Index: user/markj/netdump/bin/setfacl/setfacl.1
===================================================================
--- user/markj/netdump/bin/setfacl/setfacl.1	(revision 332407)
+++ user/markj/netdump/bin/setfacl/setfacl.1	(revision 332408)
@@ -1,495 +1,515 @@
 .\"-
 .\" Copyright (c) 2001 Chris D. Faulhaber
 .\" Copyright (c) 2011 Edward Tomasz Napierała
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
-.Dd January 23, 2016
+.Dd April 10, 2018
 .Dt SETFACL 1
 .Os
 .Sh NAME
 .Nm setfacl
 .Nd set ACL information
 .Sh SYNOPSIS
 .Nm
+.Op Fl R Op Fl H | L | P
 .Op Fl bdhkn
 .Op Fl a Ar position entries
 .Op Fl m Ar entries
 .Op Fl M Ar file
 .Op Fl x Ar entries | position
 .Op Fl X Ar file
 .Op Ar
 .Sh DESCRIPTION
 The
 .Nm
 utility sets discretionary access control information on
 the specified file(s).
 If no files are specified, or the list consists of the only
 .Sq Fl ,
 the file names are taken from the standard input.
 .Pp
 The following options are available:
 .Bl -tag -width indent
 .It Fl a Ar position entries
 Modify the ACL on the specified files by inserting new
 ACL entries
 specified in
 .Ar entries ,
 starting at position
 .Ar position ,
 counting from zero.
 This option is only applicable to NFSv4 ACLs.
 .It Fl b
 Remove all ACL entries except for the ones synthesized
 from the file mode - the three mandatory entries in case
 of POSIX.1e ACL.
 If the POSIX.1e ACL contains a
 .Dq Li mask
 entry, the permissions of the
 .Dq Li group
 entry in the resulting ACL will be set to the permission
 associated with both the
 .Dq Li group
 and
 .Dq Li mask
 entries of the current ACL.
 .It Fl d
 The operations apply to the default ACL entries instead of
 access ACL entries.
 Currently only directories may have
-default ACL's.  This option is not applicable to NFSv4 ACLs.
+default ACL's.
+This option is not applicable to NFSv4 ACLs.
 .It Fl h
 If the target of the operation is a symbolic link, perform the operation
 on the symbolic link itself, rather than following the link.
+.It Fl H
+If the
+.Fl R
+option is specified, symbolic links on the command line are followed
+and hence unaffected by the command.
+(Symbolic links encountered during tree traversal are not followed.)
 .It Fl k
 Delete any default ACL entries on the specified files.
 It
 is not considered an error if the specified files do not have
 any default ACL entries.
 An error will be reported if any of
-the specified files cannot have a default entry (i.e.\&
-non-directories).  This option is not applicable to NFSv4 ACLs.
+the specified files cannot have a default entry (i.e.,
+non-directories).
+This option is not applicable to NFSv4 ACLs.
+.It Fl L
+If the
+.Fl R
+option is specified, all symbolic links are followed.
 .It Fl m Ar entries
 Modify the ACL on the specified file.
 New entries will be added, and existing entries will be modified
 according to the
 .Ar entries
 argument.
 For NFSv4 ACLs, it is recommended to use the
 .Fl a
 and
 .Fl x
 options instead.
 .It Fl M Ar file
 Modify the ACL entries on the specified files by adding new
 ACL entries and modifying existing ACL entries with the ACL
 entries specified in the file
 .Ar file .
 If
 .Ar file
 is
 .Fl ,
 the input is taken from stdin.
 .It Fl n
 Do not recalculate the permissions associated with the ACL
 mask entry.
 This option is not applicable to NFSv4 ACLs.
+.It Fl P
+If the
+.Fl R
+option is specified, no symbolic links are followed.
+This is the default.
+.It Fl R
+Perform the action recursively on any specified directories.
 .It Fl x Ar entries | position
 If
 .Ar entries
 is specified, remove the ACL entries specified there
 from the access or default ACL of the specified files.
 Otherwise, remove entry at index
 .Ar position ,
 counting from zero.
 .It Fl X Ar file
 Remove the ACL entries specified in the file
 .Ar file
 from the access or default ACL of the specified files.
 .El
 .Pp
 The above options are evaluated in the order specified
 on the command-line.
 .Sh POSIX.1e ACL ENTRIES
 A POSIX.1E ACL entry contains three colon-separated fields:
 an ACL tag, an ACL qualifier, and discretionary access
 permissions:
 .Bl -tag -width indent
 .It Ar "ACL tag"
 The ACL tag specifies the ACL entry type and consists of
 one of the following:
 .Dq Li user
 or
 .Ql u
 specifying the access
 granted to the owner of the file or a specified user;
 .Dq Li group
 or
 .Ql g
 specifying the access granted to the file owning group
 or a specified group;
 .Dq Li other
 or
 .Ql o
 specifying the access
 granted to any process that does not match any user or group
 ACL entry;
 .Dq Li mask
 or
 .Ql m
 specifying the maximum access
 granted to any ACL entry except the
 .Dq Li user
 ACL entry for the file owner and the
 .Dq Li other
 ACL entry.
 .It Ar "ACL qualifier"
 The ACL qualifier field describes the user or group associated with
 the ACL entry.
 It may consist of one of the following: uid or
 user name, gid or group name, or empty.
 For
 .Dq Li user
 ACL entries, an empty field specifies access granted to the
 file owner.
 For
 .Dq Li group
 ACL entries, an empty field specifies access granted to the
 file owning group.
 .Dq Li mask
 and
 .Dq Li other
 ACL entries do not use this field.
 .It Ar "access permissions"
 The access permissions field contains up to one of each of
 the following:
 .Ql r ,
 .Ql w ,
 and
 .Ql x
 to set read, write, and
 execute permissions, respectively.
 Each of these may be excluded
 or replaced with a
 .Ql -
 character to indicate no access.
 .El
 .Pp
 A
 .Dq Li mask
 ACL entry is required on a file with any ACL entries other than
 the default
 .Dq Li user ,
 .Dq Li group ,
 and
 .Dq Li other
 ACL entries.
 If the
 .Fl n
 option is not specified and no
 .Dq Li mask
 ACL entry was specified, the
 .Nm
 utility
 will apply a
 .Dq Li mask
 ACL entry consisting of the union of the permissions associated
 with all
 .Dq Li group
 ACL entries in the resulting ACL.
 .Pp
 Traditional POSIX interfaces acting on file system object modes have
 modified semantics in the presence of POSIX.1e extended ACLs.
 When a mask entry is present on the access ACL of an object, the mask
 entry is substituted for the group bits; this occurs in programs such
 as
 .Xr stat 1
 or
 .Xr ls 1 .
 When the mode is modified on an object that has a mask entry, the
 changes applied to the group bits will actually be applied to the
 mask entry.
 These semantics provide for greater application compatibility:
 applications modifying the mode instead of the ACL will see
 conservative behavior, limiting the effective rights granted by all
 of the additional user and group entries; this occurs in programs
 such as
 .Xr chmod 1 .
 .Pp
 ACL entries applied from a file using the
 .Fl M
 or
 .Fl X
 options shall be of the following form: one ACL entry per line, as
 previously specified; whitespace is ignored; any text after a
 .Ql #
 is ignored (comments).
 .Pp
 When POSIX.1e ACL entries are evaluated, the access check algorithm checks
 the ACL entries in the following order: file owner,
 .Dq Li user
 ACL entries, file owning group,
 .Dq Li group
 ACL entries, and
 .Dq Li other
 ACL entry.
 .Pp
 Multiple ACL entries specified on the command line are
 separated by commas.
 .Pp
 It is possible for files and directories to inherit ACL entries from their
 parent directory.
 This is accomplished through the use of the default ACL.
 It should be noted that before you can specify a default ACL, the mandatory
 ACL entries for user, group, other and mask must be set.
 For more details see the examples below.
 Default ACLs can be created by using
 .Fl d .
 .Sh NFSv4 ACL ENTRIES
 An NFSv4 ACL entry contains four or five colon-separated fields: an ACL tag,
 an ACL qualifier (only for
 .Dq Li user
 and
 .Dq Li group
 tags), discretionary access permissions, ACL inheritance flags, and ACL type:
 .Bl -tag -width indent
 .It Ar "ACL tag"
 The ACL tag specifies the ACL entry type and consists of
 one of the following:
 .Dq Li user
 or
 .Ql u
 specifying the access
 granted to the specified user;
 .Dq Li group
 or
 .Ql g
 specifying the access granted to the specified group;
 .Dq Li owner@
 specifying the access granted to the owner of the file;
 .Dq Li group@
 specifying the access granted to the file owning group;
 .Dq Li everyone@
 specifying everyone.
 Note that
 .Dq Li everyone@
 is not the same as traditional Unix
 .Dq Li other
 - it means,
 literally, everyone, including file owner and owning group.
 .It Ar "ACL qualifier"
 The ACL qualifier field describes the user or group associated with
 the ACL entry.
 It may consist of one of the following: uid or
 user name, or gid or group name.
 In entries whose tag type is one of
 .Dq Li owner@ ,
 .Dq Li group@ ,
 or
 .Dq Li everyone@ ,
 this field is omitted altogether, including the trailing comma.
 .It Ar "access permissions"
 Access permissions may be specified in either short or long form.
 Short and long forms may not be mixed.
 Permissions in long form are separated by the
 .Ql /
 character; in short form, they are concatenated together.
 Valid permissions are:
 .Bl -tag -width ".Dv modify_set"
 .It Short
 Long
 .It r
 read_data
 .It w
 write_data
 .It x
 execute
 .It p
 append_data
 .It D
 delete_child
 .It d
 delete
 .It a
 read_attributes
 .It A
 write_attributes
 .It R
 read_xattr
 .It W
 write_xattr
 .It c
 read_acl
 .It C
 write_acl
 .It o
 write_owner
 .It s
 synchronize
 .El
 .Pp
 In addition, the following permission sets may be used:
 .Bl -tag -width ".Dv modify_set"
 .It Set
 Permissions
 .It full_set
 all permissions, as shown above
 .It modify_set
 all permissions except write_acl and write_owner
 .It read_set
 read_data, read_attributes, read_xattr and read_acl
 .It write_set
 write_data, append_data, write_attributes and write_xattr
 .El
 .It Ar "ACL inheritance flags"
 Inheritance flags may be specified in either short or long form.
 Short and long forms may not be mixed.
 Access flags in long form are separated by the
 .Ql /
 character; in short form, they are concatenated together.
 Valid inheritance flags are:
 .Bl -tag -width ".Dv short"
 .It Short
 Long
 .It f
 file_inherit
 .It d
 dir_inherit
 .It i
 inherit_only
 .It n
 no_propagate
 .It I
 inherited
 .El
 .Pp
 Other than the "inherited" flag, inheritance flags may be only set on directories.
 .It Ar "ACL type"
 The ACL type field is either
 .Dq Li allow
 or
 .Dq Li deny .
 .El
 .Pp
 ACL entries applied from a file using the
 .Fl M
 or
 .Fl X
 options shall be of the following form: one ACL entry per line, as
 previously specified; whitespace is ignored; any text after a
 .Ql #
 is ignored (comments).
 .Pp
 NFSv4 ACL entries are evaluated in their visible order.
 .Pp
 Multiple ACL entries specified on the command line are
 separated by commas.
 .Pp
 Note that the file owner is always granted the read_acl, write_acl,
 read_attributes, and write_attributes permissions, even if the ACL
 would deny it.
 .Sh EXIT STATUS
 .Ex -std
 .Sh EXAMPLES
 .Dl setfacl -d -m u::rwx,g::rx,o::rx,mask::rwx dir
 .Dl setfacl -d -m g:admins:rwx dir
 .Pp
 The first command sets the mandatory elements of the POSIX.1e default ACL.
 The second command specifies that users in group admins can have read, write, and execute
 permissions for directory named "dir".
 It should be noted that any files or directories created underneath "dir" will
 inherit these default ACLs upon creation.
 .Pp
 .Dl setfacl -m u::rwx,g:mail:rw file
 .Pp
 Sets read, write, and execute permissions for the
 .Pa file
 owner's POSIX.1e ACL entry and read and write permissions for group mail on
 .Pa file .
 .Pp
 .Dl setfacl -m owner@:rwxp::allow,g:mail:rwp::allow file
 .Pp
 Semantically equal to the example above, but for NFSv4 ACL.
 .Pp
 .Dl setfacl -M file1 file2
 .Pp
 Sets/updates the ACL entries contained in
 .Pa file1
 on
 .Pa file2 .
 .Pp
 .Dl setfacl -x g:mail:rw file
 .Pp
 Remove the group mail POSIX.1e ACL entry containing read/write permissions
 from
 .Pa file .
 .Pp
 .Dl setfacl -x0 file
 .Pp
 Remove the first entry from the NFSv4 ACL from
 .Pa file .
 .Pp
 .Dl setfacl -bn file
 .Pp
 Remove all
 .Dq Li access
 ACL entries except for the three required from
 .Pa file .
 .Pp
 .Dl getfacl file1 | setfacl -b -n -M - file2
 .Pp
 Copy ACL entries from
 .Pa file1
 to
 .Pa file2 .
 .Sh SEE ALSO
 .Xr getfacl 1 ,
 .Xr acl 3 ,
 .Xr getextattr 8 ,
 .Xr setextattr 8 ,
 .Xr acl 9 ,
 .Xr extattr 9
 .Sh STANDARDS
 The
 .Nm
 utility is expected to be
 .Tn IEEE
 Std 1003.2c compliant.
 .Sh HISTORY
 Extended Attribute and Access Control List support was developed
 as part of the
 .Tn TrustedBSD
 Project and introduced in
 .Fx 5.0 .
 NFSv4 ACL support was introduced in
 .Fx 8.1 .
 .Sh AUTHORS
 .An -nosplit
 The
 .Nm
 utility was written by
 .An Chris D. Faulhaber Aq Mt jedgar@fxp.org .
 NFSv4 ACL support was implemented by
 .An Edward Tomasz Napierala Aq Mt trasz@FreeBSD.org .
Index: user/markj/netdump/bin/setfacl/setfacl.c
===================================================================
--- user/markj/netdump/bin/setfacl/setfacl.c	(revision 332407)
+++ user/markj/netdump/bin/setfacl/setfacl.c	(revision 332408)
@@ -1,384 +1,432 @@
 /*-
  * Copyright (c) 2001 Chris D. Faulhaber
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/stat.h>
 #include <sys/acl.h>
 #include <sys/queue.h>
 
 #include <err.h>
 #include <errno.h>
+#include <fts.h>
+#include <stdbool.h>
+#include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 
 #include "setfacl.h"
 
 /* file operations */
 #define	OP_MERGE_ACL		0x00	/* merge acl's (-mM) */
 #define	OP_REMOVE_DEF		0x01	/* remove default acl's (-k) */
 #define	OP_REMOVE_EXT		0x02	/* remove extended acl's (-b) */
 #define	OP_REMOVE_ACL		0x03	/* remove acl's (-xX) */
 #define	OP_REMOVE_BY_NUMBER	0x04	/* remove acl's (-xX) by acl entry number */
 #define	OP_ADD_ACL		0x05	/* add acls entries at a given position */
 
 /* TAILQ entry for acl operations */
 struct sf_entry {
 	uint	op;
 	acl_t	acl;
 	uint	entry_number;
 	TAILQ_ENTRY(sf_entry) next;
 };
 static TAILQ_HEAD(, sf_entry) entrylist;
 
-/* TAILQ entry for files */
-struct sf_file {
-	const char *filename;
-	TAILQ_ENTRY(sf_file) next;
-};
-static TAILQ_HEAD(, sf_file) filelist;
-
 uint have_mask;
 uint need_mask;
 uint have_stdin;
 uint n_flag;
 
-static void	add_filename(const char *filename);
 static void	usage(void);
 
 static void
-add_filename(const char *filename)
-{
-	struct sf_file *file;
-
-	if (strlen(filename) > PATH_MAX - 1) {
-		warn("illegal filename");
-		return;
-	}
-	file = zmalloc(sizeof(struct sf_file));
-	file->filename = filename;
-	TAILQ_INSERT_TAIL(&filelist, file, next);
-}
-
-static void
 usage(void)
 {
 
-	fprintf(stderr, "usage: setfacl [-bdhkn] [-a position entries] "
-	    "[-m entries] [-M file] [-x entries] [-X file] [file ...]\n");
+	fprintf(stderr, "usage: setfacl [-R [-H | -L | -P]] [-bdhkn] "
+	    "[-a position entries] [-m entries] [-M file] "
+	    "[-x entries] [-X file] [file ...]\n");
 	exit(1);
 }
 
 int
 main(int argc, char *argv[])
 {
 	acl_t acl;
 	acl_type_t acl_type;
 	acl_entry_t unused_entry;
 	char filename[PATH_MAX];
-	int local_error, carried_error, ch, i, entry_number, ret;
-	int h_flag;
-	struct sf_file *file;
+	int local_error, carried_error, ch, entry_number, ret, fts_options;
+	bool h_flag, H_flag, L_flag, R_flag, follow_symlink;
+	size_t fl_count, i;
+	FTS *ftsp;
+	FTSENT *file;
+	char **files_list;
 	struct sf_entry *entry;
-	const char *fn_dup;
 	char *end;
-	struct stat sb;
 
 	acl_type = ACL_TYPE_ACCESS;
-	carried_error = local_error = 0;
-	h_flag = have_mask = have_stdin = n_flag = need_mask = 0;
+	carried_error = local_error = fts_options = 0;
+	have_mask = have_stdin = n_flag = need_mask = 0;
+	h_flag = H_flag = L_flag = R_flag = false;
 
 	TAILQ_INIT(&entrylist);
-	TAILQ_INIT(&filelist);
 
-	while ((ch = getopt(argc, argv, "M:X:a:bdhkm:nx:")) != -1)
+	while ((ch = getopt(argc, argv, "HLM:PRX:a:bdhkm:nx:")) != -1)
 		switch(ch) {
+		case 'H':
+			H_flag = true;
+			L_flag = false;
+			break;
+		case 'L':
+			L_flag = true;
+			H_flag = false;
+			break;
 		case 'M':
 			entry = zmalloc(sizeof(struct sf_entry));
 			entry->acl = get_acl_from_file(optarg);
 			if (entry->acl == NULL)
 				err(1, "%s: get_acl_from_file() failed", optarg);
 			entry->op = OP_MERGE_ACL;
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
+		case 'P':
+			H_flag = L_flag = false;
+			break;
+		case 'R':
+			R_flag = true;
+			break;
 		case 'X':
 			entry = zmalloc(sizeof(struct sf_entry));
 			entry->acl = get_acl_from_file(optarg);
 			entry->op = OP_REMOVE_ACL;
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
 		case 'a':
 			entry = zmalloc(sizeof(struct sf_entry));
 
 			entry_number = strtol(optarg, &end, 10);
 			if (end - optarg != (int)strlen(optarg))
 				errx(1, "%s: invalid entry number", optarg);
 			if (entry_number < 0)
 				errx(1, "%s: entry number cannot be less than zero", optarg);
 			entry->entry_number = entry_number;
 
 			if (argv[optind] == NULL)
 				errx(1, "missing ACL");
 			entry->acl = acl_from_text(argv[optind]);
 			if (entry->acl == NULL)
 				err(1, "%s", argv[optind]);
 			optind++;
 			entry->op = OP_ADD_ACL;
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
 		case 'b':
 			entry = zmalloc(sizeof(struct sf_entry));
 			entry->op = OP_REMOVE_EXT;
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
 		case 'd':
 			acl_type = ACL_TYPE_DEFAULT;
 			break;
 		case 'h':
 			h_flag = 1;
 			break;
 		case 'k':
 			entry = zmalloc(sizeof(struct sf_entry));
 			entry->op = OP_REMOVE_DEF;
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
 		case 'm':
 			entry = zmalloc(sizeof(struct sf_entry));
 			entry->acl = acl_from_text(optarg);
 			if (entry->acl == NULL)
 				err(1, "%s", optarg);
 			entry->op = OP_MERGE_ACL;
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
 		case 'n':
 			n_flag++;
 			break;
 		case 'x':
 			entry = zmalloc(sizeof(struct sf_entry));
 			entry_number = strtol(optarg, &end, 10);
 			if (end - optarg == (int)strlen(optarg)) {
 				if (entry_number < 0)
 					errx(1, "%s: entry number cannot be less than zero", optarg);
 				entry->entry_number = entry_number;
 				entry->op = OP_REMOVE_BY_NUMBER;
 			} else {
 				entry->acl = acl_from_text(optarg);
 				if (entry->acl == NULL)
 					err(1, "%s", optarg);
 				entry->op = OP_REMOVE_ACL;
 			}
 			TAILQ_INSERT_TAIL(&entrylist, entry, next);
 			break;
 		default:
 			usage();
 			break;
 		}
 	argc -= optind;
 	argv += optind;
 
 	if (n_flag == 0 && TAILQ_EMPTY(&entrylist))
 		usage();
 
 	/* take list of files from stdin */
 	if (argc == 0 || strcmp(argv[0], "-") == 0) {
 		if (have_stdin)
 			err(1, "cannot have more than one stdin");
 		have_stdin = 1;
 		bzero(&filename, sizeof(filename));
+		i = 0;
+		/* Start with an array size sufficient for basic cases. */
+		fl_count = 1024;
+		files_list = zmalloc(fl_count * sizeof(char *));
 		while (fgets(filename, (int)sizeof(filename), stdin)) {
 			/* remove the \n */
 			filename[strlen(filename) - 1] = '\0';
-			fn_dup = strdup(filename);
-			if (fn_dup == NULL)
+			files_list[i] = strdup(filename);
+			if (files_list[i] == NULL)
 				err(1, "strdup() failed");
-			add_filename(fn_dup);
+			/* Grow array if necessary. */
+			if (++i == fl_count) {
+				fl_count <<= 1;
+				if (fl_count > SIZE_MAX / sizeof(char *))
+					errx(1, "Too many input files");
+				files_list = zrealloc(files_list,
+				    fl_count * sizeof(char *));
+			}
 		}
+
+		/* fts_open() requires the last array element to be NULL. */
+		files_list[i] = NULL;
 	} else
-		for (i = 0; i < argc; i++)
-			add_filename(argv[i]);
+		files_list = argv;
 
-	/* cycle through each file */
-	TAILQ_FOREACH(file, &filelist, next) {
-		local_error = 0;
+	if (R_flag) {
+		if (h_flag)
+			errx(1, "the -R and -h options may not be "
+			    "specified together.");
+		if (L_flag) {
+			fts_options = FTS_LOGICAL;
+		} else {
+			fts_options = FTS_PHYSICAL;
 
-		if (stat(file->filename, &sb) == -1) {
-			warn("%s: stat() failed", file->filename);
-			carried_error++;
+			if (H_flag) {
+				fts_options |= FTS_COMFOLLOW;
+			}
+		}
+	} else if (h_flag) {
+		fts_options = FTS_PHYSICAL;
+	} else {
+		fts_options = FTS_LOGICAL;
+	}
+
+	/* Open all files. */
+	if ((ftsp = fts_open(files_list, fts_options | FTS_NOSTAT, 0)) == NULL)
+		err(1, "fts_open");
+	while ((file = fts_read(ftsp)) != NULL) {
+		switch (file->fts_info) {
+		case FTS_D:
+			/* Do not recurse if -R not specified. */
+			if (!R_flag)
+				fts_set(ftsp, file, FTS_SKIP);
+			break;
+		case FTS_DP:
+			/* Skip the second visit to a directory. */
 			continue;
+		case FTS_DNR:
+		case FTS_ERR:
+			warnx("%s: %s", file->fts_path,
+			    strerror(file->fts_errno));
+			continue;
+		default:
+			break;
 		}
 
-		if (acl_type == ACL_TYPE_DEFAULT && S_ISDIR(sb.st_mode) == 0) {
-			warnx("%s: default ACL may only be set on a directory",
-			    file->filename);
+		if (acl_type == ACL_TYPE_DEFAULT && file->fts_info != FTS_D) {
+			warnx("%s: default ACL may only be set on "
+			    "a directory", file->fts_path);
 			carried_error++;
 			continue;
 		}
 
-		if (h_flag)
-			ret = lpathconf(file->filename, _PC_ACL_NFS4);
+		local_error = 0;
+
+		follow_symlink = ((fts_options & FTS_LOGICAL) ||
+		    ((fts_options & FTS_COMFOLLOW) &&
+		    file->fts_level == FTS_ROOTLEVEL));
+
+		if (follow_symlink)
+			ret = pathconf(file->fts_accpath, _PC_ACL_NFS4);
 		else
-			ret = pathconf(file->filename, _PC_ACL_NFS4);
+			ret = lpathconf(file->fts_accpath, _PC_ACL_NFS4);
 		if (ret > 0) {
 			if (acl_type == ACL_TYPE_DEFAULT) {
 				warnx("%s: there are no default entries "
-			           "in NFSv4 ACLs", file->filename);
+			           "in NFSv4 ACLs", file->fts_path);
 				carried_error++;
 				continue;
 			}
 			acl_type = ACL_TYPE_NFS4;
 		} else if (ret == 0) {
 			if (acl_type == ACL_TYPE_NFS4)
 				acl_type = ACL_TYPE_ACCESS;
 		} else if (ret < 0 && errno != EINVAL) {
 			warn("%s: pathconf(..., _PC_ACL_NFS4) failed",
-			    file->filename);
+			    file->fts_path);
 		}
 
-		if (h_flag)
-			acl = acl_get_link_np(file->filename, acl_type);
+		if (follow_symlink)
+			acl = acl_get_file(file->fts_accpath, acl_type);
 		else
-			acl = acl_get_file(file->filename, acl_type);
+			acl = acl_get_link_np(file->fts_accpath, acl_type);
 		if (acl == NULL) {
-			if (h_flag)
-				warn("%s: acl_get_link_np() failed",
-				    file->filename);
-			else
+			if (follow_symlink)
 				warn("%s: acl_get_file() failed",
-				    file->filename);
+				    file->fts_path);
+			else
+				warn("%s: acl_get_link_np() failed",
+				    file->fts_path);
 			carried_error++;
 			continue;
 		}
 
 		/* cycle through each option */
 		TAILQ_FOREACH(entry, &entrylist, next) {
 			if (local_error)
 				continue;
 
 			switch(entry->op) {
 			case OP_ADD_ACL:
 				local_error += add_acl(entry->acl,
-				    entry->entry_number, &acl, file->filename);
+				    entry->entry_number,
+				    &acl, file->fts_path);
 				break;
 			case OP_MERGE_ACL:
 				local_error += merge_acl(entry->acl, &acl,
-				    file->filename);
+				    file->fts_path);
 				need_mask = 1;
 				break;
 			case OP_REMOVE_EXT:
 				/*
 				 * Don't try to call remove_ext() for empty
 				 * default ACL.
 				 */
 				if (acl_type == ACL_TYPE_DEFAULT &&
 				    acl_get_entry(acl, ACL_FIRST_ENTRY,
 				    &unused_entry) == 0) {
 					local_error += remove_default(&acl,
-					    file->filename);
+					    file->fts_path);
 					break;
 				}
-				remove_ext(&acl, file->filename);
+				remove_ext(&acl, file->fts_path);
 				need_mask = 0;
 				break;
 			case OP_REMOVE_DEF:
 				if (acl_type == ACL_TYPE_NFS4) {
 					warnx("%s: there are no default entries in NFSv4 ACLs; "
-					    "cannot remove", file->filename);
+					    "cannot remove", file->fts_path);
 					local_error++;
 					break;
 				}
-				if (acl_delete_def_file(file->filename) == -1) {
+				if (acl_delete_def_file(file->fts_accpath) == -1) {
 					warn("%s: acl_delete_def_file() failed",
-					    file->filename);
+					    file->fts_path);
 					local_error++;
 				}
 				if (acl_type == ACL_TYPE_DEFAULT)
 					local_error += remove_default(&acl,
-					    file->filename);
+					    file->fts_path);
 				need_mask = 0;
 				break;
 			case OP_REMOVE_ACL:
 				local_error += remove_acl(entry->acl, &acl,
-				    file->filename);
+				    file->fts_path);
 				need_mask = 1;
 				break;
 			case OP_REMOVE_BY_NUMBER:
 				local_error += remove_by_number(entry->entry_number,
-				    &acl, file->filename);
+				    &acl, file->fts_path);
 				need_mask = 1;
 				break;
 			}
 		}
 
 		/*
 		 * Don't try to set an empty default ACL; it will always fail.
 		 * Use acl_delete_def_file(3) instead.
 		 */
 		if (acl_type == ACL_TYPE_DEFAULT &&
 		    acl_get_entry(acl, ACL_FIRST_ENTRY, &unused_entry) == 0) {
-			if (acl_delete_def_file(file->filename) == -1) {
+			if (acl_delete_def_file(file->fts_accpath) == -1) {
 				warn("%s: acl_delete_def_file() failed",
-				    file->filename);
+				    file->fts_path);
 				carried_error++;
 			}
 			continue;
 		}
 
 		/* don't bother setting the ACL if something is broken */
 		if (local_error) {
 			carried_error++;
 			continue;
 		}
 
 		if (acl_type != ACL_TYPE_NFS4 && need_mask &&
-		    set_acl_mask(&acl, file->filename) == -1) {
-			warnx("%s: failed to set ACL mask", file->filename);
+		    set_acl_mask(&acl, file->fts_path) == -1) {
+			warnx("%s: failed to set ACL mask", file->fts_path);
 			carried_error++;
-		} else if (h_flag) {
-			if (acl_set_link_np(file->filename, acl_type,
+		} else if (follow_symlink) {
+			if (acl_set_file(file->fts_accpath, acl_type,
 			    acl) == -1) {
 				carried_error++;
-				warn("%s: acl_set_link_np() failed",
-				    file->filename);
+				warn("%s: acl_set_file() failed",
+				    file->fts_path);
 			}
 		} else {
-			if (acl_set_file(file->filename, acl_type,
+			if (acl_set_link_np(file->fts_accpath, acl_type,
 			    acl) == -1) {
 				carried_error++;
-				warn("%s: acl_set_file() failed",
-				    file->filename);
+				warn("%s: acl_set_link_np() failed",
+				    file->fts_path);
 			}
 		}
 
 		acl_free(acl);
 	}
 
 	return (carried_error);
 }
Index: user/markj/netdump/bin/setfacl/setfacl.h
===================================================================
--- user/markj/netdump/bin/setfacl/setfacl.h	(revision 332407)
+++ user/markj/netdump/bin/setfacl/setfacl.h	(revision 332408)
@@ -1,58 +1,59 @@
 /*-
  * Copyright (c) 2001 Chris D. Faulhaber
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _SETFACL_H
 #define _SETFACL_H
 
 #include <sys/types.h>
 #include <sys/acl.h>
 #include <sys/queue.h>
 
 /* files.c */
 acl_t  get_acl_from_file(const char *filename);
 /* merge.c */
 int    merge_acl(acl_t acl, acl_t *prev_acl, const char *filename);
 int    add_acl(acl_t acl, uint entry_number, acl_t *prev_acl, const char *filename);
 /* remove.c */
 int    remove_acl(acl_t acl, acl_t *prev_acl, const char *filename);
 int    remove_by_number(uint entry_number, acl_t *prev_acl, const char *filename);
 int    remove_default(acl_t *prev_acl, const char *filename);
 void   remove_ext(acl_t *prev_acl, const char *filename);
 /* mask.c */
 int    set_acl_mask(acl_t *prev_acl, const char *filename);
 /* util.c */
 void  *zmalloc(size_t size);
+void  *zrealloc(void *ptr, size_t size);
 const char *brand_name(int brand);
 int    branding_mismatch(int brand1, int brand2);
 
 extern uint have_mask;
 extern uint need_mask;
 extern uint have_stdin;
 extern uint n_flag;
 
 #endif /* _SETFACL_H */
Index: user/markj/netdump/bin/setfacl/util.c
===================================================================
--- user/markj/netdump/bin/setfacl/util.c	(revision 332407)
+++ user/markj/netdump/bin/setfacl/util.c	(revision 332408)
@@ -1,68 +1,79 @@
 /*-
  * Copyright (c) 2001 Chris D. Faulhaber
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <err.h>
 #include <stdlib.h>
 #include <string.h>
 
 #include "setfacl.h"
 
 void *
 zmalloc(size_t size)
 {
 	void *ptr;
 
 	ptr = calloc(1, size);
 	if (ptr == NULL)
 		err(1, "calloc() failed");
 	return (ptr);
 }
 
+void *
+zrealloc(void *ptr, size_t size)
+{
+	void *newptr;
+
+	newptr = realloc(ptr, size);
+	if (newptr == NULL)
+		err(1, "realloc() failed");
+	return (newptr);
+}
+
 const char *
 brand_name(int brand)
 {
 	switch (brand) {
 	case ACL_BRAND_NFS4:
 		return "NFSv4";
 	case ACL_BRAND_POSIX:
 		return "POSIX.1e";
 	default:
 		return "unknown";
 	}
 }
 
 int
 branding_mismatch(int brand1, int brand2)
 {
 	if (brand1 == ACL_BRAND_UNKNOWN || brand2 == ACL_BRAND_UNKNOWN)
 		return (0);
 	if (brand1 != brand2)
 		return (1);
 	return (0);
 }
Index: user/markj/netdump/release/amd64/make-memstick.sh
===================================================================
--- user/markj/netdump/release/amd64/make-memstick.sh	(revision 332407)
+++ user/markj/netdump/release/amd64/make-memstick.sh	(revision 332408)
@@ -1,41 +1,47 @@
 #!/bin/sh
 #
 # This script generates a "memstick image" (image that can be copied to a
 # USB memory stick) from a directory tree.  Note that the script does not
 # clean up after itself very well for error conditions on purpose so the
 # problem can be diagnosed (full filesystem most likely but ...).
 #
 # Usage: make-memstick.sh <directory tree> <image filename>
 #
 # $FreeBSD$
 #
 
 set -e
 
 PATH=/bin:/usr/bin:/sbin:/usr/sbin
 export PATH
 
 if [ $# -ne 2 ]; then
 	echo "make-memstick.sh /path/to/directory /path/to/image/file"
 	exit 1
 fi
 
 if [ ! -d ${1} ]; then
 	echo "${1} must be a directory"
 	exit 1
 fi
 
 if [ -e ${2} ]; then
 	echo "won't overwrite ${2}"
 	exit 1
 fi
 
 echo '/dev/ufs/FreeBSD_Install / ufs ro,noatime 1 1' > ${1}/etc/fstab
 echo 'root_rw_mount="NO"' > ${1}/etc/rc.conf.local
 makefs -B little -o label=FreeBSD_Install -o version=2 ${2}.part ${1}
 rm ${1}/etc/fstab
 rm ${1}/etc/rc.conf.local
 
-mkimg -s gpt -b ${1}/boot/pmbr -p efi:=${1}/boot/boot1.efifat -p freebsd-boot:=${1}/boot/gptboot -p freebsd-ufs:=${2}.part -p freebsd-swap::1M -o ${2}
+mkimg -s gpt \
+    -b ${1}/boot/pmbr \
+    -p efi:=${1}/boot/boot1.efifat \
+    -p freebsd-boot:=${1}/boot/gptboot \
+    -p freebsd-ufs:=${2}.part \
+    -p freebsd-swap::1M \
+    -o ${2}
 rm ${2}.part
 
Index: user/markj/netdump/release/amd64/mkisoimages.sh
===================================================================
--- user/markj/netdump/release/amd64/mkisoimages.sh	(revision 332407)
+++ user/markj/netdump/release/amd64/mkisoimages.sh	(revision 332408)
@@ -1,60 +1,60 @@
 #!/bin/sh
 #
 # Module: mkisoimages.sh
 # Author: Jordan K Hubbard
 # Date:   22 June 2001
 #
 # $FreeBSD$
 #
 # This script is used by release/Makefile to build the (optional) ISO images
 # for a FreeBSD release.  It is considered architecture dependent since each
 # platform has a slightly unique way of making bootable CDs.  This script
 # is also allowed to generate any number of images since that is more of
 # publishing decision than anything else.
 #
 # Usage:
 #
 # mkisoimages.sh [-b] image-label image-name base-bits-dir [extra-bits-dir]
 #
 # Where -b is passed if the ISO image should be made "bootable" by
 # whatever standards this architecture supports (may be unsupported),
 # image-label is the ISO image label, image-name is the filename of the
 # resulting ISO image, base-bits-dir contains the image contents and
 # extra-bits-dir, if provided, contains additional files to be merged
 # into base-bits-dir as part of making the image.
 
 if [ "$1" = "-b" ]; then
 	# This is highly x86-centric and will be used directly below.
 	bootable="-o bootimage=i386;$4/boot/cdboot -o no-emul-boot"
 
 	# Make EFI system partition (should be done with makefs in the future)
 	dd if=/dev/zero of=efiboot.img bs=4k count=200
 	device=`mdconfig -a -t vnode -f efiboot.img`
 	newfs_msdos -F 12 -m 0xf8 /dev/$device
 	mkdir efi
 	mount -t msdosfs /dev/$device efi
 	mkdir -p efi/efi/boot
 	cp "$4/boot/loader.efi" efi/efi/boot/bootx64.efi
 	umount efi
 	rmdir efi
 	mdconfig -d -u $device
-	bootable="-o bootimage=efi;efiboot.img -o no-emul-boot $bootable"
+	bootable="-o bootimage=i386;efiboot.img -o no-emul-boot -o platformid=efi $bootable"
 	
 	shift
 else
 	bootable=""
 fi
 
 if [ $# -lt 3 ]; then
 	echo "Usage: $0 [-b] image-label image-name base-bits-dir [extra-bits-dir]"
 	exit 1
 fi
 
 LABEL=`echo "$1" | tr '[:lower:]' '[:upper:]'`; shift
 NAME="$1"; shift
 
 publisher="The FreeBSD Project.  https://www.FreeBSD.org/"
 echo "/dev/iso9660/$LABEL / cd9660 ro 0 0" > "$1/etc/fstab"
 makefs -t cd9660 $bootable -o rockridge -o label="$LABEL" -o publisher="$publisher" "$NAME" "$@"
 rm -f "$1/etc/fstab"
 rm -f efiboot.img
Index: user/markj/netdump/release/arm64/RPI3.conf
===================================================================
--- user/markj/netdump/release/arm64/RPI3.conf	(revision 332407)
+++ user/markj/netdump/release/arm64/RPI3.conf	(revision 332408)
@@ -1,61 +1,65 @@
 #!/bin/sh
 #
 # $FreeBSD$
 #
 
 DTB_DIR="/usr/local/share/rpi-firmware"
 DTB="bcm2710-rpi-3-b.dtb"
 EMBEDDED_TARGET_ARCH="aarch64"
 EMBEDDED_TARGET="arm64"
 EMBEDDEDBUILD=1
 EMBEDDEDPORTS="sysutils/u-boot-rpi3 sysutils/rpi-firmware"
 FAT_SIZE="50m -b 1m"
 FAT_TYPE="16"
 IMAGE_SIZE="2560M"
 KERNEL="GENERIC"
 MD_ARGS="-x 63 -y 255"
 NODOC=1
 OL_DIR="${DTB_DIR}/overlays"
 OVERLAYS="mmc.dtbo pi3-disable-bt.dtbo"
 PART_SCHEME="MBR"
 export BOARDNAME="RPI3"
 
 arm_install_uboot() {
 	UBOOT_DIR="/usr/local/share/u-boot/u-boot-rpi3"
-	UBOOT_FILES="LICENCE.broadcom README armstub8.bin bootcode.bin config.txt \
-		fixup.dat fixup_cd.dat fixup_x.dat start.elf start_cd.elf \
-		start_x.elf u-boot.bin"
+	UBOOT_FILES="README u-boot.bin"
+	DTB_FILES="armstub8.bin bootcode.bin config.txt fixup_cd.dat \
+		fixup_db.dat fixup_x.dat fixup.dat LICENCE.broadcom \
+		start_cd.elf start_db.elf start_x.elf start.elf ${DTB}"
 	FATMOUNT="${DESTDIR%${KERNEL}}fat"
 	UFSMOUNT="${DESTDIR%${KERNEL}}ufs"
 	chroot ${CHROOTDIR} mkdir -p "${FATMOUNT}" "${UFSMOUNT}"
 	chroot ${CHROOTDIR} mount_msdosfs /dev/${mddev}s1 ${FATMOUNT}
 	chroot ${CHROOTDIR} mount /dev/${mddev}s2a ${UFSMOUNT}
 	for _UF in ${UBOOT_FILES}; do
 		chroot ${CHROOTDIR} cp -p ${UBOOT_DIR}/${_UF} \
 			${FATMOUNT}/${_UF}
 	done
-	chroot ${CHROOTDIR} cp -p ${DTB_DIR}/${DTB} ${FATMOUNT}/${DTB}
+	for _DF in ${DTB_FILES}; do
+		chroot ${CHROOTDIR} cp -p ${DTB_DIR}/${_DF} \
+			${FATMOUNT}/${_DF}
+	done
 	chroot ${CHROOTDIR} mkdir -p ${FATMOUNT}/overlays
 	for _OL in ${OVERLAYS}; do
 		chroot ${CHROOTDIR} cp -p ${OL_DIR}/${_OL} \
 			${FATMOUNT}/overlays/${_OL}
 	done
 
 	BOOTFILES="$(chroot ${CHROOTDIR} \
 		env TARGET=${EMBEDDED_TARGET} TARGET_ARCH=${EMBEDDED_TARGET_ARCH} \
 		WITH_UNIFIED_OBJDIR=yes \
 		make -C ${WORLDDIR}/stand -V .OBJDIR)"
 	BOOTFILES="$(chroot ${CHROOTDIR} realpath ${BOOTFILES})"
 
 	chroot ${CHROOTDIR} mkdir -p ${FATMOUNT}/EFI/BOOT
 	chroot ${CHROOTDIR} cp -p ${BOOTFILES}/efi/boot1/boot1.efi \
 		${FATMOUNT}/EFI/BOOT/bootaa64.efi
 	chroot ${CHROOTDIR} touch ${UFSMOUNT}/firstboot
 	sync
 	umount_loop ${CHROOTDIR}/${FATMOUNT}
 	umount_loop ${CHROOTDIR}/${UFSMOUNT}
 	chroot ${CHROOTDIR} rmdir ${FATMOUNT}
 	chroot ${CHROOTDIR} rmdir ${UFSMOUNT}
 	
 	return 0
 }
Index: user/markj/netdump/release/arm64/make-memstick.sh
===================================================================
--- user/markj/netdump/release/arm64/make-memstick.sh	(revision 332407)
+++ user/markj/netdump/release/arm64/make-memstick.sh	(revision 332408)
@@ -1,41 +1,44 @@
 #!/bin/sh
 #
 # This script generates a "memstick image" (image that can be copied to a
 # USB memory stick) from a directory tree.  Note that the script does not
 # clean up after itself very well for error conditions on purpose so the
 # problem can be diagnosed (full filesystem most likely but ...).
 #
 # Usage: make-memstick.sh <directory tree> <image filename>
 #
 # $FreeBSD$
 #
 
 set -e
 
 PATH=/bin:/usr/bin:/sbin:/usr/sbin
 export PATH
 
 if [ $# -ne 2 ]; then
 	echo "make-memstick.sh /path/to/directory /path/to/image/file"
 	exit 1
 fi
 
 if [ ! -d ${1} ]; then
 	echo "${1} must be a directory"
 	exit 1
 fi
 
 if [ -e ${2} ]; then
 	echo "won't overwrite ${2}"
 	exit 1
 fi
 
 echo '/dev/ufs/FreeBSD_Install / ufs ro,noatime 1 1' > ${1}/etc/fstab
 echo 'root_rw_mount="NO"' > ${1}/etc/rc.conf.local
 makefs -B little -o label=FreeBSD_Install -o version=2 ${2}.part ${1}
 rm ${1}/etc/fstab
 rm ${1}/etc/rc.conf.local
 
-mkimg -s gpt -p efi:=${1}/boot/boot1.efifat -p freebsd:=${2}.part -o ${2}
+mkimg -s gpt \
+    -p efi:=${1}/boot/boot1.efifat \
+    -p freebsd:=${2}.part \
+    -o ${2}
 rm ${2}.part
 
Index: user/markj/netdump/release/i386/make-memstick.sh
===================================================================
--- user/markj/netdump/release/i386/make-memstick.sh	(revision 332407)
+++ user/markj/netdump/release/i386/make-memstick.sh	(revision 332408)
@@ -1,41 +1,46 @@
 #!/bin/sh
 #
 # This script generates a "memstick image" (image that can be copied to a
 # USB memory stick) from a directory tree.  Note that the script does not
 # clean up after itself very well for error conditions on purpose so the
 # problem can be diagnosed (full filesystem most likely but ...).
 #
 # Usage: make-memstick.sh <directory tree> <image filename>
 #
 # $FreeBSD$
 #
 
 set -e
 
 PATH=/bin:/usr/bin:/sbin:/usr/sbin
 export PATH
 
 if [ $# -ne 2 ]; then
 	echo "make-memstick.sh /path/to/directory /path/to/image/file"
 	exit 1
 fi
 
 if [ ! -d ${1} ]; then
 	echo "${1} must be a directory"
 	exit 1
 fi
 
 if [ -e ${2} ]; then
 	echo "won't overwrite ${2}"
 	exit 1
 fi
 
 echo '/dev/ufs/FreeBSD_Install / ufs ro,noatime 1 1' > ${1}/etc/fstab
 echo 'root_rw_mount="NO"' > ${1}/etc/rc.conf.local
 makefs -B little -o label=FreeBSD_Install -o version=2 ${2}.part ${1}
 rm ${1}/etc/fstab
 rm ${1}/etc/rc.conf.local
 
-mkimg -s gpt -b ${1}/boot/pmbr -p freebsd-boot:=${1}/boot/gptboot -p freebsd-ufs:=${2}.part -p freebsd-swap::1M -o ${2}
+mkimg -s gpt \
+    -b ${1}/boot/pmbr \
+    -p freebsd-boot:=${1}/boot/gptboot \
+    -p freebsd-ufs:=${2}.part \
+    -p freebsd-swap::1M \
+    -o ${2}
 rm ${2}.part
 
Index: user/markj/netdump/release/powerpc/make-memstick.sh
===================================================================
--- user/markj/netdump/release/powerpc/make-memstick.sh	(revision 332407)
+++ user/markj/netdump/release/powerpc/make-memstick.sh	(revision 332408)
@@ -1,47 +1,50 @@
 #!/bin/sh
 #
 # This script generates a "memstick image" (image that can be copied to a
 # USB memory stick) from a directory tree.  Note that the script does not
 # clean up after itself very well for error conditions on purpose so the
 # problem can be diagnosed (full filesystem most likely but ...).
 #
 # Usage: make-memstick.sh <directory tree> <image filename>
 #
 # $FreeBSD$
 #
 
 set -e
 
 PATH=/bin:/usr/bin:/sbin:/usr/sbin
 export PATH
 
 BLOCKSIZE=10240
 
 if [ $# -ne 2 ]; then
   echo "make-memstick.sh /path/to/directory /path/to/image/file"
   exit 1
 fi
 
 tempfile="${2}.$$"
 
 if [ ! -d ${1} ]; then
   echo "${1} must be a directory"
   exit 1
 fi
 
 if [ -e ${2} ]; then
   echo "won't overwrite ${2}"
   exit 1
 fi
 
 echo '/dev/da0s3 / ufs ro,noatime 1 1' > ${1}/etc/fstab
 echo 'root_rw_mount="NO"' > ${1}/etc/rc.conf.local
 rm -f ${tempfile}
 makefs -B big -o version=2 ${tempfile} ${1}
 rm ${1}/etc/fstab
 rm ${1}/etc/rc.conf.local
 
-mkimg -s apm -p freebsd-boot:=${1}/boot/boot1.hfs -p freebsd-ufs/FreeBSD_Install:=${tempfile} -o ${2}
+mkimg -s apm \
+    -p freebsd-boot:=${1}/boot/boot1.hfs \
+    -p freebsd-ufs/FreeBSD_Install:=${tempfile} \
+    -o ${2}
 
 rm -f ${tempfile}
 
Index: user/markj/netdump/sbin/camcontrol/camcontrol.8
===================================================================
--- user/markj/netdump/sbin/camcontrol/camcontrol.8	(revision 332407)
+++ user/markj/netdump/sbin/camcontrol/camcontrol.8	(revision 332408)
@@ -1,2887 +1,2887 @@
 .\"
 .\" Copyright (c) 1998, 1999, 2000, 2002, 2005, 2006, 2007 Kenneth D. Merry.
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\" 3. The name of the author may not be used to endorse or promote products
 .\"    derived from this software without specific prior written permission.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
 .Dd May 3, 2017
 .Dt CAMCONTROL 8
 .Os
 .Sh NAME
 .Nm camcontrol
 .Nd CAM control program
 .Sh SYNOPSIS
 .Nm
 .Aq Ar command
 .Op device id
 .Op generic args
 .Op command args
 .Nm
 .Ic devlist
 .Op Fl b
 .Op Fl v
 .Nm
 .Ic periphlist
 .Op device id
 .Op Fl n Ar dev_name
 .Op Fl u Ar unit_number
 .Nm
 .Ic tur
 .Op device id
 .Op generic args
 .Nm
 .Ic inquiry
 .Op device id
 .Op generic args
 .Op Fl D
 .Op Fl S
 .Op Fl R
 .Nm
 .Ic identify
 .Op device id
 .Op generic args
 .Op Fl v
 .Nm
 .Ic reportluns
 .Op device id
 .Op generic args
 .Op Fl c
 .Op Fl l
 .Op Fl r Ar reporttype
 .Nm
 .Ic readcap
 .Op device id
 .Op generic args
 .Op Fl b
 .Op Fl h
 .Op Fl H
 .Op Fl N
 .Op Fl q
 .Op Fl s
 .Nm
 .Ic start
 .Op device id
 .Op generic args
 .Nm
 .Ic stop
 .Op device id
 .Op generic args
 .Nm
 .Ic load
 .Op device id
 .Op generic args
 .Nm
 .Ic eject
 .Op device id
 .Op generic args
 .Nm
 .Ic reprobe
 .Op device id
 .Nm
 .Ic rescan
 .Aq all | device id | bus Ns Op :target:lun
 .Nm
 .Ic reset
 .Aq all | device id | bus Ns Op :target:lun
 .Nm
 .Ic defects
 .Op device id
 .Op generic args
 .Aq Fl f Ar format
 .Op Fl P
 .Op Fl G
 .Op Fl q
 .Op Fl s
 .Op Fl S Ar offset
 .Op Fl X
 .Nm
 .Ic modepage
 .Op device id
 .Op generic args
 .Aq Fl m Ar page[,subpage] | Fl l
 .Op Fl P Ar pgctl
 .Op Fl b | Fl e
 .Op Fl d
 .Nm
 .Ic cmd
 .Op device id
 .Op generic args
 .Aq Fl a Ar cmd Op args
 .Aq Fl c Ar cmd Op args
 .Op Fl d
 .Op Fl f
 .Op Fl i Ar len Ar fmt
 .Bk -words
 .Op Fl o Ar len Ar fmt Op args
 .Op Fl r Ar fmt
 .Ek
 .Nm
 .Ic smpcmd
 .Op device id
 .Op generic args
 .Aq Fl r Ar len Ar fmt Op args
 .Aq Fl R Ar len Ar fmt Op args
 .Nm
 .Ic smprg
 .Op device id
 .Op generic args
 .Op Fl l
 .Nm
 .Ic smppc
 .Op device id
 .Op generic args
 .Aq Fl p Ar phy
 .Op Fl l
 .Op Fl o Ar operation
 .Op Fl d Ar name
 .Op Fl m Ar rate
 .Op Fl M Ar rate
 .Op Fl T Ar pp_timeout
 .Op Fl a Ar enable|disable
 .Op Fl A Ar enable|disable
 .Op Fl s Ar enable|disable
 .Op Fl S Ar enable|disable
 .Nm
 .Ic smpphylist
 .Op device id
 .Op generic args
 .Op Fl l
 .Op Fl q
 .Nm
 .Ic smpmaninfo
 .Op device id
 .Op generic args
 .Op Fl l
 .Nm
 .Ic debug
 .Op Fl I
 .Op Fl P
 .Op Fl T
 .Op Fl S
 .Op Fl X
 .Op Fl c
 .Op Fl p
 .Aq all|off|bus Ns Op :target Ns Op :lun
 .Nm
 .Ic tags
 .Op device id
 .Op generic args
 .Op Fl N Ar tags
 .Op Fl q
 .Op Fl v
 .Nm
 .Ic negotiate
 .Op device id
 .Op generic args
 .Op Fl c
 .Op Fl D Ar enable|disable
 .Op Fl M Ar mode
 .Op Fl O Ar offset
 .Op Fl q
 .Op Fl R Ar syncrate
 .Op Fl T Ar enable|disable
 .Op Fl U
 .Op Fl W Ar bus_width
 .Op Fl v
 .Nm
 .Ic format
 .Op device id
 .Op generic args
 .Op Fl q
 .Op Fl r
 .Op Fl w
 .Op Fl y
 .Nm
 .Ic sanitize
 .Op device id
 .Op generic args
 .Aq Fl a Ar overwrite | block | crypto | exitfailure
 .Op Fl c Ar passes
 .Op Fl I
 .Op Fl P Ar pattern
 .Op Fl q
 .Op Fl U
 .Op Fl r
 .Op Fl w
 .Op Fl y
 .Nm
 .Ic idle
 .Op device id
 .Op generic args
 .Op Fl t Ar time
 .Nm
 .Ic standby
 .Op device id
 .Op generic args
 .Op Fl t Ar time
 .Nm
 .Ic sleep
 .Op device id
 .Op generic args
 .Nm
 .Ic apm
 .Op device id
 .Op generic args
 .Op Fl l Ar level
 .Nm
 .Ic aam
 .Op device id
 .Op generic args
 .Op Fl l Ar level
 .Nm
 .Ic fwdownload
 .Op device id
 .Op generic args
 .Aq Fl f Ar fw_image
 .Op Fl q
 .Op Fl s
 .Op Fl y
 .Nm
 .Ic security
 .Op device id
 .Op generic args
 .Op Fl d Ar pwd
 .Op Fl e Ar pwd
 .Op Fl f
 .Op Fl h Ar pwd
 .Op Fl k Ar pwd
 .Op Fl l Ar high|maximum
 .Op Fl q
 .Op Fl s Ar pwd
 .Op Fl T Ar timeout
 .Op Fl U Ar user|master
 .Op Fl y
 .Nm
 .Ic hpa
 .Op device id
 .Op generic args
 .Op Fl f
 .Op Fl l
 .Op Fl P
 .Op Fl p Ar pwd
 .Op Fl q
 .Op Fl s Ar max_sectors
 .Op Fl U Ar pwd
 .Op Fl y
 .Nm
 .Ic persist
 .Op device id
 .Op generic args
 .Aq Fl i Ar action | Fl o Ar action
 .Op Fl a
 .Op Fl I Ar trans_id
 .Op Fl k Ar key
 .Op Fl K Ar sa_key
 .Op Fl p
 .Op Fl R Ar rel_tgt_port
 .Op Fl s Ar scope
 .Op Fl S
 .Op Fl T Ar res_type
 .Op Fl U
 .Nm
 .Ic attrib
 .Op device id
 .Op generic args
 .Aq Fl r Ar action | Fl w Ar attrib
 .Op Fl a Ar attr_num
 .Op Fl c
 .Op Fl e Ar elem_addr
 .Op Fl F Ar form1,form2
 .Op Fl p Ar part
 .Op Fl s Ar start_addr
 .Op Fl T Ar elem_type
 .Op Fl V Ar lv_num
 .Nm
 .Ic opcodes
 .Op device id
 .Op generic args
 .Op Fl o Ar opcode
 .Op Fl s Ar service_action
 .Op Fl N
 .Op Fl T
 .Nm
 .Ic zone
 .Aq Fl c Ar cmd
 .Op Fl a
 .Op Fl l Ar lba
 .Op Fl o Ar rep_opts
 .Op Fl P Ar print_opts
 .Nm
 .Ic epc
 .Aq Fl c Ar cmd
 .Op Fl d
 .Op Fl D
 .Op Fl e
 .Op Fl H
 .Op Fl p Ar power_cond
 .Op Fl P
 .Op Fl r Ar restore_src
 .Op Fl s
 .Op Fl S Ar power_src
 .Op Fl T Ar timer
 .Nm
 .Ic timestamp
 .Op device id
 .Op generic args
 .Ao Fl r Oo Ns Fl f Ar format | Fl m | Fl U Oc | Fl s Ao Fl f Ar format Fl T Ar time | Fl U Ac Ac
 .Nm
 .Ic help
 .Sh DESCRIPTION
 The
 .Nm
 utility is designed to provide a way for users to access and control the
 .Fx
 CAM subsystem.
 .Pp
 The
 .Nm
 utility
 can cause a loss of data and/or system crashes if used improperly.
 Even
 expert users are encouraged to exercise caution when using this command.
 Novice users should stay away from this utility.
 .Pp
 The
 .Nm
 utility has a number of primary functions, many of which support an optional
 device identifier.
 A device identifier can take one of three forms:
 .Bl -tag -width 14n
 .It deviceUNIT
 Specify a device name and unit number combination, like "da5" or "cd3".
 .It bus:target
 Specify a bus number and target id.
 The bus number can be determined from
 the output of
 .Dq camcontrol devlist .
 The lun defaults to 0.
 .It bus:target:lun
 Specify the bus, target and lun for a device.
 (e.g.\& 1:2:0)
 .El
 .Pp
 The device identifier, if it is specified,
 .Em must
 come immediately after the function name, and before any generic or
 function-specific arguments.
 Note that the
 .Fl n
 and
 .Fl u
 arguments described below will override any device name or unit number
 specified beforehand.
 The
 .Fl n
 and
 .Fl u
 arguments will
 .Em not
 override a specified bus:target or bus:target:lun, however.
 .Pp
 Most of the
 .Nm
 primary functions support these generic arguments:
 .Bl -tag -width 14n
 .It Fl C Ar count
 SCSI command retry count.
 In order for this to work, error recovery
 .Pq Fl E
 must be turned on.
 .It Fl E
 Instruct the kernel to perform generic SCSI error recovery for the given
 command.
 This is needed in order for the retry count
 .Pq Fl C
 to be honored.
 Other than retrying commands, the generic error recovery in
 the code will generally attempt to spin up drives that are not spinning.
 It may take some other actions, depending upon the sense code returned from
 the command.
 .It Fl n Ar dev_name
 Specify the device type to operate on, e.g.\& "da", "cd".
 .It Fl Q Ar task_attr
 .Tn SCSI
 task attribute for the command, if it is a
 .Tn SCSI
 command.
 This may be ordered, simple, head, or aca.
 In most cases this is not needed.
 The default is simple, which works with all
 .Tn SCSI
 devices.
 The task attribute may also be specified numerically.
 .It Fl t Ar timeout
 SCSI command timeout in seconds.
 This overrides the default timeout for
 any given command.
 .It Fl u Ar unit_number
 Specify the device unit number, e.g.\& "1", "5".
 .It Fl v
 Be verbose, print out sense information for failed SCSI commands.
 .El
 .Pp
 Primary command functions:
 .Bl -tag -width periphlist
 .It Ic devlist
 List all physical devices (logical units) attached to the CAM subsystem.
 This also includes a list of peripheral drivers attached to each device.
 With the
 .Fl v
 argument, SCSI bus number, adapter name and unit numbers are printed as
 well.
 On the other hand, with the
 .Fl b
 argument, only the bus adapter, and unit information will be printed, and
 device information will be omitted.
 .It Ic periphlist
 List all peripheral drivers attached to a given physical device (logical
 unit).
 .It Ic tur
 Send the SCSI test unit ready (0x00) command to the given device.
 The
 .Nm
 utility will report whether the device is ready or not.
 .It Ic inquiry
 Send a SCSI inquiry command (0x12) to a device.
 By default,
 .Nm
 will print out the standard inquiry data, device serial number, and
 transfer rate information.
 The user can specify that only certain types of
 inquiry data be printed:
 .Bl -tag -width 4n
 .It Fl D
 Get the standard inquiry data.
 .It Fl S
 Print out the serial number.
 If this flag is the only one specified,
 .Nm
 will not print out "Serial Number" before the value returned by the drive.
 This is to aid in script writing.
 .It Fl R
 Print out transfer rate information.
 .El
 .It Ic identify
 Send a ATA identify command (0xec) to a device.
 .It Ic reportluns
 Send the SCSI REPORT LUNS (0xA0) command to the given device.
 By default,
 .Nm
 will print out the list of logical units (LUNs) supported by the target device.
 There are a couple of options to modify the output:
 .Bl -tag -width 14n
 .It Fl c
 Just print out a count of LUNs, not the actual LUN numbers.
 .It Fl l
 Just print out the LUNs, and do not print out the count.
 .It Fl r Ar reporttype
 Specify the type of report to request from the target:
 .Bl -tag -width 012345678
 .It default
 Return the default report.
 This is the
 .Nm
 default.
 Most targets will support this report if they support the REPORT LUNS
 command.
 .It wellknown
 Return only well known LUNs.
 .It all
 Return all available LUNs.
 .El
 .El
 .Pp
 .Nm
 will try to print out LUN numbers in a reasonable format.
 It can understand the peripheral, flat, LUN and extended LUN formats.
 .It Ic readcap
 Send the SCSI READ CAPACITY command to the given device and display
 the results.
 If the device is larger than 2TB, the SCSI READ CAPACITY (16) service
 action will be sent to obtain the full size of the device.
 By default,
 .Nm
 will print out the last logical block of the device, and the blocksize of
 the device in bytes.
 To modify the output format, use the following options:
 .Bl -tag -width 5n
 .It Fl b
 Just print out the blocksize, not the last block or device size.
 This cannot be used with
 .Fl N
 or
 .Fl s .
 .It Fl h
 Print out the device size in human readable (base 2, 1K == 1024) format.
 This implies
 .Fl N
 and cannot be used with
 .Fl q
 or
 .Fl b .
 .It Fl H
 Print out the device size in human readable (base 10, 1K == 1000) format.
 .It Fl N
 Print out the number of blocks in the device instead of the last logical
 block.
 .It Fl q
 Quiet, print out the numbers only (separated by a comma if
 .Fl b
 or
 .Fl s
 are not specified).
 .It Fl s
 Print out the last logical block or the size of the device only, and omit
 the blocksize.
 .El
 .Pp
 Note that this command only displays the information, it does not update
 the kernel data structures.
 Use the
-.Nm 
+.Nm
 reprobe subcommand to do that.
 .It Ic start
 Send the SCSI Start/Stop Unit (0x1B) command to the given device with the
 start bit set.
 .It Ic stop
 Send the SCSI Start/Stop Unit (0x1B) command to the given device with the
 start bit cleared.
 .It Ic load
 Send the SCSI Start/Stop Unit (0x1B) command to the given device with the
 start bit set and the load/eject bit set.
 .It Ic eject
 Send the SCSI Start/Stop Unit (0x1B) command to the given device with the
 start bit cleared and the load/eject bit set.
 .It Ic rescan
 Tell the kernel to scan all buses in the system (with the
 .Ar all
 argument), the given bus (XPT_SCAN_BUS), bus:target:lun or device
 (XPT_SCAN_LUN) for new devices or devices that have gone away.
 The user
 may specify a scan of all buses, a single bus, or a lun.
 Scanning all luns
 on a target is not supported.
 .Pp
 If a device is specified by peripheral name and unit number, for instance
 da4, it may only be rescanned if that device currently exists in the CAM EDT
 (Existing Device Table).
 If the device is no longer there (see
 .Nm
 devlist ),
 you must use the bus:target:lun form to rescan it.
 .It Ic reprobe
 Tell the kernel to refresh the information about the device and
 notify the upper layer,
 .Xr GEOM 4 .
 This includes sending the SCSI READ CAPACITY command and updating
 the disk size visible to the rest of the system.
 .It Ic reset
 Tell the kernel to reset all buses in the system (with the
 .Ar all
 argument), the given bus (XPT_RESET_BUS) by issuing a SCSI bus
 reset for that bus, or to reset the given bus:target:lun or device
 (XPT_RESET_DEV), typically by issuing a BUS DEVICE RESET message after
 connecting to that device.
 Note that this can have a destructive impact
 on the system.
 .It Ic defects
 Send the
 .Tn SCSI
 READ DEFECT DATA (10) command (0x37) or the
 .Tn SCSI
 READ DEFECT DATA (12) command (0xB7) to the given device, and
 print out any combination of: the total number of defects, the primary
 defect list (PLIST), and the grown defect list (GLIST).
 .Bl -tag -width 11n
 .It Fl f Ar format
 Specify the requested format of the defect list.
 The format argument is
 required.
 Most drives support the physical sector format.
 Some drives
 support the logical block format.
 Many drives, if they do not support the
 requested format, return the data in an alternate format, along with sense
 information indicating that the requested data format is not supported.
 The
 .Nm
 utility
 attempts to detect this, and print out whatever format the drive returns.
 If the drive uses a non-standard sense code to report that it does not
 support the requested format,
 .Nm
 will probably see the error as a failure to complete the request.
 .Pp
 The format options are:
 .Bl -tag -width 9n
 .It block
 Print out the list as logical blocks.
 This is limited to 32-bit block sizes, and isn't supported by many modern
 drives.
 .It longblock
 Print out the list as logical blocks.
 This option uses a 64-bit block size.
 .It bfi
 Print out the list in bytes from index format.
 .It extbfi
 Print out the list in extended bytes from index format.
 The extended format allows for ranges of blocks to be printed.
 .It phys
 Print out the list in physical sector format.
 Most drives support this format.
 .It extphys
 Print out the list in extended physical sector format.
 The extended format allows for ranges of blocks to be printed.
 .El
 .It Fl G
 Print out the grown defect list.
 This is a list of bad blocks that have
 been remapped since the disk left the factory.
 .It Fl P
 Print out the primary defect list.
 This is the list of defects that were present in the factory.
 .It Fl q
 When printing status information with
 .Fl s ,
 only print the number of defects.
 .It Fl s
 Just print the number of defects, not the list of defects.
 .It Fl S Ar offset
 Specify the starting offset into the defect list.
 This implies using the
 .Tn SCSI
 READ DEFECT DATA (12) command, as the 10 byte version of the command
 doesn't support the address descriptor index field.
 Not all drives support the 12 byte command, and some drives that support
 the 12 byte command don't support the address descriptor index field.
 .It Fl X
 Print out defects in hexadecimal (base 16) form instead of base 10 form.
 .El
 .Pp
 If neither
 .Fl P
 nor
 .Fl G
 is specified,
 .Nm
 will print out the number of defects given in the READ DEFECT DATA header
 returned from the drive.
 Some drives will report 0 defects if neither the primary or grown defect
 lists are requested.
 .It Ic modepage
 Allows the user to display and optionally edit a SCSI mode page.
 The mode
 page formats are located in
 .Pa /usr/share/misc/scsi_modes .
 This can be overridden by specifying a different file in the
 .Ev SCSI_MODES
 environment variable.
 The
 .Ic modepage
 command takes several arguments:
 .Bl -tag -width 12n
 .It Fl d
 Disable block descriptors for mode sense.
 .It Fl b
 Displays mode page data in binary format.
 .It Fl e
 This flag allows the user to edit values in the mode page.
 The user may
 either edit mode page values with the text editor pointed to by his
 .Ev EDITOR
 environment variable, or supply mode page values via standard input, using
 the same format that
 .Nm
 uses to display mode page values.
 The editor will be invoked if
 .Nm
 detects that standard input is terminal.
 .It Fl l
 Lists all available mode pages.
 If specified more then once, also lists subpages.
 .It Fl m Ar page[,subpage]
 This specifies the number of the mode page and optionally subpage the user
 would like to view and/or edit.
 This argument is mandatory unless
 .Fl l
 is specified.
 .It Fl P Ar pgctl
 This allows the user to specify the page control field.
 Possible values are:
 .Bl -tag -width xxx -compact
 .It 0
 Current values
 .It 1
 Changeable values
 .It 2
 Default values
 .It 3
 Saved values
 .El
 .El
 .It Ic cmd
 Allows the user to send an arbitrary ATA or SCSI CDB to any device.
 The
 .Ic cmd
 function requires the
 .Fl c
 argument to specify SCSI CDB or the
 .Fl a
 argument to specify ATA Command Block registers values.
 Other arguments are optional, depending on
 the command type.
 The command and data specification syntax is documented
 in
 .Xr cam_cdbparse 3 .
 NOTE: If the CDB specified causes data to be transferred to or from the
 SCSI device in question, you MUST specify either
 .Fl i
 or
 .Fl o .
 .Bl -tag -width 17n
 .It Fl a Ar cmd Op args
 This specifies the content of 12 ATA Command Block registers (command,
 features, lba_low, lba_mid, lba_high, device, lba_low_exp, lba_mid_exp.
 lba_high_exp, features_exp, sector_count, sector_count_exp).
 .It Fl c Ar cmd Op args
 This specifies the SCSI CDB.
 SCSI CDBs may be 6, 10, 12 or 16 bytes.
 .It Fl d
 Specifies DMA protocol to be used for ATA command.
 .It Fl f
 Specifies FPDMA (NCQ) protocol to be used for ATA command.
 .It Fl i Ar len Ar fmt
 This specifies the amount of data to read, and how it should be displayed.
 If the format is
 .Sq - ,
 .Ar len
 bytes of data will be read from the device and written to standard output.
 .It Fl o Ar len Ar fmt Op args
 This specifies the amount of data to be written to a device, and the data
 that is to be written.
 If the format is
 .Sq - ,
 .Ar len
 bytes of data will be read from standard input and written to the device.
 .It Fl r Ar fmt
 This specifies that 11 result ATA Command Block registers should be displayed
 (status, error, lba_low, lba_mid, lba_high, device, lba_low_exp, lba_mid_exp,
 lba_high_exp, sector_count, sector_count_exp), and how.
 If the format is
 .Sq - ,
 11 result registers will be written to standard output in hex.
 .El
 .It Ic smpcmd
 Allows the user to send an arbitrary Serial
 Management Protocol (SMP) command to a device.
 The
 .Ic smpcmd
 function requires the
 .Fl r
 argument to specify the SMP request to be sent, and the
 .Fl R
 argument to specify the format of the SMP response.
 The syntax for the SMP request and response arguments is documented in
 .Xr cam_cdbparse 3 .
 .Pp
 Note that SAS adapters that support SMP passthrough (at least the currently
 known adapters) do not accept CRC bytes from the user in the request and do
 not pass CRC bytes back to the user in the response.
 Therefore users should not include the CRC bytes in the length of the
 request and not expect CRC bytes to be returned in the response.
 .Bl -tag -width 17n
 .It Fl r Ar len Ar fmt Op args
 This specifies the size of the SMP request, without the CRC bytes, and the
 SMP request format.
 If the format is
 .Sq - ,
 .Ar len
 bytes of data will be read from standard input and written as the SMP
 request.
 .It Fl R Ar len Ar fmt Op args
 This specifies the size of the buffer allocated for the SMP response, and
 the SMP response format.
 If the format is
 .Sq - ,
 .Ar len
 bytes of data will be allocated for the response and the response will be
 written to standard output.
 .El
 .It Ic smprg
 Allows the user to send the Serial Management Protocol (SMP) Report General
 command to a device.
 .Nm
 will display the data returned by the Report General command.
 If the SMP target supports the long response format, the additional data
 will be requested and displayed automatically.
 .Bl -tag -width 8n
 .It Fl l
 Request the long response format only.
 Not all SMP targets support the long response format.
 This option causes
 .Nm
 to skip sending the initial report general request without the long bit set
 and only issue a report general request with the long bit set.
 .El
 .It Ic smppc
 Allows the user to issue the Serial Management Protocol (SMP) PHY Control
 command to a device.
 This function should be used with some caution, as it can render devices
 inaccessible, and could potentially cause data corruption as well.
 The
 .Fl p
 argument is required to specify the PHY to operate on.
 .Bl -tag -width 17n
 .It Fl p Ar phy
 Specify the PHY to operate on.
 This argument is required.
 .It Fl l
 Request the long request/response format.
 Not all SMP targets support the long response format.
 For the PHY Control command, this currently only affects whether the
 request length is set to a value other than 0.
 .It Fl o Ar operation
 Specify a PHY control operation.
 Only one
 .Fl o
 operation may be specified.
 The operation may be specified numerically (in decimal, hexadecimal, or octal)
 or one of the following operation names may be specified:
 .Bl -tag -width 16n
 .It nop
 No operation.
 It is not necessary to specify this argument.
 .It linkreset
 Send the LINK RESET command to the phy.
 .It hardreset
 Send the HARD RESET command to the phy.
 .It disable
 Send the DISABLE command to the phy.
 Note that the LINK RESET or HARD RESET commands should re-enable the phy.
 .It clearerrlog
 Send the CLEAR ERROR LOG command.
 This clears the error log counters for the specified phy.
 .It clearaffiliation
 Send the CLEAR AFFILIATION command.
 This clears the affiliation from the STP initiator port with the same SAS
 address as the SMP initiator that requests the clear operation.
 .It sataportsel
 Send the TRANSMIT SATA PORT SELECTION SIGNAL command to the phy.
 This will cause a SATA port selector to use the given phy as its active phy
 and make the other phy inactive.
 .It clearitnl
 Send the CLEAR STP I_T NEXUS LOSS command to the PHY.
 .It setdevname
 Send the SET ATTACHED DEVICE NAME command to the PHY.
 This requires the
 .Fl d
 argument to specify the device name.
 .El
 .It Fl d Ar name
 Specify the attached device name.
 This option is needed with the
 .Fl o Ar setdevname
 phy operation.
 The name is a 64-bit number, and can be specified in decimal, hexadecimal
 or octal format.
 .It Fl m Ar rate
 Set the minimum physical link rate for the phy.
 This is a numeric argument.
 Currently known link rates are:
 .Bl -tag -width 5n
 .It 0x0
 Do not change current value.
 .It 0x8
 1.5 Gbps
 .It 0x9
 3 Gbps
 .It 0xa
 6 Gbps
 .El
 .Pp
 Other values may be specified for newer physical link rates.
 .It Fl M Ar rate
 Set the maximum physical link rate for the phy.
 This is a numeric argument.
 See the
 .Fl m
 argument description for known link rate arguments.
 .It Fl T Ar pp_timeout
 Set the partial pathway timeout value, in microseconds.
 See the
 .Tn ANSI
 .Tn SAS
 Protocol Layer (SPL)
 specification for more information on this field.
 .It Fl a Ar enable|disable
 Enable or disable SATA slumber phy power conditions.
 .It Fl A Ar enable|disable
 Enable or disable SATA partial power conditions.
 .It Fl s Ar enable|disable
 Enable or disable SAS slumber phy power conditions.
 .It Fl S Ar enable|disable
 Enable or disable SAS partial phy power conditions.
 .El
 .It Ic smpphylist
 List phys attached to a SAS expander, the address of the end device
 attached to the phy, and the inquiry data for that device and peripheral
 devices attached to that device.
 The inquiry data and peripheral devices are displayed if available.
 .Bl -tag -width 5n
 .It Fl l
 Turn on the long response format for the underlying SMP commands used for
 this command.
 .It Fl q
 Only print out phys that are attached to a device in the CAM EDT (Existing
 Device Table).
 .El
 .It Ic smpmaninfo
 Send the SMP Report Manufacturer Information command to the device and
 display the response.
 .Bl -tag -width 5n
 .It Fl l
 Turn on the long response format for the underlying SMP commands used for
 this command.
 .El
 .It Ic debug
 Turn on CAM debugging printfs in the kernel.
 This requires options CAMDEBUG
 in your kernel config file.
 WARNING: enabling debugging printfs currently
 causes an EXTREME number of kernel printfs.
 You may have difficulty
 turning off the debugging printfs once they start, since the kernel will be
 busy printing messages and unable to service other requests quickly.
 The
 .Ic debug
 function takes a number of arguments:
 .Bl -tag -width 18n
 .It Fl I
 Enable CAM_DEBUG_INFO printfs.
 .It Fl P
 Enable CAM_DEBUG_PERIPH printfs.
 .It Fl T
 Enable CAM_DEBUG_TRACE printfs.
 .It Fl S
 Enable CAM_DEBUG_SUBTRACE printfs.
 .It Fl X
 Enable CAM_DEBUG_XPT printfs.
 .It Fl c
 Enable CAM_DEBUG_CDB printfs.
 This will cause the kernel to print out the
 SCSI CDBs sent to the specified device(s).
 .It Fl p
 Enable CAM_DEBUG_PROBE printfs.
 .It all
 Enable debugging for all devices.
 .It off
 Turn off debugging for all devices
 .It bus Ns Op :target Ns Op :lun
 Turn on debugging for the given bus, target or lun.
 If the lun or target
 and lun are not specified, they are wildcarded.
 (i.e., just specifying a
 bus turns on debugging printfs for all devices on that bus.)
 .El
 .It Ic tags
 Show or set the number of "tagged openings" or simultaneous transactions
 we attempt to queue to a particular device.
 By default, the
 .Ic tags
 command, with no command-specific arguments (i.e., only generic arguments)
 prints out the "soft" maximum number of transactions that can be queued to
 the device in question.
 For more detailed information, use the
 .Fl v
 argument described below.
 .Bl -tag -width 7n
 .It Fl N Ar tags
 Set the number of tags for the given device.
 This must be between the
 minimum and maximum number set in the kernel quirk table.
 The default for
 most devices that support tagged queueing is a minimum of 2 and a maximum
 of 255.
 The minimum and maximum values for a given device may be
 determined by using the
 .Fl v
 switch.
 The meaning of the
 .Fl v
 switch for this
 .Nm
 subcommand is described below.
 .It Fl q
 Be quiet, and do not report the number of tags.
 This is generally used when
 setting the number of tags.
 .It Fl v
 The verbose flag has special functionality for the
 .Em tags
 argument.
 It causes
 .Nm
 to print out the tagged queueing related fields of the XPT_GDEV_TYPE CCB:
 .Bl -tag -width 13n
 .It dev_openings
 This is the amount of capacity for transactions queued to a given device.
 .It dev_active
 This is the number of transactions currently queued to a device.
 .It devq_openings
 This is the kernel queue space for transactions.
 This count usually mirrors
 dev_openings except during error recovery operations when
 the device queue is frozen (device is not allowed to receive
 commands), the number of dev_openings is reduced, or transaction
 replay is occurring.
 .It devq_queued
 This is the number of transactions waiting in the kernel queue for capacity
 on the device.
 This number is usually zero unless error recovery is in
 progress.
 .It held
 The held count is the number of CCBs held by peripheral drivers that have
 either just been completed or are about to be released to the transport
 layer for service by a device.
 Held CCBs reserve capacity on a given
 device.
 .It mintags
 This is the current "hard" minimum number of transactions that can be
 queued to a device at once.
 The
 .Ar dev_openings
 value above cannot go below this number.
 The default value for
 .Ar mintags
 is 2, although it may be set higher or lower for various devices.
 .It maxtags
 This is the "hard" maximum number of transactions that can be queued to a
 device at one time.
 The
 .Ar dev_openings
 value cannot go above this number.
 The default value for
 .Ar maxtags
 is 255, although it may be set higher or lower for various devices.
 .El
 .El
 .It Ic negotiate
 Show or negotiate various communication parameters.
 Some controllers may
 not support setting or changing some of these values.
 For instance, the
 Adaptec 174x controllers do not support changing a device's sync rate or
 offset.
 The
 .Nm
 utility
 will not attempt to set the parameter if the controller indicates that it
 does not support setting the parameter.
 To find out what the controller
 supports, use the
 .Fl v
 flag.
 The meaning of the
 .Fl v
 flag for the
 .Ic negotiate
 command is described below.
 Also, some controller drivers do not support
 setting negotiation parameters, even if the underlying controller supports
 negotiation changes.
 Some controllers, such as the Advansys wide
 controllers, support enabling and disabling synchronous negotiation for
 a device, but do not support setting the synchronous negotiation rate.
 .Bl -tag -width 17n
 .It Fl a
 Attempt to make the negotiation settings take effect immediately by sending
 a Test Unit Ready command to the device.
 .It Fl c
 Show or set current negotiation settings.
 This is the default.
 .It Fl D Ar enable|disable
 Enable or disable disconnection.
 .It Fl M Ar mode
 Set ATA mode.
 .It Fl O Ar offset
 Set the command delay offset.
 .It Fl q
 Be quiet, do not print anything.
 This is generally useful when you want to
 set a parameter, but do not want any status information.
 .It Fl R Ar syncrate
 Change the synchronization rate for a device.
 The sync rate is a floating
 point value specified in MHz.
 So, for instance,
 .Sq 20.000
 is a legal value, as is
 .Sq 20 .
 .It Fl T Ar enable|disable
 Enable or disable tagged queueing for a device.
 .It Fl U
 Show or set user negotiation settings.
 The default is to show or set
 current negotiation settings.
 .It Fl v
 The verbose switch has special meaning for the
 .Ic negotiate
 subcommand.
 It causes
 .Nm
 to print out the contents of a Path Inquiry (XPT_PATH_INQ) CCB sent to the
 controller driver.
 .It Fl W Ar bus_width
 Specify the bus width to negotiate with a device.
 The bus width is
 specified in bits.
 The only useful values to specify are 8, 16, and 32
 bits.
 The controller must support the bus width in question in order for
 the setting to take effect.
 .El
 .Pp
 In general, sync rate and offset settings will not take effect for a
 device until a command has been sent to the device.
 The
 .Fl a
 switch above will automatically send a Test Unit Ready to the device so
 negotiation parameters will take effect.
 .It Ic format
 Issue the
 .Tn SCSI
 FORMAT UNIT command to the named device.
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 Low level formatting a disk will destroy ALL data on the disk.
 Use
 extreme caution when issuing this command.
 Many users low-level format
 disks that do not really need to be low-level formatted.
 There are
 relatively few scenarios that call for low-level formatting a disk.
 One reason for
 low-level formatting a disk is to initialize the disk after changing
 its physical sector size.
 Another reason for low-level formatting a disk
 is to revive the disk if you are getting "medium format corrupted" errors
 from the disk in response to read and write requests.
 .Pp
 Some disks take longer than others to format.
 Users should specify a
 timeout long enough to allow the format to complete.
 The default format
 timeout is 3 hours, which should be long enough for most disks.
 Some hard
 disks will complete a format operation in a very short period of time
 (on the order of 5 minutes or less).
 This is often because the drive
 does not really support the FORMAT UNIT command -- it just accepts the
 command, waits a few minutes and then returns it.
 .Pp
 The
 .Sq format
 subcommand takes several arguments that modify its default behavior.
 The
 .Fl q
 and
 .Fl y
 arguments can be useful for scripts.
 .Bl -tag -width 6n
 .It Fl q
 Be quiet, do not print any status messages.
 This option will not disable
 the questions, however.
 To disable questions, use the
 .Fl y
 argument, below.
 .It Fl r
 Run in
 .Dq report only
 mode.
 This will report status on a format that is already running on the drive.
 .It Fl w
 Issue a non-immediate format command.
 By default,
 .Nm
 issues the FORMAT UNIT command with the immediate bit set.
 This tells the
 device to immediately return the format command, before the format has
 actually completed.
 Then,
 .Nm
 gathers
 .Tn SCSI
 sense information from the device every second to determine how far along
 in the format process it is.
 If the
 .Fl w
 argument is specified,
 .Nm
 will issue a non-immediate format command, and will be unable to print any
 information to let the user know what percentage of the disk has been
 formatted.
 .It Fl y
 Do not ask any questions.
 By default,
 .Nm
 will ask the user if he/she really wants to format the disk in question,
 and also if the default format command timeout is acceptable.
 The user
 will not be asked about the timeout if a timeout is specified on the
 command line.
 .El
 .It Ic sanitize
 Issue the
 .Tn SCSI
 SANITIZE command to the named device.
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 ALL data in the cache and on the disk will be destroyed or made inaccessible.
 Recovery of the data is not possible.
 Use extreme caution when issuing this command.
 .Pp
 The
 .Sq sanitize
 subcommand takes several arguments that modify its default behavior.
 The
 .Fl q
 and
 .Fl y
 arguments can be useful for scripts.
 .Bl -tag -width 6n
 .It Fl a Ar operation
 Specify the sanitize operation to perform.
 .Bl -tag -width 16n
 .It overwrite
 Perform an overwrite operation by writing a user supplied
 data pattern to the device one or more times.
 The pattern is given by the
 .Fl P
 argument.
 The number of times is given by the
 .Fl c
 argument.
 .It block
 Perform a block erase operation.
 All the device's blocks are set to a vendor defined
 value, typically zero.
 .It crypto
 Perform a cryptographic erase operation.
 The encryption keys are changed to prevent the decryption
 of the data.
 .It exitfailure
 Exits a previously failed sanitize operation.
 A failed sanitize operation can only be exited if it was
 run in the unrestricted completion mode, as provided by the
 .Fl U
 argument.
 .El
 .It Fl c Ar passes
 The number of passes when performing an
 .Sq overwrite
 operation.
 Valid values are between 1 and 31.
 The default is 1.
 .It Fl I
 When performing an
 .Sq overwrite
 operation, the pattern is inverted between consecutive passes.
 .It Fl P Ar pattern
 Path to the file containing the pattern to use when
 performing an
 .Sq overwrite
 operation.
 The pattern is repeated as needed to fill each block.
 .It Fl q
 Be quiet, do not print any status messages.
 This option will not disable
 the questions, however.
 To disable questions, use the
 .Fl y
 argument, below.
 .It Fl U
 Perform the sanitize in the unrestricted completion mode.
 If the operation fails, it can later be exited with the
 .Sq exitfailure
 operation.
 .It Fl r
 Run in
 .Dq report only
 mode.
 This will report status on a sanitize that is already running on the drive.
 .It Fl w
 Issue a non-immediate sanitize command.
 By default,
 .Nm
 issues the SANITIZE command with the immediate bit set.
 This tells the
 device to immediately return the sanitize command, before
 the sanitize has actually completed.
 Then,
 .Nm
 gathers
 .Tn SCSI
 sense information from the device every second to determine how far along
 in the sanitize process it is.
 If the
 .Fl w
 argument is specified,
 .Nm
 will issue a non-immediate sanitize command, and will be unable to print any
 information to let the user know what percentage of the disk has been
 sanitized.
 .It Fl y
 Do not ask any questions.
 By default,
 .Nm
 will ask the user if he/she really wants to sanitize the disk in question,
 and also if the default sanitize command timeout is acceptable.
 The user
 will not be asked about the timeout if a timeout is specified on the
 command line.
 .El
 .It Ic idle
 Put ATA device into IDLE state.
 Optional parameter
 .Pq Fl t
 specifies automatic standby timer value in seconds.
 Value 0 disables timer.
 .It Ic standby
 Put ATA device into STANDBY state.
 Optional parameter
 .Pq Fl t
 specifies automatic standby timer value in seconds.
 Value 0 disables timer.
 .It Ic sleep
 Put ATA device into SLEEP state.
 Note that the only way get device out of
 this state may be reset.
 .It Ic apm
 It optional parameter
 .Pq Fl l
 specified, enables and sets advanced power management level, where
 1 -- minimum power, 127 -- maximum performance with standby,
 128 -- minimum power without standby, 254 -- maximum performance.
 If not specified -- APM is disabled.
 .It Ic aam
 It optional parameter
 .Pq Fl l
 specified, enables and sets automatic acoustic management level, where
 1 -- minimum noise, 254 -- maximum performance.
 If not specified -- AAM is disabled.
 .It Ic security
 Update or report security settings, using an ATA identify command (0xec).
 By default,
 .Nm
 will print out the security support and associated settings of the device.
 The
 .Ic security
 command takes several arguments:
 .Bl -tag -width 0n
 .It Fl d Ar pwd
 .Pp
 Disable device security using the given password for the selected user according
 to the devices configured security level.
 .It Fl e Ar pwd
 .Pp
 Erase the device using the given password for the selected user.
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 Issuing a secure erase will
 .Em ERASE ALL
 user data on the device and may take several hours to complete.
 .Pp
 When this command is used against an SSD drive all its cells will be marked as
 empty, restoring it to factory default write performance.
 For SSD's this action
 usually takes just a few seconds.
 .It Fl f
 .Pp
 Freeze the security configuration of the specified device.
 .Pp
 After command completion any other commands that update the device lock mode
 shall be command aborted.
 Frozen mode is disabled by power-off or hardware reset.
 .It Fl h Ar pwd
 .Pp
 Enhanced erase the device using the given password for the selected user.
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 Issuing an enhanced secure erase will
 .Em ERASE ALL
 user data on the device and may take several hours to complete.
 .Pp
 An enhanced erase writes predetermined data patterns to all user data areas,
 all previously written user data shall be overwritten, including sectors that
 are no longer in use due to reallocation.
 .It Fl k Ar pwd
 .Pp
 Unlock the device using the given password for the selected user according to
 the devices configured security level.
 .It Fl l Ar high|maximum
 .Pp
 Specifies which security level to set when issuing a
 .Fl s Ar pwd
 command.
 The security level determines device behavior when the master
 password is used to unlock the device.
 When the security level is set to high
 the device requires the unlock command and the master password to unlock.
 When the security level is set to maximum the device requires a secure erase
 with the master password to unlock.
 .Pp
 This option must be used in conjunction with one of the security action commands.
 .Pp
 Defaults to
 .Em high
 .It Fl q
 .Pp
 Be quiet, do not print any status messages.
 This option will not disable the questions, however.
 To disable questions, use the
 .Fl y
 argument, below.
 .It Fl s Ar pwd
 .Pp
 Password the device (enable security) using the given password for the selected
 user.
 This option can be combined with other options such as
 .Fl e Em pwd
 .Pp
 A master password may be set in a addition to the user password. The purpose of
 the master password is to allow an administrator to establish a password that
 is kept secret from the user, and which may be used to unlock the device if the
 user password is lost.
 .Pp
 .Em Note:
 Setting the master password does not enable device security.
 .Pp
 If the master password is set and the drive supports a Master Revision Code
 feature the Master Password Revision Code will be decremented.
 .It Fl T Ar timeout
 .Pp
 Overrides the default timeout, specified in seconds, used for both
 .Fl e
 and
 .Fl h
 this is useful if your system has problems processing long timeouts correctly.
 .Pp
 Usually the timeout is calculated from the information stored on the drive if
 present, otherwise it defaults to 2 hours.
 .It Fl U Ar user|master
 .Pp
 Specifies which user to set / use for the running action command, valid values
 are user or master and defaults to master if not set.
 .Pp
 This option must be used in conjunction with one of the security action commands.
 .Pp
 Defaults to
 .Em master
 .It Fl y
 .Pp
 Confirm yes to dangerous options such as
 .Fl e
 without prompting for confirmation.
 .El
 .Pp
 If the password specified for any action commands does not match the configured
 password for the specified user the command will fail.
 .Pp
 The password in all cases is limited to 32 characters, longer passwords will
 fail.
 .It Ic hpa
 Update or report Host Protected Area details.
 By default
 .Nm
 will print out the HPA support and associated settings of the device.
 The
 .Ic hpa
 command takes several optional arguments:
 .Bl -tag -width 0n
 .It Fl f
 .Pp
 Freeze the HPA configuration of the specified device.
 .Pp
 After command completion any other commands that update the HPA configuration
 shall be command aborted.
 Frozen mode is disabled by power-off or hardware reset.
 .It Fl l
 .Pp
 Lock the HPA configuration of the device until a successful call to unlock or
 the next power-on reset occurs.
 .It Fl P
 .Pp
 Make the HPA max sectors persist across power-on reset or a hardware reset.
 This must be used in combination with
 .Fl s Ar max_sectors
 .
 .It Fl p Ar pwd
 .Pp
 Set the HPA configuration password required for unlock calls.
 .It Fl q
 .Pp
 Be quiet, do not print any status messages.
 This option will not disable the questions.
 To disable questions, use the
 .Fl y
 argument, below.
 .It Fl s Ar max_sectors
 .Pp
 Configures the maximum user accessible sectors of the device.
 This will change the number of sectors the device reports.
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 Changing the max sectors of a device using this option will make the data on
 the device beyond the specified value inaccessible.
 .Pp
 Only one successful
 .Fl s Ar max_sectors
 call can be made without a power-on reset or a hardware reset of the device.
 .It Fl U Ar pwd
 .Pp
 Unlock the HPA configuration of the specified device using the given password.
 If the password specified does not match the password configured via
 .Fl p Ar pwd
 the command will fail.
 .Pp
 After 5 failed unlock calls, due to password miss-match, the device will refuse
 additional unlock calls until after a power-on reset.
 .It Fl y
 .Pp
 Confirm yes to dangerous options such as
 .Fl e
 without prompting for confirmation
 .El
 .Pp
 The password for all HPA commands is limited to 32 characters, longer passwords
 will fail.
 .It Ic fwdownload
 Program firmware of the named
 .Tn SCSI
 or ATA device using the image file provided.
 .Pp
 If the device is a
 .Tn SCSI
 device and it provides a recommended timeout for the WRITE BUFFER command
 (see the
-.Nm 
+.Nm
 opcodes subcommand), that timeout will be used for the firmware download.
 The drive-recommended timeout value may be overridden on the command line
 with the
 .Fl t
 option.
 .Pp
 Current list of supported vendors for SCSI/SAS drives:
 .Bl -tag -width 10n
 .It HGST
 Tested with 4TB SAS drives, model number HUS724040ALS640.
 .It HITACHI
 .It HP
 .It IBM
 Tested with LTO-5 (ULTRIUM-HH5) and LTO-6 (ULTRIUM-HH6) tape drives.
 There is a separate table entry for hard drives, because the update method
 for hard drives is different than the method for tape drives.
 .It PLEXTOR
 .It QUALSTAR
 .It QUANTUM
 .It SAMSUNG
 Tested with SM1625 SSDs.
 .It SEAGATE
 Tested with Constellation ES (ST32000444SS), ES.2 (ST33000651SS) and
 ES.3 (ST1000NM0023) drives.
 .It SmrtStor
 Tested with 400GB Optimus SSDs (TXA2D20400GA6001).
 .El
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 Little testing has been done to make sure that different device models from
 each vendor work correctly with the fwdownload command.
 A vendor name appearing in the supported list means only that firmware of at
 least one device type from that vendor has successfully been programmed with
 the fwdownload command.
 Extra caution should be taken when using this command since there is no
 guarantee it will not break a device from the listed vendors.
 Ensure that you have a recent backup of the data on the device before
 performing a firmware update.
 .Pp
 Note that unknown
 .Tn SCSI
 protocol devices will not be programmed, since there is little chance of
 the firmware download succeeding.
 .Pp
 .Nm
 will currently attempt a firmware download to any
 .Tn ATA
 or
 .Tn SATA
-device, since the standard 
+device, since the standard
 .Tn ATA
 DOWNLOAD MICROCODE command may work.
 Firmware downloads to
 .Tn ATA
-and 
+and
 .Tn SATA
 devices are supported for devices connected
 to standard
 .Tn ATA
 and
 .Tn SATA
 controllers, and devices connected to SAS controllers
 with
-.Tn SCSI 
+.Tn SCSI
 to
 .Tn ATA
 translation capability.
 In the latter case,
 .Nm
 uses the
 .Tn SCSI
 .Tn ATA
-PASS-THROUGH command to send the 
+PASS-THROUGH command to send the
 .Tn ATA
 DOWNLOAD MICROCODE command to the drive.
 Some
 .Tn SCSI
 to
 .Tn ATA
 translation implementations don't work fully when translating
 .Tn SCSI
 WRITE BUFFER commands to
 .Tn ATA
 DOWNLOAD MICROCODE commands, but do support
 .Tn ATA
 passthrough well enough to do a firmware download.
 .Bl -tag -width 11n
 .It Fl f Ar fw_image
 Path to the firmware image file to be downloaded to the specified device.
 .It Fl q
 Do not print informational messages, only print errors.
 This option should be used with the
 .Fl y
 option to suppress all output.
 .It Fl s
 Run in simulation mode.
 Device checks are run and the confirmation dialog is shown, but no firmware
 download will occur.
 .It Fl v
 Show
 .Tn SCSI
 or
 .Tn ATA
 errors in the event of a failure.
 .Pp
 In simulation mode, print out the
 .Tn SCSI
 CDB
 or
 .Tn ATA
 register values that would be used for the firmware download command.
 .It Fl y
 Do not ask for confirmation.
 .El
 .It Ic persist
 Persistent reservation support.
 Persistent reservations are a way to reserve a particular
 .Tn SCSI
 LUN for use by one or more
 .Tn SCSI
 initiators.
 If the
 .Fl i
 option is specified,
 .Nm
 will issue the
 .Tn SCSI
 PERSISTENT RESERVE IN
 command using the requested service action.
 If the
 .Fl o
 option is specified,
 .Nm
 will issue the
 .Tn SCSI
 PERSISTENT RESERVE OUT
 command using the requested service action.
 One of those two options is required.
 .Pp
 Persistent reservations are complex, and fully explaining them is outside
 the scope of this manual.
 Please visit
 http://www.t10.org
 and download the latest SPC spec for a full explanation of persistent
 reservations.
 .Bl -tag -width 8n
 .It Fl i Ar mode
 Specify the service action for the PERSISTENT RESERVE IN command.
 Supported service actions:
 .Bl -tag -width 19n
 .It read_keys
 Report the current persistent reservation generation (PRgeneration) and any
 registered keys.
 .It read_reservation
 Report the persistent reservation, if any.
 .It report_capabilities
 Report the persistent reservation capabilities of the LUN.
 .It read_full_status
 Report the full status of persistent reservations on the LUN.
 .El
 .It Fl o Ar mode
 Specify the service action for the PERSISTENT RESERVE OUT command.
 For service actions like register that are components of other service
 action names, the entire name must be specified.
 Otherwise, enough of the service action name must be specified to
 distinguish it from other possible service actions.
 Supported service actions:
 .Bl -tag -width 15n
 .It register
 Register a reservation key with the LUN or unregister a reservation key.
 To register a key, specify the requested key as the Service Action
 Reservation Key.
 To unregister a key, specify the previously registered key as the
 Reservation Key.
 To change a key, specify the old key as the Reservation Key and the new
 key as the Service Action Reservation Key.
 .It register_ignore
 This is similar to the register subcommand, except that the Reservation Key
 is ignored.
 The Service Action Reservation Key will overwrite any previous key
 registered for the initiator.
 .It reserve
 Create a reservation.
 A key must be registered with the LUN before the LUN can be reserved, and
 it must be specified as the Reservation Key.
 The type of reservation must also be specified.
 The scope defaults to LUN scope (LU_SCOPE), but may be changed.
 .It release
 Release a reservation.
 The Reservation Key must be specified.
 .It clear
 Release a reservation and remove all keys from the device.
 The Reservation Key must be specified.
 .It preempt
 Remove a reservation belonging to another initiator.
 The Reservation Key must be specified.
 The Service Action Reservation Key may be specified, depending on the
 operation being performed.
 .It preempt_abort
 Remove a reservation belonging to another initiator and abort all
 outstanding commands from that initiator.
 The Reservation Key must be specified.
 The Service Action Reservation Key may be specified, depending on the
 operation being performed.
 .It register_move
 Register another initiator with the LUN, and establish a reservation on the
 LUN for that initiator.
 The Reservation Key and Service Action Reservation Key must be specified.
 .It replace_lost
 Replace Lost Reservation information.
 .El
 .It Fl a
 Set the All Target Ports (ALL_TG_PT) bit.
 This requests that the key registration be applied to all target ports and
 not just the particular target port that receives the command.
 This only applies to the register and register_ignore actions.
 .It Fl I Ar tid
 Specify a Transport ID.
 This only applies to the Register and Register and Move service actions for
 Persistent Reserve Out.
 Multiple Transport IDs may be specified with multiple
 .Fl I
 arguments.
 With the Register service action, specifying one or more Transport IDs
 implicitly enables the
 .Fl S
 option which turns on the SPEC_I_PT bit.
 Transport IDs generally have the format protocol,id.
 .Bl -tag -width 5n
 .It SAS
 A SAS Transport ID consists of
 .Dq sas,
 followed by a 64-bit SAS address.
 For example:
 .Pp
 .Dl sas,0x1234567812345678
 .It FC
 A Fibre Channel Transport ID consists of
 .Dq fcp,
 followed by a 64-bit Fibre Channel World Wide Name.
 For example:
 .Pp
 .Dl fcp,0x1234567812345678
 .It SPI
 A Parallel SCSI address consists of
 .Dq spi,
 followed by a SCSI target ID and a relative target port identifier.
 For example:
 .Pp
 .Dl spi,4,1
 .It 1394
 An IEEE 1394 (Firewire) Transport ID consists of
 .Dq sbp,
 followed by a 64-bit EUI-64 IEEE 1394 node unique identifier.
 For example:
 .Pp
 .Dl sbp,0x1234567812345678
 .It RDMA
 A SCSI over RDMA Transport ID consists of
 .Dq srp,
 followed by a 128-bit RDMA initiator port identifier.
 The port identifier must be exactly 32 or 34 (if the leading 0x is
 included) hexadecimal digits.
 Only hexadecimal (base 16) numbers are supported.
 For example:
 .Pp
 .Dl srp,0x12345678123456781234567812345678
 .It iSCSI
 An iSCSI Transport ID consists an iSCSI name and optionally a separator and
 iSCSI session ID.
 For example, if only the iSCSI name is specified:
 .Pp
 .Dl iqn.2012-06.com.example:target0
 .Pp
 If the iSCSI separator and initiator session ID are specified:
 .Pp
 .Dl iqn.2012-06.com.example:target0,i,0x123
 .It PCIe
 A SCSI over PCIe Transport ID consists of
 .Dq sop,
 followed by a PCIe Routing ID.
 The Routing ID consists of a bus, device and function or in the alternate
 form, a bus and function.
 The bus must be in the range of 0 to 255 inclusive and the device must be
 in the range of 0 to 31 inclusive.
 The function must be in the range of 0 to 7 inclusive if the standard form
 is used, and in the range of 0 to 255 inclusive if the alternate form is
 used.
 For example, if a bus, device and function are specified for the standard
 Routing ID form:
 .Pp
 .Dl sop,4,5,1
 .Pp
 If the alternate Routing ID form is used:
 .Pp
 .Dl sop,4,1
 .El
 .It Fl k Ar key
 Specify the Reservation Key.
 This may be in decimal, octal or hexadecimal format.
 The value is zero by default if not otherwise specified.
 The value must be between 0 and 2^64 - 1, inclusive.
 .It Fl K Ar key
 Specify the Service Action Reservation Key.
 This may be in decimal, octal or hexadecimal format.
 The value is zero by default if not otherwise specified.
 The value must be between 0 and 2^64 - 1, inclusive.
 .It Fl p
 Enable the Activate Persist Through Power Loss bit.
 This is only used for the register and register_ignore actions.
 This requests that the reservation persist across power loss events.
 .It Fl s Ar scope
 Specify the scope of the reservation.
 The scope may be specified by name or by number.
 The scope is ignored for register, register_ignore and clear.
 If the desired scope isn't available by name, you may specify the number.
 .Bl -tag -width 7n
 .It lun
 LUN scope (0x00).
 This encompasses the entire LUN.
 .It extent
 Extent scope (0x01).
 .It element
 Element scope (0x02).
 .El
 .It Fl R Ar rtp
 Specify the Relative Target Port.
 This only applies to the Register and Move service action of the Persistent
 Reserve Out command.
 .It Fl S
 Enable the SPEC_I_PT bit.
 This only applies to the Register service action of Persistent Reserve Out.
 You must also specify at least one Transport ID with
 .Fl I
 if this option is set.
 If you specify a Transport ID, this option is automatically set.
 It is an error to specify this option for any service action other than
 Register.
 .It Fl T Ar type
 Specify the reservation type.
 The reservation type may be specified by name or by number.
 If the desired reservation type isn't available by name, you may specify
 the number.
 Supported reservation type names:
 .Bl -tag -width 11n
 .It read_shared
 Read Shared mode.
 .It wr_ex
 Write Exclusive mode.
 May also be specified as
 .Dq write_exclusive .
 .It rd_ex
 Read Exclusive mode.
 May also be specified as
 .Dq read_exclusive .
 .It ex_ac
 Exclusive access mode.
 May also be specified as
 .Dq exclusive_access .
 .It wr_ex_ro
 Write Exclusive Registrants Only mode.
 May also be specified as
 .Dq write_exclusive_reg_only .
 .It ex_ac_ro
 Exclusive Access Registrants Only mode.
 May also be specified as
 .Dq exclusive_access_reg_only .
 .It wr_ex_ar
 Write Exclusive All Registrants mode.
 May also be specified as
 .Dq write_exclusive_all_regs .
 .It ex_ac_ar
 Exclusive Access All Registrants mode.
 May also be specified as
 .Dq exclusive_access_all_regs .
 .El
 .It Fl U
 Specify that the target should unregister the initiator that sent
 the Register and Move request.
 By default, the target will not unregister the initiator that sends the
 Register and Move request.
 This option only applies to the Register and Move service action of the
 Persistent Reserve Out command.
 .El
 .It Ic attrib
 Issue the
 .Tn SCSI
 READ or WRITE ATTRIBUTE commands.
 These commands are used to read and write attributes in Medium Auxiliary
 Memory (MAM).
 The most common place Medium Auxiliary Memory is found is small flash chips
 included tape cartriges.
 For instance,
 .Tn LTO
 tapes have MAM.
 Either the
-.Fl r 
+.Fl r
 option or the
-.Fl w 
+.Fl w
 option must be specified.
 .Bl -tag -width 14n
 .It Fl r Ar action
 Specify the READ ATTRIBUTE service action.
 .Bl -tag -width 11n
 .It attr_values
 Issue the ATTRIBUTE VALUES service action.
 Read and decode the available attributes and their values.
 .It attr_list
 Issue the ATTRIBUTE LIST service action.
 List the attributes that are available to read and write.
 .It lv_list
 Issue the LOGICAL VOLUME LIST service action.
 List the available logical volumes in the MAM.
 .It part_list
 Issue the PARTITION LIST service action.
 List the available partitions in the MAM.
 .It supp_attr
 Issue the SUPPORTED ATTRIBUTES service action.
 List attributes that are supported for reading or writing.
 These attributes may or may not be currently present in the MAM.
 .El
 .It Fl w Ar attr
 Specify an attribute to write to the MAM.
 This option is not yet implemented.
 .It Fl a Ar num
 Specify the attribute number to display.
 This option only works with the attr_values, attr_list and supp_attr
-arguments to 
+arguments to
 .Fl r .
 .It Fl c
 Display cached attributes.
 If the device supports this flag, it allows displaying attributes for the
 last piece of media loaded in the drive.
 .It Fl e Ar num
 Specify the element address.
 This is used for specifying which element number in a medium changer to
 access when reading attributes.
 The element number could be for a picker, portal, slot or drive.
 .It Fl F Ar form1,form2
 Specify the output format for the attribute values (attr_val) display as a
 comma separated list of options.
 The default output is currently set to field_all,nonascii_trim,text_raw.
 Once this code is ported to FreeBSD 10, any text fields will be converted
-from their codeset to the user's native codeset with 
+from their codeset to the user's native codeset with
 .Xr iconv 3 .
 .Pp
 The text options are mutually exclusive; if you specify more than one, you
 will get unpredictable results.
 The nonascii options are also mutually exclusive.
 Most of the field options may be logically ORed together.
 .Bl -tag -width 12n
 .It text_esc
 Print text fields with non-ASCII characters escaped.
 .It text_raw
 Print text fields natively, with no codeset conversion.
 .It nonascii_esc
 If any non-ASCII characters occur in fields that are supposed to be ASCII,
 escape the non-ASCII characters.
 .It nonascii_trim
 If any non-ASCII characters occur in fields that are supposed to be ASCII,
 omit the non-ASCII characters.
 .It nonascii_raw
 If any non-ASCII characters occur in fields that are supposed to be ASCII,
 print them as they are.
 .It field_all
 Print all of the prefix fields: description, attribute number, attribute
 size, and the attribute's readonly status.
 If field_all is specified, specifying any other field options will not have
 an effect.
 .It field_none
 Print none of the prefix fields, and only print out the attribute value.
 If field_none is specified, specifying any other field options will result
 in those fields being printed.
 .It field_desc
 Print out the attribute description.
 .It field_num
 Print out the attribute number.
 .It field_size
 Print out the attribute size.
 .It field_rw
 Print out the attribute's readonly status.
 .El
 .It Fl p Ar part
 Specify the partition.
 When the media has multiple partitions, specifying different partition
 numbers allows seeing the values for each individual partition.
 .It Fl s Ar start_num
 Specify the starting attribute number.
 This requests that the target device return attribute information starting
 at the given number.
 .It Fl T Ar elem_type
 Specify the element type.
 For medium changer devices, this allows specifying the type the element
 referenced in the element address (
 .Fl e ) .
 Valid types are:
 .Dq all ,
 .Dq picker ,
 .Dq slot ,
 .Dq portal ,
 and
 .Dq drive .
 .It Fl V Ar vol_num
 Specify the number of the logical volume to operate on.
 If the media has multiple logical volumes, this will allow displaying
 or writing attributes on the given logical volume.
 .El
 .It Ic opcodes
 Issue the REPORT SUPPORTED OPCODES service action of the
 .Tn SCSI
 MAINTENANCE IN
 command.
 Without arguments, this command will return a list of all
 .Tn SCSI
 commands supported by the device, including service actions of commands
 that support service actions.
 It will also include the
 .Tn SCSI
 CDB (Command Data Block) length for each command, and the description of
 each command if it is known.
 .Bl -tag -width 18n
 .It Fl o Ar opcode
 Request information on a specific opcode instead of the list of supported
 commands.
 If supported, the target will return a CDB-like structure that indicates
 the opcode, service action (if any), and a mask of bits that are supported
 in that CDB.
 .It Fl s Ar service_action
 For commands that support a service action, specify the service action to
 query.
 .It Fl N
 If a service action is specified for a given opcode, and the device does
 not support the given service action, the device should not return a
 .Tn SCSI
 error, but rather indicate in the returned parameter data that the command
 is not supported.
 By default, if a service action is specified for an opcode, and service
 actions are not supported for the opcode in question, the device will
 return an error.
 .It Fl T
 Include timeout values.
 This option works with the default display, which includes all commands
 supported by the device, and with the
 .Fl o
 and
 .Fl s
 options, which request information on a specific command and service
 action.
 This requests that the device report Nominal and Recommended timeout values
 for the given command or commands.
 The timeout values are in seconds.
-The timeout descriptor also includes a command-specific 
+The timeout descriptor also includes a command-specific
 .El
 .It Ic zone
 Manage
 .Tn SCSI
 and
 .Tn ATA
 Zoned Block devices.
 This allows managing devices that conform to the
 .Tn SCSI
 Zoned Block Commands (ZBC) and
 .Tn ATA
 Zoned ATA Command Set (ZAC)
 specifications.
 Devices using these command sets are usually hard drives using Shingled
 Magnetic Recording (SMR).
 There are three types of SMR drives:
 .Bl -tag -width 13n
 .It Drive Managed
 Drive Managed drives look and act just like a standard random access block
 device, but underneath, the drive reads and writes the bulk of its capacity
 using SMR zones.
 Sequential writes will yield better performance, but writing sequentially
 is not required.
 .It Host Aware
 Host Aware drives expose the underlying zone layout via
 .Tn SCSI
 or
 .Tn ATA
 commands and allow the host to manage the zone conditions.
 The host is not required to manage the zones on the drive, though.
 Sequential writes will yield better performance in Sequential Write
 Preferred zones, but the host can write randomly in those zones.
 .It Host Managed
 Host Managed drives expose the underlying zone layout via
 .Tn SCSI
 or
 .Tn ATA
 commands.
 The host is required to access the zones according to the rules described
 by the zone layout.
 Any commands that violate the rules will be returned with an error.
 .El
 .Pp
 SMR drives are divided into zones (typically in the range of 256MB each)
 that fall into three general categories:
 .Bl -tag -width 20n
 .It Conventional
 These are also known as Non Write Pointer zones.
 These zones can be randomly written without an unexpected performance penalty.
 .It Sequential Preferred
 These zones should be written sequentially starting at the write pointer
 for the zone.
 They may be written randomly.
 Writes that do not conform to the zone layout may be significantly slower
 than expected.
 .It Sequential Required
 These zones must be written sequentially.
 If they are not written sequentially, starting at the write pointer, the
 command will fail.
 .El
 .Pp
 .Bl -tag -width 12n
 .It Fl c Ar cmd
 Specify the zone subcommand:
 .Bl -tag -width 6n
 .It rz
 Issue the Report Zones command.
 All zones are returned by default.
 Specify report options with
 .Fl o
 and printing options with
 .Fl P .
 Specify the starting LBA with
 .Fl l .
 Note that
 .Dq reportzones
 is also accepted as a command argument.
 .It open
 Explicitly open the zone specified by the starting LBA.
 .It close
 Close the zone specified by starting LBA.
 .It finish
 Finish the zone specified by the starting LBA.
 .It rwp
 Reset the write pointer for the zone specified by the starting LBA.
 .El
 .It Fl a
 For the Open, Close, Finish, and Reset Write Pointer operations, apply the
 operation to all zones on the drive.
 .It Fl l Ar lba
 Specify the starting LBA.
 For the Report Zones command, this tells the drive to report starting with
 the zone that starts at the given LBA.
 For the other commands, this allows the user to identify the zone requested
 by its starting LBA.
 The LBA may be specified in decimal, hexadecimal or octal notation.
 .It Fl o Ar rep_opt
 For the Report Zones command, specify a subset of zones to report.
 .Bl -tag -width 8n
 .It all
 Report all zones.
 This is the default.
 .It emtpy
 Report only empty zones.
 .It imp_open
 Report zones that are implicitly open.
 This means that the host has sent a write to the zone without explicitly
 opening the zone.
 .It exp_open
 Report zones that are explicitly open.
 .It closed
 Report zones that have been closed by the host.
 .It full
 Report zones that are full.
 .It ro
 Report zones that are in the read only state.
 Note that
 .Dq readonly
 is also accepted as an argument.
 .It offline
 Report zones that are in the offline state.
 .It reset
 Report zones where the device recommends resetting write pointers.
 .It nonseq
 Report zones that have the Non Sequential Resources Active flag set.
 These are zones that are Sequential Write Preferred, but have been written
 non-sequentially.
 .It nonwp
 Report Non Write Pointer zones, also known as Conventional zones.
 .El
 .It Fl P Ar print_opt
 Specify a printing option for Report Zones:
 .Bl -tag -width 7n
 .It normal
 Normal Report Zones output.
 This is the default.
 The summary and column headings are printed, fields are separated by spaces
 and the fields themselves may contain spaces.
 .It summary
 Just print the summary:  the number of zones, the maximum LBA (LBA of the
-last logical block on the drive), and the value of the 
-.Dq same 
+last logical block on the drive), and the value of the
+.Dq same
 field.
 The
 .Dq same
 field describes whether the zones on the drive are all identical, all
 different, or whether they are the same except for the last zone, etc.
 .It script
 Print the zones in a script friendly format.
 The summary and column headings are omitted, the fields are separated by
 commas, and the fields do not contain spaces.
 The fields contain underscores where spaces would normally be used.
 .El
 .El
 .It Ic epc
 Issue
 .Tn ATA
 Extended Power Conditions (EPC) feature set commands.
 This only works on
 .Tn ATA
 protocol drives, and will not work on
 .Tn SCSI
 protocol drives.
 It will work on
 .Tn SATA
 drives behind a
 .Tn SCSI
 to
 .Tn ATA
 translation layer (SAT).
 It may be helpful to read the ATA Command Set - 4 (ACS-4) description of
 the Extended Power Conditions feature set, available at t13.org, to
-understand the details of this particular 
+understand the details of this particular
 .Nm
 subcommand.
 .Bl -tag -width 6n
 .It Fl c Ar cmd
 Specify the epc subcommand
 .Bl -tag -width 7n
 .It restore
 Restore drive power condition settings.
 .Bl -tag -width 6n
 .It Fl r Ar src
 Specify the source for the restored power settings, either
 .Dq default
 or
 .Dq saved .
 This argument is required.
 .It Fl s
 Save the settings.
 This only makes sense to specify when restoring from defaults.
 .El
 .It goto
 Go to the specified power condition.
 .Bl -tag -width 7n
 .It Fl p Ar cond
 Specify the power condition: Idle_a, Idle_b, Idle_c, Standby_y, Standby_z.
 This argument is required.
 .It Fl D
 Specify delayed entry to the power condition.
 The drive, if it supports this, can enter the power condition after the
 command completes.
 .It Fl H
 Hold the power condition.
 If the drive supports this option, it will hold the power condition and
 reject all commands that would normally cause it to exit that power
 condition.
 .El
 .It timer
 Set the timer value for a power condition and enable or disable the
 condition.
 See the
 .Dq list
 display described below to see what the current timer settings are for each
 Idle and Standby mode supported by the drive.
 .Bl -tag -width 8n
 .It Fl e
 Enable the power condition.
-One of 
+One of
 .Fl e
 or
 .Fl d
 is required.
 .It Fl d
 Disable the power condition.
 One of
 .Fl d
 or
 .Fl e
 is required.
 .It Fl T Ar timer
 Specify the timer in seconds.
 The user may specify a timer as a floating point number with a maximum
 supported resolution of tenths of a second.
 Drives may or may not support sub-second timer values.
 .It Fl p Ar cond
 Specify the power condition: Idle_a, Idle_b, Idle_c, Standby_y, Standby_z.
 This argument is required.
 .It Fl s
 Save the timer and power condition enable/disable state.
 By default, if this option is not specified, only the current values for
 this power condition will be affected.
 .El
 .It state
 Enable or disable a particular power condition.
 .Bl -tag -width 7n
 .It Fl e
 Enable the power condition.
-One of 
+One of
 .Fl e
 or
 .Fl d
 is required.
 .It Fl d
 Disable the power condition.
 One of
 .Fl d
 or
 .Fl e
 is required.
 .It Fl p Ar cond
 Specify the power condition: Idle_a, Idle_b, Idle_c, Standby_y, Standby_z.
 This argument is required.
 .It Fl s
 Save the power condition enable/disable state.
 By default, if this option is not specified, only the current values for
 this power condition will be affected.
 .El
 .It enable
 Enable the Extended Power Condition (EPC) feature set.
 .It disable
 Disable the Extended Power Condition (EPC) feature set.
 .It source
 Specify the EPC power source.
 .Bl -tag -width 6n
 .It Fl S Ar src
 Specify the power source, either
 .Dq battery
 or
 .Dq nonbattery .
 .El
 .It status
 Get the current status of several parameters related to the Extended Power
 Condition (EPC) feature set, including whether APM and EPC are supported
 and enabled, whether Low Power Standby is supported, whether setting the
 EPC power source is supported, whether Low Power Standby is supported and
 the current power condition.
 .Bl -tag -width 3n
 .It Fl P
 Only report the current power condition.
 Some drives will exit their current power condition if a command other than
 the
 .Tn ATA
 CHECK POWER MODE command is received.
 If this flag is specified,
 .Nm
 will only issue the
 .Tn ATA
 CHECK POWER MODE command to the drive.
 .El
 .It list
 Display the
 .Tn ATA
 Power Conditions log (Log Address 0x08).
 This shows the list of Idle and Standby power conditions the drive
 supports, and a number of parameters about each condition, including
 whether it is enabled and what the timer value is.
 .El
 .El
 .It Ic timestamp
 Issue REPORT TIMESTAMP or SET TIMESTAMP
 .Tn SCSI
 commands. Either the
 .Fl r
 option or the
 .Fl s
 option must be specified.
 .Bl -tag -width 6n
 .It Fl r
 Report the device's timestamp.
 If no more arguments are specified, the timestamp will be reported using
 the national representation of the date and time, followed by the time
 zone.
 .Bl -tag -width 9n
 .It Fl f Ar format
 Specify the strftime format string, as documented in strftime(3), to be used
 to format the reported timestamp.
 .It Fl m
 Report the timestamp as milliseconds since the epoch.
 .It Fl U
 Report the timestamp using the national representation of the date and
 time, but override the system time zone and use UTC instead.
 .El
 .El
 .Bl -tag -width 6n
 .It Fl s
 Set the device's timestamp. Either the
 .Fl f
-and 
+and
 .Fl T
 options or the
 .Fl U
 option must be specified.
 .Bl -tag -width 9n
 .It Fl f Ar format
 Specify the strptime format string, as documented in strptime(3).
 The time must also be specified with the
-.Fl T 
+.Fl T
 option.
 .It Fl T Ar time
 Provide the time in the format specified with the
 .Fl f
 option.
 .It Fl U
 Set the timestamp to the host system's time in UTC.
 .El
 .El
 .It Ic help
 Print out verbose usage information.
 .El
 .Sh ENVIRONMENT
 The
 .Ev SCSI_MODES
 variable allows the user to specify an alternate mode page format file.
 .Pp
 The
 .Ev EDITOR
 variable determines which text editor
 .Nm
 starts when editing mode pages.
 .Sh FILES
 .Bl -tag -width /usr/share/misc/scsi_modes -compact
 .It Pa /usr/share/misc/scsi_modes
 is the SCSI mode format database.
 .It Pa /dev/xpt0
 is the transport layer device.
 .It Pa /dev/pass*
 are the CAM application passthrough devices.
 .El
 .Sh EXAMPLES
 .Dl camcontrol eject -n cd -u 1 -v
 .Pp
 Eject the CD from cd1, and print SCSI sense information if the command
 fails.
 .Pp
 .Dl camcontrol tur da0
 .Pp
 Send the SCSI test unit ready command to da0.
 The
 .Nm
 utility will report whether the disk is ready, but will not display sense
 information if the command fails since the
 .Fl v
 switch was not specified.
 .Bd -literal -offset indent
 camcontrol tur da1 -E -C 4 -t 50 -Q head -v
 .Ed
 .Pp
 Send a test unit ready command to da1.
 Enable kernel error recovery.
 Specify a retry count of 4, and a timeout of 50 seconds.
 Enable sense
 printing (with the
 .Fl v
 flag) if the command fails.
 Since error recovery is turned on, the
 disk will be spun up if it is not currently spinning.
 The
 .Tn SCSI
 task attribute for the command will be set to Head of Queue.
 The
 .Nm
 utility will report whether the disk is ready.
 .Bd -literal -offset indent
 camcontrol cmd -n cd -u 1 -v -c "3C 00 00 00 00 00 00 00 0e 00" \e
 	-i 0xe "s1 i3 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
 .Ed
 .Pp
 Issue a READ BUFFER command (0x3C) to cd1.
 Display the buffer size of cd1,
 and display the first 10 bytes from the cache on cd1.
 Display SCSI sense
 information if the command fails.
 .Bd -literal -offset indent
 camcontrol cmd -n cd -u 1 -v -c "3B 00 00 00 00 00 00 00 0e 00" \e
 	-o 14 "00 00 00 00 1 2 3 4 5 6 v v v v" 7 8 9 8
 .Ed
 .Pp
 Issue a WRITE BUFFER (0x3B) command to cd1.
 Write out 10 bytes of data,
 not including the (reserved) 4 byte header.
 Print out sense information if
 the command fails.
 Be very careful with this command, improper use may
 cause data corruption.
 .Bd -literal -offset indent
 camcontrol modepage da3 -m 1 -e -P 3
 .Ed
 .Pp
 Edit mode page 1 (the Read-Write Error Recover page) for da3, and save the
 settings on the drive.
 Mode page 1 contains a disk drive's auto read and
 write reallocation settings, among other things.
 .Pp
 .Dl camcontrol rescan all
 .Pp
 Rescan all SCSI buses in the system for devices that have been added,
 removed or changed.
 .Pp
 .Dl camcontrol rescan 0
 .Pp
 Rescan SCSI bus 0 for devices that have been added, removed or changed.
 .Pp
 .Dl camcontrol rescan 0:1:0
 .Pp
 Rescan SCSI bus 0, target 1, lun 0 to see if it has been added, removed, or
 changed.
 .Pp
 .Dl camcontrol tags da5 -N 24
 .Pp
 Set the number of concurrent transactions for da5 to 24.
 .Bd -literal -offset indent
 camcontrol negotiate -n da -u 4 -T disable
 .Ed
 .Pp
 Disable tagged queueing for da4.
 .Bd -literal -offset indent
 camcontrol negotiate -n da -u 3 -R 20.000 -O 15 -a
 .Ed
 .Pp
 Negotiate a sync rate of 20MHz and an offset of 15 with da3.
 Then send a
 Test Unit Ready command to make the settings take effect.
 .Bd -literal -offset indent
 camcontrol smpcmd ses0 -v -r 4 "40 0 00 0" -R 1020 "s9 i1"
 .Ed
 .Pp
 Send the SMP REPORT GENERAL command to ses0, and display the number of PHYs
 it contains.
 Display SMP errors if the command fails.
 .Bd -literal -offset indent
 camcontrol security ada0
 .Ed
 .Pp
 Report security support and settings for ada0
 .Bd -literal -offset indent
 camcontrol security ada0 -U user -s MyPass
 .Ed
 .Pp
 Enable security on device ada0 with the password MyPass
 .Bd -literal -offset indent
 camcontrol security ada0 -U user -e MyPass
 .Ed
 .Pp
 Secure erase ada0 which has had security enabled with user password MyPass
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 This will
 .Em ERASE ALL
 data from the device, so backup your data before using!
 .Pp
 This command can be used against an SSD drive to restoring it to
 factory default write performance.
 .Bd -literal -offset indent
 camcontrol hpa ada0
 .Ed
 .Pp
 Report HPA support and settings for ada0 (also reported via
 identify).
 .Bd -literal -offset indent
 camcontrol hpa ada0 -s 10240
 .Ed
 .Pp
 Enables HPA on ada0 setting the maximum reported sectors to 10240.
 .Pp
 .Em WARNING! WARNING! WARNING!
 .Pp
 This will
 .Em PREVENT ACCESS
 to all data on the device beyond this limit until HPA is disabled by setting
 HPA to native max sectors of the device, which can only be done after a
 power-on or hardware reset!
 .Pp
 .Em DO NOT
 use this on a device which has an active filesystem!
 .Bd -literal -offset indent
 camcontrol persist da0 -v -i read_keys
 .Ed
 .Pp
 This will read any persistent reservation keys registered with da0, and
 display any errors encountered when sending the PERSISTENT RESERVE IN
 .Tn SCSI
 command.
 .Bd -literal -offset indent
 camcontrol persist da0 -v -o register -a -K 0x12345678
 .Ed
 .Pp
 This will register the persistent reservation key 0x12345678 with da0,
 apply that registration to all ports on da0, and display any errors that
 occur when sending the PERSISTENT RESERVE OUT command.
 .Bd -literal -offset indent
 camcontrol persist da0 -v -o reserve -s lun -k 0x12345678 -T ex_ac
 .Ed
 .Pp
 This will reserve da0 for the exlusive use of the initiator issuing the
 command.
 The scope of the reservation is the entire LUN.
 Any errors sending the PERSISTENT RESERVE OUT command will be displayed.
 .Bd -literal -offset indent
 camcontrol persist da0 -v -i read_full
 .Ed
 .Pp
 This will display the full status of all reservations on da0 and print out
 status if there are any errors.
 .Bd -literal -offset indent
 camcontrol persist da0 -v -o release -k 0x12345678 -T ex_ac
 .Ed
 .Pp
 This will release a reservation on da0 of the type ex_ac
 (Exclusive Access).
 The Reservation Key for this registration is 0x12345678.
 Any errors that occur will be displayed.
 .Bd -literal -offset indent
 camcontrol persist da0 -v -o register -K 0x12345678 -S \e
 	-I sas,0x1234567812345678 -I sas,0x8765432187654321
 .Ed
 .Pp
 This will register the key 0x12345678 with da0, specifying that it applies
 to the SAS initiators with SAS addresses 0x1234567812345678 and
 0x8765432187654321.
 .Bd -literal -offset indent
 camcontrol persist da0 -v -o register_move -k 0x87654321 \e
 	-K 0x12345678 -U -p -R 2 -I fcp,0x1234567812345678
 .Ed
 .Pp
 This will move the registration from the current initiator, whose
 Registration Key is 0x87654321, to the Fibre Channel initiator with the
 Fiber Channel World Wide Node Name 0x1234567812345678.
 A new registration key, 0x12345678, will be registered for the initiator
 with the Fibre Channel World Wide Node Name 0x1234567812345678, and the
 current initiator will be unregistered from the target.
 The reservation will be moved to relative target port 2 on the target
 device.
 The registration will persist across power losses.
 .Bd -literal -offset indent
 camcontrol attrib sa0 -v -i attr_values -p 1
 .Ed
 .Pp
 This will read and decode the attribute values from partition 1 on the tape
 in tape drive sa0, and will display any
 .Tn SCSI
 errors that result.
 .Pp
 .Bd -literal -offset indent
 camcontrol zone da0 -v -c rz -P summary
 .Ed
 .Pp
 This will request the SMR zone list from disk da0, and print out a
 summary of the zone parameters, and display any
 .Tn SCSI
 or
 .Tn ATA
 errors that result.
 .Pp
 .Bd -literal -offset indent
 camcontrol zone da0 -v -c rz -o reset
 .Ed
 .Pp
 This will request the list of SMR zones that should have their write
 pointer reset from the disk da0, and display any
 .Tn SCSI
 or
 .Tn ATA
 errors that result.
 .Pp
 .Bd -literal -offset indent
 camcontrol zone da0 -v -c rwp -l 0x2c80000
 .Ed
 .Pp
 This will issue the Reset Write Pointer command to disk da0 for the zone
 that starts at LBA 0x2c80000 and display any
 .Tn SCSI
 or
 .Tn ATA
 errors that result.
 .Pp
 .Bd -literal -offset indent
 camcontrol epc ada0 -c timer -T 60.1 -p Idle_a -e -s
 .Ed
 .Pp
 Set the timer for the Idle_a power condition on drive
 .Pa ada0
 to 60.1 seconds, enable that particular power condition, and save the timer
 value and the enabled state of the power condition.
 .Pp
 .Bd -literal -offset indent
 camcontrol epc da4 -c goto -p Standby_z -H
 .Ed
 .Pp
 Tell drive
 .Pa da4
 to go to the Standby_z power state (which is
-the drive's lowest power state) and hold in that state until it is 
+the drive's lowest power state) and hold in that state until it is
 explicitly released by another
 .Cm goto
 command.
 .Pp
 .Bd -literal -offset indent
 camcontrol epc da2 -c status -P
 .Ed
 .Pp
 Report only the power state of
 drive
 .Pa da2 .
-Some drives will power up in response to the commands sent by the 
+Some drives will power up in response to the commands sent by the
 .Pa status
 subcommand, and the
 .Fl P
 option causes
 .Nm
 to only send the
-.Tn ATA 
+.Tn ATA
 CHECK POWER MODE command, which should not trigger a change in the drive's
 power state.
 .Pp
 .Bd -literal -offset indent
 camcontrol epc ada0 -c list
 .Ed
 .Pp
 Display the ATA Power Conditions log (Log Address 0x08) for
 drive
 .Pa ada0 .
 .Pp
 .Bd -literal -offset indent
 camcontrol timestamp sa0 -s -f "%a, %d %b %Y %T %z" \e
 	-T "Wed, 26 Oct 2016 21:43:57 -0600"
 .Ed
 .Pp
 Set the timestamp of drive
 .Pa sa0
 using a
-.Xr strptime 3 
+.Xr strptime 3
 format string followed by a time string
 that was created using this format string.
 .Sh SEE ALSO
 .Xr cam 3 ,
 .Xr cam_cdbparse 3 ,
 .Xr cam 4 ,
 .Xr pass 4 ,
 .Xr xpt 4
 .Sh HISTORY
 The
 .Nm
 utility first appeared in
 .Fx 3.0 .
 .Pp
 The mode page editing code and arbitrary SCSI command code are based upon
 code in the old
 .Xr scsi 8
 utility and
 .Xr scsi 3
 library, written by Julian Elischer and Peter Dufault.
 The
 .Xr scsi 8
 program first appeared in
 .Bx 386 0.1.2.4 ,
 and first appeared in
 .Fx
 in
 .Fx 2.0.5 .
 .Sh AUTHORS
 .An Kenneth Merry Aq Mt ken@FreeBSD.org
 .Sh BUGS
 The code that parses the generic command line arguments does not know that
 some of the subcommands take multiple arguments.
 So if, for instance, you
 tried something like this:
 .Bd -literal -offset indent
 camcontrol cmd -n da -u 1 -c "00 00 00 00 00 v" 0x00 -v
 .Ed
 .Pp
 The sense information from the test unit ready command would not get
 printed out, since the first
 .Xr getopt 3
 call in
 .Nm
 bails out when it sees the second argument to
 .Fl c
 (0x00),
 above.
 Fixing this behavior would take some gross code, or changes to the
 .Xr getopt 3
 interface.
 The best way to circumvent this problem is to always make sure
 to specify generic
 .Nm
 arguments before any command-specific arguments.
Index: user/markj/netdump/sbin/geom/class/eli/geli.8
===================================================================
--- user/markj/netdump/sbin/geom/class/eli/geli.8	(revision 332407)
+++ user/markj/netdump/sbin/geom/class/eli/geli.8	(revision 332408)
@@ -1,1116 +1,1119 @@
 .\" Copyright (c) 2005-2011 Pawel Jakub Dawidek <pawel@dawidek.net>
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
-.Dd September 17, 2017
+.Dd April 10, 2018
 .Dt GELI 8
 .Os
 .Sh NAME
 .Nm geli
 .Nd "control utility for the cryptographic GEOM class"
 .Sh SYNOPSIS
 To compile GEOM_ELI into your kernel, add the following lines to your kernel
 configuration file:
 .Bd -ragged -offset indent
 .Cd "device crypto"
 .Cd "options GEOM_ELI"
 .Ed
 .Pp
 Alternatively, to load the GEOM_ELI module at boot time, add the following line
 to your
 .Xr loader.conf 5 :
 .Bd -literal -offset indent
 geom_eli_load="YES"
 .Ed
 .Pp
 Usage of the
 .Nm
 utility:
 .Pp
 .Nm
 .Cm init
 .Op Fl bdgPTv
 .Op Fl a Ar aalgo
 .Op Fl B Ar backupfile
 .Op Fl e Ar ealgo
 .Op Fl i Ar iterations
 .Op Fl J Ar newpassfile
 .Op Fl K Ar newkeyfile
 .Op Fl l Ar keylen
 .Op Fl s Ar sectorsize
 .Op Fl V Ar version
 .Ar prov
 .Nm
 .Cm label - an alias for
 .Cm init
 .Nm
 .Cm attach
-.Op Fl dprv
+.Op Fl dnprv
 .Op Fl j Ar passfile
 .Op Fl k Ar keyfile
 .Ar prov
 .Nm
 .Cm detach
 .Op Fl fl
 .Ar prov ...
 .Nm
 .Cm stop - an alias for
 .Cm detach
 .Nm
 .Cm onetime
 .Op Fl dT
 .Op Fl a Ar aalgo
 .Op Fl e Ar ealgo
 .Op Fl l Ar keylen
 .Op Fl s Ar sectorsize
 .Ar prov
 .Nm
 .Cm configure
 .Op Fl bBdDgGtT
 .Ar prov ...
 .Nm
 .Cm setkey
 .Op Fl pPv
 .Op Fl i Ar iterations
 .Op Fl j Ar passfile
 .Op Fl J Ar newpassfile
 .Op Fl k Ar keyfile
 .Op Fl K Ar newkeyfile
 .Op Fl n Ar keyno
 .Ar prov
 .Nm
 .Cm delkey
 .Op Fl afv
 .Op Fl n Ar keyno
 .Ar prov
 .Nm
 .Cm kill
 .Op Fl av
 .Op Ar prov ...
 .Nm
 .Cm backup
 .Op Fl v
 .Ar prov
 .Ar file
 .Nm
 .Cm restore
 .Op Fl fv
 .Ar file
 .Ar prov
 .Nm
 .Cm suspend
 .Op Fl v
 .Fl a | Ar prov ...
 .Nm
 .Cm resume
 .Op Fl pv
 .Op Fl j Ar passfile
 .Op Fl k Ar keyfile
 .Ar prov
 .Nm
 .Cm resize
 .Op Fl v
 .Fl s Ar oldsize
 .Ar prov
 .Nm
 .Cm version
 .Op Ar prov ...
 .Nm
 .Cm clear
 .Op Fl v
 .Ar prov ...
 .Nm
 .Cm dump
 .Op Fl v
 .Ar prov ...
 .Nm
 .Cm list
 .Nm
 .Cm status
 .Nm
 .Cm load
 .Nm
 .Cm unload
 .Sh DESCRIPTION
 The
 .Nm
 utility is used to configure encryption on GEOM providers.
 .Pp
 The following is a list of the most important features:
 .Pp
 .Bl -bullet -offset indent -compact
 .It
 Utilizes the
 .Xr crypto 9
 framework, so when there is crypto hardware available,
 .Nm
 will make use of it automatically.
 .It
 Supports many cryptographic algorithms (currently
 .Nm AES-XTS ,
 .Nm AES-CBC ,
 .Nm Blowfish-CBC ,
 .Nm Camellia-CBC
 and
 .Nm 3DES-CBC ) .
 .It
 Can optionally perform data authentication (integrity verification) utilizing
 one of the following algorithms:
 .Nm HMAC/MD5 ,
 .Nm HMAC/SHA1 ,
 .Nm HMAC/RIPEMD160 ,
 .Nm HMAC/SHA256 ,
 .Nm HMAC/SHA384
 or
 .Nm HMAC/SHA512 .
 .It
 Can create a User Key from up to two, piecewise components: a passphrase
 entered via prompt or read from one or more passfiles; a keyfile read from
 one or more files.
 .It
 Allows encryption of the root partition.
 The user will be asked for the
 passphrase before the root file system is mounted.
 .It
 Strengthens the passphrase component of the User Key with:
 .Rs
 .%A B. Kaliski
 .%T "PKCS #5: Password-Based Cryptography Specification, Version 2.0."
 .%R RFC
 .%N 2898
 .Re
 .It
 Allows the use of two independent User Keys (e.g., a
 .Qq "user key"
 and a
 .Qq "company key" ) .
 .It
 It is fast -
 .Nm
 performs simple sector-to-sector encryption.
 .It
 Allows the encrypted Master Key to be backed up and restored,
 so that if a user has to quickly destroy key material,
 it is possible to get the data back by restoring keys from
 backup.
 .It
 Providers can be configured to automatically detach on last close
 (so users do not have to remember to detach providers after unmounting
 the file systems).
 .It
 Allows attaching a provider with a random, one-time Master Key -
 useful for swap partitions and temporary file systems.
 .It
 Allows verification of data integrity (data authentication).
 .It
 Allows suspending and resuming encrypted devices.
 .El
 .Pp
 The first argument to
 .Nm
 indicates an action to be performed:
 .Bl -tag -width ".Cm configure"
 .It Cm init
 Initialize the provider which needs to be encrypted.
 Here you can set up the cryptographic algorithm to use, Data Key length,
 etc.
 The last sector of the provider is used to store metadata.
 The
 .Cm init
 subcommand also automatically writes metadata backups to
 .Pa /var/backups/<prov>.eli
 file.
 The metadata can be recovered with the
 .Cm restore
 subcommand described below.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl J Ar newpassfile"
 .It Fl a Ar aalgo
 Enable data integrity verification (authentication) using the given algorithm.
 This will reduce the size of storage available and also reduce speed.
 For example, when using 4096 bytes sector and
 .Nm HMAC/SHA256
 algorithm, 89% of the original provider storage will be available for use.
 Currently supported algorithms are:
 .Nm HMAC/MD5 ,
 .Nm HMAC/SHA1 ,
 .Nm HMAC/RIPEMD160 ,
 .Nm HMAC/SHA256 ,
 .Nm HMAC/SHA384
 and
 .Nm HMAC/SHA512 .
 If the option is not given, there will be no authentication, only encryption.
 The recommended algorithm is
 .Nm HMAC/SHA256 .
 .It Fl b
 Try to decrypt this partition during boot, before the root partition is mounted.
 This makes it possible to use an encrypted root partition.
 One will still need bootable unencrypted storage with a
 .Pa /boot/
 directory, which can be a CD-ROM disc or USB pen-drive, that can be removed
 after boot.
 .It Fl B Ar backupfile
 File name to use for metadata backup instead of the default
 .Pa /var/backups/<prov>.eli .
 To inhibit backups, you can use
 .Pa none
 as the
 .Ar backupfile .
 .It Fl d
 When entering the passphrase to boot from this encrypted root filesystem, echo
 .Ql *
 characters.
 This makes the length of the passphrase visible.
 .It Fl e Ar ealgo
 Encryption algorithm to use.
 Currently supported algorithms are:
 .Nm AES-XTS ,
 .Nm AES-CBC ,
 .Nm Blowfish-CBC ,
 .Nm Camellia-CBC ,
 .Nm 3DES-CBC ,
 and
 .Nm NULL .
 The default and recommended algorithm is
 .Nm AES-XTS .
 .Nm NULL
 is unencrypted.
 .It Fl g
 Enable booting from this encrypted root filesystem.
 The boot loader prompts for the passphrase and loads
 .Xr loader 8
 from the encrypted partition.
 .It Fl i Ar iterations
 Number of iterations to use with PKCS#5v2 when processing User Key
 passphrase component.
 If this option is not specified,
 .Nm
 will find the number of iterations which is equal to 2 seconds of crypto work.
 If 0 is given, PKCS#5v2 will not be used.
 PKCS#5v2 processing is performed once, after all parts of the passphrase
 component have been read.
 .It Fl J Ar newpassfile
 Specifies a file which contains the passphrase component of the User Key
 (or part of it).
 If
 .Ar newpassfile
 is given as -, standard input will be used.
 Only the first line (excluding new-line character) is taken from the given file.
 This argument can be specified multiple times, which has the effect of
 reassembling a single passphrase split across multiple files.
 Cannot be combined with the
 .Fl P
 option.
 .It Fl K Ar newkeyfile
 Specifies a file which contains the keyfile component of the User Key
 (or part of it).
 If
 .Ar newkeyfile
 is given as -, standard input will be used.
 This argument can be specified multiple times, which has the effect of
 reassembling a single keyfile split across multiple keyfile parts.
 .It Fl l Ar keylen
 Data Key length to use with the given cryptographic algorithm.
 If the length is not specified, the selected algorithm uses its
 .Em default
 key length.
 .Bl -ohang -offset indent
 .It Nm AES-XTS
 .Em 128 ,
 256
 .It Nm AES-CBC , Nm Camellia-CBC
 .Em 128 ,
 192,
 256
 .It Nm Blowfish-CBC
 .Em 128
 + n * 32, for n=[0..10]
 .It Nm 3DES-CBC
 .Em 192
 .El
 .It Fl P
 Do not use a passphrase as a component of the User Key.
 Cannot be combined with the
 .Fl J
 option.
 .It Fl s Ar sectorsize
 Change decrypted provider's sector size.
 Increasing the sector size allows increased performance,
 because encryption/decryption which requires an initialization vector
 is done per sector; fewer sectors means less computational work.
 .It Fl T
 Don't pass through
 .Dv BIO_DELETE
 calls (i.e., TRIM/UNMAP).
 This can prevent an attacker from knowing how much space you're actually
 using and which sectors contain live data, but will also prevent the
 backing store (SSD, etc) from reclaiming space you're not using, which
 may degrade its performance and lifespan.
 The underlying provider may or may not actually obliterate the deleted
 sectors when TRIM is enabled, so it should not be considered to add any
 security.
 .It Fl V Ar version
 Metadata version to use.
 This option is helpful when creating a provider that may be used by older
 .Nm FreeBSD/GELI
 versions.
 Consult the
 .Sx HISTORY
 section to find which metadata version is supported by which FreeBSD version.
 Note that using an older version of metadata may limit the number of
 features available.
 .El
 .It Cm attach
 Attach the given provider.
 The encrypted Master Key will be loaded from the metadata and decrypted
 using the given passphrase/keyfile and a new GEOM provider will be created
 using the given provider's name with an
 .Qq .eli
 suffix.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl j Ar passfile"
 .It Fl d
 If specified, a decrypted provider will be detached automatically on last close.
 This can help with scarce memory so the user does not have to remember to detach the
 provider after unmounting the file system.
 It only works when the provider was opened for writing, so it will not work if
 the file system on the provider is mounted read-only.
 Probably a better choice is the
 .Fl l
 option for the
 .Cm detach
 subcommand.
 .It Fl j Ar passfile
 Specifies a file which contains the passphrase component of the User Key
 (or part of it).
 For more information see the description of the
 .Fl J
 option for the
 .Cm init
 subcommand.
 .It Fl k Ar keyfile
 Specifies a file which contains the keyfile component of the User Key
 (or part of it).
 For more information see the description of the
 .Fl K
 option for the
 .Cm init
 subcommand.
+.It Fl n
+Do a dry-run decryption.
+This is useful to verify passphrase and keyfile without decrypting the device.
 .It Fl p
 Do not use a passphrase as a component of the User Key.
 Cannot be combined with the
 .Fl j
 option.
 .It Fl r
 Attach read-only provider.
 It will not be opened for writing.
 .El
 .It Cm detach
 Detach the given providers, which means remove the devfs entry
 and clear the Master Key and Data Keys from memory.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl f"
 .It Fl f
 Force detach - detach even if the provider is open.
 .It Fl l
 Mark provider to detach on last close.
 If this option is specified, the provider will not be detached
 while it is open, but will be automatically detached when it is closed for the
 last time even if it was only opened for reading.
 .El
 .It Cm onetime
 Attach the given providers with a random, one-time (ephemeral) Master Key.
 The command can be used to encrypt swap partitions or temporary file systems.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl a Ar sectorsize"
 .It Fl a Ar aalgo
 Enable data integrity verification (authentication).
 For more information, see the description of the
 .Cm init
 subcommand.
 .It Fl e Ar ealgo
 Encryption algorithm to use.
 For more information, see the description of the
 .Cm init
 subcommand.
 .It Fl d
 Detach on last close.
 Note: this option is not usable for temporary file systems as the provider will
 be detached after creating the file system on it.
 It still can (and should be) used for swap partitions.
 For more information, see the description of the
 .Cm attach
 subcommand.
 .It Fl l Ar keylen
 Data Key length to use with the given cryptographic algorithm.
 For more information, see the description of the
 .Cm init
 subcommand.
 .It Fl s Ar sectorsize
 Change decrypted provider's sector size.
 For more information, see the description of the
 .Cm init
 subcommand.
 .It Fl T
 Disable TRIM/UNMAP passthru.
 For more information, see the description of the
 .Cm init
 subcommand.
 .El
 .It Cm configure
 Change configuration of the given providers.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl b"
 .It Fl b
 Set the BOOT flag on the given providers.
 For more information, see the description of the
 .Cm init
 subcommand.
 .It Fl B
 Remove the BOOT flag from the given providers.
 .It Fl d
 When entering the passphrase to boot from this encrypted root filesystem, echo
 .Ql *
 characters.
 This makes the length of the passphrase visible.
 .It Fl D
 Disable echoing of any characters when a passphrase is entered to boot from this
 encrypted root filesystem.
 This hides the passphrase length.
 .It Fl g
 Enable booting from this encrypted root filesystem.
 The boot loader prompts for the passphrase and loads
 .Xr loader 8
 from the encrypted partition.
 .It Fl G
 Deactivate booting from this encrypted root partition.
 .It Fl t
 Enable TRIM/UNMAP passthru.
 For more information, see the description of the
 .Cm init
 subcommand.
 .It Fl T
 Disable TRIM/UNMAP passthru.
 .El
 .It Cm setkey
 Install a copy of the Master Key into the selected slot, encrypted with
 a new User Key.
 If the selected slot is populated, replace the existing copy.
 A provider has one Master Key, which can be stored in one or both slots,
 each encrypted with an independent User Key.
 With the
 .Cm init
 subcommand, only key number 0 is initialized.
 The User Key can be changed at any time: for an attached provider,
 for a detached provider, or on the backup file.
 When a provider is attached, the user does not have to provide
 an existing passphrase/keyfile.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl J Ar newpassfile"
 .It Fl i Ar iterations
 Number of iterations to use with PKCS#5v2.
 If 0 is given, PKCS#5v2 will not be used.
 To be able to use this option with the
 .Cm setkey
 subcommand, only one key has to be defined and this key must be changed.
 .It Fl j Ar passfile
 Specifies a file which contains the passphrase component of a current User Key
 (or part of it).
 .It Fl J Ar newpassfile
 Specifies a file which contains the passphrase component of the new User Key
 (or part of it).
 .It Fl k Ar keyfile
 Specifies a file which contains the keyfile component of a current User Key
 (or part of it).
 .It Fl K Ar newkeyfile
 Specifies a file which contains the keyfile component of the new User Key
 (or part of it).
 .It Fl n Ar keyno
 Specifies the index number of the Master Key copy to change (could be 0 or 1).
 If the provider is attached and no key number is given, the key
 used for attaching the provider will be changed.
 If the provider is detached (or we are operating on a backup file)
 and no key number is given, the first Master Key copy to be successfully
 decrypted with the provided User Key passphrase/keyfile will be changed.
 .It Fl p
 Do not use a passphrase as a component of the current User Key.
 Cannot be combined with the
 .Fl j
 option.
 .It Fl P
 Do not use a passphrase as a component of the new User Key.
 Cannot be combined with the
 .Fl J
 option.
 .El
 .It Cm delkey
 Destroy (overwrite with random data) the selected Master Key copy.
 If one is destroying keys for an attached provider, the provider
 will not be detached even if all copies of the Master Key are destroyed.
 It can even be rescued with the
 .Cm setkey
 subcommand because the Master Key is still in memory.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl a Ar keyno"
 .It Fl a
 Destroy all copies of the Master Key (does not need
 .Fl f
 option).
 .It Fl f
 Force key destruction.
 This option is needed to destroy the last copy of the Master Key.
 .It Fl n Ar keyno
 Specifies the index number of the Master Key copy.
 If the provider is attached and no key number is given, the key
 used for attaching the provider will be destroyed.
 If provider is detached (or we are operating on a backup file) the key number
 has to be given.
 .El
 .It Cm kill
 This command should be used only in emergency situations.
 It will destroy all copies of the Master Key on a given provider and will
 detach it forcibly (if it is attached).
 This is absolutely a one-way command - if you do not have a metadata
 backup, your data is gone for good.
 In case the provider was attached with the
 .Fl r
 flag, the keys will not be destroyed, only the provider will be detached.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl a"
 .It Fl a
 If specified, all currently attached providers will be killed.
 .El
 .It Cm backup
 Backup metadata from the given provider to the given file.
 .It Cm restore
 Restore metadata from the given file to the given provider.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl f"
 .It Fl f
 Metadata contains the size of the provider to ensure that the correct
 partition or slice is attached.
 If an attempt is made to restore metadata to a provider that has a different
 size,
 .Nm
 will refuse to restore the data unless the
 .Fl f
 switch is used.
 If the partition or slice has been grown, the
 .Cm resize
 subcommand should be used rather than attempting to relocate the metadata
 through
 .Cm backup
 and
 .Cm restore .
 .El
 .It Cm suspend
 Suspend device by waiting for all inflight requests to finish, clearing all
 sensitive information (like the Master Key and Data Keys) from kernel memory,
 and blocking all further I/O requests until the
 .Cm resume
 subcommand is executed.
 This functionality is useful for laptops: when one wants to suspend a
 laptop, one does not want to leave an encrypted device attached.
 Instead of closing all files and directories opened from a file system located
 on an encrypted device, unmounting the file system, and detaching the device,
 the
 .Cm suspend
 subcommand can be used.
 Any access to the encrypted device will be blocked until the Master Key is
 reloaded through the
 .Cm resume
 subcommand.
 Thus there is no need to close nor unmount anything.
 The
 .Cm suspend
 subcommand does not work with devices created with the
 .Cm onetime
 subcommand.
 Please note that sensitive data might still be present in memory after
 suspending an encrypted device due to the file system cache, etc.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl a"
 .It Fl a
 Suspend all
 .Nm
 devices.
 .El
 .It Cm resume
 Resume previously suspended device.
 The caller must ensure that executing this subcommand does not access the
 suspended device, leading to a deadlock.
 For example suspending a device which contains the file system where the
 .Nm
 utility is stored is bad idea.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl j Ar passfile"
 .It Fl j Ar passfile
 Specifies a file which contains the passphrase component of the User Key
 (or part of it).
 For more information see the description of the
 .Fl J
 option for the
 .Cm init
 subcommand.
 .It Fl k Ar keyfile
 Specifies a file which contains the keyfile component of the User Key
 (or part of it).
 For more information see the description of the
 .Fl K
 option for the
 .Cm init
 subcommand.
 .It Fl p
 Do not use a passphrase as a component of the User Key.
 Cannot be combined with the
 .Fl j
 option.
 .El
 .It Cm resize
 Inform
 .Nm
 that the provider has been resized.
 The old metadata block is relocated to the correct position at the end of the
 provider and the provider size is updated.
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl s Ar oldsize"
 .It Fl s Ar oldsize
 The size of the provider before it was resized.
 .El
 .It Cm version
 If no arguments are given, the
 .Cm version
 subcommand will print the version of
 .Nm
 userland utility as well as the version of the
 .Nm ELI
 GEOM class.
 .Pp
 If GEOM providers are specified, the
 .Cm version
 subcommand will print metadata version used by each of them.
 .It Cm clear
 Clear metadata from the given providers.
 .Em WARNING :
 This will erase with zeros the encrypted Master Key copies stored in the
 metadata.
 .It Cm dump
 Dump metadata stored on the given providers.
 .It Cm list
 See
 .Xr geom 8 .
 .It Cm status
 See
 .Xr geom 8 .
 .It Cm load
 See
 .Xr geom 8 .
 .It Cm unload
 See
 .Xr geom 8 .
 .El
 .Pp
 Additional options include:
 .Bl -tag -width ".Fl v"
 .It Fl v
 Be more verbose.
 .El
 .Sh KEY SUMMARY
 .Ss Master Key
 Upon
 .Cm init ,
 the
 .Nm
 utility generates a random Master Key for the provider.
 The Master Key never changes during the lifetime of the provider.
 Each copy of the provider metadata, active or backed up to a file, can store
 up to two, independently-encrypted copies of the Master Key.
 .Ss User Key
 Each stored copy of the Master Key is encrypted with a User Key, which
 is generated by the
 .Nm
 utility from a passphrase and/or a keyfile.
 The
 .Nm
 utility first reads all parts of the keyfile in the order specified on the
 command line, then reads all parts of the stored passphrase in the order
 specified on the command line.
 If no passphrase parts are specified, the system prompts the user to enter
 the passphrase.
 The passphrase is optionally strengthened by PKCS#5v2.
 The User Key is a digest computed over the concatenated keyfile and passphrase.
 .Ss Data Key
 During operation, one or more Data Keys are deterministically derived by
 the kernel from the Master Key and cached in memory.
 The number of Data Keys used by a given provider, and the way they are
 derived, depend on the GELI version and whether the provider is configured to
 use data authentication.
 .Sh SYSCTL VARIABLES
 The following
 .Xr sysctl 8
 variables can be used to control the behavior of the
 .Nm ELI
 GEOM class.
 The default value is shown next to each variable.
 Some variables can also be set in
 .Pa /boot/loader.conf .
 .Bl -tag -width indent
 .It Va kern.geom.eli.version
 Version number of the
 .Nm ELI
 GEOM class.
 .It Va kern.geom.eli.debug : No 0
 Debug level of the
 .Nm ELI
 GEOM class.
 This can be set to a number between 0 and 3 inclusive.
 If set to 0, minimal debug information is printed.
 If set to 3, the
 maximum amount of debug information is printed.
 .It Va kern.geom.eli.tries : No 3
 Number of times a user is asked for the passphrase.
 This is only used for providers which are attached on boot
 (before the root file system is mounted).
 If set to 0, attaching providers on boot will be disabled.
 This variable should be set in
 .Pa /boot/loader.conf .
 .It Va kern.geom.eli.overwrites : No 5
 Specifies how many times the Master Key will be overwritten
 with random values when it is destroyed.
 After this operation it is filled with zeros.
 .It Va kern.geom.eli.visible_passphrase : No 0
 If set to 1, the passphrase entered on boot (before the root
 file system is mounted) will be visible.
 This alternative should be used with caution as the entered
 passphrase can be logged and exposed via
 .Xr dmesg 8 .
 This variable should be set in
 .Pa /boot/loader.conf .
 .It Va kern.geom.eli.threads : No 0
 Specifies how many kernel threads should be used for doing software
 cryptography.
 Its purpose is to increase performance on SMP systems.
 If set to 0, a CPU-pinned thread will be started for every active CPU.
 .It Va kern.geom.eli.batch : No 0
 When set to 1, can speed-up crypto operations by using batching.
 Batching reduces the number of interrupts by responding to a group of
 crypto requests with one interrupt.
 The crypto card and the driver has to support this feature.
 .It Va kern.geom.eli.key_cache_limit : No 8192
 Specifies how many Data Keys to cache.
 The default limit
 (8192 keys) will allow caching of all keys for a 4TB provider with 512 byte
 sectors and will take around 1MB of memory.
 .It Va kern.geom.eli.key_cache_hits
 Reports how many times we were looking up a Data Key and it was already in
 cache.
 This sysctl is not updated for providers that need fewer Data Keys than
 the limit specified in
 .Va kern.geom.eli.key_cache_limit .
 .It Va kern.geom.eli.key_cache_misses
 Reports how many times we were looking up a Data Key and it was not in cache.
 This sysctl is not updated for providers that need fewer Data Keys than the limit
 specified in
 .Va kern.geom.eli.key_cache_limit .
 .El
 .Sh EXIT STATUS
 Exit status is 0 on success, and 1 if the command fails.
 .Sh EXAMPLES
 Initialize a provider which is going to be encrypted with a
 passphrase and random data from a file on the user's pen drive.
 Use 4kB sector size.
 Attach the provider, create a file system, and mount it.
 Do the work.
 Unmount the provider and detach it:
 .Bd -literal -offset indent
 # dd if=/dev/random of=/mnt/pendrive/da2.key bs=64 count=1
 # geli init -s 4096 -K /mnt/pendrive/da2.key /dev/da2
 Enter new passphrase:
 Reenter new passphrase:
 # geli attach -k /mnt/pendrive/da2.key /dev/da2
 Enter passphrase:
 # dd if=/dev/random of=/dev/da2.eli bs=1m
 # newfs /dev/da2.eli
 # mount /dev/da2.eli /mnt/secret
 \&...
 # umount /mnt/secret
 # geli detach da2.eli
 .Ed
 .Pp
 Create an encrypted provider, but use two User Keys:
 one for your employee and one for you as the company's security officer
 (so it is not a tragedy if the employee
 .Qq accidentally
 forgets his passphrase):
 .Bd -literal -offset indent
 # geli init /dev/da2
 Enter new passphrase:	(enter security officer's passphrase)
 Reenter new passphrase:
 # geli setkey -n 1 /dev/da2
 Enter passphrase:	(enter security officer's passphrase)
 Enter new passphrase:	(let your employee enter his passphrase ...)
 Reenter new passphrase:	(... twice)
 .Ed
 .Pp
 You are the security officer in your company.
 Create an encrypted provider for use by the user, but remember that users
 forget their passphrases, so backup the Master Key with your own random key:
 .Bd -literal -offset indent
 # dd if=/dev/random of=/mnt/pendrive/keys/`hostname` bs=64 count=1
 # geli init -P -K /mnt/pendrive/keys/`hostname` /dev/ada0s1e
 # geli backup /dev/ada0s1e /mnt/pendrive/backups/`hostname`
 (use key number 0, so the encrypted Master Key will be re-encrypted by this)
 # geli setkey -n 0 -k /mnt/pendrive/keys/`hostname` /dev/ada0s1e
 (allow the user to enter his passphrase)
 Enter new passphrase:
 Reenter new passphrase:
 .Ed
 .Pp
 Encrypted swap partition setup:
 .Bd -literal -offset indent
 # dd if=/dev/random of=/dev/ada0s1b bs=1m
 # geli onetime -d -e 3des ada0s1b
 # swapon /dev/ada0s1b.eli
 .Ed
 .Pp
 The example below shows how to configure two providers which will be attached
 on boot (before the root file system is mounted).
 One of them is using passphrase and three keyfile parts and the other is
 using only a keyfile in one part:
 .Bd -literal -offset indent
 # dd if=/dev/random of=/dev/da0 bs=1m
 # dd if=/dev/random of=/boot/keys/da0.key0 bs=32k count=1
 # dd if=/dev/random of=/boot/keys/da0.key1 bs=32k count=1
 # dd if=/dev/random of=/boot/keys/da0.key2 bs=32k count=1
 # geli init -b -K /boot/keys/da0.key0 -K /boot/keys/da0.key1 -K /boot/keys/da0.key2 da0
 Enter new passphrase:
 Reenter new passphrase:
 # dd if=/dev/random of=/dev/da1s3a bs=1m
 # dd if=/dev/random of=/boot/keys/da1s3a.key bs=128k count=1
 # geli init -b -P -K /boot/keys/da1s3a.key da1s3a
 .Ed
 .Pp
 The providers are initialized, now we have to add these lines to
 .Pa /boot/loader.conf :
 .Bd -literal -offset indent
 geli_da0_keyfile0_load="YES"
 geli_da0_keyfile0_type="da0:geli_keyfile0"
 geli_da0_keyfile0_name="/boot/keys/da0.key0"
 geli_da0_keyfile1_load="YES"
 geli_da0_keyfile1_type="da0:geli_keyfile1"
 geli_da0_keyfile1_name="/boot/keys/da0.key1"
 geli_da0_keyfile2_load="YES"
 geli_da0_keyfile2_type="da0:geli_keyfile2"
 geli_da0_keyfile2_name="/boot/keys/da0.key2"
 
 geli_da1s3a_keyfile0_load="YES"
 geli_da1s3a_keyfile0_type="da1s3a:geli_keyfile0"
 geli_da1s3a_keyfile0_name="/boot/keys/da1s3a.key"
 .Ed
 .Pp
 If there is only one keyfile, the index might be omitted:
 .Bd -literal -offset indent
 geli_da1s3a_keyfile_load="YES"
 geli_da1s3a_keyfile_type="da1s3a:geli_keyfile"
 geli_da1s3a_keyfile_name="/boot/keys/da1s3a.key"
 .Ed
 .Pp
 Not only configure encryption, but also data integrity verification using
 .Nm HMAC/SHA256 .
 .Bd -literal -offset indent
 # geli init -a hmac/sha256 -s 4096 /dev/da0
 Enter new passphrase:
 Reenter new passphrase:
 # geli attach /dev/da0
 Enter passphrase:
 # dd if=/dev/random of=/dev/da0.eli bs=1m
 # newfs /dev/da0.eli
 # mount /dev/da0.eli /mnt/secret
 .Ed
 .Pp
 .Cm geli
 writes the metadata backup by default to the
 .Pa /var/backups/<prov>.eli
 file.
 If the metadata is lost in any way (e.g., by accidental overwrite), it can be restored.
 Consider the following situation:
 .Bd -literal -offset indent
 # geli init /dev/da0
 Enter new passphrase:
 Reenter new passphrase:
 
 Metadata backup can be found in /var/backups/da0.eli and
 can be restored with the following command:
 
 	# geli restore /var/backups/da0.eli /dev/da0
 
 # geli clear /dev/da0
 # geli attach /dev/da0
 geli: Cannot read metadata from /dev/da0: Invalid argument.
 # geli restore /var/backups/da0.eli /dev/da0
 # geli attach /dev/da0
 Enter passphrase:
 .Ed
 .Pp
 If an encrypted file system is extended, it is necessary to relocate and
 update the metadata:
 .Bd -literal -offset indent
 # gpart create -s GPT ada0
 # gpart add -s 1g -t freebsd-ufs -i 1 ada0
 # geli init -K keyfile -P ada0p1
 # gpart resize -s 2g -i 1 ada0
 # geli resize -s 1g ada0p1
 # geli attach -k keyfile -p ada0p1
 .Ed
 .Pp
 Initialize provider with the passphrase split into two files.
 The provider can be attached using those two files or by entering
 .Dq foobar
 as the passphrase at the
 .Nm
 prompt:
 .Bd -literal -offset indent
 # echo foo > da0.pass0
 # echo bar > da0.pass1
 # geli init -J da0.pass0 -J da0.pass1 da0
 # geli attach -j da0.pass0 -j da0.pass1 da0
 # geli detach da0
 # geli attach da0
 Enter passphrase: foobar
 .Ed
 .Pp
 Suspend all
 .Nm
 devices on a laptop, suspend the laptop, then resume devices one by one after
 resuming the laptop:
 .Bd -literal -offset indent
 # geli suspend -a
 # zzz
 <resume your laptop>
 # geli resume -p -k keyfile gpt/secret
 # geli resume gpt/private
 Enter passphrase:
 .Ed
 .Sh ENCRYPTION MODES
 .Nm
 supports two encryption modes:
 .Nm XTS ,
 which was standardized as
 .Nm IEEE P1619
 and
 .Nm CBC
 with unpredictable IV.
 The
 .Nm CBC
 mode used by
 .Nm
 is very similar to the mode
 .Nm ESSIV .
 .Sh DATA AUTHENTICATION
 .Nm
 can verify data integrity when an authentication algorithm is specified.
 When data corruption/modification is detected,
 .Nm
 will not return any data, but instead will return an error
 .Pq Er EINVAL .
 The offset and size of the corrupted data will be printed on the console.
 It is important to know against which attacks
 .Nm
 provides protection for your data.
 If data is modified in-place or copied from one place on the disk
 to another even without modification,
 .Nm
 should be able to detect such a change.
 If an attacker can remember the encrypted data, he can overwrite any future
 changes with the data he owns without it being noticed.
 In other words
 .Nm
 will not protect your data against replay attacks.
 .Pp
 It is recommended to write to the whole provider before first use,
 in order to make sure that all sectors and their corresponding
 checksums are properly initialized into a consistent state.
 One can safely ignore data authentication errors that occur immediately
 after the first time a provider is attached and before it is
 initialized in this way.
 .Sh SEE ALSO
 .Xr crypto 4 ,
 .Xr gbde 4 ,
 .Xr geom 4 ,
 .Xr loader.conf 5 ,
 .Xr gbde 8 ,
 .Xr geom 8 ,
 .Xr crypto 9
 .Sh HISTORY
 The
 .Nm
 utility appeared in
 .Fx 6.0 .
 Support for the
 .Nm Camellia
 block cipher is implemented by Yoshisato Yanagisawa in
 .Fx 7.0 .
 .Pp
 Highest
 .Nm GELI
 metadata version supported by the given FreeBSD version:
 .Bl -column -offset indent ".Sy FreeBSD" ".Sy version"
 .It Sy FreeBSD Ta Sy GELI
 .It Sy version Ta Sy version
 .Pp
 .It Li 6.0 Ta 0
 .It Li 6.1 Ta 0
 .It Li 6.2 Ta 3
 .It Li 6.3 Ta 3
 .It Li 6.4 Ta 3
 .Pp
 .It Li 7.0 Ta 3
 .It Li 7.1 Ta 3
 .It Li 7.2 Ta 3
 .It Li 7.3 Ta 3
 .It Li 7.4 Ta 3
 .Pp
 .It Li 8.0 Ta 3
 .It Li 8.1 Ta 3
 .It Li 8.2 Ta 5
 .Pp
 .It Li 9.0 Ta 6
 .Pp
 .It Li 10.0 Ta 7
 .El
 .Sh AUTHORS
 .An Pawel Jakub Dawidek Aq Mt pjd@FreeBSD.org
Index: user/markj/netdump/sbin/geom/class/eli/geom_eli.c
===================================================================
--- user/markj/netdump/sbin/geom/class/eli/geom_eli.c	(revision 332407)
+++ user/markj/netdump/sbin/geom/class/eli/geom_eli.c	(revision 332408)
@@ -1,1767 +1,1768 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/mman.h>
 #include <sys/sysctl.h>
 #include <sys/resource.h>
 #include <opencrypto/cryptodev.h>
 
 #include <assert.h>
 #include <err.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <libgeom.h>
 #include <paths.h>
 #include <readpassphrase.h>
 #include <stdbool.h>
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <strings.h>
 #include <unistd.h>
 
 #include <geom/eli/g_eli.h>
 #include <geom/eli/pkcs5v2.h>
 
 #include "core/geom.h"
 #include "misc/subr.h"
 
 
 uint32_t lib_version = G_LIB_VERSION;
 uint32_t version = G_ELI_VERSION;
 
 #define	GELI_BACKUP_DIR	"/var/backups/"
 #define	GELI_ENC_ALGO	"aes"
 
 static void eli_main(struct gctl_req *req, unsigned flags);
 static void eli_init(struct gctl_req *req);
 static void eli_attach(struct gctl_req *req);
 static void eli_configure(struct gctl_req *req);
 static void eli_setkey(struct gctl_req *req);
 static void eli_delkey(struct gctl_req *req);
 static void eli_resume(struct gctl_req *req);
 static void eli_kill(struct gctl_req *req);
 static void eli_backup(struct gctl_req *req);
 static void eli_restore(struct gctl_req *req);
 static void eli_resize(struct gctl_req *req);
 static void eli_version(struct gctl_req *req);
 static void eli_clear(struct gctl_req *req);
 static void eli_dump(struct gctl_req *req);
 
 static int eli_backup_create(struct gctl_req *req, const char *prov,
     const char *file);
 
 /*
  * Available commands:
  *
  * init [-bdgPTv] [-a aalgo] [-B backupfile] [-e ealgo] [-i iterations] [-l keylen] [-J newpassfile] [-K newkeyfile] [-s sectorsize] [-V version] prov
  * label - alias for 'init'
  * attach [-dprv] [-j passfile] [-k keyfile] prov
  * detach [-fl] prov ...
  * stop - alias for 'detach'
  * onetime [-d] [-a aalgo] [-e ealgo] [-l keylen] prov
  * configure [-bBgGtT] prov ...
  * setkey [-pPv] [-n keyno] [-j passfile] [-J newpassfile] [-k keyfile] [-K newkeyfile] prov
  * delkey [-afv] [-n keyno] prov
  * suspend [-v] -a | prov ...
  * resume [-pv] [-j passfile] [-k keyfile] prov
  * kill [-av] [prov ...]
  * backup [-v] prov file
  * restore [-fv] file prov
  * resize [-v] -s oldsize prov
  * version [prov ...]
  * clear [-v] prov ...
  * dump [-v] prov ...
  */
 struct g_command class_commands[] = {
 	{ "init", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'a', "aalgo", "", G_TYPE_STRING },
 		{ 'b', "boot", NULL, G_TYPE_BOOL },
 		{ 'B', "backupfile", "", G_TYPE_STRING },
 		{ 'd', "displaypass", NULL, G_TYPE_BOOL },
 		{ 'e', "ealgo", "", G_TYPE_STRING },
 		{ 'g', "geliboot", NULL, G_TYPE_BOOL },
 		{ 'i', "iterations", "-1", G_TYPE_NUMBER },
 		{ 'J', "newpassfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'K', "newkeyfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'l', "keylen", "0", G_TYPE_NUMBER },
 		{ 'P', "nonewpassphrase", NULL, G_TYPE_BOOL },
 		{ 's', "sectorsize", "0", G_TYPE_NUMBER },
 		{ 'T', "notrim", NULL, G_TYPE_BOOL },
 		{ 'V', "mdversion", "-1", G_TYPE_NUMBER },
 		G_OPT_SENTINEL
 	    },
 	    "[-bdgPTv] [-a aalgo] [-B backupfile] [-e ealgo] [-i iterations] [-l keylen] [-J newpassfile] [-K newkeyfile] [-s sectorsize] [-V version] prov"
 	},
 	{ "label", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'a', "aalgo", "", G_TYPE_STRING },
 		{ 'b', "boot", NULL, G_TYPE_BOOL },
 		{ 'B', "backupfile", "", G_TYPE_STRING },
 		{ 'd', "displaypass", NULL, G_TYPE_BOOL },
 		{ 'e', "ealgo", "", G_TYPE_STRING },
 		{ 'g', "geliboot", NULL, G_TYPE_BOOL },
 		{ 'i', "iterations", "-1", G_TYPE_NUMBER },
 		{ 'J', "newpassfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'K', "newkeyfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'l', "keylen", "0", G_TYPE_NUMBER },
 		{ 'P', "nonewpassphrase", NULL, G_TYPE_BOOL },
 		{ 's', "sectorsize", "0", G_TYPE_NUMBER },
 		{ 'V', "mdversion", "-1", G_TYPE_NUMBER },
 		G_OPT_SENTINEL
 	    },
 	    "- an alias for 'init'"
 	},
 	{ "attach", G_FLAG_VERBOSE | G_FLAG_LOADKLD, eli_main,
 	    {
 		{ 'd', "detach", NULL, G_TYPE_BOOL },
 		{ 'j', "passfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'k', "keyfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
+		{ 'n', "dryrun", NULL, G_TYPE_BOOL },
 		{ 'p', "nopassphrase", NULL, G_TYPE_BOOL },
 		{ 'r', "readonly", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
-	    "[-dprv] [-j passfile] [-k keyfile] prov"
+	    "[-dnprv] [-j passfile] [-k keyfile] prov"
 	},
 	{ "detach", 0, NULL,
 	    {
 		{ 'f', "force", NULL, G_TYPE_BOOL },
 		{ 'l', "last", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-fl] prov ..."
 	},
 	{ "stop", 0, NULL,
 	    {
 		{ 'f', "force", NULL, G_TYPE_BOOL },
 		{ 'l', "last", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "- an alias for 'detach'"
 	},
 	{ "onetime", G_FLAG_VERBOSE | G_FLAG_LOADKLD, NULL,
 	    {
 		{ 'a', "aalgo", "", G_TYPE_STRING },
 		{ 'd', "detach", NULL, G_TYPE_BOOL },
 		{ 'e', "ealgo", GELI_ENC_ALGO, G_TYPE_STRING },
 		{ 'l', "keylen", "0", G_TYPE_NUMBER },
 		{ 's', "sectorsize", "0", G_TYPE_NUMBER },
 		{ 'T', "notrim", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-dT] [-a aalgo] [-e ealgo] [-l keylen] [-s sectorsize] prov"
 	},
 	{ "configure", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'b', "boot", NULL, G_TYPE_BOOL },
 		{ 'B', "noboot", NULL, G_TYPE_BOOL },
 		{ 'd', "displaypass", NULL, G_TYPE_BOOL },
 		{ 'D', "nodisplaypass", NULL, G_TYPE_BOOL },
 		{ 'g', "geliboot", NULL, G_TYPE_BOOL },
 		{ 'G', "nogeliboot", NULL, G_TYPE_BOOL },
 		{ 't', "trim", NULL, G_TYPE_BOOL },
 		{ 'T', "notrim", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-bBdDgGtT] prov ..."
 	},
 	{ "setkey", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'i', "iterations", "-1", G_TYPE_NUMBER },
 		{ 'j', "passfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'J', "newpassfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'k', "keyfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'K', "newkeyfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'n', "keyno", "-1", G_TYPE_NUMBER },
 		{ 'p', "nopassphrase", NULL, G_TYPE_BOOL },
 		{ 'P', "nonewpassphrase", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-pPv] [-n keyno] [-i iterations] [-j passfile] [-J newpassfile] [-k keyfile] [-K newkeyfile] prov"
 	},
 	{ "delkey", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'a', "all", NULL, G_TYPE_BOOL },
 		{ 'f', "force", NULL, G_TYPE_BOOL },
 		{ 'n', "keyno", "-1", G_TYPE_NUMBER },
 		G_OPT_SENTINEL
 	    },
 	    "[-afv] [-n keyno] prov"
 	},
 	{ "suspend", G_FLAG_VERBOSE, NULL,
 	    {
 		{ 'a', "all", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-v] -a | prov ..."
 	},
 	{ "resume", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'j', "passfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'k', "keyfile", G_VAL_OPTIONAL, G_TYPE_STRING | G_TYPE_MULTI },
 		{ 'p', "nopassphrase", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-pv] [-j passfile] [-k keyfile] prov"
 	},
 	{ "kill", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'a', "all", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-av] [prov ...]"
 	},
 	{ "backup", G_FLAG_VERBOSE, eli_main, G_NULL_OPTS,
 	    "[-v] prov file"
 	},
 	{ "restore", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 'f', "force", NULL, G_TYPE_BOOL },
 		G_OPT_SENTINEL
 	    },
 	    "[-fv] file prov"
 	},
 	{ "resize", G_FLAG_VERBOSE, eli_main,
 	    {
 		{ 's', "oldsize", NULL, G_TYPE_NUMBER },
 		G_OPT_SENTINEL
 	    },
 	    "[-v] -s oldsize prov"
 	},
 	{ "version", G_FLAG_LOADKLD, eli_main, G_NULL_OPTS,
 	    "[prov ...]"
 	},
 	{ "clear", G_FLAG_VERBOSE, eli_main, G_NULL_OPTS,
 	    "[-v] prov ..."
 	},
 	{ "dump", G_FLAG_VERBOSE, eli_main, G_NULL_OPTS,
 	    "[-v] prov ..."
 	},
 	G_CMD_SENTINEL
 };
 
 static int verbose = 0;
 
 #define	BUFSIZE	1024
 
 static int
 eli_protect(struct gctl_req *req)
 {
 	struct rlimit rl;
 
 	/* Disable core dumps. */
 	rl.rlim_cur = 0;
 	rl.rlim_max = 0;
 	if (setrlimit(RLIMIT_CORE, &rl) == -1) {
 		gctl_error(req, "Cannot disable core dumps: %s.",
 		    strerror(errno));
 		return (-1);
 	}
 	/* Disable swapping. */
 	if (mlockall(MCL_FUTURE) == -1) {
 		gctl_error(req, "Cannot lock memory: %s.", strerror(errno));
 		return (-1);
 	}
 	return (0);
 }
 
 static void
 eli_main(struct gctl_req *req, unsigned int flags)
 {
 	const char *name;
 
 	if (eli_protect(req) == -1)
 		return;
 
 	if ((flags & G_FLAG_VERBOSE) != 0)
 		verbose = 1;
 
 	name = gctl_get_ascii(req, "verb");
 	if (name == NULL) {
 		gctl_error(req, "No '%s' argument.", "verb");
 		return;
 	}
 	if (strcmp(name, "init") == 0 || strcmp(name, "label") == 0)
 		eli_init(req);
 	else if (strcmp(name, "attach") == 0)
 		eli_attach(req);
 	else if (strcmp(name, "configure") == 0)
 		eli_configure(req);
 	else if (strcmp(name, "setkey") == 0)
 		eli_setkey(req);
 	else if (strcmp(name, "delkey") == 0)
 		eli_delkey(req);
 	else if (strcmp(name, "resume") == 0)
 		eli_resume(req);
 	else if (strcmp(name, "kill") == 0)
 		eli_kill(req);
 	else if (strcmp(name, "backup") == 0)
 		eli_backup(req);
 	else if (strcmp(name, "restore") == 0)
 		eli_restore(req);
 	else if (strcmp(name, "resize") == 0)
 		eli_resize(req);
 	else if (strcmp(name, "version") == 0)
 		eli_version(req);
 	else if (strcmp(name, "dump") == 0)
 		eli_dump(req);
 	else if (strcmp(name, "clear") == 0)
 		eli_clear(req);
 	else
 		gctl_error(req, "Unknown command: %s.", name);
 }
 
 static bool
 eli_is_attached(const char *prov)
 {
 	char name[MAXPATHLEN];
 
 	/*
 	 * Not the best way to do it, but the easiest.
 	 * We try to open provider and check if it is a GEOM provider
 	 * by asking about its sectorsize.
 	 */
 	snprintf(name, sizeof(name), "%s%s", prov, G_ELI_SUFFIX);
 	return (g_get_sectorsize(name) > 0);
 }
 
 static int
 eli_genkey_files(struct gctl_req *req, bool new, const char *type,
     struct hmac_ctx *ctxp, char *passbuf, size_t passbufsize)
 {
 	char *p, buf[BUFSIZE], argname[16];
 	const char *file;
 	int error, fd, i;
 	ssize_t done;
 
 	assert((strcmp(type, "keyfile") == 0 && ctxp != NULL &&
 	    passbuf == NULL && passbufsize == 0) ||
 	    (strcmp(type, "passfile") == 0 && ctxp == NULL &&
 	    passbuf != NULL && passbufsize > 0));
 	assert(strcmp(type, "keyfile") == 0 || passbuf[0] == '\0');
 
 	for (i = 0; ; i++) {
 		snprintf(argname, sizeof(argname), "%s%s%d",
 		    new ? "new" : "", type, i);
 
 		/* No more {key,pass}files? */
 		if (!gctl_has_param(req, argname))
 			return (i);
 
 		file = gctl_get_ascii(req, "%s", argname);
 		assert(file != NULL);
 
 		if (strcmp(file, "-") == 0)
 			fd = STDIN_FILENO;
 		else {
 			fd = open(file, O_RDONLY);
 			if (fd == -1) {
 				gctl_error(req, "Cannot open %s %s: %s.",
 				    type, file, strerror(errno));
 				return (-1);
 			}
 		}
 		if (strcmp(type, "keyfile") == 0) {
 			while ((done = read(fd, buf, sizeof(buf))) > 0)
 				g_eli_crypto_hmac_update(ctxp, buf, done);
 		} else /* if (strcmp(type, "passfile") == 0) */ {
 			assert(strcmp(type, "passfile") == 0);
 
 			while ((done = read(fd, buf, sizeof(buf) - 1)) > 0) {
 				buf[done] = '\0';
 				p = strchr(buf, '\n');
 				if (p != NULL) {
 					*p = '\0';
 					done = p - buf;
 				}
 				if (strlcat(passbuf, buf, passbufsize) >=
 				    passbufsize) {
 					gctl_error(req,
 					    "Passphrase in %s too long.", file);
 					bzero(buf, sizeof(buf));
 					return (-1);
 				}
 				if (p != NULL)
 					break;
 			}
 		}
 		error = errno;
 		if (strcmp(file, "-") != 0)
 			close(fd);
 		bzero(buf, sizeof(buf));
 		if (done == -1) {
 			gctl_error(req, "Cannot read %s %s: %s.",
 			    type, file, strerror(error));
 			return (-1);
 		}
 	}
 	/* NOTREACHED */
 }
 
 static int
 eli_genkey_passphrase_prompt(struct gctl_req *req, bool new, char *passbuf,
     size_t passbufsize)
 {
 	char *p;
 
 	for (;;) {
 		p = readpassphrase(
 		    new ? "Enter new passphrase: " : "Enter passphrase: ",
 		    passbuf, passbufsize, RPP_ECHO_OFF | RPP_REQUIRE_TTY);
 		if (p == NULL) {
 			bzero(passbuf, passbufsize);
 			gctl_error(req, "Cannot read passphrase: %s.",
 			    strerror(errno));
 			return (-1);
 		}
 
 		if (new) {
 			char tmpbuf[BUFSIZE];
 
 			p = readpassphrase("Reenter new passphrase: ",
 			    tmpbuf, sizeof(tmpbuf),
 			    RPP_ECHO_OFF | RPP_REQUIRE_TTY);
 			if (p == NULL) {
 				bzero(passbuf, passbufsize);
 				gctl_error(req,
 				    "Cannot read passphrase: %s.",
 				    strerror(errno));
 				return (-1);
 			}
 
 			if (strcmp(passbuf, tmpbuf) != 0) {
 				bzero(passbuf, passbufsize);
 				fprintf(stderr, "They didn't match.\n");
 				continue;
 			}
 			bzero(tmpbuf, sizeof(tmpbuf));
 		}
 		return (0);
 	}
 	/* NOTREACHED */
 }
 
 static int
 eli_genkey_passphrase(struct gctl_req *req, struct g_eli_metadata *md, bool new,
     struct hmac_ctx *ctxp)
 {
 	char passbuf[BUFSIZE];
 	bool nopassphrase;
 	int nfiles;
 
 	nopassphrase =
 	    gctl_get_int(req, new ? "nonewpassphrase" : "nopassphrase");
 	if (nopassphrase) {
 		if (gctl_has_param(req, new ? "newpassfile0" : "passfile0")) {
 			gctl_error(req,
 			    "Options -%c and -%c are mutually exclusive.",
 			    new ? 'J' : 'j', new ? 'P' : 'p');
 			return (-1);
 		}
 		return (0);
 	}
 
 	if (!new && md->md_iterations == -1) {
 		gctl_error(req, "Missing -p flag.");
 		return (-1);
 	}
 	passbuf[0] = '\0';
 	nfiles = eli_genkey_files(req, new, "passfile", NULL, passbuf,
 	    sizeof(passbuf));
 	if (nfiles == -1)
 		return (-1);
 	else if (nfiles == 0) {
 		if (eli_genkey_passphrase_prompt(req, new, passbuf,
 		    sizeof(passbuf)) == -1) {
 			return (-1);
 		}
 	}
 	/*
 	 * Field md_iterations equal to -1 means "choose some sane
 	 * value for me".
 	 */
 	if (md->md_iterations == -1) {
 		assert(new);
 		if (verbose)
 			printf("Calculating number of iterations...\n");
 		md->md_iterations = pkcs5v2_calculate(2000000);
 		assert(md->md_iterations > 0);
 		if (verbose) {
 			printf("Done, using %d iterations.\n",
 			    md->md_iterations);
 		}
 	}
 	/*
 	 * If md_iterations is equal to 0, user doesn't want PKCS#5v2.
 	 */
 	if (md->md_iterations == 0) {
 		g_eli_crypto_hmac_update(ctxp, md->md_salt,
 		    sizeof(md->md_salt));
 		g_eli_crypto_hmac_update(ctxp, passbuf, strlen(passbuf));
 	} else /* if (md->md_iterations > 0) */ {
 		unsigned char dkey[G_ELI_USERKEYLEN];
 
 		pkcs5v2_genkey(dkey, sizeof(dkey), md->md_salt,
 		    sizeof(md->md_salt), passbuf, md->md_iterations);
 		g_eli_crypto_hmac_update(ctxp, dkey, sizeof(dkey));
 		bzero(dkey, sizeof(dkey));
 	}
 	bzero(passbuf, sizeof(passbuf));
 
 	return (0);
 }
 
 static unsigned char *
 eli_genkey(struct gctl_req *req, struct g_eli_metadata *md, unsigned char *key,
     bool new)
 {
 	struct hmac_ctx ctx;
 	bool nopassphrase;
 	int nfiles;
 
 	nopassphrase =
 	    gctl_get_int(req, new ? "nonewpassphrase" : "nopassphrase");
 
 	g_eli_crypto_hmac_init(&ctx, NULL, 0);
 
 	nfiles = eli_genkey_files(req, new, "keyfile", &ctx, NULL, 0);
 	if (nfiles == -1)
 		return (NULL);
 	else if (nfiles == 0 && nopassphrase) {
 		gctl_error(req, "No key components given.");
 		return (NULL);
 	}
 
 	if (eli_genkey_passphrase(req, md, new, &ctx) == -1)
 		return (NULL);
 
 	g_eli_crypto_hmac_final(&ctx, key, 0);
 
 	return (key);
 }
 
 static int
 eli_metadata_read(struct gctl_req *req, const char *prov,
     struct g_eli_metadata *md)
 {
 	unsigned char sector[sizeof(struct g_eli_metadata)];
 	int error;
 
 	if (g_get_sectorsize(prov) == 0) {
 		int fd;
 
 		/* This is a file probably. */
 		fd = open(prov, O_RDONLY);
 		if (fd == -1) {
 			gctl_error(req, "Cannot open %s: %s.", prov,
 			    strerror(errno));
 			return (-1);
 		}
 		if (read(fd, sector, sizeof(sector)) != sizeof(sector)) {
 			gctl_error(req, "Cannot read metadata from %s: %s.",
 			    prov, strerror(errno));
 			close(fd);
 			return (-1);
 		}
 		close(fd);
 	} else {
 		/* This is a GEOM provider. */
 		error = g_metadata_read(prov, sector, sizeof(sector),
 		    G_ELI_MAGIC);
 		if (error != 0) {
 			gctl_error(req, "Cannot read metadata from %s: %s.",
 			    prov, strerror(error));
 			return (-1);
 		}
 	}
 	error = eli_metadata_decode(sector, md);
 	switch (error) {
 	case 0:
 		break;
 	case EOPNOTSUPP:
 		gctl_error(req,
 		    "Provider's %s metadata version %u is too new.\n"
 		    "geli: The highest supported version is %u.",
 		    prov, (unsigned int)md->md_version, G_ELI_VERSION);
 		return (-1);
 	case EINVAL:
 		gctl_error(req, "Inconsistent provider's %s metadata.", prov);
 		return (-1);
 	default:
 		gctl_error(req,
 		    "Unexpected error while decoding provider's %s metadata: %s.",
 		    prov, strerror(error));
 		return (-1);
 	}
 	return (0);
 }
 
 static int
 eli_metadata_store(struct gctl_req *req, const char *prov,
     struct g_eli_metadata *md)
 {
 	unsigned char sector[sizeof(struct g_eli_metadata)];
 	int error;
 
 	eli_metadata_encode(md, sector);
 	if (g_get_sectorsize(prov) == 0) {
 		int fd;
 
 		/* This is a file probably. */
 		fd = open(prov, O_WRONLY | O_TRUNC);
 		if (fd == -1) {
 			gctl_error(req, "Cannot open %s: %s.", prov,
 			    strerror(errno));
 			bzero(sector, sizeof(sector));
 			return (-1);
 		}
 		if (write(fd, sector, sizeof(sector)) != sizeof(sector)) {
 			gctl_error(req, "Cannot write metadata to %s: %s.",
 			    prov, strerror(errno));
 			bzero(sector, sizeof(sector));
 			close(fd);
 			return (-1);
 		}
 		close(fd);
 	} else {
 		/* This is a GEOM provider. */
 		error = g_metadata_store(prov, sector, sizeof(sector));
 		if (error != 0) {
 			gctl_error(req, "Cannot write metadata to %s: %s.",
 			    prov, strerror(errno));
 			bzero(sector, sizeof(sector));
 			return (-1);
 		}
 	}
 	bzero(sector, sizeof(sector));
 	return (0);
 }
 
 static void
 eli_init(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	unsigned char sector[sizeof(struct g_eli_metadata)] __aligned(4);
 	unsigned char key[G_ELI_USERKEYLEN];
 	char backfile[MAXPATHLEN];
 	const char *str, *prov;
 	unsigned int secsize, version;
 	off_t mediasize;
 	intmax_t val;
 	int error, nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 	mediasize = g_get_mediasize(prov);
 	secsize = g_get_sectorsize(prov);
 	if (mediasize == 0 || secsize == 0) {
 		gctl_error(req, "Cannot get informations about %s: %s.", prov,
 		    strerror(errno));
 		return;
 	}
 
 	bzero(&md, sizeof(md));
 	strlcpy(md.md_magic, G_ELI_MAGIC, sizeof(md.md_magic));
 	val = gctl_get_intmax(req, "mdversion");
 	if (val == -1) {
 		version = G_ELI_VERSION;
 	} else if (val < 0 || val > G_ELI_VERSION) {
 		gctl_error(req,
 		    "Invalid version specified should be between %u and %u.",
 		    G_ELI_VERSION_00, G_ELI_VERSION);
 		return;
 	} else {
 		version = val;
 	}
 	md.md_version = version;
 	md.md_flags = 0;
 	if (gctl_get_int(req, "boot"))
 		md.md_flags |= G_ELI_FLAG_BOOT;
 	if (gctl_get_int(req, "geliboot"))
 		md.md_flags |= G_ELI_FLAG_GELIBOOT;
 	if (gctl_get_int(req, "displaypass"))
 		md.md_flags |= G_ELI_FLAG_GELIDISPLAYPASS;
 	if (gctl_get_int(req, "notrim"))
 		md.md_flags |= G_ELI_FLAG_NODELETE;
 	md.md_ealgo = CRYPTO_ALGORITHM_MIN - 1;
 	str = gctl_get_ascii(req, "aalgo");
 	if (*str != '\0') {
 		if (version < G_ELI_VERSION_01) {
 			gctl_error(req,
 			    "Data authentication is supported starting from version %u.",
 			    G_ELI_VERSION_01);
 			return;
 		}
 		md.md_aalgo = g_eli_str2aalgo(str);
 		if (md.md_aalgo >= CRYPTO_ALGORITHM_MIN &&
 		    md.md_aalgo <= CRYPTO_ALGORITHM_MAX) {
 			md.md_flags |= G_ELI_FLAG_AUTH;
 		} else {
 			/*
 			 * For backward compatibility, check if the -a option
 			 * was used to provide encryption algorithm.
 			 */
 			md.md_ealgo = g_eli_str2ealgo(str);
 			if (md.md_ealgo < CRYPTO_ALGORITHM_MIN ||
 			    md.md_ealgo > CRYPTO_ALGORITHM_MAX) {
 				gctl_error(req,
 				    "Invalid authentication algorithm.");
 				return;
 			} else {
 				fprintf(stderr, "warning: The -e option, not "
 				    "the -a option is now used to specify "
 				    "encryption algorithm to use.\n");
 			}
 		}
 	}
 	if (md.md_ealgo < CRYPTO_ALGORITHM_MIN ||
 	    md.md_ealgo > CRYPTO_ALGORITHM_MAX) {
 		str = gctl_get_ascii(req, "ealgo");
 		if (*str == '\0') {
 			if (version < G_ELI_VERSION_05)
 				str = "aes-cbc";
 			else
 				str = GELI_ENC_ALGO;
 		}
 		md.md_ealgo = g_eli_str2ealgo(str);
 		if (md.md_ealgo < CRYPTO_ALGORITHM_MIN ||
 		    md.md_ealgo > CRYPTO_ALGORITHM_MAX) {
 			gctl_error(req, "Invalid encryption algorithm.");
 			return;
 		}
 		if (md.md_ealgo == CRYPTO_CAMELLIA_CBC &&
 		    version < G_ELI_VERSION_04) {
 			gctl_error(req,
 			    "Camellia-CBC algorithm is supported starting from version %u.",
 			    G_ELI_VERSION_04);
 			return;
 		}
 		if (md.md_ealgo == CRYPTO_AES_XTS &&
 		    version < G_ELI_VERSION_05) {
 			gctl_error(req,
 			    "AES-XTS algorithm is supported starting from version %u.",
 			    G_ELI_VERSION_05);
 			return;
 		}
 	}
 	val = gctl_get_intmax(req, "keylen");
 	md.md_keylen = val;
 	md.md_keylen = g_eli_keylen(md.md_ealgo, md.md_keylen);
 	if (md.md_keylen == 0) {
 		gctl_error(req, "Invalid key length.");
 		return;
 	}
 	md.md_provsize = mediasize;
 
 	val = gctl_get_intmax(req, "iterations");
 	if (val != -1) {
 		int nonewpassphrase;
 
 		/*
 		 * Don't allow to set iterations when there will be no
 		 * passphrase.
 		 */
 		nonewpassphrase = gctl_get_int(req, "nonewpassphrase");
 		if (nonewpassphrase) {
 			gctl_error(req,
 			    "Options -i and -P are mutually exclusive.");
 			return;
 		}
 	}
 	md.md_iterations = val;
 
 	val = gctl_get_intmax(req, "sectorsize");
 	if (val == 0)
 		md.md_sectorsize = secsize;
 	else {
 		if (val < 0 || (val % secsize) != 0 || !powerof2(val)) {
 			gctl_error(req, "Invalid sector size.");
 			return;
 		}
 		if (val > sysconf(_SC_PAGE_SIZE)) {
 			fprintf(stderr,
 			    "warning: Using sectorsize bigger than the page size!\n");
 		}
 		md.md_sectorsize = val;
 	}
 
 	md.md_keys = 0x01;
 	arc4random_buf(md.md_salt, sizeof(md.md_salt));
 	arc4random_buf(md.md_mkeys, sizeof(md.md_mkeys));
 
 	/* Generate user key. */
 	if (eli_genkey(req, &md, key, true) == NULL) {
 		bzero(key, sizeof(key));
 		bzero(&md, sizeof(md));
 		return;
 	}
 
 	/* Encrypt the first and the only Master Key. */
 	error = g_eli_mkey_encrypt(md.md_ealgo, key, md.md_keylen, md.md_mkeys);
 	bzero(key, sizeof(key));
 	if (error != 0) {
 		bzero(&md, sizeof(md));
 		gctl_error(req, "Cannot encrypt Master Key: %s.",
 		    strerror(error));
 		return;
 	}
 
 	eli_metadata_encode(&md, sector);
 	bzero(&md, sizeof(md));
 	error = g_metadata_store(prov, sector, sizeof(sector));
 	bzero(sector, sizeof(sector));
 	if (error != 0) {
 		gctl_error(req, "Cannot store metadata on %s: %s.", prov,
 		    strerror(error));
 		return;
 	}
 	if (verbose)
 		printf("Metadata value stored on %s.\n", prov);
 	/* Backup metadata to a file. */
 	str = gctl_get_ascii(req, "backupfile");
 	if (str[0] != '\0') {
 		/* Backupfile given be the user, just copy it. */
 		strlcpy(backfile, str, sizeof(backfile));
 	} else {
 		/* Generate file name automatically. */
 		const char *p = prov;
 		unsigned int i;
 
 		if (strncmp(p, _PATH_DEV, sizeof(_PATH_DEV) - 1) == 0)
 			p += sizeof(_PATH_DEV) - 1;
 		snprintf(backfile, sizeof(backfile), "%s%s.eli",
 		    GELI_BACKUP_DIR, p);
 		/* Replace all / with _. */
 		for (i = strlen(GELI_BACKUP_DIR); backfile[i] != '\0'; i++) {
 			if (backfile[i] == '/')
 				backfile[i] = '_';
 		}
 	}
 	if (strcmp(backfile, "none") != 0 &&
 	    eli_backup_create(req, prov, backfile) == 0) {
 		printf("\nMetadata backup can be found in %s and\n", backfile);
 		printf("can be restored with the following command:\n");
 		printf("\n\t# geli restore %s %s\n\n", backfile, prov);
 	}
 }
 
 static void
 eli_attach(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	unsigned char key[G_ELI_USERKEYLEN];
 	const char *prov;
 	off_t mediasize;
 	int nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 
 	if (eli_metadata_read(req, prov, &md) == -1)
 		return;
 
 	mediasize = g_get_mediasize(prov);
 	if (md.md_provsize != (uint64_t)mediasize) {
 		gctl_error(req, "Provider size mismatch.");
 		return;
 	}
 
 	if (eli_genkey(req, &md, key, false) == NULL) {
 		bzero(key, sizeof(key));
 		return;
 	}
 
 	gctl_ro_param(req, "key", sizeof(key), key);
 	if (gctl_issue(req) == NULL) {
 		if (verbose)
 			printf("Attached to %s.\n", prov);
 	}
 	bzero(key, sizeof(key));
 }
 
 static void
 eli_configure_detached(struct gctl_req *req, const char *prov, int boot,
     int geliboot, int displaypass, int trim)
 {
 	struct g_eli_metadata md;
 	bool changed = 0;
 
 	if (eli_metadata_read(req, prov, &md) == -1)
 		return;
 
 	if (boot == 1 && (md.md_flags & G_ELI_FLAG_BOOT)) {
 		if (verbose)
 			printf("BOOT flag already configured for %s.\n", prov);
 	} else if (boot == 0 && !(md.md_flags & G_ELI_FLAG_BOOT)) {
 		if (verbose)
 			printf("BOOT flag not configured for %s.\n", prov);
 	} else if (boot >= 0) {
 		if (boot)
 			md.md_flags |= G_ELI_FLAG_BOOT;
 		else
 			md.md_flags &= ~G_ELI_FLAG_BOOT;
 		changed = 1;
 	}
 
 	if (geliboot == 1 && (md.md_flags & G_ELI_FLAG_GELIBOOT)) {
 		if (verbose)
 			printf("GELIBOOT flag already configured for %s.\n", prov);
 	} else if (geliboot == 0 && !(md.md_flags & G_ELI_FLAG_GELIBOOT)) {
 		if (verbose)
 			printf("GELIBOOT flag not configured for %s.\n", prov);
 	} else if (geliboot >= 0) {
 		if (geliboot)
 			md.md_flags |= G_ELI_FLAG_GELIBOOT;
 		else
 			md.md_flags &= ~G_ELI_FLAG_GELIBOOT;
 		changed = 1;
 	}
 
 	if (displaypass == 1 && (md.md_flags & G_ELI_FLAG_GELIDISPLAYPASS)) {
 		if (verbose)
 			printf("GELIDISPLAYPASS flag already configured for %s.\n", prov);
 	} else if (displaypass == 0 &&
 	    !(md.md_flags & G_ELI_FLAG_GELIDISPLAYPASS)) {
 		if (verbose)
 			printf("GELIDISPLAYPASS flag not configured for %s.\n", prov);
 	} else if (displaypass >= 0) {
 		if (displaypass)
 			md.md_flags |= G_ELI_FLAG_GELIDISPLAYPASS;
 		else
 			md.md_flags &= ~G_ELI_FLAG_GELIDISPLAYPASS;
 		changed = 1;
 	}
 
 	if (trim == 0 && (md.md_flags & G_ELI_FLAG_NODELETE)) {
 		if (verbose)
 			printf("TRIM disable flag already configured for %s.\n", prov);
 	} else if (trim == 1 && !(md.md_flags & G_ELI_FLAG_NODELETE)) {
 		if (verbose)
 			printf("TRIM disable flag not configured for %s.\n", prov);
 	} else if (trim >= 0) {
 		if (trim)
 			md.md_flags &= ~G_ELI_FLAG_NODELETE;
 		else
 			md.md_flags |= G_ELI_FLAG_NODELETE;
 		changed = 1;
 	}
 
 	if (changed)
 		eli_metadata_store(req, prov, &md);
 	bzero(&md, sizeof(md));
 }
 
 static void
 eli_configure(struct gctl_req *req)
 {
 	const char *prov;
 	bool boot, noboot, geliboot, nogeliboot, displaypass, nodisplaypass;
 	bool trim, notrim;
 	int doboot, dogeliboot, dodisplaypass, dotrim;
 	int i, nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs == 0) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	boot = gctl_get_int(req, "boot");
 	noboot = gctl_get_int(req, "noboot");
 	geliboot = gctl_get_int(req, "geliboot");
 	nogeliboot = gctl_get_int(req, "nogeliboot");
 	displaypass = gctl_get_int(req, "displaypass");
 	nodisplaypass = gctl_get_int(req, "nodisplaypass");
 	trim = gctl_get_int(req, "trim");
 	notrim = gctl_get_int(req, "notrim");
 
 	doboot = -1;
 	if (boot && noboot) {
 		gctl_error(req, "Options -b and -B are mutually exclusive.");
 		return;
 	}
 	if (boot)
 		doboot = 1;
 	else if (noboot)
 		doboot = 0;
 
 	dogeliboot = -1;
 	if (geliboot && nogeliboot) {
 		gctl_error(req, "Options -g and -G are mutually exclusive.");
 		return;
 	}
 	if (geliboot)
 		dogeliboot = 1;
 	else if (nogeliboot)
 		dogeliboot = 0;
 
 	dodisplaypass = -1;
 	if (displaypass && nodisplaypass) {
 		gctl_error(req, "Options -d and -D are mutually exclusive.");
 		return;
 	}
 	if (displaypass)
 		dodisplaypass = 1;
 	else if (nodisplaypass)
 		dodisplaypass = 0;
 
 	dotrim = -1;
 	if (trim && notrim) {
 		gctl_error(req, "Options -t and -T are mutually exclusive.");
 		return;
 	}
 	if (trim)
 		dotrim = 1;
 	else if (notrim)
 		dotrim = 0;
 
 	if (doboot == -1 && dogeliboot == -1 && dodisplaypass == -1 &&
 	    dotrim == -1) {
 		gctl_error(req, "No option given.");
 		return;
 	}
 
 	/* First attached providers. */
 	gctl_issue(req);
 	/* Now the rest. */
 	for (i = 0; i < nargs; i++) {
 		prov = gctl_get_ascii(req, "arg%d", i);
 		if (!eli_is_attached(prov)) {
 			eli_configure_detached(req, prov, doboot, dogeliboot,
 			    dodisplaypass, dotrim);
 		}
 	}
 }
 
 static void
 eli_setkey_attached(struct gctl_req *req, struct g_eli_metadata *md)
 {
 	unsigned char key[G_ELI_USERKEYLEN];
 	intmax_t val, old = 0;
 	int error;
 
 	val = gctl_get_intmax(req, "iterations");
 	/* Check if iterations number should be changed. */
 	if (val != -1)
 		md->md_iterations = val;
 	else
 		old = md->md_iterations;
 
 	/* Generate key for Master Key encryption. */
 	if (eli_genkey(req, md, key, true) == NULL) {
 		bzero(key, sizeof(key));
 		return;
 	}
 	/*
 	 * If number of iterations has changed, but wasn't given as a
 	 * command-line argument, update the request.
 	 */
 	if (val == -1 && md->md_iterations != old) {
 		error = gctl_change_param(req, "iterations", sizeof(intmax_t),
 		    &md->md_iterations);
 		assert(error == 0);
 	}
 
 	gctl_ro_param(req, "key", sizeof(key), key);
 	gctl_issue(req);
 	bzero(key, sizeof(key));
 }
 
 static void
 eli_setkey_detached(struct gctl_req *req, const char *prov,
  struct g_eli_metadata *md)
 {
 	unsigned char key[G_ELI_USERKEYLEN], mkey[G_ELI_DATAIVKEYLEN];
 	unsigned char *mkeydst;
 	unsigned int nkey;
 	intmax_t val;
 	int error;
 
 	if (md->md_keys == 0) {
 		gctl_error(req, "No valid keys on %s.", prov);
 		return;
 	}
 
 	/* Generate key for Master Key decryption. */
 	if (eli_genkey(req, md, key, false) == NULL) {
 		bzero(key, sizeof(key));
 		return;
 	}
 
 	/* Decrypt Master Key. */
 	error = g_eli_mkey_decrypt(md, key, mkey, &nkey);
 	bzero(key, sizeof(key));
 	if (error != 0) {
 		bzero(md, sizeof(*md));
 		if (error == -1)
 			gctl_error(req, "Wrong key for %s.", prov);
 		else /* if (error > 0) */ {
 			gctl_error(req, "Cannot decrypt Master Key: %s.",
 			    strerror(error));
 		}
 		return;
 	}
 	if (verbose)
 		printf("Decrypted Master Key %u.\n", nkey);
 
 	val = gctl_get_intmax(req, "keyno");
 	if (val != -1)
 		nkey = val;
 #if 0
 	else
 		; /* Use the key number which was found during decryption. */
 #endif
 	if (nkey >= G_ELI_MAXMKEYS) {
 		gctl_error(req, "Invalid '%s' argument.", "keyno");
 		return;
 	}
 
 	val = gctl_get_intmax(req, "iterations");
 	/* Check if iterations number should and can be changed. */
 	if (val != -1 && md->md_iterations == -1) {
 		md->md_iterations = val;
 	} else if (val != -1 && val != md->md_iterations) {
 		if (bitcount32(md->md_keys) != 1) {
 			gctl_error(req, "To be able to use '-i' option, only "
 			    "one key can be defined.");
 			return;
 		}
 		if (md->md_keys != (1 << nkey)) {
 			gctl_error(req, "Only already defined key can be "
 			    "changed when '-i' option is used.");
 			return;
 		}
 		md->md_iterations = val;
 	}
 
 	mkeydst = md->md_mkeys + nkey * G_ELI_MKEYLEN;
 	md->md_keys |= (1 << nkey);
 
 	bcopy(mkey, mkeydst, sizeof(mkey));
 	bzero(mkey, sizeof(mkey));
 
 	/* Generate key for Master Key encryption. */
 	if (eli_genkey(req, md, key, true) == NULL) {
 		bzero(key, sizeof(key));
 		bzero(md, sizeof(*md));
 		return;
 	}
 
 	/* Encrypt the Master-Key with the new key. */
 	error = g_eli_mkey_encrypt(md->md_ealgo, key, md->md_keylen, mkeydst);
 	bzero(key, sizeof(key));
 	if (error != 0) {
 		bzero(md, sizeof(*md));
 		gctl_error(req, "Cannot encrypt Master Key: %s.",
 		    strerror(error));
 		return;
 	}
 
 	/* Store metadata with fresh key. */
 	eli_metadata_store(req, prov, md);
 	bzero(md, sizeof(*md));
 }
 
 static void
 eli_setkey(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	const char *prov;
 	int nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 
 	if (eli_metadata_read(req, prov, &md) == -1)
 		return;
 
 	if (eli_is_attached(prov))
 		eli_setkey_attached(req, &md);
 	else
 		eli_setkey_detached(req, prov, &md);
 
 	if (req->error == NULL || req->error[0] == '\0') {
 		printf("Note, that the master key encrypted with old keys "
 		    "and/or passphrase may still exists in a metadata backup "
 		    "file.\n");
 	}
 }
 
 static void
 eli_delkey_attached(struct gctl_req *req, const char *prov __unused)
 {
 
 	gctl_issue(req);
 }
 
 static void
 eli_delkey_detached(struct gctl_req *req, const char *prov)
 {
 	struct g_eli_metadata md;
 	unsigned char *mkeydst;
 	unsigned int nkey;
 	intmax_t val;
 	bool all, force;
 
 	if (eli_metadata_read(req, prov, &md) == -1)
 		return;
 
 	all = gctl_get_int(req, "all");
 	if (all)
 		arc4random_buf(md.md_mkeys, sizeof(md.md_mkeys));
 	else {
 		force = gctl_get_int(req, "force");
 		val = gctl_get_intmax(req, "keyno");
 		if (val == -1) {
 			gctl_error(req, "Key number has to be specified.");
 			return;
 		}
 		nkey = val;
 		if (nkey >= G_ELI_MAXMKEYS) {
 			gctl_error(req, "Invalid '%s' argument.", "keyno");
 			return;
 		}
 		if (!(md.md_keys & (1 << nkey)) && !force) {
 			gctl_error(req, "Master Key %u is not set.", nkey);
 			return;
 		}
 		md.md_keys &= ~(1 << nkey);
 		if (md.md_keys == 0 && !force) {
 			gctl_error(req, "This is the last Master Key. Use '-f' "
 			    "option if you really want to remove it.");
 			return;
 		}
 		mkeydst = md.md_mkeys + nkey * G_ELI_MKEYLEN;
 		arc4random_buf(mkeydst, G_ELI_MKEYLEN);
 	}
 
 	eli_metadata_store(req, prov, &md);
 	bzero(&md, sizeof(md));
 }
 
 static void
 eli_delkey(struct gctl_req *req)
 {
 	const char *prov;
 	int nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 
 	if (eli_is_attached(prov))
 		eli_delkey_attached(req, prov);
 	else
 		eli_delkey_detached(req, prov);
 }
 
 static void
 eli_resume(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	unsigned char key[G_ELI_USERKEYLEN];
 	const char *prov;
 	off_t mediasize;
 	int nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 
 	if (eli_metadata_read(req, prov, &md) == -1)
 		return;
 
 	mediasize = g_get_mediasize(prov);
 	if (md.md_provsize != (uint64_t)mediasize) {
 		gctl_error(req, "Provider size mismatch.");
 		return;
 	}
 
 	if (eli_genkey(req, &md, key, false) == NULL) {
 		bzero(key, sizeof(key));
 		return;
 	}
 
 	gctl_ro_param(req, "key", sizeof(key), key);
 	if (gctl_issue(req) == NULL) {
 		if (verbose)
 			printf("Resumed %s.\n", prov);
 	}
 	bzero(key, sizeof(key));
 }
 
 static int
 eli_trash_metadata(struct gctl_req *req, const char *prov, int fd, off_t offset)
 {
 	unsigned int overwrites;
 	unsigned char *sector;
 	ssize_t size;
 	int error;
 
 	size = sizeof(overwrites);
 	if (sysctlbyname("kern.geom.eli.overwrites", &overwrites, &size,
 	    NULL, 0) == -1 || overwrites == 0) {
 		overwrites = G_ELI_OVERWRITES;
 	}
 
 	size = g_sectorsize(fd);
 	if (size <= 0) {
 		gctl_error(req, "Cannot obtain provider sector size %s: %s.",
 		    prov, strerror(errno));
 		return (-1);
 	}
 	sector = malloc(size);
 	if (sector == NULL) {
 		gctl_error(req, "Cannot allocate %zd bytes of memory.", size);
 		return (-1);
 	}
 
 	error = 0;
 	do {
 		arc4random_buf(sector, size);
 		if (pwrite(fd, sector, size, offset) != size) {
 			if (error == 0)
 				error = errno;
 		}
 		(void)g_flush(fd);
 	} while (--overwrites > 0);
 	free(sector);
 	if (error != 0) {
 		gctl_error(req, "Cannot trash metadata on provider %s: %s.",
 		    prov, strerror(error));
 		return (-1);
 	}
 	return (0);
 }
 
 static void
 eli_kill_detached(struct gctl_req *req, const char *prov)
 {
 	off_t offset;
 	int fd;
 
 	/*
 	 * NOTE: Maybe we should verify if this is geli provider first,
 	 *       but 'kill' command is quite critical so better don't waste
 	 *       the time.
 	 */
 #if 0
 	error = g_metadata_read(prov, (unsigned char *)&md, sizeof(md),
 	    G_ELI_MAGIC);
 	if (error != 0) {
 		gctl_error(req, "Cannot read metadata from %s: %s.", prov,
 		    strerror(error));
 		return;
 	}
 #endif
 
 	fd = g_open(prov, 1);
 	if (fd == -1) {
 		gctl_error(req, "Cannot open provider %s: %s.", prov,
 		    strerror(errno));
 		return;
 	}
 	offset = g_mediasize(fd) - g_sectorsize(fd);
 	if (offset <= 0) {
 		gctl_error(req,
 		    "Cannot obtain media size or sector size for provider %s: %s.",
 		    prov, strerror(errno));
 		(void)g_close(fd);
 		return;
 	}
 	(void)eli_trash_metadata(req, prov, fd, offset);
 	(void)g_close(fd);
 }
 
 static void
 eli_kill(struct gctl_req *req)
 {
 	const char *prov;
 	int i, nargs, all;
 
 	nargs = gctl_get_int(req, "nargs");
 	all = gctl_get_int(req, "all");
 	if (!all && nargs == 0) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 	/*
 	 * How '-a' option combine with a list of providers:
 	 * Delete Master Keys from all attached providers:
 	 * geli kill -a
 	 * Delete Master Keys from all attached providers and from
 	 * detached da0 and da1:
 	 * geli kill -a da0 da1
 	 * Delete Master Keys from (attached or detached) da0 and da1:
 	 * geli kill da0 da1
 	 */
 
 	/* First detached providers. */
 	for (i = 0; i < nargs; i++) {
 		prov = gctl_get_ascii(req, "arg%d", i);
 		if (!eli_is_attached(prov))
 			eli_kill_detached(req, prov);
 	}
 	/* Now attached providers. */
 	gctl_issue(req);
 }
 
 static int
 eli_backup_create(struct gctl_req *req, const char *prov, const char *file)
 {
 	unsigned char *sector;
 	ssize_t secsize;
 	int error, filefd, ret;
 
 	ret = -1;
 	filefd = -1;
 	sector = NULL;
 	secsize = 0;
 
 	secsize = g_get_sectorsize(prov);
 	if (secsize == 0) {
 		gctl_error(req, "Cannot get informations about %s: %s.", prov,
 		    strerror(errno));
 		goto out;
 	}
 	sector = malloc(secsize);
 	if (sector == NULL) {
 		gctl_error(req, "Cannot allocate memory.");
 		goto out;
 	}
 	/* Read metadata from the provider. */
 	error = g_metadata_read(prov, sector, secsize, G_ELI_MAGIC);
 	if (error != 0) {
 		gctl_error(req, "Unable to read metadata from %s: %s.", prov,
 		    strerror(error));
 		goto out;
 	}
 
 	filefd = open(file, O_WRONLY | O_TRUNC | O_CREAT, 0600);
 	if (filefd == -1) {
 		gctl_error(req, "Unable to open %s: %s.", file,
 		    strerror(errno));
 		goto out;
 	}
 	/* Write metadata to the destination file. */
 	if (write(filefd, sector, secsize) != secsize) {
 		gctl_error(req, "Unable to write to %s: %s.", file,
 		    strerror(errno));
 		(void)close(filefd);
 		(void)unlink(file);
 		goto out;
 	}
 	(void)fsync(filefd);
 	(void)close(filefd);
 	/* Success. */
 	ret = 0;
 out:
 	if (sector != NULL) {
 		bzero(sector, secsize);
 		free(sector);
 	}
 	return (ret);
 }
 
 static void
 eli_backup(struct gctl_req *req)
 {
 	const char *file, *prov;
 	int nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 2) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 	file = gctl_get_ascii(req, "arg1");
 
 	eli_backup_create(req, prov, file);
 }
 
 static void
 eli_restore(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	const char *file, *prov;
 	off_t mediasize;
 	int nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 2) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	file = gctl_get_ascii(req, "arg0");
 	prov = gctl_get_ascii(req, "arg1");
 
 	/* Read metadata from the backup file. */
 	if (eli_metadata_read(req, file, &md) == -1)
 		return;
 	/* Obtain provider's mediasize. */
 	mediasize = g_get_mediasize(prov);
 	if (mediasize == 0) {
 		gctl_error(req, "Cannot get informations about %s: %s.", prov,
 		    strerror(errno));
 		return;
 	}
 	/* Check if the provider size has changed since we did the backup. */
 	if (md.md_provsize != (uint64_t)mediasize) {
 		if (gctl_get_int(req, "force")) {
 			md.md_provsize = mediasize;
 		} else {
 			gctl_error(req, "Provider size mismatch: "
 			    "wrong backup file?");
 			return;
 		}
 	}
 	/* Write metadata to the provider. */
 	(void)eli_metadata_store(req, prov, &md);
 }
 
 static void
 eli_resize(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	const char *prov;
 	unsigned char *sector;
 	ssize_t secsize;
 	off_t mediasize, oldsize;
 	int error, nargs, provfd;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	prov = gctl_get_ascii(req, "arg0");
 
 	provfd = -1;
 	sector = NULL;
 	secsize = 0;
 
 	provfd = g_open(prov, 1);
 	if (provfd == -1) {
 		gctl_error(req, "Cannot open %s: %s.", prov, strerror(errno));
 		goto out;
 	}
 
 	mediasize = g_mediasize(provfd);
 	secsize = g_sectorsize(provfd);
 	if (mediasize == -1 || secsize == -1) {
 		gctl_error(req, "Cannot get information about %s: %s.", prov,
 		    strerror(errno));
 		goto out;
 	}
 
 	sector = malloc(secsize);
 	if (sector == NULL) {
 		gctl_error(req, "Cannot allocate memory.");
 		goto out;
 	}
 
 	oldsize = gctl_get_intmax(req, "oldsize");
 	if (oldsize < 0 || oldsize > mediasize) {
 		gctl_error(req, "Invalid oldsize: Out of range.");
 		goto out;
 	}
 	if (oldsize == mediasize) {
 		gctl_error(req, "Size hasn't changed.");
 		goto out;
 	}
 
 	/* Read metadata from the 'oldsize' offset. */
 	if (pread(provfd, sector, secsize, oldsize - secsize) != secsize) {
 		gctl_error(req, "Cannot read old metadata: %s.",
 		    strerror(errno));
 		goto out;
 	}
 
 	/* Check if this sector contains geli metadata. */
 	error = eli_metadata_decode(sector, &md);
 	switch (error) {
 	case 0:
 		break;
 	case EOPNOTSUPP:
 		gctl_error(req,
 		    "Provider's %s metadata version %u is too new.\n"
 		    "geli: The highest supported version is %u.",
 		    prov, (unsigned int)md.md_version, G_ELI_VERSION);
 		goto out;
 	case EINVAL:
 		gctl_error(req, "Inconsistent provider's %s metadata.", prov);
 		goto out;
 	default:
 		gctl_error(req,
 		    "Unexpected error while decoding provider's %s metadata: %s.",
 		    prov, strerror(error));
 		goto out;
 	}
 
 	/*
 	 * If the old metadata doesn't have a correct provider size, refuse
 	 * to resize.
 	 */
 	if (md.md_provsize != (uint64_t)oldsize) {
 		gctl_error(req, "Provider size mismatch at oldsize.");
 		goto out;
 	}
 
 	/*
 	 * Update the old metadata with the current provider size and write
 	 * it back to the correct place on the provider.
 	 */
 	md.md_provsize = mediasize;
 	/* Write metadata to the provider. */
 	(void)eli_metadata_store(req, prov, &md);
 	/* Now trash the old metadata. */
 	(void)eli_trash_metadata(req, prov, provfd, oldsize - secsize);
 out:
 	if (provfd != -1)
 		(void)g_close(provfd);
 	if (sector != NULL) {
 		bzero(sector, secsize);
 		free(sector);
 	}
 }
 
 static void
 eli_version(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	const char *name;
 	unsigned int version;
 	int error, i, nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 
 	if (nargs == 0) {
 		unsigned int kernver;
 		ssize_t size;
 
 		size = sizeof(kernver);
 		if (sysctlbyname("kern.geom.eli.version", &kernver, &size,
 		    NULL, 0) == -1) {
 			warn("Unable to obtain GELI kernel version");
 		} else {
 			printf("kernel: %u\n", kernver);
 		}
 		printf("userland: %u\n", G_ELI_VERSION);
 		return;
 	}
 
 	for (i = 0; i < nargs; i++) {
 		name = gctl_get_ascii(req, "arg%d", i);
 		error = g_metadata_read(name, (unsigned char *)&md,
 		    sizeof(md), G_ELI_MAGIC);
 		if (error != 0) {
 			warn("%s: Unable to read metadata: %s.", name,
 			    strerror(error));
 			gctl_error(req, "Not fully done.");
 			continue;
 		}
 		version = le32dec(&md.md_version);
 		printf("%s: %u\n", name, version);
 	}
 }
 
 static void
 eli_clear(struct gctl_req *req)
 {
 	const char *name;
 	int error, i, nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs < 1) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	for (i = 0; i < nargs; i++) {
 		name = gctl_get_ascii(req, "arg%d", i);
 		error = g_metadata_clear(name, G_ELI_MAGIC);
 		if (error != 0) {
 			fprintf(stderr, "Cannot clear metadata on %s: %s.\n",
 			    name, strerror(error));
 			gctl_error(req, "Not fully done.");
 			continue;
 		}
 		if (verbose)
 			printf("Metadata cleared on %s.\n", name);
 	}
 }
 
 static void
 eli_dump(struct gctl_req *req)
 {
 	struct g_eli_metadata md;
 	const char *name;
 	int i, nargs;
 
 	nargs = gctl_get_int(req, "nargs");
 	if (nargs < 1) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	for (i = 0; i < nargs; i++) {
 		name = gctl_get_ascii(req, "arg%d", i);
 		if (eli_metadata_read(NULL, name, &md) == -1) {
 			gctl_error(req, "Not fully done.");
 			continue;
 		}
 		printf("Metadata on %s:\n", name);
 		eli_metadata_dump(&md);
 		printf("\n");
 	}
 }
Index: user/markj/netdump/sbin/ipfw/ipfw.8
===================================================================
--- user/markj/netdump/sbin/ipfw/ipfw.8	(revision 332407)
+++ user/markj/netdump/sbin/ipfw/ipfw.8	(revision 332408)
@@ -1,4075 +1,4075 @@
 .\"
 .\" $FreeBSD$
 .\"
 .Dd March 19, 2018
 .Dt IPFW 8
 .Os
 .Sh NAME
 .Nm ipfw
 .Nd User interface for firewall, traffic shaper, packet scheduler,
 in-kernel NAT.
 .Sh SYNOPSIS
 .Ss FIREWALL CONFIGURATION
 .Nm
 .Op Fl cq
 .Cm add
 .Ar rule
 .Nm
 .Op Fl acdefnNStT
 .Op Cm set Ar N
 .Brq Cm list | show
 .Op Ar rule | first-last ...
 .Nm
 .Op Fl f | q
 .Op Cm set Ar N
 .Cm flush
 .Nm
 .Op Fl q
 .Op Cm set Ar N
 .Brq Cm delete | zero | resetlog
 .Op Ar number ...
 .Pp
 .Nm
 .Cm set Oo Cm disable Ar number ... Oc Op Cm enable Ar number ...
 .Nm
 .Cm set move
 .Op Cm rule
 .Ar number Cm to Ar number
 .Nm
 .Cm set swap Ar number number
 .Nm
 .Cm set show
 .Ss SYSCTL SHORTCUTS
 .Nm
 .Cm enable
 .Brq Cm firewall | altq | one_pass | debug | verbose | dyn_keepalive
 .Nm
 .Cm disable
 .Brq Cm firewall | altq | one_pass | debug | verbose | dyn_keepalive
 .Ss LOOKUP TABLES
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm create Ar create-options
 .Nm
 .Oo Cm set Ar N Oc Cm table
 .Brq Ar name | all
 .Cm destroy
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm modify Ar modify-options
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm swap Ar name
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm add Ar table-key Op Ar value
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm add Op Ar table-key Ar value ...
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm atomic add Op Ar table-key Ar value ...
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm delete Op Ar table-key ...
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm lookup Ar addr
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm lock
 .Nm
 .Oo Cm set Ar N Oc Cm table Ar name Cm unlock
 .Nm
 .Oo Cm set Ar N Oc Cm table
 .Brq Ar name | all
 .Cm list
 .Nm
 .Oo Cm set Ar N Oc Cm table
 .Brq Ar name | all
 .Cm info
 .Nm
 .Oo Cm set Ar N Oc Cm table
 .Brq Ar name | all
 .Cm detail
 .Nm
 .Oo Cm set Ar N Oc Cm table
 .Brq Ar name | all
 .Cm flush
 .Ss DUMMYNET CONFIGURATION (TRAFFIC SHAPER AND PACKET SCHEDULER)
 .Nm
 .Brq Cm pipe | queue | sched
 .Ar number
 .Cm config
 .Ar config-options
 .Nm
 .Op Fl s Op Ar field
 .Brq Cm pipe | queue | sched
 .Brq Cm delete | list | show
 .Op Ar number ...
 .Ss IN-KERNEL NAT
 .Nm
 .Op Fl q
 .Cm nat
 .Ar number
 .Cm config
 .Ar config-options
 .Pp
 .Nm
 .Op Fl cfnNqS
 .Oo
 .Fl p Ar preproc
 .Oo
 .Ar preproc-flags
 .Oc
 .Oc
 .Ar pathname
 .Ss STATEFUL IPv6/IPv4 NETWORK ADDRESS AND PROTOCOL TRANSLATION
 .Nm
 .Oo Cm set Ar N Oc Cm nat64lsn Ar name Cm create Ar create-options
 .Nm
 .Oo Cm set Ar N Oc Cm nat64lsn Ar name Cm config Ar config-options
 .Nm
 .Oo Cm set Ar N Oc Cm nat64lsn
 .Brq Ar name | all
 .Brq Cm list | show
 .Op Cm states
 .Nm
 .Oo Cm set Ar N Oc Cm nat64lsn
 .Brq Ar name | all
 .Cm destroy
 .Nm
 .Oo Cm set Ar N Oc Cm nat64lsn Ar name Cm stats Op Cm reset
 .Ss STATELESS IPv6/IPv4 NETWORK ADDRESS AND PROTOCOL TRANSLATION
 .Nm
 .Oo Cm set Ar N Oc Cm nat64stl Ar name Cm create Ar create-options
 .Nm
 .Oo Cm set Ar N Oc Cm nat64stl Ar name Cm config Ar config-options
 .Nm
 .Oo Cm set Ar N Oc Cm nat64stl
 .Brq Ar name | all
 .Brq Cm list | show
 .Nm
 .Oo Cm set Ar N Oc Cm nat64stl
 .Brq Ar name | all
 .Cm destroy
 .Nm
 .Oo Cm set Ar N Oc Cm nat64stl Ar name Cm stats Op Cm reset
 .Ss IPv6-to-IPv6 NETWORK PREFIX TRANSLATION
 .Nm
 .Oo Cm set Ar N Oc Cm nptv6 Ar name Cm create Ar create-options
 .Nm
 .Oo Cm set Ar N Oc Cm nptv6
 .Brq Ar name | all
 .Brq Cm list | show
 .Nm
 .Oo Cm set Ar N Oc Cm nptv6
 .Brq Ar name | all
 .Cm destroy
 .Nm
 .Oo Cm set Ar N Oc Cm nptv6 Ar name Cm stats Op Cm reset
 .Ss INTERNAL DIAGNOSTICS
 .Nm
 .Cm internal iflist
 .Nm
 .Cm internal talist
 .Nm
 .Cm internal vlist
 .Sh DESCRIPTION
 The
 .Nm
 utility is the user interface for controlling the
 .Xr ipfw 4
 firewall, the
 .Xr dummynet 4
 traffic shaper/packet scheduler, and the
 in-kernel NAT services.
 .Pp
 A firewall configuration, or
 .Em ruleset ,
 is made of a list of
 .Em rules
 numbered from 1 to 65535.
 Packets are passed to the firewall
 from a number of different places in the protocol stack
 (depending on the source and destination of the packet,
 it is possible for the firewall to be
 invoked multiple times on the same packet).
 The packet passed to the firewall is compared
 against each of the rules in the
 .Em ruleset ,
 in rule-number order
 (multiple rules with the same number are permitted, in which case
 they are processed in order of insertion).
 When a match is found, the action corresponding to the
 matching rule is performed.
 .Pp
 Depending on the action and certain system settings, packets
 can be reinjected into the firewall at some rule after the
 matching one for further processing.
 .Pp
 A ruleset always includes a
 .Em default
 rule (numbered 65535) which cannot be modified or deleted,
 and matches all packets.
 The action associated with the
 .Em default
 rule can be either
 .Cm deny
 or
 .Cm allow
 depending on how the kernel is configured.
 .Pp
 If the ruleset includes one or more rules with the
 .Cm keep-state
 or
 .Cm limit
 option,
 the firewall will have a
 .Em stateful
 behaviour, i.e., upon a match it will create
 .Em dynamic rules ,
 i.e., rules that match packets with the same 5-tuple
 (protocol, source and destination addresses and ports)
 as the packet which caused their creation.
 Dynamic rules, which have a limited lifetime, are checked
 at the first occurrence of a
 .Cm check-state ,
 .Cm keep-state
 or
 .Cm limit
 rule, and are typically used to open the firewall on-demand to
 legitimate traffic only.
 See the
 .Sx STATEFUL FIREWALL
 and
 .Sx EXAMPLES
 Sections below for more information on the stateful behaviour of
 .Nm .
 .Pp
 All rules (including dynamic ones) have a few associated counters:
 a packet count, a byte count, a log count and a timestamp
 indicating the time of the last match.
 Counters can be displayed or reset with
 .Nm
 commands.
 .Pp
 Each rule belongs to one of 32 different
 .Em sets
 , and there are
 .Nm
 commands to atomically manipulate sets, such as enable,
 disable, swap sets, move all rules in a set to another
 one, delete all rules in a set.
 These can be useful to
 install temporary configurations, or to test them.
 See Section
 .Sx SETS OF RULES
 for more information on
 .Em sets .
 .Pp
 Rules can be added with the
 .Cm add
 command; deleted individually or in groups with the
 .Cm delete
 command, and globally (except those in set 31) with the
 .Cm flush
 command; displayed, optionally with the content of the
 counters, using the
 .Cm show
 and
 .Cm list
 commands.
 Finally, counters can be reset with the
 .Cm zero
 and
 .Cm resetlog
 commands.
 .Pp
 .Ss COMMAND OPTIONS
 The following general options are available when invoking
 .Nm :
 .Bl -tag -width indent
 .It Fl a
 Show counter values when listing rules.
 The
 .Cm show
 command implies this option.
 .It Fl b
 Only show the action and the comment, not the body of a rule.
 Implies
 .Fl c .
 .It Fl c
 When entering or showing rules, print them in compact form,
 i.e., omitting the "ip from any to any" string
 when this does not carry any additional information.
 .It Fl d
 When listing, show dynamic rules in addition to static ones.
 .It Fl e
 When listing and
 .Fl d
 is specified, also show expired dynamic rules.
 .It Fl f
 Do not ask for confirmation for commands that can cause problems
 if misused, i.e.,
 .Cm flush .
 If there is no tty associated with the process, this is implied.
 .It Fl i
 When listing a table (see the
 .Sx LOOKUP TABLES
 section below for more information on lookup tables), format values
 as IP addresses.
 By default, values are shown as integers.
 .It Fl n
 Only check syntax of the command strings, without actually passing
 them to the kernel.
 .It Fl N
 Try to resolve addresses and service names in output.
 .It Fl q
 Be quiet when executing the
 .Cm add ,
 .Cm nat ,
 .Cm zero ,
 .Cm resetlog
 or
 .Cm flush
 commands;
 (implies
 .Fl f ) .
 This is useful when updating rulesets by executing multiple
 .Nm
 commands in a script
 (e.g.,
 .Ql sh\ /etc/rc.firewall ) ,
 or by processing a file with many
 .Nm
 rules across a remote login session.
 It also stops a table add or delete
 from failing if the entry already exists or is not present.
 .Pp
 The reason why this option may be important is that
 for some of these actions,
 .Nm
 may print a message; if the action results in blocking the
 traffic to the remote client,
 the remote login session will be closed
 and the rest of the ruleset will not be processed.
 Access to the console would then be required to recover.
 .It Fl S
 When listing rules, show the
 .Em set
 each rule belongs to.
 If this flag is not specified, disabled rules will not be
 listed.
 .It Fl s Op Ar field
 When listing pipes, sort according to one of the four
 counters (total or current packets or bytes).
 .It Fl t
 When listing, show last match timestamp converted with ctime().
 .It Fl T
 When listing, show last match timestamp as seconds from the epoch.
 This form can be more convenient for postprocessing by scripts.
 .El
 .Ss LIST OF RULES AND PREPROCESSING
 To ease configuration, rules can be put into a file which is
 processed using
 .Nm
 as shown in the last synopsis line.
 An absolute
 .Ar pathname
 must be used.
 The file will be read line by line and applied as arguments to the
 .Nm
 utility.
 .Pp
 Optionally, a preprocessor can be specified using
 .Fl p Ar preproc
 where
 .Ar pathname
 is to be piped through.
 Useful preprocessors include
 .Xr cpp 1
 and
 .Xr m4 1 .
 If
 .Ar preproc
 does not start with a slash
 .Pq Ql /
 as its first character, the usual
 .Ev PATH
 name search is performed.
 Care should be taken with this in environments where not all
 file systems are mounted (yet) by the time
 .Nm
 is being run (e.g.\& when they are mounted over NFS).
 Once
 .Fl p
 has been specified, any additional arguments are passed on to the preprocessor
 for interpretation.
 This allows for flexible configuration files (like conditionalizing
 them on the local hostname) and the use of macros to centralize
 frequently required arguments like IP addresses.
 .Ss TRAFFIC SHAPER CONFIGURATION
 The
 .Nm
 .Cm pipe , queue
 and
 .Cm sched
 commands are used to configure the traffic shaper and packet scheduler.
 See the
 .Sx TRAFFIC SHAPER (DUMMYNET) CONFIGURATION
 Section below for details.
 .Pp
 If the world and the kernel get out of sync the
 .Nm
 ABI may break, preventing you from being able to add any rules.
 This can adversely affect the booting process.
 You can use
 .Nm
 .Cm disable
 .Cm firewall
 to temporarily disable the firewall to regain access to the network,
 allowing you to fix the problem.
 .Sh PACKET FLOW
 A packet is checked against the active ruleset in multiple places
 in the protocol stack, under control of several sysctl variables.
 These places and variables are shown below, and it is important to
 have this picture in mind in order to design a correct ruleset.
 .Bd -literal -offset indent
        ^    to upper layers    V
        |                       |
        +----------->-----------+
        ^                       V
  [ip(6)_input]           [ip(6)_output]     net.inet(6).ip(6).fw.enable=1
        |                       |
        ^                       V
  [ether_demux]        [ether_output_frame]  net.link.ether.ipfw=1
        |                       |
        +-->--[bdg_forward]-->--+            net.link.bridge.ipfw=1
        ^                       V
        |      to devices       |
 .Ed
 .Pp
 The number of
 times the same packet goes through the firewall can
 vary between 0 and 4 depending on packet source and
 destination, and system configuration.
 .Pp
 Note that as packets flow through the stack, headers can be
 stripped or added to it, and so they may or may not be available
 for inspection.
 E.g., incoming packets will include the MAC header when
 .Nm
 is invoked from
 .Cm ether_demux() ,
 but the same packets will have the MAC header stripped off when
 .Nm
 is invoked from
 .Cm ip_input()
 or
 .Cm ip6_input() .
 .Pp
 Also note that each packet is always checked against the complete ruleset,
 irrespective of the place where the check occurs, or the source of the packet.
 If a rule contains some match patterns or actions which are not valid
 for the place of invocation (e.g.\& trying to match a MAC header within
 .Cm ip_input
 or
 .Cm ip6_input ),
 the match pattern will not match, but a
 .Cm not
 operator in front of such patterns
 .Em will
 cause the pattern to
 .Em always
 match on those packets.
 It is thus the responsibility of
 the programmer, if necessary, to write a suitable ruleset to
 differentiate among the possible places.
 .Cm skipto
 rules can be useful here, as an example:
 .Bd -literal -offset indent
 # packets from ether_demux or bdg_forward
 ipfw add 10 skipto 1000 all from any to any layer2 in
 # packets from ip_input
 ipfw add 10 skipto 2000 all from any to any not layer2 in
 # packets from ip_output
 ipfw add 10 skipto 3000 all from any to any not layer2 out
 # packets from ether_output_frame
 ipfw add 10 skipto 4000 all from any to any layer2 out
 .Ed
 .Pp
 (yes, at the moment there is no way to differentiate between
 ether_demux and bdg_forward).
 .Sh SYNTAX
 In general, each keyword or argument must be provided as
 a separate command line argument, with no leading or trailing
 spaces.
 Keywords are case-sensitive, whereas arguments may
 or may not be case-sensitive depending on their nature
 (e.g.\& uid's are, hostnames are not).
 .Pp
 Some arguments (e.g., port or address lists) are comma-separated
 lists of values.
 In this case, spaces after commas ',' are allowed to make
 the line more readable.
 You can also put the entire
 command (including flags) into a single argument.
 E.g., the following forms are equivalent:
 .Bd -literal -offset indent
 ipfw -q add deny src-ip 10.0.0.0/24,127.0.0.1/8
 ipfw -q add deny src-ip 10.0.0.0/24, 127.0.0.1/8
 ipfw "-q add deny src-ip 10.0.0.0/24, 127.0.0.1/8"
 .Ed
 .Sh RULE FORMAT
 The format of firewall rules is the following:
 .Bd -ragged -offset indent
 .Bk -words
 .Op Ar rule_number
 .Op Cm set Ar set_number
 .Op Cm prob Ar match_probability
 .Ar action
 .Op Cm log Op Cm logamount Ar number
 .Op Cm altq Ar queue
 .Oo
 .Bro Cm tag | untag
 .Brc Ar number
 .Oc
 .Ar body
 .Ek
 .Ed
 .Pp
 where the body of the rule specifies which information is used
 for filtering packets, among the following:
 .Pp
 .Bl -tag -width "Source and dest. addresses and ports" -offset XXX -compact
 .It Layer-2 header fields
 When available
 .It IPv4 and IPv6 Protocol
 SCTP, TCP, UDP, ICMP, etc.
 .It Source and dest. addresses and ports
 .It Direction
 See Section
 .Sx PACKET FLOW
 .It Transmit and receive interface
 By name or address
 .It Misc. IP header fields
 Version, type of service, datagram length, identification,
 fragment flag (non-zero IP offset),
 Time To Live
 .It IP options
 .It IPv6 Extension headers
 Fragmentation, Hop-by-Hop options,
 Routing Headers, Source routing rthdr0, Mobile IPv6 rthdr2, IPSec options.
 .It IPv6 Flow-ID
 .It Misc. TCP header fields
 TCP flags (SYN, FIN, ACK, RST, etc.),
 sequence number, acknowledgment number,
 window
 .It TCP options
 .It ICMP types
 for ICMP packets
 .It ICMP6 types
 for ICMP6 packets
 .It User/group ID
 When the packet can be associated with a local socket.
 .It Divert status
 Whether a packet came from a divert socket (e.g.,
 .Xr natd 8 ) .
 .It Fib annotation state
 Whether a packet has been tagged for using a specific FIB (routing table)
 in future forwarding decisions.
 .El
 .Pp
 Note that some of the above information, e.g.\& source MAC or IP addresses and
 TCP/UDP ports, can be easily spoofed, so filtering on those fields
 alone might not guarantee the desired results.
 .Bl -tag -width indent
 .It Ar rule_number
 Each rule is associated with a
 .Ar rule_number
 in the range 1..65535, with the latter reserved for the
 .Em default
 rule.
 Rules are checked sequentially by rule number.
 Multiple rules can have the same number, in which case they are
 checked (and listed) according to the order in which they have
 been added.
 If a rule is entered without specifying a number, the kernel will
 assign one in such a way that the rule becomes the last one
 before the
 .Em default
 rule.
 Automatic rule numbers are assigned by incrementing the last
 non-default rule number by the value of the sysctl variable
 .Ar net.inet.ip.fw.autoinc_step
 which defaults to 100.
 If this is not possible (e.g.\& because we would go beyond the
 maximum allowed rule number), the number of the last
 non-default value is used instead.
 .It Cm set Ar set_number
 Each rule is associated with a
 .Ar set_number
 in the range 0..31.
 Sets can be individually disabled and enabled, so this parameter
 is of fundamental importance for atomic ruleset manipulation.
 It can be also used to simplify deletion of groups of rules.
 If a rule is entered without specifying a set number,
 set 0 will be used.
 .br
 Set 31 is special in that it cannot be disabled,
 and rules in set 31 are not deleted by the
 .Nm ipfw flush
 command (but you can delete them with the
 .Nm ipfw delete set 31
 command).
 Set 31 is also used for the
 .Em default
 rule.
 .It Cm prob Ar match_probability
 A match is only declared with the specified probability
 (floating point number between 0 and 1).
 This can be useful for a number of applications such as
 random packet drop or
 (in conjunction with
 .Nm dummynet )
 to simulate the effect of multiple paths leading to out-of-order
 packet delivery.
 .Pp
 Note: this condition is checked before any other condition, including
 ones such as keep-state or check-state which might have side effects.
 .It Cm log Op Cm logamount Ar number
 Packets matching a rule with the
 .Cm log
 keyword will be made available for logging in two ways:
 if the sysctl variable
 .Va net.inet.ip.fw.verbose
 is set to 0 (default), one can use
 .Xr bpf 4
 attached to the
 .Li ipfw0
 pseudo interface.
 This pseudo interface can be created after a boot
 manually by using the following command:
 .Bd -literal -offset indent
 # ifconfig ipfw0 create
 .Ed
 .Pp
 Or, automatically at boot time by adding the following
 line to the
 .Xr rc.conf 5
 file:
 .Bd -literal -offset indent
 firewall_logif="YES"
 .Ed
 .Pp
 There is no overhead if no
 .Xr bpf 4
 is attached to the pseudo interface.
 .Pp
 If
 .Va net.inet.ip.fw.verbose
 is set to 1, packets will be logged to
 .Xr syslogd 8
 with a
 .Dv LOG_SECURITY
 facility up to a maximum of
 .Cm logamount
 packets.
 If no
 .Cm logamount
 is specified, the limit is taken from the sysctl variable
 .Va net.inet.ip.fw.verbose_limit .
 In both cases, a value of 0 means unlimited logging.
 .Pp
 Once the limit is reached, logging can be re-enabled by
 clearing the logging counter or the packet counter for that entry, see the
 .Cm resetlog
 command.
 .Pp
 Note: logging is done after all other packet matching conditions
 have been successfully verified, and before performing the final
 action (accept, deny, etc.) on the packet.
 .It Cm tag Ar number
 When a packet matches a rule with the
 .Cm tag
 keyword, the numeric tag for the given
 .Ar number
 in the range 1..65534 will be attached to the packet.
 The tag acts as an internal marker (it is not sent out over
 the wire) that can be used to identify these packets later on.
 This can be used, for example, to provide trust between interfaces
 and to start doing policy-based filtering.
 A packet can have multiple tags at the same time.
 Tags are "sticky", meaning once a tag is applied to a packet by a
 matching rule it exists until explicit removal.
 Tags are kept with the packet everywhere within the kernel, but are
 lost when packet leaves the kernel, for example, on transmitting
 packet out to the network or sending packet to a
 .Xr divert 4
 socket.
 .Pp
 To check for previously applied tags, use the
 .Cm tagged
 rule option.
 To delete previously applied tag, use the
 .Cm untag
 keyword.
 .Pp
 Note: since tags are kept with the packet everywhere in kernelspace,
 they can be set and unset anywhere in the kernel network subsystem
 (using the
 .Xr mbuf_tags 9
 facility), not only by means of the
 .Xr ipfw 4
 .Cm tag
 and
 .Cm untag
 keywords.
 For example, there can be a specialized
 .Xr netgraph 4
 node doing traffic analyzing and tagging for later inspecting
 in firewall.
 .It Cm untag Ar number
 When a packet matches a rule with the
 .Cm untag
 keyword, the tag with the number
 .Ar number
 is searched among the tags attached to this packet and,
 if found, removed from it.
 Other tags bound to packet, if present, are left untouched.
 .It Cm altq Ar queue
 When a packet matches a rule with the
 .Cm altq
 keyword, the ALTQ identifier for the given
 .Ar queue
 (see
 .Xr altq 4 )
 will be attached.
 Note that this ALTQ tag is only meaningful for packets going "out" of IPFW,
 and not being rejected or going to divert sockets.
 Note that if there is insufficient memory at the time the packet is
 processed, it will not be tagged, so it is wise to make your ALTQ
 "default" queue policy account for this.
 If multiple
 .Cm altq
 rules match a single packet, only the first one adds the ALTQ classification
 tag.
 In doing so, traffic may be shaped by using
 .Cm count Cm altq Ar queue
 rules for classification early in the ruleset, then later applying
 the filtering decision.
 For example,
 .Cm check-state
 and
 .Cm keep-state
 rules may come later and provide the actual filtering decisions in
 addition to the fallback ALTQ tag.
 .Pp
 You must run
 .Xr pfctl 8
 to set up the queues before IPFW will be able to look them up by name,
 and if the ALTQ disciplines are rearranged, the rules in containing the
 queue identifiers in the kernel will likely have gone stale and need
 to be reloaded.
 Stale queue identifiers will probably result in misclassification.
 .Pp
 All system ALTQ processing can be turned on or off via
 .Nm
 .Cm enable Ar altq
 and
 .Nm
 .Cm disable Ar altq .
 The usage of
 .Va net.inet.ip.fw.one_pass
 is irrelevant to ALTQ traffic shaping, as the actual rule action is followed
 always after adding an ALTQ tag.
 .El
 .Ss RULE ACTIONS
 A rule can be associated with one of the following actions, which
 will be executed when the packet matches the body of the rule.
 .Bl -tag -width indent
 .It Cm allow | accept | pass | permit
 Allow packets that match rule.
 The search terminates.
 .It Cm check-state Op Ar :flowname | Cm :any
 Checks the packet against the dynamic ruleset.
 If a match is found, execute the action associated with
 the rule which generated this dynamic rule, otherwise
 move to the next rule.
 .br
 .Cm Check-state
 rules do not have a body.
 If no
 .Cm check-state
 rule is found, the dynamic ruleset is checked at the first
 .Cm keep-state
 or
 .Cm limit
 rule.
 The
 .Ar :flowname
 is symbolic name assigned to dynamic rule by
 .Cm keep-state
 opcode.
 The special flowname
 .Cm :any
 can be used to ignore states flowname when matching.
 The
 .Cm :default
 keyword is special name used for compatibility with old rulesets.
 .It Cm count
 Update counters for all packets that match rule.
 The search continues with the next rule.
 .It Cm deny | drop
 Discard packets that match this rule.
 The search terminates.
 .It Cm divert Ar port
 Divert packets that match this rule to the
 .Xr divert 4
 socket bound to port
 .Ar port .
 The search terminates.
 .It Cm fwd | forward Ar ipaddr | tablearg Ns Op , Ns Ar port
 Change the next-hop on matching packets to
 .Ar ipaddr ,
 which can be an IP address or a host name.
 The next hop can also be supplied by the last table
 looked up for the packet by using the
 .Cm tablearg
 keyword instead of an explicit address.
 The search terminates if this rule matches.
 .Pp
 If
 .Ar ipaddr
 is a local address, then matching packets will be forwarded to
 .Ar port
 (or the port number in the packet if one is not specified in the rule)
 on the local machine.
 .br
 If
 .Ar ipaddr
 is not a local address, then the port number
 (if specified) is ignored, and the packet will be
 forwarded to the remote address, using the route as found in
 the local routing table for that IP.
 .br
 A
 .Ar fwd
 rule will not match layer-2 packets (those received
 on ether_input, ether_output, or bridged).
 .br
 The
 .Cm fwd
 action does not change the contents of the packet at all.
 In particular, the destination address remains unmodified, so
 packets forwarded to another system will usually be rejected by that system
 unless there is a matching rule on that system to capture them.
 For packets forwarded locally,
 the local address of the socket will be
 set to the original destination address of the packet.
 This makes the
 .Xr netstat 1
 entry look rather weird but is intended for
 use with transparent proxy servers.
 .It Cm nat Ar nat_nr | tablearg
 Pass packet to a
 nat instance
 (for network address translation, address redirect, etc.):
 see the
 .Sx NETWORK ADDRESS TRANSLATION (NAT)
 Section for further information.
 .It Cm nat64lsn Ar name
 Pass packet to a stateful NAT64 instance (for IPv6/IPv4 network address and
 protocol translation): see the
 .Sx IPv6/IPv4 NETWORK ADDRESS AND PROTOCOL TRANSLATION
 Section for further information.
 .It Cm nat64stl Ar name
 Pass packet to a stateless NAT64 instance (for IPv6/IPv4 network address and
 protocol translation): see the
 .Sx IPv6/IPv4 NETWORK ADDRESS AND PROTOCOL TRANSLATION
 Section for further information.
 .It Cm nptv6 Ar name
 Pass packet to a NPTv6 instance (for IPv6-to-IPv6 network prefix translation):
 see the
 .Sx IPv6-to-IPv6 NETWORK PREFIX TRANSLATION (NPTv6)
 Section for further information.
 .It Cm pipe Ar pipe_nr
 Pass packet to a
 .Nm dummynet
 .Dq pipe
 (for bandwidth limitation, delay, etc.).
 See the
 .Sx TRAFFIC SHAPER (DUMMYNET) CONFIGURATION
 Section for further information.
 The search terminates; however, on exit from the pipe and if
 the
 .Xr sysctl 8
 variable
 .Va net.inet.ip.fw.one_pass
 is not set, the packet is passed again to the firewall code
 starting from the next rule.
 .It Cm queue Ar queue_nr
 Pass packet to a
 .Nm dummynet
 .Dq queue
 (for bandwidth limitation using WF2Q+).
 .It Cm reject
 (Deprecated).
 Synonym for
 .Cm unreach host .
 .It Cm reset
 Discard packets that match this rule, and if the
 packet is a TCP packet, try to send a TCP reset (RST) notice.
 The search terminates.
 .It Cm reset6
 Discard packets that match this rule, and if the
 packet is a TCP packet, try to send a TCP reset (RST) notice.
 The search terminates.
 .It Cm skipto Ar number | tablearg
 Skip all subsequent rules numbered less than
 .Ar number .
 The search continues with the first rule numbered
 .Ar number
 or higher.
 It is possible to use the
 .Cm tablearg
 keyword with a skipto for a
 .Em computed
 skipto. Skipto may work either in O(log(N)) or in O(1) depending
 on amount of memory and/or sysctl variables.
 See the
 .Sx SYSCTL VARIABLES
 section for more details.
 .It Cm call Ar number | tablearg
 The current rule number is saved in the internal stack and
 ruleset processing continues with the first rule numbered
 .Ar number
 or higher.
 If later a rule with the
 .Cm return
 action is encountered, the processing returns to the first rule
 with number of this
 .Cm call
 rule plus one or higher
 (the same behaviour as with packets returning from
 .Xr divert 4
 socket after a
 .Cm divert
 action).
 This could be used to make somewhat like an assembly language
 .Dq subroutine
 calls to rules with common checks for different interfaces, etc.
 .Pp
 Rule with any number could be called, not just forward jumps as with
 .Cm skipto .
 So, to prevent endless loops in case of mistakes, both
 .Cm call
 and
 .Cm return
 actions don't do any jumps and simply go to the next rule if memory
 cannot be allocated or stack overflowed/underflowed.
 .Pp
 Internally stack for rule numbers is implemented using
 .Xr mbuf_tags 9
 facility and currently has size of 16 entries.
 As mbuf tags are lost when packet leaves the kernel,
 .Cm divert
 should not be used in subroutines to avoid endless loops
 and other undesired effects.
 .It Cm return
 Takes rule number saved to internal stack by the last
 .Cm call
 action and returns ruleset processing to the first rule
 with number greater than number of corresponding
 .Cm call
 rule.
 See description of the
 .Cm call
 action for more details.
 .Pp
 Note that
 .Cm return
 rules usually end a
 .Dq subroutine
 and thus are unconditional, but
 .Nm
 command-line utility currently requires every action except
 .Cm check-state
 to have body.
 While it is sometimes useful to return only on some packets,
 usually you want to print just
 .Dq return
 for readability.
 A workaround for this is to use new syntax and
 .Fl c
 switch:
 .Bd -literal -offset indent
 # Add a rule without actual body
 ipfw add 2999 return via any
 
 # List rules without "from any to any" part
 ipfw -c list
 .Ed
 .Pp
 This cosmetic annoyance may be fixed in future releases.
 .It Cm tee Ar port
 Send a copy of packets matching this rule to the
 .Xr divert 4
 socket bound to port
 .Ar port .
 The search continues with the next rule.
 .It Cm unreach Ar code
 Discard packets that match this rule, and try to send an ICMP
 unreachable notice with code
 .Ar code ,
 where
 .Ar code
 is a number from 0 to 255, or one of these aliases:
 .Cm net , host , protocol , port ,
 .Cm needfrag , srcfail , net-unknown , host-unknown ,
 .Cm isolated , net-prohib , host-prohib , tosnet ,
 .Cm toshost , filter-prohib , host-precedence
 or
 .Cm precedence-cutoff .
 The search terminates.
 .It Cm unreach6 Ar code
 Discard packets that match this rule, and try to send an ICMPv6
 unreachable notice with code
 .Ar code ,
 where
 .Ar code
 is a number from 0, 1, 3 or 4, or one of these aliases:
 .Cm no-route, admin-prohib, address
 or
 .Cm port .
 The search terminates.
 .It Cm netgraph Ar cookie
 Divert packet into netgraph with given
 .Ar cookie .
 The search terminates.
 If packet is later returned from netgraph it is either
 accepted or continues with the next rule, depending on
 .Va net.inet.ip.fw.one_pass
 sysctl variable.
 .It Cm ngtee Ar cookie
 A copy of packet is diverted into netgraph, original
 packet continues with the next rule.
 See
 .Xr ng_ipfw 4
 for more information on
 .Cm netgraph
 and
 .Cm ngtee
 actions.
 .It Cm setfib Ar fibnum | tablearg
 The packet is tagged so as to use the FIB (routing table)
 .Ar fibnum
 in any subsequent forwarding decisions.
 In the current implementation, this is limited to the values 0 through 15, see
 .Xr setfib 2 .
 Processing continues at the next rule.
 It is possible to use the
 .Cm tablearg
 keyword with setfib.
 If the tablearg value is not within the compiled range of fibs,
 the packet's fib is set to 0.
 .It Cm setdscp Ar DSCP | number | tablearg
 Set specified DiffServ codepoint for an IPv4/IPv6 packet.
 Processing continues at the next rule.
 Supported values are:
 .Pp
 .Cm cs0
 .Pq Dv 000000 ,
 .Cm cs1
 .Pq Dv 001000 ,
 .Cm cs2
 .Pq Dv 010000 ,
 .Cm cs3
 .Pq Dv 011000 ,
 .Cm cs4
 .Pq Dv 100000 ,
 .Cm cs5
 .Pq Dv 101000 ,
 .Cm cs6
 .Pq Dv 110000 ,
 .Cm cs7
 .Pq Dv 111000 ,
 .Cm af11
 .Pq Dv 001010 ,
 .Cm af12
 .Pq Dv 001100 ,
 .Cm af13
 .Pq Dv 001110 ,
 .Cm af21
 .Pq Dv 010010 ,
 .Cm af22
 .Pq Dv 010100 ,
 .Cm af23
 .Pq Dv 010110 ,
 .Cm af31
 .Pq Dv 011010 ,
 .Cm af32
 .Pq Dv 011100 ,
 .Cm af33
 .Pq Dv 011110 ,
 .Cm af41
 .Pq Dv 100010 ,
 .Cm af42
 .Pq Dv 100100 ,
 .Cm af43
 .Pq Dv 100110 ,
 .Cm ef
 .Pq Dv 101110 ,
 .Cm be
 .Pq Dv 000000 .
 Additionally, DSCP value can be specified by number (0..64).
 It is also possible to use the
 .Cm tablearg
 keyword with setdscp.
 If the tablearg value is not within the 0..64 range, lower 6 bits of supplied
 value are used.
 .It Cm tcp-setmss Ar mss
 Set the Maximum Segment Size (MSS) in the TCP segment to value
 .Ar mss .
 The kernel module
 .Cm ipfw_pmod
 should be loaded or kernel should have
 .Cm options IPFIREWALL_PMOD
 to be able use this action.
 This command does not change a packet if original MSS value is lower than
 specified value.
 Both TCP over IPv4 and over IPv6 are supported.
 Regardless of matched a packet or not by the
 .Cm tcp-setmss
 rule, the search continues with the next rule.
 .It Cm reass
 Queue and reassemble IPv4 fragments.
 If the packet is not fragmented, counters are updated and
 processing continues with the next rule.
 If the packet is the last logical fragment, the packet is reassembled and, if
 .Va net.inet.ip.fw.one_pass
 is set to 0, processing continues with the next rule.
 Otherwise, the packet is allowed to pass and the search terminates.
 If the packet is a fragment in the middle of a logical group of fragments,
 it is consumed and
 processing stops immediately.
 .Pp
 Fragment handling can be tuned via
 .Va net.inet.ip.maxfragpackets
 and
 .Va net.inet.ip.maxfragsperpacket
 which limit, respectively, the maximum number of processable
 fragments (default: 800) and
 the maximum number of fragments per packet (default: 16).
 .Pp
 NOTA BENE: since fragments do not contain port numbers,
 they should be avoided with the
 .Nm reass
 rule.
 Alternatively, direction-based (like
 .Nm in
 /
 .Nm out
 ) and source-based (like
 .Nm via
 ) match patterns can be used to select fragments.
 .Pp
 Usually a simple rule like:
 .Bd -literal -offset indent
 # reassemble incoming fragments
 ipfw add reass all from any to any in
 .Ed
 .Pp
 is all you need at the beginning of your ruleset.
 .It Cm abort
 Discard packets that match this rule, and if the packet is an SCTP packet,
 try to send an SCTP packet containing an ABORT chunk.
 The search terminates.
 .It Cm abort6
 Discard packets that match this rule, and if the packet is an SCTP packet,
 try to send an SCTP packet containing an ABORT chunk.
 The search terminates.
 .El
 .Ss RULE BODY
 The body of a rule contains zero or more patterns (such as
 specific source and destination addresses or ports,
 protocol options, incoming or outgoing interfaces, etc.)
 that the packet must match in order to be recognised.
 In general, the patterns are connected by (implicit)
 .Cm and
 operators -- i.e., all must match in order for the
 rule to match.
 Individual patterns can be prefixed by the
 .Cm not
 operator to reverse the result of the match, as in
 .Pp
 .Dl "ipfw add 100 allow ip from not 1.2.3.4 to any"
 .Pp
 Additionally, sets of alternative match patterns
 .Pq Em or-blocks
 can be constructed by putting the patterns in
 lists enclosed between parentheses ( ) or braces { }, and
 using the
 .Cm or
 operator as follows:
 .Pp
 .Dl "ipfw add 100 allow ip from { x or not y or z } to any"
 .Pp
 Only one level of parentheses is allowed.
 Beware that most shells have special meanings for parentheses
 or braces, so it is advisable to put a backslash \\ in front of them
 to prevent such interpretations.
 .Pp
 The body of a rule must in general include a source and destination
 address specifier.
 The keyword
 .Ar any
 can be used in various places to specify that the content of
 a required field is irrelevant.
 .Pp
 The rule body has the following format:
 .Bd -ragged -offset indent
 .Op Ar proto Cm from Ar src Cm to Ar dst
 .Op Ar options
 .Ed
 .Pp
 The first part (proto from src to dst) is for backward
 compatibility with earlier versions of
 .Fx .
 In modern
 .Fx
 any match pattern (including MAC headers, IP protocols,
 addresses and ports) can be specified in the
 .Ar options
 section.
 .Pp
 Rule fields have the following meaning:
 .Bl -tag -width indent
 .It Ar proto : protocol | Cm { Ar protocol Cm or ... }
 .It Ar protocol : Oo Cm not Oc Ar protocol-name | protocol-number
 An IP protocol specified by number or name
 (for a complete list see
 .Pa /etc/protocols ) ,
 or one of the following keywords:
 .Bl -tag -width indent
 .It Cm ip4 | ipv4
 Matches IPv4 packets.
 .It Cm ip6 | ipv6
 Matches IPv6 packets.
 .It Cm ip | all
 Matches any packet.
 .El
 .Pp
 The
 .Cm ipv6
 in
 .Cm proto
 option will be treated as inner protocol.
 And, the
 .Cm ipv4
 is not available in
 .Cm proto
 option.
 .Pp
 The
 .Cm { Ar protocol Cm or ... }
 format (an
 .Em or-block )
 is provided for convenience only but its use is deprecated.
 .It Ar src No and Ar dst : Bro Cm addr | Cm { Ar addr Cm or ... } Brc Op Oo Cm not Oc Ar ports
 An address (or a list, see below)
 optionally followed by
 .Ar ports
 specifiers.
 .Pp
 The second format
 .Em ( or-block
 with multiple addresses) is provided for convenience only and
 its use is discouraged.
 .It Ar addr : Oo Cm not Oc Bro
 .Cm any | me | me6 |
 .Cm table Ns Pq Ar name Ns Op , Ns Ar value
 .Ar | addr-list | addr-set
 .Brc
 .Bl -tag -width indent
 .It Cm any
 matches any IP address.
 .It Cm me
 matches any IP address configured on an interface in the system.
 .It Cm me6
 matches any IPv6 address configured on an interface in the system.
 The address list is evaluated at the time the packet is
 analysed.
 .It Cm table Ns Pq Ar name Ns Op , Ns Ar value
 Matches any IPv4 or IPv6 address for which an entry exists in the lookup table
 .Ar number .
 If an optional 32-bit unsigned
 .Ar value
 is also specified, an entry will match only if it has this value.
 See the
 .Sx LOOKUP TABLES
 section below for more information on lookup tables.
 .El
 .It Ar addr-list : ip-addr Ns Op Ns , Ns Ar addr-list
 .It Ar ip-addr :
 A host or subnet address specified in one of the following ways:
 .Bl -tag -width indent
 .It Ar numeric-ip | hostname
 Matches a single IPv4 address, specified as dotted-quad or a hostname.
 Hostnames are resolved at the time the rule is added to the firewall list.
 .It Ar addr Ns / Ns Ar masklen
 Matches all addresses with base
 .Ar addr
 (specified as an IP address, a network number, or a hostname)
 and mask width of
 .Cm masklen
 bits.
 As an example, 1.2.3.4/25 or 1.2.3.0/25 will match
 all IP numbers from 1.2.3.0 to 1.2.3.127 .
 .It Ar addr Ns : Ns Ar mask
 Matches all addresses with base
 .Ar addr
 (specified as an IP address, a network number, or a hostname)
 and the mask of
 .Ar mask ,
 specified as a dotted quad.
 As an example, 1.2.3.4:255.0.255.0 or 1.0.3.0:255.0.255.0 will match
 1.*.3.*.
 This form is advised only for non-contiguous
 masks.
 It is better to resort to the
 .Ar addr Ns / Ns Ar masklen
 format for contiguous masks, which is more compact and less
 error-prone.
 .El
 .It Ar addr-set : addr Ns Oo Ns / Ns Ar masklen Oc Ns Cm { Ns Ar list Ns Cm }
 .It Ar list : Bro Ar num | num-num Brc Ns Op Ns , Ns Ar list
 Matches all addresses with base address
 .Ar addr
 (specified as an IP address, a network number, or a hostname)
 and whose last byte is in the list between braces { } .
 Note that there must be no spaces between braces and
 numbers (spaces after commas are allowed).
 Elements of the list can be specified as single entries
 or ranges.
 The
 .Ar masklen
 field is used to limit the size of the set of addresses,
 and can have any value between 24 and 32.
 If not specified,
 it will be assumed as 24.
 .br
 This format is particularly useful to handle sparse address sets
 within a single rule.
 Because the matching occurs using a
 bitmask, it takes constant time and dramatically reduces
 the complexity of rulesets.
 .br
 As an example, an address specified as 1.2.3.4/24{128,35-55,89}
 or 1.2.3.0/24{128,35-55,89}
 will match the following IP addresses:
 .br
 1.2.3.128, 1.2.3.35 to 1.2.3.55, 1.2.3.89 .
 .It Ar addr6-list : ip6-addr Ns Op Ns , Ns Ar addr6-list
 .It Ar ip6-addr :
 A host or subnet specified one of the following ways:
 .Bl -tag -width indent
 .It Ar numeric-ip | hostname
 Matches a single IPv6 address as allowed by
 .Xr inet_pton 3
 or a hostname.
 Hostnames are resolved at the time the rule is added to the firewall
 list.
 .It Ar addr Ns / Ns Ar masklen
 Matches all IPv6 addresses with base
 .Ar addr
 (specified as allowed by
 .Xr inet_pton
 or a hostname)
 and mask width of
 .Cm masklen
 bits.
 .It Ar addr Ns / Ns Ar mask
 Matches all IPv6 addresses with base
 .Ar addr
 (specified as allowed by
 .Xr inet_pton
 or a hostname)
 and the mask of
 .Ar mask ,
 specified as allowed by
 .Xr inet_pton.
 As an example, fe::640:0:0/ffff::ffff:ffff:0:0 will match
 fe:*:*:*:0:640:*:*.
 This form is advised only for non-contiguous
 masks.
 It is better to resort to the
 .Ar addr Ns / Ns Ar masklen
 format for contiguous masks, which is more compact and less
 error-prone.
 .El
 .Pp
 No support for sets of IPv6 addresses is provided because IPv6 addresses
 are typically random past the initial prefix.
 .It Ar ports : Bro Ar port | port Ns \&- Ns Ar port Ns Brc Ns Op , Ns Ar ports
 For protocols which support port numbers (such as SCTP, TCP and UDP), optional
 .Cm ports
 may be specified as one or more ports or port ranges, separated
 by commas but no spaces, and an optional
 .Cm not
 operator.
 The
 .Ql \&-
 notation specifies a range of ports (including boundaries).
 .Pp
 Service names (from
 .Pa /etc/services )
 may be used instead of numeric port values.
 The length of the port list is limited to 30 ports or ranges,
 though one can specify larger ranges by using an
 .Em or-block
 in the
 .Cm options
 section of the rule.
 .Pp
 A backslash
 .Pq Ql \e
 can be used to escape the dash
 .Pq Ql -
 character in a service name (from a shell, the backslash must be
 typed twice to avoid the shell itself interpreting it as an escape
 character).
 .Pp
 .Dl "ipfw add count tcp from any ftp\e\e-data-ftp to any"
 .Pp
 Fragmented packets which have a non-zero offset (i.e., not the first
 fragment) will never match a rule which has one or more port
 specifications.
 See the
 .Cm frag
 option for details on matching fragmented packets.
 .El
 .Ss RULE OPTIONS (MATCH PATTERNS)
 Additional match patterns can be used within
 rules.
 Zero or more of these so-called
 .Em options
 can be present in a rule, optionally prefixed by the
 .Cm not
 operand, and possibly grouped into
 .Em or-blocks .
 .Pp
 The following match patterns can be used (listed in alphabetical order):
 .Bl -tag -width indent
 .It Cm // this is a comment.
 Inserts the specified text as a comment in the rule.
 Everything following // is considered as a comment and stored in the rule.
 You can have comment-only rules, which are listed as having a
 .Cm count
 action followed by the comment.
 .It Cm bridged
 Alias for
 .Cm layer2 .
 .It Cm diverted
 Matches only packets generated by a divert socket.
 .It Cm diverted-loopback
 Matches only packets coming from a divert socket back into the IP stack
 input for delivery.
 .It Cm diverted-output
 Matches only packets going from a divert socket back outward to the IP
 stack output for delivery.
 .It Cm dst-ip Ar ip-address
 Matches IPv4 packets whose destination IP is one of the address(es)
 specified as argument.
 .It Bro Cm dst-ip6 | dst-ipv6 Brc Ar ip6-address
 Matches IPv6 packets whose destination IP is one of the address(es)
 specified as argument.
 .It Cm dst-port Ar ports
 Matches IP packets whose destination port is one of the port(s)
 specified as argument.
 .It Cm established
 Matches TCP packets that have the RST or ACK bits set.
 .It Cm ext6hdr Ar header
 Matches IPv6 packets containing the extended header given by
 .Ar header .
 Supported headers are:
 .Pp
 Fragment,
 .Pq Cm frag ,
 Hop-to-hop options
 .Pq Cm hopopt ,
 any type of Routing Header
 .Pq Cm route ,
 Source routing Routing Header Type 0
 .Pq Cm rthdr0 ,
 Mobile IPv6 Routing Header Type 2
 .Pq Cm rthdr2 ,
 Destination options
 .Pq Cm dstopt ,
 IPSec authentication headers
 .Pq Cm ah ,
 and IPsec encapsulated security payload headers
 .Pq Cm esp .
 .It Cm fib Ar fibnum
 Matches a packet that has been tagged to use
 the given FIB (routing table) number.
 .It Cm flow Ar table Ns Pq Ar name Ns Op , Ns Ar value
 Search for the flow entry in lookup table
 .Ar name .
 If not found, the match fails.
 Otherwise, the match succeeds and
 .Cm tablearg
 is set to the value extracted from the table.
 .Pp
 This option can be useful to quickly dispatch traffic based on
 certain packet fields.
 See the
 .Sx LOOKUP TABLES
 section below for more information on lookup tables.
 .It Cm flow-id Ar labels
 Matches IPv6 packets containing any of the flow labels given in
 .Ar labels .
 .Ar labels
 is a comma separated list of numeric flow labels.
 .It Cm frag
 Matches packets that are fragments and not the first
 fragment of an IP datagram.
 Note that these packets will not have
 the next protocol header (e.g.\& TCP, UDP) so options that look into
 these headers cannot match.
 .It Cm gid Ar group
 Matches all TCP or UDP packets sent by or received for a
 .Ar group .
 A
 .Ar group
 may be specified by name or number.
 .It Cm jail Ar prisonID
 Matches all TCP or UDP packets sent by or received for the
 jail whos prison ID is
 .Ar prisonID .
 .It Cm icmptypes Ar types
 Matches ICMP packets whose ICMP type is in the list
 .Ar types .
 The list may be specified as any combination of
 individual types (numeric) separated by commas.
 .Em Ranges are not allowed .
 The supported ICMP types are:
 .Pp
 echo reply
 .Pq Cm 0 ,
 destination unreachable
 .Pq Cm 3 ,
 source quench
 .Pq Cm 4 ,
 redirect
 .Pq Cm 5 ,
 echo request
 .Pq Cm 8 ,
 router advertisement
 .Pq Cm 9 ,
 router solicitation
 .Pq Cm 10 ,
 time-to-live exceeded
 .Pq Cm 11 ,
 IP header bad
 .Pq Cm 12 ,
 timestamp request
 .Pq Cm 13 ,
 timestamp reply
 .Pq Cm 14 ,
 information request
 .Pq Cm 15 ,
 information reply
 .Pq Cm 16 ,
 address mask request
 .Pq Cm 17
 and address mask reply
 .Pq Cm 18 .
 .It Cm icmp6types Ar types
 Matches ICMP6 packets whose ICMP6 type is in the list of
 .Ar types .
 The list may be specified as any combination of
 individual types (numeric) separated by commas.
 .Em Ranges are not allowed .
 .It Cm in | out
 Matches incoming or outgoing packets, respectively.
 .Cm in
 and
 .Cm out
 are mutually exclusive (in fact,
 .Cm out
 is implemented as
 .Cm not in Ns No ).
 .It Cm ipid Ar id-list
 Matches IPv4 packets whose
 .Cm ip_id
 field has value included in
 .Ar id-list ,
 which is either a single value or a list of values or ranges
 specified in the same way as
 .Ar ports .
 .It Cm iplen Ar len-list
 Matches IP packets whose total length, including header and data, is
 in the set
 .Ar len-list ,
 which is either a single value or a list of values or ranges
 specified in the same way as
 .Ar ports .
 .It Cm ipoptions Ar spec
 Matches packets whose IPv4 header contains the comma separated list of
 options specified in
 .Ar spec .
 The supported IP options are:
 .Pp
 .Cm ssrr
 (strict source route),
 .Cm lsrr
 (loose source route),
 .Cm rr
 (record packet route) and
 .Cm ts
 (timestamp).
 The absence of a particular option may be denoted
 with a
 .Ql \&! .
 .It Cm ipprecedence Ar precedence
 Matches IPv4 packets whose precedence field is equal to
 .Ar precedence .
 .It Cm ipsec
 Matches packets that have IPSEC history associated with them
 (i.e., the packet comes encapsulated in IPSEC, the kernel
 has IPSEC support, and can correctly decapsulate it).
 .Pp
 Note that specifying
 .Cm ipsec
 is different from specifying
 .Cm proto Ar ipsec
 as the latter will only look at the specific IP protocol field,
 irrespective of IPSEC kernel support and the validity of the IPSEC data.
 .Pp
 Further note that this flag is silently ignored in kernels without
 IPSEC support.
 It does not affect rule processing when given and the
 rules are handled as if with no
 .Cm ipsec
 flag.
 .It Cm iptos Ar spec
 Matches IPv4 packets whose
 .Cm tos
 field contains the comma separated list of
 service types specified in
 .Ar spec .
 The supported IP types of service are:
 .Pp
 .Cm lowdelay
 .Pq Dv IPTOS_LOWDELAY ,
 .Cm throughput
 .Pq Dv IPTOS_THROUGHPUT ,
 .Cm reliability
 .Pq Dv IPTOS_RELIABILITY ,
 .Cm mincost
 .Pq Dv IPTOS_MINCOST ,
 .Cm congestion
 .Pq Dv IPTOS_ECN_CE .
 The absence of a particular type may be denoted
 with a
 .Ql \&! .
 .It Cm dscp spec Ns Op , Ns Ar spec
 Matches IPv4/IPv6 packets whose
 .Cm DS
 field value is contained in
 .Ar spec
 mask.
 Multiple values can be specified via
 the comma separated list.
 Value can be one of keywords used in
 .Cm setdscp
 action or exact number.
 .It Cm ipttl Ar ttl-list
 Matches IPv4 packets whose time to live is included in
 .Ar ttl-list ,
 which is either a single value or a list of values or ranges
 specified in the same way as
 .Ar ports .
 .It Cm ipversion Ar ver
 Matches IP packets whose IP version field is
 .Ar ver .
 .It Cm keep-state Op Ar :flowname
 Upon a match, the firewall will create a dynamic rule, whose
 default behaviour is to match bidirectional traffic between
 source and destination IP/port using the same protocol.
 The rule has a limited lifetime (controlled by a set of
 .Xr sysctl 8
 variables), and the lifetime is refreshed every time a matching
 packet is found.
 The
 .Ar :flowname
 is used to assign additional to addresses, ports and protocol parameter
 to dynamic rule. It can be used for more accurate matching by
 .Cm check-state
 rule.
 The
 .Cm :default
 keyword is special name used for compatibility with old rulesets.
 .It Cm layer2
 Matches only layer2 packets, i.e., those passed to
 .Nm
 from ether_demux() and ether_output_frame().
 .It Cm limit Bro Cm src-addr | src-port | dst-addr | dst-port Brc Ar N Op Ar :flowname
 The firewall will only allow
 .Ar N
 connections with the same
 set of parameters as specified in the rule.
 One or more
 of source and destination addresses and ports can be
 specified.
 .It Cm lookup Bro Cm dst-ip | dst-port | src-ip | src-port | uid | jail Brc Ar name
 Search an entry in lookup table
 .Ar name
 that matches the field specified as argument.
 If not found, the match fails.
 Otherwise, the match succeeds and
 .Cm tablearg
 is set to the value extracted from the table.
 .Pp
 This option can be useful to quickly dispatch traffic based on
 certain packet fields.
 See the
 .Sx LOOKUP TABLES
 section below for more information on lookup tables.
 .It Cm { MAC | mac } Ar dst-mac src-mac
 Match packets with a given
 .Ar dst-mac
 and
 .Ar src-mac
 addresses, specified as the
 .Cm any
 keyword (matching any MAC address), or six groups of hex digits
 separated by colons,
 and optionally followed by a mask indicating the significant bits.
 The mask may be specified using either of the following methods:
 .Bl -enum -width indent
 .It
 A slash
 .Pq /
 followed by the number of significant bits.
 For example, an address with 33 significant bits could be specified as:
 .Pp
 .Dl "MAC 10:20:30:40:50:60/33 any"
 .It
 An ampersand
 .Pq &
 followed by a bitmask specified as six groups of hex digits separated
 by colons.
 For example, an address in which the last 16 bits are significant could
 be specified as:
 .Pp
 .Dl "MAC 10:20:30:40:50:60&00:00:00:00:ff:ff any"
 .Pp
 Note that the ampersand character has a special meaning in many shells
 and should generally be escaped.
 .El
 Note that the order of MAC addresses (destination first,
 source second) is
 the same as on the wire, but the opposite of the one used for
 IP addresses.
 .It Cm mac-type Ar mac-type
 Matches packets whose Ethernet Type field
 corresponds to one of those specified as argument.
 .Ar mac-type
 is specified in the same way as
 .Cm port numbers
 (i.e., one or more comma-separated single values or ranges).
 You can use symbolic names for known values such as
 .Em vlan , ipv4, ipv6 .
 Values can be entered as decimal or hexadecimal (if prefixed by 0x),
 and they are always printed as hexadecimal (unless the
 .Cm -N
 option is used, in which case symbolic resolution will be attempted).
 .It Cm proto Ar protocol
 Matches packets with the corresponding IP protocol.
 .It Cm recv | xmit | via Brq Ar ifX | Ar if Ns Cm * | Ar table Ns Po Ar name Ns Oo , Ns Ar value Oc Pc | Ar ipno | Ar any
 Matches packets received, transmitted or going through,
 respectively, the interface specified by exact name
 .Po Ar ifX Pc ,
 by device name
 .Po Ar if* Pc ,
 by IP address, or through some interface.
 Table
 .Ar name
 may be used to match interface by its kernel ifindex.
 See the
 .Sx LOOKUP TABLES
 section below for more information on lookup tables.
 .Pp
 The
 .Cm via
 keyword causes the interface to always be checked.
 If
 .Cm recv
 or
 .Cm xmit
 is used instead of
 .Cm via ,
 then only the receive or transmit interface (respectively)
 is checked.
 By specifying both, it is possible to match packets based on
 both receive and transmit interface, e.g.:
 .Pp
 .Dl "ipfw add deny ip from any to any out recv ed0 xmit ed1"
 .Pp
 The
 .Cm recv
 interface can be tested on either incoming or outgoing packets,
 while the
 .Cm xmit
 interface can only be tested on outgoing packets.
 So
 .Cm out
 is required (and
 .Cm in
 is invalid) whenever
 .Cm xmit
 is used.
 .Pp
 A packet might not have a receive or transmit interface: packets
 originating from the local host have no receive interface,
 while packets destined for the local host have no transmit
 interface.
 .It Cm setup
 Matches TCP packets that have the SYN bit set but no ACK bit.
 This is the short form of
 .Dq Li tcpflags\ syn,!ack .
 .It Cm sockarg
 Matches packets that are associated to a local socket and
 for which the SO_USER_COOKIE socket option has been set
 to a non-zero value.
 As a side effect, the value of the
 option is made available as
 .Cm tablearg
 value, which in turn can be used as
 .Cm skipto
 or
 .Cm pipe
 number.
 .It Cm src-ip Ar ip-address
 Matches IPv4 packets whose source IP is one of the address(es)
 specified as an argument.
 .It Cm src-ip6 Ar ip6-address
 Matches IPv6 packets whose source IP is one of the address(es)
 specified as an argument.
 .It Cm src-port Ar ports
 Matches IP packets whose source port is one of the port(s)
 specified as argument.
 .It Cm tagged Ar tag-list
 Matches packets whose tags are included in
 .Ar tag-list ,
 which is either a single value or a list of values or ranges
 specified in the same way as
 .Ar ports .
 Tags can be applied to the packet using
 .Cm tag
 rule action parameter (see it's description for details on tags).
 .It Cm tcpack Ar ack
 TCP packets only.
 Match if the TCP header acknowledgment number field is set to
 .Ar ack .
 .It Cm tcpdatalen Ar tcpdatalen-list
 Matches TCP packets whose length of TCP data is
 .Ar tcpdatalen-list ,
 which is either a single value or a list of values or ranges
 specified in the same way as
 .Ar ports .
 .It Cm tcpflags Ar spec
 TCP packets only.
 Match if the TCP header contains the comma separated list of
 flags specified in
 .Ar spec .
 The supported TCP flags are:
 .Pp
 .Cm fin ,
 .Cm syn ,
 .Cm rst ,
 .Cm psh ,
 .Cm ack
 and
 .Cm urg .
 The absence of a particular flag may be denoted
 with a
 .Ql \&! .
 A rule which contains a
 .Cm tcpflags
 specification can never match a fragmented packet which has
 a non-zero offset.
 See the
 .Cm frag
 option for details on matching fragmented packets.
 .It Cm tcpseq Ar seq
 TCP packets only.
 Match if the TCP header sequence number field is set to
 .Ar seq .
 .It Cm tcpwin Ar tcpwin-list
 Matches TCP packets whose  header window field is set to
 .Ar tcpwin-list ,
 which is either a single value or a list of values or ranges
 specified in the same way as
 .Ar ports .
 .It Cm tcpoptions Ar spec
 TCP packets only.
 Match if the TCP header contains the comma separated list of
 options specified in
 .Ar spec .
 The supported TCP options are:
 .Pp
 .Cm mss
 (maximum segment size),
 .Cm window
 (tcp window advertisement),
 .Cm sack
 (selective ack),
 .Cm ts
 (rfc1323 timestamp) and
 .Cm cc
 (rfc1644 t/tcp connection count).
 The absence of a particular option may be denoted
 with a
 .Ql \&! .
 .It Cm uid Ar user
 Match all TCP or UDP packets sent by or received for a
 .Ar user .
 A
 .Ar user
 may be matched by name or identification number.
 .It Cm verrevpath
 For incoming packets,
 a routing table lookup is done on the packet's source address.
 If the interface on which the packet entered the system matches the
 outgoing interface for the route,
 the packet matches.
 If the interfaces do not match up,
 the packet does not match.
 All outgoing packets or packets with no incoming interface match.
 .Pp
 The name and functionality of the option is intentionally similar to
 the Cisco IOS command:
 .Pp
 .Dl ip verify unicast reverse-path
 .Pp
 This option can be used to make anti-spoofing rules to reject all
 packets with source addresses not from this interface.
 See also the option
 .Cm antispoof .
 .It Cm versrcreach
 For incoming packets,
 a routing table lookup is done on the packet's source address.
 If a route to the source address exists, but not the default route
 or a blackhole/reject route, the packet matches.
 Otherwise, the packet does not match.
 All outgoing packets match.
 .Pp
 The name and functionality of the option is intentionally similar to
 the Cisco IOS command:
 .Pp
 .Dl ip verify unicast source reachable-via any
 .Pp
 This option can be used to make anti-spoofing rules to reject all
 packets whose source address is unreachable.
 .It Cm antispoof
 For incoming packets, the packet's source address is checked if it
 belongs to a directly connected network.
 If the network is directly connected, then the interface the packet
 came on in is compared to the interface the network is connected to.
 When incoming interface and directly connected interface are not the
 same, the packet does not match.
 Otherwise, the packet does match.
 All outgoing packets match.
 .Pp
 This option can be used to make anti-spoofing rules to reject all
 packets that pretend to be from a directly connected network but do
 not come in through that interface.
 This option is similar to but more restricted than
 .Cm verrevpath
 because it engages only on packets with source addresses of directly
 connected networks instead of all source addresses.
 .El
 .Sh LOOKUP TABLES
 Lookup tables are useful to handle large sparse sets of
 addresses or other search keys (e.g., ports, jail IDs, interface names).
 In the rest of this section we will use the term ``key''.
 Table name needs to match the following spec:
 .Ar table-name .
 Tables with the same name can be created in different
 .Ar sets .
 However, rule links to the tables in
 .Ar set 0
 by default.
 This behavior can be controlled by
 .Va net.inet.ip.fw.tables_sets
 variable.
 See the
 .Sx SETS OF RULES
 section for more information.
 There may be up to 65535 different lookup tables.
 .Pp
 The following table types are supported:
 .Bl -tag -width indent
 .It Ar table-type : Ar addr | iface | number | flow
 .It Ar table-key : Ar addr Ns Oo / Ns Ar masklen Oc | iface-name | number | flow-spec
 .It Ar flow-spec : Ar flow-field Ns Op , Ns Ar flow-spec
 .It Ar flow-field : src-ip | proto | src-port | dst-ip | dst-port
 .It Cm addr
 matches IPv4 or IPv6 address.
 Each entry is represented by an
 .Ar addr Ns Op / Ns Ar masklen
 and will match all addresses with base
 .Ar addr
 (specified as an IPv4/IPv6 address, or a hostname) and mask width of
 .Ar masklen
 bits.
 If
 .Ar masklen
 is not specified, it defaults to 32 for IPv4 and 128 for IPv6.
 When looking up an IP address in a table, the most specific
 entry will match.
 .It Cm iface
 matches interface names.
 Each entry is represented by string treated as interface name.
 Wildcards are not supported.
 .It Cm number
 maches protocol ports, uids/gids or jail IDs.
 Each entry is represented by 32-bit unsigned integer.
 Ranges are not supported.
 .It Cm flow
 Matches packet fields specified by
 .Ar flow
 type suboptions with table entries.
 .El
 .Pp
 Tables require explicit creation via
 .Cm create
 before use.
 .Pp
 The following creation options are supported:
 .Bl -tag -width indent
 .It Ar create-options : Ar create-option | create-options
 .It Ar create-option : Cm type Ar table-type | Cm valtype Ar value-mask | Cm algo Ar algo-desc |
 .Cm limit Ar number | Cm locked
 .It Cm type
 Table key type.
 .It Cm valtype
 Table value mask.
 .It Cm algo
 Table algorithm to use (see below).
 .It Cm limit
 Maximum number of items that may be inserted into table.
 .It Cm locked
 Restrict any table modifications.
 .El
 .Pp
 Some of these options may be modified later via
 .Cm modify
 keyword.
 The following options can be changed:
 .Bl -tag -width indent
 .It Ar modify-options : Ar modify-option | modify-options
 .It Ar modify-option : Cm limit Ar number
 .It Cm limit
 Alter maximum number of items that may be inserted into table.
 .El
 .Pp
 Additionally, table can be locked or unlocked using
 .Cm lock
 or
 .Cm unlock
 commands.
 .Pp
 Tables of the same
 .Ar type
 can be swapped with each other using
 .Cm swap Ar name
 command.
 Swap may fail if tables limits are set and data exchange
 would result in limits hit.
 Operation is performed atomically.
 .Pp
 One or more entries can be added to a table at once using
 .Cm add
 command.
 Addition of all items are performed atomically.
 By default, error in addition of one entry does not influence
 addition of other entries. However, non-zero error code is returned
 in that case.
 Special
 .Cm atomic
 keyword may be specified before
 .Cm add
 to indicate all-or-none add request.
 .Pp
 One or more entries can be removed from a table at once using
 .Cm delete
 command.
 By default, error in removal of one entry does not influence
 removing of other entries. However, non-zero error code is returned
 in that case.
 .Pp
 It may be possible to check what entry will be found on particular
 .Ar table-key
 using
 .Cm lookup
 .Ar table-key
 command.
 This functionality is optional and may be unsupported in some algorithms.
 .Pp
 The following operations can be performed on
 .Ar one
 or
 .Cm all
 tables:
 .Bl -tag -width indent
 .It Cm list
 List all entries.
 .It Cm flush
 Removes all entries.
 .It Cm info
 Shows generic table information.
 .It Cm detail
 Shows generic table information and algo-specific data.
 .El
 .Pp
 The following lookup algorithms are supported:
 .Bl -tag -width indent
 .It Ar algo-desc : algo-name | "algo-name algo-data"
 .It Ar algo-name: Ar addr:radix | addr:hash | iface:array | number:array | flow:hash
 .It Cm addr:radix
 Separate Radix trees for IPv4 and IPv6, the same way as the routing table (see
 .Xr route 4 ) .
 Default choice for
 .Ar addr
 type.
 .It Cm addr:hash
 Separate auto-growing hashes for IPv4 and IPv6.
 Accepts entries with the same mask length specified initially via
 .Cm "addr:hash masks=/v4,/v6"
 algorithm creation options.
 Assume /32 and /128 masks by default.
 Search removes host bits (according to mask) from supplied address and checks
 resulting key in appropriate hash.
 Mostly optimized for /64 and byte-ranged IPv6 masks.
 .It Cm iface:array
 Array storing sorted indexes for entries which are presented in the system.
 Optimized for very fast lookup.
 .It Cm number:array
 Array storing sorted u32 numbers.
 .It Cm flow:hash
 Auto-growing hash storing flow entries.
 Search calculates hash on required packet fields and searches for matching
 entries in selected bucket.
 .El
 .Pp
 The
 .Cm tablearg
 feature provides the ability to use a value, looked up in the table, as
 the argument for a rule action, action parameter or rule option.
 This can significantly reduce number of rules in some configurations.
 If two tables are used in a rule, the result of the second (destination)
 is used.
 .Pp
 Each record may hold one or more values according to
 .Ar value-mask .
 This mask is set on table creation via
 .Cm valtype
 option.
 The following value types are supported:
 .Bl -tag -width indent
 .It Ar value-mask : Ar value-type Ns Op , Ns Ar value-mask
 .It Ar value-type : Ar skipto | pipe | fib | nat | dscp | tag | divert |
 .Ar netgraph | limit | ipv4
 .It Cm skipto
 rule number to jump to.
 .It Cm pipe
 Pipe number to use.
 .It Cm fib
 fib number to match/set.
 .It Cm nat
 nat number to jump to.
 .It Cm dscp
 dscp value to match/set.
 .It Cm tag
 tag number to match/set.
 .It Cm divert
 port number to divert traffic to.
 .It Cm netgraph
 hook number to move packet to.
 .It Cm limit
 maximum number of connections.
 .It Cm ipv4
 IPv4 nexthop to fwd packets to.
 .It Cm ipv6
 IPv6 nexthop to fwd packets to.
 .El
 .Pp
 The
 .Cm tablearg
 argument can be used with the following actions:
 .Cm nat, pipe , queue, divert, tee, netgraph, ngtee, fwd, skipto, setfib,
 action parameters:
 .Cm tag, untag,
 rule options:
 .Cm limit, tagged.
 .Pp
 When used with the
 .Cm skipto
 action, the user should be aware that the code will walk the ruleset
 up to a rule equal to, or past, the given number.
 .Pp
 See the
 .Sx EXAMPLES
 Section for example usage of tables and the tablearg keyword.
 .Sh SETS OF RULES
 Each rule or table belongs to one of 32 different
 .Em sets
 , numbered 0 to 31.
 Set 31 is reserved for the default rule.
 .Pp
 By default, rules or tables are put in set 0, unless you use the
 .Cm set N
 attribute when adding a new rule or table.
 Sets can be individually and atomically enabled or disabled,
 so this mechanism permits an easy way to store multiple configurations
 of the firewall and quickly (and atomically) switch between them.
 .Pp
 By default, tables from set 0 are referenced when adding rule with
 table opcodes regardless of rule set.
 This behavior can be changed by setting
-.Va net.inet.ip.fw.tables_set
+.Va net.inet.ip.fw.tables_sets
 variable to 1.
 Rule's set will then be used for table references.
 .Pp
 The command to enable/disable sets is
 .Bd -ragged -offset indent
 .Nm
 .Cm set Oo Cm disable Ar number ... Oc Op Cm enable Ar number ...
 .Ed
 .Pp
 where multiple
 .Cm enable
 or
 .Cm disable
 sections can be specified.
 Command execution is atomic on all the sets specified in the command.
 By default, all sets are enabled.
 .Pp
 When you disable a set, its rules behave as if they do not exist
 in the firewall configuration, with only one exception:
 .Bd -ragged -offset indent
 dynamic rules created from a rule before it had been disabled
 will still be active until they expire.
 In order to delete
 dynamic rules you have to explicitly delete the parent rule
 which generated them.
 .Ed
 .Pp
 The set number of rules can be changed with the command
 .Bd -ragged -offset indent
 .Nm
 .Cm set move
 .Brq Cm rule Ar rule-number | old-set
 .Cm to Ar new-set
 .Ed
 .Pp
 Also, you can atomically swap two rulesets with the command
 .Bd -ragged -offset indent
 .Nm
 .Cm set swap Ar first-set second-set
 .Ed
 .Pp
 See the
 .Sx EXAMPLES
 Section on some possible uses of sets of rules.
 .Sh STATEFUL FIREWALL
 Stateful operation is a way for the firewall to dynamically
 create rules for specific flows when packets that
 match a given pattern are detected.
 Support for stateful
 operation comes through the
 .Cm check-state , keep-state
 and
 .Cm limit
 options of
 .Nm rules .
 .Pp
 Dynamic rules are created when a packet matches a
 .Cm keep-state
 or
 .Cm limit
 rule, causing the creation of a
 .Em dynamic
 rule which will match all and only packets with
 a given
 .Em protocol
 between a
 .Em src-ip/src-port dst-ip/dst-port
 pair of addresses
 .Em ( src
 and
 .Em dst
 are used here only to denote the initial match addresses, but they
 are completely equivalent afterwards).
 Rules created by
 .Cm keep-state
 option also have a
 .Ar :flowname
 taken from it.
 This name is used in matching together with addresses, ports and protocol.
 Dynamic rules will be checked at the first
 .Cm check-state, keep-state
 or
 .Cm limit
 occurrence, and the action performed upon a match will be the same
 as in the parent rule.
 .Pp
 Note that no additional attributes other than protocol and IP addresses
 and ports and :flowname are checked on dynamic rules.
 .Pp
 The typical use of dynamic rules is to keep a closed firewall configuration,
 but let the first TCP SYN packet from the inside network install a
 dynamic rule for the flow so that packets belonging to that session
 will be allowed through the firewall:
 .Pp
 .Dl "ipfw add check-state :OUTBOUND"
 .Dl "ipfw add allow tcp from my-subnet to any setup keep-state :OUTBOUND"
 .Dl "ipfw add deny tcp from any to any"
 .Pp
 A similar approach can be used for UDP, where an UDP packet coming
 from the inside will install a dynamic rule to let the response through
 the firewall:
 .Pp
 .Dl "ipfw add check-state :OUTBOUND"
 .Dl "ipfw add allow udp from my-subnet to any keep-state :OUTBOUND"
 .Dl "ipfw add deny udp from any to any"
 .Pp
 Dynamic rules expire after some time, which depends on the status
 of the flow and the setting of some
 .Cm sysctl
 variables.
 See Section
 .Sx SYSCTL VARIABLES
 for more details.
 For TCP sessions, dynamic rules can be instructed to periodically
 send keepalive packets to refresh the state of the rule when it is
 about to expire.
 .Pp
 See Section
 .Sx EXAMPLES
 for more examples on how to use dynamic rules.
 .Sh TRAFFIC SHAPER (DUMMYNET) CONFIGURATION
 .Nm
 is also the user interface for the
 .Nm dummynet
 traffic shaper, packet scheduler and network emulator, a subsystem that
 can artificially queue, delay or drop packets
 emulating the behaviour of certain network links
 or queueing systems.
 .Pp
 .Nm dummynet
 operates by first using the firewall to select packets
 using any match pattern that can be used in
 .Nm
 rules.
 Matching packets are then passed to either of two
 different objects, which implement the traffic regulation:
 .Bl -hang -offset XXXX
 .It Em pipe
 A
 .Em pipe
 emulates a
 .Em link
 with given bandwidth and propagation delay,
 driven by a FIFO scheduler and a single queue with programmable
 queue size and packet loss rate.
 Packets are appended to the queue as they come out from
 .Nm ipfw ,
 and then transferred in FIFO order to the link at the desired rate.
 .It Em queue
 A
 .Em queue
 is an abstraction used to implement packet scheduling
 using one of several packet scheduling algorithms.
 Packets sent to a
 .Em queue
 are first grouped into flows according to a mask on the 5-tuple.
 Flows are then passed to the scheduler associated to the
 .Em queue ,
 and each flow uses scheduling parameters (weight and others)
 as configured in the
 .Em queue
 itself.
 A scheduler in turn is connected to an emulated link,
 and arbitrates the link's bandwidth among backlogged flows according to
 weights and to the features of the scheduling algorithm in use.
 .El
 .Pp
 In practice,
 .Em pipes
 can be used to set hard limits to the bandwidth that a flow can use, whereas
 .Em queues
 can be used to determine how different flows share the available bandwidth.
 .Pp
 A graphical representation of the binding of queues,
 flows, schedulers and links is below.
 .Bd -literal -offset indent
                  (flow_mask|sched_mask)  sched_mask
          +---------+   weight Wx  +-------------+
          |         |->-[flow]-->--|             |-+
     -->--| QUEUE x |   ...        |             | |
          |         |->-[flow]-->--| SCHEDuler N | |
          +---------+              |             | |
              ...                  |             +--[LINK N]-->--
          +---------+   weight Wy  |             | +--[LINK N]-->--
          |         |->-[flow]-->--|             | |
     -->--| QUEUE y |   ...        |             | |
          |         |->-[flow]-->--|             | |
          +---------+              +-------------+ |
                                     +-------------+
 .Ed
 It is important to understand the role of the SCHED_MASK
 and FLOW_MASK, which are configured through the commands
 .Dl "ipfw sched N config mask SCHED_MASK ..."
 and
 .Dl "ipfw queue X config mask FLOW_MASK ..." .
 .Pp
 The SCHED_MASK is used to assign flows to one or more
 scheduler instances, one for each
 value of the packet's 5-tuple after applying SCHED_MASK.
 As an example, using ``src-ip 0xffffff00'' creates one instance
 for each /24 destination subnet.
 .Pp
 The FLOW_MASK, together with the SCHED_MASK, is used to split
 packets into flows.
 As an example, using
 ``src-ip 0x000000ff''
 together with the previous SCHED_MASK makes a flow for
 each individual source address.
 In turn, flows for each /24
 subnet will be sent to the same scheduler instance.
 .Pp
 The above diagram holds even for the
 .Em pipe
 case, with the only restriction that a
 .Em pipe
 only supports a SCHED_MASK, and forces the use of a FIFO
 scheduler (these are for backward compatibility reasons;
 in fact, internally, a
 .Nm dummynet's
 pipe is implemented exactly as above).
 .Pp
 There are two modes of
 .Nm dummynet
 operation:
 .Dq normal
 and
 .Dq fast .
 The
 .Dq normal
 mode tries to emulate a real link: the
 .Nm dummynet
 scheduler ensures that the packet will not leave the pipe faster than it
 would on the real link with a given bandwidth.
 The
 .Dq fast
 mode allows certain packets to bypass the
 .Nm dummynet
 scheduler (if packet flow does not exceed pipe's bandwidth).
 This is the reason why the
 .Dq fast
 mode requires less CPU cycles per packet (on average) and packet latency
 can be significantly lower in comparison to a real link with the same
 bandwidth.
 The default mode is
 .Dq normal .
 The
 .Dq fast
 mode can be enabled by setting the
 .Va net.inet.ip.dummynet.io_fast
 .Xr sysctl 8
 variable to a non-zero value.
 .Pp
 .Ss PIPE, QUEUE AND SCHEDULER CONFIGURATION
 The
 .Em pipe ,
 .Em queue
 and
 .Em scheduler
 configuration commands are the following:
 .Bd -ragged -offset indent
 .Cm pipe Ar number Cm config Ar pipe-configuration
 .Pp
 .Cm queue Ar number Cm config Ar queue-configuration
 .Pp
 .Cm sched Ar number Cm config Ar sched-configuration
 .Ed
 .Pp
 The following parameters can be configured for a pipe:
 .Pp
 .Bl -tag -width indent -compact
 .It Cm bw Ar bandwidth | device
 Bandwidth, measured in
 .Sm off
 .Op Cm K | M | G
 .Brq Cm bit/s | Byte/s .
 .Sm on
 .Pp
 A value of 0 (default) means unlimited bandwidth.
 The unit must immediately follow the number, as in
 .Pp
 .Dl "ipfw pipe 1 config bw 300Kbit/s"
 .Pp
 If a device name is specified instead of a numeric value, as in
 .Pp
 .Dl "ipfw pipe 1 config bw tun0"
 .Pp
 then the transmit clock is supplied by the specified device.
 At the moment only the
 .Xr tun 4
 device supports this
 functionality, for use in conjunction with
 .Xr ppp 8 .
 .Pp
 .It Cm delay Ar ms-delay
 Propagation delay, measured in milliseconds.
 The value is rounded to the next multiple of the clock tick
 (typically 10ms, but it is a good practice to run kernels
 with
 .Dq "options HZ=1000"
 to reduce
 the granularity to 1ms or less).
 The default value is 0, meaning no delay.
 .Pp
 .It Cm burst Ar size
 If the data to be sent exceeds the pipe's bandwidth limit
 (and the pipe was previously idle), up to
 .Ar size
 bytes of data are allowed to bypass the
 .Nm dummynet
 scheduler, and will be sent as fast as the physical link allows.
 Any additional data will be transmitted at the rate specified
 by the
 .Nm pipe
 bandwidth.
 The burst size depends on how long the pipe has been idle;
 the effective burst size is calculated as follows:
 MAX(
 .Ar size
 ,
 .Nm bw
 * pipe_idle_time).
 .Pp
 .It Cm profile Ar filename
 A file specifying the additional overhead incurred in the transmission
 of a packet on the link.
 .Pp
 Some link types introduce extra delays in the transmission
 of a packet, e.g., because of MAC level framing, contention on
 the use of the channel, MAC level retransmissions and so on.
 From our point of view, the channel is effectively unavailable
 for this extra time, which is constant or variable depending
 on the link type.
 Additionally, packets may be dropped after this
 time (e.g., on a wireless link after too many retransmissions).
 We can model the additional delay with an empirical curve
 that represents its distribution.
 .Bd -literal -offset indent
       cumulative probability
       1.0 ^
           |
       L   +-- loss-level          x
           |                 ******
           |                *
           |           *****
           |          *
           |        **
           |       *
           +-------*------------------->
                       delay
 .Ed
 The empirical curve may have both vertical and horizontal lines.
 Vertical lines represent constant delay for a range of
 probabilities.
 Horizontal lines correspond to a discontinuity in the delay
 distribution: the pipe will use the largest delay for a
 given probability.
 .Pp
 The file format is the following, with whitespace acting as
 a separator and '#' indicating the beginning a comment:
 .Bl -tag -width indent
 .It Cm name Ar identifier
 optional name (listed by "ipfw pipe show")
 to identify the delay distribution;
 .It Cm bw Ar value
 the bandwidth used for the pipe.
 If not specified here, it must be present
 explicitly as a configuration parameter for the pipe;
 .It Cm loss-level Ar L
 the probability above which packets are lost.
 (0.0 <= L <= 1.0, default 1.0 i.e., no loss);
 .It Cm samples Ar N
 the number of samples used in the internal
 representation of the curve (2..1024; default 100);
 .It Cm "delay prob" | "prob delay"
 One of these two lines is mandatory and defines
 the format of the following lines with data points.
 .It Ar XXX Ar YYY
 2 or more lines representing points in the curve,
 with either delay or probability first, according
 to the chosen format.
 The unit for delay is milliseconds.
 Data points do not need to be sorted.
 Also, the number of actual lines can be different
 from the value of the "samples" parameter:
 .Nm
 utility will sort and interpolate
 the curve as needed.
 .El
 .Pp
 Example of a profile file:
 .Bd -literal -offset indent
 name    bla_bla_bla
 samples 100
 loss-level    0.86
 prob    delay
 0       200	# minimum overhead is 200ms
 0.5     200
 0.5     300
 0.8     1000
 0.9     1300
 1       1300
 #configuration file end
 .Ed
 .El
 .Pp
 The following parameters can be configured for a queue:
 .Pp
 .Bl -tag -width indent -compact
 .It Cm pipe Ar pipe_nr
 Connects a queue to the specified pipe.
 Multiple queues (with the same or different weights) can be connected to
 the same pipe, which specifies the aggregate rate for the set of queues.
 .Pp
 .It Cm weight Ar weight
 Specifies the weight to be used for flows matching this queue.
 The weight must be in the range 1..100, and defaults to 1.
 .El
 .Pp
 The following case-insensitive parameters can be configured for a
 scheduler:
 .Pp
 .Bl -tag -width indent -compact
 .It Cm type Ar {fifo | wf2q+ | rr | qfq}
 specifies the scheduling algorithm to use.
 .Bl -tag -width indent -compact
 .It Cm fifo
 is just a FIFO scheduler (which means that all packets
 are stored in the same queue as they arrive to the scheduler).
 FIFO has O(1) per-packet time complexity, with very low
 constants (estimate 60-80ns on a 2GHz desktop machine)
 but gives no service guarantees.
 .It Cm wf2q+
 implements the WF2Q+ algorithm, which is a Weighted Fair Queueing
 algorithm which permits flows to share bandwidth according to
 their weights.
 Note that weights are not priorities; even a flow
 with a minuscule weight will never starve.
 WF2Q+ has O(log N) per-packet processing cost, where N is the number
 of flows, and is the default algorithm used by previous versions
 dummynet's queues.
 .It Cm rr
 implements the Deficit Round Robin algorithm, which has O(1) processing
 costs (roughly, 100-150ns per packet)
 and permits bandwidth allocation according to weights, but
 with poor service guarantees.
 .It Cm qfq
 implements the QFQ algorithm, which is a very fast variant of
 WF2Q+, with similar service guarantees and O(1) processing
 costs (roughly, 200-250ns per packet).
 .El
 .El
 .Pp
 In addition to the type, all parameters allowed for a pipe can also
 be specified for a scheduler.
 .Pp
 Finally, the following parameters can be configured for both
 pipes and queues:
 .Pp
 .Bl -tag -width XXXX -compact
 .It Cm buckets Ar hash-table-size
 Specifies the size of the hash table used for storing the
 various queues.
 Default value is 64 controlled by the
 .Xr sysctl 8
 variable
 .Va net.inet.ip.dummynet.hash_size ,
 allowed range is 16 to 65536.
 .Pp
 .It Cm mask Ar mask-specifier
 Packets sent to a given pipe or queue by an
 .Nm
 rule can be further classified into multiple flows, each of which is then
 sent to a different
 .Em dynamic
 pipe or queue.
 A flow identifier is constructed by masking the IP addresses,
 ports and protocol types as specified with the
 .Cm mask
 options in the configuration of the pipe or queue.
 For each different flow identifier, a new pipe or queue is created
 with the same parameters as the original object, and matching packets
 are sent to it.
 .Pp
 Thus, when
 .Em dynamic pipes
 are used, each flow will get the same bandwidth as defined by the pipe,
 whereas when
 .Em dynamic queues
 are used, each flow will share the parent's pipe bandwidth evenly
 with other flows generated by the same queue (note that other queues
 with different weights might be connected to the same pipe).
 .br
 Available mask specifiers are a combination of one or more of the following:
 .Pp
 .Cm dst-ip Ar mask ,
 .Cm dst-ip6 Ar mask ,
 .Cm src-ip Ar mask ,
 .Cm src-ip6 Ar mask ,
 .Cm dst-port Ar mask ,
 .Cm src-port Ar mask ,
 .Cm flow-id Ar mask ,
 .Cm proto Ar mask
 or
 .Cm all ,
 .Pp
 where the latter means all bits in all fields are significant.
 .Pp
 .It Cm noerror
 When a packet is dropped by a
 .Nm dummynet
 queue or pipe, the error
 is normally reported to the caller routine in the kernel, in the
 same way as it happens when a device queue fills up.
 Setting this
 option reports the packet as successfully delivered, which can be
 needed for some experimental setups where you want to simulate
 loss or congestion at a remote router.
 .Pp
 .It Cm plr Ar packet-loss-rate
 Packet loss rate.
 Argument
 .Ar packet-loss-rate
 is a floating-point number between 0 and 1, with 0 meaning no
 loss, 1 meaning 100% loss.
 The loss rate is internally represented on 31 bits.
 .Pp
 .It Cm queue Brq Ar slots | size Ns Cm Kbytes
 Queue size, in
 .Ar slots
 or
 .Cm KBytes .
 Default value is 50 slots, which
 is the typical queue size for Ethernet devices.
 Note that for slow speed links you should keep the queue
 size short or your traffic might be affected by a significant
 queueing delay.
 E.g., 50 max-sized ethernet packets (1500 bytes) mean 600Kbit
 or 20s of queue on a 30Kbit/s pipe.
 Even worse effects can result if you get packets from an
 interface with a much larger MTU, e.g.\& the loopback interface
 with its 16KB packets.
 The
 .Xr sysctl 8
 variables
 .Em net.inet.ip.dummynet.pipe_byte_limit
 and
 .Em net.inet.ip.dummynet.pipe_slot_limit
 control the maximum lengths that can be specified.
 .Pp
 .It Cm red | gred Ar w_q Ns / Ns Ar min_th Ns / Ns Ar max_th Ns / Ns Ar max_p
 [ecn]
 Make use of the RED (Random Early Detection) queue management algorithm.
 .Ar w_q
 and
 .Ar max_p
 are floating
 point numbers between 0 and 1 (inclusive), while
 .Ar min_th
 and
 .Ar max_th
 are integer numbers specifying thresholds for queue management
 (thresholds are computed in bytes if the queue has been defined
 in bytes, in slots otherwise).
 The two parameters can also be of the same value if needed. The
 .Nm dummynet
 also supports the gentle RED variant (gred) and ECN (Explicit Congestion
 Notification) as optional. Three
 .Xr sysctl 8
 variables can be used to control the RED behaviour:
 .Bl -tag -width indent
 .It Va net.inet.ip.dummynet.red_lookup_depth
 specifies the accuracy in computing the average queue
 when the link is idle (defaults to 256, must be greater than zero)
 .It Va net.inet.ip.dummynet.red_avg_pkt_size
 specifies the expected average packet size (defaults to 512, must be
 greater than zero)
 .It Va net.inet.ip.dummynet.red_max_pkt_size
 specifies the expected maximum packet size, only used when queue
 thresholds are in bytes (defaults to 1500, must be greater than zero).
 .El
 .El
 .Pp
 When used with IPv6 data,
 .Nm dummynet
 currently has several limitations.
 Information necessary to route link-local packets to an
 interface is not available after processing by
 .Nm dummynet
 so those packets are dropped in the output path.
 Care should be taken to ensure that link-local packets are not passed to
 .Nm dummynet .
 .Sh CHECKLIST
 Here are some important points to consider when designing your
 rules:
 .Bl -bullet
 .It
 Remember that you filter both packets going
 .Cm in
 and
 .Cm out .
 Most connections need packets going in both directions.
 .It
 Remember to test very carefully.
 It is a good idea to be near the console when doing this.
 If you cannot be near the console,
 use an auto-recovery script such as the one in
 .Pa /usr/share/examples/ipfw/change_rules.sh .
 .It
 Do not forget the loopback interface.
 .El
 .Sh FINE POINTS
 .Bl -bullet
 .It
 There are circumstances where fragmented datagrams are unconditionally
 dropped.
 TCP packets are dropped if they do not contain at least 20 bytes of
 TCP header, UDP packets are dropped if they do not contain a full 8
 byte UDP header, and ICMP packets are dropped if they do not contain
 4 bytes of ICMP header, enough to specify the ICMP type, code, and
 checksum.
 These packets are simply logged as
 .Dq pullup failed
 since there may not be enough good data in the packet to produce a
 meaningful log entry.
 .It
 Another type of packet is unconditionally dropped, a TCP packet with a
 fragment offset of one.
 This is a valid packet, but it only has one use, to try
 to circumvent firewalls.
 When logging is enabled, these packets are
 reported as being dropped by rule -1.
 .It
 If you are logged in over a network, loading the
 .Xr kld 4
 version of
 .Nm
 is probably not as straightforward as you would think.
 The following command line is recommended:
 .Bd -literal -offset indent
 kldload ipfw && \e
 ipfw add 32000 allow ip from any to any
 .Ed
 .Pp
 Along the same lines, doing an
 .Bd -literal -offset indent
 ipfw flush
 .Ed
 .Pp
 in similar surroundings is also a bad idea.
 .It
 The
 .Nm
 filter list may not be modified if the system security level
 is set to 3 or higher
 (see
 .Xr init 8
 for information on system security levels).
 .El
 .Sh PACKET DIVERSION
 A
 .Xr divert 4
 socket bound to the specified port will receive all packets
 diverted to that port.
 If no socket is bound to the destination port, or if the divert module is
 not loaded, or if the kernel was not compiled with divert socket support,
 the packets are dropped.
 .Sh NETWORK ADDRESS TRANSLATION (NAT)
 .Nm
 support in-kernel NAT using the kernel version of
 .Xr libalias 3 .
 The kernel module
 .Cm ipfw_nat
 should be loaded or kernel should have
 .Cm options IPFIREWALL_NAT
 to be able use NAT.
 .Pp
 The nat configuration command is the following:
 .Bd -ragged -offset indent
 .Bk -words
 .Cm nat
 .Ar nat_number
 .Cm config
 .Ar nat-configuration
 .Ek
 .Ed
 .Pp
 The following parameters can be configured:
 .Bl -tag -width indent
 .It Cm ip Ar ip_address
 Define an ip address to use for aliasing.
 .It Cm if Ar nic
 Use ip address of NIC for aliasing, dynamically changing
 it if NIC's ip address changes.
 .It Cm log
 Enable logging on this nat instance.
 .It Cm deny_in
 Deny any incoming connection from outside world.
 .It Cm same_ports
 Try to leave the alias port numbers unchanged from
 the actual local port numbers.
 .It Cm unreg_only
 Traffic on the local network not originating from an
 unregistered address spaces will be ignored.
 .It Cm reset
 Reset table of the packet aliasing engine on address change.
 .It Cm reverse
 Reverse the way libalias handles aliasing.
 .It Cm proxy_only
 Obey transparent proxy rules only, packet aliasing is not performed.
 .It Cm skip_global
 Skip instance in case of global state lookup (see below).
 .El
 .Pp
 Some specials value can be supplied instead of
 .Va nat_number:
 .Bl -tag -width indent
 .It Cm global
 Looks up translation state in all configured nat instances.
 If an entry is found, packet is aliased according to that entry.
 If no entry was found in any of the instances, packet is passed unchanged,
 and no new entry will be created.
 See section
 .Sx MULTIPLE INSTANCES
 in
 .Xr natd 8
 for more information.
 .It Cm tablearg
 Uses argument supplied in lookup table.
 See
 .Sx LOOKUP TABLES
 section below for more information on lookup tables.
 .El
 .Pp
 To let the packet continue after being (de)aliased, set the sysctl variable
 .Va net.inet.ip.fw.one_pass
 to 0.
 For more information about aliasing modes, refer to
 .Xr libalias 3 .
 See Section
 .Sx EXAMPLES
 for some examples about nat usage.
 .Ss REDIRECT AND LSNAT SUPPORT IN IPFW
 Redirect and LSNAT support follow closely the syntax used in
 .Xr natd 8 .
 See Section
 .Sx EXAMPLES
 for some examples on how to do redirect and lsnat.
 .Ss SCTP NAT SUPPORT
 SCTP nat can be configured in a similar manner to TCP through the
 .Nm
 command line tool.
 The main difference is that
 .Nm sctp nat
 does not do port translation.
 Since the local and global side ports will be the same,
 there is no need to specify both.
 Ports are redirected as follows:
 .Bd -ragged -offset indent
 .Bk -words
 .Cm nat
 .Ar nat_number
 .Cm config if
 .Ar nic
 .Cm redirect_port sctp
 .Ar ip_address [,addr_list] {[port | port-port] [,ports]}
 .Ek
 .Ed
 .Pp
 Most
 .Nm sctp nat
 configuration can be done in real-time through the
 .Xr sysctl 8
 interface.
 All may be changed dynamically, though the hash_table size will only
 change for new
 .Nm nat
 instances.
 See
 .Sx SYSCTL VARIABLES
 for more info.
 .Sh IPv6/IPv4 NETWORK ADDRESS AND PROTOCOL TRANSLATION
 .Nm
 supports in-kernel IPv6/IPv4 network address and protocol translation.
 Stateful NAT64 translation allows IPv6-only clients to contact IPv4 servers
 using unicast TCP, UDP or ICMP protocols.
 One or more IPv4 addresses assigned to a stateful NAT64 translator are shared
 among several IPv6-only clients.
 When stateful NAT64 is used in conjunction with DNS64, no changes are usually
 required in the IPv6 client or the IPv4 server.
 The kernel module
 .Cm ipfw_nat64
 should be loaded or kernel should have
 .Cm options IPFIREWALL_NAT64
 to be able use stateful NAT64 translator.
 .Pp
 Stateful NAT64 uses a bunch of memory for several types of objects.
 When IPv6 client initiates connection, NAT64 translator creates a host entry
 in the states table.
 Each host entry has a number of ports group entries allocated on demand.
 Ports group entries contains connection state entries.
 There are several options to control limits and lifetime for these objects.
 .Pp
 NAT64 translator follows RFC7915 when does ICMPv6/ICMP translation,
 unsupported message types will be silently dropped.
 IPv6 needs several ICMPv6 message types to be explicitly allowed for correct
 operation.
 Make sure that ND6 neighbor solicitation (ICMPv6 type 135) and neighbor
 advertisement (ICMPv6 type 136) messages will not be handled by translation
 rules.
 .Pp
 After translation NAT64 translator sends packets through corresponding netisr
 queue.
 Thus translator host should be configured as IPv4 and IPv6 router.
 .Pp
 Currently both stateful and stateless NAT64 translators use Well-Known IPv6
 Prefix
 .Ar 64:ff9b::/96
 to represent IPv4 addresses in the IPv6 address.
 Thus DNS64 service and routing should be configured to use Well-Known IPv6
 Prefix.
 .Pp
 The stateful NAT64 configuration command is the following:
 .Bd -ragged -offset indent
 .Bk -words
 .Cm nat64lsn
 .Ar name
 .Cm create
 .Ar create-options
 .Ek
 .Ed
 .Pp
 The following parameters can be configured:
 .Bl -tag -width indent
 .It Cm prefix4 Ar ipv4_prefix/mask
 The IPv4 prefix with mask defines the pool of IPv4 addresses used as
 source address after translation.
 Stateful NAT64 module translates IPv6 source address of client to one
 IPv4 address from this pool.
 Note that incoming IPv4 packets that don't have corresponding state entry
 in the states table will be dropped by translator.
 Make sure that translation rules handle packets, destined to configured prefix.
 .It Cm max_ports Ar number
 Maximum number of ports reserved for upper level protocols to one IPv6 client.
 All reserved ports are divided into chunks between supported protocols.
 The number of connections from one IPv6 client is limited by this option.
 Note that closed TCP connections still remain in the list of connections until
 .Cm tcp_close_age
 interval will not expire.
 Default value is
 .Ar 2048 .
 .It Cm host_del_age Ar seconds
 The number of seconds until the host entry for a IPv6 client will be deleted
 and all its resources will be released due to inactivity.
 Default value is
 .Ar 3600 .
 .It Cm pg_del_age Ar seconds
 The number of seconds until a ports group with unused state entries will
 be released.
 Default value is
 .Ar 900 .
 .It Cm tcp_syn_age Ar seconds
 The number of seconds while a state entry for TCP connection with only SYN
 sent will be kept.
 If TCP connection establishing will not be finished,
 state entry will be deleted.
 Default value is
 .Ar 10 .
 .It Cm tcp_est_age Ar seconds
 The number of seconds while a state entry for established TCP connection
 will be kept.
 Default value is
 .Ar 7200 .
 .It Cm tcp_close_age Ar seconds
 The number of seconds while a state entry for closed TCP connection
 will be kept.
 Keeping state entries for closed connections is needed, because IPv4 servers
 typically keep closed connections in a TIME_WAIT state for a several minutes.
 Since translator's IPv4 addresses are shared among all IPv6 clients,
 new connections from the same addresses and ports may be rejected by server,
 because these connections are still in a TIME_WAIT state.
 Keeping them in translator's state table protects from such rejects.
 Default value is
 .Ar 180 .
 .It Cm udp_age Ar seconds
 The number of seconds while translator keeps state entry in a waiting for
 reply to the sent UDP datagram.
 Default value is
 .Ar 120 .
 .It Cm icmp_age Ar seconds
 The number of seconds while translator keeps state entry in a waiting for
 reply to the sent ICMP message.
 Default value is
 .Ar 60 .
 .It Cm log
 Turn on logging of all handled packets via BPF through
 .Ar ipfwlog0
 interface.
 .Ar ipfwlog0
 is a pseudo interface and can be created after a boot manually with
 .Cm ifconfig
 command.
 Note that it has different purpose than
 .Ar ipfw0
 interface.
 Translators sends to BPF an additional information with each packet.
 With
 .Cm tcpdump
 you are able to see each handled packet before and after translation.
 .It Cm -log
 Turn off logging of all handled packets via BPF.
 .El
 .Pp
 To inspect a states table of stateful NAT64 the following command can be used:
 .Bd -ragged -offset indent
 .Bk -words
 .Cm nat64lsn
 .Ar name
 .Cm show Cm states
 .Ek
 .Ed
 .Pp
 .Pp
 Stateless NAT64 translator doesn't use a states table for translation
 and converts IPv4 addresses to IPv6 and vice versa solely based on the
 mappings taken from configured lookup tables.
 Since a states table doesn't used by stateless translator,
 it can be configured to pass IPv4 clients to IPv6-only servers.
 .Pp
 The stateless NAT64 configuration command is the following:
 .Bd -ragged -offset indent
 .Bk -words
 .Cm nat64stl
 .Ar name
 .Cm create
 .Ar create-options
 .Ek
 .Ed
 .Pp
 The following parameters can be configured:
 .Bl -tag -width indent
 .It Cm table4 Ar table46
 The lookup table
 .Ar table46
 contains mapping how IPv4 addresses should be translated to IPv6 addresses.
 .It Cm table6 Ar table64
 The lookup table
 .Ar table64
 contains mapping how IPv6 addresses should be translated to IPv4 addresses.
 .It Cm log
 Turn on logging of all handled packets via BPF through
 .Ar ipfwlog0
 interface.
 .It Cm -log
 Turn off logging of all handled packets via BPF.
 .El
 .Pp
 Note that the behavior of stateless translator with respect to not matched
 packets differs from stateful translator.
 If corresponding addresses was not found in the lookup tables, the packet
 will not be dropped and the search continues.
 .Sh IPv6-to-IPv6 NETWORK PREFIX TRANSLATION (NPTv6)
 .Nm
 supports in-kernel IPv6-to-IPv6 network prefix translation as described
 in RFC6296.
 The kernel module
 .Cm ipfw_nptv6
 should be loaded or kernel should has
 .Cm options IPFIREWALL_NPTV6
 to be able use NPTv6 translator.
 .Pp
 The NPTv6 configuration command is the following:
 .Bd -ragged -offset indent
 .Bk -words
 .Cm nptv6
 .Ar name
 .Cm create
 .Ar create-options
 .Ek
 .Ed
 .Pp
 The following parameters can be configured:
 .Bl -tag -width indent
 .It Cm int_prefix Ar ipv6_prefix
 IPv6 prefix used in internal network.
 NPTv6 module translates source address when it matches this prefix.
 .It Cm ext_prefix Ar ipv6_prefix
 IPv6 prefix used in external network.
 NPTv6 module translates destination address when it matches this prefix.
 .It Cm prefixlen Ar length
 The length of specified IPv6 prefixes. It must be in range from 8 to 64.
 .El
 .Pp
 Note that the prefix translation rules are silently ignored when IPv6 packet
 forwarding is disabled.
 To enable the packet forwarding, set the sysctl variable
 .Va net.inet6.ip6.forwarding
 to 1.
 .Pp
 To let the packet continue after being translated, set the sysctl variable
 .Va net.inet.ip.fw.one_pass
 to 0.
 .Sh LOADER TUNABLES
 Tunables can be set in
 .Xr loader 8
 prompt,
 .Xr loader.conf 5
 or
 .Xr kenv 1
 before ipfw module gets loaded.
 .Bl -tag -width indent
 .It Va net.inet.ip.fw.default_to_accept: No 0
 Defines ipfw last rule behavior.
 This value overrides
 .Cd "options IPFW_DEFAULT_TO_(ACCEPT|DENY)"
 from kernel configuration file.
 .It Va net.inet.ip.fw.tables_max: No 128
 Defines number of tables available in ipfw.
 Number cannot exceed 65534.
 .El
 .Sh SYSCTL VARIABLES
 A set of
 .Xr sysctl 8
 variables controls the behaviour of the firewall and
 associated modules
 .Pq Nm dummynet , bridge , sctp nat .
 These are shown below together with their default value
 (but always check with the
 .Xr sysctl 8
 command what value is actually in use) and meaning:
 .Bl -tag -width indent
 .It Va net.inet.ip.alias.sctp.accept_global_ootb_addip: No 0
 Defines how the
 .Nm nat
 responds to receipt of global OOTB ASCONF-AddIP:
 .Bl -tag -width indent
 .It Cm 0
 No response (unless a partially matching association exists -
 ports and vtags match but global address does not)
 .It Cm 1
 .Nm nat
 will accept and process all OOTB global AddIP messages.
 .El
 .Pp
 Option 1 should never be selected as this forms a security risk.
 An attacker can
 establish multiple fake associations by sending AddIP messages.
 .It Va net.inet.ip.alias.sctp.chunk_proc_limit: No 5
 Defines the maximum number of chunks in an SCTP packet that will be
 parsed for a
 packet that matches an existing association.
 This value is enforced to be greater or equal than
 .Cm net.inet.ip.alias.sctp.initialising_chunk_proc_limit .
 A high value is
 a DoS risk yet setting too low a value may result in
 important control chunks in
 the packet not being located and parsed.
 .It Va net.inet.ip.alias.sctp.error_on_ootb: No 1
 Defines when the
 .Nm nat
 responds to any Out-of-the-Blue (OOTB) packets with ErrorM packets.
 An OOTB packet is a packet that arrives with no existing association
 registered in the
 .Nm nat
 and is not an INIT or ASCONF-AddIP packet:
 .Bl -tag -width indent
 .It Cm 0
 ErrorM is never sent in response to OOTB packets.
 .It Cm 1
 ErrorM is only sent to OOTB packets received on the local side.
 .It Cm 2
 ErrorM is sent to the local side and on the global side ONLY if there is a
 partial match (ports and vtags match but the source global IP does not).
 This value is only useful if the
 .Nm nat
 is tracking global IP addresses.
 .It Cm 3
 ErrorM is sent in response to all OOTB packets on both
 the local and global side
 (DoS risk).
 .El
 .Pp
 At the moment the default is 0, since the ErrorM packet is not yet
 supported by most SCTP stacks.
 When it is supported, and if not tracking
 global addresses, we recommend setting this value to 1 to allow
 multi-homed local hosts to function with the
 .Nm nat .
 To track global addresses, we recommend setting this value to 2 to
 allow global hosts to be informed when they need to (re)send an
 ASCONF-AddIP.
 Value 3 should never be chosen (except for debugging) as the
 .Nm nat
 will respond to all OOTB global packets (a DoS risk).
 .It Va net.inet.ip.alias.sctp.hashtable_size: No 2003
 Size of hash tables used for
 .Nm nat
 lookups (100 < prime_number > 1000001).
 This value sets the
 .Nm hash table
 size for any future created
 .Nm nat
 instance and therefore must be set prior to creating a
 .Nm nat
 instance.
 The table sizes may be changed to suit specific needs.
 If there will be few
 concurrent associations, and memory is scarce, you may make these smaller.
 If there will be many thousands (or millions) of concurrent associations, you
 should make these larger.
 A prime number is best for the table size.
 The sysctl
 update function will adjust your input value to the next highest prime number.
 .It Va net.inet.ip.alias.sctp.holddown_time:  No 0
 Hold association in table for this many seconds after receiving a
 SHUTDOWN-COMPLETE.
 This allows endpoints to correct shutdown gracefully if a
 shutdown_complete is lost and retransmissions are required.
 .It Va net.inet.ip.alias.sctp.init_timer: No 15
 Timeout value while waiting for (INIT-ACK|AddIP-ACK).
 This value cannot be 0.
 .It Va net.inet.ip.alias.sctp.initialising_chunk_proc_limit: No 2
 Defines the maximum number of chunks in an SCTP packet that will be parsed when
 no existing association exists that matches that packet.
 Ideally this packet
 will only be an INIT or ASCONF-AddIP packet.
 A higher value may become a DoS
 risk as malformed packets can consume processing resources.
 .It Va net.inet.ip.alias.sctp.param_proc_limit: No 25
 Defines the maximum number of parameters within a chunk that will be
 parsed in a
 packet.
 As for other similar sysctl variables, larger values pose a DoS risk.
 .It Va net.inet.ip.alias.sctp.log_level: No 0
 Level of detail in the system log messages (0 \- minimal, 1 \- event,
 2 \- info, 3 \- detail, 4 \- debug, 5 \- max debug).
 May be a good
 option in high loss environments.
 .It Va net.inet.ip.alias.sctp.shutdown_time: No 15
 Timeout value while waiting for SHUTDOWN-COMPLETE.
 This value cannot be 0.
 .It Va net.inet.ip.alias.sctp.track_global_addresses: No 0
 Enables/disables global IP address tracking within the
 .Nm nat
 and places an
 upper limit on the number of addresses tracked for each association:
 .Bl -tag -width indent
 .It Cm 0
 Global tracking is disabled
 .It Cm >1
 Enables tracking, the maximum number of addresses tracked for each
 association is limited to this value
 .El
 .Pp
 This variable is fully dynamic, the new value will be adopted for all newly
 arriving associations, existing associations are treated
 as they were previously.
 Global tracking will decrease the number of collisions within the
 .Nm nat
 at a cost
 of increased processing load, memory usage, complexity, and possible
 .Nm nat
 state
 problems in complex networks with multiple
 .Nm nats .
 We recommend not tracking
 global IP addresses, this will still result in a fully functional
 .Nm nat .
 .It Va net.inet.ip.alias.sctp.up_timer: No 300
 Timeout value to keep an association up with no traffic.
 This value cannot be 0.
 .It Va net.inet.ip.dummynet.expire : No 1
 Lazily delete dynamic pipes/queue once they have no pending traffic.
 You can disable this by setting the variable to 0, in which case
 the pipes/queues will only be deleted when the threshold is reached.
 .It Va net.inet.ip.dummynet.hash_size : No 64
 Default size of the hash table used for dynamic pipes/queues.
 This value is used when no
 .Cm buckets
 option is specified when configuring a pipe/queue.
 .It Va net.inet.ip.dummynet.io_fast : No 0
 If set to a non-zero value,
 the
 .Dq fast
 mode of
 .Nm dummynet
 operation (see above) is enabled.
 .It Va net.inet.ip.dummynet.io_pkt
 Number of packets passed to
 .Nm dummynet .
 .It Va net.inet.ip.dummynet.io_pkt_drop
 Number of packets dropped by
 .Nm dummynet .
 .It Va net.inet.ip.dummynet.io_pkt_fast
 Number of packets bypassed by the
 .Nm dummynet
 scheduler.
 .It Va net.inet.ip.dummynet.max_chain_len : No 16
 Target value for the maximum number of pipes/queues in a hash bucket.
 The product
 .Cm max_chain_len*hash_size
 is used to determine the threshold over which empty pipes/queues
 will be expired even when
 .Cm net.inet.ip.dummynet.expire=0 .
 .It Va net.inet.ip.dummynet.red_lookup_depth : No 256
 .It Va net.inet.ip.dummynet.red_avg_pkt_size : No 512
 .It Va net.inet.ip.dummynet.red_max_pkt_size : No 1500
 Parameters used in the computations of the drop probability
 for the RED algorithm.
 .It Va net.inet.ip.dummynet.pipe_byte_limit : No 1048576
 .It Va net.inet.ip.dummynet.pipe_slot_limit : No 100
 The maximum queue size that can be specified in bytes or packets.
 These limits prevent accidental exhaustion of resources such as mbufs.
 If you raise these limits,
 you should make sure the system is configured so that sufficient resources
 are available.
 .It Va net.inet.ip.fw.autoinc_step : No 100
 Delta between rule numbers when auto-generating them.
 The value must be in the range 1..1000.
 .It Va net.inet.ip.fw.curr_dyn_buckets : Va net.inet.ip.fw.dyn_buckets
 The current number of buckets in the hash table for dynamic rules
 (readonly).
 .It Va net.inet.ip.fw.debug : No 1
 Controls debugging messages produced by
 .Nm .
 .It Va net.inet.ip.fw.default_rule : No 65535
 The default rule number (read-only).
 By the design of
 .Nm , the default rule is the last one, so its number
 can also serve as the highest number allowed for a rule.
 .It Va net.inet.ip.fw.dyn_buckets : No 256
 The number of buckets in the hash table for dynamic rules.
 Must be a power of 2, up to 65536.
 It only takes effect when all dynamic rules have expired, so you
 are advised to use a
 .Cm flush
 command to make sure that the hash table is resized.
 .It Va net.inet.ip.fw.dyn_count : No 3
 Current number of dynamic rules
 (read-only).
 .It Va net.inet.ip.fw.dyn_keepalive : No 1
 Enables generation of keepalive packets for
 .Cm keep-state
 rules on TCP sessions.
 A keepalive is generated to both
 sides of the connection every 5 seconds for the last 20
 seconds of the lifetime of the rule.
 .It Va net.inet.ip.fw.dyn_max : No 8192
 Maximum number of dynamic rules.
 When you hit this limit, no more dynamic rules can be
 installed until old ones expire.
 .It Va net.inet.ip.fw.dyn_ack_lifetime : No 300
 .It Va net.inet.ip.fw.dyn_syn_lifetime : No 20
 .It Va net.inet.ip.fw.dyn_fin_lifetime : No 1
 .It Va net.inet.ip.fw.dyn_rst_lifetime : No 1
 .It Va net.inet.ip.fw.dyn_udp_lifetime : No 5
 .It Va net.inet.ip.fw.dyn_short_lifetime : No 30
 These variables control the lifetime, in seconds, of dynamic
 rules.
 Upon the initial SYN exchange the lifetime is kept short,
 then increased after both SYN have been seen, then decreased
 again during the final FIN exchange or when a RST is received.
 Both
 .Em dyn_fin_lifetime
 and
 .Em dyn_rst_lifetime
 must be strictly lower than 5 seconds, the period of
 repetition of keepalives.
 The firewall enforces that.
 .It Va net.inet.ip.fw.dyn_keep_states: No 0
 Keep dynamic states on rule/set deletion.
 States are relinked to default rule (65535).
 This can be handly for ruleset reload.
 Turned off by default.
 .It Va net.inet.ip.fw.enable : No 1
 Enables the firewall.
 Setting this variable to 0 lets you run your machine without
 firewall even if compiled in.
 .It Va net.inet6.ip6.fw.enable : No 1
 provides the same functionality as above for the IPv6 case.
 .It Va net.inet.ip.fw.one_pass : No 1
 When set, the packet exiting from the
 .Nm dummynet
 pipe or from
 .Xr ng_ipfw 4
 node is not passed though the firewall again.
 Otherwise, after an action, the packet is
 reinjected into the firewall at the next rule.
 .It Va net.inet.ip.fw.tables_max : No 128
 Maximum number of tables.
 .It Va net.inet.ip.fw.verbose : No 1
 Enables verbose messages.
 .It Va net.inet.ip.fw.verbose_limit : No 0
 Limits the number of messages produced by a verbose firewall.
 .It Va net.inet6.ip6.fw.deny_unknown_exthdrs : No 1
 If enabled packets with unknown IPv6 Extension Headers will be denied.
 .It Va net.link.ether.ipfw : No 0
 Controls whether layer-2 packets are passed to
 .Nm .
 Default is no.
 .It Va net.link.bridge.ipfw : No 0
 Controls whether bridged packets are passed to
 .Nm .
 Default is no.
 .El
 .Sh INTERNAL DIAGNOSTICS
 There are some commands that may be useful to understand current state
 of certain subsystems inside kernel module.
 These commands provide debugging output which may change without notice.
 .Pp
 Currently the following commands are available as
 .Cm internal
 sub-options:
 .Bl -tag -width indent
 .It Cm iflist
 Lists all interface which are currently tracked by
 .Nm
 with their in-kernel status.
 .It Cm talist
 List all table lookup algorithms currently available.
 .El
 .Sh EXAMPLES
 There are far too many possible uses of
 .Nm
 so this Section will only give a small set of examples.
 .Pp
 .Ss BASIC PACKET FILTERING
 This command adds an entry which denies all tcp packets from
 .Em cracker.evil.org
 to the telnet port of
 .Em wolf.tambov.su
 from being forwarded by the host:
 .Pp
 .Dl "ipfw add deny tcp from cracker.evil.org to wolf.tambov.su telnet"
 .Pp
 This one disallows any connection from the entire cracker's
 network to my host:
 .Pp
 .Dl "ipfw add deny ip from 123.45.67.0/24 to my.host.org"
 .Pp
 A first and efficient way to limit access (not using dynamic rules)
 is the use of the following rules:
 .Pp
 .Dl "ipfw add allow tcp from any to any established"
 .Dl "ipfw add allow tcp from net1 portlist1 to net2 portlist2 setup"
 .Dl "ipfw add allow tcp from net3 portlist3 to net3 portlist3 setup"
 .Dl "..."
 .Dl "ipfw add deny tcp from any to any"
 .Pp
 The first rule will be a quick match for normal TCP packets,
 but it will not match the initial SYN packet, which will be
 matched by the
 .Cm setup
 rules only for selected source/destination pairs.
 All other SYN packets will be rejected by the final
 .Cm deny
 rule.
 .Pp
 If you administer one or more subnets, you can take advantage
 of the address sets and or-blocks and write extremely
 compact rulesets which selectively enable services to blocks
 of clients, as below:
 .Pp
 .Dl "goodguys=\*q{ 10.1.2.0/24{20,35,66,18} or 10.2.3.0/28{6,3,11} }\*q"
 .Dl "badguys=\*q10.1.2.0/24{8,38,60}\*q"
 .Dl ""
 .Dl "ipfw add allow ip from ${goodguys} to any"
 .Dl "ipfw add deny ip from ${badguys} to any"
 .Dl "... normal policies ..."
 .Pp
 The
 .Cm verrevpath
 option could be used to do automated anti-spoofing by adding the
 following to the top of a ruleset:
 .Pp
 .Dl "ipfw add deny ip from any to any not verrevpath in"
 .Pp
 This rule drops all incoming packets that appear to be coming to the
 system on the wrong interface.
 For example, a packet with a source
 address belonging to a host on a protected internal network would be
 dropped if it tried to enter the system from an external interface.
 .Pp
 The
 .Cm antispoof
 option could be used to do similar but more restricted anti-spoofing
 by adding the following to the top of a ruleset:
 .Pp
 .Dl "ipfw add deny ip from any to any not antispoof in"
 .Pp
 This rule drops all incoming packets that appear to be coming from another
 directly connected system but on the wrong interface.
 For example, a packet with a source address of
 .Li 192.168.0.0/24 ,
 configured on
 .Li fxp0 ,
 but coming in on
 .Li fxp1
 would be dropped.
 .Pp
 The
 .Cm setdscp
 option could be used to (re)mark user traffic,
 by adding the following to the appropriate place in ruleset:
 .Pp
 .Dl "ipfw add setdscp be ip from any to any dscp af11,af21"
 .Ss DYNAMIC RULES
 In order to protect a site from flood attacks involving fake
 TCP packets, it is safer to use dynamic rules:
 .Pp
 .Dl "ipfw add check-state"
 .Dl "ipfw add deny tcp from any to any established"
 .Dl "ipfw add allow tcp from my-net to any setup keep-state"
 .Pp
 This will let the firewall install dynamic rules only for
 those connection which start with a regular SYN packet coming
 from the inside of our network.
 Dynamic rules are checked when encountering the first
 occurrence of a
 .Cm check-state ,
 .Cm keep-state
 or
 .Cm limit
 rule.
 A
 .Cm check-state
 rule should usually be placed near the beginning of the
 ruleset to minimize the amount of work scanning the ruleset.
 Your mileage may vary.
 .Pp
 To limit the number of connections a user can open
 you can use the following type of rules:
 .Pp
 .Dl "ipfw add allow tcp from my-net/24 to any setup limit src-addr 10"
 .Dl "ipfw add allow tcp from any to me setup limit src-addr 4"
 .Pp
 The former (assuming it runs on a gateway) will allow each host
 on a /24 network to open at most 10 TCP connections.
 The latter can be placed on a server to make sure that a single
 client does not use more than 4 simultaneous connections.
 .Pp
 .Em BEWARE :
 stateful rules can be subject to denial-of-service attacks
 by a SYN-flood which opens a huge number of dynamic rules.
 The effects of such attacks can be partially limited by
 acting on a set of
 .Xr sysctl 8
 variables which control the operation of the firewall.
 .Pp
 Here is a good usage of the
 .Cm list
 command to see accounting records and timestamp information:
 .Pp
 .Dl ipfw -at list
 .Pp
 or in short form without timestamps:
 .Pp
 .Dl ipfw -a list
 .Pp
 which is equivalent to:
 .Pp
 .Dl ipfw show
 .Pp
 Next rule diverts all incoming packets from 192.168.2.0/24
 to divert port 5000:
 .Pp
 .Dl ipfw divert 5000 ip from 192.168.2.0/24 to any in
 .Ss TRAFFIC SHAPING
 The following rules show some of the applications of
 .Nm
 and
 .Nm dummynet
 for simulations and the like.
 .Pp
 This rule drops random incoming packets with a probability
 of 5%:
 .Pp
 .Dl "ipfw add prob 0.05 deny ip from any to any in"
 .Pp
 A similar effect can be achieved making use of
 .Nm dummynet
 pipes:
 .Pp
 .Dl "ipfw add pipe 10 ip from any to any"
 .Dl "ipfw pipe 10 config plr 0.05"
 .Pp
 We can use pipes to artificially limit bandwidth, e.g.\& on a
 machine acting as a router, if we want to limit traffic from
 local clients on 192.168.2.0/24 we do:
 .Pp
 .Dl "ipfw add pipe 1 ip from 192.168.2.0/24 to any out"
 .Dl "ipfw pipe 1 config bw 300Kbit/s queue 50KBytes"
 .Pp
 note that we use the
 .Cm out
 modifier so that the rule is not used twice.
 Remember in fact that
 .Nm
 rules are checked both on incoming and outgoing packets.
 .Pp
 Should we want to simulate a bidirectional link with bandwidth
 limitations, the correct way is the following:
 .Pp
 .Dl "ipfw add pipe 1 ip from any to any out"
 .Dl "ipfw add pipe 2 ip from any to any in"
 .Dl "ipfw pipe 1 config bw 64Kbit/s queue 10Kbytes"
 .Dl "ipfw pipe 2 config bw 64Kbit/s queue 10Kbytes"
 .Pp
 The above can be very useful, e.g.\& if you want to see how
 your fancy Web page will look for a residential user who
 is connected only through a slow link.
 You should not use only one pipe for both directions, unless
 you want to simulate a half-duplex medium (e.g.\& AppleTalk,
 Ethernet, IRDA).
 It is not necessary that both pipes have the same configuration,
 so we can also simulate asymmetric links.
 .Pp
 Should we want to verify network performance with the RED queue
 management algorithm:
 .Pp
 .Dl "ipfw add pipe 1 ip from any to any"
 .Dl "ipfw pipe 1 config bw 500Kbit/s queue 100 red 0.002/30/80/0.1"
 .Pp
 Another typical application of the traffic shaper is to
 introduce some delay in the communication.
 This can significantly affect applications which do a lot of Remote
 Procedure Calls, and where the round-trip-time of the
 connection often becomes a limiting factor much more than
 bandwidth:
 .Pp
 .Dl "ipfw add pipe 1 ip from any to any out"
 .Dl "ipfw add pipe 2 ip from any to any in"
 .Dl "ipfw pipe 1 config delay 250ms bw 1Mbit/s"
 .Dl "ipfw pipe 2 config delay 250ms bw 1Mbit/s"
 .Pp
 Per-flow queueing can be useful for a variety of purposes.
 A very simple one is counting traffic:
 .Pp
 .Dl "ipfw add pipe 1 tcp from any to any"
 .Dl "ipfw add pipe 1 udp from any to any"
 .Dl "ipfw add pipe 1 ip from any to any"
 .Dl "ipfw pipe 1 config mask all"
 .Pp
 The above set of rules will create queues (and collect
 statistics) for all traffic.
 Because the pipes have no limitations, the only effect is
 collecting statistics.
 Note that we need 3 rules, not just the last one, because
 when
 .Nm
 tries to match IP packets it will not consider ports, so we
 would not see connections on separate ports as different
 ones.
 .Pp
 A more sophisticated example is limiting the outbound traffic
 on a net with per-host limits, rather than per-network limits:
 .Pp
 .Dl "ipfw add pipe 1 ip from 192.168.2.0/24 to any out"
 .Dl "ipfw add pipe 2 ip from any to 192.168.2.0/24 in"
 .Dl "ipfw pipe 1 config mask src-ip 0x000000ff bw 200Kbit/s queue 20Kbytes"
 .Dl "ipfw pipe 2 config mask dst-ip 0x000000ff bw 200Kbit/s queue 20Kbytes"
 .Ss LOOKUP TABLES
 In the following example, we need to create several traffic bandwidth
 classes and we need different hosts/networks to fall into different classes.
 We create one pipe for each class and configure them accordingly.
 Then we create a single table and fill it with IP subnets and addresses.
 For each subnet/host we set the argument equal to the number of the pipe
 that it should use.
 Then we classify traffic using a single rule:
 .Pp
 .Dl "ipfw pipe 1 config bw 1000Kbyte/s"
 .Dl "ipfw pipe 4 config bw 4000Kbyte/s"
 .Dl "..."
 .Dl "ipfw table T1 create type addr"
 .Dl "ipfw table T1 add 192.168.2.0/24 1"
 .Dl "ipfw table T1 add 192.168.0.0/27 4"
 .Dl "ipfw table T1 add 192.168.0.2 1"
 .Dl "..."
 .Dl "ipfw add pipe tablearg ip from 'table(T1)' to any"
 .Pp
 Using the
 .Cm fwd
 action, the table entries may include hostnames and IP addresses.
 .Pp
 .Dl "ipfw table T2 create type addr ftype ip"
 .Dl "ipfw table T2 add 192.168.2.0/24 10.23.2.1"
 .Dl "ipfw table T21 add 192.168.0.0/27 router1.dmz"
 .Dl "..."
 .Dl "ipfw add 100 fwd tablearg ip from any to table(1)"
 .Pp
 In the following example per-interface firewall is created:
 .Pp
 .Dl "ipfw table IN create type iface valtype skipto,fib"
 .Dl "ipfw table IN add vlan20 12000,12"
 .Dl "ipfw table IN add vlan30 13000,13"
 .Dl "ipfw table OUT create type iface valtype skipto"
 .Dl "ipfw table OUT add vlan20 22000"
 .Dl "ipfw table OUT add vlan30 23000"
 .Dl ".."
 .Dl "ipfw add 100 ipfw setfib tablearg ip from any to any recv 'table(IN)' in"
 .Dl "ipfw add 200 ipfw skipto tablearg ip from any to any recv 'table(IN)' in"
 .Dl "ipfw add 300 ipfw skipto tablearg ip from any to any xmit 'table(OUT)' out"
 .Pp
 The following example illustrate usage of flow tables:
 .Pp
 .Dl "ipfw table fl create type flow:flow:src-ip,proto,dst-ip,dst-port"
 .Dl "ipfw table fl add 2a02:6b8:77::88,tcp,2a02:6b8:77::99,80 11"
 .Dl "ipfw table fl add 10.0.0.1,udp,10.0.0.2,53 12"
 .Dl ".."
 .Dl "ipfw add 100 allow ip from any to any flow 'table(fl,11)' recv ix0"
 .Ss SETS OF RULES
 To add a set of rules atomically, e.g.\& set 18:
 .Pp
 .Dl "ipfw set disable 18"
 .Dl "ipfw add NN set 18 ...         # repeat as needed"
 .Dl "ipfw set enable 18"
 .Pp
 To delete a set of rules atomically the command is simply:
 .Pp
 .Dl "ipfw delete set 18"
 .Pp
 To test a ruleset and disable it and regain control if something goes wrong:
 .Pp
 .Dl "ipfw set disable 18"
 .Dl "ipfw add NN set 18 ...         # repeat as needed"
 .Dl "ipfw set enable 18; echo done; sleep 30 && ipfw set disable 18"
 .Pp
 Here if everything goes well, you press control-C before the "sleep"
 terminates, and your ruleset will be left active.
 Otherwise, e.g.\& if
 you cannot access your box, the ruleset will be disabled after
 the sleep terminates thus restoring the previous situation.
 .Pp
 To show rules of the specific set:
 .Pp
 .Dl "ipfw set 18 show"
 .Pp
 To show rules of the disabled set:
 .Pp
 .Dl "ipfw -S set 18 show"
 .Pp
 To clear a specific rule counters of the specific set:
 .Pp
 .Dl "ipfw set 18 zero NN"
 .Pp
 To delete a specific rule of the specific set:
 .Pp
 .Dl "ipfw set 18 delete NN"
 .Ss NAT, REDIRECT AND LSNAT
 First redirect all the traffic to nat instance 123:
 .Pp
 .Dl "ipfw add nat 123 all from any to any"
 .Pp
 Then to configure nat instance 123 to alias all the outgoing traffic with ip
 192.168.0.123, blocking all incoming connections, trying to keep
 same ports on both sides, clearing aliasing table on address change
 and keeping a log of traffic/link statistics:
 .Pp
 .Dl "ipfw nat 123 config ip 192.168.0.123 log deny_in reset same_ports"
 .Pp
 Or to change address of instance 123, aliasing table will be cleared (see
 reset option):
 .Pp
 .Dl "ipfw nat 123 config ip 10.0.0.1"
 .Pp
 To see configuration of nat instance 123:
 .Pp
 .Dl "ipfw nat 123 show config"
 .Pp
 To show logs of all the instances in range 111-999:
 .Pp
 .Dl "ipfw nat 111-999 show"
 .Pp
 To see configurations of all instances:
 .Pp
 .Dl "ipfw nat show config"
 .Pp
 Or a redirect rule with mixed modes could looks like:
 .Pp
 .Dl "ipfw nat 123 config redirect_addr 10.0.0.1 10.0.0.66"
 .Dl "			 redirect_port tcp 192.168.0.1:80 500"
 .Dl "			 redirect_proto udp 192.168.1.43 192.168.1.1"
 .Dl "			 redirect_addr 192.168.0.10,192.168.0.11"
 .Dl "			 	    10.0.0.100	# LSNAT"
 .Dl "			 redirect_port tcp 192.168.0.1:80,192.168.0.10:22"
 .Dl "			 	    500		# LSNAT"
 .Pp
 or it could be split in:
 .Pp
 .Dl "ipfw nat 1 config redirect_addr 10.0.0.1 10.0.0.66"
 .Dl "ipfw nat 2 config redirect_port tcp 192.168.0.1:80 500"
 .Dl "ipfw nat 3 config redirect_proto udp 192.168.1.43 192.168.1.1"
 .Dl "ipfw nat 4 config redirect_addr 192.168.0.10,192.168.0.11,192.168.0.12"
 .Dl "				         10.0.0.100"
 .Dl "ipfw nat 5 config redirect_port tcp"
 .Dl "			192.168.0.1:80,192.168.0.10:22,192.168.0.20:25 500"
 .Sh SEE ALSO
 .Xr cpp 1 ,
 .Xr m4 1 ,
 .Xr altq 4 ,
 .Xr divert 4 ,
 .Xr dummynet 4 ,
 .Xr if_bridge 4 ,
 .Xr ip 4 ,
 .Xr ipfirewall 4 ,
 .Xr ng_ipfw 4 ,
 .Xr protocols 5 ,
 .Xr services 5 ,
 .Xr init 8 ,
 .Xr kldload 8 ,
 .Xr reboot 8 ,
 .Xr sysctl 8 ,
 .Xr syslogd 8
 .Sh HISTORY
 The
 .Nm
 utility first appeared in
 .Fx 2.0 .
 .Nm dummynet
 was introduced in
 .Fx 2.2.8 .
 Stateful extensions were introduced in
 .Fx 4.0 .
 .Nm ipfw2
 was introduced in Summer 2002.
 .Sh AUTHORS
 .An Ugen J. S. Antsilevich ,
 .An Poul-Henning Kamp ,
 .An Alex Nash ,
 .An Archie Cobbs ,
 .An Luigi Rizzo .
 .Pp
 .An -nosplit
 API based upon code written by
 .An Daniel Boulet
 for BSDI.
 .Pp
 Dummynet has been introduced by Luigi Rizzo in 1997-1998.
 .Pp
 Some early work (1999-2000) on the
 .Nm dummynet
 traffic shaper supported by Akamba Corp.
 .Pp
 The ipfw core (ipfw2) has been completely redesigned and
 reimplemented by Luigi Rizzo in summer 2002.
 Further
 actions and
 options have been added by various developer over the years.
 .Pp
 .An -nosplit
 In-kernel NAT support written by
 .An Paolo Pisati Aq Mt piso@FreeBSD.org
 as part of a Summer of Code 2005 project.
 .Pp
 SCTP
 .Nm nat
 support has been developed by
 .An The Centre for Advanced Internet Architectures (CAIA) Aq http://www.caia.swin.edu.au .
 The primary developers and maintainers are David Hayes and Jason But.
 For further information visit:
 .Aq http://www.caia.swin.edu.au/urp/SONATA
 .Pp
 Delay profiles have been developed by Alessandro Cerri and
 Luigi Rizzo, supported by the
 European Commission within Projects Onelab and Onelab2.
 .Sh BUGS
 The syntax has grown over the years and sometimes it might be confusing.
 Unfortunately, backward compatibility prevents cleaning up mistakes
 made in the definition of the syntax.
 .Pp
 .Em !!! WARNING !!!
 .Pp
 Misconfiguring the firewall can put your computer in an unusable state,
 possibly shutting down network services and requiring console access to
 regain control of it.
 .Pp
 Incoming packet fragments diverted by
 .Cm divert
 are reassembled before delivery to the socket.
 The action used on those packet is the one from the
 rule which matches the first fragment of the packet.
 .Pp
 Packets diverted to userland, and then reinserted by a userland process
 may lose various packet attributes.
 The packet source interface name
 will be preserved if it is shorter than 8 bytes and the userland process
 saves and reuses the sockaddr_in
 (as does
 .Xr natd 8 ) ;
 otherwise, it may be lost.
 If a packet is reinserted in this manner, later rules may be incorrectly
 applied, making the order of
 .Cm divert
 rules in the rule sequence very important.
 .Pp
 Dummynet drops all packets with IPv6 link-local addresses.
 .Pp
 Rules using
 .Cm uid
 or
 .Cm gid
 may not behave as expected.
 In particular, incoming SYN packets may
 have no uid or gid associated with them since they do not yet belong
 to a TCP connection, and the uid/gid associated with a packet may not
 be as expected if the associated process calls
 .Xr setuid 2
 or similar system calls.
 .Pp
 Rule syntax is subject to the command line environment and some patterns
 may need to be escaped with the backslash character
 or quoted appropriately.
 .Pp
 Due to the architecture of
 .Xr libalias 3 ,
 ipfw nat is not compatible with the TCP segmentation offloading (TSO).
 Thus, to reliably nat your network traffic, please disable TSO
 on your NICs using
 .Xr ifconfig 8 .
 .Pp
 ICMP error messages are not implicitly matched by dynamic rules
 for the respective conversations.
 To avoid failures of network error detection and path MTU discovery,
 ICMP error messages may need to be allowed explicitly through static
 rules.
 .Pp
 Rules using
 .Cm call
 and
 .Cm return
 actions may lead to confusing behaviour if ruleset has mistakes,
 and/or interaction with other subsystems (netgraph, dummynet, etc.) is used.
 One possible case for this is packet leaving
 .Nm
 in subroutine on the input pass, while later on output encountering unpaired
 .Cm return
 first.
 As the call stack is kept intact after input pass, packet will suddenly
 return to the rule number used on input pass, not on output one.
 Order of processing should be checked carefully to avoid such mistakes.
Index: user/markj/netdump/share/man/man7/development.7
===================================================================
--- user/markj/netdump/share/man/man7/development.7	(revision 332407)
+++ user/markj/netdump/share/man/man7/development.7	(revision 332408)
@@ -1,130 +1,135 @@
 .\" Copyright (c) 2018 Edward Tomasz Napierala <trasz@FreeBSD.org>
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
-.Dd March 27, 2018
+.Dd April 10, 2018
 .Dt DEVELOPMENT 7
 .Os
 .Sh NAME
 .Nm development
 .Nd introduction to FreeBSD development process
 .Sh DESCRIPTION
 .Fx
 development is split into three major suprojects: doc, ports, and src.
 Doc is the documentation, such as the FreeBSD Handbook.
 To read more, see:
 .Pp
 .Lk https://www.FreeBSD.org/doc/en/books/fdp-primer/
 .Pp
 Ports, described further in
 .Xr ports 7 ,
 are the way to build, package, and install third party software.
 To read more, see:
 .Pp
 .Lk https://www.FreeBSD.org/doc/en/books/porters-handbook/
 .Pp
 The last one, src, revolves around the source code for the base system,
 consisting of the kernel, and the libraries and utilities commonly called
 the world.
 .Pp
 The Committer's Guide, describing topics relevant to all committers,
 can be found at:
 .Pp
 .Lk https://www.FreeBSD.org/doc/en/articles/committers-guide/
 .Pp
 FreeBSD src development takes place in the CURRENT branch in Subversion,
 located at:
 .Pp
 .Lk https://svn.FreeBSD.org/base/head
 .Pp
 There is also a read-only GitHub mirror at:
 .Pp
 .Lk https://github.com/freebsd/freebsd
 .Pp
 Changes are first committed to CURRENT and then usually merged back
 to STABLE.
 Every few years the CURRENT branch is renamed to STABLE, and a new
 CURRENT is branched, with an incremented major version number.
 Releases are then branched off STABLE and numbered with consecutive minor numbers.
 .Pp
 Layout of the source tree is described in
 .Xr hier 7 .
 Build instructions can be found in
 .Xr build 7
 and
 .Xr release 7 .
+Kernel APIs are usually documented, use
+.Cm apropos -s 9 ''
+for a list.
+Regression test suite is described in
+.Xr tests 7 .
 For coding conventions, see
 .Xr style 9 .
 .Pp
 To ask questions regarding development, use the mailing lists,
 such as freebsd-arch@ and freebsd-hackers@:
 .Pp
 .Lk https://lists.FreeBSD.org/
 .Pp
 To get your patches integrated into the main FreeBSD repository use Phabricator;
 it is a code review tool that allows other developers to review the changes,
 suggest improvements, and, eventually, allows them to pick up the change and
 commit it:
 .Pp
 .Lk https://reviews.FreeBSD.org/
 .Pp
 .Sh EXAMPLES
 Check out the CURRENT branch, build it, and install, overwriting the current
 system:
 .Dl svnlite co https://svn.FreeBSD.org/base/head src
 .Dl cd src
 .Dl make -j8 buildworld buildkernel installkernel
 .Dl reboot
 .Pp
 After reboot:
 .Dl cd src
 .Dl make -j8 installworld
 .Pp
 .Sh SEE ALSO
 .Xr witness 4 ,
 .Xr build 7 ,
 .Xr hier 7 ,
 .Xr release 7 ,
 .Xr locking 9 ,
 .Xr style 9
 .Sh HISTORY
 The
 .Nm
 manual page was originally written by
 .An Matthew Dillon Aq Mt dillon@FreeBSD.org
 and first appeared
 in
 .Fx 5.0 ,
 December 2002.
 It was since extensively modified by
 .An Eitan Adler Aq Mt eadler@FreeBSD.org
 to reflect the repository conversion from
 .Xr cvs 1
 to
 .Xr svn 1 .
 It was rewritten from scratch by
 .An Edward Tomasz Napierala Aq Mt trasz@FreeBSD.org
 for
 .Fx 12.0 .
Index: user/markj/netdump/share/man/man9/Makefile
===================================================================
--- user/markj/netdump/share/man/man9/Makefile	(revision 332407)
+++ user/markj/netdump/share/man/man9/Makefile	(revision 332408)
@@ -1,2238 +1,2259 @@
 # $FreeBSD$
 
 .include <src.opts.mk>
 
 PACKAGE=runtime-manuals
 
 MAN=	accept_filter.9 \
 	accf_data.9 \
 	accf_dns.9 \
 	accf_http.9 \
 	acl.9 \
 	alq.9 \
 	altq.9 \
 	atomic.9 \
 	bhnd.9 \
 	bhnd_erom.9 \
 	bios.9 \
 	bitset.9 \
 	boot.9 \
 	bpf.9 \
 	buf.9 \
 	buf_ring.9 \
 	BUF_ISLOCKED.9 \
 	BUF_LOCK.9 \
 	BUF_LOCKFREE.9 \
 	BUF_LOCKINIT.9 \
 	BUF_RECURSED.9 \
 	BUF_TIMELOCK.9 \
 	BUF_UNLOCK.9 \
 	bus_activate_resource.9 \
 	BUS_ADD_CHILD.9 \
 	bus_adjust_resource.9 \
 	bus_alloc_resource.9 \
 	BUS_BIND_INTR.9 \
 	bus_child_present.9 \
 	BUS_CHILD_DELETED.9 \
 	BUS_CHILD_DETACHED.9 \
 	BUS_CONFIG_INTR.9 \
 	BUS_DESCRIBE_INTR.9 \
 	bus_dma.9 \
 	bus_generic_attach.9 \
 	bus_generic_detach.9 \
 	bus_generic_new_pass.9 \
 	bus_generic_print_child.9 \
 	bus_generic_read_ivar.9 \
 	bus_generic_shutdown.9 \
 	BUS_GET_CPUS.9 \
 	bus_get_resource.9 \
 	bus_map_resource.9 \
 	BUS_NEW_PASS.9 \
 	BUS_PRINT_CHILD.9 \
 	BUS_READ_IVAR.9 \
 	BUS_RESCAN.9 \
 	bus_release_resource.9 \
 	bus_set_pass.9 \
 	bus_set_resource.9 \
 	BUS_SETUP_INTR.9 \
 	bus_space.9 \
 	byteorder.9 \
 	casuword.9 \
 	cd.9 \
 	cnv.9 \
 	condvar.9 \
 	config_intrhook.9 \
 	contigmalloc.9 \
 	copy.9 \
 	counter.9 \
 	cpuset.9 \
 	cr_cansee.9 \
 	critical_enter.9 \
 	cr_seeothergids.9 \
 	cr_seeotheruids.9 \
 	crypto.9 \
 	CTASSERT.9 \
 	DB_COMMAND.9 \
 	DECLARE_GEOM_CLASS.9 \
 	DECLARE_MODULE.9 \
 	DELAY.9 \
 	devclass.9 \
 	devclass_find.9 \
 	devclass_get_device.9 \
 	devclass_get_devices.9 \
 	devclass_get_drivers.9 \
 	devclass_get_maxunit.9 \
 	devclass_get_name.9 \
 	devclass_get_softc.9 \
 	dev_clone.9 \
 	devfs_set_cdevpriv.9 \
 	device.9 \
 	device_add_child.9 \
 	DEVICE_ATTACH.9 \
 	device_delete_child.9 \
 	DEVICE_DETACH.9 \
 	device_enable.9 \
 	device_find_child.9 \
 	device_get_children.9 \
 	device_get_devclass.9 \
 	device_get_driver.9 \
 	device_get_ivars.9 \
 	device_get_name.9 \
 	device_get_parent.9 \
 	device_get_softc.9 \
 	device_get_state.9 \
 	device_get_sysctl.9 \
 	device_get_unit.9 \
 	DEVICE_IDENTIFY.9 \
 	device_printf.9 \
 	DEVICE_PROBE.9 \
 	device_probe_and_attach.9 \
 	device_quiet.9 \
 	device_set_desc.9 \
 	device_set_driver.9 \
 	device_set_flags.9 \
 	DEVICE_SHUTDOWN.9 \
 	DEV_MODULE.9 \
 	devstat.9 \
 	devtoname.9 \
 	disk.9 \
 	dnv.9 \
 	domain.9 \
 	domainset.9 \
 	dpcpu.9 \
 	drbr.9 \
 	driver.9 \
 	DRIVER_MODULE.9 \
 	EVENTHANDLER.9 \
 	eventtimers.9 \
 	extattr.9 \
 	fail.9 \
 	fdt_pinctrl.9 \
 	fetch.9 \
 	firmware.9 \
 	fpu_kern.9 \
 	g_access.9 \
 	g_attach.9 \
 	g_bio.9 \
 	g_consumer.9 \
 	g_data.9 \
 	get_cyclecount.9 \
 	getenv.9 \
 	getnewvnode.9 \
 	g_event.9 \
 	g_geom.9 \
 	g_provider.9 \
 	g_provider_by_name.9 \
 	groupmember.9 \
 	g_wither_geom.9 \
 	hash.9 \
 	hashinit.9 \
 	hexdump.9 \
 	hhook.9 \
 	ieee80211.9 \
 	ieee80211_amrr.9 \
 	ieee80211_beacon.9 \
 	ieee80211_bmiss.9 \
 	ieee80211_crypto.9 \
 	ieee80211_ddb.9 \
 	ieee80211_input.9 \
 	ieee80211_node.9 \
 	ieee80211_output.9 \
 	ieee80211_proto.9 \
 	ieee80211_radiotap.9 \
 	ieee80211_regdomain.9 \
 	ieee80211_scan.9 \
 	ieee80211_vap.9 \
 	iflibdd.9 \
 	iflibdi.9 \
 	iflibtxrx.9 \
 	ifnet.9 \
 	inittodr.9 \
 	insmntque.9 \
 	intro.9 \
 	ithread.9 \
 	KASSERT.9 \
 	kern_testfrwk.9 \
 	kernacc.9 \
 	kernel_mount.9 \
 	khelp.9 \
 	kobj.9 \
 	kproc.9 \
 	kqueue.9 \
 	kthread.9 \
 	ktr.9 \
 	lock.9 \
 	locking.9 \
 	LOCK_PROFILING.9 \
 	mac.9 \
 	make_dev.9 \
 	malloc.9 \
 	mbchain.9 \
 	mbuf.9 \
 	mbuf_tags.9 \
 	MD5.9 \
 	mdchain.9 \
 	memcchr.9 \
 	memguard.9 \
 	microseq.9 \
 	microtime.9 \
 	microuptime.9 \
 	mi_switch.9 \
 	mod_cc.9 \
 	module.9 \
 	MODULE_DEPEND.9 \
 	MODULE_PNP_INFO.9 \
 	MODULE_VERSION.9 \
 	mtx_pool.9 \
 	mutex.9 \
 	namei.9 \
 	netisr.9 \
 	nv.9 \
+	OF_child.9 \
+	OF_device_from_xref.9 \
+	OF_finddevice.9 \
+	OF_getprop.9 \
+	OF_node_from_xref.9 \
+	OF_package_to_path.9 \
 	ofw_bus_is_compatible.9 \
 	ofw_bus_status_okay.9 \
 	osd.9 \
 	owll.9 \
 	own.9 \
 	panic.9 \
 	pbuf.9 \
 	PCBGROUP.9 \
 	p_candebug.9 \
 	p_cansee.9 \
 	pci.9 \
 	PCI_IOV_ADD_VF.9 \
 	PCI_IOV_INIT.9 \
 	pci_iov_schema.9 \
 	PCI_IOV_UNINIT.9 \
 	pfil.9 \
 	pfind.9 \
 	pget.9 \
 	pgfind.9 \
 	PHOLD.9 \
 	physio.9 \
 	pmap.9 \
 	pmap_activate.9 \
 	pmap_clear_modify.9 \
 	pmap_copy.9 \
 	pmap_enter.9 \
 	pmap_extract.9 \
 	pmap_growkernel.9 \
 	pmap_init.9 \
 	pmap_is_modified.9 \
 	pmap_is_prefaultable.9 \
 	pmap_map.9 \
 	pmap_mincore.9 \
 	pmap_object_init_pt.9 \
 	pmap_page_exists_quick.9 \
 	pmap_page_init.9 \
 	pmap_pinit.9 \
 	pmap_protect.9 \
 	pmap_qenter.9 \
 	pmap_quick_enter_page.9 \
 	pmap_release.9 \
 	pmap_remove.9 \
 	pmap_resident_count.9 \
 	pmap_unwire.9 \
 	pmap_zero_page.9 \
 	printf.9 \
 	prison_check.9 \
 	priv.9 \
 	proc_rwmem.9 \
 	pseudofs.9 \
 	psignal.9 \
 	random.9 \
 	random_harvest.9 \
 	redzone.9 \
 	refcount.9 \
 	resettodr.9 \
 	resource_int_value.9 \
 	rijndael.9 \
 	rman.9 \
 	rmlock.9 \
 	rtalloc.9 \
 	rtentry.9 \
 	runqueue.9 \
 	rwlock.9 \
 	sbuf.9 \
 	scheduler.9 \
 	SDT.9 \
 	securelevel_gt.9 \
 	selrecord.9 \
 	sema.9 \
 	sf_buf.9 \
 	sglist.9 \
 	shm_map.9 \
 	signal.9 \
 	sleep.9 \
 	sleepqueue.9 \
 	socket.9 \
 	stack.9 \
 	store.9 \
 	style.9 \
 	style.lua.9 \
 	swi.9 \
 	sx.9 \
 	syscall_helper_register.9 \
 	SYSCALL_MODULE.9 \
 	sysctl.9 \
 	sysctl_add_oid.9 \
 	sysctl_ctx_init.9 \
 	SYSINIT.9 \
 	taskqueue.9 \
 	tcp_functions.9 \
 	thread_exit.9 \
 	time.9 \
 	timeout.9 \
 	tvtohz.9 \
 	ucred.9 \
 	uidinfo.9 \
 	uio.9 \
 	unr.9 \
 	vaccess.9 \
 	vaccess_acl_nfs4.9 \
 	vaccess_acl_posix1e.9 \
 	vcount.9 \
 	vflush.9 \
 	VFS.9 \
 	vfs_busy.9 \
 	VFS_CHECKEXP.9 \
 	vfsconf.9 \
 	VFS_FHTOVP.9 \
 	vfs_getnewfsid.9 \
 	vfs_getopt.9 \
 	vfs_getvfs.9 \
 	VFS_MOUNT.9 \
 	vfs_mountedfrom.9 \
 	VFS_QUOTACTL.9 \
 	VFS_ROOT.9 \
 	vfs_rootmountalloc.9 \
 	VFS_SET.9 \
 	VFS_STATFS.9 \
 	vfs_suser.9 \
 	VFS_SYNC.9 \
 	vfs_timestamp.9 \
 	vfs_unbusy.9 \
 	VFS_UNMOUNT.9 \
 	vfs_unmountall.9 \
 	VFS_VGET.9 \
 	vget.9 \
 	vgone.9 \
 	vhold.9 \
 	vinvalbuf.9 \
 	vm_fault_prefault.9 \
 	vm_map.9 \
 	vm_map_check_protection.9 \
 	vm_map_create.9 \
 	vm_map_delete.9 \
 	vm_map_entry_resize_free.9 \
 	vm_map_find.9 \
 	vm_map_findspace.9 \
 	vm_map_inherit.9 \
 	vm_map_init.9 \
 	vm_map_insert.9 \
 	vm_map_lock.9 \
 	vm_map_lookup.9 \
 	vm_map_madvise.9 \
 	vm_map_max.9 \
 	vm_map_protect.9 \
 	vm_map_remove.9 \
 	vm_map_simplify_entry.9 \
 	vm_map_stack.9 \
 	vm_map_submap.9 \
 	vm_map_sync.9 \
 	vm_map_wire.9 \
 	vm_page_alloc.9 \
 	vm_page_bits.9 \
 	vm_page_busy.9 \
 	vm_page_deactivate.9 \
 	vm_page_dontneed.9 \
 	vm_page_aflag.9 \
 	vm_page_free.9 \
 	vm_page_grab.9 \
 	vm_page_hold.9 \
 	vm_page_insert.9 \
 	vm_page_lookup.9 \
 	vm_page_rename.9 \
 	vm_page_wire.9 \
 	vm_set_page_size.9 \
 	vmem.9 \
 	vn_fullpath.9 \
 	vn_isdisk.9 \
 	vnet.9 \
 	vnode.9 \
 	VOP_ACCESS.9 \
 	VOP_ACLCHECK.9 \
 	VOP_ADVISE.9 \
 	VOP_ADVLOCK.9 \
 	VOP_ALLOCATE.9 \
 	VOP_ATTRIB.9 \
 	VOP_BWRITE.9 \
 	VOP_CREATE.9 \
 	VOP_FSYNC.9 \
 	VOP_GETACL.9 \
 	VOP_GETEXTATTR.9 \
 	VOP_GETPAGES.9 \
 	VOP_INACTIVE.9 \
 	VOP_IOCTL.9 \
 	VOP_LINK.9 \
 	VOP_LISTEXTATTR.9 \
 	VOP_LOCK.9 \
 	VOP_LOOKUP.9 \
 	VOP_OPENCLOSE.9 \
 	VOP_PATHCONF.9 \
 	VOP_PRINT.9 \
 	VOP_RDWR.9 \
 	VOP_READDIR.9 \
 	VOP_READLINK.9 \
 	VOP_REALLOCBLKS.9 \
 	VOP_REMOVE.9 \
 	VOP_RENAME.9 \
 	VOP_REVOKE.9 \
 	VOP_SETACL.9 \
 	VOP_SETEXTATTR.9 \
 	VOP_STRATEGY.9 \
 	VOP_VPTOCNP.9 \
 	VOP_VPTOFH.9 \
 	vref.9 \
 	vrefcnt.9 \
 	vrele.9 \
 	vslock.9 \
 	watchdog.9 \
 	zone.9
 
 MLINKS=	unr.9 alloc_unr.9 \
 	unr.9 alloc_unrl.9 \
 	unr.9 alloc_unr_specific.9 \
 	unr.9 clear_unrhdr.9 \
 	unr.9 delete_unrhdr.9 \
 	unr.9 free_unr.9 \
 	unr.9 new_unrhdr.9
 MLINKS+=accept_filter.9 accept_filt_add.9 \
 	accept_filter.9 accept_filt_del.9 \
 	accept_filter.9 accept_filt_generic_mod_event.9 \
 	accept_filter.9 accept_filt_get.9
 MLINKS+=alq.9 ALQ.9 \
 	alq.9 alq_close.9 \
 	alq.9 alq_flush.9 \
 	alq.9 alq_get.9 \
 	alq.9 alq_getn.9 \
 	alq.9 alq_open.9 \
 	alq.9 alq_open_flags.9 \
 	alq.9 alq_post.9 \
 	alq.9 alq_post_flags.9 \
 	alq.9 alq_write.9 \
 	alq.9 alq_writen.9
 MLINKS+=altq.9 ALTQ.9
 MLINKS+=atomic.9 atomic_add.9 \
 	atomic.9 atomic_clear.9 \
 	atomic.9 atomic_cmpset.9 \
 	atomic.9 atomic_fetchadd.9 \
 	atomic.9 atomic_load.9 \
 	atomic.9 atomic_readandclear.9 \
 	atomic.9 atomic_set.9 \
 	atomic.9 atomic_store.9 \
 	atomic.9 atomic_subtract.9 \
 	atomic.9 atomic_swap.9 \
 	atomic.9 atomic_testandset.9
 MLINKS+=bhnd.9 BHND_MATCH_BOARD_TYPE.9 \
 	bhnd.9 BHND_MATCH_BOARD_VENDOR.9 \
 	bhnd.9 BHND_MATCH_CHIP_ID.9 \
 	bhnd.9 BHND_MATCH_CHIP_PKG.9 \
 	bhnd.9 BHND_MATCH_CHIP_REV.9 \
 	bhnd.9 BHND_MATCH_CORE_ID.9 \
 	bhnd.9 BHND_MATCH_CORE_VENDOR.9 \
 	bhnd.9 bhnd_activate_resource.9 \
 	bhnd.9 bhnd_alloc_pmu.9 \
 	bhnd.9 bhnd_alloc_resource.9 \
 	bhnd.9 bhnd_alloc_resource_any.9 \
 	bhnd.9 bhnd_alloc_resources.9 \
 	bhnd.9 bhnd_board_matches.9 \
 	bhnd.9 bhnd_bus_match_child.9 \
 	bhnd.9 bhnd_bus_read_1.9 \
 	bhnd.9 bhnd_bus_read_2.9 \
 	bhnd.9 bhnd_bus_read_4.9 \
 	bhnd.9 bhnd_bus_read_stream_1.9 \
 	bhnd.9 bhnd_bus_read_stream_2.9 \
 	bhnd.9 bhnd_bus_read_stream_4.9 \
 	bhnd.9 bhnd_bus_write_1.9 \
 	bhnd.9 bhnd_bus_write_2.9 \
 	bhnd.9 bhnd_bus_write_4.9 \
 	bhnd.9 bhnd_bus_write_stream_1.9 \
 	bhnd.9 bhnd_bus_write_stream_2.9 \
 	bhnd.9 bhnd_bus_write_stream_4.9 \
 	bhnd.9 bhnd_chip_matches.9 \
 	bhnd.9 bhnd_core_class.9 \
 	bhnd.9 bhnd_core_get_match_desc.9 \
 	bhnd.9 bhnd_core_matches.9 \
 	bhnd.9 bhnd_core_name.9 \
 	bhnd.9 bhnd_cores_equal.9 \
 	bhnd.9 bhnd_deactivate_resource.9 \
 	bhnd.9 bhnd_decode_port_rid.9 \
 	bhnd.9 bhnd_deregister_provider.9 \
 	bhnd.9 bhnd_device_lookup.9 \
 	bhnd.9 bhnd_device_matches.9 \
 	bhnd.9 bhnd_device_quirks.9 \
 	bhnd.9 bhnd_driver_get_erom_class.9 \
 	bhnd.9 bhnd_enable_clocks.9 \
 	bhnd.9 bhnd_find_core_class.9 \
 	bhnd.9 bhnd_find_core_name.9 \
 	bhnd.9 bhnd_format_chip_id.9 \
 	bhnd.9 bhnd_get_attach_type.9 \
 	bhnd.9 bhnd_get_chipid.9 \
 	bhnd.9 bhnd_get_class.9 \
 	bhnd.9 bhnd_get_clock_freq.9 \
 	bhnd.9 bhnd_get_clock_latency.9 \
 	bhnd.9 bhnd_get_core_index.9 \
 	bhnd.9 bhnd_get_core_info.9 \
 	bhnd.9 bhnd_get_core_unit.9 \
 	bhnd.9 bhnd_get_device.9 \
 	bhnd.9 bhnd_get_device_name.9 \
 	bhnd.9 bhnd_get_dma_translation.9 \
 	bhnd.9 bhnd_get_hwrev.9 \
 	bhnd.9 bhnd_get_intr_count.9 \
 	bhnd.9 bhnd_get_intr_ivec.9 \
 	bhnd.9 bhnd_get_port_count.9 \
 	bhnd.9 bhnd_get_port_rid.9 \
 	bhnd.9 bhnd_get_region_addr.9 \
 	bhnd.9 bhnd_get_region_count.9 \
 	bhnd.9 bhnd_get_vendor.9 \
 	bhnd.9 bhnd_get_vendor_name.9 \
 	bhnd.9 bhnd_hwrev_matches.9 \
 	bhnd.9 bhnd_is_hw_suspended.9 \
 	bhnd.9 bhnd_is_region_valid.9 \
 	bhnd.9 bhnd_map_intr.9 \
 	bhnd.9 bhnd_match_core.9 \
 	bhnd.9 bhnd_nvram_getvar.9 \
 	bhnd.9 bhnd_nvram_getvar_array.9 \
 	bhnd.9 bhnd_nvram_getvar_int.9 \
 	bhnd.9 bhnd_nvram_getvar_int16.9 \
 	bhnd.9 bhnd_nvram_getvar_int32.9 \
 	bhnd.9 bhnd_nvram_getvar_int8.9 \
 	bhnd.9 bhnd_nvram_getvar_str.9 \
 	bhnd.9 bhnd_nvram_getvar_uint.9 \
 	bhnd.9 bhnd_nvram_getvar_uint16.9 \
 	bhnd.9 bhnd_nvram_getvar_uint32.9 \
 	bhnd.9 bhnd_nvram_getvar_uint8.9 \
 	bhnd.9 bhnd_nvram_string_array_next.9 \
 	bhnd.9 bhnd_read_board_info.9 \
 	bhnd.9 bhnd_read_config.9 \
 	bhnd.9 bhnd_read_ioctl.9 \
 	bhnd.9 bhnd_read_iost.9 \
 	bhnd.9 bhnd_register_provider.9 \
 	bhnd.9 bhnd_release_ext_rsrc.9 \
 	bhnd.9 bhnd_release_pmu.9 \
 	bhnd.9 bhnd_release_provider.9 \
 	bhnd.9 bhnd_release_resource.9 \
 	bhnd.9 bhnd_release_resources.9 \
 	bhnd.9 bhnd_request_clock.9 \
 	bhnd.9 bhnd_request_ext_rsrc.9 \
 	bhnd.9 bhnd_reset_hw.9 \
 	bhnd.9 bhnd_retain_provider.9 \
 	bhnd.9 bhnd_set_custom_core_desc.9 \
 	bhnd.9 bhnd_set_default_core_desc.9 \
 	bhnd.9 bhnd_suspend_hw.9 \
 	bhnd.9 bhnd_unmap_intr.9 \
 	bhnd.9 bhnd_vendor_name.9 \
 	bhnd.9 bhnd_write_config.9 \
 	bhnd.9 bhnd_write_ioctl.9
 MLINKS+=bhnd_erom.9 bhnd_erom_alloc.9 \
 	bhnd_erom.9 bhnd_erom_dump.9 \
 	bhnd_erom.9 bhnd_erom_fini_static.9 \
 	bhnd_erom.9 bhnd_erom_free.9 \
 	bhnd_erom.9 bhnd_erom_free_core_table.9 \
 	bhnd_erom.9 bhnd_erom_get_core_table.9 \
 	bhnd_erom.9 bhnd_erom_init_static.9 \
 	bhnd_erom.9 bhnd_erom_io.9 \
 	bhnd_erom.9 bhnd_erom_io_fini.9 \
 	bhnd_erom.9 bhnd_erom_io_map.9 \
 	bhnd_erom.9 bhnd_erom_io_read.9 \
 	bhnd_erom.9 bhnd_erom_iobus_init.9 \
 	bhnd_erom.9 bhnd_erom_iores_new.9 \
 	bhnd_erom.9 bhnd_erom_lookup_core.9 \
 	bhnd_erom.9 bhnd_erom_lookup_core_addr.9 \
 	bhnd_erom.9 bhnd_erom_probe.9 \
 	bhnd_erom.9 bhnd_erom_probe_driver_classes.9
 MLINKS+=bitset.9 BITSET_DEFINE.9 \
 	bitset.9 BITSET_T_INITIALIZER.9 \
 	bitset.9 BITSET_FSET.9 \
 	bitset.9 BIT_CLR.9 \
 	bitset.9 BIT_COPY.9 \
 	bitset.9 BIT_ISSET.9 \
 	bitset.9 BIT_SET.9 \
 	bitset.9 BIT_ZERO.9 \
 	bitset.9 BIT_FILL.9 \
 	bitset.9 BIT_SETOF.9 \
 	bitset.9 BIT_EMPTY.9 \
 	bitset.9 BIT_ISFULLSET.9 \
 	bitset.9 BIT_FFS.9 \
 	bitset.9 BIT_COUNT.9 \
 	bitset.9 BIT_SUBSET.9 \
 	bitset.9 BIT_OVERLAP.9 \
 	bitset.9 BIT_CMP.9 \
 	bitset.9 BIT_OR.9 \
 	bitset.9 BIT_AND.9 \
 	bitset.9 BIT_NAND.9 \
 	bitset.9 BIT_CLR_ATOMIC.9 \
 	bitset.9 BIT_SET_ATOMIC.9 \
 	bitset.9 BIT_SET_ATOMIC_ACQ.9 \
 	bitset.9 BIT_AND_ATOMIC.9 \
 	bitset.9 BIT_OR_ATOMIC.9 \
 	bitset.9 BIT_COPY_STORE_REL.9
 MLINKS+=bpf.9 bpfattach.9 \
 	bpf.9 bpfattach2.9 \
 	bpf.9 bpfdetach.9 \
 	bpf.9 bpf_filter.9 \
 	bpf.9 bpf_mtap.9 \
 	bpf.9 bpf_mtap2.9 \
 	bpf.9 bpf_tap.9 \
 	bpf.9 bpf_validate.9
 MLINKS+=buf.9 bp.9
 MLINKS+=buf_ring.9 buf_ring_alloc.9 \
 	buf_ring.9 buf_ring_free.9 \
 	buf_ring.9 buf_ring_enqueue.9 \
 	buf_ring.9 buf_ring_enqueue_bytes.9 \
 	buf_ring.9 buf_ring_dequeue_mc.9 \
 	buf_ring.9 buf_ring_dequeue_sc.9 \
 	buf_ring.9 buf_ring_count.9 \
 	buf_ring.9 buf_ring_empty.9 \
 	buf_ring.9 buf_ring_full.9 \
 	buf_ring.9 buf_ring_peek.9
 MLINKS+=bus_activate_resource.9 bus_deactivate_resource.9
 MLINKS+=bus_alloc_resource.9 bus_alloc_resource_any.9
 MLINKS+=BUS_BIND_INTR.9 bus_bind_intr.9
 MLINKS+=BUS_DESCRIBE_INTR.9 bus_describe_intr.9
 MLINKS+=bus_dma.9 busdma.9 \
 	bus_dma.9 bus_dmamap_create.9 \
 	bus_dma.9 bus_dmamap_destroy.9 \
 	bus_dma.9 bus_dmamap_load.9 \
 	bus_dma.9 bus_dmamap_load_bio.9 \
 	bus_dma.9 bus_dmamap_load_ccb.9 \
 	bus_dma.9 bus_dmamap_load_mbuf.9 \
 	bus_dma.9 bus_dmamap_load_mbuf_sg.9 \
 	bus_dma.9 bus_dmamap_load_uio.9 \
 	bus_dma.9 bus_dmamap_sync.9 \
 	bus_dma.9 bus_dmamap_unload.9 \
 	bus_dma.9 bus_dmamem_alloc.9 \
 	bus_dma.9 bus_dmamem_free.9 \
 	bus_dma.9 bus_dma_tag_create.9 \
 	bus_dma.9 bus_dma_tag_destroy.9
 MLINKS+=bus_generic_read_ivar.9 bus_generic_write_ivar.9
 MLINKS+=BUS_GET_CPUS.9 bus_get_cpus.9
 MLINKS+=bus_map_resource.9 bus_unmap_resource.9 \
 	bus_map_resource.9 resource_init_map_request.9
 MLINKS+=BUS_READ_IVAR.9 BUS_WRITE_IVAR.9
 MLINKS+=BUS_SETUP_INTR.9 bus_setup_intr.9 \
 	BUS_SETUP_INTR.9 BUS_TEARDOWN_INTR.9 \
 	BUS_SETUP_INTR.9 bus_teardown_intr.9
 MLINKS+=bus_space.9 bus_space_alloc.9 \
 	bus_space.9 bus_space_barrier.9 \
 	bus_space.9 bus_space_copy_region_1.9 \
 	bus_space.9 bus_space_copy_region_2.9 \
 	bus_space.9 bus_space_copy_region_4.9 \
 	bus_space.9 bus_space_copy_region_8.9 \
 	bus_space.9 bus_space_copy_region_stream_1.9 \
 	bus_space.9 bus_space_copy_region_stream_2.9 \
 	bus_space.9 bus_space_copy_region_stream_4.9 \
 	bus_space.9 bus_space_copy_region_stream_8.9 \
 	bus_space.9 bus_space_free.9 \
 	bus_space.9 bus_space_map.9 \
 	bus_space.9 bus_space_read_1.9 \
 	bus_space.9 bus_space_read_2.9 \
 	bus_space.9 bus_space_read_4.9 \
 	bus_space.9 bus_space_read_8.9 \
 	bus_space.9 bus_space_read_multi_1.9 \
 	bus_space.9 bus_space_read_multi_2.9 \
 	bus_space.9 bus_space_read_multi_4.9 \
 	bus_space.9 bus_space_read_multi_8.9 \
 	bus_space.9 bus_space_read_multi_stream_1.9 \
 	bus_space.9 bus_space_read_multi_stream_2.9 \
 	bus_space.9 bus_space_read_multi_stream_4.9 \
 	bus_space.9 bus_space_read_multi_stream_8.9 \
 	bus_space.9 bus_space_read_region_1.9 \
 	bus_space.9 bus_space_read_region_2.9 \
 	bus_space.9 bus_space_read_region_4.9 \
 	bus_space.9 bus_space_read_region_8.9 \
 	bus_space.9 bus_space_read_region_stream_1.9 \
 	bus_space.9 bus_space_read_region_stream_2.9 \
 	bus_space.9 bus_space_read_region_stream_4.9 \
 	bus_space.9 bus_space_read_region_stream_8.9 \
 	bus_space.9 bus_space_read_stream_1.9 \
 	bus_space.9 bus_space_read_stream_2.9 \
 	bus_space.9 bus_space_read_stream_4.9 \
 	bus_space.9 bus_space_read_stream_8.9 \
 	bus_space.9 bus_space_set_multi_1.9 \
 	bus_space.9 bus_space_set_multi_2.9 \
 	bus_space.9 bus_space_set_multi_4.9 \
 	bus_space.9 bus_space_set_multi_8.9 \
 	bus_space.9 bus_space_set_multi_stream_1.9 \
 	bus_space.9 bus_space_set_multi_stream_2.9 \
 	bus_space.9 bus_space_set_multi_stream_4.9 \
 	bus_space.9 bus_space_set_multi_stream_8.9 \
 	bus_space.9 bus_space_set_region_1.9 \
 	bus_space.9 bus_space_set_region_2.9 \
 	bus_space.9 bus_space_set_region_4.9 \
 	bus_space.9 bus_space_set_region_8.9 \
 	bus_space.9 bus_space_set_region_stream_1.9 \
 	bus_space.9 bus_space_set_region_stream_2.9 \
 	bus_space.9 bus_space_set_region_stream_4.9 \
 	bus_space.9 bus_space_set_region_stream_8.9 \
 	bus_space.9 bus_space_subregion.9 \
 	bus_space.9 bus_space_unmap.9 \
 	bus_space.9 bus_space_write_1.9 \
 	bus_space.9 bus_space_write_2.9 \
 	bus_space.9 bus_space_write_4.9 \
 	bus_space.9 bus_space_write_8.9 \
 	bus_space.9 bus_space_write_multi_1.9 \
 	bus_space.9 bus_space_write_multi_2.9 \
 	bus_space.9 bus_space_write_multi_4.9 \
 	bus_space.9 bus_space_write_multi_8.9 \
 	bus_space.9 bus_space_write_multi_stream_1.9 \
 	bus_space.9 bus_space_write_multi_stream_2.9 \
 	bus_space.9 bus_space_write_multi_stream_4.9 \
 	bus_space.9 bus_space_write_multi_stream_8.9 \
 	bus_space.9 bus_space_write_region_1.9 \
 	bus_space.9 bus_space_write_region_2.9 \
 	bus_space.9 bus_space_write_region_4.9 \
 	bus_space.9 bus_space_write_region_8.9 \
 	bus_space.9 bus_space_write_region_stream_1.9 \
 	bus_space.9 bus_space_write_region_stream_2.9 \
 	bus_space.9 bus_space_write_region_stream_4.9 \
 	bus_space.9 bus_space_write_region_stream_8.9 \
 	bus_space.9 bus_space_write_stream_1.9 \
 	bus_space.9 bus_space_write_stream_2.9 \
 	bus_space.9 bus_space_write_stream_4.9 \
 	bus_space.9 bus_space_write_stream_8.9
 MLINKS+=byteorder.9 be16dec.9 \
 	byteorder.9 be16enc.9 \
 	byteorder.9 be16toh.9 \
 	byteorder.9 be32dec.9 \
 	byteorder.9 be32enc.9 \
 	byteorder.9 be32toh.9 \
 	byteorder.9 be64dec.9 \
 	byteorder.9 be64enc.9 \
 	byteorder.9 be64toh.9 \
 	byteorder.9 bswap16.9 \
 	byteorder.9 bswap32.9 \
 	byteorder.9 bswap64.9 \
 	byteorder.9 htobe16.9 \
 	byteorder.9 htobe32.9 \
 	byteorder.9 htobe64.9 \
 	byteorder.9 htole16.9 \
 	byteorder.9 htole32.9 \
 	byteorder.9 htole64.9 \
 	byteorder.9 le16dec.9 \
 	byteorder.9 le16enc.9 \
 	byteorder.9 le16toh.9 \
 	byteorder.9 le32dec.9 \
 	byteorder.9 le32enc.9 \
 	byteorder.9 le32toh.9 \
 	byteorder.9 le64dec.9 \
 	byteorder.9 le64enc.9 \
 	byteorder.9 le64toh.9
 MLINKS+=cnv.9 cnvlist.9 \
 	cnv.9 cnvlist_free_binary.9 \
 	cnv.9 cnvlist_free_bool.9 \
 	cnv.9 cnvlist_free_bool_array.9 \
 	cnv.9 cnvlist_free_descriptor.9 \
 	cnv.9 cnvlist_free_descriptor_array.9 \
 	cnv.9 cnvlist_free_null.9 \
 	cnv.9 cnvlist_free_number.9 \
 	cnv.9 cnvlist_free_number_array.9 \
 	cnv.9 cnvlist_free_nvlist.9 \
 	cnv.9 cnvlist_free_nvlist_array.9 \
 	cnv.9 cnvlist_free_string.9 \
 	cnv.9 cnvlist_free_string_array.9 \
 	cnv.9 cnvlist_get_binary.9 \
 	cnv.9 cnvlist_get_bool.9 \
 	cnv.9 cnvlist_get_bool_array.9 \
 	cnv.9 cnvlist_get_descriptor.9 \
 	cnv.9 cnvlist_get_descriptor_array.9 \
 	cnv.9 cnvlist_get_number.9 \
 	cnv.9 cnvlist_get_number_array.9 \
 	cnv.9 cnvlist_get_nvlist.9 \
 	cnv.9 cnvlist_get_nvlist_array.9 \
 	cnv.9 cnvlist_get_string.9 \
 	cnv.9 cnvlist_get_string_array.9 \
 	cnv.9 cnvlist_take_binary.9 \
 	cnv.9 cnvlist_take_bool.9 \
 	cnv.9 cnvlist_take_bool_array.9 \
 	cnv.9 cnvlist_take_descriptor.9 \
 	cnv.9 cnvlist_take_descriptor_array.9 \
 	cnv.9 cnvlist_take_number.9 \
 	cnv.9 cnvlist_take_number_array.9 \
 	cnv.9 cnvlist_take_nvlist.9 \
 	cnv.9 cnvlist_take_nvlist_array.9 \
 	cnv.9 cnvlist_take_string.9 \
 	cnv.9 cnvlist_take_string_array.9
 MLINKS+=condvar.9 cv_broadcast.9 \
 	condvar.9 cv_broadcastpri.9 \
 	condvar.9 cv_destroy.9 \
 	condvar.9 cv_init.9 \
 	condvar.9 cv_signal.9 \
 	condvar.9 cv_timedwait.9 \
 	condvar.9 cv_timedwait_sig.9 \
 	condvar.9 cv_timedwait_sig_sbt.9 \
 	condvar.9 cv_wait.9 \
 	condvar.9 cv_wait_sig.9 \
 	condvar.9 cv_wait_unlock.9 \
 	condvar.9 cv_wmesg.9
 MLINKS+=config_intrhook.9 config_intrhook_disestablish.9 \
 	config_intrhook.9 config_intrhook_establish.9 \
 	config_intrhook.9 config_intrhook_oneshot.9
 MLINKS+=contigmalloc.9 contigfree.9
 MLINKS+=casuword.9 casueword.9 \
 	casuword.9 casueword32.9 \
 	casuword.9 casuword32.9
 MLINKS+=copy.9 copyin.9 \
 	copy.9 copyin_nofault.9 \
 	copy.9 copyinstr.9 \
 	copy.9 copyout.9 \
 	copy.9 copyout_nofault.9 \
 	copy.9 copystr.9
 MLINKS+=counter.9 counter_u64_alloc.9 \
 	counter.9 counter_u64_free.9 \
 	counter.9 counter_u64_add.9 \
 	counter.9 counter_enter.9 \
 	counter.9 counter_exit.9 \
 	counter.9 counter_u64_add_protected.9 \
 	counter.9 counter_u64_fetch.9 \
 	counter.9 counter_u64_zero.9 \
 	counter.9 SYSCTL_COUNTER_U64.9 \
 	counter.9 SYSCTL_ADD_COUNTER_U64.9 \
 	counter.9 SYSCTL_COUNTER_U64_ARRAY.9 \
 	counter.9 SYSCTL_ADD_COUNTER_U64_ARRAY.9
 MLINKS+=cpuset.9 CPUSET_T_INITIALIZER.9 \
 	cpuset.9 CPUSET_FSET.9 \
 	cpuset.9 CPU_CLR.9 \
 	cpuset.9 CPU_COPY.9 \
 	cpuset.9 CPU_ISSET.9 \
 	cpuset.9 CPU_SET.9 \
 	cpuset.9 CPU_ZERO.9 \
 	cpuset.9 CPU_FILL.9 \
 	cpuset.9 CPU_SETOF.9 \
 	cpuset.9 CPU_EMPTY.9 \
 	cpuset.9 CPU_ISFULLSET.9 \
 	cpuset.9 CPU_FFS.9 \
 	cpuset.9 CPU_COUNT.9 \
 	cpuset.9 CPU_SUBSET.9 \
 	cpuset.9 CPU_OVERLAP.9 \
 	cpuset.9 CPU_CMP.9 \
 	cpuset.9 CPU_OR.9 \
 	cpuset.9 CPU_AND.9 \
 	cpuset.9 CPU_NAND.9 \
 	cpuset.9 CPU_CLR_ATOMIC.9 \
 	cpuset.9 CPU_SET_ATOMIC.9 \
 	cpuset.9 CPU_SET_ATOMIC_ACQ.9 \
 	cpuset.9 CPU_AND_ATOMIC.9 \
 	cpuset.9 CPU_OR_ATOMIC.9 \
 	cpuset.9 CPU_COPY_STORE_REL.9
 MLINKS+=critical_enter.9 critical.9 \
 	critical_enter.9 critical_exit.9
 MLINKS+=crypto.9 crypto_dispatch.9 \
 	crypto.9 crypto_done.9 \
 	crypto.9 crypto_freereq.9 \
 	crypto.9 crypto_freesession.9 \
 	crypto.9 crypto_get_driverid.9 \
 	crypto.9 crypto_getreq.9 \
 	crypto.9 crypto_kdispatch.9 \
 	crypto.9 crypto_kdone.9 \
 	crypto.9 crypto_kregister.9 \
 	crypto.9 crypto_newsession.9 \
 	crypto.9 crypto_register.9 \
 	crypto.9 crypto_unblock.9 \
 	crypto.9 crypto_unregister.9 \
 	crypto.9 crypto_unregister_all.9
 MLINKS+=DB_COMMAND.9 DB_SHOW_ALL_COMMAND.9 \
 	DB_COMMAND.9 DB_SHOW_COMMAND.9
 MLINKS+=DECLARE_MODULE.9 DECLARE_MODULE_TIED.9
 MLINKS+=dev_clone.9 drain_dev_clone_events.9
 MLINKS+=devfs_set_cdevpriv.9 devfs_clear_cdevpriv.9 \
 	devfs_set_cdevpriv.9 devfs_get_cdevpriv.9
 MLINKS+=device_add_child.9 device_add_child_ordered.9
 MLINKS+=device_enable.9 device_disable.9 \
 	device_enable.9 device_is_enabled.9
 MLINKS+=device_get_ivars.9 device_set_ivars.9
 MLINKS+=device_get_name.9 device_get_nameunit.9
 MLINKS+=device_get_state.9 device_busy.9 \
 	device_get_state.9 device_is_alive.9 \
 	device_get_state.9 device_is_attached.9 \
 	device_get_state.9 device_unbusy.9
 MLINKS+=device_get_sysctl.9 device_get_sysctl_ctx.9 \
 	device_get_sysctl.9 device_get_sysctl_tree.9
 MLINKS+=device_quiet.9 device_is_quiet.9 \
 	device_quiet.9 device_verbose.9
 MLINKS+=device_set_desc.9 device_get_desc.9 \
 	device_set_desc.9 device_set_desc_copy.9
 MLINKS+=device_set_flags.9 device_get_flags.9
 MLINKS+=devstat.9 devicestat.9 \
 	devstat.9 devstat_add_entry.9 \
 	devstat.9 devstat_end_transaction.9 \
 	devstat.9 devstat_remove_entry.9 \
 	devstat.9 devstat_start_transaction.9
 MLINKS+=disk.9 disk_add_alias.9 \
 	disk.9 disk_alloc.9 \
 	disk.9 disk_create.9 \
 	disk.9 disk_destroy.9 \
 	disk.9 disk_gone.9 \
 	disk.9 disk_resize.9
 MLINKS+=dnv.9 dnvlist.9 \
 	dnv.9 dnvlist_get_binary.9 \
 	dnv.9 dnvlist_get_bool.9 \
 	dnv.9 dnvlist_get_descriptor.9 \
 	dnv.9 dnvlist_get_number.9 \
 	dnv.9 dnvlist_get_nvlist.9 \
 	dnv.9 dnvlist_get_string.9 \
 	dnv.9 dnvlist_take_binary.9 \
 	dnv.9 dnvlist_take_bool.9 \
 	dnv.9 dnvlist_take_descriptor.9 \
 	dnv.9 dnvlist_take_number.9 \
 	dnv.9 dnvlist_take_nvlist.9 \
 	dnv.9 dnvlist_take_string.9
 MLINKS+=domain.9 DOMAIN_SET.9 \
 	domain.9 domain_add.9 \
 	domain.9 pfctlinput.9 \
 	domain.9 pfctlinput2.9 \
 	domain.9 pffinddomain.9 \
 	domain.9 pffindproto.9 \
 	domain.9 pffindtype.9
 MLINKS+=drbr.9 drbr_free.9 \
 	drbr.9 drbr_enqueue.9 \
 	drbr.9 drbr_dequeue.9 \
 	drbr.9 drbr_dequeue_cond.9 \
 	drbr.9 drbr_flush.9 \
 	drbr.9 drbr_empty.9 \
 	drbr.9 drbr_inuse.9 \
 	drbr.9 drbr_stats_update.9
 MLINKS+=DRIVER_MODULE.9 DRIVER_MODULE_ORDERED.9 \
 	DRIVER_MODULE.9 EARLY_DRIVER_MODULE.9 \
 	DRIVER_MODULE.9 EARLY_DRIVER_MODULE_ORDERED.9
 MLINKS+=EVENTHANDLER.9 EVENTHANDLER_DECLARE.9 \
 	EVENTHANDLER.9 EVENTHANDLER_DEFINE.9 \
 	EVENTHANDLER.9 EVENTHANDLER_DEREGISTER.9 \
 	EVENTHANDLER.9 eventhandler_deregister.9 \
 	EVENTHANDLER.9 eventhandler_find_list.9 \
 	EVENTHANDLER.9 EVENTHANDLER_INVOKE.9 \
 	EVENTHANDLER.9 eventhandler_prune_list.9 \
 	EVENTHANDLER.9 EVENTHANDLER_REGISTER.9 \
 	EVENTHANDLER.9 eventhandler_register.9
 MLINKS+=eventtimers.9 et_register.9 \
 	eventtimers.9 et_deregister.9 \
 	eventtimers.9 et_ban.9 \
 	eventtimers.9 et_find.9 \
 	eventtimers.9 et_free.9 \
 	eventtimers.9 et_init.9 \
 	eventtimers.9 ET_LOCK.9 \
 	eventtimers.9 ET_UNLOCK.9 \
 	eventtimers.9 et_start.9 \
 	eventtimers.9 et_stop.9
 MLINKS+=fail.9 KFAIL_POINT_CODE.9 \
 	fail.9 KFAIL_POINT_ERROR.9 \
 	fail.9 KFAIL_POINT_GOTO.9 \
 	fail.9 KFAIL_POINT_RETURN.9 \
 	fail.9 KFAIL_POINT_RETURN_VOID.9
 MLINKS+=fdt_pinctrl.9 fdt_pinctrl_configure.9 \
 	fdt_pinctrl.9 fdt_pinctrl_configure_by_name.9 \
 	fdt_pinctrl.9 fdt_pinctrl_configure_tree.9 \
 	fdt_pinctrl.9 fdt_pinctrl_register.9
 MLINKS+=fetch.9 fubyte.9 \
 	fetch.9 fuswintr.9 \
 	fetch.9 fuword.9 \
 	fetch.9 fuword16.9 \
 	fetch.9 fuword32.9 \
 	fetch.9 fuword64.9 \
 	fetch.9 fueword.9 \
 	fetch.9 fueword32.9 \
 	fetch.9 fueword64.9
 MLINKS+=firmware.9 firmware_get.9 \
 	firmware.9 firmware_put.9 \
 	firmware.9 firmware_register.9 \
 	firmware.9 firmware_unregister.9
 MLINKS+=fpu_kern.9 fpu_kern_alloc_ctx.9 \
 	fpu_kern.9 fpu_kern_free_ctx.9 \
 	fpu_kern.9 fpu_kern_enter.9 \
 	fpu_kern.9 fpu_kern_leave.9 \
 	fpu_kern.9 fpu_kern_thread.9 \
 	fpu_kern.9 is_fpu_kern_thread.9
 MLINKS+=g_attach.9 g_detach.9
 MLINKS+=g_bio.9 g_alloc_bio.9 \
 	g_bio.9 g_clone_bio.9 \
 	g_bio.9 g_destroy_bio.9 \
 	g_bio.9 g_duplicate_bio.9 \
 	g_bio.9 g_new_bio.9 \
 	g_bio.9 g_print_bio.9 \
 	g_bio.9 g_reset_bio.9
 MLINKS+=g_consumer.9 g_destroy_consumer.9 \
 	g_consumer.9 g_new_consumer.9
 MLINKS+=g_data.9 g_read_data.9 \
 	g_data.9 g_write_data.9
 MLINKS+=getenv.9 freeenv.9 \
 	getenv.9 getenv_int.9 \
 	getenv.9 getenv_long.9 \
 	getenv.9 getenv_string.9 \
 	getenv.9 getenv_quad.9 \
 	getenv.9 getenv_uint.9 \
 	getenv.9 getenv_ulong.9 \
 	getenv.9 kern_getenv.9 \
 	getenv.9 kern_setenv.9 \
 	getenv.9 kern_unsetenv.9 \
 	getenv.9 setenv.9 \
 	getenv.9 testenv.9 \
 	getenv.9 unsetenv.9
 MLINKS+=g_event.9 g_cancel_event.9 \
 	g_event.9 g_post_event.9 \
 	g_event.9 g_waitfor_event.9
 MLINKS+=g_geom.9 g_destroy_geom.9 \
 	g_geom.9 g_new_geomf.9
 MLINKS+=g_provider.9 g_destroy_provider.9 \
 	g_provider.9 g_error_provider.9 \
 	g_provider.9 g_new_providerf.9
 MLINKS+=hash.9 hash32.9 \
 	hash.9 hash32_buf.9 \
 	hash.9 hash32_str.9 \
 	hash.9 hash32_stre.9 \
 	hash.9 hash32_strn.9 \
 	hash.9 hash32_strne.9 \
 	hash.9 jenkins_hash.9 \
 	hash.9 jenkins_hash32.9
 MLINKS+=hashinit.9 hashdestroy.9 \
 	hashinit.9 hashinit_flags.9 \
 	hashinit.9 phashinit.9
 MLINKS+=hhook.9 hhook_head_register.9 \
 	hhook.9 hhook_head_deregister.9 \
 	hhook.9 hhook_head_deregister_lookup.9 \
 	hhook.9 hhook_run_hooks.9 \
 	hhook.9 HHOOKS_RUN_IF.9 \
 	hhook.9 HHOOKS_RUN_LOOKUP_IF.9
 MLINKS+=ieee80211.9 ieee80211_ifattach.9 \
 	ieee80211.9 ieee80211_ifdetach.9
 MLINKS+=ieee80211_amrr.9 ieee80211_amrr_choose.9 \
 	ieee80211_amrr.9 ieee80211_amrr_cleanup.9 \
 	ieee80211_amrr.9 ieee80211_amrr_init.9 \
 	ieee80211_amrr.9 ieee80211_amrr_node_init.9 \
 	ieee80211_amrr.9 ieee80211_amrr_setinterval.9 \
 	ieee80211_amrr.9 ieee80211_amrr_tx_complete.9 \
 	ieee80211_amrr.9 ieee80211_amrr_tx_update.9
 MLINKS+=ieee80211_beacon.9 ieee80211_beacon_alloc.9 \
 	ieee80211_beacon.9 ieee80211_beacon_notify.9 \
 	ieee80211_beacon.9 ieee80211_beacon_update.9
 MLINKS+=ieee80211_bmiss.9 ieee80211_beacon_miss.9
 MLINKS+=ieee80211_crypto.9 ieee80211_crypto_available.9 \
 	ieee80211_crypto.9 ieee80211_crypto_decap.9 \
 	ieee80211_crypto.9 ieee80211_crypto_delglobalkeys.9 \
 	ieee80211_crypto.9 ieee80211_crypto_delkey.9 \
 	ieee80211_crypto.9 ieee80211_crypto_demic.9 \
 	ieee80211_crypto.9 ieee80211_crypto_encap.9 \
 	ieee80211_crypto.9 ieee80211_crypto_enmic.9 \
 	ieee80211_crypto.9 ieee80211_crypto_newkey.9 \
 	ieee80211_crypto.9 ieee80211_crypto_register.9 \
 	ieee80211_crypto.9 ieee80211_crypto_reload_keys.9 \
 	ieee80211_crypto.9 ieee80211_crypto_setkey.9 \
 	ieee80211_crypto.9 ieee80211_crypto_unregister.9 \
 	ieee80211_crypto.9 ieee80211_key_update_begin.9 \
 	ieee80211_crypto.9 ieee80211_key_update_end.9 \
 	ieee80211_crypto.9 ieee80211_notify_michael_failure.9 \
 	ieee80211_crypto.9 ieee80211_notify_replay_failure.9
 MLINKS+=ieee80211_input.9 ieee80211_input_all.9
 MLINKS+=ieee80211_node.9 ieee80211_dump_node.9 \
 	ieee80211_node.9 ieee80211_dump_nodes.9 \
 	ieee80211_node.9 ieee80211_find_rxnode.9 \
 	ieee80211_node.9 ieee80211_find_rxnode_withkey.9 \
 	ieee80211_node.9 ieee80211_free_node.9 \
 	ieee80211_node.9 ieee80211_iterate_nodes.9 \
 	ieee80211_node.9 ieee80211_ref_node.9 \
 	ieee80211_node.9 ieee80211_unref_node.9
 MLINKS+=ieee80211_output.9 ieee80211_process_callback.9 \
 	ieee80211_output.9 M_SEQNO_GET.9 \
 	ieee80211_output.9 M_WME_GETAC.9
 MLINKS+=ieee80211_proto.9 ieee80211_new_state.9 \
 	ieee80211_proto.9 ieee80211_resume_all.9 \
 	ieee80211_proto.9 ieee80211_start_all.9 \
 	ieee80211_proto.9 ieee80211_stop_all.9 \
 	ieee80211_proto.9 ieee80211_suspend_all.9 \
 	ieee80211_proto.9 ieee80211_waitfor_parent.9
 MLINKS+=ieee80211_radiotap.9 ieee80211_radiotap_active.9 \
 	ieee80211_radiotap.9 ieee80211_radiotap_active_vap.9 \
 	ieee80211_radiotap.9 ieee80211_radiotap_attach.9 \
 	ieee80211_radiotap.9 ieee80211_radiotap_tx.9 \
 	ieee80211_radiotap.9 radiotap.9
 MLINKS+=ieee80211_regdomain.9 ieee80211_alloc_countryie.9 \
 	ieee80211_regdomain.9 ieee80211_init_channels.9 \
 	ieee80211_regdomain.9 ieee80211_sort_channels.9
 MLINKS+=ieee80211_scan.9 ieee80211_add_scan.9 \
 	ieee80211_scan.9 ieee80211_bg_scan.9 \
 	ieee80211_scan.9 ieee80211_cancel_scan.9 \
 	ieee80211_scan.9 ieee80211_cancel_scan_any.9 \
 	ieee80211_scan.9 ieee80211_check_scan.9 \
 	ieee80211_scan.9 ieee80211_check_scan_current.9 \
 	ieee80211_scan.9 ieee80211_flush.9 \
 	ieee80211_scan.9 ieee80211_probe_curchan.9 \
 	ieee80211_scan.9 ieee80211_scan_assoc_fail.9 \
 	ieee80211_scan.9 ieee80211_scan_done.9 \
 	ieee80211_scan.9 ieee80211_scan_dump_channels.9 \
 	ieee80211_scan.9 ieee80211_scan_flush.9 \
 	ieee80211_scan.9 ieee80211_scan_iterate.9 \
 	ieee80211_scan.9 ieee80211_scan_next.9 \
 	ieee80211_scan.9 ieee80211_scan_timeout.9 \
 	ieee80211_scan.9 ieee80211_scanner_get.9 \
 	ieee80211_scan.9 ieee80211_scanner_register.9 \
 	ieee80211_scan.9 ieee80211_scanner_unregister.9 \
 	ieee80211_scan.9 ieee80211_scanner_unregister_all.9 \
 	ieee80211_scan.9 ieee80211_start_scan.9
 MLINKS+=ieee80211_vap.9 ieee80211_vap_attach.9 \
 	ieee80211_vap.9 ieee80211_vap_detach.9 \
 	ieee80211_vap.9 ieee80211_vap_setup.9
 MLINKS+=iflibdd.9 ifdi_attach_pre.9 \
 	iflibdd.9 ifdi_attach_post.9 \
 	iflibdd.9 ifdi_detach.9 \
 	iflibdd.9 ifdi_get_counter.9 \
 	iflibdd.9 ifdi_i2c_req.9 \
 	iflibdd.9 ifdi_init.9 \
 	iflibdd.9 ifdi_intr_enable.9 \
 	iflibdd.9 ifdi_intr_disable.9 \
 	iflibdd.9 ifdi_led_func.9 \
 	iflibdd.9 ifdi_link_intr_enable.9 \
 	iflibdd.9 ifdi_media_set.9 \
 	iflibdd.9 ifdi_media_status.9 \
 	iflibdd.9 ifdi_media_change.9 \
 	iflibdd.9 ifdi_mtu_set.9 \
 	iflibdd.9 ifdi_multi_set.9 \
 	iflibdd.9 ifdi_promisc_set.9 \
 	iflibdd.9 ifdi_queues_alloc.9 \
 	iflibdd.9 ifdi_queues_free.9 \
 	iflibdd.9 ifdi_queue_intr_enable.9 \
 	iflibdd.9 ifdi_resume.9 \
 	iflibdd.9 ifdi_rxq_setup.9 \
 	iflibdd.9 ifdi_stop.9 \
 	iflibdd.9 ifdi_suspend.9 \
 	iflibdd.9 ifdi_sysctl_int_delay.9 \
 	iflibdd.9 ifdi_timer.9 \
 	iflibdd.9 ifdi_txq_setup.9 \
 	iflibdd.9 ifdi_update_admin_status.9 \
 	iflibdd.9 ifdi_vf_add.9 \
 	iflibdd.9 ifdi_vflr_handle.9 \
 	iflibdd.9 ifdi_vlan_register.9 \
 	iflibdd.9 ifdi_vlan_unregister.9 \
 	iflibdd.9 ifdi_watchdog_reset.9 \
 	iflibdd.9 iov_init.9 \
 	iflibdd.9 iov_uinit.9
 MLINKS+=iflibdi.9 iflib_add_int_delay_sysctl.9 \
 	iflibdi.9 iflib_device_attach.9 \
 	iflibdi.9 iflib_device_deregister.9 \
 	iflibdi.9 iflib_device_detach.9 \
 	iflibdi.9 iflib_device_suspend.9 \
 	iflibdi.9 iflib_device_register.9 \
 	iflibdi.9 iflib_device_resume.9 \
 	iflibdi.9 iflib_led_create.9 \
 	iflibdi.9 iflib_irq_alloc.9 \
 	iflibdi.9 iflib_irq_alloc_generic.9 \
 	iflibdi.9 iflib_link_intr_deferred.9 \
 	iflibdi.9 iflib_link_state_change.9 \
 	iflibdi.9 iflib_rx_intr_deferred.9 \
 	iflibdi.9 iflib_tx_intr_deferred.9
 MLINKS+=iflibtxrx.9 isc_rxd_available.9 \
 	iflibtxrx.9 isc_rxd_refill.9 \
 	iflibtxrx.9 isc_rxd_flush.9 \
 	iflibtxrx.9 isc_rxd_pkt_get.9 \
 	iflibtxrx.9 isc_txd_credits_update.9 \
 	iflibtxrx.9 isc_txd_encap.9 \
 	iflibtxrx.9 isc_txd_flush.9
 MLINKS+=ifnet.9 if_addmulti.9 \
 	ifnet.9 if_alloc.9 \
 	ifnet.9 if_allmulti.9 \
 	ifnet.9 if_attach.9 \
 	ifnet.9 if_data.9 \
 	ifnet.9 IF_DEQUEUE.9 \
 	ifnet.9 if_delmulti.9 \
 	ifnet.9 if_detach.9 \
 	ifnet.9 if_down.9 \
 	ifnet.9 if_findmulti.9 \
 	ifnet.9 if_free.9 \
 	ifnet.9 if_free_type.9 \
 	ifnet.9 if_up.9 \
 	ifnet.9 ifa_free.9 \
 	ifnet.9 ifa_ifwithaddr.9 \
 	ifnet.9 ifa_ifwithdstaddr.9 \
 	ifnet.9 ifa_ifwithnet.9 \
 	ifnet.9 ifa_ref.9 \
 	ifnet.9 ifaddr.9 \
 	ifnet.9 ifaddr_byindex.9 \
 	ifnet.9 ifaof_ifpforaddr.9 \
 	ifnet.9 ifioctl.9 \
 	ifnet.9 ifpromisc.9 \
 	ifnet.9 ifqueue.9 \
 	ifnet.9 ifunit.9 \
 	ifnet.9 ifunit_ref.9
 MLINKS+=insmntque.9 insmntque1.9
 MLINKS+=ithread.9 ithread_add_handler.9 \
 	ithread.9 ithread_create.9 \
 	ithread.9 ithread_destroy.9 \
 	ithread.9 ithread_priority.9 \
 	ithread.9 ithread_remove_handler.9 \
 	ithread.9 ithread_schedule.9
 MLINKS+=kernacc.9 useracc.9
 MLINKS+=kernel_mount.9 free_mntarg.9 \
 	kernel_mount.9 kernel_vmount.9 \
 	kernel_mount.9 mount_arg.9 \
 	kernel_mount.9 mount_argb.9 \
 	kernel_mount.9 mount_argf.9 \
 	kernel_mount.9 mount_argsu.9
 MLINKS+=khelp.9 khelp_add_hhook.9 \
 	khelp.9 KHELP_DECLARE_MOD.9 \
 	khelp.9 KHELP_DECLARE_MOD_UMA.9 \
 	khelp.9 khelp_destroy_osd.9 \
 	khelp.9 khelp_get_id.9 \
 	khelp.9 khelp_get_osd.9 \
 	khelp.9 khelp_init_osd.9 \
 	khelp.9 khelp_remove_hhook.9
 MLINKS+=kobj.9 DEFINE_CLASS.9 \
 	kobj.9 kobj_class_compile.9 \
 	kobj.9 kobj_class_compile_static.9 \
 	kobj.9 kobj_class_free.9 \
 	kobj.9 kobj_create.9 \
 	kobj.9 kobj_delete.9 \
 	kobj.9 kobj_init.9 \
 	kobj.9 kobj_init_static.9
 MLINKS+=kproc.9 kproc_create.9 \
 	kproc.9 kproc_exit.9 \
 	kproc.9 kproc_kthread_add.9 \
 	kproc.9 kproc_resume.9 \
 	kproc.9 kproc_shutdown.9 \
 	kproc.9 kproc_start.9 \
 	kproc.9 kproc_suspend.9 \
 	kproc.9 kproc_suspend_check.9 \
 	kproc.9 kthread_create.9
 MLINKS+=kqueue.9 knlist_add.9 \
 	kqueue.9 knlist_clear.9 \
 	kqueue.9 knlist_delete.9 \
 	kqueue.9 knlist_destroy.9 \
 	kqueue.9 knlist_empty.9 \
 	kqueue.9 knlist_init.9 \
 	kqueue.9 knlist_init_mtx.9 \
 	kqueue.9 knlist_init_rw_reader.9 \
 	kqueue.9 knlist_remove.9 \
 	kqueue.9 knlist_remove_inevent.9 \
 	kqueue.9 knote_fdclose.9 \
 	kqueue.9 KNOTE_LOCKED.9 \
 	kqueue.9 KNOTE_UNLOCKED.9 \
 	kqueue.9 kqfd_register.9 \
 	kqueue.9 kqueue_add_filteropts.9 \
 	kqueue.9 kqueue_del_filteropts.9
 MLINKS+=kthread.9 kthread_add.9 \
 	kthread.9 kthread_exit.9 \
 	kthread.9 kthread_resume.9 \
 	kthread.9 kthread_shutdown.9 \
 	kthread.9 kthread_start.9 \
 	kthread.9 kthread_suspend.9 \
 	kthread.9 kthread_suspend_check.9
 MLINKS+=ktr.9 CTR0.9 \
 	ktr.9 CTR1.9 \
 	ktr.9 CTR2.9 \
 	ktr.9 CTR3.9 \
 	ktr.9 CTR4.9 \
 	ktr.9 CTR5.9 \
 	ktr.9 CTR6.9
 MLINKS+=lock.9 lockdestroy.9 \
 	lock.9 lockinit.9 \
 	lock.9 lockmgr.9 \
 	lock.9 lockmgr_args.9 \
 	lock.9 lockmgr_args_rw.9 \
 	lock.9 lockmgr_assert.9 \
 	lock.9 lockmgr_disown.9 \
 	lock.9 lockmgr_printinfo.9 \
 	lock.9 lockmgr_recursed.9 \
 	lock.9 lockmgr_rw.9 \
 	lock.9 lockstatus.9
 MLINKS+=LOCK_PROFILING.9 MUTEX_PROFILING.9
 MLINKS+=make_dev.9 destroy_dev.9 \
 	make_dev.9 destroy_dev_drain.9 \
 	make_dev.9 destroy_dev_sched.9 \
 	make_dev.9 destroy_dev_sched_cb.9 \
 	make_dev.9 dev_depends.9 \
 	make_dev.9 make_dev_alias.9 \
 	make_dev.9 make_dev_alias_p.9 \
 	make_dev.9 make_dev_cred.9 \
 	make_dev.9 make_dev_credf.9 \
 	make_dev.9 make_dev_p.9 \
 	make_dev.9 make_dev_s.9
 MLINKS+=malloc.9 free.9 \
 	malloc.9 malloc_domain.9 \
 	malloc.9 free_domain.9 \
 	malloc.9 mallocarray.9 \
 	malloc.9 MALLOC_DECLARE.9 \
 	malloc.9 MALLOC_DEFINE.9 \
 	malloc.9 realloc.9 \
 	malloc.9 reallocf.9
 MLINKS+=mbchain.9 mb_detach.9 \
 	mbchain.9 mb_done.9 \
 	mbchain.9 mb_fixhdr.9 \
 	mbchain.9 mb_init.9 \
 	mbchain.9 mb_initm.9 \
 	mbchain.9 mb_put_int64be.9 \
 	mbchain.9 mb_put_int64le.9 \
 	mbchain.9 mb_put_mbuf.9 \
 	mbchain.9 mb_put_mem.9 \
 	mbchain.9 mb_put_uint16be.9 \
 	mbchain.9 mb_put_uint16le.9 \
 	mbchain.9 mb_put_uint32be.9 \
 	mbchain.9 mb_put_uint32le.9 \
 	mbchain.9 mb_put_uint8.9 \
 	mbchain.9 mb_put_uio.9 \
 	mbchain.9 mb_reserve.9
 MLINKS+=\
 	mbuf.9 m_adj.9 \
 	mbuf.9 m_align.9 \
 	mbuf.9 M_ALIGN.9 \
 	mbuf.9 m_append.9 \
 	mbuf.9 m_apply.9 \
 	mbuf.9 m_cat.9 \
 	mbuf.9 m_catpkt.9 \
 	mbuf.9 MCHTYPE.9 \
 	mbuf.9 MCLGET.9 \
 	mbuf.9 m_collapse.9 \
 	mbuf.9 m_copyback.9 \
 	mbuf.9 m_copydata.9 \
 	mbuf.9 m_copym.9 \
 	mbuf.9 m_copypacket.9 \
 	mbuf.9 m_copyup.9 \
 	mbuf.9 m_defrag.9 \
 	mbuf.9 m_devget.9 \
 	mbuf.9 m_dup.9 \
 	mbuf.9 m_dup_pkthdr.9 \
 	mbuf.9 MEXTADD.9 \
 	mbuf.9 m_fixhdr.9 \
 	mbuf.9 m_free.9 \
 	mbuf.9 m_freem.9 \
 	mbuf.9 MGET.9 \
 	mbuf.9 m_get.9 \
 	mbuf.9 m_get2.9 \
 	mbuf.9 m_getjcl.9 \
 	mbuf.9 m_getcl.9 \
 	mbuf.9 MGETHDR.9 \
 	mbuf.9 m_gethdr.9 \
 	mbuf.9 m_getm.9 \
 	mbuf.9 m_getptr.9 \
 	mbuf.9 MH_ALIGN.9 \
 	mbuf.9 M_LEADINGSPACE.9 \
 	mbuf.9 m_length.9 \
 	mbuf.9 M_MOVE_PKTHDR.9 \
 	mbuf.9 m_move_pkthdr.9 \
 	mbuf.9 M_PREPEND.9 \
 	mbuf.9 m_prepend.9 \
 	mbuf.9 m_pulldown.9 \
 	mbuf.9 m_pullup.9 \
 	mbuf.9 m_split.9 \
 	mbuf.9 mtod.9 \
 	mbuf.9 M_TRAILINGSPACE.9 \
 	mbuf.9 m_unshare.9 \
 	mbuf.9 M_WRITABLE.9
 MLINKS+=\
 	mbuf_tags.9 m_tag_alloc.9 \
 	mbuf_tags.9 m_tag_copy.9 \
 	mbuf_tags.9 m_tag_copy_chain.9 \
 	mbuf_tags.9 m_tag_delete.9 \
 	mbuf_tags.9 m_tag_delete_chain.9 \
 	mbuf_tags.9 m_tag_delete_nonpersistent.9 \
 	mbuf_tags.9 m_tag_find.9 \
 	mbuf_tags.9 m_tag_first.9 \
 	mbuf_tags.9 m_tag_free.9 \
 	mbuf_tags.9 m_tag_get.9 \
 	mbuf_tags.9 m_tag_init.9 \
 	mbuf_tags.9 m_tag_locate.9 \
 	mbuf_tags.9 m_tag_next.9 \
 	mbuf_tags.9 m_tag_prepend.9 \
 	mbuf_tags.9 m_tag_unlink.9
 MLINKS+=MD5.9 MD5Init.9 \
 	MD5.9 MD5Transform.9
 MLINKS+=mdchain.9 md_append_record.9 \
 	mdchain.9 md_done.9 \
 	mdchain.9 md_get_int64.9 \
 	mdchain.9 md_get_int64be.9 \
 	mdchain.9 md_get_int64le.9 \
 	mdchain.9 md_get_mbuf.9 \
 	mdchain.9 md_get_mem.9 \
 	mdchain.9 md_get_uint16.9 \
 	mdchain.9 md_get_uint16be.9 \
 	mdchain.9 md_get_uint16le.9 \
 	mdchain.9 md_get_uint32.9 \
 	mdchain.9 md_get_uint32be.9 \
 	mdchain.9 md_get_uint32le.9 \
 	mdchain.9 md_get_uint8.9 \
 	mdchain.9 md_get_uio.9 \
 	mdchain.9 md_initm.9 \
 	mdchain.9 md_next_record.9
 MLINKS+=microtime.9 bintime.9 \
 	microtime.9 getbintime.9 \
 	microtime.9 getmicrotime.9 \
 	microtime.9 getnanotime.9 \
 	microtime.9 nanotime.9
 MLINKS+=microuptime.9 binuptime.9 \
 	microuptime.9 getbinuptime.9 \
 	microuptime.9 getmicrouptime.9 \
 	microuptime.9 getnanouptime.9 \
 	microuptime.9 getsbinuptime.9 \
 	microuptime.9 nanouptime.9 \
 	microuptime.9 sbinuptime.9
 MLINKS+=mi_switch.9 cpu_switch.9 \
 	mi_switch.9 cpu_throw.9
 MLINKS+=mod_cc.9 CCV.9 \
 	mod_cc.9 DECLARE_CC_MODULE.9
 MLINKS+=mtx_pool.9 mtx_pool_alloc.9 \
 	mtx_pool.9 mtx_pool_create.9 \
 	mtx_pool.9 mtx_pool_destroy.9 \
 	mtx_pool.9 mtx_pool_find.9 \
 	mtx_pool.9 mtx_pool_lock.9 \
 	mtx_pool.9 mtx_pool_lock_spin.9 \
 	mtx_pool.9 mtx_pool_unlock.9 \
 	mtx_pool.9 mtx_pool_unlock_spin.9
 MLINKS+=mutex.9 mtx_assert.9 \
 	mutex.9 mtx_destroy.9 \
 	mutex.9 mtx_init.9 \
 	mutex.9 mtx_initialized.9 \
 	mutex.9 mtx_lock.9 \
 	mutex.9 mtx_lock_flags.9 \
 	mutex.9 mtx_lock_spin.9 \
 	mutex.9 mtx_lock_spin_flags.9 \
 	mutex.9 mtx_owned.9 \
 	mutex.9 mtx_recursed.9 \
 	mutex.9 mtx_sleep.9 \
 	mutex.9 MTX_SYSINIT.9 \
 	mutex.9 mtx_trylock.9 \
 	mutex.9 mtx_trylock_flags.9 \
 	mutex.9 mtx_trylock_spin.9 \
 	mutex.9 mtx_trylock_spin_flags.9 \
 	mutex.9 mtx_unlock.9 \
 	mutex.9 mtx_unlock_flags.9 \
 	mutex.9 mtx_unlock_spin.9 \
 	mutex.9 mtx_unlock_spin_flags.9
 MLINKS+=namei.9 NDFREE.9 \
 	namei.9 NDINIT.9
 MLINKS+=netisr.9 netisr_clearqdrops.9 \
 	netisr.9 netisr_default_flow2cpu.9 \
 	netisr.9 netisr_dispatch.9 \
 	netisr.9 netisr_dispatch_src.9 \
 	netisr.9 netisr_get_cpucount.9 \
 	netisr.9 netisr_get_cpuid.9 \
 	netisr.9 netisr_getqdrops.9 \
 	netisr.9 netisr_getqlimit.9 \
 	netisr.9 netisr_queue.9 \
 	netisr.9 netisr_queue_src.9 \
 	netisr.9 netisr_register.9 \
 	netisr.9 netisr_setqlimit.9 \
 	netisr.9 netisr_unregister.9
 MLINKS+=nv.9 libnv.9 \
 	nv.9 nvlist.9 \
 	nv.9 nvlist_add_binary.9 \
 	nv.9 nvlist_add_bool.9 \
 	nv.9 nvlist_add_bool_array.9 \
 	nv.9 nvlist_add_descriptor.9 \
 	nv.9 nvlist_add_descriptor_array.9 \
 	nv.9 nvlist_add_null.9 \
 	nv.9 nvlist_add_number.9 \
 	nv.9 nvlist_add_number_array.9 \
 	nv.9 nvlist_add_nvlist.9 \
 	nv.9 nvlist_add_nvlist_array.9 \
 	nv.9 nvlist_add_string.9 \
 	nv.9 nvlist_add_stringf.9 \
 	nv.9 nvlist_add_stringv.9 \
 	nv.9 nvlist_add_string_array.9 \
 	nv.9 nvlist_clone.9 \
 	nv.9 nvlist_create.9 \
 	nv.9 nvlist_destroy.9 \
 	nv.9 nvlist_dump.9 \
 	nv.9 nvlist_empty.9 \
 	nv.9 nvlist_error.9 \
 	nv.9 nvlist_exists.9 \
 	nv.9 nvlist_exists_binary.9 \
 	nv.9 nvlist_exists_bool.9 \
 	nv.9 nvlist_exists_bool_array.9 \
 	nv.9 nvlist_exists_descriptor.9 \
 	nv.9 nvlist_exists_descriptor_array.9 \
 	nv.9 nvlist_exists_null.9 \
 	nv.9 nvlist_exists_number.9 \
 	nv.9 nvlist_exists_number_array.9 \
 	nv.9 nvlist_exists_nvlist.9 \
 	nv.9 nvlist_exists_nvlist_array.9 \
 	nv.9 nvlist_exists_string.9 \
 	nv.9 nvlist_exists_type.9 \
 	nv.9 nvlist_fdump.9 \
 	nv.9 nvlist_flags.9 \
 	nv.9 nvlist_free.9 \
 	nv.9 nvlist_free_binary.9 \
 	nv.9 nvlist_free_bool.9 \
 	nv.9 nvlist_free_bool_array.9 \
 	nv.9 nvlist_free_descriptor.9 \
 	nv.9 nvlist_free_descriptor_array.9 \
 	nv.9 nvlist_free_null.9 \
 	nv.9 nvlist_free_number.9 \
 	nv.9 nvlist_free_number_array.9 \
 	nv.9 nvlist_free_nvlist.9 \
 	nv.9 nvlist_free_nvlist_array.9 \
 	nv.9 nvlist_free_string.9 \
 	nv.9 nvlist_free_string_array.9 \
 	nv.9 nvlist_free_type.9 \
 	nv.9 nvlist_get_binary.9 \
 	nv.9 nvlist_get_bool.9 \
 	nv.9 nvlist_get_bool_array.9 \
 	nv.9 nvlist_get_descriptor.9 \
 	nv.9 nvlist_get_descriptor_array.9 \
 	nv.9 nvlist_get_number.9 \
 	nv.9 nvlist_get_number_array.9 \
 	nv.9 nvlist_get_nvlist.9 \
 	nv.9 nvlist_get_nvlist_array.9 \
 	nv.9 nvlist_get_parent.9 \
 	nv.9 nvlist_get_string.9 \
 	nv.9 nvlist_get_string_array.9 \
 	nv.9 nvlist_move_binary.9 \
 	nv.9 nvlist_move_descriptor.9 \
 	nv.9 nvlist_move_descriptor_array.9 \
 	nv.9 nvlist_move_nvlist.9 \
 	nv.9 nvlist_move_nvlist_array.9 \
 	nv.9 nvlist_move_string.9 \
 	nv.9 nvlist_move_string_array.9 \
 	nv.9 nvlist_next.9 \
 	nv.9 nvlist_pack.9 \
 	nv.9 nvlist_recv.9 \
 	nv.9 nvlist_send.9 \
 	nv.9 nvlist_set_error.9 \
 	nv.9 nvlist_size.9 \
 	nv.9 nvlist_take_binary.9 \
 	nv.9 nvlist_take_bool.9 \
 	nv.9 nvlist_take_bool_array.9 \
 	nv.9 nvlist_take_descriptor.9 \
 	nv.9 nvlist_take_descriptor_array.9 \
 	nv.9 nvlist_take_number.9 \
 	nv.9 nvlist_take_number_array.9 \
 	nv.9 nvlist_take_nvlist.9 \
 	nv.9 nvlist_take_nvlist_array.9 \
 	nv.9 nvlist_take_string.9 \
 	nv.9 nvlist_take_string_array.9 \
 	nv.9 nvlist_unpack.9 \
 	nv.9 nvlist_xfer.9
+MLINKS+=OF_child.9 OF_parent.9 \
+	OF_child.9 OF_peer.9
+MLINKS+=OF_device_from_xref.9 OF_device_register_xref.9 \
+	OF_device_from_xref.9 OF_xref_from_device.9
+MLINKS+=OF_getprop.9 OF_getencprop.9 \
+	OF_getprop.9 OF_getencprop_alloc.9 \
+	OF_getprop.9 OF_getprop_alloc.9 \
+	OF_getprop.9 OF_getproplen.9 \
+	OF_getprop.9 OF_hasprop.9 \
+	OF_getprop.9 OF_nextprop.9 \
+	OF_getprop.9 OF_prop_free.9 \
+	OF_getprop.9 OF_searchencprop.9 \
+	OF_getprop.9 OF_searchprop.9 \
+	OF_getprop.9 OF_setprop.9
+MLINKS+=OF_node_from_xref.9 OF_xref_from_node.9
 MLINKS+=ofw_bus_is_compatible.9 ofw_bus_is_compatible_strict.9 \
 	ofw_bus_is_compatible.9 ofw_bus_node_is_compatible.9 \
 	ofw_bus_is_compatible.9 ofw_bus_search_compatible.9
 MLINKS+= ofw_bus_status_okay.9 ofw_bus_get_status.9 \
 	ofw_bus_status_okay.9 ofw_bus_node_status_okay.9
 MLINKS+=osd.9 osd_call.9 \
 	osd.9 osd_del.9 \
 	osd.9 osd_deregister.9 \
 	osd.9 osd_exit.9 \
 	osd.9 osd_get.9 \
 	osd.9 osd_register.9 \
 	osd.9 osd_set.9
 MLINKS+=panic.9 vpanic.9
 MLINKS+=pbuf.9 getpbuf.9 \
 	pbuf.9 relpbuf.9 \
 	pbuf.9 trypbuf.9
 MLINKS+=PCBGROUP.9 in_pcbgroup_byhash.9 \
 	PCBGROUP.9 in_pcbgroup_byinpcb.9 \
 	PCBGROUP.9 in_pcbgroup_destroy.9 \
 	PCBGROUP.9 in_pcbgroup_enabled.9 \
 	PCBGROUP.9 in_pcbgroup_init.9 \
 	PCBGROUP.9 in_pcbgroup_remove.9 \
 	PCBGROUP.9 in_pcbgroup_update.9 \
 	PCBGROUP.9 in_pcbgroup_update_mbuf.9 \
 	PCBGROUP.9 in6_pcbgroup_byhash.9
 MLINKS+=pci.9 pci_alloc_msi.9 \
 	pci.9 pci_alloc_msix.9 \
 	pci.9 pci_disable_busmaster.9 \
 	pci.9 pci_disable_io.9 \
 	pci.9 pci_enable_busmaster.9 \
 	pci.9 pci_enable_io.9 \
 	pci.9 pci_find_bsf.9 \
 	pci.9 pci_find_cap.9 \
 	pci.9 pci_find_dbsf.9 \
 	pci.9 pci_find_device.9 \
 	pci.9 pci_find_extcap.9 \
 	pci.9 pci_find_htcap.9 \
 	pci.9 pci_find_pcie_root_port.9 \
 	pci.9 pci_get_id.9 \
 	pci.9 pci_get_max_read_req.9 \
 	pci.9 pci_get_powerstate.9 \
 	pci.9 pci_get_vpd_ident.9 \
 	pci.9 pci_get_vpd_readonly.9 \
 	pci.9 pci_iov_attach.9 \
 	pci.9 pci_iov_attach_name.9 \
 	pci.9 pci_iov_detach.9 \
 	pci.9 pci_msi_count.9 \
 	pci.9 pci_msix_count.9 \
 	pci.9 pci_msix_pba_bar.9 \
 	pci.9 pci_msix_table_bar.9 \
 	pci.9 pci_pending_msix.9 \
 	pci.9 pci_read_config.9 \
 	pci.9 pci_release_msi.9 \
 	pci.9 pci_remap_msix.9 \
 	pci.9 pci_restore_state.9 \
 	pci.9 pci_save_state.9 \
 	pci.9 pci_set_powerstate.9 \
 	pci.9 pci_set_max_read_req.9 \
 	pci.9 pci_write_config.9 \
 	pci.9 pcie_adjust_config.9 \
 	pci.9 pcie_flr.9 \
 	pci.9 pcie_max_completion_timeout.9 \
 	pci.9 pcie_read_config.9 \
 	pci.9 pcie_wait_for_pending_transactions.9 \
 	pci.9 pcie_write_config.9
 MLINKS+=pci_iov_schema.9 pci_iov_schema_alloc_node.9 \
 	pci_iov_schema.9 pci_iov_schema_add_bool.9 \
 	pci_iov_schema.9 pci_iov_schema_add_string.9 \
 	pci_iov_schema.9 pci_iov_schema_add_uint8.9 \
 	pci_iov_schema.9 pci_iov_schema_add_uint16.9 \
 	pci_iov_schema.9 pci_iov_schema_add_uint32.9 \
 	pci_iov_schema.9 pci_iov_schema_add_uint64.9 \
 	pci_iov_schema.9 pci_iov_schema_add_unicast_mac.9
 MLINKS+=pfil.9 pfil_add_hook.9 \
 	pfil.9 pfil_head_register.9 \
 	pfil.9 pfil_head_unregister.9 \
 	pfil.9 pfil_hook_get.9 \
 	pfil.9 pfil_remove_hook.9 \
 	pfil.9 pfil_rlock.9 \
 	pfil.9 pfil_run_hooks.9 \
 	pfil.9 pfil_runlock.9 \
 	pfil.9 pfil_wlock.9 \
 	pfil.9 pfil_wunlock.9
 MLINKS+=pfind.9 zpfind.9
 MLINKS+=PHOLD.9 PRELE.9 \
 	PHOLD.9 _PHOLD.9 \
 	PHOLD.9 _PRELE.9 \
 	PHOLD.9 PROC_ASSERT_HELD.9 \
 	PHOLD.9 PROC_ASSERT_NOT_HELD.9
 MLINKS+=pmap_copy.9 pmap_copy_page.9
 MLINKS+=pmap_extract.9 pmap_extract_and_hold.9
 MLINKS+=pmap_init.9 pmap_init2.9
 MLINKS+=pmap_is_modified.9 pmap_ts_referenced.9
 MLINKS+=pmap_pinit.9 pmap_pinit0.9 \
 	pmap_pinit.9 pmap_pinit2.9
 MLINKS+=pmap_qenter.9 pmap_qremove.9
 MLINKS+=pmap_quick_enter_page.9 pmap_quick_remove_page.9
 MLINKS+=pmap_remove.9 pmap_remove_all.9 \
 	pmap_remove.9 pmap_remove_pages.9
 MLINKS+=pmap_resident_count.9 pmap_wired_count.9
 MLINKS+=pmap_zero_page.9 pmap_zero_area.9
 MLINKS+=printf.9 log.9 \
 	printf.9 tprintf.9 \
 	printf.9 uprintf.9
 MLINKS+=priv.9 priv_check.9 \
 	priv.9 priv_check_cred.9
 MLINKS+=proc_rwmem.9 proc_readmem.9 \
 	proc_rwmem.9 proc_writemem.9
 MLINKS+=psignal.9 gsignal.9 \
 	psignal.9 pgsignal.9 \
 	psignal.9 tdsignal.9
 MLINKS+=random.9 arc4rand.9 \
 	random.9 arc4random.9 \
 	random.9 read_random.9 \
 	random.9 read_random_uio.9 \
 	random.9 srandom.9
 MLINKS+=random_harvest.9 random_harvest_direct.9 \
 	random_harvest.9 random_harvest_fast.9 \
 	random_harvest.9 random_harvest_queue.9
 MLINKS+=refcount.9 refcount_acquire.9 \
 	refcount.9 refcount_init.9 \
 	refcount.9 refcount_release.9
 MLINKS+=resource_int_value.9 resource_long_value.9 \
 	resource_int_value.9 resource_string_value.9
 MLINKS+=rman.9 rman_activate_resource.9 \
 	rman.9 rman_adjust_resource.9 \
 	rman.9 rman_deactivate_resource.9 \
 	rman.9 rman_fini.9 \
 	rman.9 rman_first_free_region.9 \
 	rman.9 rman_get_bushandle.9 \
 	rman.9 rman_get_bustag.9 \
 	rman.9 rman_get_device.9 \
 	rman.9 rman_get_end.9 \
 	rman.9 rman_get_flags.9 \
 	rman.9 rman_get_mapping.9 \
 	rman.9 rman_get_rid.9 \
 	rman.9 rman_get_size.9 \
 	rman.9 rman_get_start.9 \
 	rman.9 rman_get_virtual.9 \
 	rman.9 rman_init.9 \
 	rman.9 rman_init_from_resource.9 \
 	rman.9 rman_is_region_manager.9 \
 	rman.9 rman_last_free_region.9 \
 	rman.9 rman_make_alignment_flags.9 \
 	rman.9 rman_manage_region.9 \
 	rman.9 rman_release_resource.9 \
 	rman.9 rman_reserve_resource.9 \
 	rman.9 rman_reserve_resource_bound.9 \
 	rman.9 rman_set_bushandle.9 \
 	rman.9 rman_set_bustag.9 \
 	rman.9 rman_set_mapping.9 \
 	rman.9 rman_set_rid.9 \
 	rman.9 rman_set_virtual.9
 MLINKS+=rmlock.9 rm_assert.9 \
 	rmlock.9 rm_destroy.9 \
 	rmlock.9 rm_init.9 \
 	rmlock.9 rm_init_flags.9 \
 	rmlock.9 rm_rlock.9 \
 	rmlock.9 rm_runlock.9 \
 	rmlock.9 rm_sleep.9 \
 	rmlock.9 RM_SYSINIT.9 \
 	rmlock.9 RM_SYSINIT_FLAGS.9 \
 	rmlock.9 rm_try_rlock.9 \
 	rmlock.9 rm_wlock.9 \
 	rmlock.9 rm_wowned.9 \
 	rmlock.9 rm_wunlock.9
 MLINKS+=rtalloc.9 rtalloc1.9 \
 	rtalloc.9 rtalloc_ign.9 \
 	rtalloc.9 RT_ADDREF.9 \
 	rtalloc.9 RT_LOCK.9 \
 	rtalloc.9 RT_REMREF.9 \
 	rtalloc.9 RT_RTFREE.9 \
 	rtalloc.9 RT_UNLOCK.9 \
 	rtalloc.9 RTFREE_LOCKED.9 \
 	rtalloc.9 RTFREE.9 \
 	rtalloc.9 rtfree.9 \
 	rtalloc.9 rtalloc1_fib.9 \
 	rtalloc.9 rtalloc_ign_fib.9 \
 	rtalloc.9 rtalloc_fib.9
 MLINKS+=runqueue.9 choosethread.9 \
 	runqueue.9 procrunnable.9 \
 	runqueue.9 remrunqueue.9 \
 	runqueue.9 setrunqueue.9
 MLINKS+=rwlock.9 rw_assert.9 \
 	rwlock.9 rw_destroy.9 \
 	rwlock.9 rw_downgrade.9 \
 	rwlock.9 rw_init.9 \
 	rwlock.9 rw_init_flags.9 \
 	rwlock.9 rw_initialized.9 \
 	rwlock.9 rw_rlock.9 \
 	rwlock.9 rw_runlock.9 \
 	rwlock.9 rw_unlock.9 \
 	rwlock.9 rw_sleep.9 \
 	rwlock.9 RW_SYSINIT.9 \
 	rwlock.9 RW_SYSINIT_FLAGS.9 \
 	rwlock.9 rw_try_rlock.9 \
 	rwlock.9 rw_try_upgrade.9 \
 	rwlock.9 rw_try_wlock.9 \
 	rwlock.9 rw_wlock.9 \
 	rwlock.9 rw_wowned.9 \
 	rwlock.9 rw_wunlock.9
 MLINKS+=sbuf.9 sbuf_bcat.9 \
 	sbuf.9 sbuf_bcopyin.9 \
 	sbuf.9 sbuf_bcpy.9 \
 	sbuf.9 sbuf_cat.9 \
 	sbuf.9 sbuf_clear.9 \
 	sbuf.9 sbuf_clear_flags.9 \
 	sbuf.9 sbuf_copyin.9 \
 	sbuf.9 sbuf_cpy.9 \
 	sbuf.9 sbuf_data.9 \
 	sbuf.9 sbuf_delete.9 \
 	sbuf.9 sbuf_done.9 \
 	sbuf.9 sbuf_error.9 \
 	sbuf.9 sbuf_finish.9 \
 	sbuf.9 sbuf_get_flags.9 \
 	sbuf.9 sbuf_hexdump.9 \
 	sbuf.9 sbuf_len.9 \
 	sbuf.9 sbuf_new.9 \
 	sbuf.9 sbuf_new_auto.9 \
 	sbuf.9 sbuf_new_for_sysctl.9 \
 	sbuf.9 sbuf_printf.9 \
 	sbuf.9 sbuf_putc.9 \
 	sbuf.9 sbuf_set_drain.9 \
 	sbuf.9 sbuf_set_flags.9 \
 	sbuf.9 sbuf_setpos.9 \
 	sbuf.9 sbuf_start_section.9 \
 	sbuf.9 sbuf_end_section.9  \
 	sbuf.9 sbuf_trim.9 \
 	sbuf.9 sbuf_vprintf.9
 MLINKS+=scheduler.9 curpriority_cmp.9 \
 	scheduler.9 maybe_resched.9 \
 	scheduler.9 propagate_priority.9 \
 	scheduler.9 resetpriority.9 \
 	scheduler.9 roundrobin.9 \
 	scheduler.9 roundrobin_interval.9 \
 	scheduler.9 schedclock.9 \
 	scheduler.9 schedcpu.9 \
 	scheduler.9 sched_setup.9 \
 	scheduler.9 setrunnable.9 \
 	scheduler.9 updatepri.9
 MLINKS+=SDT.9 SDT_PROVIDER_DECLARE.9 \
 	SDT.9 SDT_PROVIDER_DEFINE.9 \
 	SDT.9 SDT_PROBE_DECLARE.9 \
 	SDT.9 SDT_PROBE_DEFINE.9 \
 	SDT.9 SDT_PROBE.9
 MLINKS+=securelevel_gt.9 securelevel_ge.9
 MLINKS+=selrecord.9 seldrain.9 \
 	selrecord.9 selwakeup.9
 MLINKS+=sema.9 sema_destroy.9 \
 	sema.9 sema_init.9 \
 	sema.9 sema_post.9 \
 	sema.9 sema_timedwait.9 \
 	sema.9 sema_trywait.9 \
 	sema.9 sema_value.9 \
 	sema.9 sema_wait.9
 MLINKS+=sf_buf.9 sf_buf_alloc.9 \
 	sf_buf.9 sf_buf_free.9 \
 	sf_buf.9 sf_buf_kva.9 \
 	sf_buf.9 sf_buf_page.9
 MLINKS+=sglist.9 sglist_alloc.9 \
 	sglist.9 sglist_append.9 \
 	sglist.9 sglist_append_bio.9 \
 	sglist.9 sglist_append_mbuf.9 \
 	sglist.9 sglist_append_phys.9 \
 	sglist.9 sglist_append_sglist.9 \
 	sglist.9 sglist_append_uio.9 \
 	sglist.9 sglist_append_user.9 \
 	sglist.9 sglist_append_vmpages.9 \
 	sglist.9 sglist_build.9 \
 	sglist.9 sglist_clone.9 \
 	sglist.9 sglist_consume_uio.9 \
 	sglist.9 sglist_count.9 \
 	sglist.9 sglist_count_vmpages.9 \
 	sglist.9 sglist_free.9 \
 	sglist.9 sglist_hold.9 \
 	sglist.9 sglist_init.9 \
 	sglist.9 sglist_join.9 \
 	sglist.9 sglist_length.9 \
 	sglist.9 sglist_reset.9 \
 	sglist.9 sglist_slice.9 \
 	sglist.9 sglist_split.9
 MLINKS+=shm_map.9 shm_unmap.9
 MLINKS+=signal.9 cursig.9 \
 	signal.9 execsigs.9 \
 	signal.9 issignal.9 \
 	signal.9 killproc.9 \
 	signal.9 pgsigio.9 \
 	signal.9 postsig.9 \
 	signal.9 SETSETNEQ.9 \
 	signal.9 SETSETOR.9 \
 	signal.9 SIGADDSET.9 \
 	signal.9 SIG_CONTSIGMASK.9 \
 	signal.9 SIGDELSET.9 \
 	signal.9 SIGEMPTYSET.9 \
 	signal.9 sigexit.9 \
 	signal.9 SIGFILLSET.9 \
 	signal.9 siginit.9 \
 	signal.9 SIGISEMPTY.9 \
 	signal.9 SIGISMEMBER.9 \
 	signal.9 SIGNOTEMPTY.9 \
 	signal.9 signotify.9 \
 	signal.9 SIGPENDING.9 \
 	signal.9 SIGSETAND.9 \
 	signal.9 SIGSETCANTMASK.9 \
 	signal.9 SIGSETEQ.9 \
 	signal.9 SIGSETNAND.9 \
 	signal.9 SIG_STOPSIGMASK.9 \
 	signal.9 trapsignal.9
 MLINKS+=sleep.9 msleep.9 \
 	sleep.9 msleep_sbt.9 \
 	sleep.9 msleep_spin.9 \
 	sleep.9 msleep_spin_sbt.9 \
 	sleep.9 pause.9 \
 	sleep.9 pause_sig.9 \
 	sleep.9 pause_sbt.9 \
 	sleep.9 tsleep.9 \
 	sleep.9 tsleep_sbt.9 \
 	sleep.9 wakeup.9 \
 	sleep.9 wakeup_one.9
 MLINKS+=sleepqueue.9 init_sleepqueues.9 \
 	sleepqueue.9 sleepq_abort.9 \
 	sleepqueue.9 sleepq_add.9 \
 	sleepqueue.9 sleepq_alloc.9 \
 	sleepqueue.9 sleepq_broadcast.9 \
 	sleepqueue.9 sleepq_free.9 \
 	sleepqueue.9 sleepq_lookup.9 \
 	sleepqueue.9 sleepq_lock.9 \
 	sleepqueue.9 sleepq_release.9 \
 	sleepqueue.9 sleepq_remove.9 \
 	sleepqueue.9 sleepq_set_timeout.9 \
 	sleepqueue.9 sleepq_set_timeout_sbt.9 \
 	sleepqueue.9 sleepq_signal.9 \
 	sleepqueue.9 sleepq_sleepcnt.9 \
 	sleepqueue.9 sleepq_timedwait.9 \
 	sleepqueue.9 sleepq_timedwait_sig.9 \
 	sleepqueue.9 sleepq_type.9 \
 	sleepqueue.9 sleepq_wait.9 \
 	sleepqueue.9 sleepq_wait_sig.9
 MLINKS+=socket.9 soabort.9 \
 	socket.9 soaccept.9 \
 	socket.9 sobind.9 \
 	socket.9 socheckuid.9 \
 	socket.9 soclose.9 \
 	socket.9 soconnect.9 \
 	socket.9 socreate.9 \
 	socket.9 sodisconnect.9 \
 	socket.9 sodupsockaddr.9 \
 	socket.9 sofree.9 \
 	socket.9 sogetopt.9 \
 	socket.9 sohasoutofband.9 \
 	socket.9 solisten.9 \
 	socket.9 solisten_proto.9 \
 	socket.9 solisten_proto_check.9 \
 	socket.9 sonewconn.9 \
 	socket.9 sooptcopyin.9 \
 	socket.9 sooptcopyout.9 \
 	socket.9 sopoll.9 \
 	socket.9 sopoll_generic.9 \
 	socket.9 soreceive.9 \
 	socket.9 soreceive_dgram.9 \
 	socket.9 soreceive_generic.9 \
 	socket.9 soreceive_stream.9 \
 	socket.9 soreserve.9 \
 	socket.9 sorflush.9 \
 	socket.9 sosend.9 \
 	socket.9 sosend_dgram.9 \
 	socket.9 sosend_generic.9 \
 	socket.9 sosetopt.9 \
 	socket.9 soshutdown.9 \
 	socket.9 sotoxsocket.9 \
 	socket.9 soupcall_clear.9 \
 	socket.9 soupcall_set.9 \
 	socket.9 sowakeup.9
 MLINKS+=stack.9 stack_copy.9 \
 	stack.9 stack_create.9 \
 	stack.9 stack_destroy.9 \
 	stack.9 stack_print.9 \
 	stack.9 stack_print_ddb.9 \
 	stack.9 stack_print_short.9 \
 	stack.9 stack_print_short_ddb.9 \
 	stack.9 stack_put.9 \
 	stack.9 stack_save.9 \
 	stack.9 stack_sbuf_print.9 \
 	stack.9 stack_sbuf_print_ddb.9 \
 	stack.9 stack_zero.9
 MLINKS+=store.9 subyte.9 \
 	store.9 suswintr.9 \
 	store.9 suword.9 \
 	store.9 suword16.9 \
 	store.9 suword32.9 \
 	store.9 suword64.9
 MLINKS+=swi.9 swi_add.9 \
 	swi.9 swi_remove.9 \
 	swi.9 swi_sched.9
 MLINKS+=sx.9 sx_assert.9 \
 	sx.9 sx_destroy.9 \
 	sx.9 sx_downgrade.9 \
 	sx.9 sx_init.9 \
 	sx.9 sx_init_flags.9 \
 	sx.9 sx_sleep.9 \
 	sx.9 sx_slock.9 \
 	sx.9 sx_slock_sig.9 \
 	sx.9 sx_sunlock.9 \
 	sx.9 SX_SYSINIT.9 \
 	sx.9 SX_SYSINIT_FLAGS.9 \
 	sx.9 sx_try_slock.9 \
 	sx.9 sx_try_upgrade.9 \
 	sx.9 sx_try_xlock.9 \
 	sx.9 sx_unlock.9 \
 	sx.9 sx_xholder.9 \
 	sx.9 sx_xlock.9 \
 	sx.9 sx_xlock_sig.9 \
 	sx.9 sx_xlocked.9 \
 	sx.9 sx_xunlock.9
 MLINKS+=syscall_helper_register.9 syscall_helper_unregister.9 \
 	syscall_helper_register.9 SYSCALL_INIT_HELPER.9 \
 	syscall_helper_register.9 SYSCALL_INIT_HELPER_COMPAT.9 \
 	syscall_helper_register.9 SYSCALL_INIT_HELPER_COMPAT_F.9 \
 	syscall_helper_register.9 SYSCALL_INIT_HELPER_F.9
 MLINKS+=sysctl.9 SYSCTL_DECL.9 \
 	sysctl.9 SYSCTL_ADD_INT.9 \
 	sysctl.9 SYSCTL_ADD_LONG.9 \
 	sysctl.9 SYSCTL_ADD_NODE.9 \
 	sysctl.9 SYSCTL_ADD_NODE_WITH_LABEL.9 \
 	sysctl.9 SYSCTL_ADD_OPAQUE.9 \
 	sysctl.9 SYSCTL_ADD_PROC.9 \
 	sysctl.9 SYSCTL_ADD_QUAD.9 \
 	sysctl.9 SYSCTL_ADD_ROOT_NODE.9 \
 	sysctl.9 SYSCTL_ADD_S8.9 \
 	sysctl.9 SYSCTL_ADD_S16.9 \
 	sysctl.9 SYSCTL_ADD_S32.9 \
 	sysctl.9 SYSCTL_ADD_S64.9 \
 	sysctl.9 SYSCTL_ADD_STRING.9 \
 	sysctl.9 SYSCTL_ADD_STRUCT.9 \
 	sysctl.9 SYSCTL_ADD_U8.9 \
 	sysctl.9 SYSCTL_ADD_U16.9 \
 	sysctl.9 SYSCTL_ADD_U32.9 \
 	sysctl.9 SYSCTL_ADD_U64.9 \
 	sysctl.9 SYSCTL_ADD_UAUTO.9 \
 	sysctl.9 SYSCTL_ADD_UINT.9 \
 	sysctl.9 SYSCTL_ADD_ULONG.9 \
 	sysctl.9 SYSCTL_ADD_UQUAD.9 \
 	sysctl.9 SYSCTL_CHILDREN.9 \
 	sysctl.9 SYSCTL_STATIC_CHILDREN.9 \
 	sysctl.9 SYSCTL_NODE_CHILDREN.9 \
 	sysctl.9 SYSCTL_PARENT.9 \
 	sysctl.9 SYSCTL_INT.9 \
 	sysctl.9 SYSCTL_INT_WITH_LABEL.9 \
 	sysctl.9 SYSCTL_LONG.9 \
 	sysctl.9 SYSCTL_NODE.9 \
 	sysctl.9 SYSCTL_NODE_WITH_LABEL.9 \
 	sysctl.9 SYSCTL_OPAQUE.9 \
 	sysctl.9 SYSCTL_PROC.9 \
 	sysctl.9 SYSCTL_QUAD.9 \
 	sysctl.9 SYSCTL_ROOT_NODE.9 \
 	sysctl.9 SYSCTL_S8.9 \
 	sysctl.9 SYSCTL_S16.9 \
 	sysctl.9 SYSCTL_S32.9 \
 	sysctl.9 SYSCTL_S64.9 \
 	sysctl.9 SYSCTL_STRING.9 \
 	sysctl.9 SYSCTL_STRUCT.9 \
 	sysctl.9 SYSCTL_U8.9 \
 	sysctl.9 SYSCTL_U16.9 \
 	sysctl.9 SYSCTL_U32.9 \
 	sysctl.9 SYSCTL_U64.9 \
 	sysctl.9 SYSCTL_UINT.9 \
 	sysctl.9 SYSCTL_ULONG.9 \
 	sysctl.9 SYSCTL_UQUAD.9
 MLINKS+=sysctl_add_oid.9 sysctl_move_oid.9 \
 	sysctl_add_oid.9 sysctl_remove_oid.9 \
 	sysctl_add_oid.9 sysctl_remove_name.9
 MLINKS+=sysctl_ctx_init.9 sysctl_ctx_entry_add.9 \
 	sysctl_ctx_init.9 sysctl_ctx_entry_del.9 \
 	sysctl_ctx_init.9 sysctl_ctx_entry_find.9 \
 	sysctl_ctx_init.9 sysctl_ctx_free.9
 MLINKS+=SYSINIT.9 SYSUNINIT.9
 MLINKS+=taskqueue.9 TASK_INIT.9 \
 	taskqueue.9 TASK_INITIALIZER.9 \
 	taskqueue.9 taskqueue_block.9 \
 	taskqueue.9 taskqueue_cancel.9 \
 	taskqueue.9 taskqueue_cancel_timeout.9 \
 	taskqueue.9 taskqueue_create.9 \
 	taskqueue.9 taskqueue_create_fast.9 \
 	taskqueue.9 TASKQUEUE_DECLARE.9 \
 	taskqueue.9 TASKQUEUE_DEFINE.9 \
 	taskqueue.9 TASKQUEUE_DEFINE_THREAD.9 \
 	taskqueue.9 taskqueue_drain.9 \
 	taskqueue.9 taskqueue_drain_all.9 \
 	taskqueue.9 taskqueue_drain_timeout.9 \
 	taskqueue.9 taskqueue_enqueue.9 \
 	taskqueue.9 taskqueue_enqueue_timeout.9 \
 	taskqueue.9 TASKQUEUE_FAST_DEFINE.9 \
 	taskqueue.9 TASKQUEUE_FAST_DEFINE_THREAD.9 \
 	taskqueue.9 taskqueue_free.9 \
 	taskqueue.9 taskqueue_member.9 \
 	taskqueue.9 taskqueue_run.9 \
 	taskqueue.9 taskqueue_set_callback.9 \
 	taskqueue.9 taskqueue_start_threads.9 \
 	taskqueue.9 taskqueue_start_threads_pinned.9 \
 	taskqueue.9 taskqueue_unblock.9 \
 	taskqueue.9 TIMEOUT_TASK_INIT.9
 MLINKS+=tcp_functions.9 register_tcp_functions.9 \
 	tcp_functions.9 register_tcp_functions_as_name.9 \
 	tcp_functions.9 register_tcp_functions_as_names.9 \
 	tcp_functions.9 deregister_tcp_functions.9
 MLINKS+=time.9 boottime.9 \
 	time.9 time_second.9 \
 	time.9 time_uptime.9
 MLINKS+=timeout.9 callout.9 \
 	timeout.9 callout_active.9 \
 	timeout.9 callout_async_drain.9 \
 	timeout.9 callout_deactivate.9 \
 	timeout.9 callout_drain.9 \
 	timeout.9 callout_handle_init.9 \
 	timeout.9 callout_init.9 \
 	timeout.9 callout_init_mtx.9 \
 	timeout.9 callout_init_rm.9 \
 	timeout.9 callout_init_rw.9 \
 	timeout.9 callout_pending.9 \
 	timeout.9 callout_reset.9 \
 	timeout.9 callout_reset_curcpu.9 \
 	timeout.9 callout_reset_on.9 \
 	timeout.9 callout_reset_sbt.9 \
 	timeout.9 callout_reset_sbt_curcpu.9 \
 	timeout.9 callout_reset_sbt_on.9 \
 	timeout.9 callout_schedule.9 \
 	timeout.9 callout_schedule_curcpu.9 \
 	timeout.9 callout_schedule_on.9 \
 	timeout.9 callout_schedule_sbt.9 \
 	timeout.9 callout_schedule_sbt_curcpu.9 \
 	timeout.9 callout_schedule_sbt_on.9 \
 	timeout.9 callout_stop.9 \
 	timeout.9 callout_when.9 \
 	timeout.9 untimeout.9
 MLINKS+=ucred.9 cred_update_thread.9 \
 	ucred.9 crcopy.9 \
 	ucred.9 crcopysafe.9 \
 	ucred.9 crdup.9 \
 	ucred.9 crfree.9 \
 	ucred.9 crget.9 \
 	ucred.9 crhold.9 \
 	ucred.9 crsetgroups.9 \
 	ucred.9 cru2x.9
 MLINKS+=uidinfo.9 uifind.9 \
 	uidinfo.9 uifree.9 \
 	uidinfo.9 uihashinit.9 \
 	uidinfo.9 uihold.9
 MLINKS+=uio.9 uiomove.9 \
 	uio.9 uiomove_frombuf.9 \
 	uio.9 uiomove_nofault.9
 
 .if ${MK_USB} != "no"
 MAN+=	usbdi.9
 MLINKS+=usbdi.9 usbd_do_request.9 \
 	usbdi.9 usbd_do_request_flags.9 \
 	usbdi.9 usbd_errstr.9 \
 	usbdi.9 usbd_lookup_id_by_info.9 \
 	usbdi.9 usbd_lookup_id_by_uaa.9 \
 	usbdi.9 usbd_transfer_clear_stall.9 \
 	usbdi.9 usbd_transfer_drain.9 \
 	usbdi.9 usbd_transfer_pending.9 \
 	usbdi.9 usbd_transfer_poll.9 \
 	usbdi.9 usbd_transfer_setup.9 \
 	usbdi.9 usbd_transfer_start.9 \
 	usbdi.9 usbd_transfer_stop.9 \
 	usbdi.9 usbd_transfer_submit.9 \
 	usbdi.9 usbd_transfer_unsetup.9 \
 	usbdi.9 usbd_xfer_clr_flag.9 \
 	usbdi.9 usbd_xfer_frame_data.9 \
 	usbdi.9 usbd_xfer_frame_len.9 \
 	usbdi.9 usbd_xfer_get_frame.9 \
 	usbdi.9 usbd_xfer_get_priv.9 \
 	usbdi.9 usbd_xfer_is_stalled.9 \
 	usbdi.9 usbd_xfer_max_framelen.9 \
 	usbdi.9 usbd_xfer_max_frames.9 \
 	usbdi.9 usbd_xfer_max_len.9 \
 	usbdi.9 usbd_xfer_set_flag.9 \
 	usbdi.9 usbd_xfer_set_frame_data.9 \
 	usbdi.9 usbd_xfer_set_frame_len.9 \
 	usbdi.9 usbd_xfer_set_frame_offset.9 \
 	usbdi.9 usbd_xfer_set_frames.9 \
 	usbdi.9 usbd_xfer_set_interval.9 \
 	usbdi.9 usbd_xfer_set_priv.9 \
 	usbdi.9 usbd_xfer_set_stall.9 \
 	usbdi.9 usbd_xfer_set_timeout.9 \
 	usbdi.9 usbd_xfer_softc.9 \
 	usbdi.9 usbd_xfer_state.9 \
 	usbdi.9 usbd_xfer_status.9 \
 	usbdi.9 usb_fifo_alloc_buffer.9 \
 	usbdi.9 usb_fifo_attach.9 \
 	usbdi.9 usb_fifo_detach.9 \
 	usbdi.9 usb_fifo_free_buffer.9 \
 	usbdi.9 usb_fifo_get_data.9 \
 	usbdi.9 usb_fifo_get_data_buffer.9 \
 	usbdi.9 usb_fifo_get_data_error.9 \
 	usbdi.9 usb_fifo_get_data_linear.9 \
 	usbdi.9 usb_fifo_put_bytes_max.9 \
 	usbdi.9 usb_fifo_put_data.9 \
 	usbdi.9 usb_fifo_put_data_buffer.9 \
 	usbdi.9 usb_fifo_put_data_error.9 \
 	usbdi.9 usb_fifo_put_data_linear.9 \
 	usbdi.9 usb_fifo_reset.9 \
 	usbdi.9 usb_fifo_softc.9 \
 	usbdi.9 usb_fifo_wakeup.9
 .endif
 MLINKS+=vcount.9 count_dev.9
 MLINKS+=vfsconf.9 vfs_modevent.9 \
 	vfsconf.9 vfs_register.9 \
 	vfsconf.9 vfs_unregister.9
 MLINKS+=vfs_getopt.9 vfs_copyopt.9 \
 	vfs_getopt.9 vfs_filteropt.9 \
 	vfs_getopt.9 vfs_flagopt.9 \
 	vfs_getopt.9 vfs_getopts.9 \
 	vfs_getopt.9 vfs_scanopt.9 \
 	vfs_getopt.9 vfs_setopt.9 \
 	vfs_getopt.9 vfs_setopt_part.9 \
 	vfs_getopt.9 vfs_setopts.9
 MLINKS+=vhold.9 vdrop.9 \
 	vhold.9 vdropl.9 \
 	vhold.9 vholdl.9
 MLINKS+=vmem.9 vmem_add.9 \
 	vmem.9 vmem_alloc.9 \
 	vmem.9 vmem_create.9 \
 	vmem.9 vmem_destroy.9 \
 	vmem.9 vmem_free.9 \
 	vmem.9 vmem_xalloc.9 \
 	vmem.9 vmem_xfree.9  
 MLINKS+=vm_map_lock.9 vm_map_lock_downgrade.9 \
 	vm_map_lock.9 vm_map_lock_read.9 \
 	vm_map_lock.9 vm_map_lock_upgrade.9 \
 	vm_map_lock.9 vm_map_trylock.9 \
 	vm_map_lock.9 vm_map_trylock_read.9 \
 	vm_map_lock.9 vm_map_unlock.9 \
 	vm_map_lock.9 vm_map_unlock_read.9
 MLINKS+=vm_map_lookup.9 vm_map_lookup_done.9
 MLINKS+=vm_map_max.9 vm_map_min.9 \
 	vm_map_max.9 vm_map_pmap.9
 MLINKS+=vm_map_stack.9 vm_map_growstack.9
 MLINKS+=vm_map_wire.9 vm_map_unwire.9
 MLINKS+=vm_page_bits.9 vm_page_clear_dirty.9 \
 	vm_page_bits.9 vm_page_dirty.9 \
 	vm_page_bits.9 vm_page_is_valid.9 \
 	vm_page_bits.9 vm_page_set_invalid.9 \
 	vm_page_bits.9 vm_page_set_validclean.9 \
 	vm_page_bits.9 vm_page_test_dirty.9 \
 	vm_page_bits.9 vm_page_undirty.9 \
 	vm_page_bits.9 vm_page_zero_invalid.9
 MLINKS+=vm_page_busy.9 vm_page_busied.9 \
 	vm_page_busy.9 vm_page_busy_downgrade.9 \
 	vm_page_busy.9 vm_page_busy_sleep.9 \
 	vm_page_busy.9 vm_page_sbusied.9 \
 	vm_page_busy.9 vm_page_sbusy.9 \
 	vm_page_busy.9 vm_page_sleep_if_busy.9 \
 	vm_page_busy.9 vm_page_sunbusy.9 \
 	vm_page_busy.9 vm_page_trysbusy.9 \
 	vm_page_busy.9 vm_page_tryxbusy.9 \
 	vm_page_busy.9 vm_page_xbusied.9 \
 	vm_page_busy.9 vm_page_xbusy.9 \
 	vm_page_busy.9 vm_page_xunbusy.9 \
 	vm_page_busy.9 vm_page_assert_sbusied.9 \
 	vm_page_busy.9 vm_page_assert_unbusied.9 \
 	vm_page_busy.9 vm_page_assert_xbusied.9
 MLINKS+=vm_page_aflag.9 vm_page_aflag_clear.9 \
 	vm_page_aflag.9 vm_page_aflag_set.9 \
 	vm_page_aflag.9 vm_page_reference.9
 MLINKS+=vm_page_free.9 vm_page_free_toq.9 \
 	vm_page_free.9 vm_page_free_zero.9 \
 	vm_page_free.9 vm_page_try_to_free.9
 MLINKS+=vm_page_hold.9 vm_page_unhold.9
 MLINKS+=vm_page_insert.9 vm_page_remove.9
 MLINKS+=vm_page_wire.9 vm_page_unwire.9
 MLINKS+=VOP_ACCESS.9 VOP_ACCESSX.9
 MLINKS+=VOP_ATTRIB.9 VOP_GETATTR.9 \
 	VOP_ATTRIB.9 VOP_SETATTR.9
 MLINKS+=VOP_CREATE.9 VOP_MKDIR.9 \
 	VOP_CREATE.9 VOP_MKNOD.9 \
 	VOP_CREATE.9 VOP_SYMLINK.9
 MLINKS+=VOP_GETPAGES.9 VOP_PUTPAGES.9
 MLINKS+=VOP_INACTIVE.9 VOP_RECLAIM.9
 MLINKS+=VOP_LOCK.9 vn_lock.9 \
 	VOP_LOCK.9 VOP_ISLOCKED.9 \
 	VOP_LOCK.9 VOP_UNLOCK.9
 MLINKS+=VOP_OPENCLOSE.9 VOP_CLOSE.9 \
 	VOP_OPENCLOSE.9 VOP_OPEN.9
 MLINKS+=VOP_RDWR.9 VOP_READ.9 \
 	VOP_RDWR.9 VOP_WRITE.9
 MLINKS+=VOP_REMOVE.9 VOP_RMDIR.9
 MLINKS+=vnet.9 vimage.9
 MLINKS+=vref.9 VREF.9 \
 	vref.9 vrefl.9
 MLINKS+=vrele.9 vput.9 \
 	vrele.9 vunref.9
 MLINKS+=vslock.9 vsunlock.9
 MLINKS+=zone.9 uma.9 \
 	zone.9 uma_zalloc.9 \
 	zone.9 uma_zalloc_arg.9 \
 	zone.9 uma_zalloc_domain.9 \
 	zone.9 uma_zcreate.9 \
 	zone.9 uma_zdestroy.9 \
 	zone.9 uma_zfree.9 \
 	zone.9 uma_zfree_arg.9 \
 	zone.9 uma_zfree_domain.9 \
 	zone.9 uma_zone_get_cur.9 \
 	zone.9 uma_zone_get_max.9 \
 	zone.9 uma_zone_set_max.9 \
 	zone.9 uma_zone_set_warning.9 \
 	zone.9 uma_zone_set_maxaction.9
 
 .include <bsd.prog.mk>
Index: user/markj/netdump/share/man/man9/OF_child.9
===================================================================
--- user/markj/netdump/share/man/man9/OF_child.9	(nonexistent)
+++ user/markj/netdump/share/man/man9/OF_child.9	(revision 332408)
@@ -0,0 +1,76 @@
+.\"
+.\" Copyright (c) 2018 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
+.\"
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
+.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 9, 2018
+.Dt OF_CHILD 9
+.Os
+.Sh NAME
+.Nm OF_child ,
+.Nm OF_parent ,
+.Nm OF_peer
+.Nd navigate device tree
+.Sh SYNOPSIS
+.In dev/ofw/ofw_bus.h
+.In dev/ofw/ofw_bus_subr.h
+.Ft phandle_t
+.Fn OF_child "phandle_t node"
+.Ft phandle_t
+.Fn OF_parent "phandle_t node"
+.Ft phandle_t
+.Fn OF_peer "phandle_t node"
+.Sh DESCRIPTION
+.Pp
+.Fn OF_child
+returns the phandle value of the first child of the
+.Fa node .
+Zero is returned if there are no child nodes.
+.Pp
+.Fn OF_parent
+returns the phandle for the parent of the
+.Fa node .
+Zero is returned if
+.Fa node
+is the root node.
+.Pp
+.Fn OF_peer
+returns the phandle value of the next sibling of the
+.Fa node .
+Zero is returned if there is no sibling node.
+.Sh EXAMPLES
+.Bd -literal
+phandle_t node, child;
+ ...
+for (child = OF_child(node); child != 0; child = OF_peer(child) {
+	...
+}
+.Ed
+.Sh SEE ALSO
+.Xr OF_finddevice 9
+.Sh AUTHORS
+.An -nosplit
+This manual page was written by
+.An Oleksandr Tymoshenko Aq Mt gonzo@FreeBSD.org .

Property changes on: user/markj/netdump/share/man/man9/OF_child.9
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/share/man/man9/OF_device_from_xref.9
===================================================================
--- user/markj/netdump/share/man/man9/OF_device_from_xref.9	(nonexistent)
+++ user/markj/netdump/share/man/man9/OF_device_from_xref.9	(revision 332408)
@@ -0,0 +1,91 @@
+.\"
+.\" Copyright (c) 2018 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
+.\"
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
+.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 9, 2018
+.Dt OF_DEVICE_FROM_XREF 9
+.Os
+.Sh NAME
+.Nm OF_device_from_xref ,
+.Nm OF_xref_from_device,
+.Nm OF_device_register_xref
+.Nd "manage mappings between xrefs and devices"
+.Sh SYNOPSIS
+.In dev/ofw/ofw_bus.h
+.In dev/ofw/ofw_bus_subr.h
+.Ft int
+.Fn OF_device_register_xref "phandle_t xref" "device_t dev"
+.Ft device_t
+.Fn OF_device_from_xref "phandle_t xref"
+.Ft phandle_t
+.Fn OF_xref_from_device "device_t dev"
+.Sh DESCRIPTION
+.Pp
+When a device tree node references another node, the driver may
+need to get a device_t instance associated with the referenced node.
+For instance, an Ethernet driver accessing a PHY device.
+To make this possible, the kernel maintains a table that
+maps effective handles to device_t instances.
+.Pp
+.Fn OF_device_register_xref
+adds a map entry from the effective phandle
+.Fa xref
+to device
+.Fa dev .
+If a mapping entry for
+.Fa xref
+already exists, it is replaced with the new one.
+The function always returns 0.
+.Pp
+.Fn OF_device_from_xref
+returns a device_t instance associated with the effective phandle
+.Fa xref .
+If no such mapping exists, the function returns NULL.
+.Pp
+.Fn OF_xref_from_device
+returns the effective phandle associated with the device
+.Fa dev .
+If no such mapping exists, the function returns 0.
+.Sh EXAMPLES
+.Bd -literal
+    static int
+    acmephy_attach(device_t dev)
+    {
+        phandle_t node;
+
+	/* PHY node is referenced from eth device, register it */
+        node = ofw_bus_get_node(dev);
+        OF_device_register_xref(OF_xref_from_node(node), dev);
+
+        return (0);
+    }
+.Ed
+.Sh SEE ALSO
+.Xr OF_node_to_xref 9
+.Sh AUTHORS
+.An -nosplit
+This manual page was written by
+.An Oleksandr Tymoshenko Aq Mt gonzo@FreeBSD.org .

Property changes on: user/markj/netdump/share/man/man9/OF_device_from_xref.9
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/share/man/man9/OF_finddevice.9
===================================================================
--- user/markj/netdump/share/man/man9/OF_finddevice.9	(nonexistent)
+++ user/markj/netdump/share/man/man9/OF_finddevice.9	(revision 332408)
@@ -0,0 +1,74 @@
+.\"
+.\" Copyright (c) 2018 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
+.\"
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
+.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 9, 2018
+.Dt OF_FINDDEVICE 9
+.Os
+.Sh NAME
+.Nm OF_finddevice
+.Nd find node in device tree
+.Sh SYNOPSIS
+.In dev/ofw/ofw_bus.h
+.In dev/ofw/ofw_bus_subr.h
+.Ft phandle_t
+.Fn OF_finddevice "const char *path"
+.Sh DESCRIPTION
+.Pp
+.Fn OF_finddevice
+returns the phandle for the node specified by the
+.Fa path .
+Returns -1 if the path cannot be found in the tree.
+.Sh CAVEATS
+The return value should only be checked with equality
+operators (equal to, not equal to) and not relational comparison
+(less than, greater than ).
+There is a discrepancy between IEEE 1275 standard and
+.Fx Ns 's
+internal representation of a phandle: IEEE 1275
+requires the return value of this function to be -1 if the path
+is not found.
+But phandle_t is an unsigned type, so it cannot
+be relationally compared with -1 or 0, this comparison
+is always true or always false.
+.Sh EXAMPLES
+.Bd -literal
+    phandle_t root, i2c;
+
+    root = OF_finddevice("/");
+    i2c = OF_finddevice("/soc/axi/i2c@a0e0000");
+    if (i2c != -1) {
+        ...
+    }
+.Ed
+.Sh SEE ALSO
+.Xr OF_child 9
+.Xr OF_parent 9
+.Xr OF_peer 9
+.Sh AUTHORS
+.An -nosplit
+This manual page was written by
+.An Oleksandr Tymoshenko Aq Mt gonzo@FreeBSD.org .

Property changes on: user/markj/netdump/share/man/man9/OF_finddevice.9
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/share/man/man9/OF_getprop.9
===================================================================
--- user/markj/netdump/share/man/man9/OF_getprop.9	(nonexistent)
+++ user/markj/netdump/share/man/man9/OF_getprop.9	(revision 332408)
@@ -0,0 +1,291 @@
+.\"
+.\" Copyright (c) 2018 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
+.\"
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
+.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 9, 2018
+.Dt OF_CHILD 9
+.Os
+.Sh NAME
+.Nm OF_getprop ,
+.Nm OF_getproplen ,
+.Nm OF_getencprop ,
+.Nm OF_hasprop ,
+.Nm OF_searchprop ,
+.Nm OF_searchencprop ,
+.Nm OF_getprop_alloc ,
+.Nm OF_getencprop_alloc ,
+.Nm OF_prop_free ,
+.Nm OF_nextprop ,
+.Nm OF_setprop
+.Nd access properties of device tree node
+.Sh SYNOPSIS
+.In dev/ofw/ofw_bus.h
+.In dev/ofw/ofw_bus_subr.h
+.Ft ssize_t
+.Fn OF_getproplen "phandle_t node" "const char *propname"
+.Ft ssize_t
+.Fn OF_getprop "phandle_t node" "const char *propname" \
+"void *buf" "size_t len"
+.Ft ssize_t
+.Fn OF_getencprop "phandle_t node" "const char *prop" \
+"pcell_t *buf" "size_t len"
+.Ft int
+.Fn OF_hasprop "phandle_t node" "const char *propname"
+.Ft ssize_t
+.Fn OF_searchprop "phandle_t node" "const char *propname" \
+"void *buf" "size_t len"
+.Ft ssize_t
+.Fn OF_searchencprop "phandle_t node" "const char *propname" \
+"pcell_t *buf" "size_t len"
+.Ft ssize_t
+.Fn OF_getprop_alloc "phandle_t node" "const char *propname" \
+"void **buf"
+.Ft ssize_t
+.Fn OF_getencprop_alloc "phandle_t node" "const char *propname" \
+"pcell_t **buf"
+.Ft void
+.Fn OF_prop_free "void *buf"
+.Ft int
+.Fn OF_nextprop "phandle_t node" "const char *propname" \
+"char *buf" "size_t len"
+.Ft int
+.Fn OF_setprop "phandle_t node" "const char *propname" \
+"const void *buf" "size_t len"
+.Sh DESCRIPTION
+.Pp
+Device nodes can have associated properties.
+Properties consist of a name and a value.
+A name is a human-readable string from 1 to 31 characters long.
+A value is an array of zero or more bytes that encode certain
+information.
+The meaning of that bytes depends on how drivers interpret them.
+Properties can encode byte arrays, text strings, unsigned 32-bit
+values or any combination of these types.
+.Pp
+Property with a zero-length value usually represents boolean
+information.
+If the property is present, it signifies true, otherwise false.
+.Pp
+A byte array is encoded as a sequence of bytes and represents
+values like MAC addresses.
+.Pp
+A text string is a sequence of n printable characters.
+It is encoded as a byte array of length n + 1 bytes with
+characters represented as bytes plus a terminating null character.
+.Pp
+Unsigned 32-bit values, also sometimes called cells, are
+encoded as a sequence of 4 bytes in big-endian order.
+.Pp
+.Fn OF_getproplen
+returns either the length of the value associated with the property
+.Fa propname
+in the node identified by
+.Fa node ,
+or 0 if the property exists but has no associated value.
+If
+.Fa propname
+does not exist, -1 is returned.
+.Pp
+.Fn OF_getprop
+copies a maximum of
+.Fa len
+bytes from the value associated with the property
+.Fa propname
+of the device node
+.Fa node
+into the memory specified by
+.Fa buf .
+Returns the actual size of the value or -1 if the
+property does not exist.
+.Pp
+.Fn OF_getencprop
+copies a maximum of
+.Fa len
+bytes into memory specified by
+.Fa buf ,
+then converts cell values from big-endian to host byte order.
+Returns the actual size of the value in bytes, or -1
+if the property does not exist.
+.Fa len
+must be a multiple of 4.
+.Pp
+.Fn OF_hasprop
+returns 1 if the device node
+.Fa node
+has a property specified by
+.Fa propname ,
+and zero if the property does not exist.
+.Pp
+.Fn OF_searchprop
+recursively looks for the property specified by
+.Fa propname
+starting with the device node
+.Fa node
+followed by the parent node and up to the root node.
+If the property is found, the function copies a maximum of
+.Fa len
+bytes of the value associated with the property
+into the memory specified by
+.Fa buf .
+Returns the actual size in bytes of the value,
+or -1 if the property does not exist.
+.Pp
+.Fn OF_searchencprop
+recursively looks for the property specified by
+.Fa propname
+starting with the device node
+.Fa node
+followed by the parent node and up to the root node.
+If the property is found, the function copies a maximum of
+.Fa len
+bytes of the value associated with the property
+into the memory specified by
+.Fa buf ,
+then converts cell values from big-endian to host byte order.
+Returns the actual size in bytes of the value,
+or -1 if the property does not exist.
+.Pp
+.Fn OF_getprop_alloc
+allocates memory large enough to hold the
+value associated with the property
+.Fa propname
+of the device node
+.Fa node
+and copies the value into the newly allocated memory region.
+Returns the actual size of the value and stores
+the address of the allocated memory in
+.Fa *buf .
+If the property has a zero-sized value
+.Fa *buf
+is set NULL.
+Returns -1 if the property does not exist or
+memory allocation failed.
+Allocated memory should be released when no longer required
+by calling
+.Fn OF_prop_free .
+The function might sleep when allocating memory.
+.Pp
+.Fn OF_getencprop_alloc
+allocates enough memory to hold the
+value associated with the property
+.Fa propname
+of the device node
+.Fa node ,
+copies the value into the newly allocated memory region, and
+then converts cell values from big-endian to host byte
+order.
+The actual size of the value is returned and the
+address of allocated memory is stored in
+.Fa *buf .
+If the property has zero-length value,
+.Fa *buf
+is set to NULL.
+Returns -1 if the property does not exist or
+memory allocation failed.
+Allocated memory should be released when no longer required
+by calling
+.Fn OF_prop_free .
+The function might sleep when allocating memory.
+.Pp
+.Fn OF_prop_free
+releases memory at
+.Fa buf
+that was allocated by
+.Fn OF_getprop_alloc
+or
+.Fn OF_getencprop_alloc .
+.Pp
+.Fn OF_nextprop
+copies a maximum of
+.Fa size
+bytes of the name of the property following the
+.Fa propname
+property into
+.Fa buf .
+If
+.Fa propname
+is NULL, the function copies the name of the first property of the
+device node
+.Fa node .
+.Fn OF_nextprop
+returns -1 if
+.Fa propname
+is invalid or there is an internal error, 0 if there are no more
+properties after
+.Fa propname ,
+or 1 otherwise.
+.Pp
+.Fn OF_setprop
+sets the value of the property
+.Fa propname
+in the device node
+.Fa node
+to the value beginning at the address specified by
+.Fa buf
+and running for
+.Fa len
+bytes.
+If the property does not exist, the function tries to create
+it.
+.Fn OF_setprop
+returns the actual size of the new value, or -1 if the
+property value cannot be changed or the new property
+cannot be created.
+.Sh EXAMPLES
+.Bd -literal
+    phandle_t node;
+    phandle_t hdmixref, hdminode;
+    device_t hdmi;
+    uint8_t mac[6];
+    char *model;
+
+    /*
+     * Get a byte array property
+     */
+    if (OF_getprop(node, "eth,hwaddr", mac, sizeof(mac)) != sizeof(mac))
+        return;
+
+    /*
+     * Get internal node reference and device associated with it
+     */
+    if (OF_getencprop(node, "hdmi", &hdmixref) <= 0)
+        return;
+    hdmi = OF_device_from_xref(hdmixref);
+
+    /*
+     * Get string value of model property of HDMI framer node
+     */
+    hdminode = OF_node_from_xref(hdmixref);
+    if (OF_getprop_alloc(hdminode, "model", (void **)&model) <= 0)
+        return;
+.Ed
+.Sh SEE ALSO
+.Xr OF_device_from_xref 9
+.Xr OF_node_from_xref 9
+.Sh AUTHORS
+.An -nosplit
+This manual page was written by
+.An Oleksandr Tymoshenko Aq Mt gonzo@FreeBSD.org .

Property changes on: user/markj/netdump/share/man/man9/OF_getprop.9
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/share/man/man9/OF_node_from_xref.9
===================================================================
--- user/markj/netdump/share/man/man9/OF_node_from_xref.9	(nonexistent)
+++ user/markj/netdump/share/man/man9/OF_node_from_xref.9	(revision 332408)
@@ -0,0 +1,100 @@
+.\"
+.\" Copyright (c) 2018 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
+.\"
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
+.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 9, 2018
+.Dt OF_NODE_FROM_XREF 9
+.Os
+.Sh NAME
+.Nm OF_node_from_xref ,
+.Nm OF_xref_from_node
+.Nd convert between kernel phandle and effective phandle
+.Sh SYNOPSIS
+.In dev/ofw/ofw_bus.h
+.In dev/ofw/ofw_bus_subr.h
+.Ft phandle_t
+.Fn OF_node_from_xref "phandle_t xref"
+.Ft phandle_t
+.Fn OF_xref_from_node "phandle_t node"
+.Sh DESCRIPTION
+.Pp
+Some OpenFirmware implementations (FDT, IBM) have a concept
+of effective phandle or xrefs.
+They are used to cross-reference device tree nodes.
+For instance, a framebuffer controller may refer to a GPIO
+controller and pin that controls the backlight.
+In this example, the GPIO node would have a cell (32-bit integer)
+property with a reserved name like "phandle" or "linux,phandle"
+whose value uniquely identifies the node.
+The actual name depends on the implementation.
+The framebuffer node would have a property with the name
+described by device bindings (device-specific set of properties).
+It can be a cell property or a combined property with one part
+of it being a cell.
+The value of the framebuffer node's property would be the same
+as the value of the GPIO "phandle" property so it can be said
+that the framebuffer node refers to the GPIO node.
+The kernel uses internal logic to assign unique identifiers
+to the device tree nodes, and these values do not match
+the values of "phandle" properties.
+.Fn OF_node_from_xref
+and
+.Fn OF_xref_from_node
+are used to perform conversion between these two kinds of node
+identifiers. 
+.Pp
+.Fn OF_node_from_xref
+returns the kernel phandle for the effective phandle
+.Fa xref .
+If one cannot be found or the OpenFirmware implementation
+does not support effective phandles, the function returns
+the input value.
+.Pp
+.Fn OF_xref_from_xref
+returns the effective phandle for the kernel phandle
+.Fa xref .
+If one cannot be found or the OpenFirmware implementation
+does not support effective phandles, the function returns
+the input value.
+.Sh EXAMPLES
+.Bd -literal
+    phandle_t panelnode, panelxref;
+    char *model;
+
+    if (OF_getencprop(node, "lcd-panel", &panelxref) <= 0)
+        return;
+
+    panelnode = OF_node_from_xref(panelxref);
+    if (OF_getprop_alloc(hdminode, "model", (void **)&model) <= 0)
+        return;
+.Ed
+.Sh SEE ALSO
+.Xr OF_device_from_xref 9
+.Xr OF_device_register_xref 9
+.Sh AUTHORS
+.An -nosplit
+This manual page was written by
+.An Oleksandr Tymoshenko Aq Mt gonzo@FreeBSD.org .

Property changes on: user/markj/netdump/share/man/man9/OF_node_from_xref.9
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/share/man/man9/OF_package_to_path.9
===================================================================
--- user/markj/netdump/share/man/man9/OF_package_to_path.9	(nonexistent)
+++ user/markj/netdump/share/man/man9/OF_package_to_path.9	(revision 332408)
@@ -0,0 +1,54 @@
+.\"
+.\" Copyright (c) 2018 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
+.\"
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
+.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 9, 2018
+.Dt OF_PACKAGE_TO_PATH 9
+.Os
+.Sh NAME
+.Nm OF_package_to_path
+.Nd get fully qualified path to a device tree node
+.Sh SYNOPSIS
+.In dev/ofw/ofw_bus.h
+.In dev/ofw/ofw_bus_subr.h
+.Ft ssize_t
+.Fn OF_package_to_path "phandle_t node" "char *buf" "size_t len"
+.Sh DESCRIPTION
+.Pp
+.Fn OF_package_to_path
+copies at most
+.Fa len
+bytes of the fully qualified path to the device tree node
+.Fa node
+into the memory specified by
+.Fa buf .
+The function returns the number of bytes copied or -1 in case of the error.
+.Sh SEE ALSO
+.Xr OF_finddevice 9
+.Sh AUTHORS
+.An -nosplit
+This manual page was written by
+.An Oleksandr Tymoshenko Aq Mt gonzo@FreeBSD.org .

Property changes on: user/markj/netdump/share/man/man9/OF_package_to_path.9
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/share/misc/committers-src.dot
===================================================================
--- user/markj/netdump/share/misc/committers-src.dot	(revision 332407)
+++ user/markj/netdump/share/misc/committers-src.dot	(revision 332408)
@@ -1,842 +1,845 @@
 # $FreeBSD$
 
 # This file is meant to list all FreeBSD src committers and describe the
 # mentor-mentee relationships between them.
 # The graphical output can be generated from this file with the following
 # command:
 # $ dot -T png -o file.png committers-src.dot
 #
 # The dot binary is part of the graphics/graphviz port.
 
 digraph src {
 
 # Node definitions follow this example:
 #
 #   foo [label="Foo Bar\nfoo@FreeBSD.org\n????/??/??"]
 #
 # ????/??/?? is the date when the commit bit was obtained, usually the one you
 # can find looking at svn logs for the svnadmin/conf/access file.
 # Use YYYY/MM/DD format.
 #
 # For returned commit bits, the node definition will follow this example:
 #
 #   foo [label="Foo Bar\nfoo@FreeBSD.org\n????/??/??\n????/??/??"]
 #
 # The first date is the same as for an active committer, the second date is
 # the date when the commit bit has been returned. Again, check svn logs.
 
 node [color=grey62, style=filled, bgcolor=black];
 
 # Alumni go here.. Try to keep things sorted.
 
 alm [label="Andrew Moore\nalm@FreeBSD.org\n1993/06/12\n????/??/??"]
 anholt [label="Eric Anholt\nanholt@FreeBSD.org\n2002/04/22\n2008/08/07"]
 archie [label="Archie Cobbs\narchie@FreeBSD.org\n1998/11/06\n2006/06/09"]
 arr [label="Andrew R. Reiter\narr@FreeBSD.org\n2001/11/02\n2005/05/25"]
 arun [label="Arun Sharma\narun@FreeBSD.org\n2003/03/06\n2006/12/16"]
 asmodai [label="Jeroen Ruigrok\nasmodai@FreeBSD.org\n1999/12/16\n2001/11/16"]
 benjsc [label="Benjamin Close\nbenjsc@FreeBSD.org\n2007/02/09\n2010/09/15"]
 billf [label="Bill Fumerola\nbillf@FreeBSD.org\n1998/11/11\n2008/11/10"]
 bmah [label="Bruce A. Mah\nbmah@FreeBSD.org\n2002/01/29\n2009/09/13"]
 bmilekic [label="Bosko Milekic\nbmilekic@FreeBSD.org\n2000/09/21\n2008/11/10"]
 bushman [label="Michael Bushkov\nbushman@FreeBSD.org\n2007/03/10\n2010/04/29"]
 carl [label="Carl Delsey\ncarl@FreeBSD.org\n2013/01/14\n2014/03/06"]
 ceri [label="Ceri Davies\nceri@FreeBSD.org\n2006/11/07\n2012/03/07"]
 cjc [label="Crist J. Clark\ncjc@FreeBSD.org\n2001/06/01\n2006/12/29"]
 davidxu [label="David Xu\ndavidxu@FreeBSD.org\n2002/09/02\n2014/04/14"]
 dds [label="Diomidis Spinellis\ndds@FreeBSD.org\n2003/06/20\n2010/09/22"]
 dhartmei [label="Daniel Hartmeier\ndhartmei@FreeBSD.org\n2004/04/06\n2008/12/08"]
 dmlb [label="Duncan Barclay\ndmlb@FreeBSD.org\n2001/12/14\n2008/11/10"]
 dougb [label="Doug Barton\ndougb@FreeBSD.org\n2000/10/26\n2012/10/08"]
 eik [label="Oliver Eikemeier\neik@FreeBSD.org\n2004/05/20\n2008/11/10"]
 furuta [label="Atsushi Furuta\nfuruta@FreeBSD.org\n2000/06/21\n2003/03/08"]
 gj [label="Gary L. Jennejohn\ngj@FreeBSD.org\n1994/??/??\n2006/04/28"]
 groudier [label="Gerard Roudier\ngroudier@FreeBSD.org\n1999/12/30\n2006/04/06"]
 jake [label="Jake Burkholder\njake@FreeBSD.org\n2000/05/16\n2008/11/10"]
 jayanth [label="Jayanth Vijayaraghavan\njayanth@FreeBSD.org\n2000/05/08\n2008/11/10"]
 jb [label="John Birrell\njb@FreeBSD.org\n1997/03/27\n2009/12/15"]
 jdp [label="John Polstra\njdp@FreeBSD.org\n1995/12/07\n2008/02/26"]
 jedgar [label="Chris D. Faulhaber\njedgar@FreeBSD.org\n1999/12/15\n2006/04/07"]
 jkh [label="Jordan K. Hubbard\njkh@FreeBSD.org\n1993/06/12\n2008/06/13"]
 jlemon [label="Jonathan Lemon\njlemon@FreeBSD.org\n1997/08/14\n2008/11/10"]
 joe [label="Josef Karthauser\njoe@FreeBSD.org\n1999/10/22\n2008/08/10"]
 jtc [label="J.T. Conklin\njtc@FreeBSD.org\n1993/06/12\n????/??/??"]
 kargl [label="Steven G. Kargl\nkargl@FreeBSD.org\n2011/01/17\n2015/06/28"]
 kbyanc [label="Kelly Yancey\nkbyanc@FreeBSD.org\n2000/07/11\n2006/07/25"]
 keichii [label="Michael Wu\nkeichii@FreeBSD.org\n2001/03/07\n2006/04/28"]
 linimon [label="Mark Linimon\nlinimon@FreeBSD.org\n2006/09/30\n2008/05/04"]
 lulf [label="Ulf Lilleengen\nlulf@FreeBSD.org\n2007/10/24\n2012/01/19"]
 mb [label="Maxim Bolotin\nmb@FreeBSD.org\n2000/04/06\n2003/03/08"]
 marks [label="Mark Santcroos\nmarks@FreeBSD.org\n2004/03/18\n2008/09/29"]
 mike [label="Mike Barcroft\nmike@FreeBSD.org\n2001/07/17\n2006/04/28"]
 msmith [label="Mike Smith\nmsmith@FreeBSD.org\n1996/10/22\n2003/12/15"]
 murray [label="Murray Stokely\nmurray@FreeBSD.org\n2000/04/05\n2010/07/25"]
 mux [label="Maxime Henrion\nmux@FreeBSD.org\n2002/03/03\n2011/06/22"]
 nate [label="Nate Willams\nnate@FreeBSD.org\n1993/06/12\n2003/12/15"]
 njl [label="Nate Lawson\nnjl@FreeBSD.org\n2002/08/07\n2008/02/16"]
 non [label="Noriaki Mitsnaga\nnon@FreeBSD.org\n2000/06/19\n2007/03/06"]
 onoe [label="Atsushi Onoe\nonoe@FreeBSD.org\n2000/07/21\n2008/11/10"]
+ram [label="Ram Kishore Vegesna\nram@FreeBSD.org\n2018/04/04\n???/??/??"]
 rafan [label="Rong-En Fan\nrafan@FreeBSD.org\n2007/01/31\n2012/07/23"]
 randi [label="Randi Harper\nrandi@FreeBSD.org\n2010/04/20\n2012/05/10"]
 rink [label="Rink Springer\nrink@FreeBSD.org\n2006/01/16\n2010/11/04"]
 robert [label="Robert Drehmel\nrobert@FreeBSD.org\n2001/08/23\n2006/05/13"]
 sah [label="Sam Hopkins\nsah@FreeBSD.org\n2004/12/15\n2008/11/10"]
 shafeeq [label="Shafeeq Sinnamohideen\nshafeeq@FreeBSD.org\n2000/06/19\n2006/04/06"]
 sheldonh [label="Sheldon Hearn\nsheldonh@FreeBSD.org\n1999/06/14\n2006/05/13"]
 shiba [label="Takeshi Shibagaki\nshiba@FreeBSD.org\n2000/06/19\n2008/11/10"]
 shin [label="Yoshinobu Inoue\nshin@FreeBSD.org\n1999/07/29\n2003/03/08"]
 snb [label="Nick Barkas\nsnb@FreeBSD.org\n2009/05/05\n2010/11/04"]
 tmm [label="Thomas Moestl\ntmm@FreeBSD.org\n2001/03/07\n2006/07/12"]
 toshi [label="Toshihiko Arai\ntoshi@FreeBSD.org\n2000/07/06\n2003/03/08"]
 tshiozak [label="Takuya SHIOZAKI\ntshiozak@FreeBSD.org\n2001/04/25\n2003/03/08"]
 uch [label="UCHIYAMA Yasushi\nuch@FreeBSD.org\n2000/06/21\n2002/04/24"]
 wilko [label="Wilko Bulte\nwilko@FreeBSD.org\n2000/01/13\n2013/01/17"]
 yar [label="Yar Tikhiy\nyar@FreeBSD.org\n2001/03/25\n2012/05/23"]
 zack [label="Zack Kirsch\nzack@FreeBSD.org\n2010/11/05\n2012/09/08"]
 
 
 node [color=lightblue2, style=filled, bgcolor=black];
 
 # Current src committers go here. Try to keep things sorted.
 
 ache [label="Andrey Chernov\nache@FreeBSD.org\n1993/10/31"]
 achim [label="Achim Leubner\nachim@FreeBSD.org\n2013/01/23"]
 adrian [label="Adrian Chadd\nadrian@FreeBSD.org\n2000/07/03"]
 ae [label="Andrey V. Elsukov\nae@FreeBSD.org\n2010/06/03"]
 akiyama [label="Shunsuke Akiyama\nakiyama@FreeBSD.org\n2000/06/19"]
 alc [label="Alan Cox\nalc@FreeBSD.org\n1999/02/23"]
 allanjude [label="Allan Jude\nallanjude@FreeBSD.org\n2015/07/30"]
 ambrisko [label="Doug Ambrisko\nambrisko@FreeBSD.org\n2001/12/19"]
 anchie [label="Ana Kukec\nanchie@FreeBSD.org\n2010/04/14"]
 andre [label="Andre Oppermann\nandre@FreeBSD.org\n2003/11/12"]
 andreast [label="Andreas Tobler\nandreast@FreeBSD.org\n2010/09/05"]
 andrew [label="Andrew Turner\nandrew@FreeBSD.org\n2010/07/19"]
 antoine [label="Antoine Brodin\nantoine@FreeBSD.org\n2008/02/03"]
 araujo [label="Marcelo Araujo\naraujo@FreeBSD.org\n2015/08/04"]
 arichardson [label="Alex Richardson\narichardson@FreeBSD.org\n2017/10/30"]
 ariff [label="Ariff Abdullah\nariff@FreeBSD.org\n2005/11/14"]
 art [label="Artem Belevich\nart@FreeBSD.org\n2011/03/29"]
 arybchik [label="Andrew Rybchenko\narybchik@FreeBSD.org\n2014/10/12"]
 asomers [label="Alan Somers\nasomers@FreeBSD.org\n2013/04/24"]
 avg [label="Andriy Gapon\navg@FreeBSD.org\n2009/02/18"]
 avos [label="Andriy Voskoboinyk\navos@FreeBSD.org\n2015/09/24"]
 badger [label="Eric Badger\nbadger@FreeBSD.org\n2016/07/01"]
 bapt [label="Baptiste Daroussin\nbapt@FreeBSD.org\n2011/12/23"]
 bde [label="Bruce Evans\nbde@FreeBSD.org\n1994/08/20"]
 bdrewery [label="Bryan Drewery\nbdrewery@FreeBSD.org\n2013/12/14"]
 benl [label="Ben Laurie\nbenl@FreeBSD.org\n2011/05/18"]
 benno [label="Benno Rice\nbenno@FreeBSD.org\n2000/11/02"]
 bms [label="Bruce M Simpson\nbms@FreeBSD.org\n2003/08/06"]
 br [label="Ruslan Bukin\nbr@FreeBSD.org\n2013/09/02"]
 brian [label="Brian Somers\nbrian@FreeBSD.org\n1996/12/16"]
 brooks [label="Brooks Davis\nbrooks@FreeBSD.org\n2001/06/21"]
 brucec [label="Bruce Cran\nbrucec@FreeBSD.org\n2010/01/29"]
 brueffer [label="Christian Brueffer\nbrueffer@FreeBSD.org\n2006/02/28"]
 bruno [label="Bruno Ducrot\nbruno@FreeBSD.org\n2005/07/18"]
 bryanv [label="Bryan Venteicher\nbryanv@FreeBSD.org\n2012/11/03"]
 bschmidt [label="Bernhard Schmidt\nbschmidt@FreeBSD.org\n2010/02/06"]
 bz [label="Bjoern A. Zeeb\nbz@FreeBSD.org\n2004/07/27"]
 cem [label="Conrad Meyer\ncem@FreeBSD.org\n2015/07/05"]
 chuck [label="Chuck Tuffli\nchuck@FreeBSD.org\n2017/09/06"]
 cognet [label="Olivier Houchard\ncognet@FreeBSD.org\n2002/10/09"]
 cokane [label="Coleman Kane\ncokane@FreeBSD.org\n2000/06/19"]
 cperciva [label="Colin Percival\ncperciva@FreeBSD.org\n2004/01/20"]
 csjp [label="Christian S.J. Peron\ncsjp@FreeBSD.org\n2004/05/04"]
 dab [label="David Bright\ndab@FreeBSD.org\n2016/10/24"]
 das [label="David Schultz\ndas@FreeBSD.org\n2003/02/21"]
 davide [label="Davide Italiano\ndavide@FreeBSD.org\n2012/01/27"]
 dchagin [label="Dmitry Chagin\ndchagin@FreeBSD.org\n2009/02/28"]
 def [label="Konrad Witaszczyk\ndef@FreeBSD.org\n2016/11/02"]
 delphij [label="Xin Li\ndelphij@FreeBSD.org\n2004/09/14"]
 des [label="Dag-Erling Smorgrav\ndes@FreeBSD.org\n1998/04/03"]
 dexuan [label="Dexuan Cui\ndexuan@FreeBSD.org\n2016/10/24"]
 dfr [label="Doug Rabson\ndfr@FreeBSD.org\n????/??/??"]
 dg [label="David Greenman\ndg@FreeBSD.org\n1993/06/14"]
 dim [label="Dimitry Andric\ndim@FreeBSD.org\n2010/08/30"]
 dteske [label="Devin Teske\ndteske@FreeBSD.org\n2012/04/10"]
 dumbbell [label="Jean-Sebastien Pedron\ndumbbell@FreeBSD.org\n2004/11/29"]
 dwmalone [label="David Malone\ndwmalone@FreeBSD.org\n2000/07/11"]
 eadler [label="Eitan Adler\neadler@FreeBSD.org\n2012/01/18"]
 ed [label="Ed Schouten\ned@FreeBSD.org\n2008/05/22"]
 edavis [label="Eric Davis\nedavis@FreeBSD.org\n2013/10/09"]
 edwin [label="Edwin Groothuis\nedwin@FreeBSD.org\n2007/06/25"]
 eivind [label="Eivind Eklund\neivind@FreeBSD.org\n1997/02/02"]
 emaste [label="Ed Maste\nemaste@FreeBSD.org\n2005/10/04"]
 emax [label="Maksim Yevmenkin\nemax@FreeBSD.org\n2003/10/12"]
 eri [label="Ermal Luci\neri@FreeBSD.org\n2008/06/11"]
 erj [label="Eric Joyner\nerj@FreeBSD.org\n2014/12/14"]
 eugen [label="Eugene Grosbein\neugen@FreeBSD.org\n2017/09/19"]
 fabient [label="Fabien Thomas\nfabient@FreeBSD.org\n2009/03/16"]
 fanf [label="Tony Finch\nfanf@FreeBSD.org\n2002/05/05"]
 fjoe [label="Max Khon\nfjoe@FreeBSD.org\n2001/08/06"]
 flz [label="Florent Thoumie\nflz@FreeBSD.org\n2006/03/30"]
 fsu [label="Fedor Uporov\nfsu@FreeBSD.org\n2017/08/28"]
 gabor [label="Gabor Kovesdan\ngabor@FreeBSD.org\n2010/02/02"]
 gad [label="Garance A. Drosehn\ngad@FreeBSD.org\n2000/10/27"]
 gallatin [label="Andrew Gallatin\ngallatin@FreeBSD.org\n1999/01/15"]
 ganbold [label="Ganbold Tsagaankhuu\nganbold@FreeBSD.org\n2013/12/18"]
 gavin [label="Gavin Atkinson\ngavin@FreeBSD.org\n2009/12/07"]
 gibbs [label="Justin T. Gibbs\ngibbs@FreeBSD.org\n????/??/??"]
 gjb [label="Glen Barber\ngjb@FreeBSD.org\n2013/06/04"]
 gleb [label="Gleb Kurtsou\ngleb@FreeBSD.org\n2011/09/19"]
 glebius [label="Gleb Smirnoff\nglebius@FreeBSD.org\n2004/07/14"]
 gnn [label="George V. Neville-Neil\ngnn@FreeBSD.org\n2004/10/11"]
 gordon [label="Gordon Tetlow\ngordon@FreeBSD.org\n2002/05/17"]
 grehan [label="Peter Grehan\ngrehan@FreeBSD.org\n2002/08/08"]
 grog [label="Greg Lehey\ngrog@FreeBSD.org\n1998/08/30"]
 gshapiro [label="Gregory Shapiro\ngshapiro@FreeBSD.org\n2000/07/12"]
 harti [label="Hartmut Brandt\nharti@FreeBSD.org\n2003/01/29"]
 hiren [label="Hiren Panchasara\nhiren@FreeBSD.org\n2013/04/12"]
 hmp [label="Hiten Pandya\nhmp@FreeBSD.org\n2004/03/23"]
 hselasky [label="Hans Petter Selasky\nhselasky@FreeBSD.org\n"]
 ian [label="Ian Lepore\nian@FreeBSD.org\n2013/01/07"]
 iedowse [label="Ian Dowse\niedowse@FreeBSD.org\n2000/12/01"]
 imp [label="Warner Losh\nimp@FreeBSD.org\n1996/09/20"]
 ivoras [label="Ivan Voras\nivoras@FreeBSD.org\n2008/06/10"]
 jah [label="Jason A. Harmening\njah@FreeBSD.org\n2015/03/08"]
 jamie [label="Jamie Gritton\njamie@FreeBSD.org\n2009/01/28"]
 jasone [label="Jason Evans\njasone@FreeBSD.org\n1999/03/03"]
 jceel [label="Jakub Klama\njceel@FreeBSD.org\n2011/09/25"]
 jch [label="Julien Charbon\njch@FreeBSD.org\n2014/09/24"]
 jchandra [label="Jayachandran C.\njchandra@FreeBSD.org\n2010/05/19"]
 jeb [label="Jeb Cramer\njeb@FreeBSD.org\n2018/01/25"]
 jeff [label="Jeff Roberson\njeff@FreeBSD.org\n2002/02/21"]
 jh [label="Jaakko Heinonen\njh@FreeBSD.org\n2009/10/02"]
 jhb [label="John Baldwin\njhb@FreeBSD.org\n1999/08/23"]
 jhibbits [label="Justin Hibbits\njhibbits@FreeBSD.org\n2011/11/30"]
 jilles [label="Jilles Tjoelker\njilles@FreeBSD.org\n2009/05/22"]
 jimharris [label="Jim Harris\njimharris@FreeBSD.org\n2011/12/09"]
 jinmei [label="JINMEI Tatuya\njinmei@FreeBSD.org\n2007/03/17"]
 jkim [label="Jung-uk Kim\njkim@FreeBSD.org\n2005/07/06"]
 jkoshy [label="A. Joseph Koshy\njkoshy@FreeBSD.org\n1998/05/13"]
 jlh [label="Jeremie Le Hen\njlh@FreeBSD.org\n2012/04/22"]
 jls [label="Jordan Sissel\njls@FreeBSD.org\n2006/12/06"]
 jmcneill [label="Jared McNeill\njmcneill@FreeBSD.org\n2016/02/24"]
 jmg [label="John-Mark Gurney\njmg@FreeBSD.org\n1997/02/13"]
 jmmv [label="Julio Merino\njmmv@FreeBSD.org\n2013/11/02"]
 joerg [label="Joerg Wunsch\njoerg@FreeBSD.org\n1993/11/14"]
 jon [label="Jonathan Chen\njon@FreeBSD.org\n2000/10/17"]
 jonathan [label="Jonathan Anderson\njonathan@FreeBSD.org\n2010/10/07"]
 jpaetzel [label="Josh Paetzel\njpaetzel@FreeBSD.org\n2011/01/21"]
 jtl [label="Jonathan T. Looney\njtl@FreeBSD.org\n2015/10/26"]
 julian [label="Julian Elischer\njulian@FreeBSD.org\n1993/04/19"]
 jwd [label="John De Boskey\njwd@FreeBSD.org\n2000/05/19"]
 kaiw [label="Kai Wang\nkaiw@FreeBSD.org\n2007/09/26"]
 kan [label="Alexander Kabaev\nkan@FreeBSD.org\n2002/07/21"]
 karels [label="Mike Karels\nkarels@FreeBSD.org\n2016/06/09"]
 ken [label="Ken Merry\nken@FreeBSD.org\n1998/09/08"]
 kensmith [label="Ken Smith\nkensmith@FreeBSD.org\n2004/01/23"]
 kevans [label="Kyle Evans\nkevans@FreeBSD.org\n2017/06/20"]
 kevlo [label="Kevin Lo\nkevlo@FreeBSD.org\n2006/07/23"]
 kib [label="Konstantin Belousov\nkib@FreeBSD.org\n2006/06/03"]
 kibab [label="Ilya Bakulin\nkibab@FreeBSD.org\n2017/09/02"]
 kmacy [label="Kip Macy\nkmacy@FreeBSD.org\n2005/06/01"]
 kp [label="Kristof Provost\nkp@FreeBSD.org\n2015/03/22"]
 landonf [label="Landon Fuller\nlandonf@FreeBSD.org\n2016/05/31"]
 le [label="Lukas Ertl\nle@FreeBSD.org\n2004/02/02"]
 lidl [label="Kurt Lidl\nlidl@FreeBSD.org\n2015/10/21"]
 loos [label="Luiz Otavio O Souza\nloos@FreeBSD.org\n2013/07/03"]
 lstewart [label="Lawrence Stewart\nlstewart@FreeBSD.org\n2008/10/06"]
 manu [label="Emmanuel Vadot\nmanu@FreeBSD.org\n2016/04/24"]
 marcel [label="Marcel Moolenaar\nmarcel@FreeBSD.org\n1999/07/03"]
 marius [label="Marius Strobl\nmarius@FreeBSD.org\n2004/04/17"]
 markj [label="Mark Johnston\nmarkj@FreeBSD.org\n2012/12/18"]
 markm [label="Mark Murray\nmarkm@FreeBSD.org\n1995/04/24"]
 markus [label="Markus Brueffer\nmarkus@FreeBSD.org\n2006/06/01"]
 matteo [label="Matteo Riondato\nmatteo@FreeBSD.org\n2006/01/18"]
 mav [label="Alexander Motin\nmav@FreeBSD.org\n2007/04/12"]
 maxim [label="Maxim Konovalov\nmaxim@FreeBSD.org\n2002/02/07"]
 mdf [label="Matthew Fleming\nmdf@FreeBSD.org\n2010/06/04"]
 mdodd [label="Matthew N. Dodd\nmdodd@FreeBSD.org\n1999/07/27"]
 melifaro [label="Alexander V. Chernikov\nmelifaro@FreeBSD.org\n2011/10/04"]
 mizhka [label="Michael Zhilin\nmizhka@FreeBSD.org\n2016/07/19"]
 mjacob [label="Matt Jacob\nmjacob@FreeBSD.org\n1997/08/13"]
 mjg [label="Mateusz Guzik\nmjg@FreeBSD.org\n2012/06/04"]
 mjoras [label="Matt Joras\nmjoras@FreeBSD.org\n2017/07/12"]
 mlaier [label="Max Laier\nmlaier@FreeBSD.org\n2004/02/10"]
 mmel [label="Michal Meloun\nmmel@FreeBSD.org\n2015/11/01"]
 monthadar [label="Monthadar Al Jaberi\nmonthadar@FreeBSD.org\n2012/04/02"]
 mp [label="Mark Peek\nmp@FreeBSD.org\n2001/07/27"]
 mr [label="Michael Reifenberger\nmr@FreeBSD.org\n2001/09/30"]
 mw [label="Marcin Wojtas\nmw@FreeBSD.org\n2017/07/18"]
 neel [label="Neel Natu\nneel@FreeBSD.org\n2009/09/20"]
 netchild [label="Alexander Leidinger\nnetchild@FreeBSD.org\n2005/03/31"]
 ngie [label="Ngie Cooper\nngie@FreeBSD.org\n2014/07/27"]
 nork [label="Norikatsu Shigemura\nnork@FreeBSD.org\n2009/06/09"]
 np [label="Navdeep Parhar\nnp@FreeBSD.org\n2009/06/05"]
 nwhitehorn [label="Nathan Whitehorn\nnwhitehorn@FreeBSD.org\n2008/07/03"]
 n_hibma [label="Nick Hibma\nn_hibma@FreeBSD.org\n1998/11/26"]
 obrien [label="David E. O'Brien\nobrien@FreeBSD.org\n1996/10/29"]
 olli [label="Oliver Fromme\nolli@FreeBSD.org\n2008/02/14"]
 oshogbo [label="Mariusz Zaborski\noshogbo@FreeBSD.org\n2015/04/15"]
 peadar [label="Peter Edwards\npeadar@FreeBSD.org\n2004/03/08"]
 peter [label="Peter Wemm\npeter@FreeBSD.org\n1995/07/04"]
 peterj [label="Peter Jeremy\npeterj@FreeBSD.org\n2012/09/14"]
 pfg [label="Pedro Giffuni\npfg@FreeBSD.org\n2011/12/01"]
 phil [label="Phil Shafer\nphil@FreeBSD.ogr\n2015/12/30"]
 philip [label="Philip Paeps\nphilip@FreeBSD.org\n2004/01/21"]
 phk [label="Poul-Henning Kamp\nphk@FreeBSD.org\n1994/02/21"]
 pho [label="Peter Holm\npho@FreeBSD.org\n2008/11/16"]
 pjd [label="Pawel Jakub Dawidek\npjd@FreeBSD.org\n2004/02/02"]
 pkelsey [label="Patrick Kelsey\pkelsey@FreeBSD.org\n2014/05/29"]
 pluknet [label="Sergey Kandaurov\npluknet@FreeBSD.org\n2010/10/05"]
 ps [label="Paul Saab\nps@FreeBSD.org\n2000/02/23"]
 qingli [label="Qing Li\nqingli@FreeBSD.org\n2005/04/13"]
 ray [label="Aleksandr Rybalko\nray@FreeBSD.org\n2011/05/25"]
 rdivacky [label="Roman Divacky\nrdivacky@FreeBSD.org\n2008/03/13"]
 remko [label="Remko Lodder\nremko@FreeBSD.org\n2007/02/23"]
 rgrimes [label="Rodney W. Grimes\nrgrimes@FreeBSD.org\n1993/06/12\n2017/03/03"]
 rik [label="Roman Kurakin\nrik@FreeBSD.org\n2003/12/18"]
 rlibby [label="Ryan Libby\nrlibby@FreeBSD.org\n2017/06/07"]
 rmacklem [label="Rick Macklem\nrmacklem@FreeBSD.org\n2009/03/27"]
 rmh [label="Robert Millan\nrmh@FreeBSD.org\n2011/09/18"]
 rnoland [label="Robert Noland\nrnoland@FreeBSD.org\n2008/09/15"]
 roberto [label="Ollivier Robert\nroberto@FreeBSD.org\n1995/02/22"]
 rodrigc [label="Craig Rodrigues\nrodrigc@FreeBSD.org\n2005/05/14"]
 royger [label="Roger Pau Monne\nroyger@FreeBSD.org\n2013/11/26"]
 rpaulo [label="Rui Paulo\nrpaulo@FreeBSD.org\n2007/09/25"]
 rpokala [label="Ravi Pokala\nrpokala@FreeBSD.org\n2015/11/19"]
 rrs [label="Randall R Stewart\nrrs@FreeBSD.org\n2007/02/08"]
 rse [label="Ralf S. Engelschall\nrse@FreeBSD.org\n1997/07/31"]
 rstone [label="Ryan Stone\nrstone@FreeBSD.org\n2010/04/19"]
 ru [label="Ruslan Ermilov\nru@FreeBSD.org\n1999/05/27"]
 rwatson [label="Robert N. M. Watson\nrwatson@FreeBSD.org\n1999/12/16"]
 sam [label="Sam Leffler\nsam@FreeBSD.org\n2002/07/02"]
 sanpei [label="MIHIRA Sanpei Yoshiro\nsanpei@FreeBSD.org\n2000/06/19"]
 sbruno [label="Sean Bruno\nsbruno@FreeBSD.org\n2008/08/02"]
 scf [label="Sean C. Farley\nscf@FreeBSD.org\n2007/06/24"]
 schweikh [label="Jens Schweikhardt\nschweikh@FreeBSD.org\n2001/04/06"]
 scottl [label="Scott Long\nscottl@FreeBSD.org\n2000/09/28"]
 se [label="Stefan Esser\nse@FreeBSD.org\n1994/08/26"]
 sephe [label="Sepherosa Ziehau\nsephe@FreeBSD.org\n2007/03/28"]
 sepotvin [label="Stephane E. Potvin\nsepotvin@FreeBSD.org\n2007/02/15"]
 sgalabov [label="Stanislav Galabov\nsgalabov@FreeBSD.org\n2016/02/24"]
 shurd [label="Stephen Hurd\nshurd@FreeBSD.org\n2017/09/02"]
 simon [label="Simon L. Nielsen\nsimon@FreeBSD.org\n2006/03/07"]
 sjg [label="Simon J. Gerraty\nsjg@FreeBSD.org\n2012/10/23"]
 skra [label="Svatopluk Kraus\nskra@FreeBSD.org\n2015/10/28"]
 slavash [label="Slava Shwartsman\nslavash@FreeBSD.org\n2018/02/08"]
 slm [label="Stephen McConnell\nslm@FreeBSD.org\n2014/05/07"]
 smh [label="Steven Hartland\nsmh@FreeBSD.org\n2012/11/12"]
 sobomax [label="Maxim Sobolev\nsobomax@FreeBSD.org\n2001/07/25"]
 sos [label="Soren Schmidt\nsos@FreeBSD.org\n????/??/??"]
 sson [label="Stacey Son\nsson@FreeBSD.org\n2008/07/08"]
 stas [label="Stanislav Sedov\nstas@FreeBSD.org\n2008/08/22"]
 stevek [label="Stephen J. Kiernan\nstevek@FreeBSD.org\n2016/07/18"]
 suz [label="SUZUKI Shinsuke\nsuz@FreeBSD.org\n2002/03/26"]
 syrinx [label="Shteryana Shopova\nsyrinx@FreeBSD.org\n2006/10/07"]
 takawata [label="Takanori Watanabe\ntakawata@FreeBSD.org\n2000/07/06"]
 theraven [label="David Chisnall\ntheraven@FreeBSD.org\n2011/11/11"]
 thompsa [label="Andrew Thompson\nthompsa@FreeBSD.org\n2005/05/25"]
 ticso [label="Bernd Walter\nticso@FreeBSD.org\n2002/01/31"]
 tijl [label="Tijl Coosemans\ntijl@FreeBSD.org\n2010/07/16"]
 tsoome [label="Toomas Soome\ntsoome@FreeBSD.org\n2016/08/10"]
 trasz [label="Edward Tomasz Napierala\ntrasz@FreeBSD.org\n2008/08/22"]
 trhodes [label="Tom Rhodes\ntrhodes@FreeBSD.org\n2002/05/28"]
 trociny [label="Mikolaj Golub\ntrociny@FreeBSD.org\n2011/03/10"]
 tuexen [label="Michael Tuexen\ntuexen@FreeBSD.org\n2009/06/06"]
 tychon [label="Tycho Nightingale\ntychon@FreeBSD.org\n2014/01/21"]
 ume [label="Hajimu UMEMOTO\nume@FreeBSD.org\n2000/02/26"]
 uqs [label="Ulrich Spoerlein\nuqs@FreeBSD.org\n2010/01/28"]
 vangyzen [label="Eric van Gyzen\nvangyzen@FreeBSD.org\n2015/03/08"]
 vanhu [label="Yvan Vanhullebus\nvanhu@FreeBSD.org\n2008/07/21"]
 versus [label="Konrad Jankowski\nversus@FreeBSD.org\n2008/10/27"]
 weongyo [label="Weongyo Jeong\nweongyo@FreeBSD.org\n2007/12/21"]
 wes [label="Wes Peters\nwes@FreeBSD.org\n1998/11/25"]
 whu [label="Wei Hu\nwhu@FreeBSD.org\n2015/02/11"]
 wkoszek [label="Wojciech A. Koszek\nwkoszek@FreeBSD.org\n2006/02/21"]
 wma [label="Wojciech Macek\nwma@FreeBSD.org\n2016/01/18"]
 wollman [label="Garrett Wollman\nwollman@FreeBSD.org\n????/??/??"]
 wsalamon [label="Wayne Salamon\nwsalamon@FreeBSD.org\n2005/06/25"]
 wulf [label="Vladimir Kondratyev\nwulf@FreeBSD.org\n2017/04/27"]
 yongari [label="Pyun YongHyeon\nyongari@FreeBSD.org\n2004/08/01"]
 zbb [label="Zbigniew Bodek\nzbb@FreeBSD.org\n2013/09/02"]
 zec [label="Marko Zec\nzec@FreeBSD.org\n2008/06/22"]
 zml [label="Zachary Loafman\nzml@FreeBSD.org\n2009/05/27"]
 zont [label="Andrey Zonov\nzont@FreeBSD.org\n2012/08/21"]
 
 # Pseudo target representing rev 1.1 of commit.allow
 day1 [label="Birth of FreeBSD"]
 
 # Here are the mentor/mentee relationships.
 # Group together all the mentees for a particular mentor.
 # Keep the list sorted by mentor login.
 
 day1 -> jtc
 day1 -> jkh
 day1 -> nate
 day1 -> rgrimes
 day1 -> alm
 day1 -> dg
 
 adrian -> avos
 adrian -> jmcneill
 adrian -> landonf
 adrian -> lidl
 adrian -> loos
 adrian -> mizhka
 adrian -> monthadar
 adrian -> ray
 adrian -> rmh
 adrian -> sephe
 adrian -> sgalabov
 
 ae -> melifaro
 
 allanjude -> tsoome
 
 alc -> davide
 
 andre -> qingli
 
 andrew -> manu
 
 anholt -> jkim
 
 avg -> art
 avg -> eugen
 avg -> pluknet
 avg -> smh
 
 bapt -> allanjude
 bapt -> araujo
 bapt -> bdrewery
 bapt -> wulf
 
 bde -> rgrimes
 
 benno -> grehan
 
 billf -> dougb
 billf -> gad
 billf -> jedgar
 billf -> jhb
 billf -> shafeeq
 
 bmilekic -> csjp
 
 bms -> dhartmei
 bms -> mlaier
 bms -> thompsa
 
 brian -> joe
 
 brooks -> bushman
 brooks -> jamie
 brooks -> theraven
 brooks -> arichardson
 
 bz -> anchie
 bz -> jamie
 bz -> syrinx
 
 cognet -> br
 cognet -> jceel
 cognet -> kevlo
 cognet -> ian
 cognet -> manu
 cognet -> mw
 cognet -> wkoszek
 cognet -> wma
 cognet -> zbb
 
 cperciva -> eadler
 cperciva -> flz
 cperciva -> randi
 cperciva -> simon
 
 csjp -> bushman
 
 das -> kargl
 das -> rodrigc
 
 delphij -> gabor
 delphij -> rafan
 delphij -> sephe
 
 des -> anholt
 des -> hmp
 des -> mike
 des -> olli
 des -> ru
 des -> bapt
 
 dds -> versus
 
 dfr -> gallatin
 dfr -> zml
 
 dg -> peter
 
 dim -> theraven
 
 dwmalone -> fanf
 dwmalone -> peadar
 dwmalone -> snb
 
 ed -> dim
 ed -> gavin
 ed -> jilles
 ed -> rdivacky
 ed -> uqs
 
 eivind -> des
 eivind -> rwatson
 
 emaste -> achim
 emaste -> dteske
 emaste -> kevans
 emaste -> markj
 emaste -> rstone
 
 emax -> markus
 
 erj -> jeb
 
 fjoe -> versus
 
 gallatin -> ticso
 
 gavin -> versus
 
 gibbs -> mjacob
 gibbs -> njl
 gibbs -> royger
 gibbs -> whu
 
 glebius -> mav
 
 gnn -> jinmei
 gnn -> rrs
 gnn -> ivoras
 gnn -> vanhu
 gnn -> lstewart
 gnn -> np
 gnn -> davide
 gnn -> arybchik
 gnn -> erj
 gnn -> kp
 gnn -> jtl
 gnn -> karels
 
 gonzo -> jmcneill
 gonzo -> wulf
 
 grehan -> bryanv
 grehan -> rgrimes
 
 grog -> edwin
 grog -> le
 grog -> peterj
 
 hselasky -> slavash
 
 imp -> akiyama
 imp -> ambrisko
 imp -> andrew
 imp -> bmah
 imp -> bruno
 imp -> chuck
 imp -> dmlb
 imp -> emax
 imp -> furuta
 imp -> joe
 imp -> jon
 imp -> keichii
 imp -> kibab
 imp -> mb
 imp -> mr
 imp -> neel
 imp -> non
 imp -> nork
 imp -> onoe
 imp -> remko
 imp -> rik
 imp -> rink
 imp -> sanpei
 imp -> shiba
 imp -> takawata
 imp -> toshi
 imp -> tsoome
 imp -> uch
 
 jake -> bms
 jake -> gordon
 jake -> harti
 jake -> jeff
 jake -> kmacy
 jake -> robert
 jake -> yongari
 
 jb -> sson
 
 jdp -> fjoe
 
 jfv -> erj
 
 jhb -> arr
 jhb -> avg
 jhb -> jch
 jhb -> jeff
 jhb -> kbyanc
 jhb -> peterj
 jhb -> pfg
 jhb -> rnoland
 jhb -> rpokala
 jhb -> arichardson
 
 jimharris -> carl
 
 jkh -> dfr
 jkh -> gj
 jkh -> grog
 jkh -> imp
 jkh -> jlemon
 jkh -> joerg
 jkh -> jwd
 jkh -> msmith
 jkh -> murray
 jkh -> phk
 jkh -> wes
 jkh -> yar
 
 jkoshy -> kaiw
 jkoshy -> fabient
 jkoshy -> rstone
 
 jlemon -> bmilekic
 jlemon -> brooks
 
 jmallett -> pkelsey
 
 jmmv -> ngie
 
 joerg -> brian
 joerg -> eik
 joerg -> jmg
 joerg -> le
 joerg -> netchild
 joerg -> schweikh
 
 julian -> glebius
 julian -> davidxu
 julian -> archie
 julian -> adrian
 julian -> zec
 julian -> mp
 
 kan -> kib
 
 ken -> asomers
 ken -> chuck
+ken -> ram
 ken -> slm
 
 kib -> ae
 kib -> badger
 kib -> dchagin
 kib -> gjb
 kib -> jah
 kib -> jlh
 kib -> jpaetzel
 kib -> lulf
 kib -> melifaro
 kib -> mmel
 kib -> pho
 kib -> pluknet
 kib -> rdivacky
 kib -> rmacklem
 kib -> rmh
 kib -> skra
 kib -> slavash
 kib -> stas
 kib -> tijl
 kib -> trociny
 kib -> vangyzen
 kib -> zont
 
 kmacy -> lstewart
 
 marcel -> allanjude
 marcel -> art
 marcel -> arun
 marcel -> marius
 marcel -> nwhitehorn
 marcel -> sjg
 
 markj -> cem
 markj -> rlibby
 
 markm -> jasone
 markm -> sheldonh
 
 mav -> ae
 mav -> eugen
+mav -> ram
 
 mdf -> gleb
 
 mdodd -> jake
 
 mike -> das
 
 mlaier -> benjsc
 mlaier -> dhartmei
 mlaier -> thompsa
 mlaier -> eri
 
 msmith -> cokane
 msmith -> jasone
 msmith -> scottl
 
 murray -> delphij
 
 mux -> cognet
 mux -> dumbbell
 
 netchild -> ariff
 
 njl -> marks
 njl -> philip
 njl -> rpaulo
 njl -> sepotvin
 
 nwhitehorn -> andreast
 nwhitehorn -> jhibbits
 
 obrien -> benno
 obrien -> groudier
 obrien -> gshapiro
 obrien -> kan
 obrien -> sam
 
 pfg -> fsu
 
 peter -> asmodai
 peter -> jayanth
 peter -> ps
 
 philip -> benl
 philip -> ed
 philip -> jls
 philip -> matteo
 philip -> uqs
 philip -> kp
 
 phk -> jkoshy
 phk -> mux
 phk -> rgrimes
 
 pjd -> def
 pjd -> kib
 pjd -> lulf
 pjd -> oshogbo
 pjd -> smh
 pjd -> trociny
 
 rgrimes -> markm
 
 rmacklem -> jwd
 
 royger -> whu
 
 rpaulo -> avg
 rpaulo -> bschmidt
 rpaulo -> dim
 rpaulo -> jmmv
 rpaulo -> lidl
 rpaulo -> ngie
 
 rrs -> brucec
 rrs -> jchandra
 rrs -> tuexen
 
 rstone -> markj
 rstone -> mjoras
 
 ru -> ceri
 ru -> cjc
 ru -> eik
 ru -> maxim
 ru -> sobomax
 
 rwatson -> adrian
 rwatson -> antoine
 rwatson -> bmah
 rwatson -> brueffer
 rwatson -> bz
 rwatson -> cperciva
 rwatson -> emaste
 rwatson -> gnn
 rwatson -> jh
 rwatson -> jonathan
 rwatson -> kensmith
 rwatson -> kmacy
 rwatson -> linimon
 rwatson -> rmacklem
 rwatson -> shafeeq
 rwatson -> tmm
 rwatson -> trasz
 rwatson -> trhodes
 rwatson -> wsalamon
 
 rodrigc -> araujo
 
 sam -> andre
 sam -> benjsc
 sam -> sephe
 
 sbruno -> hiren
 sbruno -> jeb
 sbruno -> jimharris
 sbruno -> shurd
 
 schweikh -> dds
 
 scottl -> achim
 scottl -> jimharris
 scottl -> pjd
 scottl -> sah
 scottl -> sbruno
 scottl -> slm
 scottl -> yongari
 
 sephe -> dexuan
 
 sheldonh -> dwmalone
 sheldonh -> iedowse
 
 shin -> ume
 
 simon -> benl
 
 sjg -> phil
 sjg -> stevek
 
 sos -> marcel
 
 stas -> ganbold
 
 theraven -> phil
 
 thompsa -> weongyo
 thompsa -> eri
 
 trasz -> jh
 trasz -> mjg
 
 ume -> jinmei
 ume -> suz
 ume -> tshiozak
 
 vangyzen -> badger
 vangyzen -> dab
 
 wes -> scf
 
 wkoszek -> jceel
 
 wollman -> gad
 
 zml -> mdf
 zml -> zack
 
 }
Index: user/markj/netdump/sys/arm/amlogic/aml8726/aml8726_usb_phy-m3.c
===================================================================
--- user/markj/netdump/sys/arm/amlogic/aml8726/aml8726_usb_phy-m3.c	(revision 332407)
+++ user/markj/netdump/sys/arm/amlogic/aml8726/aml8726_usb_phy-m3.c	(revision 332408)
@@ -1,428 +1,428 @@
 /*-
  * Copyright 2014-2015 John Wehle <john@feith.com>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Amlogic aml8726-m3 USB physical layer driver.
  *
  * Both USB physical interfaces share the same configuration register.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/resource.h>
 #include <sys/rman.h>
 
 #include <sys/gpio.h>
 
 #include <machine/bus.h>
 #include <machine/cpu.h>
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include "gpio_if.h"
 
 struct aml8726_usb_phy_gpio {
 	device_t	dev;
 	uint32_t	pin;
 	uint32_t	pol;
 };
 
 struct aml8726_usb_phy_softc {
 	device_t			dev;
 	struct resource			*res[1];
 	uint32_t			npwr_en;
 	struct aml8726_usb_phy_gpio	*pwr_en;
 };
 
 static struct resource_spec aml8726_usb_phy_spec[] = {
 	{ SYS_RES_MEMORY,	0,	RF_ACTIVE },
 	{ -1, 0 }
 };
 
 #define	AML_USB_PHY_CFG_REG			0
 #define	AML_USB_PHY_CFG_A_CLK_DETECTED		(1U << 31)
 #define	AML_USB_PHY_CFG_CLK_DIV_MASK		(0x7f << 24)
 #define	AML_USB_PHY_CFG_CLK_DIV_SHIFT		24
 #define	AML_USB_PHY_CFG_B_CLK_DETECTED		(1 << 22)
 #define	AML_USB_PHY_CFG_A_PLL_RST		(1 << 19)
 #define	AML_USB_PHY_CFG_A_PHYS_RST		(1 << 18)
 #define	AML_USB_PHY_CFG_A_RST			(1 << 17)
 #define	AML_USB_PHY_CFG_B_PLL_RST		(1 << 13)
 #define	AML_USB_PHY_CFG_B_PHYS_RST		(1 << 12)
 #define	AML_USB_PHY_CFG_B_RST			(1 << 11)
 #define	AML_USB_PHY_CFG_CLK_EN			(1 << 8)
 #define	AML_USB_PHY_CFG_CLK_SEL_MASK		(7 << 5)
 #define	AML_USB_PHY_CFG_CLK_SEL_XTAL		(0 << 5)
 #define	AML_USB_PHY_CFG_CLK_SEL_XTAL_DIV2	(1 << 5)
 #define	AML_USB_PHY_CFG_B_POR			(1 << 1)
 #define	AML_USB_PHY_CFG_A_POR			(1 << 0)
 
 #define	AML_USB_PHY_CFG_CLK_DETECTED \
     (AML_USB_PHY_CFG_A_CLK_DETECTED | AML_USB_PHY_CFG_B_CLK_DETECTED)
 
 #define	AML_USB_PHY_MISC_A_REG			12
 #define	AML_USB_PHY_MISC_B_REG			16
 #define	AML_USB_PHY_MISC_ID_OVERIDE_EN		(1 << 23)
 #define	AML_USB_PHY_MISC_ID_OVERIDE_DEVICE	(1 << 22)
 #define	AML_USB_PHY_MISC_ID_OVERIDE_HOST	(0 << 22)
 
 #define	CSR_WRITE_4(sc, reg, val)	bus_write_4((sc)->res[0], reg, (val))
 #define	CSR_READ_4(sc, reg)		bus_read_4((sc)->res[0], reg)
 #define	CSR_BARRIER(sc, reg)		bus_barrier((sc)->res[0], reg, 4, \
     (BUS_SPACE_BARRIER_READ | BUS_SPACE_BARRIER_WRITE))
 
 #define	PIN_ON_FLAG(pol)		((pol) == 0 ?	\
     GPIO_PIN_LOW : GPIO_PIN_HIGH)
 #define	PIN_OFF_FLAG(pol)		((pol) == 0 ?	\
     GPIO_PIN_HIGH : GPIO_PIN_LOW)
 
 static int
 aml8726_usb_phy_mode(const char *dwcotg_path, uint32_t *mode)
 {
 	char *usb_mode;
 	phandle_t node;
 	ssize_t len;
 	
 	if ((node = OF_finddevice(dwcotg_path)) == -1)
 		return (ENXIO);
 
 	if (fdt_is_compatible_strict(node, "synopsys,designware-hs-otg2") == 0)
 		return (ENXIO);
 
 	*mode = 0;
 
 	len = OF_getprop_alloc(node, "dr_mode",
 	    (void **)&usb_mode);
 
 	if (len <= 0)
 		return (0);
 
 	if (strcasecmp(usb_mode, "host") == 0) {
 		*mode = AML_USB_PHY_MISC_ID_OVERIDE_EN |
 		    AML_USB_PHY_MISC_ID_OVERIDE_HOST;
 	} else if (strcasecmp(usb_mode, "peripheral") == 0) {
 		*mode = AML_USB_PHY_MISC_ID_OVERIDE_EN |
 		    AML_USB_PHY_MISC_ID_OVERIDE_DEVICE;
 	}
 
 	OF_prop_free(usb_mode);
 
 	return (0);
 }
 
 static int
 aml8726_usb_phy_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (!ofw_bus_is_compatible(dev, "amlogic,aml8726-m3-usb-phy"))
 		return (ENXIO);
 
 	device_set_desc(dev, "Amlogic aml8726-m3 USB PHY");
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 aml8726_usb_phy_attach(device_t dev)
 {
 	struct aml8726_usb_phy_softc *sc = device_get_softc(dev);
 	int err;
 	int npwr_en;
 	pcell_t *prop;
 	phandle_t node;
 	ssize_t len;
 	uint32_t div;
 	uint32_t i;
 	uint32_t mode_a;
 	uint32_t mode_b;
 	uint32_t value;
 
 	sc->dev = dev;
 
 	if (aml8726_usb_phy_mode("/soc/usb@c9040000", &mode_a) != 0) {
 		device_printf(dev, "missing usb@c9040000 node in FDT\n");
 		return (ENXIO);
 	}
 
 	if (aml8726_usb_phy_mode("/soc/usb@c90c0000", &mode_b) != 0) {
 		device_printf(dev, "missing usb@c90c0000 node in FDT\n");
 		return (ENXIO);
 	}
 
 	if (bus_alloc_resources(dev, aml8726_usb_phy_spec, sc->res)) {
 		device_printf(dev, "can not allocate resources for device\n");
 		return (ENXIO);
 	}
 
 	node = ofw_bus_get_node(dev);
 
 	err = 0;
 
-	len = OF_getencprop_alloc(node, "usb-pwr-en",
+	len = OF_getencprop_alloc_multi(node, "usb-pwr-en",
 	    3 * sizeof(pcell_t), (void **)&prop);
 	npwr_en = (len > 0) ? len : 0;
 
 	sc->npwr_en = 0;
 	sc->pwr_en = (struct aml8726_usb_phy_gpio *)
 	    malloc(npwr_en * sizeof (*sc->pwr_en), M_DEVBUF, M_WAITOK);
 
 	for (i = 0; i < npwr_en; i++) {
 		sc->pwr_en[i].dev = OF_device_from_xref(prop[i * 3]);
 		sc->pwr_en[i].pin = prop[i * 3 + 1];
 		sc->pwr_en[i].pol = prop[i * 3 + 2];
 
 		if (sc->pwr_en[i].dev == NULL) {
 			err = 1;
 			break;
 		}
 	}
 
 	OF_prop_free(prop);
 
 	if (err) {
 		device_printf(dev, "unable to parse gpio\n");
 		goto fail;
 	}
 
 	/* Turn on power by setting pin and then enabling output driver. */
 	for (i = 0; i < npwr_en; i++) {
 		if (GPIO_PIN_SET(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    PIN_ON_FLAG(sc->pwr_en[i].pol)) != 0 ||
 		    GPIO_PIN_SETFLAGS(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    GPIO_PIN_OUTPUT) != 0) {
 			device_printf(dev,
 			    "could not use gpio to control power\n");
 			goto fail;
 		}
 
 		sc->npwr_en++;
 	}
 
 	/*
 	 * Configure the clock source and divider.
 	 */
 
 	div = 2;
 
 	value = CSR_READ_4(sc, AML_USB_PHY_CFG_REG);
 
 	value &= ~(AML_USB_PHY_CFG_CLK_DIV_MASK | AML_USB_PHY_CFG_CLK_SEL_MASK);
 
 	value &= ~(AML_USB_PHY_CFG_A_RST | AML_USB_PHY_CFG_B_RST);
 	value &= ~(AML_USB_PHY_CFG_A_PLL_RST | AML_USB_PHY_CFG_B_PLL_RST);
 	value &= ~(AML_USB_PHY_CFG_A_PHYS_RST | AML_USB_PHY_CFG_B_PHYS_RST);
 	value &= ~(AML_USB_PHY_CFG_A_POR | AML_USB_PHY_CFG_B_POR);
 
 	value |= AML_USB_PHY_CFG_CLK_SEL_XTAL;
 	value |= ((div - 1) << AML_USB_PHY_CFG_CLK_DIV_SHIFT) &
 	    AML_USB_PHY_CFG_CLK_DIV_MASK;
 	value |= AML_USB_PHY_CFG_CLK_EN;
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	/*
 	 * Issue the reset sequence.
 	 */
 
 	value |= (AML_USB_PHY_CFG_A_RST | AML_USB_PHY_CFG_B_RST);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	value &= ~(AML_USB_PHY_CFG_A_RST | AML_USB_PHY_CFG_B_RST);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	value |= (AML_USB_PHY_CFG_A_PLL_RST | AML_USB_PHY_CFG_B_PLL_RST);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	value &= ~(AML_USB_PHY_CFG_A_PLL_RST | AML_USB_PHY_CFG_B_PLL_RST);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	value |= (AML_USB_PHY_CFG_A_PHYS_RST | AML_USB_PHY_CFG_B_PHYS_RST);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	value &= ~(AML_USB_PHY_CFG_A_PHYS_RST | AML_USB_PHY_CFG_B_PHYS_RST);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	value |= (AML_USB_PHY_CFG_A_POR | AML_USB_PHY_CFG_B_POR);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	/*
 	 * Enable by clearing the power on reset.
 	 */
 
 	value &= ~(AML_USB_PHY_CFG_A_POR | AML_USB_PHY_CFG_B_POR);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	DELAY(200);
 
 	/*
 	 * Check if the clock was detected.
 	 */
 	value = CSR_READ_4(sc, AML_USB_PHY_CFG_REG);
 	if ((value & AML_USB_PHY_CFG_CLK_DETECTED) !=
 	    AML_USB_PHY_CFG_CLK_DETECTED)
 		device_printf(dev, "PHY Clock not detected\n");
 
 	/*
 	 * Configure the mode for each port.
 	 */
 
 	value = CSR_READ_4(sc, AML_USB_PHY_MISC_A_REG);
 
 	value &= ~(AML_USB_PHY_MISC_ID_OVERIDE_EN |
 	    AML_USB_PHY_MISC_ID_OVERIDE_DEVICE |
 	    AML_USB_PHY_MISC_ID_OVERIDE_HOST);
 	value |= mode_a;
 
 	CSR_WRITE_4(sc, AML_USB_PHY_MISC_A_REG, value);
 
 	value = CSR_READ_4(sc, AML_USB_PHY_MISC_B_REG);
 
 	value &= ~(AML_USB_PHY_MISC_ID_OVERIDE_EN |
 	    AML_USB_PHY_MISC_ID_OVERIDE_DEVICE |
 	    AML_USB_PHY_MISC_ID_OVERIDE_HOST);
 	value |= mode_b;
 
 	CSR_WRITE_4(sc, AML_USB_PHY_MISC_B_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_MISC_B_REG);
 
 	return (0);
 
 fail:
 	/* In the event of problems attempt to turn things back off. */
 	i = sc->npwr_en;
 	while (i-- != 0) {
 		GPIO_PIN_SET(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    PIN_OFF_FLAG(sc->pwr_en[i].pol));
 	}
 
 	free (sc->pwr_en, M_DEVBUF);
 	sc->pwr_en = NULL;
 
 	bus_release_resources(dev, aml8726_usb_phy_spec, sc->res);
 
 	return (ENXIO);
 }
 
 static int
 aml8726_usb_phy_detach(device_t dev)
 {
 	struct aml8726_usb_phy_softc *sc = device_get_softc(dev);
 	uint32_t i;
 	uint32_t value;
 
 	/*
 	 * Disable by issuing a power on reset.
 	 */
 
 	value = CSR_READ_4(sc, AML_USB_PHY_CFG_REG);
 
 	value |= (AML_USB_PHY_CFG_A_POR | AML_USB_PHY_CFG_B_POR);
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	/* Turn off power */
 	i = sc->npwr_en;
 	while (i-- != 0) {
 		(void)GPIO_PIN_SET(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    PIN_OFF_FLAG(sc->pwr_en[i].pol));
 	}
 	free (sc->pwr_en, M_DEVBUF);
 	sc->pwr_en = NULL;
 
 	bus_release_resources(dev, aml8726_usb_phy_spec, sc->res);
 
 	return (0);
 }
 
 static device_method_t aml8726_usb_phy_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		aml8726_usb_phy_probe),
 	DEVMETHOD(device_attach,	aml8726_usb_phy_attach),
 	DEVMETHOD(device_detach,	aml8726_usb_phy_detach),
 
 	DEVMETHOD_END
 };
 
 static driver_t aml8726_usb_phy_driver = {
 	"usbphy",
 	aml8726_usb_phy_methods,
 	sizeof(struct aml8726_usb_phy_softc),
 };
 
 static devclass_t aml8726_usb_phy_devclass;
 
 DRIVER_MODULE(aml8726_m3usbphy, simplebus, aml8726_usb_phy_driver,
     aml8726_usb_phy_devclass, 0, 0);
 MODULE_DEPEND(aml8726_m3usbphy, aml8726_gpio, 1, 1, 1);
Index: user/markj/netdump/sys/arm/amlogic/aml8726/aml8726_usb_phy-m6.c
===================================================================
--- user/markj/netdump/sys/arm/amlogic/aml8726/aml8726_usb_phy-m6.c	(revision 332407)
+++ user/markj/netdump/sys/arm/amlogic/aml8726/aml8726_usb_phy-m6.c	(revision 332408)
@@ -1,418 +1,418 @@
 /*-
  * Copyright 2014-2015 John Wehle <john@feith.com>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Amlogic aml8726-m6 (and later) USB physical layer driver.
  *
  * Each USB physical interface has a dedicated register block.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/resource.h>
 #include <sys/rman.h>
 
 #include <sys/gpio.h>
 
 #include <machine/bus.h>
 #include <machine/cpu.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <arm/amlogic/aml8726/aml8726_soc.h>
 
 #include "gpio_if.h"
 
 struct aml8726_usb_phy_gpio {
 	device_t	dev;
 	uint32_t	pin;
 	uint32_t	pol;
 };
 
 struct aml8726_usb_phy_softc {
 	device_t			dev;
 	struct resource			*res[1];
 	uint32_t			npwr_en;
 	struct aml8726_usb_phy_gpio	*pwr_en;
 	boolean_t			force_aca;
 	struct aml8726_usb_phy_gpio	hub_rst;
 };
 
 static struct resource_spec aml8726_usb_phy_spec[] = {
 	{ SYS_RES_MEMORY,	0,	RF_ACTIVE },
 	{ -1, 0 }
 };
 
 #define	AML_USB_PHY_CFG_REG			0
 #define	AML_USB_PHY_CFG_CLK_SEL_32K_ALT		(1 << 15)
 #define	AML_USB_PHY_CFG_CLK_DIV_MASK		(0x7f << 4)
 #define	AML_USB_PHY_CFG_CLK_DIV_SHIFT		4
 #define	AML_USB_PHY_CFG_CLK_SEL_MASK		(7 << 1)
 #define	AML_USB_PHY_CFG_CLK_SEL_XTAL		(0 << 1)
 #define	AML_USB_PHY_CFG_CLK_SEL_XTAL_DIV2	(1 << 1)
 #define	AML_USB_PHY_CFG_CLK_EN			(1 << 0)
 
 #define	AML_USB_PHY_CTRL_REG			4
 #define	AML_USB_PHY_CTRL_FSEL_MASK		(7 << 22)
 #define	AML_USB_PHY_CTRL_FSEL_12M		(2 << 22)
 #define	AML_USB_PHY_CTRL_FSEL_24M		(5 << 22)
 #define	AML_USB_PHY_CTRL_POR			(1 << 15)
 #define	AML_USB_PHY_CTRL_CLK_DETECTED		(1 << 8)
 
 #define	AML_USB_PHY_ADP_BC_REG			12
 #define	AML_USB_PHY_ADP_BC_ACA_FLOATING		(1 << 26)
 #define	AML_USB_PHY_ADP_BC_ACA_EN		(1 << 16)
 
 #define	CSR_WRITE_4(sc, reg, val)	bus_write_4((sc)->res[0], reg, (val))
 #define	CSR_READ_4(sc, reg)		bus_read_4((sc)->res[0], reg)
 #define	CSR_BARRIER(sc, reg)		bus_barrier((sc)->res[0], reg, 4, \
     (BUS_SPACE_BARRIER_READ | BUS_SPACE_BARRIER_WRITE))
 
 #define	PIN_ON_FLAG(pol)		((pol) == 0 ?	\
     GPIO_PIN_LOW : GPIO_PIN_HIGH)
 #define	PIN_OFF_FLAG(pol)		((pol) == 0 ?	\
     GPIO_PIN_HIGH : GPIO_PIN_LOW)
 
 static int
 aml8726_usb_phy_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (!ofw_bus_is_compatible(dev, "amlogic,aml8726-m6-usb-phy") &&
 	    !ofw_bus_is_compatible(dev, "amlogic,aml8726-m8-usb-phy"))
 		return (ENXIO);
 
 	switch (aml8726_soc_hw_rev) {
 	case AML_SOC_HW_REV_M8:
 	case AML_SOC_HW_REV_M8B:
 		device_set_desc(dev, "Amlogic aml8726-m8 USB PHY");
 		break;
 	default:
 		device_set_desc(dev, "Amlogic aml8726-m6 USB PHY");
 		break;
 	}
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 aml8726_usb_phy_attach(device_t dev)
 {
 	struct aml8726_usb_phy_softc *sc = device_get_softc(dev);
 	char *force_aca;
 	int err;
 	int npwr_en;
 	pcell_t *prop;
 	phandle_t node;
 	ssize_t len;
 	uint32_t div;
 	uint32_t i;
 	uint32_t value;
 
 	sc->dev = dev;
 
 	if (bus_alloc_resources(dev, aml8726_usb_phy_spec, sc->res)) {
 		device_printf(dev, "can not allocate resources for device\n");
 		return (ENXIO);
 	}
 
 	node = ofw_bus_get_node(dev);
 
 	len = OF_getprop_alloc(node, "force-aca",
 	    (void **)&force_aca);
 
 	sc->force_aca = FALSE;
 
 	if (len > 0) {
 		if (strncmp(force_aca, "true", len) == 0)
 			sc->force_aca = TRUE;
 	}
 
 	OF_prop_free(force_aca);
 
 	err = 0;
 
-	len = OF_getencprop_alloc(node, "usb-pwr-en",
+	len = OF_getencprop_alloc_multi(node, "usb-pwr-en",
 	    3 * sizeof(pcell_t), (void **)&prop);
 	npwr_en = (len > 0) ? len : 0;
 
 	sc->npwr_en = 0;
 	sc->pwr_en = (struct aml8726_usb_phy_gpio *)
 	    malloc(npwr_en * sizeof (*sc->pwr_en), M_DEVBUF, M_WAITOK);
 
 	for (i = 0; i < npwr_en; i++) {
 		sc->pwr_en[i].dev = OF_device_from_xref(prop[i * 3]);
 		sc->pwr_en[i].pin = prop[i * 3 + 1];
 		sc->pwr_en[i].pol = prop[i * 3 + 2];
 
 		if (sc->pwr_en[i].dev == NULL) {
 			err = 1;
 			break;
 		}
 	}
 
 	OF_prop_free(prop);
 
-	len = OF_getencprop_alloc(node, "usb-hub-rst",
+	len = OF_getencprop_alloc_multi(node, "usb-hub-rst",
 	    3 * sizeof(pcell_t), (void **)&prop);
 	if (len > 0) {
 		sc->hub_rst.dev = OF_device_from_xref(prop[0]);
 		sc->hub_rst.pin = prop[1];
 		sc->hub_rst.pol = prop[2];
 
 		if (len > 1 || sc->hub_rst.dev == NULL)
 			err = 1;
 	}
 
 	OF_prop_free(prop);
 
 	if (err) {
 		device_printf(dev, "unable to parse gpio\n");
 		goto fail;
 	}
 
 	/* Turn on power by setting pin and then enabling output driver. */
 	for (i = 0; i < npwr_en; i++) {
 		if (GPIO_PIN_SET(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    PIN_ON_FLAG(sc->pwr_en[i].pol)) != 0 ||
 		    GPIO_PIN_SETFLAGS(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    GPIO_PIN_OUTPUT) != 0) {
 			device_printf(dev,
 			    "could not use gpio to control power\n");
 			goto fail;
 		}
 
 		sc->npwr_en++;
 	}
 
 	/*
 	 * Configure the clock source and divider.
 	 */
 
 	value = CSR_READ_4(sc, AML_USB_PHY_CFG_REG);
 
 	value &= ~(AML_USB_PHY_CFG_CLK_SEL_32K_ALT |
 	    AML_USB_PHY_CFG_CLK_DIV_MASK |
 	    AML_USB_PHY_CFG_CLK_SEL_MASK |
 	    AML_USB_PHY_CFG_CLK_EN);
 
 	switch (aml8726_soc_hw_rev) {
 	case AML_SOC_HW_REV_M8:
 	case AML_SOC_HW_REV_M8B:
 		value |= AML_USB_PHY_CFG_CLK_SEL_32K_ALT;
 		break;
 	default:
 		div = 2;
 		value |= AML_USB_PHY_CFG_CLK_SEL_XTAL;
 		value |= ((div - 1) << AML_USB_PHY_CFG_CLK_DIV_SHIFT) &
 		    AML_USB_PHY_CFG_CLK_DIV_MASK;
 		value |= AML_USB_PHY_CFG_CLK_EN;
 		break;
 	}
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CFG_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_CFG_REG);
 
 	/*
 	 * Configure the clock frequency and issue a power on reset.
 	 */
 
 	value = CSR_READ_4(sc, AML_USB_PHY_CTRL_REG);
 
 	value &= ~AML_USB_PHY_CTRL_FSEL_MASK;
 
 	switch (aml8726_soc_hw_rev) {
 	case AML_SOC_HW_REV_M8:
 	case AML_SOC_HW_REV_M8B:
 		value |= AML_USB_PHY_CTRL_FSEL_24M;
 		break;
 	default:
 		value |= AML_USB_PHY_CTRL_FSEL_12M;
 		break;
 	}
 
 	value |= AML_USB_PHY_CTRL_POR;
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CTRL_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_CTRL_REG);
 
 	DELAY(500);
 
 	/*
 	 * Enable by clearing the power on reset.
 	 */
 
 	value &= ~AML_USB_PHY_CTRL_POR;
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CTRL_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_CTRL_REG);
 
 	DELAY(1000);
 
 	/*
 	 * Check if the clock was detected.
 	 */
 	value = CSR_READ_4(sc, AML_USB_PHY_CTRL_REG);
 	if ((value & AML_USB_PHY_CTRL_CLK_DETECTED) == 0)
 		device_printf(dev, "PHY Clock not detected\n");
 
 	/*
 	 * If necessary enabled Accessory Charger Adaptor detection
 	 * so that the port knows what mode to operate in.
 	 */
 	if (sc->force_aca) {
 		value = CSR_READ_4(sc, AML_USB_PHY_ADP_BC_REG);
 
 		value |= AML_USB_PHY_ADP_BC_ACA_EN;
 
 		CSR_WRITE_4(sc, AML_USB_PHY_ADP_BC_REG, value);
 
 		CSR_BARRIER(sc, AML_USB_PHY_ADP_BC_REG);
 
 		DELAY(50);
 
 		value = CSR_READ_4(sc, AML_USB_PHY_ADP_BC_REG);
 
 		if ((value & AML_USB_PHY_ADP_BC_ACA_FLOATING) != 0) {
 			device_printf(dev,
 			    "force-aca requires newer silicon\n");
 			goto fail;
 		}
 	}
 
 	/*
 	 * Reset the hub.
 	 */
 	if (sc->hub_rst.dev != NULL) {
 		err = 0;
 
 		if (GPIO_PIN_SET(sc->hub_rst.dev, sc->hub_rst.pin,
 		    PIN_ON_FLAG(sc->hub_rst.pol)) != 0 ||
 		    GPIO_PIN_SETFLAGS(sc->hub_rst.dev, sc->hub_rst.pin,
 		    GPIO_PIN_OUTPUT) != 0)
 			err = 1;
 
 		DELAY(30);
 
 		if (GPIO_PIN_SET(sc->hub_rst.dev, sc->hub_rst.pin,
 		    PIN_OFF_FLAG(sc->hub_rst.pol)) != 0)
 			err = 1;
 
 		DELAY(60000);
 
 		if (err) {
 			device_printf(dev,
 			    "could not use gpio to reset hub\n");
 			goto fail;
 		}
 	}
 
 	return (0);
 
 fail:
 	/* In the event of problems attempt to turn things back off. */
 	i = sc->npwr_en;
 	while (i-- != 0) {
 		GPIO_PIN_SET(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    PIN_OFF_FLAG(sc->pwr_en[i].pol));
 	}
 
 	free (sc->pwr_en, M_DEVBUF);
 	sc->pwr_en = NULL;
 
 	bus_release_resources(dev, aml8726_usb_phy_spec, sc->res);
 
 	return (ENXIO);
 }
 
 static int
 aml8726_usb_phy_detach(device_t dev)
 {
 	struct aml8726_usb_phy_softc *sc = device_get_softc(dev);
 	uint32_t i;
 	uint32_t value;
 
 	/*
 	 * Disable by issuing a power on reset.
 	 */
 
 	value = CSR_READ_4(sc, AML_USB_PHY_CTRL_REG);
 
 	value |= AML_USB_PHY_CTRL_POR;
 
 	CSR_WRITE_4(sc, AML_USB_PHY_CTRL_REG, value);
 
 	CSR_BARRIER(sc, AML_USB_PHY_CTRL_REG);
 
 	/* Turn off power */
 	i = sc->npwr_en;
 	while (i-- != 0) {
 		GPIO_PIN_SET(sc->pwr_en[i].dev, sc->pwr_en[i].pin,
 		    PIN_OFF_FLAG(sc->pwr_en[i].pol));
 	}
 	free (sc->pwr_en, M_DEVBUF);
 	sc->pwr_en = NULL;
 
 	bus_release_resources(dev, aml8726_usb_phy_spec, sc->res);
 
 	return (0);
 }
 
 static device_method_t aml8726_usb_phy_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		aml8726_usb_phy_probe),
 	DEVMETHOD(device_attach,	aml8726_usb_phy_attach),
 	DEVMETHOD(device_detach,	aml8726_usb_phy_detach),
 
 	DEVMETHOD_END
 };
 
 static driver_t aml8726_usb_phy_driver = {
 	"usbphy",
 	aml8726_usb_phy_methods,
 	sizeof(struct aml8726_usb_phy_softc),
 };
 
 static devclass_t aml8726_usb_phy_devclass;
 
 DRIVER_MODULE(aml8726_m6usbphy, simplebus, aml8726_usb_phy_driver,
     aml8726_usb_phy_devclass, 0, 0);
 MODULE_DEPEND(aml8726_m6usbphy, aml8726_gpio, 1, 1, 1);
Index: user/markj/netdump/sys/arm/annapurna/alpine/alpine_pci_msix.c
===================================================================
--- user/markj/netdump/sys/arm/annapurna/alpine/alpine_pci_msix.c	(revision 332407)
+++ user/markj/netdump/sys/arm/annapurna/alpine/alpine_pci_msix.c	(revision 332408)
@@ -1,394 +1,394 @@
 /*-
  * Copyright (c) 2015,2016 Annapurna Labs Ltd. and affiliates
  * All rights reserved.
  *
  * Developed by Semihalf.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/bus.h>
 #include <sys/rman.h>
 #include <sys/vmem.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include "msi_if.h"
 #include "pic_if.h"
 
 #define	AL_SPI_INTR		0
 #define	AL_EDGE_HIGH		1
 #define	ERR_NOT_IN_MAP		-1
 #define	IRQ_OFFSET		1
 #define	GIC_INTR_CELL_CNT	3
 #define	INTR_RANGE_COUNT	2
 #define	MAX_MSIX_COUNT		160
 
 static int al_msix_attach(device_t);
 static int al_msix_probe(device_t);
 
 static msi_alloc_msi_t al_msix_alloc_msi;
 static msi_release_msi_t al_msix_release_msi;
 static msi_alloc_msix_t al_msix_alloc_msix;
 static msi_release_msix_t al_msix_release_msix;
 static msi_map_msi_t al_msix_map_msi;
 
 static int al_find_intr_pos_in_map(device_t, struct intr_irqsrc *);
 
 static struct ofw_compat_data compat_data[] = {
 	{"annapurna-labs,al-msix",	true},
 	{"annapurna-labs,alpine-msix",	true},
 	{NULL,				false}
 };
 
 /*
  * Bus interface definitions.
  */
 static device_method_t al_msix_methods[] = {
 	DEVMETHOD(device_probe,		al_msix_probe),
 	DEVMETHOD(device_attach,	al_msix_attach),
 
 	/* Interrupt controller interface */
 	DEVMETHOD(msi_alloc_msi,	al_msix_alloc_msi),
 	DEVMETHOD(msi_release_msi,	al_msix_release_msi),
 	DEVMETHOD(msi_alloc_msix,	al_msix_alloc_msix),
 	DEVMETHOD(msi_release_msix,	al_msix_release_msix),
 	DEVMETHOD(msi_map_msi,		al_msix_map_msi),
 
 	DEVMETHOD_END
 };
 
 struct al_msix_softc {
 	bus_addr_t	base_addr;
 	struct resource	*res;
 	uint32_t	irq_min;
 	uint32_t	irq_max;
 	uint32_t	irq_count;
 	struct mtx	msi_mtx;
 	vmem_t		*irq_alloc;
 	device_t	gic_dev;
 	/* Table of isrcs maps isrc pointer to vmem_alloc'd irq number */
 	struct intr_irqsrc	*isrcs[MAX_MSIX_COUNT];
 };
 
 static driver_t al_msix_driver = {
 	"al_msix",
 	al_msix_methods,
 	sizeof(struct al_msix_softc),
 };
 
 devclass_t al_msix_devclass;
 
 DRIVER_MODULE(al_msix, ofwbus, al_msix_driver, al_msix_devclass, 0, 0);
 DRIVER_MODULE(al_msix, simplebus, al_msix_driver, al_msix_devclass, 0, 0);
 
 MALLOC_DECLARE(M_AL_MSIX);
 MALLOC_DEFINE(M_AL_MSIX, "al_msix", "Alpine MSIX");
 
 static int
 al_msix_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (!ofw_bus_search_compatible(dev, compat_data)->ocd_data)
 		return (ENXIO);
 
 	device_set_desc(dev, "Annapurna-Labs MSI-X Controller");
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 al_msix_attach(device_t dev)
 {
 	struct al_msix_softc	*sc;
 	device_t		gic_dev;
 	phandle_t		iparent;
 	phandle_t		node;
 	intptr_t		xref;
 	int			interrupts[INTR_RANGE_COUNT];
 	int			nintr, i, rid;
 	uint32_t		icells, *intr;
 
 	sc = device_get_softc(dev);
 
 	node = ofw_bus_get_node(dev);
 	xref = OF_xref_from_node(node);
 	OF_device_register_xref(xref, dev);
 
 	rid = 0;
 	sc->res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE);
 	if (sc->res == NULL) {
 		device_printf(dev, "Failed to allocate resource\n");
 		return (ENXIO);
 	}
 
 	sc->base_addr = (bus_addr_t)rman_get_start(sc->res);
 
 	/* Register this device to handle MSI interrupts */
 	if (intr_msi_register(dev, xref) != 0) {
 		device_printf(dev, "could not register MSI-X controller\n");
 		return (ENXIO);
 	}
 	else
 		device_printf(dev, "MSI-X controller registered\n");
 
 	/* Find root interrupt controller */
 	iparent = ofw_bus_find_iparent(node);
 	if (iparent == 0) {
 		device_printf(dev, "No interrupt-parrent found. "
 				"Error in DTB\n");
 		return (ENXIO);
 	} else {
 		/* While at parent - store interrupt cells prop */
 		if (OF_searchencprop(OF_node_from_xref(iparent),
 		    "#interrupt-cells", &icells, sizeof(icells)) == -1) {
 			device_printf(dev, "DTB: Missing #interrupt-cells "
 			    "property in GIC node\n");
 			return (ENXIO);
 		}
 	}
 
 	gic_dev = OF_device_from_xref(iparent);
 	if (gic_dev == NULL) {
 		device_printf(dev, "Cannot find GIC device\n");
 		return (ENXIO);
 	}
 	sc->gic_dev = gic_dev;
 
 	/* Manually read range of interrupts from DTB */
-	nintr = OF_getencprop_alloc(node, "interrupts",  sizeof(*intr),
+	nintr = OF_getencprop_alloc_multi(node, "interrupts", sizeof(*intr),
 	    (void **)&intr);
 	if (nintr == 0) {
 		device_printf(dev, "Cannot read interrupts prop from DTB\n");
 		return (ENXIO);
 	} else if ((nintr / icells) != INTR_RANGE_COUNT) {
 		/* Supposed to have min and max value only */
 		device_printf(dev, "Unexpected count of interrupts "
 				"in DTB node\n");
 		return (EINVAL);
 	}
 
 	/* Read interrupt range values */
 	for (i = 0; i < INTR_RANGE_COUNT; i++)
 		interrupts[i] = intr[(i * icells) + IRQ_OFFSET];
 
 	sc->irq_min = interrupts[0];
 	sc->irq_max = interrupts[1];
 	sc->irq_count = (sc->irq_max - sc->irq_min + 1);
 
 	if (sc->irq_count > MAX_MSIX_COUNT) {
 		device_printf(dev, "Available MSI-X count exceeds buffer size."
 				" Capping to %d\n", MAX_MSIX_COUNT);
 		sc->irq_count = MAX_MSIX_COUNT;
 	}
 
 	mtx_init(&sc->msi_mtx, "msi_mtx", NULL, MTX_DEF);
 
 	sc->irq_alloc = vmem_create("Alpine MSI-X IRQs", 0, sc->irq_count,
 	    1, 0, M_FIRSTFIT | M_WAITOK);
 
 	device_printf(dev, "MSI-X SPI IRQ %d-%d\n", sc->irq_min, sc->irq_max);
 
 	return (bus_generic_attach(dev));
 }
 
 static int
 al_find_intr_pos_in_map(device_t dev, struct intr_irqsrc *isrc)
 {
 	struct al_msix_softc *sc;
 	int i;
 
 	sc = device_get_softc(dev);
 	for (i = 0; i < MAX_MSIX_COUNT; i++)
 		if (sc->isrcs[i] == isrc)
 			return (i);
 	return (ERR_NOT_IN_MAP);
 }
 
 static int
 al_msix_map_msi(device_t dev, device_t child, struct intr_irqsrc *isrc,
     uint64_t *addr, uint32_t *data)
 {
 	struct al_msix_softc *sc;
 	int i, spi;
 
 	sc = device_get_softc(dev);
 
 	i = al_find_intr_pos_in_map(dev, isrc);
 	if (i == ERR_NOT_IN_MAP)
 		return (EINVAL);
 
 	spi = sc->irq_min + i;
 
 	/*
 	 * MSIX message address format:
 	 * [63:20] - MSIx TBAR
 	 *           Same value as the MSIx Translation Base  Address Register
 	 * [19]    - WFE_EXIT
 	 *           Once set by MSIx message, an EVENTI is signal to the CPUs
 	 *           cluster specified by ‘Local GIC Target List’
 	 * [18:17] - Target GIC ID
 	 *           Specifies which IO-GIC (external shared GIC) is targeted
 	 *           0: Local GIC, as specified by the Local GIC Target List
 	 *           1: IO-GIC 0
 	 *           2: Reserved
 	 *           3: Reserved
 	 * [16:13] - Local GIC Target List
 	 *           Specifies the Local GICs list targeted by this MSIx
 	 *           message.
 	 *           [16]  If set, SPIn is set in Cluster 0 local GIC
 	 *           [15:13] Reserved
 	 *           [15]  If set, SPIn is set in Cluster 1 local GIC
 	 *           [14]  If set, SPIn is set in Cluster 2 local GIC
 	 *           [13]  If set, SPIn is set in Cluster 3 local GIC
 	 * [12:3]  - SPIn
 	 *           Specifies the SPI (Shared Peripheral Interrupt) index to
 	 *           be set in target GICs
 	 *           Notes:
 	 *           If targeting any local GIC than only SPI[249:0] are valid
 	 * [2]     - Function vector
 	 *           MSI Data vector extension hint
 	 * [1:0]   - Reserved
 	 *           Must be set to zero
 	 */
 	*addr = (uint64_t)sc->base_addr + (uint64_t)((1 << 16) + (spi << 3));
 	*data = 0;
 
 	if (bootverbose)
 		device_printf(dev, "MSI mapping: SPI: %d addr: %jx data: %x\n",
 		    spi, (uintmax_t)*addr, *data);
 	return (0);
 }
 
 static int
 al_msix_alloc_msi(device_t dev, device_t child, int count, int maxcount,
     device_t *pic, struct intr_irqsrc **srcs)
 {
 	struct intr_map_data_fdt *fdt_data;
 	struct al_msix_softc *sc;
 	vmem_addr_t irq_base;
 	int error;
 	u_int i, j;
 
 	sc = device_get_softc(dev);
 
 	if ((powerof2(count) == 0) || (count > 8))
 		return (EINVAL);
 
 	if (vmem_alloc(sc->irq_alloc, count, M_FIRSTFIT | M_NOWAIT,
 	    &irq_base) != 0)
 		return (ENOMEM);
 
 	/* Fabricate OFW data to get ISRC from GIC and return it */
 	fdt_data = malloc(sizeof(*fdt_data) +
 	    GIC_INTR_CELL_CNT * sizeof(pcell_t), M_AL_MSIX, M_WAITOK);
 	fdt_data->hdr.type = INTR_MAP_DATA_FDT;
 	fdt_data->iparent = 0;
 	fdt_data->ncells = GIC_INTR_CELL_CNT;
 	fdt_data->cells[0] = AL_SPI_INTR;	/* code for SPI interrupt */
 	fdt_data->cells[1] = 0;			/* SPI number (uninitialized) */
 	fdt_data->cells[2] = AL_EDGE_HIGH;	/* trig = edge, pol = high */
 
 	mtx_lock(&sc->msi_mtx);
 
 	for (i = irq_base; i < irq_base + count; i++) {
 		fdt_data->cells[1] = sc->irq_min + i;
 		error = PIC_MAP_INTR(sc->gic_dev,
 		    (struct intr_map_data *)fdt_data, srcs);
 		if (error) {
 			for (j = irq_base; j < i; j++)
 				sc->isrcs[j] = NULL;
 			mtx_unlock(&sc->msi_mtx);
 			vmem_free(sc->irq_alloc, irq_base, count);
 			free(fdt_data, M_AL_MSIX);
 			return (error);
 		}
 
 		sc->isrcs[i] = *srcs;
 		srcs++;
 	}
 
 	mtx_unlock(&sc->msi_mtx);
 	free(fdt_data, M_AL_MSIX);
 
 	if (bootverbose)
 		device_printf(dev,
 		    "MSI-X allocation: start SPI %d, count %d\n",
 		    (int)irq_base + sc->irq_min, count);
 
 	*pic = sc->gic_dev;
 
 	return (0);
 }
 
 static int
 al_msix_release_msi(device_t dev, device_t child, int count,
     struct intr_irqsrc **srcs)
 {
 	struct al_msix_softc *sc;
 	int i, pos;
 
 	sc = device_get_softc(dev);
 
 	mtx_lock(&sc->msi_mtx);
 
 	pos = al_find_intr_pos_in_map(dev, *srcs);
 	vmem_free(sc->irq_alloc, pos, count);
 	for (i = 0; i < count; i++) {
 		pos = al_find_intr_pos_in_map(dev, *srcs);
 		if (pos != ERR_NOT_IN_MAP)
 			sc->isrcs[pos] = NULL;
 		srcs++;
 	}
 
 	mtx_unlock(&sc->msi_mtx);
 
 	return (0);
 }
 
 static int
 al_msix_alloc_msix(device_t dev, device_t child, device_t *pic,
     struct intr_irqsrc **isrcp)
 {
 
 	return (al_msix_alloc_msi(dev, child, 1, 1, pic, isrcp));
 }
 
 static int
 al_msix_release_msix(device_t dev, device_t child, struct intr_irqsrc *isrc)
 {
 
 	return (al_msix_release_msi(dev, child, 1, &isrc));
 }
Index: user/markj/netdump/sys/arm/at91/at91_pinctrl.c
===================================================================
--- user/markj/netdump/sys/arm/at91/at91_pinctrl.c	(revision 332407)
+++ user/markj/netdump/sys/arm/at91/at91_pinctrl.c	(revision 332408)
@@ -1,516 +1,517 @@
 /*-
  * Copyright (c) 2014 Warner Losh.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/resource.h>
 #include <sys/systm.h>
 #include <sys/rman.h>
 
 #include <machine/bus.h>
 
 #include <arm/at91/at91var.h>
 #include <arm/at91/at91_piovar.h>
 
 #include <dev/fdt/fdt_pinctrl.h>
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #define BUS_PASS_PINMUX (BUS_PASS_INTERRUPT + 1)
 
 struct pinctrl_range {
 	uint64_t bus;
 	uint64_t host;
 	uint64_t size;
 };
 
 struct pinctrl_softc {
 	device_t dev;
 	phandle_t node;
 
 	struct pinctrl_range *ranges;
 	int nranges;
 
 	pcell_t acells, scells;
 	int done_pinmux;
 };
 
 struct pinctrl_devinfo {
 	struct ofw_bus_devinfo	obdinfo;
 	struct resource_list	rl;
 };
 
 static int
 at91_pinctrl_probe(device_t dev)
 {
 
 	if (!ofw_bus_is_compatible(dev, "atmel,at91rm9200-pinctrl"))
 		return (ENXIO);
 	device_set_desc(dev, "pincontrol bus");
         return (0);
 }
 
 /* XXX Make this a subclass of simplebus */
 
 static struct pinctrl_devinfo *
 at91_pinctrl_setup_dinfo(device_t dev, phandle_t node)
 {
 	struct pinctrl_softc *sc;
 	struct pinctrl_devinfo *ndi;
 	uint32_t *reg, *intr, icells;
 	uint64_t phys, size;
 	phandle_t iparent;
 	int i, j, k;
 	int nintr;
 	int nreg;
 
 	sc = device_get_softc(dev);
 
 	ndi = malloc(sizeof(*ndi), M_DEVBUF, M_WAITOK | M_ZERO);
 	if (ofw_bus_gen_setup_devinfo(&ndi->obdinfo, node) != 0) {
 		free(ndi, M_DEVBUF);
 		return (NULL);
 	}
 
 	resource_list_init(&ndi->rl);
-	nreg = OF_getencprop_alloc(node, "reg", sizeof(*reg), (void **)&reg);
+	nreg = OF_getencprop_alloc_multi(node, "reg", sizeof(*reg),
+	    (void **)&reg);
 	if (nreg == -1)
 		nreg = 0;
 	if (nreg % (sc->acells + sc->scells) != 0) {
 //		if (bootverbose)
 			device_printf(dev, "Malformed reg property on <%s>\n",
 			    ndi->obdinfo.obd_name);
 		nreg = 0;
 	}
 
 	for (i = 0, k = 0; i < nreg; i += sc->acells + sc->scells, k++) {
 		phys = size = 0;
 		for (j = 0; j < sc->acells; j++) {
 			phys <<= 32;
 			phys |= reg[i + j];
 		}
 		for (j = 0; j < sc->scells; j++) {
 			size <<= 32;
 			size |= reg[i + sc->acells + j];
 		}
 		
 		resource_list_add(&ndi->rl, SYS_RES_MEMORY, k,
 		    phys, phys + size - 1, size);
 	}
 	OF_prop_free(reg);
 
-	nintr = OF_getencprop_alloc(node, "interrupts",  sizeof(*intr),
+	nintr = OF_getencprop_alloc_multi(node, "interrupts",  sizeof(*intr),
 	    (void **)&intr);
 	if (nintr > 0) {
 		if (OF_searchencprop(node, "interrupt-parent", &iparent,
 		    sizeof(iparent)) == -1) {
 			device_printf(dev, "No interrupt-parent found, "
 			    "assuming direct parent\n");
 			iparent = OF_parent(node);
 		}
 		if (OF_searchencprop(OF_node_from_xref(iparent), 
 		    "#interrupt-cells", &icells, sizeof(icells)) == -1) {
 			device_printf(dev, "Missing #interrupt-cells property,"
 			    " assuming <1>\n");
 			icells = 1;
 		}
 		if (icells < 1 || icells > nintr) {
 			device_printf(dev, "Invalid #interrupt-cells property "
 			    "value <%d>, assuming <1>\n", icells);
 			icells = 1;
 		}
 		for (i = 0, k = 0; i < nintr; i += icells, k++) {
 			intr[i] = ofw_bus_map_intr(dev, iparent, icells,
 			    &intr[i]);
 			resource_list_add(&ndi->rl, SYS_RES_IRQ, k, intr[i],
 			    intr[i], 1);
 		}
 		OF_prop_free(intr);
 	}
 
 	return (ndi);
 }
 
 static int
 at91_pinctrl_fill_ranges(phandle_t node, struct pinctrl_softc *sc)
 {
 	int host_address_cells;
 	cell_t *base_ranges;
 	ssize_t nbase_ranges;
 	int err;
 	int i, j, k;
 
 	err = OF_searchencprop(OF_parent(node), "#address-cells",
 	    &host_address_cells, sizeof(host_address_cells));
 	if (err <= 0)
 		return (-1);
 
 	nbase_ranges = OF_getproplen(node, "ranges");
 	if (nbase_ranges < 0)
 		return (-1);
 	sc->nranges = nbase_ranges / sizeof(cell_t) /
 	    (sc->acells + host_address_cells + sc->scells);
 	if (sc->nranges == 0)
 		return (0);
 
 	sc->ranges = malloc(sc->nranges * sizeof(sc->ranges[0]),
 	    M_DEVBUF, M_WAITOK);
 	base_ranges = malloc(nbase_ranges, M_DEVBUF, M_WAITOK);
 	OF_getencprop(node, "ranges", base_ranges, nbase_ranges);
 
 	for (i = 0, j = 0; i < sc->nranges; i++) {
 		sc->ranges[i].bus = 0;
 		for (k = 0; k < sc->acells; k++) {
 			sc->ranges[i].bus <<= 32;
 			sc->ranges[i].bus |= base_ranges[j++];
 		}
 		sc->ranges[i].host = 0;
 		for (k = 0; k < host_address_cells; k++) {
 			sc->ranges[i].host <<= 32;
 			sc->ranges[i].host |= base_ranges[j++];
 		}
 		sc->ranges[i].size = 0;
 		for (k = 0; k < sc->scells; k++) {
 			sc->ranges[i].size <<= 32;
 			sc->ranges[i].size |= base_ranges[j++];
 		}
 	}
 
 	free(base_ranges, M_DEVBUF);
 	return (sc->nranges);
 }
 
 static int
 at91_pinctrl_attach(device_t dev)
 {
 	struct pinctrl_softc *sc;
 	struct pinctrl_devinfo *di;
 	phandle_t	node;
 	device_t	cdev;
 
 	sc = device_get_softc(dev);
 	node = ofw_bus_get_node(dev);
 
 	sc->dev = dev;
 	sc->node = node;
 	
 	/*
 	 * Some important numbers
 	 */
 	sc->acells = 2;
 	OF_getencprop(node, "#address-cells", &sc->acells, sizeof(sc->acells));
 	sc->scells = 1;
 	OF_getencprop(node, "#size-cells", &sc->scells, sizeof(sc->scells));
 
 	if (at91_pinctrl_fill_ranges(node, sc) < 0) {
 		device_printf(dev, "could not get ranges\n");
 		return (ENXIO);
 	}
 
 	for (node = OF_child(node); node > 0; node = OF_peer(node)) {
 		if ((di = at91_pinctrl_setup_dinfo(dev, node)) == NULL)
 			continue;
 		cdev = device_add_child(dev, NULL, -1);
 		if (cdev == NULL) {
 			device_printf(dev, "<%s>: device_add_child failed\n",
 			    di->obdinfo.obd_name);
 			resource_list_free(&di->rl);
 			ofw_bus_gen_destroy_devinfo(&di->obdinfo);
 			free(di, M_DEVBUF);
 			continue;
 		}
 		device_set_ivars(cdev, di);
 	}
 
 	fdt_pinctrl_register(dev, "atmel,pins");
 
 	return (bus_generic_attach(dev));
 }
 
 static const struct ofw_bus_devinfo *
 pinctrl_get_devinfo(device_t bus __unused, device_t child)
 {
         struct pinctrl_devinfo *ndi;
         
         ndi = device_get_ivars(child);
         return (&ndi->obdinfo);
 }
 
 static struct resource *
 pinctrl_alloc_resource(device_t bus, device_t child, int type, int *rid,
     u_long start, u_long end, u_long count, u_int flags)
 {
 	struct pinctrl_softc *sc;
 	struct pinctrl_devinfo *di;
 	struct resource_list_entry *rle;
 	int j;
 
 	sc = device_get_softc(bus);
 
 	/*
 	 * Request for the default allocation with a given rid: use resource
 	 * list stored in the local device info.
 	 */
 	if (RMAN_IS_DEFAULT_RANGE(start, end)) {
 		if ((di = device_get_ivars(child)) == NULL)
 			return (NULL);
 
 		if (type == SYS_RES_IOPORT)
 			type = SYS_RES_MEMORY;
 
 		rle = resource_list_find(&di->rl, type, *rid);
 		if (rle == NULL) {
 //			if (bootverbose)
 				device_printf(bus, "no default resources for "
 				    "rid = %d, type = %d\n", *rid, type);
 			return (NULL);
 		}
 		start = rle->start;
 		end = rle->end;
 		count = rle->count;
         }
 
 	if (type == SYS_RES_MEMORY) {
 		/* Remap through ranges property */
 		for (j = 0; j < sc->nranges; j++) {
 			if (start >= sc->ranges[j].bus && end <
 			    sc->ranges[j].bus + sc->ranges[j].size) {
 				start -= sc->ranges[j].bus;
 				start += sc->ranges[j].host;
 				end -= sc->ranges[j].bus;
 				end += sc->ranges[j].host;
 				break;
 			}
 		}
 		if (j == sc->nranges && sc->nranges != 0) {
 //			if (bootverbose)
 				device_printf(bus, "Could not map resource "
 				    "%#lx-%#lx\n", start, end);
 
 			return (NULL);
 		}
 	}
 
 	return (bus_generic_alloc_resource(bus, child, type, rid, start, end,
 	    count, flags));
 }
 
 static int
 pinctrl_print_res(struct pinctrl_devinfo *di)
 {
 	int rv;
 
 	rv = 0;
 	rv += resource_list_print_type(&di->rl, "mem", SYS_RES_MEMORY, "%#jx");
 	rv += resource_list_print_type(&di->rl, "irq", SYS_RES_IRQ, "%jd");
 	return (rv);
 }
 
 static void
 pinctrl_probe_nomatch(device_t bus, device_t child)
 {
 	const char *name, *type, *compat;
 
 //	if (!bootverbose)
 		return;
 
 	name = ofw_bus_get_name(child);
 	type = ofw_bus_get_type(child);
 	compat = ofw_bus_get_compat(child);
 
 	device_printf(bus, "<%s>", name != NULL ? name : "unknown");
 	pinctrl_print_res(device_get_ivars(child));
 	if (!ofw_bus_status_okay(child))
 		printf(" disabled");
 	if (type)
 		printf(" type %s", type);
 	if (compat)
 		printf(" compat %s", compat);
 	printf(" (no driver attached)\n");
 }
 
 static int
 pinctrl_print_child(device_t bus, device_t child)
 {
 	int rv;
 
 	rv = bus_print_child_header(bus, child);
 	rv += pinctrl_print_res(device_get_ivars(child));
 	if (!ofw_bus_status_okay(child))
 		rv += printf(" disabled");
 	rv += bus_print_child_footer(bus, child);
 	return (rv);
 }
 
 const char *periphs[] = {"gpio", "periph A", "periph B", "periph C", "periph D", "periph E" };
 
 struct pincfg {
 	uint32_t unit;
 	uint32_t pin;
 	uint32_t periph;
 	uint32_t flags;
 };
 
 static int
 pinctrl_configure_pins(device_t bus, phandle_t cfgxref)
 {
 	struct pinctrl_softc *sc;
 	struct pincfg *cfg, *cfgdata;
 	char name[32];
 	phandle_t node;
 	ssize_t npins;
 	int i;
 
 	sc = device_get_softc(bus);
 	node = OF_node_from_xref(cfgxref);
 	memset(name, 0, sizeof(name));
 	OF_getprop(node, "name", name, sizeof(name));
-	npins = OF_getencprop_alloc(node, "atmel,pins", sizeof(*cfgdata),
+	npins = OF_getencprop_alloc_multi(node, "atmel,pins", sizeof(*cfgdata),
 	    (void **)&cfgdata);
 	if (npins < 0) {
 		printf("We're doing it wrong %s\n", name);
 		return (ENXIO);
 	}
 	if (npins == 0)
 		return (0);
 	for (i = 0, cfg = cfgdata; i < npins; i++, cfg++) {
 		uint32_t pio;
 		pio = (0xfffffff & sc->ranges[0].bus) + 0x200 * cfg->unit;
 		printf("P%c%d %s %#x\n", cfg->unit + 'A', cfg->pin,
 		    periphs[cfg->periph], cfg->flags);
 		switch (cfg->periph) {
 		case 0:
 			at91_pio_use_gpio(pio, 1u << cfg->pin);
 			at91_pio_gpio_pullup(pio, 1u << cfg->pin,
 			    !!(cfg->flags & 1));
 			at91_pio_gpio_high_z(pio, 1u << cfg->pin,
 			    !!(cfg->flags & 2));
 			at91_pio_gpio_set_deglitch(pio,
 			    1u << cfg->pin, !!(cfg->flags & 4));
 //			at91_pio_gpio_pulldown(pio, 1u << cfg->pin,
 //			    !!(cfg->flags & 8));
 //			at91_pio_gpio_dis_schmidt(pio,
 //			    1u << cfg->pin, !!(cfg->flags & 16));
 			break;
 		case 1:
 			at91_pio_use_periph_a(pio, 1u << cfg->pin, cfg->flags);
 			break;
 		case 2:
 			at91_pio_use_periph_b(pio, 1u << cfg->pin, cfg->flags);
 			break;
 		}
 	}
 	OF_prop_free(cfgdata);
 	return (0);
 }
 
 static void
 pinctrl_new_pass(device_t bus)
 {
 	struct pinctrl_softc *sc;
 
 	sc = device_get_softc(bus);
 
 	bus_generic_new_pass(bus);
 
 	if (sc->done_pinmux || bus_current_pass < BUS_PASS_PINMUX)
 		return;
 	sc->done_pinmux++;
 
 	fdt_pinctrl_configure_tree(bus);
 }
 
 static device_method_t at91_pinctrl_methods[] = {
 	DEVMETHOD(device_probe, at91_pinctrl_probe),
 	DEVMETHOD(device_attach, at91_pinctrl_attach),
 
 	DEVMETHOD(bus_print_child,	pinctrl_print_child),
 	DEVMETHOD(bus_probe_nomatch,	pinctrl_probe_nomatch),
 	DEVMETHOD(bus_setup_intr,	bus_generic_setup_intr),
 	DEVMETHOD(bus_teardown_intr,	bus_generic_teardown_intr),
 	DEVMETHOD(bus_alloc_resource,	pinctrl_alloc_resource),
 	DEVMETHOD(bus_release_resource,	bus_generic_release_resource),
 	DEVMETHOD(bus_activate_resource, bus_generic_activate_resource),
 	DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource),
 	DEVMETHOD(bus_adjust_resource,	bus_generic_adjust_resource),
 	DEVMETHOD(bus_child_pnpinfo_str, ofw_bus_gen_child_pnpinfo_str),
 	DEVMETHOD(bus_new_pass,		pinctrl_new_pass),
 
 	/* ofw_bus interface */
 	DEVMETHOD(ofw_bus_get_devinfo,	pinctrl_get_devinfo),
 	DEVMETHOD(ofw_bus_get_compat,	ofw_bus_gen_get_compat),
 	DEVMETHOD(ofw_bus_get_model,	ofw_bus_gen_get_model),
 	DEVMETHOD(ofw_bus_get_name,	ofw_bus_gen_get_name),
 	DEVMETHOD(ofw_bus_get_node,	ofw_bus_gen_get_node),
 	DEVMETHOD(ofw_bus_get_type,	ofw_bus_gen_get_type),
 
         /* fdt_pintrl interface */
 	DEVMETHOD(fdt_pinctrl_configure,pinctrl_configure_pins),
 	DEVMETHOD_END
 };
 
 static driver_t at91_pinctrl_driver = {
 	"at91_pinctrl",
 	at91_pinctrl_methods,
 	sizeof(struct pinctrl_softc),
 };
 
 static devclass_t at91_pinctrl_devclass;
 
 EARLY_DRIVER_MODULE(at91_pinctrl, simplebus, at91_pinctrl_driver,
     at91_pinctrl_devclass, NULL, NULL, BUS_PASS_BUS);
 
 /*
  * dummy driver to force pass BUS_PASS_PINMUX to happen.
  */
 static int
 at91_pingroup_probe(device_t dev)
 {
 	return ENXIO;
 }
 
 static device_method_t at91_pingroup_methods[] = {
 	DEVMETHOD(device_probe, at91_pingroup_probe),
 
 	DEVMETHOD_END
 };
 	
 
 static driver_t at91_pingroup_driver = {
 	"at91_pingroup",
 	at91_pingroup_methods,
 	0,
 };
 
 static devclass_t at91_pingroup_devclass;
 
 EARLY_DRIVER_MODULE(at91_pingroup, at91_pinctrl, at91_pingroup_driver,
     at91_pingroup_devclass, NULL, NULL, BUS_PASS_PINMUX);
Index: user/markj/netdump/sys/arm/broadcom/bcm2835/bcm2835_gpio.c
===================================================================
--- user/markj/netdump/sys/arm/broadcom/bcm2835/bcm2835_gpio.c	(revision 332407)
+++ user/markj/netdump/sys/arm/broadcom/bcm2835/bcm2835_gpio.c	(revision 332408)
@@ -1,1324 +1,1324 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2012 Oleksandr Tymoshenko <gonzo@FreeBSD.org>
  * Copyright (c) 2012-2015 Luiz Otavio O Souza <loos@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_platform.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/gpio.h>
 #include <sys/interrupt.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/rman.h>
 #include <sys/sysctl.h>
 
 #include <machine/bus.h>
 #include <machine/intr.h>
 
 #include <dev/fdt/fdt_pinctrl.h>
 #include <dev/gpio/gpiobusvar.h>
 #include <dev/ofw/ofw_bus.h>
 
 #include "gpio_if.h"
 
 #include "pic_if.h"
 
 #ifdef DEBUG
 #define dprintf(fmt, args...) do { printf("%s(): ", __func__);   \
     printf(fmt,##args); } while (0)
 #else
 #define dprintf(fmt, args...)
 #endif
 
 #define	BCM_GPIO_IRQS		4
 #define	BCM_GPIO_PINS		54
 #define	BCM_GPIO_PINS_PER_BANK	32
 
 #define	BCM_GPIO_DEFAULT_CAPS	(GPIO_PIN_INPUT | GPIO_PIN_OUTPUT |	\
     GPIO_PIN_PULLUP | GPIO_PIN_PULLDOWN | GPIO_INTR_LEVEL_LOW |		\
     GPIO_INTR_LEVEL_HIGH | GPIO_INTR_EDGE_RISING |			\
     GPIO_INTR_EDGE_FALLING | GPIO_INTR_EDGE_BOTH)
 
 #define	BCM2835_FSEL_GPIO_IN	0
 #define	BCM2835_FSEL_GPIO_OUT	1
 #define	BCM2835_FSEL_ALT5	2
 #define	BCM2835_FSEL_ALT4	3
 #define	BCM2835_FSEL_ALT0	4
 #define	BCM2835_FSEL_ALT1	5
 #define	BCM2835_FSEL_ALT2	6
 #define	BCM2835_FSEL_ALT3	7
 
 #define	BCM2835_PUD_OFF		0
 #define	BCM2835_PUD_DOWN	1
 #define	BCM2835_PUD_UP		2
 
 static struct resource_spec bcm_gpio_res_spec[] = {
 	{ SYS_RES_MEMORY, 0, RF_ACTIVE },
 	{ SYS_RES_IRQ, 0, RF_ACTIVE },	/* bank 0 interrupt */
 	{ SYS_RES_IRQ, 1, RF_ACTIVE },	/* bank 1 interrupt */
 	{ -1, 0, 0 }
 };
 
 struct bcm_gpio_sysctl {
 	struct bcm_gpio_softc	*sc;
 	uint32_t		pin;
 };
 
 struct bcm_gpio_irqsrc {
 	struct intr_irqsrc	bgi_isrc;
 	uint32_t		bgi_irq;
 	uint32_t		bgi_mode;
 	uint32_t		bgi_mask;
 };
 
 struct bcm_gpio_softc {
 	device_t		sc_dev;
 	device_t		sc_busdev;
 	struct mtx		sc_mtx;
 	struct resource *	sc_res[BCM_GPIO_IRQS + 1];
 	bus_space_tag_t		sc_bst;
 	bus_space_handle_t	sc_bsh;
 	void *			sc_intrhand[BCM_GPIO_IRQS];
 	int			sc_gpio_npins;
 	int			sc_ro_npins;
 	int			sc_ro_pins[BCM_GPIO_PINS];
 	struct gpio_pin		sc_gpio_pins[BCM_GPIO_PINS];
 	struct bcm_gpio_sysctl	sc_sysctl[BCM_GPIO_PINS];
 	struct bcm_gpio_irqsrc	sc_isrcs[BCM_GPIO_PINS];
 };
 
 enum bcm_gpio_pud {
 	BCM_GPIO_NONE,
 	BCM_GPIO_PULLDOWN,
 	BCM_GPIO_PULLUP,
 };
 
 #define	BCM_GPIO_LOCK(_sc)	mtx_lock_spin(&(_sc)->sc_mtx)
 #define	BCM_GPIO_UNLOCK(_sc)	mtx_unlock_spin(&(_sc)->sc_mtx)
 #define	BCM_GPIO_LOCK_ASSERT(_sc)	mtx_assert(&(_sc)->sc_mtx, MA_OWNED)
 #define	BCM_GPIO_WRITE(_sc, _off, _val)		\
     bus_space_write_4((_sc)->sc_bst, (_sc)->sc_bsh, _off, _val)
 #define	BCM_GPIO_READ(_sc, _off)		\
     bus_space_read_4((_sc)->sc_bst, (_sc)->sc_bsh, _off)
 #define	BCM_GPIO_CLEAR_BITS(_sc, _off, _bits)	\
     BCM_GPIO_WRITE(_sc, _off, BCM_GPIO_READ(_sc, _off) & ~(_bits))
 #define	BCM_GPIO_SET_BITS(_sc, _off, _bits)	\
     BCM_GPIO_WRITE(_sc, _off, BCM_GPIO_READ(_sc, _off) | _bits)
 #define	BCM_GPIO_BANK(a)	(a / BCM_GPIO_PINS_PER_BANK)
 #define	BCM_GPIO_MASK(a)	(1U << (a % BCM_GPIO_PINS_PER_BANK))
 
 #define	BCM_GPIO_GPFSEL(_bank)	(0x00 + _bank * 4)	/* Function Select */
 #define	BCM_GPIO_GPSET(_bank)	(0x1c + _bank * 4)	/* Pin Out Set */
 #define	BCM_GPIO_GPCLR(_bank)	(0x28 + _bank * 4)	/* Pin Out Clear */
 #define	BCM_GPIO_GPLEV(_bank)	(0x34 + _bank * 4)	/* Pin Level */
 #define	BCM_GPIO_GPEDS(_bank)	(0x40 + _bank * 4)	/* Event Status */
 #define	BCM_GPIO_GPREN(_bank)	(0x4c + _bank * 4)	/* Rising Edge irq */
 #define	BCM_GPIO_GPFEN(_bank)	(0x58 + _bank * 4)	/* Falling Edge irq */
 #define	BCM_GPIO_GPHEN(_bank)	(0x64 + _bank * 4)	/* High Level irq */
 #define	BCM_GPIO_GPLEN(_bank)	(0x70 + _bank * 4)	/* Low Level irq */
 #define	BCM_GPIO_GPAREN(_bank)	(0x7c + _bank * 4)	/* Async Rising Edge */
 #define	BCM_GPIO_GPAFEN(_bank)	(0x88 + _bank * 4)	/* Async Falling Egde */
 #define	BCM_GPIO_GPPUD(_bank)	(0x94)			/* Pin Pull up/down */
 #define	BCM_GPIO_GPPUDCLK(_bank) (0x98 + _bank * 4)	/* Pin Pull up clock */
 
 static struct ofw_compat_data compat_data[] = {
 	{"broadcom,bcm2835-gpio",	1},
 	{"brcm,bcm2835-gpio",		1},
 	{NULL,				0}
 };
 
 static struct bcm_gpio_softc *bcm_gpio_sc = NULL;
 
 static int bcm_gpio_intr_bank0(void *arg);
 static int bcm_gpio_intr_bank1(void *arg);
 static int bcm_gpio_pic_attach(struct bcm_gpio_softc *sc);
 static int bcm_gpio_pic_detach(struct bcm_gpio_softc *sc);
 
 static int
 bcm_gpio_pin_is_ro(struct bcm_gpio_softc *sc, int pin)
 {
 	int i;
 
 	for (i = 0; i < sc->sc_ro_npins; i++)
 		if (pin == sc->sc_ro_pins[i])
 			return (1);
 	return (0);
 }
 
 static uint32_t
 bcm_gpio_get_function(struct bcm_gpio_softc *sc, uint32_t pin)
 {
 	uint32_t bank, func, offset;
 
 	/* Five banks, 10 pins per bank, 3 bits per pin. */
 	bank = pin / 10;
 	offset = (pin - bank * 10) * 3;
 
 	BCM_GPIO_LOCK(sc);
 	func = (BCM_GPIO_READ(sc, BCM_GPIO_GPFSEL(bank)) >> offset) & 7;
 	BCM_GPIO_UNLOCK(sc);
 
 	return (func);
 }
 
 static void
 bcm_gpio_func_str(uint32_t nfunc, char *buf, int bufsize)
 {
 
 	switch (nfunc) {
 	case BCM2835_FSEL_GPIO_IN:
 		strncpy(buf, "input", bufsize);
 		break;
 	case BCM2835_FSEL_GPIO_OUT:
 		strncpy(buf, "output", bufsize);
 		break;
 	case BCM2835_FSEL_ALT0:
 		strncpy(buf, "alt0", bufsize);
 		break;
 	case BCM2835_FSEL_ALT1:
 		strncpy(buf, "alt1", bufsize);
 		break;
 	case BCM2835_FSEL_ALT2:
 		strncpy(buf, "alt2", bufsize);
 		break;
 	case BCM2835_FSEL_ALT3:
 		strncpy(buf, "alt3", bufsize);
 		break;
 	case BCM2835_FSEL_ALT4:
 		strncpy(buf, "alt4", bufsize);
 		break;
 	case BCM2835_FSEL_ALT5:
 		strncpy(buf, "alt5", bufsize);
 		break;
 	default:
 		strncpy(buf, "invalid", bufsize);
 	}
 }
 
 static int
 bcm_gpio_str_func(char *func, uint32_t *nfunc)
 {
 
 	if (strcasecmp(func, "input") == 0)
 		*nfunc = BCM2835_FSEL_GPIO_IN;
 	else if (strcasecmp(func, "output") == 0)
 		*nfunc = BCM2835_FSEL_GPIO_OUT;
 	else if (strcasecmp(func, "alt0") == 0)
 		*nfunc = BCM2835_FSEL_ALT0;
 	else if (strcasecmp(func, "alt1") == 0)
 		*nfunc = BCM2835_FSEL_ALT1;
 	else if (strcasecmp(func, "alt2") == 0)
 		*nfunc = BCM2835_FSEL_ALT2;
 	else if (strcasecmp(func, "alt3") == 0)
 		*nfunc = BCM2835_FSEL_ALT3;
 	else if (strcasecmp(func, "alt4") == 0)
 		*nfunc = BCM2835_FSEL_ALT4;
 	else if (strcasecmp(func, "alt5") == 0)
 		*nfunc = BCM2835_FSEL_ALT5;
 	else
 		return (-1);
 
 	return (0);
 }
 
 static uint32_t
 bcm_gpio_func_flag(uint32_t nfunc)
 {
 
 	switch (nfunc) {
 	case BCM2835_FSEL_GPIO_IN:
 		return (GPIO_PIN_INPUT);
 	case BCM2835_FSEL_GPIO_OUT:
 		return (GPIO_PIN_OUTPUT);
 	}
 	return (0);
 }
 
 static void
 bcm_gpio_set_function(struct bcm_gpio_softc *sc, uint32_t pin, uint32_t f)
 {
 	uint32_t bank, data, offset;
 
 	/* Must be called with lock held. */
 	BCM_GPIO_LOCK_ASSERT(sc);
 
 	/* Five banks, 10 pins per bank, 3 bits per pin. */
 	bank = pin / 10;
 	offset = (pin - bank * 10) * 3;
 
 	data = BCM_GPIO_READ(sc, BCM_GPIO_GPFSEL(bank));
 	data &= ~(7 << offset);
 	data |= (f << offset);
 	BCM_GPIO_WRITE(sc, BCM_GPIO_GPFSEL(bank), data);
 }
 
 static void
 bcm_gpio_set_pud(struct bcm_gpio_softc *sc, uint32_t pin, uint32_t state)
 {
 	uint32_t bank;
 
 	/* Must be called with lock held. */
 	BCM_GPIO_LOCK_ASSERT(sc);
 
 	bank = BCM_GPIO_BANK(pin);
 	BCM_GPIO_WRITE(sc, BCM_GPIO_GPPUD(0), state);
 	BCM_GPIO_WRITE(sc, BCM_GPIO_GPPUDCLK(bank), BCM_GPIO_MASK(pin));
 	BCM_GPIO_WRITE(sc, BCM_GPIO_GPPUD(0), 0);
 	BCM_GPIO_WRITE(sc, BCM_GPIO_GPPUDCLK(bank), 0);
 }
 
 static void
 bcm_gpio_set_alternate(device_t dev, uint32_t pin, uint32_t nfunc)
 {
 	struct bcm_gpio_softc *sc;
 	int i;
 
 	sc = device_get_softc(dev);
 	BCM_GPIO_LOCK(sc);
 
 	/* Set the pin function. */
 	bcm_gpio_set_function(sc, pin, nfunc);
 
 	/* Update the pin flags. */
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 	if (i < sc->sc_gpio_npins)
 		sc->sc_gpio_pins[i].gp_flags = bcm_gpio_func_flag(nfunc);
 
         BCM_GPIO_UNLOCK(sc);
 }
 
 static void
 bcm_gpio_pin_configure(struct bcm_gpio_softc *sc, struct gpio_pin *pin,
     unsigned int flags)
 {
 
 	BCM_GPIO_LOCK(sc);
 
 	/*
 	 * Manage input/output.
 	 */
 	if (flags & (GPIO_PIN_INPUT|GPIO_PIN_OUTPUT)) {
 		pin->gp_flags &= ~(GPIO_PIN_INPUT|GPIO_PIN_OUTPUT);
 		if (flags & GPIO_PIN_OUTPUT) {
 			pin->gp_flags |= GPIO_PIN_OUTPUT;
 			bcm_gpio_set_function(sc, pin->gp_pin,
 			    BCM2835_FSEL_GPIO_OUT);
 		} else {
 			pin->gp_flags |= GPIO_PIN_INPUT;
 			bcm_gpio_set_function(sc, pin->gp_pin,
 			    BCM2835_FSEL_GPIO_IN);
 		}
 	}
 
 	/* Manage Pull-up/pull-down. */
 	pin->gp_flags &= ~(GPIO_PIN_PULLUP|GPIO_PIN_PULLDOWN);
 	if (flags & (GPIO_PIN_PULLUP|GPIO_PIN_PULLDOWN)) {
 		if (flags & GPIO_PIN_PULLUP) {
 			pin->gp_flags |= GPIO_PIN_PULLUP;
 			bcm_gpio_set_pud(sc, pin->gp_pin, BCM_GPIO_PULLUP);
 		} else {
 			pin->gp_flags |= GPIO_PIN_PULLDOWN;
 			bcm_gpio_set_pud(sc, pin->gp_pin, BCM_GPIO_PULLDOWN);
 		}
 	} else 
 		bcm_gpio_set_pud(sc, pin->gp_pin, BCM_GPIO_NONE);
 
 	BCM_GPIO_UNLOCK(sc);
 }
 
 static device_t
 bcm_gpio_get_bus(device_t dev)
 {
 	struct bcm_gpio_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	return (sc->sc_busdev);
 }
 
 static int
 bcm_gpio_pin_max(device_t dev, int *maxpin)
 {
 
 	*maxpin = BCM_GPIO_PINS - 1;
 	return (0);
 }
 
 static int
 bcm_gpio_pin_getcaps(device_t dev, uint32_t pin, uint32_t *caps)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 
 	BCM_GPIO_LOCK(sc);
 	*caps = sc->sc_gpio_pins[i].gp_caps;
 	BCM_GPIO_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 bcm_gpio_pin_getflags(device_t dev, uint32_t pin, uint32_t *flags)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 
 	BCM_GPIO_LOCK(sc);
 	*flags = sc->sc_gpio_pins[i].gp_flags;
 	BCM_GPIO_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 bcm_gpio_pin_getname(device_t dev, uint32_t pin, char *name)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 
 	BCM_GPIO_LOCK(sc);
 	memcpy(name, sc->sc_gpio_pins[i].gp_name, GPIOMAXNAME);
 	BCM_GPIO_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 bcm_gpio_pin_setflags(device_t dev, uint32_t pin, uint32_t flags)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 
 	/* We never touch on read-only/reserved pins. */
 	if (bcm_gpio_pin_is_ro(sc, pin))
 		return (EINVAL);
 
 	bcm_gpio_pin_configure(sc, &sc->sc_gpio_pins[i], flags);
 
 	return (0);
 }
 
 static int
 bcm_gpio_pin_set(device_t dev, uint32_t pin, unsigned int value)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	uint32_t bank, reg;
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 	/* We never write to read-only/reserved pins. */
 	if (bcm_gpio_pin_is_ro(sc, pin))
 		return (EINVAL);
 	BCM_GPIO_LOCK(sc);
 	bank = BCM_GPIO_BANK(pin);
 	if (value)
 		reg = BCM_GPIO_GPSET(bank);
 	else
 		reg = BCM_GPIO_GPCLR(bank);
 	BCM_GPIO_WRITE(sc, reg, BCM_GPIO_MASK(pin));
 	BCM_GPIO_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 bcm_gpio_pin_get(device_t dev, uint32_t pin, unsigned int *val)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	uint32_t bank, reg_data;
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 	bank = BCM_GPIO_BANK(pin);
 	BCM_GPIO_LOCK(sc);
 	reg_data = BCM_GPIO_READ(sc, BCM_GPIO_GPLEV(bank));
 	BCM_GPIO_UNLOCK(sc);
 	*val = (reg_data & BCM_GPIO_MASK(pin)) ? 1 : 0;
 
 	return (0);
 }
 
 static int
 bcm_gpio_pin_toggle(device_t dev, uint32_t pin)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	uint32_t bank, data, reg;
 	int i;
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 		if (sc->sc_gpio_pins[i].gp_pin == pin)
 			break;
 	}
 	if (i >= sc->sc_gpio_npins)
 		return (EINVAL);
 	/* We never write to read-only/reserved pins. */
 	if (bcm_gpio_pin_is_ro(sc, pin))
 		return (EINVAL);
 	BCM_GPIO_LOCK(sc);
 	bank = BCM_GPIO_BANK(pin);
 	data = BCM_GPIO_READ(sc, BCM_GPIO_GPLEV(bank));
 	if (data & BCM_GPIO_MASK(pin))
 		reg = BCM_GPIO_GPCLR(bank);
 	else
 		reg = BCM_GPIO_GPSET(bank);
 	BCM_GPIO_WRITE(sc, reg, BCM_GPIO_MASK(pin));
 	BCM_GPIO_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 bcm_gpio_func_proc(SYSCTL_HANDLER_ARGS)
 {
 	char buf[16];
 	struct bcm_gpio_softc *sc;
 	struct bcm_gpio_sysctl *sc_sysctl;
 	uint32_t nfunc;
 	int error;
 
 	sc_sysctl = arg1;
 	sc = sc_sysctl->sc;
 
 	/* Get the current pin function. */
 	nfunc = bcm_gpio_get_function(sc, sc_sysctl->pin);
 	bcm_gpio_func_str(nfunc, buf, sizeof(buf));
 
 	error = sysctl_handle_string(oidp, buf, sizeof(buf), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	/* Ignore changes on read-only pins. */
 	if (bcm_gpio_pin_is_ro(sc, sc_sysctl->pin))
 		return (0);
 	/* Parse the user supplied string and check for a valid pin function. */
 	if (bcm_gpio_str_func(buf, &nfunc) != 0)
 		return (EINVAL);
 
 	/* Update the pin alternate function. */
 	bcm_gpio_set_alternate(sc->sc_dev, sc_sysctl->pin, nfunc);
 
 	return (0);
 }
 
 static void
 bcm_gpio_sysctl_init(struct bcm_gpio_softc *sc)
 {
 	char pinbuf[3];
 	struct bcm_gpio_sysctl *sc_sysctl;
 	struct sysctl_ctx_list *ctx;
 	struct sysctl_oid *tree_node, *pin_node, *pinN_node;
 	struct sysctl_oid_list *tree, *pin_tree, *pinN_tree;
 	int i;
 
 	/*
 	 * Add per-pin sysctl tree/handlers.
 	 */
 	ctx = device_get_sysctl_ctx(sc->sc_dev);
  	tree_node = device_get_sysctl_tree(sc->sc_dev);
  	tree = SYSCTL_CHILDREN(tree_node);
 	pin_node = SYSCTL_ADD_NODE(ctx, tree, OID_AUTO, "pin",
 	    CTLFLAG_RD, NULL, "GPIO Pins");
 	pin_tree = SYSCTL_CHILDREN(pin_node);
 
 	for (i = 0; i < sc->sc_gpio_npins; i++) {
 
 		snprintf(pinbuf, sizeof(pinbuf), "%d", i);
 		pinN_node = SYSCTL_ADD_NODE(ctx, pin_tree, OID_AUTO, pinbuf,
 		    CTLFLAG_RD, NULL, "GPIO Pin");
 		pinN_tree = SYSCTL_CHILDREN(pinN_node);
 
 		sc->sc_sysctl[i].sc = sc;
 		sc_sysctl = &sc->sc_sysctl[i];
 		sc_sysctl->sc = sc;
 		sc_sysctl->pin = sc->sc_gpio_pins[i].gp_pin;
 		SYSCTL_ADD_PROC(ctx, pinN_tree, OID_AUTO, "function",
 		    CTLFLAG_RW | CTLTYPE_STRING, sc_sysctl,
 		    sizeof(struct bcm_gpio_sysctl), bcm_gpio_func_proc,
 		    "A", "Pin Function");
 	}
 }
 
 static int
 bcm_gpio_get_ro_pins(struct bcm_gpio_softc *sc, phandle_t node,
 	const char *propname, const char *label)
 {
 	int i, need_comma, npins, range_start, range_stop;
 	pcell_t *pins;
 
 	/* Get the property data. */
-	npins = OF_getencprop_alloc(node, propname, sizeof(*pins),
+	npins = OF_getencprop_alloc_multi(node, propname, sizeof(*pins),
 	    (void **)&pins);
 	if (npins < 0)
 		return (-1);
 	if (npins == 0) {
 		OF_prop_free(pins);
 		return (0);
 	}
 	for (i = 0; i < npins; i++)
 		sc->sc_ro_pins[i + sc->sc_ro_npins] = pins[i];
 	sc->sc_ro_npins += npins;
 	need_comma = 0;
 	device_printf(sc->sc_dev, "%s pins: ", label);
 	range_start = range_stop = pins[0];
 	for (i = 1; i < npins; i++) {
 		if (pins[i] != range_stop + 1) {
 			if (need_comma)
 				printf(",");
 			if (range_start != range_stop)
 				printf("%d-%d", range_start, range_stop);
 			else
 				printf("%d", range_start);
 			range_start = range_stop = pins[i];
 			need_comma = 1;
 		} else
 			range_stop++;
 	}
 	if (need_comma)
 		printf(",");
 	if (range_start != range_stop)
 		printf("%d-%d.\n", range_start, range_stop);
 	else
 		printf("%d.\n", range_start);
 	OF_prop_free(pins);
 
 	return (0);
 }
 
 static int
 bcm_gpio_get_reserved_pins(struct bcm_gpio_softc *sc)
 {
 	char *name;
 	phandle_t gpio, node, reserved;
 	ssize_t len;
 
 	/* Get read-only pins if they're provided */
 	gpio = ofw_bus_get_node(sc->sc_dev);
 	if (bcm_gpio_get_ro_pins(sc, gpio, "broadcom,read-only",
 	    "read-only") != 0)
 		return (0);
 	/* Traverse the GPIO subnodes to find the reserved pins node. */
 	reserved = 0;
 	node = OF_child(gpio);
 	while ((node != 0) && (reserved == 0)) {
 		len = OF_getprop_alloc(node, "name", (void **)&name);
 		if (len == -1)
 			return (-1);
 		if (strcmp(name, "reserved") == 0)
 			reserved = node;
 		OF_prop_free(name);
 		node = OF_peer(node);
 	}
 	if (reserved == 0)
 		return (-1);
 	/* Get the reserved pins. */
 	if (bcm_gpio_get_ro_pins(sc, reserved, "broadcom,pins",
 	    "reserved") != 0)
 		return (-1);
 
 	return (0);
 }
 
 static int
 bcm_gpio_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0)
 		return (ENXIO);
 
 	device_set_desc(dev, "BCM2708/2835 GPIO controller");
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 bcm_gpio_intr_attach(device_t dev)
 {
 	struct bcm_gpio_softc *sc;
 
 	/*
 	 *  Only first two interrupt lines are used. Third line is
 	 *  mirrored second line and forth line is common for all banks.
 	 */
 	sc = device_get_softc(dev);
 	if (sc->sc_res[1] == NULL || sc->sc_res[2] == NULL)
 		return (-1);
 
 	if (bcm_gpio_pic_attach(sc) != 0) {
 		device_printf(dev, "unable to attach PIC\n");
 		return (-1);
 	}
 	if (bus_setup_intr(dev, sc->sc_res[1], INTR_TYPE_MISC | INTR_MPSAFE,
 	    bcm_gpio_intr_bank0, NULL, sc, &sc->sc_intrhand[0]) != 0)
 		return (-1);
 	if (bus_setup_intr(dev, sc->sc_res[2], INTR_TYPE_MISC | INTR_MPSAFE,
 	    bcm_gpio_intr_bank1, NULL, sc, &sc->sc_intrhand[1]) != 0)
 		return (-1);
 
 	return (0);
 }
 
 static void
 bcm_gpio_intr_detach(device_t dev)
 {
 	struct bcm_gpio_softc *sc;
 
 	sc = device_get_softc(dev);
 	if (sc->sc_intrhand[0] != NULL)
 		bus_teardown_intr(dev, sc->sc_res[1], sc->sc_intrhand[0]);
 	if (sc->sc_intrhand[1] != NULL)
 		bus_teardown_intr(dev, sc->sc_res[2], sc->sc_intrhand[1]);
 
 	bcm_gpio_pic_detach(sc);
 }
 
 static int
 bcm_gpio_attach(device_t dev)
 {
 	int i, j;
 	phandle_t gpio;
 	struct bcm_gpio_softc *sc;
 	uint32_t func;
 
 	if (bcm_gpio_sc != NULL)
 		return (ENXIO);
 
 	bcm_gpio_sc = sc = device_get_softc(dev);
  	sc->sc_dev = dev;
 	mtx_init(&sc->sc_mtx, "bcm gpio", "gpio", MTX_SPIN);
 	if (bus_alloc_resources(dev, bcm_gpio_res_spec, sc->sc_res) != 0) {
 		device_printf(dev, "cannot allocate resources\n");
 		goto fail;
 	}
 	sc->sc_bst = rman_get_bustag(sc->sc_res[0]);
 	sc->sc_bsh = rman_get_bushandle(sc->sc_res[0]);
 	/* Setup the GPIO interrupt handler. */
 	if (bcm_gpio_intr_attach(dev)) {
 		device_printf(dev, "unable to setup the gpio irq handler\n");
 		goto fail;
 	}
 	/* Find our node. */
 	gpio = ofw_bus_get_node(sc->sc_dev);
 	if (!OF_hasprop(gpio, "gpio-controller"))
 		/* Node is not a GPIO controller. */
 		goto fail;
 	/*
 	 * Find the read-only pins.  These are pins we never touch or bad
 	 * things could happen.
 	 */
 	if (bcm_gpio_get_reserved_pins(sc) == -1)
 		goto fail;
 	/* Initialize the software controlled pins. */
 	for (i = 0, j = 0; j < BCM_GPIO_PINS; j++) {
 		snprintf(sc->sc_gpio_pins[i].gp_name, GPIOMAXNAME,
 		    "pin %d", j);
 		func = bcm_gpio_get_function(sc, j);
 		sc->sc_gpio_pins[i].gp_pin = j;
 		sc->sc_gpio_pins[i].gp_caps = BCM_GPIO_DEFAULT_CAPS;
 		sc->sc_gpio_pins[i].gp_flags = bcm_gpio_func_flag(func);
 		i++;
 	}
 	sc->sc_gpio_npins = i;
 	bcm_gpio_sysctl_init(sc);
 	sc->sc_busdev = gpiobus_attach_bus(dev);
 	if (sc->sc_busdev == NULL)
 		goto fail;
 
 	fdt_pinctrl_register(dev, "brcm,pins");
 	fdt_pinctrl_configure_tree(dev);
 
 	return (0);
 
 fail:
 	bcm_gpio_intr_detach(dev);
 	bus_release_resources(dev, bcm_gpio_res_spec, sc->sc_res);
 	mtx_destroy(&sc->sc_mtx);
 
 	return (ENXIO);
 }
 
 static int
 bcm_gpio_detach(device_t dev)
 {
 
 	return (EBUSY);
 }
 
 static inline void
 bcm_gpio_modify(struct bcm_gpio_softc *sc, uint32_t reg, uint32_t mask,
     bool set_bits)
 {
 
 	if (set_bits)
 		BCM_GPIO_SET_BITS(sc, reg, mask);
 	else
 		BCM_GPIO_CLEAR_BITS(sc, reg, mask);
 }
 
 static inline void
 bcm_gpio_isrc_eoi(struct bcm_gpio_softc *sc, struct bcm_gpio_irqsrc *bgi)
 {
 	uint32_t bank;
 
 	/* Write 1 to clear. */
 	bank = BCM_GPIO_BANK(bgi->bgi_irq);
 	BCM_GPIO_WRITE(sc, BCM_GPIO_GPEDS(bank), bgi->bgi_mask);
 }
 
 static inline bool
 bcm_gpio_isrc_is_level(struct bcm_gpio_irqsrc *bgi)
 {
 
 	return (bgi->bgi_mode ==  GPIO_INTR_LEVEL_LOW ||
 	    bgi->bgi_mode == GPIO_INTR_LEVEL_HIGH);
 }
 
 static inline void
 bcm_gpio_isrc_mask(struct bcm_gpio_softc *sc, struct bcm_gpio_irqsrc *bgi)
 {
 	uint32_t bank;
 
 	bank = BCM_GPIO_BANK(bgi->bgi_irq);
 	BCM_GPIO_LOCK(sc);
 	switch (bgi->bgi_mode) {
 	case GPIO_INTR_LEVEL_LOW:
 		BCM_GPIO_CLEAR_BITS(sc, BCM_GPIO_GPLEN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_LEVEL_HIGH:
 		BCM_GPIO_CLEAR_BITS(sc, BCM_GPIO_GPHEN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_EDGE_RISING:
 		BCM_GPIO_CLEAR_BITS(sc, BCM_GPIO_GPREN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_EDGE_FALLING:
 		BCM_GPIO_CLEAR_BITS(sc, BCM_GPIO_GPFEN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_EDGE_BOTH:
 		BCM_GPIO_CLEAR_BITS(sc, BCM_GPIO_GPREN(bank), bgi->bgi_mask);
 		BCM_GPIO_CLEAR_BITS(sc, BCM_GPIO_GPFEN(bank), bgi->bgi_mask);
 		break;
 	}
 	BCM_GPIO_UNLOCK(sc);
 }
 
 static inline void
 bcm_gpio_isrc_unmask(struct bcm_gpio_softc *sc, struct bcm_gpio_irqsrc *bgi)
 {
 	uint32_t bank;
 
 	bank = BCM_GPIO_BANK(bgi->bgi_irq);
 	BCM_GPIO_LOCK(sc);
 	switch (bgi->bgi_mode) {
 	case GPIO_INTR_LEVEL_LOW:
 		BCM_GPIO_SET_BITS(sc, BCM_GPIO_GPLEN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_LEVEL_HIGH:
 		BCM_GPIO_SET_BITS(sc, BCM_GPIO_GPHEN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_EDGE_RISING:
 		BCM_GPIO_SET_BITS(sc, BCM_GPIO_GPREN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_EDGE_FALLING:
 		BCM_GPIO_SET_BITS(sc, BCM_GPIO_GPFEN(bank), bgi->bgi_mask);
 		break;
 	case GPIO_INTR_EDGE_BOTH:
 		BCM_GPIO_SET_BITS(sc, BCM_GPIO_GPREN(bank), bgi->bgi_mask);
 		BCM_GPIO_SET_BITS(sc, BCM_GPIO_GPFEN(bank), bgi->bgi_mask);
 		break;
 	}
 	BCM_GPIO_UNLOCK(sc);
 }
 
 static int
 bcm_gpio_intr_internal(struct bcm_gpio_softc *sc, uint32_t bank)
 {
 	u_int irq;
 	struct bcm_gpio_irqsrc *bgi;
 	uint32_t reg;
 
 	/* Do not care of spurious interrupt on GPIO. */
 	reg = BCM_GPIO_READ(sc, BCM_GPIO_GPEDS(bank));
 	while (reg != 0) {
 		irq = BCM_GPIO_PINS_PER_BANK * bank + ffs(reg) - 1;
 		bgi = sc->sc_isrcs + irq;
 		if (!bcm_gpio_isrc_is_level(bgi))
 			bcm_gpio_isrc_eoi(sc, bgi);
 		if (intr_isrc_dispatch(&bgi->bgi_isrc,
 		    curthread->td_intr_frame) != 0) {
 			bcm_gpio_isrc_mask(sc, bgi);
 			if (bcm_gpio_isrc_is_level(bgi))
 				bcm_gpio_isrc_eoi(sc, bgi);
 			device_printf(sc->sc_dev, "Stray irq %u disabled\n",
 			    irq);
 		}
 		reg &= ~bgi->bgi_mask;
 	}
 	return (FILTER_HANDLED);
 }
 
 static int
 bcm_gpio_intr_bank0(void *arg)
 {
 
 	return (bcm_gpio_intr_internal(arg, 0));
 }
 
 static int
 bcm_gpio_intr_bank1(void *arg)
 {
 
 	return (bcm_gpio_intr_internal(arg, 1));
 }
 
 static int
 bcm_gpio_pic_attach(struct bcm_gpio_softc *sc)
 {
 	int error;
 	uint32_t irq;
 	const char *name;
 
 	name = device_get_nameunit(sc->sc_dev);
 	for (irq = 0; irq < BCM_GPIO_PINS; irq++) {
 		sc->sc_isrcs[irq].bgi_irq = irq;
 		sc->sc_isrcs[irq].bgi_mask = BCM_GPIO_MASK(irq);
 		sc->sc_isrcs[irq].bgi_mode = GPIO_INTR_CONFORM;
 
 		error = intr_isrc_register(&sc->sc_isrcs[irq].bgi_isrc,
 		    sc->sc_dev, 0, "%s,%u", name, irq);
 		if (error != 0)
 			return (error); /* XXX deregister ISRCs */
 	}
 	if (intr_pic_register(sc->sc_dev,
 	    OF_xref_from_node(ofw_bus_get_node(sc->sc_dev))) == NULL)
 		return (ENXIO);
 
 	return (0);
 }
 
 static int
 bcm_gpio_pic_detach(struct bcm_gpio_softc *sc)
 {
 
 	/*
 	 *  There has not been established any procedure yet
 	 *  how to detach PIC from living system correctly.
 	 */
 	device_printf(sc->sc_dev, "%s: not implemented yet\n", __func__);
 	return (EBUSY);
 }
 
 static void
 bcm_gpio_pic_config_intr(struct bcm_gpio_softc *sc, struct bcm_gpio_irqsrc *bgi,
     uint32_t mode)
 {
 	uint32_t bank;
 
 	bank = BCM_GPIO_BANK(bgi->bgi_irq);
 	BCM_GPIO_LOCK(sc);
 	bcm_gpio_modify(sc, BCM_GPIO_GPREN(bank), bgi->bgi_mask,
 	    mode == GPIO_INTR_EDGE_RISING || mode == GPIO_INTR_EDGE_BOTH);
 	bcm_gpio_modify(sc, BCM_GPIO_GPFEN(bank), bgi->bgi_mask,
 	    mode == GPIO_INTR_EDGE_FALLING || mode == GPIO_INTR_EDGE_BOTH);
 	bcm_gpio_modify(sc, BCM_GPIO_GPHEN(bank), bgi->bgi_mask,
 	    mode == GPIO_INTR_LEVEL_HIGH);
 	bcm_gpio_modify(sc, BCM_GPIO_GPLEN(bank), bgi->bgi_mask,
 	    mode == GPIO_INTR_LEVEL_LOW);
 	bgi->bgi_mode = mode;
 	BCM_GPIO_UNLOCK(sc);
 }
 
 static void
 bcm_gpio_pic_disable_intr(device_t dev, struct intr_irqsrc *isrc)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	struct bcm_gpio_irqsrc *bgi = (struct bcm_gpio_irqsrc *)isrc;
 
 	bcm_gpio_isrc_mask(sc, bgi);
 }
 
 static void
 bcm_gpio_pic_enable_intr(device_t dev, struct intr_irqsrc *isrc)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	struct bcm_gpio_irqsrc *bgi = (struct bcm_gpio_irqsrc *)isrc;
 
 	arm_irq_memory_barrier(bgi->bgi_irq);
 	bcm_gpio_isrc_unmask(sc, bgi);
 }
 
 static int
 bcm_gpio_pic_map_fdt(struct bcm_gpio_softc *sc, struct intr_map_data_fdt *daf,
     u_int *irqp, uint32_t *modep)
 {
 	u_int irq;
 	uint32_t mode;
 
 	/*
 	 * The first cell is the interrupt number.
 	 * The second cell is used to specify flags:
 	 *	bits[3:0] trigger type and level flags:
 	 *		1 = low-to-high edge triggered.
 	 *		2 = high-to-low edge triggered.
 	 *		4 = active high level-sensitive.
 	 *		8 = active low level-sensitive.
 	 */
 	if (daf->ncells != 2)
 		return (EINVAL);
 
 	irq = daf->cells[0];
 	if (irq >= BCM_GPIO_PINS || bcm_gpio_pin_is_ro(sc, irq))
 		return (EINVAL);
 
 	/* Only reasonable modes are supported. */
 	if (daf->cells[1] == 1)
 		mode = GPIO_INTR_EDGE_RISING;
 	else if (daf->cells[1] == 2)
 		mode = GPIO_INTR_EDGE_FALLING;
 	else if (daf->cells[1] == 3)
 		mode = GPIO_INTR_EDGE_BOTH;
 	else if (daf->cells[1] == 4)
 		mode = GPIO_INTR_LEVEL_HIGH;
 	else if (daf->cells[1] == 8)
 		mode = GPIO_INTR_LEVEL_LOW;
 	else
 		return (EINVAL);
 
 	*irqp = irq;
 	if (modep != NULL)
 		*modep = mode;
 	return (0);
 }
 
 static int
 bcm_gpio_pic_map_gpio(struct bcm_gpio_softc *sc, struct intr_map_data_gpio *dag,
     u_int *irqp, uint32_t *modep)
 {
 	u_int irq;
 	uint32_t mode;
 
 	irq = dag->gpio_pin_num;
 	if (irq >= BCM_GPIO_PINS || bcm_gpio_pin_is_ro(sc, irq))
 		return (EINVAL);
 
 	mode = dag->gpio_intr_mode;
 	if (mode != GPIO_INTR_LEVEL_LOW && mode != GPIO_INTR_LEVEL_HIGH &&
 	    mode != GPIO_INTR_EDGE_RISING && mode != GPIO_INTR_EDGE_FALLING &&
 	    mode != GPIO_INTR_EDGE_BOTH)
 		return (EINVAL);
 
 	*irqp = irq;
 	if (modep != NULL)
 		*modep = mode;
 	return (0);
 }
 
 static int
 bcm_gpio_pic_map(struct bcm_gpio_softc *sc, struct intr_map_data *data,
     u_int *irqp, uint32_t *modep)
 {
 
 	switch (data->type) {
 	case INTR_MAP_DATA_FDT:
 		return (bcm_gpio_pic_map_fdt(sc,
 		    (struct intr_map_data_fdt *)data, irqp, modep));
 	case INTR_MAP_DATA_GPIO:
 		return (bcm_gpio_pic_map_gpio(sc,
 		    (struct intr_map_data_gpio *)data, irqp, modep));
 	default:
 		return (ENOTSUP);
 	}
 }
 
 static int
 bcm_gpio_pic_map_intr(device_t dev, struct intr_map_data *data,
     struct intr_irqsrc **isrcp)
 {
 	int error;
 	u_int irq;
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 
 	error = bcm_gpio_pic_map(sc, data, &irq, NULL);
 	if (error == 0)
 		*isrcp = &sc->sc_isrcs[irq].bgi_isrc;
 	return (error);
 }
 
 static void
 bcm_gpio_pic_post_filter(device_t dev, struct intr_irqsrc *isrc)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	struct bcm_gpio_irqsrc *bgi = (struct bcm_gpio_irqsrc *)isrc;
 
 	if (bcm_gpio_isrc_is_level(bgi))
 		bcm_gpio_isrc_eoi(sc, bgi);
 }
 
 static void
 bcm_gpio_pic_post_ithread(device_t dev, struct intr_irqsrc *isrc)
 {
 
 	bcm_gpio_pic_enable_intr(dev, isrc);
 }
 
 static void
 bcm_gpio_pic_pre_ithread(device_t dev, struct intr_irqsrc *isrc)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	struct bcm_gpio_irqsrc *bgi = (struct bcm_gpio_irqsrc *)isrc;
 
 	bcm_gpio_isrc_mask(sc, bgi);
 	if (bcm_gpio_isrc_is_level(bgi))
 		bcm_gpio_isrc_eoi(sc, bgi);
 }
 
 static int
 bcm_gpio_pic_setup_intr(device_t dev, struct intr_irqsrc *isrc,
     struct resource *res, struct intr_map_data *data)
 {
 	u_int irq;
 	uint32_t mode;
 	struct bcm_gpio_softc *sc;
 	struct bcm_gpio_irqsrc *bgi;
 
 	if (data == NULL)
 		return (ENOTSUP);
 
 	sc = device_get_softc(dev);
 	bgi = (struct bcm_gpio_irqsrc *)isrc;
 
 	/* Get and check config for an interrupt. */
 	if (bcm_gpio_pic_map(sc, data, &irq, &mode) != 0 || bgi->bgi_irq != irq)
 		return (EINVAL);
 
 	/*
 	 * If this is a setup for another handler,
 	 * only check that its configuration match.
 	 */
 	if (isrc->isrc_handlers != 0)
 		return (bgi->bgi_mode == mode ? 0 : EINVAL);
 
 	bcm_gpio_pic_config_intr(sc, bgi, mode);
 	return (0);
 }
 
 static int
 bcm_gpio_pic_teardown_intr(device_t dev, struct intr_irqsrc *isrc,
     struct resource *res, struct intr_map_data *data)
 {
 	struct bcm_gpio_softc *sc = device_get_softc(dev);
 	struct bcm_gpio_irqsrc *bgi = (struct bcm_gpio_irqsrc *)isrc;
 
 	if (isrc->isrc_handlers == 0)
 		bcm_gpio_pic_config_intr(sc, bgi, GPIO_INTR_CONFORM);
 	return (0);
 }
 
 static phandle_t
 bcm_gpio_get_node(device_t bus, device_t dev)
 {
 
 	/* We only have one child, the GPIO bus, which needs our own node. */
 	return (ofw_bus_get_node(bus));
 }
 
 static int
 bcm_gpio_configure_pins(device_t dev, phandle_t cfgxref)
 {
 	phandle_t cfgnode;
 	int i, pintuples, pulltuples;
 	uint32_t pin;
 	uint32_t *pins;
 	uint32_t *pulls;
 	uint32_t function;
 	static struct bcm_gpio_softc *sc;
 
 	sc = device_get_softc(dev);
 	cfgnode = OF_node_from_xref(cfgxref);
 
 	pins = NULL;
-	pintuples = OF_getencprop_alloc(cfgnode, "brcm,pins", sizeof(*pins),
-	    (void **)&pins);
+	pintuples = OF_getencprop_alloc_multi(cfgnode, "brcm,pins",
+	    sizeof(*pins), (void **)&pins);
 
 	char name[32];
 	OF_getprop(cfgnode, "name", &name, sizeof(name));
 
 	if (pintuples < 0)
 		return (ENOENT);
 
 	if (pintuples == 0)
 		return (0); /* Empty property is not an error. */
 
 	if (OF_getencprop(cfgnode, "brcm,function", &function,
 	    sizeof(function)) <= 0) {
 		OF_prop_free(pins);
 		return (EINVAL);
 	}
 
 	pulls = NULL;
-	pulltuples = OF_getencprop_alloc(cfgnode, "brcm,pull", sizeof(*pulls),
-	    (void **)&pulls);
+	pulltuples = OF_getencprop_alloc_multi(cfgnode, "brcm,pull",
+	    sizeof(*pulls), (void **)&pulls);
 
 	if ((pulls != NULL) && (pulltuples != pintuples)) {
 		OF_prop_free(pins);
 		OF_prop_free(pulls);
 		return (EINVAL);
 	}
 
 	for (i = 0; i < pintuples; i++) {
 		pin = pins[i];
 		bcm_gpio_set_alternate(dev, pin, function);
 		if (bootverbose)
 			device_printf(dev, "set pin %d to func %d", pin, function);
 		if (pulls) {
 			if (bootverbose)
 				printf(", pull %d", pulls[i]);
 			switch (pulls[i]) {
 			/* Convert to gpio(4) flags */
 			case BCM2835_PUD_OFF:
 				bcm_gpio_pin_setflags(dev, pin, 0);
 				break;
 			case BCM2835_PUD_UP:
 				bcm_gpio_pin_setflags(dev, pin, GPIO_PIN_PULLUP);
 				break;
 			case BCM2835_PUD_DOWN:
 				bcm_gpio_pin_setflags(dev, pin, GPIO_PIN_PULLDOWN);
 				break;
 			default:
 				printf("%s: invalid pull value for pin %d: %d\n",
 				    name, pin, pulls[i]);
 			}
 		}
 		if (bootverbose)
 			printf("\n");
 	}
 
 	OF_prop_free(pins);
 	if (pulls)
 		OF_prop_free(pulls);
 
 	return (0);
 }
 
 static device_method_t bcm_gpio_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		bcm_gpio_probe),
 	DEVMETHOD(device_attach,	bcm_gpio_attach),
 	DEVMETHOD(device_detach,	bcm_gpio_detach),
 
 	/* GPIO protocol */
 	DEVMETHOD(gpio_get_bus,		bcm_gpio_get_bus),
 	DEVMETHOD(gpio_pin_max,		bcm_gpio_pin_max),
 	DEVMETHOD(gpio_pin_getname,	bcm_gpio_pin_getname),
 	DEVMETHOD(gpio_pin_getflags,	bcm_gpio_pin_getflags),
 	DEVMETHOD(gpio_pin_getcaps,	bcm_gpio_pin_getcaps),
 	DEVMETHOD(gpio_pin_setflags,	bcm_gpio_pin_setflags),
 	DEVMETHOD(gpio_pin_get,		bcm_gpio_pin_get),
 	DEVMETHOD(gpio_pin_set,		bcm_gpio_pin_set),
 	DEVMETHOD(gpio_pin_toggle,	bcm_gpio_pin_toggle),
 
 	/* Interrupt controller interface */
 	DEVMETHOD(pic_disable_intr,	bcm_gpio_pic_disable_intr),
 	DEVMETHOD(pic_enable_intr,	bcm_gpio_pic_enable_intr),
 	DEVMETHOD(pic_map_intr,		bcm_gpio_pic_map_intr),
 	DEVMETHOD(pic_post_filter,	bcm_gpio_pic_post_filter),
 	DEVMETHOD(pic_post_ithread,	bcm_gpio_pic_post_ithread),
 	DEVMETHOD(pic_pre_ithread,	bcm_gpio_pic_pre_ithread),
 	DEVMETHOD(pic_setup_intr,	bcm_gpio_pic_setup_intr),
 	DEVMETHOD(pic_teardown_intr,	bcm_gpio_pic_teardown_intr),
 
 	/* ofw_bus interface */
 	DEVMETHOD(ofw_bus_get_node,	bcm_gpio_get_node),
 
         /* fdt_pinctrl interface */
 	DEVMETHOD(fdt_pinctrl_configure, bcm_gpio_configure_pins),
 
 	DEVMETHOD_END
 };
 
 static devclass_t bcm_gpio_devclass;
 
 static driver_t bcm_gpio_driver = {
 	"gpio",
 	bcm_gpio_methods,
 	sizeof(struct bcm_gpio_softc),
 };
 
 EARLY_DRIVER_MODULE(bcm_gpio, simplebus, bcm_gpio_driver, bcm_gpio_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_LATE);
Index: user/markj/netdump/sys/arm/broadcom/bcm2835/bcm2835_pwm.c
===================================================================
--- user/markj/netdump/sys/arm/broadcom/bcm2835/bcm2835_pwm.c	(revision 332407)
+++ user/markj/netdump/sys/arm/broadcom/bcm2835/bcm2835_pwm.c	(revision 332408)
@@ -1,379 +1,375 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2017 Poul-Henning Kamp <phk@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/rman.h>
 #include <sys/lock.h>
 #include <sys/sysctl.h>
 
 #include <machine/bus.h>
 #include <machine/resource.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <arm/broadcom/bcm2835/bcm2835_clkman.h>
 
 static struct ofw_compat_data compat_data[] = {
 	{"broadcom,bcm2835-pwm",	1},
 	{"brcm,bcm2835-pwm",		1},
 	{NULL,				0}
 };
 
 struct bcm_pwm_softc {
 	device_t		sc_dev;
 
 	struct resource *	sc_mem_res;
 	bus_space_tag_t		sc_m_bst;
 	bus_space_handle_t	sc_m_bsh;
 
 	device_t		clkman;
 
 	uint32_t		freq;
 	uint32_t		period;
 	uint32_t		ratio;
 	uint32_t		mode;
 
 };
 
 #define BCM_PWM_MEM_WRITE(_sc, _off, _val)		\
     bus_space_write_4(_sc->sc_m_bst, _sc->sc_m_bsh, _off, _val)
 #define BCM_PWM_MEM_READ(_sc, _off)			\
     bus_space_read_4(_sc->sc_m_bst, _sc->sc_m_bsh, _off)
 #define BCM_PWM_CLK_WRITE(_sc, _off, _val)		\
     bus_space_write_4(_sc->sc_c_bst, _sc->sc_c_bsh, _off, _val)
 #define BCM_PWM_CLK_READ(_sc, _off)			\
     bus_space_read_4(_sc->sc_c_bst, _sc->sc_c_bsh, _off)
 
 #define W_CTL(_sc, _val) BCM_PWM_MEM_WRITE(_sc, 0x00, _val)
 #define R_CTL(_sc) BCM_PWM_MEM_READ(_sc, 0x00)
 #define W_STA(_sc, _val) BCM_PWM_MEM_WRITE(_sc, 0x04, _val)
 #define R_STA(_sc) BCM_PWM_MEM_READ(_sc, 0x04)
 #define W_RNG(_sc, _val) BCM_PWM_MEM_WRITE(_sc, 0x10, _val)
 #define R_RNG(_sc) BCM_PWM_MEM_READ(_sc, 0x10)
 #define W_DAT(_sc, _val) BCM_PWM_MEM_WRITE(_sc, 0x14, _val)
 #define R_DAT(_sc) BCM_PWM_MEM_READ(_sc, 0x14)
 
 static int
 bcm_pwm_reconf(struct bcm_pwm_softc *sc)
 {
 	uint32_t u;
 
 	/* Disable PWM */
 	W_CTL(sc, 0);
 
 	/* Stop PWM clock */
 	(void)bcm2835_clkman_set_frequency(sc->clkman, BCM_PWM_CLKSRC, 0);
 
 	if (sc->mode == 0)
 		return (0);
 
 	u = bcm2835_clkman_set_frequency(sc->clkman, BCM_PWM_CLKSRC, sc->freq);
 	if (u == 0)
 		return (EINVAL);
 	sc->freq = u;
 
 	/* Config PWM */
 	W_RNG(sc, sc->period);
 	if (sc->ratio > sc->period)
 		sc->ratio = sc->period;
 	W_DAT(sc, sc->ratio);
 
 	/* Start PWM */
 	if (sc->mode == 1)
 		W_CTL(sc, 0x81);
 	else
 		W_CTL(sc, 0x1);
 
 	return (0);
 }
 
 static int
 bcm_pwm_pwm_freq_proc(SYSCTL_HANDLER_ARGS)
 {
 	struct bcm_pwm_softc *sc;
 	uint32_t r;
 	int error;
 
 	sc = (struct bcm_pwm_softc *)arg1;
 	if (sc->mode == 1)
 		r = sc->freq / sc->period;
 	else
 		r = 0;
 	error = sysctl_handle_int(oidp, &r, sizeof(r), req);
 	return (error);
 }
 
 
 static int
 bcm_pwm_mode_proc(SYSCTL_HANDLER_ARGS)
 {
 	struct bcm_pwm_softc *sc;
 	uint32_t r;
 	int error;
 
 	sc = (struct bcm_pwm_softc *)arg1;
 	r = sc->mode;
 	error = sysctl_handle_int(oidp, &r, sizeof(r), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (r > 2)
 		return (EINVAL);
 	sc->mode = r;
 	return (bcm_pwm_reconf(sc));
 }
 
 static int
 bcm_pwm_freq_proc(SYSCTL_HANDLER_ARGS)
 {
 	struct bcm_pwm_softc *sc;
 	uint32_t r;
 	int error;
 
 	sc = (struct bcm_pwm_softc *)arg1;
 	r = sc->freq;
 	error = sysctl_handle_int(oidp, &r, sizeof(r), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (r > 125000000)
 		return (EINVAL);
 	sc->freq = r;
 	return (bcm_pwm_reconf(sc));
 }
 
 static int
 bcm_pwm_period_proc(SYSCTL_HANDLER_ARGS)
 {
 	struct bcm_pwm_softc *sc;
 	int error;
 
 	sc = (struct bcm_pwm_softc *)arg1;
 	error = sysctl_handle_int(oidp, &sc->period, sizeof(sc->period), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	return (bcm_pwm_reconf(sc));
 }
 
 static int
 bcm_pwm_ratio_proc(SYSCTL_HANDLER_ARGS)
 {
 	struct bcm_pwm_softc *sc;
 	uint32_t r;
 	int error;
 
 	sc = (struct bcm_pwm_softc *)arg1;
 	r = sc->ratio;
 	error = sysctl_handle_int(oidp, &r, sizeof(r), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (r > sc->period)			// XXX >= ?
 		return (EINVAL);
 	sc->ratio = r;
 	BCM_PWM_MEM_WRITE(sc, 0x14, sc->ratio);
 	return (0);
 }
 
 static int
 bcm_pwm_reg_proc(SYSCTL_HANDLER_ARGS)
 {
 	struct bcm_pwm_softc *sc;
 	uint32_t reg;
 	int error;
 
 	sc = (struct bcm_pwm_softc *)arg1;
 	reg = BCM_PWM_MEM_READ(sc, arg2 & 0xff);
 
 	error = sysctl_handle_int(oidp, &reg, sizeof(reg), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 
 	BCM_PWM_MEM_WRITE(sc, arg2, reg);
 	return (0);
 }
 
 static void
 bcm_pwm_sysctl_init(struct bcm_pwm_softc *sc)
 {
 	struct sysctl_ctx_list *ctx;
 	struct sysctl_oid *tree_node;
 	struct sysctl_oid_list *tree;
 
 	/*
 	 * Add system sysctl tree/handlers.
 	 */
 	ctx = device_get_sysctl_ctx(sc->sc_dev);
 	tree_node = device_get_sysctl_tree(sc->sc_dev);
 	tree = SYSCTL_CHILDREN(tree_node);
 	if (bootverbose) {
 #define RR(x,y)							\
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, y,			\
 	    CTLFLAG_RW | CTLTYPE_UINT, sc, 0x##x,		\
 	    bcm_pwm_reg_proc, "IU", "Register 0x" #x " " y);
 
 		RR(24, "DAT2")
 		RR(20, "RNG2")
 		RR(18, "FIF1")
 		RR(14, "DAT1")
 		RR(10, "RNG1")
 		RR(08, "DMAC")
 		RR(04, "STA")
 		RR(00, "CTL")
 #undef RR
 	}
 
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "pwm_freq",
 	    CTLFLAG_RD | CTLTYPE_UINT, sc, 0,
 	    bcm_pwm_pwm_freq_proc, "IU", "PWM frequency (Hz)");
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "period",
 	    CTLFLAG_RW | CTLTYPE_UINT, sc, 0,
 	    bcm_pwm_period_proc, "IU", "PWM period (#clocks)");
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "ratio",
 	    CTLFLAG_RW | CTLTYPE_UINT, sc, 0,
 	    bcm_pwm_ratio_proc, "IU", "PWM ratio (0...period)");
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "freq",
 	    CTLFLAG_RW | CTLTYPE_UINT, sc, 0,
 	    bcm_pwm_freq_proc, "IU", "PWM clock (Hz)");
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "mode",
 	    CTLFLAG_RW | CTLTYPE_UINT, sc, 0,
 	    bcm_pwm_mode_proc, "IU", "PWM mode (0=off, 1=pwm, 2=dither)");
 }
 
 static int
 bcm_pwm_probe(device_t dev)
 {
 
-#if 0
-	// XXX: default state is disabled in RPI3 DTB, assume for now
-	// XXX: that people want the PWM to work if the KLD this module.
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
-#endif
 
 	if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0)
 		return (ENXIO);
 
 	device_set_desc(dev, "BCM2708/2835 PWM controller");
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 bcm_pwm_attach(device_t dev)
 {
 	struct bcm_pwm_softc *sc;
 	int rid;
 
 	if (device_get_unit(dev) != 0) {
 		device_printf(dev, "only one PWM controller supported\n");
 		return (ENXIO);
 	}
 
 	sc = device_get_softc(dev);
 	sc->sc_dev = dev;
 
 	sc->clkman = devclass_get_device(devclass_find("bcm2835_clkman"), 0);
 	if (sc->clkman == NULL) {
 		device_printf(dev, "cannot find Clock Manager\n");
 		return (ENXIO);
 	}
 
 	rid = 0;
 	sc->sc_mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid,
 	    RF_ACTIVE);
 	if (!sc->sc_mem_res) {
 		device_printf(dev, "cannot allocate memory window\n");
 		return (ENXIO);
 	}
 
 	sc->sc_m_bst = rman_get_bustag(sc->sc_mem_res);
 	sc->sc_m_bsh = rman_get_bushandle(sc->sc_mem_res);
 
 	/* Add sysctl nodes. */
 	bcm_pwm_sysctl_init(sc);
 
 	sc->freq = 125000000;
 	sc->period = 10000;
 	sc->ratio = 2500;
 
 
 	return (bus_generic_attach(dev));
 }
 
 static int
 bcm_pwm_detach(device_t dev)
 {
 	struct bcm_pwm_softc *sc;
 
 	bus_generic_detach(dev);
 
 	sc = device_get_softc(dev);
 	sc->mode = 0;
 	(void)bcm_pwm_reconf(sc);
 	if (sc->sc_mem_res)
 		bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res);
 
 	return (0);
 }
 
 static phandle_t
 bcm_pwm_get_node(device_t bus, device_t dev)
 {
 
 	return (ofw_bus_get_node(bus));
 }
 
 
 static device_method_t bcm_pwm_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		bcm_pwm_probe),
 	DEVMETHOD(device_attach,	bcm_pwm_attach),
 	DEVMETHOD(device_detach,	bcm_pwm_detach),
 	DEVMETHOD(ofw_bus_get_node,	bcm_pwm_get_node),
 
 	DEVMETHOD_END
 };
 
 static devclass_t bcm_pwm_devclass;
 
 static driver_t bcm_pwm_driver = {
 	"pwm",
 	bcm_pwm_methods,
 	sizeof(struct bcm_pwm_softc),
 };
 
 DRIVER_MODULE(bcm2835_pwm, simplebus, bcm_pwm_driver, bcm_pwm_devclass, 0, 0);
 MODULE_DEPEND(bcm2835_pwm, bcm2835_clkman, 1, 1, 1);
Index: user/markj/netdump/sys/arm/freescale/imx/imx_iomux.c
===================================================================
--- user/markj/netdump/sys/arm/freescale/imx/imx_iomux.c	(revision 332407)
+++ user/markj/netdump/sys/arm/freescale/imx/imx_iomux.c	(revision 332408)
@@ -1,330 +1,330 @@
 /*-
  * Copyright (c) 2014 Ian Lepore
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * Pin mux and pad control driver for imx5 and imx6.
  *
  * This driver implements the fdt_pinctrl interface for configuring the gpio and
  * peripheral pins based on fdt configuration data.
  *
  * When the driver attaches, it walks the entire fdt tree and automatically
  * configures the pins for each device which has a pinctrl-0 property and whose
  * status is "okay".  In addition it implements the fdt_pinctrl_configure()
  * method which any other driver can call at any time to reconfigure its pins.
  *
  * The nature of the fsl,pins property in fdt data makes this driver's job very
  * easy.  Instead of representing each pin and pad configuration using symbolic
  * properties such as pullup-enable="true" and so on, the data simply contains
  * the addresses of the registers that control the pins, and the raw values to
  * store in those registers.
  *
  * The imx5 and imx6 SoCs also have a small number of "general purpose
  * registers" in the iomuxc device which are used to control an assortment
  * of completely unrelated aspects of SoC behavior.  This driver provides other
  * drivers with direct access to those registers via simple accessor functions.
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/malloc.h>
 #include <sys/rman.h>
 
 #include <machine/bus.h>
 
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #include <dev/fdt/fdt_pinctrl.h>
 
 #include <arm/freescale/imx/imx_iomuxvar.h>
 #include <arm/freescale/imx/imx_machdep.h>
 
 struct iomux_softc {
 	device_t	dev;
 	struct resource	*mem_res;
 	u_int		last_gpregaddr;
 };
 
 static struct iomux_softc *iomux_sc;
 
 static struct ofw_compat_data compat_data[] = {
 	{"fsl,imx6dl-iomuxc",	true},
 	{"fsl,imx6q-iomuxc",	true},
 	{"fsl,imx6sl-iomuxc",	true},
 	{"fsl,imx6ul-iomuxc",	true},
 	{"fsl,imx6sx-iomuxc",	true},
 	{"fsl,imx53-iomuxc",	true},
 	{"fsl,imx51-iomuxc",	true},
 	{NULL,			false},
 };
 
 /*
  * Each tuple in an fsl,pins property contains these fields.
  */
 struct pincfg {
 	uint32_t mux_reg;
 	uint32_t padconf_reg;
 	uint32_t input_reg;
 	uint32_t mux_val;
 	uint32_t input_val;
 	uint32_t padconf_val;
 };
 
 #define	PADCONF_NONE	(1U << 31)	/* Do not configure pad. */
 #define	PADCONF_SION	(1U << 30)	/* Force SION bit in mux register. */
 #define	PADMUX_SION	(1U <<  4)	/* The SION bit in the mux register. */
 
 static inline uint32_t
 RD4(struct iomux_softc *sc, bus_size_t off)
 {
 
 	return (bus_read_4(sc->mem_res, off));
 }
 
 static inline void
 WR4(struct iomux_softc *sc, bus_size_t off, uint32_t val)
 {
 
 	bus_write_4(sc->mem_res, off, val);
 }
 
 static void
 iomux_configure_input(struct iomux_softc *sc, uint32_t reg, uint32_t val)
 {
 	u_int select, mask, shift, width;
 
 	/* If register and value are zero, there is nothing to configure. */
 	if (reg == 0 && val == 0)
 		return;
 
 	/*
 	 * If the config value has 0xff in the high byte it is encoded:
 	 * 	31     23      15      7        0
 	 *      | 0xff | shift | width | select |
 	 * We need to mask out the old select value and OR in the new, using a
 	 * mask of the given width and shifting the values up by shift.
 	 */
 	if ((val & 0xff000000) == 0xff000000) {
 		select = val & 0x000000ff;
 		width = (val & 0x0000ff00) >> 8;
 		shift = (val & 0x00ff0000) >> 16;
 		mask  = ((1u << width) - 1) << shift;
 		val = (RD4(sc, reg) & ~mask) | (select << shift);
 	}
 	WR4(sc, reg, val);
 }
 
 static int
 iomux_configure_pins(device_t dev, phandle_t cfgxref)
 {
 	struct iomux_softc *sc;
 	struct pincfg *cfgtuples, *cfg;
 	phandle_t cfgnode;
 	int i, ntuples;
 	uint32_t sion;
 
 	sc = device_get_softc(dev);
 	cfgnode = OF_node_from_xref(cfgxref);
-	ntuples = OF_getencprop_alloc(cfgnode, "fsl,pins", sizeof(*cfgtuples),
-	    (void **)&cfgtuples);
+	ntuples = OF_getencprop_alloc_multi(cfgnode, "fsl,pins",
+	    sizeof(*cfgtuples), (void **)&cfgtuples);
 	if (ntuples < 0)
 		return (ENOENT);
 	if (ntuples == 0)
 		return (0); /* Empty property is not an error. */
 	for (i = 0, cfg = cfgtuples; i < ntuples; i++, cfg++) {
 		sion = (cfg->padconf_val & PADCONF_SION) ? PADMUX_SION : 0;
 		WR4(sc, cfg->mux_reg, cfg->mux_val | sion);
 		iomux_configure_input(sc, cfg->input_reg, cfg->input_val);
 		if ((cfg->padconf_val & PADCONF_NONE) == 0)
 			WR4(sc, cfg->padconf_reg, cfg->padconf_val);
 		if (bootverbose) {
 			char name[32]; 
 			OF_getprop(cfgnode, "name", &name, sizeof(name));
 			printf("%16s: muxreg 0x%04x muxval 0x%02x "
 			    "inpreg 0x%04x inpval 0x%02x "
 			    "padreg 0x%04x padval 0x%08x\n",
 			    name, cfg->mux_reg, cfg->mux_val | sion,
 			    cfg->input_reg, cfg->input_val,
 			    cfg->padconf_reg, cfg->padconf_val);
 		}
 	}
 	OF_prop_free(cfgtuples);
 	return (0);
 }
 
 static int
 iomux_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (!ofw_bus_search_compatible(dev, compat_data)->ocd_data)
 		return (ENXIO);
 
 	device_set_desc(dev, "Freescale i.MX pin configuration");
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 iomux_detach(device_t dev)
 {
 
         /* This device is always present. */
 	return (EBUSY);
 }
 
 static int
 iomux_attach(device_t dev)
 {
 	struct iomux_softc * sc;
 	int rid;
 
 	sc = device_get_softc(dev);
 	sc->dev = dev;
 
 	switch (imx_soc_type()) {
 	case IMXSOC_51:
 		sc->last_gpregaddr = 1 * sizeof(uint32_t);
 		break;
 	case IMXSOC_53:
 		sc->last_gpregaddr = 2 * sizeof(uint32_t);
 		break;
 	case IMXSOC_6DL:
 	case IMXSOC_6S:
 	case IMXSOC_6SL:
 	case IMXSOC_6Q:
 		sc->last_gpregaddr = 13 * sizeof(uint32_t);
 		break;
 	case IMXSOC_6UL:
 		sc->last_gpregaddr = 14 * sizeof(uint32_t);
 		break;
 	default:
 		device_printf(dev, "Unknown SoC type\n");
 		return (ENXIO);
 	}
 
 	rid = 0;
 	sc->mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid,
 	    RF_ACTIVE);
 	if (sc->mem_res == NULL) {
 		device_printf(dev, "Cannot allocate memory resources\n");
 		return (ENXIO);
 	}
 
 	iomux_sc = sc;
 
 	/*
 	 * Register as a pinctrl device, and call the convenience function that
 	 * walks the entire device tree invoking FDT_PINCTRL_CONFIGURE() on any
 	 * pinctrl-0 property cells whose xref phandle refers to a configuration
 	 * that is a child node of our node in the tree.
 	 *
 	 * The pinctrl bindings documentation specifically mentions that the
 	 * pinctrl device itself may have a pinctrl-0 property which contains
 	 * static configuration to be applied at device init time.  The tree
 	 * walk will automatically handle this for us when it passes through our
 	 * node in the tree.
 	 */
 	fdt_pinctrl_register(dev, "fsl,pins");
 	fdt_pinctrl_configure_tree(dev);
 
 	return (0);
 }
 
 uint32_t
 imx_iomux_gpr_get(u_int regaddr)
 {
 	struct iomux_softc * sc;
 
 	sc = iomux_sc;
 	KASSERT(sc != NULL, ("%s called before attach", __FUNCTION__));
 	KASSERT(regaddr >= 0 && regaddr <= sc->last_gpregaddr, 
 	    ("%s bad regaddr %u, max %u", __FUNCTION__, regaddr,
 	    sc->last_gpregaddr));
 
 	return (RD4(iomux_sc, regaddr));
 }
 
 void
 imx_iomux_gpr_set(u_int regaddr, uint32_t val)
 {
 	struct iomux_softc * sc;
 
 	sc = iomux_sc;
 	KASSERT(sc != NULL, ("%s called before attach", __FUNCTION__));
 	KASSERT(regaddr >= 0 && regaddr <= sc->last_gpregaddr, 
 	    ("%s bad regaddr %u, max %u", __FUNCTION__, regaddr,
 	    sc->last_gpregaddr));
 
 	WR4(iomux_sc, regaddr, val);
 }
 
 void
 imx_iomux_gpr_set_masked(u_int regaddr, uint32_t clrbits, uint32_t setbits)
 {
 	struct iomux_softc * sc;
 	uint32_t val;
 
 	sc = iomux_sc;
 	KASSERT(sc != NULL, ("%s called before attach", __FUNCTION__));
 	KASSERT(regaddr >= 0 && regaddr <= sc->last_gpregaddr, 
 	    ("%s bad regaddr %u, max %u", __FUNCTION__, regaddr,
 	    sc->last_gpregaddr));
 
 	val = RD4(iomux_sc, regaddr * 4);
 	val = (val & ~clrbits) | setbits;
 	WR4(iomux_sc, regaddr, val);
 }
 
 static device_method_t imx_iomux_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,         iomux_probe),
 	DEVMETHOD(device_attach,        iomux_attach),
 	DEVMETHOD(device_detach,        iomux_detach),
 
         /* fdt_pinctrl interface */
 	DEVMETHOD(fdt_pinctrl_configure,iomux_configure_pins),
 
 	DEVMETHOD_END
 };
 
 static driver_t imx_iomux_driver = {
 	"imx_iomux",
 	imx_iomux_methods,
 	sizeof(struct iomux_softc),
 };
 
 static devclass_t imx_iomux_devclass;
 
 EARLY_DRIVER_MODULE(imx_iomux, simplebus, imx_iomux_driver, 
     imx_iomux_devclass, 0, 0, BUS_PASS_CPU + BUS_PASS_ORDER_LATE);
 
Index: user/markj/netdump/sys/arm/mv/mv_common.c
===================================================================
--- user/markj/netdump/sys/arm/mv/mv_common.c	(revision 332407)
+++ user/markj/netdump/sys/arm/mv/mv_common.c	(revision 332408)
@@ -1,3037 +1,3055 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (C) 2008-2011 MARVELL INTERNATIONAL LTD.
  * All rights reserved.
  *
  * Developed by Semihalf.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of MARVELL nor the names of contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/kdb.h>
 #include <sys/reboot.h>
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <machine/bus.h>
 #include <machine/fdt.h>
 #include <machine/vmparam.h>
 #include <machine/intr.h>
 
 #include <arm/mv/mvreg.h>
 #include <arm/mv/mvvar.h>
 #include <arm/mv/mvwin.h>
 
 
 MALLOC_DEFINE(M_IDMA, "idma", "idma dma test memory");
 
 #define IDMA_DEBUG
 #undef IDMA_DEBUG
 
 #define MAX_CPU_WIN	5
 
 #ifdef DEBUG
 #define debugf(fmt, args...) do { printf("%s(): ", __func__);	\
     printf(fmt,##args); } while (0)
 #else
 #define debugf(fmt, args...)
 #endif
 
 #ifdef DEBUG
 #define MV_DUMP_WIN	1
 #else
 #define MV_DUMP_WIN	0
 #endif
 
 struct soc_node_spec;
 
 static enum soc_family soc_family;
 
 static int mv_win_cesa_attr(int wng_sel);
 static int mv_win_cesa_attr_armv5(int eng_sel);
 static int mv_win_cesa_attr_armada38x(int eng_sel);
 static int mv_win_cesa_attr_armadaxp(int eng_sel);
 
 uint32_t read_cpu_ctrl_armv5(uint32_t reg);
 uint32_t read_cpu_ctrl_armv7(uint32_t reg);
 
 void write_cpu_ctrl_armv5(uint32_t reg, uint32_t val);
 void write_cpu_ctrl_armv7(uint32_t reg, uint32_t val);
 
 static int win_eth_can_remap(int i);
 
 static int decode_win_cesa_valid(void);
 static int decode_win_cpu_valid(void);
 static int decode_win_usb_valid(void);
 static int decode_win_usb3_valid(void);
 static int decode_win_eth_valid(void);
 static int decode_win_pcie_valid(void);
 static int decode_win_sata_valid(void);
 static int decode_win_sdhci_valid(void);
 
 static int decode_win_idma_valid(void);
 static int decode_win_xor_valid(void);
 
 static void decode_win_cpu_setup(void);
 static int decode_win_sdram_fixup(void);
 static void decode_win_cesa_setup(u_long);
+static void decode_win_a38x_cesa_setup(u_long);
 static void decode_win_usb_setup(u_long);
 static void decode_win_usb3_setup(u_long);
 static void decode_win_eth_setup(u_long);
 static void decode_win_neta_setup(u_long);
 static void decode_win_sata_setup(u_long);
 static void decode_win_ahci_setup(u_long);
 static void decode_win_sdhci_setup(u_long);
 
 static void decode_win_idma_setup(u_long);
 static void decode_win_xor_setup(u_long);
 
 static void decode_win_cesa_dump(u_long);
+static void decode_win_a38x_cesa_dump(u_long);
 static void decode_win_usb_dump(u_long);
 static void decode_win_usb3_dump(u_long);
 static void decode_win_eth_dump(u_long base);
 static void decode_win_neta_dump(u_long base);
 static void decode_win_idma_dump(u_long base);
 static void decode_win_xor_dump(u_long base);
 static void decode_win_ahci_dump(u_long base);
 static void decode_win_sdhci_dump(u_long);
 static void decode_win_pcie_dump(u_long);
 
 static uint32_t win_cpu_cr_read(int);
 static uint32_t win_cpu_armv5_cr_read(int);
 static uint32_t win_cpu_armv7_cr_read(int);
 static uint32_t win_cpu_br_read(int);
 static uint32_t win_cpu_armv5_br_read(int);
 static uint32_t win_cpu_armv7_br_read(int);
 static uint32_t win_cpu_remap_l_read(int);
 static uint32_t win_cpu_armv5_remap_l_read(int);
 static uint32_t win_cpu_armv7_remap_l_read(int);
 static uint32_t win_cpu_remap_h_read(int);
 static uint32_t win_cpu_armv5_remap_h_read(int);
 static uint32_t win_cpu_armv7_remap_h_read(int);
 
 static void win_cpu_cr_write(int, uint32_t);
 static void win_cpu_armv5_cr_write(int, uint32_t);
 static void win_cpu_armv7_cr_write(int, uint32_t);
 static void win_cpu_br_write(int, uint32_t);
 static void win_cpu_armv5_br_write(int, uint32_t);
 static void win_cpu_armv7_br_write(int, uint32_t);
 static void win_cpu_remap_l_write(int, uint32_t);
 static void win_cpu_armv5_remap_l_write(int, uint32_t);
 static void win_cpu_armv7_remap_l_write(int, uint32_t);
 static void win_cpu_remap_h_write(int, uint32_t);
 static void win_cpu_armv5_remap_h_write(int, uint32_t);
 static void win_cpu_armv7_remap_h_write(int, uint32_t);
 
 static uint32_t ddr_br_read(int);
 static uint32_t ddr_sz_read(int);
 static uint32_t ddr_armv5_br_read(int);
 static uint32_t ddr_armv5_sz_read(int);
 static uint32_t ddr_armv7_br_read(int);
 static uint32_t ddr_armv7_sz_read(int);
 static void ddr_br_write(int, uint32_t);
 static void ddr_sz_write(int, uint32_t);
 static void ddr_armv5_br_write(int, uint32_t);
 static void ddr_armv5_sz_write(int, uint32_t);
 static void ddr_armv7_br_write(int, uint32_t);
 static void ddr_armv7_sz_write(int, uint32_t);
 
 static int fdt_get_ranges(const char *, void *, int, int *, int *);
 int gic_decode_fdt(phandle_t iparent, pcell_t *intr, int *interrupt,
     int *trig, int *pol);
 
 static int win_cpu_from_dt(void);
 static int fdt_win_setup(void);
 
 static int fdt_win_process_child(phandle_t, struct soc_node_spec *, const char*);
 
 static uint32_t dev_mask = 0;
 static int cpu_wins_no = 0;
 static int eth_port = 0;
 static int usb_port = 0;
 static boolean_t platform_io_coherent = false;
 
 static struct decode_win cpu_win_tbl[MAX_CPU_WIN];
 
 const struct decode_win *cpu_wins = cpu_win_tbl;
 
 typedef void (*decode_win_setup_t)(u_long);
 typedef void (*dump_win_t)(u_long);
 typedef int (*valid_t)(void);
 
 /*
  * The power status of device feature is only supported on
  * Kirkwood and Discovery SoCs.
  */
 #if defined(SOC_MV_KIRKWOOD) || defined(SOC_MV_DISCOVERY)
 #define	SOC_MV_POWER_STAT_SUPPORTED		1
 #else
 #define	SOC_MV_POWER_STAT_SUPPORTED		0
 #endif
 
 struct soc_node_spec {
 	const char		*compat;
 	decode_win_setup_t	decode_handler;
 	dump_win_t		dump_handler;
 	valid_t			valid_handler;
 };
 
 static struct soc_node_spec soc_nodes[] = {
 	{ "mrvl,ge", &decode_win_eth_setup, &decode_win_eth_dump, &decode_win_eth_valid},
 	{ "marvell,armada-370-neta", &decode_win_neta_setup,
 	    &decode_win_neta_dump, NULL },
 	{ "mrvl,usb-ehci", &decode_win_usb_setup, &decode_win_usb_dump, &decode_win_usb_valid},
 	{ "marvell,orion-ehci", &decode_win_usb_setup, &decode_win_usb_dump, &decode_win_usb_valid },
 	{ "marvell,armada-380-xhci", &decode_win_usb3_setup,
 	    &decode_win_usb3_dump, &decode_win_usb3_valid },
 	{ "marvell,armada-380-ahci", &decode_win_ahci_setup,
 	    &decode_win_ahci_dump, NULL },
 	{ "marvell,armada-380-sdhci", &decode_win_sdhci_setup,
 	    &decode_win_sdhci_dump, &decode_win_sdhci_valid},
 	{ "mrvl,sata", &decode_win_sata_setup, NULL, &decode_win_sata_valid},
 	{ "mrvl,xor", &decode_win_xor_setup, &decode_win_xor_dump, &decode_win_xor_valid},
 	{ "mrvl,idma", &decode_win_idma_setup, &decode_win_idma_dump, &decode_win_idma_valid},
 	{ "mrvl,cesa", &decode_win_cesa_setup, &decode_win_cesa_dump, &decode_win_cesa_valid},
 	{ "mrvl,pcie", &decode_win_pcie_setup, &decode_win_pcie_dump, &decode_win_pcie_valid},
+	{ "marvell,armada-38x-crypto", &decode_win_a38x_cesa_setup,
+	    &decode_win_a38x_cesa_dump, &decode_win_cesa_valid},
 	{ NULL, NULL, NULL, NULL },
 };
 
 #define	SOC_NODE_PCIE_ENTRY_IDX		11
 
 typedef uint32_t(*read_cpu_ctrl_t)(uint32_t);
 typedef void(*write_cpu_ctrl_t)(uint32_t, uint32_t);
 typedef uint32_t (*win_read_t)(int);
 typedef void (*win_write_t)(int, uint32_t);
 typedef int (*win_cesa_attr_t)(int);
 typedef uint32_t (*get_t)(void);
 
 struct decode_win_spec {
 	read_cpu_ctrl_t  read_cpu_ctrl;
 	write_cpu_ctrl_t write_cpu_ctrl;
 	win_read_t	cr_read;
 	win_read_t	br_read;
 	win_read_t	remap_l_read;
 	win_read_t	remap_h_read;
 	win_write_t	cr_write;
 	win_write_t	br_write;
 	win_write_t	remap_l_write;
 	win_write_t	remap_h_write;
 	uint32_t	mv_win_cpu_max;
 	win_cesa_attr_t win_cesa_attr;
 	int 		win_cesa_target;
 	win_read_t	ddr_br_read;
 	win_read_t	ddr_sz_read;
 	win_write_t	ddr_br_write;
 	win_write_t	ddr_sz_write;
 #if __ARM_ARCH >= 6
 	get_t		get_tclk;
 	get_t		get_cpu_freq;
 #endif
 };
 
 struct decode_win_spec *soc_decode_win_spec;
 
 static struct decode_win_spec decode_win_specs[] =
 {
 	{
 		&read_cpu_ctrl_armv7,
 		&write_cpu_ctrl_armv7,
 		&win_cpu_armv7_cr_read,
 		&win_cpu_armv7_br_read,
 		&win_cpu_armv7_remap_l_read,
 		&win_cpu_armv7_remap_h_read,
 		&win_cpu_armv7_cr_write,
 		&win_cpu_armv7_br_write,
 		&win_cpu_armv7_remap_l_write,
 		&win_cpu_armv7_remap_h_write,
 		MV_WIN_CPU_MAX_ARMV7,
 		&mv_win_cesa_attr_armada38x,
 		MV_WIN_CESA_TARGET_ARMADA38X,
 		&ddr_armv7_br_read,
 		&ddr_armv7_sz_read,
 		&ddr_armv7_br_write,
 		&ddr_armv7_sz_write,
 #if __ARM_ARCH >= 6
 		&get_tclk_armada38x,
 		&get_cpu_freq_armada38x,
 #endif
 	},
 	{
 		&read_cpu_ctrl_armv7,
 		&write_cpu_ctrl_armv7,
 		&win_cpu_armv7_cr_read,
 		&win_cpu_armv7_br_read,
 		&win_cpu_armv7_remap_l_read,
 		&win_cpu_armv7_remap_h_read,
 		&win_cpu_armv7_cr_write,
 		&win_cpu_armv7_br_write,
 		&win_cpu_armv7_remap_l_write,
 		&win_cpu_armv7_remap_h_write,
 		MV_WIN_CPU_MAX_ARMV7,
 		&mv_win_cesa_attr_armadaxp,
 		MV_WIN_CESA_TARGET_ARMADAXP,
 		&ddr_armv7_br_read,
 		&ddr_armv7_sz_read,
 		&ddr_armv7_br_write,
 		&ddr_armv7_sz_write,
 #if __ARM_ARCH >= 6
 		&get_tclk_armadaxp,
 		&get_cpu_freq_armadaxp,
 #endif
 	},
 	{
 		&read_cpu_ctrl_armv5,
 		&write_cpu_ctrl_armv5,
 		&win_cpu_armv5_cr_read,
 		&win_cpu_armv5_br_read,
 		&win_cpu_armv5_remap_l_read,
 		&win_cpu_armv5_remap_h_read,
 		&win_cpu_armv5_cr_write,
 		&win_cpu_armv5_br_write,
 		&win_cpu_armv5_remap_l_write,
 		&win_cpu_armv5_remap_h_write,
 		MV_WIN_CPU_MAX,
 		&mv_win_cesa_attr_armv5,
 		MV_WIN_CESA_TARGET,
 		&ddr_armv5_br_read,
 		&ddr_armv5_sz_read,
 		&ddr_armv5_br_write,
 		&ddr_armv5_sz_write,
 #if __ARM_ARCH >= 6
 		NULL,
 		NULL,
 #endif
 	},
 };
 
 struct fdt_pm_mask_entry {
 	char		*compat;
 	uint32_t	mask;
 };
 
 static struct fdt_pm_mask_entry fdt_pm_mask_table[] = {
 	{ "mrvl,ge",		CPU_PM_CTRL_GE(0) },
 	{ "mrvl,ge",		CPU_PM_CTRL_GE(1) },
 	{ "mrvl,usb-ehci",	CPU_PM_CTRL_USB(0) },
 	{ "mrvl,usb-ehci",	CPU_PM_CTRL_USB(1) },
 	{ "mrvl,usb-ehci",	CPU_PM_CTRL_USB(2) },
 	{ "mrvl,xor",		CPU_PM_CTRL_XOR },
 	{ "mrvl,sata",		CPU_PM_CTRL_SATA },
 
 	{ NULL, 0 }
 };
 
 static __inline int
 pm_is_disabled(uint32_t mask)
 {
 #if SOC_MV_POWER_STAT_SUPPORTED
 	return (soc_power_ctrl_get(mask) == mask ? 0 : 1);
 #else
 	return (0);
 #endif
 }
 
 /*
  * Disable device using power management register.
  * 1 - Device Power On
  * 0 - Device Power Off
  * Mask can be set in loader.
  * EXAMPLE:
  * loader> set hw.pm-disable-mask=0x2
  *
  * Common mask:
  * |-------------------------------|
  * | Device | Kirkwood | Discovery |
  * |-------------------------------|
  * | USB0   | 0x00008  | 0x020000  |
  * |-------------------------------|
  * | USB1   |     -    | 0x040000  |
  * |-------------------------------|
  * | USB2   |     -    | 0x080000  |
  * |-------------------------------|
  * | GE0    | 0x00001  | 0x000002  |
  * |-------------------------------|
  * | GE1    |     -    | 0x000004  |
  * |-------------------------------|
  * | IDMA   |     -    | 0x100000  |
  * |-------------------------------|
  * | XOR    | 0x10000  | 0x200000  |
  * |-------------------------------|
  * | CESA   | 0x20000  | 0x400000  |
  * |-------------------------------|
  * | SATA   | 0x04000  | 0x004000  |
  * --------------------------------|
  * This feature can be used only on Kirkwood and Discovery
  * machines.
  */
 
 static int mv_win_cesa_attr(int eng_sel)
 {
 
 	if (soc_decode_win_spec->win_cesa_attr != NULL)
 		return (soc_decode_win_spec->win_cesa_attr(eng_sel));
 
 	return (-1);
 }
 
 static int mv_win_cesa_attr_armv5(int eng_sel)
 {
 
 	return MV_WIN_CESA_ATTR(eng_sel);
 }
 
 static int mv_win_cesa_attr_armada38x(int eng_sel)
 {
 
 	return MV_WIN_CESA_ATTR_ARMADA38X(eng_sel);
 }
 
 static int mv_win_cesa_attr_armadaxp(int eng_sel)
 {
 
 	return MV_WIN_CESA_ATTR_ARMADAXP(eng_sel);
 }
 
 enum soc_family
 mv_check_soc_family()
 {
 	uint32_t dev, rev;
 
 	soc_id(&dev, &rev);
 	switch (dev) {
 	case MV_DEV_MV78230:
 	case MV_DEV_MV78260:
 	case MV_DEV_MV78460:
 		soc_decode_win_spec = &decode_win_specs[MV_SOC_ARMADA_XP];
 		soc_family = MV_SOC_ARMADA_XP;
 		return (MV_SOC_ARMADA_XP);
 	case MV_DEV_88F6828:
 	case MV_DEV_88F6820:
 	case MV_DEV_88F6810:
 		soc_decode_win_spec = &decode_win_specs[MV_SOC_ARMADA_38X];
 		soc_family = MV_SOC_ARMADA_38X;
 		return (MV_SOC_ARMADA_38X);
 	case MV_DEV_88F5181:
 	case MV_DEV_88F5182:
 	case MV_DEV_88F5281:
 	case MV_DEV_88F6281:
 	case MV_DEV_88RC8180:
 	case MV_DEV_88RC9480:
 	case MV_DEV_88RC9580:
 	case MV_DEV_88F6781:
 	case MV_DEV_88F6282:
 	case MV_DEV_MV78100_Z0:
 	case MV_DEV_MV78100:
 	case MV_DEV_MV78160:
 		soc_decode_win_spec = &decode_win_specs[MV_SOC_ARMV5];
 		soc_family = MV_SOC_ARMV5;
 		return (MV_SOC_ARMV5);
 	default:
 		soc_family = MV_SOC_UNSUPPORTED;
 		return (MV_SOC_UNSUPPORTED);
 	}
 }
 
 static __inline void
 pm_disable_device(int mask)
 {
 #ifdef DIAGNOSTIC
 	uint32_t reg;
 
 	reg = soc_power_ctrl_get(CPU_PM_CTRL_ALL);
 	printf("Power Management Register: 0%x\n", reg);
 
 	reg &= ~mask;
 	soc_power_ctrl_set(reg);
 	printf("Device %x is disabled\n", mask);
 
 	reg = soc_power_ctrl_get(CPU_PM_CTRL_ALL);
 	printf("Power Management Register: 0%x\n", reg);
 #endif
 }
 
 int
 mv_fdt_is_type(phandle_t node, const char *typestr)
 {
 #define FDT_TYPE_LEN	64
 	char type[FDT_TYPE_LEN];
 
 	if (OF_getproplen(node, "device_type") <= 0)
 		return (0);
 
 	if (OF_getprop(node, "device_type", type, FDT_TYPE_LEN) < 0)
 		return (0);
 
 	if (strncasecmp(type, typestr, FDT_TYPE_LEN) == 0)
 		/* This fits. */
 		return (1);
 
 	return (0);
 #undef FDT_TYPE_LEN
 }
 
 int
 mv_fdt_pm(phandle_t node)
 {
 	uint32_t cpu_pm_ctrl;
 	int i, ena, compat;
 
 	ena = 1;
 	cpu_pm_ctrl = read_cpu_ctrl(CPU_PM_CTRL);
 	for (i = 0; fdt_pm_mask_table[i].compat != NULL; i++) {
 		if (dev_mask & (1 << i))
 			continue;
 
 		compat = ofw_bus_node_is_compatible(node,
 		    fdt_pm_mask_table[i].compat);
 #if defined(SOC_MV_KIRKWOOD)
 		if (compat && (cpu_pm_ctrl & fdt_pm_mask_table[i].mask)) {
 			dev_mask |= (1 << i);
 			ena = 0;
 			break;
 		} else if (compat) {
 			dev_mask |= (1 << i);
 			break;
 		}
 #else
 		if (compat && (~cpu_pm_ctrl & fdt_pm_mask_table[i].mask)) {
 			dev_mask |= (1 << i);
 			ena = 0;
 			break;
 		} else if (compat) {
 			dev_mask |= (1 << i);
 			break;
 		}
 #endif
 	}
 
 	return (ena);
 }
 
 uint32_t
 read_cpu_ctrl(uint32_t reg)
 {
 
 	if (soc_decode_win_spec->read_cpu_ctrl != NULL)
 		return (soc_decode_win_spec->read_cpu_ctrl(reg));
 	return (-1);
 }
 
 uint32_t
 read_cpu_ctrl_armv5(uint32_t reg)
 {
 
 	return (bus_space_read_4(fdtbus_bs_tag, MV_CPU_CONTROL_BASE, reg));
 }
 
 uint32_t
 read_cpu_ctrl_armv7(uint32_t reg)
 {
 
 	return (bus_space_read_4(fdtbus_bs_tag, MV_CPU_CONTROL_BASE_ARMV7, reg));
 }
 
 void
 write_cpu_ctrl(uint32_t reg, uint32_t val)
 {
 
 	if (soc_decode_win_spec->write_cpu_ctrl != NULL)
 		soc_decode_win_spec->write_cpu_ctrl(reg, val);
 }
 
 void
 write_cpu_ctrl_armv5(uint32_t reg, uint32_t val)
 {
 
 	bus_space_write_4(fdtbus_bs_tag, MV_CPU_CONTROL_BASE, reg, val);
 }
 
 void
 write_cpu_ctrl_armv7(uint32_t reg, uint32_t val)
 {
 
 	bus_space_write_4(fdtbus_bs_tag, MV_CPU_CONTROL_BASE_ARMV7, reg, val);
 }
 
 uint32_t
 read_cpu_mp_clocks(uint32_t reg)
 {
 
 	return (bus_space_read_4(fdtbus_bs_tag, MV_MP_CLOCKS_BASE, reg));
 }
 
 void
 write_cpu_mp_clocks(uint32_t reg, uint32_t val)
 {
 
 	bus_space_write_4(fdtbus_bs_tag, MV_MP_CLOCKS_BASE, reg, val);
 }
 
 uint32_t
 read_cpu_misc(uint32_t reg)
 {
 
 	return (bus_space_read_4(fdtbus_bs_tag, MV_MISC_BASE, reg));
 }
 
 void
 write_cpu_misc(uint32_t reg, uint32_t val)
 {
 
 	bus_space_write_4(fdtbus_bs_tag, MV_MISC_BASE, reg, val);
 }
 
 uint32_t
 cpu_extra_feat(void)
 {
 	uint32_t dev, rev;
 	uint32_t ef = 0;
 
 	soc_id(&dev, &rev);
 
 	switch (dev) {
 	case MV_DEV_88F6281:
 	case MV_DEV_88F6282:
 	case MV_DEV_88RC8180:
 	case MV_DEV_MV78100_Z0:
 	case MV_DEV_MV78100:
 		__asm __volatile("mrc p15, 1, %0, c15, c1, 0" : "=r" (ef));
 		break;
 	case MV_DEV_88F5182:
 	case MV_DEV_88F5281:
 		__asm __volatile("mrc p15, 0, %0, c14, c0, 0" : "=r" (ef));
 		break;
 	default:
 		if (bootverbose)
 			printf("This ARM Core does not support any extra features\n");
 	}
 
 	return (ef);
 }
 
 /*
  * Get the power status of device. This feature is only supported on
  * Kirkwood and Discovery SoCs.
  */
 uint32_t
 soc_power_ctrl_get(uint32_t mask)
 {
 
 #if SOC_MV_POWER_STAT_SUPPORTED
 	if (mask != CPU_PM_CTRL_NONE)
 		mask &= read_cpu_ctrl(CPU_PM_CTRL);
 
 	return (mask);
 #else
 	return (mask);
 #endif
 }
 
 /*
  * Set the power status of device. This feature is only supported on
  * Kirkwood and Discovery SoCs.
  */
 void
 soc_power_ctrl_set(uint32_t mask)
 {
 
 #if !defined(SOC_MV_ORION)
 	if (mask != CPU_PM_CTRL_NONE)
 		write_cpu_ctrl(CPU_PM_CTRL, mask);
 #endif
 }
 
 void
 soc_id(uint32_t *dev, uint32_t *rev)
 {
 	uint64_t mv_pcie_base = MV_PCIE_BASE;
 	phandle_t node;
 
 	/*
 	 * Notice: system identifiers are available in the registers range of
 	 * PCIE controller, so using this function is only allowed (and
 	 * possible) after the internal registers range has been mapped in via
 	 * devmap_bootstrap().
 	 */
 	*dev = 0;
 	*rev = 0;
 	if ((node = OF_finddevice("/")) == -1)
 		return;
 	if (ofw_bus_node_is_compatible(node, "marvell,armada380"))
 		mv_pcie_base = MV_PCIE_BASE_ARMADA38X;
 
 	*dev = bus_space_read_4(fdtbus_bs_tag, mv_pcie_base, 0) >> 16;
 	*rev = bus_space_read_4(fdtbus_bs_tag, mv_pcie_base, 8) & 0xff;
 }
 
 static void
 soc_identify(void)
 {
 	uint32_t d, r, size, mode, freq;
 	const char *dev;
 	const char *rev;
 
 	soc_id(&d, &r);
 
 	printf("SOC: ");
 	if (bootverbose)
 		printf("(0x%4x:0x%02x) ", d, r);
 
 	rev = "";
 	switch (d) {
 	case MV_DEV_88F5181:
 		dev = "Marvell 88F5181";
 		if (r == 3)
 			rev = "B1";
 		break;
 	case MV_DEV_88F5182:
 		dev = "Marvell 88F5182";
 		if (r == 2)
 			rev = "A2";
 		break;
 	case MV_DEV_88F5281:
 		dev = "Marvell 88F5281";
 		if (r == 4)
 			rev = "D0";
 		else if (r == 5)
 			rev = "D1";
 		else if (r == 6)
 			rev = "D2";
 		break;
 	case MV_DEV_88F6281:
 		dev = "Marvell 88F6281";
 		if (r == 0)
 			rev = "Z0";
 		else if (r == 2)
 			rev = "A0";
 		else if (r == 3)
 			rev = "A1";
 		break;
 	case MV_DEV_88RC8180:
 		dev = "Marvell 88RC8180";
 		break;
 	case MV_DEV_88RC9480:
 		dev = "Marvell 88RC9480";
 		break;
 	case MV_DEV_88RC9580:
 		dev = "Marvell 88RC9580";
 		break;
 	case MV_DEV_88F6781:
 		dev = "Marvell 88F6781";
 		if (r == 2)
 			rev = "Y0";
 		break;
 	case MV_DEV_88F6282:
 		dev = "Marvell 88F6282";
 		if (r == 0)
 			rev = "A0";
 		else if (r == 1)
 			rev = "A1";
 		break;
 	case MV_DEV_88F6828:
 		dev = "Marvell 88F6828";
 		break;
 	case MV_DEV_88F6820:
 		dev = "Marvell 88F6820";
 		break;
 	case MV_DEV_88F6810:
 		dev = "Marvell 88F6810";
 		break;
 	case MV_DEV_MV78100_Z0:
 		dev = "Marvell MV78100 Z0";
 		break;
 	case MV_DEV_MV78100:
 		dev = "Marvell MV78100";
 		break;
 	case MV_DEV_MV78160:
 		dev = "Marvell MV78160";
 		break;
 	case MV_DEV_MV78260:
 		dev = "Marvell MV78260";
 		break;
 	case MV_DEV_MV78460:
 		dev = "Marvell MV78460";
 		break;
 	default:
 		dev = "UNKNOWN";
 		break;
 	}
 
 	printf("%s", dev);
 	if (*rev != '\0')
 		printf(" rev %s", rev);
 	printf(", TClock %dMHz", get_tclk() / 1000 / 1000);
 	freq = get_cpu_freq();
 	if (freq != 0)
 		printf(", Frequency %dMHz", freq / 1000 / 1000);
 	printf("\n");
 
 	mode = read_cpu_ctrl(CPU_CONFIG);
 	printf("  Instruction cache prefetch %s, data cache prefetch %s\n",
 	    (mode & CPU_CONFIG_IC_PREF) ? "enabled" : "disabled",
 	    (mode & CPU_CONFIG_DC_PREF) ? "enabled" : "disabled");
 
 	switch (d) {
 	case MV_DEV_88F6281:
 	case MV_DEV_88F6282:
 		mode = read_cpu_ctrl(CPU_L2_CONFIG) & CPU_L2_CONFIG_MODE;
 		printf("  256KB 4-way set-associative %s unified L2 cache\n",
 		    mode ? "write-through" : "write-back");
 		break;
 	case MV_DEV_MV78100:
 		mode = read_cpu_ctrl(CPU_CONTROL);
 		size = mode & CPU_CONTROL_L2_SIZE;
 		mode = mode & CPU_CONTROL_L2_MODE;
 		printf("  %s set-associative %s unified L2 cache\n",
 		    size ? "256KB 4-way" : "512KB 8-way",
 		    mode ? "write-through" : "write-back");
 		break;
 	default:
 		break;
 	}
 }
 
 static void
 platform_identify(void *dummy)
 {
 
 	soc_identify();
 
 	/*
 	 * XXX Board identification e.g. read out from FPGA or similar should
 	 * go here
 	 */
 }
 SYSINIT(platform_identify, SI_SUB_CPU, SI_ORDER_SECOND, platform_identify,
     NULL);
 
 #ifdef KDB
 static void
 mv_enter_debugger(void *dummy)
 {
 
 	if (boothowto & RB_KDB)
 		kdb_enter(KDB_WHY_BOOTFLAGS, "Boot flags requested debugger");
 }
 SYSINIT(mv_enter_debugger, SI_SUB_CPU, SI_ORDER_ANY, mv_enter_debugger, NULL);
 #endif
 
 int
 soc_decode_win(void)
 {
 	uint32_t dev, rev;
 	int mask, err;
 
 	mask = 0;
 	TUNABLE_INT_FETCH("hw.pm-disable-mask", &mask);
 
 	if (mask != 0)
 		pm_disable_device(mask);
 
 	/* Retrieve data about physical addresses from device tree. */
 	if ((err = win_cpu_from_dt()) != 0)
 		return (err);
 
 	/* Retrieve our ID: some windows facilities vary between SoC models */
 	soc_id(&dev, &rev);
 
 	if (soc_family == MV_SOC_ARMADA_XP)
 		if ((err = decode_win_sdram_fixup()) != 0)
 			return(err);
 
 
 	decode_win_cpu_setup();
 	if (MV_DUMP_WIN)
 		soc_dump_decode_win();
 
 	eth_port = 0;
 	usb_port = 0;
 	if ((err = fdt_win_setup()) != 0)
 		return (err);
 
 	return (0);
 }
 
 /**************************************************************************
  * Decode windows registers accessors
  **************************************************************************/
 WIN_REG_IDX_RD(win_cpu_armv5, cr, MV_WIN_CPU_CTRL_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_RD(win_cpu_armv5, br, MV_WIN_CPU_BASE_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_RD(win_cpu_armv5, remap_l, MV_WIN_CPU_REMAP_LO_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_RD(win_cpu_armv5, remap_h, MV_WIN_CPU_REMAP_HI_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv5, cr, MV_WIN_CPU_CTRL_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv5, br, MV_WIN_CPU_BASE_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv5, remap_l, MV_WIN_CPU_REMAP_LO_ARMV5, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv5, remap_h, MV_WIN_CPU_REMAP_HI_ARMV5, MV_MBUS_BRIDGE_BASE)
 
 WIN_REG_IDX_RD(win_cpu_armv7, cr, MV_WIN_CPU_CTRL_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_RD(win_cpu_armv7, br, MV_WIN_CPU_BASE_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_RD(win_cpu_armv7, remap_l, MV_WIN_CPU_REMAP_LO_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_RD(win_cpu_armv7, remap_h, MV_WIN_CPU_REMAP_HI_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv7, cr, MV_WIN_CPU_CTRL_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv7, br, MV_WIN_CPU_BASE_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv7, remap_l, MV_WIN_CPU_REMAP_LO_ARMV7, MV_MBUS_BRIDGE_BASE)
 WIN_REG_IDX_WR(win_cpu_armv7, remap_h, MV_WIN_CPU_REMAP_HI_ARMV7, MV_MBUS_BRIDGE_BASE)
 
 static uint32_t
 win_cpu_cr_read(int i)
 {
 
 	if (soc_decode_win_spec->cr_read != NULL)
 		return (soc_decode_win_spec->cr_read(i));
 	return (-1);
 }
 
 static uint32_t
 win_cpu_br_read(int i)
 {
 
 	if (soc_decode_win_spec->br_read != NULL)
 		return (soc_decode_win_spec->br_read(i));
 	return (-1);
 }
 
 static uint32_t
 win_cpu_remap_l_read(int i)
 {
 
 	if (soc_decode_win_spec->remap_l_read != NULL)
 		return (soc_decode_win_spec->remap_l_read(i));
 	return (-1);
 }
 
 static uint32_t
 win_cpu_remap_h_read(int i)
 {
 
 	if (soc_decode_win_spec->remap_h_read != NULL)
 		return soc_decode_win_spec->remap_h_read(i);
 	return (-1);
 }
 
 static void
 win_cpu_cr_write(int i, uint32_t val)
 {
 
 	if (soc_decode_win_spec->cr_write != NULL)
 		soc_decode_win_spec->cr_write(i, val);
 }
 
 static void
 win_cpu_br_write(int i, uint32_t val)
 {
 
 	if (soc_decode_win_spec->br_write != NULL)
 		soc_decode_win_spec->br_write(i, val);
 }
 
 static void
 win_cpu_remap_l_write(int i, uint32_t val)
 {
 
 	if (soc_decode_win_spec->remap_l_write != NULL)
 		soc_decode_win_spec->remap_l_write(i, val);
 }
 
 static void
 win_cpu_remap_h_write(int i, uint32_t val)
 {
 
 	if (soc_decode_win_spec->remap_h_write != NULL)
 		soc_decode_win_spec->remap_h_write(i, val);
 }
 
 WIN_REG_BASE_IDX_RD(win_cesa, cr, MV_WIN_CESA_CTRL)
 WIN_REG_BASE_IDX_RD(win_cesa, br, MV_WIN_CESA_BASE)
 WIN_REG_BASE_IDX_WR(win_cesa, cr, MV_WIN_CESA_CTRL)
 WIN_REG_BASE_IDX_WR(win_cesa, br, MV_WIN_CESA_BASE)
 
 WIN_REG_BASE_IDX_RD(win_usb, cr, MV_WIN_USB_CTRL)
 WIN_REG_BASE_IDX_RD(win_usb, br, MV_WIN_USB_BASE)
 WIN_REG_BASE_IDX_WR(win_usb, cr, MV_WIN_USB_CTRL)
 WIN_REG_BASE_IDX_WR(win_usb, br, MV_WIN_USB_BASE)
 
 WIN_REG_BASE_IDX_RD(win_usb3, cr, MV_WIN_USB3_CTRL)
 WIN_REG_BASE_IDX_RD(win_usb3, br, MV_WIN_USB3_BASE)
 WIN_REG_BASE_IDX_WR(win_usb3, cr, MV_WIN_USB3_CTRL)
 WIN_REG_BASE_IDX_WR(win_usb3, br, MV_WIN_USB3_BASE)
 
 WIN_REG_BASE_IDX_RD(win_eth, br, MV_WIN_ETH_BASE)
 WIN_REG_BASE_IDX_RD(win_eth, sz, MV_WIN_ETH_SIZE)
 WIN_REG_BASE_IDX_RD(win_eth, har, MV_WIN_ETH_REMAP)
 WIN_REG_BASE_IDX_WR(win_eth, br, MV_WIN_ETH_BASE)
 WIN_REG_BASE_IDX_WR(win_eth, sz, MV_WIN_ETH_SIZE)
 WIN_REG_BASE_IDX_WR(win_eth, har, MV_WIN_ETH_REMAP)
 
 WIN_REG_BASE_IDX_RD2(win_xor, br, MV_WIN_XOR_BASE)
 WIN_REG_BASE_IDX_RD2(win_xor, sz, MV_WIN_XOR_SIZE)
 WIN_REG_BASE_IDX_RD2(win_xor, har, MV_WIN_XOR_REMAP)
 WIN_REG_BASE_IDX_RD2(win_xor, ctrl, MV_WIN_XOR_CTRL)
 WIN_REG_BASE_IDX_WR2(win_xor, br, MV_WIN_XOR_BASE)
 WIN_REG_BASE_IDX_WR2(win_xor, sz, MV_WIN_XOR_SIZE)
 WIN_REG_BASE_IDX_WR2(win_xor, har, MV_WIN_XOR_REMAP)
 WIN_REG_BASE_IDX_WR2(win_xor, ctrl, MV_WIN_XOR_CTRL)
 
 WIN_REG_BASE_RD(win_eth, bare, 0x290)
 WIN_REG_BASE_RD(win_eth, epap, 0x294)
 WIN_REG_BASE_WR(win_eth, bare, 0x290)
 WIN_REG_BASE_WR(win_eth, epap, 0x294)
 
 WIN_REG_BASE_IDX_RD(win_pcie, cr, MV_WIN_PCIE_CTRL);
 WIN_REG_BASE_IDX_RD(win_pcie, br, MV_WIN_PCIE_BASE);
 WIN_REG_BASE_IDX_RD(win_pcie, remap, MV_WIN_PCIE_REMAP);
 WIN_REG_BASE_IDX_WR(win_pcie, cr, MV_WIN_PCIE_CTRL);
 WIN_REG_BASE_IDX_WR(win_pcie, br, MV_WIN_PCIE_BASE);
 WIN_REG_BASE_IDX_WR(win_pcie, remap, MV_WIN_PCIE_REMAP);
 WIN_REG_BASE_IDX_RD(pcie_bar, br, MV_PCIE_BAR_BASE);
 WIN_REG_BASE_IDX_RD(pcie_bar, brh, MV_PCIE_BAR_BASE_H);
 WIN_REG_BASE_IDX_RD(pcie_bar, cr, MV_PCIE_BAR_CTRL);
 WIN_REG_BASE_IDX_WR(pcie_bar, br, MV_PCIE_BAR_BASE);
 WIN_REG_BASE_IDX_WR(pcie_bar, brh, MV_PCIE_BAR_BASE_H);
 WIN_REG_BASE_IDX_WR(pcie_bar, cr, MV_PCIE_BAR_CTRL);
 
 WIN_REG_BASE_IDX_RD(win_idma, br, MV_WIN_IDMA_BASE)
 WIN_REG_BASE_IDX_RD(win_idma, sz, MV_WIN_IDMA_SIZE)
 WIN_REG_BASE_IDX_RD(win_idma, har, MV_WIN_IDMA_REMAP)
 WIN_REG_BASE_IDX_RD(win_idma, cap, MV_WIN_IDMA_CAP)
 WIN_REG_BASE_IDX_WR(win_idma, br, MV_WIN_IDMA_BASE)
 WIN_REG_BASE_IDX_WR(win_idma, sz, MV_WIN_IDMA_SIZE)
 WIN_REG_BASE_IDX_WR(win_idma, har, MV_WIN_IDMA_REMAP)
 WIN_REG_BASE_IDX_WR(win_idma, cap, MV_WIN_IDMA_CAP)
 WIN_REG_BASE_RD(win_idma, bare, 0xa80)
 WIN_REG_BASE_WR(win_idma, bare, 0xa80)
 
 WIN_REG_BASE_IDX_RD(win_sata, cr, MV_WIN_SATA_CTRL);
 WIN_REG_BASE_IDX_RD(win_sata, br, MV_WIN_SATA_BASE);
 WIN_REG_BASE_IDX_WR(win_sata, cr, MV_WIN_SATA_CTRL);
 WIN_REG_BASE_IDX_WR(win_sata, br, MV_WIN_SATA_BASE);
 
 WIN_REG_BASE_IDX_RD(win_sata_armada38x, sz, MV_WIN_SATA_SIZE_ARMADA38X);
 WIN_REG_BASE_IDX_WR(win_sata_armada38x, sz, MV_WIN_SATA_SIZE_ARMADA38X);
 WIN_REG_BASE_IDX_RD(win_sata_armada38x, cr, MV_WIN_SATA_CTRL_ARMADA38X);
 WIN_REG_BASE_IDX_RD(win_sata_armada38x, br, MV_WIN_SATA_BASE_ARMADA38X);
 WIN_REG_BASE_IDX_WR(win_sata_armada38x, cr, MV_WIN_SATA_CTRL_ARMADA38X);
 WIN_REG_BASE_IDX_WR(win_sata_armada38x, br, MV_WIN_SATA_BASE_ARMADA38X);
 
 WIN_REG_BASE_IDX_RD(win_sdhci, cr, MV_WIN_SDHCI_CTRL);
 WIN_REG_BASE_IDX_RD(win_sdhci, br, MV_WIN_SDHCI_BASE);
 WIN_REG_BASE_IDX_WR(win_sdhci, cr, MV_WIN_SDHCI_CTRL);
 WIN_REG_BASE_IDX_WR(win_sdhci, br, MV_WIN_SDHCI_BASE);
 
 #ifndef SOC_MV_DOVE
 WIN_REG_IDX_RD(ddr_armv5, br, MV_WIN_DDR_BASE, MV_DDR_CADR_BASE)
 WIN_REG_IDX_RD(ddr_armv5, sz, MV_WIN_DDR_SIZE, MV_DDR_CADR_BASE)
 WIN_REG_IDX_WR(ddr_armv5, br, MV_WIN_DDR_BASE, MV_DDR_CADR_BASE)
 WIN_REG_IDX_WR(ddr_armv5, sz, MV_WIN_DDR_SIZE, MV_DDR_CADR_BASE)
 
 WIN_REG_IDX_RD(ddr_armv7, br, MV_WIN_DDR_BASE, MV_DDR_CADR_BASE_ARMV7)
 WIN_REG_IDX_RD(ddr_armv7, sz, MV_WIN_DDR_SIZE, MV_DDR_CADR_BASE_ARMV7)
 WIN_REG_IDX_WR(ddr_armv7, br, MV_WIN_DDR_BASE, MV_DDR_CADR_BASE_ARMV7)
 WIN_REG_IDX_WR(ddr_armv7, sz, MV_WIN_DDR_SIZE, MV_DDR_CADR_BASE_ARMV7)
 
 static inline uint32_t
 ddr_br_read(int i)
 {
 
 	if (soc_decode_win_spec->ddr_br_read != NULL)
 		return (soc_decode_win_spec->ddr_br_read(i));
 	return (-1);
 }
 
 static inline uint32_t
 ddr_sz_read(int i)
 {
 
 	if (soc_decode_win_spec->ddr_sz_read != NULL)
 		return (soc_decode_win_spec->ddr_sz_read(i));
 	return (-1);
 }
 
 static inline void
 ddr_br_write(int i, uint32_t val)
 {
 
 	if (soc_decode_win_spec->ddr_br_write != NULL)
 		soc_decode_win_spec->ddr_br_write(i, val);
 }
 
 static inline void
 ddr_sz_write(int i, uint32_t val)
 {
 
 	if (soc_decode_win_spec->ddr_sz_write != NULL)
 		soc_decode_win_spec->ddr_sz_write(i, val);
 }
 #else
 /*
  * On 88F6781 (Dove) SoC DDR Controller is accessed through
  * single MBUS <-> AXI bridge. In this case we provide emulated
  * ddr_br_read() and ddr_sz_read() functions to keep compatibility
  * with common decoding windows setup code.
  */
 
 static inline uint32_t ddr_br_read(int i)
 {
 	uint32_t mmap;
 
 	/* Read Memory Address Map Register for CS i */
 	mmap = bus_space_read_4(fdtbus_bs_tag, MV_DDR_CADR_BASE + (i * 0x10), 0);
 
 	/* Return CS i base address */
 	return (mmap & 0xFF000000);
 }
 
 static inline uint32_t ddr_sz_read(int i)
 {
 	uint32_t mmap, size;
 
 	/* Read Memory Address Map Register for CS i */
 	mmap = bus_space_read_4(fdtbus_bs_tag, MV_DDR_CADR_BASE + (i * 0x10), 0);
 
 	/* Extract size of CS space in 64kB units */
 	size = (1 << ((mmap >> 16) & 0x0F));
 
 	/* Return CS size and enable/disable status */
 	return (((size - 1) << 16) | (mmap & 0x01));
 }
 #endif
 
 /**************************************************************************
  * Decode windows helper routines
  **************************************************************************/
 void
 soc_dump_decode_win(void)
 {
 	int i;
 
 	for (i = 0; i < soc_decode_win_spec->mv_win_cpu_max; i++) {
 		printf("CPU window#%d: c 0x%08x, b 0x%08x", i,
 		    win_cpu_cr_read(i),
 		    win_cpu_br_read(i));
 
 		if (win_cpu_can_remap(i))
 			printf(", rl 0x%08x, rh 0x%08x",
 			    win_cpu_remap_l_read(i),
 			    win_cpu_remap_h_read(i));
 
 		printf("\n");
 	}
 	printf("Internal regs base: 0x%08x\n",
 	    bus_space_read_4(fdtbus_bs_tag, MV_INTREGS_BASE, 0));
 
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		printf("DDR CS#%d: b 0x%08x, s 0x%08x\n", i,
 		    ddr_br_read(i), ddr_sz_read(i));
 }
 
 /**************************************************************************
  * CPU windows routines
  **************************************************************************/
 int
 win_cpu_can_remap(int i)
 {
 	uint32_t dev, rev;
 
 	soc_id(&dev, &rev);
 
 	/* Depending on the SoC certain windows have remap capability */
 	if ((dev == MV_DEV_88F5182 && i < 2) ||
 	    (dev == MV_DEV_88F5281 && i < 4) ||
 	    (dev == MV_DEV_88F6281 && i < 4) ||
 	    (dev == MV_DEV_88F6282 && i < 4) ||
 	    (dev == MV_DEV_88F6828 && i < 20) ||
 	    (dev == MV_DEV_88F6820 && i < 20) ||
 	    (dev == MV_DEV_88F6810 && i < 20) ||
 	    (dev == MV_DEV_88RC8180 && i < 2) ||
 	    (dev == MV_DEV_88F6781 && i < 4) ||
 	    (dev == MV_DEV_MV78100_Z0 && i < 8) ||
 	    ((dev & MV_DEV_FAMILY_MASK) == MV_DEV_DISCOVERY && i < 8))
 		return (1);
 
 	return (0);
 }
 
 /* XXX This should check for overlapping remap fields too.. */
 int
 decode_win_overlap(int win, int win_no, const struct decode_win *wintab)
 {
 	const struct decode_win *tab;
 	int i;
 
 	tab = wintab;
 
 	for (i = 0; i < win_no; i++, tab++) {
 		if (i == win)
 			/* Skip self */
 			continue;
 
 		if ((tab->base + tab->size - 1) < (wintab + win)->base)
 			continue;
 
 		else if (((wintab + win)->base + (wintab + win)->size - 1) <
 		    tab->base)
 			continue;
 		else
 			return (i);
 	}
 
 	return (-1);
 }
 
 static int
 decode_win_cpu_valid(void)
 {
 	int i, j, rv;
 	uint32_t b, e, s;
 
 	if (cpu_wins_no > soc_decode_win_spec->mv_win_cpu_max) {
 		printf("CPU windows: too many entries: %d\n", cpu_wins_no);
 		return (0);
 	}
 
 	rv = 1;
 	for (i = 0; i < cpu_wins_no; i++) {
 
 		if (cpu_wins[i].target == 0) {
 			printf("CPU window#%d: DDR target window is not "
 			    "supposed to be reprogrammed!\n", i);
 			rv = 0;
 		}
 
 		if (cpu_wins[i].remap != ~0 && win_cpu_can_remap(i) != 1) {
 			printf("CPU window#%d: not capable of remapping, but "
 			    "val 0x%08x defined\n", i, cpu_wins[i].remap);
 			rv = 0;
 		}
 
 		s = cpu_wins[i].size;
 		b = cpu_wins[i].base;
 		e = b + s - 1;
 		if (s > (0xFFFFFFFF - b + 1)) {
 			/*
 			 * XXX this boundary check should account for 64bit
 			 * and remapping..
 			 */
 			printf("CPU window#%d: no space for size 0x%08x at "
 			    "0x%08x\n", i, s, b);
 			rv = 0;
 			continue;
 		}
 
 		if (b != rounddown2(b, s)) {
 			printf("CPU window#%d: address 0x%08x is not aligned "
 			    "to 0x%08x\n", i, b, s);
 			rv = 0;
 			continue;
 		}
 
 		j = decode_win_overlap(i, cpu_wins_no, &cpu_wins[0]);
 		if (j >= 0) {
 			printf("CPU window#%d: (0x%08x - 0x%08x) overlaps "
 			    "with #%d (0x%08x - 0x%08x)\n", i, b, e, j,
 			    cpu_wins[j].base,
 			    cpu_wins[j].base + cpu_wins[j].size - 1);
 			rv = 0;
 		}
 	}
 
 	return (rv);
 }
 
 int
 decode_win_cpu_set(int target, int attr, vm_paddr_t base, uint32_t size,
     vm_paddr_t remap)
 {
 	uint32_t br, cr;
 	int win, i;
 
 	if (remap == ~0) {
 		win = soc_decode_win_spec->mv_win_cpu_max - 1;
 		i = -1;
 	} else {
 		win = 0;
 		i = 1;
 	}
 
 	while ((win >= 0) && (win < soc_decode_win_spec->mv_win_cpu_max)) {
 		cr = win_cpu_cr_read(win);
 		if ((cr & MV_WIN_CPU_ENABLE_BIT) == 0)
 			break;
 		if ((cr & ((0xff << MV_WIN_CPU_ATTR_SHIFT) |
 		    (0x1f << MV_WIN_CPU_TARGET_SHIFT))) ==
 		    ((attr << MV_WIN_CPU_ATTR_SHIFT) |
 		    (target << MV_WIN_CPU_TARGET_SHIFT)))
 			break;
 		win += i;
 	}
 	if ((win < 0) || (win >= soc_decode_win_spec->mv_win_cpu_max) ||
 	    ((remap != ~0) && (win_cpu_can_remap(win) == 0)))
 		return (-1);
 
 	br = base & 0xffff0000;
 	win_cpu_br_write(win, br);
 
 	if (win_cpu_can_remap(win)) {
 		if (remap != ~0) {
 			win_cpu_remap_l_write(win, remap & 0xffff0000);
 			win_cpu_remap_h_write(win, 0);
 		} else {
 			/*
 			 * Remap function is not used for a given window
 			 * (capable of remapping) - set remap field with the
 			 * same value as base.
 			 */
 			win_cpu_remap_l_write(win, base & 0xffff0000);
 			win_cpu_remap_h_write(win, 0);
 		}
 	}
 
 	cr = ((size - 1) & 0xffff0000) | (attr << MV_WIN_CPU_ATTR_SHIFT) |
 	    (target << MV_WIN_CPU_TARGET_SHIFT) | MV_WIN_CPU_ENABLE_BIT;
 	win_cpu_cr_write(win, cr);
 
 	return (0);
 }
 
 static void
 decode_win_cpu_setup(void)
 {
 	int i;
 
 	/* Disable all CPU windows */
 	for (i = 0; i < soc_decode_win_spec->mv_win_cpu_max; i++) {
 		win_cpu_cr_write(i, 0);
 		win_cpu_br_write(i, 0);
 		if (win_cpu_can_remap(i)) {
 			win_cpu_remap_l_write(i, 0);
 			win_cpu_remap_h_write(i, 0);
 		}
 	}
 
 	for (i = 0; i < cpu_wins_no; i++)
 		if (cpu_wins[i].target > 0)
 			decode_win_cpu_set(cpu_wins[i].target,
 			    cpu_wins[i].attr, cpu_wins[i].base,
 			    cpu_wins[i].size, cpu_wins[i].remap);
 
 }
 
 static int
 decode_win_sdram_fixup(void)
 {
 	struct mem_region mr[FDT_MEM_REGIONS];
 	uint8_t window_valid[MV_WIN_DDR_MAX];
 	int mr_cnt, err, i, j;
 	uint32_t valid_win_num = 0;
 
 	/* Grab physical memory regions information from device tree. */
 	err = fdt_get_mem_regions(mr, &mr_cnt, NULL);
 	if (err != 0)
 		return (err);
 
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		window_valid[i] = 0;
 
 	/* Try to match entries from device tree with settings from u-boot */
 	for (i = 0; i < mr_cnt; i++) {
 		for (j = 0; j < MV_WIN_DDR_MAX; j++) {
 			if (ddr_is_active(j) &&
 			    (ddr_base(j) == mr[i].mr_start) &&
 			    (ddr_size(j) == mr[i].mr_size)) {
 				window_valid[j] = 1;
 				valid_win_num++;
 			}
 		}
 	}
 
 	if (mr_cnt != valid_win_num)
 		return (EINVAL);
 
 	/* Destroy windows without corresponding device tree entry */
 	for (j = 0; j < MV_WIN_DDR_MAX; j++) {
 		if (ddr_is_active(j) && (window_valid[j] != 1)) {
 			printf("Disabling SDRAM decoding window: %d\n", j);
 			ddr_disable(j);
 		}
 	}
 
 	return (0);
 }
 /*
  * Check if we're able to cover all active DDR banks.
  */
 static int
 decode_win_can_cover_ddr(int max)
 {
 	int i, c;
 
 	c = 0;
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i))
 			c++;
 
 	if (c > max) {
 		printf("Unable to cover all active DDR banks: "
 		    "%d, available windows: %d\n", c, max);
 		return (0);
 	}
 
 	return (1);
 }
 
 /**************************************************************************
  * DDR windows routines
  **************************************************************************/
 int
 ddr_is_active(int i)
 {
 
 	if (ddr_sz_read(i) & 0x1)
 		return (1);
 
 	return (0);
 }
 
 void
 ddr_disable(int i)
 {
 
 	ddr_sz_write(i, 0);
 	ddr_br_write(i, 0);
 }
 
 uint32_t
 ddr_base(int i)
 {
 
 	return (ddr_br_read(i) & 0xff000000);
 }
 
 uint32_t
 ddr_size(int i)
 {
 
 	return ((ddr_sz_read(i) | 0x00ffffff) + 1);
 }
 
 uint32_t
 ddr_attr(int i)
 {
 	uint32_t dev, rev, attr;
 
 	soc_id(&dev, &rev);
 	if (dev == MV_DEV_88RC8180)
 		return ((ddr_sz_read(i) & 0xf0) >> 4);
 	if (dev == MV_DEV_88F6781)
 		return (0);
 
 	attr = (i == 0 ? 0xe :
 	    (i == 1 ? 0xd :
 	    (i == 2 ? 0xb :
 	    (i == 3 ? 0x7 : 0xff))));
 	if (platform_io_coherent)
 		attr |= 0x10;
 
 	return (attr);
 }
 
 uint32_t
 ddr_target(int i)
 {
 	uint32_t dev, rev;
 
 	soc_id(&dev, &rev);
 	if (dev == MV_DEV_88RC8180) {
 		i = (ddr_sz_read(i) & 0xf0) >> 4;
 		return (i == 0xe ? 0xc :
 		    (i == 0xd ? 0xd :
 		    (i == 0xb ? 0xe :
 		    (i == 0x7 ? 0xf : 0xc))));
 	}
 
 	/*
 	 * On SOCs other than 88RC8180 Mbus unit ID for
 	 * DDR SDRAM controller is always 0x0.
 	 */
 	return (0);
 }
 
 /**************************************************************************
  * CESA windows routines
  **************************************************************************/
 static int
 decode_win_cesa_valid(void)
 {
 
 	return (decode_win_can_cover_ddr(MV_WIN_CESA_MAX));
 }
 
 static void
 decode_win_cesa_dump(u_long base)
 {
 	int i;
 
 	for (i = 0; i < MV_WIN_CESA_MAX; i++)
 		printf("CESA window#%d: c 0x%08x, b 0x%08x\n", i,
 		    win_cesa_cr_read(base, i), win_cesa_br_read(base, i));
 }
 
 /*
  * Set CESA decode windows.
  */
 static void
 decode_win_cesa_setup(u_long base)
 {
 	uint32_t br, cr;
 	uint64_t size;
 	int i, j;
 
 	for (i = 0; i < MV_WIN_CESA_MAX; i++) {
 		win_cesa_cr_write(base, i, 0);
 		win_cesa_br_write(base, i, 0);
 	}
 
 	/* Only access to active DRAM banks is required */
 	for (i = 0; i < MV_WIN_DDR_MAX; i++) {
 		if (ddr_is_active(i)) {
 			br = ddr_base(i);
 
 			size = ddr_size(i);
 			/*
 			 * Armada 38x SoC's equipped with 4GB DRAM
 			 * suffer freeze during CESA operation, if
 			 * MBUS window opened at given DRAM CS reaches
 			 * end of the address space. Apply a workaround
 			 * by setting the window size to the closest possible
 			 * value, i.e. divide it by 2.
 			 */
 			if ((soc_family == MV_SOC_ARMADA_38X) &&
 			    (size + ddr_base(i) == 0x100000000ULL))
 				size /= 2;
 
 			cr = (((size - 1) & 0xffff0000) |
 			    (ddr_attr(i) << IO_WIN_ATTR_SHIFT) |
 			    (ddr_target(i) << IO_WIN_TGT_SHIFT) |
 			    IO_WIN_ENA_MASK);
 
 			/* Set the first free CESA window */
 			for (j = 0; j < MV_WIN_CESA_MAX; j++) {
 				if (win_cesa_cr_read(base, j) & 0x1)
 					continue;
 
 				win_cesa_br_write(base, j, br);
 				win_cesa_cr_write(base, j, cr);
 				break;
 			}
 		}
 	}
+}
+
+static void
+decode_win_a38x_cesa_setup(u_long base)
+{
+	decode_win_cesa_setup(base);
+	decode_win_cesa_setup(base + MV_WIN_CESA_OFFSET);
+}
+
+static void
+decode_win_a38x_cesa_dump(u_long base)
+{
+	decode_win_cesa_dump(base);
+	decode_win_cesa_dump(base + MV_WIN_CESA_OFFSET);
 }
 
 /**************************************************************************
  * USB windows routines
  **************************************************************************/
 static int
 decode_win_usb_valid(void)
 {
 
 	return (decode_win_can_cover_ddr(MV_WIN_USB_MAX));
 }
 
 static void
 decode_win_usb_dump(u_long base)
 {
 	int i;
 
 	if (pm_is_disabled(CPU_PM_CTRL_USB(usb_port - 1)))
 		return;
 
 	for (i = 0; i < MV_WIN_USB_MAX; i++)
 		printf("USB window#%d: c 0x%08x, b 0x%08x\n", i,
 		    win_usb_cr_read(base, i), win_usb_br_read(base, i));
 }
 
 /*
  * Set USB decode windows.
  */
 static void
 decode_win_usb_setup(u_long base)
 {
 	uint32_t br, cr;
 	int i, j;
 
 	if (pm_is_disabled(CPU_PM_CTRL_USB(usb_port)))
 		return;
 
 	usb_port++;
 
 	for (i = 0; i < MV_WIN_USB_MAX; i++) {
 		win_usb_cr_write(base, i, 0);
 		win_usb_br_write(base, i, 0);
 	}
 
 	/* Only access to active DRAM banks is required */
 	for (i = 0; i < MV_WIN_DDR_MAX; i++) {
 		if (ddr_is_active(i)) {
 			br = ddr_base(i);
 			/*
 			 * XXX for 6281 we should handle Mbus write
 			 * burst limit field in the ctrl reg
 			 */
 			cr = (((ddr_size(i) - 1) & 0xffff0000) |
 			    (ddr_attr(i) << 8) |
 			    (ddr_target(i) << 4) | 1);
 
 			/* Set the first free USB window */
 			for (j = 0; j < MV_WIN_USB_MAX; j++) {
 				if (win_usb_cr_read(base, j) & 0x1)
 					continue;
 
 				win_usb_br_write(base, j, br);
 				win_usb_cr_write(base, j, cr);
 				break;
 			}
 		}
 	}
 }
 
 /**************************************************************************
  * USB3 windows routines
  **************************************************************************/
 static int
 decode_win_usb3_valid(void)
 {
 
 	return (decode_win_can_cover_ddr(MV_WIN_USB3_MAX));
 }
 
 static void
 decode_win_usb3_dump(u_long base)
 {
 	int i;
 
 	for (i = 0; i < MV_WIN_USB3_MAX; i++)
 		printf("USB3.0 window#%d: c 0x%08x, b 0x%08x\n", i,
 		    win_usb3_cr_read(base, i), win_usb3_br_read(base, i));
 }
 
 /*
  * Set USB3 decode windows
  */
 static void
 decode_win_usb3_setup(u_long base)
 {
 	uint32_t br, cr;
 	int i, j;
 
 	for (i = 0; i < MV_WIN_USB3_MAX; i++) {
 		win_usb3_cr_write(base, i, 0);
 		win_usb3_br_write(base, i, 0);
 	}
 
 	/* Only access to active DRAM banks is required */
 	for (i = 0; i < MV_WIN_DDR_MAX; i++) {
 		if (ddr_is_active(i)) {
 			br = ddr_base(i);
 			cr = (((ddr_size(i) - 1) &
 			    (IO_WIN_SIZE_MASK << IO_WIN_SIZE_SHIFT)) |
 			    (ddr_attr(i) << IO_WIN_ATTR_SHIFT) |
 			    (ddr_target(i) << IO_WIN_TGT_SHIFT) |
 			    IO_WIN_ENA_MASK);
 
 			/* Set the first free USB3.0 window */
 			for (j = 0; j < MV_WIN_USB3_MAX; j++) {
 				if (win_usb3_cr_read(base, j) & IO_WIN_ENA_MASK)
 					continue;
 
 				win_usb3_br_write(base, j, br);
 				win_usb3_cr_write(base, j, cr);
 				break;
 			}
 		}
 	}
 }
 
 
 /**************************************************************************
  * ETH windows routines
  **************************************************************************/
 
 static int
 win_eth_can_remap(int i)
 {
 
 	/* ETH encode windows 0-3 have remap capability */
 	if (i < 4)
 		return (1);
 
 	return (0);
 }
 
 static int
 eth_bare_read(uint32_t base, int i)
 {
 	uint32_t v;
 
 	v = win_eth_bare_read(base);
 	v &= (1 << i);
 
 	return (v >> i);
 }
 
 static void
 eth_bare_write(uint32_t base, int i, int val)
 {
 	uint32_t v;
 
 	v = win_eth_bare_read(base);
 	v &= ~(1 << i);
 	v |= (val << i);
 	win_eth_bare_write(base, v);
 }
 
 static void
 eth_epap_write(uint32_t base, int i, int val)
 {
 	uint32_t v;
 
 	v = win_eth_epap_read(base);
 	v &= ~(0x3 << (i * 2));
 	v |= (val << (i * 2));
 	win_eth_epap_write(base, v);
 }
 
 static void
 decode_win_eth_dump(u_long base)
 {
 	int i;
 
 	if (pm_is_disabled(CPU_PM_CTRL_GE(eth_port - 1)))
 		return;
 
 	for (i = 0; i < MV_WIN_ETH_MAX; i++) {
 		printf("ETH window#%d: b 0x%08x, s 0x%08x", i,
 		    win_eth_br_read(base, i),
 		    win_eth_sz_read(base, i));
 
 		if (win_eth_can_remap(i))
 			printf(", ha 0x%08x",
 			    win_eth_har_read(base, i));
 
 		printf("\n");
 	}
 	printf("ETH windows: bare 0x%08x, epap 0x%08x\n",
 	    win_eth_bare_read(base),
 	    win_eth_epap_read(base));
 }
 
 #define MV_WIN_ETH_DDR_TRGT(n)	ddr_target(n)
 
 static void
 decode_win_eth_setup(u_long base)
 {
 	uint32_t br, sz;
 	int i, j;
 
 	if (pm_is_disabled(CPU_PM_CTRL_GE(eth_port)))
 		return;
 
 	eth_port++;
 
 	/* Disable, clear and revoke protection for all ETH windows */
 	for (i = 0; i < MV_WIN_ETH_MAX; i++) {
 
 		eth_bare_write(base, i, 1);
 		eth_epap_write(base, i, 0);
 		win_eth_br_write(base, i, 0);
 		win_eth_sz_write(base, i, 0);
 		if (win_eth_can_remap(i))
 			win_eth_har_write(base, i, 0);
 	}
 
 	/* Only access to active DRAM banks is required */
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i)) {
 
 			br = ddr_base(i) | (ddr_attr(i) << 8) | MV_WIN_ETH_DDR_TRGT(i);
 			sz = ((ddr_size(i) - 1) & 0xffff0000);
 
 			/* Set the first free ETH window */
 			for (j = 0; j < MV_WIN_ETH_MAX; j++) {
 				if (eth_bare_read(base, j) == 0)
 					continue;
 
 				win_eth_br_write(base, j, br);
 				win_eth_sz_write(base, j, sz);
 
 				/* XXX remapping ETH windows not supported */
 
 				/* Set protection RW */
 				eth_epap_write(base, j, 0x3);
 
 				/* Enable window */
 				eth_bare_write(base, j, 0);
 				break;
 			}
 		}
 }
 
 static void
 decode_win_neta_dump(u_long base)
 {
 
 	decode_win_eth_dump(base + MV_WIN_NETA_OFFSET);
 }
 
 static void
 decode_win_neta_setup(u_long base)
 {
 
 	decode_win_eth_setup(base + MV_WIN_NETA_OFFSET);
 }
 
 static int
 decode_win_eth_valid(void)
 {
 
 	return (decode_win_can_cover_ddr(MV_WIN_ETH_MAX));
 }
 
 /**************************************************************************
  * PCIE windows routines
  **************************************************************************/
 static void
 decode_win_pcie_dump(u_long base)
 {
 	int i;
 
 	printf("PCIE windows base 0x%08lx\n", base);
 	for (i = 0; i < MV_WIN_PCIE_MAX; i++)
 		printf("PCIE window#%d: cr 0x%08x br 0x%08x remap 0x%08x\n",
 		    i, win_pcie_cr_read(base, i),
 		    win_pcie_br_read(base, i), win_pcie_remap_read(base, i));
 
 	for (i = 0; i < MV_PCIE_BAR_MAX; i++)
 		printf("PCIE bar#%d: cr 0x%08x br 0x%08x brh 0x%08x\n",
 		    i, pcie_bar_cr_read(base, i),
 		    pcie_bar_br_read(base, i), pcie_bar_brh_read(base, i));
 }
 
 void
 decode_win_pcie_setup(u_long base)
 {
 	uint32_t size = 0, ddrbase = ~0;
 	uint32_t cr, br;
 	int i, j;
 
 	for (i = 0; i < MV_PCIE_BAR_MAX; i++) {
 		pcie_bar_br_write(base, i,
 		    MV_PCIE_BAR_64BIT | MV_PCIE_BAR_PREFETCH_EN);
 		if (i < 3)
 			pcie_bar_brh_write(base, i, 0);
 		if (i > 0)
 			pcie_bar_cr_write(base, i, 0);
 	}
 
 	for (i = 0; i < MV_WIN_PCIE_MAX; i++) {
 		win_pcie_cr_write(base, i, 0);
 		win_pcie_br_write(base, i, 0);
 		win_pcie_remap_write(base, i, 0);
 	}
 
 	/* On End-Point only set BAR size to 1MB regardless of DDR size */
 	if ((bus_space_read_4(fdtbus_bs_tag, base, MV_PCIE_CONTROL)
 	    & MV_PCIE_ROOT_CMPLX) == 0) {
 		pcie_bar_cr_write(base, 1, 0xf0000 | 1);
 		return;
 	}
 
 	for (i = 0; i < MV_WIN_DDR_MAX; i++) {
 		if (ddr_is_active(i)) {
 			/* Map DDR to BAR 1 */
 			cr = (ddr_size(i) - 1) & 0xffff0000;
 			size += ddr_size(i) & 0xffff0000;
 			cr |= (ddr_attr(i) << 8) | (ddr_target(i) << 4) | 1;
 			br = ddr_base(i);
 			if (br < ddrbase)
 				ddrbase = br;
 
 			/* Use the first available PCIE window */
 			for (j = 0; j < MV_WIN_PCIE_MAX; j++) {
 				if (win_pcie_cr_read(base, j) != 0)
 					continue;
 
 				win_pcie_br_write(base, j, br);
 				win_pcie_cr_write(base, j, cr);
 				break;
 			}
 		}
 	}
 
 	/*
 	 * Upper 16 bits in BAR register is interpreted as BAR size
 	 * (in 64 kB units) plus 64kB, so subtract 0x10000
 	 * form value passed to register to get correct value.
 	 */
 	size -= 0x10000;
 	pcie_bar_cr_write(base, 1, size | 1);
 	pcie_bar_br_write(base, 1, ddrbase |
 	    MV_PCIE_BAR_64BIT | MV_PCIE_BAR_PREFETCH_EN);
 	pcie_bar_br_write(base, 0, fdt_immr_pa |
 	    MV_PCIE_BAR_64BIT | MV_PCIE_BAR_PREFETCH_EN);
 }
 
 static int
 decode_win_pcie_valid(void)
 {
 
 	return (decode_win_can_cover_ddr(MV_WIN_PCIE_MAX));
 }
 
 /**************************************************************************
  * IDMA windows routines
  **************************************************************************/
 #if defined(SOC_MV_ORION) || defined(SOC_MV_DISCOVERY)
 static int
 idma_bare_read(u_long base, int i)
 {
 	uint32_t v;
 
 	v = win_idma_bare_read(base);
 	v &= (1 << i);
 
 	return (v >> i);
 }
 
 static void
 idma_bare_write(u_long base, int i, int val)
 {
 	uint32_t v;
 
 	v = win_idma_bare_read(base);
 	v &= ~(1 << i);
 	v |= (val << i);
 	win_idma_bare_write(base, v);
 }
 
 /*
  * Sets channel protection 'val' for window 'w' on channel 'c'
  */
 static void
 idma_cap_write(u_long base, int c, int w, int val)
 {
 	uint32_t v;
 
 	v = win_idma_cap_read(base, c);
 	v &= ~(0x3 << (w * 2));
 	v |= (val << (w * 2));
 	win_idma_cap_write(base, c, v);
 }
 
 /*
  * Set protection 'val' on all channels for window 'w'
  */
 static void
 idma_set_prot(u_long base, int w, int val)
 {
 	int c;
 
 	for (c = 0; c < MV_IDMA_CHAN_MAX; c++)
 		idma_cap_write(base, c, w, val);
 }
 
 static int
 win_idma_can_remap(int i)
 {
 
 	/* IDMA decode windows 0-3 have remap capability */
 	if (i < 4)
 		return (1);
 
 	return (0);
 }
 
 void
 decode_win_idma_setup(u_long base)
 {
 	uint32_t br, sz;
 	int i, j;
 
 	if (pm_is_disabled(CPU_PM_CTRL_IDMA))
 		return;
 	/*
 	 * Disable and clear all IDMA windows, revoke protection for all channels
 	 */
 	for (i = 0; i < MV_WIN_IDMA_MAX; i++) {
 
 		idma_bare_write(base, i, 1);
 		win_idma_br_write(base, i, 0);
 		win_idma_sz_write(base, i, 0);
 		if (win_idma_can_remap(i) == 1)
 			win_idma_har_write(base, i, 0);
 	}
 	for (i = 0; i < MV_IDMA_CHAN_MAX; i++)
 		win_idma_cap_write(base, i, 0);
 
 	/*
 	 * Set up access to all active DRAM banks
 	 */
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i)) {
 			br = ddr_base(i) | (ddr_attr(i) << 8) | ddr_target(i);
 			sz = ((ddr_size(i) - 1) & 0xffff0000);
 
 			/* Place DDR entries in non-remapped windows */
 			for (j = 0; j < MV_WIN_IDMA_MAX; j++)
 				if (win_idma_can_remap(j) != 1 &&
 				    idma_bare_read(base, j) == 1) {
 
 					/* Configure window */
 					win_idma_br_write(base, j, br);
 					win_idma_sz_write(base, j, sz);
 
 					/* Set protection RW on all channels */
 					idma_set_prot(base, j, 0x3);
 
 					/* Enable window */
 					idma_bare_write(base, j, 0);
 					break;
 				}
 		}
 
 	/*
 	 * Remaining targets -- from statically defined table
 	 */
 	for (i = 0; i < idma_wins_no; i++)
 		if (idma_wins[i].target > 0) {
 			br = (idma_wins[i].base & 0xffff0000) |
 			    (idma_wins[i].attr << 8) | idma_wins[i].target;
 			sz = ((idma_wins[i].size - 1) & 0xffff0000);
 
 			/* Set the first free IDMA window */
 			for (j = 0; j < MV_WIN_IDMA_MAX; j++) {
 				if (idma_bare_read(base, j) == 0)
 					continue;
 
 				/* Configure window */
 				win_idma_br_write(base, j, br);
 				win_idma_sz_write(base, j, sz);
 				if (win_idma_can_remap(j) &&
 				    idma_wins[j].remap >= 0)
 					win_idma_har_write(base, j,
 					    idma_wins[j].remap);
 
 				/* Set protection RW on all channels */
 				idma_set_prot(base, j, 0x3);
 
 				/* Enable window */
 				idma_bare_write(base, j, 0);
 				break;
 			}
 		}
 }
 
 int
 decode_win_idma_valid(void)
 {
 	const struct decode_win *wintab;
 	int c, i, j, rv;
 	uint32_t b, e, s;
 
 	if (idma_wins_no > MV_WIN_IDMA_MAX) {
 		printf("IDMA windows: too many entries: %d\n", idma_wins_no);
 		return (0);
 	}
 	for (i = 0, c = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i))
 			c++;
 
 	if (idma_wins_no > (MV_WIN_IDMA_MAX - c)) {
 		printf("IDMA windows: too many entries: %d, available: %d\n",
 		    idma_wins_no, MV_WIN_IDMA_MAX - c);
 		return (0);
 	}
 
 	wintab = idma_wins;
 	rv = 1;
 	for (i = 0; i < idma_wins_no; i++, wintab++) {
 
 		if (wintab->target == 0) {
 			printf("IDMA window#%d: DDR target window is not "
 			    "supposed to be reprogrammed!\n", i);
 			rv = 0;
 		}
 
 		if (wintab->remap >= 0 && win_cpu_can_remap(i) != 1) {
 			printf("IDMA window#%d: not capable of remapping, but "
 			    "val 0x%08x defined\n", i, wintab->remap);
 			rv = 0;
 		}
 
 		s = wintab->size;
 		b = wintab->base;
 		e = b + s - 1;
 		if (s > (0xFFFFFFFF - b + 1)) {
 			/* XXX this boundary check should account for 64bit and
 			 * remapping.. */
 			printf("IDMA window#%d: no space for size 0x%08x at "
 			    "0x%08x\n", i, s, b);
 			rv = 0;
 			continue;
 		}
 
 		j = decode_win_overlap(i, idma_wins_no, &idma_wins[0]);
 		if (j >= 0) {
 			printf("IDMA window#%d: (0x%08x - 0x%08x) overlaps "
 			    "with #%d (0x%08x - 0x%08x)\n", i, b, e, j,
 			    idma_wins[j].base,
 			    idma_wins[j].base + idma_wins[j].size - 1);
 			rv = 0;
 		}
 	}
 
 	return (rv);
 }
 
 void
 decode_win_idma_dump(u_long base)
 {
 	int i;
 
 	if (pm_is_disabled(CPU_PM_CTRL_IDMA))
 		return;
 
 	for (i = 0; i < MV_WIN_IDMA_MAX; i++) {
 		printf("IDMA window#%d: b 0x%08x, s 0x%08x", i,
 		    win_idma_br_read(base, i), win_idma_sz_read(base, i));
 		
 		if (win_idma_can_remap(i))
 			printf(", ha 0x%08x", win_idma_har_read(base, i));
 
 		printf("\n");
 	}
 	for (i = 0; i < MV_IDMA_CHAN_MAX; i++)
 		printf("IDMA channel#%d: ap 0x%08x\n", i,
 		    win_idma_cap_read(base, i));
 	printf("IDMA windows: bare 0x%08x\n", win_idma_bare_read(base));
 }
 #else
 
 /* Provide dummy functions to satisfy the build for SoCs not equipped with IDMA */
 int
 decode_win_idma_valid(void)
 {
 
 	return (1);
 }
 
 void
 decode_win_idma_setup(u_long base)
 {
 }
 
 void
 decode_win_idma_dump(u_long base)
 {
 }
 #endif
 
 /**************************************************************************
  * XOR windows routines
  **************************************************************************/
 #if defined(SOC_MV_KIRKWOOD) || defined(SOC_MV_DISCOVERY)
 static int
 xor_ctrl_read(u_long base, int i, int c, int e)
 {
 	uint32_t v;
 	v = win_xor_ctrl_read(base, c, e);
 	v &= (1 << i);
 
 	return (v >> i);
 }
 
 static void
 xor_ctrl_write(u_long base, int i, int c, int e, int val)
 {
 	uint32_t v;
 
 	v = win_xor_ctrl_read(base, c, e);
 	v &= ~(1 << i);
 	v |= (val << i);
 	win_xor_ctrl_write(base, c, e, v);
 }
 
 /*
  * Set channel protection 'val' for window 'w' on channel 'c'
  */
 static void
 xor_chan_write(u_long base, int c, int e, int w, int val)
 {
 	uint32_t v;
 
 	v = win_xor_ctrl_read(base, c, e);
 	v &= ~(0x3 << (w * 2 + 16));
 	v |= (val << (w * 2 + 16));
 	win_xor_ctrl_write(base, c, e, v);
 }
 
 /*
  * Set protection 'val' on all channels for window 'w' on engine 'e'
  */
 static void
 xor_set_prot(u_long base, int w, int e, int val)
 {
 	int c;
 
 	for (c = 0; c < MV_XOR_CHAN_MAX; c++)
 		xor_chan_write(base, c, e, w, val);
 }
 
 static int
 win_xor_can_remap(int i)
 {
 
 	/* XOR decode windows 0-3 have remap capability */
 	if (i < 4)
 		return (1);
 
 	return (0);
 }
 
 static int
 xor_max_eng(void)
 {
 	uint32_t dev, rev;
 
 	soc_id(&dev, &rev);
 	switch (dev) {
 	case MV_DEV_88F6281:
 	case MV_DEV_88F6282:
 	case MV_DEV_MV78130:
 	case MV_DEV_MV78160:
 	case MV_DEV_MV78230:
 	case MV_DEV_MV78260:
 	case MV_DEV_MV78460:
 		return (2);
 	case MV_DEV_MV78100:
 	case MV_DEV_MV78100_Z0:
 		return (1);
 	default:
 		return (0);
 	}
 }
 
 static void
 xor_active_dram(u_long base, int c, int e, int *window)
 {
 	uint32_t br, sz;
 	int i, m, w;
 
 	/*
 	 * Set up access to all active DRAM banks
 	 */
 	m = xor_max_eng();
 	for (i = 0; i < m; i++)
 		if (ddr_is_active(i)) {
 			br = ddr_base(i) | (ddr_attr(i) << 8) |
 			    ddr_target(i);
 			sz = ((ddr_size(i) - 1) & 0xffff0000);
 
 			/* Place DDR entries in non-remapped windows */
 			for (w = 0; w < MV_WIN_XOR_MAX; w++)
 				if (win_xor_can_remap(w) != 1 &&
 				    (xor_ctrl_read(base, w, c, e) == 0) &&
 				    w > *window) {
 					/* Configure window */
 					win_xor_br_write(base, w, e, br);
 					win_xor_sz_write(base, w, e, sz);
 
 					/* Set protection RW on all channels */
 					xor_set_prot(base, w, e, 0x3);
 
 					/* Enable window */
 					xor_ctrl_write(base, w, c, e, 1);
 					(*window)++;
 					break;
 				}
 		}
 }
 
 void
 decode_win_xor_setup(u_long base)
 {
 	uint32_t br, sz;
 	int i, j, z, e = 1, m, window;
 
 	if (pm_is_disabled(CPU_PM_CTRL_XOR))
 		return;
 
 	/*
 	 * Disable and clear all XOR windows, revoke protection for all
 	 * channels
 	 */
 	m = xor_max_eng();
 	for (j = 0; j < m; j++, e--) {
 
 		/* Number of non-remaped windows */
 		window = MV_XOR_NON_REMAP - 1;
 
 		for (i = 0; i < MV_WIN_XOR_MAX; i++) {
 			win_xor_br_write(base, i, e, 0);
 			win_xor_sz_write(base, i, e, 0);
 		}
 
 		if (win_xor_can_remap(i) == 1)
 			win_xor_har_write(base, i, e, 0);
 
 		for (i = 0; i < MV_XOR_CHAN_MAX; i++) {
 			win_xor_ctrl_write(base, i, e, 0);
 			xor_active_dram(base, i, e, &window);
 		}
 
 		/*
 		 * Remaining targets -- from a statically defined table
 		 */
 		for (i = 0; i < xor_wins_no; i++)
 			if (xor_wins[i].target > 0) {
 				br = (xor_wins[i].base & 0xffff0000) |
 				    (xor_wins[i].attr << 8) |
 				    xor_wins[i].target;
 				sz = ((xor_wins[i].size - 1) & 0xffff0000);
 
 				/* Set the first free XOR window */
 				for (z = 0; z < MV_WIN_XOR_MAX; z++) {
 					if (xor_ctrl_read(base, z, 0, e) &&
 					    xor_ctrl_read(base, z, 1, e))
 						continue;
 
 					/* Configure window */
 					win_xor_br_write(base, z, e, br);
 					win_xor_sz_write(base, z, e, sz);
 					if (win_xor_can_remap(z) &&
 					    xor_wins[z].remap >= 0)
 						win_xor_har_write(base, z, e,
 						    xor_wins[z].remap);
 
 					/* Set protection RW on all channels */
 					xor_set_prot(base, z, e, 0x3);
 
 					/* Enable window */
 					xor_ctrl_write(base, z, 0, e, 1);
 					xor_ctrl_write(base, z, 1, e, 1);
 					break;
 				}
 			}
 	}
 }
 
 int
 decode_win_xor_valid(void)
 {
 	const struct decode_win *wintab;
 	int c, i, j, rv;
 	uint32_t b, e, s;
 
 	if (xor_wins_no > MV_WIN_XOR_MAX) {
 		printf("XOR windows: too many entries: %d\n", xor_wins_no);
 		return (0);
 	}
 	for (i = 0, c = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i))
 			c++;
 
 	if (xor_wins_no > (MV_WIN_XOR_MAX - c)) {
 		printf("XOR windows: too many entries: %d, available: %d\n",
 		    xor_wins_no, MV_WIN_IDMA_MAX - c);
 		return (0);
 	}
 
 	wintab = xor_wins;
 	rv = 1;
 	for (i = 0; i < xor_wins_no; i++, wintab++) {
 
 		if (wintab->target == 0) {
 			printf("XOR window#%d: DDR target window is not "
 			    "supposed to be reprogrammed!\n", i);
 			rv = 0;
 		}
 
 		if (wintab->remap >= 0 && win_cpu_can_remap(i) != 1) {
 			printf("XOR window#%d: not capable of remapping, but "
 			    "val 0x%08x defined\n", i, wintab->remap);
 			rv = 0;
 		}
 
 		s = wintab->size;
 		b = wintab->base;
 		e = b + s - 1;
 		if (s > (0xFFFFFFFF - b + 1)) {
 			/*
 			 * XXX this boundary check should account for 64bit
 			 * and remapping..
 			 */
 			printf("XOR window#%d: no space for size 0x%08x at "
 			    "0x%08x\n", i, s, b);
 			rv = 0;
 			continue;
 		}
 
 		j = decode_win_overlap(i, xor_wins_no, &xor_wins[0]);
 		if (j >= 0) {
 			printf("XOR window#%d: (0x%08x - 0x%08x) overlaps "
 			    "with #%d (0x%08x - 0x%08x)\n", i, b, e, j,
 			    xor_wins[j].base,
 			    xor_wins[j].base + xor_wins[j].size - 1);
 			rv = 0;
 		}
 	}
 
 	return (rv);
 }
 
 void
 decode_win_xor_dump(u_long base)
 {
 	int i, j;
 	int e = 1;
 
 	if (pm_is_disabled(CPU_PM_CTRL_XOR))
 		return;
 
 	for (j = 0; j < xor_max_eng(); j++, e--) {
 		for (i = 0; i < MV_WIN_XOR_MAX; i++) {
 			printf("XOR window#%d: b 0x%08x, s 0x%08x", i,
 			    win_xor_br_read(base, i, e), win_xor_sz_read(base, i, e));
 
 			if (win_xor_can_remap(i))
 				printf(", ha 0x%08x", win_xor_har_read(base, i, e));
 
 			printf("\n");
 		}
 		for (i = 0; i < MV_XOR_CHAN_MAX; i++)
 			printf("XOR control#%d: 0x%08x\n", i,
 			    win_xor_ctrl_read(base, i, e));
 	}
 }
 
 #else
 /* Provide dummy functions to satisfy the build for SoCs not equipped with XOR */
 static int
 decode_win_xor_valid(void)
 {
 
 	return (1);
 }
 
 static void
 decode_win_xor_setup(u_long base)
 {
 }
 
 static void
 decode_win_xor_dump(u_long base)
 {
 }
 #endif
 
 /**************************************************************************
  * SATA windows routines
  **************************************************************************/
 static void
 decode_win_sata_setup(u_long base)
 {
 	uint32_t cr, br;
 	int i, j;
 
 	if (pm_is_disabled(CPU_PM_CTRL_SATA))
 		return;
 
 	for (i = 0; i < MV_WIN_SATA_MAX; i++) {
 		win_sata_cr_write(base, i, 0);
 		win_sata_br_write(base, i, 0);
 	}
 
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i)) {
 			cr = ((ddr_size(i) - 1) & 0xffff0000) |
 			    (ddr_attr(i) << 8) | (ddr_target(i) << 4) | 1;
 			br = ddr_base(i);
 
 			/* Use the first available SATA window */
 			for (j = 0; j < MV_WIN_SATA_MAX; j++) {
 				if ((win_sata_cr_read(base, j) & 1) != 0)
 					continue;
 
 				win_sata_br_write(base, j, br);
 				win_sata_cr_write(base, j, cr);
 				break;
 			}
 		}
 }
 
 /*
  * Configure AHCI decoding windows
  */
 static void
 decode_win_ahci_setup(u_long base)
 {
 	uint32_t br, cr, sz;
 	int i, j;
 
 	for (i = 0; i < MV_WIN_SATA_MAX_ARMADA38X; i++) {
 		win_sata_armada38x_cr_write(base, i, 0);
 		win_sata_armada38x_br_write(base, i, 0);
 		win_sata_armada38x_sz_write(base, i, 0);
 	}
 
 	for (i = 0; i < MV_WIN_DDR_MAX; i++) {
 		if (ddr_is_active(i)) {
 			cr = (ddr_attr(i) << IO_WIN_ATTR_SHIFT) |
 			    (ddr_target(i) << IO_WIN_TGT_SHIFT) |
 			    IO_WIN_ENA_MASK;
 			br = ddr_base(i);
 			sz = (ddr_size(i) - 1) &
 			    (IO_WIN_SIZE_MASK << IO_WIN_SIZE_SHIFT);
 
 			/* Use first available SATA window */
 			for (j = 0; j < MV_WIN_SATA_MAX_ARMADA38X; j++) {
 				if (win_sata_armada38x_cr_read(base, j) & IO_WIN_ENA_MASK)
 					continue;
 
 				/* BASE is set to DRAM base (0x00000000) */
 				win_sata_armada38x_br_write(base, j, br);
 				/* CTRL targets DRAM ctrl with 0x0E or 0x0D */
 				win_sata_armada38x_cr_write(base, j, cr);
 				/* SIZE is set to 16MB - max value */
 				win_sata_armada38x_sz_write(base, j, sz);
 				break;
 			}
 		}
 	}
 }
 
 static void
 decode_win_ahci_dump(u_long base)
 {
 	int i;
 
 	for (i = 0; i < MV_WIN_SATA_MAX_ARMADA38X; i++)
 		printf("SATA window#%d: cr 0x%08x, br 0x%08x, sz 0x%08x\n", i,
 		    win_sata_armada38x_cr_read(base, i), win_sata_br_read(base, i),
 		    win_sata_armada38x_sz_read(base,i));
 }
 
 static int
 decode_win_sata_valid(void)
 {
 	uint32_t dev, rev;
 
 	soc_id(&dev, &rev);
 	if (dev == MV_DEV_88F5281)
 		return (1);
 
 	return (decode_win_can_cover_ddr(MV_WIN_SATA_MAX));
 }
 
 static void
 decode_win_sdhci_setup(u_long base)
 {
 	uint32_t cr, br;
 	int i, j;
 
 	for (i = 0; i < MV_WIN_SDHCI_MAX; i++) {
 		win_sdhci_cr_write(base, i, 0);
 		win_sdhci_br_write(base, i, 0);
 	}
 
 	for (i = 0; i < MV_WIN_DDR_MAX; i++)
 		if (ddr_is_active(i)) {
 			br = ddr_base(i);
 			cr = (((ddr_size(i) - 1) &
 			    (IO_WIN_SIZE_MASK << IO_WIN_SIZE_SHIFT)) |
 			    (ddr_attr(i) << IO_WIN_ATTR_SHIFT) |
 			    (ddr_target(i) << IO_WIN_TGT_SHIFT) |
 			    IO_WIN_ENA_MASK);
 
 			/* Use the first available SDHCI window */
 			for (j = 0; j < MV_WIN_SDHCI_MAX; j++) {
 				if (win_sdhci_cr_read(base, j) & IO_WIN_ENA_MASK)
 					continue;
 
 				win_sdhci_cr_write(base, j, cr);
 				win_sdhci_br_write(base, j, br);
 				break;
 			}
 		}
 }
 
 static void
 decode_win_sdhci_dump(u_long base)
 {
 	int i;
 
 	for (i = 0; i < MV_WIN_SDHCI_MAX; i++)
 		printf("SDHCI window#%d: c 0x%08x, b 0x%08x\n", i,
 		    win_sdhci_cr_read(base, i), win_sdhci_br_read(base, i));
 }
 
 static int
 decode_win_sdhci_valid(void)
 {
 
 	return (decode_win_can_cover_ddr(MV_WIN_SDHCI_MAX));
 }
 
 /**************************************************************************
  * FDT parsing routines.
  **************************************************************************/
 
 static int
 fdt_get_ranges(const char *nodename, void *buf, int size, int *tuples,
     int *tuplesize)
 {
 	phandle_t node;
 	pcell_t addr_cells, par_addr_cells, size_cells;
 	int len, tuple_size, tuples_count;
 
 	node = OF_finddevice(nodename);
 	if (node == -1)
 		return (EINVAL);
 
 	if ((fdt_addrsize_cells(node, &addr_cells, &size_cells)) != 0)
 		return (ENXIO);
 
 	par_addr_cells = fdt_parent_addr_cells(node);
 	if (par_addr_cells > 2)
 		return (ERANGE);
 
 	tuple_size = sizeof(pcell_t) * (addr_cells + par_addr_cells +
 	    size_cells);
 
 	/* Note the OF_getprop_alloc() cannot be used at this early stage. */
 	len = OF_getprop(node, "ranges", buf, size);
 
 	/*
 	 * XXX this does not handle the empty 'ranges;' case, which is
 	 * legitimate and should be allowed.
 	 */
 	tuples_count = len / tuple_size;
 	if (tuples_count <= 0)
 		return (ERANGE);
 
 	if (par_addr_cells > 2 || addr_cells > 2 || size_cells > 2)
 		return (ERANGE);
 
 	*tuples = tuples_count;
 	*tuplesize = tuple_size;
 	return (0);
 }
 
 static int
 win_cpu_from_dt(void)
 {
 	pcell_t ranges[48];
 	phandle_t node;
 	int i, entry_size, err, t, tuple_size, tuples;
 	u_long sram_base, sram_size;
 
 	t = 0;
 	/* Retrieve 'ranges' property of '/localbus' node. */
 	if ((err = fdt_get_ranges("/localbus", ranges, sizeof(ranges),
 	    &tuples, &tuple_size)) == 0) {
 		/*
 		 * Fill CPU decode windows table.
 		 */
 		bzero((void *)&cpu_win_tbl, sizeof(cpu_win_tbl));
 
 		entry_size = tuple_size / sizeof(pcell_t);
 		cpu_wins_no = tuples;
 
 		/* Check range */
 		if (tuples > nitems(cpu_win_tbl)) {
 			debugf("too many tuples to fit into cpu_win_tbl\n");
 			return (ENOMEM);
 		}
 
 		for (i = 0, t = 0; t < tuples; i += entry_size, t++) {
 			cpu_win_tbl[t].target = 1;
 			cpu_win_tbl[t].attr = fdt32_to_cpu(ranges[i + 1]);
 			cpu_win_tbl[t].base = fdt32_to_cpu(ranges[i + 2]);
 			cpu_win_tbl[t].size = fdt32_to_cpu(ranges[i + 3]);
 			cpu_win_tbl[t].remap = ~0;
 			debugf("target = 0x%0x attr = 0x%0x base = 0x%0x "
 			    "size = 0x%0x remap = 0x%0x\n",
 			    cpu_win_tbl[t].target,
 			    cpu_win_tbl[t].attr, cpu_win_tbl[t].base,
 			    cpu_win_tbl[t].size, cpu_win_tbl[t].remap);
 		}
 	}
 
 	/*
 	 * Retrieve CESA SRAM data.
 	 */
 	if ((node = OF_finddevice("sram")) != -1)
 		if (ofw_bus_node_is_compatible(node, "mrvl,cesa-sram"))
 			goto moveon;
 
 	if ((node = OF_finddevice("/")) == -1)
 		return (ENXIO);
 
 	if ((node = fdt_find_compatible(node, "mrvl,cesa-sram", 0)) == 0)
 		/* SRAM block is not always present. */
 		return (0);
 moveon:
 	sram_base = sram_size = 0;
 	if (fdt_regsize(node, &sram_base, &sram_size) != 0)
 		return (EINVAL);
 
 	/* Check range */
 	if (t >= nitems(cpu_win_tbl)) {
 		debugf("cannot fit CESA tuple into cpu_win_tbl\n");
 		return (ENOMEM);
 	}
 
 	cpu_win_tbl[t].target = soc_decode_win_spec->win_cesa_target;
 	if (soc_family == MV_SOC_ARMADA_38X)
 		cpu_win_tbl[t].attr = soc_decode_win_spec->win_cesa_attr(0);
 	else
 		cpu_win_tbl[t].attr = soc_decode_win_spec->win_cesa_attr(1);
 	cpu_win_tbl[t].base = sram_base;
 	cpu_win_tbl[t].size = sram_size;
 	cpu_win_tbl[t].remap = ~0;
 	cpu_wins_no++;
 	debugf("sram: base = 0x%0lx size = 0x%0lx\n", sram_base, sram_size);
 
 	/* Check if there is a second CESA node */
 	while ((node = OF_peer(node)) != 0) {
 		if (ofw_bus_node_is_compatible(node, "mrvl,cesa-sram")) {
 			if (fdt_regsize(node, &sram_base, &sram_size) != 0)
 				return (EINVAL);
 			break;
 		}
 	}
 
 	if (node == 0)
 		return (0);
 
 	t++;
 	if (t >= nitems(cpu_win_tbl)) {
 		debugf("cannot fit CESA tuple into cpu_win_tbl\n");
 		return (ENOMEM);
 	}
 
 	/* Configure window for CESA1 */
 	cpu_win_tbl[t].target = soc_decode_win_spec->win_cesa_target;
 	cpu_win_tbl[t].attr = soc_decode_win_spec->win_cesa_attr(1);
 	cpu_win_tbl[t].base = sram_base;
 	cpu_win_tbl[t].size = sram_size;
 	cpu_win_tbl[t].remap = ~0;
 	cpu_wins_no++;
 	debugf("sram: base = 0x%0lx size = 0x%0lx\n", sram_base, sram_size);
 
 	return (0);
 }
 
 static int
 fdt_win_process(phandle_t child)
 {
 	int i, ret;
 
 	for (i = 0; soc_nodes[i].compat != NULL; i++) {
 		/* Setup only for enabled devices */
 		if (ofw_bus_node_status_okay(child) == 0)
 			continue;
 
 		if (!ofw_bus_node_is_compatible(child, soc_nodes[i].compat))
 			continue;
 
 		ret = fdt_win_process_child(child, &soc_nodes[i], "reg");
 		if (ret != 0)
 			return (ret);
 	}
 
 	return (0);
 }
 
 static int
 fdt_win_process_child(phandle_t child, struct soc_node_spec *soc_node,
     const char* mimo_reg_source)
 {
 	int addr_cells, size_cells;
 	pcell_t reg[8];
 	u_long size, base;
 
 	if (fdt_addrsize_cells(OF_parent(child), &addr_cells,
 	    &size_cells))
 		return (ENXIO);
 
 	if ((sizeof(pcell_t) * (addr_cells + size_cells)) > sizeof(reg))
 		return (ENOMEM);
 	if (OF_getprop(child, mimo_reg_source, &reg, sizeof(reg)) <= 0)
 		return (EINVAL);
 
 	if (addr_cells <= 2)
 		base = fdt_data_get(&reg[0], addr_cells);
 	else
 		base = fdt_data_get(&reg[addr_cells - 2], 2);
 	size = fdt_data_get(&reg[addr_cells], size_cells);
 
 	if (soc_node->valid_handler != NULL)
 		if (!soc_node->valid_handler())
 			return (EINVAL);
 
 	base = (base & 0x000fffff) | fdt_immr_va;
 	if (soc_node->decode_handler != NULL)
 		soc_node->decode_handler(base);
 	else
 		return (ENXIO);
 
 	if (MV_DUMP_WIN && (soc_node->dump_handler != NULL))
 		soc_node->dump_handler(base);
 
 	return (0);
 }
 
 static int
 fdt_win_setup(void)
 {
 	phandle_t node, child, sb;
 	phandle_t child_pci;
 	int err;
 
 	sb = 0;
 	node = OF_finddevice("/");
 	if (node == -1)
 		panic("fdt_win_setup: no root node");
 
 	/* Allow for coherent transactions on the A38x MBUS */
 	if (ofw_bus_node_is_compatible(node, "marvell,armada380"))
 		platform_io_coherent = true;
 
 	/*
 	 * Traverse through all children of root and simple-bus nodes.
 	 * For each found device retrieve decode windows data (if applicable).
 	 */
 	child = OF_child(node);
 	while (child != 0) {
 		/* Lookup for callback and run */
 		err = fdt_win_process(child);
 		if (err != 0)
 			return (err);
 
 		/* Process Marvell Armada-XP/38x PCIe controllers */
 		if (ofw_bus_node_is_compatible(child, "marvell,armada-370-pcie")) {
 			child_pci = OF_child(child);
 			while (child_pci != 0) {
 				err = fdt_win_process_child(child_pci,
 				    &soc_nodes[SOC_NODE_PCIE_ENTRY_IDX],
 				    "assigned-addresses");
 				if (err != 0)
 					return (err);
 
 				child_pci = OF_peer(child_pci);
 			}
 		}
 
 		/*
 		 * Once done with root-level children let's move down to
 		 * simple-bus and its children.
 		 */
 		child = OF_peer(child);
 		if ((child == 0) && (node == OF_finddevice("/"))) {
 			sb = node = fdt_find_compatible(node, "simple-bus", 0);
 			if (node == 0)
 				return (ENXIO);
 			child = OF_child(node);
 		}
 		/*
 		 * Next, move one more level down to internal-regs node (if
 		 * it is present) and its children. This node also have
 		 * "simple-bus" compatible.
 		 */
 		if ((child == 0) && (node == sb)) {
 			node = fdt_find_compatible(node, "simple-bus", 0);
 			if (node == 0)
 				return (0);
 			child = OF_child(node);
 		}
 	}
 
 	return (0);
 }
 
 static void
 fdt_fixup_busfreq(phandle_t root)
 {
 	phandle_t sb;
 	pcell_t freq;
 
 	freq = cpu_to_fdt32(get_tclk());
 
 	/*
 	 * Fix bus speed in cpu node
 	 */
 	if ((sb = OF_finddevice("cpu")) != -1)
 		if (fdt_is_compatible_strict(sb, "ARM,88VS584"))
 			OF_setprop(sb, "bus-frequency", (void *)&freq,
 			    sizeof(freq));
 
 	/*
 	 * This fixup sets the simple-bus bus-frequency property.
 	 */
 	if ((sb = fdt_find_compatible(root, "simple-bus", 1)) != 0)
 		OF_setprop(sb, "bus-frequency", (void *)&freq, sizeof(freq));
 }
 
 static void
 fdt_fixup_ranges(phandle_t root)
 {
 	phandle_t node;
 	pcell_t par_addr_cells, addr_cells, size_cells;
 	pcell_t ranges[3], reg[2], *rangesptr;
 	int len, tuple_size, tuples_count;
 	uint32_t base;
 
 	/* Fix-up SoC ranges according to real fdt_immr_pa */
 	if ((node = fdt_find_compatible(root, "simple-bus", 1)) != 0) {
 		if (fdt_addrsize_cells(node, &addr_cells, &size_cells) == 0 &&
 		    (par_addr_cells = fdt_parent_addr_cells(node) <= 2)) {
 			tuple_size = sizeof(pcell_t) * (par_addr_cells +
 			   addr_cells + size_cells);
 			len = OF_getprop(node, "ranges", ranges,
 			    sizeof(ranges));
 			tuples_count = len / tuple_size;
 			/* Unexpected settings are not supported */
 			if (tuples_count != 1)
 				goto fixup_failed;
 
 			rangesptr = &ranges[0];
 			rangesptr += par_addr_cells;
 			base = fdt_data_get((void *)rangesptr, addr_cells);
 			*rangesptr = cpu_to_fdt32(fdt_immr_pa);
 			if (OF_setprop(node, "ranges", (void *)&ranges[0],
 			    sizeof(ranges)) < 0)
 				goto fixup_failed;
 		}
 	}
 
 	/* Fix-up PCIe reg according to real PCIe registers' PA */
 	if ((node = fdt_find_compatible(root, "mrvl,pcie", 1)) != 0) {
 		if (fdt_addrsize_cells(OF_parent(node), &par_addr_cells,
 		    &size_cells) == 0) {
 			tuple_size = sizeof(pcell_t) * (par_addr_cells +
 			    size_cells);
 			len = OF_getprop(node, "reg", reg, sizeof(reg));
 			tuples_count = len / tuple_size;
 			/* Unexpected settings are not supported */
 			if (tuples_count != 1)
 				goto fixup_failed;
 
 			base = fdt_data_get((void *)&reg[0], par_addr_cells);
 			base &= ~0xFF000000;
 			base |= fdt_immr_pa;
 			reg[0] = cpu_to_fdt32(base);
 			if (OF_setprop(node, "reg", (void *)&reg[0],
 			    sizeof(reg)) < 0)
 				goto fixup_failed;
 		}
 	}
 	/* Fix-up succeeded. May return and continue */
 	return;
 
 fixup_failed:
 	while (1) {
 		/*
 		 * In case of any error while fixing ranges just hang.
 		 *	1. No message can be displayed yet since console
 		 *	   is not initialized.
 		 *	2. Going further will cause failure on bus_space_map()
 		 *	   relying on the wrong ranges or data abort when
 		 *	   accessing PCIe registers.
 		 */
 	}
 }
 
 struct fdt_fixup_entry fdt_fixup_table[] = {
 	{ "mrvl,DB-88F6281", &fdt_fixup_busfreq },
 	{ "mrvl,DB-78460", &fdt_fixup_busfreq },
 	{ "mrvl,DB-78460", &fdt_fixup_ranges },
 	{ NULL, NULL }
 };
 
 #if __ARM_ARCH >= 6
 uint32_t
 get_tclk(void)
 {
 
 	if (soc_decode_win_spec->get_tclk != NULL)
 		return soc_decode_win_spec->get_tclk();
 	else
 		return -1;
 }
 
 uint32_t
 get_cpu_freq(void)
 {
 
 	if (soc_decode_win_spec->get_cpu_freq != NULL)
 		return soc_decode_win_spec->get_cpu_freq();
 	else
 		return -1;
 }
 #endif
 
 #ifndef INTRNG
 static int
 fdt_pic_decode_ic(phandle_t node, pcell_t *intr, int *interrupt, int *trig,
     int *pol)
 {
 
 	if (!ofw_bus_node_is_compatible(node, "mrvl,pic") &&
 	    !ofw_bus_node_is_compatible(node, "mrvl,mpic"))
 		return (ENXIO);
 
 	*interrupt = fdt32_to_cpu(intr[0]);
 	*trig = INTR_TRIGGER_CONFORM;
 	*pol = INTR_POLARITY_CONFORM;
 
 	return (0);
 }
 
 fdt_pic_decode_t fdt_pic_table[] = {
 	&fdt_pic_decode_ic,
 	NULL
 };
 #endif
Index: user/markj/netdump/sys/arm/mv/mvwin.h
===================================================================
--- user/markj/netdump/sys/arm/mv/mvwin.h	(revision 332407)
+++ user/markj/netdump/sys/arm/mv/mvwin.h	(revision 332408)
@@ -1,392 +1,394 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (C) 2007-2011 MARVELL INTERNATIONAL LTD.
  * All rights reserved.
  *
  * Developed by Semihalf.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of MARVELL nor the names of contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MVWIN_H_
 #define _MVWIN_H_
 
 /*
  * Decode windows addresses.
  *
  * All decoding windows must be aligned to their size, which has to be
  * a power of 2.
  */
 
 /*
  * SoC Integrated devices: 0xF1000000, 16 MB (VA == PA)
  */
 
 /* SoC Regs */
 #define MV_PHYS_BASE		0xF1000000
 #define MV_SIZE			(1024 * 1024)	/* 1 MB */
 
 /* SRAM */
 #define MV_CESA_SRAM_BASE	0xF1100000
 
 /*
  * External devices: 0x80000000, 1 GB (VA == PA)
  * Includes Device Bus, PCI and PCIE.
  */
 #if defined(SOC_MV_ORION)
 #define MV_PCI_PORTS	2	/* 1x PCI + 1x PCIE */
 #elif defined(SOC_MV_KIRKWOOD)
 #define MV_PCI_PORTS	1	/* 1x PCIE */
 #elif defined(SOC_MV_DISCOVERY)
 #define MV_PCI_PORTS	8	/* 8x PCIE */
 #else
 #define	MV_PCI_PORTS    1       /* 1x PCIE -> worst case */
 #endif
 
 /* PCI/PCIE Memory */
 #define MV_PCI_MEM_PHYS_BASE	0x80000000
 #define MV_PCI_MEM_SIZE		(512 * 1024 * 1024)	/* 512 MB */
 #define MV_PCI_MEM_BASE		MV_PCI_MEM_PHYS_BASE
 #define MV_PCI_MEM_SLICE_SIZE	(MV_PCI_MEM_SIZE / MV_PCI_PORTS)
 /* PCI/PCIE I/O */
 #define MV_PCI_IO_PHYS_BASE	0xBF000000
 #define MV_PCI_IO_SIZE		(16 * 1024 * 1024)	/* 16 MB */
 #define MV_PCI_IO_BASE		MV_PCI_IO_PHYS_BASE
 #define MV_PCI_IO_SLICE_SIZE	(MV_PCI_IO_SIZE / MV_PCI_PORTS)
 #define MV_PCI_VA_MEM_BASE	0
 #define MV_PCI_VA_IO_BASE	0
 
 /*
  * Device Bus (VA == PA)
  */
 #define MV_DEV_BOOT_BASE    0xF9300000
 #define MV_DEV_BOOT_SIZE    (1024 * 1024)   /* 1 MB */
 
 #define MV_DEV_CS0_BASE     0xF9400000
 #define MV_DEV_CS0_SIZE     (1024 * 1024)   /* 1 MB */
 
 #define MV_DEV_CS1_BASE     0xF9500000
 #define MV_DEV_CS1_SIZE     (32 * 1024 * 1024)  /* 32 MB */
 
 #define MV_DEV_CS2_BASE     0xFB500000
 #define MV_DEV_CS2_SIZE     (1024 * 1024)   /* 1 MB */
 
 
 /*
  * Integrated SoC peripherals addresses
  */
 #define MV_BASE			MV_PHYS_BASE	/* VA == PA mapping */
 #define	MV_DDR_CADR_BASE_ARMV7	(MV_BASE + 0x20180)
 #define MV_DDR_CADR_BASE	(MV_BASE + 0x1500)
 #define MV_MPP_BASE		(MV_BASE + 0x10000)
 
 
 #define MV_MISC_BASE		(MV_BASE + 0x18200)
 #define MV_MBUS_BRIDGE_BASE	(MV_BASE + 0x20000)
 #define MV_INTREGS_BASE		(MV_MBUS_BRIDGE_BASE + 0x80)
 #define MV_MP_CLOCKS_BASE	(MV_MBUS_BRIDGE_BASE + 0x700)
 
 #define	MV_CPU_CONTROL_BASE_ARMV7	(MV_MBUS_BRIDGE_BASE + 0x1800)
 #define MV_CPU_CONTROL_BASE	(MV_MBUS_BRIDGE_BASE + 0x100)
 
 #define MV_PCI_BASE		(MV_BASE + 0x30000)
 #define MV_PCI_SIZE		0x2000
 
 #define	MV_PCIE_BASE_ARMADA38X	(MV_BASE + 0x80000)
 #define MV_PCIE_BASE		(MV_BASE + 0x40000)
 #define MV_PCIE_SIZE		0x2000
 #define MV_SDIO_BASE		(MV_BASE + 0x90000)
 #define MV_SDIO_SIZE		0x10000
 
 /*
  * Decode windows definitions and macros
  */
 #define	MV_WIN_CPU_CTRL_ARMV7(n)		(((n) < 8) ? 0x10 * (n) :  0x90 + (0x8 * ((n) - 8)))
 #define	MV_WIN_CPU_BASE_ARMV7(n)		((((n) < 8) ? 0x10 * (n) :  0x90 + (0x8 * ((n) - 8))) + 0x4)
 #define	MV_WIN_CPU_REMAP_LO_ARMV7(n)	(0x10 * (n) +  0x008)
 #define	MV_WIN_CPU_REMAP_HI_ARMV7(n)	(0x10 * (n) +  0x00C)
 
 #define	MV_WIN_CPU_CTRL_ARMV5(n)		(0x10 * (n) + (((n) < 8) ? 0x000 : 0x880))
 #define	MV_WIN_CPU_BASE_ARMV5(n)		(0x10 * (n) + (((n) < 8) ? 0x004 : 0x884))
 #define	MV_WIN_CPU_REMAP_LO_ARMV5(n)		(0x10 * (n) + (((n) < 8) ? 0x008 : 0x888))
 #define	MV_WIN_CPU_REMAP_HI_ARMV5(n)		(0x10 * (n) + (((n) < 8) ? 0x00C : 0x88C))
 
 #if defined(SOC_MV_DISCOVERY)
 #define MV_WIN_CPU_MAX			14
 #else
 #define MV_WIN_CPU_MAX			8
 #endif
 #define	MV_WIN_CPU_MAX_ARMV7		20
 
 #define MV_WIN_CPU_ATTR_SHIFT		8
 #define MV_WIN_CPU_TARGET_SHIFT		4
 #define MV_WIN_CPU_ENABLE_BIT		1
 
 #define MV_WIN_DDR_BASE(n)		(0x8 * (n) + 0x0)
 #define MV_WIN_DDR_SIZE(n)		(0x8 * (n) + 0x4)
 #define MV_WIN_DDR_MAX			4
 
 /*
  * These values are valid only for peripherals decoding windows
  * Bit in ATTR is zeroed according to CS bank number
  */
 #define MV_WIN_DDR_ATTR(cs)		(0x0F & ~(0x01 << (cs)))
 #define MV_WIN_DDR_TARGET		0x0
 
 #if defined(SOC_MV_DISCOVERY)
 #define MV_WIN_CESA_TARGET		9
 #define MV_WIN_CESA_ATTR(eng_sel)	1
 #else
 #define	MV_WIN_CESA_TARGET		3
 #define	MV_WIN_CESA_ATTR(eng_sel)	0
 #endif
 
 #define	MV_WIN_CESA_TARGET_ARMADAXP	9
 /*
  * Bits [2:3] of cesa attribute select engine:
  * eng_sel:
  *  1: engine1
  *  2: engine0
  */
 #define	MV_WIN_CESA_ATTR_ARMADAXP(eng_sel)	(1 | ((eng_sel) << 2))
 #define	MV_WIN_CESA_TARGET_ARMADA38X		9
 /*
  * Bits [1:0] = Data swapping
  *  0x0 = Byte swap
  *  0x1 = No swap
  *  0x2 = Byte and word swap
  *  0x3 = Word swap
  * Bits [4:2] = CESA select:
  *  0x6 = CESA0
  *  0x5 = CESA1
  */
 #define	MV_WIN_CESA_ATTR_ARMADA38X(eng_sel)	(0x11 | (1 << (3 - (eng_sel))))
 /* CESA TDMA address decoding registers */
 #define MV_WIN_CESA_CTRL(n)		(0x8 * (n) + 0xA04)
 #define MV_WIN_CESA_BASE(n)		(0x8 * (n) + 0xA00)
 #define MV_WIN_CESA_MAX			4
 
 #define MV_WIN_USB_CTRL(n)		(0x10 * (n) + 0x320)
 #define MV_WIN_USB_BASE(n)		(0x10 * (n) + 0x324)
 #define MV_WIN_USB_MAX			4
 
 #define	MV_WIN_USB3_CTRL(n)		(0x8 * (n) + 0x4000)
 #define	MV_WIN_USB3_BASE(n)		(0x8 * (n) + 0x4004)
 #define	MV_WIN_USB3_MAX			8
 
 #define	MV_WIN_NETA_OFFSET		0x2000
 #define	MV_WIN_NETA_BASE(n)		MV_WIN_ETH_BASE(n) + MV_WIN_NETA_OFFSET
 
+#define MV_WIN_CESA_OFFSET		0x2000
+
 #define MV_WIN_ETH_BASE(n)		(0x8 * (n) + 0x200)
 #define MV_WIN_ETH_SIZE(n)		(0x8 * (n) + 0x204)
 #define MV_WIN_ETH_REMAP(n)		(0x4 * (n) + 0x280)
 #define MV_WIN_ETH_MAX			6
 
 #define MV_WIN_IDMA_BASE(n)		(0x8 * (n) + 0xa00)
 #define MV_WIN_IDMA_SIZE(n)		(0x8 * (n) + 0xa04)
 #define MV_WIN_IDMA_REMAP(n)		(0x4 * (n) + 0xa60)
 #define MV_WIN_IDMA_CAP(n)		(0x4 * (n) + 0xa70)
 #define MV_WIN_IDMA_MAX			8
 #define MV_IDMA_CHAN_MAX		4
 
 #define MV_WIN_XOR_BASE(n, m)		(0x4 * (n) + 0xa50 + (m) * 0x100)
 #define MV_WIN_XOR_SIZE(n, m)		(0x4 * (n) + 0xa70 + (m) * 0x100)
 #define MV_WIN_XOR_REMAP(n, m)		(0x4 * (n) + 0xa90 + (m) * 0x100)
 #define MV_WIN_XOR_CTRL(n, m)		(0x4 * (n) + 0xa40 + (m) * 0x100)
 #define MV_WIN_XOR_OVERR(n, m)		(0x4 * (n) + 0xaa0 + (m) * 0x100)
 #define MV_WIN_XOR_MAX			8
 #define MV_XOR_CHAN_MAX			2
 #define MV_XOR_NON_REMAP		4
 
 #define	MV_WIN_PCIE_TARGET_ARMADAXP(n)		(4 + (4 * ((n) % 2)))
 #define	MV_WIN_PCIE_MEM_ATTR_ARMADAXP(n)	(0xE8 + (0x10 * ((n) / 2)))
 #define	MV_WIN_PCIE_IO_ATTR_ARMADAXP(n)		(0xE0 + (0x10 * ((n) / 2)))
 #define	MV_WIN_PCIE_TARGET_ARMADA38X(n)		((n) == 0 ? 8 : 4)
 #define	MV_WIN_PCIE_MEM_ATTR_ARMADA38X(n)	((n) < 2 ? 0xE8 : (0xD8 - (((n) % 2) * 0x20)))
 #define	MV_WIN_PCIE_IO_ATTR_ARMADA38X(n)	((n) < 2 ? 0xE0 : (0xD0 - (((n) % 2) * 0x20)))
 #if defined(SOC_MV_DISCOVERY) || defined(SOC_MV_KIRKWOOD)
 #define MV_WIN_PCIE_TARGET(n)		4
 #define MV_WIN_PCIE_MEM_ATTR(n)		0xE8
 #define MV_WIN_PCIE_IO_ATTR(n)		0xE0
 #elif defined(SOC_MV_ORION)
 #define MV_WIN_PCIE_TARGET(n)		4
 #define MV_WIN_PCIE_MEM_ATTR(n)		0x59
 #define MV_WIN_PCIE_IO_ATTR(n)		0x51
 #else
 #define	MV_WIN_PCIE_TARGET(n)           (4 + (4 * ((n) % 2)))
 #define	MV_WIN_PCIE_MEM_ATTR(n)         (0xE8 + (0x10 * ((n) / 2)))
 #define	MV_WIN_PCIE_IO_ATTR(n)          (0xE0 + (0x10 * ((n) / 2)))
 #endif
 
 #define MV_WIN_PCI_TARGET		3
 #define MV_WIN_PCI_MEM_ATTR		0x59
 #define MV_WIN_PCI_IO_ATTR		0x51
 
 #define MV_WIN_PCIE_CTRL(n)		(0x10 * (((n) < 5) ? (n) : \
 					    (n) + 1) + 0x1820)
 #define MV_WIN_PCIE_BASE(n)		(0x10 * (((n) < 5) ? (n) : \
 					    (n) + 1) + 0x1824)
 #define MV_WIN_PCIE_REMAP(n)		(0x10 * (((n) < 5) ? (n) : \
 					    (n) + 1) + 0x182C)
 #define MV_WIN_PCIE_MAX			6
 
 #define MV_PCIE_BAR_CTRL(n)		(0x04 * (n) + 0x1800)
 #define MV_PCIE_BAR_BASE(n)		(0x08 * ((n) < 3 ? (n) : 4) + 0x0010)
 #define MV_PCIE_BAR_BASE_H(n)		(0x08 * (n) + 0x0014)
 #define MV_PCIE_BAR_MAX			4
 #define MV_PCIE_BAR_64BIT		(0x4)
 #define MV_PCIE_BAR_PREFETCH_EN		(0x8)
 
 #define MV_PCIE_CONTROL			(0x1a00)
 #define MV_PCIE_ROOT_CMPLX		(1 << 1)
 
 #define	MV_WIN_SATA_CTRL_ARMADA38X(n)	(0x10 * (n) + 0x60)
 #define	MV_WIN_SATA_BASE_ARMADA38X(n)	(0x10 * (n) + 0x64)
 #define	MV_WIN_SATA_SIZE_ARMADA38X(n)	(0x10 * (n) + 0x68)
 #define	MV_WIN_SATA_MAX_ARMADA38X	4
 #define	MV_WIN_SATA_CTRL(n)		(0x10 * (n) + 0x30)
 #define	MV_WIN_SATA_BASE(n)		(0x10 * (n) + 0x34)
 #define	MV_WIN_SATA_MAX			4
 
 #define	MV_WIN_SDHCI_CTRL(n)		(0x8 * (n) + 0x4080)
 #define	MV_WIN_SDHCI_BASE(n)		(0x8 * (n) + 0x4084)
 #define	MV_WIN_SDHCI_MAX		8
 
 #define	MV_BOOTROM_MEM_ADDR	0xFFF00000
 #define	MV_BOOTROM_WIN_SIZE	0xF
 #define	MV_CPU_SUBSYS_REGS_LEN	0x100
 
 #define	IO_WIN_9_CTRL_OFFSET	0x98
 #define	IO_WIN_9_BASE_OFFSET	0x9C
 
 /* Mbus decoding unit IDs and attributes */
 #define	MBUS_BOOTROM_TGT_ID	0x1
 #define	MBUS_BOOTROM_ATTR	0x1D
 
 /* Internal Units Sync Barrier Control Register */
 #define	MV_SYNC_BARRIER_CTRL		0x84
 #define	MV_SYNC_BARRIER_CTRL_ALL	0xFFFF
 
 /* IO Window Control Register fields */
 #define	IO_WIN_SIZE_SHIFT	16
 #define	IO_WIN_SIZE_MASK	0xFFFF
 #define	IO_WIN_COH_ATTR_MASK	(0xF << 12)
 #define	IO_WIN_ATTR_SHIFT	8
 #define	IO_WIN_ATTR_MASK	0xFF
 #define	IO_WIN_TGT_SHIFT	4
 #define	IO_WIN_TGT_MASK		0xF
 #define	IO_WIN_SYNC_SHIFT	1
 #define	IO_WIN_SYNC_MASK	0x1
 #define	IO_WIN_ENA_SHIFT	0
 #define	IO_WIN_ENA_MASK		0x1
 
 #define WIN_REG_IDX_RD(pre,reg,off,base)					\
 	static __inline uint32_t						\
 	pre ## _ ## reg ## _read(int i)						\
 	{									\
 		return (bus_space_read_4(fdtbus_bs_tag, base, off(i)));		\
 	}
 
 #define WIN_REG_IDX_RD2(pre,reg,off,base)					\
 	static  __inline uint32_t						\
 	pre ## _ ## reg ## _read(int i, int j)					\
 	{									\
 		return (bus_space_read_4(fdtbus_bs_tag, base, off(i, j)));		\
 	}									\
 
 #define WIN_REG_BASE_IDX_RD(pre,reg,off)					\
 	static __inline uint32_t						\
 	pre ## _ ## reg ## _read(uint32_t base, int i)				\
 	{									\
 		return (bus_space_read_4(fdtbus_bs_tag, base, off(i)));		\
 	}
 
 #define WIN_REG_BASE_IDX_RD2(pre,reg,off)					\
 	static __inline uint32_t						\
 	pre ## _ ## reg ## _read(uint32_t base, int i, int j)				\
 	{									\
 		return (bus_space_read_4(fdtbus_bs_tag, base, off(i, j)));		\
 	}
 
 #define WIN_REG_IDX_WR(pre,reg,off,base)					\
 	static __inline void							\
 	pre ## _ ## reg ## _write(int i, uint32_t val)				\
 	{									\
 		bus_space_write_4(fdtbus_bs_tag, base, off(i), val);			\
 	}
 
 #define WIN_REG_IDX_WR2(pre,reg,off,base)					\
 	static __inline void							\
 	pre ## _ ## reg ## _write(int i, int j, uint32_t val)			\
 	{									\
 		bus_space_write_4(fdtbus_bs_tag, base, off(i, j), val);		\
 	}
 
 #define WIN_REG_BASE_IDX_WR(pre,reg,off)					\
 	static __inline void							\
 	pre ## _ ## reg ## _write(uint32_t base, int i, uint32_t val)		\
 	{									\
 		bus_space_write_4(fdtbus_bs_tag, base, off(i), val);			\
 	}
 
 #define WIN_REG_BASE_IDX_WR2(pre,reg,off)					\
 	static __inline void							\
 	pre ## _ ## reg ## _write(uint32_t base, int i, int j, uint32_t val)		\
 	{									\
 		bus_space_write_4(fdtbus_bs_tag, base, off(i, j), val);			\
 	}
 
 #define WIN_REG_RD(pre,reg,off,base)						\
 	static __inline uint32_t						\
 	pre ## _ ## reg ## _read(void)						\
 	{									\
 		return (bus_space_read_4(fdtbus_bs_tag, base, off));			\
 	}
 
 #define WIN_REG_BASE_RD(pre,reg,off)						\
 	static __inline uint32_t						\
 	pre ## _ ## reg ## _read(uint32_t base)					\
 	{									\
 		return (bus_space_read_4(fdtbus_bs_tag, base, off));			\
 	}
 
 #define WIN_REG_WR(pre,reg,off,base)						\
 	static __inline void							\
 	pre ## _ ## reg ## _write(uint32_t val)					\
 	{									\
 		bus_space_write_4(fdtbus_bs_tag, base, off, val);			\
 	}
 
 #define WIN_REG_BASE_WR(pre,reg,off)						\
 	static __inline void							\
 	pre ## _ ## reg ## _write(uint32_t base, uint32_t val)			\
 	{									\
 		bus_space_write_4(fdtbus_bs_tag, base, off, val);			\
 	}
 
 #endif /* _MVWIN_H_ */
Index: user/markj/netdump/sys/arm/nvidia/drm2/tegra_drm_subr.c
===================================================================
--- user/markj/netdump/sys/arm/nvidia/drm2/tegra_drm_subr.c	(revision 332407)
+++ user/markj/netdump/sys/arm/nvidia/drm2/tegra_drm_subr.c	(revision 332408)
@@ -1,177 +1,177 @@
 /*-
  * Copyright (c) 2015 Michal Meloun
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 
 #include <machine/bus.h>
 
 #include <dev/extres/clk/clk.h>
 #include <dev/drm2/drmP.h>
 #include <dev/drm2/drm_crtc.h>
 #include <dev/drm2/drm_crtc_helper.h>
 #include <dev/drm2/drm_edid.h>
 #include <dev/drm2/drm_fb_helper.h>
 #include <dev/gpio/gpiobusvar.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <arm/nvidia/drm2/tegra_drm.h>
 
 #include <gnu/dts/include/dt-bindings/gpio/gpio.h>
 
 int
 tegra_drm_connector_get_modes(struct drm_connector *connector)
 {
 	struct tegra_drm_encoder *output;
 	struct edid *edid = NULL;
 	int rv;
 
 	output = container_of(connector, struct tegra_drm_encoder,
 	     connector);
 
 	/* Panel is first */
 	if (output->panel != NULL) {
 		/* XXX panel parsing */
 		return (0);
 	}
 
 	/* static EDID is second*/
 	edid = output->edid;
 
 	/* EDID from monitor is last */
 	if (edid == NULL)
 		edid = drm_get_edid(connector, output->ddc);
 
 	if (edid == NULL)
 		return (0);
 
 	/* Process EDID */
 	drm_mode_connector_update_edid_property(connector, edid);
 	rv = drm_add_edid_modes(connector, edid);
 	drm_edid_to_eld(connector, edid);
 	return (rv);
 }
 
 struct drm_encoder *
 tegra_drm_connector_best_encoder(struct drm_connector *connector)
 {
 	struct tegra_drm_encoder *output;
 
 	output = container_of(connector, struct tegra_drm_encoder,
 	     connector);
 
 	return &(output->encoder);
 }
 
 enum drm_connector_status
 tegra_drm_connector_detect(struct drm_connector *connector, bool force)
 {
 	struct tegra_drm_encoder *output;
 	bool active;
 	int rv;
 
 	output = container_of(connector, struct tegra_drm_encoder,
 	     connector);
 	if (output->gpio_hpd == NULL) {
 		return ((output->panel != NULL) ?
 		    connector_status_connected:
 		    connector_status_disconnected);
 	}
 
 	rv = gpio_pin_is_active(output->gpio_hpd, &active);
 	if (rv  != 0) {
 		device_printf(output->dev, " GPIO read failed: %d\n", rv);
 		return (connector_status_unknown);
 	}
 
 	return (active ?
 	    connector_status_connected : connector_status_disconnected);
 }
 
 int
 tegra_drm_encoder_attach(struct tegra_drm_encoder *output, phandle_t node)
 {
 	int rv;
 	phandle_t ddc;
 
 	/* XXX parse output panel here */
 
-	rv = OF_getencprop_alloc(node, "nvidia,edid", 1,
+	rv = OF_getencprop_alloc(node, "nvidia,edid",
 	    (void **)&output->edid);
 
 	/* EDID exist but have invalid size */
 	if ((rv >= 0) && (rv != sizeof(struct edid))) {
 		device_printf(output->dev,
 		    "Malformed \"nvidia,edid\" property\n");
 		if (output->edid != NULL)
 			free(output->edid, M_OFWPROP);
 		return (ENXIO);
 	}
 
 	gpio_pin_get_by_ofw_property(output->dev, node, "nvidia,hpd-gpio",
 	    &output->gpio_hpd);
 	ddc = 0;
 	OF_getencprop(node, "nvidia,ddc-i2c-bus", &ddc, sizeof(ddc));
 	if (ddc > 0)
 		output->ddc = OF_device_from_xref(ddc);
 	if ((output->edid == NULL) && (output->ddc == NULL))
 		return (ENXIO);
 
 	if (output->gpio_hpd != NULL) {
 		output->connector.polled =
 //		    DRM_CONNECTOR_POLL_HPD;
 		    DRM_CONNECTOR_POLL_DISCONNECT |
 		    DRM_CONNECTOR_POLL_CONNECT;
 	}
 
 	return (0);
 }
 
 int tegra_drm_encoder_init(struct tegra_drm_encoder *output,
     struct tegra_drm *drm)
 {
 
 	if (output->panel) {
 		/* attach panel */
 	}
 	return (0);
 }
 
 int tegra_drm_encoder_exit(struct tegra_drm_encoder *output,
     struct tegra_drm *drm)
 {
 
 	if (output->panel) {
 		/* detach panel */
 	}
 	return (0);
-}
\ No newline at end of file
+}
Index: user/markj/netdump/sys/arm/ti/ti_adc.c
===================================================================
--- user/markj/netdump/sys/arm/ti/ti_adc.c	(revision 332407)
+++ user/markj/netdump/sys/arm/ti/ti_adc.c	(revision 332408)
@@ -1,965 +1,966 @@
 /*-
  * Copyright 2014 Luiz Otavio O Souza <loos@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_evdev.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 
 #include <sys/conf.h>
 #include <sys/kernel.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/condvar.h>
 #include <sys/resource.h>
 #include <sys/rman.h>
 #include <sys/sysctl.h>
 #include <sys/selinfo.h>
 #include <sys/poll.h>
 #include <sys/uio.h>
 
 #include <machine/bus.h>
 
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #ifdef EVDEV_SUPPORT
 #include <dev/evdev/input.h>
 #include <dev/evdev/evdev.h>
 #endif
 
 #include <arm/ti/ti_prcm.h>
 #include <arm/ti/ti_adcreg.h>
 #include <arm/ti/ti_adcvar.h>
 
 #undef	DEBUG_TSC
 
 #define	DEFAULT_CHARGE_DELAY	0x400
 #define	STEPDLY_OPEN		0x98
 
 #define	ORDER_XP	0
 #define	ORDER_XN	1
 #define	ORDER_YP	2
 #define	ORDER_YN	3
 
 /* Define our 8 steps, one for each input channel. */
 static struct ti_adc_input ti_adc_inputs[TI_ADC_NPINS] = {
 	{ .stepconfig = ADC_STEPCFG(1), .stepdelay = ADC_STEPDLY(1) },
 	{ .stepconfig = ADC_STEPCFG(2), .stepdelay = ADC_STEPDLY(2) },
 	{ .stepconfig = ADC_STEPCFG(3), .stepdelay = ADC_STEPDLY(3) },
 	{ .stepconfig = ADC_STEPCFG(4), .stepdelay = ADC_STEPDLY(4) },
 	{ .stepconfig = ADC_STEPCFG(5), .stepdelay = ADC_STEPDLY(5) },
 	{ .stepconfig = ADC_STEPCFG(6), .stepdelay = ADC_STEPDLY(6) },
 	{ .stepconfig = ADC_STEPCFG(7), .stepdelay = ADC_STEPDLY(7) },
 	{ .stepconfig = ADC_STEPCFG(8), .stepdelay = ADC_STEPDLY(8) },
 };
 
 static int ti_adc_samples[5] = { 0, 2, 4, 8, 16 };
 
 static int ti_adc_detach(device_t dev);
 
 #ifdef EVDEV_SUPPORT
 static void
 ti_adc_ev_report(struct ti_adc_softc *sc)
 {
 
 	evdev_push_event(sc->sc_evdev, EV_ABS, ABS_X, sc->sc_x);
 	evdev_push_event(sc->sc_evdev, EV_ABS, ABS_Y, sc->sc_y);
 	evdev_push_event(sc->sc_evdev, EV_KEY, BTN_TOUCH, sc->sc_pen_down);
 	evdev_sync(sc->sc_evdev);
 }
 #endif /* EVDEV */
 
 static void
 ti_adc_enable(struct ti_adc_softc *sc)
 {
 	uint32_t reg;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	if (sc->sc_last_state == 1)
 		return;
 
 	/* Enable the FIFO0 threshold and the end of sequence interrupt. */
 	ADC_WRITE4(sc, ADC_IRQENABLE_SET,
 	    ADC_IRQ_FIFO0_THRES | ADC_IRQ_FIFO1_THRES | ADC_IRQ_END_OF_SEQ);
 
 	reg = ADC_CTRL_STEP_WP | ADC_CTRL_STEP_ID;
 	if (sc->sc_tsc_wires > 0) {
 		reg |= ADC_CTRL_TSC_ENABLE;
 		switch (sc->sc_tsc_wires) {
 		case 4:
 			reg |= ADC_CTRL_TSC_4WIRE;
 			break;
 		case 5:
 			reg |= ADC_CTRL_TSC_5WIRE;
 			break;
 		case 8:
 			reg |= ADC_CTRL_TSC_8WIRE;
 			break;
 		default:
 			break;
 		}
 	}
 	reg |= ADC_CTRL_ENABLE;
 	/* Enable the ADC.  Run thru enabled steps, start the conversions. */
 	ADC_WRITE4(sc, ADC_CTRL, reg);
 
 	sc->sc_last_state = 1;
 }
 
 static void
 ti_adc_disable(struct ti_adc_softc *sc)
 {
 	int count;
 	uint32_t data;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	if (sc->sc_last_state == 0)
 		return;
 
 	/* Disable all the enabled steps. */
 	ADC_WRITE4(sc, ADC_STEPENABLE, 0);
 
 	/* Disable the ADC. */
 	ADC_WRITE4(sc, ADC_CTRL, ADC_READ4(sc, ADC_CTRL) & ~ADC_CTRL_ENABLE);
 
 	/* Disable the FIFO0 threshold and the end of sequence interrupt. */
 	ADC_WRITE4(sc, ADC_IRQENABLE_CLR,
 	    ADC_IRQ_FIFO0_THRES | ADC_IRQ_FIFO1_THRES | ADC_IRQ_END_OF_SEQ);
 
 	/* ACK any pending interrupt. */
 	ADC_WRITE4(sc, ADC_IRQSTATUS, ADC_READ4(sc, ADC_IRQSTATUS));
 
 	/* Drain the FIFO data. */
 	count = ADC_READ4(sc, ADC_FIFO0COUNT) & ADC_FIFO_COUNT_MSK;
 	while (count > 0) {
 		data = ADC_READ4(sc, ADC_FIFO0DATA);
 		count = ADC_READ4(sc, ADC_FIFO0COUNT) & ADC_FIFO_COUNT_MSK;
 	}
 
 	count = ADC_READ4(sc, ADC_FIFO1COUNT) & ADC_FIFO_COUNT_MSK;
 	while (count > 0) {
 		data = ADC_READ4(sc, ADC_FIFO1DATA);
 		count = ADC_READ4(sc, ADC_FIFO1COUNT) & ADC_FIFO_COUNT_MSK;
 	}
 
 	sc->sc_last_state = 0;
 }
 
 static int
 ti_adc_setup(struct ti_adc_softc *sc)
 {
 	int ain, i;
 	uint32_t enabled;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	/* Check for enabled inputs. */
 	enabled = sc->sc_tsc_enabled;
 	for (i = 0; i < sc->sc_adc_nchannels; i++) {
 		ain = sc->sc_adc_channels[i];
 		if (ti_adc_inputs[ain].enable)
 			enabled |= (1U << (ain + 1));
 	}
 
 	/* Set the ADC global status. */
 	if (enabled != 0) {
 		ti_adc_enable(sc);
 		/* Update the enabled steps. */
 		if (enabled != ADC_READ4(sc, ADC_STEPENABLE))
 			ADC_WRITE4(sc, ADC_STEPENABLE, enabled);
 	} else
 		ti_adc_disable(sc);
 
 	return (0);
 }
 
 static void
 ti_adc_input_setup(struct ti_adc_softc *sc, int32_t ain)
 {
 	struct ti_adc_input *input;
 	uint32_t reg, val;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	input = &ti_adc_inputs[ain];
 	reg = input->stepconfig;
 	val = ADC_READ4(sc, reg);
 
 	/* Set single ended operation. */
 	val &= ~ADC_STEP_DIFF_CNTRL;
 
 	/* Set the negative voltage reference. */
 	val &= ~ADC_STEP_RFM_MSK;
 
 	/* Set the positive voltage reference. */
 	val &= ~ADC_STEP_RFP_MSK;
 
 	/* Set the samples average. */
 	val &= ~ADC_STEP_AVG_MSK;
 	val |= input->samples << ADC_STEP_AVG_SHIFT;
 
 	/* Select the desired input. */
 	val &= ~ADC_STEP_INP_MSK;
 	val |= ain << ADC_STEP_INP_SHIFT;
 
 	/* Set the ADC to one-shot mode. */
 	val &= ~ADC_STEP_MODE_MSK;
 
 	ADC_WRITE4(sc, reg, val);
 }
 
 static void
 ti_adc_reset(struct ti_adc_softc *sc)
 {
 	int ain, i;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	/* Disable all the inputs. */
 	for (i = 0; i < sc->sc_adc_nchannels; i++) {
 		ain = sc->sc_adc_channels[i];
 		ti_adc_inputs[ain].enable = 0;
 	}
 }
 
 static int
 ti_adc_clockdiv_proc(SYSCTL_HANDLER_ARGS)
 {
 	int error, reg;
 	struct ti_adc_softc *sc;
 
 	sc = (struct ti_adc_softc *)arg1;
 
 	TI_ADC_LOCK(sc);
 	reg = (int)ADC_READ4(sc, ADC_CLKDIV) + 1;
 	TI_ADC_UNLOCK(sc);
 
 	error = sysctl_handle_int(oidp, &reg, sizeof(reg), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 
 	/*
 	 * The actual written value is the prescaler setting - 1.
 	 * Enforce a minimum value of 10 (i.e. 9) which limits the maximum
 	 * ADC clock to ~2.4Mhz (CLK_M_OSC / 10).
 	 */
 	reg--;
 	if (reg < 9)
 		reg = 9;
 	if (reg > USHRT_MAX)
 		reg = USHRT_MAX;
 
 	TI_ADC_LOCK(sc);
 	/* Disable the ADC. */
 	ti_adc_disable(sc);
 	/* Update the ADC prescaler setting. */
 	ADC_WRITE4(sc, ADC_CLKDIV, reg);
 	/* Enable the ADC again. */
 	ti_adc_setup(sc);
 	TI_ADC_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 ti_adc_enable_proc(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 	int32_t enable;
 	struct ti_adc_softc *sc;
 	struct ti_adc_input *input;
 
 	input = (struct ti_adc_input *)arg1;
 	sc = input->sc;
 
 	enable = input->enable;
 	error = sysctl_handle_int(oidp, &enable, sizeof(enable),
 	    req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 
 	if (enable)
 		enable = 1;
 
 	TI_ADC_LOCK(sc);
 	/* Setup the ADC as needed. */
 	if (input->enable != enable) {
 		input->enable = enable;
 		ti_adc_setup(sc);
 		if (input->enable == 0)
 			input->value = 0;
 	}
 	TI_ADC_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 ti_adc_open_delay_proc(SYSCTL_HANDLER_ARGS)
 {
 	int error, reg;
 	struct ti_adc_softc *sc;
 	struct ti_adc_input *input;
 
 	input = (struct ti_adc_input *)arg1;
 	sc = input->sc;
 
 	TI_ADC_LOCK(sc);
 	reg = (int)ADC_READ4(sc, input->stepdelay) & ADC_STEP_OPEN_DELAY;
 	TI_ADC_UNLOCK(sc);
 
 	error = sysctl_handle_int(oidp, &reg, sizeof(reg), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 
 	if (reg < 0)
 		reg = 0;
 
 	TI_ADC_LOCK(sc);
 	ADC_WRITE4(sc, input->stepdelay, reg & ADC_STEP_OPEN_DELAY);
 	TI_ADC_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 ti_adc_samples_avg_proc(SYSCTL_HANDLER_ARGS)
 {
 	int error, samples, i;
 	struct ti_adc_softc *sc;
 	struct ti_adc_input *input;
 
 	input = (struct ti_adc_input *)arg1;
 	sc = input->sc;
 
 	if (input->samples > nitems(ti_adc_samples))
 		input->samples = nitems(ti_adc_samples);
 	samples = ti_adc_samples[input->samples];
 
 	error = sysctl_handle_int(oidp, &samples, 0, req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 
 	TI_ADC_LOCK(sc);
 	if (samples != ti_adc_samples[input->samples]) {
 		input->samples = 0;
 		for (i = 0; i < nitems(ti_adc_samples); i++)
 			if (samples >= ti_adc_samples[i])
 				input->samples = i;
 		ti_adc_input_setup(sc, input->input);
 	}
 	TI_ADC_UNLOCK(sc);
 
 	return (error);
 }
 
 static void
 ti_adc_read_data(struct ti_adc_softc *sc)
 {
 	int count, ain;
 	struct ti_adc_input *input;
 	uint32_t data;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	/* Read the available data. */
 	count = ADC_READ4(sc, ADC_FIFO0COUNT) & ADC_FIFO_COUNT_MSK;
 	while (count > 0) {
 		data = ADC_READ4(sc, ADC_FIFO0DATA);
 		ain = (data & ADC_FIFO_STEP_ID_MSK) >> ADC_FIFO_STEP_ID_SHIFT;
 		input = &ti_adc_inputs[ain];
 		if (input->enable == 0)
 			input->value = 0;
 		else
 			input->value = (int32_t)(data & ADC_FIFO_DATA_MSK);
 		count = ADC_READ4(sc, ADC_FIFO0COUNT) & ADC_FIFO_COUNT_MSK;
 	}
 }
 
 static int
 cmp_values(const void *a, const void *b)
 {
 	const uint32_t *v1, *v2;
 	v1 = a;
 	v2 = b;
 	if (*v1 < *v2)
 		return -1;
 	if (*v1 > *v2)
 		return 1;
 
 	return (0);
 }
 
 static void
 ti_adc_tsc_read_data(struct ti_adc_softc *sc)
 {
 	int count;
 	uint32_t data[16];
 	uint32_t x, y;
 	int i, start, end;
 
 	TI_ADC_LOCK_ASSERT(sc);
 
 	/* Read the available data. */
 	count = ADC_READ4(sc, ADC_FIFO1COUNT) & ADC_FIFO_COUNT_MSK;
 	if (count == 0)
 		return;
 
 	i = 0;
 	while (count > 0) {
 		data[i++] = ADC_READ4(sc, ADC_FIFO1DATA) & ADC_FIFO_DATA_MSK;
 		count = ADC_READ4(sc, ADC_FIFO1COUNT) & ADC_FIFO_COUNT_MSK;
 	}
 
 	if (sc->sc_coord_readouts > 3) {
 		start = 1;
 		end = sc->sc_coord_readouts - 1;
 		qsort(data, sc->sc_coord_readouts,
 			sizeof(data[0]), &cmp_values);
 		qsort(&data[sc->sc_coord_readouts + 2],
 			sc->sc_coord_readouts,
 			sizeof(data[0]), &cmp_values);
 	}
 	else {
 		start = 0;
 		end = sc->sc_coord_readouts;
 	}
 
 	x = y = 0;
 	for (i = start; i < end; i++)
 		y += data[i];
 	y /= (end - start);
 
 	for (i = sc->sc_coord_readouts + 2 + start; i < sc->sc_coord_readouts + 2 + end; i++)
 		x += data[i];
 	x /= (end - start);
 
 #ifdef DEBUG_TSC
 	device_printf(sc->sc_dev, "touchscreen x: %d, y: %d\n", x, y);
 #endif
 
 #ifdef EVDEV_SUPPORT
 	if ((sc->sc_x != x) || (sc->sc_y != y)) {
 		sc->sc_x = x;
 		sc->sc_y = y;
 		ti_adc_ev_report(sc);
 	}
 #endif
 }
 
 static void
 ti_adc_intr_locked(struct ti_adc_softc *sc, uint32_t status)
 {
 	/* Read the available data. */
 	if (status & ADC_IRQ_FIFO0_THRES)
 		ti_adc_read_data(sc);
 }
 
 static void
 ti_adc_tsc_intr_locked(struct ti_adc_softc *sc, uint32_t status)
 {
 	/* Read the available data. */
 	if (status & ADC_IRQ_FIFO1_THRES)
 		ti_adc_tsc_read_data(sc);
 
 }
 
 static void
 ti_adc_intr(void *arg)
 {
 	struct ti_adc_softc *sc;
 	uint32_t status, rawstatus;
 
 	sc = (struct ti_adc_softc *)arg;
 
 	TI_ADC_LOCK(sc);
 
 	rawstatus = ADC_READ4(sc, ADC_IRQSTATUS_RAW);
 	status = ADC_READ4(sc, ADC_IRQSTATUS);
 
 	if (rawstatus & ADC_IRQ_HW_PEN_ASYNC) {
 		sc->sc_pen_down = 1;
 		status |= ADC_IRQ_HW_PEN_ASYNC;
 		ADC_WRITE4(sc, ADC_IRQENABLE_CLR,
 			ADC_IRQ_HW_PEN_ASYNC);
 #ifdef EVDEV_SUPPORT
 		ti_adc_ev_report(sc);
 #endif
 	}
 
 	if (rawstatus & ADC_IRQ_PEN_UP) {
 		sc->sc_pen_down = 0;
 		status |= ADC_IRQ_PEN_UP;
 #ifdef EVDEV_SUPPORT
 		ti_adc_ev_report(sc);
 #endif
 	}
 
 	if (status & ADC_IRQ_FIFO0_THRES)
 		ti_adc_intr_locked(sc, status);
 
 	if (status & ADC_IRQ_FIFO1_THRES)
 		ti_adc_tsc_intr_locked(sc, status);
 
 	if (status) {
 		/* ACK the interrupt. */
 		ADC_WRITE4(sc, ADC_IRQSTATUS, status);
 	}
 
 	/* Start the next conversion ? */
 	if (status & ADC_IRQ_END_OF_SEQ)
 		ti_adc_setup(sc);
 
 	TI_ADC_UNLOCK(sc);
 }
 
 static void
 ti_adc_sysctl_init(struct ti_adc_softc *sc)
 {
 	char pinbuf[3];
 	struct sysctl_ctx_list *ctx;
 	struct sysctl_oid *tree_node, *inp_node, *inpN_node;
 	struct sysctl_oid_list *tree, *inp_tree, *inpN_tree;
 	int ain, i;
 
 	/*
 	 * Add per-pin sysctl tree/handlers.
 	 */
 	ctx = device_get_sysctl_ctx(sc->sc_dev);
 	tree_node = device_get_sysctl_tree(sc->sc_dev);
 	tree = SYSCTL_CHILDREN(tree_node);
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "clockdiv",
 	    CTLFLAG_RW | CTLTYPE_UINT,  sc, 0,
 	    ti_adc_clockdiv_proc, "IU", "ADC clock prescaler");
 	inp_node = SYSCTL_ADD_NODE(ctx, tree, OID_AUTO, "ain",
 	    CTLFLAG_RD, NULL, "ADC inputs");
 	inp_tree = SYSCTL_CHILDREN(inp_node);
 
 	for (i = 0; i < sc->sc_adc_nchannels; i++) {
 		ain = sc->sc_adc_channels[i];
 
 		snprintf(pinbuf, sizeof(pinbuf), "%d", ain);
 		inpN_node = SYSCTL_ADD_NODE(ctx, inp_tree, OID_AUTO, pinbuf,
 		    CTLFLAG_RD, NULL, "ADC input");
 		inpN_tree = SYSCTL_CHILDREN(inpN_node);
 
 		SYSCTL_ADD_PROC(ctx, inpN_tree, OID_AUTO, "enable",
 		    CTLFLAG_RW | CTLTYPE_UINT, &ti_adc_inputs[ain], 0,
 		    ti_adc_enable_proc, "IU", "Enable ADC input");
 		SYSCTL_ADD_PROC(ctx, inpN_tree, OID_AUTO, "open_delay",
 		    CTLFLAG_RW | CTLTYPE_UINT,  &ti_adc_inputs[ain], 0,
 		    ti_adc_open_delay_proc, "IU", "ADC open delay");
 		SYSCTL_ADD_PROC(ctx, inpN_tree, OID_AUTO, "samples_avg",
 		    CTLFLAG_RW | CTLTYPE_UINT,  &ti_adc_inputs[ain], 0,
 		    ti_adc_samples_avg_proc, "IU", "ADC samples average");
 		SYSCTL_ADD_INT(ctx, inpN_tree, OID_AUTO, "input",
 		    CTLFLAG_RD, &ti_adc_inputs[ain].value, 0,
 		    "Converted raw value for the ADC input");
 	}
 }
 
 static void
 ti_adc_inputs_init(struct ti_adc_softc *sc)
 {
 	int ain, i;
 	struct ti_adc_input *input;
 
 	TI_ADC_LOCK(sc);
 	for (i = 0; i < sc->sc_adc_nchannels; i++) {
 		ain = sc->sc_adc_channels[i];
 		input = &ti_adc_inputs[ain];
 		input->sc = sc;
 		input->input = ain;
 		input->value = 0;
 		input->enable = 0;
 		input->samples = 0;
 		ti_adc_input_setup(sc, ain);
 	}
 	TI_ADC_UNLOCK(sc);
 }
 
 static void
 ti_adc_tsc_init(struct ti_adc_softc *sc)
 {
 	int i, start_step, end_step;
 	uint32_t stepconfig, val;
 
 	TI_ADC_LOCK(sc);
 
 	/* X coordinates */
 	stepconfig = ADC_STEP_FIFO1 | (4 << ADC_STEP_AVG_SHIFT) |
 	    ADC_STEP_MODE_HW_ONESHOT | sc->sc_xp_bit;
 	if (sc->sc_tsc_wires == 4)
 		stepconfig |= ADC_STEP_INP(sc->sc_yp_inp) | sc->sc_xn_bit;
 	else if (sc->sc_tsc_wires == 5)
 		stepconfig |= ADC_STEP_INP(4) |
 			sc->sc_xn_bit | sc->sc_yn_bit | sc->sc_yp_bit;
 	else if (sc->sc_tsc_wires == 8)
 		stepconfig |= ADC_STEP_INP(sc->sc_yp_inp) | sc->sc_xn_bit;
 
 	start_step = ADC_STEPS - sc->sc_coord_readouts + 1;
 	end_step = start_step + sc->sc_coord_readouts - 1;
 	for (i = start_step; i <= end_step; i++) {
 		ADC_WRITE4(sc, ADC_STEPCFG(i), stepconfig);
 		ADC_WRITE4(sc, ADC_STEPDLY(i), STEPDLY_OPEN);
 	}
 
 	/* Y coordinates */
 	stepconfig = ADC_STEP_FIFO1 | (4 << ADC_STEP_AVG_SHIFT) |
 	    ADC_STEP_MODE_HW_ONESHOT | sc->sc_yn_bit |
 	    ADC_STEP_INM(8);
 	if (sc->sc_tsc_wires == 4)
 		stepconfig |= ADC_STEP_INP(sc->sc_xp_inp) | sc->sc_yp_bit;
 	else if (sc->sc_tsc_wires == 5)
 		stepconfig |= ADC_STEP_INP(4) |
 			sc->sc_xp_bit | sc->sc_xn_bit | sc->sc_yp_bit;
 	else if (sc->sc_tsc_wires == 8)
 		stepconfig |= ADC_STEP_INP(sc->sc_xp_inp) | sc->sc_yp_bit;
 
 	start_step = ADC_STEPS - (sc->sc_coord_readouts*2 + 2) + 1;
 	end_step = start_step + sc->sc_coord_readouts - 1;
 	for (i = start_step; i <= end_step; i++) {
 		ADC_WRITE4(sc, ADC_STEPCFG(i), stepconfig);
 		ADC_WRITE4(sc, ADC_STEPDLY(i), STEPDLY_OPEN);
 	}
 
 	/* Charge config */
 	val = ADC_READ4(sc, ADC_IDLECONFIG);
 	ADC_WRITE4(sc, ADC_TC_CHARGE_STEPCONFIG, val);
 	ADC_WRITE4(sc, ADC_TC_CHARGE_DELAY, sc->sc_charge_delay);
 
 	/* 2 steps for Z */
 	start_step = ADC_STEPS - (sc->sc_coord_readouts + 2) + 1;
 	stepconfig = ADC_STEP_FIFO1 | (4 << ADC_STEP_AVG_SHIFT) |
 	    ADC_STEP_MODE_HW_ONESHOT | sc->sc_yp_bit |
 	    sc->sc_xn_bit | ADC_STEP_INP(sc->sc_xp_inp) |
 	    ADC_STEP_INM(8);
 	ADC_WRITE4(sc, ADC_STEPCFG(start_step), stepconfig);
 	ADC_WRITE4(sc, ADC_STEPDLY(start_step), STEPDLY_OPEN);
 	start_step++;
 	stepconfig |= ADC_STEP_INP(sc->sc_yn_inp);
 	ADC_WRITE4(sc, ADC_STEPCFG(start_step), stepconfig);
 	ADC_WRITE4(sc, ADC_STEPDLY(start_step), STEPDLY_OPEN);
 
 	ADC_WRITE4(sc, ADC_FIFO1THRESHOLD, (sc->sc_coord_readouts*2 + 2) - 1);
 
 	sc->sc_tsc_enabled = 1;
 	start_step = ADC_STEPS - (sc->sc_coord_readouts*2 + 2) + 1;
 	end_step = ADC_STEPS;
 	for (i = start_step; i <= end_step; i++) {
 		sc->sc_tsc_enabled |= (1 << i);
 	}
 
 
 	TI_ADC_UNLOCK(sc);
 }
 
 static void
 ti_adc_idlestep_init(struct ti_adc_softc *sc)
 {
 	uint32_t val;
 
 	val = ADC_STEP_YNN_SW | ADC_STEP_INM(8) | ADC_STEP_INP(8) | ADC_STEP_YPN_SW;
 
 	ADC_WRITE4(sc, ADC_IDLECONFIG, val);
 }
 
 static int
 ti_adc_config_wires(struct ti_adc_softc *sc, int *wire_configs, int nwire_configs)
 {
 	int i;
 	int wire, ai;
 
 	for (i = 0; i < nwire_configs; i++) {
 		wire = wire_configs[i] & 0xf;
 		ai = (wire_configs[i] >> 4) & 0xf;
 		switch (wire) {
 		case ORDER_XP:
 			sc->sc_xp_bit = ADC_STEP_XPP_SW;
 			sc->sc_xp_inp = ai;
 			break;
 		case ORDER_XN:
 			sc->sc_xn_bit = ADC_STEP_XNN_SW;
 			sc->sc_xn_inp = ai;
 			break;
 		case ORDER_YP:
 			sc->sc_yp_bit = ADC_STEP_YPP_SW;
 			sc->sc_yp_inp = ai;
 			break;
 		case ORDER_YN:
 			sc->sc_yn_bit = ADC_STEP_YNN_SW;
 			sc->sc_yn_inp = ai;
 			break;
 		default:
 			device_printf(sc->sc_dev, "Invalid wire config\n");
 			return (-1);
 		}
 	}
 	return (0);
 }
 
 static int
 ti_adc_probe(device_t dev)
 {
 
 	if (!ofw_bus_is_compatible(dev, "ti,am3359-tscadc"))
 		return (ENXIO);
 	device_set_desc(dev, "TI ADC controller");
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 ti_adc_attach(device_t dev)
 {
 	int err, rid, i;
 	struct ti_adc_softc *sc;
 	uint32_t rev, reg;
 	phandle_t node, child;
 	pcell_t cell;
 	int *channels;
 	int nwire_configs;
 	int *wire_configs;
 
 	sc = device_get_softc(dev);
 	sc->sc_dev = dev;
 
 	node = ofw_bus_get_node(dev);
 
 	sc->sc_tsc_wires = 0;
 	sc->sc_coord_readouts = 1;
 	sc->sc_x_plate_resistance = 0;
 	sc->sc_charge_delay = DEFAULT_CHARGE_DELAY;
 	/* Read "tsc" node properties */
 	child = ofw_bus_find_child(node, "tsc");
 	if (child != 0 && OF_hasprop(child, "ti,wires")) {
 		if ((OF_getencprop(child, "ti,wires", &cell, sizeof(cell))) > 0)
 			sc->sc_tsc_wires = cell;
 		if ((OF_getencprop(child, "ti,coordinate-readouts", &cell,
 		    sizeof(cell))) > 0)
 			sc->sc_coord_readouts = cell;
 		if ((OF_getencprop(child, "ti,x-plate-resistance", &cell,
 		    sizeof(cell))) > 0)
 			sc->sc_x_plate_resistance = cell;
 		if ((OF_getencprop(child, "ti,charge-delay", &cell,
 		    sizeof(cell))) > 0)
 			sc->sc_charge_delay = cell;
-		nwire_configs = OF_getencprop_alloc(child, "ti,wire-config",
-		    sizeof(*wire_configs), (void **)&wire_configs);
+		nwire_configs = OF_getencprop_alloc_multi(child,
+		    "ti,wire-config", sizeof(*wire_configs),
+		    (void **)&wire_configs);
 		if (nwire_configs != sc->sc_tsc_wires) {
 			device_printf(sc->sc_dev,
 			    "invalid number of ti,wire-config: %d (should be %d)\n",
 			    nwire_configs, sc->sc_tsc_wires);
 			OF_prop_free(wire_configs);
 			return (EINVAL);
 		}
 		err = ti_adc_config_wires(sc, wire_configs, nwire_configs);
 		OF_prop_free(wire_configs);
 		if (err)
 			return (EINVAL);
 	}
 
 	/* Read "adc" node properties */
 	child = ofw_bus_find_child(node, "adc");
 	if (child != 0) {
-		sc->sc_adc_nchannels = OF_getencprop_alloc(child, "ti,adc-channels",
-		    sizeof(*channels), (void **)&channels);
+		sc->sc_adc_nchannels = OF_getencprop_alloc_multi(child,
+		    "ti,adc-channels", sizeof(*channels), (void **)&channels);
 		if (sc->sc_adc_nchannels > 0) {
 			for (i = 0; i < sc->sc_adc_nchannels; i++)
 				sc->sc_adc_channels[i] = channels[i];
 			OF_prop_free(channels);
 		}
 	}
 
 	/* Sanity check FDT data */
 	if (sc->sc_tsc_wires + sc->sc_adc_nchannels > TI_ADC_NPINS) {
 		device_printf(dev, "total number of chanels (%d) is larger than %d\n",
 		    sc->sc_tsc_wires + sc->sc_adc_nchannels, TI_ADC_NPINS);
 		return (ENXIO);
 	}
 
 	rid = 0;
 	sc->sc_mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid,
 	    RF_ACTIVE);
 	if (!sc->sc_mem_res) {
 		device_printf(dev, "cannot allocate memory window\n");
 		return (ENXIO);
 	}
 
 	/* Activate the ADC_TSC module. */
 	err = ti_prcm_clk_enable(TSC_ADC_CLK);
 	if (err)
 		return (err);
 
 	rid = 0;
 	sc->sc_irq_res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid,
 	    RF_ACTIVE);
 	if (!sc->sc_irq_res) {
 		bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res);
 		device_printf(dev, "cannot allocate interrupt\n");
 		return (ENXIO);
 	}
 
 	if (bus_setup_intr(dev, sc->sc_irq_res, INTR_TYPE_MISC | INTR_MPSAFE,
 	    NULL, ti_adc_intr, sc, &sc->sc_intrhand) != 0) {
 		bus_release_resource(dev, SYS_RES_IRQ, 0, sc->sc_irq_res);
 		bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res);
 		device_printf(dev, "Unable to setup the irq handler.\n");
 		return (ENXIO);
 	}
 
 	/* Check the ADC revision. */
 	rev = ADC_READ4(sc, ADC_REVISION);
 	device_printf(dev,
 	    "scheme: %#x func: %#x rtl: %d rev: %d.%d custom rev: %d\n",
 	    (rev & ADC_REV_SCHEME_MSK) >> ADC_REV_SCHEME_SHIFT,
 	    (rev & ADC_REV_FUNC_MSK) >> ADC_REV_FUNC_SHIFT,
 	    (rev & ADC_REV_RTL_MSK) >> ADC_REV_RTL_SHIFT,
 	    (rev & ADC_REV_MAJOR_MSK) >> ADC_REV_MAJOR_SHIFT,
 	    rev & ADC_REV_MINOR_MSK,
 	    (rev & ADC_REV_CUSTOM_MSK) >> ADC_REV_CUSTOM_SHIFT);
 
 	reg = ADC_READ4(sc, ADC_CTRL);
 	ADC_WRITE4(sc, ADC_CTRL, reg | ADC_CTRL_STEP_WP | ADC_CTRL_STEP_ID);
 
 	/*
 	 * Set the ADC prescaler to 2400 if touchscreen is not enabled
 	 * and to 24 if it is.  This sets the ADC clock to ~10Khz and
 	 * ~1Mhz respectively (CLK_M_OSC / prescaler).
 	 */
 	if (sc->sc_tsc_wires)
 		ADC_WRITE4(sc, ADC_CLKDIV, 24 - 1);
 	else
 		ADC_WRITE4(sc, ADC_CLKDIV, 2400 - 1);
 
 	TI_ADC_LOCK_INIT(sc);
 
 	ti_adc_idlestep_init(sc);
 	ti_adc_inputs_init(sc);
 	ti_adc_sysctl_init(sc);
 	ti_adc_tsc_init(sc);
 
 	TI_ADC_LOCK(sc);
 	ti_adc_setup(sc);
 	TI_ADC_UNLOCK(sc);
 
 #ifdef EVDEV_SUPPORT
 	if (sc->sc_tsc_wires > 0) {
 		sc->sc_evdev = evdev_alloc();
 		evdev_set_name(sc->sc_evdev, device_get_desc(dev));
 		evdev_set_phys(sc->sc_evdev, device_get_nameunit(dev));
 		evdev_set_id(sc->sc_evdev, BUS_VIRTUAL, 0, 0, 0);
 		evdev_support_prop(sc->sc_evdev, INPUT_PROP_DIRECT);
 		evdev_support_event(sc->sc_evdev, EV_SYN);
 		evdev_support_event(sc->sc_evdev, EV_ABS);
 		evdev_support_event(sc->sc_evdev, EV_KEY);
 
 		evdev_support_abs(sc->sc_evdev, ABS_X, 0, 0,
 		    ADC_MAX_VALUE, 0, 0, 0);
 		evdev_support_abs(sc->sc_evdev, ABS_Y, 0, 0,
 		    ADC_MAX_VALUE, 0, 0, 0);
 
 		evdev_support_key(sc->sc_evdev, BTN_TOUCH);
 
 		err = evdev_register(sc->sc_evdev);
 		if (err) {
 			device_printf(dev,
 			    "failed to register evdev: error=%d\n", err);
 			ti_adc_detach(dev);
 			return (err);
 		}
 
 		sc->sc_pen_down = 0;
 		sc->sc_x = -1;
 		sc->sc_y = -1;
 	}
 #endif /* EVDEV */
 
 	return (0);
 }
 
 static int
 ti_adc_detach(device_t dev)
 {
 	struct ti_adc_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	/* Turn off the ADC. */
 	TI_ADC_LOCK(sc);
 	ti_adc_reset(sc);
 	ti_adc_setup(sc);
 
 #ifdef EVDEV_SUPPORT
 	evdev_free(sc->sc_evdev);
 #endif
 
 	TI_ADC_UNLOCK(sc);
 
 	TI_ADC_LOCK_DESTROY(sc);
 
 	if (sc->sc_intrhand)
 		bus_teardown_intr(dev, sc->sc_irq_res, sc->sc_intrhand);
 	if (sc->sc_irq_res)
 		bus_release_resource(dev, SYS_RES_IRQ, 0, sc->sc_irq_res);
 	if (sc->sc_mem_res)
 		bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res);
 
 	return (bus_generic_detach(dev));
 }
 
 static device_method_t ti_adc_methods[] = {
 	DEVMETHOD(device_probe,		ti_adc_probe),
 	DEVMETHOD(device_attach,	ti_adc_attach),
 	DEVMETHOD(device_detach,	ti_adc_detach),
 
 	DEVMETHOD_END
 };
 
 static driver_t ti_adc_driver = {
 	"ti_adc",
 	ti_adc_methods,
 	sizeof(struct ti_adc_softc),
 };
 
 static devclass_t ti_adc_devclass;
 
 DRIVER_MODULE(ti_adc, simplebus, ti_adc_driver, ti_adc_devclass, 0, 0);
 MODULE_VERSION(ti_adc, 1);
 MODULE_DEPEND(ti_adc, simplebus, 1, 1, 1);
 #ifdef EVDEV_SUPPORT
 MODULE_DEPEND(ti_adc, evdev, 1, 1, 1);
 #endif
Index: user/markj/netdump/sys/arm/ti/ti_pinmux.c
===================================================================
--- user/markj/netdump/sys/arm/ti/ti_pinmux.c	(revision 332407)
+++ user/markj/netdump/sys/arm/ti/ti_pinmux.c	(revision 332408)
@@ -1,461 +1,461 @@
 /*
  * Copyright (c) 2010
  *	Ben Gray <ben.r.gray@gmail.com>.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by Ben Gray.
  * 4. The name of the company nor the name of the author may be used to
  *    endorse or promote products derived from this software without specific
  *    prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY BEN GRAY ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL BEN GRAY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /**
  * Exposes pinmux module to pinctrl-compatible interface
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/bus.h>
 #include <sys/resource.h>
 #include <sys/rman.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 
 #include <machine/bus.h>
 #include <machine/resource.h>
 
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #include <dev/fdt/fdt_pinctrl.h>
 
 #include <arm/ti/omap4/omap4_scm_padconf.h>
 #include <arm/ti/am335x/am335x_scm_padconf.h>
 #include <arm/ti/ti_cpuid.h>
 #include "ti_pinmux.h"
 
 struct pincfg {
 	uint32_t reg;
 	uint32_t conf;
 };
 
 static struct resource_spec ti_pinmux_res_spec[] = {
 	{ SYS_RES_MEMORY,	0,	RF_ACTIVE },	/* Control memory window */
 	{ -1, 0 }
 };
 
 static struct ti_pinmux_softc *ti_pinmux_sc;
 
 #define	ti_pinmux_read_2(sc, reg)		\
     bus_space_read_2((sc)->sc_bst, (sc)->sc_bsh, (reg))
 #define	ti_pinmux_write_2(sc, reg, val)		\
     bus_space_write_2((sc)->sc_bst, (sc)->sc_bsh, (reg), (val))
 #define	ti_pinmux_read_4(sc, reg)		\
     bus_space_read_4((sc)->sc_bst, (sc)->sc_bsh, (reg))
 #define	ti_pinmux_write_4(sc, reg, val)		\
     bus_space_write_4((sc)->sc_bst, (sc)->sc_bsh, (reg), (val))
 
 
 /**
  *	ti_padconf_devmap - Array of pins, should be defined one per SoC
  *
  *	This array is typically defined in one of the targeted *_scm_pinumx.c
  *	files and is specific to the given SoC platform. Each entry in the array
  *	corresponds to an individual pin.
  */
 static const struct ti_pinmux_device *ti_pinmux_dev;
 
 
 /**
  *	ti_pinmux_padconf_from_name - searches the list of pads and returns entry
  *	                             with matching ball name.
  *	@ballname: the name of the ball
  *
  *	RETURNS:
  *	A pointer to the matching padconf or NULL if the ball wasn't found.
  */
 static const struct ti_pinmux_padconf*
 ti_pinmux_padconf_from_name(const char *ballname)
 {
 	const struct ti_pinmux_padconf *padconf;
 
 	padconf = ti_pinmux_dev->padconf;
 	while (padconf->ballname != NULL) {
 		if (strcmp(ballname, padconf->ballname) == 0)
 			return(padconf);
 		padconf++;
 	}
 
 	return (NULL);
 }
 
 /**
  *	ti_pinmux_padconf_set_internal - sets the muxmode and state for a pad/pin
  *	@padconf: pointer to the pad structure
  *	@muxmode: the name of the mode to use for the pin, i.e. "uart1_rx"
  *	@state: the state to put the pad/pin in, i.e. PADCONF_PIN_???
  *
  *
  *	LOCKING:
  *	Internally locks it's own context.
  *
  *	RETURNS:
  *	0 on success.
  *	EINVAL if pin requested is outside valid range or already in use.
  */
 static int
 ti_pinmux_padconf_set_internal(struct ti_pinmux_softc *sc,
     const struct ti_pinmux_padconf *padconf,
     const char *muxmode, unsigned int state)
 {
 	unsigned int mode;
 	uint16_t reg_val;
 
 	/* populate the new value for the PADCONF register */
 	reg_val = (uint16_t)(state & ti_pinmux_dev->padconf_sate_mask);
 
 	/* find the new mode requested */
 	for (mode = 0; mode < 8; mode++) {
 		if ((padconf->muxmodes[mode] != NULL) &&
 		    (strcmp(padconf->muxmodes[mode], muxmode) == 0)) {
 			break;
 		}
 	}
 
 	/* couldn't find the mux mode */
 	if (mode >= 8) {
 		printf("Invalid mode \"%s\"\n", muxmode);
 		return (EINVAL);
 	}
 
 	/* set the mux mode */
 	reg_val |= (uint16_t)(mode & ti_pinmux_dev->padconf_muxmode_mask);
 
 	if (bootverbose)
 		device_printf(sc->sc_dev, "setting internal %x for %s\n",
 		    reg_val, muxmode);
 	/* write the register value (16-bit writes) */
 	ti_pinmux_write_2(sc, padconf->reg_off, reg_val);
 
 	return (0);
 }
 
 /**
  *	ti_pinmux_padconf_set - sets the muxmode and state for a pad/pin
  *	@padname: the name of the pad, i.e. "c12"
  *	@muxmode: the name of the mode to use for the pin, i.e. "uart1_rx"
  *	@state: the state to put the pad/pin in, i.e. PADCONF_PIN_???
  *
  *
  *	LOCKING:
  *	Internally locks it's own context.
  *
  *	RETURNS:
  *	0 on success.
  *	EINVAL if pin requested is outside valid range or already in use.
  */
 int
 ti_pinmux_padconf_set(const char *padname, const char *muxmode, unsigned int state)
 {
 	const struct ti_pinmux_padconf *padconf;
 
 	if (!ti_pinmux_sc)
 		return (ENXIO);
 
 	/* find the pin in the devmap */
 	padconf = ti_pinmux_padconf_from_name(padname);
 	if (padconf == NULL)
 		return (EINVAL);
 
 	return (ti_pinmux_padconf_set_internal(ti_pinmux_sc, padconf, muxmode, state));
 }
 
 /**
  *	ti_pinmux_padconf_get - gets the muxmode and state for a pad/pin
  *	@padname: the name of the pad, i.e. "c12"
  *	@muxmode: upon return will contain the name of the muxmode of the pin
  *	@state: upon return will contain the state of the pad/pin
  *
  *
  *	LOCKING:
  *	Internally locks it's own context.
  *
  *	RETURNS:
  *	0 on success.
  *	EINVAL if pin requested is outside valid range or already in use.
  */
 int
 ti_pinmux_padconf_get(const char *padname, const char **muxmode,
     unsigned int *state)
 {
 	const struct ti_pinmux_padconf *padconf;
 	uint16_t reg_val;
 
 	if (!ti_pinmux_sc)
 		return (ENXIO);
 
 	/* find the pin in the devmap */
 	padconf = ti_pinmux_padconf_from_name(padname);
 	if (padconf == NULL)
 		return (EINVAL);
 
 	/* read the register value (16-bit reads) */
 	reg_val = ti_pinmux_read_2(ti_pinmux_sc, padconf->reg_off);
 
 	/* save the state */
 	if (state)
 		*state = (reg_val & ti_pinmux_dev->padconf_sate_mask);
 
 	/* save the mode */
 	if (muxmode)
 		*muxmode = padconf->muxmodes[(reg_val & ti_pinmux_dev->padconf_muxmode_mask)];
 
 	return (0);
 }
 
 /**
  *	ti_pinmux_padconf_set_gpiomode - converts a pad to GPIO mode.
  *	@gpio: the GPIO pin number (0-195)
  *	@state: the state to put the pad/pin in, i.e. PADCONF_PIN_???
  *
  *
  *
  *	LOCKING:
  *	Internally locks it's own context.
  *
  *	RETURNS:
  *	0 on success.
  *	EINVAL if pin requested is outside valid range or already in use.
  */
 int
 ti_pinmux_padconf_set_gpiomode(uint32_t gpio, unsigned int state)
 {
 	const struct ti_pinmux_padconf *padconf;
 	uint16_t reg_val;
 
 	if (!ti_pinmux_sc)
 		return (ENXIO);
 
 	/* find the gpio pin in the padconf array */
 	padconf = ti_pinmux_dev->padconf;
 	while (padconf->ballname != NULL) {
 		if (padconf->gpio_pin == gpio)
 			break;
 		padconf++;
 	}
 	if (padconf->ballname == NULL)
 		return (EINVAL);
 
 	/* populate the new value for the PADCONF register */
 	reg_val = (uint16_t)(state & ti_pinmux_dev->padconf_sate_mask);
 
 	/* set the mux mode */
 	reg_val |= (uint16_t)(padconf->gpio_mode & ti_pinmux_dev->padconf_muxmode_mask);
 
 	/* write the register value (16-bit writes) */
 	ti_pinmux_write_2(ti_pinmux_sc, padconf->reg_off, reg_val);
 
 	return (0);
 }
 
 /**
  *	ti_pinmux_padconf_get_gpiomode - gets the current GPIO mode of the pin
  *	@gpio: the GPIO pin number (0-195)
  *	@state: upon return will contain the state
  *
  *
  *
  *	LOCKING:
  *	Internally locks it's own context.
  *
  *	RETURNS:
  *	0 on success.
  *	EINVAL if pin requested is outside valid range or not configured as GPIO.
  */
 int
 ti_pinmux_padconf_get_gpiomode(uint32_t gpio, unsigned int *state)
 {
 	const struct ti_pinmux_padconf *padconf;
 	uint16_t reg_val;
 
 	if (!ti_pinmux_sc)
 		return (ENXIO);
 
 	/* find the gpio pin in the padconf array */
 	padconf = ti_pinmux_dev->padconf;
 	while (padconf->ballname != NULL) {
 		if (padconf->gpio_pin == gpio)
 			break;
 		padconf++;
 	}
 	if (padconf->ballname == NULL)
 		return (EINVAL);
 
 	/* read the current register settings */
 	reg_val = ti_pinmux_read_2(ti_pinmux_sc, padconf->reg_off);
 
 	/* check to make sure the pins is configured as GPIO in the first state */
 	if ((reg_val & ti_pinmux_dev->padconf_muxmode_mask) != padconf->gpio_mode)
 		return (EINVAL);
 
 	/* read and store the reset of the state, i.e. pull-up, pull-down, etc */
 	if (state)
 		*state = (reg_val & ti_pinmux_dev->padconf_sate_mask);
 
 	return (0);
 }
 
 static int
 ti_pinmux_configure_pins(device_t dev, phandle_t cfgxref)
 {
 	struct pincfg *cfgtuples, *cfg;
 	phandle_t cfgnode;
 	int i, ntuples;
 	static struct ti_pinmux_softc *sc;
 
 	sc = device_get_softc(dev);
 	cfgnode = OF_node_from_xref(cfgxref);
-	ntuples = OF_getencprop_alloc(cfgnode, "pinctrl-single,pins", sizeof(*cfgtuples),
-	    (void **)&cfgtuples);
+	ntuples = OF_getencprop_alloc_multi(cfgnode, "pinctrl-single,pins",
+	    sizeof(*cfgtuples), (void **)&cfgtuples);
 
 	if (ntuples < 0)
 		return (ENOENT);
 
 	if (ntuples == 0)
 		return (0); /* Empty property is not an error. */
 
 	for (i = 0, cfg = cfgtuples; i < ntuples; i++, cfg++) {
 		if (bootverbose) {
 			char name[32];
 			OF_getprop(cfgnode, "name", &name, sizeof(name));
 			printf("%16s: muxreg 0x%04x muxval 0x%02x\n",
 			    name, cfg->reg, cfg->conf);
 		}
 
 		/* write the register value (16-bit writes) */
 		ti_pinmux_write_2(sc, cfg->reg, cfg->conf);
 	}
 
 	OF_prop_free(cfgtuples);
 
 	return (0);
 }
 
 /*
  * Device part of OMAP SCM driver
  */
 
 static int
 ti_pinmux_probe(device_t dev)
 {
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (!ofw_bus_is_compatible(dev, "pinctrl-single"))
 		return (ENXIO);
 
 	if (ti_pinmux_sc) {
 		printf("%s: multiple pinctrl modules in device tree data, ignoring\n",
 		    __func__);
 		return (EEXIST);
 	}
 	switch (ti_chip()) {
 #ifdef SOC_OMAP4
 	case CHIP_OMAP_4:
 		ti_pinmux_dev = &omap4_pinmux_dev;
 		break;
 #endif
 #ifdef SOC_TI_AM335X
 	case CHIP_AM335X:
 		ti_pinmux_dev = &ti_am335x_pinmux_dev;
 		break;
 #endif
 	default:
 		printf("Unknown CPU in pinmux\n");
 		return (ENXIO);
 	}
 
 
 	device_set_desc(dev, "TI Pinmux Module");
 	return (BUS_PROBE_DEFAULT);
 }
 
 /**
  *	ti_pinmux_attach - attaches the pinmux to the simplebus
  *	@dev: new device
  *
  *	RETURNS
  *	Zero on success or ENXIO if an error occuried.
  */
 static int
 ti_pinmux_attach(device_t dev)
 {
 	struct ti_pinmux_softc *sc = device_get_softc(dev);
 
 #if 0
 	if (ti_pinmux_sc)
 		return (ENXIO);
 #endif
 
 	sc->sc_dev = dev;
 
 	if (bus_alloc_resources(dev, ti_pinmux_res_spec, sc->sc_res)) {
 		device_printf(dev, "could not allocate resources\n");
 		return (ENXIO);
 	}
 
 	sc->sc_bst = rman_get_bustag(sc->sc_res[0]);
 	sc->sc_bsh = rman_get_bushandle(sc->sc_res[0]);
 
 	if (ti_pinmux_sc == NULL)
 		ti_pinmux_sc = sc;
 
 	fdt_pinctrl_register(dev, "pinctrl-single,pins");
 	fdt_pinctrl_configure_tree(dev);
 
 	return (0);
 }
 
 static device_method_t ti_pinmux_methods[] = {
 	DEVMETHOD(device_probe,		ti_pinmux_probe),
 	DEVMETHOD(device_attach,	ti_pinmux_attach),
 
         /* fdt_pinctrl interface */
 	DEVMETHOD(fdt_pinctrl_configure, ti_pinmux_configure_pins),
 	{ 0, 0 }
 };
 
 static driver_t ti_pinmux_driver = {
 	"ti_pinmux",
 	ti_pinmux_methods,
 	sizeof(struct ti_pinmux_softc),
 };
 
 static devclass_t ti_pinmux_devclass;
 
 DRIVER_MODULE(ti_pinmux, simplebus, ti_pinmux_driver, ti_pinmux_devclass, 0, 0);
Index: user/markj/netdump/sys/arm64/conf/GENERIC
===================================================================
--- user/markj/netdump/sys/arm64/conf/GENERIC	(revision 332407)
+++ user/markj/netdump/sys/arm64/conf/GENERIC	(revision 332408)
@@ -1,254 +1,257 @@
 #
 # GENERIC -- Generic kernel configuration file for FreeBSD/arm64
 #
 # For more information on this file, please read the config(5) manual page,
 # and/or the handbook section on Kernel Configuration Files:
 #
 #    https://www.FreeBSD.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig-config.html
 #
 # The handbook is also available locally in /usr/share/doc/handbook
 # if you've installed the doc distribution, otherwise always see the
 # FreeBSD World Wide Web server (https://www.FreeBSD.org/) for the
 # latest information.
 #
 # An exhaustive list of options and more detailed explanations of the
 # device lines is also present in the ../../conf/NOTES and NOTES files.
 # If you are in doubt as to the purpose or necessity of a line, check first
 # in NOTES.
 #
 # $FreeBSD$
 
 cpu		ARM64
 ident		GENERIC
 
 makeoptions	DEBUG=-g		# Build kernel with gdb(1) debug symbols
 makeoptions	WITH_CTF=1		# Run ctfconvert(1) for DTrace support
 
 options 	SCHED_ULE		# ULE scheduler
 options 	PREEMPTION		# Enable kernel thread preemption
 #options 	VIMAGE			# Subsystem virtualization, e.g. VNET
 options 	INET			# InterNETworking
 options 	INET6			# IPv6 communications protocols
 options 	IPSEC			# IP (v4/v6) security
 options 	IPSEC_SUPPORT		# Allow kldload of ipsec and tcpmd5
 options 	TCP_HHOOK		# hhook(9) framework for TCP
 options 	TCP_OFFLOAD		# TCP offload
 options		TCP_RFC7413		# TCP Fast Open
 options 	SCTP			# Stream Control Transmission Protocol
 options 	FFS			# Berkeley Fast Filesystem
 options 	SOFTUPDATES		# Enable FFS soft updates support
 options 	UFS_ACL			# Support for access control lists
 options 	UFS_DIRHASH		# Improve performance on big directories
 options 	UFS_GJOURNAL		# Enable gjournal-based UFS journaling
 options 	QUOTA			# Enable disk quotas for UFS
 options 	MD_ROOT			# MD is a potential root device
 options 	NFSCL			# Network Filesystem Client
 options 	NFSD			# Network Filesystem Server
 options 	NFSLOCKD		# Network Lock Manager
 options 	NFS_ROOT		# NFS usable as /, requires NFSCL
 options 	MSDOSFS			# MSDOS Filesystem
 options 	CD9660			# ISO 9660 Filesystem
 options 	PROCFS			# Process filesystem (requires PSEUDOFS)
 options 	PSEUDOFS		# Pseudo-filesystem framework
 options 	GEOM_PART_GPT		# GUID Partition Tables.
 options 	GEOM_RAID		# Soft RAID functionality.
 options 	GEOM_LABEL		# Provides labelization
 options 	COMPAT_FREEBSD32	# Incomplete, but used by cloudabi32.ko.
 options 	COMPAT_FREEBSD11	# Compatible with FreeBSD11
 options 	SCSI_DELAY=5000		# Delay (in ms) before probing SCSI
 options 	KTRACE			# ktrace(1) support
 options 	STACK			# stack(9) support
 options 	SYSVSHM			# SYSV-style shared memory
 options 	SYSVMSG			# SYSV-style message queues
 options 	SYSVSEM			# SYSV-style semaphores
 options 	_KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
 options 	PRINTF_BUFR_SIZE=128	# Prevent printf output being interspersed.
 options 	KBD_INSTALL_CDEV	# install a CDEV entry in /dev
 options 	HWPMC_HOOKS		# Necessary kernel hooks for hwpmc(4)
 options 	AUDIT			# Security event auditing
 options 	CAPABILITY_MODE		# Capsicum capability mode
 options 	CAPABILITIES		# Capsicum capabilities
 options 	MAC			# TrustedBSD MAC Framework
 options 	KDTRACE_FRAME		# Ensure frames are compiled in
 options 	KDTRACE_HOOKS		# Kernel DTrace hooks
 options 	VFP			# Floating-point support
 options 	RACCT			# Resource accounting framework
 options 	RACCT_DEFAULT_TO_DISABLED # Set kern.racct.enable=0 by default
 options 	RCTL			# Resource limits
 options 	SMP
 options 	INTRNG
 
 # Debugging support.  Always need this:
 options 	KDB			# Enable kernel debugger support.
 options 	KDB_TRACE		# Print a stack trace for a panic.
 # For full debugger support use (turn off in stable branch):
 options 	DDB			# Support DDB.
 #options 	GDB			# Support remote GDB.
 options 	DEADLKRES		# Enable the deadlock resolver
 options 	INVARIANTS		# Enable calls of extra sanity checking
 options 	INVARIANT_SUPPORT	# Extra sanity checks of internal structures, required by INVARIANTS
 options 	WITNESS			# Enable checks to detect deadlocks and cycles
 options 	WITNESS_SKIPSPIN	# Don't run witness on spinlocks for speed
 options 	MALLOC_DEBUG_MAXZONES=8	# Separate malloc(9) zones
 
 # SoC support
 options 	SOC_ALLWINNER_A64
 options 	SOC_ALLWINNER_H5
 options 	SOC_CAVM_THUNDERX
 options 	SOC_HISI_HI6220
 options 	SOC_BRCM_BCM2837
 options 	SOC_ROCKCHIP_RK3328
 
 # Annapurna Alpine drivers
 device		al_ccu			# Alpine Cache Coherency Unit
 device		al_nb_service		# Alpine North Bridge Service
 device		al_iofic		# I/O Fabric Interrupt Controller
 device		al_serdes		# Serializer/Deserializer
 device		al_udma			# Universal DMA
 
+# Qualcomm Snapdragon drivers
+device		qcom_gcc		# Global Clock Controller
+
 # VirtIO support
 device		virtio
 device		virtio_pci
 device		virtio_mmio
 device		virtio_blk
 device		vtnet
 
 # CPU frequency control
 device		cpufreq
 
 # Bus drivers
 device		pci
 device		al_pci		# Annapurna Alpine PCI-E
 options 	PCI_HP			# PCI-Express native HotPlug
 options 	PCI_IOV		# PCI SR-IOV support
 
 # Ethernet NICs
 device		mdio
 device		mii
 device		miibus		# MII bus support
 device		awg		# Allwinner EMAC Gigabit Ethernet
 device		axgbe		# AMD Opteron A1100 integrated NIC
 device		em		# Intel PRO/1000 Gigabit Ethernet Family
 device		ix		# Intel 10Gb Ethernet Family
 device		msk		# Marvell/SysKonnect Yukon II Gigabit Ethernet
 device		neta		# Marvell Armada 370/38x/XP/3700 NIC
 device  	smc		# SMSC LAN91C111
 device		vnic		# Cavium ThunderX NIC
 device		al_eth		# Annapurna Alpine Ethernet NIC
 
 # Block devices
 device		ahci
 device		scbus
 device		da
 
 # ATA/SCSI peripherals
 device		pass		# Passthrough device (direct ATA/SCSI access)
 
 # MMC/SD/SDIO Card slot support
 device		sdhci
 device		aw_mmc			# Allwinner SD/MMC controller
 device		mmc			# mmc/sd bus
 device		mmcsd			# mmc/sd flash cards
 device		dwmmc
 
 # Serial (COM) ports
 device		uart		# Generic UART driver
 device		uart_mvebu	# Armada 3700 UART driver
 device		uart_ns8250	# ns8250-type UART driver
 device		uart_snps
 device		pl011
 
 # USB support
 options 	USB_DEBUG		# enable debug msgs
 device		aw_ehci			# Allwinner EHCI USB interface (USB 2.0)
 device		aw_usbphy		# Allwinner USB PHY
 device		dwcotg			# DWC OTG controller
 device		ohci			# OHCI USB interface
 device		ehci			# EHCI USB interface (USB 2.0)
 device		ehci_mv			# Marvell EHCI USB interface
 device		xhci			# XHCI PCI->USB interface (USB 3.0)
 device		xhci_mv			# Marvell XHCI USB interface
 device		usb			# USB Bus (required)
 device		ukbd			# Keyboard
 device		umass			# Disks/Mass storage - Requires scbus and da
 
 # USB ethernet support
 device		smcphy
 device		smsc
 
 # GPIO
 device		aw_gpio		# Allwinner GPIO controller
 device		gpio
 device		gpioled
 device		fdt_pinctrl
 
 # I2C
 device		aw_rsb		# Allwinner Reduced Serial Bus
 device		bcm2835_bsc	# Broadcom BCM283x I2C bus
 device		iicbus
 device		iic
 device		twsi		# Allwinner I2C controller
 
 # Clock and reset controllers
 device		aw_ccu		# Allwinner clock controller
 
 # Interrupt controllers
 device		aw_nmi		# Allwinner NMI support
 
 # Real-time clock support
 device		aw_rtc		# Allwinner Real-time Clock
 device		mv_rtc		# Marvell Real-time Clock
 
 # Watchdog controllers
 device		aw_wdog		# Allwinner Watchdog
 
 # Power management controllers
 device		axp81x		# X-Powers AXP81x PMIC
 
 # EFUSE
 device		aw_sid		# Allwinner Secure ID EFUSE
 
 # Thermal sensors
 device		aw_thermal	# Allwinner Thermal Sensor Controller
 
 # SPI
 device		spibus
 device		bcm2835_spi	# Broadcom BCM283x SPI bus
 
 # Console
 device		vt
 device		kbdmux
 
 # Pseudo devices.
 device		loop		# Network loopback
 device		random		# Entropy device
 device		ether		# Ethernet support
 device		vlan		# 802.1Q VLAN support
 device		tun		# Packet tunnel.
 device		md		# Memory "disks"
 device		gif		# IPv6 and IPv4 tunneling
 device		firmware	# firmware assist module
 device		psci		# Support for ARM PSCI
 options 	EFIRT		# EFI Runtime Services
 
 # EXT_RESOURCES pseudo devices
 options 	EXT_RESOURCES
 device		clk
 device		phy
 device		hwreset
 device		regulator
 device		syscon
 
 # The `bpf' device enables the Berkeley Packet Filter.
 # Be aware of the administrative consequences of enabling this!
 # Note that 'bpf' is required for DHCP.
 device		bpf		# Berkeley packet filter
 
 # Chip-specific errata
 options 	THUNDERX_PASS_1_1_ERRATA
 
 options 	FDT
 device		acpi
 
 # The crypto framework is required by IPSEC
 device		crypto			# Required by IPSEC
Index: user/markj/netdump/sys/arm64/qualcomm/qcom_gcc.c
===================================================================
--- user/markj/netdump/sys/arm64/qualcomm/qcom_gcc.c	(nonexistent)
+++ user/markj/netdump/sys/arm64/qualcomm/qcom_gcc.c	(revision 332408)
@@ -0,0 +1,148 @@
+/*-
+ * Copyright (c) 2018 Ruslan Bukin <br@bsdpad.com>
+ * All rights reserved.
+ *
+ * This software was developed by BAE Systems, the University of Cambridge
+ * Computer Laboratory, and Memorial University under DARPA/AFRL contract
+ * FA8650-15-C-7558 ("CADETS"), as part of the DARPA Transparent Computing
+ * (TC) research program.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/bus.h>
+#include <sys/kthread.h>
+#include <sys/rman.h>
+#include <sys/kernel.h>
+#include <sys/module.h>
+#include <machine/bus.h>
+
+#include <dev/ofw/ofw_bus.h>
+#include <dev/ofw/ofw_bus_subr.h>
+
+#define	GCC_QDSS_BCR			0x29000
+#define	 GCC_QDSS_BCR_BLK_ARES		(1 << 0) /* Async software reset. */
+#define	GCC_QDSS_CFG_AHB_CBCR		0x29008
+#define	 AHB_CBCR_CLK_ENABLE		(1 << 0) /* AHB clk branch ctrl */
+#define	GCC_QDSS_ETR_USB_CBCR		0x29028
+#define	 ETR_USB_CBCR_CLK_ENABLE	(1 << 0) /* ETR USB clk branch ctrl */
+#define	GCC_QDSS_DAP_CBCR		0x29084
+#define	 DAP_CBCR_CLK_ENABLE		(1 << 0) /* DAP clk branch ctrl */
+
+static struct ofw_compat_data compat_data[] = {
+	{ "qcom,gcc-msm8916",			1 },
+	{ NULL,					0 }
+};
+
+struct qcom_gcc_softc {
+	struct resource		*res;
+};
+
+static struct resource_spec qcom_gcc_spec[] = {
+	{ SYS_RES_MEMORY,	0,	RF_ACTIVE },
+	{ -1, 0 }
+};
+
+/*
+ * Qualcomm Debug Subsystem (QDSS)
+ * block enabling routine.
+ */
+static void
+qcom_qdss_enable(struct qcom_gcc_softc *sc)
+{
+
+	/* Put QDSS block to reset */
+	bus_write_4(sc->res, GCC_QDSS_BCR, GCC_QDSS_BCR_BLK_ARES);
+
+	/* Enable AHB clock branch */
+	bus_write_4(sc->res, GCC_QDSS_CFG_AHB_CBCR, AHB_CBCR_CLK_ENABLE);
+
+	/* Enable DAP clock branch */
+	bus_write_4(sc->res, GCC_QDSS_DAP_CBCR, DAP_CBCR_CLK_ENABLE);
+
+	/* Enable ETR USB clock branch */
+	bus_write_4(sc->res, GCC_QDSS_ETR_USB_CBCR, ETR_USB_CBCR_CLK_ENABLE);
+
+	/* Out of reset */
+	bus_write_4(sc->res, GCC_QDSS_BCR, 0);
+}
+
+static int
+qcom_gcc_probe(device_t dev)
+{
+	if (!ofw_bus_status_okay(dev))
+		return (ENXIO);
+
+	if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0)
+		return (ENXIO);
+
+	device_set_desc(dev, "Qualcomm Global Clock Controller");
+
+	return (BUS_PROBE_DEFAULT);
+}
+
+static int
+qcom_gcc_attach(device_t dev)
+{
+	struct qcom_gcc_softc *sc;
+
+	sc = device_get_softc(dev);
+
+	if (bus_alloc_resources(dev, qcom_gcc_spec, &sc->res) != 0) {
+		device_printf(dev, "cannot allocate resources for device\n");
+		return (ENXIO);
+	}
+
+	/*
+	 * Enable debug unit.
+	 * This is required for Coresight operation.
+	 * This also enables USB clock branch.
+	 */
+	qcom_qdss_enable(sc);
+
+	return (0);
+}
+
+static device_method_t qcom_gcc_methods[] = {
+	/* Device interface */
+	DEVMETHOD(device_probe,		qcom_gcc_probe),
+	DEVMETHOD(device_attach,	qcom_gcc_attach),
+
+	DEVMETHOD_END
+};
+
+static driver_t qcom_gcc_driver = {
+	"qcom_gcc",
+	qcom_gcc_methods,
+	sizeof(struct qcom_gcc_softc),
+};
+
+static devclass_t qcom_gcc_devclass;
+
+EARLY_DRIVER_MODULE(qcom_gcc, simplebus, qcom_gcc_driver, qcom_gcc_devclass,
+    0, 0, BUS_PASS_BUS + BUS_PASS_ORDER_MIDDLE);
+MODULE_VERSION(qcom_gcc, 1);

Property changes on: user/markj/netdump/sys/arm64/qualcomm/qcom_gcc.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/cddl/contrib/opensolaris/uts/common/dtrace/dtrace.c
===================================================================
--- user/markj/netdump/sys/cddl/contrib/opensolaris/uts/common/dtrace/dtrace.c	(revision 332407)
+++ user/markj/netdump/sys/cddl/contrib/opensolaris/uts/common/dtrace/dtrace.c	(revision 332408)
@@ -1,18386 +1,18424 @@
 /*
  * CDDL HEADER START
  *
  * The contents of this file are subject to the terms of the
  * Common Development and Distribution License (the "License").
  * You may not use this file except in compliance with the License.
  *
  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  * or http://www.opensolaris.org/os/licensing.
  * See the License for the specific language governing permissions
  * and limitations under the License.
  *
  * When distributing Covered Code, include this CDDL HEADER in each
  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  * If applicable, add the following below this CDDL HEADER, with the
  * fields enclosed by brackets "[]" replaced with your own identifying
  * information: Portions Copyright [yyyy] [name of copyright owner]
  *
  * CDDL HEADER END
  *
  * $FreeBSD$
  */
 
 /*
  * Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
  * Copyright (c) 2016, Joyent, Inc. All rights reserved.
  * Copyright (c) 2012, 2014 by Delphix. All rights reserved.
  */
 
 /*
  * DTrace - Dynamic Tracing for Solaris
  *
  * This is the implementation of the Solaris Dynamic Tracing framework
  * (DTrace).  The user-visible interface to DTrace is described at length in
  * the "Solaris Dynamic Tracing Guide".  The interfaces between the libdtrace
  * library, the in-kernel DTrace framework, and the DTrace providers are
  * described in the block comments in the <sys/dtrace.h> header file.  The
  * internal architecture of DTrace is described in the block comments in the
  * <sys/dtrace_impl.h> header file.  The comments contained within the DTrace
  * implementation very much assume mastery of all of these sources; if one has
  * an unanswered question about the implementation, one should consult them
  * first.
  *
  * The functions here are ordered roughly as follows:
  *
  *   - Probe context functions
  *   - Probe hashing functions
  *   - Non-probe context utility functions
  *   - Matching functions
  *   - Provider-to-Framework API functions
  *   - Probe management functions
  *   - DIF object functions
  *   - Format functions
  *   - Predicate functions
  *   - ECB functions
  *   - Buffer functions
  *   - Enabling functions
  *   - DOF functions
  *   - Anonymous enabling functions
  *   - Consumer state functions
  *   - Helper functions
  *   - Hook functions
  *   - Driver cookbook functions
  *
  * Each group of functions begins with a block comment labelled the "DTrace
  * [Group] Functions", allowing one to find each block by searching forward
  * on capital-f functions.
  */
 #include <sys/errno.h>
 #ifndef illumos
 #include <sys/time.h>
 #endif
 #include <sys/stat.h>
 #include <sys/modctl.h>
 #include <sys/conf.h>
 #include <sys/systm.h>
 #ifdef illumos
 #include <sys/ddi.h>
 #include <sys/sunddi.h>
 #endif
 #include <sys/cpuvar.h>
 #include <sys/kmem.h>
 #ifdef illumos
 #include <sys/strsubr.h>
 #endif
 #include <sys/sysmacros.h>
 #include <sys/dtrace_impl.h>
 #include <sys/atomic.h>
 #include <sys/cmn_err.h>
 #ifdef illumos
 #include <sys/mutex_impl.h>
 #include <sys/rwlock_impl.h>
 #endif
 #include <sys/ctf_api.h>
 #ifdef illumos
 #include <sys/panic.h>
 #include <sys/priv_impl.h>
 #endif
 #include <sys/policy.h>
 #ifdef illumos
 #include <sys/cred_impl.h>
 #include <sys/procfs_isa.h>
 #endif
 #include <sys/taskq.h>
 #ifdef illumos
 #include <sys/mkdev.h>
 #include <sys/kdi.h>
 #endif
 #include <sys/zone.h>
 #include <sys/socket.h>
 #include <netinet/in.h>
 #include "strtolctype.h"
 
 /* FreeBSD includes: */
 #ifndef illumos
 #include <sys/callout.h>
 #include <sys/ctype.h>
 #include <sys/eventhandler.h>
 #include <sys/limits.h>
 #include <sys/linker.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/ptrace.h>
 #include <sys/random.h>
 #include <sys/rwlock.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 
 #include <sys/dtrace_bsd.h>
 
 #include <netinet/in.h>
 
 #include "dtrace_cddl.h"
 #include "dtrace_debug.c"
 #endif
 
 #include "dtrace_xoroshiro128_plus.h"
 
 /*
  * DTrace Tunable Variables
  *
  * The following variables may be tuned by adding a line to /etc/system that
  * includes both the name of the DTrace module ("dtrace") and the name of the
  * variable.  For example:
  *
  *   set dtrace:dtrace_destructive_disallow = 1
  *
  * In general, the only variables that one should be tuning this way are those
  * that affect system-wide DTrace behavior, and for which the default behavior
  * is undesirable.  Most of these variables are tunable on a per-consumer
  * basis using DTrace options, and need not be tuned on a system-wide basis.
  * When tuning these variables, avoid pathological values; while some attempt
  * is made to verify the integrity of these variables, they are not considered
  * part of the supported interface to DTrace, and they are therefore not
  * checked comprehensively.  Further, these variables should not be tuned
  * dynamically via "mdb -kw" or other means; they should only be tuned via
  * /etc/system.
  */
 int		dtrace_destructive_disallow = 0;
 #ifndef illumos
 /* Positive logic version of dtrace_destructive_disallow for loader tunable */
 int		dtrace_allow_destructive = 1;
 #endif
 dtrace_optval_t	dtrace_nonroot_maxsize = (16 * 1024 * 1024);
 size_t		dtrace_difo_maxsize = (256 * 1024);
 dtrace_optval_t	dtrace_dof_maxsize = (8 * 1024 * 1024);
 size_t		dtrace_statvar_maxsize = (16 * 1024);
 size_t		dtrace_actions_max = (16 * 1024);
 size_t		dtrace_retain_max = 1024;
 dtrace_optval_t	dtrace_helper_actions_max = 128;
 dtrace_optval_t	dtrace_helper_providers_max = 32;
 dtrace_optval_t	dtrace_dstate_defsize = (1 * 1024 * 1024);
 size_t		dtrace_strsize_default = 256;
 dtrace_optval_t	dtrace_cleanrate_default = 9900990;		/* 101 hz */
 dtrace_optval_t	dtrace_cleanrate_min = 200000;			/* 5000 hz */
 dtrace_optval_t	dtrace_cleanrate_max = (uint64_t)60 * NANOSEC;	/* 1/minute */
 dtrace_optval_t	dtrace_aggrate_default = NANOSEC;		/* 1 hz */
 dtrace_optval_t	dtrace_statusrate_default = NANOSEC;		/* 1 hz */
 dtrace_optval_t dtrace_statusrate_max = (hrtime_t)10 * NANOSEC;	 /* 6/minute */
 dtrace_optval_t	dtrace_switchrate_default = NANOSEC;		/* 1 hz */
 dtrace_optval_t	dtrace_nspec_default = 1;
 dtrace_optval_t	dtrace_specsize_default = 32 * 1024;
 dtrace_optval_t dtrace_stackframes_default = 20;
 dtrace_optval_t dtrace_ustackframes_default = 20;
 dtrace_optval_t dtrace_jstackframes_default = 50;
 dtrace_optval_t dtrace_jstackstrsize_default = 512;
 int		dtrace_msgdsize_max = 128;
 hrtime_t	dtrace_chill_max = MSEC2NSEC(500);		/* 500 ms */
 hrtime_t	dtrace_chill_interval = NANOSEC;		/* 1000 ms */
 int		dtrace_devdepth_max = 32;
 int		dtrace_err_verbose;
 hrtime_t	dtrace_deadman_interval = NANOSEC;
 hrtime_t	dtrace_deadman_timeout = (hrtime_t)10 * NANOSEC;
 hrtime_t	dtrace_deadman_user = (hrtime_t)30 * NANOSEC;
 hrtime_t	dtrace_unregister_defunct_reap = (hrtime_t)60 * NANOSEC;
 #ifndef illumos
 int		dtrace_memstr_max = 4096;
 #endif
 
 /*
  * DTrace External Variables
  *
  * As dtrace(7D) is a kernel module, any DTrace variables are obviously
  * available to DTrace consumers via the backtick (`) syntax.  One of these,
  * dtrace_zero, is made deliberately so:  it is provided as a source of
  * well-known, zero-filled memory.  While this variable is not documented,
  * it is used by some translators as an implementation detail.
  */
 const char	dtrace_zero[256] = { 0 };	/* zero-filled memory */
 
 /*
  * DTrace Internal Variables
  */
 #ifdef illumos
 static dev_info_t	*dtrace_devi;		/* device info */
 #endif
 #ifdef illumos
 static vmem_t		*dtrace_arena;		/* probe ID arena */
 static vmem_t		*dtrace_minor;		/* minor number arena */
 #else
 static taskq_t		*dtrace_taskq;		/* task queue */
 static struct unrhdr	*dtrace_arena;		/* Probe ID number.     */
 #endif
 static dtrace_probe_t	**dtrace_probes;	/* array of all probes */
 static int		dtrace_nprobes;		/* number of probes */
 static dtrace_provider_t *dtrace_provider;	/* provider list */
 static dtrace_meta_t	*dtrace_meta_pid;	/* user-land meta provider */
 static int		dtrace_opens;		/* number of opens */
 static int		dtrace_helpers;		/* number of helpers */
 static int		dtrace_getf;		/* number of unpriv getf()s */
 #ifdef illumos
 static void		*dtrace_softstate;	/* softstate pointer */
 #endif
 static dtrace_hash_t	*dtrace_bymod;		/* probes hashed by module */
 static dtrace_hash_t	*dtrace_byfunc;		/* probes hashed by function */
 static dtrace_hash_t	*dtrace_byname;		/* probes hashed by name */
 static dtrace_toxrange_t *dtrace_toxrange;	/* toxic range array */
 static int		dtrace_toxranges;	/* number of toxic ranges */
 static int		dtrace_toxranges_max;	/* size of toxic range array */
 static dtrace_anon_t	dtrace_anon;		/* anonymous enabling */
 static kmem_cache_t	*dtrace_state_cache;	/* cache for dynamic state */
 static uint64_t		dtrace_vtime_references; /* number of vtimestamp refs */
 static kthread_t	*dtrace_panicked;	/* panicking thread */
 static dtrace_ecb_t	*dtrace_ecb_create_cache; /* cached created ECB */
 static dtrace_genid_t	dtrace_probegen;	/* current probe generation */
 static dtrace_helpers_t *dtrace_deferred_pid;	/* deferred helper list */
 static dtrace_enabling_t *dtrace_retained;	/* list of retained enablings */
 static dtrace_genid_t	dtrace_retained_gen;	/* current retained enab gen */
 static dtrace_dynvar_t	dtrace_dynhash_sink;	/* end of dynamic hash chains */
 static int		dtrace_dynvar_failclean; /* dynvars failed to clean */
 #ifndef illumos
 static struct mtx	dtrace_unr_mtx;
 MTX_SYSINIT(dtrace_unr_mtx, &dtrace_unr_mtx, "Unique resource identifier", MTX_DEF);
 static eventhandler_tag	dtrace_kld_load_tag;
 static eventhandler_tag	dtrace_kld_unload_try_tag;
 #endif
 
 /*
  * DTrace Locking
  * DTrace is protected by three (relatively coarse-grained) locks:
  *
  * (1) dtrace_lock is required to manipulate essentially any DTrace state,
  *     including enabling state, probes, ECBs, consumer state, helper state,
  *     etc.  Importantly, dtrace_lock is _not_ required when in probe context;
  *     probe context is lock-free -- synchronization is handled via the
  *     dtrace_sync() cross call mechanism.
  *
  * (2) dtrace_provider_lock is required when manipulating provider state, or
  *     when provider state must be held constant.
  *
  * (3) dtrace_meta_lock is required when manipulating meta provider state, or
  *     when meta provider state must be held constant.
  *
  * The lock ordering between these three locks is dtrace_meta_lock before
  * dtrace_provider_lock before dtrace_lock.  (In particular, there are
  * several places where dtrace_provider_lock is held by the framework as it
  * calls into the providers -- which then call back into the framework,
  * grabbing dtrace_lock.)
  *
  * There are two other locks in the mix:  mod_lock and cpu_lock.  With respect
  * to dtrace_provider_lock and dtrace_lock, cpu_lock continues its historical
  * role as a coarse-grained lock; it is acquired before both of these locks.
  * With respect to dtrace_meta_lock, its behavior is stranger:  cpu_lock must
  * be acquired _between_ dtrace_meta_lock and any other DTrace locks.
  * mod_lock is similar with respect to dtrace_provider_lock in that it must be
  * acquired _between_ dtrace_provider_lock and dtrace_lock.
  */
 static kmutex_t		dtrace_lock;		/* probe state lock */
 static kmutex_t		dtrace_provider_lock;	/* provider state lock */
 static kmutex_t		dtrace_meta_lock;	/* meta-provider state lock */
 
 #ifndef illumos
 /* XXX FreeBSD hacks. */
 #define cr_suid		cr_svuid
 #define cr_sgid		cr_svgid
 #define	ipaddr_t	in_addr_t
 #define mod_modname	pathname
 #define vuprintf	vprintf
 #define ttoproc(_a)	((_a)->td_proc)
 #define crgetzoneid(_a)	0
 #define SNOCD		0
 #define CPU_ON_INTR(_a)	0
 
 #define PRIV_EFFECTIVE		(1 << 0)
 #define PRIV_DTRACE_KERNEL	(1 << 1)
 #define PRIV_DTRACE_PROC	(1 << 2)
 #define PRIV_DTRACE_USER	(1 << 3)
 #define PRIV_PROC_OWNER		(1 << 4)
 #define PRIV_PROC_ZONE		(1 << 5)
 #define PRIV_ALL		~0
 
 SYSCTL_DECL(_debug_dtrace);
 SYSCTL_DECL(_kern_dtrace);
 #endif
 
 #ifdef illumos
 #define curcpu	CPU->cpu_id
 #endif
 
 
 /*
  * DTrace Provider Variables
  *
  * These are the variables relating to DTrace as a provider (that is, the
  * provider of the BEGIN, END, and ERROR probes).
  */
 static dtrace_pattr_t	dtrace_provider_attr = {
 { DTRACE_STABILITY_STABLE, DTRACE_STABILITY_STABLE, DTRACE_CLASS_COMMON },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_STABLE, DTRACE_STABILITY_STABLE, DTRACE_CLASS_COMMON },
 { DTRACE_STABILITY_STABLE, DTRACE_STABILITY_STABLE, DTRACE_CLASS_COMMON },
 };
 
 static void
 dtrace_nullop(void)
 {}
 
 static dtrace_pops_t dtrace_provider_ops = {
 	.dtps_provide =	(void (*)(void *, dtrace_probedesc_t *))dtrace_nullop,
 	.dtps_provide_module =	(void (*)(void *, modctl_t *))dtrace_nullop,
 	.dtps_enable =	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 	.dtps_disable =	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 	.dtps_suspend =	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 	.dtps_resume =	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 	.dtps_getargdesc =	NULL,
 	.dtps_getargval =	NULL,
 	.dtps_usermode =	NULL,
 	.dtps_destroy =	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 };
 
 static dtrace_id_t	dtrace_probeid_begin;	/* special BEGIN probe */
 static dtrace_id_t	dtrace_probeid_end;	/* special END probe */
 dtrace_id_t		dtrace_probeid_error;	/* special ERROR probe */
 
 /*
  * DTrace Helper Tracing Variables
  *
  * These variables should be set dynamically to enable helper tracing.  The
  * only variables that should be set are dtrace_helptrace_enable (which should
  * be set to a non-zero value to allocate helper tracing buffers on the next
  * open of /dev/dtrace) and dtrace_helptrace_disable (which should be set to a
  * non-zero value to deallocate helper tracing buffers on the next close of
  * /dev/dtrace).  When (and only when) helper tracing is disabled, the
  * buffer size may also be set via dtrace_helptrace_bufsize.
  */
 int			dtrace_helptrace_enable = 0;
 int			dtrace_helptrace_disable = 0;
 int			dtrace_helptrace_bufsize = 16 * 1024 * 1024;
 uint32_t		dtrace_helptrace_nlocals;
 static dtrace_helptrace_t *dtrace_helptrace_buffer;
 static uint32_t		dtrace_helptrace_next = 0;
 static int		dtrace_helptrace_wrapped = 0;
 
 /*
  * DTrace Error Hashing
  *
  * On DEBUG kernels, DTrace will track the errors that has seen in a hash
  * table.  This is very useful for checking coverage of tests that are
  * expected to induce DIF or DOF processing errors, and may be useful for
  * debugging problems in the DIF code generator or in DOF generation .  The
  * error hash may be examined with the ::dtrace_errhash MDB dcmd.
  */
 #ifdef DEBUG
 static dtrace_errhash_t	dtrace_errhash[DTRACE_ERRHASHSZ];
 static const char *dtrace_errlast;
 static kthread_t *dtrace_errthread;
 static kmutex_t dtrace_errlock;
 #endif
 
 /*
  * DTrace Macros and Constants
  *
  * These are various macros that are useful in various spots in the
  * implementation, along with a few random constants that have no meaning
  * outside of the implementation.  There is no real structure to this cpp
  * mishmash -- but is there ever?
  */
 #define	DTRACE_HASHSTR(hash, probe)	\
 	dtrace_hash_str(*((char **)((uintptr_t)(probe) + (hash)->dth_stroffs)))
 
 #define	DTRACE_HASHNEXT(hash, probe)	\
 	(dtrace_probe_t **)((uintptr_t)(probe) + (hash)->dth_nextoffs)
 
 #define	DTRACE_HASHPREV(hash, probe)	\
 	(dtrace_probe_t **)((uintptr_t)(probe) + (hash)->dth_prevoffs)
 
 #define	DTRACE_HASHEQ(hash, lhs, rhs)	\
 	(strcmp(*((char **)((uintptr_t)(lhs) + (hash)->dth_stroffs)), \
 	    *((char **)((uintptr_t)(rhs) + (hash)->dth_stroffs))) == 0)
 
 #define	DTRACE_AGGHASHSIZE_SLEW		17
 
 #define	DTRACE_V4MAPPED_OFFSET		(sizeof (uint32_t) * 3)
 
 /*
  * The key for a thread-local variable consists of the lower 61 bits of the
  * t_did, plus the 3 bits of the highest active interrupt above LOCK_LEVEL.
  * We add DIF_VARIABLE_MAX to t_did to assure that the thread key is never
  * equal to a variable identifier.  This is necessary (but not sufficient) to
  * assure that global associative arrays never collide with thread-local
  * variables.  To guarantee that they cannot collide, we must also define the
  * order for keying dynamic variables.  That order is:
  *
  *   [ key0 ] ... [ keyn ] [ variable-key ] [ tls-key ]
  *
  * Because the variable-key and the tls-key are in orthogonal spaces, there is
  * no way for a global variable key signature to match a thread-local key
  * signature.
  */
 #ifdef illumos
 #define	DTRACE_TLS_THRKEY(where) { \
 	uint_t intr = 0; \
 	uint_t actv = CPU->cpu_intr_actv >> (LOCK_LEVEL + 1); \
 	for (; actv; actv >>= 1) \
 		intr++; \
 	ASSERT(intr < (1 << 3)); \
 	(where) = ((curthread->t_did + DIF_VARIABLE_MAX) & \
 	    (((uint64_t)1 << 61) - 1)) | ((uint64_t)intr << 61); \
 }
 #else
 #define	DTRACE_TLS_THRKEY(where) { \
 	solaris_cpu_t *_c = &solaris_cpu[curcpu]; \
 	uint_t intr = 0; \
 	uint_t actv = _c->cpu_intr_actv; \
 	for (; actv; actv >>= 1) \
 		intr++; \
 	ASSERT(intr < (1 << 3)); \
 	(where) = ((curthread->td_tid + DIF_VARIABLE_MAX) & \
 	    (((uint64_t)1 << 61) - 1)) | ((uint64_t)intr << 61); \
 }
 #endif
 
 #define	DT_BSWAP_8(x)	((x) & 0xff)
 #define	DT_BSWAP_16(x)	((DT_BSWAP_8(x) << 8) | DT_BSWAP_8((x) >> 8))
 #define	DT_BSWAP_32(x)	((DT_BSWAP_16(x) << 16) | DT_BSWAP_16((x) >> 16))
 #define	DT_BSWAP_64(x)	((DT_BSWAP_32(x) << 32) | DT_BSWAP_32((x) >> 32))
 
 #define	DT_MASK_LO 0x00000000FFFFFFFFULL
 
 #define	DTRACE_STORE(type, tomax, offset, what) \
 	*((type *)((uintptr_t)(tomax) + (uintptr_t)offset)) = (type)(what);
 
 #ifndef __x86
 #define	DTRACE_ALIGNCHECK(addr, size, flags)				\
 	if (addr & (size - 1)) {					\
 		*flags |= CPU_DTRACE_BADALIGN;				\
 		cpu_core[curcpu].cpuc_dtrace_illval = addr;	\
 		return (0);						\
 	}
 #else
 #define	DTRACE_ALIGNCHECK(addr, size, flags)
 #endif
 
 /*
  * Test whether a range of memory starting at testaddr of size testsz falls
  * within the range of memory described by addr, sz.  We take care to avoid
  * problems with overflow and underflow of the unsigned quantities, and
  * disallow all negative sizes.  Ranges of size 0 are allowed.
  */
 #define	DTRACE_INRANGE(testaddr, testsz, baseaddr, basesz) \
 	((testaddr) - (uintptr_t)(baseaddr) < (basesz) && \
 	(testaddr) + (testsz) - (uintptr_t)(baseaddr) <= (basesz) && \
 	(testaddr) + (testsz) >= (testaddr))
 
 #define	DTRACE_RANGE_REMAIN(remp, addr, baseaddr, basesz)		\
 do {									\
 	if ((remp) != NULL) {						\
 		*(remp) = (uintptr_t)(baseaddr) + (basesz) - (addr);	\
 	}								\
 _NOTE(CONSTCOND) } while (0)
 
 
 /*
  * Test whether alloc_sz bytes will fit in the scratch region.  We isolate
  * alloc_sz on the righthand side of the comparison in order to avoid overflow
  * or underflow in the comparison with it.  This is simpler than the INRANGE
  * check above, because we know that the dtms_scratch_ptr is valid in the
  * range.  Allocations of size zero are allowed.
  */
 #define	DTRACE_INSCRATCH(mstate, alloc_sz) \
 	((mstate)->dtms_scratch_base + (mstate)->dtms_scratch_size - \
 	(mstate)->dtms_scratch_ptr >= (alloc_sz))
 
 #define	DTRACE_LOADFUNC(bits)						\
 /*CSTYLED*/								\
 uint##bits##_t								\
 dtrace_load##bits(uintptr_t addr)					\
 {									\
 	size_t size = bits / NBBY;					\
 	/*CSTYLED*/							\
 	uint##bits##_t rval;						\
 	int i;								\
 	volatile uint16_t *flags = (volatile uint16_t *)		\
 	    &cpu_core[curcpu].cpuc_dtrace_flags;			\
 									\
 	DTRACE_ALIGNCHECK(addr, size, flags);				\
 									\
 	for (i = 0; i < dtrace_toxranges; i++) {			\
 		if (addr >= dtrace_toxrange[i].dtt_limit)		\
 			continue;					\
 									\
 		if (addr + size <= dtrace_toxrange[i].dtt_base)		\
 			continue;					\
 									\
 		/*							\
 		 * This address falls within a toxic region; return 0.	\
 		 */							\
 		*flags |= CPU_DTRACE_BADADDR;				\
 		cpu_core[curcpu].cpuc_dtrace_illval = addr;		\
 		return (0);						\
 	}								\
 									\
 	*flags |= CPU_DTRACE_NOFAULT;					\
 	/*CSTYLED*/							\
 	rval = *((volatile uint##bits##_t *)addr);			\
 	*flags &= ~CPU_DTRACE_NOFAULT;					\
 									\
 	return (!(*flags & CPU_DTRACE_FAULT) ? rval : 0);		\
 }
 
 #ifdef _LP64
 #define	dtrace_loadptr	dtrace_load64
 #else
 #define	dtrace_loadptr	dtrace_load32
 #endif
 
 #define	DTRACE_DYNHASH_FREE	0
 #define	DTRACE_DYNHASH_SINK	1
 #define	DTRACE_DYNHASH_VALID	2
 
 #define	DTRACE_MATCH_NEXT	0
 #define	DTRACE_MATCH_DONE	1
 #define	DTRACE_ANCHORED(probe)	((probe)->dtpr_func[0] != '\0')
 #define	DTRACE_STATE_ALIGN	64
 
 #define	DTRACE_FLAGS2FLT(flags)						\
 	(((flags) & CPU_DTRACE_BADADDR) ? DTRACEFLT_BADADDR :		\
 	((flags) & CPU_DTRACE_ILLOP) ? DTRACEFLT_ILLOP :		\
 	((flags) & CPU_DTRACE_DIVZERO) ? DTRACEFLT_DIVZERO :		\
 	((flags) & CPU_DTRACE_KPRIV) ? DTRACEFLT_KPRIV :		\
 	((flags) & CPU_DTRACE_UPRIV) ? DTRACEFLT_UPRIV :		\
 	((flags) & CPU_DTRACE_TUPOFLOW) ?  DTRACEFLT_TUPOFLOW :		\
 	((flags) & CPU_DTRACE_BADALIGN) ?  DTRACEFLT_BADALIGN :		\
 	((flags) & CPU_DTRACE_NOSCRATCH) ?  DTRACEFLT_NOSCRATCH :	\
 	((flags) & CPU_DTRACE_BADSTACK) ?  DTRACEFLT_BADSTACK :		\
 	DTRACEFLT_UNKNOWN)
 
 #define	DTRACEACT_ISSTRING(act)						\
 	((act)->dta_kind == DTRACEACT_DIFEXPR &&			\
 	(act)->dta_difo->dtdo_rtype.dtdt_kind == DIF_TYPE_STRING)
 
 /* Function prototype definitions: */
 static size_t dtrace_strlen(const char *, size_t);
 static dtrace_probe_t *dtrace_probe_lookup_id(dtrace_id_t id);
 static void dtrace_enabling_provide(dtrace_provider_t *);
 static int dtrace_enabling_match(dtrace_enabling_t *, int *);
 static void dtrace_enabling_matchall(void);
 static void dtrace_enabling_reap(void);
 static dtrace_state_t *dtrace_anon_grab(void);
 static uint64_t dtrace_helper(int, dtrace_mstate_t *,
     dtrace_state_t *, uint64_t, uint64_t);
 static dtrace_helpers_t *dtrace_helpers_create(proc_t *);
 static void dtrace_buffer_drop(dtrace_buffer_t *);
 static int dtrace_buffer_consumed(dtrace_buffer_t *, hrtime_t when);
 static intptr_t dtrace_buffer_reserve(dtrace_buffer_t *, size_t, size_t,
     dtrace_state_t *, dtrace_mstate_t *);
 static int dtrace_state_option(dtrace_state_t *, dtrace_optid_t,
     dtrace_optval_t);
 static int dtrace_ecb_create_enable(dtrace_probe_t *, void *);
 static void dtrace_helper_provider_destroy(dtrace_helper_provider_t *);
 uint16_t dtrace_load16(uintptr_t);
 uint32_t dtrace_load32(uintptr_t);
 uint64_t dtrace_load64(uintptr_t);
 uint8_t dtrace_load8(uintptr_t);
 void dtrace_dynvar_clean(dtrace_dstate_t *);
 dtrace_dynvar_t *dtrace_dynvar(dtrace_dstate_t *, uint_t, dtrace_key_t *,
     size_t, dtrace_dynvar_op_t, dtrace_mstate_t *, dtrace_vstate_t *);
 uintptr_t dtrace_dif_varstr(uintptr_t, dtrace_state_t *, dtrace_mstate_t *);
 static int dtrace_priv_proc(dtrace_state_t *);
 static void dtrace_getf_barrier(void);
 static int dtrace_canload_remains(uint64_t, size_t, size_t *,
     dtrace_mstate_t *, dtrace_vstate_t *);
 static int dtrace_canstore_remains(uint64_t, size_t, size_t *,
     dtrace_mstate_t *, dtrace_vstate_t *);
 
 /*
  * DTrace Probe Context Functions
  *
  * These functions are called from probe context.  Because probe context is
  * any context in which C may be called, arbitrarily locks may be held,
  * interrupts may be disabled, we may be in arbitrary dispatched state, etc.
  * As a result, functions called from probe context may only call other DTrace
  * support functions -- they may not interact at all with the system at large.
  * (Note that the ASSERT macro is made probe-context safe by redefining it in
  * terms of dtrace_assfail(), a probe-context safe function.) If arbitrary
  * loads are to be performed from probe context, they _must_ be in terms of
  * the safe dtrace_load*() variants.
  *
  * Some functions in this block are not actually called from probe context;
  * for these functions, there will be a comment above the function reading
  * "Note:  not called from probe context."
  */
 void
 dtrace_panic(const char *format, ...)
 {
 	va_list alist;
 
 	va_start(alist, format);
 #ifdef __FreeBSD__
 	vpanic(format, alist);
 #else
 	dtrace_vpanic(format, alist);
 #endif
 	va_end(alist);
 }
 
 int
 dtrace_assfail(const char *a, const char *f, int l)
 {
 	dtrace_panic("assertion failed: %s, file: %s, line: %d", a, f, l);
 
 	/*
 	 * We just need something here that even the most clever compiler
 	 * cannot optimize away.
 	 */
 	return (a[(uintptr_t)f]);
 }
 
 /*
  * Atomically increment a specified error counter from probe context.
  */
 static void
 dtrace_error(uint32_t *counter)
 {
 	/*
 	 * Most counters stored to in probe context are per-CPU counters.
 	 * However, there are some error conditions that are sufficiently
 	 * arcane that they don't merit per-CPU storage.  If these counters
 	 * are incremented concurrently on different CPUs, scalability will be
 	 * adversely affected -- but we don't expect them to be white-hot in a
 	 * correctly constructed enabling...
 	 */
 	uint32_t oval, nval;
 
 	do {
 		oval = *counter;
 
 		if ((nval = oval + 1) == 0) {
 			/*
 			 * If the counter would wrap, set it to 1 -- assuring
 			 * that the counter is never zero when we have seen
 			 * errors.  (The counter must be 32-bits because we
 			 * aren't guaranteed a 64-bit compare&swap operation.)
 			 * To save this code both the infamy of being fingered
 			 * by a priggish news story and the indignity of being
 			 * the target of a neo-puritan witch trial, we're
 			 * carefully avoiding any colorful description of the
 			 * likelihood of this condition -- but suffice it to
 			 * say that it is only slightly more likely than the
 			 * overflow of predicate cache IDs, as discussed in
 			 * dtrace_predicate_create().
 			 */
 			nval = 1;
 		}
 	} while (dtrace_cas32(counter, oval, nval) != oval);
 }
 
 /*
  * Use the DTRACE_LOADFUNC macro to define functions for each of loading a
  * uint8_t, a uint16_t, a uint32_t and a uint64_t.
  */
 /* BEGIN CSTYLED */
 DTRACE_LOADFUNC(8)
 DTRACE_LOADFUNC(16)
 DTRACE_LOADFUNC(32)
 DTRACE_LOADFUNC(64)
 /* END CSTYLED */
 
 static int
 dtrace_inscratch(uintptr_t dest, size_t size, dtrace_mstate_t *mstate)
 {
 	if (dest < mstate->dtms_scratch_base)
 		return (0);
 
 	if (dest + size < dest)
 		return (0);
 
 	if (dest + size > mstate->dtms_scratch_ptr)
 		return (0);
 
 	return (1);
 }
 
 static int
 dtrace_canstore_statvar(uint64_t addr, size_t sz, size_t *remain,
     dtrace_statvar_t **svars, int nsvars)
 {
 	int i;
 	size_t maxglobalsize, maxlocalsize;
 
 	if (nsvars == 0)
 		return (0);
 
 	maxglobalsize = dtrace_statvar_maxsize + sizeof (uint64_t);
 	maxlocalsize = maxglobalsize * NCPU;
 
 	for (i = 0; i < nsvars; i++) {
 		dtrace_statvar_t *svar = svars[i];
 		uint8_t scope;
 		size_t size;
 
 		if (svar == NULL || (size = svar->dtsv_size) == 0)
 			continue;
 
 		scope = svar->dtsv_var.dtdv_scope;
 
 		/*
 		 * We verify that our size is valid in the spirit of providing
 		 * defense in depth:  we want to prevent attackers from using
 		 * DTrace to escalate an orthogonal kernel heap corruption bug
 		 * into the ability to store to arbitrary locations in memory.
 		 */
 		VERIFY((scope == DIFV_SCOPE_GLOBAL && size <= maxglobalsize) ||
 		    (scope == DIFV_SCOPE_LOCAL && size <= maxlocalsize));
 
 		if (DTRACE_INRANGE(addr, sz, svar->dtsv_data,
 		    svar->dtsv_size)) {
 			DTRACE_RANGE_REMAIN(remain, addr, svar->dtsv_data,
 			    svar->dtsv_size);
 			return (1);
 		}
 	}
 
 	return (0);
 }
 
 /*
  * Check to see if the address is within a memory region to which a store may
  * be issued.  This includes the DTrace scratch areas, and any DTrace variable
  * region.  The caller of dtrace_canstore() is responsible for performing any
  * alignment checks that are needed before stores are actually executed.
  */
 static int
 dtrace_canstore(uint64_t addr, size_t sz, dtrace_mstate_t *mstate,
     dtrace_vstate_t *vstate)
 {
 	return (dtrace_canstore_remains(addr, sz, NULL, mstate, vstate));
 }
 
 /*
  * Implementation of dtrace_canstore which communicates the upper bound of the
  * allowed memory region.
  */
 static int
 dtrace_canstore_remains(uint64_t addr, size_t sz, size_t *remain,
     dtrace_mstate_t *mstate, dtrace_vstate_t *vstate)
 {
 	/*
 	 * First, check to see if the address is in scratch space...
 	 */
 	if (DTRACE_INRANGE(addr, sz, mstate->dtms_scratch_base,
 	    mstate->dtms_scratch_size)) {
 		DTRACE_RANGE_REMAIN(remain, addr, mstate->dtms_scratch_base,
 		    mstate->dtms_scratch_size);
 		return (1);
 	}
 
 	/*
 	 * Now check to see if it's a dynamic variable.  This check will pick
 	 * up both thread-local variables and any global dynamically-allocated
 	 * variables.
 	 */
 	if (DTRACE_INRANGE(addr, sz, vstate->dtvs_dynvars.dtds_base,
 	    vstate->dtvs_dynvars.dtds_size)) {
 		dtrace_dstate_t *dstate = &vstate->dtvs_dynvars;
 		uintptr_t base = (uintptr_t)dstate->dtds_base +
 		    (dstate->dtds_hashsize * sizeof (dtrace_dynhash_t));
 		uintptr_t chunkoffs;
 		dtrace_dynvar_t *dvar;
 
 		/*
 		 * Before we assume that we can store here, we need to make
 		 * sure that it isn't in our metadata -- storing to our
 		 * dynamic variable metadata would corrupt our state.  For
 		 * the range to not include any dynamic variable metadata,
 		 * it must:
 		 *
 		 *	(1) Start above the hash table that is at the base of
 		 *	the dynamic variable space
 		 *
 		 *	(2) Have a starting chunk offset that is beyond the
 		 *	dtrace_dynvar_t that is at the base of every chunk
 		 *
 		 *	(3) Not span a chunk boundary
 		 *
 		 *	(4) Not be in the tuple space of a dynamic variable
 		 *
 		 */
 		if (addr < base)
 			return (0);
 
 		chunkoffs = (addr - base) % dstate->dtds_chunksize;
 
 		if (chunkoffs < sizeof (dtrace_dynvar_t))
 			return (0);
 
 		if (chunkoffs + sz > dstate->dtds_chunksize)
 			return (0);
 
 		dvar = (dtrace_dynvar_t *)((uintptr_t)addr - chunkoffs);
 
 		if (dvar->dtdv_hashval == DTRACE_DYNHASH_FREE)
 			return (0);
 
 		if (chunkoffs < sizeof (dtrace_dynvar_t) +
 		    ((dvar->dtdv_tuple.dtt_nkeys - 1) * sizeof (dtrace_key_t)))
 			return (0);
 
 		DTRACE_RANGE_REMAIN(remain, addr, dvar, dstate->dtds_chunksize);
 		return (1);
 	}
 
 	/*
 	 * Finally, check the static local and global variables.  These checks
 	 * take the longest, so we perform them last.
 	 */
 	if (dtrace_canstore_statvar(addr, sz, remain,
 	    vstate->dtvs_locals, vstate->dtvs_nlocals))
 		return (1);
 
 	if (dtrace_canstore_statvar(addr, sz, remain,
 	    vstate->dtvs_globals, vstate->dtvs_nglobals))
 		return (1);
 
 	return (0);
 }
 
 
 /*
  * Convenience routine to check to see if the address is within a memory
  * region in which a load may be issued given the user's privilege level;
  * if not, it sets the appropriate error flags and loads 'addr' into the
  * illegal value slot.
  *
  * DTrace subroutines (DIF_SUBR_*) should use this helper to implement
  * appropriate memory access protection.
  */
 static int
 dtrace_canload(uint64_t addr, size_t sz, dtrace_mstate_t *mstate,
     dtrace_vstate_t *vstate)
 {
 	return (dtrace_canload_remains(addr, sz, NULL, mstate, vstate));
 }
 
 /*
  * Implementation of dtrace_canload which communicates the uppoer bound of the
  * allowed memory region.
  */
 static int
 dtrace_canload_remains(uint64_t addr, size_t sz, size_t *remain,
     dtrace_mstate_t *mstate, dtrace_vstate_t *vstate)
 {
 	volatile uintptr_t *illval = &cpu_core[curcpu].cpuc_dtrace_illval;
 	file_t *fp;
 
 	/*
 	 * If we hold the privilege to read from kernel memory, then
 	 * everything is readable.
 	 */
 	if ((mstate->dtms_access & DTRACE_ACCESS_KERNEL) != 0) {
 		DTRACE_RANGE_REMAIN(remain, addr, addr, sz);
 		return (1);
 	}
 
 	/*
 	 * You can obviously read that which you can store.
 	 */
 	if (dtrace_canstore_remains(addr, sz, remain, mstate, vstate))
 		return (1);
 
 	/*
 	 * We're allowed to read from our own string table.
 	 */
 	if (DTRACE_INRANGE(addr, sz, mstate->dtms_difo->dtdo_strtab,
 	    mstate->dtms_difo->dtdo_strlen)) {
 		DTRACE_RANGE_REMAIN(remain, addr,
 		    mstate->dtms_difo->dtdo_strtab,
 		    mstate->dtms_difo->dtdo_strlen);
 		return (1);
 	}
 
 	if (vstate->dtvs_state != NULL &&
 	    dtrace_priv_proc(vstate->dtvs_state)) {
 		proc_t *p;
 
 		/*
 		 * When we have privileges to the current process, there are
 		 * several context-related kernel structures that are safe to
 		 * read, even absent the privilege to read from kernel memory.
 		 * These reads are safe because these structures contain only
 		 * state that (1) we're permitted to read, (2) is harmless or
 		 * (3) contains pointers to additional kernel state that we're
 		 * not permitted to read (and as such, do not present an
 		 * opportunity for privilege escalation).  Finally (and
 		 * critically), because of the nature of their relation with
 		 * the current thread context, the memory associated with these
 		 * structures cannot change over the duration of probe context,
 		 * and it is therefore impossible for this memory to be
 		 * deallocated and reallocated as something else while it's
 		 * being operated upon.
 		 */
 		if (DTRACE_INRANGE(addr, sz, curthread, sizeof (kthread_t))) {
 			DTRACE_RANGE_REMAIN(remain, addr, curthread,
 			    sizeof (kthread_t));
 			return (1);
 		}
 
 		if ((p = curthread->t_procp) != NULL && DTRACE_INRANGE(addr,
 		    sz, curthread->t_procp, sizeof (proc_t))) {
 			DTRACE_RANGE_REMAIN(remain, addr, curthread->t_procp,
 			    sizeof (proc_t));
 			return (1);
 		}
 
 		if (curthread->t_cred != NULL && DTRACE_INRANGE(addr, sz,
 		    curthread->t_cred, sizeof (cred_t))) {
 			DTRACE_RANGE_REMAIN(remain, addr, curthread->t_cred,
 			    sizeof (cred_t));
 			return (1);
 		}
 
 #ifdef illumos
 		if (p != NULL && p->p_pidp != NULL && DTRACE_INRANGE(addr, sz,
 		    &(p->p_pidp->pid_id), sizeof (pid_t))) {
 			DTRACE_RANGE_REMAIN(remain, addr, &(p->p_pidp->pid_id),
 			    sizeof (pid_t));
 			return (1);
 		}
 
 		if (curthread->t_cpu != NULL && DTRACE_INRANGE(addr, sz,
 		    curthread->t_cpu, offsetof(cpu_t, cpu_pause_thread))) {
 			DTRACE_RANGE_REMAIN(remain, addr, curthread->t_cpu,
 			    offsetof(cpu_t, cpu_pause_thread));
 			return (1);
 		}
 #endif
 	}
 
 	if ((fp = mstate->dtms_getf) != NULL) {
 		uintptr_t psz = sizeof (void *);
 		vnode_t *vp;
 		vnodeops_t *op;
 
 		/*
 		 * When getf() returns a file_t, the enabling is implicitly
 		 * granted the (transient) right to read the returned file_t
 		 * as well as the v_path and v_op->vnop_name of the underlying
 		 * vnode.  These accesses are allowed after a successful
 		 * getf() because the members that they refer to cannot change
 		 * once set -- and the barrier logic in the kernel's closef()
 		 * path assures that the file_t and its referenced vode_t
 		 * cannot themselves be stale (that is, it impossible for
 		 * either dtms_getf itself or its f_vnode member to reference
 		 * freed memory).
 		 */
 		if (DTRACE_INRANGE(addr, sz, fp, sizeof (file_t))) {
 			DTRACE_RANGE_REMAIN(remain, addr, fp, sizeof (file_t));
 			return (1);
 		}
 
 		if ((vp = fp->f_vnode) != NULL) {
 			size_t slen;
 #ifdef illumos
 			if (DTRACE_INRANGE(addr, sz, &vp->v_path, psz)) {
 				DTRACE_RANGE_REMAIN(remain, addr, &vp->v_path,
 				    psz);
 				return (1);
 			}
 			slen = strlen(vp->v_path) + 1;
 			if (DTRACE_INRANGE(addr, sz, vp->v_path, slen)) {
 				DTRACE_RANGE_REMAIN(remain, addr, vp->v_path,
 				    slen);
 				return (1);
 			}
 #endif
 
 			if (DTRACE_INRANGE(addr, sz, &vp->v_op, psz)) {
 				DTRACE_RANGE_REMAIN(remain, addr, &vp->v_op,
 				    psz);
 				return (1);
 			}
 
 #ifdef illumos
 			if ((op = vp->v_op) != NULL &&
 			    DTRACE_INRANGE(addr, sz, &op->vnop_name, psz)) {
 				DTRACE_RANGE_REMAIN(remain, addr,
 				    &op->vnop_name, psz);
 				return (1);
 			}
 
 			if (op != NULL && op->vnop_name != NULL &&
 			    DTRACE_INRANGE(addr, sz, op->vnop_name,
 			    (slen = strlen(op->vnop_name) + 1))) {
 				DTRACE_RANGE_REMAIN(remain, addr,
 				    op->vnop_name, slen);
 				return (1);
 			}
 #endif
 		}
 	}
 
 	DTRACE_CPUFLAG_SET(CPU_DTRACE_KPRIV);
 	*illval = addr;
 	return (0);
 }
 
 /*
  * Convenience routine to check to see if a given string is within a memory
  * region in which a load may be issued given the user's privilege level;
  * this exists so that we don't need to issue unnecessary dtrace_strlen()
  * calls in the event that the user has all privileges.
  */
 static int
 dtrace_strcanload(uint64_t addr, size_t sz, size_t *remain,
     dtrace_mstate_t *mstate, dtrace_vstate_t *vstate)
 {
 	size_t rsize;
 
 	/*
 	 * If we hold the privilege to read from kernel memory, then
 	 * everything is readable.
 	 */
 	if ((mstate->dtms_access & DTRACE_ACCESS_KERNEL) != 0) {
 		DTRACE_RANGE_REMAIN(remain, addr, addr, sz);
 		return (1);
 	}
 
 	/*
 	 * Even if the caller is uninterested in querying the remaining valid
 	 * range, it is required to ensure that the access is allowed.
 	 */
 	if (remain == NULL) {
 		remain = &rsize;
 	}
 	if (dtrace_canload_remains(addr, 0, remain, mstate, vstate)) {
 		size_t strsz;
 		/*
 		 * Perform the strlen after determining the length of the
 		 * memory region which is accessible.  This prevents timing
 		 * information from being used to find NULs in memory which is
 		 * not accessible to the caller.
 		 */
 		strsz = 1 + dtrace_strlen((char *)(uintptr_t)addr,
 		    MIN(sz, *remain));
 		if (strsz <= *remain) {
 			return (1);
 		}
 	}
 
 	return (0);
 }
 
 /*
  * Convenience routine to check to see if a given variable is within a memory
  * region in which a load may be issued given the user's privilege level.
  */
 static int
 dtrace_vcanload(void *src, dtrace_diftype_t *type, size_t *remain,
     dtrace_mstate_t *mstate, dtrace_vstate_t *vstate)
 {
 	size_t sz;
 	ASSERT(type->dtdt_flags & DIF_TF_BYREF);
 
 	/*
 	 * Calculate the max size before performing any checks since even
 	 * DTRACE_ACCESS_KERNEL-credentialed callers expect that this function
 	 * return the max length via 'remain'.
 	 */
 	if (type->dtdt_kind == DIF_TYPE_STRING) {
 		dtrace_state_t *state = vstate->dtvs_state;
 
 		if (state != NULL) {
 			sz = state->dts_options[DTRACEOPT_STRSIZE];
 		} else {
 			/*
 			 * In helper context, we have a NULL state; fall back
 			 * to using the system-wide default for the string size
 			 * in this case.
 			 */
 			sz = dtrace_strsize_default;
 		}
 	} else {
 		sz = type->dtdt_size;
 	}
 
 	/*
 	 * If we hold the privilege to read from kernel memory, then
 	 * everything is readable.
 	 */
 	if ((mstate->dtms_access & DTRACE_ACCESS_KERNEL) != 0) {
 		DTRACE_RANGE_REMAIN(remain, (uintptr_t)src, src, sz);
 		return (1);
 	}
 
 	if (type->dtdt_kind == DIF_TYPE_STRING) {
 		return (dtrace_strcanload((uintptr_t)src, sz, remain, mstate,
 		    vstate));
 	}
 	return (dtrace_canload_remains((uintptr_t)src, sz, remain, mstate,
 	    vstate));
 }
 
 /*
  * Convert a string to a signed integer using safe loads.
  *
  * NOTE: This function uses various macros from strtolctype.h to manipulate
  * digit values, etc -- these have all been checked to ensure they make
  * no additional function calls.
  */
 static int64_t
 dtrace_strtoll(char *input, int base, size_t limit)
 {
 	uintptr_t pos = (uintptr_t)input;
 	int64_t val = 0;
 	int x;
 	boolean_t neg = B_FALSE;
 	char c, cc, ccc;
 	uintptr_t end = pos + limit;
 
 	/*
 	 * Consume any whitespace preceding digits.
 	 */
 	while ((c = dtrace_load8(pos)) == ' ' || c == '\t')
 		pos++;
 
 	/*
 	 * Handle an explicit sign if one is present.
 	 */
 	if (c == '-' || c == '+') {
 		if (c == '-')
 			neg = B_TRUE;
 		c = dtrace_load8(++pos);
 	}
 
 	/*
 	 * Check for an explicit hexadecimal prefix ("0x" or "0X") and skip it
 	 * if present.
 	 */
 	if (base == 16 && c == '0' && ((cc = dtrace_load8(pos + 1)) == 'x' ||
 	    cc == 'X') && isxdigit(ccc = dtrace_load8(pos + 2))) {
 		pos += 2;
 		c = ccc;
 	}
 
 	/*
 	 * Read in contiguous digits until the first non-digit character.
 	 */
 	for (; pos < end && c != '\0' && lisalnum(c) && (x = DIGIT(c)) < base;
 	    c = dtrace_load8(++pos))
 		val = val * base + x;
 
 	return (neg ? -val : val);
 }
 
 /*
  * Compare two strings using safe loads.
  */
 static int
 dtrace_strncmp(char *s1, char *s2, size_t limit)
 {
 	uint8_t c1, c2;
 	volatile uint16_t *flags;
 
 	if (s1 == s2 || limit == 0)
 		return (0);
 
 	flags = (volatile uint16_t *)&cpu_core[curcpu].cpuc_dtrace_flags;
 
 	do {
 		if (s1 == NULL) {
 			c1 = '\0';
 		} else {
 			c1 = dtrace_load8((uintptr_t)s1++);
 		}
 
 		if (s2 == NULL) {
 			c2 = '\0';
 		} else {
 			c2 = dtrace_load8((uintptr_t)s2++);
 		}
 
 		if (c1 != c2)
 			return (c1 - c2);
 	} while (--limit && c1 != '\0' && !(*flags & CPU_DTRACE_FAULT));
 
 	return (0);
 }
 
 /*
  * Compute strlen(s) for a string using safe memory accesses.  The additional
  * len parameter is used to specify a maximum length to ensure completion.
  */
 static size_t
 dtrace_strlen(const char *s, size_t lim)
 {
 	uint_t len;
 
 	for (len = 0; len != lim; len++) {
 		if (dtrace_load8((uintptr_t)s++) == '\0')
 			break;
 	}
 
 	return (len);
 }
 
 /*
  * Check if an address falls within a toxic region.
  */
 static int
 dtrace_istoxic(uintptr_t kaddr, size_t size)
 {
 	uintptr_t taddr, tsize;
 	int i;
 
 	for (i = 0; i < dtrace_toxranges; i++) {
 		taddr = dtrace_toxrange[i].dtt_base;
 		tsize = dtrace_toxrange[i].dtt_limit - taddr;
 
 		if (kaddr - taddr < tsize) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_BADADDR);
 			cpu_core[curcpu].cpuc_dtrace_illval = kaddr;
 			return (1);
 		}
 
 		if (taddr - kaddr < size) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_BADADDR);
 			cpu_core[curcpu].cpuc_dtrace_illval = taddr;
 			return (1);
 		}
 	}
 
 	return (0);
 }
 
 /*
  * Copy src to dst using safe memory accesses.  The src is assumed to be unsafe
  * memory specified by the DIF program.  The dst is assumed to be safe memory
  * that we can store to directly because it is managed by DTrace.  As with
  * standard bcopy, overlapping copies are handled properly.
  */
 static void
 dtrace_bcopy(const void *src, void *dst, size_t len)
 {
 	if (len != 0) {
 		uint8_t *s1 = dst;
 		const uint8_t *s2 = src;
 
 		if (s1 <= s2) {
 			do {
 				*s1++ = dtrace_load8((uintptr_t)s2++);
 			} while (--len != 0);
 		} else {
 			s2 += len;
 			s1 += len;
 
 			do {
 				*--s1 = dtrace_load8((uintptr_t)--s2);
 			} while (--len != 0);
 		}
 	}
 }
 
 /*
  * Copy src to dst using safe memory accesses, up to either the specified
  * length, or the point that a nul byte is encountered.  The src is assumed to
  * be unsafe memory specified by the DIF program.  The dst is assumed to be
  * safe memory that we can store to directly because it is managed by DTrace.
  * Unlike dtrace_bcopy(), overlapping regions are not handled.
  */
 static void
 dtrace_strcpy(const void *src, void *dst, size_t len)
 {
 	if (len != 0) {
 		uint8_t *s1 = dst, c;
 		const uint8_t *s2 = src;
 
 		do {
 			*s1++ = c = dtrace_load8((uintptr_t)s2++);
 		} while (--len != 0 && c != '\0');
 	}
 }
 
 /*
  * Copy src to dst, deriving the size and type from the specified (BYREF)
  * variable type.  The src is assumed to be unsafe memory specified by the DIF
  * program.  The dst is assumed to be DTrace variable memory that is of the
  * specified type; we assume that we can store to directly.
  */
 static void
 dtrace_vcopy(void *src, void *dst, dtrace_diftype_t *type, size_t limit)
 {
 	ASSERT(type->dtdt_flags & DIF_TF_BYREF);
 
 	if (type->dtdt_kind == DIF_TYPE_STRING) {
 		dtrace_strcpy(src, dst, MIN(type->dtdt_size, limit));
 	} else {
 		dtrace_bcopy(src, dst, MIN(type->dtdt_size, limit));
 	}
 }
 
 /*
  * Compare s1 to s2 using safe memory accesses.  The s1 data is assumed to be
  * unsafe memory specified by the DIF program.  The s2 data is assumed to be
  * safe memory that we can access directly because it is managed by DTrace.
  */
 static int
 dtrace_bcmp(const void *s1, const void *s2, size_t len)
 {
 	volatile uint16_t *flags;
 
 	flags = (volatile uint16_t *)&cpu_core[curcpu].cpuc_dtrace_flags;
 
 	if (s1 == s2)
 		return (0);
 
 	if (s1 == NULL || s2 == NULL)
 		return (1);
 
 	if (s1 != s2 && len != 0) {
 		const uint8_t *ps1 = s1;
 		const uint8_t *ps2 = s2;
 
 		do {
 			if (dtrace_load8((uintptr_t)ps1++) != *ps2++)
 				return (1);
 		} while (--len != 0 && !(*flags & CPU_DTRACE_FAULT));
 	}
 	return (0);
 }
 
 /*
  * Zero the specified region using a simple byte-by-byte loop.  Note that this
  * is for safe DTrace-managed memory only.
  */
 static void
 dtrace_bzero(void *dst, size_t len)
 {
 	uchar_t *cp;
 
 	for (cp = dst; len != 0; len--)
 		*cp++ = 0;
 }
 
 static void
 dtrace_add_128(uint64_t *addend1, uint64_t *addend2, uint64_t *sum)
 {
 	uint64_t result[2];
 
 	result[0] = addend1[0] + addend2[0];
 	result[1] = addend1[1] + addend2[1] +
 	    (result[0] < addend1[0] || result[0] < addend2[0] ? 1 : 0);
 
 	sum[0] = result[0];
 	sum[1] = result[1];
 }
 
 /*
  * Shift the 128-bit value in a by b. If b is positive, shift left.
  * If b is negative, shift right.
  */
 static void
 dtrace_shift_128(uint64_t *a, int b)
 {
 	uint64_t mask;
 
 	if (b == 0)
 		return;
 
 	if (b < 0) {
 		b = -b;
 		if (b >= 64) {
 			a[0] = a[1] >> (b - 64);
 			a[1] = 0;
 		} else {
 			a[0] >>= b;
 			mask = 1LL << (64 - b);
 			mask -= 1;
 			a[0] |= ((a[1] & mask) << (64 - b));
 			a[1] >>= b;
 		}
 	} else {
 		if (b >= 64) {
 			a[1] = a[0] << (b - 64);
 			a[0] = 0;
 		} else {
 			a[1] <<= b;
 			mask = a[0] >> (64 - b);
 			a[1] |= mask;
 			a[0] <<= b;
 		}
 	}
 }
 
 /*
  * The basic idea is to break the 2 64-bit values into 4 32-bit values,
  * use native multiplication on those, and then re-combine into the
  * resulting 128-bit value.
  *
  * (hi1 << 32 + lo1) * (hi2 << 32 + lo2) =
  *     hi1 * hi2 << 64 +
  *     hi1 * lo2 << 32 +
  *     hi2 * lo1 << 32 +
  *     lo1 * lo2
  */
 static void
 dtrace_multiply_128(uint64_t factor1, uint64_t factor2, uint64_t *product)
 {
 	uint64_t hi1, hi2, lo1, lo2;
 	uint64_t tmp[2];
 
 	hi1 = factor1 >> 32;
 	hi2 = factor2 >> 32;
 
 	lo1 = factor1 & DT_MASK_LO;
 	lo2 = factor2 & DT_MASK_LO;
 
 	product[0] = lo1 * lo2;
 	product[1] = hi1 * hi2;
 
 	tmp[0] = hi1 * lo2;
 	tmp[1] = 0;
 	dtrace_shift_128(tmp, 32);
 	dtrace_add_128(product, tmp, product);
 
 	tmp[0] = hi2 * lo1;
 	tmp[1] = 0;
 	dtrace_shift_128(tmp, 32);
 	dtrace_add_128(product, tmp, product);
 }
 
 /*
  * This privilege check should be used by actions and subroutines to
  * verify that the user credentials of the process that enabled the
  * invoking ECB match the target credentials
  */
 static int
 dtrace_priv_proc_common_user(dtrace_state_t *state)
 {
 	cred_t *cr, *s_cr = state->dts_cred.dcr_cred;
 
 	/*
 	 * We should always have a non-NULL state cred here, since if cred
 	 * is null (anonymous tracing), we fast-path bypass this routine.
 	 */
 	ASSERT(s_cr != NULL);
 
 	if ((cr = CRED()) != NULL &&
 	    s_cr->cr_uid == cr->cr_uid &&
 	    s_cr->cr_uid == cr->cr_ruid &&
 	    s_cr->cr_uid == cr->cr_suid &&
 	    s_cr->cr_gid == cr->cr_gid &&
 	    s_cr->cr_gid == cr->cr_rgid &&
 	    s_cr->cr_gid == cr->cr_sgid)
 		return (1);
 
 	return (0);
 }
 
 /*
  * This privilege check should be used by actions and subroutines to
  * verify that the zone of the process that enabled the invoking ECB
  * matches the target credentials
  */
 static int
 dtrace_priv_proc_common_zone(dtrace_state_t *state)
 {
 #ifdef illumos
 	cred_t *cr, *s_cr = state->dts_cred.dcr_cred;
 
 	/*
 	 * We should always have a non-NULL state cred here, since if cred
 	 * is null (anonymous tracing), we fast-path bypass this routine.
 	 */
 	ASSERT(s_cr != NULL);
 
 	if ((cr = CRED()) != NULL && s_cr->cr_zone == cr->cr_zone)
 		return (1);
 
 	return (0);
 #else
 	return (1);
 #endif
 }
 
 /*
  * This privilege check should be used by actions and subroutines to
  * verify that the process has not setuid or changed credentials.
  */
 static int
 dtrace_priv_proc_common_nocd(void)
 {
 	proc_t *proc;
 
 	if ((proc = ttoproc(curthread)) != NULL &&
 	    !(proc->p_flag & SNOCD))
 		return (1);
 
 	return (0);
 }
 
 static int
 dtrace_priv_proc_destructive(dtrace_state_t *state)
 {
 	int action = state->dts_cred.dcr_action;
 
 	if (((action & DTRACE_CRA_PROC_DESTRUCTIVE_ALLZONE) == 0) &&
 	    dtrace_priv_proc_common_zone(state) == 0)
 		goto bad;
 
 	if (((action & DTRACE_CRA_PROC_DESTRUCTIVE_ALLUSER) == 0) &&
 	    dtrace_priv_proc_common_user(state) == 0)
 		goto bad;
 
 	if (((action & DTRACE_CRA_PROC_DESTRUCTIVE_CREDCHG) == 0) &&
 	    dtrace_priv_proc_common_nocd() == 0)
 		goto bad;
 
 	return (1);
 
 bad:
 	cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_UPRIV;
 
 	return (0);
 }
 
 static int
 dtrace_priv_proc_control(dtrace_state_t *state)
 {
 	if (state->dts_cred.dcr_action & DTRACE_CRA_PROC_CONTROL)
 		return (1);
 
 	if (dtrace_priv_proc_common_zone(state) &&
 	    dtrace_priv_proc_common_user(state) &&
 	    dtrace_priv_proc_common_nocd())
 		return (1);
 
 	cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_UPRIV;
 
 	return (0);
 }
 
 static int
 dtrace_priv_proc(dtrace_state_t *state)
 {
 	if (state->dts_cred.dcr_action & DTRACE_CRA_PROC)
 		return (1);
 
 	cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_UPRIV;
 
 	return (0);
 }
 
 static int
 dtrace_priv_kernel(dtrace_state_t *state)
 {
 	if (state->dts_cred.dcr_action & DTRACE_CRA_KERNEL)
 		return (1);
 
 	cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_KPRIV;
 
 	return (0);
 }
 
 static int
 dtrace_priv_kernel_destructive(dtrace_state_t *state)
 {
 	if (state->dts_cred.dcr_action & DTRACE_CRA_KERNEL_DESTRUCTIVE)
 		return (1);
 
 	cpu_core[curcpu].cpuc_dtrace_flags |= CPU_DTRACE_KPRIV;
 
 	return (0);
 }
 
 /*
  * Determine if the dte_cond of the specified ECB allows for processing of
  * the current probe to continue.  Note that this routine may allow continued
  * processing, but with access(es) stripped from the mstate's dtms_access
  * field.
  */
 static int
 dtrace_priv_probe(dtrace_state_t *state, dtrace_mstate_t *mstate,
     dtrace_ecb_t *ecb)
 {
 	dtrace_probe_t *probe = ecb->dte_probe;
 	dtrace_provider_t *prov = probe->dtpr_provider;
 	dtrace_pops_t *pops = &prov->dtpv_pops;
 	int mode = DTRACE_MODE_NOPRIV_DROP;
 
 	ASSERT(ecb->dte_cond);
 
 #ifdef illumos
 	if (pops->dtps_mode != NULL) {
 		mode = pops->dtps_mode(prov->dtpv_arg,
 		    probe->dtpr_id, probe->dtpr_arg);
 
 		ASSERT((mode & DTRACE_MODE_USER) ||
 		    (mode & DTRACE_MODE_KERNEL));
 		ASSERT((mode & DTRACE_MODE_NOPRIV_RESTRICT) ||
 		    (mode & DTRACE_MODE_NOPRIV_DROP));
 	}
 
 	/*
 	 * If the dte_cond bits indicate that this consumer is only allowed to
 	 * see user-mode firings of this probe, call the provider's dtps_mode()
 	 * entry point to check that the probe was fired while in a user
 	 * context.  If that's not the case, use the policy specified by the
 	 * provider to determine if we drop the probe or merely restrict
 	 * operation.
 	 */
 	if (ecb->dte_cond & DTRACE_COND_USERMODE) {
 		ASSERT(mode != DTRACE_MODE_NOPRIV_DROP);
 
 		if (!(mode & DTRACE_MODE_USER)) {
 			if (mode & DTRACE_MODE_NOPRIV_DROP)
 				return (0);
 
 			mstate->dtms_access &= ~DTRACE_ACCESS_ARGS;
 		}
 	}
 #endif
 
 	/*
 	 * This is more subtle than it looks. We have to be absolutely certain
 	 * that CRED() isn't going to change out from under us so it's only
 	 * legit to examine that structure if we're in constrained situations.
 	 * Currently, the only times we'll this check is if a non-super-user
 	 * has enabled the profile or syscall providers -- providers that
 	 * allow visibility of all processes. For the profile case, the check
 	 * above will ensure that we're examining a user context.
 	 */
 	if (ecb->dte_cond & DTRACE_COND_OWNER) {
 		cred_t *cr;
 		cred_t *s_cr = state->dts_cred.dcr_cred;
 		proc_t *proc;
 
 		ASSERT(s_cr != NULL);
 
 		if ((cr = CRED()) == NULL ||
 		    s_cr->cr_uid != cr->cr_uid ||
 		    s_cr->cr_uid != cr->cr_ruid ||
 		    s_cr->cr_uid != cr->cr_suid ||
 		    s_cr->cr_gid != cr->cr_gid ||
 		    s_cr->cr_gid != cr->cr_rgid ||
 		    s_cr->cr_gid != cr->cr_sgid ||
 		    (proc = ttoproc(curthread)) == NULL ||
 		    (proc->p_flag & SNOCD)) {
 			if (mode & DTRACE_MODE_NOPRIV_DROP)
 				return (0);
 
 #ifdef illumos
 			mstate->dtms_access &= ~DTRACE_ACCESS_PROC;
 #endif
 		}
 	}
 
 #ifdef illumos
 	/*
 	 * If our dte_cond is set to DTRACE_COND_ZONEOWNER and we are not
 	 * in our zone, check to see if our mode policy is to restrict rather
 	 * than to drop; if to restrict, strip away both DTRACE_ACCESS_PROC
 	 * and DTRACE_ACCESS_ARGS
 	 */
 	if (ecb->dte_cond & DTRACE_COND_ZONEOWNER) {
 		cred_t *cr;
 		cred_t *s_cr = state->dts_cred.dcr_cred;
 
 		ASSERT(s_cr != NULL);
 
 		if ((cr = CRED()) == NULL ||
 		    s_cr->cr_zone->zone_id != cr->cr_zone->zone_id) {
 			if (mode & DTRACE_MODE_NOPRIV_DROP)
 				return (0);
 
 			mstate->dtms_access &=
 			    ~(DTRACE_ACCESS_PROC | DTRACE_ACCESS_ARGS);
 		}
 	}
 #endif
 
 	return (1);
 }
 
 /*
  * Note:  not called from probe context.  This function is called
  * asynchronously (and at a regular interval) from outside of probe context to
  * clean the dirty dynamic variable lists on all CPUs.  Dynamic variable
  * cleaning is explained in detail in <sys/dtrace_impl.h>.
  */
 void
 dtrace_dynvar_clean(dtrace_dstate_t *dstate)
 {
 	dtrace_dynvar_t *dirty;
 	dtrace_dstate_percpu_t *dcpu;
 	dtrace_dynvar_t **rinsep;
 	int i, j, work = 0;
 
 	for (i = 0; i < NCPU; i++) {
 		dcpu = &dstate->dtds_percpu[i];
 		rinsep = &dcpu->dtdsc_rinsing;
 
 		/*
 		 * If the dirty list is NULL, there is no dirty work to do.
 		 */
 		if (dcpu->dtdsc_dirty == NULL)
 			continue;
 
 		if (dcpu->dtdsc_rinsing != NULL) {
 			/*
 			 * If the rinsing list is non-NULL, then it is because
 			 * this CPU was selected to accept another CPU's
 			 * dirty list -- and since that time, dirty buffers
 			 * have accumulated.  This is a highly unlikely
 			 * condition, but we choose to ignore the dirty
 			 * buffers -- they'll be picked up a future cleanse.
 			 */
 			continue;
 		}
 
 		if (dcpu->dtdsc_clean != NULL) {
 			/*
 			 * If the clean list is non-NULL, then we're in a
 			 * situation where a CPU has done deallocations (we
 			 * have a non-NULL dirty list) but no allocations (we
 			 * also have a non-NULL clean list).  We can't simply
 			 * move the dirty list into the clean list on this
 			 * CPU, yet we also don't want to allow this condition
 			 * to persist, lest a short clean list prevent a
 			 * massive dirty list from being cleaned (which in
 			 * turn could lead to otherwise avoidable dynamic
 			 * drops).  To deal with this, we look for some CPU
 			 * with a NULL clean list, NULL dirty list, and NULL
 			 * rinsing list -- and then we borrow this CPU to
 			 * rinse our dirty list.
 			 */
 			for (j = 0; j < NCPU; j++) {
 				dtrace_dstate_percpu_t *rinser;
 
 				rinser = &dstate->dtds_percpu[j];
 
 				if (rinser->dtdsc_rinsing != NULL)
 					continue;
 
 				if (rinser->dtdsc_dirty != NULL)
 					continue;
 
 				if (rinser->dtdsc_clean != NULL)
 					continue;
 
 				rinsep = &rinser->dtdsc_rinsing;
 				break;
 			}
 
 			if (j == NCPU) {
 				/*
 				 * We were unable to find another CPU that
 				 * could accept this dirty list -- we are
 				 * therefore unable to clean it now.
 				 */
 				dtrace_dynvar_failclean++;
 				continue;
 			}
 		}
 
 		work = 1;
 
 		/*
 		 * Atomically move the dirty list aside.
 		 */
 		do {
 			dirty = dcpu->dtdsc_dirty;
 
 			/*
 			 * Before we zap the dirty list, set the rinsing list.
 			 * (This allows for a potential assertion in
 			 * dtrace_dynvar():  if a free dynamic variable appears
 			 * on a hash chain, either the dirty list or the
 			 * rinsing list for some CPU must be non-NULL.)
 			 */
 			*rinsep = dirty;
 			dtrace_membar_producer();
 		} while (dtrace_casptr(&dcpu->dtdsc_dirty,
 		    dirty, NULL) != dirty);
 	}
 
 	if (!work) {
 		/*
 		 * We have no work to do; we can simply return.
 		 */
 		return;
 	}
 
 	dtrace_sync();
 
 	for (i = 0; i < NCPU; i++) {
 		dcpu = &dstate->dtds_percpu[i];
 
 		if (dcpu->dtdsc_rinsing == NULL)
 			continue;
 
 		/*
 		 * We are now guaranteed that no hash chain contains a pointer
 		 * into this dirty list; we can make it clean.
 		 */
 		ASSERT(dcpu->dtdsc_clean == NULL);
 		dcpu->dtdsc_clean = dcpu->dtdsc_rinsing;
 		dcpu->dtdsc_rinsing = NULL;
 	}
 
 	/*
 	 * Before we actually set the state to be DTRACE_DSTATE_CLEAN, make
 	 * sure that all CPUs have seen all of the dtdsc_clean pointers.
 	 * This prevents a race whereby a CPU incorrectly decides that
 	 * the state should be something other than DTRACE_DSTATE_CLEAN
 	 * after dtrace_dynvar_clean() has completed.
 	 */
 	dtrace_sync();
 
 	dstate->dtds_state = DTRACE_DSTATE_CLEAN;
 }
 
 /*
  * Depending on the value of the op parameter, this function looks-up,
  * allocates or deallocates an arbitrarily-keyed dynamic variable.  If an
  * allocation is requested, this function will return a pointer to a
  * dtrace_dynvar_t corresponding to the allocated variable -- or NULL if no
  * variable can be allocated.  If NULL is returned, the appropriate counter
  * will be incremented.
  */
 dtrace_dynvar_t *
 dtrace_dynvar(dtrace_dstate_t *dstate, uint_t nkeys,
     dtrace_key_t *key, size_t dsize, dtrace_dynvar_op_t op,
     dtrace_mstate_t *mstate, dtrace_vstate_t *vstate)
 {
 	uint64_t hashval = DTRACE_DYNHASH_VALID;
 	dtrace_dynhash_t *hash = dstate->dtds_hash;
 	dtrace_dynvar_t *free, *new_free, *next, *dvar, *start, *prev = NULL;
 	processorid_t me = curcpu, cpu = me;
 	dtrace_dstate_percpu_t *dcpu = &dstate->dtds_percpu[me];
 	size_t bucket, ksize;
 	size_t chunksize = dstate->dtds_chunksize;
 	uintptr_t kdata, lock, nstate;
 	uint_t i;
 
 	ASSERT(nkeys != 0);
 
 	/*
 	 * Hash the key.  As with aggregations, we use Jenkins' "One-at-a-time"
 	 * algorithm.  For the by-value portions, we perform the algorithm in
 	 * 16-bit chunks (as opposed to 8-bit chunks).  This speeds things up a
 	 * bit, and seems to have only a minute effect on distribution.  For
 	 * the by-reference data, we perform "One-at-a-time" iterating (safely)
 	 * over each referenced byte.  It's painful to do this, but it's much
 	 * better than pathological hash distribution.  The efficacy of the
 	 * hashing algorithm (and a comparison with other algorithms) may be
 	 * found by running the ::dtrace_dynstat MDB dcmd.
 	 */
 	for (i = 0; i < nkeys; i++) {
 		if (key[i].dttk_size == 0) {
 			uint64_t val = key[i].dttk_value;
 
 			hashval += (val >> 48) & 0xffff;
 			hashval += (hashval << 10);
 			hashval ^= (hashval >> 6);
 
 			hashval += (val >> 32) & 0xffff;
 			hashval += (hashval << 10);
 			hashval ^= (hashval >> 6);
 
 			hashval += (val >> 16) & 0xffff;
 			hashval += (hashval << 10);
 			hashval ^= (hashval >> 6);
 
 			hashval += val & 0xffff;
 			hashval += (hashval << 10);
 			hashval ^= (hashval >> 6);
 		} else {
 			/*
 			 * This is incredibly painful, but it beats the hell
 			 * out of the alternative.
 			 */
 			uint64_t j, size = key[i].dttk_size;
 			uintptr_t base = (uintptr_t)key[i].dttk_value;
 
 			if (!dtrace_canload(base, size, mstate, vstate))
 				break;
 
 			for (j = 0; j < size; j++) {
 				hashval += dtrace_load8(base + j);
 				hashval += (hashval << 10);
 				hashval ^= (hashval >> 6);
 			}
 		}
 	}
 
 	if (DTRACE_CPUFLAG_ISSET(CPU_DTRACE_FAULT))
 		return (NULL);
 
 	hashval += (hashval << 3);
 	hashval ^= (hashval >> 11);
 	hashval += (hashval << 15);
 
 	/*
 	 * There is a remote chance (ideally, 1 in 2^31) that our hashval
 	 * comes out to be one of our two sentinel hash values.  If this
 	 * actually happens, we set the hashval to be a value known to be a
 	 * non-sentinel value.
 	 */
 	if (hashval == DTRACE_DYNHASH_FREE || hashval == DTRACE_DYNHASH_SINK)
 		hashval = DTRACE_DYNHASH_VALID;
 
 	/*
 	 * Yes, it's painful to do a divide here.  If the cycle count becomes
 	 * important here, tricks can be pulled to reduce it.  (However, it's
 	 * critical that hash collisions be kept to an absolute minimum;
 	 * they're much more painful than a divide.)  It's better to have a
 	 * solution that generates few collisions and still keeps things
 	 * relatively simple.
 	 */
 	bucket = hashval % dstate->dtds_hashsize;
 
 	if (op == DTRACE_DYNVAR_DEALLOC) {
 		volatile uintptr_t *lockp = &hash[bucket].dtdh_lock;
 
 		for (;;) {
 			while ((lock = *lockp) & 1)
 				continue;
 
 			if (dtrace_casptr((volatile void *)lockp,
 			    (volatile void *)lock, (volatile void *)(lock + 1)) == (void *)lock)
 				break;
 		}
 
 		dtrace_membar_producer();
 	}
 
 top:
 	prev = NULL;
 	lock = hash[bucket].dtdh_lock;
 
 	dtrace_membar_consumer();
 
 	start = hash[bucket].dtdh_chain;
 	ASSERT(start != NULL && (start->dtdv_hashval == DTRACE_DYNHASH_SINK ||
 	    start->dtdv_hashval != DTRACE_DYNHASH_FREE ||
 	    op != DTRACE_DYNVAR_DEALLOC));
 
 	for (dvar = start; dvar != NULL; dvar = dvar->dtdv_next) {
 		dtrace_tuple_t *dtuple = &dvar->dtdv_tuple;
 		dtrace_key_t *dkey = &dtuple->dtt_key[0];
 
 		if (dvar->dtdv_hashval != hashval) {
 			if (dvar->dtdv_hashval == DTRACE_DYNHASH_SINK) {
 				/*
 				 * We've reached the sink, and therefore the
 				 * end of the hash chain; we can kick out of
 				 * the loop knowing that we have seen a valid
 				 * snapshot of state.
 				 */
 				ASSERT(dvar->dtdv_next == NULL);
 				ASSERT(dvar == &dtrace_dynhash_sink);
 				break;
 			}
 
 			if (dvar->dtdv_hashval == DTRACE_DYNHASH_FREE) {
 				/*
 				 * We've gone off the rails:  somewhere along
 				 * the line, one of the members of this hash
 				 * chain was deleted.  Note that we could also
 				 * detect this by simply letting this loop run
 				 * to completion, as we would eventually hit
 				 * the end of the dirty list.  However, we
 				 * want to avoid running the length of the
 				 * dirty list unnecessarily (it might be quite
 				 * long), so we catch this as early as
 				 * possible by detecting the hash marker.  In
 				 * this case, we simply set dvar to NULL and
 				 * break; the conditional after the loop will
 				 * send us back to top.
 				 */
 				dvar = NULL;
 				break;
 			}
 
 			goto next;
 		}
 
 		if (dtuple->dtt_nkeys != nkeys)
 			goto next;
 
 		for (i = 0; i < nkeys; i++, dkey++) {
 			if (dkey->dttk_size != key[i].dttk_size)
 				goto next; /* size or type mismatch */
 
 			if (dkey->dttk_size != 0) {
 				if (dtrace_bcmp(
 				    (void *)(uintptr_t)key[i].dttk_value,
 				    (void *)(uintptr_t)dkey->dttk_value,
 				    dkey->dttk_size))
 					goto next;
 			} else {
 				if (dkey->dttk_value != key[i].dttk_value)
 					goto next;
 			}
 		}
 
 		if (op != DTRACE_DYNVAR_DEALLOC)
 			return (dvar);
 
 		ASSERT(dvar->dtdv_next == NULL ||
 		    dvar->dtdv_next->dtdv_hashval != DTRACE_DYNHASH_FREE);
 
 		if (prev != NULL) {
 			ASSERT(hash[bucket].dtdh_chain != dvar);
 			ASSERT(start != dvar);
 			ASSERT(prev->dtdv_next == dvar);
 			prev->dtdv_next = dvar->dtdv_next;
 		} else {
 			if (dtrace_casptr(&hash[bucket].dtdh_chain,
 			    start, dvar->dtdv_next) != start) {
 				/*
 				 * We have failed to atomically swing the
 				 * hash table head pointer, presumably because
 				 * of a conflicting allocation on another CPU.
 				 * We need to reread the hash chain and try
 				 * again.
 				 */
 				goto top;
 			}
 		}
 
 		dtrace_membar_producer();
 
 		/*
 		 * Now set the hash value to indicate that it's free.
 		 */
 		ASSERT(hash[bucket].dtdh_chain != dvar);
 		dvar->dtdv_hashval = DTRACE_DYNHASH_FREE;
 
 		dtrace_membar_producer();
 
 		/*
 		 * Set the next pointer to point at the dirty list, and
 		 * atomically swing the dirty pointer to the newly freed dvar.
 		 */
 		do {
 			next = dcpu->dtdsc_dirty;
 			dvar->dtdv_next = next;
 		} while (dtrace_casptr(&dcpu->dtdsc_dirty, next, dvar) != next);
 
 		/*
 		 * Finally, unlock this hash bucket.
 		 */
 		ASSERT(hash[bucket].dtdh_lock == lock);
 		ASSERT(lock & 1);
 		hash[bucket].dtdh_lock++;
 
 		return (NULL);
 next:
 		prev = dvar;
 		continue;
 	}
 
 	if (dvar == NULL) {
 		/*
 		 * If dvar is NULL, it is because we went off the rails:
 		 * one of the elements that we traversed in the hash chain
 		 * was deleted while we were traversing it.  In this case,
 		 * we assert that we aren't doing a dealloc (deallocs lock
 		 * the hash bucket to prevent themselves from racing with
 		 * one another), and retry the hash chain traversal.
 		 */
 		ASSERT(op != DTRACE_DYNVAR_DEALLOC);
 		goto top;
 	}
 
 	if (op != DTRACE_DYNVAR_ALLOC) {
 		/*
 		 * If we are not to allocate a new variable, we want to
 		 * return NULL now.  Before we return, check that the value
 		 * of the lock word hasn't changed.  If it has, we may have
 		 * seen an inconsistent snapshot.
 		 */
 		if (op == DTRACE_DYNVAR_NOALLOC) {
 			if (hash[bucket].dtdh_lock != lock)
 				goto top;
 		} else {
 			ASSERT(op == DTRACE_DYNVAR_DEALLOC);
 			ASSERT(hash[bucket].dtdh_lock == lock);
 			ASSERT(lock & 1);
 			hash[bucket].dtdh_lock++;
 		}
 
 		return (NULL);
 	}
 
 	/*
 	 * We need to allocate a new dynamic variable.  The size we need is the
 	 * size of dtrace_dynvar plus the size of nkeys dtrace_key_t's plus the
 	 * size of any auxiliary key data (rounded up to 8-byte alignment) plus
 	 * the size of any referred-to data (dsize).  We then round the final
 	 * size up to the chunksize for allocation.
 	 */
 	for (ksize = 0, i = 0; i < nkeys; i++)
 		ksize += P2ROUNDUP(key[i].dttk_size, sizeof (uint64_t));
 
 	/*
 	 * This should be pretty much impossible, but could happen if, say,
 	 * strange DIF specified the tuple.  Ideally, this should be an
 	 * assertion and not an error condition -- but that requires that the
 	 * chunksize calculation in dtrace_difo_chunksize() be absolutely
 	 * bullet-proof.  (That is, it must not be able to be fooled by
 	 * malicious DIF.)  Given the lack of backwards branches in DIF,
 	 * solving this would presumably not amount to solving the Halting
 	 * Problem -- but it still seems awfully hard.
 	 */
 	if (sizeof (dtrace_dynvar_t) + sizeof (dtrace_key_t) * (nkeys - 1) +
 	    ksize + dsize > chunksize) {
 		dcpu->dtdsc_drops++;
 		return (NULL);
 	}
 
 	nstate = DTRACE_DSTATE_EMPTY;
 
 	do {
 retry:
 		free = dcpu->dtdsc_free;
 
 		if (free == NULL) {
 			dtrace_dynvar_t *clean = dcpu->dtdsc_clean;
 			void *rval;
 
 			if (clean == NULL) {
 				/*
 				 * We're out of dynamic variable space on
 				 * this CPU.  Unless we have tried all CPUs,
 				 * we'll try to allocate from a different
 				 * CPU.
 				 */
 				switch (dstate->dtds_state) {
 				case DTRACE_DSTATE_CLEAN: {
 					void *sp = &dstate->dtds_state;
 
 					if (++cpu >= NCPU)
 						cpu = 0;
 
 					if (dcpu->dtdsc_dirty != NULL &&
 					    nstate == DTRACE_DSTATE_EMPTY)
 						nstate = DTRACE_DSTATE_DIRTY;
 
 					if (dcpu->dtdsc_rinsing != NULL)
 						nstate = DTRACE_DSTATE_RINSING;
 
 					dcpu = &dstate->dtds_percpu[cpu];
 
 					if (cpu != me)
 						goto retry;
 
 					(void) dtrace_cas32(sp,
 					    DTRACE_DSTATE_CLEAN, nstate);
 
 					/*
 					 * To increment the correct bean
 					 * counter, take another lap.
 					 */
 					goto retry;
 				}
 
 				case DTRACE_DSTATE_DIRTY:
 					dcpu->dtdsc_dirty_drops++;
 					break;
 
 				case DTRACE_DSTATE_RINSING:
 					dcpu->dtdsc_rinsing_drops++;
 					break;
 
 				case DTRACE_DSTATE_EMPTY:
 					dcpu->dtdsc_drops++;
 					break;
 				}
 
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_DROP);
 				return (NULL);
 			}
 
 			/*
 			 * The clean list appears to be non-empty.  We want to
 			 * move the clean list to the free list; we start by
 			 * moving the clean pointer aside.
 			 */
 			if (dtrace_casptr(&dcpu->dtdsc_clean,
 			    clean, NULL) != clean) {
 				/*
 				 * We are in one of two situations:
 				 *
 				 *  (a)	The clean list was switched to the
 				 *	free list by another CPU.
 				 *
 				 *  (b)	The clean list was added to by the
 				 *	cleansing cyclic.
 				 *
 				 * In either of these situations, we can
 				 * just reattempt the free list allocation.
 				 */
 				goto retry;
 			}
 
 			ASSERT(clean->dtdv_hashval == DTRACE_DYNHASH_FREE);
 
 			/*
 			 * Now we'll move the clean list to our free list.
 			 * It's impossible for this to fail:  the only way
 			 * the free list can be updated is through this
 			 * code path, and only one CPU can own the clean list.
 			 * Thus, it would only be possible for this to fail if
 			 * this code were racing with dtrace_dynvar_clean().
 			 * (That is, if dtrace_dynvar_clean() updated the clean
 			 * list, and we ended up racing to update the free
 			 * list.)  This race is prevented by the dtrace_sync()
 			 * in dtrace_dynvar_clean() -- which flushes the
 			 * owners of the clean lists out before resetting
 			 * the clean lists.
 			 */
 			dcpu = &dstate->dtds_percpu[me];
 			rval = dtrace_casptr(&dcpu->dtdsc_free, NULL, clean);
 			ASSERT(rval == NULL);
 			goto retry;
 		}
 
 		dvar = free;
 		new_free = dvar->dtdv_next;
 	} while (dtrace_casptr(&dcpu->dtdsc_free, free, new_free) != free);
 
 	/*
 	 * We have now allocated a new chunk.  We copy the tuple keys into the
 	 * tuple array and copy any referenced key data into the data space
 	 * following the tuple array.  As we do this, we relocate dttk_value
 	 * in the final tuple to point to the key data address in the chunk.
 	 */
 	kdata = (uintptr_t)&dvar->dtdv_tuple.dtt_key[nkeys];
 	dvar->dtdv_data = (void *)(kdata + ksize);
 	dvar->dtdv_tuple.dtt_nkeys = nkeys;
 
 	for (i = 0; i < nkeys; i++) {
 		dtrace_key_t *dkey = &dvar->dtdv_tuple.dtt_key[i];
 		size_t kesize = key[i].dttk_size;
 
 		if (kesize != 0) {
 			dtrace_bcopy(
 			    (const void *)(uintptr_t)key[i].dttk_value,
 			    (void *)kdata, kesize);
 			dkey->dttk_value = kdata;
 			kdata += P2ROUNDUP(kesize, sizeof (uint64_t));
 		} else {
 			dkey->dttk_value = key[i].dttk_value;
 		}
 
 		dkey->dttk_size = kesize;
 	}
 
 	ASSERT(dvar->dtdv_hashval == DTRACE_DYNHASH_FREE);
 	dvar->dtdv_hashval = hashval;
 	dvar->dtdv_next = start;
 
 	if (dtrace_casptr(&hash[bucket].dtdh_chain, start, dvar) == start)
 		return (dvar);
 
 	/*
 	 * The cas has failed.  Either another CPU is adding an element to
 	 * this hash chain, or another CPU is deleting an element from this
 	 * hash chain.  The simplest way to deal with both of these cases
 	 * (though not necessarily the most efficient) is to free our
 	 * allocated block and re-attempt it all.  Note that the free is
 	 * to the dirty list and _not_ to the free list.  This is to prevent
 	 * races with allocators, above.
 	 */
 	dvar->dtdv_hashval = DTRACE_DYNHASH_FREE;
 
 	dtrace_membar_producer();
 
 	do {
 		free = dcpu->dtdsc_dirty;
 		dvar->dtdv_next = free;
 	} while (dtrace_casptr(&dcpu->dtdsc_dirty, free, dvar) != free);
 
 	goto top;
 }
 
 /*ARGSUSED*/
 static void
 dtrace_aggregate_min(uint64_t *oval, uint64_t nval, uint64_t arg)
 {
 	if ((int64_t)nval < (int64_t)*oval)
 		*oval = nval;
 }
 
 /*ARGSUSED*/
 static void
 dtrace_aggregate_max(uint64_t *oval, uint64_t nval, uint64_t arg)
 {
 	if ((int64_t)nval > (int64_t)*oval)
 		*oval = nval;
 }
 
 static void
 dtrace_aggregate_quantize(uint64_t *quanta, uint64_t nval, uint64_t incr)
 {
 	int i, zero = DTRACE_QUANTIZE_ZEROBUCKET;
 	int64_t val = (int64_t)nval;
 
 	if (val < 0) {
 		for (i = 0; i < zero; i++) {
 			if (val <= DTRACE_QUANTIZE_BUCKETVAL(i)) {
 				quanta[i] += incr;
 				return;
 			}
 		}
 	} else {
 		for (i = zero + 1; i < DTRACE_QUANTIZE_NBUCKETS; i++) {
 			if (val < DTRACE_QUANTIZE_BUCKETVAL(i)) {
 				quanta[i - 1] += incr;
 				return;
 			}
 		}
 
 		quanta[DTRACE_QUANTIZE_NBUCKETS - 1] += incr;
 		return;
 	}
 
 	ASSERT(0);
 }
 
 static void
 dtrace_aggregate_lquantize(uint64_t *lquanta, uint64_t nval, uint64_t incr)
 {
 	uint64_t arg = *lquanta++;
 	int32_t base = DTRACE_LQUANTIZE_BASE(arg);
 	uint16_t step = DTRACE_LQUANTIZE_STEP(arg);
 	uint16_t levels = DTRACE_LQUANTIZE_LEVELS(arg);
 	int32_t val = (int32_t)nval, level;
 
 	ASSERT(step != 0);
 	ASSERT(levels != 0);
 
 	if (val < base) {
 		/*
 		 * This is an underflow.
 		 */
 		lquanta[0] += incr;
 		return;
 	}
 
 	level = (val - base) / step;
 
 	if (level < levels) {
 		lquanta[level + 1] += incr;
 		return;
 	}
 
 	/*
 	 * This is an overflow.
 	 */
 	lquanta[levels + 1] += incr;
 }
 
 static int
 dtrace_aggregate_llquantize_bucket(uint16_t factor, uint16_t low,
     uint16_t high, uint16_t nsteps, int64_t value)
 {
 	int64_t this = 1, last, next;
 	int base = 1, order;
 
 	ASSERT(factor <= nsteps);
 	ASSERT(nsteps % factor == 0);
 
 	for (order = 0; order < low; order++)
 		this *= factor;
 
 	/*
 	 * If our value is less than our factor taken to the power of the
 	 * low order of magnitude, it goes into the zeroth bucket.
 	 */
 	if (value < (last = this))
 		return (0);
 
 	for (this *= factor; order <= high; order++) {
 		int nbuckets = this > nsteps ? nsteps : this;
 
 		if ((next = this * factor) < this) {
 			/*
 			 * We should not generally get log/linear quantizations
 			 * with a high magnitude that allows 64-bits to
 			 * overflow, but we nonetheless protect against this
 			 * by explicitly checking for overflow, and clamping
 			 * our value accordingly.
 			 */
 			value = this - 1;
 		}
 
 		if (value < this) {
 			/*
 			 * If our value lies within this order of magnitude,
 			 * determine its position by taking the offset within
 			 * the order of magnitude, dividing by the bucket
 			 * width, and adding to our (accumulated) base.
 			 */
 			return (base + (value - last) / (this / nbuckets));
 		}
 
 		base += nbuckets - (nbuckets / factor);
 		last = this;
 		this = next;
 	}
 
 	/*
 	 * Our value is greater than or equal to our factor taken to the
 	 * power of one plus the high magnitude -- return the top bucket.
 	 */
 	return (base);
 }
 
 static void
 dtrace_aggregate_llquantize(uint64_t *llquanta, uint64_t nval, uint64_t incr)
 {
 	uint64_t arg = *llquanta++;
 	uint16_t factor = DTRACE_LLQUANTIZE_FACTOR(arg);
 	uint16_t low = DTRACE_LLQUANTIZE_LOW(arg);
 	uint16_t high = DTRACE_LLQUANTIZE_HIGH(arg);
 	uint16_t nsteps = DTRACE_LLQUANTIZE_NSTEP(arg);
 
 	llquanta[dtrace_aggregate_llquantize_bucket(factor,
 	    low, high, nsteps, nval)] += incr;
 }
 
 /*ARGSUSED*/
 static void
 dtrace_aggregate_avg(uint64_t *data, uint64_t nval, uint64_t arg)
 {
 	data[0]++;
 	data[1] += nval;
 }
 
 /*ARGSUSED*/
 static void
 dtrace_aggregate_stddev(uint64_t *data, uint64_t nval, uint64_t arg)
 {
 	int64_t snval = (int64_t)nval;
 	uint64_t tmp[2];
 
 	data[0]++;
 	data[1] += nval;
 
 	/*
 	 * What we want to say here is:
 	 *
 	 * data[2] += nval * nval;
 	 *
 	 * But given that nval is 64-bit, we could easily overflow, so
 	 * we do this as 128-bit arithmetic.
 	 */
 	if (snval < 0)
 		snval = -snval;
 
 	dtrace_multiply_128((uint64_t)snval, (uint64_t)snval, tmp);
 	dtrace_add_128(data + 2, tmp, data + 2);
 }
 
 /*ARGSUSED*/
 static void
 dtrace_aggregate_count(uint64_t *oval, uint64_t nval, uint64_t arg)
 {
 	*oval = *oval + 1;
 }
 
 /*ARGSUSED*/
 static void
 dtrace_aggregate_sum(uint64_t *oval, uint64_t nval, uint64_t arg)
 {
 	*oval += nval;
 }
 
 /*
  * Aggregate given the tuple in the principal data buffer, and the aggregating
  * action denoted by the specified dtrace_aggregation_t.  The aggregation
  * buffer is specified as the buf parameter.  This routine does not return
  * failure; if there is no space in the aggregation buffer, the data will be
  * dropped, and a corresponding counter incremented.
  */
 static void
 dtrace_aggregate(dtrace_aggregation_t *agg, dtrace_buffer_t *dbuf,
     intptr_t offset, dtrace_buffer_t *buf, uint64_t expr, uint64_t arg)
 {
 	dtrace_recdesc_t *rec = &agg->dtag_action.dta_rec;
 	uint32_t i, ndx, size, fsize;
 	uint32_t align = sizeof (uint64_t) - 1;
 	dtrace_aggbuffer_t *agb;
 	dtrace_aggkey_t *key;
 	uint32_t hashval = 0, limit, isstr;
 	caddr_t tomax, data, kdata;
 	dtrace_actkind_t action;
 	dtrace_action_t *act;
 	uintptr_t offs;
 
 	if (buf == NULL)
 		return;
 
 	if (!agg->dtag_hasarg) {
 		/*
 		 * Currently, only quantize() and lquantize() take additional
 		 * arguments, and they have the same semantics:  an increment
 		 * value that defaults to 1 when not present.  If additional
 		 * aggregating actions take arguments, the setting of the
 		 * default argument value will presumably have to become more
 		 * sophisticated...
 		 */
 		arg = 1;
 	}
 
 	action = agg->dtag_action.dta_kind - DTRACEACT_AGGREGATION;
 	size = rec->dtrd_offset - agg->dtag_base;
 	fsize = size + rec->dtrd_size;
 
 	ASSERT(dbuf->dtb_tomax != NULL);
 	data = dbuf->dtb_tomax + offset + agg->dtag_base;
 
 	if ((tomax = buf->dtb_tomax) == NULL) {
 		dtrace_buffer_drop(buf);
 		return;
 	}
 
 	/*
 	 * The metastructure is always at the bottom of the buffer.
 	 */
 	agb = (dtrace_aggbuffer_t *)(tomax + buf->dtb_size -
 	    sizeof (dtrace_aggbuffer_t));
 
 	if (buf->dtb_offset == 0) {
 		/*
 		 * We just kludge up approximately 1/8th of the size to be
 		 * buckets.  If this guess ends up being routinely
 		 * off-the-mark, we may need to dynamically readjust this
 		 * based on past performance.
 		 */
 		uintptr_t hashsize = (buf->dtb_size >> 3) / sizeof (uintptr_t);
 
 		if ((uintptr_t)agb - hashsize * sizeof (dtrace_aggkey_t *) <
 		    (uintptr_t)tomax || hashsize == 0) {
 			/*
 			 * We've been given a ludicrously small buffer;
 			 * increment our drop count and leave.
 			 */
 			dtrace_buffer_drop(buf);
 			return;
 		}
 
 		/*
 		 * And now, a pathetic attempt to try to get a an odd (or
 		 * perchance, a prime) hash size for better hash distribution.
 		 */
 		if (hashsize > (DTRACE_AGGHASHSIZE_SLEW << 3))
 			hashsize -= DTRACE_AGGHASHSIZE_SLEW;
 
 		agb->dtagb_hashsize = hashsize;
 		agb->dtagb_hash = (dtrace_aggkey_t **)((uintptr_t)agb -
 		    agb->dtagb_hashsize * sizeof (dtrace_aggkey_t *));
 		agb->dtagb_free = (uintptr_t)agb->dtagb_hash;
 
 		for (i = 0; i < agb->dtagb_hashsize; i++)
 			agb->dtagb_hash[i] = NULL;
 	}
 
 	ASSERT(agg->dtag_first != NULL);
 	ASSERT(agg->dtag_first->dta_intuple);
 
 	/*
 	 * Calculate the hash value based on the key.  Note that we _don't_
 	 * include the aggid in the hashing (but we will store it as part of
 	 * the key).  The hashing algorithm is Bob Jenkins' "One-at-a-time"
 	 * algorithm: a simple, quick algorithm that has no known funnels, and
 	 * gets good distribution in practice.  The efficacy of the hashing
 	 * algorithm (and a comparison with other algorithms) may be found by
 	 * running the ::dtrace_aggstat MDB dcmd.
 	 */
 	for (act = agg->dtag_first; act->dta_intuple; act = act->dta_next) {
 		i = act->dta_rec.dtrd_offset - agg->dtag_base;
 		limit = i + act->dta_rec.dtrd_size;
 		ASSERT(limit <= size);
 		isstr = DTRACEACT_ISSTRING(act);
 
 		for (; i < limit; i++) {
 			hashval += data[i];
 			hashval += (hashval << 10);
 			hashval ^= (hashval >> 6);
 
 			if (isstr && data[i] == '\0')
 				break;
 		}
 	}
 
 	hashval += (hashval << 3);
 	hashval ^= (hashval >> 11);
 	hashval += (hashval << 15);
 
 	/*
 	 * Yes, the divide here is expensive -- but it's generally the least
 	 * of the performance issues given the amount of data that we iterate
 	 * over to compute hash values, compare data, etc.
 	 */
 	ndx = hashval % agb->dtagb_hashsize;
 
 	for (key = agb->dtagb_hash[ndx]; key != NULL; key = key->dtak_next) {
 		ASSERT((caddr_t)key >= tomax);
 		ASSERT((caddr_t)key < tomax + buf->dtb_size);
 
 		if (hashval != key->dtak_hashval || key->dtak_size != size)
 			continue;
 
 		kdata = key->dtak_data;
 		ASSERT(kdata >= tomax && kdata < tomax + buf->dtb_size);
 
 		for (act = agg->dtag_first; act->dta_intuple;
 		    act = act->dta_next) {
 			i = act->dta_rec.dtrd_offset - agg->dtag_base;
 			limit = i + act->dta_rec.dtrd_size;
 			ASSERT(limit <= size);
 			isstr = DTRACEACT_ISSTRING(act);
 
 			for (; i < limit; i++) {
 				if (kdata[i] != data[i])
 					goto next;
 
 				if (isstr && data[i] == '\0')
 					break;
 			}
 		}
 
 		if (action != key->dtak_action) {
 			/*
 			 * We are aggregating on the same value in the same
 			 * aggregation with two different aggregating actions.
 			 * (This should have been picked up in the compiler,
 			 * so we may be dealing with errant or devious DIF.)
 			 * This is an error condition; we indicate as much,
 			 * and return.
 			 */
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_ILLOP);
 			return;
 		}
 
 		/*
 		 * This is a hit:  we need to apply the aggregator to
 		 * the value at this key.
 		 */
 		agg->dtag_aggregate((uint64_t *)(kdata + size), expr, arg);
 		return;
 next:
 		continue;
 	}
 
 	/*
 	 * We didn't find it.  We need to allocate some zero-filled space,
 	 * link it into the hash table appropriately, and apply the aggregator
 	 * to the (zero-filled) value.
 	 */
 	offs = buf->dtb_offset;
 	while (offs & (align - 1))
 		offs += sizeof (uint32_t);
 
 	/*
 	 * If we don't have enough room to both allocate a new key _and_
 	 * its associated data, increment the drop count and return.
 	 */
 	if ((uintptr_t)tomax + offs + fsize >
 	    agb->dtagb_free - sizeof (dtrace_aggkey_t)) {
 		dtrace_buffer_drop(buf);
 		return;
 	}
 
 	/*CONSTCOND*/
 	ASSERT(!(sizeof (dtrace_aggkey_t) & (sizeof (uintptr_t) - 1)));
 	key = (dtrace_aggkey_t *)(agb->dtagb_free - sizeof (dtrace_aggkey_t));
 	agb->dtagb_free -= sizeof (dtrace_aggkey_t);
 
 	key->dtak_data = kdata = tomax + offs;
 	buf->dtb_offset = offs + fsize;
 
 	/*
 	 * Now copy the data across.
 	 */
 	*((dtrace_aggid_t *)kdata) = agg->dtag_id;
 
 	for (i = sizeof (dtrace_aggid_t); i < size; i++)
 		kdata[i] = data[i];
 
 	/*
 	 * Because strings are not zeroed out by default, we need to iterate
 	 * looking for actions that store strings, and we need to explicitly
 	 * pad these strings out with zeroes.
 	 */
 	for (act = agg->dtag_first; act->dta_intuple; act = act->dta_next) {
 		int nul;
 
 		if (!DTRACEACT_ISSTRING(act))
 			continue;
 
 		i = act->dta_rec.dtrd_offset - agg->dtag_base;
 		limit = i + act->dta_rec.dtrd_size;
 		ASSERT(limit <= size);
 
 		for (nul = 0; i < limit; i++) {
 			if (nul) {
 				kdata[i] = '\0';
 				continue;
 			}
 
 			if (data[i] != '\0')
 				continue;
 
 			nul = 1;
 		}
 	}
 
 	for (i = size; i < fsize; i++)
 		kdata[i] = 0;
 
 	key->dtak_hashval = hashval;
 	key->dtak_size = size;
 	key->dtak_action = action;
 	key->dtak_next = agb->dtagb_hash[ndx];
 	agb->dtagb_hash[ndx] = key;
 
 	/*
 	 * Finally, apply the aggregator.
 	 */
 	*((uint64_t *)(key->dtak_data + size)) = agg->dtag_initial;
 	agg->dtag_aggregate((uint64_t *)(key->dtak_data + size), expr, arg);
 }
 
 /*
  * Given consumer state, this routine finds a speculation in the INACTIVE
  * state and transitions it into the ACTIVE state.  If there is no speculation
  * in the INACTIVE state, 0 is returned.  In this case, no error counter is
  * incremented -- it is up to the caller to take appropriate action.
  */
 static int
 dtrace_speculation(dtrace_state_t *state)
 {
 	int i = 0;
 	dtrace_speculation_state_t current;
 	uint32_t *stat = &state->dts_speculations_unavail, count;
 
 	while (i < state->dts_nspeculations) {
 		dtrace_speculation_t *spec = &state->dts_speculations[i];
 
 		current = spec->dtsp_state;
 
 		if (current != DTRACESPEC_INACTIVE) {
 			if (current == DTRACESPEC_COMMITTINGMANY ||
 			    current == DTRACESPEC_COMMITTING ||
 			    current == DTRACESPEC_DISCARDING)
 				stat = &state->dts_speculations_busy;
 			i++;
 			continue;
 		}
 
 		if (dtrace_cas32((uint32_t *)&spec->dtsp_state,
 		    current, DTRACESPEC_ACTIVE) == current)
 			return (i + 1);
 	}
 
 	/*
 	 * We couldn't find a speculation.  If we found as much as a single
 	 * busy speculation buffer, we'll attribute this failure as "busy"
 	 * instead of "unavail".
 	 */
 	do {
 		count = *stat;
 	} while (dtrace_cas32(stat, count, count + 1) != count);
 
 	return (0);
 }
 
 /*
  * This routine commits an active speculation.  If the specified speculation
  * is not in a valid state to perform a commit(), this routine will silently do
  * nothing.  The state of the specified speculation is transitioned according
  * to the state transition diagram outlined in <sys/dtrace_impl.h>
  */
 static void
 dtrace_speculation_commit(dtrace_state_t *state, processorid_t cpu,
     dtrace_specid_t which)
 {
 	dtrace_speculation_t *spec;
 	dtrace_buffer_t *src, *dest;
 	uintptr_t daddr, saddr, dlimit, slimit;
 	dtrace_speculation_state_t current, new = 0;
 	intptr_t offs;
 	uint64_t timestamp;
 
 	if (which == 0)
 		return;
 
 	if (which > state->dts_nspeculations) {
 		cpu_core[cpu].cpuc_dtrace_flags |= CPU_DTRACE_ILLOP;
 		return;
 	}
 
 	spec = &state->dts_speculations[which - 1];
 	src = &spec->dtsp_buffer[cpu];
 	dest = &state->dts_buffer[cpu];
 
 	do {
 		current = spec->dtsp_state;
 
 		if (current == DTRACESPEC_COMMITTINGMANY)
 			break;
 
 		switch (current) {
 		case DTRACESPEC_INACTIVE:
 		case DTRACESPEC_DISCARDING:
 			return;
 
 		case DTRACESPEC_COMMITTING:
 			/*
 			 * This is only possible if we are (a) commit()'ing
 			 * without having done a prior speculate() on this CPU
 			 * and (b) racing with another commit() on a different
 			 * CPU.  There's nothing to do -- we just assert that
 			 * our offset is 0.
 			 */
 			ASSERT(src->dtb_offset == 0);
 			return;
 
 		case DTRACESPEC_ACTIVE:
 			new = DTRACESPEC_COMMITTING;
 			break;
 
 		case DTRACESPEC_ACTIVEONE:
 			/*
 			 * This speculation is active on one CPU.  If our
 			 * buffer offset is non-zero, we know that the one CPU
 			 * must be us.  Otherwise, we are committing on a
 			 * different CPU from the speculate(), and we must
 			 * rely on being asynchronously cleaned.
 			 */
 			if (src->dtb_offset != 0) {
 				new = DTRACESPEC_COMMITTING;
 				break;
 			}
 			/*FALLTHROUGH*/
 
 		case DTRACESPEC_ACTIVEMANY:
 			new = DTRACESPEC_COMMITTINGMANY;
 			break;
 
 		default:
 			ASSERT(0);
 		}
 	} while (dtrace_cas32((uint32_t *)&spec->dtsp_state,
 	    current, new) != current);
 
 	/*
 	 * We have set the state to indicate that we are committing this
 	 * speculation.  Now reserve the necessary space in the destination
 	 * buffer.
 	 */
 	if ((offs = dtrace_buffer_reserve(dest, src->dtb_offset,
 	    sizeof (uint64_t), state, NULL)) < 0) {
 		dtrace_buffer_drop(dest);
 		goto out;
 	}
 
 	/*
 	 * We have sufficient space to copy the speculative buffer into the
 	 * primary buffer.  First, modify the speculative buffer, filling
 	 * in the timestamp of all entries with the current time.  The data
 	 * must have the commit() time rather than the time it was traced,
 	 * so that all entries in the primary buffer are in timestamp order.
 	 */
 	timestamp = dtrace_gethrtime();
 	saddr = (uintptr_t)src->dtb_tomax;
 	slimit = saddr + src->dtb_offset;
 	while (saddr < slimit) {
 		size_t size;
 		dtrace_rechdr_t *dtrh = (dtrace_rechdr_t *)saddr;
 
 		if (dtrh->dtrh_epid == DTRACE_EPIDNONE) {
 			saddr += sizeof (dtrace_epid_t);
 			continue;
 		}
 		ASSERT3U(dtrh->dtrh_epid, <=, state->dts_necbs);
 		size = state->dts_ecbs[dtrh->dtrh_epid - 1]->dte_size;
 
 		ASSERT3U(saddr + size, <=, slimit);
 		ASSERT3U(size, >=, sizeof (dtrace_rechdr_t));
 		ASSERT3U(DTRACE_RECORD_LOAD_TIMESTAMP(dtrh), ==, UINT64_MAX);
 
 		DTRACE_RECORD_STORE_TIMESTAMP(dtrh, timestamp);
 
 		saddr += size;
 	}
 
 	/*
 	 * Copy the buffer across.  (Note that this is a
 	 * highly subobtimal bcopy(); in the unlikely event that this becomes
 	 * a serious performance issue, a high-performance DTrace-specific
 	 * bcopy() should obviously be invented.)
 	 */
 	daddr = (uintptr_t)dest->dtb_tomax + offs;
 	dlimit = daddr + src->dtb_offset;
 	saddr = (uintptr_t)src->dtb_tomax;
 
 	/*
 	 * First, the aligned portion.
 	 */
 	while (dlimit - daddr >= sizeof (uint64_t)) {
 		*((uint64_t *)daddr) = *((uint64_t *)saddr);
 
 		daddr += sizeof (uint64_t);
 		saddr += sizeof (uint64_t);
 	}
 
 	/*
 	 * Now any left-over bit...
 	 */
 	while (dlimit - daddr)
 		*((uint8_t *)daddr++) = *((uint8_t *)saddr++);
 
 	/*
 	 * Finally, commit the reserved space in the destination buffer.
 	 */
 	dest->dtb_offset = offs + src->dtb_offset;
 
 out:
 	/*
 	 * If we're lucky enough to be the only active CPU on this speculation
 	 * buffer, we can just set the state back to DTRACESPEC_INACTIVE.
 	 */
 	if (current == DTRACESPEC_ACTIVE ||
 	    (current == DTRACESPEC_ACTIVEONE && new == DTRACESPEC_COMMITTING)) {
 		uint32_t rval = dtrace_cas32((uint32_t *)&spec->dtsp_state,
 		    DTRACESPEC_COMMITTING, DTRACESPEC_INACTIVE);
 
 		ASSERT(rval == DTRACESPEC_COMMITTING);
 	}
 
 	src->dtb_offset = 0;
 	src->dtb_xamot_drops += src->dtb_drops;
 	src->dtb_drops = 0;
 }
 
 /*
  * This routine discards an active speculation.  If the specified speculation
  * is not in a valid state to perform a discard(), this routine will silently
  * do nothing.  The state of the specified speculation is transitioned
  * according to the state transition diagram outlined in <sys/dtrace_impl.h>
  */
 static void
 dtrace_speculation_discard(dtrace_state_t *state, processorid_t cpu,
     dtrace_specid_t which)
 {
 	dtrace_speculation_t *spec;
 	dtrace_speculation_state_t current, new = 0;
 	dtrace_buffer_t *buf;
 
 	if (which == 0)
 		return;
 
 	if (which > state->dts_nspeculations) {
 		cpu_core[cpu].cpuc_dtrace_flags |= CPU_DTRACE_ILLOP;
 		return;
 	}
 
 	spec = &state->dts_speculations[which - 1];
 	buf = &spec->dtsp_buffer[cpu];
 
 	do {
 		current = spec->dtsp_state;
 
 		switch (current) {
 		case DTRACESPEC_INACTIVE:
 		case DTRACESPEC_COMMITTINGMANY:
 		case DTRACESPEC_COMMITTING:
 		case DTRACESPEC_DISCARDING:
 			return;
 
 		case DTRACESPEC_ACTIVE:
 		case DTRACESPEC_ACTIVEMANY:
 			new = DTRACESPEC_DISCARDING;
 			break;
 
 		case DTRACESPEC_ACTIVEONE:
 			if (buf->dtb_offset != 0) {
 				new = DTRACESPEC_INACTIVE;
 			} else {
 				new = DTRACESPEC_DISCARDING;
 			}
 			break;
 
 		default:
 			ASSERT(0);
 		}
 	} while (dtrace_cas32((uint32_t *)&spec->dtsp_state,
 	    current, new) != current);
 
 	buf->dtb_offset = 0;
 	buf->dtb_drops = 0;
 }
 
 /*
  * Note:  not called from probe context.  This function is called
  * asynchronously from cross call context to clean any speculations that are
  * in the COMMITTINGMANY or DISCARDING states.  These speculations may not be
  * transitioned back to the INACTIVE state until all CPUs have cleaned the
  * speculation.
  */
 static void
 dtrace_speculation_clean_here(dtrace_state_t *state)
 {
 	dtrace_icookie_t cookie;
 	processorid_t cpu = curcpu;
 	dtrace_buffer_t *dest = &state->dts_buffer[cpu];
 	dtrace_specid_t i;
 
 	cookie = dtrace_interrupt_disable();
 
 	if (dest->dtb_tomax == NULL) {
 		dtrace_interrupt_enable(cookie);
 		return;
 	}
 
 	for (i = 0; i < state->dts_nspeculations; i++) {
 		dtrace_speculation_t *spec = &state->dts_speculations[i];
 		dtrace_buffer_t *src = &spec->dtsp_buffer[cpu];
 
 		if (src->dtb_tomax == NULL)
 			continue;
 
 		if (spec->dtsp_state == DTRACESPEC_DISCARDING) {
 			src->dtb_offset = 0;
 			continue;
 		}
 
 		if (spec->dtsp_state != DTRACESPEC_COMMITTINGMANY)
 			continue;
 
 		if (src->dtb_offset == 0)
 			continue;
 
 		dtrace_speculation_commit(state, cpu, i + 1);
 	}
 
 	dtrace_interrupt_enable(cookie);
 }
 
 /*
  * Note:  not called from probe context.  This function is called
  * asynchronously (and at a regular interval) to clean any speculations that
  * are in the COMMITTINGMANY or DISCARDING states.  If it discovers that there
  * is work to be done, it cross calls all CPUs to perform that work;
  * COMMITMANY and DISCARDING speculations may not be transitioned back to the
  * INACTIVE state until they have been cleaned by all CPUs.
  */
 static void
 dtrace_speculation_clean(dtrace_state_t *state)
 {
 	int work = 0, rv;
 	dtrace_specid_t i;
 
 	for (i = 0; i < state->dts_nspeculations; i++) {
 		dtrace_speculation_t *spec = &state->dts_speculations[i];
 
 		ASSERT(!spec->dtsp_cleaning);
 
 		if (spec->dtsp_state != DTRACESPEC_DISCARDING &&
 		    spec->dtsp_state != DTRACESPEC_COMMITTINGMANY)
 			continue;
 
 		work++;
 		spec->dtsp_cleaning = 1;
 	}
 
 	if (!work)
 		return;
 
 	dtrace_xcall(DTRACE_CPUALL,
 	    (dtrace_xcall_t)dtrace_speculation_clean_here, state);
 
 	/*
 	 * We now know that all CPUs have committed or discarded their
 	 * speculation buffers, as appropriate.  We can now set the state
 	 * to inactive.
 	 */
 	for (i = 0; i < state->dts_nspeculations; i++) {
 		dtrace_speculation_t *spec = &state->dts_speculations[i];
 		dtrace_speculation_state_t current, new;
 
 		if (!spec->dtsp_cleaning)
 			continue;
 
 		current = spec->dtsp_state;
 		ASSERT(current == DTRACESPEC_DISCARDING ||
 		    current == DTRACESPEC_COMMITTINGMANY);
 
 		new = DTRACESPEC_INACTIVE;
 
 		rv = dtrace_cas32((uint32_t *)&spec->dtsp_state, current, new);
 		ASSERT(rv == current);
 		spec->dtsp_cleaning = 0;
 	}
 }
 
 /*
  * Called as part of a speculate() to get the speculative buffer associated
  * with a given speculation.  Returns NULL if the specified speculation is not
  * in an ACTIVE state.  If the speculation is in the ACTIVEONE state -- and
  * the active CPU is not the specified CPU -- the speculation will be
  * atomically transitioned into the ACTIVEMANY state.
  */
 static dtrace_buffer_t *
 dtrace_speculation_buffer(dtrace_state_t *state, processorid_t cpuid,
     dtrace_specid_t which)
 {
 	dtrace_speculation_t *spec;
 	dtrace_speculation_state_t current, new = 0;
 	dtrace_buffer_t *buf;
 
 	if (which == 0)
 		return (NULL);
 
 	if (which > state->dts_nspeculations) {
 		cpu_core[cpuid].cpuc_dtrace_flags |= CPU_DTRACE_ILLOP;
 		return (NULL);
 	}
 
 	spec = &state->dts_speculations[which - 1];
 	buf = &spec->dtsp_buffer[cpuid];
 
 	do {
 		current = spec->dtsp_state;
 
 		switch (current) {
 		case DTRACESPEC_INACTIVE:
 		case DTRACESPEC_COMMITTINGMANY:
 		case DTRACESPEC_DISCARDING:
 			return (NULL);
 
 		case DTRACESPEC_COMMITTING:
 			ASSERT(buf->dtb_offset == 0);
 			return (NULL);
 
 		case DTRACESPEC_ACTIVEONE:
 			/*
 			 * This speculation is currently active on one CPU.
 			 * Check the offset in the buffer; if it's non-zero,
 			 * that CPU must be us (and we leave the state alone).
 			 * If it's zero, assume that we're starting on a new
 			 * CPU -- and change the state to indicate that the
 			 * speculation is active on more than one CPU.
 			 */
 			if (buf->dtb_offset != 0)
 				return (buf);
 
 			new = DTRACESPEC_ACTIVEMANY;
 			break;
 
 		case DTRACESPEC_ACTIVEMANY:
 			return (buf);
 
 		case DTRACESPEC_ACTIVE:
 			new = DTRACESPEC_ACTIVEONE;
 			break;
 
 		default:
 			ASSERT(0);
 		}
 	} while (dtrace_cas32((uint32_t *)&spec->dtsp_state,
 	    current, new) != current);
 
 	ASSERT(new == DTRACESPEC_ACTIVEONE || new == DTRACESPEC_ACTIVEMANY);
 	return (buf);
 }
 
 /*
  * Return a string.  In the event that the user lacks the privilege to access
  * arbitrary kernel memory, we copy the string out to scratch memory so that we
  * don't fail access checking.
  *
  * dtrace_dif_variable() uses this routine as a helper for various
  * builtin values such as 'execname' and 'probefunc.'
  */
 uintptr_t
 dtrace_dif_varstr(uintptr_t addr, dtrace_state_t *state,
     dtrace_mstate_t *mstate)
 {
 	uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 	uintptr_t ret;
 	size_t strsz;
 
 	/*
 	 * The easy case: this probe is allowed to read all of memory, so
 	 * we can just return this as a vanilla pointer.
 	 */
 	if ((mstate->dtms_access & DTRACE_ACCESS_KERNEL) != 0)
 		return (addr);
 
 	/*
 	 * This is the tougher case: we copy the string in question from
 	 * kernel memory into scratch memory and return it that way: this
 	 * ensures that we won't trip up when access checking tests the
 	 * BYREF return value.
 	 */
 	strsz = dtrace_strlen((char *)addr, size) + 1;
 
 	if (mstate->dtms_scratch_ptr + strsz >
 	    mstate->dtms_scratch_base + mstate->dtms_scratch_size) {
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 		return (0);
 	}
 
 	dtrace_strcpy((const void *)addr, (void *)mstate->dtms_scratch_ptr,
 	    strsz);
 	ret = mstate->dtms_scratch_ptr;
 	mstate->dtms_scratch_ptr += strsz;
 	return (ret);
 }
 
 /*
  * Return a string from a memoy address which is known to have one or
  * more concatenated, individually zero terminated, sub-strings.
  * In the event that the user lacks the privilege to access
  * arbitrary kernel memory, we copy the string out to scratch memory so that we
  * don't fail access checking.
  *
  * dtrace_dif_variable() uses this routine as a helper for various
  * builtin values such as 'execargs'.
  */
 static uintptr_t
 dtrace_dif_varstrz(uintptr_t addr, size_t strsz, dtrace_state_t *state,
     dtrace_mstate_t *mstate)
 {
 	char *p;
 	size_t i;
 	uintptr_t ret;
 
 	if (mstate->dtms_scratch_ptr + strsz >
 	    mstate->dtms_scratch_base + mstate->dtms_scratch_size) {
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 		return (0);
 	}
 
 	dtrace_bcopy((const void *)addr, (void *)mstate->dtms_scratch_ptr,
 	    strsz);
 
 	/* Replace sub-string termination characters with a space. */
 	for (p = (char *) mstate->dtms_scratch_ptr, i = 0; i < strsz - 1;
 	    p++, i++)
 		if (*p == '\0')
 			*p = ' ';
 
 	ret = mstate->dtms_scratch_ptr;
 	mstate->dtms_scratch_ptr += strsz;
 	return (ret);
 }
 
 /*
  * This function implements the DIF emulator's variable lookups.  The emulator
  * passes a reserved variable identifier and optional built-in array index.
  */
 static uint64_t
 dtrace_dif_variable(dtrace_mstate_t *mstate, dtrace_state_t *state, uint64_t v,
     uint64_t ndx)
 {
 	/*
 	 * If we're accessing one of the uncached arguments, we'll turn this
 	 * into a reference in the args array.
 	 */
 	if (v >= DIF_VAR_ARG0 && v <= DIF_VAR_ARG9) {
 		ndx = v - DIF_VAR_ARG0;
 		v = DIF_VAR_ARGS;
 	}
 
 	switch (v) {
 	case DIF_VAR_ARGS:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_ARGS);
 		if (ndx >= sizeof (mstate->dtms_arg) /
 		    sizeof (mstate->dtms_arg[0])) {
 			int aframes = mstate->dtms_probe->dtpr_aframes + 2;
 			dtrace_provider_t *pv;
 			uint64_t val;
 
 			pv = mstate->dtms_probe->dtpr_provider;
 			if (pv->dtpv_pops.dtps_getargval != NULL)
 				val = pv->dtpv_pops.dtps_getargval(pv->dtpv_arg,
 				    mstate->dtms_probe->dtpr_id,
 				    mstate->dtms_probe->dtpr_arg, ndx, aframes);
 			else
 				val = dtrace_getarg(ndx, aframes);
 
 			/*
 			 * This is regrettably required to keep the compiler
 			 * from tail-optimizing the call to dtrace_getarg().
 			 * The condition always evaluates to true, but the
 			 * compiler has no way of figuring that out a priori.
 			 * (None of this would be necessary if the compiler
 			 * could be relied upon to _always_ tail-optimize
 			 * the call to dtrace_getarg() -- but it can't.)
 			 */
 			if (mstate->dtms_probe != NULL)
 				return (val);
 
 			ASSERT(0);
 		}
 
 		return (mstate->dtms_arg[ndx]);
 
 #ifdef illumos
 	case DIF_VAR_UREGS: {
 		klwp_t *lwp;
 
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 		if ((lwp = curthread->t_lwp) == NULL) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_BADADDR);
 			cpu_core[curcpu].cpuc_dtrace_illval = NULL;
 			return (0);
 		}
 
 		return (dtrace_getreg(lwp->lwp_regs, ndx));
 		return (0);
 	}
 #else
 	case DIF_VAR_UREGS: {
 		struct trapframe *tframe;
 
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 		if ((tframe = curthread->td_frame) == NULL) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_BADADDR);
 			cpu_core[curcpu].cpuc_dtrace_illval = 0;
 			return (0);
 		}
 
 		return (dtrace_getreg(tframe, ndx));
 	}
 #endif
 
 	case DIF_VAR_CURTHREAD:
 		if (!dtrace_priv_proc(state))
 			return (0);
 		return ((uint64_t)(uintptr_t)curthread);
 
 	case DIF_VAR_TIMESTAMP:
 		if (!(mstate->dtms_present & DTRACE_MSTATE_TIMESTAMP)) {
 			mstate->dtms_timestamp = dtrace_gethrtime();
 			mstate->dtms_present |= DTRACE_MSTATE_TIMESTAMP;
 		}
 		return (mstate->dtms_timestamp);
 
 	case DIF_VAR_VTIMESTAMP:
 		ASSERT(dtrace_vtime_references != 0);
 		return (curthread->t_dtrace_vtime);
 
 	case DIF_VAR_WALLTIMESTAMP:
 		if (!(mstate->dtms_present & DTRACE_MSTATE_WALLTIMESTAMP)) {
 			mstate->dtms_walltimestamp = dtrace_gethrestime();
 			mstate->dtms_present |= DTRACE_MSTATE_WALLTIMESTAMP;
 		}
 		return (mstate->dtms_walltimestamp);
 
 #ifdef illumos
 	case DIF_VAR_IPL:
 		if (!dtrace_priv_kernel(state))
 			return (0);
 		if (!(mstate->dtms_present & DTRACE_MSTATE_IPL)) {
 			mstate->dtms_ipl = dtrace_getipl();
 			mstate->dtms_present |= DTRACE_MSTATE_IPL;
 		}
 		return (mstate->dtms_ipl);
 #endif
 
 	case DIF_VAR_EPID:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_EPID);
 		return (mstate->dtms_epid);
 
 	case DIF_VAR_ID:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_PROBE);
 		return (mstate->dtms_probe->dtpr_id);
 
 	case DIF_VAR_STACKDEPTH:
 		if (!dtrace_priv_kernel(state))
 			return (0);
 		if (!(mstate->dtms_present & DTRACE_MSTATE_STACKDEPTH)) {
 			int aframes = mstate->dtms_probe->dtpr_aframes + 2;
 
 			mstate->dtms_stackdepth = dtrace_getstackdepth(aframes);
 			mstate->dtms_present |= DTRACE_MSTATE_STACKDEPTH;
 		}
 		return (mstate->dtms_stackdepth);
 
 	case DIF_VAR_USTACKDEPTH:
 		if (!dtrace_priv_proc(state))
 			return (0);
 		if (!(mstate->dtms_present & DTRACE_MSTATE_USTACKDEPTH)) {
 			/*
 			 * See comment in DIF_VAR_PID.
 			 */
 			if (DTRACE_ANCHORED(mstate->dtms_probe) &&
 			    CPU_ON_INTR(CPU)) {
 				mstate->dtms_ustackdepth = 0;
 			} else {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 				mstate->dtms_ustackdepth =
 				    dtrace_getustackdepth();
 				DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			}
 			mstate->dtms_present |= DTRACE_MSTATE_USTACKDEPTH;
 		}
 		return (mstate->dtms_ustackdepth);
 
 	case DIF_VAR_CALLER:
 		if (!dtrace_priv_kernel(state))
 			return (0);
 		if (!(mstate->dtms_present & DTRACE_MSTATE_CALLER)) {
 			int aframes = mstate->dtms_probe->dtpr_aframes + 2;
 
 			if (!DTRACE_ANCHORED(mstate->dtms_probe)) {
 				/*
 				 * If this is an unanchored probe, we are
 				 * required to go through the slow path:
 				 * dtrace_caller() only guarantees correct
 				 * results for anchored probes.
 				 */
 				pc_t caller[2] = {0, 0};
 
 				dtrace_getpcstack(caller, 2, aframes,
 				    (uint32_t *)(uintptr_t)mstate->dtms_arg[0]);
 				mstate->dtms_caller = caller[1];
 			} else if ((mstate->dtms_caller =
 			    dtrace_caller(aframes)) == -1) {
 				/*
 				 * We have failed to do this the quick way;
 				 * we must resort to the slower approach of
 				 * calling dtrace_getpcstack().
 				 */
 				pc_t caller = 0;
 
 				dtrace_getpcstack(&caller, 1, aframes, NULL);
 				mstate->dtms_caller = caller;
 			}
 
 			mstate->dtms_present |= DTRACE_MSTATE_CALLER;
 		}
 		return (mstate->dtms_caller);
 
 	case DIF_VAR_UCALLER:
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 		if (!(mstate->dtms_present & DTRACE_MSTATE_UCALLER)) {
 			uint64_t ustack[3];
 
 			/*
 			 * dtrace_getupcstack() fills in the first uint64_t
 			 * with the current PID.  The second uint64_t will
 			 * be the program counter at user-level.  The third
 			 * uint64_t will contain the caller, which is what
 			 * we're after.
 			 */
 			ustack[2] = 0;
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			dtrace_getupcstack(ustack, 3);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			mstate->dtms_ucaller = ustack[2];
 			mstate->dtms_present |= DTRACE_MSTATE_UCALLER;
 		}
 
 		return (mstate->dtms_ucaller);
 
 	case DIF_VAR_PROBEPROV:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_PROBE);
 		return (dtrace_dif_varstr(
 		    (uintptr_t)mstate->dtms_probe->dtpr_provider->dtpv_name,
 		    state, mstate));
 
 	case DIF_VAR_PROBEMOD:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_PROBE);
 		return (dtrace_dif_varstr(
 		    (uintptr_t)mstate->dtms_probe->dtpr_mod,
 		    state, mstate));
 
 	case DIF_VAR_PROBEFUNC:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_PROBE);
 		return (dtrace_dif_varstr(
 		    (uintptr_t)mstate->dtms_probe->dtpr_func,
 		    state, mstate));
 
 	case DIF_VAR_PROBENAME:
 		ASSERT(mstate->dtms_present & DTRACE_MSTATE_PROBE);
 		return (dtrace_dif_varstr(
 		    (uintptr_t)mstate->dtms_probe->dtpr_name,
 		    state, mstate));
 
 	case DIF_VAR_PID:
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 #ifdef illumos
 		/*
 		 * Note that we are assuming that an unanchored probe is
 		 * always due to a high-level interrupt.  (And we're assuming
 		 * that there is only a single high level interrupt.)
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return (pid0.pid_id);
 
 		/*
 		 * It is always safe to dereference one's own t_procp pointer:
 		 * it always points to a valid, allocated proc structure.
 		 * Further, it is always safe to dereference the p_pidp member
 		 * of one's own proc structure.  (These are truisms becuase
 		 * threads and processes don't clean up their own state --
 		 * they leave that task to whomever reaps them.)
 		 */
 		return ((uint64_t)curthread->t_procp->p_pidp->pid_id);
 #else
 		return ((uint64_t)curproc->p_pid);
 #endif
 
 	case DIF_VAR_PPID:
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 #ifdef illumos
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return (pid0.pid_id);
 
 		/*
 		 * It is always safe to dereference one's own t_procp pointer:
 		 * it always points to a valid, allocated proc structure.
 		 * (This is true because threads don't clean up their own
 		 * state -- they leave that task to whomever reaps them.)
 		 */
 		return ((uint64_t)curthread->t_procp->p_ppid);
 #else
 		if (curproc->p_pid == proc0.p_pid)
 			return (curproc->p_pid);
 		else
 			return (curproc->p_pptr->p_pid);
 #endif
 
 	case DIF_VAR_TID:
 #ifdef illumos
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return (0);
 #endif
 
 		return ((uint64_t)curthread->t_tid);
 
 	case DIF_VAR_EXECARGS: {
 		struct pargs *p_args = curthread->td_proc->p_args;
 
 		if (p_args == NULL)
 			return(0);
 
 		return (dtrace_dif_varstrz(
 		    (uintptr_t) p_args->ar_args, p_args->ar_length, state, mstate));
 	}
 
 	case DIF_VAR_EXECNAME:
 #ifdef illumos
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return ((uint64_t)(uintptr_t)p0.p_user.u_comm);
 
 		/*
 		 * It is always safe to dereference one's own t_procp pointer:
 		 * it always points to a valid, allocated proc structure.
 		 * (This is true because threads don't clean up their own
 		 * state -- they leave that task to whomever reaps them.)
 		 */
 		return (dtrace_dif_varstr(
 		    (uintptr_t)curthread->t_procp->p_user.u_comm,
 		    state, mstate));
 #else
 		return (dtrace_dif_varstr(
 		    (uintptr_t) curthread->td_proc->p_comm, state, mstate));
 #endif
 
 	case DIF_VAR_ZONENAME:
 #ifdef illumos
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return ((uint64_t)(uintptr_t)p0.p_zone->zone_name);
 
 		/*
 		 * It is always safe to dereference one's own t_procp pointer:
 		 * it always points to a valid, allocated proc structure.
 		 * (This is true because threads don't clean up their own
 		 * state -- they leave that task to whomever reaps them.)
 		 */
 		return (dtrace_dif_varstr(
 		    (uintptr_t)curthread->t_procp->p_zone->zone_name,
 		    state, mstate));
 #elif defined(__FreeBSD__)
 	/*
 	 * On FreeBSD, we introduce compatibility to zonename by falling through
 	 * into jailname.
 	 */
 	case DIF_VAR_JAILNAME:
 		if (!dtrace_priv_kernel(state))
 			return (0);
 
 		return (dtrace_dif_varstr(
 		    (uintptr_t)curthread->td_ucred->cr_prison->pr_name,
 		    state, mstate));
 
 	case DIF_VAR_JID:
 		if (!dtrace_priv_kernel(state))
 			return (0);
 
 		return ((uint64_t)curthread->td_ucred->cr_prison->pr_id);
 #else
 		return (0);
 #endif
 
 	case DIF_VAR_UID:
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 #ifdef illumos
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return ((uint64_t)p0.p_cred->cr_uid);
 
 		/*
 		 * It is always safe to dereference one's own t_procp pointer:
 		 * it always points to a valid, allocated proc structure.
 		 * (This is true because threads don't clean up their own
 		 * state -- they leave that task to whomever reaps them.)
 		 *
 		 * Additionally, it is safe to dereference one's own process
 		 * credential, since this is never NULL after process birth.
 		 */
 		return ((uint64_t)curthread->t_procp->p_cred->cr_uid);
 #else
 		return ((uint64_t)curthread->td_ucred->cr_uid);
 #endif
 
 	case DIF_VAR_GID:
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 #ifdef illumos
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return ((uint64_t)p0.p_cred->cr_gid);
 
 		/*
 		 * It is always safe to dereference one's own t_procp pointer:
 		 * it always points to a valid, allocated proc structure.
 		 * (This is true because threads don't clean up their own
 		 * state -- they leave that task to whomever reaps them.)
 		 *
 		 * Additionally, it is safe to dereference one's own process
 		 * credential, since this is never NULL after process birth.
 		 */
 		return ((uint64_t)curthread->t_procp->p_cred->cr_gid);
 #else
 		return ((uint64_t)curthread->td_ucred->cr_gid);
 #endif
 
 	case DIF_VAR_ERRNO: {
 #ifdef illumos
 		klwp_t *lwp;
 		if (!dtrace_priv_proc(state))
 			return (0);
 
 		/*
 		 * See comment in DIF_VAR_PID.
 		 */
 		if (DTRACE_ANCHORED(mstate->dtms_probe) && CPU_ON_INTR(CPU))
 			return (0);
 
 		/*
 		 * It is always safe to dereference one's own t_lwp pointer in
 		 * the event that this pointer is non-NULL.  (This is true
 		 * because threads and lwps don't clean up their own state --
 		 * they leave that task to whomever reaps them.)
 		 */
 		if ((lwp = curthread->t_lwp) == NULL)
 			return (0);
 
 		return ((uint64_t)lwp->lwp_errno);
 #else
 		return (curthread->td_errno);
 #endif
 	}
 #ifndef illumos
 	case DIF_VAR_CPU: {
 		return curcpu;
 	}
 #endif
 	default:
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_ILLOP);
 		return (0);
 	}
 }
 
 
 typedef enum dtrace_json_state {
 	DTRACE_JSON_REST = 1,
 	DTRACE_JSON_OBJECT,
 	DTRACE_JSON_STRING,
 	DTRACE_JSON_STRING_ESCAPE,
 	DTRACE_JSON_STRING_ESCAPE_UNICODE,
 	DTRACE_JSON_COLON,
 	DTRACE_JSON_COMMA,
 	DTRACE_JSON_VALUE,
 	DTRACE_JSON_IDENTIFIER,
 	DTRACE_JSON_NUMBER,
 	DTRACE_JSON_NUMBER_FRAC,
 	DTRACE_JSON_NUMBER_EXP,
 	DTRACE_JSON_COLLECT_OBJECT
 } dtrace_json_state_t;
 
 /*
  * This function possesses just enough knowledge about JSON to extract a single
  * value from a JSON string and store it in the scratch buffer.  It is able
  * to extract nested object values, and members of arrays by index.
  *
  * elemlist is a list of JSON keys, stored as packed NUL-terminated strings, to
  * be looked up as we descend into the object tree.  e.g.
  *
  *    foo[0].bar.baz[32] --> "foo" NUL "0" NUL "bar" NUL "baz" NUL "32" NUL
  *       with nelems = 5.
  *
  * The run time of this function must be bounded above by strsize to limit the
  * amount of work done in probe context.  As such, it is implemented as a
  * simple state machine, reading one character at a time using safe loads
  * until we find the requested element, hit a parsing error or run off the
  * end of the object or string.
  *
  * As there is no way for a subroutine to return an error without interrupting
  * clause execution, we simply return NULL in the event of a missing key or any
  * other error condition.  Each NULL return in this function is commented with
  * the error condition it represents -- parsing or otherwise.
  *
  * The set of states for the state machine closely matches the JSON
  * specification (http://json.org/).  Briefly:
  *
  *   DTRACE_JSON_REST:
  *     Skip whitespace until we find either a top-level Object, moving
  *     to DTRACE_JSON_OBJECT; or an Array, moving to DTRACE_JSON_VALUE.
  *
  *   DTRACE_JSON_OBJECT:
  *     Locate the next key String in an Object.  Sets a flag to denote
  *     the next String as a key string and moves to DTRACE_JSON_STRING.
  *
  *   DTRACE_JSON_COLON:
  *     Skip whitespace until we find the colon that separates key Strings
  *     from their values.  Once found, move to DTRACE_JSON_VALUE.
  *
  *   DTRACE_JSON_VALUE:
  *     Detects the type of the next value (String, Number, Identifier, Object
  *     or Array) and routes to the states that process that type.  Here we also
  *     deal with the element selector list if we are requested to traverse down
  *     into the object tree.
  *
  *   DTRACE_JSON_COMMA:
  *     Skip whitespace until we find the comma that separates key-value pairs
  *     in Objects (returning to DTRACE_JSON_OBJECT) or values in Arrays
  *     (similarly DTRACE_JSON_VALUE).  All following literal value processing
  *     states return to this state at the end of their value, unless otherwise
  *     noted.
  *
  *   DTRACE_JSON_NUMBER, DTRACE_JSON_NUMBER_FRAC, DTRACE_JSON_NUMBER_EXP:
  *     Processes a Number literal from the JSON, including any exponent
  *     component that may be present.  Numbers are returned as strings, which
  *     may be passed to strtoll() if an integer is required.
  *
  *   DTRACE_JSON_IDENTIFIER:
  *     Processes a "true", "false" or "null" literal in the JSON.
  *
  *   DTRACE_JSON_STRING, DTRACE_JSON_STRING_ESCAPE,
  *   DTRACE_JSON_STRING_ESCAPE_UNICODE:
  *     Processes a String literal from the JSON, whether the String denotes
  *     a key, a value or part of a larger Object.  Handles all escape sequences
  *     present in the specification, including four-digit unicode characters,
  *     but merely includes the escape sequence without converting it to the
  *     actual escaped character.  If the String is flagged as a key, we
  *     move to DTRACE_JSON_COLON rather than DTRACE_JSON_COMMA.
  *
  *   DTRACE_JSON_COLLECT_OBJECT:
  *     This state collects an entire Object (or Array), correctly handling
  *     embedded strings.  If the full element selector list matches this nested
  *     object, we return the Object in full as a string.  If not, we use this
  *     state to skip to the next value at this level and continue processing.
  *
  * NOTE: This function uses various macros from strtolctype.h to manipulate
  * digit values, etc -- these have all been checked to ensure they make
  * no additional function calls.
  */
 static char *
 dtrace_json(uint64_t size, uintptr_t json, char *elemlist, int nelems,
     char *dest)
 {
 	dtrace_json_state_t state = DTRACE_JSON_REST;
 	int64_t array_elem = INT64_MIN;
 	int64_t array_pos = 0;
 	uint8_t escape_unicount = 0;
 	boolean_t string_is_key = B_FALSE;
 	boolean_t collect_object = B_FALSE;
 	boolean_t found_key = B_FALSE;
 	boolean_t in_array = B_FALSE;
 	uint32_t braces = 0, brackets = 0;
 	char *elem = elemlist;
 	char *dd = dest;
 	uintptr_t cur;
 
 	for (cur = json; cur < json + size; cur++) {
 		char cc = dtrace_load8(cur);
 		if (cc == '\0')
 			return (NULL);
 
 		switch (state) {
 		case DTRACE_JSON_REST:
 			if (isspace(cc))
 				break;
 
 			if (cc == '{') {
 				state = DTRACE_JSON_OBJECT;
 				break;
 			}
 
 			if (cc == '[') {
 				in_array = B_TRUE;
 				array_pos = 0;
 				array_elem = dtrace_strtoll(elem, 10, size);
 				found_key = array_elem == 0 ? B_TRUE : B_FALSE;
 				state = DTRACE_JSON_VALUE;
 				break;
 			}
 
 			/*
 			 * ERROR: expected to find a top-level object or array.
 			 */
 			return (NULL);
 		case DTRACE_JSON_OBJECT:
 			if (isspace(cc))
 				break;
 
 			if (cc == '"') {
 				state = DTRACE_JSON_STRING;
 				string_is_key = B_TRUE;
 				break;
 			}
 
 			/*
 			 * ERROR: either the object did not start with a key
 			 * string, or we've run off the end of the object
 			 * without finding the requested key.
 			 */
 			return (NULL);
 		case DTRACE_JSON_STRING:
 			if (cc == '\\') {
 				*dd++ = '\\';
 				state = DTRACE_JSON_STRING_ESCAPE;
 				break;
 			}
 
 			if (cc == '"') {
 				if (collect_object) {
 					/*
 					 * We don't reset the dest here, as
 					 * the string is part of a larger
 					 * object being collected.
 					 */
 					*dd++ = cc;
 					collect_object = B_FALSE;
 					state = DTRACE_JSON_COLLECT_OBJECT;
 					break;
 				}
 				*dd = '\0';
 				dd = dest; /* reset string buffer */
 				if (string_is_key) {
 					if (dtrace_strncmp(dest, elem,
 					    size) == 0)
 						found_key = B_TRUE;
 				} else if (found_key) {
 					if (nelems > 1) {
 						/*
 						 * We expected an object, not
 						 * this string.
 						 */
 						return (NULL);
 					}
 					return (dest);
 				}
 				state = string_is_key ? DTRACE_JSON_COLON :
 				    DTRACE_JSON_COMMA;
 				string_is_key = B_FALSE;
 				break;
 			}
 
 			*dd++ = cc;
 			break;
 		case DTRACE_JSON_STRING_ESCAPE:
 			*dd++ = cc;
 			if (cc == 'u') {
 				escape_unicount = 0;
 				state = DTRACE_JSON_STRING_ESCAPE_UNICODE;
 			} else {
 				state = DTRACE_JSON_STRING;
 			}
 			break;
 		case DTRACE_JSON_STRING_ESCAPE_UNICODE:
 			if (!isxdigit(cc)) {
 				/*
 				 * ERROR: invalid unicode escape, expected
 				 * four valid hexidecimal digits.
 				 */
 				return (NULL);
 			}
 
 			*dd++ = cc;
 			if (++escape_unicount == 4)
 				state = DTRACE_JSON_STRING;
 			break;
 		case DTRACE_JSON_COLON:
 			if (isspace(cc))
 				break;
 
 			if (cc == ':') {
 				state = DTRACE_JSON_VALUE;
 				break;
 			}
 
 			/*
 			 * ERROR: expected a colon.
 			 */
 			return (NULL);
 		case DTRACE_JSON_COMMA:
 			if (isspace(cc))
 				break;
 
 			if (cc == ',') {
 				if (in_array) {
 					state = DTRACE_JSON_VALUE;
 					if (++array_pos == array_elem)
 						found_key = B_TRUE;
 				} else {
 					state = DTRACE_JSON_OBJECT;
 				}
 				break;
 			}
 
 			/*
 			 * ERROR: either we hit an unexpected character, or
 			 * we reached the end of the object or array without
 			 * finding the requested key.
 			 */
 			return (NULL);
 		case DTRACE_JSON_IDENTIFIER:
 			if (islower(cc)) {
 				*dd++ = cc;
 				break;
 			}
 
 			*dd = '\0';
 			dd = dest; /* reset string buffer */
 
 			if (dtrace_strncmp(dest, "true", 5) == 0 ||
 			    dtrace_strncmp(dest, "false", 6) == 0 ||
 			    dtrace_strncmp(dest, "null", 5) == 0) {
 				if (found_key) {
 					if (nelems > 1) {
 						/*
 						 * ERROR: We expected an object,
 						 * not this identifier.
 						 */
 						return (NULL);
 					}
 					return (dest);
 				} else {
 					cur--;
 					state = DTRACE_JSON_COMMA;
 					break;
 				}
 			}
 
 			/*
 			 * ERROR: we did not recognise the identifier as one
 			 * of those in the JSON specification.
 			 */
 			return (NULL);
 		case DTRACE_JSON_NUMBER:
 			if (cc == '.') {
 				*dd++ = cc;
 				state = DTRACE_JSON_NUMBER_FRAC;
 				break;
 			}
 
 			if (cc == 'x' || cc == 'X') {
 				/*
 				 * ERROR: specification explicitly excludes
 				 * hexidecimal or octal numbers.
 				 */
 				return (NULL);
 			}
 
 			/* FALLTHRU */
 		case DTRACE_JSON_NUMBER_FRAC:
 			if (cc == 'e' || cc == 'E') {
 				*dd++ = cc;
 				state = DTRACE_JSON_NUMBER_EXP;
 				break;
 			}
 
 			if (cc == '+' || cc == '-') {
 				/*
 				 * ERROR: expect sign as part of exponent only.
 				 */
 				return (NULL);
 			}
 			/* FALLTHRU */
 		case DTRACE_JSON_NUMBER_EXP:
 			if (isdigit(cc) || cc == '+' || cc == '-') {
 				*dd++ = cc;
 				break;
 			}
 
 			*dd = '\0';
 			dd = dest; /* reset string buffer */
 			if (found_key) {
 				if (nelems > 1) {
 					/*
 					 * ERROR: We expected an object, not
 					 * this number.
 					 */
 					return (NULL);
 				}
 				return (dest);
 			}
 
 			cur--;
 			state = DTRACE_JSON_COMMA;
 			break;
 		case DTRACE_JSON_VALUE:
 			if (isspace(cc))
 				break;
 
 			if (cc == '{' || cc == '[') {
 				if (nelems > 1 && found_key) {
 					in_array = cc == '[' ? B_TRUE : B_FALSE;
 					/*
 					 * If our element selector directs us
 					 * to descend into this nested object,
 					 * then move to the next selector
 					 * element in the list and restart the
 					 * state machine.
 					 */
 					while (*elem != '\0')
 						elem++;
 					elem++; /* skip the inter-element NUL */
 					nelems--;
 					dd = dest;
 					if (in_array) {
 						state = DTRACE_JSON_VALUE;
 						array_pos = 0;
 						array_elem = dtrace_strtoll(
 						    elem, 10, size);
 						found_key = array_elem == 0 ?
 						    B_TRUE : B_FALSE;
 					} else {
 						found_key = B_FALSE;
 						state = DTRACE_JSON_OBJECT;
 					}
 					break;
 				}
 
 				/*
 				 * Otherwise, we wish to either skip this
 				 * nested object or return it in full.
 				 */
 				if (cc == '[')
 					brackets = 1;
 				else
 					braces = 1;
 				*dd++ = cc;
 				state = DTRACE_JSON_COLLECT_OBJECT;
 				break;
 			}
 
 			if (cc == '"') {
 				state = DTRACE_JSON_STRING;
 				break;
 			}
 
 			if (islower(cc)) {
 				/*
 				 * Here we deal with true, false and null.
 				 */
 				*dd++ = cc;
 				state = DTRACE_JSON_IDENTIFIER;
 				break;
 			}
 
 			if (cc == '-' || isdigit(cc)) {
 				*dd++ = cc;
 				state = DTRACE_JSON_NUMBER;
 				break;
 			}
 
 			/*
 			 * ERROR: unexpected character at start of value.
 			 */
 			return (NULL);
 		case DTRACE_JSON_COLLECT_OBJECT:
 			if (cc == '\0')
 				/*
 				 * ERROR: unexpected end of input.
 				 */
 				return (NULL);
 
 			*dd++ = cc;
 			if (cc == '"') {
 				collect_object = B_TRUE;
 				state = DTRACE_JSON_STRING;
 				break;
 			}
 
 			if (cc == ']') {
 				if (brackets-- == 0) {
 					/*
 					 * ERROR: unbalanced brackets.
 					 */
 					return (NULL);
 				}
 			} else if (cc == '}') {
 				if (braces-- == 0) {
 					/*
 					 * ERROR: unbalanced braces.
 					 */
 					return (NULL);
 				}
 			} else if (cc == '{') {
 				braces++;
 			} else if (cc == '[') {
 				brackets++;
 			}
 
 			if (brackets == 0 && braces == 0) {
 				if (found_key) {
 					*dd = '\0';
 					return (dest);
 				}
 				dd = dest; /* reset string buffer */
 				state = DTRACE_JSON_COMMA;
 			}
 			break;
 		}
 	}
 	return (NULL);
 }
 
 /*
  * Emulate the execution of DTrace ID subroutines invoked by the call opcode.
  * Notice that we don't bother validating the proper number of arguments or
  * their types in the tuple stack.  This isn't needed because all argument
  * interpretation is safe because of our load safety -- the worst that can
  * happen is that a bogus program can obtain bogus results.
  */
 static void
 dtrace_dif_subr(uint_t subr, uint_t rd, uint64_t *regs,
     dtrace_key_t *tupregs, int nargs,
     dtrace_mstate_t *mstate, dtrace_state_t *state)
 {
 	volatile uint16_t *flags = &cpu_core[curcpu].cpuc_dtrace_flags;
 	volatile uintptr_t *illval = &cpu_core[curcpu].cpuc_dtrace_illval;
 	dtrace_vstate_t *vstate = &state->dts_vstate;
 
 #ifdef illumos
 	union {
 		mutex_impl_t mi;
 		uint64_t mx;
 	} m;
 
 	union {
 		krwlock_t ri;
 		uintptr_t rw;
 	} r;
 #else
 	struct thread *lowner;
 	union {
 		struct lock_object *li;
 		uintptr_t lx;
 	} l;
 #endif
 
 	switch (subr) {
 	case DIF_SUBR_RAND:
 		regs[rd] = dtrace_xoroshiro128_plus_next(
 		    state->dts_rstate[curcpu]);
 		break;
 
 #ifdef illumos
 	case DIF_SUBR_MUTEX_OWNED:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (kmutex_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		m.mx = dtrace_load64(tupregs[0].dttk_value);
 		if (MUTEX_TYPE_ADAPTIVE(&m.mi))
 			regs[rd] = MUTEX_OWNER(&m.mi) != MUTEX_NO_OWNER;
 		else
 			regs[rd] = LOCK_HELD(&m.mi.m_spin.m_spinlock);
 		break;
 
 	case DIF_SUBR_MUTEX_OWNER:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (kmutex_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		m.mx = dtrace_load64(tupregs[0].dttk_value);
 		if (MUTEX_TYPE_ADAPTIVE(&m.mi) &&
 		    MUTEX_OWNER(&m.mi) != MUTEX_NO_OWNER)
 			regs[rd] = (uintptr_t)MUTEX_OWNER(&m.mi);
 		else
 			regs[rd] = 0;
 		break;
 
 	case DIF_SUBR_MUTEX_TYPE_ADAPTIVE:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (kmutex_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		m.mx = dtrace_load64(tupregs[0].dttk_value);
 		regs[rd] = MUTEX_TYPE_ADAPTIVE(&m.mi);
 		break;
 
 	case DIF_SUBR_MUTEX_TYPE_SPIN:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (kmutex_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		m.mx = dtrace_load64(tupregs[0].dttk_value);
 		regs[rd] = MUTEX_TYPE_SPIN(&m.mi);
 		break;
 
 	case DIF_SUBR_RW_READ_HELD: {
 		uintptr_t tmp;
 
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (uintptr_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		r.rw = dtrace_loadptr(tupregs[0].dttk_value);
 		regs[rd] = _RW_READ_HELD(&r.ri, tmp);
 		break;
 	}
 
 	case DIF_SUBR_RW_WRITE_HELD:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (krwlock_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		r.rw = dtrace_loadptr(tupregs[0].dttk_value);
 		regs[rd] = _RW_WRITE_HELD(&r.ri);
 		break;
 
 	case DIF_SUBR_RW_ISWRITER:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (krwlock_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		r.rw = dtrace_loadptr(tupregs[0].dttk_value);
 		regs[rd] = _RW_ISWRITER(&r.ri);
 		break;
 
 #else /* !illumos */
 	case DIF_SUBR_MUTEX_OWNED:
 		if (!dtrace_canload(tupregs[0].dttk_value,
 			sizeof (struct lock_object), mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr((uintptr_t)&tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		regs[rd] = LOCK_CLASS(l.li)->lc_owner(l.li, &lowner);
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		break;
 
 	case DIF_SUBR_MUTEX_OWNER:
 		if (!dtrace_canload(tupregs[0].dttk_value,
 			sizeof (struct lock_object), mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr((uintptr_t)&tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		LOCK_CLASS(l.li)->lc_owner(l.li, &lowner);
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		regs[rd] = (uintptr_t)lowner;
 		break;
 
 	case DIF_SUBR_MUTEX_TYPE_ADAPTIVE:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (struct mtx),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr((uintptr_t)&tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		regs[rd] = (LOCK_CLASS(l.li)->lc_flags & LC_SLEEPLOCK) != 0;
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		break;
 
 	case DIF_SUBR_MUTEX_TYPE_SPIN:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (struct mtx),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr((uintptr_t)&tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		regs[rd] = (LOCK_CLASS(l.li)->lc_flags & LC_SPINLOCK) != 0;
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		break;
 
 	case DIF_SUBR_RW_READ_HELD: 
 	case DIF_SUBR_SX_SHARED_HELD: 
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (uintptr_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr((uintptr_t)&tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		regs[rd] = LOCK_CLASS(l.li)->lc_owner(l.li, &lowner) &&
 		    lowner == NULL;
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		break;
 
 	case DIF_SUBR_RW_WRITE_HELD:
 	case DIF_SUBR_SX_EXCLUSIVE_HELD:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (uintptr_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr(tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		regs[rd] = LOCK_CLASS(l.li)->lc_owner(l.li, &lowner) &&
 		    lowner != NULL;
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		break;
 
 	case DIF_SUBR_RW_ISWRITER:
 	case DIF_SUBR_SX_ISEXCLUSIVE:
 		if (!dtrace_canload(tupregs[0].dttk_value, sizeof (uintptr_t),
 		    mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		l.lx = dtrace_loadptr(tupregs[0].dttk_value);
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		LOCK_CLASS(l.li)->lc_owner(l.li, &lowner);
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		regs[rd] = (lowner == curthread);
 		break;
 #endif /* illumos */
 
 	case DIF_SUBR_BCOPY: {
 		/*
 		 * We need to be sure that the destination is in the scratch
 		 * region -- no other region is allowed.
 		 */
 		uintptr_t src = tupregs[0].dttk_value;
 		uintptr_t dest = tupregs[1].dttk_value;
 		size_t size = tupregs[2].dttk_value;
 
 		if (!dtrace_inscratch(dest, size, mstate)) {
 			*flags |= CPU_DTRACE_BADADDR;
 			*illval = regs[rd];
 			break;
 		}
 
 		if (!dtrace_canload(src, size, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		dtrace_bcopy((void *)src, (void *)dest, size);
 		break;
 	}
 
 	case DIF_SUBR_ALLOCA:
 	case DIF_SUBR_COPYIN: {
 		uintptr_t dest = P2ROUNDUP(mstate->dtms_scratch_ptr, 8);
 		uint64_t size =
 		    tupregs[subr == DIF_SUBR_ALLOCA ? 0 : 1].dttk_value;
 		size_t scratch_size = (dest - mstate->dtms_scratch_ptr) + size;
 
 		/*
 		 * This action doesn't require any credential checks since
 		 * probes will not activate in user contexts to which the
 		 * enabling user does not have permissions.
 		 */
 
 		/*
 		 * Rounding up the user allocation size could have overflowed
 		 * a large, bogus allocation (like -1ULL) to 0.
 		 */
 		if (scratch_size < size ||
 		    !DTRACE_INSCRATCH(mstate, scratch_size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		if (subr == DIF_SUBR_COPYIN) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			dtrace_copyin(tupregs[0].dttk_value, dest, size, flags);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		}
 
 		mstate->dtms_scratch_ptr += scratch_size;
 		regs[rd] = dest;
 		break;
 	}
 
 	case DIF_SUBR_COPYINTO: {
 		uint64_t size = tupregs[1].dttk_value;
 		uintptr_t dest = tupregs[2].dttk_value;
 
 		/*
 		 * This action doesn't require any credential checks since
 		 * probes will not activate in user contexts to which the
 		 * enabling user does not have permissions.
 		 */
 		if (!dtrace_inscratch(dest, size, mstate)) {
 			*flags |= CPU_DTRACE_BADADDR;
 			*illval = regs[rd];
 			break;
 		}
 
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		dtrace_copyin(tupregs[0].dttk_value, dest, size, flags);
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		break;
 	}
 
 	case DIF_SUBR_COPYINSTR: {
 		uintptr_t dest = mstate->dtms_scratch_ptr;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 
 		if (nargs > 1 && tupregs[1].dttk_value < size)
 			size = tupregs[1].dttk_value + 1;
 
 		/*
 		 * This action doesn't require any credential checks since
 		 * probes will not activate in user contexts to which the
 		 * enabling user does not have permissions.
 		 */
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 		dtrace_copyinstr(tupregs[0].dttk_value, dest, size, flags);
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 
 		((char *)dest)[size - 1] = '\0';
 		mstate->dtms_scratch_ptr += size;
 		regs[rd] = dest;
 		break;
 	}
 
 #ifdef illumos
 	case DIF_SUBR_MSGSIZE:
 	case DIF_SUBR_MSGDSIZE: {
 		uintptr_t baddr = tupregs[0].dttk_value, daddr;
 		uintptr_t wptr, rptr;
 		size_t count = 0;
 		int cont = 0;
 
 		while (baddr != 0 && !(*flags & CPU_DTRACE_FAULT)) {
 
 			if (!dtrace_canload(baddr, sizeof (mblk_t), mstate,
 			    vstate)) {
 				regs[rd] = 0;
 				break;
 			}
 
 			wptr = dtrace_loadptr(baddr +
 			    offsetof(mblk_t, b_wptr));
 
 			rptr = dtrace_loadptr(baddr +
 			    offsetof(mblk_t, b_rptr));
 
 			if (wptr < rptr) {
 				*flags |= CPU_DTRACE_BADADDR;
 				*illval = tupregs[0].dttk_value;
 				break;
 			}
 
 			daddr = dtrace_loadptr(baddr +
 			    offsetof(mblk_t, b_datap));
 
 			baddr = dtrace_loadptr(baddr +
 			    offsetof(mblk_t, b_cont));
 
 			/*
 			 * We want to prevent against denial-of-service here,
 			 * so we're only going to search the list for
 			 * dtrace_msgdsize_max mblks.
 			 */
 			if (cont++ > dtrace_msgdsize_max) {
 				*flags |= CPU_DTRACE_ILLOP;
 				break;
 			}
 
 			if (subr == DIF_SUBR_MSGDSIZE) {
 				if (dtrace_load8(daddr +
 				    offsetof(dblk_t, db_type)) != M_DATA)
 					continue;
 			}
 
 			count += wptr - rptr;
 		}
 
 		if (!(*flags & CPU_DTRACE_FAULT))
 			regs[rd] = count;
 
 		break;
 	}
 #endif
 
 	case DIF_SUBR_PROGENYOF: {
 		pid_t pid = tupregs[0].dttk_value;
 		proc_t *p;
 		int rval = 0;
 
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 
 		for (p = curthread->t_procp; p != NULL; p = p->p_parent) {
 #ifdef illumos
 			if (p->p_pidp->pid_id == pid) {
 #else
 			if (p->p_pid == pid) {
 #endif
 				rval = 1;
 				break;
 			}
 		}
 
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 
 		regs[rd] = rval;
 		break;
 	}
 
 	case DIF_SUBR_SPECULATION:
 		regs[rd] = dtrace_speculation(state);
 		break;
 
 	case DIF_SUBR_COPYOUT: {
 		uintptr_t kaddr = tupregs[0].dttk_value;
 		uintptr_t uaddr = tupregs[1].dttk_value;
 		uint64_t size = tupregs[2].dttk_value;
 
 		if (!dtrace_destructive_disallow &&
 		    dtrace_priv_proc_control(state) &&
 		    !dtrace_istoxic(kaddr, size) &&
 		    dtrace_canload(kaddr, size, mstate, vstate)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			dtrace_copyout(kaddr, uaddr, size, flags);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		}
 		break;
 	}
 
 	case DIF_SUBR_COPYOUTSTR: {
 		uintptr_t kaddr = tupregs[0].dttk_value;
 		uintptr_t uaddr = tupregs[1].dttk_value;
 		uint64_t size = tupregs[2].dttk_value;
 		size_t lim;
 
 		if (!dtrace_destructive_disallow &&
 		    dtrace_priv_proc_control(state) &&
 		    !dtrace_istoxic(kaddr, size) &&
 		    dtrace_strcanload(kaddr, size, &lim, mstate, vstate)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			dtrace_copyoutstr(kaddr, uaddr, lim, flags);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 		}
 		break;
 	}
 
 	case DIF_SUBR_STRLEN: {
 		size_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		uintptr_t addr = (uintptr_t)tupregs[0].dttk_value;
 		size_t lim;
 
 		if (!dtrace_strcanload(addr, size, &lim, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		regs[rd] = dtrace_strlen((char *)addr, lim);
 		break;
 	}
 
 	case DIF_SUBR_STRCHR:
 	case DIF_SUBR_STRRCHR: {
 		/*
 		 * We're going to iterate over the string looking for the
 		 * specified character.  We will iterate until we have reached
 		 * the string length or we have found the character.  If this
 		 * is DIF_SUBR_STRRCHR, we will look for the last occurrence
 		 * of the specified character instead of the first.
 		 */
 		uintptr_t addr = tupregs[0].dttk_value;
 		uintptr_t addr_limit;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		size_t lim;
 		char c, target = (char)tupregs[1].dttk_value;
 
 		if (!dtrace_strcanload(addr, size, &lim, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		addr_limit = addr + lim;
 
 		for (regs[rd] = 0; addr < addr_limit; addr++) {
 			if ((c = dtrace_load8(addr)) == target) {
 				regs[rd] = addr;
 
 				if (subr == DIF_SUBR_STRCHR)
 					break;
 			}
 
 			if (c == '\0')
 				break;
 		}
 		break;
 	}
 
 	case DIF_SUBR_STRSTR:
 	case DIF_SUBR_INDEX:
 	case DIF_SUBR_RINDEX: {
 		/*
 		 * We're going to iterate over the string looking for the
 		 * specified string.  We will iterate until we have reached
 		 * the string length or we have found the string.  (Yes, this
 		 * is done in the most naive way possible -- but considering
 		 * that the string we're searching for is likely to be
 		 * relatively short, the complexity of Rabin-Karp or similar
 		 * hardly seems merited.)
 		 */
 		char *addr = (char *)(uintptr_t)tupregs[0].dttk_value;
 		char *substr = (char *)(uintptr_t)tupregs[1].dttk_value;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		size_t len = dtrace_strlen(addr, size);
 		size_t sublen = dtrace_strlen(substr, size);
 		char *limit = addr + len, *orig = addr;
 		int notfound = subr == DIF_SUBR_STRSTR ? 0 : -1;
 		int inc = 1;
 
 		regs[rd] = notfound;
 
 		if (!dtrace_canload((uintptr_t)addr, len + 1, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!dtrace_canload((uintptr_t)substr, sublen + 1, mstate,
 		    vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		/*
 		 * strstr() and index()/rindex() have similar semantics if
 		 * both strings are the empty string: strstr() returns a
 		 * pointer to the (empty) string, and index() and rindex()
 		 * both return index 0 (regardless of any position argument).
 		 */
 		if (sublen == 0 && len == 0) {
 			if (subr == DIF_SUBR_STRSTR)
 				regs[rd] = (uintptr_t)addr;
 			else
 				regs[rd] = 0;
 			break;
 		}
 
 		if (subr != DIF_SUBR_STRSTR) {
 			if (subr == DIF_SUBR_RINDEX) {
 				limit = orig - 1;
 				addr += len;
 				inc = -1;
 			}
 
 			/*
 			 * Both index() and rindex() take an optional position
 			 * argument that denotes the starting position.
 			 */
 			if (nargs == 3) {
 				int64_t pos = (int64_t)tupregs[2].dttk_value;
 
 				/*
 				 * If the position argument to index() is
 				 * negative, Perl implicitly clamps it at
 				 * zero.  This semantic is a little surprising
 				 * given the special meaning of negative
 				 * positions to similar Perl functions like
 				 * substr(), but it appears to reflect a
 				 * notion that index() can start from a
 				 * negative index and increment its way up to
 				 * the string.  Given this notion, Perl's
 				 * rindex() is at least self-consistent in
 				 * that it implicitly clamps positions greater
 				 * than the string length to be the string
 				 * length.  Where Perl completely loses
 				 * coherence, however, is when the specified
 				 * substring is the empty string ("").  In
 				 * this case, even if the position is
 				 * negative, rindex() returns 0 -- and even if
 				 * the position is greater than the length,
 				 * index() returns the string length.  These
 				 * semantics violate the notion that index()
 				 * should never return a value less than the
 				 * specified position and that rindex() should
 				 * never return a value greater than the
 				 * specified position.  (One assumes that
 				 * these semantics are artifacts of Perl's
 				 * implementation and not the results of
 				 * deliberate design -- it beggars belief that
 				 * even Larry Wall could desire such oddness.)
 				 * While in the abstract one would wish for
 				 * consistent position semantics across
 				 * substr(), index() and rindex() -- or at the
 				 * very least self-consistent position
 				 * semantics for index() and rindex() -- we
 				 * instead opt to keep with the extant Perl
 				 * semantics, in all their broken glory.  (Do
 				 * we have more desire to maintain Perl's
 				 * semantics than Perl does?  Probably.)
 				 */
 				if (subr == DIF_SUBR_RINDEX) {
 					if (pos < 0) {
 						if (sublen == 0)
 							regs[rd] = 0;
 						break;
 					}
 
 					if (pos > len)
 						pos = len;
 				} else {
 					if (pos < 0)
 						pos = 0;
 
 					if (pos >= len) {
 						if (sublen == 0)
 							regs[rd] = len;
 						break;
 					}
 				}
 
 				addr = orig + pos;
 			}
 		}
 
 		for (regs[rd] = notfound; addr != limit; addr += inc) {
 			if (dtrace_strncmp(addr, substr, sublen) == 0) {
 				if (subr != DIF_SUBR_STRSTR) {
 					/*
 					 * As D index() and rindex() are
 					 * modeled on Perl (and not on awk),
 					 * we return a zero-based (and not a
 					 * one-based) index.  (For you Perl
 					 * weenies: no, we're not going to add
 					 * $[ -- and shouldn't you be at a con
 					 * or something?)
 					 */
 					regs[rd] = (uintptr_t)(addr - orig);
 					break;
 				}
 
 				ASSERT(subr == DIF_SUBR_STRSTR);
 				regs[rd] = (uintptr_t)addr;
 				break;
 			}
 		}
 
 		break;
 	}
 
 	case DIF_SUBR_STRTOK: {
 		uintptr_t addr = tupregs[0].dttk_value;
 		uintptr_t tokaddr = tupregs[1].dttk_value;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		uintptr_t limit, toklimit;
 		size_t clim;
 		uint8_t c = 0, tokmap[32];	 /* 256 / 8 */
 		char *dest = (char *)mstate->dtms_scratch_ptr;
 		int i;
 
 		/*
 		 * Check both the token buffer and (later) the input buffer,
 		 * since both could be non-scratch addresses.
 		 */
 		if (!dtrace_strcanload(tokaddr, size, &clim, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 		toklimit = tokaddr + clim;
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		if (addr == 0) {
 			/*
 			 * If the address specified is NULL, we use our saved
 			 * strtok pointer from the mstate.  Note that this
 			 * means that the saved strtok pointer is _only_
 			 * valid within multiple enablings of the same probe --
 			 * it behaves like an implicit clause-local variable.
 			 */
 			addr = mstate->dtms_strtok;
 			limit = mstate->dtms_strtok_limit;
 		} else {
 			/*
 			 * If the user-specified address is non-NULL we must
 			 * access check it.  This is the only time we have
 			 * a chance to do so, since this address may reside
 			 * in the string table of this clause-- future calls
 			 * (when we fetch addr from mstate->dtms_strtok)
 			 * would fail this access check.
 			 */
 			if (!dtrace_strcanload(addr, size, &clim, mstate,
 			    vstate)) {
 				regs[rd] = 0;
 				break;
 			}
 			limit = addr + clim;
 		}
 
 		/*
 		 * First, zero the token map, and then process the token
 		 * string -- setting a bit in the map for every character
 		 * found in the token string.
 		 */
 		for (i = 0; i < sizeof (tokmap); i++)
 			tokmap[i] = 0;
 
 		for (; tokaddr < toklimit; tokaddr++) {
 			if ((c = dtrace_load8(tokaddr)) == '\0')
 				break;
 
 			ASSERT((c >> 3) < sizeof (tokmap));
 			tokmap[c >> 3] |= (1 << (c & 0x7));
 		}
 
 		for (; addr < limit; addr++) {
 			/*
 			 * We're looking for a character that is _not_
 			 * contained in the token string.
 			 */
 			if ((c = dtrace_load8(addr)) == '\0')
 				break;
 
 			if (!(tokmap[c >> 3] & (1 << (c & 0x7))))
 				break;
 		}
 
 		if (c == '\0') {
 			/*
 			 * We reached the end of the string without finding
 			 * any character that was not in the token string.
 			 * We return NULL in this case, and we set the saved
 			 * address to NULL as well.
 			 */
 			regs[rd] = 0;
 			mstate->dtms_strtok = 0;
 			mstate->dtms_strtok_limit = 0;
 			break;
 		}
 
 		/*
 		 * From here on, we're copying into the destination string.
 		 */
 		for (i = 0; addr < limit && i < size - 1; addr++) {
 			if ((c = dtrace_load8(addr)) == '\0')
 				break;
 
 			if (tokmap[c >> 3] & (1 << (c & 0x7)))
 				break;
 
 			ASSERT(i < size);
 			dest[i++] = c;
 		}
 
 		ASSERT(i < size);
 		dest[i] = '\0';
 		regs[rd] = (uintptr_t)dest;
 		mstate->dtms_scratch_ptr += size;
 		mstate->dtms_strtok = addr;
 		mstate->dtms_strtok_limit = limit;
 		break;
 	}
 
 	case DIF_SUBR_SUBSTR: {
 		uintptr_t s = tupregs[0].dttk_value;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		char *d = (char *)mstate->dtms_scratch_ptr;
 		int64_t index = (int64_t)tupregs[1].dttk_value;
 		int64_t remaining = (int64_t)tupregs[2].dttk_value;
 		size_t len = dtrace_strlen((char *)s, size);
 		int64_t i;
 
 		if (!dtrace_canload(s, len + 1, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		if (nargs <= 2)
 			remaining = (int64_t)size;
 
 		if (index < 0) {
 			index += len;
 
 			if (index < 0 && index + remaining > 0) {
 				remaining += index;
 				index = 0;
 			}
 		}
 
 		if (index >= len || index < 0) {
 			remaining = 0;
 		} else if (remaining < 0) {
 			remaining += len - index;
 		} else if (index + remaining > size) {
 			remaining = size - index;
 		}
 
 		for (i = 0; i < remaining; i++) {
 			if ((d[i] = dtrace_load8(s + index + i)) == '\0')
 				break;
 		}
 
 		d[i] = '\0';
 
 		mstate->dtms_scratch_ptr += size;
 		regs[rd] = (uintptr_t)d;
 		break;
 	}
 
 	case DIF_SUBR_JSON: {
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		uintptr_t json = tupregs[0].dttk_value;
 		size_t jsonlen = dtrace_strlen((char *)json, size);
 		uintptr_t elem = tupregs[1].dttk_value;
 		size_t elemlen = dtrace_strlen((char *)elem, size);
 
 		char *dest = (char *)mstate->dtms_scratch_ptr;
 		char *elemlist = (char *)mstate->dtms_scratch_ptr + jsonlen + 1;
 		char *ee = elemlist;
 		int nelems = 1;
 		uintptr_t cur;
 
 		if (!dtrace_canload(json, jsonlen + 1, mstate, vstate) ||
 		    !dtrace_canload(elem, elemlen + 1, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, jsonlen + 1 + elemlen + 1)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		/*
 		 * Read the element selector and split it up into a packed list
 		 * of strings.
 		 */
 		for (cur = elem; cur < elem + elemlen; cur++) {
 			char cc = dtrace_load8(cur);
 
 			if (cur == elem && cc == '[') {
 				/*
 				 * If the first element selector key is
 				 * actually an array index then ignore the
 				 * bracket.
 				 */
 				continue;
 			}
 
 			if (cc == ']')
 				continue;
 
 			if (cc == '.' || cc == '[') {
 				nelems++;
 				cc = '\0';
 			}
 
 			*ee++ = cc;
 		}
 		*ee++ = '\0';
 
 		if ((regs[rd] = (uintptr_t)dtrace_json(size, json, elemlist,
 		    nelems, dest)) != 0)
 			mstate->dtms_scratch_ptr += jsonlen + 1;
 		break;
 	}
 
 	case DIF_SUBR_TOUPPER:
 	case DIF_SUBR_TOLOWER: {
 		uintptr_t s = tupregs[0].dttk_value;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		char *dest = (char *)mstate->dtms_scratch_ptr, c;
 		size_t len = dtrace_strlen((char *)s, size);
 		char lower, upper, convert;
 		int64_t i;
 
 		if (subr == DIF_SUBR_TOUPPER) {
 			lower = 'a';
 			upper = 'z';
 			convert = 'A';
 		} else {
 			lower = 'A';
 			upper = 'Z';
 			convert = 'a';
 		}
 
 		if (!dtrace_canload(s, len + 1, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		for (i = 0; i < size - 1; i++) {
 			if ((c = dtrace_load8(s + i)) == '\0')
 				break;
 
 			if (c >= lower && c <= upper)
 				c = convert + (c - lower);
 
 			dest[i] = c;
 		}
 
 		ASSERT(i < size);
 		dest[i] = '\0';
 		regs[rd] = (uintptr_t)dest;
 		mstate->dtms_scratch_ptr += size;
 		break;
 	}
 
 #ifdef illumos
 	case DIF_SUBR_GETMAJOR:
 #ifdef _LP64
 		regs[rd] = (tupregs[0].dttk_value >> NBITSMINOR64) & MAXMAJ64;
 #else
 		regs[rd] = (tupregs[0].dttk_value >> NBITSMINOR) & MAXMAJ;
 #endif
 		break;
 
 	case DIF_SUBR_GETMINOR:
 #ifdef _LP64
 		regs[rd] = tupregs[0].dttk_value & MAXMIN64;
 #else
 		regs[rd] = tupregs[0].dttk_value & MAXMIN;
 #endif
 		break;
 
 	case DIF_SUBR_DDI_PATHNAME: {
 		/*
 		 * This one is a galactic mess.  We are going to roughly
 		 * emulate ddi_pathname(), but it's made more complicated
 		 * by the fact that we (a) want to include the minor name and
 		 * (b) must proceed iteratively instead of recursively.
 		 */
 		uintptr_t dest = mstate->dtms_scratch_ptr;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		char *start = (char *)dest, *end = start + size - 1;
 		uintptr_t daddr = tupregs[0].dttk_value;
 		int64_t minor = (int64_t)tupregs[1].dttk_value;
 		char *s;
 		int i, len, depth = 0;
 
 		/*
 		 * Due to all the pointer jumping we do and context we must
 		 * rely upon, we just mandate that the user must have kernel
 		 * read privileges to use this routine.
 		 */
 		if ((mstate->dtms_access & DTRACE_ACCESS_KERNEL) == 0) {
 			*flags |= CPU_DTRACE_KPRIV;
 			*illval = daddr;
 			regs[rd] = 0;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		*end = '\0';
 
 		/*
 		 * We want to have a name for the minor.  In order to do this,
 		 * we need to walk the minor list from the devinfo.  We want
 		 * to be sure that we don't infinitely walk a circular list,
 		 * so we check for circularity by sending a scout pointer
 		 * ahead two elements for every element that we iterate over;
 		 * if the list is circular, these will ultimately point to the
 		 * same element.  You may recognize this little trick as the
 		 * answer to a stupid interview question -- one that always
 		 * seems to be asked by those who had to have it laboriously
 		 * explained to them, and who can't even concisely describe
 		 * the conditions under which one would be forced to resort to
 		 * this technique.  Needless to say, those conditions are
 		 * found here -- and probably only here.  Is this the only use
 		 * of this infamous trick in shipping, production code?  If it
 		 * isn't, it probably should be...
 		 */
 		if (minor != -1) {
 			uintptr_t maddr = dtrace_loadptr(daddr +
 			    offsetof(struct dev_info, devi_minor));
 
 			uintptr_t next = offsetof(struct ddi_minor_data, next);
 			uintptr_t name = offsetof(struct ddi_minor_data,
 			    d_minor) + offsetof(struct ddi_minor, name);
 			uintptr_t dev = offsetof(struct ddi_minor_data,
 			    d_minor) + offsetof(struct ddi_minor, dev);
 			uintptr_t scout;
 
 			if (maddr != NULL)
 				scout = dtrace_loadptr(maddr + next);
 
 			while (maddr != NULL && !(*flags & CPU_DTRACE_FAULT)) {
 				uint64_t m;
 #ifdef _LP64
 				m = dtrace_load64(maddr + dev) & MAXMIN64;
 #else
 				m = dtrace_load32(maddr + dev) & MAXMIN;
 #endif
 				if (m != minor) {
 					maddr = dtrace_loadptr(maddr + next);
 
 					if (scout == NULL)
 						continue;
 
 					scout = dtrace_loadptr(scout + next);
 
 					if (scout == NULL)
 						continue;
 
 					scout = dtrace_loadptr(scout + next);
 
 					if (scout == NULL)
 						continue;
 
 					if (scout == maddr) {
 						*flags |= CPU_DTRACE_ILLOP;
 						break;
 					}
 
 					continue;
 				}
 
 				/*
 				 * We have the minor data.  Now we need to
 				 * copy the minor's name into the end of the
 				 * pathname.
 				 */
 				s = (char *)dtrace_loadptr(maddr + name);
 				len = dtrace_strlen(s, size);
 
 				if (*flags & CPU_DTRACE_FAULT)
 					break;
 
 				if (len != 0) {
 					if ((end -= (len + 1)) < start)
 						break;
 
 					*end = ':';
 				}
 
 				for (i = 1; i <= len; i++)
 					end[i] = dtrace_load8((uintptr_t)s++);
 				break;
 			}
 		}
 
 		while (daddr != NULL && !(*flags & CPU_DTRACE_FAULT)) {
 			ddi_node_state_t devi_state;
 
 			devi_state = dtrace_load32(daddr +
 			    offsetof(struct dev_info, devi_node_state));
 
 			if (*flags & CPU_DTRACE_FAULT)
 				break;
 
 			if (devi_state >= DS_INITIALIZED) {
 				s = (char *)dtrace_loadptr(daddr +
 				    offsetof(struct dev_info, devi_addr));
 				len = dtrace_strlen(s, size);
 
 				if (*flags & CPU_DTRACE_FAULT)
 					break;
 
 				if (len != 0) {
 					if ((end -= (len + 1)) < start)
 						break;
 
 					*end = '@';
 				}
 
 				for (i = 1; i <= len; i++)
 					end[i] = dtrace_load8((uintptr_t)s++);
 			}
 
 			/*
 			 * Now for the node name...
 			 */
 			s = (char *)dtrace_loadptr(daddr +
 			    offsetof(struct dev_info, devi_node_name));
 
 			daddr = dtrace_loadptr(daddr +
 			    offsetof(struct dev_info, devi_parent));
 
 			/*
 			 * If our parent is NULL (that is, if we're the root
 			 * node), we're going to use the special path
 			 * "devices".
 			 */
 			if (daddr == 0)
 				s = "devices";
 
 			len = dtrace_strlen(s, size);
 			if (*flags & CPU_DTRACE_FAULT)
 				break;
 
 			if ((end -= (len + 1)) < start)
 				break;
 
 			for (i = 1; i <= len; i++)
 				end[i] = dtrace_load8((uintptr_t)s++);
 			*end = '/';
 
 			if (depth++ > dtrace_devdepth_max) {
 				*flags |= CPU_DTRACE_ILLOP;
 				break;
 			}
 		}
 
 		if (end < start)
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 
 		if (daddr == 0) {
 			regs[rd] = (uintptr_t)end;
 			mstate->dtms_scratch_ptr += size;
 		}
 
 		break;
 	}
 #endif
 
 	case DIF_SUBR_STRJOIN: {
 		char *d = (char *)mstate->dtms_scratch_ptr;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		uintptr_t s1 = tupregs[0].dttk_value;
 		uintptr_t s2 = tupregs[1].dttk_value;
 		int i = 0, j = 0;
 		size_t lim1, lim2;
 		char c;
 
 		if (!dtrace_strcanload(s1, size, &lim1, mstate, vstate) ||
 		    !dtrace_strcanload(s2, size, &lim2, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		for (;;) {
 			if (i >= size) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 				regs[rd] = 0;
 				break;
 			}
 			c = (i >= lim1) ? '\0' : dtrace_load8(s1++);
 			if ((d[i++] = c) == '\0') {
 				i--;
 				break;
 			}
 		}
 
 		for (;;) {
 			if (i >= size) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 				regs[rd] = 0;
 				break;
 			}
 
 			c = (j++ >= lim2) ? '\0' : dtrace_load8(s2++);
 			if ((d[i++] = c) == '\0')
 				break;
 		}
 
 		if (i < size) {
 			mstate->dtms_scratch_ptr += i;
 			regs[rd] = (uintptr_t)d;
 		}
 
 		break;
 	}
 
 	case DIF_SUBR_STRTOLL: {
 		uintptr_t s = tupregs[0].dttk_value;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		size_t lim;
 		int base = 10;
 
 		if (nargs > 1) {
 			if ((base = tupregs[1].dttk_value) <= 1 ||
 			    base > ('z' - 'a' + 1) + ('9' - '0' + 1)) {
 				*flags |= CPU_DTRACE_ILLOP;
 				break;
 			}
 		}
 
 		if (!dtrace_strcanload(s, size, &lim, mstate, vstate)) {
 			regs[rd] = INT64_MIN;
 			break;
 		}
 
 		regs[rd] = dtrace_strtoll((char *)s, base, lim);
 		break;
 	}
 
 	case DIF_SUBR_LLTOSTR: {
 		int64_t i = (int64_t)tupregs[0].dttk_value;
 		uint64_t val, digit;
 		uint64_t size = 65;	/* enough room for 2^64 in binary */
 		char *end = (char *)mstate->dtms_scratch_ptr + size - 1;
 		int base = 10;
 
 		if (nargs > 1) {
 			if ((base = tupregs[1].dttk_value) <= 1 ||
 			    base > ('z' - 'a' + 1) + ('9' - '0' + 1)) {
 				*flags |= CPU_DTRACE_ILLOP;
 				break;
 			}
 		}
 
 		val = (base == 10 && i < 0) ? i * -1 : i;
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		for (*end-- = '\0'; val; val /= base) {
 			if ((digit = val % base) <= '9' - '0') {
 				*end-- = '0' + digit;
 			} else {
 				*end-- = 'a' + (digit - ('9' - '0') - 1);
 			}
 		}
 
 		if (i == 0 && base == 16)
 			*end-- = '0';
 
 		if (base == 16)
 			*end-- = 'x';
 
 		if (i == 0 || base == 8 || base == 16)
 			*end-- = '0';
 
 		if (i < 0 && base == 10)
 			*end-- = '-';
 
 		regs[rd] = (uintptr_t)end + 1;
 		mstate->dtms_scratch_ptr += size;
 		break;
 	}
 
 	case DIF_SUBR_HTONS:
 	case DIF_SUBR_NTOHS:
 #if BYTE_ORDER == BIG_ENDIAN
 		regs[rd] = (uint16_t)tupregs[0].dttk_value;
 #else
 		regs[rd] = DT_BSWAP_16((uint16_t)tupregs[0].dttk_value);
 #endif
 		break;
 
 
 	case DIF_SUBR_HTONL:
 	case DIF_SUBR_NTOHL:
 #if BYTE_ORDER == BIG_ENDIAN
 		regs[rd] = (uint32_t)tupregs[0].dttk_value;
 #else
 		regs[rd] = DT_BSWAP_32((uint32_t)tupregs[0].dttk_value);
 #endif
 		break;
 
 
 	case DIF_SUBR_HTONLL:
 	case DIF_SUBR_NTOHLL:
 #if BYTE_ORDER == BIG_ENDIAN
 		regs[rd] = (uint64_t)tupregs[0].dttk_value;
 #else
 		regs[rd] = DT_BSWAP_64((uint64_t)tupregs[0].dttk_value);
 #endif
 		break;
 
 
 	case DIF_SUBR_DIRNAME:
 	case DIF_SUBR_BASENAME: {
 		char *dest = (char *)mstate->dtms_scratch_ptr;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		uintptr_t src = tupregs[0].dttk_value;
 		int i, j, len = dtrace_strlen((char *)src, size);
 		int lastbase = -1, firstbase = -1, lastdir = -1;
 		int start, end;
 
 		if (!dtrace_canload(src, len + 1, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		/*
 		 * The basename and dirname for a zero-length string is
 		 * defined to be "."
 		 */
 		if (len == 0) {
 			len = 1;
 			src = (uintptr_t)".";
 		}
 
 		/*
 		 * Start from the back of the string, moving back toward the
 		 * front until we see a character that isn't a slash.  That
 		 * character is the last character in the basename.
 		 */
 		for (i = len - 1; i >= 0; i--) {
 			if (dtrace_load8(src + i) != '/')
 				break;
 		}
 
 		if (i >= 0)
 			lastbase = i;
 
 		/*
 		 * Starting from the last character in the basename, move
 		 * towards the front until we find a slash.  The character
 		 * that we processed immediately before that is the first
 		 * character in the basename.
 		 */
 		for (; i >= 0; i--) {
 			if (dtrace_load8(src + i) == '/')
 				break;
 		}
 
 		if (i >= 0)
 			firstbase = i + 1;
 
 		/*
 		 * Now keep going until we find a non-slash character.  That
 		 * character is the last character in the dirname.
 		 */
 		for (; i >= 0; i--) {
 			if (dtrace_load8(src + i) != '/')
 				break;
 		}
 
 		if (i >= 0)
 			lastdir = i;
 
 		ASSERT(!(lastbase == -1 && firstbase != -1));
 		ASSERT(!(firstbase == -1 && lastdir != -1));
 
 		if (lastbase == -1) {
 			/*
 			 * We didn't find a non-slash character.  We know that
 			 * the length is non-zero, so the whole string must be
 			 * slashes.  In either the dirname or the basename
 			 * case, we return '/'.
 			 */
 			ASSERT(firstbase == -1);
 			firstbase = lastbase = lastdir = 0;
 		}
 
 		if (firstbase == -1) {
 			/*
 			 * The entire string consists only of a basename
 			 * component.  If we're looking for dirname, we need
 			 * to change our string to be just "."; if we're
 			 * looking for a basename, we'll just set the first
 			 * character of the basename to be 0.
 			 */
 			if (subr == DIF_SUBR_DIRNAME) {
 				ASSERT(lastdir == -1);
 				src = (uintptr_t)".";
 				lastdir = 0;
 			} else {
 				firstbase = 0;
 			}
 		}
 
 		if (subr == DIF_SUBR_DIRNAME) {
 			if (lastdir == -1) {
 				/*
 				 * We know that we have a slash in the name --
 				 * or lastdir would be set to 0, above.  And
 				 * because lastdir is -1, we know that this
 				 * slash must be the first character.  (That
 				 * is, the full string must be of the form
 				 * "/basename".)  In this case, the last
 				 * character of the directory name is 0.
 				 */
 				lastdir = 0;
 			}
 
 			start = 0;
 			end = lastdir;
 		} else {
 			ASSERT(subr == DIF_SUBR_BASENAME);
 			ASSERT(firstbase != -1 && lastbase != -1);
 			start = firstbase;
 			end = lastbase;
 		}
 
 		for (i = start, j = 0; i <= end && j < size - 1; i++, j++)
 			dest[j] = dtrace_load8(src + i);
 
 		dest[j] = '\0';
 		regs[rd] = (uintptr_t)dest;
 		mstate->dtms_scratch_ptr += size;
 		break;
 	}
 
 	case DIF_SUBR_GETF: {
 		uintptr_t fd = tupregs[0].dttk_value;
 		struct filedesc *fdp;
 		file_t *fp;
 
 		if (!dtrace_priv_proc(state)) {
 			regs[rd] = 0;
 			break;
 		}
 		fdp = curproc->p_fd;
 		FILEDESC_SLOCK(fdp);
 		fp = fget_locked(fdp, fd);
 		mstate->dtms_getf = fp;
 		regs[rd] = (uintptr_t)fp;
 		FILEDESC_SUNLOCK(fdp);
 		break;
 	}
 
 	case DIF_SUBR_CLEANPATH: {
 		char *dest = (char *)mstate->dtms_scratch_ptr, c;
 		uint64_t size = state->dts_options[DTRACEOPT_STRSIZE];
 		uintptr_t src = tupregs[0].dttk_value;
 		size_t lim;
 		int i = 0, j = 0;
 #ifdef illumos
 		zone_t *z;
 #endif
 
 		if (!dtrace_strcanload(src, size, &lim, mstate, vstate)) {
 			regs[rd] = 0;
 			break;
 		}
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			regs[rd] = 0;
 			break;
 		}
 
 		/*
 		 * Move forward, loading each character.
 		 */
 		do {
 			c = (i >= lim) ? '\0' : dtrace_load8(src + i++);
 next:
 			if (j + 5 >= size)	/* 5 = strlen("/..c\0") */
 				break;
 
 			if (c != '/') {
 				dest[j++] = c;
 				continue;
 			}
 
 			c = (i >= lim) ? '\0' : dtrace_load8(src + i++);
 
 			if (c == '/') {
 				/*
 				 * We have two slashes -- we can just advance
 				 * to the next character.
 				 */
 				goto next;
 			}
 
 			if (c != '.') {
 				/*
 				 * This is not "." and it's not ".." -- we can
 				 * just store the "/" and this character and
 				 * drive on.
 				 */
 				dest[j++] = '/';
 				dest[j++] = c;
 				continue;
 			}
 
 			c = (i >= lim) ? '\0' : dtrace_load8(src + i++);
 
 			if (c == '/') {
 				/*
 				 * This is a "/./" component.  We're not going
 				 * to store anything in the destination buffer;
 				 * we're just going to go to the next component.
 				 */
 				goto next;
 			}
 
 			if (c != '.') {
 				/*
 				 * This is not ".." -- we can just store the
 				 * "/." and this character and continue
 				 * processing.
 				 */
 				dest[j++] = '/';
 				dest[j++] = '.';
 				dest[j++] = c;
 				continue;
 			}
 
 			c = (i >= lim) ? '\0' : dtrace_load8(src + i++);
 
 			if (c != '/' && c != '\0') {
 				/*
 				 * This is not ".." -- it's "..[mumble]".
 				 * We'll store the "/.." and this character
 				 * and continue processing.
 				 */
 				dest[j++] = '/';
 				dest[j++] = '.';
 				dest[j++] = '.';
 				dest[j++] = c;
 				continue;
 			}
 
 			/*
 			 * This is "/../" or "/..\0".  We need to back up
 			 * our destination pointer until we find a "/".
 			 */
 			i--;
 			while (j != 0 && dest[--j] != '/')
 				continue;
 
 			if (c == '\0')
 				dest[++j] = '/';
 		} while (c != '\0');
 
 		dest[j] = '\0';
 
 #ifdef illumos
 		if (mstate->dtms_getf != NULL &&
 		    !(mstate->dtms_access & DTRACE_ACCESS_KERNEL) &&
 		    (z = state->dts_cred.dcr_cred->cr_zone) != kcred->cr_zone) {
 			/*
 			 * If we've done a getf() as a part of this ECB and we
 			 * don't have kernel access (and we're not in the global
 			 * zone), check if the path we cleaned up begins with
 			 * the zone's root path, and trim it off if so.  Note
 			 * that this is an output cleanliness issue, not a
 			 * security issue: knowing one's zone root path does
 			 * not enable privilege escalation.
 			 */
 			if (strstr(dest, z->zone_rootpath) == dest)
 				dest += strlen(z->zone_rootpath) - 1;
 		}
 #endif
 
 		regs[rd] = (uintptr_t)dest;
 		mstate->dtms_scratch_ptr += size;
 		break;
 	}
 
 	case DIF_SUBR_INET_NTOA:
 	case DIF_SUBR_INET_NTOA6:
 	case DIF_SUBR_INET_NTOP: {
 		size_t size;
 		int af, argi, i;
 		char *base, *end;
 
 		if (subr == DIF_SUBR_INET_NTOP) {
 			af = (int)tupregs[0].dttk_value;
 			argi = 1;
 		} else {
 			af = subr == DIF_SUBR_INET_NTOA ? AF_INET: AF_INET6;
 			argi = 0;
 		}
 
 		if (af == AF_INET) {
 			ipaddr_t ip4;
 			uint8_t *ptr8, val;
 
 			if (!dtrace_canload(tupregs[argi].dttk_value,
 			    sizeof (ipaddr_t), mstate, vstate)) {
 				regs[rd] = 0;
 				break;
 			}
 
 			/*
 			 * Safely load the IPv4 address.
 			 */
 			ip4 = dtrace_load32(tupregs[argi].dttk_value);
 
 			/*
 			 * Check an IPv4 string will fit in scratch.
 			 */
 			size = INET_ADDRSTRLEN;
 			if (!DTRACE_INSCRATCH(mstate, size)) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 				regs[rd] = 0;
 				break;
 			}
 			base = (char *)mstate->dtms_scratch_ptr;
 			end = (char *)mstate->dtms_scratch_ptr + size - 1;
 
 			/*
 			 * Stringify as a dotted decimal quad.
 			 */
 			*end-- = '\0';
 			ptr8 = (uint8_t *)&ip4;
 			for (i = 3; i >= 0; i--) {
 				val = ptr8[i];
 
 				if (val == 0) {
 					*end-- = '0';
 				} else {
 					for (; val; val /= 10) {
 						*end-- = '0' + (val % 10);
 					}
 				}
 
 				if (i > 0)
 					*end-- = '.';
 			}
 			ASSERT(end + 1 >= base);
 
 		} else if (af == AF_INET6) {
 			struct in6_addr ip6;
 			int firstzero, tryzero, numzero, v6end;
 			uint16_t val;
 			const char digits[] = "0123456789abcdef";
 
 			/*
 			 * Stringify using RFC 1884 convention 2 - 16 bit
 			 * hexadecimal values with a zero-run compression.
 			 * Lower case hexadecimal digits are used.
 			 * 	eg, fe80::214:4fff:fe0b:76c8.
 			 * The IPv4 embedded form is returned for inet_ntop,
 			 * just the IPv4 string is returned for inet_ntoa6.
 			 */
 
 			if (!dtrace_canload(tupregs[argi].dttk_value,
 			    sizeof (struct in6_addr), mstate, vstate)) {
 				regs[rd] = 0;
 				break;
 			}
 
 			/*
 			 * Safely load the IPv6 address.
 			 */
 			dtrace_bcopy(
 			    (void *)(uintptr_t)tupregs[argi].dttk_value,
 			    (void *)(uintptr_t)&ip6, sizeof (struct in6_addr));
 
 			/*
 			 * Check an IPv6 string will fit in scratch.
 			 */
 			size = INET6_ADDRSTRLEN;
 			if (!DTRACE_INSCRATCH(mstate, size)) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 				regs[rd] = 0;
 				break;
 			}
 			base = (char *)mstate->dtms_scratch_ptr;
 			end = (char *)mstate->dtms_scratch_ptr + size - 1;
 			*end-- = '\0';
 
 			/*
 			 * Find the longest run of 16 bit zero values
 			 * for the single allowed zero compression - "::".
 			 */
 			firstzero = -1;
 			tryzero = -1;
 			numzero = 1;
 			for (i = 0; i < sizeof (struct in6_addr); i++) {
 #ifdef illumos
 				if (ip6._S6_un._S6_u8[i] == 0 &&
 #else
 				if (ip6.__u6_addr.__u6_addr8[i] == 0 &&
 #endif
 				    tryzero == -1 && i % 2 == 0) {
 					tryzero = i;
 					continue;
 				}
 
 				if (tryzero != -1 &&
 #ifdef illumos
 				    (ip6._S6_un._S6_u8[i] != 0 ||
 #else
 				    (ip6.__u6_addr.__u6_addr8[i] != 0 ||
 #endif
 				    i == sizeof (struct in6_addr) - 1)) {
 
 					if (i - tryzero <= numzero) {
 						tryzero = -1;
 						continue;
 					}
 
 					firstzero = tryzero;
 					numzero = i - i % 2 - tryzero;
 					tryzero = -1;
 
 #ifdef illumos
 					if (ip6._S6_un._S6_u8[i] == 0 &&
 #else
 					if (ip6.__u6_addr.__u6_addr8[i] == 0 &&
 #endif
 					    i == sizeof (struct in6_addr) - 1)
 						numzero += 2;
 				}
 			}
 			ASSERT(firstzero + numzero <= sizeof (struct in6_addr));
 
 			/*
 			 * Check for an IPv4 embedded address.
 			 */
 			v6end = sizeof (struct in6_addr) - 2;
 			if (IN6_IS_ADDR_V4MAPPED(&ip6) ||
 			    IN6_IS_ADDR_V4COMPAT(&ip6)) {
 				for (i = sizeof (struct in6_addr) - 1;
 				    i >= DTRACE_V4MAPPED_OFFSET; i--) {
 					ASSERT(end >= base);
 
 #ifdef illumos
 					val = ip6._S6_un._S6_u8[i];
 #else
 					val = ip6.__u6_addr.__u6_addr8[i];
 #endif
 
 					if (val == 0) {
 						*end-- = '0';
 					} else {
 						for (; val; val /= 10) {
 							*end-- = '0' + val % 10;
 						}
 					}
 
 					if (i > DTRACE_V4MAPPED_OFFSET)
 						*end-- = '.';
 				}
 
 				if (subr == DIF_SUBR_INET_NTOA6)
 					goto inetout;
 
 				/*
 				 * Set v6end to skip the IPv4 address that
 				 * we have already stringified.
 				 */
 				v6end = 10;
 			}
 
 			/*
 			 * Build the IPv6 string by working through the
 			 * address in reverse.
 			 */
 			for (i = v6end; i >= 0; i -= 2) {
 				ASSERT(end >= base);
 
 				if (i == firstzero + numzero - 2) {
 					*end-- = ':';
 					*end-- = ':';
 					i -= numzero - 2;
 					continue;
 				}
 
 				if (i < 14 && i != firstzero - 2)
 					*end-- = ':';
 
 #ifdef illumos
 				val = (ip6._S6_un._S6_u8[i] << 8) +
 				    ip6._S6_un._S6_u8[i + 1];
 #else
 				val = (ip6.__u6_addr.__u6_addr8[i] << 8) +
 				    ip6.__u6_addr.__u6_addr8[i + 1];
 #endif
 
 				if (val == 0) {
 					*end-- = '0';
 				} else {
 					for (; val; val /= 16) {
 						*end-- = digits[val % 16];
 					}
 				}
 			}
 			ASSERT(end + 1 >= base);
 
 		} else {
 			/*
 			 * The user didn't use AH_INET or AH_INET6.
 			 */
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_ILLOP);
 			regs[rd] = 0;
 			break;
 		}
 
 inetout:	regs[rd] = (uintptr_t)end + 1;
 		mstate->dtms_scratch_ptr += size;
 		break;
 	}
 
 	case DIF_SUBR_MEMREF: {
 		uintptr_t size = 2 * sizeof(uintptr_t);
 		uintptr_t *memref = (uintptr_t *) P2ROUNDUP(mstate->dtms_scratch_ptr, sizeof(uintptr_t));
 		size_t scratch_size = ((uintptr_t) memref - mstate->dtms_scratch_ptr) + size;
 
 		/* address and length */
 		memref[0] = tupregs[0].dttk_value;
 		memref[1] = tupregs[1].dttk_value;
 
 		regs[rd] = (uintptr_t) memref;
 		mstate->dtms_scratch_ptr += scratch_size;
 		break;
 	}
 
 #ifndef illumos
 	case DIF_SUBR_MEMSTR: {
 		char *str = (char *)mstate->dtms_scratch_ptr;
 		uintptr_t mem = tupregs[0].dttk_value;
 		char c = tupregs[1].dttk_value;
 		size_t size = tupregs[2].dttk_value;
 		uint8_t n;
 		int i;
 
 		regs[rd] = 0;
 
 		if (size == 0)
 			break;
 
 		if (!dtrace_canload(mem, size - 1, mstate, vstate))
 			break;
 
 		if (!DTRACE_INSCRATCH(mstate, size)) {
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 			break;
 		}
 
 		if (dtrace_memstr_max != 0 && size > dtrace_memstr_max) {
 			*flags |= CPU_DTRACE_ILLOP;
 			break;
 		}
 
 		for (i = 0; i < size - 1; i++) {
 			n = dtrace_load8(mem++);
 			str[i] = (n == 0) ? c : n;
 		}
 		str[size - 1] = 0;
 
 		regs[rd] = (uintptr_t)str;
 		mstate->dtms_scratch_ptr += size;
 		break;
 	}
 #endif
 	}
 }
 
 /*
  * Emulate the execution of DTrace IR instructions specified by the given
  * DIF object.  This function is deliberately void of assertions as all of
  * the necessary checks are handled by a call to dtrace_difo_validate().
  */
 static uint64_t
 dtrace_dif_emulate(dtrace_difo_t *difo, dtrace_mstate_t *mstate,
     dtrace_vstate_t *vstate, dtrace_state_t *state)
 {
 	const dif_instr_t *text = difo->dtdo_buf;
 	const uint_t textlen = difo->dtdo_len;
 	const char *strtab = difo->dtdo_strtab;
 	const uint64_t *inttab = difo->dtdo_inttab;
 
 	uint64_t rval = 0;
 	dtrace_statvar_t *svar;
 	dtrace_dstate_t *dstate = &vstate->dtvs_dynvars;
 	dtrace_difv_t *v;
 	volatile uint16_t *flags = &cpu_core[curcpu].cpuc_dtrace_flags;
 	volatile uintptr_t *illval = &cpu_core[curcpu].cpuc_dtrace_illval;
 
 	dtrace_key_t tupregs[DIF_DTR_NREGS + 2]; /* +2 for thread and id */
 	uint64_t regs[DIF_DIR_NREGS];
 	uint64_t *tmp;
 
 	uint8_t cc_n = 0, cc_z = 0, cc_v = 0, cc_c = 0;
 	int64_t cc_r;
 	uint_t pc = 0, id, opc = 0;
 	uint8_t ttop = 0;
 	dif_instr_t instr;
 	uint_t r1, r2, rd;
 
 	/*
 	 * We stash the current DIF object into the machine state: we need it
 	 * for subsequent access checking.
 	 */
 	mstate->dtms_difo = difo;
 
 	regs[DIF_REG_R0] = 0; 		/* %r0 is fixed at zero */
 
 	while (pc < textlen && !(*flags & CPU_DTRACE_FAULT)) {
 		opc = pc;
 
 		instr = text[pc++];
 		r1 = DIF_INSTR_R1(instr);
 		r2 = DIF_INSTR_R2(instr);
 		rd = DIF_INSTR_RD(instr);
 
 		switch (DIF_INSTR_OP(instr)) {
 		case DIF_OP_OR:
 			regs[rd] = regs[r1] | regs[r2];
 			break;
 		case DIF_OP_XOR:
 			regs[rd] = regs[r1] ^ regs[r2];
 			break;
 		case DIF_OP_AND:
 			regs[rd] = regs[r1] & regs[r2];
 			break;
 		case DIF_OP_SLL:
 			regs[rd] = regs[r1] << regs[r2];
 			break;
 		case DIF_OP_SRL:
 			regs[rd] = regs[r1] >> regs[r2];
 			break;
 		case DIF_OP_SUB:
 			regs[rd] = regs[r1] - regs[r2];
 			break;
 		case DIF_OP_ADD:
 			regs[rd] = regs[r1] + regs[r2];
 			break;
 		case DIF_OP_MUL:
 			regs[rd] = regs[r1] * regs[r2];
 			break;
 		case DIF_OP_SDIV:
 			if (regs[r2] == 0) {
 				regs[rd] = 0;
 				*flags |= CPU_DTRACE_DIVZERO;
 			} else {
 				regs[rd] = (int64_t)regs[r1] /
 				    (int64_t)regs[r2];
 			}
 			break;
 
 		case DIF_OP_UDIV:
 			if (regs[r2] == 0) {
 				regs[rd] = 0;
 				*flags |= CPU_DTRACE_DIVZERO;
 			} else {
 				regs[rd] = regs[r1] / regs[r2];
 			}
 			break;
 
 		case DIF_OP_SREM:
 			if (regs[r2] == 0) {
 				regs[rd] = 0;
 				*flags |= CPU_DTRACE_DIVZERO;
 			} else {
 				regs[rd] = (int64_t)regs[r1] %
 				    (int64_t)regs[r2];
 			}
 			break;
 
 		case DIF_OP_UREM:
 			if (regs[r2] == 0) {
 				regs[rd] = 0;
 				*flags |= CPU_DTRACE_DIVZERO;
 			} else {
 				regs[rd] = regs[r1] % regs[r2];
 			}
 			break;
 
 		case DIF_OP_NOT:
 			regs[rd] = ~regs[r1];
 			break;
 		case DIF_OP_MOV:
 			regs[rd] = regs[r1];
 			break;
 		case DIF_OP_CMP:
 			cc_r = regs[r1] - regs[r2];
 			cc_n = cc_r < 0;
 			cc_z = cc_r == 0;
 			cc_v = 0;
 			cc_c = regs[r1] < regs[r2];
 			break;
 		case DIF_OP_TST:
 			cc_n = cc_v = cc_c = 0;
 			cc_z = regs[r1] == 0;
 			break;
 		case DIF_OP_BA:
 			pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BE:
 			if (cc_z)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BNE:
 			if (cc_z == 0)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BG:
 			if ((cc_z | (cc_n ^ cc_v)) == 0)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BGU:
 			if ((cc_c | cc_z) == 0)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BGE:
 			if ((cc_n ^ cc_v) == 0)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BGEU:
 			if (cc_c == 0)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BL:
 			if (cc_n ^ cc_v)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BLU:
 			if (cc_c)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BLE:
 			if (cc_z | (cc_n ^ cc_v))
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_BLEU:
 			if (cc_c | cc_z)
 				pc = DIF_INSTR_LABEL(instr);
 			break;
 		case DIF_OP_RLDSB:
 			if (!dtrace_canload(regs[r1], 1, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDSB:
 			regs[rd] = (int8_t)dtrace_load8(regs[r1]);
 			break;
 		case DIF_OP_RLDSH:
 			if (!dtrace_canload(regs[r1], 2, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDSH:
 			regs[rd] = (int16_t)dtrace_load16(regs[r1]);
 			break;
 		case DIF_OP_RLDSW:
 			if (!dtrace_canload(regs[r1], 4, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDSW:
 			regs[rd] = (int32_t)dtrace_load32(regs[r1]);
 			break;
 		case DIF_OP_RLDUB:
 			if (!dtrace_canload(regs[r1], 1, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDUB:
 			regs[rd] = dtrace_load8(regs[r1]);
 			break;
 		case DIF_OP_RLDUH:
 			if (!dtrace_canload(regs[r1], 2, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDUH:
 			regs[rd] = dtrace_load16(regs[r1]);
 			break;
 		case DIF_OP_RLDUW:
 			if (!dtrace_canload(regs[r1], 4, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDUW:
 			regs[rd] = dtrace_load32(regs[r1]);
 			break;
 		case DIF_OP_RLDX:
 			if (!dtrace_canload(regs[r1], 8, mstate, vstate))
 				break;
 			/*FALLTHROUGH*/
 		case DIF_OP_LDX:
 			regs[rd] = dtrace_load64(regs[r1]);
 			break;
 		case DIF_OP_ULDSB:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] = (int8_t)
 			    dtrace_fuword8((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_ULDSH:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] = (int16_t)
 			    dtrace_fuword16((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_ULDSW:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] = (int32_t)
 			    dtrace_fuword32((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_ULDUB:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] =
 			    dtrace_fuword8((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_ULDUH:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] =
 			    dtrace_fuword16((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_ULDUW:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] =
 			    dtrace_fuword32((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_ULDX:
 			DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 			regs[rd] =
 			    dtrace_fuword64((void *)(uintptr_t)regs[r1]);
 			DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 			break;
 		case DIF_OP_RET:
 			rval = regs[rd];
 			pc = textlen;
 			break;
 		case DIF_OP_NOP:
 			break;
 		case DIF_OP_SETX:
 			regs[rd] = inttab[DIF_INSTR_INTEGER(instr)];
 			break;
 		case DIF_OP_SETS:
 			regs[rd] = (uint64_t)(uintptr_t)
 			    (strtab + DIF_INSTR_STRING(instr));
 			break;
 		case DIF_OP_SCMP: {
 			size_t sz = state->dts_options[DTRACEOPT_STRSIZE];
 			uintptr_t s1 = regs[r1];
 			uintptr_t s2 = regs[r2];
 			size_t lim1, lim2;
 
 			if (s1 != 0 &&
 			    !dtrace_strcanload(s1, sz, &lim1, mstate, vstate))
 				break;
 			if (s2 != 0 &&
 			    !dtrace_strcanload(s2, sz, &lim2, mstate, vstate))
 				break;
 
 			cc_r = dtrace_strncmp((char *)s1, (char *)s2,
 			    MIN(lim1, lim2));
 
 			cc_n = cc_r < 0;
 			cc_z = cc_r == 0;
 			cc_v = cc_c = 0;
 			break;
 		}
 		case DIF_OP_LDGA:
 			regs[rd] = dtrace_dif_variable(mstate, state,
 			    r1, regs[r2]);
 			break;
 		case DIF_OP_LDGS:
 			id = DIF_INSTR_VAR(instr);
 
 			if (id >= DIF_VAR_OTHER_UBASE) {
 				uintptr_t a;
 
 				id -= DIF_VAR_OTHER_UBASE;
 				svar = vstate->dtvs_globals[id];
 				ASSERT(svar != NULL);
 				v = &svar->dtsv_var;
 
 				if (!(v->dtdv_type.dtdt_flags & DIF_TF_BYREF)) {
 					regs[rd] = svar->dtsv_data;
 					break;
 				}
 
 				a = (uintptr_t)svar->dtsv_data;
 
 				if (*(uint8_t *)a == UINT8_MAX) {
 					/*
 					 * If the 0th byte is set to UINT8_MAX
 					 * then this is to be treated as a
 					 * reference to a NULL variable.
 					 */
 					regs[rd] = 0;
 				} else {
 					regs[rd] = a + sizeof (uint64_t);
 				}
 
 				break;
 			}
 
 			regs[rd] = dtrace_dif_variable(mstate, state, id, 0);
 			break;
 
 		case DIF_OP_STGS:
 			id = DIF_INSTR_VAR(instr);
 
 			ASSERT(id >= DIF_VAR_OTHER_UBASE);
 			id -= DIF_VAR_OTHER_UBASE;
 
 			VERIFY(id < vstate->dtvs_nglobals);
 			svar = vstate->dtvs_globals[id];
 			ASSERT(svar != NULL);
 			v = &svar->dtsv_var;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				uintptr_t a = (uintptr_t)svar->dtsv_data;
 				size_t lim;
 
 				ASSERT(a != 0);
 				ASSERT(svar->dtsv_size != 0);
 
 				if (regs[rd] == 0) {
 					*(uint8_t *)a = UINT8_MAX;
 					break;
 				} else {
 					*(uint8_t *)a = 0;
 					a += sizeof (uint64_t);
 				}
 				if (!dtrace_vcanload(
 				    (void *)(uintptr_t)regs[rd], &v->dtdv_type,
 				    &lim, mstate, vstate))
 					break;
 
 				dtrace_vcopy((void *)(uintptr_t)regs[rd],
 				    (void *)a, &v->dtdv_type, lim);
 				break;
 			}
 
 			svar->dtsv_data = regs[rd];
 			break;
 
 		case DIF_OP_LDTA:
 			/*
 			 * There are no DTrace built-in thread-local arrays at
 			 * present.  This opcode is saved for future work.
 			 */
 			*flags |= CPU_DTRACE_ILLOP;
 			regs[rd] = 0;
 			break;
 
 		case DIF_OP_LDLS:
 			id = DIF_INSTR_VAR(instr);
 
 			if (id < DIF_VAR_OTHER_UBASE) {
 				/*
 				 * For now, this has no meaning.
 				 */
 				regs[rd] = 0;
 				break;
 			}
 
 			id -= DIF_VAR_OTHER_UBASE;
 
 			ASSERT(id < vstate->dtvs_nlocals);
 			ASSERT(vstate->dtvs_locals != NULL);
 
 			svar = vstate->dtvs_locals[id];
 			ASSERT(svar != NULL);
 			v = &svar->dtsv_var;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				uintptr_t a = (uintptr_t)svar->dtsv_data;
 				size_t sz = v->dtdv_type.dtdt_size;
 				size_t lim;
 
 				sz += sizeof (uint64_t);
 				ASSERT(svar->dtsv_size == NCPU * sz);
 				a += curcpu * sz;
 
 				if (*(uint8_t *)a == UINT8_MAX) {
 					/*
 					 * If the 0th byte is set to UINT8_MAX
 					 * then this is to be treated as a
 					 * reference to a NULL variable.
 					 */
 					regs[rd] = 0;
 				} else {
 					regs[rd] = a + sizeof (uint64_t);
 				}
 
 				break;
 			}
 
 			ASSERT(svar->dtsv_size == NCPU * sizeof (uint64_t));
 			tmp = (uint64_t *)(uintptr_t)svar->dtsv_data;
 			regs[rd] = tmp[curcpu];
 			break;
 
 		case DIF_OP_STLS:
 			id = DIF_INSTR_VAR(instr);
 
 			ASSERT(id >= DIF_VAR_OTHER_UBASE);
 			id -= DIF_VAR_OTHER_UBASE;
 			VERIFY(id < vstate->dtvs_nlocals);
 
 			ASSERT(vstate->dtvs_locals != NULL);
 			svar = vstate->dtvs_locals[id];
 			ASSERT(svar != NULL);
 			v = &svar->dtsv_var;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				uintptr_t a = (uintptr_t)svar->dtsv_data;
 				size_t sz = v->dtdv_type.dtdt_size;
 				size_t lim;
 
 				sz += sizeof (uint64_t);
 				ASSERT(svar->dtsv_size == NCPU * sz);
 				a += curcpu * sz;
 
 				if (regs[rd] == 0) {
 					*(uint8_t *)a = UINT8_MAX;
 					break;
 				} else {
 					*(uint8_t *)a = 0;
 					a += sizeof (uint64_t);
 				}
 
 				if (!dtrace_vcanload(
 				    (void *)(uintptr_t)regs[rd], &v->dtdv_type,
 				    &lim, mstate, vstate))
 					break;
 
 				dtrace_vcopy((void *)(uintptr_t)regs[rd],
 				    (void *)a, &v->dtdv_type, lim);
 				break;
 			}
 
 			ASSERT(svar->dtsv_size == NCPU * sizeof (uint64_t));
 			tmp = (uint64_t *)(uintptr_t)svar->dtsv_data;
 			tmp[curcpu] = regs[rd];
 			break;
 
 		case DIF_OP_LDTS: {
 			dtrace_dynvar_t *dvar;
 			dtrace_key_t *key;
 
 			id = DIF_INSTR_VAR(instr);
 			ASSERT(id >= DIF_VAR_OTHER_UBASE);
 			id -= DIF_VAR_OTHER_UBASE;
 			v = &vstate->dtvs_tlocals[id];
 
 			key = &tupregs[DIF_DTR_NREGS];
 			key[0].dttk_value = (uint64_t)id;
 			key[0].dttk_size = 0;
 			DTRACE_TLS_THRKEY(key[1].dttk_value);
 			key[1].dttk_size = 0;
 
 			dvar = dtrace_dynvar(dstate, 2, key,
 			    sizeof (uint64_t), DTRACE_DYNVAR_NOALLOC,
 			    mstate, vstate);
 
 			if (dvar == NULL) {
 				regs[rd] = 0;
 				break;
 			}
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				regs[rd] = (uint64_t)(uintptr_t)dvar->dtdv_data;
 			} else {
 				regs[rd] = *((uint64_t *)dvar->dtdv_data);
 			}
 
 			break;
 		}
 
 		case DIF_OP_STTS: {
 			dtrace_dynvar_t *dvar;
 			dtrace_key_t *key;
 
 			id = DIF_INSTR_VAR(instr);
 			ASSERT(id >= DIF_VAR_OTHER_UBASE);
 			id -= DIF_VAR_OTHER_UBASE;
 			VERIFY(id < vstate->dtvs_ntlocals);
 
 			key = &tupregs[DIF_DTR_NREGS];
 			key[0].dttk_value = (uint64_t)id;
 			key[0].dttk_size = 0;
 			DTRACE_TLS_THRKEY(key[1].dttk_value);
 			key[1].dttk_size = 0;
 			v = &vstate->dtvs_tlocals[id];
 
 			dvar = dtrace_dynvar(dstate, 2, key,
 			    v->dtdv_type.dtdt_size > sizeof (uint64_t) ?
 			    v->dtdv_type.dtdt_size : sizeof (uint64_t),
 			    regs[rd] ? DTRACE_DYNVAR_ALLOC :
 			    DTRACE_DYNVAR_DEALLOC, mstate, vstate);
 
 			/*
 			 * Given that we're storing to thread-local data,
 			 * we need to flush our predicate cache.
 			 */
 			curthread->t_predcache = 0;
 
 			if (dvar == NULL)
 				break;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				size_t lim;
 
 				if (!dtrace_vcanload(
 				    (void *)(uintptr_t)regs[rd],
 				    &v->dtdv_type, &lim, mstate, vstate))
 					break;
 
 				dtrace_vcopy((void *)(uintptr_t)regs[rd],
 				    dvar->dtdv_data, &v->dtdv_type, lim);
 			} else {
 				*((uint64_t *)dvar->dtdv_data) = regs[rd];
 			}
 
 			break;
 		}
 
 		case DIF_OP_SRA:
 			regs[rd] = (int64_t)regs[r1] >> regs[r2];
 			break;
 
 		case DIF_OP_CALL:
 			dtrace_dif_subr(DIF_INSTR_SUBR(instr), rd,
 			    regs, tupregs, ttop, mstate, state);
 			break;
 
 		case DIF_OP_PUSHTR:
 			if (ttop == DIF_DTR_NREGS) {
 				*flags |= CPU_DTRACE_TUPOFLOW;
 				break;
 			}
 
 			if (r1 == DIF_TYPE_STRING) {
 				/*
 				 * If this is a string type and the size is 0,
 				 * we'll use the system-wide default string
 				 * size.  Note that we are _not_ looking at
 				 * the value of the DTRACEOPT_STRSIZE option;
 				 * had this been set, we would expect to have
 				 * a non-zero size value in the "pushtr".
 				 */
 				tupregs[ttop].dttk_size =
 				    dtrace_strlen((char *)(uintptr_t)regs[rd],
 				    regs[r2] ? regs[r2] :
 				    dtrace_strsize_default) + 1;
 			} else {
 				if (regs[r2] > LONG_MAX) {
 					*flags |= CPU_DTRACE_ILLOP;
 					break;
 				}
 
 				tupregs[ttop].dttk_size = regs[r2];
 			}
 
 			tupregs[ttop++].dttk_value = regs[rd];
 			break;
 
 		case DIF_OP_PUSHTV:
 			if (ttop == DIF_DTR_NREGS) {
 				*flags |= CPU_DTRACE_TUPOFLOW;
 				break;
 			}
 
 			tupregs[ttop].dttk_value = regs[rd];
 			tupregs[ttop++].dttk_size = 0;
 			break;
 
 		case DIF_OP_POPTS:
 			if (ttop != 0)
 				ttop--;
 			break;
 
 		case DIF_OP_FLUSHTS:
 			ttop = 0;
 			break;
 
 		case DIF_OP_LDGAA:
 		case DIF_OP_LDTAA: {
 			dtrace_dynvar_t *dvar;
 			dtrace_key_t *key = tupregs;
 			uint_t nkeys = ttop;
 
 			id = DIF_INSTR_VAR(instr);
 			ASSERT(id >= DIF_VAR_OTHER_UBASE);
 			id -= DIF_VAR_OTHER_UBASE;
 
 			key[nkeys].dttk_value = (uint64_t)id;
 			key[nkeys++].dttk_size = 0;
 
 			if (DIF_INSTR_OP(instr) == DIF_OP_LDTAA) {
 				DTRACE_TLS_THRKEY(key[nkeys].dttk_value);
 				key[nkeys++].dttk_size = 0;
 				VERIFY(id < vstate->dtvs_ntlocals);
 				v = &vstate->dtvs_tlocals[id];
 			} else {
 				VERIFY(id < vstate->dtvs_nglobals);
 				v = &vstate->dtvs_globals[id]->dtsv_var;
 			}
 
 			dvar = dtrace_dynvar(dstate, nkeys, key,
 			    v->dtdv_type.dtdt_size > sizeof (uint64_t) ?
 			    v->dtdv_type.dtdt_size : sizeof (uint64_t),
 			    DTRACE_DYNVAR_NOALLOC, mstate, vstate);
 
 			if (dvar == NULL) {
 				regs[rd] = 0;
 				break;
 			}
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				regs[rd] = (uint64_t)(uintptr_t)dvar->dtdv_data;
 			} else {
 				regs[rd] = *((uint64_t *)dvar->dtdv_data);
 			}
 
 			break;
 		}
 
 		case DIF_OP_STGAA:
 		case DIF_OP_STTAA: {
 			dtrace_dynvar_t *dvar;
 			dtrace_key_t *key = tupregs;
 			uint_t nkeys = ttop;
 
 			id = DIF_INSTR_VAR(instr);
 			ASSERT(id >= DIF_VAR_OTHER_UBASE);
 			id -= DIF_VAR_OTHER_UBASE;
 
 			key[nkeys].dttk_value = (uint64_t)id;
 			key[nkeys++].dttk_size = 0;
 
 			if (DIF_INSTR_OP(instr) == DIF_OP_STTAA) {
 				DTRACE_TLS_THRKEY(key[nkeys].dttk_value);
 				key[nkeys++].dttk_size = 0;
 				VERIFY(id < vstate->dtvs_ntlocals);
 				v = &vstate->dtvs_tlocals[id];
 			} else {
 				VERIFY(id < vstate->dtvs_nglobals);
 				v = &vstate->dtvs_globals[id]->dtsv_var;
 			}
 
 			dvar = dtrace_dynvar(dstate, nkeys, key,
 			    v->dtdv_type.dtdt_size > sizeof (uint64_t) ?
 			    v->dtdv_type.dtdt_size : sizeof (uint64_t),
 			    regs[rd] ? DTRACE_DYNVAR_ALLOC :
 			    DTRACE_DYNVAR_DEALLOC, mstate, vstate);
 
 			if (dvar == NULL)
 				break;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF) {
 				size_t lim;
 
 				if (!dtrace_vcanload(
 				    (void *)(uintptr_t)regs[rd], &v->dtdv_type,
 				    &lim, mstate, vstate))
 					break;
 
 				dtrace_vcopy((void *)(uintptr_t)regs[rd],
 				    dvar->dtdv_data, &v->dtdv_type, lim);
 			} else {
 				*((uint64_t *)dvar->dtdv_data) = regs[rd];
 			}
 
 			break;
 		}
 
 		case DIF_OP_ALLOCS: {
 			uintptr_t ptr = P2ROUNDUP(mstate->dtms_scratch_ptr, 8);
 			size_t size = ptr - mstate->dtms_scratch_ptr + regs[r1];
 
 			/*
 			 * Rounding up the user allocation size could have
 			 * overflowed large, bogus allocations (like -1ULL) to
 			 * 0.
 			 */
 			if (size < regs[r1] ||
 			    !DTRACE_INSCRATCH(mstate, size)) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 				regs[rd] = 0;
 				break;
 			}
 
 			dtrace_bzero((void *) mstate->dtms_scratch_ptr, size);
 			mstate->dtms_scratch_ptr += size;
 			regs[rd] = ptr;
 			break;
 		}
 
 		case DIF_OP_COPYS:
 			if (!dtrace_canstore(regs[rd], regs[r2],
 			    mstate, vstate)) {
 				*flags |= CPU_DTRACE_BADADDR;
 				*illval = regs[rd];
 				break;
 			}
 
 			if (!dtrace_canload(regs[r1], regs[r2], mstate, vstate))
 				break;
 
 			dtrace_bcopy((void *)(uintptr_t)regs[r1],
 			    (void *)(uintptr_t)regs[rd], (size_t)regs[r2]);
 			break;
 
 		case DIF_OP_STB:
 			if (!dtrace_canstore(regs[rd], 1, mstate, vstate)) {
 				*flags |= CPU_DTRACE_BADADDR;
 				*illval = regs[rd];
 				break;
 			}
 			*((uint8_t *)(uintptr_t)regs[rd]) = (uint8_t)regs[r1];
 			break;
 
 		case DIF_OP_STH:
 			if (!dtrace_canstore(regs[rd], 2, mstate, vstate)) {
 				*flags |= CPU_DTRACE_BADADDR;
 				*illval = regs[rd];
 				break;
 			}
 			if (regs[rd] & 1) {
 				*flags |= CPU_DTRACE_BADALIGN;
 				*illval = regs[rd];
 				break;
 			}
 			*((uint16_t *)(uintptr_t)regs[rd]) = (uint16_t)regs[r1];
 			break;
 
 		case DIF_OP_STW:
 			if (!dtrace_canstore(regs[rd], 4, mstate, vstate)) {
 				*flags |= CPU_DTRACE_BADADDR;
 				*illval = regs[rd];
 				break;
 			}
 			if (regs[rd] & 3) {
 				*flags |= CPU_DTRACE_BADALIGN;
 				*illval = regs[rd];
 				break;
 			}
 			*((uint32_t *)(uintptr_t)regs[rd]) = (uint32_t)regs[r1];
 			break;
 
 		case DIF_OP_STX:
 			if (!dtrace_canstore(regs[rd], 8, mstate, vstate)) {
 				*flags |= CPU_DTRACE_BADADDR;
 				*illval = regs[rd];
 				break;
 			}
 			if (regs[rd] & 7) {
 				*flags |= CPU_DTRACE_BADALIGN;
 				*illval = regs[rd];
 				break;
 			}
 			*((uint64_t *)(uintptr_t)regs[rd]) = regs[r1];
 			break;
 		}
 	}
 
 	if (!(*flags & CPU_DTRACE_FAULT))
 		return (rval);
 
 	mstate->dtms_fltoffs = opc * sizeof (dif_instr_t);
 	mstate->dtms_present |= DTRACE_MSTATE_FLTOFFS;
 
 	return (0);
 }
 
 static void
 dtrace_action_breakpoint(dtrace_ecb_t *ecb)
 {
 	dtrace_probe_t *probe = ecb->dte_probe;
 	dtrace_provider_t *prov = probe->dtpr_provider;
 	char c[DTRACE_FULLNAMELEN + 80], *str;
 	char *msg = "dtrace: breakpoint action at probe ";
 	char *ecbmsg = " (ecb ";
 	uintptr_t mask = (0xf << (sizeof (uintptr_t) * NBBY / 4));
 	uintptr_t val = (uintptr_t)ecb;
 	int shift = (sizeof (uintptr_t) * NBBY) - 4, i = 0;
 
 	if (dtrace_destructive_disallow)
 		return;
 
 	/*
 	 * It's impossible to be taking action on the NULL probe.
 	 */
 	ASSERT(probe != NULL);
 
 	/*
 	 * This is a poor man's (destitute man's?) sprintf():  we want to
 	 * print the provider name, module name, function name and name of
 	 * the probe, along with the hex address of the ECB with the breakpoint
 	 * action -- all of which we must place in the character buffer by
 	 * hand.
 	 */
 	while (*msg != '\0')
 		c[i++] = *msg++;
 
 	for (str = prov->dtpv_name; *str != '\0'; str++)
 		c[i++] = *str;
 	c[i++] = ':';
 
 	for (str = probe->dtpr_mod; *str != '\0'; str++)
 		c[i++] = *str;
 	c[i++] = ':';
 
 	for (str = probe->dtpr_func; *str != '\0'; str++)
 		c[i++] = *str;
 	c[i++] = ':';
 
 	for (str = probe->dtpr_name; *str != '\0'; str++)
 		c[i++] = *str;
 
 	while (*ecbmsg != '\0')
 		c[i++] = *ecbmsg++;
 
 	while (shift >= 0) {
 		mask = (uintptr_t)0xf << shift;
 
 		if (val >= ((uintptr_t)1 << shift))
 			c[i++] = "0123456789abcdef"[(val & mask) >> shift];
 		shift -= 4;
 	}
 
 	c[i++] = ')';
 	c[i] = '\0';
 
 #ifdef illumos
 	debug_enter(c);
 #else
 	kdb_enter(KDB_WHY_DTRACE, "breakpoint action");
 #endif
 }
 
 static void
 dtrace_action_panic(dtrace_ecb_t *ecb)
 {
 	dtrace_probe_t *probe = ecb->dte_probe;
 
 	/*
 	 * It's impossible to be taking action on the NULL probe.
 	 */
 	ASSERT(probe != NULL);
 
 	if (dtrace_destructive_disallow)
 		return;
 
 	if (dtrace_panicked != NULL)
 		return;
 
 	if (dtrace_casptr(&dtrace_panicked, NULL, curthread) != NULL)
 		return;
 
 	/*
 	 * We won the right to panic.  (We want to be sure that only one
 	 * thread calls panic() from dtrace_probe(), and that panic() is
 	 * called exactly once.)
 	 */
 	dtrace_panic("dtrace: panic action at probe %s:%s:%s:%s (ecb %p)",
 	    probe->dtpr_provider->dtpv_name, probe->dtpr_mod,
 	    probe->dtpr_func, probe->dtpr_name, (void *)ecb);
 }
 
 static void
 dtrace_action_raise(uint64_t sig)
 {
 	if (dtrace_destructive_disallow)
 		return;
 
 	if (sig >= NSIG) {
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_ILLOP);
 		return;
 	}
 
 #ifdef illumos
 	/*
 	 * raise() has a queue depth of 1 -- we ignore all subsequent
 	 * invocations of the raise() action.
 	 */
 	if (curthread->t_dtrace_sig == 0)
 		curthread->t_dtrace_sig = (uint8_t)sig;
 
 	curthread->t_sig_check = 1;
 	aston(curthread);
 #else
 	struct proc *p = curproc;
 	PROC_LOCK(p);
 	kern_psignal(p, sig);
 	PROC_UNLOCK(p);
 #endif
 }
 
 static void
 dtrace_action_stop(void)
 {
 	if (dtrace_destructive_disallow)
 		return;
 
 #ifdef illumos
 	if (!curthread->t_dtrace_stop) {
 		curthread->t_dtrace_stop = 1;
 		curthread->t_sig_check = 1;
 		aston(curthread);
 	}
 #else
 	struct proc *p = curproc;
 	PROC_LOCK(p);
 	kern_psignal(p, SIGSTOP);
 	PROC_UNLOCK(p);
 #endif
 }
 
 static void
 dtrace_action_chill(dtrace_mstate_t *mstate, hrtime_t val)
 {
 	hrtime_t now;
 	volatile uint16_t *flags;
 #ifdef illumos
 	cpu_t *cpu = CPU;
 #else
 	cpu_t *cpu = &solaris_cpu[curcpu];
 #endif
 
 	if (dtrace_destructive_disallow)
 		return;
 
 	flags = (volatile uint16_t *)&cpu_core[curcpu].cpuc_dtrace_flags;
 
 	now = dtrace_gethrtime();
 
 	if (now - cpu->cpu_dtrace_chillmark > dtrace_chill_interval) {
 		/*
 		 * We need to advance the mark to the current time.
 		 */
 		cpu->cpu_dtrace_chillmark = now;
 		cpu->cpu_dtrace_chilled = 0;
 	}
 
 	/*
 	 * Now check to see if the requested chill time would take us over
 	 * the maximum amount of time allowed in the chill interval.  (Or
 	 * worse, if the calculation itself induces overflow.)
 	 */
 	if (cpu->cpu_dtrace_chilled + val > dtrace_chill_max ||
 	    cpu->cpu_dtrace_chilled + val < cpu->cpu_dtrace_chilled) {
 		*flags |= CPU_DTRACE_ILLOP;
 		return;
 	}
 
 	while (dtrace_gethrtime() - now < val)
 		continue;
 
 	/*
 	 * Normally, we assure that the value of the variable "timestamp" does
 	 * not change within an ECB.  The presence of chill() represents an
 	 * exception to this rule, however.
 	 */
 	mstate->dtms_present &= ~DTRACE_MSTATE_TIMESTAMP;
 	cpu->cpu_dtrace_chilled += val;
 }
 
 static void
 dtrace_action_ustack(dtrace_mstate_t *mstate, dtrace_state_t *state,
     uint64_t *buf, uint64_t arg)
 {
 	int nframes = DTRACE_USTACK_NFRAMES(arg);
 	int strsize = DTRACE_USTACK_STRSIZE(arg);
 	uint64_t *pcs = &buf[1], *fps;
 	char *str = (char *)&pcs[nframes];
 	int size, offs = 0, i, j;
 	size_t rem;
 	uintptr_t old = mstate->dtms_scratch_ptr, saved;
 	uint16_t *flags = &cpu_core[curcpu].cpuc_dtrace_flags;
 	char *sym;
 
 	/*
 	 * Should be taking a faster path if string space has not been
 	 * allocated.
 	 */
 	ASSERT(strsize != 0);
 
 	/*
 	 * We will first allocate some temporary space for the frame pointers.
 	 */
 	fps = (uint64_t *)P2ROUNDUP(mstate->dtms_scratch_ptr, 8);
 	size = (uintptr_t)fps - mstate->dtms_scratch_ptr +
 	    (nframes * sizeof (uint64_t));
 
 	if (!DTRACE_INSCRATCH(mstate, size)) {
 		/*
 		 * Not enough room for our frame pointers -- need to indicate
 		 * that we ran out of scratch space.
 		 */
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOSCRATCH);
 		return;
 	}
 
 	mstate->dtms_scratch_ptr += size;
 	saved = mstate->dtms_scratch_ptr;
 
 	/*
 	 * Now get a stack with both program counters and frame pointers.
 	 */
 	DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 	dtrace_getufpstack(buf, fps, nframes + 1);
 	DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 
 	/*
 	 * If that faulted, we're cooked.
 	 */
 	if (*flags & CPU_DTRACE_FAULT)
 		goto out;
 
 	/*
 	 * Now we want to walk up the stack, calling the USTACK helper.  For
 	 * each iteration, we restore the scratch pointer.
 	 */
 	for (i = 0; i < nframes; i++) {
 		mstate->dtms_scratch_ptr = saved;
 
 		if (offs >= strsize)
 			break;
 
 		sym = (char *)(uintptr_t)dtrace_helper(
 		    DTRACE_HELPER_ACTION_USTACK,
 		    mstate, state, pcs[i], fps[i]);
 
 		/*
 		 * If we faulted while running the helper, we're going to
 		 * clear the fault and null out the corresponding string.
 		 */
 		if (*flags & CPU_DTRACE_FAULT) {
 			*flags &= ~CPU_DTRACE_FAULT;
 			str[offs++] = '\0';
 			continue;
 		}
 
 		if (sym == NULL) {
 			str[offs++] = '\0';
 			continue;
 		}
 
 		if (!dtrace_strcanload((uintptr_t)sym, strsize, &rem, mstate,
 		    &(state->dts_vstate))) {
 			str[offs++] = '\0';
 			continue;
 		}
 
 		DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 
 		/*
 		 * Now copy in the string that the helper returned to us.
 		 */
 		for (j = 0; offs + j < strsize && j < rem; j++) {
 			if ((str[offs + j] = sym[j]) == '\0')
 				break;
 		}
 
 		DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 
 		offs += j + 1;
 	}
 
 	if (offs >= strsize) {
 		/*
 		 * If we didn't have room for all of the strings, we don't
 		 * abort processing -- this needn't be a fatal error -- but we
 		 * still want to increment a counter (dts_stkstroverflows) to
 		 * allow this condition to be warned about.  (If this is from
 		 * a jstack() action, it is easily tuned via jstackstrsize.)
 		 */
 		dtrace_error(&state->dts_stkstroverflows);
 	}
 
 	while (offs < strsize)
 		str[offs++] = '\0';
 
 out:
 	mstate->dtms_scratch_ptr = old;
 }
 
 static void
 dtrace_store_by_ref(dtrace_difo_t *dp, caddr_t tomax, size_t size,
     size_t *valoffsp, uint64_t *valp, uint64_t end, int intuple, int dtkind)
 {
 	volatile uint16_t *flags;
 	uint64_t val = *valp;
 	size_t valoffs = *valoffsp;
 
 	flags = (volatile uint16_t *)&cpu_core[curcpu].cpuc_dtrace_flags;
 	ASSERT(dtkind == DIF_TF_BYREF || dtkind == DIF_TF_BYUREF);
 
 	/*
 	 * If this is a string, we're going to only load until we find the zero
 	 * byte -- after which we'll store zero bytes.
 	 */
 	if (dp->dtdo_rtype.dtdt_kind == DIF_TYPE_STRING) {
 		char c = '\0' + 1;
 		size_t s;
 
 		for (s = 0; s < size; s++) {
 			if (c != '\0' && dtkind == DIF_TF_BYREF) {
 				c = dtrace_load8(val++);
 			} else if (c != '\0' && dtkind == DIF_TF_BYUREF) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 				c = dtrace_fuword8((void *)(uintptr_t)val++);
 				DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 				if (*flags & CPU_DTRACE_FAULT)
 					break;
 			}
 
 			DTRACE_STORE(uint8_t, tomax, valoffs++, c);
 
 			if (c == '\0' && intuple)
 				break;
 		}
 	} else {
 		uint8_t c;
 		while (valoffs < end) {
 			if (dtkind == DIF_TF_BYREF) {
 				c = dtrace_load8(val++);
 			} else if (dtkind == DIF_TF_BYUREF) {
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 				c = dtrace_fuword8((void *)(uintptr_t)val++);
 				DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 				if (*flags & CPU_DTRACE_FAULT)
 					break;
 			}
 
 			DTRACE_STORE(uint8_t, tomax,
 			    valoffs++, c);
 		}
 	}
 
 	*valp = val;
 	*valoffsp = valoffs;
 }
 
 /*
+ * Disables interrupts and sets the per-thread inprobe flag. When DEBUG is
+ * defined, we also assert that we are not recursing unless the probe ID is an
+ * error probe.
+ */
+static dtrace_icookie_t
+dtrace_probe_enter(dtrace_id_t id)
+{
+	dtrace_icookie_t cookie;
+
+	cookie = dtrace_interrupt_disable();
+
+	/*
+	 * Unless this is an ERROR probe, we are not allowed to recurse in
+	 * dtrace_probe(). Recursing into DTrace probe usually means that a
+	 * function is instrumented that should not have been instrumented or
+	 * that the ordering guarantee of the records will be violated,
+	 * resulting in unexpected output. If there is an exception to this
+	 * assertion, a new case should be added.
+	 */
+	ASSERT(curthread->t_dtrace_inprobe == 0 ||
+	    id == dtrace_probeid_error);
+	curthread->t_dtrace_inprobe = 1;
+
+	return (cookie);
+}
+
+/*
+ * Clears the per-thread inprobe flag and enables interrupts.
+ */
+static void
+dtrace_probe_exit(dtrace_icookie_t cookie)
+{
+
+	curthread->t_dtrace_inprobe = 0;
+	dtrace_interrupt_enable(cookie);
+}
+
+/*
  * If you're looking for the epicenter of DTrace, you just found it.  This
  * is the function called by the provider to fire a probe -- from which all
  * subsequent probe-context DTrace activity emanates.
  */
 void
 dtrace_probe(dtrace_id_t id, uintptr_t arg0, uintptr_t arg1,
     uintptr_t arg2, uintptr_t arg3, uintptr_t arg4)
 {
 	processorid_t cpuid;
 	dtrace_icookie_t cookie;
 	dtrace_probe_t *probe;
 	dtrace_mstate_t mstate;
 	dtrace_ecb_t *ecb;
 	dtrace_action_t *act;
 	intptr_t offs;
 	size_t size;
 	int vtime, onintr;
 	volatile uint16_t *flags;
 	hrtime_t now;
 
 	if (panicstr != NULL)
 		return;
 
 #ifdef illumos
 	/*
 	 * Kick out immediately if this CPU is still being born (in which case
 	 * curthread will be set to -1) or the current thread can't allow
 	 * probes in its current context.
 	 */
 	if (((uintptr_t)curthread & 1) || (curthread->t_flag & T_DONTDTRACE))
 		return;
 #endif
 
-	cookie = dtrace_interrupt_disable();
+	cookie = dtrace_probe_enter(id);
 	probe = dtrace_probes[id - 1];
 	cpuid = curcpu;
 	onintr = CPU_ON_INTR(CPU);
 
 	if (!onintr && probe->dtpr_predcache != DTRACE_CACHEIDNONE &&
 	    probe->dtpr_predcache == curthread->t_predcache) {
 		/*
 		 * We have hit in the predicate cache; we know that
 		 * this predicate would evaluate to be false.
 		 */
-		dtrace_interrupt_enable(cookie);
+		dtrace_probe_exit(cookie);
 		return;
 	}
 
 #ifdef illumos
 	if (panic_quiesce) {
 #else
 	if (panicstr != NULL) {
 #endif
 		/*
 		 * We don't trace anything if we're panicking.
 		 */
-		dtrace_interrupt_enable(cookie);
+		dtrace_probe_exit(cookie);
 		return;
 	}
 
 	now = mstate.dtms_timestamp = dtrace_gethrtime();
 	mstate.dtms_present = DTRACE_MSTATE_TIMESTAMP;
 	vtime = dtrace_vtime_references != 0;
 
 	if (vtime && curthread->t_dtrace_start)
 		curthread->t_dtrace_vtime += now - curthread->t_dtrace_start;
 
 	mstate.dtms_difo = NULL;
 	mstate.dtms_probe = probe;
 	mstate.dtms_strtok = 0;
 	mstate.dtms_arg[0] = arg0;
 	mstate.dtms_arg[1] = arg1;
 	mstate.dtms_arg[2] = arg2;
 	mstate.dtms_arg[3] = arg3;
 	mstate.dtms_arg[4] = arg4;
 
 	flags = (volatile uint16_t *)&cpu_core[cpuid].cpuc_dtrace_flags;
 
 	for (ecb = probe->dtpr_ecb; ecb != NULL; ecb = ecb->dte_next) {
 		dtrace_predicate_t *pred = ecb->dte_predicate;
 		dtrace_state_t *state = ecb->dte_state;
 		dtrace_buffer_t *buf = &state->dts_buffer[cpuid];
 		dtrace_buffer_t *aggbuf = &state->dts_aggbuffer[cpuid];
 		dtrace_vstate_t *vstate = &state->dts_vstate;
 		dtrace_provider_t *prov = probe->dtpr_provider;
 		uint64_t tracememsize = 0;
 		int committed = 0;
 		caddr_t tomax;
 
 		/*
 		 * A little subtlety with the following (seemingly innocuous)
 		 * declaration of the automatic 'val':  by looking at the
 		 * code, you might think that it could be declared in the
 		 * action processing loop, below.  (That is, it's only used in
 		 * the action processing loop.)  However, it must be declared
 		 * out of that scope because in the case of DIF expression
 		 * arguments to aggregating actions, one iteration of the
 		 * action loop will use the last iteration's value.
 		 */
 		uint64_t val = 0;
 
 		mstate.dtms_present = DTRACE_MSTATE_ARGS | DTRACE_MSTATE_PROBE;
 		mstate.dtms_getf = NULL;
 
 		*flags &= ~CPU_DTRACE_ERROR;
 
 		if (prov == dtrace_provider) {
 			/*
 			 * If dtrace itself is the provider of this probe,
 			 * we're only going to continue processing the ECB if
 			 * arg0 (the dtrace_state_t) is equal to the ECB's
 			 * creating state.  (This prevents disjoint consumers
 			 * from seeing one another's metaprobes.)
 			 */
 			if (arg0 != (uint64_t)(uintptr_t)state)
 				continue;
 		}
 
 		if (state->dts_activity != DTRACE_ACTIVITY_ACTIVE) {
 			/*
 			 * We're not currently active.  If our provider isn't
 			 * the dtrace pseudo provider, we're not interested.
 			 */
 			if (prov != dtrace_provider)
 				continue;
 
 			/*
 			 * Now we must further check if we are in the BEGIN
 			 * probe.  If we are, we will only continue processing
 			 * if we're still in WARMUP -- if one BEGIN enabling
 			 * has invoked the exit() action, we don't want to
 			 * evaluate subsequent BEGIN enablings.
 			 */
 			if (probe->dtpr_id == dtrace_probeid_begin &&
 			    state->dts_activity != DTRACE_ACTIVITY_WARMUP) {
 				ASSERT(state->dts_activity ==
 				    DTRACE_ACTIVITY_DRAINING);
 				continue;
 			}
 		}
 
 		if (ecb->dte_cond) {
 			/*
 			 * If the dte_cond bits indicate that this
 			 * consumer is only allowed to see user-mode firings
 			 * of this probe, call the provider's dtps_usermode()
 			 * entry point to check that the probe was fired
 			 * while in a user context. Skip this ECB if that's
 			 * not the case.
 			 */
 			if ((ecb->dte_cond & DTRACE_COND_USERMODE) &&
 			    prov->dtpv_pops.dtps_usermode(prov->dtpv_arg,
 			    probe->dtpr_id, probe->dtpr_arg) == 0)
 				continue;
 
 #ifdef illumos
 			/*
 			 * This is more subtle than it looks. We have to be
 			 * absolutely certain that CRED() isn't going to
 			 * change out from under us so it's only legit to
 			 * examine that structure if we're in constrained
 			 * situations. Currently, the only times we'll this
 			 * check is if a non-super-user has enabled the
 			 * profile or syscall providers -- providers that
 			 * allow visibility of all processes. For the
 			 * profile case, the check above will ensure that
 			 * we're examining a user context.
 			 */
 			if (ecb->dte_cond & DTRACE_COND_OWNER) {
 				cred_t *cr;
 				cred_t *s_cr =
 				    ecb->dte_state->dts_cred.dcr_cred;
 				proc_t *proc;
 
 				ASSERT(s_cr != NULL);
 
 				if ((cr = CRED()) == NULL ||
 				    s_cr->cr_uid != cr->cr_uid ||
 				    s_cr->cr_uid != cr->cr_ruid ||
 				    s_cr->cr_uid != cr->cr_suid ||
 				    s_cr->cr_gid != cr->cr_gid ||
 				    s_cr->cr_gid != cr->cr_rgid ||
 				    s_cr->cr_gid != cr->cr_sgid ||
 				    (proc = ttoproc(curthread)) == NULL ||
 				    (proc->p_flag & SNOCD))
 					continue;
 			}
 
 			if (ecb->dte_cond & DTRACE_COND_ZONEOWNER) {
 				cred_t *cr;
 				cred_t *s_cr =
 				    ecb->dte_state->dts_cred.dcr_cred;
 
 				ASSERT(s_cr != NULL);
 
 				if ((cr = CRED()) == NULL ||
 				    s_cr->cr_zone->zone_id !=
 				    cr->cr_zone->zone_id)
 					continue;
 			}
 #endif
 		}
 
 		if (now - state->dts_alive > dtrace_deadman_timeout) {
 			/*
 			 * We seem to be dead.  Unless we (a) have kernel
 			 * destructive permissions (b) have explicitly enabled
 			 * destructive actions and (c) destructive actions have
 			 * not been disabled, we're going to transition into
 			 * the KILLED state, from which no further processing
 			 * on this state will be performed.
 			 */
 			if (!dtrace_priv_kernel_destructive(state) ||
 			    !state->dts_cred.dcr_destructive ||
 			    dtrace_destructive_disallow) {
 				void *activity = &state->dts_activity;
 				dtrace_activity_t current;
 
 				do {
 					current = state->dts_activity;
 				} while (dtrace_cas32(activity, current,
 				    DTRACE_ACTIVITY_KILLED) != current);
 
 				continue;
 			}
 		}
 
 		if ((offs = dtrace_buffer_reserve(buf, ecb->dte_needed,
 		    ecb->dte_alignment, state, &mstate)) < 0)
 			continue;
 
 		tomax = buf->dtb_tomax;
 		ASSERT(tomax != NULL);
 
 		if (ecb->dte_size != 0) {
 			dtrace_rechdr_t dtrh;
 			if (!(mstate.dtms_present & DTRACE_MSTATE_TIMESTAMP)) {
 				mstate.dtms_timestamp = dtrace_gethrtime();
 				mstate.dtms_present |= DTRACE_MSTATE_TIMESTAMP;
 			}
 			ASSERT3U(ecb->dte_size, >=, sizeof (dtrace_rechdr_t));
 			dtrh.dtrh_epid = ecb->dte_epid;
 			DTRACE_RECORD_STORE_TIMESTAMP(&dtrh,
 			    mstate.dtms_timestamp);
 			*((dtrace_rechdr_t *)(tomax + offs)) = dtrh;
 		}
 
 		mstate.dtms_epid = ecb->dte_epid;
 		mstate.dtms_present |= DTRACE_MSTATE_EPID;
 
 		if (state->dts_cred.dcr_visible & DTRACE_CRV_KERNEL)
 			mstate.dtms_access = DTRACE_ACCESS_KERNEL;
 		else
 			mstate.dtms_access = 0;
 
 		if (pred != NULL) {
 			dtrace_difo_t *dp = pred->dtp_difo;
 			uint64_t rval;
 
 			rval = dtrace_dif_emulate(dp, &mstate, vstate, state);
 
 			if (!(*flags & CPU_DTRACE_ERROR) && !rval) {
 				dtrace_cacheid_t cid = probe->dtpr_predcache;
 
 				if (cid != DTRACE_CACHEIDNONE && !onintr) {
 					/*
 					 * Update the predicate cache...
 					 */
 					ASSERT(cid == pred->dtp_cacheid);
 					curthread->t_predcache = cid;
 				}
 
 				continue;
 			}
 		}
 
 		for (act = ecb->dte_action; !(*flags & CPU_DTRACE_ERROR) &&
 		    act != NULL; act = act->dta_next) {
 			size_t valoffs;
 			dtrace_difo_t *dp;
 			dtrace_recdesc_t *rec = &act->dta_rec;
 
 			size = rec->dtrd_size;
 			valoffs = offs + rec->dtrd_offset;
 
 			if (DTRACEACT_ISAGG(act->dta_kind)) {
 				uint64_t v = 0xbad;
 				dtrace_aggregation_t *agg;
 
 				agg = (dtrace_aggregation_t *)act;
 
 				if ((dp = act->dta_difo) != NULL)
 					v = dtrace_dif_emulate(dp,
 					    &mstate, vstate, state);
 
 				if (*flags & CPU_DTRACE_ERROR)
 					continue;
 
 				/*
 				 * Note that we always pass the expression
 				 * value from the previous iteration of the
 				 * action loop.  This value will only be used
 				 * if there is an expression argument to the
 				 * aggregating action, denoted by the
 				 * dtag_hasarg field.
 				 */
 				dtrace_aggregate(agg, buf,
 				    offs, aggbuf, v, val);
 				continue;
 			}
 
 			switch (act->dta_kind) {
 			case DTRACEACT_STOP:
 				if (dtrace_priv_proc_destructive(state))
 					dtrace_action_stop();
 				continue;
 
 			case DTRACEACT_BREAKPOINT:
 				if (dtrace_priv_kernel_destructive(state))
 					dtrace_action_breakpoint(ecb);
 				continue;
 
 			case DTRACEACT_PANIC:
 				if (dtrace_priv_kernel_destructive(state))
 					dtrace_action_panic(ecb);
 				continue;
 
 			case DTRACEACT_STACK:
 				if (!dtrace_priv_kernel(state))
 					continue;
 
 				dtrace_getpcstack((pc_t *)(tomax + valoffs),
 				    size / sizeof (pc_t), probe->dtpr_aframes,
 				    DTRACE_ANCHORED(probe) ? NULL :
 				    (uint32_t *)arg0);
 				continue;
 
 			case DTRACEACT_JSTACK:
 			case DTRACEACT_USTACK:
 				if (!dtrace_priv_proc(state))
 					continue;
 
 				/*
 				 * See comment in DIF_VAR_PID.
 				 */
 				if (DTRACE_ANCHORED(mstate.dtms_probe) &&
 				    CPU_ON_INTR(CPU)) {
 					int depth = DTRACE_USTACK_NFRAMES(
 					    rec->dtrd_arg) + 1;
 
 					dtrace_bzero((void *)(tomax + valoffs),
 					    DTRACE_USTACK_STRSIZE(rec->dtrd_arg)
 					    + depth * sizeof (uint64_t));
 
 					continue;
 				}
 
 				if (DTRACE_USTACK_STRSIZE(rec->dtrd_arg) != 0 &&
 				    curproc->p_dtrace_helpers != NULL) {
 					/*
 					 * This is the slow path -- we have
 					 * allocated string space, and we're
 					 * getting the stack of a process that
 					 * has helpers.  Call into a separate
 					 * routine to perform this processing.
 					 */
 					dtrace_action_ustack(&mstate, state,
 					    (uint64_t *)(tomax + valoffs),
 					    rec->dtrd_arg);
 					continue;
 				}
 
 				DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
 				dtrace_getupcstack((uint64_t *)
 				    (tomax + valoffs),
 				    DTRACE_USTACK_NFRAMES(rec->dtrd_arg) + 1);
 				DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
 				continue;
 
 			default:
 				break;
 			}
 
 			dp = act->dta_difo;
 			ASSERT(dp != NULL);
 
 			val = dtrace_dif_emulate(dp, &mstate, vstate, state);
 
 			if (*flags & CPU_DTRACE_ERROR)
 				continue;
 
 			switch (act->dta_kind) {
 			case DTRACEACT_SPECULATE: {
 				dtrace_rechdr_t *dtrh;
 
 				ASSERT(buf == &state->dts_buffer[cpuid]);
 				buf = dtrace_speculation_buffer(state,
 				    cpuid, val);
 
 				if (buf == NULL) {
 					*flags |= CPU_DTRACE_DROP;
 					continue;
 				}
 
 				offs = dtrace_buffer_reserve(buf,
 				    ecb->dte_needed, ecb->dte_alignment,
 				    state, NULL);
 
 				if (offs < 0) {
 					*flags |= CPU_DTRACE_DROP;
 					continue;
 				}
 
 				tomax = buf->dtb_tomax;
 				ASSERT(tomax != NULL);
 
 				if (ecb->dte_size == 0)
 					continue;
 
 				ASSERT3U(ecb->dte_size, >=,
 				    sizeof (dtrace_rechdr_t));
 				dtrh = ((void *)(tomax + offs));
 				dtrh->dtrh_epid = ecb->dte_epid;
 				/*
 				 * When the speculation is committed, all of
 				 * the records in the speculative buffer will
 				 * have their timestamps set to the commit
 				 * time.  Until then, it is set to a sentinel
 				 * value, for debugability.
 				 */
 				DTRACE_RECORD_STORE_TIMESTAMP(dtrh, UINT64_MAX);
 				continue;
 			}
 
 			case DTRACEACT_PRINTM: {
 				/* The DIF returns a 'memref'. */
 				uintptr_t *memref = (uintptr_t *)(uintptr_t) val;
 
 				/* Get the size from the memref. */
 				size = memref[1];
 
 				/*
 				 * Check if the size exceeds the allocated
 				 * buffer size.
 				 */
 				if (size + sizeof(uintptr_t) > dp->dtdo_rtype.dtdt_size) {
 					/* Flag a drop! */
 					*flags |= CPU_DTRACE_DROP;
 					continue;
 				}
 
 				/* Store the size in the buffer first. */
 				DTRACE_STORE(uintptr_t, tomax,
 				    valoffs, size);
 
 				/*
 				 * Offset the buffer address to the start
 				 * of the data.
 				 */
 				valoffs += sizeof(uintptr_t);
 
 				/*
 				 * Reset to the memory address rather than
 				 * the memref array, then let the BYREF
 				 * code below do the work to store the 
 				 * memory data in the buffer.
 				 */
 				val = memref[0];
 				break;
 			}
 
 			case DTRACEACT_CHILL:
 				if (dtrace_priv_kernel_destructive(state))
 					dtrace_action_chill(&mstate, val);
 				continue;
 
 			case DTRACEACT_RAISE:
 				if (dtrace_priv_proc_destructive(state))
 					dtrace_action_raise(val);
 				continue;
 
 			case DTRACEACT_COMMIT:
 				ASSERT(!committed);
 
 				/*
 				 * We need to commit our buffer state.
 				 */
 				if (ecb->dte_size)
 					buf->dtb_offset = offs + ecb->dte_size;
 				buf = &state->dts_buffer[cpuid];
 				dtrace_speculation_commit(state, cpuid, val);
 				committed = 1;
 				continue;
 
 			case DTRACEACT_DISCARD:
 				dtrace_speculation_discard(state, cpuid, val);
 				continue;
 
 			case DTRACEACT_DIFEXPR:
 			case DTRACEACT_LIBACT:
 			case DTRACEACT_PRINTF:
 			case DTRACEACT_PRINTA:
 			case DTRACEACT_SYSTEM:
 			case DTRACEACT_FREOPEN:
 			case DTRACEACT_TRACEMEM:
 				break;
 
 			case DTRACEACT_TRACEMEM_DYNSIZE:
 				tracememsize = val;
 				break;
 
 			case DTRACEACT_SYM:
 			case DTRACEACT_MOD:
 				if (!dtrace_priv_kernel(state))
 					continue;
 				break;
 
 			case DTRACEACT_USYM:
 			case DTRACEACT_UMOD:
 			case DTRACEACT_UADDR: {
 #ifdef illumos
 				struct pid *pid = curthread->t_procp->p_pidp;
 #endif
 
 				if (!dtrace_priv_proc(state))
 					continue;
 
 				DTRACE_STORE(uint64_t, tomax,
 #ifdef illumos
 				    valoffs, (uint64_t)pid->pid_id);
 #else
 				    valoffs, (uint64_t) curproc->p_pid);
 #endif
 				DTRACE_STORE(uint64_t, tomax,
 				    valoffs + sizeof (uint64_t), val);
 
 				continue;
 			}
 
 			case DTRACEACT_EXIT: {
 				/*
 				 * For the exit action, we are going to attempt
 				 * to atomically set our activity to be
 				 * draining.  If this fails (either because
 				 * another CPU has beat us to the exit action,
 				 * or because our current activity is something
 				 * other than ACTIVE or WARMUP), we will
 				 * continue.  This assures that the exit action
 				 * can be successfully recorded at most once
 				 * when we're in the ACTIVE state.  If we're
 				 * encountering the exit() action while in
 				 * COOLDOWN, however, we want to honor the new
 				 * status code.  (We know that we're the only
 				 * thread in COOLDOWN, so there is no race.)
 				 */
 				void *activity = &state->dts_activity;
 				dtrace_activity_t current = state->dts_activity;
 
 				if (current == DTRACE_ACTIVITY_COOLDOWN)
 					break;
 
 				if (current != DTRACE_ACTIVITY_WARMUP)
 					current = DTRACE_ACTIVITY_ACTIVE;
 
 				if (dtrace_cas32(activity, current,
 				    DTRACE_ACTIVITY_DRAINING) != current) {
 					*flags |= CPU_DTRACE_DROP;
 					continue;
 				}
 
 				break;
 			}
 
 			default:
 				ASSERT(0);
 			}
 
 			if (dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF ||
 			    dp->dtdo_rtype.dtdt_flags & DIF_TF_BYUREF) {
 				uintptr_t end = valoffs + size;
 
 				if (tracememsize != 0 &&
 				    valoffs + tracememsize < end) {
 					end = valoffs + tracememsize;
 					tracememsize = 0;
 				}
 
 				if (dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF &&
 				    !dtrace_vcanload((void *)(uintptr_t)val,
 				    &dp->dtdo_rtype, NULL, &mstate, vstate))
 					continue;
 
 				dtrace_store_by_ref(dp, tomax, size, &valoffs,
 				    &val, end, act->dta_intuple,
 				    dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF ?
 				    DIF_TF_BYREF: DIF_TF_BYUREF);
 				continue;
 			}
 
 			switch (size) {
 			case 0:
 				break;
 
 			case sizeof (uint8_t):
 				DTRACE_STORE(uint8_t, tomax, valoffs, val);
 				break;
 			case sizeof (uint16_t):
 				DTRACE_STORE(uint16_t, tomax, valoffs, val);
 				break;
 			case sizeof (uint32_t):
 				DTRACE_STORE(uint32_t, tomax, valoffs, val);
 				break;
 			case sizeof (uint64_t):
 				DTRACE_STORE(uint64_t, tomax, valoffs, val);
 				break;
 			default:
 				/*
 				 * Any other size should have been returned by
 				 * reference, not by value.
 				 */
 				ASSERT(0);
 				break;
 			}
 		}
 
 		if (*flags & CPU_DTRACE_DROP)
 			continue;
 
 		if (*flags & CPU_DTRACE_FAULT) {
 			int ndx;
 			dtrace_action_t *err;
 
 			buf->dtb_errors++;
 
 			if (probe->dtpr_id == dtrace_probeid_error) {
 				/*
 				 * There's nothing we can do -- we had an
 				 * error on the error probe.  We bump an
 				 * error counter to at least indicate that
 				 * this condition happened.
 				 */
 				dtrace_error(&state->dts_dblerrors);
 				continue;
 			}
 
 			if (vtime) {
 				/*
 				 * Before recursing on dtrace_probe(), we
 				 * need to explicitly clear out our start
 				 * time to prevent it from being accumulated
 				 * into t_dtrace_vtime.
 				 */
 				curthread->t_dtrace_start = 0;
 			}
 
 			/*
 			 * Iterate over the actions to figure out which action
 			 * we were processing when we experienced the error.
 			 * Note that act points _past_ the faulting action; if
 			 * act is ecb->dte_action, the fault was in the
 			 * predicate, if it's ecb->dte_action->dta_next it's
 			 * in action #1, and so on.
 			 */
 			for (err = ecb->dte_action, ndx = 0;
 			    err != act; err = err->dta_next, ndx++)
 				continue;
 
 			dtrace_probe_error(state, ecb->dte_epid, ndx,
 			    (mstate.dtms_present & DTRACE_MSTATE_FLTOFFS) ?
 			    mstate.dtms_fltoffs : -1, DTRACE_FLAGS2FLT(*flags),
 			    cpu_core[cpuid].cpuc_dtrace_illval);
 
 			continue;
 		}
 
 		if (!committed)
 			buf->dtb_offset = offs + ecb->dte_size;
 	}
 
 	if (vtime)
 		curthread->t_dtrace_start = dtrace_gethrtime();
 
-	dtrace_interrupt_enable(cookie);
+	dtrace_probe_exit(cookie);
 }
 
 /*
  * DTrace Probe Hashing Functions
  *
  * The functions in this section (and indeed, the functions in remaining
  * sections) are not _called_ from probe context.  (Any exceptions to this are
  * marked with a "Note:".)  Rather, they are called from elsewhere in the
  * DTrace framework to look-up probes in, add probes to and remove probes from
  * the DTrace probe hashes.  (Each probe is hashed by each element of the
  * probe tuple -- allowing for fast lookups, regardless of what was
  * specified.)
  */
 static uint_t
 dtrace_hash_str(const char *p)
 {
 	unsigned int g;
 	uint_t hval = 0;
 
 	while (*p) {
 		hval = (hval << 4) + *p++;
 		if ((g = (hval & 0xf0000000)) != 0)
 			hval ^= g >> 24;
 		hval &= ~g;
 	}
 	return (hval);
 }
 
 static dtrace_hash_t *
 dtrace_hash_create(uintptr_t stroffs, uintptr_t nextoffs, uintptr_t prevoffs)
 {
 	dtrace_hash_t *hash = kmem_zalloc(sizeof (dtrace_hash_t), KM_SLEEP);
 
 	hash->dth_stroffs = stroffs;
 	hash->dth_nextoffs = nextoffs;
 	hash->dth_prevoffs = prevoffs;
 
 	hash->dth_size = 1;
 	hash->dth_mask = hash->dth_size - 1;
 
 	hash->dth_tab = kmem_zalloc(hash->dth_size *
 	    sizeof (dtrace_hashbucket_t *), KM_SLEEP);
 
 	return (hash);
 }
 
 static void
 dtrace_hash_destroy(dtrace_hash_t *hash)
 {
 #ifdef DEBUG
 	int i;
 
 	for (i = 0; i < hash->dth_size; i++)
 		ASSERT(hash->dth_tab[i] == NULL);
 #endif
 
 	kmem_free(hash->dth_tab,
 	    hash->dth_size * sizeof (dtrace_hashbucket_t *));
 	kmem_free(hash, sizeof (dtrace_hash_t));
 }
 
 static void
 dtrace_hash_resize(dtrace_hash_t *hash)
 {
 	int size = hash->dth_size, i, ndx;
 	int new_size = hash->dth_size << 1;
 	int new_mask = new_size - 1;
 	dtrace_hashbucket_t **new_tab, *bucket, *next;
 
 	ASSERT((new_size & new_mask) == 0);
 
 	new_tab = kmem_zalloc(new_size * sizeof (void *), KM_SLEEP);
 
 	for (i = 0; i < size; i++) {
 		for (bucket = hash->dth_tab[i]; bucket != NULL; bucket = next) {
 			dtrace_probe_t *probe = bucket->dthb_chain;
 
 			ASSERT(probe != NULL);
 			ndx = DTRACE_HASHSTR(hash, probe) & new_mask;
 
 			next = bucket->dthb_next;
 			bucket->dthb_next = new_tab[ndx];
 			new_tab[ndx] = bucket;
 		}
 	}
 
 	kmem_free(hash->dth_tab, hash->dth_size * sizeof (void *));
 	hash->dth_tab = new_tab;
 	hash->dth_size = new_size;
 	hash->dth_mask = new_mask;
 }
 
 static void
 dtrace_hash_add(dtrace_hash_t *hash, dtrace_probe_t *new)
 {
 	int hashval = DTRACE_HASHSTR(hash, new);
 	int ndx = hashval & hash->dth_mask;
 	dtrace_hashbucket_t *bucket = hash->dth_tab[ndx];
 	dtrace_probe_t **nextp, **prevp;
 
 	for (; bucket != NULL; bucket = bucket->dthb_next) {
 		if (DTRACE_HASHEQ(hash, bucket->dthb_chain, new))
 			goto add;
 	}
 
 	if ((hash->dth_nbuckets >> 1) > hash->dth_size) {
 		dtrace_hash_resize(hash);
 		dtrace_hash_add(hash, new);
 		return;
 	}
 
 	bucket = kmem_zalloc(sizeof (dtrace_hashbucket_t), KM_SLEEP);
 	bucket->dthb_next = hash->dth_tab[ndx];
 	hash->dth_tab[ndx] = bucket;
 	hash->dth_nbuckets++;
 
 add:
 	nextp = DTRACE_HASHNEXT(hash, new);
 	ASSERT(*nextp == NULL && *(DTRACE_HASHPREV(hash, new)) == NULL);
 	*nextp = bucket->dthb_chain;
 
 	if (bucket->dthb_chain != NULL) {
 		prevp = DTRACE_HASHPREV(hash, bucket->dthb_chain);
 		ASSERT(*prevp == NULL);
 		*prevp = new;
 	}
 
 	bucket->dthb_chain = new;
 	bucket->dthb_len++;
 }
 
 static dtrace_probe_t *
 dtrace_hash_lookup(dtrace_hash_t *hash, dtrace_probe_t *template)
 {
 	int hashval = DTRACE_HASHSTR(hash, template);
 	int ndx = hashval & hash->dth_mask;
 	dtrace_hashbucket_t *bucket = hash->dth_tab[ndx];
 
 	for (; bucket != NULL; bucket = bucket->dthb_next) {
 		if (DTRACE_HASHEQ(hash, bucket->dthb_chain, template))
 			return (bucket->dthb_chain);
 	}
 
 	return (NULL);
 }
 
 static int
 dtrace_hash_collisions(dtrace_hash_t *hash, dtrace_probe_t *template)
 {
 	int hashval = DTRACE_HASHSTR(hash, template);
 	int ndx = hashval & hash->dth_mask;
 	dtrace_hashbucket_t *bucket = hash->dth_tab[ndx];
 
 	for (; bucket != NULL; bucket = bucket->dthb_next) {
 		if (DTRACE_HASHEQ(hash, bucket->dthb_chain, template))
 			return (bucket->dthb_len);
 	}
 
 	return (0);
 }
 
 static void
 dtrace_hash_remove(dtrace_hash_t *hash, dtrace_probe_t *probe)
 {
 	int ndx = DTRACE_HASHSTR(hash, probe) & hash->dth_mask;
 	dtrace_hashbucket_t *bucket = hash->dth_tab[ndx];
 
 	dtrace_probe_t **prevp = DTRACE_HASHPREV(hash, probe);
 	dtrace_probe_t **nextp = DTRACE_HASHNEXT(hash, probe);
 
 	/*
 	 * Find the bucket that we're removing this probe from.
 	 */
 	for (; bucket != NULL; bucket = bucket->dthb_next) {
 		if (DTRACE_HASHEQ(hash, bucket->dthb_chain, probe))
 			break;
 	}
 
 	ASSERT(bucket != NULL);
 
 	if (*prevp == NULL) {
 		if (*nextp == NULL) {
 			/*
 			 * The removed probe was the only probe on this
 			 * bucket; we need to remove the bucket.
 			 */
 			dtrace_hashbucket_t *b = hash->dth_tab[ndx];
 
 			ASSERT(bucket->dthb_chain == probe);
 			ASSERT(b != NULL);
 
 			if (b == bucket) {
 				hash->dth_tab[ndx] = bucket->dthb_next;
 			} else {
 				while (b->dthb_next != bucket)
 					b = b->dthb_next;
 				b->dthb_next = bucket->dthb_next;
 			}
 
 			ASSERT(hash->dth_nbuckets > 0);
 			hash->dth_nbuckets--;
 			kmem_free(bucket, sizeof (dtrace_hashbucket_t));
 			return;
 		}
 
 		bucket->dthb_chain = *nextp;
 	} else {
 		*(DTRACE_HASHNEXT(hash, *prevp)) = *nextp;
 	}
 
 	if (*nextp != NULL)
 		*(DTRACE_HASHPREV(hash, *nextp)) = *prevp;
 }
 
 /*
  * DTrace Utility Functions
  *
  * These are random utility functions that are _not_ called from probe context.
  */
 static int
 dtrace_badattr(const dtrace_attribute_t *a)
 {
 	return (a->dtat_name > DTRACE_STABILITY_MAX ||
 	    a->dtat_data > DTRACE_STABILITY_MAX ||
 	    a->dtat_class > DTRACE_CLASS_MAX);
 }
 
 /*
  * Return a duplicate copy of a string.  If the specified string is NULL,
  * this function returns a zero-length string.
  */
 static char *
 dtrace_strdup(const char *str)
 {
 	char *new = kmem_zalloc((str != NULL ? strlen(str) : 0) + 1, KM_SLEEP);
 
 	if (str != NULL)
 		(void) strcpy(new, str);
 
 	return (new);
 }
 
 #define	DTRACE_ISALPHA(c)	\
 	(((c) >= 'a' && (c) <= 'z') || ((c) >= 'A' && (c) <= 'Z'))
 
 static int
 dtrace_badname(const char *s)
 {
 	char c;
 
 	if (s == NULL || (c = *s++) == '\0')
 		return (0);
 
 	if (!DTRACE_ISALPHA(c) && c != '-' && c != '_' && c != '.')
 		return (1);
 
 	while ((c = *s++) != '\0') {
 		if (!DTRACE_ISALPHA(c) && (c < '0' || c > '9') &&
 		    c != '-' && c != '_' && c != '.' && c != '`')
 			return (1);
 	}
 
 	return (0);
 }
 
 static void
 dtrace_cred2priv(cred_t *cr, uint32_t *privp, uid_t *uidp, zoneid_t *zoneidp)
 {
 	uint32_t priv;
 
 #ifdef illumos
 	if (cr == NULL || PRIV_POLICY_ONLY(cr, PRIV_ALL, B_FALSE)) {
 		/*
 		 * For DTRACE_PRIV_ALL, the uid and zoneid don't matter.
 		 */
 		priv = DTRACE_PRIV_ALL;
 	} else {
 		*uidp = crgetuid(cr);
 		*zoneidp = crgetzoneid(cr);
 
 		priv = 0;
 		if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_KERNEL, B_FALSE))
 			priv |= DTRACE_PRIV_KERNEL | DTRACE_PRIV_USER;
 		else if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_USER, B_FALSE))
 			priv |= DTRACE_PRIV_USER;
 		if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_PROC, B_FALSE))
 			priv |= DTRACE_PRIV_PROC;
 		if (PRIV_POLICY_ONLY(cr, PRIV_PROC_OWNER, B_FALSE))
 			priv |= DTRACE_PRIV_OWNER;
 		if (PRIV_POLICY_ONLY(cr, PRIV_PROC_ZONE, B_FALSE))
 			priv |= DTRACE_PRIV_ZONEOWNER;
 	}
 #else
 	priv = DTRACE_PRIV_ALL;
 #endif
 
 	*privp = priv;
 }
 
 #ifdef DTRACE_ERRDEBUG
 static void
 dtrace_errdebug(const char *str)
 {
 	int hval = dtrace_hash_str(str) % DTRACE_ERRHASHSZ;
 	int occupied = 0;
 
 	mutex_enter(&dtrace_errlock);
 	dtrace_errlast = str;
 	dtrace_errthread = curthread;
 
 	while (occupied++ < DTRACE_ERRHASHSZ) {
 		if (dtrace_errhash[hval].dter_msg == str) {
 			dtrace_errhash[hval].dter_count++;
 			goto out;
 		}
 
 		if (dtrace_errhash[hval].dter_msg != NULL) {
 			hval = (hval + 1) % DTRACE_ERRHASHSZ;
 			continue;
 		}
 
 		dtrace_errhash[hval].dter_msg = str;
 		dtrace_errhash[hval].dter_count = 1;
 		goto out;
 	}
 
 	panic("dtrace: undersized error hash");
 out:
 	mutex_exit(&dtrace_errlock);
 }
 #endif
 
 /*
  * DTrace Matching Functions
  *
  * These functions are used to match groups of probes, given some elements of
  * a probe tuple, or some globbed expressions for elements of a probe tuple.
  */
 static int
 dtrace_match_priv(const dtrace_probe_t *prp, uint32_t priv, uid_t uid,
     zoneid_t zoneid)
 {
 	if (priv != DTRACE_PRIV_ALL) {
 		uint32_t ppriv = prp->dtpr_provider->dtpv_priv.dtpp_flags;
 		uint32_t match = priv & ppriv;
 
 		/*
 		 * No PRIV_DTRACE_* privileges...
 		 */
 		if ((priv & (DTRACE_PRIV_PROC | DTRACE_PRIV_USER |
 		    DTRACE_PRIV_KERNEL)) == 0)
 			return (0);
 
 		/*
 		 * No matching bits, but there were bits to match...
 		 */
 		if (match == 0 && ppriv != 0)
 			return (0);
 
 		/*
 		 * Need to have permissions to the process, but don't...
 		 */
 		if (((ppriv & ~match) & DTRACE_PRIV_OWNER) != 0 &&
 		    uid != prp->dtpr_provider->dtpv_priv.dtpp_uid) {
 			return (0);
 		}
 
 		/*
 		 * Need to be in the same zone unless we possess the
 		 * privilege to examine all zones.
 		 */
 		if (((ppriv & ~match) & DTRACE_PRIV_ZONEOWNER) != 0 &&
 		    zoneid != prp->dtpr_provider->dtpv_priv.dtpp_zoneid) {
 			return (0);
 		}
 	}
 
 	return (1);
 }
 
 /*
  * dtrace_match_probe compares a dtrace_probe_t to a pre-compiled key, which
  * consists of input pattern strings and an ops-vector to evaluate them.
  * This function returns >0 for match, 0 for no match, and <0 for error.
  */
 static int
 dtrace_match_probe(const dtrace_probe_t *prp, const dtrace_probekey_t *pkp,
     uint32_t priv, uid_t uid, zoneid_t zoneid)
 {
 	dtrace_provider_t *pvp = prp->dtpr_provider;
 	int rv;
 
 	if (pvp->dtpv_defunct)
 		return (0);
 
 	if ((rv = pkp->dtpk_pmatch(pvp->dtpv_name, pkp->dtpk_prov, 0)) <= 0)
 		return (rv);
 
 	if ((rv = pkp->dtpk_mmatch(prp->dtpr_mod, pkp->dtpk_mod, 0)) <= 0)
 		return (rv);
 
 	if ((rv = pkp->dtpk_fmatch(prp->dtpr_func, pkp->dtpk_func, 0)) <= 0)
 		return (rv);
 
 	if ((rv = pkp->dtpk_nmatch(prp->dtpr_name, pkp->dtpk_name, 0)) <= 0)
 		return (rv);
 
 	if (dtrace_match_priv(prp, priv, uid, zoneid) == 0)
 		return (0);
 
 	return (rv);
 }
 
 /*
  * dtrace_match_glob() is a safe kernel implementation of the gmatch(3GEN)
  * interface for matching a glob pattern 'p' to an input string 's'.  Unlike
  * libc's version, the kernel version only applies to 8-bit ASCII strings.
  * In addition, all of the recursion cases except for '*' matching have been
  * unwound.  For '*', we still implement recursive evaluation, but a depth
  * counter is maintained and matching is aborted if we recurse too deep.
  * The function returns 0 if no match, >0 if match, and <0 if recursion error.
  */
 static int
 dtrace_match_glob(const char *s, const char *p, int depth)
 {
 	const char *olds;
 	char s1, c;
 	int gs;
 
 	if (depth > DTRACE_PROBEKEY_MAXDEPTH)
 		return (-1);
 
 	if (s == NULL)
 		s = ""; /* treat NULL as empty string */
 
 top:
 	olds = s;
 	s1 = *s++;
 
 	if (p == NULL)
 		return (0);
 
 	if ((c = *p++) == '\0')
 		return (s1 == '\0');
 
 	switch (c) {
 	case '[': {
 		int ok = 0, notflag = 0;
 		char lc = '\0';
 
 		if (s1 == '\0')
 			return (0);
 
 		if (*p == '!') {
 			notflag = 1;
 			p++;
 		}
 
 		if ((c = *p++) == '\0')
 			return (0);
 
 		do {
 			if (c == '-' && lc != '\0' && *p != ']') {
 				if ((c = *p++) == '\0')
 					return (0);
 				if (c == '\\' && (c = *p++) == '\0')
 					return (0);
 
 				if (notflag) {
 					if (s1 < lc || s1 > c)
 						ok++;
 					else
 						return (0);
 				} else if (lc <= s1 && s1 <= c)
 					ok++;
 
 			} else if (c == '\\' && (c = *p++) == '\0')
 				return (0);
 
 			lc = c; /* save left-hand 'c' for next iteration */
 
 			if (notflag) {
 				if (s1 != c)
 					ok++;
 				else
 					return (0);
 			} else if (s1 == c)
 				ok++;
 
 			if ((c = *p++) == '\0')
 				return (0);
 
 		} while (c != ']');
 
 		if (ok)
 			goto top;
 
 		return (0);
 	}
 
 	case '\\':
 		if ((c = *p++) == '\0')
 			return (0);
 		/*FALLTHRU*/
 
 	default:
 		if (c != s1)
 			return (0);
 		/*FALLTHRU*/
 
 	case '?':
 		if (s1 != '\0')
 			goto top;
 		return (0);
 
 	case '*':
 		while (*p == '*')
 			p++; /* consecutive *'s are identical to a single one */
 
 		if (*p == '\0')
 			return (1);
 
 		for (s = olds; *s != '\0'; s++) {
 			if ((gs = dtrace_match_glob(s, p, depth + 1)) != 0)
 				return (gs);
 		}
 
 		return (0);
 	}
 }
 
 /*ARGSUSED*/
 static int
 dtrace_match_string(const char *s, const char *p, int depth)
 {
 	return (s != NULL && strcmp(s, p) == 0);
 }
 
 /*ARGSUSED*/
 static int
 dtrace_match_nul(const char *s, const char *p, int depth)
 {
 	return (1); /* always match the empty pattern */
 }
 
 /*ARGSUSED*/
 static int
 dtrace_match_nonzero(const char *s, const char *p, int depth)
 {
 	return (s != NULL && s[0] != '\0');
 }
 
 static int
 dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
     zoneid_t zoneid, int (*matched)(dtrace_probe_t *, void *), void *arg)
 {
 	dtrace_probe_t template, *probe;
 	dtrace_hash_t *hash = NULL;
 	int len, best = INT_MAX, nmatched = 0;
 	dtrace_id_t i;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	/*
 	 * If the probe ID is specified in the key, just lookup by ID and
 	 * invoke the match callback once if a matching probe is found.
 	 */
 	if (pkp->dtpk_id != DTRACE_IDNONE) {
 		if ((probe = dtrace_probe_lookup_id(pkp->dtpk_id)) != NULL &&
 		    dtrace_match_probe(probe, pkp, priv, uid, zoneid) > 0) {
 			(void) (*matched)(probe, arg);
 			nmatched++;
 		}
 		return (nmatched);
 	}
 
 	template.dtpr_mod = (char *)pkp->dtpk_mod;
 	template.dtpr_func = (char *)pkp->dtpk_func;
 	template.dtpr_name = (char *)pkp->dtpk_name;
 
 	/*
 	 * We want to find the most distinct of the module name, function
 	 * name, and name.  So for each one that is not a glob pattern or
 	 * empty string, we perform a lookup in the corresponding hash and
 	 * use the hash table with the fewest collisions to do our search.
 	 */
 	if (pkp->dtpk_mmatch == &dtrace_match_string &&
 	    (len = dtrace_hash_collisions(dtrace_bymod, &template)) < best) {
 		best = len;
 		hash = dtrace_bymod;
 	}
 
 	if (pkp->dtpk_fmatch == &dtrace_match_string &&
 	    (len = dtrace_hash_collisions(dtrace_byfunc, &template)) < best) {
 		best = len;
 		hash = dtrace_byfunc;
 	}
 
 	if (pkp->dtpk_nmatch == &dtrace_match_string &&
 	    (len = dtrace_hash_collisions(dtrace_byname, &template)) < best) {
 		best = len;
 		hash = dtrace_byname;
 	}
 
 	/*
 	 * If we did not select a hash table, iterate over every probe and
 	 * invoke our callback for each one that matches our input probe key.
 	 */
 	if (hash == NULL) {
 		for (i = 0; i < dtrace_nprobes; i++) {
 			if ((probe = dtrace_probes[i]) == NULL ||
 			    dtrace_match_probe(probe, pkp, priv, uid,
 			    zoneid) <= 0)
 				continue;
 
 			nmatched++;
 
 			if ((*matched)(probe, arg) != DTRACE_MATCH_NEXT)
 				break;
 		}
 
 		return (nmatched);
 	}
 
 	/*
 	 * If we selected a hash table, iterate over each probe of the same key
 	 * name and invoke the callback for every probe that matches the other
 	 * attributes of our input probe key.
 	 */
 	for (probe = dtrace_hash_lookup(hash, &template); probe != NULL;
 	    probe = *(DTRACE_HASHNEXT(hash, probe))) {
 
 		if (dtrace_match_probe(probe, pkp, priv, uid, zoneid) <= 0)
 			continue;
 
 		nmatched++;
 
 		if ((*matched)(probe, arg) != DTRACE_MATCH_NEXT)
 			break;
 	}
 
 	return (nmatched);
 }
 
 /*
  * Return the function pointer dtrace_probecmp() should use to compare the
  * specified pattern with a string.  For NULL or empty patterns, we select
  * dtrace_match_nul().  For glob pattern strings, we use dtrace_match_glob().
  * For non-empty non-glob strings, we use dtrace_match_string().
  */
 static dtrace_probekey_f *
 dtrace_probekey_func(const char *p)
 {
 	char c;
 
 	if (p == NULL || *p == '\0')
 		return (&dtrace_match_nul);
 
 	while ((c = *p++) != '\0') {
 		if (c == '[' || c == '?' || c == '*' || c == '\\')
 			return (&dtrace_match_glob);
 	}
 
 	return (&dtrace_match_string);
 }
 
 /*
  * Build a probe comparison key for use with dtrace_match_probe() from the
  * given probe description.  By convention, a null key only matches anchored
  * probes: if each field is the empty string, reset dtpk_fmatch to
  * dtrace_match_nonzero().
  */
 static void
 dtrace_probekey(dtrace_probedesc_t *pdp, dtrace_probekey_t *pkp)
 {
 	pkp->dtpk_prov = pdp->dtpd_provider;
 	pkp->dtpk_pmatch = dtrace_probekey_func(pdp->dtpd_provider);
 
 	pkp->dtpk_mod = pdp->dtpd_mod;
 	pkp->dtpk_mmatch = dtrace_probekey_func(pdp->dtpd_mod);
 
 	pkp->dtpk_func = pdp->dtpd_func;
 	pkp->dtpk_fmatch = dtrace_probekey_func(pdp->dtpd_func);
 
 	pkp->dtpk_name = pdp->dtpd_name;
 	pkp->dtpk_nmatch = dtrace_probekey_func(pdp->dtpd_name);
 
 	pkp->dtpk_id = pdp->dtpd_id;
 
 	if (pkp->dtpk_id == DTRACE_IDNONE &&
 	    pkp->dtpk_pmatch == &dtrace_match_nul &&
 	    pkp->dtpk_mmatch == &dtrace_match_nul &&
 	    pkp->dtpk_fmatch == &dtrace_match_nul &&
 	    pkp->dtpk_nmatch == &dtrace_match_nul)
 		pkp->dtpk_fmatch = &dtrace_match_nonzero;
 }
 
 /*
  * DTrace Provider-to-Framework API Functions
  *
  * These functions implement much of the Provider-to-Framework API, as
  * described in <sys/dtrace.h>.  The parts of the API not in this section are
  * the functions in the API for probe management (found below), and
  * dtrace_probe() itself (found above).
  */
 
 /*
  * Register the calling provider with the DTrace framework.  This should
  * generally be called by DTrace providers in their attach(9E) entry point.
  */
 int
 dtrace_register(const char *name, const dtrace_pattr_t *pap, uint32_t priv,
     cred_t *cr, const dtrace_pops_t *pops, void *arg, dtrace_provider_id_t *idp)
 {
 	dtrace_provider_t *provider;
 
 	if (name == NULL || pap == NULL || pops == NULL || idp == NULL) {
 		cmn_err(CE_WARN, "failed to register provider '%s': invalid "
 		    "arguments", name ? name : "<NULL>");
 		return (EINVAL);
 	}
 
 	if (name[0] == '\0' || dtrace_badname(name)) {
 		cmn_err(CE_WARN, "failed to register provider '%s': invalid "
 		    "provider name", name);
 		return (EINVAL);
 	}
 
 	if ((pops->dtps_provide == NULL && pops->dtps_provide_module == NULL) ||
 	    pops->dtps_enable == NULL || pops->dtps_disable == NULL ||
 	    pops->dtps_destroy == NULL ||
 	    ((pops->dtps_resume == NULL) != (pops->dtps_suspend == NULL))) {
 		cmn_err(CE_WARN, "failed to register provider '%s': invalid "
 		    "provider ops", name);
 		return (EINVAL);
 	}
 
 	if (dtrace_badattr(&pap->dtpa_provider) ||
 	    dtrace_badattr(&pap->dtpa_mod) ||
 	    dtrace_badattr(&pap->dtpa_func) ||
 	    dtrace_badattr(&pap->dtpa_name) ||
 	    dtrace_badattr(&pap->dtpa_args)) {
 		cmn_err(CE_WARN, "failed to register provider '%s': invalid "
 		    "provider attributes", name);
 		return (EINVAL);
 	}
 
 	if (priv & ~DTRACE_PRIV_ALL) {
 		cmn_err(CE_WARN, "failed to register provider '%s': invalid "
 		    "privilege attributes", name);
 		return (EINVAL);
 	}
 
 	if ((priv & DTRACE_PRIV_KERNEL) &&
 	    (priv & (DTRACE_PRIV_USER | DTRACE_PRIV_OWNER)) &&
 	    pops->dtps_usermode == NULL) {
 		cmn_err(CE_WARN, "failed to register provider '%s': need "
 		    "dtps_usermode() op for given privilege attributes", name);
 		return (EINVAL);
 	}
 
 	provider = kmem_zalloc(sizeof (dtrace_provider_t), KM_SLEEP);
 	provider->dtpv_name = kmem_alloc(strlen(name) + 1, KM_SLEEP);
 	(void) strcpy(provider->dtpv_name, name);
 
 	provider->dtpv_attr = *pap;
 	provider->dtpv_priv.dtpp_flags = priv;
 	if (cr != NULL) {
 		provider->dtpv_priv.dtpp_uid = crgetuid(cr);
 		provider->dtpv_priv.dtpp_zoneid = crgetzoneid(cr);
 	}
 	provider->dtpv_pops = *pops;
 
 	if (pops->dtps_provide == NULL) {
 		ASSERT(pops->dtps_provide_module != NULL);
 		provider->dtpv_pops.dtps_provide =
 		    (void (*)(void *, dtrace_probedesc_t *))dtrace_nullop;
 	}
 
 	if (pops->dtps_provide_module == NULL) {
 		ASSERT(pops->dtps_provide != NULL);
 		provider->dtpv_pops.dtps_provide_module =
 		    (void (*)(void *, modctl_t *))dtrace_nullop;
 	}
 
 	if (pops->dtps_suspend == NULL) {
 		ASSERT(pops->dtps_resume == NULL);
 		provider->dtpv_pops.dtps_suspend =
 		    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop;
 		provider->dtpv_pops.dtps_resume =
 		    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop;
 	}
 
 	provider->dtpv_arg = arg;
 	*idp = (dtrace_provider_id_t)provider;
 
 	if (pops == &dtrace_provider_ops) {
 		ASSERT(MUTEX_HELD(&dtrace_provider_lock));
 		ASSERT(MUTEX_HELD(&dtrace_lock));
 		ASSERT(dtrace_anon.dta_enabling == NULL);
 
 		/*
 		 * We make sure that the DTrace provider is at the head of
 		 * the provider chain.
 		 */
 		provider->dtpv_next = dtrace_provider;
 		dtrace_provider = provider;
 		return (0);
 	}
 
 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
 
 	/*
 	 * If there is at least one provider registered, we'll add this
 	 * provider after the first provider.
 	 */
 	if (dtrace_provider != NULL) {
 		provider->dtpv_next = dtrace_provider->dtpv_next;
 		dtrace_provider->dtpv_next = provider;
 	} else {
 		dtrace_provider = provider;
 	}
 
 	if (dtrace_retained != NULL) {
 		dtrace_enabling_provide(provider);
 
 		/*
 		 * Now we need to call dtrace_enabling_matchall() -- which
 		 * will acquire cpu_lock and dtrace_lock.  We therefore need
 		 * to drop all of our locks before calling into it...
 		 */
 		mutex_exit(&dtrace_lock);
 		mutex_exit(&dtrace_provider_lock);
 		dtrace_enabling_matchall();
 
 		return (0);
 	}
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&dtrace_provider_lock);
 
 	return (0);
 }
 
 /*
  * Unregister the specified provider from the DTrace framework.  This should
  * generally be called by DTrace providers in their detach(9E) entry point.
  */
 int
 dtrace_unregister(dtrace_provider_id_t id)
 {
 	dtrace_provider_t *old = (dtrace_provider_t *)id;
 	dtrace_provider_t *prev = NULL;
 	int i, self = 0, noreap = 0;
 	dtrace_probe_t *probe, *first = NULL;
 
 	if (old->dtpv_pops.dtps_enable ==
 	    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop) {
 		/*
 		 * If DTrace itself is the provider, we're called with locks
 		 * already held.
 		 */
 		ASSERT(old == dtrace_provider);
 #ifdef illumos
 		ASSERT(dtrace_devi != NULL);
 #endif
 		ASSERT(MUTEX_HELD(&dtrace_provider_lock));
 		ASSERT(MUTEX_HELD(&dtrace_lock));
 		self = 1;
 
 		if (dtrace_provider->dtpv_next != NULL) {
 			/*
 			 * There's another provider here; return failure.
 			 */
 			return (EBUSY);
 		}
 	} else {
 		mutex_enter(&dtrace_provider_lock);
 #ifdef illumos
 		mutex_enter(&mod_lock);
 #endif
 		mutex_enter(&dtrace_lock);
 	}
 
 	/*
 	 * If anyone has /dev/dtrace open, or if there are anonymous enabled
 	 * probes, we refuse to let providers slither away, unless this
 	 * provider has already been explicitly invalidated.
 	 */
 	if (!old->dtpv_defunct &&
 	    (dtrace_opens || (dtrace_anon.dta_state != NULL &&
 	    dtrace_anon.dta_state->dts_necbs > 0))) {
 		if (!self) {
 			mutex_exit(&dtrace_lock);
 #ifdef illumos
 			mutex_exit(&mod_lock);
 #endif
 			mutex_exit(&dtrace_provider_lock);
 		}
 		return (EBUSY);
 	}
 
 	/*
 	 * Attempt to destroy the probes associated with this provider.
 	 */
 	for (i = 0; i < dtrace_nprobes; i++) {
 		if ((probe = dtrace_probes[i]) == NULL)
 			continue;
 
 		if (probe->dtpr_provider != old)
 			continue;
 
 		if (probe->dtpr_ecb == NULL)
 			continue;
 
 		/*
 		 * If we are trying to unregister a defunct provider, and the
 		 * provider was made defunct within the interval dictated by
 		 * dtrace_unregister_defunct_reap, we'll (asynchronously)
 		 * attempt to reap our enablings.  To denote that the provider
 		 * should reattempt to unregister itself at some point in the
 		 * future, we will return a differentiable error code (EAGAIN
 		 * instead of EBUSY) in this case.
 		 */
 		if (dtrace_gethrtime() - old->dtpv_defunct >
 		    dtrace_unregister_defunct_reap)
 			noreap = 1;
 
 		if (!self) {
 			mutex_exit(&dtrace_lock);
 #ifdef illumos
 			mutex_exit(&mod_lock);
 #endif
 			mutex_exit(&dtrace_provider_lock);
 		}
 
 		if (noreap)
 			return (EBUSY);
 
 		(void) taskq_dispatch(dtrace_taskq,
 		    (task_func_t *)dtrace_enabling_reap, NULL, TQ_SLEEP);
 
 		return (EAGAIN);
 	}
 
 	/*
 	 * All of the probes for this provider are disabled; we can safely
 	 * remove all of them from their hash chains and from the probe array.
 	 */
 	for (i = 0; i < dtrace_nprobes; i++) {
 		if ((probe = dtrace_probes[i]) == NULL)
 			continue;
 
 		if (probe->dtpr_provider != old)
 			continue;
 
 		dtrace_probes[i] = NULL;
 
 		dtrace_hash_remove(dtrace_bymod, probe);
 		dtrace_hash_remove(dtrace_byfunc, probe);
 		dtrace_hash_remove(dtrace_byname, probe);
 
 		if (first == NULL) {
 			first = probe;
 			probe->dtpr_nextmod = NULL;
 		} else {
 			probe->dtpr_nextmod = first;
 			first = probe;
 		}
 	}
 
 	/*
 	 * The provider's probes have been removed from the hash chains and
 	 * from the probe array.  Now issue a dtrace_sync() to be sure that
 	 * everyone has cleared out from any probe array processing.
 	 */
 	dtrace_sync();
 
 	for (probe = first; probe != NULL; probe = first) {
 		first = probe->dtpr_nextmod;
 
 		old->dtpv_pops.dtps_destroy(old->dtpv_arg, probe->dtpr_id,
 		    probe->dtpr_arg);
 		kmem_free(probe->dtpr_mod, strlen(probe->dtpr_mod) + 1);
 		kmem_free(probe->dtpr_func, strlen(probe->dtpr_func) + 1);
 		kmem_free(probe->dtpr_name, strlen(probe->dtpr_name) + 1);
 #ifdef illumos
 		vmem_free(dtrace_arena, (void *)(uintptr_t)(probe->dtpr_id), 1);
 #else
 		free_unr(dtrace_arena, probe->dtpr_id);
 #endif
 		kmem_free(probe, sizeof (dtrace_probe_t));
 	}
 
 	if ((prev = dtrace_provider) == old) {
 #ifdef illumos
 		ASSERT(self || dtrace_devi == NULL);
 		ASSERT(old->dtpv_next == NULL || dtrace_devi == NULL);
 #endif
 		dtrace_provider = old->dtpv_next;
 	} else {
 		while (prev != NULL && prev->dtpv_next != old)
 			prev = prev->dtpv_next;
 
 		if (prev == NULL) {
 			panic("attempt to unregister non-existent "
 			    "dtrace provider %p\n", (void *)id);
 		}
 
 		prev->dtpv_next = old->dtpv_next;
 	}
 
 	if (!self) {
 		mutex_exit(&dtrace_lock);
 #ifdef illumos
 		mutex_exit(&mod_lock);
 #endif
 		mutex_exit(&dtrace_provider_lock);
 	}
 
 	kmem_free(old->dtpv_name, strlen(old->dtpv_name) + 1);
 	kmem_free(old, sizeof (dtrace_provider_t));
 
 	return (0);
 }
 
 /*
  * Invalidate the specified provider.  All subsequent probe lookups for the
  * specified provider will fail, but its probes will not be removed.
  */
 void
 dtrace_invalidate(dtrace_provider_id_t id)
 {
 	dtrace_provider_t *pvp = (dtrace_provider_t *)id;
 
 	ASSERT(pvp->dtpv_pops.dtps_enable !=
 	    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop);
 
 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
 
 	pvp->dtpv_defunct = dtrace_gethrtime();
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&dtrace_provider_lock);
 }
 
 /*
  * Indicate whether or not DTrace has attached.
  */
 int
 dtrace_attached(void)
 {
 	/*
 	 * dtrace_provider will be non-NULL iff the DTrace driver has
 	 * attached.  (It's non-NULL because DTrace is always itself a
 	 * provider.)
 	 */
 	return (dtrace_provider != NULL);
 }
 
 /*
  * Remove all the unenabled probes for the given provider.  This function is
  * not unlike dtrace_unregister(), except that it doesn't remove the provider
  * -- just as many of its associated probes as it can.
  */
 int
 dtrace_condense(dtrace_provider_id_t id)
 {
 	dtrace_provider_t *prov = (dtrace_provider_t *)id;
 	int i;
 	dtrace_probe_t *probe;
 
 	/*
 	 * Make sure this isn't the dtrace provider itself.
 	 */
 	ASSERT(prov->dtpv_pops.dtps_enable !=
 	    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop);
 
 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
 
 	/*
 	 * Attempt to destroy the probes associated with this provider.
 	 */
 	for (i = 0; i < dtrace_nprobes; i++) {
 		if ((probe = dtrace_probes[i]) == NULL)
 			continue;
 
 		if (probe->dtpr_provider != prov)
 			continue;
 
 		if (probe->dtpr_ecb != NULL)
 			continue;
 
 		dtrace_probes[i] = NULL;
 
 		dtrace_hash_remove(dtrace_bymod, probe);
 		dtrace_hash_remove(dtrace_byfunc, probe);
 		dtrace_hash_remove(dtrace_byname, probe);
 
 		prov->dtpv_pops.dtps_destroy(prov->dtpv_arg, i + 1,
 		    probe->dtpr_arg);
 		kmem_free(probe->dtpr_mod, strlen(probe->dtpr_mod) + 1);
 		kmem_free(probe->dtpr_func, strlen(probe->dtpr_func) + 1);
 		kmem_free(probe->dtpr_name, strlen(probe->dtpr_name) + 1);
 		kmem_free(probe, sizeof (dtrace_probe_t));
 #ifdef illumos
 		vmem_free(dtrace_arena, (void *)((uintptr_t)i + 1), 1);
 #else
 		free_unr(dtrace_arena, i + 1);
 #endif
 	}
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&dtrace_provider_lock);
 
 	return (0);
 }
 
 /*
  * DTrace Probe Management Functions
  *
  * The functions in this section perform the DTrace probe management,
  * including functions to create probes, look-up probes, and call into the
  * providers to request that probes be provided.  Some of these functions are
  * in the Provider-to-Framework API; these functions can be identified by the
  * fact that they are not declared "static".
  */
 
 /*
  * Create a probe with the specified module name, function name, and name.
  */
 dtrace_id_t
 dtrace_probe_create(dtrace_provider_id_t prov, const char *mod,
     const char *func, const char *name, int aframes, void *arg)
 {
 	dtrace_probe_t *probe, **probes;
 	dtrace_provider_t *provider = (dtrace_provider_t *)prov;
 	dtrace_id_t id;
 
 	if (provider == dtrace_provider) {
 		ASSERT(MUTEX_HELD(&dtrace_lock));
 	} else {
 		mutex_enter(&dtrace_lock);
 	}
 
 #ifdef illumos
 	id = (dtrace_id_t)(uintptr_t)vmem_alloc(dtrace_arena, 1,
 	    VM_BESTFIT | VM_SLEEP);
 #else
 	id = alloc_unr(dtrace_arena);
 #endif
 	probe = kmem_zalloc(sizeof (dtrace_probe_t), KM_SLEEP);
 
 	probe->dtpr_id = id;
 	probe->dtpr_gen = dtrace_probegen++;
 	probe->dtpr_mod = dtrace_strdup(mod);
 	probe->dtpr_func = dtrace_strdup(func);
 	probe->dtpr_name = dtrace_strdup(name);
 	probe->dtpr_arg = arg;
 	probe->dtpr_aframes = aframes;
 	probe->dtpr_provider = provider;
 
 	dtrace_hash_add(dtrace_bymod, probe);
 	dtrace_hash_add(dtrace_byfunc, probe);
 	dtrace_hash_add(dtrace_byname, probe);
 
 	if (id - 1 >= dtrace_nprobes) {
 		size_t osize = dtrace_nprobes * sizeof (dtrace_probe_t *);
 		size_t nsize = osize << 1;
 
 		if (nsize == 0) {
 			ASSERT(osize == 0);
 			ASSERT(dtrace_probes == NULL);
 			nsize = sizeof (dtrace_probe_t *);
 		}
 
 		probes = kmem_zalloc(nsize, KM_SLEEP);
 
 		if (dtrace_probes == NULL) {
 			ASSERT(osize == 0);
 			dtrace_probes = probes;
 			dtrace_nprobes = 1;
 		} else {
 			dtrace_probe_t **oprobes = dtrace_probes;
 
 			bcopy(oprobes, probes, osize);
 			dtrace_membar_producer();
 			dtrace_probes = probes;
 
 			dtrace_sync();
 
 			/*
 			 * All CPUs are now seeing the new probes array; we can
 			 * safely free the old array.
 			 */
 			kmem_free(oprobes, osize);
 			dtrace_nprobes <<= 1;
 		}
 
 		ASSERT(id - 1 < dtrace_nprobes);
 	}
 
 	ASSERT(dtrace_probes[id - 1] == NULL);
 	dtrace_probes[id - 1] = probe;
 
 	if (provider != dtrace_provider)
 		mutex_exit(&dtrace_lock);
 
 	return (id);
 }
 
 static dtrace_probe_t *
 dtrace_probe_lookup_id(dtrace_id_t id)
 {
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (id == 0 || id > dtrace_nprobes)
 		return (NULL);
 
 	return (dtrace_probes[id - 1]);
 }
 
 static int
 dtrace_probe_lookup_match(dtrace_probe_t *probe, void *arg)
 {
 	*((dtrace_id_t *)arg) = probe->dtpr_id;
 
 	return (DTRACE_MATCH_DONE);
 }
 
 /*
  * Look up a probe based on provider and one or more of module name, function
  * name and probe name.
  */
 dtrace_id_t
 dtrace_probe_lookup(dtrace_provider_id_t prid, char *mod,
     char *func, char *name)
 {
 	dtrace_probekey_t pkey;
 	dtrace_id_t id;
 	int match;
 
 	pkey.dtpk_prov = ((dtrace_provider_t *)prid)->dtpv_name;
 	pkey.dtpk_pmatch = &dtrace_match_string;
 	pkey.dtpk_mod = mod;
 	pkey.dtpk_mmatch = mod ? &dtrace_match_string : &dtrace_match_nul;
 	pkey.dtpk_func = func;
 	pkey.dtpk_fmatch = func ? &dtrace_match_string : &dtrace_match_nul;
 	pkey.dtpk_name = name;
 	pkey.dtpk_nmatch = name ? &dtrace_match_string : &dtrace_match_nul;
 	pkey.dtpk_id = DTRACE_IDNONE;
 
 	mutex_enter(&dtrace_lock);
 	match = dtrace_match(&pkey, DTRACE_PRIV_ALL, 0, 0,
 	    dtrace_probe_lookup_match, &id);
 	mutex_exit(&dtrace_lock);
 
 	ASSERT(match == 1 || match == 0);
 	return (match ? id : 0);
 }
 
 /*
  * Returns the probe argument associated with the specified probe.
  */
 void *
 dtrace_probe_arg(dtrace_provider_id_t id, dtrace_id_t pid)
 {
 	dtrace_probe_t *probe;
 	void *rval = NULL;
 
 	mutex_enter(&dtrace_lock);
 
 	if ((probe = dtrace_probe_lookup_id(pid)) != NULL &&
 	    probe->dtpr_provider == (dtrace_provider_t *)id)
 		rval = probe->dtpr_arg;
 
 	mutex_exit(&dtrace_lock);
 
 	return (rval);
 }
 
 /*
  * Copy a probe into a probe description.
  */
 static void
 dtrace_probe_description(const dtrace_probe_t *prp, dtrace_probedesc_t *pdp)
 {
 	bzero(pdp, sizeof (dtrace_probedesc_t));
 	pdp->dtpd_id = prp->dtpr_id;
 
 	(void) strncpy(pdp->dtpd_provider,
 	    prp->dtpr_provider->dtpv_name, DTRACE_PROVNAMELEN - 1);
 
 	(void) strncpy(pdp->dtpd_mod, prp->dtpr_mod, DTRACE_MODNAMELEN - 1);
 	(void) strncpy(pdp->dtpd_func, prp->dtpr_func, DTRACE_FUNCNAMELEN - 1);
 	(void) strncpy(pdp->dtpd_name, prp->dtpr_name, DTRACE_NAMELEN - 1);
 }
 
 /*
  * Called to indicate that a probe -- or probes -- should be provided by a
  * specfied provider.  If the specified description is NULL, the provider will
  * be told to provide all of its probes.  (This is done whenever a new
  * consumer comes along, or whenever a retained enabling is to be matched.) If
  * the specified description is non-NULL, the provider is given the
  * opportunity to dynamically provide the specified probe, allowing providers
  * to support the creation of probes on-the-fly.  (So-called _autocreated_
  * probes.)  If the provider is NULL, the operations will be applied to all
  * providers; if the provider is non-NULL the operations will only be applied
  * to the specified provider.  The dtrace_provider_lock must be held, and the
  * dtrace_lock must _not_ be held -- the provider's dtps_provide() operation
  * will need to grab the dtrace_lock when it reenters the framework through
  * dtrace_probe_lookup(), dtrace_probe_create(), etc.
  */
 static void
 dtrace_probe_provide(dtrace_probedesc_t *desc, dtrace_provider_t *prv)
 {
 #ifdef illumos
 	modctl_t *ctl;
 #endif
 	int all = 0;
 
 	ASSERT(MUTEX_HELD(&dtrace_provider_lock));
 
 	if (prv == NULL) {
 		all = 1;
 		prv = dtrace_provider;
 	}
 
 	do {
 		/*
 		 * First, call the blanket provide operation.
 		 */
 		prv->dtpv_pops.dtps_provide(prv->dtpv_arg, desc);
 
 #ifdef illumos
 		/*
 		 * Now call the per-module provide operation.  We will grab
 		 * mod_lock to prevent the list from being modified.  Note
 		 * that this also prevents the mod_busy bits from changing.
 		 * (mod_busy can only be changed with mod_lock held.)
 		 */
 		mutex_enter(&mod_lock);
 
 		ctl = &modules;
 		do {
 			if (ctl->mod_busy || ctl->mod_mp == NULL)
 				continue;
 
 			prv->dtpv_pops.dtps_provide_module(prv->dtpv_arg, ctl);
 
 		} while ((ctl = ctl->mod_next) != &modules);
 
 		mutex_exit(&mod_lock);
 #endif
 	} while (all && (prv = prv->dtpv_next) != NULL);
 }
 
 #ifdef illumos
 /*
  * Iterate over each probe, and call the Framework-to-Provider API function
  * denoted by offs.
  */
 static void
 dtrace_probe_foreach(uintptr_t offs)
 {
 	dtrace_provider_t *prov;
 	void (*func)(void *, dtrace_id_t, void *);
 	dtrace_probe_t *probe;
 	dtrace_icookie_t cookie;
 	int i;
 
 	/*
 	 * We disable interrupts to walk through the probe array.  This is
 	 * safe -- the dtrace_sync() in dtrace_unregister() assures that we
 	 * won't see stale data.
 	 */
 	cookie = dtrace_interrupt_disable();
 
 	for (i = 0; i < dtrace_nprobes; i++) {
 		if ((probe = dtrace_probes[i]) == NULL)
 			continue;
 
 		if (probe->dtpr_ecb == NULL) {
 			/*
 			 * This probe isn't enabled -- don't call the function.
 			 */
 			continue;
 		}
 
 		prov = probe->dtpr_provider;
 		func = *((void(**)(void *, dtrace_id_t, void *))
 		    ((uintptr_t)&prov->dtpv_pops + offs));
 
 		func(prov->dtpv_arg, i + 1, probe->dtpr_arg);
 	}
 
 	dtrace_interrupt_enable(cookie);
 }
 #endif
 
 static int
 dtrace_probe_enable(dtrace_probedesc_t *desc, dtrace_enabling_t *enab)
 {
 	dtrace_probekey_t pkey;
 	uint32_t priv;
 	uid_t uid;
 	zoneid_t zoneid;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	dtrace_ecb_create_cache = NULL;
 
 	if (desc == NULL) {
 		/*
 		 * If we're passed a NULL description, we're being asked to
 		 * create an ECB with a NULL probe.
 		 */
 		(void) dtrace_ecb_create_enable(NULL, enab);
 		return (0);
 	}
 
 	dtrace_probekey(desc, &pkey);
 	dtrace_cred2priv(enab->dten_vstate->dtvs_state->dts_cred.dcr_cred,
 	    &priv, &uid, &zoneid);
 
 	return (dtrace_match(&pkey, priv, uid, zoneid, dtrace_ecb_create_enable,
 	    enab));
 }
 
 /*
  * DTrace Helper Provider Functions
  */
 static void
 dtrace_dofattr2attr(dtrace_attribute_t *attr, const dof_attr_t dofattr)
 {
 	attr->dtat_name = DOF_ATTR_NAME(dofattr);
 	attr->dtat_data = DOF_ATTR_DATA(dofattr);
 	attr->dtat_class = DOF_ATTR_CLASS(dofattr);
 }
 
 static void
 dtrace_dofprov2hprov(dtrace_helper_provdesc_t *hprov,
     const dof_provider_t *dofprov, char *strtab)
 {
 	hprov->dthpv_provname = strtab + dofprov->dofpv_name;
 	dtrace_dofattr2attr(&hprov->dthpv_pattr.dtpa_provider,
 	    dofprov->dofpv_provattr);
 	dtrace_dofattr2attr(&hprov->dthpv_pattr.dtpa_mod,
 	    dofprov->dofpv_modattr);
 	dtrace_dofattr2attr(&hprov->dthpv_pattr.dtpa_func,
 	    dofprov->dofpv_funcattr);
 	dtrace_dofattr2attr(&hprov->dthpv_pattr.dtpa_name,
 	    dofprov->dofpv_nameattr);
 	dtrace_dofattr2attr(&hprov->dthpv_pattr.dtpa_args,
 	    dofprov->dofpv_argsattr);
 }
 
 static void
 dtrace_helper_provide_one(dof_helper_t *dhp, dof_sec_t *sec, pid_t pid)
 {
 	uintptr_t daddr = (uintptr_t)dhp->dofhp_dof;
 	dof_hdr_t *dof = (dof_hdr_t *)daddr;
 	dof_sec_t *str_sec, *prb_sec, *arg_sec, *off_sec, *enoff_sec;
 	dof_provider_t *provider;
 	dof_probe_t *probe;
 	uint32_t *off, *enoff;
 	uint8_t *arg;
 	char *strtab;
 	uint_t i, nprobes;
 	dtrace_helper_provdesc_t dhpv;
 	dtrace_helper_probedesc_t dhpb;
 	dtrace_meta_t *meta = dtrace_meta_pid;
 	dtrace_mops_t *mops = &meta->dtm_mops;
 	void *parg;
 
 	provider = (dof_provider_t *)(uintptr_t)(daddr + sec->dofs_offset);
 	str_sec = (dof_sec_t *)(uintptr_t)(daddr + dof->dofh_secoff +
 	    provider->dofpv_strtab * dof->dofh_secsize);
 	prb_sec = (dof_sec_t *)(uintptr_t)(daddr + dof->dofh_secoff +
 	    provider->dofpv_probes * dof->dofh_secsize);
 	arg_sec = (dof_sec_t *)(uintptr_t)(daddr + dof->dofh_secoff +
 	    provider->dofpv_prargs * dof->dofh_secsize);
 	off_sec = (dof_sec_t *)(uintptr_t)(daddr + dof->dofh_secoff +
 	    provider->dofpv_proffs * dof->dofh_secsize);
 
 	strtab = (char *)(uintptr_t)(daddr + str_sec->dofs_offset);
 	off = (uint32_t *)(uintptr_t)(daddr + off_sec->dofs_offset);
 	arg = (uint8_t *)(uintptr_t)(daddr + arg_sec->dofs_offset);
 	enoff = NULL;
 
 	/*
 	 * See dtrace_helper_provider_validate().
 	 */
 	if (dof->dofh_ident[DOF_ID_VERSION] != DOF_VERSION_1 &&
 	    provider->dofpv_prenoffs != DOF_SECT_NONE) {
 		enoff_sec = (dof_sec_t *)(uintptr_t)(daddr + dof->dofh_secoff +
 		    provider->dofpv_prenoffs * dof->dofh_secsize);
 		enoff = (uint32_t *)(uintptr_t)(daddr + enoff_sec->dofs_offset);
 	}
 
 	nprobes = prb_sec->dofs_size / prb_sec->dofs_entsize;
 
 	/*
 	 * Create the provider.
 	 */
 	dtrace_dofprov2hprov(&dhpv, provider, strtab);
 
 	if ((parg = mops->dtms_provide_pid(meta->dtm_arg, &dhpv, pid)) == NULL)
 		return;
 
 	meta->dtm_count++;
 
 	/*
 	 * Create the probes.
 	 */
 	for (i = 0; i < nprobes; i++) {
 		probe = (dof_probe_t *)(uintptr_t)(daddr +
 		    prb_sec->dofs_offset + i * prb_sec->dofs_entsize);
 
 		/* See the check in dtrace_helper_provider_validate(). */
 		if (strlen(strtab + probe->dofpr_func) >= DTRACE_FUNCNAMELEN)
 			continue;
 
 		dhpb.dthpb_mod = dhp->dofhp_mod;
 		dhpb.dthpb_func = strtab + probe->dofpr_func;
 		dhpb.dthpb_name = strtab + probe->dofpr_name;
 		dhpb.dthpb_base = probe->dofpr_addr;
 		dhpb.dthpb_offs = off + probe->dofpr_offidx;
 		dhpb.dthpb_noffs = probe->dofpr_noffs;
 		if (enoff != NULL) {
 			dhpb.dthpb_enoffs = enoff + probe->dofpr_enoffidx;
 			dhpb.dthpb_nenoffs = probe->dofpr_nenoffs;
 		} else {
 			dhpb.dthpb_enoffs = NULL;
 			dhpb.dthpb_nenoffs = 0;
 		}
 		dhpb.dthpb_args = arg + probe->dofpr_argidx;
 		dhpb.dthpb_nargc = probe->dofpr_nargc;
 		dhpb.dthpb_xargc = probe->dofpr_xargc;
 		dhpb.dthpb_ntypes = strtab + probe->dofpr_nargv;
 		dhpb.dthpb_xtypes = strtab + probe->dofpr_xargv;
 
 		mops->dtms_create_probe(meta->dtm_arg, parg, &dhpb);
 	}
 }
 
 static void
 dtrace_helper_provide(dof_helper_t *dhp, pid_t pid)
 {
 	uintptr_t daddr = (uintptr_t)dhp->dofhp_dof;
 	dof_hdr_t *dof = (dof_hdr_t *)daddr;
 	int i;
 
 	ASSERT(MUTEX_HELD(&dtrace_meta_lock));
 
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)(uintptr_t)(daddr +
 		    dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (sec->dofs_type != DOF_SECT_PROVIDER)
 			continue;
 
 		dtrace_helper_provide_one(dhp, sec, pid);
 	}
 
 	/*
 	 * We may have just created probes, so we must now rematch against
 	 * any retained enablings.  Note that this call will acquire both
 	 * cpu_lock and dtrace_lock; the fact that we are holding
 	 * dtrace_meta_lock now is what defines the ordering with respect to
 	 * these three locks.
 	 */
 	dtrace_enabling_matchall();
 }
 
 static void
 dtrace_helper_provider_remove_one(dof_helper_t *dhp, dof_sec_t *sec, pid_t pid)
 {
 	uintptr_t daddr = (uintptr_t)dhp->dofhp_dof;
 	dof_hdr_t *dof = (dof_hdr_t *)daddr;
 	dof_sec_t *str_sec;
 	dof_provider_t *provider;
 	char *strtab;
 	dtrace_helper_provdesc_t dhpv;
 	dtrace_meta_t *meta = dtrace_meta_pid;
 	dtrace_mops_t *mops = &meta->dtm_mops;
 
 	provider = (dof_provider_t *)(uintptr_t)(daddr + sec->dofs_offset);
 	str_sec = (dof_sec_t *)(uintptr_t)(daddr + dof->dofh_secoff +
 	    provider->dofpv_strtab * dof->dofh_secsize);
 
 	strtab = (char *)(uintptr_t)(daddr + str_sec->dofs_offset);
 
 	/*
 	 * Create the provider.
 	 */
 	dtrace_dofprov2hprov(&dhpv, provider, strtab);
 
 	mops->dtms_remove_pid(meta->dtm_arg, &dhpv, pid);
 
 	meta->dtm_count--;
 }
 
 static void
 dtrace_helper_provider_remove(dof_helper_t *dhp, pid_t pid)
 {
 	uintptr_t daddr = (uintptr_t)dhp->dofhp_dof;
 	dof_hdr_t *dof = (dof_hdr_t *)daddr;
 	int i;
 
 	ASSERT(MUTEX_HELD(&dtrace_meta_lock));
 
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)(uintptr_t)(daddr +
 		    dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (sec->dofs_type != DOF_SECT_PROVIDER)
 			continue;
 
 		dtrace_helper_provider_remove_one(dhp, sec, pid);
 	}
 }
 
 /*
  * DTrace Meta Provider-to-Framework API Functions
  *
  * These functions implement the Meta Provider-to-Framework API, as described
  * in <sys/dtrace.h>.
  */
 int
 dtrace_meta_register(const char *name, const dtrace_mops_t *mops, void *arg,
     dtrace_meta_provider_id_t *idp)
 {
 	dtrace_meta_t *meta;
 	dtrace_helpers_t *help, *next;
 	int i;
 
 	*idp = DTRACE_METAPROVNONE;
 
 	/*
 	 * We strictly don't need the name, but we hold onto it for
 	 * debuggability. All hail error queues!
 	 */
 	if (name == NULL) {
 		cmn_err(CE_WARN, "failed to register meta-provider: "
 		    "invalid name");
 		return (EINVAL);
 	}
 
 	if (mops == NULL ||
 	    mops->dtms_create_probe == NULL ||
 	    mops->dtms_provide_pid == NULL ||
 	    mops->dtms_remove_pid == NULL) {
 		cmn_err(CE_WARN, "failed to register meta-register %s: "
 		    "invalid ops", name);
 		return (EINVAL);
 	}
 
 	meta = kmem_zalloc(sizeof (dtrace_meta_t), KM_SLEEP);
 	meta->dtm_mops = *mops;
 	meta->dtm_name = kmem_alloc(strlen(name) + 1, KM_SLEEP);
 	(void) strcpy(meta->dtm_name, name);
 	meta->dtm_arg = arg;
 
 	mutex_enter(&dtrace_meta_lock);
 	mutex_enter(&dtrace_lock);
 
 	if (dtrace_meta_pid != NULL) {
 		mutex_exit(&dtrace_lock);
 		mutex_exit(&dtrace_meta_lock);
 		cmn_err(CE_WARN, "failed to register meta-register %s: "
 		    "user-land meta-provider exists", name);
 		kmem_free(meta->dtm_name, strlen(meta->dtm_name) + 1);
 		kmem_free(meta, sizeof (dtrace_meta_t));
 		return (EINVAL);
 	}
 
 	dtrace_meta_pid = meta;
 	*idp = (dtrace_meta_provider_id_t)meta;
 
 	/*
 	 * If there are providers and probes ready to go, pass them
 	 * off to the new meta provider now.
 	 */
 
 	help = dtrace_deferred_pid;
 	dtrace_deferred_pid = NULL;
 
 	mutex_exit(&dtrace_lock);
 
 	while (help != NULL) {
 		for (i = 0; i < help->dthps_nprovs; i++) {
 			dtrace_helper_provide(&help->dthps_provs[i]->dthp_prov,
 			    help->dthps_pid);
 		}
 
 		next = help->dthps_next;
 		help->dthps_next = NULL;
 		help->dthps_prev = NULL;
 		help->dthps_deferred = 0;
 		help = next;
 	}
 
 	mutex_exit(&dtrace_meta_lock);
 
 	return (0);
 }
 
 int
 dtrace_meta_unregister(dtrace_meta_provider_id_t id)
 {
 	dtrace_meta_t **pp, *old = (dtrace_meta_t *)id;
 
 	mutex_enter(&dtrace_meta_lock);
 	mutex_enter(&dtrace_lock);
 
 	if (old == dtrace_meta_pid) {
 		pp = &dtrace_meta_pid;
 	} else {
 		panic("attempt to unregister non-existent "
 		    "dtrace meta-provider %p\n", (void *)old);
 	}
 
 	if (old->dtm_count != 0) {
 		mutex_exit(&dtrace_lock);
 		mutex_exit(&dtrace_meta_lock);
 		return (EBUSY);
 	}
 
 	*pp = NULL;
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&dtrace_meta_lock);
 
 	kmem_free(old->dtm_name, strlen(old->dtm_name) + 1);
 	kmem_free(old, sizeof (dtrace_meta_t));
 
 	return (0);
 }
 
 
 /*
  * DTrace DIF Object Functions
  */
 static int
 dtrace_difo_err(uint_t pc, const char *format, ...)
 {
 	if (dtrace_err_verbose) {
 		va_list alist;
 
 		(void) uprintf("dtrace DIF object error: [%u]: ", pc);
 		va_start(alist, format);
 		(void) vuprintf(format, alist);
 		va_end(alist);
 	}
 
 #ifdef DTRACE_ERRDEBUG
 	dtrace_errdebug(format);
 #endif
 	return (1);
 }
 
 /*
  * Validate a DTrace DIF object by checking the IR instructions.  The following
  * rules are currently enforced by dtrace_difo_validate():
  *
  * 1. Each instruction must have a valid opcode
  * 2. Each register, string, variable, or subroutine reference must be valid
  * 3. No instruction can modify register %r0 (must be zero)
  * 4. All instruction reserved bits must be set to zero
  * 5. The last instruction must be a "ret" instruction
  * 6. All branch targets must reference a valid instruction _after_ the branch
  */
 static int
 dtrace_difo_validate(dtrace_difo_t *dp, dtrace_vstate_t *vstate, uint_t nregs,
     cred_t *cr)
 {
 	int err = 0, i;
 	int (*efunc)(uint_t pc, const char *, ...) = dtrace_difo_err;
 	int kcheckload;
 	uint_t pc;
 	int maxglobal = -1, maxlocal = -1, maxtlocal = -1;
 
 	kcheckload = cr == NULL ||
 	    (vstate->dtvs_state->dts_cred.dcr_visible & DTRACE_CRV_KERNEL) == 0;
 
 	dp->dtdo_destructive = 0;
 
 	for (pc = 0; pc < dp->dtdo_len && err == 0; pc++) {
 		dif_instr_t instr = dp->dtdo_buf[pc];
 
 		uint_t r1 = DIF_INSTR_R1(instr);
 		uint_t r2 = DIF_INSTR_R2(instr);
 		uint_t rd = DIF_INSTR_RD(instr);
 		uint_t rs = DIF_INSTR_RS(instr);
 		uint_t label = DIF_INSTR_LABEL(instr);
 		uint_t v = DIF_INSTR_VAR(instr);
 		uint_t subr = DIF_INSTR_SUBR(instr);
 		uint_t type = DIF_INSTR_TYPE(instr);
 		uint_t op = DIF_INSTR_OP(instr);
 
 		switch (op) {
 		case DIF_OP_OR:
 		case DIF_OP_XOR:
 		case DIF_OP_AND:
 		case DIF_OP_SLL:
 		case DIF_OP_SRL:
 		case DIF_OP_SRA:
 		case DIF_OP_SUB:
 		case DIF_OP_ADD:
 		case DIF_OP_MUL:
 		case DIF_OP_SDIV:
 		case DIF_OP_UDIV:
 		case DIF_OP_SREM:
 		case DIF_OP_UREM:
 		case DIF_OP_COPYS:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r2);
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_NOT:
 		case DIF_OP_MOV:
 		case DIF_OP_ALLOCS:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_LDSB:
 		case DIF_OP_LDSH:
 		case DIF_OP_LDSW:
 		case DIF_OP_LDUB:
 		case DIF_OP_LDUH:
 		case DIF_OP_LDUW:
 		case DIF_OP_LDX:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			if (kcheckload)
 				dp->dtdo_buf[pc] = DIF_INSTR_LOAD(op +
 				    DIF_OP_RLDSB - DIF_OP_LDSB, r1, rd);
 			break;
 		case DIF_OP_RLDSB:
 		case DIF_OP_RLDSH:
 		case DIF_OP_RLDSW:
 		case DIF_OP_RLDUB:
 		case DIF_OP_RLDUH:
 		case DIF_OP_RLDUW:
 		case DIF_OP_RLDX:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_ULDSB:
 		case DIF_OP_ULDSH:
 		case DIF_OP_ULDSW:
 		case DIF_OP_ULDUB:
 		case DIF_OP_ULDUH:
 		case DIF_OP_ULDUW:
 		case DIF_OP_ULDX:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_STB:
 		case DIF_OP_STH:
 		case DIF_OP_STW:
 		case DIF_OP_STX:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to 0 address\n");
 			break;
 		case DIF_OP_CMP:
 		case DIF_OP_SCMP:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r2);
 			if (rd != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			break;
 		case DIF_OP_TST:
 			if (r1 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r1);
 			if (r2 != 0 || rd != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			break;
 		case DIF_OP_BA:
 		case DIF_OP_BE:
 		case DIF_OP_BNE:
 		case DIF_OP_BG:
 		case DIF_OP_BGU:
 		case DIF_OP_BGE:
 		case DIF_OP_BGEU:
 		case DIF_OP_BL:
 		case DIF_OP_BLU:
 		case DIF_OP_BLE:
 		case DIF_OP_BLEU:
 			if (label >= dp->dtdo_len) {
 				err += efunc(pc, "invalid branch target %u\n",
 				    label);
 			}
 			if (label <= pc) {
 				err += efunc(pc, "backward branch to %u\n",
 				    label);
 			}
 			break;
 		case DIF_OP_RET:
 			if (r1 != 0 || r2 != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			break;
 		case DIF_OP_NOP:
 		case DIF_OP_POPTS:
 		case DIF_OP_FLUSHTS:
 			if (r1 != 0 || r2 != 0 || rd != 0)
 				err += efunc(pc, "non-zero reserved bits\n");
 			break;
 		case DIF_OP_SETX:
 			if (DIF_INSTR_INTEGER(instr) >= dp->dtdo_intlen) {
 				err += efunc(pc, "invalid integer ref %u\n",
 				    DIF_INSTR_INTEGER(instr));
 			}
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_SETS:
 			if (DIF_INSTR_STRING(instr) >= dp->dtdo_strlen) {
 				err += efunc(pc, "invalid string ref %u\n",
 				    DIF_INSTR_STRING(instr));
 			}
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_LDGA:
 		case DIF_OP_LDTA:
 			if (r1 > DIF_VAR_ARRAY_MAX)
 				err += efunc(pc, "invalid array %u\n", r1);
 			if (r2 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r2);
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_LDGS:
 		case DIF_OP_LDTS:
 		case DIF_OP_LDLS:
 		case DIF_OP_LDGAA:
 		case DIF_OP_LDTAA:
 			if (v < DIF_VAR_OTHER_MIN || v > DIF_VAR_OTHER_MAX)
 				err += efunc(pc, "invalid variable %u\n", v);
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 			break;
 		case DIF_OP_STGS:
 		case DIF_OP_STTS:
 		case DIF_OP_STLS:
 		case DIF_OP_STGAA:
 		case DIF_OP_STTAA:
 			if (v < DIF_VAR_OTHER_UBASE || v > DIF_VAR_OTHER_MAX)
 				err += efunc(pc, "invalid variable %u\n", v);
 			if (rs >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			break;
 		case DIF_OP_CALL:
 			if (subr > DIF_SUBR_MAX)
 				err += efunc(pc, "invalid subr %u\n", subr);
 			if (rd >= nregs)
 				err += efunc(pc, "invalid register %u\n", rd);
 			if (rd == 0)
 				err += efunc(pc, "cannot write to %r0\n");
 
 			if (subr == DIF_SUBR_COPYOUT ||
 			    subr == DIF_SUBR_COPYOUTSTR) {
 				dp->dtdo_destructive = 1;
 			}
 
 			if (subr == DIF_SUBR_GETF) {
 				/*
 				 * If we have a getf() we need to record that
 				 * in our state.  Note that our state can be
 				 * NULL if this is a helper -- but in that
 				 * case, the call to getf() is itself illegal,
 				 * and will be caught (slightly later) when
 				 * the helper is validated.
 				 */
 				if (vstate->dtvs_state != NULL)
 					vstate->dtvs_state->dts_getf++;
 			}
 
 			break;
 		case DIF_OP_PUSHTR:
 			if (type != DIF_TYPE_STRING && type != DIF_TYPE_CTF)
 				err += efunc(pc, "invalid ref type %u\n", type);
 			if (r2 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r2);
 			if (rs >= nregs)
 				err += efunc(pc, "invalid register %u\n", rs);
 			break;
 		case DIF_OP_PUSHTV:
 			if (type != DIF_TYPE_CTF)
 				err += efunc(pc, "invalid val type %u\n", type);
 			if (r2 >= nregs)
 				err += efunc(pc, "invalid register %u\n", r2);
 			if (rs >= nregs)
 				err += efunc(pc, "invalid register %u\n", rs);
 			break;
 		default:
 			err += efunc(pc, "invalid opcode %u\n",
 			    DIF_INSTR_OP(instr));
 		}
 	}
 
 	if (dp->dtdo_len != 0 &&
 	    DIF_INSTR_OP(dp->dtdo_buf[dp->dtdo_len - 1]) != DIF_OP_RET) {
 		err += efunc(dp->dtdo_len - 1,
 		    "expected 'ret' as last DIF instruction\n");
 	}
 
 	if (!(dp->dtdo_rtype.dtdt_flags & (DIF_TF_BYREF | DIF_TF_BYUREF))) {
 		/*
 		 * If we're not returning by reference, the size must be either
 		 * 0 or the size of one of the base types.
 		 */
 		switch (dp->dtdo_rtype.dtdt_size) {
 		case 0:
 		case sizeof (uint8_t):
 		case sizeof (uint16_t):
 		case sizeof (uint32_t):
 		case sizeof (uint64_t):
 			break;
 
 		default:
 			err += efunc(dp->dtdo_len - 1, "bad return size\n");
 		}
 	}
 
 	for (i = 0; i < dp->dtdo_varlen && err == 0; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i], *existing = NULL;
 		dtrace_diftype_t *vt, *et;
 		uint_t id, ndx;
 
 		if (v->dtdv_scope != DIFV_SCOPE_GLOBAL &&
 		    v->dtdv_scope != DIFV_SCOPE_THREAD &&
 		    v->dtdv_scope != DIFV_SCOPE_LOCAL) {
 			err += efunc(i, "unrecognized variable scope %d\n",
 			    v->dtdv_scope);
 			break;
 		}
 
 		if (v->dtdv_kind != DIFV_KIND_ARRAY &&
 		    v->dtdv_kind != DIFV_KIND_SCALAR) {
 			err += efunc(i, "unrecognized variable type %d\n",
 			    v->dtdv_kind);
 			break;
 		}
 
 		if ((id = v->dtdv_id) > DIF_VARIABLE_MAX) {
 			err += efunc(i, "%d exceeds variable id limit\n", id);
 			break;
 		}
 
 		if (id < DIF_VAR_OTHER_UBASE)
 			continue;
 
 		/*
 		 * For user-defined variables, we need to check that this
 		 * definition is identical to any previous definition that we
 		 * encountered.
 		 */
 		ndx = id - DIF_VAR_OTHER_UBASE;
 
 		switch (v->dtdv_scope) {
 		case DIFV_SCOPE_GLOBAL:
 			if (maxglobal == -1 || ndx > maxglobal)
 				maxglobal = ndx;
 
 			if (ndx < vstate->dtvs_nglobals) {
 				dtrace_statvar_t *svar;
 
 				if ((svar = vstate->dtvs_globals[ndx]) != NULL)
 					existing = &svar->dtsv_var;
 			}
 
 			break;
 
 		case DIFV_SCOPE_THREAD:
 			if (maxtlocal == -1 || ndx > maxtlocal)
 				maxtlocal = ndx;
 
 			if (ndx < vstate->dtvs_ntlocals)
 				existing = &vstate->dtvs_tlocals[ndx];
 			break;
 
 		case DIFV_SCOPE_LOCAL:
 			if (maxlocal == -1 || ndx > maxlocal)
 				maxlocal = ndx;
 
 			if (ndx < vstate->dtvs_nlocals) {
 				dtrace_statvar_t *svar;
 
 				if ((svar = vstate->dtvs_locals[ndx]) != NULL)
 					existing = &svar->dtsv_var;
 			}
 
 			break;
 		}
 
 		vt = &v->dtdv_type;
 
 		if (vt->dtdt_flags & DIF_TF_BYREF) {
 			if (vt->dtdt_size == 0) {
 				err += efunc(i, "zero-sized variable\n");
 				break;
 			}
 
 			if ((v->dtdv_scope == DIFV_SCOPE_GLOBAL ||
 			    v->dtdv_scope == DIFV_SCOPE_LOCAL) &&
 			    vt->dtdt_size > dtrace_statvar_maxsize) {
 				err += efunc(i, "oversized by-ref static\n");
 				break;
 			}
 		}
 
 		if (existing == NULL || existing->dtdv_id == 0)
 			continue;
 
 		ASSERT(existing->dtdv_id == v->dtdv_id);
 		ASSERT(existing->dtdv_scope == v->dtdv_scope);
 
 		if (existing->dtdv_kind != v->dtdv_kind)
 			err += efunc(i, "%d changed variable kind\n", id);
 
 		et = &existing->dtdv_type;
 
 		if (vt->dtdt_flags != et->dtdt_flags) {
 			err += efunc(i, "%d changed variable type flags\n", id);
 			break;
 		}
 
 		if (vt->dtdt_size != 0 && vt->dtdt_size != et->dtdt_size) {
 			err += efunc(i, "%d changed variable type size\n", id);
 			break;
 		}
 	}
 
 	for (pc = 0; pc < dp->dtdo_len && err == 0; pc++) {
 		dif_instr_t instr = dp->dtdo_buf[pc];
 
 		uint_t v = DIF_INSTR_VAR(instr);
 		uint_t op = DIF_INSTR_OP(instr);
 
 		switch (op) {
 		case DIF_OP_LDGS:
 		case DIF_OP_LDGAA:
 		case DIF_OP_STGS:
 		case DIF_OP_STGAA:
 			if (v > DIF_VAR_OTHER_UBASE + maxglobal)
 				err += efunc(pc, "invalid variable %u\n", v);
 			break;
 		case DIF_OP_LDTS:
 		case DIF_OP_LDTAA:
 		case DIF_OP_STTS:
 		case DIF_OP_STTAA:
 			if (v > DIF_VAR_OTHER_UBASE + maxtlocal)
 				err += efunc(pc, "invalid variable %u\n", v);
 			break;
 		case DIF_OP_LDLS:
 		case DIF_OP_STLS:
 			if (v > DIF_VAR_OTHER_UBASE + maxlocal)
 				err += efunc(pc, "invalid variable %u\n", v);
 			break;
 		default:
 			break;
 		}
 	}
 
 	return (err);
 }
 
 /*
  * Validate a DTrace DIF object that it is to be used as a helper.  Helpers
  * are much more constrained than normal DIFOs.  Specifically, they may
  * not:
  *
  * 1. Make calls to subroutines other than copyin(), copyinstr() or
  *    miscellaneous string routines
  * 2. Access DTrace variables other than the args[] array, and the
  *    curthread, pid, ppid, tid, execname, zonename, uid and gid variables.
  * 3. Have thread-local variables.
  * 4. Have dynamic variables.
  */
 static int
 dtrace_difo_validate_helper(dtrace_difo_t *dp)
 {
 	int (*efunc)(uint_t pc, const char *, ...) = dtrace_difo_err;
 	int err = 0;
 	uint_t pc;
 
 	for (pc = 0; pc < dp->dtdo_len; pc++) {
 		dif_instr_t instr = dp->dtdo_buf[pc];
 
 		uint_t v = DIF_INSTR_VAR(instr);
 		uint_t subr = DIF_INSTR_SUBR(instr);
 		uint_t op = DIF_INSTR_OP(instr);
 
 		switch (op) {
 		case DIF_OP_OR:
 		case DIF_OP_XOR:
 		case DIF_OP_AND:
 		case DIF_OP_SLL:
 		case DIF_OP_SRL:
 		case DIF_OP_SRA:
 		case DIF_OP_SUB:
 		case DIF_OP_ADD:
 		case DIF_OP_MUL:
 		case DIF_OP_SDIV:
 		case DIF_OP_UDIV:
 		case DIF_OP_SREM:
 		case DIF_OP_UREM:
 		case DIF_OP_COPYS:
 		case DIF_OP_NOT:
 		case DIF_OP_MOV:
 		case DIF_OP_RLDSB:
 		case DIF_OP_RLDSH:
 		case DIF_OP_RLDSW:
 		case DIF_OP_RLDUB:
 		case DIF_OP_RLDUH:
 		case DIF_OP_RLDUW:
 		case DIF_OP_RLDX:
 		case DIF_OP_ULDSB:
 		case DIF_OP_ULDSH:
 		case DIF_OP_ULDSW:
 		case DIF_OP_ULDUB:
 		case DIF_OP_ULDUH:
 		case DIF_OP_ULDUW:
 		case DIF_OP_ULDX:
 		case DIF_OP_STB:
 		case DIF_OP_STH:
 		case DIF_OP_STW:
 		case DIF_OP_STX:
 		case DIF_OP_ALLOCS:
 		case DIF_OP_CMP:
 		case DIF_OP_SCMP:
 		case DIF_OP_TST:
 		case DIF_OP_BA:
 		case DIF_OP_BE:
 		case DIF_OP_BNE:
 		case DIF_OP_BG:
 		case DIF_OP_BGU:
 		case DIF_OP_BGE:
 		case DIF_OP_BGEU:
 		case DIF_OP_BL:
 		case DIF_OP_BLU:
 		case DIF_OP_BLE:
 		case DIF_OP_BLEU:
 		case DIF_OP_RET:
 		case DIF_OP_NOP:
 		case DIF_OP_POPTS:
 		case DIF_OP_FLUSHTS:
 		case DIF_OP_SETX:
 		case DIF_OP_SETS:
 		case DIF_OP_LDGA:
 		case DIF_OP_LDLS:
 		case DIF_OP_STGS:
 		case DIF_OP_STLS:
 		case DIF_OP_PUSHTR:
 		case DIF_OP_PUSHTV:
 			break;
 
 		case DIF_OP_LDGS:
 			if (v >= DIF_VAR_OTHER_UBASE)
 				break;
 
 			if (v >= DIF_VAR_ARG0 && v <= DIF_VAR_ARG9)
 				break;
 
 			if (v == DIF_VAR_CURTHREAD || v == DIF_VAR_PID ||
 			    v == DIF_VAR_PPID || v == DIF_VAR_TID ||
 			    v == DIF_VAR_EXECARGS ||
 			    v == DIF_VAR_EXECNAME || v == DIF_VAR_ZONENAME ||
 			    v == DIF_VAR_UID || v == DIF_VAR_GID)
 				break;
 
 			err += efunc(pc, "illegal variable %u\n", v);
 			break;
 
 		case DIF_OP_LDTA:
 		case DIF_OP_LDTS:
 		case DIF_OP_LDGAA:
 		case DIF_OP_LDTAA:
 			err += efunc(pc, "illegal dynamic variable load\n");
 			break;
 
 		case DIF_OP_STTS:
 		case DIF_OP_STGAA:
 		case DIF_OP_STTAA:
 			err += efunc(pc, "illegal dynamic variable store\n");
 			break;
 
 		case DIF_OP_CALL:
 			if (subr == DIF_SUBR_ALLOCA ||
 			    subr == DIF_SUBR_BCOPY ||
 			    subr == DIF_SUBR_COPYIN ||
 			    subr == DIF_SUBR_COPYINTO ||
 			    subr == DIF_SUBR_COPYINSTR ||
 			    subr == DIF_SUBR_INDEX ||
 			    subr == DIF_SUBR_INET_NTOA ||
 			    subr == DIF_SUBR_INET_NTOA6 ||
 			    subr == DIF_SUBR_INET_NTOP ||
 			    subr == DIF_SUBR_JSON ||
 			    subr == DIF_SUBR_LLTOSTR ||
 			    subr == DIF_SUBR_STRTOLL ||
 			    subr == DIF_SUBR_RINDEX ||
 			    subr == DIF_SUBR_STRCHR ||
 			    subr == DIF_SUBR_STRJOIN ||
 			    subr == DIF_SUBR_STRRCHR ||
 			    subr == DIF_SUBR_STRSTR ||
 			    subr == DIF_SUBR_HTONS ||
 			    subr == DIF_SUBR_HTONL ||
 			    subr == DIF_SUBR_HTONLL ||
 			    subr == DIF_SUBR_NTOHS ||
 			    subr == DIF_SUBR_NTOHL ||
 			    subr == DIF_SUBR_NTOHLL ||
 			    subr == DIF_SUBR_MEMREF)
 				break;
 #ifdef __FreeBSD__
 			if (subr == DIF_SUBR_MEMSTR)
 				break;
 #endif
 
 			err += efunc(pc, "invalid subr %u\n", subr);
 			break;
 
 		default:
 			err += efunc(pc, "invalid opcode %u\n",
 			    DIF_INSTR_OP(instr));
 		}
 	}
 
 	return (err);
 }
 
 /*
  * Returns 1 if the expression in the DIF object can be cached on a per-thread
  * basis; 0 if not.
  */
 static int
 dtrace_difo_cacheable(dtrace_difo_t *dp)
 {
 	int i;
 
 	if (dp == NULL)
 		return (0);
 
 	for (i = 0; i < dp->dtdo_varlen; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i];
 
 		if (v->dtdv_scope != DIFV_SCOPE_GLOBAL)
 			continue;
 
 		switch (v->dtdv_id) {
 		case DIF_VAR_CURTHREAD:
 		case DIF_VAR_PID:
 		case DIF_VAR_TID:
 		case DIF_VAR_EXECARGS:
 		case DIF_VAR_EXECNAME:
 		case DIF_VAR_ZONENAME:
 			break;
 
 		default:
 			return (0);
 		}
 	}
 
 	/*
 	 * This DIF object may be cacheable.  Now we need to look for any
 	 * array loading instructions, any memory loading instructions, or
 	 * any stores to thread-local variables.
 	 */
 	for (i = 0; i < dp->dtdo_len; i++) {
 		uint_t op = DIF_INSTR_OP(dp->dtdo_buf[i]);
 
 		if ((op >= DIF_OP_LDSB && op <= DIF_OP_LDX) ||
 		    (op >= DIF_OP_ULDSB && op <= DIF_OP_ULDX) ||
 		    (op >= DIF_OP_RLDSB && op <= DIF_OP_RLDX) ||
 		    op == DIF_OP_LDGA || op == DIF_OP_STTS)
 			return (0);
 	}
 
 	return (1);
 }
 
 static void
 dtrace_difo_hold(dtrace_difo_t *dp)
 {
 	int i;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	dp->dtdo_refcnt++;
 	ASSERT(dp->dtdo_refcnt != 0);
 
 	/*
 	 * We need to check this DIF object for references to the variable
 	 * DIF_VAR_VTIMESTAMP.
 	 */
 	for (i = 0; i < dp->dtdo_varlen; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i];
 
 		if (v->dtdv_id != DIF_VAR_VTIMESTAMP)
 			continue;
 
 		if (dtrace_vtime_references++ == 0)
 			dtrace_vtime_enable();
 	}
 }
 
 /*
  * This routine calculates the dynamic variable chunksize for a given DIF
  * object.  The calculation is not fool-proof, and can probably be tricked by
  * malicious DIF -- but it works for all compiler-generated DIF.  Because this
  * calculation is likely imperfect, dtrace_dynvar() is able to gracefully fail
  * if a dynamic variable size exceeds the chunksize.
  */
 static void
 dtrace_difo_chunksize(dtrace_difo_t *dp, dtrace_vstate_t *vstate)
 {
 	uint64_t sval = 0;
 	dtrace_key_t tupregs[DIF_DTR_NREGS + 2]; /* +2 for thread and id */
 	const dif_instr_t *text = dp->dtdo_buf;
 	uint_t pc, srd = 0;
 	uint_t ttop = 0;
 	size_t size, ksize;
 	uint_t id, i;
 
 	for (pc = 0; pc < dp->dtdo_len; pc++) {
 		dif_instr_t instr = text[pc];
 		uint_t op = DIF_INSTR_OP(instr);
 		uint_t rd = DIF_INSTR_RD(instr);
 		uint_t r1 = DIF_INSTR_R1(instr);
 		uint_t nkeys = 0;
 		uchar_t scope = 0;
 
 		dtrace_key_t *key = tupregs;
 
 		switch (op) {
 		case DIF_OP_SETX:
 			sval = dp->dtdo_inttab[DIF_INSTR_INTEGER(instr)];
 			srd = rd;
 			continue;
 
 		case DIF_OP_STTS:
 			key = &tupregs[DIF_DTR_NREGS];
 			key[0].dttk_size = 0;
 			key[1].dttk_size = 0;
 			nkeys = 2;
 			scope = DIFV_SCOPE_THREAD;
 			break;
 
 		case DIF_OP_STGAA:
 		case DIF_OP_STTAA:
 			nkeys = ttop;
 
 			if (DIF_INSTR_OP(instr) == DIF_OP_STTAA)
 				key[nkeys++].dttk_size = 0;
 
 			key[nkeys++].dttk_size = 0;
 
 			if (op == DIF_OP_STTAA) {
 				scope = DIFV_SCOPE_THREAD;
 			} else {
 				scope = DIFV_SCOPE_GLOBAL;
 			}
 
 			break;
 
 		case DIF_OP_PUSHTR:
 			if (ttop == DIF_DTR_NREGS)
 				return;
 
 			if ((srd == 0 || sval == 0) && r1 == DIF_TYPE_STRING) {
 				/*
 				 * If the register for the size of the "pushtr"
 				 * is %r0 (or the value is 0) and the type is
 				 * a string, we'll use the system-wide default
 				 * string size.
 				 */
 				tupregs[ttop++].dttk_size =
 				    dtrace_strsize_default;
 			} else {
 				if (srd == 0)
 					return;
 
 				if (sval > LONG_MAX)
 					return;
 
 				tupregs[ttop++].dttk_size = sval;
 			}
 
 			break;
 
 		case DIF_OP_PUSHTV:
 			if (ttop == DIF_DTR_NREGS)
 				return;
 
 			tupregs[ttop++].dttk_size = 0;
 			break;
 
 		case DIF_OP_FLUSHTS:
 			ttop = 0;
 			break;
 
 		case DIF_OP_POPTS:
 			if (ttop != 0)
 				ttop--;
 			break;
 		}
 
 		sval = 0;
 		srd = 0;
 
 		if (nkeys == 0)
 			continue;
 
 		/*
 		 * We have a dynamic variable allocation; calculate its size.
 		 */
 		for (ksize = 0, i = 0; i < nkeys; i++)
 			ksize += P2ROUNDUP(key[i].dttk_size, sizeof (uint64_t));
 
 		size = sizeof (dtrace_dynvar_t);
 		size += sizeof (dtrace_key_t) * (nkeys - 1);
 		size += ksize;
 
 		/*
 		 * Now we need to determine the size of the stored data.
 		 */
 		id = DIF_INSTR_VAR(instr);
 
 		for (i = 0; i < dp->dtdo_varlen; i++) {
 			dtrace_difv_t *v = &dp->dtdo_vartab[i];
 
 			if (v->dtdv_id == id && v->dtdv_scope == scope) {
 				size += v->dtdv_type.dtdt_size;
 				break;
 			}
 		}
 
 		if (i == dp->dtdo_varlen)
 			return;
 
 		/*
 		 * We have the size.  If this is larger than the chunk size
 		 * for our dynamic variable state, reset the chunk size.
 		 */
 		size = P2ROUNDUP(size, sizeof (uint64_t));
 
 		/*
 		 * Before setting the chunk size, check that we're not going
 		 * to set it to a negative value...
 		 */
 		if (size > LONG_MAX)
 			return;
 
 		/*
 		 * ...and make certain that we didn't badly overflow.
 		 */
 		if (size < ksize || size < sizeof (dtrace_dynvar_t))
 			return;
 
 		if (size > vstate->dtvs_dynvars.dtds_chunksize)
 			vstate->dtvs_dynvars.dtds_chunksize = size;
 	}
 }
 
 static void
 dtrace_difo_init(dtrace_difo_t *dp, dtrace_vstate_t *vstate)
 {
 	int i, oldsvars, osz, nsz, otlocals, ntlocals;
 	uint_t id;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(dp->dtdo_buf != NULL && dp->dtdo_len != 0);
 
 	for (i = 0; i < dp->dtdo_varlen; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i];
 		dtrace_statvar_t *svar, ***svarp = NULL;
 		size_t dsize = 0;
 		uint8_t scope = v->dtdv_scope;
 		int *np = NULL;
 
 		if ((id = v->dtdv_id) < DIF_VAR_OTHER_UBASE)
 			continue;
 
 		id -= DIF_VAR_OTHER_UBASE;
 
 		switch (scope) {
 		case DIFV_SCOPE_THREAD:
 			while (id >= (otlocals = vstate->dtvs_ntlocals)) {
 				dtrace_difv_t *tlocals;
 
 				if ((ntlocals = (otlocals << 1)) == 0)
 					ntlocals = 1;
 
 				osz = otlocals * sizeof (dtrace_difv_t);
 				nsz = ntlocals * sizeof (dtrace_difv_t);
 
 				tlocals = kmem_zalloc(nsz, KM_SLEEP);
 
 				if (osz != 0) {
 					bcopy(vstate->dtvs_tlocals,
 					    tlocals, osz);
 					kmem_free(vstate->dtvs_tlocals, osz);
 				}
 
 				vstate->dtvs_tlocals = tlocals;
 				vstate->dtvs_ntlocals = ntlocals;
 			}
 
 			vstate->dtvs_tlocals[id] = *v;
 			continue;
 
 		case DIFV_SCOPE_LOCAL:
 			np = &vstate->dtvs_nlocals;
 			svarp = &vstate->dtvs_locals;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF)
 				dsize = NCPU * (v->dtdv_type.dtdt_size +
 				    sizeof (uint64_t));
 			else
 				dsize = NCPU * sizeof (uint64_t);
 
 			break;
 
 		case DIFV_SCOPE_GLOBAL:
 			np = &vstate->dtvs_nglobals;
 			svarp = &vstate->dtvs_globals;
 
 			if (v->dtdv_type.dtdt_flags & DIF_TF_BYREF)
 				dsize = v->dtdv_type.dtdt_size +
 				    sizeof (uint64_t);
 
 			break;
 
 		default:
 			ASSERT(0);
 		}
 
 		while (id >= (oldsvars = *np)) {
 			dtrace_statvar_t **statics;
 			int newsvars, oldsize, newsize;
 
 			if ((newsvars = (oldsvars << 1)) == 0)
 				newsvars = 1;
 
 			oldsize = oldsvars * sizeof (dtrace_statvar_t *);
 			newsize = newsvars * sizeof (dtrace_statvar_t *);
 
 			statics = kmem_zalloc(newsize, KM_SLEEP);
 
 			if (oldsize != 0) {
 				bcopy(*svarp, statics, oldsize);
 				kmem_free(*svarp, oldsize);
 			}
 
 			*svarp = statics;
 			*np = newsvars;
 		}
 
 		if ((svar = (*svarp)[id]) == NULL) {
 			svar = kmem_zalloc(sizeof (dtrace_statvar_t), KM_SLEEP);
 			svar->dtsv_var = *v;
 
 			if ((svar->dtsv_size = dsize) != 0) {
 				svar->dtsv_data = (uint64_t)(uintptr_t)
 				    kmem_zalloc(dsize, KM_SLEEP);
 			}
 
 			(*svarp)[id] = svar;
 		}
 
 		svar->dtsv_refcnt++;
 	}
 
 	dtrace_difo_chunksize(dp, vstate);
 	dtrace_difo_hold(dp);
 }
 
 static dtrace_difo_t *
 dtrace_difo_duplicate(dtrace_difo_t *dp, dtrace_vstate_t *vstate)
 {
 	dtrace_difo_t *new;
 	size_t sz;
 
 	ASSERT(dp->dtdo_buf != NULL);
 	ASSERT(dp->dtdo_refcnt != 0);
 
 	new = kmem_zalloc(sizeof (dtrace_difo_t), KM_SLEEP);
 
 	ASSERT(dp->dtdo_buf != NULL);
 	sz = dp->dtdo_len * sizeof (dif_instr_t);
 	new->dtdo_buf = kmem_alloc(sz, KM_SLEEP);
 	bcopy(dp->dtdo_buf, new->dtdo_buf, sz);
 	new->dtdo_len = dp->dtdo_len;
 
 	if (dp->dtdo_strtab != NULL) {
 		ASSERT(dp->dtdo_strlen != 0);
 		new->dtdo_strtab = kmem_alloc(dp->dtdo_strlen, KM_SLEEP);
 		bcopy(dp->dtdo_strtab, new->dtdo_strtab, dp->dtdo_strlen);
 		new->dtdo_strlen = dp->dtdo_strlen;
 	}
 
 	if (dp->dtdo_inttab != NULL) {
 		ASSERT(dp->dtdo_intlen != 0);
 		sz = dp->dtdo_intlen * sizeof (uint64_t);
 		new->dtdo_inttab = kmem_alloc(sz, KM_SLEEP);
 		bcopy(dp->dtdo_inttab, new->dtdo_inttab, sz);
 		new->dtdo_intlen = dp->dtdo_intlen;
 	}
 
 	if (dp->dtdo_vartab != NULL) {
 		ASSERT(dp->dtdo_varlen != 0);
 		sz = dp->dtdo_varlen * sizeof (dtrace_difv_t);
 		new->dtdo_vartab = kmem_alloc(sz, KM_SLEEP);
 		bcopy(dp->dtdo_vartab, new->dtdo_vartab, sz);
 		new->dtdo_varlen = dp->dtdo_varlen;
 	}
 
 	dtrace_difo_init(new, vstate);
 	return (new);
 }
 
 static void
 dtrace_difo_destroy(dtrace_difo_t *dp, dtrace_vstate_t *vstate)
 {
 	int i;
 
 	ASSERT(dp->dtdo_refcnt == 0);
 
 	for (i = 0; i < dp->dtdo_varlen; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i];
 		dtrace_statvar_t *svar, **svarp = NULL;
 		uint_t id;
 		uint8_t scope = v->dtdv_scope;
 		int *np = NULL;
 
 		switch (scope) {
 		case DIFV_SCOPE_THREAD:
 			continue;
 
 		case DIFV_SCOPE_LOCAL:
 			np = &vstate->dtvs_nlocals;
 			svarp = vstate->dtvs_locals;
 			break;
 
 		case DIFV_SCOPE_GLOBAL:
 			np = &vstate->dtvs_nglobals;
 			svarp = vstate->dtvs_globals;
 			break;
 
 		default:
 			ASSERT(0);
 		}
 
 		if ((id = v->dtdv_id) < DIF_VAR_OTHER_UBASE)
 			continue;
 
 		id -= DIF_VAR_OTHER_UBASE;
 		ASSERT(id < *np);
 
 		svar = svarp[id];
 		ASSERT(svar != NULL);
 		ASSERT(svar->dtsv_refcnt > 0);
 
 		if (--svar->dtsv_refcnt > 0)
 			continue;
 
 		if (svar->dtsv_size != 0) {
 			ASSERT(svar->dtsv_data != 0);
 			kmem_free((void *)(uintptr_t)svar->dtsv_data,
 			    svar->dtsv_size);
 		}
 
 		kmem_free(svar, sizeof (dtrace_statvar_t));
 		svarp[id] = NULL;
 	}
 
 	if (dp->dtdo_buf != NULL)
 		kmem_free(dp->dtdo_buf, dp->dtdo_len * sizeof (dif_instr_t));
 	if (dp->dtdo_inttab != NULL)
 		kmem_free(dp->dtdo_inttab, dp->dtdo_intlen * sizeof (uint64_t));
 	if (dp->dtdo_strtab != NULL)
 		kmem_free(dp->dtdo_strtab, dp->dtdo_strlen);
 	if (dp->dtdo_vartab != NULL)
 		kmem_free(dp->dtdo_vartab, dp->dtdo_varlen * sizeof (dtrace_difv_t));
 
 	kmem_free(dp, sizeof (dtrace_difo_t));
 }
 
 static void
 dtrace_difo_release(dtrace_difo_t *dp, dtrace_vstate_t *vstate)
 {
 	int i;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(dp->dtdo_refcnt != 0);
 
 	for (i = 0; i < dp->dtdo_varlen; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i];
 
 		if (v->dtdv_id != DIF_VAR_VTIMESTAMP)
 			continue;
 
 		ASSERT(dtrace_vtime_references > 0);
 		if (--dtrace_vtime_references == 0)
 			dtrace_vtime_disable();
 	}
 
 	if (--dp->dtdo_refcnt == 0)
 		dtrace_difo_destroy(dp, vstate);
 }
 
 /*
  * DTrace Format Functions
  */
 static uint16_t
 dtrace_format_add(dtrace_state_t *state, char *str)
 {
 	char *fmt, **new;
 	uint16_t ndx, len = strlen(str) + 1;
 
 	fmt = kmem_zalloc(len, KM_SLEEP);
 	bcopy(str, fmt, len);
 
 	for (ndx = 0; ndx < state->dts_nformats; ndx++) {
 		if (state->dts_formats[ndx] == NULL) {
 			state->dts_formats[ndx] = fmt;
 			return (ndx + 1);
 		}
 	}
 
 	if (state->dts_nformats == USHRT_MAX) {
 		/*
 		 * This is only likely if a denial-of-service attack is being
 		 * attempted.  As such, it's okay to fail silently here.
 		 */
 		kmem_free(fmt, len);
 		return (0);
 	}
 
 	/*
 	 * For simplicity, we always resize the formats array to be exactly the
 	 * number of formats.
 	 */
 	ndx = state->dts_nformats++;
 	new = kmem_alloc((ndx + 1) * sizeof (char *), KM_SLEEP);
 
 	if (state->dts_formats != NULL) {
 		ASSERT(ndx != 0);
 		bcopy(state->dts_formats, new, ndx * sizeof (char *));
 		kmem_free(state->dts_formats, ndx * sizeof (char *));
 	}
 
 	state->dts_formats = new;
 	state->dts_formats[ndx] = fmt;
 
 	return (ndx + 1);
 }
 
 static void
 dtrace_format_remove(dtrace_state_t *state, uint16_t format)
 {
 	char *fmt;
 
 	ASSERT(state->dts_formats != NULL);
 	ASSERT(format <= state->dts_nformats);
 	ASSERT(state->dts_formats[format - 1] != NULL);
 
 	fmt = state->dts_formats[format - 1];
 	kmem_free(fmt, strlen(fmt) + 1);
 	state->dts_formats[format - 1] = NULL;
 }
 
 static void
 dtrace_format_destroy(dtrace_state_t *state)
 {
 	int i;
 
 	if (state->dts_nformats == 0) {
 		ASSERT(state->dts_formats == NULL);
 		return;
 	}
 
 	ASSERT(state->dts_formats != NULL);
 
 	for (i = 0; i < state->dts_nformats; i++) {
 		char *fmt = state->dts_formats[i];
 
 		if (fmt == NULL)
 			continue;
 
 		kmem_free(fmt, strlen(fmt) + 1);
 	}
 
 	kmem_free(state->dts_formats, state->dts_nformats * sizeof (char *));
 	state->dts_nformats = 0;
 	state->dts_formats = NULL;
 }
 
 /*
  * DTrace Predicate Functions
  */
 static dtrace_predicate_t *
 dtrace_predicate_create(dtrace_difo_t *dp)
 {
 	dtrace_predicate_t *pred;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(dp->dtdo_refcnt != 0);
 
 	pred = kmem_zalloc(sizeof (dtrace_predicate_t), KM_SLEEP);
 	pred->dtp_difo = dp;
 	pred->dtp_refcnt = 1;
 
 	if (!dtrace_difo_cacheable(dp))
 		return (pred);
 
 	if (dtrace_predcache_id == DTRACE_CACHEIDNONE) {
 		/*
 		 * This is only theoretically possible -- we have had 2^32
 		 * cacheable predicates on this machine.  We cannot allow any
 		 * more predicates to become cacheable:  as unlikely as it is,
 		 * there may be a thread caching a (now stale) predicate cache
 		 * ID. (N.B.: the temptation is being successfully resisted to
 		 * have this cmn_err() "Holy shit -- we executed this code!")
 		 */
 		return (pred);
 	}
 
 	pred->dtp_cacheid = dtrace_predcache_id++;
 
 	return (pred);
 }
 
 static void
 dtrace_predicate_hold(dtrace_predicate_t *pred)
 {
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(pred->dtp_difo != NULL && pred->dtp_difo->dtdo_refcnt != 0);
 	ASSERT(pred->dtp_refcnt > 0);
 
 	pred->dtp_refcnt++;
 }
 
 static void
 dtrace_predicate_release(dtrace_predicate_t *pred, dtrace_vstate_t *vstate)
 {
 	dtrace_difo_t *dp = pred->dtp_difo;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(dp != NULL && dp->dtdo_refcnt != 0);
 	ASSERT(pred->dtp_refcnt > 0);
 
 	if (--pred->dtp_refcnt == 0) {
 		dtrace_difo_release(pred->dtp_difo, vstate);
 		kmem_free(pred, sizeof (dtrace_predicate_t));
 	}
 }
 
 /*
  * DTrace Action Description Functions
  */
 static dtrace_actdesc_t *
 dtrace_actdesc_create(dtrace_actkind_t kind, uint32_t ntuple,
     uint64_t uarg, uint64_t arg)
 {
 	dtrace_actdesc_t *act;
 
 #ifdef illumos
 	ASSERT(!DTRACEACT_ISPRINTFLIKE(kind) || (arg != NULL &&
 	    arg >= KERNELBASE) || (arg == NULL && kind == DTRACEACT_PRINTA));
 #endif
 
 	act = kmem_zalloc(sizeof (dtrace_actdesc_t), KM_SLEEP);
 	act->dtad_kind = kind;
 	act->dtad_ntuple = ntuple;
 	act->dtad_uarg = uarg;
 	act->dtad_arg = arg;
 	act->dtad_refcnt = 1;
 
 	return (act);
 }
 
 static void
 dtrace_actdesc_hold(dtrace_actdesc_t *act)
 {
 	ASSERT(act->dtad_refcnt >= 1);
 	act->dtad_refcnt++;
 }
 
 static void
 dtrace_actdesc_release(dtrace_actdesc_t *act, dtrace_vstate_t *vstate)
 {
 	dtrace_actkind_t kind = act->dtad_kind;
 	dtrace_difo_t *dp;
 
 	ASSERT(act->dtad_refcnt >= 1);
 
 	if (--act->dtad_refcnt != 0)
 		return;
 
 	if ((dp = act->dtad_difo) != NULL)
 		dtrace_difo_release(dp, vstate);
 
 	if (DTRACEACT_ISPRINTFLIKE(kind)) {
 		char *str = (char *)(uintptr_t)act->dtad_arg;
 
 #ifdef illumos
 		ASSERT((str != NULL && (uintptr_t)str >= KERNELBASE) ||
 		    (str == NULL && act->dtad_kind == DTRACEACT_PRINTA));
 #endif
 
 		if (str != NULL)
 			kmem_free(str, strlen(str) + 1);
 	}
 
 	kmem_free(act, sizeof (dtrace_actdesc_t));
 }
 
 /*
  * DTrace ECB Functions
  */
 static dtrace_ecb_t *
 dtrace_ecb_add(dtrace_state_t *state, dtrace_probe_t *probe)
 {
 	dtrace_ecb_t *ecb;
 	dtrace_epid_t epid;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	ecb = kmem_zalloc(sizeof (dtrace_ecb_t), KM_SLEEP);
 	ecb->dte_predicate = NULL;
 	ecb->dte_probe = probe;
 
 	/*
 	 * The default size is the size of the default action: recording
 	 * the header.
 	 */
 	ecb->dte_size = ecb->dte_needed = sizeof (dtrace_rechdr_t);
 	ecb->dte_alignment = sizeof (dtrace_epid_t);
 
 	epid = state->dts_epid++;
 
 	if (epid - 1 >= state->dts_necbs) {
 		dtrace_ecb_t **oecbs = state->dts_ecbs, **ecbs;
 		int necbs = state->dts_necbs << 1;
 
 		ASSERT(epid == state->dts_necbs + 1);
 
 		if (necbs == 0) {
 			ASSERT(oecbs == NULL);
 			necbs = 1;
 		}
 
 		ecbs = kmem_zalloc(necbs * sizeof (*ecbs), KM_SLEEP);
 
 		if (oecbs != NULL)
 			bcopy(oecbs, ecbs, state->dts_necbs * sizeof (*ecbs));
 
 		dtrace_membar_producer();
 		state->dts_ecbs = ecbs;
 
 		if (oecbs != NULL) {
 			/*
 			 * If this state is active, we must dtrace_sync()
 			 * before we can free the old dts_ecbs array:  we're
 			 * coming in hot, and there may be active ring
 			 * buffer processing (which indexes into the dts_ecbs
 			 * array) on another CPU.
 			 */
 			if (state->dts_activity != DTRACE_ACTIVITY_INACTIVE)
 				dtrace_sync();
 
 			kmem_free(oecbs, state->dts_necbs * sizeof (*ecbs));
 		}
 
 		dtrace_membar_producer();
 		state->dts_necbs = necbs;
 	}
 
 	ecb->dte_state = state;
 
 	ASSERT(state->dts_ecbs[epid - 1] == NULL);
 	dtrace_membar_producer();
 	state->dts_ecbs[(ecb->dte_epid = epid) - 1] = ecb;
 
 	return (ecb);
 }
 
 static void
 dtrace_ecb_enable(dtrace_ecb_t *ecb)
 {
 	dtrace_probe_t *probe = ecb->dte_probe;
 
 	ASSERT(MUTEX_HELD(&cpu_lock));
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(ecb->dte_next == NULL);
 
 	if (probe == NULL) {
 		/*
 		 * This is the NULL probe -- there's nothing to do.
 		 */
 		return;
 	}
 
 	if (probe->dtpr_ecb == NULL) {
 		dtrace_provider_t *prov = probe->dtpr_provider;
 
 		/*
 		 * We're the first ECB on this probe.
 		 */
 		probe->dtpr_ecb = probe->dtpr_ecb_last = ecb;
 
 		if (ecb->dte_predicate != NULL)
 			probe->dtpr_predcache = ecb->dte_predicate->dtp_cacheid;
 
 		prov->dtpv_pops.dtps_enable(prov->dtpv_arg,
 		    probe->dtpr_id, probe->dtpr_arg);
 	} else {
 		/*
 		 * This probe is already active.  Swing the last pointer to
 		 * point to the new ECB, and issue a dtrace_sync() to assure
 		 * that all CPUs have seen the change.
 		 */
 		ASSERT(probe->dtpr_ecb_last != NULL);
 		probe->dtpr_ecb_last->dte_next = ecb;
 		probe->dtpr_ecb_last = ecb;
 		probe->dtpr_predcache = 0;
 
 		dtrace_sync();
 	}
 }
 
 static int
 dtrace_ecb_resize(dtrace_ecb_t *ecb)
 {
 	dtrace_action_t *act;
 	uint32_t curneeded = UINT32_MAX;
 	uint32_t aggbase = UINT32_MAX;
 
 	/*
 	 * If we record anything, we always record the dtrace_rechdr_t.  (And
 	 * we always record it first.)
 	 */
 	ecb->dte_size = sizeof (dtrace_rechdr_t);
 	ecb->dte_alignment = sizeof (dtrace_epid_t);
 
 	for (act = ecb->dte_action; act != NULL; act = act->dta_next) {
 		dtrace_recdesc_t *rec = &act->dta_rec;
 		ASSERT(rec->dtrd_size > 0 || rec->dtrd_alignment == 1);
 
 		ecb->dte_alignment = MAX(ecb->dte_alignment,
 		    rec->dtrd_alignment);
 
 		if (DTRACEACT_ISAGG(act->dta_kind)) {
 			dtrace_aggregation_t *agg = (dtrace_aggregation_t *)act;
 
 			ASSERT(rec->dtrd_size != 0);
 			ASSERT(agg->dtag_first != NULL);
 			ASSERT(act->dta_prev->dta_intuple);
 			ASSERT(aggbase != UINT32_MAX);
 			ASSERT(curneeded != UINT32_MAX);
 
 			agg->dtag_base = aggbase;
 
 			curneeded = P2ROUNDUP(curneeded, rec->dtrd_alignment);
 			rec->dtrd_offset = curneeded;
 			if (curneeded + rec->dtrd_size < curneeded)
 				return (EINVAL);
 			curneeded += rec->dtrd_size;
 			ecb->dte_needed = MAX(ecb->dte_needed, curneeded);
 
 			aggbase = UINT32_MAX;
 			curneeded = UINT32_MAX;
 		} else if (act->dta_intuple) {
 			if (curneeded == UINT32_MAX) {
 				/*
 				 * This is the first record in a tuple.  Align
 				 * curneeded to be at offset 4 in an 8-byte
 				 * aligned block.
 				 */
 				ASSERT(act->dta_prev == NULL ||
 				    !act->dta_prev->dta_intuple);
 				ASSERT3U(aggbase, ==, UINT32_MAX);
 				curneeded = P2PHASEUP(ecb->dte_size,
 				    sizeof (uint64_t), sizeof (dtrace_aggid_t));
 
 				aggbase = curneeded - sizeof (dtrace_aggid_t);
 				ASSERT(IS_P2ALIGNED(aggbase,
 				    sizeof (uint64_t)));
 			}
 			curneeded = P2ROUNDUP(curneeded, rec->dtrd_alignment);
 			rec->dtrd_offset = curneeded;
 			if (curneeded + rec->dtrd_size < curneeded)
 				return (EINVAL);
 			curneeded += rec->dtrd_size;
 		} else {
 			/* tuples must be followed by an aggregation */
 			ASSERT(act->dta_prev == NULL ||
 			    !act->dta_prev->dta_intuple);
 
 			ecb->dte_size = P2ROUNDUP(ecb->dte_size,
 			    rec->dtrd_alignment);
 			rec->dtrd_offset = ecb->dte_size;
 			if (ecb->dte_size + rec->dtrd_size < ecb->dte_size)
 				return (EINVAL);
 			ecb->dte_size += rec->dtrd_size;
 			ecb->dte_needed = MAX(ecb->dte_needed, ecb->dte_size);
 		}
 	}
 
 	if ((act = ecb->dte_action) != NULL &&
 	    !(act->dta_kind == DTRACEACT_SPECULATE && act->dta_next == NULL) &&
 	    ecb->dte_size == sizeof (dtrace_rechdr_t)) {
 		/*
 		 * If the size is still sizeof (dtrace_rechdr_t), then all
 		 * actions store no data; set the size to 0.
 		 */
 		ecb->dte_size = 0;
 	}
 
 	ecb->dte_size = P2ROUNDUP(ecb->dte_size, sizeof (dtrace_epid_t));
 	ecb->dte_needed = P2ROUNDUP(ecb->dte_needed, (sizeof (dtrace_epid_t)));
 	ecb->dte_state->dts_needed = MAX(ecb->dte_state->dts_needed,
 	    ecb->dte_needed);
 	return (0);
 }
 
 static dtrace_action_t *
 dtrace_ecb_aggregation_create(dtrace_ecb_t *ecb, dtrace_actdesc_t *desc)
 {
 	dtrace_aggregation_t *agg;
 	size_t size = sizeof (uint64_t);
 	int ntuple = desc->dtad_ntuple;
 	dtrace_action_t *act;
 	dtrace_recdesc_t *frec;
 	dtrace_aggid_t aggid;
 	dtrace_state_t *state = ecb->dte_state;
 
 	agg = kmem_zalloc(sizeof (dtrace_aggregation_t), KM_SLEEP);
 	agg->dtag_ecb = ecb;
 
 	ASSERT(DTRACEACT_ISAGG(desc->dtad_kind));
 
 	switch (desc->dtad_kind) {
 	case DTRACEAGG_MIN:
 		agg->dtag_initial = INT64_MAX;
 		agg->dtag_aggregate = dtrace_aggregate_min;
 		break;
 
 	case DTRACEAGG_MAX:
 		agg->dtag_initial = INT64_MIN;
 		agg->dtag_aggregate = dtrace_aggregate_max;
 		break;
 
 	case DTRACEAGG_COUNT:
 		agg->dtag_aggregate = dtrace_aggregate_count;
 		break;
 
 	case DTRACEAGG_QUANTIZE:
 		agg->dtag_aggregate = dtrace_aggregate_quantize;
 		size = (((sizeof (uint64_t) * NBBY) - 1) * 2 + 1) *
 		    sizeof (uint64_t);
 		break;
 
 	case DTRACEAGG_LQUANTIZE: {
 		uint16_t step = DTRACE_LQUANTIZE_STEP(desc->dtad_arg);
 		uint16_t levels = DTRACE_LQUANTIZE_LEVELS(desc->dtad_arg);
 
 		agg->dtag_initial = desc->dtad_arg;
 		agg->dtag_aggregate = dtrace_aggregate_lquantize;
 
 		if (step == 0 || levels == 0)
 			goto err;
 
 		size = levels * sizeof (uint64_t) + 3 * sizeof (uint64_t);
 		break;
 	}
 
 	case DTRACEAGG_LLQUANTIZE: {
 		uint16_t factor = DTRACE_LLQUANTIZE_FACTOR(desc->dtad_arg);
 		uint16_t low = DTRACE_LLQUANTIZE_LOW(desc->dtad_arg);
 		uint16_t high = DTRACE_LLQUANTIZE_HIGH(desc->dtad_arg);
 		uint16_t nsteps = DTRACE_LLQUANTIZE_NSTEP(desc->dtad_arg);
 		int64_t v;
 
 		agg->dtag_initial = desc->dtad_arg;
 		agg->dtag_aggregate = dtrace_aggregate_llquantize;
 
 		if (factor < 2 || low >= high || nsteps < factor)
 			goto err;
 
 		/*
 		 * Now check that the number of steps evenly divides a power
 		 * of the factor.  (This assures both integer bucket size and
 		 * linearity within each magnitude.)
 		 */
 		for (v = factor; v < nsteps; v *= factor)
 			continue;
 
 		if ((v % nsteps) || (nsteps % factor))
 			goto err;
 
 		size = (dtrace_aggregate_llquantize_bucket(factor,
 		    low, high, nsteps, INT64_MAX) + 2) * sizeof (uint64_t);
 		break;
 	}
 
 	case DTRACEAGG_AVG:
 		agg->dtag_aggregate = dtrace_aggregate_avg;
 		size = sizeof (uint64_t) * 2;
 		break;
 
 	case DTRACEAGG_STDDEV:
 		agg->dtag_aggregate = dtrace_aggregate_stddev;
 		size = sizeof (uint64_t) * 4;
 		break;
 
 	case DTRACEAGG_SUM:
 		agg->dtag_aggregate = dtrace_aggregate_sum;
 		break;
 
 	default:
 		goto err;
 	}
 
 	agg->dtag_action.dta_rec.dtrd_size = size;
 
 	if (ntuple == 0)
 		goto err;
 
 	/*
 	 * We must make sure that we have enough actions for the n-tuple.
 	 */
 	for (act = ecb->dte_action_last; act != NULL; act = act->dta_prev) {
 		if (DTRACEACT_ISAGG(act->dta_kind))
 			break;
 
 		if (--ntuple == 0) {
 			/*
 			 * This is the action with which our n-tuple begins.
 			 */
 			agg->dtag_first = act;
 			goto success;
 		}
 	}
 
 	/*
 	 * This n-tuple is short by ntuple elements.  Return failure.
 	 */
 	ASSERT(ntuple != 0);
 err:
 	kmem_free(agg, sizeof (dtrace_aggregation_t));
 	return (NULL);
 
 success:
 	/*
 	 * If the last action in the tuple has a size of zero, it's actually
 	 * an expression argument for the aggregating action.
 	 */
 	ASSERT(ecb->dte_action_last != NULL);
 	act = ecb->dte_action_last;
 
 	if (act->dta_kind == DTRACEACT_DIFEXPR) {
 		ASSERT(act->dta_difo != NULL);
 
 		if (act->dta_difo->dtdo_rtype.dtdt_size == 0)
 			agg->dtag_hasarg = 1;
 	}
 
 	/*
 	 * We need to allocate an id for this aggregation.
 	 */
 #ifdef illumos
 	aggid = (dtrace_aggid_t)(uintptr_t)vmem_alloc(state->dts_aggid_arena, 1,
 	    VM_BESTFIT | VM_SLEEP);
 #else
 	aggid = alloc_unr(state->dts_aggid_arena);
 #endif
 
 	if (aggid - 1 >= state->dts_naggregations) {
 		dtrace_aggregation_t **oaggs = state->dts_aggregations;
 		dtrace_aggregation_t **aggs;
 		int naggs = state->dts_naggregations << 1;
 		int onaggs = state->dts_naggregations;
 
 		ASSERT(aggid == state->dts_naggregations + 1);
 
 		if (naggs == 0) {
 			ASSERT(oaggs == NULL);
 			naggs = 1;
 		}
 
 		aggs = kmem_zalloc(naggs * sizeof (*aggs), KM_SLEEP);
 
 		if (oaggs != NULL) {
 			bcopy(oaggs, aggs, onaggs * sizeof (*aggs));
 			kmem_free(oaggs, onaggs * sizeof (*aggs));
 		}
 
 		state->dts_aggregations = aggs;
 		state->dts_naggregations = naggs;
 	}
 
 	ASSERT(state->dts_aggregations[aggid - 1] == NULL);
 	state->dts_aggregations[(agg->dtag_id = aggid) - 1] = agg;
 
 	frec = &agg->dtag_first->dta_rec;
 	if (frec->dtrd_alignment < sizeof (dtrace_aggid_t))
 		frec->dtrd_alignment = sizeof (dtrace_aggid_t);
 
 	for (act = agg->dtag_first; act != NULL; act = act->dta_next) {
 		ASSERT(!act->dta_intuple);
 		act->dta_intuple = 1;
 	}
 
 	return (&agg->dtag_action);
 }
 
 static void
 dtrace_ecb_aggregation_destroy(dtrace_ecb_t *ecb, dtrace_action_t *act)
 {
 	dtrace_aggregation_t *agg = (dtrace_aggregation_t *)act;
 	dtrace_state_t *state = ecb->dte_state;
 	dtrace_aggid_t aggid = agg->dtag_id;
 
 	ASSERT(DTRACEACT_ISAGG(act->dta_kind));
 #ifdef illumos
 	vmem_free(state->dts_aggid_arena, (void *)(uintptr_t)aggid, 1);
 #else
 	free_unr(state->dts_aggid_arena, aggid);
 #endif
 
 	ASSERT(state->dts_aggregations[aggid - 1] == agg);
 	state->dts_aggregations[aggid - 1] = NULL;
 
 	kmem_free(agg, sizeof (dtrace_aggregation_t));
 }
 
 static int
 dtrace_ecb_action_add(dtrace_ecb_t *ecb, dtrace_actdesc_t *desc)
 {
 	dtrace_action_t *action, *last;
 	dtrace_difo_t *dp = desc->dtad_difo;
 	uint32_t size = 0, align = sizeof (uint8_t), mask;
 	uint16_t format = 0;
 	dtrace_recdesc_t *rec;
 	dtrace_state_t *state = ecb->dte_state;
 	dtrace_optval_t *opt = state->dts_options, nframes = 0, strsize;
 	uint64_t arg = desc->dtad_arg;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(ecb->dte_action == NULL || ecb->dte_action->dta_refcnt == 1);
 
 	if (DTRACEACT_ISAGG(desc->dtad_kind)) {
 		/*
 		 * If this is an aggregating action, there must be neither
 		 * a speculate nor a commit on the action chain.
 		 */
 		dtrace_action_t *act;
 
 		for (act = ecb->dte_action; act != NULL; act = act->dta_next) {
 			if (act->dta_kind == DTRACEACT_COMMIT)
 				return (EINVAL);
 
 			if (act->dta_kind == DTRACEACT_SPECULATE)
 				return (EINVAL);
 		}
 
 		action = dtrace_ecb_aggregation_create(ecb, desc);
 
 		if (action == NULL)
 			return (EINVAL);
 	} else {
 		if (DTRACEACT_ISDESTRUCTIVE(desc->dtad_kind) ||
 		    (desc->dtad_kind == DTRACEACT_DIFEXPR &&
 		    dp != NULL && dp->dtdo_destructive)) {
 			state->dts_destructive = 1;
 		}
 
 		switch (desc->dtad_kind) {
 		case DTRACEACT_PRINTF:
 		case DTRACEACT_PRINTA:
 		case DTRACEACT_SYSTEM:
 		case DTRACEACT_FREOPEN:
 		case DTRACEACT_DIFEXPR:
 			/*
 			 * We know that our arg is a string -- turn it into a
 			 * format.
 			 */
 			if (arg == 0) {
 				ASSERT(desc->dtad_kind == DTRACEACT_PRINTA ||
 				    desc->dtad_kind == DTRACEACT_DIFEXPR);
 				format = 0;
 			} else {
 				ASSERT(arg != 0);
 #ifdef illumos
 				ASSERT(arg > KERNELBASE);
 #endif
 				format = dtrace_format_add(state,
 				    (char *)(uintptr_t)arg);
 			}
 
 			/*FALLTHROUGH*/
 		case DTRACEACT_LIBACT:
 		case DTRACEACT_TRACEMEM:
 		case DTRACEACT_TRACEMEM_DYNSIZE:
 			if (dp == NULL)
 				return (EINVAL);
 
 			if ((size = dp->dtdo_rtype.dtdt_size) != 0)
 				break;
 
 			if (dp->dtdo_rtype.dtdt_kind == DIF_TYPE_STRING) {
 				if (!(dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF))
 					return (EINVAL);
 
 				size = opt[DTRACEOPT_STRSIZE];
 			}
 
 			break;
 
 		case DTRACEACT_STACK:
 			if ((nframes = arg) == 0) {
 				nframes = opt[DTRACEOPT_STACKFRAMES];
 				ASSERT(nframes > 0);
 				arg = nframes;
 			}
 
 			size = nframes * sizeof (pc_t);
 			break;
 
 		case DTRACEACT_JSTACK:
 			if ((strsize = DTRACE_USTACK_STRSIZE(arg)) == 0)
 				strsize = opt[DTRACEOPT_JSTACKSTRSIZE];
 
 			if ((nframes = DTRACE_USTACK_NFRAMES(arg)) == 0)
 				nframes = opt[DTRACEOPT_JSTACKFRAMES];
 
 			arg = DTRACE_USTACK_ARG(nframes, strsize);
 
 			/*FALLTHROUGH*/
 		case DTRACEACT_USTACK:
 			if (desc->dtad_kind != DTRACEACT_JSTACK &&
 			    (nframes = DTRACE_USTACK_NFRAMES(arg)) == 0) {
 				strsize = DTRACE_USTACK_STRSIZE(arg);
 				nframes = opt[DTRACEOPT_USTACKFRAMES];
 				ASSERT(nframes > 0);
 				arg = DTRACE_USTACK_ARG(nframes, strsize);
 			}
 
 			/*
 			 * Save a slot for the pid.
 			 */
 			size = (nframes + 1) * sizeof (uint64_t);
 			size += DTRACE_USTACK_STRSIZE(arg);
 			size = P2ROUNDUP(size, (uint32_t)(sizeof (uintptr_t)));
 
 			break;
 
 		case DTRACEACT_SYM:
 		case DTRACEACT_MOD:
 			if (dp == NULL || ((size = dp->dtdo_rtype.dtdt_size) !=
 			    sizeof (uint64_t)) ||
 			    (dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF))
 				return (EINVAL);
 			break;
 
 		case DTRACEACT_USYM:
 		case DTRACEACT_UMOD:
 		case DTRACEACT_UADDR:
 			if (dp == NULL ||
 			    (dp->dtdo_rtype.dtdt_size != sizeof (uint64_t)) ||
 			    (dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF))
 				return (EINVAL);
 
 			/*
 			 * We have a slot for the pid, plus a slot for the
 			 * argument.  To keep things simple (aligned with
 			 * bitness-neutral sizing), we store each as a 64-bit
 			 * quantity.
 			 */
 			size = 2 * sizeof (uint64_t);
 			break;
 
 		case DTRACEACT_STOP:
 		case DTRACEACT_BREAKPOINT:
 		case DTRACEACT_PANIC:
 			break;
 
 		case DTRACEACT_CHILL:
 		case DTRACEACT_DISCARD:
 		case DTRACEACT_RAISE:
 			if (dp == NULL)
 				return (EINVAL);
 			break;
 
 		case DTRACEACT_EXIT:
 			if (dp == NULL ||
 			    (size = dp->dtdo_rtype.dtdt_size) != sizeof (int) ||
 			    (dp->dtdo_rtype.dtdt_flags & DIF_TF_BYREF))
 				return (EINVAL);
 			break;
 
 		case DTRACEACT_SPECULATE:
 			if (ecb->dte_size > sizeof (dtrace_rechdr_t))
 				return (EINVAL);
 
 			if (dp == NULL)
 				return (EINVAL);
 
 			state->dts_speculates = 1;
 			break;
 
 		case DTRACEACT_PRINTM:
 		    	size = dp->dtdo_rtype.dtdt_size;
 			break;
 
 		case DTRACEACT_COMMIT: {
 			dtrace_action_t *act = ecb->dte_action;
 
 			for (; act != NULL; act = act->dta_next) {
 				if (act->dta_kind == DTRACEACT_COMMIT)
 					return (EINVAL);
 			}
 
 			if (dp == NULL)
 				return (EINVAL);
 			break;
 		}
 
 		default:
 			return (EINVAL);
 		}
 
 		if (size != 0 || desc->dtad_kind == DTRACEACT_SPECULATE) {
 			/*
 			 * If this is a data-storing action or a speculate,
 			 * we must be sure that there isn't a commit on the
 			 * action chain.
 			 */
 			dtrace_action_t *act = ecb->dte_action;
 
 			for (; act != NULL; act = act->dta_next) {
 				if (act->dta_kind == DTRACEACT_COMMIT)
 					return (EINVAL);
 			}
 		}
 
 		action = kmem_zalloc(sizeof (dtrace_action_t), KM_SLEEP);
 		action->dta_rec.dtrd_size = size;
 	}
 
 	action->dta_refcnt = 1;
 	rec = &action->dta_rec;
 	size = rec->dtrd_size;
 
 	for (mask = sizeof (uint64_t) - 1; size != 0 && mask > 0; mask >>= 1) {
 		if (!(size & mask)) {
 			align = mask + 1;
 			break;
 		}
 	}
 
 	action->dta_kind = desc->dtad_kind;
 
 	if ((action->dta_difo = dp) != NULL)
 		dtrace_difo_hold(dp);
 
 	rec->dtrd_action = action->dta_kind;
 	rec->dtrd_arg = arg;
 	rec->dtrd_uarg = desc->dtad_uarg;
 	rec->dtrd_alignment = (uint16_t)align;
 	rec->dtrd_format = format;
 
 	if ((last = ecb->dte_action_last) != NULL) {
 		ASSERT(ecb->dte_action != NULL);
 		action->dta_prev = last;
 		last->dta_next = action;
 	} else {
 		ASSERT(ecb->dte_action == NULL);
 		ecb->dte_action = action;
 	}
 
 	ecb->dte_action_last = action;
 
 	return (0);
 }
 
 static void
 dtrace_ecb_action_remove(dtrace_ecb_t *ecb)
 {
 	dtrace_action_t *act = ecb->dte_action, *next;
 	dtrace_vstate_t *vstate = &ecb->dte_state->dts_vstate;
 	dtrace_difo_t *dp;
 	uint16_t format;
 
 	if (act != NULL && act->dta_refcnt > 1) {
 		ASSERT(act->dta_next == NULL || act->dta_next->dta_refcnt == 1);
 		act->dta_refcnt--;
 	} else {
 		for (; act != NULL; act = next) {
 			next = act->dta_next;
 			ASSERT(next != NULL || act == ecb->dte_action_last);
 			ASSERT(act->dta_refcnt == 1);
 
 			if ((format = act->dta_rec.dtrd_format) != 0)
 				dtrace_format_remove(ecb->dte_state, format);
 
 			if ((dp = act->dta_difo) != NULL)
 				dtrace_difo_release(dp, vstate);
 
 			if (DTRACEACT_ISAGG(act->dta_kind)) {
 				dtrace_ecb_aggregation_destroy(ecb, act);
 			} else {
 				kmem_free(act, sizeof (dtrace_action_t));
 			}
 		}
 	}
 
 	ecb->dte_action = NULL;
 	ecb->dte_action_last = NULL;
 	ecb->dte_size = 0;
 }
 
 static void
 dtrace_ecb_disable(dtrace_ecb_t *ecb)
 {
 	/*
 	 * We disable the ECB by removing it from its probe.
 	 */
 	dtrace_ecb_t *pecb, *prev = NULL;
 	dtrace_probe_t *probe = ecb->dte_probe;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (probe == NULL) {
 		/*
 		 * This is the NULL probe; there is nothing to disable.
 		 */
 		return;
 	}
 
 	for (pecb = probe->dtpr_ecb; pecb != NULL; pecb = pecb->dte_next) {
 		if (pecb == ecb)
 			break;
 		prev = pecb;
 	}
 
 	ASSERT(pecb != NULL);
 
 	if (prev == NULL) {
 		probe->dtpr_ecb = ecb->dte_next;
 	} else {
 		prev->dte_next = ecb->dte_next;
 	}
 
 	if (ecb == probe->dtpr_ecb_last) {
 		ASSERT(ecb->dte_next == NULL);
 		probe->dtpr_ecb_last = prev;
 	}
 
 	/*
 	 * The ECB has been disconnected from the probe; now sync to assure
 	 * that all CPUs have seen the change before returning.
 	 */
 	dtrace_sync();
 
 	if (probe->dtpr_ecb == NULL) {
 		/*
 		 * That was the last ECB on the probe; clear the predicate
 		 * cache ID for the probe, disable it and sync one more time
 		 * to assure that we'll never hit it again.
 		 */
 		dtrace_provider_t *prov = probe->dtpr_provider;
 
 		ASSERT(ecb->dte_next == NULL);
 		ASSERT(probe->dtpr_ecb_last == NULL);
 		probe->dtpr_predcache = DTRACE_CACHEIDNONE;
 		prov->dtpv_pops.dtps_disable(prov->dtpv_arg,
 		    probe->dtpr_id, probe->dtpr_arg);
 		dtrace_sync();
 	} else {
 		/*
 		 * There is at least one ECB remaining on the probe.  If there
 		 * is _exactly_ one, set the probe's predicate cache ID to be
 		 * the predicate cache ID of the remaining ECB.
 		 */
 		ASSERT(probe->dtpr_ecb_last != NULL);
 		ASSERT(probe->dtpr_predcache == DTRACE_CACHEIDNONE);
 
 		if (probe->dtpr_ecb == probe->dtpr_ecb_last) {
 			dtrace_predicate_t *p = probe->dtpr_ecb->dte_predicate;
 
 			ASSERT(probe->dtpr_ecb->dte_next == NULL);
 
 			if (p != NULL)
 				probe->dtpr_predcache = p->dtp_cacheid;
 		}
 
 		ecb->dte_next = NULL;
 	}
 }
 
 static void
 dtrace_ecb_destroy(dtrace_ecb_t *ecb)
 {
 	dtrace_state_t *state = ecb->dte_state;
 	dtrace_vstate_t *vstate = &state->dts_vstate;
 	dtrace_predicate_t *pred;
 	dtrace_epid_t epid = ecb->dte_epid;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(ecb->dte_next == NULL);
 	ASSERT(ecb->dte_probe == NULL || ecb->dte_probe->dtpr_ecb != ecb);
 
 	if ((pred = ecb->dte_predicate) != NULL)
 		dtrace_predicate_release(pred, vstate);
 
 	dtrace_ecb_action_remove(ecb);
 
 	ASSERT(state->dts_ecbs[epid - 1] == ecb);
 	state->dts_ecbs[epid - 1] = NULL;
 
 	kmem_free(ecb, sizeof (dtrace_ecb_t));
 }
 
 static dtrace_ecb_t *
 dtrace_ecb_create(dtrace_state_t *state, dtrace_probe_t *probe,
     dtrace_enabling_t *enab)
 {
 	dtrace_ecb_t *ecb;
 	dtrace_predicate_t *pred;
 	dtrace_actdesc_t *act;
 	dtrace_provider_t *prov;
 	dtrace_ecbdesc_t *desc = enab->dten_current;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(state != NULL);
 
 	ecb = dtrace_ecb_add(state, probe);
 	ecb->dte_uarg = desc->dted_uarg;
 
 	if ((pred = desc->dted_pred.dtpdd_predicate) != NULL) {
 		dtrace_predicate_hold(pred);
 		ecb->dte_predicate = pred;
 	}
 
 	if (probe != NULL) {
 		/*
 		 * If the provider shows more leg than the consumer is old
 		 * enough to see, we need to enable the appropriate implicit
 		 * predicate bits to prevent the ecb from activating at
 		 * revealing times.
 		 *
 		 * Providers specifying DTRACE_PRIV_USER at register time
 		 * are stating that they need the /proc-style privilege
 		 * model to be enforced, and this is what DTRACE_COND_OWNER
 		 * and DTRACE_COND_ZONEOWNER will then do at probe time.
 		 */
 		prov = probe->dtpr_provider;
 		if (!(state->dts_cred.dcr_visible & DTRACE_CRV_ALLPROC) &&
 		    (prov->dtpv_priv.dtpp_flags & DTRACE_PRIV_USER))
 			ecb->dte_cond |= DTRACE_COND_OWNER;
 
 		if (!(state->dts_cred.dcr_visible & DTRACE_CRV_ALLZONE) &&
 		    (prov->dtpv_priv.dtpp_flags & DTRACE_PRIV_USER))
 			ecb->dte_cond |= DTRACE_COND_ZONEOWNER;
 
 		/*
 		 * If the provider shows us kernel innards and the user
 		 * is lacking sufficient privilege, enable the
 		 * DTRACE_COND_USERMODE implicit predicate.
 		 */
 		if (!(state->dts_cred.dcr_visible & DTRACE_CRV_KERNEL) &&
 		    (prov->dtpv_priv.dtpp_flags & DTRACE_PRIV_KERNEL))
 			ecb->dte_cond |= DTRACE_COND_USERMODE;
 	}
 
 	if (dtrace_ecb_create_cache != NULL) {
 		/*
 		 * If we have a cached ecb, we'll use its action list instead
 		 * of creating our own (saving both time and space).
 		 */
 		dtrace_ecb_t *cached = dtrace_ecb_create_cache;
 		dtrace_action_t *act = cached->dte_action;
 
 		if (act != NULL) {
 			ASSERT(act->dta_refcnt > 0);
 			act->dta_refcnt++;
 			ecb->dte_action = act;
 			ecb->dte_action_last = cached->dte_action_last;
 			ecb->dte_needed = cached->dte_needed;
 			ecb->dte_size = cached->dte_size;
 			ecb->dte_alignment = cached->dte_alignment;
 		}
 
 		return (ecb);
 	}
 
 	for (act = desc->dted_action; act != NULL; act = act->dtad_next) {
 		if ((enab->dten_error = dtrace_ecb_action_add(ecb, act)) != 0) {
 			dtrace_ecb_destroy(ecb);
 			return (NULL);
 		}
 	}
 
 	if ((enab->dten_error = dtrace_ecb_resize(ecb)) != 0) {
 		dtrace_ecb_destroy(ecb);
 		return (NULL);
 	}
 
 	return (dtrace_ecb_create_cache = ecb);
 }
 
 static int
 dtrace_ecb_create_enable(dtrace_probe_t *probe, void *arg)
 {
 	dtrace_ecb_t *ecb;
 	dtrace_enabling_t *enab = arg;
 	dtrace_state_t *state = enab->dten_vstate->dtvs_state;
 
 	ASSERT(state != NULL);
 
 	if (probe != NULL && probe->dtpr_gen < enab->dten_probegen) {
 		/*
 		 * This probe was created in a generation for which this
 		 * enabling has previously created ECBs; we don't want to
 		 * enable it again, so just kick out.
 		 */
 		return (DTRACE_MATCH_NEXT);
 	}
 
 	if ((ecb = dtrace_ecb_create(state, probe, enab)) == NULL)
 		return (DTRACE_MATCH_DONE);
 
 	dtrace_ecb_enable(ecb);
 	return (DTRACE_MATCH_NEXT);
 }
 
 static dtrace_ecb_t *
 dtrace_epid2ecb(dtrace_state_t *state, dtrace_epid_t id)
 {
 	dtrace_ecb_t *ecb;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (id == 0 || id > state->dts_necbs)
 		return (NULL);
 
 	ASSERT(state->dts_necbs > 0 && state->dts_ecbs != NULL);
 	ASSERT((ecb = state->dts_ecbs[id - 1]) == NULL || ecb->dte_epid == id);
 
 	return (state->dts_ecbs[id - 1]);
 }
 
 static dtrace_aggregation_t *
 dtrace_aggid2agg(dtrace_state_t *state, dtrace_aggid_t id)
 {
 	dtrace_aggregation_t *agg;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (id == 0 || id > state->dts_naggregations)
 		return (NULL);
 
 	ASSERT(state->dts_naggregations > 0 && state->dts_aggregations != NULL);
 	ASSERT((agg = state->dts_aggregations[id - 1]) == NULL ||
 	    agg->dtag_id == id);
 
 	return (state->dts_aggregations[id - 1]);
 }
 
 /*
  * DTrace Buffer Functions
  *
  * The following functions manipulate DTrace buffers.  Most of these functions
  * are called in the context of establishing or processing consumer state;
  * exceptions are explicitly noted.
  */
 
 /*
  * Note:  called from cross call context.  This function switches the two
  * buffers on a given CPU.  The atomicity of this operation is assured by
  * disabling interrupts while the actual switch takes place; the disabling of
  * interrupts serializes the execution with any execution of dtrace_probe() on
  * the same CPU.
  */
 static void
 dtrace_buffer_switch(dtrace_buffer_t *buf)
 {
 	caddr_t tomax = buf->dtb_tomax;
 	caddr_t xamot = buf->dtb_xamot;
 	dtrace_icookie_t cookie;
 	hrtime_t now;
 
 	ASSERT(!(buf->dtb_flags & DTRACEBUF_NOSWITCH));
 	ASSERT(!(buf->dtb_flags & DTRACEBUF_RING));
 
 	cookie = dtrace_interrupt_disable();
 	now = dtrace_gethrtime();
 	buf->dtb_tomax = xamot;
 	buf->dtb_xamot = tomax;
 	buf->dtb_xamot_drops = buf->dtb_drops;
 	buf->dtb_xamot_offset = buf->dtb_offset;
 	buf->dtb_xamot_errors = buf->dtb_errors;
 	buf->dtb_xamot_flags = buf->dtb_flags;
 	buf->dtb_offset = 0;
 	buf->dtb_drops = 0;
 	buf->dtb_errors = 0;
 	buf->dtb_flags &= ~(DTRACEBUF_ERROR | DTRACEBUF_DROPPED);
 	buf->dtb_interval = now - buf->dtb_switched;
 	buf->dtb_switched = now;
 	dtrace_interrupt_enable(cookie);
 }
 
 /*
  * Note:  called from cross call context.  This function activates a buffer
  * on a CPU.  As with dtrace_buffer_switch(), the atomicity of the operation
  * is guaranteed by the disabling of interrupts.
  */
 static void
 dtrace_buffer_activate(dtrace_state_t *state)
 {
 	dtrace_buffer_t *buf;
 	dtrace_icookie_t cookie = dtrace_interrupt_disable();
 
 	buf = &state->dts_buffer[curcpu];
 
 	if (buf->dtb_tomax != NULL) {
 		/*
 		 * We might like to assert that the buffer is marked inactive,
 		 * but this isn't necessarily true:  the buffer for the CPU
 		 * that processes the BEGIN probe has its buffer activated
 		 * manually.  In this case, we take the (harmless) action
 		 * re-clearing the bit INACTIVE bit.
 		 */
 		buf->dtb_flags &= ~DTRACEBUF_INACTIVE;
 	}
 
 	dtrace_interrupt_enable(cookie);
 }
 
 #ifdef __FreeBSD__
 /*
  * Activate the specified per-CPU buffer.  This is used instead of
  * dtrace_buffer_activate() when APs have not yet started, i.e. when
  * activating anonymous state.
  */
 static void
 dtrace_buffer_activate_cpu(dtrace_state_t *state, int cpu)
 {
 
 	if (state->dts_buffer[cpu].dtb_tomax != NULL)
 		state->dts_buffer[cpu].dtb_flags &= ~DTRACEBUF_INACTIVE;
 }
 #endif
 
 static int
 dtrace_buffer_alloc(dtrace_buffer_t *bufs, size_t size, int flags,
     processorid_t cpu, int *factor)
 {
 #ifdef illumos
 	cpu_t *cp;
 #endif
 	dtrace_buffer_t *buf;
 	int allocated = 0, desired = 0;
 
 #ifdef illumos
 	ASSERT(MUTEX_HELD(&cpu_lock));
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	*factor = 1;
 
 	if (size > dtrace_nonroot_maxsize &&
 	    !PRIV_POLICY_CHOICE(CRED(), PRIV_ALL, B_FALSE))
 		return (EFBIG);
 
 	cp = cpu_list;
 
 	do {
 		if (cpu != DTRACE_CPUALL && cpu != cp->cpu_id)
 			continue;
 
 		buf = &bufs[cp->cpu_id];
 
 		/*
 		 * If there is already a buffer allocated for this CPU, it
 		 * is only possible that this is a DR event.  In this case,
 		 */
 		if (buf->dtb_tomax != NULL) {
 			ASSERT(buf->dtb_size == size);
 			continue;
 		}
 
 		ASSERT(buf->dtb_xamot == NULL);
 
 		if ((buf->dtb_tomax = kmem_zalloc(size,
 		    KM_NOSLEEP | KM_NORMALPRI)) == NULL)
 			goto err;
 
 		buf->dtb_size = size;
 		buf->dtb_flags = flags;
 		buf->dtb_offset = 0;
 		buf->dtb_drops = 0;
 
 		if (flags & DTRACEBUF_NOSWITCH)
 			continue;
 
 		if ((buf->dtb_xamot = kmem_zalloc(size,
 		    KM_NOSLEEP | KM_NORMALPRI)) == NULL)
 			goto err;
 	} while ((cp = cp->cpu_next) != cpu_list);
 
 	return (0);
 
 err:
 	cp = cpu_list;
 
 	do {
 		if (cpu != DTRACE_CPUALL && cpu != cp->cpu_id)
 			continue;
 
 		buf = &bufs[cp->cpu_id];
 		desired += 2;
 
 		if (buf->dtb_xamot != NULL) {
 			ASSERT(buf->dtb_tomax != NULL);
 			ASSERT(buf->dtb_size == size);
 			kmem_free(buf->dtb_xamot, size);
 			allocated++;
 		}
 
 		if (buf->dtb_tomax != NULL) {
 			ASSERT(buf->dtb_size == size);
 			kmem_free(buf->dtb_tomax, size);
 			allocated++;
 		}
 
 		buf->dtb_tomax = NULL;
 		buf->dtb_xamot = NULL;
 		buf->dtb_size = 0;
 	} while ((cp = cp->cpu_next) != cpu_list);
 #else
 	int i;
 
 	*factor = 1;
 #if defined(__aarch64__) || defined(__amd64__) || defined(__arm__) || \
     defined(__mips__) || defined(__powerpc__) || defined(__riscv)
 	/*
 	 * FreeBSD isn't good at limiting the amount of memory we
 	 * ask to malloc, so let's place a limit here before trying
 	 * to do something that might well end in tears at bedtime.
 	 */
 	if (size > physmem * PAGE_SIZE / (128 * (mp_maxid + 1)))
 		return (ENOMEM);
 #endif
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	CPU_FOREACH(i) {
 		if (cpu != DTRACE_CPUALL && cpu != i)
 			continue;
 
 		buf = &bufs[i];
 
 		/*
 		 * If there is already a buffer allocated for this CPU, it
 		 * is only possible that this is a DR event.  In this case,
 		 * the buffer size must match our specified size.
 		 */
 		if (buf->dtb_tomax != NULL) {
 			ASSERT(buf->dtb_size == size);
 			continue;
 		}
 
 		ASSERT(buf->dtb_xamot == NULL);
 
 		if ((buf->dtb_tomax = kmem_zalloc(size,
 		    KM_NOSLEEP | KM_NORMALPRI)) == NULL)
 			goto err;
 
 		buf->dtb_size = size;
 		buf->dtb_flags = flags;
 		buf->dtb_offset = 0;
 		buf->dtb_drops = 0;
 
 		if (flags & DTRACEBUF_NOSWITCH)
 			continue;
 
 		if ((buf->dtb_xamot = kmem_zalloc(size,
 		    KM_NOSLEEP | KM_NORMALPRI)) == NULL)
 			goto err;
 	}
 
 	return (0);
 
 err:
 	/*
 	 * Error allocating memory, so free the buffers that were
 	 * allocated before the failed allocation.
 	 */
 	CPU_FOREACH(i) {
 		if (cpu != DTRACE_CPUALL && cpu != i)
 			continue;
 
 		buf = &bufs[i];
 		desired += 2;
 
 		if (buf->dtb_xamot != NULL) {
 			ASSERT(buf->dtb_tomax != NULL);
 			ASSERT(buf->dtb_size == size);
 			kmem_free(buf->dtb_xamot, size);
 			allocated++;
 		}
 
 		if (buf->dtb_tomax != NULL) {
 			ASSERT(buf->dtb_size == size);
 			kmem_free(buf->dtb_tomax, size);
 			allocated++;
 		}
 
 		buf->dtb_tomax = NULL;
 		buf->dtb_xamot = NULL;
 		buf->dtb_size = 0;
 
 	}
 #endif
 	*factor = desired / (allocated > 0 ? allocated : 1);
 
 	return (ENOMEM);
 }
 
 /*
  * Note:  called from probe context.  This function just increments the drop
  * count on a buffer.  It has been made a function to allow for the
  * possibility of understanding the source of mysterious drop counts.  (A
  * problem for which one may be particularly disappointed that DTrace cannot
  * be used to understand DTrace.)
  */
 static void
 dtrace_buffer_drop(dtrace_buffer_t *buf)
 {
 	buf->dtb_drops++;
 }
 
 /*
  * Note:  called from probe context.  This function is called to reserve space
  * in a buffer.  If mstate is non-NULL, sets the scratch base and size in the
  * mstate.  Returns the new offset in the buffer, or a negative value if an
  * error has occurred.
  */
 static intptr_t
 dtrace_buffer_reserve(dtrace_buffer_t *buf, size_t needed, size_t align,
     dtrace_state_t *state, dtrace_mstate_t *mstate)
 {
 	intptr_t offs = buf->dtb_offset, soffs;
 	intptr_t woffs;
 	caddr_t tomax;
 	size_t total;
 
 	if (buf->dtb_flags & DTRACEBUF_INACTIVE)
 		return (-1);
 
 	if ((tomax = buf->dtb_tomax) == NULL) {
 		dtrace_buffer_drop(buf);
 		return (-1);
 	}
 
 	if (!(buf->dtb_flags & (DTRACEBUF_RING | DTRACEBUF_FILL))) {
 		while (offs & (align - 1)) {
 			/*
 			 * Assert that our alignment is off by a number which
 			 * is itself sizeof (uint32_t) aligned.
 			 */
 			ASSERT(!((align - (offs & (align - 1))) &
 			    (sizeof (uint32_t) - 1)));
 			DTRACE_STORE(uint32_t, tomax, offs, DTRACE_EPIDNONE);
 			offs += sizeof (uint32_t);
 		}
 
 		if ((soffs = offs + needed) > buf->dtb_size) {
 			dtrace_buffer_drop(buf);
 			return (-1);
 		}
 
 		if (mstate == NULL)
 			return (offs);
 
 		mstate->dtms_scratch_base = (uintptr_t)tomax + soffs;
 		mstate->dtms_scratch_size = buf->dtb_size - soffs;
 		mstate->dtms_scratch_ptr = mstate->dtms_scratch_base;
 
 		return (offs);
 	}
 
 	if (buf->dtb_flags & DTRACEBUF_FILL) {
 		if (state->dts_activity != DTRACE_ACTIVITY_COOLDOWN &&
 		    (buf->dtb_flags & DTRACEBUF_FULL))
 			return (-1);
 		goto out;
 	}
 
 	total = needed + (offs & (align - 1));
 
 	/*
 	 * For a ring buffer, life is quite a bit more complicated.  Before
 	 * we can store any padding, we need to adjust our wrapping offset.
 	 * (If we've never before wrapped or we're not about to, no adjustment
 	 * is required.)
 	 */
 	if ((buf->dtb_flags & DTRACEBUF_WRAPPED) ||
 	    offs + total > buf->dtb_size) {
 		woffs = buf->dtb_xamot_offset;
 
 		if (offs + total > buf->dtb_size) {
 			/*
 			 * We can't fit in the end of the buffer.  First, a
 			 * sanity check that we can fit in the buffer at all.
 			 */
 			if (total > buf->dtb_size) {
 				dtrace_buffer_drop(buf);
 				return (-1);
 			}
 
 			/*
 			 * We're going to be storing at the top of the buffer,
 			 * so now we need to deal with the wrapped offset.  We
 			 * only reset our wrapped offset to 0 if it is
 			 * currently greater than the current offset.  If it
 			 * is less than the current offset, it is because a
 			 * previous allocation induced a wrap -- but the
 			 * allocation didn't subsequently take the space due
 			 * to an error or false predicate evaluation.  In this
 			 * case, we'll just leave the wrapped offset alone: if
 			 * the wrapped offset hasn't been advanced far enough
 			 * for this allocation, it will be adjusted in the
 			 * lower loop.
 			 */
 			if (buf->dtb_flags & DTRACEBUF_WRAPPED) {
 				if (woffs >= offs)
 					woffs = 0;
 			} else {
 				woffs = 0;
 			}
 
 			/*
 			 * Now we know that we're going to be storing to the
 			 * top of the buffer and that there is room for us
 			 * there.  We need to clear the buffer from the current
 			 * offset to the end (there may be old gunk there).
 			 */
 			while (offs < buf->dtb_size)
 				tomax[offs++] = 0;
 
 			/*
 			 * We need to set our offset to zero.  And because we
 			 * are wrapping, we need to set the bit indicating as
 			 * much.  We can also adjust our needed space back
 			 * down to the space required by the ECB -- we know
 			 * that the top of the buffer is aligned.
 			 */
 			offs = 0;
 			total = needed;
 			buf->dtb_flags |= DTRACEBUF_WRAPPED;
 		} else {
 			/*
 			 * There is room for us in the buffer, so we simply
 			 * need to check the wrapped offset.
 			 */
 			if (woffs < offs) {
 				/*
 				 * The wrapped offset is less than the offset.
 				 * This can happen if we allocated buffer space
 				 * that induced a wrap, but then we didn't
 				 * subsequently take the space due to an error
 				 * or false predicate evaluation.  This is
 				 * okay; we know that _this_ allocation isn't
 				 * going to induce a wrap.  We still can't
 				 * reset the wrapped offset to be zero,
 				 * however: the space may have been trashed in
 				 * the previous failed probe attempt.  But at
 				 * least the wrapped offset doesn't need to
 				 * be adjusted at all...
 				 */
 				goto out;
 			}
 		}
 
 		while (offs + total > woffs) {
 			dtrace_epid_t epid = *(uint32_t *)(tomax + woffs);
 			size_t size;
 
 			if (epid == DTRACE_EPIDNONE) {
 				size = sizeof (uint32_t);
 			} else {
 				ASSERT3U(epid, <=, state->dts_necbs);
 				ASSERT(state->dts_ecbs[epid - 1] != NULL);
 
 				size = state->dts_ecbs[epid - 1]->dte_size;
 			}
 
 			ASSERT(woffs + size <= buf->dtb_size);
 			ASSERT(size != 0);
 
 			if (woffs + size == buf->dtb_size) {
 				/*
 				 * We've reached the end of the buffer; we want
 				 * to set the wrapped offset to 0 and break
 				 * out.  However, if the offs is 0, then we're
 				 * in a strange edge-condition:  the amount of
 				 * space that we want to reserve plus the size
 				 * of the record that we're overwriting is
 				 * greater than the size of the buffer.  This
 				 * is problematic because if we reserve the
 				 * space but subsequently don't consume it (due
 				 * to a failed predicate or error) the wrapped
 				 * offset will be 0 -- yet the EPID at offset 0
 				 * will not be committed.  This situation is
 				 * relatively easy to deal with:  if we're in
 				 * this case, the buffer is indistinguishable
 				 * from one that hasn't wrapped; we need only
 				 * finish the job by clearing the wrapped bit,
 				 * explicitly setting the offset to be 0, and
 				 * zero'ing out the old data in the buffer.
 				 */
 				if (offs == 0) {
 					buf->dtb_flags &= ~DTRACEBUF_WRAPPED;
 					buf->dtb_offset = 0;
 					woffs = total;
 
 					while (woffs < buf->dtb_size)
 						tomax[woffs++] = 0;
 				}
 
 				woffs = 0;
 				break;
 			}
 
 			woffs += size;
 		}
 
 		/*
 		 * We have a wrapped offset.  It may be that the wrapped offset
 		 * has become zero -- that's okay.
 		 */
 		buf->dtb_xamot_offset = woffs;
 	}
 
 out:
 	/*
 	 * Now we can plow the buffer with any necessary padding.
 	 */
 	while (offs & (align - 1)) {
 		/*
 		 * Assert that our alignment is off by a number which
 		 * is itself sizeof (uint32_t) aligned.
 		 */
 		ASSERT(!((align - (offs & (align - 1))) &
 		    (sizeof (uint32_t) - 1)));
 		DTRACE_STORE(uint32_t, tomax, offs, DTRACE_EPIDNONE);
 		offs += sizeof (uint32_t);
 	}
 
 	if (buf->dtb_flags & DTRACEBUF_FILL) {
 		if (offs + needed > buf->dtb_size - state->dts_reserve) {
 			buf->dtb_flags |= DTRACEBUF_FULL;
 			return (-1);
 		}
 	}
 
 	if (mstate == NULL)
 		return (offs);
 
 	/*
 	 * For ring buffers and fill buffers, the scratch space is always
 	 * the inactive buffer.
 	 */
 	mstate->dtms_scratch_base = (uintptr_t)buf->dtb_xamot;
 	mstate->dtms_scratch_size = buf->dtb_size;
 	mstate->dtms_scratch_ptr = mstate->dtms_scratch_base;
 
 	return (offs);
 }
 
 static void
 dtrace_buffer_polish(dtrace_buffer_t *buf)
 {
 	ASSERT(buf->dtb_flags & DTRACEBUF_RING);
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (!(buf->dtb_flags & DTRACEBUF_WRAPPED))
 		return;
 
 	/*
 	 * We need to polish the ring buffer.  There are three cases:
 	 *
 	 * - The first (and presumably most common) is that there is no gap
 	 *   between the buffer offset and the wrapped offset.  In this case,
 	 *   there is nothing in the buffer that isn't valid data; we can
 	 *   mark the buffer as polished and return.
 	 *
 	 * - The second (less common than the first but still more common
 	 *   than the third) is that there is a gap between the buffer offset
 	 *   and the wrapped offset, and the wrapped offset is larger than the
 	 *   buffer offset.  This can happen because of an alignment issue, or
 	 *   can happen because of a call to dtrace_buffer_reserve() that
 	 *   didn't subsequently consume the buffer space.  In this case,
 	 *   we need to zero the data from the buffer offset to the wrapped
 	 *   offset.
 	 *
 	 * - The third (and least common) is that there is a gap between the
 	 *   buffer offset and the wrapped offset, but the wrapped offset is
 	 *   _less_ than the buffer offset.  This can only happen because a
 	 *   call to dtrace_buffer_reserve() induced a wrap, but the space
 	 *   was not subsequently consumed.  In this case, we need to zero the
 	 *   space from the offset to the end of the buffer _and_ from the
 	 *   top of the buffer to the wrapped offset.
 	 */
 	if (buf->dtb_offset < buf->dtb_xamot_offset) {
 		bzero(buf->dtb_tomax + buf->dtb_offset,
 		    buf->dtb_xamot_offset - buf->dtb_offset);
 	}
 
 	if (buf->dtb_offset > buf->dtb_xamot_offset) {
 		bzero(buf->dtb_tomax + buf->dtb_offset,
 		    buf->dtb_size - buf->dtb_offset);
 		bzero(buf->dtb_tomax, buf->dtb_xamot_offset);
 	}
 }
 
 /*
  * This routine determines if data generated at the specified time has likely
  * been entirely consumed at user-level.  This routine is called to determine
  * if an ECB on a defunct probe (but for an active enabling) can be safely
  * disabled and destroyed.
  */
 static int
 dtrace_buffer_consumed(dtrace_buffer_t *bufs, hrtime_t when)
 {
 	int i;
 
 	for (i = 0; i < NCPU; i++) {
 		dtrace_buffer_t *buf = &bufs[i];
 
 		if (buf->dtb_size == 0)
 			continue;
 
 		if (buf->dtb_flags & DTRACEBUF_RING)
 			return (0);
 
 		if (!buf->dtb_switched && buf->dtb_offset != 0)
 			return (0);
 
 		if (buf->dtb_switched - buf->dtb_interval < when)
 			return (0);
 	}
 
 	return (1);
 }
 
 static void
 dtrace_buffer_free(dtrace_buffer_t *bufs)
 {
 	int i;
 
 	for (i = 0; i < NCPU; i++) {
 		dtrace_buffer_t *buf = &bufs[i];
 
 		if (buf->dtb_tomax == NULL) {
 			ASSERT(buf->dtb_xamot == NULL);
 			ASSERT(buf->dtb_size == 0);
 			continue;
 		}
 
 		if (buf->dtb_xamot != NULL) {
 			ASSERT(!(buf->dtb_flags & DTRACEBUF_NOSWITCH));
 			kmem_free(buf->dtb_xamot, buf->dtb_size);
 		}
 
 		kmem_free(buf->dtb_tomax, buf->dtb_size);
 		buf->dtb_size = 0;
 		buf->dtb_tomax = NULL;
 		buf->dtb_xamot = NULL;
 	}
 }
 
 /*
  * DTrace Enabling Functions
  */
 static dtrace_enabling_t *
 dtrace_enabling_create(dtrace_vstate_t *vstate)
 {
 	dtrace_enabling_t *enab;
 
 	enab = kmem_zalloc(sizeof (dtrace_enabling_t), KM_SLEEP);
 	enab->dten_vstate = vstate;
 
 	return (enab);
 }
 
 static void
 dtrace_enabling_add(dtrace_enabling_t *enab, dtrace_ecbdesc_t *ecb)
 {
 	dtrace_ecbdesc_t **ndesc;
 	size_t osize, nsize;
 
 	/*
 	 * We can't add to enablings after we've enabled them, or after we've
 	 * retained them.
 	 */
 	ASSERT(enab->dten_probegen == 0);
 	ASSERT(enab->dten_next == NULL && enab->dten_prev == NULL);
 
 	if (enab->dten_ndesc < enab->dten_maxdesc) {
 		enab->dten_desc[enab->dten_ndesc++] = ecb;
 		return;
 	}
 
 	osize = enab->dten_maxdesc * sizeof (dtrace_enabling_t *);
 
 	if (enab->dten_maxdesc == 0) {
 		enab->dten_maxdesc = 1;
 	} else {
 		enab->dten_maxdesc <<= 1;
 	}
 
 	ASSERT(enab->dten_ndesc < enab->dten_maxdesc);
 
 	nsize = enab->dten_maxdesc * sizeof (dtrace_enabling_t *);
 	ndesc = kmem_zalloc(nsize, KM_SLEEP);
 	bcopy(enab->dten_desc, ndesc, osize);
 	if (enab->dten_desc != NULL)
 		kmem_free(enab->dten_desc, osize);
 
 	enab->dten_desc = ndesc;
 	enab->dten_desc[enab->dten_ndesc++] = ecb;
 }
 
 static void
 dtrace_enabling_addlike(dtrace_enabling_t *enab, dtrace_ecbdesc_t *ecb,
     dtrace_probedesc_t *pd)
 {
 	dtrace_ecbdesc_t *new;
 	dtrace_predicate_t *pred;
 	dtrace_actdesc_t *act;
 
 	/*
 	 * We're going to create a new ECB description that matches the
 	 * specified ECB in every way, but has the specified probe description.
 	 */
 	new = kmem_zalloc(sizeof (dtrace_ecbdesc_t), KM_SLEEP);
 
 	if ((pred = ecb->dted_pred.dtpdd_predicate) != NULL)
 		dtrace_predicate_hold(pred);
 
 	for (act = ecb->dted_action; act != NULL; act = act->dtad_next)
 		dtrace_actdesc_hold(act);
 
 	new->dted_action = ecb->dted_action;
 	new->dted_pred = ecb->dted_pred;
 	new->dted_probe = *pd;
 	new->dted_uarg = ecb->dted_uarg;
 
 	dtrace_enabling_add(enab, new);
 }
 
 static void
 dtrace_enabling_dump(dtrace_enabling_t *enab)
 {
 	int i;
 
 	for (i = 0; i < enab->dten_ndesc; i++) {
 		dtrace_probedesc_t *desc = &enab->dten_desc[i]->dted_probe;
 
 #ifdef __FreeBSD__
 		printf("dtrace: enabling probe %d (%s:%s:%s:%s)\n", i,
 		    desc->dtpd_provider, desc->dtpd_mod,
 		    desc->dtpd_func, desc->dtpd_name);
 #else
 		cmn_err(CE_NOTE, "enabling probe %d (%s:%s:%s:%s)", i,
 		    desc->dtpd_provider, desc->dtpd_mod,
 		    desc->dtpd_func, desc->dtpd_name);
 #endif
 	}
 }
 
 static void
 dtrace_enabling_destroy(dtrace_enabling_t *enab)
 {
 	int i;
 	dtrace_ecbdesc_t *ep;
 	dtrace_vstate_t *vstate = enab->dten_vstate;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	for (i = 0; i < enab->dten_ndesc; i++) {
 		dtrace_actdesc_t *act, *next;
 		dtrace_predicate_t *pred;
 
 		ep = enab->dten_desc[i];
 
 		if ((pred = ep->dted_pred.dtpdd_predicate) != NULL)
 			dtrace_predicate_release(pred, vstate);
 
 		for (act = ep->dted_action; act != NULL; act = next) {
 			next = act->dtad_next;
 			dtrace_actdesc_release(act, vstate);
 		}
 
 		kmem_free(ep, sizeof (dtrace_ecbdesc_t));
 	}
 
 	if (enab->dten_desc != NULL)
 		kmem_free(enab->dten_desc,
 		    enab->dten_maxdesc * sizeof (dtrace_enabling_t *));
 
 	/*
 	 * If this was a retained enabling, decrement the dts_nretained count
 	 * and take it off of the dtrace_retained list.
 	 */
 	if (enab->dten_prev != NULL || enab->dten_next != NULL ||
 	    dtrace_retained == enab) {
 		ASSERT(enab->dten_vstate->dtvs_state != NULL);
 		ASSERT(enab->dten_vstate->dtvs_state->dts_nretained > 0);
 		enab->dten_vstate->dtvs_state->dts_nretained--;
 		dtrace_retained_gen++;
 	}
 
 	if (enab->dten_prev == NULL) {
 		if (dtrace_retained == enab) {
 			dtrace_retained = enab->dten_next;
 
 			if (dtrace_retained != NULL)
 				dtrace_retained->dten_prev = NULL;
 		}
 	} else {
 		ASSERT(enab != dtrace_retained);
 		ASSERT(dtrace_retained != NULL);
 		enab->dten_prev->dten_next = enab->dten_next;
 	}
 
 	if (enab->dten_next != NULL) {
 		ASSERT(dtrace_retained != NULL);
 		enab->dten_next->dten_prev = enab->dten_prev;
 	}
 
 	kmem_free(enab, sizeof (dtrace_enabling_t));
 }
 
 static int
 dtrace_enabling_retain(dtrace_enabling_t *enab)
 {
 	dtrace_state_t *state;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(enab->dten_next == NULL && enab->dten_prev == NULL);
 	ASSERT(enab->dten_vstate != NULL);
 
 	state = enab->dten_vstate->dtvs_state;
 	ASSERT(state != NULL);
 
 	/*
 	 * We only allow each state to retain dtrace_retain_max enablings.
 	 */
 	if (state->dts_nretained >= dtrace_retain_max)
 		return (ENOSPC);
 
 	state->dts_nretained++;
 	dtrace_retained_gen++;
 
 	if (dtrace_retained == NULL) {
 		dtrace_retained = enab;
 		return (0);
 	}
 
 	enab->dten_next = dtrace_retained;
 	dtrace_retained->dten_prev = enab;
 	dtrace_retained = enab;
 
 	return (0);
 }
 
 static int
 dtrace_enabling_replicate(dtrace_state_t *state, dtrace_probedesc_t *match,
     dtrace_probedesc_t *create)
 {
 	dtrace_enabling_t *new, *enab;
 	int found = 0, err = ENOENT;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(strlen(match->dtpd_provider) < DTRACE_PROVNAMELEN);
 	ASSERT(strlen(match->dtpd_mod) < DTRACE_MODNAMELEN);
 	ASSERT(strlen(match->dtpd_func) < DTRACE_FUNCNAMELEN);
 	ASSERT(strlen(match->dtpd_name) < DTRACE_NAMELEN);
 
 	new = dtrace_enabling_create(&state->dts_vstate);
 
 	/*
 	 * Iterate over all retained enablings, looking for enablings that
 	 * match the specified state.
 	 */
 	for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next) {
 		int i;
 
 		/*
 		 * dtvs_state can only be NULL for helper enablings -- and
 		 * helper enablings can't be retained.
 		 */
 		ASSERT(enab->dten_vstate->dtvs_state != NULL);
 
 		if (enab->dten_vstate->dtvs_state != state)
 			continue;
 
 		/*
 		 * Now iterate over each probe description; we're looking for
 		 * an exact match to the specified probe description.
 		 */
 		for (i = 0; i < enab->dten_ndesc; i++) {
 			dtrace_ecbdesc_t *ep = enab->dten_desc[i];
 			dtrace_probedesc_t *pd = &ep->dted_probe;
 
 			if (strcmp(pd->dtpd_provider, match->dtpd_provider))
 				continue;
 
 			if (strcmp(pd->dtpd_mod, match->dtpd_mod))
 				continue;
 
 			if (strcmp(pd->dtpd_func, match->dtpd_func))
 				continue;
 
 			if (strcmp(pd->dtpd_name, match->dtpd_name))
 				continue;
 
 			/*
 			 * We have a winning probe!  Add it to our growing
 			 * enabling.
 			 */
 			found = 1;
 			dtrace_enabling_addlike(new, ep, create);
 		}
 	}
 
 	if (!found || (err = dtrace_enabling_retain(new)) != 0) {
 		dtrace_enabling_destroy(new);
 		return (err);
 	}
 
 	return (0);
 }
 
 static void
 dtrace_enabling_retract(dtrace_state_t *state)
 {
 	dtrace_enabling_t *enab, *next;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	/*
 	 * Iterate over all retained enablings, destroy the enablings retained
 	 * for the specified state.
 	 */
 	for (enab = dtrace_retained; enab != NULL; enab = next) {
 		next = enab->dten_next;
 
 		/*
 		 * dtvs_state can only be NULL for helper enablings -- and
 		 * helper enablings can't be retained.
 		 */
 		ASSERT(enab->dten_vstate->dtvs_state != NULL);
 
 		if (enab->dten_vstate->dtvs_state == state) {
 			ASSERT(state->dts_nretained > 0);
 			dtrace_enabling_destroy(enab);
 		}
 	}
 
 	ASSERT(state->dts_nretained == 0);
 }
 
 static int
 dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)
 {
 	int i = 0;
 	int matched = 0;
 
 	ASSERT(MUTEX_HELD(&cpu_lock));
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	for (i = 0; i < enab->dten_ndesc; i++) {
 		dtrace_ecbdesc_t *ep = enab->dten_desc[i];
 
 		enab->dten_current = ep;
 		enab->dten_error = 0;
 
 		matched += dtrace_probe_enable(&ep->dted_probe, enab);
 
 		if (enab->dten_error != 0) {
 			/*
 			 * If we get an error half-way through enabling the
 			 * probes, we kick out -- perhaps with some number of
 			 * them enabled.  Leaving enabled probes enabled may
 			 * be slightly confusing for user-level, but we expect
 			 * that no one will attempt to actually drive on in
 			 * the face of such errors.  If this is an anonymous
 			 * enabling (indicated with a NULL nmatched pointer),
 			 * we cmn_err() a message.  We aren't expecting to
 			 * get such an error -- such as it can exist at all,
 			 * it would be a result of corrupted DOF in the driver
 			 * properties.
 			 */
 			if (nmatched == NULL) {
 				cmn_err(CE_WARN, "dtrace_enabling_match() "
 				    "error on %p: %d", (void *)ep,
 				    enab->dten_error);
 			}
 
 			return (enab->dten_error);
 		}
 	}
 
 	enab->dten_probegen = dtrace_probegen;
 	if (nmatched != NULL)
 		*nmatched = matched;
 
 	return (0);
 }
 
 static void
 dtrace_enabling_matchall(void)
 {
 	dtrace_enabling_t *enab;
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_lock);
 
 	/*
 	 * Iterate over all retained enablings to see if any probes match
 	 * against them.  We only perform this operation on enablings for which
 	 * we have sufficient permissions by virtue of being in the global zone
 	 * or in the same zone as the DTrace client.  Because we can be called
 	 * after dtrace_detach() has been called, we cannot assert that there
 	 * are retained enablings.  We can safely load from dtrace_retained,
 	 * however:  the taskq_destroy() at the end of dtrace_detach() will
 	 * block pending our completion.
 	 */
 	for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next) {
 #ifdef illumos
 		cred_t *cr = enab->dten_vstate->dtvs_state->dts_cred.dcr_cred;
 
 		if (INGLOBALZONE(curproc) ||
 		    cr != NULL && getzoneid() == crgetzoneid(cr))
 #endif
 			(void) dtrace_enabling_match(enab, NULL);
 	}
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&cpu_lock);
 }
 
 /*
  * If an enabling is to be enabled without having matched probes (that is, if
  * dtrace_state_go() is to be called on the underlying dtrace_state_t), the
  * enabling must be _primed_ by creating an ECB for every ECB description.
  * This must be done to assure that we know the number of speculations, the
  * number of aggregations, the minimum buffer size needed, etc. before we
  * transition out of DTRACE_ACTIVITY_INACTIVE.  To do this without actually
  * enabling any probes, we create ECBs for every ECB decription, but with a
  * NULL probe -- which is exactly what this function does.
  */
 static void
 dtrace_enabling_prime(dtrace_state_t *state)
 {
 	dtrace_enabling_t *enab;
 	int i;
 
 	for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next) {
 		ASSERT(enab->dten_vstate->dtvs_state != NULL);
 
 		if (enab->dten_vstate->dtvs_state != state)
 			continue;
 
 		/*
 		 * We don't want to prime an enabling more than once, lest
 		 * we allow a malicious user to induce resource exhaustion.
 		 * (The ECBs that result from priming an enabling aren't
 		 * leaked -- but they also aren't deallocated until the
 		 * consumer state is destroyed.)
 		 */
 		if (enab->dten_primed)
 			continue;
 
 		for (i = 0; i < enab->dten_ndesc; i++) {
 			enab->dten_current = enab->dten_desc[i];
 			(void) dtrace_probe_enable(NULL, enab);
 		}
 
 		enab->dten_primed = 1;
 	}
 }
 
 /*
  * Called to indicate that probes should be provided due to retained
  * enablings.  This is implemented in terms of dtrace_probe_provide(), but it
  * must take an initial lap through the enabling calling the dtps_provide()
  * entry point explicitly to allow for autocreated probes.
  */
 static void
 dtrace_enabling_provide(dtrace_provider_t *prv)
 {
 	int i, all = 0;
 	dtrace_probedesc_t desc;
 	dtrace_genid_t gen;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(MUTEX_HELD(&dtrace_provider_lock));
 
 	if (prv == NULL) {
 		all = 1;
 		prv = dtrace_provider;
 	}
 
 	do {
 		dtrace_enabling_t *enab;
 		void *parg = prv->dtpv_arg;
 
 retry:
 		gen = dtrace_retained_gen;
 		for (enab = dtrace_retained; enab != NULL;
 		    enab = enab->dten_next) {
 			for (i = 0; i < enab->dten_ndesc; i++) {
 				desc = enab->dten_desc[i]->dted_probe;
 				mutex_exit(&dtrace_lock);
 				prv->dtpv_pops.dtps_provide(parg, &desc);
 				mutex_enter(&dtrace_lock);
 				/*
 				 * Process the retained enablings again if
 				 * they have changed while we weren't holding
 				 * dtrace_lock.
 				 */
 				if (gen != dtrace_retained_gen)
 					goto retry;
 			}
 		}
 	} while (all && (prv = prv->dtpv_next) != NULL);
 
 	mutex_exit(&dtrace_lock);
 	dtrace_probe_provide(NULL, all ? NULL : prv);
 	mutex_enter(&dtrace_lock);
 }
 
 /*
  * Called to reap ECBs that are attached to probes from defunct providers.
  */
 static void
 dtrace_enabling_reap(void)
 {
 	dtrace_provider_t *prov;
 	dtrace_probe_t *probe;
 	dtrace_ecb_t *ecb;
 	hrtime_t when;
 	int i;
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_lock);
 
 	for (i = 0; i < dtrace_nprobes; i++) {
 		if ((probe = dtrace_probes[i]) == NULL)
 			continue;
 
 		if (probe->dtpr_ecb == NULL)
 			continue;
 
 		prov = probe->dtpr_provider;
 
 		if ((when = prov->dtpv_defunct) == 0)
 			continue;
 
 		/*
 		 * We have ECBs on a defunct provider:  we want to reap these
 		 * ECBs to allow the provider to unregister.  The destruction
 		 * of these ECBs must be done carefully:  if we destroy the ECB
 		 * and the consumer later wishes to consume an EPID that
 		 * corresponds to the destroyed ECB (and if the EPID metadata
 		 * has not been previously consumed), the consumer will abort
 		 * processing on the unknown EPID.  To reduce (but not, sadly,
 		 * eliminate) the possibility of this, we will only destroy an
 		 * ECB for a defunct provider if, for the state that
 		 * corresponds to the ECB:
 		 *
 		 *  (a)	There is no speculative tracing (which can effectively
 		 *	cache an EPID for an arbitrary amount of time).
 		 *
 		 *  (b)	The principal buffers have been switched twice since the
 		 *	provider became defunct.
 		 *
 		 *  (c)	The aggregation buffers are of zero size or have been
 		 *	switched twice since the provider became defunct.
 		 *
 		 * We use dts_speculates to determine (a) and call a function
 		 * (dtrace_buffer_consumed()) to determine (b) and (c).  Note
 		 * that as soon as we've been unable to destroy one of the ECBs
 		 * associated with the probe, we quit trying -- reaping is only
 		 * fruitful in as much as we can destroy all ECBs associated
 		 * with the defunct provider's probes.
 		 */
 		while ((ecb = probe->dtpr_ecb) != NULL) {
 			dtrace_state_t *state = ecb->dte_state;
 			dtrace_buffer_t *buf = state->dts_buffer;
 			dtrace_buffer_t *aggbuf = state->dts_aggbuffer;
 
 			if (state->dts_speculates)
 				break;
 
 			if (!dtrace_buffer_consumed(buf, when))
 				break;
 
 			if (!dtrace_buffer_consumed(aggbuf, when))
 				break;
 
 			dtrace_ecb_disable(ecb);
 			ASSERT(probe->dtpr_ecb != ecb);
 			dtrace_ecb_destroy(ecb);
 		}
 	}
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&cpu_lock);
 }
 
 /*
  * DTrace DOF Functions
  */
 /*ARGSUSED*/
 static void
 dtrace_dof_error(dof_hdr_t *dof, const char *str)
 {
 	if (dtrace_err_verbose)
 		cmn_err(CE_WARN, "failed to process DOF: %s", str);
 
 #ifdef DTRACE_ERRDEBUG
 	dtrace_errdebug(str);
 #endif
 }
 
 /*
  * Create DOF out of a currently enabled state.  Right now, we only create
  * DOF containing the run-time options -- but this could be expanded to create
  * complete DOF representing the enabled state.
  */
 static dof_hdr_t *
 dtrace_dof_create(dtrace_state_t *state)
 {
 	dof_hdr_t *dof;
 	dof_sec_t *sec;
 	dof_optdesc_t *opt;
 	int i, len = sizeof (dof_hdr_t) +
 	    roundup(sizeof (dof_sec_t), sizeof (uint64_t)) +
 	    sizeof (dof_optdesc_t) * DTRACEOPT_MAX;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	dof = kmem_zalloc(len, KM_SLEEP);
 	dof->dofh_ident[DOF_ID_MAG0] = DOF_MAG_MAG0;
 	dof->dofh_ident[DOF_ID_MAG1] = DOF_MAG_MAG1;
 	dof->dofh_ident[DOF_ID_MAG2] = DOF_MAG_MAG2;
 	dof->dofh_ident[DOF_ID_MAG3] = DOF_MAG_MAG3;
 
 	dof->dofh_ident[DOF_ID_MODEL] = DOF_MODEL_NATIVE;
 	dof->dofh_ident[DOF_ID_ENCODING] = DOF_ENCODE_NATIVE;
 	dof->dofh_ident[DOF_ID_VERSION] = DOF_VERSION;
 	dof->dofh_ident[DOF_ID_DIFVERS] = DIF_VERSION;
 	dof->dofh_ident[DOF_ID_DIFIREG] = DIF_DIR_NREGS;
 	dof->dofh_ident[DOF_ID_DIFTREG] = DIF_DTR_NREGS;
 
 	dof->dofh_flags = 0;
 	dof->dofh_hdrsize = sizeof (dof_hdr_t);
 	dof->dofh_secsize = sizeof (dof_sec_t);
 	dof->dofh_secnum = 1;	/* only DOF_SECT_OPTDESC */
 	dof->dofh_secoff = sizeof (dof_hdr_t);
 	dof->dofh_loadsz = len;
 	dof->dofh_filesz = len;
 	dof->dofh_pad = 0;
 
 	/*
 	 * Fill in the option section header...
 	 */
 	sec = (dof_sec_t *)((uintptr_t)dof + sizeof (dof_hdr_t));
 	sec->dofs_type = DOF_SECT_OPTDESC;
 	sec->dofs_align = sizeof (uint64_t);
 	sec->dofs_flags = DOF_SECF_LOAD;
 	sec->dofs_entsize = sizeof (dof_optdesc_t);
 
 	opt = (dof_optdesc_t *)((uintptr_t)sec +
 	    roundup(sizeof (dof_sec_t), sizeof (uint64_t)));
 
 	sec->dofs_offset = (uintptr_t)opt - (uintptr_t)dof;
 	sec->dofs_size = sizeof (dof_optdesc_t) * DTRACEOPT_MAX;
 
 	for (i = 0; i < DTRACEOPT_MAX; i++) {
 		opt[i].dofo_option = i;
 		opt[i].dofo_strtab = DOF_SECIDX_NONE;
 		opt[i].dofo_value = state->dts_options[i];
 	}
 
 	return (dof);
 }
 
 static dof_hdr_t *
 dtrace_dof_copyin(uintptr_t uarg, int *errp)
 {
 	dof_hdr_t hdr, *dof;
 
 	ASSERT(!MUTEX_HELD(&dtrace_lock));
 
 	/*
 	 * First, we're going to copyin() the sizeof (dof_hdr_t).
 	 */
 	if (copyin((void *)uarg, &hdr, sizeof (hdr)) != 0) {
 		dtrace_dof_error(NULL, "failed to copyin DOF header");
 		*errp = EFAULT;
 		return (NULL);
 	}
 
 	/*
 	 * Now we'll allocate the entire DOF and copy it in -- provided
 	 * that the length isn't outrageous.
 	 */
 	if (hdr.dofh_loadsz >= dtrace_dof_maxsize) {
 		dtrace_dof_error(&hdr, "load size exceeds maximum");
 		*errp = E2BIG;
 		return (NULL);
 	}
 
 	if (hdr.dofh_loadsz < sizeof (hdr)) {
 		dtrace_dof_error(&hdr, "invalid load size");
 		*errp = EINVAL;
 		return (NULL);
 	}
 
 	dof = kmem_alloc(hdr.dofh_loadsz, KM_SLEEP);
 
 	if (copyin((void *)uarg, dof, hdr.dofh_loadsz) != 0 ||
 	    dof->dofh_loadsz != hdr.dofh_loadsz) {
 		kmem_free(dof, hdr.dofh_loadsz);
 		*errp = EFAULT;
 		return (NULL);
 	}
 
 	return (dof);
 }
 
 #ifdef __FreeBSD__
 static dof_hdr_t *
 dtrace_dof_copyin_proc(struct proc *p, uintptr_t uarg, int *errp)
 {
 	dof_hdr_t hdr, *dof;
 	struct thread *td;
 	size_t loadsz;
 
 	ASSERT(!MUTEX_HELD(&dtrace_lock));
 
 	td = curthread;
 
 	/*
 	 * First, we're going to copyin() the sizeof (dof_hdr_t).
 	 */
 	if (proc_readmem(td, p, uarg, &hdr, sizeof(hdr)) != sizeof(hdr)) {
 		dtrace_dof_error(NULL, "failed to copyin DOF header");
 		*errp = EFAULT;
 		return (NULL);
 	}
 
 	/*
 	 * Now we'll allocate the entire DOF and copy it in -- provided
 	 * that the length isn't outrageous.
 	 */
 	if (hdr.dofh_loadsz >= dtrace_dof_maxsize) {
 		dtrace_dof_error(&hdr, "load size exceeds maximum");
 		*errp = E2BIG;
 		return (NULL);
 	}
 	loadsz = (size_t)hdr.dofh_loadsz;
 
 	if (loadsz < sizeof (hdr)) {
 		dtrace_dof_error(&hdr, "invalid load size");
 		*errp = EINVAL;
 		return (NULL);
 	}
 
 	dof = kmem_alloc(loadsz, KM_SLEEP);
 
 	if (proc_readmem(td, p, uarg, dof, loadsz) != loadsz ||
 	    dof->dofh_loadsz != loadsz) {
 		kmem_free(dof, hdr.dofh_loadsz);
 		*errp = EFAULT;
 		return (NULL);
 	}
 
 	return (dof);
 }
 
 static __inline uchar_t
 dtrace_dof_char(char c)
 {
 
 	switch (c) {
 	case '0':
 	case '1':
 	case '2':
 	case '3':
 	case '4':
 	case '5':
 	case '6':
 	case '7':
 	case '8':
 	case '9':
 		return (c - '0');
 	case 'A':
 	case 'B':
 	case 'C':
 	case 'D':
 	case 'E':
 	case 'F':
 		return (c - 'A' + 10);
 	case 'a':
 	case 'b':
 	case 'c':
 	case 'd':
 	case 'e':
 	case 'f':
 		return (c - 'a' + 10);
 	}
 	/* Should not reach here. */
 	return (UCHAR_MAX);
 }
 #endif /* __FreeBSD__ */
 
 static dof_hdr_t *
 dtrace_dof_property(const char *name)
 {
 #ifdef __FreeBSD__
 	uint8_t *dofbuf;
 	u_char *data, *eol;
 	caddr_t doffile;
 	size_t bytes, len, i;
 	dof_hdr_t *dof;
 	u_char c1, c2;
 
 	dof = NULL;
 
 	doffile = preload_search_by_type("dtrace_dof");
 	if (doffile == NULL)
 		return (NULL);
 
 	data = preload_fetch_addr(doffile);
 	len = preload_fetch_size(doffile);
 	for (;;) {
 		/* Look for the end of the line. All lines end in a newline. */
 		eol = memchr(data, '\n', len);
 		if (eol == NULL)
 			return (NULL);
 
 		if (strncmp(name, data, strlen(name)) == 0)
 			break;
 
 		eol++; /* skip past the newline */
 		len -= eol - data;
 		data = eol;
 	}
 
 	/* We've found the data corresponding to the specified key. */
 
 	data += strlen(name) + 1; /* skip past the '=' */
 	len = eol - data;
 	if (len % 2 != 0) {
 		dtrace_dof_error(NULL, "invalid DOF encoding length");
 		goto doferr;
 	}
 	bytes = len / 2;
 	if (bytes < sizeof(dof_hdr_t)) {
 		dtrace_dof_error(NULL, "truncated header");
 		goto doferr;
 	}
 
 	/*
 	 * Each byte is represented by the two ASCII characters in its hex
 	 * representation.
 	 */
 	dofbuf = malloc(bytes, M_SOLARIS, M_WAITOK);
 	for (i = 0; i < bytes; i++) {
 		c1 = dtrace_dof_char(data[i * 2]);
 		c2 = dtrace_dof_char(data[i * 2 + 1]);
 		if (c1 == UCHAR_MAX || c2 == UCHAR_MAX) {
 			dtrace_dof_error(NULL, "invalid hex char in DOF");
 			goto doferr;
 		}
 		dofbuf[i] = c1 * 16 + c2;
 	}
 
 	dof = (dof_hdr_t *)dofbuf;
 	if (bytes < dof->dofh_loadsz) {
 		dtrace_dof_error(NULL, "truncated DOF");
 		goto doferr;
 	}
 
 	if (dof->dofh_loadsz >= dtrace_dof_maxsize) {
 		dtrace_dof_error(NULL, "oversized DOF");
 		goto doferr;
 	}
 
 	return (dof);
 
 doferr:
 	free(dof, M_SOLARIS);
 	return (NULL);
 #else /* __FreeBSD__ */
 	uchar_t *buf;
 	uint64_t loadsz;
 	unsigned int len, i;
 	dof_hdr_t *dof;
 
 	/*
 	 * Unfortunately, array of values in .conf files are always (and
 	 * only) interpreted to be integer arrays.  We must read our DOF
 	 * as an integer array, and then squeeze it into a byte array.
 	 */
 	if (ddi_prop_lookup_int_array(DDI_DEV_T_ANY, dtrace_devi, 0,
 	    (char *)name, (int **)&buf, &len) != DDI_PROP_SUCCESS)
 		return (NULL);
 
 	for (i = 0; i < len; i++)
 		buf[i] = (uchar_t)(((int *)buf)[i]);
 
 	if (len < sizeof (dof_hdr_t)) {
 		ddi_prop_free(buf);
 		dtrace_dof_error(NULL, "truncated header");
 		return (NULL);
 	}
 
 	if (len < (loadsz = ((dof_hdr_t *)buf)->dofh_loadsz)) {
 		ddi_prop_free(buf);
 		dtrace_dof_error(NULL, "truncated DOF");
 		return (NULL);
 	}
 
 	if (loadsz >= dtrace_dof_maxsize) {
 		ddi_prop_free(buf);
 		dtrace_dof_error(NULL, "oversized DOF");
 		return (NULL);
 	}
 
 	dof = kmem_alloc(loadsz, KM_SLEEP);
 	bcopy(buf, dof, loadsz);
 	ddi_prop_free(buf);
 
 	return (dof);
 #endif /* !__FreeBSD__ */
 }
 
 static void
 dtrace_dof_destroy(dof_hdr_t *dof)
 {
 	kmem_free(dof, dof->dofh_loadsz);
 }
 
 /*
  * Return the dof_sec_t pointer corresponding to a given section index.  If the
  * index is not valid, dtrace_dof_error() is called and NULL is returned.  If
  * a type other than DOF_SECT_NONE is specified, the header is checked against
  * this type and NULL is returned if the types do not match.
  */
 static dof_sec_t *
 dtrace_dof_sect(dof_hdr_t *dof, uint32_t type, dof_secidx_t i)
 {
 	dof_sec_t *sec = (dof_sec_t *)(uintptr_t)
 	    ((uintptr_t)dof + dof->dofh_secoff + i * dof->dofh_secsize);
 
 	if (i >= dof->dofh_secnum) {
 		dtrace_dof_error(dof, "referenced section index is invalid");
 		return (NULL);
 	}
 
 	if (!(sec->dofs_flags & DOF_SECF_LOAD)) {
 		dtrace_dof_error(dof, "referenced section is not loadable");
 		return (NULL);
 	}
 
 	if (type != DOF_SECT_NONE && type != sec->dofs_type) {
 		dtrace_dof_error(dof, "referenced section is the wrong type");
 		return (NULL);
 	}
 
 	return (sec);
 }
 
 static dtrace_probedesc_t *
 dtrace_dof_probedesc(dof_hdr_t *dof, dof_sec_t *sec, dtrace_probedesc_t *desc)
 {
 	dof_probedesc_t *probe;
 	dof_sec_t *strtab;
 	uintptr_t daddr = (uintptr_t)dof;
 	uintptr_t str;
 	size_t size;
 
 	if (sec->dofs_type != DOF_SECT_PROBEDESC) {
 		dtrace_dof_error(dof, "invalid probe section");
 		return (NULL);
 	}
 
 	if (sec->dofs_align != sizeof (dof_secidx_t)) {
 		dtrace_dof_error(dof, "bad alignment in probe description");
 		return (NULL);
 	}
 
 	if (sec->dofs_offset + sizeof (dof_probedesc_t) > dof->dofh_loadsz) {
 		dtrace_dof_error(dof, "truncated probe description");
 		return (NULL);
 	}
 
 	probe = (dof_probedesc_t *)(uintptr_t)(daddr + sec->dofs_offset);
 	strtab = dtrace_dof_sect(dof, DOF_SECT_STRTAB, probe->dofp_strtab);
 
 	if (strtab == NULL)
 		return (NULL);
 
 	str = daddr + strtab->dofs_offset;
 	size = strtab->dofs_size;
 
 	if (probe->dofp_provider >= strtab->dofs_size) {
 		dtrace_dof_error(dof, "corrupt probe provider");
 		return (NULL);
 	}
 
 	(void) strncpy(desc->dtpd_provider,
 	    (char *)(str + probe->dofp_provider),
 	    MIN(DTRACE_PROVNAMELEN - 1, size - probe->dofp_provider));
 
 	if (probe->dofp_mod >= strtab->dofs_size) {
 		dtrace_dof_error(dof, "corrupt probe module");
 		return (NULL);
 	}
 
 	(void) strncpy(desc->dtpd_mod, (char *)(str + probe->dofp_mod),
 	    MIN(DTRACE_MODNAMELEN - 1, size - probe->dofp_mod));
 
 	if (probe->dofp_func >= strtab->dofs_size) {
 		dtrace_dof_error(dof, "corrupt probe function");
 		return (NULL);
 	}
 
 	(void) strncpy(desc->dtpd_func, (char *)(str + probe->dofp_func),
 	    MIN(DTRACE_FUNCNAMELEN - 1, size - probe->dofp_func));
 
 	if (probe->dofp_name >= strtab->dofs_size) {
 		dtrace_dof_error(dof, "corrupt probe name");
 		return (NULL);
 	}
 
 	(void) strncpy(desc->dtpd_name, (char *)(str + probe->dofp_name),
 	    MIN(DTRACE_NAMELEN - 1, size - probe->dofp_name));
 
 	return (desc);
 }
 
 static dtrace_difo_t *
 dtrace_dof_difo(dof_hdr_t *dof, dof_sec_t *sec, dtrace_vstate_t *vstate,
     cred_t *cr)
 {
 	dtrace_difo_t *dp;
 	size_t ttl = 0;
 	dof_difohdr_t *dofd;
 	uintptr_t daddr = (uintptr_t)dof;
 	size_t max = dtrace_difo_maxsize;
 	int i, l, n;
 
 	static const struct {
 		int section;
 		int bufoffs;
 		int lenoffs;
 		int entsize;
 		int align;
 		const char *msg;
 	} difo[] = {
 		{ DOF_SECT_DIF, offsetof(dtrace_difo_t, dtdo_buf),
 		offsetof(dtrace_difo_t, dtdo_len), sizeof (dif_instr_t),
 		sizeof (dif_instr_t), "multiple DIF sections" },
 
 		{ DOF_SECT_INTTAB, offsetof(dtrace_difo_t, dtdo_inttab),
 		offsetof(dtrace_difo_t, dtdo_intlen), sizeof (uint64_t),
 		sizeof (uint64_t), "multiple integer tables" },
 
 		{ DOF_SECT_STRTAB, offsetof(dtrace_difo_t, dtdo_strtab),
 		offsetof(dtrace_difo_t, dtdo_strlen), 0,
 		sizeof (char), "multiple string tables" },
 
 		{ DOF_SECT_VARTAB, offsetof(dtrace_difo_t, dtdo_vartab),
 		offsetof(dtrace_difo_t, dtdo_varlen), sizeof (dtrace_difv_t),
 		sizeof (uint_t), "multiple variable tables" },
 
 		{ DOF_SECT_NONE, 0, 0, 0, 0, NULL }
 	};
 
 	if (sec->dofs_type != DOF_SECT_DIFOHDR) {
 		dtrace_dof_error(dof, "invalid DIFO header section");
 		return (NULL);
 	}
 
 	if (sec->dofs_align != sizeof (dof_secidx_t)) {
 		dtrace_dof_error(dof, "bad alignment in DIFO header");
 		return (NULL);
 	}
 
 	if (sec->dofs_size < sizeof (dof_difohdr_t) ||
 	    sec->dofs_size % sizeof (dof_secidx_t)) {
 		dtrace_dof_error(dof, "bad size in DIFO header");
 		return (NULL);
 	}
 
 	dofd = (dof_difohdr_t *)(uintptr_t)(daddr + sec->dofs_offset);
 	n = (sec->dofs_size - sizeof (*dofd)) / sizeof (dof_secidx_t) + 1;
 
 	dp = kmem_zalloc(sizeof (dtrace_difo_t), KM_SLEEP);
 	dp->dtdo_rtype = dofd->dofd_rtype;
 
 	for (l = 0; l < n; l++) {
 		dof_sec_t *subsec;
 		void **bufp;
 		uint32_t *lenp;
 
 		if ((subsec = dtrace_dof_sect(dof, DOF_SECT_NONE,
 		    dofd->dofd_links[l])) == NULL)
 			goto err; /* invalid section link */
 
 		if (ttl + subsec->dofs_size > max) {
 			dtrace_dof_error(dof, "exceeds maximum size");
 			goto err;
 		}
 
 		ttl += subsec->dofs_size;
 
 		for (i = 0; difo[i].section != DOF_SECT_NONE; i++) {
 			if (subsec->dofs_type != difo[i].section)
 				continue;
 
 			if (!(subsec->dofs_flags & DOF_SECF_LOAD)) {
 				dtrace_dof_error(dof, "section not loaded");
 				goto err;
 			}
 
 			if (subsec->dofs_align != difo[i].align) {
 				dtrace_dof_error(dof, "bad alignment");
 				goto err;
 			}
 
 			bufp = (void **)((uintptr_t)dp + difo[i].bufoffs);
 			lenp = (uint32_t *)((uintptr_t)dp + difo[i].lenoffs);
 
 			if (*bufp != NULL) {
 				dtrace_dof_error(dof, difo[i].msg);
 				goto err;
 			}
 
 			if (difo[i].entsize != subsec->dofs_entsize) {
 				dtrace_dof_error(dof, "entry size mismatch");
 				goto err;
 			}
 
 			if (subsec->dofs_entsize != 0 &&
 			    (subsec->dofs_size % subsec->dofs_entsize) != 0) {
 				dtrace_dof_error(dof, "corrupt entry size");
 				goto err;
 			}
 
 			*lenp = subsec->dofs_size;
 			*bufp = kmem_alloc(subsec->dofs_size, KM_SLEEP);
 			bcopy((char *)(uintptr_t)(daddr + subsec->dofs_offset),
 			    *bufp, subsec->dofs_size);
 
 			if (subsec->dofs_entsize != 0)
 				*lenp /= subsec->dofs_entsize;
 
 			break;
 		}
 
 		/*
 		 * If we encounter a loadable DIFO sub-section that is not
 		 * known to us, assume this is a broken program and fail.
 		 */
 		if (difo[i].section == DOF_SECT_NONE &&
 		    (subsec->dofs_flags & DOF_SECF_LOAD)) {
 			dtrace_dof_error(dof, "unrecognized DIFO subsection");
 			goto err;
 		}
 	}
 
 	if (dp->dtdo_buf == NULL) {
 		/*
 		 * We can't have a DIF object without DIF text.
 		 */
 		dtrace_dof_error(dof, "missing DIF text");
 		goto err;
 	}
 
 	/*
 	 * Before we validate the DIF object, run through the variable table
 	 * looking for the strings -- if any of their size are under, we'll set
 	 * their size to be the system-wide default string size.  Note that
 	 * this should _not_ happen if the "strsize" option has been set --
 	 * in this case, the compiler should have set the size to reflect the
 	 * setting of the option.
 	 */
 	for (i = 0; i < dp->dtdo_varlen; i++) {
 		dtrace_difv_t *v = &dp->dtdo_vartab[i];
 		dtrace_diftype_t *t = &v->dtdv_type;
 
 		if (v->dtdv_id < DIF_VAR_OTHER_UBASE)
 			continue;
 
 		if (t->dtdt_kind == DIF_TYPE_STRING && t->dtdt_size == 0)
 			t->dtdt_size = dtrace_strsize_default;
 	}
 
 	if (dtrace_difo_validate(dp, vstate, DIF_DIR_NREGS, cr) != 0)
 		goto err;
 
 	dtrace_difo_init(dp, vstate);
 	return (dp);
 
 err:
 	kmem_free(dp->dtdo_buf, dp->dtdo_len * sizeof (dif_instr_t));
 	kmem_free(dp->dtdo_inttab, dp->dtdo_intlen * sizeof (uint64_t));
 	kmem_free(dp->dtdo_strtab, dp->dtdo_strlen);
 	kmem_free(dp->dtdo_vartab, dp->dtdo_varlen * sizeof (dtrace_difv_t));
 
 	kmem_free(dp, sizeof (dtrace_difo_t));
 	return (NULL);
 }
 
 static dtrace_predicate_t *
 dtrace_dof_predicate(dof_hdr_t *dof, dof_sec_t *sec, dtrace_vstate_t *vstate,
     cred_t *cr)
 {
 	dtrace_difo_t *dp;
 
 	if ((dp = dtrace_dof_difo(dof, sec, vstate, cr)) == NULL)
 		return (NULL);
 
 	return (dtrace_predicate_create(dp));
 }
 
 static dtrace_actdesc_t *
 dtrace_dof_actdesc(dof_hdr_t *dof, dof_sec_t *sec, dtrace_vstate_t *vstate,
     cred_t *cr)
 {
 	dtrace_actdesc_t *act, *first = NULL, *last = NULL, *next;
 	dof_actdesc_t *desc;
 	dof_sec_t *difosec;
 	size_t offs;
 	uintptr_t daddr = (uintptr_t)dof;
 	uint64_t arg;
 	dtrace_actkind_t kind;
 
 	if (sec->dofs_type != DOF_SECT_ACTDESC) {
 		dtrace_dof_error(dof, "invalid action section");
 		return (NULL);
 	}
 
 	if (sec->dofs_offset + sizeof (dof_actdesc_t) > dof->dofh_loadsz) {
 		dtrace_dof_error(dof, "truncated action description");
 		return (NULL);
 	}
 
 	if (sec->dofs_align != sizeof (uint64_t)) {
 		dtrace_dof_error(dof, "bad alignment in action description");
 		return (NULL);
 	}
 
 	if (sec->dofs_size < sec->dofs_entsize) {
 		dtrace_dof_error(dof, "section entry size exceeds total size");
 		return (NULL);
 	}
 
 	if (sec->dofs_entsize != sizeof (dof_actdesc_t)) {
 		dtrace_dof_error(dof, "bad entry size in action description");
 		return (NULL);
 	}
 
 	if (sec->dofs_size / sec->dofs_entsize > dtrace_actions_max) {
 		dtrace_dof_error(dof, "actions exceed dtrace_actions_max");
 		return (NULL);
 	}
 
 	for (offs = 0; offs < sec->dofs_size; offs += sec->dofs_entsize) {
 		desc = (dof_actdesc_t *)(daddr +
 		    (uintptr_t)sec->dofs_offset + offs);
 		kind = (dtrace_actkind_t)desc->dofa_kind;
 
 		if ((DTRACEACT_ISPRINTFLIKE(kind) &&
 		    (kind != DTRACEACT_PRINTA ||
 		    desc->dofa_strtab != DOF_SECIDX_NONE)) ||
 		    (kind == DTRACEACT_DIFEXPR &&
 		    desc->dofa_strtab != DOF_SECIDX_NONE)) {
 			dof_sec_t *strtab;
 			char *str, *fmt;
 			uint64_t i;
 
 			/*
 			 * The argument to these actions is an index into the
 			 * DOF string table.  For printf()-like actions, this
 			 * is the format string.  For print(), this is the
 			 * CTF type of the expression result.
 			 */
 			if ((strtab = dtrace_dof_sect(dof,
 			    DOF_SECT_STRTAB, desc->dofa_strtab)) == NULL)
 				goto err;
 
 			str = (char *)((uintptr_t)dof +
 			    (uintptr_t)strtab->dofs_offset);
 
 			for (i = desc->dofa_arg; i < strtab->dofs_size; i++) {
 				if (str[i] == '\0')
 					break;
 			}
 
 			if (i >= strtab->dofs_size) {
 				dtrace_dof_error(dof, "bogus format string");
 				goto err;
 			}
 
 			if (i == desc->dofa_arg) {
 				dtrace_dof_error(dof, "empty format string");
 				goto err;
 			}
 
 			i -= desc->dofa_arg;
 			fmt = kmem_alloc(i + 1, KM_SLEEP);
 			bcopy(&str[desc->dofa_arg], fmt, i + 1);
 			arg = (uint64_t)(uintptr_t)fmt;
 		} else {
 			if (kind == DTRACEACT_PRINTA) {
 				ASSERT(desc->dofa_strtab == DOF_SECIDX_NONE);
 				arg = 0;
 			} else {
 				arg = desc->dofa_arg;
 			}
 		}
 
 		act = dtrace_actdesc_create(kind, desc->dofa_ntuple,
 		    desc->dofa_uarg, arg);
 
 		if (last != NULL) {
 			last->dtad_next = act;
 		} else {
 			first = act;
 		}
 
 		last = act;
 
 		if (desc->dofa_difo == DOF_SECIDX_NONE)
 			continue;
 
 		if ((difosec = dtrace_dof_sect(dof,
 		    DOF_SECT_DIFOHDR, desc->dofa_difo)) == NULL)
 			goto err;
 
 		act->dtad_difo = dtrace_dof_difo(dof, difosec, vstate, cr);
 
 		if (act->dtad_difo == NULL)
 			goto err;
 	}
 
 	ASSERT(first != NULL);
 	return (first);
 
 err:
 	for (act = first; act != NULL; act = next) {
 		next = act->dtad_next;
 		dtrace_actdesc_release(act, vstate);
 	}
 
 	return (NULL);
 }
 
 static dtrace_ecbdesc_t *
 dtrace_dof_ecbdesc(dof_hdr_t *dof, dof_sec_t *sec, dtrace_vstate_t *vstate,
     cred_t *cr)
 {
 	dtrace_ecbdesc_t *ep;
 	dof_ecbdesc_t *ecb;
 	dtrace_probedesc_t *desc;
 	dtrace_predicate_t *pred = NULL;
 
 	if (sec->dofs_size < sizeof (dof_ecbdesc_t)) {
 		dtrace_dof_error(dof, "truncated ECB description");
 		return (NULL);
 	}
 
 	if (sec->dofs_align != sizeof (uint64_t)) {
 		dtrace_dof_error(dof, "bad alignment in ECB description");
 		return (NULL);
 	}
 
 	ecb = (dof_ecbdesc_t *)((uintptr_t)dof + (uintptr_t)sec->dofs_offset);
 	sec = dtrace_dof_sect(dof, DOF_SECT_PROBEDESC, ecb->dofe_probes);
 
 	if (sec == NULL)
 		return (NULL);
 
 	ep = kmem_zalloc(sizeof (dtrace_ecbdesc_t), KM_SLEEP);
 	ep->dted_uarg = ecb->dofe_uarg;
 	desc = &ep->dted_probe;
 
 	if (dtrace_dof_probedesc(dof, sec, desc) == NULL)
 		goto err;
 
 	if (ecb->dofe_pred != DOF_SECIDX_NONE) {
 		if ((sec = dtrace_dof_sect(dof,
 		    DOF_SECT_DIFOHDR, ecb->dofe_pred)) == NULL)
 			goto err;
 
 		if ((pred = dtrace_dof_predicate(dof, sec, vstate, cr)) == NULL)
 			goto err;
 
 		ep->dted_pred.dtpdd_predicate = pred;
 	}
 
 	if (ecb->dofe_actions != DOF_SECIDX_NONE) {
 		if ((sec = dtrace_dof_sect(dof,
 		    DOF_SECT_ACTDESC, ecb->dofe_actions)) == NULL)
 			goto err;
 
 		ep->dted_action = dtrace_dof_actdesc(dof, sec, vstate, cr);
 
 		if (ep->dted_action == NULL)
 			goto err;
 	}
 
 	return (ep);
 
 err:
 	if (pred != NULL)
 		dtrace_predicate_release(pred, vstate);
 	kmem_free(ep, sizeof (dtrace_ecbdesc_t));
 	return (NULL);
 }
 
 /*
  * Apply the relocations from the specified 'sec' (a DOF_SECT_URELHDR) to the
  * specified DOF.  SETX relocations are computed using 'ubase', the base load
  * address of the object containing the DOF, and DOFREL relocations are relative
  * to the relocation offset within the DOF.
  */
 static int
 dtrace_dof_relocate(dof_hdr_t *dof, dof_sec_t *sec, uint64_t ubase,
     uint64_t udaddr)
 {
 	uintptr_t daddr = (uintptr_t)dof;
 	uintptr_t ts_end;
 	dof_relohdr_t *dofr =
 	    (dof_relohdr_t *)(uintptr_t)(daddr + sec->dofs_offset);
 	dof_sec_t *ss, *rs, *ts;
 	dof_relodesc_t *r;
 	uint_t i, n;
 
 	if (sec->dofs_size < sizeof (dof_relohdr_t) ||
 	    sec->dofs_align != sizeof (dof_secidx_t)) {
 		dtrace_dof_error(dof, "invalid relocation header");
 		return (-1);
 	}
 
 	ss = dtrace_dof_sect(dof, DOF_SECT_STRTAB, dofr->dofr_strtab);
 	rs = dtrace_dof_sect(dof, DOF_SECT_RELTAB, dofr->dofr_relsec);
 	ts = dtrace_dof_sect(dof, DOF_SECT_NONE, dofr->dofr_tgtsec);
 	ts_end = (uintptr_t)ts + sizeof (dof_sec_t);
 
 	if (ss == NULL || rs == NULL || ts == NULL)
 		return (-1); /* dtrace_dof_error() has been called already */
 
 	if (rs->dofs_entsize < sizeof (dof_relodesc_t) ||
 	    rs->dofs_align != sizeof (uint64_t)) {
 		dtrace_dof_error(dof, "invalid relocation section");
 		return (-1);
 	}
 
 	r = (dof_relodesc_t *)(uintptr_t)(daddr + rs->dofs_offset);
 	n = rs->dofs_size / rs->dofs_entsize;
 
 	for (i = 0; i < n; i++) {
 		uintptr_t taddr = daddr + ts->dofs_offset + r->dofr_offset;
 
 		switch (r->dofr_type) {
 		case DOF_RELO_NONE:
 			break;
 		case DOF_RELO_SETX:
 		case DOF_RELO_DOFREL:
 			if (r->dofr_offset >= ts->dofs_size || r->dofr_offset +
 			    sizeof (uint64_t) > ts->dofs_size) {
 				dtrace_dof_error(dof, "bad relocation offset");
 				return (-1);
 			}
 
 			if (taddr >= (uintptr_t)ts && taddr < ts_end) {
 				dtrace_dof_error(dof, "bad relocation offset");
 				return (-1);
 			}
 
 			if (!IS_P2ALIGNED(taddr, sizeof (uint64_t))) {
 				dtrace_dof_error(dof, "misaligned setx relo");
 				return (-1);
 			}
 
 			if (r->dofr_type == DOF_RELO_SETX)
 				*(uint64_t *)taddr += ubase;
 			else
 				*(uint64_t *)taddr +=
 				    udaddr + ts->dofs_offset + r->dofr_offset;
 			break;
 		default:
 			dtrace_dof_error(dof, "invalid relocation type");
 			return (-1);
 		}
 
 		r = (dof_relodesc_t *)((uintptr_t)r + rs->dofs_entsize);
 	}
 
 	return (0);
 }
 
 /*
  * The dof_hdr_t passed to dtrace_dof_slurp() should be a partially validated
  * header:  it should be at the front of a memory region that is at least
  * sizeof (dof_hdr_t) in size -- and then at least dof_hdr.dofh_loadsz in
  * size.  It need not be validated in any other way.
  */
 static int
 dtrace_dof_slurp(dof_hdr_t *dof, dtrace_vstate_t *vstate, cred_t *cr,
     dtrace_enabling_t **enabp, uint64_t ubase, uint64_t udaddr, int noprobes)
 {
 	uint64_t len = dof->dofh_loadsz, seclen;
 	uintptr_t daddr = (uintptr_t)dof;
 	dtrace_ecbdesc_t *ep;
 	dtrace_enabling_t *enab;
 	uint_t i;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(dof->dofh_loadsz >= sizeof (dof_hdr_t));
 
 	/*
 	 * Check the DOF header identification bytes.  In addition to checking
 	 * valid settings, we also verify that unused bits/bytes are zeroed so
 	 * we can use them later without fear of regressing existing binaries.
 	 */
 	if (bcmp(&dof->dofh_ident[DOF_ID_MAG0],
 	    DOF_MAG_STRING, DOF_MAG_STRLEN) != 0) {
 		dtrace_dof_error(dof, "DOF magic string mismatch");
 		return (-1);
 	}
 
 	if (dof->dofh_ident[DOF_ID_MODEL] != DOF_MODEL_ILP32 &&
 	    dof->dofh_ident[DOF_ID_MODEL] != DOF_MODEL_LP64) {
 		dtrace_dof_error(dof, "DOF has invalid data model");
 		return (-1);
 	}
 
 	if (dof->dofh_ident[DOF_ID_ENCODING] != DOF_ENCODE_NATIVE) {
 		dtrace_dof_error(dof, "DOF encoding mismatch");
 		return (-1);
 	}
 
 	if (dof->dofh_ident[DOF_ID_VERSION] != DOF_VERSION_1 &&
 	    dof->dofh_ident[DOF_ID_VERSION] != DOF_VERSION_2) {
 		dtrace_dof_error(dof, "DOF version mismatch");
 		return (-1);
 	}
 
 	if (dof->dofh_ident[DOF_ID_DIFVERS] != DIF_VERSION_2) {
 		dtrace_dof_error(dof, "DOF uses unsupported instruction set");
 		return (-1);
 	}
 
 	if (dof->dofh_ident[DOF_ID_DIFIREG] > DIF_DIR_NREGS) {
 		dtrace_dof_error(dof, "DOF uses too many integer registers");
 		return (-1);
 	}
 
 	if (dof->dofh_ident[DOF_ID_DIFTREG] > DIF_DTR_NREGS) {
 		dtrace_dof_error(dof, "DOF uses too many tuple registers");
 		return (-1);
 	}
 
 	for (i = DOF_ID_PAD; i < DOF_ID_SIZE; i++) {
 		if (dof->dofh_ident[i] != 0) {
 			dtrace_dof_error(dof, "DOF has invalid ident byte set");
 			return (-1);
 		}
 	}
 
 	if (dof->dofh_flags & ~DOF_FL_VALID) {
 		dtrace_dof_error(dof, "DOF has invalid flag bits set");
 		return (-1);
 	}
 
 	if (dof->dofh_secsize == 0) {
 		dtrace_dof_error(dof, "zero section header size");
 		return (-1);
 	}
 
 	/*
 	 * Check that the section headers don't exceed the amount of DOF
 	 * data.  Note that we cast the section size and number of sections
 	 * to uint64_t's to prevent possible overflow in the multiplication.
 	 */
 	seclen = (uint64_t)dof->dofh_secnum * (uint64_t)dof->dofh_secsize;
 
 	if (dof->dofh_secoff > len || seclen > len ||
 	    dof->dofh_secoff + seclen > len) {
 		dtrace_dof_error(dof, "truncated section headers");
 		return (-1);
 	}
 
 	if (!IS_P2ALIGNED(dof->dofh_secoff, sizeof (uint64_t))) {
 		dtrace_dof_error(dof, "misaligned section headers");
 		return (-1);
 	}
 
 	if (!IS_P2ALIGNED(dof->dofh_secsize, sizeof (uint64_t))) {
 		dtrace_dof_error(dof, "misaligned section size");
 		return (-1);
 	}
 
 	/*
 	 * Take an initial pass through the section headers to be sure that
 	 * the headers don't have stray offsets.  If the 'noprobes' flag is
 	 * set, do not permit sections relating to providers, probes, or args.
 	 */
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)(daddr +
 		    (uintptr_t)dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (noprobes) {
 			switch (sec->dofs_type) {
 			case DOF_SECT_PROVIDER:
 			case DOF_SECT_PROBES:
 			case DOF_SECT_PRARGS:
 			case DOF_SECT_PROFFS:
 				dtrace_dof_error(dof, "illegal sections "
 				    "for enabling");
 				return (-1);
 			}
 		}
 
 		if (DOF_SEC_ISLOADABLE(sec->dofs_type) &&
 		    !(sec->dofs_flags & DOF_SECF_LOAD)) {
 			dtrace_dof_error(dof, "loadable section with load "
 			    "flag unset");
 			return (-1);
 		}
 
 		if (!(sec->dofs_flags & DOF_SECF_LOAD))
 			continue; /* just ignore non-loadable sections */
 
 		if (!ISP2(sec->dofs_align)) {
 			dtrace_dof_error(dof, "bad section alignment");
 			return (-1);
 		}
 
 		if (sec->dofs_offset & (sec->dofs_align - 1)) {
 			dtrace_dof_error(dof, "misaligned section");
 			return (-1);
 		}
 
 		if (sec->dofs_offset > len || sec->dofs_size > len ||
 		    sec->dofs_offset + sec->dofs_size > len) {
 			dtrace_dof_error(dof, "corrupt section header");
 			return (-1);
 		}
 
 		if (sec->dofs_type == DOF_SECT_STRTAB && *((char *)daddr +
 		    sec->dofs_offset + sec->dofs_size - 1) != '\0') {
 			dtrace_dof_error(dof, "non-terminating string table");
 			return (-1);
 		}
 	}
 
 	/*
 	 * Take a second pass through the sections and locate and perform any
 	 * relocations that are present.  We do this after the first pass to
 	 * be sure that all sections have had their headers validated.
 	 */
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)(daddr +
 		    (uintptr_t)dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (!(sec->dofs_flags & DOF_SECF_LOAD))
 			continue; /* skip sections that are not loadable */
 
 		switch (sec->dofs_type) {
 		case DOF_SECT_URELHDR:
 			if (dtrace_dof_relocate(dof, sec, ubase, udaddr) != 0)
 				return (-1);
 			break;
 		}
 	}
 
 	if ((enab = *enabp) == NULL)
 		enab = *enabp = dtrace_enabling_create(vstate);
 
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)(daddr +
 		    (uintptr_t)dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (sec->dofs_type != DOF_SECT_ECBDESC)
 			continue;
 
 		if ((ep = dtrace_dof_ecbdesc(dof, sec, vstate, cr)) == NULL) {
 			dtrace_enabling_destroy(enab);
 			*enabp = NULL;
 			return (-1);
 		}
 
 		dtrace_enabling_add(enab, ep);
 	}
 
 	return (0);
 }
 
 /*
  * Process DOF for any options.  This routine assumes that the DOF has been
  * at least processed by dtrace_dof_slurp().
  */
 static int
 dtrace_dof_options(dof_hdr_t *dof, dtrace_state_t *state)
 {
 	int i, rval;
 	uint32_t entsize;
 	size_t offs;
 	dof_optdesc_t *desc;
 
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)((uintptr_t)dof +
 		    (uintptr_t)dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (sec->dofs_type != DOF_SECT_OPTDESC)
 			continue;
 
 		if (sec->dofs_align != sizeof (uint64_t)) {
 			dtrace_dof_error(dof, "bad alignment in "
 			    "option description");
 			return (EINVAL);
 		}
 
 		if ((entsize = sec->dofs_entsize) == 0) {
 			dtrace_dof_error(dof, "zeroed option entry size");
 			return (EINVAL);
 		}
 
 		if (entsize < sizeof (dof_optdesc_t)) {
 			dtrace_dof_error(dof, "bad option entry size");
 			return (EINVAL);
 		}
 
 		for (offs = 0; offs < sec->dofs_size; offs += entsize) {
 			desc = (dof_optdesc_t *)((uintptr_t)dof +
 			    (uintptr_t)sec->dofs_offset + offs);
 
 			if (desc->dofo_strtab != DOF_SECIDX_NONE) {
 				dtrace_dof_error(dof, "non-zero option string");
 				return (EINVAL);
 			}
 
 			if (desc->dofo_value == DTRACEOPT_UNSET) {
 				dtrace_dof_error(dof, "unset option");
 				return (EINVAL);
 			}
 
 			if ((rval = dtrace_state_option(state,
 			    desc->dofo_option, desc->dofo_value)) != 0) {
 				dtrace_dof_error(dof, "rejected option");
 				return (rval);
 			}
 		}
 	}
 
 	return (0);
 }
 
 /*
  * DTrace Consumer State Functions
  */
 static int
 dtrace_dstate_init(dtrace_dstate_t *dstate, size_t size)
 {
 	size_t hashsize, maxper, min, chunksize = dstate->dtds_chunksize;
 	void *base;
 	uintptr_t limit;
 	dtrace_dynvar_t *dvar, *next, *start;
 	int i;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(dstate->dtds_base == NULL && dstate->dtds_percpu == NULL);
 
 	bzero(dstate, sizeof (dtrace_dstate_t));
 
 	if ((dstate->dtds_chunksize = chunksize) == 0)
 		dstate->dtds_chunksize = DTRACE_DYNVAR_CHUNKSIZE;
 
 	VERIFY(dstate->dtds_chunksize < LONG_MAX);
 
 	if (size < (min = dstate->dtds_chunksize + sizeof (dtrace_dynhash_t)))
 		size = min;
 
 	if ((base = kmem_zalloc(size, KM_NOSLEEP | KM_NORMALPRI)) == NULL)
 		return (ENOMEM);
 
 	dstate->dtds_size = size;
 	dstate->dtds_base = base;
 	dstate->dtds_percpu = kmem_cache_alloc(dtrace_state_cache, KM_SLEEP);
 	bzero(dstate->dtds_percpu, NCPU * sizeof (dtrace_dstate_percpu_t));
 
 	hashsize = size / (dstate->dtds_chunksize + sizeof (dtrace_dynhash_t));
 
 	if (hashsize != 1 && (hashsize & 1))
 		hashsize--;
 
 	dstate->dtds_hashsize = hashsize;
 	dstate->dtds_hash = dstate->dtds_base;
 
 	/*
 	 * Set all of our hash buckets to point to the single sink, and (if
 	 * it hasn't already been set), set the sink's hash value to be the
 	 * sink sentinel value.  The sink is needed for dynamic variable
 	 * lookups to know that they have iterated over an entire, valid hash
 	 * chain.
 	 */
 	for (i = 0; i < hashsize; i++)
 		dstate->dtds_hash[i].dtdh_chain = &dtrace_dynhash_sink;
 
 	if (dtrace_dynhash_sink.dtdv_hashval != DTRACE_DYNHASH_SINK)
 		dtrace_dynhash_sink.dtdv_hashval = DTRACE_DYNHASH_SINK;
 
 	/*
 	 * Determine number of active CPUs.  Divide free list evenly among
 	 * active CPUs.
 	 */
 	start = (dtrace_dynvar_t *)
 	    ((uintptr_t)base + hashsize * sizeof (dtrace_dynhash_t));
 	limit = (uintptr_t)base + size;
 
 	VERIFY((uintptr_t)start < limit);
 	VERIFY((uintptr_t)start >= (uintptr_t)base);
 
 	maxper = (limit - (uintptr_t)start) / NCPU;
 	maxper = (maxper / dstate->dtds_chunksize) * dstate->dtds_chunksize;
 
 #ifndef illumos
 	CPU_FOREACH(i) {
 #else
 	for (i = 0; i < NCPU; i++) {
 #endif
 		dstate->dtds_percpu[i].dtdsc_free = dvar = start;
 
 		/*
 		 * If we don't even have enough chunks to make it once through
 		 * NCPUs, we're just going to allocate everything to the first
 		 * CPU.  And if we're on the last CPU, we're going to allocate
 		 * whatever is left over.  In either case, we set the limit to
 		 * be the limit of the dynamic variable space.
 		 */
 		if (maxper == 0 || i == NCPU - 1) {
 			limit = (uintptr_t)base + size;
 			start = NULL;
 		} else {
 			limit = (uintptr_t)start + maxper;
 			start = (dtrace_dynvar_t *)limit;
 		}
 
 		VERIFY(limit <= (uintptr_t)base + size);
 
 		for (;;) {
 			next = (dtrace_dynvar_t *)((uintptr_t)dvar +
 			    dstate->dtds_chunksize);
 
 			if ((uintptr_t)next + dstate->dtds_chunksize >= limit)
 				break;
 
 			VERIFY((uintptr_t)dvar >= (uintptr_t)base &&
 			    (uintptr_t)dvar <= (uintptr_t)base + size);
 			dvar->dtdv_next = next;
 			dvar = next;
 		}
 
 		if (maxper == 0)
 			break;
 	}
 
 	return (0);
 }
 
 static void
 dtrace_dstate_fini(dtrace_dstate_t *dstate)
 {
 	ASSERT(MUTEX_HELD(&cpu_lock));
 
 	if (dstate->dtds_base == NULL)
 		return;
 
 	kmem_free(dstate->dtds_base, dstate->dtds_size);
 	kmem_cache_free(dtrace_state_cache, dstate->dtds_percpu);
 }
 
 static void
 dtrace_vstate_fini(dtrace_vstate_t *vstate)
 {
 	/*
 	 * Logical XOR, where are you?
 	 */
 	ASSERT((vstate->dtvs_nglobals == 0) ^ (vstate->dtvs_globals != NULL));
 
 	if (vstate->dtvs_nglobals > 0) {
 		kmem_free(vstate->dtvs_globals, vstate->dtvs_nglobals *
 		    sizeof (dtrace_statvar_t *));
 	}
 
 	if (vstate->dtvs_ntlocals > 0) {
 		kmem_free(vstate->dtvs_tlocals, vstate->dtvs_ntlocals *
 		    sizeof (dtrace_difv_t));
 	}
 
 	ASSERT((vstate->dtvs_nlocals == 0) ^ (vstate->dtvs_locals != NULL));
 
 	if (vstate->dtvs_nlocals > 0) {
 		kmem_free(vstate->dtvs_locals, vstate->dtvs_nlocals *
 		    sizeof (dtrace_statvar_t *));
 	}
 }
 
 #ifdef illumos
 static void
 dtrace_state_clean(dtrace_state_t *state)
 {
 	if (state->dts_activity == DTRACE_ACTIVITY_INACTIVE)
 		return;
 
 	dtrace_dynvar_clean(&state->dts_vstate.dtvs_dynvars);
 	dtrace_speculation_clean(state);
 }
 
 static void
 dtrace_state_deadman(dtrace_state_t *state)
 {
 	hrtime_t now;
 
 	dtrace_sync();
 
 	now = dtrace_gethrtime();
 
 	if (state != dtrace_anon.dta_state &&
 	    now - state->dts_laststatus >= dtrace_deadman_user)
 		return;
 
 	/*
 	 * We must be sure that dts_alive never appears to be less than the
 	 * value upon entry to dtrace_state_deadman(), and because we lack a
 	 * dtrace_cas64(), we cannot store to it atomically.  We thus instead
 	 * store INT64_MAX to it, followed by a memory barrier, followed by
 	 * the new value.  This assures that dts_alive never appears to be
 	 * less than its true value, regardless of the order in which the
 	 * stores to the underlying storage are issued.
 	 */
 	state->dts_alive = INT64_MAX;
 	dtrace_membar_producer();
 	state->dts_alive = now;
 }
 #else	/* !illumos */
 static void
 dtrace_state_clean(void *arg)
 {
 	dtrace_state_t *state = arg;
 	dtrace_optval_t *opt = state->dts_options;
 
 	if (state->dts_activity == DTRACE_ACTIVITY_INACTIVE)
 		return;
 
 	dtrace_dynvar_clean(&state->dts_vstate.dtvs_dynvars);
 	dtrace_speculation_clean(state);
 
 	callout_reset(&state->dts_cleaner, hz * opt[DTRACEOPT_CLEANRATE] / NANOSEC,
 	    dtrace_state_clean, state);
 }
 
 static void
 dtrace_state_deadman(void *arg)
 {
 	dtrace_state_t *state = arg;
 	hrtime_t now;
 
 	dtrace_sync();
 
 	dtrace_debug_output();
 
 	now = dtrace_gethrtime();
 
 	if (state != dtrace_anon.dta_state &&
 	    now - state->dts_laststatus >= dtrace_deadman_user)
 		return;
 
 	/*
 	 * We must be sure that dts_alive never appears to be less than the
 	 * value upon entry to dtrace_state_deadman(), and because we lack a
 	 * dtrace_cas64(), we cannot store to it atomically.  We thus instead
 	 * store INT64_MAX to it, followed by a memory barrier, followed by
 	 * the new value.  This assures that dts_alive never appears to be
 	 * less than its true value, regardless of the order in which the
 	 * stores to the underlying storage are issued.
 	 */
 	state->dts_alive = INT64_MAX;
 	dtrace_membar_producer();
 	state->dts_alive = now;
 
 	callout_reset(&state->dts_deadman, hz * dtrace_deadman_interval / NANOSEC,
 	    dtrace_state_deadman, state);
 }
 #endif	/* illumos */
 
 static dtrace_state_t *
 #ifdef illumos
 dtrace_state_create(dev_t *devp, cred_t *cr)
 #else
 dtrace_state_create(struct cdev *dev, struct ucred *cred __unused)
 #endif
 {
 #ifdef illumos
 	minor_t minor;
 	major_t major;
 #else
 	cred_t *cr = NULL;
 	int m = 0;
 #endif
 	char c[30];
 	dtrace_state_t *state;
 	dtrace_optval_t *opt;
 	int bufsize = NCPU * sizeof (dtrace_buffer_t), i;
 	int cpu_it;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(MUTEX_HELD(&cpu_lock));
 
 #ifdef illumos
 	minor = (minor_t)(uintptr_t)vmem_alloc(dtrace_minor, 1,
 	    VM_BESTFIT | VM_SLEEP);
 
 	if (ddi_soft_state_zalloc(dtrace_softstate, minor) != DDI_SUCCESS) {
 		vmem_free(dtrace_minor, (void *)(uintptr_t)minor, 1);
 		return (NULL);
 	}
 
 	state = ddi_get_soft_state(dtrace_softstate, minor);
 #else
 	if (dev != NULL) {
 		cr = dev->si_cred;
 		m = dev2unit(dev);
 	}
 
 	/* Allocate memory for the state. */
 	state = kmem_zalloc(sizeof(dtrace_state_t), KM_SLEEP);
 #endif
 
 	state->dts_epid = DTRACE_EPIDNONE + 1;
 
 	(void) snprintf(c, sizeof (c), "dtrace_aggid_%d", m);
 #ifdef illumos
 	state->dts_aggid_arena = vmem_create(c, (void *)1, UINT32_MAX, 1,
 	    NULL, NULL, NULL, 0, VM_SLEEP | VMC_IDENTIFIER);
 
 	if (devp != NULL) {
 		major = getemajor(*devp);
 	} else {
 		major = ddi_driver_major(dtrace_devi);
 	}
 
 	state->dts_dev = makedevice(major, minor);
 
 	if (devp != NULL)
 		*devp = state->dts_dev;
 #else
 	state->dts_aggid_arena = new_unrhdr(1, INT_MAX, &dtrace_unr_mtx);
 	state->dts_dev = dev;
 #endif
 
 	/*
 	 * We allocate NCPU buffers.  On the one hand, this can be quite
 	 * a bit of memory per instance (nearly 36K on a Starcat).  On the
 	 * other hand, it saves an additional memory reference in the probe
 	 * path.
 	 */
 	state->dts_buffer = kmem_zalloc(bufsize, KM_SLEEP);
 	state->dts_aggbuffer = kmem_zalloc(bufsize, KM_SLEEP);
 
 	/*
          * Allocate and initialise the per-process per-CPU random state.
 	 * SI_SUB_RANDOM < SI_SUB_DTRACE_ANON therefore entropy device is
          * assumed to be seeded at this point (if from Fortuna seed file).
 	 */
 	(void) read_random(&state->dts_rstate[0], 2 * sizeof(uint64_t));
 	for (cpu_it = 1; cpu_it < NCPU; cpu_it++) {
 		/*
 		 * Each CPU is assigned a 2^64 period, non-overlapping
 		 * subsequence.
 		 */
 		dtrace_xoroshiro128_plus_jump(state->dts_rstate[cpu_it-1],
 		    state->dts_rstate[cpu_it]); 
 	}
 
 #ifdef illumos
 	state->dts_cleaner = CYCLIC_NONE;
 	state->dts_deadman = CYCLIC_NONE;
 #else
 	callout_init(&state->dts_cleaner, 1);
 	callout_init(&state->dts_deadman, 1);
 #endif
 	state->dts_vstate.dtvs_state = state;
 
 	for (i = 0; i < DTRACEOPT_MAX; i++)
 		state->dts_options[i] = DTRACEOPT_UNSET;
 
 	/*
 	 * Set the default options.
 	 */
 	opt = state->dts_options;
 	opt[DTRACEOPT_BUFPOLICY] = DTRACEOPT_BUFPOLICY_SWITCH;
 	opt[DTRACEOPT_BUFRESIZE] = DTRACEOPT_BUFRESIZE_AUTO;
 	opt[DTRACEOPT_NSPEC] = dtrace_nspec_default;
 	opt[DTRACEOPT_SPECSIZE] = dtrace_specsize_default;
 	opt[DTRACEOPT_CPU] = (dtrace_optval_t)DTRACE_CPUALL;
 	opt[DTRACEOPT_STRSIZE] = dtrace_strsize_default;
 	opt[DTRACEOPT_STACKFRAMES] = dtrace_stackframes_default;
 	opt[DTRACEOPT_USTACKFRAMES] = dtrace_ustackframes_default;
 	opt[DTRACEOPT_CLEANRATE] = dtrace_cleanrate_default;
 	opt[DTRACEOPT_AGGRATE] = dtrace_aggrate_default;
 	opt[DTRACEOPT_SWITCHRATE] = dtrace_switchrate_default;
 	opt[DTRACEOPT_STATUSRATE] = dtrace_statusrate_default;
 	opt[DTRACEOPT_JSTACKFRAMES] = dtrace_jstackframes_default;
 	opt[DTRACEOPT_JSTACKSTRSIZE] = dtrace_jstackstrsize_default;
 
 	state->dts_activity = DTRACE_ACTIVITY_INACTIVE;
 
 	/*
 	 * Depending on the user credentials, we set flag bits which alter probe
 	 * visibility or the amount of destructiveness allowed.  In the case of
 	 * actual anonymous tracing, or the possession of all privileges, all of
 	 * the normal checks are bypassed.
 	 */
 	if (cr == NULL || PRIV_POLICY_ONLY(cr, PRIV_ALL, B_FALSE)) {
 		state->dts_cred.dcr_visible = DTRACE_CRV_ALL;
 		state->dts_cred.dcr_action = DTRACE_CRA_ALL;
 	} else {
 		/*
 		 * Set up the credentials for this instantiation.  We take a
 		 * hold on the credential to prevent it from disappearing on
 		 * us; this in turn prevents the zone_t referenced by this
 		 * credential from disappearing.  This means that we can
 		 * examine the credential and the zone from probe context.
 		 */
 		crhold(cr);
 		state->dts_cred.dcr_cred = cr;
 
 		/*
 		 * CRA_PROC means "we have *some* privilege for dtrace" and
 		 * unlocks the use of variables like pid, zonename, etc.
 		 */
 		if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_USER, B_FALSE) ||
 		    PRIV_POLICY_ONLY(cr, PRIV_DTRACE_PROC, B_FALSE)) {
 			state->dts_cred.dcr_action |= DTRACE_CRA_PROC;
 		}
 
 		/*
 		 * dtrace_user allows use of syscall and profile providers.
 		 * If the user also has proc_owner and/or proc_zone, we
 		 * extend the scope to include additional visibility and
 		 * destructive power.
 		 */
 		if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_USER, B_FALSE)) {
 			if (PRIV_POLICY_ONLY(cr, PRIV_PROC_OWNER, B_FALSE)) {
 				state->dts_cred.dcr_visible |=
 				    DTRACE_CRV_ALLPROC;
 
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_ALLUSER;
 			}
 
 			if (PRIV_POLICY_ONLY(cr, PRIV_PROC_ZONE, B_FALSE)) {
 				state->dts_cred.dcr_visible |=
 				    DTRACE_CRV_ALLZONE;
 
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_ALLZONE;
 			}
 
 			/*
 			 * If we have all privs in whatever zone this is,
 			 * we can do destructive things to processes which
 			 * have altered credentials.
 			 */
 #ifdef illumos
 			if (priv_isequalset(priv_getset(cr, PRIV_EFFECTIVE),
 			    cr->cr_zone->zone_privset)) {
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_CREDCHG;
 			}
 #endif
 		}
 
 		/*
 		 * Holding the dtrace_kernel privilege also implies that
 		 * the user has the dtrace_user privilege from a visibility
 		 * perspective.  But without further privileges, some
 		 * destructive actions are not available.
 		 */
 		if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_KERNEL, B_FALSE)) {
 			/*
 			 * Make all probes in all zones visible.  However,
 			 * this doesn't mean that all actions become available
 			 * to all zones.
 			 */
 			state->dts_cred.dcr_visible |= DTRACE_CRV_KERNEL |
 			    DTRACE_CRV_ALLPROC | DTRACE_CRV_ALLZONE;
 
 			state->dts_cred.dcr_action |= DTRACE_CRA_KERNEL |
 			    DTRACE_CRA_PROC;
 			/*
 			 * Holding proc_owner means that destructive actions
 			 * for *this* zone are allowed.
 			 */
 			if (PRIV_POLICY_ONLY(cr, PRIV_PROC_OWNER, B_FALSE))
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_ALLUSER;
 
 			/*
 			 * Holding proc_zone means that destructive actions
 			 * for this user/group ID in all zones is allowed.
 			 */
 			if (PRIV_POLICY_ONLY(cr, PRIV_PROC_ZONE, B_FALSE))
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_ALLZONE;
 
 #ifdef illumos
 			/*
 			 * If we have all privs in whatever zone this is,
 			 * we can do destructive things to processes which
 			 * have altered credentials.
 			 */
 			if (priv_isequalset(priv_getset(cr, PRIV_EFFECTIVE),
 			    cr->cr_zone->zone_privset)) {
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_CREDCHG;
 			}
 #endif
 		}
 
 		/*
 		 * Holding the dtrace_proc privilege gives control over fasttrap
 		 * and pid providers.  We need to grant wider destructive
 		 * privileges in the event that the user has proc_owner and/or
 		 * proc_zone.
 		 */
 		if (PRIV_POLICY_ONLY(cr, PRIV_DTRACE_PROC, B_FALSE)) {
 			if (PRIV_POLICY_ONLY(cr, PRIV_PROC_OWNER, B_FALSE))
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_ALLUSER;
 
 			if (PRIV_POLICY_ONLY(cr, PRIV_PROC_ZONE, B_FALSE))
 				state->dts_cred.dcr_action |=
 				    DTRACE_CRA_PROC_DESTRUCTIVE_ALLZONE;
 		}
 	}
 
 	return (state);
 }
 
 static int
 dtrace_state_buffer(dtrace_state_t *state, dtrace_buffer_t *buf, int which)
 {
 	dtrace_optval_t *opt = state->dts_options, size;
 	processorid_t cpu = 0;;
 	int flags = 0, rval, factor, divisor = 1;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(MUTEX_HELD(&cpu_lock));
 	ASSERT(which < DTRACEOPT_MAX);
 	ASSERT(state->dts_activity == DTRACE_ACTIVITY_INACTIVE ||
 	    (state == dtrace_anon.dta_state &&
 	    state->dts_activity == DTRACE_ACTIVITY_ACTIVE));
 
 	if (opt[which] == DTRACEOPT_UNSET || opt[which] == 0)
 		return (0);
 
 	if (opt[DTRACEOPT_CPU] != DTRACEOPT_UNSET)
 		cpu = opt[DTRACEOPT_CPU];
 
 	if (which == DTRACEOPT_SPECSIZE)
 		flags |= DTRACEBUF_NOSWITCH;
 
 	if (which == DTRACEOPT_BUFSIZE) {
 		if (opt[DTRACEOPT_BUFPOLICY] == DTRACEOPT_BUFPOLICY_RING)
 			flags |= DTRACEBUF_RING;
 
 		if (opt[DTRACEOPT_BUFPOLICY] == DTRACEOPT_BUFPOLICY_FILL)
 			flags |= DTRACEBUF_FILL;
 
 		if (state != dtrace_anon.dta_state ||
 		    state->dts_activity != DTRACE_ACTIVITY_ACTIVE)
 			flags |= DTRACEBUF_INACTIVE;
 	}
 
 	for (size = opt[which]; size >= sizeof (uint64_t); size /= divisor) {
 		/*
 		 * The size must be 8-byte aligned.  If the size is not 8-byte
 		 * aligned, drop it down by the difference.
 		 */
 		if (size & (sizeof (uint64_t) - 1))
 			size -= size & (sizeof (uint64_t) - 1);
 
 		if (size < state->dts_reserve) {
 			/*
 			 * Buffers always must be large enough to accommodate
 			 * their prereserved space.  We return E2BIG instead
 			 * of ENOMEM in this case to allow for user-level
 			 * software to differentiate the cases.
 			 */
 			return (E2BIG);
 		}
 
 		rval = dtrace_buffer_alloc(buf, size, flags, cpu, &factor);
 
 		if (rval != ENOMEM) {
 			opt[which] = size;
 			return (rval);
 		}
 
 		if (opt[DTRACEOPT_BUFRESIZE] == DTRACEOPT_BUFRESIZE_MANUAL)
 			return (rval);
 
 		for (divisor = 2; divisor < factor; divisor <<= 1)
 			continue;
 	}
 
 	return (ENOMEM);
 }
 
 static int
 dtrace_state_buffers(dtrace_state_t *state)
 {
 	dtrace_speculation_t *spec = state->dts_speculations;
 	int rval, i;
 
 	if ((rval = dtrace_state_buffer(state, state->dts_buffer,
 	    DTRACEOPT_BUFSIZE)) != 0)
 		return (rval);
 
 	if ((rval = dtrace_state_buffer(state, state->dts_aggbuffer,
 	    DTRACEOPT_AGGSIZE)) != 0)
 		return (rval);
 
 	for (i = 0; i < state->dts_nspeculations; i++) {
 		if ((rval = dtrace_state_buffer(state,
 		    spec[i].dtsp_buffer, DTRACEOPT_SPECSIZE)) != 0)
 			return (rval);
 	}
 
 	return (0);
 }
 
 static void
 dtrace_state_prereserve(dtrace_state_t *state)
 {
 	dtrace_ecb_t *ecb;
 	dtrace_probe_t *probe;
 
 	state->dts_reserve = 0;
 
 	if (state->dts_options[DTRACEOPT_BUFPOLICY] != DTRACEOPT_BUFPOLICY_FILL)
 		return;
 
 	/*
 	 * If our buffer policy is a "fill" buffer policy, we need to set the
 	 * prereserved space to be the space required by the END probes.
 	 */
 	probe = dtrace_probes[dtrace_probeid_end - 1];
 	ASSERT(probe != NULL);
 
 	for (ecb = probe->dtpr_ecb; ecb != NULL; ecb = ecb->dte_next) {
 		if (ecb->dte_state != state)
 			continue;
 
 		state->dts_reserve += ecb->dte_needed + ecb->dte_alignment;
 	}
 }
 
 static int
 dtrace_state_go(dtrace_state_t *state, processorid_t *cpu)
 {
 	dtrace_optval_t *opt = state->dts_options, sz, nspec;
 	dtrace_speculation_t *spec;
 	dtrace_buffer_t *buf;
 #ifdef illumos
 	cyc_handler_t hdlr;
 	cyc_time_t when;
 #endif
 	int rval = 0, i, bufsize = NCPU * sizeof (dtrace_buffer_t);
 	dtrace_icookie_t cookie;
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_lock);
 
 	if (state->dts_activity != DTRACE_ACTIVITY_INACTIVE) {
 		rval = EBUSY;
 		goto out;
 	}
 
 	/*
 	 * Before we can perform any checks, we must prime all of the
 	 * retained enablings that correspond to this state.
 	 */
 	dtrace_enabling_prime(state);
 
 	if (state->dts_destructive && !state->dts_cred.dcr_destructive) {
 		rval = EACCES;
 		goto out;
 	}
 
 	dtrace_state_prereserve(state);
 
 	/*
 	 * Now we want to do is try to allocate our speculations.
 	 * We do not automatically resize the number of speculations; if
 	 * this fails, we will fail the operation.
 	 */
 	nspec = opt[DTRACEOPT_NSPEC];
 	ASSERT(nspec != DTRACEOPT_UNSET);
 
 	if (nspec > INT_MAX) {
 		rval = ENOMEM;
 		goto out;
 	}
 
 	spec = kmem_zalloc(nspec * sizeof (dtrace_speculation_t),
 	    KM_NOSLEEP | KM_NORMALPRI);
 
 	if (spec == NULL) {
 		rval = ENOMEM;
 		goto out;
 	}
 
 	state->dts_speculations = spec;
 	state->dts_nspeculations = (int)nspec;
 
 	for (i = 0; i < nspec; i++) {
 		if ((buf = kmem_zalloc(bufsize,
 		    KM_NOSLEEP | KM_NORMALPRI)) == NULL) {
 			rval = ENOMEM;
 			goto err;
 		}
 
 		spec[i].dtsp_buffer = buf;
 	}
 
 	if (opt[DTRACEOPT_GRABANON] != DTRACEOPT_UNSET) {
 		if (dtrace_anon.dta_state == NULL) {
 			rval = ENOENT;
 			goto out;
 		}
 
 		if (state->dts_necbs != 0) {
 			rval = EALREADY;
 			goto out;
 		}
 
 		state->dts_anon = dtrace_anon_grab();
 		ASSERT(state->dts_anon != NULL);
 		state = state->dts_anon;
 
 		/*
 		 * We want "grabanon" to be set in the grabbed state, so we'll
 		 * copy that option value from the grabbing state into the
 		 * grabbed state.
 		 */
 		state->dts_options[DTRACEOPT_GRABANON] =
 		    opt[DTRACEOPT_GRABANON];
 
 		*cpu = dtrace_anon.dta_beganon;
 
 		/*
 		 * If the anonymous state is active (as it almost certainly
 		 * is if the anonymous enabling ultimately matched anything),
 		 * we don't allow any further option processing -- but we
 		 * don't return failure.
 		 */
 		if (state->dts_activity != DTRACE_ACTIVITY_INACTIVE)
 			goto out;
 	}
 
 	if (opt[DTRACEOPT_AGGSIZE] != DTRACEOPT_UNSET &&
 	    opt[DTRACEOPT_AGGSIZE] != 0) {
 		if (state->dts_aggregations == NULL) {
 			/*
 			 * We're not going to create an aggregation buffer
 			 * because we don't have any ECBs that contain
 			 * aggregations -- set this option to 0.
 			 */
 			opt[DTRACEOPT_AGGSIZE] = 0;
 		} else {
 			/*
 			 * If we have an aggregation buffer, we must also have
 			 * a buffer to use as scratch.
 			 */
 			if (opt[DTRACEOPT_BUFSIZE] == DTRACEOPT_UNSET ||
 			    opt[DTRACEOPT_BUFSIZE] < state->dts_needed) {
 				opt[DTRACEOPT_BUFSIZE] = state->dts_needed;
 			}
 		}
 	}
 
 	if (opt[DTRACEOPT_SPECSIZE] != DTRACEOPT_UNSET &&
 	    opt[DTRACEOPT_SPECSIZE] != 0) {
 		if (!state->dts_speculates) {
 			/*
 			 * We're not going to create speculation buffers
 			 * because we don't have any ECBs that actually
 			 * speculate -- set the speculation size to 0.
 			 */
 			opt[DTRACEOPT_SPECSIZE] = 0;
 		}
 	}
 
 	/*
 	 * The bare minimum size for any buffer that we're actually going to
 	 * do anything to is sizeof (uint64_t).
 	 */
 	sz = sizeof (uint64_t);
 
 	if ((state->dts_needed != 0 && opt[DTRACEOPT_BUFSIZE] < sz) ||
 	    (state->dts_speculates && opt[DTRACEOPT_SPECSIZE] < sz) ||
 	    (state->dts_aggregations != NULL && opt[DTRACEOPT_AGGSIZE] < sz)) {
 		/*
 		 * A buffer size has been explicitly set to 0 (or to a size
 		 * that will be adjusted to 0) and we need the space -- we
 		 * need to return failure.  We return ENOSPC to differentiate
 		 * it from failing to allocate a buffer due to failure to meet
 		 * the reserve (for which we return E2BIG).
 		 */
 		rval = ENOSPC;
 		goto out;
 	}
 
 	if ((rval = dtrace_state_buffers(state)) != 0)
 		goto err;
 
 	if ((sz = opt[DTRACEOPT_DYNVARSIZE]) == DTRACEOPT_UNSET)
 		sz = dtrace_dstate_defsize;
 
 	do {
 		rval = dtrace_dstate_init(&state->dts_vstate.dtvs_dynvars, sz);
 
 		if (rval == 0)
 			break;
 
 		if (opt[DTRACEOPT_BUFRESIZE] == DTRACEOPT_BUFRESIZE_MANUAL)
 			goto err;
 	} while (sz >>= 1);
 
 	opt[DTRACEOPT_DYNVARSIZE] = sz;
 
 	if (rval != 0)
 		goto err;
 
 	if (opt[DTRACEOPT_STATUSRATE] > dtrace_statusrate_max)
 		opt[DTRACEOPT_STATUSRATE] = dtrace_statusrate_max;
 
 	if (opt[DTRACEOPT_CLEANRATE] == 0)
 		opt[DTRACEOPT_CLEANRATE] = dtrace_cleanrate_max;
 
 	if (opt[DTRACEOPT_CLEANRATE] < dtrace_cleanrate_min)
 		opt[DTRACEOPT_CLEANRATE] = dtrace_cleanrate_min;
 
 	if (opt[DTRACEOPT_CLEANRATE] > dtrace_cleanrate_max)
 		opt[DTRACEOPT_CLEANRATE] = dtrace_cleanrate_max;
 
 	state->dts_alive = state->dts_laststatus = dtrace_gethrtime();
 #ifdef illumos
 	hdlr.cyh_func = (cyc_func_t)dtrace_state_clean;
 	hdlr.cyh_arg = state;
 	hdlr.cyh_level = CY_LOW_LEVEL;
 
 	when.cyt_when = 0;
 	when.cyt_interval = opt[DTRACEOPT_CLEANRATE];
 
 	state->dts_cleaner = cyclic_add(&hdlr, &when);
 
 	hdlr.cyh_func = (cyc_func_t)dtrace_state_deadman;
 	hdlr.cyh_arg = state;
 	hdlr.cyh_level = CY_LOW_LEVEL;
 
 	when.cyt_when = 0;
 	when.cyt_interval = dtrace_deadman_interval;
 
 	state->dts_deadman = cyclic_add(&hdlr, &when);
 #else
 	callout_reset(&state->dts_cleaner, hz * opt[DTRACEOPT_CLEANRATE] / NANOSEC,
 	    dtrace_state_clean, state);
 	callout_reset(&state->dts_deadman, hz * dtrace_deadman_interval / NANOSEC,
 	    dtrace_state_deadman, state);
 #endif
 
 	state->dts_activity = DTRACE_ACTIVITY_WARMUP;
 
 #ifdef illumos
 	if (state->dts_getf != 0 &&
 	    !(state->dts_cred.dcr_visible & DTRACE_CRV_KERNEL)) {
 		/*
 		 * We don't have kernel privs but we have at least one call
 		 * to getf(); we need to bump our zone's count, and (if
 		 * this is the first enabling to have an unprivileged call
 		 * to getf()) we need to hook into closef().
 		 */
 		state->dts_cred.dcr_cred->cr_zone->zone_dtrace_getf++;
 
 		if (dtrace_getf++ == 0) {
 			ASSERT(dtrace_closef == NULL);
 			dtrace_closef = dtrace_getf_barrier;
 		}
 	}
 #endif
 
 	/*
 	 * Now it's time to actually fire the BEGIN probe.  We need to disable
 	 * interrupts here both to record the CPU on which we fired the BEGIN
 	 * probe (the data from this CPU will be processed first at user
 	 * level) and to manually activate the buffer for this CPU.
 	 */
 	cookie = dtrace_interrupt_disable();
 	*cpu = curcpu;
 	ASSERT(state->dts_buffer[*cpu].dtb_flags & DTRACEBUF_INACTIVE);
 	state->dts_buffer[*cpu].dtb_flags &= ~DTRACEBUF_INACTIVE;
 
 	dtrace_probe(dtrace_probeid_begin,
 	    (uint64_t)(uintptr_t)state, 0, 0, 0, 0);
 	dtrace_interrupt_enable(cookie);
 	/*
 	 * We may have had an exit action from a BEGIN probe; only change our
 	 * state to ACTIVE if we're still in WARMUP.
 	 */
 	ASSERT(state->dts_activity == DTRACE_ACTIVITY_WARMUP ||
 	    state->dts_activity == DTRACE_ACTIVITY_DRAINING);
 
 	if (state->dts_activity == DTRACE_ACTIVITY_WARMUP)
 		state->dts_activity = DTRACE_ACTIVITY_ACTIVE;
 
 #ifdef __FreeBSD__
 	/*
 	 * We enable anonymous tracing before APs are started, so we must
 	 * activate buffers using the current CPU.
 	 */
 	if (state == dtrace_anon.dta_state)
 		for (int i = 0; i < NCPU; i++)
 			dtrace_buffer_activate_cpu(state, i);
 	else
 		dtrace_xcall(DTRACE_CPUALL,
 		    (dtrace_xcall_t)dtrace_buffer_activate, state);
 #else
 	/*
 	 * Regardless of whether or not now we're in ACTIVE or DRAINING, we
 	 * want each CPU to transition its principal buffer out of the
 	 * INACTIVE state.  Doing this assures that no CPU will suddenly begin
 	 * processing an ECB halfway down a probe's ECB chain; all CPUs will
 	 * atomically transition from processing none of a state's ECBs to
 	 * processing all of them.
 	 */
 	dtrace_xcall(DTRACE_CPUALL,
 	    (dtrace_xcall_t)dtrace_buffer_activate, state);
 #endif
 	goto out;
 
 err:
 	dtrace_buffer_free(state->dts_buffer);
 	dtrace_buffer_free(state->dts_aggbuffer);
 
 	if ((nspec = state->dts_nspeculations) == 0) {
 		ASSERT(state->dts_speculations == NULL);
 		goto out;
 	}
 
 	spec = state->dts_speculations;
 	ASSERT(spec != NULL);
 
 	for (i = 0; i < state->dts_nspeculations; i++) {
 		if ((buf = spec[i].dtsp_buffer) == NULL)
 			break;
 
 		dtrace_buffer_free(buf);
 		kmem_free(buf, bufsize);
 	}
 
 	kmem_free(spec, nspec * sizeof (dtrace_speculation_t));
 	state->dts_nspeculations = 0;
 	state->dts_speculations = NULL;
 
 out:
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&cpu_lock);
 
 	return (rval);
 }
 
 static int
 dtrace_state_stop(dtrace_state_t *state, processorid_t *cpu)
 {
 	dtrace_icookie_t cookie;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (state->dts_activity != DTRACE_ACTIVITY_ACTIVE &&
 	    state->dts_activity != DTRACE_ACTIVITY_DRAINING)
 		return (EINVAL);
 
 	/*
 	 * We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync
 	 * to be sure that every CPU has seen it.  See below for the details
 	 * on why this is done.
 	 */
 	state->dts_activity = DTRACE_ACTIVITY_DRAINING;
 	dtrace_sync();
 
 	/*
 	 * By this point, it is impossible for any CPU to be still processing
 	 * with DTRACE_ACTIVITY_ACTIVE.  We can thus set our activity to
 	 * DTRACE_ACTIVITY_COOLDOWN and know that we're not racing with any
 	 * other CPU in dtrace_buffer_reserve().  This allows dtrace_probe()
 	 * and callees to know that the activity is DTRACE_ACTIVITY_COOLDOWN
 	 * iff we're in the END probe.
 	 */
 	state->dts_activity = DTRACE_ACTIVITY_COOLDOWN;
 	dtrace_sync();
 	ASSERT(state->dts_activity == DTRACE_ACTIVITY_COOLDOWN);
 
 	/*
 	 * Finally, we can release the reserve and call the END probe.  We
 	 * disable interrupts across calling the END probe to allow us to
 	 * return the CPU on which we actually called the END probe.  This
 	 * allows user-land to be sure that this CPU's principal buffer is
 	 * processed last.
 	 */
 	state->dts_reserve = 0;
 
 	cookie = dtrace_interrupt_disable();
 	*cpu = curcpu;
 	dtrace_probe(dtrace_probeid_end,
 	    (uint64_t)(uintptr_t)state, 0, 0, 0, 0);
 	dtrace_interrupt_enable(cookie);
 
 	state->dts_activity = DTRACE_ACTIVITY_STOPPED;
 	dtrace_sync();
 
 #ifdef illumos
 	if (state->dts_getf != 0 &&
 	    !(state->dts_cred.dcr_visible & DTRACE_CRV_KERNEL)) {
 		/*
 		 * We don't have kernel privs but we have at least one call
 		 * to getf(); we need to lower our zone's count, and (if
 		 * this is the last enabling to have an unprivileged call
 		 * to getf()) we need to clear the closef() hook.
 		 */
 		ASSERT(state->dts_cred.dcr_cred->cr_zone->zone_dtrace_getf > 0);
 		ASSERT(dtrace_closef == dtrace_getf_barrier);
 		ASSERT(dtrace_getf > 0);
 
 		state->dts_cred.dcr_cred->cr_zone->zone_dtrace_getf--;
 
 		if (--dtrace_getf == 0)
 			dtrace_closef = NULL;
 	}
 #endif
 
 	return (0);
 }
 
 static int
 dtrace_state_option(dtrace_state_t *state, dtrace_optid_t option,
     dtrace_optval_t val)
 {
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (state->dts_activity != DTRACE_ACTIVITY_INACTIVE)
 		return (EBUSY);
 
 	if (option >= DTRACEOPT_MAX)
 		return (EINVAL);
 
 	if (option != DTRACEOPT_CPU && val < 0)
 		return (EINVAL);
 
 	switch (option) {
 	case DTRACEOPT_DESTRUCTIVE:
 		if (dtrace_destructive_disallow)
 			return (EACCES);
 
 		state->dts_cred.dcr_destructive = 1;
 		break;
 
 	case DTRACEOPT_BUFSIZE:
 	case DTRACEOPT_DYNVARSIZE:
 	case DTRACEOPT_AGGSIZE:
 	case DTRACEOPT_SPECSIZE:
 	case DTRACEOPT_STRSIZE:
 		if (val < 0)
 			return (EINVAL);
 
 		if (val >= LONG_MAX) {
 			/*
 			 * If this is an otherwise negative value, set it to
 			 * the highest multiple of 128m less than LONG_MAX.
 			 * Technically, we're adjusting the size without
 			 * regard to the buffer resizing policy, but in fact,
 			 * this has no effect -- if we set the buffer size to
 			 * ~LONG_MAX and the buffer policy is ultimately set to
 			 * be "manual", the buffer allocation is guaranteed to
 			 * fail, if only because the allocation requires two
 			 * buffers.  (We set the the size to the highest
 			 * multiple of 128m because it ensures that the size
 			 * will remain a multiple of a megabyte when
 			 * repeatedly halved -- all the way down to 15m.)
 			 */
 			val = LONG_MAX - (1 << 27) + 1;
 		}
 	}
 
 	state->dts_options[option] = val;
 
 	return (0);
 }
 
 static void
 dtrace_state_destroy(dtrace_state_t *state)
 {
 	dtrace_ecb_t *ecb;
 	dtrace_vstate_t *vstate = &state->dts_vstate;
 #ifdef illumos
 	minor_t minor = getminor(state->dts_dev);
 #endif
 	int i, bufsize = NCPU * sizeof (dtrace_buffer_t);
 	dtrace_speculation_t *spec = state->dts_speculations;
 	int nspec = state->dts_nspeculations;
 	uint32_t match;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(MUTEX_HELD(&cpu_lock));
 
 	/*
 	 * First, retract any retained enablings for this state.
 	 */
 	dtrace_enabling_retract(state);
 	ASSERT(state->dts_nretained == 0);
 
 	if (state->dts_activity == DTRACE_ACTIVITY_ACTIVE ||
 	    state->dts_activity == DTRACE_ACTIVITY_DRAINING) {
 		/*
 		 * We have managed to come into dtrace_state_destroy() on a
 		 * hot enabling -- almost certainly because of a disorderly
 		 * shutdown of a consumer.  (That is, a consumer that is
 		 * exiting without having called dtrace_stop().) In this case,
 		 * we're going to set our activity to be KILLED, and then
 		 * issue a sync to be sure that everyone is out of probe
 		 * context before we start blowing away ECBs.
 		 */
 		state->dts_activity = DTRACE_ACTIVITY_KILLED;
 		dtrace_sync();
 	}
 
 	/*
 	 * Release the credential hold we took in dtrace_state_create().
 	 */
 	if (state->dts_cred.dcr_cred != NULL)
 		crfree(state->dts_cred.dcr_cred);
 
 	/*
 	 * Now we can safely disable and destroy any enabled probes.  Because
 	 * any DTRACE_PRIV_KERNEL probes may actually be slowing our progress
 	 * (especially if they're all enabled), we take two passes through the
 	 * ECBs:  in the first, we disable just DTRACE_PRIV_KERNEL probes, and
 	 * in the second we disable whatever is left over.
 	 */
 	for (match = DTRACE_PRIV_KERNEL; ; match = 0) {
 		for (i = 0; i < state->dts_necbs; i++) {
 			if ((ecb = state->dts_ecbs[i]) == NULL)
 				continue;
 
 			if (match && ecb->dte_probe != NULL) {
 				dtrace_probe_t *probe = ecb->dte_probe;
 				dtrace_provider_t *prov = probe->dtpr_provider;
 
 				if (!(prov->dtpv_priv.dtpp_flags & match))
 					continue;
 			}
 
 			dtrace_ecb_disable(ecb);
 			dtrace_ecb_destroy(ecb);
 		}
 
 		if (!match)
 			break;
 	}
 
 	/*
 	 * Before we free the buffers, perform one more sync to assure that
 	 * every CPU is out of probe context.
 	 */
 	dtrace_sync();
 
 	dtrace_buffer_free(state->dts_buffer);
 	dtrace_buffer_free(state->dts_aggbuffer);
 
 	for (i = 0; i < nspec; i++)
 		dtrace_buffer_free(spec[i].dtsp_buffer);
 
 #ifdef illumos
 	if (state->dts_cleaner != CYCLIC_NONE)
 		cyclic_remove(state->dts_cleaner);
 
 	if (state->dts_deadman != CYCLIC_NONE)
 		cyclic_remove(state->dts_deadman);
 #else
 	callout_stop(&state->dts_cleaner);
 	callout_drain(&state->dts_cleaner);
 	callout_stop(&state->dts_deadman);
 	callout_drain(&state->dts_deadman);
 #endif
 
 	dtrace_dstate_fini(&vstate->dtvs_dynvars);
 	dtrace_vstate_fini(vstate);
 	if (state->dts_ecbs != NULL)
 		kmem_free(state->dts_ecbs, state->dts_necbs * sizeof (dtrace_ecb_t *));
 
 	if (state->dts_aggregations != NULL) {
 #ifdef DEBUG
 		for (i = 0; i < state->dts_naggregations; i++)
 			ASSERT(state->dts_aggregations[i] == NULL);
 #endif
 		ASSERT(state->dts_naggregations > 0);
 		kmem_free(state->dts_aggregations,
 		    state->dts_naggregations * sizeof (dtrace_aggregation_t *));
 	}
 
 	kmem_free(state->dts_buffer, bufsize);
 	kmem_free(state->dts_aggbuffer, bufsize);
 
 	for (i = 0; i < nspec; i++)
 		kmem_free(spec[i].dtsp_buffer, bufsize);
 
 	if (spec != NULL)
 		kmem_free(spec, nspec * sizeof (dtrace_speculation_t));
 
 	dtrace_format_destroy(state);
 
 	if (state->dts_aggid_arena != NULL) {
 #ifdef illumos
 		vmem_destroy(state->dts_aggid_arena);
 #else
 		delete_unrhdr(state->dts_aggid_arena);
 #endif
 		state->dts_aggid_arena = NULL;
 	}
 #ifdef illumos
 	ddi_soft_state_free(dtrace_softstate, minor);
 	vmem_free(dtrace_minor, (void *)(uintptr_t)minor, 1);
 #endif
 }
 
 /*
  * DTrace Anonymous Enabling Functions
  */
 static dtrace_state_t *
 dtrace_anon_grab(void)
 {
 	dtrace_state_t *state;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if ((state = dtrace_anon.dta_state) == NULL) {
 		ASSERT(dtrace_anon.dta_enabling == NULL);
 		return (NULL);
 	}
 
 	ASSERT(dtrace_anon.dta_enabling != NULL);
 	ASSERT(dtrace_retained != NULL);
 
 	dtrace_enabling_destroy(dtrace_anon.dta_enabling);
 	dtrace_anon.dta_enabling = NULL;
 	dtrace_anon.dta_state = NULL;
 
 	return (state);
 }
 
 static void
 dtrace_anon_property(void)
 {
 	int i, rv;
 	dtrace_state_t *state;
 	dof_hdr_t *dof;
 	char c[32];		/* enough for "dof-data-" + digits */
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(MUTEX_HELD(&cpu_lock));
 
 	for (i = 0; ; i++) {
 		(void) snprintf(c, sizeof (c), "dof-data-%d", i);
 
 		dtrace_err_verbose = 1;
 
 		if ((dof = dtrace_dof_property(c)) == NULL) {
 			dtrace_err_verbose = 0;
 			break;
 		}
 
 #ifdef illumos
 		/*
 		 * We want to create anonymous state, so we need to transition
 		 * the kernel debugger to indicate that DTrace is active.  If
 		 * this fails (e.g. because the debugger has modified text in
 		 * some way), we won't continue with the processing.
 		 */
 		if (kdi_dtrace_set(KDI_DTSET_DTRACE_ACTIVATE) != 0) {
 			cmn_err(CE_NOTE, "kernel debugger active; anonymous "
 			    "enabling ignored.");
 			dtrace_dof_destroy(dof);
 			break;
 		}
 #endif
 
 		/*
 		 * If we haven't allocated an anonymous state, we'll do so now.
 		 */
 		if ((state = dtrace_anon.dta_state) == NULL) {
 			state = dtrace_state_create(NULL, NULL);
 			dtrace_anon.dta_state = state;
 
 			if (state == NULL) {
 				/*
 				 * This basically shouldn't happen:  the only
 				 * failure mode from dtrace_state_create() is a
 				 * failure of ddi_soft_state_zalloc() that
 				 * itself should never happen.  Still, the
 				 * interface allows for a failure mode, and
 				 * we want to fail as gracefully as possible:
 				 * we'll emit an error message and cease
 				 * processing anonymous state in this case.
 				 */
 				cmn_err(CE_WARN, "failed to create "
 				    "anonymous state");
 				dtrace_dof_destroy(dof);
 				break;
 			}
 		}
 
 		rv = dtrace_dof_slurp(dof, &state->dts_vstate, CRED(),
 		    &dtrace_anon.dta_enabling, 0, 0, B_TRUE);
 
 		if (rv == 0)
 			rv = dtrace_dof_options(dof, state);
 
 		dtrace_err_verbose = 0;
 		dtrace_dof_destroy(dof);
 
 		if (rv != 0) {
 			/*
 			 * This is malformed DOF; chuck any anonymous state
 			 * that we created.
 			 */
 			ASSERT(dtrace_anon.dta_enabling == NULL);
 			dtrace_state_destroy(state);
 			dtrace_anon.dta_state = NULL;
 			break;
 		}
 
 		ASSERT(dtrace_anon.dta_enabling != NULL);
 	}
 
 	if (dtrace_anon.dta_enabling != NULL) {
 		int rval;
 
 		/*
 		 * dtrace_enabling_retain() can only fail because we are
 		 * trying to retain more enablings than are allowed -- but
 		 * we only have one anonymous enabling, and we are guaranteed
 		 * to be allowed at least one retained enabling; we assert
 		 * that dtrace_enabling_retain() returns success.
 		 */
 		rval = dtrace_enabling_retain(dtrace_anon.dta_enabling);
 		ASSERT(rval == 0);
 
 		dtrace_enabling_dump(dtrace_anon.dta_enabling);
 	}
 }
 
 /*
  * DTrace Helper Functions
  */
 static void
 dtrace_helper_trace(dtrace_helper_action_t *helper,
     dtrace_mstate_t *mstate, dtrace_vstate_t *vstate, int where)
 {
 	uint32_t size, next, nnext, i;
 	dtrace_helptrace_t *ent, *buffer;
 	uint16_t flags = cpu_core[curcpu].cpuc_dtrace_flags;
 
 	if ((buffer = dtrace_helptrace_buffer) == NULL)
 		return;
 
 	ASSERT(vstate->dtvs_nlocals <= dtrace_helptrace_nlocals);
 
 	/*
 	 * What would a tracing framework be without its own tracing
 	 * framework?  (Well, a hell of a lot simpler, for starters...)
 	 */
 	size = sizeof (dtrace_helptrace_t) + dtrace_helptrace_nlocals *
 	    sizeof (uint64_t) - sizeof (uint64_t);
 
 	/*
 	 * Iterate until we can allocate a slot in the trace buffer.
 	 */
 	do {
 		next = dtrace_helptrace_next;
 
 		if (next + size < dtrace_helptrace_bufsize) {
 			nnext = next + size;
 		} else {
 			nnext = size;
 		}
 	} while (dtrace_cas32(&dtrace_helptrace_next, next, nnext) != next);
 
 	/*
 	 * We have our slot; fill it in.
 	 */
 	if (nnext == size) {
 		dtrace_helptrace_wrapped++;
 		next = 0;
 	}
 
 	ent = (dtrace_helptrace_t *)((uintptr_t)buffer + next);
 	ent->dtht_helper = helper;
 	ent->dtht_where = where;
 	ent->dtht_nlocals = vstate->dtvs_nlocals;
 
 	ent->dtht_fltoffs = (mstate->dtms_present & DTRACE_MSTATE_FLTOFFS) ?
 	    mstate->dtms_fltoffs : -1;
 	ent->dtht_fault = DTRACE_FLAGS2FLT(flags);
 	ent->dtht_illval = cpu_core[curcpu].cpuc_dtrace_illval;
 
 	for (i = 0; i < vstate->dtvs_nlocals; i++) {
 		dtrace_statvar_t *svar;
 
 		if ((svar = vstate->dtvs_locals[i]) == NULL)
 			continue;
 
 		ASSERT(svar->dtsv_size >= NCPU * sizeof (uint64_t));
 		ent->dtht_locals[i] =
 		    ((uint64_t *)(uintptr_t)svar->dtsv_data)[curcpu];
 	}
 }
 
 static uint64_t
 dtrace_helper(int which, dtrace_mstate_t *mstate,
     dtrace_state_t *state, uint64_t arg0, uint64_t arg1)
 {
 	uint16_t *flags = &cpu_core[curcpu].cpuc_dtrace_flags;
 	uint64_t sarg0 = mstate->dtms_arg[0];
 	uint64_t sarg1 = mstate->dtms_arg[1];
 	uint64_t rval = 0;
 	dtrace_helpers_t *helpers = curproc->p_dtrace_helpers;
 	dtrace_helper_action_t *helper;
 	dtrace_vstate_t *vstate;
 	dtrace_difo_t *pred;
 	int i, trace = dtrace_helptrace_buffer != NULL;
 
 	ASSERT(which >= 0 && which < DTRACE_NHELPER_ACTIONS);
 
 	if (helpers == NULL)
 		return (0);
 
 	if ((helper = helpers->dthps_actions[which]) == NULL)
 		return (0);
 
 	vstate = &helpers->dthps_vstate;
 	mstate->dtms_arg[0] = arg0;
 	mstate->dtms_arg[1] = arg1;
 
 	/*
 	 * Now iterate over each helper.  If its predicate evaluates to 'true',
 	 * we'll call the corresponding actions.  Note that the below calls
 	 * to dtrace_dif_emulate() may set faults in machine state.  This is
 	 * okay:  our caller (the outer dtrace_dif_emulate()) will simply plow
 	 * the stored DIF offset with its own (which is the desired behavior).
 	 * Also, note the calls to dtrace_dif_emulate() may allocate scratch
 	 * from machine state; this is okay, too.
 	 */
 	for (; helper != NULL; helper = helper->dtha_next) {
 		if ((pred = helper->dtha_predicate) != NULL) {
 			if (trace)
 				dtrace_helper_trace(helper, mstate, vstate, 0);
 
 			if (!dtrace_dif_emulate(pred, mstate, vstate, state))
 				goto next;
 
 			if (*flags & CPU_DTRACE_FAULT)
 				goto err;
 		}
 
 		for (i = 0; i < helper->dtha_nactions; i++) {
 			if (trace)
 				dtrace_helper_trace(helper,
 				    mstate, vstate, i + 1);
 
 			rval = dtrace_dif_emulate(helper->dtha_actions[i],
 			    mstate, vstate, state);
 
 			if (*flags & CPU_DTRACE_FAULT)
 				goto err;
 		}
 
 next:
 		if (trace)
 			dtrace_helper_trace(helper, mstate, vstate,
 			    DTRACE_HELPTRACE_NEXT);
 	}
 
 	if (trace)
 		dtrace_helper_trace(helper, mstate, vstate,
 		    DTRACE_HELPTRACE_DONE);
 
 	/*
 	 * Restore the arg0 that we saved upon entry.
 	 */
 	mstate->dtms_arg[0] = sarg0;
 	mstate->dtms_arg[1] = sarg1;
 
 	return (rval);
 
 err:
 	if (trace)
 		dtrace_helper_trace(helper, mstate, vstate,
 		    DTRACE_HELPTRACE_ERR);
 
 	/*
 	 * Restore the arg0 that we saved upon entry.
 	 */
 	mstate->dtms_arg[0] = sarg0;
 	mstate->dtms_arg[1] = sarg1;
 
 	return (0);
 }
 
 static void
 dtrace_helper_action_destroy(dtrace_helper_action_t *helper,
     dtrace_vstate_t *vstate)
 {
 	int i;
 
 	if (helper->dtha_predicate != NULL)
 		dtrace_difo_release(helper->dtha_predicate, vstate);
 
 	for (i = 0; i < helper->dtha_nactions; i++) {
 		ASSERT(helper->dtha_actions[i] != NULL);
 		dtrace_difo_release(helper->dtha_actions[i], vstate);
 	}
 
 	kmem_free(helper->dtha_actions,
 	    helper->dtha_nactions * sizeof (dtrace_difo_t *));
 	kmem_free(helper, sizeof (dtrace_helper_action_t));
 }
 
 static int
 dtrace_helper_destroygen(dtrace_helpers_t *help, int gen)
 {
 	proc_t *p = curproc;
 	dtrace_vstate_t *vstate;
 	int i;
 
 	if (help == NULL)
 		help = p->p_dtrace_helpers;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if (help == NULL || gen > help->dthps_generation)
 		return (EINVAL);
 
 	vstate = &help->dthps_vstate;
 
 	for (i = 0; i < DTRACE_NHELPER_ACTIONS; i++) {
 		dtrace_helper_action_t *last = NULL, *h, *next;
 
 		for (h = help->dthps_actions[i]; h != NULL; h = next) {
 			next = h->dtha_next;
 
 			if (h->dtha_generation == gen) {
 				if (last != NULL) {
 					last->dtha_next = next;
 				} else {
 					help->dthps_actions[i] = next;
 				}
 
 				dtrace_helper_action_destroy(h, vstate);
 			} else {
 				last = h;
 			}
 		}
 	}
 
 	/*
 	 * Interate until we've cleared out all helper providers with the
 	 * given generation number.
 	 */
 	for (;;) {
 		dtrace_helper_provider_t *prov;
 
 		/*
 		 * Look for a helper provider with the right generation. We
 		 * have to start back at the beginning of the list each time
 		 * because we drop dtrace_lock. It's unlikely that we'll make
 		 * more than two passes.
 		 */
 		for (i = 0; i < help->dthps_nprovs; i++) {
 			prov = help->dthps_provs[i];
 
 			if (prov->dthp_generation == gen)
 				break;
 		}
 
 		/*
 		 * If there were no matches, we're done.
 		 */
 		if (i == help->dthps_nprovs)
 			break;
 
 		/*
 		 * Move the last helper provider into this slot.
 		 */
 		help->dthps_nprovs--;
 		help->dthps_provs[i] = help->dthps_provs[help->dthps_nprovs];
 		help->dthps_provs[help->dthps_nprovs] = NULL;
 
 		mutex_exit(&dtrace_lock);
 
 		/*
 		 * If we have a meta provider, remove this helper provider.
 		 */
 		mutex_enter(&dtrace_meta_lock);
 		if (dtrace_meta_pid != NULL) {
 			ASSERT(dtrace_deferred_pid == NULL);
 			dtrace_helper_provider_remove(&prov->dthp_prov,
 			    p->p_pid);
 		}
 		mutex_exit(&dtrace_meta_lock);
 
 		dtrace_helper_provider_destroy(prov);
 
 		mutex_enter(&dtrace_lock);
 	}
 
 	return (0);
 }
 
 static int
 dtrace_helper_validate(dtrace_helper_action_t *helper)
 {
 	int err = 0, i;
 	dtrace_difo_t *dp;
 
 	if ((dp = helper->dtha_predicate) != NULL)
 		err += dtrace_difo_validate_helper(dp);
 
 	for (i = 0; i < helper->dtha_nactions; i++)
 		err += dtrace_difo_validate_helper(helper->dtha_actions[i]);
 
 	return (err == 0);
 }
 
 static int
 dtrace_helper_action_add(int which, dtrace_ecbdesc_t *ep,
     dtrace_helpers_t *help)
 {
 	dtrace_helper_action_t *helper, *last;
 	dtrace_actdesc_t *act;
 	dtrace_vstate_t *vstate;
 	dtrace_predicate_t *pred;
 	int count = 0, nactions = 0, i;
 
 	if (which < 0 || which >= DTRACE_NHELPER_ACTIONS)
 		return (EINVAL);
 
 	last = help->dthps_actions[which];
 	vstate = &help->dthps_vstate;
 
 	for (count = 0; last != NULL; last = last->dtha_next) {
 		count++;
 		if (last->dtha_next == NULL)
 			break;
 	}
 
 	/*
 	 * If we already have dtrace_helper_actions_max helper actions for this
 	 * helper action type, we'll refuse to add a new one.
 	 */
 	if (count >= dtrace_helper_actions_max)
 		return (ENOSPC);
 
 	helper = kmem_zalloc(sizeof (dtrace_helper_action_t), KM_SLEEP);
 	helper->dtha_generation = help->dthps_generation;
 
 	if ((pred = ep->dted_pred.dtpdd_predicate) != NULL) {
 		ASSERT(pred->dtp_difo != NULL);
 		dtrace_difo_hold(pred->dtp_difo);
 		helper->dtha_predicate = pred->dtp_difo;
 	}
 
 	for (act = ep->dted_action; act != NULL; act = act->dtad_next) {
 		if (act->dtad_kind != DTRACEACT_DIFEXPR)
 			goto err;
 
 		if (act->dtad_difo == NULL)
 			goto err;
 
 		nactions++;
 	}
 
 	helper->dtha_actions = kmem_zalloc(sizeof (dtrace_difo_t *) *
 	    (helper->dtha_nactions = nactions), KM_SLEEP);
 
 	for (act = ep->dted_action, i = 0; act != NULL; act = act->dtad_next) {
 		dtrace_difo_hold(act->dtad_difo);
 		helper->dtha_actions[i++] = act->dtad_difo;
 	}
 
 	if (!dtrace_helper_validate(helper))
 		goto err;
 
 	if (last == NULL) {
 		help->dthps_actions[which] = helper;
 	} else {
 		last->dtha_next = helper;
 	}
 
 	if (vstate->dtvs_nlocals > dtrace_helptrace_nlocals) {
 		dtrace_helptrace_nlocals = vstate->dtvs_nlocals;
 		dtrace_helptrace_next = 0;
 	}
 
 	return (0);
 err:
 	dtrace_helper_action_destroy(helper, vstate);
 	return (EINVAL);
 }
 
 static void
 dtrace_helper_provider_register(proc_t *p, dtrace_helpers_t *help,
     dof_helper_t *dofhp)
 {
 	ASSERT(MUTEX_NOT_HELD(&dtrace_lock));
 
 	mutex_enter(&dtrace_meta_lock);
 	mutex_enter(&dtrace_lock);
 
 	if (!dtrace_attached() || dtrace_meta_pid == NULL) {
 		/*
 		 * If the dtrace module is loaded but not attached, or if
 		 * there aren't isn't a meta provider registered to deal with
 		 * these provider descriptions, we need to postpone creating
 		 * the actual providers until later.
 		 */
 
 		if (help->dthps_next == NULL && help->dthps_prev == NULL &&
 		    dtrace_deferred_pid != help) {
 			help->dthps_deferred = 1;
 			help->dthps_pid = p->p_pid;
 			help->dthps_next = dtrace_deferred_pid;
 			help->dthps_prev = NULL;
 			if (dtrace_deferred_pid != NULL)
 				dtrace_deferred_pid->dthps_prev = help;
 			dtrace_deferred_pid = help;
 		}
 
 		mutex_exit(&dtrace_lock);
 
 	} else if (dofhp != NULL) {
 		/*
 		 * If the dtrace module is loaded and we have a particular
 		 * helper provider description, pass that off to the
 		 * meta provider.
 		 */
 
 		mutex_exit(&dtrace_lock);
 
 		dtrace_helper_provide(dofhp, p->p_pid);
 
 	} else {
 		/*
 		 * Otherwise, just pass all the helper provider descriptions
 		 * off to the meta provider.
 		 */
 
 		int i;
 		mutex_exit(&dtrace_lock);
 
 		for (i = 0; i < help->dthps_nprovs; i++) {
 			dtrace_helper_provide(&help->dthps_provs[i]->dthp_prov,
 			    p->p_pid);
 		}
 	}
 
 	mutex_exit(&dtrace_meta_lock);
 }
 
 static int
 dtrace_helper_provider_add(dof_helper_t *dofhp, dtrace_helpers_t *help, int gen)
 {
 	dtrace_helper_provider_t *hprov, **tmp_provs;
 	uint_t tmp_maxprovs, i;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(help != NULL);
 
 	/*
 	 * If we already have dtrace_helper_providers_max helper providers,
 	 * we're refuse to add a new one.
 	 */
 	if (help->dthps_nprovs >= dtrace_helper_providers_max)
 		return (ENOSPC);
 
 	/*
 	 * Check to make sure this isn't a duplicate.
 	 */
 	for (i = 0; i < help->dthps_nprovs; i++) {
 		if (dofhp->dofhp_addr ==
 		    help->dthps_provs[i]->dthp_prov.dofhp_addr)
 			return (EALREADY);
 	}
 
 	hprov = kmem_zalloc(sizeof (dtrace_helper_provider_t), KM_SLEEP);
 	hprov->dthp_prov = *dofhp;
 	hprov->dthp_ref = 1;
 	hprov->dthp_generation = gen;
 
 	/*
 	 * Allocate a bigger table for helper providers if it's already full.
 	 */
 	if (help->dthps_maxprovs == help->dthps_nprovs) {
 		tmp_maxprovs = help->dthps_maxprovs;
 		tmp_provs = help->dthps_provs;
 
 		if (help->dthps_maxprovs == 0)
 			help->dthps_maxprovs = 2;
 		else
 			help->dthps_maxprovs *= 2;
 		if (help->dthps_maxprovs > dtrace_helper_providers_max)
 			help->dthps_maxprovs = dtrace_helper_providers_max;
 
 		ASSERT(tmp_maxprovs < help->dthps_maxprovs);
 
 		help->dthps_provs = kmem_zalloc(help->dthps_maxprovs *
 		    sizeof (dtrace_helper_provider_t *), KM_SLEEP);
 
 		if (tmp_provs != NULL) {
 			bcopy(tmp_provs, help->dthps_provs, tmp_maxprovs *
 			    sizeof (dtrace_helper_provider_t *));
 			kmem_free(tmp_provs, tmp_maxprovs *
 			    sizeof (dtrace_helper_provider_t *));
 		}
 	}
 
 	help->dthps_provs[help->dthps_nprovs] = hprov;
 	help->dthps_nprovs++;
 
 	return (0);
 }
 
 static void
 dtrace_helper_provider_destroy(dtrace_helper_provider_t *hprov)
 {
 	mutex_enter(&dtrace_lock);
 
 	if (--hprov->dthp_ref == 0) {
 		dof_hdr_t *dof;
 		mutex_exit(&dtrace_lock);
 		dof = (dof_hdr_t *)(uintptr_t)hprov->dthp_prov.dofhp_dof;
 		dtrace_dof_destroy(dof);
 		kmem_free(hprov, sizeof (dtrace_helper_provider_t));
 	} else {
 		mutex_exit(&dtrace_lock);
 	}
 }
 
 static int
 dtrace_helper_provider_validate(dof_hdr_t *dof, dof_sec_t *sec)
 {
 	uintptr_t daddr = (uintptr_t)dof;
 	dof_sec_t *str_sec, *prb_sec, *arg_sec, *off_sec, *enoff_sec;
 	dof_provider_t *provider;
 	dof_probe_t *probe;
 	uint8_t *arg;
 	char *strtab, *typestr;
 	dof_stridx_t typeidx;
 	size_t typesz;
 	uint_t nprobes, j, k;
 
 	ASSERT(sec->dofs_type == DOF_SECT_PROVIDER);
 
 	if (sec->dofs_offset & (sizeof (uint_t) - 1)) {
 		dtrace_dof_error(dof, "misaligned section offset");
 		return (-1);
 	}
 
 	/*
 	 * The section needs to be large enough to contain the DOF provider
 	 * structure appropriate for the given version.
 	 */
 	if (sec->dofs_size <
 	    ((dof->dofh_ident[DOF_ID_VERSION] == DOF_VERSION_1) ?
 	    offsetof(dof_provider_t, dofpv_prenoffs) :
 	    sizeof (dof_provider_t))) {
 		dtrace_dof_error(dof, "provider section too small");
 		return (-1);
 	}
 
 	provider = (dof_provider_t *)(uintptr_t)(daddr + sec->dofs_offset);
 	str_sec = dtrace_dof_sect(dof, DOF_SECT_STRTAB, provider->dofpv_strtab);
 	prb_sec = dtrace_dof_sect(dof, DOF_SECT_PROBES, provider->dofpv_probes);
 	arg_sec = dtrace_dof_sect(dof, DOF_SECT_PRARGS, provider->dofpv_prargs);
 	off_sec = dtrace_dof_sect(dof, DOF_SECT_PROFFS, provider->dofpv_proffs);
 
 	if (str_sec == NULL || prb_sec == NULL ||
 	    arg_sec == NULL || off_sec == NULL)
 		return (-1);
 
 	enoff_sec = NULL;
 
 	if (dof->dofh_ident[DOF_ID_VERSION] != DOF_VERSION_1 &&
 	    provider->dofpv_prenoffs != DOF_SECT_NONE &&
 	    (enoff_sec = dtrace_dof_sect(dof, DOF_SECT_PRENOFFS,
 	    provider->dofpv_prenoffs)) == NULL)
 		return (-1);
 
 	strtab = (char *)(uintptr_t)(daddr + str_sec->dofs_offset);
 
 	if (provider->dofpv_name >= str_sec->dofs_size ||
 	    strlen(strtab + provider->dofpv_name) >= DTRACE_PROVNAMELEN) {
 		dtrace_dof_error(dof, "invalid provider name");
 		return (-1);
 	}
 
 	if (prb_sec->dofs_entsize == 0 ||
 	    prb_sec->dofs_entsize > prb_sec->dofs_size) {
 		dtrace_dof_error(dof, "invalid entry size");
 		return (-1);
 	}
 
 	if (prb_sec->dofs_entsize & (sizeof (uintptr_t) - 1)) {
 		dtrace_dof_error(dof, "misaligned entry size");
 		return (-1);
 	}
 
 	if (off_sec->dofs_entsize != sizeof (uint32_t)) {
 		dtrace_dof_error(dof, "invalid entry size");
 		return (-1);
 	}
 
 	if (off_sec->dofs_offset & (sizeof (uint32_t) - 1)) {
 		dtrace_dof_error(dof, "misaligned section offset");
 		return (-1);
 	}
 
 	if (arg_sec->dofs_entsize != sizeof (uint8_t)) {
 		dtrace_dof_error(dof, "invalid entry size");
 		return (-1);
 	}
 
 	arg = (uint8_t *)(uintptr_t)(daddr + arg_sec->dofs_offset);
 
 	nprobes = prb_sec->dofs_size / prb_sec->dofs_entsize;
 
 	/*
 	 * Take a pass through the probes to check for errors.
 	 */
 	for (j = 0; j < nprobes; j++) {
 		probe = (dof_probe_t *)(uintptr_t)(daddr +
 		    prb_sec->dofs_offset + j * prb_sec->dofs_entsize);
 
 		if (probe->dofpr_func >= str_sec->dofs_size) {
 			dtrace_dof_error(dof, "invalid function name");
 			return (-1);
 		}
 
 		if (strlen(strtab + probe->dofpr_func) >= DTRACE_FUNCNAMELEN) {
 			dtrace_dof_error(dof, "function name too long");
 			/*
 			 * Keep going if the function name is too long.
 			 * Unlike provider and probe names, we cannot reasonably
 			 * impose restrictions on function names, since they're
 			 * a property of the code being instrumented. We will
 			 * skip this probe in dtrace_helper_provide_one().
 			 */
 		}
 
 		if (probe->dofpr_name >= str_sec->dofs_size ||
 		    strlen(strtab + probe->dofpr_name) >= DTRACE_NAMELEN) {
 			dtrace_dof_error(dof, "invalid probe name");
 			return (-1);
 		}
 
 		/*
 		 * The offset count must not wrap the index, and the offsets
 		 * must also not overflow the section's data.
 		 */
 		if (probe->dofpr_offidx + probe->dofpr_noffs <
 		    probe->dofpr_offidx ||
 		    (probe->dofpr_offidx + probe->dofpr_noffs) *
 		    off_sec->dofs_entsize > off_sec->dofs_size) {
 			dtrace_dof_error(dof, "invalid probe offset");
 			return (-1);
 		}
 
 		if (dof->dofh_ident[DOF_ID_VERSION] != DOF_VERSION_1) {
 			/*
 			 * If there's no is-enabled offset section, make sure
 			 * there aren't any is-enabled offsets. Otherwise
 			 * perform the same checks as for probe offsets
 			 * (immediately above).
 			 */
 			if (enoff_sec == NULL) {
 				if (probe->dofpr_enoffidx != 0 ||
 				    probe->dofpr_nenoffs != 0) {
 					dtrace_dof_error(dof, "is-enabled "
 					    "offsets with null section");
 					return (-1);
 				}
 			} else if (probe->dofpr_enoffidx +
 			    probe->dofpr_nenoffs < probe->dofpr_enoffidx ||
 			    (probe->dofpr_enoffidx + probe->dofpr_nenoffs) *
 			    enoff_sec->dofs_entsize > enoff_sec->dofs_size) {
 				dtrace_dof_error(dof, "invalid is-enabled "
 				    "offset");
 				return (-1);
 			}
 
 			if (probe->dofpr_noffs + probe->dofpr_nenoffs == 0) {
 				dtrace_dof_error(dof, "zero probe and "
 				    "is-enabled offsets");
 				return (-1);
 			}
 		} else if (probe->dofpr_noffs == 0) {
 			dtrace_dof_error(dof, "zero probe offsets");
 			return (-1);
 		}
 
 		if (probe->dofpr_argidx + probe->dofpr_xargc <
 		    probe->dofpr_argidx ||
 		    (probe->dofpr_argidx + probe->dofpr_xargc) *
 		    arg_sec->dofs_entsize > arg_sec->dofs_size) {
 			dtrace_dof_error(dof, "invalid args");
 			return (-1);
 		}
 
 		typeidx = probe->dofpr_nargv;
 		typestr = strtab + probe->dofpr_nargv;
 		for (k = 0; k < probe->dofpr_nargc; k++) {
 			if (typeidx >= str_sec->dofs_size) {
 				dtrace_dof_error(dof, "bad "
 				    "native argument type");
 				return (-1);
 			}
 
 			typesz = strlen(typestr) + 1;
 			if (typesz > DTRACE_ARGTYPELEN) {
 				dtrace_dof_error(dof, "native "
 				    "argument type too long");
 				return (-1);
 			}
 			typeidx += typesz;
 			typestr += typesz;
 		}
 
 		typeidx = probe->dofpr_xargv;
 		typestr = strtab + probe->dofpr_xargv;
 		for (k = 0; k < probe->dofpr_xargc; k++) {
 			if (arg[probe->dofpr_argidx + k] > probe->dofpr_nargc) {
 				dtrace_dof_error(dof, "bad "
 				    "native argument index");
 				return (-1);
 			}
 
 			if (typeidx >= str_sec->dofs_size) {
 				dtrace_dof_error(dof, "bad "
 				    "translated argument type");
 				return (-1);
 			}
 
 			typesz = strlen(typestr) + 1;
 			if (typesz > DTRACE_ARGTYPELEN) {
 				dtrace_dof_error(dof, "translated argument "
 				    "type too long");
 				return (-1);
 			}
 
 			typeidx += typesz;
 			typestr += typesz;
 		}
 	}
 
 	return (0);
 }
 
 static int
 dtrace_helper_slurp(dof_hdr_t *dof, dof_helper_t *dhp, struct proc *p)
 {
 	dtrace_helpers_t *help;
 	dtrace_vstate_t *vstate;
 	dtrace_enabling_t *enab = NULL;
 	int i, gen, rv, nhelpers = 0, nprovs = 0, destroy = 1;
 	uintptr_t daddr = (uintptr_t)dof;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 
 	if ((help = p->p_dtrace_helpers) == NULL)
 		help = dtrace_helpers_create(p);
 
 	vstate = &help->dthps_vstate;
 
 	if ((rv = dtrace_dof_slurp(dof, vstate, NULL, &enab, dhp->dofhp_addr,
 	    dhp->dofhp_dof, B_FALSE)) != 0) {
 		dtrace_dof_destroy(dof);
 		return (rv);
 	}
 
 	/*
 	 * Look for helper providers and validate their descriptions.
 	 */
 	for (i = 0; i < dof->dofh_secnum; i++) {
 		dof_sec_t *sec = (dof_sec_t *)(uintptr_t)(daddr +
 		    dof->dofh_secoff + i * dof->dofh_secsize);
 
 		if (sec->dofs_type != DOF_SECT_PROVIDER)
 			continue;
 
 		if (dtrace_helper_provider_validate(dof, sec) != 0) {
 			dtrace_enabling_destroy(enab);
 			dtrace_dof_destroy(dof);
 			return (-1);
 		}
 
 		nprovs++;
 	}
 
 	/*
 	 * Now we need to walk through the ECB descriptions in the enabling.
 	 */
 	for (i = 0; i < enab->dten_ndesc; i++) {
 		dtrace_ecbdesc_t *ep = enab->dten_desc[i];
 		dtrace_probedesc_t *desc = &ep->dted_probe;
 
 		if (strcmp(desc->dtpd_provider, "dtrace") != 0)
 			continue;
 
 		if (strcmp(desc->dtpd_mod, "helper") != 0)
 			continue;
 
 		if (strcmp(desc->dtpd_func, "ustack") != 0)
 			continue;
 
 		if ((rv = dtrace_helper_action_add(DTRACE_HELPER_ACTION_USTACK,
 		    ep, help)) != 0) {
 			/*
 			 * Adding this helper action failed -- we are now going
 			 * to rip out the entire generation and return failure.
 			 */
 			(void) dtrace_helper_destroygen(help,
 			    help->dthps_generation);
 			dtrace_enabling_destroy(enab);
 			dtrace_dof_destroy(dof);
 			return (-1);
 		}
 
 		nhelpers++;
 	}
 
 	if (nhelpers < enab->dten_ndesc)
 		dtrace_dof_error(dof, "unmatched helpers");
 
 	gen = help->dthps_generation++;
 	dtrace_enabling_destroy(enab);
 
 	if (nprovs > 0) {
 		/*
 		 * Now that this is in-kernel, we change the sense of the
 		 * members:  dofhp_dof denotes the in-kernel copy of the DOF
 		 * and dofhp_addr denotes the address at user-level.
 		 */
 		dhp->dofhp_addr = dhp->dofhp_dof;
 		dhp->dofhp_dof = (uint64_t)(uintptr_t)dof;
 
 		if (dtrace_helper_provider_add(dhp, help, gen) == 0) {
 			mutex_exit(&dtrace_lock);
 			dtrace_helper_provider_register(p, help, dhp);
 			mutex_enter(&dtrace_lock);
 
 			destroy = 0;
 		}
 	}
 
 	if (destroy)
 		dtrace_dof_destroy(dof);
 
 	return (gen);
 }
 
 static dtrace_helpers_t *
 dtrace_helpers_create(proc_t *p)
 {
 	dtrace_helpers_t *help;
 
 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(p->p_dtrace_helpers == NULL);
 
 	help = kmem_zalloc(sizeof (dtrace_helpers_t), KM_SLEEP);
 	help->dthps_actions = kmem_zalloc(sizeof (dtrace_helper_action_t *) *
 	    DTRACE_NHELPER_ACTIONS, KM_SLEEP);
 
 	p->p_dtrace_helpers = help;
 	dtrace_helpers++;
 
 	return (help);
 }
 
 #ifdef illumos
 static
 #endif
 void
 dtrace_helpers_destroy(proc_t *p)
 {
 	dtrace_helpers_t *help;
 	dtrace_vstate_t *vstate;
 #ifdef illumos
 	proc_t *p = curproc;
 #endif
 	int i;
 
 	mutex_enter(&dtrace_lock);
 
 	ASSERT(p->p_dtrace_helpers != NULL);
 	ASSERT(dtrace_helpers > 0);
 
 	help = p->p_dtrace_helpers;
 	vstate = &help->dthps_vstate;
 
 	/*
 	 * We're now going to lose the help from this process.
 	 */
 	p->p_dtrace_helpers = NULL;
 	dtrace_sync();
 
 	/*
 	 * Destory the helper actions.
 	 */
 	for (i = 0; i < DTRACE_NHELPER_ACTIONS; i++) {
 		dtrace_helper_action_t *h, *next;
 
 		for (h = help->dthps_actions[i]; h != NULL; h = next) {
 			next = h->dtha_next;
 			dtrace_helper_action_destroy(h, vstate);
 			h = next;
 		}
 	}
 
 	mutex_exit(&dtrace_lock);
 
 	/*
 	 * Destroy the helper providers.
 	 */
 	if (help->dthps_maxprovs > 0) {
 		mutex_enter(&dtrace_meta_lock);
 		if (dtrace_meta_pid != NULL) {
 			ASSERT(dtrace_deferred_pid == NULL);
 
 			for (i = 0; i < help->dthps_nprovs; i++) {
 				dtrace_helper_provider_remove(
 				    &help->dthps_provs[i]->dthp_prov, p->p_pid);
 			}
 		} else {
 			mutex_enter(&dtrace_lock);
 			ASSERT(help->dthps_deferred == 0 ||
 			    help->dthps_next != NULL ||
 			    help->dthps_prev != NULL ||
 			    help == dtrace_deferred_pid);
 
 			/*
 			 * Remove the helper from the deferred list.
 			 */
 			if (help->dthps_next != NULL)
 				help->dthps_next->dthps_prev = help->dthps_prev;
 			if (help->dthps_prev != NULL)
 				help->dthps_prev->dthps_next = help->dthps_next;
 			if (dtrace_deferred_pid == help) {
 				dtrace_deferred_pid = help->dthps_next;
 				ASSERT(help->dthps_prev == NULL);
 			}
 
 			mutex_exit(&dtrace_lock);
 		}
 
 		mutex_exit(&dtrace_meta_lock);
 
 		for (i = 0; i < help->dthps_nprovs; i++) {
 			dtrace_helper_provider_destroy(help->dthps_provs[i]);
 		}
 
 		kmem_free(help->dthps_provs, help->dthps_maxprovs *
 		    sizeof (dtrace_helper_provider_t *));
 	}
 
 	mutex_enter(&dtrace_lock);
 
 	dtrace_vstate_fini(&help->dthps_vstate);
 	kmem_free(help->dthps_actions,
 	    sizeof (dtrace_helper_action_t *) * DTRACE_NHELPER_ACTIONS);
 	kmem_free(help, sizeof (dtrace_helpers_t));
 
 	--dtrace_helpers;
 	mutex_exit(&dtrace_lock);
 }
 
 #ifdef illumos
 static
 #endif
 void
 dtrace_helpers_duplicate(proc_t *from, proc_t *to)
 {
 	dtrace_helpers_t *help, *newhelp;
 	dtrace_helper_action_t *helper, *new, *last;
 	dtrace_difo_t *dp;
 	dtrace_vstate_t *vstate;
 	int i, j, sz, hasprovs = 0;
 
 	mutex_enter(&dtrace_lock);
 	ASSERT(from->p_dtrace_helpers != NULL);
 	ASSERT(dtrace_helpers > 0);
 
 	help = from->p_dtrace_helpers;
 	newhelp = dtrace_helpers_create(to);
 	ASSERT(to->p_dtrace_helpers != NULL);
 
 	newhelp->dthps_generation = help->dthps_generation;
 	vstate = &newhelp->dthps_vstate;
 
 	/*
 	 * Duplicate the helper actions.
 	 */
 	for (i = 0; i < DTRACE_NHELPER_ACTIONS; i++) {
 		if ((helper = help->dthps_actions[i]) == NULL)
 			continue;
 
 		for (last = NULL; helper != NULL; helper = helper->dtha_next) {
 			new = kmem_zalloc(sizeof (dtrace_helper_action_t),
 			    KM_SLEEP);
 			new->dtha_generation = helper->dtha_generation;
 
 			if ((dp = helper->dtha_predicate) != NULL) {
 				dp = dtrace_difo_duplicate(dp, vstate);
 				new->dtha_predicate = dp;
 			}
 
 			new->dtha_nactions = helper->dtha_nactions;
 			sz = sizeof (dtrace_difo_t *) * new->dtha_nactions;
 			new->dtha_actions = kmem_alloc(sz, KM_SLEEP);
 
 			for (j = 0; j < new->dtha_nactions; j++) {
 				dtrace_difo_t *dp = helper->dtha_actions[j];
 
 				ASSERT(dp != NULL);
 				dp = dtrace_difo_duplicate(dp, vstate);
 				new->dtha_actions[j] = dp;
 			}
 
 			if (last != NULL) {
 				last->dtha_next = new;
 			} else {
 				newhelp->dthps_actions[i] = new;
 			}
 
 			last = new;
 		}
 	}
 
 	/*
 	 * Duplicate the helper providers and register them with the
 	 * DTrace framework.
 	 */
 	if (help->dthps_nprovs > 0) {
 		newhelp->dthps_nprovs = help->dthps_nprovs;
 		newhelp->dthps_maxprovs = help->dthps_nprovs;
 		newhelp->dthps_provs = kmem_alloc(newhelp->dthps_nprovs *
 		    sizeof (dtrace_helper_provider_t *), KM_SLEEP);
 		for (i = 0; i < newhelp->dthps_nprovs; i++) {
 			newhelp->dthps_provs[i] = help->dthps_provs[i];
 			newhelp->dthps_provs[i]->dthp_ref++;
 		}
 
 		hasprovs = 1;
 	}
 
 	mutex_exit(&dtrace_lock);
 
 	if (hasprovs)
 		dtrace_helper_provider_register(to, newhelp, NULL);
 }
 
 /*
  * DTrace Hook Functions
  */
 static void
 dtrace_module_loaded(modctl_t *ctl)
 {
 	dtrace_provider_t *prv;
 
 	mutex_enter(&dtrace_provider_lock);
 #ifdef illumos
 	mutex_enter(&mod_lock);
 #endif
 
 #ifdef illumos
 	ASSERT(ctl->mod_busy);
 #endif
 
 	/*
 	 * We're going to call each providers per-module provide operation
 	 * specifying only this module.
 	 */
 	for (prv = dtrace_provider; prv != NULL; prv = prv->dtpv_next)
 		prv->dtpv_pops.dtps_provide_module(prv->dtpv_arg, ctl);
 
 #ifdef illumos
 	mutex_exit(&mod_lock);
 #endif
 	mutex_exit(&dtrace_provider_lock);
 
 	/*
 	 * If we have any retained enablings, we need to match against them.
 	 * Enabling probes requires that cpu_lock be held, and we cannot hold
 	 * cpu_lock here -- it is legal for cpu_lock to be held when loading a
 	 * module.  (In particular, this happens when loading scheduling
 	 * classes.)  So if we have any retained enablings, we need to dispatch
 	 * our task queue to do the match for us.
 	 */
 	mutex_enter(&dtrace_lock);
 
 	if (dtrace_retained == NULL) {
 		mutex_exit(&dtrace_lock);
 		return;
 	}
 
 	(void) taskq_dispatch(dtrace_taskq,
 	    (task_func_t *)dtrace_enabling_matchall, NULL, TQ_SLEEP);
 
 	mutex_exit(&dtrace_lock);
 
 	/*
 	 * And now, for a little heuristic sleaze:  in general, we want to
 	 * match modules as soon as they load.  However, we cannot guarantee
 	 * this, because it would lead us to the lock ordering violation
 	 * outlined above.  The common case, of course, is that cpu_lock is
 	 * _not_ held -- so we delay here for a clock tick, hoping that that's
 	 * long enough for the task queue to do its work.  If it's not, it's
 	 * not a serious problem -- it just means that the module that we
 	 * just loaded may not be immediately instrumentable.
 	 */
 	delay(1);
 }
 
 static void
 #ifdef illumos
 dtrace_module_unloaded(modctl_t *ctl)
 #else
 dtrace_module_unloaded(modctl_t *ctl, int *error)
 #endif
 {
 	dtrace_probe_t template, *probe, *first, *next;
 	dtrace_provider_t *prov;
 #ifndef illumos
 	char modname[DTRACE_MODNAMELEN];
 	size_t len;
 #endif
 
 #ifdef illumos
 	template.dtpr_mod = ctl->mod_modname;
 #else
 	/* Handle the fact that ctl->filename may end in ".ko". */
 	strlcpy(modname, ctl->filename, sizeof(modname));
 	len = strlen(ctl->filename);
 	if (len > 3 && strcmp(modname + len - 3, ".ko") == 0)
 		modname[len - 3] = '\0';
 	template.dtpr_mod = modname;
 #endif
 
 	mutex_enter(&dtrace_provider_lock);
 #ifdef illumos
 	mutex_enter(&mod_lock);
 #endif
 	mutex_enter(&dtrace_lock);
 
 #ifndef illumos
 	if (ctl->nenabled > 0) {
 		/* Don't allow unloads if a probe is enabled. */
 		mutex_exit(&dtrace_provider_lock);
 		mutex_exit(&dtrace_lock);
 		*error = -1;
 		printf(
 	"kldunload: attempt to unload module that has DTrace probes enabled\n");
 		return;
 	}
 #endif
 
 	if (dtrace_bymod == NULL) {
 		/*
 		 * The DTrace module is loaded (obviously) but not attached;
 		 * we don't have any work to do.
 		 */
 		mutex_exit(&dtrace_provider_lock);
 #ifdef illumos
 		mutex_exit(&mod_lock);
 #endif
 		mutex_exit(&dtrace_lock);
 		return;
 	}
 
 	for (probe = first = dtrace_hash_lookup(dtrace_bymod, &template);
 	    probe != NULL; probe = probe->dtpr_nextmod) {
 		if (probe->dtpr_ecb != NULL) {
 			mutex_exit(&dtrace_provider_lock);
 #ifdef illumos
 			mutex_exit(&mod_lock);
 #endif
 			mutex_exit(&dtrace_lock);
 
 			/*
 			 * This shouldn't _actually_ be possible -- we're
 			 * unloading a module that has an enabled probe in it.
 			 * (It's normally up to the provider to make sure that
 			 * this can't happen.)  However, because dtps_enable()
 			 * doesn't have a failure mode, there can be an
 			 * enable/unload race.  Upshot:  we don't want to
 			 * assert, but we're not going to disable the
 			 * probe, either.
 			 */
 			if (dtrace_err_verbose) {
 #ifdef illumos
 				cmn_err(CE_WARN, "unloaded module '%s' had "
 				    "enabled probes", ctl->mod_modname);
 #else
 				cmn_err(CE_WARN, "unloaded module '%s' had "
 				    "enabled probes", modname);
 #endif
 			}
 
 			return;
 		}
 	}
 
 	probe = first;
 
 	for (first = NULL; probe != NULL; probe = next) {
 		ASSERT(dtrace_probes[probe->dtpr_id - 1] == probe);
 
 		dtrace_probes[probe->dtpr_id - 1] = NULL;
 
 		next = probe->dtpr_nextmod;
 		dtrace_hash_remove(dtrace_bymod, probe);
 		dtrace_hash_remove(dtrace_byfunc, probe);
 		dtrace_hash_remove(dtrace_byname, probe);
 
 		if (first == NULL) {
 			first = probe;
 			probe->dtpr_nextmod = NULL;
 		} else {
 			probe->dtpr_nextmod = first;
 			first = probe;
 		}
 	}
 
 	/*
 	 * We've removed all of the module's probes from the hash chains and
 	 * from the probe array.  Now issue a dtrace_sync() to be sure that
 	 * everyone has cleared out from any probe array processing.
 	 */
 	dtrace_sync();
 
 	for (probe = first; probe != NULL; probe = first) {
 		first = probe->dtpr_nextmod;
 		prov = probe->dtpr_provider;
 		prov->dtpv_pops.dtps_destroy(prov->dtpv_arg, probe->dtpr_id,
 		    probe->dtpr_arg);
 		kmem_free(probe->dtpr_mod, strlen(probe->dtpr_mod) + 1);
 		kmem_free(probe->dtpr_func, strlen(probe->dtpr_func) + 1);
 		kmem_free(probe->dtpr_name, strlen(probe->dtpr_name) + 1);
 #ifdef illumos
 		vmem_free(dtrace_arena, (void *)(uintptr_t)probe->dtpr_id, 1);
 #else
 		free_unr(dtrace_arena, probe->dtpr_id);
 #endif
 		kmem_free(probe, sizeof (dtrace_probe_t));
 	}
 
 	mutex_exit(&dtrace_lock);
 #ifdef illumos
 	mutex_exit(&mod_lock);
 #endif
 	mutex_exit(&dtrace_provider_lock);
 }
 
 #ifndef illumos
 static void
 dtrace_kld_load(void *arg __unused, linker_file_t lf)
 {
 
 	dtrace_module_loaded(lf);
 }
 
 static void
 dtrace_kld_unload_try(void *arg __unused, linker_file_t lf, int *error)
 {
 
 	if (*error != 0)
 		/* We already have an error, so don't do anything. */
 		return;
 	dtrace_module_unloaded(lf, error);
 }
 #endif
 
 #ifdef illumos
 static void
 dtrace_suspend(void)
 {
 	dtrace_probe_foreach(offsetof(dtrace_pops_t, dtps_suspend));
 }
 
 static void
 dtrace_resume(void)
 {
 	dtrace_probe_foreach(offsetof(dtrace_pops_t, dtps_resume));
 }
 #endif
 
 static int
 dtrace_cpu_setup(cpu_setup_t what, processorid_t cpu)
 {
 	ASSERT(MUTEX_HELD(&cpu_lock));
 	mutex_enter(&dtrace_lock);
 
 	switch (what) {
 	case CPU_CONFIG: {
 		dtrace_state_t *state;
 		dtrace_optval_t *opt, rs, c;
 
 		/*
 		 * For now, we only allocate a new buffer for anonymous state.
 		 */
 		if ((state = dtrace_anon.dta_state) == NULL)
 			break;
 
 		if (state->dts_activity != DTRACE_ACTIVITY_ACTIVE)
 			break;
 
 		opt = state->dts_options;
 		c = opt[DTRACEOPT_CPU];
 
 		if (c != DTRACE_CPUALL && c != DTRACEOPT_UNSET && c != cpu)
 			break;
 
 		/*
 		 * Regardless of what the actual policy is, we're going to
 		 * temporarily set our resize policy to be manual.  We're
 		 * also going to temporarily set our CPU option to denote
 		 * the newly configured CPU.
 		 */
 		rs = opt[DTRACEOPT_BUFRESIZE];
 		opt[DTRACEOPT_BUFRESIZE] = DTRACEOPT_BUFRESIZE_MANUAL;
 		opt[DTRACEOPT_CPU] = (dtrace_optval_t)cpu;
 
 		(void) dtrace_state_buffers(state);
 
 		opt[DTRACEOPT_BUFRESIZE] = rs;
 		opt[DTRACEOPT_CPU] = c;
 
 		break;
 	}
 
 	case CPU_UNCONFIG:
 		/*
 		 * We don't free the buffer in the CPU_UNCONFIG case.  (The
 		 * buffer will be freed when the consumer exits.)
 		 */
 		break;
 
 	default:
 		break;
 	}
 
 	mutex_exit(&dtrace_lock);
 	return (0);
 }
 
 #ifdef illumos
 static void
 dtrace_cpu_setup_initial(processorid_t cpu)
 {
 	(void) dtrace_cpu_setup(CPU_CONFIG, cpu);
 }
 #endif
 
 static void
 dtrace_toxrange_add(uintptr_t base, uintptr_t limit)
 {
 	if (dtrace_toxranges >= dtrace_toxranges_max) {
 		int osize, nsize;
 		dtrace_toxrange_t *range;
 
 		osize = dtrace_toxranges_max * sizeof (dtrace_toxrange_t);
 
 		if (osize == 0) {
 			ASSERT(dtrace_toxrange == NULL);
 			ASSERT(dtrace_toxranges_max == 0);
 			dtrace_toxranges_max = 1;
 		} else {
 			dtrace_toxranges_max <<= 1;
 		}
 
 		nsize = dtrace_toxranges_max * sizeof (dtrace_toxrange_t);
 		range = kmem_zalloc(nsize, KM_SLEEP);
 
 		if (dtrace_toxrange != NULL) {
 			ASSERT(osize != 0);
 			bcopy(dtrace_toxrange, range, osize);
 			kmem_free(dtrace_toxrange, osize);
 		}
 
 		dtrace_toxrange = range;
 	}
 
 	ASSERT(dtrace_toxrange[dtrace_toxranges].dtt_base == 0);
 	ASSERT(dtrace_toxrange[dtrace_toxranges].dtt_limit == 0);
 
 	dtrace_toxrange[dtrace_toxranges].dtt_base = base;
 	dtrace_toxrange[dtrace_toxranges].dtt_limit = limit;
 	dtrace_toxranges++;
 }
 
 static void
 dtrace_getf_barrier()
 {
 #ifdef illumos
 	/*
 	 * When we have unprivileged (that is, non-DTRACE_CRV_KERNEL) enablings
 	 * that contain calls to getf(), this routine will be called on every
 	 * closef() before either the underlying vnode is released or the
 	 * file_t itself is freed.  By the time we are here, it is essential
 	 * that the file_t can no longer be accessed from a call to getf()
 	 * in probe context -- that assures that a dtrace_sync() can be used
 	 * to clear out any enablings referring to the old structures.
 	 */
 	if (curthread->t_procp->p_zone->zone_dtrace_getf != 0 ||
 	    kcred->cr_zone->zone_dtrace_getf != 0)
 		dtrace_sync();
 #endif
 }
 
 /*
  * DTrace Driver Cookbook Functions
  */
 #ifdef illumos
 /*ARGSUSED*/
 static int
 dtrace_attach(dev_info_t *devi, ddi_attach_cmd_t cmd)
 {
 	dtrace_provider_id_t id;
 	dtrace_state_t *state = NULL;
 	dtrace_enabling_t *enab;
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
 
 	if (ddi_soft_state_init(&dtrace_softstate,
 	    sizeof (dtrace_state_t), 0) != 0) {
 		cmn_err(CE_NOTE, "/dev/dtrace failed to initialize soft state");
 		mutex_exit(&cpu_lock);
 		mutex_exit(&dtrace_provider_lock);
 		mutex_exit(&dtrace_lock);
 		return (DDI_FAILURE);
 	}
 
 	if (ddi_create_minor_node(devi, DTRACEMNR_DTRACE, S_IFCHR,
 	    DTRACEMNRN_DTRACE, DDI_PSEUDO, NULL) == DDI_FAILURE ||
 	    ddi_create_minor_node(devi, DTRACEMNR_HELPER, S_IFCHR,
 	    DTRACEMNRN_HELPER, DDI_PSEUDO, NULL) == DDI_FAILURE) {
 		cmn_err(CE_NOTE, "/dev/dtrace couldn't create minor nodes");
 		ddi_remove_minor_node(devi, NULL);
 		ddi_soft_state_fini(&dtrace_softstate);
 		mutex_exit(&cpu_lock);
 		mutex_exit(&dtrace_provider_lock);
 		mutex_exit(&dtrace_lock);
 		return (DDI_FAILURE);
 	}
 
 	ddi_report_dev(devi);
 	dtrace_devi = devi;
 
 	dtrace_modload = dtrace_module_loaded;
 	dtrace_modunload = dtrace_module_unloaded;
 	dtrace_cpu_init = dtrace_cpu_setup_initial;
 	dtrace_helpers_cleanup = dtrace_helpers_destroy;
 	dtrace_helpers_fork = dtrace_helpers_duplicate;
 	dtrace_cpustart_init = dtrace_suspend;
 	dtrace_cpustart_fini = dtrace_resume;
 	dtrace_debugger_init = dtrace_suspend;
 	dtrace_debugger_fini = dtrace_resume;
 
 	register_cpu_setup_func((cpu_setup_func_t *)dtrace_cpu_setup, NULL);
 
 	ASSERT(MUTEX_HELD(&cpu_lock));
 
 	dtrace_arena = vmem_create("dtrace", (void *)1, UINT32_MAX, 1,
 	    NULL, NULL, NULL, 0, VM_SLEEP | VMC_IDENTIFIER);
 	dtrace_minor = vmem_create("dtrace_minor", (void *)DTRACEMNRN_CLONE,
 	    UINT32_MAX - DTRACEMNRN_CLONE, 1, NULL, NULL, NULL, 0,
 	    VM_SLEEP | VMC_IDENTIFIER);
 	dtrace_taskq = taskq_create("dtrace_taskq", 1, maxclsyspri,
 	    1, INT_MAX, 0);
 
 	dtrace_state_cache = kmem_cache_create("dtrace_state_cache",
 	    sizeof (dtrace_dstate_percpu_t) * NCPU, DTRACE_STATE_ALIGN,
 	    NULL, NULL, NULL, NULL, NULL, 0);
 
 	ASSERT(MUTEX_HELD(&cpu_lock));
 	dtrace_bymod = dtrace_hash_create(offsetof(dtrace_probe_t, dtpr_mod),
 	    offsetof(dtrace_probe_t, dtpr_nextmod),
 	    offsetof(dtrace_probe_t, dtpr_prevmod));
 
 	dtrace_byfunc = dtrace_hash_create(offsetof(dtrace_probe_t, dtpr_func),
 	    offsetof(dtrace_probe_t, dtpr_nextfunc),
 	    offsetof(dtrace_probe_t, dtpr_prevfunc));
 
 	dtrace_byname = dtrace_hash_create(offsetof(dtrace_probe_t, dtpr_name),
 	    offsetof(dtrace_probe_t, dtpr_nextname),
 	    offsetof(dtrace_probe_t, dtpr_prevname));
 
 	if (dtrace_retain_max < 1) {
 		cmn_err(CE_WARN, "illegal value (%lu) for dtrace_retain_max; "
 		    "setting to 1", dtrace_retain_max);
 		dtrace_retain_max = 1;
 	}
 
 	/*
 	 * Now discover our toxic ranges.
 	 */
 	dtrace_toxic_ranges(dtrace_toxrange_add);
 
 	/*
 	 * Before we register ourselves as a provider to our own framework,
 	 * we would like to assert that dtrace_provider is NULL -- but that's
 	 * not true if we were loaded as a dependency of a DTrace provider.
 	 * Once we've registered, we can assert that dtrace_provider is our
 	 * pseudo provider.
 	 */
 	(void) dtrace_register("dtrace", &dtrace_provider_attr,
 	    DTRACE_PRIV_NONE, 0, &dtrace_provider_ops, NULL, &id);
 
 	ASSERT(dtrace_provider != NULL);
 	ASSERT((dtrace_provider_id_t)dtrace_provider == id);
 
 	dtrace_probeid_begin = dtrace_probe_create((dtrace_provider_id_t)
 	    dtrace_provider, NULL, NULL, "BEGIN", 0, NULL);
 	dtrace_probeid_end = dtrace_probe_create((dtrace_provider_id_t)
 	    dtrace_provider, NULL, NULL, "END", 0, NULL);
 	dtrace_probeid_error = dtrace_probe_create((dtrace_provider_id_t)
 	    dtrace_provider, NULL, NULL, "ERROR", 1, NULL);
 
 	dtrace_anon_property();
 	mutex_exit(&cpu_lock);
 
 	/*
 	 * If there are already providers, we must ask them to provide their
 	 * probes, and then match any anonymous enabling against them.  Note
 	 * that there should be no other retained enablings at this time:
 	 * the only retained enablings at this time should be the anonymous
 	 * enabling.
 	 */
 	if (dtrace_anon.dta_enabling != NULL) {
 		ASSERT(dtrace_retained == dtrace_anon.dta_enabling);
 
 		dtrace_enabling_provide(NULL);
 		state = dtrace_anon.dta_state;
 
 		/*
 		 * We couldn't hold cpu_lock across the above call to
 		 * dtrace_enabling_provide(), but we must hold it to actually
 		 * enable the probes.  We have to drop all of our locks, pick
 		 * up cpu_lock, and regain our locks before matching the
 		 * retained anonymous enabling.
 		 */
 		mutex_exit(&dtrace_lock);
 		mutex_exit(&dtrace_provider_lock);
 
 		mutex_enter(&cpu_lock);
 		mutex_enter(&dtrace_provider_lock);
 		mutex_enter(&dtrace_lock);
 
 		if ((enab = dtrace_anon.dta_enabling) != NULL)
 			(void) dtrace_enabling_match(enab, NULL);
 
 		mutex_exit(&cpu_lock);
 	}
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&dtrace_provider_lock);
 
 	if (state != NULL) {
 		/*
 		 * If we created any anonymous state, set it going now.
 		 */
 		(void) dtrace_state_go(state, &dtrace_anon.dta_beganon);
 	}
 
 	return (DDI_SUCCESS);
 }
 #endif	/* illumos */
 
 #ifndef illumos
 static void dtrace_dtr(void *);
 #endif
 
 /*ARGSUSED*/
 static int
 #ifdef illumos
 dtrace_open(dev_t *devp, int flag, int otyp, cred_t *cred_p)
 #else
 dtrace_open(struct cdev *dev, int oflags, int devtype, struct thread *td)
 #endif
 {
 	dtrace_state_t *state;
 	uint32_t priv;
 	uid_t uid;
 	zoneid_t zoneid;
 
 #ifdef illumos
 	if (getminor(*devp) == DTRACEMNRN_HELPER)
 		return (0);
 
 	/*
 	 * If this wasn't an open with the "helper" minor, then it must be
 	 * the "dtrace" minor.
 	 */
 	if (getminor(*devp) == DTRACEMNRN_DTRACE)
 		return (ENXIO);
 #else
 	cred_t *cred_p = NULL;
 	cred_p = dev->si_cred;
 
 	/*
 	 * If no DTRACE_PRIV_* bits are set in the credential, then the
 	 * caller lacks sufficient permission to do anything with DTrace.
 	 */
 	dtrace_cred2priv(cred_p, &priv, &uid, &zoneid);
 	if (priv == DTRACE_PRIV_NONE) {
 #endif
 
 		return (EACCES);
 	}
 
 	/*
 	 * Ask all providers to provide all their probes.
 	 */
 	mutex_enter(&dtrace_provider_lock);
 	dtrace_probe_provide(NULL, NULL);
 	mutex_exit(&dtrace_provider_lock);
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_lock);
 	dtrace_opens++;
 	dtrace_membar_producer();
 
 #ifdef illumos
 	/*
 	 * If the kernel debugger is active (that is, if the kernel debugger
 	 * modified text in some way), we won't allow the open.
 	 */
 	if (kdi_dtrace_set(KDI_DTSET_DTRACE_ACTIVATE) != 0) {
 		dtrace_opens--;
 		mutex_exit(&cpu_lock);
 		mutex_exit(&dtrace_lock);
 		return (EBUSY);
 	}
 
 	if (dtrace_helptrace_enable && dtrace_helptrace_buffer == NULL) {
 		/*
 		 * If DTrace helper tracing is enabled, we need to allocate the
 		 * trace buffer and initialize the values.
 		 */
 		dtrace_helptrace_buffer =
 		    kmem_zalloc(dtrace_helptrace_bufsize, KM_SLEEP);
 		dtrace_helptrace_next = 0;
 		dtrace_helptrace_wrapped = 0;
 		dtrace_helptrace_enable = 0;
 	}
 
 	state = dtrace_state_create(devp, cred_p);
 #else
 	state = dtrace_state_create(dev, NULL);
 	devfs_set_cdevpriv(state, dtrace_dtr);
 #endif
 
 	mutex_exit(&cpu_lock);
 
 	if (state == NULL) {
 #ifdef illumos
 		if (--dtrace_opens == 0 && dtrace_anon.dta_enabling == NULL)
 			(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);
 #else
 		--dtrace_opens;
 #endif
 		mutex_exit(&dtrace_lock);
 		return (EAGAIN);
 	}
 
 	mutex_exit(&dtrace_lock);
 
 	return (0);
 }
 
 /*ARGSUSED*/
 #ifdef illumos
 static int
 dtrace_close(dev_t dev, int flag, int otyp, cred_t *cred_p)
 #else
 static void
 dtrace_dtr(void *data)
 #endif
 {
 #ifdef illumos
 	minor_t minor = getminor(dev);
 	dtrace_state_t *state;
 #endif
 	dtrace_helptrace_t *buf = NULL;
 
 #ifdef illumos
 	if (minor == DTRACEMNRN_HELPER)
 		return (0);
 
 	state = ddi_get_soft_state(dtrace_softstate, minor);
 #else
 	dtrace_state_t *state = data;
 #endif
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_lock);
 
 #ifdef illumos
 	if (state->dts_anon)
 #else
 	if (state != NULL && state->dts_anon)
 #endif
 	{
 		/*
 		 * There is anonymous state. Destroy that first.
 		 */
 		ASSERT(dtrace_anon.dta_state == NULL);
 		dtrace_state_destroy(state->dts_anon);
 	}
 
 	if (dtrace_helptrace_disable) {
 		/*
 		 * If we have been told to disable helper tracing, set the
 		 * buffer to NULL before calling into dtrace_state_destroy();
 		 * we take advantage of its dtrace_sync() to know that no
 		 * CPU is in probe context with enabled helper tracing
 		 * after it returns.
 		 */
 		buf = dtrace_helptrace_buffer;
 		dtrace_helptrace_buffer = NULL;
 	}
 
 #ifdef illumos
 	dtrace_state_destroy(state);
 #else
 	if (state != NULL) {
 		dtrace_state_destroy(state);
 		kmem_free(state, 0);
 	}
 #endif
 	ASSERT(dtrace_opens > 0);
 
 #ifdef illumos
 	/*
 	 * Only relinquish control of the kernel debugger interface when there
 	 * are no consumers and no anonymous enablings.
 	 */
 	if (--dtrace_opens == 0 && dtrace_anon.dta_enabling == NULL)
 		(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);
 #else
 	--dtrace_opens;
 #endif
 
 	if (buf != NULL) {
 		kmem_free(buf, dtrace_helptrace_bufsize);
 		dtrace_helptrace_disable = 0;
 	}
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&cpu_lock);
 
 #ifdef illumos
 	return (0);
 #endif
 }
 
 #ifdef illumos
 /*ARGSUSED*/
 static int
 dtrace_ioctl_helper(int cmd, intptr_t arg, int *rv)
 {
 	int rval;
 	dof_helper_t help, *dhp = NULL;
 
 	switch (cmd) {
 	case DTRACEHIOC_ADDDOF:
 		if (copyin((void *)arg, &help, sizeof (help)) != 0) {
 			dtrace_dof_error(NULL, "failed to copyin DOF helper");
 			return (EFAULT);
 		}
 
 		dhp = &help;
 		arg = (intptr_t)help.dofhp_dof;
 		/*FALLTHROUGH*/
 
 	case DTRACEHIOC_ADD: {
 		dof_hdr_t *dof = dtrace_dof_copyin(arg, &rval);
 
 		if (dof == NULL)
 			return (rval);
 
 		mutex_enter(&dtrace_lock);
 
 		/*
 		 * dtrace_helper_slurp() takes responsibility for the dof --
 		 * it may free it now or it may save it and free it later.
 		 */
 		if ((rval = dtrace_helper_slurp(dof, dhp)) != -1) {
 			*rv = rval;
 			rval = 0;
 		} else {
 			rval = EINVAL;
 		}
 
 		mutex_exit(&dtrace_lock);
 		return (rval);
 	}
 
 	case DTRACEHIOC_REMOVE: {
 		mutex_enter(&dtrace_lock);
 		rval = dtrace_helper_destroygen(NULL, arg);
 		mutex_exit(&dtrace_lock);
 
 		return (rval);
 	}
 
 	default:
 		break;
 	}
 
 	return (ENOTTY);
 }
 
 /*ARGSUSED*/
 static int
 dtrace_ioctl(dev_t dev, int cmd, intptr_t arg, int md, cred_t *cr, int *rv)
 {
 	minor_t minor = getminor(dev);
 	dtrace_state_t *state;
 	int rval;
 
 	if (minor == DTRACEMNRN_HELPER)
 		return (dtrace_ioctl_helper(cmd, arg, rv));
 
 	state = ddi_get_soft_state(dtrace_softstate, minor);
 
 	if (state->dts_anon) {
 		ASSERT(dtrace_anon.dta_state == NULL);
 		state = state->dts_anon;
 	}
 
 	switch (cmd) {
 	case DTRACEIOC_PROVIDER: {
 		dtrace_providerdesc_t pvd;
 		dtrace_provider_t *pvp;
 
 		if (copyin((void *)arg, &pvd, sizeof (pvd)) != 0)
 			return (EFAULT);
 
 		pvd.dtvd_name[DTRACE_PROVNAMELEN - 1] = '\0';
 		mutex_enter(&dtrace_provider_lock);
 
 		for (pvp = dtrace_provider; pvp != NULL; pvp = pvp->dtpv_next) {
 			if (strcmp(pvp->dtpv_name, pvd.dtvd_name) == 0)
 				break;
 		}
 
 		mutex_exit(&dtrace_provider_lock);
 
 		if (pvp == NULL)
 			return (ESRCH);
 
 		bcopy(&pvp->dtpv_priv, &pvd.dtvd_priv, sizeof (dtrace_ppriv_t));
 		bcopy(&pvp->dtpv_attr, &pvd.dtvd_attr, sizeof (dtrace_pattr_t));
 
 		if (copyout(&pvd, (void *)arg, sizeof (pvd)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_EPROBE: {
 		dtrace_eprobedesc_t epdesc;
 		dtrace_ecb_t *ecb;
 		dtrace_action_t *act;
 		void *buf;
 		size_t size;
 		uintptr_t dest;
 		int nrecs;
 
 		if (copyin((void *)arg, &epdesc, sizeof (epdesc)) != 0)
 			return (EFAULT);
 
 		mutex_enter(&dtrace_lock);
 
 		if ((ecb = dtrace_epid2ecb(state, epdesc.dtepd_epid)) == NULL) {
 			mutex_exit(&dtrace_lock);
 			return (EINVAL);
 		}
 
 		if (ecb->dte_probe == NULL) {
 			mutex_exit(&dtrace_lock);
 			return (EINVAL);
 		}
 
 		epdesc.dtepd_probeid = ecb->dte_probe->dtpr_id;
 		epdesc.dtepd_uarg = ecb->dte_uarg;
 		epdesc.dtepd_size = ecb->dte_size;
 
 		nrecs = epdesc.dtepd_nrecs;
 		epdesc.dtepd_nrecs = 0;
 		for (act = ecb->dte_action; act != NULL; act = act->dta_next) {
 			if (DTRACEACT_ISAGG(act->dta_kind) || act->dta_intuple)
 				continue;
 
 			epdesc.dtepd_nrecs++;
 		}
 
 		/*
 		 * Now that we have the size, we need to allocate a temporary
 		 * buffer in which to store the complete description.  We need
 		 * the temporary buffer to be able to drop dtrace_lock()
 		 * across the copyout(), below.
 		 */
 		size = sizeof (dtrace_eprobedesc_t) +
 		    (epdesc.dtepd_nrecs * sizeof (dtrace_recdesc_t));
 
 		buf = kmem_alloc(size, KM_SLEEP);
 		dest = (uintptr_t)buf;
 
 		bcopy(&epdesc, (void *)dest, sizeof (epdesc));
 		dest += offsetof(dtrace_eprobedesc_t, dtepd_rec[0]);
 
 		for (act = ecb->dte_action; act != NULL; act = act->dta_next) {
 			if (DTRACEACT_ISAGG(act->dta_kind) || act->dta_intuple)
 				continue;
 
 			if (nrecs-- == 0)
 				break;
 
 			bcopy(&act->dta_rec, (void *)dest,
 			    sizeof (dtrace_recdesc_t));
 			dest += sizeof (dtrace_recdesc_t);
 		}
 
 		mutex_exit(&dtrace_lock);
 
 		if (copyout(buf, (void *)arg, dest - (uintptr_t)buf) != 0) {
 			kmem_free(buf, size);
 			return (EFAULT);
 		}
 
 		kmem_free(buf, size);
 		return (0);
 	}
 
 	case DTRACEIOC_AGGDESC: {
 		dtrace_aggdesc_t aggdesc;
 		dtrace_action_t *act;
 		dtrace_aggregation_t *agg;
 		int nrecs;
 		uint32_t offs;
 		dtrace_recdesc_t *lrec;
 		void *buf;
 		size_t size;
 		uintptr_t dest;
 
 		if (copyin((void *)arg, &aggdesc, sizeof (aggdesc)) != 0)
 			return (EFAULT);
 
 		mutex_enter(&dtrace_lock);
 
 		if ((agg = dtrace_aggid2agg(state, aggdesc.dtagd_id)) == NULL) {
 			mutex_exit(&dtrace_lock);
 			return (EINVAL);
 		}
 
 		aggdesc.dtagd_epid = agg->dtag_ecb->dte_epid;
 
 		nrecs = aggdesc.dtagd_nrecs;
 		aggdesc.dtagd_nrecs = 0;
 
 		offs = agg->dtag_base;
 		lrec = &agg->dtag_action.dta_rec;
 		aggdesc.dtagd_size = lrec->dtrd_offset + lrec->dtrd_size - offs;
 
 		for (act = agg->dtag_first; ; act = act->dta_next) {
 			ASSERT(act->dta_intuple ||
 			    DTRACEACT_ISAGG(act->dta_kind));
 
 			/*
 			 * If this action has a record size of zero, it
 			 * denotes an argument to the aggregating action.
 			 * Because the presence of this record doesn't (or
 			 * shouldn't) affect the way the data is interpreted,
 			 * we don't copy it out to save user-level the
 			 * confusion of dealing with a zero-length record.
 			 */
 			if (act->dta_rec.dtrd_size == 0) {
 				ASSERT(agg->dtag_hasarg);
 				continue;
 			}
 
 			aggdesc.dtagd_nrecs++;
 
 			if (act == &agg->dtag_action)
 				break;
 		}
 
 		/*
 		 * Now that we have the size, we need to allocate a temporary
 		 * buffer in which to store the complete description.  We need
 		 * the temporary buffer to be able to drop dtrace_lock()
 		 * across the copyout(), below.
 		 */
 		size = sizeof (dtrace_aggdesc_t) +
 		    (aggdesc.dtagd_nrecs * sizeof (dtrace_recdesc_t));
 
 		buf = kmem_alloc(size, KM_SLEEP);
 		dest = (uintptr_t)buf;
 
 		bcopy(&aggdesc, (void *)dest, sizeof (aggdesc));
 		dest += offsetof(dtrace_aggdesc_t, dtagd_rec[0]);
 
 		for (act = agg->dtag_first; ; act = act->dta_next) {
 			dtrace_recdesc_t rec = act->dta_rec;
 
 			/*
 			 * See the comment in the above loop for why we pass
 			 * over zero-length records.
 			 */
 			if (rec.dtrd_size == 0) {
 				ASSERT(agg->dtag_hasarg);
 				continue;
 			}
 
 			if (nrecs-- == 0)
 				break;
 
 			rec.dtrd_offset -= offs;
 			bcopy(&rec, (void *)dest, sizeof (rec));
 			dest += sizeof (dtrace_recdesc_t);
 
 			if (act == &agg->dtag_action)
 				break;
 		}
 
 		mutex_exit(&dtrace_lock);
 
 		if (copyout(buf, (void *)arg, dest - (uintptr_t)buf) != 0) {
 			kmem_free(buf, size);
 			return (EFAULT);
 		}
 
 		kmem_free(buf, size);
 		return (0);
 	}
 
 	case DTRACEIOC_ENABLE: {
 		dof_hdr_t *dof;
 		dtrace_enabling_t *enab = NULL;
 		dtrace_vstate_t *vstate;
 		int err = 0;
 
 		*rv = 0;
 
 		/*
 		 * If a NULL argument has been passed, we take this as our
 		 * cue to reevaluate our enablings.
 		 */
 		if (arg == NULL) {
 			dtrace_enabling_matchall();
 
 			return (0);
 		}
 
 		if ((dof = dtrace_dof_copyin(arg, &rval)) == NULL)
 			return (rval);
 
 		mutex_enter(&cpu_lock);
 		mutex_enter(&dtrace_lock);
 		vstate = &state->dts_vstate;
 
 		if (state->dts_activity != DTRACE_ACTIVITY_INACTIVE) {
 			mutex_exit(&dtrace_lock);
 			mutex_exit(&cpu_lock);
 			dtrace_dof_destroy(dof);
 			return (EBUSY);
 		}
 
 		if (dtrace_dof_slurp(dof, vstate, cr, &enab, 0, B_TRUE) != 0) {
 			mutex_exit(&dtrace_lock);
 			mutex_exit(&cpu_lock);
 			dtrace_dof_destroy(dof);
 			return (EINVAL);
 		}
 
 		if ((rval = dtrace_dof_options(dof, state)) != 0) {
 			dtrace_enabling_destroy(enab);
 			mutex_exit(&dtrace_lock);
 			mutex_exit(&cpu_lock);
 			dtrace_dof_destroy(dof);
 			return (rval);
 		}
 
 		if ((err = dtrace_enabling_match(enab, rv)) == 0) {
 			err = dtrace_enabling_retain(enab);
 		} else {
 			dtrace_enabling_destroy(enab);
 		}
 
 		mutex_exit(&cpu_lock);
 		mutex_exit(&dtrace_lock);
 		dtrace_dof_destroy(dof);
 
 		return (err);
 	}
 
 	case DTRACEIOC_REPLICATE: {
 		dtrace_repldesc_t desc;
 		dtrace_probedesc_t *match = &desc.dtrpd_match;
 		dtrace_probedesc_t *create = &desc.dtrpd_create;
 		int err;
 
 		if (copyin((void *)arg, &desc, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		match->dtpd_provider[DTRACE_PROVNAMELEN - 1] = '\0';
 		match->dtpd_mod[DTRACE_MODNAMELEN - 1] = '\0';
 		match->dtpd_func[DTRACE_FUNCNAMELEN - 1] = '\0';
 		match->dtpd_name[DTRACE_NAMELEN - 1] = '\0';
 
 		create->dtpd_provider[DTRACE_PROVNAMELEN - 1] = '\0';
 		create->dtpd_mod[DTRACE_MODNAMELEN - 1] = '\0';
 		create->dtpd_func[DTRACE_FUNCNAMELEN - 1] = '\0';
 		create->dtpd_name[DTRACE_NAMELEN - 1] = '\0';
 
 		mutex_enter(&dtrace_lock);
 		err = dtrace_enabling_replicate(state, match, create);
 		mutex_exit(&dtrace_lock);
 
 		return (err);
 	}
 
 	case DTRACEIOC_PROBEMATCH:
 	case DTRACEIOC_PROBES: {
 		dtrace_probe_t *probe = NULL;
 		dtrace_probedesc_t desc;
 		dtrace_probekey_t pkey;
 		dtrace_id_t i;
 		int m = 0;
 		uint32_t priv;
 		uid_t uid;
 		zoneid_t zoneid;
 
 		if (copyin((void *)arg, &desc, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		desc.dtpd_provider[DTRACE_PROVNAMELEN - 1] = '\0';
 		desc.dtpd_mod[DTRACE_MODNAMELEN - 1] = '\0';
 		desc.dtpd_func[DTRACE_FUNCNAMELEN - 1] = '\0';
 		desc.dtpd_name[DTRACE_NAMELEN - 1] = '\0';
 
 		/*
 		 * Before we attempt to match this probe, we want to give
 		 * all providers the opportunity to provide it.
 		 */
 		if (desc.dtpd_id == DTRACE_IDNONE) {
 			mutex_enter(&dtrace_provider_lock);
 			dtrace_probe_provide(&desc, NULL);
 			mutex_exit(&dtrace_provider_lock);
 			desc.dtpd_id++;
 		}
 
 		if (cmd == DTRACEIOC_PROBEMATCH)  {
 			dtrace_probekey(&desc, &pkey);
 			pkey.dtpk_id = DTRACE_IDNONE;
 		}
 
 		dtrace_cred2priv(cr, &priv, &uid, &zoneid);
 
 		mutex_enter(&dtrace_lock);
 
 		if (cmd == DTRACEIOC_PROBEMATCH) {
 			for (i = desc.dtpd_id; i <= dtrace_nprobes; i++) {
 				if ((probe = dtrace_probes[i - 1]) != NULL &&
 				    (m = dtrace_match_probe(probe, &pkey,
 				    priv, uid, zoneid)) != 0)
 					break;
 			}
 
 			if (m < 0) {
 				mutex_exit(&dtrace_lock);
 				return (EINVAL);
 			}
 
 		} else {
 			for (i = desc.dtpd_id; i <= dtrace_nprobes; i++) {
 				if ((probe = dtrace_probes[i - 1]) != NULL &&
 				    dtrace_match_priv(probe, priv, uid, zoneid))
 					break;
 			}
 		}
 
 		if (probe == NULL) {
 			mutex_exit(&dtrace_lock);
 			return (ESRCH);
 		}
 
 		dtrace_probe_description(probe, &desc);
 		mutex_exit(&dtrace_lock);
 
 		if (copyout(&desc, (void *)arg, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_PROBEARG: {
 		dtrace_argdesc_t desc;
 		dtrace_probe_t *probe;
 		dtrace_provider_t *prov;
 
 		if (copyin((void *)arg, &desc, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		if (desc.dtargd_id == DTRACE_IDNONE)
 			return (EINVAL);
 
 		if (desc.dtargd_ndx == DTRACE_ARGNONE)
 			return (EINVAL);
 
 		mutex_enter(&dtrace_provider_lock);
 		mutex_enter(&mod_lock);
 		mutex_enter(&dtrace_lock);
 
 		if (desc.dtargd_id > dtrace_nprobes) {
 			mutex_exit(&dtrace_lock);
 			mutex_exit(&mod_lock);
 			mutex_exit(&dtrace_provider_lock);
 			return (EINVAL);
 		}
 
 		if ((probe = dtrace_probes[desc.dtargd_id - 1]) == NULL) {
 			mutex_exit(&dtrace_lock);
 			mutex_exit(&mod_lock);
 			mutex_exit(&dtrace_provider_lock);
 			return (EINVAL);
 		}
 
 		mutex_exit(&dtrace_lock);
 
 		prov = probe->dtpr_provider;
 
 		if (prov->dtpv_pops.dtps_getargdesc == NULL) {
 			/*
 			 * There isn't any typed information for this probe.
 			 * Set the argument number to DTRACE_ARGNONE.
 			 */
 			desc.dtargd_ndx = DTRACE_ARGNONE;
 		} else {
 			desc.dtargd_native[0] = '\0';
 			desc.dtargd_xlate[0] = '\0';
 			desc.dtargd_mapping = desc.dtargd_ndx;
 
 			prov->dtpv_pops.dtps_getargdesc(prov->dtpv_arg,
 			    probe->dtpr_id, probe->dtpr_arg, &desc);
 		}
 
 		mutex_exit(&mod_lock);
 		mutex_exit(&dtrace_provider_lock);
 
 		if (copyout(&desc, (void *)arg, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_GO: {
 		processorid_t cpuid;
 		rval = dtrace_state_go(state, &cpuid);
 
 		if (rval != 0)
 			return (rval);
 
 		if (copyout(&cpuid, (void *)arg, sizeof (cpuid)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_STOP: {
 		processorid_t cpuid;
 
 		mutex_enter(&dtrace_lock);
 		rval = dtrace_state_stop(state, &cpuid);
 		mutex_exit(&dtrace_lock);
 
 		if (rval != 0)
 			return (rval);
 
 		if (copyout(&cpuid, (void *)arg, sizeof (cpuid)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_DOFGET: {
 		dof_hdr_t hdr, *dof;
 		uint64_t len;
 
 		if (copyin((void *)arg, &hdr, sizeof (hdr)) != 0)
 			return (EFAULT);
 
 		mutex_enter(&dtrace_lock);
 		dof = dtrace_dof_create(state);
 		mutex_exit(&dtrace_lock);
 
 		len = MIN(hdr.dofh_loadsz, dof->dofh_loadsz);
 		rval = copyout(dof, (void *)arg, len);
 		dtrace_dof_destroy(dof);
 
 		return (rval == 0 ? 0 : EFAULT);
 	}
 
 	case DTRACEIOC_AGGSNAP:
 	case DTRACEIOC_BUFSNAP: {
 		dtrace_bufdesc_t desc;
 		caddr_t cached;
 		dtrace_buffer_t *buf;
 
 		if (copyin((void *)arg, &desc, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		if (desc.dtbd_cpu < 0 || desc.dtbd_cpu >= NCPU)
 			return (EINVAL);
 
 		mutex_enter(&dtrace_lock);
 
 		if (cmd == DTRACEIOC_BUFSNAP) {
 			buf = &state->dts_buffer[desc.dtbd_cpu];
 		} else {
 			buf = &state->dts_aggbuffer[desc.dtbd_cpu];
 		}
 
 		if (buf->dtb_flags & (DTRACEBUF_RING | DTRACEBUF_FILL)) {
 			size_t sz = buf->dtb_offset;
 
 			if (state->dts_activity != DTRACE_ACTIVITY_STOPPED) {
 				mutex_exit(&dtrace_lock);
 				return (EBUSY);
 			}
 
 			/*
 			 * If this buffer has already been consumed, we're
 			 * going to indicate that there's nothing left here
 			 * to consume.
 			 */
 			if (buf->dtb_flags & DTRACEBUF_CONSUMED) {
 				mutex_exit(&dtrace_lock);
 
 				desc.dtbd_size = 0;
 				desc.dtbd_drops = 0;
 				desc.dtbd_errors = 0;
 				desc.dtbd_oldest = 0;
 				sz = sizeof (desc);
 
 				if (copyout(&desc, (void *)arg, sz) != 0)
 					return (EFAULT);
 
 				return (0);
 			}
 
 			/*
 			 * If this is a ring buffer that has wrapped, we want
 			 * to copy the whole thing out.
 			 */
 			if (buf->dtb_flags & DTRACEBUF_WRAPPED) {
 				dtrace_buffer_polish(buf);
 				sz = buf->dtb_size;
 			}
 
 			if (copyout(buf->dtb_tomax, desc.dtbd_data, sz) != 0) {
 				mutex_exit(&dtrace_lock);
 				return (EFAULT);
 			}
 
 			desc.dtbd_size = sz;
 			desc.dtbd_drops = buf->dtb_drops;
 			desc.dtbd_errors = buf->dtb_errors;
 			desc.dtbd_oldest = buf->dtb_xamot_offset;
 			desc.dtbd_timestamp = dtrace_gethrtime();
 
 			mutex_exit(&dtrace_lock);
 
 			if (copyout(&desc, (void *)arg, sizeof (desc)) != 0)
 				return (EFAULT);
 
 			buf->dtb_flags |= DTRACEBUF_CONSUMED;
 
 			return (0);
 		}
 
 		if (buf->dtb_tomax == NULL) {
 			ASSERT(buf->dtb_xamot == NULL);
 			mutex_exit(&dtrace_lock);
 			return (ENOENT);
 		}
 
 		cached = buf->dtb_tomax;
 		ASSERT(!(buf->dtb_flags & DTRACEBUF_NOSWITCH));
 
 		dtrace_xcall(desc.dtbd_cpu,
 		    (dtrace_xcall_t)dtrace_buffer_switch, buf);
 
 		state->dts_errors += buf->dtb_xamot_errors;
 
 		/*
 		 * If the buffers did not actually switch, then the cross call
 		 * did not take place -- presumably because the given CPU is
 		 * not in the ready set.  If this is the case, we'll return
 		 * ENOENT.
 		 */
 		if (buf->dtb_tomax == cached) {
 			ASSERT(buf->dtb_xamot != cached);
 			mutex_exit(&dtrace_lock);
 			return (ENOENT);
 		}
 
 		ASSERT(cached == buf->dtb_xamot);
 
 		/*
 		 * We have our snapshot; now copy it out.
 		 */
 		if (copyout(buf->dtb_xamot, desc.dtbd_data,
 		    buf->dtb_xamot_offset) != 0) {
 			mutex_exit(&dtrace_lock);
 			return (EFAULT);
 		}
 
 		desc.dtbd_size = buf->dtb_xamot_offset;
 		desc.dtbd_drops = buf->dtb_xamot_drops;
 		desc.dtbd_errors = buf->dtb_xamot_errors;
 		desc.dtbd_oldest = 0;
 		desc.dtbd_timestamp = buf->dtb_switched;
 
 		mutex_exit(&dtrace_lock);
 
 		/*
 		 * Finally, copy out the buffer description.
 		 */
 		if (copyout(&desc, (void *)arg, sizeof (desc)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_CONF: {
 		dtrace_conf_t conf;
 
 		bzero(&conf, sizeof (conf));
 		conf.dtc_difversion = DIF_VERSION;
 		conf.dtc_difintregs = DIF_DIR_NREGS;
 		conf.dtc_diftupregs = DIF_DTR_NREGS;
 		conf.dtc_ctfmodel = CTF_MODEL_NATIVE;
 
 		if (copyout(&conf, (void *)arg, sizeof (conf)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_STATUS: {
 		dtrace_status_t stat;
 		dtrace_dstate_t *dstate;
 		int i, j;
 		uint64_t nerrs;
 
 		/*
 		 * See the comment in dtrace_state_deadman() for the reason
 		 * for setting dts_laststatus to INT64_MAX before setting
 		 * it to the correct value.
 		 */
 		state->dts_laststatus = INT64_MAX;
 		dtrace_membar_producer();
 		state->dts_laststatus = dtrace_gethrtime();
 
 		bzero(&stat, sizeof (stat));
 
 		mutex_enter(&dtrace_lock);
 
 		if (state->dts_activity == DTRACE_ACTIVITY_INACTIVE) {
 			mutex_exit(&dtrace_lock);
 			return (ENOENT);
 		}
 
 		if (state->dts_activity == DTRACE_ACTIVITY_DRAINING)
 			stat.dtst_exiting = 1;
 
 		nerrs = state->dts_errors;
 		dstate = &state->dts_vstate.dtvs_dynvars;
 
 		for (i = 0; i < NCPU; i++) {
 			dtrace_dstate_percpu_t *dcpu = &dstate->dtds_percpu[i];
 
 			stat.dtst_dyndrops += dcpu->dtdsc_drops;
 			stat.dtst_dyndrops_dirty += dcpu->dtdsc_dirty_drops;
 			stat.dtst_dyndrops_rinsing += dcpu->dtdsc_rinsing_drops;
 
 			if (state->dts_buffer[i].dtb_flags & DTRACEBUF_FULL)
 				stat.dtst_filled++;
 
 			nerrs += state->dts_buffer[i].dtb_errors;
 
 			for (j = 0; j < state->dts_nspeculations; j++) {
 				dtrace_speculation_t *spec;
 				dtrace_buffer_t *buf;
 
 				spec = &state->dts_speculations[j];
 				buf = &spec->dtsp_buffer[i];
 				stat.dtst_specdrops += buf->dtb_xamot_drops;
 			}
 		}
 
 		stat.dtst_specdrops_busy = state->dts_speculations_busy;
 		stat.dtst_specdrops_unavail = state->dts_speculations_unavail;
 		stat.dtst_stkstroverflows = state->dts_stkstroverflows;
 		stat.dtst_dblerrors = state->dts_dblerrors;
 		stat.dtst_killed =
 		    (state->dts_activity == DTRACE_ACTIVITY_KILLED);
 		stat.dtst_errors = nerrs;
 
 		mutex_exit(&dtrace_lock);
 
 		if (copyout(&stat, (void *)arg, sizeof (stat)) != 0)
 			return (EFAULT);
 
 		return (0);
 	}
 
 	case DTRACEIOC_FORMAT: {
 		dtrace_fmtdesc_t fmt;
 		char *str;
 		int len;
 
 		if (copyin((void *)arg, &fmt, sizeof (fmt)) != 0)
 			return (EFAULT);
 
 		mutex_enter(&dtrace_lock);
 
 		if (fmt.dtfd_format == 0 ||
 		    fmt.dtfd_format > state->dts_nformats) {
 			mutex_exit(&dtrace_lock);
 			return (EINVAL);
 		}
 
 		/*
 		 * Format strings are allocated contiguously and they are
 		 * never freed; if a format index is less than the number
 		 * of formats, we can assert that the format map is non-NULL
 		 * and that the format for the specified index is non-NULL.
 		 */
 		ASSERT(state->dts_formats != NULL);
 		str = state->dts_formats[fmt.dtfd_format - 1];
 		ASSERT(str != NULL);
 
 		len = strlen(str) + 1;
 
 		if (len > fmt.dtfd_length) {
 			fmt.dtfd_length = len;
 
 			if (copyout(&fmt, (void *)arg, sizeof (fmt)) != 0) {
 				mutex_exit(&dtrace_lock);
 				return (EINVAL);
 			}
 		} else {
 			if (copyout(str, fmt.dtfd_string, len) != 0) {
 				mutex_exit(&dtrace_lock);
 				return (EINVAL);
 			}
 		}
 
 		mutex_exit(&dtrace_lock);
 		return (0);
 	}
 
 	default:
 		break;
 	}
 
 	return (ENOTTY);
 }
 
 /*ARGSUSED*/
 static int
 dtrace_detach(dev_info_t *dip, ddi_detach_cmd_t cmd)
 {
 	dtrace_state_t *state;
 
 	switch (cmd) {
 	case DDI_DETACH:
 		break;
 
 	case DDI_SUSPEND:
 		return (DDI_SUCCESS);
 
 	default:
 		return (DDI_FAILURE);
 	}
 
 	mutex_enter(&cpu_lock);
 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
 
 	ASSERT(dtrace_opens == 0);
 
 	if (dtrace_helpers > 0) {
 		mutex_exit(&dtrace_provider_lock);
 		mutex_exit(&dtrace_lock);
 		mutex_exit(&cpu_lock);
 		return (DDI_FAILURE);
 	}
 
 	if (dtrace_unregister((dtrace_provider_id_t)dtrace_provider) != 0) {
 		mutex_exit(&dtrace_provider_lock);
 		mutex_exit(&dtrace_lock);
 		mutex_exit(&cpu_lock);
 		return (DDI_FAILURE);
 	}
 
 	dtrace_provider = NULL;
 
 	if ((state = dtrace_anon_grab()) != NULL) {
 		/*
 		 * If there were ECBs on this state, the provider should
 		 * have not been allowed to detach; assert that there is
 		 * none.
 		 */
 		ASSERT(state->dts_necbs == 0);
 		dtrace_state_destroy(state);
 
 		/*
 		 * If we're being detached with anonymous state, we need to
 		 * indicate to the kernel debugger that DTrace is now inactive.
 		 */
 		(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);
 	}
 
 	bzero(&dtrace_anon, sizeof (dtrace_anon_t));
 	unregister_cpu_setup_func((cpu_setup_func_t *)dtrace_cpu_setup, NULL);
 	dtrace_cpu_init = NULL;
 	dtrace_helpers_cleanup = NULL;
 	dtrace_helpers_fork = NULL;
 	dtrace_cpustart_init = NULL;
 	dtrace_cpustart_fini = NULL;
 	dtrace_debugger_init = NULL;
 	dtrace_debugger_fini = NULL;
 	dtrace_modload = NULL;
 	dtrace_modunload = NULL;
 
 	ASSERT(dtrace_getf == 0);
 	ASSERT(dtrace_closef == NULL);
 
 	mutex_exit(&cpu_lock);
 
 	kmem_free(dtrace_probes, dtrace_nprobes * sizeof (dtrace_probe_t *));
 	dtrace_probes = NULL;
 	dtrace_nprobes = 0;
 
 	dtrace_hash_destroy(dtrace_bymod);
 	dtrace_hash_destroy(dtrace_byfunc);
 	dtrace_hash_destroy(dtrace_byname);
 	dtrace_bymod = NULL;
 	dtrace_byfunc = NULL;
 	dtrace_byname = NULL;
 
 	kmem_cache_destroy(dtrace_state_cache);
 	vmem_destroy(dtrace_minor);
 	vmem_destroy(dtrace_arena);
 
 	if (dtrace_toxrange != NULL) {
 		kmem_free(dtrace_toxrange,
 		    dtrace_toxranges_max * sizeof (dtrace_toxrange_t));
 		dtrace_toxrange = NULL;
 		dtrace_toxranges = 0;
 		dtrace_toxranges_max = 0;
 	}
 
 	ddi_remove_minor_node(dtrace_devi, NULL);
 	dtrace_devi = NULL;
 
 	ddi_soft_state_fini(&dtrace_softstate);
 
 	ASSERT(dtrace_vtime_references == 0);
 	ASSERT(dtrace_opens == 0);
 	ASSERT(dtrace_retained == NULL);
 
 	mutex_exit(&dtrace_lock);
 	mutex_exit(&dtrace_provider_lock);
 
 	/*
 	 * We don't destroy the task queue until after we have dropped our
 	 * locks (taskq_destroy() may block on running tasks).  To prevent
 	 * attempting to do work after we have effectively detached but before
 	 * the task queue has been destroyed, all tasks dispatched via the
 	 * task queue must check that DTrace is still attached before
 	 * performing any operation.
 	 */
 	taskq_destroy(dtrace_taskq);
 	dtrace_taskq = NULL;
 
 	return (DDI_SUCCESS);
 }
 #endif
 
 #ifdef illumos
 /*ARGSUSED*/
 static int
 dtrace_info(dev_info_t *dip, ddi_info_cmd_t infocmd, void *arg, void **result)
 {
 	int error;
 
 	switch (infocmd) {
 	case DDI_INFO_DEVT2DEVINFO:
 		*result = (void *)dtrace_devi;
 		error = DDI_SUCCESS;
 		break;
 	case DDI_INFO_DEVT2INSTANCE:
 		*result = (void *)0;
 		error = DDI_SUCCESS;
 		break;
 	default:
 		error = DDI_FAILURE;
 	}
 	return (error);
 }
 #endif
 
 #ifdef illumos
 static struct cb_ops dtrace_cb_ops = {
 	dtrace_open,		/* open */
 	dtrace_close,		/* close */
 	nulldev,		/* strategy */
 	nulldev,		/* print */
 	nodev,			/* dump */
 	nodev,			/* read */
 	nodev,			/* write */
 	dtrace_ioctl,		/* ioctl */
 	nodev,			/* devmap */
 	nodev,			/* mmap */
 	nodev,			/* segmap */
 	nochpoll,		/* poll */
 	ddi_prop_op,		/* cb_prop_op */
 	0,			/* streamtab  */
 	D_NEW | D_MP		/* Driver compatibility flag */
 };
 
 static struct dev_ops dtrace_ops = {
 	DEVO_REV,		/* devo_rev */
 	0,			/* refcnt */
 	dtrace_info,		/* get_dev_info */
 	nulldev,		/* identify */
 	nulldev,		/* probe */
 	dtrace_attach,		/* attach */
 	dtrace_detach,		/* detach */
 	nodev,			/* reset */
 	&dtrace_cb_ops,		/* driver operations */
 	NULL,			/* bus operations */
 	nodev			/* dev power */
 };
 
 static struct modldrv modldrv = {
 	&mod_driverops,		/* module type (this is a pseudo driver) */
 	"Dynamic Tracing",	/* name of module */
 	&dtrace_ops,		/* driver ops */
 };
 
 static struct modlinkage modlinkage = {
 	MODREV_1,
 	(void *)&modldrv,
 	NULL
 };
 
 int
 _init(void)
 {
 	return (mod_install(&modlinkage));
 }
 
 int
 _info(struct modinfo *modinfop)
 {
 	return (mod_info(&modlinkage, modinfop));
 }
 
 int
 _fini(void)
 {
 	return (mod_remove(&modlinkage));
 }
 #else
 
 static d_ioctl_t	dtrace_ioctl;
 static d_ioctl_t	dtrace_ioctl_helper;
 static void		dtrace_load(void *);
 static int		dtrace_unload(void);
 static struct cdev	*dtrace_dev;
 static struct cdev	*helper_dev;
 
 void dtrace_invop_init(void);
 void dtrace_invop_uninit(void);
 
 static struct cdevsw dtrace_cdevsw = {
 	.d_version	= D_VERSION,
 	.d_ioctl	= dtrace_ioctl,
 	.d_open		= dtrace_open,
 	.d_name		= "dtrace",
 };
 
 static struct cdevsw helper_cdevsw = {
 	.d_version	= D_VERSION,
 	.d_ioctl	= dtrace_ioctl_helper,
 	.d_name		= "helper",
 };
 
 #include <dtrace_anon.c>
 #include <dtrace_ioctl.c>
 #include <dtrace_load.c>
 #include <dtrace_modevent.c>
 #include <dtrace_sysctl.c>
 #include <dtrace_unload.c>
 #include <dtrace_vtime.c>
 #include <dtrace_hacks.c>
 #include <dtrace_isa.c>
 
 SYSINIT(dtrace_load, SI_SUB_DTRACE, SI_ORDER_FIRST, dtrace_load, NULL);
 SYSUNINIT(dtrace_unload, SI_SUB_DTRACE, SI_ORDER_FIRST, dtrace_unload, NULL);
 SYSINIT(dtrace_anon_init, SI_SUB_DTRACE_ANON, SI_ORDER_FIRST, dtrace_anon_init, NULL);
 
 DEV_MODULE(dtrace, dtrace_modevent, NULL);
 MODULE_VERSION(dtrace, 1);
 MODULE_DEPEND(dtrace, opensolaris, 1, 1, 1);
 #endif
Index: user/markj/netdump/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
===================================================================
--- user/markj/netdump/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c	(revision 332407)
+++ user/markj/netdump/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c	(revision 332408)
@@ -1,7933 +1,7933 @@
 /*
  * CDDL HEADER START
  *
  * The contents of this file are subject to the terms of the
  * Common Development and Distribution License (the "License").
  * You may not use this file except in compliance with the License.
  *
  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  * or http://www.opensolaris.org/os/licensing.
  * See the License for the specific language governing permissions
  * and limitations under the License.
  *
  * When distributing Covered Code, include this CDDL HEADER in each
  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  * If applicable, add the following below this CDDL HEADER, with the
  * fields enclosed by brackets "[]" replaced with your own identifying
  * information: Portions Copyright [yyyy] [name of copyright owner]
  *
  * CDDL HEADER END
  */
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  * Copyright (c) 2018, Joyent, Inc.
  * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
  * Copyright 2017 Nexenta Systems, Inc.  All rights reserved.
  */
 
 /*
  * DVA-based Adjustable Replacement Cache
  *
  * While much of the theory of operation used here is
  * based on the self-tuning, low overhead replacement cache
  * presented by Megiddo and Modha at FAST 2003, there are some
  * significant differences:
  *
  * 1. The Megiddo and Modha model assumes any page is evictable.
  * Pages in its cache cannot be "locked" into memory.  This makes
  * the eviction algorithm simple: evict the last page in the list.
  * This also make the performance characteristics easy to reason
  * about.  Our cache is not so simple.  At any given moment, some
  * subset of the blocks in the cache are un-evictable because we
  * have handed out a reference to them.  Blocks are only evictable
  * when there are no external references active.  This makes
  * eviction far more problematic:  we choose to evict the evictable
  * blocks that are the "lowest" in the list.
  *
  * There are times when it is not possible to evict the requested
  * space.  In these circumstances we are unable to adjust the cache
  * size.  To prevent the cache growing unbounded at these times we
  * implement a "cache throttle" that slows the flow of new data
  * into the cache until we can make space available.
  *
  * 2. The Megiddo and Modha model assumes a fixed cache size.
  * Pages are evicted when the cache is full and there is a cache
  * miss.  Our model has a variable sized cache.  It grows with
  * high use, but also tries to react to memory pressure from the
  * operating system: decreasing its size when system memory is
  * tight.
  *
  * 3. The Megiddo and Modha model assumes a fixed page size. All
  * elements of the cache are therefore exactly the same size.  So
  * when adjusting the cache size following a cache miss, its simply
  * a matter of choosing a single page to evict.  In our model, we
  * have variable sized cache blocks (rangeing from 512 bytes to
  * 128K bytes).  We therefore choose a set of blocks to evict to make
  * space for a cache miss that approximates as closely as possible
  * the space used by the new block.
  *
  * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
  * by N. Megiddo & D. Modha, FAST 2003
  */
 
 /*
  * The locking model:
  *
  * A new reference to a cache buffer can be obtained in two
  * ways: 1) via a hash table lookup using the DVA as a key,
  * or 2) via one of the ARC lists.  The arc_read() interface
  * uses method 1, while the internal ARC algorithms for
  * adjusting the cache use method 2.  We therefore provide two
  * types of locks: 1) the hash table lock array, and 2) the
  * ARC list locks.
  *
  * Buffers do not have their own mutexes, rather they rely on the
  * hash table mutexes for the bulk of their protection (i.e. most
  * fields in the arc_buf_hdr_t are protected by these mutexes).
  *
  * buf_hash_find() returns the appropriate mutex (held) when it
  * locates the requested buffer in the hash table.  It returns
  * NULL for the mutex if the buffer was not in the table.
  *
  * buf_hash_remove() expects the appropriate hash mutex to be
  * already held before it is invoked.
  *
  * Each ARC state also has a mutex which is used to protect the
  * buffer list associated with the state.  When attempting to
  * obtain a hash table lock while holding an ARC list lock you
  * must use: mutex_tryenter() to avoid deadlock.  Also note that
  * the active state mutex must be held before the ghost state mutex.
  *
  * Note that the majority of the performance stats are manipulated
  * with atomic operations.
  *
  * The L2ARC uses the l2ad_mtx on each vdev for the following:
  *
  *	- L2ARC buflist creation
  *	- L2ARC buflist eviction
  *	- L2ARC write completion, which walks L2ARC buflists
  *	- ARC header destruction, as it removes from L2ARC buflists
  *	- ARC header release, as it removes from L2ARC buflists
  */
 
 /*
  * ARC operation:
  *
  * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
  * This structure can point either to a block that is still in the cache or to
  * one that is only accessible in an L2 ARC device, or it can provide
  * information about a block that was recently evicted. If a block is
  * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
  * information to retrieve it from the L2ARC device. This information is
  * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
  * that is in this state cannot access the data directly.
  *
  * Blocks that are actively being referenced or have not been evicted
  * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
  * the arc_buf_hdr_t that will point to the data block in memory. A block can
  * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
  * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
  * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).
  *
  * The L1ARC's data pointer may or may not be uncompressed. The ARC has the
  * ability to store the physical data (b_pabd) associated with the DVA of the
  * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,
  * it will match its on-disk compression characteristics. This behavior can be
  * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
  * compressed ARC functionality is disabled, the b_pabd will point to an
  * uncompressed version of the on-disk data.
  *
  * Data in the L1ARC is not accessed by consumers of the ARC directly. Each
  * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
  * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
  * consumer. The ARC will provide references to this data and will keep it
  * cached until it is no longer in use. The ARC caches only the L1ARC's physical
  * data block and will evict any arc_buf_t that is no longer referenced. The
  * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
  * "overhead_size" kstat.
  *
  * Depending on the consumer, an arc_buf_t can be requested in uncompressed or
  * compressed form. The typical case is that consumers will want uncompressed
  * data, and when that happens a new data buffer is allocated where the data is
  * decompressed for them to use. Currently the only consumer who wants
  * compressed arc_buf_t's is "zfs send", when it streams data exactly as it
  * exists on disk. When this happens, the arc_buf_t's data buffer is shared
  * with the arc_buf_hdr_t.
  *
  * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
  * first one is owned by a compressed send consumer (and therefore references
  * the same compressed data buffer as the arc_buf_hdr_t) and the second could be
  * used by any other consumer (and has its own uncompressed copy of the data
  * buffer).
  *
  *   arc_buf_hdr_t
  *   +-----------+
  *   | fields    |
  *   | common to |
  *   | L1- and   |
  *   | L2ARC     |
  *   +-----------+
  *   | l2arc_buf_hdr_t
  *   |           |
  *   +-----------+
  *   | l1arc_buf_hdr_t
  *   |           |              arc_buf_t
  *   | b_buf     +------------>+-----------+      arc_buf_t
  *   | b_pabd    +-+           |b_next     +---->+-----------+
  *   +-----------+ |           |-----------|     |b_next     +-->NULL
  *                 |           |b_comp = T |     +-----------+
  *                 |           |b_data     +-+   |b_comp = F |
  *                 |           +-----------+ |   |b_data     +-+
  *                 +->+------+               |   +-----------+ |
  *        compressed  |      |               |                 |
  *           data     |      |<--------------+                 | uncompressed
  *                    +------+          compressed,            |     data
  *                                        shared               +-->+------+
  *                                         data                    |      |
  *                                                                 |      |
  *                                                                 +------+
  *
  * When a consumer reads a block, the ARC must first look to see if the
  * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
  * arc_buf_t and either copies uncompressed data into a new data buffer from an
  * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a
  * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the
  * hdr is compressed and the desired compression characteristics of the
  * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
  * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
  * the last buffer in the hdr's b_buf list, however a shared compressed buf can
  * be anywhere in the hdr's list.
  *
  * The diagram below shows an example of an uncompressed ARC hdr that is
  * sharing its data with an arc_buf_t (note that the shared uncompressed buf is
  * the last element in the buf list):
  *
  *                arc_buf_hdr_t
  *                +-----------+
  *                |           |
  *                |           |
  *                |           |
  *                +-----------+
  * l2arc_buf_hdr_t|           |
  *                |           |
  *                +-----------+
  * l1arc_buf_hdr_t|           |
  *                |           |                 arc_buf_t    (shared)
  *                |    b_buf  +------------>+---------+      arc_buf_t
  *                |           |             |b_next   +---->+---------+
  *                |  b_pabd   +-+           |---------|     |b_next   +-->NULL
  *                +-----------+ |           |         |     +---------+
  *                              |           |b_data   +-+   |         |
  *                              |           +---------+ |   |b_data   +-+
  *                              +->+------+             |   +---------+ |
  *                                 |      |             |               |
  *                   uncompressed  |      |             |               |
  *                        data     +------+             |               |
  *                                    ^                 +->+------+     |
  *                                    |       uncompressed |      |     |
  *                                    |           data     |      |     |
  *                                    |                    +------+     |
  *                                    +---------------------------------+
  *
  * Writing to the ARC requires that the ARC first discard the hdr's b_pabd
  * since the physical block is about to be rewritten. The new data contents
  * will be contained in the arc_buf_t. As the I/O pipeline performs the write,
  * it may compress the data before writing it to disk. The ARC will be called
  * with the transformed data and will bcopy the transformed on-disk block into
  * a newly allocated b_pabd. Writes are always done into buffers which have
  * either been loaned (and hence are new and don't have other readers) or
  * buffers which have been released (and hence have their own hdr, if there
  * were originally other readers of the buf's original hdr). This ensures that
  * the ARC only needs to update a single buf and its hdr after a write occurs.
  *
  * When the L2ARC is in use, it will also take advantage of the b_pabd. The
  * L2ARC will always write the contents of b_pabd to the L2ARC. This means
  * that when compressed ARC is enabled that the L2ARC blocks are identical
  * to the on-disk block in the main data pool. This provides a significant
  * advantage since the ARC can leverage the bp's checksum when reading from the
  * L2ARC to determine if the contents are valid. However, if the compressed
  * ARC is disabled, then the L2ARC's block must be transformed to look
  * like the physical block in the main data pool before comparing the
  * checksum and determining its validity.
  */
 
 #include <sys/spa.h>
 #include <sys/zio.h>
 #include <sys/spa_impl.h>
 #include <sys/zio_compress.h>
 #include <sys/zio_checksum.h>
 #include <sys/zfs_context.h>
 #include <sys/arc.h>
 #include <sys/refcount.h>
 #include <sys/vdev.h>
 #include <sys/vdev_impl.h>
 #include <sys/dsl_pool.h>
 #include <sys/zio_checksum.h>
 #include <sys/multilist.h>
 #include <sys/abd.h>
 #ifdef _KERNEL
 #include <sys/dnlc.h>
 #include <sys/racct.h>
 #endif
 #include <sys/callb.h>
 #include <sys/kstat.h>
 #include <sys/trim_map.h>
 #include <zfs_fletcher.h>
 #include <sys/sdt.h>
 #include <sys/aggsum.h>
 #include <sys/cityhash.h>
 
 #include <machine/vmparam.h>
 
 #ifdef illumos
 #ifndef _KERNEL
 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 boolean_t arc_watch = B_FALSE;
 int arc_procfd;
 #endif
 #endif /* illumos */
 
 static kmutex_t		arc_reclaim_lock;
 static kcondvar_t	arc_reclaim_thread_cv;
 static boolean_t	arc_reclaim_thread_exit;
 static kcondvar_t	arc_reclaim_waiters_cv;
 
 static kmutex_t		arc_dnlc_evicts_lock;
 static kcondvar_t	arc_dnlc_evicts_cv;
 static boolean_t	arc_dnlc_evicts_thread_exit;
 
 uint_t arc_reduce_dnlc_percent = 3;
 
 /*
  * The number of headers to evict in arc_evict_state_impl() before
  * dropping the sublist lock and evicting from another sublist. A lower
  * value means we're more likely to evict the "correct" header (i.e. the
  * oldest header in the arc state), but comes with higher overhead
  * (i.e. more invocations of arc_evict_state_impl()).
  */
 int zfs_arc_evict_batch_limit = 10;
 
 /* number of seconds before growing cache again */
 static int		arc_grow_retry = 60;
 
 /* number of milliseconds before attempting a kmem-cache-reap */
 static int		arc_kmem_cache_reap_retry_ms = 1000;
 
 /* shift of arc_c for calculating overflow limit in arc_get_data_impl */
 int		zfs_arc_overflow_shift = 8;
 
 /* shift of arc_c for calculating both min and max arc_p */
 static int		arc_p_min_shift = 4;
 
 /* log2(fraction of arc to reclaim) */
 static int		arc_shrink_shift = 7;
 
 /*
  * log2(fraction of ARC which must be free to allow growing).
  * I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
  * when reading a new block into the ARC, we will evict an equal-sized block
  * from the ARC.
  *
  * This must be less than arc_shrink_shift, so that when we shrink the ARC,
  * we will still not allow it to grow.
  */
 int			arc_no_grow_shift = 5;
 
 
 /*
  * minimum lifespan of a prefetch block in clock ticks
  * (initialized in arc_init())
  */
 static int		arc_min_prefetch_lifespan;
 
 /*
  * If this percent of memory is free, don't throttle.
  */
 int arc_lotsfree_percent = 10;
 
 static int arc_dead;
 extern boolean_t zfs_prefetch_disable;
 
 /*
  * The arc has filled available memory and has now warmed up.
  */
 static boolean_t arc_warm;
 
 /*
  * log2 fraction of the zio arena to keep free.
  */
 int arc_zio_arena_free_shift = 2;
 
 /*
  * These tunables are for performance analysis.
  */
 uint64_t zfs_arc_max;
 uint64_t zfs_arc_min;
 uint64_t zfs_arc_meta_limit = 0;
 uint64_t zfs_arc_meta_min = 0;
 int zfs_arc_grow_retry = 0;
 int zfs_arc_shrink_shift = 0;
 int zfs_arc_no_grow_shift = 0;
 int zfs_arc_p_min_shift = 0;
 uint64_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 u_int zfs_arc_free_target = 0;
 
 /* Absolute min for arc min / max is 16MB. */
 static uint64_t arc_abs_min = 16 << 20;
 
 boolean_t zfs_compressed_arc_enabled = B_TRUE;
 
 static int sysctl_vfs_zfs_arc_free_target(SYSCTL_HANDLER_ARGS);
 static int sysctl_vfs_zfs_arc_meta_limit(SYSCTL_HANDLER_ARGS);
 static int sysctl_vfs_zfs_arc_max(SYSCTL_HANDLER_ARGS);
 static int sysctl_vfs_zfs_arc_min(SYSCTL_HANDLER_ARGS);
 static int sysctl_vfs_zfs_arc_no_grow_shift(SYSCTL_HANDLER_ARGS);
 
 #if defined(__FreeBSD__) && defined(_KERNEL)
 static void
 arc_free_target_init(void *unused __unused)
 {
 
-	zfs_arc_free_target = (vm_cnt.v_free_min / 10) * 11;
+	zfs_arc_free_target = vm_cnt.v_free_target;
 }
 SYSINIT(arc_free_target_init, SI_SUB_KTHREAD_PAGE, SI_ORDER_ANY,
     arc_free_target_init, NULL);
 
 TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
 TUNABLE_QUAD("vfs.zfs.arc_meta_min", &zfs_arc_meta_min);
 TUNABLE_INT("vfs.zfs.arc_shrink_shift", &zfs_arc_shrink_shift);
 TUNABLE_INT("vfs.zfs.arc_grow_retry", &zfs_arc_grow_retry);
 TUNABLE_INT("vfs.zfs.arc_no_grow_shift", &zfs_arc_no_grow_shift);
 SYSCTL_DECL(_vfs_zfs);
 SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_max, CTLTYPE_U64 | CTLFLAG_RWTUN,
     0, sizeof(uint64_t), sysctl_vfs_zfs_arc_max, "QU", "Maximum ARC size");
 SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_min, CTLTYPE_U64 | CTLFLAG_RWTUN,
     0, sizeof(uint64_t), sysctl_vfs_zfs_arc_min, "QU", "Minimum ARC size");
 SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_no_grow_shift, CTLTYPE_U32 | CTLFLAG_RWTUN,
     0, sizeof(uint32_t), sysctl_vfs_zfs_arc_no_grow_shift, "U",
     "log2(fraction of ARC which must be free to allow growing)");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_average_blocksize, CTLFLAG_RDTUN,
     &zfs_arc_average_blocksize, 0,
     "ARC average blocksize");
 SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_shift, CTLFLAG_RW,
     &arc_shrink_shift, 0,
     "log2(fraction of arc to reclaim)");
 SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_grow_retry, CTLFLAG_RW,
     &arc_grow_retry, 0,
     "Wait in seconds before considering growing ARC");
 SYSCTL_INT(_vfs_zfs, OID_AUTO, compressed_arc_enabled, CTLFLAG_RDTUN,
     &zfs_compressed_arc_enabled, 0, "Enable compressed ARC");
 
 /*
  * We don't have a tunable for arc_free_target due to the dependency on
  * pagedaemon initialisation.
  */
 SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_free_target,
     CTLTYPE_UINT | CTLFLAG_MPSAFE | CTLFLAG_RW, 0, sizeof(u_int),
     sysctl_vfs_zfs_arc_free_target, "IU",
     "Desired number of free pages below which ARC triggers reclaim");
 
 static int
 sysctl_vfs_zfs_arc_free_target(SYSCTL_HANDLER_ARGS)
 {
 	u_int val;
 	int err;
 
 	val = zfs_arc_free_target;
 	err = sysctl_handle_int(oidp, &val, 0, req);
 	if (err != 0 || req->newptr == NULL)
 		return (err);
 
 	if (val < minfree)
 		return (EINVAL);
 	if (val > vm_cnt.v_page_count)
 		return (EINVAL);
 
 	zfs_arc_free_target = val;
 
 	return (0);
 }
 
 /*
  * Must be declared here, before the definition of corresponding kstat
  * macro which uses the same names will confuse the compiler.
  */
 SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_meta_limit,
     CTLTYPE_U64 | CTLFLAG_MPSAFE | CTLFLAG_RW, 0, sizeof(uint64_t),
     sysctl_vfs_zfs_arc_meta_limit, "QU",
     "ARC metadata limit");
 #endif
 
 /*
  * Note that buffers can be in one of 6 states:
  *	ARC_anon	- anonymous (discussed below)
  *	ARC_mru		- recently used, currently cached
  *	ARC_mru_ghost	- recentely used, no longer in cache
  *	ARC_mfu		- frequently used, currently cached
  *	ARC_mfu_ghost	- frequently used, no longer in cache
  *	ARC_l2c_only	- exists in L2ARC but not other states
  * When there are no active references to the buffer, they are
  * are linked onto a list in one of these arc states.  These are
  * the only buffers that can be evicted or deleted.  Within each
  * state there are multiple lists, one for meta-data and one for
  * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
  * etc.) is tracked separately so that it can be managed more
  * explicitly: favored over data, limited explicitly.
  *
  * Anonymous buffers are buffers that are not associated with
  * a DVA.  These are buffers that hold dirty block copies
  * before they are written to stable storage.  By definition,
  * they are "ref'd" and are considered part of arc_mru
  * that cannot be freed.  Generally, they will aquire a DVA
  * as they are written and migrate onto the arc_mru list.
  *
  * The ARC_l2c_only state is for buffers that are in the second
  * level ARC but no longer in any of the ARC_m* lists.  The second
  * level ARC itself may also contain buffers that are in any of
  * the ARC_m* states - meaning that a buffer can exist in two
  * places.  The reason for the ARC_l2c_only state is to keep the
  * buffer header in the hash table, so that reads that hit the
  * second level ARC benefit from these fast lookups.
  */
 
 typedef struct arc_state {
 	/*
 	 * list of evictable buffers
 	 */
 	multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
 	/*
 	 * total amount of evictable data in this state
 	 */
 	refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
 	/*
 	 * total amount of data in this state; this includes: evictable,
 	 * non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
 	 */
 	refcount_t arcs_size;
 } arc_state_t;
 
 /* The 6 states: */
 static arc_state_t ARC_anon;
 static arc_state_t ARC_mru;
 static arc_state_t ARC_mru_ghost;
 static arc_state_t ARC_mfu;
 static arc_state_t ARC_mfu_ghost;
 static arc_state_t ARC_l2c_only;
 
 typedef struct arc_stats {
 	kstat_named_t arcstat_hits;
 	kstat_named_t arcstat_misses;
 	kstat_named_t arcstat_demand_data_hits;
 	kstat_named_t arcstat_demand_data_misses;
 	kstat_named_t arcstat_demand_metadata_hits;
 	kstat_named_t arcstat_demand_metadata_misses;
 	kstat_named_t arcstat_prefetch_data_hits;
 	kstat_named_t arcstat_prefetch_data_misses;
 	kstat_named_t arcstat_prefetch_metadata_hits;
 	kstat_named_t arcstat_prefetch_metadata_misses;
 	kstat_named_t arcstat_mru_hits;
 	kstat_named_t arcstat_mru_ghost_hits;
 	kstat_named_t arcstat_mfu_hits;
 	kstat_named_t arcstat_mfu_ghost_hits;
 	kstat_named_t arcstat_allocated;
 	kstat_named_t arcstat_deleted;
 	/*
 	 * Number of buffers that could not be evicted because the hash lock
 	 * was held by another thread.  The lock may not necessarily be held
 	 * by something using the same buffer, since hash locks are shared
 	 * by multiple buffers.
 	 */
 	kstat_named_t arcstat_mutex_miss;
 	/*
 	 * Number of buffers skipped because they have I/O in progress, are
 	 * indrect prefetch buffers that have not lived long enough, or are
 	 * not from the spa we're trying to evict from.
 	 */
 	kstat_named_t arcstat_evict_skip;
 	/*
 	 * Number of times arc_evict_state() was unable to evict enough
 	 * buffers to reach it's target amount.
 	 */
 	kstat_named_t arcstat_evict_not_enough;
 	kstat_named_t arcstat_evict_l2_cached;
 	kstat_named_t arcstat_evict_l2_eligible;
 	kstat_named_t arcstat_evict_l2_ineligible;
 	kstat_named_t arcstat_evict_l2_skip;
 	kstat_named_t arcstat_hash_elements;
 	kstat_named_t arcstat_hash_elements_max;
 	kstat_named_t arcstat_hash_collisions;
 	kstat_named_t arcstat_hash_chains;
 	kstat_named_t arcstat_hash_chain_max;
 	kstat_named_t arcstat_p;
 	kstat_named_t arcstat_c;
 	kstat_named_t arcstat_c_min;
 	kstat_named_t arcstat_c_max;
 	/* Not updated directly; only synced in arc_kstat_update. */
 	kstat_named_t arcstat_size;
 	/*
 	 * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
 	 * Note that the compressed bytes may match the uncompressed bytes
 	 * if the block is either not compressed or compressed arc is disabled.
 	 */
 	kstat_named_t arcstat_compressed_size;
 	/*
 	 * Uncompressed size of the data stored in b_pabd. If compressed
 	 * arc is disabled then this value will be identical to the stat
 	 * above.
 	 */
 	kstat_named_t arcstat_uncompressed_size;
 	/*
 	 * Number of bytes stored in all the arc_buf_t's. This is classified
 	 * as "overhead" since this data is typically short-lived and will
 	 * be evicted from the arc when it becomes unreferenced unless the
 	 * zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level
 	 * values have been set (see comment in dbuf.c for more information).
 	 */
 	kstat_named_t arcstat_overhead_size;
 	/*
 	 * Number of bytes consumed by internal ARC structures necessary
 	 * for tracking purposes; these structures are not actually
 	 * backed by ARC buffers. This includes arc_buf_hdr_t structures
 	 * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
 	 * caches), and arc_buf_t structures (allocated via arc_buf_t
 	 * cache).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_hdr_size;
 	/*
 	 * Number of bytes consumed by ARC buffers of type equal to
 	 * ARC_BUFC_DATA. This is generally consumed by buffers backing
 	 * on disk user data (e.g. plain file contents).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_data_size;
 	/*
 	 * Number of bytes consumed by ARC buffers of type equal to
 	 * ARC_BUFC_METADATA. This is generally consumed by buffers
 	 * backing on disk data that is used for internal ZFS
 	 * structures (e.g. ZAP, dnode, indirect blocks, etc).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_metadata_size;
 	/*
 	 * Number of bytes consumed by various buffers and structures
 	 * not actually backed with ARC buffers. This includes bonus
 	 * buffers (allocated directly via zio_buf_* functions),
 	 * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
 	 * cache), and dnode_t structures (allocated via dnode_t cache).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_other_size;
 	/*
 	 * Total number of bytes consumed by ARC buffers residing in the
 	 * arc_anon state. This includes *all* buffers in the arc_anon
 	 * state; e.g. data, metadata, evictable, and unevictable buffers
 	 * are all included in this value.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_anon_size;
 	/*
 	 * Number of bytes consumed by ARC buffers that meet the
 	 * following criteria: backing buffers of type ARC_BUFC_DATA,
 	 * residing in the arc_anon state, and are eligible for eviction
 	 * (e.g. have no outstanding holds on the buffer).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_anon_evictable_data;
 	/*
 	 * Number of bytes consumed by ARC buffers that meet the
 	 * following criteria: backing buffers of type ARC_BUFC_METADATA,
 	 * residing in the arc_anon state, and are eligible for eviction
 	 * (e.g. have no outstanding holds on the buffer).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_anon_evictable_metadata;
 	/*
 	 * Total number of bytes consumed by ARC buffers residing in the
 	 * arc_mru state. This includes *all* buffers in the arc_mru
 	 * state; e.g. data, metadata, evictable, and unevictable buffers
 	 * are all included in this value.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mru_size;
 	/*
 	 * Number of bytes consumed by ARC buffers that meet the
 	 * following criteria: backing buffers of type ARC_BUFC_DATA,
 	 * residing in the arc_mru state, and are eligible for eviction
 	 * (e.g. have no outstanding holds on the buffer).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mru_evictable_data;
 	/*
 	 * Number of bytes consumed by ARC buffers that meet the
 	 * following criteria: backing buffers of type ARC_BUFC_METADATA,
 	 * residing in the arc_mru state, and are eligible for eviction
 	 * (e.g. have no outstanding holds on the buffer).
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mru_evictable_metadata;
 	/*
 	 * Total number of bytes that *would have been* consumed by ARC
 	 * buffers in the arc_mru_ghost state. The key thing to note
 	 * here, is the fact that this size doesn't actually indicate
 	 * RAM consumption. The ghost lists only consist of headers and
 	 * don't actually have ARC buffers linked off of these headers.
 	 * Thus, *if* the headers had associated ARC buffers, these
 	 * buffers *would have* consumed this number of bytes.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mru_ghost_size;
 	/*
 	 * Number of bytes that *would have been* consumed by ARC
 	 * buffers that are eligible for eviction, of type
 	 * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mru_ghost_evictable_data;
 	/*
 	 * Number of bytes that *would have been* consumed by ARC
 	 * buffers that are eligible for eviction, of type
 	 * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mru_ghost_evictable_metadata;
 	/*
 	 * Total number of bytes consumed by ARC buffers residing in the
 	 * arc_mfu state. This includes *all* buffers in the arc_mfu
 	 * state; e.g. data, metadata, evictable, and unevictable buffers
 	 * are all included in this value.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mfu_size;
 	/*
 	 * Number of bytes consumed by ARC buffers that are eligible for
 	 * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
 	 * state.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mfu_evictable_data;
 	/*
 	 * Number of bytes consumed by ARC buffers that are eligible for
 	 * eviction, of type ARC_BUFC_METADATA, and reside in the
 	 * arc_mfu state.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mfu_evictable_metadata;
 	/*
 	 * Total number of bytes that *would have been* consumed by ARC
 	 * buffers in the arc_mfu_ghost state. See the comment above
 	 * arcstat_mru_ghost_size for more details.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mfu_ghost_size;
 	/*
 	 * Number of bytes that *would have been* consumed by ARC
 	 * buffers that are eligible for eviction, of type
 	 * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mfu_ghost_evictable_data;
 	/*
 	 * Number of bytes that *would have been* consumed by ARC
 	 * buffers that are eligible for eviction, of type
 	 * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 	 * Not updated directly; only synced in arc_kstat_update.
 	 */
 	kstat_named_t arcstat_mfu_ghost_evictable_metadata;
 	kstat_named_t arcstat_l2_hits;
 	kstat_named_t arcstat_l2_misses;
 	kstat_named_t arcstat_l2_feeds;
 	kstat_named_t arcstat_l2_rw_clash;
 	kstat_named_t arcstat_l2_read_bytes;
 	kstat_named_t arcstat_l2_write_bytes;
 	kstat_named_t arcstat_l2_writes_sent;
 	kstat_named_t arcstat_l2_writes_done;
 	kstat_named_t arcstat_l2_writes_error;
 	kstat_named_t arcstat_l2_writes_lock_retry;
 	kstat_named_t arcstat_l2_evict_lock_retry;
 	kstat_named_t arcstat_l2_evict_reading;
 	kstat_named_t arcstat_l2_evict_l1cached;
 	kstat_named_t arcstat_l2_free_on_write;
 	kstat_named_t arcstat_l2_abort_lowmem;
 	kstat_named_t arcstat_l2_cksum_bad;
 	kstat_named_t arcstat_l2_io_error;
 	kstat_named_t arcstat_l2_lsize;
 	kstat_named_t arcstat_l2_psize;
 	/* Not updated directly; only synced in arc_kstat_update. */
 	kstat_named_t arcstat_l2_hdr_size;
 	kstat_named_t arcstat_l2_write_trylock_fail;
 	kstat_named_t arcstat_l2_write_passed_headroom;
 	kstat_named_t arcstat_l2_write_spa_mismatch;
 	kstat_named_t arcstat_l2_write_in_l2;
 	kstat_named_t arcstat_l2_write_hdr_io_in_progress;
 	kstat_named_t arcstat_l2_write_not_cacheable;
 	kstat_named_t arcstat_l2_write_full;
 	kstat_named_t arcstat_l2_write_buffer_iter;
 	kstat_named_t arcstat_l2_write_pios;
 	kstat_named_t arcstat_l2_write_buffer_bytes_scanned;
 	kstat_named_t arcstat_l2_write_buffer_list_iter;
 	kstat_named_t arcstat_l2_write_buffer_list_null_iter;
 	kstat_named_t arcstat_memory_throttle_count;
 	/* Not updated directly; only synced in arc_kstat_update. */
 	kstat_named_t arcstat_meta_used;
 	kstat_named_t arcstat_meta_limit;
 	kstat_named_t arcstat_meta_max;
 	kstat_named_t arcstat_meta_min;
 	kstat_named_t arcstat_sync_wait_for_async;
 	kstat_named_t arcstat_demand_hit_predictive_prefetch;
 } arc_stats_t;
 
 static arc_stats_t arc_stats = {
 	{ "hits",			KSTAT_DATA_UINT64 },
 	{ "misses",			KSTAT_DATA_UINT64 },
 	{ "demand_data_hits",		KSTAT_DATA_UINT64 },
 	{ "demand_data_misses",		KSTAT_DATA_UINT64 },
 	{ "demand_metadata_hits",	KSTAT_DATA_UINT64 },
 	{ "demand_metadata_misses",	KSTAT_DATA_UINT64 },
 	{ "prefetch_data_hits",		KSTAT_DATA_UINT64 },
 	{ "prefetch_data_misses",	KSTAT_DATA_UINT64 },
 	{ "prefetch_metadata_hits",	KSTAT_DATA_UINT64 },
 	{ "prefetch_metadata_misses",	KSTAT_DATA_UINT64 },
 	{ "mru_hits",			KSTAT_DATA_UINT64 },
 	{ "mru_ghost_hits",		KSTAT_DATA_UINT64 },
 	{ "mfu_hits",			KSTAT_DATA_UINT64 },
 	{ "mfu_ghost_hits",		KSTAT_DATA_UINT64 },
 	{ "allocated",			KSTAT_DATA_UINT64 },
 	{ "deleted",			KSTAT_DATA_UINT64 },
 	{ "mutex_miss",			KSTAT_DATA_UINT64 },
 	{ "evict_skip",			KSTAT_DATA_UINT64 },
 	{ "evict_not_enough",		KSTAT_DATA_UINT64 },
 	{ "evict_l2_cached",		KSTAT_DATA_UINT64 },
 	{ "evict_l2_eligible",		KSTAT_DATA_UINT64 },
 	{ "evict_l2_ineligible",	KSTAT_DATA_UINT64 },
 	{ "evict_l2_skip",		KSTAT_DATA_UINT64 },
 	{ "hash_elements",		KSTAT_DATA_UINT64 },
 	{ "hash_elements_max",		KSTAT_DATA_UINT64 },
 	{ "hash_collisions",		KSTAT_DATA_UINT64 },
 	{ "hash_chains",		KSTAT_DATA_UINT64 },
 	{ "hash_chain_max",		KSTAT_DATA_UINT64 },
 	{ "p",				KSTAT_DATA_UINT64 },
 	{ "c",				KSTAT_DATA_UINT64 },
 	{ "c_min",			KSTAT_DATA_UINT64 },
 	{ "c_max",			KSTAT_DATA_UINT64 },
 	{ "size",			KSTAT_DATA_UINT64 },
 	{ "compressed_size",		KSTAT_DATA_UINT64 },
 	{ "uncompressed_size",		KSTAT_DATA_UINT64 },
 	{ "overhead_size",		KSTAT_DATA_UINT64 },
 	{ "hdr_size",			KSTAT_DATA_UINT64 },
 	{ "data_size",			KSTAT_DATA_UINT64 },
 	{ "metadata_size",		KSTAT_DATA_UINT64 },
 	{ "other_size",			KSTAT_DATA_UINT64 },
 	{ "anon_size",			KSTAT_DATA_UINT64 },
 	{ "anon_evictable_data",	KSTAT_DATA_UINT64 },
 	{ "anon_evictable_metadata",	KSTAT_DATA_UINT64 },
 	{ "mru_size",			KSTAT_DATA_UINT64 },
 	{ "mru_evictable_data",		KSTAT_DATA_UINT64 },
 	{ "mru_evictable_metadata",	KSTAT_DATA_UINT64 },
 	{ "mru_ghost_size",		KSTAT_DATA_UINT64 },
 	{ "mru_ghost_evictable_data",	KSTAT_DATA_UINT64 },
 	{ "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
 	{ "mfu_size",			KSTAT_DATA_UINT64 },
 	{ "mfu_evictable_data",		KSTAT_DATA_UINT64 },
 	{ "mfu_evictable_metadata",	KSTAT_DATA_UINT64 },
 	{ "mfu_ghost_size",		KSTAT_DATA_UINT64 },
 	{ "mfu_ghost_evictable_data",	KSTAT_DATA_UINT64 },
 	{ "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
 	{ "l2_hits",			KSTAT_DATA_UINT64 },
 	{ "l2_misses",			KSTAT_DATA_UINT64 },
 	{ "l2_feeds",			KSTAT_DATA_UINT64 },
 	{ "l2_rw_clash",		KSTAT_DATA_UINT64 },
 	{ "l2_read_bytes",		KSTAT_DATA_UINT64 },
 	{ "l2_write_bytes",		KSTAT_DATA_UINT64 },
 	{ "l2_writes_sent",		KSTAT_DATA_UINT64 },
 	{ "l2_writes_done",		KSTAT_DATA_UINT64 },
 	{ "l2_writes_error",		KSTAT_DATA_UINT64 },
 	{ "l2_writes_lock_retry",	KSTAT_DATA_UINT64 },
 	{ "l2_evict_lock_retry",	KSTAT_DATA_UINT64 },
 	{ "l2_evict_reading",		KSTAT_DATA_UINT64 },
 	{ "l2_evict_l1cached",		KSTAT_DATA_UINT64 },
 	{ "l2_free_on_write",		KSTAT_DATA_UINT64 },
 	{ "l2_abort_lowmem",		KSTAT_DATA_UINT64 },
 	{ "l2_cksum_bad",		KSTAT_DATA_UINT64 },
 	{ "l2_io_error",		KSTAT_DATA_UINT64 },
 	{ "l2_size",			KSTAT_DATA_UINT64 },
 	{ "l2_asize",			KSTAT_DATA_UINT64 },
 	{ "l2_hdr_size",		KSTAT_DATA_UINT64 },
 	{ "l2_write_trylock_fail",	KSTAT_DATA_UINT64 },
 	{ "l2_write_passed_headroom",	KSTAT_DATA_UINT64 },
 	{ "l2_write_spa_mismatch",	KSTAT_DATA_UINT64 },
 	{ "l2_write_in_l2",		KSTAT_DATA_UINT64 },
 	{ "l2_write_io_in_progress",	KSTAT_DATA_UINT64 },
 	{ "l2_write_not_cacheable",	KSTAT_DATA_UINT64 },
 	{ "l2_write_full",		KSTAT_DATA_UINT64 },
 	{ "l2_write_buffer_iter",	KSTAT_DATA_UINT64 },
 	{ "l2_write_pios",		KSTAT_DATA_UINT64 },
 	{ "l2_write_buffer_bytes_scanned", KSTAT_DATA_UINT64 },
 	{ "l2_write_buffer_list_iter",	KSTAT_DATA_UINT64 },
 	{ "l2_write_buffer_list_null_iter", KSTAT_DATA_UINT64 },
 	{ "memory_throttle_count",	KSTAT_DATA_UINT64 },
 	{ "arc_meta_used",		KSTAT_DATA_UINT64 },
 	{ "arc_meta_limit",		KSTAT_DATA_UINT64 },
 	{ "arc_meta_max",		KSTAT_DATA_UINT64 },
 	{ "arc_meta_min",		KSTAT_DATA_UINT64 },
 	{ "sync_wait_for_async",	KSTAT_DATA_UINT64 },
 	{ "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
 };
 
 #define	ARCSTAT(stat)	(arc_stats.stat.value.ui64)
 
 #define	ARCSTAT_INCR(stat, val) \
 	atomic_add_64(&arc_stats.stat.value.ui64, (val))
 
 #define	ARCSTAT_BUMP(stat)	ARCSTAT_INCR(stat, 1)
 #define	ARCSTAT_BUMPDOWN(stat)	ARCSTAT_INCR(stat, -1)
 
 #define	ARCSTAT_MAX(stat, val) {					\
 	uint64_t m;							\
 	while ((val) > (m = arc_stats.stat.value.ui64) &&		\
 	    (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val))))	\
 		continue;						\
 }
 
 #define	ARCSTAT_MAXSTAT(stat) \
 	ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 
 /*
  * We define a macro to allow ARC hits/misses to be easily broken down by
  * two separate conditions, giving a total of four different subtypes for
  * each of hits and misses (so eight statistics total).
  */
 #define	ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 	if (cond1) {							\
 		if (cond2) {						\
 			ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
 		} else {						\
 			ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
 		}							\
 	} else {							\
 		if (cond2) {						\
 			ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
 		} else {						\
 			ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
 		}							\
 	}
 
 kstat_t			*arc_ksp;
 static arc_state_t	*arc_anon;
 static arc_state_t	*arc_mru;
 static arc_state_t	*arc_mru_ghost;
 static arc_state_t	*arc_mfu;
 static arc_state_t	*arc_mfu_ghost;
 static arc_state_t	*arc_l2c_only;
 
 /*
  * There are several ARC variables that are critical to export as kstats --
  * but we don't want to have to grovel around in the kstat whenever we wish to
  * manipulate them.  For these variables, we therefore define them to be in
  * terms of the statistic variable.  This assures that we are not introducing
  * the possibility of inconsistency by having shadow copies of the variables,
  * while still allowing the code to be readable.
  */
 #define	arc_p		ARCSTAT(arcstat_p)	/* target size of MRU */
 #define	arc_c		ARCSTAT(arcstat_c)	/* target size of cache */
 #define	arc_c_min	ARCSTAT(arcstat_c_min)	/* min target cache size */
 #define	arc_c_max	ARCSTAT(arcstat_c_max)	/* max target cache size */
 #define	arc_meta_limit	ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 #define	arc_meta_min	ARCSTAT(arcstat_meta_min) /* min size for metadata */
 #define	arc_meta_max	ARCSTAT(arcstat_meta_max) /* max size of metadata */
 
 /* compressed size of entire arc */
 #define	arc_compressed_size	ARCSTAT(arcstat_compressed_size)
 /* uncompressed size of entire arc */
 #define	arc_uncompressed_size	ARCSTAT(arcstat_uncompressed_size)
 /* number of bytes in the arc from arc_buf_t's */
 #define	arc_overhead_size	ARCSTAT(arcstat_overhead_size)
 
 /*
  * There are also some ARC variables that we want to export, but that are
  * updated so often that having the canonical representation be the statistic
  * variable causes a performance bottleneck. We want to use aggsum_t's for these
  * instead, but still be able to export the kstat in the same way as before.
  * The solution is to always use the aggsum version, except in the kstat update
  * callback.
  */
 aggsum_t arc_size;
 aggsum_t arc_meta_used;
 aggsum_t astat_data_size;
 aggsum_t astat_metadata_size;
 aggsum_t astat_hdr_size;
 aggsum_t astat_other_size;
 aggsum_t astat_l2_hdr_size;
 
 static int		arc_no_grow;	/* Don't try to grow cache size */
 static uint64_t		arc_tempreserve;
 static uint64_t		arc_loaned_bytes;
 
 typedef struct arc_callback arc_callback_t;
 
 struct arc_callback {
 	void			*acb_private;
 	arc_done_func_t		*acb_done;
 	arc_buf_t		*acb_buf;
 	boolean_t		acb_compressed;
 	zio_t			*acb_zio_dummy;
 	arc_callback_t		*acb_next;
 };
 
 typedef struct arc_write_callback arc_write_callback_t;
 
 struct arc_write_callback {
 	void		*awcb_private;
 	arc_done_func_t	*awcb_ready;
 	arc_done_func_t	*awcb_children_ready;
 	arc_done_func_t	*awcb_physdone;
 	arc_done_func_t	*awcb_done;
 	arc_buf_t	*awcb_buf;
 };
 
 /*
  * ARC buffers are separated into multiple structs as a memory saving measure:
  *   - Common fields struct, always defined, and embedded within it:
  *       - L2-only fields, always allocated but undefined when not in L2ARC
  *       - L1-only fields, only allocated when in L1ARC
  *
  *           Buffer in L1                     Buffer only in L2
  *    +------------------------+          +------------------------+
  *    | arc_buf_hdr_t          |          | arc_buf_hdr_t          |
  *    |                        |          |                        |
  *    |                        |          |                        |
  *    |                        |          |                        |
  *    +------------------------+          +------------------------+
  *    | l2arc_buf_hdr_t        |          | l2arc_buf_hdr_t        |
  *    | (undefined if L1-only) |          |                        |
  *    +------------------------+          +------------------------+
  *    | l1arc_buf_hdr_t        |
  *    |                        |
  *    |                        |
  *    |                        |
  *    |                        |
  *    +------------------------+
  *
  * Because it's possible for the L2ARC to become extremely large, we can wind
  * up eating a lot of memory in L2ARC buffer headers, so the size of a header
  * is minimized by only allocating the fields necessary for an L1-cached buffer
  * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
  * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
  * words in pointers. arc_hdr_realloc() is used to switch a header between
  * these two allocation states.
  */
 typedef struct l1arc_buf_hdr {
 	kmutex_t		b_freeze_lock;
 	zio_cksum_t		*b_freeze_cksum;
 #ifdef ZFS_DEBUG
 	/*
 	 * Used for debugging with kmem_flags - by allocating and freeing
 	 * b_thawed when the buffer is thawed, we get a record of the stack
 	 * trace that thawed it.
 	 */
 	void			*b_thawed;
 #endif
 
 	arc_buf_t		*b_buf;
 	uint32_t		b_bufcnt;
 	/* for waiting on writes to complete */
 	kcondvar_t		b_cv;
 	uint8_t			b_byteswap;
 
 	/* protected by arc state mutex */
 	arc_state_t		*b_state;
 	multilist_node_t	b_arc_node;
 
 	/* updated atomically */
 	clock_t			b_arc_access;
 
 	/* self protecting */
 	refcount_t		b_refcnt;
 
 	arc_callback_t		*b_acb;
 	abd_t			*b_pabd;
 } l1arc_buf_hdr_t;
 
 typedef struct l2arc_dev l2arc_dev_t;
 
 typedef struct l2arc_buf_hdr {
 	/* protected by arc_buf_hdr mutex */
 	l2arc_dev_t		*b_dev;		/* L2ARC device */
 	uint64_t		b_daddr;	/* disk address, offset byte */
 
 	list_node_t		b_l2node;
 } l2arc_buf_hdr_t;
 
 struct arc_buf_hdr {
 	/* protected by hash lock */
 	dva_t			b_dva;
 	uint64_t		b_birth;
 
 	arc_buf_contents_t	b_type;
 	arc_buf_hdr_t		*b_hash_next;
 	arc_flags_t		b_flags;
 
 	/*
 	 * This field stores the size of the data buffer after
 	 * compression, and is set in the arc's zio completion handlers.
 	 * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
 	 *
 	 * While the block pointers can store up to 32MB in their psize
 	 * field, we can only store up to 32MB minus 512B. This is due
 	 * to the bp using a bias of 1, whereas we use a bias of 0 (i.e.
 	 * a field of zeros represents 512B in the bp). We can't use a
 	 * bias of 1 since we need to reserve a psize of zero, here, to
 	 * represent holes and embedded blocks.
 	 *
 	 * This isn't a problem in practice, since the maximum size of a
 	 * buffer is limited to 16MB, so we never need to store 32MB in
 	 * this field. Even in the upstream illumos code base, the
 	 * maximum size of a buffer is limited to 16MB.
 	 */
 	uint16_t		b_psize;
 
 	/*
 	 * This field stores the size of the data buffer before
 	 * compression, and cannot change once set. It is in units
 	 * of SPA_MINBLOCKSIZE (e.g. 2 == 1024 bytes)
 	 */
 	uint16_t		b_lsize;	/* immutable */
 	uint64_t		b_spa;		/* immutable */
 
 	/* L2ARC fields. Undefined when not in L2ARC. */
 	l2arc_buf_hdr_t		b_l2hdr;
 	/* L1ARC fields. Undefined when in l2arc_only state */
 	l1arc_buf_hdr_t		b_l1hdr;
 };
 
 #if defined(__FreeBSD__) && defined(_KERNEL)
 static int
 sysctl_vfs_zfs_arc_meta_limit(SYSCTL_HANDLER_ARGS)
 {
 	uint64_t val;
 	int err;
 
 	val = arc_meta_limit;
 	err = sysctl_handle_64(oidp, &val, 0, req);
 	if (err != 0 || req->newptr == NULL)
 		return (err);
 
         if (val <= 0 || val > arc_c_max)
 		return (EINVAL);
 
 	arc_meta_limit = val;
 	return (0);
 }
 
 static int
 sysctl_vfs_zfs_arc_no_grow_shift(SYSCTL_HANDLER_ARGS)
 {
 	uint32_t val;
 	int err;
 
 	val = arc_no_grow_shift;
 	err = sysctl_handle_32(oidp, &val, 0, req);
 	if (err != 0 || req->newptr == NULL)
 		return (err);
 
         if (val >= arc_shrink_shift)
 		return (EINVAL);
 
 	arc_no_grow_shift = val;
 	return (0);
 }
 
 static int
 sysctl_vfs_zfs_arc_max(SYSCTL_HANDLER_ARGS)
 {
 	uint64_t val;
 	int err;
 
 	val = zfs_arc_max;
 	err = sysctl_handle_64(oidp, &val, 0, req);
 	if (err != 0 || req->newptr == NULL)
 		return (err);
 
 	if (zfs_arc_max == 0) {
 		/* Loader tunable so blindly set */
 		zfs_arc_max = val;
 		return (0);
 	}
 
 	if (val < arc_abs_min || val > kmem_size())
 		return (EINVAL);
 	if (val < arc_c_min)
 		return (EINVAL);
 	if (zfs_arc_meta_limit > 0 && val < zfs_arc_meta_limit)
 		return (EINVAL);
 
 	arc_c_max = val;
 
 	arc_c = arc_c_max;
         arc_p = (arc_c >> 1);
 
 	if (zfs_arc_meta_limit == 0) {
 		/* limit meta-data to 1/4 of the arc capacity */
 		arc_meta_limit = arc_c_max / 4;
 	}
 
 	/* if kmem_flags are set, lets try to use less memory */
 	if (kmem_debugging())
 		arc_c = arc_c / 2;
 
 	zfs_arc_max = arc_c;
 
 	return (0);
 }
 
 static int
 sysctl_vfs_zfs_arc_min(SYSCTL_HANDLER_ARGS)
 {
 	uint64_t val;
 	int err;
 
 	val = zfs_arc_min;
 	err = sysctl_handle_64(oidp, &val, 0, req);
 	if (err != 0 || req->newptr == NULL)
 		return (err);
 
 	if (zfs_arc_min == 0) {
 		/* Loader tunable so blindly set */
 		zfs_arc_min = val;
 		return (0);
 	}
 
 	if (val < arc_abs_min || val > arc_c_max)
 		return (EINVAL);
 
 	arc_c_min = val;
 
 	if (zfs_arc_meta_min == 0)
                 arc_meta_min = arc_c_min / 2;
 
 	if (arc_c < arc_c_min)
                 arc_c = arc_c_min;
 
 	zfs_arc_min = arc_c_min;
 
 	return (0);
 }
 #endif
 
 #define	GHOST_STATE(state)	\
 	((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||	\
 	(state) == arc_l2c_only)
 
 #define	HDR_IN_HASH_TABLE(hdr)	((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
 #define	HDR_IO_IN_PROGRESS(hdr)	((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
 #define	HDR_IO_ERROR(hdr)	((hdr)->b_flags & ARC_FLAG_IO_ERROR)
 #define	HDR_PREFETCH(hdr)	((hdr)->b_flags & ARC_FLAG_PREFETCH)
 #define	HDR_COMPRESSION_ENABLED(hdr)	\
 	((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
 
 #define	HDR_L2CACHE(hdr)	((hdr)->b_flags & ARC_FLAG_L2CACHE)
 #define	HDR_L2_READING(hdr)	\
 	(((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&	\
 	((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
 #define	HDR_L2_WRITING(hdr)	((hdr)->b_flags & ARC_FLAG_L2_WRITING)
 #define	HDR_L2_EVICTED(hdr)	((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
 #define	HDR_L2_WRITE_HEAD(hdr)	((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
 #define	HDR_SHARED_DATA(hdr)	((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
 
 #define	HDR_ISTYPE_METADATA(hdr)	\
 	((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
 #define	HDR_ISTYPE_DATA(hdr)	(!HDR_ISTYPE_METADATA(hdr))
 
 #define	HDR_HAS_L1HDR(hdr)	((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
 #define	HDR_HAS_L2HDR(hdr)	((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
 
 /* For storing compression mode in b_flags */
 #define	HDR_COMPRESS_OFFSET	(highbit64(ARC_FLAG_COMPRESS_0) - 1)
 
 #define	HDR_GET_COMPRESS(hdr)	((enum zio_compress)BF32_GET((hdr)->b_flags, \
 	HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
 #define	HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
 	HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
 
 #define	ARC_BUF_LAST(buf)	((buf)->b_next == NULL)
 #define	ARC_BUF_SHARED(buf)	((buf)->b_flags & ARC_BUF_FLAG_SHARED)
 #define	ARC_BUF_COMPRESSED(buf)	((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
 
 /*
  * Other sizes
  */
 
 #define	HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
 #define	HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
 
 /*
  * Hash table routines
  */
 
 #define	HT_LOCK_PAD	CACHE_LINE_SIZE
 
 struct ht_lock {
 	kmutex_t	ht_lock;
 #ifdef _KERNEL
 	unsigned char	pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
 #endif
 };
 
 #define	BUF_LOCKS 256
 typedef struct buf_hash_table {
 	uint64_t ht_mask;
 	arc_buf_hdr_t **ht_table;
 	struct ht_lock ht_locks[BUF_LOCKS] __aligned(CACHE_LINE_SIZE);
 } buf_hash_table_t;
 
 static buf_hash_table_t buf_hash_table;
 
 #define	BUF_HASH_INDEX(spa, dva, birth) \
 	(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
 #define	BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
 #define	BUF_HASH_LOCK(idx)	(&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
 #define	HDR_LOCK(hdr) \
 	(BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
 
 uint64_t zfs_crc64_table[256];
 
 /*
  * Level 2 ARC
  */
 
 #define	L2ARC_WRITE_SIZE	(8 * 1024 * 1024)	/* initial write max */
 #define	L2ARC_HEADROOM		2			/* num of writes */
 /*
  * If we discover during ARC scan any buffers to be compressed, we boost
  * our headroom for the next scanning cycle by this percentage multiple.
  */
 #define	L2ARC_HEADROOM_BOOST	200
 #define	L2ARC_FEED_SECS		1		/* caching interval secs */
 #define	L2ARC_FEED_MIN_MS	200		/* min caching interval ms */
 
 #define	l2arc_writes_sent	ARCSTAT(arcstat_l2_writes_sent)
 #define	l2arc_writes_done	ARCSTAT(arcstat_l2_writes_done)
 
 /* L2ARC Performance Tunables */
 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;	/* default max write size */
 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;	/* extra write during warmup */
 uint64_t l2arc_headroom = L2ARC_HEADROOM;	/* number of dev writes */
 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;	/* interval seconds */
 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS;	/* min interval milliseconds */
 boolean_t l2arc_noprefetch = B_TRUE;		/* don't cache prefetch bufs */
 boolean_t l2arc_feed_again = B_TRUE;		/* turbo warmup */
 boolean_t l2arc_norw = B_TRUE;			/* no reads during writes */
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_write_max, CTLFLAG_RW,
     &l2arc_write_max, 0, "max write size");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_write_boost, CTLFLAG_RW,
     &l2arc_write_boost, 0, "extra write during warmup");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_headroom, CTLFLAG_RW,
     &l2arc_headroom, 0, "number of dev writes");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs, CTLFLAG_RW,
     &l2arc_feed_secs, 0, "interval seconds");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_feed_min_ms, CTLFLAG_RW,
     &l2arc_feed_min_ms, 0, "min interval milliseconds");
 
 SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_noprefetch, CTLFLAG_RW,
     &l2arc_noprefetch, 0, "don't cache prefetch bufs");
 SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_feed_again, CTLFLAG_RW,
     &l2arc_feed_again, 0, "turbo warmup");
 SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_norw, CTLFLAG_RW,
     &l2arc_norw, 0, "no reads during writes");
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, anon_size, CTLFLAG_RD,
     &ARC_anon.arcs_size.rc_count, 0, "size of anonymous state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, anon_metadata_esize, CTLFLAG_RD,
     &ARC_anon.arcs_esize[ARC_BUFC_METADATA].rc_count, 0,
     "size of anonymous state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, anon_data_esize, CTLFLAG_RD,
     &ARC_anon.arcs_esize[ARC_BUFC_DATA].rc_count, 0,
     "size of anonymous state");
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_size, CTLFLAG_RD,
     &ARC_mru.arcs_size.rc_count, 0, "size of mru state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_metadata_esize, CTLFLAG_RD,
     &ARC_mru.arcs_esize[ARC_BUFC_METADATA].rc_count, 0,
     "size of metadata in mru state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_data_esize, CTLFLAG_RD,
     &ARC_mru.arcs_esize[ARC_BUFC_DATA].rc_count, 0,
     "size of data in mru state");
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_ghost_size, CTLFLAG_RD,
     &ARC_mru_ghost.arcs_size.rc_count, 0, "size of mru ghost state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_ghost_metadata_esize, CTLFLAG_RD,
     &ARC_mru_ghost.arcs_esize[ARC_BUFC_METADATA].rc_count, 0,
     "size of metadata in mru ghost state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_ghost_data_esize, CTLFLAG_RD,
     &ARC_mru_ghost.arcs_esize[ARC_BUFC_DATA].rc_count, 0,
     "size of data in mru ghost state");
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_size, CTLFLAG_RD,
     &ARC_mfu.arcs_size.rc_count, 0, "size of mfu state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_metadata_esize, CTLFLAG_RD,
     &ARC_mfu.arcs_esize[ARC_BUFC_METADATA].rc_count, 0,
     "size of metadata in mfu state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_data_esize, CTLFLAG_RD,
     &ARC_mfu.arcs_esize[ARC_BUFC_DATA].rc_count, 0,
     "size of data in mfu state");
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_ghost_size, CTLFLAG_RD,
     &ARC_mfu_ghost.arcs_size.rc_count, 0, "size of mfu ghost state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_ghost_metadata_esize, CTLFLAG_RD,
     &ARC_mfu_ghost.arcs_esize[ARC_BUFC_METADATA].rc_count, 0,
     "size of metadata in mfu ghost state");
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_ghost_data_esize, CTLFLAG_RD,
     &ARC_mfu_ghost.arcs_esize[ARC_BUFC_DATA].rc_count, 0,
     "size of data in mfu ghost state");
 
 SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2c_only_size, CTLFLAG_RD,
     &ARC_l2c_only.arcs_size.rc_count, 0, "size of mru state");
 
 /*
  * L2ARC Internals
  */
 struct l2arc_dev {
 	vdev_t			*l2ad_vdev;	/* vdev */
 	spa_t			*l2ad_spa;	/* spa */
 	uint64_t		l2ad_hand;	/* next write location */
 	uint64_t		l2ad_start;	/* first addr on device */
 	uint64_t		l2ad_end;	/* last addr on device */
 	boolean_t		l2ad_first;	/* first sweep through */
 	boolean_t		l2ad_writing;	/* currently writing */
 	kmutex_t		l2ad_mtx;	/* lock for buffer list */
 	list_t			l2ad_buflist;	/* buffer list */
 	list_node_t		l2ad_node;	/* device list node */
 	refcount_t		l2ad_alloc;	/* allocated bytes */
 };
 
 static list_t L2ARC_dev_list;			/* device list */
 static list_t *l2arc_dev_list;			/* device list pointer */
 static kmutex_t l2arc_dev_mtx;			/* device list mutex */
 static l2arc_dev_t *l2arc_dev_last;		/* last device used */
 static list_t L2ARC_free_on_write;		/* free after write buf list */
 static list_t *l2arc_free_on_write;		/* free after write list ptr */
 static kmutex_t l2arc_free_on_write_mtx;	/* mutex for list */
 static uint64_t l2arc_ndev;			/* number of devices */
 
 typedef struct l2arc_read_callback {
 	arc_buf_hdr_t		*l2rcb_hdr;		/* read header */
 	blkptr_t		l2rcb_bp;		/* original blkptr */
 	zbookmark_phys_t	l2rcb_zb;		/* original bookmark */
 	int			l2rcb_flags;		/* original flags */
 	abd_t			*l2rcb_abd;		/* temporary buffer */
 } l2arc_read_callback_t;
 
 typedef struct l2arc_write_callback {
 	l2arc_dev_t	*l2wcb_dev;		/* device info */
 	arc_buf_hdr_t	*l2wcb_head;		/* head of write buflist */
 } l2arc_write_callback_t;
 
 typedef struct l2arc_data_free {
 	/* protected by l2arc_free_on_write_mtx */
 	abd_t		*l2df_abd;
 	size_t		l2df_size;
 	arc_buf_contents_t l2df_type;
 	list_node_t	l2df_list_node;
 } l2arc_data_free_t;
 
 static kmutex_t l2arc_feed_thr_lock;
 static kcondvar_t l2arc_feed_thr_cv;
 static uint8_t l2arc_thread_exit;
 
 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *);
 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
 static void arc_hdr_free_pabd(arc_buf_hdr_t *);
 static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
 static void arc_access(arc_buf_hdr_t *, kmutex_t *);
 static boolean_t arc_is_overflowing();
 static void arc_buf_watch(arc_buf_t *);
 
 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
 static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
 
 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
 static void l2arc_read_done(zio_t *);
 
 static void
 l2arc_trim(const arc_buf_hdr_t *hdr)
 {
 	l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
 
 	ASSERT(HDR_HAS_L2HDR(hdr));
 	ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
 
 	if (HDR_GET_PSIZE(hdr) != 0) {
 		trim_map_free(dev->l2ad_vdev, hdr->b_l2hdr.b_daddr,
 		    HDR_GET_PSIZE(hdr), 0);
 	}
 }
 
 /*
  * We use Cityhash for this. It's fast, and has good hash properties without
  * requiring any large static buffers.
  */
 static uint64_t
 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
 {
 	return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));
 }
 
 #define	HDR_EMPTY(hdr)						\
 	((hdr)->b_dva.dva_word[0] == 0 &&			\
 	(hdr)->b_dva.dva_word[1] == 0)
 
 #define	HDR_EQUAL(spa, dva, birth, hdr)				\
 	((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&	\
 	((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&	\
 	((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
 
 static void
 buf_discard_identity(arc_buf_hdr_t *hdr)
 {
 	hdr->b_dva.dva_word[0] = 0;
 	hdr->b_dva.dva_word[1] = 0;
 	hdr->b_birth = 0;
 }
 
 static arc_buf_hdr_t *
 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
 {
 	const dva_t *dva = BP_IDENTITY(bp);
 	uint64_t birth = BP_PHYSICAL_BIRTH(bp);
 	uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
 	kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
 	arc_buf_hdr_t *hdr;
 
 	mutex_enter(hash_lock);
 	for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
 	    hdr = hdr->b_hash_next) {
 		if (HDR_EQUAL(spa, dva, birth, hdr)) {
 			*lockp = hash_lock;
 			return (hdr);
 		}
 	}
 	mutex_exit(hash_lock);
 	*lockp = NULL;
 	return (NULL);
 }
 
 /*
  * Insert an entry into the hash table.  If there is already an element
  * equal to elem in the hash table, then the already existing element
  * will be returned and the new element will not be inserted.
  * Otherwise returns NULL.
  * If lockp == NULL, the caller is assumed to already hold the hash lock.
  */
 static arc_buf_hdr_t *
 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
 {
 	uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
 	kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
 	arc_buf_hdr_t *fhdr;
 	uint32_t i;
 
 	ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
 	ASSERT(hdr->b_birth != 0);
 	ASSERT(!HDR_IN_HASH_TABLE(hdr));
 
 	if (lockp != NULL) {
 		*lockp = hash_lock;
 		mutex_enter(hash_lock);
 	} else {
 		ASSERT(MUTEX_HELD(hash_lock));
 	}
 
 	for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
 	    fhdr = fhdr->b_hash_next, i++) {
 		if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
 			return (fhdr);
 	}
 
 	hdr->b_hash_next = buf_hash_table.ht_table[idx];
 	buf_hash_table.ht_table[idx] = hdr;
 	arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
 
 	/* collect some hash table performance data */
 	if (i > 0) {
 		ARCSTAT_BUMP(arcstat_hash_collisions);
 		if (i == 1)
 			ARCSTAT_BUMP(arcstat_hash_chains);
 
 		ARCSTAT_MAX(arcstat_hash_chain_max, i);
 	}
 
 	ARCSTAT_BUMP(arcstat_hash_elements);
 	ARCSTAT_MAXSTAT(arcstat_hash_elements);
 
 	return (NULL);
 }
 
 static void
 buf_hash_remove(arc_buf_hdr_t *hdr)
 {
 	arc_buf_hdr_t *fhdr, **hdrp;
 	uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
 
 	ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
 	ASSERT(HDR_IN_HASH_TABLE(hdr));
 
 	hdrp = &buf_hash_table.ht_table[idx];
 	while ((fhdr = *hdrp) != hdr) {
 		ASSERT3P(fhdr, !=, NULL);
 		hdrp = &fhdr->b_hash_next;
 	}
 	*hdrp = hdr->b_hash_next;
 	hdr->b_hash_next = NULL;
 	arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
 
 	/* collect some hash table performance data */
 	ARCSTAT_BUMPDOWN(arcstat_hash_elements);
 
 	if (buf_hash_table.ht_table[idx] &&
 	    buf_hash_table.ht_table[idx]->b_hash_next == NULL)
 		ARCSTAT_BUMPDOWN(arcstat_hash_chains);
 }
 
 /*
  * Global data structures and functions for the buf kmem cache.
  */
 static kmem_cache_t *hdr_full_cache;
 static kmem_cache_t *hdr_l2only_cache;
 static kmem_cache_t *buf_cache;
 
 static void
 buf_fini(void)
 {
 	int i;
 
 	kmem_free(buf_hash_table.ht_table,
 	    (buf_hash_table.ht_mask + 1) * sizeof (void *));
 	for (i = 0; i < BUF_LOCKS; i++)
 		mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
 	kmem_cache_destroy(hdr_full_cache);
 	kmem_cache_destroy(hdr_l2only_cache);
 	kmem_cache_destroy(buf_cache);
 }
 
 /*
  * Constructor callback - called when the cache is empty
  * and a new buf is requested.
  */
 /* ARGSUSED */
 static int
 hdr_full_cons(void *vbuf, void *unused, int kmflag)
 {
 	arc_buf_hdr_t *hdr = vbuf;
 
 	bzero(hdr, HDR_FULL_SIZE);
 	cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
 	refcount_create(&hdr->b_l1hdr.b_refcnt);
 	mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
 	multilist_link_init(&hdr->b_l1hdr.b_arc_node);
 	arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
 
 	return (0);
 }
 
 /* ARGSUSED */
 static int
 hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
 {
 	arc_buf_hdr_t *hdr = vbuf;
 
 	bzero(hdr, HDR_L2ONLY_SIZE);
 	arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
 
 	return (0);
 }
 
 /* ARGSUSED */
 static int
 buf_cons(void *vbuf, void *unused, int kmflag)
 {
 	arc_buf_t *buf = vbuf;
 
 	bzero(buf, sizeof (arc_buf_t));
 	mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
 	arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
 
 	return (0);
 }
 
 /*
  * Destructor callback - called when a cached buf is
  * no longer required.
  */
 /* ARGSUSED */
 static void
 hdr_full_dest(void *vbuf, void *unused)
 {
 	arc_buf_hdr_t *hdr = vbuf;
 
 	ASSERT(HDR_EMPTY(hdr));
 	cv_destroy(&hdr->b_l1hdr.b_cv);
 	refcount_destroy(&hdr->b_l1hdr.b_refcnt);
 	mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
 	ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 	arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
 }
 
 /* ARGSUSED */
 static void
 hdr_l2only_dest(void *vbuf, void *unused)
 {
 	arc_buf_hdr_t *hdr = vbuf;
 
 	ASSERT(HDR_EMPTY(hdr));
 	arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
 }
 
 /* ARGSUSED */
 static void
 buf_dest(void *vbuf, void *unused)
 {
 	arc_buf_t *buf = vbuf;
 
 	mutex_destroy(&buf->b_evict_lock);
 	arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
 }
 
 /*
  * Reclaim callback -- invoked when memory is low.
  */
 /* ARGSUSED */
 static void
 hdr_recl(void *unused)
 {
 	dprintf("hdr_recl called\n");
 	/*
 	 * umem calls the reclaim func when we destroy the buf cache,
 	 * which is after we do arc_fini().
 	 */
 	if (!arc_dead)
 		cv_signal(&arc_reclaim_thread_cv);
 }
 
 static void
 buf_init(void)
 {
 	uint64_t *ct;
 	uint64_t hsize = 1ULL << 12;
 	int i, j;
 
 	/*
 	 * The hash table is big enough to fill all of physical memory
 	 * with an average block size of zfs_arc_average_blocksize (default 8K).
 	 * By default, the table will take up
 	 * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
 	 */
 	while (hsize * zfs_arc_average_blocksize < (uint64_t)physmem * PAGESIZE)
 		hsize <<= 1;
 retry:
 	buf_hash_table.ht_mask = hsize - 1;
 	buf_hash_table.ht_table =
 	    kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
 	if (buf_hash_table.ht_table == NULL) {
 		ASSERT(hsize > (1ULL << 8));
 		hsize >>= 1;
 		goto retry;
 	}
 
 	hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
 	    0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
 	hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
 	    HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
 	    NULL, NULL, 0);
 	buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
 	    0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
 
 	for (i = 0; i < 256; i++)
 		for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
 			*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
 
 	for (i = 0; i < BUF_LOCKS; i++) {
 		mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
 		    NULL, MUTEX_DEFAULT, NULL);
 	}
 }
 
 /*
  * This is the size that the buf occupies in memory. If the buf is compressed,
  * it will correspond to the compressed size. You should use this method of
  * getting the buf size unless you explicitly need the logical size.
  */
 int32_t
 arc_buf_size(arc_buf_t *buf)
 {
 	return (ARC_BUF_COMPRESSED(buf) ?
 	    HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
 }
 
 int32_t
 arc_buf_lsize(arc_buf_t *buf)
 {
 	return (HDR_GET_LSIZE(buf->b_hdr));
 }
 
 enum zio_compress
 arc_get_compression(arc_buf_t *buf)
 {
 	return (ARC_BUF_COMPRESSED(buf) ?
 	    HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
 }
 
 #define	ARC_MINTIME	(hz>>4) /* 62 ms */
 
 static inline boolean_t
 arc_buf_is_shared(arc_buf_t *buf)
 {
 	boolean_t shared = (buf->b_data != NULL &&
 	    buf->b_hdr->b_l1hdr.b_pabd != NULL &&
 	    abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&
 	    buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));
 	IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
 	IMPLY(shared, ARC_BUF_SHARED(buf));
 	IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
 
 	/*
 	 * It would be nice to assert arc_can_share() too, but the "hdr isn't
 	 * already being shared" requirement prevents us from doing that.
 	 */
 
 	return (shared);
 }
 
 /*
  * Free the checksum associated with this header. If there is no checksum, this
  * is a no-op.
  */
 static inline void
 arc_cksum_free(arc_buf_hdr_t *hdr)
 {
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
 	if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
 		kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
 		hdr->b_l1hdr.b_freeze_cksum = NULL;
 	}
 	mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 }
 
 /*
  * Return true iff at least one of the bufs on hdr is not compressed.
  */
 static boolean_t
 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
 {
 	for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
 		if (!ARC_BUF_COMPRESSED(b)) {
 			return (B_TRUE);
 		}
 	}
 	return (B_FALSE);
 }
 
 /*
  * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
  * matches the checksum that is stored in the hdr. If there is no checksum,
  * or if the buf is compressed, this is a no-op.
  */
 static void
 arc_cksum_verify(arc_buf_t *buf)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	zio_cksum_t zc;
 
 	if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 		return;
 
 	if (ARC_BUF_COMPRESSED(buf)) {
 		ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
 		    arc_hdr_has_uncompressed_buf(hdr));
 		return;
 	}
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
 	if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
 		mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 		return;
 	}
 
 	fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
 	if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
 		panic("buffer modified while frozen!");
 	mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 }
 
 static boolean_t
 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
 {
 	enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
 	boolean_t valid_cksum;
 
 	ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
 	VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
 
 	/*
 	 * We rely on the blkptr's checksum to determine if the block
 	 * is valid or not. When compressed arc is enabled, the l2arc
 	 * writes the block to the l2arc just as it appears in the pool.
 	 * This allows us to use the blkptr's checksum to validate the
 	 * data that we just read off of the l2arc without having to store
 	 * a separate checksum in the arc_buf_hdr_t. However, if compressed
 	 * arc is disabled, then the data written to the l2arc is always
 	 * uncompressed and won't match the block as it exists in the main
 	 * pool. When this is the case, we must first compress it if it is
 	 * compressed on the main pool before we can validate the checksum.
 	 */
 	if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
 		ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
 		uint64_t lsize = HDR_GET_LSIZE(hdr);
 		uint64_t csize;
 
 		abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE);
 		csize = zio_compress_data(compress, zio->io_abd,
 		    abd_to_buf(cdata), lsize);
 
 		ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
 		if (csize < HDR_GET_PSIZE(hdr)) {
 			/*
 			 * Compressed blocks are always a multiple of the
 			 * smallest ashift in the pool. Ideally, we would
 			 * like to round up the csize to the next
 			 * spa_min_ashift but that value may have changed
 			 * since the block was last written. Instead,
 			 * we rely on the fact that the hdr's psize
 			 * was set to the psize of the block when it was
 			 * last written. We set the csize to that value
 			 * and zero out any part that should not contain
 			 * data.
 			 */
 			abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize);
 			csize = HDR_GET_PSIZE(hdr);
 		}
 		zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL);
 	}
 
 	/*
 	 * Block pointers always store the checksum for the logical data.
 	 * If the block pointer has the gang bit set, then the checksum
 	 * it represents is for the reconstituted data and not for an
 	 * individual gang member. The zio pipeline, however, must be able to
 	 * determine the checksum of each of the gang constituents so it
 	 * treats the checksum comparison differently than what we need
 	 * for l2arc blocks. This prevents us from using the
 	 * zio_checksum_error() interface directly. Instead we must call the
 	 * zio_checksum_error_impl() so that we can ensure the checksum is
 	 * generated using the correct checksum algorithm and accounts for the
 	 * logical I/O size and not just a gang fragment.
 	 */
 	valid_cksum = (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
 	    BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,
 	    zio->io_offset, NULL) == 0);
 	zio_pop_transforms(zio);
 	return (valid_cksum);
 }
 
 /*
  * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
  * checksum and attaches it to the buf's hdr so that we can ensure that the buf
  * isn't modified later on. If buf is compressed or there is already a checksum
  * on the hdr, this is a no-op (we only checksum uncompressed bufs).
  */
 static void
 arc_cksum_compute(arc_buf_t *buf)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 		return;
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
 	if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
 		ASSERT(arc_hdr_has_uncompressed_buf(hdr));
 		mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 		return;
 	} else if (ARC_BUF_COMPRESSED(buf)) {
 		mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 		return;
 	}
 
 	ASSERT(!ARC_BUF_COMPRESSED(buf));
 	hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
 	    KM_SLEEP);
 	fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
 	    hdr->b_l1hdr.b_freeze_cksum);
 	mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 #ifdef illumos
 	arc_buf_watch(buf);
 #endif
 }
 
 #ifdef illumos
 #ifndef _KERNEL
 typedef struct procctl {
 	long cmd;
 	prwatch_t prwatch;
 } procctl_t;
 #endif
 
 /* ARGSUSED */
 static void
 arc_buf_unwatch(arc_buf_t *buf)
 {
 #ifndef _KERNEL
 	if (arc_watch) {
 		int result;
 		procctl_t ctl;
 		ctl.cmd = PCWATCH;
 		ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
 		ctl.prwatch.pr_size = 0;
 		ctl.prwatch.pr_wflags = 0;
 		result = write(arc_procfd, &ctl, sizeof (ctl));
 		ASSERT3U(result, ==, sizeof (ctl));
 	}
 #endif
 }
 
 /* ARGSUSED */
 static void
 arc_buf_watch(arc_buf_t *buf)
 {
 #ifndef _KERNEL
 	if (arc_watch) {
 		int result;
 		procctl_t ctl;
 		ctl.cmd = PCWATCH;
 		ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
 		ctl.prwatch.pr_size = arc_buf_size(buf);
 		ctl.prwatch.pr_wflags = WA_WRITE;
 		result = write(arc_procfd, &ctl, sizeof (ctl));
 		ASSERT3U(result, ==, sizeof (ctl));
 	}
 #endif
 }
 #endif /* illumos */
 
 static arc_buf_contents_t
 arc_buf_type(arc_buf_hdr_t *hdr)
 {
 	arc_buf_contents_t type;
 	if (HDR_ISTYPE_METADATA(hdr)) {
 		type = ARC_BUFC_METADATA;
 	} else {
 		type = ARC_BUFC_DATA;
 	}
 	VERIFY3U(hdr->b_type, ==, type);
 	return (type);
 }
 
 boolean_t
 arc_is_metadata(arc_buf_t *buf)
 {
 	return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
 }
 
 static uint32_t
 arc_bufc_to_flags(arc_buf_contents_t type)
 {
 	switch (type) {
 	case ARC_BUFC_DATA:
 		/* metadata field is 0 if buffer contains normal data */
 		return (0);
 	case ARC_BUFC_METADATA:
 		return (ARC_FLAG_BUFC_METADATA);
 	default:
 		break;
 	}
 	panic("undefined ARC buffer type!");
 	return ((uint32_t)-1);
 }
 
 void
 arc_buf_thaw(arc_buf_t *buf)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 	ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 
 	arc_cksum_verify(buf);
 
 	/*
 	 * Compressed buffers do not manipulate the b_freeze_cksum or
 	 * allocate b_thawed.
 	 */
 	if (ARC_BUF_COMPRESSED(buf)) {
 		ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
 		    arc_hdr_has_uncompressed_buf(hdr));
 		return;
 	}
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	arc_cksum_free(hdr);
 
 	mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
 #ifdef ZFS_DEBUG
 	if (zfs_flags & ZFS_DEBUG_MODIFY) {
 		if (hdr->b_l1hdr.b_thawed != NULL)
 			kmem_free(hdr->b_l1hdr.b_thawed, 1);
 		hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP);
 	}
 #endif
 
 	mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 
 #ifdef illumos
 	arc_buf_unwatch(buf);
 #endif
 }
 
 void
 arc_buf_freeze(arc_buf_t *buf)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	kmutex_t *hash_lock;
 
 	if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 		return;
 
 	if (ARC_BUF_COMPRESSED(buf)) {
 		ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
 		    arc_hdr_has_uncompressed_buf(hdr));
 		return;
 	}
 
 	hash_lock = HDR_LOCK(hdr);
 	mutex_enter(hash_lock);
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
 	    hdr->b_l1hdr.b_state == arc_anon);
 	arc_cksum_compute(buf);
 	mutex_exit(hash_lock);
 }
 
 /*
  * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
  * the following functions should be used to ensure that the flags are
  * updated in a thread-safe way. When manipulating the flags either
  * the hash_lock must be held or the hdr must be undiscoverable. This
  * ensures that we're not racing with any other threads when updating
  * the flags.
  */
 static inline void
 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
 {
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 	hdr->b_flags |= flags;
 }
 
 static inline void
 arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
 {
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 	hdr->b_flags &= ~flags;
 }
 
 /*
  * Setting the compression bits in the arc_buf_hdr_t's b_flags is
  * done in a special way since we have to clear and set bits
  * at the same time. Consumers that wish to set the compression bits
  * must use this function to ensure that the flags are updated in
  * thread-safe manner.
  */
 static void
 arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
 {
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 
 	/*
 	 * Holes and embedded blocks will always have a psize = 0 so
 	 * we ignore the compression of the blkptr and set the
 	 * arc_buf_hdr_t's compression to ZIO_COMPRESS_OFF.
 	 * Holes and embedded blocks remain anonymous so we don't
 	 * want to uncompress them. Mark them as uncompressed.
 	 */
 	if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
 		arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
 		HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
 		ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
 		ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
 	} else {
 		arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
 		HDR_SET_COMPRESS(hdr, cmp);
 		ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
 		ASSERT(HDR_COMPRESSION_ENABLED(hdr));
 	}
 }
 
 /*
  * Looks for another buf on the same hdr which has the data decompressed, copies
  * from it, and returns true. If no such buf exists, returns false.
  */
 static boolean_t
 arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	boolean_t copied = B_FALSE;
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT3P(buf->b_data, !=, NULL);
 	ASSERT(!ARC_BUF_COMPRESSED(buf));
 
 	for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
 	    from = from->b_next) {
 		/* can't use our own data buffer */
 		if (from == buf) {
 			continue;
 		}
 
 		if (!ARC_BUF_COMPRESSED(from)) {
 			bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
 			copied = B_TRUE;
 			break;
 		}
 	}
 
 	/*
 	 * There were no decompressed bufs, so there should not be a
 	 * checksum on the hdr either.
 	 */
 	EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
 
 	return (copied);
 }
 
 /*
  * Given a buf that has a data buffer attached to it, this function will
  * efficiently fill the buf with data of the specified compression setting from
  * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
  * are already sharing a data buf, no copy is performed.
  *
  * If the buf is marked as compressed but uncompressed data was requested, this
  * will allocate a new data buffer for the buf, remove that flag, and fill the
  * buf with uncompressed data. You can't request a compressed buf on a hdr with
  * uncompressed data, and (since we haven't added support for it yet) if you
  * want compressed data your buf must already be marked as compressed and have
  * the correct-sized data buffer.
  */
 static int
 arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	boolean_t hdr_compressed = (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
 	dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
 
 	ASSERT3P(buf->b_data, !=, NULL);
 	IMPLY(compressed, hdr_compressed);
 	IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
 
 	if (hdr_compressed == compressed) {
 		if (!arc_buf_is_shared(buf)) {
 			abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,
 			    arc_buf_size(buf));
 		}
 	} else {
 		ASSERT(hdr_compressed);
 		ASSERT(!compressed);
 		ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr));
 
 		/*
 		 * If the buf is sharing its data with the hdr, unlink it and
 		 * allocate a new data buffer for the buf.
 		 */
 		if (arc_buf_is_shared(buf)) {
 			ASSERT(ARC_BUF_COMPRESSED(buf));
 
 			/* We need to give the buf it's own b_data */
 			buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
 			buf->b_data =
 			    arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
 			arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
 
 			/* Previously overhead was 0; just add new overhead */
 			ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
 		} else if (ARC_BUF_COMPRESSED(buf)) {
 			/* We need to reallocate the buf's b_data */
 			arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
 			    buf);
 			buf->b_data =
 			    arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
 
 			/* We increased the size of b_data; update overhead */
 			ARCSTAT_INCR(arcstat_overhead_size,
 			    HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
 		}
 
 		/*
 		 * Regardless of the buf's previous compression settings, it
 		 * should not be compressed at the end of this function.
 		 */
 		buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
 
 		/*
 		 * Try copying the data from another buf which already has a
 		 * decompressed version. If that's not possible, it's time to
 		 * bite the bullet and decompress the data from the hdr.
 		 */
 		if (arc_buf_try_copy_decompressed_data(buf)) {
 			/* Skip byteswapping and checksumming (already done) */
 			ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
 			return (0);
 		} else {
 			int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
 			    hdr->b_l1hdr.b_pabd, buf->b_data,
 			    HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
 
 			/*
 			 * Absent hardware errors or software bugs, this should
 			 * be impossible, but log it anyway so we can debug it.
 			 */
 			if (error != 0) {
 				zfs_dbgmsg(
 				    "hdr %p, compress %d, psize %d, lsize %d",
 				    hdr, HDR_GET_COMPRESS(hdr),
 				    HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
 				return (SET_ERROR(EIO));
 			}
 		}
 	}
 
 	/* Byteswap the buf's data if necessary */
 	if (bswap != DMU_BSWAP_NUMFUNCS) {
 		ASSERT(!HDR_SHARED_DATA(hdr));
 		ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
 		dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
 	}
 
 	/* Compute the hdr's checksum if necessary */
 	arc_cksum_compute(buf);
 
 	return (0);
 }
 
 int
 arc_decompress(arc_buf_t *buf)
 {
 	return (arc_buf_fill(buf, B_FALSE));
 }
 
 /*
  * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.
  */
 static uint64_t
 arc_hdr_size(arc_buf_hdr_t *hdr)
 {
 	uint64_t size;
 
 	if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
 	    HDR_GET_PSIZE(hdr) > 0) {
 		size = HDR_GET_PSIZE(hdr);
 	} else {
 		ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
 		size = HDR_GET_LSIZE(hdr);
 	}
 	return (size);
 }
 
 /*
  * Increment the amount of evictable space in the arc_state_t's refcount.
  * We account for the space used by the hdr and the arc buf individually
  * so that we can add and remove them from the refcount individually.
  */
 static void
 arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
 {
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	if (GHOST_STATE(state)) {
 		ASSERT0(hdr->b_l1hdr.b_bufcnt);
 		ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 		ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 		(void) refcount_add_many(&state->arcs_esize[type],
 		    HDR_GET_LSIZE(hdr), hdr);
 		return;
 	}
 
 	ASSERT(!GHOST_STATE(state));
 	if (hdr->b_l1hdr.b_pabd != NULL) {
 		(void) refcount_add_many(&state->arcs_esize[type],
 		    arc_hdr_size(hdr), hdr);
 	}
 	for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 	    buf = buf->b_next) {
 		if (arc_buf_is_shared(buf))
 			continue;
 		(void) refcount_add_many(&state->arcs_esize[type],
 		    arc_buf_size(buf), buf);
 	}
 }
 
 /*
  * Decrement the amount of evictable space in the arc_state_t's refcount.
  * We account for the space used by the hdr and the arc buf individually
  * so that we can add and remove them from the refcount individually.
  */
 static void
 arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
 {
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	if (GHOST_STATE(state)) {
 		ASSERT0(hdr->b_l1hdr.b_bufcnt);
 		ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 		ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 		(void) refcount_remove_many(&state->arcs_esize[type],
 		    HDR_GET_LSIZE(hdr), hdr);
 		return;
 	}
 
 	ASSERT(!GHOST_STATE(state));
 	if (hdr->b_l1hdr.b_pabd != NULL) {
 		(void) refcount_remove_many(&state->arcs_esize[type],
 		    arc_hdr_size(hdr), hdr);
 	}
 	for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 	    buf = buf->b_next) {
 		if (arc_buf_is_shared(buf))
 			continue;
 		(void) refcount_remove_many(&state->arcs_esize[type],
 		    arc_buf_size(buf), buf);
 	}
 }
 
 /*
  * Add a reference to this hdr indicating that someone is actively
  * referencing that memory. When the refcount transitions from 0 to 1,
  * we remove it from the respective arc_state_t list to indicate that
  * it is not evictable.
  */
 static void
 add_reference(arc_buf_hdr_t *hdr, void *tag)
 {
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	if (!MUTEX_HELD(HDR_LOCK(hdr))) {
 		ASSERT(hdr->b_l1hdr.b_state == arc_anon);
 		ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 		ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 	}
 
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 
 	if ((refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
 	    (state != arc_anon)) {
 		/* We don't use the L2-only state list. */
 		if (state != arc_l2c_only) {
 			multilist_remove(state->arcs_list[arc_buf_type(hdr)],
 			    hdr);
 			arc_evictable_space_decrement(hdr, state);
 		}
 		/* remove the prefetch flag if we get a reference */
 		arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
 	}
 }
 
 /*
  * Remove a reference from this hdr. When the reference transitions from
  * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
  * list making it eligible for eviction.
  */
 static int
 remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag)
 {
 	int cnt;
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
 	ASSERT(!GHOST_STATE(state));
 
 	/*
 	 * arc_l2c_only counts as a ghost state so we don't need to explicitly
 	 * check to prevent usage of the arc_l2c_only list.
 	 */
 	if (((cnt = refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) &&
 	    (state != arc_anon)) {
 		multilist_insert(state->arcs_list[arc_buf_type(hdr)], hdr);
 		ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
 		arc_evictable_space_increment(hdr, state);
 	}
 	return (cnt);
 }
 
 /*
  * Move the supplied buffer to the indicated state. The hash lock
  * for the buffer must be held by the caller.
  */
 static void
 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr,
     kmutex_t *hash_lock)
 {
 	arc_state_t *old_state;
 	int64_t refcnt;
 	uint32_t bufcnt;
 	boolean_t update_old, update_new;
 	arc_buf_contents_t buftype = arc_buf_type(hdr);
 
 	/*
 	 * We almost always have an L1 hdr here, since we call arc_hdr_realloc()
 	 * in arc_read() when bringing a buffer out of the L2ARC.  However, the
 	 * L1 hdr doesn't always exist when we change state to arc_anon before
 	 * destroying a header, in which case reallocating to add the L1 hdr is
 	 * pointless.
 	 */
 	if (HDR_HAS_L1HDR(hdr)) {
 		old_state = hdr->b_l1hdr.b_state;
 		refcnt = refcount_count(&hdr->b_l1hdr.b_refcnt);
 		bufcnt = hdr->b_l1hdr.b_bufcnt;
 		update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL);
 	} else {
 		old_state = arc_l2c_only;
 		refcnt = 0;
 		bufcnt = 0;
 		update_old = B_FALSE;
 	}
 	update_new = update_old;
 
 	ASSERT(MUTEX_HELD(hash_lock));
 	ASSERT3P(new_state, !=, old_state);
 	ASSERT(!GHOST_STATE(new_state) || bufcnt == 0);
 	ASSERT(old_state != arc_anon || bufcnt <= 1);
 
 	/*
 	 * If this buffer is evictable, transfer it from the
 	 * old state list to the new state list.
 	 */
 	if (refcnt == 0) {
 		if (old_state != arc_anon && old_state != arc_l2c_only) {
 			ASSERT(HDR_HAS_L1HDR(hdr));
 			multilist_remove(old_state->arcs_list[buftype], hdr);
 
 			if (GHOST_STATE(old_state)) {
 				ASSERT0(bufcnt);
 				ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 				update_old = B_TRUE;
 			}
 			arc_evictable_space_decrement(hdr, old_state);
 		}
 		if (new_state != arc_anon && new_state != arc_l2c_only) {
 
 			/*
 			 * An L1 header always exists here, since if we're
 			 * moving to some L1-cached state (i.e. not l2c_only or
 			 * anonymous), we realloc the header to add an L1hdr
 			 * beforehand.
 			 */
 			ASSERT(HDR_HAS_L1HDR(hdr));
 			multilist_insert(new_state->arcs_list[buftype], hdr);
 
 			if (GHOST_STATE(new_state)) {
 				ASSERT0(bufcnt);
 				ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 				update_new = B_TRUE;
 			}
 			arc_evictable_space_increment(hdr, new_state);
 		}
 	}
 
 	ASSERT(!HDR_EMPTY(hdr));
 	if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
 		buf_hash_remove(hdr);
 
 	/* adjust state sizes (ignore arc_l2c_only) */
 
 	if (update_new && new_state != arc_l2c_only) {
 		ASSERT(HDR_HAS_L1HDR(hdr));
 		if (GHOST_STATE(new_state)) {
 			ASSERT0(bufcnt);
 
 			/*
 			 * When moving a header to a ghost state, we first
 			 * remove all arc buffers. Thus, we'll have a
 			 * bufcnt of zero, and no arc buffer to use for
 			 * the reference. As a result, we use the arc
 			 * header pointer for the reference.
 			 */
 			(void) refcount_add_many(&new_state->arcs_size,
 			    HDR_GET_LSIZE(hdr), hdr);
 			ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 		} else {
 			uint32_t buffers = 0;
 
 			/*
 			 * Each individual buffer holds a unique reference,
 			 * thus we must remove each of these references one
 			 * at a time.
 			 */
 			for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 			    buf = buf->b_next) {
 				ASSERT3U(bufcnt, !=, 0);
 				buffers++;
 
 				/*
 				 * When the arc_buf_t is sharing the data
 				 * block with the hdr, the owner of the
 				 * reference belongs to the hdr. Only
 				 * add to the refcount if the arc_buf_t is
 				 * not shared.
 				 */
 				if (arc_buf_is_shared(buf))
 					continue;
 
 				(void) refcount_add_many(&new_state->arcs_size,
 				    arc_buf_size(buf), buf);
 			}
 			ASSERT3U(bufcnt, ==, buffers);
 
 			if (hdr->b_l1hdr.b_pabd != NULL) {
 				(void) refcount_add_many(&new_state->arcs_size,
 				    arc_hdr_size(hdr), hdr);
 			} else {
 				ASSERT(GHOST_STATE(old_state));
 			}
 		}
 	}
 
 	if (update_old && old_state != arc_l2c_only) {
 		ASSERT(HDR_HAS_L1HDR(hdr));
 		if (GHOST_STATE(old_state)) {
 			ASSERT0(bufcnt);
 			ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 
 			/*
 			 * When moving a header off of a ghost state,
 			 * the header will not contain any arc buffers.
 			 * We use the arc header pointer for the reference
 			 * which is exactly what we did when we put the
 			 * header on the ghost state.
 			 */
 
 			(void) refcount_remove_many(&old_state->arcs_size,
 			    HDR_GET_LSIZE(hdr), hdr);
 		} else {
 			uint32_t buffers = 0;
 
 			/*
 			 * Each individual buffer holds a unique reference,
 			 * thus we must remove each of these references one
 			 * at a time.
 			 */
 			for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 			    buf = buf->b_next) {
 				ASSERT3U(bufcnt, !=, 0);
 				buffers++;
 
 				/*
 				 * When the arc_buf_t is sharing the data
 				 * block with the hdr, the owner of the
 				 * reference belongs to the hdr. Only
 				 * add to the refcount if the arc_buf_t is
 				 * not shared.
 				 */
 				if (arc_buf_is_shared(buf))
 					continue;
 
 				(void) refcount_remove_many(
 				    &old_state->arcs_size, arc_buf_size(buf),
 				    buf);
 			}
 			ASSERT3U(bufcnt, ==, buffers);
 			ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 			(void) refcount_remove_many(
 			    &old_state->arcs_size, arc_hdr_size(hdr), hdr);
 		}
 	}
 
 	if (HDR_HAS_L1HDR(hdr))
 		hdr->b_l1hdr.b_state = new_state;
 
 	/*
 	 * L2 headers should never be on the L2 state list since they don't
 	 * have L1 headers allocated.
 	 */
 	ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
 	    multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
 }
 
 void
 arc_space_consume(uint64_t space, arc_space_type_t type)
 {
 	ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
 
 	switch (type) {
 	case ARC_SPACE_DATA:
 		aggsum_add(&astat_data_size, space);
 		break;
 	case ARC_SPACE_META:
 		aggsum_add(&astat_metadata_size, space);
 		break;
 	case ARC_SPACE_OTHER:
 		aggsum_add(&astat_other_size, space);
 		break;
 	case ARC_SPACE_HDRS:
 		aggsum_add(&astat_hdr_size, space);
 		break;
 	case ARC_SPACE_L2HDRS:
 		aggsum_add(&astat_l2_hdr_size, space);
 		break;
 	}
 
 	if (type != ARC_SPACE_DATA)
 		aggsum_add(&arc_meta_used, space);
 
 	aggsum_add(&arc_size, space);
 }
 
 void
 arc_space_return(uint64_t space, arc_space_type_t type)
 {
 	ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
 
 	switch (type) {
 	case ARC_SPACE_DATA:
 		aggsum_add(&astat_data_size, -space);
 		break;
 	case ARC_SPACE_META:
 		aggsum_add(&astat_metadata_size, -space);
 		break;
 	case ARC_SPACE_OTHER:
 		aggsum_add(&astat_other_size, -space);
 		break;
 	case ARC_SPACE_HDRS:
 		aggsum_add(&astat_hdr_size, -space);
 		break;
 	case ARC_SPACE_L2HDRS:
 		aggsum_add(&astat_l2_hdr_size, -space);
 		break;
 	}
 
 	if (type != ARC_SPACE_DATA) {
 		ASSERT(aggsum_compare(&arc_meta_used, space) >= 0);
 		/*
 		 * We use the upper bound here rather than the precise value
 		 * because the arc_meta_max value doesn't need to be
 		 * precise. It's only consumed by humans via arcstats.
 		 */
 		if (arc_meta_max < aggsum_upper_bound(&arc_meta_used))
 			arc_meta_max = aggsum_upper_bound(&arc_meta_used);
 		aggsum_add(&arc_meta_used, -space);
 	}
 
 	ASSERT(aggsum_compare(&arc_size, space) >= 0);
 	aggsum_add(&arc_size, -space);
 }
 
 /*
  * Given a hdr and a buf, returns whether that buf can share its b_data buffer
  * with the hdr's b_pabd.
  */
 static boolean_t
 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 {
 	/*
 	 * The criteria for sharing a hdr's data are:
 	 * 1. the hdr's compression matches the buf's compression
 	 * 2. the hdr doesn't need to be byteswapped
 	 * 3. the hdr isn't already being shared
 	 * 4. the buf is either compressed or it is the last buf in the hdr list
 	 *
 	 * Criterion #4 maintains the invariant that shared uncompressed
 	 * bufs must be the final buf in the hdr's b_buf list. Reading this, you
 	 * might ask, "if a compressed buf is allocated first, won't that be the
 	 * last thing in the list?", but in that case it's impossible to create
 	 * a shared uncompressed buf anyway (because the hdr must be compressed
 	 * to have the compressed buf). You might also think that #3 is
 	 * sufficient to make this guarantee, however it's possible
 	 * (specifically in the rare L2ARC write race mentioned in
 	 * arc_buf_alloc_impl()) there will be an existing uncompressed buf that
 	 * is sharable, but wasn't at the time of its allocation. Rather than
 	 * allow a new shared uncompressed buf to be created and then shuffle
 	 * the list around to make it the last element, this simply disallows
 	 * sharing if the new buf isn't the first to be added.
 	 */
 	ASSERT3P(buf->b_hdr, ==, hdr);
 	boolean_t hdr_compressed = HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF;
 	boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
 	return (buf_compressed == hdr_compressed &&
 	    hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
 	    !HDR_SHARED_DATA(hdr) &&
 	    (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
 }
 
 /*
  * Allocate a buf for this hdr. If you care about the data that's in the hdr,
  * or if you want a compressed buffer, pass those flags in. Returns 0 if the
  * copy was made successfully, or an error code otherwise.
  */
 static int
 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
     boolean_t fill, arc_buf_t **ret)
 {
 	arc_buf_t *buf;
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
 	VERIFY(hdr->b_type == ARC_BUFC_DATA ||
 	    hdr->b_type == ARC_BUFC_METADATA);
 	ASSERT3P(ret, !=, NULL);
 	ASSERT3P(*ret, ==, NULL);
 
 	buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
 	buf->b_hdr = hdr;
 	buf->b_data = NULL;
 	buf->b_next = hdr->b_l1hdr.b_buf;
 	buf->b_flags = 0;
 
 	add_reference(hdr, tag);
 
 	/*
 	 * We're about to change the hdr's b_flags. We must either
 	 * hold the hash_lock or be undiscoverable.
 	 */
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 
 	/*
 	 * Only honor requests for compressed bufs if the hdr is actually
 	 * compressed.
 	 */
 	if (compressed && HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF)
 		buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
 
 	/*
 	 * If the hdr's data can be shared then we share the data buffer and
 	 * set the appropriate bit in the hdr's b_flags to indicate the hdr is
 	 * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new
 	 * buffer to store the buf's data.
 	 *
 	 * There are two additional restrictions here because we're sharing
 	 * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be
 	 * actively involved in an L2ARC write, because if this buf is used by
 	 * an arc_write() then the hdr's data buffer will be released when the
 	 * write completes, even though the L2ARC write might still be using it.
 	 * Second, the hdr's ABD must be linear so that the buf's user doesn't
 	 * need to be ABD-aware.
 	 */
 	boolean_t can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr) &&
 	    abd_is_linear(hdr->b_l1hdr.b_pabd);
 
 	/* Set up b_data and sharing */
 	if (can_share) {
 		buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);
 		buf->b_flags |= ARC_BUF_FLAG_SHARED;
 		arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
 	} else {
 		buf->b_data =
 		    arc_get_data_buf(hdr, arc_buf_size(buf), buf);
 		ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
 	}
 	VERIFY3P(buf->b_data, !=, NULL);
 
 	hdr->b_l1hdr.b_buf = buf;
 	hdr->b_l1hdr.b_bufcnt += 1;
 
 	/*
 	 * If the user wants the data from the hdr, we need to either copy or
 	 * decompress the data.
 	 */
 	if (fill) {
 		return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
 	}
 
 	return (0);
 }
 
 static char *arc_onloan_tag = "onloan";
 
 static inline void
 arc_loaned_bytes_update(int64_t delta)
 {
 	atomic_add_64(&arc_loaned_bytes, delta);
 
 	/* assert that it did not wrap around */
 	ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
 }
 
 /*
  * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
  * flight data by arc_tempreserve_space() until they are "returned". Loaned
  * buffers must be returned to the arc before they can be used by the DMU or
  * freed.
  */
 arc_buf_t *
 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
 {
 	arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
 	    is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
 
 	arc_loaned_bytes_update(arc_buf_size(buf));
 
 	return (buf);
 }
 
 arc_buf_t *
 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
     enum zio_compress compression_type)
 {
 	arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
 	    psize, lsize, compression_type);
 
 	arc_loaned_bytes_update(arc_buf_size(buf));
 
 	return (buf);
 }
 
 
 /*
  * Return a loaned arc buffer to the arc.
  */
 void
 arc_return_buf(arc_buf_t *buf, void *tag)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	ASSERT3P(buf->b_data, !=, NULL);
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	(void) refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
 	(void) refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
 
 	arc_loaned_bytes_update(-arc_buf_size(buf));
 }
 
 /* Detach an arc_buf from a dbuf (tag) */
 void
 arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	ASSERT3P(buf->b_data, !=, NULL);
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	(void) refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
 	(void) refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
 
 	arc_loaned_bytes_update(arc_buf_size(buf));
 }
 
 static void
 l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)
 {
 	l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
 
 	df->l2df_abd = abd;
 	df->l2df_size = size;
 	df->l2df_type = type;
 	mutex_enter(&l2arc_free_on_write_mtx);
 	list_insert_head(l2arc_free_on_write, df);
 	mutex_exit(&l2arc_free_on_write_mtx);
 }
 
 static void
 arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
 {
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 	arc_buf_contents_t type = arc_buf_type(hdr);
 	uint64_t size = arc_hdr_size(hdr);
 
 	/* protected by hash lock, if in the hash table */
 	if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 		ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 		ASSERT(state != arc_anon && state != arc_l2c_only);
 
 		(void) refcount_remove_many(&state->arcs_esize[type],
 		    size, hdr);
 	}
 	(void) refcount_remove_many(&state->arcs_size, size, hdr);
 	if (type == ARC_BUFC_METADATA) {
 		arc_space_return(size, ARC_SPACE_META);
 	} else {
 		ASSERT(type == ARC_BUFC_DATA);
 		arc_space_return(size, ARC_SPACE_DATA);
 	}
 
 	l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
 }
 
 /*
  * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
  * data buffer, we transfer the refcount ownership to the hdr and update
  * the appropriate kstats.
  */
 static void
 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 {
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 
 	ASSERT(arc_can_share(hdr, buf));
 	ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 
 	/*
 	 * Start sharing the data buffer. We transfer the
 	 * refcount ownership to the hdr since it always owns
 	 * the refcount whenever an arc_buf_t is shared.
 	 */
 	refcount_transfer_ownership(&state->arcs_size, buf, hdr);
 	hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
 	abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
 	    HDR_ISTYPE_METADATA(hdr));
 	arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
 	buf->b_flags |= ARC_BUF_FLAG_SHARED;
 
 	/*
 	 * Since we've transferred ownership to the hdr we need
 	 * to increment its compressed and uncompressed kstats and
 	 * decrement the overhead size.
 	 */
 	ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
 	ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
 	ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
 }
 
 static void
 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 {
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 
 	ASSERT(arc_buf_is_shared(buf));
 	ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 
 	/*
 	 * We are no longer sharing this buffer so we need
 	 * to transfer its ownership to the rightful owner.
 	 */
 	refcount_transfer_ownership(&state->arcs_size, hdr, buf);
 	arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
 	abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);
 	abd_put(hdr->b_l1hdr.b_pabd);
 	hdr->b_l1hdr.b_pabd = NULL;
 	buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
 
 	/*
 	 * Since the buffer is no longer shared between
 	 * the arc buf and the hdr, count it as overhead.
 	 */
 	ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
 	ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
 	ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
 }
 
 /*
  * Remove an arc_buf_t from the hdr's buf list and return the last
  * arc_buf_t on the list. If no buffers remain on the list then return
  * NULL.
  */
 static arc_buf_t *
 arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 {
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 
 	arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
 	arc_buf_t *lastbuf = NULL;
 
 	/*
 	 * Remove the buf from the hdr list and locate the last
 	 * remaining buffer on the list.
 	 */
 	while (*bufp != NULL) {
 		if (*bufp == buf)
 			*bufp = buf->b_next;
 
 		/*
 		 * If we've removed a buffer in the middle of
 		 * the list then update the lastbuf and update
 		 * bufp.
 		 */
 		if (*bufp != NULL) {
 			lastbuf = *bufp;
 			bufp = &(*bufp)->b_next;
 		}
 	}
 	buf->b_next = NULL;
 	ASSERT3P(lastbuf, !=, buf);
 	IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL);
 	IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL);
 	IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
 
 	return (lastbuf);
 }
 
 /*
  * Free up buf->b_data and pull the arc_buf_t off of the the arc_buf_hdr_t's
  * list and free it.
  */
 static void
 arc_buf_destroy_impl(arc_buf_t *buf)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	/*
 	 * Free up the data associated with the buf but only if we're not
 	 * sharing this with the hdr. If we are sharing it with the hdr, the
 	 * hdr is responsible for doing the free.
 	 */
 	if (buf->b_data != NULL) {
 		/*
 		 * We're about to change the hdr's b_flags. We must either
 		 * hold the hash_lock or be undiscoverable.
 		 */
 		ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
 
 		arc_cksum_verify(buf);
 #ifdef illumos
 		arc_buf_unwatch(buf);
 #endif
 
 		if (arc_buf_is_shared(buf)) {
 			arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
 		} else {
 			uint64_t size = arc_buf_size(buf);
 			arc_free_data_buf(hdr, buf->b_data, size, buf);
 			ARCSTAT_INCR(arcstat_overhead_size, -size);
 		}
 		buf->b_data = NULL;
 
 		ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
 		hdr->b_l1hdr.b_bufcnt -= 1;
 	}
 
 	arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
 
 	if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
 		/*
 		 * If the current arc_buf_t is sharing its data buffer with the
 		 * hdr, then reassign the hdr's b_pabd to share it with the new
 		 * buffer at the end of the list. The shared buffer is always
 		 * the last one on the hdr's buffer list.
 		 *
 		 * There is an equivalent case for compressed bufs, but since
 		 * they aren't guaranteed to be the last buf in the list and
 		 * that is an exceedingly rare case, we just allow that space be
 		 * wasted temporarily.
 		 */
 		if (lastbuf != NULL) {
 			/* Only one buf can be shared at once */
 			VERIFY(!arc_buf_is_shared(lastbuf));
 			/* hdr is uncompressed so can't have compressed buf */
 			VERIFY(!ARC_BUF_COMPRESSED(lastbuf));
 
 			ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 			arc_hdr_free_pabd(hdr);
 
 			/*
 			 * We must setup a new shared block between the
 			 * last buffer and the hdr. The data would have
 			 * been allocated by the arc buf so we need to transfer
 			 * ownership to the hdr since it's now being shared.
 			 */
 			arc_share_buf(hdr, lastbuf);
 		}
 	} else if (HDR_SHARED_DATA(hdr)) {
 		/*
 		 * Uncompressed shared buffers are always at the end
 		 * of the list. Compressed buffers don't have the
 		 * same requirements. This makes it hard to
 		 * simply assert that the lastbuf is shared so
 		 * we rely on the hdr's compression flags to determine
 		 * if we have a compressed, shared buffer.
 		 */
 		ASSERT3P(lastbuf, !=, NULL);
 		ASSERT(arc_buf_is_shared(lastbuf) ||
 		    HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
 	}
 
 	/*
 	 * Free the checksum if we're removing the last uncompressed buf from
 	 * this hdr.
 	 */
 	if (!arc_hdr_has_uncompressed_buf(hdr)) {
 		arc_cksum_free(hdr);
 	}
 
 	/* clean up the buf */
 	buf->b_hdr = NULL;
 	kmem_cache_free(buf_cache, buf);
 }
 
 static void
 arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr)
 {
 	ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT(!HDR_SHARED_DATA(hdr));
 
 	ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 	hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
 	hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 	ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 
 	ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
 	ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
 }
 
 static void
 arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
 {
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 
 	/*
 	 * If the hdr is currently being written to the l2arc then
 	 * we defer freeing the data by adding it to the l2arc_free_on_write
 	 * list. The l2arc will free the data once it's finished
 	 * writing it to the l2arc device.
 	 */
 	if (HDR_L2_WRITING(hdr)) {
 		arc_hdr_free_on_write(hdr);
 		ARCSTAT_BUMP(arcstat_l2_free_on_write);
 	} else {
 		arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
 		    arc_hdr_size(hdr), hdr);
 	}
 	hdr->b_l1hdr.b_pabd = NULL;
 	hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 
 	ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
 	ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
 }
 
 static arc_buf_hdr_t *
 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
     enum zio_compress compression_type, arc_buf_contents_t type)
 {
 	arc_buf_hdr_t *hdr;
 
 	VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
 
 	hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
 	ASSERT(HDR_EMPTY(hdr));
 	ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 	ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
 	HDR_SET_PSIZE(hdr, psize);
 	HDR_SET_LSIZE(hdr, lsize);
 	hdr->b_spa = spa;
 	hdr->b_type = type;
 	hdr->b_flags = 0;
 	arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
 	arc_hdr_set_compress(hdr, compression_type);
 
 	hdr->b_l1hdr.b_state = arc_anon;
 	hdr->b_l1hdr.b_arc_access = 0;
 	hdr->b_l1hdr.b_bufcnt = 0;
 	hdr->b_l1hdr.b_buf = NULL;
 
 	/*
 	 * Allocate the hdr's buffer. This will contain either
 	 * the compressed or uncompressed data depending on the block
 	 * it references and compressed arc enablement.
 	 */
 	arc_hdr_alloc_pabd(hdr);
 	ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 
 	return (hdr);
 }
 
 /*
  * Transition between the two allocation states for the arc_buf_hdr struct.
  * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
  * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
  * version is used when a cache buffer is only in the L2ARC in order to reduce
  * memory usage.
  */
 static arc_buf_hdr_t *
 arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
 {
 	ASSERT(HDR_HAS_L2HDR(hdr));
 
 	arc_buf_hdr_t *nhdr;
 	l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
 
 	ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
 	    (old == hdr_l2only_cache && new == hdr_full_cache));
 
 	nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
 
 	ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
 	buf_hash_remove(hdr);
 
 	bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
 
 	if (new == hdr_full_cache) {
 		arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
 		/*
 		 * arc_access and arc_change_state need to be aware that a
 		 * header has just come out of L2ARC, so we set its state to
 		 * l2c_only even though it's about to change.
 		 */
 		nhdr->b_l1hdr.b_state = arc_l2c_only;
 
 		/* Verify previous threads set to NULL before freeing */
 		ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
 	} else {
 		ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 		ASSERT0(hdr->b_l1hdr.b_bufcnt);
 		ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 
 		/*
 		 * If we've reached here, We must have been called from
 		 * arc_evict_hdr(), as such we should have already been
 		 * removed from any ghost list we were previously on
 		 * (which protects us from racing with arc_evict_state),
 		 * thus no locking is needed during this check.
 		 */
 		ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 
 		/*
 		 * A buffer must not be moved into the arc_l2c_only
 		 * state if it's not finished being written out to the
 		 * l2arc device. Otherwise, the b_l1hdr.b_pabd field
 		 * might try to be accessed, even though it was removed.
 		 */
 		VERIFY(!HDR_L2_WRITING(hdr));
 		VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 
 #ifdef ZFS_DEBUG
 		if (hdr->b_l1hdr.b_thawed != NULL) {
 			kmem_free(hdr->b_l1hdr.b_thawed, 1);
 			hdr->b_l1hdr.b_thawed = NULL;
 		}
 #endif
 
 		arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
 	}
 	/*
 	 * The header has been reallocated so we need to re-insert it into any
 	 * lists it was on.
 	 */
 	(void) buf_hash_insert(nhdr, NULL);
 
 	ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
 
 	mutex_enter(&dev->l2ad_mtx);
 
 	/*
 	 * We must place the realloc'ed header back into the list at
 	 * the same spot. Otherwise, if it's placed earlier in the list,
 	 * l2arc_write_buffers() could find it during the function's
 	 * write phase, and try to write it out to the l2arc.
 	 */
 	list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
 	list_remove(&dev->l2ad_buflist, hdr);
 
 	mutex_exit(&dev->l2ad_mtx);
 
 	/*
 	 * Since we're using the pointer address as the tag when
 	 * incrementing and decrementing the l2ad_alloc refcount, we
 	 * must remove the old pointer (that we're about to destroy) and
 	 * add the new pointer to the refcount. Otherwise we'd remove
 	 * the wrong pointer address when calling arc_hdr_destroy() later.
 	 */
 
 	(void) refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
 	(void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), nhdr);
 
 	buf_discard_identity(hdr);
 	kmem_cache_free(old, hdr);
 
 	return (nhdr);
 }
 
 /*
  * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
  * The buf is returned thawed since we expect the consumer to modify it.
  */
 arc_buf_t *
 arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size)
 {
 	arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
 	    ZIO_COMPRESS_OFF, type);
 	ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
 
 	arc_buf_t *buf = NULL;
 	VERIFY0(arc_buf_alloc_impl(hdr, tag, B_FALSE, B_FALSE, &buf));
 	arc_buf_thaw(buf);
 
 	return (buf);
 }
 
 /*
  * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
  * for bufs containing metadata.
  */
 arc_buf_t *
 arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
     enum zio_compress compression_type)
 {
 	ASSERT3U(lsize, >, 0);
 	ASSERT3U(lsize, >=, psize);
 	ASSERT(compression_type > ZIO_COMPRESS_OFF);
 	ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
 
 	arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
 	    compression_type, ARC_BUFC_DATA);
 	ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
 
 	arc_buf_t *buf = NULL;
 	VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
 	arc_buf_thaw(buf);
 	ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 
 	if (!arc_buf_is_shared(buf)) {
 		/*
 		 * To ensure that the hdr has the correct data in it if we call
 		 * arc_decompress() on this buf before it's been written to
 		 * disk, it's easiest if we just set up sharing between the
 		 * buf and the hdr.
 		 */
 		ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
 		arc_hdr_free_pabd(hdr);
 		arc_share_buf(hdr, buf);
 	}
 
 	return (buf);
 }
 
 static void
 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
 {
 	l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
 	l2arc_dev_t *dev = l2hdr->b_dev;
 	uint64_t psize = arc_hdr_size(hdr);
 
 	ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
 	ASSERT(HDR_HAS_L2HDR(hdr));
 
 	list_remove(&dev->l2ad_buflist, hdr);
 
 	ARCSTAT_INCR(arcstat_l2_psize, -psize);
 	ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
 
 	vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
 
 	(void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
 	arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
 }
 
 static void
 arc_hdr_destroy(arc_buf_hdr_t *hdr)
 {
 	if (HDR_HAS_L1HDR(hdr)) {
 		ASSERT(hdr->b_l1hdr.b_buf == NULL ||
 		    hdr->b_l1hdr.b_bufcnt > 0);
 		ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 		ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 	}
 	ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 	ASSERT(!HDR_IN_HASH_TABLE(hdr));
 
 	if (!HDR_EMPTY(hdr))
 		buf_discard_identity(hdr);
 
 	if (HDR_HAS_L2HDR(hdr)) {
 		l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
 		boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
 
 		if (!buflist_held)
 			mutex_enter(&dev->l2ad_mtx);
 
 		/*
 		 * Even though we checked this conditional above, we
 		 * need to check this again now that we have the
 		 * l2ad_mtx. This is because we could be racing with
 		 * another thread calling l2arc_evict() which might have
 		 * destroyed this header's L2 portion as we were waiting
 		 * to acquire the l2ad_mtx. If that happens, we don't
 		 * want to re-destroy the header's L2 portion.
 		 */
 		if (HDR_HAS_L2HDR(hdr)) {
 			l2arc_trim(hdr);
 			arc_hdr_l2hdr_destroy(hdr);
 		}
 
 		if (!buflist_held)
 			mutex_exit(&dev->l2ad_mtx);
 	}
 
 	if (HDR_HAS_L1HDR(hdr)) {
 		arc_cksum_free(hdr);
 
 		while (hdr->b_l1hdr.b_buf != NULL)
 			arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
 
 #ifdef ZFS_DEBUG
 		if (hdr->b_l1hdr.b_thawed != NULL) {
 			kmem_free(hdr->b_l1hdr.b_thawed, 1);
 			hdr->b_l1hdr.b_thawed = NULL;
 		}
 #endif
 
 		if (hdr->b_l1hdr.b_pabd != NULL) {
 			arc_hdr_free_pabd(hdr);
 		}
 	}
 
 	ASSERT3P(hdr->b_hash_next, ==, NULL);
 	if (HDR_HAS_L1HDR(hdr)) {
 		ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 		ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 		kmem_cache_free(hdr_full_cache, hdr);
 	} else {
 		kmem_cache_free(hdr_l2only_cache, hdr);
 	}
 }
 
 void
 arc_buf_destroy(arc_buf_t *buf, void* tag)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	kmutex_t *hash_lock = HDR_LOCK(hdr);
 
 	if (hdr->b_l1hdr.b_state == arc_anon) {
 		ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
 		ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 		VERIFY0(remove_reference(hdr, NULL, tag));
 		arc_hdr_destroy(hdr);
 		return;
 	}
 
 	mutex_enter(hash_lock);
 	ASSERT3P(hdr, ==, buf->b_hdr);
 	ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
 	ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
 	ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
 	ASSERT3P(buf->b_data, !=, NULL);
 
 	(void) remove_reference(hdr, hash_lock, tag);
 	arc_buf_destroy_impl(buf);
 	mutex_exit(hash_lock);
 }
 
 /*
  * Evict the arc_buf_hdr that is provided as a parameter. The resultant
  * state of the header is dependent on its state prior to entering this
  * function. The following transitions are possible:
  *
  *    - arc_mru -> arc_mru_ghost
  *    - arc_mfu -> arc_mfu_ghost
  *    - arc_mru_ghost -> arc_l2c_only
  *    - arc_mru_ghost -> deleted
  *    - arc_mfu_ghost -> arc_l2c_only
  *    - arc_mfu_ghost -> deleted
  */
 static int64_t
 arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
 {
 	arc_state_t *evicted_state, *state;
 	int64_t bytes_evicted = 0;
 
 	ASSERT(MUTEX_HELD(hash_lock));
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	state = hdr->b_l1hdr.b_state;
 	if (GHOST_STATE(state)) {
 		ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 		ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 
 		/*
 		 * l2arc_write_buffers() relies on a header's L1 portion
 		 * (i.e. its b_pabd field) during it's write phase.
 		 * Thus, we cannot push a header onto the arc_l2c_only
 		 * state (removing it's L1 piece) until the header is
 		 * done being written to the l2arc.
 		 */
 		if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
 			ARCSTAT_BUMP(arcstat_evict_l2_skip);
 			return (bytes_evicted);
 		}
 
 		ARCSTAT_BUMP(arcstat_deleted);
 		bytes_evicted += HDR_GET_LSIZE(hdr);
 
 		DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
 
 		ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 		if (HDR_HAS_L2HDR(hdr)) {
 			/*
 			 * This buffer is cached on the 2nd Level ARC;
 			 * don't destroy the header.
 			 */
 			arc_change_state(arc_l2c_only, hdr, hash_lock);
 			/*
 			 * dropping from L1+L2 cached to L2-only,
 			 * realloc to remove the L1 header.
 			 */
 			hdr = arc_hdr_realloc(hdr, hdr_full_cache,
 			    hdr_l2only_cache);
 		} else {
 			arc_change_state(arc_anon, hdr, hash_lock);
 			arc_hdr_destroy(hdr);
 		}
 		return (bytes_evicted);
 	}
 
 	ASSERT(state == arc_mru || state == arc_mfu);
 	evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
 
 	/* prefetch buffers have a minimum lifespan */
 	if (HDR_IO_IN_PROGRESS(hdr) ||
 	    ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
 	    ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
 	    arc_min_prefetch_lifespan)) {
 		ARCSTAT_BUMP(arcstat_evict_skip);
 		return (bytes_evicted);
 	}
 
 	ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
 	while (hdr->b_l1hdr.b_buf) {
 		arc_buf_t *buf = hdr->b_l1hdr.b_buf;
 		if (!mutex_tryenter(&buf->b_evict_lock)) {
 			ARCSTAT_BUMP(arcstat_mutex_miss);
 			break;
 		}
 		if (buf->b_data != NULL)
 			bytes_evicted += HDR_GET_LSIZE(hdr);
 		mutex_exit(&buf->b_evict_lock);
 		arc_buf_destroy_impl(buf);
 	}
 
 	if (HDR_HAS_L2HDR(hdr)) {
 		ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
 	} else {
 		if (l2arc_write_eligible(hdr->b_spa, hdr)) {
 			ARCSTAT_INCR(arcstat_evict_l2_eligible,
 			    HDR_GET_LSIZE(hdr));
 		} else {
 			ARCSTAT_INCR(arcstat_evict_l2_ineligible,
 			    HDR_GET_LSIZE(hdr));
 		}
 	}
 
 	if (hdr->b_l1hdr.b_bufcnt == 0) {
 		arc_cksum_free(hdr);
 
 		bytes_evicted += arc_hdr_size(hdr);
 
 		/*
 		 * If this hdr is being evicted and has a compressed
 		 * buffer then we discard it here before we change states.
 		 * This ensures that the accounting is updated correctly
 		 * in arc_free_data_impl().
 		 */
 		arc_hdr_free_pabd(hdr);
 
 		arc_change_state(evicted_state, hdr, hash_lock);
 		ASSERT(HDR_IN_HASH_TABLE(hdr));
 		arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
 		DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
 	}
 
 	return (bytes_evicted);
 }
 
 static uint64_t
 arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
     uint64_t spa, int64_t bytes)
 {
 	multilist_sublist_t *mls;
 	uint64_t bytes_evicted = 0;
 	arc_buf_hdr_t *hdr;
 	kmutex_t *hash_lock;
 	int evict_count = 0;
 
 	ASSERT3P(marker, !=, NULL);
 	IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
 
 	mls = multilist_sublist_lock(ml, idx);
 
 	for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL;
 	    hdr = multilist_sublist_prev(mls, marker)) {
 		if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) ||
 		    (evict_count >= zfs_arc_evict_batch_limit))
 			break;
 
 		/*
 		 * To keep our iteration location, move the marker
 		 * forward. Since we're not holding hdr's hash lock, we
 		 * must be very careful and not remove 'hdr' from the
 		 * sublist. Otherwise, other consumers might mistake the
 		 * 'hdr' as not being on a sublist when they call the
 		 * multilist_link_active() function (they all rely on
 		 * the hash lock protecting concurrent insertions and
 		 * removals). multilist_sublist_move_forward() was
 		 * specifically implemented to ensure this is the case
 		 * (only 'marker' will be removed and re-inserted).
 		 */
 		multilist_sublist_move_forward(mls, marker);
 
 		/*
 		 * The only case where the b_spa field should ever be
 		 * zero, is the marker headers inserted by
 		 * arc_evict_state(). It's possible for multiple threads
 		 * to be calling arc_evict_state() concurrently (e.g.
 		 * dsl_pool_close() and zio_inject_fault()), so we must
 		 * skip any markers we see from these other threads.
 		 */
 		if (hdr->b_spa == 0)
 			continue;
 
 		/* we're only interested in evicting buffers of a certain spa */
 		if (spa != 0 && hdr->b_spa != spa) {
 			ARCSTAT_BUMP(arcstat_evict_skip);
 			continue;
 		}
 
 		hash_lock = HDR_LOCK(hdr);
 
 		/*
 		 * We aren't calling this function from any code path
 		 * that would already be holding a hash lock, so we're
 		 * asserting on this assumption to be defensive in case
 		 * this ever changes. Without this check, it would be
 		 * possible to incorrectly increment arcstat_mutex_miss
 		 * below (e.g. if the code changed such that we called
 		 * this function with a hash lock held).
 		 */
 		ASSERT(!MUTEX_HELD(hash_lock));
 
 		if (mutex_tryenter(hash_lock)) {
 			uint64_t evicted = arc_evict_hdr(hdr, hash_lock);
 			mutex_exit(hash_lock);
 
 			bytes_evicted += evicted;
 
 			/*
 			 * If evicted is zero, arc_evict_hdr() must have
 			 * decided to skip this header, don't increment
 			 * evict_count in this case.
 			 */
 			if (evicted != 0)
 				evict_count++;
 
 			/*
 			 * If arc_size isn't overflowing, signal any
 			 * threads that might happen to be waiting.
 			 *
 			 * For each header evicted, we wake up a single
 			 * thread. If we used cv_broadcast, we could
 			 * wake up "too many" threads causing arc_size
 			 * to significantly overflow arc_c; since
 			 * arc_get_data_impl() doesn't check for overflow
 			 * when it's woken up (it doesn't because it's
 			 * possible for the ARC to be overflowing while
 			 * full of un-evictable buffers, and the
 			 * function should proceed in this case).
 			 *
 			 * If threads are left sleeping, due to not
 			 * using cv_broadcast, they will be woken up
 			 * just before arc_reclaim_thread() sleeps.
 			 */
 			mutex_enter(&arc_reclaim_lock);
 			if (!arc_is_overflowing())
 				cv_signal(&arc_reclaim_waiters_cv);
 			mutex_exit(&arc_reclaim_lock);
 		} else {
 			ARCSTAT_BUMP(arcstat_mutex_miss);
 		}
 	}
 
 	multilist_sublist_unlock(mls);
 
 	return (bytes_evicted);
 }
 
 /*
  * Evict buffers from the given arc state, until we've removed the
  * specified number of bytes. Move the removed buffers to the
  * appropriate evict state.
  *
  * This function makes a "best effort". It skips over any buffers
  * it can't get a hash_lock on, and so, may not catch all candidates.
  * It may also return without evicting as much space as requested.
  *
  * If bytes is specified using the special value ARC_EVICT_ALL, this
  * will evict all available (i.e. unlocked and evictable) buffers from
  * the given arc state; which is used by arc_flush().
  */
 static uint64_t
 arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
     arc_buf_contents_t type)
 {
 	uint64_t total_evicted = 0;
 	multilist_t *ml = state->arcs_list[type];
 	int num_sublists;
 	arc_buf_hdr_t **markers;
 
 	IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
 
 	num_sublists = multilist_get_num_sublists(ml);
 
 	/*
 	 * If we've tried to evict from each sublist, made some
 	 * progress, but still have not hit the target number of bytes
 	 * to evict, we want to keep trying. The markers allow us to
 	 * pick up where we left off for each individual sublist, rather
 	 * than starting from the tail each time.
 	 */
 	markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
 	for (int i = 0; i < num_sublists; i++) {
 		markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
 
 		/*
 		 * A b_spa of 0 is used to indicate that this header is
 		 * a marker. This fact is used in arc_adjust_type() and
 		 * arc_evict_state_impl().
 		 */
 		markers[i]->b_spa = 0;
 
 		multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
 		multilist_sublist_insert_tail(mls, markers[i]);
 		multilist_sublist_unlock(mls);
 	}
 
 	/*
 	 * While we haven't hit our target number of bytes to evict, or
 	 * we're evicting all available buffers.
 	 */
 	while (total_evicted < bytes || bytes == ARC_EVICT_ALL) {
 		/*
 		 * Start eviction using a randomly selected sublist,
 		 * this is to try and evenly balance eviction across all
 		 * sublists. Always starting at the same sublist
 		 * (e.g. index 0) would cause evictions to favor certain
 		 * sublists over others.
 		 */
 		int sublist_idx = multilist_get_random_index(ml);
 		uint64_t scan_evicted = 0;
 
 		for (int i = 0; i < num_sublists; i++) {
 			uint64_t bytes_remaining;
 			uint64_t bytes_evicted;
 
 			if (bytes == ARC_EVICT_ALL)
 				bytes_remaining = ARC_EVICT_ALL;
 			else if (total_evicted < bytes)
 				bytes_remaining = bytes - total_evicted;
 			else
 				break;
 
 			bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
 			    markers[sublist_idx], spa, bytes_remaining);
 
 			scan_evicted += bytes_evicted;
 			total_evicted += bytes_evicted;
 
 			/* we've reached the end, wrap to the beginning */
 			if (++sublist_idx >= num_sublists)
 				sublist_idx = 0;
 		}
 
 		/*
 		 * If we didn't evict anything during this scan, we have
 		 * no reason to believe we'll evict more during another
 		 * scan, so break the loop.
 		 */
 		if (scan_evicted == 0) {
 			/* This isn't possible, let's make that obvious */
 			ASSERT3S(bytes, !=, 0);
 
 			/*
 			 * When bytes is ARC_EVICT_ALL, the only way to
 			 * break the loop is when scan_evicted is zero.
 			 * In that case, we actually have evicted enough,
 			 * so we don't want to increment the kstat.
 			 */
 			if (bytes != ARC_EVICT_ALL) {
 				ASSERT3S(total_evicted, <, bytes);
 				ARCSTAT_BUMP(arcstat_evict_not_enough);
 			}
 
 			break;
 		}
 	}
 
 	for (int i = 0; i < num_sublists; i++) {
 		multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
 		multilist_sublist_remove(mls, markers[i]);
 		multilist_sublist_unlock(mls);
 
 		kmem_cache_free(hdr_full_cache, markers[i]);
 	}
 	kmem_free(markers, sizeof (*markers) * num_sublists);
 
 	return (total_evicted);
 }
 
 /*
  * Flush all "evictable" data of the given type from the arc state
  * specified. This will not evict any "active" buffers (i.e. referenced).
  *
  * When 'retry' is set to B_FALSE, the function will make a single pass
  * over the state and evict any buffers that it can. Since it doesn't
  * continually retry the eviction, it might end up leaving some buffers
  * in the ARC due to lock misses.
  *
  * When 'retry' is set to B_TRUE, the function will continually retry the
  * eviction until *all* evictable buffers have been removed from the
  * state. As a result, if concurrent insertions into the state are
  * allowed (e.g. if the ARC isn't shutting down), this function might
  * wind up in an infinite loop, continually trying to evict buffers.
  */
 static uint64_t
 arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
     boolean_t retry)
 {
 	uint64_t evicted = 0;
 
 	while (refcount_count(&state->arcs_esize[type]) != 0) {
 		evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type);
 
 		if (!retry)
 			break;
 	}
 
 	return (evicted);
 }
 
 /*
  * Evict the specified number of bytes from the state specified,
  * restricting eviction to the spa and type given. This function
  * prevents us from trying to evict more from a state's list than
  * is "evictable", and to skip evicting altogether when passed a
  * negative value for "bytes". In contrast, arc_evict_state() will
  * evict everything it can, when passed a negative value for "bytes".
  */
 static uint64_t
 arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
     arc_buf_contents_t type)
 {
 	int64_t delta;
 
 	if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
 		delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
 		return (arc_evict_state(state, spa, delta, type));
 	}
 
 	return (0);
 }
 
 /*
  * Evict metadata buffers from the cache, such that arc_meta_used is
  * capped by the arc_meta_limit tunable.
  */
 static uint64_t
 arc_adjust_meta(uint64_t meta_used)
 {
 	uint64_t total_evicted = 0;
 	int64_t target;
 
 	/*
 	 * If we're over the meta limit, we want to evict enough
 	 * metadata to get back under the meta limit. We don't want to
 	 * evict so much that we drop the MRU below arc_p, though. If
 	 * we're over the meta limit more than we're over arc_p, we
 	 * evict some from the MRU here, and some from the MFU below.
 	 */
 	target = MIN((int64_t)(meta_used - arc_meta_limit),
 	    (int64_t)(refcount_count(&arc_anon->arcs_size) +
 	    refcount_count(&arc_mru->arcs_size) - arc_p));
 
 	total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
 
 	/*
 	 * Similar to the above, we want to evict enough bytes to get us
 	 * below the meta limit, but not so much as to drop us below the
 	 * space allotted to the MFU (which is defined as arc_c - arc_p).
 	 */
 	target = MIN((int64_t)(meta_used - arc_meta_limit),
 	    (int64_t)(refcount_count(&arc_mfu->arcs_size) -
 	    (arc_c - arc_p)));
 
 	total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
 
 	return (total_evicted);
 }
 
 /*
  * Return the type of the oldest buffer in the given arc state
  *
  * This function will select a random sublist of type ARC_BUFC_DATA and
  * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
  * is compared, and the type which contains the "older" buffer will be
  * returned.
  */
 static arc_buf_contents_t
 arc_adjust_type(arc_state_t *state)
 {
 	multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
 	multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];
 	int data_idx = multilist_get_random_index(data_ml);
 	int meta_idx = multilist_get_random_index(meta_ml);
 	multilist_sublist_t *data_mls;
 	multilist_sublist_t *meta_mls;
 	arc_buf_contents_t type;
 	arc_buf_hdr_t *data_hdr;
 	arc_buf_hdr_t *meta_hdr;
 
 	/*
 	 * We keep the sublist lock until we're finished, to prevent
 	 * the headers from being destroyed via arc_evict_state().
 	 */
 	data_mls = multilist_sublist_lock(data_ml, data_idx);
 	meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
 
 	/*
 	 * These two loops are to ensure we skip any markers that
 	 * might be at the tail of the lists due to arc_evict_state().
 	 */
 
 	for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
 	    data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
 		if (data_hdr->b_spa != 0)
 			break;
 	}
 
 	for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
 	    meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
 		if (meta_hdr->b_spa != 0)
 			break;
 	}
 
 	if (data_hdr == NULL && meta_hdr == NULL) {
 		type = ARC_BUFC_DATA;
 	} else if (data_hdr == NULL) {
 		ASSERT3P(meta_hdr, !=, NULL);
 		type = ARC_BUFC_METADATA;
 	} else if (meta_hdr == NULL) {
 		ASSERT3P(data_hdr, !=, NULL);
 		type = ARC_BUFC_DATA;
 	} else {
 		ASSERT3P(data_hdr, !=, NULL);
 		ASSERT3P(meta_hdr, !=, NULL);
 
 		/* The headers can't be on the sublist without an L1 header */
 		ASSERT(HDR_HAS_L1HDR(data_hdr));
 		ASSERT(HDR_HAS_L1HDR(meta_hdr));
 
 		if (data_hdr->b_l1hdr.b_arc_access <
 		    meta_hdr->b_l1hdr.b_arc_access) {
 			type = ARC_BUFC_DATA;
 		} else {
 			type = ARC_BUFC_METADATA;
 		}
 	}
 
 	multilist_sublist_unlock(meta_mls);
 	multilist_sublist_unlock(data_mls);
 
 	return (type);
 }
 
 /*
  * Evict buffers from the cache, such that arc_size is capped by arc_c.
  */
 static uint64_t
 arc_adjust(void)
 {
 	uint64_t total_evicted = 0;
 	uint64_t bytes;
 	int64_t target;
 	uint64_t asize = aggsum_value(&arc_size);
 	uint64_t ameta = aggsum_value(&arc_meta_used);
 
 	/*
 	 * If we're over arc_meta_limit, we want to correct that before
 	 * potentially evicting data buffers below.
 	 */
 	total_evicted += arc_adjust_meta(ameta);
 
 	/*
 	 * Adjust MRU size
 	 *
 	 * If we're over the target cache size, we want to evict enough
 	 * from the list to get back to our target size. We don't want
 	 * to evict too much from the MRU, such that it drops below
 	 * arc_p. So, if we're over our target cache size more than
 	 * the MRU is over arc_p, we'll evict enough to get back to
 	 * arc_p here, and then evict more from the MFU below.
 	 */
 	target = MIN((int64_t)(asize - arc_c),
 	    (int64_t)(refcount_count(&arc_anon->arcs_size) +
 	    refcount_count(&arc_mru->arcs_size) + ameta - arc_p));
 
 	/*
 	 * If we're below arc_meta_min, always prefer to evict data.
 	 * Otherwise, try to satisfy the requested number of bytes to
 	 * evict from the type which contains older buffers; in an
 	 * effort to keep newer buffers in the cache regardless of their
 	 * type. If we cannot satisfy the number of bytes from this
 	 * type, spill over into the next type.
 	 */
 	if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
 	    ameta > arc_meta_min) {
 		bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
 		total_evicted += bytes;
 
 		/*
 		 * If we couldn't evict our target number of bytes from
 		 * metadata, we try to get the rest from data.
 		 */
 		target -= bytes;
 
 		total_evicted +=
 		    arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
 	} else {
 		bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
 		total_evicted += bytes;
 
 		/*
 		 * If we couldn't evict our target number of bytes from
 		 * data, we try to get the rest from metadata.
 		 */
 		target -= bytes;
 
 		total_evicted +=
 		    arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
 	}
 
 	/*
 	 * Adjust MFU size
 	 *
 	 * Now that we've tried to evict enough from the MRU to get its
 	 * size back to arc_p, if we're still above the target cache
 	 * size, we evict the rest from the MFU.
 	 */
 	target = asize - arc_c;
 
 	if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
 	    ameta > arc_meta_min) {
 		bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
 		total_evicted += bytes;
 
 		/*
 		 * If we couldn't evict our target number of bytes from
 		 * metadata, we try to get the rest from data.
 		 */
 		target -= bytes;
 
 		total_evicted +=
 		    arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
 	} else {
 		bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
 		total_evicted += bytes;
 
 		/*
 		 * If we couldn't evict our target number of bytes from
 		 * data, we try to get the rest from data.
 		 */
 		target -= bytes;
 
 		total_evicted +=
 		    arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
 	}
 
 	/*
 	 * Adjust ghost lists
 	 *
 	 * In addition to the above, the ARC also defines target values
 	 * for the ghost lists. The sum of the mru list and mru ghost
 	 * list should never exceed the target size of the cache, and
 	 * the sum of the mru list, mfu list, mru ghost list, and mfu
 	 * ghost list should never exceed twice the target size of the
 	 * cache. The following logic enforces these limits on the ghost
 	 * caches, and evicts from them as needed.
 	 */
 	target = refcount_count(&arc_mru->arcs_size) +
 	    refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
 
 	bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
 	total_evicted += bytes;
 
 	target -= bytes;
 
 	total_evicted +=
 	    arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
 
 	/*
 	 * We assume the sum of the mru list and mfu list is less than
 	 * or equal to arc_c (we enforced this above), which means we
 	 * can use the simpler of the two equations below:
 	 *
 	 *	mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
 	 *		    mru ghost + mfu ghost <= arc_c
 	 */
 	target = refcount_count(&arc_mru_ghost->arcs_size) +
 	    refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
 
 	bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
 	total_evicted += bytes;
 
 	target -= bytes;
 
 	total_evicted +=
 	    arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
 
 	return (total_evicted);
 }
 
 void
 arc_flush(spa_t *spa, boolean_t retry)
 {
 	uint64_t guid = 0;
 
 	/*
 	 * If retry is B_TRUE, a spa must not be specified since we have
 	 * no good way to determine if all of a spa's buffers have been
 	 * evicted from an arc state.
 	 */
 	ASSERT(!retry || spa == 0);
 
 	if (spa != NULL)
 		guid = spa_load_guid(spa);
 
 	(void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
 	(void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
 
 	(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
 	(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
 
 	(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
 	(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
 
 	(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
 	(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
 }
 
 void
 arc_shrink(int64_t to_free)
 {
 	uint64_t asize = aggsum_value(&arc_size);
 	if (arc_c > arc_c_min) {
 		DTRACE_PROBE4(arc__shrink, uint64_t, arc_c, uint64_t,
 			arc_c_min, uint64_t, arc_p, uint64_t, to_free);
 		if (arc_c > arc_c_min + to_free)
 			atomic_add_64(&arc_c, -to_free);
 		else
 			arc_c = arc_c_min;
 
 		atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
 		if (asize < arc_c)
 			arc_c = MAX(asize, arc_c_min);
 		if (arc_p > arc_c)
 			arc_p = (arc_c >> 1);
 
 		DTRACE_PROBE2(arc__shrunk, uint64_t, arc_c, uint64_t,
 			arc_p);
 
 		ASSERT(arc_c >= arc_c_min);
 		ASSERT((int64_t)arc_p >= 0);
 	}
 
 	if (asize > arc_c) {
 		DTRACE_PROBE2(arc__shrink_adjust, uint64_t, asize,
 			uint64_t, arc_c);
 		(void) arc_adjust();
 	}
 }
 
 typedef enum free_memory_reason_t {
 	FMR_UNKNOWN,
 	FMR_NEEDFREE,
 	FMR_LOTSFREE,
 	FMR_SWAPFS_MINFREE,
 	FMR_PAGES_PP_MAXIMUM,
 	FMR_HEAP_ARENA,
 	FMR_ZIO_ARENA,
 } free_memory_reason_t;
 
 int64_t last_free_memory;
 free_memory_reason_t last_free_reason;
 
 /*
  * Additional reserve of pages for pp_reserve.
  */
 int64_t arc_pages_pp_reserve = 64;
 
 /*
  * Additional reserve of pages for swapfs.
  */
 int64_t arc_swapfs_reserve = 64;
 
 /*
  * Return the amount of memory that can be consumed before reclaim will be
  * needed.  Positive if there is sufficient free memory, negative indicates
  * the amount of memory that needs to be freed up.
  */
 static int64_t
 arc_available_memory(void)
 {
 	int64_t lowest = INT64_MAX;
 	int64_t n;
 	free_memory_reason_t r = FMR_UNKNOWN;
 
 #ifdef _KERNEL
 #ifdef __FreeBSD__
 	/*
 	 * Cooperate with pagedaemon when it's time for it to scan
 	 * and reclaim some pages.
 	 */
 	n = PAGESIZE * ((int64_t)freemem - zfs_arc_free_target);
 	if (n < lowest) {
 		lowest = n;
 		r = FMR_LOTSFREE;
 	}
 
 #else
 	if (needfree > 0) {
 		n = PAGESIZE * (-needfree);
 		if (n < lowest) {
 			lowest = n;
 			r = FMR_NEEDFREE;
 		}
 	}
 
 	/*
 	 * check that we're out of range of the pageout scanner.  It starts to
 	 * schedule paging if freemem is less than lotsfree and needfree.
 	 * lotsfree is the high-water mark for pageout, and needfree is the
 	 * number of needed free pages.  We add extra pages here to make sure
 	 * the scanner doesn't start up while we're freeing memory.
 	 */
 	n = PAGESIZE * (freemem - lotsfree - needfree - desfree);
 	if (n < lowest) {
 		lowest = n;
 		r = FMR_LOTSFREE;
 	}
 
 	/*
 	 * check to make sure that swapfs has enough space so that anon
 	 * reservations can still succeed. anon_resvmem() checks that the
 	 * availrmem is greater than swapfs_minfree, and the number of reserved
 	 * swap pages.  We also add a bit of extra here just to prevent
 	 * circumstances from getting really dire.
 	 */
 	n = PAGESIZE * (availrmem - swapfs_minfree - swapfs_reserve -
 	    desfree - arc_swapfs_reserve);
 	if (n < lowest) {
 		lowest = n;
 		r = FMR_SWAPFS_MINFREE;
 	}
 
 
 	/*
 	 * Check that we have enough availrmem that memory locking (e.g., via
 	 * mlock(3C) or memcntl(2)) can still succeed.  (pages_pp_maximum
 	 * stores the number of pages that cannot be locked; when availrmem
 	 * drops below pages_pp_maximum, page locking mechanisms such as
 	 * page_pp_lock() will fail.)
 	 */
 	n = PAGESIZE * (availrmem - pages_pp_maximum -
 	    arc_pages_pp_reserve);
 	if (n < lowest) {
 		lowest = n;
 		r = FMR_PAGES_PP_MAXIMUM;
 	}
 
 #endif	/* __FreeBSD__ */
 #if defined(__i386) || !defined(UMA_MD_SMALL_ALLOC)
 	/*
 	 * If we're on an i386 platform, it's possible that we'll exhaust the
 	 * kernel heap space before we ever run out of available physical
 	 * memory.  Most checks of the size of the heap_area compare against
 	 * tune.t_minarmem, which is the minimum available real memory that we
 	 * can have in the system.  However, this is generally fixed at 25 pages
 	 * which is so low that it's useless.  In this comparison, we seek to
 	 * calculate the total heap-size, and reclaim if more than 3/4ths of the
 	 * heap is allocated.  (Or, in the calculation, if less than 1/4th is
 	 * free)
 	 */
 	n = uma_avail() - (long)(uma_limit() / 4);
 	if (n < lowest) {
 		lowest = n;
 		r = FMR_HEAP_ARENA;
 	}
 #endif
 
 	/*
 	 * If zio data pages are being allocated out of a separate heap segment,
 	 * then enforce that the size of available vmem for this arena remains
 	 * above about 1/4th (1/(2^arc_zio_arena_free_shift)) free.
 	 *
 	 * Note that reducing the arc_zio_arena_free_shift keeps more virtual
 	 * memory (in the zio_arena) free, which can avoid memory
 	 * fragmentation issues.
 	 */
 	if (zio_arena != NULL) {
 		n = (int64_t)vmem_size(zio_arena, VMEM_FREE) -
 		    (vmem_size(zio_arena, VMEM_ALLOC) >>
 		    arc_zio_arena_free_shift);
 		if (n < lowest) {
 			lowest = n;
 			r = FMR_ZIO_ARENA;
 		}
 	}
 
 #else	/* _KERNEL */
 	/* Every 100 calls, free a small amount */
 	if (spa_get_random(100) == 0)
 		lowest = -1024;
 #endif	/* _KERNEL */
 
 	last_free_memory = lowest;
 	last_free_reason = r;
 	DTRACE_PROBE2(arc__available_memory, int64_t, lowest, int, r);
 	return (lowest);
 }
 
 
 /*
  * Determine if the system is under memory pressure and is asking
  * to reclaim memory. A return value of B_TRUE indicates that the system
  * is under memory pressure and that the arc should adjust accordingly.
  */
 static boolean_t
 arc_reclaim_needed(void)
 {
 	return (arc_available_memory() < 0);
 }
 
 extern kmem_cache_t	*zio_buf_cache[];
 extern kmem_cache_t	*zio_data_buf_cache[];
 extern kmem_cache_t	*range_seg_cache;
 extern kmem_cache_t	*abd_chunk_cache;
 
 static __noinline void
 arc_kmem_reap_now(void)
 {
 	size_t			i;
 	kmem_cache_t		*prev_cache = NULL;
 	kmem_cache_t		*prev_data_cache = NULL;
 
 	DTRACE_PROBE(arc__kmem_reap_start);
 #ifdef _KERNEL
 	if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) {
 		/*
 		 * We are exceeding our meta-data cache limit.
 		 * Purge some DNLC entries to release holds on meta-data.
 		 */
 		dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
 	}
 #if defined(__i386)
 	/*
 	 * Reclaim unused memory from all kmem caches.
 	 */
 	kmem_reap();
 #endif
 #endif
 
 	/*
 	 * If a kmem reap is already active, don't schedule more.  We must
 	 * check for this because kmem_cache_reap_soon() won't actually
 	 * block on the cache being reaped (this is to prevent callers from
 	 * becoming implicitly blocked by a system-wide kmem reap -- which,
 	 * on a system with many, many full magazines, can take minutes).
 	 */
 	if (kmem_cache_reap_active())
 		return;
 
 	for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
 		if (zio_buf_cache[i] != prev_cache) {
 			prev_cache = zio_buf_cache[i];
 			kmem_cache_reap_soon(zio_buf_cache[i]);
 		}
 		if (zio_data_buf_cache[i] != prev_data_cache) {
 			prev_data_cache = zio_data_buf_cache[i];
 			kmem_cache_reap_soon(zio_data_buf_cache[i]);
 		}
 	}
 	kmem_cache_reap_soon(abd_chunk_cache);
 	kmem_cache_reap_soon(buf_cache);
 	kmem_cache_reap_soon(hdr_full_cache);
 	kmem_cache_reap_soon(hdr_l2only_cache);
 	kmem_cache_reap_soon(range_seg_cache);
 
 #ifdef illumos
 	if (zio_arena != NULL) {
 		/*
 		 * Ask the vmem arena to reclaim unused memory from its
 		 * quantum caches.
 		 */
 		vmem_qcache_reap(zio_arena);
 	}
 #endif
 	DTRACE_PROBE(arc__kmem_reap_end);
 }
 
 /*
  * Threads can block in arc_get_data_impl() waiting for this thread to evict
  * enough data and signal them to proceed. When this happens, the threads in
  * arc_get_data_impl() are sleeping while holding the hash lock for their
  * particular arc header. Thus, we must be careful to never sleep on a
  * hash lock in this thread. This is to prevent the following deadlock:
  *
  *  - Thread A sleeps on CV in arc_get_data_impl() holding hash lock "L",
  *    waiting for the reclaim thread to signal it.
  *
  *  - arc_reclaim_thread() tries to acquire hash lock "L" using mutex_enter,
  *    fails, and goes to sleep forever.
  *
  * This possible deadlock is avoided by always acquiring a hash lock
  * using mutex_tryenter() from arc_reclaim_thread().
  */
 /* ARGSUSED */
 static void
 arc_reclaim_thread(void *unused __unused)
 {
 	hrtime_t		growtime = 0;
 	hrtime_t		kmem_reap_time = 0;
 	callb_cpr_t		cpr;
 
 	CALLB_CPR_INIT(&cpr, &arc_reclaim_lock, callb_generic_cpr, FTAG);
 
 	mutex_enter(&arc_reclaim_lock);
 	while (!arc_reclaim_thread_exit) {
 		uint64_t evicted = 0;
 
 		/*
 		 * This is necessary in order for the mdb ::arc dcmd to
 		 * show up to date information. Since the ::arc command
 		 * does not call the kstat's update function, without
 		 * this call, the command may show stale stats for the
 		 * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
 		 * with this change, the data might be up to 1 second
 		 * out of date; but that should suffice. The arc_state_t
 		 * structures can be queried directly if more accurate
 		 * information is needed.
 		 */
 		if (arc_ksp != NULL)
 			arc_ksp->ks_update(arc_ksp, KSTAT_READ);
 
 		mutex_exit(&arc_reclaim_lock);
 
 		/*
 		 * We call arc_adjust() before (possibly) calling
 		 * arc_kmem_reap_now(), so that we can wake up
 		 * arc_get_data_impl() sooner.
 		 */
 		evicted = arc_adjust();
 
 		int64_t free_memory = arc_available_memory();
 		if (free_memory < 0) {
 			hrtime_t curtime = gethrtime();
 			arc_no_grow = B_TRUE;
 			arc_warm = B_TRUE;
 
 			/*
 			 * Wait at least zfs_grow_retry (default 60) seconds
 			 * before considering growing.
 			 */
 			growtime = curtime + SEC2NSEC(arc_grow_retry);
 
 			/*
 			 * Wait at least arc_kmem_cache_reap_retry_ms
 			 * between arc_kmem_reap_now() calls. Without
 			 * this check it is possible to end up in a
 			 * situation where we spend lots of time
 			 * reaping caches, while we're near arc_c_min.
 			 */
 			if (curtime >= kmem_reap_time) {
 				arc_kmem_reap_now();
 				kmem_reap_time = gethrtime() +
 				    MSEC2NSEC(arc_kmem_cache_reap_retry_ms);
 			}
 
 			/*
 			 * If we are still low on memory, shrink the ARC
 			 * so that we have arc_shrink_min free space.
 			 */
 			free_memory = arc_available_memory();
 
 			int64_t to_free =
 			    (arc_c >> arc_shrink_shift) - free_memory;
 			if (to_free > 0) {
 #ifdef _KERNEL
 #ifdef illumos
 				to_free = MAX(to_free, ptob(needfree));
 #endif
 #endif
 				arc_shrink(to_free);
 			}
 		} else if (free_memory < arc_c >> arc_no_grow_shift) {
 			arc_no_grow = B_TRUE;
 		} else if (gethrtime() >= growtime) {
 			arc_no_grow = B_FALSE;
 		}
 
 		mutex_enter(&arc_reclaim_lock);
 
 		/*
 		 * If evicted is zero, we couldn't evict anything via
 		 * arc_adjust(). This could be due to hash lock
 		 * collisions, but more likely due to the majority of
 		 * arc buffers being unevictable. Therefore, even if
 		 * arc_size is above arc_c, another pass is unlikely to
 		 * be helpful and could potentially cause us to enter an
 		 * infinite loop.
 		 */
 		if (aggsum_compare(&arc_size, arc_c) <= 0|| evicted == 0) {
 			/*
 			 * We're either no longer overflowing, or we
 			 * can't evict anything more, so we should wake
 			 * up any threads before we go to sleep.
 			 */
 			cv_broadcast(&arc_reclaim_waiters_cv);
 
 			/*
 			 * Block until signaled, or after one second (we
 			 * might need to perform arc_kmem_reap_now()
 			 * even if we aren't being signalled)
 			 */
 			CALLB_CPR_SAFE_BEGIN(&cpr);
 			(void) cv_timedwait_hires(&arc_reclaim_thread_cv,
 			    &arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
 			CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
 		}
 	}
 
 	arc_reclaim_thread_exit = B_FALSE;
 	cv_broadcast(&arc_reclaim_thread_cv);
 	CALLB_CPR_EXIT(&cpr);		/* drops arc_reclaim_lock */
 	thread_exit();
 }
 
 static u_int arc_dnlc_evicts_arg;
 extern struct vfsops zfs_vfsops;
 
 static void
 arc_dnlc_evicts_thread(void *dummy __unused)
 {
 	callb_cpr_t cpr;
 	u_int percent;
 
 	CALLB_CPR_INIT(&cpr, &arc_dnlc_evicts_lock, callb_generic_cpr, FTAG);
 
 	mutex_enter(&arc_dnlc_evicts_lock);
 	while (!arc_dnlc_evicts_thread_exit) {
 		CALLB_CPR_SAFE_BEGIN(&cpr);
 		(void) cv_wait(&arc_dnlc_evicts_cv, &arc_dnlc_evicts_lock);
 		CALLB_CPR_SAFE_END(&cpr, &arc_dnlc_evicts_lock);
 		if (arc_dnlc_evicts_arg != 0) {
 			percent = arc_dnlc_evicts_arg;
 			mutex_exit(&arc_dnlc_evicts_lock);
 #ifdef _KERNEL
 			vnlru_free(desiredvnodes * percent / 100, &zfs_vfsops);
 #endif
 			mutex_enter(&arc_dnlc_evicts_lock);
 			/*
 			 * Clear our token only after vnlru_free()
 			 * pass is done, to avoid false queueing of
 			 * the requests.
 			 */
 			arc_dnlc_evicts_arg = 0;
 		}
 	}
 	arc_dnlc_evicts_thread_exit = FALSE;
 	cv_broadcast(&arc_dnlc_evicts_cv);
 	CALLB_CPR_EXIT(&cpr);
 	thread_exit();
 }
 
 void
 dnlc_reduce_cache(void *arg)
 {
 	u_int percent;
 
 	percent = (u_int)(uintptr_t)arg;
 	mutex_enter(&arc_dnlc_evicts_lock);
 	if (arc_dnlc_evicts_arg == 0) {
 		arc_dnlc_evicts_arg = percent;
 		cv_broadcast(&arc_dnlc_evicts_cv);
 	}
 	mutex_exit(&arc_dnlc_evicts_lock);
 }
 
 /*
  * Adapt arc info given the number of bytes we are trying to add and
  * the state that we are comming from.  This function is only called
  * when we are adding new content to the cache.
  */
 static void
 arc_adapt(int bytes, arc_state_t *state)
 {
 	int mult;
 	uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
 	int64_t mrug_size = refcount_count(&arc_mru_ghost->arcs_size);
 	int64_t mfug_size = refcount_count(&arc_mfu_ghost->arcs_size);
 
 	if (state == arc_l2c_only)
 		return;
 
 	ASSERT(bytes > 0);
 	/*
 	 * Adapt the target size of the MRU list:
 	 *	- if we just hit in the MRU ghost list, then increase
 	 *	  the target size of the MRU list.
 	 *	- if we just hit in the MFU ghost list, then increase
 	 *	  the target size of the MFU list by decreasing the
 	 *	  target size of the MRU list.
 	 */
 	if (state == arc_mru_ghost) {
 		mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size);
 		mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
 
 		arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
 	} else if (state == arc_mfu_ghost) {
 		uint64_t delta;
 
 		mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size);
 		mult = MIN(mult, 10);
 
 		delta = MIN(bytes * mult, arc_p);
 		arc_p = MAX(arc_p_min, arc_p - delta);
 	}
 	ASSERT((int64_t)arc_p >= 0);
 
 	if (arc_reclaim_needed()) {
 		cv_signal(&arc_reclaim_thread_cv);
 		return;
 	}
 
 	if (arc_no_grow)
 		return;
 
 	if (arc_c >= arc_c_max)
 		return;
 
 	/*
 	 * If we're within (2 * maxblocksize) bytes of the target
 	 * cache size, increment the target cache size
 	 */
 	if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) >
 	    0) {
 		DTRACE_PROBE1(arc__inc_adapt, int, bytes);
 		atomic_add_64(&arc_c, (int64_t)bytes);
 		if (arc_c > arc_c_max)
 			arc_c = arc_c_max;
 		else if (state == arc_anon)
 			atomic_add_64(&arc_p, (int64_t)bytes);
 		if (arc_p > arc_c)
 			arc_p = arc_c;
 	}
 	ASSERT((int64_t)arc_p >= 0);
 }
 
 /*
  * Check if arc_size has grown past our upper threshold, determined by
  * zfs_arc_overflow_shift.
  */
 static boolean_t
 arc_is_overflowing(void)
 {
 	/* Always allow at least one block of overflow */
 	uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
 	    arc_c >> zfs_arc_overflow_shift);
 
 	/*
 	 * We just compare the lower bound here for performance reasons. Our
 	 * primary goals are to make sure that the arc never grows without
 	 * bound, and that it can reach its maximum size. This check
 	 * accomplishes both goals. The maximum amount we could run over by is
 	 * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
 	 * in the ARC. In practice, that's in the tens of MB, which is low
 	 * enough to be safe.
 	 */
 	return (aggsum_lower_bound(&arc_size) >= arc_c + overflow);
 }
 
 static abd_t *
 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
 {
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	arc_get_data_impl(hdr, size, tag);
 	if (type == ARC_BUFC_METADATA) {
 		return (abd_alloc(size, B_TRUE));
 	} else {
 		ASSERT(type == ARC_BUFC_DATA);
 		return (abd_alloc(size, B_FALSE));
 	}
 }
 
 static void *
 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
 {
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	arc_get_data_impl(hdr, size, tag);
 	if (type == ARC_BUFC_METADATA) {
 		return (zio_buf_alloc(size));
 	} else {
 		ASSERT(type == ARC_BUFC_DATA);
 		return (zio_data_buf_alloc(size));
 	}
 }
 
 /*
  * Allocate a block and return it to the caller. If we are hitting the
  * hard limit for the cache size, we must sleep, waiting for the eviction
  * thread to catch up. If we're past the target size but below the hard
  * limit, we'll only signal the reclaim thread and continue on.
  */
 static void
 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
 {
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	arc_adapt(size, state);
 
 	/*
 	 * If arc_size is currently overflowing, and has grown past our
 	 * upper limit, we must be adding data faster than the evict
 	 * thread can evict. Thus, to ensure we don't compound the
 	 * problem by adding more data and forcing arc_size to grow even
 	 * further past it's target size, we halt and wait for the
 	 * eviction thread to catch up.
 	 *
 	 * It's also possible that the reclaim thread is unable to evict
 	 * enough buffers to get arc_size below the overflow limit (e.g.
 	 * due to buffers being un-evictable, or hash lock collisions).
 	 * In this case, we want to proceed regardless if we're
 	 * overflowing; thus we don't use a while loop here.
 	 */
 	if (arc_is_overflowing()) {
 		mutex_enter(&arc_reclaim_lock);
 
 		/*
 		 * Now that we've acquired the lock, we may no longer be
 		 * over the overflow limit, lets check.
 		 *
 		 * We're ignoring the case of spurious wake ups. If that
 		 * were to happen, it'd let this thread consume an ARC
 		 * buffer before it should have (i.e. before we're under
 		 * the overflow limit and were signalled by the reclaim
 		 * thread). As long as that is a rare occurrence, it
 		 * shouldn't cause any harm.
 		 */
 		if (arc_is_overflowing()) {
 			cv_signal(&arc_reclaim_thread_cv);
 			cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
 		}
 
 		mutex_exit(&arc_reclaim_lock);
 	}
 
 	VERIFY3U(hdr->b_type, ==, type);
 	if (type == ARC_BUFC_METADATA) {
 		arc_space_consume(size, ARC_SPACE_META);
 	} else {
 		arc_space_consume(size, ARC_SPACE_DATA);
 	}
 
 	/*
 	 * Update the state size.  Note that ghost states have a
 	 * "ghost size" and so don't need to be updated.
 	 */
 	if (!GHOST_STATE(state)) {
 
 		(void) refcount_add_many(&state->arcs_size, size, tag);
 
 		/*
 		 * If this is reached via arc_read, the link is
 		 * protected by the hash lock. If reached via
 		 * arc_buf_alloc, the header should not be accessed by
 		 * any other thread. And, if reached via arc_read_done,
 		 * the hash lock will protect it if it's found in the
 		 * hash table; otherwise no other thread should be
 		 * trying to [add|remove]_reference it.
 		 */
 		if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 			ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 			(void) refcount_add_many(&state->arcs_esize[type],
 			    size, tag);
 		}
 
 		/*
 		 * If we are growing the cache, and we are adding anonymous
 		 * data, and we have outgrown arc_p, update arc_p
 		 */
 		if (aggsum_compare(&arc_size, arc_c) < 0 &&
 		    hdr->b_l1hdr.b_state == arc_anon &&
 		    (refcount_count(&arc_anon->arcs_size) +
 		    refcount_count(&arc_mru->arcs_size) > arc_p))
 			arc_p = MIN(arc_c, arc_p + size);
 	}
 	ARCSTAT_BUMP(arcstat_allocated);
 }
 
 static void
 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
 {
 	arc_free_data_impl(hdr, size, tag);
 	abd_free(abd);
 }
 
 static void
 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
 {
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	arc_free_data_impl(hdr, size, tag);
 	if (type == ARC_BUFC_METADATA) {
 		zio_buf_free(buf, size);
 	} else {
 		ASSERT(type == ARC_BUFC_DATA);
 		zio_data_buf_free(buf, size);
 	}
 }
 
 /*
  * Free the arc data buffer.
  */
 static void
 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
 {
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 	arc_buf_contents_t type = arc_buf_type(hdr);
 
 	/* protected by hash lock, if in the hash table */
 	if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 		ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 		ASSERT(state != arc_anon && state != arc_l2c_only);
 
 		(void) refcount_remove_many(&state->arcs_esize[type],
 		    size, tag);
 	}
 	(void) refcount_remove_many(&state->arcs_size, size, tag);
 
 	VERIFY3U(hdr->b_type, ==, type);
 	if (type == ARC_BUFC_METADATA) {
 		arc_space_return(size, ARC_SPACE_META);
 	} else {
 		ASSERT(type == ARC_BUFC_DATA);
 		arc_space_return(size, ARC_SPACE_DATA);
 	}
 }
 
 /*
  * This routine is called whenever a buffer is accessed.
  * NOTE: the hash lock is dropped in this function.
  */
 static void
 arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
 {
 	clock_t now;
 
 	ASSERT(MUTEX_HELD(hash_lock));
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	if (hdr->b_l1hdr.b_state == arc_anon) {
 		/*
 		 * This buffer is not in the cache, and does not
 		 * appear in our "ghost" list.  Add the new buffer
 		 * to the MRU state.
 		 */
 
 		ASSERT0(hdr->b_l1hdr.b_arc_access);
 		hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
 		DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
 		arc_change_state(arc_mru, hdr, hash_lock);
 
 	} else if (hdr->b_l1hdr.b_state == arc_mru) {
 		now = ddi_get_lbolt();
 
 		/*
 		 * If this buffer is here because of a prefetch, then either:
 		 * - clear the flag if this is a "referencing" read
 		 *   (any subsequent access will bump this into the MFU state).
 		 * or
 		 * - move the buffer to the head of the list if this is
 		 *   another prefetch (to make it less likely to be evicted).
 		 */
 		if (HDR_PREFETCH(hdr)) {
 			if (refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
 				/* link protected by hash lock */
 				ASSERT(multilist_link_active(
 				    &hdr->b_l1hdr.b_arc_node));
 			} else {
 				arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
 				ARCSTAT_BUMP(arcstat_mru_hits);
 			}
 			hdr->b_l1hdr.b_arc_access = now;
 			return;
 		}
 
 		/*
 		 * This buffer has been "accessed" only once so far,
 		 * but it is still in the cache. Move it to the MFU
 		 * state.
 		 */
 		if (now > hdr->b_l1hdr.b_arc_access + ARC_MINTIME) {
 			/*
 			 * More than 125ms have passed since we
 			 * instantiated this buffer.  Move it to the
 			 * most frequently used state.
 			 */
 			hdr->b_l1hdr.b_arc_access = now;
 			DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 			arc_change_state(arc_mfu, hdr, hash_lock);
 		}
 		ARCSTAT_BUMP(arcstat_mru_hits);
 	} else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
 		arc_state_t	*new_state;
 		/*
 		 * This buffer has been "accessed" recently, but
 		 * was evicted from the cache.  Move it to the
 		 * MFU state.
 		 */
 
 		if (HDR_PREFETCH(hdr)) {
 			new_state = arc_mru;
 			if (refcount_count(&hdr->b_l1hdr.b_refcnt) > 0)
 				arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
 			DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
 		} else {
 			new_state = arc_mfu;
 			DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 		}
 
 		hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
 		arc_change_state(new_state, hdr, hash_lock);
 
 		ARCSTAT_BUMP(arcstat_mru_ghost_hits);
 	} else if (hdr->b_l1hdr.b_state == arc_mfu) {
 		/*
 		 * This buffer has been accessed more than once and is
 		 * still in the cache.  Keep it in the MFU state.
 		 *
 		 * NOTE: an add_reference() that occurred when we did
 		 * the arc_read() will have kicked this off the list.
 		 * If it was a prefetch, we will explicitly move it to
 		 * the head of the list now.
 		 */
 		if ((HDR_PREFETCH(hdr)) != 0) {
 			ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 			/* link protected by hash_lock */
 			ASSERT(multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 		}
 		ARCSTAT_BUMP(arcstat_mfu_hits);
 		hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
 	} else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
 		arc_state_t	*new_state = arc_mfu;
 		/*
 		 * This buffer has been accessed more than once but has
 		 * been evicted from the cache.  Move it back to the
 		 * MFU state.
 		 */
 
 		if (HDR_PREFETCH(hdr)) {
 			/*
 			 * This is a prefetch access...
 			 * move this block back to the MRU state.
 			 */
 			ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
 			new_state = arc_mru;
 		}
 
 		hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
 		DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 		arc_change_state(new_state, hdr, hash_lock);
 
 		ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
 	} else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
 		/*
 		 * This buffer is on the 2nd Level ARC.
 		 */
 
 		hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
 		DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 		arc_change_state(arc_mfu, hdr, hash_lock);
 	} else {
 		ASSERT(!"invalid arc state");
 	}
 }
 
 /* a generic arc_done_func_t which you can use */
 /* ARGSUSED */
 void
 arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
 {
 	if (zio == NULL || zio->io_error == 0)
 		bcopy(buf->b_data, arg, arc_buf_size(buf));
 	arc_buf_destroy(buf, arg);
 }
 
 /* a generic arc_done_func_t */
 void
 arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
 {
 	arc_buf_t **bufp = arg;
 	if (zio && zio->io_error) {
 		arc_buf_destroy(buf, arg);
 		*bufp = NULL;
 	} else {
 		*bufp = buf;
 		ASSERT(buf->b_data);
 	}
 }
 
 static void
 arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
 {
 	if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
 		ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
 		ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
 	} else {
 		if (HDR_COMPRESSION_ENABLED(hdr)) {
 			ASSERT3U(HDR_GET_COMPRESS(hdr), ==,
 			    BP_GET_COMPRESS(bp));
 		}
 		ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
 		ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
 	}
 }
 
 static void
 arc_read_done(zio_t *zio)
 {
 	arc_buf_hdr_t	*hdr = zio->io_private;
 	kmutex_t	*hash_lock = NULL;
 	arc_callback_t	*callback_list;
 	arc_callback_t	*acb;
 	boolean_t	freeable = B_FALSE;
 	boolean_t	no_zio_error = (zio->io_error == 0);
 
 	/*
 	 * The hdr was inserted into hash-table and removed from lists
 	 * prior to starting I/O.  We should find this header, since
 	 * it's in the hash table, and it should be legit since it's
 	 * not possible to evict it during the I/O.  The only possible
 	 * reason for it not to be found is if we were freed during the
 	 * read.
 	 */
 	if (HDR_IN_HASH_TABLE(hdr)) {
 		ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
 		ASSERT3U(hdr->b_dva.dva_word[0], ==,
 		    BP_IDENTITY(zio->io_bp)->dva_word[0]);
 		ASSERT3U(hdr->b_dva.dva_word[1], ==,
 		    BP_IDENTITY(zio->io_bp)->dva_word[1]);
 
 		arc_buf_hdr_t *found = buf_hash_find(hdr->b_spa, zio->io_bp,
 		    &hash_lock);
 
 		ASSERT((found == hdr &&
 		    DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
 		    (found == hdr && HDR_L2_READING(hdr)));
 		ASSERT3P(hash_lock, !=, NULL);
 	}
 
 	if (no_zio_error) {
 		/* byteswap if necessary */
 		if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
 			if (BP_GET_LEVEL(zio->io_bp) > 0) {
 				hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
 			} else {
 				hdr->b_l1hdr.b_byteswap =
 				    DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
 			}
 		} else {
 			hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 		}
 	}
 
 	arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
 	if (l2arc_noprefetch && HDR_PREFETCH(hdr))
 		arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
 
 	callback_list = hdr->b_l1hdr.b_acb;
 	ASSERT3P(callback_list, !=, NULL);
 
 	if (hash_lock && no_zio_error && hdr->b_l1hdr.b_state == arc_anon) {
 		/*
 		 * Only call arc_access on anonymous buffers.  This is because
 		 * if we've issued an I/O for an evicted buffer, we've already
 		 * called arc_access (to prevent any simultaneous readers from
 		 * getting confused).
 		 */
 		arc_access(hdr, hash_lock);
 	}
 
 	/*
 	 * If a read request has a callback (i.e. acb_done is not NULL), then we
 	 * make a buf containing the data according to the parameters which were
 	 * passed in. The implementation of arc_buf_alloc_impl() ensures that we
 	 * aren't needlessly decompressing the data multiple times.
 	 */
 	int callback_cnt = 0;
 	for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
 		if (!acb->acb_done)
 			continue;
 
 		/* This is a demand read since prefetches don't use callbacks */
 		callback_cnt++;
 
 		int error = arc_buf_alloc_impl(hdr, acb->acb_private,
 		    acb->acb_compressed, no_zio_error, &acb->acb_buf);
 		if (no_zio_error) {
 			zio->io_error = error;
 		}
 	}
 	hdr->b_l1hdr.b_acb = NULL;
 	arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 	if (callback_cnt == 0) {
 		ASSERT(HDR_PREFETCH(hdr));
 		ASSERT0(hdr->b_l1hdr.b_bufcnt);
 		ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 	}
 
 	ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
 	    callback_list != NULL);
 
 	if (no_zio_error) {
 		arc_hdr_verify(hdr, zio->io_bp);
 	} else {
 		arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
 		if (hdr->b_l1hdr.b_state != arc_anon)
 			arc_change_state(arc_anon, hdr, hash_lock);
 		if (HDR_IN_HASH_TABLE(hdr))
 			buf_hash_remove(hdr);
 		freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
 	}
 
 	/*
 	 * Broadcast before we drop the hash_lock to avoid the possibility
 	 * that the hdr (and hence the cv) might be freed before we get to
 	 * the cv_broadcast().
 	 */
 	cv_broadcast(&hdr->b_l1hdr.b_cv);
 
 	if (hash_lock != NULL) {
 		mutex_exit(hash_lock);
 	} else {
 		/*
 		 * This block was freed while we waited for the read to
 		 * complete.  It has been removed from the hash table and
 		 * moved to the anonymous state (so that it won't show up
 		 * in the cache).
 		 */
 		ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 		freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
 	}
 
 	/* execute each callback and free its structure */
 	while ((acb = callback_list) != NULL) {
 		if (acb->acb_done)
 			acb->acb_done(zio, acb->acb_buf, acb->acb_private);
 
 		if (acb->acb_zio_dummy != NULL) {
 			acb->acb_zio_dummy->io_error = zio->io_error;
 			zio_nowait(acb->acb_zio_dummy);
 		}
 
 		callback_list = acb->acb_next;
 		kmem_free(acb, sizeof (arc_callback_t));
 	}
 
 	if (freeable)
 		arc_hdr_destroy(hdr);
 }
 
 /*
  * "Read" the block at the specified DVA (in bp) via the
  * cache.  If the block is found in the cache, invoke the provided
  * callback immediately and return.  Note that the `zio' parameter
  * in the callback will be NULL in this case, since no IO was
  * required.  If the block is not in the cache pass the read request
  * on to the spa with a substitute callback function, so that the
  * requested block will be added to the cache.
  *
  * If a read request arrives for a block that has a read in-progress,
  * either wait for the in-progress read to complete (and return the
  * results); or, if this is a read with a "done" func, add a record
  * to the read to invoke the "done" func when the read completes,
  * and return; or just return.
  *
  * arc_read_done() will invoke all the requested "done" functions
  * for readers of this block.
  */
 int
 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
     void *private, zio_priority_t priority, int zio_flags,
     arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
 {
 	arc_buf_hdr_t *hdr = NULL;
 	kmutex_t *hash_lock = NULL;
 	zio_t *rzio;
 	uint64_t guid = spa_load_guid(spa);
 	boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW) != 0;
 
 	ASSERT(!BP_IS_EMBEDDED(bp) ||
 	    BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
 
 top:
 	if (!BP_IS_EMBEDDED(bp)) {
 		/*
 		 * Embedded BP's have no DVA and require no I/O to "read".
 		 * Create an anonymous arc buf to back it.
 		 */
 		hdr = buf_hash_find(guid, bp, &hash_lock);
 	}
 
 	if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pabd != NULL) {
 		arc_buf_t *buf = NULL;
 		*arc_flags |= ARC_FLAG_CACHED;
 
 		if (HDR_IO_IN_PROGRESS(hdr)) {
 
 			if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
 			    priority == ZIO_PRIORITY_SYNC_READ) {
 				/*
 				 * This sync read must wait for an
 				 * in-progress async read (e.g. a predictive
 				 * prefetch).  Async reads are queued
 				 * separately at the vdev_queue layer, so
 				 * this is a form of priority inversion.
 				 * Ideally, we would "inherit" the demand
 				 * i/o's priority by moving the i/o from
 				 * the async queue to the synchronous queue,
 				 * but there is currently no mechanism to do
 				 * so.  Track this so that we can evaluate
 				 * the magnitude of this potential performance
 				 * problem.
 				 *
 				 * Note that if the prefetch i/o is already
 				 * active (has been issued to the device),
 				 * the prefetch improved performance, because
 				 * we issued it sooner than we would have
 				 * without the prefetch.
 				 */
 				DTRACE_PROBE1(arc__sync__wait__for__async,
 				    arc_buf_hdr_t *, hdr);
 				ARCSTAT_BUMP(arcstat_sync_wait_for_async);
 			}
 			if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
 				arc_hdr_clear_flags(hdr,
 				    ARC_FLAG_PREDICTIVE_PREFETCH);
 			}
 
 			if (*arc_flags & ARC_FLAG_WAIT) {
 				cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
 				mutex_exit(hash_lock);
 				goto top;
 			}
 			ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
 
 			if (done) {
 				arc_callback_t *acb = NULL;
 
 				acb = kmem_zalloc(sizeof (arc_callback_t),
 				    KM_SLEEP);
 				acb->acb_done = done;
 				acb->acb_private = private;
 				acb->acb_compressed = compressed_read;
 				if (pio != NULL)
 					acb->acb_zio_dummy = zio_null(pio,
 					    spa, NULL, NULL, NULL, zio_flags);
 
 				ASSERT3P(acb->acb_done, !=, NULL);
 				acb->acb_next = hdr->b_l1hdr.b_acb;
 				hdr->b_l1hdr.b_acb = acb;
 				mutex_exit(hash_lock);
 				return (0);
 			}
 			mutex_exit(hash_lock);
 			return (0);
 		}
 
 		ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
 		    hdr->b_l1hdr.b_state == arc_mfu);
 
 		if (done) {
 			if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
 				/*
 				 * This is a demand read which does not have to
 				 * wait for i/o because we did a predictive
 				 * prefetch i/o for it, which has completed.
 				 */
 				DTRACE_PROBE1(
 				    arc__demand__hit__predictive__prefetch,
 				    arc_buf_hdr_t *, hdr);
 				ARCSTAT_BUMP(
 				    arcstat_demand_hit_predictive_prefetch);
 				arc_hdr_clear_flags(hdr,
 				    ARC_FLAG_PREDICTIVE_PREFETCH);
 			}
 			ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
 
 			/* Get a buf with the desired data in it. */
 			VERIFY0(arc_buf_alloc_impl(hdr, private,
 			    compressed_read, B_TRUE, &buf));
 		} else if (*arc_flags & ARC_FLAG_PREFETCH &&
 		    refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
 			arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
 		}
 		DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
 		arc_access(hdr, hash_lock);
 		if (*arc_flags & ARC_FLAG_L2CACHE)
 			arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
 		mutex_exit(hash_lock);
 		ARCSTAT_BUMP(arcstat_hits);
 		ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
 		    demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
 		    data, metadata, hits);
 
 		if (done)
 			done(NULL, buf, private);
 	} else {
 		uint64_t lsize = BP_GET_LSIZE(bp);
 		uint64_t psize = BP_GET_PSIZE(bp);
 		arc_callback_t *acb;
 		vdev_t *vd = NULL;
 		uint64_t addr = 0;
 		boolean_t devw = B_FALSE;
 		uint64_t size;
 
 		if (hdr == NULL) {
 			/* this block is not in the cache */
 			arc_buf_hdr_t *exists = NULL;
 			arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
 			hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
 			    BP_GET_COMPRESS(bp), type);
 
 			if (!BP_IS_EMBEDDED(bp)) {
 				hdr->b_dva = *BP_IDENTITY(bp);
 				hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
 				exists = buf_hash_insert(hdr, &hash_lock);
 			}
 			if (exists != NULL) {
 				/* somebody beat us to the hash insert */
 				mutex_exit(hash_lock);
 				buf_discard_identity(hdr);
 				arc_hdr_destroy(hdr);
 				goto top; /* restart the IO request */
 			}
 		} else {
 			/*
 			 * This block is in the ghost cache. If it was L2-only
 			 * (and thus didn't have an L1 hdr), we realloc the
 			 * header to add an L1 hdr.
 			 */
 			if (!HDR_HAS_L1HDR(hdr)) {
 				hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
 				    hdr_full_cache);
 			}
 			ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 			ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
 			ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 			ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 			ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 			ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 
 			/*
 			 * This is a delicate dance that we play here.
 			 * This hdr is in the ghost list so we access it
 			 * to move it out of the ghost list before we
 			 * initiate the read. If it's a prefetch then
 			 * it won't have a callback so we'll remove the
 			 * reference that arc_buf_alloc_impl() created. We
 			 * do this after we've called arc_access() to
 			 * avoid hitting an assert in remove_reference().
 			 */
 			arc_access(hdr, hash_lock);
 			arc_hdr_alloc_pabd(hdr);
 		}
 		ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 		size = arc_hdr_size(hdr);
 
 		/*
 		 * If compression is enabled on the hdr, then will do
 		 * RAW I/O and will store the compressed data in the hdr's
 		 * data block. Otherwise, the hdr's data block will contain
 		 * the uncompressed data.
 		 */
 		if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
 			zio_flags |= ZIO_FLAG_RAW;
 		}
 
 		if (*arc_flags & ARC_FLAG_PREFETCH)
 			arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
 		if (*arc_flags & ARC_FLAG_L2CACHE)
 			arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
 		if (BP_GET_LEVEL(bp) > 0)
 			arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
 		if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
 			arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
 		ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
 
 		acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
 		acb->acb_done = done;
 		acb->acb_private = private;
 		acb->acb_compressed = compressed_read;
 
 		ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 		hdr->b_l1hdr.b_acb = acb;
 		arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 
 		if (HDR_HAS_L2HDR(hdr) &&
 		    (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
 			devw = hdr->b_l2hdr.b_dev->l2ad_writing;
 			addr = hdr->b_l2hdr.b_daddr;
 			/*
 			 * Lock out L2ARC device removal.
 			 */
 			if (vdev_is_dead(vd) ||
 			    !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
 				vd = NULL;
 		}
 
 		if (priority == ZIO_PRIORITY_ASYNC_READ)
 			arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
 		else
 			arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
 
 		if (hash_lock != NULL)
 			mutex_exit(hash_lock);
 
 		/*
 		 * At this point, we have a level 1 cache miss.  Try again in
 		 * L2ARC if possible.
 		 */
 		ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
 
 		DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
 		    uint64_t, lsize, zbookmark_phys_t *, zb);
 		ARCSTAT_BUMP(arcstat_misses);
 		ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
 		    demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
 		    data, metadata, misses);
 #ifdef _KERNEL
 #ifdef RACCT
 		if (racct_enable) {
 			PROC_LOCK(curproc);
 			racct_add_force(curproc, RACCT_READBPS, size);
 			racct_add_force(curproc, RACCT_READIOPS, 1);
 			PROC_UNLOCK(curproc);
 		}
 #endif /* RACCT */
 		curthread->td_ru.ru_inblock++;
 #endif
 
 		if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
 			/*
 			 * Read from the L2ARC if the following are true:
 			 * 1. The L2ARC vdev was previously cached.
 			 * 2. This buffer still has L2ARC metadata.
 			 * 3. This buffer isn't currently writing to the L2ARC.
 			 * 4. The L2ARC entry wasn't evicted, which may
 			 *    also have invalidated the vdev.
 			 * 5. This isn't prefetch and l2arc_noprefetch is set.
 			 */
 			if (HDR_HAS_L2HDR(hdr) &&
 			    !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
 			    !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
 				l2arc_read_callback_t *cb;
 				abd_t *abd;
 				uint64_t asize;
 
 				DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
 				ARCSTAT_BUMP(arcstat_l2_hits);
 
 				cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
 				    KM_SLEEP);
 				cb->l2rcb_hdr = hdr;
 				cb->l2rcb_bp = *bp;
 				cb->l2rcb_zb = *zb;
 				cb->l2rcb_flags = zio_flags;
 
 				asize = vdev_psize_to_asize(vd, size);
 				if (asize != size) {
 					abd = abd_alloc_for_io(asize,
 					    HDR_ISTYPE_METADATA(hdr));
 					cb->l2rcb_abd = abd;
 				} else {
 					abd = hdr->b_l1hdr.b_pabd;
 				}
 
 				ASSERT(addr >= VDEV_LABEL_START_SIZE &&
 				    addr + asize <= vd->vdev_psize -
 				    VDEV_LABEL_END_SIZE);
 
 				/*
 				 * l2arc read.  The SCL_L2ARC lock will be
 				 * released by l2arc_read_done().
 				 * Issue a null zio if the underlying buffer
 				 * was squashed to zero size by compression.
 				 */
 				ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
 				    ZIO_COMPRESS_EMPTY);
 				rzio = zio_read_phys(pio, vd, addr,
 				    asize, abd,
 				    ZIO_CHECKSUM_OFF,
 				    l2arc_read_done, cb, priority,
 				    zio_flags | ZIO_FLAG_DONT_CACHE |
 				    ZIO_FLAG_CANFAIL |
 				    ZIO_FLAG_DONT_PROPAGATE |
 				    ZIO_FLAG_DONT_RETRY, B_FALSE);
 				DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
 				    zio_t *, rzio);
 				ARCSTAT_INCR(arcstat_l2_read_bytes, size);
 
 				if (*arc_flags & ARC_FLAG_NOWAIT) {
 					zio_nowait(rzio);
 					return (0);
 				}
 
 				ASSERT(*arc_flags & ARC_FLAG_WAIT);
 				if (zio_wait(rzio) == 0)
 					return (0);
 
 				/* l2arc read error; goto zio_read() */
 			} else {
 				DTRACE_PROBE1(l2arc__miss,
 				    arc_buf_hdr_t *, hdr);
 				ARCSTAT_BUMP(arcstat_l2_misses);
 				if (HDR_L2_WRITING(hdr))
 					ARCSTAT_BUMP(arcstat_l2_rw_clash);
 				spa_config_exit(spa, SCL_L2ARC, vd);
 			}
 		} else {
 			if (vd != NULL)
 				spa_config_exit(spa, SCL_L2ARC, vd);
 			if (l2arc_ndev != 0) {
 				DTRACE_PROBE1(l2arc__miss,
 				    arc_buf_hdr_t *, hdr);
 				ARCSTAT_BUMP(arcstat_l2_misses);
 			}
 		}
 
 		rzio = zio_read(pio, spa, bp, hdr->b_l1hdr.b_pabd, size,
 		    arc_read_done, hdr, priority, zio_flags, zb);
 
 		if (*arc_flags & ARC_FLAG_WAIT)
 			return (zio_wait(rzio));
 
 		ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
 		zio_nowait(rzio);
 	}
 	return (0);
 }
 
 /*
  * Notify the arc that a block was freed, and thus will never be used again.
  */
 void
 arc_freed(spa_t *spa, const blkptr_t *bp)
 {
 	arc_buf_hdr_t *hdr;
 	kmutex_t *hash_lock;
 	uint64_t guid = spa_load_guid(spa);
 
 	ASSERT(!BP_IS_EMBEDDED(bp));
 
 	hdr = buf_hash_find(guid, bp, &hash_lock);
 	if (hdr == NULL)
 		return;
 
 	/*
 	 * We might be trying to free a block that is still doing I/O
 	 * (i.e. prefetch) or has a reference (i.e. a dedup-ed,
 	 * dmu_sync-ed block). If this block is being prefetched, then it
 	 * would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr
 	 * until the I/O completes. A block may also have a reference if it is
 	 * part of a dedup-ed, dmu_synced write. The dmu_sync() function would
 	 * have written the new block to its final resting place on disk but
 	 * without the dedup flag set. This would have left the hdr in the MRU
 	 * state and discoverable. When the txg finally syncs it detects that
 	 * the block was overridden in open context and issues an override I/O.
 	 * Since this is a dedup block, the override I/O will determine if the
 	 * block is already in the DDT. If so, then it will replace the io_bp
 	 * with the bp from the DDT and allow the I/O to finish. When the I/O
 	 * reaches the done callback, dbuf_write_override_done, it will
 	 * check to see if the io_bp and io_bp_override are identical.
 	 * If they are not, then it indicates that the bp was replaced with
 	 * the bp in the DDT and the override bp is freed. This allows
 	 * us to arrive here with a reference on a block that is being
 	 * freed. So if we have an I/O in progress, or a reference to
 	 * this hdr, then we don't destroy the hdr.
 	 */
 	if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) &&
 	    refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) {
 		arc_change_state(arc_anon, hdr, hash_lock);
 		arc_hdr_destroy(hdr);
 		mutex_exit(hash_lock);
 	} else {
 		mutex_exit(hash_lock);
 	}
 
 }
 
 /*
  * Release this buffer from the cache, making it an anonymous buffer.  This
  * must be done after a read and prior to modifying the buffer contents.
  * If the buffer has more than one reference, we must make
  * a new hdr for the buffer.
  */
 void
 arc_release(arc_buf_t *buf, void *tag)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	/*
 	 * It would be nice to assert that if it's DMU metadata (level >
 	 * 0 || it's the dnode file), then it must be syncing context.
 	 * But we don't know that information at this level.
 	 */
 
 	mutex_enter(&buf->b_evict_lock);
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 
 	/*
 	 * We don't grab the hash lock prior to this check, because if
 	 * the buffer's header is in the arc_anon state, it won't be
 	 * linked into the hash table.
 	 */
 	if (hdr->b_l1hdr.b_state == arc_anon) {
 		mutex_exit(&buf->b_evict_lock);
 		ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 		ASSERT(!HDR_IN_HASH_TABLE(hdr));
 		ASSERT(!HDR_HAS_L2HDR(hdr));
 		ASSERT(HDR_EMPTY(hdr));
 		ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
 		ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
 		ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node));
 
 		hdr->b_l1hdr.b_arc_access = 0;
 
 		/*
 		 * If the buf is being overridden then it may already
 		 * have a hdr that is not empty.
 		 */
 		buf_discard_identity(hdr);
 		arc_buf_thaw(buf);
 
 		return;
 	}
 
 	kmutex_t *hash_lock = HDR_LOCK(hdr);
 	mutex_enter(hash_lock);
 
 	/*
 	 * This assignment is only valid as long as the hash_lock is
 	 * held, we must be careful not to reference state or the
 	 * b_state field after dropping the lock.
 	 */
 	arc_state_t *state = hdr->b_l1hdr.b_state;
 	ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
 	ASSERT3P(state, !=, arc_anon);
 
 	/* this buffer is not on any list */
 	ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
 
 	if (HDR_HAS_L2HDR(hdr)) {
 		mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
 
 		/*
 		 * We have to recheck this conditional again now that
 		 * we're holding the l2ad_mtx to prevent a race with
 		 * another thread which might be concurrently calling
 		 * l2arc_evict(). In that case, l2arc_evict() might have
 		 * destroyed the header's L2 portion as we were waiting
 		 * to acquire the l2ad_mtx.
 		 */
 		if (HDR_HAS_L2HDR(hdr)) {
 			l2arc_trim(hdr);
 			arc_hdr_l2hdr_destroy(hdr);
 		}
 
 		mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
 	}
 
 	/*
 	 * Do we have more than one buf?
 	 */
 	if (hdr->b_l1hdr.b_bufcnt > 1) {
 		arc_buf_hdr_t *nhdr;
 		uint64_t spa = hdr->b_spa;
 		uint64_t psize = HDR_GET_PSIZE(hdr);
 		uint64_t lsize = HDR_GET_LSIZE(hdr);
 		enum zio_compress compress = HDR_GET_COMPRESS(hdr);
 		arc_buf_contents_t type = arc_buf_type(hdr);
 		VERIFY3U(hdr->b_type, ==, type);
 
 		ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
 		(void) remove_reference(hdr, hash_lock, tag);
 
 		if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) {
 			ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
 			ASSERT(ARC_BUF_LAST(buf));
 		}
 
 		/*
 		 * Pull the data off of this hdr and attach it to
 		 * a new anonymous hdr. Also find the last buffer
 		 * in the hdr's buffer list.
 		 */
 		arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
 		ASSERT3P(lastbuf, !=, NULL);
 
 		/*
 		 * If the current arc_buf_t and the hdr are sharing their data
 		 * buffer, then we must stop sharing that block.
 		 */
 		if (arc_buf_is_shared(buf)) {
 			VERIFY(!arc_buf_is_shared(lastbuf));
 
 			/*
 			 * First, sever the block sharing relationship between
 			 * buf and the arc_buf_hdr_t.
 			 */
 			arc_unshare_buf(hdr, buf);
 
 			/*
 			 * Now we need to recreate the hdr's b_pabd. Since we
 			 * have lastbuf handy, we try to share with it, but if
 			 * we can't then we allocate a new b_pabd and copy the
 			 * data from buf into it.
 			 */
 			if (arc_can_share(hdr, lastbuf)) {
 				arc_share_buf(hdr, lastbuf);
 			} else {
 				arc_hdr_alloc_pabd(hdr);
 				abd_copy_from_buf(hdr->b_l1hdr.b_pabd,
 				    buf->b_data, psize);
 			}
 			VERIFY3P(lastbuf->b_data, !=, NULL);
 		} else if (HDR_SHARED_DATA(hdr)) {
 			/*
 			 * Uncompressed shared buffers are always at the end
 			 * of the list. Compressed buffers don't have the
 			 * same requirements. This makes it hard to
 			 * simply assert that the lastbuf is shared so
 			 * we rely on the hdr's compression flags to determine
 			 * if we have a compressed, shared buffer.
 			 */
 			ASSERT(arc_buf_is_shared(lastbuf) ||
 			    HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
 			ASSERT(!ARC_BUF_SHARED(buf));
 		}
 		ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 		ASSERT3P(state, !=, arc_l2c_only);
 
 		(void) refcount_remove_many(&state->arcs_size,
 		    arc_buf_size(buf), buf);
 
 		if (refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
 			ASSERT3P(state, !=, arc_l2c_only);
 			(void) refcount_remove_many(&state->arcs_esize[type],
 			    arc_buf_size(buf), buf);
 		}
 
 		hdr->b_l1hdr.b_bufcnt -= 1;
 		arc_cksum_verify(buf);
 #ifdef illumos
 		arc_buf_unwatch(buf);
 #endif
 
 		mutex_exit(hash_lock);
 
 		/*
 		 * Allocate a new hdr. The new hdr will contain a b_pabd
 		 * buffer which will be freed in arc_write().
 		 */
 		nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
 		ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
 		ASSERT0(nhdr->b_l1hdr.b_bufcnt);
 		ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
 		VERIFY3U(nhdr->b_type, ==, type);
 		ASSERT(!HDR_SHARED_DATA(nhdr));
 
 		nhdr->b_l1hdr.b_buf = buf;
 		nhdr->b_l1hdr.b_bufcnt = 1;
 		(void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
 		buf->b_hdr = nhdr;
 
 		mutex_exit(&buf->b_evict_lock);
 		(void) refcount_add_many(&arc_anon->arcs_size,
 		    arc_buf_size(buf), buf);
 	} else {
 		mutex_exit(&buf->b_evict_lock);
 		ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
 		/* protected by hash lock, or hdr is on arc_anon */
 		ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 		ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 		arc_change_state(arc_anon, hdr, hash_lock);
 		hdr->b_l1hdr.b_arc_access = 0;
 		mutex_exit(hash_lock);
 
 		buf_discard_identity(hdr);
 		arc_buf_thaw(buf);
 	}
 }
 
 int
 arc_released(arc_buf_t *buf)
 {
 	int released;
 
 	mutex_enter(&buf->b_evict_lock);
 	released = (buf->b_data != NULL &&
 	    buf->b_hdr->b_l1hdr.b_state == arc_anon);
 	mutex_exit(&buf->b_evict_lock);
 	return (released);
 }
 
 #ifdef ZFS_DEBUG
 int
 arc_referenced(arc_buf_t *buf)
 {
 	int referenced;
 
 	mutex_enter(&buf->b_evict_lock);
 	referenced = (refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
 	mutex_exit(&buf->b_evict_lock);
 	return (referenced);
 }
 #endif
 
 static void
 arc_write_ready(zio_t *zio)
 {
 	arc_write_callback_t *callback = zio->io_private;
 	arc_buf_t *buf = callback->awcb_buf;
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	uint64_t psize = BP_IS_HOLE(zio->io_bp) ? 0 : BP_GET_PSIZE(zio->io_bp);
 
 	ASSERT(HDR_HAS_L1HDR(hdr));
 	ASSERT(!refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
 	ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
 
 	/*
 	 * If we're reexecuting this zio because the pool suspended, then
 	 * cleanup any state that was previously set the first time the
 	 * callback was invoked.
 	 */
 	if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
 		arc_cksum_free(hdr);
 #ifdef illumos
 		arc_buf_unwatch(buf);
 #endif
 		if (hdr->b_l1hdr.b_pabd != NULL) {
 			if (arc_buf_is_shared(buf)) {
 				arc_unshare_buf(hdr, buf);
 			} else {
 				arc_hdr_free_pabd(hdr);
 			}
 		}
 	}
 	ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 	ASSERT(!HDR_SHARED_DATA(hdr));
 	ASSERT(!arc_buf_is_shared(buf));
 
 	callback->awcb_ready(zio, buf, callback->awcb_private);
 
 	if (HDR_IO_IN_PROGRESS(hdr))
 		ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
 
 	arc_cksum_compute(buf);
 	arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 
 	enum zio_compress compress;
 	if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
 		compress = ZIO_COMPRESS_OFF;
 	} else {
 		ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(zio->io_bp));
 		compress = BP_GET_COMPRESS(zio->io_bp);
 	}
 	HDR_SET_PSIZE(hdr, psize);
 	arc_hdr_set_compress(hdr, compress);
 
 
 	/*
 	 * Fill the hdr with data. If the hdr is compressed, the data we want
 	 * is available from the zio, otherwise we can take it from the buf.
 	 *
 	 * We might be able to share the buf's data with the hdr here. However,
 	 * doing so would cause the ARC to be full of linear ABDs if we write a
 	 * lot of shareable data. As a compromise, we check whether scattered
 	 * ABDs are allowed, and assume that if they are then the user wants
 	 * the ARC to be primarily filled with them regardless of the data being
 	 * written. Therefore, if they're allowed then we allocate one and copy
 	 * the data into it; otherwise, we share the data directly if we can.
 	 */
 	if (zfs_abd_scatter_enabled || !arc_can_share(hdr, buf)) {
 		arc_hdr_alloc_pabd(hdr);
 
 		/*
 		 * Ideally, we would always copy the io_abd into b_pabd, but the
 		 * user may have disabled compressed ARC, thus we must check the
 		 * hdr's compression setting rather than the io_bp's.
 		 */
 		if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
 			ASSERT3U(BP_GET_COMPRESS(zio->io_bp), !=,
 			    ZIO_COMPRESS_OFF);
 			ASSERT3U(psize, >, 0);
 
 			abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);
 		} else {
 			ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));
 
 			abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,
 			    arc_buf_size(buf));
 		}
 	} else {
 		ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));
 		ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
 		ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
 
 		arc_share_buf(hdr, buf);
 	}
 
 	arc_hdr_verify(hdr, zio->io_bp);
 }
 
 static void
 arc_write_children_ready(zio_t *zio)
 {
 	arc_write_callback_t *callback = zio->io_private;
 	arc_buf_t *buf = callback->awcb_buf;
 
 	callback->awcb_children_ready(zio, buf, callback->awcb_private);
 }
 
 /*
  * The SPA calls this callback for each physical write that happens on behalf
  * of a logical write.  See the comment in dbuf_write_physdone() for details.
  */
 static void
 arc_write_physdone(zio_t *zio)
 {
 	arc_write_callback_t *cb = zio->io_private;
 	if (cb->awcb_physdone != NULL)
 		cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
 }
 
 static void
 arc_write_done(zio_t *zio)
 {
 	arc_write_callback_t *callback = zio->io_private;
 	arc_buf_t *buf = callback->awcb_buf;
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 
 	ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 
 	if (zio->io_error == 0) {
 		arc_hdr_verify(hdr, zio->io_bp);
 
 		if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
 			buf_discard_identity(hdr);
 		} else {
 			hdr->b_dva = *BP_IDENTITY(zio->io_bp);
 			hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
 		}
 	} else {
 		ASSERT(HDR_EMPTY(hdr));
 	}
 
 	/*
 	 * If the block to be written was all-zero or compressed enough to be
 	 * embedded in the BP, no write was performed so there will be no
 	 * dva/birth/checksum.  The buffer must therefore remain anonymous
 	 * (and uncached).
 	 */
 	if (!HDR_EMPTY(hdr)) {
 		arc_buf_hdr_t *exists;
 		kmutex_t *hash_lock;
 
 		ASSERT3U(zio->io_error, ==, 0);
 
 		arc_cksum_verify(buf);
 
 		exists = buf_hash_insert(hdr, &hash_lock);
 		if (exists != NULL) {
 			/*
 			 * This can only happen if we overwrite for
 			 * sync-to-convergence, because we remove
 			 * buffers from the hash table when we arc_free().
 			 */
 			if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
 				if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
 					panic("bad overwrite, hdr=%p exists=%p",
 					    (void *)hdr, (void *)exists);
 				ASSERT(refcount_is_zero(
 				    &exists->b_l1hdr.b_refcnt));
 				arc_change_state(arc_anon, exists, hash_lock);
 				mutex_exit(hash_lock);
 				arc_hdr_destroy(exists);
 				exists = buf_hash_insert(hdr, &hash_lock);
 				ASSERT3P(exists, ==, NULL);
 			} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
 				/* nopwrite */
 				ASSERT(zio->io_prop.zp_nopwrite);
 				if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
 					panic("bad nopwrite, hdr=%p exists=%p",
 					    (void *)hdr, (void *)exists);
 			} else {
 				/* Dedup */
 				ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
 				ASSERT(hdr->b_l1hdr.b_state == arc_anon);
 				ASSERT(BP_GET_DEDUP(zio->io_bp));
 				ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
 			}
 		}
 		arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 		/* if it's not anon, we are doing a scrub */
 		if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
 			arc_access(hdr, hash_lock);
 		mutex_exit(hash_lock);
 	} else {
 		arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 	}
 
 	ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 	callback->awcb_done(zio, buf, callback->awcb_private);
 
 	abd_put(zio->io_abd);
 	kmem_free(callback, sizeof (arc_write_callback_t));
 }
 
 zio_t *
 arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
     boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
     arc_done_func_t *children_ready, arc_done_func_t *physdone,
     arc_done_func_t *done, void *private, zio_priority_t priority,
     int zio_flags, const zbookmark_phys_t *zb)
 {
 	arc_buf_hdr_t *hdr = buf->b_hdr;
 	arc_write_callback_t *callback;
 	zio_t *zio;
 	zio_prop_t localprop = *zp;
 
 	ASSERT3P(ready, !=, NULL);
 	ASSERT3P(done, !=, NULL);
 	ASSERT(!HDR_IO_ERROR(hdr));
 	ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 	ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 	ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
 	if (l2arc)
 		arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
 	if (ARC_BUF_COMPRESSED(buf)) {
 		/*
 		 * We're writing a pre-compressed buffer.  Make the
 		 * compression algorithm requested by the zio_prop_t match
 		 * the pre-compressed buffer's compression algorithm.
 		 */
 		localprop.zp_compress = HDR_GET_COMPRESS(hdr);
 
 		ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
 		zio_flags |= ZIO_FLAG_RAW;
 	}
 	callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
 	callback->awcb_ready = ready;
 	callback->awcb_children_ready = children_ready;
 	callback->awcb_physdone = physdone;
 	callback->awcb_done = done;
 	callback->awcb_private = private;
 	callback->awcb_buf = buf;
 
 	/*
 	 * The hdr's b_pabd is now stale, free it now. A new data block
 	 * will be allocated when the zio pipeline calls arc_write_ready().
 	 */
 	if (hdr->b_l1hdr.b_pabd != NULL) {
 		/*
 		 * If the buf is currently sharing the data block with
 		 * the hdr then we need to break that relationship here.
 		 * The hdr will remain with a NULL data pointer and the
 		 * buf will take sole ownership of the block.
 		 */
 		if (arc_buf_is_shared(buf)) {
 			arc_unshare_buf(hdr, buf);
 		} else {
 			arc_hdr_free_pabd(hdr);
 		}
 		VERIFY3P(buf->b_data, !=, NULL);
 		arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
 	}
 	ASSERT(!arc_buf_is_shared(buf));
 	ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 
 	zio = zio_write(pio, spa, txg, bp,
 	    abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
 	    HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
 	    (children_ready != NULL) ? arc_write_children_ready : NULL,
 	    arc_write_physdone, arc_write_done, callback,
 	    priority, zio_flags, zb);
 
 	return (zio);
 }
 
 static int
 arc_memory_throttle(uint64_t reserve, uint64_t txg)
 {
 #ifdef _KERNEL
 	uint64_t available_memory = ptob(freemem);
 	static uint64_t page_load = 0;
 	static uint64_t last_txg = 0;
 
 #if defined(__i386) || !defined(UMA_MD_SMALL_ALLOC)
 	available_memory = MIN(available_memory, uma_avail());
 #endif
 
 	if (freemem > (uint64_t)physmem * arc_lotsfree_percent / 100)
 		return (0);
 
 	if (txg > last_txg) {
 		last_txg = txg;
 		page_load = 0;
 	}
 	/*
 	 * If we are in pageout, we know that memory is already tight,
 	 * the arc is already going to be evicting, so we just want to
 	 * continue to let page writes occur as quickly as possible.
 	 */
 	if (curproc == pageproc) {
 		if (page_load > MAX(ptob(minfree), available_memory) / 4)
 			return (SET_ERROR(ERESTART));
 		/* Note: reserve is inflated, so we deflate */
 		page_load += reserve / 8;
 		return (0);
 	} else if (page_load > 0 && arc_reclaim_needed()) {
 		/* memory is low, delay before restarting */
 		ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
 		return (SET_ERROR(EAGAIN));
 	}
 	page_load = 0;
 #endif
 	return (0);
 }
 
 void
 arc_tempreserve_clear(uint64_t reserve)
 {
 	atomic_add_64(&arc_tempreserve, -reserve);
 	ASSERT((int64_t)arc_tempreserve >= 0);
 }
 
 int
 arc_tempreserve_space(uint64_t reserve, uint64_t txg)
 {
 	int error;
 	uint64_t anon_size;
 
 	if (reserve > arc_c/4 && !arc_no_grow) {
 		arc_c = MIN(arc_c_max, reserve * 4);
 		DTRACE_PROBE1(arc__set_reserve, uint64_t, arc_c);
 	}
 	if (reserve > arc_c)
 		return (SET_ERROR(ENOMEM));
 
 	/*
 	 * Don't count loaned bufs as in flight dirty data to prevent long
 	 * network delays from blocking transactions that are ready to be
 	 * assigned to a txg.
 	 */
 
 	/* assert that it has not wrapped around */
 	ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
 
 	anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
 	    arc_loaned_bytes), 0);
 
 	/*
 	 * Writes will, almost always, require additional memory allocations
 	 * in order to compress/encrypt/etc the data.  We therefore need to
 	 * make sure that there is sufficient available memory for this.
 	 */
 	error = arc_memory_throttle(reserve, txg);
 	if (error != 0)
 		return (error);
 
 	/*
 	 * Throttle writes when the amount of dirty data in the cache
 	 * gets too large.  We try to keep the cache less than half full
 	 * of dirty blocks so that our sync times don't grow too large.
 	 * Note: if two requests come in concurrently, we might let them
 	 * both succeed, when one of them should fail.  Not a huge deal.
 	 */
 
 	if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
 	    anon_size > arc_c / 4) {
 		uint64_t meta_esize =
 		    refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 		uint64_t data_esize =
 		    refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
 		dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
 		    "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
 		    arc_tempreserve >> 10, meta_esize >> 10,
 		    data_esize >> 10, reserve >> 10, arc_c >> 10);
 		return (SET_ERROR(ERESTART));
 	}
 	atomic_add_64(&arc_tempreserve, reserve);
 	return (0);
 }
 
 static void
 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
     kstat_named_t *evict_data, kstat_named_t *evict_metadata)
 {
 	size->value.ui64 = refcount_count(&state->arcs_size);
 	evict_data->value.ui64 =
 	    refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
 	evict_metadata->value.ui64 =
 	    refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
 }
 
 static int
 arc_kstat_update(kstat_t *ksp, int rw)
 {
 	arc_stats_t *as = ksp->ks_data;
 
 	if (rw == KSTAT_WRITE) {
 		return (EACCES);
 	} else {
 		arc_kstat_update_state(arc_anon,
 		    &as->arcstat_anon_size,
 		    &as->arcstat_anon_evictable_data,
 		    &as->arcstat_anon_evictable_metadata);
 		arc_kstat_update_state(arc_mru,
 		    &as->arcstat_mru_size,
 		    &as->arcstat_mru_evictable_data,
 		    &as->arcstat_mru_evictable_metadata);
 		arc_kstat_update_state(arc_mru_ghost,
 		    &as->arcstat_mru_ghost_size,
 		    &as->arcstat_mru_ghost_evictable_data,
 		    &as->arcstat_mru_ghost_evictable_metadata);
 		arc_kstat_update_state(arc_mfu,
 		    &as->arcstat_mfu_size,
 		    &as->arcstat_mfu_evictable_data,
 		    &as->arcstat_mfu_evictable_metadata);
 		arc_kstat_update_state(arc_mfu_ghost,
 		    &as->arcstat_mfu_ghost_size,
 		    &as->arcstat_mfu_ghost_evictable_data,
 		    &as->arcstat_mfu_ghost_evictable_metadata);
 
 		ARCSTAT(arcstat_size) = aggsum_value(&arc_size);
 		ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used);
 		ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size);
 		ARCSTAT(arcstat_metadata_size) =
 		    aggsum_value(&astat_metadata_size);
 		ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size);
 		ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size);
 		ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size);
 	}
 
 	return (0);
 }
 
 /*
  * This function *must* return indices evenly distributed between all
  * sublists of the multilist. This is needed due to how the ARC eviction
  * code is laid out; arc_evict_state() assumes ARC buffers are evenly
  * distributed between all sublists and uses this assumption when
  * deciding which sublist to evict from and how much to evict from it.
  */
 unsigned int
 arc_state_multilist_index_func(multilist_t *ml, void *obj)
 {
 	arc_buf_hdr_t *hdr = obj;
 
 	/*
 	 * We rely on b_dva to generate evenly distributed index
 	 * numbers using buf_hash below. So, as an added precaution,
 	 * let's make sure we never add empty buffers to the arc lists.
 	 */
 	ASSERT(!HDR_EMPTY(hdr));
 
 	/*
 	 * The assumption here, is the hash value for a given
 	 * arc_buf_hdr_t will remain constant throughout it's lifetime
 	 * (i.e. it's b_spa, b_dva, and b_birth fields don't change).
 	 * Thus, we don't need to store the header's sublist index
 	 * on insertion, as this index can be recalculated on removal.
 	 *
 	 * Also, the low order bits of the hash value are thought to be
 	 * distributed evenly. Otherwise, in the case that the multilist
 	 * has a power of two number of sublists, each sublists' usage
 	 * would not be evenly distributed.
 	 */
 	return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
 	    multilist_get_num_sublists(ml));
 }
 
 #ifdef _KERNEL
 static eventhandler_tag arc_event_lowmem = NULL;
 
 static void
 arc_lowmem(void *arg __unused, int howto __unused)
 {
 
 	mutex_enter(&arc_reclaim_lock);
 	DTRACE_PROBE1(arc__needfree, int64_t, ((int64_t)freemem - zfs_arc_free_target) * PAGESIZE);
 	cv_signal(&arc_reclaim_thread_cv);
 
 	/*
 	 * It is unsafe to block here in arbitrary threads, because we can come
 	 * here from ARC itself and may hold ARC locks and thus risk a deadlock
 	 * with ARC reclaim thread.
 	 */
 	if (curproc == pageproc)
 		(void) cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
 	mutex_exit(&arc_reclaim_lock);
 }
 #endif
 
 static void
 arc_state_init(void)
 {
 	arc_anon = &ARC_anon;
 	arc_mru = &ARC_mru;
 	arc_mru_ghost = &ARC_mru_ghost;
 	arc_mfu = &ARC_mfu;
 	arc_mfu_ghost = &ARC_mfu_ghost;
 	arc_l2c_only = &ARC_l2c_only;
 
 	arc_mru->arcs_list[ARC_BUFC_METADATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mru->arcs_list[ARC_BUFC_DATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mru_ghost->arcs_list[ARC_BUFC_DATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mfu->arcs_list[ARC_BUFC_METADATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mfu->arcs_list[ARC_BUFC_DATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_l2c_only->arcs_list[ARC_BUFC_METADATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 	arc_l2c_only->arcs_list[ARC_BUFC_DATA] =
 	    multilist_create(sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
 	    arc_state_multilist_index_func);
 
 	refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
 	refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
 	refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
 	refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
 	refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
 	refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
 
 	refcount_create(&arc_anon->arcs_size);
 	refcount_create(&arc_mru->arcs_size);
 	refcount_create(&arc_mru_ghost->arcs_size);
 	refcount_create(&arc_mfu->arcs_size);
 	refcount_create(&arc_mfu_ghost->arcs_size);
 	refcount_create(&arc_l2c_only->arcs_size);
 
 	aggsum_init(&arc_meta_used, 0);
 	aggsum_init(&arc_size, 0);
 	aggsum_init(&astat_data_size, 0);
 	aggsum_init(&astat_metadata_size, 0);
 	aggsum_init(&astat_hdr_size, 0);
 	aggsum_init(&astat_other_size, 0);
 	aggsum_init(&astat_l2_hdr_size, 0);
 }
 
 static void
 arc_state_fini(void)
 {
 	refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
 	refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
 	refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
 	refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
 	refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
 	refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
 	refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
 
 	refcount_destroy(&arc_anon->arcs_size);
 	refcount_destroy(&arc_mru->arcs_size);
 	refcount_destroy(&arc_mru_ghost->arcs_size);
 	refcount_destroy(&arc_mfu->arcs_size);
 	refcount_destroy(&arc_mfu_ghost->arcs_size);
 	refcount_destroy(&arc_l2c_only->arcs_size);
 
 	multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]);
 	multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
 	multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]);
 	multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
 	multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]);
 	multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
 	multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]);
 	multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
 }
 
 uint64_t
 arc_max_bytes(void)
 {
 	return (arc_c_max);
 }
 
 void
 arc_init(void)
 {
 	int i, prefetch_tunable_set = 0;
 
 	/*
 	 * allmem is "all memory that we could possibly use".
 	 */
 #ifdef illumos
 #ifdef _KERNEL
 	uint64_t allmem = ptob(physmem - swapfs_minfree);
 #else
 	uint64_t allmem = (physmem * PAGESIZE) / 2;
 #endif
 #else
 	uint64_t allmem = kmem_size();
 #endif
 
 
 	mutex_init(&arc_reclaim_lock, NULL, MUTEX_DEFAULT, NULL);
 	cv_init(&arc_reclaim_thread_cv, NULL, CV_DEFAULT, NULL);
 	cv_init(&arc_reclaim_waiters_cv, NULL, CV_DEFAULT, NULL);
 
 	mutex_init(&arc_dnlc_evicts_lock, NULL, MUTEX_DEFAULT, NULL);
 	cv_init(&arc_dnlc_evicts_cv, NULL, CV_DEFAULT, NULL);
 
 	/* Convert seconds to clock ticks */
 	arc_min_prefetch_lifespan = 1 * hz;
 
 	/* set min cache to 1/32 of all memory, or arc_abs_min, whichever is more */
 	arc_c_min = MAX(allmem / 32, arc_abs_min);
 	/* set max to 5/8 of all memory, or all but 1GB, whichever is more */
 	if (allmem >= 1 << 30)
 		arc_c_max = allmem - (1 << 30);
 	else
 		arc_c_max = arc_c_min;
 	arc_c_max = MAX(allmem * 5 / 8, arc_c_max);
 
 	/*
 	 * In userland, there's only the memory pressure that we artificially
 	 * create (see arc_available_memory()).  Don't let arc_c get too
 	 * small, because it can cause transactions to be larger than
 	 * arc_c, causing arc_tempreserve_space() to fail.
 	 */
 #ifndef _KERNEL
 	arc_c_min = arc_c_max / 2;
 #endif
 
 #ifdef _KERNEL
 	/*
 	 * Allow the tunables to override our calculations if they are
 	 * reasonable.
 	 */
 	if (zfs_arc_max > arc_abs_min && zfs_arc_max < allmem) {
 		arc_c_max = zfs_arc_max;
 		arc_c_min = MIN(arc_c_min, arc_c_max);
 	}
 	if (zfs_arc_min > arc_abs_min && zfs_arc_min <= arc_c_max)
 		arc_c_min = zfs_arc_min;
 #endif
 
 	arc_c = arc_c_max;
 	arc_p = (arc_c >> 1);
 
 	/* limit meta-data to 1/4 of the arc capacity */
 	arc_meta_limit = arc_c_max / 4;
 
 #ifdef _KERNEL
 	/*
 	 * Metadata is stored in the kernel's heap.  Don't let us
 	 * use more than half the heap for the ARC.
 	 */
 #ifdef __FreeBSD__
 	arc_meta_limit = MIN(arc_meta_limit, uma_limit() / 2);
 #else
 	arc_meta_limit = MIN(arc_meta_limit,
 	    vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
 #endif
 #endif
 
 	/* Allow the tunable to override if it is reasonable */
 	if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
 		arc_meta_limit = zfs_arc_meta_limit;
 
 	if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
 		arc_c_min = arc_meta_limit / 2;
 
 	if (zfs_arc_meta_min > 0) {
 		arc_meta_min = zfs_arc_meta_min;
 	} else {
 		arc_meta_min = arc_c_min / 2;
 	}
 
 	if (zfs_arc_grow_retry > 0)
 		arc_grow_retry = zfs_arc_grow_retry;
 
 	if (zfs_arc_shrink_shift > 0)
 		arc_shrink_shift = zfs_arc_shrink_shift;
 
 	if (zfs_arc_no_grow_shift > 0)
 		arc_no_grow_shift = zfs_arc_no_grow_shift;
 	/*
 	 * Ensure that arc_no_grow_shift is less than arc_shrink_shift.
 	 */
 	if (arc_no_grow_shift >= arc_shrink_shift)
 		arc_no_grow_shift = arc_shrink_shift - 1;
 
 	if (zfs_arc_p_min_shift > 0)
 		arc_p_min_shift = zfs_arc_p_min_shift;
 
 	/* if kmem_flags are set, lets try to use less memory */
 	if (kmem_debugging())
 		arc_c = arc_c / 2;
 	if (arc_c < arc_c_min)
 		arc_c = arc_c_min;
 
 	zfs_arc_min = arc_c_min;
 	zfs_arc_max = arc_c_max;
 
 	arc_state_init();
 	buf_init();
 
 	arc_reclaim_thread_exit = B_FALSE;
 	arc_dnlc_evicts_thread_exit = FALSE;
 
 	arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
 	    sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
 
 	if (arc_ksp != NULL) {
 		arc_ksp->ks_data = &arc_stats;
 		arc_ksp->ks_update = arc_kstat_update;
 		kstat_install(arc_ksp);
 	}
 
 	(void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
 	    TS_RUN, minclsyspri);
 
 #ifdef _KERNEL
 	arc_event_lowmem = EVENTHANDLER_REGISTER(vm_lowmem, arc_lowmem, NULL,
 	    EVENTHANDLER_PRI_FIRST);
 #endif
 
 	(void) thread_create(NULL, 0, arc_dnlc_evicts_thread, NULL, 0, &p0,
 	    TS_RUN, minclsyspri);
 
 	arc_dead = B_FALSE;
 	arc_warm = B_FALSE;
 
 	/*
 	 * Calculate maximum amount of dirty data per pool.
 	 *
 	 * If it has been set by /etc/system, take that.
 	 * Otherwise, use a percentage of physical memory defined by
 	 * zfs_dirty_data_max_percent (default 10%) with a cap at
 	 * zfs_dirty_data_max_max (default 4GB).
 	 */
 	if (zfs_dirty_data_max == 0) {
 		zfs_dirty_data_max = ptob(physmem) *
 		    zfs_dirty_data_max_percent / 100;
 		zfs_dirty_data_max = MIN(zfs_dirty_data_max,
 		    zfs_dirty_data_max_max);
 	}
 
 #ifdef _KERNEL
 	if (TUNABLE_INT_FETCH("vfs.zfs.prefetch_disable", &zfs_prefetch_disable))
 		prefetch_tunable_set = 1;
 
 #ifdef __i386__
 	if (prefetch_tunable_set == 0) {
 		printf("ZFS NOTICE: Prefetch is disabled by default on i386 "
 		    "-- to enable,\n");
 		printf("            add \"vfs.zfs.prefetch_disable=0\" "
 		    "to /boot/loader.conf.\n");
 		zfs_prefetch_disable = 1;
 	}
 #else
 	if ((((uint64_t)physmem * PAGESIZE) < (1ULL << 32)) &&
 	    prefetch_tunable_set == 0) {
 		printf("ZFS NOTICE: Prefetch is disabled by default if less "
 		    "than 4GB of RAM is present;\n"
 		    "            to enable, add \"vfs.zfs.prefetch_disable=0\" "
 		    "to /boot/loader.conf.\n");
 		zfs_prefetch_disable = 1;
 	}
 #endif
 	/* Warn about ZFS memory and address space requirements. */
 	if (((uint64_t)physmem * PAGESIZE) < (256 + 128 + 64) * (1 << 20)) {
 		printf("ZFS WARNING: Recommended minimum RAM size is 512MB; "
 		    "expect unstable behavior.\n");
 	}
 	if (allmem < 512 * (1 << 20)) {
 		printf("ZFS WARNING: Recommended minimum kmem_size is 512MB; "
 		    "expect unstable behavior.\n");
 		printf("             Consider tuning vm.kmem_size and "
 		    "vm.kmem_size_max\n");
 		printf("             in /boot/loader.conf.\n");
 	}
 #endif
 }
 
 void
 arc_fini(void)
 {
 #ifdef _KERNEL
 	if (arc_event_lowmem != NULL)
 		EVENTHANDLER_DEREGISTER(vm_lowmem, arc_event_lowmem);
 #endif
 
 	mutex_enter(&arc_reclaim_lock);
 	arc_reclaim_thread_exit = B_TRUE;
 	/*
 	 * The reclaim thread will set arc_reclaim_thread_exit back to
 	 * B_FALSE when it is finished exiting; we're waiting for that.
 	 */
 	while (arc_reclaim_thread_exit) {
 		cv_signal(&arc_reclaim_thread_cv);
 		cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
 	}
 	mutex_exit(&arc_reclaim_lock);
 
 	/* Use B_TRUE to ensure *all* buffers are evicted */
 	arc_flush(NULL, B_TRUE);
 
 	mutex_enter(&arc_dnlc_evicts_lock);
 	arc_dnlc_evicts_thread_exit = TRUE;
 	/*
 	 * The user evicts thread will set arc_user_evicts_thread_exit
 	 * to FALSE when it is finished exiting; we're waiting for that.
 	 */
 	while (arc_dnlc_evicts_thread_exit) {
 		cv_signal(&arc_dnlc_evicts_cv);
 		cv_wait(&arc_dnlc_evicts_cv, &arc_dnlc_evicts_lock);
 	}
 	mutex_exit(&arc_dnlc_evicts_lock);
 
 	arc_dead = B_TRUE;
 
 	if (arc_ksp != NULL) {
 		kstat_delete(arc_ksp);
 		arc_ksp = NULL;
 	}
 
 	mutex_destroy(&arc_reclaim_lock);
 	cv_destroy(&arc_reclaim_thread_cv);
 	cv_destroy(&arc_reclaim_waiters_cv);
 
 	mutex_destroy(&arc_dnlc_evicts_lock);
 	cv_destroy(&arc_dnlc_evicts_cv);
 
 	arc_state_fini();
 	buf_fini();
 
 	ASSERT0(arc_loaned_bytes);
 }
 
 /*
  * Level 2 ARC
  *
  * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
  * It uses dedicated storage devices to hold cached data, which are populated
  * using large infrequent writes.  The main role of this cache is to boost
  * the performance of random read workloads.  The intended L2ARC devices
  * include short-stroked disks, solid state disks, and other media with
  * substantially faster read latency than disk.
  *
  *                 +-----------------------+
  *                 |         ARC           |
  *                 +-----------------------+
  *                    |         ^     ^
  *                    |         |     |
  *      l2arc_feed_thread()    arc_read()
  *                    |         |     |
  *                    |  l2arc read   |
  *                    V         |     |
  *               +---------------+    |
  *               |     L2ARC     |    |
  *               +---------------+    |
  *                   |    ^           |
  *          l2arc_write() |           |
  *                   |    |           |
  *                   V    |           |
  *                 +-------+      +-------+
  *                 | vdev  |      | vdev  |
  *                 | cache |      | cache |
  *                 +-------+      +-------+
  *                 +=========+     .-----.
  *                 :  L2ARC  :    |-_____-|
  *                 : devices :    | Disks |
  *                 +=========+    `-_____-'
  *
  * Read requests are satisfied from the following sources, in order:
  *
  *	1) ARC
  *	2) vdev cache of L2ARC devices
  *	3) L2ARC devices
  *	4) vdev cache of disks
  *	5) disks
  *
  * Some L2ARC device types exhibit extremely slow write performance.
  * To accommodate for this there are some significant differences between
  * the L2ARC and traditional cache design:
  *
  * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
  * the ARC behave as usual, freeing buffers and placing headers on ghost
  * lists.  The ARC does not send buffers to the L2ARC during eviction as
  * this would add inflated write latencies for all ARC memory pressure.
  *
  * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
  * It does this by periodically scanning buffers from the eviction-end of
  * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
  * not already there. It scans until a headroom of buffers is satisfied,
  * which itself is a buffer for ARC eviction. If a compressible buffer is
  * found during scanning and selected for writing to an L2ARC device, we
  * temporarily boost scanning headroom during the next scan cycle to make
  * sure we adapt to compression effects (which might significantly reduce
  * the data volume we write to L2ARC). The thread that does this is
  * l2arc_feed_thread(), illustrated below; example sizes are included to
  * provide a better sense of ratio than this diagram:
  *
  *	       head -->                        tail
  *	        +---------------------+----------+
  *	ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
  *	        +---------------------+----------+   |   o L2ARC eligible
  *	ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
  *	        +---------------------+----------+   |
  *	             15.9 Gbytes      ^ 32 Mbytes    |
  *	                           headroom          |
  *	                                      l2arc_feed_thread()
  *	                                             |
  *	                 l2arc write hand <--[oooo]--'
  *	                         |           8 Mbyte
  *	                         |          write max
  *	                         V
  *		  +==============================+
  *	L2ARC dev |####|#|###|###|    |####| ... |
  *	          +==============================+
  *	                     32 Gbytes
  *
  * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
  * evicted, then the L2ARC has cached a buffer much sooner than it probably
  * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
  * safe to say that this is an uncommon case, since buffers at the end of
  * the ARC lists have moved there due to inactivity.
  *
  * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
  * then the L2ARC simply misses copying some buffers.  This serves as a
  * pressure valve to prevent heavy read workloads from both stalling the ARC
  * with waits and clogging the L2ARC with writes.  This also helps prevent
  * the potential for the L2ARC to churn if it attempts to cache content too
  * quickly, such as during backups of the entire pool.
  *
  * 5. After system boot and before the ARC has filled main memory, there are
  * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
  * lists can remain mostly static.  Instead of searching from tail of these
  * lists as pictured, the l2arc_feed_thread() will search from the list heads
  * for eligible buffers, greatly increasing its chance of finding them.
  *
  * The L2ARC device write speed is also boosted during this time so that
  * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
  * there are no L2ARC reads, and no fear of degrading read performance
  * through increased writes.
  *
  * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
  * the vdev queue can aggregate them into larger and fewer writes.  Each
  * device is written to in a rotor fashion, sweeping writes through
  * available space then repeating.
  *
  * 7. The L2ARC does not store dirty content.  It never needs to flush
  * write buffers back to disk based storage.
  *
  * 8. If an ARC buffer is written (and dirtied) which also exists in the
  * L2ARC, the now stale L2ARC buffer is immediately dropped.
  *
  * The performance of the L2ARC can be tweaked by a number of tunables, which
  * may be necessary for different workloads:
  *
  *	l2arc_write_max		max write bytes per interval
  *	l2arc_write_boost	extra write bytes during device warmup
  *	l2arc_noprefetch	skip caching prefetched buffers
  *	l2arc_headroom		number of max device writes to precache
  *	l2arc_headroom_boost	when we find compressed buffers during ARC
  *				scanning, we multiply headroom by this
  *				percentage factor for the next scan cycle,
  *				since more compressed buffers are likely to
  *				be present
  *	l2arc_feed_secs		seconds between L2ARC writing
  *
  * Tunables may be removed or added as future performance improvements are
  * integrated, and also may become zpool properties.
  *
  * There are three key functions that control how the L2ARC warms up:
  *
  *	l2arc_write_eligible()	check if a buffer is eligible to cache
  *	l2arc_write_size()	calculate how much to write
  *	l2arc_write_interval()	calculate sleep delay between writes
  *
  * These three functions determine what to write, how much, and how quickly
  * to send writes.
  */
 
 static boolean_t
 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
 {
 	/*
 	 * A buffer is *not* eligible for the L2ARC if it:
 	 * 1. belongs to a different spa.
 	 * 2. is already cached on the L2ARC.
 	 * 3. has an I/O in progress (it may be an incomplete read).
 	 * 4. is flagged not eligible (zfs property).
 	 */
 	if (hdr->b_spa != spa_guid) {
 		ARCSTAT_BUMP(arcstat_l2_write_spa_mismatch);
 		return (B_FALSE);
 	}
 	if (HDR_HAS_L2HDR(hdr)) {
 		ARCSTAT_BUMP(arcstat_l2_write_in_l2);
 		return (B_FALSE);
 	}
 	if (HDR_IO_IN_PROGRESS(hdr)) {
 		ARCSTAT_BUMP(arcstat_l2_write_hdr_io_in_progress);
 		return (B_FALSE);
 	}
 	if (!HDR_L2CACHE(hdr)) {
 		ARCSTAT_BUMP(arcstat_l2_write_not_cacheable);
 		return (B_FALSE);
 	}
 
 	return (B_TRUE);
 }
 
 static uint64_t
 l2arc_write_size(void)
 {
 	uint64_t size;
 
 	/*
 	 * Make sure our globals have meaningful values in case the user
 	 * altered them.
 	 */
 	size = l2arc_write_max;
 	if (size == 0) {
 		cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
 		    "be greater than zero, resetting it to the default (%d)",
 		    L2ARC_WRITE_SIZE);
 		size = l2arc_write_max = L2ARC_WRITE_SIZE;
 	}
 
 	if (arc_warm == B_FALSE)
 		size += l2arc_write_boost;
 
 	return (size);
 
 }
 
 static clock_t
 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
 {
 	clock_t interval, next, now;
 
 	/*
 	 * If the ARC lists are busy, increase our write rate; if the
 	 * lists are stale, idle back.  This is achieved by checking
 	 * how much we previously wrote - if it was more than half of
 	 * what we wanted, schedule the next write much sooner.
 	 */
 	if (l2arc_feed_again && wrote > (wanted / 2))
 		interval = (hz * l2arc_feed_min_ms) / 1000;
 	else
 		interval = hz * l2arc_feed_secs;
 
 	now = ddi_get_lbolt();
 	next = MAX(now, MIN(now + interval, began + interval));
 
 	return (next);
 }
 
 /*
  * Cycle through L2ARC devices.  This is how L2ARC load balances.
  * If a device is returned, this also returns holding the spa config lock.
  */
 static l2arc_dev_t *
 l2arc_dev_get_next(void)
 {
 	l2arc_dev_t *first, *next = NULL;
 
 	/*
 	 * Lock out the removal of spas (spa_namespace_lock), then removal
 	 * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
 	 * both locks will be dropped and a spa config lock held instead.
 	 */
 	mutex_enter(&spa_namespace_lock);
 	mutex_enter(&l2arc_dev_mtx);
 
 	/* if there are no vdevs, there is nothing to do */
 	if (l2arc_ndev == 0)
 		goto out;
 
 	first = NULL;
 	next = l2arc_dev_last;
 	do {
 		/* loop around the list looking for a non-faulted vdev */
 		if (next == NULL) {
 			next = list_head(l2arc_dev_list);
 		} else {
 			next = list_next(l2arc_dev_list, next);
 			if (next == NULL)
 				next = list_head(l2arc_dev_list);
 		}
 
 		/* if we have come back to the start, bail out */
 		if (first == NULL)
 			first = next;
 		else if (next == first)
 			break;
 
 	} while (vdev_is_dead(next->l2ad_vdev));
 
 	/* if we were unable to find any usable vdevs, return NULL */
 	if (vdev_is_dead(next->l2ad_vdev))
 		next = NULL;
 
 	l2arc_dev_last = next;
 
 out:
 	mutex_exit(&l2arc_dev_mtx);
 
 	/*
 	 * Grab the config lock to prevent the 'next' device from being
 	 * removed while we are writing to it.
 	 */
 	if (next != NULL)
 		spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
 	mutex_exit(&spa_namespace_lock);
 
 	return (next);
 }
 
 /*
  * Free buffers that were tagged for destruction.
  */
 static void
 l2arc_do_free_on_write()
 {
 	list_t *buflist;
 	l2arc_data_free_t *df, *df_prev;
 
 	mutex_enter(&l2arc_free_on_write_mtx);
 	buflist = l2arc_free_on_write;
 
 	for (df = list_tail(buflist); df; df = df_prev) {
 		df_prev = list_prev(buflist, df);
 		ASSERT3P(df->l2df_abd, !=, NULL);
 		abd_free(df->l2df_abd);
 		list_remove(buflist, df);
 		kmem_free(df, sizeof (l2arc_data_free_t));
 	}
 
 	mutex_exit(&l2arc_free_on_write_mtx);
 }
 
 /*
  * A write to a cache device has completed.  Update all headers to allow
  * reads from these buffers to begin.
  */
 static void
 l2arc_write_done(zio_t *zio)
 {
 	l2arc_write_callback_t *cb;
 	l2arc_dev_t *dev;
 	list_t *buflist;
 	arc_buf_hdr_t *head, *hdr, *hdr_prev;
 	kmutex_t *hash_lock;
 	int64_t bytes_dropped = 0;
 
 	cb = zio->io_private;
 	ASSERT3P(cb, !=, NULL);
 	dev = cb->l2wcb_dev;
 	ASSERT3P(dev, !=, NULL);
 	head = cb->l2wcb_head;
 	ASSERT3P(head, !=, NULL);
 	buflist = &dev->l2ad_buflist;
 	ASSERT3P(buflist, !=, NULL);
 	DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
 	    l2arc_write_callback_t *, cb);
 
 	if (zio->io_error != 0)
 		ARCSTAT_BUMP(arcstat_l2_writes_error);
 
 	/*
 	 * All writes completed, or an error was hit.
 	 */
 top:
 	mutex_enter(&dev->l2ad_mtx);
 	for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
 		hdr_prev = list_prev(buflist, hdr);
 
 		hash_lock = HDR_LOCK(hdr);
 
 		/*
 		 * We cannot use mutex_enter or else we can deadlock
 		 * with l2arc_write_buffers (due to swapping the order
 		 * the hash lock and l2ad_mtx are taken).
 		 */
 		if (!mutex_tryenter(hash_lock)) {
 			/*
 			 * Missed the hash lock. We must retry so we
 			 * don't leave the ARC_FLAG_L2_WRITING bit set.
 			 */
 			ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
 
 			/*
 			 * We don't want to rescan the headers we've
 			 * already marked as having been written out, so
 			 * we reinsert the head node so we can pick up
 			 * where we left off.
 			 */
 			list_remove(buflist, head);
 			list_insert_after(buflist, hdr, head);
 
 			mutex_exit(&dev->l2ad_mtx);
 
 			/*
 			 * We wait for the hash lock to become available
 			 * to try and prevent busy waiting, and increase
 			 * the chance we'll be able to acquire the lock
 			 * the next time around.
 			 */
 			mutex_enter(hash_lock);
 			mutex_exit(hash_lock);
 			goto top;
 		}
 
 		/*
 		 * We could not have been moved into the arc_l2c_only
 		 * state while in-flight due to our ARC_FLAG_L2_WRITING
 		 * bit being set. Let's just ensure that's being enforced.
 		 */
 		ASSERT(HDR_HAS_L1HDR(hdr));
 
 		if (zio->io_error != 0) {
 			/*
 			 * Error - drop L2ARC entry.
 			 */
 			list_remove(buflist, hdr);
 			l2arc_trim(hdr);
 			arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
 
 			ARCSTAT_INCR(arcstat_l2_psize, -arc_hdr_size(hdr));
 			ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
 
 			bytes_dropped += arc_hdr_size(hdr);
 			(void) refcount_remove_many(&dev->l2ad_alloc,
 			    arc_hdr_size(hdr), hdr);
 		}
 
 		/*
 		 * Allow ARC to begin reads and ghost list evictions to
 		 * this L2ARC entry.
 		 */
 		arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
 
 		mutex_exit(hash_lock);
 	}
 
 	atomic_inc_64(&l2arc_writes_done);
 	list_remove(buflist, head);
 	ASSERT(!HDR_HAS_L1HDR(head));
 	kmem_cache_free(hdr_l2only_cache, head);
 	mutex_exit(&dev->l2ad_mtx);
 
 	vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
 
 	l2arc_do_free_on_write();
 
 	kmem_free(cb, sizeof (l2arc_write_callback_t));
 }
 
 /*
  * A read to a cache device completed.  Validate buffer contents before
  * handing over to the regular ARC routines.
  */
 static void
 l2arc_read_done(zio_t *zio)
 {
 	l2arc_read_callback_t *cb;
 	arc_buf_hdr_t *hdr;
 	kmutex_t *hash_lock;
 	boolean_t valid_cksum;
 
 	ASSERT3P(zio->io_vd, !=, NULL);
 	ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
 
 	spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
 
 	cb = zio->io_private;
 	ASSERT3P(cb, !=, NULL);
 	hdr = cb->l2rcb_hdr;
 	ASSERT3P(hdr, !=, NULL);
 
 	hash_lock = HDR_LOCK(hdr);
 	mutex_enter(hash_lock);
 	ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
 
 	/*
 	 * If the data was read into a temporary buffer,
 	 * move it and free the buffer.
 	 */
 	if (cb->l2rcb_abd != NULL) {
 		ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);
 		if (zio->io_error == 0) {
 			abd_copy(hdr->b_l1hdr.b_pabd, cb->l2rcb_abd,
 			    arc_hdr_size(hdr));
 		}
 
 		/*
 		 * The following must be done regardless of whether
 		 * there was an error:
 		 * - free the temporary buffer
 		 * - point zio to the real ARC buffer
 		 * - set zio size accordingly
 		 * These are required because zio is either re-used for
 		 * an I/O of the block in the case of the error
 		 * or the zio is passed to arc_read_done() and it
 		 * needs real data.
 		 */
 		abd_free(cb->l2rcb_abd);
 		zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);
 		zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;
 	}
 
 	ASSERT3P(zio->io_abd, !=, NULL);
 
 	/*
 	 * Check this survived the L2ARC journey.
 	 */
 	ASSERT3P(zio->io_abd, ==, hdr->b_l1hdr.b_pabd);
 	zio->io_bp_copy = cb->l2rcb_bp;	/* XXX fix in L2ARC 2.0	*/
 	zio->io_bp = &zio->io_bp_copy;	/* XXX fix in L2ARC 2.0	*/
 
 	valid_cksum = arc_cksum_is_equal(hdr, zio);
 	if (valid_cksum && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
 		mutex_exit(hash_lock);
 		zio->io_private = hdr;
 		arc_read_done(zio);
 	} else {
 		mutex_exit(hash_lock);
 		/*
 		 * Buffer didn't survive caching.  Increment stats and
 		 * reissue to the original storage device.
 		 */
 		if (zio->io_error != 0) {
 			ARCSTAT_BUMP(arcstat_l2_io_error);
 		} else {
 			zio->io_error = SET_ERROR(EIO);
 		}
 		if (!valid_cksum)
 			ARCSTAT_BUMP(arcstat_l2_cksum_bad);
 
 		/*
 		 * If there's no waiter, issue an async i/o to the primary
 		 * storage now.  If there *is* a waiter, the caller must
 		 * issue the i/o in a context where it's OK to block.
 		 */
 		if (zio->io_waiter == NULL) {
 			zio_t *pio = zio_unique_parent(zio);
 
 			ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
 
 			zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
 			    hdr->b_l1hdr.b_pabd, zio->io_size, arc_read_done,
 			    hdr, zio->io_priority, cb->l2rcb_flags,
 			    &cb->l2rcb_zb));
 		}
 	}
 
 	kmem_free(cb, sizeof (l2arc_read_callback_t));
 }
 
 /*
  * This is the list priority from which the L2ARC will search for pages to
  * cache.  This is used within loops (0..3) to cycle through lists in the
  * desired order.  This order can have a significant effect on cache
  * performance.
  *
  * Currently the metadata lists are hit first, MFU then MRU, followed by
  * the data lists.  This function returns a locked list, and also returns
  * the lock pointer.
  */
 static multilist_sublist_t *
 l2arc_sublist_lock(int list_num)
 {
 	multilist_t *ml = NULL;
 	unsigned int idx;
 
 	ASSERT(list_num >= 0 && list_num <= 3);
 
 	switch (list_num) {
 	case 0:
 		ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
 		break;
 	case 1:
 		ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
 		break;
 	case 2:
 		ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
 		break;
 	case 3:
 		ml = arc_mru->arcs_list[ARC_BUFC_DATA];
 		break;
 	}
 
 	/*
 	 * Return a randomly-selected sublist. This is acceptable
 	 * because the caller feeds only a little bit of data for each
 	 * call (8MB). Subsequent calls will result in different
 	 * sublists being selected.
 	 */
 	idx = multilist_get_random_index(ml);
 	return (multilist_sublist_lock(ml, idx));
 }
 
 /*
  * Evict buffers from the device write hand to the distance specified in
  * bytes.  This distance may span populated buffers, it may span nothing.
  * This is clearing a region on the L2ARC device ready for writing.
  * If the 'all' boolean is set, every buffer is evicted.
  */
 static void
 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
 {
 	list_t *buflist;
 	arc_buf_hdr_t *hdr, *hdr_prev;
 	kmutex_t *hash_lock;
 	uint64_t taddr;
 
 	buflist = &dev->l2ad_buflist;
 
 	if (!all && dev->l2ad_first) {
 		/*
 		 * This is the first sweep through the device.  There is
 		 * nothing to evict.
 		 */
 		return;
 	}
 
 	if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
 		/*
 		 * When nearing the end of the device, evict to the end
 		 * before the device write hand jumps to the start.
 		 */
 		taddr = dev->l2ad_end;
 	} else {
 		taddr = dev->l2ad_hand + distance;
 	}
 	DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
 	    uint64_t, taddr, boolean_t, all);
 
 top:
 	mutex_enter(&dev->l2ad_mtx);
 	for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
 		hdr_prev = list_prev(buflist, hdr);
 
 		hash_lock = HDR_LOCK(hdr);
 
 		/*
 		 * We cannot use mutex_enter or else we can deadlock
 		 * with l2arc_write_buffers (due to swapping the order
 		 * the hash lock and l2ad_mtx are taken).
 		 */
 		if (!mutex_tryenter(hash_lock)) {
 			/*
 			 * Missed the hash lock.  Retry.
 			 */
 			ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
 			mutex_exit(&dev->l2ad_mtx);
 			mutex_enter(hash_lock);
 			mutex_exit(hash_lock);
 			goto top;
 		}
 
 		/*
 		 * A header can't be on this list if it doesn't have L2 header.
 		 */
 		ASSERT(HDR_HAS_L2HDR(hdr));
 
 		/* Ensure this header has finished being written. */
 		ASSERT(!HDR_L2_WRITING(hdr));
 		ASSERT(!HDR_L2_WRITE_HEAD(hdr));
 
 		if (!all && (hdr->b_l2hdr.b_daddr >= taddr ||
 		    hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
 			/*
 			 * We've evicted to the target address,
 			 * or the end of the device.
 			 */
 			mutex_exit(hash_lock);
 			break;
 		}
 
 		if (!HDR_HAS_L1HDR(hdr)) {
 			ASSERT(!HDR_L2_READING(hdr));
 			/*
 			 * This doesn't exist in the ARC.  Destroy.
 			 * arc_hdr_destroy() will call list_remove()
 			 * and decrement arcstat_l2_lsize.
 			 */
 			arc_change_state(arc_anon, hdr, hash_lock);
 			arc_hdr_destroy(hdr);
 		} else {
 			ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
 			ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
 			/*
 			 * Invalidate issued or about to be issued
 			 * reads, since we may be about to write
 			 * over this location.
 			 */
 			if (HDR_L2_READING(hdr)) {
 				ARCSTAT_BUMP(arcstat_l2_evict_reading);
 				arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
 			}
 
 			arc_hdr_l2hdr_destroy(hdr);
 		}
 		mutex_exit(hash_lock);
 	}
 	mutex_exit(&dev->l2ad_mtx);
 }
 
 /*
  * Find and write ARC buffers to the L2ARC device.
  *
  * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
  * for reading until they have completed writing.
  * The headroom_boost is an in-out parameter used to maintain headroom boost
  * state between calls to this function.
  *
  * Returns the number of bytes actually written (which may be smaller than
  * the delta by which the device hand has changed due to alignment).
  */
 static uint64_t
 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
 {
 	arc_buf_hdr_t *hdr, *hdr_prev, *head;
 	uint64_t write_asize, write_psize, write_lsize, headroom;
 	boolean_t full;
 	l2arc_write_callback_t *cb;
 	zio_t *pio, *wzio;
 	uint64_t guid = spa_load_guid(spa);
 	int try;
 
 	ASSERT3P(dev->l2ad_vdev, !=, NULL);
 
 	pio = NULL;
 	write_lsize = write_asize = write_psize = 0;
 	full = B_FALSE;
 	head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
 	arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
 
 	ARCSTAT_BUMP(arcstat_l2_write_buffer_iter);
 	/*
 	 * Copy buffers for L2ARC writing.
 	 */
 	for (try = 0; try <= 3; try++) {
 		multilist_sublist_t *mls = l2arc_sublist_lock(try);
 		uint64_t passed_sz = 0;
 
 		ARCSTAT_BUMP(arcstat_l2_write_buffer_list_iter);
 
 		/*
 		 * L2ARC fast warmup.
 		 *
 		 * Until the ARC is warm and starts to evict, read from the
 		 * head of the ARC lists rather than the tail.
 		 */
 		if (arc_warm == B_FALSE)
 			hdr = multilist_sublist_head(mls);
 		else
 			hdr = multilist_sublist_tail(mls);
 		if (hdr == NULL)
 			ARCSTAT_BUMP(arcstat_l2_write_buffer_list_null_iter);
 
 		headroom = target_sz * l2arc_headroom;
 		if (zfs_compressed_arc_enabled)
 			headroom = (headroom * l2arc_headroom_boost) / 100;
 
 		for (; hdr; hdr = hdr_prev) {
 			kmutex_t *hash_lock;
 
 			if (arc_warm == B_FALSE)
 				hdr_prev = multilist_sublist_next(mls, hdr);
 			else
 				hdr_prev = multilist_sublist_prev(mls, hdr);
 			ARCSTAT_INCR(arcstat_l2_write_buffer_bytes_scanned,
 			    HDR_GET_LSIZE(hdr));
 
 			hash_lock = HDR_LOCK(hdr);
 			if (!mutex_tryenter(hash_lock)) {
 				ARCSTAT_BUMP(arcstat_l2_write_trylock_fail);
 				/*
 				 * Skip this buffer rather than waiting.
 				 */
 				continue;
 			}
 
 			passed_sz += HDR_GET_LSIZE(hdr);
 			if (passed_sz > headroom) {
 				/*
 				 * Searched too far.
 				 */
 				mutex_exit(hash_lock);
 				ARCSTAT_BUMP(arcstat_l2_write_passed_headroom);
 				break;
 			}
 
 			if (!l2arc_write_eligible(guid, hdr)) {
 				mutex_exit(hash_lock);
 				continue;
 			}
 
 			/*
 			 * We rely on the L1 portion of the header below, so
 			 * it's invalid for this header to have been evicted out
 			 * of the ghost cache, prior to being written out. The
 			 * ARC_FLAG_L2_WRITING bit ensures this won't happen.
 			 */
 			ASSERT(HDR_HAS_L1HDR(hdr));
 
 			ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
 			ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 			ASSERT3U(arc_hdr_size(hdr), >, 0);
 			uint64_t psize = arc_hdr_size(hdr);
 			uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
 			    psize);
 
 			if ((write_asize + asize) > target_sz) {
 				full = B_TRUE;
 				mutex_exit(hash_lock);
 				ARCSTAT_BUMP(arcstat_l2_write_full);
 				break;
 			}
 
 			if (pio == NULL) {
 				/*
 				 * Insert a dummy header on the buflist so
 				 * l2arc_write_done() can find where the
 				 * write buffers begin without searching.
 				 */
 				mutex_enter(&dev->l2ad_mtx);
 				list_insert_head(&dev->l2ad_buflist, head);
 				mutex_exit(&dev->l2ad_mtx);
 
 				cb = kmem_alloc(
 				    sizeof (l2arc_write_callback_t), KM_SLEEP);
 				cb->l2wcb_dev = dev;
 				cb->l2wcb_head = head;
 				pio = zio_root(spa, l2arc_write_done, cb,
 				    ZIO_FLAG_CANFAIL);
 				ARCSTAT_BUMP(arcstat_l2_write_pios);
 			}
 
 			hdr->b_l2hdr.b_dev = dev;
 			hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
 			arc_hdr_set_flags(hdr,
 			    ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
 
 			mutex_enter(&dev->l2ad_mtx);
 			list_insert_head(&dev->l2ad_buflist, hdr);
 			mutex_exit(&dev->l2ad_mtx);
 
 			(void) refcount_add_many(&dev->l2ad_alloc, psize, hdr);
 
 			/*
 			 * Normally the L2ARC can use the hdr's data, but if
 			 * we're sharing data between the hdr and one of its
 			 * bufs, L2ARC needs its own copy of the data so that
 			 * the ZIO below can't race with the buf consumer.
 			 * Another case where we need to create a copy of the
 			 * data is when the buffer size is not device-aligned
 			 * and we need to pad the block to make it such.
 			 * That also keeps the clock hand suitably aligned.
 			 *
 			 * To ensure that the copy will be available for the
 			 * lifetime of the ZIO and be cleaned up afterwards, we
 			 * add it to the l2arc_free_on_write queue.
 			 */
 			abd_t *to_write;
 			if (!HDR_SHARED_DATA(hdr) && psize == asize) {
 				to_write = hdr->b_l1hdr.b_pabd;
 			} else {
 				to_write = abd_alloc_for_io(asize,
 				    HDR_ISTYPE_METADATA(hdr));
 				abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
 				if (asize != psize) {
 					abd_zero_off(to_write, psize,
 					    asize - psize);
 				}
 				l2arc_free_abd_on_write(to_write, asize,
 				    arc_buf_type(hdr));
 			}
 			wzio = zio_write_phys(pio, dev->l2ad_vdev,
 			    hdr->b_l2hdr.b_daddr, asize, to_write,
 			    ZIO_CHECKSUM_OFF, NULL, hdr,
 			    ZIO_PRIORITY_ASYNC_WRITE,
 			    ZIO_FLAG_CANFAIL, B_FALSE);
 
 			write_lsize += HDR_GET_LSIZE(hdr);
 			DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
 			    zio_t *, wzio);
 
 			write_psize += psize;
 			write_asize += asize;
 			dev->l2ad_hand += asize;
 
 			mutex_exit(hash_lock);
 
 			(void) zio_nowait(wzio);
 		}
 
 		multilist_sublist_unlock(mls);
 
 		if (full == B_TRUE)
 			break;
 	}
 
 	/* No buffers selected for writing? */
 	if (pio == NULL) {
 		ASSERT0(write_lsize);
 		ASSERT(!HDR_HAS_L1HDR(head));
 		kmem_cache_free(hdr_l2only_cache, head);
 		return (0);
 	}
 
 	ASSERT3U(write_psize, <=, target_sz);
 	ARCSTAT_BUMP(arcstat_l2_writes_sent);
 	ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
 	ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
 	ARCSTAT_INCR(arcstat_l2_psize, write_psize);
 	vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
 
 	/*
 	 * Bump device hand to the device start if it is approaching the end.
 	 * l2arc_evict() will already have evicted ahead for this case.
 	 */
 	if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
 		dev->l2ad_hand = dev->l2ad_start;
 		dev->l2ad_first = B_FALSE;
 	}
 
 	dev->l2ad_writing = B_TRUE;
 	(void) zio_wait(pio);
 	dev->l2ad_writing = B_FALSE;
 
 	return (write_asize);
 }
 
 /*
  * This thread feeds the L2ARC at regular intervals.  This is the beating
  * heart of the L2ARC.
  */
 /* ARGSUSED */
 static void
 l2arc_feed_thread(void *unused __unused)
 {
 	callb_cpr_t cpr;
 	l2arc_dev_t *dev;
 	spa_t *spa;
 	uint64_t size, wrote;
 	clock_t begin, next = ddi_get_lbolt();
 
 	CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
 
 	mutex_enter(&l2arc_feed_thr_lock);
 
 	while (l2arc_thread_exit == 0) {
 		CALLB_CPR_SAFE_BEGIN(&cpr);
 		(void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
 		    next - ddi_get_lbolt());
 		CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
 		next = ddi_get_lbolt() + hz;
 
 		/*
 		 * Quick check for L2ARC devices.
 		 */
 		mutex_enter(&l2arc_dev_mtx);
 		if (l2arc_ndev == 0) {
 			mutex_exit(&l2arc_dev_mtx);
 			continue;
 		}
 		mutex_exit(&l2arc_dev_mtx);
 		begin = ddi_get_lbolt();
 
 		/*
 		 * This selects the next l2arc device to write to, and in
 		 * doing so the next spa to feed from: dev->l2ad_spa.   This
 		 * will return NULL if there are now no l2arc devices or if
 		 * they are all faulted.
 		 *
 		 * If a device is returned, its spa's config lock is also
 		 * held to prevent device removal.  l2arc_dev_get_next()
 		 * will grab and release l2arc_dev_mtx.
 		 */
 		if ((dev = l2arc_dev_get_next()) == NULL)
 			continue;
 
 		spa = dev->l2ad_spa;
 		ASSERT3P(spa, !=, NULL);
 
 		/*
 		 * If the pool is read-only then force the feed thread to
 		 * sleep a little longer.
 		 */
 		if (!spa_writeable(spa)) {
 			next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
 			spa_config_exit(spa, SCL_L2ARC, dev);
 			continue;
 		}
 
 		/*
 		 * Avoid contributing to memory pressure.
 		 */
 		if (arc_reclaim_needed()) {
 			ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
 			spa_config_exit(spa, SCL_L2ARC, dev);
 			continue;
 		}
 
 		ARCSTAT_BUMP(arcstat_l2_feeds);
 
 		size = l2arc_write_size();
 
 		/*
 		 * Evict L2ARC buffers that will be overwritten.
 		 */
 		l2arc_evict(dev, size, B_FALSE);
 
 		/*
 		 * Write ARC buffers.
 		 */
 		wrote = l2arc_write_buffers(spa, dev, size);
 
 		/*
 		 * Calculate interval between writes.
 		 */
 		next = l2arc_write_interval(begin, size, wrote);
 		spa_config_exit(spa, SCL_L2ARC, dev);
 	}
 
 	l2arc_thread_exit = 0;
 	cv_broadcast(&l2arc_feed_thr_cv);
 	CALLB_CPR_EXIT(&cpr);		/* drops l2arc_feed_thr_lock */
 	thread_exit();
 }
 
 boolean_t
 l2arc_vdev_present(vdev_t *vd)
 {
 	l2arc_dev_t *dev;
 
 	mutex_enter(&l2arc_dev_mtx);
 	for (dev = list_head(l2arc_dev_list); dev != NULL;
 	    dev = list_next(l2arc_dev_list, dev)) {
 		if (dev->l2ad_vdev == vd)
 			break;
 	}
 	mutex_exit(&l2arc_dev_mtx);
 
 	return (dev != NULL);
 }
 
 /*
  * Add a vdev for use by the L2ARC.  By this point the spa has already
  * validated the vdev and opened it.
  */
 void
 l2arc_add_vdev(spa_t *spa, vdev_t *vd)
 {
 	l2arc_dev_t *adddev;
 
 	ASSERT(!l2arc_vdev_present(vd));
 
 	vdev_ashift_optimize(vd);
 
 	/*
 	 * Create a new l2arc device entry.
 	 */
 	adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
 	adddev->l2ad_spa = spa;
 	adddev->l2ad_vdev = vd;
 	adddev->l2ad_start = VDEV_LABEL_START_SIZE;
 	adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
 	adddev->l2ad_hand = adddev->l2ad_start;
 	adddev->l2ad_first = B_TRUE;
 	adddev->l2ad_writing = B_FALSE;
 
 	mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
 	/*
 	 * This is a list of all ARC buffers that are still valid on the
 	 * device.
 	 */
 	list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
 	    offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
 
 	vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
 	refcount_create(&adddev->l2ad_alloc);
 
 	/*
 	 * Add device to global list
 	 */
 	mutex_enter(&l2arc_dev_mtx);
 	list_insert_head(l2arc_dev_list, adddev);
 	atomic_inc_64(&l2arc_ndev);
 	mutex_exit(&l2arc_dev_mtx);
 }
 
 /*
  * Remove a vdev from the L2ARC.
  */
 void
 l2arc_remove_vdev(vdev_t *vd)
 {
 	l2arc_dev_t *dev, *nextdev, *remdev = NULL;
 
 	/*
 	 * Find the device by vdev
 	 */
 	mutex_enter(&l2arc_dev_mtx);
 	for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
 		nextdev = list_next(l2arc_dev_list, dev);
 		if (vd == dev->l2ad_vdev) {
 			remdev = dev;
 			break;
 		}
 	}
 	ASSERT3P(remdev, !=, NULL);
 
 	/*
 	 * Remove device from global list
 	 */
 	list_remove(l2arc_dev_list, remdev);
 	l2arc_dev_last = NULL;		/* may have been invalidated */
 	atomic_dec_64(&l2arc_ndev);
 	mutex_exit(&l2arc_dev_mtx);
 
 	/*
 	 * Clear all buflists and ARC references.  L2ARC device flush.
 	 */
 	l2arc_evict(remdev, 0, B_TRUE);
 	list_destroy(&remdev->l2ad_buflist);
 	mutex_destroy(&remdev->l2ad_mtx);
 	refcount_destroy(&remdev->l2ad_alloc);
 	kmem_free(remdev, sizeof (l2arc_dev_t));
 }
 
 void
 l2arc_init(void)
 {
 	l2arc_thread_exit = 0;
 	l2arc_ndev = 0;
 	l2arc_writes_sent = 0;
 	l2arc_writes_done = 0;
 
 	mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
 	cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
 	mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
 	mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
 
 	l2arc_dev_list = &L2ARC_dev_list;
 	l2arc_free_on_write = &L2ARC_free_on_write;
 	list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
 	    offsetof(l2arc_dev_t, l2ad_node));
 	list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
 	    offsetof(l2arc_data_free_t, l2df_list_node));
 }
 
 void
 l2arc_fini(void)
 {
 	/*
 	 * This is called from dmu_fini(), which is called from spa_fini();
 	 * Because of this, we can assume that all l2arc devices have
 	 * already been removed when the pools themselves were removed.
 	 */
 
 	l2arc_do_free_on_write();
 
 	mutex_destroy(&l2arc_feed_thr_lock);
 	cv_destroy(&l2arc_feed_thr_cv);
 	mutex_destroy(&l2arc_dev_mtx);
 	mutex_destroy(&l2arc_free_on_write_mtx);
 
 	list_destroy(l2arc_dev_list);
 	list_destroy(l2arc_free_on_write);
 }
 
 void
 l2arc_start(void)
 {
 	if (!(spa_mode_global & FWRITE))
 		return;
 
 	(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
 	    TS_RUN, minclsyspri);
 }
 
 void
 l2arc_stop(void)
 {
 	if (!(spa_mode_global & FWRITE))
 		return;
 
 	mutex_enter(&l2arc_feed_thr_lock);
 	cv_signal(&l2arc_feed_thr_cv);	/* kick thread out of startup */
 	l2arc_thread_exit = 1;
 	while (l2arc_thread_exit != 0)
 		cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
 	mutex_exit(&l2arc_feed_thr_lock);
 }
Index: user/markj/netdump/sys/cddl/contrib/opensolaris
===================================================================
--- user/markj/netdump/sys/cddl/contrib/opensolaris	(revision 332407)
+++ user/markj/netdump/sys/cddl/contrib/opensolaris	(revision 332408)

Property changes on: user/markj/netdump/sys/cddl/contrib/opensolaris
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head/sys/cddl/contrib/opensolaris:r332339-332407
Index: user/markj/netdump/sys/cddl/dev/dtrace/dtrace_cddl.h
===================================================================
--- user/markj/netdump/sys/cddl/dev/dtrace/dtrace_cddl.h	(revision 332407)
+++ user/markj/netdump/sys/cddl/dev/dtrace/dtrace_cddl.h	(revision 332408)
@@ -1,169 +1,171 @@
 /*
  * CDDL HEADER START
  *
  * The contents of this file are subject to the terms of the
  * Common Development and Distribution License (the "License").
  * You may not use this file except in compliance with the License.
  *
  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  * or http://www.opensolaris.org/os/licensing.
  * See the License for the specific language governing permissions
  * and limitations under the License.
  *
  * When distributing Covered Code, include this CDDL HEADER in each
  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  * If applicable, add the following below this CDDL HEADER, with the
  * fields enclosed by brackets "[]" replaced with your own identifying
  * information: Portions Copyright [yyyy] [name of copyright owner]
  *
  * CDDL HEADER END
  *
  * $FreeBSD$
  *
  */
 
 #ifndef _DTRACE_CDDL_H_
 #define	_DTRACE_CDDL_H_
 
 #include <sys/proc.h>
 
 #define LOCK_LEVEL	10
 
 /*
  * Kernel DTrace extension to 'struct proc' for FreeBSD.
  */
 typedef struct kdtrace_proc {
 	int		p_dtrace_probes;	/* Are there probes for this proc? */
 	u_int64_t	p_dtrace_count;		/* Number of DTrace tracepoints */
 	void		*p_dtrace_helpers;	/* DTrace helpers, if any */
 	int		p_dtrace_model;
 
 } kdtrace_proc_t;
 
 /*
  * Kernel DTrace extension to 'struct thread' for FreeBSD.
  */
 typedef struct kdtrace_thread {
 	u_int8_t	td_dtrace_stop;	/* Indicates a DTrace-desired stop */
 	u_int8_t	td_dtrace_sig;	/* Signal sent via DTrace's raise() */
+	u_int8_t	td_dtrace_inprobe; /* Are we in a probe? */
 	u_int		td_predcache;	/* DTrace predicate cache */
 	u_int64_t	td_dtrace_vtime; /* DTrace virtual time */
 	u_int64_t	td_dtrace_start; /* DTrace slice start time */
 
 	union __tdu {
 		struct __tds {
 			u_int8_t	_td_dtrace_on;
 					/* Hit a fasttrap tracepoint. */
 			u_int8_t	_td_dtrace_step;
 					/* About to return to kernel. */
 			u_int8_t	_td_dtrace_ret;
 					/* Handling a return probe. */
 			u_int8_t	_td_dtrace_ast;
 					/* Saved ast flag. */
 #ifdef __amd64__
 			u_int8_t	_td_dtrace_reg;
 #endif
 		} _tds;
 		u_long	_td_dtrace_ft;	/* Bitwise or of these flags. */
 	} _tdu;
 #define	td_dtrace_ft	_tdu._td_dtrace_ft
 #define	td_dtrace_on	_tdu._tds._td_dtrace_on
 #define	td_dtrace_step	_tdu._tds._td_dtrace_step
 #define	td_dtrace_ret	_tdu._tds._td_dtrace_ret
 #define	td_dtrace_ast	_tdu._tds._td_dtrace_ast
 #define	td_dtrace_reg	_tdu._tds._td_dtrace_reg
 
 	uintptr_t	td_dtrace_pc;	/* DTrace saved pc from fasttrap. */
 	uintptr_t	td_dtrace_npc;	/* DTrace next pc from fasttrap. */
 	uintptr_t	td_dtrace_scrpc;
 					/* DTrace per-thread scratch location. */
 	uintptr_t	td_dtrace_astpc;
 					/* DTrace return sequence location. */
 #ifdef __amd64__
 	uintptr_t	td_dtrace_regv;
 #endif
 	u_int64_t	td_hrtime;	/* Last time on cpu. */
 	void		*td_dtrace_sscr; /* Saved scratch space location. */
 	void		*td_systrace_args; /* syscall probe arguments. */
 } kdtrace_thread_t;
 
 /*
  * Definitions to reference fields in the FreeBSD DTrace structures defined
  * above using the names of fields in similar structures in Solaris. Note 
  * that the separation on FreeBSD is a licensing constraint designed to
  * keep the GENERIC kernel BSD licensed.
  */
 #define	t_dtrace_vtime	td_dtrace->td_dtrace_vtime
 #define	t_dtrace_start	td_dtrace->td_dtrace_start
 #define	t_dtrace_stop	td_dtrace->td_dtrace_stop
 #define	t_dtrace_sig	td_dtrace->td_dtrace_sig
+#define	t_dtrace_inprobe	td_dtrace->td_dtrace_inprobe
 #define	t_predcache	td_dtrace->td_predcache
 #define	t_dtrace_ft	td_dtrace->td_dtrace_ft
 #define	t_dtrace_on	td_dtrace->td_dtrace_on
 #define	t_dtrace_step	td_dtrace->td_dtrace_step
 #define	t_dtrace_ret	td_dtrace->td_dtrace_ret
 #define	t_dtrace_ast	td_dtrace->td_dtrace_ast
 #define	t_dtrace_reg	td_dtrace->td_dtrace_reg
 #define	t_dtrace_pc	td_dtrace->td_dtrace_pc
 #define	t_dtrace_npc	td_dtrace->td_dtrace_npc
 #define	t_dtrace_scrpc	td_dtrace->td_dtrace_scrpc
 #define	t_dtrace_astpc	td_dtrace->td_dtrace_astpc
 #define	t_dtrace_regv	td_dtrace->td_dtrace_regv
 #define	t_dtrace_sscr	td_dtrace->td_dtrace_sscr
 #define	t_dtrace_systrace_args	td_dtrace->td_systrace_args
 #define	p_dtrace_helpers	p_dtrace->p_dtrace_helpers
 #define	p_dtrace_count	p_dtrace->p_dtrace_count
 #define	p_dtrace_probes	p_dtrace->p_dtrace_probes
 #define	p_model		p_dtrace->p_dtrace_model
 
 #define	DATAMODEL_NATIVE	0
 #ifdef __amd64__
 #define	DATAMODEL_LP64		0
 #define	DATAMODEL_ILP32		1
 #else
 #define	DATAMODEL_LP64		1
 #define	DATAMODEL_ILP32		0
 #endif
 
 /*
  * Definitions for fields in struct proc which are named differently in FreeBSD.
  */
 #define	p_cred		p_ucred
 #define	p_parent	p_pptr
 
 /*
  * Definitions for fields in struct thread which are named differently in FreeBSD.
  */
 #define	t_procp		td_proc
 #define	t_tid		td_tid
 #define	t_did		td_tid
 #define	t_cred		td_ucred
 
 
 int priv_policy(const cred_t *, int, boolean_t, int, const char *);
 boolean_t priv_policy_only(const cred_t *, int, boolean_t);
 boolean_t priv_policy_choice(const cred_t *, int, boolean_t);
 
 /*
  * Test privilege. Audit success or failure, allow privilege debugging.
  * Returns 0 for success, err for failure.
  */
 #define	PRIV_POLICY(cred, priv, all, err, reason) \
 		priv_policy((cred), (priv), (all), (err), (reason))
 
 /*
  * Test privilege. Audit success only, no privilege debugging.
  * Returns 1 for success, and 0 for failure.
  */
 #define	PRIV_POLICY_CHOICE(cred, priv, all) \
 		priv_policy_choice((cred), (priv), (all))
 
 /*
  * Test privilege. No priv_debugging, no auditing.
  * Returns 1 for success, and 0 for failure.
  */
 
 #define	PRIV_POLICY_ONLY(cred, priv, all) \
 		priv_policy_only((cred), (priv), (all))
 
 #endif	/* !_DTRACE_CDDL_H_ */
Index: user/markj/netdump/sys/conf/files.arm64
===================================================================
--- user/markj/netdump/sys/conf/files.arm64	(revision 332407)
+++ user/markj/netdump/sys/conf/files.arm64	(revision 332408)
@@ -1,249 +1,250 @@
 # $FreeBSD$
 cloudabi32_vdso.o		optional	compat_cloudabi32	\
 	dependency	"$S/contrib/cloudabi/cloudabi_vdso_armv6_on_64bit.S"	\
 	compile-with	"${CC} -x assembler-with-cpp -m32 -shared -nostdinc -nostdlib -Wl,-T$S/compat/cloudabi/cloudabi_vdso.lds $S/contrib/cloudabi/cloudabi_vdso_armv6_on_64bit.S -o ${.TARGET}" \
 	no-obj no-implicit-rule						\
 	clean		"cloudabi32_vdso.o"
 #
 cloudabi32_vdso_blob.o		optional	compat_cloudabi32	\
 	dependency 	"cloudabi32_vdso.o"			\
 	compile-with	"${OBJCOPY} --input-target binary --output-target elf64-littleaarch64 --binary-architecture aarch64 cloudabi32_vdso.o ${.TARGET}" \
 	no-implicit-rule						\
 	clean		"cloudabi32_vdso_blob.o"
 #
 cloudabi64_vdso.o		optional	compat_cloudabi64	\
 	dependency	"$S/contrib/cloudabi/cloudabi_vdso_aarch64.S"	\
 	compile-with	"${CC} -x assembler-with-cpp -shared -nostdinc -nostdlib -Wl,-T$S/compat/cloudabi/cloudabi_vdso.lds $S/contrib/cloudabi/cloudabi_vdso_aarch64.S -o ${.TARGET}" \
 	no-obj no-implicit-rule						\
 	clean		"cloudabi64_vdso.o"
 #
 cloudabi64_vdso_blob.o		optional	compat_cloudabi64	\
 	dependency 	"cloudabi64_vdso.o"			\
 	compile-with	"${OBJCOPY} --input-target binary --output-target elf64-littleaarch64 --binary-architecture aarch64 cloudabi64_vdso.o ${.TARGET}" \
 	no-implicit-rule						\
 	clean		"cloudabi64_vdso_blob.o"
 #
 
 # Allwinner common files
 arm/allwinner/a10_ehci.c	optional	ehci aw_ehci fdt
 arm/allwinner/aw_gpio.c		optional	gpio aw_gpio fdt
 arm/allwinner/aw_mmc.c		optional	mmc aw_mmc fdt
 arm/allwinner/aw_nmi.c		optional	aw_nmi fdt \
 	compile-with "${NORMAL_C} -I$S/gnu/dts/include"
 arm/allwinner/aw_rsb.c		optional	aw_rsb fdt
 arm/allwinner/aw_rtc.c		optional	aw_rtc fdt
 arm/allwinner/aw_sid.c		optional	aw_sid fdt
 arm/allwinner/aw_thermal.c	optional	aw_thermal fdt
 arm/allwinner/aw_usbphy.c	optional	ehci aw_usbphy fdt
 arm/allwinner/aw_wdog.c		optional	aw_wdog fdt
 arm/allwinner/axp81x.c		optional	axp81x fdt
 arm/allwinner/if_awg.c		optional	awg ext_resources syscon fdt
 
 # Allwinner clock driver
 arm/allwinner/clkng/aw_ccung.c		optional	aw_ccu fdt
 arm/allwinner/clkng/aw_clk_nkmp.c	optional	aw_ccu fdt
 arm/allwinner/clkng/aw_clk_nm.c		optional	aw_ccu fdt
 arm/allwinner/clkng/aw_clk_prediv_mux.c	optional	aw_ccu fdt
 arm/allwinner/clkng/ccu_a64.c		optional	soc_allwinner_a64 aw_ccu fdt
 arm/allwinner/clkng/ccu_h3.c		optional	soc_allwinner_h5 aw_ccu fdt
 arm/allwinner/clkng/ccu_sun8i_r.c	optional	aw_ccu fdt
 
 # Allwinner padconf files
 arm/allwinner/a64/a64_padconf.c	optional	soc_allwinner_a64 fdt
 arm/allwinner/a64/a64_r_padconf.c optional	soc_allwinner_a64 fdt
 arm/allwinner/h3/h3_padconf.c	optional	soc_allwinner_h5 fdt
 arm/allwinner/h3/h3_r_padconf.c optional	soc_allwinner_h5 fdt
 
 arm/annapurna/alpine/alpine_ccu.c		optional	al_ccu fdt
 arm/annapurna/alpine/alpine_nb_service.c	optional	al_nb_service fdt
 arm/annapurna/alpine/alpine_pci.c		optional	al_pci fdt
 arm/annapurna/alpine/alpine_pci_msix.c		optional	al_pci fdt
 arm/annapurna/alpine/alpine_serdes.c		optional al_serdes fdt		\
 	no-depend	\
 	compile-with "${CC} -c -o ${.TARGET} ${CFLAGS} -I$S/contrib/alpine-hal -I$S/contrib/alpine-hal/eth ${PROF} ${.IMPSRC}"
 arm/arm/generic_timer.c		standard
 arm/arm/gic.c			standard
 arm/arm/gic_acpi.c		optional	acpi
 arm/arm/gic_fdt.c		optional	fdt
 arm/arm/pmu.c			standard
 arm/broadcom/bcm2835/bcm2835_audio.c		optional sound vchiq fdt \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 arm/broadcom/bcm2835/bcm2835_bsc.c		optional bcm2835_bsc soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_cpufreq.c		optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_dma.c		optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_fbd.c		optional vt soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_ft5406.c		optional evdev bcm2835_ft5406 soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_gpio.c		optional gpio soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_intr.c		optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_mbox.c		optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_rng.c		optional random soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_sdhci.c		optional sdhci soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_spi.c		optional bcm2835_spi soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_vcio.c		optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2835_wdog.c		optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm2836.c			optional soc_brcm_bcm2837 fdt
 arm/broadcom/bcm2835/bcm283x_dwc_fdt.c		optional dwcotg fdt soc_brcm_bcm2837
 arm/mv/armada38x/armada38x_rtc.c		optional mv_rtc fdt
 arm64/acpica/acpi_machdep.c	optional	acpi
 arm64/acpica/OsdEnvironment.c	optional	acpi
 arm64/acpica/acpi_wakeup.c	optional	acpi
 arm64/acpica/pci_cfgreg.c	optional	acpi	pci
 arm64/arm64/autoconf.c		standard
 arm64/arm64/bus_machdep.c	standard
 arm64/arm64/bus_space_asm.S	standard
 arm64/arm64/busdma_bounce.c	standard
 arm64/arm64/busdma_machdep.c	standard
 arm64/arm64/bzero.S		standard
 arm64/arm64/clock.c		standard
 arm64/arm64/copyinout.S		standard
 arm64/arm64/copystr.c		standard
 arm64/arm64/cpu_errata.c	standard
 arm64/arm64/cpufunc_asm.S	standard
 arm64/arm64/db_disasm.c		optional	ddb
 arm64/arm64/db_interface.c	optional	ddb
 arm64/arm64/db_trace.c		optional	ddb
 arm64/arm64/debug_monitor.c	optional	ddb
 arm64/arm64/disassem.c		optional	ddb
 arm64/arm64/dump_machdep.c	standard
 arm64/arm64/efirt_machdep.c	optional	efirt
 arm64/arm64/elf32_machdep.c	optional	compat_freebsd32
 arm64/arm64/elf_machdep.c	standard
 arm64/arm64/exception.S		standard
 arm64/arm64/freebsd32_machdep.c	optional	compat_freebsd32
 arm64/arm64/gicv3_its.c		optional	intrng fdt
 arm64/arm64/gic_v3.c		standard
 arm64/arm64/gic_v3_acpi.c	optional	acpi
 arm64/arm64/gic_v3_fdt.c	optional	fdt
 arm64/arm64/identcpu.c		standard
 arm64/arm64/in_cksum.c		optional	inet | inet6
 arm64/arm64/locore.S		standard	no-obj
 arm64/arm64/machdep.c		standard
 arm64/arm64/mem.c		standard
 arm64/arm64/memcpy.S		standard
 arm64/arm64/memmove.S		standard
 arm64/arm64/minidump_machdep.c	standard
 arm64/arm64/mp_machdep.c	optional	smp
 arm64/arm64/nexus.c		standard
 arm64/arm64/ofw_machdep.c	optional	fdt
 arm64/arm64/pmap.c		standard
 arm64/arm64/stack_machdep.c	optional	ddb | stack
 arm64/arm64/support.S		standard
 arm64/arm64/swtch.S		standard
 arm64/arm64/sys_machdep.c	standard
 arm64/arm64/trap.c		standard
 arm64/arm64/uio_machdep.c	standard
 arm64/arm64/uma_machdep.c	standard
 arm64/arm64/undefined.c		standard
 arm64/arm64/unwind.c		optional	ddb | kdtrace_hooks | stack
 arm64/arm64/vfp.c		standard
 arm64/arm64/vm_machdep.c	standard
 arm64/cavium/thunder_pcie_fdt.c		optional	soc_cavm_thunderx pci fdt
 arm64/cavium/thunder_pcie_pem.c		optional	soc_cavm_thunderx pci
 arm64/cavium/thunder_pcie_pem_fdt.c	optional	soc_cavm_thunderx pci fdt
 arm64/cavium/thunder_pcie_common.c	optional	soc_cavm_thunderx pci
 arm64/cloudabi32/cloudabi32_sysvec.c	optional compat_cloudabi32
 arm64/cloudabi64/cloudabi64_sysvec.c	optional compat_cloudabi64
 arm64/coresight/coresight.c			standard
 arm64/coresight/coresight_if.m			standard
 arm64/coresight/coresight-cmd.c			standard
 arm64/coresight/coresight-cpu-debug.c		standard
 arm64/coresight/coresight-dynamic-replicator.c	standard
 arm64/coresight/coresight-etm4x.c		standard
 arm64/coresight/coresight-funnel.c		standard
 arm64/coresight/coresight-tmc.c			standard
+arm64/qualcomm/qcom_gcc.c			optional qcom_gcc fdt
 contrib/vchiq/interface/compat/vchi_bsd.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_2835_arm.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -Wno-unused -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_arm.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -Wno-unused -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_connected.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_core.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_kern_lib.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_kmod.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_shim.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 contrib/vchiq/interface/vchiq_arm/vchiq_util.c	optional vchiq soc_brcm_bcm2837 \
 	compile-with "${NORMAL_C} -DUSE_VCHIQ_ARM -D__VCCOREVER__=0x04000000 -I$S/contrib/vchiq"
 crypto/armv8/armv8_crypto.c	optional	armv8crypto
 armv8_crypto_wrap.o		optional	armv8crypto		\
 	dependency	"$S/crypto/armv8/armv8_crypto_wrap.c"		\
 	compile-with	"${CC} -c ${CFLAGS:C/^-O2$/-O3/:N-nostdinc:N-mgeneral-regs-only} ${WERROR} ${NO_WCAST_QUAL} ${PROF} -march=armv8-a+crypto ${.IMPSRC}" \
 	no-implicit-rule						\
 	clean		"armv8_crypto_wrap.o"
 crypto/blowfish/bf_enc.c	optional	crypto | ipsec | ipsec_support
 crypto/des/des_enc.c		optional	crypto | ipsec | ipsec_support | netsmb
 dev/acpica/acpi_bus_if.m	optional	acpi
 dev/acpica/acpi_if.m		optional	acpi
 dev/acpica/acpi_pci_link.c	optional	acpi pci
 dev/acpica/acpi_pcib.c		optional	acpi pci
 dev/ahci/ahci_generic.c		optional	ahci
 dev/axgbe/if_axgbe.c		optional	axgbe
 dev/axgbe/xgbe-desc.c		optional	axgbe
 dev/axgbe/xgbe-dev.c		optional	axgbe
 dev/axgbe/xgbe-drv.c		optional	axgbe
 dev/axgbe/xgbe-mdio.c		optional	axgbe
 dev/cpufreq/cpufreq_dt.c	optional	cpufreq fdt
 dev/iicbus/twsi/a10_twsi.c	optional	twsi fdt
 dev/iicbus/twsi/twsi.c		optional	twsi fdt
 dev/hwpmc/hwpmc_arm64.c		optional	hwpmc
 dev/hwpmc/hwpmc_arm64_md.c	optional	hwpmc
 dev/mbox/mbox_if.m		optional	soc_brcm_bcm2837
 dev/mmc/host/dwmmc.c		optional	dwmmc fdt
 dev/mmc/host/dwmmc_hisi.c	optional	dwmmc fdt soc_hisi_hi6220
 dev/mmc/host/dwmmc_rockchip.c	optional	dwmmc fdt soc_rockchip_rk3328
 dev/neta/if_mvneta_fdt.c	optional	neta fdt
 dev/neta/if_mvneta.c		optional	neta mdio mii
 dev/ofw/ofw_cpu.c		optional	fdt
 dev/ofw/ofwpci.c		optional 	fdt pci
 dev/pci/pci_host_generic.c	optional	pci
 dev/pci/pci_host_generic_acpi.c	optional	pci acpi
 dev/pci/pci_host_generic_fdt.c	optional	pci fdt
 dev/psci/psci.c			optional	psci
 dev/psci/psci_arm64.S		optional	psci
 dev/uart/uart_cpu_arm64.c	optional	uart
 dev/uart/uart_dev_pl011.c	optional	uart pl011
 dev/usb/controller/dwc_otg_hisi.c optional	dwcotg fdt soc_hisi_hi6220
 dev/usb/controller/ehci_mv.c	optional	ehci_mv fdt
 dev/usb/controller/generic_ehci.c optional	ehci acpi
 dev/usb/controller/generic_ohci.c optional	ohci fdt
 dev/usb/controller/generic_usb_if.m optional	ohci fdt
 dev/usb/controller/xhci_mv.c	optional	xhci_mv fdt
 dev/vnic/mrml_bridge.c		optional	vnic fdt
 dev/vnic/nic_main.c		optional	vnic pci
 dev/vnic/nicvf_main.c		optional	vnic pci pci_iov
 dev/vnic/nicvf_queues.c		optional	vnic pci pci_iov
 dev/vnic/thunder_bgx_fdt.c	optional	vnic fdt
 dev/vnic/thunder_bgx.c		optional	vnic pci
 dev/vnic/thunder_mdio_fdt.c	optional	vnic fdt
 dev/vnic/thunder_mdio.c		optional	vnic
 dev/vnic/lmac_if.m		optional	inet | inet6 | vnic
 kern/kern_clocksource.c		standard
 kern/msi_if.m			optional	intrng
 kern/pic_if.m			optional	intrng
 kern/subr_devmap.c		standard
 kern/subr_intr.c		optional	intrng
 libkern/bcmp.c			standard
 libkern/ffs.c			standard
 libkern/ffsl.c			standard
 libkern/ffsll.c			standard
 libkern/fls.c			standard
 libkern/flsl.c			standard
 libkern/flsll.c			standard
 libkern/memset.c		standard
 libkern/arm64/crc32c_armv8.S	standard
 cddl/contrib/opensolaris/common/atomic/aarch64/opensolaris_atomic.S	optional zfs | dtrace compile-with "${CDDL_C}"
 cddl/dev/dtrace/aarch64/dtrace_asm.S			optional dtrace compile-with "${DTRACE_S}"
 cddl/dev/dtrace/aarch64/dtrace_subr.c			optional dtrace compile-with "${DTRACE_C}"
 cddl/dev/fbt/aarch64/fbt_isa.c				optional dtrace_fbt | dtraceall compile-with "${FBT_C}"
 
 arm64/rockchip/clk/rk_cru.c		optional fdt soc_rockchip_rk3328
 arm64/rockchip/clk/rk_clk_composite.c	optional fdt soc_rockchip_rk3328
 arm64/rockchip/clk/rk_clk_gate.c	optional fdt soc_rockchip_rk3328
 arm64/rockchip/clk/rk_clk_mux.c		optional fdt soc_rockchip_rk3328
 arm64/rockchip/clk/rk_clk_pll.c		optional fdt soc_rockchip_rk3328
 arm64/rockchip/clk/rk3328_cru.c		optional fdt soc_rockchip_rk3328
Index: user/markj/netdump/sys/dev/cesa/cesa.c
===================================================================
--- user/markj/netdump/sys/dev/cesa/cesa.c	(revision 332407)
+++ user/markj/netdump/sys/dev/cesa/cesa.c	(revision 332408)
@@ -1,1701 +1,1894 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (C) 2009-2011 Semihalf.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * CESA SRAM Memory Map:
  *
  * +------------------------+ <= sc->sc_sram_base_va + CESA_SRAM_SIZE
  * |                        |
  * |          DATA          |
  * |                        |
  * +------------------------+ <= sc->sc_sram_base_va + CESA_DATA(0)
  * |  struct cesa_sa_data   |
  * +------------------------+
  * |  struct cesa_sa_hdesc  |
  * +------------------------+ <= sc->sc_sram_base_va
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mbuf.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/rman.h>
 
 #include <machine/bus.h>
 #include <machine/intr.h>
 #include <machine/resource.h>
 #include <machine/fdt.h>
 
+#include <dev/fdt/simplebus.h>
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <sys/md5.h>
 #include <crypto/sha1.h>
 #include <crypto/sha2/sha256.h>
 #include <crypto/rijndael/rijndael.h>
 #include <opencrypto/cryptodev.h>
 #include "cryptodev_if.h"
 
 #include <arm/mv/mvreg.h>
 #include <arm/mv/mvvar.h>
 #include "cesa.h"
 
 static int	cesa_probe(device_t);
 static int	cesa_attach(device_t);
+static int	cesa_attach_late(device_t);
 static int	cesa_detach(device_t);
 static void	cesa_intr(void *);
 static int	cesa_newsession(device_t, u_int32_t *, struct cryptoini *);
 static int	cesa_freesession(device_t, u_int64_t);
 static int	cesa_process(device_t, struct cryptop *, int);
 
 static struct resource_spec cesa_res_spec[] = {
 	{ SYS_RES_MEMORY, 0, RF_ACTIVE },
 	{ SYS_RES_MEMORY, 1, RF_ACTIVE },
 	{ SYS_RES_IRQ, 0, RF_ACTIVE | RF_SHAREABLE },
 	{ -1, 0 }
 };
 
 static device_method_t cesa_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		cesa_probe),
 	DEVMETHOD(device_attach,	cesa_attach),
 	DEVMETHOD(device_detach,	cesa_detach),
 
 	/* Crypto device methods */
 	DEVMETHOD(cryptodev_newsession,	cesa_newsession),
 	DEVMETHOD(cryptodev_freesession,cesa_freesession),
 	DEVMETHOD(cryptodev_process,	cesa_process),
 
 	DEVMETHOD_END
 };
 
 static driver_t cesa_driver = {
 	"cesa",
 	cesa_methods,
 	sizeof (struct cesa_softc)
 };
 static devclass_t cesa_devclass;
 
 DRIVER_MODULE(cesa, simplebus, cesa_driver, cesa_devclass, 0, 0);
 MODULE_DEPEND(cesa, crypto, 1, 1, 1);
 
 static void
 cesa_dump_cshd(struct cesa_softc *sc, struct cesa_sa_hdesc *cshd)
 {
 #ifdef DEBUG
 	device_t dev;
 
 	dev = sc->sc_dev;
 	device_printf(dev, "CESA SA Hardware Descriptor:\n");
 	device_printf(dev, "\t\tconfig: 0x%08X\n", cshd->cshd_config);
 	device_printf(dev, "\t\te_src:  0x%08X\n", cshd->cshd_enc_src);
 	device_printf(dev, "\t\te_dst:  0x%08X\n", cshd->cshd_enc_dst);
 	device_printf(dev, "\t\te_dlen: 0x%08X\n", cshd->cshd_enc_dlen);
 	device_printf(dev, "\t\te_key:  0x%08X\n", cshd->cshd_enc_key);
 	device_printf(dev, "\t\te_iv_1: 0x%08X\n", cshd->cshd_enc_iv);
 	device_printf(dev, "\t\te_iv_2: 0x%08X\n", cshd->cshd_enc_iv_buf);
 	device_printf(dev, "\t\tm_src:  0x%08X\n", cshd->cshd_mac_src);
 	device_printf(dev, "\t\tm_dst:  0x%08X\n", cshd->cshd_mac_dst);
 	device_printf(dev, "\t\tm_dlen: 0x%08X\n", cshd->cshd_mac_dlen);
 	device_printf(dev, "\t\tm_tlen: 0x%08X\n", cshd->cshd_mac_total_dlen);
 	device_printf(dev, "\t\tm_iv_i: 0x%08X\n", cshd->cshd_mac_iv_in);
 	device_printf(dev, "\t\tm_iv_o: 0x%08X\n", cshd->cshd_mac_iv_out);
 #endif
 }
 
 static void
 cesa_alloc_dma_mem_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error)
 {
 	struct cesa_dma_mem *cdm;
 
 	if (error)
 		return;
 
 	KASSERT(nseg == 1, ("Got wrong number of DMA segments, should be 1."));
 	cdm = arg;
 	cdm->cdm_paddr = segs->ds_addr;
 }
 
 static int
 cesa_alloc_dma_mem(struct cesa_softc *sc, struct cesa_dma_mem *cdm,
     bus_size_t size)
 {
 	int error;
 
 	KASSERT(cdm->cdm_vaddr == NULL,
 	    ("%s(): DMA memory descriptor in use.", __func__));
 
 	error = bus_dma_tag_create(bus_get_dma_tag(sc->sc_dev),	/* parent */
 	    PAGE_SIZE, 0,			/* alignment, boundary */
 	    BUS_SPACE_MAXADDR_32BIT,		/* lowaddr */
 	    BUS_SPACE_MAXADDR,			/* highaddr */
 	    NULL, NULL,				/* filtfunc, filtfuncarg */
 	    size, 1,				/* maxsize, nsegments */
 	    size, 0,				/* maxsegsz, flags */
 	    NULL, NULL,				/* lockfunc, lockfuncarg */
 	    &cdm->cdm_tag);			/* dmat */
 	if (error) {
 		device_printf(sc->sc_dev, "failed to allocate busdma tag, error"
 		    " %i!\n", error);
 
 		goto err1;
 	}
 
 	error = bus_dmamem_alloc(cdm->cdm_tag, &cdm->cdm_vaddr,
 	    BUS_DMA_NOWAIT | BUS_DMA_ZERO, &cdm->cdm_map);
 	if (error) {
 		device_printf(sc->sc_dev, "failed to allocate DMA safe"
 		    " memory, error %i!\n", error);
 
 		goto err2;
 	}
 
 	error = bus_dmamap_load(cdm->cdm_tag, cdm->cdm_map, cdm->cdm_vaddr,
 	    size, cesa_alloc_dma_mem_cb, cdm, BUS_DMA_NOWAIT);
 	if (error) {
 		device_printf(sc->sc_dev, "cannot get address of the DMA"
 		    " memory, error %i\n", error);
 
 		goto err3;
 	}
 
 	return (0);
 err3:
 	bus_dmamem_free(cdm->cdm_tag, cdm->cdm_vaddr, cdm->cdm_map);
 err2:
 	bus_dma_tag_destroy(cdm->cdm_tag);
 err1:
 	cdm->cdm_vaddr = NULL;
 	return (error);
 }
 
 static void
 cesa_free_dma_mem(struct cesa_dma_mem *cdm)
 {
 
 	bus_dmamap_unload(cdm->cdm_tag, cdm->cdm_map);
 	bus_dmamem_free(cdm->cdm_tag, cdm->cdm_vaddr, cdm->cdm_map);
 	bus_dma_tag_destroy(cdm->cdm_tag);
 	cdm->cdm_vaddr = NULL;
 }
 
 static void
 cesa_sync_dma_mem(struct cesa_dma_mem *cdm, bus_dmasync_op_t op)
 {
 
 	/* Sync only if dma memory is valid */
         if (cdm->cdm_vaddr != NULL)
 		bus_dmamap_sync(cdm->cdm_tag, cdm->cdm_map, op);
 }
 
 static void
 cesa_sync_desc(struct cesa_softc *sc, bus_dmasync_op_t op)
 {
 
 	cesa_sync_dma_mem(&sc->sc_tdesc_cdm, op);
 	cesa_sync_dma_mem(&sc->sc_sdesc_cdm, op);
 	cesa_sync_dma_mem(&sc->sc_requests_cdm, op);
 }
 
 static struct cesa_session *
 cesa_alloc_session(struct cesa_softc *sc)
 {
 	struct cesa_session *cs;
 
 	CESA_GENERIC_ALLOC_LOCKED(sc, cs, sessions);
 
 	return (cs);
 }
 
 static struct cesa_session *
 cesa_get_session(struct cesa_softc *sc, uint32_t sid)
 {
 
 	if (sid >= CESA_SESSIONS)
 		return (NULL);
 
 	return (&sc->sc_sessions[sid]);
 }
 
 static void
 cesa_free_session(struct cesa_softc *sc, struct cesa_session *cs)
 {
 
 	CESA_GENERIC_FREE_LOCKED(sc, cs, sessions);
 }
 
 static struct cesa_request *
 cesa_alloc_request(struct cesa_softc *sc)
 {
 	struct cesa_request *cr;
 
 	CESA_GENERIC_ALLOC_LOCKED(sc, cr, requests);
 	if (!cr)
 		return (NULL);
 
 	STAILQ_INIT(&cr->cr_tdesc);
 	STAILQ_INIT(&cr->cr_sdesc);
 
 	return (cr);
 }
 
 static void
 cesa_free_request(struct cesa_softc *sc, struct cesa_request *cr)
 {
 
 	/* Free TDMA descriptors assigned to this request */
 	CESA_LOCK(sc, tdesc);
 	STAILQ_CONCAT(&sc->sc_free_tdesc, &cr->cr_tdesc);
 	CESA_UNLOCK(sc, tdesc);
 
 	/* Free SA descriptors assigned to this request */
 	CESA_LOCK(sc, sdesc);
 	STAILQ_CONCAT(&sc->sc_free_sdesc, &cr->cr_sdesc);
 	CESA_UNLOCK(sc, sdesc);
 
 	/* Unload DMA memory associated with request */
 	if (cr->cr_dmap_loaded) {
 		bus_dmamap_unload(sc->sc_data_dtag, cr->cr_dmap);
 		cr->cr_dmap_loaded = 0;
 	}
 
 	CESA_GENERIC_FREE_LOCKED(sc, cr, requests);
 }
 
 static void
 cesa_enqueue_request(struct cesa_softc *sc, struct cesa_request *cr)
 {
 
 	CESA_LOCK(sc, requests);
 	STAILQ_INSERT_TAIL(&sc->sc_ready_requests, cr, cr_stq);
 	CESA_UNLOCK(sc, requests);
 }
 
 static struct cesa_tdma_desc *
 cesa_alloc_tdesc(struct cesa_softc *sc)
 {
 	struct cesa_tdma_desc *ctd;
 
 	CESA_GENERIC_ALLOC_LOCKED(sc, ctd, tdesc);
 
 	if (!ctd)
 		device_printf(sc->sc_dev, "TDMA descriptors pool exhaused. "
 		    "Consider increasing CESA_TDMA_DESCRIPTORS.\n");
 
 	return (ctd);
 }
 
 static struct cesa_sa_desc *
 cesa_alloc_sdesc(struct cesa_softc *sc, struct cesa_request *cr)
 {
 	struct cesa_sa_desc *csd;
 
 	CESA_GENERIC_ALLOC_LOCKED(sc, csd, sdesc);
 	if (!csd) {
 		device_printf(sc->sc_dev, "SA descriptors pool exhaused. "
 		    "Consider increasing CESA_SA_DESCRIPTORS.\n");
 		return (NULL);
 	}
 
 	STAILQ_INSERT_TAIL(&cr->cr_sdesc, csd, csd_stq);
 
 	/* Fill-in SA descriptor with default values */
 	csd->csd_cshd->cshd_enc_key = CESA_SA_DATA(csd_key);
 	csd->csd_cshd->cshd_enc_iv = CESA_SA_DATA(csd_iv);
 	csd->csd_cshd->cshd_enc_iv_buf = CESA_SA_DATA(csd_iv);
 	csd->csd_cshd->cshd_enc_src = 0;
 	csd->csd_cshd->cshd_enc_dst = 0;
 	csd->csd_cshd->cshd_enc_dlen = 0;
 	csd->csd_cshd->cshd_mac_dst = CESA_SA_DATA(csd_hash);
 	csd->csd_cshd->cshd_mac_iv_in = CESA_SA_DATA(csd_hiv_in);
 	csd->csd_cshd->cshd_mac_iv_out = CESA_SA_DATA(csd_hiv_out);
 	csd->csd_cshd->cshd_mac_src = 0;
 	csd->csd_cshd->cshd_mac_dlen = 0;
 
 	return (csd);
 }
 
 static struct cesa_tdma_desc *
 cesa_tdma_copy(struct cesa_softc *sc, bus_addr_t dst, bus_addr_t src,
     bus_size_t size)
 {
 	struct cesa_tdma_desc *ctd;
 
 	ctd = cesa_alloc_tdesc(sc);
 	if (!ctd)
 		return (NULL);
 
 	ctd->ctd_cthd->cthd_dst = dst;
 	ctd->ctd_cthd->cthd_src = src;
 	ctd->ctd_cthd->cthd_byte_count = size;
 
 	/* Handle special control packet */
 	if (size != 0)
 		ctd->ctd_cthd->cthd_flags = CESA_CTHD_OWNED;
 	else
 		ctd->ctd_cthd->cthd_flags = 0;
 
 	return (ctd);
 }
 
 static struct cesa_tdma_desc *
 cesa_tdma_copyin_sa_data(struct cesa_softc *sc, struct cesa_request *cr)
 {
 
 	return (cesa_tdma_copy(sc, sc->sc_sram_base_pa +
 	    sizeof(struct cesa_sa_hdesc), cr->cr_csd_paddr,
 	    sizeof(struct cesa_sa_data)));
 }
 
 static struct cesa_tdma_desc *
 cesa_tdma_copyout_sa_data(struct cesa_softc *sc, struct cesa_request *cr)
 {
 
 	return (cesa_tdma_copy(sc, cr->cr_csd_paddr, sc->sc_sram_base_pa +
 	    sizeof(struct cesa_sa_hdesc), sizeof(struct cesa_sa_data)));
 }
 
 static struct cesa_tdma_desc *
 cesa_tdma_copy_sdesc(struct cesa_softc *sc, struct cesa_sa_desc *csd)
 {
 
 	return (cesa_tdma_copy(sc, sc->sc_sram_base_pa, csd->csd_cshd_paddr,
 	    sizeof(struct cesa_sa_hdesc)));
 }
 
 static void
 cesa_append_tdesc(struct cesa_request *cr, struct cesa_tdma_desc *ctd)
 {
 	struct cesa_tdma_desc *ctd_prev;
 
 	if (!STAILQ_EMPTY(&cr->cr_tdesc)) {
 		ctd_prev = STAILQ_LAST(&cr->cr_tdesc, cesa_tdma_desc, ctd_stq);
 		ctd_prev->ctd_cthd->cthd_next = ctd->ctd_cthd_paddr;
 	}
 
 	ctd->ctd_cthd->cthd_next = 0;
 	STAILQ_INSERT_TAIL(&cr->cr_tdesc, ctd, ctd_stq);
 }
 
 static int
 cesa_append_packet(struct cesa_softc *sc, struct cesa_request *cr,
     struct cesa_packet *cp, struct cesa_sa_desc *csd)
 {
 	struct cesa_tdma_desc *ctd, *tmp;
 
 	/* Copy SA descriptor for this packet */
 	ctd = cesa_tdma_copy_sdesc(sc, csd);
 	if (!ctd)
 		return (ENOMEM);
 
 	cesa_append_tdesc(cr, ctd);
 
 	/* Copy data to be processed */
 	STAILQ_FOREACH_SAFE(ctd, &cp->cp_copyin, ctd_stq, tmp)
 		cesa_append_tdesc(cr, ctd);
 	STAILQ_INIT(&cp->cp_copyin);
 
 	/* Insert control descriptor */
 	ctd = cesa_tdma_copy(sc, 0, 0, 0);
 	if (!ctd)
 		return (ENOMEM);
 
 	cesa_append_tdesc(cr, ctd);
 
 	/* Copy back results */
 	STAILQ_FOREACH_SAFE(ctd, &cp->cp_copyout, ctd_stq, tmp)
 		cesa_append_tdesc(cr, ctd);
 	STAILQ_INIT(&cp->cp_copyout);
 
 	return (0);
 }
 
 static int
 cesa_set_mkey(struct cesa_session *cs, int alg, const uint8_t *mkey, int mklen)
 {
 	uint8_t ipad[CESA_MAX_HMAC_BLOCK_LEN];
 	uint8_t opad[CESA_MAX_HMAC_BLOCK_LEN];
 	SHA1_CTX sha1ctx;
 	SHA256_CTX sha256ctx;
 	MD5_CTX md5ctx;
 	uint32_t *hout;
 	uint32_t *hin;
 	int i;
 
 	memset(ipad, HMAC_IPAD_VAL, CESA_MAX_HMAC_BLOCK_LEN);
 	memset(opad, HMAC_OPAD_VAL, CESA_MAX_HMAC_BLOCK_LEN);
 	for (i = 0; i < mklen; i++) {
 		ipad[i] ^= mkey[i];
 		opad[i] ^= mkey[i];
 	}
 
 	hin = (uint32_t *)cs->cs_hiv_in;
 	hout = (uint32_t *)cs->cs_hiv_out;
 
 	switch (alg) {
 	case CRYPTO_MD5_HMAC:
 		MD5Init(&md5ctx);
 		MD5Update(&md5ctx, ipad, MD5_HMAC_BLOCK_LEN);
 		memcpy(hin, md5ctx.state, sizeof(md5ctx.state));
 		MD5Init(&md5ctx);
 		MD5Update(&md5ctx, opad, MD5_HMAC_BLOCK_LEN);
 		memcpy(hout, md5ctx.state, sizeof(md5ctx.state));
 		break;
 	case CRYPTO_SHA1_HMAC:
 		SHA1Init(&sha1ctx);
 		SHA1Update(&sha1ctx, ipad, SHA1_HMAC_BLOCK_LEN);
 		memcpy(hin, sha1ctx.h.b32, sizeof(sha1ctx.h.b32));
 		SHA1Init(&sha1ctx);
 		SHA1Update(&sha1ctx, opad, SHA1_HMAC_BLOCK_LEN);
 		memcpy(hout, sha1ctx.h.b32, sizeof(sha1ctx.h.b32));
 		break;
 	case CRYPTO_SHA2_256_HMAC:
 		SHA256_Init(&sha256ctx);
 		SHA256_Update(&sha256ctx, ipad, SHA2_256_HMAC_BLOCK_LEN);
 		memcpy(hin, sha256ctx.state, sizeof(sha256ctx.state));
 		SHA256_Init(&sha256ctx);
 		SHA256_Update(&sha256ctx, opad, SHA2_256_HMAC_BLOCK_LEN);
 		memcpy(hout, sha256ctx.state, sizeof(sha256ctx.state));
 		break;
 	default:
 		return (EINVAL);
 	}
 
 	for (i = 0; i < CESA_MAX_HASH_LEN / sizeof(uint32_t); i++) {
 		hin[i] = htobe32(hin[i]);
 		hout[i] = htobe32(hout[i]);
 	}
 
 	return (0);
 }
 
 static int
 cesa_prep_aes_key(struct cesa_session *cs)
 {
 	uint32_t ek[4 * (RIJNDAEL_MAXNR + 1)];
 	uint32_t *dkey;
 	int i;
 
 	rijndaelKeySetupEnc(ek, cs->cs_key, cs->cs_klen * 8);
 
 	cs->cs_config &= ~CESA_CSH_AES_KLEN_MASK;
 	dkey = (uint32_t *)cs->cs_aes_dkey;
 
 	switch (cs->cs_klen) {
 	case 16:
 		cs->cs_config |= CESA_CSH_AES_KLEN_128;
 		for (i = 0; i < 4; i++)
 			*dkey++ = htobe32(ek[4 * 10 + i]);
 		break;
 	case 24:
 		cs->cs_config |= CESA_CSH_AES_KLEN_192;
 		for (i = 0; i < 4; i++)
 			*dkey++ = htobe32(ek[4 * 12 + i]);
 		for (i = 0; i < 2; i++)
 			*dkey++ = htobe32(ek[4 * 11 + 2 + i]);
 		break;
 	case 32:
 		cs->cs_config |= CESA_CSH_AES_KLEN_256;
 		for (i = 0; i < 4; i++)
 			*dkey++ = htobe32(ek[4 * 14 + i]);
 		for (i = 0; i < 4; i++)
 			*dkey++ = htobe32(ek[4 * 13 + i]);
 		break;
 	default:
 		return (EINVAL);
 	}
 
 	return (0);
 }
 
 static int
 cesa_is_hash(int alg)
 {
 
 	switch (alg) {
 	case CRYPTO_MD5:
 	case CRYPTO_MD5_HMAC:
 	case CRYPTO_SHA1:
 	case CRYPTO_SHA1_HMAC:
 	case CRYPTO_SHA2_256_HMAC:
 		return (1);
 	default:
 		return (0);
 	}
 }
 
 static void
 cesa_start_packet(struct cesa_packet *cp, unsigned int size)
 {
 
 	cp->cp_size = size;
 	cp->cp_offset = 0;
 	STAILQ_INIT(&cp->cp_copyin);
 	STAILQ_INIT(&cp->cp_copyout);
 }
 
 static int
 cesa_fill_packet(struct cesa_softc *sc, struct cesa_packet *cp,
     bus_dma_segment_t *seg)
 {
 	struct cesa_tdma_desc *ctd;
 	unsigned int bsize;
 
 	/* Calculate size of block copy */
 	bsize = MIN(seg->ds_len, cp->cp_size - cp->cp_offset);
 
 	if (bsize > 0) {
 		ctd = cesa_tdma_copy(sc, sc->sc_sram_base_pa +
 		    CESA_DATA(cp->cp_offset), seg->ds_addr, bsize);
 		if (!ctd)
 			return (-ENOMEM);
 
 		STAILQ_INSERT_TAIL(&cp->cp_copyin, ctd, ctd_stq);
 
 		ctd = cesa_tdma_copy(sc, seg->ds_addr, sc->sc_sram_base_pa +
 		    CESA_DATA(cp->cp_offset), bsize);
 		if (!ctd)
 			return (-ENOMEM);
 
 		STAILQ_INSERT_TAIL(&cp->cp_copyout, ctd, ctd_stq);
 
 		seg->ds_len -= bsize;
 		seg->ds_addr += bsize;
 		cp->cp_offset += bsize;
 	}
 
 	return (bsize);
 }
 
 static void
 cesa_create_chain_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error)
 {
 	unsigned int mpsize, fragmented;
 	unsigned int mlen, mskip, tmlen;
 	struct cesa_chain_info *cci;
 	unsigned int elen, eskip;
 	unsigned int skip, len;
 	struct cesa_sa_desc *csd;
 	struct cesa_request *cr;
 	struct cesa_softc *sc;
 	struct cesa_packet cp;
 	bus_dma_segment_t seg;
 	uint32_t config;
 	int size;
 
 	cci = arg;
 	sc = cci->cci_sc;
 	cr = cci->cci_cr;
 
 	if (error) {
 		cci->cci_error = error;
 		return;
 	}
 
 	elen = cci->cci_enc ? cci->cci_enc->crd_len : 0;
 	eskip = cci->cci_enc ? cci->cci_enc->crd_skip : 0;
 	mlen = cci->cci_mac ? cci->cci_mac->crd_len : 0;
 	mskip = cci->cci_mac ? cci->cci_mac->crd_skip : 0;
 
 	if (elen && mlen &&
 	    ((eskip > mskip && ((eskip - mskip) & (cr->cr_cs->cs_ivlen - 1))) ||
 	    (mskip > eskip && ((mskip - eskip) & (cr->cr_cs->cs_mblen - 1))) ||
 	    (eskip > (mskip + mlen)) || (mskip > (eskip + elen)))) {
 		/*
 		 * Data alignment in the request does not meet CESA requiremnts
 		 * for combined encryption/decryption and hashing. We have to
 		 * split the request to separate operations and process them
 		 * one by one.
 		 */
 		config = cci->cci_config;
 		if ((config & CESA_CSHD_OP_MASK) == CESA_CSHD_MAC_AND_ENC) {
 			config &= ~CESA_CSHD_OP_MASK;
 
 			cci->cci_config = config | CESA_CSHD_MAC;
 			cci->cci_enc = NULL;
 			cci->cci_mac = cr->cr_mac;
 			cesa_create_chain_cb(cci, segs, nseg, cci->cci_error);
 
 			cci->cci_config = config | CESA_CSHD_ENC;
 			cci->cci_enc = cr->cr_enc;
 			cci->cci_mac = NULL;
 			cesa_create_chain_cb(cci, segs, nseg, cci->cci_error);
 		} else {
 			config &= ~CESA_CSHD_OP_MASK;
 
 			cci->cci_config = config | CESA_CSHD_ENC;
 			cci->cci_enc = cr->cr_enc;
 			cci->cci_mac = NULL;
 			cesa_create_chain_cb(cci, segs, nseg, cci->cci_error);
 
 			cci->cci_config = config | CESA_CSHD_MAC;
 			cci->cci_enc = NULL;
 			cci->cci_mac = cr->cr_mac;
 			cesa_create_chain_cb(cci, segs, nseg, cci->cci_error);
 		}
 
 		return;
 	}
 
 	tmlen = mlen;
 	fragmented = 0;
 	mpsize = CESA_MAX_PACKET_SIZE;
 	mpsize &= ~((cr->cr_cs->cs_ivlen - 1) | (cr->cr_cs->cs_mblen - 1));
 
 	if (elen && mlen) {
 		skip = MIN(eskip, mskip);
 		len = MAX(elen + eskip, mlen + mskip) - skip;
 	} else if (elen) {
 		skip = eskip;
 		len = elen;
 	} else {
 		skip = mskip;
 		len = mlen;
 	}
 
 	/* Start first packet in chain */
 	cesa_start_packet(&cp, MIN(mpsize, len));
 
 	while (nseg-- && len > 0) {
 		seg = *(segs++);
 
 		/*
 		 * Skip data in buffer on which neither ENC nor MAC operation
 		 * is requested.
 		 */
 		if (skip > 0) {
 			size = MIN(skip, seg.ds_len);
 			skip -= size;
 
 			seg.ds_addr += size;
 			seg.ds_len -= size;
 
 			if (eskip > 0)
 				eskip -= size;
 
 			if (mskip > 0)
 				mskip -= size;
 
 			if (seg.ds_len == 0)
 				continue;
 		}
 
 		while (1) {
 			/*
 			 * Fill in current packet with data. Break if there is
 			 * no more data in current DMA segment or an error
 			 * occurred.
 			 */
 			size = cesa_fill_packet(sc, &cp, &seg);
 			if (size <= 0) {
 				error = -size;
 				break;
 			}
 
 			len -= size;
 
 			/* If packet is full, append it to the chain */
 			if (cp.cp_size == cp.cp_offset) {
 				csd = cesa_alloc_sdesc(sc, cr);
 				if (!csd) {
 					error = ENOMEM;
 					break;
 				}
 
 				/* Create SA descriptor for this packet */
 				csd->csd_cshd->cshd_config = cci->cci_config;
 				csd->csd_cshd->cshd_mac_total_dlen = tmlen;
 
 				/*
 				 * Enable fragmentation if request will not fit
 				 * into one packet.
 				 */
 				if (len > 0) {
 					if (!fragmented) {
 						fragmented = 1;
 						csd->csd_cshd->cshd_config |=
 						    CESA_CSHD_FRAG_FIRST;
 					} else
 						csd->csd_cshd->cshd_config |=
 						    CESA_CSHD_FRAG_MIDDLE;
 				} else if (fragmented)
 					csd->csd_cshd->cshd_config |=
 					    CESA_CSHD_FRAG_LAST;
 
 				if (eskip < cp.cp_size && elen > 0) {
 					csd->csd_cshd->cshd_enc_src =
 					    CESA_DATA(eskip);
 					csd->csd_cshd->cshd_enc_dst =
 					    CESA_DATA(eskip);
 					csd->csd_cshd->cshd_enc_dlen =
 					    MIN(elen, cp.cp_size - eskip);
 				}
 
 				if (mskip < cp.cp_size && mlen > 0) {
 					csd->csd_cshd->cshd_mac_src =
 					    CESA_DATA(mskip);
 					csd->csd_cshd->cshd_mac_dlen =
 					    MIN(mlen, cp.cp_size - mskip);
 				}
 
 				elen -= csd->csd_cshd->cshd_enc_dlen;
 				eskip -= MIN(eskip, cp.cp_size);
 				mlen -= csd->csd_cshd->cshd_mac_dlen;
 				mskip -= MIN(mskip, cp.cp_size);
 
 				cesa_dump_cshd(sc, csd->csd_cshd);
 
 				/* Append packet to the request */
 				error = cesa_append_packet(sc, cr, &cp, csd);
 				if (error)
 					break;
 
 				/* Start a new packet, as current is full */
 				cesa_start_packet(&cp, MIN(mpsize, len));
 			}
 		}
 
 		if (error)
 			break;
 	}
 
 	if (error) {
 		/*
 		 * Move all allocated resources to the request. They will be
 		 * freed later.
 		 */
 		STAILQ_CONCAT(&cr->cr_tdesc, &cp.cp_copyin);
 		STAILQ_CONCAT(&cr->cr_tdesc, &cp.cp_copyout);
 		cci->cci_error = error;
 	}
 }
 
 static void
 cesa_create_chain_cb2(void *arg, bus_dma_segment_t *segs, int nseg,
     bus_size_t size, int error)
 {
 
 	cesa_create_chain_cb(arg, segs, nseg, error);
 }
 
 static int
 cesa_create_chain(struct cesa_softc *sc, struct cesa_request *cr)
 {
 	struct cesa_chain_info cci;
 	struct cesa_tdma_desc *ctd;
 	uint32_t config;
 	int error;
 
 	error = 0;
 	CESA_LOCK_ASSERT(sc, sessions);
 
 	/* Create request metadata */
 	if (cr->cr_enc) {
 		if (cr->cr_enc->crd_alg == CRYPTO_AES_CBC &&
 		    (cr->cr_enc->crd_flags & CRD_F_ENCRYPT) == 0)
 			memcpy(cr->cr_csd->csd_key, cr->cr_cs->cs_aes_dkey,
 			    cr->cr_cs->cs_klen);
 		else
 			memcpy(cr->cr_csd->csd_key, cr->cr_cs->cs_key,
 			    cr->cr_cs->cs_klen);
 	}
 
 	if (cr->cr_mac) {
 		memcpy(cr->cr_csd->csd_hiv_in, cr->cr_cs->cs_hiv_in,
 		    CESA_MAX_HASH_LEN);
 		memcpy(cr->cr_csd->csd_hiv_out, cr->cr_cs->cs_hiv_out,
 		    CESA_MAX_HASH_LEN);
 	}
 
 	ctd = cesa_tdma_copyin_sa_data(sc, cr);
 	if (!ctd)
 		return (ENOMEM);
 
 	cesa_append_tdesc(cr, ctd);
 
 	/* Prepare SA configuration */
 	config = cr->cr_cs->cs_config;
 
 	if (cr->cr_enc && (cr->cr_enc->crd_flags & CRD_F_ENCRYPT) == 0)
 		config |= CESA_CSHD_DECRYPT;
 	if (cr->cr_enc && !cr->cr_mac)
 		config |= CESA_CSHD_ENC;
 	if (!cr->cr_enc && cr->cr_mac)
 		config |= CESA_CSHD_MAC;
 	if (cr->cr_enc && cr->cr_mac)
 		config |= (config & CESA_CSHD_DECRYPT) ? CESA_CSHD_MAC_AND_ENC :
 		    CESA_CSHD_ENC_AND_MAC;
 
 	/* Create data packets */
 	cci.cci_sc = sc;
 	cci.cci_cr = cr;
 	cci.cci_enc = cr->cr_enc;
 	cci.cci_mac = cr->cr_mac;
 	cci.cci_config = config;
 	cci.cci_error = 0;
 
 	if (cr->cr_crp->crp_flags & CRYPTO_F_IOV)
 		error = bus_dmamap_load_uio(sc->sc_data_dtag,
 		    cr->cr_dmap, (struct uio *)cr->cr_crp->crp_buf,
 		    cesa_create_chain_cb2, &cci, BUS_DMA_NOWAIT);
 	else if (cr->cr_crp->crp_flags & CRYPTO_F_IMBUF)
 		error = bus_dmamap_load_mbuf(sc->sc_data_dtag,
 		    cr->cr_dmap, (struct mbuf *)cr->cr_crp->crp_buf,
 		    cesa_create_chain_cb2, &cci, BUS_DMA_NOWAIT);
 	else
 		error = bus_dmamap_load(sc->sc_data_dtag,
 		    cr->cr_dmap, cr->cr_crp->crp_buf,
 		    cr->cr_crp->crp_ilen, cesa_create_chain_cb, &cci,
 		    BUS_DMA_NOWAIT);
 
 	if (!error)
 		cr->cr_dmap_loaded = 1;
 
 	if (cci.cci_error)
 		error = cci.cci_error;
 
 	if (error)
 		return (error);
 
 	/* Read back request metadata */
 	ctd = cesa_tdma_copyout_sa_data(sc, cr);
 	if (!ctd)
 		return (ENOMEM);
 
 	cesa_append_tdesc(cr, ctd);
 
 	return (0);
 }
 
 static void
 cesa_execute(struct cesa_softc *sc)
 {
 	struct cesa_tdma_desc *prev_ctd, *ctd;
 	struct cesa_request *prev_cr, *cr;
 
 	CESA_LOCK(sc, requests);
 
 	/*
 	 * If ready list is empty, there is nothing to execute. If queued list
 	 * is not empty, the hardware is busy and we cannot start another
 	 * execution.
 	 */
 	if (STAILQ_EMPTY(&sc->sc_ready_requests) ||
 	    !STAILQ_EMPTY(&sc->sc_queued_requests)) {
 		CESA_UNLOCK(sc, requests);
 		return;
 	}
 
 	/* Move all ready requests to queued list */
 	STAILQ_CONCAT(&sc->sc_queued_requests, &sc->sc_ready_requests);
 	STAILQ_INIT(&sc->sc_ready_requests);
 
 	/* Create one execution chain from all requests on the list */
 	if (STAILQ_FIRST(&sc->sc_queued_requests) !=
 	    STAILQ_LAST(&sc->sc_queued_requests, cesa_request, cr_stq)) {
 		prev_cr = NULL;
 		cesa_sync_dma_mem(&sc->sc_tdesc_cdm, BUS_DMASYNC_POSTREAD |
 		    BUS_DMASYNC_POSTWRITE);
 
 		STAILQ_FOREACH(cr, &sc->sc_queued_requests, cr_stq) {
 			if (prev_cr) {
 				ctd = STAILQ_FIRST(&cr->cr_tdesc);
 				prev_ctd = STAILQ_LAST(&prev_cr->cr_tdesc,
 				    cesa_tdma_desc, ctd_stq);
 
 				prev_ctd->ctd_cthd->cthd_next =
 				    ctd->ctd_cthd_paddr;
 			}
 
 			prev_cr = cr;
 		}
 
 		cesa_sync_dma_mem(&sc->sc_tdesc_cdm, BUS_DMASYNC_PREREAD |
 		    BUS_DMASYNC_PREWRITE);
 	}
 
 	/* Start chain execution in hardware */
 	cr = STAILQ_FIRST(&sc->sc_queued_requests);
 	ctd = STAILQ_FIRST(&cr->cr_tdesc);
 
 	CESA_TDMA_WRITE(sc, CESA_TDMA_ND, ctd->ctd_cthd_paddr);
 
 	if (sc->sc_soc_id == MV_DEV_88F6828 ||
 	    sc->sc_soc_id == MV_DEV_88F6820 ||
 	    sc->sc_soc_id == MV_DEV_88F6810)
 		CESA_REG_WRITE(sc, CESA_SA_CMD, CESA_SA_CMD_ACTVATE | CESA_SA_CMD_SHA2);
 	else
 		CESA_REG_WRITE(sc, CESA_SA_CMD, CESA_SA_CMD_ACTVATE);
 
 	CESA_UNLOCK(sc, requests);
 }
 
 static int
 cesa_setup_sram(struct cesa_softc *sc)
 {
 	phandle_t sram_node;
 	ihandle_t sram_ihandle;
 	pcell_t sram_handle, sram_reg[2];
 	void *sram_va;
 	int rv;
 
 	rv = OF_getencprop(ofw_bus_get_node(sc->sc_dev), "sram-handle",
 	    (void *)&sram_handle, sizeof(sram_handle));
 	if (rv <= 0)
 		return (rv);
 
 	sram_ihandle = (ihandle_t)sram_handle;
 	sram_node = OF_instance_to_package(sram_ihandle);
 
 	rv = OF_getencprop(sram_node, "reg", (void *)sram_reg, sizeof(sram_reg));
 	if (rv <= 0)
 		return (rv);
 
 	sc->sc_sram_base_pa = sram_reg[0];
 	/* Store SRAM size to be able to unmap in detach() */
 	sc->sc_sram_size = sram_reg[1];
 
 	if (sc->sc_soc_id != MV_DEV_88F6828 &&
 	    sc->sc_soc_id != MV_DEV_88F6820 &&
 	    sc->sc_soc_id != MV_DEV_88F6810)
 		return (0);
 
 	/* SRAM memory was not mapped in platform_sram_devmap(), map it now */
 	sram_va = pmap_mapdev(sc->sc_sram_base_pa, sc->sc_sram_size);
 	if (sram_va == NULL)
 		return (ENOMEM);
 	sc->sc_sram_base_va = (vm_offset_t)sram_va;
 
 	return (0);
 }
 
+/*
+ * Function: device_from_node
+ * This function returns appropriate device_t to phandle_t
+ * Parameters:
+ * root - device where you want to start search
+ *     if you provide NULL here, function will take
+ *     "root0" device as root.
+ * node - we are checking every device_t to be
+ *     appropriate with this.
+ */
+static device_t
+device_from_node(device_t root, phandle_t node)
+{
+	device_t *children, retval;
+	int nkid, i;
+
+	/* Nothing matches no node */
+	if (node == -1)
+		return (NULL);
+
+	if (root == NULL)
+		/* Get root of device tree */
+		if ((root = device_lookup_by_name("root0")) == NULL)
+			return (NULL);
+
+	if (device_get_children(root, &children, &nkid) != 0)
+		return (NULL);
+
+	retval = NULL;
+	for (i = 0; i < nkid; i++) {
+		/* Check if device and node matches */
+		if (OFW_BUS_GET_NODE(root, children[i]) == node) {
+			retval = children[i];
+			break;
+		}
+		/* or go deeper */
+		if ((retval = device_from_node(children[i], node)) != NULL)
+			break;
+	}
+	free(children, M_TEMP);
+
+	return (retval);
+}
+
 static int
+cesa_setup_sram_armada(struct cesa_softc *sc)
+{
+	phandle_t sram_node;
+	ihandle_t sram_ihandle;
+	pcell_t sram_handle[2];
+	void *sram_va;
+	int rv, j;
+	struct resource_list rl;
+	struct resource_list_entry *rle;
+	struct simplebus_softc *ssc;
+	device_t sdev;
+
+	/* Get refs to SRAMS from CESA node */
+	rv = OF_getencprop(ofw_bus_get_node(sc->sc_dev), "marvell,crypto-srams",
+	    (void *)sram_handle, sizeof(sram_handle));
+	if (rv <= 0)
+		return (rv);
+
+	if (sc->sc_cesa_engine_id >= 2)
+		return (ENXIO);
+
+	/* Get SRAM node on the basis of sc_cesa_engine_id */
+	sram_ihandle = (ihandle_t)sram_handle[sc->sc_cesa_engine_id];
+	sram_node = OF_instance_to_package(sram_ihandle);
+
+	/* Get device_t of simplebus (sram_node parent) */
+	sdev = device_from_node(NULL, OF_parent(sram_node));
+	if (!sdev)
+		return (ENXIO);
+
+	ssc = device_get_softc(sdev);
+
+	resource_list_init(&rl);
+	/* Parse reg property to resource list */
+	ofw_bus_reg_to_rl(sdev, sram_node, ssc->acells,
+	    ssc->scells, &rl);
+
+	/* We expect only one resource */
+	rle = resource_list_find(&rl, SYS_RES_MEMORY, 0);
+	if (rle == NULL)
+		return (ENXIO);
+
+	/* Remap through ranges property */
+	for (j = 0; j < ssc->nranges; j++) {
+		if (rle->start >= ssc->ranges[j].bus &&
+		    rle->end < ssc->ranges[j].bus + ssc->ranges[j].size) {
+			rle->start -= ssc->ranges[j].bus;
+			rle->start += ssc->ranges[j].host;
+			rle->end -= ssc->ranges[j].bus;
+			rle->end += ssc->ranges[j].host;
+		}
+	}
+
+	sc->sc_sram_base_pa = rle->start;
+	sc->sc_sram_size = rle->count;
+
+	/* SRAM memory was not mapped in platform_sram_devmap(), map it now */
+	sram_va = pmap_mapdev(sc->sc_sram_base_pa, sc->sc_sram_size);
+	if (sram_va == NULL)
+		return (ENOMEM);
+	sc->sc_sram_base_va = (vm_offset_t)sram_va;
+
+	return (0);
+}
+
+struct ofw_compat_data cesa_devices[] = {
+	{ "mrvl,cesa", (uintptr_t)true },
+	{ "marvell,armada-38x-crypto", (uintptr_t)true },
+	{ NULL, 0 }
+};
+
+static int
 cesa_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
-	if (!ofw_bus_is_compatible(dev, "mrvl,cesa"))
+	if (!ofw_bus_search_compatible(dev, cesa_devices)->ocd_data)
 		return (ENXIO);
 
 	device_set_desc(dev, "Marvell Cryptographic Engine and Security "
 	    "Accelerator");
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 cesa_attach(device_t dev)
 {
+	static int engine_idx = 0;
+	struct simplebus_devinfo *ndi;
+	struct resource_list *rl;
 	struct cesa_softc *sc;
+
+	if (!ofw_bus_is_compatible(dev, "marvell,armada-38x-crypto"))
+		return (cesa_attach_late(dev));
+
+	/*
+	 * Get simplebus_devinfo which contains
+	 * resource list filled with adresses and
+	 * interrupts read form FDT.
+	 * Let's correct it by splitting resources
+	 * for each engine.
+	 */
+	if ((ndi = device_get_ivars(dev)) == NULL)
+		return (ENXIO);
+
+	rl = &ndi->rl;
+
+	switch (engine_idx) {
+		case 0:
+			/* Update regs values */
+			resource_list_add(rl, SYS_RES_MEMORY, 0, CESA0_TDMA_ADDR,
+			    CESA0_TDMA_ADDR + CESA_TDMA_SIZE - 1, CESA_TDMA_SIZE);
+			resource_list_add(rl, SYS_RES_MEMORY, 1, CESA0_CESA_ADDR,
+			    CESA0_CESA_ADDR + CESA_CESA_SIZE - 1, CESA_CESA_SIZE);
+
+			/* Remove unused interrupt */
+			resource_list_delete(rl, SYS_RES_IRQ, 1);
+			break;
+
+		case 1:
+			/* Update regs values */
+			resource_list_add(rl, SYS_RES_MEMORY, 0, CESA1_TDMA_ADDR,
+			    CESA1_TDMA_ADDR + CESA_TDMA_SIZE - 1, CESA_TDMA_SIZE);
+			resource_list_add(rl, SYS_RES_MEMORY, 1, CESA1_CESA_ADDR,
+			    CESA1_CESA_ADDR + CESA_CESA_SIZE - 1, CESA_CESA_SIZE);
+
+			/* Remove unused interrupt */
+			resource_list_delete(rl, SYS_RES_IRQ, 0);
+			resource_list_find(rl, SYS_RES_IRQ, 1)->rid = 0;
+			break;
+
+		default:
+			device_printf(dev, "Bad cesa engine_idx\n");
+			return (ENXIO);
+	}
+
+	sc = device_get_softc(dev);
+	sc->sc_cesa_engine_id = engine_idx;
+
+	/*
+	 * Call simplebus_add_device only once.
+	 * It will create second cesa driver instance
+	 * with the same FDT node as first instance.
+	 * When second driver reach this function,
+	 * it will be configured to use second cesa engine
+	 */
+	if (engine_idx == 0)
+		simplebus_add_device(device_get_parent(dev), ofw_bus_get_node(dev),
+		    0, "cesa", 1, NULL);
+
+	engine_idx++;
+
+	return (cesa_attach_late(dev));
+}
+
+static int
+cesa_attach_late(device_t dev)
+{
+	struct cesa_softc *sc;
 	uint32_t d, r, val;
 	int error;
 	int i;
 
 	sc = device_get_softc(dev);
 	sc->sc_blocked = 0;
 	sc->sc_error = 0;
 	sc->sc_dev = dev;
 
 	soc_id(&d, &r);
 
 	switch (d) {
 	case MV_DEV_88F6281:
 	case MV_DEV_88F6282:
 		/* Check if CESA peripheral device has power turned on */
 		if (soc_power_ctrl_get(CPU_PM_CTRL_CRYPTO) ==
 		    CPU_PM_CTRL_CRYPTO) {
 			device_printf(dev, "not powered on\n");
 			return (ENXIO);
 		}
 		sc->sc_tperr = 0;
 		break;
 	case MV_DEV_88F6828:
 	case MV_DEV_88F6820:
 	case MV_DEV_88F6810:
 		sc->sc_tperr = 0;
 		break;
 	case MV_DEV_MV78100:
 	case MV_DEV_MV78100_Z0:
 		/* Check if CESA peripheral device has power turned on */
 		if (soc_power_ctrl_get(CPU_PM_CTRL_CRYPTO) !=
 		    CPU_PM_CTRL_CRYPTO) {
 			device_printf(dev, "not powered on\n");
 			return (ENXIO);
 		}
 		sc->sc_tperr = CESA_ICR_TPERR;
 		break;
 	default:
 		return (ENXIO);
 	}
 
 	sc->sc_soc_id = d;
 
 	/* Initialize mutexes */
 	mtx_init(&sc->sc_sc_lock, device_get_nameunit(dev),
 	    "CESA Shared Data", MTX_DEF);
 	mtx_init(&sc->sc_tdesc_lock, device_get_nameunit(dev),
 	    "CESA TDMA Descriptors Pool", MTX_DEF);
 	mtx_init(&sc->sc_sdesc_lock, device_get_nameunit(dev),
 	    "CESA SA Descriptors Pool", MTX_DEF);
 	mtx_init(&sc->sc_requests_lock, device_get_nameunit(dev),
 	    "CESA Requests Pool", MTX_DEF);
 	mtx_init(&sc->sc_sessions_lock, device_get_nameunit(dev),
 	    "CESA Sessions Pool", MTX_DEF);
 
 	/* Allocate I/O and IRQ resources */
 	error = bus_alloc_resources(dev, cesa_res_spec, sc->sc_res);
 	if (error) {
 		device_printf(dev, "could not allocate resources\n");
 		goto err0;
 	}
 
 	/* Acquire SRAM base address */
-	error = cesa_setup_sram(sc);
+	if (!ofw_bus_is_compatible(dev, "marvell,armada-38x-crypto"))
+		error = cesa_setup_sram(sc);
+	else
+		error = cesa_setup_sram_armada(sc);
+
 	if (error) {
 		device_printf(dev, "could not setup SRAM\n");
 		goto err1;
 	}
 
 	/* Setup interrupt handler */
 	error = bus_setup_intr(dev, sc->sc_res[RES_CESA_IRQ], INTR_TYPE_NET |
 	    INTR_MPSAFE, NULL, cesa_intr, sc, &(sc->sc_icookie));
 	if (error) {
 		device_printf(dev, "could not setup engine completion irq\n");
 		goto err2;
 	}
 
 	/* Create DMA tag for processed data */
 	error = bus_dma_tag_create(bus_get_dma_tag(dev),	/* parent */
 	    1, 0,				/* alignment, boundary */
 	    BUS_SPACE_MAXADDR_32BIT,		/* lowaddr */
 	    BUS_SPACE_MAXADDR,			/* highaddr */
 	    NULL, NULL,				/* filtfunc, filtfuncarg */
 	    CESA_MAX_REQUEST_SIZE,		/* maxsize */
 	    CESA_MAX_FRAGMENTS,			/* nsegments */
 	    CESA_MAX_REQUEST_SIZE, 0,		/* maxsegsz, flags */
 	    NULL, NULL,				/* lockfunc, lockfuncarg */
 	    &sc->sc_data_dtag);			/* dmat */
 	if (error)
 		goto err3;
 
 	/* Initialize data structures: TDMA Descriptors Pool */
 	error = cesa_alloc_dma_mem(sc, &sc->sc_tdesc_cdm,
 	    CESA_TDMA_DESCRIPTORS * sizeof(struct cesa_tdma_hdesc));
 	if (error)
 		goto err4;
 
 	STAILQ_INIT(&sc->sc_free_tdesc);
 	for (i = 0; i < CESA_TDMA_DESCRIPTORS; i++) {
 		sc->sc_tdesc[i].ctd_cthd =
 		    (struct cesa_tdma_hdesc *)(sc->sc_tdesc_cdm.cdm_vaddr) + i;
 		sc->sc_tdesc[i].ctd_cthd_paddr = sc->sc_tdesc_cdm.cdm_paddr +
 		    (i * sizeof(struct cesa_tdma_hdesc));
 		STAILQ_INSERT_TAIL(&sc->sc_free_tdesc, &sc->sc_tdesc[i],
 		    ctd_stq);
 	}
 
 	/* Initialize data structures: SA Descriptors Pool */
 	error = cesa_alloc_dma_mem(sc, &sc->sc_sdesc_cdm,
 	    CESA_SA_DESCRIPTORS * sizeof(struct cesa_sa_hdesc));
 	if (error)
 		goto err5;
 
 	STAILQ_INIT(&sc->sc_free_sdesc);
 	for (i = 0; i < CESA_SA_DESCRIPTORS; i++) {
 		sc->sc_sdesc[i].csd_cshd =
 		    (struct cesa_sa_hdesc *)(sc->sc_sdesc_cdm.cdm_vaddr) + i;
 		sc->sc_sdesc[i].csd_cshd_paddr = sc->sc_sdesc_cdm.cdm_paddr +
 		    (i * sizeof(struct cesa_sa_hdesc));
 		STAILQ_INSERT_TAIL(&sc->sc_free_sdesc, &sc->sc_sdesc[i],
 		    csd_stq);
 	}
 
 	/* Initialize data structures: Requests Pool */
 	error = cesa_alloc_dma_mem(sc, &sc->sc_requests_cdm,
 	    CESA_REQUESTS * sizeof(struct cesa_sa_data));
 	if (error)
 		goto err6;
 
 	STAILQ_INIT(&sc->sc_free_requests);
 	STAILQ_INIT(&sc->sc_ready_requests);
 	STAILQ_INIT(&sc->sc_queued_requests);
 	for (i = 0; i < CESA_REQUESTS; i++) {
 		sc->sc_requests[i].cr_csd =
 		    (struct cesa_sa_data *)(sc->sc_requests_cdm.cdm_vaddr) + i;
 		sc->sc_requests[i].cr_csd_paddr =
 		    sc->sc_requests_cdm.cdm_paddr +
 		    (i * sizeof(struct cesa_sa_data));
 
 		/* Preallocate DMA maps */
 		error = bus_dmamap_create(sc->sc_data_dtag, 0,
 		    &sc->sc_requests[i].cr_dmap);
 		if (error && i > 0) {
 			i--;
 			do {
 				bus_dmamap_destroy(sc->sc_data_dtag,
 				    sc->sc_requests[i].cr_dmap);
 			} while (i--);
 
 			goto err7;
 		}
 
 		STAILQ_INSERT_TAIL(&sc->sc_free_requests, &sc->sc_requests[i],
 		    cr_stq);
 	}
 
 	/* Initialize data structures: Sessions Pool */
 	STAILQ_INIT(&sc->sc_free_sessions);
 	for (i = 0; i < CESA_SESSIONS; i++) {
 		sc->sc_sessions[i].cs_sid = i;
 		STAILQ_INSERT_TAIL(&sc->sc_free_sessions, &sc->sc_sessions[i],
 		    cs_stq);
 	}
 
 	/*
 	 * Initialize TDMA:
 	 * - Burst limit: 128 bytes,
 	 * - Outstanding reads enabled,
 	 * - No byte-swap.
 	 */
 	val = CESA_TDMA_CR_DBL128 | CESA_TDMA_CR_SBL128 |
 	    CESA_TDMA_CR_ORDEN | CESA_TDMA_CR_NBS | CESA_TDMA_CR_ENABLE;
 
 	if (sc->sc_soc_id == MV_DEV_88F6828 ||
 	    sc->sc_soc_id == MV_DEV_88F6820 ||
 	    sc->sc_soc_id == MV_DEV_88F6810)
 		val |= CESA_TDMA_NUM_OUTSTAND;
 
 	CESA_TDMA_WRITE(sc, CESA_TDMA_CR, val);
 
 	/*
 	 * Initialize SA:
 	 * - SA descriptor is present at beginning of CESA SRAM,
 	 * - Multi-packet chain mode,
 	 * - Cooperation with TDMA enabled.
 	 */
 	CESA_REG_WRITE(sc, CESA_SA_DPR, 0);
 	CESA_REG_WRITE(sc, CESA_SA_CR, CESA_SA_CR_ACTIVATE_TDMA |
 	    CESA_SA_CR_WAIT_FOR_TDMA | CESA_SA_CR_MULTI_MODE);
 
 	/* Unmask interrupts */
 	CESA_REG_WRITE(sc, CESA_ICR, 0);
 	CESA_REG_WRITE(sc, CESA_ICM, CESA_ICM_ACCTDMA | sc->sc_tperr);
 	CESA_TDMA_WRITE(sc, CESA_TDMA_ECR, 0);
 	CESA_TDMA_WRITE(sc, CESA_TDMA_EMR, CESA_TDMA_EMR_MISS |
 	    CESA_TDMA_EMR_DOUBLE_HIT | CESA_TDMA_EMR_BOTH_HIT |
 	    CESA_TDMA_EMR_DATA_ERROR);
 
 	/* Register in OCF */
 	sc->sc_cid = crypto_get_driverid(dev, CRYPTOCAP_F_HARDWARE);
 	if (sc->sc_cid < 0) {
 		device_printf(dev, "could not get crypto driver id\n");
 		goto err8;
 	}
 
 	crypto_register(sc->sc_cid, CRYPTO_AES_CBC, 0, 0);
 	crypto_register(sc->sc_cid, CRYPTO_DES_CBC, 0, 0);
 	crypto_register(sc->sc_cid, CRYPTO_3DES_CBC, 0, 0);
 	crypto_register(sc->sc_cid, CRYPTO_MD5, 0, 0);
 	crypto_register(sc->sc_cid, CRYPTO_MD5_HMAC, 0, 0);
 	crypto_register(sc->sc_cid, CRYPTO_SHA1, 0, 0);
 	crypto_register(sc->sc_cid, CRYPTO_SHA1_HMAC, 0, 0);
 	if (sc->sc_soc_id == MV_DEV_88F6828 ||
 	    sc->sc_soc_id == MV_DEV_88F6820 ||
 	    sc->sc_soc_id == MV_DEV_88F6810)
 		crypto_register(sc->sc_cid, CRYPTO_SHA2_256_HMAC, 0, 0);
 
 	return (0);
 err8:
 	for (i = 0; i < CESA_REQUESTS; i++)
 		bus_dmamap_destroy(sc->sc_data_dtag,
 		    sc->sc_requests[i].cr_dmap);
 err7:
 	cesa_free_dma_mem(&sc->sc_requests_cdm);
 err6:
 	cesa_free_dma_mem(&sc->sc_sdesc_cdm);
 err5:
 	cesa_free_dma_mem(&sc->sc_tdesc_cdm);
 err4:
 	bus_dma_tag_destroy(sc->sc_data_dtag);
 err3:
 	bus_teardown_intr(dev, sc->sc_res[RES_CESA_IRQ], sc->sc_icookie);
 err2:
 	if (sc->sc_soc_id == MV_DEV_88F6828 ||
 	    sc->sc_soc_id == MV_DEV_88F6820 ||
 	    sc->sc_soc_id == MV_DEV_88F6810)
 		pmap_unmapdev(sc->sc_sram_base_va, sc->sc_sram_size);
 err1:
 	bus_release_resources(dev, cesa_res_spec, sc->sc_res);
 err0:
 	mtx_destroy(&sc->sc_sessions_lock);
 	mtx_destroy(&sc->sc_requests_lock);
 	mtx_destroy(&sc->sc_sdesc_lock);
 	mtx_destroy(&sc->sc_tdesc_lock);
 	mtx_destroy(&sc->sc_sc_lock);
 	return (ENXIO);
 }
 
 static int
 cesa_detach(device_t dev)
 {
 	struct cesa_softc *sc;
 	int i;
  
 	sc = device_get_softc(dev);
 
 	/* TODO: Wait for queued requests completion before shutdown. */
 
 	/* Mask interrupts */
 	CESA_REG_WRITE(sc, CESA_ICM, 0);
 	CESA_TDMA_WRITE(sc, CESA_TDMA_EMR, 0);
 
 	/* Unregister from OCF */
 	crypto_unregister_all(sc->sc_cid);
 
 	/* Free DMA Maps */
 	for (i = 0; i < CESA_REQUESTS; i++)
 		bus_dmamap_destroy(sc->sc_data_dtag,
 		    sc->sc_requests[i].cr_dmap);
 
 	/* Free DMA Memory */
 	cesa_free_dma_mem(&sc->sc_requests_cdm);
 	cesa_free_dma_mem(&sc->sc_sdesc_cdm);
 	cesa_free_dma_mem(&sc->sc_tdesc_cdm);
 
 	/* Free DMA Tag */
 	bus_dma_tag_destroy(sc->sc_data_dtag);
 
 	/* Stop interrupt */
 	bus_teardown_intr(dev, sc->sc_res[RES_CESA_IRQ], sc->sc_icookie);
 
 	/* Relase I/O and IRQ resources */
 	bus_release_resources(dev, cesa_res_spec, sc->sc_res);
 
 	/* Unmap SRAM memory */
 	if (sc->sc_soc_id == MV_DEV_88F6828 ||
 	    sc->sc_soc_id == MV_DEV_88F6820 ||
 	    sc->sc_soc_id == MV_DEV_88F6810)
 		pmap_unmapdev(sc->sc_sram_base_va, sc->sc_sram_size);
 
 	/* Destroy mutexes */
 	mtx_destroy(&sc->sc_sessions_lock);
 	mtx_destroy(&sc->sc_requests_lock);
 	mtx_destroy(&sc->sc_sdesc_lock);
 	mtx_destroy(&sc->sc_tdesc_lock);
 	mtx_destroy(&sc->sc_sc_lock);
 
 	return (0);
 }
 
 static void
 cesa_intr(void *arg)
 {
 	STAILQ_HEAD(, cesa_request) requests;
 	struct cesa_request *cr, *tmp;
 	struct cesa_softc *sc;
 	uint32_t ecr, icr;
 	int blocked;
 
 	sc = arg;
 
 	/* Ack interrupt */
 	ecr = CESA_TDMA_READ(sc, CESA_TDMA_ECR);
 	CESA_TDMA_WRITE(sc, CESA_TDMA_ECR, 0);
 	icr = CESA_REG_READ(sc, CESA_ICR);
 	CESA_REG_WRITE(sc, CESA_ICR, 0);
 
 	/* Check for TDMA errors */
 	if (ecr & CESA_TDMA_ECR_MISS) {
 		device_printf(sc->sc_dev, "TDMA Miss error detected!\n");
 		sc->sc_error = EIO;
 	}
 
 	if (ecr & CESA_TDMA_ECR_DOUBLE_HIT) {
 		device_printf(sc->sc_dev, "TDMA Double Hit error detected!\n");
 		sc->sc_error = EIO;
 	}
 
 	if (ecr & CESA_TDMA_ECR_BOTH_HIT) {
 		device_printf(sc->sc_dev, "TDMA Both Hit error detected!\n");
 		sc->sc_error = EIO;
 	}
 
 	if (ecr & CESA_TDMA_ECR_DATA_ERROR) {
 		device_printf(sc->sc_dev, "TDMA Data error detected!\n");
 		sc->sc_error = EIO;
 	}
 
 	/* Check for CESA errors */
 	if (icr & sc->sc_tperr) {
 		device_printf(sc->sc_dev, "CESA SRAM Parity error detected!\n");
 		sc->sc_error = EIO;
 	}
 
 	/* If there is nothing more to do, return */
 	if ((icr & CESA_ICR_ACCTDMA) == 0)
 		return;
 
 	/* Get all finished requests */
 	CESA_LOCK(sc, requests);
 	STAILQ_INIT(&requests);
 	STAILQ_CONCAT(&requests, &sc->sc_queued_requests);
 	STAILQ_INIT(&sc->sc_queued_requests);
 	CESA_UNLOCK(sc, requests);
 
 	/* Execute all ready requests */
 	cesa_execute(sc);
 
 	/* Process completed requests */
 	cesa_sync_dma_mem(&sc->sc_requests_cdm, BUS_DMASYNC_POSTREAD |
 	    BUS_DMASYNC_POSTWRITE);
 
 	STAILQ_FOREACH_SAFE(cr, &requests, cr_stq, tmp) {
 		bus_dmamap_sync(sc->sc_data_dtag, cr->cr_dmap,
 		    BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 		cr->cr_crp->crp_etype = sc->sc_error;
 		if (cr->cr_mac)
 			crypto_copyback(cr->cr_crp->crp_flags,
 			    cr->cr_crp->crp_buf, cr->cr_mac->crd_inject,
 			    cr->cr_cs->cs_hlen, cr->cr_csd->csd_hash);
 
 		crypto_done(cr->cr_crp);
 		cesa_free_request(sc, cr);
 	}
 
 	cesa_sync_dma_mem(&sc->sc_requests_cdm, BUS_DMASYNC_PREREAD |
 	    BUS_DMASYNC_PREWRITE);
 
 	sc->sc_error = 0;
 
 	/* Unblock driver if it ran out of resources */
 	CESA_LOCK(sc, sc);
 	blocked = sc->sc_blocked;
 	sc->sc_blocked = 0;
 	CESA_UNLOCK(sc, sc);
 
 	if (blocked)
 		crypto_unblock(sc->sc_cid, blocked);
 }
 
 static int
 cesa_newsession(device_t dev, uint32_t *sidp, struct cryptoini *cri)
 {
 	struct cesa_session *cs;
 	struct cesa_softc *sc;
 	struct cryptoini *enc;
 	struct cryptoini *mac;
 	int error;
  
 	sc = device_get_softc(dev);
 	enc = NULL;
 	mac = NULL;
 	error = 0;
 
 	/* Check and parse input */
 	if (cesa_is_hash(cri->cri_alg))
 		mac = cri;
 	else
 		enc = cri;
 
 	cri = cri->cri_next;
 
 	if (cri) {
 		if (!enc && !cesa_is_hash(cri->cri_alg))
 			enc = cri;
 
 		if (!mac && cesa_is_hash(cri->cri_alg))
 			mac = cri;
 
 		if (cri->cri_next || !(enc && mac))
 			return (EINVAL);
 	}
 
 	if ((enc && (enc->cri_klen / 8) > CESA_MAX_KEY_LEN) ||
 	    (mac && (mac->cri_klen / 8) > CESA_MAX_MKEY_LEN))
 		return (E2BIG);
 
 	/* Allocate session */
 	cs = cesa_alloc_session(sc);
 	if (!cs)
 		return (ENOMEM);
 
 	/* Prepare CESA configuration */
 	cs->cs_config = 0;
 	cs->cs_ivlen = 1;
 	cs->cs_mblen = 1;
 
 	if (enc) {
 		switch (enc->cri_alg) {
 		case CRYPTO_AES_CBC:
 			cs->cs_config |= CESA_CSHD_AES | CESA_CSHD_CBC;
 			cs->cs_ivlen = AES_BLOCK_LEN;
 			break;
 		case CRYPTO_DES_CBC:
 			cs->cs_config |= CESA_CSHD_DES | CESA_CSHD_CBC;
 			cs->cs_ivlen = DES_BLOCK_LEN;
 			break;
 		case CRYPTO_3DES_CBC:
 			cs->cs_config |= CESA_CSHD_3DES | CESA_CSHD_3DES_EDE |
 			    CESA_CSHD_CBC;
 			cs->cs_ivlen = DES3_BLOCK_LEN;
 			break;
 		default:
 			error = EINVAL;
 			break;
 		}
 	}
 
 	if (!error && mac) {
 		switch (mac->cri_alg) {
 		case CRYPTO_MD5:
 			cs->cs_mblen = 1;
 			cs->cs_hlen = (mac->cri_mlen == 0) ? MD5_HASH_LEN :
 			    mac->cri_mlen;
 			cs->cs_config |= CESA_CSHD_MD5;
 			break;
 		case CRYPTO_MD5_HMAC:
 			cs->cs_mblen = MD5_HMAC_BLOCK_LEN;
 			cs->cs_hlen = (mac->cri_mlen == 0) ? MD5_HASH_LEN :
 			    mac->cri_mlen;
 			cs->cs_config |= CESA_CSHD_MD5_HMAC;
 			if (cs->cs_hlen == CESA_HMAC_TRUNC_LEN)
 				cs->cs_config |= CESA_CSHD_96_BIT_HMAC;
 			break;
 		case CRYPTO_SHA1:
 			cs->cs_mblen = 1;
 			cs->cs_hlen = (mac->cri_mlen == 0) ? SHA1_HASH_LEN :
 			    mac->cri_mlen;
 			cs->cs_config |= CESA_CSHD_SHA1;
 			break;
 		case CRYPTO_SHA1_HMAC:
 			cs->cs_mblen = SHA1_HMAC_BLOCK_LEN;
 			cs->cs_hlen = (mac->cri_mlen == 0) ? SHA1_HASH_LEN :
 			    mac->cri_mlen;
 			cs->cs_config |= CESA_CSHD_SHA1_HMAC;
 			if (cs->cs_hlen == CESA_HMAC_TRUNC_LEN)
 				cs->cs_config |= CESA_CSHD_96_BIT_HMAC;
 			break;
 		case CRYPTO_SHA2_256_HMAC:
 			cs->cs_mblen = SHA2_256_HMAC_BLOCK_LEN;
 			cs->cs_hlen = (mac->cri_mlen == 0) ? SHA2_256_HASH_LEN :
 			    mac->cri_mlen;
 			cs->cs_config |= CESA_CSHD_SHA2_256_HMAC;
 			break;
 		default:
 			error = EINVAL;
 			break;
 		}
 	}
 
 	/* Save cipher key */
 	if (!error && enc && enc->cri_key) {
 		cs->cs_klen = enc->cri_klen / 8;
 		memcpy(cs->cs_key, enc->cri_key, cs->cs_klen);
 		if (enc->cri_alg == CRYPTO_AES_CBC)
 			error = cesa_prep_aes_key(cs);
 	}
 
 	/* Save digest key */
 	if (!error && mac && mac->cri_key)
 		error = cesa_set_mkey(cs, mac->cri_alg, mac->cri_key,
 		    mac->cri_klen / 8);
 
 	if (error) {
 		cesa_free_session(sc, cs);
 		return (EINVAL);
 	}
 
 	*sidp = cs->cs_sid;
 
 	return (0);
 }
 
 static int
 cesa_freesession(device_t dev, uint64_t tid)
 {
 	struct cesa_session *cs;
 	struct cesa_softc *sc;
  
 	sc = device_get_softc(dev);
 	cs = cesa_get_session(sc, CRYPTO_SESID2LID(tid));
 	if (!cs)
 		return (EINVAL);
 
 	/* Free session */
 	cesa_free_session(sc, cs);
 
 	return (0);
 }
 
 static int
 cesa_process(device_t dev, struct cryptop *crp, int hint)
 {
 	struct cesa_request *cr;
 	struct cesa_session *cs;
 	struct cryptodesc *crd;
 	struct cryptodesc *enc;
 	struct cryptodesc *mac;
 	struct cesa_softc *sc;
 	int error;
 
 	sc = device_get_softc(dev);
 	crd = crp->crp_desc;
 	enc = NULL;
 	mac = NULL;
 	error = 0;
 
 	/* Check session ID */
 	cs = cesa_get_session(sc, CRYPTO_SESID2LID(crp->crp_sid));
 	if (!cs) {
 		crp->crp_etype = EINVAL;
 		crypto_done(crp);
 		return (0);
 	}
 
 	/* Check and parse input */
 	if (crp->crp_ilen > CESA_MAX_REQUEST_SIZE) {
 		crp->crp_etype = E2BIG;
 		crypto_done(crp);
 		return (0);
 	}
 
 	if (cesa_is_hash(crd->crd_alg))
 		mac = crd;
 	else
 		enc = crd;
 
 	crd = crd->crd_next;
 
 	if (crd) {
 		if (!enc && !cesa_is_hash(crd->crd_alg))
 			enc = crd;
 
 		if (!mac && cesa_is_hash(crd->crd_alg))
 			mac = crd;
 
 		if (crd->crd_next || !(enc && mac)) {
 			crp->crp_etype = EINVAL;
 			crypto_done(crp);
 			return (0);
 		}
 	}
 
 	/*
 	 * Get request descriptor. Block driver if there is no free
 	 * descriptors in pool.
 	 */
 	cr = cesa_alloc_request(sc);
 	if (!cr) {
 		CESA_LOCK(sc, sc);
 		sc->sc_blocked = CRYPTO_SYMQ;
 		CESA_UNLOCK(sc, sc);
 		return (ERESTART);
 	}
 
 	/* Prepare request */
 	cr->cr_crp = crp;
 	cr->cr_enc = enc;
 	cr->cr_mac = mac;
 	cr->cr_cs = cs;
 
 	CESA_LOCK(sc, sessions);
 	cesa_sync_desc(sc, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 	if (enc && enc->crd_flags & CRD_F_ENCRYPT) {
 		if (enc->crd_flags & CRD_F_IV_EXPLICIT)
 			memcpy(cr->cr_csd->csd_iv, enc->crd_iv, cs->cs_ivlen);
 		else
 			arc4rand(cr->cr_csd->csd_iv, cs->cs_ivlen, 0);
 
 		if ((enc->crd_flags & CRD_F_IV_PRESENT) == 0)
 			crypto_copyback(crp->crp_flags, crp->crp_buf,
 			    enc->crd_inject, cs->cs_ivlen, cr->cr_csd->csd_iv);
 	} else if (enc) {
 		if (enc->crd_flags & CRD_F_IV_EXPLICIT)
 			memcpy(cr->cr_csd->csd_iv, enc->crd_iv, cs->cs_ivlen);
 		else
 			crypto_copydata(crp->crp_flags, crp->crp_buf,
 			    enc->crd_inject, cs->cs_ivlen, cr->cr_csd->csd_iv);
 	}
 
 	if (enc && enc->crd_flags & CRD_F_KEY_EXPLICIT) {
 		if ((enc->crd_klen / 8) <= CESA_MAX_KEY_LEN) {
 			cs->cs_klen = enc->crd_klen / 8;
 			memcpy(cs->cs_key, enc->crd_key, cs->cs_klen);
 			if (enc->crd_alg == CRYPTO_AES_CBC)
 				error = cesa_prep_aes_key(cs);
 		} else
 			error = E2BIG;
 	}
 
 	if (!error && mac && mac->crd_flags & CRD_F_KEY_EXPLICIT) {
 		if ((mac->crd_klen / 8) <= CESA_MAX_MKEY_LEN)
 			error = cesa_set_mkey(cs, mac->crd_alg, mac->crd_key,
 			    mac->crd_klen / 8);
 		else
 			error = E2BIG;
 	}
 
 	/* Convert request to chain of TDMA and SA descriptors */
 	if (!error)
 		error = cesa_create_chain(sc, cr);
 
 	cesa_sync_desc(sc, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	CESA_UNLOCK(sc, sessions);
 
 	if (error) {
 		cesa_free_request(sc, cr);
 		crp->crp_etype = error;
 		crypto_done(crp);
 		return (0);
 	}
 
 	bus_dmamap_sync(sc->sc_data_dtag, cr->cr_dmap, BUS_DMASYNC_PREREAD |
 	    BUS_DMASYNC_PREWRITE);
 
 	/* Enqueue request to execution */
 	cesa_enqueue_request(sc, cr);
 
 	/* Start execution, if we have no more requests in queue */
 	if ((hint & CRYPTO_HINT_MORE) == 0)
 		cesa_execute(sc);
 
 	return (0);
 }
Index: user/markj/netdump/sys/dev/cesa/cesa.h
===================================================================
--- user/markj/netdump/sys/dev/cesa/cesa.h	(revision 332407)
+++ user/markj/netdump/sys/dev/cesa/cesa.h	(revision 332408)
@@ -1,370 +1,377 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (C) 2009-2011 Semihalf.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _DEV_CESA_H_
 #define _DEV_CESA_H_
 
 /* Maximum number of allocated sessions */
 #define CESA_SESSIONS			64
 
 /* Maximum number of queued requests */
 #define CESA_REQUESTS			256
 
 /*
  * CESA is able to process data only in CESA SRAM, which is quite small (2 kB).
  * We have to fit a packet there, which contains SA descriptor, keys, IV
  * and data to be processed. Every request must be converted into chain of
  * packets and each packet can hold about 1.75 kB of data.
  *
  * To process each packet we need at least 1 SA descriptor and at least 4 TDMA
  * descriptors. However there are cases when we use 2 SA and 8 TDMA descriptors
  * per packet. Number of used TDMA descriptors can increase beyond given values
  * if data in the request is fragmented in physical memory.
  *
  * The driver uses preallocated SA and TDMA descriptors pools to get best
  * performace. Size of these pools should match expected request size. Example:
  *
  * Expected average request size:			1.5 kB (Ethernet MTU)
  * Packets per average request:				(1.5 kB / 1.75 kB) = 1
  * SA decriptors per average request (worst case):	1 * 2 = 2
  * TDMA desctiptors per average request (worst case):	1 * 8 = 8
  *
  * More TDMA descriptors should be allocated, if data fragmentation is expected
  * (for example while processing mbufs larger than MCLBYTES). The driver may use
  * 2 additional TDMA descriptors per each discontinuity in the physical data
  * layout.
  */
 
 /* Values below are optimized for requests containing about 1.5 kB of data */
 #define CESA_SA_DESC_PER_REQ		2
 #define CESA_TDMA_DESC_PER_REQ		8
 
 #define CESA_SA_DESCRIPTORS		(CESA_SA_DESC_PER_REQ * CESA_REQUESTS)
 #define CESA_TDMA_DESCRIPTORS		(CESA_TDMA_DESC_PER_REQ * CESA_REQUESTS)
 
 /* Useful constants */
 #define CESA_HMAC_TRUNC_LEN		12
 #define CESA_MAX_FRAGMENTS		64
 #define CESA_SRAM_SIZE			2048
 
 /*
  * CESA_MAX_HASH_LEN is maximum length of hash generated by CESA.
  * As CESA supports MD5, SHA1 and SHA-256 this equals to 32 bytes.
  */
 #define CESA_MAX_HASH_LEN		32
 #define CESA_MAX_KEY_LEN		32
 #define CESA_MAX_IV_LEN			16
 #define CESA_MAX_HMAC_BLOCK_LEN		64
 #define CESA_MAX_MKEY_LEN		CESA_MAX_HMAC_BLOCK_LEN
 #define CESA_MAX_PACKET_SIZE		(CESA_SRAM_SIZE - CESA_DATA(0))
 #define CESA_MAX_REQUEST_SIZE		65535
 
 /* Locking macros */
 #define CESA_LOCK(sc, what)		mtx_lock(&(sc)->sc_ ## what ## _lock)
 #define CESA_UNLOCK(sc, what)		mtx_unlock(&(sc)->sc_ ## what ## _lock)
 #define CESA_LOCK_ASSERT(sc, what)	\
 	mtx_assert(&(sc)->sc_ ## what ## _lock, MA_OWNED)
 
 /* Registers read/write macros */
 #define CESA_REG_READ(sc, reg)		\
 	bus_read_4((sc)->sc_res[RES_CESA_REGS], (reg))
 #define CESA_REG_WRITE(sc, reg, val)	\
 	bus_write_4((sc)->sc_res[RES_CESA_REGS], (reg), (val))
 
 #define CESA_TDMA_READ(sc, reg)		\
 	bus_read_4((sc)->sc_res[RES_TDMA_REGS], (reg))
 #define CESA_TDMA_WRITE(sc, reg, val)	\
 	bus_write_4((sc)->sc_res[RES_TDMA_REGS], (reg), (val))
 
 /* Generic allocator for objects */
 #define CESA_GENERIC_ALLOC_LOCKED(sc, obj, pool) do {		\
 	CESA_LOCK(sc, pool);					\
 								\
 	if (STAILQ_EMPTY(&(sc)->sc_free_ ## pool))		\
 		obj = NULL;					\
 	else {							\
 		obj = STAILQ_FIRST(&(sc)->sc_free_ ## pool);	\
 		STAILQ_REMOVE_HEAD(&(sc)->sc_free_ ## pool,	\
 		    obj ## _stq);				\
 	}							\
 								\
 	CESA_UNLOCK(sc, pool);					\
 } while (0)
 
 #define CESA_GENERIC_FREE_LOCKED(sc, obj, pool) do {		\
 	CESA_LOCK(sc, pool);					\
 	STAILQ_INSERT_TAIL(&(sc)->sc_free_ ## pool, obj,	\
 	    obj ## _stq);					\
 	CESA_UNLOCK(sc, pool);					\
 } while (0)
 
 /* CESA SRAM offset calculation macros */
 #define CESA_SA_DATA(member)					\
 	(sizeof(struct cesa_sa_hdesc) + offsetof(struct cesa_sa_data, member))
 #define CESA_DATA(offset)					\
 	(sizeof(struct cesa_sa_hdesc) + sizeof(struct cesa_sa_data) + offset)
 
 /* CESA memory and IRQ resources */
 enum cesa_res_type {
 	RES_TDMA_REGS,
 	RES_CESA_REGS,
 	RES_CESA_IRQ,
 	RES_CESA_NUM
 };
 
 struct cesa_tdma_hdesc {
 	uint16_t	cthd_byte_count;
 	uint16_t	cthd_flags;
 	uint32_t	cthd_src;
 	uint32_t	cthd_dst;
 	uint32_t	cthd_next;
 };
 
 struct cesa_sa_hdesc {
 	uint32_t	cshd_config;
 	uint16_t	cshd_enc_src;
 	uint16_t	cshd_enc_dst;
 	uint32_t	cshd_enc_dlen;
 	uint32_t	cshd_enc_key;
 	uint16_t	cshd_enc_iv;
 	uint16_t	cshd_enc_iv_buf;
 	uint16_t	cshd_mac_src;
 	uint16_t	cshd_mac_total_dlen;
 	uint16_t	cshd_mac_dst;
 	uint16_t	cshd_mac_dlen;
 	uint16_t	cshd_mac_iv_in;
 	uint16_t	cshd_mac_iv_out;
 };
 
 struct cesa_sa_data {
 	uint8_t		csd_key[CESA_MAX_KEY_LEN];
 	uint8_t		csd_iv[CESA_MAX_IV_LEN];
 	uint8_t		csd_hiv_in[CESA_MAX_HASH_LEN];
 	uint8_t		csd_hiv_out[CESA_MAX_HASH_LEN];
 	uint8_t		csd_hash[CESA_MAX_HASH_LEN];
 };
 
 struct cesa_dma_mem {
 	void		*cdm_vaddr;
 	bus_addr_t	cdm_paddr;
 	bus_dma_tag_t	cdm_tag;
 	bus_dmamap_t	cdm_map;
 };
 
 struct cesa_tdma_desc {
 	struct cesa_tdma_hdesc		*ctd_cthd;
 	bus_addr_t			ctd_cthd_paddr;
 
 	STAILQ_ENTRY(cesa_tdma_desc)	ctd_stq;
 };
 
 struct cesa_sa_desc {
 	struct cesa_sa_hdesc		*csd_cshd;
 	bus_addr_t			csd_cshd_paddr;
 
 	STAILQ_ENTRY(cesa_sa_desc)	csd_stq;
 };
 
 struct cesa_session {
 	uint32_t			cs_sid;
 	uint32_t			cs_config;
 	unsigned int			cs_klen;
 	unsigned int			cs_ivlen;
 	unsigned int			cs_hlen;
 	unsigned int			cs_mblen;
 	uint8_t				cs_key[CESA_MAX_KEY_LEN];
 	uint8_t				cs_aes_dkey[CESA_MAX_KEY_LEN];
 	uint8_t				cs_hiv_in[CESA_MAX_HASH_LEN];
 	uint8_t				cs_hiv_out[CESA_MAX_HASH_LEN];
 
 	STAILQ_ENTRY(cesa_session)	cs_stq;
 };
 
 struct cesa_request {
 	struct cesa_sa_data		*cr_csd;
 	bus_addr_t			cr_csd_paddr;
 	struct cryptop			*cr_crp;
 	struct cryptodesc		*cr_enc;
 	struct cryptodesc		*cr_mac;
 	struct cesa_session		*cr_cs;
 	bus_dmamap_t			cr_dmap;
 	int				cr_dmap_loaded;
 
 	STAILQ_HEAD(, cesa_tdma_desc)	cr_tdesc;
 	STAILQ_HEAD(, cesa_sa_desc)	cr_sdesc;
 
 	STAILQ_ENTRY(cesa_request)	cr_stq;
 };
 
 struct cesa_packet {
 	STAILQ_HEAD(, cesa_tdma_desc)	cp_copyin;
 	STAILQ_HEAD(, cesa_tdma_desc)	cp_copyout;
 	unsigned int			cp_size;
 	unsigned int			cp_offset;
 };
 
 struct cesa_softc {
 	device_t			sc_dev;
 	int32_t				sc_cid;
 	uint32_t			sc_soc_id;
 	struct resource			*sc_res[RES_CESA_NUM];
 	void				*sc_icookie;
 	bus_dma_tag_t			sc_data_dtag;
 	int				sc_error;
 	int				sc_tperr;
+	uint8_t				sc_cesa_engine_id;
 
 	struct mtx			sc_sc_lock;
 	int				sc_blocked;
 
 	/* TDMA descriptors pool */
 	struct mtx			sc_tdesc_lock;
 	struct cesa_tdma_desc		sc_tdesc[CESA_TDMA_DESCRIPTORS];
 	struct cesa_dma_mem		sc_tdesc_cdm;
 	STAILQ_HEAD(, cesa_tdma_desc)	sc_free_tdesc;
 
 	/* SA descriptors pool */
 	struct mtx			sc_sdesc_lock;
 	struct cesa_sa_desc		sc_sdesc[CESA_SA_DESCRIPTORS];
 	struct cesa_dma_mem		sc_sdesc_cdm;
 	STAILQ_HEAD(, cesa_sa_desc)	sc_free_sdesc;
 
 	/* Requests pool */
 	struct mtx			sc_requests_lock;
 	struct cesa_request		sc_requests[CESA_REQUESTS];
 	struct cesa_dma_mem		sc_requests_cdm;
 	STAILQ_HEAD(, cesa_request)	sc_free_requests;
 	STAILQ_HEAD(, cesa_request)	sc_ready_requests;
 	STAILQ_HEAD(, cesa_request)	sc_queued_requests;
 
 	/* Sessions pool */
 	struct mtx			sc_sessions_lock;
 	struct cesa_session		sc_sessions[CESA_SESSIONS];
 	STAILQ_HEAD(, cesa_session)	sc_free_sessions;
 
 	/* CESA SRAM Address */
 	bus_addr_t			sc_sram_base_pa;
 	vm_offset_t			sc_sram_base_va;
 	bus_size_t			sc_sram_size;
 };
 
 struct cesa_chain_info {
 	struct cesa_softc		*cci_sc;
 	struct cesa_request		*cci_cr;
 	struct cryptodesc		*cci_enc;
 	struct cryptodesc		*cci_mac;
 	uint32_t			cci_config;
 	int				cci_error;
 };
 
 /* CESA descriptors flags definitions */
 #define CESA_CTHD_OWNED			(1 << 15)
 
 #define CESA_CSHD_MAC			(0 << 0)
 #define CESA_CSHD_ENC			(1 << 0)
 #define CESA_CSHD_MAC_AND_ENC		(2 << 0)
 #define CESA_CSHD_ENC_AND_MAC		(3 << 0)
 #define CESA_CSHD_OP_MASK		(3 << 0)
 
 #define CESA_CSHD_MD5			(4 << 4)
 #define CESA_CSHD_SHA1			(5 << 4)
 #define CESA_CSHD_SHA2_256		(1 << 4)
 #define CESA_CSHD_MD5_HMAC		(6 << 4)
 #define CESA_CSHD_SHA1_HMAC		(7 << 4)
 #define CESA_CSHD_SHA2_256_HMAC		(3 << 4)
 
 #define CESA_CSHD_96_BIT_HMAC		(1 << 7)
 
 #define CESA_CSHD_DES			(1 << 8)
 #define CESA_CSHD_3DES			(2 << 8)
 #define CESA_CSHD_AES			(3 << 8)
 
 #define CESA_CSHD_DECRYPT		(1 << 12)
 #define CESA_CSHD_CBC			(1 << 16)
 #define CESA_CSHD_3DES_EDE		(1 << 20)
 
 #define CESA_CSH_AES_KLEN_128		(0 << 24)
 #define CESA_CSH_AES_KLEN_192		(1 << 24)
 #define CESA_CSH_AES_KLEN_256		(2 << 24)
 #define CESA_CSH_AES_KLEN_MASK		(3 << 24)
 
 #define CESA_CSHD_FRAG_FIRST		(1 << 30)
 #define CESA_CSHD_FRAG_LAST		(2U << 30)
 #define CESA_CSHD_FRAG_MIDDLE		(3U << 30)
 
 /* CESA registers definitions */
 #define CESA_ICR			0x0E20
 #define CESA_ICR_ACCTDMA		(1 << 7)
 #define CESA_ICR_TPERR			(1 << 12)
 
 #define CESA_ICM			0x0E24
 #define CESA_ICM_ACCTDMA		CESA_ICR_ACCTDMA
 #define CESA_ICM_TPERR			CESA_ICR_TPERR
 
 /* CESA TDMA registers definitions */
 #define CESA_TDMA_ND			0x0830
 
 #define CESA_TDMA_CR			0x0840
 #define CESA_TDMA_CR_DBL128		(4 << 0)
 #define CESA_TDMA_CR_ORDEN		(1 << 4)
 #define CESA_TDMA_CR_SBL128		(4 << 6)
 #define CESA_TDMA_CR_NBS		(1 << 11)
 #define CESA_TDMA_CR_ENABLE		(1 << 12)
 #define CESA_TDMA_CR_FETCHND		(1 << 13)
 #define CESA_TDMA_CR_ACTIVE		(1 << 14)
 #define CESA_TDMA_NUM_OUTSTAND		(2 << 16)
 
 #define CESA_TDMA_ECR			0x08C8
 #define CESA_TDMA_ECR_MISS		(1 << 0)
 #define CESA_TDMA_ECR_DOUBLE_HIT	(1 << 1)
 #define CESA_TDMA_ECR_BOTH_HIT		(1 << 2)
 #define CESA_TDMA_ECR_DATA_ERROR	(1 << 3)
 
 #define CESA_TDMA_EMR			0x08CC
 #define CESA_TDMA_EMR_MISS		CESA_TDMA_ECR_MISS
 #define CESA_TDMA_EMR_DOUBLE_HIT	CESA_TDMA_ECR_DOUBLE_HIT
 #define CESA_TDMA_EMR_BOTH_HIT		CESA_TDMA_ECR_BOTH_HIT
 #define CESA_TDMA_EMR_DATA_ERROR	CESA_TDMA_ECR_DATA_ERROR
 
 /* CESA SA registers definitions */
 #define CESA_SA_CMD			0x0E00
 #define CESA_SA_CMD_ACTVATE		(1 << 0)
 #define CESA_SA_CMD_SHA2		(1 << 31)
 
 #define CESA_SA_DPR			0x0E04
 
 #define CESA_SA_CR			0x0E08
 #define CESA_SA_CR_WAIT_FOR_TDMA	(1 << 7)
 #define CESA_SA_CR_ACTIVATE_TDMA	(1 << 9)
 #define CESA_SA_CR_MULTI_MODE		(1 << 11)
 
 #define CESA_SA_SR			0x0E0C
 #define CESA_SA_SR_ACTIVE		(1 << 0)
 
+#define CESA_TDMA_SIZE			0x1000
+#define CESA_CESA_SIZE			0x1000
+#define CESA0_TDMA_ADDR			0x90000
+#define CESA0_CESA_ADDR			0x9D000
+#define CESA1_TDMA_ADDR			0x92000
+#define CESA1_CESA_ADDR			0x9F000
 #endif
Index: user/markj/netdump/sys/dev/cpufreq/cpufreq_dt.c
===================================================================
--- user/markj/netdump/sys/dev/cpufreq/cpufreq_dt.c	(revision 332407)
+++ user/markj/netdump/sys/dev/cpufreq/cpufreq_dt.c	(revision 332408)
@@ -1,360 +1,360 @@
 /*-
  * Copyright (c) 2016 Jared McNeill <jmcneill@invisible.ca>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
  * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * Generic DT based cpufreq driver
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/rman.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/cpu.h>
 #include <sys/cpuset.h>
 #include <sys/smp.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <dev/extres/clk/clk.h>
 #include <dev/extres/regulator/regulator.h>
 
 #include "cpufreq_if.h"
 
 struct cpufreq_dt_opp {
 	uint32_t	freq_khz;
 	uint32_t	voltage_uv;
 };
 
 struct cpufreq_dt_softc {
 	clk_t clk;
 	regulator_t reg;
 
 	struct cpufreq_dt_opp *opp;
 	ssize_t nopp;
 	int clk_latency;
 
 	cpuset_t cpus;
 };
 
 static void
 cpufreq_dt_notify(device_t dev, uint64_t freq)
 {
 #ifdef __aarch64__
 	struct cpufreq_dt_softc *sc;
 	struct pcpu *pc;
 	int cpu;
 
 	sc = device_get_softc(dev);
 
 	CPU_FOREACH(cpu) {
 		if (CPU_ISSET(cpu, &sc->cpus)) {
 			pc = pcpu_find(cpu);
 			pc->pc_clock = freq;
 		}
 	}
 #endif
 }
 
 static const struct cpufreq_dt_opp *
 cpufreq_dt_find_opp(device_t dev, uint32_t freq_mhz)
 {
 	struct cpufreq_dt_softc *sc;
 	ssize_t n;
 
 	sc = device_get_softc(dev);
 
 	for (n = 0; n < sc->nopp; n++)
 		if (CPUFREQ_CMP(sc->opp[n].freq_khz / 1000, freq_mhz))
 			return (&sc->opp[n]);
 
 	return (NULL);
 }
 
 static void
 cpufreq_dt_opp_to_setting(device_t dev, const struct cpufreq_dt_opp *opp,
     struct cf_setting *set)
 {
 	struct cpufreq_dt_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	memset(set, 0, sizeof(*set));
 	set->freq = opp->freq_khz / 1000;
 	set->volts = opp->voltage_uv / 1000;
 	set->power = CPUFREQ_VAL_UNKNOWN;
 	set->lat = sc->clk_latency;
 	set->dev = dev;
 }
 
 static int
 cpufreq_dt_get(device_t dev, struct cf_setting *set)
 {
 	struct cpufreq_dt_softc *sc;
 	const struct cpufreq_dt_opp *opp;
 	uint64_t freq;
 
 	sc = device_get_softc(dev);
 
 	if (clk_get_freq(sc->clk, &freq) != 0)
 		return (ENXIO);
 
 	opp = cpufreq_dt_find_opp(dev, freq / 1000000);
 	if (opp == NULL)
 		return (ENOENT);
 
 	cpufreq_dt_opp_to_setting(dev, opp, set);
 
 	return (0);
 }
 
 static int
 cpufreq_dt_set(device_t dev, const struct cf_setting *set)
 {
 	struct cpufreq_dt_softc *sc;
 	const struct cpufreq_dt_opp *opp, *copp;
 	uint64_t freq;
 	int error;
 
 	sc = device_get_softc(dev);
 
 	if (clk_get_freq(sc->clk, &freq) != 0)
 		return (ENXIO);
 
 	copp = cpufreq_dt_find_opp(dev, freq / 1000000);
 	if (copp == NULL)
 		return (ENOENT);
 	opp = cpufreq_dt_find_opp(dev, set->freq);
 	if (opp == NULL)
 		return (EINVAL);
 
 	if (copp->voltage_uv < opp->voltage_uv) {
 		error = regulator_set_voltage(sc->reg, opp->voltage_uv,
 		    opp->voltage_uv);
 		if (error != 0)
 			return (ENXIO);
 	}
 
 	error = clk_set_freq(sc->clk, (uint64_t)opp->freq_khz * 1000, 0);
 	if (error != 0) {
 		/* Restore previous voltage (best effort) */
 		(void)regulator_set_voltage(sc->reg, copp->voltage_uv,
 		    copp->voltage_uv);
 		return (ENXIO);
 	}
 
 	if (copp->voltage_uv > opp->voltage_uv) {
 		error = regulator_set_voltage(sc->reg, opp->voltage_uv,
 		    opp->voltage_uv);
 		if (error != 0) {
 			/* Restore previous CPU frequency (best effort) */
 			(void)clk_set_freq(sc->clk,
 			    (uint64_t)copp->freq_khz * 1000, 0);
 			return (ENXIO);
 		}
 	}
 
 	if (clk_get_freq(sc->clk, &freq) == 0)
 		cpufreq_dt_notify(dev, freq);
 
 	return (0);
 }
 
 
 static int
 cpufreq_dt_type(device_t dev, int *type)
 {
 	if (type == NULL)
 		return (EINVAL);
 
 	*type = CPUFREQ_TYPE_ABSOLUTE;
 	return (0);
 }
 
 static int
 cpufreq_dt_settings(device_t dev, struct cf_setting *sets, int *count)
 {
 	struct cpufreq_dt_softc *sc;
 	ssize_t n;
 
 	if (sets == NULL || count == NULL)
 		return (EINVAL);
 
 	sc = device_get_softc(dev);
 
 	if (*count < sc->nopp) {
 		*count = (int)sc->nopp;
 		return (E2BIG);
 	}
 
 	for (n = 0; n < sc->nopp; n++)
 		cpufreq_dt_opp_to_setting(dev, &sc->opp[n], &sets[n]);
 
 	*count = (int)sc->nopp;
 
 	return (0);
 }
 
 static void
 cpufreq_dt_identify(driver_t *driver, device_t parent)
 {
 	phandle_t node;
 
 	/* Properties must be listed under node /cpus/cpu@0 */
 	node = ofw_bus_get_node(parent);
 
 	/* The cpu@0 node must have the following properties */
 	if (!OF_hasprop(node, "operating-points") ||
 	    !OF_hasprop(node, "clocks") ||
 	    !OF_hasprop(node, "cpu-supply"))
 		return;
 
 	if (device_find_child(parent, "cpufreq_dt", -1) != NULL)
 		return;
 
 	if (BUS_ADD_CHILD(parent, 0, "cpufreq_dt", -1) == NULL)
 		device_printf(parent, "add cpufreq_dt child failed\n");
 }
 
 static int
 cpufreq_dt_probe(device_t dev)
 {
 	phandle_t node;
 
 	node = ofw_bus_get_node(device_get_parent(dev));
 
 	if (!OF_hasprop(node, "operating-points") ||
 	    !OF_hasprop(node, "clocks") ||
 	    !OF_hasprop(node, "cpu-supply"))
 		return (ENXIO);
 
 	device_set_desc(dev, "Generic cpufreq driver");
 	return (BUS_PROBE_GENERIC);
 }
 
 static int
 cpufreq_dt_attach(device_t dev)
 {
 	struct cpufreq_dt_softc *sc;
 	uint32_t *opp, lat;
 	phandle_t node, cnode;
 	uint64_t freq;
 	ssize_t n;
 	int cpu;
 
 	sc = device_get_softc(dev);
 	node = ofw_bus_get_node(device_get_parent(dev));
 
 	if (regulator_get_by_ofw_property(dev, node,
 	    "cpu-supply", &sc->reg) != 0) {
 		device_printf(dev, "no regulator for %s\n",
 		    ofw_bus_get_name(device_get_parent(dev)));
 		return (ENXIO);
 	}
 
 	if (clk_get_by_ofw_index(dev, node, 0, &sc->clk) != 0) {
 		device_printf(dev, "no clock for %s\n",
 		    ofw_bus_get_name(device_get_parent(dev)));
 		regulator_release(sc->reg);
 		return (ENXIO);
 	}
 
-	sc->nopp = OF_getencprop_alloc(node, "operating-points",
+	sc->nopp = OF_getencprop_alloc_multi(node, "operating-points",
 	    sizeof(*sc->opp), (void **)&opp);
 	if (sc->nopp == -1)
 		return (ENXIO);
 	sc->opp = malloc(sizeof(*sc->opp) * sc->nopp, M_DEVBUF, M_WAITOK);
 	for (n = 0; n < sc->nopp; n++) {
 		sc->opp[n].freq_khz = opp[n * 2 + 0];
 		sc->opp[n].voltage_uv = opp[n * 2 + 1];
 
 		if (bootverbose)
 			device_printf(dev, "%u.%03u MHz, %u uV\n",
 			    sc->opp[n].freq_khz / 1000,
 			    sc->opp[n].freq_khz % 1000,
 			    sc->opp[n].voltage_uv);
 	}
 	free(opp, M_OFWPROP);
 
 	if (OF_getencprop(node, "clock-latency", &lat, sizeof(lat)) == -1)
 		sc->clk_latency = CPUFREQ_VAL_UNKNOWN;
 	else
 		sc->clk_latency = (int)lat;
 
 	/*
 	 * Find all CPUs that share the same voltage and CPU frequency
 	 * controls. Start with the current node and move forward until
 	 * the end is reached or a peer has an "operating-points" property.
 	 */
 	CPU_ZERO(&sc->cpus);
 	cpu = device_get_unit(device_get_parent(dev));
 	for (cnode = node; cnode > 0; cnode = OF_peer(cnode), cpu++) {
 		if (cnode != node && OF_hasprop(cnode, "operating-points"))
 			break;
 		CPU_SET(cpu, &sc->cpus);
 	}
 
 	if (clk_get_freq(sc->clk, &freq) == 0)
 		cpufreq_dt_notify(dev, freq);
 
 	cpufreq_register(dev);
 
 	return (0);
 }
 
 
 static device_method_t cpufreq_dt_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_identify,	cpufreq_dt_identify),
 	DEVMETHOD(device_probe,		cpufreq_dt_probe),
 	DEVMETHOD(device_attach,	cpufreq_dt_attach),
 
 	/* cpufreq interface */
 	DEVMETHOD(cpufreq_drv_get,	cpufreq_dt_get),
 	DEVMETHOD(cpufreq_drv_set,	cpufreq_dt_set),
 	DEVMETHOD(cpufreq_drv_type,	cpufreq_dt_type),
 	DEVMETHOD(cpufreq_drv_settings,	cpufreq_dt_settings),
 
 	DEVMETHOD_END
 };
 
 static driver_t cpufreq_dt_driver = {
 	"cpufreq_dt",
 	cpufreq_dt_methods,
 	sizeof(struct cpufreq_dt_softc),
 };
 
 static devclass_t cpufreq_dt_devclass;
 
 DRIVER_MODULE(cpufreq_dt, cpu, cpufreq_dt_driver, cpufreq_dt_devclass, 0, 0);
 MODULE_VERSION(cpufreq_dt, 1);
Index: user/markj/netdump/sys/dev/dpaa/qman_fdt.c
===================================================================
--- user/markj/netdump/sys/dev/dpaa/qman_fdt.c	(revision 332407)
+++ user/markj/netdump/sys/dev/dpaa/qman_fdt.c	(revision 332408)
@@ -1,267 +1,267 @@
 /*-
  * Copyright (c) 2011-2012 Semihalf.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include "opt_platform.h"
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/bus.h>
 #include <sys/module.h>
 #include <sys/smp.h>
 
 #include <machine/bus.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #include <dev/ofw/ofw_subr.h>
 
 #include "qman.h"
 #include "portals.h"
 
 #define	FQMAN_DEVSTR	"Freescale Queue Manager"
 
 static int qman_fdt_probe(device_t);
 
 static device_method_t qman_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		qman_fdt_probe),
 	DEVMETHOD(device_attach,	qman_attach),
 	DEVMETHOD(device_detach,	qman_detach),
 
 	DEVMETHOD(device_suspend,	qman_suspend),
 	DEVMETHOD(device_resume,	qman_resume),
 	DEVMETHOD(device_shutdown,	qman_shutdown),
 
 	{ 0, 0 }
 };
 
 static driver_t qman_driver = {
 	"qman",
 	qman_methods,
 	sizeof(struct qman_softc),
 };
 
 static devclass_t qman_devclass;
 DRIVER_MODULE(qman, simplebus, qman_driver, qman_devclass, 0, 0);
 
 static int
 qman_fdt_probe(device_t dev)
 {
 
 	if (!ofw_bus_is_compatible(dev, "fsl,qman"))
 		return (ENXIO);
 
 	device_set_desc(dev, FQMAN_DEVSTR);
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 /*
  * QMAN Portals
  */
 #define	QMAN_PORT_DEVSTR	"Freescale Queue Manager - Portals"
 
 static device_probe_t qman_portals_fdt_probe;
 static device_attach_t qman_portals_fdt_attach;
 
 static device_method_t qm_portals_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		qman_portals_fdt_probe),
 	DEVMETHOD(device_attach,	qman_portals_fdt_attach),
 	DEVMETHOD(device_detach,	qman_portals_detach),
 
 	{ 0, 0 }
 };
 
 static driver_t qm_portals_driver = {
 	"qman-portals",
 	qm_portals_methods,
 	sizeof(struct dpaa_portals_softc),
 };
 
 static devclass_t qm_portals_devclass;
 EARLY_DRIVER_MODULE(qman_portals, ofwbus, qm_portals_driver,
     qm_portals_devclass, 0, 0, BUS_PASS_BUS);
 
 static void
 get_addr_props(phandle_t node, uint32_t *addrp, uint32_t *sizep)
 {
 
 	*addrp = 2;
 	*sizep = 1;
 	OF_getencprop(node, "#address-cells", addrp, sizeof(*addrp));
 	OF_getencprop(node, "#size-cells", sizep, sizeof(*sizep));
 }
 
 static int
 qman_portals_fdt_probe(device_t dev)
 {
 	phandle_t node;
 
 	if (ofw_bus_is_compatible(dev, "simple-bus")) {
 		node = ofw_bus_get_node(dev);
 		for (node = OF_child(node); node > 0; node = OF_peer(node)) {
 			if (ofw_bus_node_is_compatible(node, "fsl,qman-portal"))
 				break;
 		}
 		if (node <= 0)
 			return (ENXIO);
 	} else if (!ofw_bus_is_compatible(dev, "fsl,qman-portals"))
 		return (ENXIO);
 
 	device_set_desc(dev, QMAN_PORT_DEVSTR);
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static phandle_t
 qman_portal_find_cpu(int cpu)
 {
 	phandle_t node;
 	pcell_t reg;
 
 	node = OF_finddevice("/cpus");
 	if (node == -1)
 		return (-1);
 
 	for (node = OF_child(node); node != 0; node = OF_peer(node)) {
 		if (OF_getprop(node, "reg", &reg, sizeof(reg)) <= 0)
 			continue;
 		if (reg == cpu)
 			return (node);
 	}
 	return (-1);
 }
 
 static int
 qman_portals_fdt_attach(device_t dev)
 {
 	struct dpaa_portals_softc *sc;
 	phandle_t node, child, cpu_node;
 	vm_paddr_t portal_pa, portal_par_pa;
 	vm_size_t portal_size;
 	uint32_t addr, paddr, size;
 	ihandle_t cpu;
 	int cpu_num, cpus, intr_rid;
 	struct dpaa_portals_devinfo di;
 	struct ofw_bus_devinfo ofw_di = {};
 	cell_t *range;
 	int nrange;
 	int i;
 
 	cpus = 0;
 	sc = device_get_softc(dev);
 	sc->sc_dev = dev;
 
 	node = ofw_bus_get_node(dev);
 
 	/* Get this node's range */
 	get_addr_props(ofw_bus_get_node(device_get_parent(dev)), &paddr, &size);
 	get_addr_props(node, &addr, &size);
 
-	nrange = OF_getencprop_alloc(node, "ranges",
+	nrange = OF_getencprop_alloc_multi(node, "ranges",
 	    sizeof(*range), (void **)&range);
 	if (nrange < addr + paddr + size)
 		return (ENXIO);
 	portal_pa = portal_par_pa = 0;
 	portal_size = 0;
 	for (i = 0; i < addr; i++) {
 		portal_pa <<= 32;
 		portal_pa |= range[i];
 	}
 	for (; i < paddr + addr; i++) {
 		portal_par_pa <<= 32;
 		portal_par_pa |= range[i];
 	}
 	portal_pa += portal_par_pa;
 	for (; i < size + paddr + addr; i++) {
 		portal_size = (uintmax_t)portal_size << 32;
 		portal_size |= range[i];
 	}
 	OF_prop_free(range);
 	sc->sc_dp_size = portal_size;
 	sc->sc_dp_pa = portal_pa;
 
 	/* Find portals tied to CPUs */
 	for (child = OF_child(node); child != 0; child = OF_peer(child)) {
 		if (cpus >= mp_ncpus)
 			break;
 		if (!ofw_bus_node_is_compatible(child, "fsl,qman-portal")) {
 			continue;
 		}
 		/* Checkout related cpu */
 		if (OF_getprop(child, "cpu-handle", (void *)&cpu,
 		    sizeof(cpu)) <= 0) {
 			cpu = qman_portal_find_cpu(cpus);
 			if (cpu <= 0)
 				continue;
 		}
 		/* Acquire cpu number */
 		cpu_node = OF_instance_to_package(cpu);
 		if (OF_getencprop(cpu_node, "reg", &cpu_num, sizeof(cpu_num)) <= 0) {
 			device_printf(dev, "Could not retrieve CPU number.\n");
 			return (ENXIO);
 		}
 
 		cpus++;
 
 		if (ofw_bus_gen_setup_devinfo(&ofw_di, child) != 0) {
 			device_printf(dev, "could not set up devinfo\n");
 			continue;
 		}
 
 		resource_list_init(&di.di_res);
 		if (ofw_bus_reg_to_rl(dev, child, addr, size, &di.di_res)) {
 			device_printf(dev, "%s: could not process 'reg' "
 			    "property\n", ofw_di.obd_name);
 			ofw_bus_gen_destroy_devinfo(&ofw_di);
 			continue;
 		}
 		if (ofw_bus_intr_to_rl(dev, child, &di.di_res, &intr_rid)) {
 			device_printf(dev, "%s: could not process "
 			    "'interrupts' property\n", ofw_di.obd_name);
 			resource_list_free(&di.di_res);
 			ofw_bus_gen_destroy_devinfo(&ofw_di);
 			continue;
 		}
 		di.di_intr_rid = intr_rid;
 
 		if (dpaa_portal_alloc_res(dev, &di, cpu_num))
 			goto err;
 	}
 
 	ofw_bus_gen_destroy_devinfo(&ofw_di);
 
 	return (qman_portals_attach(dev));
 err:
 	resource_list_free(&di.di_res);
 	ofw_bus_gen_destroy_devinfo(&ofw_di);
 	qman_portals_detach(dev);
 	return (ENXIO);
 }
Index: user/markj/netdump/sys/dev/etherswitch/e6000sw/e6000sw.c
===================================================================
--- user/markj/netdump/sys/dev/etherswitch/e6000sw/e6000sw.c	(revision 332407)
+++ user/markj/netdump/sys/dev/etherswitch/e6000sw/e6000sw.c	(revision 332408)
@@ -1,1300 +1,1302 @@
 /*-
  * Copyright (c) 2015 Semihalf
  * Copyright (c) 2015 Stormshield
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/errno.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/module.h>
 #include <sys/socket.h>
 #include <sys/sockio.h>
 
 #include <net/if.h>
 #include <net/if_media.h>
 #include <net/if_types.h>
 
 #include <dev/etherswitch/etherswitch.h>
 #include <dev/mii/mii.h>
 #include <dev/mii/miivar.h>
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 
 #include "e6000swreg.h"
 #include "etherswitch_if.h"
 #include "miibus_if.h"
 #include "mdio_if.h"
 
 MALLOC_DECLARE(M_E6000SW);
 MALLOC_DEFINE(M_E6000SW, "e6000sw", "e6000sw switch");
 
 #define	E6000SW_LOCK(_sc)		sx_xlock(&(_sc)->sx)
 #define	E6000SW_UNLOCK(_sc)		sx_unlock(&(_sc)->sx)
 #define	E6000SW_LOCK_ASSERT(_sc, _what)	sx_assert(&(_sc)->sx, (_what))
 #define	E6000SW_TRYLOCK(_sc)		sx_tryxlock(&(_sc)->sx)
 
 typedef struct e6000sw_softc {
 	device_t		dev;
 	phandle_t		node;
 
 	struct sx		sx;
 	struct ifnet		*ifp[E6000SW_MAX_PORTS];
 	char			*ifname[E6000SW_MAX_PORTS];
 	device_t		miibus[E6000SW_MAX_PORTS];
 	struct proc		*kproc;
 
 	uint32_t		swid;
 	uint32_t		vlan_mode;
 	uint32_t		cpuports_mask;
 	uint32_t		fixed_mask;
 	uint32_t		fixed25_mask;
 	uint32_t		ports_mask;
 	int			phy_base;
 	int			sw_addr;
 	int			num_ports;
 	boolean_t		multi_chip;
 } e6000sw_softc_t;
 
 static etherswitch_info_t etherswitch_info = {
 	.es_nports =		0,
 	.es_nvlangroups =	0,
 	.es_vlan_caps =		ETHERSWITCH_VLAN_PORT,
 	.es_name =		"Marvell 6000 series switch"
 };
 
 static void e6000sw_identify(driver_t *, device_t);
 static int e6000sw_probe(device_t);
 static int e6000sw_attach(device_t);
 static int e6000sw_detach(device_t);
 static int e6000sw_readphy(device_t, int, int);
 static int e6000sw_writephy(device_t, int, int, int);
 static etherswitch_info_t* e6000sw_getinfo(device_t);
 static int e6000sw_getconf(device_t, etherswitch_conf_t *);
 static void e6000sw_lock(device_t);
 static void e6000sw_unlock(device_t);
 static int e6000sw_getport(device_t, etherswitch_port_t *);
 static int e6000sw_setport(device_t, etherswitch_port_t *);
 static int e6000sw_readreg_wrapper(device_t, int);
 static int e6000sw_writereg_wrapper(device_t, int, int);
 static int e6000sw_readphy_wrapper(device_t, int, int);
 static int e6000sw_writephy_wrapper(device_t, int, int, int);
 static int e6000sw_getvgroup_wrapper(device_t, etherswitch_vlangroup_t *);
 static int e6000sw_setvgroup_wrapper(device_t, etherswitch_vlangroup_t *);
 static int e6000sw_setvgroup(device_t, etherswitch_vlangroup_t *);
 static int e6000sw_getvgroup(device_t, etherswitch_vlangroup_t *);
 static void e6000sw_setup(device_t, e6000sw_softc_t *);
 static void e6000sw_port_vlan_conf(e6000sw_softc_t *);
 static void e6000sw_tick(void *);
 static void e6000sw_set_atustat(device_t, e6000sw_softc_t *, int, int);
 static int e6000sw_atu_flush(device_t, e6000sw_softc_t *, int);
 static __inline void e6000sw_writereg(e6000sw_softc_t *, int, int, int);
 static __inline uint32_t e6000sw_readreg(e6000sw_softc_t *, int, int);
 static int e6000sw_ifmedia_upd(struct ifnet *);
 static void e6000sw_ifmedia_sts(struct ifnet *, struct ifmediareq *);
 static int e6000sw_atu_mac_table(device_t, e6000sw_softc_t *, struct atu_opt *,
     int);
 static int e6000sw_get_pvid(e6000sw_softc_t *, int, int *);
 static int e6000sw_set_pvid(e6000sw_softc_t *, int, int);
 static __inline bool e6000sw_is_cpuport(e6000sw_softc_t *, int);
 static __inline bool e6000sw_is_fixedport(e6000sw_softc_t *, int);
 static __inline bool e6000sw_is_fixed25port(e6000sw_softc_t *, int);
 static __inline bool e6000sw_is_phyport(e6000sw_softc_t *, int);
 static __inline bool e6000sw_is_portenabled(e6000sw_softc_t *, int);
 static __inline struct mii_data *e6000sw_miiforphy(e6000sw_softc_t *,
     unsigned int);
 
 static device_method_t e6000sw_methods[] = {
 	/* device interface */
 	DEVMETHOD(device_identify,		e6000sw_identify),
 	DEVMETHOD(device_probe,			e6000sw_probe),
 	DEVMETHOD(device_attach,		e6000sw_attach),
 	DEVMETHOD(device_detach,		e6000sw_detach),
 
 	/* bus interface */
 	DEVMETHOD(bus_add_child,		device_add_child_ordered),
 
 	/* mii interface */
 	DEVMETHOD(miibus_readreg,		e6000sw_readphy),
 	DEVMETHOD(miibus_writereg,		e6000sw_writephy),
 
 	/* etherswitch interface */
 	DEVMETHOD(etherswitch_getinfo,		e6000sw_getinfo),
 	DEVMETHOD(etherswitch_getconf,		e6000sw_getconf),
 	DEVMETHOD(etherswitch_lock,		e6000sw_lock),
 	DEVMETHOD(etherswitch_unlock,		e6000sw_unlock),
 	DEVMETHOD(etherswitch_getport,		e6000sw_getport),
 	DEVMETHOD(etherswitch_setport,		e6000sw_setport),
 	DEVMETHOD(etherswitch_readreg,		e6000sw_readreg_wrapper),
 	DEVMETHOD(etherswitch_writereg,		e6000sw_writereg_wrapper),
 	DEVMETHOD(etherswitch_readphyreg,	e6000sw_readphy_wrapper),
 	DEVMETHOD(etherswitch_writephyreg,	e6000sw_writephy_wrapper),
 	DEVMETHOD(etherswitch_setvgroup,	e6000sw_setvgroup_wrapper),
 	DEVMETHOD(etherswitch_getvgroup,	e6000sw_getvgroup_wrapper),
 
 	DEVMETHOD_END
 };
 
 static devclass_t e6000sw_devclass;
 
 DEFINE_CLASS_0(e6000sw, e6000sw_driver, e6000sw_methods,
     sizeof(e6000sw_softc_t));
 
 DRIVER_MODULE(e6000sw, mdio, e6000sw_driver, e6000sw_devclass, 0, 0);
 DRIVER_MODULE(etherswitch, e6000sw, etherswitch_driver, etherswitch_devclass, 0,
     0);
 DRIVER_MODULE(miibus, e6000sw, miibus_driver, miibus_devclass, 0, 0);
 MODULE_DEPEND(e6000sw, mdio, 1, 1, 1);
 
 #define	SMI_CMD			0
 #define	SMI_CMD_BUSY		(1 << 15)
 #define	SMI_CMD_OP_READ		((2 << 10) | SMI_CMD_BUSY | (1 << 12))
 #define	SMI_CMD_OP_WRITE	((1 << 10) | SMI_CMD_BUSY | (1 << 12))
 #define	SMI_DATA		1
 
 #define	MDIO_READ(dev, addr, reg)					\
 	MDIO_READREG(device_get_parent(dev), (addr), (reg))
 #define	MDIO_WRITE(dev, addr, reg, val)					\
 	MDIO_WRITEREG(device_get_parent(dev), (addr), (reg), (val))
 
 static void
 e6000sw_identify(driver_t *driver, device_t parent)
 {
 
 	if (device_find_child(parent, "e6000sw", -1) == NULL)
 		BUS_ADD_CHILD(parent, 0, "e6000sw", -1);
 }
 
 static int
 e6000sw_probe(device_t dev)
 {
 	e6000sw_softc_t *sc;
 	const char *description;
 	phandle_t dsa_node, switch_node;
 
 	dsa_node = fdt_find_compatible(OF_finddevice("/"),
 	    "marvell,dsa", 0);
 	switch_node = OF_child(dsa_node);
 
 	if (switch_node == 0)
 		return (ENXIO);
 
 	sc = device_get_softc(dev);
 	sc->dev = dev;
 	sc->node = switch_node;
 
 	if (OF_getencprop(sc->node, "reg", &sc->sw_addr,
 	    sizeof(sc->sw_addr)) < 0)
 		return (ENXIO);
-	if (sc->sw_addr != 0 && (sc->sw_addr % 2) == 0)
+
+	if (!OF_hasprop(sc->node, "single-chip-addressing") &&
+	    (sc->sw_addr != 0 && (sc->sw_addr % 2) == 0))
 		sc->multi_chip = true;
 
 	/*
 	 * Create temporary lock, just to satisfy assertions,
 	 * when obtaining the switch ID. Destroy immediately afterwards.
 	 */
 	sx_init(&sc->sx, "e6000sw_tmp");
 	E6000SW_LOCK(sc);
 	sc->swid = e6000sw_readreg(sc, REG_PORT(0), SWITCH_ID) & 0xfff0;
 	E6000SW_UNLOCK(sc);
 	sx_destroy(&sc->sx);
 
 	switch (sc->swid) {
 	case MV88E6141:
 		description = "Marvell 88E6141";
 		sc->phy_base = 0x10;
 		sc->num_ports = 6;
 		break;
 	case MV88E6341:
 		description = "Marvell 88E6341";
 		sc->phy_base = 0x10;
 		sc->num_ports = 6;
 		break;
 	case MV88E6352:
 		description = "Marvell 88E6352";
 		sc->num_ports = 7;
 		break;
 	case MV88E6172:
 		description = "Marvell 88E6172";
 		sc->num_ports = 7;
 		break;
 	case MV88E6176:
 		description = "Marvell 88E6176";
 		sc->num_ports = 7;
 		break;
 	default:
 		device_printf(dev, "Unrecognized device, id 0x%x.\n", sc->swid);
 		return (ENXIO);
 	}
 
 	device_set_desc(dev, description);
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 e6000sw_parse_child_fdt(e6000sw_softc_t *sc, phandle_t child, int *pport)
 {
 	char *name, *portlabel;
 	int speed;
 	phandle_t fixed_link;
 	uint32_t port;
 
 	if (pport == NULL)
 		return (ENXIO);
 
 	if (OF_getencprop(child, "reg", (void *)&port, sizeof(port)) < 0)
 		return (ENXIO);
 	if (port >= sc->num_ports)
 		return (ENXIO);
 	*pport = port;
 
 	if (OF_getprop_alloc(child, "label", (void **)&portlabel) > 0) {
 		if (strncmp(portlabel, "cpu", 3) == 0) {
 			device_printf(sc->dev, "CPU port at %d\n", port);
 			sc->cpuports_mask |= (1 << port);
 			sc->fixed_mask |= (1 << port);
 		}
 		free(portlabel, M_OFWPROP);
 	}
 
 	fixed_link = OF_child(child);
 	if (fixed_link != 0 &&
 	    OF_getprop_alloc(fixed_link, "name", (void **)&name) > 0) {
 		if (strncmp(name, "fixed-link", 10) == 0) {
 			/* Assume defaults: 1g - full-duplex. */
 			sc->fixed_mask |= (1 << port);
 			if (OF_getencprop(fixed_link, "speed", &speed,
 			     sizeof(speed)) > 0) {
 				if (speed == 2500 &&
 				    (MVSWITCH(sc, MV88E6141) ||
 				     MVSWITCH(sc, MV88E6341))) {
 					sc->fixed25_mask |= (1 << port);
 				}
 			}
 		}
 		free(name, M_OFWPROP);
 	}
 	if ((sc->fixed_mask & (1 << port)) != 0)
 		device_printf(sc->dev, "fixed port at %d\n", port);
 	else
 		device_printf(sc->dev, "PHY at port %d\n", port);
 
 	return (0);
 }
 
 static int
 e6000sw_init_interface(e6000sw_softc_t *sc, int port)
 {
 	char name[IFNAMSIZ];
 
 	snprintf(name, IFNAMSIZ, "%sport", device_get_nameunit(sc->dev));
 
 	sc->ifp[port] = if_alloc(IFT_ETHER);
 	if (sc->ifp[port] == NULL)
 		return (ENOMEM);
 	sc->ifp[port]->if_softc = sc;
 	sc->ifp[port]->if_flags |= IFF_UP | IFF_BROADCAST |
 	    IFF_DRV_RUNNING | IFF_SIMPLEX;
 	sc->ifname[port] = malloc(strlen(name) + 1, M_E6000SW, M_NOWAIT);
 	if (sc->ifname[port] == NULL) {
 		if_free(sc->ifp[port]);
 		return (ENOMEM);
 	}
 	memcpy(sc->ifname[port], name, strlen(name) + 1);
 	if_initname(sc->ifp[port], sc->ifname[port], port);
 
 	return (0);
 }
 
 static int
 e6000sw_attach_miibus(e6000sw_softc_t *sc, int port)
 {
 	int err;
 
 	err = mii_attach(sc->dev, &sc->miibus[port], sc->ifp[port],
 	    e6000sw_ifmedia_upd, e6000sw_ifmedia_sts, BMSR_DEFCAPMASK,
 	    port + sc->phy_base, MII_OFFSET_ANY, 0);
 	if (err != 0)
 		return (err);
 
 	return (0);
 }
 
 static int
 e6000sw_attach(device_t dev)
 {
 	e6000sw_softc_t *sc;
 	phandle_t child;
 	int err, port;
 	uint32_t reg;
 
 	err = 0;
 	sc = device_get_softc(dev);
 
 	if (sc->multi_chip)
 		device_printf(dev, "multi-chip addressing mode\n");
 	else
 		device_printf(dev, "single-chip addressing mode\n");
 
 	sx_init(&sc->sx, "e6000sw");
 
 	E6000SW_LOCK(sc);
 	e6000sw_setup(dev, sc);
 
 	for (child = OF_child(sc->node); child != 0; child = OF_peer(child)) {
 		err = e6000sw_parse_child_fdt(sc, child, &port);
 		if (err != 0) {
 			device_printf(sc->dev, "failed to parse DTS\n");
 			goto out_fail;
 		}
 
 		/* Port is in use. */
 		sc->ports_mask |= (1 << port);
 
 		err = e6000sw_init_interface(sc, port);
 		if (err != 0) {
 			device_printf(sc->dev, "failed to init interface\n");
 			goto out_fail;
 		}
 
 		if (e6000sw_is_fixedport(sc, port)) {
 			/* Link must be down to change speed force value. */
 			reg = e6000sw_readreg(sc, REG_PORT(port), PSC_CONTROL);
 			reg &= ~PSC_CONTROL_LINK_UP;
 			reg |= PSC_CONTROL_FORCED_LINK;
 			e6000sw_writereg(sc, REG_PORT(port), PSC_CONTROL, reg);
 
 			/*
 			 * Force speed, full-duplex, EEE off and flow-control
 			 * on.
 			 */
 			if (e6000sw_is_fixed25port(sc, port))
 				reg = PSC_CONTROL_SPD2500;
 			else
 				reg = PSC_CONTROL_SPD1000;
 			reg |= PSC_CONTROL_FORCED_DPX | PSC_CONTROL_FULLDPX |
 			    PSC_CONTROL_FORCED_LINK | PSC_CONTROL_LINK_UP |
 			    PSC_CONTROL_FORCED_FC | PSC_CONTROL_FC_ON |
 			    PSC_CONTROL_FORCED_SPD;
 			if (MVSWITCH(sc, MV88E6141) || MVSWITCH(sc, MV88E6341))
 			    reg |= PSC_CONTROL_FORCED_EEE;
 			e6000sw_writereg(sc, REG_PORT(port), PSC_CONTROL, reg);
 		}
 
 		/* Don't attach miibus at CPU/fixed ports */
 		if (!e6000sw_is_phyport(sc, port))
 			continue;
 
 		err = e6000sw_attach_miibus(sc, port);
 		if (err != 0) {
 			device_printf(sc->dev, "failed to attach miibus\n");
 			goto out_fail;
 		}
 	}
 
 	etherswitch_info.es_nports = sc->num_ports;
 
 	/* Default to port vlan. */
 	e6000sw_port_vlan_conf(sc);
 	E6000SW_UNLOCK(sc);
 
 	bus_generic_probe(dev);
 	bus_generic_attach(dev);
 
 	kproc_create(e6000sw_tick, sc, &sc->kproc, 0, 0, "e6000sw tick kproc");
 
 	return (0);
 
 out_fail:
 	E6000SW_UNLOCK(sc);
 	e6000sw_detach(dev);
 
 	return (err);
 }
 
 static __inline int
 e6000sw_poll_done(e6000sw_softc_t *sc)
 {
 	int i;
 
 	for (i = 0; i < E6000SW_SMI_TIMEOUT; i++) {
 
 		if ((e6000sw_readreg(sc, REG_GLOBAL2, SMI_PHY_CMD_REG) &
 		    (1 << PHY_CMD_SMI_BUSY)) == 0)
 			return (0);
 
 		pause("e6000sw PHY poll", hz/1000);
 	}
 
 	return (ETIMEDOUT);
 }
 
 /*
  * PHY registers are paged. Put page index in reg 22 (accessible from every
  * page), then access specific register.
  */
 static int
 e6000sw_readphy(device_t dev, int phy, int reg)
 {
 	e6000sw_softc_t *sc;
 	uint32_t val;
 	int err;
 
 	sc = device_get_softc(dev);
 	if (!e6000sw_is_phyport(sc, phy) || reg >= E6000SW_NUM_PHY_REGS) {
 		device_printf(dev, "Wrong register address.\n");
 		return (EINVAL);
 	}
 
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 
 	err = e6000sw_poll_done(sc);
 	if (err != 0) {
 		device_printf(dev, "Timeout while waiting for switch\n");
 		return (err);
 	}
 
 	val = 1 << PHY_CMD_SMI_BUSY;
 	val |= PHY_CMD_MODE_MDIO << PHY_CMD_MODE;
 	val |= PHY_CMD_OPCODE_READ << PHY_CMD_OPCODE;
 	val |= (reg << PHY_CMD_REG_ADDR) & PHY_CMD_REG_ADDR_MASK;
 	val |= (phy << PHY_CMD_DEV_ADDR) & PHY_CMD_DEV_ADDR_MASK;
 	e6000sw_writereg(sc, REG_GLOBAL2, SMI_PHY_CMD_REG, val);
 
 	err = e6000sw_poll_done(sc);
 	if (err != 0) {
 		device_printf(dev, "Timeout while waiting for switch\n");
 		return (err);
 	}
 
 	val = e6000sw_readreg(sc, REG_GLOBAL2, SMI_PHY_DATA_REG);
 
 	return (val & PHY_DATA_MASK);
 }
 
 static int
 e6000sw_writephy(device_t dev, int phy, int reg, int data)
 {
 	e6000sw_softc_t *sc;
 	uint32_t val;
 	int err;
 
 	sc = device_get_softc(dev);
 	if (!e6000sw_is_phyport(sc, phy) || reg >= E6000SW_NUM_PHY_REGS) {
 		device_printf(dev, "Wrong register address.\n");
 		return (EINVAL);
 	}
 
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 
 	err = e6000sw_poll_done(sc);
 	if (err != 0) {
 		device_printf(dev, "Timeout while waiting for switch\n");
 		return (err);
 	}
 
 	val = 1 << PHY_CMD_SMI_BUSY;
 	val |= PHY_CMD_MODE_MDIO << PHY_CMD_MODE;
 	val |= PHY_CMD_OPCODE_WRITE << PHY_CMD_OPCODE;
 	val |= (reg << PHY_CMD_REG_ADDR) & PHY_CMD_REG_ADDR_MASK;
 	val |= (phy << PHY_CMD_DEV_ADDR) & PHY_CMD_DEV_ADDR_MASK;
 	e6000sw_writereg(sc, REG_GLOBAL2, SMI_PHY_DATA_REG,
 	    data & PHY_DATA_MASK);
 	e6000sw_writereg(sc, REG_GLOBAL2, SMI_PHY_CMD_REG, val);
 
 	err = e6000sw_poll_done(sc);
 	if (err != 0)
 		device_printf(dev, "Timeout while waiting for switch\n");
 
 	return (err);
 }
 
 static int
 e6000sw_detach(device_t dev)
 {
 	int phy;
 	e6000sw_softc_t *sc;
 
 	sc = device_get_softc(dev);
 	bus_generic_detach(dev);
 	sx_destroy(&sc->sx);
 	for (phy = 0; phy < sc->num_ports; phy++) {
 		if (sc->miibus[phy] != NULL)
 			device_delete_child(dev, sc->miibus[phy]);
 		if (sc->ifp[phy] != NULL)
 			if_free(sc->ifp[phy]);
 		if (sc->ifname[phy] != NULL)
 			free(sc->ifname[phy], M_E6000SW);
 	}
 
 	return (0);
 }
 
 static etherswitch_info_t*
 e6000sw_getinfo(device_t dev)
 {
 
 	return (&etherswitch_info);
 }
 
 static int
 e6000sw_getconf(device_t dev, etherswitch_conf_t *conf)
 {
 	struct e6000sw_softc *sc;
 
 	/* Return the VLAN mode. */
 	sc = device_get_softc(dev);
 	conf->cmd = ETHERSWITCH_CONF_VLAN_MODE;
 	conf->vlan_mode = sc->vlan_mode;
 
 	return (0);
 }
 
 static void
 e6000sw_lock(device_t dev)
 {
 	struct e6000sw_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 	E6000SW_LOCK(sc);
 }
 
 static void
 e6000sw_unlock(device_t dev)
 {
 	struct e6000sw_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 	E6000SW_UNLOCK(sc);
 }
 
 static int
 e6000sw_getport(device_t dev, etherswitch_port_t *p)
 {
 	struct mii_data *mii;
 	int err;
 	struct ifmediareq *ifmr;
 
 	e6000sw_softc_t *sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	if (p->es_port >= sc->num_ports || p->es_port < 0)
 		return (EINVAL);
 	if (!e6000sw_is_portenabled(sc, p->es_port))
 		return (0);
 
 	err = 0;
 	E6000SW_LOCK(sc);
 	e6000sw_get_pvid(sc, p->es_port, &p->es_pvid);
 
 	if (e6000sw_is_fixedport(sc, p->es_port)) {
 		if (e6000sw_is_cpuport(sc, p->es_port))
 			p->es_flags |= ETHERSWITCH_PORT_CPU;
 		ifmr = &p->es_ifmr;
 		ifmr->ifm_status = IFM_ACTIVE | IFM_AVALID;
 		ifmr->ifm_count = 0;
 		if (e6000sw_is_fixed25port(sc, p->es_port))
 			ifmr->ifm_active = IFM_2500_T;
 		else
 			ifmr->ifm_active = IFM_1000_T;
 		ifmr->ifm_active |= IFM_ETHER | IFM_FDX;
 		ifmr->ifm_current = ifmr->ifm_active;
 		ifmr->ifm_mask = 0;
 	} else {
 		mii = e6000sw_miiforphy(sc, p->es_port);
 		err = ifmedia_ioctl(mii->mii_ifp, &p->es_ifr,
 		    &mii->mii_media, SIOCGIFMEDIA);
 	}
 	E6000SW_UNLOCK(sc);
 
 	return (err);
 }
 
 static int
 e6000sw_setport(device_t dev, etherswitch_port_t *p)
 {
 	e6000sw_softc_t *sc;
 	int err;
 	struct mii_data *mii;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	if (p->es_port >= sc->num_ports || p->es_port < 0)
 		return (EINVAL);
 	if (!e6000sw_is_portenabled(sc, p->es_port))
 		return (0);
 
 	err = 0;
 	E6000SW_LOCK(sc);
 	if (p->es_pvid != 0)
 		e6000sw_set_pvid(sc, p->es_port, p->es_pvid);
 	if (e6000sw_is_phyport(sc, p->es_port)) {
 		mii = e6000sw_miiforphy(sc, p->es_port);
 		err = ifmedia_ioctl(mii->mii_ifp, &p->es_ifr, &mii->mii_media,
 		    SIOCSIFMEDIA);
 	}
 	E6000SW_UNLOCK(sc);
 
 	return (err);
 }
 
 /*
  * Registers in this switch are divided into sections, specified in
  * documentation. So as to access any of them, section index and reg index
  * is necessary. etherswitchcfg uses only one variable, so indexes were
  * compressed into addr_reg: 32 * section_index + reg_index.
  */
 static int
 e6000sw_readreg_wrapper(device_t dev, int addr_reg)
 {
 
 	if ((addr_reg > (REG_GLOBAL2 * 32 + REG_NUM_MAX)) ||
 	    (addr_reg < (REG_PORT(0) * 32))) {
 		device_printf(dev, "Wrong register address.\n");
 		return (EINVAL);
 	}
 
 	return (e6000sw_readreg(device_get_softc(dev), addr_reg / 32,
 	    addr_reg % 32));
 }
 
 static int
 e6000sw_writereg_wrapper(device_t dev, int addr_reg, int val)
 {
 
 	if ((addr_reg > (REG_GLOBAL2 * 32 + REG_NUM_MAX)) ||
 	    (addr_reg < (REG_PORT(0) * 32))) {
 		device_printf(dev, "Wrong register address.\n");
 		return (EINVAL);
 	}
 	e6000sw_writereg(device_get_softc(dev), addr_reg / 5,
 	    addr_reg % 32, val);
 
 	return (0);
 }
 
 /*
  * These wrappers are necessary because PHY accesses from etherswitchcfg
  * need to be synchronized with locks, while miibus PHY accesses do not.
  */
 static int
 e6000sw_readphy_wrapper(device_t dev, int phy, int reg)
 {
 	e6000sw_softc_t *sc;
 	int ret;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	E6000SW_LOCK(sc);
 	ret = e6000sw_readphy(dev, phy, reg);
 	E6000SW_UNLOCK(sc);
 
 	return (ret);
 }
 
 static int
 e6000sw_writephy_wrapper(device_t dev, int phy, int reg, int data)
 {
 	e6000sw_softc_t *sc;
 	int ret;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	E6000SW_LOCK(sc);
 	ret = e6000sw_writephy(dev, phy, reg, data);
 	E6000SW_UNLOCK(sc);
 
 	return (ret);
 }
 
 /*
  * setvgroup/getvgroup called from etherswitchfcg need to be locked,
  * while internal calls do not.
  */
 static int
 e6000sw_setvgroup_wrapper(device_t dev, etherswitch_vlangroup_t *vg)
 {
 	e6000sw_softc_t *sc;
 	int ret;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	E6000SW_LOCK(sc);
 	ret = e6000sw_setvgroup(dev, vg);
 	E6000SW_UNLOCK(sc);
 
 	return (ret);
 }
 
 static int
 e6000sw_getvgroup_wrapper(device_t dev, etherswitch_vlangroup_t *vg)
 {
 	e6000sw_softc_t *sc;
 	int ret;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	E6000SW_LOCK(sc);
 	ret = e6000sw_getvgroup(dev, vg);
 	E6000SW_UNLOCK(sc);
 
 	return (ret);
 }
 
 static __inline void
 e6000sw_port_vlan_assign(e6000sw_softc_t *sc, int port, uint32_t fid,
     uint32_t members)
 {
 	uint32_t reg;
 
 	reg = e6000sw_readreg(sc, REG_PORT(port), PORT_VLAN_MAP);
 	reg &= ~PORT_VLAN_MAP_TABLE_MASK;
 	reg &= ~PORT_VLAN_MAP_FID_MASK;
 	reg |= members & PORT_VLAN_MAP_TABLE_MASK & ~(1 << port);
 	reg |= (fid << PORT_VLAN_MAP_FID) & PORT_VLAN_MAP_FID_MASK;
 	e6000sw_writereg(sc, REG_PORT(port), PORT_VLAN_MAP, reg);
 	reg = e6000sw_readreg(sc, REG_PORT(port), PORT_CONTROL_1);
 	reg &= ~PORT_CONTROL_1_FID_MASK;
 	reg |= (fid >> 4) & PORT_CONTROL_1_FID_MASK;
 	e6000sw_writereg(sc, REG_PORT(port), PORT_CONTROL_1, reg);
 }
 
 static int
 e6000sw_set_port_vlan(e6000sw_softc_t *sc, etherswitch_vlangroup_t *vg)
 {
 	uint32_t port;
 
 	port = vg->es_vlangroup;
 	if (port > sc->num_ports)
 		return (EINVAL);
 
 	if (vg->es_member_ports != vg->es_untagged_ports) {
 		device_printf(sc->dev, "Tagged ports not supported.\n");
 		return (EINVAL);
 	}
 
 	e6000sw_port_vlan_assign(sc, port, port + 1, vg->es_untagged_ports);
 	vg->es_vid = port | ETHERSWITCH_VID_VALID;
 
 	return (0);
 }
 
 static int
 e6000sw_setvgroup(device_t dev, etherswitch_vlangroup_t *vg)
 {
 	e6000sw_softc_t *sc;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 
 	if (sc->vlan_mode == ETHERSWITCH_VLAN_PORT)
 		return (e6000sw_set_port_vlan(sc, vg));
 
 	return (EINVAL);
 }
 
 static int
 e6000sw_get_port_vlan(e6000sw_softc_t *sc, etherswitch_vlangroup_t *vg)
 {
 	uint32_t port, reg;
 
 	port = vg->es_vlangroup;
 	if (port > sc->num_ports)
 		return (EINVAL);
 
 	if (!e6000sw_is_portenabled(sc, port)) {
 		vg->es_vid = port;
 		return (0);
 	}
 
 	reg = e6000sw_readreg(sc, REG_PORT(port), PORT_VLAN_MAP);
 	vg->es_untagged_ports = vg->es_member_ports =
 	    reg & PORT_VLAN_MAP_TABLE_MASK;
 	vg->es_vid = port | ETHERSWITCH_VID_VALID;
 	vg->es_fid = (reg & PORT_VLAN_MAP_FID_MASK) >> PORT_VLAN_MAP_FID;
 	reg = e6000sw_readreg(sc, REG_PORT(port), PORT_CONTROL_1);
 	vg->es_fid |= (reg & PORT_CONTROL_1_FID_MASK) << 4;
 
 	return (0);
 }
 
 static int
 e6000sw_getvgroup(device_t dev, etherswitch_vlangroup_t *vg)
 {
 	e6000sw_softc_t *sc;
 
 	sc = device_get_softc(dev);
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 
 	if (sc->vlan_mode == ETHERSWITCH_VLAN_PORT)
 		return (e6000sw_get_port_vlan(sc, vg));
 
 	return (EINVAL);
 }
 
 static __inline struct mii_data*
 e6000sw_miiforphy(e6000sw_softc_t *sc, unsigned int phy)
 {
 
 	if (!e6000sw_is_phyport(sc, phy))
 		return (NULL);
 
 	return (device_get_softc(sc->miibus[phy]));
 }
 
 static int
 e6000sw_ifmedia_upd(struct ifnet *ifp)
 {
 	e6000sw_softc_t *sc;
 	struct mii_data *mii;
 
 	sc = ifp->if_softc;
 	mii = e6000sw_miiforphy(sc, ifp->if_dunit);
 	if (mii == NULL)
 		return (ENXIO);
 	mii_mediachg(mii);
 
 	return (0);
 }
 
 static void
 e6000sw_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr)
 {
 	e6000sw_softc_t *sc;
 	struct mii_data *mii;
 
 	sc = ifp->if_softc;
 	mii = e6000sw_miiforphy(sc, ifp->if_dunit);
 
 	if (mii == NULL)
 		return;
 
 	mii_pollstat(mii);
 	ifmr->ifm_active = mii->mii_media_active;
 	ifmr->ifm_status = mii->mii_media_status;
 }
 
 static int
 e6000sw_smi_waitready(e6000sw_softc_t *sc, int phy)
 {
 	int i;
 
 	for (i = 0; i < E6000SW_SMI_TIMEOUT; i++) {
 		if ((MDIO_READ(sc->dev, phy, SMI_CMD) & SMI_CMD_BUSY) == 0)
 			return (0);
 		DELAY(1);
 	}
 
 	return (1);
 }
 
 static __inline uint32_t
 e6000sw_readreg(e6000sw_softc_t *sc, int addr, int reg)
 {
 
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 
 	if (!sc->multi_chip)
 		return (MDIO_READ(sc->dev, addr, reg) & 0xffff);
 
 	if (e6000sw_smi_waitready(sc, sc->sw_addr)) {
 		printf("e6000sw: readreg timeout\n");
 		return (0xffff);
 	}
 	MDIO_WRITE(sc->dev, sc->sw_addr, SMI_CMD,
 	    SMI_CMD_OP_READ | (addr << 5) | reg);
 	if (e6000sw_smi_waitready(sc, sc->sw_addr)) {
 		printf("e6000sw: readreg timeout\n");
 		return (0xffff);
 	}
 
 	return (MDIO_READ(sc->dev, sc->sw_addr, SMI_DATA) & 0xffff);
 }
 
 static __inline void
 e6000sw_writereg(e6000sw_softc_t *sc, int addr, int reg, int val)
 {
 
 	E6000SW_LOCK_ASSERT(sc, SA_XLOCKED);
 
 	if (!sc->multi_chip) {
 		MDIO_WRITE(sc->dev, addr, reg, val);
 		return;
 	}
 
 	if (e6000sw_smi_waitready(sc, sc->sw_addr)) {
 		printf("e6000sw: readreg timeout\n");
 		return;
 	}
 	MDIO_WRITE(sc->dev, sc->sw_addr, SMI_DATA, val);
 	MDIO_WRITE(sc->dev, sc->sw_addr, SMI_CMD,
 	    SMI_CMD_OP_WRITE | (addr << 5) | reg);
 	if (e6000sw_smi_waitready(sc, sc->sw_addr)) {
 		printf("e6000sw: readreg timeout\n");
 		return;
 	}
 }
 
 static __inline bool
 e6000sw_is_cpuport(e6000sw_softc_t *sc, int port)
 {
 
 	return ((sc->cpuports_mask & (1 << port)) ? true : false);
 }
 
 static __inline bool
 e6000sw_is_fixedport(e6000sw_softc_t *sc, int port)
 {
 
 	return ((sc->fixed_mask & (1 << port)) ? true : false);
 }
 
 static __inline bool
 e6000sw_is_fixed25port(e6000sw_softc_t *sc, int port)
 {
 
 	return ((sc->fixed25_mask & (1 << port)) ? true : false);
 }
 
 static __inline bool
 e6000sw_is_phyport(e6000sw_softc_t *sc, int port)
 {
 	uint32_t phy_mask;
 	phy_mask = ~(sc->fixed_mask | sc->cpuports_mask);
 
 	return ((phy_mask & (1 << port)) ? true : false);
 }
 
 static __inline bool
 e6000sw_is_portenabled(e6000sw_softc_t *sc, int port)
 {
 
 	return ((sc->ports_mask & (1 << port)) ? true : false);
 }
 
 static __inline int
 e6000sw_set_pvid(e6000sw_softc_t *sc, int port, int pvid)
 {
 
 	e6000sw_writereg(sc, REG_PORT(port), PORT_VID, pvid &
 	    PORT_VID_DEF_VID_MASK);
 
 	return (0);
 }
 
 static __inline int
 e6000sw_get_pvid(e6000sw_softc_t *sc, int port, int *pvid)
 {
 
 	if (pvid == NULL)
 		return (ENXIO);
 
 	*pvid = e6000sw_readreg(sc, REG_PORT(port), PORT_VID) &
 	    PORT_VID_DEF_VID_MASK;
 
 	return (0);
 }
 
 /*
  * Convert port status to ifmedia.
  */
 static void
 e6000sw_update_ifmedia(uint16_t portstatus, u_int *media_status, u_int *media_active)
 {
 	*media_active = IFM_ETHER;
 	*media_status = IFM_AVALID;
 
 	if ((portstatus & PORT_STATUS_LINK_MASK) != 0)
 		*media_status |= IFM_ACTIVE;
 	else {
 		*media_active |= IFM_NONE;
 		return;
 	}
 
 	switch (portstatus & PORT_STATUS_SPEED_MASK) {
 	case PORT_STATUS_SPEED_10:
 		*media_active |= IFM_10_T;
 		break;
 	case PORT_STATUS_SPEED_100:
 		*media_active |= IFM_100_TX;
 		break;
 	case PORT_STATUS_SPEED_1000:
 		*media_active |= IFM_1000_T;
 		break;
 	}
 
 	if ((portstatus & PORT_STATUS_DUPLEX_MASK) == 0)
 		*media_active |= IFM_FDX;
 	else
 		*media_active |= IFM_HDX;
 }
 
 static void
 e6000sw_tick (void *arg)
 {
 	e6000sw_softc_t *sc;
 	struct mii_data *mii;
 	struct mii_softc *miisc;
 	uint16_t portstatus;
 	int port;
 
 	sc = arg;
 
 	E6000SW_LOCK_ASSERT(sc, SA_UNLOCKED);
 
 	for (;;) {
 		E6000SW_LOCK(sc);
 		for (port = 0; port < sc->num_ports; port++) {
 			/* Tick only on PHY ports */
 			if (!e6000sw_is_portenabled(sc, port) ||
 			    !e6000sw_is_phyport(sc, port))
 				continue;
 
 			mii = e6000sw_miiforphy(sc, port);
 			if (mii == NULL)
 				continue;
 
 			portstatus = e6000sw_readreg(sc, REG_PORT(port),
 			    PORT_STATUS);
 
 			e6000sw_update_ifmedia(portstatus,
 			    &mii->mii_media_status, &mii->mii_media_active);
 
 			LIST_FOREACH(miisc, &mii->mii_phys, mii_list) {
 				if (IFM_INST(mii->mii_media.ifm_cur->ifm_media)
 				    != miisc->mii_inst)
 					continue;
 				mii_phy_update(miisc, MII_POLLSTAT);
 			}
 		}
 		E6000SW_UNLOCK(sc);
 		pause("e6000sw tick", 1000);
 	}
 }
 
 static void
 e6000sw_setup(device_t dev, e6000sw_softc_t *sc)
 {
 	uint16_t atu_ctrl, atu_age;
 
 	/* Set aging time */
 	e6000sw_writereg(sc, REG_GLOBAL, ATU_CONTROL,
 	    (E6000SW_DEFAULT_AGETIME << ATU_CONTROL_AGETIME) |
 	    (1 << ATU_CONTROL_LEARN2ALL));
 
 	/* Send all with specific mac address to cpu port */
 	e6000sw_writereg(sc, REG_GLOBAL2, MGMT_EN_2x, MGMT_EN_ALL);
 	e6000sw_writereg(sc, REG_GLOBAL2, MGMT_EN_0x, MGMT_EN_ALL);
 
 	/* Disable Remote Management */
 	e6000sw_writereg(sc, REG_GLOBAL, SWITCH_GLOBAL_CONTROL2, 0);
 
 	/* Disable loopback filter and flow control messages */
 	e6000sw_writereg(sc, REG_GLOBAL2, SWITCH_MGMT,
 	    SWITCH_MGMT_PRI_MASK |
 	    (1 << SWITCH_MGMT_RSVD2CPU) |
 	    SWITCH_MGMT_FC_PRI_MASK |
 	    (1 << SWITCH_MGMT_FORCEFLOW));
 
 	e6000sw_atu_flush(dev, sc, NO_OPERATION);
 	e6000sw_atu_mac_table(dev, sc, NULL, NO_OPERATION);
 	e6000sw_set_atustat(dev, sc, 0, COUNT_ALL);
 
 	/* Set ATU AgeTime to 15 seconds */
 	atu_age = 1;
 
 	atu_ctrl = e6000sw_readreg(sc, REG_GLOBAL, ATU_CONTROL);
 
 	/* Set new AgeTime field */
 	atu_ctrl &= ~ATU_CONTROL_AGETIME_MASK;
 	e6000sw_writereg(sc, REG_GLOBAL, ATU_CONTROL, atu_ctrl |
 	    (atu_age << ATU_CONTROL_AGETIME));
 }
 
 static void
 e6000sw_port_vlan_conf(e6000sw_softc_t *sc)
 {
 	int i, port, ret;
 	uint32_t members;
 
 	/* Disable all ports */
 	for (port = 0; port < sc->num_ports; port++) {
 		ret = e6000sw_readreg(sc, REG_PORT(port), PORT_CONTROL);
 		e6000sw_writereg(sc, REG_PORT(port), PORT_CONTROL,
 		    (ret & ~PORT_CONTROL_ENABLE));
 	}
 
 	/* Set port priority */
 	for (port = 0; port < sc->num_ports; port++) {
 		if (!e6000sw_is_portenabled(sc, port))
 			continue;
 		ret = e6000sw_readreg(sc, REG_PORT(port), PORT_VID);
 		ret &= ~PORT_VID_PRIORITY_MASK;
 		e6000sw_writereg(sc, REG_PORT(port), PORT_VID, ret);
 	}
 
 	/* Set VID map */
 	for (port = 0; port < sc->num_ports; port++) {
 		if (!e6000sw_is_portenabled(sc, port))
 			continue;
 		ret = e6000sw_readreg(sc, REG_PORT(port), PORT_VID);
 		ret &= ~PORT_VID_DEF_VID_MASK;
 		ret |= (port + 1);
 		e6000sw_writereg(sc, REG_PORT(port), PORT_VID, ret);
 	}
 
 	/* Enable all ports */
 	for (port = 0; port < sc->num_ports; port++) {
 		if (!e6000sw_is_portenabled(sc, port))
 			continue;
 		ret = e6000sw_readreg(sc, REG_PORT(port), PORT_CONTROL);
 		e6000sw_writereg(sc, REG_PORT(port), PORT_CONTROL,
 		    (ret | PORT_CONTROL_ENABLE));
 	}
 
 	/* Set VLAN mode. */
 	sc->vlan_mode = ETHERSWITCH_VLAN_PORT;
 	etherswitch_info.es_nvlangroups = sc->num_ports;
 	for (port = 0; port < sc->num_ports; port++) {
 		members = 0;
 		if (e6000sw_is_portenabled(sc, port)) {
 			for (i = 0; i < sc->num_ports; i++) {
 				if (i == port || !e6000sw_is_portenabled(sc, i))
 					continue;
 				members |= (1 << i);
 			}
 		}
 		e6000sw_port_vlan_assign(sc, port, port + 1, members);
 	}
 }
 
 static void
 e6000sw_set_atustat(device_t dev, e6000sw_softc_t *sc, int bin, int flag)
 {
 	uint16_t ret;
 
 	ret = e6000sw_readreg(sc, REG_GLOBAL2, ATU_STATS);
 	e6000sw_writereg(sc, REG_GLOBAL2, ATU_STATS, (bin << ATU_STATS_BIN ) |
 	    (flag << ATU_STATS_FLAG));
 }
 
 static int
 e6000sw_atu_mac_table(device_t dev, e6000sw_softc_t *sc, struct atu_opt *atu,
     int flag)
 {
 	uint16_t ret_opt;
 	uint16_t ret_data;
 	int retries;
 
 	if (flag == NO_OPERATION)
 		return (0);
 	else if ((flag & (LOAD_FROM_FIB | PURGE_FROM_FIB | GET_NEXT_IN_FIB |
 	    GET_VIOLATION_DATA | CLEAR_VIOLATION_DATA)) == 0) {
 		device_printf(dev, "Wrong Opcode for ATU operation\n");
 		return (EINVAL);
 	}
 
 	ret_opt = e6000sw_readreg(sc, REG_GLOBAL, ATU_OPERATION);
 
 	if (ret_opt & ATU_UNIT_BUSY) {
 		device_printf(dev, "ATU unit is busy, cannot access"
 		    "register\n");
 		return (EBUSY);
 	} else {
 		if(flag & LOAD_FROM_FIB) {
 			ret_data = e6000sw_readreg(sc, REG_GLOBAL, ATU_DATA);
 			e6000sw_writereg(sc, REG_GLOBAL2, ATU_DATA, (ret_data &
 			    ~ENTRY_STATE));
 		}
 		e6000sw_writereg(sc, REG_GLOBAL, ATU_MAC_ADDR01, atu->mac_01);
 		e6000sw_writereg(sc, REG_GLOBAL, ATU_MAC_ADDR23, atu->mac_23);
 		e6000sw_writereg(sc, REG_GLOBAL, ATU_MAC_ADDR45, atu->mac_45);
 		e6000sw_writereg(sc, REG_GLOBAL, ATU_FID, atu->fid);
 
 		e6000sw_writereg(sc, REG_GLOBAL, ATU_OPERATION, (ret_opt |
 		    ATU_UNIT_BUSY | flag));
 
 		retries = E6000SW_RETRIES;
 		while (--retries & (e6000sw_readreg(sc, REG_GLOBAL,
 		    ATU_OPERATION) & ATU_UNIT_BUSY))
 			DELAY(1);
 
 		if (retries == 0)
 			device_printf(dev, "Timeout while flushing\n");
 		else if (flag & GET_NEXT_IN_FIB) {
 			atu->mac_01 = e6000sw_readreg(sc, REG_GLOBAL,
 			    ATU_MAC_ADDR01);
 			atu->mac_23 = e6000sw_readreg(sc, REG_GLOBAL,
 			    ATU_MAC_ADDR23);
 			atu->mac_45 = e6000sw_readreg(sc, REG_GLOBAL,
 			    ATU_MAC_ADDR45);
 		}
 	}
 
 	return (0);
 }
 
 static int
 e6000sw_atu_flush(device_t dev, e6000sw_softc_t *sc, int flag)
 {
 	uint16_t ret;
 	int retries;
 
 	if (flag == NO_OPERATION)
 		return (0);
 
 	ret = e6000sw_readreg(sc, REG_GLOBAL, ATU_OPERATION);
 	if (ret & ATU_UNIT_BUSY) {
 		device_printf(dev, "Atu unit is busy, cannot flush\n");
 		return (EBUSY);
 	} else {
 		e6000sw_writereg(sc, REG_GLOBAL, ATU_OPERATION, (ret |
 		    ATU_UNIT_BUSY | flag));
 		retries = E6000SW_RETRIES;
 		while (--retries & (e6000sw_readreg(sc, REG_GLOBAL,
 		    ATU_OPERATION) & ATU_UNIT_BUSY))
 			DELAY(1);
 
 		if (retries == 0)
 			device_printf(dev, "Timeout while flushing\n");
 	}
 
 	return (0);
 }
Index: user/markj/netdump/sys/dev/extres/clk/clk.c
===================================================================
--- user/markj/netdump/sys/dev/extres/clk/clk.c	(revision 332407)
+++ user/markj/netdump/sys/dev/extres/clk/clk.c	(revision 332408)
@@ -1,1514 +1,1514 @@
 /*-
  * Copyright 2016 Michal Meloun <mmel@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_platform.h"
 #include <sys/param.h>
 #include <sys/conf.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/queue.h>
 #include <sys/kobj.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/sx.h>
 
 #ifdef FDT
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #endif
 #include <dev/extres/clk/clk.h>
 
 MALLOC_DEFINE(M_CLOCK, "clocks", "Clock framework");
 
 /* Forward declarations. */
 struct clk;
 struct clknodenode;
 struct clkdom;
 
 typedef TAILQ_HEAD(clknode_list, clknode) clknode_list_t;
 typedef TAILQ_HEAD(clkdom_list, clkdom) clkdom_list_t;
 
 /* Default clock methods. */
 static int clknode_method_init(struct clknode *clk, device_t dev);
 static int clknode_method_recalc_freq(struct clknode *clk, uint64_t *freq);
 static int clknode_method_set_freq(struct clknode *clk, uint64_t fin,
     uint64_t *fout, int flags, int *stop);
 static int clknode_method_set_gate(struct clknode *clk, bool enable);
 static int clknode_method_set_mux(struct clknode *clk, int idx);
 
 /*
  * Clock controller methods.
  */
 static clknode_method_t clknode_methods[] = {
 	CLKNODEMETHOD(clknode_init,		clknode_method_init),
 	CLKNODEMETHOD(clknode_recalc_freq,	clknode_method_recalc_freq),
 	CLKNODEMETHOD(clknode_set_freq,		clknode_method_set_freq),
 	CLKNODEMETHOD(clknode_set_gate,		clknode_method_set_gate),
 	CLKNODEMETHOD(clknode_set_mux,		clknode_method_set_mux),
 
 	CLKNODEMETHOD_END
 };
 DEFINE_CLASS_0(clknode, clknode_class, clknode_methods, 0);
 
 /*
  * Clock node - basic element for modeling SOC clock graph.  It holds the clock
  * provider's data about the clock, and the links for the clock's membership in
  * various lists.
  */
 struct clknode {
 	KOBJ_FIELDS;
 
 	/* Clock nodes topology. */
 	struct clkdom 		*clkdom;	/* Owning clock domain */
 	TAILQ_ENTRY(clknode)	clkdom_link;	/* Domain list entry */
 	TAILQ_ENTRY(clknode)	clklist_link;	/* Global list entry */
 
 	/* String based parent list. */
 	const char		**parent_names;	/* Array of parent names */
 	int			parent_cnt;	/* Number of parents */
 	int			parent_idx;	/* Parent index or -1 */
 
 	/* Cache for already resolved names. */
 	struct clknode		**parents;	/* Array of potential parents */
 	struct clknode		*parent;	/* Current parent */
 
 	/* Parent/child relationship links. */
 	clknode_list_t		children;	/* List of our children */
 	TAILQ_ENTRY(clknode)	sibling_link; 	/* Our entry in parent's list */
 
 	/* Details of this device. */
 	void			*softc;		/* Instance softc */
 	const char		*name;		/* Globally unique name */
 	intptr_t		id;		/* Per domain unique id */
 	int			flags;		/* CLK_FLAG_*  */
 	struct sx		lock;		/* Lock for this clock */
 	int			ref_cnt;	/* Reference counter */
 	int			enable_cnt;	/* Enabled counter */
 
 	/* Cached values. */
 	uint64_t		freq;		/* Actual frequency */
 
 	struct sysctl_ctx_list	sysctl_ctx;
 };
 
 /*
  *  Per consumer data, information about how a consumer is using a clock node.
  *  A pointer to this structure is used as a handle in the consumer interface.
  */
 struct clk {
 	device_t		dev;
 	struct clknode		*clknode;
 	int			enable_cnt;
 };
 
 /*
  * Clock domain - a group of clocks provided by one clock device.
  */
 struct clkdom {
 	device_t 		dev; 	/* Link to provider device */
 	TAILQ_ENTRY(clkdom)	link;		/* Global domain list entry */
 	clknode_list_t		clknode_list;	/* All clocks in the domain */
 
 #ifdef FDT
 	clknode_ofw_mapper_func	*ofw_mapper;	/* Find clock using FDT xref */
 #endif
 };
 
 /*
  * The system-wide list of clock domains.
  */
 static clkdom_list_t clkdom_list = TAILQ_HEAD_INITIALIZER(clkdom_list);
 
 /*
  * Each clock node is linked on a system-wide list and can be searched by name.
  */
 static clknode_list_t clknode_list = TAILQ_HEAD_INITIALIZER(clknode_list);
 
 /*
  * Locking - we use three levels of locking:
  * - First, topology lock is taken.  This one protect all lists.
  * - Second level is per clknode lock.  It protects clknode data.
  * - Third level is outside of this file, it protect clock device registers.
  * First two levels use sleepable locks; clock device can use mutex or sx lock.
  */
 static struct sx		clk_topo_lock;
 SX_SYSINIT(clock_topology, &clk_topo_lock, "Clock topology lock");
 
 #define CLK_TOPO_SLOCK()	sx_slock(&clk_topo_lock)
 #define CLK_TOPO_XLOCK()	sx_xlock(&clk_topo_lock)
 #define CLK_TOPO_UNLOCK()	sx_unlock(&clk_topo_lock)
 #define CLK_TOPO_ASSERT()	sx_assert(&clk_topo_lock, SA_LOCKED)
 #define CLK_TOPO_XASSERT()	sx_assert(&clk_topo_lock, SA_XLOCKED)
 
 #define CLKNODE_SLOCK(_sc)	sx_slock(&((_sc)->lock))
 #define CLKNODE_XLOCK(_sc)	sx_xlock(&((_sc)->lock))
 #define CLKNODE_UNLOCK(_sc)	sx_unlock(&((_sc)->lock))
 
 static void clknode_adjust_parent(struct clknode *clknode, int idx);
 
 enum clknode_sysctl_type {
 	CLKNODE_SYSCTL_PARENT,
 	CLKNODE_SYSCTL_PARENTS_LIST,
 	CLKNODE_SYSCTL_CHILDREN_LIST,
 };
 
 static int clknode_sysctl(SYSCTL_HANDLER_ARGS);
 static int clkdom_sysctl(SYSCTL_HANDLER_ARGS);
 
 /*
  * Default clock methods for base class.
  */
 static int
 clknode_method_init(struct clknode *clknode, device_t dev)
 {
 
 	return (0);
 }
 
 static int
 clknode_method_recalc_freq(struct clknode *clknode, uint64_t *freq)
 {
 
 	return (0);
 }
 
 static int
 clknode_method_set_freq(struct clknode *clknode, uint64_t fin,  uint64_t *fout,
    int flags, int *stop)
 {
 
 	*stop = 0;
 	return (0);
 }
 
 static int
 clknode_method_set_gate(struct clknode *clk, bool enable)
 {
 
 	return (0);
 }
 
 static int
 clknode_method_set_mux(struct clknode *clk, int idx)
 {
 
 	return (0);
 }
 
 /*
  * Internal functions.
  */
 
 /*
  * Duplicate an array of parent names.
  *
  * Compute total size and allocate a single block which holds both the array of
  * pointers to strings and the copied strings themselves.  Returns a pointer to
  * the start of the block where the array of copied string pointers lives.
  *
  * XXX Revisit this, no need for the DECONST stuff.
  */
 static const char **
 strdup_list(const char **names, int num)
 {
 	size_t len, slen;
 	const char **outptr, *ptr;
 	int i;
 
 	len = sizeof(char *) * num;
 	for (i = 0; i < num; i++) {
 		if (names[i] == NULL)
 			continue;
 		slen = strlen(names[i]);
 		if (slen == 0)
 			panic("Clock parent names array have empty string");
 		len += slen + 1;
 	}
 	outptr = malloc(len, M_CLOCK, M_WAITOK | M_ZERO);
 	ptr = (char *)(outptr + num);
 	for (i = 0; i < num; i++) {
 		if (names[i] == NULL)
 			continue;
 		outptr[i] = ptr;
 		slen = strlen(names[i]) + 1;
 		bcopy(names[i], __DECONST(void *, outptr[i]), slen);
 		ptr += slen;
 	}
 	return (outptr);
 }
 
 /*
  * Recompute the cached frequency for this node and all its children.
  */
 static int
 clknode_refresh_cache(struct clknode *clknode, uint64_t freq)
 {
 	int rv;
 	struct clknode *entry;
 
 	CLK_TOPO_XASSERT();
 
 	/* Compute generated frequency. */
 	rv = CLKNODE_RECALC_FREQ(clknode, &freq);
 	if (rv != 0) {
 		 /* XXX If an error happens while refreshing children
 		  * this leaves the world in a  partially-updated state.
 		  * Panic for now.
 		  */
 		panic("clknode_refresh_cache failed for '%s'\n",
 		    clknode->name);
 		return (rv);
 	}
 	/* Refresh cache for this node. */
 	clknode->freq = freq;
 
 	/* Refresh cache for all children. */
 	TAILQ_FOREACH(entry, &(clknode->children), sibling_link) {
 		rv = clknode_refresh_cache(entry, freq);
 		if (rv != 0)
 			return (rv);
 	}
 	return (0);
 }
 
 /*
  * Public interface.
  */
 
 struct clknode *
 clknode_find_by_name(const char *name)
 {
 	struct clknode *entry;
 
 	CLK_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &clknode_list, clklist_link) {
 		if (strcmp(entry->name, name) == 0)
 			return (entry);
 	}
 	return (NULL);
 }
 
 struct clknode *
 clknode_find_by_id(struct clkdom *clkdom, intptr_t id)
 {
 	struct clknode *entry;
 
 	CLK_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &clkdom->clknode_list, clkdom_link) {
 		if (entry->id ==  id)
 			return (entry);
 	}
 
 	return (NULL);
 }
 
 /* -------------------------------------------------------------------------- */
 /*
  * Clock domain functions
  */
 
 /* Find clock domain associated to device in global list. */
 struct clkdom *
 clkdom_get_by_dev(const device_t dev)
 {
 	struct clkdom *entry;
 
 	CLK_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &clkdom_list, link) {
 		if (entry->dev == dev)
 			return (entry);
 	}
 	return (NULL);
 }
 
 
 #ifdef FDT
 /* Default DT mapper. */
 static int
 clknode_default_ofw_map(struct clkdom *clkdom, uint32_t ncells,
     phandle_t *cells, struct clknode **clk)
 {
 
 	CLK_TOPO_ASSERT();
 
 	if (ncells == 0)
 		*clk = clknode_find_by_id(clkdom, 1);
 	else if (ncells == 1)
 		*clk = clknode_find_by_id(clkdom, cells[0]);
 	else
 		return  (ERANGE);
 
 	if (*clk == NULL)
 		return (ENXIO);
 	return (0);
 }
 #endif
 
 /*
  * Create a clock domain.  Returns with the topo lock held.
  */
 struct clkdom *
 clkdom_create(device_t dev)
 {
 	struct clkdom *clkdom;
 
 	clkdom = malloc(sizeof(struct clkdom), M_CLOCK, M_WAITOK | M_ZERO);
 	clkdom->dev = dev;
 	TAILQ_INIT(&clkdom->clknode_list);
 #ifdef FDT
 	clkdom->ofw_mapper = clknode_default_ofw_map;
 #endif
 
 	SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev),
 	  SYSCTL_CHILDREN(device_get_sysctl_tree(dev)),
 	  OID_AUTO, "clocks",
 	  CTLTYPE_STRING | CTLFLAG_RD,
 		    clkdom, 0, clkdom_sysctl,
 		    "A",
 		    "Clock list for the domain");
 
 	return (clkdom);
 }
 
 void
 clkdom_unlock(struct clkdom *clkdom)
 {
 
 	CLK_TOPO_UNLOCK();
 }
 
 void
 clkdom_xlock(struct clkdom *clkdom)
 {
 
 	CLK_TOPO_XLOCK();
 }
 
 /*
  * Finalize initialization of clock domain.  Releases topo lock.
  *
  * XXX Revisit failure handling.
  */
 int
 clkdom_finit(struct clkdom *clkdom)
 {
 	struct clknode *clknode;
 	int i, rv;
 #ifdef FDT
 	phandle_t node;
 
 
 	if ((node = ofw_bus_get_node(clkdom->dev)) == -1) {
 		device_printf(clkdom->dev,
 		    "%s called on not ofw based device\n", __func__);
 		return (ENXIO);
 	}
 #endif
 	rv = 0;
 
 	/* Make clock domain globally visible. */
 	CLK_TOPO_XLOCK();
 	TAILQ_INSERT_TAIL(&clkdom_list, clkdom, link);
 #ifdef FDT
 	OF_device_register_xref(OF_xref_from_node(node), clkdom->dev);
 #endif
 
 	/* Register all clock names into global list. */
 	TAILQ_FOREACH(clknode, &clkdom->clknode_list, clkdom_link) {
 		TAILQ_INSERT_TAIL(&clknode_list, clknode, clklist_link);
 	}
 	/*
 	 * At this point all domain nodes must be registered and all
 	 * parents must be valid.
 	 */
 	TAILQ_FOREACH(clknode, &clkdom->clknode_list, clkdom_link) {
 		if (clknode->parent_cnt == 0)
 			continue;
 		for (i = 0; i < clknode->parent_cnt; i++) {
 			if (clknode->parents[i] != NULL)
 				continue;
 			if (clknode->parent_names[i] == NULL)
 				continue;
 			clknode->parents[i] = clknode_find_by_name(
 			    clknode->parent_names[i]);
 			if (clknode->parents[i] == NULL) {
 				device_printf(clkdom->dev,
 				    "Clock %s have unknown parent: %s\n",
 				    clknode->name, clknode->parent_names[i]);
 				rv = ENODEV;
 			}
 		}
 
 		/* If parent index is not set yet... */
 		if (clknode->parent_idx == CLKNODE_IDX_NONE) {
 			device_printf(clkdom->dev,
 			    "Clock %s have not set parent idx\n",
 			    clknode->name);
 			rv = ENXIO;
 			continue;
 		}
 		if (clknode->parents[clknode->parent_idx] == NULL) {
 			device_printf(clkdom->dev,
 			    "Clock %s have unknown parent(idx %d): %s\n",
 			    clknode->name, clknode->parent_idx,
 			    clknode->parent_names[clknode->parent_idx]);
 			rv = ENXIO;
 			continue;
 		}
 		clknode_adjust_parent(clknode, clknode->parent_idx);
 	}
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 /* Dump clock domain. */
 void
 clkdom_dump(struct clkdom * clkdom)
 {
 	struct clknode *clknode;
 	int rv;
 	uint64_t freq;
 
 	CLK_TOPO_SLOCK();
 	TAILQ_FOREACH(clknode, &clkdom->clknode_list, clkdom_link) {
 		rv = clknode_get_freq(clknode, &freq);
 		printf("Clock: %s, parent: %s(%d), freq: %ju\n", clknode->name,
 		    clknode->parent == NULL ? "(NULL)" : clknode->parent->name,
 		    clknode->parent_idx,
 		    (uintmax_t)((rv == 0) ? freq: rv));
 	}
 	CLK_TOPO_UNLOCK();
 }
 
 /*
  * Create and initialize clock object, but do not register it.
  */
 struct clknode *
 clknode_create(struct clkdom * clkdom, clknode_class_t clknode_class,
     const struct clknode_init_def *def)
 {
 	struct clknode *clknode;
 	struct sysctl_oid *clknode_oid;
 
 	KASSERT(def->name != NULL, ("clock name is NULL"));
 	KASSERT(def->name[0] != '\0', ("clock name is empty"));
 #ifdef   INVARIANTS
 	CLK_TOPO_SLOCK();
 	if (clknode_find_by_name(def->name) != NULL)
 		panic("Duplicated clock registration: %s\n", def->name);
 	CLK_TOPO_UNLOCK();
 #endif
 
 	/* Create object and initialize it. */
 	clknode = malloc(sizeof(struct clknode), M_CLOCK, M_WAITOK | M_ZERO);
 	kobj_init((kobj_t)clknode, (kobj_class_t)clknode_class);
 	sx_init(&clknode->lock, "Clocknode lock");
 
 	/* Allocate softc if required. */
 	if (clknode_class->size > 0) {
 		clknode->softc = malloc(clknode_class->size,
 		    M_CLOCK, M_WAITOK | M_ZERO);
 	}
 
 	/* Prepare array for ptrs to parent clocks. */
 	clknode->parents = malloc(sizeof(struct clknode *) * def->parent_cnt,
 	    M_CLOCK, M_WAITOK | M_ZERO);
 
 	/* Copy all strings unless they're flagged as static. */
 	if (def->flags & CLK_NODE_STATIC_STRINGS) {
 		clknode->name = def->name;
 		clknode->parent_names = def->parent_names;
 	} else {
 		clknode->name = strdup(def->name, M_CLOCK);
 		clknode->parent_names =
 		    strdup_list(def->parent_names, def->parent_cnt);
 	}
 
 	/* Rest of init. */
 	clknode->id = def->id;
 	clknode->clkdom = clkdom;
 	clknode->flags = def->flags;
 	clknode->parent_cnt = def->parent_cnt;
 	clknode->parent = NULL;
 	clknode->parent_idx = CLKNODE_IDX_NONE;
 	TAILQ_INIT(&clknode->children);
 
 	sysctl_ctx_init(&clknode->sysctl_ctx);
 	clknode_oid = SYSCTL_ADD_NODE(&clknode->sysctl_ctx,
 	    SYSCTL_STATIC_CHILDREN(_clock),
 	    OID_AUTO, clknode->name,
 	    CTLFLAG_RD, 0, "A clock node");
 
 	SYSCTL_ADD_U64(&clknode->sysctl_ctx,
 	    SYSCTL_CHILDREN(clknode_oid),
 	    OID_AUTO, "frequency",
 	    CTLFLAG_RD, &clknode->freq, 0, "The clock frequency");
 	SYSCTL_ADD_PROC(&clknode->sysctl_ctx,
 	    SYSCTL_CHILDREN(clknode_oid),
 	    OID_AUTO, "parent",
 	    CTLTYPE_STRING | CTLFLAG_RD,
 	    clknode, CLKNODE_SYSCTL_PARENT, clknode_sysctl,
 	    "A",
 	    "The clock parent");
 	SYSCTL_ADD_PROC(&clknode->sysctl_ctx,
 	    SYSCTL_CHILDREN(clknode_oid),
 	    OID_AUTO, "parents",
 	    CTLTYPE_STRING | CTLFLAG_RD,
 	    clknode, CLKNODE_SYSCTL_PARENTS_LIST, clknode_sysctl,
 	    "A",
 	    "The clock parents list");
 	SYSCTL_ADD_PROC(&clknode->sysctl_ctx,
 	    SYSCTL_CHILDREN(clknode_oid),
 	    OID_AUTO, "childrens",
 	    CTLTYPE_STRING | CTLFLAG_RD,
 	    clknode, CLKNODE_SYSCTL_CHILDREN_LIST, clknode_sysctl,
 	    "A",
 	    "The clock childrens list");
 	SYSCTL_ADD_INT(&clknode->sysctl_ctx,
 	    SYSCTL_CHILDREN(clknode_oid),
 	    OID_AUTO, "enable_cnt",
 	    CTLFLAG_RD, &clknode->enable_cnt, 0, "The clock enable counter");
 
 	return (clknode);
 }
 
 /*
  * Register clock object into clock domain hierarchy.
  */
 struct clknode *
 clknode_register(struct clkdom * clkdom, struct clknode *clknode)
 {
 	int rv;
 
 	rv = CLKNODE_INIT(clknode, clknode_get_device(clknode));
 	if (rv != 0) {
 		printf(" CLKNODE_INIT failed: %d\n", rv);
 		return (NULL);
 	}
 
 	TAILQ_INSERT_TAIL(&clkdom->clknode_list, clknode, clkdom_link);
 
 	return (clknode);
 }
 
 /*
  * Clock providers interface.
  */
 
 /*
  * Reparent clock node.
  */
 static void
 clknode_adjust_parent(struct clknode *clknode, int idx)
 {
 
 	CLK_TOPO_XASSERT();
 
 	if (clknode->parent_cnt == 0)
 		return;
 	if ((idx == CLKNODE_IDX_NONE) || (idx >= clknode->parent_cnt))
 		panic("%s: Invalid parent index %d for clock %s",
 		    __func__, idx, clknode->name);
 
 	if (clknode->parents[idx] == NULL)
 		panic("%s: Invalid parent index %d for clock %s",
 		    __func__, idx, clknode->name);
 
 	/* Remove me from old children list. */
 	if (clknode->parent != NULL) {
 		TAILQ_REMOVE(&clknode->parent->children, clknode, sibling_link);
 	}
 
 	/* Insert into children list of new parent. */
 	clknode->parent_idx = idx;
 	clknode->parent = clknode->parents[idx];
 	TAILQ_INSERT_TAIL(&clknode->parent->children, clknode, sibling_link);
 }
 
 /*
  * Set parent index - init function.
  */
 void
 clknode_init_parent_idx(struct clknode *clknode, int idx)
 {
 
 	if (clknode->parent_cnt == 0) {
 		clknode->parent_idx = CLKNODE_IDX_NONE;
 		clknode->parent = NULL;
 		return;
 	}
 	if ((idx == CLKNODE_IDX_NONE) ||
 	    (idx >= clknode->parent_cnt) ||
 	    (clknode->parent_names[idx] == NULL))
 		panic("%s: Invalid parent index %d for clock %s",
 		    __func__, idx, clknode->name);
 	clknode->parent_idx = idx;
 }
 
 int
 clknode_set_parent_by_idx(struct clknode *clknode, int idx)
 {
 	int rv;
 	uint64_t freq;
 	int  oldidx;
 
 	/* We have exclusive topology lock, node lock is not needed. */
 	CLK_TOPO_XASSERT();
 
 	if (clknode->parent_cnt == 0)
 		return (0);
 
 	if (clknode->parent_idx == idx)
 		return (0);
 
 	oldidx = clknode->parent_idx;
 	clknode_adjust_parent(clknode, idx);
 	rv = CLKNODE_SET_MUX(clknode, idx);
 	if (rv != 0) {
 		clknode_adjust_parent(clknode, oldidx);
 		return (rv);
 	}
 	rv = clknode_get_freq(clknode->parent, &freq);
 	if (rv != 0)
 		return (rv);
 	rv = clknode_refresh_cache(clknode, freq);
 	return (rv);
 }
 
 int
 clknode_set_parent_by_name(struct clknode *clknode, const char *name)
 {
 	int rv;
 	uint64_t freq;
 	int  oldidx, idx;
 
 	/* We have exclusive topology lock, node lock is not needed. */
 	CLK_TOPO_XASSERT();
 
 	if (clknode->parent_cnt == 0)
 		return (0);
 
 	/*
 	 * If this node doesnt have mux, then passthrough request to parent.
 	 * This feature is used in clock domain initialization and allows us to
 	 * set clock source and target frequency on the tail node of the clock
 	 * chain.
 	 */
 	if (clknode->parent_cnt == 1) {
 		rv = clknode_set_parent_by_name(clknode->parent, name);
 		return (rv);
 	}
 
 	for (idx = 0; idx < clknode->parent_cnt; idx++) {
 		if (clknode->parent_names[idx] == NULL)
 			continue;
 		if (strcmp(clknode->parent_names[idx], name) == 0)
 			break;
 	}
 	if (idx >= clknode->parent_cnt) {
 		return (ENXIO);
 	}
 	if (clknode->parent_idx == idx)
 		return (0);
 
 	oldidx = clknode->parent_idx;
 	clknode_adjust_parent(clknode, idx);
 	rv = CLKNODE_SET_MUX(clknode, idx);
 	if (rv != 0) {
 		clknode_adjust_parent(clknode, oldidx);
 		CLKNODE_UNLOCK(clknode);
 		return (rv);
 	}
 	rv = clknode_get_freq(clknode->parent, &freq);
 	if (rv != 0)
 		return (rv);
 	rv = clknode_refresh_cache(clknode, freq);
 	return (rv);
 }
 
 struct clknode *
 clknode_get_parent(struct clknode *clknode)
 {
 
 	return (clknode->parent);
 }
 
 const char *
 clknode_get_name(struct clknode *clknode)
 {
 
 	return (clknode->name);
 }
 
 const char **
 clknode_get_parent_names(struct clknode *clknode)
 {
 
 	return (clknode->parent_names);
 }
 
 int
 clknode_get_parents_num(struct clknode *clknode)
 {
 
 	return (clknode->parent_cnt);
 }
 
 int
 clknode_get_parent_idx(struct clknode *clknode)
 {
 
 	return (clknode->parent_idx);
 }
 
 int
 clknode_get_flags(struct clknode *clknode)
 {
 
 	return (clknode->flags);
 }
 
 
 void *
 clknode_get_softc(struct clknode *clknode)
 {
 
 	return (clknode->softc);
 }
 
 device_t
 clknode_get_device(struct clknode *clknode)
 {
 
 	return (clknode->clkdom->dev);
 }
 
 #ifdef FDT
 void
 clkdom_set_ofw_mapper(struct clkdom * clkdom, clknode_ofw_mapper_func *map)
 {
 
 	clkdom->ofw_mapper = map;
 }
 #endif
 
 /*
  * Real consumers executive
  */
 int
 clknode_get_freq(struct clknode *clknode, uint64_t *freq)
 {
 	int rv;
 
 	CLK_TOPO_ASSERT();
 
 	/* Use cached value, if it exists. */
 	*freq  = clknode->freq;
 	if (*freq != 0)
 		return (0);
 
 	/* Get frequency from parent, if the clock has a parent. */
 	if (clknode->parent_cnt > 0) {
 		rv = clknode_get_freq(clknode->parent, freq);
 		if (rv != 0) {
 			return (rv);
 		}
 	}
 
 	/* And recalculate my output frequency. */
 	CLKNODE_XLOCK(clknode);
 	rv = CLKNODE_RECALC_FREQ(clknode, freq);
 	if (rv != 0) {
 		CLKNODE_UNLOCK(clknode);
 		printf("Cannot get frequency for clk: %s, error: %d\n",
 		    clknode->name, rv);
 		return (rv);
 	}
 
 	/* Save new frequency to cache. */
 	clknode->freq = *freq;
 	CLKNODE_UNLOCK(clknode);
 	return (0);
 }
 
 int
 clknode_set_freq(struct clknode *clknode, uint64_t freq, int flags,
     int enablecnt)
 {
 	int rv, done;
 	uint64_t parent_freq;
 
 	/* We have exclusive topology lock, node lock is not needed. */
 	CLK_TOPO_XASSERT();
 
 	/* Check for no change */
 	if (clknode->freq == freq)
 		return (0);
 
 	parent_freq = 0;
 
 	/*
 	 * We can set frequency only if
 	 *   clock is disabled
 	 * OR
 	 *   clock is glitch free and is enabled by calling consumer only
 	 */
 	if ((flags & CLK_SET_DRYRUN) == 0 &&
 	    clknode->enable_cnt > 1 &&
 	    clknode->enable_cnt > enablecnt &&
 	    (clknode->flags & CLK_NODE_GLITCH_FREE) == 0) {
 		return (EBUSY);
 	}
 
 	/* Get frequency from parent, if the clock has a parent. */
 	if (clknode->parent_cnt > 0) {
 		rv = clknode_get_freq(clknode->parent, &parent_freq);
 		if (rv != 0) {
 			return (rv);
 		}
 	}
 
 	/* Set frequency for this clock. */
 	rv = CLKNODE_SET_FREQ(clknode, parent_freq, &freq, flags, &done);
 	if (rv != 0) {
 		printf("Cannot set frequency for clk: %s, error: %d\n",
 		    clknode->name, rv);
 		if ((flags & CLK_SET_DRYRUN) == 0)
 			clknode_refresh_cache(clknode, parent_freq);
 		return (rv);
 	}
 
 	if (done) {
 		/* Success - invalidate frequency cache for all children. */
 		if ((flags & CLK_SET_DRYRUN) == 0) {
 			clknode->freq = freq;
 			/* Clock might have reparent during set_freq */
 			if (clknode->parent_cnt > 0) {
 				rv = clknode_get_freq(clknode->parent,
 				    &parent_freq);
 				if (rv != 0) {
 					return (rv);
 				}
 			}
 			clknode_refresh_cache(clknode, parent_freq);
 		}
 	} else if (clknode->parent != NULL) {
 		/* Nothing changed, pass request to parent. */
 		rv = clknode_set_freq(clknode->parent, freq, flags, enablecnt);
 	} else {
 		/* End of chain without action. */
 		printf("Cannot set frequency for clk: %s, end of chain\n",
 		    clknode->name);
 		rv = ENXIO;
 	}
 
 	return (rv);
 }
 
 int
 clknode_enable(struct clknode *clknode)
 {
 	int rv;
 
 	CLK_TOPO_ASSERT();
 
 	/* Enable clock for each node in chain, starting from source. */
 	if (clknode->parent_cnt > 0) {
 		rv = clknode_enable(clknode->parent);
 		if (rv != 0) {
 			return (rv);
 		}
 	}
 
 	/* Handle this node */
 	CLKNODE_XLOCK(clknode);
 	if (clknode->enable_cnt == 0) {
 		rv = CLKNODE_SET_GATE(clknode, 1);
 		if (rv != 0) {
 			CLKNODE_UNLOCK(clknode);
 			return (rv);
 		}
 	}
 	clknode->enable_cnt++;
 	CLKNODE_UNLOCK(clknode);
 	return (0);
 }
 
 int
 clknode_disable(struct clknode *clknode)
 {
 	int rv;
 
 	CLK_TOPO_ASSERT();
 	rv = 0;
 
 	CLKNODE_XLOCK(clknode);
 	/* Disable clock for each node in chain, starting from consumer. */
 	if ((clknode->enable_cnt == 1) &&
 	    ((clknode->flags & CLK_NODE_CANNOT_STOP) == 0)) {
 		rv = CLKNODE_SET_GATE(clknode, 0);
 		if (rv != 0) {
 			CLKNODE_UNLOCK(clknode);
 			return (rv);
 		}
 	}
 	clknode->enable_cnt--;
 	CLKNODE_UNLOCK(clknode);
 
 	if (clknode->parent_cnt > 0) {
 		rv = clknode_disable(clknode->parent);
 	}
 	return (rv);
 }
 
 int
 clknode_stop(struct clknode *clknode, int depth)
 {
 	int rv;
 
 	CLK_TOPO_ASSERT();
 	rv = 0;
 
 	CLKNODE_XLOCK(clknode);
 	/* The first node cannot be enabled. */
 	if ((clknode->enable_cnt != 0) && (depth == 0)) {
 		CLKNODE_UNLOCK(clknode);
 		return (EBUSY);
 	}
 	/* Stop clock for each node in chain, starting from consumer. */
 	if ((clknode->enable_cnt == 0) &&
 	    ((clknode->flags & CLK_NODE_CANNOT_STOP) == 0)) {
 		rv = CLKNODE_SET_GATE(clknode, 0);
 		if (rv != 0) {
 			CLKNODE_UNLOCK(clknode);
 			return (rv);
 		}
 	}
 	CLKNODE_UNLOCK(clknode);
 
 	if (clknode->parent_cnt > 0)
 		rv = clknode_stop(clknode->parent, depth + 1);
 	return (rv);
 }
 
 /* --------------------------------------------------------------------------
  *
  * Clock consumers interface.
  *
  */
 /* Helper function for clk_get*() */
 static clk_t
 clk_create(struct clknode *clknode, device_t dev)
 {
 	struct clk *clk;
 
 	CLK_TOPO_ASSERT();
 
 	clk =  malloc(sizeof(struct clk), M_CLOCK, M_WAITOK);
 	clk->dev = dev;
 	clk->clknode = clknode;
 	clk->enable_cnt = 0;
 	clknode->ref_cnt++;
 
 	return (clk);
 }
 
 int
 clk_get_freq(clk_t clk, uint64_t *freq)
 {
 	int rv;
 	struct clknode *clknode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 
 	CLK_TOPO_SLOCK();
 	rv = clknode_get_freq(clknode, freq);
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_set_freq(clk_t clk, uint64_t freq, int flags)
 {
 	int rv;
 	struct clknode *clknode;
 
 	flags &= CLK_SET_USER_MASK;
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 
 	CLK_TOPO_XLOCK();
 	rv = clknode_set_freq(clknode, freq, flags, clk->enable_cnt);
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_test_freq(clk_t clk, uint64_t freq, int flags)
 {
 	int rv;
 	struct clknode *clknode;
 
 	flags &= CLK_SET_USER_MASK;
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 
 	CLK_TOPO_XLOCK();
 	rv = clknode_set_freq(clknode, freq, flags | CLK_SET_DRYRUN, 0);
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_get_parent(clk_t clk, clk_t *parent)
 {
 	struct clknode *clknode;
 	struct clknode *parentnode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 
 	CLK_TOPO_SLOCK();
 	parentnode = clknode_get_parent(clknode);
 	if (parentnode == NULL) {
 		CLK_TOPO_UNLOCK();
 		return (ENODEV);
 	}
 	*parent = clk_create(parentnode, clk->dev);
 	CLK_TOPO_UNLOCK();
 	return (0);
 }
 
 int
 clk_set_parent_by_clk(clk_t clk, clk_t parent)
 {
 	int rv;
 	struct clknode *clknode;
 	struct clknode *parentnode;
 
 	clknode = clk->clknode;
 	parentnode = parent->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	KASSERT(parentnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	CLK_TOPO_XLOCK();
 	rv = clknode_set_parent_by_name(clknode, parentnode->name);
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_enable(clk_t clk)
 {
 	int rv;
 	struct clknode *clknode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	CLK_TOPO_SLOCK();
 	rv = clknode_enable(clknode);
 	if (rv == 0)
 		clk->enable_cnt++;
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_disable(clk_t clk)
 {
 	int rv;
 	struct clknode *clknode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	KASSERT(clk->enable_cnt > 0,
 	   ("Attempt to disable already disabled clock: %s\n", clknode->name));
 	CLK_TOPO_SLOCK();
 	rv = clknode_disable(clknode);
 	if (rv == 0)
 		clk->enable_cnt--;
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_stop(clk_t clk)
 {
 	int rv;
 	struct clknode *clknode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	KASSERT(clk->enable_cnt == 0,
 	   ("Attempt to stop already enabled clock: %s\n", clknode->name));
 
 	CLK_TOPO_SLOCK();
 	rv = clknode_stop(clknode, 0);
 	CLK_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 clk_release(clk_t clk)
 {
 	struct clknode *clknode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	CLK_TOPO_SLOCK();
 	while (clk->enable_cnt > 0) {
 		clknode_disable(clknode);
 		clk->enable_cnt--;
 	}
 	CLKNODE_XLOCK(clknode);
 	clknode->ref_cnt--;
 	CLKNODE_UNLOCK(clknode);
 	CLK_TOPO_UNLOCK();
 
 	free(clk, M_CLOCK);
 	return (0);
 }
 
 const char *
 clk_get_name(clk_t clk)
 {
 	const char *name;
 	struct clknode *clknode;
 
 	clknode = clk->clknode;
 	KASSERT(clknode->ref_cnt > 0,
 	   ("Attempt to access unreferenced clock: %s\n", clknode->name));
 	name = clknode_get_name(clknode);
 	return (name);
 }
 
 int
 clk_get_by_name(device_t dev, const char *name, clk_t *clk)
 {
 	struct clknode *clknode;
 
 	CLK_TOPO_SLOCK();
 	clknode = clknode_find_by_name(name);
 	if (clknode == NULL) {
 		CLK_TOPO_UNLOCK();
 		return (ENODEV);
 	}
 	*clk = clk_create(clknode, dev);
 	CLK_TOPO_UNLOCK();
 	return (0);
 }
 
 int
 clk_get_by_id(device_t dev, struct clkdom *clkdom, intptr_t id, clk_t *clk)
 {
 	struct clknode *clknode;
 
 	CLK_TOPO_SLOCK();
 
 	clknode = clknode_find_by_id(clkdom, id);
 	if (clknode == NULL) {
 		CLK_TOPO_UNLOCK();
 		return (ENODEV);
 	}
 	*clk = clk_create(clknode, dev);
 	CLK_TOPO_UNLOCK();
 
 	return (0);
 }
 
 #ifdef FDT
 
 int
 clk_set_assigned(device_t dev, phandle_t node)
 {
 	clk_t clk, clk_parent;
 	int error, nclocks, i;
 
 	error = ofw_bus_parse_xref_list_get_length(node,
 	    "assigned-clock-parents", "#clock-cells", &nclocks);
 
 	if (error != 0) {
 		if (error != ENOENT)
 			device_printf(dev,
 			    "cannot parse assigned-clock-parents property\n");
 		return (error);
 	}
 
 	for (i = 0; i < nclocks; i++) {
 		error = clk_get_by_ofw_index_prop(dev, 0,
 		    "assigned-clock-parents", i, &clk_parent);
 		if (error != 0) {
 			device_printf(dev, "cannot get parent %d\n", i);
 			return (error);
 		}
 
 		error = clk_get_by_ofw_index_prop(dev, 0, "assigned-clocks",
 		    i, &clk);
 		if (error != 0) {
 			device_printf(dev, "cannot get assigned clock %d\n", i);
 			clk_release(clk_parent);
 			return (error);
 		}
 
 		error = clk_set_parent_by_clk(clk, clk_parent);
 		clk_release(clk_parent);
 		clk_release(clk);
 		if (error != 0)
 			return (error);
 	}
 
 	return (0);
 }
 
 int
 clk_get_by_ofw_index_prop(device_t dev, phandle_t cnode, const char *prop, int idx, clk_t *clk)
 {
 	phandle_t parent, *cells;
 	device_t clockdev;
 	int ncells, rv;
 	struct clkdom *clkdom;
 	struct clknode *clknode;
 
 	*clk = NULL;
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(dev);
 	if (cnode <= 0) {
 		device_printf(dev, "%s called on not ofw based device\n",
 		 __func__);
 		return (ENXIO);
 	}
 
 
 	rv = ofw_bus_parse_xref_list_alloc(cnode, prop, "#clock-cells", idx,
 	    &parent, &ncells, &cells);
 	if (rv != 0) {
 		return (rv);
 	}
 
 	clockdev = OF_device_from_xref(parent);
 	if (clockdev == NULL) {
 		rv = ENODEV;
 		goto done;
 	}
 
 	CLK_TOPO_SLOCK();
 	clkdom = clkdom_get_by_dev(clockdev);
 	if (clkdom == NULL){
 		CLK_TOPO_UNLOCK();
 		rv = ENXIO;
 		goto done;
 	}
 
 	rv = clkdom->ofw_mapper(clkdom, ncells, cells, &clknode);
 	if (rv == 0) {
 		*clk = clk_create(clknode, dev);
 	}
 	CLK_TOPO_UNLOCK();
 
 done:
 	if (cells != NULL)
 		OF_prop_free(cells);
 	return (rv);
 }
 
 int
 clk_get_by_ofw_index(device_t dev, phandle_t cnode, int idx, clk_t *clk)
 {
 	return (clk_get_by_ofw_index_prop(dev, cnode, "clocks", idx, clk));
 }
 
 int
 clk_get_by_ofw_name(device_t dev, phandle_t cnode, const char *name, clk_t *clk)
 {
 	int rv, idx;
 
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(dev);
 	if (cnode <= 0) {
 		device_printf(dev, "%s called on not ofw based device\n",
 		 __func__);
 		return (ENXIO);
 	}
 	rv = ofw_bus_find_string_index(cnode, "clock-names", name, &idx);
 	if (rv != 0)
 		return (rv);
 	return (clk_get_by_ofw_index(dev, cnode, idx, clk));
 }
 
 /* --------------------------------------------------------------------------
  *
  * Support functions for parsing various clock related OFW things.
  */
 
 /*
  * Get "clock-output-names" and  (optional) "clock-indices" lists.
  * Both lists are alocated using M_OFWPROP specifier.
  *
  * Returns number of items or 0.
  */
 int
 clk_parse_ofw_out_names(device_t dev, phandle_t node, const char ***out_names,
 	uint32_t **indices)
 {
 	int name_items, rv;
 
 	*out_names = NULL;
 	*indices = NULL;
 	if (!OF_hasprop(node, "clock-output-names"))
 		return (0);
 	rv = ofw_bus_string_list_to_array(node, "clock-output-names",
 	    out_names);
 	if (rv <= 0)
 		return (0);
 	name_items = rv;
 
 	if (!OF_hasprop(node, "clock-indices"))
 		return (name_items);
-	rv = OF_getencprop_alloc(node, "clock-indices", sizeof (uint32_t),
+	rv = OF_getencprop_alloc_multi(node, "clock-indices", sizeof (uint32_t),
 	    (void **)indices);
 	if (rv != name_items) {
 		device_printf(dev, " Size of 'clock-output-names' and "
 		    "'clock-indices' differs\n");
 		OF_prop_free(*out_names);
 		OF_prop_free(*indices);
 		return (0);
 	}
 	return (name_items);
 }
 
 /*
  * Get output clock name for single output clock node.
  */
 int
 clk_parse_ofw_clk_name(device_t dev, phandle_t node, const char **name)
 {
 	const char **out_names;
 	const char  *tmp_name;
 	int rv;
 
 	*name = NULL;
 	if (!OF_hasprop(node, "clock-output-names")) {
 		tmp_name  = ofw_bus_get_name(dev);
 		if (tmp_name == NULL)
 			return (ENXIO);
 		*name = strdup(tmp_name, M_OFWPROP);
 		return (0);
 	}
 	rv = ofw_bus_string_list_to_array(node, "clock-output-names",
 	    &out_names);
 	if (rv != 1) {
 		OF_prop_free(out_names);
 		device_printf(dev, "Malformed 'clock-output-names' property\n");
 		return (ENXIO);
 	}
 	*name = strdup(out_names[0], M_OFWPROP);
 	OF_prop_free(out_names);
 	return (0);
 }
 #endif
 
 static int
 clkdom_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct clkdom *clkdom = arg1;
 	struct clknode *clknode;
 	struct sbuf *sb;
 	int ret;
 
 	sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req);
 	if (sb == NULL)
 		return (ENOMEM);
 
 	CLK_TOPO_SLOCK();
 	TAILQ_FOREACH(clknode, &clkdom->clknode_list, clkdom_link) {
 		sbuf_printf(sb, "%s ", clknode->name);
 	}
 	CLK_TOPO_UNLOCK();
 
 	ret = sbuf_finish(sb);
 	sbuf_delete(sb);
 	return (ret);
 }
 
 static int
 clknode_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct clknode *clknode, *children;
 	enum clknode_sysctl_type type = arg2;
 	struct sbuf *sb;
 	const char **parent_names;
 	int ret, i;
 
 	clknode = arg1;
 	sb = sbuf_new_for_sysctl(NULL, NULL, 512, req);
 	if (sb == NULL)
 		return (ENOMEM);
 
 	CLK_TOPO_SLOCK();
 	switch (type) {
 	case CLKNODE_SYSCTL_PARENT:
 		if (clknode->parent)
 			sbuf_printf(sb, "%s", clknode->parent->name);
 		break;
 	case CLKNODE_SYSCTL_PARENTS_LIST:
 		parent_names = clknode_get_parent_names(clknode);
 		for (i = 0; i < clknode->parent_cnt; i++)
 			sbuf_printf(sb, "%s ", parent_names[i]);
 		break;
 	case CLKNODE_SYSCTL_CHILDREN_LIST:
 		TAILQ_FOREACH(children, &(clknode->children), sibling_link) {
 			sbuf_printf(sb, "%s ", children->name);
 		}
 		break;
 	}
 	CLK_TOPO_UNLOCK();
 
 	ret = sbuf_finish(sb);
 	sbuf_delete(sb);
 	return (ret);
 }
Index: user/markj/netdump/sys/dev/extres/phy/phy.c
===================================================================
--- user/markj/netdump/sys/dev/extres/phy/phy.c	(revision 332407)
+++ user/markj/netdump/sys/dev/extres/phy/phy.c	(revision 332408)
@@ -1,585 +1,585 @@
 /*-
  * Copyright 2016 Michal Meloun <mmel@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
  #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_platform.h"
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/queue.h>
 #include <sys/systm.h>
 #include <sys/sx.h>
 
 #ifdef FDT
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #endif
 
 #include  <dev/extres/phy/phy.h>
 
 #include "phydev_if.h"
 
 MALLOC_DEFINE(M_PHY, "phy", "Phy framework");
 
 /* Forward declarations. */
 struct phy;
 struct phynode;
 
 typedef TAILQ_HEAD(phynode_list, phynode) phynode_list_t;
 typedef TAILQ_HEAD(phy_list, phy) phy_list_t;
 
 /* Default phy methods. */
 static int phynode_method_init(struct phynode *phynode);
 static int phynode_method_enable(struct phynode *phynode, bool disable);
 static int phynode_method_status(struct phynode *phynode, int *status);
 
 
 /*
  * Phy controller methods.
  */
 static phynode_method_t phynode_methods[] = {
 	PHYNODEMETHOD(phynode_init,		phynode_method_init),
 	PHYNODEMETHOD(phynode_enable,		phynode_method_enable),
 	PHYNODEMETHOD(phynode_status,		phynode_method_status),
 
 	PHYNODEMETHOD_END
 };
 DEFINE_CLASS_0(phynode, phynode_class, phynode_methods, 0);
 
 /*
  * Phy node
  */
 struct phynode {
 	KOBJ_FIELDS;
 
 	TAILQ_ENTRY(phynode)	phylist_link;	/* Global list entry */
 	phy_list_t		consumers_list;	/* Consumers list */
 
 
 	/* Details of this device. */
 	const char		*name;		/* Globally unique name */
 
 	device_t		pdev;		/* Producer device_t */
 	void			*softc;		/* Producer softc */
 	intptr_t		id;		/* Per producer unique id */
 #ifdef FDT
 	 phandle_t		ofw_node;	/* OFW node of phy */
 #endif
 	struct sx		lock;		/* Lock for this phy */
 	int			ref_cnt;	/* Reference counter */
 	int			enable_cnt;	/* Enabled counter */
 };
 
 struct phy {
 	device_t		cdev;		/* consumer device*/
 	struct phynode		*phynode;
 	TAILQ_ENTRY(phy)	link;		/* Consumers list entry */
 
 	int			enable_cnt;
 };
 
 static phynode_list_t phynode_list = TAILQ_HEAD_INITIALIZER(phynode_list);
 
 static struct sx		phynode_topo_lock;
 SX_SYSINIT(phy_topology, &phynode_topo_lock, "Phy topology lock");
 
 #define PHY_TOPO_SLOCK()	sx_slock(&phynode_topo_lock)
 #define PHY_TOPO_XLOCK()	sx_xlock(&phynode_topo_lock)
 #define PHY_TOPO_UNLOCK()	sx_unlock(&phynode_topo_lock)
 #define PHY_TOPO_ASSERT()	sx_assert(&phynode_topo_lock, SA_LOCKED)
 #define PHY_TOPO_XASSERT() 	sx_assert(&phynode_topo_lock, SA_XLOCKED)
 
 #define PHYNODE_SLOCK(_sc)	sx_slock(&((_sc)->lock))
 #define PHYNODE_XLOCK(_sc)	sx_xlock(&((_sc)->lock))
 #define PHYNODE_UNLOCK(_sc)	sx_unlock(&((_sc)->lock))
 
 /* ----------------------------------------------------------------------------
  *
  * Default phy methods for base class.
  *
  */
 
 static int
 phynode_method_init(struct phynode *phynode)
 {
 
 	return (0);
 }
 
 static int
 phynode_method_enable(struct phynode *phynode, bool enable)
 {
 
 	if (!enable)
 		return (ENXIO);
 
 	return (0);
 }
 
 static int
 phynode_method_status(struct phynode *phynode, int *status)
 {
 	*status = PHY_STATUS_ENABLED;
 	return (0);
 }
 
 /* ----------------------------------------------------------------------------
  *
  * Internal functions.
  *
  */
 /*
  * Create and initialize phy object, but do not register it.
  */
 struct phynode *
 phynode_create(device_t pdev, phynode_class_t phynode_class,
     struct phynode_init_def *def)
 {
 	struct phynode *phynode;
 
 
 	/* Create object and initialize it. */
 	phynode = malloc(sizeof(struct phynode), M_PHY, M_WAITOK | M_ZERO);
 	kobj_init((kobj_t)phynode, (kobj_class_t)phynode_class);
 	sx_init(&phynode->lock, "Phy node lock");
 
 	/* Allocate softc if required. */
 	if (phynode_class->size > 0) {
 		phynode->softc = malloc(phynode_class->size, M_PHY,
 		    M_WAITOK | M_ZERO);
 	}
 
 	/* Rest of init. */
 	TAILQ_INIT(&phynode->consumers_list);
 	phynode->id = def->id;
 	phynode->pdev = pdev;
 #ifdef FDT
 	phynode->ofw_node = def->ofw_node;
 #endif
 
 	return (phynode);
 }
 
 /* Register phy object. */
 struct phynode *
 phynode_register(struct phynode *phynode)
 {
 	int rv;
 
 #ifdef FDT
 	if (phynode->ofw_node <= 0)
 		phynode->ofw_node = ofw_bus_get_node(phynode->pdev);
 	if (phynode->ofw_node <= 0)
 		return (NULL);
 #endif
 
 	rv = PHYNODE_INIT(phynode);
 	if (rv != 0) {
 		printf("PHYNODE_INIT failed: %d\n", rv);
 		return (NULL);
 	}
 
 	PHY_TOPO_XLOCK();
 	TAILQ_INSERT_TAIL(&phynode_list, phynode, phylist_link);
 	PHY_TOPO_UNLOCK();
 #ifdef FDT
 	OF_device_register_xref(OF_xref_from_node(phynode->ofw_node),
 	    phynode->pdev);
 #endif
 	return (phynode);
 }
 
 static struct phynode *
 phynode_find_by_id(device_t dev, intptr_t id)
 {
 	struct phynode *entry;
 
 	PHY_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &phynode_list, phylist_link) {
 		if ((entry->pdev == dev) && (entry->id ==  id))
 			return (entry);
 	}
 
 	return (NULL);
 }
 
 /* --------------------------------------------------------------------------
  *
  * Phy providers interface
  *
  */
 
 void *
 phynode_get_softc(struct phynode *phynode)
 {
 
 	return (phynode->softc);
 }
 
 device_t
 phynode_get_device(struct phynode *phynode)
 {
 
 	return (phynode->pdev);
 }
 
 intptr_t phynode_get_id(struct phynode *phynode)
 {
 
 	return (phynode->id);
 }
 
 #ifdef FDT
 phandle_t
 phynode_get_ofw_node(struct phynode *phynode)
 {
 
 	return (phynode->ofw_node);
 }
 #endif
 
 /* --------------------------------------------------------------------------
  *
  * Real consumers executive
  *
  */
 
 /*
  * Enable phy.
  */
 int
 phynode_enable(struct phynode *phynode)
 {
 	int rv;
 
 	PHY_TOPO_ASSERT();
 
 	PHYNODE_XLOCK(phynode);
 	if (phynode->enable_cnt == 0) {
 		rv = PHYNODE_ENABLE(phynode, true);
 		if (rv != 0) {
 			PHYNODE_UNLOCK(phynode);
 			return (rv);
 		}
 	}
 	phynode->enable_cnt++;
 	PHYNODE_UNLOCK(phynode);
 	return (0);
 }
 
 /*
  * Disable phy.
  */
 int
 phynode_disable(struct phynode *phynode)
 {
 	int rv;
 
 	PHY_TOPO_ASSERT();
 
 	PHYNODE_XLOCK(phynode);
 	if (phynode->enable_cnt == 1) {
 		rv = PHYNODE_ENABLE(phynode, false);
 		if (rv != 0) {
 			PHYNODE_UNLOCK(phynode);
 			return (rv);
 		}
 	}
 	phynode->enable_cnt--;
 	PHYNODE_UNLOCK(phynode);
 	return (0);
 }
 
 
 /*
  * Get phy status. (PHY_STATUS_*)
  */
 int
 phynode_status(struct phynode *phynode, int *status)
 {
 	int rv;
 
 	PHY_TOPO_ASSERT();
 
 	PHYNODE_XLOCK(phynode);
 	rv = PHYNODE_STATUS(phynode, status);
 	PHYNODE_UNLOCK(phynode);
 	return (rv);
 }
 
  /* --------------------------------------------------------------------------
  *
  * Phy consumers interface.
  *
  */
 
 /* Helper function for phy_get*() */
 static phy_t
 phy_create(struct phynode *phynode, device_t cdev)
 {
 	struct phy *phy;
 
 	PHY_TOPO_ASSERT();
 
 	phy =  malloc(sizeof(struct phy), M_PHY, M_WAITOK | M_ZERO);
 	phy->cdev = cdev;
 	phy->phynode = phynode;
 	phy->enable_cnt = 0;
 
 	PHYNODE_XLOCK(phynode);
 	phynode->ref_cnt++;
 	TAILQ_INSERT_TAIL(&phynode->consumers_list, phy, link);
 	PHYNODE_UNLOCK(phynode);
 
 	return (phy);
 }
 
 int
 phy_enable(phy_t phy)
 {
 	int rv;
 	struct phynode *phynode;
 
 	phynode = phy->phynode;
 	KASSERT(phynode->ref_cnt > 0,
 	    ("Attempt to access unreferenced phy.\n"));
 
 	PHY_TOPO_SLOCK();
 	rv = phynode_enable(phynode);
 	if (rv == 0)
 		phy->enable_cnt++;
 	PHY_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 phy_disable(phy_t phy)
 {
 	int rv;
 	struct phynode *phynode;
 
 	phynode = phy->phynode;
 	KASSERT(phynode->ref_cnt > 0,
 	   ("Attempt to access unreferenced phy.\n"));
 	KASSERT(phy->enable_cnt > 0,
 	   ("Attempt to disable already disabled phy.\n"));
 
 	PHY_TOPO_SLOCK();
 	rv = phynode_disable(phynode);
 	if (rv == 0)
 		phy->enable_cnt--;
 	PHY_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 phy_status(phy_t phy, int *status)
 {
 	int rv;
 	struct phynode *phynode;
 
 	phynode = phy->phynode;
 	KASSERT(phynode->ref_cnt > 0,
 	   ("Attempt to access unreferenced phy.\n"));
 
 	PHY_TOPO_SLOCK();
 	rv = phynode_status(phynode, status);
 	PHY_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 phy_get_by_id(device_t consumer_dev, device_t provider_dev, intptr_t id,
     phy_t *phy)
 {
 	struct phynode *phynode;
 
 	PHY_TOPO_SLOCK();
 
 	phynode = phynode_find_by_id(provider_dev, id);
 	if (phynode == NULL) {
 		PHY_TOPO_UNLOCK();
 		return (ENODEV);
 	}
 	*phy = phy_create(phynode, consumer_dev);
 	PHY_TOPO_UNLOCK();
 
 	return (0);
 }
 
 void
 phy_release(phy_t phy)
 {
 	struct phynode *phynode;
 
 	phynode = phy->phynode;
 	KASSERT(phynode->ref_cnt > 0,
 	   ("Attempt to access unreferenced phy.\n"));
 
 	PHY_TOPO_SLOCK();
 	while (phy->enable_cnt > 0) {
 		phynode_disable(phynode);
 		phy->enable_cnt--;
 	}
 	PHYNODE_XLOCK(phynode);
 	TAILQ_REMOVE(&phynode->consumers_list, phy, link);
 	phynode->ref_cnt--;
 	PHYNODE_UNLOCK(phynode);
 	PHY_TOPO_UNLOCK();
 
 	free(phy, M_PHY);
 }
 
 #ifdef FDT
 int phydev_default_ofw_map(device_t provider, phandle_t xref, int ncells,
     pcell_t *cells, intptr_t *id)
 {
 	struct phynode *entry;
 	phandle_t node;
 
 	/* Single device can register multiple subnodes. */
 	if (ncells == 0) {
 
 		node = OF_node_from_xref(xref);
 		PHY_TOPO_XLOCK();
 		TAILQ_FOREACH(entry, &phynode_list, phylist_link) {
 			if ((entry->pdev == provider) &&
 			    (entry->ofw_node == node)) {
 				*id = entry->id;
 				PHY_TOPO_UNLOCK();
 				return (0);
 			}
 		}
 		PHY_TOPO_UNLOCK();
 		return (ERANGE);
 	}
 
 	/* First cell is ID. */
 	if (ncells == 1) {
 		*id = cells[0];
 		return (0);
 	}
 
 	/* No default way how to get ID, custom mapper is required. */
 	return  (ERANGE);
 }
 
 int
 phy_get_by_ofw_idx(device_t consumer_dev, phandle_t cnode, int idx, phy_t *phy)
 {
 	phandle_t xnode;
 	pcell_t *cells;
 	device_t phydev;
 	int ncells, rv;
 	intptr_t id;
 
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(consumer_dev);
 	if (cnode <= 0) {
 		device_printf(consumer_dev,
 		    "%s called on not ofw based device\n", __func__);
 		return (ENXIO);
 	}
 	rv = ofw_bus_parse_xref_list_alloc(cnode, "phys", "#phy-cells", idx,
 	    &xnode, &ncells, &cells);
 	if (rv != 0)
 		return (rv);
 
 	/* Tranlate provider to device. */
 	phydev = OF_device_from_xref(xnode);
 	if (phydev == NULL) {
 		OF_prop_free(cells);
 		return (ENODEV);
 	}
 	/* Map phy to number. */
 	rv = PHYDEV_MAP(phydev, xnode, ncells, cells, &id);
 	OF_prop_free(cells);
 	if (rv != 0)
 		return (rv);
 
 	return (phy_get_by_id(consumer_dev, phydev, id, phy));
 }
 
 int
 phy_get_by_ofw_name(device_t consumer_dev, phandle_t cnode, char *name,
     phy_t *phy)
 {
 	int rv, idx;
 
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(consumer_dev);
 	if (cnode <= 0) {
 		device_printf(consumer_dev,
 		    "%s called on not ofw based device\n",  __func__);
 		return (ENXIO);
 	}
 	rv = ofw_bus_find_string_index(cnode, "phy-names", name, &idx);
 	if (rv != 0)
 		return (rv);
 	return (phy_get_by_ofw_idx(consumer_dev, cnode, idx, phy));
 }
 
 int
 phy_get_by_ofw_property(device_t consumer_dev, phandle_t cnode, char *name,
     phy_t *phy)
 {
 	pcell_t *cells;
 	device_t phydev;
 	int ncells, rv;
 	intptr_t id;
 
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(consumer_dev);
 	if (cnode <= 0) {
 		device_printf(consumer_dev,
 		    "%s called on not ofw based device\n", __func__);
 		return (ENXIO);
 	}
-	ncells = OF_getencprop_alloc(cnode, name, sizeof(pcell_t),
+	ncells = OF_getencprop_alloc_multi(cnode, name, sizeof(pcell_t),
 	    (void **)&cells);
 	if (ncells < 1)
 		return (ENXIO);
 
 	/* Tranlate provider to device. */
 	phydev = OF_device_from_xref(cells[0]);
 	if (phydev == NULL) {
 		OF_prop_free(cells);
 		return (ENODEV);
 	}
 	/* Map phy to number. */
 	rv = PHYDEV_MAP(phydev, cells[0], ncells - 1 , cells + 1, &id);
 	OF_prop_free(cells);
 	if (rv != 0)
 		return (rv);
 
 	return (phy_get_by_id(consumer_dev, phydev, id, phy));
 }
 #endif
Index: user/markj/netdump/sys/dev/extres/regulator/regulator.c
===================================================================
--- user/markj/netdump/sys/dev/extres/regulator/regulator.c	(revision 332407)
+++ user/markj/netdump/sys/dev/extres/regulator/regulator.c	(revision 332408)
@@ -1,1191 +1,1191 @@
 /*-
  * Copyright 2016 Michal Meloun <mmel@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_platform.h"
 #include <sys/param.h>
 #include <sys/conf.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/queue.h>
 #include <sys/kobj.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/sx.h>
 
 #ifdef FDT
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #endif
 #include <dev/extres/regulator/regulator.h>
 
 #include "regdev_if.h"
 
 SYSCTL_NODE(_hw, OID_AUTO, regulator, CTLFLAG_RD, NULL, "Regulators");
 
 MALLOC_DEFINE(M_REGULATOR, "regulator", "Regulator framework");
 
 #define	DIV_ROUND_UP(n,d) howmany(n, d)
 
 /* Forward declarations. */
 struct regulator;
 struct regnode;
 
 typedef TAILQ_HEAD(regnode_list, regnode) regnode_list_t;
 typedef TAILQ_HEAD(regulator_list, regulator) regulator_list_t;
 
 /* Default regulator methods. */
 static int regnode_method_enable(struct regnode *regnode, bool enable,
     int *udelay);
 static int regnode_method_status(struct regnode *regnode, int *status);
 static int regnode_method_set_voltage(struct regnode *regnode, int min_uvolt,
     int max_uvolt, int *udelay);
 static int regnode_method_get_voltage(struct regnode *regnode, int *uvolt);
 static void regulator_shutdown(void *dummy);
 
 /*
  * Regulator controller methods.
  */
 static regnode_method_t regnode_methods[] = {
 	REGNODEMETHOD(regnode_enable,		regnode_method_enable),
 	REGNODEMETHOD(regnode_status,		regnode_method_status),
 	REGNODEMETHOD(regnode_set_voltage,	regnode_method_set_voltage),
 	REGNODEMETHOD(regnode_get_voltage,	regnode_method_get_voltage),
 
 	REGNODEMETHOD_END
 };
 DEFINE_CLASS_0(regnode, regnode_class, regnode_methods, 0);
 
 /*
  * Regulator node - basic element for modelling SOC and bard power supply
  * chains. Its contains producer data.
  */
 struct regnode {
 	KOBJ_FIELDS;
 
 	TAILQ_ENTRY(regnode)	reglist_link;	/* Global list entry */
 	regulator_list_t	consumers_list;	/* Consumers list */
 
 	/* Cache for already resolved names */
 	struct regnode		*parent;	/* Resolved parent */
 
 	/* Details of this device. */
 	const char		*name;		/* Globally unique name */
 	const char		*parent_name;	/* Parent name */
 
 	device_t		pdev;		/* Producer device_t */
 	void			*softc;		/* Producer softc */
 	intptr_t		id;		/* Per producer unique id */
 #ifdef FDT
 	 phandle_t 		ofw_node;	/* OFW node of regulator */
 #endif
 	int			flags;		/* REGULATOR_FLAGS_ */
 	struct sx		lock;		/* Lock for this regulator */
 	int			ref_cnt;	/* Reference counter */
 	int			enable_cnt;	/* Enabled counter */
 
 	struct regnode_std_param std_param;	/* Standard parameters */
 
 	struct sysctl_ctx_list	sysctl_ctx;
 };
 
 /*
  * Per consumer data, information about how a consumer is using a regulator
  * node.
  * A pointer to this structure is used as a handle in the consumer interface.
  */
 struct regulator {
 	device_t		cdev;		/* Consumer device */
 	struct regnode		*regnode;
 	TAILQ_ENTRY(regulator)	link;		/* Consumers list entry */
 
 	int			enable_cnt;
 	int 			min_uvolt;	/* Requested uvolt range */
 	int 			max_uvolt;
 };
 
 /*
  * Regulator names must be system wide unique.
  */
 static regnode_list_t regnode_list = TAILQ_HEAD_INITIALIZER(regnode_list);
 
 static struct sx		regnode_topo_lock;
 SX_SYSINIT(regulator_topology, &regnode_topo_lock, "Regulator topology lock");
 
 #define REG_TOPO_SLOCK()	sx_slock(&regnode_topo_lock)
 #define REG_TOPO_XLOCK()	sx_xlock(&regnode_topo_lock)
 #define REG_TOPO_UNLOCK()	sx_unlock(&regnode_topo_lock)
 #define REG_TOPO_ASSERT()	sx_assert(&regnode_topo_lock, SA_LOCKED)
 #define REG_TOPO_XASSERT() 	sx_assert(&regnode_topo_lock, SA_XLOCKED)
 
 #define REGNODE_SLOCK(_sc)	sx_slock(&((_sc)->lock))
 #define REGNODE_XLOCK(_sc)	sx_xlock(&((_sc)->lock))
 #define REGNODE_UNLOCK(_sc)	sx_unlock(&((_sc)->lock))
 
 SYSINIT(regulator_shutdown, SI_SUB_LAST, SI_ORDER_ANY, regulator_shutdown,
     NULL);
 
 /*
  * Disable unused regulator
  * We run this function at SI_SUB_LAST which mean that every driver that needs
  * regulator should have already enable them.
  * All the remaining regulators should be those left enabled by the bootloader
  * or enable by default by the PMIC.
  */
 static void
 regulator_shutdown(void *dummy)
 {
 	struct regnode *entry;
 	int disable = 1;
 
 	REG_TOPO_SLOCK();
 	TUNABLE_INT_FETCH("hw.regulator.disable_unused", &disable);
 	TAILQ_FOREACH(entry, &regnode_list, reglist_link) {
 		if (entry->enable_cnt == 0 &&
 		    entry->std_param.always_on == 0 && disable) {
 			if (bootverbose)
 				printf("regulator: shuting down %s\n",
 				    entry->name);
 			regnode_stop(entry, 0);
 		}
 	}
 	REG_TOPO_UNLOCK();
 }
 
 /*
  * sysctl handler
  */
 static int
 regnode_uvolt_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct regnode *regnode = arg1;
 	int rv, uvolt;
 
 	if (regnode->std_param.min_uvolt == regnode->std_param.max_uvolt) {
 		uvolt = regnode->std_param.min_uvolt;
 	} else {
 		REG_TOPO_SLOCK();
 		if ((rv = regnode_get_voltage(regnode, &uvolt)) != 0) {
 			REG_TOPO_UNLOCK();
 			return (rv);
 		}
 		REG_TOPO_UNLOCK();
 	}
 
 	return sysctl_handle_int(oidp, &uvolt, sizeof(uvolt), req);
 }
 
 /* ----------------------------------------------------------------------------
  *
  * Default regulator methods for base class.
  *
  */
 static int
 regnode_method_enable(struct regnode *regnode, bool enable, int *udelay)
 {
 
 	if (!enable)
 		return (ENXIO);
 
 	*udelay = 0;
 	return (0);
 }
 
 static int
 regnode_method_status(struct regnode *regnode, int *status)
 {
 	*status = REGULATOR_STATUS_ENABLED;
 	return (0);
 }
 
 static int
 regnode_method_set_voltage(struct regnode *regnode, int min_uvolt, int max_uvolt,
     int *udelay)
 {
 
 	if ((min_uvolt > regnode->std_param.max_uvolt) ||
 	    (max_uvolt < regnode->std_param.min_uvolt))
 		return (ERANGE);
 	*udelay = 0;
 	return (0);
 }
 
 static int
 regnode_method_get_voltage(struct regnode *regnode, int *uvolt)
 {
 
 	return (regnode->std_param.min_uvolt +
 	    (regnode->std_param.max_uvolt - regnode->std_param.min_uvolt) / 2);
 }
 
 /* ----------------------------------------------------------------------------
  *
  * Internal functions.
  *
  */
 
 static struct regnode *
 regnode_find_by_name(const char *name)
 {
 	struct regnode *entry;
 
 	REG_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &regnode_list, reglist_link) {
 		if (strcmp(entry->name, name) == 0)
 			return (entry);
 	}
 	return (NULL);
 }
 
 static struct regnode *
 regnode_find_by_id(device_t dev, intptr_t id)
 {
 	struct regnode *entry;
 
 	REG_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &regnode_list, reglist_link) {
 		if ((entry->pdev == dev) && (entry->id ==  id))
 			return (entry);
 	}
 
 	return (NULL);
 }
 
 /*
  * Create and initialize regulator object, but do not register it.
  */
 struct regnode *
 regnode_create(device_t pdev, regnode_class_t regnode_class,
     struct regnode_init_def *def)
 {
 	struct regnode *regnode;
 	struct sysctl_oid *regnode_oid;
 
 	KASSERT(def->name != NULL, ("regulator name is NULL"));
 	KASSERT(def->name[0] != '\0', ("regulator name is empty"));
 
 	REG_TOPO_SLOCK();
 	if (regnode_find_by_name(def->name) != NULL)
 		panic("Duplicated regulator registration: %s\n", def->name);
 	REG_TOPO_UNLOCK();
 
 	/* Create object and initialize it. */
 	regnode = malloc(sizeof(struct regnode), M_REGULATOR,
 	    M_WAITOK | M_ZERO);
 	kobj_init((kobj_t)regnode, (kobj_class_t)regnode_class);
 	sx_init(&regnode->lock, "Regulator node lock");
 
 	/* Allocate softc if required. */
 	if (regnode_class->size > 0) {
 		regnode->softc = malloc(regnode_class->size, M_REGULATOR,
 		    M_WAITOK | M_ZERO);
 	}
 
 
 	/* Copy all strings unless they're flagged as static. */
 	if (def->flags & REGULATOR_FLAGS_STATIC) {
 		regnode->name = def->name;
 		regnode->parent_name = def->parent_name;
 	} else {
 		regnode->name = strdup(def->name, M_REGULATOR);
 		if (def->parent_name != NULL)
 			regnode->parent_name = strdup(def->parent_name,
 			    M_REGULATOR);
 	}
 
 	/* Rest of init. */
 	TAILQ_INIT(&regnode->consumers_list);
 	regnode->id = def->id;
 	regnode->pdev = pdev;
 	regnode->flags = def->flags;
 	regnode->parent = NULL;
 	regnode->std_param = def->std_param;
 #ifdef FDT
 	regnode->ofw_node = def->ofw_node;
 #endif
 
 	sysctl_ctx_init(&regnode->sysctl_ctx);
 	regnode_oid = SYSCTL_ADD_NODE(&regnode->sysctl_ctx,
 	    SYSCTL_STATIC_CHILDREN(_hw_regulator),
 	    OID_AUTO, regnode->name,
 	    CTLFLAG_RD, 0, "A regulator node");
 
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "min_uvolt",
 	    CTLFLAG_RD, &regnode->std_param.min_uvolt, 0,
 	    "Minimal voltage (in uV)");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "max_uvolt",
 	    CTLFLAG_RD, &regnode->std_param.max_uvolt, 0,
 	    "Maximal voltage (in uV)");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "min_uamp",
 	    CTLFLAG_RD, &regnode->std_param.min_uamp, 0,
 	    "Minimal amperage (in uA)");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "max_uamp",
 	    CTLFLAG_RD, &regnode->std_param.max_uamp, 0,
 	    "Maximal amperage (in uA)");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "ramp_delay",
 	    CTLFLAG_RD, &regnode->std_param.ramp_delay, 0,
 	    "Ramp delay (in uV/us)");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "enable_delay",
 	    CTLFLAG_RD, &regnode->std_param.enable_delay, 0,
 	    "Enable delay (in us)");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "enable_cnt",
 	    CTLFLAG_RD, &regnode->enable_cnt, 0,
 	    "The regulator enable counter");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "boot_on",
 	    CTLFLAG_RD, (int *) &regnode->std_param.boot_on, 0,
 	    "Is enabled on boot");
 	SYSCTL_ADD_INT(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "always_on",
 	    CTLFLAG_RD, (int *)&regnode->std_param.always_on, 0,
 	    "Is always enabled");
 
 	SYSCTL_ADD_PROC(&regnode->sysctl_ctx,
 	    SYSCTL_CHILDREN(regnode_oid),
 	    OID_AUTO, "uvolt",
 	    CTLTYPE_INT | CTLFLAG_RD,
 	    regnode, 0, regnode_uvolt_sysctl,
 	    "I",
 	    "Current voltage (in uV)");
 
 	return (regnode);
 }
 
 /* Register regulator object. */
 struct regnode *
 regnode_register(struct regnode *regnode)
 {
 	int rv;
 
 #ifdef FDT
 	if (regnode->ofw_node <= 0)
 		regnode->ofw_node = ofw_bus_get_node(regnode->pdev);
 	if (regnode->ofw_node <= 0)
 		return (NULL);
 #endif
 
 	rv = REGNODE_INIT(regnode);
 	if (rv != 0) {
 		printf("REGNODE_INIT failed: %d\n", rv);
 		return (NULL);
 	}
 
 	REG_TOPO_XLOCK();
 	TAILQ_INSERT_TAIL(&regnode_list, regnode, reglist_link);
 	REG_TOPO_UNLOCK();
 #ifdef FDT
 	OF_device_register_xref(OF_xref_from_node(regnode->ofw_node),
 	    regnode->pdev);
 #endif
 	return (regnode);
 }
 
 static int
 regnode_resolve_parent(struct regnode *regnode)
 {
 
 	/* All ready resolved or no parent? */
 	if ((regnode->parent != NULL) ||
 	    (regnode->parent_name == NULL))
 		return (0);
 
 	regnode->parent = regnode_find_by_name(regnode->parent_name);
 	if (regnode->parent == NULL)
 		return (ENODEV);
 	return (0);
 }
 
 static void
 regnode_delay(int usec)
 {
 	int ticks;
 
 	if (usec == 0)
 		return;
 	ticks = (usec * hz + 999999) / 1000000;
 
 	if (cold || ticks < 2)
 		DELAY(usec);
 	else
 		pause("REGULATOR", ticks);
 }
 
 /* --------------------------------------------------------------------------
  *
  * Regulator providers interface
  *
  */
 
 const char *
 regnode_get_name(struct regnode *regnode)
 {
 
 	return (regnode->name);
 }
 
 const char *
 regnode_get_parent_name(struct regnode *regnode)
 {
 
 	return (regnode->parent_name);
 }
 
 int
 regnode_get_flags(struct regnode *regnode)
 {
 
 	return (regnode->flags);
 }
 
 void *
 regnode_get_softc(struct regnode *regnode)
 {
 
 	return (regnode->softc);
 }
 
 device_t
 regnode_get_device(struct regnode *regnode)
 {
 
 	return (regnode->pdev);
 }
 
 struct regnode_std_param *regnode_get_stdparam(struct regnode *regnode)
 {
 
 	return (&regnode->std_param);
 }
 
 void regnode_topo_unlock(void)
 {
 
 	REG_TOPO_UNLOCK();
 }
 
 void regnode_topo_xlock(void)
 {
 
 	REG_TOPO_XLOCK();
 }
 
 void regnode_topo_slock(void)
 {
 
 	REG_TOPO_SLOCK();
 }
 
 
 /* --------------------------------------------------------------------------
  *
  * Real consumers executive
  *
  */
 struct regnode *
 regnode_get_parent(struct regnode *regnode)
 {
 	int rv;
 
 	REG_TOPO_ASSERT();
 
 	rv = regnode_resolve_parent(regnode);
 	if (rv != 0)
 		return (NULL);
 
 	return (regnode->parent);
 }
 
 /*
  * Enable regulator.
  */
 int
 regnode_enable(struct regnode *regnode)
 {
 	int udelay;
 	int rv;
 
 	REG_TOPO_ASSERT();
 
 	/* Enable regulator for each node in chain, starting from source. */
 	rv = regnode_resolve_parent(regnode);
 	if (rv != 0)
 		return (rv);
 	if (regnode->parent != NULL) {
 		rv = regnode_enable(regnode->parent);
 		if (rv != 0)
 			return (rv);
 	}
 
 	/* Handle this node. */
 	REGNODE_XLOCK(regnode);
 	if (regnode->enable_cnt == 0) {
 		rv = REGNODE_ENABLE(regnode, true, &udelay);
 		if (rv != 0) {
 			REGNODE_UNLOCK(regnode);
 			return (rv);
 		}
 		regnode_delay(udelay);
 	}
 	regnode->enable_cnt++;
 	REGNODE_UNLOCK(regnode);
 	return (0);
 }
 
 /*
  * Disable regulator.
  */
 int
 regnode_disable(struct regnode *regnode)
 {
 	int udelay;
 	int rv;
 
 	REG_TOPO_ASSERT();
 	rv = 0;
 
 	REGNODE_XLOCK(regnode);
 	/* Disable regulator for each node in chain, starting from consumer. */
 	if ((regnode->enable_cnt == 1) &&
 	    ((regnode->flags & REGULATOR_FLAGS_NOT_DISABLE) == 0)) {
 		rv = REGNODE_ENABLE(regnode, false, &udelay);
 		if (rv != 0) {
 			REGNODE_UNLOCK(regnode);
 			return (rv);
 		}
 		regnode_delay(udelay);
 	}
 	regnode->enable_cnt--;
 	REGNODE_UNLOCK(regnode);
 
 	rv = regnode_resolve_parent(regnode);
 	if (rv != 0)
 		return (rv);
 	if (regnode->parent != NULL)
 		rv = regnode_disable(regnode->parent);
 	return (rv);
 }
 
 /*
  * Stop regulator.
  */
 int
 regnode_stop(struct regnode *regnode, int depth)
 {
 	int udelay;
 	int rv;
 
 	REG_TOPO_ASSERT();
 	rv = 0;
 
 	REGNODE_XLOCK(regnode);
 	/* The first node must not be enabled. */
 	if ((regnode->enable_cnt != 0) && (depth == 0)) {
 		REGNODE_UNLOCK(regnode);
 		return (EBUSY);
 	}
 	/* Disable regulator for each node in chain, starting from consumer */
 	if ((regnode->enable_cnt == 0) &&
 	    ((regnode->flags & REGULATOR_FLAGS_NOT_DISABLE) == 0)) {
 		rv = REGNODE_ENABLE(regnode, false, &udelay);
 		if (rv != 0) {
 			REGNODE_UNLOCK(regnode);
 			return (rv);
 		}
 		regnode_delay(udelay);
 	}
 	REGNODE_UNLOCK(regnode);
 
 	rv = regnode_resolve_parent(regnode);
 	if (rv != 0)
 		return (rv);
 	if (regnode->parent != NULL)
 		rv = regnode_stop(regnode->parent, depth + 1);
 	return (rv);
 }
 
 /*
  * Get regulator status. (REGULATOR_STATUS_*)
  */
 int
 regnode_status(struct regnode *regnode, int *status)
 {
 	int rv;
 
 	REG_TOPO_ASSERT();
 
 	REGNODE_XLOCK(regnode);
 	rv = REGNODE_STATUS(regnode, status);
 	REGNODE_UNLOCK(regnode);
 	return (rv);
 }
 
 /*
  * Get actual regulator voltage.
  */
 int
 regnode_get_voltage(struct regnode *regnode, int *uvolt)
 {
 	int rv;
 
 	REG_TOPO_ASSERT();
 
 	REGNODE_XLOCK(regnode);
 	rv = REGNODE_GET_VOLTAGE(regnode, uvolt);
 	REGNODE_UNLOCK(regnode);
 
 	/* Pass call into parent, if regulator is in bypass mode. */
 	if (rv == ENOENT) {
 		rv = regnode_resolve_parent(regnode);
 		if (rv != 0)
 			return (rv);
 		if (regnode->parent != NULL)
 			rv = regnode_get_voltage(regnode->parent, uvolt);
 
 	}
 	return (rv);
 }
 
 /*
  * Set regulator voltage.
  */
 int
 regnode_set_voltage(struct regnode *regnode, int min_uvolt, int max_uvolt)
 {
 	int udelay;
 	int rv;
 
 	REG_TOPO_ASSERT();
 
 	REGNODE_XLOCK(regnode);
 
 	rv = REGNODE_SET_VOLTAGE(regnode, min_uvolt, max_uvolt, &udelay);
 	if (rv == 0)
 		regnode_delay(udelay);
 	REGNODE_UNLOCK(regnode);
 	return (rv);
 }
 
 /*
  * Consumer variant of regnode_set_voltage().
  */
 static int
 regnode_set_voltage_checked(struct regnode *regnode, struct regulator *reg,
     int min_uvolt, int max_uvolt)
 {
 	int udelay;
 	int all_max_uvolt;
 	int all_min_uvolt;
 	struct regulator *tmp;
 	int rv;
 
 	REG_TOPO_ASSERT();
 
 	REGNODE_XLOCK(regnode);
 	/* Return error if requested range is outside of regulator range. */
 	if ((min_uvolt > regnode->std_param.max_uvolt) ||
 	    (max_uvolt < regnode->std_param.min_uvolt)) {
 		REGNODE_UNLOCK(regnode);
 		return (ERANGE);
 	}
 
 	/* Get actual voltage range for all consumers. */
 	all_min_uvolt = regnode->std_param.min_uvolt;
 	all_max_uvolt = regnode->std_param.max_uvolt;
 	TAILQ_FOREACH(tmp, &regnode->consumers_list, link) {
 		/* Don't take requestor in account. */
 		if (tmp == reg)
 			continue;
 		if (all_min_uvolt < tmp->min_uvolt)
 			all_min_uvolt = tmp->min_uvolt;
 		if (all_max_uvolt > tmp->max_uvolt)
 			all_max_uvolt = tmp->max_uvolt;
 	}
 
 	/* Test if request fits to actual contract. */
 	if ((min_uvolt > all_max_uvolt) ||
 	    (max_uvolt < all_min_uvolt)) {
 		REGNODE_UNLOCK(regnode);
 		return (ERANGE);
 	}
 
 	/* Adjust new range.*/
 	if (min_uvolt < all_min_uvolt)
 		min_uvolt = all_min_uvolt;
 	if (max_uvolt > all_max_uvolt)
 		max_uvolt = all_max_uvolt;
 
 	rv = REGNODE_SET_VOLTAGE(regnode, min_uvolt, max_uvolt, &udelay);
 	regnode_delay(udelay);
 	REGNODE_UNLOCK(regnode);
 	return (rv);
 }
 
 #ifdef FDT
 phandle_t
 regnode_get_ofw_node(struct regnode *regnode)
 {
 
 	return (regnode->ofw_node);
 }
 #endif
 
 /* --------------------------------------------------------------------------
  *
  * Regulator consumers interface.
  *
  */
 /* Helper function for regulator_get*() */
 static regulator_t
 regulator_create(struct regnode *regnode, device_t cdev)
 {
 	struct regulator *reg;
 
 	REG_TOPO_ASSERT();
 
 	reg =  malloc(sizeof(struct regulator), M_REGULATOR,
 	    M_WAITOK | M_ZERO);
 	reg->cdev = cdev;
 	reg->regnode = regnode;
 	reg->enable_cnt = 0;
 
 	REGNODE_XLOCK(regnode);
 	regnode->ref_cnt++;
 	TAILQ_INSERT_TAIL(&regnode->consumers_list, reg, link);
 	reg ->min_uvolt = regnode->std_param.min_uvolt;
 	reg ->max_uvolt = regnode->std_param.max_uvolt;
 	REGNODE_UNLOCK(regnode);
 
 	return (reg);
 }
 
 int
 regulator_enable(regulator_t reg)
 {
 	int rv;
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 	REG_TOPO_SLOCK();
 	rv = regnode_enable(regnode);
 	if (rv == 0)
 		reg->enable_cnt++;
 	REG_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 regulator_disable(regulator_t reg)
 {
 	int rv;
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 	KASSERT(reg->enable_cnt > 0,
 	   ("Attempt to disable already disabled regulator: %s\n",
 	   regnode->name));
 	REG_TOPO_SLOCK();
 	rv = regnode_disable(regnode);
 	if (rv == 0)
 		reg->enable_cnt--;
 	REG_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 regulator_stop(regulator_t reg)
 {
 	int rv;
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 	KASSERT(reg->enable_cnt == 0,
 	   ("Attempt to stop already enabled regulator: %s\n", regnode->name));
 
 	REG_TOPO_SLOCK();
 	rv = regnode_stop(regnode, 0);
 	REG_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 regulator_status(regulator_t reg, int *status)
 {
 	int rv;
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 
 	REG_TOPO_SLOCK();
 	rv = regnode_status(regnode, status);
 	REG_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 regulator_get_voltage(regulator_t reg, int *uvolt)
 {
 	int rv;
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 
 	REG_TOPO_SLOCK();
 	rv = regnode_get_voltage(regnode, uvolt);
 	REG_TOPO_UNLOCK();
 	return (rv);
 }
 
 int
 regulator_set_voltage(regulator_t reg, int min_uvolt, int max_uvolt)
 {
 	struct regnode *regnode;
 	int rv;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 
 	REG_TOPO_SLOCK();
 
 	rv = regnode_set_voltage_checked(regnode, reg, min_uvolt, max_uvolt);
 	if (rv == 0) {
 		reg->min_uvolt = min_uvolt;
 		reg->max_uvolt = max_uvolt;
 	}
 	REG_TOPO_UNLOCK();
 	return (rv);
 }
 
 const char *
 regulator_get_name(regulator_t reg)
 {
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 	return (regnode->name);
 }
 
 int
 regulator_get_by_name(device_t cdev, const char *name, regulator_t *reg)
 {
 	struct regnode *regnode;
 
 	REG_TOPO_SLOCK();
 	regnode = regnode_find_by_name(name);
 	if (regnode == NULL) {
 		REG_TOPO_UNLOCK();
 		return (ENODEV);
 	}
 	*reg = regulator_create(regnode, cdev);
 	REG_TOPO_UNLOCK();
 	return (0);
 }
 
 int
 regulator_get_by_id(device_t cdev, device_t pdev, intptr_t id, regulator_t *reg)
 {
 	struct regnode *regnode;
 
 	REG_TOPO_SLOCK();
 
 	regnode = regnode_find_by_id(pdev, id);
 	if (regnode == NULL) {
 		REG_TOPO_UNLOCK();
 		return (ENODEV);
 	}
 	*reg = regulator_create(regnode, cdev);
 	REG_TOPO_UNLOCK();
 
 	return (0);
 }
 
 int
 regulator_release(regulator_t reg)
 {
 	struct regnode *regnode;
 
 	regnode = reg->regnode;
 	KASSERT(regnode->ref_cnt > 0,
 	   ("Attempt to access unreferenced regulator: %s\n", regnode->name));
 	REG_TOPO_SLOCK();
 	while (reg->enable_cnt > 0) {
 		regnode_disable(regnode);
 		reg->enable_cnt--;
 	}
 	REGNODE_XLOCK(regnode);
 	TAILQ_REMOVE(&regnode->consumers_list, reg, link);
 	regnode->ref_cnt--;
 	REGNODE_UNLOCK(regnode);
 	REG_TOPO_UNLOCK();
 
 	free(reg, M_REGULATOR);
 	return (0);
 }
 
 #ifdef FDT
 /* Default DT mapper. */
 int
 regdev_default_ofw_map(device_t dev, phandle_t 	xref, int ncells,
     pcell_t *cells, intptr_t *id)
 {
 	if (ncells == 0)
 		*id = 1;
 	else if (ncells == 1)
 		*id = cells[0];
 	else
 		return  (ERANGE);
 
 	return (0);
 }
 
 int
 regulator_parse_ofw_stdparam(device_t pdev, phandle_t node,
     struct regnode_init_def *def)
 {
 	phandle_t supply_xref;
 	struct regnode_std_param *par;
 	int rv;
 
 	par = &def->std_param;
 	rv = OF_getprop_alloc(node, "regulator-name",
 	    (void **)&def->name);
 	if (rv <= 0) {
 		device_printf(pdev, "%s: Missing regulator name\n",
 		 __func__);
 		return (ENXIO);
 	}
 
 	rv = OF_getencprop(node, "regulator-min-microvolt", &par->min_uvolt,
 	    sizeof(par->min_uvolt));
 	if (rv <= 0)
 		par->min_uvolt = 0;
 
 	rv = OF_getencprop(node, "regulator-max-microvolt", &par->max_uvolt,
 	    sizeof(par->max_uvolt));
 	if (rv <= 0)
 		par->max_uvolt = 0;
 
 	rv = OF_getencprop(node, "regulator-min-microamp", &par->min_uamp,
 	    sizeof(par->min_uamp));
 	if (rv <= 0)
 		par->min_uamp = 0;
 
 	rv = OF_getencprop(node, "regulator-max-microamp", &par->max_uamp,
 	    sizeof(par->max_uamp));
 	if (rv <= 0)
 		par->max_uamp = 0;
 
 	rv = OF_getencprop(node, "regulator-ramp-delay", &par->ramp_delay,
 	    sizeof(par->ramp_delay));
 	if (rv <= 0)
 		par->ramp_delay = 0;
 
 	rv = OF_getencprop(node, "regulator-enable-ramp-delay",
 	    &par->enable_delay, sizeof(par->enable_delay));
 	if (rv <= 0)
 		par->enable_delay = 0;
 
 	if (OF_hasprop(node, "regulator-boot-on"))
 		par->boot_on = 1;
 
 	if (OF_hasprop(node, "regulator-always-on"))
 		par->always_on = 1;
 
 	if (OF_hasprop(node, "enable-active-high"))
 		par->enable_active_high = 1;
 
 	rv = OF_getencprop(node, "vin-supply", &supply_xref,
 	    sizeof(supply_xref));
 	if (rv >=  0) {
 		rv = OF_getprop_alloc(supply_xref, "regulator-name",
 		    (void **)&def->parent_name);
 		if (rv <= 0)
 			def->parent_name = NULL;
 	}
 	return (0);
 }
 
 int
 regulator_get_by_ofw_property(device_t cdev, phandle_t cnode, char *name,
     regulator_t *reg)
 {
 	phandle_t *cells;
 	device_t regdev;
 	int ncells, rv;
 	intptr_t id;
 
 	*reg = NULL;
 
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(cdev);
 	if (cnode <= 0) {
 		device_printf(cdev, "%s called on not ofw based device\n",
 		 __func__);
 		return (ENXIO);
 	}
 
 	cells = NULL;
-	ncells = OF_getencprop_alloc(cnode, name,  sizeof(*cells),
+	ncells = OF_getencprop_alloc_multi(cnode, name, sizeof(*cells),
 	    (void **)&cells);
 	if (ncells <= 0)
 		return (ENXIO);
 
 	/* Translate xref to device */
 	regdev = OF_device_from_xref(cells[0]);
 	if (regdev == NULL) {
 		OF_prop_free(cells);
 		return (ENODEV);
 	}
 
 	/* Map regulator to number */
 	rv = REGDEV_MAP(regdev, cells[0], ncells - 1, cells + 1, &id);
 	OF_prop_free(cells);
 	if (rv != 0)
 		return (rv);
 	return (regulator_get_by_id(cdev, regdev, id, reg));
 }
 #endif
 
 /* --------------------------------------------------------------------------
  *
  * Regulator utility functions.
  *
  */
 
 /* Convert raw selector value to real voltage */
 int
 regulator_range_sel8_to_volt(struct regulator_range *ranges, int nranges,
    uint8_t sel, int *volt)
 {
 	struct regulator_range *range;
 	int i;
 
 	if (nranges == 0)
 		panic("Voltage regulator have zero ranges\n");
 
 	for (i = 0; i < nranges ; i++) {
 		range = ranges  + i;
 
 		if (!(sel >= range->min_sel &&
 		      sel <= range->max_sel))
 			continue;
 
 		sel -= range->min_sel;
 
 		*volt = range->min_uvolt + sel * range->step_uvolt;
 		return (0);
 	}
 
 	return (ERANGE);
 }
 
 int
 regulator_range_volt_to_sel8(struct regulator_range *ranges, int nranges,
     int min_uvolt, int max_uvolt, uint8_t *out_sel)
 {
 	struct regulator_range *range;
 	uint8_t sel;
 	int uvolt;
 	int rv, i;
 
 	if (nranges == 0)
 		panic("Voltage regulator have zero ranges\n");
 
 	for (i = 0; i < nranges; i++) {
 		range = ranges  + i;
 		uvolt = range->min_uvolt +
 		    (range->max_sel - range->min_sel) * range->step_uvolt;
 
 		if ((min_uvolt > uvolt) ||
 		    (max_uvolt < range->min_uvolt))
 			continue;
 
 		if (min_uvolt <= range->min_uvolt)
 			min_uvolt = range->min_uvolt;
 
 		/* if step == 0 -> fixed voltage range. */
 		if (range->step_uvolt == 0)
 			sel = 0;
 		else
 			sel = DIV_ROUND_UP(min_uvolt - range->min_uvolt,
 			   range->step_uvolt);
 
 
 		sel += range->min_sel;
 
 		break;
 	}
 
 	if (i >= nranges)
 		return (ERANGE);
 
 	/* Verify new settings. */
 	rv = regulator_range_sel8_to_volt(ranges, nranges, sel, &uvolt);
 	if (rv != 0)
 		return (rv);
 	if ((uvolt < min_uvolt) || (uvolt > max_uvolt))
 		return (ERANGE);
 
 	*out_sel = sel;
 	return (0);
 }
Index: user/markj/netdump/sys/dev/extres/syscon/syscon.c
===================================================================
--- user/markj/netdump/sys/dev/extres/syscon/syscon.c	(revision 332407)
+++ user/markj/netdump/sys/dev/extres/syscon/syscon.c	(revision 332408)
@@ -1,257 +1,257 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2017 Kyle Evans <kevans@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * This is a generic syscon driver, whose purpose is to provide access to
  * various unrelated bits packed in a single register space. It is usually used
  * as a fallback to more specific driver, but works well enough for simple
  * access.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 #include "opt_platform.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/lock.h>
 #include <sys/module.h>
 #include <sys/rman.h>
 #include <sys/sx.h>
 #include <sys/queue.h>
 
 #include <machine/bus.h>
 
 #ifdef FDT
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #endif
 
 #include "syscon_if.h"
 #include "syscon.h"
 
 /*
  * Syscon interface details
  */
 typedef TAILQ_HEAD(syscon_list, syscon) syscon_list_t;
 
 /*
  * Declarations
  */
 static int syscon_method_init(struct syscon *syscon);
 static int syscon_method_uninit(struct syscon *syscon);
 
 MALLOC_DEFINE(M_SYSCON, "syscon", "Syscon driver");
 
 static syscon_list_t syscon_list = TAILQ_HEAD_INITIALIZER(syscon_list);
 static struct sx		syscon_topo_lock;
 SX_SYSINIT(syscon_topology, &syscon_topo_lock, "Syscon topology lock");
 
 /*
  * Syscon methods.
  */
 static syscon_method_t syscon_methods[] = {
 	SYSCONMETHOD(syscon_init,	syscon_method_init),
 	SYSCONMETHOD(syscon_uninit,	syscon_method_uninit),
 
 	SYSCONMETHOD_END
 };
 DEFINE_CLASS_0(syscon, syscon_class, syscon_methods, 0);
 
 #define SYSCON_TOPO_SLOCK()	sx_slock(&syscon_topo_lock)
 #define SYSCON_TOPO_XLOCK()	sx_xlock(&syscon_topo_lock)
 #define SYSCON_TOPO_UNLOCK()	sx_unlock(&syscon_topo_lock)
 #define SYSCON_TOPO_ASSERT()	sx_assert(&syscon_topo_lock, SA_LOCKED)
 #define SYSCON_TOPO_XASSERT()	sx_assert(&syscon_topo_lock, SA_XLOCKED)
 
 /*
  * Default syscon methods for base class.
  */
 static int
 syscon_method_init(struct syscon *syscon)
 {
 
 	return (0);
 };
 
 static int
 syscon_method_uninit(struct syscon *syscon)
 {
 
 	return (0);
 };
 
 void *
 syscon_get_softc(struct syscon *syscon)
 {
 
 	return (syscon->softc);
 };
 
 /*
  * Create and initialize syscon object, but do not register it.
  */
 struct syscon *
 syscon_create(device_t pdev, syscon_class_t syscon_class)
 {
 	struct syscon *syscon;
 
 	/* Create object and initialize it. */
 	syscon = malloc(sizeof(struct syscon), M_SYSCON,
 	    M_WAITOK | M_ZERO);
 	kobj_init((kobj_t)syscon, (kobj_class_t)syscon_class);
 
 	/* Allocate softc if required. */
 	if (syscon_class->size > 0)
 		syscon->softc = malloc(syscon_class->size, M_SYSCON,
 		    M_WAITOK | M_ZERO);
 
 	/* Rest of init. */
 	syscon->pdev = pdev;
 	return (syscon);
 }
 
 /* Register syscon object. */
 struct syscon *
 syscon_register(struct syscon *syscon)
 {
 	int rv;
 
 #ifdef FDT
 	if (syscon->ofw_node <= 0)
 		syscon->ofw_node = ofw_bus_get_node(syscon->pdev);
 	if (syscon->ofw_node <= 0)
 		return (NULL);
 #endif
 
 	rv = SYSCON_INIT(syscon);
 	if (rv != 0) {
 		printf("SYSCON_INIT failed: %d\n", rv);
 		return (NULL);
 	}
 
 #ifdef FDT
 	OF_device_register_xref(OF_xref_from_node(syscon->ofw_node),
 	    syscon->pdev);
 #endif
 	SYSCON_TOPO_XLOCK();
 	TAILQ_INSERT_TAIL(&syscon_list, syscon, syscon_link);
 	SYSCON_TOPO_UNLOCK();
 	return (syscon);
 }
 
 int
 syscon_unregister(struct syscon *syscon)
 {
 
 	SYSCON_TOPO_XLOCK();
 	TAILQ_REMOVE(&syscon_list, syscon, syscon_link);
 	SYSCON_TOPO_UNLOCK();
 #ifdef FDT
 	OF_device_register_xref(OF_xref_from_node(syscon->ofw_node), NULL);
 #endif
 	return (SYSCON_UNINIT(syscon));
 }
 
 /**
  * Provider methods
  */
 #ifdef FDT
 static struct syscon *
 syscon_find_by_ofw_node(phandle_t node)
 {
 	struct syscon *entry;
 
 	SYSCON_TOPO_ASSERT();
 
 	TAILQ_FOREACH(entry, &syscon_list, syscon_link) {
 		if (entry->ofw_node == node)
 			return (entry);
 	}
 
 	return (NULL);
 }
 
 struct syscon *
 syscon_create_ofw_node(device_t pdev, syscon_class_t syscon_class,
     phandle_t node)
 {
 	struct syscon *syscon;
 
 	syscon = syscon_create(pdev, syscon_class);
 	if (syscon == NULL)
 		return (NULL);
 	syscon->ofw_node = node;
 	if (syscon_register(syscon) == NULL)
 		return (NULL);
 	return (syscon);
 }
 
 phandle_t
 syscon_get_ofw_node(struct syscon *syscon)
 {
 
 	return (syscon->ofw_node);
 }
 
 int
 syscon_get_by_ofw_property(device_t cdev, phandle_t cnode, char *name,
     struct syscon **syscon)
 {
 	pcell_t *cells;
 	int ncells;
 
 	if (cnode <= 0)
 		cnode = ofw_bus_get_node(cdev);
 	if (cnode <= 0) {
 		device_printf(cdev,
 		    "%s called on not ofw based device\n", __func__);
 		return (ENXIO);
 	}
-	ncells = OF_getencprop_alloc(cnode, name, sizeof(pcell_t),
+	ncells = OF_getencprop_alloc_multi(cnode, name, sizeof(pcell_t),
 	    (void **)&cells);
 	if (ncells < 1)
 		return (ENXIO);
 
 	/* Translate to syscon node. */
 	SYSCON_TOPO_SLOCK();
 	*syscon = syscon_find_by_ofw_node(OF_node_from_xref(cells[0]));
 	if (*syscon == NULL) {
 		SYSCON_TOPO_UNLOCK();
 		device_printf(cdev, "Failed to find syscon node\n");
 		OF_prop_free(cells);
 		return (ENODEV);
 	}
 	SYSCON_TOPO_UNLOCK();
 	OF_prop_free(cells);
 	return (0);
 }
 #endif
Index: user/markj/netdump/sys/dev/fdt/fdt_clock.c
===================================================================
--- user/markj/netdump/sys/dev/fdt/fdt_clock.c	(revision 332407)
+++ user/markj/netdump/sys/dev/fdt/fdt_clock.c	(revision 332408)
@@ -1,162 +1,162 @@
 /*-
  * Copyright (c) 2014 Ian Lepore <ian@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #include <sys/cdefs.h>
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/systm.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include "fdt_clock_if.h"
 #include <dev/fdt/fdt_clock.h>
 
 /*
  * Loop through all the tuples in the clocks= property for a device, enabling or
  * disabling each clock.
  *
  * Be liberal about errors for now: warn about a failure to enable but keep
  * trying with any other clocks in the list.  Return ENXIO if any errors were
  * found, and let the caller decide whether the problem is fatal.
  */
 static int
 enable_disable_all(device_t consumer, boolean_t enable)
 {
 	phandle_t cnode;
 	device_t clockdev;
 	int clocknum, err, i, ncells;
 	uint32_t *clks;
 	boolean_t anyerrors;
 
 	cnode = ofw_bus_get_node(consumer);
-	ncells = OF_getencprop_alloc(cnode, "clocks", sizeof(*clks),
+	ncells = OF_getencprop_alloc_multi(cnode, "clocks", sizeof(*clks),
 	    (void **)&clks);
 	if (enable && ncells < 2) {
 		device_printf(consumer, "Warning: No clocks specified in fdt "
 		    "data; device may not function.");
 		return (ENXIO);
 	}
 	anyerrors = false;
 	for (i = 0; i < ncells; i += 2) {
 		clockdev = OF_device_from_xref(clks[i]);
 		clocknum = clks[i + 1];
 		if (clockdev == NULL) {
 			if (enable)
 				device_printf(consumer, "Warning: can not find "
 				    "driver for clock number %u; device may not "
 				    "function\n", clocknum);
 			anyerrors = true;
 			continue;
 		}
 		if (enable)
 			err = FDT_CLOCK_ENABLE(clockdev, clocknum);
 		else
 			err = FDT_CLOCK_DISABLE(clockdev, clocknum);
 		if (err != 0) {
 			if (enable)
 				device_printf(consumer, "Warning: failed to "
 				    "enable clock number %u; device may not "
 				    "function\n", clocknum);
 			anyerrors = true;
 		}
 	}
 	OF_prop_free(clks);
 	return (anyerrors ? ENXIO : 0);
 }
 
 int
 fdt_clock_get_info(device_t consumer, int n, struct fdt_clock_info *info)
 {
 	phandle_t cnode;
 	device_t clockdev;
 	int clocknum, err, ncells;
 	uint32_t *clks;
 
 	cnode = ofw_bus_get_node(consumer);
-	ncells = OF_getencprop_alloc(cnode, "clocks", sizeof(*clks),
+	ncells = OF_getencprop_alloc_multi(cnode, "clocks", sizeof(*clks),
 	    (void **)&clks);
 	if (ncells <= 0)
 		return (ENXIO);
 	n *= 2;
 	if (ncells <= n)
 		err = ENXIO;
 	else {
 		clockdev = OF_device_from_xref(clks[n]);
 		if (clockdev == NULL)
 			err = ENXIO;
 		else  {
 			/*
 			 * Make struct contents minimally valid, then call
 			 * provider to fill in what it knows (provider can
 			 * override anything it wants to).
 			 */
 			clocknum = clks[n + 1];
 			bzero(info, sizeof(*info));
 			info->provider = clockdev;
 			info->index = clocknum;
 			info->name = "";
 			err = FDT_CLOCK_GET_INFO(clockdev, clocknum, info);
 		}
 	}
 	OF_prop_free(clks);
 	return (err);
 }
 
 int
 fdt_clock_enable_all(device_t consumer)
 {
 
 	return (enable_disable_all(consumer, true));
 }
 
 int
 fdt_clock_disable_all(device_t consumer)
 {
 
 	return (enable_disable_all(consumer, false));
 }
 
 void
 fdt_clock_register_provider(device_t provider)
 {
 
 	OF_device_register_xref(
 	    OF_xref_from_node(ofw_bus_get_node(provider)), provider);
 }
 
 void
 fdt_clock_unregister_provider(device_t provider)
 {
 
 	OF_device_register_xref(OF_xref_from_device(provider), NULL);
 }
 
Index: user/markj/netdump/sys/dev/fdt/fdt_pinctrl.c
===================================================================
--- user/markj/netdump/sys/dev/fdt/fdt_pinctrl.c	(revision 332407)
+++ user/markj/netdump/sys/dev/fdt/fdt_pinctrl.c	(revision 332408)
@@ -1,150 +1,150 @@
 /*-
  * Copyright (c) 2014 Ian Lepore <ian@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #include <sys/cdefs.h>
 #include <sys/param.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include "fdt_pinctrl_if.h"
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/fdt/fdt_pinctrl.h>
 
 int
 fdt_pinctrl_configure(device_t client, u_int index)
 {
 	device_t pinctrl;
 	phandle_t *configs;
 	int i, nconfigs;
 	char name[16];
 
 	snprintf(name, sizeof(name), "pinctrl-%u", index);
-	nconfigs = OF_getencprop_alloc(ofw_bus_get_node(client), name,
+	nconfigs = OF_getencprop_alloc_multi(ofw_bus_get_node(client), name,
 	    sizeof(*configs), (void **)&configs);
 	if (nconfigs < 0)
 		return (ENOENT);
 	if (nconfigs == 0)
 		return (0); /* Empty property is documented as valid. */
 	for (i = 0; i < nconfigs; i++) {
 		if ((pinctrl = OF_device_from_xref(configs[i])) != NULL)
 			FDT_PINCTRL_CONFIGURE(pinctrl, configs[i]);
 	}
 	OF_prop_free(configs);
 	return (0);
 }
 
 int
 fdt_pinctrl_configure_by_name(device_t client, const char * name)
 {
 	char * names;
 	int i, offset, nameslen;
 
 	nameslen = OF_getprop_alloc(ofw_bus_get_node(client), "pinctrl-names",
 	    (void **)&names);
 	if (nameslen <= 0)
 		return (ENOENT);
 	for (i = 0, offset = 0; offset < nameslen; i++) {
 		if (strcmp(name, &names[offset]) == 0)
 			break;
 		offset += strlen(&names[offset]) + 1;
 	}
 	OF_prop_free(names);
 	if (offset < nameslen)
 		return (fdt_pinctrl_configure(client, i));
 	else
 		return (ENOENT);
 }
 
 static int
 pinctrl_register_children(device_t pinctrl, phandle_t parent,
     const char *pinprop)
 {
 	phandle_t node;
 
 	/*
 	 * Recursively descend from parent, looking for nodes that have the
 	 * given property, and associate the pinctrl device_t with each one.
 	 */
 	for (node = OF_child(parent); node != 0; node = OF_peer(node)) {
 		pinctrl_register_children(pinctrl, node, pinprop);
 		if (pinprop == NULL || OF_hasprop(node, pinprop)) {
 			OF_device_register_xref(OF_xref_from_node(node),
 			    pinctrl);
 		}
 	}
 	return (0);
 }
 
 int
 fdt_pinctrl_register(device_t pinctrl, const char *pinprop)
 {
 	phandle_t node;
 
 	node = ofw_bus_get_node(pinctrl);
 	OF_device_register_xref(OF_xref_from_node(node), pinctrl);
 	return (pinctrl_register_children(pinctrl, node, pinprop));
 }
 
 static int
 pinctrl_configure_children(device_t pinctrl, phandle_t parent)
 {
 	phandle_t node, *configs;
 	int i, nconfigs;
 
 	for (node = OF_child(parent); node != 0; node = OF_peer(node)) {
 		if (!ofw_bus_node_status_okay(node))
 			continue;
 		pinctrl_configure_children(pinctrl, node);
-		nconfigs = OF_getencprop_alloc(node, "pinctrl-0",
+		nconfigs = OF_getencprop_alloc_multi(node, "pinctrl-0",
 		    sizeof(*configs), (void **)&configs);
 		if (nconfigs <= 0)
 			continue;
 		if (bootverbose) {
 			char name[32];
 			OF_getprop(node, "name", &name, sizeof(name));
 			printf("Processing %d pin-config node(s) in pinctrl-0 for %s\n",
 			    nconfigs, name);
 		}
 		for (i = 0; i < nconfigs; i++) {
 			if (OF_device_from_xref(configs[i]) == pinctrl)
 				FDT_PINCTRL_CONFIGURE(pinctrl, configs[i]);
 		}
 		OF_prop_free(configs);
 	}
 	return (0);
 }
 
 int
 fdt_pinctrl_configure_tree(device_t pinctrl)
 {
 
 	return (pinctrl_configure_children(pinctrl, OF_peer(0)));
 }
 
Index: user/markj/netdump/sys/dev/gpio/gpioregulator.c
===================================================================
--- user/markj/netdump/sys/dev/gpio/gpioregulator.c	(revision 332407)
+++ user/markj/netdump/sys/dev/gpio/gpioregulator.c	(revision 332408)
@@ -1,348 +1,348 @@
 /*-
  * Copyright (c) 2016 Jared McNeill <jmcneill@invisible.ca>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
  * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * GPIO controlled regulators
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/rman.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/gpio.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <dev/gpio/gpiobusvar.h>
 
 #include <dev/extres/regulator/regulator.h>
 
 #include "regdev_if.h"
 
 struct gpioregulator_state {
 	int			val;
 	uint32_t		mask;
 };
 
 struct gpioregulator_init_def {
 	struct regnode_init_def		reg_init_def;
 	struct gpiobus_pin		*enable_pin;
 	int				enable_pin_valid;
 	int				startup_delay_us;
 	int				nstates;
 	struct gpioregulator_state	*states;
 	int				npins;
 	struct gpiobus_pin		**pins;
 };
 
 struct gpioregulator_reg_sc {
 	struct regnode			*regnode;
 	device_t			base_dev;
 	struct regnode_std_param	*param;
 	struct gpioregulator_init_def	*def;
 };
 
 struct gpioregulator_softc {
 	device_t			dev;
 	struct gpioregulator_reg_sc	*reg_sc;
 	struct gpioregulator_init_def	init_def;
 };
 
 static int
 gpioregulator_regnode_init(struct regnode *regnode)
 {
 	struct gpioregulator_reg_sc *sc;
 	int error, n;
 
 	sc = regnode_get_softc(regnode);
 
 	if (sc->def->enable_pin_valid == 1) {
 		error = gpio_pin_setflags(sc->def->enable_pin, GPIO_PIN_OUTPUT);
 		if (error != 0)
 			return (error);
 	}
 
 	for (n = 0; n < sc->def->npins; n++) {
 		error = gpio_pin_setflags(sc->def->pins[n], GPIO_PIN_OUTPUT);
 		if (error != 0)
 			return (error);
 	}
 
 	return (0);
 }
 
 static int
 gpioregulator_regnode_enable(struct regnode *regnode, bool enable, int *udelay)
 {
 	struct gpioregulator_reg_sc *sc;
 	bool active;
 	int error;
 
 	sc = regnode_get_softc(regnode);
 
 	if (sc->def->enable_pin_valid == 1) {
 		active = enable;
 		if (!sc->param->enable_active_high)
 			active = !active;
 		error = gpio_pin_set_active(sc->def->enable_pin, active);
 		if (error != 0)
 			return (error);
 	}
 
 	*udelay = sc->def->startup_delay_us;
 
 	return (0);
 }
 
 static int
 gpioregulator_regnode_set_voltage(struct regnode *regnode, int min_uvolt,
     int max_uvolt, int *udelay)
 {
 	struct gpioregulator_reg_sc *sc;
 	const struct gpioregulator_state *state;
 	int error, n;
 
 	sc = regnode_get_softc(regnode);
 	state = NULL;
 
 	for (n = 0; n < sc->def->nstates; n++) {
 		if (sc->def->states[n].val >= min_uvolt &&
 		    sc->def->states[n].val <= max_uvolt) {
 			state = &sc->def->states[n];
 			break;
 		}
 	}
 	if (state == NULL)
 		return (EINVAL);
 
 	for (n = 0; n < sc->def->npins; n++) {
 		error = gpio_pin_set_active(sc->def->pins[n],
 		    (state->mask >> n) & 1);
 		if (error != 0)
 			return (error);
 	}
 
 	*udelay = sc->def->startup_delay_us;
 
 	return (0);
 }
 
 static int
 gpioregulator_regnode_get_voltage(struct regnode *regnode, int *uvolt)
 {
 	struct gpioregulator_reg_sc *sc;
 	uint32_t mask;
 	int error, n;
 	bool active;
 
 	sc = regnode_get_softc(regnode);
 	mask = 0;
 
 	for (n = 0; n < sc->def->npins; n++) {
 		error = gpio_pin_is_active(sc->def->pins[n], &active);
 		if (error != 0)
 			return (error);
 		mask |= (active << n);
 	}
 
 	for (n = 0; n < sc->def->nstates; n++) {
 		if (sc->def->states[n].mask == mask) {
 			*uvolt = sc->def->states[n].val;
 			return (0);
 		}
 	}
 
 	return (EIO);
 }
 
 static regnode_method_t gpioregulator_regnode_methods[] = {
 	/* Regulator interface */
 	REGNODEMETHOD(regnode_init,	gpioregulator_regnode_init),
 	REGNODEMETHOD(regnode_enable,	gpioregulator_regnode_enable),
 	REGNODEMETHOD(regnode_set_voltage, gpioregulator_regnode_set_voltage),
 	REGNODEMETHOD(regnode_get_voltage, gpioregulator_regnode_get_voltage),
 	REGNODEMETHOD_END
 };
 DEFINE_CLASS_1(gpioregulator_regnode, gpioregulator_regnode_class,
     gpioregulator_regnode_methods, sizeof(struct gpioregulator_reg_sc),
     regnode_class);
 
 static int
 gpioregulator_parse_fdt(struct gpioregulator_softc *sc)
 {
 	uint32_t *pstates, mask;
 	phandle_t node;
 	ssize_t len;
 	int error, n;
 
 	node = ofw_bus_get_node(sc->dev);
 	pstates = NULL;
 	mask = 0;
 
 	error = regulator_parse_ofw_stdparam(sc->dev, node,
 	    &sc->init_def.reg_init_def);
 	if (error != 0)
 		return (error);
 
 	/* "states" property (required) */
-	len = OF_getencprop_alloc(node, "states", sizeof(*pstates),
+	len = OF_getencprop_alloc_multi(node, "states", sizeof(*pstates),
 	    (void **)&pstates);
 	if (len < 2) {
 		device_printf(sc->dev, "invalid 'states' property\n");
 		error = EINVAL;
 		goto done;
 	}
 	sc->init_def.nstates = len / 2;
 	sc->init_def.states = malloc(sc->init_def.nstates *
 	    sizeof(*sc->init_def.states), M_DEVBUF, M_WAITOK);
 	for (n = 0; n < sc->init_def.nstates; n++) {
 		sc->init_def.states[n].val = pstates[n * 2 + 0];
 		sc->init_def.states[n].mask = pstates[n * 2 + 1];
 		mask |= sc->init_def.states[n].mask;
 	}
 
 	/* "startup-delay-us" property (optional) */
 	len = OF_getencprop(node, "startup-delay-us",
 	    &sc->init_def.startup_delay_us,
 	    sizeof(sc->init_def.startup_delay_us));
 	if (len <= 0)
 		sc->init_def.startup_delay_us = 0;
 
 	/* "enable-gpio" property (optional) */
 	error = gpio_pin_get_by_ofw_property(sc->dev, node, "enable-gpio",
 	    &sc->init_def.enable_pin);
 	if (error == 0)
 		sc->init_def.enable_pin_valid = 1;
 
 	/* "gpios" property */
 	sc->init_def.npins = 32 - __builtin_clz(mask);
 	sc->init_def.pins = malloc(sc->init_def.npins *
 	    sizeof(sc->init_def.pins), M_DEVBUF, M_WAITOK);
 	for (n = 0; n < sc->init_def.npins; n++) {
 		error = gpio_pin_get_by_ofw_idx(sc->dev, node, n,
 		    &sc->init_def.pins[n]);
 		if (error != 0) {
 			device_printf(sc->dev, "cannot get pin %d\n", n);
 			goto done;
 		}
 	}
 
 done:
 	if (error != 0) {
 		for (n = 0; n < sc->init_def.npins; n++) {
 			if (sc->init_def.pins[n] != NULL)
 				gpio_pin_release(sc->init_def.pins[n]);
 		}
 
 		free(sc->init_def.states, M_DEVBUF);
 		free(sc->init_def.pins, M_DEVBUF);
 
 	}
 	OF_prop_free(pstates);
 
 	return (error);
 }
 
 static int
 gpioregulator_probe(device_t dev)
 {
 
 	if (!ofw_bus_is_compatible(dev, "regulator-gpio"))
 		return (ENXIO);
 
 	device_set_desc(dev, "GPIO controlled regulator");
 	return (BUS_PROBE_GENERIC);
 }
 
 static int
 gpioregulator_attach(device_t dev)
 {
 	struct gpioregulator_softc *sc;
 	struct regnode *regnode;
 	phandle_t node;
 	int error;
 
 	sc = device_get_softc(dev);
 	sc->dev = dev;
 	node = ofw_bus_get_node(dev);
 
 	error = gpioregulator_parse_fdt(sc);
 	if (error != 0) {
 		device_printf(dev, "cannot parse parameters\n");
 		return (ENXIO);
 	}
 	sc->init_def.reg_init_def.id = 1;
 	sc->init_def.reg_init_def.ofw_node = node;
 
 	regnode = regnode_create(dev, &gpioregulator_regnode_class,
 	    &sc->init_def.reg_init_def);
 	if (regnode == NULL) {
 		device_printf(dev, "cannot create regulator\n");
 		return (ENXIO);
 	}
 
 	sc->reg_sc = regnode_get_softc(regnode);
 	sc->reg_sc->regnode = regnode;
 	sc->reg_sc->base_dev = dev;
 	sc->reg_sc->param = regnode_get_stdparam(regnode);
 	sc->reg_sc->def = &sc->init_def;
 
 	regnode_register(regnode);
 
 	return (0);
 }
 
 
 static device_method_t gpioregulator_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		gpioregulator_probe),
 	DEVMETHOD(device_attach,	gpioregulator_attach),
 
 	/* Regdev interface */
 	DEVMETHOD(regdev_map,		regdev_default_ofw_map),
 
 	DEVMETHOD_END
 };
 
 static driver_t gpioregulator_driver = {
 	"gpioregulator",
 	gpioregulator_methods,
 	sizeof(struct gpioregulator_softc),
 };
 
 static devclass_t gpioregulator_devclass;
 
 EARLY_DRIVER_MODULE(gpioregulator, simplebus, gpioregulator_driver,
     gpioregulator_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_LAST);
 MODULE_VERSION(gpioregulator, 1);
Index: user/markj/netdump/sys/dev/gpio/ofw_gpiobus.c
===================================================================
--- user/markj/netdump/sys/dev/gpio/ofw_gpiobus.c	(revision 332407)
+++ user/markj/netdump/sys/dev/gpio/ofw_gpiobus.c	(revision 332408)
@@ -1,593 +1,593 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2009, Nathan Whitehorn <nwhitehorn@FreeBSD.org>
  * Copyright (c) 2013, Luiz Otavio O Souza <loos@FreeBSD.org>
  * Copyright (c) 2013 The FreeBSD Foundation
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 
 #include <dev/gpio/gpiobusvar.h>
 #include <dev/ofw/ofw_bus.h>
 
 #include "gpiobus_if.h"
 
 #define	GPIO_ACTIVE_LOW		1
 
 static struct ofw_gpiobus_devinfo *ofw_gpiobus_setup_devinfo(device_t,
 	device_t, phandle_t);
 static void ofw_gpiobus_destroy_devinfo(device_t, struct ofw_gpiobus_devinfo *);
 static int ofw_gpiobus_parse_gpios_impl(device_t, phandle_t, char *,
 	struct gpiobus_softc *, struct gpiobus_pin **);
 
 /*
  * Utility functions for easier handling of OFW GPIO pins.
  *
  * !!! BEWARE !!!
  * GPIOBUS uses children's IVARs, so we cannot use this interface for cross
  * tree consumers.
  *
  */
 int
 gpio_pin_get_by_ofw_propidx(device_t consumer, phandle_t cnode,
     char *prop_name, int idx, gpio_pin_t *out_pin)
 {
 	phandle_t xref;
 	pcell_t *cells;
 	device_t busdev;
 	struct gpiobus_pin pin;
 	int ncells, rv;
 
 	KASSERT(consumer != NULL && cnode > 0,
 	    ("both consumer and cnode required"));
 
 	rv = ofw_bus_parse_xref_list_alloc(cnode, prop_name, "#gpio-cells",
 	    idx, &xref, &ncells, &cells);
 	if (rv != 0)
 		return (rv);
 
 	/* Translate provider to device. */
 	pin.dev = OF_device_from_xref(xref);
 	if (pin.dev == NULL) {
 		OF_prop_free(cells);
 		return (ENODEV);
 	}
 
 	/* Test if GPIO bus already exist. */
 	busdev = GPIO_GET_BUS(pin.dev);
 	if (busdev == NULL) {
 		OF_prop_free(cells);
 		return (ENODEV);
 	}
 
 	/* Map GPIO pin. */
 	rv = gpio_map_gpios(pin.dev, cnode, OF_node_from_xref(xref), ncells,
 	    cells, &pin.pin, &pin.flags);
 	OF_prop_free(cells);
 	if (rv != 0)
 		return (ENXIO);
 
 	/* Reserve GPIO pin. */
 	rv = gpiobus_acquire_pin(busdev, pin.pin);
 	if (rv != 0)
 		return (EBUSY);
 
 	*out_pin = malloc(sizeof(struct gpiobus_pin), M_DEVBUF,
 	    M_WAITOK | M_ZERO);
 	**out_pin = pin;
 	return (0);
 }
 
 int
 gpio_pin_get_by_ofw_idx(device_t consumer, phandle_t node,
     int idx, gpio_pin_t *pin)
 {
 
 	return (gpio_pin_get_by_ofw_propidx(consumer, node, "gpios", idx, pin));
 }
 
 int
 gpio_pin_get_by_ofw_property(device_t consumer, phandle_t node,
     char *name, gpio_pin_t *pin)
 {
 
 	return (gpio_pin_get_by_ofw_propidx(consumer, node, name, 0, pin));
 }
 
 int
 gpio_pin_get_by_ofw_name(device_t consumer, phandle_t node,
     char *name, gpio_pin_t *pin)
 {
 	int rv, idx;
 
 	KASSERT(consumer != NULL && node > 0,
 	    ("both consumer and node required"));
 
 	rv = ofw_bus_find_string_index(node, "gpio-names", name, &idx);
 	if (rv != 0)
 		return (rv);
 	return (gpio_pin_get_by_ofw_idx(consumer, node, idx, pin));
 }
 
 void
 gpio_pin_release(gpio_pin_t gpio)
 {
 	device_t busdev;
 
 	if (gpio == NULL)
 		return;
 
 	KASSERT(gpio->dev != NULL, ("invalid pin state"));
 
 	busdev = GPIO_GET_BUS(gpio->dev);
 	if (busdev != NULL)
 		gpiobus_release_pin(busdev, gpio->pin);
 
 	/* XXXX Unreserve pin. */
 	free(gpio, M_DEVBUF);
 }
 
 int
 gpio_pin_getcaps(gpio_pin_t pin, uint32_t *caps)
 {
 
 	KASSERT(pin != NULL, ("GPIO pin is NULL."));
 	KASSERT(pin->dev != NULL, ("GPIO pin device is NULL."));
 	return (GPIO_PIN_GETCAPS(pin->dev, pin->pin, caps));
 }
 
 int
 gpio_pin_is_active(gpio_pin_t pin, bool *active)
 {
 	int rv;
 	uint32_t tmp;
 
 	KASSERT(pin != NULL, ("GPIO pin is NULL."));
 	KASSERT(pin->dev != NULL, ("GPIO pin device is NULL."));
 	rv = GPIO_PIN_GET(pin->dev, pin->pin, &tmp);
 	if (rv  != 0) {
 		return (rv);
 	}
 
 	if (pin->flags & GPIO_ACTIVE_LOW)
 		*active = tmp == 0;
 	else
 		*active = tmp != 0;
 	return (0);
 }
 
 int
 gpio_pin_set_active(gpio_pin_t pin, bool active)
 {
 	int rv;
 	uint32_t tmp;
 
 	if (pin->flags & GPIO_ACTIVE_LOW)
 		tmp = active ? 0 : 1;
 	else
 		tmp = active ? 1 : 0;
 
 	KASSERT(pin != NULL, ("GPIO pin is NULL."));
 	KASSERT(pin->dev != NULL, ("GPIO pin device is NULL."));
 	rv = GPIO_PIN_SET(pin->dev, pin->pin, tmp);
 	return (rv);
 }
 
 int
 gpio_pin_setflags(gpio_pin_t pin, uint32_t flags)
 {
 	int rv;
 
 	KASSERT(pin != NULL, ("GPIO pin is NULL."));
 	KASSERT(pin->dev != NULL, ("GPIO pin device is NULL."));
 
 	rv = GPIO_PIN_SETFLAGS(pin->dev, pin->pin, flags);
 	return (rv);
 }
 
 /*
  * OFW_GPIOBUS driver.
  */
 device_t
 ofw_gpiobus_add_fdt_child(device_t bus, const char *drvname, phandle_t child)
 {
 	device_t childdev;
 	int i;
 	struct gpiobus_ivar *devi;
 	struct ofw_gpiobus_devinfo *dinfo;
 
 	/*
 	 * Check to see if we already have a child for @p child, and if so
 	 * return it.
 	 */
 	childdev = ofw_bus_find_child_device_by_phandle(bus, child);
 	if (childdev != NULL)
 		return (childdev);
 
 	/*
 	 * Set up the GPIO child and OFW bus layer devinfo and add it to bus.
 	 */
 	childdev = device_add_child(bus, drvname, -1);
 	if (childdev == NULL)
 		return (NULL);
 	dinfo = ofw_gpiobus_setup_devinfo(bus, childdev, child);
 	if (dinfo == NULL) {
 		device_delete_child(bus, childdev);
 		return (NULL);
 	}
 	if (device_probe_and_attach(childdev) != 0) {
 		ofw_gpiobus_destroy_devinfo(bus, dinfo);
 		device_delete_child(bus, childdev);
 		return (NULL);
 	}
 	/* Use the child name as pin name. */
 	devi = &dinfo->opd_dinfo;
 	for (i = 0; i < devi->npins; i++)
 		GPIOBUS_PIN_SETNAME(bus, devi->pins[i],
 		    device_get_nameunit(childdev));
 
 	return (childdev);
 }
 
 int
 ofw_gpiobus_parse_gpios(device_t consumer, char *pname,
 	struct gpiobus_pin **pins)
 {
 
 	return (ofw_gpiobus_parse_gpios_impl(consumer,
 	    ofw_bus_get_node(consumer), pname, NULL, pins));
 }
 
 void
 ofw_gpiobus_register_provider(device_t provider)
 {
 	phandle_t node;
 
 	node = ofw_bus_get_node(provider);
 	OF_device_register_xref(OF_xref_from_node(node), provider);
 }
 
 void
 ofw_gpiobus_unregister_provider(device_t provider)
 {
 	phandle_t node;
 
 	node = ofw_bus_get_node(provider);
 	OF_device_register_xref(OF_xref_from_node(node), NULL);
 }
 
 static struct ofw_gpiobus_devinfo *
 ofw_gpiobus_setup_devinfo(device_t bus, device_t child, phandle_t node)
 {
 	int i, npins;
 	struct gpiobus_ivar *devi;
 	struct gpiobus_pin *pins;
 	struct gpiobus_softc *sc;
 	struct ofw_gpiobus_devinfo *dinfo;
 
 	sc = device_get_softc(bus);
 	dinfo = malloc(sizeof(*dinfo), M_DEVBUF, M_NOWAIT | M_ZERO);
 	if (dinfo == NULL)
 		return (NULL);
 	if (ofw_bus_gen_setup_devinfo(&dinfo->opd_obdinfo, node) != 0) {
 		free(dinfo, M_DEVBUF);
 		return (NULL);
 	}
 	/* Parse the gpios property for the child. */
 	npins = ofw_gpiobus_parse_gpios_impl(child, node, "gpios", sc, &pins);
 	if (npins <= 0) {
 		ofw_bus_gen_destroy_devinfo(&dinfo->opd_obdinfo);
 		free(dinfo, M_DEVBUF);
 		return (NULL);
 	}
 	/* Initialize the irq resource list. */
 	resource_list_init(&dinfo->opd_dinfo.rl);
 	/* Allocate the child ivars and copy the parsed pin data. */
 	devi = &dinfo->opd_dinfo;
 	devi->npins = (uint32_t)npins;
 	if (gpiobus_alloc_ivars(devi) != 0) {
 		free(pins, M_DEVBUF);
 		ofw_gpiobus_destroy_devinfo(bus, dinfo);
 		return (NULL);
 	}
 	for (i = 0; i < devi->npins; i++) {
 		devi->flags[i] = pins[i].flags;
 		devi->pins[i] = pins[i].pin;
 	}
 	free(pins, M_DEVBUF);
 	/* Parse the interrupt resources. */
 	if (ofw_bus_intr_to_rl(bus, node, &dinfo->opd_dinfo.rl, NULL) != 0) {
 		ofw_gpiobus_destroy_devinfo(bus, dinfo);
 		return (NULL);
 	}
 	device_set_ivars(child, dinfo);
 
 	return (dinfo);
 }
 
 static void
 ofw_gpiobus_destroy_devinfo(device_t bus, struct ofw_gpiobus_devinfo *dinfo)
 {
 	int i;
 	struct gpiobus_ivar *devi;
 	struct gpiobus_softc *sc;
 
 	sc = device_get_softc(bus);
 	devi = &dinfo->opd_dinfo;
 	for (i = 0; i < devi->npins; i++) {
 		if (devi->pins[i] > sc->sc_npins)
 			continue;
 		sc->sc_pins[devi->pins[i]].mapped = 0;
 	}
 	gpiobus_free_ivars(devi);
 	resource_list_free(&dinfo->opd_dinfo.rl);
 	ofw_bus_gen_destroy_devinfo(&dinfo->opd_obdinfo);
 	free(dinfo, M_DEVBUF);
 }
 
 static int
 ofw_gpiobus_parse_gpios_impl(device_t consumer, phandle_t cnode, char *pname,
 	struct gpiobus_softc *bussc, struct gpiobus_pin **pins)
 {
 	int gpiocells, i, j, ncells, npins;
 	pcell_t *gpios;
 	phandle_t gpio;
 
-	ncells = OF_getencprop_alloc(cnode, pname, sizeof(*gpios),
+	ncells = OF_getencprop_alloc_multi(cnode, pname, sizeof(*gpios),
             (void **)&gpios);
 	if (ncells == -1) {
 		device_printf(consumer,
 		    "Warning: No %s specified in fdt data; "
 		    "device may not function.\n", pname);
 		return (-1);
 	}
 	/*
 	 * The gpio-specifier is controller independent, the first pcell has
 	 * the reference to the GPIO controller phandler.
 	 * Count the number of encoded gpio-specifiers on the first pass.
 	 */
 	i = 0;
 	npins = 0;
 	while (i < ncells) {
 		/* Allow NULL specifiers. */
 		if (gpios[i] == 0) {
 			npins++;
 			i++;
 			continue;
 		}
 		gpio = OF_node_from_xref(gpios[i]);
 		/* If we have bussc, ignore devices from other gpios. */
 		if (bussc != NULL)
 			if (ofw_bus_get_node(bussc->sc_dev) != gpio)
 				return (0);
 		/*
 		 * Check for gpio-controller property and read the #gpio-cells
 		 * for this GPIO controller.
 		 */
 		if (!OF_hasprop(gpio, "gpio-controller") ||
 		    OF_getencprop(gpio, "#gpio-cells", &gpiocells,
 		    sizeof(gpiocells)) < 0) {
 			device_printf(consumer,
 			    "gpio reference is not a gpio-controller.\n");
 			OF_prop_free(gpios);
 			return (-1);
 		}
 		if (ncells - i < gpiocells + 1) {
 			device_printf(consumer,
 			    "%s cells doesn't match #gpio-cells.\n", pname);
 			return (-1);
 		}
 		npins++;
 		i += gpiocells + 1;
 	}
 	if (npins == 0 || pins == NULL) {
 		if (npins == 0)
 			device_printf(consumer, "no pin specified in %s.\n",
 			    pname);
 		OF_prop_free(gpios);
 		return (npins);
 	}
 	*pins = malloc(sizeof(struct gpiobus_pin) * npins, M_DEVBUF,
 	    M_NOWAIT | M_ZERO);
 	if (*pins == NULL) {
 		OF_prop_free(gpios);
 		return (-1);
 	}
 	/* Decode the gpio specifier on the second pass. */
 	i = 0;
 	j = 0;
 	while (i < ncells) {
 		/* Allow NULL specifiers. */
 		if (gpios[i] == 0) {
 			j++;
 			i++;
 			continue;
 		}
 		gpio = OF_node_from_xref(gpios[i]);
 		/* Read gpio-cells property for this GPIO controller. */
 		if (OF_getencprop(gpio, "#gpio-cells", &gpiocells,
 		    sizeof(gpiocells)) < 0) {
 			device_printf(consumer,
 			    "gpio does not have the #gpio-cells property.\n");
 			goto fail;
 		}
 		/* Return the device reference for the GPIO controller. */
 		(*pins)[j].dev = OF_device_from_xref(gpios[i]);
 		if ((*pins)[j].dev == NULL) {
 			device_printf(consumer,
 			    "no device registered for the gpio controller.\n");
 			goto fail;
 		}
 		/*
 		 * If the gpiobus softc is NULL we use the GPIO_GET_BUS() to
 		 * retrieve it.  The GPIO_GET_BUS() method is only valid after
 		 * the child is probed and attached.
 		 */
 		if (bussc == NULL) {
 			if (GPIO_GET_BUS((*pins)[j].dev) == NULL) {
 				device_printf(consumer,
 				    "no gpiobus reference for %s.\n",
 				    device_get_nameunit((*pins)[j].dev));
 				goto fail;
 			}
 			bussc = device_get_softc(GPIO_GET_BUS((*pins)[j].dev));
 		}
 		/* Get the GPIO pin number and flags. */
 		if (gpio_map_gpios((*pins)[j].dev, cnode, gpio, gpiocells,
 		    &gpios[i + 1], &(*pins)[j].pin, &(*pins)[j].flags) != 0) {
 			device_printf(consumer,
 			    "cannot map the gpios specifier.\n");
 			goto fail;
 		}
 		/* Reserve the GPIO pin. */
 		if (gpiobus_acquire_pin(bussc->sc_busdev, (*pins)[j].pin) != 0)
 			goto fail;
 		j++;
 		i += gpiocells + 1;
 	}
 	OF_prop_free(gpios);
 
 	return (npins);
 
 fail:
 	OF_prop_free(gpios);
 	free(*pins, M_DEVBUF);
 	return (-1);
 }
 
 static int
 ofw_gpiobus_probe(device_t dev)
 {
 
 	if (ofw_bus_get_node(dev) == -1)
 		return (ENXIO);
 	device_set_desc(dev, "OFW GPIO bus");
 
 	return (0);
 }
 
 static int
 ofw_gpiobus_attach(device_t dev)
 {
 	int err;
 	phandle_t child;
 
 	err = gpiobus_init_softc(dev);
 	if (err != 0)
 		return (err);
 	bus_generic_probe(dev);
 	bus_enumerate_hinted_children(dev);
 	/*
 	 * Attach the children represented in the device tree.
 	 */
 	for (child = OF_child(ofw_bus_get_node(dev)); child != 0;
 	    child = OF_peer(child)) {
 		if (!OF_hasprop(child, "gpios"))
 			continue;
 		if (ofw_gpiobus_add_fdt_child(dev, NULL, child) == NULL)
 			continue;
 	}
 
 	return (bus_generic_attach(dev));
 }
 
 static device_t
 ofw_gpiobus_add_child(device_t dev, u_int order, const char *name, int unit)
 {
 	device_t child;
 	struct ofw_gpiobus_devinfo *devi;
 
 	child = device_add_child_ordered(dev, order, name, unit);
 	if (child == NULL)
 		return (child);
 	devi = malloc(sizeof(struct ofw_gpiobus_devinfo), M_DEVBUF,
 	    M_NOWAIT | M_ZERO);
 	if (devi == NULL) {
 		device_delete_child(dev, child);
 		return (0);
 	}
 
 	/*
 	 * NULL all the OFW-related parts of the ivars for non-OFW
 	 * children.
 	 */
 	devi->opd_obdinfo.obd_node = -1;
 	devi->opd_obdinfo.obd_name = NULL;
 	devi->opd_obdinfo.obd_compat = NULL;
 	devi->opd_obdinfo.obd_type = NULL;
 	devi->opd_obdinfo.obd_model = NULL;
 
 	device_set_ivars(child, devi);
 
 	return (child);
 }
 
 static const struct ofw_bus_devinfo *
 ofw_gpiobus_get_devinfo(device_t bus, device_t dev)
 {
 	struct ofw_gpiobus_devinfo *dinfo;
 
 	dinfo = device_get_ivars(dev);
 
 	return (&dinfo->opd_obdinfo);
 }
 
 static device_method_t ofw_gpiobus_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		ofw_gpiobus_probe),
 	DEVMETHOD(device_attach,	ofw_gpiobus_attach),
 
 	/* Bus interface */
 	DEVMETHOD(bus_child_pnpinfo_str,	ofw_bus_gen_child_pnpinfo_str),
 	DEVMETHOD(bus_add_child,	ofw_gpiobus_add_child),
 
 	/* ofw_bus interface */
 	DEVMETHOD(ofw_bus_get_devinfo,	ofw_gpiobus_get_devinfo),
 	DEVMETHOD(ofw_bus_get_compat,	ofw_bus_gen_get_compat),
 	DEVMETHOD(ofw_bus_get_model,	ofw_bus_gen_get_model),
 	DEVMETHOD(ofw_bus_get_name,	ofw_bus_gen_get_name),
 	DEVMETHOD(ofw_bus_get_node,	ofw_bus_gen_get_node),
 	DEVMETHOD(ofw_bus_get_type,	ofw_bus_gen_get_type),
 
 	DEVMETHOD_END
 };
 
 devclass_t ofwgpiobus_devclass;
 
 DEFINE_CLASS_1(gpiobus, ofw_gpiobus_driver, ofw_gpiobus_methods,
     sizeof(struct gpiobus_softc), gpiobus_driver);
 EARLY_DRIVER_MODULE(ofw_gpiobus, gpio, ofw_gpiobus_driver, ofwgpiobus_devclass,
     0, 0, BUS_PASS_BUS);
 MODULE_VERSION(ofw_gpiobus, 1);
 MODULE_DEPEND(ofw_gpiobus, gpiobus, 1, 1, 1);
Index: user/markj/netdump/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
===================================================================
--- user/markj/netdump/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c	(revision 332407)
+++ user/markj/netdump/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c	(revision 332408)
@@ -1,2400 +1,2387 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2009-2012,2016-2017 Microsoft Corp.
  * Copyright (c) 2012 NetApp Inc.
  * Copyright (c) 2012 Citrix Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /**
  * StorVSC driver for Hyper-V.  This driver presents a SCSI HBA interface
  * to the Comman Access Method (CAM) layer.  CAM control blocks (CCBs) are
  * converted into VSCSI protocol messages which are delivered to the parent
  * partition StorVSP driver over the Hyper-V VMBUS.
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/proc.h>
 #include <sys/condvar.h>
 #include <sys/time.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/sockio.h>
 #include <sys/mbuf.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/kernel.h>
 #include <sys/queue.h>
 #include <sys/lock.h>
 #include <sys/sx.h>
 #include <sys/taskqueue.h>
 #include <sys/bus.h>
 #include <sys/mutex.h>
 #include <sys/callout.h>
 #include <sys/smp.h>
 #include <vm/vm.h>
 #include <vm/pmap.h>
 #include <vm/uma.h>
 #include <sys/lock.h>
 #include <sys/sema.h>
 #include <sys/sglist.h>
 #include <sys/eventhandler.h>
 #include <machine/bus.h>
 
 #include <cam/cam.h>
 #include <cam/cam_ccb.h>
 #include <cam/cam_periph.h>
 #include <cam/cam_sim.h>
 #include <cam/cam_xpt_sim.h>
 #include <cam/cam_xpt_internal.h>
 #include <cam/cam_debug.h>
 #include <cam/scsi/scsi_all.h>
 #include <cam/scsi/scsi_message.h>
 
 #include <dev/hyperv/include/hyperv.h>
 #include <dev/hyperv/include/vmbus.h>
 #include "hv_vstorage.h"
 #include "vmbus_if.h"
 
 #define STORVSC_MAX_LUNS_PER_TARGET	(64)
 #define STORVSC_MAX_IO_REQUESTS		(STORVSC_MAX_LUNS_PER_TARGET * 2)
 #define BLKVSC_MAX_IDE_DISKS_PER_TARGET	(1)
 #define BLKVSC_MAX_IO_REQUESTS		STORVSC_MAX_IO_REQUESTS
 #define STORVSC_MAX_TARGETS		(2)
 
 #define VSTOR_PKT_SIZE	(sizeof(struct vstor_packet) - vmscsi_size_delta)
 
 /*
  * 33 segments are needed to allow 128KB maxio, in case the data
  * in the first page is _not_ PAGE_SIZE aligned, e.g.
  *
  *     |<----------- 128KB ----------->|
  *     |                               |
  *  0  2K 4K    8K   16K   124K  128K  130K
  *  |  |  |     |     |       |     |  |
  *  +--+--+-----+-----+.......+-----+--+--+
  *  |  |  |     |     |       |     |  |  | DATA
  *  |  |  |     |     |       |     |  |  |
  *  +--+--+-----+-----+.......------+--+--+
  *     |  |                         |  |
  *     | 1|            31           | 1| ...... # of segments
  */
 #define STORVSC_DATA_SEGCNT_MAX		33
 #define STORVSC_DATA_SEGSZ_MAX		PAGE_SIZE
 #define STORVSC_DATA_SIZE_MAX		\
 	((STORVSC_DATA_SEGCNT_MAX - 1) * STORVSC_DATA_SEGSZ_MAX)
 
 struct storvsc_softc;
 
 struct hv_sgl_node {
 	LIST_ENTRY(hv_sgl_node) link;
 	struct sglist *sgl_data;
 };
 
 struct hv_sgl_page_pool{
 	LIST_HEAD(, hv_sgl_node) in_use_sgl_list;
 	LIST_HEAD(, hv_sgl_node) free_sgl_list;
 	boolean_t                is_init;
 } g_hv_sgl_page_pool;
 
 enum storvsc_request_type {
 	WRITE_TYPE,
 	READ_TYPE,
 	UNKNOWN_TYPE
 };
 
 SYSCTL_NODE(_hw, OID_AUTO, storvsc, CTLFLAG_RD | CTLFLAG_MPSAFE, NULL,
 	"Hyper-V storage interface");
 
 static u_int hv_storvsc_use_win8ext_flags = 1;
 SYSCTL_UINT(_hw_storvsc, OID_AUTO, use_win8ext_flags, CTLFLAG_RW,
 	&hv_storvsc_use_win8ext_flags, 0,
 	"Use win8 extension flags or not");
 
 static u_int hv_storvsc_use_pim_unmapped = 1;
 SYSCTL_UINT(_hw_storvsc, OID_AUTO, use_pim_unmapped, CTLFLAG_RDTUN,
 	&hv_storvsc_use_pim_unmapped, 0,
 	"Optimize storvsc by using unmapped I/O");
 
 static u_int hv_storvsc_ringbuffer_size = (64 * PAGE_SIZE);
 SYSCTL_UINT(_hw_storvsc, OID_AUTO, ringbuffer_size, CTLFLAG_RDTUN,
 	&hv_storvsc_ringbuffer_size, 0, "Hyper-V storage ringbuffer size");
 
 static u_int hv_storvsc_max_io = 512;
 SYSCTL_UINT(_hw_storvsc, OID_AUTO, max_io, CTLFLAG_RDTUN,
 	&hv_storvsc_max_io, 0, "Hyper-V storage max io limit");
 
 static int hv_storvsc_chan_cnt = 0;
 SYSCTL_INT(_hw_storvsc, OID_AUTO, chan_cnt, CTLFLAG_RDTUN,
 	&hv_storvsc_chan_cnt, 0, "# of channels to use");
 
 #define STORVSC_MAX_IO						\
 	vmbus_chan_prplist_nelem(hv_storvsc_ringbuffer_size,	\
 	   STORVSC_DATA_SEGCNT_MAX, VSTOR_PKT_SIZE)
 
 struct hv_storvsc_sysctl {
 	u_long		data_bio_cnt;
 	u_long		data_vaddr_cnt;
 	u_long		data_sg_cnt;
 	u_long		chan_send_cnt[MAXCPU];
 };
 
 struct storvsc_gpa_range {
 	struct vmbus_gpa_range	gpa_range;
 	uint64_t		gpa_page[STORVSC_DATA_SEGCNT_MAX];
 } __packed;
 
 struct hv_storvsc_request {
 	LIST_ENTRY(hv_storvsc_request)	link;
 	struct vstor_packet		vstor_packet;
 	int				prp_cnt;
 	struct storvsc_gpa_range	prp_list;
 	void				*sense_data;
 	uint8_t				sense_info_len;
 	uint8_t				retries;
 	union ccb			*ccb;
 	struct storvsc_softc		*softc;
 	struct callout			callout;
 	struct sema			synch_sema; /*Synchronize the request/response if needed */
 	struct sglist			*bounce_sgl;
 	unsigned int			bounce_sgl_count;
 	uint64_t			not_aligned_seg_bits;
 	bus_dmamap_t			data_dmap;
 };
 
 struct storvsc_softc {
 	struct vmbus_channel		*hs_chan;
 	LIST_HEAD(, hv_storvsc_request)	hs_free_list;
 	struct mtx			hs_lock;
 	struct storvsc_driver_props	*hs_drv_props;
 	int 				hs_unit;
 	uint32_t			hs_frozen;
 	struct cam_sim			*hs_sim;
 	struct cam_path 		*hs_path;
 	uint32_t			hs_num_out_reqs;
 	boolean_t			hs_destroy;
 	boolean_t			hs_drain_notify;
 	struct sema 			hs_drain_sema;	
 	struct hv_storvsc_request	hs_init_req;
 	struct hv_storvsc_request	hs_reset_req;
 	device_t			hs_dev;
 	bus_dma_tag_t			storvsc_req_dtag;
 	struct hv_storvsc_sysctl	sysctl_data;
 	uint32_t			hs_nchan;
 	struct vmbus_channel		*hs_sel_chan[MAXCPU];
 };
 
 static eventhandler_tag storvsc_handler_tag;
 /*
  * The size of the vmscsi_request has changed in win8. The
  * additional size is for the newly added elements in the
  * structure. These elements are valid only when we are talking
  * to a win8 host.
  * Track the correct size we need to apply.
  */
 static int vmscsi_size_delta = sizeof(struct vmscsi_win8_extension);
 
 /**
  * HyperV storvsc timeout testing cases:
  * a. IO returned after first timeout;
  * b. IO returned after second timeout and queue freeze;
  * c. IO returned while timer handler is running
  * The first can be tested by "sg_senddiag -vv /dev/daX",
  * and the second and third can be done by
  * "sg_wr_mode -v -p 08 -c 0,1a -m 0,ff /dev/daX".
  */
 #define HVS_TIMEOUT_TEST 0
 
 /*
  * Bus/adapter reset functionality on the Hyper-V host is
  * buggy and it will be disabled until
  * it can be further tested.
  */
 #define HVS_HOST_RESET 0
 
 struct storvsc_driver_props {
 	char		*drv_name;
 	char		*drv_desc;
 	uint8_t		drv_max_luns_per_target;
 	uint32_t	drv_max_ios_per_target;
 	uint32_t	drv_ringbuffer_size;
 };
 
 enum hv_storage_type {
 	DRIVER_BLKVSC,
 	DRIVER_STORVSC,
 	DRIVER_UNKNOWN
 };
 
 #define HS_MAX_ADAPTERS 10
 
 #define HV_STORAGE_SUPPORTS_MULTI_CHANNEL 0x1
 
 /* {ba6163d9-04a1-4d29-b605-72e2ffb1dc7f} */
 static const struct hyperv_guid gStorVscDeviceType={
 	.hv_guid = {0xd9, 0x63, 0x61, 0xba, 0xa1, 0x04, 0x29, 0x4d,
 		 0xb6, 0x05, 0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f}
 };
 
 /* {32412632-86cb-44a2-9b5c-50d1417354f5} */
 static const struct hyperv_guid gBlkVscDeviceType={
 	.hv_guid = {0x32, 0x26, 0x41, 0x32, 0xcb, 0x86, 0xa2, 0x44,
 		 0x9b, 0x5c, 0x50, 0xd1, 0x41, 0x73, 0x54, 0xf5}
 };
 
 static struct storvsc_driver_props g_drv_props_table[] = {
 	{"blkvsc", "Hyper-V IDE",
 	 BLKVSC_MAX_IDE_DISKS_PER_TARGET, BLKVSC_MAX_IO_REQUESTS,
 	 20*PAGE_SIZE},
 	{"storvsc", "Hyper-V SCSI",
 	 STORVSC_MAX_LUNS_PER_TARGET, STORVSC_MAX_IO_REQUESTS,
 	 20*PAGE_SIZE}
 };
 
 /*
  * Sense buffer size changed in win8; have a run-time
  * variable to track the size we should use.
  */
 static int sense_buffer_size = PRE_WIN8_STORVSC_SENSE_BUFFER_SIZE;
 
 /*
  * The storage protocol version is determined during the
  * initial exchange with the host.  It will indicate which
  * storage functionality is available in the host.
 */
 static int vmstor_proto_version;
 
 struct vmstor_proto {
         int proto_version;
         int sense_buffer_size;
         int vmscsi_size_delta;
 };
 
 static const struct vmstor_proto vmstor_proto_list[] = {
         {
                 VMSTOR_PROTOCOL_VERSION_WIN10,
                 POST_WIN7_STORVSC_SENSE_BUFFER_SIZE,
                 0
         },
         {
                 VMSTOR_PROTOCOL_VERSION_WIN8_1,
                 POST_WIN7_STORVSC_SENSE_BUFFER_SIZE,
                 0
         },
         {
                 VMSTOR_PROTOCOL_VERSION_WIN8,
                 POST_WIN7_STORVSC_SENSE_BUFFER_SIZE,
                 0
         },
         {
                 VMSTOR_PROTOCOL_VERSION_WIN7,
                 PRE_WIN8_STORVSC_SENSE_BUFFER_SIZE,
                 sizeof(struct vmscsi_win8_extension),
         },
         {
                 VMSTOR_PROTOCOL_VERSION_WIN6,
                 PRE_WIN8_STORVSC_SENSE_BUFFER_SIZE,
                 sizeof(struct vmscsi_win8_extension),
         }
 };
 
 /* static functions */
 static int storvsc_probe(device_t dev);
 static int storvsc_attach(device_t dev);
 static int storvsc_detach(device_t dev);
 static void storvsc_poll(struct cam_sim * sim);
 static void storvsc_action(struct cam_sim * sim, union ccb * ccb);
 static int create_storvsc_request(union ccb *ccb, struct hv_storvsc_request *reqp);
 static void storvsc_free_request(struct storvsc_softc *sc, struct hv_storvsc_request *reqp);
 static enum hv_storage_type storvsc_get_storage_type(device_t dev);
 static void hv_storvsc_rescan_target(struct storvsc_softc *sc);
 static void hv_storvsc_on_channel_callback(struct vmbus_channel *chan, void *xsc);
 static void hv_storvsc_on_iocompletion( struct storvsc_softc *sc,
 					struct vstor_packet *vstor_packet,
 					struct hv_storvsc_request *request);
 static int hv_storvsc_connect_vsp(struct storvsc_softc *);
 static void storvsc_io_done(struct hv_storvsc_request *reqp);
 static void storvsc_copy_sgl_to_bounce_buf(struct sglist *bounce_sgl,
 				bus_dma_segment_t *orig_sgl,
 				unsigned int orig_sgl_count,
 				uint64_t seg_bits);
 void storvsc_copy_from_bounce_buf_to_sgl(bus_dma_segment_t *dest_sgl,
 				unsigned int dest_sgl_count,
 				struct sglist* src_sgl,
 				uint64_t seg_bits);
 
 static device_method_t storvsc_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		storvsc_probe),
 	DEVMETHOD(device_attach,	storvsc_attach),
 	DEVMETHOD(device_detach,	storvsc_detach),
 	DEVMETHOD(device_shutdown,      bus_generic_shutdown),
 	DEVMETHOD_END
 };
 
 static driver_t storvsc_driver = {
 	"storvsc", storvsc_methods, sizeof(struct storvsc_softc),
 };
 
 static devclass_t storvsc_devclass;
 DRIVER_MODULE(storvsc, vmbus, storvsc_driver, storvsc_devclass, 0, 0);
 MODULE_VERSION(storvsc, 1);
 MODULE_DEPEND(storvsc, vmbus, 1, 1, 1);
 
 static void
 storvsc_subchan_attach(struct storvsc_softc *sc,
     struct vmbus_channel *new_channel)
 {
 	struct vmstor_chan_props props;
 	int ret = 0;
 
 	memset(&props, 0, sizeof(props));
 
 	vmbus_chan_cpu_rr(new_channel);
 	ret = vmbus_chan_open(new_channel,
 	    sc->hs_drv_props->drv_ringbuffer_size,
   	    sc->hs_drv_props->drv_ringbuffer_size,
 	    (void *)&props,
 	    sizeof(struct vmstor_chan_props),
 	    hv_storvsc_on_channel_callback, sc);
 }
 
 /**
  * @brief Send multi-channel creation request to host
  *
  * @param device  a Hyper-V device pointer
  * @param max_chans  the max channels supported by vmbus
  */
 static void
 storvsc_send_multichannel_request(struct storvsc_softc *sc, int max_subch)
 {
 	struct vmbus_channel **subchan;
 	struct hv_storvsc_request *request;
 	struct vstor_packet *vstor_packet;	
 	int request_subch;
 	int ret, i;
 
 	/* get sub-channel count that need to create */
 	request_subch = MIN(max_subch, mp_ncpus - 1);
 
 	request = &sc->hs_init_req;
 
 	/* request the host to create multi-channel */
 	memset(request, 0, sizeof(struct hv_storvsc_request));
 	
 	sema_init(&request->synch_sema, 0, ("stor_synch_sema"));
 
 	vstor_packet = &request->vstor_packet;
 	
 	vstor_packet->operation = VSTOR_OPERATION_CREATE_MULTI_CHANNELS;
 	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
 	vstor_packet->u.multi_channels_cnt = request_subch;
 
 	ret = vmbus_chan_send(sc->hs_chan,
 	    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 	    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 
 	sema_wait(&request->synch_sema);
 
 	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETEIO ||
 	    vstor_packet->status != 0) {		
 		printf("Storvsc_error: create multi-channel invalid operation "
 		    "(%d) or statue (%u)\n",
 		    vstor_packet->operation, vstor_packet->status);
 		return;
 	}
 
 	/* Update channel count */
 	sc->hs_nchan = request_subch + 1;
 
 	/* Wait for sub-channels setup to complete. */
 	subchan = vmbus_subchan_get(sc->hs_chan, request_subch);
 
 	/* Attach the sub-channels. */
 	for (i = 0; i < request_subch; ++i)
 		storvsc_subchan_attach(sc, subchan[i]);
 
 	/* Release the sub-channels. */
 	vmbus_subchan_rel(subchan, request_subch);
 
 	if (bootverbose)
 		printf("Storvsc create multi-channel success!\n");
 }
 
 /**
  * @brief initialize channel connection to parent partition
  *
  * @param dev  a Hyper-V device pointer
  * @returns  0 on success, non-zero error on failure
  */
 static int
 hv_storvsc_channel_init(struct storvsc_softc *sc)
 {
 	int ret = 0, i;
 	struct hv_storvsc_request *request;
 	struct vstor_packet *vstor_packet;
 	uint16_t max_subch;
 	boolean_t support_multichannel;
 	uint32_t version;
 
 	max_subch = 0;
 	support_multichannel = FALSE;
 
 	request = &sc->hs_init_req;
 	memset(request, 0, sizeof(struct hv_storvsc_request));
 	vstor_packet = &request->vstor_packet;
 	request->softc = sc;
 
 	/**
 	 * Initiate the vsc/vsp initialization protocol on the open channel
 	 */
 	sema_init(&request->synch_sema, 0, ("stor_synch_sema"));
 
 	vstor_packet->operation = VSTOR_OPERATION_BEGININITIALIZATION;
 	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
 
 
 	ret = vmbus_chan_send(sc->hs_chan,
 	    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 	    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 
 	if (ret != 0)
 		goto cleanup;
 
 	sema_wait(&request->synch_sema);
 
 	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETEIO ||
 		vstor_packet->status != 0) {
 		goto cleanup;
 	}
 
 	for (i = 0; i < nitems(vmstor_proto_list); i++) {
 		/* reuse the packet for version range supported */
 
 		memset(vstor_packet, 0, sizeof(struct vstor_packet));
 		vstor_packet->operation = VSTOR_OPERATION_QUERYPROTOCOLVERSION;
 		vstor_packet->flags = REQUEST_COMPLETION_FLAG;
 
 		vstor_packet->u.version.major_minor =
 			vmstor_proto_list[i].proto_version;
 
 		/* revision is only significant for Windows guests */
 		vstor_packet->u.version.revision = 0;
 
 		ret = vmbus_chan_send(sc->hs_chan,
 		    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 		    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 
 		if (ret != 0)
 			goto cleanup;
 
 		sema_wait(&request->synch_sema);
 
 		if (vstor_packet->operation != VSTOR_OPERATION_COMPLETEIO) {
 			ret = EINVAL;
 			goto cleanup;	
 		}
 		if (vstor_packet->status == 0) {
 			vmstor_proto_version =
 				vmstor_proto_list[i].proto_version;
 			sense_buffer_size =
 				vmstor_proto_list[i].sense_buffer_size;
 			vmscsi_size_delta =
 				vmstor_proto_list[i].vmscsi_size_delta;
 			break;
 		}
 	}
 
 	if (vstor_packet->status != 0) {
 		ret = EINVAL;
 		goto cleanup;
 	}
 	/**
 	 * Query channel properties
 	 */
 	memset(vstor_packet, 0, sizeof(struct vstor_packet));
 	vstor_packet->operation = VSTOR_OPERATION_QUERYPROPERTIES;
 	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
 
 	ret = vmbus_chan_send(sc->hs_chan,
 	    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 	    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 
 	if ( ret != 0)
 		goto cleanup;
 
 	sema_wait(&request->synch_sema);
 
 	/* TODO: Check returned version */
 	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETEIO ||
 	    vstor_packet->status != 0) {
 		goto cleanup;
 	}
 
 	max_subch = vstor_packet->u.chan_props.max_channel_cnt;
 	if (hv_storvsc_chan_cnt > 0 && hv_storvsc_chan_cnt < (max_subch + 1))
 		max_subch = hv_storvsc_chan_cnt - 1;
 
 	/* multi-channels feature is supported by WIN8 and above version */
 	version = VMBUS_GET_VERSION(device_get_parent(sc->hs_dev), sc->hs_dev);
 	if (version != VMBUS_VERSION_WIN7 && version != VMBUS_VERSION_WS2008 &&
 	    (vstor_packet->u.chan_props.flags &
 	     HV_STORAGE_SUPPORTS_MULTI_CHANNEL)) {
 		support_multichannel = TRUE;
 	}
 	if (bootverbose) {
 		device_printf(sc->hs_dev, "max chans %d%s\n", max_subch + 1,
 		    support_multichannel ? ", multi-chan capable" : "");
 	}
 
 	memset(vstor_packet, 0, sizeof(struct vstor_packet));
 	vstor_packet->operation = VSTOR_OPERATION_ENDINITIALIZATION;
 	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
 
 	ret = vmbus_chan_send(sc->hs_chan,
 	    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 	    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 
 	if (ret != 0) {
 		goto cleanup;
 	}
 
 	sema_wait(&request->synch_sema);
 
 	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETEIO ||
 	    vstor_packet->status != 0)
 		goto cleanup;
 
 	/*
 	 * If multi-channel is supported, send multichannel create
 	 * request to host.
 	 */
 	if (support_multichannel && max_subch > 0)
 		storvsc_send_multichannel_request(sc, max_subch);
 cleanup:
 	sema_destroy(&request->synch_sema);
 	return (ret);
 }
 
 /**
  * @brief Open channel connection to paraent partition StorVSP driver
  *
  * Open and initialize channel connection to parent partition StorVSP driver.
  *
  * @param pointer to a Hyper-V device
  * @returns 0 on success, non-zero error on failure
  */
 static int
 hv_storvsc_connect_vsp(struct storvsc_softc *sc)
 {	
 	int ret = 0;
 	struct vmstor_chan_props props;
 
 	memset(&props, 0, sizeof(struct vmstor_chan_props));
 
 	/*
 	 * Open the channel
 	 */
 	vmbus_chan_cpu_rr(sc->hs_chan);
 	ret = vmbus_chan_open(
 		sc->hs_chan,
 		sc->hs_drv_props->drv_ringbuffer_size,
 		sc->hs_drv_props->drv_ringbuffer_size,
 		(void *)&props,
 		sizeof(struct vmstor_chan_props),
 		hv_storvsc_on_channel_callback, sc);
 
 	if (ret != 0) {
 		return ret;
 	}
 
 	ret = hv_storvsc_channel_init(sc);
 	return (ret);
 }
 
 #if HVS_HOST_RESET
 static int
 hv_storvsc_host_reset(struct storvsc_softc *sc)
 {
 	int ret = 0;
 
 	struct hv_storvsc_request *request;
 	struct vstor_packet *vstor_packet;
 
 	request = &sc->hs_reset_req;
 	request->softc = sc;
 	vstor_packet = &request->vstor_packet;
 
 	sema_init(&request->synch_sema, 0, "stor synch sema");
 
 	vstor_packet->operation = VSTOR_OPERATION_RESETBUS;
 	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
 
 	ret = vmbus_chan_send(dev->channel,
 	    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 	    vstor_packet, VSTOR_PKT_SIZE,
 	    (uint64_t)(uintptr_t)&sc->hs_reset_req);
 
 	if (ret != 0) {
 		goto cleanup;
 	}
 
 	sema_wait(&request->synch_sema);
 
 	/*
 	 * At this point, all outstanding requests in the adapter
 	 * should have been flushed out and return to us
 	 */
 
 cleanup:
 	sema_destroy(&request->synch_sema);
 	return (ret);
 }
 #endif /* HVS_HOST_RESET */
 
 /**
  * @brief Function to initiate an I/O request
  *
  * @param device Hyper-V device pointer
  * @param request pointer to a request structure
  * @returns 0 on success, non-zero error on failure
  */
 static int
 hv_storvsc_io_request(struct storvsc_softc *sc,
 					  struct hv_storvsc_request *request)
 {
 	struct vstor_packet *vstor_packet = &request->vstor_packet;
 	struct vmbus_channel* outgoing_channel = NULL;
 	int ret = 0, ch_sel;
 
 	vstor_packet->flags |= REQUEST_COMPLETION_FLAG;
 
 	vstor_packet->u.vm_srb.length =
 	    sizeof(struct vmscsi_req) - vmscsi_size_delta;
 	
 	vstor_packet->u.vm_srb.sense_info_len = sense_buffer_size;
 
 	vstor_packet->u.vm_srb.transfer_len =
 	    request->prp_list.gpa_range.gpa_len;
 
 	vstor_packet->operation = VSTOR_OPERATION_EXECUTESRB;
 
 	ch_sel = (vstor_packet->u.vm_srb.lun + curcpu) % sc->hs_nchan;
 	outgoing_channel = sc->hs_sel_chan[ch_sel];
 
 	mtx_unlock(&request->softc->hs_lock);
 	if (request->prp_list.gpa_range.gpa_len) {
 		ret = vmbus_chan_send_prplist(outgoing_channel,
 		    &request->prp_list.gpa_range, request->prp_cnt,
 		    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 	} else {
 		ret = vmbus_chan_send(outgoing_channel,
 		    VMBUS_CHANPKT_TYPE_INBAND, VMBUS_CHANPKT_FLAG_RC,
 		    vstor_packet, VSTOR_PKT_SIZE, (uint64_t)(uintptr_t)request);
 	}
 	/* statistic for successful request sending on each channel */
 	if (!ret) {
 		sc->sysctl_data.chan_send_cnt[ch_sel]++;
 	}
 	mtx_lock(&request->softc->hs_lock);
 
 	if (ret != 0) {
 		printf("Unable to send packet %p ret %d", vstor_packet, ret);
 	} else {
 		atomic_add_int(&sc->hs_num_out_reqs, 1);
 	}
 
 	return (ret);
 }
 
 
 /**
  * Process IO_COMPLETION_OPERATION and ready
  * the result to be completed for upper layer
  * processing by the CAM layer.
  */
 static void
 hv_storvsc_on_iocompletion(struct storvsc_softc *sc,
 			   struct vstor_packet *vstor_packet,
 			   struct hv_storvsc_request *request)
 {
 	struct vmscsi_req *vm_srb;
 
 	vm_srb = &vstor_packet->u.vm_srb;
 
 	/*
 	 * Copy some fields of the host's response into the request structure,
 	 * because the fields will be used later in storvsc_io_done().
 	 */
 	request->vstor_packet.u.vm_srb.scsi_status = vm_srb->scsi_status;
 	request->vstor_packet.u.vm_srb.srb_status = vm_srb->srb_status;
 	request->vstor_packet.u.vm_srb.transfer_len = vm_srb->transfer_len;
 
 	if (((vm_srb->scsi_status & 0xFF) == SCSI_STATUS_CHECK_COND) &&
 			(vm_srb->srb_status & SRB_STATUS_AUTOSENSE_VALID)) {
 		/* Autosense data available */
 
 		KASSERT(vm_srb->sense_info_len <= request->sense_info_len,
 				("vm_srb->sense_info_len <= "
 				 "request->sense_info_len"));
 
 		memcpy(request->sense_data, vm_srb->u.sense_data,
 			vm_srb->sense_info_len);
 
 		request->sense_info_len = vm_srb->sense_info_len;
 	}
 
 	/* Complete request by passing to the CAM layer */
 	storvsc_io_done(request);
 	atomic_subtract_int(&sc->hs_num_out_reqs, 1);
 	if (sc->hs_drain_notify && (sc->hs_num_out_reqs == 0)) {
 		sema_post(&sc->hs_drain_sema);
 	}
 }
 
 static void
 hv_storvsc_rescan_target(struct storvsc_softc *sc)
 {
 	path_id_t pathid;
 	target_id_t targetid;
 	union ccb *ccb;
 
 	pathid = cam_sim_path(sc->hs_sim);
 	targetid = CAM_TARGET_WILDCARD;
 
 	/*
 	 * Allocate a CCB and schedule a rescan.
 	 */
 	ccb = xpt_alloc_ccb_nowait();
 	if (ccb == NULL) {
 		printf("unable to alloc CCB for rescan\n");
 		return;
 	}
 
 	if (xpt_create_path(&ccb->ccb_h.path, NULL, pathid, targetid,
 	    CAM_LUN_WILDCARD) != CAM_REQ_CMP) {
 		printf("unable to create path for rescan, pathid: %u,"
 		    "targetid: %u\n", pathid, targetid);
 		xpt_free_ccb(ccb);
 		return;
 	}
 
 	if (targetid == CAM_TARGET_WILDCARD)
 		ccb->ccb_h.func_code = XPT_SCAN_BUS;
 	else
 		ccb->ccb_h.func_code = XPT_SCAN_TGT;
 
 	xpt_rescan(ccb);
 }
 
 static void
 hv_storvsc_on_channel_callback(struct vmbus_channel *channel, void *xsc)
 {
 	int ret = 0;
 	struct storvsc_softc *sc = xsc;
 	uint32_t bytes_recvd;
 	uint64_t request_id;
 	uint8_t packet[roundup2(sizeof(struct vstor_packet), 8)];
 	struct hv_storvsc_request *request;
 	struct vstor_packet *vstor_packet;
 
 	bytes_recvd = roundup2(VSTOR_PKT_SIZE, 8);
 	ret = vmbus_chan_recv(channel, packet, &bytes_recvd, &request_id);
 	KASSERT(ret != ENOBUFS, ("storvsc recvbuf is not large enough"));
 	/* XXX check bytes_recvd to make sure that it contains enough data */
 
 	while ((ret == 0) && (bytes_recvd > 0)) {
 		request = (struct hv_storvsc_request *)(uintptr_t)request_id;
 
 		if ((request == &sc->hs_init_req) ||
 			(request == &sc->hs_reset_req)) {
 			memcpy(&request->vstor_packet, packet,
 				   sizeof(struct vstor_packet));
 			sema_post(&request->synch_sema);
 		} else {
 			vstor_packet = (struct vstor_packet *)packet;
 			switch(vstor_packet->operation) {
 			case VSTOR_OPERATION_COMPLETEIO:
 				if (request == NULL)
 					panic("VMBUS: storvsc received a "
 					    "packet with NULL request id in "
 					    "COMPLETEIO operation.");
 
 				hv_storvsc_on_iocompletion(sc,
 							vstor_packet, request);
 				break;
 			case VSTOR_OPERATION_REMOVEDEVICE:
 				printf("VMBUS: storvsc operation %d not "
 				    "implemented.\n", vstor_packet->operation);
 				/* TODO: implement */
 				break;
 			case VSTOR_OPERATION_ENUMERATE_BUS:
 				hv_storvsc_rescan_target(sc);
 				break;
 			default:
 				break;
 			}			
 		}
 
 		bytes_recvd = roundup2(VSTOR_PKT_SIZE, 8),
 		ret = vmbus_chan_recv(channel, packet, &bytes_recvd,
 		    &request_id);
 		KASSERT(ret != ENOBUFS,
 		    ("storvsc recvbuf is not large enough"));
 		/*
 		 * XXX check bytes_recvd to make sure that it contains
 		 * enough data
 		 */
 	}
 }
 
 /**
  * @brief StorVSC probe function
  *
  * Device probe function.  Returns 0 if the input device is a StorVSC
  * device.  Otherwise, a ENXIO is returned.  If the input device is
  * for BlkVSC (paravirtual IDE) device and this support is disabled in
  * favor of the emulated ATA/IDE device, return ENXIO.
  *
  * @param a device
  * @returns 0 on success, ENXIO if not a matcing StorVSC device
  */
 static int
 storvsc_probe(device_t dev)
 {
 	int ret	= ENXIO;
 	
 	switch (storvsc_get_storage_type(dev)) {
 	case DRIVER_BLKVSC:
 		if(bootverbose)
 			device_printf(dev,
 			    "Enlightened ATA/IDE detected\n");
 		device_set_desc(dev, g_drv_props_table[DRIVER_BLKVSC].drv_desc);
 		ret = BUS_PROBE_DEFAULT;
 		break;
 	case DRIVER_STORVSC:
 		if(bootverbose)
 			device_printf(dev, "Enlightened SCSI device detected\n");
 		device_set_desc(dev, g_drv_props_table[DRIVER_STORVSC].drv_desc);
 		ret = BUS_PROBE_DEFAULT;
 		break;
 	default:
 		ret = ENXIO;
 	}
 	return (ret);
 }
 
 static void
 storvsc_create_chan_sel(struct storvsc_softc *sc)
 {
 	struct vmbus_channel **subch;
 	int i, nsubch;
 
 	sc->hs_sel_chan[0] = sc->hs_chan;
 	nsubch = sc->hs_nchan - 1;
 	if (nsubch == 0)
 		return;
 
 	subch = vmbus_subchan_get(sc->hs_chan, nsubch);
 	for (i = 0; i < nsubch; i++)
 		sc->hs_sel_chan[i + 1] = subch[i];
 	vmbus_subchan_rel(subch, nsubch);
 }
 
 static int
 storvsc_init_requests(device_t dev)
 {
 	struct storvsc_softc *sc = device_get_softc(dev);
 	struct hv_storvsc_request *reqp;
 	int error, i;
 
 	LIST_INIT(&sc->hs_free_list);
 
 	error = bus_dma_tag_create(
 		bus_get_dma_tag(dev),		/* parent */
 		1,				/* alignment */
 		PAGE_SIZE,			/* boundary */
 		BUS_SPACE_MAXADDR,		/* lowaddr */
 		BUS_SPACE_MAXADDR,		/* highaddr */
 		NULL, NULL,			/* filter, filterarg */
 		STORVSC_DATA_SIZE_MAX,		/* maxsize */
 		STORVSC_DATA_SEGCNT_MAX,	/* nsegments */
 		STORVSC_DATA_SEGSZ_MAX,		/* maxsegsize */
 		0,				/* flags */
 		NULL,				/* lockfunc */
 		NULL,				/* lockfuncarg */
 		&sc->storvsc_req_dtag);
 	if (error) {
 		device_printf(dev, "failed to create storvsc dma tag\n");
 		return (error);
 	}
 
 	for (i = 0; i < sc->hs_drv_props->drv_max_ios_per_target; ++i) {
 		reqp = malloc(sizeof(struct hv_storvsc_request),
 				 M_DEVBUF, M_WAITOK|M_ZERO);
 		reqp->softc = sc;
 		error = bus_dmamap_create(sc->storvsc_req_dtag, 0,
 				&reqp->data_dmap);
 		if (error) {
 			device_printf(dev, "failed to allocate storvsc "
 			    "data dmamap\n");
 			goto cleanup;
 		}
 		LIST_INSERT_HEAD(&sc->hs_free_list, reqp, link);
 	}
 	return (0);
 
 cleanup:
 	while ((reqp = LIST_FIRST(&sc->hs_free_list)) != NULL) {
 		LIST_REMOVE(reqp, link);
 		bus_dmamap_destroy(sc->storvsc_req_dtag, reqp->data_dmap);
 		free(reqp, M_DEVBUF);
 	}
 	return (error);
 }
 
 static void
 storvsc_sysctl(device_t dev)
 {
 	struct sysctl_oid_list *child;
 	struct sysctl_ctx_list *ctx;
 	struct sysctl_oid *ch_tree, *chid_tree;
 	struct storvsc_softc *sc;
 	char name[16];
 	int i;
 
 	sc = device_get_softc(dev);
 	ctx = device_get_sysctl_ctx(dev);
 	child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev));
 
 	SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "data_bio_cnt", CTLFLAG_RW,
 		&sc->sysctl_data.data_bio_cnt, "# of bio data block");
 	SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "data_vaddr_cnt", CTLFLAG_RW,
 		&sc->sysctl_data.data_vaddr_cnt, "# of vaddr data block");
 	SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "data_sg_cnt", CTLFLAG_RW,
 		&sc->sysctl_data.data_sg_cnt, "# of sg data block");
 
 	/* dev.storvsc.UNIT.channel */
 	ch_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "channel",
 		CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "");
 	if (ch_tree == NULL)
 		return;
 
 	for (i = 0; i < sc->hs_nchan; i++) {
 		uint32_t ch_id;
 
 		ch_id = vmbus_chan_id(sc->hs_sel_chan[i]);
 		snprintf(name, sizeof(name), "%d", ch_id);
 		/* dev.storvsc.UNIT.channel.CHID */
 		chid_tree = SYSCTL_ADD_NODE(ctx, SYSCTL_CHILDREN(ch_tree),
 			OID_AUTO, name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "");
 		if (chid_tree == NULL)
 			return;
 		/* dev.storvsc.UNIT.channel.CHID.send_req */
 		SYSCTL_ADD_ULONG(ctx, SYSCTL_CHILDREN(chid_tree), OID_AUTO,
 			"send_req", CTLFLAG_RD, &sc->sysctl_data.chan_send_cnt[i],
 			"# of request sending from this channel");
 	}
 }
 
 /**
  * @brief StorVSC attach function
  *
  * Function responsible for allocating per-device structures,
  * setting up CAM interfaces and scanning for available LUNs to
  * be used for SCSI device peripherals.
  *
  * @param a device
  * @returns 0 on success or an error on failure
  */
 static int
 storvsc_attach(device_t dev)
 {
 	enum hv_storage_type stor_type;
 	struct storvsc_softc *sc;
 	struct cam_devq *devq;
 	int ret, i, j;
 	struct hv_storvsc_request *reqp;
 	struct root_hold_token *root_mount_token = NULL;
 	struct hv_sgl_node *sgl_node = NULL;
 	void *tmp_buff = NULL;
 
 	/*
 	 * We need to serialize storvsc attach calls.
 	 */
 	root_mount_token = root_mount_hold("storvsc");
 
 	sc = device_get_softc(dev);
 	sc->hs_nchan = 1;
 	sc->hs_chan = vmbus_get_channel(dev);
 
 	stor_type = storvsc_get_storage_type(dev);
 
 	if (stor_type == DRIVER_UNKNOWN) {
 		ret = ENODEV;
 		goto cleanup;
 	}
 
 	/* fill in driver specific properties */
 	sc->hs_drv_props = &g_drv_props_table[stor_type];
 	sc->hs_drv_props->drv_ringbuffer_size = hv_storvsc_ringbuffer_size;
 	sc->hs_drv_props->drv_max_ios_per_target =
 		MIN(STORVSC_MAX_IO, hv_storvsc_max_io);
 	if (bootverbose) {
 		printf("storvsc ringbuffer size: %d, max_io: %d\n",
 			sc->hs_drv_props->drv_ringbuffer_size,
 			sc->hs_drv_props->drv_max_ios_per_target);
 	}
 	/* fill in device specific properties */
 	sc->hs_unit	= device_get_unit(dev);
 	sc->hs_dev	= dev;
 
 	mtx_init(&sc->hs_lock, "hvslck", NULL, MTX_DEF);
 
 	ret = storvsc_init_requests(dev);
 	if (ret != 0)
 		goto cleanup;
 
 	/* create sg-list page pool */
 	if (FALSE == g_hv_sgl_page_pool.is_init) {
 		g_hv_sgl_page_pool.is_init = TRUE;
 		LIST_INIT(&g_hv_sgl_page_pool.in_use_sgl_list);
 		LIST_INIT(&g_hv_sgl_page_pool.free_sgl_list);
 
 		/*
 		 * Pre-create SG list, each SG list with
 		 * STORVSC_DATA_SEGCNT_MAX segments, each
 		 * segment has one page buffer
 		 */
 		for (i = 0; i < sc->hs_drv_props->drv_max_ios_per_target; i++) {
 	        	sgl_node = malloc(sizeof(struct hv_sgl_node),
 			    M_DEVBUF, M_WAITOK|M_ZERO);
 
 			sgl_node->sgl_data =
 			    sglist_alloc(STORVSC_DATA_SEGCNT_MAX,
 			    M_WAITOK|M_ZERO);
 
 			for (j = 0; j < STORVSC_DATA_SEGCNT_MAX; j++) {
 				tmp_buff = malloc(PAGE_SIZE,
 				    M_DEVBUF, M_WAITOK|M_ZERO);
 
 				sgl_node->sgl_data->sg_segs[j].ss_paddr =
 				    (vm_paddr_t)tmp_buff;
 			}
 
 			LIST_INSERT_HEAD(&g_hv_sgl_page_pool.free_sgl_list,
 			    sgl_node, link);
 		}
 	}
 
 	sc->hs_destroy = FALSE;
 	sc->hs_drain_notify = FALSE;
 	sema_init(&sc->hs_drain_sema, 0, "Store Drain Sema");
 
 	ret = hv_storvsc_connect_vsp(sc);
 	if (ret != 0) {
 		goto cleanup;
 	}
 
 	/* Construct cpu to channel mapping */
 	storvsc_create_chan_sel(sc);
 
 	/*
 	 * Create the device queue.
 	 * Hyper-V maps each target to one SCSI HBA
 	 */
 	devq = cam_simq_alloc(sc->hs_drv_props->drv_max_ios_per_target);
 	if (devq == NULL) {
 		device_printf(dev, "Failed to alloc device queue\n");
 		ret = ENOMEM;
 		goto cleanup;
 	}
 
 	sc->hs_sim = cam_sim_alloc(storvsc_action,
 				storvsc_poll,
 				sc->hs_drv_props->drv_name,
 				sc,
 				sc->hs_unit,
 				&sc->hs_lock, 1,
 				sc->hs_drv_props->drv_max_ios_per_target,
 				devq);
 
 	if (sc->hs_sim == NULL) {
 		device_printf(dev, "Failed to alloc sim\n");
 		cam_simq_free(devq);
 		ret = ENOMEM;
 		goto cleanup;
 	}
 
 	mtx_lock(&sc->hs_lock);
 	/* bus_id is set to 0, need to get it from VMBUS channel query? */
 	if (xpt_bus_register(sc->hs_sim, dev, 0) != CAM_SUCCESS) {
 		cam_sim_free(sc->hs_sim, /*free_devq*/TRUE);
 		mtx_unlock(&sc->hs_lock);
 		device_printf(dev, "Unable to register SCSI bus\n");
 		ret = ENXIO;
 		goto cleanup;
 	}
 
 	if (xpt_create_path(&sc->hs_path, /*periph*/NULL,
 		 cam_sim_path(sc->hs_sim),
 		CAM_TARGET_WILDCARD, CAM_LUN_WILDCARD) != CAM_REQ_CMP) {
 		xpt_bus_deregister(cam_sim_path(sc->hs_sim));
 		cam_sim_free(sc->hs_sim, /*free_devq*/TRUE);
 		mtx_unlock(&sc->hs_lock);
 		device_printf(dev, "Unable to create path\n");
 		ret = ENXIO;
 		goto cleanup;
 	}
 
 	mtx_unlock(&sc->hs_lock);
 
 	storvsc_sysctl(dev);
 
 	root_mount_rel(root_mount_token);
 	return (0);
 
 
 cleanup:
 	root_mount_rel(root_mount_token);
 	while (!LIST_EMPTY(&sc->hs_free_list)) {
 		reqp = LIST_FIRST(&sc->hs_free_list);
 		LIST_REMOVE(reqp, link);
 		bus_dmamap_destroy(sc->storvsc_req_dtag, reqp->data_dmap);
 		free(reqp, M_DEVBUF);
 	}
 
 	while (!LIST_EMPTY(&g_hv_sgl_page_pool.free_sgl_list)) {
 		sgl_node = LIST_FIRST(&g_hv_sgl_page_pool.free_sgl_list);
 		LIST_REMOVE(sgl_node, link);
 		for (j = 0; j < STORVSC_DATA_SEGCNT_MAX; j++) {
 			if (NULL !=
 			    (void*)sgl_node->sgl_data->sg_segs[j].ss_paddr) {
 				free((void*)sgl_node->sgl_data->sg_segs[j].ss_paddr, M_DEVBUF);
 			}
 		}
 		sglist_free(sgl_node->sgl_data);
 		free(sgl_node, M_DEVBUF);
 	}
 
 	return (ret);
 }
 
 /**
  * @brief StorVSC device detach function
  *
  * This function is responsible for safely detaching a
  * StorVSC device.  This includes waiting for inbound responses
  * to complete and freeing associated per-device structures.
  *
  * @param dev a device
  * returns 0 on success
  */
 static int
 storvsc_detach(device_t dev)
 {
 	struct storvsc_softc *sc = device_get_softc(dev);
 	struct hv_storvsc_request *reqp = NULL;
 	struct hv_sgl_node *sgl_node = NULL;
 	int j = 0;
 
 	sc->hs_destroy = TRUE;
 
 	/*
 	 * At this point, all outbound traffic should be disabled. We
 	 * only allow inbound traffic (responses) to proceed so that
 	 * outstanding requests can be completed.
 	 */
 
 	sc->hs_drain_notify = TRUE;
 	sema_wait(&sc->hs_drain_sema);
 	sc->hs_drain_notify = FALSE;
 
 	/*
 	 * Since we have already drained, we don't need to busy wait.
 	 * The call to close the channel will reset the callback
 	 * under the protection of the incoming channel lock.
 	 */
 
 	vmbus_chan_close(sc->hs_chan);
 
 	mtx_lock(&sc->hs_lock);
 	while (!LIST_EMPTY(&sc->hs_free_list)) {
 		reqp = LIST_FIRST(&sc->hs_free_list);
 		LIST_REMOVE(reqp, link);
 		bus_dmamap_destroy(sc->storvsc_req_dtag, reqp->data_dmap);
 		free(reqp, M_DEVBUF);
 	}
 	mtx_unlock(&sc->hs_lock);
 
 	while (!LIST_EMPTY(&g_hv_sgl_page_pool.free_sgl_list)) {
 		sgl_node = LIST_FIRST(&g_hv_sgl_page_pool.free_sgl_list);
 		LIST_REMOVE(sgl_node, link);
 		for (j = 0; j < STORVSC_DATA_SEGCNT_MAX; j++){
 			if (NULL !=
 			    (void*)sgl_node->sgl_data->sg_segs[j].ss_paddr) {
 				free((void*)sgl_node->sgl_data->sg_segs[j].ss_paddr, M_DEVBUF);
 			}
 		}
 		sglist_free(sgl_node->sgl_data);
 		free(sgl_node, M_DEVBUF);
 	}
 	
 	return (0);
 }
 
 #if HVS_TIMEOUT_TEST
 /**
  * @brief unit test for timed out operations
  *
  * This function provides unit testing capability to simulate
  * timed out operations.  Recompilation with HV_TIMEOUT_TEST=1
  * is required.
  *
  * @param reqp pointer to a request structure
  * @param opcode SCSI operation being performed
  * @param wait if 1, wait for I/O to complete
  */
 static void
 storvsc_timeout_test(struct hv_storvsc_request *reqp,
 		uint8_t opcode, int wait)
 {
 	int ret;
 	union ccb *ccb = reqp->ccb;
 	struct storvsc_softc *sc = reqp->softc;
 
 	if (reqp->vstor_packet.vm_srb.cdb[0] != opcode) {
 		return;
 	}
 
 	if (wait) {
 		mtx_lock(&reqp->event.mtx);
 	}
 	ret = hv_storvsc_io_request(sc, reqp);
 	if (ret != 0) {
 		if (wait) {
 			mtx_unlock(&reqp->event.mtx);
 		}
 		printf("%s: io_request failed with %d.\n",
 				__func__, ret);
 		ccb->ccb_h.status = CAM_PROVIDE_FAIL;
 		mtx_lock(&sc->hs_lock);
 		storvsc_free_request(sc, reqp);
 		xpt_done(ccb);
 		mtx_unlock(&sc->hs_lock);
 		return;
 	}
 
 	if (wait) {
 		xpt_print(ccb->ccb_h.path,
 				"%u: %s: waiting for IO return.\n",
 				ticks, __func__);
 		ret = cv_timedwait(&reqp->event.cv, &reqp->event.mtx, 60*hz);
 		mtx_unlock(&reqp->event.mtx);
 		xpt_print(ccb->ccb_h.path, "%u: %s: %s.\n",
 				ticks, __func__, (ret == 0)?
 				"IO return detected" :
 				"IO return not detected");
 		/*
 		 * Now both the timer handler and io done are running
 		 * simultaneously. We want to confirm the io done always
 		 * finishes after the timer handler exits. So reqp used by
 		 * timer handler is not freed or stale. Do busy loop for
 		 * another 1/10 second to make sure io done does
 		 * wait for the timer handler to complete.
 		 */
 		DELAY(100*1000);
 		mtx_lock(&sc->hs_lock);
 		xpt_print(ccb->ccb_h.path,
 				"%u: %s: finishing, queue frozen %d, "
 				"ccb status 0x%x scsi_status 0x%x.\n",
 				ticks, __func__, sc->hs_frozen,
 				ccb->ccb_h.status,
 				ccb->csio.scsi_status);
 		mtx_unlock(&sc->hs_lock);
 	}
 }
 #endif /* HVS_TIMEOUT_TEST */
 
 #ifdef notyet
 /**
  * @brief timeout handler for requests
  *
  * This function is called as a result of a callout expiring.
  *
  * @param arg pointer to a request
  */
 static void
 storvsc_timeout(void *arg)
 {
 	struct hv_storvsc_request *reqp = arg;
 	struct storvsc_softc *sc = reqp->softc;
 	union ccb *ccb = reqp->ccb;
 
 	if (reqp->retries == 0) {
 		mtx_lock(&sc->hs_lock);
 		xpt_print(ccb->ccb_h.path,
 		    "%u: IO timed out (req=0x%p), wait for another %u secs.\n",
 		    ticks, reqp, ccb->ccb_h.timeout / 1000);
 		cam_error_print(ccb, CAM_ESF_ALL, CAM_EPF_ALL);
 		mtx_unlock(&sc->hs_lock);
 
 		reqp->retries++;
 		callout_reset_sbt(&reqp->callout, SBT_1MS * ccb->ccb_h.timeout,
 		    0, storvsc_timeout, reqp, 0);
 #if HVS_TIMEOUT_TEST
 		storvsc_timeout_test(reqp, SEND_DIAGNOSTIC, 0);
 #endif
 		return;
 	}
 
 	mtx_lock(&sc->hs_lock);
 	xpt_print(ccb->ccb_h.path,
 		"%u: IO (reqp = 0x%p) did not return for %u seconds, %s.\n",
 		ticks, reqp, ccb->ccb_h.timeout * (reqp->retries+1) / 1000,
 		(sc->hs_frozen == 0)?
 		"freezing the queue" : "the queue is already frozen");
 	if (sc->hs_frozen == 0) {
 		sc->hs_frozen = 1;
 		xpt_freeze_simq(xpt_path_sim(ccb->ccb_h.path), 1);
 	}
 	mtx_unlock(&sc->hs_lock);
 	
 #if HVS_TIMEOUT_TEST
 	storvsc_timeout_test(reqp, MODE_SELECT_10, 1);
 #endif
 }
 #endif
 
 /**
  * @brief StorVSC device poll function
  *
  * This function is responsible for servicing requests when
  * interrupts are disabled (i.e when we are dumping core.)
  *
  * @param sim a pointer to a CAM SCSI interface module
  */
 static void
 storvsc_poll(struct cam_sim *sim)
 {
 	struct storvsc_softc *sc = cam_sim_softc(sim);
 
 	mtx_assert(&sc->hs_lock, MA_OWNED);
 	mtx_unlock(&sc->hs_lock);
 	hv_storvsc_on_channel_callback(sc->hs_chan, sc);
 	mtx_lock(&sc->hs_lock);
 }
 
 /**
  * @brief StorVSC device action function
  *
  * This function is responsible for handling SCSI operations which
  * are passed from the CAM layer.  The requests are in the form of
  * CAM control blocks which indicate the action being performed.
  * Not all actions require converting the request to a VSCSI protocol
  * message - these actions can be responded to by this driver.
  * Requests which are destined for a backend storage device are converted
  * to a VSCSI protocol message and sent on the channel connection associated
  * with this device.
  *
  * @param sim pointer to a CAM SCSI interface module
  * @param ccb pointer to a CAM control block
  */
 static void
 storvsc_action(struct cam_sim *sim, union ccb *ccb)
 {
 	struct storvsc_softc *sc = cam_sim_softc(sim);
 	int res;
 
 	mtx_assert(&sc->hs_lock, MA_OWNED);
 	switch (ccb->ccb_h.func_code) {
 	case XPT_PATH_INQ: {
 		struct ccb_pathinq *cpi = &ccb->cpi;
 
 		cpi->version_num = 1;
 		cpi->hba_inquiry = PI_TAG_ABLE|PI_SDTR_ABLE;
 		cpi->target_sprt = 0;
 		cpi->hba_misc = PIM_NOBUSRESET;
 		if (hv_storvsc_use_pim_unmapped)
 			cpi->hba_misc |= PIM_UNMAPPED;
 		cpi->maxio = STORVSC_DATA_SIZE_MAX;
 		cpi->hba_eng_cnt = 0;
 		cpi->max_target = STORVSC_MAX_TARGETS;
 		cpi->max_lun = sc->hs_drv_props->drv_max_luns_per_target;
 		cpi->initiator_id = cpi->max_target;
 		cpi->bus_id = cam_sim_bus(sim);
 		cpi->base_transfer_speed = 300000;
 		cpi->transport = XPORT_SAS;
 		cpi->transport_version = 0;
 		cpi->protocol = PROTO_SCSI;
 		cpi->protocol_version = SCSI_REV_SPC2;
 		strlcpy(cpi->sim_vid, "FreeBSD", SIM_IDLEN);
 		strlcpy(cpi->hba_vid, sc->hs_drv_props->drv_name, HBA_IDLEN);
 		strlcpy(cpi->dev_name, cam_sim_name(sim), DEV_IDLEN);
 		cpi->unit_number = cam_sim_unit(sim);
 
 		ccb->ccb_h.status = CAM_REQ_CMP;
 		xpt_done(ccb);
 		return;
 	}
 	case XPT_GET_TRAN_SETTINGS: {
 		struct  ccb_trans_settings *cts = &ccb->cts;
 
 		cts->transport = XPORT_SAS;
 		cts->transport_version = 0;
 		cts->protocol = PROTO_SCSI;
 		cts->protocol_version = SCSI_REV_SPC2;
 
 		/* enable tag queuing and disconnected mode */
 		cts->proto_specific.valid = CTS_SCSI_VALID_TQ;
 		cts->proto_specific.scsi.valid = CTS_SCSI_VALID_TQ;
 		cts->proto_specific.scsi.flags = CTS_SCSI_FLAGS_TAG_ENB;
 		cts->xport_specific.valid = CTS_SPI_VALID_DISC;
 		cts->xport_specific.spi.flags = CTS_SPI_FLAGS_DISC_ENB;
 			
 		ccb->ccb_h.status = CAM_REQ_CMP;
 		xpt_done(ccb);
 		return;
 	}
 	case XPT_SET_TRAN_SETTINGS:	{
 		ccb->ccb_h.status = CAM_REQ_CMP;
 		xpt_done(ccb);
 		return;
 	}
 	case XPT_CALC_GEOMETRY:{
 		cam_calc_geometry(&ccb->ccg, 1);
 		xpt_done(ccb);
 		return;
 	}
 	case  XPT_RESET_BUS:
 	case  XPT_RESET_DEV:{
 #if HVS_HOST_RESET
 		if ((res = hv_storvsc_host_reset(sc)) != 0) {
 			xpt_print(ccb->ccb_h.path,
 				"hv_storvsc_host_reset failed with %d\n", res);
 			ccb->ccb_h.status = CAM_PROVIDE_FAIL;
 			xpt_done(ccb);
 			return;
 		}
 		ccb->ccb_h.status = CAM_REQ_CMP;
 		xpt_done(ccb);
 		return;
 #else
 		xpt_print(ccb->ccb_h.path,
 				  "%s reset not supported.\n",
 				  (ccb->ccb_h.func_code == XPT_RESET_BUS)?
 				  "bus" : "dev");
 		ccb->ccb_h.status = CAM_REQ_INVALID;
 		xpt_done(ccb);
 		return;
 #endif	/* HVS_HOST_RESET */
 	}
 	case XPT_SCSI_IO:
 	case XPT_IMMED_NOTIFY: {
 		struct hv_storvsc_request *reqp = NULL;
 		bus_dmamap_t dmap_saved;
 
 		if (ccb->csio.cdb_len == 0) {
 			panic("cdl_len is 0\n");
 		}
 
 		if (LIST_EMPTY(&sc->hs_free_list)) {
 			ccb->ccb_h.status = CAM_REQUEUE_REQ;
 			if (sc->hs_frozen == 0) {
 				sc->hs_frozen = 1;
 				xpt_freeze_simq(sim, /* count*/1);
 			}
 			xpt_done(ccb);
 			return;
 		}
 
 		reqp = LIST_FIRST(&sc->hs_free_list);
 		LIST_REMOVE(reqp, link);
 
 		/* Save the data_dmap before reset request */
 		dmap_saved = reqp->data_dmap;
 
 		/* XXX this is ugly */
 		bzero(reqp, sizeof(struct hv_storvsc_request));
 
 		/* Restore necessary bits */
 		reqp->data_dmap = dmap_saved;
 		reqp->softc = sc;
 		
 		ccb->ccb_h.status |= CAM_SIM_QUEUED;
 		if ((res = create_storvsc_request(ccb, reqp)) != 0) {
 			ccb->ccb_h.status = CAM_REQ_INVALID;
 			xpt_done(ccb);
 			return;
 		}
 
 #ifdef notyet
 		if (ccb->ccb_h.timeout != CAM_TIME_INFINITY) {
 			callout_init(&reqp->callout, 1);
 			callout_reset_sbt(&reqp->callout,
 			    SBT_1MS * ccb->ccb_h.timeout, 0,
 			    storvsc_timeout, reqp, 0);
 #if HVS_TIMEOUT_TEST
 			cv_init(&reqp->event.cv, "storvsc timeout cv");
 			mtx_init(&reqp->event.mtx, "storvsc timeout mutex",
 					NULL, MTX_DEF);
 			switch (reqp->vstor_packet.vm_srb.cdb[0]) {
 				case MODE_SELECT_10:
 				case SEND_DIAGNOSTIC:
 					/* To have timer send the request. */
 					return;
 				default:
 					break;
 			}
 #endif /* HVS_TIMEOUT_TEST */
 		}
 #endif
 
 		if ((res = hv_storvsc_io_request(sc, reqp)) != 0) {
 			xpt_print(ccb->ccb_h.path,
 				"hv_storvsc_io_request failed with %d\n", res);
 			ccb->ccb_h.status = CAM_PROVIDE_FAIL;
 			storvsc_free_request(sc, reqp);
 			xpt_done(ccb);
 			return;
 		}
 		return;
 	}
 
 	default:
 		ccb->ccb_h.status = CAM_REQ_INVALID;
 		xpt_done(ccb);
 		return;
 	}
 }
 
 /**
  * @brief destroy bounce buffer
  *
  * This function is responsible for destroy a Scatter/Gather list
  * that create by storvsc_create_bounce_buffer()
  *
  * @param sgl- the Scatter/Gather need be destroy
  * @param sg_count- page count of the SG list.
  *
  */
 static void
 storvsc_destroy_bounce_buffer(struct sglist *sgl)
 {
 	struct hv_sgl_node *sgl_node = NULL;
 	if (LIST_EMPTY(&g_hv_sgl_page_pool.in_use_sgl_list)) {
 		printf("storvsc error: not enough in use sgl\n");
 		return;
 	}
 	sgl_node = LIST_FIRST(&g_hv_sgl_page_pool.in_use_sgl_list);
 	LIST_REMOVE(sgl_node, link);
 	sgl_node->sgl_data = sgl;
 	LIST_INSERT_HEAD(&g_hv_sgl_page_pool.free_sgl_list, sgl_node, link);
 }
 
 /**
  * @brief create bounce buffer
  *
  * This function is responsible for create a Scatter/Gather list,
  * which hold several pages that can be aligned with page size.
  *
  * @param seg_count- SG-list segments count
  * @param write - if WRITE_TYPE, set SG list page used size to 0,
  * otherwise set used size to page size.
  *
  * return NULL if create failed
  */
 static struct sglist *
 storvsc_create_bounce_buffer(uint16_t seg_count, int write)
 {
 	int i = 0;
 	struct sglist *bounce_sgl = NULL;
 	unsigned int buf_len = ((write == WRITE_TYPE) ? 0 : PAGE_SIZE);
 	struct hv_sgl_node *sgl_node = NULL;	
 
 	/* get struct sglist from free_sgl_list */
 	if (LIST_EMPTY(&g_hv_sgl_page_pool.free_sgl_list)) {
 		printf("storvsc error: not enough free sgl\n");
 		return NULL;
 	}
 	sgl_node = LIST_FIRST(&g_hv_sgl_page_pool.free_sgl_list);
 	LIST_REMOVE(sgl_node, link);
 	bounce_sgl = sgl_node->sgl_data;
 	LIST_INSERT_HEAD(&g_hv_sgl_page_pool.in_use_sgl_list, sgl_node, link);
 
 	bounce_sgl->sg_maxseg = seg_count;
 
 	if (write == WRITE_TYPE)
 		bounce_sgl->sg_nseg = 0;
 	else
 		bounce_sgl->sg_nseg = seg_count;
 
 	for (i = 0; i < seg_count; i++)
 	        bounce_sgl->sg_segs[i].ss_len = buf_len;
 
 	return bounce_sgl;
 }
 
 /**
  * @brief copy data from SG list to bounce buffer
  *
  * This function is responsible for copy data from one SG list's segments
  * to another SG list which used as bounce buffer.
  *
  * @param bounce_sgl - the destination SG list
  * @param orig_sgl - the segment of the source SG list.
  * @param orig_sgl_count - the count of segments.
  * @param orig_sgl_count - indicate which segment need bounce buffer,
  *  set 1 means need.
  *
  */
 static void
 storvsc_copy_sgl_to_bounce_buf(struct sglist *bounce_sgl,
 			       bus_dma_segment_t *orig_sgl,
 			       unsigned int orig_sgl_count,
 			       uint64_t seg_bits)
 {
 	int src_sgl_idx = 0;
 
 	for (src_sgl_idx = 0; src_sgl_idx < orig_sgl_count; src_sgl_idx++) {
 		if (seg_bits & (1 << src_sgl_idx)) {
 			memcpy((void*)bounce_sgl->sg_segs[src_sgl_idx].ss_paddr,
 			    (void*)orig_sgl[src_sgl_idx].ds_addr,
 			    orig_sgl[src_sgl_idx].ds_len);
 
 			bounce_sgl->sg_segs[src_sgl_idx].ss_len =
 			    orig_sgl[src_sgl_idx].ds_len;
 		}
 	}
 }
 
 /**
  * @brief copy data from SG list which used as bounce to another SG list
  *
  * This function is responsible for copy data from one SG list with bounce
  * buffer to another SG list's segments.
  *
  * @param dest_sgl - the destination SG list's segments
  * @param dest_sgl_count - the count of destination SG list's segment.
  * @param src_sgl - the source SG list.
  * @param seg_bits - indicate which segment used bounce buffer of src SG-list.
  *
  */
 void
 storvsc_copy_from_bounce_buf_to_sgl(bus_dma_segment_t *dest_sgl,
 				    unsigned int dest_sgl_count,
 				    struct sglist* src_sgl,
 				    uint64_t seg_bits)
 {
 	int sgl_idx = 0;
 	
 	for (sgl_idx = 0; sgl_idx < dest_sgl_count; sgl_idx++) {
 		if (seg_bits & (1 << sgl_idx)) {
 			memcpy((void*)(dest_sgl[sgl_idx].ds_addr),
 			    (void*)(src_sgl->sg_segs[sgl_idx].ss_paddr),
 			    src_sgl->sg_segs[sgl_idx].ss_len);
 		}
 	}
 }
 
 /**
  * @brief check SG list with bounce buffer or not
  *
  * This function is responsible for check if need bounce buffer for SG list.
  *
  * @param sgl - the SG list's segments
  * @param sg_count - the count of SG list's segment.
  * @param bits - segmengs number that need bounce buffer
  *
  * return -1 if SG list needless bounce buffer
  */
 static int
 storvsc_check_bounce_buffer_sgl(bus_dma_segment_t *sgl,
 				unsigned int sg_count,
 				uint64_t *bits)
 {
 	int i = 0;
 	int offset = 0;
 	uint64_t phys_addr = 0;
 	uint64_t tmp_bits = 0;
 	boolean_t found_hole = FALSE;
 	boolean_t pre_aligned = TRUE;
 
 	if (sg_count < 2){
 		return -1;
 	}
 
 	*bits = 0;
 	
 	phys_addr = vtophys(sgl[0].ds_addr);
 	offset =  phys_addr - trunc_page(phys_addr);
 
 	if (offset != 0) {
 		pre_aligned = FALSE;
 		tmp_bits |= 1;
 	}
 
 	for (i = 1; i < sg_count; i++) {
 		phys_addr = vtophys(sgl[i].ds_addr);
 		offset =  phys_addr - trunc_page(phys_addr);
 
 		if (offset == 0) {
 			if (FALSE == pre_aligned){
 				/*
 				 * This segment is aligned, if the previous
 				 * one is not aligned, find a hole
 				 */
 				found_hole = TRUE;
 			}
 			pre_aligned = TRUE;
 		} else {
 			tmp_bits |= 1ULL << i;
 			if (!pre_aligned) {
 				if (phys_addr != vtophys(sgl[i-1].ds_addr +
 				    sgl[i-1].ds_len)) {
 					/*
 					 * Check whether connect to previous
 					 * segment,if not, find the hole
 					 */
 					found_hole = TRUE;
 				}
 			} else {
 				found_hole = TRUE;
 			}
 			pre_aligned = FALSE;
 		}
 	}
 
 	if (!found_hole) {
 		return (-1);
 	} else {
 		*bits = tmp_bits;
 		return 0;
 	}
 }
 
 /**
  * Copy bus_dma segments to multiple page buffer, which requires
  * the pages are compact composed except for the 1st and last pages.
  */
 static void
 storvsc_xferbuf_prepare(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
 {
 	struct hv_storvsc_request *reqp = arg;
 	union ccb *ccb = reqp->ccb;
 	struct ccb_scsiio *csio = &ccb->csio;
 	struct storvsc_gpa_range *prplist;
 	int i;
 
 	prplist = &reqp->prp_list;
 	prplist->gpa_range.gpa_len = csio->dxfer_len;
 	prplist->gpa_range.gpa_ofs = segs[0].ds_addr & PAGE_MASK;
 
 	for (i = 0; i < nsegs; i++) {
 #ifdef INVARIANTS
 		if (nsegs > 1) {
 			if (i == 0) {
 				KASSERT((segs[i].ds_addr & PAGE_MASK) +
 				    segs[i].ds_len == PAGE_SIZE,
 				    ("invalid 1st page, ofs 0x%jx, len %zu",
 				     (uintmax_t)segs[i].ds_addr,
 				     segs[i].ds_len));
 			} else if (i == nsegs - 1) {
 				KASSERT((segs[i].ds_addr & PAGE_MASK) == 0,
 				    ("invalid last page, ofs 0x%jx",
 				     (uintmax_t)segs[i].ds_addr));
 			} else {
 				KASSERT((segs[i].ds_addr & PAGE_MASK) == 0 &&
 				    segs[i].ds_len == PAGE_SIZE,
 				    ("not a full page, ofs 0x%jx, len %zu",
 				     (uintmax_t)segs[i].ds_addr,
 				     segs[i].ds_len));
 			}
 		}
 #endif
 		prplist->gpa_page[i] = atop(segs[i].ds_addr);
 	}
 	reqp->prp_cnt = nsegs;
 }
 
 /**
  * @brief Fill in a request structure based on a CAM control block
  *
  * Fills in a request structure based on the contents of a CAM control
  * block.  The request structure holds the payload information for
  * VSCSI protocol request.
  *
  * @param ccb pointer to a CAM contorl block
  * @param reqp pointer to a request structure
  */
 static int
 create_storvsc_request(union ccb *ccb, struct hv_storvsc_request *reqp)
 {
 	struct ccb_scsiio *csio = &ccb->csio;
 	uint64_t phys_addr;
 	uint32_t pfn;
 	uint64_t not_aligned_seg_bits = 0;
 	int error;
 	
 	/* refer to struct vmscsi_req for meanings of these two fields */
 	reqp->vstor_packet.u.vm_srb.port =
 		cam_sim_unit(xpt_path_sim(ccb->ccb_h.path));
 	reqp->vstor_packet.u.vm_srb.path_id =
 		cam_sim_bus(xpt_path_sim(ccb->ccb_h.path));
 
 	reqp->vstor_packet.u.vm_srb.target_id = ccb->ccb_h.target_id;
 	reqp->vstor_packet.u.vm_srb.lun = ccb->ccb_h.target_lun;
 
 	reqp->vstor_packet.u.vm_srb.cdb_len = csio->cdb_len;
 	if(ccb->ccb_h.flags & CAM_CDB_POINTER) {
 		memcpy(&reqp->vstor_packet.u.vm_srb.u.cdb, csio->cdb_io.cdb_ptr,
 			csio->cdb_len);
 	} else {
 		memcpy(&reqp->vstor_packet.u.vm_srb.u.cdb, csio->cdb_io.cdb_bytes,
 			csio->cdb_len);
 	}
 
 	if (hv_storvsc_use_win8ext_flags) {
 		reqp->vstor_packet.u.vm_srb.win8_extension.time_out_value = 60;
 		reqp->vstor_packet.u.vm_srb.win8_extension.srb_flags |=
 			SRB_FLAGS_DISABLE_SYNCH_TRANSFER;
 	}
 	switch (ccb->ccb_h.flags & CAM_DIR_MASK) {
 	case CAM_DIR_OUT:
 		reqp->vstor_packet.u.vm_srb.data_in = WRITE_TYPE;
 		if (hv_storvsc_use_win8ext_flags) {
 			reqp->vstor_packet.u.vm_srb.win8_extension.srb_flags |=
 				SRB_FLAGS_DATA_OUT;
 		}
 		break;
 	case CAM_DIR_IN:
 		reqp->vstor_packet.u.vm_srb.data_in = READ_TYPE;
 		if (hv_storvsc_use_win8ext_flags) {
 			reqp->vstor_packet.u.vm_srb.win8_extension.srb_flags |=
 				SRB_FLAGS_DATA_IN;
 		}
 		break;
 	case CAM_DIR_NONE:
 		reqp->vstor_packet.u.vm_srb.data_in = UNKNOWN_TYPE;
 		if (hv_storvsc_use_win8ext_flags) {
 			reqp->vstor_packet.u.vm_srb.win8_extension.srb_flags |=
 				SRB_FLAGS_NO_DATA_TRANSFER;
 		}
 		break;
 	default:
 		printf("Error: unexpected data direction: 0x%x\n",
 			ccb->ccb_h.flags & CAM_DIR_MASK);
 		return (EINVAL);
 	}
 
 	reqp->sense_data     = &csio->sense_data;
 	reqp->sense_info_len = csio->sense_len;
 
 	reqp->ccb = ccb;
 
 	if (0 == csio->dxfer_len) {
 		return (0);
 	}
 
 	switch (ccb->ccb_h.flags & CAM_DATA_MASK) {
 	case CAM_DATA_BIO:
 	case CAM_DATA_VADDR:
 		error = bus_dmamap_load_ccb(reqp->softc->storvsc_req_dtag,
 		    reqp->data_dmap, ccb, storvsc_xferbuf_prepare, reqp,
 		    BUS_DMA_NOWAIT);
 		if (error) {
 			xpt_print(ccb->ccb_h.path,
 			    "bus_dmamap_load_ccb failed: %d\n", error);
 			return (error);
 		}
 		if ((ccb->ccb_h.flags & CAM_DATA_MASK) == CAM_DATA_BIO)
 			reqp->softc->sysctl_data.data_bio_cnt++;
 		else
 			reqp->softc->sysctl_data.data_vaddr_cnt++;
 		break;
 
 	case CAM_DATA_SG:
 	{
 		struct storvsc_gpa_range *prplist;
 		int i = 0;
 		int offset = 0;
 		int ret;
 
 		bus_dma_segment_t *storvsc_sglist =
 		    (bus_dma_segment_t *)ccb->csio.data_ptr;
 		u_int16_t storvsc_sg_count = ccb->csio.sglist_cnt;
 
 		prplist = &reqp->prp_list;
 		prplist->gpa_range.gpa_len = csio->dxfer_len;
 
 		printf("Storvsc: get SG I/O operation, %d\n",
 		    reqp->vstor_packet.u.vm_srb.data_in);
 
 		if (storvsc_sg_count > STORVSC_DATA_SEGCNT_MAX){
 			printf("Storvsc: %d segments is too much, "
 			    "only support %d segments\n",
 			    storvsc_sg_count, STORVSC_DATA_SEGCNT_MAX);
 			return (EINVAL);
 		}
 
 		/*
 		 * We create our own bounce buffer function currently. Idealy
 		 * we should use BUS_DMA(9) framework. But with current BUS_DMA
 		 * code there is no callback API to check the page alignment of
 		 * middle segments before busdma can decide if a bounce buffer
 		 * is needed for particular segment. There is callback,
 		 * "bus_dma_filter_t *filter", but the parrameters are not
 		 * sufficient for storvsc driver.
 		 * TODO:
 		 *	Add page alignment check in BUS_DMA(9) callback. Once
 		 *	this is complete, switch the following code to use
 		 *	BUS_DMA(9) for storvsc bounce buffer support.
 		 */
 		/* check if we need to create bounce buffer */
 		ret = storvsc_check_bounce_buffer_sgl(storvsc_sglist,
 		    storvsc_sg_count, &not_aligned_seg_bits);
 		if (ret != -1) {
 			reqp->bounce_sgl =
 			    storvsc_create_bounce_buffer(storvsc_sg_count,
 			    reqp->vstor_packet.u.vm_srb.data_in);
 			if (NULL == reqp->bounce_sgl) {
 				printf("Storvsc_error: "
 				    "create bounce buffer failed.\n");
 				return (ENOMEM);
 			}
 
 			reqp->bounce_sgl_count = storvsc_sg_count;
 			reqp->not_aligned_seg_bits = not_aligned_seg_bits;
 
 			/*
 			 * if it is write, we need copy the original data
 			 *to bounce buffer
 			 */
 			if (WRITE_TYPE == reqp->vstor_packet.u.vm_srb.data_in) {
 				storvsc_copy_sgl_to_bounce_buf(
 				    reqp->bounce_sgl,
 				    storvsc_sglist,
 				    storvsc_sg_count,
 				    reqp->not_aligned_seg_bits);
 			}
 
 			/* transfer virtual address to physical frame number */
 			if (reqp->not_aligned_seg_bits & 0x1){
  				phys_addr =
 				    vtophys(reqp->bounce_sgl->sg_segs[0].ss_paddr);
 			}else{
  				phys_addr =
 					vtophys(storvsc_sglist[0].ds_addr);
 			}
 			prplist->gpa_range.gpa_ofs = phys_addr & PAGE_MASK;
 
 			pfn = phys_addr >> PAGE_SHIFT;
 			prplist->gpa_page[0] = pfn;
 			
 			for (i = 1; i < storvsc_sg_count; i++) {
 				if (reqp->not_aligned_seg_bits & (1 << i)) {
 					phys_addr =
 					    vtophys(reqp->bounce_sgl->sg_segs[i].ss_paddr);
 				} else {
 					phys_addr =
 					    vtophys(storvsc_sglist[i].ds_addr);
 				}
 
 				pfn = phys_addr >> PAGE_SHIFT;
 				prplist->gpa_page[i] = pfn;
 			}
 			reqp->prp_cnt = i;
 		} else {
 			phys_addr = vtophys(storvsc_sglist[0].ds_addr);
 
 			prplist->gpa_range.gpa_ofs = phys_addr & PAGE_MASK;
 
 			for (i = 0; i < storvsc_sg_count; i++) {
 				phys_addr = vtophys(storvsc_sglist[i].ds_addr);
 				pfn = phys_addr >> PAGE_SHIFT;
 				prplist->gpa_page[i] = pfn;
 			}
 			reqp->prp_cnt = i;
 
 			/* check the last segment cross boundary or not */
 			offset = phys_addr & PAGE_MASK;
 			if (offset) {
 				/* Add one more PRP entry */
 				phys_addr =
 				    vtophys(storvsc_sglist[i-1].ds_addr +
 				    PAGE_SIZE - offset);
 				pfn = phys_addr >> PAGE_SHIFT;
 				prplist->gpa_page[i] = pfn;
 				reqp->prp_cnt++;
 			}
 			
 			reqp->bounce_sgl_count = 0;
 		}
 		reqp->softc->sysctl_data.data_sg_cnt++;
 		break;
 	}
 	default:
 		printf("Unknow flags: %d\n", ccb->ccb_h.flags);
 		return(EINVAL);
 	}
 
 	return(0);
 }
 
 static uint32_t
 is_scsi_valid(const struct scsi_inquiry_data *inq_data)
 {
 	u_int8_t type;
 
 	type = SID_TYPE(inq_data);
 	if (type == T_NODEVICE)
 		return (0);
 	if (SID_QUAL(inq_data) == SID_QUAL_BAD_LU)
 		return (0);
 	return (1);
 }
 
 /**
  * @brief completion function before returning to CAM
  *
  * I/O process has been completed and the result needs
  * to be passed to the CAM layer.
  * Free resources related to this request.
  *
  * @param reqp pointer to a request structure
  */
 static void
 storvsc_io_done(struct hv_storvsc_request *reqp)
 {
 	union ccb *ccb = reqp->ccb;
 	struct ccb_scsiio *csio = &ccb->csio;
 	struct storvsc_softc *sc = reqp->softc;
 	struct vmscsi_req *vm_srb = &reqp->vstor_packet.u.vm_srb;
 	bus_dma_segment_t *ori_sglist = NULL;
 	int ori_sg_count = 0;
 	const struct scsi_generic *cmd;
 
 	/* destroy bounce buffer if it is used */
 	if (reqp->bounce_sgl_count) {
 		ori_sglist = (bus_dma_segment_t *)ccb->csio.data_ptr;
 		ori_sg_count = ccb->csio.sglist_cnt;
 
 		/*
 		 * If it is READ operation, we should copy back the data
 		 * to original SG list.
 		 */
 		if (READ_TYPE == reqp->vstor_packet.u.vm_srb.data_in) {
 			storvsc_copy_from_bounce_buf_to_sgl(ori_sglist,
 			    ori_sg_count,
 			    reqp->bounce_sgl,
 			    reqp->not_aligned_seg_bits);
 		}
 
 		storvsc_destroy_bounce_buffer(reqp->bounce_sgl);
 		reqp->bounce_sgl_count = 0;
 	}
 		
 	if (reqp->retries > 0) {
 		mtx_lock(&sc->hs_lock);
 #if HVS_TIMEOUT_TEST
 		xpt_print(ccb->ccb_h.path,
 			"%u: IO returned after timeout, "
 			"waking up timer handler if any.\n", ticks);
 		mtx_lock(&reqp->event.mtx);
 		cv_signal(&reqp->event.cv);
 		mtx_unlock(&reqp->event.mtx);
 #endif
 		reqp->retries = 0;
 		xpt_print(ccb->ccb_h.path,
 			"%u: IO returned after timeout, "
 			"stopping timer if any.\n", ticks);
 		mtx_unlock(&sc->hs_lock);
 	}
 
 #ifdef notyet
 	/*
 	 * callout_drain() will wait for the timer handler to finish
 	 * if it is running. So we don't need any lock to synchronize
 	 * between this routine and the timer handler.
 	 * Note that we need to make sure reqp is not freed when timer
 	 * handler is using or will use it.
 	 */
 	if (ccb->ccb_h.timeout != CAM_TIME_INFINITY) {
 		callout_drain(&reqp->callout);
 	}
 #endif
 	cmd = (const struct scsi_generic *)
 	    ((ccb->ccb_h.flags & CAM_CDB_POINTER) ?
 	     csio->cdb_io.cdb_ptr : csio->cdb_io.cdb_bytes);
 
 	ccb->ccb_h.status &= ~CAM_SIM_QUEUED;
 	ccb->ccb_h.status &= ~CAM_STATUS_MASK;
 	int srb_status = SRB_STATUS(vm_srb->srb_status);
 	if (vm_srb->scsi_status == SCSI_STATUS_OK) {
 		if (srb_status != SRB_STATUS_SUCCESS) {
 			/*
 			 * If there are errors, for example, invalid LUN,
 			 * host will inform VM through SRB status.
 			 */
 			if (bootverbose) {
 				if (srb_status == SRB_STATUS_INVALID_LUN) {
 					xpt_print(ccb->ccb_h.path,
 					    "invalid LUN %d for op: %s\n",
 					    vm_srb->lun,
 					    scsi_op_desc(cmd->opcode, NULL));
 				} else {
 					xpt_print(ccb->ccb_h.path,
 					    "Unknown SRB flag: %d for op: %s\n",
 					    srb_status,
 					    scsi_op_desc(cmd->opcode, NULL));
 				}
 			}
-
-			/*
-			 * XXX For a selection timeout, all of the LUNs
-			 * on the target will be gone.  It works for SCSI
-			 * disks, but does not work for IDE disks.
-			 *
-			 * For CAM_DEV_NOT_THERE, CAM will only get
-			 * rid of the device(s) specified by the path.
-			 */
-			if (storvsc_get_storage_type(sc->hs_dev) ==
-			    DRIVER_STORVSC)
-				ccb->ccb_h.status |= CAM_SEL_TIMEOUT;
-			else
-				ccb->ccb_h.status |= CAM_DEV_NOT_THERE;
+			ccb->ccb_h.status |= CAM_DEV_NOT_THERE;
 		} else {
 			ccb->ccb_h.status |= CAM_REQ_CMP;
 		}
 
 		if (cmd->opcode == INQUIRY &&
 		    srb_status == SRB_STATUS_SUCCESS) {
 			int resp_xfer_len, resp_buf_len, data_len;
 			uint8_t *resp_buf = (uint8_t *)csio->data_ptr;
 			struct scsi_inquiry_data *inq_data =
 			    (struct scsi_inquiry_data *)csio->data_ptr;
 
 			/* Get the buffer length reported by host */
 			resp_xfer_len = vm_srb->transfer_len;
 
 			/* Get the available buffer length */
 			resp_buf_len = resp_xfer_len >= 5 ? resp_buf[4] + 5 : 0;
 			data_len = (resp_buf_len < resp_xfer_len) ?
 			    resp_buf_len : resp_xfer_len;
 			if (bootverbose && data_len >= 5) {
 				xpt_print(ccb->ccb_h.path, "storvsc inquiry "
 				    "(%d) [%x %x %x %x %x ... ]\n", data_len,
 				    resp_buf[0], resp_buf[1], resp_buf[2],
 				    resp_buf[3], resp_buf[4]);
 			}
 			/*
 			 * XXX: Hyper-V (since win2012r2) responses inquiry with
 			 * unknown version (0) for GEN-2 DVD device.
 			 * Manually set the version number to SPC3 in order to
 			 * ask CAM to continue probing with "PROBE_REPORT_LUNS".
 			 * see probedone() in scsi_xpt.c
 			 */
 			if (SID_TYPE(inq_data) == T_CDROM &&
 			    inq_data->version == 0 &&
 			    (vmstor_proto_version >= VMSTOR_PROTOCOL_VERSION_WIN8)) {
 				inq_data->version = SCSI_REV_SPC3;
 				if (bootverbose) {
 					xpt_print(ccb->ccb_h.path,
 					    "set version from 0 to %d\n",
 					    inq_data->version);
 				}
 			}
 			/*
 			 * XXX: Manually fix the wrong response returned from WS2012
 			 */
 			if (!is_scsi_valid(inq_data) &&
 			    (vmstor_proto_version == VMSTOR_PROTOCOL_VERSION_WIN8_1 ||
 			    vmstor_proto_version == VMSTOR_PROTOCOL_VERSION_WIN8 ||
 			    vmstor_proto_version == VMSTOR_PROTOCOL_VERSION_WIN7)) {
 				if (data_len >= 4 &&
 				    (resp_buf[2] == 0 || resp_buf[3] == 0)) {
 					resp_buf[2] = SCSI_REV_SPC3;
 					resp_buf[3] = 2; // resp fmt must be 2
 					if (bootverbose)
 						xpt_print(ccb->ccb_h.path,
 						    "fix version and resp fmt for 0x%x\n",
 						    vmstor_proto_version);
 				}
 			} else if (data_len >= SHORT_INQUIRY_LENGTH) {
 				char vendor[16];
 
 				cam_strvis(vendor, inq_data->vendor,
 				    sizeof(inq_data->vendor), sizeof(vendor));
 				/*
 				 * XXX: Upgrade SPC2 to SPC3 if host is WIN8 or
 				 * WIN2012 R2 in order to support UNMAP feature.
 				 */
 				if (!strncmp(vendor, "Msft", 4) &&
 				    SID_ANSI_REV(inq_data) == SCSI_REV_SPC2 &&
 				    (vmstor_proto_version ==
 				     VMSTOR_PROTOCOL_VERSION_WIN8_1 ||
 				     vmstor_proto_version ==
 				     VMSTOR_PROTOCOL_VERSION_WIN8)) {
 					inq_data->version = SCSI_REV_SPC3;
 					if (bootverbose) {
 						xpt_print(ccb->ccb_h.path,
 						    "storvsc upgrades "
 						    "SPC2 to SPC3\n");
 					}
 				}
 			}
 		}
 	} else {
 		/**
 		 * On Some Windows hosts TEST_UNIT_READY command can return
 		 * SRB_STATUS_ERROR and sense data, for example, asc=0x3a,1
 		 * "(Medium not present - tray closed)". This error can be
 		 * ignored since it will be sent to host periodically.
 		 */
 		boolean_t unit_not_ready = \
 		    vm_srb->scsi_status == SCSI_STATUS_CHECK_COND &&
 		    cmd->opcode == TEST_UNIT_READY &&
 		    srb_status == SRB_STATUS_ERROR;
 		if (!unit_not_ready && bootverbose) {
 			mtx_lock(&sc->hs_lock);
 			xpt_print(ccb->ccb_h.path,
 				"storvsc scsi_status = %d, srb_status = %d\n",
 				vm_srb->scsi_status, srb_status);
 			mtx_unlock(&sc->hs_lock);
 		}
 		ccb->ccb_h.status |= CAM_SCSI_STATUS_ERROR;
 	}
 
 	ccb->csio.scsi_status = (vm_srb->scsi_status & 0xFF);
 	ccb->csio.resid = ccb->csio.dxfer_len - vm_srb->transfer_len;
 
 	if (reqp->sense_info_len != 0) {
 		csio->sense_resid = csio->sense_len - reqp->sense_info_len;
 		ccb->ccb_h.status |= CAM_AUTOSNS_VALID;
 	}
 
 	mtx_lock(&sc->hs_lock);
 	if (reqp->softc->hs_frozen == 1) {
 		xpt_print(ccb->ccb_h.path,
 			"%u: storvsc unfreezing softc 0x%p.\n",
 			ticks, reqp->softc);
 		ccb->ccb_h.status |= CAM_RELEASE_SIMQ;
 		reqp->softc->hs_frozen = 0;
 	}
 	storvsc_free_request(sc, reqp);
 	mtx_unlock(&sc->hs_lock);
 
 	xpt_done_direct(ccb);
 }
 
 /**
  * @brief Free a request structure
  *
  * Free a request structure by returning it to the free list
  *
  * @param sc pointer to a softc
  * @param reqp pointer to a request structure
  */	
 static void
 storvsc_free_request(struct storvsc_softc *sc, struct hv_storvsc_request *reqp)
 {
 
 	LIST_INSERT_HEAD(&sc->hs_free_list, reqp, link);
 }
 
 /**
  * @brief Determine type of storage device from GUID
  *
  * Using the type GUID, determine if this is a StorVSC (paravirtual
  * SCSI or BlkVSC (paravirtual IDE) device.
  *
  * @param dev a device
  * returns an enum
  */
 static enum hv_storage_type
 storvsc_get_storage_type(device_t dev)
 {
 	device_t parent = device_get_parent(dev);
 
 	if (VMBUS_PROBE_GUID(parent, dev, &gBlkVscDeviceType) == 0)
 		return DRIVER_BLKVSC;
 	if (VMBUS_PROBE_GUID(parent, dev, &gStorVscDeviceType) == 0)
 		return DRIVER_STORVSC;
 	return DRIVER_UNKNOWN;
 }
 
 #define	PCI_VENDOR_INTEL	0x8086
 #define	PCI_PRODUCT_PIIX4	0x7111
 
 static void
 storvsc_ada_probe_veto(void *arg __unused, struct cam_path *path,
     struct ata_params *ident_buf __unused, int *veto)
 {
 
 	/*
 	 * The ATA disks are shared with the controllers managed
 	 * by this driver, so veto the ATA disks' attachment; the
 	 * ATA disks will be attached as SCSI disks once this driver
 	 * attached.
 	 */
 	if (path->device->protocol == PROTO_ATA) {
 		struct ccb_pathinq cpi;
 
 		xpt_path_inq(&cpi, path);
 		if (cpi.ccb_h.status == CAM_REQ_CMP &&
 		    cpi.hba_vendor == PCI_VENDOR_INTEL &&
 		    cpi.hba_device == PCI_PRODUCT_PIIX4) {
 			(*veto)++;
 			if (bootverbose) {
 				xpt_print(path,
 				    "Disable ATA disks on "
 				    "simulated ATA controller (0x%04x%04x)\n",
 				    cpi.hba_device, cpi.hba_vendor);
 			}
 		}
 	}
 }
 
 static void
 storvsc_sysinit(void *arg __unused)
 {
 	if (vm_guest == VM_GUEST_HV) {
 		storvsc_handler_tag = EVENTHANDLER_REGISTER(ada_probe_veto,
 		    storvsc_ada_probe_veto, NULL, EVENTHANDLER_PRI_ANY);
 	}
 }
 SYSINIT(storvsc_sys_init, SI_SUB_DRIVERS, SI_ORDER_SECOND, storvsc_sysinit,
     NULL);
 
 static void
 storvsc_sysuninit(void *arg __unused)
 {
 	if (storvsc_handler_tag != NULL)
 		EVENTHANDLER_DEREGISTER(ada_probe_veto, storvsc_handler_tag);
 }
 SYSUNINIT(storvsc_sys_uninit, SI_SUB_DRIVERS, SI_ORDER_SECOND,
     storvsc_sysuninit, NULL);
Index: user/markj/netdump/sys/dev/ofw/ofw_bus_subr.c
===================================================================
--- user/markj/netdump/sys/dev/ofw/ofw_bus_subr.c	(revision 332407)
+++ user/markj/netdump/sys/dev/ofw/ofw_bus_subr.c	(revision 332408)
@@ -1,980 +1,982 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2001 - 2003 by Thomas Moestl <tmm@FreeBSD.org>.
  * Copyright (c) 2005 Marius Strobl <marius@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions, and the following disclaimer,
  *    without modification, immediately at the beginning of the file.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in
  *    the documentation and/or other materials provided with the
  *    distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR
  * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_platform.h"
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/errno.h>
 #include <sys/libkern.h>
 
 #include <machine/resource.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #include <dev/ofw/openfirm.h>
 
 #include "ofw_bus_if.h"
 
 #define	OFW_COMPAT_LEN	255
 #define	OFW_STATUS_LEN	16
 
 int
 ofw_bus_gen_setup_devinfo(struct ofw_bus_devinfo *obd, phandle_t node)
 {
 
 	if (obd == NULL)
 		return (ENOMEM);
 	/* The 'name' property is considered mandatory. */
 	if ((OF_getprop_alloc(node, "name", (void **)&obd->obd_name)) == -1)
 		return (EINVAL);
 	OF_getprop_alloc(node, "compatible", (void **)&obd->obd_compat);
 	OF_getprop_alloc(node, "device_type", (void **)&obd->obd_type);
 	OF_getprop_alloc(node, "model", (void **)&obd->obd_model);
 	OF_getprop_alloc(node, "status", (void **)&obd->obd_status);
 	obd->obd_node = node;
 	return (0);
 }
 
 void
 ofw_bus_gen_destroy_devinfo(struct ofw_bus_devinfo *obd)
 {
 
 	if (obd == NULL)
 		return;
 	if (obd->obd_compat != NULL)
 		free(obd->obd_compat, M_OFWPROP);
 	if (obd->obd_model != NULL)
 		free(obd->obd_model, M_OFWPROP);
 	if (obd->obd_name != NULL)
 		free(obd->obd_name, M_OFWPROP);
 	if (obd->obd_type != NULL)
 		free(obd->obd_type, M_OFWPROP);
 	if (obd->obd_status != NULL)
 		free(obd->obd_status, M_OFWPROP);
 }
 
 int
 ofw_bus_gen_child_pnpinfo_str(device_t cbdev, device_t child, char *buf,
     size_t buflen)
 {
 
 	*buf = '\0';
 	if (ofw_bus_get_name(child) != NULL) {
 		strlcat(buf, "name=", buflen);
 		strlcat(buf, ofw_bus_get_name(child), buflen);
 	}
 
 	if (ofw_bus_get_compat(child) != NULL) {
 		strlcat(buf, " compat=", buflen);
 		strlcat(buf, ofw_bus_get_compat(child), buflen);
 	}
 	return (0);
 };
 
 const char *
 ofw_bus_gen_get_compat(device_t bus, device_t dev)
 {
 	const struct ofw_bus_devinfo *obd;
 
 	obd = OFW_BUS_GET_DEVINFO(bus, dev);
 	if (obd == NULL)
 		return (NULL);
 	return (obd->obd_compat);
 }
 
 const char *
 ofw_bus_gen_get_model(device_t bus, device_t dev)
 {
 	const struct ofw_bus_devinfo *obd;
 
 	obd = OFW_BUS_GET_DEVINFO(bus, dev);
 	if (obd == NULL)
 		return (NULL);
 	return (obd->obd_model);
 }
 
 const char *
 ofw_bus_gen_get_name(device_t bus, device_t dev)
 {
 	const struct ofw_bus_devinfo *obd;
 
 	obd = OFW_BUS_GET_DEVINFO(bus, dev);
 	if (obd == NULL)
 		return (NULL);
 	return (obd->obd_name);
 }
 
 phandle_t
 ofw_bus_gen_get_node(device_t bus, device_t dev)
 {
 	const struct ofw_bus_devinfo *obd;
 
 	obd = OFW_BUS_GET_DEVINFO(bus, dev);
 	if (obd == NULL)
 		return (0);
 	return (obd->obd_node);
 }
 
 const char *
 ofw_bus_gen_get_type(device_t bus, device_t dev)
 {
 	const struct ofw_bus_devinfo *obd;
 
 	obd = OFW_BUS_GET_DEVINFO(bus, dev);
 	if (obd == NULL)
 		return (NULL);
 	return (obd->obd_type);
 }
 
 const char *
 ofw_bus_get_status(device_t dev)
 {
 	const struct ofw_bus_devinfo *obd;
 
 	obd = OFW_BUS_GET_DEVINFO(device_get_parent(dev), dev);
 	if (obd == NULL)
 		return (NULL);
 
 	return (obd->obd_status);
 }
 
 int
 ofw_bus_status_okay(device_t dev)
 {
 	const char *status;
 
 	status = ofw_bus_get_status(dev);
 	if (status == NULL || strcmp(status, "okay") == 0 ||
 	    strcmp(status, "ok") == 0)
 		return (1);
 	
 	return (0);
 }
 
 int
 ofw_bus_node_status_okay(phandle_t node)
 {
 	char status[OFW_STATUS_LEN];
 	int len;
 
 	len = OF_getproplen(node, "status");
 	if (len <= 0)
 		return (1);
 
 	OF_getprop(node, "status", status, OFW_STATUS_LEN);
 	if ((len == 5 && (bcmp(status, "okay", len) == 0)) ||
 	    (len == 3 && (bcmp(status, "ok", len))))
 		return (1);
 
 	return (0);
 }
 
 static int
 ofw_bus_node_is_compatible_int(const char *compat, int len,
     const char *onecompat)
 {
 	int onelen, l, ret;
 
 	onelen = strlen(onecompat);
 
 	ret = 0;
 	while (len > 0) {
 		if (strlen(compat) == onelen &&
 		    strncasecmp(compat, onecompat, onelen) == 0) {
 			/* Found it. */
 			ret = 1;
 			break;
 		}
 
 		/* Slide to the next sub-string. */
 		l = strlen(compat) + 1;
 		compat += l;
 		len -= l;
 	}
 
 	return (ret);
 }
 
 int
 ofw_bus_node_is_compatible(phandle_t node, const char *compatstr)
 {
 	char compat[OFW_COMPAT_LEN];
 	int len, rv;
 
 	if ((len = OF_getproplen(node, "compatible")) <= 0)
 		return (0);
 
 	bzero(compat, OFW_COMPAT_LEN);
 
 	if (OF_getprop(node, "compatible", compat, OFW_COMPAT_LEN) < 0)
 		return (0);
 
 	rv = ofw_bus_node_is_compatible_int(compat, len, compatstr);
 
 	return (rv);
 }
 
 int
 ofw_bus_is_compatible(device_t dev, const char *onecompat)
 {
 	phandle_t node;
 	const char *compat;
 	int len;
 
 	if ((compat = ofw_bus_get_compat(dev)) == NULL)
 		return (0);
 
 	if ((node = ofw_bus_get_node(dev)) == -1)
 		return (0);
 
 	/* Get total 'compatible' prop len */
 	if ((len = OF_getproplen(node, "compatible")) <= 0)
 		return (0);
 
 	return (ofw_bus_node_is_compatible_int(compat, len, onecompat));
 }
 
 int
 ofw_bus_is_compatible_strict(device_t dev, const char *compatible)
 {
 	const char *compat;
 	size_t len;
 
 	if ((compat = ofw_bus_get_compat(dev)) == NULL)
 		return (0);
 
 	len = strlen(compatible);
 	if (strlen(compat) == len &&
 	    strncasecmp(compat, compatible, len) == 0)
 		return (1);
 
 	return (0);
 }
 
 const struct ofw_compat_data *
 ofw_bus_search_compatible(device_t dev, const struct ofw_compat_data *compat)
 {
 
 	if (compat == NULL)
 		return NULL;
 
 	for (; compat->ocd_str != NULL; ++compat) {
 		if (ofw_bus_is_compatible(dev, compat->ocd_str))
 			break;
 	}
 
 	return (compat);
 }
 
 int
 ofw_bus_has_prop(device_t dev, const char *propname)
 {
 	phandle_t node;
 
 	if ((node = ofw_bus_get_node(dev)) == -1)
 		return (0);
 
 	return (OF_hasprop(node, propname));
 }
 
 void
 ofw_bus_setup_iinfo(phandle_t node, struct ofw_bus_iinfo *ii, int intrsz)
 {
 	pcell_t addrc;
 	int msksz;
 
 	if (OF_getencprop(node, "#address-cells", &addrc, sizeof(addrc)) == -1)
 		addrc = 2;
 	ii->opi_addrc = addrc * sizeof(pcell_t);
 
-	ii->opi_imapsz = OF_getencprop_alloc(node, "interrupt-map", 1,
+	ii->opi_imapsz = OF_getencprop_alloc(node, "interrupt-map",
 	    (void **)&ii->opi_imap);
 	if (ii->opi_imapsz > 0) {
-		msksz = OF_getencprop_alloc(node, "interrupt-map-mask", 1,
+		msksz = OF_getencprop_alloc(node, "interrupt-map-mask",
 		    (void **)&ii->opi_imapmsk);
 		/*
 		 * Failure to get the mask is ignored; a full mask is used
 		 * then.  We barf on bad mask sizes, however.
 		 */
 		if (msksz != -1 && msksz != ii->opi_addrc + intrsz)
 			panic("ofw_bus_setup_iinfo: bad interrupt-map-mask "
 			    "property!");
 	}
 }
 
 int
 ofw_bus_lookup_imap(phandle_t node, struct ofw_bus_iinfo *ii, void *reg,
     int regsz, void *pintr, int pintrsz, void *mintr, int mintrsz,
     phandle_t *iparent)
 {
 	uint8_t maskbuf[regsz + pintrsz];
 	int rv;
 
 	if (ii->opi_imapsz <= 0)
 		return (0);
 	KASSERT(regsz >= ii->opi_addrc,
 	    ("ofw_bus_lookup_imap: register size too small: %d < %d",
 		regsz, ii->opi_addrc));
 	if (node != -1) {
 		rv = OF_getencprop(node, "reg", reg, regsz);
 		if (rv < regsz)
 			panic("ofw_bus_lookup_imap: cannot get reg property");
 	}
 	return (ofw_bus_search_intrmap(pintr, pintrsz, reg, ii->opi_addrc,
 	    ii->opi_imap, ii->opi_imapsz, ii->opi_imapmsk, maskbuf, mintr,
 	    mintrsz, iparent));
 }
 
 /*
  * Map an interrupt using the firmware reg, interrupt-map and
  * interrupt-map-mask properties.
  * The interrupt property to be mapped must be of size intrsz, and pointed to
  * by intr.  The regs property of the node for which the mapping is done must
  * be passed as regs. This property is an array of register specifications;
  * the size of the address part of such a specification must be passed as
  * physsz.  Only the first element of the property is used.
  * imap and imapsz hold the interrupt mask and it's size.
  * imapmsk is a pointer to the interrupt-map-mask property, which must have
  * a size of physsz + intrsz; it may be NULL, in which case a full mask is
  * assumed.
  * maskbuf must point to a buffer of length physsz + intrsz.
  * The interrupt is returned in result, which must point to a buffer of length
  * rintrsz (which gives the expected size of the mapped interrupt).
  * Returns number of cells in the interrupt if a mapping was found, 0 otherwise.
  */
 int
 ofw_bus_search_intrmap(void *intr, int intrsz, void *regs, int physsz,
     void *imap, int imapsz, void *imapmsk, void *maskbuf, void *result,
     int rintrsz, phandle_t *iparent)
 {
 	phandle_t parent;
 	uint8_t *ref = maskbuf;
 	uint8_t *uiintr = intr;
 	uint8_t *uiregs = regs;
 	uint8_t *uiimapmsk = imapmsk;
 	uint8_t *mptr;
 	pcell_t paddrsz;
 	pcell_t pintrsz;
 	int i, tsz;
 
 	if (imapmsk != NULL) {
 		for (i = 0; i < physsz; i++)
 			ref[i] = uiregs[i] & uiimapmsk[i];
 		for (i = 0; i < intrsz; i++)
 			ref[physsz + i] = uiintr[i] & uiimapmsk[physsz + i];
 	} else {
 		bcopy(regs, ref, physsz);
 		bcopy(intr, ref + physsz, intrsz);
 	}
 
 	mptr = imap;
 	i = imapsz;
 	paddrsz = 0;
 	while (i > 0) {
 		bcopy(mptr + physsz + intrsz, &parent, sizeof(parent));
 #ifndef OFW_IMAP_NO_IPARENT_ADDR_CELLS
 		/*
 		 * Find if we need to read the parent address data.
 		 * CHRP-derived OF bindings, including ePAPR-compliant FDTs,
 		 * use this as an optional part of the specifier.
 		 */
 		if (OF_getencprop(OF_node_from_xref(parent),
 		    "#address-cells", &paddrsz, sizeof(paddrsz)) == -1)
 			paddrsz = 0;	/* default */
 		paddrsz *= sizeof(pcell_t);
 #endif
 
 		if (OF_searchencprop(OF_node_from_xref(parent),
 		    "#interrupt-cells", &pintrsz, sizeof(pintrsz)) == -1)
 			pintrsz = 1;	/* default */
 		pintrsz *= sizeof(pcell_t);
 
 		/* Compute the map stride size. */
 		tsz = physsz + intrsz + sizeof(phandle_t) + paddrsz + pintrsz;
 		KASSERT(i >= tsz, ("ofw_bus_search_intrmap: truncated map"));
 
 		if (bcmp(ref, mptr, physsz + intrsz) == 0) {
 			bcopy(mptr + physsz + intrsz + sizeof(parent) + paddrsz,
 			    result, MIN(rintrsz, pintrsz));
 
 			if (iparent != NULL)
 				*iparent = parent;
 			return (pintrsz/sizeof(pcell_t));
 		}
 		mptr += tsz;
 		i -= tsz;
 	}
 	return (0);
 }
 
 int
 ofw_bus_msimap(phandle_t node, uint16_t pci_rid, phandle_t *msi_parent,
     uint32_t *msi_rid)
 {
 	pcell_t *map, mask, msi_base, rid_base, rid_length;
 	ssize_t len;
 	uint32_t masked_rid;
 	int err, i;
 
 	/* TODO: This should be OF_searchprop_alloc if we had it */
-	len = OF_getencprop_alloc(node, "msi-map", sizeof(*map), (void **)&map);
+	len = OF_getencprop_alloc_multi(node, "msi-map", sizeof(*map),
+	    (void **)&map);
 	if (len < 0) {
 		if (msi_parent != NULL) {
 			*msi_parent = 0;
 			OF_getencprop(node, "msi-parent", msi_parent,
 			    sizeof(*msi_parent));
 		}
 		if (msi_rid != NULL)
 			*msi_rid = pci_rid;
 		return (0);
 	}
 
 	err = ENOENT;
 	mask = 0xffffffff;
 	OF_getencprop(node, "msi-map-mask", &mask, sizeof(mask));
 
 	masked_rid = pci_rid & mask;
 	for (i = 0; i < len; i += 4) {
 		rid_base = map[i + 0];
 		rid_length = map[i + 3];
 
 		if (masked_rid < rid_base ||
 		    masked_rid >= (rid_base + rid_length))
 			continue;
 
 		msi_base = map[i + 2];
 
 		if (msi_parent != NULL)
 			*msi_parent = map[i + 1];
 		if (msi_rid != NULL)
 			*msi_rid = masked_rid - rid_base + msi_base;
 		err = 0;
 		break;
 	}
 
 	free(map, M_OFWPROP);
 
 	return (err);
 }
 
 static int
 ofw_bus_reg_to_rl_helper(device_t dev, phandle_t node, pcell_t acells, pcell_t scells,
     struct resource_list *rl, const char *reg_source)
 {
 	uint64_t phys, size;
 	ssize_t i, j, rid, nreg, ret;
 	uint32_t *reg;
 	char *name;
 
 	/*
 	 * This may be just redundant when having ofw_bus_devinfo
 	 * but makes this routine independent of it.
 	 */
 	ret = OF_getprop_alloc(node, "name", (void **)&name);
 	if (ret == -1)
 		name = NULL;
 
-	ret = OF_getencprop_alloc(node, reg_source, sizeof(*reg), (void **)&reg);
+	ret = OF_getencprop_alloc_multi(node, reg_source, sizeof(*reg),
+	    (void **)&reg);
 	nreg = (ret == -1) ? 0 : ret;
 
 	if (nreg % (acells + scells) != 0) {
 		if (bootverbose)
 			device_printf(dev, "Malformed reg property on <%s>\n",
 			    (name == NULL) ? "unknown" : name);
 		nreg = 0;
 	}
 
 	for (i = 0, rid = 0; i < nreg; i += acells + scells, rid++) {
 		phys = size = 0;
 		for (j = 0; j < acells; j++) {
 			phys <<= 32;
 			phys |= reg[i + j];
 		}
 		for (j = 0; j < scells; j++) {
 			size <<= 32;
 			size |= reg[i + acells + j];
 		}
 		/* Skip the dummy reg property of glue devices like ssm(4). */
 		if (size != 0)
 			resource_list_add(rl, SYS_RES_MEMORY, rid,
 			    phys, phys + size - 1, size);
 	}
 	free(name, M_OFWPROP);
 	free(reg, M_OFWPROP);
 
 	return (0);
 }
 
 int
 ofw_bus_reg_to_rl(device_t dev, phandle_t node, pcell_t acells, pcell_t scells,
     struct resource_list *rl)
 {
 
 	return (ofw_bus_reg_to_rl_helper(dev, node, acells, scells, rl, "reg"));
 }
 
 int
 ofw_bus_assigned_addresses_to_rl(device_t dev, phandle_t node, pcell_t acells,
     pcell_t scells, struct resource_list *rl)
 {
 
 	return (ofw_bus_reg_to_rl_helper(dev, node, acells, scells,
 	    rl, "assigned-addresses"));
 }
 
 /*
  * Get interrupt parent for given node.
  * Returns 0 if interrupt parent doesn't exist.
  */
 phandle_t
 ofw_bus_find_iparent(phandle_t node)
 {
 	phandle_t iparent;
 
 	if (OF_searchencprop(node, "interrupt-parent", &iparent,
 		    sizeof(iparent)) == -1) {
 		for (iparent = node; iparent != 0;
 		    iparent = OF_parent(iparent)) {
 			if (OF_hasprop(iparent, "interrupt-controller"))
 				break;
 		}
 		iparent = OF_xref_from_node(iparent);
 	}
 	return (iparent);
 }
 
 int
 ofw_bus_intr_to_rl(device_t dev, phandle_t node,
     struct resource_list *rl, int *rlen)
 {
 	phandle_t iparent;
 	uint32_t icells, *intr;
 	int err, i, irqnum, nintr, rid;
 	boolean_t extended;
 
-	nintr = OF_getencprop_alloc(node, "interrupts",  sizeof(*intr),
+	nintr = OF_getencprop_alloc_multi(node, "interrupts",  sizeof(*intr),
 	    (void **)&intr);
 	if (nintr > 0) {
 		iparent = ofw_bus_find_iparent(node);
 		if (iparent == 0) {
 			device_printf(dev, "No interrupt-parent found, "
 			    "assuming direct parent\n");
 			iparent = OF_parent(node);
 			iparent = OF_xref_from_node(iparent);
 		}
 		if (OF_searchencprop(OF_node_from_xref(iparent), 
 		    "#interrupt-cells", &icells, sizeof(icells)) == -1) {
 			device_printf(dev, "Missing #interrupt-cells "
 			    "property, assuming <1>\n");
 			icells = 1;
 		}
 		if (icells < 1 || icells > nintr) {
 			device_printf(dev, "Invalid #interrupt-cells property "
 			    "value <%d>, assuming <1>\n", icells);
 			icells = 1;
 		}
 		extended = false;
 	} else {
-		nintr = OF_getencprop_alloc(node, "interrupts-extended",
+		nintr = OF_getencprop_alloc_multi(node, "interrupts-extended",
 		    sizeof(*intr), (void **)&intr);
 		if (nintr <= 0)
 			return (0);
 		extended = true;
 	}
 	err = 0;
 	rid = 0;
 	for (i = 0; i < nintr; i += icells) {
 		if (extended) {
 			iparent = intr[i++];
 			if (OF_searchencprop(OF_node_from_xref(iparent), 
 			    "#interrupt-cells", &icells, sizeof(icells)) == -1) {
 				device_printf(dev, "Missing #interrupt-cells "
 				    "property\n");
 				err = ENOENT;
 				break;
 			}
 			if (icells < 1 || (i + icells) > nintr) {
 				device_printf(dev, "Invalid #interrupt-cells "
 				    "property value <%d>\n", icells);
 				err = ERANGE;
 				break;
 			}
 		}
 		irqnum = ofw_bus_map_intr(dev, iparent, icells, &intr[i]);
 		resource_list_add(rl, SYS_RES_IRQ, rid++, irqnum, irqnum, 1);
 	}
 	if (rlen != NULL)
 		*rlen = rid;
 	free(intr, M_OFWPROP);
 	return (err);
 }
 
 int
 ofw_bus_intr_by_rid(device_t dev, phandle_t node, int wanted_rid,
     phandle_t *producer, int *ncells, pcell_t **cells)
 {
 	phandle_t iparent;
 	uint32_t icells, *intr;
 	int err, i, nintr, rid;
 	boolean_t extended;
 
-	nintr = OF_getencprop_alloc(node, "interrupts",  sizeof(*intr),
+	nintr = OF_getencprop_alloc_multi(node, "interrupts",  sizeof(*intr),
 	    (void **)&intr);
 	if (nintr > 0) {
 		iparent = ofw_bus_find_iparent(node);
 		if (iparent == 0) {
 			device_printf(dev, "No interrupt-parent found, "
 			    "assuming direct parent\n");
 			iparent = OF_parent(node);
 			iparent = OF_xref_from_node(iparent);
 		}
 		if (OF_searchencprop(OF_node_from_xref(iparent),
 		    "#interrupt-cells", &icells, sizeof(icells)) == -1) {
 			device_printf(dev, "Missing #interrupt-cells "
 			    "property, assuming <1>\n");
 			icells = 1;
 		}
 		if (icells < 1 || icells > nintr) {
 			device_printf(dev, "Invalid #interrupt-cells property "
 			    "value <%d>, assuming <1>\n", icells);
 			icells = 1;
 		}
 		extended = false;
 	} else {
-		nintr = OF_getencprop_alloc(node, "interrupts-extended",
+		nintr = OF_getencprop_alloc_multi(node, "interrupts-extended",
 		    sizeof(*intr), (void **)&intr);
 		if (nintr <= 0)
 			return (ESRCH);
 		extended = true;
 	}
 	err = ESRCH;
 	rid = 0;
 	for (i = 0; i < nintr; i += icells, rid++) {
 		if (extended) {
 			iparent = intr[i++];
 			if (OF_searchencprop(OF_node_from_xref(iparent),
 			    "#interrupt-cells", &icells, sizeof(icells)) == -1) {
 				device_printf(dev, "Missing #interrupt-cells "
 				    "property\n");
 				err = ENOENT;
 				break;
 			}
 			if (icells < 1 || (i + icells) > nintr) {
 				device_printf(dev, "Invalid #interrupt-cells "
 				    "property value <%d>\n", icells);
 				err = ERANGE;
 				break;
 			}
 		}
 		if (rid == wanted_rid) {
 			*cells = malloc(icells * sizeof(**cells), M_OFWPROP,
 			    M_WAITOK);
 			*producer = iparent;
 			*ncells= icells;
 			memcpy(*cells, intr + i, icells * sizeof(**cells));
 			err = 0;
 			break;
 		}
 	}
 	free(intr, M_OFWPROP);
 	return (err);
 }
 
 phandle_t
 ofw_bus_find_child(phandle_t start, const char *child_name)
 {
 	char *name;
 	int ret;
 	phandle_t child;
 
 	for (child = OF_child(start); child != 0; child = OF_peer(child)) {
 		ret = OF_getprop_alloc(child, "name", (void **)&name);
 		if (ret == -1)
 			continue;
 		if (strcmp(name, child_name) == 0) {
 			free(name, M_OFWPROP);
 			return (child);
 		}
 
 		free(name, M_OFWPROP);
 	}
 
 	return (0);
 }
 
 phandle_t
 ofw_bus_find_compatible(phandle_t node, const char *onecompat)
 {
 	phandle_t child, ret;
 
 	/*
 	 * Traverse all children of 'start' node, and find first with
 	 * matching 'compatible' property.
 	 */
 	for (child = OF_child(node); child != 0; child = OF_peer(child)) {
 		if (ofw_bus_node_is_compatible(child, onecompat) != 0)
 			return (child);
 
 		ret = ofw_bus_find_compatible(child, onecompat);
 		if (ret != 0)
 			return (ret);
 	}
 	return (0);
 }
 
 /**
  * @brief Return child of bus whose phandle is node
  *
  * A direct child of @p will be returned if it its phandle in the
  * OFW tree is @p node. Otherwise, NULL is returned.
  *
  * @param bus		The bus to examine
  * @param node		The phandle_t to look for.
  */
 device_t
 ofw_bus_find_child_device_by_phandle(device_t bus, phandle_t node)
 {
 	device_t *children, retval, child;
 	int nkid, i;
 
 	/*
 	 * Nothing can match the flag value for no node.
 	 */
 	if (node == -1)
 		return (NULL);
 
 	/*
 	 * Search the children for a match. We microoptimize
 	 * a bit by not using ofw_bus_get since we already know
 	 * the parent. We do not recurse.
 	 */
 	if (device_get_children(bus, &children, &nkid) != 0)
 		return (NULL);
 	retval = NULL;
 	for (i = 0; i < nkid; i++) {
 		child = children[i];
 		if (OFW_BUS_GET_NODE(bus, child) == node) {
 			retval = child;
 			break;
 		}
 	}
 	free(children, M_TEMP);
 
 	return (retval);
 }
 
 /*
  * Parse property that contain list of xrefs and values
  * (like standard "clocks" and "resets" properties)
  * Input arguments:
  *  node - consumers device node
  *  list_name  - name of parsed list - "clocks"
  *  cells_name - name of size property - "#clock-cells"
  *  idx - the index of the requested list entry, or, if -1, an indication
  *        to return the number of entries in the parsed list.
  * Output arguments:
  *  producer - handle of producer
  *  ncells   - number of cells in result or the number of items in the list when
  *             idx == -1.
  *  cells    - array of decoded cells
  */
 static int
 ofw_bus_parse_xref_list_internal(phandle_t node, const char *list_name,
     const char *cells_name, int idx, phandle_t *producer, int *ncells,
     pcell_t **cells)
 {
 	phandle_t pnode;
 	phandle_t *elems;
 	uint32_t  pcells;
 	int rv, i, j, nelems, cnt;
 
 	elems = NULL;
-	nelems = OF_getencprop_alloc(node, list_name,  sizeof(*elems),
+	nelems = OF_getencprop_alloc_multi(node, list_name,  sizeof(*elems),
 	    (void **)&elems);
 	if (nelems <= 0)
 		return (ENOENT);
 	rv = (idx == -1) ? 0 : ENOENT;
 	for (i = 0, cnt = 0; i < nelems; i += pcells, cnt++) {
 		pnode = elems[i++];
 		if (OF_getencprop(OF_node_from_xref(pnode),
 		    cells_name, &pcells, sizeof(pcells)) == -1) {
 			printf("Missing %s property\n", cells_name);
 			rv = ENOENT;
 			break;
 		}
 
 		if ((i + pcells) > nelems) {
 			printf("Invalid %s property value <%d>\n", cells_name,
 			    pcells);
 			rv = ERANGE;
 			break;
 		}
 		if (cnt == idx) {
 			*cells= malloc(pcells * sizeof(**cells), M_OFWPROP,
 			    M_WAITOK);
 			*producer = pnode;
 			*ncells = pcells;
 			for (j = 0; j < pcells; j++)
 				(*cells)[j] = elems[i + j];
 			rv = 0;
 			break;
 		}
 	}
 	if (elems != NULL)
 		free(elems, M_OFWPROP);
 	if (idx == -1 && rv == 0)
 		*ncells = cnt;
 	return (rv);
 }
 
 /*
  * Parse property that contain list of xrefs and values
  * (like standard "clocks" and "resets" properties)
  * Input arguments:
  *  node - consumers device node
  *  list_name  - name of parsed list - "clocks"
  *  cells_name - name of size property - "#clock-cells"
  *  idx - the index of the requested list entry (>= 0)
  * Output arguments:
  *  producer - handle of producer
  *  ncells   - number of cells in result
  *  cells    - array of decoded cells
  */
 int
 ofw_bus_parse_xref_list_alloc(phandle_t node, const char *list_name,
     const char *cells_name, int idx, phandle_t *producer, int *ncells,
     pcell_t **cells)
 {
 
 	KASSERT(idx >= 0,
 	    ("ofw_bus_parse_xref_list_alloc: negative index supplied"));
 
 	return (ofw_bus_parse_xref_list_internal(node, list_name, cells_name,
 		    idx, producer, ncells, cells));
 }
 
 /*
  * Parse property that contain list of xrefs and values
  * (like standard "clocks" and "resets" properties)
  * and determine the number of items in the list
  * Input arguments:
  *  node - consumers device node
  *  list_name  - name of parsed list - "clocks"
  *  cells_name - name of size property - "#clock-cells"
  * Output arguments:
  *  count - number of items in list
  */
 int
 ofw_bus_parse_xref_list_get_length(phandle_t node, const char *list_name,
     const char *cells_name, int *count)
 {
 
 	return (ofw_bus_parse_xref_list_internal(node, list_name, cells_name,
 		    -1, NULL, count, NULL));
 }
 
 /*
  * Find index of string in string list property (case sensitive).
  */
 int
 ofw_bus_find_string_index(phandle_t node, const char *list_name,
     const char *name, int *idx)
 {
 	char *elems;
 	int rv, i, cnt, nelems;
 
 	elems = NULL;
 	nelems = OF_getprop_alloc(node, list_name, (void **)&elems);
 	if (nelems <= 0)
 		return (ENOENT);
 
 	rv = ENOENT;
 	for (i = 0, cnt = 0; i < nelems; cnt++) {
 		if (strcmp(elems + i, name) == 0) {
 			*idx = cnt;
 			rv = 0;
 			break;
 		}
 		i += strlen(elems + i) + 1;
 	}
 
 	if (elems != NULL)
 		free(elems, M_OFWPROP);
 	return (rv);
 }
 
 /*
  * Create zero terminated array of strings from string list property.
  */
 int
 ofw_bus_string_list_to_array(phandle_t node, const char *list_name,
    const char ***out_array)
 {
 	char *elems, *tptr;
 	const char **array;
 	int i, cnt, nelems, len;
 
 	elems = NULL;
 	nelems = OF_getprop_alloc(node, list_name, (void **)&elems);
 	if (nelems <= 0)
 		return (nelems);
 
 	/* Count number of strings. */
 	for (i = 0, cnt = 0; i < nelems; cnt++)
 		i += strlen(elems + i) + 1;
 
 	/* Allocate space for arrays and all strings. */
 	array = malloc((cnt + 1) * sizeof(char *) + nelems, M_OFWPROP,
 	    M_WAITOK);
 
 	/* Get address of first string. */
 	tptr = (char *)(array + cnt + 1);
 
 	/* Copy strings. */
 	memcpy(tptr, elems, nelems);
 	free(elems, M_OFWPROP);
 
 	/* Fill string pointers. */
 	for (i = 0, cnt = 0; i < nelems; cnt++) {
 		len = strlen(tptr) + 1;
 		array[cnt] = tptr;
 		i += len;
 		tptr += len;
 	}
 	array[cnt] = NULL;
 	*out_array = array;
 
 	return (cnt);
 }
Index: user/markj/netdump/sys/dev/ofw/openfirm.c
===================================================================
--- user/markj/netdump/sys/dev/ofw/openfirm.c	(revision 332407)
+++ user/markj/netdump/sys/dev/ofw/openfirm.c	(revision 332408)
@@ -1,840 +1,848 @@
 /*	$NetBSD: Locore.c,v 1.7 2000/08/20 07:04:59 tsubai Exp $	*/
 
 /*-
  * SPDX-License-Identifier: BSD-4-Clause
  *
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 /*-
  * Copyright (C) 2000 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_platform.h"
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/systm.h>
 #include <sys/endian.h>
 
 #include <machine/stdarg.h>
 
 #include <dev/ofw/ofwvar.h>
 #include <dev/ofw/openfirm.h>
 
 #include "ofw_if.h"
 
 static void OF_putchar(int c, void *arg);
 
 MALLOC_DEFINE(M_OFWPROP, "openfirm", "Open Firmware properties");
 
 static ihandle_t stdout;
 
 static ofw_def_t	*ofw_def_impl = NULL;
 static ofw_t		ofw_obj;
 static struct ofw_kobj	ofw_kernel_obj;
 static struct kobj_ops	ofw_kernel_kops;
 
 struct xrefinfo {
 	phandle_t	xref;
 	phandle_t 	node;
 	device_t  	dev;
 	SLIST_ENTRY(xrefinfo) next_entry;
 };
 
 static SLIST_HEAD(, xrefinfo) xreflist = SLIST_HEAD_INITIALIZER(xreflist);
 static struct mtx xreflist_lock;
 static boolean_t xref_init_done;
 
 #define	FIND_BY_XREF	0
 #define	FIND_BY_NODE	1
 #define	FIND_BY_DEV	2
 
 /*
  * xref-phandle-device lookup helper routines.
  *
  * As soon as we are able to use malloc(), walk the node tree and build a list
  * of info that cross-references node handles, xref handles, and device_t
  * instances.  This list exists primarily to allow association of a device_t
  * with an xref handle, but it is also used to speed up translation between xref
  * and node handles.  Before malloc() is available we have to recursively search
  * the node tree each time we want to translate between a node and xref handle.
  * Afterwards we can do the translations by searching this much shorter list.
  */
 static void
 xrefinfo_create(phandle_t node)
 {
 	struct xrefinfo * xi;
 	phandle_t child, xref;
 
 	/*
 	 * Recursively descend from parent, looking for nodes with a property
 	 * named either "phandle", "ibm,phandle", or "linux,phandle".  For each
 	 * such node found create an entry in the xreflist.
 	 */
 	for (child = OF_child(node); child != 0; child = OF_peer(child)) {
 		xrefinfo_create(child);
 		if (OF_getencprop(child, "phandle", &xref, sizeof(xref)) ==
 		    -1 && OF_getencprop(child, "ibm,phandle", &xref,
 		    sizeof(xref)) == -1 && OF_getencprop(child,
 		    "linux,phandle", &xref, sizeof(xref)) == -1)
 			continue;
 		xi = malloc(sizeof(*xi), M_OFWPROP, M_WAITOK | M_ZERO);
 		xi->node = child;
 		xi->xref = xref;
 		SLIST_INSERT_HEAD(&xreflist, xi, next_entry);
 	}
 }
 
 static void
 xrefinfo_init(void *unsed)
 {
 
 	/*
 	 * There is no locking during this init because it runs much earlier
 	 * than any of the clients/consumers of the xref list data, but we do
 	 * initialize the mutex that will be used for access later.
 	 */
 	mtx_init(&xreflist_lock, "OF xreflist lock", NULL, MTX_DEF);
 	xrefinfo_create(OF_peer(0));
 	xref_init_done = true;
 }
 SYSINIT(xrefinfo, SI_SUB_KMEM, SI_ORDER_ANY, xrefinfo_init, NULL);
 
 static struct xrefinfo *
 xrefinfo_find(uintptr_t key, int find_by)
 {
 	struct xrefinfo *rv, *xi;
 
 	rv = NULL;
 	mtx_lock(&xreflist_lock);
 	SLIST_FOREACH(xi, &xreflist, next_entry) {
 		if ((find_by == FIND_BY_XREF && (phandle_t)key == xi->xref) ||
 		    (find_by == FIND_BY_NODE && (phandle_t)key == xi->node) ||
 		    (find_by == FIND_BY_DEV && key == (uintptr_t)xi->dev)) {
 			rv = xi;
 			break;
 		}
 	}
 	mtx_unlock(&xreflist_lock);
 	return (rv);
 }
 
 static struct xrefinfo *
 xrefinfo_add(phandle_t node, phandle_t xref, device_t dev)
 {
 	struct xrefinfo *xi;
 
 	xi = malloc(sizeof(*xi), M_OFWPROP, M_WAITOK);
 	xi->node = node;
 	xi->xref = xref;
 	xi->dev  = dev;
 	mtx_lock(&xreflist_lock);
 	SLIST_INSERT_HEAD(&xreflist, xi, next_entry);
 	mtx_unlock(&xreflist_lock);
 	return (xi);
 }
 
 /*
  * OFW install routines.  Highest priority wins, equal priority also
  * overrides allowing last-set to win.
  */
 SET_DECLARE(ofw_set, ofw_def_t);
 
 boolean_t
 OF_install(char *name, int prio)
 {
 	ofw_def_t *ofwp, **ofwpp;
 	static int curr_prio = 0;
 
 	/* Allow OF layer to be uninstalled */
 	if (name == NULL) {
 		ofw_def_impl = NULL;
 		return (FALSE);
 	}
 
 	/*
 	 * Try and locate the OFW kobj corresponding to the name.
 	 */
 	SET_FOREACH(ofwpp, ofw_set) {
 		ofwp = *ofwpp;
 
 		if (ofwp->name &&
 		    !strcmp(ofwp->name, name) &&
 		    prio >= curr_prio) {
 			curr_prio = prio;
 			ofw_def_impl = ofwp;
 			return (TRUE);
 		}
 	}
 
 	return (FALSE);
 }
 
 /* Initializer */
 int
 OF_init(void *cookie)
 {
 	phandle_t chosen;
 	int rv;
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	ofw_obj = &ofw_kernel_obj;
 	/*
 	 * Take care of compiling the selected class, and
 	 * then statically initialize the OFW object.
 	 */
 	kobj_class_compile_static(ofw_def_impl, &ofw_kernel_kops);
 	kobj_init_static((kobj_t)ofw_obj, ofw_def_impl);
 
 	rv = OFW_INIT(ofw_obj, cookie);
 
 	if ((chosen = OF_finddevice("/chosen")) != -1)
 		if (OF_getencprop(chosen, "stdout", &stdout,
 		    sizeof(stdout)) == -1)
 			stdout = -1;
 
 	return (rv);
 }
 
 static void
 OF_putchar(int c, void *arg __unused)
 {
 	char cbuf;
 
 	if (c == '\n') {
 		cbuf = '\r';
 		OF_write(stdout, &cbuf, 1);
 	}
 
 	cbuf = c;
 	OF_write(stdout, &cbuf, 1);
 }
 
 void
 OF_printf(const char *fmt, ...)
 {
 	va_list	va;
 
 	va_start(va, fmt);
 	(void)kvprintf(fmt, OF_putchar, NULL, 10, va);
 	va_end(va);
 }
 
 /*
  * Generic functions
  */
 
 /* Test to see if a service exists. */
 int
 OF_test(const char *name)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_TEST(ofw_obj, name));
 }
 
 int
 OF_interpret(const char *cmd, int nreturns, ...)
 {
 	va_list ap;
 	cell_t slots[16];
 	int i = 0;
 	int status;
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	status = OFW_INTERPRET(ofw_obj, cmd, nreturns, slots);
 	if (status == -1)
 		return (status);
 
 	va_start(ap, nreturns);
 	while (i < nreturns)
 		*va_arg(ap, cell_t *) = slots[i++];
 	va_end(ap);
 
 	return (status);
 }
 
 /*
  * Device tree functions
  */
 
 /* Return the next sibling of this node or 0. */
 phandle_t
 OF_peer(phandle_t node)
 {
 
 	if (ofw_def_impl == NULL)
 		return (0);
 
 	return (OFW_PEER(ofw_obj, node));
 }
 
 /* Return the first child of this node or 0. */
 phandle_t
 OF_child(phandle_t node)
 {
 
 	if (ofw_def_impl == NULL)
 		return (0);
 
 	return (OFW_CHILD(ofw_obj, node));
 }
 
 /* Return the parent of this node or 0. */
 phandle_t
 OF_parent(phandle_t node)
 {
 
 	if (ofw_def_impl == NULL)
 		return (0);
 
 	return (OFW_PARENT(ofw_obj, node));
 }
 
 /* Return the package handle that corresponds to an instance handle. */
 phandle_t
 OF_instance_to_package(ihandle_t instance)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_INSTANCE_TO_PACKAGE(ofw_obj, instance));
 }
 
 /* Get the length of a property of a package. */
 ssize_t
 OF_getproplen(phandle_t package, const char *propname)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_GETPROPLEN(ofw_obj, package, propname));
 }
 
 /* Check existence of a property of a package. */
 int
 OF_hasprop(phandle_t package, const char *propname)
 {
 
 	return (OF_getproplen(package, propname) >= 0 ? 1 : 0);
 }
 
 /* Get the value of a property of a package. */
 ssize_t
 OF_getprop(phandle_t package, const char *propname, void *buf, size_t buflen)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_GETPROP(ofw_obj, package, propname, buf, buflen));
 }
 
 ssize_t
 OF_getencprop(phandle_t node, const char *propname, pcell_t *buf, size_t len)
 {
 	ssize_t retval;
 	int i;
 
 	KASSERT(len % 4 == 0, ("Need a multiple of 4 bytes"));
 
 	retval = OF_getprop(node, propname, buf, len);
 	if (retval <= 0)
 		return (retval);
 
 	for (i = 0; i < len/4; i++)
 		buf[i] = be32toh(buf[i]);
 
 	return (retval);
 }
 
 /*
  * Recursively search the node and its parent for the given property, working
  * downward from the node to the device tree root.  Returns the value of the
  * first match.
  */
 ssize_t
 OF_searchprop(phandle_t node, const char *propname, void *buf, size_t len)
 {
 	ssize_t rv;
 
 	for (; node != 0; node = OF_parent(node))
 		if ((rv = OF_getprop(node, propname, buf, len)) != -1)
 			return (rv);
 	return (-1);
 }
 
 ssize_t
 OF_searchencprop(phandle_t node, const char *propname, pcell_t *buf, size_t len)
 {
 	ssize_t rv;
 
 	for (; node != 0; node = OF_parent(node))
 		if ((rv = OF_getencprop(node, propname, buf, len)) != -1)
 			return (rv);
 	return (-1);
 }
 
 /*
  * Store the value of a property of a package into newly allocated memory
  * (using the M_OFWPROP malloc pool and M_WAITOK).
  */
 ssize_t
 OF_getprop_alloc(phandle_t package, const char *propname, void **buf)
 {
 	int len;
 
 	*buf = NULL;
 	if ((len = OF_getproplen(package, propname)) == -1)
 		return (-1);
 
 	if (len > 0) {
 		*buf = malloc(len, M_OFWPROP, M_WAITOK);
 		if (OF_getprop(package, propname, *buf, len) == -1) {
 			free(*buf, M_OFWPROP);
 			*buf = NULL;
 			return (-1);
 		}
 	}
 	return (len);
 }
 
 /*
  * Store the value of a property of a package into newly allocated memory
  * (using the M_OFWPROP malloc pool and M_WAITOK).  elsz is the size of a
  * single element, the number of elements is return in number.
  */
 ssize_t
 OF_getprop_alloc_multi(phandle_t package, const char *propname, int elsz, void **buf)
 {
 	int len;
 
 	*buf = NULL;
 	if ((len = OF_getproplen(package, propname)) == -1 ||
 	    len % elsz != 0)
 		return (-1);
 
 	if (len > 0) {
 		*buf = malloc(len, M_OFWPROP, M_WAITOK);
 		if (OF_getprop(package, propname, *buf, len) == -1) {
 			free(*buf, M_OFWPROP);
 			*buf = NULL;
 			return (-1);
 		}
 	}
 	return (len / elsz);
 }
 
+ssize_t
+OF_getencprop_alloc(phandle_t package, const char *name, void **buf)
+{
+	ssize_t ret;
 
+	ret = OF_getencprop_alloc_multi(package, name, sizeof(pcell_t),
+	    buf);
+	if (ret < 0)
+		return (ret);
+	else
+		return (ret * sizeof(pcell_t));
+}
+
 ssize_t
-OF_getencprop_alloc(phandle_t package, const char *name, int elsz, void **buf)
+OF_getencprop_alloc_multi(phandle_t package, const char *name, int elsz,
+    void **buf)
 {
 	ssize_t retval;
 	pcell_t *cell;
 	int i;
 
 	retval = OF_getprop_alloc_multi(package, name, elsz, buf);
 	if (retval == -1)
 		return (-1);
- 	if (retval * elsz % 4 != 0) {
-		free(*buf, M_OFWPROP);
-		*buf = NULL;
-		return (-1);
-	}
 
 	cell = *buf;
 	for (i = 0; i < retval * elsz / 4; i++)
 		cell[i] = be32toh(cell[i]);
 
 	return (retval);
 }
 
 /* Free buffer allocated by OF_getencprop_alloc or OF_getprop_alloc */
 void OF_prop_free(void *buf)
 {
 
 	free(buf, M_OFWPROP);
 }
 
 /* Get the next property of a package. */
 int
 OF_nextprop(phandle_t package, const char *previous, char *buf, size_t size)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_NEXTPROP(ofw_obj, package, previous, buf, size));
 }
 
 /* Set the value of a property of a package. */
 int
 OF_setprop(phandle_t package, const char *propname, const void *buf, size_t len)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_SETPROP(ofw_obj, package, propname, buf,len));
 }
 
 /* Convert a device specifier to a fully qualified pathname. */
 ssize_t
 OF_canon(const char *device, char *buf, size_t len)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_CANON(ofw_obj, device, buf, len));
 }
 
 /* Return a package handle for the specified device. */
 phandle_t
 OF_finddevice(const char *device)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_FINDDEVICE(ofw_obj, device));
 }
 
 /* Return the fully qualified pathname corresponding to an instance. */
 ssize_t
 OF_instance_to_path(ihandle_t instance, char *buf, size_t len)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_INSTANCE_TO_PATH(ofw_obj, instance, buf, len));
 }
 
 /* Return the fully qualified pathname corresponding to a package. */
 ssize_t
 OF_package_to_path(phandle_t package, char *buf, size_t len)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_PACKAGE_TO_PATH(ofw_obj, package, buf, len));
 }
 
 /* Look up effective phandle (see FDT/PAPR spec) */
 static phandle_t
 OF_child_xref_phandle(phandle_t parent, phandle_t xref)
 {
 	phandle_t child, rxref;
 
 	/*
 	 * Recursively descend from parent, looking for a node with a property
 	 * named either "phandle", "ibm,phandle", or "linux,phandle" that
 	 * matches the xref we are looking for.
 	 */
 
 	for (child = OF_child(parent); child != 0; child = OF_peer(child)) {
 		rxref = OF_child_xref_phandle(child, xref);
 		if (rxref != -1)
 			return (rxref);
 
 		if (OF_getencprop(child, "phandle", &rxref, sizeof(rxref)) ==
 		    -1 && OF_getencprop(child, "ibm,phandle", &rxref,
 		    sizeof(rxref)) == -1 && OF_getencprop(child,
 		    "linux,phandle", &rxref, sizeof(rxref)) == -1)
 			continue;
 
 		if (rxref == xref)
 			return (child);
 	}
 
 	return (-1);
 }
 
 phandle_t
 OF_node_from_xref(phandle_t xref)
 {
 	struct xrefinfo *xi;
 	phandle_t node;
 
 	if (xref_init_done) {
 		if ((xi = xrefinfo_find(xref, FIND_BY_XREF)) == NULL)
 			return (xref);
 		return (xi->node);
 	}
 
 	if ((node = OF_child_xref_phandle(OF_peer(0), xref)) == -1)
 		return (xref);
 	return (node);
 }
 
 phandle_t
 OF_xref_from_node(phandle_t node)
 {
 	struct xrefinfo *xi;
 	phandle_t xref;
 
 	if (xref_init_done) {
 		if ((xi = xrefinfo_find(node, FIND_BY_NODE)) == NULL)
 			return (node);
 		return (xi->xref);
 	}
 
 	if (OF_getencprop(node, "phandle", &xref, sizeof(xref)) == -1 &&
 	    OF_getencprop(node, "ibm,phandle", &xref, sizeof(xref)) == -1 &&
 	    OF_getencprop(node, "linux,phandle", &xref, sizeof(xref)) == -1)
 		return (node);
 	return (xref);
 }
 
 device_t
 OF_device_from_xref(phandle_t xref)
 {
 	struct xrefinfo *xi;
 
 	if (xref_init_done) {
 		if ((xi = xrefinfo_find(xref, FIND_BY_XREF)) == NULL)
 			return (NULL);
 		return (xi->dev);
 	}
 	panic("Attempt to find device before xreflist_init");
 }
 
 phandle_t
 OF_xref_from_device(device_t dev)
 {
 	struct xrefinfo *xi;
 
 	if (xref_init_done) {
 		if ((xi = xrefinfo_find((uintptr_t)dev, FIND_BY_DEV)) == NULL)
 			return (0);
 		return (xi->xref);
 	}
 	panic("Attempt to find xref before xreflist_init");
 }
 
 int
 OF_device_register_xref(phandle_t xref, device_t dev)
 {
 	struct xrefinfo *xi;
 
 	/*
 	 * If the given xref handle doesn't already exist in the list then we
 	 * add a list entry.  In theory this can only happen on a system where
 	 * nodes don't contain phandle properties and xref and node handles are
 	 * synonymous, so the xref handle is added as the node handle as well.
 	 */
 	if (xref_init_done) {
 		if ((xi = xrefinfo_find(xref, FIND_BY_XREF)) == NULL)
 			xrefinfo_add(xref, xref, dev);
 		else 
 			xi->dev = dev;
 		return (0);
 	}
 	panic("Attempt to register device before xreflist_init");
 }
 
 /*  Call the method in the scope of a given instance. */
 int
 OF_call_method(const char *method, ihandle_t instance, int nargs, int nreturns,
     ...)
 {
 	va_list ap;
 	cell_t args_n_results[12];
 	int n, status;
 
 	if (nargs > 6 || ofw_def_impl == NULL)
 		return (-1);
 	va_start(ap, nreturns);
 	for (n = 0; n < nargs; n++)
 		args_n_results[n] = va_arg(ap, cell_t);
 
 	status = OFW_CALL_METHOD(ofw_obj, instance, method, nargs, nreturns,
 	    args_n_results);
 	if (status != 0)
 		return (status);
 
 	for (; n < nargs + nreturns; n++)
 		*va_arg(ap, cell_t *) = args_n_results[n];
 	va_end(ap);
 	return (0);
 }
 
 /*
  * Device I/O functions
  */
 
 /* Open an instance for a device. */
 ihandle_t
 OF_open(const char *device)
 {
 
 	if (ofw_def_impl == NULL)
 		return (0);
 
 	return (OFW_OPEN(ofw_obj, device));
 }
 
 /* Close an instance. */
 void
 OF_close(ihandle_t instance)
 {
 
 	if (ofw_def_impl == NULL)
 		return;
 
 	OFW_CLOSE(ofw_obj, instance);
 }
 
 /* Read from an instance. */
 ssize_t
 OF_read(ihandle_t instance, void *addr, size_t len)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_READ(ofw_obj, instance, addr, len));
 }
 
 /* Write to an instance. */
 ssize_t
 OF_write(ihandle_t instance, const void *addr, size_t len)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_WRITE(ofw_obj, instance, addr, len));
 }
 
 /* Seek to a position. */
 int
 OF_seek(ihandle_t instance, uint64_t pos)
 {
 
 	if (ofw_def_impl == NULL)
 		return (-1);
 
 	return (OFW_SEEK(ofw_obj, instance, pos));
 }
 
 /*
  * Memory functions
  */
 
 /* Claim an area of memory. */
 void *
 OF_claim(void *virt, size_t size, u_int align)
 {
 
 	if (ofw_def_impl == NULL)
 		return ((void *)-1);
 
 	return (OFW_CLAIM(ofw_obj, virt, size, align));
 }
 
 /* Release an area of memory. */
 void
 OF_release(void *virt, size_t size)
 {
 
 	if (ofw_def_impl == NULL)
 		return;
 
 	OFW_RELEASE(ofw_obj, virt, size);
 }
 
 /*
  * Control transfer functions
  */
 
 /* Suspend and drop back to the Open Firmware interface. */
 void
 OF_enter()
 {
 
 	if (ofw_def_impl == NULL)
 		return;
 
 	OFW_ENTER(ofw_obj);
 }
 
 /* Shut down and drop back to the Open Firmware interface. */
 void
 OF_exit()
 {
 
 	if (ofw_def_impl == NULL)
 		panic("OF_exit: Open Firmware not available");
 
 	/* Should not return */
 	OFW_EXIT(ofw_obj);
 
 	for (;;)			/* just in case */
 		;
 }
Index: user/markj/netdump/sys/dev/ofw/openfirm.h
===================================================================
--- user/markj/netdump/sys/dev/ofw/openfirm.h	(revision 332407)
+++ user/markj/netdump/sys/dev/ofw/openfirm.h	(revision 332408)
@@ -1,188 +1,190 @@
 /*	$NetBSD: openfirm.h,v 1.1 1998/05/15 10:16:00 tsubai Exp $	*/
 
 /*-
  * SPDX-License-Identifier: BSD-4-Clause
  *
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 /*
  * Copyright (C) 2000 Benno Rice.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _DEV_OPENFIRM_H_
 #define _DEV_OPENFIRM_H_
 
 #include <sys/types.h>
 #include <machine/_bus.h>
 
 /*
  * Prototypes for Open Firmware Interface Routines
  */
 
 typedef uint32_t	ihandle_t;
 typedef uint32_t	phandle_t;
 typedef uint32_t	pcell_t;
 
 #ifdef _KERNEL
 #include <sys/malloc.h>
 
 #include <machine/ofw_machdep.h>
 
 MALLOC_DECLARE(M_OFWPROP);
 
 /*
  * Open Firmware interface initialization.  OF_install installs the named
  * interface as the Open Firmware access mechanism, OF_init initializes it.
  */
 
 boolean_t	OF_install(char *name, int prio);
 int		OF_init(void *cookie);
 
 /*
  * Known Open Firmware interface names
  */
 
 #define	OFW_STD_DIRECT	"ofw_std"	/* Standard OF interface */
 #define	OFW_STD_REAL	"ofw_real"	/* Real-mode OF interface */
 #define	OFW_STD_32BIT	"ofw_32bit"	/* 32-bit OF interface */
 #define	OFW_FDT		"ofw_fdt"	/* Flattened Device Tree */
 
 /* Generic functions */
 int		OF_test(const char *name);
 void		OF_printf(const char *fmt, ...);
 
 /* Device tree functions */
 phandle_t	OF_peer(phandle_t node);
 phandle_t	OF_child(phandle_t node);
 phandle_t	OF_parent(phandle_t node);
 ssize_t		OF_getproplen(phandle_t node, const char *propname);
 ssize_t		OF_getprop(phandle_t node, const char *propname, void *buf,
 		    size_t len);
 ssize_t		OF_getencprop(phandle_t node, const char *prop, pcell_t *buf,
 		    size_t len); /* Same as getprop, but maintains endianness */
 int		OF_hasprop(phandle_t node, const char *propname);
 ssize_t		OF_searchprop(phandle_t node, const char *propname, void *buf,
 		    size_t len);
 ssize_t		OF_searchencprop(phandle_t node, const char *propname,
 		    pcell_t *buf, size_t len);
 ssize_t		OF_getprop_alloc(phandle_t node, const char *propname,
 		    void **buf);
 ssize_t		OF_getprop_alloc_multi(phandle_t node, const char *propname,
 		    int elsz, void **buf);
 ssize_t		OF_getencprop_alloc(phandle_t node, const char *propname,
+		    void **buf);
+ssize_t		OF_getencprop_alloc_multi(phandle_t node, const char *propname,
 		    int elsz, void **buf);
 void		OF_prop_free(void *buf);
 int		OF_nextprop(phandle_t node, const char *propname, char *buf,
 		    size_t len);
 int		OF_setprop(phandle_t node, const char *name, const void *buf,
 		    size_t len);
 ssize_t		OF_canon(const char *path, char *buf, size_t len);
 phandle_t	OF_finddevice(const char *path);
 ssize_t		OF_package_to_path(phandle_t node, char *buf, size_t len);
 
 /*
  * Some OF implementations (IBM, FDT) have a concept of effective phandles
  * used for device-tree cross-references. Given one of these, returns the
  * real phandle. If one can't be found (or running on OF implementations
  * without this property), returns its input.
  */
 phandle_t	OF_node_from_xref(phandle_t xref);
 phandle_t	OF_xref_from_node(phandle_t node);
 
 /*
  * When properties contain references to other nodes using xref handles it is
  * often necessary to use interfaces provided by the driver for the referenced
  * instance.  These routines allow a driver that provides such an interface to
  * register its association with an xref handle, and for other drivers to obtain
  * the device_t associated with an xref handle.
  */
 device_t	OF_device_from_xref(phandle_t xref);
 phandle_t	OF_xref_from_device(device_t dev);
 int		OF_device_register_xref(phandle_t xref, device_t dev);
 
 /* Device I/O functions */
 ihandle_t	OF_open(const char *path);
 void		OF_close(ihandle_t instance);
 ssize_t		OF_read(ihandle_t instance, void *buf, size_t len);
 ssize_t		OF_write(ihandle_t instance, const void *buf, size_t len);
 int		OF_seek(ihandle_t instance, uint64_t where);
 
 phandle_t	OF_instance_to_package(ihandle_t instance);
 ssize_t		OF_instance_to_path(ihandle_t instance, char *buf, size_t len);
 int		OF_call_method(const char *method, ihandle_t instance,
 		    int nargs, int nreturns, ...);
 
 /* Memory functions */
 void		*OF_claim(void *virtrequest, size_t size, u_int align);
 void		OF_release(void *virt, size_t size);
 
 /* Control transfer functions */
 void		OF_enter(void);
 void		OF_exit(void) __attribute__((noreturn));
 
 /* User interface functions */
 int		OF_interpret(const char *cmd, int nreturns, ...);
 
 /*
  * Decode the Nth register property of the given device node and create a bus
  * space tag and handle for accessing it.  This is for use in setting up things
  * like early console output before newbus is available.  The implementation is
  * machine-dependent, and sparc uses a different function signature as well.
  */
 #ifndef __sparc64__
 int		OF_decode_addr(phandle_t dev, int regno, bus_space_tag_t *ptag,
 		    bus_space_handle_t *phandle, bus_size_t *sz);
 #endif
 
 #endif /* _KERNEL */
 #endif /* _DEV_OPENFIRM_H_ */
Index: user/markj/netdump/sys/dev/pci/pci_user.c
===================================================================
--- user/markj/netdump/sys/dev/pci/pci_user.c	(revision 332407)
+++ user/markj/netdump/sys/dev/pci/pci_user.c	(revision 332408)
@@ -1,1025 +1,1062 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 1997, Stefan Esser <se@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_bus.h"	/* XXX trim includes */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/linker.h>
 #include <sys/fcntl.h>
 #include <sys/conf.h>
 #include <sys/kernel.h>
 #include <sys/proc.h>
 #include <sys/queue.h>
 #include <sys/types.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 #include <vm/vm_extern.h>
 
 #include <sys/bus.h>
 #include <machine/bus.h>
 #include <sys/rman.h>
 #include <machine/resource.h>
 
 #include <sys/pciio.h>
 #include <dev/pci/pcireg.h>
 #include <dev/pci/pcivar.h>
 
 #include "pcib_if.h"
 #include "pci_if.h"
 
 /*
  * This is the user interface to PCI configuration space.
  */
 
 static d_open_t 	pci_open;
 static d_close_t	pci_close;
-static int	pci_conf_match(struct pci_match_conf *matches, int num_matches,
-			       struct pci_conf *match_buf);
 static d_ioctl_t	pci_ioctl;
 
 struct cdevsw pcicdev = {
 	.d_version =	D_VERSION,
 	.d_flags =	D_NEEDGIANT,
 	.d_open =	pci_open,
 	.d_close =	pci_close,
 	.d_ioctl =	pci_ioctl,
 	.d_name =	"pci",
 };
   
 static int
 pci_open(struct cdev *dev, int oflags, int devtype, struct thread *td)
 {
 	int error;
 
 	if (oflags & FWRITE) {
 		error = securelevel_gt(td->td_ucred, 0);
 		if (error)
 			return (error);
 	}
 
 	return (0);
 }
 
 static int
 pci_close(struct cdev *dev, int flag, int devtype, struct thread *td)
 {
 	return 0;
 }
 
 /*
  * Match a single pci_conf structure against an array of pci_match_conf
  * structures.  The first argument, 'matches', is an array of num_matches
  * pci_match_conf structures.  match_buf is a pointer to the pci_conf
  * structure that will be compared to every entry in the matches array.
  * This function returns 1 on failure, 0 on success.
  */
 static int
-pci_conf_match(struct pci_match_conf *matches, int num_matches, 
+pci_conf_match_native(struct pci_match_conf *matches, int num_matches,
 	       struct pci_conf *match_buf)
 {
 	int i;
 
 	if ((matches == NULL) || (match_buf == NULL) || (num_matches <= 0))
 		return(1);
 
 	for (i = 0; i < num_matches; i++) {
 		/*
 		 * I'm not sure why someone would do this...but...
 		 */
 		if (matches[i].flags == PCI_GETCONF_NO_MATCH)
 			continue;
 
 		/*
 		 * Look at each of the match flags.  If it's set, do the
 		 * comparison.  If the comparison fails, we don't have a
 		 * match, go on to the next item if there is one.
 		 */
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DOMAIN) != 0)
 		 && (match_buf->pc_sel.pc_domain !=
 		 matches[i].pc_sel.pc_domain))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_BUS) != 0)
 		 && (match_buf->pc_sel.pc_bus != matches[i].pc_sel.pc_bus))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DEV) != 0)
 		 && (match_buf->pc_sel.pc_dev != matches[i].pc_sel.pc_dev))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_FUNC) != 0)
 		 && (match_buf->pc_sel.pc_func != matches[i].pc_sel.pc_func))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_VENDOR) != 0) 
 		 && (match_buf->pc_vendor != matches[i].pc_vendor))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DEVICE) != 0)
 		 && (match_buf->pc_device != matches[i].pc_device))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_CLASS) != 0)
 		 && (match_buf->pc_class != matches[i].pc_class))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_UNIT) != 0)
 		 && (match_buf->pd_unit != matches[i].pd_unit))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_NAME) != 0)
 		 && (strncmp(matches[i].pd_name, match_buf->pd_name,
 			     sizeof(match_buf->pd_name)) != 0))
 			continue;
 
 		return(0);
 	}
 
 	return(1);
 }
 
 #if defined(COMPAT_FREEBSD4) || defined(COMPAT_FREEBSD5) || \
     defined(COMPAT_FREEBSD6)
 #define PRE7_COMPAT
 
 typedef enum {
 	PCI_GETCONF_NO_MATCH_OLD	= 0x00,
 	PCI_GETCONF_MATCH_BUS_OLD	= 0x01,
 	PCI_GETCONF_MATCH_DEV_OLD	= 0x02,
 	PCI_GETCONF_MATCH_FUNC_OLD	= 0x04,
 	PCI_GETCONF_MATCH_NAME_OLD	= 0x08,
 	PCI_GETCONF_MATCH_UNIT_OLD	= 0x10,
 	PCI_GETCONF_MATCH_VENDOR_OLD	= 0x20,
 	PCI_GETCONF_MATCH_DEVICE_OLD	= 0x40,
 	PCI_GETCONF_MATCH_CLASS_OLD	= 0x80
 } pci_getconf_flags_old;
 
 struct pcisel_old {
 	u_int8_t	pc_bus;		/* bus number */
 	u_int8_t	pc_dev;		/* device on this bus */
 	u_int8_t	pc_func;	/* function on this device */
 };
 
 struct pci_conf_old {
 	struct pcisel_old pc_sel;	/* bus+slot+function */
 	u_int8_t	pc_hdr;		/* PCI header type */
 	u_int16_t	pc_subvendor;	/* card vendor ID */
 	u_int16_t	pc_subdevice;	/* card device ID, assigned by
 					   card vendor */
 	u_int16_t	pc_vendor;	/* chip vendor ID */
 	u_int16_t	pc_device;	/* chip device ID, assigned by
 					   chip vendor */
 	u_int8_t	pc_class;	/* chip PCI class */
 	u_int8_t	pc_subclass;	/* chip PCI subclass */
 	u_int8_t	pc_progif;	/* chip PCI programming interface */
 	u_int8_t	pc_revid;	/* chip revision ID */
 	char		pd_name[PCI_MAXNAMELEN + 1];  /* device name */
 	u_long		pd_unit;	/* device unit number */
 };
 
 struct pci_match_conf_old {
 	struct pcisel_old	pc_sel;		/* bus+slot+function */
 	char			pd_name[PCI_MAXNAMELEN + 1];  /* device name */
 	u_long			pd_unit;	/* Unit number */
 	u_int16_t		pc_vendor;	/* PCI Vendor ID */
 	u_int16_t		pc_device;	/* PCI Device ID */
 	u_int8_t		pc_class;	/* PCI class */
 	pci_getconf_flags_old	flags;		/* Matching expression */
 };
 
 struct pci_io_old {
 	struct pcisel_old pi_sel;	/* device to operate on */
 	int		pi_reg;		/* configuration register to examine */
 	int		pi_width;	/* width (in bytes) of read or write */
 	u_int32_t	pi_data;	/* data to write or result of read */
 };
 
 #ifdef COMPAT_FREEBSD32
 struct pci_conf_old32 {
 	struct pcisel_old pc_sel;	/* bus+slot+function */
 	uint8_t		pc_hdr;		/* PCI header type */
 	uint16_t	pc_subvendor;	/* card vendor ID */
 	uint16_t	pc_subdevice;	/* card device ID, assigned by
 					   card vendor */
 	uint16_t	pc_vendor;	/* chip vendor ID */
 	uint16_t	pc_device;	/* chip device ID, assigned by
 					   chip vendor */
 	uint8_t		pc_class;	/* chip PCI class */
 	uint8_t		pc_subclass;	/* chip PCI subclass */
 	uint8_t		pc_progif;	/* chip PCI programming interface */
 	uint8_t		pc_revid;	/* chip revision ID */
 	char		pd_name[PCI_MAXNAMELEN + 1]; /* device name */
 	uint32_t	pd_unit;	/* device unit number (u_long) */
 };
 
 struct pci_match_conf_old32 {
 	struct pcisel_old pc_sel;	/* bus+slot+function */
 	char		pd_name[PCI_MAXNAMELEN + 1]; /* device name */
 	uint32_t	pd_unit;	/* Unit number (u_long) */
 	uint16_t	pc_vendor;	/* PCI Vendor ID */
 	uint16_t	pc_device;	/* PCI Device ID */
 	uint8_t		pc_class;	/* PCI class */
 	pci_getconf_flags_old flags;	/* Matching expression */
 };
 
 struct pci_conf_io32 {
 	uint32_t	pat_buf_len;	/* pattern buffer length */
 	uint32_t	num_patterns;	/* number of patterns */
 	uint32_t	patterns;	/* pattern buffer
 					   (struct pci_match_conf_old32 *) */
 	uint32_t	match_buf_len;	/* match buffer length */
 	uint32_t	num_matches;	/* number of matches returned */
 	uint32_t	matches;	/* match buffer
 					   (struct pci_conf_old32 *) */
 	uint32_t	offset;		/* offset into device list */
 	uint32_t	generation;	/* device list generation */
 	pci_getconf_status status;	/* request status */
 };
 
 #define	PCIOCGETCONF_OLD32	_IOWR('p', 1, struct pci_conf_io32)
 #endif	/* COMPAT_FREEBSD32 */
 
 #define	PCIOCGETCONF_OLD	_IOWR('p', 1, struct pci_conf_io)
 #define	PCIOCREAD_OLD		_IOWR('p', 2, struct pci_io_old)
 #define	PCIOCWRITE_OLD		_IOWR('p', 3, struct pci_io_old)
 
-static int	pci_conf_match_old(struct pci_match_conf_old *matches,
-		    int num_matches, struct pci_conf *match_buf);
-
 static int
 pci_conf_match_old(struct pci_match_conf_old *matches, int num_matches,
     struct pci_conf *match_buf)
 {
 	int i;
 
 	if ((matches == NULL) || (match_buf == NULL) || (num_matches <= 0))
 		return(1);
 
 	for (i = 0; i < num_matches; i++) {
 		if (match_buf->pc_sel.pc_domain != 0)
 			continue;
 
 		/*
 		 * I'm not sure why someone would do this...but...
 		 */
 		if (matches[i].flags == PCI_GETCONF_NO_MATCH_OLD)
 			continue;
 
 		/*
 		 * Look at each of the match flags.  If it's set, do the
 		 * comparison.  If the comparison fails, we don't have a
 		 * match, go on to the next item if there is one.
 		 */
 		if (((matches[i].flags & PCI_GETCONF_MATCH_BUS_OLD) != 0)
 		 && (match_buf->pc_sel.pc_bus != matches[i].pc_sel.pc_bus))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DEV_OLD) != 0)
 		 && (match_buf->pc_sel.pc_dev != matches[i].pc_sel.pc_dev))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_FUNC_OLD) != 0)
 		 && (match_buf->pc_sel.pc_func != matches[i].pc_sel.pc_func))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_VENDOR_OLD) != 0)
 		 && (match_buf->pc_vendor != matches[i].pc_vendor))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DEVICE_OLD) != 0)
 		 && (match_buf->pc_device != matches[i].pc_device))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_CLASS_OLD) != 0)
 		 && (match_buf->pc_class != matches[i].pc_class))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_UNIT_OLD) != 0)
 		 && (match_buf->pd_unit != matches[i].pd_unit))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_NAME_OLD) != 0)
 		 && (strncmp(matches[i].pd_name, match_buf->pd_name,
 			     sizeof(match_buf->pd_name)) != 0))
 			continue;
 
 		return(0);
 	}
 
 	return(1);
 }
 
 #ifdef COMPAT_FREEBSD32
 static int
 pci_conf_match_old32(struct pci_match_conf_old32 *matches, int num_matches,
     struct pci_conf *match_buf)
 {
 	int i;
 
 	if ((matches == NULL) || (match_buf == NULL) || (num_matches <= 0))
 		return(1);
 
 	for (i = 0; i < num_matches; i++) {
 		if (match_buf->pc_sel.pc_domain != 0)
 			continue;
 
 		/*
 		 * I'm not sure why someone would do this...but...
 		 */
 		if (matches[i].flags == PCI_GETCONF_NO_MATCH_OLD)
 			continue;
 
 		/*
 		 * Look at each of the match flags.  If it's set, do the
 		 * comparison.  If the comparison fails, we don't have a
 		 * match, go on to the next item if there is one.
 		 */
 		if (((matches[i].flags & PCI_GETCONF_MATCH_BUS_OLD) != 0) &&
 		    (match_buf->pc_sel.pc_bus != matches[i].pc_sel.pc_bus))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DEV_OLD) != 0) &&
 		    (match_buf->pc_sel.pc_dev != matches[i].pc_sel.pc_dev))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_FUNC_OLD) != 0) &&
 		    (match_buf->pc_sel.pc_func != matches[i].pc_sel.pc_func))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_VENDOR_OLD) != 0) &&
 		    (match_buf->pc_vendor != matches[i].pc_vendor))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_DEVICE_OLD) != 0) &&
 		    (match_buf->pc_device != matches[i].pc_device))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_CLASS_OLD) != 0) &&
 		    (match_buf->pc_class != matches[i].pc_class))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_UNIT_OLD) != 0) &&
 		    ((u_int32_t)match_buf->pd_unit != matches[i].pd_unit))
 			continue;
 
 		if (((matches[i].flags & PCI_GETCONF_MATCH_NAME_OLD) != 0) &&
 		    (strncmp(matches[i].pd_name, match_buf->pd_name,
 		    sizeof(match_buf->pd_name)) != 0))
 			continue;
 
 		return (0);
 	}
 
 	return (1);
 }
 #endif	/* COMPAT_FREEBSD32 */
-#endif	/* PRE7_COMPAT */
+#endif	/* !PRE7_COMPAT */
 
+union pci_conf_union {
+	struct pci_conf		pc;
+#ifdef PRE7_COMPAT
+	struct pci_conf_old	pco;
+#ifdef COMPAT_FREEBSD32
+	struct pci_conf_old32	pco32;
+#endif
+#endif
+};
+
 static int
+pci_conf_match(u_long cmd, struct pci_match_conf *matches, int num_matches,
+    struct pci_conf *match_buf)
+{
+
+	switch (cmd) {
+	case PCIOCGETCONF:
+		return (pci_conf_match_native(
+		    (struct pci_match_conf *)matches, num_matches, match_buf));
+#ifdef PRE7_COMPAT
+	case PCIOCGETCONF_OLD:
+		return (pci_conf_match_old(
+		    (struct pci_match_conf_old *)matches, num_matches,
+		    match_buf));
+#ifdef COMPAT_FREEBSD32
+	case PCIOCGETCONF_OLD32:
+		return (pci_conf_match_old32(
+		    (struct pci_match_conf_old32 *)matches, num_matches,
+		    match_buf));
+#endif
+#endif
+	default:
+		/* programmer error */
+		return (0);
+	}
+}
+
+static int
 pci_list_vpd(device_t dev, struct pci_list_vpd_io *lvio)
 {
 	struct pci_vpd_element vpd_element, *vpd_user;
 	struct pcicfg_vpd *vpd;
 	size_t len;
 	int error, i;
 
 	vpd = pci_fetch_vpd_list(dev);
 	if (vpd->vpd_reg == 0 || vpd->vpd_ident == NULL)
 		return (ENXIO);
 
 	/*
 	 * Calculate the amount of space needed in the data buffer.  An
 	 * identifier element is always present followed by the read-only
 	 * and read-write keywords.
 	 */
 	len = sizeof(struct pci_vpd_element) + strlen(vpd->vpd_ident);
 	for (i = 0; i < vpd->vpd_rocnt; i++)
 		len += sizeof(struct pci_vpd_element) + vpd->vpd_ros[i].len;
 	for (i = 0; i < vpd->vpd_wcnt; i++)
 		len += sizeof(struct pci_vpd_element) + vpd->vpd_w[i].len;
 
 	if (lvio->plvi_len == 0) {
 		lvio->plvi_len = len;
 		return (0);
 	}
 	if (lvio->plvi_len < len) {
 		lvio->plvi_len = len;
 		return (ENOMEM);
 	}
 
 	/*
 	 * Copyout the identifier string followed by each keyword and
 	 * value.
 	 */
 	vpd_user = lvio->plvi_data;
 	vpd_element.pve_keyword[0] = '\0';
 	vpd_element.pve_keyword[1] = '\0';
 	vpd_element.pve_flags = PVE_FLAG_IDENT;
 	vpd_element.pve_datalen = strlen(vpd->vpd_ident);
 	error = copyout(&vpd_element, vpd_user, sizeof(vpd_element));
 	if (error)
 		return (error);
 	error = copyout(vpd->vpd_ident, vpd_user->pve_data,
 	    strlen(vpd->vpd_ident));
 	if (error)
 		return (error);
 	vpd_user = PVE_NEXT(vpd_user);
 	vpd_element.pve_flags = 0;
 	for (i = 0; i < vpd->vpd_rocnt; i++) {
 		vpd_element.pve_keyword[0] = vpd->vpd_ros[i].keyword[0];
 		vpd_element.pve_keyword[1] = vpd->vpd_ros[i].keyword[1];
 		vpd_element.pve_datalen = vpd->vpd_ros[i].len;
 		error = copyout(&vpd_element, vpd_user, sizeof(vpd_element));
 		if (error)
 			return (error);
 		error = copyout(vpd->vpd_ros[i].value, vpd_user->pve_data,
 		    vpd->vpd_ros[i].len);
 		if (error)
 			return (error);
 		vpd_user = PVE_NEXT(vpd_user);
 	}
 	vpd_element.pve_flags = PVE_FLAG_RW;
 	for (i = 0; i < vpd->vpd_wcnt; i++) {
 		vpd_element.pve_keyword[0] = vpd->vpd_w[i].keyword[0];
 		vpd_element.pve_keyword[1] = vpd->vpd_w[i].keyword[1];
 		vpd_element.pve_datalen = vpd->vpd_w[i].len;
 		error = copyout(&vpd_element, vpd_user, sizeof(vpd_element));
 		if (error)
 			return (error);
 		error = copyout(vpd->vpd_w[i].value, vpd_user->pve_data,
 		    vpd->vpd_w[i].len);
 		if (error)
 			return (error);
 		vpd_user = PVE_NEXT(vpd_user);
 	}
 	KASSERT((char *)vpd_user - (char *)lvio->plvi_data == len,
 	    ("length mismatch"));
 	lvio->plvi_len = len;
 	return (0);
 }
 
+static size_t
+pci_match_conf_size(u_long cmd)
+{
+
+	switch (cmd) {
+	case PCIOCGETCONF:
+		return (sizeof(struct pci_match_conf));
+#ifdef PRE7_COMPAT
+	case PCIOCGETCONF_OLD:
+		return (sizeof(struct pci_match_conf_old));
+#ifdef COMPAT_FREEBSD32
+	case PCIOCGETCONF_OLD32:
+		return (sizeof(struct pci_match_conf_old32));
+#endif
+#endif
+	default:
+		/* programmer error */
+		return (0);
+	}
+}
+
+static size_t
+pci_conf_size(u_long cmd)
+{
+
+	switch (cmd) {
+	case PCIOCGETCONF:
+		return (sizeof(struct pci_conf));
+#ifdef PRE7_COMPAT
+	case PCIOCGETCONF_OLD:
+		return (sizeof(struct pci_conf_old));
+#ifdef COMPAT_FREEBSD32
+	case PCIOCGETCONF_OLD32:
+		return (sizeof(struct pci_conf_old32));
+#endif
+#endif
+	default:
+		/* programmer error */
+		return (0);
+	}
+}
+
+static void
+pci_conf_io_init(struct pci_conf_io *cio, caddr_t data, u_long cmd)
+{
+#if defined(PRE7_COMPAT) && defined(COMPAT_FREEBSD32)
+	struct pci_conf_io32 *cio32;
+#endif
+
+	switch (cmd) {
+	case PCIOCGETCONF:
+#ifdef PRE7_COMPAT
+	case PCIOCGETCONF_OLD:
+#endif
+		*cio = *(struct pci_conf_io *)data;
+		return;
+
+#if defined(PRE7_COMPAT) && defined(COMPAT_FREEBSD32)
+	case PCIOCGETCONF_OLD32:
+               cio32 = (struct pci_conf_io32 *)data;
+               cio->pat_buf_len = cio32->pat_buf_len;
+               cio->num_patterns = cio32->num_patterns;
+               cio->patterns = (void *)(uintptr_t)cio32->patterns;
+               cio->match_buf_len = cio32->match_buf_len;
+               cio->num_matches = cio32->num_matches;
+               cio->matches = (void *)(uintptr_t)cio32->matches;
+               cio->offset = cio32->offset;
+               cio->generation = cio32->generation;
+               cio->status = cio32->status;
+               return;
+#endif
+
+	default:
+		/* programmer error */
+		return;
+	}
+}
+
+static void
+pci_conf_io_update_data(const struct pci_conf_io *cio, caddr_t data,
+    u_long cmd)
+{
+	struct pci_conf_io *d_cio;
+#if defined(PRE7_COMPAT) && defined(COMPAT_FREEBSD32)
+	struct pci_conf_io32 *cio32;
+#endif
+
+	switch (cmd) {
+	case PCIOCGETCONF:
+#ifdef PRE7_COMPAT
+	case PCIOCGETCONF_OLD:
+#endif
+		d_cio = (struct pci_conf_io *)data;
+		d_cio->status = cio->status;
+		d_cio->generation = cio->generation;
+		d_cio->offset = cio->offset;
+		d_cio->num_matches = cio->num_matches;
+		return;
+
+#if defined(PRE7_COMPAT) && defined(COMPAT_FREEBSD32)
+	case PCIOCGETCONF_OLD32:
+		cio32 = (struct pci_conf_io32 *)data;
+
+		cio32->status = cio->status;
+		cio32->generation = cio->generation;
+		cio32->offset = cio->offset;
+		cio32->num_matches = cio->num_matches;
+		return;
+#endif
+
+	default:
+		/* programmer error */
+		return;
+	}
+}
+
+static void
+pci_conf_for_copyout(const struct pci_conf *pcp, union pci_conf_union *pcup,
+    u_long cmd)
+{
+
+	memset(pcup, 0, sizeof(*pcup));
+
+	switch (cmd) {
+	case PCIOCGETCONF:
+		pcup->pc = *pcp;
+		return;
+
+#ifdef PRE7_COMPAT
+#ifdef COMPAT_FREEBSD32
+	case PCIOCGETCONF_OLD32:
+		pcup->pco32.pc_sel.pc_bus = pcp->pc_sel.pc_bus;
+		pcup->pco32.pc_sel.pc_dev = pcp->pc_sel.pc_dev;
+		pcup->pco32.pc_sel.pc_func = pcp->pc_sel.pc_func;
+		pcup->pco32.pc_hdr = pcp->pc_hdr;
+		pcup->pco32.pc_subvendor = pcp->pc_subvendor;
+		pcup->pco32.pc_subdevice = pcp->pc_subdevice;
+		pcup->pco32.pc_vendor = pcp->pc_vendor;
+		pcup->pco32.pc_device = pcp->pc_device;
+		pcup->pco32.pc_class = pcp->pc_class;
+		pcup->pco32.pc_subclass = pcp->pc_subclass;
+		pcup->pco32.pc_progif = pcp->pc_progif;
+		pcup->pco32.pc_revid = pcp->pc_revid;
+		strlcpy(pcup->pco32.pd_name, pcp->pd_name,
+		    sizeof(pcup->pco32.pd_name));
+		pcup->pco32.pd_unit = (uint32_t)pcp->pd_unit;
+		return;
+
+#endif /* COMPAT_FREEBSD32 */
+	case PCIOCGETCONF_OLD:
+		pcup->pco.pc_sel.pc_bus = pcp->pc_sel.pc_bus;
+		pcup->pco.pc_sel.pc_dev = pcp->pc_sel.pc_dev;
+		pcup->pco.pc_sel.pc_func = pcp->pc_sel.pc_func;
+		pcup->pco.pc_hdr = pcp->pc_hdr;
+		pcup->pco.pc_subvendor = pcp->pc_subvendor;
+		pcup->pco.pc_subdevice = pcp->pc_subdevice;
+		pcup->pco.pc_vendor = pcp->pc_vendor;
+		pcup->pco.pc_device = pcp->pc_device;
+		pcup->pco.pc_class = pcp->pc_class;
+		pcup->pco.pc_subclass = pcp->pc_subclass;
+		pcup->pco.pc_progif = pcp->pc_progif;
+		pcup->pco.pc_revid = pcp->pc_revid;
+		strlcpy(pcup->pco.pd_name, pcp->pd_name,
+		    sizeof(pcup->pco.pd_name));
+		pcup->pco.pd_unit = pcp->pd_unit;
+		return;
+#endif /* PRE7_COMPAT */
+
+	default:
+		/* programmer error */
+		return;
+	}
+}
+
 static int
 pci_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td)
 {
 	device_t pcidev;
-	void *confdata;
 	const char *name;
 	struct devlist *devlist_head;
 	struct pci_conf_io *cio = NULL;
 	struct pci_devinfo *dinfo;
 	struct pci_io *io;
 	struct pci_bar_io *bio;
 	struct pci_list_vpd_io *lvio;
 	struct pci_match_conf *pattern_buf;
 	struct pci_map *pm;
-	size_t confsz, iolen, pbufsz;
+	size_t confsz, iolen;
 	int error, ionum, i, num_patterns;
+	union pci_conf_union pcu;
 #ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-	struct pci_conf_io32 *cio32 = NULL;
-	struct pci_conf_old32 conf_old32;
-	struct pci_match_conf_old32 *pattern_buf_old32 = NULL;
-#endif
-	struct pci_conf_old conf_old;
 	struct pci_io iodata;
 	struct pci_io_old *io_old;
-	struct pci_match_conf_old *pattern_buf_old = NULL;
 
 	io_old = NULL;
 #endif
 
 	if (!(flag & FWRITE)) {
 		switch (cmd) {
+		case PCIOCGETCONF:
 #ifdef PRE7_COMPAT
+		case PCIOCGETCONF_OLD:
 #ifdef COMPAT_FREEBSD32
 		case PCIOCGETCONF_OLD32:
 #endif
-		case PCIOCGETCONF_OLD:
 #endif
-		case PCIOCGETCONF:
 		case PCIOCGETBAR:
 		case PCIOCLISTVPD:
 			break;
 		default:
 			return (EPERM);
 		}
 	}
 
-	switch (cmd) {
-#ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-	case PCIOCGETCONF_OLD32:
-               cio32 = (struct pci_conf_io32 *)data;
-               cio = malloc(sizeof(struct pci_conf_io), M_TEMP, M_WAITOK);
-               cio->pat_buf_len = cio32->pat_buf_len;
-               cio->num_patterns = cio32->num_patterns;
-               cio->patterns = (void *)(uintptr_t)cio32->patterns;
-               cio->match_buf_len = cio32->match_buf_len;
-               cio->num_matches = cio32->num_matches;
-               cio->matches = (void *)(uintptr_t)cio32->matches;
-               cio->offset = cio32->offset;
-               cio->generation = cio32->generation;
-               cio->status = cio32->status;
-               cio32->num_matches = 0;
-               break;
-#endif
-	case PCIOCGETCONF_OLD:
-#endif
-	case PCIOCGETCONF:
-		cio = (struct pci_conf_io *)data;
-	}
 
 	switch (cmd) {
+	case PCIOCGETCONF:
 #ifdef PRE7_COMPAT
+	case PCIOCGETCONF_OLD:
 #ifdef COMPAT_FREEBSD32
 	case PCIOCGETCONF_OLD32:
 #endif
-	case PCIOCGETCONF_OLD:
 #endif
-	case PCIOCGETCONF:
-
+		cio = malloc(sizeof(struct pci_conf_io), M_TEMP,
+		    M_WAITOK | M_ZERO);
+		pci_conf_io_init(cio, data, cmd);
 		pattern_buf = NULL;
 		num_patterns = 0;
 		dinfo = NULL;
 
 		cio->num_matches = 0;
 
 		/*
 		 * If the user specified an offset into the device list,
 		 * but the list has changed since they last called this
 		 * ioctl, tell them that the list has changed.  They will
 		 * have to get the list from the beginning.
 		 */
 		if ((cio->offset != 0)
 		 && (cio->generation != pci_generation)){
 			cio->status = PCI_GETCONF_LIST_CHANGED;
 			error = 0;
 			goto getconfexit;
 		}
 
 		/*
 		 * Check to see whether the user has asked for an offset
 		 * past the end of our list.
 		 */
 		if (cio->offset >= pci_numdevs) {
 			cio->status = PCI_GETCONF_LAST_DEVICE;
 			error = 0;
 			goto getconfexit;
 		}
 
 		/* get the head of the device queue */
 		devlist_head = &pci_devq;
 
 		/*
 		 * Determine how much room we have for pci_conf structures.
 		 * Round the user's buffer size down to the nearest
 		 * multiple of sizeof(struct pci_conf) in case the user
 		 * didn't specify a multiple of that size.
 		 */
-#ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-		if (cmd == PCIOCGETCONF_OLD32)
-			confsz = sizeof(struct pci_conf_old32);
-		else
-#endif
-		if (cmd == PCIOCGETCONF_OLD)
-			confsz = sizeof(struct pci_conf_old);
-		else
-#endif
-			confsz = sizeof(struct pci_conf);
+		confsz = pci_conf_size(cmd);
 		iolen = min(cio->match_buf_len - (cio->match_buf_len % confsz),
 		    pci_numdevs * confsz);
 
 		/*
 		 * Since we know that iolen is a multiple of the size of
 		 * the pciconf union, it's okay to do this.
 		 */
 		ionum = iolen / confsz;
 
 		/*
 		 * If this test is true, the user wants the pci_conf
 		 * structures returned to match the supplied entries.
 		 */
 		if ((cio->num_patterns > 0) && (cio->num_patterns < pci_numdevs)
 		 && (cio->pat_buf_len > 0)) {
 			/*
 			 * pat_buf_len needs to be:
 			 * num_patterns * sizeof(struct pci_match_conf)
 			 * While it is certainly possible the user just
 			 * allocated a large buffer, but set the number of
 			 * matches correctly, it is far more likely that
 			 * their kernel doesn't match the userland utility
 			 * they're using.  It's also possible that the user
 			 * forgot to initialize some variables.  Yes, this
 			 * may be overly picky, but I hazard to guess that
 			 * it's far more likely to just catch folks that
 			 * updated their kernel but not their userland.
 			 */
-#ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-			if (cmd == PCIOCGETCONF_OLD32)
-				pbufsz = sizeof(struct pci_match_conf_old32);
-			else
-#endif
-			if (cmd == PCIOCGETCONF_OLD)
-				pbufsz = sizeof(struct pci_match_conf_old);
-			else
-#endif
-				pbufsz = sizeof(struct pci_match_conf);
-			if (cio->num_patterns * pbufsz != cio->pat_buf_len) {
+			if (cio->num_patterns * pci_match_conf_size(cmd) !=
+			    cio->pat_buf_len) {
 				/* The user made a mistake, return an error. */
 				cio->status = PCI_GETCONF_ERROR;
 				error = EINVAL;
 				goto getconfexit;
 			}
 
 			/*
 			 * Allocate a buffer to hold the patterns.
 			 */
-#ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-			if (cmd == PCIOCGETCONF_OLD32) {
-				pattern_buf_old32 = malloc(cio->pat_buf_len,
-				    M_TEMP, M_WAITOK);
-				error = copyin(cio->patterns,
-				    pattern_buf_old32, cio->pat_buf_len);
-			} else
-#endif /* COMPAT_FREEBSD32 */
-			if (cmd == PCIOCGETCONF_OLD) {
-				pattern_buf_old = malloc(cio->pat_buf_len,
-				    M_TEMP, M_WAITOK);
-				error = copyin(cio->patterns,
-				    pattern_buf_old, cio->pat_buf_len);
-			} else
-#endif /* PRE7_COMPAT */
-			{
-				pattern_buf = malloc(cio->pat_buf_len, M_TEMP,
-				    M_WAITOK);
-				error = copyin(cio->patterns, pattern_buf,
-				    cio->pat_buf_len);
-			}
+			pattern_buf = malloc(cio->pat_buf_len, M_TEMP,
+			    M_WAITOK);
+			error = copyin(cio->patterns, pattern_buf,
+			    cio->pat_buf_len);
 			if (error != 0) {
 				error = EINVAL;
 				goto getconfexit;
 			}
 			num_patterns = cio->num_patterns;
 		} else if ((cio->num_patterns > 0)
 			|| (cio->pat_buf_len > 0)) {
 			/*
 			 * The user made a mistake, spit out an error.
 			 */
 			cio->status = PCI_GETCONF_ERROR;
 			error = EINVAL;
                        goto getconfexit;
 		}
 
 		/*
 		 * Go through the list of devices and copy out the devices
 		 * that match the user's criteria.
 		 */
 		for (cio->num_matches = 0, i = 0,
 				 dinfo = STAILQ_FIRST(devlist_head);
 		     dinfo != NULL;
 		     dinfo = STAILQ_NEXT(dinfo, pci_links), i++) {
 
 			if (i < cio->offset)
 				continue;
 
 			/* Populate pd_name and pd_unit */
 			name = NULL;
 			if (dinfo->cfg.dev)
 				name = device_get_name(dinfo->cfg.dev);
 			if (name) {
 				strncpy(dinfo->conf.pd_name, name,
 					sizeof(dinfo->conf.pd_name));
 				dinfo->conf.pd_name[PCI_MAXNAMELEN] = 0;
 				dinfo->conf.pd_unit =
 					device_get_unit(dinfo->cfg.dev);
 			} else {
 				dinfo->conf.pd_name[0] = '\0';
 				dinfo->conf.pd_unit = 0;
 			}
 
-#ifdef PRE7_COMPAT
-			if (
-#ifdef COMPAT_FREEBSD32
-			    (cmd == PCIOCGETCONF_OLD32 &&
-			    (pattern_buf_old32 == NULL ||
-			    pci_conf_match_old32(pattern_buf_old32,
-			    num_patterns, &dinfo->conf) == 0)) ||
-#endif
-			    (cmd == PCIOCGETCONF_OLD &&
-			    (pattern_buf_old == NULL ||
-			    pci_conf_match_old(pattern_buf_old, num_patterns,
-			    &dinfo->conf) == 0)) ||
-			    (cmd == PCIOCGETCONF &&
-			    (pattern_buf == NULL ||
-			    pci_conf_match(pattern_buf, num_patterns,
-			    &dinfo->conf) == 0))) {
-#else
 			if (pattern_buf == NULL ||
-			    pci_conf_match(pattern_buf, num_patterns,
+			    pci_conf_match(cmd, pattern_buf, num_patterns,
 			    &dinfo->conf) == 0) {
-#endif
 				/*
 				 * If we've filled up the user's buffer,
 				 * break out at this point.  Since we've
 				 * got a match here, we'll pick right back
 				 * up at the matching entry.  We can also
 				 * tell the user that there are more matches
 				 * left.
 				 */
 				if (cio->num_matches >= ionum) {
 					error = 0;
 					break;
 				}
 
-#ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-				if (cmd == PCIOCGETCONF_OLD32) {
-					memset(&conf_old32, 0,
-					    sizeof(conf_old32));
-					conf_old32.pc_sel.pc_bus =
-					    dinfo->conf.pc_sel.pc_bus;
-					conf_old32.pc_sel.pc_dev =
-					    dinfo->conf.pc_sel.pc_dev;
-					conf_old32.pc_sel.pc_func =
-					    dinfo->conf.pc_sel.pc_func;
-					conf_old32.pc_hdr = dinfo->conf.pc_hdr;
-					conf_old32.pc_subvendor =
-					    dinfo->conf.pc_subvendor;
-					conf_old32.pc_subdevice =
-					    dinfo->conf.pc_subdevice;
-					conf_old32.pc_vendor =
-					    dinfo->conf.pc_vendor;
-					conf_old32.pc_device =
-					    dinfo->conf.pc_device;
-					conf_old32.pc_class =
-					    dinfo->conf.pc_class;
-					conf_old32.pc_subclass =
-					    dinfo->conf.pc_subclass;
-					conf_old32.pc_progif =
-					    dinfo->conf.pc_progif;
-					conf_old32.pc_revid =
-					    dinfo->conf.pc_revid;
-					strncpy(conf_old32.pd_name,
-					    dinfo->conf.pd_name,
-					    sizeof(conf_old32.pd_name));
-					conf_old32.pd_name[PCI_MAXNAMELEN] = 0;
-					conf_old32.pd_unit =
-					    (uint32_t)dinfo->conf.pd_unit;
-					confdata = &conf_old32;
-				} else
-#endif /* COMPAT_FREEBSD32 */
-				if (cmd == PCIOCGETCONF_OLD) {
-					memset(&conf_old, 0, sizeof(conf_old));
-					conf_old.pc_sel.pc_bus =
-					    dinfo->conf.pc_sel.pc_bus;
-					conf_old.pc_sel.pc_dev =
-					    dinfo->conf.pc_sel.pc_dev;
-					conf_old.pc_sel.pc_func =
-					    dinfo->conf.pc_sel.pc_func;
-					conf_old.pc_hdr = dinfo->conf.pc_hdr;
-					conf_old.pc_subvendor =
-					    dinfo->conf.pc_subvendor;
-					conf_old.pc_subdevice =
-					    dinfo->conf.pc_subdevice;
-					conf_old.pc_vendor =
-					    dinfo->conf.pc_vendor;
-					conf_old.pc_device =
-					    dinfo->conf.pc_device;
-					conf_old.pc_class =
-					    dinfo->conf.pc_class;
-					conf_old.pc_subclass =
-					    dinfo->conf.pc_subclass;
-					conf_old.pc_progif =
-					    dinfo->conf.pc_progif;
-					conf_old.pc_revid =
-					    dinfo->conf.pc_revid;
-					strncpy(conf_old.pd_name,
-					    dinfo->conf.pd_name,
-					    sizeof(conf_old.pd_name));
-					conf_old.pd_name[PCI_MAXNAMELEN] = 0;
-					conf_old.pd_unit =
-					    dinfo->conf.pd_unit;
-					confdata = &conf_old;
-				} else
-#endif /* PRE7_COMPAT */
-					confdata = &dinfo->conf;
-				error = copyout(confdata,
+				pci_conf_for_copyout(&dinfo->conf, &pcu, cmd);
+				error = copyout(&pcu,
 				    (caddr_t)cio->matches +
 				    confsz * cio->num_matches, confsz);
 				if (error)
 					break;
 				cio->num_matches++;
 			}
 		}
 
 		/*
 		 * Set the pointer into the list, so if the user is getting
 		 * n records at a time, where n < pci_numdevs,
 		 */
 		cio->offset = i;
 
 		/*
 		 * Set the generation, the user will need this if they make
 		 * another ioctl call with offset != 0.
 		 */
 		cio->generation = pci_generation;
 
 		/*
 		 * If this is the last device, inform the user so he won't
 		 * bother asking for more devices.  If dinfo isn't NULL, we
 		 * know that there are more matches in the list because of
 		 * the way the traversal is done.
 		 */
 		if (dinfo == NULL)
 			cio->status = PCI_GETCONF_LAST_DEVICE;
 		else
 			cio->status = PCI_GETCONF_MORE_DEVS;
 
 getconfexit:
-#ifdef PRE7_COMPAT
-#ifdef COMPAT_FREEBSD32
-		if (cmd == PCIOCGETCONF_OLD32) {
-			cio32->status = cio->status;
-			cio32->generation = cio->generation;
-			cio32->offset = cio->offset;
-			cio32->num_matches = cio->num_matches;
-			free(cio, M_TEMP);
-		}
-		if (pattern_buf_old32 != NULL)
-			free(pattern_buf_old32, M_TEMP);
-#endif
-		if (pattern_buf_old != NULL)
-			free(pattern_buf_old, M_TEMP);
-#endif
-		if (pattern_buf != NULL)
-			free(pattern_buf, M_TEMP);
+		pci_conf_io_update_data(cio, data, cmd);
+		free(cio, M_TEMP);
+		free(pattern_buf, M_TEMP);
 
 		break;
 
 #ifdef PRE7_COMPAT
 	case PCIOCREAD_OLD:
 	case PCIOCWRITE_OLD:
 		io_old = (struct pci_io_old *)data;
 		iodata.pi_sel.pc_domain = 0;
 		iodata.pi_sel.pc_bus = io_old->pi_sel.pc_bus;
 		iodata.pi_sel.pc_dev = io_old->pi_sel.pc_dev;
 		iodata.pi_sel.pc_func = io_old->pi_sel.pc_func;
 		iodata.pi_reg = io_old->pi_reg;
 		iodata.pi_width = io_old->pi_width;
 		iodata.pi_data = io_old->pi_data;
 		data = (caddr_t)&iodata;
 		/* FALLTHROUGH */
 #endif
 	case PCIOCREAD:
 	case PCIOCWRITE:
 		io = (struct pci_io *)data;
 		switch(io->pi_width) {
 		case 4:
 		case 2:
 		case 1:
 			/* Make sure register is not negative and aligned. */
 			if (io->pi_reg < 0 ||
 			    io->pi_reg & (io->pi_width - 1)) {
 				error = EINVAL;
 				break;
 			}
 			/*
 			 * Assume that the user-level bus number is
 			 * in fact the physical PCI bus number.
 			 * Look up the grandparent, i.e. the bridge device,
 			 * so that we can issue configuration space cycles.
 			 */
 			pcidev = pci_find_dbsf(io->pi_sel.pc_domain,
 			    io->pi_sel.pc_bus, io->pi_sel.pc_dev,
 			    io->pi_sel.pc_func);
 			if (pcidev) {
 #ifdef PRE7_COMPAT
 				if (cmd == PCIOCWRITE || cmd == PCIOCWRITE_OLD)
 #else
 				if (cmd == PCIOCWRITE)
 #endif
 					pci_write_config(pcidev,
 							  io->pi_reg,
 							  io->pi_data,
 							  io->pi_width);
 #ifdef PRE7_COMPAT
 				else if (cmd == PCIOCREAD_OLD)
 					io_old->pi_data =
 						pci_read_config(pcidev,
 							  io->pi_reg,
 							  io->pi_width);
 #endif
 				else
 					io->pi_data =
 						pci_read_config(pcidev,
 							  io->pi_reg,
 							  io->pi_width);
 				error = 0;
 			} else {
 #ifdef COMPAT_FREEBSD4
 				if (cmd == PCIOCREAD_OLD) {
 					io_old->pi_data = -1;
 					error = 0;
 				} else
 #endif
 					error = ENODEV;
 			}
 			break;
 		default:
 			error = EINVAL;
 			break;
 		}
 		break;
 
 	case PCIOCGETBAR:
 		bio = (struct pci_bar_io *)data;
 
 		/*
 		 * Assume that the user-level bus number is
 		 * in fact the physical PCI bus number.
 		 */
 		pcidev = pci_find_dbsf(bio->pbi_sel.pc_domain,
 		    bio->pbi_sel.pc_bus, bio->pbi_sel.pc_dev,
 		    bio->pbi_sel.pc_func);
 		if (pcidev == NULL) {
 			error = ENODEV;
 			break;
 		}
 		pm = pci_find_bar(pcidev, bio->pbi_reg);
 		if (pm == NULL) {
 			error = EINVAL;
 			break;
 		}
 		bio->pbi_base = pm->pm_value;
 		bio->pbi_length = (pci_addr_t)1 << pm->pm_size;
 		bio->pbi_enabled = pci_bar_enabled(pcidev, pm);
 		error = 0;
 		break;
 	case PCIOCATTACHED:
 		error = 0;
 		io = (struct pci_io *)data;
 		pcidev = pci_find_dbsf(io->pi_sel.pc_domain, io->pi_sel.pc_bus,
 				       io->pi_sel.pc_dev, io->pi_sel.pc_func);
 		if (pcidev != NULL)
 			io->pi_data = device_is_attached(pcidev);
 		else
 			error = ENODEV;
 		break;
 	case PCIOCLISTVPD:
 		lvio = (struct pci_list_vpd_io *)data;
 
 		/*
 		 * Assume that the user-level bus number is
 		 * in fact the physical PCI bus number.
 		 */
 		pcidev = pci_find_dbsf(lvio->plvi_sel.pc_domain,
 		    lvio->plvi_sel.pc_bus, lvio->plvi_sel.pc_dev,
 		    lvio->plvi_sel.pc_func);
 		if (pcidev == NULL) {
 			error = ENODEV;
 			break;
 		}
 		error = pci_list_vpd(pcidev, lvio);
 		break;
 	default:
 		error = ENOTTY;
 		break;
 	}
 
 	return (error);
 }
Index: user/markj/netdump/sys/dev/vnic/thunder_bgx_fdt.c
===================================================================
--- user/markj/netdump/sys/dev/vnic/thunder_bgx_fdt.c	(revision 332407)
+++ user/markj/netdump/sys/dev/vnic/thunder_bgx_fdt.c	(revision 332408)
@@ -1,460 +1,460 @@
 /*-
  * Copyright (c) 2015 The FreeBSD Foundation
  * All rights reserved.
  *
  * This software was developed by Semihalf under
  * the sponsorship of the FreeBSD Foundation.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bitset.h>
 #include <sys/bitstring.h>
 #include <sys/bus.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/rman.h>
 #include <sys/pciio.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/socket.h>
 #include <sys/cpuset.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 
 #include <net/ethernet.h>
 #include <net/if.h>
 #include <net/if_media.h>
 
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/mii/miivar.h>
 
 #include "thunder_bgx.h"
 #include "thunder_bgx_var.h"
 
 #define	CONN_TYPE_MAXLEN	16
 #define	CONN_TYPE_OFFSET	2
 
 #define	BGX_NODE_NAME		"bgx"
 #define	BGX_MAXID		9
 /* BGX func. 0, i.e.: reg = <0x8000 0 0 0 0>; DEVFN = 0x80 */
 #define	BGX_DEVFN_0		0x80
 
 #define	FDT_NAME_MAXLEN		31
 
 int bgx_fdt_init_phy(struct bgx *);
 
 static void
 bgx_fdt_get_macaddr(phandle_t phy, uint8_t *hwaddr)
 {
 	uint8_t addr[ETHER_ADDR_LEN];
 
 	if (OF_getprop(phy, "local-mac-address", addr, ETHER_ADDR_LEN) == -1) {
 		/* Missing MAC address should be marked by clearing it */
 		memset(hwaddr, 0, ETHER_ADDR_LEN);
 	} else
 		memcpy(hwaddr, addr, ETHER_ADDR_LEN);
 }
 
 static boolean_t
 bgx_fdt_phy_mode_match(struct bgx *bgx, char *qlm_mode, ssize_t size)
 {
 	const char *type;
 	ssize_t sz;
 	ssize_t offset;
 
 	switch (bgx->qlm_mode) {
 	case QLM_MODE_SGMII:
 		type = "sgmii";
 		sz = sizeof("sgmii") - 1;
 		offset = size - sz;
 		break;
 	case QLM_MODE_XAUI_1X4:
 		type = "xaui";
 		sz = sizeof("xaui") - 1;
 		offset = size - sz;
 		if (offset < 0)
 			return (FALSE);
 		if (strncmp(&qlm_mode[offset], type, sz) == 0)
 			return (TRUE);
 		type = "dxaui";
 		sz = sizeof("dxaui") - 1;
 		offset = size - sz;
 		break;
 	case QLM_MODE_RXAUI_2X2:
 		type = "raui";
 		sz = sizeof("raui") - 1;
 		offset = size - sz;
 		break;
 	case QLM_MODE_XFI_4X1:
 		type = "xfi";
 		sz = sizeof("xfi") - 1;
 		offset = size - sz;
 		break;
 	case QLM_MODE_XLAUI_1X4:
 		type = "xlaui";
 		sz = sizeof("xlaui") - 1;
 		offset = size - sz;
 		break;
 	case QLM_MODE_10G_KR_4X1:
 		type = "xfi-10g-kr";
 		sz = sizeof("xfi-10g-kr") - 1;
 		offset = size - sz;
 		break;
 	case QLM_MODE_40G_KR4_1X4:
 		type = "xlaui-40g-kr";
 		sz = sizeof("xlaui-40g-kr") - 1;
 		offset = size - sz;
 		break;
 	default:
 		return (FALSE);
 	}
 
 	if (offset < 0)
 		return (FALSE);
 
 	if (strncmp(&qlm_mode[offset], type, sz) == 0)
 		return (TRUE);
 
 	return (FALSE);
 }
 
 static boolean_t
 bgx_fdt_phy_name_match(struct bgx *bgx, char *phy_name, ssize_t size)
 {
 	const char *type;
 	ssize_t sz;
 
 	switch (bgx->qlm_mode) {
 	case QLM_MODE_SGMII:
 		type = "sgmii";
 		sz = sizeof("sgmii") - 1;
 		break;
 	case QLM_MODE_XAUI_1X4:
 		type = "xaui";
 		sz = sizeof("xaui") - 1;
 		if (sz < size)
 			return (FALSE);
 		if (strncmp(phy_name, type, sz) == 0)
 			return (TRUE);
 		type = "dxaui";
 		sz = sizeof("dxaui") - 1;
 		break;
 	case QLM_MODE_RXAUI_2X2:
 		type = "raui";
 		sz = sizeof("raui") - 1;
 		break;
 	case QLM_MODE_XFI_4X1:
 		type = "xfi";
 		sz = sizeof("xfi") - 1;
 		break;
 	case QLM_MODE_XLAUI_1X4:
 		type = "xlaui";
 		sz = sizeof("xlaui") - 1;
 		break;
 	case QLM_MODE_10G_KR_4X1:
 		type = "xfi-10g-kr";
 		sz = sizeof("xfi-10g-kr") - 1;
 		break;
 	case QLM_MODE_40G_KR4_1X4:
 		type = "xlaui-40g-kr";
 		sz = sizeof("xlaui-40g-kr") - 1;
 		break;
 	default:
 		return (FALSE);
 	}
 
 	if (sz > size)
 		return (FALSE);
 	if (strncmp(phy_name, type, sz) == 0)
 		return (TRUE);
 
 	return (FALSE);
 }
 
 static phandle_t
 bgx_fdt_traverse_nodes(uint8_t unit, phandle_t start, char *name,
     size_t len)
 {
 	phandle_t node, ret;
 	uint32_t *reg;
 	size_t buf_size;
 	ssize_t proplen;
 	char *node_name;
 	int err;
 
 	/*
 	 * Traverse all subordinate nodes of 'start' to find BGX instance.
 	 * This supports both old (by name) and new (by reg) methods.
 	 */
 	buf_size = sizeof(*node_name) * FDT_NAME_MAXLEN;
 	if (len > buf_size) {
 		/*
 		 * This is an erroneous situation since the string
 		 * to compare cannot be longer than FDT_NAME_MAXLEN.
 		 */
 		return (0);
 	}
 
 	node_name = malloc(buf_size, M_BGX, M_WAITOK);
 	for (node = OF_child(start); node != 0; node = OF_peer(node)) {
 		/* Clean-up the buffer */
 		memset(node_name, 0, buf_size);
 		/* Recurse to children */
 		if (OF_child(node) != 0) {
 			ret = bgx_fdt_traverse_nodes(unit, node, name, len);
 			if (ret != 0) {
 				free(node_name, M_BGX);
 				return (ret);
 			}
 		}
 		/*
 		 * Old way - by name
 		 */
 		proplen = OF_getproplen(node, "name");
 		if ((proplen <= 0) || (proplen < len))
 			continue;
 
 		err = OF_getprop(node, "name", node_name, proplen);
 		if (err <= 0)
 			continue;
 
 		if (strncmp(node_name, name, len) == 0) {
 			free(node_name, M_BGX);
 			return (node);
 		}
 		/*
 		 * New way - by reg
 		 */
 		/* Check if even BGX */
 		if (strncmp(node_name,
 		    BGX_NODE_NAME, sizeof(BGX_NODE_NAME) - 1) != 0)
 			continue;
 		/* Get reg */
-		err = OF_getencprop_alloc(node, "reg", sizeof(*reg),
+		err = OF_getencprop_alloc_multi(node, "reg", sizeof(*reg),
 		    (void **)&reg);
 		if (err == -1) {
 			free(reg, M_OFWPROP);
 			continue;
 		}
 
 		/* Match BGX device function */
 		if ((BGX_DEVFN_0 + unit) == (reg[0] >> 8)) {
 			free(reg, M_OFWPROP);
 			free(node_name, M_BGX);
 			return (node);
 		}
 		free(reg, M_OFWPROP);
 	}
 	free(node_name, M_BGX);
 
 	return (0);
 }
 
 /*
  * Similar functionality to pci_find_pcie_root_port()
  * but this one works for ThunderX.
  */
 static device_t
 bgx_find_root_pcib(device_t dev)
 {
 	devclass_t pci_class;
 	device_t pcib, bus;
 
 	pci_class = devclass_find("pci");
 	KASSERT(device_get_devclass(device_get_parent(dev)) == pci_class,
 	    ("%s: non-pci device %s", __func__, device_get_nameunit(dev)));
 
 	/* Walk the bridge hierarchy until we find a non-PCI device */
 	for (;;) {
 		bus = device_get_parent(dev);
 		KASSERT(bus != NULL, ("%s: null parent of %s", __func__,
 		    device_get_nameunit(dev)));
 
 		if (device_get_devclass(bus) != pci_class)
 			return (NULL);
 
 		pcib = device_get_parent(bus);
 		KASSERT(pcib != NULL, ("%s: null bridge of %s", __func__,
 		    device_get_nameunit(bus)));
 
 		/*
 		 * If the parent of this PCIB is not PCI
 		 * then we found our root PCIB.
 		 */
 		if (device_get_devclass(device_get_parent(pcib)) != pci_class)
 			return (pcib);
 
 		dev = pcib;
 	}
 }
 
 static __inline phandle_t
 bgx_fdt_find_node(struct bgx *bgx)
 {
 	device_t root_pcib;
 	phandle_t node;
 	char *bgx_sel;
 	size_t len;
 
 	KASSERT(bgx->bgx_id <= BGX_MAXID,
 	    ("Invalid BGX ID: %d, max: %d", bgx->bgx_id, BGX_MAXID));
 
 	len = sizeof(BGX_NODE_NAME) + 1; /* <bgx_name>+<digit>+<\0> */
 	/* Allocate memory for BGX node name + "/" character */
 	bgx_sel = malloc(sizeof(*bgx_sel) * (len + 1), M_BGX,
 	    M_ZERO | M_WAITOK);
 
 	/* Prepare node's name */
 	snprintf(bgx_sel, len + 1, "/"BGX_NODE_NAME"%d", bgx->bgx_id);
 	/* First try the root node */
 	node =  OF_finddevice(bgx_sel);
 	if (node != -1) {
 		/* Found relevant node */
 		goto out;
 	}
 	/*
 	 * Clean-up and try to find BGX in DT
 	 * starting from the parent PCI bridge node.
 	 */
 	memset(bgx_sel, 0, sizeof(*bgx_sel) * (len + 1));
 	snprintf(bgx_sel, len, BGX_NODE_NAME"%d", bgx->bgx_id);
 
 	/* Find PCI bridge that we are connected to */
 
 	root_pcib = bgx_find_root_pcib(bgx->dev);
 	if (root_pcib == NULL) {
 		device_printf(bgx->dev, "Unable to find BGX root bridge\n");
 		node = 0;
 		goto out;
 	}
 
 	node = ofw_bus_get_node(root_pcib);
 	if ((int)node <= 0) {
 		device_printf(bgx->dev, "No parent FDT node for BGX\n");
 		goto out;
 	}
 
 	node = bgx_fdt_traverse_nodes(bgx->bgx_id, node, bgx_sel, len);
 out:
 	free(bgx_sel, M_BGX);
 	return (node);
 }
 
 int
 bgx_fdt_init_phy(struct bgx *bgx)
 {
 	char *node_name;
 	phandle_t node, child;
 	phandle_t phy, mdio;
 	ssize_t len;
 	uint8_t lmac;
 	char qlm_mode[CONN_TYPE_MAXLEN];
 
 	node = bgx_fdt_find_node(bgx);
 	if (node == 0) {
 		device_printf(bgx->dev,
 		    "Could not find bgx%d node in FDT\n", bgx->bgx_id);
 		return (ENXIO);
 	}
 
 	lmac = 0;
 	for (child = OF_child(node); child > 0; child = OF_peer(child)) {
 		len = OF_getprop(child, "qlm-mode", qlm_mode, sizeof(qlm_mode));
 		if (len > 0) {
 			if (!bgx_fdt_phy_mode_match(bgx, qlm_mode, len)) {
 				/*
 				 * Connection type not match with BGX mode.
 				 */
 				continue;
 			}
 		} else {
 			len = OF_getprop_alloc(child, "name",
 			    (void **)&node_name);
 			if (len <= 0) {
 				continue;
 			}
 
 			if (!bgx_fdt_phy_name_match(bgx, node_name, len)) {
 				free(node_name, M_OFWPROP);
 				continue;
 			}
 			free(node_name, M_OFWPROP);
 		}
 
 		/* Acquire PHY address */
 		if (OF_getencprop(child, "reg", &bgx->lmac[lmac].phyaddr,
 		    sizeof(bgx->lmac[lmac].phyaddr)) <= 0) {
 			if (bootverbose) {
 				device_printf(bgx->dev,
 				    "Could not retrieve PHY address\n");
 			}
 			bgx->lmac[lmac].phyaddr = MII_PHY_ANY;
 		}
 
 		if (OF_getencprop(child, "phy-handle", &phy,
 		    sizeof(phy)) <= 0) {
 			if (bootverbose) {
 				device_printf(bgx->dev,
 				    "No phy-handle in PHY node. Skipping...\n");
 			}
 			continue;
 		}
 		phy = OF_instance_to_package(phy);
 		/*
 		 * Get PHY interface (MDIO bus) device.
 		 * Driver must be already attached.
 		 */
 		mdio = OF_parent(phy);
 		bgx->lmac[lmac].phy_if_dev =
 		    OF_device_from_xref(OF_xref_from_node(mdio));
 		if (bgx->lmac[lmac].phy_if_dev == NULL) {
 			if (bootverbose) {
 				device_printf(bgx->dev,
 				    "Could not find interface to PHY\n");
 			}
 			continue;
 		}
 
 		/* Get mac address from FDT */
 		bgx_fdt_get_macaddr(child, bgx->lmac[lmac].mac);
 
 		bgx->lmac[lmac].lmacid = lmac;
 		lmac++;
 		if (lmac == MAX_LMAC_PER_BGX)
 			break;
 	}
 	if (lmac == 0) {
 		device_printf(bgx->dev, "Could not find matching PHY\n");
 		return (ENXIO);
 	}
 
 	return (0);
 }
Index: user/markj/netdump/sys/dts/arm/armada-38x.dtsi
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-38x.dtsi	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-38x.dtsi	(nonexistent)
@@ -1,664 +0,0 @@
-/*
- * Device Tree Include file for Marvell Armada 38x family of SoCs.
- *
- * Copyright (C) 2014 Marvell
- *
- * Lior Amsalem <alior@marvell.com>
- * Gregory CLEMENT <gregory.clement@free-electrons.com>
- * Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is free software; you can redistribute it and/or
- *     modify it under the terms of the GNU General Public License as
- *     published by the Free Software Foundation; either version 2 of the
- *     License, or (at your option) any later version.
- *
- *     This file is distributed in the hope that it will be useful
- *     but WITHOUT ANY WARRANTY; without even the implied warranty of
- *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *     GNU General Public License for more details.
- *
- * Or, alternatively
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED , WITHOUT WARRANTY OF ANY KIND
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-
-#include "skeleton.dtsi"
-#include <dt-bindings/interrupt-controller/arm-gic.h>
-#include <dt-bindings/interrupt-controller/irq.h>
-
-#define MBUS_ID(target,attributes) (((target) << 24) | ((attributes) << 16))
-
-/ {
-	model = "Marvell Armada 38x family SoC";
-	compatible = "marvell,armada380";
-
-	aliases {
-		gpio0 = &gpio0;
-		gpio1 = &gpio1;
-		serial0 = &uart0;
-		serial1 = &uart1;
-		sram0 = &SRAM0;
-		sram1 = &SRAM1;
-	};
-
-	pmu {
-		compatible = "arm,cortex-a9-pmu";
-		interrupts-extended = <&mpic 3>;
-	};
-
-	SRAM0: sram@f1100000 {
-		compatible = "mrvl,cesa-sram";
-		reg = <0xf1100000 0x0010000>;
-	};
-
-	SRAM1: sram@f1110000 {
-		compatible = "mrvl,cesa-sram";
-		reg = <0xf1110000 0x0010000>;
-	};
-
-	soc {
-		compatible = "marvell,armada380-mbus", "simple-bus";
-		#address-cells = <2>;
-		#size-cells = <1>;
-		controller = <&mbusc>;
-		interrupt-parent = <&gic>;
-		pcie-mem-aperture = <0xe0000000 0x8000000>;
-		pcie-io-aperture  = <0xe8000000 0x100000>;
-
-		bootrom {
-			compatible = "marvell,bootrom";
-			reg = <MBUS_ID(0x01, 0x1d) 0 0x200000>;
-		};
-
-		devbus-bootcs {
-			compatible = "marvell,mvebu-devbus";
-			reg = <MBUS_ID(0xf0, 0x01) 0x10400 0x8>;
-			ranges = <0 MBUS_ID(0x01, 0x2f) 0 0xffffffff>;
-			#address-cells = <1>;
-			#size-cells = <1>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
-
-		devbus-cs0 {
-			compatible = "marvell,mvebu-devbus";
-			reg = <MBUS_ID(0xf0, 0x01) 0x10408 0x8>;
-			ranges = <0 MBUS_ID(0x01, 0x3e) 0 0xffffffff>;
-			#address-cells = <1>;
-			#size-cells = <1>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
-
-		devbus-cs1 {
-			compatible = "marvell,mvebu-devbus";
-			reg = <MBUS_ID(0xf0, 0x01) 0x10410 0x8>;
-			ranges = <0 MBUS_ID(0x01, 0x3d) 0 0xffffffff>;
-			#address-cells = <1>;
-			#size-cells = <1>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
-
-		devbus-cs2 {
-			compatible = "marvell,mvebu-devbus";
-			reg = <MBUS_ID(0xf0, 0x01) 0x10418 0x8>;
-			ranges = <0 MBUS_ID(0x01, 0x3b) 0 0xffffffff>;
-			#address-cells = <1>;
-			#size-cells = <1>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
-
-		devbus-cs3 {
-			compatible = "marvell,mvebu-devbus";
-			reg = <MBUS_ID(0xf0, 0x01) 0x10420 0x8>;
-			ranges = <0 MBUS_ID(0x01, 0x37) 0 0xffffffff>;
-			#address-cells = <1>;
-			#size-cells = <1>;
-			clocks = <&coreclk 0>;
-			status = "disabled";
-		};
-
-		internal-regs {
-			compatible = "simple-bus";
-			#address-cells = <1>;
-			#size-cells = <1>;
-			ranges = <0 MBUS_ID(0xf0, 0x01) 0 0x100000>;
-
-			crypto@90000 {
-				compatible = "mrvl,cesa";
-				reg = <0x90000 0x1000   /* tdma base reg chan 0 */
-				       0x9D000 0x1000>;	/* cesa base reg chan 0 */
-				interrupts = <GIC_SPI 19 IRQ_TYPE_LEVEL_HIGH>;
-				interrupt-parent = <&gic>;
-				sram-handle = <&SRAM0>;
-				status = "disabled";
-			};
-
-			crypto@92000 {
-				compatible = "mrvl,cesa";
-				reg = <0x92000 0x1000	/* tdma base reg chan 1 */
-				       0x9F000 0x1000>;	/* cesa base reg chan 1 */
-				interrupts = <GIC_SPI 20 IRQ_TYPE_LEVEL_HIGH>;
-				interrupt-parent = <&gic>;
-				sram-handle = <&SRAM1>;
-				status = "disabled";
-			};
-
-			L2: cache-controller@8000 {
-				compatible = "arm,pl310-cache";
-				reg = <0x8000 0x1000>;
-				cache-unified;
-				cache-level = <2>;
-				arm,double-linefill-incr = <1>;
-				arm,double-linefill-wrap = <0>;
-				arm,double-linefill = <1>;
-				prefetch-data = <1>;
-			};
-
-			scu@c000 {
-				compatible = "arm,cortex-a9-scu";
-				reg = <0xc000 0x58>;
-			};
-
-			timer@c200 {
-				compatible = "arm,cortex-a9-global-timer";
-				reg = <0xc200 0x20>;
-				interrupts = <GIC_PPI 11 (IRQ_TYPE_EDGE_RISING | GIC_CPU_MASK_SIMPLE(2))>;
-				clocks = <&coreclk 2>;
-			};
-
-			timer@c600 {
-				compatible = "arm,cortex-a9-twd-timer";
-				reg = <0xc600 0x20>;
-				interrupts = <GIC_PPI 13 (IRQ_TYPE_EDGE_RISING | GIC_CPU_MASK_SIMPLE(2))>;
-				clocks = <&coreclk 2>;
-			};
-
-			gic: interrupt-controller@d000 {
-				compatible = "arm,cortex-a9-gic";
-				#interrupt-cells = <3>;
-				#size-cells = <0>;
-				interrupt-controller;
-				reg = <0xd000 0x1000>,
-				      <0xc100 0x100>;
-			};
-
-			spi0: spi@10600 {
-				compatible = "marvell,armada-380-spi",
-						"marvell,orion-spi";
-				reg = <0x10600 0x50>;
-				#address-cells = <1>;
-				#size-cells = <0>;
-				cell-index = <0>;
-				interrupts = <GIC_SPI 1 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&coreclk 0>;
-				status = "disabled";
-			};
-
-			spi1: spi@10680 {
-				compatible = "marvell,armada-380-spi",
-						"marvell,orion-spi";
-				reg = <0x10680 0x50>;
-				#address-cells = <1>;
-				#size-cells = <0>;
-				cell-index = <1>;
-				interrupts = <GIC_SPI 63 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&coreclk 0>;
-				status = "disabled";
-			};
-
-			i2c0: i2c@11000 {
-				compatible = "marvell,mv64xxx-i2c";
-				reg = <0x11000 0x20>;
-				#address-cells = <1>;
-				#size-cells = <0>;
-				interrupts = <GIC_SPI 2 IRQ_TYPE_LEVEL_HIGH>;
-				timeout-ms = <1000>;
-				clocks = <&coreclk 0>;
-				status = "disabled";
-			};
-
-			i2c1: i2c@11100 {
-				compatible = "marvell,mv64xxx-i2c";
-				reg = <0x11100 0x20>;
-				#address-cells = <1>;
-				#size-cells = <0>;
-				interrupts = <GIC_SPI 3 IRQ_TYPE_LEVEL_HIGH>;
-				timeout-ms = <1000>;
-				clocks = <&coreclk 0>;
-				status = "disabled";
-			};
-
-			uart0: serial@12000 {
-				compatible = "snps,dw-apb-uart";
-				reg = <0x12000 0x100>;
-				reg-shift = <2>;
-				interrupts = <GIC_SPI 12 IRQ_TYPE_LEVEL_HIGH>;
-				reg-io-width = <1>;
-				clocks = <&coreclk 0>;
-				status = "disabled";
-			};
-
-			uart1: serial@12100 {
-				compatible = "snps,dw-apb-uart";
-				reg = <0x12100 0x100>;
-				reg-shift = <2>;
-				interrupts = <GIC_SPI 13 IRQ_TYPE_LEVEL_HIGH>;
-				reg-io-width = <1>;
-				clocks = <&coreclk 0>;
-				status = "disabled";
-			};
-
-			pinctrl: pinctrl@18000 {
-				reg = <0x18000 0x20>;
-
-				ge0_rgmii_pins: ge-rgmii-pins-0 {
-					marvell,pins = "mpp6", "mpp7", "mpp8",
-						       "mpp9", "mpp10", "mpp11",
-						       "mpp12", "mpp13", "mpp14",
-						       "mpp15", "mpp16", "mpp17";
-					marvell,function = "ge0";
-				};
-
-				ge1_rgmii_pins: ge-rgmii-pins-1 {
-					marvell,pins = "mpp21", "mpp27", "mpp28",
-						       "mpp29", "mpp30", "mpp31",
-						       "mpp32", "mpp37", "mpp38",
-						       "mpp39", "mpp40", "mpp41";
-					marvell,function = "ge1";
-				};
-
-				i2c0_pins: i2c-pins-0 {
-					marvell,pins = "mpp2", "mpp3";
-					marvell,function = "i2c0";
-				};
-
-				mdio_pins: mdio-pins {
-					marvell,pins = "mpp4", "mpp5";
-					marvell,function = "ge";
-				};
-
-				ref_clk0_pins: ref-clk-pins-0 {
-					marvell,pins = "mpp45";
-					marvell,function = "ref";
-				};
-
-				ref_clk1_pins: ref-clk-pins-1 {
-					marvell,pins = "mpp46";
-					marvell,function = "ref";
-				};
-
-				spi0_pins: spi-pins-0 {
-					marvell,pins = "mpp22", "mpp23", "mpp24",
-						       "mpp25";
-					marvell,function = "spi0";
-				};
-
-				spi1_pins: spi-pins-1 {
-					marvell,pins = "mpp56", "mpp57", "mpp58",
-						       "mpp59";
-					marvell,function = "spi1";
-				};
-
-				uart0_pins: uart-pins-0 {
-					marvell,pins = "mpp0", "mpp1";
-					marvell,function = "ua0";
-				};
-
-				uart1_pins: uart-pins-1 {
-					marvell,pins = "mpp19", "mpp20";
-					marvell,function = "ua1";
-				};
-
-				sdhci_pins: sdhci-pins {
-					marvell,pins = "mpp48", "mpp49", "mpp50",
-						       "mpp52", "mpp53", "mpp54",
-						       "mpp55", "mpp57", "mpp58",
-						       "mpp59";
-					marvell,function = "sd0";
-				};
-
-				sata0_pins: sata-pins-0 {
-					marvell,pins = "mpp20";
-					marvell,function = "sata0";
-				};
-
-				sata1_pins: sata-pins-1 {
-					marvell,pins = "mpp19";
-					marvell,function = "sata1";
-				};
-
-				sata2_pins: sata-pins-2 {
-					marvell,pins = "mpp47";
-					marvell,function = "sata2";
-				};
-
-				sata3_pins: sata-pins-3 {
-					marvell,pins = "mpp44";
-					marvell,function = "sata3";
-				};
-			};
-
-			gpio0: gpio@18100 {
-				compatible = "marvell,orion-gpio";
-				reg = <0x18100 0x40>;
-				ngpios = <32>;
-				gpio-controller;
-				#gpio-cells = <2>;
-				interrupt-controller;
-				#interrupt-cells = <2>;
-				interrupts = <GIC_SPI 53 IRQ_TYPE_LEVEL_HIGH>,
-					     <GIC_SPI 54 IRQ_TYPE_LEVEL_HIGH>,
-					     <GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>,
-					     <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>;
-			};
-
-			gpio1: gpio@18140 {
-				compatible = "marvell,orion-gpio";
-				reg = <0x18140 0x40>;
-				ngpios = <28>;
-				gpio-controller;
-				#gpio-cells = <2>;
-				interrupt-controller;
-				#interrupt-cells = <2>;
-				interrupts = <GIC_SPI 58 IRQ_TYPE_LEVEL_HIGH>,
-					     <GIC_SPI 59 IRQ_TYPE_LEVEL_HIGH>,
-					     <GIC_SPI 60 IRQ_TYPE_LEVEL_HIGH>,
-					     <GIC_SPI 61 IRQ_TYPE_LEVEL_HIGH>;
-			};
-
-			system-controller@18200 {
-				compatible = "marvell,armada-380-system-controller",
-					     "marvell,armada-370-xp-system-controller";
-				reg = <0x18200 0x100>;
-			};
-
-			gateclk: clock-gating-control@18220 {
-				compatible = "marvell,armada-380-gating-clock";
-				reg = <0x18220 0x4>;
-				clocks = <&coreclk 0>;
-				#clock-cells = <1>;
-			};
-
-			coreclk: mvebu-sar@18600 {
-				compatible = "marvell,armada-380-core-clock";
-				reg = <0x18600 0x04>;
-				#clock-cells = <1>;
-			};
-
-			mbusc: mbus-controller@20000 {
-				compatible = "marvell,mbus-controller";
-				reg = <0x20000 0x100>, <0x20180 0x20>;
-			};
-
-			mpic: interrupt-controller@20a00 {
-				compatible = "marvell,mpic";
-				reg = <0x20a00 0x2d0>, <0x21870 0x300>;
-				#interrupt-cells = <1>;
-				#size-cells = <1>;
-				interrupt-controller;
-				msi-controller;
-				interrupts = <GIC_PPI 15 IRQ_TYPE_LEVEL_HIGH>;
-			};
-
-			timer@20300 {
-				compatible = "marvell,armada-380-timer",
-					     "marvell,armada-xp-timer";
-				reg = <0x20300 0x30>, <0x21040 0x30>;
-				interrupts-extended = <&gic  GIC_SPI  8 IRQ_TYPE_LEVEL_HIGH>,
-						      <&gic  GIC_SPI  9 IRQ_TYPE_LEVEL_HIGH>,
-						      <&gic  GIC_SPI 10 IRQ_TYPE_LEVEL_HIGH>,
-						      <&gic  GIC_SPI 11 IRQ_TYPE_LEVEL_HIGH>,
-						      <&mpic 5>,
-						      <&mpic 6>;
-				clocks = <&coreclk 2>, <&refclk>;
-				clock-names = "nbclk", "fixed";
-			};
-
-			watchdog@20300 {
-				compatible = "marvell,armada-380-wdt";
-				reg = <0x20300 0x34>, <0x20704 0x4>, <0x18260 0x4>;
-				clocks = <&coreclk 2>, <&refclk>;
-				clock-names = "nbclk", "fixed";
-			};
-
-			cpurst@20800 {
-				compatible = "marvell,armada-370-cpu-reset";
-				reg = <0x20800 0x10>;
-			};
-
-			mpcore-soc-ctrl@20d20 {
-				compatible = "marvell,armada-380-mpcore-soc-ctrl";
-				reg = <0x20d20 0x6c>;
-			};
-
-			coherency-fabric@21010 {
-				compatible = "marvell,armada-380-coherency-fabric";
-				reg = <0x21010 0x1c>;
-			};
-
-			pmsu@22000 {
-				compatible = "marvell,armada-380-pmsu";
-				reg = <0x22000 0x1000>;
-			};
-
-			eth1: ethernet@30000 {
-				compatible = "marvell,armada-370-neta";
-				reg = <0x30000 0x4000>;
-				interrupts-extended = <&mpic 10>;
-				clocks = <&gateclk 3>;
-				status = "disabled";
-			};
-
-			eth2: ethernet@34000 {
-				compatible = "marvell,armada-370-neta";
-				reg = <0x34000 0x4000>;
-				interrupts-extended = <&mpic 12>;
-				clocks = <&gateclk 2>;
-				status = "disabled";
-			};
-
-			usb@58000 {
-				compatible = "marvell,orion-ehci";
-				reg = <0x58000 0x500>;
-				interrupts = <GIC_SPI 18 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 18>;
-				status = "disabled";
-			};
-
-			xor@60800 {
-				compatible = "marvell,orion-xor";
-				reg = <0x60800 0x100
-				       0x60a00 0x100>;
-				clocks = <&gateclk 22>;
-				status = "okay";
-
-				xor00 {
-					interrupts = <GIC_SPI 22 IRQ_TYPE_LEVEL_HIGH>;
-					dmacap,memcpy;
-					dmacap,xor;
-				};
-				xor01 {
-					interrupts = <GIC_SPI 23 IRQ_TYPE_LEVEL_HIGH>;
-					dmacap,memcpy;
-					dmacap,xor;
-					dmacap,memset;
-				};
-			};
-
-			xor@60900 {
-				compatible = "marvell,orion-xor";
-				reg = <0x60900 0x100
-				       0x60b00 0x100>;
-				clocks = <&gateclk 28>;
-				status = "okay";
-
-				xor10 {
-					interrupts = <GIC_SPI 65 IRQ_TYPE_LEVEL_HIGH>;
-					dmacap,memcpy;
-					dmacap,xor;
-				};
-				xor11 {
-					interrupts = <GIC_SPI 66 IRQ_TYPE_LEVEL_HIGH>;
-					dmacap,memcpy;
-					dmacap,xor;
-					dmacap,memset;
-				};
-			};
-
-			eth0: ethernet@70000 {
-				compatible = "marvell,armada-370-neta";
-				reg = <0x70000 0x4000>;
-				interrupts-extended = <&mpic 8>;
-				clocks = <&gateclk 4>;
-				status = "disabled";
-			};
-
-			mdio: mdio@72004 {
-				#address-cells = <1>;
-				#size-cells = <0>;
-				compatible = "marvell,orion-mdio";
-				reg = <0x72004 0x4>;
-				clocks = <&gateclk 4>;
-			};
-
-			rtc@a3800 {
-				compatible = "marvell,armada-380-rtc";
-				reg = <0xa3800 0x20>, <0x184a0 0x0c>;
-				reg-names = "rtc", "rtc-soc";
-				interrupts = <GIC_SPI 21 IRQ_TYPE_LEVEL_HIGH>;
-			};
-
-			sata@a8000 {
-				compatible = "marvell,armada-380-ahci";
-				reg = <0xa8000 0x2000>;
-				interrupts = <GIC_SPI 26 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 15>;
-				status = "disabled";
-			};
-
-			bm: bm@c8000 {
-				compatible = "marvell,armada-380-neta-bm";
-				reg = <0xc8000 0xac>;
-				clocks = <&gateclk 13>;
-				internal-mem = <&bm_bppi>;
-				status = "disabled";
-			};
-
-			sata@e0000 {
-				compatible = "marvell,armada-380-ahci";
-				reg = <0xe0000 0x2000>;
-				interrupts = <GIC_SPI 28 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 30>;
-				status = "disabled";
-			};
-
-			coredivclk: clock@e4250 {
-				compatible = "marvell,armada-380-corediv-clock";
-				reg = <0xe4250 0xc>;
-				#clock-cells = <1>;
-				clocks = <&mainpll>;
-				clock-output-names = "nand";
-			};
-
-			thermal@e8078 {
-				compatible = "marvell,armada380-thermal";
-				reg = <0xe4078 0x4>, <0xe4074 0x4>;
-				status = "okay";
-			};
-
-			flash@d0000 {
-				compatible = "marvell,armada370-nand";
-				reg = <0xd0000 0x54>;
-				#address-cells = <1>;
-				#size-cells = <1>;
-				interrupts = <GIC_SPI 84 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&coredivclk 0>;
-				status = "disabled";
-			};
-
-			sdhci@d8000 {
-				compatible = "marvell,armada-380-sdhci";
-				reg-names = "sdhci", "mbus", "conf-sdio3";
-				reg = <0xd8000 0x1000>,
-					<0xdc000 0x100>,
-					<0x18454 0x4>;
-				interrupts = <GIC_SPI 25 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 17>;
-				mrvl,clk-delay-cycles = <0x1F>;
-				status = "disabled";
-			};
-
-			usb3@f0000 {
-				compatible = "marvell,armada-380-xhci";
-				reg = <0xf0000 0x4000>,<0xf4000 0x4000>;
-				interrupts = <GIC_SPI 16 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 9>;
-				status = "disabled";
-			};
-
-			usb3@f8000 {
-				compatible = "marvell,armada-380-xhci";
-				reg = <0xf8000 0x4000>,<0xfc000 0x4000>;
-				interrupts = <GIC_SPI 17 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 10>;
-				status = "disabled";
-			};
-		};
-
-		bm_bppi: bm-bppi {
-			compatible = "mmio-sram";
-			reg = <MBUS_ID(0x0c, 0x04) 0 0x100000>;
-			ranges = <0 MBUS_ID(0x0c, 0x04) 0 0x100000>;
-			#address-cells = <1>;
-			#size-cells = <1>;
-			clocks = <&gateclk 13>;
-			no-memory-wc;
-			status = "disabled";
-		};
-	};
-
-	clocks {
-		/* 2 GHz fixed main PLL */
-		mainpll: mainpll {
-			compatible = "fixed-clock";
-			#clock-cells = <0>;
-			clock-frequency = <1000000000>;
-		};
-
-		/* 25 MHz reference crystal */
-		refclk: oscillator {
-			compatible = "fixed-clock";
-			#clock-cells = <0>;
-			clock-frequency = <25000000>;
-		};
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-38x.dtsi
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-388.dtsi
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-388.dtsi	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-388.dtsi	(nonexistent)
@@ -1,72 +0,0 @@
-/*
- * Device Tree Include file for Marvell Armada 388 SoC.
- *
- * Copyright (C) 2015 Marvell
- *
- * Gregory CLEMENT <gregory.clement@free-electrons.com>
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is licensed under the terms of the GNU General Public
- *     License version 2.  This program is licensed "as is" without
- *     any warranty of any kind, whether express or implied.
- *
- * Or, alternatively,
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use,
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- *
- * The main difference with the Armada 385 is that the 388 can handle two more
- * SATA ports. So we can reuse the dtsi of the Armada 385, override the pinctrl
- * property and the name of the SoC, and add the second SATA host which control
- * the 2 other ports.
- *
- * $FreeBSD$
- */
-
-#include "armada-385.dtsi"
-
-/ {
-	model = "Marvell Armada 388 family SoC";
-	compatible = "marvell,armada388", "marvell,armada385",
-		"marvell,armada380";
-
-	soc {
-		internal-regs {
-			pinctrl@18000 {
-				compatible = "marvell,mv88f6828-pinctrl";
-			};
-
-			sata@e0000 {
-				compatible = "marvell,armada-380-ahci";
-				reg = <0xe0000 0x2000>;
-				interrupts = <GIC_SPI 28 IRQ_TYPE_LEVEL_HIGH>;
-				clocks = <&gateclk 30>;
-				status = "disabled";
-			};
-
-		};
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-388.dtsi
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-385.dtsi
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-385.dtsi	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-385.dtsi	(nonexistent)
@@ -1,187 +0,0 @@
-/*
- * Device Tree Include file for Marvell Armada 385 SoC.
- *
- * Copyright (C) 2014 Marvell
- *
- * Lior Amsalem <alior@marvell.com>
- * Gregory CLEMENT <gregory.clement@free-electrons.com>
- * Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is free software; you can redistribute it and/or
- *     modify it under the terms of the GNU General Public License as
- *     published by the Free Software Foundation; either version 2 of the
- *     License, or (at your option) any later version.
- *
- *     This file is distributed in the hope that it will be useful
- *     but WITHOUT ANY WARRANTY; without even the implied warranty of
- *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *     GNU General Public License for more details.
- *
- * Or, alternatively
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED , WITHOUT WARRANTY OF ANY KIND
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-
-#include "armada-38x.dtsi"
-
-/ {
-	model = "Marvell Armada 385 family SoC";
-	compatible = "marvell,armada385", "marvell,armada380";
-
-	cpus {
-		#address-cells = <1>;
-		#size-cells = <0>;
-		enable-method = "marvell,armada-380-smp";
-
-		cpu@0 {
-			device_type = "cpu";
-			compatible = "arm,cortex-a9";
-			reg = <0>;
-		};
-		cpu@1 {
-			device_type = "cpu";
-			compatible = "arm,cortex-a9";
-			reg = <1>;
-		};
-	};
-
-	soc {
-		internal-regs {
-			pinctrl@18000 {
-				compatible = "marvell,mv88f6820-pinctrl";
-			};
-		};
-
-		pcie-controller {
-			compatible = "marvell,armada-370-pcie";
-			status = "disabled";
-			device_type = "pci";
-
-			#address-cells = <3>;
-			#size-cells = <2>;
-
-			msi-parent = <&mpic>;
-			bus-range = <0x00 0xff>;
-
-			ranges =
-			       <0x82000000 0 0x80000 MBUS_ID(0xf0, 0x01) 0x80000 0 0x00002000
-				0x82000000 0 0x40000 MBUS_ID(0xf0, 0x01) 0x40000 0 0x00002000
-				0x82000000 0 0x44000 MBUS_ID(0xf0, 0x01) 0x44000 0 0x00002000
-				0x82000000 0 0x48000 MBUS_ID(0xf0, 0x01) 0x48000 0 0x00002000
-				0x82000000 0x1 0     MBUS_ID(0x08, 0xe8) 0 1 0 /* Port 0 MEM */
-				0x81000000 0x1 0     MBUS_ID(0x08, 0xe0) 0 1 0 /* Port 0 IO  */
-				0x82000000 0x2 0     MBUS_ID(0x04, 0xe8) 0 1 0 /* Port 1 MEM */
-				0x81000000 0x2 0     MBUS_ID(0x04, 0xe0) 0 1 0 /* Port 1 IO  */
-				0x82000000 0x3 0     MBUS_ID(0x04, 0xd8) 0 1 0 /* Port 2 MEM */
-				0x81000000 0x3 0     MBUS_ID(0x04, 0xd0) 0 1 0 /* Port 2 IO  */
-				0x82000000 0x4 0     MBUS_ID(0x04, 0xb8) 0 1 0 /* Port 3 MEM */
-				0x81000000 0x4 0     MBUS_ID(0x04, 0xb0) 0 1 0 /* Port 3 IO  */>;
-
-			/*
-			 * This port can be either x4 or x1. When
-			 * configured in x4 by the bootloader, then
-			 * pcie@4,0 is not available.
-			 */
-			pcie@1,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x80000 0 0x2000>;
-				reg = <0x0800 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
-					  0x81000000 0 0 0x81000000 0x1 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 29 IRQ_TYPE_LEVEL_HIGH>;
-				marvell,pcie-port = <0>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 8>;
-				status = "disabled";
-			};
-
-			/* x1 port */
-			pcie@2,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x40000 0 0x2000>;
-				reg = <0x1000 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x2 0 1 0
-					  0x81000000 0 0 0x81000000 0x2 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 33 IRQ_TYPE_LEVEL_HIGH>;
-				marvell,pcie-port = <1>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 5>;
-				status = "disabled";
-			};
-
-			/* x1 port */
-			pcie@3,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x44000 0 0x2000>;
-				reg = <0x1800 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x3 0 1 0
-					  0x81000000 0 0 0x81000000 0x3 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 70 IRQ_TYPE_LEVEL_HIGH>;
-				marvell,pcie-port = <2>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 6>;
-				status = "disabled";
-			};
-
-			/*
-			 * x1 port only available when pcie@1,0 is
-			 * configured as a x1 port
-			 */
-			pcie@4,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x48000 0 0x2000>;
-				reg = <0x2000 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x4 0 1 0
-					  0x81000000 0 0 0x81000000 0x4 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 71 IRQ_TYPE_LEVEL_HIGH>;
-				marvell,pcie-port = <3>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 7>;
-				status = "disabled";
-			};
-		};
-	};
-
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-385.dtsi
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-380.dtsi
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-380.dtsi	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-380.dtsi	(nonexistent)
@@ -1,156 +0,0 @@
-/*
- * Device Tree Include file for Marvell Armada 380 SoC.
- *
- * Copyright (C) 2014 Marvell
- *
- * Lior Amsalem <alior@marvell.com>
- * Gregory CLEMENT <gregory.clement@free-electrons.com>
- * Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is free software; you can redistribute it and/or
- *     modify it under the terms of the GNU General Public License as
- *     published by the Free Software Foundation; either version 2 of the
- *     License, or (at your option) any later version.
- *
- *     This file is distributed in the hope that it will be useful
- *     but WITHOUT ANY WARRANTY; without even the implied warranty of
- *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *     GNU General Public License for more details.
- *
- * Or, alternatively
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED , WITHOUT WARRANTY OF ANY KIND
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-
-#include "armada-38x.dtsi"
-
-/ {
-	model = "Marvell Armada 380 family SoC";
-	compatible = "marvell,armada380";
-
-	cpus {
-		#address-cells = <1>;
-		#size-cells = <0>;
-		enable-method = "marvell,armada-380-smp";
-
-		cpu@0 {
-			device_type = "cpu";
-			compatible = "arm,cortex-a9";
-			reg = <0>;
-		};
-	};
-
-	soc {
-		internal-regs {
-			pinctrl@18000 {
-				compatible = "marvell,mv88f6810-pinctrl";
-			};
-		};
-
-		pcie-controller {
-			compatible = "marvell,armada-370-pcie";
-			status = "disabled";
-			device_type = "pci";
-
-			#address-cells = <3>;
-			#size-cells = <2>;
-
-			msi-parent = <&mpic>;
-			bus-range = <0x00 0xff>;
-
-			ranges =
-			       <0x82000000 0 0x80000 MBUS_ID(0xf0, 0x01) 0x80000 0 0x00002000
-				0x82000000 0 0x40000 MBUS_ID(0xf0, 0x01) 0x40000 0 0x00002000
-				0x82000000 0 0x44000 MBUS_ID(0xf0, 0x01) 0x44000 0 0x00002000
-				0x82000000 0x1 0     MBUS_ID(0x08, 0xe8) 0 1 0 /* Port 0 MEM */
-				0x81000000 0x1 0     MBUS_ID(0x08, 0xe0) 0 1 0 /* Port 0 IO  */
-				0x82000000 0x2 0     MBUS_ID(0x04, 0xe8) 0 1 0 /* Port 1 MEM */
-				0x81000000 0x2 0     MBUS_ID(0x04, 0xe0) 0 1 0 /* Port 1 IO  */
-				0x82000000 0x3 0     MBUS_ID(0x04, 0xd8) 0 1 0 /* Port 2 MEM */
-				0x81000000 0x3 0     MBUS_ID(0x04, 0xd0) 0 1 0 /* Port 2 IO  */>;
-
-			/* x1 port */
-			pcie@1,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x80000 0 0x2000>;
-				reg = <0x0800 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
-					  0x81000000 0 0 0x81000000 0x1 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 29 IRQ_TYPE_LEVEL_HIGH>;
-				interrupt-parent = <&gic>;
-				marvell,pcie-port = <0>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 8>;
-				status = "disabled";
-			};
-
-			/* x1 port */
-			pcie@2,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x40000 0 0x2000>;
-				reg = <0x1000 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x2 0 1 0
-					  0x81000000 0 0 0x81000000 0x2 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 33 IRQ_TYPE_LEVEL_HIGH>;
-				interrupt-parent = <&gic>;
-				marvell,pcie-port = <1>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 5>;
-				status = "disabled";
-			};
-
-			/* x1 port */
-			pcie@3,0 {
-				device_type = "pci";
-				assigned-addresses = <0x82000800 0 0x44000 0 0x2000>;
-				reg = <0x1800 0 0 0 0>;
-				#address-cells = <3>;
-				#size-cells = <2>;
-				#interrupt-cells = <1>;
-				ranges = <0x82000000 0 0 0x82000000 0x3 0 1 0
-					  0x81000000 0 0 0x81000000 0x3 0 1 0>;
-				interrupt-map-mask = <0 0 0 0>;
-				interrupt-map = <0 0 0 0 &gic GIC_SPI 70 IRQ_TYPE_LEVEL_HIGH>;
-				interrupt-parent = <&gic>;
-				marvell,pcie-port = <2>;
-				marvell,pcie-lane = <0>;
-				clocks = <&gateclk 6>;
-				status = "disabled";
-			};
-		};
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-380.dtsi
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-388-gp.dts
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-388-gp.dts	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-388-gp.dts	(nonexistent)
@@ -1,425 +0,0 @@
-/*
- * Device Tree file for Marvell Armada 385 development board
- * (RD-88F6820-GP)
- *
- * Copyright (C) 2014 Marvell
- *
- * Gregory CLEMENT <gregory.clement@free-electrons.com>
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is licensed under the terms of the GNU General Public
- *     License version 2.  This program is licensed "as is" without
- *     any warranty of any kind, whether express or implied.
- *
- * Or, alternatively,
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use,
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-
-/dts-v1/;
-#include "armada-388.dtsi"
-#include <dt-bindings/gpio/gpio.h>
-
-/ {
-	model = "Marvell Armada 385 GP";
-	compatible = "marvell,a385-gp", "marvell,armada388", "marvell,armada380";
-
-	chosen {
-		stdout-path = "serial0:115200n8";
-	};
-
-	memory {
-		device_type = "memory";
-		reg = <0x00000000 0x80000000>; /* 2 GB */
-	};
-
-	soc {
-		ranges = <MBUS_ID(0xf0, 0x01) 0 0xf1000000 0x100000
-			  MBUS_ID(0x01, 0x1d) 0 0xfff00000 0x100000
-			  MBUS_ID(0x09, 0x19) 0 0xf1100000 0x10000
-			  MBUS_ID(0x09, 0x15) 0 0xf1110000 0x10000
-			  MBUS_ID(0x0c, 0x04) 0 0xf1200000 0x100000>;
-
-		internal-regs {
-			crypto@90000 {
-				status = "okay";
-			};
-			crypto@92000 {
-				status = "okay";
-			};
-
-			spi@10600 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&spi0_pins>;
-				status = "okay";
-
-				spi-flash@0 {
-					#address-cells = <1>;
-					#size-cells = <1>;
-					compatible = "st,m25p128", "jedec,spi-nor";
-					reg = <0>; /* Chip select 0 */
-					spi-max-frequency = <50000000>;
-					m25p,fast-read;
-				};
-			};
-
-			i2c@11000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&i2c0_pins>;
-				status = "okay";
-				clock-frequency = <100000>;
-				/*
-				 * The EEPROM located at adresse 54 is needed
-				 * for the boot - DO NOT ERASE IT -
-				 */
-
-				expander0: pca9555@20 {
-					compatible = "nxp,pca9555";
-					pinctrl-names = "default";
-					pinctrl-0 = <&pca0_pins>;
-					interrupt-parent = <&gpio0>;
-					interrupts = <18 IRQ_TYPE_EDGE_FALLING>;
-					gpio-controller;
-					#gpio-cells = <2>;
-					interrupt-controller;
-					#interrupt-cells = <2>;
-					reg = <0x20>;
-				};
-
-				expander1: pca9555@21 {
-					compatible = "nxp,pca9555";
-					pinctrl-names = "default";
-					interrupt-parent = <&gpio0>;
-					interrupts = <18 IRQ_TYPE_EDGE_FALLING>;
-					gpio-controller;
-					#gpio-cells = <2>;
-					interrupt-controller;
-					#interrupt-cells = <2>;
-					reg = <0x21>;
-				};
-
-			};
-
-			serial@12000 {
-				/*
-				 * Exported on the micro USB connector CON16
-				 * through an FTDI
-				 */
-
-				pinctrl-names = "default";
-				pinctrl-0 = <&uart0_pins>;
-				status = "okay";
-			};
-
-			/* GE1 CON15 */
-			ethernet@30000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&ge1_rgmii_pins>;
-				status = "okay";
-				phy = <&phy1>;
-				phy-mode = "rgmii-id";
-			};
-
-			/* CON4 */
-			usb@58000 {
-				vcc-supply = <&reg_usb2_0_vbus>;
-				status = "okay";
-			};
-
-			/* GE0 CON1 */
-			ethernet@70000 {
-				pinctrl-names = "default";
-				/*
-				 * The Reference Clock 0 is used to provide a
-				 * clock to the PHY
-				 */
-				pinctrl-0 = <&ge0_rgmii_pins>, <&ref_clk0_pins>;
-				status = "okay";
-				phy = <&phy0>;
-				phy-mode = "rgmii-id";
-			};
-
-
-			mdio@72004 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&mdio_pins>;
-
-				phy0: ethernet-phy@1 {
-					reg = <1>;
-				};
-
-				phy1: ethernet-phy@0 {
-					reg = <0>;
-				};
-			};
-
-			sata@a8000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&sata0_pins>, <&sata1_pins>;
-				status = "okay";
-				#address-cells = <1>;
-				#size-cells = <0>;
-
-				sata0: sata-port@0 {
-					reg = <0>;
-					target-supply = <&reg_5v_sata0>;
-				};
-
-				sata1: sata-port@1 {
-					reg = <1>;
-					target-supply = <&reg_5v_sata1>;
-				};
-			};
-
-			sata@e0000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&sata2_pins>, <&sata3_pins>;
-				status = "okay";
-				#address-cells = <1>;
-				#size-cells = <0>;
-
-				sata2: sata-port@0 {
-					reg = <0>;
-					target-supply = <&reg_5v_sata2>;
-				};
-
-				sata3: sata-port@1 {
-					reg = <1>;
-					target-supply = <&reg_5v_sata3>;
-				};
-			};
-
-			sdhci@d8000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&sdhci_pins>;
-				cd-gpios = <&expander0 5 GPIO_ACTIVE_LOW>;
-				no-1-8-v;
-				wp-inverted;
-				bus-width = <8>;
-				status = "okay";
-			};
-
-			/* CON5 */
-			usb3@f0000 {
-				vcc-supply = <&reg_usb2_1_vbus>;
-				status = "okay";
-			};
-
-			/* CON7 */
-			usb3@f8000 {
-				vcc-supply = <&reg_usb3_vbus>;
-				status = "okay";
-			};
-		};
-
-		gpio-fan {
-			compatible = "gpio-fan";
-			gpios = <&expander1 3 GPIO_ACTIVE_HIGH>;
-			gpio-fan,speed-map = <	 0 0
-					      3000 1>;
-		};
-		pcie-controller {
-			status = "okay";
-			/*
-			 * One PCIe units is accessible through
-			 * standard PCIe slot on the board.
-			 */
-			pcie@1,0 {
-				/* Port 0, Lane 0 */
-				status = "okay";
-			};
-
-			/*
-			 * The two other PCIe units are accessible
-			 * through mini PCIe slot on the board.
-			 */
-			pcie@2,0 {
-				/* Port 1, Lane 0 */
-				status = "okay";
-			};
-			pcie@3,0 {
-				/* Port 2, Lane 0 */
-				status = "okay";
-			};
-		};
-	};
-
-
-	reg_usb3_vbus: usb3-vbus {
-		compatible = "regulator-fixed";
-		regulator-name = "usb3-vbus";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander1 15 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_usb2_0_vbus: v5-vbus0 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-vbus0";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander1 14 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_usb2_1_vbus: v5-vbus1 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-vbus1";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander0 4 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_usb2_1_vbus: v5-vbus1 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-vbus1";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander0 4 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_sata0: pwr-sata0 {
-		compatible = "regulator-fixed";
-		regulator-name = "pwr_en_sata0";
-		enable-active-high;
-		regulator-always-on;
-
-	};
-
-	reg_5v_sata0: v5-sata0 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-sata0";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata0>;
-	};
-
-	reg_12v_sata0: v12-sata0 {
-		compatible = "regulator-fixed";
-		regulator-name = "v12.0-sata0";
-		regulator-min-microvolt = <12000000>;
-		regulator-max-microvolt = <12000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata0>;
-	};
-
-	reg_sata1: pwr-sata1 {
-		regulator-name = "pwr_en_sata1";
-		compatible = "regulator-fixed";
-		regulator-min-microvolt = <12000000>;
-		regulator-max-microvolt = <12000000>;
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander0 3 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_5v_sata1: v5-sata1 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-sata1";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata1>;
-	};
-
-	reg_12v_sata1: v12-sata1 {
-		compatible = "regulator-fixed";
-		regulator-name = "v12.0-sata1";
-		regulator-min-microvolt = <12000000>;
-		regulator-max-microvolt = <12000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata1>;
-	};
-
-	reg_sata2: pwr-sata2 {
-		compatible = "regulator-fixed";
-		regulator-name = "pwr_en_sata2";
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander0 11 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_5v_sata2: v5-sata2 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-sata2";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata2>;
-	};
-
-	reg_12v_sata2: v12-sata2 {
-		compatible = "regulator-fixed";
-		regulator-name = "v12.0-sata2";
-		regulator-min-microvolt = <12000000>;
-		regulator-max-microvolt = <12000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata2>;
-	};
-
-	reg_sata3: pwr-sata3 {
-		compatible = "regulator-fixed";
-		regulator-name = "pwr_en_sata3";
-		enable-active-high;
-		regulator-always-on;
-		gpio = <&expander0 12 GPIO_ACTIVE_HIGH>;
-	};
-
-	reg_5v_sata3: v5-sata3 {
-		compatible = "regulator-fixed";
-		regulator-name = "v5.0-sata3";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata3>;
-	};
-
-	reg_12v_sata3: v12-sata3 {
-		compatible = "regulator-fixed";
-		regulator-name = "v12.0-sata3";
-		regulator-min-microvolt = <12000000>;
-		regulator-max-microvolt = <12000000>;
-		regulator-always-on;
-		vin-supply = <&reg_sata3>;
-	};
-};
-
-&pinctrl {
-	pca0_pins: pca0_pins {
-		marvell,pins = "mpp18";
-		marvell,function = "gpio";
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-388-gp.dts
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-38x-solidrun-microsom.dtsi
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-38x-solidrun-microsom.dtsi	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-38x-solidrun-microsom.dtsi	(nonexistent)
@@ -1,130 +0,0 @@
-/*
- * Device Tree file for SolidRun Armada 38x Microsom
- *
- *  Copyright (C) 2015 Russell King
- *
- * This board is in development; the contents of this file work with
- * the A1 rev 2.0 of the board, which does not represent final
- * production board.  Things will change, don't expect this file to
- * remain compatible info the future.
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is free software; you can redistribute it and/or
- *     modify it under the terms of the GNU General Public License
- *     version 2 as published by the Free Software Foundation.
- *
- *     This file is distributed in the hope that it will be useful
- *     but WITHOUT ANY WARRANTY; without even the implied warranty of
- *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *     GNU General Public License for more details.
- *
- * Or, alternatively
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED , WITHOUT WARRANTY OF ANY KIND
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-#include <dt-bindings/input/input.h>
-#include <dt-bindings/gpio/gpio.h>
-
-/ {
-	memory {
-		device_type = "memory";
-		reg = <0x00000000 0x10000000>; /* 256 MB */
-	};
-
-	soc {
-		ranges = <MBUS_ID(0xf0, 0x01) 0 0xf1000000 0x100000
-			  MBUS_ID(0x01, 0x1d) 0 0xfff00000 0x100000
-			  MBUS_ID(0x09, 0x19) 0 0xf1100000 0x10000
-			  MBUS_ID(0x09, 0x15) 0 0xf1110000 0x10000
-			  MBUS_ID(0x0c, 0x04) 0 0xf1200000 0x100000>;
-
-		internal-regs {
-			ethernet@70000 {
-				pinctrl-0 = <&ge0_rgmii_pins>;
-				pinctrl-names = "default";
-				phy = <&phy_dedicated>;
-				phy-mode = "rgmii-id";
-				buffer-manager = <&bm>;
-				bm,pool-long = <0>;
-				bm,pool-short = <1>;
-				status = "okay";
-			};
-
-			mdio@72004 {
-				/*
-				 * Add the phy clock here, so the phy can be
-				 * accessed to read its IDs prior to binding
-				 * with the driver.
-				 */
-				pinctrl-0 = <&mdio_pins &microsom_phy_clk_pins>;
-				pinctrl-names = "default";
-
-				phy_dedicated: ethernet-phy@0 {
-					/*
-					 * Annoyingly, the marvell phy driver
-					 * configures the LED register, rather
-					 * than preserving reset-loaded setting.
-					 * We undo that rubbish here.
-					 */
-					marvell,reg-init = <3 16 0 0x101e>;
-					reg = <0>;
-				};
-			};
-
-			pinctrl@18000 {
-				microsom_phy_clk_pins: microsom-phy-clk-pins {
-					marvell,pins = "mpp45";
-					marvell,function = "ref";
-				};
-			};
-
-			rtc@a3800 {
-				/*
-				 * If the rtc doesn't work, run "date reset"
-				 * twice in u-boot.
-				 */
-				status = "okay";
-			};
-
-			serial@12000 {
-				pinctrl-0 = <&uart0_pins>;
-				pinctrl-names = "default";
-				status = "okay";
-			};
-
-			bm@c8000 {
-				status = "okay";
-			};
-		};
-
-		bm-bppi {
-			status = "okay";
-		};
-
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-38x-solidrun-microsom.dtsi
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-385-db-ap.dts
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-385-db-ap.dts	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-385-db-ap.dts	(nonexistent)
@@ -1,275 +0,0 @@
-/*
- * Device Tree file for Marvell Armada 385 Access Point Development board
- * (DB-88F6820-AP)
- *
- *  Copyright (C) 2014 Marvell
- *
- * Nadav Haklai <nadavh@marvell.com>
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is licensed under the terms of the GNU General Public
- *     License version 2.  This program is licensed "as is" without
- *     any warranty of any kind, whether express or implied.
- *
- * Or, alternatively,
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use,
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-
-/dts-v1/;
-#include "armada-385.dtsi"
-
-#include <dt-bindings/gpio/gpio.h>
-
-/ {
-	model = "Marvell Armada 385 Access Point Development Board";
-	compatible = "marvell,a385-db-ap", "marvell,armada385", "marvell,armada380";
-
-	chosen {
-		stdout-path = "serial1";
-	};
-
-	memory {
-		device_type = "memory";
-		reg = <0x00000000 0x80000000>; /* 2GB */
-	};
-
-	soc {
-		ranges = <MBUS_ID(0xf0, 0x01) 0 0xf1000000 0x100000
-			  MBUS_ID(0x01, 0x1d) 0 0xfff00000 0x100000
-			  MBUS_ID(0x09, 0x19) 0 0xf1100000 0x10000
-			  MBUS_ID(0x09, 0x15) 0 0xf1110000 0x10000
-			  MBUS_ID(0x0c, 0x04) 0 0xf1200000 0x100000>;
-
-		internal-regs {
-			i2c0: i2c@11000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&i2c0_pins>;
-				status = "okay";
-
-				/*
-				 * This bus is wired to two EEPROM
-				 * sockets, one of which holding the
-				 * board ID used by the	bootloader.
-				 * Erasing this EEPROM's content will
-				 * brick the board.
-				 * Use this bus with caution.
-				 */
-			};
-
-			mdio@72004 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&mdio_pins>;
-
-				phy0: ethernet-phy@1 {
-					reg = <1>;
-				};
-
-				phy1: ethernet-phy@4 {
-					reg = <4>;
-				};
-
-				phy2: ethernet-phy@6 {
-					reg = <6>;
-				};
-			};
-
-			/* UART0 is exposed through the JP8 connector */
-			uart0: serial@12000 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&uart0_pins>;
-				status = "okay";
-			};
-
-			/*
-			 * UART1 is exposed through a FTDI chip
-			 * wired to the mini-USB connector
-			 */
-			uart1: serial@12100 {
-				pinctrl-names = "default";
-				pinctrl-0 = <&uart1_pins>;
-				status = "okay";
-			};
-
-			pinctrl@18000 {
-				xhci0_vbus_pins: xhci0-vbus-pins {
-					marvell,pins = "mpp44";
-					marvell,function = "gpio";
-				};
-			};
-
-			/* CON3 */
-			ethernet@30000 {
-				status = "okay";
-				phy = <&phy2>;
-				phy-mode = "sgmii";
-				buffer-manager = <&bm>;
-				bm,pool-long = <1>;
-				bm,pool-short = <3>;
-			};
-
-			/* CON2 */
-			ethernet@34000 {
-				status = "okay";
-				phy = <&phy1>;
-				phy-mode = "sgmii";
-				buffer-manager = <&bm>;
-				bm,pool-long = <2>;
-				bm,pool-short = <3>;
-			};
-
-			usb@58000 {
-				status = "okay";
-			};
-
-			/* CON4 */
-			ethernet@70000 {
-				pinctrl-names = "default";
-
-				/*
-				 * The Reference Clock 0 is used to
-				 * provide a clock to the PHY
-				 */
-				pinctrl-0 = <&ge0_rgmii_pins>, <&ref_clk0_pins>;
-				status = "okay";
-				phy = <&phy0>;
-				phy-mode = "rgmii-id";
-				buffer-manager = <&bm>;
-				bm,pool-long = <0>;
-				bm,pool-short = <3>;
-			};
-
-			crypto@90000 {
-				status = "okay";
-			};
-
-			crypto@92000 {
-				status = "okay";
-			};
-
-			bm@c8000 {
-				status = "okay";
-			};
-
-			nfc: flash@d0000 {
-				status = "okay";
-				num-cs = <1>;
-				nand-ecc-strength = <4>;
-				nand-ecc-step-size = <512>;
-				marvell,nand-keep-config;
-				marvell,nand-enable-arbiter;
-				nand-on-flash-bbt;
-
-				partitions {
-					compatible = "fixed-partitions";
-					#address-cells = <1>;
-					#size-cells = <1>;
-
-					partition@0 {
-						label = "U-Boot";
-						reg = <0x00000000 0x00800000>;
-						read-only;
-					};
-
-					partition@800000 {
-						label = "uImage";
-						reg = <0x00800000 0x00400000>;
-						read-only;
-					};
-
-					partition@c00000 {
-						label = "Root";
-						reg = <0x00c00000 0x3f400000>;
-					};
-				};
-			};
-
-			usb3@f0000 {
-				status = "okay";
-				usb-phy = <&usb3_phy>;
-			};
-		};
-
-		bm-bppi {
-			status = "okay";
-		};
-
-		pcie-controller {
-			status = "okay";
-
-			/*
-			 * The three PCIe units are accessible through
-			 * standard mini-PCIe slots on the board.
-			 */
-			pcie@1,0 {
-				/* Port 0, Lane 0 */
-				status = "okay";
-			};
-
-			pcie@2,0 {
-				/* Port 1, Lane 0 */
-				status = "okay";
-			};
-
-			pcie@3,0 {
-				/* Port 2, Lane 0 */
-				status = "okay";
-			};
-		};
-	};
-
-	usb3_phy: usb3_phy {
-		compatible = "usb-nop-xceiv";
-		vcc-supply = <&reg_xhci0_vbus>;
-	};
-
-	reg_xhci0_vbus: xhci0-vbus {
-		compatible = "regulator-fixed";
-		pinctrl-names = "default";
-		pinctrl-0 = <&xhci0_vbus_pins>;
-		regulator-name = "xhci0-vbus";
-		regulator-min-microvolt = <5000000>;
-		regulator-max-microvolt = <5000000>;
-		enable-active-high;
-		gpio = <&gpio1 12 GPIO_ACTIVE_HIGH>;
-	};
-};
-
-&spi1 {
-	pinctrl-names = "default";
-	pinctrl-0 = <&spi1_pins>;
-	status = "okay";
-
-	spi-flash@0 {
-		#address-cells = <1>;
-		#size-cells = <1>;
-		compatible = "st,m25p128", "jedec,spi-nor";
-		reg = <0>; /* Chip select 0 */
-		spi-max-frequency = <54000000>;
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-385-db-ap.dts
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/dts/arm/armada-388-clearfog.dts
===================================================================
--- user/markj/netdump/sys/dts/arm/armada-388-clearfog.dts	(revision 332407)
+++ user/markj/netdump/sys/dts/arm/armada-388-clearfog.dts	(nonexistent)
@@ -1,459 +0,0 @@
-/*
- * Device Tree file for SolidRun Clearfog revision A1 rev 2.0 (88F6828)
- *
- *  Copyright (C) 2015 Russell King
- *
- * This board is in development; the contents of this file work with
- * the A1 rev 2.0 of the board, which does not represent final
- * production board.  Things will change, don't expect this file to
- * remain compatible info the future.
- *
- * This file is dual-licensed: you can use it either under the terms
- * of the GPL or the X11 license, at your option. Note that this dual
- * licensing only applies to this file, and not this project as a
- * whole.
- *
- *  a) This file is free software; you can redistribute it and/or
- *     modify it under the terms of the GNU General Public License
- *     version 2 as published by the Free Software Foundation.
- *
- *     This file is distributed in the hope that it will be useful
- *     but WITHOUT ANY WARRANTY; without even the implied warranty of
- *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *     GNU General Public License for more details.
- *
- * Or, alternatively
- *
- *  b) Permission is hereby granted, free of charge, to any person
- *     obtaining a copy of this software and associated documentation
- *     files (the "Software"), to deal in the Software without
- *     restriction, including without limitation the rights to use
- *     copy, modify, merge, publish, distribute, sublicense, and/or
- *     sell copies of the Software, and to permit persons to whom the
- *     Software is furnished to do so, subject to the following
- *     conditions:
- *
- *     The above copyright notice and this permission notice shall be
- *     included in all copies or substantial portions of the Software.
- *
- *     THE SOFTWARE IS PROVIDED , WITHOUT WARRANTY OF ANY KIND
- *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY
- *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- *     OTHER DEALINGS IN THE SOFTWARE.
- *
- * $FreeBSD$
- */
-
-/dts-v1/;
-#include "armada-388.dtsi"
-#include "armada-38x-solidrun-microsom.dtsi"
-
-/ {
-	model = "SolidRun Clearfog A1";
-	compatible = "solidrun,clearfog-a1", "marvell,armada388",
-		"marvell,armada385", "marvell,armada380";
-
-	aliases {
-		/* So that mvebu u-boot can update the MAC addresses */
-		ethernet1 = &eth0;
-		ethernet2 = &eth1;
-		ethernet3 = &eth2;
-	};
-
-	chosen {
-		stdout-path = "serial0:115200n8";
-	};
-
-	reg_3p3v: regulator-3p3v {
-		compatible = "regulator-fixed";
-		regulator-name = "3P3V";
-		regulator-min-microvolt = <3300000>;
-		regulator-max-microvolt = <3300000>;
-		regulator-always-on;
-	};
-
-	soc {
-		internal-regs {
-			ethernet@30000 {
-				phy-mode = "sgmii";
-				buffer-manager = <&bm>;
-				bm,pool-long = <2>;
-				bm,pool-short = <1>;
-				status = "okay";
-
-				fixed-link {
-					speed = <1000>;
-					full-duplex;
-				};
-			};
-
-			ethernet@34000 {
-				phy-mode = "sgmii";
-				buffer-manager = <&bm>;
-				bm,pool-long = <3>;
-				bm,pool-short = <1>;
-				status = "okay";
-				managed = "in-band-status";
-			};
-
-			i2c@11000 {
-				/* Is there anything on this? */
-				clock-frequency = <100000>;
-				pinctrl-0 = <&i2c0_pins>;
-				pinctrl-names = "default";
-				status = "okay";
-
-				/*
-				 * PCA9655 GPIO expander, up to 1MHz clock.
-				 *  0-CON3 CLKREQ#
-				 *  1-CON3 PERST#
-				 *  2-CON2 PERST#
-				 *  3-CON3 W_DISABLE
-				 *  4-CON2 CLKREQ#
-				 *  5-USB3 overcurrent
-				 *  6-USB3 power
-				 *  7-CON2 W_DISABLE
-				 *  8-JP4 P1
-				 *  9-JP4 P4
-				 * 10-JP4 P5
-				 * 11-m.2 DEVSLP
-				 * 12-SFP_LOS
-				 * 13-SFP_TX_FAULT
-				 * 14-SFP_TX_DISABLE
-				 * 15-SFP_MOD_DEF0
-				 */
-				expander0: gpio-expander@20 {
-					/*
-					 * This is how it should be:
-					 * compatible = "onnn,pca9655",
-					 *	 "nxp,pca9555";
-					 * but you can't do this because of
-					 * the way I2C works.
-					 */
-					compatible = "nxp,pca9555";
-					gpio-controller;
-					#gpio-cells = <2>;
-					reg = <0x20>;
-
-					pcie1_0_clkreq {
-						gpio-hog;
-						gpios = <0 GPIO_ACTIVE_LOW>;
-						input;
-						line-name = "pcie1.0-clkreq";
-					};
-					pcie1_0_w_disable {
-						gpio-hog;
-						gpios = <3 GPIO_ACTIVE_LOW>;
-						output-low;
-						line-name = "pcie1.0-w-disable";
-					};
-					pcie2_0_clkreq {
-						gpio-hog;
-						gpios = <4 GPIO_ACTIVE_LOW>;
-						input;
-						line-name = "pcie2.0-clkreq";
-					};
-					pcie2_0_w_disable {
-						gpio-hog;
-						gpios = <7 GPIO_ACTIVE_LOW>;
-						output-low;
-						line-name = "pcie2.0-w-disable";
-					};
-					usb3_ilimit {
-						gpio-hog;
-						gpios = <5 GPIO_ACTIVE_LOW>;
-						input;
-						line-name = "usb3-current-limit";
-					};
-					usb3_power {
-						gpio-hog;
-						gpios = <6 GPIO_ACTIVE_HIGH>;
-						output-high;
-						line-name = "usb3-power";
-					};
-					m2_devslp {
-						gpio-hog;
-						gpios = <11 GPIO_ACTIVE_HIGH>;
-						output-low;
-						line-name = "m.2 devslp";
-					};
-					sfp_los {
-						/* SFP loss of signal */
-						gpio-hog;
-						gpios = <12 GPIO_ACTIVE_HIGH>;
-						input;
-						line-name = "sfp-los";
-					};
-					sfp_tx_fault {
-						/* SFP laser fault */
-						gpio-hog;
-						gpios = <13 GPIO_ACTIVE_HIGH>;
-						input;
-						line-name = "sfp-tx-fault";
-					};
-					sfp_tx_disable {
-						/* SFP transmit disable */
-						gpio-hog;
-						gpios = <14 GPIO_ACTIVE_HIGH>;
-						output-low;
-						line-name = "sfp-tx-disable";
-					};
-					sfp_mod_def0 {
-						/* SFP module present */
-						gpio-hog;
-						gpios = <15 GPIO_ACTIVE_LOW>;
-						input;
-						line-name = "sfp-mod-def0";
-					};
-				};
-
-				/* The MCP3021 is 100kHz clock only */
-				mikrobus_adc: mcp3021@4c {
-					compatible = "microchip,mcp3021";
-					reg = <0x4c>;
-				};
-
-				/* Also something at 0x64 */
-			};
-
-			i2c@11100 {
-				/*
-				 * Routed to SFP, mikrobus, and PCIe.
-				 * SFP limits this to 100kHz, and requires
-				 *  an AT24C01A/02/04 with address pins tied
-				 *  low, which takes addresses 0x50 and 0x51.
-				 * Mikrobus doesn't specify beyond an I2C
-				 *  bus being present.
-				 * PCIe uses ARP to assign addresses, or
-				 *  0x63-0x64.
-				 */
-				clock-frequency = <100000>;
-				pinctrl-0 = <&clearfog_i2c1_pins>;
-				pinctrl-names = "default";
-				status = "okay";
-			};
-
-			pinctrl@18000 {
-				clearfog_dsa0_clk_pins: clearfog-dsa0-clk-pins {
-					marvell,pins = "mpp46";
-					marvell,function = "ref";
-				};
-				clearfog_dsa0_pins: clearfog-dsa0-pins {
-					marvell,pins = "mpp23", "mpp41";
-					marvell,function = "gpio";
-				};
-				clearfog_i2c1_pins: i2c1-pins {
-					/* SFP, PCIe, mSATA, mikrobus */
-					marvell,pins = "mpp26", "mpp27";
-					marvell,function = "i2c1";
-				};
-				clearfog_sdhci_cd_pins: clearfog-sdhci-cd-pins {
-					marvell,pins = "mpp20";
-					marvell,function = "gpio";
-				};
-				clearfog_sdhci_pins: clearfog-sdhci-pins {
-					marvell,pins = "mpp21", "mpp28",
-						       "mpp37", "mpp38",
-						       "mpp39", "mpp40";
-					marvell,function = "sd0";
-				};
-				clearfog_spi1_cs_pins: spi1-cs-pins {
-					marvell,pins = "mpp55";
-					marvell,function = "spi1";
-				};
-				mikro_pins: mikro-pins {
-					/* int: mpp22 rst: mpp29 */
-					marvell,pins = "mpp22", "mpp29";
-					marvell,function = "gpio";
-				};
-				mikro_spi_pins: mikro-spi-pins {
-					marvell,pins = "mpp43";
-					marvell,function = "spi1";
-				};
-				mikro_uart_pins: mikro-uart-pins {
-					marvell,pins = "mpp24", "mpp25";
-					marvell,function = "ua1";
-				};
-				rear_button_pins: rear-button-pins {
-					marvell,pins = "mpp34";
-					marvell,function = "gpio";
-				};
-			};
-
-			sata@a8000 {
-				/* pinctrl? */
-				status = "okay";
-			};
-
-			sata@e0000 {
-				/* pinctrl? */
-				status = "okay";
-			};
-
-			sdhci@d8000 {
-				bus-width = <4>;
-				cd-gpios = <&gpio0 20 GPIO_ACTIVE_LOW>;
-				no-1-8-v;
-				pinctrl-0 = <&clearfog_sdhci_pins
-					     &clearfog_sdhci_cd_pins>;
-				pinctrl-names = "default";
-				status = "okay";
-				vmmc = <&reg_3p3v>;
-				wp-inverted;
-			};
-
-			serial@12100 {
-				/* mikrobus uart */
-				pinctrl-0 = <&mikro_uart_pins>;
-				pinctrl-names = "default";
-				status = "okay";
-			};
-
-			usb@58000 {
-				/* CON3, nearest  power. */
-				status = "okay";
-			};
-
-			crypto@90000 {
-				status = "okay";
-			};
-
-			crypto@92000 {
-				status = "okay";
-			};
-
-			usb3@f0000 {
-				/* CON2, nearest CPU, USB2 only. */
-				status = "okay";
-			};
-
-			usb3@f8000 {
-				/* CON7 */
-				status = "okay";
-			};
-		};
-
-		pcie-controller {
-			status = "okay";
-			/*
-			 * The two PCIe units are accessible through
-			 * the mini-PCIe connectors on the board.
-			 */
-			pcie@2,0 {
-				/* Port 1, Lane 0. CON3, nearest power. */
-				reset-gpios = <&expander0 1 GPIO_ACTIVE_LOW>;
-				status = "okay";
-			};
-			pcie@3,0 {
-				/* Port 2, Lane 0. CON2, nearest CPU. */
-				reset-gpios = <&expander0 2 GPIO_ACTIVE_LOW>;
-				status = "okay";
-			};
-		};
-	};
-
-	dsa@0 {
-		compatible = "marvell,dsa";
-		dsa,ethernet = <&eth1>;
-		dsa,mii-bus = <&mdio>;
-		pinctrl-0 = <&clearfog_dsa0_clk_pins &clearfog_dsa0_pins>;
-		pinctrl-names = "default";
-		#address-cells = <2>;
-		#size-cells = <0>;
-
-		switch@0 {
-			#address-cells = <1>;
-			#size-cells = <0>;
-			reg = <4 0>;
-
-			port@0 {
-				reg = <0>;
-				label = "lan5";
-				vlangroup = <0>;
-			};
-
-			port@1 {
-				reg = <1>;
-				label = "lan4";
-				vlangroup = <0>;
-			};
-
-			port@2 {
-				reg = <2>;
-				label = "lan3";
-				vlangroup = <0>;
-			};
-
-			port@3 {
-				reg = <3>;
-				label = "lan2";
-				vlangroup = <0>;
-			};
-
-			port@4 {
-				reg = <4>;
-				label = "lan1";
-				vlangroup = <0>;
-			};
-
-			port@5 {
-				reg = <5>;
-				label = "cpu";
-				vlangroup = <0>;
-			};
-
-			port@6 {
-				/* 88E1512 external phy */
-				reg = <6>;
-				label = "lan6";
-				vlangroup = <0>;
-				fixed-link {
-					speed = <1000>;
-					full-duplex;
-				};
-			};
-		};
-	};
-
-	gpio-keys {
-		compatible = "gpio-keys";
-		pinctrl-0 = <&rear_button_pins>;
-		pinctrl-names = "default";
-
-		button_0 {
-			/* The rear SW3 button */
-			label = "Rear Button";
-			gpios = <&gpio1 2 GPIO_ACTIVE_LOW>;
-			linux,can-disable;
-			linux,code = <BTN_0>;
-		};
-	};
-};
-
-&spi1 {
-	/*
-	 * We don't seem to have the W25Q32 on the
-	 * A1 Rev 2.0 boards, so disable SPI.
-	 * CS0: W25Q32 (doesn't appear to be present)
-	 * CS1:
-	 * CS2: mikrobus
-	 */
-	pinctrl-0 = <&spi1_pins
-		     &clearfog_spi1_cs_pins
-		     &mikro_spi_pins>;
-	pinctrl-names = "default";
-	status = "okay";
-
-	spi-flash@0 {
-		#address-cells = <1>;
-		#size-cells = <0>;
-		compatible = "w25q32", "jedec,spi-nor";
-		reg = <0>; /* Chip select 0 */
-		spi-max-frequency = <3000000>;
-		status = "disabled";
-	};
-};

Property changes on: user/markj/netdump/sys/dts/arm/armada-388-clearfog.dts
___________________________________________________________________
Deleted: svn:eol-style
## -1 +0,0 ##
-native
\ No newline at end of property
Deleted: svn:keywords
## -1 +0,0 ##
-FreeBSD=%H
\ No newline at end of property
Deleted: svn:mime-type
## -1 +0,0 ##
-text/plain
\ No newline at end of property
Index: user/markj/netdump/sys/geom/bde/g_bde.c
===================================================================
--- user/markj/netdump/sys/geom/bde/g_bde.c	(revision 332407)
+++ user/markj/netdump/sys/geom/bde/g_bde.c	(revision 332408)
@@ -1,294 +1,295 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2002 Poul-Henning Kamp
  * Copyright (c) 2002 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Poul-Henning Kamp
  * and NAI Labs, the Security Research Division of Network Associates, Inc.
  * under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
  * DARPA CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  *
  */
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/malloc.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/sysctl.h>
 
 #include <crypto/rijndael/rijndael-api-fst.h>
 #include <crypto/sha2/sha512.h>
 #include <geom/geom.h>
 #include <geom/bde/g_bde.h>
 #define BDE_CLASS_NAME "BDE"
 
 FEATURE(geom_bde, "GEOM-based Disk Encryption");
 
 static void
 g_bde_start(struct bio *bp)
 {
 
 	switch (bp->bio_cmd) {
 	case BIO_DELETE:
 	case BIO_READ:
 	case BIO_WRITE:
 		g_bde_start1(bp);
 		break;
 	case BIO_GETATTR:
 		g_io_deliver(bp, EOPNOTSUPP);
 		break;
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 	return;
 }
 
 static void
 g_bde_orphan(struct g_consumer *cp)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct g_bde_softc *sc;
 
 	g_trace(G_T_TOPOLOGY, "g_bde_orphan(%p/%s)", cp, cp->provider->name);
 	g_topology_assert();
 
 	gp = cp->geom;
 	sc = gp->softc;
 	gp->flags |= G_GEOM_WITHER;
 	LIST_FOREACH(pp, &gp->provider, provider)
 		g_wither_provider(pp, ENXIO);
 	bzero(sc, sizeof(struct g_bde_softc));	/* destroy evidence */
 	return;
 }
 
 static int
 g_bde_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 
 	gp = pp->geom;
 	cp = LIST_FIRST(&gp->consumer);
 	if (cp->acr == 0 && cp->acw == 0 && cp->ace == 0) {
 		de++;
 		dr++;
 	}
 	/* ... and let go of it on last close */
 	if ((cp->acr + dr) == 0 && (cp->acw + dw) == 0 && (cp->ace + de) == 1) {
 		de--;
 		dr--;
 	}
 	return (g_access(cp, dr, dw, de));
 }
 
 static void
 g_bde_create_geom(struct gctl_req *req, struct g_class *mp, struct g_provider *pp)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_bde_key *kp;
 	int error, i;
 	u_int sectorsize;
 	off_t mediasize;
 	struct g_bde_softc *sc;
 	void *pass;
 	void *key;
 
 	g_trace(G_T_TOPOLOGY, "g_bde_create_geom(%s, %s)", mp->name, pp->name);
 	g_topology_assert();
 	gp = NULL;
 
 
 	gp = g_new_geomf(mp, "%s.bde", pp->name);
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_access(cp, 1, 1, 1);
 	if (error) {
 		g_detach(cp);
 		g_destroy_consumer(cp);
 		g_destroy_geom(gp);
 		gctl_error(req, "could not access consumer");
 		return;
 	}
 	pass = NULL;
 	key = NULL;
 	do {
 		pass = gctl_get_param(req, "pass", &i);
 		if (pass == NULL || i != SHA512_DIGEST_LENGTH) {
 			gctl_error(req, "No usable key presented");
 			break;
 		}
 		key = gctl_get_param(req, "key", &i);
 		if (key != NULL && i != 16) {
 			gctl_error(req, "Invalid key presented");
 			break;
 		}
 		sectorsize = cp->provider->sectorsize;
 		mediasize = cp->provider->mediasize;
 		sc = g_malloc(sizeof(struct g_bde_softc), M_WAITOK | M_ZERO);
 		gp->softc = sc;
 		sc->geom = gp;
 		sc->consumer = cp;
 
 		error = g_bde_decrypt_lock(sc, pass, key,
 		    mediasize, sectorsize, NULL);
 		bzero(sc->sha2, sizeof sc->sha2);
 		if (error)
 			break;
 		kp = &sc->key;
 
 		/* Initialize helper-fields */
 		kp->keys_per_sector = kp->sectorsize / G_BDE_SKEYLEN;
 		kp->zone_cont = kp->keys_per_sector * kp->sectorsize;
 		kp->zone_width = kp->zone_cont + kp->sectorsize;
 		kp->media_width = kp->sectorN - kp->sector0 -
 		    G_BDE_MAXKEYS * kp->sectorsize;
 
 		/* Our external parameters */
 		sc->zone_cont = kp->zone_cont;
 		sc->mediasize = g_bde_max_sector(kp);
 		sc->sectorsize = kp->sectorsize;
 
 		TAILQ_INIT(&sc->freelist);
 		TAILQ_INIT(&sc->worklist);
 		mtx_init(&sc->worklist_mutex, "g_bde_worklist", NULL, MTX_DEF);
 		/* XXX: error check */
 		kproc_create(g_bde_worker, gp, &sc->thread, 0, 0,
 			"g_bde %s", gp->name);
 		pp = g_new_providerf(gp, "%s", gp->name);
 		pp->stripesize = kp->zone_cont;
 		pp->stripeoffset = 0;
 		pp->mediasize = sc->mediasize;
 		pp->sectorsize = sc->sectorsize;
 		g_error_provider(pp, 0);
 		break;
 	} while (0);
 	if (pass != NULL)
 		bzero(pass, SHA512_DIGEST_LENGTH);
 	if (key != NULL)
 		bzero(key, 16);
 	if (error == 0)
 		return;
 	g_access(cp, -1, -1, -1);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	if (gp->softc != NULL)
 		g_free(gp->softc);
 	g_destroy_geom(gp);
 	switch (error) {
 	case ENOENT:
 		gctl_error(req, "Lock was destroyed");
 		break;
 	case ESRCH:
 		gctl_error(req, "Lock was nuked");
 		break;
 	case EINVAL:
 		gctl_error(req, "Could not open lock");
 		break;
 	case ENOTDIR:
 		gctl_error(req, "Lock not found");
 		break;
 	default:
 		gctl_error(req, "Could not open lock (%d)", error);
 		break;
 	}
 	return;
 }
 
 
 static int
 g_bde_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	struct g_bde_softc *sc;
 
 	g_trace(G_T_TOPOLOGY, "g_bde_destroy_geom(%s, %s)", mp->name, gp->name);
 	g_topology_assert();
 	/*
 	 * Orderly detachment.
 	 */
 	KASSERT(gp != NULL, ("NULL geom"));
 	pp = LIST_FIRST(&gp->provider);
 	KASSERT(pp != NULL, ("NULL provider"));
 	if (pp->acr > 0 || pp->acw > 0 || pp->ace > 0)
 		return (EBUSY);
 	sc = gp->softc;
 	cp = LIST_FIRST(&gp->consumer);
 	KASSERT(cp != NULL, ("NULL consumer"));
 	sc->dead = 1;
 	wakeup(sc);
 	g_access(cp, -1, -1, -1);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	while (sc->dead != 2 && !LIST_EMPTY(&pp->consumers))
 		tsleep(sc, PRIBIO, "g_bdedie", hz);
 	mtx_destroy(&sc->worklist_mutex);
 	bzero(&sc->key, sizeof sc->key);
 	g_free(sc);
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static void
 g_bde_ctlreq(struct gctl_req *req, struct g_class *mp, char const *verb)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	if (!strcmp(verb, "create geom")) {
 		pp = gctl_get_provider(req, "provider");
 		if (pp != NULL)
 			g_bde_create_geom(req, mp, pp);
 	} else if (!strcmp(verb, "destroy geom")) {
 		gp = gctl_get_geom(req, mp, "geom");
 		if (gp != NULL)
 			g_bde_destroy_geom(req, mp, gp);
 	} else {
 		gctl_error(req, "unknown verb");
 	}
 }
 
 static struct g_class g_bde_class	= {
 	.name = BDE_CLASS_NAME,
 	.version = G_VERSION,
 	.destroy_geom = g_bde_destroy_geom,
 	.ctlreq = g_bde_ctlreq,
 	.start = g_bde_start,
 	.orphan = g_bde_orphan,
 	.access = g_bde_access,
 	.spoiled = g_std_spoiled,
 };
 
 DECLARE_GEOM_CLASS(g_bde_class, g_bde);
+MODULE_VERSION(geom_bde, 0);
Index: user/markj/netdump/sys/geom/cache/g_cache.c
===================================================================
--- user/markj/netdump/sys/geom/cache/g_cache.c	(revision 332407)
+++ user/markj/netdump/sys/geom/cache/g_cache.c	(revision 332408)
@@ -1,1018 +1,1019 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2006 Ruslan Ermilov <ru@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/time.h>
 #include <vm/uma.h>
 #include <geom/geom.h>
 #include <geom/cache/g_cache.h>
 
 FEATURE(geom_cache, "GEOM cache module");
 
 static MALLOC_DEFINE(M_GCACHE, "gcache_data", "GEOM_CACHE Data");
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, cache, CTLFLAG_RW, 0,
     "GEOM_CACHE stuff");
 static u_int g_cache_debug = 0;
 SYSCTL_UINT(_kern_geom_cache, OID_AUTO, debug, CTLFLAG_RW, &g_cache_debug, 0,
     "Debug level");
 static u_int g_cache_enable = 1;
 SYSCTL_UINT(_kern_geom_cache, OID_AUTO, enable, CTLFLAG_RW, &g_cache_enable, 0,
     "");
 static u_int g_cache_timeout = 10;
 SYSCTL_UINT(_kern_geom_cache, OID_AUTO, timeout, CTLFLAG_RW, &g_cache_timeout,
     0, "");
 static u_int g_cache_idletime = 5;
 SYSCTL_UINT(_kern_geom_cache, OID_AUTO, idletime, CTLFLAG_RW, &g_cache_idletime,
     0, "");
 static u_int g_cache_used_lo = 5;
 static u_int g_cache_used_hi = 20;
 static int
 sysctl_handle_pct(SYSCTL_HANDLER_ARGS)
 {
 	u_int val = *(u_int *)arg1;
 	int error;
 
 	error = sysctl_handle_int(oidp, &val, 0, req);
 	if (error || !req->newptr)
 		return (error);
 	if (val > 100)
 		return (EINVAL);
 	if ((arg1 == &g_cache_used_lo && val > g_cache_used_hi) ||
 	    (arg1 == &g_cache_used_hi && g_cache_used_lo > val))
 		return (EINVAL);
 	*(u_int *)arg1 = val;
 	return (0);
 }
 SYSCTL_PROC(_kern_geom_cache, OID_AUTO, used_lo, CTLTYPE_UINT|CTLFLAG_RW,
 	&g_cache_used_lo, 0, sysctl_handle_pct, "IU", "");
 SYSCTL_PROC(_kern_geom_cache, OID_AUTO, used_hi, CTLTYPE_UINT|CTLFLAG_RW,
 	&g_cache_used_hi, 0, sysctl_handle_pct, "IU", "");
 
 
 static int g_cache_destroy(struct g_cache_softc *sc, boolean_t force);
 static g_ctl_destroy_geom_t g_cache_destroy_geom;
 
 static g_taste_t g_cache_taste;
 static g_ctl_req_t g_cache_config;
 static g_dumpconf_t g_cache_dumpconf;
 
 struct g_class g_cache_class = {
 	.name = G_CACHE_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_cache_config,
 	.taste = g_cache_taste,
 	.destroy_geom = g_cache_destroy_geom
 };
 
 #define	OFF2BNO(off, sc)	((off) >> (sc)->sc_bshift)
 #define	BNO2OFF(bno, sc)	((bno) << (sc)->sc_bshift)
 
 
 static struct g_cache_desc *
 g_cache_alloc(struct g_cache_softc *sc)
 {
 	struct g_cache_desc *dp;
 
 	mtx_assert(&sc->sc_mtx, MA_OWNED);
 
 	if (!TAILQ_EMPTY(&sc->sc_usedlist)) {
 		dp = TAILQ_FIRST(&sc->sc_usedlist);
 		TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);
 		sc->sc_nused--;
 		dp->d_flags = 0;
 		LIST_REMOVE(dp, d_next);
 		return (dp);
 	}
 	if (sc->sc_nent > sc->sc_maxent) {
 		sc->sc_cachefull++;
 		return (NULL);
 	}
 	dp = malloc(sizeof(*dp), M_GCACHE, M_NOWAIT | M_ZERO);
 	if (dp == NULL)
 		return (NULL);
 	dp->d_data = uma_zalloc(sc->sc_zone, M_NOWAIT);
 	if (dp->d_data == NULL) {
 		free(dp, M_GCACHE);
 		return (NULL);
 	}
 	sc->sc_nent++;
 	return (dp);
 }
 
 static void
 g_cache_free(struct g_cache_softc *sc, struct g_cache_desc *dp)
 {
 
 	mtx_assert(&sc->sc_mtx, MA_OWNED);
 
 	uma_zfree(sc->sc_zone, dp->d_data);
 	free(dp, M_GCACHE);
 	sc->sc_nent--;
 }
 
 static void
 g_cache_free_used(struct g_cache_softc *sc)
 {
 	struct g_cache_desc *dp;
 	u_int n;
 
 	mtx_assert(&sc->sc_mtx, MA_OWNED);
 
 	n = g_cache_used_lo * sc->sc_maxent / 100;
 	while (sc->sc_nused > n) {
 		KASSERT(!TAILQ_EMPTY(&sc->sc_usedlist), ("used list empty"));
 		dp = TAILQ_FIRST(&sc->sc_usedlist);
 		TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);
 		sc->sc_nused--;
 		LIST_REMOVE(dp, d_next);
 		g_cache_free(sc, dp);
 	}
 }
 
 static void
 g_cache_deliver(struct g_cache_softc *sc, struct bio *bp,
     struct g_cache_desc *dp, int error)
 {
 	off_t off1, off, len;
 
 	mtx_assert(&sc->sc_mtx, MA_OWNED);
 	KASSERT(OFF2BNO(bp->bio_offset, sc) <= dp->d_bno, ("wrong entry"));
 	KASSERT(OFF2BNO(bp->bio_offset + bp->bio_length - 1, sc) >=
 	    dp->d_bno, ("wrong entry"));
 
 	off1 = BNO2OFF(dp->d_bno, sc);
 	off = MAX(bp->bio_offset, off1);
 	len = MIN(bp->bio_offset + bp->bio_length, off1 + sc->sc_bsize) - off;
 
 	if (bp->bio_error == 0)
 		bp->bio_error = error;
 	if (bp->bio_error == 0) {
 		bcopy(dp->d_data + (off - off1),
 		    bp->bio_data + (off - bp->bio_offset), len);
 	}
 	bp->bio_completed += len;
 	KASSERT(bp->bio_completed <= bp->bio_length, ("extra data"));
 	if (bp->bio_completed == bp->bio_length) {
 		if (bp->bio_error != 0)
 			bp->bio_completed = 0;
 		g_io_deliver(bp, bp->bio_error);
 	}
 
 	if (dp->d_flags & D_FLAG_USED) {
 		TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);
 		TAILQ_INSERT_TAIL(&sc->sc_usedlist, dp, d_used);
 	} else if (OFF2BNO(off + len, sc) > dp->d_bno) {
 		TAILQ_INSERT_TAIL(&sc->sc_usedlist, dp, d_used);
 		sc->sc_nused++;
 		dp->d_flags |= D_FLAG_USED;
 	}
 	dp->d_atime = time_uptime;
 }
 
 static void
 g_cache_done(struct bio *bp)
 {
 	struct g_cache_softc *sc;
 	struct g_cache_desc *dp;
 	struct bio *bp2, *tmpbp;
 
 	sc = bp->bio_from->geom->softc;
 	KASSERT(G_CACHE_DESC1(bp) == sc, ("corrupt bio_caller in g_cache_done()"));
 	dp = G_CACHE_DESC2(bp);
 	mtx_lock(&sc->sc_mtx);
 	bp2 = dp->d_biolist;
 	while (bp2 != NULL) {
 		KASSERT(G_CACHE_NEXT_BIO1(bp2) == sc, ("corrupt bio_driver in g_cache_done()"));
 		tmpbp = G_CACHE_NEXT_BIO2(bp2);
 		g_cache_deliver(sc, bp2, dp, bp->bio_error);
 		bp2 = tmpbp;
 	}
 	dp->d_biolist = NULL;
 	if (dp->d_flags & D_FLAG_INVALID) {
 		sc->sc_invalid--;
 		g_cache_free(sc, dp);
 	} else if (bp->bio_error) {
 		LIST_REMOVE(dp, d_next);
 		if (dp->d_flags & D_FLAG_USED) {
 			TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);
 			sc->sc_nused--;
 		}
 		g_cache_free(sc, dp);
 	}
 	mtx_unlock(&sc->sc_mtx);
 	g_destroy_bio(bp);
 }
 
 static struct g_cache_desc *
 g_cache_lookup(struct g_cache_softc *sc, off_t bno)
 {
 	struct g_cache_desc *dp;
 
 	mtx_assert(&sc->sc_mtx, MA_OWNED);
 
 	LIST_FOREACH(dp, &sc->sc_desclist[G_CACHE_BUCKET(bno)], d_next)
 		if (dp->d_bno == bno)
 			return (dp);
 	return (NULL);
 }
 
 static int
 g_cache_read(struct g_cache_softc *sc, struct bio *bp)
 {
 	struct bio *cbp;
 	struct g_cache_desc *dp;
 
 	mtx_lock(&sc->sc_mtx);
 	dp = g_cache_lookup(sc,
 	    OFF2BNO(bp->bio_offset + bp->bio_completed, sc));
 	if (dp != NULL) {
 		/* Add to waiters list or deliver. */
 		sc->sc_cachehits++;
 		if (dp->d_biolist != NULL) {
 			G_CACHE_NEXT_BIO1(bp) = sc;
 			G_CACHE_NEXT_BIO2(bp) = dp->d_biolist;
 			dp->d_biolist = bp;
 		} else
 			g_cache_deliver(sc, bp, dp, 0);
 		mtx_unlock(&sc->sc_mtx);
 		return (0);
 	}
 
 	/* Cache miss.  Allocate entry and schedule bio.  */
 	sc->sc_cachemisses++;
 	dp = g_cache_alloc(sc);
 	if (dp == NULL) {
 		mtx_unlock(&sc->sc_mtx);
 		return (ENOMEM);
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_cache_free(sc, dp);
 		mtx_unlock(&sc->sc_mtx);
 		return (ENOMEM);
 	}
 
 	dp->d_bno = OFF2BNO(bp->bio_offset + bp->bio_completed, sc);
 	G_CACHE_NEXT_BIO1(bp) = sc;
 	G_CACHE_NEXT_BIO2(bp) = NULL;
 	dp->d_biolist = bp;
 	LIST_INSERT_HEAD(&sc->sc_desclist[G_CACHE_BUCKET(dp->d_bno)],
 	    dp, d_next);
 	mtx_unlock(&sc->sc_mtx);
 
 	G_CACHE_DESC1(cbp) = sc;
 	G_CACHE_DESC2(cbp) = dp;
 	cbp->bio_done = g_cache_done;
 	cbp->bio_offset = BNO2OFF(dp->d_bno, sc);
 	cbp->bio_data = dp->d_data;
 	cbp->bio_length = sc->sc_bsize;
 	g_io_request(cbp, LIST_FIRST(&bp->bio_to->geom->consumer));
 	return (0);
 }
 
 static void
 g_cache_invalidate(struct g_cache_softc *sc, struct bio *bp)
 {
 	struct g_cache_desc *dp;
 	off_t bno, lim;
 
 	mtx_lock(&sc->sc_mtx);
 	bno = OFF2BNO(bp->bio_offset, sc);
 	lim = OFF2BNO(bp->bio_offset + bp->bio_length - 1, sc);
 	do {
 		if ((dp = g_cache_lookup(sc, bno)) != NULL) {
 			LIST_REMOVE(dp, d_next);
 			if (dp->d_flags & D_FLAG_USED) {
 				TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);
 				sc->sc_nused--;
 			}
 			if (dp->d_biolist == NULL)
 				g_cache_free(sc, dp);
 			else {
 				dp->d_flags = D_FLAG_INVALID;
 				sc->sc_invalid++;
 			}
 		}
 		bno++;
 	} while (bno <= lim);
 	mtx_unlock(&sc->sc_mtx);
 }
 
 static void
 g_cache_start(struct bio *bp)
 {
 	struct g_cache_softc *sc;
 	struct g_geom *gp;
 	struct g_cache_desc *dp;
 	struct bio *cbp;
 
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 	G_CACHE_LOGREQ(bp, "Request received.");
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		sc->sc_reads++;
 		sc->sc_readbytes += bp->bio_length;
 		if (!g_cache_enable)
 			break;
 		if (bp->bio_offset + bp->bio_length > sc->sc_tail)
 			break;
 		if (OFF2BNO(bp->bio_offset, sc) ==
 		    OFF2BNO(bp->bio_offset + bp->bio_length - 1, sc)) {
 			sc->sc_cachereads++;
 			sc->sc_cachereadbytes += bp->bio_length;
 			if (g_cache_read(sc, bp) == 0)
 				return;
 			sc->sc_cachereads--;
 			sc->sc_cachereadbytes -= bp->bio_length;
 			break;
 		} else if (OFF2BNO(bp->bio_offset, sc) + 1 ==
 		    OFF2BNO(bp->bio_offset + bp->bio_length - 1, sc)) {
 			mtx_lock(&sc->sc_mtx);
 			dp = g_cache_lookup(sc, OFF2BNO(bp->bio_offset, sc));
 			if (dp == NULL || dp->d_biolist != NULL) {
 				mtx_unlock(&sc->sc_mtx);
 				break;
 			}
 			sc->sc_cachereads++;
 			sc->sc_cachereadbytes += bp->bio_length;
 			g_cache_deliver(sc, bp, dp, 0);
 			mtx_unlock(&sc->sc_mtx);
 			if (g_cache_read(sc, bp) == 0)
 				return;
 			sc->sc_cachereads--;
 			sc->sc_cachereadbytes -= bp->bio_length;
 			break;
 		}
 		break;
 	case BIO_WRITE:
 		sc->sc_writes++;
 		sc->sc_wrotebytes += bp->bio_length;
 		g_cache_invalidate(sc, bp);
 		break;
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	cbp->bio_done = g_std_done;
 	G_CACHE_LOGREQ(cbp, "Sending request.");
 	g_io_request(cbp, LIST_FIRST(&gp->consumer));
 }
 
 static void
 g_cache_go(void *arg)
 {
 	struct g_cache_softc *sc = arg;
 	struct g_cache_desc *dp;
 	int i;
 
 	mtx_assert(&sc->sc_mtx, MA_OWNED);
 
 	/* Forcibly mark idle ready entries as used. */
 	for (i = 0; i < G_CACHE_BUCKETS; i++) {
 		LIST_FOREACH(dp, &sc->sc_desclist[i], d_next) {
 			if (dp->d_flags & D_FLAG_USED ||
 			    dp->d_biolist != NULL ||
 			    time_uptime - dp->d_atime < g_cache_idletime)
 				continue;
 			TAILQ_INSERT_TAIL(&sc->sc_usedlist, dp, d_used);
 			sc->sc_nused++;
 			dp->d_flags |= D_FLAG_USED;
 		}
 	}
 
 	/* Keep the number of used entries low. */
 	if (sc->sc_nused > g_cache_used_hi * sc->sc_maxent / 100)
 		g_cache_free_used(sc);
 
 	callout_reset(&sc->sc_callout, g_cache_timeout * hz, g_cache_go, sc);
 }
 
 static int
 g_cache_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	int error;
 
 	gp = pp->geom;
 	cp = LIST_FIRST(&gp->consumer);
 	error = g_access(cp, dr, dw, de);
 
 	return (error);
 }
 
 static void
 g_cache_orphan(struct g_consumer *cp)
 {
 
 	g_topology_assert();
 	g_cache_destroy(cp->geom->softc, 1);
 }
 
 static struct g_cache_softc *
 g_cache_find_device(struct g_class *mp, const char *name)
 {
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		if (strcmp(gp->name, name) == 0)
 			return (gp->softc);
 	}
 	return (NULL);
 }
 
 static struct g_geom *
 g_cache_create(struct g_class *mp, struct g_provider *pp,
     const struct g_cache_metadata *md, u_int type)
 {
 	struct g_cache_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *newpp;
 	struct g_consumer *cp;
 	u_int bshift;
 	int i;
 
 	g_topology_assert();
 
 	gp = NULL;
 	newpp = NULL;
 	cp = NULL;
 
 	G_CACHE_DEBUG(1, "Creating device %s.", md->md_name);
 
 	/* Cache size is minimum 100. */
 	if (md->md_size < 100) {
 		G_CACHE_DEBUG(0, "Invalid size for device %s.", md->md_name);
 		return (NULL);
 	}
 
 	/* Block size restrictions. */
 	bshift = ffs(md->md_bsize) - 1;
 	if (md->md_bsize == 0 || md->md_bsize > MAXPHYS ||
 	    md->md_bsize != 1 << bshift ||
 	    (md->md_bsize % pp->sectorsize) != 0) {
 		G_CACHE_DEBUG(0, "Invalid blocksize for provider %s.", pp->name);
 		return (NULL);
 	}
 
 	/* Check for duplicate unit. */
 	if (g_cache_find_device(mp, (const char *)&md->md_name) != NULL) {
 		G_CACHE_DEBUG(0, "Provider %s already exists.", md->md_name);
 		return (NULL);
 	}
 
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = g_malloc(sizeof(*sc), M_WAITOK | M_ZERO);
 	sc->sc_type = type;
 	sc->sc_bshift = bshift;
 	sc->sc_bsize = 1 << bshift;
 	sc->sc_zone = uma_zcreate("gcache", sc->sc_bsize, NULL, NULL, NULL, NULL,
 	    UMA_ALIGN_PTR, 0);
 	mtx_init(&sc->sc_mtx, "GEOM CACHE mutex", NULL, MTX_DEF);
 	for (i = 0; i < G_CACHE_BUCKETS; i++)
 		LIST_INIT(&sc->sc_desclist[i]);
 	TAILQ_INIT(&sc->sc_usedlist);
 	sc->sc_maxent = md->md_size;
 	callout_init_mtx(&sc->sc_callout, &sc->sc_mtx, 0);
 	gp->softc = sc;
 	sc->sc_geom = gp;
 	gp->start = g_cache_start;
 	gp->orphan = g_cache_orphan;
 	gp->access = g_cache_access;
 	gp->dumpconf = g_cache_dumpconf;
 
 	newpp = g_new_providerf(gp, "cache/%s", gp->name);
 	newpp->sectorsize = pp->sectorsize;
 	newpp->mediasize = pp->mediasize;
 	if (type == G_CACHE_TYPE_AUTOMATIC)
 		newpp->mediasize -= pp->sectorsize;
 	sc->sc_tail = BNO2OFF(OFF2BNO(newpp->mediasize, sc), sc);
 
 	cp = g_new_consumer(gp);
 	if (g_attach(cp, pp) != 0) {
 		G_CACHE_DEBUG(0, "Cannot attach to provider %s.", pp->name);
 		g_destroy_consumer(cp);
 		g_destroy_provider(newpp);
 		mtx_destroy(&sc->sc_mtx);
 		g_free(sc);
 		g_destroy_geom(gp);
 		return (NULL);
 	}
 
 	g_error_provider(newpp, 0);
 	G_CACHE_DEBUG(0, "Device %s created.", gp->name);
 	callout_reset(&sc->sc_callout, g_cache_timeout * hz, g_cache_go, sc);
 	return (gp);
 }
 
 static int
 g_cache_destroy(struct g_cache_softc *sc, boolean_t force)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct g_cache_desc *dp, *dp2;
 	int i;
 
 	g_topology_assert();
 	if (sc == NULL)
 		return (ENXIO);
 	gp = sc->sc_geom;
 	pp = LIST_FIRST(&gp->provider);
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_CACHE_DEBUG(0, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_CACHE_DEBUG(1, "Device %s is still open (r%dw%de%d).",
 			    pp->name, pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	} else {
 		G_CACHE_DEBUG(0, "Device %s removed.", gp->name);
 	}
 	callout_drain(&sc->sc_callout);
 	mtx_lock(&sc->sc_mtx);
 	for (i = 0; i < G_CACHE_BUCKETS; i++) {
 		dp = LIST_FIRST(&sc->sc_desclist[i]);
 		while (dp != NULL) {
 			dp2 = LIST_NEXT(dp, d_next);
 			g_cache_free(sc, dp);
 			dp = dp2;
 		}
 	}
 	mtx_unlock(&sc->sc_mtx);
 	mtx_destroy(&sc->sc_mtx);
 	uma_zdestroy(sc->sc_zone);
 	g_free(sc);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 
 	return (0);
 }
 
 static int
 g_cache_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 
 	return (g_cache_destroy(gp->softc, 0));
 }
 
 static int
 g_cache_read_metadata(struct g_consumer *cp, struct g_cache_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL)
 		return (error);
 
 	/* Decode metadata. */
 	cache_metadata_decode(buf, md);
 	g_free(buf);
 
 	return (0);
 }
 
 static int
 g_cache_write_metadata(struct g_consumer *cp, struct g_cache_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 0, 1, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	buf = malloc((size_t)pp->sectorsize, M_GCACHE, M_WAITOK | M_ZERO);
 	cache_metadata_encode(md, buf);
 	g_topology_unlock();
 	error = g_write_data(cp, pp->mediasize - pp->sectorsize, buf, pp->sectorsize);
 	g_topology_lock();
 	g_access(cp, 0, -1, 0);
 	free(buf, M_GCACHE);
 
 	return (error);
 }
 
 static struct g_geom *
 g_cache_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_cache_metadata md;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	G_CACHE_DEBUG(3, "Tasting %s.", pp->name);
 
 	gp = g_new_geomf(mp, "cache:taste");
 	gp->start = g_cache_start;
 	gp->orphan = g_cache_orphan;
 	gp->access = g_cache_access;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_cache_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 
 	if (strcmp(md.md_magic, G_CACHE_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version > G_CACHE_VERSION) {
 		printf("geom_cache.ko module is too old to handle %s.\n",
 		    pp->name);
 		return (NULL);
 	}
 	if (md.md_provsize != pp->mediasize)
 		return (NULL);
 
 	gp = g_cache_create(mp, pp, &md, G_CACHE_TYPE_AUTOMATIC);
 	if (gp == NULL) {
 		G_CACHE_DEBUG(0, "Can't create %s.", md.md_name);
 		return (NULL);
 	}
 	return (gp);
 }
 
 static void
 g_cache_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_cache_metadata md;
 	struct g_provider *pp;
 	struct g_geom *gp;
 	intmax_t *bsize, *size;
 	const char *name;
 	int *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs != 2) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 
 	strlcpy(md.md_magic, G_CACHE_MAGIC, sizeof(md.md_magic));
 	md.md_version = G_CACHE_VERSION;
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg0' argument");
 		return;
 	}
 	strlcpy(md.md_name, name, sizeof(md.md_name));
 
 	size = gctl_get_paraml(req, "size", sizeof(*size));
 	if (size == NULL) {
 		gctl_error(req, "No '%s' argument", "size");
 		return;
 	}
 	if ((u_int)*size < 100) {
 		gctl_error(req, "Invalid '%s' argument", "size");
 		return;
 	}
 	md.md_size = (u_int)*size;
 
 	bsize = gctl_get_paraml(req, "blocksize", sizeof(*bsize));
 	if (bsize == NULL) {
 		gctl_error(req, "No '%s' argument", "blocksize");
 		return;
 	}
 	if (*bsize < 0) {
 		gctl_error(req, "Invalid '%s' argument", "blocksize");
 		return;
 	}
 	md.md_bsize = (u_int)*bsize;
 
 	/* This field is not important here. */
 	md.md_provsize = 0;
 
 	name = gctl_get_asciiparam(req, "arg1");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg1' argument");
 		return;
 	}
 	if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 		name += strlen("/dev/");
 	pp = g_provider_by_name(name);
 	if (pp == NULL) {
 		G_CACHE_DEBUG(1, "Provider %s is invalid.", name);
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 	gp = g_cache_create(mp, pp, &md, G_CACHE_TYPE_MANUAL);
 	if (gp == NULL) {
 		gctl_error(req, "Can't create %s.", md.md_name);
 		return;
 	}
 }
 
 static void
 g_cache_ctl_configure(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_cache_metadata md;
 	struct g_cache_softc *sc;
 	struct g_consumer *cp;
 	intmax_t *bsize, *size;
 	const char *name;
 	int error, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs != 1) {
 		gctl_error(req, "Missing device.");
 		return;
 	}
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg0' argument");
 		return;
 	}
 	sc = g_cache_find_device(mp, name);
 	if (sc == NULL) {
 		G_CACHE_DEBUG(1, "Device %s is invalid.", name);
 		gctl_error(req, "Device %s is invalid.", name);
 		return;
 	}
 
 	size = gctl_get_paraml(req, "size", sizeof(*size));
 	if (size == NULL) {
 		gctl_error(req, "No '%s' argument", "size");
 		return;
 	}
 	if ((u_int)*size != 0 && (u_int)*size < 100) {
 		gctl_error(req, "Invalid '%s' argument", "size");
 		return;
 	}
 	if ((u_int)*size != 0)
 		sc->sc_maxent = (u_int)*size;
 
 	bsize = gctl_get_paraml(req, "blocksize", sizeof(*bsize));
 	if (bsize == NULL) {
 		gctl_error(req, "No '%s' argument", "blocksize");
 		return;
 	}
 	if (*bsize < 0) {
 		gctl_error(req, "Invalid '%s' argument", "blocksize");
 		return;
 	}
 
 	if (sc->sc_type != G_CACHE_TYPE_AUTOMATIC)
 		return;
 
 	strlcpy(md.md_name, name, sizeof(md.md_name));
 	strlcpy(md.md_magic, G_CACHE_MAGIC, sizeof(md.md_magic));
 	md.md_version = G_CACHE_VERSION;
 	if ((u_int)*size != 0)
 		md.md_size = (u_int)*size;
 	else
 		md.md_size = sc->sc_maxent;
 	if ((u_int)*bsize != 0)
 		md.md_bsize = (u_int)*bsize;
 	else
 		md.md_bsize = sc->sc_bsize;
 	cp = LIST_FIRST(&sc->sc_geom->consumer);
 	md.md_provsize = cp->provider->mediasize;
 	error = g_cache_write_metadata(cp, &md);
 	if (error == 0)
 		G_CACHE_DEBUG(2, "Metadata on %s updated.", cp->provider->name);
 	else
 		G_CACHE_DEBUG(0, "Cannot update metadata on %s (error=%d).",
 		    cp->provider->name, error);
 }
 
 static void
 g_cache_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	int *nargs, *force, error, i;
 	struct g_cache_softc *sc;
 	const char *name;
 	char param[16];
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No 'force' argument");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		sc = g_cache_find_device(mp, name);
 		if (sc == NULL) {
 			G_CACHE_DEBUG(1, "Device %s is invalid.", name);
 			gctl_error(req, "Device %s is invalid.", name);
 			return;
 		}
 		error = g_cache_destroy(sc, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy device %s (error=%d).",
 			    sc->sc_name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_cache_ctl_reset(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_cache_softc *sc;
 	const char *name;
 	char param[16];
 	int i, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		sc = g_cache_find_device(mp, name);
 		if (sc == NULL) {
 			G_CACHE_DEBUG(1, "Device %s is invalid.", name);
 			gctl_error(req, "Device %s is invalid.", name);
 			return;
 		}
 		sc->sc_reads = 0;
 		sc->sc_readbytes = 0;
 		sc->sc_cachereads = 0;
 		sc->sc_cachereadbytes = 0;
 		sc->sc_cachehits = 0;
 		sc->sc_cachemisses = 0;
 		sc->sc_cachefull = 0;
 		sc->sc_writes = 0;
 		sc->sc_wrotebytes = 0;
 	}
 }
 
 static void
 g_cache_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_CACHE_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "create") == 0) {
 		g_cache_ctl_create(req, mp);
 		return;
 	} else if (strcmp(verb, "configure") == 0) {
 		g_cache_ctl_configure(req, mp);
 		return;
 	} else if (strcmp(verb, "destroy") == 0 ||
 	    strcmp(verb, "stop") == 0) {
 		g_cache_ctl_destroy(req, mp);
 		return;
 	} else if (strcmp(verb, "reset") == 0) {
 		g_cache_ctl_reset(req, mp);
 		return;
 	}
 
 	gctl_error(req, "Unknown verb.");
 }
 
 static void
 g_cache_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_cache_softc *sc;
 
 	if (pp != NULL || cp != NULL)
 		return;
 	sc = gp->softc;
 	sbuf_printf(sb, "%s<Size>%u</Size>\n", indent, sc->sc_maxent);
 	sbuf_printf(sb, "%s<BlockSize>%u</BlockSize>\n", indent, sc->sc_bsize);
 	sbuf_printf(sb, "%s<TailOffset>%ju</TailOffset>\n", indent,
 	    (uintmax_t)sc->sc_tail);
 	sbuf_printf(sb, "%s<Entries>%u</Entries>\n", indent, sc->sc_nent);
 	sbuf_printf(sb, "%s<UsedEntries>%u</UsedEntries>\n", indent,
 	    sc->sc_nused);
 	sbuf_printf(sb, "%s<InvalidEntries>%u</InvalidEntries>\n", indent,
 	    sc->sc_invalid);
 	sbuf_printf(sb, "%s<Reads>%ju</Reads>\n", indent, sc->sc_reads);
 	sbuf_printf(sb, "%s<ReadBytes>%ju</ReadBytes>\n", indent,
 	    sc->sc_readbytes);
 	sbuf_printf(sb, "%s<CacheReads>%ju</CacheReads>\n", indent,
 	    sc->sc_cachereads);
 	sbuf_printf(sb, "%s<CacheReadBytes>%ju</CacheReadBytes>\n", indent,
 	    sc->sc_cachereadbytes);
 	sbuf_printf(sb, "%s<CacheHits>%ju</CacheHits>\n", indent,
 	    sc->sc_cachehits);
 	sbuf_printf(sb, "%s<CacheMisses>%ju</CacheMisses>\n", indent,
 	    sc->sc_cachemisses);
 	sbuf_printf(sb, "%s<CacheFull>%ju</CacheFull>\n", indent,
 	    sc->sc_cachefull);
 	sbuf_printf(sb, "%s<Writes>%ju</Writes>\n", indent, sc->sc_writes);
 	sbuf_printf(sb, "%s<WroteBytes>%ju</WroteBytes>\n", indent,
 	    sc->sc_wrotebytes);
 }
 
 DECLARE_GEOM_CLASS(g_cache_class, g_cache);
+MODULE_VERSION(geom_cache, 0);
Index: user/markj/netdump/sys/geom/concat/g_concat.c
===================================================================
--- user/markj/netdump/sys/geom/concat/g_concat.c	(revision 332407)
+++ user/markj/netdump/sys/geom/concat/g_concat.c	(revision 332408)
@@ -1,995 +1,996 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <geom/geom.h>
 #include <geom/concat/g_concat.h>
 
 FEATURE(geom_concat, "GEOM concatenation support");
 
 static MALLOC_DEFINE(M_CONCAT, "concat_data", "GEOM_CONCAT Data");
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, concat, CTLFLAG_RW, 0,
     "GEOM_CONCAT stuff");
 static u_int g_concat_debug = 0;
 SYSCTL_UINT(_kern_geom_concat, OID_AUTO, debug, CTLFLAG_RWTUN, &g_concat_debug, 0,
     "Debug level");
 
 static int g_concat_destroy(struct g_concat_softc *sc, boolean_t force);
 static int g_concat_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 
 static g_taste_t g_concat_taste;
 static g_ctl_req_t g_concat_config;
 static g_dumpconf_t g_concat_dumpconf;
 
 struct g_class g_concat_class = {
 	.name = G_CONCAT_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_concat_config,
 	.taste = g_concat_taste,
 	.destroy_geom = g_concat_destroy_geom
 };
 
 
 /*
  * Greatest Common Divisor.
  */
 static u_int
 gcd(u_int a, u_int b)
 {
 	u_int c;
 
 	while (b != 0) {
 		c = a;
 		a = b;
 		b = (c % b);
 	}
 	return (a);
 }
 
 /*
  * Least Common Multiple.
  */
 static u_int
 lcm(u_int a, u_int b)
 {
 
 	return ((a * b) / gcd(a, b));
 }
 
 /*
  * Return the number of valid disks.
  */
 static u_int
 g_concat_nvalid(struct g_concat_softc *sc)
 {
 	u_int i, no;
 
 	no = 0;
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		if (sc->sc_disks[i].d_consumer != NULL)
 			no++;
 	}
 
 	return (no);
 }
 
 static void
 g_concat_remove_disk(struct g_concat_disk *disk)
 {
 	struct g_consumer *cp;
 	struct g_concat_softc *sc;
 
 	g_topology_assert();
 	KASSERT(disk->d_consumer != NULL, ("Non-valid disk in %s.", __func__));
 	sc = disk->d_softc;
 	cp = disk->d_consumer;
 
 	if (!disk->d_removed) {
 		G_CONCAT_DEBUG(0, "Disk %s removed from %s.",
 		    cp->provider->name, sc->sc_name);
 		disk->d_removed = 1;
 	}
 
 	if (sc->sc_provider != NULL) {
 		G_CONCAT_DEBUG(0, "Device %s deactivated.",
 		    sc->sc_provider->name);
 		g_wither_provider(sc->sc_provider, ENXIO);
 		sc->sc_provider = NULL;
 	}
 
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		return;
 	disk->d_consumer = NULL;
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	/* If there are no valid disks anymore, remove device. */
 	if (LIST_EMPTY(&sc->sc_geom->consumer))
 		g_concat_destroy(sc, 1);
 }
 
 static void
 g_concat_orphan(struct g_consumer *cp)
 {
 	struct g_concat_softc *sc;
 	struct g_concat_disk *disk;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = cp->geom;
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 
 	disk = cp->private;
 	if (disk == NULL)	/* Possible? */
 		return;
 	g_concat_remove_disk(disk);
 }
 
 static int
 g_concat_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_consumer *cp1, *cp2, *tmp;
 	struct g_concat_disk *disk;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	gp = pp->geom;
 
 	/* On first open, grab an extra "exclusive" bit */
 	if (pp->acr == 0 && pp->acw == 0 && pp->ace == 0)
 		de++;
 	/* ... and let go of it on last close */
 	if ((pp->acr + dr) == 0 && (pp->acw + dw) == 0 && (pp->ace + de) == 0)
 		de--;
 
 	LIST_FOREACH_SAFE(cp1, &gp->consumer, consumer, tmp) {
 		error = g_access(cp1, dr, dw, de);
 		if (error != 0)
 			goto fail;
 		disk = cp1->private;
 		if (cp1->acr == 0 && cp1->acw == 0 && cp1->ace == 0 &&
 		    disk->d_removed) {
 			g_concat_remove_disk(disk); /* May destroy geom. */
 		}
 	}
 	return (0);
 
 fail:
 	LIST_FOREACH(cp2, &gp->consumer, consumer) {
 		if (cp1 == cp2)
 			break;
 		g_access(cp2, -dr, -dw, -de);
 	}
 	return (error);
 }
 
 static void
 g_concat_kernel_dump(struct bio *bp)
 {
 	struct g_concat_softc *sc;
 	struct g_concat_disk *disk;
 	struct bio *cbp;
 	struct g_kerneldump *gkd;
 	u_int i;
 
 	sc = bp->bio_to->geom->softc;
 	gkd = (struct g_kerneldump *)bp->bio_data;
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		if (sc->sc_disks[i].d_start <= gkd->offset &&
 		    sc->sc_disks[i].d_end > gkd->offset)
 			break;
 	}
 	if (i == sc->sc_ndisks)
 		g_io_deliver(bp, EOPNOTSUPP);
 	disk = &sc->sc_disks[i];
 	gkd->offset -= disk->d_start;
 	if (gkd->length > disk->d_end - disk->d_start - gkd->offset)
 		gkd->length = disk->d_end - disk->d_start - gkd->offset;
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	cbp->bio_done = g_std_done;
 	g_io_request(cbp, disk->d_consumer);
 	G_CONCAT_DEBUG(1, "Kernel dump will go to %s.",
 	    disk->d_consumer->provider->name);
 }
 
 static void
 g_concat_done(struct bio *bp)
 {
 	struct g_concat_softc *sc;
 	struct bio *pbp;
 
 	pbp = bp->bio_parent;
 	sc = pbp->bio_to->geom->softc;
 	mtx_lock(&sc->sc_lock);
 	if (pbp->bio_error == 0)
 		pbp->bio_error = bp->bio_error;
 	pbp->bio_completed += bp->bio_completed;
 	pbp->bio_inbed++;
 	if (pbp->bio_children == pbp->bio_inbed) {
 		mtx_unlock(&sc->sc_lock);
 		g_io_deliver(pbp, pbp->bio_error);
 	} else
 		mtx_unlock(&sc->sc_lock);
 	g_destroy_bio(bp);
 }
 
 static void
 g_concat_flush(struct g_concat_softc *sc, struct bio *bp)
 {
 	struct bio_queue_head queue;
 	struct g_consumer *cp;
 	struct bio *cbp;
 	u_int no;
 
 	bioq_init(&queue);
 	for (no = 0; no < sc->sc_ndisks; no++) {
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			while ((cbp = bioq_takefirst(&queue)) != NULL)
 				g_destroy_bio(cbp);
 			if (bp->bio_error == 0)
 				bp->bio_error = ENOMEM;
 			g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 		bioq_insert_tail(&queue, cbp);
 		cbp->bio_done = g_concat_done;
 		cbp->bio_caller1 = sc->sc_disks[no].d_consumer;
 		cbp->bio_to = sc->sc_disks[no].d_consumer->provider;
 	}
 	while ((cbp = bioq_takefirst(&queue)) != NULL) {
 		G_CONCAT_LOGREQ(cbp, "Sending request.");
 		cp = cbp->bio_caller1;
 		cbp->bio_caller1 = NULL;
 		g_io_request(cbp, cp);
 	}
 }
 
 static void
 g_concat_start(struct bio *bp)
 {
 	struct bio_queue_head queue;
 	struct g_concat_softc *sc;
 	struct g_concat_disk *disk;
 	struct g_provider *pp;
 	off_t offset, end, length, off, len;
 	struct bio *cbp;
 	char *addr;
 	u_int no;
 
 	pp = bp->bio_to;
 	sc = pp->geom->softc;
 	/*
 	 * If sc == NULL, provider's error should be set and g_concat_start()
 	 * should not be called at all.
 	 */
 	KASSERT(sc != NULL,
 	    ("Provider's error should be set (error=%d)(device=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 
 	G_CONCAT_LOGREQ(bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
 	case BIO_FLUSH:
 		g_concat_flush(sc, bp);
 		return;
 	case BIO_GETATTR:
 		if (strcmp("GEOM::kerneldump", bp->bio_attribute) == 0) {
 			g_concat_kernel_dump(bp);
 			return;
 		}
 		/* To which provider it should be delivered? */
 		/* FALLTHROUGH */
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 
 	offset = bp->bio_offset;
 	length = bp->bio_length;
 	if ((bp->bio_flags & BIO_UNMAPPED) != 0)
 		addr = NULL;
 	else
 		addr = bp->bio_data;
 	end = offset + length;
 
 	bioq_init(&queue);
 	for (no = 0; no < sc->sc_ndisks; no++) {
 		disk = &sc->sc_disks[no];
 		if (disk->d_end <= offset)
 			continue;
 		if (disk->d_start >= end)
 			break;
 
 		off = offset - disk->d_start;
 		len = MIN(length, disk->d_end - offset);
 		length -= len;
 		offset += len;
 
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			while ((cbp = bioq_takefirst(&queue)) != NULL)
 				g_destroy_bio(cbp);
 			if (bp->bio_error == 0)
 				bp->bio_error = ENOMEM;
 			g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 		bioq_insert_tail(&queue, cbp);
 		/*
 		 * Fill in the component buf structure.
 		 */
 		if (len == bp->bio_length)
 			cbp->bio_done = g_std_done;
 		else
 			cbp->bio_done = g_concat_done;
 		cbp->bio_offset = off;
 		cbp->bio_length = len;
 		if ((bp->bio_flags & BIO_UNMAPPED) != 0) {
 			cbp->bio_ma_offset += (uintptr_t)addr;
 			cbp->bio_ma += cbp->bio_ma_offset / PAGE_SIZE;
 			cbp->bio_ma_offset %= PAGE_SIZE;
 			cbp->bio_ma_n = round_page(cbp->bio_ma_offset +
 			    cbp->bio_length) / PAGE_SIZE;
 		} else
 			cbp->bio_data = addr;
 		addr += len;
 		cbp->bio_to = disk->d_consumer->provider;
 		cbp->bio_caller1 = disk;
 
 		if (length == 0)
 			break;
 	}
 	KASSERT(length == 0,
 	    ("Length is still greater than 0 (class=%s, name=%s).",
 	    bp->bio_to->geom->class->name, bp->bio_to->geom->name));
 	while ((cbp = bioq_takefirst(&queue)) != NULL) {
 		G_CONCAT_LOGREQ(cbp, "Sending request.");
 		disk = cbp->bio_caller1;
 		cbp->bio_caller1 = NULL;
 		g_io_request(cbp, disk->d_consumer);
 	}
 }
 
 static void
 g_concat_check_and_run(struct g_concat_softc *sc)
 {
 	struct g_concat_disk *disk;
 	struct g_provider *dp, *pp;
 	u_int no, sectorsize = 0;
 	off_t start;
 
 	g_topology_assert();
 	if (g_concat_nvalid(sc) != sc->sc_ndisks)
 		return;
 
 	pp = g_new_providerf(sc->sc_geom, "concat/%s", sc->sc_name);
 	pp->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE |
 	    G_PF_ACCEPT_UNMAPPED;
 	start = 0;
 	for (no = 0; no < sc->sc_ndisks; no++) {
 		disk = &sc->sc_disks[no];
 		dp = disk->d_consumer->provider;
 		disk->d_start = start;
 		disk->d_end = disk->d_start + dp->mediasize;
 		if (sc->sc_type == G_CONCAT_TYPE_AUTOMATIC)
 			disk->d_end -= dp->sectorsize;
 		start = disk->d_end;
 		if (no == 0)
 			sectorsize = dp->sectorsize;
 		else
 			sectorsize = lcm(sectorsize, dp->sectorsize);
 
 		/* A provider underneath us doesn't support unmapped */
 		if ((dp->flags & G_PF_ACCEPT_UNMAPPED) == 0) {
 			G_CONCAT_DEBUG(1, "Cancelling unmapped "
 			    "because of %s.", dp->name);
 			pp->flags &= ~G_PF_ACCEPT_UNMAPPED;
 		}
 	}
 	pp->sectorsize = sectorsize;
 	/* We have sc->sc_disks[sc->sc_ndisks - 1].d_end in 'start'. */
 	pp->mediasize = start;
 	pp->stripesize = sc->sc_disks[0].d_consumer->provider->stripesize;
 	pp->stripeoffset = sc->sc_disks[0].d_consumer->provider->stripeoffset;
 	sc->sc_provider = pp;
 	g_error_provider(pp, 0);
 
 	G_CONCAT_DEBUG(0, "Device %s activated.", sc->sc_provider->name);
 }
 
 static int
 g_concat_read_metadata(struct g_consumer *cp, struct g_concat_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL)
 		return (error);
 
 	/* Decode metadata. */
 	concat_metadata_decode(buf, md);
 	g_free(buf);
 
 	return (0);
 }
 
 /*
  * Add disk to given device.
  */
 static int
 g_concat_add_disk(struct g_concat_softc *sc, struct g_provider *pp, u_int no)
 {
 	struct g_concat_disk *disk;
 	struct g_consumer *cp, *fcp;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	/* Metadata corrupted? */
 	if (no >= sc->sc_ndisks)
 		return (EINVAL);
 
 	disk = &sc->sc_disks[no];
 	/* Check if disk is not already attached. */
 	if (disk->d_consumer != NULL)
 		return (EEXIST);
 
 	gp = sc->sc_geom;
 	fcp = LIST_FIRST(&gp->consumer);
 
 	cp = g_new_consumer(gp);
 	cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		g_destroy_consumer(cp);
 		return (error);
 	}
 
 	if (fcp != NULL && (fcp->acr > 0 || fcp->acw > 0 || fcp->ace > 0)) {
 		error = g_access(cp, fcp->acr, fcp->acw, fcp->ace);
 		if (error != 0) {
 			g_detach(cp);
 			g_destroy_consumer(cp);
 			return (error);
 		}
 	}
 	if (sc->sc_type == G_CONCAT_TYPE_AUTOMATIC) {
 		struct g_concat_metadata md;
 
 		/* Re-read metadata. */
 		error = g_concat_read_metadata(cp, &md);
 		if (error != 0)
 			goto fail;
 
 		if (strcmp(md.md_magic, G_CONCAT_MAGIC) != 0 ||
 		    strcmp(md.md_name, sc->sc_name) != 0 ||
 		    md.md_id != sc->sc_id) {
 			G_CONCAT_DEBUG(0, "Metadata on %s changed.", pp->name);
 			goto fail;
 		}
 	}
 
 	cp->private = disk;
 	disk->d_consumer = cp;
 	disk->d_softc = sc;
 	disk->d_start = 0;	/* not yet */
 	disk->d_end = 0;	/* not yet */
 	disk->d_removed = 0;
 
 	G_CONCAT_DEBUG(0, "Disk %s attached to %s.", pp->name, sc->sc_name);
 
 	g_concat_check_and_run(sc);
 
 	return (0);
 fail:
 	if (fcp != NULL && (fcp->acr > 0 || fcp->acw > 0 || fcp->ace > 0))
 		g_access(cp, -fcp->acr, -fcp->acw, -fcp->ace);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	return (error);
 }
 
 static struct g_geom *
 g_concat_create(struct g_class *mp, const struct g_concat_metadata *md,
     u_int type)
 {
 	struct g_concat_softc *sc;
 	struct g_geom *gp;
 	u_int no;
 
 	G_CONCAT_DEBUG(1, "Creating device %s (id=%u).", md->md_name,
 	    md->md_id);
 
 	/* One disks is minimum. */
 	if (md->md_all < 1)
 		return (NULL);
 
 	/* Check for duplicate unit */
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc != NULL && strcmp(sc->sc_name, md->md_name) == 0) {
 			G_CONCAT_DEBUG(0, "Device %s already configured.",
 			    gp->name);
 			return (NULL);
 		}
 	}
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = malloc(sizeof(*sc), M_CONCAT, M_WAITOK | M_ZERO);
 	gp->start = g_concat_start;
 	gp->spoiled = g_concat_orphan;
 	gp->orphan = g_concat_orphan;
 	gp->access = g_concat_access;
 	gp->dumpconf = g_concat_dumpconf;
 
 	sc->sc_id = md->md_id;
 	sc->sc_ndisks = md->md_all;
 	sc->sc_disks = malloc(sizeof(struct g_concat_disk) * sc->sc_ndisks,
 	    M_CONCAT, M_WAITOK | M_ZERO);
 	for (no = 0; no < sc->sc_ndisks; no++)
 		sc->sc_disks[no].d_consumer = NULL;
 	sc->sc_type = type;
 	mtx_init(&sc->sc_lock, "gconcat lock", NULL, MTX_DEF);
 
 	gp->softc = sc;
 	sc->sc_geom = gp;
 	sc->sc_provider = NULL;
 
 	G_CONCAT_DEBUG(0, "Device %s created (id=%u).", sc->sc_name, sc->sc_id);
 
 	return (gp);
 }
 
 static int
 g_concat_destroy(struct g_concat_softc *sc, boolean_t force)
 {
 	struct g_provider *pp;
 	struct g_consumer *cp, *cp1;
 	struct g_geom *gp;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENXIO);
 
 	pp = sc->sc_provider;
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_CONCAT_DEBUG(0, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_CONCAT_DEBUG(1,
 			    "Device %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	}
 
 	gp = sc->sc_geom;
 	LIST_FOREACH_SAFE(cp, &gp->consumer, consumer, cp1) {
 		g_concat_remove_disk(cp->private);
 		if (cp1 == NULL)
 			return (0);	/* Recursion happened. */
 	}
 	if (!LIST_EMPTY(&gp->consumer))
 		return (EINPROGRESS);
 
 	gp->softc = NULL;
 	KASSERT(sc->sc_provider == NULL, ("Provider still exists? (device=%s)",
 	    gp->name));
 	free(sc->sc_disks, M_CONCAT);
 	mtx_destroy(&sc->sc_lock);
 	free(sc, M_CONCAT);
 
 	G_CONCAT_DEBUG(0, "Device %s destroyed.", gp->name);
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static int
 g_concat_destroy_geom(struct gctl_req *req __unused,
     struct g_class *mp __unused, struct g_geom *gp)
 {
 	struct g_concat_softc *sc;
 
 	sc = gp->softc;
 	return (g_concat_destroy(sc, 0));
 }
 
 static struct g_geom *
 g_concat_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_concat_metadata md;
 	struct g_concat_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	/* Skip providers that are already open for writing. */
 	if (pp->acw > 0)
 		return (NULL);
 
 	G_CONCAT_DEBUG(3, "Tasting %s.", pp->name);
 
 	gp = g_new_geomf(mp, "concat:taste");
 	gp->start = g_concat_start;
 	gp->access = g_concat_access;
 	gp->orphan = g_concat_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_concat_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (strcmp(md.md_magic, G_CONCAT_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version > G_CONCAT_VERSION) {
 		printf("geom_concat.ko module is too old to handle %s.\n",
 		    pp->name);
 		return (NULL);
 	}
 	/*
 	 * Backward compatibility:
 	 */
 	/* There was no md_provider field in earlier versions of metadata. */
 	if (md.md_version < 3)
 		bzero(md.md_provider, sizeof(md.md_provider));
 	/* There was no md_provsize field in earlier versions of metadata. */
 	if (md.md_version < 4)
 		md.md_provsize = pp->mediasize;
 
 	if (md.md_provider[0] != '\0' &&
 	    !g_compare_names(md.md_provider, pp->name))
 		return (NULL);
 	if (md.md_provsize != pp->mediasize)
 		return (NULL);
 
 	/*
 	 * Let's check if device already exists.
 	 */
 	sc = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (sc->sc_type != G_CONCAT_TYPE_AUTOMATIC)
 			continue;
 		if (strcmp(md.md_name, sc->sc_name) != 0)
 			continue;
 		if (md.md_id != sc->sc_id)
 			continue;
 		break;
 	}
 	if (gp != NULL) {
 		G_CONCAT_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 		error = g_concat_add_disk(sc, pp, md.md_no);
 		if (error != 0) {
 			G_CONCAT_DEBUG(0,
 			    "Cannot add disk %s to %s (error=%d).", pp->name,
 			    gp->name, error);
 			return (NULL);
 		}
 	} else {
 		gp = g_concat_create(mp, &md, G_CONCAT_TYPE_AUTOMATIC);
 		if (gp == NULL) {
 			G_CONCAT_DEBUG(0, "Cannot create device %s.",
 			    md.md_name);
 			return (NULL);
 		}
 		sc = gp->softc;
 		G_CONCAT_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 		error = g_concat_add_disk(sc, pp, md.md_no);
 		if (error != 0) {
 			G_CONCAT_DEBUG(0,
 			    "Cannot add disk %s to %s (error=%d).", pp->name,
 			    gp->name, error);
 			g_concat_destroy(sc, 1);
 			return (NULL);
 		}
 	}
 
 	return (gp);
 }
 
 static void
 g_concat_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	u_int attached, no;
 	struct g_concat_metadata md;
 	struct g_provider *pp;
 	struct g_concat_softc *sc;
 	struct g_geom *gp;
 	struct sbuf *sb;
 	const char *name;
 	char param[16];
 	int *nargs;
 
 	g_topology_assert();
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs < 2) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	strlcpy(md.md_magic, G_CONCAT_MAGIC, sizeof(md.md_magic));
 	md.md_version = G_CONCAT_VERSION;
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	strlcpy(md.md_name, name, sizeof(md.md_name));
 	md.md_id = arc4random();
 	md.md_no = 0;
 	md.md_all = *nargs - 1;
 	bzero(md.md_provider, sizeof(md.md_provider));
 	/* This field is not important here. */
 	md.md_provsize = 0;
 
 	/* Check all providers are valid */
 	for (no = 1; no < *nargs; no++) {
 		snprintf(param, sizeof(param), "arg%u", no);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%u' argument.", no);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		if (pp == NULL) {
 			G_CONCAT_DEBUG(1, "Disk %s is invalid.", name);
 			gctl_error(req, "Disk %s is invalid.", name);
 			return;
 		}
 	}
 
 	gp = g_concat_create(mp, &md, G_CONCAT_TYPE_MANUAL);
 	if (gp == NULL) {
 		gctl_error(req, "Can't configure %s.", md.md_name);
 		return;
 	}
 
 	sc = gp->softc;
 	sb = sbuf_new_auto();
 	sbuf_printf(sb, "Can't attach disk(s) to %s:", gp->name);
 	for (attached = 0, no = 1; no < *nargs; no++) {
 		snprintf(param, sizeof(param), "arg%u", no);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument.", no);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		KASSERT(pp != NULL, ("Provider %s disappear?!", name));
 		if (g_concat_add_disk(sc, pp, no - 1) != 0) {
 			G_CONCAT_DEBUG(1, "Disk %u (%s) not attached to %s.",
 			    no, pp->name, gp->name);
 			sbuf_printf(sb, " %s", pp->name);
 			continue;
 		}
 		attached++;
 	}
 	sbuf_finish(sb);
 	if (md.md_all != attached) {
 		g_concat_destroy(gp->softc, 1);
 		gctl_error(req, "%s", sbuf_data(sb));
 	}
 	sbuf_delete(sb);
 }
 
 static struct g_concat_softc *
 g_concat_find_device(struct g_class *mp, const char *name)
 {
 	struct g_concat_softc *sc;
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (strcmp(sc->sc_name, name) == 0)
 			return (sc);
 	}
 	return (NULL);
 }
 
 static void
 g_concat_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_concat_softc *sc;
 	int *force, *nargs, error;
 	const char *name;
 	char param[16];
 	u_int i;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No '%s' argument.", "force");
 		return;
 	}
 
 	for (i = 0; i < (u_int)*nargs; i++) {
 		snprintf(param, sizeof(param), "arg%u", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%u' argument.", i);
 			return;
 		}
 		sc = g_concat_find_device(mp, name);
 		if (sc == NULL) {
 			gctl_error(req, "No such device: %s.", name);
 			return;
 		}
 		error = g_concat_destroy(sc, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy device %s (error=%d).",
 			    sc->sc_name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_concat_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_CONCAT_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "create") == 0) {
 		g_concat_ctl_create(req, mp);
 		return;
 	} else if (strcmp(verb, "destroy") == 0 ||
 	    strcmp(verb, "stop") == 0) {
 		g_concat_ctl_destroy(req, mp);
 		return;
 	}
 	gctl_error(req, "Unknown verb.");
 }
 
 static void
 g_concat_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_concat_softc *sc;
 
 	g_topology_assert();
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	if (pp != NULL) {
 		/* Nothing here. */
 	} else if (cp != NULL) {
 		struct g_concat_disk *disk;
 
 		disk = cp->private;
 		if (disk == NULL)
 			return;
 		sbuf_printf(sb, "%s<End>%jd</End>\n", indent,
 		    (intmax_t)disk->d_end);
 		sbuf_printf(sb, "%s<Start>%jd</Start>\n", indent,
 		    (intmax_t)disk->d_start);
 	} else {
 		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)sc->sc_id);
 		sbuf_printf(sb, "%s<Type>", indent);
 		switch (sc->sc_type) {
 		case G_CONCAT_TYPE_AUTOMATIC:
 			sbuf_printf(sb, "AUTOMATIC");
 			break;
 		case G_CONCAT_TYPE_MANUAL:
 			sbuf_printf(sb, "MANUAL");
 			break;
 		default:
 			sbuf_printf(sb, "UNKNOWN");
 			break;
 		}
 		sbuf_printf(sb, "</Type>\n");
 		sbuf_printf(sb, "%s<Status>Total=%u, Online=%u</Status>\n",
 		    indent, sc->sc_ndisks, g_concat_nvalid(sc));
 		sbuf_printf(sb, "%s<State>", indent);
 		if (sc->sc_provider != NULL && sc->sc_provider->error == 0)
 			sbuf_printf(sb, "UP");
 		else
 			sbuf_printf(sb, "DOWN");
 		sbuf_printf(sb, "</State>\n");
 	}
 }
 
 DECLARE_GEOM_CLASS(g_concat_class, g_concat);
+MODULE_VERSION(geom_concat, 0);
Index: user/markj/netdump/sys/geom/eli/g_eli.c
===================================================================
--- user/markj/netdump/sys/geom/eli/g_eli.c	(revision 332407)
+++ user/markj/netdump/sys/geom/eli/g_eli.c	(revision 332408)
@@ -1,1335 +1,1336 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2005-2011 Pawel Jakub Dawidek <pawel@dawidek.net>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/cons.h>
 #include <sys/kernel.h>
 #include <sys/linker.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/eventhandler.h>
 #include <sys/kthread.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/uio.h>
 #include <sys/vnode.h>
 
 #include <vm/uma.h>
 
 #include <geom/geom.h>
 #include <geom/eli/g_eli.h>
 #include <geom/eli/pkcs5v2.h>
 
 #include <crypto/intake.h>
 
 FEATURE(geom_eli, "GEOM crypto module");
 
 MALLOC_DEFINE(M_ELI, "eli data", "GEOM_ELI Data");
 
 SYSCTL_DECL(_kern_geom);
 SYSCTL_NODE(_kern_geom, OID_AUTO, eli, CTLFLAG_RW, 0, "GEOM_ELI stuff");
 static int g_eli_version = G_ELI_VERSION;
 SYSCTL_INT(_kern_geom_eli, OID_AUTO, version, CTLFLAG_RD, &g_eli_version, 0,
     "GELI version");
 int g_eli_debug = 0;
 SYSCTL_INT(_kern_geom_eli, OID_AUTO, debug, CTLFLAG_RWTUN, &g_eli_debug, 0,
     "Debug level");
 static u_int g_eli_tries = 3;
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, tries, CTLFLAG_RWTUN, &g_eli_tries, 0,
     "Number of tries for entering the passphrase");
 static u_int g_eli_visible_passphrase = GETS_NOECHO;
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, visible_passphrase, CTLFLAG_RWTUN,
     &g_eli_visible_passphrase, 0,
     "Visibility of passphrase prompt (0 = invisible, 1 = visible, 2 = asterisk)");
 u_int g_eli_overwrites = G_ELI_OVERWRITES;
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, overwrites, CTLFLAG_RWTUN, &g_eli_overwrites,
     0, "Number of times on-disk keys should be overwritten when destroying them");
 static u_int g_eli_threads = 0;
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, threads, CTLFLAG_RWTUN, &g_eli_threads, 0,
     "Number of threads doing crypto work");
 u_int g_eli_batch = 0;
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, batch, CTLFLAG_RWTUN, &g_eli_batch, 0,
     "Use crypto operations batching");
 
 /*
  * Passphrase cached during boot, in order to be more user-friendly if
  * there are multiple providers using the same passphrase.
  */
 static char cached_passphrase[256];
 static u_int g_eli_boot_passcache = 1;
 TUNABLE_INT("kern.geom.eli.boot_passcache", &g_eli_boot_passcache);
 SYSCTL_UINT(_kern_geom_eli, OID_AUTO, boot_passcache, CTLFLAG_RD,
     &g_eli_boot_passcache, 0,
     "Passphrases are cached during boot process for possible reuse");
 static void
 fetch_loader_passphrase(void * dummy)
 {
 	char * env_passphrase;
 
 	KASSERT(dynamic_kenv, ("need dynamic kenv"));
 
 	if ((env_passphrase = kern_getenv("kern.geom.eli.passphrase")) != NULL) {
 		/* Extract passphrase from the environment. */
 		strlcpy(cached_passphrase, env_passphrase,
 		    sizeof(cached_passphrase));
 		freeenv(env_passphrase);
 
 		/* Wipe the passphrase from the environment. */
 		kern_unsetenv("kern.geom.eli.passphrase");
 	}
 }
 SYSINIT(geli_fetch_loader_passphrase, SI_SUB_KMEM + 1, SI_ORDER_ANY,
     fetch_loader_passphrase, NULL);
 
 static void
 zero_boot_passcache(void)
 {
 
         explicit_bzero(cached_passphrase, sizeof(cached_passphrase));
 }
 
 static void
 zero_geli_intake_keys(void)
 {
         struct keybuf *keybuf;
         int i;
 
         if ((keybuf = get_keybuf()) != NULL) {
                 /* Scan the key buffer, clear all GELI keys. */
                 for (i = 0; i < keybuf->kb_nents; i++) {
                          if (keybuf->kb_ents[i].ke_type == KEYBUF_TYPE_GELI) {
                                  explicit_bzero(keybuf->kb_ents[i].ke_data,
                                      sizeof(keybuf->kb_ents[i].ke_data));
                                  keybuf->kb_ents[i].ke_type = KEYBUF_TYPE_NONE;
                          }
                 }
         }
 }
 
 static void
 zero_intake_passcache(void *dummy)
 {
         zero_boot_passcache();
         zero_geli_intake_keys();
 }
 EVENTHANDLER_DEFINE(mountroot, zero_intake_passcache, NULL, 0);
 
 static eventhandler_tag g_eli_pre_sync = NULL;
 
 static int g_eli_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 static void g_eli_init(struct g_class *mp);
 static void g_eli_fini(struct g_class *mp);
 
 static g_taste_t g_eli_taste;
 static g_dumpconf_t g_eli_dumpconf;
 
 struct g_class g_eli_class = {
 	.name = G_ELI_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_eli_config,
 	.taste = g_eli_taste,
 	.destroy_geom = g_eli_destroy_geom,
 	.init = g_eli_init,
 	.fini = g_eli_fini
 };
 
 
 /*
  * Code paths:
  * BIO_READ:
  *	g_eli_start -> g_eli_crypto_read -> g_io_request -> g_eli_read_done -> g_eli_crypto_run -> g_eli_crypto_read_done -> g_io_deliver
  * BIO_WRITE:
  *	g_eli_start -> g_eli_crypto_run -> g_eli_crypto_write_done -> g_io_request -> g_eli_write_done -> g_io_deliver
  */
 
 
 /*
  * EAGAIN from crypto(9) means, that we were probably balanced to another crypto
  * accelerator or something like this.
  * The function updates the SID and rerun the operation.
  */
 int
 g_eli_crypto_rerun(struct cryptop *crp)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_worker *wr;
 	struct bio *bp;
 	int error;
 
 	bp = (struct bio *)crp->crp_opaque;
 	sc = bp->bio_to->geom->softc;
 	LIST_FOREACH(wr, &sc->sc_workers, w_next) {
 		if (wr->w_number == bp->bio_pflags)
 			break;
 	}
 	KASSERT(wr != NULL, ("Invalid worker (%u).", bp->bio_pflags));
 	G_ELI_DEBUG(1, "Rerunning crypto %s request (sid: %ju -> %ju).",
 	    bp->bio_cmd == BIO_READ ? "READ" : "WRITE", (uintmax_t)wr->w_sid,
 	    (uintmax_t)crp->crp_sid);
 	wr->w_sid = crp->crp_sid;
 	crp->crp_etype = 0;
 	error = crypto_dispatch(crp);
 	if (error == 0)
 		return (0);
 	G_ELI_DEBUG(1, "%s: crypto_dispatch() returned %d.", __func__, error);
 	crp->crp_etype = error;
 	return (error);
 }
 
 static void
 g_eli_getattr_done(struct bio *bp)
 {
 	if (bp->bio_error == 0 && 
 	    !strcmp(bp->bio_attribute, "GEOM::physpath")) {
 		strlcat(bp->bio_data, "/eli", bp->bio_length);
 	}
 	g_std_done(bp);
 }
 
 /*
  * The function is called afer reading encrypted data from the provider.
  *
  * g_eli_start -> g_eli_crypto_read -> g_io_request -> G_ELI_READ_DONE -> g_eli_crypto_run -> g_eli_crypto_read_done -> g_io_deliver
  */
 void
 g_eli_read_done(struct bio *bp)
 {
 	struct g_eli_softc *sc;
 	struct bio *pbp;
 
 	G_ELI_LOGREQ(2, bp, "Request done.");
 	pbp = bp->bio_parent;
 	if (pbp->bio_error == 0 && bp->bio_error != 0)
 		pbp->bio_error = bp->bio_error;
 	g_destroy_bio(bp);
 	/*
 	 * Do we have all sectors already?
 	 */
 	pbp->bio_inbed++;
 	if (pbp->bio_inbed < pbp->bio_children)
 		return;
 	sc = pbp->bio_to->geom->softc;
 	if (pbp->bio_error != 0) {
 		G_ELI_LOGREQ(0, pbp, "%s() failed (error=%d)", __func__,
 		    pbp->bio_error);
 		pbp->bio_completed = 0;
 		if (pbp->bio_driver2 != NULL) {
 			free(pbp->bio_driver2, M_ELI);
 			pbp->bio_driver2 = NULL;
 		}
 		g_io_deliver(pbp, pbp->bio_error);
 		atomic_subtract_int(&sc->sc_inflight, 1);
 		return;
 	}
 	mtx_lock(&sc->sc_queue_mtx);
 	bioq_insert_tail(&sc->sc_queue, pbp);
 	mtx_unlock(&sc->sc_queue_mtx);
 	wakeup(sc);
 }
 
 /*
  * The function is called after we encrypt and write data.
  *
  * g_eli_start -> g_eli_crypto_run -> g_eli_crypto_write_done -> g_io_request -> G_ELI_WRITE_DONE -> g_io_deliver
  */
 void
 g_eli_write_done(struct bio *bp)
 {
 	struct g_eli_softc *sc;
 	struct bio *pbp;
 
 	G_ELI_LOGREQ(2, bp, "Request done.");
 	pbp = bp->bio_parent;
 	if (pbp->bio_error == 0 && bp->bio_error != 0)
 		pbp->bio_error = bp->bio_error;
 	g_destroy_bio(bp);
 	/*
 	 * Do we have all sectors already?
 	 */
 	pbp->bio_inbed++;
 	if (pbp->bio_inbed < pbp->bio_children)
 		return;
 	free(pbp->bio_driver2, M_ELI);
 	pbp->bio_driver2 = NULL;
 	if (pbp->bio_error != 0) {
 		G_ELI_LOGREQ(0, pbp, "%s() failed (error=%d)", __func__,
 		    pbp->bio_error);
 		pbp->bio_completed = 0;
 	} else
 		pbp->bio_completed = pbp->bio_length;
 
 	/*
 	 * Write is finished, send it up.
 	 */
 	sc = pbp->bio_to->geom->softc;
 	g_io_deliver(pbp, pbp->bio_error);
 	atomic_subtract_int(&sc->sc_inflight, 1);
 }
 
 /*
  * This function should never be called, but GEOM made as it set ->orphan()
  * method for every geom.
  */
 static void
 g_eli_orphan_spoil_assert(struct g_consumer *cp)
 {
 
 	panic("Function %s() called for %s.", __func__, cp->geom->name);
 }
 
 static void
 g_eli_orphan(struct g_consumer *cp)
 {
 	struct g_eli_softc *sc;
 
 	g_topology_assert();
 	sc = cp->geom->softc;
 	if (sc == NULL)
 		return;
 	g_eli_destroy(sc, TRUE);
 }
 
 /*
  * BIO_READ:
  *	G_ELI_START -> g_eli_crypto_read -> g_io_request -> g_eli_read_done -> g_eli_crypto_run -> g_eli_crypto_read_done -> g_io_deliver
  * BIO_WRITE:
  *	G_ELI_START -> g_eli_crypto_run -> g_eli_crypto_write_done -> g_io_request -> g_eli_write_done -> g_io_deliver
  */
 static void
 g_eli_start(struct bio *bp)
 {
 	struct g_eli_softc *sc;
 	struct g_consumer *cp;
 	struct bio *cbp;
 
 	sc = bp->bio_to->geom->softc;
 	KASSERT(sc != NULL,
 	    ("Provider's error should be set (error=%d)(device=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 	G_ELI_LOGREQ(2, bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_GETATTR:
 	case BIO_FLUSH:
 	case BIO_ZONE:
 		break;
 	case BIO_DELETE:
 		/*
 		 * If the user hasn't set the NODELETE flag, we just pass
 		 * it down the stack and let the layers beneath us do (or
 		 * not) whatever they do with it.  If they have, we
 		 * reject it.  A possible extension would be an
 		 * additional flag to take it as a hint to shred the data
 		 * with [multiple?] overwrites.
 		 */
 		if (!(sc->sc_flags & G_ELI_FLAG_NODELETE))
 			break;
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	bp->bio_driver1 = cbp;
 	bp->bio_pflags = G_ELI_NEW_BIO;
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		if (!(sc->sc_flags & G_ELI_FLAG_AUTH)) {
 			g_eli_crypto_read(sc, bp, 0);
 			break;
 		}
 		/* FALLTHROUGH */
 	case BIO_WRITE:
 		mtx_lock(&sc->sc_queue_mtx);
 		bioq_insert_tail(&sc->sc_queue, bp);
 		mtx_unlock(&sc->sc_queue_mtx);
 		wakeup(sc);
 		break;
 	case BIO_GETATTR:
 	case BIO_FLUSH:
 	case BIO_DELETE:
 	case BIO_ZONE:
 		if (bp->bio_cmd == BIO_GETATTR)
 			cbp->bio_done = g_eli_getattr_done;
 		else
 			cbp->bio_done = g_std_done;
 		cp = LIST_FIRST(&sc->sc_geom->consumer);
 		cbp->bio_to = cp->provider;
 		G_ELI_LOGREQ(2, cbp, "Sending request.");
 		g_io_request(cbp, cp);
 		break;
 	}
 }
 
 static int
 g_eli_newsession(struct g_eli_worker *wr)
 {
 	struct g_eli_softc *sc;
 	struct cryptoini crie, cria;
 	int error;
 
 	sc = wr->w_softc;
 
 	bzero(&crie, sizeof(crie));
 	crie.cri_alg = sc->sc_ealgo;
 	crie.cri_klen = sc->sc_ekeylen;
 	if (sc->sc_ealgo == CRYPTO_AES_XTS)
 		crie.cri_klen <<= 1;
 	if ((sc->sc_flags & G_ELI_FLAG_FIRST_KEY) != 0) {
 		crie.cri_key = g_eli_key_hold(sc, 0,
 		    LIST_FIRST(&sc->sc_geom->consumer)->provider->sectorsize);
 	} else {
 		crie.cri_key = sc->sc_ekey;
 	}
 	if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 		bzero(&cria, sizeof(cria));
 		cria.cri_alg = sc->sc_aalgo;
 		cria.cri_klen = sc->sc_akeylen;
 		cria.cri_key = sc->sc_akey;
 		crie.cri_next = &cria;
 	}
 
 	switch (sc->sc_crypto) {
 	case G_ELI_CRYPTO_SW:
 		error = crypto_newsession(&wr->w_sid, &crie,
 		    CRYPTOCAP_F_SOFTWARE);
 		break;
 	case G_ELI_CRYPTO_HW:
 		error = crypto_newsession(&wr->w_sid, &crie,
 		    CRYPTOCAP_F_HARDWARE);
 		break;
 	case G_ELI_CRYPTO_UNKNOWN:
 		error = crypto_newsession(&wr->w_sid, &crie,
 		    CRYPTOCAP_F_HARDWARE);
 		if (error == 0) {
 			mtx_lock(&sc->sc_queue_mtx);
 			if (sc->sc_crypto == G_ELI_CRYPTO_UNKNOWN)
 				sc->sc_crypto = G_ELI_CRYPTO_HW;
 			mtx_unlock(&sc->sc_queue_mtx);
 		} else {
 			error = crypto_newsession(&wr->w_sid, &crie,
 			    CRYPTOCAP_F_SOFTWARE);
 			mtx_lock(&sc->sc_queue_mtx);
 			if (sc->sc_crypto == G_ELI_CRYPTO_UNKNOWN)
 				sc->sc_crypto = G_ELI_CRYPTO_SW;
 			mtx_unlock(&sc->sc_queue_mtx);
 		}
 		break;
 	default:
 		panic("%s: invalid condition", __func__);
 	}
 
 	if ((sc->sc_flags & G_ELI_FLAG_FIRST_KEY) != 0)
 		g_eli_key_drop(sc, crie.cri_key);
 
 	return (error);
 }
 
 static void
 g_eli_freesession(struct g_eli_worker *wr)
 {
 
 	crypto_freesession(wr->w_sid);
 }
 
 static void
 g_eli_cancel(struct g_eli_softc *sc)
 {
 	struct bio *bp;
 
 	mtx_assert(&sc->sc_queue_mtx, MA_OWNED);
 
 	while ((bp = bioq_takefirst(&sc->sc_queue)) != NULL) {
 		KASSERT(bp->bio_pflags == G_ELI_NEW_BIO,
 		    ("Not new bio when canceling (bp=%p).", bp));
 		g_io_deliver(bp, ENXIO);
 	}
 }
 
 static struct bio *
 g_eli_takefirst(struct g_eli_softc *sc)
 {
 	struct bio *bp;
 
 	mtx_assert(&sc->sc_queue_mtx, MA_OWNED);
 
 	if (!(sc->sc_flags & G_ELI_FLAG_SUSPEND))
 		return (bioq_takefirst(&sc->sc_queue));
 	/*
 	 * Device suspended, so we skip new I/O requests.
 	 */
 	TAILQ_FOREACH(bp, &sc->sc_queue.queue, bio_queue) {
 		if (bp->bio_pflags != G_ELI_NEW_BIO)
 			break;
 	}
 	if (bp != NULL)
 		bioq_remove(&sc->sc_queue, bp);
 	return (bp);
 }
 
 /*
  * This is the main function for kernel worker thread when we don't have
  * hardware acceleration and we have to do cryptography in software.
  * Dedicated thread is needed, so we don't slow down g_up/g_down GEOM
  * threads with crypto work.
  */
 static void
 g_eli_worker(void *arg)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_worker *wr;
 	struct bio *bp;
 	int error;
 
 	wr = arg;
 	sc = wr->w_softc;
 #ifdef EARLY_AP_STARTUP
 	MPASS(!sc->sc_cpubind || smp_started);
 #elif defined(SMP)
 	/* Before sched_bind() to a CPU, wait for all CPUs to go on-line. */
 	if (sc->sc_cpubind) {
 		while (!smp_started)
 			tsleep(wr, 0, "geli:smp", hz / 4);
 	}
 #endif
 	thread_lock(curthread);
 	sched_prio(curthread, PUSER);
 	if (sc->sc_cpubind)
 		sched_bind(curthread, wr->w_number % mp_ncpus);
 	thread_unlock(curthread);
 
 	G_ELI_DEBUG(1, "Thread %s started.", curthread->td_proc->p_comm);
 
 	for (;;) {
 		mtx_lock(&sc->sc_queue_mtx);
 again:
 		bp = g_eli_takefirst(sc);
 		if (bp == NULL) {
 			if (sc->sc_flags & G_ELI_FLAG_DESTROY) {
 				g_eli_cancel(sc);
 				LIST_REMOVE(wr, w_next);
 				g_eli_freesession(wr);
 				free(wr, M_ELI);
 				G_ELI_DEBUG(1, "Thread %s exiting.",
 				    curthread->td_proc->p_comm);
 				wakeup(&sc->sc_workers);
 				mtx_unlock(&sc->sc_queue_mtx);
 				kproc_exit(0);
 			}
 			while (sc->sc_flags & G_ELI_FLAG_SUSPEND) {
 				if (sc->sc_inflight > 0) {
 					G_ELI_DEBUG(0, "inflight=%d",
 					    sc->sc_inflight);
 					/*
 					 * We still have inflight BIOs, so
 					 * sleep and retry.
 					 */
 					msleep(sc, &sc->sc_queue_mtx, PRIBIO,
 					    "geli:inf", hz / 5);
 					goto again;
 				}
 				/*
 				 * Suspend requested, mark the worker as
 				 * suspended and go to sleep.
 				 */
 				if (wr->w_active) {
 					g_eli_freesession(wr);
 					wr->w_active = FALSE;
 				}
 				wakeup(&sc->sc_workers);
 				msleep(sc, &sc->sc_queue_mtx, PRIBIO,
 				    "geli:suspend", 0);
 				if (!wr->w_active &&
 				    !(sc->sc_flags & G_ELI_FLAG_SUSPEND)) {
 					error = g_eli_newsession(wr);
 					KASSERT(error == 0,
 					    ("g_eli_newsession() failed on resume (error=%d)",
 					    error));
 					wr->w_active = TRUE;
 				}
 				goto again;
 			}
 			msleep(sc, &sc->sc_queue_mtx, PDROP, "geli:w", 0);
 			continue;
 		}
 		if (bp->bio_pflags == G_ELI_NEW_BIO)
 			atomic_add_int(&sc->sc_inflight, 1);
 		mtx_unlock(&sc->sc_queue_mtx);
 		if (bp->bio_pflags == G_ELI_NEW_BIO) {
 			bp->bio_pflags = 0;
 			if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 				if (bp->bio_cmd == BIO_READ)
 					g_eli_auth_read(sc, bp);
 				else
 					g_eli_auth_run(wr, bp);
 			} else {
 				if (bp->bio_cmd == BIO_READ)
 					g_eli_crypto_read(sc, bp, 1);
 				else
 					g_eli_crypto_run(wr, bp);
 			}
 		} else {
 			if (sc->sc_flags & G_ELI_FLAG_AUTH)
 				g_eli_auth_run(wr, bp);
 			else
 				g_eli_crypto_run(wr, bp);
 		}
 	}
 }
 
 int
 g_eli_read_metadata(struct g_class *mp, struct g_provider *pp,
     struct g_eli_metadata *md)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	u_char *buf = NULL;
 	int error;
 
 	g_topology_assert();
 
 	gp = g_new_geomf(mp, "eli:taste");
 	gp->start = g_eli_start;
 	gp->access = g_std_access;
 	/*
 	 * g_eli_read_metadata() is always called from the event thread.
 	 * Our geom is created and destroyed in the same event, so there
 	 * could be no orphan nor spoil event in the meantime.
 	 */
 	gp->orphan = g_eli_orphan_spoil_assert;
 	gp->spoiled = g_eli_orphan_spoil_assert;
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, pp);
 	if (error != 0)
 		goto end;
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		goto end;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	if (buf == NULL)
 		goto end;
 	error = eli_metadata_decode(buf, md);
 	if (error != 0)
 		goto end;
 	/* Metadata was read and decoded successfully. */
 end:
 	if (buf != NULL)
 		g_free(buf);
 	if (cp->provider != NULL) {
 		if (cp->acr == 1)
 			g_access(cp, -1, 0, 0);
 		g_detach(cp);
 	}
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	return (error);
 }
 
 /*
  * The function is called when we had last close on provider and user requested
  * to close it when this situation occur.
  */
 static void
 g_eli_last_close(void *arg, int flags __unused)
 {
 	struct g_geom *gp;
 	char gpname[64];
 	int error;
 
 	g_topology_assert();
 	gp = arg;
 	strlcpy(gpname, gp->name, sizeof(gpname));
 	error = g_eli_destroy(gp->softc, TRUE);
 	KASSERT(error == 0, ("Cannot detach %s on last close (error=%d).",
 	    gpname, error));
 	G_ELI_DEBUG(0, "Detached %s on last close.", gpname);
 }
 
 int
 g_eli_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_eli_softc *sc;
 	struct g_geom *gp;
 
 	gp = pp->geom;
 	sc = gp->softc;
 
 	if (dw > 0) {
 		if (sc->sc_flags & G_ELI_FLAG_RO) {
 			/* Deny write attempts. */
 			return (EROFS);
 		}
 		/* Someone is opening us for write, we need to remember that. */
 		sc->sc_flags |= G_ELI_FLAG_WOPEN;
 		return (0);
 	}
 	/* Is this the last close? */
 	if (pp->acr + dr > 0 || pp->acw + dw > 0 || pp->ace + de > 0)
 		return (0);
 
 	/*
 	 * Automatically detach on last close if requested.
 	 */
 	if ((sc->sc_flags & G_ELI_FLAG_RW_DETACH) ||
 	    (sc->sc_flags & G_ELI_FLAG_WOPEN)) {
 		g_post_event(g_eli_last_close, gp, M_WAITOK, NULL);
 	}
 	return (0);
 }
 
 static int
 g_eli_cpu_is_disabled(int cpu)
 {
 #ifdef SMP
 	return (CPU_ISSET(cpu, &hlt_cpus_mask));
 #else
 	return (0);
 #endif
 }
 
 struct g_geom *
 g_eli_create(struct gctl_req *req, struct g_class *mp, struct g_provider *bpp,
     const struct g_eli_metadata *md, const u_char *mkey, int nkey)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_worker *wr;
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	u_int i, threads;
 	int error;
 
 	G_ELI_DEBUG(1, "Creating device %s%s.", bpp->name, G_ELI_SUFFIX);
 
 	gp = g_new_geomf(mp, "%s%s", bpp->name, G_ELI_SUFFIX);
 	sc = malloc(sizeof(*sc), M_ELI, M_WAITOK | M_ZERO);
 	gp->start = g_eli_start;
 	/*
 	 * Spoiling can happen even though we have the provider open
 	 * exclusively, e.g. through media change events.
 	 */
 	gp->spoiled = g_eli_orphan;
 	gp->orphan = g_eli_orphan;
 	gp->dumpconf = g_eli_dumpconf;
 	/*
 	 * If detach-on-last-close feature is not enabled and we don't operate
 	 * on read-only provider, we can simply use g_std_access().
 	 */
 	if (md->md_flags & (G_ELI_FLAG_WO_DETACH | G_ELI_FLAG_RO))
 		gp->access = g_eli_access;
 	else
 		gp->access = g_std_access;
 
 	eli_metadata_softc(sc, md, bpp->sectorsize, bpp->mediasize);
 	sc->sc_nkey = nkey;
 
 	gp->softc = sc;
 	sc->sc_geom = gp;
 
 	bioq_init(&sc->sc_queue);
 	mtx_init(&sc->sc_queue_mtx, "geli:queue", NULL, MTX_DEF);
 	mtx_init(&sc->sc_ekeys_lock, "geli:ekeys", NULL, MTX_DEF);
 
 	pp = NULL;
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, bpp);
 	if (error != 0) {
 		if (req != NULL) {
 			gctl_error(req, "Cannot attach to %s (error=%d).",
 			    bpp->name, error);
 		} else {
 			G_ELI_DEBUG(1, "Cannot attach to %s (error=%d).",
 			    bpp->name, error);
 		}
 		goto failed;
 	}
 	/*
 	 * Keep provider open all the time, so we can run critical tasks,
 	 * like Master Keys deletion, without wondering if we can open
 	 * provider or not.
 	 * We don't open provider for writing only when user requested read-only
 	 * access.
 	 */
 	if (sc->sc_flags & G_ELI_FLAG_RO)
 		error = g_access(cp, 1, 0, 1);
 	else
 		error = g_access(cp, 1, 1, 1);
 	if (error != 0) {
 		if (req != NULL) {
 			gctl_error(req, "Cannot access %s (error=%d).",
 			    bpp->name, error);
 		} else {
 			G_ELI_DEBUG(1, "Cannot access %s (error=%d).",
 			    bpp->name, error);
 		}
 		goto failed;
 	}
 
 	/*
 	 * Remember the keys in our softc structure.
 	 */
 	g_eli_mkey_propagate(sc, mkey);
 
 	LIST_INIT(&sc->sc_workers);
 
 	threads = g_eli_threads;
 	if (threads == 0)
 		threads = mp_ncpus;
 	sc->sc_cpubind = (mp_ncpus > 1 && threads == mp_ncpus);
 	for (i = 0; i < threads; i++) {
 		if (g_eli_cpu_is_disabled(i)) {
 			G_ELI_DEBUG(1, "%s: CPU %u disabled, skipping.",
 			    bpp->name, i);
 			continue;
 		}
 		wr = malloc(sizeof(*wr), M_ELI, M_WAITOK | M_ZERO);
 		wr->w_softc = sc;
 		wr->w_number = i;
 		wr->w_active = TRUE;
 
 		error = g_eli_newsession(wr);
 		if (error != 0) {
 			free(wr, M_ELI);
 			if (req != NULL) {
 				gctl_error(req, "Cannot set up crypto session "
 				    "for %s (error=%d).", bpp->name, error);
 			} else {
 				G_ELI_DEBUG(1, "Cannot set up crypto session "
 				    "for %s (error=%d).", bpp->name, error);
 			}
 			goto failed;
 		}
 
 		error = kproc_create(g_eli_worker, wr, &wr->w_proc, 0, 0,
 		    "g_eli[%u] %s", i, bpp->name);
 		if (error != 0) {
 			g_eli_freesession(wr);
 			free(wr, M_ELI);
 			if (req != NULL) {
 				gctl_error(req, "Cannot create kernel thread "
 				    "for %s (error=%d).", bpp->name, error);
 			} else {
 				G_ELI_DEBUG(1, "Cannot create kernel thread "
 				    "for %s (error=%d).", bpp->name, error);
 			}
 			goto failed;
 		}
 		LIST_INSERT_HEAD(&sc->sc_workers, wr, w_next);
 	}
 
 	/*
 	 * Create decrypted provider.
 	 */
 	pp = g_new_providerf(gp, "%s%s", bpp->name, G_ELI_SUFFIX);
 	pp->mediasize = sc->sc_mediasize;
 	pp->sectorsize = sc->sc_sectorsize;
 
 	g_error_provider(pp, 0);
 
 	G_ELI_DEBUG(0, "Device %s created.", pp->name);
 	G_ELI_DEBUG(0, "Encryption: %s %u", g_eli_algo2str(sc->sc_ealgo),
 	    sc->sc_ekeylen);
 	if (sc->sc_flags & G_ELI_FLAG_AUTH)
 		G_ELI_DEBUG(0, " Integrity: %s", g_eli_algo2str(sc->sc_aalgo));
 	G_ELI_DEBUG(0, "    Crypto: %s",
 	    sc->sc_crypto == G_ELI_CRYPTO_SW ? "software" : "hardware");
 	return (gp);
 failed:
 	mtx_lock(&sc->sc_queue_mtx);
 	sc->sc_flags |= G_ELI_FLAG_DESTROY;
 	wakeup(sc);
 	/*
 	 * Wait for kernel threads self destruction.
 	 */
 	while (!LIST_EMPTY(&sc->sc_workers)) {
 		msleep(&sc->sc_workers, &sc->sc_queue_mtx, PRIBIO,
 		    "geli:destroy", 0);
 	}
 	mtx_destroy(&sc->sc_queue_mtx);
 	if (cp->provider != NULL) {
 		if (cp->acr == 1)
 			g_access(cp, -1, -1, -1);
 		g_detach(cp);
 	}
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	g_eli_key_destroy(sc);
 	bzero(sc, sizeof(*sc));
 	free(sc, M_ELI);
 	return (NULL);
 }
 
 int
 g_eli_destroy(struct g_eli_softc *sc, boolean_t force)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENXIO);
 
 	gp = sc->sc_geom;
 	pp = LIST_FIRST(&gp->provider);
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_ELI_DEBUG(1, "Device %s is still open, so it "
 			    "cannot be definitely removed.", pp->name);
 			sc->sc_flags |= G_ELI_FLAG_RW_DETACH;
 			gp->access = g_eli_access;
 			g_wither_provider(pp, ENXIO);
 			return (EBUSY);
 		} else {
 			G_ELI_DEBUG(1,
 			    "Device %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	}
 
 	mtx_lock(&sc->sc_queue_mtx);
 	sc->sc_flags |= G_ELI_FLAG_DESTROY;
 	wakeup(sc);
 	while (!LIST_EMPTY(&sc->sc_workers)) {
 		msleep(&sc->sc_workers, &sc->sc_queue_mtx, PRIBIO,
 		    "geli:destroy", 0);
 	}
 	mtx_destroy(&sc->sc_queue_mtx);
 	gp->softc = NULL;
 	g_eli_key_destroy(sc);
 	bzero(sc, sizeof(*sc));
 	free(sc, M_ELI);
 
 	if (pp == NULL || (pp->acr == 0 && pp->acw == 0 && pp->ace == 0))
 		G_ELI_DEBUG(0, "Device %s destroyed.", gp->name);
 	g_wither_geom_close(gp, ENXIO);
 
 	return (0);
 }
 
 static int
 g_eli_destroy_geom(struct gctl_req *req __unused,
     struct g_class *mp __unused, struct g_geom *gp)
 {
 	struct g_eli_softc *sc;
 
 	sc = gp->softc;
 	return (g_eli_destroy(sc, FALSE));
 }
 
 static int
 g_eli_keyfiles_load(struct hmac_ctx *ctx, const char *provider)
 {
 	u_char *keyfile, *data;
 	char *file, name[64];
 	size_t size;
 	int i;
 
 	for (i = 0; ; i++) {
 		snprintf(name, sizeof(name), "%s:geli_keyfile%d", provider, i);
 		keyfile = preload_search_by_type(name);
 		if (keyfile == NULL && i == 0) {
 			/*
 			 * If there is only one keyfile, allow simpler name.
 			 */
 			snprintf(name, sizeof(name), "%s:geli_keyfile", provider);
 			keyfile = preload_search_by_type(name);
 		}
 		if (keyfile == NULL)
 			return (i);	/* Return number of loaded keyfiles. */
 		data = preload_fetch_addr(keyfile);
 		if (data == NULL) {
 			G_ELI_DEBUG(0, "Cannot find key file data for %s.",
 			    name);
 			return (0);
 		}
 		size = preload_fetch_size(keyfile);
 		if (size == 0) {
 			G_ELI_DEBUG(0, "Cannot find key file size for %s.",
 			    name);
 			return (0);
 		}
 		file = preload_search_info(keyfile, MODINFO_NAME);
 		if (file == NULL) {
 			G_ELI_DEBUG(0, "Cannot find key file name for %s.",
 			    name);
 			return (0);
 		}
 		G_ELI_DEBUG(1, "Loaded keyfile %s for %s (type: %s).", file,
 		    provider, name);
 		g_eli_crypto_hmac_update(ctx, data, size);
 	}
 }
 
 static void
 g_eli_keyfiles_clear(const char *provider)
 {
 	u_char *keyfile, *data;
 	char name[64];
 	size_t size;
 	int i;
 
 	for (i = 0; ; i++) {
 		snprintf(name, sizeof(name), "%s:geli_keyfile%d", provider, i);
 		keyfile = preload_search_by_type(name);
 		if (keyfile == NULL)
 			return;
 		data = preload_fetch_addr(keyfile);
 		size = preload_fetch_size(keyfile);
 		if (data != NULL && size != 0)
 			bzero(data, size);
 	}
 }
 
 /*
  * Tasting is only made on boot.
  * We detect providers which should be attached before root is mounted.
  */
 static struct g_geom *
 g_eli_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_eli_metadata md;
 	struct g_geom *gp;
 	struct hmac_ctx ctx;
 	char passphrase[256];
 	u_char key[G_ELI_USERKEYLEN], mkey[G_ELI_DATAIVKEYLEN];
 	u_int i, nkey, nkeyfiles, tries, showpass;
 	int error;
         struct keybuf *keybuf;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	if (root_mounted() || g_eli_tries == 0)
 		return (NULL);
 
 	G_ELI_DEBUG(3, "Tasting %s.", pp->name);
 
 	error = g_eli_read_metadata(mp, pp, &md);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (strcmp(md.md_magic, G_ELI_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version > G_ELI_VERSION) {
 		printf("geom_eli.ko module is too old to handle %s.\n",
 		    pp->name);
 		return (NULL);
 	}
 	if (md.md_provsize != pp->mediasize)
 		return (NULL);
 	/* Should we attach it on boot? */
 	if (!(md.md_flags & G_ELI_FLAG_BOOT))
 		return (NULL);
 	if (md.md_keys == 0x00) {
 		G_ELI_DEBUG(0, "No valid keys on %s.", pp->name);
 		return (NULL);
 	}
 	if (md.md_iterations == -1) {
 		/* If there is no passphrase, we try only once. */
 		tries = 1;
 	} else {
 		/* Ask for the passphrase no more than g_eli_tries times. */
 		tries = g_eli_tries;
 	}
 
         if ((keybuf = get_keybuf()) != NULL) {
                 /* Scan the key buffer, try all GELI keys. */
                 for (i = 0; i < keybuf->kb_nents; i++) {
                          if (keybuf->kb_ents[i].ke_type == KEYBUF_TYPE_GELI) {
                                  memcpy(key, keybuf->kb_ents[i].ke_data,
                                      sizeof(key));
 
                                  if (g_eli_mkey_decrypt(&md, key,
                                      mkey, &nkey) == 0 ) {
                                          explicit_bzero(key, sizeof(key));
                                          goto have_key;
                                  }
                          }
                 }
         }
 
         for (i = 0; i <= tries; i++) {
                 g_eli_crypto_hmac_init(&ctx, NULL, 0);
 
                 /*
                  * Load all key files.
                  */
                 nkeyfiles = g_eli_keyfiles_load(&ctx, pp->name);
 
                 if (nkeyfiles == 0 && md.md_iterations == -1) {
                         /*
                          * No key files and no passphrase, something is
                          * definitely wrong here.
                          * geli(8) doesn't allow for such situation, so assume
                          * that there was really no passphrase and in that case
                          * key files are no properly defined in loader.conf.
                          */
                         G_ELI_DEBUG(0,
                             "Found no key files in loader.conf for %s.",
                             pp->name);
                         return (NULL);
                 }
 
                 /* Ask for the passphrase if defined. */
                 if (md.md_iterations >= 0) {
                         /* Try first with cached passphrase. */
                         if (i == 0) {
                                 if (!g_eli_boot_passcache)
                                         continue;
                                 memcpy(passphrase, cached_passphrase,
                                     sizeof(passphrase));
                         } else {
                                 printf("Enter passphrase for %s: ", pp->name);
 				showpass = g_eli_visible_passphrase;
 				if ((md.md_flags & G_ELI_FLAG_GELIDISPLAYPASS) != 0)
 					showpass = GETS_ECHOPASS;
                                 cngets(passphrase, sizeof(passphrase),
 				    showpass);
                                 memcpy(cached_passphrase, passphrase,
                                     sizeof(passphrase));
                         }
                 }
 
                 /*
                  * Prepare Derived-Key from the user passphrase.
                  */
                 if (md.md_iterations == 0) {
                         g_eli_crypto_hmac_update(&ctx, md.md_salt,
                             sizeof(md.md_salt));
                         g_eli_crypto_hmac_update(&ctx, passphrase,
                             strlen(passphrase));
                         explicit_bzero(passphrase, sizeof(passphrase));
                 } else if (md.md_iterations > 0) {
                         u_char dkey[G_ELI_USERKEYLEN];
 
                         pkcs5v2_genkey(dkey, sizeof(dkey), md.md_salt,
                             sizeof(md.md_salt), passphrase, md.md_iterations);
                         bzero(passphrase, sizeof(passphrase));
                         g_eli_crypto_hmac_update(&ctx, dkey, sizeof(dkey));
                         explicit_bzero(dkey, sizeof(dkey));
                 }
 
                 g_eli_crypto_hmac_final(&ctx, key, 0);
 
                 /*
                  * Decrypt Master-Key.
                  */
                 error = g_eli_mkey_decrypt(&md, key, mkey, &nkey);
                 bzero(key, sizeof(key));
                 if (error == -1) {
                         if (i == tries) {
                                 G_ELI_DEBUG(0,
                                     "Wrong key for %s. No tries left.",
                                     pp->name);
                                 g_eli_keyfiles_clear(pp->name);
                                 return (NULL);
                         }
                         if (i > 0) {
                                 G_ELI_DEBUG(0,
                                     "Wrong key for %s. Tries left: %u.",
                                     pp->name, tries - i);
                         }
                         /* Try again. */
                         continue;
                 } else if (error > 0) {
                         G_ELI_DEBUG(0,
                             "Cannot decrypt Master Key for %s (error=%d).",
                             pp->name, error);
                         g_eli_keyfiles_clear(pp->name);
                         return (NULL);
                 }
                 g_eli_keyfiles_clear(pp->name);
                 G_ELI_DEBUG(1, "Using Master Key %u for %s.", nkey, pp->name);
                 break;
         }
 have_key:
 
 	/*
 	 * We have correct key, let's attach provider.
 	 */
 	gp = g_eli_create(NULL, mp, pp, &md, mkey, nkey);
 	bzero(mkey, sizeof(mkey));
 	bzero(&md, sizeof(md));
 	if (gp == NULL) {
 		G_ELI_DEBUG(0, "Cannot create device %s%s.", pp->name,
 		    G_ELI_SUFFIX);
 		return (NULL);
 	}
 	return (gp);
 }
 
 static void
 g_eli_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_eli_softc *sc;
 
 	g_topology_assert();
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	if (pp != NULL || cp != NULL)
 		return;	/* Nothing here. */
 
 	sbuf_printf(sb, "%s<KeysTotal>%ju</KeysTotal>\n", indent,
 	    (uintmax_t)sc->sc_ekeys_total);
 	sbuf_printf(sb, "%s<KeysAllocated>%ju</KeysAllocated>\n", indent,
 	    (uintmax_t)sc->sc_ekeys_allocated);
 	sbuf_printf(sb, "%s<Flags>", indent);
 	if (sc->sc_flags == 0)
 		sbuf_printf(sb, "NONE");
 	else {
 		int first = 1;
 
 #define ADD_FLAG(flag, name)	do {					\
 	if (sc->sc_flags & (flag)) {					\
 		if (!first)						\
 			sbuf_printf(sb, ", ");				\
 		else							\
 			first = 0;					\
 		sbuf_printf(sb, name);					\
 	}								\
 } while (0)
 		ADD_FLAG(G_ELI_FLAG_SUSPEND, "SUSPEND");
 		ADD_FLAG(G_ELI_FLAG_SINGLE_KEY, "SINGLE-KEY");
 		ADD_FLAG(G_ELI_FLAG_NATIVE_BYTE_ORDER, "NATIVE-BYTE-ORDER");
 		ADD_FLAG(G_ELI_FLAG_ONETIME, "ONETIME");
 		ADD_FLAG(G_ELI_FLAG_BOOT, "BOOT");
 		ADD_FLAG(G_ELI_FLAG_WO_DETACH, "W-DETACH");
 		ADD_FLAG(G_ELI_FLAG_RW_DETACH, "RW-DETACH");
 		ADD_FLAG(G_ELI_FLAG_AUTH, "AUTH");
 		ADD_FLAG(G_ELI_FLAG_WOPEN, "W-OPEN");
 		ADD_FLAG(G_ELI_FLAG_DESTROY, "DESTROY");
 		ADD_FLAG(G_ELI_FLAG_RO, "READ-ONLY");
 		ADD_FLAG(G_ELI_FLAG_NODELETE, "NODELETE");
 		ADD_FLAG(G_ELI_FLAG_GELIBOOT, "GELIBOOT");
 		ADD_FLAG(G_ELI_FLAG_GELIDISPLAYPASS, "GELIDISPLAYPASS");
 #undef  ADD_FLAG
 	}
 	sbuf_printf(sb, "</Flags>\n");
 
 	if (!(sc->sc_flags & G_ELI_FLAG_ONETIME)) {
 		sbuf_printf(sb, "%s<UsedKey>%u</UsedKey>\n", indent,
 		    sc->sc_nkey);
 	}
 	sbuf_printf(sb, "%s<Version>%u</Version>\n", indent, sc->sc_version);
 	sbuf_printf(sb, "%s<Crypto>", indent);
 	switch (sc->sc_crypto) {
 	case G_ELI_CRYPTO_HW:
 		sbuf_printf(sb, "hardware");
 		break;
 	case G_ELI_CRYPTO_SW:
 		sbuf_printf(sb, "software");
 		break;
 	default:
 		sbuf_printf(sb, "UNKNOWN");
 		break;
 	}
 	sbuf_printf(sb, "</Crypto>\n");
 	if (sc->sc_flags & G_ELI_FLAG_AUTH) {
 		sbuf_printf(sb,
 		    "%s<AuthenticationAlgorithm>%s</AuthenticationAlgorithm>\n",
 		    indent, g_eli_algo2str(sc->sc_aalgo));
 	}
 	sbuf_printf(sb, "%s<KeyLength>%u</KeyLength>\n", indent,
 	    sc->sc_ekeylen);
 	sbuf_printf(sb, "%s<EncryptionAlgorithm>%s</EncryptionAlgorithm>\n",
 	    indent, g_eli_algo2str(sc->sc_ealgo));
 	sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 	    (sc->sc_flags & G_ELI_FLAG_SUSPEND) ? "SUSPENDED" : "ACTIVE");
 }
 
 static void
 g_eli_shutdown_pre_sync(void *arg, int howto)
 {
 	struct g_class *mp;
 	struct g_geom *gp, *gp2;
 	struct g_provider *pp;
 	struct g_eli_softc *sc;
 	int error;
 
 	mp = arg;
 	g_topology_lock();
 	LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		pp = LIST_FIRST(&gp->provider);
 		KASSERT(pp != NULL, ("No provider? gp=%p (%s)", gp, gp->name));
 		if (pp->acr + pp->acw + pp->ace == 0)
 			error = g_eli_destroy(sc, TRUE);
 		else {
 			sc->sc_flags |= G_ELI_FLAG_RW_DETACH;
 			gp->access = g_eli_access;
 		}
 	}
 	g_topology_unlock();
 }
 
 static void
 g_eli_init(struct g_class *mp)
 {
 
 	g_eli_pre_sync = EVENTHANDLER_REGISTER(shutdown_pre_sync,
 	    g_eli_shutdown_pre_sync, mp, SHUTDOWN_PRI_FIRST);
 	if (g_eli_pre_sync == NULL)
 		G_ELI_DEBUG(0, "Warning! Cannot register shutdown event.");
 }
 
 static void
 g_eli_fini(struct g_class *mp)
 {
 
 	if (g_eli_pre_sync != NULL)
 		EVENTHANDLER_DEREGISTER(shutdown_pre_sync, g_eli_pre_sync);
 }
 
 DECLARE_GEOM_CLASS(g_eli_class, g_eli);
 MODULE_DEPEND(g_eli, crypto, 1, 1, 1);
+MODULE_VERSION(geom_eli, 0);
Index: user/markj/netdump/sys/geom/eli/g_eli_ctl.c
===================================================================
--- user/markj/netdump/sys/geom/eli/g_eli_ctl.c	(revision 332407)
+++ user/markj/netdump/sys/geom/eli/g_eli_ctl.c	(revision 332408)
@@ -1,1165 +1,1172 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2005-2011 Pawel Jakub Dawidek <pawel@dawidek.net>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/kthread.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/uio.h>
 
 #include <vm/uma.h>
 
 #include <geom/geom.h>
 #include <geom/eli/g_eli.h>
 
 
 MALLOC_DECLARE(M_ELI);
 
 
 static void
 g_eli_ctl_attach(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_metadata md;
 	struct g_provider *pp;
 	const char *name;
 	u_char *key, mkey[G_ELI_DATAIVKEYLEN];
-	int *nargs, *detach, *readonly;
+	int *nargs, *detach, *readonly, *dryrun;
 	int keysize, error;
 	u_int nkey;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 
 	detach = gctl_get_paraml(req, "detach", sizeof(*detach));
 	if (detach == NULL) {
 		gctl_error(req, "No '%s' argument.", "detach");
 		return;
 	}
 
 	readonly = gctl_get_paraml(req, "readonly", sizeof(*readonly));
 	if (readonly == NULL) {
 		gctl_error(req, "No '%s' argument.", "readonly");
 		return;
 	}
 
+	dryrun = gctl_get_paraml(req, "dryrun", sizeof(*dryrun));
+	if (dryrun == NULL) {
+		gctl_error(req, "No '%s' argument.", "dryrun");
+		return;
+	}
+
 	if (*detach && *readonly) {
 		gctl_error(req, "Options -d and -r are mutually exclusive.");
 		return;
 	}
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 		name += strlen("/dev/");
 	pp = g_provider_by_name(name);
 	if (pp == NULL) {
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 	error = g_eli_read_metadata(mp, pp, &md);
 	if (error != 0) {
 		gctl_error(req, "Cannot read metadata from %s (error=%d).",
 		    name, error);
 		return;
 	}
 	if (md.md_keys == 0x00) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "No valid keys on %s.", pp->name);
 		return;
 	}
 
 	key = gctl_get_param(req, "key", &keysize);
 	if (key == NULL || keysize != G_ELI_USERKEYLEN) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "No '%s' argument.", "key");
 		return;
 	}
 
 	error = g_eli_mkey_decrypt(&md, key, mkey, &nkey);
 	explicit_bzero(key, keysize);
 	if (error == -1) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "Wrong key for %s.", pp->name);
 		return;
 	} else if (error > 0) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "Cannot decrypt Master Key for %s (error=%d).",
 		    pp->name, error);
 		return;
 	}
 	G_ELI_DEBUG(1, "Using Master Key %u for %s.", nkey, pp->name);
 
 	if (*detach)
 		md.md_flags |= G_ELI_FLAG_WO_DETACH;
 	if (*readonly)
 		md.md_flags |= G_ELI_FLAG_RO;
-	g_eli_create(req, mp, pp, &md, mkey, nkey);
+	if (!*dryrun)
+		g_eli_create(req, mp, pp, &md, mkey, nkey);
 	explicit_bzero(mkey, sizeof(mkey));
 	explicit_bzero(&md, sizeof(md));
 }
 
 static struct g_eli_softc *
 g_eli_find_device(struct g_class *mp, const char *prov)
 {
 	struct g_eli_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 
 	if (strncmp(prov, "/dev/", strlen("/dev/")) == 0)
 		prov += strlen("/dev/");
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		pp = LIST_FIRST(&gp->provider);
 		if (pp != NULL && strcmp(pp->name, prov) == 0)
 			return (sc);
 		cp = LIST_FIRST(&gp->consumer);
 		if (cp != NULL && cp->provider != NULL &&
 		    strcmp(cp->provider->name, prov) == 0) {
 			return (sc);
 		}
 	}
 	return (NULL);
 }
 
 static void
 g_eli_ctl_detach(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_softc *sc;
 	int *force, *last, *nargs, error;
 	const char *prov;
 	char param[16];
 	int i;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No '%s' argument.", "force");
 		return;
 	}
 	last = gctl_get_paraml(req, "last", sizeof(*last));
 	if (last == NULL) {
 		gctl_error(req, "No '%s' argument.", "last");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		prov = gctl_get_asciiparam(req, param);
 		if (prov == NULL) {
 			gctl_error(req, "No 'arg%d' argument.", i);
 			return;
 		}
 		sc = g_eli_find_device(mp, prov);
 		if (sc == NULL) {
 			gctl_error(req, "No such device: %s.", prov);
 			return;
 		}
 		if (*last) {
 			sc->sc_flags |= G_ELI_FLAG_RW_DETACH;
 			sc->sc_geom->access = g_eli_access;
 		} else {
 			error = g_eli_destroy(sc, *force ? TRUE : FALSE);
 			if (error != 0) {
 				gctl_error(req,
 				    "Cannot destroy device %s (error=%d).",
 				    sc->sc_name, error);
 				return;
 			}
 		}
 	}
 }
 
 static void
 g_eli_ctl_onetime(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_metadata md;
 	struct g_provider *pp;
 	const char *name;
 	intmax_t *keylen, *sectorsize;
 	u_char mkey[G_ELI_DATAIVKEYLEN];
 	int *nargs, *detach, *notrim;
 
 	g_topology_assert();
 	bzero(&md, sizeof(md));
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 
 	strlcpy(md.md_magic, G_ELI_MAGIC, sizeof(md.md_magic));
 	md.md_version = G_ELI_VERSION;
 	md.md_flags |= G_ELI_FLAG_ONETIME;
 
 	detach = gctl_get_paraml(req, "detach", sizeof(*detach));
 	if (detach != NULL && *detach)
 		md.md_flags |= G_ELI_FLAG_WO_DETACH;
 	notrim = gctl_get_paraml(req, "notrim", sizeof(*notrim));
 	if (notrim != NULL && *notrim)
 		md.md_flags |= G_ELI_FLAG_NODELETE;
 
 	md.md_ealgo = CRYPTO_ALGORITHM_MIN - 1;
 	name = gctl_get_asciiparam(req, "aalgo");
 	if (name == NULL) {
 		gctl_error(req, "No '%s' argument.", "aalgo");
 		return;
 	}
 	if (*name != '\0') {
 		md.md_aalgo = g_eli_str2aalgo(name);
 		if (md.md_aalgo >= CRYPTO_ALGORITHM_MIN &&
 		    md.md_aalgo <= CRYPTO_ALGORITHM_MAX) {
 			md.md_flags |= G_ELI_FLAG_AUTH;
 		} else {
 			/*
 			 * For backward compatibility, check if the -a option
 			 * was used to provide encryption algorithm.
 			 */
 			md.md_ealgo = g_eli_str2ealgo(name);
 			if (md.md_ealgo < CRYPTO_ALGORITHM_MIN ||
 			    md.md_ealgo > CRYPTO_ALGORITHM_MAX) {
 				gctl_error(req,
 				    "Invalid authentication algorithm.");
 				return;
 			} else {
 				gctl_error(req, "warning: The -e option, not "
 				    "the -a option is now used to specify "
 				    "encryption algorithm to use.");
 			}
 		}
 	}
 
 	if (md.md_ealgo < CRYPTO_ALGORITHM_MIN ||
 	    md.md_ealgo > CRYPTO_ALGORITHM_MAX) {
 		name = gctl_get_asciiparam(req, "ealgo");
 		if (name == NULL) {
 			gctl_error(req, "No '%s' argument.", "ealgo");
 			return;
 		}
 		md.md_ealgo = g_eli_str2ealgo(name);
 		if (md.md_ealgo < CRYPTO_ALGORITHM_MIN ||
 		    md.md_ealgo > CRYPTO_ALGORITHM_MAX) {
 			gctl_error(req, "Invalid encryption algorithm.");
 			return;
 		}
 	}
 
 	keylen = gctl_get_paraml(req, "keylen", sizeof(*keylen));
 	if (keylen == NULL) {
 		gctl_error(req, "No '%s' argument.", "keylen");
 		return;
 	}
 	md.md_keylen = g_eli_keylen(md.md_ealgo, *keylen);
 	if (md.md_keylen == 0) {
 		gctl_error(req, "Invalid '%s' argument.", "keylen");
 		return;
 	}
 
 	/* Not important here. */
 	md.md_provsize = 0;
 	/* Not important here. */
 	bzero(md.md_salt, sizeof(md.md_salt));
 
 	md.md_keys = 0x01;
 	arc4rand(mkey, sizeof(mkey), 0);
 
 	/* Not important here. */
 	bzero(md.md_hash, sizeof(md.md_hash));
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 		name += strlen("/dev/");
 	pp = g_provider_by_name(name);
 	if (pp == NULL) {
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 
 	sectorsize = gctl_get_paraml(req, "sectorsize", sizeof(*sectorsize));
 	if (sectorsize == NULL) {
 		gctl_error(req, "No '%s' argument.", "sectorsize");
 		return;
 	}
 	if (*sectorsize == 0)
 		md.md_sectorsize = pp->sectorsize;
 	else {
 		if (*sectorsize < 0 || (*sectorsize % pp->sectorsize) != 0) {
 			gctl_error(req, "Invalid sector size.");
 			return;
 		}
 		if (*sectorsize > PAGE_SIZE) {
 			gctl_error(req, "warning: Using sectorsize bigger than "
 			    "the page size!");
 		}
 		md.md_sectorsize = *sectorsize;
 	}
 
 	g_eli_create(req, mp, pp, &md, mkey, -1);
 	explicit_bzero(mkey, sizeof(mkey));
 	explicit_bzero(&md, sizeof(md));
 }
 
 static void
 g_eli_ctl_configure(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_metadata md;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	char param[16];
 	const char *prov;
 	u_char *sector;
 	int *nargs, *boot, *noboot, *trim, *notrim, *geliboot, *nogeliboot;
 	int *displaypass, *nodisplaypass;
 	int zero, error, changed;
 	u_int i;
 
 	g_topology_assert();
 
 	changed = 0;
 	zero = 0;
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 
 	boot = gctl_get_paraml(req, "boot", sizeof(*boot));
 	if (boot == NULL)
 		boot = &zero;
 	noboot = gctl_get_paraml(req, "noboot", sizeof(*noboot));
 	if (noboot == NULL)
 		noboot = &zero;
 	if (*boot && *noboot) {
 		gctl_error(req, "Options -b and -B are mutually exclusive.");
 		return;
 	}
 	if (*boot || *noboot)
 		changed = 1;
 
 	trim = gctl_get_paraml(req, "trim", sizeof(*trim));
 	if (trim == NULL)
 		trim = &zero;
 	notrim = gctl_get_paraml(req, "notrim", sizeof(*notrim));
 	if (notrim == NULL)
 		notrim = &zero;
 	if (*trim && *notrim) {
 		gctl_error(req, "Options -t and -T are mutually exclusive.");
 		return;
 	}
 	if (*trim || *notrim)
 		changed = 1;
 
 	geliboot = gctl_get_paraml(req, "geliboot", sizeof(*geliboot));
 	if (geliboot == NULL)
 		geliboot = &zero;
 	nogeliboot = gctl_get_paraml(req, "nogeliboot", sizeof(*nogeliboot));
 	if (nogeliboot == NULL)
 		nogeliboot = &zero;
 	if (*geliboot && *nogeliboot) {
 		gctl_error(req, "Options -g and -G are mutually exclusive.");
 		return;
 	}
 	if (*geliboot || *nogeliboot)
 		changed = 1;
 
 	displaypass = gctl_get_paraml(req, "displaypass", sizeof(*displaypass));
 	if (displaypass == NULL)
 		displaypass = &zero;
 	nodisplaypass = gctl_get_paraml(req, "nodisplaypass", sizeof(*nodisplaypass));
 	if (nodisplaypass == NULL)
 		nodisplaypass = &zero;
 	if (*displaypass && *nodisplaypass) {
 		gctl_error(req, "Options -d and -D are mutually exclusive.");
 		return;
 	}
 	if (*displaypass || *nodisplaypass)
 		changed = 1;
 
 	if (!changed) {
 		gctl_error(req, "No option given.");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		prov = gctl_get_asciiparam(req, param);
 		if (prov == NULL) {
 			gctl_error(req, "No 'arg%d' argument.", i);
 			return;
 		}
 		sc = g_eli_find_device(mp, prov);
 		if (sc == NULL) {
 			/*
 			 * We ignore not attached providers, userland part will
 			 * take care of them.
 			 */
 			G_ELI_DEBUG(1, "Skipping configuration of not attached "
 			    "provider %s.", prov);
 			continue;
 		}
 		if (sc->sc_flags & G_ELI_FLAG_RO) {
 			gctl_error(req, "Cannot change configuration of "
 			    "read-only provider %s.", prov);
 			continue;
 		}
 
 		if (*boot && (sc->sc_flags & G_ELI_FLAG_BOOT)) {
 			G_ELI_DEBUG(1, "BOOT flag already configured for %s.",
 			    prov);
 			continue;
 		} else if (*noboot && !(sc->sc_flags & G_ELI_FLAG_BOOT)) {
 			G_ELI_DEBUG(1, "BOOT flag not configured for %s.",
 			    prov);
 			continue;
 		}
 
 		if (*notrim && (sc->sc_flags & G_ELI_FLAG_NODELETE)) {
 			G_ELI_DEBUG(1, "TRIM disable flag already configured for %s.",
 			    prov);
 			continue;
 		} else if (*trim && !(sc->sc_flags & G_ELI_FLAG_NODELETE)) {
 			G_ELI_DEBUG(1, "TRIM disable flag not configured for %s.",
 			    prov);
 			continue;
 		}
 
 		if (*geliboot && (sc->sc_flags & G_ELI_FLAG_GELIBOOT)) {
 			G_ELI_DEBUG(1, "GELIBOOT flag already configured for %s.",
 			    prov);
 			continue;
 		} else if (*nogeliboot && !(sc->sc_flags & G_ELI_FLAG_GELIBOOT)) {
 			G_ELI_DEBUG(1, "GELIBOOT flag not configured for %s.",
 			    prov);
 			continue;
 		}
 
 		if (*displaypass && (sc->sc_flags & G_ELI_FLAG_GELIDISPLAYPASS)) {
 			G_ELI_DEBUG(1, "GELIDISPLAYPASS flag already configured for %s.",
 			    prov);
 			continue;
 		} else if (*nodisplaypass &&
 		    !(sc->sc_flags & G_ELI_FLAG_GELIDISPLAYPASS)) {
 			G_ELI_DEBUG(1, "GELIDISPLAYPASS flag not configured for %s.",
 			    prov);
 			continue;
 		}
 
 		if (!(sc->sc_flags & G_ELI_FLAG_ONETIME)) {
 			/*
 			 * ONETIME providers don't write metadata to
 			 * disk, so don't try reading it.  This means
 			 * we're bit-flipping uninitialized memory in md
 			 * below, but that's OK; we don't do anything
 			 * with it later.
 			 */
 			cp = LIST_FIRST(&sc->sc_geom->consumer);
 			pp = cp->provider;
 			error = g_eli_read_metadata(mp, pp, &md);
 			if (error != 0) {
 			    gctl_error(req,
 				"Cannot read metadata from %s (error=%d).",
 				prov, error);
 			    continue;
 			}
 		}
 
 		if (*boot) {
 			md.md_flags |= G_ELI_FLAG_BOOT;
 			sc->sc_flags |= G_ELI_FLAG_BOOT;
 		} else if (*noboot) {
 			md.md_flags &= ~G_ELI_FLAG_BOOT;
 			sc->sc_flags &= ~G_ELI_FLAG_BOOT;
 		}
 
 		if (*notrim) {
 			md.md_flags |= G_ELI_FLAG_NODELETE;
 			sc->sc_flags |= G_ELI_FLAG_NODELETE;
 		} else if (*trim) {
 			md.md_flags &= ~G_ELI_FLAG_NODELETE;
 			sc->sc_flags &= ~G_ELI_FLAG_NODELETE;
 		}
 
 		if (*geliboot) {
 			md.md_flags |= G_ELI_FLAG_GELIBOOT;
 			sc->sc_flags |= G_ELI_FLAG_GELIBOOT;
 		} else if (*nogeliboot) {
 			md.md_flags &= ~G_ELI_FLAG_GELIBOOT;
 			sc->sc_flags &= ~G_ELI_FLAG_GELIBOOT;
 		}
 
 		if (*displaypass) {
 			md.md_flags |= G_ELI_FLAG_GELIDISPLAYPASS;
 			sc->sc_flags |= G_ELI_FLAG_GELIDISPLAYPASS;
 		} else if (*nodisplaypass) {
 			md.md_flags &= ~G_ELI_FLAG_GELIDISPLAYPASS;
 			sc->sc_flags &= ~G_ELI_FLAG_GELIDISPLAYPASS;
 		}
 
 		if (sc->sc_flags & G_ELI_FLAG_ONETIME) {
 			/* There's no metadata on disk so we are done here. */
 			continue;
 		}
 
 		sector = malloc(pp->sectorsize, M_ELI, M_WAITOK | M_ZERO);
 		eli_metadata_encode(&md, sector);
 		error = g_write_data(cp, pp->mediasize - pp->sectorsize, sector,
 		    pp->sectorsize);
 		if (error != 0) {
 			gctl_error(req,
 			    "Cannot store metadata on %s (error=%d).",
 			    prov, error);
 		}
 		explicit_bzero(&md, sizeof(md));
 		explicit_bzero(sector, pp->sectorsize);
 		free(sector, M_ELI);
 	}
 }
 
 static void
 g_eli_ctl_setkey(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_metadata md;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	const char *name;
 	u_char *key, *mkeydst, *sector;
 	intmax_t *valp;
 	int keysize, nkey, error;
 
 	g_topology_assert();
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	key = gctl_get_param(req, "key", &keysize);
 	if (key == NULL || keysize != G_ELI_USERKEYLEN) {
 		gctl_error(req, "No '%s' argument.", "key");
 		return;
 	}
 	sc = g_eli_find_device(mp, name);
 	if (sc == NULL) {
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 	if (sc->sc_flags & G_ELI_FLAG_RO) {
 		gctl_error(req, "Cannot change keys for read-only provider.");
 		return;
 	}
 	cp = LIST_FIRST(&sc->sc_geom->consumer);
 	pp = cp->provider;
 
 	error = g_eli_read_metadata(mp, pp, &md);
 	if (error != 0) {
 		gctl_error(req, "Cannot read metadata from %s (error=%d).",
 		    name, error);
 		return;
 	}
 
 	valp = gctl_get_paraml(req, "keyno", sizeof(*valp));
 	if (valp == NULL) {
 		gctl_error(req, "No '%s' argument.", "keyno");
 		return;
 	}
 	if (*valp != -1)
 		nkey = *valp;
 	else
 		nkey = sc->sc_nkey;
 	if (nkey < 0 || nkey >= G_ELI_MAXMKEYS) {
 		gctl_error(req, "Invalid '%s' argument.", "keyno");
 		return;
 	}
 
 	valp = gctl_get_paraml(req, "iterations", sizeof(*valp));
 	if (valp == NULL) {
 		gctl_error(req, "No '%s' argument.", "iterations");
 		return;
 	}
 	/* Check if iterations number should and can be changed. */
 	if (*valp != -1 && md.md_iterations == -1) {
 		md.md_iterations = *valp;
 	} else if (*valp != -1 && *valp != md.md_iterations) {
 		if (bitcount32(md.md_keys) != 1) {
 			gctl_error(req, "To be able to use '-i' option, only "
 			    "one key can be defined.");
 			return;
 		}
 		if (md.md_keys != (1 << nkey)) {
 			gctl_error(req, "Only already defined key can be "
 			    "changed when '-i' option is used.");
 			return;
 		}
 		md.md_iterations = *valp;
 	}
 
 	mkeydst = md.md_mkeys + nkey * G_ELI_MKEYLEN;
 	md.md_keys |= (1 << nkey);
 
 	bcopy(sc->sc_mkey, mkeydst, sizeof(sc->sc_mkey));
 
 	/* Encrypt Master Key with the new key. */
 	error = g_eli_mkey_encrypt(md.md_ealgo, key, md.md_keylen, mkeydst);
 	explicit_bzero(key, keysize);
 	if (error != 0) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "Cannot encrypt Master Key (error=%d).", error);
 		return;
 	}
 
 	sector = malloc(pp->sectorsize, M_ELI, M_WAITOK | M_ZERO);
 	/* Store metadata with fresh key. */
 	eli_metadata_encode(&md, sector);
 	explicit_bzero(&md, sizeof(md));
 	error = g_write_data(cp, pp->mediasize - pp->sectorsize, sector,
 	    pp->sectorsize);
 	explicit_bzero(sector, pp->sectorsize);
 	free(sector, M_ELI);
 	if (error != 0) {
 		gctl_error(req, "Cannot store metadata on %s (error=%d).",
 		    pp->name, error);
 		return;
 	}
 	G_ELI_DEBUG(1, "Key %u changed on %s.", nkey, pp->name);
 }
 
 static void
 g_eli_ctl_delkey(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_softc *sc;
 	struct g_eli_metadata md;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	const char *name;
 	u_char *mkeydst, *sector;
 	intmax_t *valp;
 	size_t keysize;
 	int error, nkey, *all, *force;
 	u_int i;
 
 	g_topology_assert();
 
 	nkey = 0;	/* fixes causeless gcc warning */
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	sc = g_eli_find_device(mp, name);
 	if (sc == NULL) {
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 	if (sc->sc_flags & G_ELI_FLAG_RO) {
 		gctl_error(req, "Cannot delete keys for read-only provider.");
 		return;
 	}
 	cp = LIST_FIRST(&sc->sc_geom->consumer);
 	pp = cp->provider;
 
 	error = g_eli_read_metadata(mp, pp, &md);
 	if (error != 0) {
 		gctl_error(req, "Cannot read metadata from %s (error=%d).",
 		    name, error);
 		return;
 	}
 
 	all = gctl_get_paraml(req, "all", sizeof(*all));
 	if (all == NULL) {
 		gctl_error(req, "No '%s' argument.", "all");
 		return;
 	}
 
 	if (*all) {
 		mkeydst = md.md_mkeys;
 		keysize = sizeof(md.md_mkeys);
 	} else {
 		force = gctl_get_paraml(req, "force", sizeof(*force));
 		if (force == NULL) {
 			gctl_error(req, "No '%s' argument.", "force");
 			return;
 		}
 
 		valp = gctl_get_paraml(req, "keyno", sizeof(*valp));
 		if (valp == NULL) {
 			gctl_error(req, "No '%s' argument.", "keyno");
 			return;
 		}
 		if (*valp != -1)
 			nkey = *valp;
 		else
 			nkey = sc->sc_nkey;
 		if (nkey < 0 || nkey >= G_ELI_MAXMKEYS) {
 			gctl_error(req, "Invalid '%s' argument.", "keyno");
 			return;
 		}
 		if (!(md.md_keys & (1 << nkey)) && !*force) {
 			gctl_error(req, "Master Key %u is not set.", nkey);
 			return;
 		}
 		md.md_keys &= ~(1 << nkey);
 		if (md.md_keys == 0 && !*force) {
 			gctl_error(req, "This is the last Master Key. Use '-f' "
 			    "flag if you really want to remove it.");
 			return;
 		}
 		mkeydst = md.md_mkeys + nkey * G_ELI_MKEYLEN;
 		keysize = G_ELI_MKEYLEN;
 	}
 
 	sector = malloc(pp->sectorsize, M_ELI, M_WAITOK | M_ZERO);
 	for (i = 0; i <= g_eli_overwrites; i++) {
 		if (i == g_eli_overwrites)
 			explicit_bzero(mkeydst, keysize);
 		else
 			arc4rand(mkeydst, keysize, 0);
 		/* Store metadata with destroyed key. */
 		eli_metadata_encode(&md, sector);
 		error = g_write_data(cp, pp->mediasize - pp->sectorsize, sector,
 		    pp->sectorsize);
 		if (error != 0) {
 			G_ELI_DEBUG(0, "Cannot store metadata on %s "
 			    "(error=%d).", pp->name, error);
 		}
 		/*
 		 * Flush write cache so we don't overwrite data N times in cache
 		 * and only once on disk.
 		 */
 		(void)g_io_flush(cp);
 	}
 	explicit_bzero(&md, sizeof(md));
 	explicit_bzero(sector, pp->sectorsize);
 	free(sector, M_ELI);
 	if (*all)
 		G_ELI_DEBUG(1, "All keys removed from %s.", pp->name);
 	else
 		G_ELI_DEBUG(1, "Key %d removed from %s.", nkey, pp->name);
 }
 
 static void
 g_eli_suspend_one(struct g_eli_softc *sc, struct gctl_req *req)
 {
 	struct g_eli_worker *wr;
 
 	g_topology_assert();
 
 	KASSERT(sc != NULL, ("NULL sc"));
 
 	if (sc->sc_flags & G_ELI_FLAG_ONETIME) {
 		gctl_error(req,
 		    "Device %s is using one-time key, suspend not supported.",
 		    sc->sc_name);
 		return;
 	}
 
 	mtx_lock(&sc->sc_queue_mtx);
 	if (sc->sc_flags & G_ELI_FLAG_SUSPEND) {
 		mtx_unlock(&sc->sc_queue_mtx);
 		gctl_error(req, "Device %s already suspended.",
 		    sc->sc_name);
 		return;
 	}
 	sc->sc_flags |= G_ELI_FLAG_SUSPEND;
 	wakeup(sc);
 	for (;;) {
 		LIST_FOREACH(wr, &sc->sc_workers, w_next) {
 			if (wr->w_active)
 				break;
 		}
 		if (wr == NULL)
 			break;
 		/* Not all threads suspended. */
 		msleep(&sc->sc_workers, &sc->sc_queue_mtx, PRIBIO,
 		    "geli:suspend", 0);
 	}
 	/*
 	 * Clear sensitive data on suspend, they will be recovered on resume.
 	 */
 	explicit_bzero(sc->sc_mkey, sizeof(sc->sc_mkey));
 	g_eli_key_destroy(sc);
 	explicit_bzero(sc->sc_akey, sizeof(sc->sc_akey));
 	explicit_bzero(&sc->sc_akeyctx, sizeof(sc->sc_akeyctx));
 	explicit_bzero(sc->sc_ivkey, sizeof(sc->sc_ivkey));
 	explicit_bzero(&sc->sc_ivctx, sizeof(sc->sc_ivctx));
 	mtx_unlock(&sc->sc_queue_mtx);
 	G_ELI_DEBUG(0, "Device %s has been suspended.", sc->sc_name);
 }
 
 static void
 g_eli_ctl_suspend(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_softc *sc;
 	int *all, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	all = gctl_get_paraml(req, "all", sizeof(*all));
 	if (all == NULL) {
 		gctl_error(req, "No '%s' argument.", "all");
 		return;
 	}
 	if (!*all && *nargs == 0) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	if (*all) {
 		struct g_geom *gp, *gp2;
 
 		LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 			sc = gp->softc;
 			if (sc->sc_flags & G_ELI_FLAG_ONETIME) {
 				G_ELI_DEBUG(0,
 				    "Device %s is using one-time key, suspend not supported, skipping.",
 				    sc->sc_name);
 				continue;
 			}
 			g_eli_suspend_one(sc, req);
 		}
 	} else {
 		const char *prov;
 		char param[16];
 		int i;
 
 		for (i = 0; i < *nargs; i++) {
 			snprintf(param, sizeof(param), "arg%d", i);
 			prov = gctl_get_asciiparam(req, param);
 			if (prov == NULL) {
 				G_ELI_DEBUG(0, "No 'arg%d' argument.", i);
 				continue;
 			}
 
 			sc = g_eli_find_device(mp, prov);
 			if (sc == NULL) {
 				G_ELI_DEBUG(0, "No such provider: %s.", prov);
 				continue;
 			}
 			g_eli_suspend_one(sc, req);
 		}
 	}
 }
 
 static void
 g_eli_ctl_resume(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_eli_metadata md;
 	struct g_eli_softc *sc;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	const char *name;
 	u_char *key, mkey[G_ELI_DATAIVKEYLEN];
 	int *nargs, keysize, error;
 	u_int nkey;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs != 1) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	key = gctl_get_param(req, "key", &keysize);
 	if (key == NULL || keysize != G_ELI_USERKEYLEN) {
 		gctl_error(req, "No '%s' argument.", "key");
 		return;
 	}
 	sc = g_eli_find_device(mp, name);
 	if (sc == NULL) {
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 	cp = LIST_FIRST(&sc->sc_geom->consumer);
 	pp = cp->provider;
 	error = g_eli_read_metadata(mp, pp, &md);
 	if (error != 0) {
 		gctl_error(req, "Cannot read metadata from %s (error=%d).",
 		    name, error);
 		return;
 	}
 	if (md.md_keys == 0x00) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "No valid keys on %s.", pp->name);
 		return;
 	}
 
 	error = g_eli_mkey_decrypt(&md, key, mkey, &nkey);
 	explicit_bzero(key, keysize);
 	if (error == -1) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "Wrong key for %s.", pp->name);
 		return;
 	} else if (error > 0) {
 		explicit_bzero(&md, sizeof(md));
 		gctl_error(req, "Cannot decrypt Master Key for %s (error=%d).",
 		    pp->name, error);
 		return;
 	}
 	G_ELI_DEBUG(1, "Using Master Key %u for %s.", nkey, pp->name);
 
 	mtx_lock(&sc->sc_queue_mtx);
 	if (!(sc->sc_flags & G_ELI_FLAG_SUSPEND))
 		gctl_error(req, "Device %s is not suspended.", name);
 	else {
 		/* Restore sc_mkey, sc_ekeys, sc_akey and sc_ivkey. */
 		g_eli_mkey_propagate(sc, mkey);
 		sc->sc_flags &= ~G_ELI_FLAG_SUSPEND;
 		G_ELI_DEBUG(1, "Resumed %s.", pp->name);
 		wakeup(sc);
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	explicit_bzero(mkey, sizeof(mkey));
 	explicit_bzero(&md, sizeof(md));
 }
 
 static int
 g_eli_kill_one(struct g_eli_softc *sc)
 {
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	int error = 0;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENOENT);
 
 	pp = LIST_FIRST(&sc->sc_geom->provider);
 	g_error_provider(pp, ENXIO);
 
 	cp = LIST_FIRST(&sc->sc_geom->consumer);
 	pp = cp->provider;
 
 	if (sc->sc_flags & G_ELI_FLAG_RO) {
 		G_ELI_DEBUG(0, "WARNING: Metadata won't be erased on read-only "
 		    "provider: %s.", pp->name);
 	} else {
 		u_char *sector;
 		u_int i;
 		int err;
 
 		sector = malloc(pp->sectorsize, M_ELI, M_WAITOK);
 		for (i = 0; i <= g_eli_overwrites; i++) {
 			if (i == g_eli_overwrites)
 				bzero(sector, pp->sectorsize);
 			else
 				arc4rand(sector, pp->sectorsize, 0);
 			err = g_write_data(cp, pp->mediasize - pp->sectorsize,
 			    sector, pp->sectorsize);
 			if (err != 0) {
 				G_ELI_DEBUG(0, "Cannot erase metadata on %s "
 				    "(error=%d).", pp->name, err);
 				if (error == 0)
 					error = err;
 			}
 			/*
 			 * Flush write cache so we don't overwrite data N times
 			 * in cache and only once on disk.
 			 */
 			(void)g_io_flush(cp);
 		}
 		free(sector, M_ELI);
 	}
 	if (error == 0)
 		G_ELI_DEBUG(0, "%s has been killed.", pp->name);
 	g_eli_destroy(sc, TRUE);
 	return (error);
 }
 
 static void
 g_eli_ctl_kill(struct gctl_req *req, struct g_class *mp)
 {
 	int *all, *nargs;
 	int error;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	all = gctl_get_paraml(req, "all", sizeof(*all));
 	if (all == NULL) {
 		gctl_error(req, "No '%s' argument.", "all");
 		return;
 	}
 	if (!*all && *nargs == 0) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	if (*all) {
 		struct g_geom *gp, *gp2;
 
 		LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 			error = g_eli_kill_one(gp->softc);
 			if (error != 0)
 				gctl_error(req, "Not fully done.");
 		}
 	} else {
 		struct g_eli_softc *sc;
 		const char *prov;
 		char param[16];
 		int i;
 
 		for (i = 0; i < *nargs; i++) {
 			snprintf(param, sizeof(param), "arg%d", i);
 			prov = gctl_get_asciiparam(req, param);
 			if (prov == NULL) {
 				G_ELI_DEBUG(0, "No 'arg%d' argument.", i);
 				continue;
 			}
 
 			sc = g_eli_find_device(mp, prov);
 			if (sc == NULL) {
 				G_ELI_DEBUG(0, "No such provider: %s.", prov);
 				continue;
 			}
 			error = g_eli_kill_one(sc);
 			if (error != 0)
 				gctl_error(req, "Not fully done.");
 		}
 	}
 }
 
 void
 g_eli_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	while (*version != G_ELI_VERSION) {
 		if (G_ELI_VERSION == G_ELI_VERSION_06 &&
 		    *version == G_ELI_VERSION_05) {
 			/* Compatible. */
 			break;
 		}
 		if (G_ELI_VERSION == G_ELI_VERSION_07 &&
 		    (*version == G_ELI_VERSION_05 ||
 		     *version == G_ELI_VERSION_06)) {
 			/* Compatible. */
 			break;
 		}
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "attach") == 0)
 		g_eli_ctl_attach(req, mp);
 	else if (strcmp(verb, "detach") == 0 || strcmp(verb, "stop") == 0)
 		g_eli_ctl_detach(req, mp);
 	else if (strcmp(verb, "onetime") == 0)
 		g_eli_ctl_onetime(req, mp);
 	else if (strcmp(verb, "configure") == 0)
 		g_eli_ctl_configure(req, mp);
 	else if (strcmp(verb, "setkey") == 0)
 		g_eli_ctl_setkey(req, mp);
 	else if (strcmp(verb, "delkey") == 0)
 		g_eli_ctl_delkey(req, mp);
 	else if (strcmp(verb, "suspend") == 0)
 		g_eli_ctl_suspend(req, mp);
 	else if (strcmp(verb, "resume") == 0)
 		g_eli_ctl_resume(req, mp);
 	else if (strcmp(verb, "kill") == 0)
 		g_eli_ctl_kill(req, mp);
 	else
 		gctl_error(req, "Unknown verb.");
 }
Index: user/markj/netdump/sys/geom/gate/g_gate.c
===================================================================
--- user/markj/netdump/sys/geom/gate/g_gate.c	(revision 332407)
+++ user/markj/netdump/sys/geom/gate/g_gate.c	(revision 332408)
@@ -1,966 +1,967 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * Copyright (c) 2009-2010 The FreeBSD Foundation
  * All rights reserved.
  *
  * Portions of this software were developed by Pawel Jakub Dawidek
  * under sponsorship from the FreeBSD Foundation.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bio.h>
 #include <sys/conf.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/fcntl.h>
 #include <sys/linker.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/limits.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/signalvar.h>
 #include <sys/time.h>
 #include <machine/atomic.h>
 
 #include <geom/geom.h>
 #include <geom/gate/g_gate.h>
 
 FEATURE(geom_gate, "GEOM Gate module");
 
 static MALLOC_DEFINE(M_GATE, "gg_data", "GEOM Gate Data");
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, gate, CTLFLAG_RW, 0,
     "GEOM_GATE configuration");
 static int g_gate_debug = 0;
 SYSCTL_INT(_kern_geom_gate, OID_AUTO, debug, CTLFLAG_RWTUN, &g_gate_debug, 0,
     "Debug level");
 static u_int g_gate_maxunits = 256;
 SYSCTL_UINT(_kern_geom_gate, OID_AUTO, maxunits, CTLFLAG_RDTUN,
     &g_gate_maxunits, 0, "Maximum number of ggate devices");
 
 struct g_class g_gate_class = {
 	.name = G_GATE_CLASS_NAME,
 	.version = G_VERSION,
 };
 
 static struct cdev *status_dev;
 static d_ioctl_t g_gate_ioctl;
 static struct cdevsw g_gate_cdevsw = {
 	.d_version =	D_VERSION,
 	.d_ioctl =	g_gate_ioctl,
 	.d_name =	G_GATE_CTL_NAME
 };
 
 
 static struct g_gate_softc **g_gate_units;
 static u_int g_gate_nunits;
 static struct mtx g_gate_units_lock;
 
 static int
 g_gate_destroy(struct g_gate_softc *sc, boolean_t force)
 {
 	struct bio_queue_head queue;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	struct bio *bp;
 
 	g_topology_assert();
 	mtx_assert(&g_gate_units_lock, MA_OWNED);
 	pp = sc->sc_provider;
 	if (!force && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		mtx_unlock(&g_gate_units_lock);
 		return (EBUSY);
 	}
 	mtx_unlock(&g_gate_units_lock);
 	mtx_lock(&sc->sc_queue_mtx);
 	if ((sc->sc_flags & G_GATE_FLAG_DESTROY) == 0)
 		sc->sc_flags |= G_GATE_FLAG_DESTROY;
 	wakeup(sc);
 	mtx_unlock(&sc->sc_queue_mtx);
 	gp = pp->geom;
 	g_wither_provider(pp, ENXIO);
 	callout_drain(&sc->sc_callout);
 	bioq_init(&queue);
 	mtx_lock(&sc->sc_queue_mtx);
 	while ((bp = bioq_takefirst(&sc->sc_inqueue)) != NULL) {
 		sc->sc_queue_count--;
 		bioq_insert_tail(&queue, bp);
 	}
 	while ((bp = bioq_takefirst(&sc->sc_outqueue)) != NULL) {
 		sc->sc_queue_count--;
 		bioq_insert_tail(&queue, bp);
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	g_topology_unlock();
 	while ((bp = bioq_takefirst(&queue)) != NULL) {
 		G_GATE_LOGREQ(1, bp, "Request canceled.");
 		g_io_deliver(bp, ENXIO);
 	}
 	mtx_lock(&g_gate_units_lock);
 	/* One reference is ours. */
 	sc->sc_ref--;
 	while (sc->sc_ref > 0)
 		msleep(&sc->sc_ref, &g_gate_units_lock, 0, "gg:destroy", 0);
 	g_gate_units[sc->sc_unit] = NULL;
 	KASSERT(g_gate_nunits > 0, ("negative g_gate_nunits?"));
 	g_gate_nunits--;
 	mtx_unlock(&g_gate_units_lock);
 	mtx_destroy(&sc->sc_queue_mtx);
 	g_topology_lock();
 	if ((cp = sc->sc_readcons) != NULL) {
 		sc->sc_readcons = NULL;
 		(void)g_access(cp, -1, 0, 0);
 		g_detach(cp);
 		g_destroy_consumer(cp);
 	}
 	G_GATE_DEBUG(1, "Device %s destroyed.", gp->name);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 	sc->sc_provider = NULL;
 	free(sc, M_GATE);
 	return (0);
 }
 
 static int
 g_gate_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_gate_softc *sc;
 
 	if (dr <= 0 && dw <= 0 && de <= 0)
 		return (0);
 	sc = pp->geom->softc;
 	if (sc == NULL || (sc->sc_flags & G_GATE_FLAG_DESTROY) != 0)
 		return (ENXIO);
 	/* XXX: Hack to allow read-only mounts. */
 #if 0
 	if ((sc->sc_flags & G_GATE_FLAG_READONLY) != 0 && dw > 0)
 		return (EPERM);
 #endif
 	if ((sc->sc_flags & G_GATE_FLAG_WRITEONLY) != 0 && dr > 0)
 		return (EPERM);
 	return (0);
 }
 
 static void
 g_gate_queue_io(struct bio *bp)
 {
 	struct g_gate_softc *sc;
 
 	sc = bp->bio_to->geom->softc;
 	if (sc == NULL || (sc->sc_flags & G_GATE_FLAG_DESTROY) != 0) {
 		g_io_deliver(bp, ENXIO);
 		return;
 	}
 
 	mtx_lock(&sc->sc_queue_mtx);
 
 	if (sc->sc_queue_size > 0 && sc->sc_queue_count > sc->sc_queue_size) {
 		mtx_unlock(&sc->sc_queue_mtx);
 		G_GATE_LOGREQ(1, bp, "Queue full, request canceled.");
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 
 	bp->bio_driver1 = (void *)sc->sc_seq;
 	sc->sc_seq++;
 	sc->sc_queue_count++;
 
 	bioq_insert_tail(&sc->sc_inqueue, bp);
 	wakeup(sc);
 
 	mtx_unlock(&sc->sc_queue_mtx);
 }
 
 static void
 g_gate_done(struct bio *cbp)
 {
 	struct bio *pbp;
 
 	pbp = cbp->bio_parent;
 	if (cbp->bio_error == 0) {
 		pbp->bio_completed = cbp->bio_completed;
 		g_destroy_bio(cbp);
 		pbp->bio_inbed++;
 		g_io_deliver(pbp, 0);
 	} else {
 		/* If direct read failed, pass it through userland daemon. */
 		g_destroy_bio(cbp);
 		pbp->bio_children--;
 		g_gate_queue_io(pbp);
 	}
 }
 
 static void
 g_gate_start(struct bio *pbp)
 {
 	struct g_gate_softc *sc;
 
 	sc = pbp->bio_to->geom->softc;
 	if (sc == NULL || (sc->sc_flags & G_GATE_FLAG_DESTROY) != 0) {
 		g_io_deliver(pbp, ENXIO);
 		return;
 	}
 	G_GATE_LOGREQ(2, pbp, "Request received.");
 	switch (pbp->bio_cmd) {
 	case BIO_READ:
 		if (sc->sc_readcons != NULL) {
 			struct bio *cbp;
 
 			cbp = g_clone_bio(pbp);
 			if (cbp == NULL) {
 				g_io_deliver(pbp, ENOMEM);
 				return;
 			}
 			cbp->bio_done = g_gate_done;
 			cbp->bio_offset = pbp->bio_offset + sc->sc_readoffset;
 			cbp->bio_to = sc->sc_readcons->provider;
 			g_io_request(cbp, sc->sc_readcons);
 			return;
 		}
 		break;
 	case BIO_DELETE:
 	case BIO_WRITE:
 	case BIO_FLUSH:
 		/* XXX: Hack to allow read-only mounts. */
 		if ((sc->sc_flags & G_GATE_FLAG_READONLY) != 0) {
 			g_io_deliver(pbp, EPERM);
 			return;
 		}
 		break;
 	case BIO_GETATTR:
 	default:
 		G_GATE_LOGREQ(2, pbp, "Ignoring request.");
 		g_io_deliver(pbp, EOPNOTSUPP);
 		return;
 	}
 
 	g_gate_queue_io(pbp);
 }
 
 static struct g_gate_softc *
 g_gate_hold(int unit, const char *name)
 {
 	struct g_gate_softc *sc = NULL;
 
 	mtx_lock(&g_gate_units_lock);
 	if (unit >= 0 && unit < g_gate_maxunits)
 		sc = g_gate_units[unit];
 	else if (unit == G_GATE_NAME_GIVEN) {
 		KASSERT(name != NULL, ("name is NULL"));
 		for (unit = 0; unit < g_gate_maxunits; unit++) {
 			if (g_gate_units[unit] == NULL)
 				continue;
 			if (strcmp(name,
 			    g_gate_units[unit]->sc_provider->name) != 0) {
 				continue;
 			}
 			sc = g_gate_units[unit];
 			break;
 		}
 	}
 	if (sc != NULL)
 		sc->sc_ref++;
 	mtx_unlock(&g_gate_units_lock);
 	return (sc);
 }
 
 static void
 g_gate_release(struct g_gate_softc *sc)
 {
 
 	g_topology_assert_not();
 	mtx_lock(&g_gate_units_lock);
 	sc->sc_ref--;
 	KASSERT(sc->sc_ref >= 0, ("Negative sc_ref for %s.", sc->sc_name));
 	if (sc->sc_ref == 0 && (sc->sc_flags & G_GATE_FLAG_DESTROY) != 0)
 		wakeup(&sc->sc_ref);
 	mtx_unlock(&g_gate_units_lock);
 }
 
 static int
 g_gate_getunit(int unit, int *errorp)
 {
 
 	mtx_assert(&g_gate_units_lock, MA_OWNED);
 	if (unit >= 0) {
 		if (unit >= g_gate_maxunits)
 			*errorp = EINVAL;
 		else if (g_gate_units[unit] == NULL)
 			return (unit);
 		else
 			*errorp = EEXIST;
 	} else {
 		for (unit = 0; unit < g_gate_maxunits; unit++) {
 			if (g_gate_units[unit] == NULL)
 				return (unit);
 		}
 		*errorp = ENFILE;
 	}
 	return (-1);
 }
 
 static void
 g_gate_guard(void *arg)
 {
 	struct bio_queue_head queue;
 	struct g_gate_softc *sc;
 	struct bintime curtime;
 	struct bio *bp, *bp2;
 
 	sc = arg;
 	binuptime(&curtime);
 	g_gate_hold(sc->sc_unit, NULL);
 	bioq_init(&queue);
 	mtx_lock(&sc->sc_queue_mtx);
 	TAILQ_FOREACH_SAFE(bp, &sc->sc_inqueue.queue, bio_queue, bp2) {
 		if (curtime.sec - bp->bio_t0.sec < 5)
 			continue;
 		bioq_remove(&sc->sc_inqueue, bp);
 		sc->sc_queue_count--;
 		bioq_insert_tail(&queue, bp);
 	}
 	TAILQ_FOREACH_SAFE(bp, &sc->sc_outqueue.queue, bio_queue, bp2) {
 		if (curtime.sec - bp->bio_t0.sec < 5)
 			continue;
 		bioq_remove(&sc->sc_outqueue, bp);
 		sc->sc_queue_count--;
 		bioq_insert_tail(&queue, bp);
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	while ((bp = bioq_takefirst(&queue)) != NULL) {
 		G_GATE_LOGREQ(1, bp, "Request timeout.");
 		g_io_deliver(bp, EIO);
 	}
 	if ((sc->sc_flags & G_GATE_FLAG_DESTROY) == 0) {
 		callout_reset(&sc->sc_callout, sc->sc_timeout * hz,
 		    g_gate_guard, sc);
 	}
 	g_gate_release(sc);
 }
 
 static void
 g_gate_orphan(struct g_consumer *cp)
 {
 	struct g_gate_softc *sc;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = cp->geom;
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	KASSERT(cp == sc->sc_readcons, ("cp=%p sc_readcons=%p", cp,
 	    sc->sc_readcons));
 	sc->sc_readcons = NULL;
 	G_GATE_DEBUG(1, "Destroying read consumer on provider %s orphan.",
 	    cp->provider->name);
 	(void)g_access(cp, -1, 0, 0);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static void
 g_gate_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_gate_softc *sc;
 
 	sc = gp->softc;
 	if (sc == NULL || pp != NULL || cp != NULL)
 		return;
 	sc = g_gate_hold(sc->sc_unit, NULL);
 	if (sc == NULL)
 		return;
 	if ((sc->sc_flags & G_GATE_FLAG_READONLY) != 0) {
 		sbuf_printf(sb, "%s<access>%s</access>\n", indent, "read-only");
 	} else if ((sc->sc_flags & G_GATE_FLAG_WRITEONLY) != 0) {
 		sbuf_printf(sb, "%s<access>%s</access>\n", indent,
 		    "write-only");
 	} else {
 		sbuf_printf(sb, "%s<access>%s</access>\n", indent,
 		    "read-write");
 	}
 	if (sc->sc_readcons != NULL) {
 		sbuf_printf(sb, "%s<read_offset>%jd</read_offset>\n",
 		    indent, (intmax_t)sc->sc_readoffset);
 		sbuf_printf(sb, "%s<read_provider>%s</read_provider>\n",
 		    indent, sc->sc_readcons->provider->name);
 	}
 	sbuf_printf(sb, "%s<timeout>%u</timeout>\n", indent, sc->sc_timeout);
 	sbuf_printf(sb, "%s<info>%s</info>\n", indent, sc->sc_info);
 	sbuf_printf(sb, "%s<queue_count>%u</queue_count>\n", indent,
 	    sc->sc_queue_count);
 	sbuf_printf(sb, "%s<queue_size>%u</queue_size>\n", indent,
 	    sc->sc_queue_size);
 	sbuf_printf(sb, "%s<ref>%u</ref>\n", indent, sc->sc_ref);
 	sbuf_printf(sb, "%s<unit>%d</unit>\n", indent, sc->sc_unit);
 	g_topology_unlock();
 	g_gate_release(sc);
 	g_topology_lock();
 }
 
 static int
 g_gate_create(struct g_gate_ctl_create *ggio)
 {
 	struct g_gate_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *pp, *ropp;
 	struct g_consumer *cp;
 	char name[NAME_MAX];
 	int error = 0, unit;
 
 	if (ggio->gctl_mediasize <= 0) {
 		G_GATE_DEBUG(1, "Invalid media size.");
 		return (EINVAL);
 	}
 	if (ggio->gctl_sectorsize <= 0) {
 		G_GATE_DEBUG(1, "Invalid sector size.");
 		return (EINVAL);
 	}
 	if (!powerof2(ggio->gctl_sectorsize)) {
 		G_GATE_DEBUG(1, "Invalid sector size.");
 		return (EINVAL);
 	}
 	if ((ggio->gctl_mediasize % ggio->gctl_sectorsize) != 0) {
 		G_GATE_DEBUG(1, "Invalid media size.");
 		return (EINVAL);
 	}
 	if ((ggio->gctl_flags & G_GATE_FLAG_READONLY) != 0 &&
 	    (ggio->gctl_flags & G_GATE_FLAG_WRITEONLY) != 0) {
 		G_GATE_DEBUG(1, "Invalid flags.");
 		return (EINVAL);
 	}
 	if (ggio->gctl_unit != G_GATE_UNIT_AUTO &&
 	    ggio->gctl_unit != G_GATE_NAME_GIVEN &&
 	    ggio->gctl_unit < 0) {
 		G_GATE_DEBUG(1, "Invalid unit number.");
 		return (EINVAL);
 	}
 	if (ggio->gctl_unit == G_GATE_NAME_GIVEN &&
 	    ggio->gctl_name[0] == '\0') {
 		G_GATE_DEBUG(1, "No device name.");
 		return (EINVAL);
 	}
 
 	sc = malloc(sizeof(*sc), M_GATE, M_WAITOK | M_ZERO);
 	sc->sc_flags = (ggio->gctl_flags & G_GATE_USERFLAGS);
 	strlcpy(sc->sc_info, ggio->gctl_info, sizeof(sc->sc_info));
 	sc->sc_seq = 1;
 	bioq_init(&sc->sc_inqueue);
 	bioq_init(&sc->sc_outqueue);
 	mtx_init(&sc->sc_queue_mtx, "gg:queue", NULL, MTX_DEF);
 	sc->sc_queue_count = 0;
 	sc->sc_queue_size = ggio->gctl_maxcount;
 	if (sc->sc_queue_size > G_GATE_MAX_QUEUE_SIZE)
 		sc->sc_queue_size = G_GATE_MAX_QUEUE_SIZE;
 	sc->sc_timeout = ggio->gctl_timeout;
 	callout_init(&sc->sc_callout, 1);
 
 	mtx_lock(&g_gate_units_lock);
 	sc->sc_unit = g_gate_getunit(ggio->gctl_unit, &error);
 	if (sc->sc_unit < 0)
 		goto fail1;
 	if (ggio->gctl_unit == G_GATE_NAME_GIVEN)
 		snprintf(name, sizeof(name), "%s", ggio->gctl_name);
 	else {
 		snprintf(name, sizeof(name), "%s%d", G_GATE_PROVIDER_NAME,
 		    sc->sc_unit);
 	}
 	/* Check for name collision. */
 	for (unit = 0; unit < g_gate_maxunits; unit++) {
 		if (g_gate_units[unit] == NULL)
 			continue;
 		if (strcmp(name, g_gate_units[unit]->sc_name) != 0)
 			continue;
 		error = EEXIST;
 		goto fail1;
 	}
 	sc->sc_name = name;
 	g_gate_units[sc->sc_unit] = sc;
 	g_gate_nunits++;
 	mtx_unlock(&g_gate_units_lock);
 
 	g_topology_lock();
 
 	if (ggio->gctl_readprov[0] == '\0') {
 		ropp = NULL;
 	} else {
 		ropp = g_provider_by_name(ggio->gctl_readprov);
 		if (ropp == NULL) {
 			G_GATE_DEBUG(1, "Provider %s doesn't exist.",
 			    ggio->gctl_readprov);
 			error = EINVAL;
 			goto fail2;
 		}
 		if ((ggio->gctl_readoffset % ggio->gctl_sectorsize) != 0) {
 			G_GATE_DEBUG(1, "Invalid read offset.");
 			error = EINVAL;
 			goto fail2;
 		}
 		if (ggio->gctl_mediasize + ggio->gctl_readoffset >
 		    ropp->mediasize) {
 			G_GATE_DEBUG(1, "Invalid read offset or media size.");
 			error = EINVAL;
 			goto fail2;
 		}
 	}
 
 	gp = g_new_geomf(&g_gate_class, "%s", name);
 	gp->start = g_gate_start;
 	gp->access = g_gate_access;
 	gp->orphan = g_gate_orphan;
 	gp->dumpconf = g_gate_dumpconf;
 	gp->softc = sc;
 
 	if (ropp != NULL) {
 		cp = g_new_consumer(gp);
 		cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 		error = g_attach(cp, ropp);
 		if (error != 0) {
 			G_GATE_DEBUG(1, "Unable to attach to %s.", ropp->name);
 			goto fail3;
 		}
 		error = g_access(cp, 1, 0, 0);
 		if (error != 0) {
 			G_GATE_DEBUG(1, "Unable to access %s.", ropp->name);
 			g_detach(cp);
 			goto fail3;
 		}
 		sc->sc_readcons = cp;
 		sc->sc_readoffset = ggio->gctl_readoffset;
 	}
 
 	ggio->gctl_unit = sc->sc_unit;
 
 	pp = g_new_providerf(gp, "%s", name);
 	pp->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE;
 	pp->mediasize = ggio->gctl_mediasize;
 	pp->sectorsize = ggio->gctl_sectorsize;
 	sc->sc_provider = pp;
 	g_error_provider(pp, 0);
 
 	g_topology_unlock();
 	mtx_lock(&g_gate_units_lock);
 	sc->sc_name = sc->sc_provider->name;
 	mtx_unlock(&g_gate_units_lock);
 	G_GATE_DEBUG(1, "Device %s created.", gp->name);
 
 	if (sc->sc_timeout > 0) {
 		callout_reset(&sc->sc_callout, sc->sc_timeout * hz,
 		    g_gate_guard, sc);
 	}
 	return (0);
 fail3:
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 fail2:
 	g_topology_unlock();
 	mtx_lock(&g_gate_units_lock);
 	g_gate_units[sc->sc_unit] = NULL;
 	KASSERT(g_gate_nunits > 0, ("negative g_gate_nunits?"));
 	g_gate_nunits--;
 fail1:
 	mtx_unlock(&g_gate_units_lock);
 	mtx_destroy(&sc->sc_queue_mtx);
 	free(sc, M_GATE);
 	return (error);
 }
 
 static int
 g_gate_modify(struct g_gate_softc *sc, struct g_gate_ctl_modify *ggio)
 {
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	int error;
 
 	if ((ggio->gctl_modify & GG_MODIFY_MEDIASIZE) != 0) {
 		if (ggio->gctl_mediasize <= 0) {
 			G_GATE_DEBUG(1, "Invalid media size.");
 			return (EINVAL);
 		}
 		pp = sc->sc_provider;
 		if ((ggio->gctl_mediasize % pp->sectorsize) != 0) {
 			G_GATE_DEBUG(1, "Invalid media size.");
 			return (EINVAL);
 		}
 		/* TODO */
 		return (EOPNOTSUPP);
 	}
 
 	if ((ggio->gctl_modify & GG_MODIFY_INFO) != 0)
 		(void)strlcpy(sc->sc_info, ggio->gctl_info, sizeof(sc->sc_info));
 
 	cp = NULL;
 
 	if ((ggio->gctl_modify & GG_MODIFY_READPROV) != 0) {
 		g_topology_lock();
 		if (sc->sc_readcons != NULL) {
 			cp = sc->sc_readcons;
 			sc->sc_readcons = NULL;
 			(void)g_access(cp, -1, 0, 0);
 			g_detach(cp);
 			g_destroy_consumer(cp);
 		}
 		if (ggio->gctl_readprov[0] != '\0') {
 			pp = g_provider_by_name(ggio->gctl_readprov);
 			if (pp == NULL) {
 				g_topology_unlock();
 				G_GATE_DEBUG(1, "Provider %s doesn't exist.",
 				    ggio->gctl_readprov);
 				return (EINVAL);
 			}
 			cp = g_new_consumer(sc->sc_provider->geom);
 			cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 			error = g_attach(cp, pp);
 			if (error != 0) {
 				G_GATE_DEBUG(1, "Unable to attach to %s.",
 				    pp->name);
 			} else {
 				error = g_access(cp, 1, 0, 0);
 				if (error != 0) {
 					G_GATE_DEBUG(1, "Unable to access %s.",
 					    pp->name);
 					g_detach(cp);
 				}
 			}
 			if (error != 0) {
 				g_destroy_consumer(cp);
 				g_topology_unlock();
 				return (error);
 			}
 		}
 	} else {
 		cp = sc->sc_readcons;
 	}
 
 	if ((ggio->gctl_modify & GG_MODIFY_READOFFSET) != 0) {
 		if (cp == NULL) {
 			G_GATE_DEBUG(1, "No read provider.");
 			return (EINVAL);
 		}
 		pp = sc->sc_provider;
 		if ((ggio->gctl_readoffset % pp->sectorsize) != 0) {
 			G_GATE_DEBUG(1, "Invalid read offset.");
 			return (EINVAL);
 		}
 		if (pp->mediasize + ggio->gctl_readoffset >
 		    cp->provider->mediasize) {
 			G_GATE_DEBUG(1, "Invalid read offset or media size.");
 			return (EINVAL);
 		}
 		sc->sc_readoffset = ggio->gctl_readoffset;
 	}
 
 	if ((ggio->gctl_modify & GG_MODIFY_READPROV) != 0) {
 		sc->sc_readcons = cp;
 		g_topology_unlock();
 	}
 
 	return (0);
 }
 
 #define	G_GATE_CHECK_VERSION(ggio)	do {				\
 	if ((ggio)->gctl_version != G_GATE_VERSION) {			\
 		printf("Version mismatch %d != %d.\n",			\
 		    ggio->gctl_version, G_GATE_VERSION);		\
 		return (EINVAL);					\
 	}								\
 } while (0)
 static int
 g_gate_ioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags, struct thread *td)
 {
 	struct g_gate_softc *sc;
 	struct bio *bp;
 	int error = 0;
 
 	G_GATE_DEBUG(4, "ioctl(%s, %lx, %p, %x, %p)", devtoname(dev), cmd, addr,
 	    flags, td);
 
 	switch (cmd) {
 	case G_GATE_CMD_CREATE:
 	    {
 		struct g_gate_ctl_create *ggio = (void *)addr;
 
 		G_GATE_CHECK_VERSION(ggio);
 		error = g_gate_create(ggio);
 		/*
 		 * Reset TDP_GEOM flag.
 		 * There are pending events for sure, because we just created
 		 * new provider and other classes want to taste it, but we
 		 * cannot answer on I/O requests until we're here.
 		 */
 		td->td_pflags &= ~TDP_GEOM;
 		return (error);
 	    }
 	case G_GATE_CMD_MODIFY:
 	    {
 		struct g_gate_ctl_modify *ggio = (void *)addr;
 
 		G_GATE_CHECK_VERSION(ggio);
 		sc = g_gate_hold(ggio->gctl_unit, NULL);
 		if (sc == NULL)
 			return (ENXIO);
 		error = g_gate_modify(sc, ggio);
 		g_gate_release(sc);
 		return (error);
 	    }
 	case G_GATE_CMD_DESTROY:
 	    {
 		struct g_gate_ctl_destroy *ggio = (void *)addr;
 
 		G_GATE_CHECK_VERSION(ggio);
 		sc = g_gate_hold(ggio->gctl_unit, ggio->gctl_name);
 		if (sc == NULL)
 			return (ENXIO);
 		g_topology_lock();
 		mtx_lock(&g_gate_units_lock);
 		error = g_gate_destroy(sc, ggio->gctl_force);
 		g_topology_unlock();
 		if (error != 0)
 			g_gate_release(sc);
 		return (error);
 	    }
 	case G_GATE_CMD_CANCEL:
 	    {
 		struct g_gate_ctl_cancel *ggio = (void *)addr;
 		struct bio *tbp, *lbp;
 
 		G_GATE_CHECK_VERSION(ggio);
 		sc = g_gate_hold(ggio->gctl_unit, ggio->gctl_name);
 		if (sc == NULL)
 			return (ENXIO);
 		lbp = NULL;
 		mtx_lock(&sc->sc_queue_mtx);
 		TAILQ_FOREACH_SAFE(bp, &sc->sc_outqueue.queue, bio_queue, tbp) {
 			if (ggio->gctl_seq == 0 ||
 			    ggio->gctl_seq == (uintptr_t)bp->bio_driver1) {
 				G_GATE_LOGREQ(1, bp, "Request canceled.");
 				bioq_remove(&sc->sc_outqueue, bp);
 				/*
 				 * Be sure to put requests back onto incoming
 				 * queue in the proper order.
 				 */
 				if (lbp == NULL)
 					bioq_insert_head(&sc->sc_inqueue, bp);
 				else {
 					TAILQ_INSERT_AFTER(&sc->sc_inqueue.queue,
 					    lbp, bp, bio_queue);
 				}
 				lbp = bp;
 				/*
 				 * If only one request was canceled, leave now.
 				 */
 				if (ggio->gctl_seq != 0)
 					break;
 			}
 		}
 		if (ggio->gctl_unit == G_GATE_NAME_GIVEN)
 			ggio->gctl_unit = sc->sc_unit;
 		mtx_unlock(&sc->sc_queue_mtx);
 		g_gate_release(sc);
 		return (error);
 	    }
 	case G_GATE_CMD_START:
 	    {
 		struct g_gate_ctl_io *ggio = (void *)addr;
 
 		G_GATE_CHECK_VERSION(ggio);
 		sc = g_gate_hold(ggio->gctl_unit, NULL);
 		if (sc == NULL)
 			return (ENXIO);
 		error = 0;
 		for (;;) {
 			mtx_lock(&sc->sc_queue_mtx);
 			bp = bioq_first(&sc->sc_inqueue);
 			if (bp != NULL)
 				break;
 			if ((sc->sc_flags & G_GATE_FLAG_DESTROY) != 0) {
 				ggio->gctl_error = ECANCELED;
 				mtx_unlock(&sc->sc_queue_mtx);
 				goto start_end;
 			}
 			if (msleep(sc, &sc->sc_queue_mtx,
 			    PPAUSE | PDROP | PCATCH, "ggwait", 0) != 0) {
 				ggio->gctl_error = ECANCELED;
 				goto start_end;
 			}
 		}
 		ggio->gctl_cmd = bp->bio_cmd;
 		if (bp->bio_cmd == BIO_WRITE &&
 		    bp->bio_length > ggio->gctl_length) {
 			mtx_unlock(&sc->sc_queue_mtx);
 			ggio->gctl_length = bp->bio_length;
 			ggio->gctl_error = ENOMEM;
 			goto start_end;
 		}
 		bioq_remove(&sc->sc_inqueue, bp);
 		bioq_insert_tail(&sc->sc_outqueue, bp);
 		mtx_unlock(&sc->sc_queue_mtx);
 
 		ggio->gctl_seq = (uintptr_t)bp->bio_driver1;
 		ggio->gctl_offset = bp->bio_offset;
 		ggio->gctl_length = bp->bio_length;
 
 		switch (bp->bio_cmd) {
 		case BIO_READ:
 		case BIO_DELETE:
 		case BIO_FLUSH:
 			break;
 		case BIO_WRITE:
 			error = copyout(bp->bio_data, ggio->gctl_data,
 			    bp->bio_length);
 			if (error != 0) {
 				mtx_lock(&sc->sc_queue_mtx);
 				bioq_remove(&sc->sc_outqueue, bp);
 				bioq_insert_head(&sc->sc_inqueue, bp);
 				mtx_unlock(&sc->sc_queue_mtx);
 				goto start_end;
 			}
 			break;
 		}
 start_end:
 		g_gate_release(sc);
 		return (error);
 	    }
 	case G_GATE_CMD_DONE:
 	    {
 		struct g_gate_ctl_io *ggio = (void *)addr;
 
 		G_GATE_CHECK_VERSION(ggio);
 		sc = g_gate_hold(ggio->gctl_unit, NULL);
 		if (sc == NULL)
 			return (ENOENT);
 		error = 0;
 		mtx_lock(&sc->sc_queue_mtx);
 		TAILQ_FOREACH(bp, &sc->sc_outqueue.queue, bio_queue) {
 			if (ggio->gctl_seq == (uintptr_t)bp->bio_driver1)
 				break;
 		}
 		if (bp != NULL) {
 			bioq_remove(&sc->sc_outqueue, bp);
 			sc->sc_queue_count--;
 		}
 		mtx_unlock(&sc->sc_queue_mtx);
 		if (bp == NULL) {
 			/*
 			 * Request was probably canceled.
 			 */
 			goto done_end;
 		}
 		if (ggio->gctl_error == EAGAIN) {
 			bp->bio_error = 0;
 			G_GATE_LOGREQ(1, bp, "Request desisted.");
 			mtx_lock(&sc->sc_queue_mtx);
 			sc->sc_queue_count++;
 			bioq_insert_head(&sc->sc_inqueue, bp);
 			wakeup(sc);
 			mtx_unlock(&sc->sc_queue_mtx);
 		} else {
 			bp->bio_error = ggio->gctl_error;
 			if (bp->bio_error == 0) {
 				bp->bio_completed = bp->bio_length;
 				switch (bp->bio_cmd) {
 				case BIO_READ:
 					error = copyin(ggio->gctl_data,
 					    bp->bio_data, bp->bio_length);
 					if (error != 0)
 						bp->bio_error = error;
 					break;
 				case BIO_DELETE:
 				case BIO_WRITE:
 				case BIO_FLUSH:
 					break;
 				}
 			}
 			G_GATE_LOGREQ(2, bp, "Request done.");
 			g_io_deliver(bp, bp->bio_error);
 		}
 done_end:
 		g_gate_release(sc);
 		return (error);
 	    }
 	}
 	return (ENOIOCTL);
 }
 
 static void
 g_gate_device(void)
 {
 
 	status_dev = make_dev(&g_gate_cdevsw, 0x0, UID_ROOT, GID_WHEEL, 0600,
 	    G_GATE_CTL_NAME);
 }
 
 static int
 g_gate_modevent(module_t mod, int type, void *data)
 {
 	int error = 0;
 
 	switch (type) {
 	case MOD_LOAD:
 		mtx_init(&g_gate_units_lock, "gg_units_lock", NULL, MTX_DEF);
 		g_gate_units = malloc(g_gate_maxunits * sizeof(g_gate_units[0]),
 		    M_GATE, M_WAITOK | M_ZERO);
 		g_gate_nunits = 0;
 		g_gate_device();
 		break;
 	case MOD_UNLOAD:
 		mtx_lock(&g_gate_units_lock);
 		if (g_gate_nunits > 0) {
 			mtx_unlock(&g_gate_units_lock);
 			error = EBUSY;
 			break;
 		}
 		mtx_unlock(&g_gate_units_lock);
 		mtx_destroy(&g_gate_units_lock);
 		if (status_dev != NULL)
 			destroy_dev(status_dev);
 		free(g_gate_units, M_GATE);
 		break;
 	default:
 		return (EOPNOTSUPP);
 		break;
 	}
 
 	return (error);
 }
 static moduledata_t g_gate_module = {
 	G_GATE_MOD_NAME,
 	g_gate_modevent,
 	NULL
 };
 DECLARE_MODULE(geom_gate, g_gate_module, SI_SUB_DRIVERS, SI_ORDER_MIDDLE);
 DECLARE_GEOM_CLASS(g_gate_class, g_gate);
+MODULE_VERSION(geom_gate, 0);
Index: user/markj/netdump/sys/geom/geom_bsd.c
===================================================================
--- user/markj/netdump/sys/geom/geom_bsd.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_bsd.c	(revision 332408)
@@ -1,616 +1,617 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 2002 Poul-Henning Kamp
  * Copyright (c) 2002 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Poul-Henning Kamp
  * and NAI Labs, the Security Research Division of Network Associates, Inc.
  * under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
  * DARPA CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The names of the authors may not be used to endorse or promote
  *    products derived from this software without specific prior written
  *    permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * This is the method for dealing with BSD disklabels.  It has been
  * extensively (by my standards at least) commented, in the vain hope that
  * it will serve as the source in future copy&paste operations.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/endian.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/kernel.h>
 #include <sys/fcntl.h>
 #include <sys/conf.h>
 #include <sys/bio.h>
 #include <sys/malloc.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/md5.h>
 #include <sys/errno.h>
 #include <sys/disklabel.h>
 #include <sys/gpt.h>
 #include <sys/proc.h>
 #include <sys/sbuf.h>
 #include <sys/uuid.h>
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 
 FEATURE(geom_bsd, "GEOM BSD disklabels support");
 
 #define	BSD_CLASS_NAME "BSD"
 
 #define ALPHA_LABEL_OFFSET	64
 #define HISTORIC_LABEL_OFFSET	512
 
 #define LABELSIZE (148 + 16 * MAXPARTITIONS)
 
 static int g_bsd_once;
 
 static void g_bsd_hotwrite(void *arg, int flag);
 /*
  * Our private data about one instance.  All the rest is handled by the
  * slice code and stored in its softc, so this is just the stuff
  * specific to BSD disklabels.
  */
 struct g_bsd_softc {
 	off_t	labeloffset;
 	off_t	mbroffset;
 	off_t	rawoffset;
 	struct disklabel ondisk;
 	u_char	label[LABELSIZE];
 	u_char	labelsum[16];
 };
 
 /*
  * Modify our slicer to match proposed disklabel, if possible.
  * This is where we make sure we don't do something stupid.
  */
 static int
 g_bsd_modify(struct g_geom *gp, u_char *label)
 {
 	int i, error;
 	struct partition *ppp;
 	struct g_slicer *gsp;
 	struct g_consumer *cp;
 	struct g_bsd_softc *ms;
 	u_int secsize, u;
 	off_t rawoffset, o;
 	struct disklabel dl;
 	MD5_CTX md5sum;
 
 	g_topology_assert();
 	gsp = gp->softc;
 	ms = gsp->softc;
 
 	error = bsd_disklabel_le_dec(label, &dl, MAXPARTITIONS);
 	if (error) {
 		return (error);
 	}
 
 	/* Get dimensions of our device. */
 	cp = LIST_FIRST(&gp->consumer);
 	secsize = cp->provider->sectorsize;
 
 	/* ... or a smaller sector size. */
 	if (dl.d_secsize < secsize) {
 		return (EINVAL);
 	}
 
 	/* ... or a non-multiple sector size. */
 	if (dl.d_secsize % secsize != 0) {
 		return (EINVAL);
 	}
 
 	/* Historical braindamage... */
 	rawoffset = (off_t)dl.d_partitions[RAW_PART].p_offset * dl.d_secsize;
 
 	for (i = 0; i < dl.d_npartitions; i++) {
 		ppp = &dl.d_partitions[i];
 		if (ppp->p_size == 0)
 			continue;
 	        o = (off_t)ppp->p_offset * dl.d_secsize;
 
 		if (o < rawoffset)
 			rawoffset = 0;
 	}
 	
 	if (rawoffset != 0 && (off_t)rawoffset != ms->mbroffset)
 		printf("WARNING: %s expected rawoffset %jd, found %jd\n",
 		    gp->name,
 		    (intmax_t)ms->mbroffset/dl.d_secsize,
 		    (intmax_t)rawoffset/dl.d_secsize);
 
 	/* Don't munge open partitions. */
 	for (i = 0; i < dl.d_npartitions; i++) {
 		ppp = &dl.d_partitions[i];
 
 	        o = (off_t)ppp->p_offset * dl.d_secsize;
 		if (o == 0)
 			o = rawoffset;
 		error = g_slice_config(gp, i, G_SLICE_CONFIG_CHECK,
 		    o - rawoffset,
 		    (off_t)ppp->p_size * dl.d_secsize,
 		     dl.d_secsize,
 		    "%s%c", gp->name, 'a' + i);
 		if (error)
 			return (error);
 	}
 
 	/* Look good, go for it... */
 	for (u = 0; u < gsp->nslice; u++) {
 		ppp = &dl.d_partitions[u];
 	        o = (off_t)ppp->p_offset * dl.d_secsize;
 		if (o == 0)
 			o = rawoffset;
 		g_slice_config(gp, u, G_SLICE_CONFIG_SET,
 		    o - rawoffset,
 		    (off_t)ppp->p_size * dl.d_secsize,
 		     dl.d_secsize,
 		    "%s%c", gp->name, 'a' + u);
 	}
 
 	/* Update our softc */
 	ms->ondisk = dl;
 	if (label != ms->label)
 		bcopy(label, ms->label, LABELSIZE);
 	ms->rawoffset = rawoffset;
 
 	/*
 	 * In order to avoid recursively attaching to the same
 	 * on-disk label (it's usually visible through the 'c'
 	 * partition) we calculate an MD5 and ask if other BSD's
 	 * below us love that label.  If they do, we don't.
 	 */
 	MD5Init(&md5sum);
 	MD5Update(&md5sum, ms->label, sizeof(ms->label));
 	MD5Final(ms->labelsum, &md5sum);
 
 	return (0);
 }
 
 /*
  * This is an internal helper function, called multiple times from the taste
  * function to try to locate a disklabel on the disk.  More civilized formats
  * will not need this, as there is only one possible place on disk to look
  * for the magic spot.
  */
 
 static int
 g_bsd_try(struct g_geom *gp, struct g_slicer *gsp, struct g_consumer *cp, int secsize, struct g_bsd_softc *ms, off_t offset)
 {
 	int error;
 	u_char *buf;
 	struct disklabel *dl;
 	off_t secoff;
 
 	/*
 	 * We need to read entire aligned sectors, and we assume that the
 	 * disklabel does not span sectors, so one sector is enough.
 	 */
 	secoff = offset % secsize;
 	buf = g_read_data(cp, offset - secoff, secsize, NULL);
 	if (buf == NULL)
 		return (ENOENT);
 
 	/* Decode into our native format. */
 	dl = &ms->ondisk;
 	error = bsd_disklabel_le_dec(buf + secoff, dl, MAXPARTITIONS);
 	if (!error)
 		bcopy(buf + secoff, ms->label, LABELSIZE);
 
 	/* Remember to free the buffer g_read_data() gave us. */
 	g_free(buf);
 
 	ms->labeloffset = offset;
 	return (error);
 }
 
 /*
  * This function writes the current label to disk, possibly updating
  * the alpha SRM checksum.
  */
 
 static int
 g_bsd_writelabel(struct g_geom *gp, u_char *bootcode)
 {
 	off_t secoff;
 	u_int secsize;
 	struct g_consumer *cp;
 	struct g_slicer *gsp;
 	struct g_bsd_softc *ms;
 	u_char *buf;
 	uint64_t sum;
 	int error, i;
 
 	gsp = gp->softc;
 	ms = gsp->softc;
 	cp = LIST_FIRST(&gp->consumer);
 	/* Get sector size, we need it to read data. */
 	secsize = cp->provider->sectorsize;
 	secoff = ms->labeloffset % secsize;
 	if (bootcode == NULL) {
 		buf = g_read_data(cp, ms->labeloffset - secoff, secsize, &error);
 		if (buf == NULL)
 			return (error);
 		bcopy(ms->label, buf + secoff, sizeof(ms->label));
 	} else {
 		buf = bootcode;
 		bcopy(ms->label, buf + ms->labeloffset, sizeof(ms->label));
 	}
 	if (ms->labeloffset == ALPHA_LABEL_OFFSET) {
 		sum = 0;
 		for (i = 0; i < 63; i++)
 			sum += le64dec(buf + i * 8);
 		le64enc(buf + 504, sum);
 	}
 	if (bootcode == NULL) {
 		error = g_write_data(cp, ms->labeloffset - secoff, buf, secsize);
 		g_free(buf);
 	} else {
 		error = g_write_data(cp, 0, bootcode, BBSIZE);
 	}
 	return(error);
 }
 
 /*
  * If the user tries to overwrite our disklabel through an open partition
  * or via a magicwrite config call, we end up here and try to prevent
  * footshooting as best we can.
  */
 static void
 g_bsd_hotwrite(void *arg, int flag)
 {
 	struct bio *bp;
 	struct g_geom *gp;
 	struct g_slicer *gsp;
 	struct g_slice *gsl;
 	struct g_bsd_softc *ms;
 	u_char *p;
 	int error;
 	
 	g_topology_assert();
 	/*
 	 * We should never get canceled, because that would amount to a removal
 	 * of the geom while there was outstanding I/O requests.
 	 */
 	KASSERT(flag != EV_CANCEL, ("g_bsd_hotwrite cancelled"));
 	bp = arg;
 	gp = bp->bio_to->geom;
 	gsp = gp->softc;
 	ms = gsp->softc;
 	gsl = &gsp->slices[bp->bio_to->index];
 	p = (u_char*)bp->bio_data + ms->labeloffset -
 	    (bp->bio_offset + gsl->offset);
 	error = g_bsd_modify(gp, p);
 	if (error) {
 		g_io_deliver(bp, EPERM);
 		return;
 	}
 	g_slice_finish_hot(bp);
 }
 
 static int
 g_bsd_start(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct g_bsd_softc *ms;
 	struct g_slicer *gsp;
 
 	gp = bp->bio_to->geom;
 	gsp = gp->softc;
 	ms = gsp->softc;
 	if (bp->bio_cmd == BIO_GETATTR) {
 		if (g_handleattr(bp, "BSD::labelsum", ms->labelsum,
 		    sizeof(ms->labelsum)))
 			return (1);
 	}
 	return (0);
 }
 
 /*
  * Dump configuration information in XML format.
  * Notice that the function is called once for the geom and once for each
  * consumer and provider.  We let g_slice_dumpconf() do most of the work.
  */
 static void
 g_bsd_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp, struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_bsd_softc *ms;
 	struct g_slicer *gsp;
 
 	gsp = gp->softc;
 	ms = gsp->softc;
 	g_slice_dumpconf(sb, indent, gp, cp, pp);
 	if (indent != NULL && pp == NULL && cp == NULL) {
 		sbuf_printf(sb, "%s<labeloffset>%jd</labeloffset>\n",
 		    indent, (intmax_t)ms->labeloffset);
 		sbuf_printf(sb, "%s<rawoffset>%jd</rawoffset>\n",
 		    indent, (intmax_t)ms->rawoffset);
 		sbuf_printf(sb, "%s<mbroffset>%jd</mbroffset>\n",
 		    indent, (intmax_t)ms->mbroffset);
 	} else if (pp != NULL) {
 		if (indent == NULL)
 			sbuf_printf(sb, " ty %d",
 			    ms->ondisk.d_partitions[pp->index].p_fstype);
 		else
 			sbuf_printf(sb, "%s<type>%d</type>\n", indent,
 			    ms->ondisk.d_partitions[pp->index].p_fstype);
 	}
 }
 
 /*
  * The taste function is called from the event-handler, with the topology
  * lock already held and a provider to examine.  The flags are unused.
  *
  * If flags == G_TF_NORMAL, the idea is to take a bite of the provider and
  * if we find valid, consistent magic on it, build a geom on it.
  *
  * There may be cases where the operator would like to put a BSD-geom on
  * providers which do not meet all of the requirements.  This can be done
  * by instead passing the G_TF_INSIST flag, which will override these
  * checks.
  *
  * The final flags value is G_TF_TRANSPARENT, which instructs the method
  * to put a geom on top of the provider and configure it to be as transparent
  * as possible.  This is not really relevant to the BSD method and therefore
  * not implemented here.
  */
 
 static struct uuid freebsd_slice = GPT_ENT_TYPE_FREEBSD;
 
 static struct g_geom *
 g_bsd_taste(struct g_class *mp, struct g_provider *pp, int flags)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	int error, i;
 	struct g_bsd_softc *ms;
 	u_int secsize;
 	struct g_slicer *gsp;
 	u_char hash[16];
 	MD5_CTX md5sum;
 	struct uuid uuid;
 
 	g_trace(G_T_TOPOLOGY, "bsd_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 
 	/* We don't implement transparent inserts. */
 	if (flags == G_TF_TRANSPARENT)
 		return (NULL);
 
 	/*
 	 * BSD labels are a subclass of the general "slicing" topology so
 	 * a lot of the work can be done by the common "slice" code.
 	 * Create a geom with space for MAXPARTITIONS providers, one consumer
 	 * and a softc structure for us.  Specify the provider to attach
 	 * the consumer to and our "start" routine for special requests.
 	 * The provider is opened with mode (1,0,0) so we can do reads
 	 * from it.
 	 */
 	gp = g_slice_new(mp, MAXPARTITIONS, pp, &cp, &ms,
 	     sizeof(*ms), g_bsd_start);
 	if (gp == NULL)
 		return (NULL);
 
 	/* Get the geom_slicer softc from the geom. */
 	gsp = gp->softc;
 
 	/*
 	 * The do...while loop here allows us to have multiple escapes
 	 * using a simple "break".  This improves code clarity without
 	 * ending up in deep nesting and without using goto or come from.
 	 */
 	do {
 		/*
 		 * If the provider is an MBR we will only auto attach
 		 * to type 165 slices in the G_TF_NORMAL case.  We will
 		 * attach to any other type.
 		 */
 		error = g_getattr("MBR::type", cp, &i);
 		if (!error) {
 			if (i != 165 && flags == G_TF_NORMAL)
 				break;
 			error = g_getattr("MBR::offset", cp, &ms->mbroffset);
 			if (error)
 				break;
 		}
 
 		/* Same thing if we are inside a GPT */
 		error = g_getattr("GPT::type", cp, &uuid);
 		if (!error) {
 			if (memcmp(&uuid, &freebsd_slice, sizeof(uuid)) != 0 &&
 			    flags == G_TF_NORMAL)
 				break;
 		}
 
 		/* Get sector size, we need it to read data. */
 		secsize = cp->provider->sectorsize;
 		if (secsize < 512)
 			break;
 
 		/* First look for a label at the start of the second sector. */
 		error = g_bsd_try(gp, gsp, cp, secsize, ms, secsize);
 
 		/*
 		 * If sector size is not 512 the label still can be at
 		 * offset 512, not at the start of the second sector. At least
 		 * it's true for labels created by the FreeBSD's bsdlabel(8).
 		 */
 		if (error && secsize != HISTORIC_LABEL_OFFSET)
 			error = g_bsd_try(gp, gsp, cp, secsize, ms,
 			    HISTORIC_LABEL_OFFSET);
 
 		/* Next, look for alpha labels */
 		if (error)
 			error = g_bsd_try(gp, gsp, cp, secsize, ms,
 			    ALPHA_LABEL_OFFSET);
 
 		/* If we didn't find a label, punt. */
 		if (error)
 			break;
 
 		/*
 		 * In order to avoid recursively attaching to the same
 		 * on-disk label (it's usually visible through the 'c'
 		 * partition) we calculate an MD5 and ask if other BSD's
 		 * below us love that label.  If they do, we don't.
 		 */
 		MD5Init(&md5sum);
 		MD5Update(&md5sum, ms->label, sizeof(ms->label));
 		MD5Final(ms->labelsum, &md5sum);
 
 		error = g_getattr("BSD::labelsum", cp, &hash);
 		if (!error && !bcmp(ms->labelsum, hash, sizeof(hash)))
 			break;
 
 		/*
 		 * Process the found disklabel, and modify our "slice"
 		 * instance to match it, if possible.
 		 */
 		error = g_bsd_modify(gp, ms->label);
 	} while (0);
 
 	/* Success or failure, we can close our provider now. */
 	g_access(cp, -1, 0, 0);
 
 	/* If we have configured any providers, return the new geom. */
 	if (gsp->nprovider > 0) {
 		g_slice_conf_hot(gp, 0, ms->labeloffset, LABELSIZE,
 		    G_SLICE_HOT_ALLOW, G_SLICE_HOT_DENY, G_SLICE_HOT_CALL);
 		gsp->hot = g_bsd_hotwrite;
 		if (!g_bsd_once) {
 			g_bsd_once = 1;
 			printf(
 			    "WARNING: geom_bsd (geom %s) is deprecated, "
 			    "use gpart instead.\n", gp->name);
 		}
 		return (gp);
 	}
 	/*
 	 * ...else push the "self-destruct" button, by spoiling our own
 	 * consumer.  This triggers a call to g_slice_spoiled which will
 	 * dismantle what was setup.
 	 */
 	g_slice_spoiled(cp);
 	return (NULL);
 }
 
 struct h0h0 {
 	struct g_geom *gp;
 	struct g_bsd_softc *ms;
 	u_char *label;
 	int error;
 };
 
 static void
 g_bsd_callconfig(void *arg, int flag)
 {
 	struct h0h0 *hp;
 
 	hp = arg;
 	hp->error = g_bsd_modify(hp->gp, hp->label);
 	if (!hp->error)
 		hp->error = g_bsd_writelabel(hp->gp, NULL);
 }
 
 /*
  * NB! curthread is user process which GCTL'ed.
  */
 static void
 g_bsd_config(struct gctl_req *req, struct g_class *mp, char const *verb)
 {
 	u_char *label;
 	int error;
 	struct h0h0 h0h0;
 	struct g_geom *gp;
 	struct g_slicer *gsp;
 	struct g_consumer *cp;
 	struct g_bsd_softc *ms;
 
 	g_topology_assert();
 	gp = gctl_get_geom(req, mp, "geom");
 	if (gp == NULL)
 		return;
 	cp = LIST_FIRST(&gp->consumer);
 	gsp = gp->softc;
 	ms = gsp->softc;
 	if (!strcmp(verb, "read mbroffset")) {
 		gctl_set_param_err(req, "mbroffset", &ms->mbroffset,
 		    sizeof(ms->mbroffset));
 		return;
 	} else if (!strcmp(verb, "write label")) {
 		label = gctl_get_paraml(req, "label", LABELSIZE);
 		if (label == NULL)
 			return;
 		h0h0.gp = gp;
 		h0h0.ms = gsp->softc;
 		h0h0.label = label;
 		h0h0.error = -1;
 		/* XXX: Does this reference register with our selfdestruct code ? */
 		error = g_access(cp, 1, 1, 1);
 		if (error) {
 			gctl_error(req, "could not access consumer");
 			return;
 		}
 		g_bsd_callconfig(&h0h0, 0);
 		error = h0h0.error;
 		g_access(cp, -1, -1, -1);
 	} else if (!strcmp(verb, "write bootcode")) {
 		label = gctl_get_paraml(req, "bootcode", BBSIZE);
 		if (label == NULL)
 			return;
 		/* XXX: Does this reference register with our selfdestruct code ? */
 		error = g_access(cp, 1, 1, 1);
 		if (error) {
 			gctl_error(req, "could not access consumer");
 			return;
 		}
 		error = g_bsd_writelabel(gp, label);
 		g_access(cp, -1, -1, -1);
 	} else {
 		gctl_error(req, "Unknown verb parameter");
 	}
 
 	return;
 }
 
 /* Finally, register with GEOM infrastructure. */
 static struct g_class g_bsd_class = {
 	.name = BSD_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_bsd_taste,
 	.ctlreq = g_bsd_config,
 	.dumpconf = g_bsd_dumpconf,
 };
 
 DECLARE_GEOM_CLASS(g_bsd_class, g_bsd);
+MODULE_VERSION(geom_bsd, 0);
Index: user/markj/netdump/sys/geom/geom_ccd.c
===================================================================
--- user/markj/netdump/sys/geom/geom_ccd.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_ccd.c	(revision 332408)
@@ -1,938 +1,939 @@
 /*-
  * SPDX-License-Identifier: (BSD-2-Clause-NetBSD AND BSD-3-Clause)
  *
  * Copyright (c) 2003 Poul-Henning Kamp.
  * Copyright (c) 1996, 1997 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
  * by Jason R. Thorpe.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
  * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
  * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
  * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  *
  * $NetBSD: ccd.c,v 1.22 1995/12/08 19:13:26 thorpej Exp $ 
  */
 
 /*-
  * Copyright (c) 1988 University of Utah.
  * Copyright (c) 1990, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * from: Utah $Hdr: cd.c 1.6 90/11/28$
  *
  *	@(#)cd.c	8.2 (Berkeley) 11/16/93
  */
 
 /*
  * Dynamic configuration and disklabel support by:
  *	Jason R. Thorpe <thorpej@nas.nasa.gov>
  *	Numerical Aerodynamic Simulation Facility
  *	Mail Stop 258-6
  *	NASA Ames Research Center
  *	Moffett Field, CA 94035
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/bio.h>
 #include <sys/malloc.h>
 #include <sys/sbuf.h>
 #include <geom/geom.h>
 
 /*
  * Number of blocks to untouched in front of a component partition.
  * This is to avoid violating its disklabel area when it starts at the
  * beginning of the slice.
  */
 #if !defined(CCD_OFFSET)
 #define CCD_OFFSET 16
 #endif
 
 /* sc_flags */
 #define CCDF_UNIFORM	0x02	/* use LCCD of sizes for uniform interleave */
 #define CCDF_MIRROR	0x04	/* use mirroring */
 #define CCDF_NO_OFFSET	0x08	/* do not leave space in front */
 #define CCDF_LINUX	0x10	/* use Linux compatibility mode */
 
 /* Mask of user-settable ccd flags. */
 #define CCDF_USERMASK	(CCDF_UNIFORM|CCDF_MIRROR)
 
 /*
  * Interleave description table.
  * Computed at boot time to speed irregular-interleave lookups.
  * The idea is that we interleave in "groups".  First we interleave
  * evenly over all component disks up to the size of the smallest
  * component (the first group), then we interleave evenly over all
  * remaining disks up to the size of the next-smallest (second group),
  * and so on.
  *
  * Each table entry describes the interleave characteristics of one
  * of these groups.  For example if a concatenated disk consisted of
  * three components of 5, 3, and 7 DEV_BSIZE blocks interleaved at
  * DEV_BSIZE (1), the table would have three entries:
  *
  *	ndisk	startblk	startoff	dev
  *	3	0		0		0, 1, 2
  *	2	9		3		0, 2
  *	1	13		5		2
  *	0	-		-		-
  *
  * which says that the first nine blocks (0-8) are interleaved over
  * 3 disks (0, 1, 2) starting at block offset 0 on any component disk,
  * the next 4 blocks (9-12) are interleaved over 2 disks (0, 2) starting
  * at component block 3, and the remaining blocks (13-14) are on disk
  * 2 starting at offset 5.
  */
 struct ccdiinfo {
 	int	ii_ndisk;	/* # of disks range is interleaved over */
 	daddr_t	ii_startblk;	/* starting scaled block # for range */
 	daddr_t	ii_startoff;	/* starting component offset (block #) */
 	int	*ii_index;	/* ordered list of components in range */
 };
 
 /*
  * Component info table.
  * Describes a single component of a concatenated disk.
  */
 struct ccdcinfo {
 	daddr_t		ci_size; 		/* size */
 	struct g_provider *ci_provider;		/* provider */
 	struct g_consumer *ci_consumer;		/* consumer */
 };
 
 /*
  * A concatenated disk is described by this structure.
  */
 
 struct ccd_s {
 	LIST_ENTRY(ccd_s) list;
 
 	int		 sc_unit;		/* logical unit number */
 	int		 sc_flags;		/* flags */
 	daddr_t		 sc_size;		/* size of ccd */
 	int		 sc_ileave;		/* interleave */
 	u_int		 sc_ndisks;		/* number of components */
 	struct ccdcinfo	 *sc_cinfo;		/* component info */
 	struct ccdiinfo	 *sc_itable;		/* interleave table */
 	u_int32_t	 sc_secsize;		/* # bytes per sector */
 	int		 sc_pick;		/* side of mirror picked */
 	daddr_t		 sc_blk[2];		/* mirror localization */
 	u_int32_t	 sc_offset;		/* actual offset used */
 };
 
 static g_start_t g_ccd_start;
 static void ccdiodone(struct bio *bp);
 static void ccdinterleave(struct ccd_s *);
 static int ccdinit(struct gctl_req *req, struct ccd_s *);
 static int ccdbuffer(struct bio **ret, struct ccd_s *,
 		      struct bio *, daddr_t, caddr_t, long);
 
 static void
 g_ccd_orphan(struct g_consumer *cp)
 {
 	/*
 	 * XXX: We don't do anything here.  It is not obvious
 	 * XXX: what DTRT would be, so we do what the previous
 	 * XXX: code did: ignore it and let the user cope.
 	 */
 }
 
 static int
 g_ccd_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp1, *cp2;
 	int error;
 
 	de += dr;
 	de += dw;
 
 	gp = pp->geom;
 	error = ENXIO;
 	LIST_FOREACH(cp1, &gp->consumer, consumer) {
 		error = g_access(cp1, dr, dw, de);
 		if (error) {
 			LIST_FOREACH(cp2, &gp->consumer, consumer) {
 				if (cp1 == cp2)
 					break;
 				g_access(cp2, -dr, -dw, -de);
 			}
 			break;
 		}
 	}
 	return (error);
 }
 
 /*
  * Free the softc and its substructures.
  */
 static void
 g_ccd_freesc(struct ccd_s *sc)
 {
 	struct ccdiinfo *ii;
 
 	g_free(sc->sc_cinfo);
 	if (sc->sc_itable != NULL) {
 		for (ii = sc->sc_itable; ii->ii_ndisk > 0; ii++)
 			if (ii->ii_index != NULL)
 				g_free(ii->ii_index);
 		g_free(sc->sc_itable);
 	}
 	g_free(sc);
 }
 
 
 static int
 ccdinit(struct gctl_req *req, struct ccd_s *cs)
 {
 	struct ccdcinfo *ci;
 	daddr_t size;
 	int ix;
 	daddr_t minsize;
 	int maxsecsize;
 	off_t mediasize;
 	u_int sectorsize;
 
 	cs->sc_size = 0;
 
 	maxsecsize = 0;
 	minsize = 0;
 
 	if (cs->sc_flags & CCDF_LINUX) {
 		cs->sc_offset = 0;
 		cs->sc_ileave *= 2;
 		if (cs->sc_flags & CCDF_MIRROR && cs->sc_ndisks != 2)
 			gctl_error(req, "Mirror mode for Linux raids is "
 			                "only supported with 2 devices");
 	} else {
 		if (cs->sc_flags & CCDF_NO_OFFSET)
 			cs->sc_offset = 0;
 		else
 			cs->sc_offset = CCD_OFFSET;
 
 	}
 	for (ix = 0; ix < cs->sc_ndisks; ix++) {
 		ci = &cs->sc_cinfo[ix];
 
 		mediasize = ci->ci_provider->mediasize;
 		sectorsize = ci->ci_provider->sectorsize;
 		if (sectorsize > maxsecsize)
 			maxsecsize = sectorsize;
 		size = mediasize / DEV_BSIZE - cs->sc_offset;
 
 		/* Truncate to interleave boundary */
 
 		if (cs->sc_ileave > 1)
 			size -= size % cs->sc_ileave;
 
 		if (size == 0) {
 			gctl_error(req, "Component %s has effective size zero",
 			    ci->ci_provider->name);
 			return(ENODEV);
 		}
 
 		if (minsize == 0 || size < minsize)
 			minsize = size;
 		ci->ci_size = size;
 		cs->sc_size += size;
 	}
 
 	/*
 	 * Don't allow the interleave to be smaller than
 	 * the biggest component sector.
 	 */
 	if ((cs->sc_ileave > 0) &&
 	    (cs->sc_ileave < (maxsecsize / DEV_BSIZE))) {
 		gctl_error(req, "Interleave to small for sector size");
 		return(EINVAL);
 	}
 
 	/*
 	 * If uniform interleave is desired set all sizes to that of
 	 * the smallest component.  This will guarantee that a single
 	 * interleave table is generated.
 	 *
 	 * Lost space must be taken into account when calculating the
 	 * overall size.  Half the space is lost when CCDF_MIRROR is
 	 * specified.
 	 */
 	if (cs->sc_flags & CCDF_UNIFORM) {
 		for (ix = 0; ix < cs->sc_ndisks; ix++) {
 			ci = &cs->sc_cinfo[ix];
 			ci->ci_size = minsize;
 		}
 		cs->sc_size = cs->sc_ndisks * minsize;
 	}
 
 	if (cs->sc_flags & CCDF_MIRROR) {
 		/*
 		 * Check to see if an even number of components
 		 * have been specified.  The interleave must also
 		 * be non-zero in order for us to be able to 
 		 * guarantee the topology.
 		 */
 		if (cs->sc_ndisks % 2) {
 			gctl_error(req,
 			      "Mirroring requires an even number of disks");
 			return(EINVAL);
 		}
 		if (cs->sc_ileave == 0) {
 			gctl_error(req,
 			     "An interleave must be specified when mirroring");
 			return(EINVAL);
 		}
 		cs->sc_size = (cs->sc_ndisks/2) * minsize;
 	} 
 
 	/*
 	 * Construct the interleave table.
 	 */
 	ccdinterleave(cs);
 
 	/*
 	 * Create pseudo-geometry based on 1MB cylinders.  It's
 	 * pretty close.
 	 */
 	cs->sc_secsize = maxsecsize;
 
 	return (0);
 }
 
 static void
 ccdinterleave(struct ccd_s *cs)
 {
 	struct ccdcinfo *ci, *smallci;
 	struct ccdiinfo *ii;
 	daddr_t bn, lbn;
 	int ix;
 	daddr_t size;
 
 
 	/*
 	 * Allocate an interleave table.  The worst case occurs when each
 	 * of N disks is of a different size, resulting in N interleave
 	 * tables.
 	 *
 	 * Chances are this is too big, but we don't care.
 	 */
 	size = (cs->sc_ndisks + 1) * sizeof(struct ccdiinfo);
 	cs->sc_itable = g_malloc(size, M_WAITOK | M_ZERO);
 
 	/*
 	 * Trivial case: no interleave (actually interleave of disk size).
 	 * Each table entry represents a single component in its entirety.
 	 *
 	 * An interleave of 0 may not be used with a mirror setup.
 	 */
 	if (cs->sc_ileave == 0) {
 		bn = 0;
 		ii = cs->sc_itable;
 
 		for (ix = 0; ix < cs->sc_ndisks; ix++) {
 			/* Allocate space for ii_index. */
 			ii->ii_index = g_malloc(sizeof(int), M_WAITOK);
 			ii->ii_ndisk = 1;
 			ii->ii_startblk = bn;
 			ii->ii_startoff = 0;
 			ii->ii_index[0] = ix;
 			bn += cs->sc_cinfo[ix].ci_size;
 			ii++;
 		}
 		ii->ii_ndisk = 0;
 		return;
 	}
 
 	/*
 	 * The following isn't fast or pretty; it doesn't have to be.
 	 */
 	size = 0;
 	bn = lbn = 0;
 	for (ii = cs->sc_itable; ; ii++) {
 		/*
 		 * Allocate space for ii_index.  We might allocate more then
 		 * we use.
 		 */
 		ii->ii_index = g_malloc((sizeof(int) * cs->sc_ndisks),
 		    M_WAITOK);
 
 		/*
 		 * Locate the smallest of the remaining components
 		 */
 		smallci = NULL;
 		for (ci = cs->sc_cinfo; ci < &cs->sc_cinfo[cs->sc_ndisks]; 
 		    ci++) {
 			if (ci->ci_size > size &&
 			    (smallci == NULL ||
 			     ci->ci_size < smallci->ci_size)) {
 				smallci = ci;
 			}
 		}
 
 		/*
 		 * Nobody left, all done
 		 */
 		if (smallci == NULL) {
 			ii->ii_ndisk = 0;
 			g_free(ii->ii_index);
 			ii->ii_index = NULL;
 			break;
 		}
 
 		/*
 		 * Record starting logical block using an sc_ileave blocksize.
 		 */
 		ii->ii_startblk = bn / cs->sc_ileave;
 
 		/*
 		 * Record starting component block using an sc_ileave 
 		 * blocksize.  This value is relative to the beginning of
 		 * a component disk.
 		 */
 		ii->ii_startoff = lbn;
 
 		/*
 		 * Determine how many disks take part in this interleave
 		 * and record their indices.
 		 */
 		ix = 0;
 		for (ci = cs->sc_cinfo; 
 		    ci < &cs->sc_cinfo[cs->sc_ndisks]; ci++) {
 			if (ci->ci_size >= smallci->ci_size) {
 				ii->ii_index[ix++] = ci - cs->sc_cinfo;
 			}
 		}
 		ii->ii_ndisk = ix;
 		bn += ix * (smallci->ci_size - size);
 		lbn = smallci->ci_size / cs->sc_ileave;
 		size = smallci->ci_size;
 	}
 }
 
 static void
 g_ccd_start(struct bio *bp)
 {
 	long bcount, rcount;
 	struct bio *cbp[2];
 	caddr_t addr;
 	daddr_t bn;
 	int err;
 	struct ccd_s *cs;
 
 	cs = bp->bio_to->geom->softc;
 
 	/*
 	 * Block all GETATTR requests, we wouldn't know which of our
 	 * subdevices we should ship it off to.
 	 * XXX: this may not be the right policy.
 	 */
 	if(bp->bio_cmd == BIO_GETATTR) {
 		g_io_deliver(bp, EINVAL);
 		return;
 	}
 
 	/*
 	 * Translate the partition-relative block number to an absolute.
 	 */
 	bn = bp->bio_offset / cs->sc_secsize;
 
 	/*
 	 * Allocate component buffers and fire off the requests
 	 */
 	addr = bp->bio_data;
 	for (bcount = bp->bio_length; bcount > 0; bcount -= rcount) {
 		err = ccdbuffer(cbp, cs, bp, bn, addr, bcount);
 		if (err) {
 			bp->bio_completed += bcount;
 			if (bp->bio_error == 0)
 				bp->bio_error = err;
 			if (bp->bio_completed == bp->bio_length)
 				g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 		rcount = cbp[0]->bio_length;
 
 		if (cs->sc_flags & CCDF_MIRROR) {
 			/*
 			 * Mirroring.  Writes go to both disks, reads are
 			 * taken from whichever disk seems most appropriate.
 			 *
 			 * We attempt to localize reads to the disk whos arm
 			 * is nearest the read request.  We ignore seeks due
 			 * to writes when making this determination and we
 			 * also try to avoid hogging.
 			 */
 			if (cbp[0]->bio_cmd != BIO_READ) {
 				g_io_request(cbp[0], cbp[0]->bio_from);
 				g_io_request(cbp[1], cbp[1]->bio_from);
 			} else {
 				int pick = cs->sc_pick;
 				daddr_t range = cs->sc_size / 16;
 
 				if (bn < cs->sc_blk[pick] - range ||
 				    bn > cs->sc_blk[pick] + range
 				) {
 					cs->sc_pick = pick = 1 - pick;
 				}
 				cs->sc_blk[pick] = bn + btodb(rcount);
 				g_io_request(cbp[pick], cbp[pick]->bio_from);
 			}
 		} else {
 			/*
 			 * Not mirroring
 			 */
 			g_io_request(cbp[0], cbp[0]->bio_from);
 		}
 		bn += btodb(rcount);
 		addr += rcount;
 	}
 }
 
 /*
  * Build a component buffer header.
  */
 static int
 ccdbuffer(struct bio **cb, struct ccd_s *cs, struct bio *bp, daddr_t bn, caddr_t addr, long bcount)
 {
 	struct ccdcinfo *ci, *ci2 = NULL;
 	struct bio *cbp;
 	daddr_t cbn, cboff;
 	off_t cbc;
 
 	/*
 	 * Determine which component bn falls in.
 	 */
 	cbn = bn;
 	cboff = 0;
 
 	if (cs->sc_ileave == 0) {
 		/*
 		 * Serially concatenated and neither a mirror nor a parity
 		 * config.  This is a special case.
 		 */
 		daddr_t sblk;
 
 		sblk = 0;
 		for (ci = cs->sc_cinfo; cbn >= sblk + ci->ci_size; ci++)
 			sblk += ci->ci_size;
 		cbn -= sblk;
 	} else {
 		struct ccdiinfo *ii;
 		int ccdisk, off;
 
 		/*
 		 * Calculate cbn, the logical superblock (sc_ileave chunks),
 		 * and cboff, a normal block offset (DEV_BSIZE chunks) relative
 		 * to cbn.
 		 */
 		cboff = cbn % cs->sc_ileave;	/* DEV_BSIZE gran */
 		cbn = cbn / cs->sc_ileave;	/* DEV_BSIZE * ileave gran */
 
 		/*
 		 * Figure out which interleave table to use.
 		 */
 		for (ii = cs->sc_itable; ii->ii_ndisk; ii++) {
 			if (ii->ii_startblk > cbn)
 				break;
 		}
 		ii--;
 
 		/*
 		 * off is the logical superblock relative to the beginning 
 		 * of this interleave block.  
 		 */
 		off = cbn - ii->ii_startblk;
 
 		/*
 		 * We must calculate which disk component to use (ccdisk),
 		 * and recalculate cbn to be the superblock relative to
 		 * the beginning of the component.  This is typically done by
 		 * adding 'off' and ii->ii_startoff together.  However, 'off'
 		 * must typically be divided by the number of components in
 		 * this interleave array to be properly convert it from a
 		 * CCD-relative logical superblock number to a 
 		 * component-relative superblock number.
 		 */
 		if (ii->ii_ndisk == 1) {
 			/*
 			 * When we have just one disk, it can't be a mirror
 			 * or a parity config.
 			 */
 			ccdisk = ii->ii_index[0];
 			cbn = ii->ii_startoff + off;
 		} else {
 			if (cs->sc_flags & CCDF_MIRROR) {
 				/*
 				 * We have forced a uniform mapping, resulting
 				 * in a single interleave array.  We double
 				 * up on the first half of the available
 				 * components and our mirror is in the second
 				 * half.  This only works with a single 
 				 * interleave array because doubling up
 				 * doubles the number of sectors, so there
 				 * cannot be another interleave array because
 				 * the next interleave array's calculations
 				 * would be off.
 				 */
 				int ndisk2 = ii->ii_ndisk / 2;
 				ccdisk = ii->ii_index[off % ndisk2];
 				cbn = ii->ii_startoff + off / ndisk2;
 				ci2 = &cs->sc_cinfo[ccdisk + ndisk2];
 			} else {
 				ccdisk = ii->ii_index[off % ii->ii_ndisk];
 				cbn = ii->ii_startoff + off / ii->ii_ndisk;
 			}
 		}
 
 		ci = &cs->sc_cinfo[ccdisk];
 
 		/*
 		 * Convert cbn from a superblock to a normal block so it
 		 * can be used to calculate (along with cboff) the normal
 		 * block index into this particular disk.
 		 */
 		cbn *= cs->sc_ileave;
 	}
 
 	/*
 	 * Fill in the component buf structure.
 	 */
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL)
 		return (ENOMEM);
 	cbp->bio_done = g_std_done;
 	cbp->bio_offset = dbtob(cbn + cboff + cs->sc_offset);
 	cbp->bio_data = addr;
 	if (cs->sc_ileave == 0)
               cbc = dbtob((off_t)(ci->ci_size - cbn));
 	else
               cbc = dbtob((off_t)(cs->sc_ileave - cboff));
 	cbp->bio_length = (cbc < bcount) ? cbc : bcount;
 
 	cbp->bio_from = ci->ci_consumer;
 	cb[0] = cbp;
 
 	if (cs->sc_flags & CCDF_MIRROR) {
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL)
 			return (ENOMEM);
 		cbp->bio_done = cb[0]->bio_done = ccdiodone;
 		cbp->bio_offset = cb[0]->bio_offset;
 		cbp->bio_data = cb[0]->bio_data;
 		cbp->bio_length = cb[0]->bio_length;
 		cbp->bio_from = ci2->ci_consumer;
 		cbp->bio_caller1 = cb[0];
 		cb[0]->bio_caller1 = cbp;
 		cb[1] = cbp;
 	}
 	return (0);
 }
 
 /*
  * Called only for mirrored operations.
  */
 static void
 ccdiodone(struct bio *cbp)
 {
 	struct bio *mbp, *pbp;
 
 	mbp = cbp->bio_caller1;
 	pbp = cbp->bio_parent;
 
 	if (pbp->bio_cmd == BIO_READ) {
 		if (cbp->bio_error == 0) {
 			/* We will not be needing the partner bio */
 			if (mbp != NULL) {
 				pbp->bio_inbed++;
 				g_destroy_bio(mbp);
 			}
 			g_std_done(cbp);
 			return;
 		}
 		if (mbp != NULL) {
 			/* Try partner the bio instead */
 			mbp->bio_caller1 = NULL;
 			pbp->bio_inbed++;
 			g_destroy_bio(cbp);
 			g_io_request(mbp, mbp->bio_from);
 			/*
 			 * XXX: If this comes back OK, we should actually
 			 * try to write the good data on the failed mirror
 			 */
 			return;
 		}
 		g_std_done(cbp);
 		return;
 	}
 	if (mbp != NULL) {
 		mbp->bio_caller1 = NULL;
 		pbp->bio_inbed++;
 		if (cbp->bio_error != 0 && pbp->bio_error == 0)
 			pbp->bio_error = cbp->bio_error;
 		g_destroy_bio(cbp);
 		return;
 	}
 	g_std_done(cbp);
 }
 
 static void
 g_ccd_create(struct gctl_req *req, struct g_class *mp)
 {
 	int *unit, *ileave, *nprovider;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	struct ccd_s *sc;
 	struct sbuf *sb;
 	char buf[20];
 	int i, error;
 
 	g_topology_assert();
 	unit = gctl_get_paraml(req, "unit", sizeof (*unit));
 	if (unit == NULL) {
 		gctl_error(req, "unit parameter not given");
 		return;
 	}
 	ileave = gctl_get_paraml(req, "ileave", sizeof (*ileave));
 	if (ileave == NULL) {
 		gctl_error(req, "ileave parameter not given");
 		return;
 	}
 	nprovider = gctl_get_paraml(req, "nprovider", sizeof (*nprovider));
 	if (nprovider == NULL) {
 		gctl_error(req, "nprovider parameter not given");
 		return;
 	}
 
 	/* Check for duplicate unit */
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc != NULL && sc->sc_unit == *unit) {
 			gctl_error(req, "Unit %d already configured", *unit);
 			return;
 		}
 	}
 
 	if (*nprovider <= 0) {
 		gctl_error(req, "Bogus nprovider argument (= %d)", *nprovider);
 		return;
 	}
 
 	/* Check all providers are valid */
 	for (i = 0; i < *nprovider; i++) {
 		sprintf(buf, "provider%d", i);
 		pp = gctl_get_provider(req, buf);
 		if (pp == NULL)
 			return;
 	}
 
 	gp = g_new_geomf(mp, "ccd%d", *unit);
 	sc = g_malloc(sizeof *sc, M_WAITOK | M_ZERO);
 	gp->softc = sc;
 	sc->sc_ndisks = *nprovider;
 
 	/* Allocate space for the component info. */
 	sc->sc_cinfo = g_malloc(sc->sc_ndisks * sizeof(struct ccdcinfo),
 	    M_WAITOK | M_ZERO);
 
 	/* Create consumers and attach to all providers */
 	for (i = 0; i < *nprovider; i++) {
 		sprintf(buf, "provider%d", i);
 		pp = gctl_get_provider(req, buf);
 		cp = g_new_consumer(gp);
 		error = g_attach(cp, pp);
 		KASSERT(error == 0, ("attach to %s failed", pp->name));
 		sc->sc_cinfo[i].ci_consumer = cp;
 		sc->sc_cinfo[i].ci_provider = pp;
 	}
 
 	sc->sc_unit = *unit;
 	sc->sc_ileave = *ileave;
 
 	if (gctl_get_param(req, "no_offset", NULL))
 		sc->sc_flags |= CCDF_NO_OFFSET;
 	if (gctl_get_param(req, "linux", NULL))
 		sc->sc_flags |= CCDF_LINUX;
 
 	if (gctl_get_param(req, "uniform", NULL))
 		sc->sc_flags |= CCDF_UNIFORM;
 	if (gctl_get_param(req, "mirror", NULL))
 		sc->sc_flags |= CCDF_MIRROR;
 
 	if (sc->sc_ileave == 0 && (sc->sc_flags & CCDF_MIRROR)) {
 		printf("%s: disabling mirror, interleave is 0\n", gp->name);
 		sc->sc_flags &= ~(CCDF_MIRROR);
 	}
 
 	if ((sc->sc_flags & CCDF_MIRROR) && !(sc->sc_flags & CCDF_UNIFORM)) {
 		printf("%s: mirror/parity forces uniform flag\n", gp->name);
 		sc->sc_flags |= CCDF_UNIFORM;
 	}
 
 	error = ccdinit(req, sc);
 	if (error != 0) {
 		g_ccd_freesc(sc);
 		gp->softc = NULL;
 		g_wither_geom(gp, ENXIO);
 		return;
 	}
 
 	pp = g_new_providerf(gp, "%s", gp->name);
 	pp->mediasize = sc->sc_size * (off_t)sc->sc_secsize;
 	pp->sectorsize = sc->sc_secsize;
 	g_error_provider(pp, 0);
 
 	sb = sbuf_new_auto();
 	sbuf_printf(sb, "ccd%d: %d components ", sc->sc_unit, *nprovider);
 	for (i = 0; i < *nprovider; i++) {
 		sbuf_printf(sb, "%s%s",
 		    i == 0 ? "(" : ", ", 
 		    sc->sc_cinfo[i].ci_provider->name);
 	}
 	sbuf_printf(sb, "), %jd blocks ", (off_t)pp->mediasize / DEV_BSIZE);
 	if (sc->sc_ileave != 0)
 		sbuf_printf(sb, "interleaved at %d blocks\n",
 			sc->sc_ileave);
 	else
 		sbuf_printf(sb, "concatenated\n");
 	sbuf_finish(sb);
 	gctl_set_param_err(req, "output", sbuf_data(sb), sbuf_len(sb) + 1);
 	sbuf_delete(sb);
 }
 
 static int
 g_ccd_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 	struct g_provider *pp;
 	struct ccd_s *sc;
 
 	g_topology_assert();
 	sc = gp->softc;
 	pp = LIST_FIRST(&gp->provider);
 	if (sc == NULL || pp == NULL)
 		return (EBUSY);
 	if (pp->acr != 0 || pp->acw != 0 || pp->ace != 0) {
 		gctl_error(req, "%s is open(r%dw%de%d)", gp->name,
 		    pp->acr, pp->acw, pp->ace);
 		return (EBUSY);
 	}
 	g_ccd_freesc(sc);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static void
 g_ccd_list(struct gctl_req *req, struct g_class *mp)
 {
 	struct sbuf *sb;
 	struct ccd_s *cs;
 	struct g_geom *gp;
 	int i, unit, *up;
 
 	up = gctl_get_paraml(req, "unit", sizeof (*up));
 	if (up == NULL) {
 		gctl_error(req, "unit parameter not given");
 		return;
 	}
 	unit = *up;
 	sb = sbuf_new_auto();
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		cs = gp->softc;
 		if (cs == NULL || (unit >= 0 && unit != cs->sc_unit))
 			continue;
 		sbuf_printf(sb, "ccd%d\t\t%d\t%d\t",
 		    cs->sc_unit, cs->sc_ileave, cs->sc_flags & CCDF_USERMASK);
 			
 		for (i = 0; i < cs->sc_ndisks; ++i) {
 			sbuf_printf(sb, "%s/dev/%s", i == 0 ? "" : " ",
 			    cs->sc_cinfo[i].ci_provider->name);
 		}
 		sbuf_printf(sb, "\n");
 	}
 	sbuf_finish(sb);
 	gctl_set_param_err(req, "output", sbuf_data(sb), sbuf_len(sb) + 1);
 	sbuf_delete(sb);
 }
 
 static void
 g_ccd_config(struct gctl_req *req, struct g_class *mp, char const *verb)
 {
 	struct g_geom *gp;
 
 	g_topology_assert();
 	if (!strcmp(verb, "create geom")) {
 		g_ccd_create(req, mp);
 	} else if (!strcmp(verb, "destroy geom")) {
 		gp = gctl_get_geom(req, mp, "geom");
 		if (gp != NULL)
 		g_ccd_destroy_geom(req, mp, gp);
 	} else if (!strcmp(verb, "list")) {
 		g_ccd_list(req, mp);
 	} else {
 		gctl_error(req, "unknown verb");
 	}
 }
 
 static struct g_class g_ccd_class = {
 	.name = "CCD",
 	.version = G_VERSION,
 	.ctlreq = g_ccd_config,
 	.destroy_geom = g_ccd_destroy_geom,
 	.start = g_ccd_start,
 	.orphan = g_ccd_orphan,
 	.access = g_ccd_access,
 };
 
 DECLARE_GEOM_CLASS(g_ccd_class, g_ccd);
+MODULE_VERSION(geom_ccd, 0);
Index: user/markj/netdump/sys/geom/geom_fox.c
===================================================================
--- user/markj/netdump/sys/geom/geom_fox.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_fox.c	(revision 332408)
@@ -1,487 +1,488 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 2003 Poul-Henning Kamp
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The names of the authors may not be used to endorse or promote
  *    products derived from this software without specific prior written
  *    permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /* This is a GEOM module for handling path selection for multi-path
  * storage devices.  It is named "fox" because it, like they, prefer
  * to have multiple exits to choose from.
  *
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/kernel.h>
 #include <sys/conf.h>
 #include <sys/bio.h>
 #include <sys/malloc.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/libkern.h>
 #include <sys/endian.h>
 #include <sys/md5.h>
 #include <sys/errno.h>
 #include <geom/geom.h>
 
 #define FOX_CLASS_NAME "FOX"
 #define FOX_MAGIC	"GEOM::FOX"
 
 static int g_fox_once;
 
 FEATURE(geom_fox, "GEOM FOX redundant path mitigation support");
 
 struct g_fox_softc {
 	off_t			mediasize;
 	u_int			sectorsize;
 	TAILQ_HEAD(, bio)	queue;
 	struct mtx		lock;
 	u_char 			magic[16];
 	struct g_consumer 	*path;
 	struct g_consumer 	*opath;
 	int			waiting;
 	int			cr, cw, ce;
 };
 
 /*
  * This function is called whenever we need to select a new path.
  */
 static void
 g_fox_select_path(void *arg, int flag)
 {
 	struct g_geom *gp;
 	struct g_fox_softc *sc;
 	struct g_consumer *cp1;
 	struct bio *bp;
 	int error;
 
 	g_topology_assert();
 	if (flag == EV_CANCEL)
 		return;
 	gp = arg;
 	sc = gp->softc;
 
 	if (sc->opath != NULL) {
 		/*
 		 * First, close the old path entirely.
 		 */
 		printf("Closing old path (%s) on fox (%s)\n",
 			sc->opath->provider->name, gp->name);
 
 		cp1 = LIST_NEXT(sc->opath, consumer);
 
 		g_access(sc->opath, -sc->cr, -sc->cw, -(sc->ce + 1));
 
 		/*
 		 * The attempt to reopen it with a exclusive count
 		 */
 		error = g_access(sc->opath, 0, 0, 1);
 		if (error) {
 			/*
 			 * Ok, ditch this consumer, we can't use it.
 			 */
 			printf("Drop old path (%s) on fox (%s)\n",
 				sc->opath->provider->name, gp->name);
 			g_detach(sc->opath);
 			g_destroy_consumer(sc->opath);
 			if (LIST_EMPTY(&gp->consumer)) {
 				/* No consumers left */
 				g_wither_geom(gp, ENXIO);
 				for (;;) {
 					bp = TAILQ_FIRST(&sc->queue);
 					if (bp == NULL)
 						break;
 					TAILQ_REMOVE(&sc->queue, bp, bio_queue);
 					bp->bio_error = ENXIO;
 					g_std_done(bp);
 				}
 				return;
 			}
 		} else {
 			printf("Got e-bit on old path (%s) on fox (%s)\n",
 				sc->opath->provider->name, gp->name);
 		}
 		sc->opath = NULL;
 	} else {
 		cp1 = LIST_FIRST(&gp->consumer);
 	}
 	if (cp1 == NULL)
 		cp1 = LIST_FIRST(&gp->consumer);
 	printf("Open new path (%s) on fox (%s)\n",
 		cp1->provider->name, gp->name);
 	error = g_access(cp1, sc->cr, sc->cw, sc->ce);
 	if (error) {
 		/*
 		 * If we failed, we take another trip through here
 		 */
 		printf("Open new path (%s) on fox (%s) failed, reselect.\n",
 			cp1->provider->name, gp->name);
 		sc->opath = cp1;
 		g_post_event(g_fox_select_path, gp, M_WAITOK, gp, NULL);
 	} else {
 		printf("Open new path (%s) on fox (%s) succeeded\n",
 			cp1->provider->name, gp->name);
 		mtx_lock(&sc->lock);
 		sc->path = cp1;
 		sc->waiting = 0;
 		for (;;) {
 			bp = TAILQ_FIRST(&sc->queue);
 			if (bp == NULL)
 				break;
 			TAILQ_REMOVE(&sc->queue, bp, bio_queue);
 			g_io_request(bp, sc->path);
 		}
 		mtx_unlock(&sc->lock);
 	}
 }
 
 static void
 g_fox_orphan(struct g_consumer *cp)
 {
 	struct g_geom *gp;
 	struct g_fox_softc *sc;
 	int error, mark;
 
 	g_topology_assert();
 	gp = cp->geom;
 	sc = gp->softc;
 	printf("Removing path (%s) from fox (%s)\n",
 	    cp->provider->name, gp->name);
 	mtx_lock(&sc->lock);
 	if (cp == sc->path) {
 		sc->opath = NULL;
 		sc->path = NULL;
 		sc->waiting = 1;
 		mark = 1;
 	} else {
 		mark = 0;
 	}
 	mtx_unlock(&sc->lock);
 	    
 	g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 	error = cp->provider->error;
 	g_detach(cp);
 	g_destroy_consumer(cp);	
 	if (!LIST_EMPTY(&gp->consumer)) {
 		if (mark)
 			g_post_event(g_fox_select_path, gp, M_WAITOK, gp, NULL);
 		return;
 	}
 
 	mtx_destroy(&sc->lock);
 	g_free(gp->softc);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 }
 
 static void
 g_fox_done(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct g_fox_softc *sc;
 	int error;
 
 	if (bp->bio_error == 0) {
 		g_std_done(bp);
 		return;
 	}
 	gp = bp->bio_from->geom;
 	sc = gp->softc;
 	if (bp->bio_from != sc->path) {
 		g_io_request(bp, sc->path);
 		return;
 	}
 	mtx_lock(&sc->lock);
 	sc->opath = sc->path;
 	sc->path = NULL;
 	error = g_post_event(g_fox_select_path, gp, M_NOWAIT, gp, NULL);
 	if (error) {
 		bp->bio_error = ENOMEM;
 		g_std_done(bp);
 	} else {
 		sc->waiting = 1;
 		TAILQ_INSERT_TAIL(&sc->queue, bp, bio_queue);
 	}
 	mtx_unlock(&sc->lock);
 }
 
 static void
 g_fox_start(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct bio *bp2;
 	struct g_fox_softc *sc;
 	int error;
 
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 	if (sc == NULL) {
 		g_io_deliver(bp, ENXIO);
 		return;
 	}
 	switch(bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		bp2 = g_clone_bio(bp);
 		if (bp2 == NULL) {
 			g_io_deliver(bp, ENOMEM);
 			break;
 		}
 		bp2->bio_offset += sc->sectorsize;
 		bp2->bio_done = g_fox_done;
 		mtx_lock(&sc->lock);
 		if (sc->path == NULL || !TAILQ_EMPTY(&sc->queue)) {
 			if (sc->waiting == 0) {
 				error = g_post_event(g_fox_select_path, gp,
 				    M_NOWAIT, gp, NULL);
 				if (error) {
 					g_destroy_bio(bp2);
 					bp2 = NULL;
 					g_io_deliver(bp, error);
 				} else {
 					sc->waiting = 1;
 				}
 			}
 			if (bp2 != NULL)
 				TAILQ_INSERT_TAIL(&sc->queue, bp2,
 				    bio_queue);
 		} else {
 			g_io_request(bp2, sc->path);
 		}
 		mtx_unlock(&sc->lock);
 		break;
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		break;
 	}
 	return;
 }
 
 static int
 g_fox_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_fox_softc *sc;
 	struct g_consumer *cp1;
 	int error;
 
 	g_topology_assert();
 	gp = pp->geom;
 	sc = gp->softc;
 	if (sc == NULL) {
 		if (dr <= 0 && dw <= 0 && de <= 0)
 			return (0);
 		else
 			return (ENXIO);
 	}
 
 	if (sc->cr == 0 && sc->cw == 0 && sc->ce == 0) {
 		/*
 		 * First open, open all consumers with an exclusive bit
 		 */
 		error = 0;
 		LIST_FOREACH(cp1, &gp->consumer, consumer) {
 			error = g_access(cp1, 0, 0, 1);
 			if (error) {
 				printf("FOX: access(%s,0,0,1) = %d\n",
 				    cp1->provider->name, error);
 				break;
 			}
 		}
 		if (error) {
 			LIST_FOREACH(cp1, &gp->consumer, consumer) {
 				if (cp1->ace)
 					g_access(cp1, 0, 0, -1);
 			}
 			return (error);
 		}
 	}
 	if (sc->path == NULL)
 		g_fox_select_path(gp, 0);
 	if (sc->path == NULL)
 		error = ENXIO;
 	else
 		error = g_access(sc->path, dr, dw, de);
 	if (error == 0) {
 		sc->cr += dr;
 		sc->cw += dw;
 		sc->ce += de;
 		if (sc->cr == 0 && sc->cw == 0 && sc->ce == 0) {
 			/*
 			 * Last close, remove e-bit on all consumers
 			 */
 			LIST_FOREACH(cp1, &gp->consumer, consumer)
 				g_access(cp1, 0, 0, -1);
 		}
 	}
 	return (error);
 }
 
 static struct g_geom *
 g_fox_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_geom *gp, *gp2;
 	struct g_provider *pp2;
 	struct g_consumer *cp, *cp2;
 	struct g_fox_softc *sc, *sc2;
 	int error;
 	u_int sectorsize;
 	u_char *buf;
 
 	g_trace(G_T_TOPOLOGY, "fox_taste(%s, %s)", mp->name, pp->name);
 	g_topology_assert();
 	if (!strcmp(pp->geom->class->name, mp->name))
 		return (NULL);
 	gp = g_new_geomf(mp, "%s.fox", pp->name);
 	gp->softc = g_malloc(sizeof(struct g_fox_softc), M_WAITOK | M_ZERO);
 	sc = gp->softc;
 
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_access(cp, 1, 0, 0);
 	if (error) {
 		g_free(sc);
 		g_detach(cp);
 		g_destroy_consumer(cp);	
 		g_destroy_geom(gp);
 		return(NULL);
 	}
 	do {
 		sectorsize = cp->provider->sectorsize;
 		g_topology_unlock();
 		buf = g_read_data(cp, 0, sectorsize, NULL);
 		g_topology_lock();
 		if (buf == NULL)
 			break;
 		if (memcmp(buf, FOX_MAGIC, strlen(FOX_MAGIC)))
 			break;
 
 		/*
 		 * First we need to see if this a new path for an existing fox.
 		 */
 		LIST_FOREACH(gp2, &mp->geom, geom) {
 			sc2 = gp2->softc;
 			if (sc2 == NULL)
 				continue;
 			if (memcmp(buf + 16, sc2->magic, sizeof sc2->magic))
 				continue;
 			break;
 		}
 		if (gp2 != NULL) {
 			/*
 			 * It was.  Create a new consumer for that fox,
 			 * attach it, and if the fox is open, open this
 			 * path with an exclusive count of one.
 			 */
 			printf("Adding path (%s) to fox (%s)\n",
 			    pp->name, gp2->name);
 			cp2 = g_new_consumer(gp2);
 			g_attach(cp2, pp);
 			pp2 = LIST_FIRST(&gp2->provider);
 			if (pp2->acr > 0 || pp2->acw > 0 || pp2->ace > 0) {
 				error = g_access(cp2, 0, 0, 1);
 				if (error) {
 					/*
 					 * This is bad, or more likely,
 					 * the user is doing something stupid
 					 */
 					printf(
 	"WARNING: New path (%s) to fox(%s) not added: %s\n%s",
 					    cp2->provider->name, gp2->name,
 	"Could not get exclusive bit.",
 	"WARNING: This indicates a risk of data inconsistency."
 					);
 					g_detach(cp2);
 					g_destroy_consumer(cp2);
 				}
 			}
 			break;
 		}
 		printf("Creating new fox (%s)\n", pp->name);
 		sc->path = cp;
 		memcpy(sc->magic, buf + 16, sizeof sc->magic);
 		pp2 = g_new_providerf(gp, "%s", gp->name);
 		pp2->mediasize = sc->mediasize = pp->mediasize - pp->sectorsize;
 		pp2->sectorsize = sc->sectorsize = pp->sectorsize;
 printf("fox %s lock %p\n", gp->name, &sc->lock);
 
 		mtx_init(&sc->lock, "fox queue", NULL, MTX_DEF);
 		TAILQ_INIT(&sc->queue);
 		g_error_provider(pp2, 0);
 	} while (0);
 	if (buf != NULL)
 		g_free(buf);
 	g_access(cp, -1, 0, 0);
 
 	if (!LIST_EMPTY(&gp->provider)) {
 		if (!g_fox_once) {
 			g_fox_once = 1;
 			printf(
 			    "WARNING: geom_fox (geom %s) is deprecated, "
 			    "use gmultipath instead.\n", gp->name);
 		}
 		return (gp);
 	}
 
 	g_free(gp->softc);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	return (NULL);
 }
 
 static int
 g_fox_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 	struct g_fox_softc *sc;
 
 	g_topology_assert();
 	sc = gp->softc;
 	mtx_destroy(&sc->lock);
 	g_free(gp->softc);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static struct g_class g_fox_class	= {
 	.name = FOX_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_fox_taste,
 	.destroy_geom = g_fox_destroy_geom,
 	.start = g_fox_start,
 	.spoiled = g_fox_orphan,
 	.orphan = g_fox_orphan,
 	.access= g_fox_access,
 };
 
 DECLARE_GEOM_CLASS(g_fox_class, g_fox);
+MODULE_VERSION(geom_fox, 0);
Index: user/markj/netdump/sys/geom/geom_map.c
===================================================================
--- user/markj/netdump/sys/geom/geom_map.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_map.c	(revision 332408)
@@ -1,409 +1,410 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2010-2011 Aleksandr Rybalko <ray@dlink.ua>
  *   based on geom_redboot.c
  * Copyright (c) 2009 Sam Leffler, Errno Consulting
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer,
  *    without modification.
  * 2. Redistributions in binary form must reproduce at minimum a disclaimer
  *    similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any
  *    redistribution must be conditioned upon including a substantially
  *    similar Disclaimer requirement for further binary redistribution.
  *
  * NO WARRANTY
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY
  * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
  * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY,
  * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
  * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
  * THE POSSIBILITY OF SUCH DAMAGES.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/errno.h>
 #include <sys/endian.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/fcntl.h>
 #include <sys/malloc.h>
 #include <sys/bio.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/sbuf.h>
 
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 
 #define	MAP_CLASS_NAME	"MAP"
 #define	MAP_MAXSLICE	64
 #define	MAP_MAX_MARKER_LEN	64
 
 struct g_map_softc {
 	off_t		 offset[MAP_MAXSLICE];	/* offset in flash */
 	off_t		 size[MAP_MAXSLICE];	/* image size in bytes */
 	off_t		 entry[MAP_MAXSLICE];
 	off_t		 dsize[MAP_MAXSLICE];
 	uint8_t		 readonly[MAP_MAXSLICE];
 	g_access_t	*parent_access;
 };
 
 static int
 g_map_access(struct g_provider *pp, int dread, int dwrite, int dexcl)
 {
 	struct g_geom *gp;
 	struct g_slicer *gsp;
 	struct g_map_softc *sc;
 
 	gp = pp->geom;
 	gsp = gp->softc;
 	sc = gsp->softc;
 
 	if (dwrite > 0 && sc->readonly[pp->index])
 		return (EPERM);
 
 	return (sc->parent_access(pp, dread, dwrite, dexcl)); 
 }
 
 static int
 g_map_start(struct bio *bp)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	struct g_map_softc *sc;
 	struct g_slicer *gsp;
 	int idx;
 
 	pp = bp->bio_to;
 	idx = pp->index;
 	gp = pp->geom;
 	gsp = gp->softc;
 	sc = gsp->softc;
 
 	if (bp->bio_cmd == BIO_GETATTR) {
 		if (g_handleattr_int(bp, MAP_CLASS_NAME "::entry",
 		    sc->entry[idx])) {
 			return (1);
 		}
 		if (g_handleattr_int(bp, MAP_CLASS_NAME "::dsize",
 		    sc->dsize[idx])) {
 			return (1);
 		}
 	}
 
 	return (0);
 }
 
 static void
 g_map_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp __unused, struct g_provider *pp)
 {
 	struct g_map_softc *sc;
 	struct g_slicer *gsp;
 
 	gsp = gp->softc;
 	sc = gsp->softc;
 	g_slice_dumpconf(sb, indent, gp, cp, pp);
 	if (pp != NULL) {
 		if (indent == NULL) {
 			sbuf_printf(sb, " entry %jd", (intmax_t)sc->entry[pp->index]);
 			sbuf_printf(sb, " dsize %jd", (intmax_t)sc->dsize[pp->index]);
 		} else {
 			sbuf_printf(sb, "%s<entry>%jd</entry>\n", indent,
 			    (intmax_t)sc->entry[pp->index]);
 			sbuf_printf(sb, "%s<dsize>%jd</dsize>\n", indent,
 			    (intmax_t)sc->dsize[pp->index]);
 		}
 	}
 }
 
 static int
 find_marker(struct g_consumer *cp, const char *line, off_t *offset)
 {
 	off_t search_start, search_offset, search_step;
 	size_t sectorsize;
 	uint8_t *buf;
 	char *op, key[MAP_MAX_MARKER_LEN], search_key[MAP_MAX_MARKER_LEN];
 	int ret, c;
 
 	/* Try convert to numeric first */
 	*offset = strtouq(line, &op, 0);
 	if (*op == '\0') 
 		return (0);
 
 	bzero(search_key, MAP_MAX_MARKER_LEN);
 	sectorsize = cp->provider->sectorsize;
 
 #ifdef __LP64__
 	ret = sscanf(line, "search:%li:%li:%63c",
 	    &search_start, &search_step, search_key);
 #else
 	ret = sscanf(line, "search:%qi:%qi:%63c",
 	    &search_start, &search_step, search_key);
 #endif
 	if (ret < 3)
 		return (1);
 
 	if (bootverbose) {
 		printf("MAP: search %s for key \"%s\" from 0x%jx, step 0x%jx\n",
 		    cp->geom->name, search_key, (intmax_t)search_start, (intmax_t)search_step);
 	}
 
 	/* error if search_key is empty */
 	if (strlen(search_key) < 1)
 		return (1);
 
 	/* sscanf successful, and we start marker search */
 	for (search_offset = search_start;
 	     search_offset < cp->provider->mediasize;
 	     search_offset += search_step) {
 
 		g_topology_unlock();
 		buf = g_read_data(cp, rounddown(search_offset, sectorsize),
 		    roundup(strlen(search_key), sectorsize), NULL);
 		g_topology_lock();
 
 		/*
 		 * Don't bother doing the rest if buf==NULL; eg derefencing
 		 * to assemble 'key'.
 		 */
 		if (buf == NULL)
 			continue;
 
 		/* Wildcard, replace '.' with byte from data */
 		/* TODO: add support wildcard escape '\.' */
 
 		strncpy(key, search_key, MAP_MAX_MARKER_LEN);
 
 		for (c = 0; c < MAP_MAX_MARKER_LEN && key[c]; c++) {
 			if (key[c] == '.') {
 				key[c] = ((char *)(buf + 
 				    (search_offset % sectorsize)))[c];
 			}
 		}
 
 		/* Assume buf != NULL here */
 		if (memcmp(buf + search_offset % sectorsize,
 		    key, strlen(search_key)) == 0) {
 			g_free(buf);
 			/* Marker found, so return their offset */
 			*offset = search_offset;
 			return (0);
 		}
 		g_free(buf);
 	}
 
 	/* Marker not found */
 	return (1);
 }
 
 static int
 g_map_parse_part(struct g_class *mp, struct g_provider *pp,
     struct g_consumer *cp, struct g_geom *gp, struct g_map_softc *sc, int i)
 {
 	const char *value, *name;
 	char *op;
 	off_t start, end, offset, size, dsize;
 	int readonly, ret;
 
 	/* hint.map.0.at="cfid0" - bind to cfid0 media */
 	if (resource_string_value("map", i, "at", &value) != 0)
 		return (1);
 
 	/* Check if this correct provider */
 	if (strcmp(pp->name, value) != 0)
 		return (1);
 
 	/*
 	 * hint.map.0.name="uboot" - name of partition, will be available
 	 * as "/dev/map/uboot"
 	 */
 	if (resource_string_value("map", i, "name", &name) != 0) {
 		if (bootverbose)
 			printf("MAP: hint.map.%d has no name\n", i);
 		return (1);
 	}
 
 	/*
 	 * hint.map.0.start="0x00010000" - partition start at 0x00010000
 	 * or hint.map.0.start="search:0x00010000:0x200:marker text" -
 	 * search for text "marker text", begin at 0x10000, step 0x200
 	 * until we found marker or end of media reached
 	 */ 
 	if (resource_string_value("map", i, "start", &value) != 0) {
 		if (bootverbose)
 			printf("MAP: \"%s\" has no start value\n", name);
 		return (1);
 	}
 	if (find_marker(cp, value, &start) != 0) {
 		if (bootverbose) {
 			printf("MAP: \"%s\" can't parse/use start value\n",
 			    name);
 		}
 		return (1);
 	}
 
 	/* like "start" */
 	if (resource_string_value("map", i, "end", &value) != 0) {
 		if (bootverbose)
 			printf("MAP: \"%s\" has no end value\n", name);
 		return (1);
 	}
 	if (find_marker(cp, value, &end) != 0) {
 		if (bootverbose) {
 			printf("MAP: \"%s\" can't parse/use end value\n",
 			    name);
 		}
 		return (1);
 	}
 
 	/* variable readonly optional, disable write access */
 	if (resource_int_value("map", i, "readonly", &readonly) != 0)
 		readonly = 0;
 
 	/* offset of partition data, from partition begin */
 	if (resource_string_value("map", i, "offset", &value) == 0) {
 		offset = strtouq(value, &op, 0);
 		if (*op != '\0') {
 			if (bootverbose) {
 				printf("MAP: \"%s\" can't parse offset\n",
 				    name);
 			}
 			return (1);
 		}
 	} else {
 		offset = 0;
 	}
 
 	/* partition data size */
 	if (resource_string_value("map", i, "dsize", &value) == 0) {
 		dsize = strtouq(value, &op, 0);
 		if (*op != '\0') {
 			if (bootverbose) {
 				printf("MAP: \"%s\" can't parse dsize\n", 
 				    name);
 			}
 			return (1);
 		}
 	} else {
 		dsize = 0;
 	}
 
 	size = end - start;
 	if (dsize == 0)
 		dsize = size - offset;
 
 	/* end is 0 or size is 0, No MAP - so next */
 	if (end < start) {
 		if (bootverbose) {
 			printf("MAP: \"%s\", \"end\" less than "
 			    "\"start\"\n", name);
 		}
 		return (1);
 	}
 
 	if (offset + dsize > size) {
 		if (bootverbose) {
 			printf("MAP: \"%s\", \"dsize\" bigger than "
 			    "partition - offset\n", name);
 		}
 		return (1);
 	}
 
 	ret = g_slice_config(gp, i, G_SLICE_CONFIG_SET, start + offset,
 	    dsize, cp->provider->sectorsize, "map/%s", name);
 	if (ret != 0) {
 		if (bootverbose) {
 			printf("MAP: g_slice_config returns %d for \"%s\"\n", 
 			    ret, name);
 		}
 		return (1);
 	}
 
 	if (bootverbose) {
 		printf("MAP: %s: %jxx%jx, data=%jxx%jx "
 		    "\"/dev/map/%s\"\n",
 		    cp->geom->name, (intmax_t)start, (intmax_t)size, (intmax_t)offset,
 		    (intmax_t)dsize, name);
 	}
 
 	sc->offset[i] = start;
 	sc->size[i] = size;
 	sc->entry[i] = offset;
 	sc->dsize[i] = dsize;
 	sc->readonly[i] = readonly ? 1 : 0;
 
 	return (0);
 }
 
 static struct g_geom *
 g_map_taste(struct g_class *mp, struct g_provider *pp, int insist __unused)
 {
 	struct g_map_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int i;
 
 	g_trace(G_T_TOPOLOGY, "map_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 	if (strcmp(pp->geom->class->name, MAP_CLASS_NAME) == 0)
 		return (NULL);
 
 	gp = g_slice_new(mp, MAP_MAXSLICE, pp, &cp, &sc, sizeof(*sc),
 	    g_map_start);
 	if (gp == NULL)
 		return (NULL);
 
 	/* interpose our access method */
 	sc->parent_access = gp->access;
 	gp->access = g_map_access;
 
 	for (i = 0; i < MAP_MAXSLICE; i++)
 		g_map_parse_part(mp, pp, cp, gp, sc, i);
 
 
 	g_access(cp, -1, 0, 0);
 	if (LIST_EMPTY(&gp->provider)) {
 		if (bootverbose)
 			printf("MAP: No valid partition found at %s\n", pp->name);
 		g_slice_spoiled(cp);
 		return (NULL);
 	}
 	return (gp);
 }
 
 static void
 g_map_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = gctl_get_geom(req, mp, "geom");
 	if (gp == NULL)
 		return;
 	gctl_error(req, "Unknown verb");
 }
 
 static struct g_class g_map_class = {
 	.name = MAP_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_map_taste,
 	.dumpconf = g_map_dumpconf,
 	.ctlreq = g_map_config,
 };
 DECLARE_GEOM_CLASS(g_map_class, g_map);
+MODULE_VERSION(geom_map, 0);
Index: user/markj/netdump/sys/geom/geom_mbr.c
===================================================================
--- user/markj/netdump/sys/geom/geom_mbr.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_mbr.c	(revision 332408)
@@ -1,530 +1,531 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2002 Poul-Henning Kamp
  * Copyright (c) 2002 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Poul-Henning Kamp
  * and NAI Labs, the Security Research Division of Network Associates, Inc.
  * under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
  * DARPA CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/errno.h>
 #include <sys/endian.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/kernel.h>
 #include <sys/fcntl.h>
 #include <sys/malloc.h>
 #include <sys/bio.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/md5.h>
 #include <sys/proc.h>
 
 #include <sys/diskmbr.h>
 #include <sys/sbuf.h>
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 
 FEATURE(geom_mbr, "GEOM DOS/MBR partitioning support");
 
 #define MBR_CLASS_NAME "MBR"
 #define MBREXT_CLASS_NAME "MBREXT"
 
 static int g_mbr_once = 0;
 
 static struct dos_partition historical_bogus_partition_table[NDOSPART] = {
         { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },
         { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },
         { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },
         { 0x80, 0, 1, 0, DOSPTYP_386BSD, 255, 255, 255, 0, 50000, },
 };
 
 static struct dos_partition historical_bogus_partition_table_fixed[NDOSPART] = {
         { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },
         { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },
         { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },
         { 0x80, 0, 1, 0, DOSPTYP_386BSD, 254, 255, 255, 0, 50000, },
 };
 
 static void
 g_mbr_print(int i, struct dos_partition *dp)
 {
 
 	printf("[%d] f:%02x typ:%d", i, dp->dp_flag, dp->dp_typ);
 	printf(" s(CHS):%d/%d/%d", DPCYL(dp->dp_scyl, dp->dp_ssect),
 	    dp->dp_shd, DPSECT(dp->dp_ssect));
 	printf(" e(CHS):%d/%d/%d", DPCYL(dp->dp_ecyl, dp->dp_esect),
 	    dp->dp_ehd, DPSECT(dp->dp_esect));
 	printf(" s:%d l:%d\n", dp->dp_start, dp->dp_size);
 }
 
 struct g_mbr_softc {
 	int		type [NDOSPART];
 	u_int		sectorsize;
 	u_char		sec0[512];
 	u_char		slicesum[16];
 };
 
 /*
  * XXX: Add gctl_req arg and give good error msgs.
  * XXX: Check that length argument does not bring boot code inside any slice.
  */
 static int
 g_mbr_modify(struct g_geom *gp, struct g_mbr_softc *ms, u_char *sec0, int len __unused)
 {
 	int i, error;
 	off_t l[NDOSPART];
 	struct dos_partition ndp[NDOSPART], *dp;
 	MD5_CTX md5sum;
 
 	g_topology_assert();
 
 	if (sec0[0x1fe] != 0x55 && sec0[0x1ff] != 0xaa)
 		return (EBUSY);
 
 	dp = ndp;
 	for (i = 0; i < NDOSPART; i++) {
 		dos_partition_dec(
 		    sec0 + DOSPARTOFF + i * sizeof(struct dos_partition),
 		    dp + i);
 	}
 	if ((!bcmp(dp, historical_bogus_partition_table,
 	    sizeof historical_bogus_partition_table)) ||
 	    (!bcmp(dp, historical_bogus_partition_table_fixed,
 	    sizeof historical_bogus_partition_table_fixed))) {
 		/*
 		 * We will not allow people to write these from "the inside",
 		 * Since properly selfdestructing takes too much code.  If 
 		 * people really want to do this, they cannot have any
 		 * providers of this geom open, and in that case they can just
 		 * as easily overwrite the MBR in the parent device.
 		 */
 		return(EBUSY);
 	}
 	for (i = 0; i < NDOSPART; i++) {
 		/* 
 		 * A Protective MBR (PMBR) has a single partition of
 		 * type 0xEE spanning the whole disk. Such a MBR
 		 * protects a GPT on the disk from MBR tools that
 		 * don't know anything about GPT. We're interpreting
 		 * it a bit more loosely: any partition of type 0xEE
 		 * is to be skipped as it doesn't contain any data
 		 * that we should care about. We still allow other
 		 * partitions to be present in the MBR. A PMBR will
 		 * be handled correctly anyway.
 		 */
 		if (dp[i].dp_typ == DOSPTYP_PMBR)
 			l[i] = 0;
 		else if (dp[i].dp_flag != 0 && dp[i].dp_flag != 0x80)
 			l[i] = 0;
 		else if (dp[i].dp_typ == 0)
 			l[i] = 0;
 		else
 			l[i] = (off_t)dp[i].dp_size * ms->sectorsize;
 		error = g_slice_config(gp, i, G_SLICE_CONFIG_CHECK,
 		    (off_t)dp[i].dp_start * ms->sectorsize, l[i],
 		    ms->sectorsize, "%ss%d", gp->name, 1 + i);
 		if (error)
 			return (error);
 	}
 	for (i = 0; i < NDOSPART; i++) {
 		ms->type[i] = dp[i].dp_typ;
 		g_slice_config(gp, i, G_SLICE_CONFIG_SET,
 		    (off_t)dp[i].dp_start * ms->sectorsize, l[i],
 		    ms->sectorsize, "%ss%d", gp->name, 1 + i);
 	}
 	bcopy(sec0, ms->sec0, 512);
 
 	/*
 	 * Calculate MD5 from the first sector and use it for avoiding
 	 * recursive slices creation.
 	 */
 	MD5Init(&md5sum);
 	MD5Update(&md5sum, ms->sec0, sizeof(ms->sec0));
 	MD5Final(ms->slicesum, &md5sum);
 
 	return (0);
 }
 
 static int
 g_mbr_ioctl(struct g_provider *pp, u_long cmd, void *data, int fflag, struct thread *td)
 {
 	struct g_geom *gp;
 	struct g_mbr_softc *ms;
 	struct g_slicer *gsp;
 	struct g_consumer *cp;
 	int error, opened;
 
 	gp = pp->geom;
 	gsp = gp->softc;
 	ms = gsp->softc;
 
 	opened = 0;
 	error = 0;
 	switch(cmd) {
 	case DIOCSMBR: {
 		if (!(fflag & FWRITE))
 			return (EPERM);
 		g_topology_lock();
 		cp = LIST_FIRST(&gp->consumer);
 		if (cp->acw == 0) {
 			error = g_access(cp, 0, 1, 0);
 			if (error == 0)
 				opened = 1;
 		}
 		if (!error)
 			error = g_mbr_modify(gp, ms, data, 512);
 		if (!error)
 			error = g_write_data(cp, 0, data, 512);
 		if (opened)
 			g_access(cp, 0, -1 , 0);
 		g_topology_unlock();
 		return(error);
 	}
 	default:
 		return (ENOIOCTL);
 	}
 }
 
 static int
 g_mbr_start(struct bio *bp)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	struct g_mbr_softc *mp;
 	struct g_slicer *gsp;
 	int idx;
 
 	pp = bp->bio_to;
 	idx = pp->index;
 	gp = pp->geom;
 	gsp = gp->softc;
 	mp = gsp->softc;
 	if (bp->bio_cmd == BIO_GETATTR) {
 		if (g_handleattr_int(bp, "MBR::type", mp->type[idx]))
 			return (1);
 		if (g_handleattr_off_t(bp, "MBR::offset",
 		    gsp->slices[idx].offset))
 			return (1);
 		if (g_handleattr(bp, "MBR::slicesum", mp->slicesum,
 		    sizeof(mp->slicesum)))
 			return (1);
 	}
 
 	return (0);
 }
 
 static void
 g_mbr_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp, struct g_consumer *cp __unused, struct g_provider *pp)
 {
 	struct g_mbr_softc *mp;
 	struct g_slicer *gsp;
 
 	gsp = gp->softc;
 	mp = gsp->softc;
 	g_slice_dumpconf(sb, indent, gp, cp, pp);
 	if (pp != NULL) {
 		if (indent == NULL)
 			sbuf_printf(sb, " ty %d", mp->type[pp->index]);
 		else
 			sbuf_printf(sb, "%s<type>%d</type>\n", indent,
 			    mp->type[pp->index]);
 	}
 }
 
 static struct g_geom *
 g_mbr_taste(struct g_class *mp, struct g_provider *pp, int insist)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	int error;
 	struct g_mbr_softc *ms;
 	u_int fwsectors, sectorsize;
 	u_char *buf;
 	u_char hash[16];
 	MD5_CTX md5sum;
 
 	g_trace(G_T_TOPOLOGY, "mbr_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 	if (!strcmp(pp->geom->class->name, MBR_CLASS_NAME))
 		return (NULL);
 	gp = g_slice_new(mp, NDOSPART, pp, &cp, &ms, sizeof *ms, g_mbr_start);
 	if (gp == NULL)
 		return (NULL);
 	g_topology_unlock();
 	do {
 		error = g_getattr("GEOM::fwsectors", cp, &fwsectors);
 		if (error)
 			fwsectors = 17;
 		sectorsize = cp->provider->sectorsize;
 		if (sectorsize < 512)
 			break;
 		ms->sectorsize = sectorsize;
 		buf = g_read_data(cp, 0, sectorsize, NULL);
 		if (buf == NULL)
 			break;
 
 		/*
 		 * Calculate MD5 from the first sector and use it for avoiding
 		 * recursive slices creation.
 		 */
 		bcopy(buf, ms->sec0, 512);
 		MD5Init(&md5sum);
 		MD5Update(&md5sum, ms->sec0, sizeof(ms->sec0));
 		MD5Final(ms->slicesum, &md5sum);
 
 		error = g_getattr("MBR::slicesum", cp, &hash);
 		if (!error && !bcmp(ms->slicesum, hash, sizeof(hash))) {
 			g_free(buf);
 			break;
 		}
 
 		g_topology_lock();
 		g_mbr_modify(gp, ms, buf, 512);
 		g_topology_unlock();
 		g_free(buf);
 		break;
 	} while (0);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (LIST_EMPTY(&gp->provider)) {
 		g_slice_spoiled(cp);
 		return (NULL);
 	}
 	if (!g_mbr_once) {
 		g_mbr_once = 1;
 		printf(
 		    "WARNING: geom_mbr (geom %s) is deprecated, "
 		    "use gpart instead.\n", gp->name);
 	}
 	return (gp);
 }
 
 static void
 g_mbr_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_mbr_softc *ms;
 	struct g_slicer *gsp;
 	int opened = 0, error = 0;
 	void *data;
 	int len;
 
 	g_topology_assert();
 	gp = gctl_get_geom(req, mp, "geom");
 	if (gp == NULL)
 		return;
 	if (strcmp(verb, "write MBR")) {
 		gctl_error(req, "Unknown verb");
 		return;
 	}
 	gsp = gp->softc;
 	ms = gsp->softc;
 	data = gctl_get_param(req, "data", &len);
 	if (data == NULL)
 		return;
 	if (len < 512 || (len % 512)) {
 		gctl_error(req, "Wrong request length");
 		return;
 	}
 	cp = LIST_FIRST(&gp->consumer);
 	if (cp->acw == 0) {
 		error = g_access(cp, 0, 1, 0);
 		if (error == 0)
 			opened = 1;
 	}
 	if (!error)
 		error = g_mbr_modify(gp, ms, data, len);
 	if (error)
 		gctl_error(req, "conflict with open slices");
 	if (!error)
 		error = g_write_data(cp, 0, data, len);
 	if (error)
 		gctl_error(req, "sector zero write failed");
 	if (opened)
 		g_access(cp, 0, -1 , 0);
 	return;
 }
 
 static struct g_class g_mbr_class	= {
 	.name = MBR_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_mbr_taste,
 	.dumpconf = g_mbr_dumpconf,
 	.ctlreq = g_mbr_config,
 	.ioctl = g_mbr_ioctl,
 };
 
 DECLARE_GEOM_CLASS(g_mbr_class, g_mbr);
 
 #define NDOSEXTPART		32
 struct g_mbrext_softc {
 	int		type [NDOSEXTPART];
 };
 
 static int
 g_mbrext_start(struct bio *bp)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	struct g_mbrext_softc *mp;
 	struct g_slicer *gsp;
 	int idx;
 
 	pp = bp->bio_to;
 	idx = pp->index;
 	gp = pp->geom;
 	gsp = gp->softc;
 	mp = gsp->softc;
 	if (bp->bio_cmd == BIO_GETATTR) {
 		if (g_handleattr_int(bp, "MBR::type", mp->type[idx]))
 			return (1);
 	}
 	return (0);
 }
 
 static void
 g_mbrext_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp, struct g_consumer *cp __unused, struct g_provider *pp)
 {
 	struct g_mbrext_softc *mp;
 	struct g_slicer *gsp;
 
 	g_slice_dumpconf(sb, indent, gp, cp, pp);
 	gsp = gp->softc;
 	mp = gsp->softc;
 	if (pp != NULL) {
 		if (indent == NULL)
 			sbuf_printf(sb, " ty %d", mp->type[pp->index]);
 		else
 			sbuf_printf(sb, "%s<type>%d</type>\n", indent,
 			    mp->type[pp->index]);
 	}
 }
 
 static struct g_geom *
 g_mbrext_taste(struct g_class *mp, struct g_provider *pp, int insist __unused)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	int error, i, slice;
 	struct g_mbrext_softc *ms;
 	off_t off;
 	u_char *buf;
 	struct dos_partition dp[4];
 	u_int fwsectors, sectorsize;
 
 	g_trace(G_T_TOPOLOGY, "g_mbrext_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 	if (strcmp(pp->geom->class->name, MBR_CLASS_NAME))
 		return (NULL);
 	gp = g_slice_new(mp, NDOSEXTPART, pp, &cp, &ms, sizeof *ms,
 	    g_mbrext_start);
 	if (gp == NULL)
 		return (NULL);
 	g_topology_unlock();
 	off = 0;
 	slice = 0;
 	do {
 		error = g_getattr("MBR::type", cp, &i);
 		if (error || (i != DOSPTYP_EXT && i != DOSPTYP_EXTLBA))
 			break;
 		error = g_getattr("GEOM::fwsectors", cp, &fwsectors);
 		if (error)
 			fwsectors = 17;
 		sectorsize = cp->provider->sectorsize;
 		if (sectorsize != 512)
 			break;
 		for (;;) {
 			buf = g_read_data(cp, off, sectorsize, NULL);
 			if (buf == NULL)
 				break;
 			if (buf[0x1fe] != 0x55 && buf[0x1ff] != 0xaa) {
 				g_free(buf);
 				break;
 			}
 			for (i = 0; i < NDOSPART; i++) 
 				dos_partition_dec(
 				    buf + DOSPARTOFF + 
 				    i * sizeof(struct dos_partition), dp + i);
 			g_free(buf);
 			if (0 && bootverbose) {
 				printf("MBREXT Slice %d on %s:\n",
 				    slice + 5, gp->name);
 				g_mbr_print(0, dp);
 				g_mbr_print(1, dp + 1);
 			}
 			if ((dp[0].dp_flag & 0x7f) == 0 &&
 			     dp[0].dp_size != 0 && dp[0].dp_typ != 0) {
 				g_topology_lock();
 				g_slice_config(gp, slice, G_SLICE_CONFIG_SET,
 				    (((off_t)dp[0].dp_start) << 9ULL) + off,
 				    ((off_t)dp[0].dp_size) << 9ULL,
 				    sectorsize,
 				    "%*.*s%d",
 				    (int)strlen(gp->name) - 1,
 				    (int)strlen(gp->name) - 1,
 				    gp->name,
 				    slice + 5);
 				g_topology_unlock();
 				ms->type[slice] = dp[0].dp_typ;
 				slice++;
 			}
 			if (dp[1].dp_flag != 0)
 				break;
 			if (dp[1].dp_typ != DOSPTYP_EXT &&
 			    dp[1].dp_typ != DOSPTYP_EXTLBA)
 				break;
 			if (dp[1].dp_size == 0)
 				break;
 			off = ((off_t)dp[1].dp_start) << 9ULL;
 		}
 		break;
 	} while (0);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (LIST_EMPTY(&gp->provider)) {
 		g_slice_spoiled(cp);
 		return (NULL);
 	}
 	return (gp);
 }
 
 
 static struct g_class g_mbrext_class	= {
 	.name = MBREXT_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_mbrext_taste,
 	.dumpconf = g_mbrext_dumpconf,
 };
 
 DECLARE_GEOM_CLASS(g_mbrext_class, g_mbrext);
+MODULE_VERSION(geom_mbr, 0);
Index: user/markj/netdump/sys/geom/geom_redboot.c
===================================================================
--- user/markj/netdump/sys/geom/geom_redboot.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_redboot.c	(revision 332408)
@@ -1,359 +1,360 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2009 Sam Leffler, Errno Consulting
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer,
  *    without modification.
  * 2. Redistributions in binary form must reproduce at minimum a disclaimer
  *    similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any
  *    redistribution must be conditioned upon including a substantially
  *    similar Disclaimer requirement for further binary redistribution.
  *
  * NO WARRANTY
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY
  * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
  * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY,
  * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
  * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
  * THE POSSIBILITY OF SUCH DAMAGES.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/errno.h>
 #include <sys/endian.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/fcntl.h>
 #include <sys/malloc.h>
 #include <sys/bio.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bus.h>
 
 #include <sys/sbuf.h>
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 
 #define REDBOOT_CLASS_NAME "REDBOOT"
 
 struct fis_image_desc {
 	uint8_t		name[16];	/* null-terminated name */
 	uint32_t	offset;		/* offset in flash */
 	uint32_t	addr;		/* address in memory */
 	uint32_t	size;		/* image size in bytes */
 	uint32_t	entry;		/* offset in image for entry point */
 	uint32_t	dsize;		/* data size in bytes */
 	uint8_t		pad[256-(16+7*sizeof(uint32_t)+sizeof(void*))];
 	struct fis_image_desc *next;	/* linked list (in memory) */
 	uint32_t	dsum;		/* descriptor checksum */
 	uint32_t	fsum;		/* checksum over image data */
 };
 
 #define	FISDIR_NAME	"FIS directory"
 #define	REDBCFG_NAME	"RedBoot config"
 #define	REDBOOT_NAME	"RedBoot"
 
 #define	REDBOOT_MAXSLICE	64
 #define	REDBOOT_MAXOFF \
 	(REDBOOT_MAXSLICE*sizeof(struct fis_image_desc))
 
 struct g_redboot_softc {
 	uint32_t	entry[REDBOOT_MAXSLICE];
 	uint32_t	dsize[REDBOOT_MAXSLICE];
 	uint8_t		readonly[REDBOOT_MAXSLICE];
 	g_access_t	*parent_access;
 };
 
 static void
 g_redboot_print(int i, struct fis_image_desc *fd)
 {
 
 	printf("[%2d] \"%-15.15s\" %08x:%08x", i, fd->name,
 	    fd->offset, fd->size);
 	printf(" addr %08x entry %08x\n", fd->addr, fd->entry);
 	printf("     dsize 0x%x dsum 0x%x fsum 0x%x\n", fd->dsize,
 	    fd->dsum, fd->fsum);
 }
 
 static int
 g_redboot_ioctl(struct g_provider *pp, u_long cmd, void *data, int fflag, struct thread *td)
 {
 	return (ENOIOCTL);
 }
 
 static int
 g_redboot_access(struct g_provider *pp, int dread, int dwrite, int dexcl)
 {
 	struct g_geom *gp = pp->geom;
 	struct g_slicer *gsp = gp->softc;
 	struct g_redboot_softc *sc = gsp->softc;
 
 	if (dwrite > 0 && sc->readonly[pp->index])
 		return (EPERM);
 	return (sc->parent_access(pp, dread, dwrite, dexcl));
 }
 
 static int
 g_redboot_start(struct bio *bp)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	struct g_redboot_softc *sc;
 	struct g_slicer *gsp;
 	int idx;
 
 	pp = bp->bio_to;
 	idx = pp->index;
 	gp = pp->geom;
 	gsp = gp->softc;
 	sc = gsp->softc;
 	if (bp->bio_cmd == BIO_GETATTR) {
 		if (g_handleattr_int(bp, REDBOOT_CLASS_NAME "::entry",
 		    sc->entry[idx]))
 			return (1);
 		if (g_handleattr_int(bp, REDBOOT_CLASS_NAME "::dsize",
 		    sc->dsize[idx]))
 			return (1);
 	}
 
 	return (0);
 }
 
 static void
 g_redboot_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
 	struct g_consumer *cp __unused, struct g_provider *pp)
 {
 	struct g_redboot_softc *sc;
 	struct g_slicer *gsp;
 
 	gsp = gp->softc;
 	sc = gsp->softc;
 	g_slice_dumpconf(sb, indent, gp, cp, pp);
 	if (pp != NULL) {
 		if (indent == NULL) {
 			sbuf_printf(sb, " entry %d", sc->entry[pp->index]);
 			sbuf_printf(sb, " dsize %d", sc->dsize[pp->index]);
 		} else {
 			sbuf_printf(sb, "%s<entry>%d</entry>\n", indent,
 			    sc->entry[pp->index]);
 			sbuf_printf(sb, "%s<dsize>%d</dsize>\n", indent,
 			    sc->dsize[pp->index]);
 		}
 	}
 }
 
 #include <sys/ctype.h>
 
 static int
 nameok(const char name[16])
 {
 	int i;
 
 	/* descriptor names are null-terminated printable ascii */
 	for (i = 0; i < 15; i++)
 		if (!isprint(name[i]))
 			break;
 	return (name[i] == '\0');
 }
 
 static struct fis_image_desc *
 parse_fis_directory(u_char *buf, size_t bufsize, off_t offset, uint32_t offmask)
 {
 #define	match(a,b)	(bcmp(a, b, sizeof(b)-1) == 0)
 	struct fis_image_desc *fd, *efd;
 	struct fis_image_desc *fisdir, *redbcfg;
 	struct fis_image_desc *head, **tail;
 	int i;
 
 	fd = (struct fis_image_desc *)buf;
 	efd = fd + (bufsize / sizeof(struct fis_image_desc));
 #if 0
 	/*
 	 * Find the start of the FIS table.
 	 */
 	while (fd < efd && fd->name[0] != 0xff)
 		fd++;
 	if (fd == efd)
 		return (NULL);
 	if (bootverbose)
 		printf("RedBoot FIS table starts at 0x%jx\n",
 		    offset + fd - (struct fis_image_desc *) buf);
 #endif
 	/*
 	 * Scan forward collecting entries in a list.
 	 */
 	fisdir = redbcfg = NULL;
 	*(tail = &head) = NULL;
 	for (i = 0; fd < efd; i++, fd++) {
 		if (fd->name[0] == 0xff)
 			continue;
 		if (match(fd->name, FISDIR_NAME))
 			fisdir = fd;
 		else if (match(fd->name, REDBCFG_NAME))
 			redbcfg = fd;
 		if (nameok(fd->name)) {
 			/*
 			 * NB: flash address includes platform mapping;
 			 *     strip it so we have only a flash offset.
 			 */
 			fd->offset &= offmask;
 			if (bootverbose)
 				g_redboot_print(i, fd);
 			*tail = fd;
 			*(tail = &fd->next) = NULL;
 		}
 	}
 	if (fisdir == NULL) {
 		if (bootverbose)
 			printf("No RedBoot FIS table located at %lu\n",
 			    (long) offset);
 		return (NULL);
 	}
 	if (redbcfg != NULL &&
 	    fisdir->offset + fisdir->size == redbcfg->offset) {
 		/*
 		 * Merged FIS/RedBoot config directory.
 		 */
 		if (bootverbose)
 			printf("FIS/RedBoot merged at 0x%jx (not yet)\n",
 			    offset + fisdir->offset);
 		/* XXX */
 	}
 	return head;
 #undef match
 }
 
 static struct g_geom *
 g_redboot_taste(struct g_class *mp, struct g_provider *pp, int insist)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_redboot_softc *sc;
 	int error, sectorsize, i;
 	struct fis_image_desc *fd, *head;
 	uint32_t offmask;
 	u_int blksize;		/* NB: flash block size stored as stripesize */
 	u_char *buf;
 	off_t offset;
 	const char *value;
 	char *op;
 
 	offset = 0;
 	if (resource_string_value("redboot", 0, "fisoffset", &value) == 0) {
 		offset = strtouq(value, &op, 0);
 		if (*op != '\0') {
 			offset = 0;
 		}
 	}
 
 	g_trace(G_T_TOPOLOGY, "redboot_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 	if (!strcmp(pp->geom->class->name, REDBOOT_CLASS_NAME))
 		return (NULL);
 	/* XXX only taste flash providers */
 	if (strncmp(pp->name, "cfi", 3) && 
 	    strncmp(pp->name, "flash/spi", 9))
 		return (NULL);
 	gp = g_slice_new(mp, REDBOOT_MAXSLICE, pp, &cp, &sc, sizeof(*sc),
 	    g_redboot_start);
 	if (gp == NULL)
 		return (NULL);
 	/* interpose our access method */
 	sc->parent_access = gp->access;
 	gp->access = g_redboot_access;
 
 	sectorsize = cp->provider->sectorsize;
 	blksize = cp->provider->stripesize;
 	if (powerof2(cp->provider->mediasize))
 		offmask = cp->provider->mediasize-1;
 	else
 		offmask = 0xffffffff;		/* XXX */
 	if (bootverbose)
 		printf("%s: mediasize %ld secsize %d blksize %d offmask 0x%x\n",
 		    __func__, (long) cp->provider->mediasize, sectorsize,
 		    blksize, offmask);
 	if (sectorsize < sizeof(struct fis_image_desc) ||
 	    (sectorsize % sizeof(struct fis_image_desc)))
 		return (NULL);
 	g_topology_unlock();
 	head = NULL;
 	if(offset == 0)
 		offset = cp->provider->mediasize - blksize;
 again:
 	buf = g_read_data(cp, offset, blksize, NULL);
 	if (buf != NULL)
 		head = parse_fis_directory(buf, blksize, offset, offmask);
 	if (head == NULL && offset != 0) {
 		if (buf != NULL)
 			g_free(buf);
 		offset = 0;			/* check the front */
 		goto again;
 	}
 	g_topology_lock();
 	if (head == NULL) {
 		if (buf != NULL)
 			g_free(buf);
 		return NULL;
 	}
 	/*
 	 * Craft a slice for each entry.
 	 */
 	for (fd = head, i = 0; fd != NULL; fd = fd->next) {
 		if (fd->name[0] == '\0')
 			continue;
 		error = g_slice_config(gp, i, G_SLICE_CONFIG_SET,
 		    fd->offset, fd->size, sectorsize, "redboot/%s", fd->name);
 		if (error)
 			printf("%s: g_slice_config returns %d for \"%s\"\n",
 			    __func__, error, fd->name);
 		sc->entry[i] = fd->entry;
 		sc->dsize[i] = fd->dsize;
 		/* disallow writing hard-to-recover entries */
 		sc->readonly[i] = (strcmp(fd->name, FISDIR_NAME) == 0) ||
 				  (strcmp(fd->name, REDBOOT_NAME) == 0);
 		i++;
 	}
 	g_free(buf);
 	g_access(cp, -1, 0, 0);
 	if (LIST_EMPTY(&gp->provider)) {
 		g_slice_spoiled(cp);
 		return (NULL);
 	}
 	return (gp);
 }
 
 static void
 g_redboot_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = gctl_get_geom(req, mp, "geom");
 	if (gp == NULL)
 		return;
 	gctl_error(req, "Unknown verb");
 }
 
 static struct g_class g_redboot_class	= {
 	.name		= REDBOOT_CLASS_NAME,
 	.version	= G_VERSION,
 	.taste		= g_redboot_taste,
 	.dumpconf	= g_redboot_dumpconf,
 	.ctlreq		= g_redboot_config,
 	.ioctl		= g_redboot_ioctl,
 };
 DECLARE_GEOM_CLASS(g_redboot_class, g_redboot);
+MODULE_VERSION(geom_redboot, 0);
Index: user/markj/netdump/sys/geom/geom_sunlabel.c
===================================================================
--- user/markj/netdump/sys/geom/geom_sunlabel.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_sunlabel.c	(revision 332408)
@@ -1,336 +1,337 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 2002 Poul-Henning Kamp
  * Copyright (c) 2002 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Poul-Henning Kamp
  * and NAI Labs, the Security Research Division of Network Associates, Inc.
  * under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
  * DARPA CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The names of the authors may not be used to endorse or promote
  *    products derived from this software without specific prior written
  *    permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/endian.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/kernel.h>
 #include <sys/conf.h>
 #include <sys/bio.h>
 #include <sys/malloc.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/md5.h>
 #include <sys/sbuf.h>
 #include <sys/sun_disklabel.h>
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 #include <machine/endian.h>
 
 FEATURE(geom_sunlabel, "GEOM Sun/Solaris partitioning support");
 
 #define SUNLABEL_CLASS_NAME "SUN"
 
 struct g_sunlabel_softc {
 	int sectorsize;
 	int nheads;
 	int nsects;
 	int nalt;
 	u_char labelsum[16];
 };
 
 static int g_sunlabel_once = 0;
 
 static int
 g_sunlabel_modify(struct g_geom *gp, struct g_sunlabel_softc *ms, u_char *sec0)
 {
 	int i, error;
 	u_int u, v, csize;
 	struct sun_disklabel sl;
 	MD5_CTX md5sum;
 
 	error = sunlabel_dec(sec0, &sl);
 	if (error)
 		return (error);
 
 	csize = sl.sl_ntracks * sl.sl_nsectors;
 
 	for (i = 0; i < SUN_NPART; i++) {
 		v = sl.sl_part[i].sdkp_cyloffset;
 		u = sl.sl_part[i].sdkp_nsectors;
 		error = g_slice_config(gp, i, G_SLICE_CONFIG_CHECK,
 		    ((off_t)v * csize) << 9ULL,
 		    ((off_t)u) << 9ULL,
 		    ms->sectorsize,
 		    "%s%c", gp->name, 'a' + i);
 		if (error)
 			return (error);
 	}
 	for (i = 0; i < SUN_NPART; i++) {
 		v = sl.sl_part[i].sdkp_cyloffset;
 		u = sl.sl_part[i].sdkp_nsectors;
 		g_slice_config(gp, i, G_SLICE_CONFIG_SET,
 		    ((off_t)v * csize) << 9ULL,
 		    ((off_t)u) << 9ULL,
 		    ms->sectorsize,
 		    "%s%c", gp->name, 'a' + i);
 	}
 	ms->nalt = sl.sl_acylinders;
 	ms->nheads = sl.sl_ntracks;
 	ms->nsects = sl.sl_nsectors;
 
 	/*
 	 * Calculate MD5 from the first sector and use it for avoiding
 	 * recursive labels creation.
 	 */
 	MD5Init(&md5sum);
 	MD5Update(&md5sum, sec0, ms->sectorsize);
 	MD5Final(ms->labelsum, &md5sum);
 
 	return (0);
 }
 
 static void
 g_sunlabel_hotwrite(void *arg, int flag)
 {
 	struct bio *bp;
 	struct g_geom *gp;
 	struct g_slicer *gsp;
 	struct g_slice *gsl;
 	struct g_sunlabel_softc *ms;
 	u_char *p;
 	int error;
 
 	KASSERT(flag != EV_CANCEL, ("g_sunlabel_hotwrite cancelled"));
 	bp = arg;
 	gp = bp->bio_to->geom;
 	gsp = gp->softc;
 	ms = gsp->softc;
 	gsl = &gsp->slices[bp->bio_to->index];
 	/*
 	 * XXX: For all practical purposes, this whould be equvivalent to
 	 * XXX: "p = (u_char *)bp->bio_data;" because the label is always
 	 * XXX: in the first sector and we refuse sectors smaller than the
 	 * XXX: label.
 	 */
 	p = (u_char *)bp->bio_data - (bp->bio_offset + gsl->offset);
 
 	error = g_sunlabel_modify(gp, ms, p);
 	if (error) {
 		g_io_deliver(bp, EPERM);
 		return;
 	}
 	g_slice_finish_hot(bp);
 }
 
 static void
 g_sunlabel_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp, struct g_consumer *cp __unused, struct g_provider *pp)
 {
 	struct g_slicer *gsp;
 	struct g_sunlabel_softc *ms;
 
 	gsp = gp->softc;
 	ms = gsp->softc;
 	g_slice_dumpconf(sb, indent, gp, cp, pp);
 	if (indent == NULL) {
 		sbuf_printf(sb, " sc %u hd %u alt %u",
 		    ms->nsects, ms->nheads, ms->nalt);
 	}
 }
 
 struct g_hh01 {
 	struct g_geom *gp;
 	struct g_sunlabel_softc *ms;
 	u_char *label;
 	int error;
 };
 
 static void
 g_sunlabel_callconfig(void *arg, int flag)
 {
 	struct g_hh01 *hp;
 
 	hp = arg;
 	hp->error = g_sunlabel_modify(hp->gp, hp->ms, hp->label);
 	if (!hp->error)
 		hp->error = g_write_data(LIST_FIRST(&hp->gp->consumer),
 		    0, hp->label, SUN_SIZE);
 }
 
 /*
  * NB! curthread is user process which GCTL'ed.
  */
 static void
 g_sunlabel_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	u_char *label;
 	int error, i;
 	struct g_hh01 h0h0;
 	struct g_slicer *gsp;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 
 	g_topology_assert();
 	gp = gctl_get_geom(req, mp, "geom");
 	if (gp == NULL)
 		return;
 	cp = LIST_FIRST(&gp->consumer);
 	gsp = gp->softc;
 	if (!strcmp(verb, "write label")) {
 		label = gctl_get_paraml(req, "label", SUN_SIZE);
 		if (label == NULL)
 			return;
 		h0h0.gp = gp;
 		h0h0.ms = gsp->softc;
 		h0h0.label = label;
 		h0h0.error = -1;
 		/* XXX: Does this reference register with our selfdestruct code ? */
 		error = g_access(cp, 1, 1, 1);
 		if (error) {
 			gctl_error(req, "could not access consumer");
 			return;
 		}
 		g_sunlabel_callconfig(&h0h0, 0);
 		g_access(cp, -1, -1, -1);
 	} else if (!strcmp(verb, "write bootcode")) {
 		label = gctl_get_paraml(req, "bootcode", SUN_BOOTSIZE);
 		if (label == NULL)
 			return;
 		/* XXX: Does this reference register with our selfdestruct code ? */
 		error = g_access(cp, 1, 1, 1);
 		if (error) {
 			gctl_error(req, "could not access consumer");
 			return;
 		}
 		for (i = 0; i < SUN_NPART; i++) {
 			if (gsp->slices[i].length <= SUN_BOOTSIZE)
 				continue;
 			g_write_data(cp,
 			    gsp->slices[i].offset + SUN_SIZE, label + SUN_SIZE,
 			    SUN_BOOTSIZE - SUN_SIZE);
 		}
 		g_access(cp, -1, -1, -1);
 	} else {
 		gctl_error(req, "Unknown verb parameter");
 	}
 }
 
 static int
 g_sunlabel_start(struct bio *bp)
 {
 	struct g_sunlabel_softc *mp;
 	struct g_slicer *gsp;
 
 	gsp = bp->bio_to->geom->softc;
 	mp = gsp->softc;
 	if (bp->bio_cmd == BIO_GETATTR) {
 		if (g_handleattr(bp, "SUN::labelsum", mp->labelsum,
 		    sizeof(mp->labelsum)))
 			return (1);
 	}
 	return (0);
 }
 
 static struct g_geom *
 g_sunlabel_taste(struct g_class *mp, struct g_provider *pp, int flags)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_sunlabel_softc *ms;
 	struct g_slicer *gsp;
 	u_char *buf, hash[16];
 	MD5_CTX md5sum;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "g_sunlabel_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 	if (flags == G_TF_NORMAL &&
 	    !strcmp(pp->geom->class->name, SUNLABEL_CLASS_NAME))
 		return (NULL);
 	gp = g_slice_new(mp, 8, pp, &cp, &ms, sizeof *ms, g_sunlabel_start);
 	if (gp == NULL)
 		return (NULL);
 	gsp = gp->softc;
 	do {
 		ms->sectorsize = cp->provider->sectorsize;
 		if (ms->sectorsize < 512)
 			break;
 		g_topology_unlock();
 		buf = g_read_data(cp, 0, ms->sectorsize, NULL);
 		g_topology_lock();
 		if (buf == NULL)
 			break;
 
 		/*
 		 * Calculate MD5 from the first sector and use it for avoiding
 		 * recursive labels creation.
 		 */
 		MD5Init(&md5sum);
 		MD5Update(&md5sum, buf, ms->sectorsize);
 		MD5Final(ms->labelsum, &md5sum);
  
 		error = g_getattr("SUN::labelsum", cp, &hash);
 		if (!error && !bcmp(ms->labelsum, hash, sizeof(hash))) {
 			g_free(buf);
 			break;
 		}
 
 		g_sunlabel_modify(gp, ms, buf);
 		g_free(buf);
 
 		break;
 	} while (0);
 	g_access(cp, -1, 0, 0);
 	if (LIST_EMPTY(&gp->provider)) {
 		g_slice_spoiled(cp);
 		return (NULL);
 	}
 	g_slice_conf_hot(gp, 0, 0, SUN_SIZE,
 	    G_SLICE_HOT_ALLOW, G_SLICE_HOT_DENY, G_SLICE_HOT_CALL);
 	gsp->hot = g_sunlabel_hotwrite;
 	if (!g_sunlabel_once) {
 		g_sunlabel_once = 1;
 		printf(
 		    "WARNING: geom_sunlabel (geom %s) is deprecated, "
 		    "use gpart instead.\n", gp->name);
 	}
 	return (gp);
 }
 
 static struct g_class g_sunlabel_class = {
 	.name = SUNLABEL_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_sunlabel_taste,
 	.ctlreq = g_sunlabel_config,
 	.dumpconf = g_sunlabel_dumpconf,
 };
 
 DECLARE_GEOM_CLASS(g_sunlabel_class, g_sunlabel);
+MODULE_VERSION(geom_sunlabel, 0);
Index: user/markj/netdump/sys/geom/geom_vol_ffs.c
===================================================================
--- user/markj/netdump/sys/geom/geom_vol_ffs.c	(revision 332407)
+++ user/markj/netdump/sys/geom/geom_vol_ffs.c	(revision 332408)
@@ -1,166 +1,167 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2002, 2003 Gordon Tetlow
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/errno.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/bio.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 
 #include <ufs/ufs/dinode.h>
 #include <ufs/ffs/fs.h>
 
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 
 FEATURE(geom_vol, "GEOM support for volume names from UFS superblock");
 
 #define VOL_FFS_CLASS_NAME "VOL_FFS"
 
 static int superblocks[] = SBLOCKSEARCH;
 static int g_vol_ffs_once;
 
 struct g_vol_ffs_softc {
 	char *	vol;
 };
 
 static int
 g_vol_ffs_start(struct bio *bp __unused)
 {
 	return(0);
 }
 
 static struct g_geom *
 g_vol_ffs_taste(struct g_class *mp, struct g_provider *pp, int flags)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_vol_ffs_softc *ms;
 	int sb, superblock;
 	struct fs *fs;
 
 	g_trace(G_T_TOPOLOGY, "vol_taste(%s,%s)", mp->name, pp->name);
 	g_topology_assert();
 
 	/* 
 	 * XXX This is a really weak way to make sure we don't recurse.
 	 * Probably ought to use BIO_GETATTR to check for this.
 	 */
 	if (flags == G_TF_NORMAL &&
 	    !strcmp(pp->geom->class->name, VOL_FFS_CLASS_NAME))
 		return (NULL);
 
 	gp = g_slice_new(mp, 1, pp, &cp, &ms, sizeof(*ms), g_vol_ffs_start);
 	if (gp == NULL)
 		return (NULL);
 	g_topology_unlock();
 	/*
 	 * Walk through the standard places that superblocks hide and look
 	 * for UFS magic. If we find magic, then check that the size in the
 	 * superblock corresponds to the size of the underlying provider.
 	 * Finally, look for a volume label and create an appropriate 
 	 * provider based on that.
 	 */
 	for (sb=0; (superblock = superblocks[sb]) != -1; sb++) {
 		/*
 		 * Take care not to issue an invalid I/O request.  The
 		 * offset and size of the superblock candidate must be
 		 * multiples of the provider's sector size, otherwise an
 		 * FFS can't exist on the provider anyway.
 		 */
 		if (superblock % cp->provider->sectorsize != 0 ||
 		    SBLOCKSIZE % cp->provider->sectorsize != 0)
 			continue;
 
 		fs = (struct fs *) g_read_data(cp, superblock,
 			SBLOCKSIZE, NULL);
 		if (fs == NULL)
 			continue;
 		/* Check for magic and make sure things are the right size */
 		if (fs->fs_magic == FS_UFS1_MAGIC) {
 			if (fs->fs_old_size * fs->fs_fsize !=
 			    (int32_t) pp->mediasize) {
 				g_free(fs);
 				continue;
 			}
 		} else if (fs->fs_magic == FS_UFS2_MAGIC) {
 			if (fs->fs_size * fs->fs_fsize !=
 			    (int64_t) pp->mediasize) {
 				g_free(fs);
 				continue;
 			}
 		} else {
 			g_free(fs);
 			continue;
 		}
 		/* Check for volume label */
 		if (fs->fs_volname[0] == '\0') {
 			g_free(fs);
 			continue;
 		}
 		/* XXX We need to check for namespace conflicts. */
 		/* XXX How do you handle a mirror set? */
 		/* XXX We don't validate the volume name. */
 		g_topology_lock();
 		/* Alright, we have a label and a volume name, reconfig. */
 		g_slice_config(gp, 0, G_SLICE_CONFIG_SET, (off_t) 0,
 		    pp->mediasize, pp->sectorsize, "vol/%s",
 		    fs->fs_volname);
 		g_free(fs);
 		g_topology_unlock();
 		break;
 	}
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (LIST_EMPTY(&gp->provider)) {
 		g_slice_spoiled(cp);
 		return (NULL);
 	}
 	if (!g_vol_ffs_once) {
 		g_vol_ffs_once = 1;
 		printf(
 		    "WARNING: geom_vol_Ffs (geom %s) is deprecated, "
 		    "use glabel instead.\n", gp->name);
 	}
 	return (gp);
 }
 
 static struct g_class g_vol_ffs_class	= {
 	.name = VOL_FFS_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_vol_ffs_taste,
 };
 
 DECLARE_GEOM_CLASS(g_vol_ffs_class, g_vol_ffs);
+MODULE_VERSION(geom_vol_ffs, 0);
Index: user/markj/netdump/sys/geom/journal/g_journal_ufs.c
===================================================================
--- user/markj/netdump/sys/geom/journal/g_journal_ufs.c	(revision 332407)
+++ user/markj/netdump/sys/geom/journal/g_journal_ufs.c	(revision 332408)
@@ -1,103 +1,104 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/vnode.h>
 #include <sys/mount.h>
 
 #include <ufs/ufs/extattr.h>
 #include <ufs/ufs/quota.h>
 #include <ufs/ufs/inode.h>
 #include <ufs/ufs/ufs_extern.h>
 #include <ufs/ufs/ufsmount.h>
 
 #include <ufs/ffs/fs.h>
 #include <ufs/ffs/ffs_extern.h>
 
 #include <geom/geom.h>
 #include <geom/journal/g_journal.h>
 
 static int
 g_journal_ufs_clean(struct mount *mp)
 {
 	struct ufsmount *ump;
 	struct fs *fs;
 	int flags;
 
 	ump = VFSTOUFS(mp);
 	fs = ump->um_fs;
 
 	flags = fs->fs_flags;
 	fs->fs_flags &= ~(FS_UNCLEAN | FS_NEEDSFSCK);
 	ffs_sbupdate(ump, MNT_WAIT, 1);
 	fs->fs_flags = flags;
 
 	return (0);
 }
 
 static void
 g_journal_ufs_dirty(struct g_consumer *cp)
 {
 	struct fs *fs;
 	int error;
 
 	fs = NULL;
 	if (SBLOCKSIZE % cp->provider->sectorsize != 0 ||
 	    ffs_sbget(cp, &fs, -1, M_GEOM, g_use_g_read_data) != 0) {
 		GJ_DEBUG(0, "Cannot find superblock to mark file system %s "
 		    "as dirty.", cp->provider->name);
 		KASSERT(fs == NULL,
 		    ("g_journal_ufs_dirty: non-NULL fs %p\n", fs));
 		return;
 	}
 	GJ_DEBUG(0, "clean=%d flags=0x%x", fs->fs_clean, fs->fs_flags);
 	fs->fs_clean = 0;
 	fs->fs_flags |= FS_NEEDSFSCK | FS_UNCLEAN;
 	error = ffs_sbput(cp, fs, fs->fs_sblockloc, g_use_g_write_data);
 	g_free(fs->fs_csp);
 	g_free(fs);
 	if (error != 0) {
 		GJ_DEBUG(0, "Cannot mark file system %s as dirty "
 		    "(error=%d).", cp->provider->name, error);
 	} else {
 		GJ_DEBUG(0, "File system %s marked as dirty.",
 		    cp->provider->name);
 	}
 }
 
 const struct g_journal_desc g_journal_ufs = {
 	.jd_fstype = "ufs",
 	.jd_clean = g_journal_ufs_clean,
 	.jd_dirty = g_journal_ufs_dirty
 };
 
 MODULE_DEPEND(g_journal, ufs, 1, 1, 1);
+MODULE_VERSION(geom_journal, 0);
Index: user/markj/netdump/sys/geom/label/g_label.c
===================================================================
--- user/markj/netdump/sys/geom/label/g_label.c	(revision 332407)
+++ user/markj/netdump/sys/geom/label/g_label.c	(revision 332408)
@@ -1,558 +1,559 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_geom.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/ctype.h>
 #include <sys/malloc.h>
 #include <sys/libkern.h>
 #include <sys/sbuf.h>
 #include <sys/stddef.h>
 #include <sys/sysctl.h>
 #include <geom/geom.h>
 #include <geom/geom_slice.h>
 #include <geom/label/g_label.h>
 
 FEATURE(geom_label, "GEOM labeling support");
 
 SYSCTL_DECL(_kern_geom);
 SYSCTL_NODE(_kern_geom, OID_AUTO, label, CTLFLAG_RW, 0, "GEOM_LABEL stuff");
 u_int g_label_debug = 0;
 SYSCTL_UINT(_kern_geom_label, OID_AUTO, debug, CTLFLAG_RWTUN, &g_label_debug, 0,
     "Debug level");
 
 static int g_label_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 static int g_label_destroy(struct g_geom *gp, boolean_t force);
 static struct g_geom *g_label_taste(struct g_class *mp, struct g_provider *pp,
     int flags __unused);
 static void g_label_config(struct gctl_req *req, struct g_class *mp,
     const char *verb);
 
 struct g_class g_label_class = {
 	.name = G_LABEL_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_label_config,
 	.taste = g_label_taste,
 	.destroy_geom = g_label_destroy_geom
 };
 
 /*
  * To add a new file system where you want to look for volume labels,
  * you have to:
  * 1. Add a file g_label_<file system>.c which implements labels recognition.
  * 2. Add an 'extern const struct g_label_desc g_label_<file system>;' into
  *    g_label.h file.
  * 3. Add an element to the table below '&g_label_<file system>,'.
  * 4. Add your file to sys/conf/files.
  * 5. Add your file to sys/modules/geom/geom_label/Makefile.
  * 6. Add your file system to manual page sbin/geom/class/label/glabel.8.
  */
 const struct g_label_desc *g_labels[] = {
 	&g_label_gpt,
 	&g_label_gpt_uuid,
 #ifdef GEOM_LABEL
 	&g_label_ufs_id,
 	&g_label_ufs_volume,
 	&g_label_iso9660,
 	&g_label_msdosfs,
 	&g_label_ext2fs,
 	&g_label_reiserfs,
 	&g_label_ntfs,
 	&g_label_disk_ident,
 #endif
 	NULL
 };
 
 void
 g_label_rtrim(char *label, size_t size)
 {
 	ptrdiff_t i;
 
 	for (i = size - 1; i >= 0; i--) {
 		if (label[i] == '\0')
 			continue;
 		else if (label[i] == ' ')
 			label[i] = '\0';
 		else
 			break;
 	}
 }
 
 static int
 g_label_destroy_geom(struct gctl_req *req __unused, struct g_class *mp,
     struct g_geom *gp __unused)
 {
 
 	/*
 	 * XXX: Unloading a class which is using geom_slice:1.56 is currently
 	 * XXX: broken, so we deny unloading when we have geoms.
 	 */
 	return (EOPNOTSUPP);
 }
 
 static void
 g_label_orphan(struct g_consumer *cp)
 {
 
 	G_LABEL_DEBUG(1, "Label %s removed.",
 	    LIST_FIRST(&cp->geom->provider)->name);
 	g_slice_orphan(cp);
 }
 
 static void
 g_label_spoiled(struct g_consumer *cp)
 {
 
 	G_LABEL_DEBUG(1, "Label %s removed.",
 	    LIST_FIRST(&cp->geom->provider)->name);
 	g_slice_spoiled(cp);
 }
 
 static void
 g_label_resize(struct g_consumer *cp)
 {
 
 	G_LABEL_DEBUG(1, "Label %s resized.",
 	    LIST_FIRST(&cp->geom->provider)->name);
 
 	g_slice_config(cp->geom, 0, G_SLICE_CONFIG_FORCE, (off_t)0,
 	    cp->provider->mediasize, cp->provider->sectorsize, "notused");
 }
 
 static int
 g_label_is_name_ok(const char *label)
 {
 	const char *s;
 
 	/* Check if the label starts from ../ */
 	if (strncmp(label, "../", 3) == 0)
 		return (0);
 	/* Check if the label contains /../ */
 	if (strstr(label, "/../") != NULL)
 		return (0);
 	/* Check if the label ends at ../ */
 	if ((s = strstr(label, "/..")) != NULL && s[3] == '\0')
 		return (0);
 	return (1);
 }
 
 static void
 g_label_mangle_name(char *label, size_t size)
 {
 	struct sbuf *sb;
 	const u_char *c;
 
 	sb = sbuf_new(NULL, NULL, size, SBUF_FIXEDLEN);
 	for (c = label; *c != '\0'; c++) {
 		if (!isprint(*c) || isspace(*c) || *c =='"' || *c == '%')
 			sbuf_printf(sb, "%%%02X", *c);
 		else
 			sbuf_putc(sb, *c);
 	}
 	if (sbuf_finish(sb) != 0)
 		label[0] = '\0';
 	else
 		strlcpy(label, sbuf_data(sb), size);
 	sbuf_delete(sb);
 }
 
 static struct g_geom *
 g_label_create(struct gctl_req *req, struct g_class *mp, struct g_provider *pp,
     const char *label, const char *dir, off_t mediasize)
 {
 	struct g_geom *gp;
 	struct g_provider *pp2;
 	struct g_consumer *cp;
 	char name[64];
 
 	g_topology_assert();
 
 	if (!g_label_is_name_ok(label)) {
 		G_LABEL_DEBUG(0, "%s contains suspicious label, skipping.",
 		    pp->name);
 		G_LABEL_DEBUG(1, "%s suspicious label is: %s", pp->name, label);
 		if (req != NULL)
 			gctl_error(req, "Label name %s is invalid.", label);
 		return (NULL);
 	}
 	gp = NULL;
 	cp = NULL;
 	if (snprintf(name, sizeof(name), "%s/%s", dir, label) >= sizeof(name)) {
 		if (req != NULL)
 			gctl_error(req, "Label name %s is too long.", label);
 		return (NULL);
 	}
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		pp2 = LIST_FIRST(&gp->provider);
 		if (pp2 == NULL)
 			continue;
 		if ((pp2->flags & G_PF_ORPHAN) != 0)
 			continue;
 		if (strcmp(pp2->name, name) == 0) {
 			G_LABEL_DEBUG(1, "Label %s(%s) already exists (%s).",
 			    label, name, pp->name);
 			if (req != NULL) {
 				gctl_error(req, "Provider %s already exists.",
 				    name);
 			}
 			return (NULL);
 		}
 	}
 	gp = g_slice_new(mp, 1, pp, &cp, NULL, 0, NULL);
 	if (gp == NULL) {
 		G_LABEL_DEBUG(0, "Cannot create slice %s.", label);
 		if (req != NULL)
 			gctl_error(req, "Cannot create slice %s.", label);
 		return (NULL);
 	}
 	gp->orphan = g_label_orphan;
 	gp->spoiled = g_label_spoiled;
 	gp->resize = g_label_resize;
 	g_access(cp, -1, 0, 0);
 	g_slice_config(gp, 0, G_SLICE_CONFIG_SET, (off_t)0, mediasize,
 	    pp->sectorsize, "%s", name);
 	G_LABEL_DEBUG(1, "Label for provider %s is %s.", pp->name, name);
 	return (gp);
 }
 
 static int
 g_label_destroy(struct g_geom *gp, boolean_t force)
 {
 	struct g_provider *pp;
 
 	g_topology_assert();
 	pp = LIST_FIRST(&gp->provider);
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_LABEL_DEBUG(0, "Provider %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_LABEL_DEBUG(1,
 			    "Provider %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	} else if (pp != NULL)
 		G_LABEL_DEBUG(1, "Label %s removed.", pp->name);
 	g_slice_spoiled(LIST_FIRST(&gp->consumer));
 	return (0);
 }
 
 static int
 g_label_read_metadata(struct g_consumer *cp, struct g_label_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	if (buf == NULL)
 		return (error);
 	/* Decode metadata. */
 	label_metadata_decode(buf, md);
 	g_free(buf);
 
 	return (0);
 }
 
 static void
 g_label_orphan_taste(struct g_consumer *cp __unused)
 {
 
 	KASSERT(1 == 0, ("%s called?", __func__));
 }
 
 static void
 g_label_start_taste(struct bio *bp __unused)
 {
 
 	KASSERT(1 == 0, ("%s called?", __func__));
 }
 
 static int
 g_label_access_taste(struct g_provider *pp __unused, int dr __unused,
     int dw __unused, int de __unused)
 {
 
 	KASSERT(1 == 0, ("%s called", __func__));
 	return (EOPNOTSUPP);
 }
 
 static struct g_geom *
 g_label_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_label_metadata md;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int i;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	G_LABEL_DEBUG(2, "Tasting %s.", pp->name);
 
 	/* Skip providers that are already open for writing. */
 	if (pp->acw > 0)
 		return (NULL);
 
 	if (strcmp(pp->geom->class->name, mp->name) == 0)
 		return (NULL);
 
 	gp = g_new_geomf(mp, "label:taste");
 	gp->start = g_label_start_taste;
 	gp->access = g_label_access_taste;
 	gp->orphan = g_label_orphan_taste;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	if (g_access(cp, 1, 0, 0) != 0)
 		goto end;
 	do {
 		if (g_label_read_metadata(cp, &md) != 0)
 			break;
 		if (strcmp(md.md_magic, G_LABEL_MAGIC) != 0)
 			break;
 		if (md.md_version > G_LABEL_VERSION) {
 			printf("geom_label.ko module is too old to handle %s.\n",
 			    pp->name);
 			break;
 		}
 
 		/*
 		 * Backward compatibility:
 		 */
 		/*
 		 * There was no md_provsize field in earlier versions of
 		 * metadata.
 		 */
 		if (md.md_version < 2)
 			md.md_provsize = pp->mediasize;
 
 		if (md.md_provsize != pp->mediasize)
 			break;
 
 		g_label_create(NULL, mp, pp, md.md_label, G_LABEL_DIR,
 		    pp->mediasize - pp->sectorsize);
 	} while (0);
 	for (i = 0; g_labels[i] != NULL; i++) {
 		char label[128];
 
 		if (g_labels[i]->ld_enabled == 0)
 			continue;
 		g_topology_unlock();
 		g_labels[i]->ld_taste(cp, label, sizeof(label));
 		g_label_mangle_name(label, sizeof(label));
 		g_topology_lock();
 		if (label[0] == '\0')
 			continue;
 		g_label_create(NULL, mp, pp, label, g_labels[i]->ld_dir,
 		    pp->mediasize);
 	}
 	g_access(cp, -1, 0, 0);
 end:
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	return (NULL);
 }
 
 static void
 g_label_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_provider *pp;
 	const char *name;
 	int *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs != 2) {
 		gctl_error(req, "Invalid number of arguments.");
 		return;
 	}
 	/*
 	 * arg1 is the name of provider.
 	 */
 	name = gctl_get_asciiparam(req, "arg1");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%d' argument", 1);
 		return;
 	}
 	if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 		name += strlen("/dev/");
 	pp = g_provider_by_name(name);
 	if (pp == NULL) {
 		G_LABEL_DEBUG(1, "Provider %s is invalid.", name);
 		gctl_error(req, "Provider %s is invalid.", name);
 		return;
 	}
 	/*
 	 * arg0 is the label.
 	 */
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%d' argument", 0);
 		return;
 	}
 	g_label_create(req, mp, pp, name, G_LABEL_DIR, pp->mediasize);
 }
 
 static const char *
 g_label_skip_dir(const char *name)
 {
 	char path[64];
 	u_int i;
 
 	if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 		name += strlen("/dev/");
 	if (strncmp(name, G_LABEL_DIR "/", strlen(G_LABEL_DIR "/")) == 0)
 		name += strlen(G_LABEL_DIR "/");
 	for (i = 0; g_labels[i] != NULL; i++) {
 		snprintf(path, sizeof(path), "%s/", g_labels[i]->ld_dir);
 		if (strncmp(name, path, strlen(path)) == 0) {
 			name += strlen(path);
 			break;
 		}
 	}
 	return (name);
 }
 
 static struct g_geom *
 g_label_find_geom(struct g_class *mp, const char *name)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 	const char *pname;
 
 	name = g_label_skip_dir(name);
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		pp = LIST_FIRST(&gp->provider);
 		pname = g_label_skip_dir(pp->name);
 		if (strcmp(pname, name) == 0)
 			return (gp);
 	}
 	return (NULL);
 }
 
 static void
 g_label_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	int *nargs, *force, error, i;
 	struct g_geom *gp;
 	const char *name;
 	char param[16];
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No 'force' argument");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		gp = g_label_find_geom(mp, name);
 		if (gp == NULL) {
 			G_LABEL_DEBUG(1, "Label %s is invalid.", name);
 			gctl_error(req, "Label %s is invalid.", name);
 			return;
 		}
 		error = g_label_destroy(gp, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy label %s (error=%d).",
 			    LIST_FIRST(&gp->provider)->name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_label_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_LABEL_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "create") == 0) {
 		g_label_ctl_create(req, mp);
 		return;
 	} else if (strcmp(verb, "destroy") == 0 ||
 	    strcmp(verb, "stop") == 0) {
 		g_label_ctl_destroy(req, mp);
 		return;
 	}
 
 	gctl_error(req, "Unknown verb.");
 }
 
 DECLARE_GEOM_CLASS(g_label_class, g_label);
+MODULE_VERSION(geom_label, 0);
Index: user/markj/netdump/sys/geom/linux_lvm/g_linux_lvm.c
===================================================================
--- user/markj/netdump/sys/geom/linux_lvm/g_linux_lvm.c	(revision 332407)
+++ user/markj/netdump/sys/geom/linux_lvm/g_linux_lvm.c	(revision 332408)
@@ -1,1192 +1,1193 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2008 Andrew Thompson <thompsa@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/ctype.h>
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/kernel.h>
 #include <sys/limits.h>
 #include <sys/malloc.h>
 #include <sys/queue.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 
 #include <geom/geom.h>
 #include <sys/endian.h>
 
 #include <geom/linux_lvm/g_linux_lvm.h>
 
 FEATURE(geom_linux_lvm, "GEOM Linux LVM partitioning support");
 
 /* Declare malloc(9) label */
 static MALLOC_DEFINE(M_GLLVM, "gllvm", "GEOM_LINUX_LVM Data");
 
 /* GEOM class methods */
 static g_access_t g_llvm_access;
 static g_init_t g_llvm_init;
 static g_orphan_t g_llvm_orphan;
 static g_orphan_t g_llvm_taste_orphan;
 static g_start_t g_llvm_start;
 static g_taste_t g_llvm_taste;
 static g_ctl_destroy_geom_t g_llvm_destroy_geom;
 
 static void	g_llvm_done(struct bio *);
 static void	g_llvm_remove_disk(struct g_llvm_vg *, struct g_consumer *);
 static int	g_llvm_activate_lv(struct g_llvm_vg *, struct g_llvm_lv *);
 static int	g_llvm_add_disk(struct g_llvm_vg *, struct g_provider *, char *);
 static void	g_llvm_free_vg(struct g_llvm_vg *);
 static int	g_llvm_destroy(struct g_llvm_vg *, int);
 static int	g_llvm_read_label(struct g_consumer *, struct g_llvm_label *);
 static int	g_llvm_read_md(struct g_consumer *, struct g_llvm_metadata *,
 		    struct g_llvm_label *);
 
 static int	llvm_label_decode(const u_char *, struct g_llvm_label *, int);
 static int	llvm_md_decode(const u_char *, struct g_llvm_metadata *,
 		    struct g_llvm_label *);
 static int	llvm_textconf_decode(u_char *, int,
 		    struct g_llvm_metadata *);
 static int	llvm_textconf_decode_pv(char **, char *, struct g_llvm_vg *);
 static int	llvm_textconf_decode_lv(char **, char *, struct g_llvm_vg *);
 static int	llvm_textconf_decode_sg(char **, char *, struct g_llvm_lv *);
 
 SYSCTL_DECL(_kern_geom);
 SYSCTL_NODE(_kern_geom, OID_AUTO, linux_lvm, CTLFLAG_RW, 0,
     "GEOM_LINUX_LVM stuff");
 static u_int g_llvm_debug = 0;
 SYSCTL_UINT(_kern_geom_linux_lvm, OID_AUTO, debug, CTLFLAG_RWTUN, &g_llvm_debug, 0,
     "Debug level");
 
 LIST_HEAD(, g_llvm_vg) vg_list;
 
 /*
  * Called to notify geom when it's been opened, and for what intent
  */
 static int
 g_llvm_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_consumer *c;
 	struct g_llvm_vg *vg;
 	struct g_geom *gp;
 	int error;
 
 	KASSERT(pp != NULL, ("%s: NULL provider", __func__));
 	gp = pp->geom;
 	KASSERT(gp != NULL, ("%s: NULL geom", __func__));
 	vg = gp->softc;
 
 	if (vg == NULL) {
 		/* It seems that .access can be called with negative dr,dw,dx
 		 * in this case but I want to check for myself */
 		G_LLVM_DEBUG(0, "access(%d, %d, %d) for %s",
 		    dr, dw, de, pp->name);
 
 		/* This should only happen when geom is withered so
 		 * allow only negative requests */
 		KASSERT(dr <= 0 && dw <= 0 && de <= 0,
 		    ("%s: Positive access for %s", __func__, pp->name));
 		if (pp->acr + dr == 0 && pp->acw + dw == 0 && pp->ace + de == 0)
 			G_LLVM_DEBUG(0,
 			    "Device %s definitely destroyed", pp->name);
 		return (0);
 	}
 
 	/* Grab an exclusive bit to propagate on our consumers on first open */
 	if (pp->acr == 0 && pp->acw == 0 && pp->ace == 0)
 		de++;
 	/* ... drop it on close */
 	if (pp->acr + dr == 0 && pp->acw + dw == 0 && pp->ace + de == 0)
 		de--;
 
 	error = ENXIO;
 	LIST_FOREACH(c, &gp->consumer, consumer) {
 		KASSERT(c != NULL, ("%s: consumer is NULL", __func__));
 		error = g_access(c, dr, dw, de);
 		if (error != 0) {
 			struct g_consumer *c2;
 
 			/* Backout earlier changes */
 			LIST_FOREACH(c2, &gp->consumer, consumer) {
 				if (c2 == c) /* all eariler components fixed */
 					return (error);
 				g_access(c2, -dr, -dw, -de);
 			}
 		}
 	}
 
 	return (error);
 }
 
 /*
  * Dismantle bio_queue and destroy its components
  */
 static void
 bioq_dismantle(struct bio_queue_head *bq)
 {
 	struct bio *b;
 
 	for (b = bioq_first(bq); b != NULL; b = bioq_first(bq)) {
 		bioq_remove(bq, b);
 		g_destroy_bio(b);
 	}
 }
 
 /*
  * GEOM .done handler
  * Can't use standard handler because one requested IO may
  * fork into additional data IOs
  */
 static void
 g_llvm_done(struct bio *b)
 {
 	struct bio *parent_b;
 
 	parent_b = b->bio_parent;
 
 	if (b->bio_error != 0) {
 		G_LLVM_DEBUG(0, "Error %d for offset=%ju, length=%ju on %s",
 		    b->bio_error, b->bio_offset, b->bio_length,
 		    b->bio_to->name);
 		if (parent_b->bio_error == 0)
 			parent_b->bio_error = b->bio_error;
 	}
 
 	parent_b->bio_inbed++;
 	parent_b->bio_completed += b->bio_completed;
 
 	if (parent_b->bio_children == parent_b->bio_inbed) {
 		parent_b->bio_completed = parent_b->bio_length;
 		g_io_deliver(parent_b, parent_b->bio_error);
 	}
 	g_destroy_bio(b);
 }
 
 static void
 g_llvm_start(struct bio *bp)
 {
 	struct g_provider *pp;
 	struct g_llvm_vg *vg;
 	struct g_llvm_pv *pv;
 	struct g_llvm_lv *lv;
 	struct g_llvm_segment *sg;
 	struct bio *cb;
 	struct bio_queue_head bq;
 	size_t chunk_size;
 	off_t offset, length;
 	char *addr;
 	u_int count;
 
 	pp = bp->bio_to;
 	lv = pp->private;
 	vg = pp->geom->softc;
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 	/* XXX BIO_GETATTR allowed? */
 		break;
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 
 	bioq_init(&bq);
 
 	chunk_size = vg->vg_extentsize;
 	addr = bp->bio_data;
 	offset = bp->bio_offset;	/* virtual offset and length */
 	length = bp->bio_length;
 
 	while (length > 0) {
 		size_t chunk_index, in_chunk_offset, in_chunk_length;
 
 		pv = NULL;
 		cb = g_clone_bio(bp);
 		if (cb == NULL) {
 			bioq_dismantle(&bq);
 			if (bp->bio_error == 0)
 				bp->bio_error = ENOMEM;
 			g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 
 		/* get the segment and the pv */
 		if (lv->lv_sgcount == 1) {
 			/* skip much of the calculations for a single sg */
 			chunk_index = 0;
 			in_chunk_offset = 0;
 			in_chunk_length = length;
 			sg = lv->lv_firstsg;
 			pv = sg->sg_pv;
 			cb->bio_offset = offset + sg->sg_pvoffset;
 		} else {
 			chunk_index = offset / chunk_size; /* round downwards */
 			in_chunk_offset = offset % chunk_size;
 			in_chunk_length =
 			    min(length, chunk_size - in_chunk_offset);
 
 			/* XXX could be faster */
 			LIST_FOREACH(sg, &lv->lv_segs, sg_next) {
 				if (chunk_index >= sg->sg_start &&
 				    chunk_index <= sg->sg_end) {
 					/* adjust chunk index for sg start */
 					chunk_index -= sg->sg_start;
 					pv = sg->sg_pv;
 					break;
 				}
 			}
 			cb->bio_offset =
 			    (off_t)chunk_index * (off_t)chunk_size
 			    + in_chunk_offset + sg->sg_pvoffset;
 		}
 
 		KASSERT(pv != NULL, ("Can't find PV for chunk %zu",
 		    chunk_index));
 
 		cb->bio_to = pv->pv_gprov;
 		cb->bio_done = g_llvm_done;
 		cb->bio_length = in_chunk_length;
 		cb->bio_data = addr;
 		cb->bio_caller1 = pv;
 		bioq_disksort(&bq, cb);
 
 		G_LLVM_DEBUG(5,
 		    "Mapped %s(%ju, %ju) on %s to %zu(%zu,%zu) @ %s:%ju",
 		    bp->bio_cmd == BIO_READ ? "R" : "W",
 		    offset, length, lv->lv_name,
 		    chunk_index, in_chunk_offset, in_chunk_length,
 		    pv->pv_name, cb->bio_offset);
 
 		addr += in_chunk_length;
 		length -= in_chunk_length;
 		offset += in_chunk_length;
 	}
 
 	/* Fire off bio's here */
 	count = 0;
 	for (cb = bioq_first(&bq); cb != NULL; cb = bioq_first(&bq)) {
 		bioq_remove(&bq, cb);
 		pv = cb->bio_caller1;
 		cb->bio_caller1 = NULL;
 		G_LLVM_DEBUG(6, "firing bio to %s, offset=%ju, length=%ju",
 		    cb->bio_to->name, cb->bio_offset, cb->bio_length);
 		g_io_request(cb, pv->pv_gcons);
 		count++;
 	}
 	if (count == 0) { /* We handled everything locally */
 		bp->bio_completed = bp->bio_length;
 		g_io_deliver(bp, 0);
 	}
 }
 
 static void
 g_llvm_remove_disk(struct g_llvm_vg *vg, struct g_consumer *cp)
 {
 	struct g_llvm_pv *pv;
 	struct g_llvm_lv *lv;
 	struct g_llvm_segment *sg;
 	int found;
 
 	KASSERT(cp != NULL, ("Non-valid disk in %s.", __func__));
 	pv = (struct g_llvm_pv *)cp->private;
 
 	G_LLVM_DEBUG(0, "Disk %s removed from %s.", cp->provider->name,
 	    pv->pv_name);
 
 	LIST_FOREACH(lv, &vg->vg_lvs, lv_next) {
 		/* Find segments that map to this disk */
 		found = 0;
 		LIST_FOREACH(sg, &lv->lv_segs, sg_next) {
 			if (sg->sg_pv == pv) {
 				sg->sg_pv = NULL;
 				lv->lv_sgactive--;
 				found = 1;
 				break;
 			}
 		}
 		if (found) {
 			G_LLVM_DEBUG(0, "Device %s removed.",
 			    lv->lv_gprov->name);
 			g_wither_provider(lv->lv_gprov, ENXIO);
 			lv->lv_gprov = NULL;
 		}
 	}
 
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static void
 g_llvm_orphan(struct g_consumer *cp)
 {
 	struct g_llvm_vg *vg;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = cp->geom;
 	vg = gp->softc;
 	if (vg == NULL)
 		return;
 
 	g_llvm_remove_disk(vg, cp);
 	g_llvm_destroy(vg, 1);
 }
 
 static int
 g_llvm_activate_lv(struct g_llvm_vg *vg, struct g_llvm_lv *lv)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	g_topology_assert();
 
 	KASSERT(lv->lv_sgactive == lv->lv_sgcount, ("segment missing"));
 
 	gp = vg->vg_geom;
 	pp = g_new_providerf(gp, "linux_lvm/%s-%s", vg->vg_name, lv->lv_name);
 	pp->mediasize = vg->vg_extentsize * (off_t)lv->lv_extentcount;
 	pp->sectorsize = vg->vg_sectorsize;
 	g_error_provider(pp, 0);
 	lv->lv_gprov = pp;
 	pp->private = lv;
 
 	G_LLVM_DEBUG(1, "Created %s, %juM", pp->name,
 	    pp->mediasize / (1024*1024));
 
 	return (0);
 }
 
 static int
 g_llvm_add_disk(struct g_llvm_vg *vg, struct g_provider *pp, char *uuid)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp, *fcp;
 	struct g_llvm_pv *pv;
 	struct g_llvm_lv *lv;
 	struct g_llvm_segment *sg;
 	int error;
 
 	g_topology_assert();
 
 	LIST_FOREACH(pv, &vg->vg_pvs, pv_next) {
 		if (strcmp(pv->pv_uuid, uuid) == 0)
 			break;	/* found it */
 	}
 	if (pv == NULL) {
 		G_LLVM_DEBUG(3, "uuid %s not found in pv list", uuid);
 		return (ENOENT);
 	}
 	if (pv->pv_gprov != NULL) {
 		G_LLVM_DEBUG(0, "disk %s already initialised in %s",
 		    pv->pv_name, vg->vg_name);
 		return (EEXIST);
 	}
 
 	pv->pv_start *= vg->vg_sectorsize;
 	gp = vg->vg_geom;
 	fcp = LIST_FIRST(&gp->consumer);
 
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, pp);
 	G_LLVM_DEBUG(1, "Attached %s to %s at offset %ju",
 	    pp->name, pv->pv_name, pv->pv_start);
 
 	if (error != 0) {
 		G_LLVM_DEBUG(0, "cannot attach %s to %s",
 		    pp->name, vg->vg_name);
 		g_destroy_consumer(cp);
 		return (error);
 	}
 
 	if (fcp != NULL) {
 		if (fcp->provider->sectorsize != pp->sectorsize) {
 			G_LLVM_DEBUG(0, "Provider %s of %s has invalid "
 			    "sector size (%d)", pp->name, vg->vg_name,
 			    pp->sectorsize);
 			return (EINVAL);
 		}
 		if (fcp->acr > 0 || fcp->acw || fcp->ace > 0) {
 			/* Replicate access permissions from first "live"
 			 * consumer to the new one */
 			error = g_access(cp, fcp->acr, fcp->acw, fcp->ace);
 			if (error != 0) {
 				g_detach(cp);
 				g_destroy_consumer(cp);
 				return (error);
 			}
 		}
 	}
 
 	cp->private = pv;
 	pv->pv_gcons = cp;
 	pv->pv_gprov = pp;
 
 	LIST_FOREACH(lv, &vg->vg_lvs, lv_next) {
 		/* Find segments that map to this disk */
 		LIST_FOREACH(sg, &lv->lv_segs, sg_next) {
 			if (strcmp(sg->sg_pvname, pv->pv_name) == 0) {
 				/* avtivate the segment */
 				KASSERT(sg->sg_pv == NULL,
 				    ("segment already mapped"));
 				sg->sg_pvoffset =
 				    (off_t)sg->sg_pvstart * vg->vg_extentsize
 				    + pv->pv_start;
 				sg->sg_pv = pv;
 				lv->lv_sgactive++;
 
 				G_LLVM_DEBUG(2, "%s: %d to %d @ %s:%d"
 				    " offset %ju sector %ju",
 				    lv->lv_name, sg->sg_start, sg->sg_end,
 				    sg->sg_pvname, sg->sg_pvstart,
 				    sg->sg_pvoffset,
 				    sg->sg_pvoffset / vg->vg_sectorsize);
 			}
 		}
 		/* Activate any lvs waiting on this disk */
 		if (lv->lv_gprov == NULL && lv->lv_sgactive == lv->lv_sgcount) {
 			error = g_llvm_activate_lv(vg, lv);
 			if (error)
 				break;
 		}
 	}
 	return (error);
 }
 
 static void
 g_llvm_init(struct g_class *mp)
 {
 	LIST_INIT(&vg_list);
 }
 
 static void
 g_llvm_free_vg(struct g_llvm_vg *vg)
 {
 	struct g_llvm_pv *pv;
 	struct g_llvm_lv *lv;
 	struct g_llvm_segment *sg;
 
 	/* Free all the structures */
 	while ((pv = LIST_FIRST(&vg->vg_pvs)) != NULL) {
 		LIST_REMOVE(pv, pv_next);
 		free(pv, M_GLLVM);
 	}
 	while ((lv = LIST_FIRST(&vg->vg_lvs)) != NULL) {
 		while ((sg = LIST_FIRST(&lv->lv_segs)) != NULL) {
 			LIST_REMOVE(sg, sg_next);
 			free(sg, M_GLLVM);
 		}
 		LIST_REMOVE(lv, lv_next);
 		free(lv, M_GLLVM);
 	}
 	LIST_REMOVE(vg, vg_next);
 	free(vg, M_GLLVM);
 }
 
 static void
 g_llvm_taste_orphan(struct g_consumer *cp)
 {
 
 	KASSERT(1 == 0, ("%s called while tasting %s.", __func__,
 	    cp->provider->name));
 }
 
 static struct g_geom *
 g_llvm_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	struct g_llvm_label ll;
 	struct g_llvm_metadata md;
 	struct g_llvm_vg *vg;
 	int error;
 
 	bzero(&md, sizeof(md));
 
 	g_topology_assert();
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	gp = g_new_geomf(mp, "linux_lvm:taste");
 	/* This orphan function should be never called. */
 	gp->orphan = g_llvm_taste_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_llvm_read_label(cp, &ll);
 	if (!error)
 		error = g_llvm_read_md(cp, &md, &ll);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 
 	vg = md.md_vg;
 	if (vg->vg_geom == NULL) {
 		/* new volume group */
 		gp = g_new_geomf(mp, "%s", vg->vg_name);
 		gp->start = g_llvm_start;
 		gp->spoiled = g_llvm_orphan;
 		gp->orphan = g_llvm_orphan;
 		gp->access = g_llvm_access;
 		vg->vg_sectorsize = pp->sectorsize;
 		vg->vg_extentsize *= vg->vg_sectorsize;
 		vg->vg_geom = gp;
 		gp->softc = vg;
 		G_LLVM_DEBUG(1, "Created volume %s, extent size %zuK",
 		    vg->vg_name, vg->vg_extentsize / 1024);
 	}
 
 	/* initialise this disk in the volume group */
 	g_llvm_add_disk(vg, pp, ll.ll_uuid);
 	return (vg->vg_geom);
 }
 
 static int
 g_llvm_destroy(struct g_llvm_vg *vg, int force)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	if (vg == NULL)
 		return (ENXIO);
 	gp = vg->vg_geom;
 
 	LIST_FOREACH(pp, &gp->provider, provider) {
 		if (pp->acr != 0 || pp->acw != 0 || pp->ace != 0) {
 			G_LLVM_DEBUG(1, "Device %s is still open (r%dw%de%d)",
 			    pp->name, pp->acr, pp->acw, pp->ace);
 			if (!force)
 				return (EBUSY);
 		}
 	}
 
 	g_llvm_free_vg(gp->softc);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static int
 g_llvm_destroy_geom(struct gctl_req *req __unused, struct g_class *mp __unused,
     struct g_geom *gp)
 {
 	struct g_llvm_vg *vg;
 
 	vg = gp->softc;
 	return (g_llvm_destroy(vg, 0));
 }
 
 int
 g_llvm_read_label(struct g_consumer *cp, struct g_llvm_label *ll)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int i, error = 0;
 
 	g_topology_assert();
 
 	/* The LVM label is stored on the first four sectors */
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, 0, pp->sectorsize * 4, &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL) {
 		G_LLVM_DEBUG(1, "Cannot read metadata from %s (error=%d)",
 		    pp->name, error);
 		return (error);
 	}
 
 	/* Search the four sectors for the LVM label. */
 	for (i = 0; i < 4; i++) {
 		error = llvm_label_decode(&buf[i * pp->sectorsize], ll, i);
 		if (error == 0)
 			break;	/* found it */
 	}
 	g_free(buf);
 	return (error);
 }
 
 int
 g_llvm_read_md(struct g_consumer *cp, struct g_llvm_metadata *md,
     struct g_llvm_label *ll)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 	int size;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, ll->ll_md_offset, pp->sectorsize, &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL) {
 		G_LLVM_DEBUG(0, "Cannot read metadata from %s (error=%d)",
 		    cp->provider->name, error);
 		return (error);
 	}
 
 	error = llvm_md_decode(buf, md, ll);
 	g_free(buf);
 	if (error != 0) {
 		return (error);
 	}
 
 	G_LLVM_DEBUG(1, "reading LVM2 config @ %s:%ju", pp->name,
 		    ll->ll_md_offset + md->md_reloffset);
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	/* round up to the nearest sector */
 	size = md->md_relsize +
 	    (pp->sectorsize - md->md_relsize % pp->sectorsize);
 	buf = g_read_data(cp, ll->ll_md_offset + md->md_reloffset, size, &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL) {
 		G_LLVM_DEBUG(0, "Cannot read LVM2 config from %s (error=%d)",
 		    pp->name, error);
 		return (error);
 	}
 	buf[md->md_relsize] = '\0';
 	G_LLVM_DEBUG(10, "LVM config:\n%s\n", buf);
 	error = llvm_textconf_decode(buf, md->md_relsize, md);
 	g_free(buf);
 
 	return (error);
 }
 
 static int
 llvm_label_decode(const u_char *data, struct g_llvm_label *ll, int sector)
 {
 	uint64_t off;
 	char *uuid;
 
 	/* Magic string */
 	if (bcmp("LABELONE", data , 8) != 0)
 		return (EINVAL);
 
 	/* We only support LVM2 text format */
 	if (bcmp("LVM2 001", data + 24, 8) != 0) {
 		G_LLVM_DEBUG(0, "Unsupported LVM format");
 		return (EINVAL);
 	}
 
 	ll->ll_sector = le64dec(data + 8);
 	ll->ll_crc = le32dec(data + 16);
 	ll->ll_offset = le32dec(data + 20);
 
 	if (ll->ll_sector != sector) {
 		G_LLVM_DEBUG(0, "Expected sector %ju, found at %d",
 		    ll->ll_sector, sector);
 		return (EINVAL);
 	}
 
 	off = ll->ll_offset;
 	/*
 	 * convert the binary uuid to string format, the format is
 	 * xxxxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxxxxx (6-4-4-4-4-4-6)
 	 */
 	uuid = ll->ll_uuid;
 	bcopy(data + off, uuid, 6);
 	off += 6;
 	uuid += 6;
 	*uuid++ = '-';
 	for (int i = 0; i < 5; i++) {
 		bcopy(data + off, uuid, 4);
 		off += 4;
 		uuid += 4;
 		*uuid++ = '-';
 	}
 	bcopy(data + off, uuid, 6);
 	off += 6;
 	uuid += 6;
 	*uuid++ = '\0';
 
 	ll->ll_size = le64dec(data + off);
 	off += 8;
 	ll->ll_pestart = le64dec(data + off);
 	off += 16;
 
 	/* Only one data section is supported */
 	if (le64dec(data + off) != 0) {
 		G_LLVM_DEBUG(0, "Only one data section supported");
 		return (EINVAL);
 	}
 
 	off += 16;
 	ll->ll_md_offset = le64dec(data + off);
 	off += 8;
 	ll->ll_md_size = le64dec(data + off);
 	off += 8;
 
 	G_LLVM_DEBUG(1, "LVM metadata: offset=%ju, size=%ju", ll->ll_md_offset,
 	    ll->ll_md_size);
 
 	/* Only one data section is supported */
 	if (le64dec(data + off) != 0) {
 		G_LLVM_DEBUG(0, "Only one metadata section supported");
 		return (EINVAL);
 	}
 
 	G_LLVM_DEBUG(2, "label uuid=%s", ll->ll_uuid);
 	G_LLVM_DEBUG(2, "sector=%ju, crc=%u, offset=%u, size=%ju, pestart=%ju",
 	    ll->ll_sector, ll->ll_crc, ll->ll_offset, ll->ll_size,
 	    ll->ll_pestart);
 
 	return (0);
 }
 
 static int
 llvm_md_decode(const u_char *data, struct g_llvm_metadata *md,
     struct g_llvm_label *ll)
 {
 	uint64_t off;
 	char magic[16];
 
 	off = 0;
 	md->md_csum = le32dec(data + off);
 	off += 4;
 	bcopy(data + off, magic, 16);
 	off += 16;
 	md->md_version = le32dec(data + off);
 	off += 4;
 	md->md_start = le64dec(data + off);
 	off += 8;
 	md->md_size = le64dec(data + off);
 	off += 8;
 
 	if (bcmp(G_LLVM_MAGIC, magic, 16) != 0) {
 		G_LLVM_DEBUG(0, "Incorrect md magic number");
 		return (EINVAL);
 	}
 	if (md->md_version != 1) {
 		G_LLVM_DEBUG(0, "Incorrect md version number (%u)",
 		    md->md_version);
 		return (EINVAL);
 	}
 	if (md->md_start != ll->ll_md_offset) {
 		G_LLVM_DEBUG(0, "Incorrect md offset (%ju)", md->md_start);
 		return (EINVAL);
 	}
 
 	/* Aparently only one is ever returned */
 	md->md_reloffset = le64dec(data + off);
 	off += 8;
 	md->md_relsize = le64dec(data + off);
 	off += 16;	/* XXX skipped checksum */
 
 	if (le64dec(data + off) != 0) {
 		G_LLVM_DEBUG(0, "Only one reloc supported");
 		return (EINVAL);
 	}
 
 	G_LLVM_DEBUG(3, "reloc: offset=%ju, size=%ju",
 	    md->md_reloffset, md->md_relsize);
 	G_LLVM_DEBUG(3, "md: version=%u, start=%ju, size=%ju",
 	    md->md_version, md->md_start, md->md_size);
 
 	return (0);
 }
 
 #define	GRAB_INT(key, tok1, tok2, v)					\
 	if (tok1 && tok2 && strncmp(tok1, key, sizeof(key)) == 0) {	\
 		v = strtol(tok2, &tok1, 10);				\
 		if (tok1 == tok2)					\
 			/* strtol did not eat any of the buffer */	\
 			goto bad;					\
 		continue;						\
 	}
 
 #define	GRAB_STR(key, tok1, tok2, v, len)				\
 	if (tok1 && tok2 && strncmp(tok1, key, sizeof(key)) == 0) {	\
 		strsep(&tok2, "\"");					\
 		if (tok2 == NULL)					\
 			continue;					\
 		tok1 = strsep(&tok2, "\"");				\
 		if (tok2 == NULL)					\
 			continue;					\
 		strncpy(v, tok1, len);					\
 		continue;						\
 	}
 
 #define	SPLIT(key, value, str)						\
 	key = strsep(&value, str);					\
 	/* strip trailing whitespace on the key */			\
 	for (char *t = key; *t != '\0'; t++)				\
 		if (isspace(*t)) {					\
 			*t = '\0';					\
 			break;						\
 		}
 
 static size_t 
 llvm_grab_name(char *name, const char *tok)
 {
 	size_t len;
 
 	len = 0;
 	if (tok == NULL)
 		return (0);
 	if (tok[0] == '-')
 		return (0);
 	if (strcmp(tok, ".") == 0 || strcmp(tok, "..") == 0)
 		return (0);
 	while (tok[len] && (isalpha(tok[len]) || isdigit(tok[len]) ||
 	    tok[len] == '.' || tok[len] == '_' || tok[len] == '-' ||
 	    tok[len] == '+') && len < G_LLVM_NAMELEN - 1)
 		len++;
 	bcopy(tok, name, len);
 	name[len] = '\0';
 	return (len);
 }
 
 static int
 llvm_textconf_decode(u_char *data, int buflen, struct g_llvm_metadata *md)
 {
 	struct g_llvm_vg	*vg;
 	char *buf = data;
 	char *tok, *v;
 	char name[G_LLVM_NAMELEN];
 	char uuid[G_LLVM_UUIDLEN];
 	size_t len;
 
 	if (buf == NULL || *buf == '\0')
 		return (EINVAL);
 
 	tok = strsep(&buf, "\n");
 	if (tok == NULL)
 		return (EINVAL);
 	len = llvm_grab_name(name, tok);
 	if (len == 0)
 		return (EINVAL);
 
 	/* check too see if the vg has already been loaded off another disk */
 	LIST_FOREACH(vg, &vg_list, vg_next) {
 		if (strcmp(vg->vg_name, name) == 0) {
 			uuid[0] = '\0';
 			/* grab the volume group uuid */
 			while ((tok = strsep(&buf, "\n")) != NULL) {
 				if (strstr(tok, "{"))
 					break;
 				if (strstr(tok, "=")) {
 					SPLIT(v, tok, "=");
 					GRAB_STR("id", v, tok, uuid,
 					    sizeof(uuid));
 				}
 			}
 			if (strcmp(vg->vg_uuid, uuid) == 0) {
 				/* existing vg */
 				md->md_vg = vg;
 				return (0);
 			}
 			/* XXX different volume group with name clash! */
 			G_LLVM_DEBUG(0,
 			    "%s already exists, volume group not loaded", name);
 			return (EINVAL);
 		}
 	}
 
 	vg = malloc(sizeof(*vg), M_GLLVM, M_NOWAIT|M_ZERO);
 	if (vg == NULL)
 		return (ENOMEM);
 
 	strncpy(vg->vg_name, name, sizeof(vg->vg_name));
 	LIST_INIT(&vg->vg_pvs);
 	LIST_INIT(&vg->vg_lvs);
 
 #define	VOL_FOREACH(func, tok, buf, p)					\
 	while ((tok = strsep(buf, "\n")) != NULL) {			\
 		if (strstr(tok, "{")) {					\
 			func(buf, tok, p);				\
 			continue;					\
 		}							\
 		if (strstr(tok, "}"))					\
 			break;						\
 	}
 
 	while ((tok = strsep(&buf, "\n")) != NULL) {
 		if (strcmp(tok, "physical_volumes {") == 0) {
 			VOL_FOREACH(llvm_textconf_decode_pv, tok, &buf, vg);
 			continue;
 		}
 		if (strcmp(tok, "logical_volumes {") == 0) {
 			VOL_FOREACH(llvm_textconf_decode_lv, tok, &buf, vg);
 			continue;
 		}
 		if (strstr(tok, "{")) {
 			G_LLVM_DEBUG(2, "unknown section %s", tok);
 			continue;
 		}
 
 		/* parse 'key = value' lines */
 		if (strstr(tok, "=")) {
 			SPLIT(v, tok, "=");
 			GRAB_STR("id", v, tok, vg->vg_uuid, sizeof(vg->vg_uuid));
 			GRAB_INT("extent_size", v, tok, vg->vg_extentsize);
 			continue;
 		}
 	}
 	/* basic checking */
 	if (vg->vg_extentsize == 0)
 		goto bad;
 
 	md->md_vg = vg;
 	LIST_INSERT_HEAD(&vg_list, vg, vg_next);
 	G_LLVM_DEBUG(3, "vg: name=%s uuid=%s", vg->vg_name, vg->vg_uuid);
 	return(0);
 
 bad:
 	g_llvm_free_vg(vg);
 	return (-1);
 }
 #undef	VOL_FOREACH
 
 static int
 llvm_textconf_decode_pv(char **buf, char *tok, struct g_llvm_vg *vg)
 {
 	struct g_llvm_pv	*pv;
 	char *v;
 	size_t len;
 
 	if (*buf == NULL || **buf == '\0')
 		return (EINVAL);
 
 	pv = malloc(sizeof(*pv), M_GLLVM, M_NOWAIT|M_ZERO);
 	if (pv == NULL)
 		return (ENOMEM);
 
 	pv->pv_vg = vg;
 	len = 0;
 	if (tok == NULL)
 		goto bad;
 	len = llvm_grab_name(pv->pv_name, tok);
 	if (len == 0)
 		goto bad;
 
 	while ((tok = strsep(buf, "\n")) != NULL) {
 		if (strstr(tok, "{"))
 			goto bad;
 
 		if (strstr(tok, "}"))
 			break;
 
 		/* parse 'key = value' lines */
 		if (strstr(tok, "=")) {
 			SPLIT(v, tok, "=");
 			GRAB_STR("id", v, tok, pv->pv_uuid, sizeof(pv->pv_uuid));
 			GRAB_INT("pe_start", v, tok, pv->pv_start);
 			GRAB_INT("pe_count", v, tok, pv->pv_count);
 			continue;
 		}
 	}
 	if (tok == NULL)
 		goto bad;
 	/* basic checking */
 	if (pv->pv_count == 0)
 		goto bad;
 
 	LIST_INSERT_HEAD(&vg->vg_pvs, pv, pv_next);
 	G_LLVM_DEBUG(3, "pv: name=%s uuid=%s", pv->pv_name, pv->pv_uuid);
 
 	return (0);
 bad:
 	free(pv, M_GLLVM);
 	return (-1);
 }
 
 static int
 llvm_textconf_decode_lv(char **buf, char *tok, struct g_llvm_vg *vg)
 {
 	struct g_llvm_lv	*lv;
 	struct g_llvm_segment *sg;
 	char *v;
 	size_t len;
 
 	if (*buf == NULL || **buf == '\0')
 		return (EINVAL);
 
 	lv = malloc(sizeof(*lv), M_GLLVM, M_NOWAIT|M_ZERO);
 	if (lv == NULL)
 		return (ENOMEM);
 
 	lv->lv_vg = vg;
 	LIST_INIT(&lv->lv_segs);
 
 	if (tok == NULL)
 		goto bad;
 	len = llvm_grab_name(lv->lv_name, tok);
 	if (len == 0)
 		goto bad;
 
 	while ((tok = strsep(buf, "\n")) != NULL) {
 		if (strstr(tok, "{")) {
 			if (strstr(tok, "segment")) {
 				llvm_textconf_decode_sg(buf, tok, lv);
 				continue;
 			} else
 				/* unexpected section */
 				goto bad;
 		}
 
 		if (strstr(tok, "}"))
 			break;
 
 		/* parse 'key = value' lines */
 		if (strstr(tok, "=")) {
 			SPLIT(v, tok, "=");
 			GRAB_STR("id", v, tok, lv->lv_uuid, sizeof(lv->lv_uuid));
 			GRAB_INT("segment_count", v, tok, lv->lv_sgcount);
 			continue;
 		}
 	}
 	if (tok == NULL)
 		goto bad;
 	if (lv->lv_sgcount == 0 || lv->lv_sgcount != lv->lv_numsegs)
 		/* zero or incomplete segment list */
 		goto bad;
 
 	/* Optimize for only one segment on the pv */
 	lv->lv_firstsg = LIST_FIRST(&lv->lv_segs);
 	LIST_INSERT_HEAD(&vg->vg_lvs, lv, lv_next);
 	G_LLVM_DEBUG(3, "lv: name=%s uuid=%s", lv->lv_name, lv->lv_uuid);
 
 	return (0);
 bad:
 	while ((sg = LIST_FIRST(&lv->lv_segs)) != NULL) {
 		LIST_REMOVE(sg, sg_next);
 		free(sg, M_GLLVM);
 	}
 	free(lv, M_GLLVM);
 	return (-1);
 }
 
 static int
 llvm_textconf_decode_sg(char **buf, char *tok, struct g_llvm_lv *lv)
 {
 	struct g_llvm_segment *sg;
 	char *v;
 	int count = 0;
 
 	if (*buf == NULL || **buf == '\0')
 		return (EINVAL);
 
 	sg = malloc(sizeof(*sg), M_GLLVM, M_NOWAIT|M_ZERO);
 	if (sg == NULL)
 		return (ENOMEM);
 
 	while ((tok = strsep(buf, "\n")) != NULL) {
 		/* only a single linear stripe is supported */
 		if (strstr(tok, "stripe_count")) {
 			SPLIT(v, tok, "=");
 			GRAB_INT("stripe_count", v, tok, count);
 			if (count != 1)
 				goto bad;
 		}
 
 		if (strstr(tok, "{"))
 			goto bad;
 
 		if (strstr(tok, "}"))
 			break;
 
 		if (strcmp(tok, "stripes = [") == 0) {
 			tok = strsep(buf, "\n");
 			if (tok == NULL)
 				goto bad;
 
 			strsep(&tok, "\"");
 			if (tok == NULL)
 				goto bad;	/* missing open quotes */
 			v = strsep(&tok, "\"");
 			if (tok == NULL)
 				goto bad;	/* missing close quotes */
 			strncpy(sg->sg_pvname, v, sizeof(sg->sg_pvname));
 			if (*tok != ',')
 				goto bad;	/* missing comma for stripe */
 			tok++;
 
 			sg->sg_pvstart = strtol(tok, &v, 10);
 			if (v == tok)
 				/* strtol did not eat any of the buffer */
 				goto bad;
 
 			continue;
 		}
 
 		/* parse 'key = value' lines */
 		if (strstr(tok, "=")) {
 			SPLIT(v, tok, "=");
 			GRAB_INT("start_extent", v, tok, sg->sg_start);
 			GRAB_INT("extent_count", v, tok, sg->sg_count);
 			continue;
 		}
 	}
 	if (tok == NULL)
 		goto bad;
 	/* basic checking */
 	if (count != 1 || sg->sg_count == 0)
 		goto bad;
 
 	sg->sg_end = sg->sg_start + sg->sg_count - 1;
 	lv->lv_numsegs++;
 	lv->lv_extentcount += sg->sg_count;
 	LIST_INSERT_HEAD(&lv->lv_segs, sg, sg_next);
 
 	return (0);
 bad:
 	free(sg, M_GLLVM);
 	return (-1);
 }
 #undef	GRAB_INT
 #undef	GRAB_STR
 #undef	SPLIT
 
 static struct g_class g_llvm_class = {
 	.name = G_LLVM_CLASS_NAME,
 	.version = G_VERSION,
 	.init = g_llvm_init,
 	.taste = g_llvm_taste,
 	.destroy_geom = g_llvm_destroy_geom
 };
 
 DECLARE_GEOM_CLASS(g_llvm_class, g_linux_lvm);
+MODULE_VERSION(geom_linux_lvm, 0);
Index: user/markj/netdump/sys/geom/mirror/g_mirror.c
===================================================================
--- user/markj/netdump/sys/geom/mirror/g_mirror.c	(revision 332407)
+++ user/markj/netdump/sys/geom/mirror/g_mirror.c	(revision 332408)
@@ -1,3494 +1,3495 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bio.h>
 #include <sys/eventhandler.h>
 #include <sys/fail.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/sbuf.h>
 #include <sys/sched.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 
 #include <geom/geom.h>
 #include <geom/mirror/g_mirror.h>
 
 FEATURE(geom_mirror, "GEOM mirroring support");
 
 static MALLOC_DEFINE(M_MIRROR, "mirror_data", "GEOM_MIRROR Data");
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, mirror, CTLFLAG_RW, 0,
     "GEOM_MIRROR stuff");
 int g_mirror_debug = 0;
 SYSCTL_INT(_kern_geom_mirror, OID_AUTO, debug, CTLFLAG_RWTUN, &g_mirror_debug, 0,
     "Debug level");
 static u_int g_mirror_timeout = 4;
 SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, timeout, CTLFLAG_RWTUN, &g_mirror_timeout,
     0, "Time to wait on all mirror components");
 static u_int g_mirror_idletime = 5;
 SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, idletime, CTLFLAG_RWTUN,
     &g_mirror_idletime, 0, "Mark components as clean when idling");
 static u_int g_mirror_disconnect_on_failure = 1;
 SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, disconnect_on_failure, CTLFLAG_RWTUN,
     &g_mirror_disconnect_on_failure, 0, "Disconnect component on I/O failure.");
 static u_int g_mirror_syncreqs = 2;
 SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, sync_requests, CTLFLAG_RDTUN,
     &g_mirror_syncreqs, 0, "Parallel synchronization I/O requests.");
 static u_int g_mirror_sync_period = 5;
 SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, sync_update_period, CTLFLAG_RWTUN,
     &g_mirror_sync_period, 0,
     "Metadata update period during synchronization, in seconds");
 
 #define	MSLEEP(ident, mtx, priority, wmesg, timeout)	do {		\
 	G_MIRROR_DEBUG(4, "%s: Sleeping %p.", __func__, (ident));	\
 	msleep((ident), (mtx), (priority), (wmesg), (timeout));		\
 	G_MIRROR_DEBUG(4, "%s: Woken up %p.", __func__, (ident));	\
 } while (0)
 
 static eventhandler_tag g_mirror_post_sync = NULL;
 static int g_mirror_shutdown = 0;
 
 static g_ctl_destroy_geom_t g_mirror_destroy_geom;
 static g_taste_t g_mirror_taste;
 static g_init_t g_mirror_init;
 static g_fini_t g_mirror_fini;
 static g_provgone_t g_mirror_providergone;
 static g_resize_t g_mirror_resize;
 
 struct g_class g_mirror_class = {
 	.name = G_MIRROR_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_mirror_config,
 	.taste = g_mirror_taste,
 	.destroy_geom = g_mirror_destroy_geom,
 	.init = g_mirror_init,
 	.fini = g_mirror_fini,
 	.providergone = g_mirror_providergone,
 	.resize = g_mirror_resize
 };
 
 
 static void g_mirror_destroy_provider(struct g_mirror_softc *sc);
 static int g_mirror_update_disk(struct g_mirror_disk *disk, u_int state);
 static void g_mirror_update_device(struct g_mirror_softc *sc, bool force);
 static void g_mirror_dumpconf(struct sbuf *sb, const char *indent,
     struct g_geom *gp, struct g_consumer *cp, struct g_provider *pp);
 static void g_mirror_sync_reinit(const struct g_mirror_disk *disk,
     struct bio *bp, off_t offset);
 static void g_mirror_sync_stop(struct g_mirror_disk *disk, int type);
 static void g_mirror_register_request(struct g_mirror_softc *sc,
     struct bio *bp);
 static void g_mirror_sync_release(struct g_mirror_softc *sc);
 
 
 static const char *
 g_mirror_disk_state2str(int state)
 {
 
 	switch (state) {
 	case G_MIRROR_DISK_STATE_NONE:
 		return ("NONE");
 	case G_MIRROR_DISK_STATE_NEW:
 		return ("NEW");
 	case G_MIRROR_DISK_STATE_ACTIVE:
 		return ("ACTIVE");
 	case G_MIRROR_DISK_STATE_STALE:
 		return ("STALE");
 	case G_MIRROR_DISK_STATE_SYNCHRONIZING:
 		return ("SYNCHRONIZING");
 	case G_MIRROR_DISK_STATE_DISCONNECTED:
 		return ("DISCONNECTED");
 	case G_MIRROR_DISK_STATE_DESTROY:
 		return ("DESTROY");
 	default:
 		return ("INVALID");
 	}
 }
 
 static const char *
 g_mirror_device_state2str(int state)
 {
 
 	switch (state) {
 	case G_MIRROR_DEVICE_STATE_STARTING:
 		return ("STARTING");
 	case G_MIRROR_DEVICE_STATE_RUNNING:
 		return ("RUNNING");
 	default:
 		return ("INVALID");
 	}
 }
 
 static const char *
 g_mirror_get_diskname(struct g_mirror_disk *disk)
 {
 
 	if (disk->d_consumer == NULL || disk->d_consumer->provider == NULL)
 		return ("[unknown]");
 	return (disk->d_name);
 }
 
 /*
  * --- Events handling functions ---
  * Events in geom_mirror are used to maintain disks and device status
  * from one thread to simplify locking.
  */
 static void
 g_mirror_event_free(struct g_mirror_event *ep)
 {
 
 	free(ep, M_MIRROR);
 }
 
 int
 g_mirror_event_send(void *arg, int state, int flags)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_disk *disk;
 	struct g_mirror_event *ep;
 	int error;
 
 	ep = malloc(sizeof(*ep), M_MIRROR, M_WAITOK);
 	G_MIRROR_DEBUG(4, "%s: Sending event %p.", __func__, ep);
 	if ((flags & G_MIRROR_EVENT_DEVICE) != 0) {
 		disk = NULL;
 		sc = arg;
 	} else {
 		disk = arg;
 		sc = disk->d_softc;
 	}
 	ep->e_disk = disk;
 	ep->e_state = state;
 	ep->e_flags = flags;
 	ep->e_error = 0;
 	mtx_lock(&sc->sc_events_mtx);
 	TAILQ_INSERT_TAIL(&sc->sc_events, ep, e_next);
 	mtx_unlock(&sc->sc_events_mtx);
 	G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 	mtx_lock(&sc->sc_queue_mtx);
 	wakeup(sc);
 	mtx_unlock(&sc->sc_queue_mtx);
 	if ((flags & G_MIRROR_EVENT_DONTWAIT) != 0)
 		return (0);
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 	G_MIRROR_DEBUG(4, "%s: Sleeping %p.", __func__, ep);
 	sx_xunlock(&sc->sc_lock);
 	while ((ep->e_flags & G_MIRROR_EVENT_DONE) == 0) {
 		mtx_lock(&sc->sc_events_mtx);
 		MSLEEP(ep, &sc->sc_events_mtx, PRIBIO | PDROP, "m:event",
 		    hz * 5);
 	}
 	error = ep->e_error;
 	g_mirror_event_free(ep);
 	sx_xlock(&sc->sc_lock);
 	return (error);
 }
 
 static struct g_mirror_event *
 g_mirror_event_first(struct g_mirror_softc *sc)
 {
 	struct g_mirror_event *ep;
 
 	mtx_lock(&sc->sc_events_mtx);
 	ep = TAILQ_FIRST(&sc->sc_events);
 	mtx_unlock(&sc->sc_events_mtx);
 	return (ep);
 }
 
 static void
 g_mirror_event_remove(struct g_mirror_softc *sc, struct g_mirror_event *ep)
 {
 
 	mtx_lock(&sc->sc_events_mtx);
 	TAILQ_REMOVE(&sc->sc_events, ep, e_next);
 	mtx_unlock(&sc->sc_events_mtx);
 }
 
 static void
 g_mirror_event_cancel(struct g_mirror_disk *disk)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_event *ep, *tmpep;
 
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	mtx_lock(&sc->sc_events_mtx);
 	TAILQ_FOREACH_SAFE(ep, &sc->sc_events, e_next, tmpep) {
 		if ((ep->e_flags & G_MIRROR_EVENT_DEVICE) != 0)
 			continue;
 		if (ep->e_disk != disk)
 			continue;
 		TAILQ_REMOVE(&sc->sc_events, ep, e_next);
 		if ((ep->e_flags & G_MIRROR_EVENT_DONTWAIT) != 0)
 			g_mirror_event_free(ep);
 		else {
 			ep->e_error = ECANCELED;
 			wakeup(ep);
 		}
 	}
 	mtx_unlock(&sc->sc_events_mtx);
 }
 
 /*
  * Return the number of disks in given state.
  * If state is equal to -1, count all connected disks.
  */
 u_int
 g_mirror_ndisks(struct g_mirror_softc *sc, int state)
 {
 	struct g_mirror_disk *disk;
 	u_int n = 0;
 
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (state == -1 || disk->d_state == state)
 			n++;
 	}
 	return (n);
 }
 
 /*
  * Find a disk in mirror by its disk ID.
  */
 static struct g_mirror_disk *
 g_mirror_id2disk(struct g_mirror_softc *sc, uint32_t id)
 {
 	struct g_mirror_disk *disk;
 
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_id == id)
 			return (disk);
 	}
 	return (NULL);
 }
 
 static u_int
 g_mirror_nrequests(struct g_mirror_softc *sc, struct g_consumer *cp)
 {
 	struct bio *bp;
 	u_int nreqs = 0;
 
 	mtx_lock(&sc->sc_queue_mtx);
 	TAILQ_FOREACH(bp, &sc->sc_queue, bio_queue) {
 		if (bp->bio_from == cp)
 			nreqs++;
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	return (nreqs);
 }
 
 static int
 g_mirror_is_busy(struct g_mirror_softc *sc, struct g_consumer *cp)
 {
 
 	if (cp->index > 0) {
 		G_MIRROR_DEBUG(2,
 		    "I/O requests for %s exist, can't destroy it now.",
 		    cp->provider->name);
 		return (1);
 	}
 	if (g_mirror_nrequests(sc, cp) > 0) {
 		G_MIRROR_DEBUG(2,
 		    "I/O requests for %s in queue, can't destroy it now.",
 		    cp->provider->name);
 		return (1);
 	}
 	return (0);
 }
 
 static void
 g_mirror_destroy_consumer(void *arg, int flags __unused)
 {
 	struct g_consumer *cp;
 
 	g_topology_assert();
 
 	cp = arg;
 	G_MIRROR_DEBUG(1, "Consumer %s destroyed.", cp->provider->name);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static void
 g_mirror_kill_consumer(struct g_mirror_softc *sc, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	int retaste_wait;
 
 	g_topology_assert();
 
 	cp->private = NULL;
 	if (g_mirror_is_busy(sc, cp))
 		return;
 	pp = cp->provider;
 	retaste_wait = 0;
 	if (cp->acw == 1) {
 		if ((pp->geom->flags & G_GEOM_WITHER) == 0)
 			retaste_wait = 1;
 	}
 	G_MIRROR_DEBUG(2, "Access %s r%dw%de%d = %d", pp->name, -cp->acr,
 	    -cp->acw, -cp->ace, 0);
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 	if (retaste_wait) {
 		/*
 		 * After retaste event was send (inside g_access()), we can send
 		 * event to detach and destroy consumer.
 		 * A class, which has consumer to the given provider connected
 		 * will not receive retaste event for the provider.
 		 * This is the way how I ignore retaste events when I close
 		 * consumers opened for write: I detach and destroy consumer
 		 * after retaste event is sent.
 		 */
 		g_post_event(g_mirror_destroy_consumer, cp, M_WAITOK, NULL);
 		return;
 	}
 	G_MIRROR_DEBUG(1, "Consumer %s destroyed.", pp->name);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static int
 g_mirror_connect_disk(struct g_mirror_disk *disk, struct g_provider *pp)
 {
 	struct g_consumer *cp;
 	int error;
 
 	g_topology_assert_not();
 	KASSERT(disk->d_consumer == NULL,
 	    ("Disk already connected (device %s).", disk->d_softc->sc_name));
 
 	g_topology_lock();
 	cp = g_new_consumer(disk->d_softc->sc_geom);
 	cp->flags |= G_CF_DIRECT_RECEIVE;
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		g_destroy_consumer(cp);
 		g_topology_unlock();
 		return (error);
 	}
 	error = g_access(cp, 1, 1, 1);
 	if (error != 0) {
 		g_detach(cp);
 		g_destroy_consumer(cp);
 		g_topology_unlock();
 		G_MIRROR_DEBUG(0, "Cannot open consumer %s (error=%d).",
 		    pp->name, error);
 		return (error);
 	}
 	g_topology_unlock();
 	disk->d_consumer = cp;
 	disk->d_consumer->private = disk;
 	disk->d_consumer->index = 0;
 
 	G_MIRROR_DEBUG(2, "Disk %s connected.", g_mirror_get_diskname(disk));
 	return (0);
 }
 
 static void
 g_mirror_disconnect_consumer(struct g_mirror_softc *sc, struct g_consumer *cp)
 {
 
 	g_topology_assert();
 
 	if (cp == NULL)
 		return;
 	if (cp->provider != NULL)
 		g_mirror_kill_consumer(sc, cp);
 	else
 		g_destroy_consumer(cp);
 }
 
 /*
  * Initialize disk. This means allocate memory, create consumer, attach it
  * to the provider and open access (r1w1e1) to it.
  */
 static struct g_mirror_disk *
 g_mirror_init_disk(struct g_mirror_softc *sc, struct g_provider *pp,
     struct g_mirror_metadata *md, int *errorp)
 {
 	struct g_mirror_disk *disk;
 	int i, error;
 
 	disk = malloc(sizeof(*disk), M_MIRROR, M_NOWAIT | M_ZERO);
 	if (disk == NULL) {
 		error = ENOMEM;
 		goto fail;
 	}
 	disk->d_softc = sc;
 	error = g_mirror_connect_disk(disk, pp);
 	if (error != 0)
 		goto fail;
 	disk->d_id = md->md_did;
 	disk->d_state = G_MIRROR_DISK_STATE_NONE;
 	disk->d_priority = md->md_priority;
 	disk->d_flags = md->md_dflags;
 	error = g_getattr("GEOM::candelete", disk->d_consumer, &i);
 	if (error == 0 && i != 0)
 		disk->d_flags |= G_MIRROR_DISK_FLAG_CANDELETE;
 	if (md->md_provider[0] != '\0')
 		disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED;
 	disk->d_sync.ds_consumer = NULL;
 	disk->d_sync.ds_offset = md->md_sync_offset;
 	disk->d_sync.ds_offset_done = md->md_sync_offset;
 	disk->d_sync.ds_update_ts = time_uptime;
 	disk->d_genid = md->md_genid;
 	disk->d_sync.ds_syncid = md->md_syncid;
 	if (errorp != NULL)
 		*errorp = 0;
 	return (disk);
 fail:
 	if (errorp != NULL)
 		*errorp = error;
 	if (disk != NULL)
 		free(disk, M_MIRROR);
 	return (NULL);
 }
 
 static void
 g_mirror_destroy_disk(struct g_mirror_disk *disk)
 {
 	struct g_mirror_softc *sc;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	LIST_REMOVE(disk, d_next);
 	g_mirror_event_cancel(disk);
 	if (sc->sc_hint == disk)
 		sc->sc_hint = NULL;
 	switch (disk->d_state) {
 	case G_MIRROR_DISK_STATE_SYNCHRONIZING:
 		g_mirror_sync_stop(disk, 1);
 		/* FALLTHROUGH */
 	case G_MIRROR_DISK_STATE_NEW:
 	case G_MIRROR_DISK_STATE_STALE:
 	case G_MIRROR_DISK_STATE_ACTIVE:
 		g_topology_lock();
 		g_mirror_disconnect_consumer(sc, disk->d_consumer);
 		g_topology_unlock();
 		free(disk, M_MIRROR);
 		break;
 	default:
 		KASSERT(0 == 1, ("Wrong disk state (%s, %s).",
 		    g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 	}
 }
 
 static void
 g_mirror_free_device(struct g_mirror_softc *sc)
 {
 
 	mtx_destroy(&sc->sc_queue_mtx);
 	mtx_destroy(&sc->sc_events_mtx);
 	mtx_destroy(&sc->sc_done_mtx);
 	sx_destroy(&sc->sc_lock);
 	free(sc, M_MIRROR);
 }
 
 static void
 g_mirror_providergone(struct g_provider *pp)
 {
 	struct g_mirror_softc *sc = pp->private;
 
 	if ((--sc->sc_refcnt) == 0)
 		g_mirror_free_device(sc);
 }
 
 static void
 g_mirror_destroy_device(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 	struct g_mirror_event *ep;
 	struct g_geom *gp;
 	struct g_consumer *cp, *tmpcp;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	gp = sc->sc_geom;
 	if (sc->sc_provider != NULL)
 		g_mirror_destroy_provider(sc);
 	for (disk = LIST_FIRST(&sc->sc_disks); disk != NULL;
 	    disk = LIST_FIRST(&sc->sc_disks)) {
 		disk->d_flags &= ~G_MIRROR_DISK_FLAG_DIRTY;
 		g_mirror_update_metadata(disk);
 		g_mirror_destroy_disk(disk);
 	}
 	while ((ep = g_mirror_event_first(sc)) != NULL) {
 		g_mirror_event_remove(sc, ep);
 		if ((ep->e_flags & G_MIRROR_EVENT_DONTWAIT) != 0)
 			g_mirror_event_free(ep);
 		else {
 			ep->e_error = ECANCELED;
 			ep->e_flags |= G_MIRROR_EVENT_DONE;
 			G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__, ep);
 			mtx_lock(&sc->sc_events_mtx);
 			wakeup(ep);
 			mtx_unlock(&sc->sc_events_mtx);
 		}
 	}
 	callout_drain(&sc->sc_callout);
 
 	g_topology_lock();
 	LIST_FOREACH_SAFE(cp, &sc->sc_sync.ds_geom->consumer, consumer, tmpcp) {
 		g_mirror_disconnect_consumer(sc, cp);
 	}
 	g_wither_geom(sc->sc_sync.ds_geom, ENXIO);
 	G_MIRROR_DEBUG(0, "Device %s destroyed.", gp->name);
 	g_wither_geom(gp, ENXIO);
 	sx_xunlock(&sc->sc_lock);
 	if ((--sc->sc_refcnt) == 0)
 		g_mirror_free_device(sc);
 	g_topology_unlock();
 }
 
 static void
 g_mirror_orphan(struct g_consumer *cp)
 {
 	struct g_mirror_disk *disk;
 
 	g_topology_assert();
 
 	disk = cp->private;
 	if (disk == NULL)
 		return;
 	disk->d_softc->sc_bump_id |= G_MIRROR_BUMP_SYNCID;
 	g_mirror_event_send(disk, G_MIRROR_DISK_STATE_DISCONNECTED,
 	    G_MIRROR_EVENT_DONTWAIT);
 }
 
 /*
  * Function should return the next active disk on the list.
  * It is possible that it will be the same disk as given.
  * If there are no active disks on list, NULL is returned.
  */
 static __inline struct g_mirror_disk *
 g_mirror_find_next(struct g_mirror_softc *sc, struct g_mirror_disk *disk)
 {
 	struct g_mirror_disk *dp;
 
 	for (dp = LIST_NEXT(disk, d_next); dp != disk;
 	    dp = LIST_NEXT(dp, d_next)) {
 		if (dp == NULL)
 			dp = LIST_FIRST(&sc->sc_disks);
 		if (dp->d_state == G_MIRROR_DISK_STATE_ACTIVE)
 			break;
 	}
 	if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 		return (NULL);
 	return (dp);
 }
 
 static struct g_mirror_disk *
 g_mirror_get_disk(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 
 	if (sc->sc_hint == NULL) {
 		sc->sc_hint = LIST_FIRST(&sc->sc_disks);
 		if (sc->sc_hint == NULL)
 			return (NULL);
 	}
 	disk = sc->sc_hint;
 	if (disk->d_state != G_MIRROR_DISK_STATE_ACTIVE) {
 		disk = g_mirror_find_next(sc, disk);
 		if (disk == NULL)
 			return (NULL);
 	}
 	sc->sc_hint = g_mirror_find_next(sc, disk);
 	return (disk);
 }
 
 static int
 g_mirror_write_metadata(struct g_mirror_disk *disk,
     struct g_mirror_metadata *md)
 {
 	struct g_mirror_softc *sc;
 	struct g_consumer *cp;
 	off_t offset, length;
 	u_char *sector;
 	int error = 0;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	cp = disk->d_consumer;
 	KASSERT(cp != NULL, ("NULL consumer (%s).", sc->sc_name));
 	KASSERT(cp->provider != NULL, ("NULL provider (%s).", sc->sc_name));
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s closed? (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	length = cp->provider->sectorsize;
 	offset = cp->provider->mediasize - length;
 	sector = malloc((size_t)length, M_MIRROR, M_WAITOK | M_ZERO);
 	if (md != NULL &&
 	    (sc->sc_flags & G_MIRROR_DEVICE_FLAG_WIPE) == 0) {
 		/*
 		 * Handle the case, when the size of parent provider reduced.
 		 */
 		if (offset < md->md_mediasize)
 			error = ENOSPC;
 		else
 			mirror_metadata_encode(md, sector);
 	}
 	KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_metadata_write, error);
 	if (error == 0)
 		error = g_write_data(cp, offset, sector, length);
 	free(sector, M_MIRROR);
 	if (error != 0) {
 		if ((disk->d_flags & G_MIRROR_DISK_FLAG_BROKEN) == 0) {
 			disk->d_flags |= G_MIRROR_DISK_FLAG_BROKEN;
 			G_MIRROR_DEBUG(0, "Cannot write metadata on %s "
 			    "(device=%s, error=%d).",
 			    g_mirror_get_diskname(disk), sc->sc_name, error);
 		} else {
 			G_MIRROR_DEBUG(1, "Cannot write metadata on %s "
 			    "(device=%s, error=%d).",
 			    g_mirror_get_diskname(disk), sc->sc_name, error);
 		}
 		if (g_mirror_disconnect_on_failure &&
 		    g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) > 1) {
 			sc->sc_bump_id |= G_MIRROR_BUMP_GENID;
 			g_mirror_event_send(disk,
 			    G_MIRROR_DISK_STATE_DISCONNECTED,
 			    G_MIRROR_EVENT_DONTWAIT);
 		}
 	}
 	return (error);
 }
 
 static int
 g_mirror_clear_metadata(struct g_mirror_disk *disk)
 {
 	int error;
 
 	g_topology_assert_not();
 	sx_assert(&disk->d_softc->sc_lock, SX_LOCKED);
 
 	if (disk->d_softc->sc_type != G_MIRROR_TYPE_AUTOMATIC)
 		return (0);
 	error = g_mirror_write_metadata(disk, NULL);
 	if (error == 0) {
 		G_MIRROR_DEBUG(2, "Metadata on %s cleared.",
 		    g_mirror_get_diskname(disk));
 	} else {
 		G_MIRROR_DEBUG(0,
 		    "Cannot clear metadata on disk %s (error=%d).",
 		    g_mirror_get_diskname(disk), error);
 	}
 	return (error);
 }
 
 void
 g_mirror_fill_metadata(struct g_mirror_softc *sc, struct g_mirror_disk *disk,
     struct g_mirror_metadata *md)
 {
 
 	strlcpy(md->md_magic, G_MIRROR_MAGIC, sizeof(md->md_magic));
 	md->md_version = G_MIRROR_VERSION;
 	strlcpy(md->md_name, sc->sc_name, sizeof(md->md_name));
 	md->md_mid = sc->sc_id;
 	md->md_all = sc->sc_ndisks;
 	md->md_slice = sc->sc_slice;
 	md->md_balance = sc->sc_balance;
 	md->md_genid = sc->sc_genid;
 	md->md_mediasize = sc->sc_mediasize;
 	md->md_sectorsize = sc->sc_sectorsize;
 	md->md_mflags = (sc->sc_flags & G_MIRROR_DEVICE_FLAG_MASK);
 	bzero(md->md_provider, sizeof(md->md_provider));
 	if (disk == NULL) {
 		md->md_did = arc4random();
 		md->md_priority = 0;
 		md->md_syncid = 0;
 		md->md_dflags = 0;
 		md->md_sync_offset = 0;
 		md->md_provsize = 0;
 	} else {
 		md->md_did = disk->d_id;
 		md->md_priority = disk->d_priority;
 		md->md_syncid = disk->d_sync.ds_syncid;
 		md->md_dflags = (disk->d_flags & G_MIRROR_DISK_FLAG_MASK);
 		if (disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING)
 			md->md_sync_offset = disk->d_sync.ds_offset_done;
 		else
 			md->md_sync_offset = 0;
 		if ((disk->d_flags & G_MIRROR_DISK_FLAG_HARDCODED) != 0) {
 			strlcpy(md->md_provider,
 			    disk->d_consumer->provider->name,
 			    sizeof(md->md_provider));
 		}
 		md->md_provsize = disk->d_consumer->provider->mediasize;
 	}
 }
 
 void
 g_mirror_update_metadata(struct g_mirror_disk *disk)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_metadata md;
 	int error;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	if (sc->sc_type != G_MIRROR_TYPE_AUTOMATIC)
 		return;
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_WIPE) == 0)
 		g_mirror_fill_metadata(sc, disk, &md);
 	error = g_mirror_write_metadata(disk, &md);
 	if (error == 0) {
 		G_MIRROR_DEBUG(2, "Metadata on %s updated.",
 		    g_mirror_get_diskname(disk));
 	} else {
 		G_MIRROR_DEBUG(0,
 		    "Cannot update metadata on disk %s (error=%d).",
 		    g_mirror_get_diskname(disk), error);
 	}
 }
 
 static void
 g_mirror_bump_syncid(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 	KASSERT(g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) > 0,
 	    ("%s called with no active disks (device=%s).", __func__,
 	    sc->sc_name));
 
 	sc->sc_syncid++;
 	G_MIRROR_DEBUG(1, "Device %s: syncid bumped to %u.", sc->sc_name,
 	    sc->sc_syncid);
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state == G_MIRROR_DISK_STATE_ACTIVE ||
 		    disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING) {
 			disk->d_sync.ds_syncid = sc->sc_syncid;
 			g_mirror_update_metadata(disk);
 		}
 	}
 }
 
 static void
 g_mirror_bump_genid(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 	KASSERT(g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) > 0,
 	    ("%s called with no active disks (device=%s).", __func__,
 	    sc->sc_name));
 
 	sc->sc_genid++;
 	G_MIRROR_DEBUG(1, "Device %s: genid bumped to %u.", sc->sc_name,
 	    sc->sc_genid);
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state == G_MIRROR_DISK_STATE_ACTIVE ||
 		    disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING) {
 			disk->d_genid = sc->sc_genid;
 			g_mirror_update_metadata(disk);
 		}
 	}
 }
 
 static int
 g_mirror_idle(struct g_mirror_softc *sc, int acw)
 {
 	struct g_mirror_disk *disk;
 	int timeout;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if (sc->sc_provider == NULL)
 		return (0);
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_NOFAILSYNC) != 0)
 		return (0);
 	if (sc->sc_idle)
 		return (0);
 	if (sc->sc_writes > 0)
 		return (0);
 	if (acw > 0 || (acw == -1 && sc->sc_provider->acw > 0)) {
 		timeout = g_mirror_idletime - (time_uptime - sc->sc_last_write);
 		if (!g_mirror_shutdown && timeout > 0)
 			return (timeout);
 	}
 	sc->sc_idle = 1;
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
 		G_MIRROR_DEBUG(2, "Disk %s (device %s) marked as clean.",
 		    g_mirror_get_diskname(disk), sc->sc_name);
 		disk->d_flags &= ~G_MIRROR_DISK_FLAG_DIRTY;
 		g_mirror_update_metadata(disk);
 	}
 	return (0);
 }
 
 static void
 g_mirror_unidle(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_NOFAILSYNC) != 0)
 		return;
 	sc->sc_idle = 0;
 	sc->sc_last_write = time_uptime;
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
 		G_MIRROR_DEBUG(2, "Disk %s (device %s) marked as dirty.",
 		    g_mirror_get_diskname(disk), sc->sc_name);
 		disk->d_flags |= G_MIRROR_DISK_FLAG_DIRTY;
 		g_mirror_update_metadata(disk);
 	}
 }
 
 static void
 g_mirror_done(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
 
 	sc = bp->bio_from->geom->softc;
 	bp->bio_cflags = G_MIRROR_BIO_FLAG_REGULAR;
 	mtx_lock(&sc->sc_queue_mtx);
 	TAILQ_INSERT_TAIL(&sc->sc_queue, bp, bio_queue);
 	mtx_unlock(&sc->sc_queue_mtx);
 	wakeup(sc);
 }
 
 static void
 g_mirror_regular_request_error(struct g_mirror_softc *sc,
     struct g_mirror_disk *disk, struct bio *bp)
 {
 
 	if (bp->bio_cmd == BIO_FLUSH && bp->bio_error == EOPNOTSUPP)
 		return;
 
 	if ((disk->d_flags & G_MIRROR_DISK_FLAG_BROKEN) == 0) {
 		disk->d_flags |= G_MIRROR_DISK_FLAG_BROKEN;
 		G_MIRROR_LOGREQ(0, bp, "Request failed (error=%d).",
 		    bp->bio_error);
 	} else {
 		G_MIRROR_LOGREQ(1, bp, "Request failed (error=%d).",
 		    bp->bio_error);
 	}
 	if (g_mirror_disconnect_on_failure &&
 	    g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) > 1) {
 		if (bp->bio_error == ENXIO &&
 		    bp->bio_cmd == BIO_READ)
 			sc->sc_bump_id |= G_MIRROR_BUMP_SYNCID;
 		else if (bp->bio_error == ENXIO)
 			sc->sc_bump_id |= G_MIRROR_BUMP_SYNCID_NOW;
 		else
 			sc->sc_bump_id |= G_MIRROR_BUMP_GENID;
 		g_mirror_event_send(disk, G_MIRROR_DISK_STATE_DISCONNECTED,
 		    G_MIRROR_EVENT_DONTWAIT);
 	}
 }
 
 static void
 g_mirror_regular_request(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk;
 	struct bio *pbp;
 
 	g_topology_assert_not();
 	KASSERT(sc->sc_provider == bp->bio_parent->bio_to,
 	    ("regular request %p with unexpected origin", bp));
 
 	pbp = bp->bio_parent;
 	bp->bio_from->index--;
 	if (bp->bio_cmd == BIO_WRITE || bp->bio_cmd == BIO_DELETE)
 		sc->sc_writes--;
 	disk = bp->bio_from->private;
 	if (disk == NULL) {
 		g_topology_lock();
 		g_mirror_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
 	}
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_regular_request_read,
 		    bp->bio_error);
 		break;
 	case BIO_WRITE:
 		KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_regular_request_write,
 		    bp->bio_error);
 		break;
 	case BIO_DELETE:
 		KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_regular_request_delete,
 		    bp->bio_error);
 		break;
 	case BIO_FLUSH:
 		KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_regular_request_flush,
 		    bp->bio_error);
 		break;
 	}
 
 	pbp->bio_inbed++;
 	KASSERT(pbp->bio_inbed <= pbp->bio_children,
 	    ("bio_inbed (%u) is bigger than bio_children (%u).", pbp->bio_inbed,
 	    pbp->bio_children));
 	if (bp->bio_error == 0 && pbp->bio_error == 0) {
 		G_MIRROR_LOGREQ(3, bp, "Request delivered.");
 		g_destroy_bio(bp);
 		if (pbp->bio_children == pbp->bio_inbed) {
 			G_MIRROR_LOGREQ(3, pbp, "Request delivered.");
 			pbp->bio_completed = pbp->bio_length;
 			if (pbp->bio_cmd == BIO_WRITE ||
 			    pbp->bio_cmd == BIO_DELETE) {
 				TAILQ_REMOVE(&sc->sc_inflight, pbp, bio_queue);
 				/* Release delayed sync requests if possible. */
 				g_mirror_sync_release(sc);
 			}
 			g_io_deliver(pbp, pbp->bio_error);
 		}
 		return;
 	} else if (bp->bio_error != 0) {
 		if (pbp->bio_error == 0)
 			pbp->bio_error = bp->bio_error;
 		if (disk != NULL)
 			g_mirror_regular_request_error(sc, disk, bp);
 		switch (pbp->bio_cmd) {
 		case BIO_DELETE:
 		case BIO_WRITE:
 		case BIO_FLUSH:
 			pbp->bio_inbed--;
 			pbp->bio_children--;
 			break;
 		}
 	}
 	g_destroy_bio(bp);
 
 	switch (pbp->bio_cmd) {
 	case BIO_READ:
 		if (pbp->bio_inbed < pbp->bio_children)
 			break;
 		if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 1)
 			g_io_deliver(pbp, pbp->bio_error);
 		else {
 			pbp->bio_error = 0;
 			mtx_lock(&sc->sc_queue_mtx);
 			TAILQ_INSERT_TAIL(&sc->sc_queue, pbp, bio_queue);
 			mtx_unlock(&sc->sc_queue_mtx);
 			G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 			wakeup(sc);
 		}
 		break;
 	case BIO_DELETE:
 	case BIO_WRITE:
 	case BIO_FLUSH:
 		if (pbp->bio_children == 0) {
 			/*
 			 * All requests failed.
 			 */
 		} else if (pbp->bio_inbed < pbp->bio_children) {
 			/* Do nothing. */
 			break;
 		} else if (pbp->bio_children == pbp->bio_inbed) {
 			/* Some requests succeeded. */
 			pbp->bio_error = 0;
 			pbp->bio_completed = pbp->bio_length;
 		}
 		if (pbp->bio_cmd == BIO_WRITE || pbp->bio_cmd == BIO_DELETE) {
 			TAILQ_REMOVE(&sc->sc_inflight, pbp, bio_queue);
 			/* Release delayed sync requests if possible. */
 			g_mirror_sync_release(sc);
 		}
 		g_io_deliver(pbp, pbp->bio_error);
 		break;
 	default:
 		KASSERT(1 == 0, ("Invalid request: %u.", pbp->bio_cmd));
 		break;
 	}
 }
 
 static void
 g_mirror_sync_done(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
 
 	G_MIRROR_LOGREQ(3, bp, "Synchronization request delivered.");
 	sc = bp->bio_from->geom->softc;
 	bp->bio_cflags = G_MIRROR_BIO_FLAG_SYNC;
 	mtx_lock(&sc->sc_queue_mtx);
 	TAILQ_INSERT_TAIL(&sc->sc_queue, bp, bio_queue);
 	mtx_unlock(&sc->sc_queue_mtx);
 	wakeup(sc);
 }
 
 static void
 g_mirror_candelete(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_disk *disk;
 	int *val;
 
 	sc = bp->bio_to->private;
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_flags & G_MIRROR_DISK_FLAG_CANDELETE)
 			break;
 	}
 	val = (int *)bp->bio_data;
 	*val = (disk != NULL);
 	g_io_deliver(bp, 0);
 }
 
 static void
 g_mirror_kernel_dump(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_disk *disk;
 	struct bio *cbp;
 	struct g_kerneldump *gkd;
 
 	/*
 	 * We configure dumping to the first component, because this component
 	 * will be used for reading with 'prefer' balance algorithm.
 	 * If the component with the highest priority is currently disconnected
 	 * we will not be able to read the dump after the reboot if it will be
 	 * connected and synchronized later. Can we do something better?
 	 */
 	sc = bp->bio_to->private;
 	disk = LIST_FIRST(&sc->sc_disks);
 
 	gkd = (struct g_kerneldump *)bp->bio_data;
 	if (gkd->length > bp->bio_to->mediasize)
 		gkd->length = bp->bio_to->mediasize;
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	cbp->bio_done = g_std_done;
 	g_io_request(cbp, disk->d_consumer);
 	G_MIRROR_DEBUG(1, "Kernel dump will go to %s.",
 	    g_mirror_get_diskname(disk));
 }
 
 static void
 g_mirror_start(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
 
 	sc = bp->bio_to->private;
 	/*
 	 * If sc == NULL or there are no valid disks, provider's error
 	 * should be set and g_mirror_start() should not be called at all.
 	 */
 	KASSERT(sc != NULL && sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING,
 	    ("Provider's error should be set (error=%d)(mirror=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 	G_MIRROR_LOGREQ(3, bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 	case BIO_FLUSH:
 		break;
 	case BIO_GETATTR:
 		if (!strcmp(bp->bio_attribute, "GEOM::candelete")) {
 			g_mirror_candelete(bp);
 			return;
 		} else if (strcmp("GEOM::kerneldump", bp->bio_attribute) == 0) {
 			g_mirror_kernel_dump(bp);
 			return;
 		}
 		/* FALLTHROUGH */
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 	mtx_lock(&sc->sc_queue_mtx);
 	if (bp->bio_to->error != 0) {
 		mtx_unlock(&sc->sc_queue_mtx);
 		g_io_deliver(bp, bp->bio_to->error);
 		return;
 	}
 	TAILQ_INSERT_TAIL(&sc->sc_queue, bp, bio_queue);
 	mtx_unlock(&sc->sc_queue_mtx);
 	G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 	wakeup(sc);
 }
 
 /*
  * Return TRUE if the given request is colliding with a in-progress
  * synchronization request.
  */
 static bool
 g_mirror_sync_collision(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk;
 	struct bio *sbp;
 	off_t rstart, rend, sstart, send;
 	u_int i;
 
 	if (sc->sc_sync.ds_ndisks == 0)
 		return (false);
 	rstart = bp->bio_offset;
 	rend = bp->bio_offset + bp->bio_length;
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state != G_MIRROR_DISK_STATE_SYNCHRONIZING)
 			continue;
 		for (i = 0; i < g_mirror_syncreqs; i++) {
 			sbp = disk->d_sync.ds_bios[i];
 			if (sbp == NULL)
 				continue;
 			sstart = sbp->bio_offset;
 			send = sbp->bio_offset + sbp->bio_length;
 			if (rend > sstart && rstart < send)
 				return (true);
 		}
 	}
 	return (false);
 }
 
 /*
  * Return TRUE if the given sync request is colliding with a in-progress regular
  * request.
  */
 static bool
 g_mirror_regular_collision(struct g_mirror_softc *sc, struct bio *sbp)
 {
 	off_t rstart, rend, sstart, send;
 	struct bio *bp;
 
 	if (sc->sc_sync.ds_ndisks == 0)
 		return (false);
 	sstart = sbp->bio_offset;
 	send = sbp->bio_offset + sbp->bio_length;
 	TAILQ_FOREACH(bp, &sc->sc_inflight, bio_queue) {
 		rstart = bp->bio_offset;
 		rend = bp->bio_offset + bp->bio_length;
 		if (rend > sstart && rstart < send)
 			return (true);
 	}
 	return (false);
 }
 
 /*
  * Puts regular request onto delayed queue.
  */
 static void
 g_mirror_regular_delay(struct g_mirror_softc *sc, struct bio *bp)
 {
 
 	G_MIRROR_LOGREQ(2, bp, "Delaying request.");
 	TAILQ_INSERT_TAIL(&sc->sc_regular_delayed, bp, bio_queue);
 }
 
 /*
  * Puts synchronization request onto delayed queue.
  */
 static void
 g_mirror_sync_delay(struct g_mirror_softc *sc, struct bio *bp)
 {
 
 	G_MIRROR_LOGREQ(2, bp, "Delaying synchronization request.");
 	TAILQ_INSERT_TAIL(&sc->sc_sync_delayed, bp, bio_queue);
 }
 
 /*
  * Requeue delayed regular requests.
  */
 static void
 g_mirror_regular_release(struct g_mirror_softc *sc)
 {
 	struct bio *bp;
 
 	if ((bp = TAILQ_FIRST(&sc->sc_regular_delayed)) == NULL)
 		return;
 	if (g_mirror_sync_collision(sc, bp))
 		return;
 
 	G_MIRROR_DEBUG(2, "Requeuing regular requests after collision.");
 	mtx_lock(&sc->sc_queue_mtx);
 	TAILQ_CONCAT(&sc->sc_regular_delayed, &sc->sc_queue, bio_queue);
 	TAILQ_SWAP(&sc->sc_regular_delayed, &sc->sc_queue, bio, bio_queue);
 	mtx_unlock(&sc->sc_queue_mtx);
 }
 
 /*
  * Releases delayed sync requests which don't collide anymore with regular
  * requests.
  */
 static void
 g_mirror_sync_release(struct g_mirror_softc *sc)
 {
 	struct bio *bp, *bp2;
 
 	TAILQ_FOREACH_SAFE(bp, &sc->sc_sync_delayed, bio_queue, bp2) {
 		if (g_mirror_regular_collision(sc, bp))
 			continue;
 		TAILQ_REMOVE(&sc->sc_sync_delayed, bp, bio_queue);
 		G_MIRROR_LOGREQ(2, bp,
 		    "Releasing delayed synchronization request.");
 		g_io_request(bp, bp->bio_from);
 	}
 }
 
 /*
  * Free a synchronization request and clear its slot in the array.
  */
 static void
 g_mirror_sync_request_free(struct g_mirror_disk *disk, struct bio *bp)
 {
 	int idx;
 
 	if (disk != NULL && disk->d_sync.ds_bios != NULL) {
 		idx = (int)(uintptr_t)bp->bio_caller1;
 		KASSERT(disk->d_sync.ds_bios[idx] == bp,
 		    ("unexpected sync BIO at %p:%d", disk, idx));
 		disk->d_sync.ds_bios[idx] = NULL;
 	}
 	free(bp->bio_data, M_MIRROR);
 	g_destroy_bio(bp);
 }
 
 /*
  * Handle synchronization requests.
  * Every synchronization request is a two-step process: first, a read request is
  * sent to the mirror provider via the sync consumer. If that request completes
  * successfully, it is converted to a write and sent to the disk being
  * synchronized. If the write also completes successfully, the synchronization
  * offset is advanced and a new read request is submitted.
  */
 static void
 g_mirror_sync_request(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk;
 	struct g_mirror_disk_sync *sync;
 
 	KASSERT((bp->bio_cmd == BIO_READ &&
 	    bp->bio_from->geom == sc->sc_sync.ds_geom) ||
 	    (bp->bio_cmd == BIO_WRITE && bp->bio_from->geom == sc->sc_geom),
 	    ("Sync BIO %p with unexpected origin", bp));
 
 	bp->bio_from->index--;
 	disk = bp->bio_from->private;
 	if (disk == NULL) {
 		sx_xunlock(&sc->sc_lock); /* Avoid recursion on sc_lock. */
 		g_topology_lock();
 		g_mirror_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
 		g_mirror_sync_request_free(NULL, bp);
 		sx_xlock(&sc->sc_lock);
 		return;
 	}
 
 	sync = &disk->d_sync;
 
 	/*
 	 * Synchronization request.
 	 */
 	switch (bp->bio_cmd) {
 	case BIO_READ: {
 		struct g_consumer *cp;
 
 		KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_sync_request_read,
 		    bp->bio_error);
 
 		if (bp->bio_error != 0) {
 			G_MIRROR_LOGREQ(0, bp,
 			    "Synchronization request failed (error=%d).",
 			    bp->bio_error);
 
 			/*
 			 * The read error will trigger a syncid bump, so there's
 			 * no need to do that here.
 			 *
 			 * The read error handling for regular requests will
 			 * retry the read from all active mirrors before passing
 			 * the error back up, so there's no need to retry here.
 			 */
 			g_mirror_sync_request_free(disk, bp);
 			g_mirror_event_send(disk,
 			    G_MIRROR_DISK_STATE_DISCONNECTED,
 			    G_MIRROR_EVENT_DONTWAIT);
 			return;
 		}
 		G_MIRROR_LOGREQ(3, bp,
 		    "Synchronization request half-finished.");
 		bp->bio_cmd = BIO_WRITE;
 		bp->bio_cflags = 0;
 		cp = disk->d_consumer;
 		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 		    cp->acr, cp->acw, cp->ace));
 		cp->index++;
 		g_io_request(bp, cp);
 		return;
 	}
 	case BIO_WRITE: {
 		off_t offset;
 		int i;
 
 		KFAIL_POINT_ERROR(DEBUG_FP, g_mirror_sync_request_write,
 		    bp->bio_error);
 
 		if (bp->bio_error != 0) {
 			G_MIRROR_LOGREQ(0, bp,
 			    "Synchronization request failed (error=%d).",
 			    bp->bio_error);
 			g_mirror_sync_request_free(disk, bp);
 			sc->sc_bump_id |= G_MIRROR_BUMP_GENID;
 			g_mirror_event_send(disk,
 			    G_MIRROR_DISK_STATE_DISCONNECTED,
 			    G_MIRROR_EVENT_DONTWAIT);
 			return;
 		}
 		G_MIRROR_LOGREQ(3, bp, "Synchronization request finished.");
 		if (sync->ds_offset >= sc->sc_mediasize ||
 		    sync->ds_consumer == NULL ||
 		    (sc->sc_flags & G_MIRROR_DEVICE_FLAG_DESTROY) != 0) {
 			/* Don't send more synchronization requests. */
 			sync->ds_inflight--;
 			g_mirror_sync_request_free(disk, bp);
 			if (sync->ds_inflight > 0)
 				return;
 			if (sync->ds_consumer == NULL ||
 			    (sc->sc_flags & G_MIRROR_DEVICE_FLAG_DESTROY) != 0) {
 				return;
 			}
 			/* Disk up-to-date, activate it. */
 			g_mirror_event_send(disk, G_MIRROR_DISK_STATE_ACTIVE,
 			    G_MIRROR_EVENT_DONTWAIT);
 			return;
 		}
 
 		/* Send next synchronization request. */
 		g_mirror_sync_reinit(disk, bp, sync->ds_offset);
 		sync->ds_offset += bp->bio_length;
 
 		G_MIRROR_LOGREQ(3, bp, "Sending synchronization request.");
 		sync->ds_consumer->index++;
 
 		/*
 		 * Delay the request if it is colliding with a regular request.
 		 */
 		if (g_mirror_regular_collision(sc, bp))
 			g_mirror_sync_delay(sc, bp);
 		else
 			g_io_request(bp, sync->ds_consumer);
 
 		/* Requeue delayed requests if possible. */
 		g_mirror_regular_release(sc);
 
 		/* Find the smallest offset */
 		offset = sc->sc_mediasize;
 		for (i = 0; i < g_mirror_syncreqs; i++) {
 			bp = sync->ds_bios[i];
 			if (bp != NULL && bp->bio_offset < offset)
 				offset = bp->bio_offset;
 		}
 		if (g_mirror_sync_period > 0 &&
 		    time_uptime - sync->ds_update_ts > g_mirror_sync_period) {
 			sync->ds_offset_done = offset;
 			g_mirror_update_metadata(disk);
 			sync->ds_update_ts = time_uptime;
 		}
 		return;
 	}
 	default:
 		panic("Invalid I/O request %p", bp);
 	}
 }
 
 static void
 g_mirror_request_prefer(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk;
 	struct g_consumer *cp;
 	struct bio *cbp;
 
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state == G_MIRROR_DISK_STATE_ACTIVE)
 			break;
 	}
 	if (disk == NULL) {
 		if (bp->bio_error == 0)
 			bp->bio_error = ENXIO;
 		g_io_deliver(bp, bp->bio_error);
 		return;
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		if (bp->bio_error == 0)
 			bp->bio_error = ENOMEM;
 		g_io_deliver(bp, bp->bio_error);
 		return;
 	}
 	/*
 	 * Fill in the component buf structure.
 	 */
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	cp->index++;
 	g_io_request(cbp, cp);
 }
 
 static void
 g_mirror_request_round_robin(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk;
 	struct g_consumer *cp;
 	struct bio *cbp;
 
 	disk = g_mirror_get_disk(sc);
 	if (disk == NULL) {
 		if (bp->bio_error == 0)
 			bp->bio_error = ENXIO;
 		g_io_deliver(bp, bp->bio_error);
 		return;
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		if (bp->bio_error == 0)
 			bp->bio_error = ENOMEM;
 		g_io_deliver(bp, bp->bio_error);
 		return;
 	}
 	/*
 	 * Fill in the component buf structure.
 	 */
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	cp->index++;
 	g_io_request(cbp, cp);
 }
 
 #define TRACK_SIZE  (1 * 1024 * 1024)
 #define LOAD_SCALE	256
 #define ABS(x)		(((x) >= 0) ? (x) : (-(x)))
 
 static void
 g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk, *dp;
 	struct g_consumer *cp;
 	struct bio *cbp;
 	int prio, best;
 
 	/* Find a disk with the smallest load. */
 	disk = NULL;
 	best = INT_MAX;
 	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 		if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
 		prio = dp->load;
 		/* If disk head is precisely in position - highly prefer it. */
 		if (dp->d_last_offset == bp->bio_offset)
 			prio -= 2 * LOAD_SCALE;
 		else
 		/* If disk head is close to position - prefer it. */
 		if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE)
 			prio -= 1 * LOAD_SCALE;
 		if (prio <= best) {
 			disk = dp;
 			best = prio;
 		}
 	}
 	KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name));
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		if (bp->bio_error == 0)
 			bp->bio_error = ENOMEM;
 		g_io_deliver(bp, bp->bio_error);
 		return;
 	}
 	/*
 	 * Fill in the component buf structure.
 	 */
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	cp->index++;
 	/* Remember last head position */
 	disk->d_last_offset = bp->bio_offset + bp->bio_length;
 	/* Update loads. */
 	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 		dp->load = (dp->d_consumer->index * LOAD_SCALE +
 		    dp->load * 7) / 8;
 	}
 	g_io_request(cbp, cp);
 }
 
 static void
 g_mirror_request_split(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct bio_queue queue;
 	struct g_mirror_disk *disk;
 	struct g_consumer *cp;
 	struct bio *cbp;
 	off_t left, mod, offset, slice;
 	u_char *data;
 	u_int ndisks;
 
 	if (bp->bio_length <= sc->sc_slice) {
 		g_mirror_request_round_robin(sc, bp);
 		return;
 	}
 	ndisks = g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE);
 	slice = bp->bio_length / ndisks;
 	mod = slice % sc->sc_provider->sectorsize;
 	if (mod != 0)
 		slice += sc->sc_provider->sectorsize - mod;
 	/*
 	 * Allocate all bios before sending any request, so we can
 	 * return ENOMEM in nice and clean way.
 	 */
 	left = bp->bio_length;
 	offset = bp->bio_offset;
 	data = bp->bio_data;
 	TAILQ_INIT(&queue);
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 				TAILQ_REMOVE(&queue, cbp, bio_queue);
 				g_destroy_bio(cbp);
 			}
 			if (bp->bio_error == 0)
 				bp->bio_error = ENOMEM;
 			g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 		TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 		cbp->bio_done = g_mirror_done;
 		cbp->bio_caller1 = disk;
 		cbp->bio_to = disk->d_consumer->provider;
 		cbp->bio_offset = offset;
 		cbp->bio_data = data;
 		cbp->bio_length = MIN(left, slice);
 		left -= cbp->bio_length;
 		if (left == 0)
 			break;
 		offset += cbp->bio_length;
 		data += cbp->bio_length;
 	}
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 		disk = cbp->bio_caller1;
 		cbp->bio_caller1 = NULL;
 		cp = disk->d_consumer;
 		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 		    cp->acr, cp->acw, cp->ace));
 		disk->d_consumer->index++;
 		g_io_request(cbp, disk->d_consumer);
 	}
 }
 
 static void
 g_mirror_register_request(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct bio_queue queue;
 	struct bio *cbp;
 	struct g_consumer *cp;
 	struct g_mirror_disk *disk;
 
 	sx_assert(&sc->sc_lock, SA_XLOCKED);
 
 	/*
 	 * To avoid ordering issues, if a write is deferred because of a
 	 * collision with a sync request, all I/O is deferred until that
 	 * write is initiated.
 	 */
 	if (bp->bio_from->geom != sc->sc_sync.ds_geom &&
 	    !TAILQ_EMPTY(&sc->sc_regular_delayed)) {
 		g_mirror_regular_delay(sc, bp);
 		return;
 	}
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		switch (sc->sc_balance) {
 		case G_MIRROR_BALANCE_LOAD:
 			g_mirror_request_load(sc, bp);
 			break;
 		case G_MIRROR_BALANCE_PREFER:
 			g_mirror_request_prefer(sc, bp);
 			break;
 		case G_MIRROR_BALANCE_ROUND_ROBIN:
 			g_mirror_request_round_robin(sc, bp);
 			break;
 		case G_MIRROR_BALANCE_SPLIT:
 			g_mirror_request_split(sc, bp);
 			break;
 		}
 		return;
 	case BIO_WRITE:
 	case BIO_DELETE:
 		/*
 		 * Delay the request if it is colliding with a synchronization
 		 * request.
 		 */
 		if (g_mirror_sync_collision(sc, bp)) {
 			g_mirror_regular_delay(sc, bp);
 			return;
 		}
 
 		if (sc->sc_idle)
 			g_mirror_unidle(sc);
 		else
 			sc->sc_last_write = time_uptime;
 
 		/*
 		 * Bump syncid on first write.
 		 */
 		if ((sc->sc_bump_id & G_MIRROR_BUMP_SYNCID) != 0) {
 			sc->sc_bump_id &= ~G_MIRROR_BUMP_SYNCID;
 			g_mirror_bump_syncid(sc);
 		}
 
 		/*
 		 * Allocate all bios before sending any request, so we can
 		 * return ENOMEM in nice and clean way.
 		 */
 		TAILQ_INIT(&queue);
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			switch (disk->d_state) {
 			case G_MIRROR_DISK_STATE_ACTIVE:
 				break;
 			case G_MIRROR_DISK_STATE_SYNCHRONIZING:
 				if (bp->bio_offset >= disk->d_sync.ds_offset)
 					continue;
 				break;
 			default:
 				continue;
 			}
 			if (bp->bio_cmd == BIO_DELETE &&
 			    (disk->d_flags & G_MIRROR_DISK_FLAG_CANDELETE) == 0)
 				continue;
 			cbp = g_clone_bio(bp);
 			if (cbp == NULL) {
 				while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 					TAILQ_REMOVE(&queue, cbp, bio_queue);
 					g_destroy_bio(cbp);
 				}
 				if (bp->bio_error == 0)
 					bp->bio_error = ENOMEM;
 				g_io_deliver(bp, bp->bio_error);
 				return;
 			}
 			TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 			cbp->bio_done = g_mirror_done;
 			cp = disk->d_consumer;
 			cbp->bio_caller1 = cp;
 			cbp->bio_to = cp->provider;
 			KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 			    ("Consumer %s not opened (r%dw%de%d).",
 			    cp->provider->name, cp->acr, cp->acw, cp->ace));
 		}
 		if (TAILQ_EMPTY(&queue)) {
 			KASSERT(bp->bio_cmd == BIO_DELETE,
 			    ("No consumers for regular request %p", bp));
 			g_io_deliver(bp, EOPNOTSUPP);
 			return;
 		}
 		while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 			G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 			TAILQ_REMOVE(&queue, cbp, bio_queue);
 			cp = cbp->bio_caller1;
 			cbp->bio_caller1 = NULL;
 			cp->index++;
 			sc->sc_writes++;
 			g_io_request(cbp, cp);
 		}
 		/*
 		 * Put request onto inflight queue, so we can check if new
 		 * synchronization requests don't collide with it.
 		 */
 		TAILQ_INSERT_TAIL(&sc->sc_inflight, bp, bio_queue);
 		return;
 	case BIO_FLUSH:
 		TAILQ_INIT(&queue);
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			if (disk->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 				continue;
 			cbp = g_clone_bio(bp);
 			if (cbp == NULL) {
 				while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 					TAILQ_REMOVE(&queue, cbp, bio_queue);
 					g_destroy_bio(cbp);
 				}
 				if (bp->bio_error == 0)
 					bp->bio_error = ENOMEM;
 				g_io_deliver(bp, bp->bio_error);
 				return;
 			}
 			TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 			cbp->bio_done = g_mirror_done;
 			cbp->bio_caller1 = disk;
 			cbp->bio_to = disk->d_consumer->provider;
 		}
 		KASSERT(!TAILQ_EMPTY(&queue),
 		    ("No consumers for regular request %p", bp));
 		while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 			G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 			TAILQ_REMOVE(&queue, cbp, bio_queue);
 			disk = cbp->bio_caller1;
 			cbp->bio_caller1 = NULL;
 			cp = disk->d_consumer;
 			KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 			    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 			    cp->acr, cp->acw, cp->ace));
 			cp->index++;
 			g_io_request(cbp, cp);
 		}
 		break;
 	default:
 		KASSERT(1 == 0, ("Invalid command here: %u (device=%s)",
 		    bp->bio_cmd, sc->sc_name));
 		break;
 	}
 }
 
 static int
 g_mirror_can_destroy(struct g_mirror_softc *sc)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 
 	g_topology_assert();
 	gp = sc->sc_geom;
 	if (gp->softc == NULL)
 		return (1);
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_TASTING) != 0)
 		return (0);
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (g_mirror_is_busy(sc, cp))
 			return (0);
 	}
 	gp = sc->sc_sync.ds_geom;
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (g_mirror_is_busy(sc, cp))
 			return (0);
 	}
 	G_MIRROR_DEBUG(2, "No I/O requests for %s, it can be destroyed.",
 	    sc->sc_name);
 	return (1);
 }
 
 static int
 g_mirror_try_destroy(struct g_mirror_softc *sc)
 {
 
 	if (sc->sc_rootmount != NULL) {
 		G_MIRROR_DEBUG(1, "root_mount_rel[%u] %p", __LINE__,
 		    sc->sc_rootmount);
 		root_mount_rel(sc->sc_rootmount);
 		sc->sc_rootmount = NULL;
 	}
 	g_topology_lock();
 	if (!g_mirror_can_destroy(sc)) {
 		g_topology_unlock();
 		return (0);
 	}
 	sc->sc_geom->softc = NULL;
 	sc->sc_sync.ds_geom->softc = NULL;
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_DRAIN) != 0) {
 		g_topology_unlock();
 		G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__,
 		    &sc->sc_worker);
 		/* Unlock sc_lock here, as it can be destroyed after wakeup. */
 		sx_xunlock(&sc->sc_lock);
 		wakeup(&sc->sc_worker);
 		sc->sc_worker = NULL;
 	} else {
 		g_topology_unlock();
 		g_mirror_destroy_device(sc);
 	}
 	return (1);
 }
 
 /*
  * Worker thread.
  */
 static void
 g_mirror_worker(void *arg)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_event *ep;
 	struct bio *bp;
 	int timeout;
 
 	sc = arg;
 	thread_lock(curthread);
 	sched_prio(curthread, PRIBIO);
 	thread_unlock(curthread);
 
 	sx_xlock(&sc->sc_lock);
 	for (;;) {
 		G_MIRROR_DEBUG(5, "%s: Let's see...", __func__);
 		/*
 		 * First take a look at events.
 		 * This is important to handle events before any I/O requests.
 		 */
 		ep = g_mirror_event_first(sc);
 		if (ep != NULL) {
 			g_mirror_event_remove(sc, ep);
 			if ((ep->e_flags & G_MIRROR_EVENT_DEVICE) != 0) {
 				/* Update only device status. */
 				G_MIRROR_DEBUG(3,
 				    "Running event for device %s.",
 				    sc->sc_name);
 				ep->e_error = 0;
 				g_mirror_update_device(sc, true);
 			} else {
 				/* Update disk status. */
 				G_MIRROR_DEBUG(3, "Running event for disk %s.",
 				     g_mirror_get_diskname(ep->e_disk));
 				ep->e_error = g_mirror_update_disk(ep->e_disk,
 				    ep->e_state);
 				if (ep->e_error == 0)
 					g_mirror_update_device(sc, false);
 			}
 			if ((ep->e_flags & G_MIRROR_EVENT_DONTWAIT) != 0) {
 				KASSERT(ep->e_error == 0,
 				    ("Error cannot be handled."));
 				g_mirror_event_free(ep);
 			} else {
 				ep->e_flags |= G_MIRROR_EVENT_DONE;
 				G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__,
 				    ep);
 				mtx_lock(&sc->sc_events_mtx);
 				wakeup(ep);
 				mtx_unlock(&sc->sc_events_mtx);
 			}
 			if ((sc->sc_flags &
 			    G_MIRROR_DEVICE_FLAG_DESTROY) != 0) {
 				if (g_mirror_try_destroy(sc)) {
 					curthread->td_pflags &= ~TDP_GEOM;
 					G_MIRROR_DEBUG(1, "Thread exiting.");
 					kproc_exit(0);
 				}
 			}
 			G_MIRROR_DEBUG(5, "%s: I'm here 1.", __func__);
 			continue;
 		}
 
 		/*
 		 * Check if we can mark array as CLEAN and if we can't take
 		 * how much seconds should we wait.
 		 */
 		timeout = g_mirror_idle(sc, -1);
 
 		/*
 		 * Handle I/O requests.
 		 */
 		mtx_lock(&sc->sc_queue_mtx);
 		bp = TAILQ_FIRST(&sc->sc_queue);
 		if (bp != NULL)
 			TAILQ_REMOVE(&sc->sc_queue, bp, bio_queue);
 		else {
 			if ((sc->sc_flags &
 			    G_MIRROR_DEVICE_FLAG_DESTROY) != 0) {
 				mtx_unlock(&sc->sc_queue_mtx);
 				if (g_mirror_try_destroy(sc)) {
 					curthread->td_pflags &= ~TDP_GEOM;
 					G_MIRROR_DEBUG(1, "Thread exiting.");
 					kproc_exit(0);
 				}
 				mtx_lock(&sc->sc_queue_mtx);
 				if (!TAILQ_EMPTY(&sc->sc_queue)) {
 					mtx_unlock(&sc->sc_queue_mtx);
 					continue;
 				}
 			}
 			if (g_mirror_event_first(sc) != NULL) {
 				mtx_unlock(&sc->sc_queue_mtx);
 				continue;
 			}
 			sx_xunlock(&sc->sc_lock);
 			MSLEEP(sc, &sc->sc_queue_mtx, PRIBIO | PDROP, "m:w1",
 			    timeout * hz);
 			sx_xlock(&sc->sc_lock);
 			G_MIRROR_DEBUG(5, "%s: I'm here 4.", __func__);
 			continue;
 		}
 		mtx_unlock(&sc->sc_queue_mtx);
 
 		if (bp->bio_from->geom == sc->sc_sync.ds_geom &&
 		    (bp->bio_cflags & G_MIRROR_BIO_FLAG_SYNC) != 0) {
 			/*
 			 * Handle completion of the first half (the read) of a
 			 * block synchronization operation.
 			 */
 			g_mirror_sync_request(sc, bp);
 		} else if (bp->bio_to != sc->sc_provider) {
 			if ((bp->bio_cflags & G_MIRROR_BIO_FLAG_REGULAR) != 0)
 				/*
 				 * Handle completion of a regular I/O request.
 				 */
 				g_mirror_regular_request(sc, bp);
 			else if ((bp->bio_cflags & G_MIRROR_BIO_FLAG_SYNC) != 0)
 				/*
 				 * Handle completion of the second half (the
 				 * write) of a block synchronization operation.
 				 */
 				g_mirror_sync_request(sc, bp);
 			else {
 				KASSERT(0,
 				    ("Invalid request cflags=0x%hx to=%s.",
 				    bp->bio_cflags, bp->bio_to->name));
 			}
 		} else {
 			/*
 			 * Initiate an I/O request.
 			 */
 			g_mirror_register_request(sc, bp);
 		}
 		G_MIRROR_DEBUG(5, "%s: I'm here 9.", __func__);
 	}
 }
 
 static void
 g_mirror_update_idle(struct g_mirror_softc *sc, struct g_mirror_disk *disk)
 {
 
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_NOFAILSYNC) != 0)
 		return;
 	if (!sc->sc_idle && (disk->d_flags & G_MIRROR_DISK_FLAG_DIRTY) == 0) {
 		G_MIRROR_DEBUG(2, "Disk %s (device %s) marked as dirty.",
 		    g_mirror_get_diskname(disk), sc->sc_name);
 		disk->d_flags |= G_MIRROR_DISK_FLAG_DIRTY;
 	} else if (sc->sc_idle &&
 	    (disk->d_flags & G_MIRROR_DISK_FLAG_DIRTY) != 0) {
 		G_MIRROR_DEBUG(2, "Disk %s (device %s) marked as clean.",
 		    g_mirror_get_diskname(disk), sc->sc_name);
 		disk->d_flags &= ~G_MIRROR_DISK_FLAG_DIRTY;
 	}
 }
 
 static void
 g_mirror_sync_reinit(const struct g_mirror_disk *disk, struct bio *bp,
     off_t offset)
 {
 	void *data;
 	int idx;
 
 	data = bp->bio_data;
 	idx = (int)(uintptr_t)bp->bio_caller1;
 	g_reset_bio(bp);
 
 	bp->bio_cmd = BIO_READ;
 	bp->bio_data = data;
 	bp->bio_done = g_mirror_sync_done;
 	bp->bio_from = disk->d_sync.ds_consumer;
 	bp->bio_to = disk->d_softc->sc_provider;
 	bp->bio_caller1 = (void *)(uintptr_t)idx;
 	bp->bio_offset = offset;
 	bp->bio_length = MIN(MAXPHYS,
 	    disk->d_softc->sc_mediasize - bp->bio_offset);
 }
 
 static void
 g_mirror_sync_start(struct g_mirror_disk *disk)
 {
 	struct g_mirror_softc *sc;
 	struct g_mirror_disk_sync *sync;
 	struct g_consumer *cp;
 	struct bio *bp;
 	int error, i;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sync = &disk->d_sync;
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	KASSERT(disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING,
 	    ("Disk %s is not marked for synchronization.",
 	    g_mirror_get_diskname(disk)));
 	KASSERT(sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING,
 	    ("Device not in RUNNING state (%s, %u).", sc->sc_name,
 	    sc->sc_state));
 
 	sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	cp = g_new_consumer(sc->sc_sync.ds_geom);
 	cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 	error = g_attach(cp, sc->sc_provider);
 	KASSERT(error == 0,
 	    ("Cannot attach to %s (error=%d).", sc->sc_name, error));
 	error = g_access(cp, 1, 0, 0);
 	KASSERT(error == 0, ("Cannot open %s (error=%d).", sc->sc_name, error));
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 
 	G_MIRROR_DEBUG(0, "Device %s: rebuilding provider %s.", sc->sc_name,
 	    g_mirror_get_diskname(disk));
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_NOFAILSYNC) == 0)
 		disk->d_flags |= G_MIRROR_DISK_FLAG_DIRTY;
 	KASSERT(sync->ds_consumer == NULL,
 	    ("Sync consumer already exists (device=%s, disk=%s).",
 	    sc->sc_name, g_mirror_get_diskname(disk)));
 
 	sync->ds_consumer = cp;
 	sync->ds_consumer->private = disk;
 	sync->ds_consumer->index = 0;
 
 	/*
 	 * Allocate memory for synchronization bios and initialize them.
 	 */
 	sync->ds_bios = malloc(sizeof(struct bio *) * g_mirror_syncreqs,
 	    M_MIRROR, M_WAITOK);
 	for (i = 0; i < g_mirror_syncreqs; i++) {
 		bp = g_alloc_bio();
 		sync->ds_bios[i] = bp;
 
 		bp->bio_data = malloc(MAXPHYS, M_MIRROR, M_WAITOK);
 		bp->bio_caller1 = (void *)(uintptr_t)i;
 		g_mirror_sync_reinit(disk, bp, sync->ds_offset);
 		sync->ds_offset += bp->bio_length;
 	}
 
 	/* Increase the number of disks in SYNCHRONIZING state. */
 	sc->sc_sync.ds_ndisks++;
 	/* Set the number of in-flight synchronization requests. */
 	sync->ds_inflight = g_mirror_syncreqs;
 
 	/*
 	 * Fire off first synchronization requests.
 	 */
 	for (i = 0; i < g_mirror_syncreqs; i++) {
 		bp = sync->ds_bios[i];
 		G_MIRROR_LOGREQ(3, bp, "Sending synchronization request.");
 		sync->ds_consumer->index++;
 		/*
 		 * Delay the request if it is colliding with a regular request.
 		 */
 		if (g_mirror_regular_collision(sc, bp))
 			g_mirror_sync_delay(sc, bp);
 		else
 			g_io_request(bp, sync->ds_consumer);
 	}
 }
 
 /*
  * Stop synchronization process.
  * type: 0 - synchronization finished
  *       1 - synchronization stopped
  */
 static void
 g_mirror_sync_stop(struct g_mirror_disk *disk, int type)
 {
 	struct g_mirror_softc *sc;
 	struct g_consumer *cp;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	KASSERT(disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING,
 	    ("Wrong disk state (%s, %s).", g_mirror_get_diskname(disk),
 	    g_mirror_disk_state2str(disk->d_state)));
 	if (disk->d_sync.ds_consumer == NULL)
 		return;
 
 	if (type == 0) {
 		G_MIRROR_DEBUG(0, "Device %s: rebuilding provider %s finished.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 	} else /* if (type == 1) */ {
 		G_MIRROR_DEBUG(0, "Device %s: rebuilding provider %s stopped.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 	}
 	g_mirror_regular_release(sc);
 	free(disk->d_sync.ds_bios, M_MIRROR);
 	disk->d_sync.ds_bios = NULL;
 	cp = disk->d_sync.ds_consumer;
 	disk->d_sync.ds_consumer = NULL;
 	disk->d_flags &= ~G_MIRROR_DISK_FLAG_DIRTY;
 	sc->sc_sync.ds_ndisks--;
 	sx_xunlock(&sc->sc_lock); /* Avoid recursion on sc_lock. */
 	g_topology_lock();
 	g_mirror_kill_consumer(sc, cp);
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 }
 
 static void
 g_mirror_launch_provider(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 	struct g_provider *pp, *dp;
 
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	g_topology_lock();
 	pp = g_new_providerf(sc->sc_geom, "mirror/%s", sc->sc_name);
 	pp->flags |= G_PF_DIRECT_RECEIVE;
 	pp->mediasize = sc->sc_mediasize;
 	pp->sectorsize = sc->sc_sectorsize;
 	pp->stripesize = 0;
 	pp->stripeoffset = 0;
 
 	/* Splitting of unmapped BIO's could work but isn't implemented now */
 	if (sc->sc_balance != G_MIRROR_BALANCE_SPLIT)
 		pp->flags |= G_PF_ACCEPT_UNMAPPED;
 
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_consumer && disk->d_consumer->provider) {
 			dp = disk->d_consumer->provider;
 			if (dp->stripesize > pp->stripesize) {
 				pp->stripesize = dp->stripesize;
 				pp->stripeoffset = dp->stripeoffset;
 			}
 			/* A provider underneath us doesn't support unmapped */
 			if ((dp->flags & G_PF_ACCEPT_UNMAPPED) == 0) {
 				G_MIRROR_DEBUG(0, "Cancelling unmapped "
 				    "because of %s.", dp->name);
 				pp->flags &= ~G_PF_ACCEPT_UNMAPPED;
 			}
 		}
 	}
 	pp->private = sc;
 	sc->sc_refcnt++;
 	sc->sc_provider = pp;
 	g_error_provider(pp, 0);
 	g_topology_unlock();
 	G_MIRROR_DEBUG(0, "Device %s launched (%u/%u).", pp->name,
 	    g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE), sc->sc_ndisks);
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING)
 			g_mirror_sync_start(disk);
 	}
 }
 
 static void
 g_mirror_destroy_provider(struct g_mirror_softc *sc)
 {
 	struct g_mirror_disk *disk;
 	struct bio *bp;
 
 	g_topology_assert_not();
 	KASSERT(sc->sc_provider != NULL, ("NULL provider (device=%s).",
 	    sc->sc_name));
 
 	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 		if (disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING)
 			g_mirror_sync_stop(disk, 1);
 	}
 
 	g_topology_lock();
 	g_error_provider(sc->sc_provider, ENXIO);
 	mtx_lock(&sc->sc_queue_mtx);
 	while ((bp = TAILQ_FIRST(&sc->sc_queue)) != NULL) {
 		TAILQ_REMOVE(&sc->sc_queue, bp, bio_queue);
 		/*
 		 * Abort any pending I/O that wasn't generated by us.
 		 * Synchronization requests and requests destined for individual
 		 * mirror components can be destroyed immediately.
 		 */
 		if (bp->bio_to == sc->sc_provider &&
 		    bp->bio_from->geom != sc->sc_sync.ds_geom) {
 			g_io_deliver(bp, ENXIO);
 		} else {
 			if ((bp->bio_cflags & G_MIRROR_BIO_FLAG_SYNC) != 0)
 				free(bp->bio_data, M_MIRROR);
 			g_destroy_bio(bp);
 		}
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	g_wither_provider(sc->sc_provider, ENXIO);
 	sc->sc_provider = NULL;
 	G_MIRROR_DEBUG(0, "Device %s: provider destroyed.", sc->sc_name);
 	g_topology_unlock();
 }
 
 static void
 g_mirror_go(void *arg)
 {
 	struct g_mirror_softc *sc;
 
 	sc = arg;
 	G_MIRROR_DEBUG(0, "Force device %s start due to timeout.", sc->sc_name);
 	g_mirror_event_send(sc, 0,
 	    G_MIRROR_EVENT_DONTWAIT | G_MIRROR_EVENT_DEVICE);
 }
 
 static u_int
 g_mirror_determine_state(struct g_mirror_disk *disk)
 {
 	struct g_mirror_softc *sc;
 	u_int state;
 
 	sc = disk->d_softc;
 	if (sc->sc_syncid == disk->d_sync.ds_syncid) {
 		if ((disk->d_flags &
 		    G_MIRROR_DISK_FLAG_SYNCHRONIZING) == 0 &&
 		    (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 0 ||
 		     (disk->d_flags & G_MIRROR_DISK_FLAG_DIRTY) == 0)) {
 			/* Disk does not need synchronization. */
 			state = G_MIRROR_DISK_STATE_ACTIVE;
 		} else {
 			if ((sc->sc_flags &
 			     G_MIRROR_DEVICE_FLAG_NOAUTOSYNC) == 0 ||
 			    (disk->d_flags &
 			     G_MIRROR_DISK_FLAG_FORCE_SYNC) != 0) {
 				/*
 				 * We can start synchronization from
 				 * the stored offset.
 				 */
 				state = G_MIRROR_DISK_STATE_SYNCHRONIZING;
 			} else {
 				state = G_MIRROR_DISK_STATE_STALE;
 			}
 		}
 	} else if (disk->d_sync.ds_syncid < sc->sc_syncid) {
 		/*
 		 * Reset all synchronization data for this disk,
 		 * because if it even was synchronized, it was
 		 * synchronized to disks with different syncid.
 		 */
 		disk->d_flags |= G_MIRROR_DISK_FLAG_SYNCHRONIZING;
 		disk->d_sync.ds_offset = 0;
 		disk->d_sync.ds_offset_done = 0;
 		disk->d_sync.ds_syncid = sc->sc_syncid;
 		if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_NOAUTOSYNC) == 0 ||
 		    (disk->d_flags & G_MIRROR_DISK_FLAG_FORCE_SYNC) != 0) {
 			state = G_MIRROR_DISK_STATE_SYNCHRONIZING;
 		} else {
 			state = G_MIRROR_DISK_STATE_STALE;
 		}
 	} else /* if (sc->sc_syncid < disk->d_sync.ds_syncid) */ {
 		/*
 		 * Not good, NOT GOOD!
 		 * It means that mirror was started on stale disks
 		 * and more fresh disk just arrive.
 		 * If there were writes, mirror is broken, sorry.
 		 * I think the best choice here is don't touch
 		 * this disk and inform the user loudly.
 		 */
 		G_MIRROR_DEBUG(0, "Device %s was started before the freshest "
 		    "disk (%s) arrives!! It will not be connected to the "
 		    "running device.", sc->sc_name,
 		    g_mirror_get_diskname(disk));
 		g_mirror_destroy_disk(disk);
 		state = G_MIRROR_DISK_STATE_NONE;
 		/* Return immediately, because disk was destroyed. */
 		return (state);
 	}
 	G_MIRROR_DEBUG(3, "State for %s disk: %s.",
 	    g_mirror_get_diskname(disk), g_mirror_disk_state2str(state));
 	return (state);
 }
 
 /*
  * Update device state.
  */
 static void
 g_mirror_update_device(struct g_mirror_softc *sc, bool force)
 {
 	struct g_mirror_disk *disk;
 	u_int state;
 
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	switch (sc->sc_state) {
 	case G_MIRROR_DEVICE_STATE_STARTING:
 	    {
 		struct g_mirror_disk *pdisk, *tdisk;
 		u_int dirty, ndisks, genid, syncid;
 		bool broken;
 
 		KASSERT(sc->sc_provider == NULL,
 		    ("Non-NULL provider in STARTING state (%s).", sc->sc_name));
 		/*
 		 * Are we ready? We are, if all disks are connected or
 		 * if we have any disks and 'force' is true.
 		 */
 		ndisks = g_mirror_ndisks(sc, -1);
 		if (sc->sc_ndisks == ndisks || (force && ndisks > 0)) {
 			;
 		} else if (ndisks == 0) {
 			/*
 			 * Disks went down in starting phase, so destroy
 			 * device.
 			 */
 			callout_drain(&sc->sc_callout);
 			sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DESTROY;
 			G_MIRROR_DEBUG(1, "root_mount_rel[%u] %p", __LINE__,
 			    sc->sc_rootmount);
 			root_mount_rel(sc->sc_rootmount);
 			sc->sc_rootmount = NULL;
 			return;
 		} else {
 			return;
 		}
 
 		/*
 		 * Activate all disks with the biggest syncid.
 		 */
 		if (force) {
 			/*
 			 * If 'force' is true, we have been called due to
 			 * timeout, so don't bother canceling timeout.
 			 */
 			ndisks = 0;
 			LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 				if ((disk->d_flags &
 				    G_MIRROR_DISK_FLAG_SYNCHRONIZING) == 0) {
 					ndisks++;
 				}
 			}
 			if (ndisks == 0) {
 				/* No valid disks found, destroy device. */
 				sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DESTROY;
 				G_MIRROR_DEBUG(1, "root_mount_rel[%u] %p",
 				    __LINE__, sc->sc_rootmount);
 				root_mount_rel(sc->sc_rootmount);
 				sc->sc_rootmount = NULL;
 				return;
 			}
 		} else {
 			/* Cancel timeout. */
 			callout_drain(&sc->sc_callout);
 		}
 
 		/*
 		 * Find the biggest genid.
 		 */
 		genid = 0;
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			if (disk->d_genid > genid)
 				genid = disk->d_genid;
 		}
 		sc->sc_genid = genid;
 		/*
 		 * Remove all disks without the biggest genid.
 		 */
 		broken = false;
 		LIST_FOREACH_SAFE(disk, &sc->sc_disks, d_next, tdisk) {
 			if (disk->d_genid < genid) {
 				G_MIRROR_DEBUG(0,
 				    "Component %s (device %s) broken, skipping.",
 				    g_mirror_get_diskname(disk), sc->sc_name);
 				g_mirror_destroy_disk(disk);
 				/*
 				 * Bump the syncid in case we discover a healthy
 				 * replacement disk after starting the mirror.
 				 */
 				broken = true;
 			}
 		}
 
 		/*
 		 * Find the biggest syncid.
 		 */
 		syncid = 0;
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			if (disk->d_sync.ds_syncid > syncid)
 				syncid = disk->d_sync.ds_syncid;
 		}
 
 		/*
 		 * Here we need to look for dirty disks and if all disks
 		 * with the biggest syncid are dirty, we have to choose
 		 * one with the biggest priority and rebuild the rest.
 		 */
 		/*
 		 * Find the number of dirty disks with the biggest syncid.
 		 * Find the number of disks with the biggest syncid.
 		 * While here, find a disk with the biggest priority.
 		 */
 		dirty = ndisks = 0;
 		pdisk = NULL;
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			if (disk->d_sync.ds_syncid != syncid)
 				continue;
 			if ((disk->d_flags &
 			    G_MIRROR_DISK_FLAG_SYNCHRONIZING) != 0) {
 				continue;
 			}
 			ndisks++;
 			if ((disk->d_flags & G_MIRROR_DISK_FLAG_DIRTY) != 0) {
 				dirty++;
 				if (pdisk == NULL ||
 				    pdisk->d_priority < disk->d_priority) {
 					pdisk = disk;
 				}
 			}
 		}
 		if (dirty == 0) {
 			/* No dirty disks at all, great. */
 		} else if (dirty == ndisks) {
 			/*
 			 * Force synchronization for all dirty disks except one
 			 * with the biggest priority.
 			 */
 			KASSERT(pdisk != NULL, ("pdisk == NULL"));
 			G_MIRROR_DEBUG(1, "Using disk %s (device %s) as a "
 			    "master disk for synchronization.",
 			    g_mirror_get_diskname(pdisk), sc->sc_name);
 			LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 				if (disk->d_sync.ds_syncid != syncid)
 					continue;
 				if ((disk->d_flags &
 				    G_MIRROR_DISK_FLAG_SYNCHRONIZING) != 0) {
 					continue;
 				}
 				KASSERT((disk->d_flags &
 				    G_MIRROR_DISK_FLAG_DIRTY) != 0,
 				    ("Disk %s isn't marked as dirty.",
 				    g_mirror_get_diskname(disk)));
 				/* Skip the disk with the biggest priority. */
 				if (disk == pdisk)
 					continue;
 				disk->d_sync.ds_syncid = 0;
 			}
 		} else if (dirty < ndisks) {
 			/*
 			 * Force synchronization for all dirty disks.
 			 * We have some non-dirty disks.
 			 */
 			LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 				if (disk->d_sync.ds_syncid != syncid)
 					continue;
 				if ((disk->d_flags &
 				    G_MIRROR_DISK_FLAG_SYNCHRONIZING) != 0) {
 					continue;
 				}
 				if ((disk->d_flags &
 				    G_MIRROR_DISK_FLAG_DIRTY) == 0) {
 					continue;
 				}
 				disk->d_sync.ds_syncid = 0;
 			}
 		}
 
 		/* Reset hint. */
 		sc->sc_hint = NULL;
 		sc->sc_syncid = syncid;
 		if (force || broken) {
 			/* Remember to bump syncid on first write. */
 			sc->sc_bump_id |= G_MIRROR_BUMP_SYNCID;
 		}
 		state = G_MIRROR_DEVICE_STATE_RUNNING;
 		G_MIRROR_DEBUG(1, "Device %s state changed from %s to %s.",
 		    sc->sc_name, g_mirror_device_state2str(sc->sc_state),
 		    g_mirror_device_state2str(state));
 		sc->sc_state = state;
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			state = g_mirror_determine_state(disk);
 			g_mirror_event_send(disk, state,
 			    G_MIRROR_EVENT_DONTWAIT);
 			if (state == G_MIRROR_DISK_STATE_STALE)
 				sc->sc_bump_id |= G_MIRROR_BUMP_SYNCID;
 		}
 		break;
 	    }
 	case G_MIRROR_DEVICE_STATE_RUNNING:
 		if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 0 &&
 		    g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_NEW) == 0) {
 			/*
 			 * No usable disks, so destroy the device.
 			 */
 			sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DESTROY;
 			break;
 		} else if (g_mirror_ndisks(sc,
 		    G_MIRROR_DISK_STATE_ACTIVE) > 0 &&
 		    g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_NEW) == 0) {
 			/*
 			 * We have active disks, launch provider if it doesn't
 			 * exist.
 			 */
 			if (sc->sc_provider == NULL)
 				g_mirror_launch_provider(sc);
 			if (sc->sc_rootmount != NULL) {
 				G_MIRROR_DEBUG(1, "root_mount_rel[%u] %p",
 				    __LINE__, sc->sc_rootmount);
 				root_mount_rel(sc->sc_rootmount);
 				sc->sc_rootmount = NULL;
 			}
 		}
 		/*
 		 * Genid should be bumped immediately, so do it here.
 		 */
 		if ((sc->sc_bump_id & G_MIRROR_BUMP_GENID) != 0) {
 			sc->sc_bump_id &= ~G_MIRROR_BUMP_GENID;
 			g_mirror_bump_genid(sc);
 		}
 		if ((sc->sc_bump_id & G_MIRROR_BUMP_SYNCID_NOW) != 0) {
 			sc->sc_bump_id &= ~G_MIRROR_BUMP_SYNCID_NOW;
 			g_mirror_bump_syncid(sc);
 		}
 		break;
 	default:
 		KASSERT(1 == 0, ("Wrong device state (%s, %s).",
 		    sc->sc_name, g_mirror_device_state2str(sc->sc_state)));
 		break;
 	}
 }
 
 /*
  * Update disk state and device state if needed.
  */
 #define	DISK_STATE_CHANGED()	G_MIRROR_DEBUG(1,			\
 	"Disk %s state changed from %s to %s (device %s).",		\
 	g_mirror_get_diskname(disk),					\
 	g_mirror_disk_state2str(disk->d_state),				\
 	g_mirror_disk_state2str(state), sc->sc_name)
 static int
 g_mirror_update_disk(struct g_mirror_disk *disk, u_int state)
 {
 	struct g_mirror_softc *sc;
 
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 again:
 	G_MIRROR_DEBUG(3, "Changing disk %s state from %s to %s.",
 	    g_mirror_get_diskname(disk), g_mirror_disk_state2str(disk->d_state),
 	    g_mirror_disk_state2str(state));
 	switch (state) {
 	case G_MIRROR_DISK_STATE_NEW:
 		/*
 		 * Possible scenarios:
 		 * 1. New disk arrive.
 		 */
 		/* Previous state should be NONE. */
 		KASSERT(disk->d_state == G_MIRROR_DISK_STATE_NONE,
 		    ("Wrong disk state (%s, %s).", g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		disk->d_state = state;
 		if (LIST_EMPTY(&sc->sc_disks))
 			LIST_INSERT_HEAD(&sc->sc_disks, disk, d_next);
 		else {
 			struct g_mirror_disk *dp;
 
 			LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 				if (disk->d_priority >= dp->d_priority) {
 					LIST_INSERT_BEFORE(dp, disk, d_next);
 					dp = NULL;
 					break;
 				}
 				if (LIST_NEXT(dp, d_next) == NULL)
 					break;
 			}
 			if (dp != NULL)
 				LIST_INSERT_AFTER(dp, disk, d_next);
 		}
 		G_MIRROR_DEBUG(1, "Device %s: provider %s detected.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 		if (sc->sc_state == G_MIRROR_DEVICE_STATE_STARTING)
 			break;
 		KASSERT(sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_mirror_device_state2str(sc->sc_state),
 		    g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		state = g_mirror_determine_state(disk);
 		if (state != G_MIRROR_DISK_STATE_NONE)
 			goto again;
 		break;
 	case G_MIRROR_DISK_STATE_ACTIVE:
 		/*
 		 * Possible scenarios:
 		 * 1. New disk does not need synchronization.
 		 * 2. Synchronization process finished successfully.
 		 */
 		KASSERT(sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_mirror_device_state2str(sc->sc_state),
 		    g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		/* Previous state should be NEW or SYNCHRONIZING. */
 		KASSERT(disk->d_state == G_MIRROR_DISK_STATE_NEW ||
 		    disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING,
 		    ("Wrong disk state (%s, %s).", g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		if (disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING) {
 			disk->d_flags &= ~G_MIRROR_DISK_FLAG_SYNCHRONIZING;
 			disk->d_flags &= ~G_MIRROR_DISK_FLAG_FORCE_SYNC;
 			g_mirror_sync_stop(disk, 0);
 		}
 		disk->d_state = state;
 		disk->d_sync.ds_offset = 0;
 		disk->d_sync.ds_offset_done = 0;
 		g_mirror_update_idle(sc, disk);
 		g_mirror_update_metadata(disk);
 		G_MIRROR_DEBUG(1, "Device %s: provider %s activated.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 		break;
 	case G_MIRROR_DISK_STATE_STALE:
 		/*
 		 * Possible scenarios:
 		 * 1. Stale disk was connected.
 		 */
 		/* Previous state should be NEW. */
 		KASSERT(disk->d_state == G_MIRROR_DISK_STATE_NEW,
 		    ("Wrong disk state (%s, %s).", g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		KASSERT(sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_mirror_device_state2str(sc->sc_state),
 		    g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		/*
 		 * STALE state is only possible if device is marked
 		 * NOAUTOSYNC.
 		 */
 		KASSERT((sc->sc_flags & G_MIRROR_DEVICE_FLAG_NOAUTOSYNC) != 0,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_mirror_device_state2str(sc->sc_state),
 		    g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		disk->d_flags &= ~G_MIRROR_DISK_FLAG_DIRTY;
 		disk->d_state = state;
 		g_mirror_update_metadata(disk);
 		G_MIRROR_DEBUG(0, "Device %s: provider %s is stale.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 		break;
 	case G_MIRROR_DISK_STATE_SYNCHRONIZING:
 		/*
 		 * Possible scenarios:
 		 * 1. Disk which needs synchronization was connected.
 		 */
 		/* Previous state should be NEW. */
 		KASSERT(disk->d_state == G_MIRROR_DISK_STATE_NEW,
 		    ("Wrong disk state (%s, %s).", g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		KASSERT(sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_mirror_device_state2str(sc->sc_state),
 		    g_mirror_get_diskname(disk),
 		    g_mirror_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		if (disk->d_state == G_MIRROR_DISK_STATE_NEW)
 			disk->d_flags &= ~G_MIRROR_DISK_FLAG_DIRTY;
 		disk->d_state = state;
 		if (sc->sc_provider != NULL) {
 			g_mirror_sync_start(disk);
 			g_mirror_update_metadata(disk);
 		}
 		break;
 	case G_MIRROR_DISK_STATE_DISCONNECTED:
 		/*
 		 * Possible scenarios:
 		 * 1. Device wasn't running yet, but disk disappear.
 		 * 2. Disk was active and disapppear.
 		 * 3. Disk disappear during synchronization process.
 		 */
 		if (sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING) {
 			/*
 			 * Previous state should be ACTIVE, STALE or
 			 * SYNCHRONIZING.
 			 */
 			KASSERT(disk->d_state == G_MIRROR_DISK_STATE_ACTIVE ||
 			    disk->d_state == G_MIRROR_DISK_STATE_STALE ||
 			    disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING,
 			    ("Wrong disk state (%s, %s).",
 			    g_mirror_get_diskname(disk),
 			    g_mirror_disk_state2str(disk->d_state)));
 		} else if (sc->sc_state == G_MIRROR_DEVICE_STATE_STARTING) {
 			/* Previous state should be NEW. */
 			KASSERT(disk->d_state == G_MIRROR_DISK_STATE_NEW,
 			    ("Wrong disk state (%s, %s).",
 			    g_mirror_get_diskname(disk),
 			    g_mirror_disk_state2str(disk->d_state)));
 			/*
 			 * Reset bumping syncid if disk disappeared in STARTING
 			 * state.
 			 */
 			if ((sc->sc_bump_id & G_MIRROR_BUMP_SYNCID) != 0)
 				sc->sc_bump_id &= ~G_MIRROR_BUMP_SYNCID;
 #ifdef	INVARIANTS
 		} else {
 			KASSERT(1 == 0, ("Wrong device state (%s, %s, %s, %s).",
 			    sc->sc_name,
 			    g_mirror_device_state2str(sc->sc_state),
 			    g_mirror_get_diskname(disk),
 			    g_mirror_disk_state2str(disk->d_state)));
 #endif
 		}
 		DISK_STATE_CHANGED();
 		G_MIRROR_DEBUG(0, "Device %s: provider %s disconnected.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 
 		g_mirror_destroy_disk(disk);
 		break;
 	case G_MIRROR_DISK_STATE_DESTROY:
 	    {
 		int error;
 
 		error = g_mirror_clear_metadata(disk);
 		if (error != 0) {
 			G_MIRROR_DEBUG(0,
 			    "Device %s: failed to clear metadata on %s: %d.",
 			    sc->sc_name, g_mirror_get_diskname(disk), error);
 			break;
 		}
 		DISK_STATE_CHANGED();
 		G_MIRROR_DEBUG(0, "Device %s: provider %s destroyed.",
 		    sc->sc_name, g_mirror_get_diskname(disk));
 
 		g_mirror_destroy_disk(disk);
 		sc->sc_ndisks--;
 		LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 			g_mirror_update_metadata(disk);
 		}
 		break;
 	    }
 	default:
 		KASSERT(1 == 0, ("Unknown state (%u).", state));
 		break;
 	}
 	return (0);
 }
 #undef	DISK_STATE_CHANGED
 
 int
 g_mirror_read_metadata(struct g_consumer *cp, struct g_mirror_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	/* Metadata are stored on last sector. */
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL) {
 		G_MIRROR_DEBUG(1, "Cannot read metadata from %s (error=%d).",
 		    cp->provider->name, error);
 		return (error);
 	}
 
 	/* Decode metadata. */
 	error = mirror_metadata_decode(buf, md);
 	g_free(buf);
 	if (strcmp(md->md_magic, G_MIRROR_MAGIC) != 0)
 		return (EINVAL);
 	if (md->md_version > G_MIRROR_VERSION) {
 		G_MIRROR_DEBUG(0,
 		    "Kernel module is too old to handle metadata from %s.",
 		    cp->provider->name);
 		return (EINVAL);
 	}
 	if (error != 0) {
 		G_MIRROR_DEBUG(1, "MD5 metadata hash mismatch for provider %s.",
 		    cp->provider->name);
 		return (error);
 	}
 
 	return (0);
 }
 
 static int
 g_mirror_check_metadata(struct g_mirror_softc *sc, struct g_provider *pp,
     struct g_mirror_metadata *md)
 {
 
 	if (g_mirror_id2disk(sc, md->md_did) != NULL) {
 		G_MIRROR_DEBUG(1, "Disk %s (id=%u) already exists, skipping.",
 		    pp->name, md->md_did);
 		return (EEXIST);
 	}
 	if (md->md_all != sc->sc_ndisks) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_all", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if (md->md_slice != sc->sc_slice) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_slice", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if (md->md_balance != sc->sc_balance) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_balance", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 #if 0
 	if (md->md_mediasize != sc->sc_mediasize) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_mediasize", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 #endif
 	if (sc->sc_mediasize > pp->mediasize) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid size of disk %s (device %s), skipping.", pp->name,
 		    sc->sc_name);
 		return (EINVAL);
 	}
 	if (md->md_sectorsize != sc->sc_sectorsize) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_sectorsize", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((sc->sc_sectorsize % pp->sectorsize) != 0) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid sector size of disk %s (device %s), skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_mflags & ~G_MIRROR_DEVICE_FLAG_MASK) != 0) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid device flags on disk %s (device %s), skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_dflags & ~G_MIRROR_DISK_FLAG_MASK) != 0) {
 		G_MIRROR_DEBUG(1,
 		    "Invalid disk flags on disk %s (device %s), skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	return (0);
 }
 
 int
 g_mirror_add_disk(struct g_mirror_softc *sc, struct g_provider *pp,
     struct g_mirror_metadata *md)
 {
 	struct g_mirror_disk *disk;
 	int error;
 
 	g_topology_assert_not();
 	G_MIRROR_DEBUG(2, "Adding disk %s.", pp->name);
 
 	error = g_mirror_check_metadata(sc, pp, md);
 	if (error != 0)
 		return (error);
 	if (sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING &&
 	    md->md_genid < sc->sc_genid) {
 		G_MIRROR_DEBUG(0, "Component %s (device %s) broken, skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	disk = g_mirror_init_disk(sc, pp, md, &error);
 	if (disk == NULL)
 		return (error);
 	error = g_mirror_event_send(disk, G_MIRROR_DISK_STATE_NEW,
 	    G_MIRROR_EVENT_WAIT);
 	if (error != 0)
 		return (error);
 	if (md->md_version < G_MIRROR_VERSION) {
 		G_MIRROR_DEBUG(0, "Upgrading metadata on %s (v%d->v%d).",
 		    pp->name, md->md_version, G_MIRROR_VERSION);
 		g_mirror_update_metadata(disk);
 	}
 	return (0);
 }
 
 static void
 g_mirror_destroy_delayed(void *arg, int flag)
 {
 	struct g_mirror_softc *sc;
 	int error;
 
 	if (flag == EV_CANCEL) {
 		G_MIRROR_DEBUG(1, "Destroying canceled.");
 		return;
 	}
 	sc = arg;
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 	KASSERT((sc->sc_flags & G_MIRROR_DEVICE_FLAG_DESTROY) == 0,
 	    ("DESTROY flag set on %s.", sc->sc_name));
 	KASSERT((sc->sc_flags & G_MIRROR_DEVICE_FLAG_CLOSEWAIT) != 0,
 	    ("CLOSEWAIT flag not set on %s.", sc->sc_name));
 	G_MIRROR_DEBUG(1, "Destroying %s (delayed).", sc->sc_name);
 	error = g_mirror_destroy(sc, G_MIRROR_DESTROY_SOFT);
 	if (error != 0) {
 		G_MIRROR_DEBUG(0, "Cannot destroy %s (error=%d).",
 		    sc->sc_name, error);
 		sx_xunlock(&sc->sc_lock);
 	}
 	g_topology_lock();
 }
 
 static int
 g_mirror_access(struct g_provider *pp, int acr, int acw, int ace)
 {
 	struct g_mirror_softc *sc;
 	int error = 0;
 
 	g_topology_assert();
 	G_MIRROR_DEBUG(2, "Access request for %s: r%dw%de%d.", pp->name, acr,
 	    acw, ace);
 
 	sc = pp->private;
 	KASSERT(sc != NULL, ("NULL softc (provider=%s).", pp->name));
 
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_DESTROY) != 0 ||
 	    (sc->sc_flags & G_MIRROR_DEVICE_FLAG_CLOSEWAIT) != 0 ||
 	    LIST_EMPTY(&sc->sc_disks)) {
 		if (acr > 0 || acw > 0 || ace > 0)
 			error = ENXIO;
 		goto end;
 	}
 	sc->sc_provider_open += acr + acw + ace;
 	if (pp->acw + acw == 0)
 		g_mirror_idle(sc, 0);
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_CLOSEWAIT) != 0 &&
 	    sc->sc_provider_open == 0)
 		g_post_event(g_mirror_destroy_delayed, sc, M_WAITOK, sc, NULL);
 end:
 	sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	return (error);
 }
 
 struct g_geom *
 g_mirror_create(struct g_class *mp, const struct g_mirror_metadata *md,
     u_int type)
 {
 	struct g_mirror_softc *sc;
 	struct g_geom *gp;
 	int error, timeout;
 
 	g_topology_assert();
 	G_MIRROR_DEBUG(1, "Creating device %s (id=%u).", md->md_name,
 	    md->md_mid);
 
 	/* One disk is minimum. */
 	if (md->md_all < 1)
 		return (NULL);
 	/*
 	 * Action geom.
 	 */
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = malloc(sizeof(*sc), M_MIRROR, M_WAITOK | M_ZERO);
 	gp->start = g_mirror_start;
 	gp->orphan = g_mirror_orphan;
 	gp->access = g_mirror_access;
 	gp->dumpconf = g_mirror_dumpconf;
 
 	sc->sc_type = type;
 	sc->sc_id = md->md_mid;
 	sc->sc_slice = md->md_slice;
 	sc->sc_balance = md->md_balance;
 	sc->sc_mediasize = md->md_mediasize;
 	sc->sc_sectorsize = md->md_sectorsize;
 	sc->sc_ndisks = md->md_all;
 	sc->sc_flags = md->md_mflags;
 	sc->sc_bump_id = 0;
 	sc->sc_idle = 1;
 	sc->sc_last_write = time_uptime;
 	sc->sc_writes = 0;
 	sc->sc_refcnt = 1;
 	sx_init(&sc->sc_lock, "gmirror:lock");
 	TAILQ_INIT(&sc->sc_queue);
 	mtx_init(&sc->sc_queue_mtx, "gmirror:queue", NULL, MTX_DEF);
 	TAILQ_INIT(&sc->sc_regular_delayed);
 	TAILQ_INIT(&sc->sc_inflight);
 	TAILQ_INIT(&sc->sc_sync_delayed);
 	LIST_INIT(&sc->sc_disks);
 	TAILQ_INIT(&sc->sc_events);
 	mtx_init(&sc->sc_events_mtx, "gmirror:events", NULL, MTX_DEF);
 	callout_init(&sc->sc_callout, 1);
 	mtx_init(&sc->sc_done_mtx, "gmirror:done", NULL, MTX_DEF);
 	sc->sc_state = G_MIRROR_DEVICE_STATE_STARTING;
 	gp->softc = sc;
 	sc->sc_geom = gp;
 	sc->sc_provider = NULL;
 	sc->sc_provider_open = 0;
 	/*
 	 * Synchronization geom.
 	 */
 	gp = g_new_geomf(mp, "%s.sync", md->md_name);
 	gp->softc = sc;
 	gp->orphan = g_mirror_orphan;
 	sc->sc_sync.ds_geom = gp;
 	sc->sc_sync.ds_ndisks = 0;
 	error = kproc_create(g_mirror_worker, sc, &sc->sc_worker, 0, 0,
 	    "g_mirror %s", md->md_name);
 	if (error != 0) {
 		G_MIRROR_DEBUG(1, "Cannot create kernel thread for %s.",
 		    sc->sc_name);
 		g_destroy_geom(sc->sc_sync.ds_geom);
 		g_destroy_geom(sc->sc_geom);
 		g_mirror_free_device(sc);
 		return (NULL);
 	}
 
 	G_MIRROR_DEBUG(1, "Device %s created (%u components, id=%u).",
 	    sc->sc_name, sc->sc_ndisks, sc->sc_id);
 
 	sc->sc_rootmount = root_mount_hold("GMIRROR");
 	G_MIRROR_DEBUG(1, "root_mount_hold %p", sc->sc_rootmount);
 	/*
 	 * Run timeout.
 	 */
 	timeout = g_mirror_timeout * hz;
 	callout_reset(&sc->sc_callout, timeout, g_mirror_go, sc);
 	return (sc->sc_geom);
 }
 
 int
 g_mirror_destroy(struct g_mirror_softc *sc, int how)
 {
 	struct g_mirror_disk *disk;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if (sc->sc_provider_open != 0) {
 		switch (how) {
 		case G_MIRROR_DESTROY_SOFT:
 			G_MIRROR_DEBUG(1,
 			    "Device %s is still open (%d).", sc->sc_name,
 			    sc->sc_provider_open);
 			return (EBUSY);
 		case G_MIRROR_DESTROY_DELAYED:
 			G_MIRROR_DEBUG(1,
 			    "Device %s will be destroyed on last close.",
 			    sc->sc_name);
 			LIST_FOREACH(disk, &sc->sc_disks, d_next) {
 				if (disk->d_state ==
 				    G_MIRROR_DISK_STATE_SYNCHRONIZING) {
 					g_mirror_sync_stop(disk, 1);
 				}
 			}
 			sc->sc_flags |= G_MIRROR_DEVICE_FLAG_CLOSEWAIT;
 			return (EBUSY);
 		case G_MIRROR_DESTROY_HARD:
 			G_MIRROR_DEBUG(1, "Device %s is still open, so it "
 			    "can't be definitely removed.", sc->sc_name);
 		}
 	}
 
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_DESTROY) != 0) {
 		sx_xunlock(&sc->sc_lock);
 		return (0);
 	}
 	sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DESTROY;
 	sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DRAIN;
 	G_MIRROR_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 	sx_xunlock(&sc->sc_lock);
 	mtx_lock(&sc->sc_queue_mtx);
 	wakeup(sc);
 	mtx_unlock(&sc->sc_queue_mtx);
 	G_MIRROR_DEBUG(4, "%s: Sleeping %p.", __func__, &sc->sc_worker);
 	while (sc->sc_worker != NULL)
 		tsleep(&sc->sc_worker, PRIBIO, "m:destroy", hz / 5);
 	G_MIRROR_DEBUG(4, "%s: Woken up %p.", __func__, &sc->sc_worker);
 	sx_xlock(&sc->sc_lock);
 	g_mirror_destroy_device(sc);
 	return (0);
 }
 
 static void
 g_mirror_taste_orphan(struct g_consumer *cp)
 {
 
 	KASSERT(1 == 0, ("%s called while tasting %s.", __func__,
 	    cp->provider->name));
 }
 
 static struct g_geom *
 g_mirror_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_mirror_metadata md;
 	struct g_mirror_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	G_MIRROR_DEBUG(2, "Tasting %s.", pp->name);
 
 	gp = g_new_geomf(mp, "mirror:taste");
 	/*
 	 * This orphan function should be never called.
 	 */
 	gp->orphan = g_mirror_taste_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_mirror_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (md.md_provider[0] != '\0' &&
 	    !g_compare_names(md.md_provider, pp->name))
 		return (NULL);
 	if (md.md_provsize != 0 && md.md_provsize != pp->mediasize)
 		return (NULL);
 	if ((md.md_dflags & G_MIRROR_DISK_FLAG_INACTIVE) != 0) {
 		G_MIRROR_DEBUG(0,
 		    "Device %s: provider %s marked as inactive, skipping.",
 		    md.md_name, pp->name);
 		return (NULL);
 	}
 	if (g_mirror_debug >= 2)
 		mirror_metadata_dump(&md);
 
 	/*
 	 * Let's check if device already exists.
 	 */
 	sc = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (sc->sc_type != G_MIRROR_TYPE_AUTOMATIC)
 			continue;
 		if (sc->sc_sync.ds_geom == gp)
 			continue;
 		if (strcmp(md.md_name, sc->sc_name) != 0)
 			continue;
 		if (md.md_mid != sc->sc_id) {
 			G_MIRROR_DEBUG(0, "Device %s already configured.",
 			    sc->sc_name);
 			return (NULL);
 		}
 		break;
 	}
 	if (gp == NULL) {
 		gp = g_mirror_create(mp, &md, G_MIRROR_TYPE_AUTOMATIC);
 		if (gp == NULL) {
 			G_MIRROR_DEBUG(0, "Cannot create device %s.",
 			    md.md_name);
 			return (NULL);
 		}
 		sc = gp->softc;
 	}
 	G_MIRROR_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 	sc->sc_flags |= G_MIRROR_DEVICE_FLAG_TASTING;
 	error = g_mirror_add_disk(sc, pp, &md);
 	if (error != 0) {
 		G_MIRROR_DEBUG(0, "Cannot add disk %s to %s (error=%d).",
 		    pp->name, gp->name, error);
 		if (LIST_EMPTY(&sc->sc_disks)) {
 			g_cancel_event(sc);
 			g_mirror_destroy(sc, G_MIRROR_DESTROY_HARD);
 			g_topology_lock();
 			return (NULL);
 		}
 		gp = NULL;
 	}
 	sc->sc_flags &= ~G_MIRROR_DEVICE_FLAG_TASTING;
 	if ((sc->sc_flags & G_MIRROR_DEVICE_FLAG_DESTROY) != 0) {
 		g_mirror_destroy(sc, G_MIRROR_DESTROY_HARD);
 		g_topology_lock();
 		return (NULL);
 	}
 	sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	return (gp);
 }
 
 static void
 g_mirror_resize(struct g_consumer *cp)
 {
 	struct g_mirror_disk *disk;
 
 	g_topology_assert();
 	g_trace(G_T_TOPOLOGY, "%s(%s)", __func__, cp->provider->name);
 
 	disk = cp->private;
 	if (disk == NULL)
 		return;
 	g_topology_unlock();
 	g_mirror_update_metadata(disk);
 	g_topology_lock();
 }
 
 static int
 g_mirror_destroy_geom(struct gctl_req *req __unused,
     struct g_class *mp __unused, struct g_geom *gp)
 {
 	struct g_mirror_softc *sc;
 	int error;
 
 	g_topology_unlock();
 	sc = gp->softc;
 	sx_xlock(&sc->sc_lock);
 	g_cancel_event(sc);
 	error = g_mirror_destroy(gp->softc, G_MIRROR_DESTROY_SOFT);
 	if (error != 0)
 		sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	return (error);
 }
 
 static void
 g_mirror_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_mirror_softc *sc;
 
 	g_topology_assert();
 
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	/* Skip synchronization geom. */
 	if (gp == sc->sc_sync.ds_geom)
 		return;
 	if (pp != NULL) {
 		/* Nothing here. */
 	} else if (cp != NULL) {
 		struct g_mirror_disk *disk;
 
 		disk = cp->private;
 		if (disk == NULL)
 			return;
 		g_topology_unlock();
 		sx_xlock(&sc->sc_lock);
 		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)disk->d_id);
 		if (disk->d_state == G_MIRROR_DISK_STATE_SYNCHRONIZING) {
 			sbuf_printf(sb, "%s<Synchronized>", indent);
 			if (disk->d_sync.ds_offset == 0)
 				sbuf_printf(sb, "0%%");
 			else {
 				sbuf_printf(sb, "%u%%",
 				    (u_int)((disk->d_sync.ds_offset * 100) /
 				    sc->sc_provider->mediasize));
 			}
 			sbuf_printf(sb, "</Synchronized>\n");
 			if (disk->d_sync.ds_offset > 0) {
 				sbuf_printf(sb, "%s<BytesSynced>%jd"
 				    "</BytesSynced>\n", indent,
 				    (intmax_t)disk->d_sync.ds_offset);
 			}
 		}
 		sbuf_printf(sb, "%s<SyncID>%u</SyncID>\n", indent,
 		    disk->d_sync.ds_syncid);
 		sbuf_printf(sb, "%s<GenID>%u</GenID>\n", indent,
 		    disk->d_genid);
 		sbuf_printf(sb, "%s<Flags>", indent);
 		if (disk->d_flags == 0)
 			sbuf_printf(sb, "NONE");
 		else {
 			int first = 1;
 
 #define	ADD_FLAG(flag, name)	do {					\
 	if ((disk->d_flags & (flag)) != 0) {				\
 		if (!first)						\
 			sbuf_printf(sb, ", ");				\
 		else							\
 			first = 0;					\
 		sbuf_printf(sb, name);					\
 	}								\
 } while (0)
 			ADD_FLAG(G_MIRROR_DISK_FLAG_DIRTY, "DIRTY");
 			ADD_FLAG(G_MIRROR_DISK_FLAG_HARDCODED, "HARDCODED");
 			ADD_FLAG(G_MIRROR_DISK_FLAG_INACTIVE, "INACTIVE");
 			ADD_FLAG(G_MIRROR_DISK_FLAG_SYNCHRONIZING,
 			    "SYNCHRONIZING");
 			ADD_FLAG(G_MIRROR_DISK_FLAG_FORCE_SYNC, "FORCE_SYNC");
 			ADD_FLAG(G_MIRROR_DISK_FLAG_BROKEN, "BROKEN");
 #undef	ADD_FLAG
 		}
 		sbuf_printf(sb, "</Flags>\n");
 		sbuf_printf(sb, "%s<Priority>%u</Priority>\n", indent,
 		    disk->d_priority);
 		sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 		    g_mirror_disk_state2str(disk->d_state));
 		sx_xunlock(&sc->sc_lock);
 		g_topology_lock();
 	} else {
 		g_topology_unlock();
 		sx_xlock(&sc->sc_lock);
 		sbuf_printf(sb, "%s<Type>", indent);
 		switch (sc->sc_type) {
 		case G_MIRROR_TYPE_AUTOMATIC:
 			sbuf_printf(sb, "AUTOMATIC");
 			break;
 		case G_MIRROR_TYPE_MANUAL:
 			sbuf_printf(sb, "MANUAL");
 			break;
 		default:
 			sbuf_printf(sb, "UNKNOWN");
 			break;
 		}
 		sbuf_printf(sb, "</Type>\n");
 		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)sc->sc_id);
 		sbuf_printf(sb, "%s<SyncID>%u</SyncID>\n", indent, sc->sc_syncid);
 		sbuf_printf(sb, "%s<GenID>%u</GenID>\n", indent, sc->sc_genid);
 		sbuf_printf(sb, "%s<Flags>", indent);
 		if (sc->sc_flags == 0)
 			sbuf_printf(sb, "NONE");
 		else {
 			int first = 1;
 
 #define	ADD_FLAG(flag, name)	do {					\
 	if ((sc->sc_flags & (flag)) != 0) {				\
 		if (!first)						\
 			sbuf_printf(sb, ", ");				\
 		else							\
 			first = 0;					\
 		sbuf_printf(sb, name);					\
 	}								\
 } while (0)
 			ADD_FLAG(G_MIRROR_DEVICE_FLAG_NOFAILSYNC, "NOFAILSYNC");
 			ADD_FLAG(G_MIRROR_DEVICE_FLAG_NOAUTOSYNC, "NOAUTOSYNC");
 #undef	ADD_FLAG
 		}
 		sbuf_printf(sb, "</Flags>\n");
 		sbuf_printf(sb, "%s<Slice>%u</Slice>\n", indent,
 		    (u_int)sc->sc_slice);
 		sbuf_printf(sb, "%s<Balance>%s</Balance>\n", indent,
 		    balance_name(sc->sc_balance));
 		sbuf_printf(sb, "%s<Components>%u</Components>\n", indent,
 		    sc->sc_ndisks);
 		sbuf_printf(sb, "%s<State>", indent);
 		if (sc->sc_state == G_MIRROR_DEVICE_STATE_STARTING)
 			sbuf_printf(sb, "%s", "STARTING");
 		else if (sc->sc_ndisks ==
 		    g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE))
 			sbuf_printf(sb, "%s", "COMPLETE");
 		else
 			sbuf_printf(sb, "%s", "DEGRADED");
 		sbuf_printf(sb, "</State>\n");
 		sx_xunlock(&sc->sc_lock);
 		g_topology_lock();
 	}
 }
 
 static void
 g_mirror_shutdown_post_sync(void *arg, int howto)
 {
 	struct g_class *mp;
 	struct g_geom *gp, *gp2;
 	struct g_mirror_softc *sc;
 	int error;
 
 	if (panicstr != NULL)
 		return;
 
 	mp = arg;
 	g_topology_lock();
 	g_mirror_shutdown = 1;
 	LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 		if ((sc = gp->softc) == NULL)
 			continue;
 		/* Skip synchronization geom. */
 		if (gp == sc->sc_sync.ds_geom)
 			continue;
 		g_topology_unlock();
 		sx_xlock(&sc->sc_lock);
 		g_mirror_idle(sc, -1);
 		g_cancel_event(sc);
 		error = g_mirror_destroy(sc, G_MIRROR_DESTROY_DELAYED);
 		if (error != 0)
 			sx_xunlock(&sc->sc_lock);
 		g_topology_lock();
 	}
 	g_topology_unlock();
 }
 
 static void
 g_mirror_init(struct g_class *mp)
 {
 
 	g_mirror_post_sync = EVENTHANDLER_REGISTER(shutdown_post_sync,
 	    g_mirror_shutdown_post_sync, mp, SHUTDOWN_PRI_FIRST);
 	if (g_mirror_post_sync == NULL)
 		G_MIRROR_DEBUG(0, "Warning! Cannot register shutdown event.");
 }
 
 static void
 g_mirror_fini(struct g_class *mp)
 {
 
 	if (g_mirror_post_sync != NULL)
 		EVENTHANDLER_DEREGISTER(shutdown_post_sync, g_mirror_post_sync);
 }
 
 DECLARE_GEOM_CLASS(g_mirror_class, g_mirror);
+MODULE_VERSION(geom_mirror, 0);
Index: user/markj/netdump/sys/geom/mountver/g_mountver.c
===================================================================
--- user/markj/netdump/sys/geom/mountver/g_mountver.c	(revision 332407)
+++ user/markj/netdump/sys/geom/mountver/g_mountver.c	(revision 332408)
@@ -1,662 +1,663 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2010 Edward Tomasz Napierala <trasz@FreeBSD.org>
  * Copyright (c) 2004-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/disk.h>
 #include <sys/proc.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/eventhandler.h>
 #include <geom/geom.h>
 #include <geom/mountver/g_mountver.h>
 
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, mountver, CTLFLAG_RW,
     0, "GEOM_MOUNTVER stuff");
 static u_int g_mountver_debug = 0;
 static u_int g_mountver_check_ident = 1;
 SYSCTL_UINT(_kern_geom_mountver, OID_AUTO, debug, CTLFLAG_RW,
     &g_mountver_debug, 0, "Debug level");
 SYSCTL_UINT(_kern_geom_mountver, OID_AUTO, check_ident, CTLFLAG_RW,
     &g_mountver_check_ident, 0, "Check disk ident when reattaching");
 
 static eventhandler_tag g_mountver_pre_sync = NULL;
 
 static void g_mountver_queue(struct bio *bp);
 static void g_mountver_orphan(struct g_consumer *cp);
 static void g_mountver_resize(struct g_consumer *cp);
 static int g_mountver_destroy(struct g_geom *gp, boolean_t force);
 static g_taste_t g_mountver_taste;
 static int g_mountver_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 static void g_mountver_config(struct gctl_req *req, struct g_class *mp,
     const char *verb);
 static void g_mountver_dumpconf(struct sbuf *sb, const char *indent,
     struct g_geom *gp, struct g_consumer *cp, struct g_provider *pp);
 static void g_mountver_init(struct g_class *mp);
 static void g_mountver_fini(struct g_class *mp);
 
 struct g_class g_mountver_class = {
 	.name = G_MOUNTVER_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_mountver_config,
 	.taste = g_mountver_taste,
 	.destroy_geom = g_mountver_destroy_geom,
 	.init = g_mountver_init,
 	.fini = g_mountver_fini
 };
 
 static void
 g_mountver_done(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct bio *pbp;
 
 	if (bp->bio_error != ENXIO) {
 		g_std_done(bp);
 		return;
 	}
 
 	/*
 	 * When the device goes away, it's possible that few requests
 	 * will be completed with ENXIO before g_mountver_orphan()
 	 * gets called.  To work around that, we have to queue requests
 	 * that failed with ENXIO, in order to send them later.
 	 */
 	gp = bp->bio_from->geom;
 
 	pbp = bp->bio_parent;
 	KASSERT(pbp->bio_to == LIST_FIRST(&gp->provider),
 	    ("parent request was for someone else"));
 	g_destroy_bio(bp);
 	pbp->bio_inbed++;
 	g_mountver_queue(pbp);
 }
 
 static void
 g_mountver_send(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct bio *cbp;
 
 	gp = bp->bio_to->geom;
 
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 
 	cbp->bio_done = g_mountver_done;
 	g_io_request(cbp, LIST_FIRST(&gp->consumer));
 }
 
 static void
 g_mountver_queue(struct bio *bp)
 {
 	struct g_mountver_softc *sc;
 	struct g_geom *gp;
 
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 
 	mtx_lock(&sc->sc_mtx);
 	TAILQ_INSERT_TAIL(&sc->sc_queue, bp, bio_queue);
 	mtx_unlock(&sc->sc_mtx);
 }
 
 static void
 g_mountver_send_queued(struct g_geom *gp)
 {
 	struct g_mountver_softc *sc;
 	struct bio *bp;
 
 	sc = gp->softc;
 
 	mtx_lock(&sc->sc_mtx);
 	while ((bp = TAILQ_FIRST(&sc->sc_queue)) != NULL) {
 		TAILQ_REMOVE(&sc->sc_queue, bp, bio_queue);
 		G_MOUNTVER_LOGREQ(bp, "Sending queued request.");
 		g_mountver_send(bp);
 	}
 	mtx_unlock(&sc->sc_mtx);
 }
 
 static void
 g_mountver_discard_queued(struct g_geom *gp)
 {
 	struct g_mountver_softc *sc;
 	struct bio *bp;
 
 	sc = gp->softc;
 
 	mtx_lock(&sc->sc_mtx);
 	while ((bp = TAILQ_FIRST(&sc->sc_queue)) != NULL) {
 		TAILQ_REMOVE(&sc->sc_queue, bp, bio_queue);
 		G_MOUNTVER_LOGREQ(bp, "Discarding queued request.");
 		g_io_deliver(bp, ENXIO);
 	}
 	mtx_unlock(&sc->sc_mtx);
 }
 
 static void
 g_mountver_start(struct bio *bp)
 {
 	struct g_mountver_softc *sc;
 	struct g_geom *gp;
 
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 	G_MOUNTVER_LOGREQ(bp, "Request received.");
 
 	/*
 	 * It is possible that some bios were returned with ENXIO, even though
 	 * orphaning didn't happen yet.  In that case, queue all subsequent
 	 * requests in order to maintain ordering.
 	 */
 	if (sc->sc_orphaned || !TAILQ_EMPTY(&sc->sc_queue)) {
 		if (sc->sc_shutting_down) {
 			G_MOUNTVER_LOGREQ(bp, "Discarding request due to shutdown.");
 			g_io_deliver(bp, ENXIO);
 			return;
 		}
 		G_MOUNTVER_LOGREQ(bp, "Queueing request.");
 		g_mountver_queue(bp);
 		if (!sc->sc_orphaned)
 			g_mountver_send_queued(gp);
 	} else {
 		G_MOUNTVER_LOGREQ(bp, "Sending request.");
 		g_mountver_send(bp);
 	}
 }
 
 static int
 g_mountver_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_mountver_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 
 	g_topology_assert();
 
 	gp = pp->geom;
 	cp = LIST_FIRST(&gp->consumer);
 	sc = gp->softc;
 	if (sc == NULL && dr <= 0 && dw <= 0 && de <= 0)
 		return (0);
 	KASSERT(sc != NULL, ("Trying to access withered provider \"%s\".", pp->name));
 
 	sc->sc_access_r += dr;
 	sc->sc_access_w += dw;
 	sc->sc_access_e += de;
 
 	if (sc->sc_orphaned)
 		return (0);
 
 	return (g_access(cp, dr, dw, de));
 }
 
 static int
 g_mountver_create(struct gctl_req *req, struct g_class *mp, struct g_provider *pp)
 {
 	struct g_mountver_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *newpp;
 	struct g_consumer *cp;
 	char name[64];
 	int error;
 	int identsize = DISK_IDENT_SIZE;
 
 	g_topology_assert();
 
 	gp = NULL;
 	newpp = NULL;
 	cp = NULL;
 
 	snprintf(name, sizeof(name), "%s%s", pp->name, G_MOUNTVER_SUFFIX);
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		if (strcmp(gp->name, name) == 0) {
 			gctl_error(req, "Provider %s already exists.", name);
 			return (EEXIST);
 		}
 	}
 	gp = g_new_geomf(mp, "%s", name);
 	sc = g_malloc(sizeof(*sc), M_WAITOK | M_ZERO);
 	mtx_init(&sc->sc_mtx, "gmountver", NULL, MTX_DEF | MTX_RECURSE);
 	TAILQ_INIT(&sc->sc_queue);
 	sc->sc_provider_name = strdup(pp->name, M_GEOM);
 	gp->softc = sc;
 	gp->start = g_mountver_start;
 	gp->orphan = g_mountver_orphan;
 	gp->resize = g_mountver_resize;
 	gp->access = g_mountver_access;
 	gp->dumpconf = g_mountver_dumpconf;
 
 	newpp = g_new_providerf(gp, "%s", gp->name);
 	newpp->mediasize = pp->mediasize;
 	newpp->sectorsize = pp->sectorsize;
 	newpp->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE;
 
 	if ((pp->flags & G_PF_ACCEPT_UNMAPPED) != 0) {
 		G_MOUNTVER_DEBUG(0, "Unmapped supported for %s.", gp->name);
 		newpp->flags |= G_PF_ACCEPT_UNMAPPED;
 	} else {
 		G_MOUNTVER_DEBUG(0, "Unmapped unsupported for %s.", gp->name);
 		newpp->flags &= ~G_PF_ACCEPT_UNMAPPED;
 	}
 
 	cp = g_new_consumer(gp);
 	cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		gctl_error(req, "Cannot attach to provider %s.", pp->name);
 		goto fail;
 	}
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0) {
 		gctl_error(req, "Cannot access provider %s.", pp->name);
 		goto fail;
 	}
 	error = g_io_getattr("GEOM::ident", cp, &identsize, sc->sc_ident);
 	g_access(cp, -1, 0, 0);
 	if (error != 0) {
 		if (g_mountver_check_ident) {
 			gctl_error(req, "Cannot get disk ident from %s; error = %d.", pp->name, error);
 			goto fail;
 		}
 
 		G_MOUNTVER_DEBUG(0, "Cannot get disk ident from %s; error = %d.", pp->name, error);
 		sc->sc_ident[0] = '\0';
 	}
 
 	g_error_provider(newpp, 0);
 	G_MOUNTVER_DEBUG(0, "Device %s created.", gp->name);
 	return (0);
 fail:
 	g_free(sc->sc_provider_name);
 	if (cp->provider != NULL)
 		g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_provider(newpp);
 	g_free(gp->softc);
 	g_destroy_geom(gp);
 	return (error);
 }
 
 static int
 g_mountver_destroy(struct g_geom *gp, boolean_t force)
 {
 	struct g_mountver_softc *sc;
 	struct g_provider *pp;
 
 	g_topology_assert();
 	if (gp->softc == NULL)
 		return (ENXIO);
 	sc = gp->softc;
 	pp = LIST_FIRST(&gp->provider);
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_MOUNTVER_DEBUG(0, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_MOUNTVER_DEBUG(1, "Device %s is still open (r%dw%de%d).",
 			    pp->name, pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	} else {
 		G_MOUNTVER_DEBUG(0, "Device %s removed.", gp->name);
 	}
 	if (pp != NULL)
 		g_wither_provider(pp, ENXIO);
 	g_mountver_discard_queued(gp);
 	g_free(sc->sc_provider_name);
 	g_free(gp->softc);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 
 	return (0);
 }
 
 static int
 g_mountver_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 
 	return (g_mountver_destroy(gp, 0));
 }
 
 static void
 g_mountver_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_provider *pp;
 	const char *name;
 	char param[16];
 	int i, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		if (pp == NULL) {
 			G_MOUNTVER_DEBUG(1, "Provider %s is invalid.", name);
 			gctl_error(req, "Provider %s is invalid.", name);
 			return;
 		}
 		if (g_mountver_create(req, mp, pp) != 0)
 			return;
 	}
 }
 
 static struct g_geom *
 g_mountver_find_geom(struct g_class *mp, const char *name)
 {
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		if (strcmp(gp->name, name) == 0)
 			return (gp);
 	}
 	return (NULL);
 }
 
 static void
 g_mountver_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	int *nargs, *force, error, i;
 	struct g_geom *gp;
 	const char *name;
 	char param[16];
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No 'force' argument");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		gp = g_mountver_find_geom(mp, name);
 		if (gp == NULL) {
 			G_MOUNTVER_DEBUG(1, "Device %s is invalid.", name);
 			gctl_error(req, "Device %s is invalid.", name);
 			return;
 		}
 		error = g_mountver_destroy(gp, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy device %s (error=%d).",
 			    gp->name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_mountver_orphan(struct g_consumer *cp)
 {
 	struct g_mountver_softc *sc;
 
 	g_topology_assert();
 
 	sc = cp->geom->softc;
 	sc->sc_orphaned = 1;
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 	g_detach(cp);
 	G_MOUNTVER_DEBUG(0, "%s is offline.  Mount verification in progress.", sc->sc_provider_name);
 }
 
 static void
 g_mountver_resize(struct g_consumer *cp)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	gp = cp->geom;
 
 	LIST_FOREACH(pp, &gp->provider, provider)
 		g_resize_provider(pp, cp->provider->mediasize);
 }
 
 static int
 g_mountver_ident_matches(struct g_geom *gp)
 {
 	struct g_consumer *cp;
 	struct g_mountver_softc *sc;
 	char ident[DISK_IDENT_SIZE];
 	int error, identsize = DISK_IDENT_SIZE;
 
 	sc = gp->softc;
 	cp = LIST_FIRST(&gp->consumer);
 
 	if (g_mountver_check_ident == 0)
 		return (0);
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0) {
 		G_MOUNTVER_DEBUG(0, "Cannot access %s; "
 		    "not attaching; error = %d.", gp->name, error);
 		return (1);
 	}
 	error = g_io_getattr("GEOM::ident", cp, &identsize, ident);
 	g_access(cp, -1, 0, 0);
 	if (error != 0) {
 		G_MOUNTVER_DEBUG(0, "Cannot get disk ident for %s; "
 		    "not attaching; error = %d.", gp->name, error);
 		return (1);
 	}
 	if (strcmp(ident, sc->sc_ident) != 0) {
 		G_MOUNTVER_DEBUG(1, "Disk ident for %s (\"%s\") is different "
 		    "from expected \"%s\", not attaching.", gp->name, ident,
 		    sc->sc_ident);
 		return (1);
 	}
 
 	return (0);
 }
 	
 static struct g_geom *
 g_mountver_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_mountver_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	G_MOUNTVER_DEBUG(2, "Tasting %s.", pp->name);
 
 	/*
 	 * Let's check if device already exists.
 	 */
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 
 		/* Already attached? */
 		if (pp == LIST_FIRST(&gp->provider))
 			return (NULL);
 
 		if (sc->sc_orphaned && strcmp(pp->name, sc->sc_provider_name) == 0)
 			break;
 	}
 	if (gp == NULL)
 		return (NULL);
 
 	cp = LIST_FIRST(&gp->consumer);
 	g_attach(cp, pp);
 	error = g_mountver_ident_matches(gp);
 	if (error != 0) {
 		g_detach(cp);
 		return (NULL);
 	}
 	if (sc->sc_access_r > 0 || sc->sc_access_w > 0 || sc->sc_access_e > 0) {
 		error = g_access(cp, sc->sc_access_r, sc->sc_access_w, sc->sc_access_e);
 		if (error != 0) {
 			G_MOUNTVER_DEBUG(0, "Cannot access %s; error = %d.", pp->name, error);
 			g_detach(cp);
 			return (NULL);
 		}
 	}
 	g_mountver_send_queued(gp);
 	sc->sc_orphaned = 0;
 	G_MOUNTVER_DEBUG(0, "%s has completed mount verification.", sc->sc_provider_name);
 
 	return (gp);
 }
 
 static void
 g_mountver_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_MOUNTVER_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "create") == 0) {
 		g_mountver_ctl_create(req, mp);
 		return;
 	} else if (strcmp(verb, "destroy") == 0) {
 		g_mountver_ctl_destroy(req, mp);
 		return;
 	}
 
 	gctl_error(req, "Unknown verb.");
 }
 
 static void
 g_mountver_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_mountver_softc *sc;
 
 	if (pp != NULL || cp != NULL)
 		return;
 
 	sc = gp->softc;
 	sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 	    sc->sc_orphaned ? "OFFLINE" : "ONLINE");
 	sbuf_printf(sb, "%s<Provider-Name>%s</Provider-Name>\n", indent, sc->sc_provider_name);
 	sbuf_printf(sb, "%s<Disk-Ident>%s</Disk-Ident>\n", indent, sc->sc_ident);
 }
 
 static void
 g_mountver_shutdown_pre_sync(void *arg, int howto)
 {
 	struct g_mountver_softc *sc;
 	struct g_class *mp;
 	struct g_geom *gp, *gp2;
 
 	mp = arg;
 	g_topology_lock();
 	LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 		if (gp->softc == NULL)
 			continue;
 		sc = gp->softc;
 		sc->sc_shutting_down = 1;
 		if (sc->sc_orphaned)
 			g_mountver_destroy(gp, 1);
 	}
 	g_topology_unlock();
 }
 
 static void
 g_mountver_init(struct g_class *mp)
 {
 
 	g_mountver_pre_sync = EVENTHANDLER_REGISTER(shutdown_pre_sync,
 	    g_mountver_shutdown_pre_sync, mp, SHUTDOWN_PRI_FIRST);
 	if (g_mountver_pre_sync == NULL)
 		G_MOUNTVER_DEBUG(0, "Warning! Cannot register shutdown event.");
 }
 
 static void
 g_mountver_fini(struct g_class *mp)
 {
 
 	if (g_mountver_pre_sync != NULL)
 		EVENTHANDLER_DEREGISTER(shutdown_pre_sync, g_mountver_pre_sync);
 }
 
 DECLARE_GEOM_CLASS(g_mountver_class, g_mountver);
+MODULE_VERSION(geom_mountver, 0);
Index: user/markj/netdump/sys/geom/multipath/g_multipath.c
===================================================================
--- user/markj/netdump/sys/geom/multipath/g_multipath.c	(revision 332407)
+++ user/markj/netdump/sys/geom/multipath/g_multipath.c	(revision 332408)
@@ -1,1534 +1,1535 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2011-2013 Alexander Motin <mav@FreeBSD.org>
  * Copyright (c) 2006-2007 Matthew Jacob <mjacob@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 /*
  * Based upon work by Pawel Jakub Dawidek <pjd@FreeBSD.org> for all of the
  * fine geom examples, and by Poul Henning Kamp <phk@FreeBSD.org> for GEOM
  * itself, all of which is most gratefully acknowledged.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/kthread.h>
 #include <sys/malloc.h>
 #include <geom/geom.h>
 #include <geom/multipath/g_multipath.h>
 
 FEATURE(geom_multipath, "GEOM multipath support");
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, multipath, CTLFLAG_RW, 0,
     "GEOM_MULTIPATH tunables");
 static u_int g_multipath_debug = 0;
 SYSCTL_UINT(_kern_geom_multipath, OID_AUTO, debug, CTLFLAG_RW,
     &g_multipath_debug, 0, "Debug level");
 static u_int g_multipath_exclusive = 1;
 SYSCTL_UINT(_kern_geom_multipath, OID_AUTO, exclusive, CTLFLAG_RW,
     &g_multipath_exclusive, 0, "Exclusively open providers");
 
 static enum {
 	GKT_NIL,
 	GKT_RUN,
 	GKT_DIE
 } g_multipath_kt_state;
 static struct bio_queue_head gmtbq;
 static struct mtx gmtbq_mtx;
 
 static int g_multipath_read_metadata(struct g_consumer *cp,
     struct g_multipath_metadata *md);
 static int g_multipath_write_metadata(struct g_consumer *cp,
     struct g_multipath_metadata *md);
 
 static void g_multipath_orphan(struct g_consumer *);
 static void g_multipath_resize(struct g_consumer *);
 static void g_multipath_start(struct bio *);
 static void g_multipath_done(struct bio *);
 static void g_multipath_done_error(struct bio *);
 static void g_multipath_kt(void *);
 
 static int g_multipath_destroy(struct g_geom *);
 static int
 g_multipath_destroy_geom(struct gctl_req *, struct g_class *, struct g_geom *);
 
 static struct g_geom *g_multipath_find_geom(struct g_class *, const char *);
 static int g_multipath_rotate(struct g_geom *);
 
 static g_taste_t g_multipath_taste;
 static g_ctl_req_t g_multipath_config;
 static g_init_t g_multipath_init;
 static g_fini_t g_multipath_fini;
 static g_dumpconf_t g_multipath_dumpconf;
 
 struct g_class g_multipath_class = {
 	.name		= G_MULTIPATH_CLASS_NAME,
 	.version	= G_VERSION,
 	.ctlreq		= g_multipath_config,
 	.taste		= g_multipath_taste,
 	.destroy_geom	= g_multipath_destroy_geom,
 	.init		= g_multipath_init,
 	.fini		= g_multipath_fini
 };
 
 #define	MP_FAIL		0x00000001
 #define	MP_LOST		0x00000002
 #define	MP_NEW		0x00000004
 #define	MP_POSTED	0x00000008
 #define	MP_BAD		(MP_FAIL | MP_LOST | MP_NEW)
 #define	MP_WITHER	0x00000010
 #define	MP_IDLE		0x00000020
 #define	MP_IDLE_MASK	0xffffffe0
 
 static int
 g_multipath_good(struct g_geom *gp)
 {
 	struct g_consumer *cp;
 	int n = 0;
 
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if ((cp->index & MP_BAD) == 0)
 			n++;
 	}
 	return (n);
 }
 
 static void
 g_multipath_fault(struct g_consumer *cp, int cause)
 {
 	struct g_multipath_softc *sc;
 	struct g_consumer *lcp;
 	struct g_geom *gp;
 
 	gp = cp->geom;
 	sc = gp->softc;
 	cp->index |= cause;
 	if (g_multipath_good(gp) == 0 && sc->sc_ndisks > 0) {
 		LIST_FOREACH(lcp, &gp->consumer, consumer) {
 			if (lcp->provider == NULL ||
 			    (lcp->index & (MP_LOST | MP_NEW)))
 				continue;
 			if (sc->sc_ndisks > 1 && lcp == cp)
 				continue;
 			printf("GEOM_MULTIPATH: "
 			    "all paths in %s were marked FAIL, restore %s\n",
 			    sc->sc_name, lcp->provider->name);
 			lcp->index &= ~MP_FAIL;
 		}
 	}
 	if (cp != sc->sc_active)
 		return;
 	sc->sc_active = NULL;
 	LIST_FOREACH(lcp, &gp->consumer, consumer) {
 		if ((lcp->index & MP_BAD) == 0) {
 			sc->sc_active = lcp;
 			break;
 		}
 	}
 	if (sc->sc_active == NULL) {
 		printf("GEOM_MULTIPATH: out of providers for %s\n",
 		    sc->sc_name);
 	} else if (sc->sc_active_active != 1) {
 		printf("GEOM_MULTIPATH: %s is now active path in %s\n",
 		    sc->sc_active->provider->name, sc->sc_name);
 	}
 }
 
 static struct g_consumer *
 g_multipath_choose(struct g_geom *gp, struct bio *bp)
 {
 	struct g_multipath_softc *sc;
 	struct g_consumer *best, *cp;
 
 	sc = gp->softc;
 	if (sc->sc_active_active == 0 ||
 	    (sc->sc_active_active == 2 && bp->bio_cmd != BIO_READ))
 		return (sc->sc_active);
 	best = NULL;
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp->index & MP_BAD)
 			continue;
 		cp->index += MP_IDLE;
 		if (best == NULL || cp->private < best->private ||
 		    (cp->private == best->private && cp->index > best->index))
 			best = cp;
 	}
 	if (best != NULL)
 		best->index &= ~MP_IDLE_MASK;
 	return (best);
 }
 
 static void
 g_mpd(void *arg, int flags __unused)
 {
 	struct g_geom *gp;
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	int w;
 
 	g_topology_assert();
 	cp = arg;
 	gp = cp->geom;
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0) {
 		w = cp->acw;
 		g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 		if (w > 0 && cp->provider != NULL &&
 		    (cp->provider->geom->flags & G_GEOM_WITHER) == 0) {
 			cp->index |= MP_WITHER;
 			g_post_event(g_mpd, cp, M_WAITOK, NULL);
 			return;
 		}
 	}
 	sc = gp->softc;
 	mtx_lock(&sc->sc_mtx);
 	if (cp->provider) {
 		printf("GEOM_MULTIPATH: %s removed from %s\n",
 		    cp->provider->name, gp->name);
 		g_detach(cp);
 	}
 	g_destroy_consumer(cp);
 	mtx_unlock(&sc->sc_mtx);
 	if (LIST_EMPTY(&gp->consumer))
 		g_multipath_destroy(gp);
 }
 
 static void
 g_multipath_orphan(struct g_consumer *cp)
 {
 	struct g_multipath_softc *sc;
 	uintptr_t *cnt;
 
 	g_topology_assert();
 	printf("GEOM_MULTIPATH: %s in %s was disconnected\n",
 	    cp->provider->name, cp->geom->name);
 	sc = cp->geom->softc;
 	cnt = (uintptr_t *)&cp->private;
 	mtx_lock(&sc->sc_mtx);
 	sc->sc_ndisks--;
 	g_multipath_fault(cp, MP_LOST);
 	if (*cnt == 0 && (cp->index & MP_POSTED) == 0) {
 		cp->index |= MP_POSTED;
 		mtx_unlock(&sc->sc_mtx);
 		g_mpd(cp, 0);
 	} else
 		mtx_unlock(&sc->sc_mtx);
 }
 
 static void
 g_multipath_resize(struct g_consumer *cp)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp1;
 	struct g_provider *pp;
 	struct g_multipath_metadata md;
 	off_t size, psize, ssize;
 	int error;
 
 	g_topology_assert();
 
 	gp = cp->geom;
 	pp = cp->provider;
 	sc = gp->softc;
 
 	if (sc->sc_stopping)
 		return;
 
 	if (pp->mediasize < sc->sc_size) {
 		size = pp->mediasize;
 		ssize = pp->sectorsize;
 	} else {
 		size = ssize = OFF_MAX;
 		mtx_lock(&sc->sc_mtx);
 		LIST_FOREACH(cp1, &gp->consumer, consumer) {
 			pp = cp1->provider;
 			if (pp == NULL)
 				continue;
 			if (pp->mediasize < size) {
 				size = pp->mediasize;
 				ssize = pp->sectorsize;
 			}
 		}
 		mtx_unlock(&sc->sc_mtx);
 		if (size == OFF_MAX || size == sc->sc_size)
 			return;
 	}
 	psize = size - ((sc->sc_uuid[0] != 0) ? ssize : 0);
 	printf("GEOM_MULTIPATH: %s size changed from %jd to %jd\n",
 	    sc->sc_name, sc->sc_pp->mediasize, psize);
 	if (sc->sc_uuid[0] != 0 && size < sc->sc_size) {
 		error = g_multipath_read_metadata(cp, &md);
 		if (error ||
 		    (strcmp(md.md_magic, G_MULTIPATH_MAGIC) != 0) ||
 		    (memcmp(md.md_uuid, sc->sc_uuid, sizeof(sc->sc_uuid)) != 0) ||
 		    (strcmp(md.md_name, sc->sc_name) != 0) ||
 		    (md.md_size != 0 && md.md_size != size) ||
 		    (md.md_sectorsize != 0 && md.md_sectorsize != ssize)) {
 			g_multipath_destroy(gp);
 			return;
 		}
 	}
 	sc->sc_size = size;
 	g_resize_provider(sc->sc_pp, psize);
 
 	if (sc->sc_uuid[0] != 0) {
 		pp = cp->provider;
 		strlcpy(md.md_magic, G_MULTIPATH_MAGIC, sizeof(md.md_magic));
 		memcpy(md.md_uuid, sc->sc_uuid, sizeof (sc->sc_uuid));
 		strlcpy(md.md_name, sc->sc_name, sizeof(md.md_name));
 		md.md_version = G_MULTIPATH_VERSION;
 		md.md_size = size;
 		md.md_sectorsize = ssize;
 		md.md_active_active = sc->sc_active_active;
 		error = g_multipath_write_metadata(cp, &md);
 		if (error != 0)
 			printf("GEOM_MULTIPATH: Can't update metadata on %s "
 			    "(%d)\n", pp->name, error);
 	}
 }
 
 static void
 g_multipath_start(struct bio *bp)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct bio *cbp;
 	uintptr_t *cnt;
 
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 	KASSERT(sc != NULL, ("NULL sc"));
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	mtx_lock(&sc->sc_mtx);
 	cp = g_multipath_choose(gp, bp);
 	if (cp == NULL) {
 		mtx_unlock(&sc->sc_mtx);
 		g_destroy_bio(cbp);
 		g_io_deliver(bp, ENXIO);
 		return;
 	}
 	if ((uintptr_t)bp->bio_driver1 < sc->sc_ndisks)
 		bp->bio_driver1 = (void *)(uintptr_t)sc->sc_ndisks;
 	cnt = (uintptr_t *)&cp->private;
 	(*cnt)++;
 	mtx_unlock(&sc->sc_mtx);
 	cbp->bio_done = g_multipath_done;
 	g_io_request(cbp, cp);
 }
 
 static void
 g_multipath_done(struct bio *bp)
 {
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	uintptr_t *cnt;
 
 	if (bp->bio_error == ENXIO || bp->bio_error == EIO) {
 		mtx_lock(&gmtbq_mtx);
 		bioq_insert_tail(&gmtbq, bp);
 		mtx_unlock(&gmtbq_mtx);
 		wakeup(&g_multipath_kt_state);
 	} else {
 		cp = bp->bio_from;
 		sc = cp->geom->softc;
 		cnt = (uintptr_t *)&cp->private;
 		mtx_lock(&sc->sc_mtx);
 		(*cnt)--;
 		if (*cnt == 0 && (cp->index & MP_LOST)) {
 			if (g_post_event(g_mpd, cp, M_NOWAIT, NULL) == 0)
 				cp->index |= MP_POSTED;
 			mtx_unlock(&sc->sc_mtx);
 		} else
 			mtx_unlock(&sc->sc_mtx);
 		g_std_done(bp);
 	}
 }
 
 static void
 g_multipath_done_error(struct bio *bp)
 {
 	struct bio *pbp;
 	struct g_geom *gp;
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	uintptr_t *cnt;
 
 	/*
 	 * If we had a failure, we have to check first to see
 	 * whether the consumer it failed on was the currently
 	 * active consumer (i.e., this is the first in perhaps
 	 * a number of failures). If so, we then switch consumers
 	 * to the next available consumer.
 	 */
 
 	pbp = bp->bio_parent;
 	gp = pbp->bio_to->geom;
 	sc = gp->softc;
 	cp = bp->bio_from;
 	pp = cp->provider;
 	cnt = (uintptr_t *)&cp->private;
 
 	mtx_lock(&sc->sc_mtx);
 	if ((cp->index & MP_FAIL) == 0) {
 		printf("GEOM_MULTIPATH: Error %d, %s in %s marked FAIL\n",
 		    bp->bio_error, pp->name, sc->sc_name);
 		g_multipath_fault(cp, MP_FAIL);
 	}
 	(*cnt)--;
 	if (*cnt == 0 && (cp->index & (MP_LOST | MP_POSTED)) == MP_LOST) {
 		cp->index |= MP_POSTED;
 		mtx_unlock(&sc->sc_mtx);
 		g_post_event(g_mpd, cp, M_WAITOK, NULL);
 	} else
 		mtx_unlock(&sc->sc_mtx);
 
 	/*
 	 * If we can fruitfully restart the I/O, do so.
 	 */
 	if (pbp->bio_children < (uintptr_t)pbp->bio_driver1) {
 		pbp->bio_inbed++;
 		g_destroy_bio(bp);
 		g_multipath_start(pbp);
 	} else {
 		g_std_done(bp);
 	}
 }
 
 static void
 g_multipath_kt(void *arg)
 {
 
 	g_multipath_kt_state = GKT_RUN;
 	mtx_lock(&gmtbq_mtx);
 	while (g_multipath_kt_state == GKT_RUN) {
 		for (;;) {
 			struct bio *bp;
 
 			bp = bioq_takefirst(&gmtbq);
 			if (bp == NULL)
 				break;
 			mtx_unlock(&gmtbq_mtx);
 			g_multipath_done_error(bp);
 			mtx_lock(&gmtbq_mtx);
 		}
 		if (g_multipath_kt_state != GKT_RUN)
 			break;
 		msleep(&g_multipath_kt_state, &gmtbq_mtx, PRIBIO,
 		    "gkt:wait", 0);
 	}
 	mtx_unlock(&gmtbq_mtx);
 	wakeup(&g_multipath_kt_state);
 	kproc_exit(0);
 }
 
 
 static int
 g_multipath_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp, *badcp = NULL;
 	struct g_multipath_softc *sc;
 	int error;
 
 	gp = pp->geom;
 
 	/* Error used if we have no valid consumers. */
 	error = (dr > 0 || dw > 0 || de > 0) ? ENXIO : 0;
 
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp->index & MP_WITHER)
 			continue;
 
 		error = g_access(cp, dr, dw, de);
 		if (error) {
 			badcp = cp;
 			goto fail;
 		}
 	}
 
 	if (error != 0)
 		return (error);
 
 	sc = gp->softc;
 	sc->sc_opened += dr + dw + de;
 	if (sc->sc_stopping && sc->sc_opened == 0)
 		g_multipath_destroy(gp);
 
 	return (0);
 
 fail:
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp == badcp)
 			break;
 		if (cp->index & MP_WITHER)
 			continue;
 
 		(void) g_access(cp, -dr, -dw, -de);
 	}
 	return (error);
 }
 
 static struct g_geom *
 g_multipath_create(struct g_class *mp, struct g_multipath_metadata *md)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	g_topology_assert();
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL || sc->sc_stopping)
 			continue;
 		if (strcmp(gp->name, md->md_name) == 0) {
 			printf("GEOM_MULTIPATH: name %s already exists\n",
 			    md->md_name);
 			return (NULL);
 		}
 	}
 
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = g_malloc(sizeof(*sc), M_WAITOK | M_ZERO);
 	mtx_init(&sc->sc_mtx, "multipath", NULL, MTX_DEF);
 	memcpy(sc->sc_uuid, md->md_uuid, sizeof (sc->sc_uuid));
 	memcpy(sc->sc_name, md->md_name, sizeof (sc->sc_name));
 	sc->sc_active_active = md->md_active_active;
 	sc->sc_size = md->md_size;
 	gp->softc = sc;
 	gp->start = g_multipath_start;
 	gp->orphan = g_multipath_orphan;
 	gp->resize = g_multipath_resize;
 	gp->access = g_multipath_access;
 	gp->dumpconf = g_multipath_dumpconf;
 
 	pp = g_new_providerf(gp, "multipath/%s", md->md_name);
 	pp->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE;
 	if (md->md_size != 0) {
 		pp->mediasize = md->md_size -
 		    ((md->md_uuid[0] != 0) ? md->md_sectorsize : 0);
 		pp->sectorsize = md->md_sectorsize;
 	}
 	sc->sc_pp = pp;
 	g_error_provider(pp, 0);
 	printf("GEOM_MULTIPATH: %s created\n", gp->name);
 	return (gp);
 }
 
 static int
 g_multipath_add_disk(struct g_geom *gp, struct g_provider *pp)
 {
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp, *nxtcp;
 	int error, acr, acw, ace;
 
 	g_topology_assert();
 
 	sc = gp->softc;
 	KASSERT(sc, ("no softc"));
 
 	/*
 	 * Make sure that the passed provider isn't already attached
 	 */
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp->provider == pp)
 			break;
 	}
 	if (cp) {
 		printf("GEOM_MULTIPATH: provider %s already attached to %s\n",
 		    pp->name, gp->name);
 		return (EEXIST);
 	}
 	nxtcp = LIST_FIRST(&gp->consumer);
 	cp = g_new_consumer(gp);
 	cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 	cp->private = NULL;
 	cp->index = MP_NEW;
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		printf("GEOM_MULTIPATH: cannot attach %s to %s",
 		    pp->name, sc->sc_name);
 		g_destroy_consumer(cp);
 		return (error);
 	}
 
 	/*
 	 * Set access permissions on new consumer to match other consumers
 	 */
 	if (sc->sc_pp) {
 		acr = sc->sc_pp->acr;
 		acw = sc->sc_pp->acw;
 		ace = sc->sc_pp->ace;
 	} else
 		acr = acw = ace = 0;
 	if (g_multipath_exclusive) {
 		acr++;
 		acw++;
 		ace++;
 	}
 	error = g_access(cp, acr, acw, ace);
 	if (error) {
 		printf("GEOM_MULTIPATH: cannot set access in "
 		    "attaching %s to %s (%d)\n",
 		    pp->name, sc->sc_name, error);
 		g_detach(cp);
 		g_destroy_consumer(cp);
 		return (error);
 	}
 	if (sc->sc_size == 0) {
 		sc->sc_size = pp->mediasize -
 		    ((sc->sc_uuid[0] != 0) ? pp->sectorsize : 0);
 		sc->sc_pp->mediasize = sc->sc_size;
 		sc->sc_pp->sectorsize = pp->sectorsize;
 	}
 	if (sc->sc_pp->stripesize == 0 && sc->sc_pp->stripeoffset == 0) {
 		sc->sc_pp->stripesize = pp->stripesize;
 		sc->sc_pp->stripeoffset = pp->stripeoffset;
 	}
 	sc->sc_pp->flags |= pp->flags & G_PF_ACCEPT_UNMAPPED;
 	mtx_lock(&sc->sc_mtx);
 	cp->index = 0;
 	sc->sc_ndisks++;
 	mtx_unlock(&sc->sc_mtx);
 	printf("GEOM_MULTIPATH: %s added to %s\n",
 	    pp->name, sc->sc_name);
 	if (sc->sc_active == NULL) {
 		sc->sc_active = cp;
 		if (sc->sc_active_active != 1)
 			printf("GEOM_MULTIPATH: %s is now active path in %s\n",
 			    pp->name, sc->sc_name);
 	}
 	return (0);
 }
 
 static int
 g_multipath_destroy(struct g_geom *gp)
 {
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp, *cp1;
 
 	g_topology_assert();
 	if (gp->softc == NULL)
 		return (ENXIO);
 	sc = gp->softc;
 	if (!sc->sc_stopping) {
 		printf("GEOM_MULTIPATH: destroying %s\n", gp->name);
 		sc->sc_stopping = 1;
 	}
 	if (sc->sc_opened != 0) {
 		g_wither_provider(sc->sc_pp, ENXIO);
 		sc->sc_pp = NULL;
 		return (EINPROGRESS);
 	}
 	LIST_FOREACH_SAFE(cp, &gp->consumer, consumer, cp1) {
 		mtx_lock(&sc->sc_mtx);
 		if ((cp->index & MP_POSTED) == 0) {
 			cp->index |= MP_POSTED;
 			mtx_unlock(&sc->sc_mtx);
 			g_mpd(cp, 0);
 			if (cp1 == NULL)
 				return(0);	/* Recursion happened. */
 		} else
 			mtx_unlock(&sc->sc_mtx);
 	}
 	if (!LIST_EMPTY(&gp->consumer))
 		return (EINPROGRESS);
 	mtx_destroy(&sc->sc_mtx);
 	g_free(gp->softc);
 	gp->softc = NULL;
 	printf("GEOM_MULTIPATH: %s destroyed\n", gp->name);
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static int
 g_multipath_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp)
 {
 
 	return (g_multipath_destroy(gp));
 }
 
 static int
 g_multipath_rotate(struct g_geom *gp)
 {
 	struct g_consumer *lcp, *first_good_cp = NULL;
 	struct g_multipath_softc *sc = gp->softc;
 	int active_cp_seen = 0;
 
 	g_topology_assert();
 	if (sc == NULL)
 		return (ENXIO);
 	LIST_FOREACH(lcp, &gp->consumer, consumer) {
 		if ((lcp->index & MP_BAD) == 0) {
 			if (first_good_cp == NULL)
 				first_good_cp = lcp;
 			if (active_cp_seen)
 				break;
 		}
 		if (sc->sc_active == lcp)
 			active_cp_seen = 1;
 	}
 	if (lcp == NULL)
 		lcp = first_good_cp;
 	if (lcp && lcp != sc->sc_active) {
 		sc->sc_active = lcp;
 		if (sc->sc_active_active != 1)
 			printf("GEOM_MULTIPATH: %s is now active path in %s\n",
 			    lcp->provider->name, sc->sc_name);
 	}
 	return (0);
 }
 
 static void
 g_multipath_init(struct g_class *mp)
 {
 	bioq_init(&gmtbq);
 	mtx_init(&gmtbq_mtx, "gmtbq", NULL, MTX_DEF);
 	kproc_create(g_multipath_kt, mp, NULL, 0, 0, "g_mp_kt");
 }
 
 static void
 g_multipath_fini(struct g_class *mp)
 {
 	if (g_multipath_kt_state == GKT_RUN) {
 		mtx_lock(&gmtbq_mtx);
 		g_multipath_kt_state = GKT_DIE;
 		wakeup(&g_multipath_kt_state);
 		msleep(&g_multipath_kt_state, &gmtbq_mtx, PRIBIO,
 		    "gmp:fini", 0);
 		mtx_unlock(&gmtbq_mtx);
 	}
 }
 
 static int
 g_multipath_read_metadata(struct g_consumer *cp,
     struct g_multipath_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize,
 	    pp->sectorsize, &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL)
 		return (error);
 	multipath_metadata_decode(buf, md);
 	g_free(buf);
 	return (0);
 }
 
 static int
 g_multipath_write_metadata(struct g_consumer *cp,
     struct g_multipath_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 	error = g_access(cp, 1, 1, 1);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_malloc(pp->sectorsize, M_WAITOK | M_ZERO);
 	multipath_metadata_encode(md, buf);
 	error = g_write_data(cp, pp->mediasize - pp->sectorsize,
 	    buf, pp->sectorsize);
 	g_topology_lock();
 	g_access(cp, -1, -1, -1);
 	g_free(buf);
 	return (error);
 }
 
 static struct g_geom *
 g_multipath_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_multipath_metadata md;
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp, *gp1;
 	int error, isnew;
 
 	g_topology_assert();
 
 	gp = g_new_geomf(mp, "multipath:taste");
 	gp->start = g_multipath_start;
 	gp->access = g_multipath_access;
 	gp->orphan = g_multipath_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_multipath_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (strcmp(md.md_magic, G_MULTIPATH_MAGIC) != 0) {
 		if (g_multipath_debug)
 			printf("%s is not MULTIPATH\n", pp->name);
 		return (NULL);
 	}
 	if (md.md_version != G_MULTIPATH_VERSION) {
 		printf("%s has version %d multipath id- this module is version "
 		    " %d: rejecting\n", pp->name, md.md_version,
 		    G_MULTIPATH_VERSION);
 		return (NULL);
 	}
 	if (md.md_size != 0 && md.md_size != pp->mediasize)
 		return (NULL);
 	if (md.md_sectorsize != 0 && md.md_sectorsize != pp->sectorsize)
 		return (NULL);
 	if (g_multipath_debug)
 		printf("MULTIPATH: %s/%s\n", md.md_name, md.md_uuid);
 
 	/*
 	 * Let's check if such a device already is present. We check against
 	 * uuid alone first because that's the true distinguishor. If that
 	 * passes, then we check for name conflicts. If there are conflicts, 
 	 * modify the name.
 	 *
 	 * The whole purpose of this is to solve the problem that people don't
 	 * pick good unique names, but good unique names (like uuids) are a
 	 * pain to use. So, we allow people to build GEOMs with friendly names
 	 * and uuids, and modify the names in case there's a collision.
 	 */
 	sc = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL || sc->sc_stopping)
 			continue;
 		if (strncmp(md.md_uuid, sc->sc_uuid, sizeof(md.md_uuid)) == 0)
 			break;
 	}
 
 	LIST_FOREACH(gp1, &mp->geom, geom) {
 		if (gp1 == gp)
 			continue;
 		sc = gp1->softc;
 		if (sc == NULL || sc->sc_stopping)
 			continue;
 		if (strncmp(md.md_name, sc->sc_name, sizeof(md.md_name)) == 0)
 			break;
 	}
 
 	/*
 	 * If gp is NULL, we had no extant MULTIPATH geom with this uuid.
 	 *
 	 * If gp1 is *not* NULL, that means we have a MULTIPATH geom extant
 	 * with the same name (but a different UUID).
 	 *
 	 * If gp is NULL, then modify the name with a random number and
   	 * complain, but allow the creation of the geom to continue.
 	 *
 	 * If gp is *not* NULL, just use the geom's name as we're attaching
 	 * this disk to the (previously generated) name.
 	 */
 
 	if (gp1) {
 		sc = gp1->softc;
 		if (gp == NULL) {
 			char buf[16];
 			u_long rand = random();
 
 			snprintf(buf, sizeof (buf), "%s-%lu", md.md_name, rand);
 			printf("GEOM_MULTIPATH: geom %s/%s exists already\n",
 			    sc->sc_name, sc->sc_uuid);
 			printf("GEOM_MULTIPATH: %s will be (temporarily) %s\n",
 			    md.md_uuid, buf);
 			strlcpy(md.md_name, buf, sizeof(md.md_name));
 		} else {
 			strlcpy(md.md_name, sc->sc_name, sizeof(md.md_name));
 		}
 	}
 
 	if (gp == NULL) {
 		gp = g_multipath_create(mp, &md);
 		if (gp == NULL) {
 			printf("GEOM_MULTIPATH: cannot create geom %s/%s\n",
 			    md.md_name, md.md_uuid);
 			return (NULL);
 		}
 		isnew = 1;
 	} else {
 		isnew = 0;
 	}
 
 	sc = gp->softc;
 	KASSERT(sc != NULL, ("sc is NULL"));
 	error = g_multipath_add_disk(gp, pp);
 	if (error != 0) {
 		if (isnew)
 			g_multipath_destroy(gp);
 		return (NULL);
 	}
 	return (gp);
 }
 
 static void
 g_multipath_ctl_add_name(struct gctl_req *req, struct g_class *mp,
     const char *name)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	const char *mpname;
 	static const char devpf[6] = "/dev/";
 	int error;
 
 	g_topology_assert();
 
 	mpname = gctl_get_asciiparam(req, "arg0");
         if (mpname == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, mpname);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", mpname);
 		return;
 	}
 	sc = gp->softc;
 
 	if (strncmp(name, devpf, 5) == 0)
 		name += 5;
 	pp = g_provider_by_name(name);
 	if (pp == NULL) {
 		gctl_error(req, "Provider %s is invalid", name);
 		return;
 	}
 
 	/*
 	 * Check to make sure parameters match.
 	 */
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp->provider == pp) {
 			gctl_error(req, "provider %s is already there",
 			    pp->name);
 			return;
 		}
 	}
 	if (sc->sc_pp->mediasize != 0 &&
 	    sc->sc_pp->mediasize + (sc->sc_uuid[0] != 0 ? pp->sectorsize : 0)
 	     != pp->mediasize) {
 		gctl_error(req, "Providers size mismatch %jd != %jd",
 		    (intmax_t) sc->sc_pp->mediasize +
 			(sc->sc_uuid[0] != 0 ? pp->sectorsize : 0),
 		    (intmax_t) pp->mediasize);
 		return;
 	}
 	if (sc->sc_pp->sectorsize != 0 &&
 	    sc->sc_pp->sectorsize != pp->sectorsize) {
 		gctl_error(req, "Providers sectorsize mismatch %u != %u",
 		    sc->sc_pp->sectorsize, pp->sectorsize);
 		return;
 	}
 
 	error = g_multipath_add_disk(gp, pp);
 	if (error != 0)
 		gctl_error(req, "Provider addition error: %d", error);
 }
 
 static void
 g_multipath_ctl_prefer(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_geom *gp;
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	const char *name, *mpname;
 	static const char devpf[6] = "/dev/";
 	int *nargs;
 
 	g_topology_assert();
 
 	mpname = gctl_get_asciiparam(req, "arg0");
         if (mpname == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, mpname);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", mpname);
 		return;
 	}
 	sc = gp->softc;
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No 'nargs' argument");
 		return;
 	}
 	if (*nargs != 2) {
 		gctl_error(req, "missing device");
 		return;
 	}
 
 	name = gctl_get_asciiparam(req, "arg1");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg1' argument");
 		return;
 	}
 	if (strncmp(name, devpf, 5) == 0) {
 		name += 5;
 	}
 
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp->provider != NULL
                       && strcmp(cp->provider->name, name) == 0)
 		    break;
 	}
 
 	if (cp == NULL) {
 		gctl_error(req, "Provider %s not found", name);
 		return;
 	}
 
 	mtx_lock(&sc->sc_mtx);
 
 	if (cp->index & MP_BAD) {
 		gctl_error(req, "Consumer %s is invalid", name);
 		mtx_unlock(&sc->sc_mtx);
 		return;
 	}
 
 	/* Here when the consumer is present and in good shape */
 
 	sc->sc_active = cp;
 	if (!sc->sc_active_active)
 	    printf("GEOM_MULTIPATH: %s now active path in %s\n",
 		sc->sc_active->provider->name, sc->sc_name);
 
 	mtx_unlock(&sc->sc_mtx);
 }
 
 static void
 g_multipath_ctl_add(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	const char *mpname, *name;
 
 	mpname = gctl_get_asciiparam(req, "arg0");
         if (mpname == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, mpname);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s not found", mpname);
 		return;
 	}
 	sc = gp->softc;
 
 	name = gctl_get_asciiparam(req, "arg1");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg1' argument");
 		return;
 	}
 	g_multipath_ctl_add_name(req, mp, name);
 }
 
 static void
 g_multipath_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_multipath_metadata md;
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	const char *mpname, *name;
 	char param[16];
 	int *nargs, i, *val;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (*nargs < 2) {
 		gctl_error(req, "wrong number of arguments.");
 		return;
 	}
 
 	mpname = gctl_get_asciiparam(req, "arg0");
         if (mpname == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, mpname);
 	if (gp != NULL) {
 		gctl_error(req, "Device %s already exist", mpname);
 		return;
 	}
 
 	memset(&md, 0, sizeof(md));
 	strlcpy(md.md_magic, G_MULTIPATH_MAGIC, sizeof(md.md_magic));
 	md.md_version = G_MULTIPATH_VERSION;
 	strlcpy(md.md_name, mpname, sizeof(md.md_name));
 	md.md_size = 0;
 	md.md_sectorsize = 0;
 	md.md_uuid[0] = 0;
 	md.md_active_active = 0;
 	val = gctl_get_paraml(req, "active_active", sizeof(*val));
 	if (val != NULL && *val != 0)
 		md.md_active_active = 1;
 	val = gctl_get_paraml(req, "active_read", sizeof(*val));
 	if (val != NULL && *val != 0)
 		md.md_active_active = 2;
 	gp = g_multipath_create(mp, &md);
 	if (gp == NULL) {
 		gctl_error(req, "GEOM_MULTIPATH: cannot create geom %s/%s\n",
 		    md.md_name, md.md_uuid);
 		return;
 	}
 	sc = gp->softc;
 
 	for (i = 1; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		g_multipath_ctl_add_name(req, mp, name);
 	}
 
 	if (sc->sc_ndisks != (*nargs - 1))
 		g_multipath_destroy(gp);
 }
 
 static void
 g_multipath_ctl_configure(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	struct g_multipath_metadata md;
 	const char *name;
 	int error, *val;
 
 	g_topology_assert();
 
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg0' argument");
 		return;
 	}
 	gp = g_multipath_find_geom(mp, name);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", name);
 		return;
 	}
 	sc = gp->softc;
 	val = gctl_get_paraml(req, "active_active", sizeof(*val));
 	if (val != NULL && *val != 0)
 		sc->sc_active_active = 1;
 	val = gctl_get_paraml(req, "active_read", sizeof(*val));
 	if (val != NULL && *val != 0)
 		sc->sc_active_active = 2;
 	val = gctl_get_paraml(req, "active_passive", sizeof(*val));
 	if (val != NULL && *val != 0)
 		sc->sc_active_active = 0;
 	if (sc->sc_uuid[0] != 0 && sc->sc_active != NULL) {
 		cp = sc->sc_active;
 		pp = cp->provider;
 		strlcpy(md.md_magic, G_MULTIPATH_MAGIC, sizeof(md.md_magic));
 		memcpy(md.md_uuid, sc->sc_uuid, sizeof (sc->sc_uuid));
 		strlcpy(md.md_name, name, sizeof(md.md_name));
 		md.md_version = G_MULTIPATH_VERSION;
 		md.md_size = pp->mediasize;
 		md.md_sectorsize = pp->sectorsize;
 		md.md_active_active = sc->sc_active_active;
 		error = g_multipath_write_metadata(cp, &md);
 		if (error != 0)
 			gctl_error(req, "Can't update metadata on %s (%d)",
 			    pp->name, error);
 	}
 }
 
 static void
 g_multipath_ctl_fail(struct gctl_req *req, struct g_class *mp, int fail)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	const char *mpname, *name;
 	int found;
 
 	mpname = gctl_get_asciiparam(req, "arg0");
         if (mpname == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, mpname);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s not found", mpname);
 		return;
 	}
 	sc = gp->softc;
 
 	name = gctl_get_asciiparam(req, "arg1");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg1' argument");
 		return;
 	}
 
 	found = 0;
 	mtx_lock(&sc->sc_mtx);
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (cp->provider != NULL &&
 		    strcmp(cp->provider->name, name) == 0 &&
 		    (cp->index & MP_LOST) == 0) {
 			found = 1;
 			if (!fail == !(cp->index & MP_FAIL))
 				continue;
 			printf("GEOM_MULTIPATH: %s in %s is marked %s.\n",
 				name, sc->sc_name, fail ? "FAIL" : "OK");
 			if (fail) {
 				g_multipath_fault(cp, MP_FAIL);
 			} else {
 				cp->index &= ~MP_FAIL;
 			}
 		}
 	}
 	mtx_unlock(&sc->sc_mtx);
 	if (found == 0)
 		gctl_error(req, "Provider %s not found", name);
 }
 
 static void
 g_multipath_ctl_remove(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_multipath_softc *sc;
 	struct g_geom *gp;
 	struct g_consumer *cp, *cp1;
 	const char *mpname, *name;
 	uintptr_t *cnt;
 	int found;
 
 	mpname = gctl_get_asciiparam(req, "arg0");
         if (mpname == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, mpname);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s not found", mpname);
 		return;
 	}
 	sc = gp->softc;
 
 	name = gctl_get_asciiparam(req, "arg1");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg1' argument");
 		return;
 	}
 
 	found = 0;
 	mtx_lock(&sc->sc_mtx);
 	LIST_FOREACH_SAFE(cp, &gp->consumer, consumer, cp1) {
 		if (cp->provider != NULL &&
 		    strcmp(cp->provider->name, name) == 0 &&
 		    (cp->index & MP_LOST) == 0) {
 			found = 1;
 			printf("GEOM_MULTIPATH: removing %s from %s\n",
 			    cp->provider->name, cp->geom->name);
 			sc->sc_ndisks--;
 			g_multipath_fault(cp, MP_LOST);
 			cnt = (uintptr_t *)&cp->private;
 			if (*cnt == 0 && (cp->index & MP_POSTED) == 0) {
 				cp->index |= MP_POSTED;
 				mtx_unlock(&sc->sc_mtx);
 				g_mpd(cp, 0);
 				if (cp1 == NULL)
 					return;	/* Recursion happened. */
 				mtx_lock(&sc->sc_mtx);
 			}
 		}
 	}
 	mtx_unlock(&sc->sc_mtx);
 	if (found == 0)
 		gctl_error(req, "Provider %s not found", name);
 }
 
 static struct g_geom *
 g_multipath_find_geom(struct g_class *mp, const char *name)
 {
 	struct g_geom *gp;
 	struct g_multipath_softc *sc;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL || sc->sc_stopping)
 			continue;
 		if (strcmp(gp->name, name) == 0)
 			return (gp);
 	}
 	return (NULL);
 }
 
 static void
 g_multipath_ctl_stop(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_geom *gp;
 	const char *name;
 	int error;
 
 	g_topology_assert();
 
 	name = gctl_get_asciiparam(req, "arg0");
         if (name == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, name);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", name);
 		return;
 	}
 	error = g_multipath_destroy(gp);
 	if (error != 0 && error != EINPROGRESS)
 		gctl_error(req, "failed to stop %s (err=%d)", name, error);
 }
 
 static void
 g_multipath_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_geom *gp;
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	const char *name;
 	uint8_t *buf;
 	int error;
 
 	g_topology_assert();
 
 	name = gctl_get_asciiparam(req, "arg0");
         if (name == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, name);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", name);
 		return;
 	}
 	sc = gp->softc;
 
 	if (sc->sc_uuid[0] != 0 && sc->sc_active != NULL) {
 		cp = sc->sc_active;
 		pp = cp->provider;
 		error = g_access(cp, 1, 1, 1);
 		if (error != 0) {
 			gctl_error(req, "Can't open %s (%d)", pp->name, error);
 			goto destroy;
 		}
 		g_topology_unlock();
 		buf = g_malloc(pp->sectorsize, M_WAITOK | M_ZERO);
 		error = g_write_data(cp, pp->mediasize - pp->sectorsize,
 		    buf, pp->sectorsize);
 		g_topology_lock();
 		g_access(cp, -1, -1, -1);
 		if (error != 0)
 			gctl_error(req, "Can't erase metadata on %s (%d)",
 			    pp->name, error);
 	}
 
 destroy:
 	error = g_multipath_destroy(gp);
 	if (error != 0 && error != EINPROGRESS)
 		gctl_error(req, "failed to destroy %s (err=%d)", name, error);
 }
 
 static void
 g_multipath_ctl_rotate(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_geom *gp;
 	const char *name;
 	int error;
 
 	g_topology_assert();
 
 	name = gctl_get_asciiparam(req, "arg0");
         if (name == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, name);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", name);
 		return;
 	}
 	error = g_multipath_rotate(gp);
 	if (error != 0) {
 		gctl_error(req, "failed to rotate %s (err=%d)", name, error);
 	}
 }
 
 static void
 g_multipath_ctl_getactive(struct gctl_req *req, struct g_class *mp)
 {
 	struct sbuf *sb;
 	struct g_geom *gp;
 	struct g_multipath_softc *sc;
 	struct g_consumer *cp;
 	const char *name;
 	int empty;
 
 	sb = sbuf_new_auto();
 
 	g_topology_assert();
 	name = gctl_get_asciiparam(req, "arg0");
         if (name == NULL) {
                 gctl_error(req, "No 'arg0' argument");
                 return;
         }
 	gp = g_multipath_find_geom(mp, name);
 	if (gp == NULL) {
 		gctl_error(req, "Device %s is invalid", name);
 		return;
 	}
 	sc = gp->softc;
 	if (sc->sc_active_active == 1) {
 		empty = 1;
 		LIST_FOREACH(cp, &gp->consumer, consumer) {
 			if (cp->index & MP_BAD)
 				continue;
 			if (!empty)
 				sbuf_cat(sb, " ");
 			sbuf_cat(sb, cp->provider->name);
 			empty = 0;
 		}
 		if (empty)
 			sbuf_cat(sb, "none");
 		sbuf_cat(sb, "\n");
 	} else if (sc->sc_active && sc->sc_active->provider) {
 		sbuf_printf(sb, "%s\n", sc->sc_active->provider->name);
 	} else {
 		sbuf_printf(sb, "none\n");
 	}
 	sbuf_finish(sb);
 	gctl_set_param_err(req, "output", sbuf_data(sb), sbuf_len(sb) + 1);
 	sbuf_delete(sb);
 }
 
 static void
 g_multipath_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 	g_topology_assert();
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No 'version' argument");
 	} else if (*version != G_MULTIPATH_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync");
 	} else if (strcmp(verb, "add") == 0) {
 		g_multipath_ctl_add(req, mp);
 	} else if (strcmp(verb, "prefer") == 0) {
 		g_multipath_ctl_prefer(req, mp);
 	} else if (strcmp(verb, "create") == 0) {
 		g_multipath_ctl_create(req, mp);
 	} else if (strcmp(verb, "configure") == 0) {
 		g_multipath_ctl_configure(req, mp);
 	} else if (strcmp(verb, "stop") == 0) {
 		g_multipath_ctl_stop(req, mp);
 	} else if (strcmp(verb, "destroy") == 0) {
 		g_multipath_ctl_destroy(req, mp);
 	} else if (strcmp(verb, "fail") == 0) {
 		g_multipath_ctl_fail(req, mp, 1);
 	} else if (strcmp(verb, "restore") == 0) {
 		g_multipath_ctl_fail(req, mp, 0);
 	} else if (strcmp(verb, "remove") == 0) {
 		g_multipath_ctl_remove(req, mp);
 	} else if (strcmp(verb, "rotate") == 0) {
 		g_multipath_ctl_rotate(req, mp);
 	} else if (strcmp(verb, "getactive") == 0) {
 		g_multipath_ctl_getactive(req, mp);
 	} else {
 		gctl_error(req, "Unknown verb %s", verb);
 	}
 }
 
 static void
 g_multipath_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_multipath_softc *sc;
 	int good;
 
 	g_topology_assert();
 
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	if (cp != NULL) {
 		sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 		    (cp->index & MP_NEW) ? "NEW" :
 		    (cp->index & MP_LOST) ? "LOST" :
 		    (cp->index & MP_FAIL) ? "FAIL" :
 		    (sc->sc_active_active == 1 || sc->sc_active == cp) ?
 		     "ACTIVE" :
 		     sc->sc_active_active == 2 ? "READ" : "PASSIVE");
 	} else {
 		good = g_multipath_good(gp);
 		sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 		    good == 0 ? "BROKEN" :
 		    (good != sc->sc_ndisks || sc->sc_ndisks == 1) ?
 		    "DEGRADED" : "OPTIMAL");
 	}
 	if (cp == NULL && pp == NULL) {
 		sbuf_printf(sb, "%s<UUID>%s</UUID>\n", indent, sc->sc_uuid);
 		sbuf_printf(sb, "%s<Mode>Active/%s</Mode>\n", indent,
 		    sc->sc_active_active == 2 ? "Read" :
 		    sc->sc_active_active == 1 ? "Active" : "Passive");
 		sbuf_printf(sb, "%s<Type>%s</Type>\n", indent,
 		    sc->sc_uuid[0] == 0 ? "MANUAL" : "AUTOMATIC");
 	}
 }
 
 DECLARE_GEOM_CLASS(g_multipath_class, g_multipath);
+MODULE_VERSION(geom_multipath, 0);
Index: user/markj/netdump/sys/geom/nop/g_nop.c
===================================================================
--- user/markj/netdump/sys/geom/nop/g_nop.c	(revision 332407)
+++ user/markj/netdump/sys/geom/nop/g_nop.c	(revision 332408)
@@ -1,719 +1,720 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <geom/geom.h>
 #include <geom/nop/g_nop.h>
 
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, nop, CTLFLAG_RW, 0, "GEOM_NOP stuff");
 static u_int g_nop_debug = 0;
 SYSCTL_UINT(_kern_geom_nop, OID_AUTO, debug, CTLFLAG_RW, &g_nop_debug, 0,
     "Debug level");
 
 static int g_nop_destroy(struct g_geom *gp, boolean_t force);
 static int g_nop_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 static void g_nop_config(struct gctl_req *req, struct g_class *mp,
     const char *verb);
 static void g_nop_dumpconf(struct sbuf *sb, const char *indent,
     struct g_geom *gp, struct g_consumer *cp, struct g_provider *pp);
 
 struct g_class g_nop_class = {
 	.name = G_NOP_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_nop_config,
 	.destroy_geom = g_nop_destroy_geom
 };
 
 
 static void
 g_nop_orphan(struct g_consumer *cp)
 {
 
 	g_topology_assert();
 	g_nop_destroy(cp->geom, 1);
 }
 
 static void
 g_nop_resize(struct g_consumer *cp)
 {
 	struct g_nop_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *pp;
 	off_t size;
 
 	g_topology_assert();
 
 	gp = cp->geom;
 	sc = gp->softc;
 
 	if (sc->sc_explicitsize != 0)
 		return;
 	if (cp->provider->mediasize < sc->sc_offset) {
 		g_nop_destroy(gp, 1);
 		return;
 	}
 	size = cp->provider->mediasize - sc->sc_offset;
 	LIST_FOREACH(pp, &gp->provider, provider)
 		g_resize_provider(pp, size);
 }
 
 static void
 g_nop_start(struct bio *bp)
 {
 	struct g_nop_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct bio *cbp;
 	u_int failprob = 0;
 
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 	G_NOP_LOGREQ(bp, "Request received.");
 	mtx_lock(&sc->sc_lock);
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		sc->sc_reads++;
 		sc->sc_readbytes += bp->bio_length;
 		failprob = sc->sc_rfailprob;
 		break;
 	case BIO_WRITE:
 		sc->sc_writes++;
 		sc->sc_wrotebytes += bp->bio_length;
 		failprob = sc->sc_wfailprob;
 		break;
 	case BIO_DELETE:
 		sc->sc_deletes++;
 		break;
 	case BIO_GETATTR:
 		sc->sc_getattrs++;
 		if (sc->sc_physpath && 
 		    g_handleattr_str(bp, "GEOM::physpath", sc->sc_physpath)) {
 			mtx_unlock(&sc->sc_lock);
 			return;
 		}
 		break;
 	case BIO_FLUSH:
 		sc->sc_flushes++;
 		break;
 	case BIO_CMD0:
 		sc->sc_cmd0s++;
 		break;
 	case BIO_CMD1:
 		sc->sc_cmd1s++;
 		break;
 	case BIO_CMD2:
 		sc->sc_cmd2s++;
 		break;
 	}
 	mtx_unlock(&sc->sc_lock);
 	if (failprob > 0) {
 		u_int rval;
 
 		rval = arc4random() % 100;
 		if (rval < failprob) {
 			G_NOP_LOGREQLVL(1, bp, "Returning error=%d.", sc->sc_error);
 			g_io_deliver(bp, sc->sc_error);
 			return;
 		}
 	}
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return;
 	}
 	cbp->bio_done = g_std_done;
 	cbp->bio_offset = bp->bio_offset + sc->sc_offset;
 	pp = LIST_FIRST(&gp->provider);
 	KASSERT(pp != NULL, ("NULL pp"));
 	cbp->bio_to = pp;
 	G_NOP_LOGREQ(cbp, "Sending request.");
 	g_io_request(cbp, LIST_FIRST(&gp->consumer));
 }
 
 static int
 g_nop_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	int error;
 
 	gp = pp->geom;
 	cp = LIST_FIRST(&gp->consumer);
 	error = g_access(cp, dr, dw, de);
 
 	return (error);
 }
 
 static int
 g_nop_create(struct gctl_req *req, struct g_class *mp, struct g_provider *pp,
     int ioerror, u_int rfailprob, u_int wfailprob, off_t offset, off_t size,
     u_int secsize, u_int stripesize, u_int stripeoffset, const char *physpath)
 {
 	struct g_nop_softc *sc;
 	struct g_geom *gp;
 	struct g_provider *newpp;
 	struct g_consumer *cp;
 	char name[64];
 	int error;
 	off_t explicitsize;
 
 	g_topology_assert();
 
 	gp = NULL;
 	newpp = NULL;
 	cp = NULL;
 
 	if ((offset % pp->sectorsize) != 0) {
 		gctl_error(req, "Invalid offset for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	if ((size % pp->sectorsize) != 0) {
 		gctl_error(req, "Invalid size for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	if (offset >= pp->mediasize) {
 		gctl_error(req, "Invalid offset for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	explicitsize = size;
 	if (size == 0)
 		size = pp->mediasize - offset;
 	if (offset + size > pp->mediasize) {
 		gctl_error(req, "Invalid size for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	if (secsize == 0)
 		secsize = pp->sectorsize;
 	else if ((secsize % pp->sectorsize) != 0) {
 		gctl_error(req, "Invalid secsize for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	if (secsize > MAXPHYS) {
 		gctl_error(req, "secsize is too big.");
 		return (EINVAL);
 	}
 	size -= size % secsize;
 	if ((stripesize % pp->sectorsize) != 0) {
 		gctl_error(req, "Invalid stripesize for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	if ((stripeoffset % pp->sectorsize) != 0) {
 		gctl_error(req, "Invalid stripeoffset for provider %s.", pp->name);
 		return (EINVAL);
 	}
 	if (stripesize != 0 && stripeoffset >= stripesize) {
 		gctl_error(req, "stripeoffset is too big.");
 		return (EINVAL);
 	}
 	snprintf(name, sizeof(name), "%s%s", pp->name, G_NOP_SUFFIX);
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		if (strcmp(gp->name, name) == 0) {
 			gctl_error(req, "Provider %s already exists.", name);
 			return (EEXIST);
 		}
 	}
 	gp = g_new_geomf(mp, "%s", name);
 	sc = g_malloc(sizeof(*sc), M_WAITOK | M_ZERO);
 	sc->sc_offset = offset;
 	sc->sc_explicitsize = explicitsize;
 	sc->sc_stripesize = stripesize;
 	sc->sc_stripeoffset = stripeoffset;
 	if (physpath && strcmp(physpath, G_NOP_PHYSPATH_PASSTHROUGH)) {
 		sc->sc_physpath = strndup(physpath, MAXPATHLEN, M_GEOM);
 	} else
 		sc->sc_physpath = NULL;
 	sc->sc_error = ioerror;
 	sc->sc_rfailprob = rfailprob;
 	sc->sc_wfailprob = wfailprob;
 	sc->sc_reads = 0;
 	sc->sc_writes = 0;
 	sc->sc_deletes = 0;
 	sc->sc_getattrs = 0;
 	sc->sc_flushes = 0;
 	sc->sc_cmd0s = 0;
 	sc->sc_cmd1s = 0;
 	sc->sc_cmd2s = 0;
 	sc->sc_readbytes = 0;
 	sc->sc_wrotebytes = 0;
 	mtx_init(&sc->sc_lock, "gnop lock", NULL, MTX_DEF);
 	gp->softc = sc;
 	gp->start = g_nop_start;
 	gp->orphan = g_nop_orphan;
 	gp->resize = g_nop_resize;
 	gp->access = g_nop_access;
 	gp->dumpconf = g_nop_dumpconf;
 
 	newpp = g_new_providerf(gp, "%s", gp->name);
 	newpp->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE;
 	newpp->mediasize = size;
 	newpp->sectorsize = secsize;
 	newpp->stripesize = stripesize;
 	newpp->stripeoffset = stripeoffset;
 
 	cp = g_new_consumer(gp);
 	cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		gctl_error(req, "Cannot attach to provider %s.", pp->name);
 		goto fail;
 	}
 
 	newpp->flags |= pp->flags & G_PF_ACCEPT_UNMAPPED;
 	g_error_provider(newpp, 0);
 	G_NOP_DEBUG(0, "Device %s created.", gp->name);
 	return (0);
 fail:
 	if (cp->provider != NULL)
 		g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_provider(newpp);
 	mtx_destroy(&sc->sc_lock);
 	free(sc->sc_physpath, M_GEOM);
 	g_free(gp->softc);
 	g_destroy_geom(gp);
 	return (error);
 }
 
 static int
 g_nop_destroy(struct g_geom *gp, boolean_t force)
 {
 	struct g_nop_softc *sc;
 	struct g_provider *pp;
 
 	g_topology_assert();
 	sc = gp->softc;
 	if (sc == NULL)
 		return (ENXIO);
 	free(sc->sc_physpath, M_GEOM);
 	pp = LIST_FIRST(&gp->provider);
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_NOP_DEBUG(0, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_NOP_DEBUG(1, "Device %s is still open (r%dw%de%d).",
 			    pp->name, pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	} else {
 		G_NOP_DEBUG(0, "Device %s removed.", gp->name);
 	}
 	gp->softc = NULL;
 	mtx_destroy(&sc->sc_lock);
 	g_free(sc);
 	g_wither_geom(gp, ENXIO);
 
 	return (0);
 }
 
 static int
 g_nop_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 
 	return (g_nop_destroy(gp, 0));
 }
 
 static void
 g_nop_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_provider *pp;
 	intmax_t *error, *rfailprob, *wfailprob, *offset, *secsize, *size,
 	    *stripesize, *stripeoffset;
 	const char *name, *physpath;
 	char param[16];
 	int i, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	error = gctl_get_paraml(req, "error", sizeof(*error));
 	if (error == NULL) {
 		gctl_error(req, "No '%s' argument", "error");
 		return;
 	}
 	rfailprob = gctl_get_paraml(req, "rfailprob", sizeof(*rfailprob));
 	if (rfailprob == NULL) {
 		gctl_error(req, "No '%s' argument", "rfailprob");
 		return;
 	}
 	if (*rfailprob < -1 || *rfailprob > 100) {
 		gctl_error(req, "Invalid '%s' argument", "rfailprob");
 		return;
 	}
 	wfailprob = gctl_get_paraml(req, "wfailprob", sizeof(*wfailprob));
 	if (wfailprob == NULL) {
 		gctl_error(req, "No '%s' argument", "wfailprob");
 		return;
 	}
 	if (*wfailprob < -1 || *wfailprob > 100) {
 		gctl_error(req, "Invalid '%s' argument", "wfailprob");
 		return;
 	}
 	offset = gctl_get_paraml(req, "offset", sizeof(*offset));
 	if (offset == NULL) {
 		gctl_error(req, "No '%s' argument", "offset");
 		return;
 	}
 	if (*offset < 0) {
 		gctl_error(req, "Invalid '%s' argument", "offset");
 		return;
 	}
 	size = gctl_get_paraml(req, "size", sizeof(*size));
 	if (size == NULL) {
 		gctl_error(req, "No '%s' argument", "size");
 		return;
 	}
 	if (*size < 0) {
 		gctl_error(req, "Invalid '%s' argument", "size");
 		return;
 	}
 	secsize = gctl_get_paraml(req, "secsize", sizeof(*secsize));
 	if (secsize == NULL) {
 		gctl_error(req, "No '%s' argument", "secsize");
 		return;
 	}
 	if (*secsize < 0) {
 		gctl_error(req, "Invalid '%s' argument", "secsize");
 		return;
 	}
 	stripesize = gctl_get_paraml(req, "stripesize", sizeof(*stripesize));
 	if (stripesize == NULL) {
 		gctl_error(req, "No '%s' argument", "stripesize");
 		return;
 	}
 	if (*stripesize < 0) {
 		gctl_error(req, "Invalid '%s' argument", "stripesize");
 		return;
 	}
 	stripeoffset = gctl_get_paraml(req, "stripeoffset", sizeof(*stripeoffset));
 	if (stripeoffset == NULL) {
 		gctl_error(req, "No '%s' argument", "stripeoffset");
 		return;
 	}
 	if (*stripeoffset < 0) {
 		gctl_error(req, "Invalid '%s' argument", "stripeoffset");
 		return;
 	}
 	physpath = gctl_get_asciiparam(req, "physpath");
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		if (pp == NULL) {
 			G_NOP_DEBUG(1, "Provider %s is invalid.", name);
 			gctl_error(req, "Provider %s is invalid.", name);
 			return;
 		}
 		if (g_nop_create(req, mp, pp,
 		    *error == -1 ? EIO : (int)*error,
 		    *rfailprob == -1 ? 0 : (u_int)*rfailprob,
 		    *wfailprob == -1 ? 0 : (u_int)*wfailprob,
 		    (off_t)*offset, (off_t)*size, (u_int)*secsize,
 		    (u_int)*stripesize, (u_int)*stripeoffset,
 		    physpath) != 0) {
 			return;
 		}
 	}
 }
 
 static void
 g_nop_ctl_configure(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_nop_softc *sc;
 	struct g_provider *pp;
 	intmax_t *error, *rfailprob, *wfailprob;
 	const char *name;
 	char param[16];
 	int i, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	error = gctl_get_paraml(req, "error", sizeof(*error));
 	if (error == NULL) {
 		gctl_error(req, "No '%s' argument", "error");
 		return;
 	}
 	rfailprob = gctl_get_paraml(req, "rfailprob", sizeof(*rfailprob));
 	if (rfailprob == NULL) {
 		gctl_error(req, "No '%s' argument", "rfailprob");
 		return;
 	}
 	if (*rfailprob < -1 || *rfailprob > 100) {
 		gctl_error(req, "Invalid '%s' argument", "rfailprob");
 		return;
 	}
 	wfailprob = gctl_get_paraml(req, "wfailprob", sizeof(*wfailprob));
 	if (wfailprob == NULL) {
 		gctl_error(req, "No '%s' argument", "wfailprob");
 		return;
 	}
 	if (*wfailprob < -1 || *wfailprob > 100) {
 		gctl_error(req, "Invalid '%s' argument", "wfailprob");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		if (pp == NULL || pp->geom->class != mp) {
 			G_NOP_DEBUG(1, "Provider %s is invalid.", name);
 			gctl_error(req, "Provider %s is invalid.", name);
 			return;
 		}
 		sc = pp->geom->softc;
 		if (*error != -1)
 			sc->sc_error = (int)*error;
 		if (*rfailprob != -1)
 			sc->sc_rfailprob = (u_int)*rfailprob;
 		if (*wfailprob != -1)
 			sc->sc_wfailprob = (u_int)*wfailprob;
 	}
 }
 
 static struct g_geom *
 g_nop_find_geom(struct g_class *mp, const char *name)
 {
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		if (strcmp(gp->name, name) == 0)
 			return (gp);
 	}
 	return (NULL);
 }
 
 static void
 g_nop_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	int *nargs, *force, error, i;
 	struct g_geom *gp;
 	const char *name;
 	char param[16];
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No 'force' argument");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		gp = g_nop_find_geom(mp, name);
 		if (gp == NULL) {
 			G_NOP_DEBUG(1, "Device %s is invalid.", name);
 			gctl_error(req, "Device %s is invalid.", name);
 			return;
 		}
 		error = g_nop_destroy(gp, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy device %s (error=%d).",
 			    gp->name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_nop_ctl_reset(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_nop_softc *sc;
 	struct g_provider *pp;
 	const char *name;
 	char param[16];
 	int i, *nargs;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 
 	for (i = 0; i < *nargs; i++) {
 		snprintf(param, sizeof(param), "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		if (pp == NULL || pp->geom->class != mp) {
 			G_NOP_DEBUG(1, "Provider %s is invalid.", name);
 			gctl_error(req, "Provider %s is invalid.", name);
 			return;
 		}
 		sc = pp->geom->softc;
 		sc->sc_reads = 0;
 		sc->sc_writes = 0;
 		sc->sc_deletes = 0;
 		sc->sc_getattrs = 0;
 		sc->sc_flushes = 0;
 		sc->sc_cmd0s = 0;
 		sc->sc_cmd1s = 0;
 		sc->sc_cmd2s = 0;
 		sc->sc_readbytes = 0;
 		sc->sc_wrotebytes = 0;
 	}
 }
 
 static void
 g_nop_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_NOP_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "create") == 0) {
 		g_nop_ctl_create(req, mp);
 		return;
 	} else if (strcmp(verb, "configure") == 0) {
 		g_nop_ctl_configure(req, mp);
 		return;
 	} else if (strcmp(verb, "destroy") == 0) {
 		g_nop_ctl_destroy(req, mp);
 		return;
 	} else if (strcmp(verb, "reset") == 0) {
 		g_nop_ctl_reset(req, mp);
 		return;
 	}
 
 	gctl_error(req, "Unknown verb.");
 }
 
 static void
 g_nop_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_nop_softc *sc;
 
 	if (pp != NULL || cp != NULL)
 		return;
 	sc = gp->softc;
 	sbuf_printf(sb, "%s<Offset>%jd</Offset>\n", indent,
 	    (intmax_t)sc->sc_offset);
 	sbuf_printf(sb, "%s<ReadFailProb>%u</ReadFailProb>\n", indent,
 	    sc->sc_rfailprob);
 	sbuf_printf(sb, "%s<WriteFailProb>%u</WriteFailProb>\n", indent,
 	    sc->sc_wfailprob);
 	sbuf_printf(sb, "%s<Error>%d</Error>\n", indent, sc->sc_error);
 	sbuf_printf(sb, "%s<Reads>%ju</Reads>\n", indent, sc->sc_reads);
 	sbuf_printf(sb, "%s<Writes>%ju</Writes>\n", indent, sc->sc_writes);
 	sbuf_printf(sb, "%s<Deletes>%ju</Deletes>\n", indent, sc->sc_deletes);
 	sbuf_printf(sb, "%s<Getattrs>%ju</Getattrs>\n", indent, sc->sc_getattrs);
 	sbuf_printf(sb, "%s<Flushes>%ju</Flushes>\n", indent, sc->sc_flushes);
 	sbuf_printf(sb, "%s<Cmd0s>%ju</Cmd0s>\n", indent, sc->sc_cmd0s);
 	sbuf_printf(sb, "%s<Cmd1s>%ju</Cmd1s>\n", indent, sc->sc_cmd1s);
 	sbuf_printf(sb, "%s<Cmd2s>%ju</Cmd2s>\n", indent, sc->sc_cmd2s);
 	sbuf_printf(sb, "%s<ReadBytes>%ju</ReadBytes>\n", indent,
 	    sc->sc_readbytes);
 	sbuf_printf(sb, "%s<WroteBytes>%ju</WroteBytes>\n", indent,
 	    sc->sc_wrotebytes);
 }
 
 DECLARE_GEOM_CLASS(g_nop_class, g_nop);
+MODULE_VERSION(geom_nop, 0);
Index: user/markj/netdump/sys/geom/part/g_part_apm.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_apm.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_apm.c	(revision 332408)
@@ -1,596 +1,597 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2006-2008 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/apm.h>
 #include <sys/bio.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <geom/geom.h>
 #include <geom/geom_int.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_apm, "GEOM partitioning class for Apple-style partitions");
 
 struct g_part_apm_table {
 	struct g_part_table	base;
 	struct apm_ddr		ddr;
 	struct apm_ent		self;
 	int			tivo_series1;
 };
 
 struct g_part_apm_entry {
 	struct g_part_entry	base;
 	struct apm_ent		ent;
 };
 
 static int g_part_apm_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_apm_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_apm_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_apm_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_apm_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_apm_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_apm_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_apm_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_apm_read(struct g_part_table *, struct g_consumer *);
 static const char *g_part_apm_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_apm_write(struct g_part_table *, struct g_consumer *);
 static int g_part_apm_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 
 static kobj_method_t g_part_apm_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_apm_add),
 	KOBJMETHOD(g_part_create,	g_part_apm_create),
 	KOBJMETHOD(g_part_destroy,	g_part_apm_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_apm_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_apm_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_apm_modify),
 	KOBJMETHOD(g_part_resize,	g_part_apm_resize),
 	KOBJMETHOD(g_part_name,		g_part_apm_name),
 	KOBJMETHOD(g_part_probe,	g_part_apm_probe),
 	KOBJMETHOD(g_part_read,		g_part_apm_read),
 	KOBJMETHOD(g_part_type,		g_part_apm_type),
 	KOBJMETHOD(g_part_write,	g_part_apm_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_apm_scheme = {
 	"APM",
 	g_part_apm_methods,
 	sizeof(struct g_part_apm_table),
 	.gps_entrysz = sizeof(struct g_part_apm_entry),
 	.gps_minent = 16,
 	.gps_maxent = 4096,
 };
 G_PART_SCHEME_DECLARE(g_part_apm);
+MODULE_VERSION(geom_part_apm, 0);
 
 static void
 swab(char *buf, size_t bufsz)
 {
 	int i;
 	char ch;
 
 	for (i = 0; i < bufsz; i += 2) {
 		ch = buf[i];
 		buf[i] = buf[i + 1];
 		buf[i + 1] = ch;
 	}
 }
 
 static int
 apm_parse_type(const char *type, char *buf, size_t bufsz)
 {
 	const char *alias;
 
 	if (type[0] == '!') {
 		type++;
 		if (strlen(type) > bufsz)
 			return (EINVAL);
 		if (!strcmp(type, APM_ENT_TYPE_SELF) ||
 		    !strcmp(type, APM_ENT_TYPE_UNUSED))
 			return (EINVAL);
 		strncpy(buf, type, bufsz);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_APPLE_BOOT);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_APPLE_BOOT);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_APPLE_HFS);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_APPLE_HFS);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_APPLE_UFS);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_APPLE_UFS);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_FREEBSD);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_NANDFS);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_FREEBSD_NANDFS);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_SWAP);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_FREEBSD_SWAP);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_UFS);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_FREEBSD_UFS);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_VINUM);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_FREEBSD_VINUM);
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_ZFS);
 	if (!strcasecmp(type, alias)) {
 		strcpy(buf, APM_ENT_TYPE_FREEBSD_ZFS);
 		return (0);
 	}
 	return (EINVAL);
 }
 
 static int
 apm_read_ent(struct g_consumer *cp, uint32_t blk, struct apm_ent *ent,
     int tivo_series1)
 {
 	struct g_provider *pp;
 	char *buf;
 	int error;
 
 	pp = cp->provider;
 	buf = g_read_data(cp, pp->sectorsize * blk, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 	if (tivo_series1)
 		swab(buf, pp->sectorsize);
 	ent->ent_sig = be16dec(buf);
 	ent->ent_pmblkcnt = be32dec(buf + 4);
 	ent->ent_start = be32dec(buf + 8);
 	ent->ent_size = be32dec(buf + 12);
 	bcopy(buf + 16, ent->ent_name, sizeof(ent->ent_name));
 	bcopy(buf + 48, ent->ent_type, sizeof(ent->ent_type));
 	g_free(buf);
 	return (0);
 }
 
 static int
 g_part_apm_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 	struct g_part_apm_entry *entry;
 	struct g_part_apm_table *table;
 	int error;
 
 	entry = (struct g_part_apm_entry *)baseentry;
 	table = (struct g_part_apm_table *)basetable;
 	entry->ent.ent_sig = APM_ENT_SIG;
 	entry->ent.ent_pmblkcnt = table->self.ent_pmblkcnt;
 	entry->ent.ent_start = gpp->gpp_start;
 	entry->ent.ent_size = gpp->gpp_size;
 	if (baseentry->gpe_deleted) {
 		bzero(entry->ent.ent_type, sizeof(entry->ent.ent_type));
 		bzero(entry->ent.ent_name, sizeof(entry->ent.ent_name));
 	}
 	error = apm_parse_type(gpp->gpp_type, entry->ent.ent_type,
 	    sizeof(entry->ent.ent_type));
 	if (error)
 		return (error);
 	if (gpp->gpp_parms & G_PART_PARM_LABEL) {
 		if (strlen(gpp->gpp_label) > sizeof(entry->ent.ent_name))
 			return (EINVAL);
 		strncpy(entry->ent.ent_name, gpp->gpp_label,
 		    sizeof(entry->ent.ent_name));
 	}
 	if (baseentry->gpe_index >= table->self.ent_pmblkcnt)
 		table->self.ent_pmblkcnt = baseentry->gpe_index + 1;
 	KASSERT(table->self.ent_size >= table->self.ent_pmblkcnt,
 	    ("%s", __func__));
 	KASSERT(table->self.ent_size > baseentry->gpe_index,
 	    ("%s", __func__));
 	return (0);
 }
 
 static int
 g_part_apm_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 	struct g_part_apm_table *table;
 	uint32_t last;
 
 	/* We don't nest, which means that our depth should be 0. */
 	if (basetable->gpt_depth != 0)
 		return (ENXIO);
 
 	table = (struct g_part_apm_table *)basetable;
 	pp = gpp->gpp_provider;
 	if (pp->sectorsize != 512 ||
 	    pp->mediasize < (2 + 2 * basetable->gpt_entries) * pp->sectorsize)
 		return (ENOSPC);
 
 	/* APM uses 32-bit LBAs. */
 	last = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX) - 1;
 
 	basetable->gpt_first = 2 + basetable->gpt_entries;
 	basetable->gpt_last = last;
 
 	table->ddr.ddr_sig = APM_DDR_SIG;
 	table->ddr.ddr_blksize = pp->sectorsize;
 	table->ddr.ddr_blkcount = last + 1;
 
 	table->self.ent_sig = APM_ENT_SIG;
 	table->self.ent_pmblkcnt = basetable->gpt_entries + 1;
 	table->self.ent_start = 1;
 	table->self.ent_size = table->self.ent_pmblkcnt;
 	strcpy(table->self.ent_name, "Apple");
 	strcpy(table->self.ent_type, APM_ENT_TYPE_SELF);
 	return (0);
 }
 
 static int
 g_part_apm_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	/* Wipe the first 2 sectors to clear the partitioning. */
 	basetable->gpt_smhead |= 3;
 	return (0);
 }
 
 static void
 g_part_apm_dumpconf(struct g_part_table *table, struct g_part_entry *baseentry,
     struct sbuf *sb, const char *indent)
 {
 	union {
 		char name[APM_ENT_NAMELEN + 1];
 		char type[APM_ENT_TYPELEN + 1];
 	} u;
 	struct g_part_apm_entry *entry;
 
 	entry = (struct g_part_apm_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs APPLE xt %s", entry->ent.ent_type);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		strncpy(u.name, entry->ent.ent_name, APM_ENT_NAMELEN);
 		u.name[APM_ENT_NAMELEN] = '\0';
 		sbuf_printf(sb, "%s<label>", indent);
 		g_conf_printf_escaped(sb, "%s", u.name);
 		sbuf_printf(sb, "</label>\n");
 		strncpy(u.type, entry->ent.ent_type, APM_ENT_TYPELEN);
 		u.type[APM_ENT_TYPELEN] = '\0';
 		sbuf_printf(sb, "%s<rawtype>", indent);
 		g_conf_printf_escaped(sb, "%s", u.type);
 		sbuf_printf(sb, "</rawtype>\n");
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_apm_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 	struct g_part_apm_entry *entry;
 
 	entry = (struct g_part_apm_entry *)baseentry;
 	return ((!strcmp(entry->ent.ent_type, APM_ENT_TYPE_FREEBSD_SWAP))
 	    ? 1 : 0);
 }
 
 static int
 g_part_apm_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_apm_entry *entry;
 	int error;
 
 	entry = (struct g_part_apm_entry *)baseentry;
 	if (gpp->gpp_parms & G_PART_PARM_LABEL) {
 		if (strlen(gpp->gpp_label) > sizeof(entry->ent.ent_name))
 			return (EINVAL);
 	}
 	if (gpp->gpp_parms & G_PART_PARM_TYPE) {
 		error = apm_parse_type(gpp->gpp_type, entry->ent.ent_type,
 		    sizeof(entry->ent.ent_type));
 		if (error)
 			return (error);
 	}
 	if (gpp->gpp_parms & G_PART_PARM_LABEL) {
 		strncpy(entry->ent.ent_name, gpp->gpp_label,
 		    sizeof(entry->ent.ent_name));
 	}
 	return (0);
 }
 
 static int
 g_part_apm_resize(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_apm_entry *entry;
 	struct g_provider *pp;
 
 	if (baseentry == NULL) {
 		pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 		basetable->gpt_last = MIN(pp->mediasize / pp->sectorsize,
 		    UINT32_MAX) - 1;
 		return (0);
 	}
 
 	entry = (struct g_part_apm_entry *)baseentry;
 	baseentry->gpe_end = baseentry->gpe_start + gpp->gpp_size - 1;
 	entry->ent.ent_size = gpp->gpp_size;
 
 	return (0);
 }
 
 static const char *
 g_part_apm_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "s%d", baseentry->gpe_index + 1);
 	return (buf);
 }
 
 static int
 g_part_apm_probe(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_apm_table *table;
 	char *buf;
 	int error;
 
 	/* We don't nest, which means that our depth should be 0. */
 	if (basetable->gpt_depth != 0)
 		return (ENXIO);
 
 	table = (struct g_part_apm_table *)basetable;
 	table->tivo_series1 = 0;
 	pp = cp->provider;
 
 	/* Sanity-check the provider. */
 	if (pp->mediasize < 4 * pp->sectorsize)
 		return (ENOSPC);
 
 	/* Check that there's a Driver Descriptor Record (DDR). */
 	buf = g_read_data(cp, 0L, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 	if (be16dec(buf) == APM_DDR_SIG) {
 		/* Normal Apple DDR */
 		table->ddr.ddr_sig = be16dec(buf);
 		table->ddr.ddr_blksize = be16dec(buf + 2);
 		table->ddr.ddr_blkcount = be32dec(buf + 4);
 		g_free(buf);
 		if (table->ddr.ddr_blksize != pp->sectorsize)
 			return (ENXIO);
 		if (table->ddr.ddr_blkcount > pp->mediasize / pp->sectorsize)
 			return (ENXIO);
 	} else {
 		/*
 		 * Check for Tivo drives, which have no DDR and a different
 		 * signature.  Those whose first two bytes are 14 92 are
 		 * Series 2 drives, and aren't supported.  Those that start
 		 * with 92 14 are series 1 drives and are supported.
 		 */
 		if (be16dec(buf) != 0x9214) {
 			/* If this is 0x1492 it could be a series 2 drive */
 			g_free(buf);
 			return (ENXIO);
 		}
 		table->ddr.ddr_sig = APM_DDR_SIG;		/* XXX */
 		table->ddr.ddr_blksize = pp->sectorsize;	/* XXX */
 		table->ddr.ddr_blkcount =
 		    MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 		table->tivo_series1 = 1;
 		g_free(buf);
 	}
 
 	/* Check that there's a Partition Map. */
 	error = apm_read_ent(cp, 1, &table->self, table->tivo_series1);
 	if (error)
 		return (error);
 	if (table->self.ent_sig != APM_ENT_SIG)
 		return (ENXIO);
 	if (strcmp(table->self.ent_type, APM_ENT_TYPE_SELF))
 		return (ENXIO);
 	if (table->self.ent_pmblkcnt >= table->ddr.ddr_blkcount)
 		return (ENXIO);
 	return (G_PART_PROBE_PRI_NORM);
 }
 
 static int
 g_part_apm_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct apm_ent ent;
 	struct g_part_apm_entry *entry;
 	struct g_part_apm_table *table;
 	int error, index;
 
 	table = (struct g_part_apm_table *)basetable;
 
 	basetable->gpt_first = table->self.ent_size + 1;
 	basetable->gpt_last = table->ddr.ddr_blkcount - 1;
 	basetable->gpt_entries = table->self.ent_size - 1;
 
 	for (index = table->self.ent_pmblkcnt - 1; index > 0; index--) {
 		error = apm_read_ent(cp, index + 1, &ent, table->tivo_series1);
 		if (error)
 			continue;
 		if (!strcmp(ent.ent_type, APM_ENT_TYPE_UNUSED))
 			continue;
 		entry = (struct g_part_apm_entry *)g_part_new_entry(basetable,
 		    index, ent.ent_start, ent.ent_start + ent.ent_size - 1);
 		entry->ent = ent;
 	}
 
 	return (0);
 }
 
 static const char *
 g_part_apm_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_apm_entry *entry;
 	const char *type;
 	size_t len;
 
 	entry = (struct g_part_apm_entry *)baseentry;
 	type = entry->ent.ent_type;
 	if (!strcmp(type, APM_ENT_TYPE_APPLE_BOOT))
 		return (g_part_alias_name(G_PART_ALIAS_APPLE_BOOT));
 	if (!strcmp(type, APM_ENT_TYPE_APPLE_HFS))
 		return (g_part_alias_name(G_PART_ALIAS_APPLE_HFS));
 	if (!strcmp(type, APM_ENT_TYPE_APPLE_UFS))
 		return (g_part_alias_name(G_PART_ALIAS_APPLE_UFS));
 	if (!strcmp(type, APM_ENT_TYPE_FREEBSD))
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD));
 	if (!strcmp(type, APM_ENT_TYPE_FREEBSD_NANDFS))
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_NANDFS));
 	if (!strcmp(type, APM_ENT_TYPE_FREEBSD_SWAP))
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_SWAP));
 	if (!strcmp(type, APM_ENT_TYPE_FREEBSD_UFS))
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_UFS));
 	if (!strcmp(type, APM_ENT_TYPE_FREEBSD_VINUM))
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_VINUM));
 	if (!strcmp(type, APM_ENT_TYPE_FREEBSD_ZFS))
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_ZFS));
 	buf[0] = '!';
 	len = MIN(sizeof(entry->ent.ent_type), bufsz - 2);
 	bcopy(type, buf + 1, len);
 	buf[len + 1] = '\0';
 	return (buf);
 }
 
 static int
 g_part_apm_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_entry *baseentry;
 	struct g_part_apm_entry *entry;
 	struct g_part_apm_table *table;
 	char *buf, *ptr;
 	uint32_t index;
 	int error;
 	size_t tblsz;
 
 	pp = cp->provider;
 	table = (struct g_part_apm_table *)basetable;
 	/*
 	 * Tivo Series 1 disk partitions are currently read-only.
 	 */
 	if (table->tivo_series1)
 		return (EOPNOTSUPP);
 
 	/* Write the DDR only when we're newly created. */
 	if (basetable->gpt_created) {
 		buf = g_malloc(pp->sectorsize, M_WAITOK | M_ZERO);
 		be16enc(buf, table->ddr.ddr_sig);
 		be16enc(buf + 2, table->ddr.ddr_blksize);
 		be32enc(buf + 4, table->ddr.ddr_blkcount);
 		error = g_write_data(cp, 0, buf, pp->sectorsize);
 		g_free(buf);
 		if (error)
 			return (error);
 	}
 
 	/* Allocate the buffer for all entries */
 	tblsz = table->self.ent_pmblkcnt;
 	buf = g_malloc(tblsz * pp->sectorsize, M_WAITOK | M_ZERO);
 
 	/* Fill the self entry */
 	be16enc(buf, APM_ENT_SIG);
 	be32enc(buf + 4, table->self.ent_pmblkcnt);
 	be32enc(buf + 8, table->self.ent_start);
 	be32enc(buf + 12, table->self.ent_size);
 	bcopy(table->self.ent_name, buf + 16, sizeof(table->self.ent_name));
 	bcopy(table->self.ent_type, buf + 48, sizeof(table->self.ent_type));
 
 	baseentry = LIST_FIRST(&basetable->gpt_entry);
 	for (index = 1; index < tblsz; index++) {
 		entry = (baseentry != NULL && index == baseentry->gpe_index)
 		    ? (struct g_part_apm_entry *)baseentry : NULL;
 		ptr = buf + index * pp->sectorsize;
 		be16enc(ptr, APM_ENT_SIG);
 		be32enc(ptr + 4, table->self.ent_pmblkcnt);
 		if (entry != NULL && !baseentry->gpe_deleted) {
 			be32enc(ptr + 8, entry->ent.ent_start);
 			be32enc(ptr + 12, entry->ent.ent_size);
 			bcopy(entry->ent.ent_name, ptr + 16,
 			    sizeof(entry->ent.ent_name));
 			bcopy(entry->ent.ent_type, ptr + 48,
 			    sizeof(entry->ent.ent_type));
 		} else {
 			strcpy(ptr + 48, APM_ENT_TYPE_UNUSED);
 		}
 		if (entry != NULL)
 			baseentry = LIST_NEXT(baseentry, gpe_entry);
 	}
 
 	for (index = 0; index < tblsz; index += MAXPHYS / pp->sectorsize) {
 		error = g_write_data(cp, (1 + index) * pp->sectorsize,
 		    buf + index * pp->sectorsize,
 		    (tblsz - index > MAXPHYS / pp->sectorsize) ? MAXPHYS:
 		    (tblsz - index) * pp->sectorsize);
 		if (error) {
 			g_free(buf);
 			return (error);
 		}
 	}
 	g_free(buf);
 	return (0);
 }
Index: user/markj/netdump/sys/geom/part/g_part_bsd.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_bsd.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_bsd.c	(revision 332408)
@@ -1,541 +1,542 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2007 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/disklabel.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <geom/geom.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 #define	BOOT1_SIZE	512
 #define	LABEL_SIZE	512
 #define	BOOT2_OFF	(BOOT1_SIZE + LABEL_SIZE)
 #define	BOOT2_SIZE	(BBSIZE - BOOT2_OFF)
 
 FEATURE(geom_part_bsd, "GEOM partitioning class for BSD disklabels");
 
 struct g_part_bsd_table {
 	struct g_part_table	base;
 	u_char			*bbarea;
 	uint32_t		offset;
 };
 
 struct g_part_bsd_entry {
 	struct g_part_entry	base;
 	struct partition	part;
 };
 
 static int g_part_bsd_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_bsd_bootcode(struct g_part_table *, struct g_part_parms *);
 static int g_part_bsd_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_bsd_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_bsd_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_bsd_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_bsd_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_bsd_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_bsd_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_bsd_read(struct g_part_table *, struct g_consumer *);
 static const char *g_part_bsd_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_bsd_write(struct g_part_table *, struct g_consumer *);
 static int g_part_bsd_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 
 static kobj_method_t g_part_bsd_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_bsd_add),
 	KOBJMETHOD(g_part_bootcode,	g_part_bsd_bootcode),
 	KOBJMETHOD(g_part_create,	g_part_bsd_create),
 	KOBJMETHOD(g_part_destroy,	g_part_bsd_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_bsd_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_bsd_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_bsd_modify),
 	KOBJMETHOD(g_part_resize,	g_part_bsd_resize),
 	KOBJMETHOD(g_part_name,		g_part_bsd_name),
 	KOBJMETHOD(g_part_probe,	g_part_bsd_probe),
 	KOBJMETHOD(g_part_read,		g_part_bsd_read),
 	KOBJMETHOD(g_part_type,		g_part_bsd_type),
 	KOBJMETHOD(g_part_write,	g_part_bsd_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_bsd_scheme = {
 	"BSD",
 	g_part_bsd_methods,
 	sizeof(struct g_part_bsd_table),
 	.gps_entrysz = sizeof(struct g_part_bsd_entry),
 	.gps_minent = 8,
 	.gps_maxent = 20,	/* Only 22 entries fit in 512 byte sectors */
 	.gps_bootcodesz = BBSIZE,
 };
 G_PART_SCHEME_DECLARE(g_part_bsd);
+MODULE_VERSION(geom_part_bsd, 0);
 
 static struct g_part_bsd_alias {
 	uint8_t		type;
 	int		alias;
 } bsd_alias_match[] = {
 	{ FS_BSDFFS,	G_PART_ALIAS_FREEBSD_UFS },
 	{ FS_SWAP,	G_PART_ALIAS_FREEBSD_SWAP },
 	{ FS_ZFS,	G_PART_ALIAS_FREEBSD_ZFS },
 	{ FS_VINUM,	G_PART_ALIAS_FREEBSD_VINUM },
 	{ FS_NANDFS,	G_PART_ALIAS_FREEBSD_NANDFS },
 	{ FS_HAMMER,	G_PART_ALIAS_DFBSD_HAMMER },
 	{ FS_HAMMER2,	G_PART_ALIAS_DFBSD_HAMMER2 },
 };
 
 static int
 bsd_parse_type(const char *type, uint8_t *fstype)
 {
 	const char *alias;
 	char *endp;
 	long lt;
 	int i;
 
 	if (type[0] == '!') {
 		lt = strtol(type + 1, &endp, 0);
 		if (type[1] == '\0' || *endp != '\0' || lt <= 0 || lt >= 256)
 			return (EINVAL);
 		*fstype = (u_int)lt;
 		return (0);
 	}
 	for (i = 0; i < nitems(bsd_alias_match); i++) {
 		alias = g_part_alias_name(bsd_alias_match[i].alias);
 		if (strcasecmp(type, alias) == 0) {
 			*fstype = bsd_alias_match[i].type;
 			return (0);
 		}
 	}
 	return (EINVAL);
 }
 
 static int
 g_part_bsd_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 	struct g_part_bsd_entry *entry;
 	struct g_part_bsd_table *table;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_bsd_entry *)baseentry;
 	table = (struct g_part_bsd_table *)basetable;
 
 	entry->part.p_size = gpp->gpp_size;
 	entry->part.p_offset = gpp->gpp_start + table->offset;
 	entry->part.p_fsize = 0;
 	entry->part.p_frag = 0;
 	entry->part.p_cpg = 0;
 	return (bsd_parse_type(gpp->gpp_type, &entry->part.p_fstype));
 }
 
 static int
 g_part_bsd_bootcode(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_bsd_table *table;
 	const u_char *codeptr;
 
 	if (gpp->gpp_codesize != BOOT1_SIZE && gpp->gpp_codesize != BBSIZE)
 		return (ENODEV);
 
 	table = (struct g_part_bsd_table *)basetable;
 	codeptr = gpp->gpp_codeptr;
 	bcopy(codeptr, table->bbarea, BOOT1_SIZE);
 	if (gpp->gpp_codesize == BBSIZE)
 		bcopy(codeptr + BOOT2_OFF, table->bbarea + BOOT2_OFF,
 		    BOOT2_SIZE);
 	return (0);
 }
 
 static int
 g_part_bsd_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 	struct g_part_entry *baseentry;
 	struct g_part_bsd_entry *entry;
 	struct g_part_bsd_table *table;
 	u_char *ptr;
 	uint32_t msize, ncyls, secpercyl;
 
 	pp = gpp->gpp_provider;
 
 	if (pp->sectorsize < sizeof(struct disklabel))
 		return (ENOSPC);
 	if (BBSIZE % pp->sectorsize)
 		return (ENOTBLK);
 
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 	secpercyl = basetable->gpt_sectors * basetable->gpt_heads;
 	ncyls = msize / secpercyl;
 
 	table = (struct g_part_bsd_table *)basetable;
 	table->bbarea = g_malloc(BBSIZE, M_WAITOK | M_ZERO);
 	ptr = table->bbarea + pp->sectorsize;
 
 	le32enc(ptr + 0, DISKMAGIC);			/* d_magic */
 	le32enc(ptr + 40, pp->sectorsize);		/* d_secsize */
 	le32enc(ptr + 44, basetable->gpt_sectors);	/* d_nsectors */
 	le32enc(ptr + 48, basetable->gpt_heads);	/* d_ntracks */
 	le32enc(ptr + 52, ncyls);			/* d_ncylinders */
 	le32enc(ptr + 56, secpercyl);			/* d_secpercyl */
 	le32enc(ptr + 60, msize);			/* d_secperunit */
 	le16enc(ptr + 72, 3600);			/* d_rpm */
 	le32enc(ptr + 132, DISKMAGIC);			/* d_magic2 */
 	le16enc(ptr + 138, basetable->gpt_entries);	/* d_npartitions */
 	le32enc(ptr + 140, BBSIZE);			/* d_bbsize */
 
 	basetable->gpt_first = 0;
 	basetable->gpt_last = msize - 1;
 	basetable->gpt_isleaf = 1;
 
 	baseentry = g_part_new_entry(basetable, RAW_PART + 1,
 	    basetable->gpt_first, basetable->gpt_last);
 	baseentry->gpe_internal = 1;
 	entry = (struct g_part_bsd_entry *)baseentry;
 	entry->part.p_size = basetable->gpt_last + 1;
 	entry->part.p_offset = table->offset;
 
 	return (0);
 }
 
 static int
 g_part_bsd_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_bsd_table *table;
 
 	table = (struct g_part_bsd_table *)basetable;
 	if (table->bbarea != NULL)
 		g_free(table->bbarea);
 	table->bbarea = NULL;
 
 	/* Wipe the second sector to clear the partitioning. */
 	basetable->gpt_smhead |= 2;
 	return (0);
 }
 
 static void
 g_part_bsd_dumpconf(struct g_part_table *table, struct g_part_entry *baseentry,
     struct sbuf *sb, const char *indent)
 {
 	struct g_part_bsd_entry *entry;
 
 	entry = (struct g_part_bsd_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs BSD xt %u", entry->part.p_fstype);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<rawtype>%u</rawtype>\n", indent,
 		    entry->part.p_fstype);
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_bsd_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 	struct g_part_bsd_entry *entry;
 
 	/* Allow dumping to a swap partition or an unused partition. */
 	entry = (struct g_part_bsd_entry *)baseentry;
 	return ((entry->part.p_fstype == FS_UNUSED ||
 	    entry->part.p_fstype == FS_SWAP) ? 1 : 0);
 }
 
 static int
 g_part_bsd_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_bsd_entry *entry;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_bsd_entry *)baseentry;
 	if (gpp->gpp_parms & G_PART_PARM_TYPE)
 		return (bsd_parse_type(gpp->gpp_type, &entry->part.p_fstype));
 	return (0);
 }
 
 static void
 bsd_set_rawsize(struct g_part_table *basetable, struct g_provider *pp)
 {
 	struct g_part_bsd_table *table;
 	struct g_part_bsd_entry *entry;
 	struct g_part_entry *baseentry;
 	uint32_t msize;
 
 	table = (struct g_part_bsd_table *)basetable;
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 	le32enc(table->bbarea + pp->sectorsize + 60, msize); /* d_secperunit */
 	basetable->gpt_last = msize - 1;
 	LIST_FOREACH(baseentry, &basetable->gpt_entry, gpe_entry) {
 		if (baseentry->gpe_index != RAW_PART + 1)
 			continue;
 		baseentry->gpe_end = basetable->gpt_last;
 		entry = (struct g_part_bsd_entry *)baseentry;
 		entry->part.p_size = msize;
 		return;
 	}
 }
 
 static int
 g_part_bsd_resize(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_bsd_entry *entry;
 	struct g_provider *pp;
 
 	if (baseentry == NULL) {
 		pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 		bsd_set_rawsize(basetable, pp);
 		return (0);
 	}
 	entry = (struct g_part_bsd_entry *)baseentry;
 	baseentry->gpe_end = baseentry->gpe_start + gpp->gpp_size - 1;
 	entry->part.p_size = gpp->gpp_size;
 
 	return (0);
 }
 
 static const char *
 g_part_bsd_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "%c", 'a' + baseentry->gpe_index - 1);
 	return (buf);
 }
 
 static int
 g_part_bsd_probe(struct g_part_table *table, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	uint32_t magic1, magic2;
 	int error;
 
 	pp = cp->provider;
 
 	/* Sanity-check the provider. */
 	if (pp->sectorsize < sizeof(struct disklabel) ||
 	    pp->mediasize < BBSIZE)
 		return (ENOSPC);
 	if (BBSIZE % pp->sectorsize)
 		return (ENOTBLK);
 
 	/* Check that there's a disklabel. */
 	buf = g_read_data(cp, pp->sectorsize, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 	magic1 = le32dec(buf + 0);
 	magic2 = le32dec(buf + 132);
 	g_free(buf);
 	return ((magic1 == DISKMAGIC && magic2 == DISKMAGIC)
 	    ? G_PART_PROBE_PRI_HIGH : ENXIO);
 }
 
 static int
 g_part_bsd_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_bsd_table *table;
 	struct g_part_entry *baseentry;
 	struct g_part_bsd_entry *entry;
 	struct partition part;
 	u_char *buf, *p;
 	off_t chs, msize;
 	u_int sectors, heads;
 	int error, index;
 
 	pp = cp->provider;
 	table = (struct g_part_bsd_table *)basetable;
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 
 	table->bbarea = g_read_data(cp, 0, BBSIZE, &error);
 	if (table->bbarea == NULL)
 		return (error);
 
 	buf = table->bbarea + pp->sectorsize;
 
 	if (le32dec(buf + 40) != pp->sectorsize)
 		goto invalid_label;
 	sectors = le32dec(buf + 44);
 	if (sectors < 1 || sectors > 255)
 		goto invalid_label;
 	if (sectors != basetable->gpt_sectors && !basetable->gpt_fixgeom) {
 		g_part_geometry_heads(msize, sectors, &chs, &heads);
 		if (chs != 0) {
 			basetable->gpt_sectors = sectors;
 			basetable->gpt_heads = heads;
 		}
 	}
 	heads = le32dec(buf + 48);
 	if (heads < 1 || heads > 255)
 		goto invalid_label;
 	if (heads != basetable->gpt_heads && !basetable->gpt_fixgeom)
 		basetable->gpt_heads = heads;
 
 	chs = le32dec(buf + 60);
 	if (chs < 1)
 		goto invalid_label;
 	/* Fix-up a sysinstall bug. */
 	if (chs > msize) {
 		chs = msize;
 		le32enc(buf + 60, msize);
 	}
 
 	basetable->gpt_first = 0;
 	basetable->gpt_last = msize - 1;
 	basetable->gpt_isleaf = 1;
 
 	basetable->gpt_entries = le16dec(buf + 138);
 	if (basetable->gpt_entries < g_part_bsd_scheme.gps_minent ||
 	    basetable->gpt_entries > g_part_bsd_scheme.gps_maxent)
 		goto invalid_label;
 
 	table->offset = le32dec(buf + 148 + RAW_PART * 16 + 4);
 	for (index = basetable->gpt_entries - 1; index >= 0; index--) {
 		p = buf + 148 + index * 16;
 		part.p_size = le32dec(p + 0);
 		part.p_offset = le32dec(p + 4);
 		part.p_fsize = le32dec(p + 8);
 		part.p_fstype = p[12];
 		part.p_frag = p[13];
 		part.p_cpg = le16dec(p + 14);
 		if (part.p_size == 0)
 			continue;
 		if (part.p_offset < table->offset)
 			continue;
 		if (part.p_offset - table->offset > basetable->gpt_last)
 			goto invalid_label;
 		baseentry = g_part_new_entry(basetable, index + 1,
 		    part.p_offset - table->offset,
 		    part.p_offset - table->offset + part.p_size - 1);
 		entry = (struct g_part_bsd_entry *)baseentry;
 		entry->part = part;
 		if (index == RAW_PART)
 			baseentry->gpe_internal = 1;
 	}
 
 	return (0);
 
  invalid_label:
 	printf("GEOM: %s: invalid disklabel.\n", pp->name);
 	g_free(table->bbarea);
 	table->bbarea = NULL;
 	return (EINVAL);
 }
 
 static const char *
 g_part_bsd_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_bsd_entry *entry;
 	int type;
 
 	entry = (struct g_part_bsd_entry *)baseentry;
 	type = entry->part.p_fstype;
 	if (type == FS_NANDFS)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_NANDFS));
 	if (type == FS_SWAP)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_SWAP));
 	if (type == FS_BSDFFS)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_UFS));
 	if (type == FS_VINUM)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_VINUM));
 	if (type == FS_ZFS)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_ZFS));
 	snprintf(buf, bufsz, "!%d", type);
 	return (buf);
 }
 
 static int
 g_part_bsd_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_entry *baseentry;
 	struct g_part_bsd_entry *entry;
 	struct g_part_bsd_table *table;
 	uint16_t sum;
 	u_char *label, *p, *pe;
 	int error, index;
 
 	pp = cp->provider;
 	table = (struct g_part_bsd_table *)basetable;
 	baseentry = LIST_FIRST(&basetable->gpt_entry);
 	label = table->bbarea + pp->sectorsize;
 	for (index = 1; index <= basetable->gpt_entries; index++) {
 		p = label + 148 + (index - 1) * 16;
 		entry = (baseentry != NULL && index == baseentry->gpe_index)
 		    ? (struct g_part_bsd_entry *)baseentry : NULL;
 		if (entry != NULL && !baseentry->gpe_deleted) {
 			le32enc(p + 0, entry->part.p_size);
 			le32enc(p + 4, entry->part.p_offset);
 			le32enc(p + 8, entry->part.p_fsize);
 			p[12] = entry->part.p_fstype;
 			p[13] = entry->part.p_frag;
 			le16enc(p + 14, entry->part.p_cpg);
 		} else
 			bzero(p, 16);
 
 		if (entry != NULL)
 			baseentry = LIST_NEXT(baseentry, gpe_entry);
 	}
 
 	/* Calculate checksum. */
 	le16enc(label + 136, 0);
 	pe = label + 148 + basetable->gpt_entries * 16;
 	sum = 0;
 	for (p = label; p < pe; p += 2)
 		sum ^= le16dec(p);
 	le16enc(label + 136, sum);
 
 	error = g_write_data(cp, 0, table->bbarea, BBSIZE);
 	return (error);
 }
Index: user/markj/netdump/sys/geom/part/g_part_bsd64.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_bsd64.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_bsd64.c	(revision 332408)
@@ -1,664 +1,665 @@
 /*-
  * Copyright (c) 2014 Andrey V. Elsukov <ae@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/disklabel.h>
 #include <sys/endian.h>
 #include <sys/gpt.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <geom/geom.h>
 #include <geom/geom_int.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_bsd64, "GEOM partitioning class for 64-bit BSD disklabels");
 
 /* XXX: move this to sys/disklabel64.h */
 #define	DISKMAGIC64     ((uint32_t)0xc4464c59)
 #define	MAXPARTITIONS64	16
 #define	RESPARTITIONS64	32
 
 struct disklabel64 {
 	char	  d_reserved0[512];	/* reserved or unused */
 	u_int32_t d_magic;		/* the magic number */
 	u_int32_t d_crc;		/* crc32() d_magic through last part */
 	u_int32_t d_align;		/* partition alignment requirement */
 	u_int32_t d_npartitions;	/* number of partitions */
 	struct uuid d_stor_uuid;	/* unique uuid for label */
 
 	u_int64_t d_total_size;		/* total size incl everything (bytes) */
 	u_int64_t d_bbase;		/* boot area base offset (bytes) */
 					/* boot area is pbase - bbase */
 	u_int64_t d_pbase;		/* first allocatable offset (bytes) */
 	u_int64_t d_pstop;		/* last allocatable offset+1 (bytes) */
 	u_int64_t d_abase;		/* location of backup copy if not 0 */
 
 	u_char	  d_packname[64];
 	u_char    d_reserved[64];
 
 	/*
 	 * Note: offsets are relative to the base of the slice, NOT to
 	 * d_pbase.  Unlike 32 bit disklabels the on-disk format for
 	 * a 64 bit disklabel remains slice-relative.
 	 *
 	 * An uninitialized partition has a p_boffset and p_bsize of 0.
 	 *
 	 * If p_fstype is not supported for a live partition it is set
 	 * to FS_OTHER.  This is typically the case when the filesystem
 	 * is identified by its uuid.
 	 */
 	struct partition64 {		/* the partition table */
 		u_int64_t p_boffset;	/* slice relative offset, in bytes */
 		u_int64_t p_bsize;	/* size of partition, in bytes */
 		u_int8_t  p_fstype;
 		u_int8_t  p_unused01;	/* reserved, must be 0 */
 		u_int8_t  p_unused02;	/* reserved, must be 0 */
 		u_int8_t  p_unused03;	/* reserved, must be 0 */
 		u_int32_t p_unused04;	/* reserved, must be 0 */
 		u_int32_t p_unused05;	/* reserved, must be 0 */
 		u_int32_t p_unused06;	/* reserved, must be 0 */
 		struct uuid p_type_uuid;/* mount type as UUID */
 		struct uuid p_stor_uuid;/* unique uuid for storage */
 	} d_partitions[MAXPARTITIONS64];/* actually may be more */
 };
 
 struct g_part_bsd64_table {
 	struct g_part_table	base;
 
 	uint32_t		d_align;
 	uint64_t		d_bbase;
 	uint64_t		d_abase;
 	struct uuid		d_stor_uuid;
 	char			d_reserved0[512];
 	u_char			d_packname[64];
 	u_char			d_reserved[64];
 };
 
 struct g_part_bsd64_entry {
 	struct g_part_entry	base;
 
 	uint8_t			fstype;
 	struct uuid		type_uuid;
 	struct uuid		stor_uuid;
 };
 
 static int g_part_bsd64_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_bsd64_bootcode(struct g_part_table *, struct g_part_parms *);
 static int g_part_bsd64_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_bsd64_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_bsd64_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_bsd64_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_bsd64_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_bsd64_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_bsd64_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_bsd64_read(struct g_part_table *, struct g_consumer *);
 static const char *g_part_bsd64_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_bsd64_write(struct g_part_table *, struct g_consumer *);
 static int g_part_bsd64_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 
 static kobj_method_t g_part_bsd64_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_bsd64_add),
 	KOBJMETHOD(g_part_bootcode,	g_part_bsd64_bootcode),
 	KOBJMETHOD(g_part_create,	g_part_bsd64_create),
 	KOBJMETHOD(g_part_destroy,	g_part_bsd64_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_bsd64_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_bsd64_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_bsd64_modify),
 	KOBJMETHOD(g_part_resize,	g_part_bsd64_resize),
 	KOBJMETHOD(g_part_name,		g_part_bsd64_name),
 	KOBJMETHOD(g_part_probe,	g_part_bsd64_probe),
 	KOBJMETHOD(g_part_read,		g_part_bsd64_read),
 	KOBJMETHOD(g_part_type,		g_part_bsd64_type),
 	KOBJMETHOD(g_part_write,	g_part_bsd64_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_bsd64_scheme = {
 	"BSD64",
 	g_part_bsd64_methods,
 	sizeof(struct g_part_bsd64_table),
 	.gps_entrysz = sizeof(struct g_part_bsd64_entry),
 	.gps_minent = MAXPARTITIONS64,
 	.gps_maxent = MAXPARTITIONS64
 };
 G_PART_SCHEME_DECLARE(g_part_bsd64);
+MODULE_VERSION(geom_part_bsd64, 0);
 
 #define	EQUUID(a, b)	(memcmp(a, b, sizeof(struct uuid)) == 0)
 static struct uuid bsd64_uuid_unused = GPT_ENT_TYPE_UNUSED;
 static struct uuid bsd64_uuid_dfbsd_swap = GPT_ENT_TYPE_DRAGONFLY_SWAP;
 static struct uuid bsd64_uuid_dfbsd_ufs1 = GPT_ENT_TYPE_DRAGONFLY_UFS1;
 static struct uuid bsd64_uuid_dfbsd_vinum = GPT_ENT_TYPE_DRAGONFLY_VINUM;
 static struct uuid bsd64_uuid_dfbsd_ccd = GPT_ENT_TYPE_DRAGONFLY_CCD;
 static struct uuid bsd64_uuid_dfbsd_legacy = GPT_ENT_TYPE_DRAGONFLY_LEGACY;
 static struct uuid bsd64_uuid_dfbsd_hammer = GPT_ENT_TYPE_DRAGONFLY_HAMMER;
 static struct uuid bsd64_uuid_dfbsd_hammer2 = GPT_ENT_TYPE_DRAGONFLY_HAMMER2;
 static struct uuid bsd64_uuid_freebsd_boot = GPT_ENT_TYPE_FREEBSD_BOOT;
 static struct uuid bsd64_uuid_freebsd_nandfs = GPT_ENT_TYPE_FREEBSD_NANDFS;
 static struct uuid bsd64_uuid_freebsd_swap = GPT_ENT_TYPE_FREEBSD_SWAP;
 static struct uuid bsd64_uuid_freebsd_ufs = GPT_ENT_TYPE_FREEBSD_UFS;
 static struct uuid bsd64_uuid_freebsd_vinum = GPT_ENT_TYPE_FREEBSD_VINUM;
 static struct uuid bsd64_uuid_freebsd_zfs = GPT_ENT_TYPE_FREEBSD_ZFS;
 
 struct bsd64_uuid_alias {
 	struct uuid *uuid;
 	uint8_t fstype;
 	int alias;
 };
 static struct bsd64_uuid_alias dfbsd_alias_match[] = {
 	{ &bsd64_uuid_dfbsd_swap, FS_SWAP, G_PART_ALIAS_DFBSD_SWAP },
 	{ &bsd64_uuid_dfbsd_ufs1, FS_BSDFFS, G_PART_ALIAS_DFBSD_UFS },
 	{ &bsd64_uuid_dfbsd_vinum, FS_VINUM, G_PART_ALIAS_DFBSD_VINUM },
 	{ &bsd64_uuid_dfbsd_ccd, FS_CCD, G_PART_ALIAS_DFBSD_CCD },
 	{ &bsd64_uuid_dfbsd_legacy, FS_OTHER, G_PART_ALIAS_DFBSD_LEGACY },
 	{ &bsd64_uuid_dfbsd_hammer, FS_HAMMER, G_PART_ALIAS_DFBSD_HAMMER },
 	{ &bsd64_uuid_dfbsd_hammer2, FS_HAMMER2, G_PART_ALIAS_DFBSD_HAMMER2 },
 	{ NULL, 0, 0}
 };
 static struct bsd64_uuid_alias fbsd_alias_match[] = {
 	{ &bsd64_uuid_freebsd_boot, FS_OTHER, G_PART_ALIAS_FREEBSD_BOOT },
 	{ &bsd64_uuid_freebsd_swap, FS_OTHER, G_PART_ALIAS_FREEBSD_SWAP },
 	{ &bsd64_uuid_freebsd_ufs, FS_OTHER, G_PART_ALIAS_FREEBSD_UFS },
 	{ &bsd64_uuid_freebsd_zfs, FS_OTHER, G_PART_ALIAS_FREEBSD_ZFS },
 	{ &bsd64_uuid_freebsd_vinum, FS_OTHER, G_PART_ALIAS_FREEBSD_VINUM },
 	{ &bsd64_uuid_freebsd_nandfs, FS_OTHER, G_PART_ALIAS_FREEBSD_NANDFS },
 	{ NULL, 0, 0}
 };
 
 static int
 bsd64_parse_type(const char *type, struct g_part_bsd64_entry *entry)
 {
 	struct uuid tmp;
 	const struct bsd64_uuid_alias *uap;
 	const char *alias;
 	char *p;
 	long lt;
 	int error;
 
 	if (type[0] == '!') {
 		if (type[1] == '\0')
 			return (EINVAL);
 		lt = strtol(type + 1, &p, 0);
 		/* The type specified as number */
 		if (*p == '\0') {
 			if (lt <= 0 || lt > 255)
 				return (EINVAL);
 			entry->fstype = lt;
 			entry->type_uuid = bsd64_uuid_unused;
 			return (0);
 		}
 		/* The type specified as uuid */
 		error = parse_uuid(type + 1, &tmp);
 		if (error != 0)
 			return (error);
 		if (EQUUID(&tmp, &bsd64_uuid_unused))
 			return (EINVAL);
 		for (uap = &dfbsd_alias_match[0]; uap->uuid != NULL; uap++) {
 			if (EQUUID(&tmp, uap->uuid)) {
 				/* Prefer fstype for known uuids */
 				entry->type_uuid = bsd64_uuid_unused;
 				entry->fstype = uap->fstype;
 				return (0);
 			}
 		}
 		entry->type_uuid = tmp;
 		entry->fstype = FS_OTHER;
 		return (0);
 	}
 	/* The type specified as symbolic alias name */
 	for (uap = &fbsd_alias_match[0]; uap->uuid != NULL; uap++) {
 		alias = g_part_alias_name(uap->alias);
 		if (!strcasecmp(type, alias)) {
 			entry->type_uuid = *uap->uuid;
 			entry->fstype = uap->fstype;
 			return (0);
 		}
 	}
 	for (uap = &dfbsd_alias_match[0]; uap->uuid != NULL; uap++) {
 		alias = g_part_alias_name(uap->alias);
 		if (!strcasecmp(type, alias)) {
 			entry->type_uuid = bsd64_uuid_unused;
 			entry->fstype = uap->fstype;
 			return (0);
 		}
 	}
 	return (EINVAL);
 }
 
 static int
 g_part_bsd64_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 	struct g_part_bsd64_entry *entry;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_bsd64_entry *)baseentry;
 	if (bsd64_parse_type(gpp->gpp_type, entry) != 0)
 		return (EINVAL);
 	kern_uuidgen(&entry->stor_uuid, 1);
 	return (0);
 }
 
 static int
 g_part_bsd64_bootcode(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	return (EOPNOTSUPP);
 }
 
 #define	PALIGN_SIZE	(1024 * 1024)
 #define	PALIGN_MASK	(PALIGN_SIZE - 1)
 #define	BLKSIZE		(4 * 1024)
 #define	BOOTSIZE	(32 * 1024)
 #define	DALIGN_SIZE	(32 * 1024)
 static int
 g_part_bsd64_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_bsd64_table *table;
 	struct g_part_entry *baseentry;
 	struct g_provider *pp;
 	uint64_t blkmask, pbase;
 	uint32_t blksize, ressize;
 
 	pp = gpp->gpp_provider;
 	if (pp->mediasize < 2* PALIGN_SIZE)
 		return (ENOSPC);
 
 	/*
 	 * Use at least 4KB block size. Blksize is stored in the d_align.
 	 * XXX: Actually it is used just for calculate d_bbase and used
 	 * for better alignment in bsdlabel64(8).
 	 */
 	blksize = pp->sectorsize < BLKSIZE ? BLKSIZE: pp->sectorsize;
 	blkmask = blksize - 1;
 	/* Reserve enough space for RESPARTITIONS64 partitions. */
 	ressize = offsetof(struct disklabel64, d_partitions[RESPARTITIONS64]);
 	ressize = (ressize + blkmask) & ~blkmask;
 	/*
 	 * Reserve enough space for bootcode and align first allocatable
 	 * offset to PALIGN_SIZE.
 	 * XXX: Currently DragonFlyBSD has 32KB bootcode, but the size could
 	 * be bigger, because it is possible change it (it is equal pbase-bbase)
 	 * in the bsdlabel64(8).
 	 */
 	pbase = ressize + ((BOOTSIZE + blkmask) & ~blkmask);
 	pbase = (pbase + PALIGN_MASK) & ~PALIGN_MASK;
 	/*
 	 * Take physical offset into account and make first allocatable
 	 * offset 32KB aligned to the start of the physical disk.
 	 * XXX: Actually there are no such restrictions, this is how
 	 * DragonFlyBSD behaves.
 	 */
 	pbase += DALIGN_SIZE - pp->stripeoffset % DALIGN_SIZE;
 
 	table = (struct g_part_bsd64_table *)basetable;
 	table->d_align = blksize;
 	table->d_bbase = ressize / pp->sectorsize;
 	table->d_abase = ((pp->mediasize - ressize) &
 	    ~blkmask) / pp->sectorsize;
 	kern_uuidgen(&table->d_stor_uuid, 1);
 	basetable->gpt_first = pbase / pp->sectorsize;
 	basetable->gpt_last = table->d_abase - 1; /* XXX */
 	/*
 	 * Create 'c' partition and make it internal, so user will not be
 	 * able use it.
 	 */
 	baseentry = g_part_new_entry(basetable, RAW_PART + 1, 0, 0);
 	baseentry->gpe_internal = 1;
 	return (0);
 }
 
 static int
 g_part_bsd64_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 
 	pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 	if (pp->sectorsize > offsetof(struct disklabel64, d_magic))
 		basetable->gpt_smhead |= 1;
 	else
 		basetable->gpt_smhead |= 3;
 	return (0);
 }
 
 static void
 g_part_bsd64_dumpconf(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct sbuf *sb, const char *indent)
 {
 	struct g_part_bsd64_table *table;
 	struct g_part_bsd64_entry *entry;
 	char buf[sizeof(table->d_packname)];
 
 	entry = (struct g_part_bsd64_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs BSD64 xt %u", entry->fstype);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<rawtype>%u</rawtype>\n", indent,
 		    entry->fstype);
 		if (!EQUUID(&bsd64_uuid_unused, &entry->type_uuid)) {
 			sbuf_printf(sb, "%s<type_uuid>", indent);
 			sbuf_printf_uuid(sb, &entry->type_uuid);
 			sbuf_printf(sb, "</type_uuid>\n");
 		}
 		sbuf_printf(sb, "%s<stor_uuid>", indent);
 		sbuf_printf_uuid(sb, &entry->stor_uuid);
 		sbuf_printf(sb, "</stor_uuid>\n");
 	} else {
 		/* confxml: scheme information */
 		table = (struct g_part_bsd64_table *)basetable;
 		sbuf_printf(sb, "%s<bootbase>%ju</bootbase>\n", indent,
 		    (uintmax_t)table->d_bbase);
 		if (table->d_abase)
 			sbuf_printf(sb, "%s<backupbase>%ju</backupbase>\n",
 			    indent, (uintmax_t)table->d_abase);
 		sbuf_printf(sb, "%s<stor_uuid>", indent);
 		sbuf_printf_uuid(sb, &table->d_stor_uuid);
 		sbuf_printf(sb, "</stor_uuid>\n");
 		sbuf_printf(sb, "%s<label>", indent);
 		strncpy(buf, table->d_packname, sizeof(buf) - 1);
 		buf[sizeof(buf) - 1] = '\0';
 		g_conf_printf_escaped(sb, "%s", buf);
 		sbuf_printf(sb, "</label>\n");
 	}
 }
 
 static int
 g_part_bsd64_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 	struct g_part_bsd64_entry *entry;
 
 	/* Allow dumping to a swap partition. */
 	entry = (struct g_part_bsd64_entry *)baseentry;
 	if (entry->fstype == FS_SWAP ||
 	    EQUUID(&entry->type_uuid, &bsd64_uuid_dfbsd_swap) ||
 	    EQUUID(&entry->type_uuid, &bsd64_uuid_freebsd_swap))
 		return (1);
 	return (0);
 }
 
 static int
 g_part_bsd64_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_bsd64_entry *entry;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_bsd64_entry *)baseentry;
 	if (gpp->gpp_parms & G_PART_PARM_TYPE)
 		return (bsd64_parse_type(gpp->gpp_type, entry));
 	return (0);
 }
 
 static int
 g_part_bsd64_resize(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_bsd64_table *table;
 	struct g_provider *pp;
 
 	if (baseentry == NULL) {
 		pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 		table = (struct g_part_bsd64_table *)basetable;
 		table->d_abase =
 		    rounddown2(pp->mediasize - table->d_bbase * pp->sectorsize,
 		        table->d_align) / pp->sectorsize;
 		basetable->gpt_last = table->d_abase - 1;
 		return (0);
 	}
 	baseentry->gpe_end = baseentry->gpe_start + gpp->gpp_size - 1;
 	return (0);
 }
 
 static const char *
 g_part_bsd64_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "%c", 'a' + baseentry->gpe_index - 1);
 	return (buf);
 }
 
 static int
 g_part_bsd64_probe(struct g_part_table *table, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	uint32_t v;
 	int error;
 	u_char *buf;
 
 	pp = cp->provider;
 	if (pp->mediasize < 2 * PALIGN_SIZE)
 		return (ENOSPC);
 	v = rounddown2(pp->sectorsize + offsetof(struct disklabel64, d_magic),
 		       pp->sectorsize);
 	buf = g_read_data(cp, 0, v, &error);
 	if (buf == NULL)
 		return (error);
 	v = le32dec(buf + offsetof(struct disklabel64, d_magic));
 	g_free(buf);
 	return (v == DISKMAGIC64 ? G_PART_PROBE_PRI_HIGH: ENXIO);
 }
 
 static int
 g_part_bsd64_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_part_bsd64_table *table;
 	struct g_part_bsd64_entry *entry;
 	struct g_part_entry *baseentry;
 	struct g_provider *pp;
 	struct disklabel64 *dlp;
 	uint64_t v64, sz;
 	uint32_t v32;
 	int error, index;
 	u_char *buf;
 
 	pp = cp->provider;
 	table = (struct g_part_bsd64_table *)basetable;
 	v32 = roundup2(sizeof(struct disklabel64), pp->sectorsize);
 	buf = g_read_data(cp, 0, v32, &error);
 	if (buf == NULL)
 		return (error);
 
 	dlp = (struct disklabel64 *)buf;
 	basetable->gpt_entries = le32toh(dlp->d_npartitions);
 	if (basetable->gpt_entries > MAXPARTITIONS64 ||
 	    basetable->gpt_entries < 1)
 		goto invalid_label;
 	v32 = le32toh(dlp->d_crc);
 	dlp->d_crc = 0;
 	if (crc32(&dlp->d_magic, offsetof(struct disklabel64,
 	    d_partitions[basetable->gpt_entries]) -
 	    offsetof(struct disklabel64, d_magic)) != v32)
 		goto invalid_label;
 	table->d_align = le32toh(dlp->d_align);
 	if (table->d_align == 0 || (table->d_align & (pp->sectorsize - 1)))
 		goto invalid_label;
 	if (le64toh(dlp->d_total_size) > pp->mediasize)
 		goto invalid_label;
 	v64 = le64toh(dlp->d_pbase);
 	if (v64 % pp->sectorsize)
 		goto invalid_label;
 	basetable->gpt_first = v64 / pp->sectorsize;
 	v64 = le64toh(dlp->d_pstop);
 	if (v64 % pp->sectorsize)
 		goto invalid_label;
 	basetable->gpt_last = v64 / pp->sectorsize;
 	basetable->gpt_isleaf = 1;
 	v64 = le64toh(dlp->d_bbase);
 	if (v64 % pp->sectorsize)
 		goto invalid_label;
 	table->d_bbase = v64 / pp->sectorsize;
 	v64 = le64toh(dlp->d_abase);
 	if (v64 % pp->sectorsize)
 		goto invalid_label;
 	table->d_abase = v64 / pp->sectorsize;
 	le_uuid_dec(&dlp->d_stor_uuid, &table->d_stor_uuid);
 	for (index = basetable->gpt_entries - 1; index >= 0; index--) {
 		if (index == RAW_PART) {
 			/* Skip 'c' partition. */
 			baseentry = g_part_new_entry(basetable,
 			    index + 1, 0, 0);
 			baseentry->gpe_internal = 1;
 			continue;
 		}
 		v64 = le64toh(dlp->d_partitions[index].p_boffset);
 		sz = le64toh(dlp->d_partitions[index].p_bsize);
 		if (sz == 0 && v64 == 0)
 			continue;
 		if (sz == 0 || (v64 % pp->sectorsize) || (sz % pp->sectorsize))
 			goto invalid_label;
 		baseentry = g_part_new_entry(basetable, index + 1,
 		    v64 / pp->sectorsize, (v64 + sz) / pp->sectorsize - 1);
 		entry = (struct g_part_bsd64_entry *)baseentry;
 		le_uuid_dec(&dlp->d_partitions[index].p_type_uuid,
 		    &entry->type_uuid);
 		le_uuid_dec(&dlp->d_partitions[index].p_stor_uuid,
 		    &entry->stor_uuid);
 		entry->fstype = dlp->d_partitions[index].p_fstype;
 	}
 	bcopy(dlp->d_reserved0, table->d_reserved0,
 	    sizeof(table->d_reserved0));
 	bcopy(dlp->d_packname, table->d_packname, sizeof(table->d_packname));
 	bcopy(dlp->d_reserved, table->d_reserved, sizeof(table->d_reserved));
 	g_free(buf);
 	return (0);
 
 invalid_label:
 	g_free(buf);
 	return (EINVAL);
 }
 
 static const char *
 g_part_bsd64_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_bsd64_entry *entry;
 	struct bsd64_uuid_alias *uap;
 
 	entry = (struct g_part_bsd64_entry *)baseentry;
 	if (entry->fstype != FS_OTHER) {
 		for (uap = &dfbsd_alias_match[0]; uap->uuid != NULL; uap++)
 			if (uap->fstype == entry->fstype)
 				return (g_part_alias_name(uap->alias));
 	} else {
 		for (uap = &fbsd_alias_match[0]; uap->uuid != NULL; uap++)
 			if (EQUUID(uap->uuid, &entry->type_uuid))
 				return (g_part_alias_name(uap->alias));
 		for (uap = &dfbsd_alias_match[0]; uap->uuid != NULL; uap++)
 			if (EQUUID(uap->uuid, &entry->type_uuid))
 				return (g_part_alias_name(uap->alias));
 	}
 	if (EQUUID(&bsd64_uuid_unused, &entry->type_uuid))
 		snprintf(buf, bufsz, "!%d", entry->fstype);
 	else {
 		buf[0] = '!';
 		snprintf_uuid(buf + 1, bufsz - 1, &entry->type_uuid);
 	}
 	return (buf);
 }
 
 static int
 g_part_bsd64_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_entry *baseentry;
 	struct g_part_bsd64_entry *entry;
 	struct g_part_bsd64_table *table;
 	struct disklabel64 *dlp;
 	uint32_t v, sz;
 	int error, index;
 
 	pp = cp->provider;
 	table = (struct g_part_bsd64_table *)basetable;
 	sz = roundup2(sizeof(struct disklabel64), pp->sectorsize);
 	dlp = g_malloc(sz, M_WAITOK | M_ZERO);
 
 	memcpy(dlp->d_reserved0, table->d_reserved0,
 	    sizeof(table->d_reserved0));
 	memcpy(dlp->d_packname, table->d_packname, sizeof(table->d_packname));
 	memcpy(dlp->d_reserved, table->d_reserved, sizeof(table->d_reserved));
 	le32enc(&dlp->d_magic, DISKMAGIC64);
 	le32enc(&dlp->d_align, table->d_align);
 	le32enc(&dlp->d_npartitions, basetable->gpt_entries);
 	le_uuid_enc(&dlp->d_stor_uuid, &table->d_stor_uuid);
 	le64enc(&dlp->d_total_size, pp->mediasize);
 	le64enc(&dlp->d_bbase, table->d_bbase * pp->sectorsize);
 	le64enc(&dlp->d_pbase, basetable->gpt_first * pp->sectorsize);
 	le64enc(&dlp->d_pstop, basetable->gpt_last * pp->sectorsize);
 	le64enc(&dlp->d_abase, table->d_abase * pp->sectorsize);
 
 	LIST_FOREACH(baseentry, &basetable->gpt_entry, gpe_entry) {
 		if (baseentry->gpe_deleted)
 			continue;
 		index = baseentry->gpe_index - 1;
 		entry = (struct g_part_bsd64_entry *)baseentry;
 		if (index == RAW_PART)
 			continue;
 		le64enc(&dlp->d_partitions[index].p_boffset,
 		    baseentry->gpe_start * pp->sectorsize);
 		le64enc(&dlp->d_partitions[index].p_bsize, pp->sectorsize *
 		    (baseentry->gpe_end - baseentry->gpe_start + 1));
 		dlp->d_partitions[index].p_fstype = entry->fstype;
 		le_uuid_enc(&dlp->d_partitions[index].p_type_uuid,
 		    &entry->type_uuid);
 		le_uuid_enc(&dlp->d_partitions[index].p_stor_uuid,
 		    &entry->stor_uuid);
 	}
 	/* Calculate checksum. */
 	v = offsetof(struct disklabel64,
 	    d_partitions[basetable->gpt_entries]) -
 	    offsetof(struct disklabel64, d_magic);
 	le32enc(&dlp->d_crc, crc32(&dlp->d_magic, v));
 	error = g_write_data(cp, 0, dlp, sz);
 	g_free(dlp);
 	return (error);
 }
 
Index: user/markj/netdump/sys/geom/part/g_part_ebr.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_ebr.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_ebr.c	(revision 332408)
@@ -1,696 +1,697 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2007-2009 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include "opt_geom.h"
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/diskmbr.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <geom/geom.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_ebr,
     "GEOM partitioning class for extended boot records support");
 #if defined(GEOM_PART_EBR_COMPAT)
 FEATURE(geom_part_ebr_compat,
     "GEOM EBR partitioning class: backward-compatible partition names");
 #endif
 
 #define	EBRSIZE		512
 
 struct g_part_ebr_table {
 	struct g_part_table	base;
 #ifndef GEOM_PART_EBR_COMPAT
 	u_char		ebr[EBRSIZE];
 #endif
 };
 
 struct g_part_ebr_entry {
 	struct g_part_entry	base;
 	struct dos_partition	ent;
 };
 
 static int g_part_ebr_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_ebr_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_ebr_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_ebr_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_ebr_dumpto(struct g_part_table *, struct g_part_entry *);
 #if defined(GEOM_PART_EBR_COMPAT)
 static void g_part_ebr_fullname(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 #endif
 static int g_part_ebr_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_ebr_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_ebr_precheck(struct g_part_table *, enum g_part_ctl,
     struct g_part_parms *);
 static int g_part_ebr_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_ebr_read(struct g_part_table *, struct g_consumer *);
 static int g_part_ebr_setunset(struct g_part_table *, struct g_part_entry *,
     const char *, unsigned int);
 static const char *g_part_ebr_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_ebr_write(struct g_part_table *, struct g_consumer *);
 static int g_part_ebr_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 
 static kobj_method_t g_part_ebr_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_ebr_add),
 	KOBJMETHOD(g_part_create,	g_part_ebr_create),
 	KOBJMETHOD(g_part_destroy,	g_part_ebr_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_ebr_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_ebr_dumpto),
 #if defined(GEOM_PART_EBR_COMPAT)
 	KOBJMETHOD(g_part_fullname,	g_part_ebr_fullname),
 #endif
 	KOBJMETHOD(g_part_modify,	g_part_ebr_modify),
 	KOBJMETHOD(g_part_name,		g_part_ebr_name),
 	KOBJMETHOD(g_part_precheck,	g_part_ebr_precheck),
 	KOBJMETHOD(g_part_probe,	g_part_ebr_probe),
 	KOBJMETHOD(g_part_read,		g_part_ebr_read),
 	KOBJMETHOD(g_part_resize,	g_part_ebr_resize),
 	KOBJMETHOD(g_part_setunset,	g_part_ebr_setunset),
 	KOBJMETHOD(g_part_type,		g_part_ebr_type),
 	KOBJMETHOD(g_part_write,	g_part_ebr_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_ebr_scheme = {
 	"EBR",
 	g_part_ebr_methods,
 	sizeof(struct g_part_ebr_table),
 	.gps_entrysz = sizeof(struct g_part_ebr_entry),
 	.gps_minent = 1,
 	.gps_maxent = INT_MAX,
 };
 G_PART_SCHEME_DECLARE(g_part_ebr);
+MODULE_VERSION(geom_part_ebr, 0);
 
 static struct g_part_ebr_alias {
 	u_char		typ;
 	int		alias;
 } ebr_alias_match[] = {
 	{ DOSPTYP_386BSD,	G_PART_ALIAS_FREEBSD },
 	{ DOSPTYP_NTFS,		G_PART_ALIAS_MS_NTFS },
 	{ DOSPTYP_FAT32,	G_PART_ALIAS_MS_FAT32 },
 	{ DOSPTYP_LINSWP,	G_PART_ALIAS_LINUX_SWAP },
 	{ DOSPTYP_LINUX,	G_PART_ALIAS_LINUX_DATA },
 	{ DOSPTYP_LINLVM,	G_PART_ALIAS_LINUX_LVM },
 	{ DOSPTYP_LINRAID,	G_PART_ALIAS_LINUX_RAID },
 };
 
 static void ebr_set_chs(struct g_part_table *, uint32_t, u_char *, u_char *,
     u_char *);
 
 static void
 ebr_entry_decode(const char *p, struct dos_partition *ent)
 {
 	ent->dp_flag = p[0];
 	ent->dp_shd = p[1];
 	ent->dp_ssect = p[2];
 	ent->dp_scyl = p[3];
 	ent->dp_typ = p[4];
 	ent->dp_ehd = p[5];
 	ent->dp_esect = p[6];
 	ent->dp_ecyl = p[7];
 	ent->dp_start = le32dec(p + 8);
 	ent->dp_size = le32dec(p + 12);
 }
 
 static void
 ebr_entry_link(struct g_part_table *table, uint32_t start, uint32_t end,
    u_char *buf)
 {
 
 	buf[0] = 0 /* dp_flag */;
 	ebr_set_chs(table, start, &buf[3] /* dp_scyl */, &buf[1] /* dp_shd */,
 	    &buf[2] /* dp_ssect */);
 	buf[4] = 5 /* dp_typ */;
 	ebr_set_chs(table, end, &buf[7] /* dp_ecyl */, &buf[5] /* dp_ehd */,
 	    &buf[6] /* dp_esect */);
 	le32enc(buf + 8, start);
 	le32enc(buf + 12, end - start + 1);
 }
 
 static int
 ebr_parse_type(const char *type, u_char *dp_typ)
 {
 	const char *alias;
 	char *endp;
 	long lt;
 	int i;
 
 	if (type[0] == '!') {
 		lt = strtol(type + 1, &endp, 0);
 		if (type[1] == '\0' || *endp != '\0' || lt <= 0 || lt >= 256)
 			return (EINVAL);
 		*dp_typ = (u_char)lt;
 		return (0);
 	}
 	for (i = 0; i < nitems(ebr_alias_match); i++) {
 		alias = g_part_alias_name(ebr_alias_match[i].alias);
 		if (strcasecmp(type, alias) == 0) {
 			*dp_typ = ebr_alias_match[i].typ;
 			return (0);
 		}
 	}
 	return (EINVAL);
 }
 
 
 static void
 ebr_set_chs(struct g_part_table *table, uint32_t lba, u_char *cylp, u_char *hdp,
     u_char *secp)
 {
 	uint32_t cyl, hd, sec;
 
 	sec = lba % table->gpt_sectors + 1;
 	lba /= table->gpt_sectors;
 	hd = lba % table->gpt_heads;
 	lba /= table->gpt_heads;
 	cyl = lba;
 	if (cyl > 1023)
 		sec = hd = cyl = ~0;
 
 	*cylp = cyl & 0xff;
 	*hdp = hd & 0xff;
 	*secp = (sec & 0x3f) | ((cyl >> 2) & 0xc0);
 }
 
 static int
 ebr_align(struct g_part_table *basetable, uint32_t *start, uint32_t *size)
 {
 	uint32_t sectors;
 
 	sectors = basetable->gpt_sectors;
 	if (*size < 2 * sectors)
 		return (EINVAL);
 	if (*start % sectors) {
 		*size += (*start % sectors) - sectors;
 		*start -= (*start % sectors) - sectors;
 	}
 	if (*size % sectors)
 		*size -= (*size % sectors);
 	if (*size < 2 * sectors)
 		return (EINVAL);
 	return (0);
 }
 
 
 static int
 g_part_ebr_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 	struct g_part_ebr_entry *entry;
 	uint32_t start, size;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 	entry = (struct g_part_ebr_entry *)baseentry;
 	start = gpp->gpp_start;
 	size = gpp->gpp_size;
 	if (ebr_align(basetable, &start, &size) != 0)
 		return (EINVAL);
 	if (baseentry->gpe_deleted)
 		bzero(&entry->ent, sizeof(entry->ent));
 
 	KASSERT(baseentry->gpe_start <= start, ("%s", __func__));
 	KASSERT(baseentry->gpe_end >= start + size - 1, ("%s", __func__));
 	baseentry->gpe_index = (start / basetable->gpt_sectors) + 1;
 	baseentry->gpe_offset =
 	    (off_t)(start + basetable->gpt_sectors) * pp->sectorsize;
 	baseentry->gpe_start = start;
 	baseentry->gpe_end = start + size - 1;
 	entry->ent.dp_start = basetable->gpt_sectors;
 	entry->ent.dp_size = size - basetable->gpt_sectors;
 	ebr_set_chs(basetable, entry->ent.dp_start, &entry->ent.dp_scyl,
 	    &entry->ent.dp_shd, &entry->ent.dp_ssect);
 	ebr_set_chs(basetable, baseentry->gpe_end, &entry->ent.dp_ecyl,
 	    &entry->ent.dp_ehd, &entry->ent.dp_esect);
 	return (ebr_parse_type(gpp->gpp_type, &entry->ent.dp_typ));
 }
 
 static int
 g_part_ebr_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	char type[64];
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	uint32_t msize;
 	int error;
 
 	pp = gpp->gpp_provider;
 
 	if (pp->sectorsize < EBRSIZE)
 		return (ENOSPC);
 	if (pp->sectorsize > 4096)
 		return (ENXIO);
 
 	/* Check that we have a parent and that it's a MBR. */
 	if (basetable->gpt_depth == 0)
 		return (ENXIO);
 	cp = LIST_FIRST(&pp->consumers);
 	error = g_getattr("PART::scheme", cp, &type);
 	if (error != 0)
 		return (error);
 	if (strcmp(type, "MBR") != 0)
 		return (ENXIO);
 	error = g_getattr("PART::type", cp, &type);
 	if (error != 0)
 		return (error);
 	if (strcmp(type, "ebr") != 0)
 		return (ENXIO);
 
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 	basetable->gpt_first = 0;
 	basetable->gpt_last = msize - 1;
 	basetable->gpt_entries = msize / basetable->gpt_sectors;
 	return (0);
 }
 
 static int
 g_part_ebr_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	/* Wipe the first sector to clear the partitioning. */
 	basetable->gpt_smhead |= 1;
 	return (0);
 }
 
 static void
 g_part_ebr_dumpconf(struct g_part_table *table, struct g_part_entry *baseentry,
     struct sbuf *sb, const char *indent)
 {
 	struct g_part_ebr_entry *entry;
 
 	entry = (struct g_part_ebr_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs MBREXT xt %u", entry->ent.dp_typ);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<rawtype>%u</rawtype>\n", indent,
 		    entry->ent.dp_typ);
 		if (entry->ent.dp_flag & 0x80)
 			sbuf_printf(sb, "%s<attrib>active</attrib>\n", indent);
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_ebr_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 	struct g_part_ebr_entry *entry;
 
 	/* Allow dumping to a FreeBSD partition or Linux swap partition only. */
 	entry = (struct g_part_ebr_entry *)baseentry;
 	return ((entry->ent.dp_typ == DOSPTYP_386BSD ||
 	    entry->ent.dp_typ == DOSPTYP_LINSWP) ? 1 : 0);
 }
 
 #if defined(GEOM_PART_EBR_COMPAT)
 static void
 g_part_ebr_fullname(struct g_part_table *table, struct g_part_entry *entry,
     struct sbuf *sb, const char *pfx)
 {
 	struct g_part_entry *iter;
 	u_int idx;
 
 	idx = 5;
 	LIST_FOREACH(iter, &table->gpt_entry, gpe_entry) {
 		if (iter == entry)
 			break;
 		idx++;
 	}
 	sbuf_printf(sb, "%.*s%u", (int)strlen(pfx) - 1, pfx, idx);
 }
 #endif
 
 static int
 g_part_ebr_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_ebr_entry *entry;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_ebr_entry *)baseentry;
 	if (gpp->gpp_parms & G_PART_PARM_TYPE)
 		return (ebr_parse_type(gpp->gpp_type, &entry->ent.dp_typ));
 	return (0);
 }
 
 static int
 g_part_ebr_resize(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 
 	if (baseentry != NULL)
 		return (EOPNOTSUPP);
 	pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 	basetable->gpt_last = MIN(pp->mediasize / pp->sectorsize,
 	    UINT32_MAX) - 1;
 	return (0);
 }
 
 static const char *
 g_part_ebr_name(struct g_part_table *table, struct g_part_entry *entry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "+%08u", entry->gpe_index);
 	return (buf);
 }
 
 static int
 g_part_ebr_precheck(struct g_part_table *table, enum g_part_ctl req,
     struct g_part_parms *gpp)
 {
 #if defined(GEOM_PART_EBR_COMPAT)
 	if (req == G_PART_CTL_DESTROY)
 		return (0);
 	return (ECANCELED);
 #else
 	/*
 	 * The index is a function of the start of the partition.
 	 * This is not something the user can override, nor is it
 	 * something the common code will do right. We can set the
 	 * index now so that we get what we need.
 	 */
 	if (req == G_PART_CTL_ADD)
 		gpp->gpp_index = (gpp->gpp_start / table->gpt_sectors) + 1;
 	return (0);
 #endif
 }
 
 static int
 g_part_ebr_probe(struct g_part_table *table, struct g_consumer *cp)
 {
 	char type[64];
 	struct g_provider *pp;
 	u_char *buf, *p;
 	int error, index, res;
 	uint16_t magic;
 
 	pp = cp->provider;
 
 	/* Sanity-check the provider. */
 	if (pp->sectorsize < EBRSIZE || pp->mediasize < pp->sectorsize)
 		return (ENOSPC);
 	if (pp->sectorsize > 4096)
 		return (ENXIO);
 
 	/* Check that we have a parent and that it's a MBR. */
 	if (table->gpt_depth == 0)
 		return (ENXIO);
 	error = g_getattr("PART::scheme", cp, &type);
 	if (error != 0)
 		return (error);
 	if (strcmp(type, "MBR") != 0)
 		return (ENXIO);
 	/* Check that partition has type DOSPTYP_EBR. */
 	error = g_getattr("PART::type", cp, &type);
 	if (error != 0)
 		return (error);
 	if (strcmp(type, "ebr") != 0)
 		return (ENXIO);
 
 	/* Check that there's a EBR. */
 	buf = g_read_data(cp, 0L, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 
 	/* We goto out on mismatch. */
 	res = ENXIO;
 
 	magic = le16dec(buf + DOSMAGICOFFSET);
 	if (magic != DOSMAGIC)
 		goto out;
 
 	for (index = 0; index < 2; index++) {
 		p = buf + DOSPARTOFF + index * DOSPARTSIZE;
 		if (p[0] != 0 && p[0] != 0x80)
 			goto out;
 	}
 	res = G_PART_PROBE_PRI_NORM;
 
  out:
 	g_free(buf);
 	return (res);
 }
 
 static int
 g_part_ebr_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct dos_partition ent[2];
 	struct g_provider *pp;
 	struct g_part_entry *baseentry;
 	struct g_part_ebr_table *table;
 	struct g_part_ebr_entry *entry;
 	u_char *buf;
 	off_t ofs, msize;
 	u_int lba;
 	int error, index;
 
 	pp = cp->provider;
 	table = (struct g_part_ebr_table *)basetable;
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 
 	lba = 0;
 	while (1) {
 		ofs = (off_t)lba * pp->sectorsize;
 		buf = g_read_data(cp, ofs, pp->sectorsize, &error);
 		if (buf == NULL)
 			return (error);
 
 		ebr_entry_decode(buf + DOSPARTOFF + 0 * DOSPARTSIZE, ent + 0);
 		ebr_entry_decode(buf + DOSPARTOFF + 1 * DOSPARTSIZE, ent + 1);
 
 		/* The 3rd & 4th entries should be zeroes. */
 		if (le64dec(buf + DOSPARTOFF + 2 * DOSPARTSIZE) +
 		    le64dec(buf + DOSPARTOFF + 3 * DOSPARTSIZE) != 0) {
 			basetable->gpt_corrupt = 1;
 			printf("GEOM: %s: invalid entries in the EBR ignored.\n",
 			    pp->name);
 		}
 #ifndef GEOM_PART_EBR_COMPAT
 		/* Save the first EBR, it can contain a boot code */
 		if (lba == 0)
 			bcopy(buf, table->ebr, sizeof(table->ebr));
 #endif
 		g_free(buf);
 
 		if (ent[0].dp_typ == 0)
 			break;
 
 		if (ent[0].dp_typ == 5 && ent[1].dp_typ == 0) {
 			lba = ent[0].dp_start;
 			continue;
 		}
 
 		index = (lba / basetable->gpt_sectors) + 1;
 		baseentry = (struct g_part_entry *)g_part_new_entry(basetable,
 		    index, lba, lba + ent[0].dp_start + ent[0].dp_size - 1);
 		baseentry->gpe_offset = (off_t)(lba + ent[0].dp_start) *
 		    pp->sectorsize;
 		entry = (struct g_part_ebr_entry *)baseentry;
 		entry->ent = ent[0];
 
 		if (ent[1].dp_typ == 0)
 			break;
 
 		lba = ent[1].dp_start;
 	}
 
 	basetable->gpt_entries = msize / basetable->gpt_sectors;
 	basetable->gpt_first = 0;
 	basetable->gpt_last = msize - 1;
 	return (0);
 }
 
 static int
 g_part_ebr_setunset(struct g_part_table *table, struct g_part_entry *baseentry,
     const char *attrib, unsigned int set)
 {
 	struct g_part_entry *iter;
 	struct g_part_ebr_entry *entry;
 	int changed;
 
 	if (baseentry == NULL)
 		return (ENODEV);
 	if (strcasecmp(attrib, "active") != 0)
 		return (EINVAL);
 
 	/* Only one entry can have the active attribute. */
 	LIST_FOREACH(iter, &table->gpt_entry, gpe_entry) {
 		if (iter->gpe_deleted)
 			continue;
 		changed = 0;
 		entry = (struct g_part_ebr_entry *)iter;
 		if (iter == baseentry) {
 			if (set && (entry->ent.dp_flag & 0x80) == 0) {
 				entry->ent.dp_flag |= 0x80;
 				changed = 1;
 			} else if (!set && (entry->ent.dp_flag & 0x80)) {
 				entry->ent.dp_flag &= ~0x80;
 				changed = 1;
 			}
 		} else {
 			if (set && (entry->ent.dp_flag & 0x80)) {
 				entry->ent.dp_flag &= ~0x80;
 				changed = 1;
 			}
 		}
 		if (changed && !iter->gpe_created)
 			iter->gpe_modified = 1;
 	}
 	return (0);
 }
 
 static const char *
 g_part_ebr_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_ebr_entry *entry;
 	int i;
 
 	entry = (struct g_part_ebr_entry *)baseentry;
 	for (i = 0; i < nitems(ebr_alias_match); i++) {
 		if (ebr_alias_match[i].typ == entry->ent.dp_typ)
 			return (g_part_alias_name(ebr_alias_match[i].alias));
 	}
 	snprintf(buf, bufsz, "!%d", entry->ent.dp_typ);
 	return (buf);
 }
 
 static int
 g_part_ebr_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 #ifndef GEOM_PART_EBR_COMPAT
 	struct g_part_ebr_table *table;
 #endif
 	struct g_provider *pp;
 	struct g_part_entry *baseentry, *next;
 	struct g_part_ebr_entry *entry;
 	u_char *buf;
 	u_char *p;
 	int error;
 
 	pp = cp->provider;
 	buf = g_malloc(pp->sectorsize, M_WAITOK | M_ZERO);
 #ifndef GEOM_PART_EBR_COMPAT
 	table = (struct g_part_ebr_table *)basetable;
 	bcopy(table->ebr, buf, DOSPARTOFF);
 #endif
 	le16enc(buf + DOSMAGICOFFSET, DOSMAGIC);
 
 	baseentry = LIST_FIRST(&basetable->gpt_entry);
 	while (baseentry != NULL && baseentry->gpe_deleted)
 		baseentry = LIST_NEXT(baseentry, gpe_entry);
 
 	/* Wipe-out the first EBR when there are no slices. */
 	if (baseentry == NULL) {
 		error = g_write_data(cp, 0, buf, pp->sectorsize);
 		goto out;
 	}
 
 	/*
 	 * If the first partition is not in LBA 0, we need to
 	 * put a "link" EBR in LBA 0.
 	 */
 	if (baseentry->gpe_start != 0) {
 		ebr_entry_link(basetable, (uint32_t)baseentry->gpe_start,
 		    (uint32_t)baseentry->gpe_end, buf + DOSPARTOFF);
 		error = g_write_data(cp, 0, buf, pp->sectorsize);
 		if (error)
 			goto out;
 	}
 
 	do {
 		entry = (struct g_part_ebr_entry *)baseentry;
 
 		p = buf + DOSPARTOFF;
 		p[0] = entry->ent.dp_flag;
 		p[1] = entry->ent.dp_shd;
 		p[2] = entry->ent.dp_ssect;
 		p[3] = entry->ent.dp_scyl;
 		p[4] = entry->ent.dp_typ;
 		p[5] = entry->ent.dp_ehd;
 		p[6] = entry->ent.dp_esect;
 		p[7] = entry->ent.dp_ecyl;
 		le32enc(p + 8, entry->ent.dp_start);
 		le32enc(p + 12, entry->ent.dp_size);
 
 		next = LIST_NEXT(baseentry, gpe_entry);
 		while (next != NULL && next->gpe_deleted)
 			next = LIST_NEXT(next, gpe_entry);
 
 		p += DOSPARTSIZE;
 		if (next != NULL)
 			ebr_entry_link(basetable, (uint32_t)next->gpe_start,
 			    (uint32_t)next->gpe_end, p);
 		else
 			bzero(p, DOSPARTSIZE);
 
 		error = g_write_data(cp, baseentry->gpe_start * pp->sectorsize,
 		    buf, pp->sectorsize);
 #ifndef GEOM_PART_EBR_COMPAT
 		if (baseentry->gpe_start == 0)
 			bzero(buf, DOSPARTOFF);
 #endif
 		baseentry = next;
 	} while (!error && baseentry != NULL);
 
  out:
 	g_free(buf);
 	return (error);
 }
Index: user/markj/netdump/sys/geom/part/g_part_gpt.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_gpt.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_gpt.c	(revision 332408)
@@ -1,1410 +1,1411 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2002, 2005-2007, 2011 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/diskmbr.h>
 #include <sys/endian.h>
 #include <sys/gpt.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/uuid.h>
 #include <geom/geom.h>
 #include <geom/geom_int.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_gpt, "GEOM partitioning class for GPT partitions support");
 
 CTASSERT(offsetof(struct gpt_hdr, padding) == 92);
 CTASSERT(sizeof(struct gpt_ent) == 128);
 
 #define	EQUUID(a,b)	(memcmp(a, b, sizeof(struct uuid)) == 0)
 
 #define	MBRSIZE		512
 
 enum gpt_elt {
 	GPT_ELT_PRIHDR,
 	GPT_ELT_PRITBL,
 	GPT_ELT_SECHDR,
 	GPT_ELT_SECTBL,
 	GPT_ELT_COUNT
 };
 
 enum gpt_state {
 	GPT_STATE_UNKNOWN,	/* Not determined. */
 	GPT_STATE_MISSING,	/* No signature found. */
 	GPT_STATE_CORRUPT,	/* Checksum mismatch. */
 	GPT_STATE_INVALID,	/* Nonconformant/invalid. */
 	GPT_STATE_OK		/* Perfectly fine. */
 };
 
 struct g_part_gpt_table {
 	struct g_part_table	base;
 	u_char			mbr[MBRSIZE];
 	struct gpt_hdr		*hdr;
 	quad_t			lba[GPT_ELT_COUNT];
 	enum gpt_state		state[GPT_ELT_COUNT];
 	int			bootcamp;
 };
 
 struct g_part_gpt_entry {
 	struct g_part_entry	base;
 	struct gpt_ent		ent;
 };
 
 static void g_gpt_printf_utf16(struct sbuf *, uint16_t *, size_t);
 static void g_gpt_utf8_to_utf16(const uint8_t *, uint16_t *, size_t);
 static void g_gpt_set_defaults(struct g_part_table *, struct g_provider *);
 
 static int g_part_gpt_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_gpt_bootcode(struct g_part_table *, struct g_part_parms *);
 static int g_part_gpt_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_gpt_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_gpt_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_gpt_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_gpt_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_gpt_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_gpt_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_gpt_read(struct g_part_table *, struct g_consumer *);
 static int g_part_gpt_setunset(struct g_part_table *table,
     struct g_part_entry *baseentry, const char *attrib, unsigned int set);
 static const char *g_part_gpt_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_gpt_write(struct g_part_table *, struct g_consumer *);
 static int g_part_gpt_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_gpt_recover(struct g_part_table *);
 
 static kobj_method_t g_part_gpt_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_gpt_add),
 	KOBJMETHOD(g_part_bootcode,	g_part_gpt_bootcode),
 	KOBJMETHOD(g_part_create,	g_part_gpt_create),
 	KOBJMETHOD(g_part_destroy,	g_part_gpt_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_gpt_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_gpt_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_gpt_modify),
 	KOBJMETHOD(g_part_resize,	g_part_gpt_resize),
 	KOBJMETHOD(g_part_name,		g_part_gpt_name),
 	KOBJMETHOD(g_part_probe,	g_part_gpt_probe),
 	KOBJMETHOD(g_part_read,		g_part_gpt_read),
 	KOBJMETHOD(g_part_recover,	g_part_gpt_recover),
 	KOBJMETHOD(g_part_setunset,	g_part_gpt_setunset),
 	KOBJMETHOD(g_part_type,		g_part_gpt_type),
 	KOBJMETHOD(g_part_write,	g_part_gpt_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_gpt_scheme = {
 	"GPT",
 	g_part_gpt_methods,
 	sizeof(struct g_part_gpt_table),
 	.gps_entrysz = sizeof(struct g_part_gpt_entry),
 	.gps_minent = 128,
 	.gps_maxent = 4096,
 	.gps_bootcodesz = MBRSIZE,
 };
 G_PART_SCHEME_DECLARE(g_part_gpt);
+MODULE_VERSION(geom_part_gpt, 0);
 
 static struct uuid gpt_uuid_apple_apfs = GPT_ENT_TYPE_APPLE_APFS;
 static struct uuid gpt_uuid_apple_boot = GPT_ENT_TYPE_APPLE_BOOT;
 static struct uuid gpt_uuid_apple_core_storage =
     GPT_ENT_TYPE_APPLE_CORE_STORAGE;
 static struct uuid gpt_uuid_apple_hfs = GPT_ENT_TYPE_APPLE_HFS;
 static struct uuid gpt_uuid_apple_label = GPT_ENT_TYPE_APPLE_LABEL;
 static struct uuid gpt_uuid_apple_raid = GPT_ENT_TYPE_APPLE_RAID;
 static struct uuid gpt_uuid_apple_raid_offline = GPT_ENT_TYPE_APPLE_RAID_OFFLINE;
 static struct uuid gpt_uuid_apple_tv_recovery = GPT_ENT_TYPE_APPLE_TV_RECOVERY;
 static struct uuid gpt_uuid_apple_ufs = GPT_ENT_TYPE_APPLE_UFS;
 static struct uuid gpt_uuid_bios_boot = GPT_ENT_TYPE_BIOS_BOOT;
 static struct uuid gpt_uuid_chromeos_firmware = GPT_ENT_TYPE_CHROMEOS_FIRMWARE;
 static struct uuid gpt_uuid_chromeos_kernel = GPT_ENT_TYPE_CHROMEOS_KERNEL;
 static struct uuid gpt_uuid_chromeos_reserved = GPT_ENT_TYPE_CHROMEOS_RESERVED;
 static struct uuid gpt_uuid_chromeos_root = GPT_ENT_TYPE_CHROMEOS_ROOT;
 static struct uuid gpt_uuid_dfbsd_ccd = GPT_ENT_TYPE_DRAGONFLY_CCD;
 static struct uuid gpt_uuid_dfbsd_hammer = GPT_ENT_TYPE_DRAGONFLY_HAMMER;
 static struct uuid gpt_uuid_dfbsd_hammer2 = GPT_ENT_TYPE_DRAGONFLY_HAMMER2;
 static struct uuid gpt_uuid_dfbsd_label32 = GPT_ENT_TYPE_DRAGONFLY_LABEL32;
 static struct uuid gpt_uuid_dfbsd_label64 = GPT_ENT_TYPE_DRAGONFLY_LABEL64;
 static struct uuid gpt_uuid_dfbsd_legacy = GPT_ENT_TYPE_DRAGONFLY_LEGACY;
 static struct uuid gpt_uuid_dfbsd_swap = GPT_ENT_TYPE_DRAGONFLY_SWAP;
 static struct uuid gpt_uuid_dfbsd_ufs1 = GPT_ENT_TYPE_DRAGONFLY_UFS1;
 static struct uuid gpt_uuid_dfbsd_vinum = GPT_ENT_TYPE_DRAGONFLY_VINUM;
 static struct uuid gpt_uuid_efi = GPT_ENT_TYPE_EFI;
 static struct uuid gpt_uuid_freebsd = GPT_ENT_TYPE_FREEBSD;
 static struct uuid gpt_uuid_freebsd_boot = GPT_ENT_TYPE_FREEBSD_BOOT;
 static struct uuid gpt_uuid_freebsd_nandfs = GPT_ENT_TYPE_FREEBSD_NANDFS;
 static struct uuid gpt_uuid_freebsd_swap = GPT_ENT_TYPE_FREEBSD_SWAP;
 static struct uuid gpt_uuid_freebsd_ufs = GPT_ENT_TYPE_FREEBSD_UFS;
 static struct uuid gpt_uuid_freebsd_vinum = GPT_ENT_TYPE_FREEBSD_VINUM;
 static struct uuid gpt_uuid_freebsd_zfs = GPT_ENT_TYPE_FREEBSD_ZFS;
 static struct uuid gpt_uuid_linux_data = GPT_ENT_TYPE_LINUX_DATA;
 static struct uuid gpt_uuid_linux_lvm = GPT_ENT_TYPE_LINUX_LVM;
 static struct uuid gpt_uuid_linux_raid = GPT_ENT_TYPE_LINUX_RAID;
 static struct uuid gpt_uuid_linux_swap = GPT_ENT_TYPE_LINUX_SWAP;
 static struct uuid gpt_uuid_mbr = GPT_ENT_TYPE_MBR;
 static struct uuid gpt_uuid_ms_basic_data = GPT_ENT_TYPE_MS_BASIC_DATA;
 static struct uuid gpt_uuid_ms_ldm_data = GPT_ENT_TYPE_MS_LDM_DATA;
 static struct uuid gpt_uuid_ms_ldm_metadata = GPT_ENT_TYPE_MS_LDM_METADATA;
 static struct uuid gpt_uuid_ms_recovery = GPT_ENT_TYPE_MS_RECOVERY;
 static struct uuid gpt_uuid_ms_reserved = GPT_ENT_TYPE_MS_RESERVED;
 static struct uuid gpt_uuid_ms_spaces = GPT_ENT_TYPE_MS_SPACES;
 static struct uuid gpt_uuid_netbsd_ccd = GPT_ENT_TYPE_NETBSD_CCD;
 static struct uuid gpt_uuid_netbsd_cgd = GPT_ENT_TYPE_NETBSD_CGD;
 static struct uuid gpt_uuid_netbsd_ffs = GPT_ENT_TYPE_NETBSD_FFS;
 static struct uuid gpt_uuid_netbsd_lfs = GPT_ENT_TYPE_NETBSD_LFS;
 static struct uuid gpt_uuid_netbsd_raid = GPT_ENT_TYPE_NETBSD_RAID;
 static struct uuid gpt_uuid_netbsd_swap = GPT_ENT_TYPE_NETBSD_SWAP;
 static struct uuid gpt_uuid_openbsd_data = GPT_ENT_TYPE_OPENBSD_DATA;
 static struct uuid gpt_uuid_prep_boot = GPT_ENT_TYPE_PREP_BOOT;
 static struct uuid gpt_uuid_unused = GPT_ENT_TYPE_UNUSED;
 static struct uuid gpt_uuid_vmfs = GPT_ENT_TYPE_VMFS;
 static struct uuid gpt_uuid_vmkdiag = GPT_ENT_TYPE_VMKDIAG;
 static struct uuid gpt_uuid_vmreserved = GPT_ENT_TYPE_VMRESERVED;
 static struct uuid gpt_uuid_vmvsanhdr = GPT_ENT_TYPE_VMVSANHDR;
 
 static struct g_part_uuid_alias {
 	struct uuid *uuid;
 	int alias;
 	int mbrtype;
 } gpt_uuid_alias_match[] = {
 	{ &gpt_uuid_apple_apfs,		G_PART_ALIAS_APPLE_APFS,	 0 },
 	{ &gpt_uuid_apple_boot,		G_PART_ALIAS_APPLE_BOOT,	 0xab },
 	{ &gpt_uuid_apple_core_storage,	G_PART_ALIAS_APPLE_CORE_STORAGE, 0 },
 	{ &gpt_uuid_apple_hfs,		G_PART_ALIAS_APPLE_HFS,		 0xaf },
 	{ &gpt_uuid_apple_label,	G_PART_ALIAS_APPLE_LABEL,	 0 },
 	{ &gpt_uuid_apple_raid,		G_PART_ALIAS_APPLE_RAID,	 0 },
 	{ &gpt_uuid_apple_raid_offline,	G_PART_ALIAS_APPLE_RAID_OFFLINE, 0 },
 	{ &gpt_uuid_apple_tv_recovery,	G_PART_ALIAS_APPLE_TV_RECOVERY,	 0 },
 	{ &gpt_uuid_apple_ufs,		G_PART_ALIAS_APPLE_UFS,		 0 },
 	{ &gpt_uuid_bios_boot,		G_PART_ALIAS_BIOS_BOOT,		 0 },
 	{ &gpt_uuid_chromeos_firmware,	G_PART_ALIAS_CHROMEOS_FIRMWARE,	 0 },
 	{ &gpt_uuid_chromeos_kernel,	G_PART_ALIAS_CHROMEOS_KERNEL,	 0 },
 	{ &gpt_uuid_chromeos_reserved,	G_PART_ALIAS_CHROMEOS_RESERVED,	 0 },
 	{ &gpt_uuid_chromeos_root,	G_PART_ALIAS_CHROMEOS_ROOT,	 0 },
 	{ &gpt_uuid_dfbsd_ccd,		G_PART_ALIAS_DFBSD_CCD,		 0 },
 	{ &gpt_uuid_dfbsd_hammer,	G_PART_ALIAS_DFBSD_HAMMER,	 0 },
 	{ &gpt_uuid_dfbsd_hammer2,	G_PART_ALIAS_DFBSD_HAMMER2,	 0 },
 	{ &gpt_uuid_dfbsd_label32,	G_PART_ALIAS_DFBSD,		 0xa5 },
 	{ &gpt_uuid_dfbsd_label64,	G_PART_ALIAS_DFBSD64,		 0xa5 },
 	{ &gpt_uuid_dfbsd_legacy,	G_PART_ALIAS_DFBSD_LEGACY,	 0 },
 	{ &gpt_uuid_dfbsd_swap,		G_PART_ALIAS_DFBSD_SWAP,	 0 },
 	{ &gpt_uuid_dfbsd_ufs1,		G_PART_ALIAS_DFBSD_UFS,		 0 },
 	{ &gpt_uuid_dfbsd_vinum,	G_PART_ALIAS_DFBSD_VINUM,	 0 },
 	{ &gpt_uuid_efi, 		G_PART_ALIAS_EFI,		 0xee },
 	{ &gpt_uuid_freebsd,		G_PART_ALIAS_FREEBSD,		 0xa5 },
 	{ &gpt_uuid_freebsd_boot, 	G_PART_ALIAS_FREEBSD_BOOT,	 0 },
 	{ &gpt_uuid_freebsd_nandfs, 	G_PART_ALIAS_FREEBSD_NANDFS,	 0 },
 	{ &gpt_uuid_freebsd_swap,	G_PART_ALIAS_FREEBSD_SWAP,	 0 },
 	{ &gpt_uuid_freebsd_ufs,	G_PART_ALIAS_FREEBSD_UFS,	 0 },
 	{ &gpt_uuid_freebsd_vinum,	G_PART_ALIAS_FREEBSD_VINUM,	 0 },
 	{ &gpt_uuid_freebsd_zfs,	G_PART_ALIAS_FREEBSD_ZFS,	 0 },
 	{ &gpt_uuid_linux_data,		G_PART_ALIAS_LINUX_DATA,	 0x0b },
 	{ &gpt_uuid_linux_lvm,		G_PART_ALIAS_LINUX_LVM,		 0 },
 	{ &gpt_uuid_linux_raid,		G_PART_ALIAS_LINUX_RAID,	 0 },
 	{ &gpt_uuid_linux_swap,		G_PART_ALIAS_LINUX_SWAP,	 0 },
 	{ &gpt_uuid_mbr,		G_PART_ALIAS_MBR,		 0 },
 	{ &gpt_uuid_ms_basic_data,	G_PART_ALIAS_MS_BASIC_DATA,	 0x0b },
 	{ &gpt_uuid_ms_ldm_data,	G_PART_ALIAS_MS_LDM_DATA,	 0 },
 	{ &gpt_uuid_ms_ldm_metadata,	G_PART_ALIAS_MS_LDM_METADATA,	 0 },
 	{ &gpt_uuid_ms_recovery,	G_PART_ALIAS_MS_RECOVERY,	 0 },
 	{ &gpt_uuid_ms_reserved,	G_PART_ALIAS_MS_RESERVED,	 0 },
 	{ &gpt_uuid_ms_spaces,		G_PART_ALIAS_MS_SPACES,		 0 },
 	{ &gpt_uuid_netbsd_ccd,		G_PART_ALIAS_NETBSD_CCD,	 0 },
 	{ &gpt_uuid_netbsd_cgd,		G_PART_ALIAS_NETBSD_CGD,	 0 },
 	{ &gpt_uuid_netbsd_ffs,		G_PART_ALIAS_NETBSD_FFS,	 0 },
 	{ &gpt_uuid_netbsd_lfs,		G_PART_ALIAS_NETBSD_LFS,	 0 },
 	{ &gpt_uuid_netbsd_raid,	G_PART_ALIAS_NETBSD_RAID,	 0 },
 	{ &gpt_uuid_netbsd_swap,	G_PART_ALIAS_NETBSD_SWAP,	 0 },
 	{ &gpt_uuid_openbsd_data,	G_PART_ALIAS_OPENBSD_DATA,	 0 },
 	{ &gpt_uuid_prep_boot,		G_PART_ALIAS_PREP_BOOT,		 0x41 },
 	{ &gpt_uuid_vmfs,		G_PART_ALIAS_VMFS,		 0 },
 	{ &gpt_uuid_vmkdiag,		G_PART_ALIAS_VMKDIAG,		 0 },
 	{ &gpt_uuid_vmreserved,		G_PART_ALIAS_VMRESERVED,	 0 },
 	{ &gpt_uuid_vmvsanhdr,		G_PART_ALIAS_VMVSANHDR,		 0 },
 	{ NULL, 0, 0 }
 };
 
 static int
 gpt_write_mbr_entry(u_char *mbr, int idx, int typ, quad_t start,
     quad_t end)
 {
 
 	if (typ == 0 || start > UINT32_MAX || end > UINT32_MAX)
 		return (EINVAL);
 
 	mbr += DOSPARTOFF + idx * DOSPARTSIZE;
 	mbr[0] = 0;
 	if (start == 1) {
 		/*
 		 * Treat the PMBR partition specially to maximize
 		 * interoperability with BIOSes.
 		 */
 		mbr[1] = mbr[3] = 0;
 		mbr[2] = 2;
 	} else
 		mbr[1] = mbr[2] = mbr[3] = 0xff;
 	mbr[4] = typ;
 	mbr[5] = mbr[6] = mbr[7] = 0xff;
 	le32enc(mbr + 8, (uint32_t)start);
 	le32enc(mbr + 12, (uint32_t)(end - start + 1));
 	return (0);
 }
 
 static int
 gpt_map_type(struct uuid *t)
 {
 	struct g_part_uuid_alias *uap;
 
 	for (uap = &gpt_uuid_alias_match[0]; uap->uuid; uap++) {
 		if (EQUUID(t, uap->uuid))
 			return (uap->mbrtype);
 	}
 	return (0);
 }
 
 static void
 gpt_create_pmbr(struct g_part_gpt_table *table, struct g_provider *pp)
 {
 
 	bzero(table->mbr + DOSPARTOFF, DOSPARTSIZE * NDOSPART);
 	gpt_write_mbr_entry(table->mbr, 0, 0xee, 1,
 	    MIN(pp->mediasize / pp->sectorsize - 1, UINT32_MAX));
 	le16enc(table->mbr + DOSMAGICOFFSET, DOSMAGIC);
 }
 
 /*
  * Under Boot Camp the PMBR partition (type 0xEE) doesn't cover the
  * whole disk anymore. Rather, it covers the GPT table and the EFI
  * system partition only. This way the HFS+ partition and any FAT
  * partitions can be added to the MBR without creating an overlap.
  */
 static int
 gpt_is_bootcamp(struct g_part_gpt_table *table, const char *provname)
 {
 	uint8_t *p;
 
 	p = table->mbr + DOSPARTOFF;
 	if (p[4] != 0xee || le32dec(p + 8) != 1)
 		return (0);
 
 	p += DOSPARTSIZE;
 	if (p[4] != 0xaf)
 		return (0);
 
 	printf("GEOM: %s: enabling Boot Camp\n", provname);
 	return (1);
 }
 
 static void
 gpt_update_bootcamp(struct g_part_table *basetable, struct g_provider *pp)
 {
 	struct g_part_entry *baseentry;
 	struct g_part_gpt_entry *entry;
 	struct g_part_gpt_table *table;
 	int bootable, error, index, slices, typ;
 
 	table = (struct g_part_gpt_table *)basetable;
 
 	bootable = -1;
 	for (index = 0; index < NDOSPART; index++) {
 		if (table->mbr[DOSPARTOFF + DOSPARTSIZE * index])
 			bootable = index;
 	}
 
 	bzero(table->mbr + DOSPARTOFF, DOSPARTSIZE * NDOSPART);
 	slices = 0;
 	LIST_FOREACH(baseentry, &basetable->gpt_entry, gpe_entry) {
 		if (baseentry->gpe_deleted)
 			continue;
 		index = baseentry->gpe_index - 1;
 		if (index >= NDOSPART)
 			continue;
 
 		entry = (struct g_part_gpt_entry *)baseentry;
 
 		switch (index) {
 		case 0:	/* This must be the EFI system partition. */
 			if (!EQUUID(&entry->ent.ent_type, &gpt_uuid_efi))
 				goto disable;
 			error = gpt_write_mbr_entry(table->mbr, index, 0xee,
 			    1ull, entry->ent.ent_lba_end);
 			break;
 		case 1:	/* This must be the HFS+ partition. */
 			if (!EQUUID(&entry->ent.ent_type, &gpt_uuid_apple_hfs))
 				goto disable;
 			error = gpt_write_mbr_entry(table->mbr, index, 0xaf,
 			    entry->ent.ent_lba_start, entry->ent.ent_lba_end);
 			break;
 		default:
 			typ = gpt_map_type(&entry->ent.ent_type);
 			error = gpt_write_mbr_entry(table->mbr, index, typ,
 			    entry->ent.ent_lba_start, entry->ent.ent_lba_end);
 			break;
 		}
 		if (error)
 			continue;
 
 		if (index == bootable)
 			table->mbr[DOSPARTOFF + DOSPARTSIZE * index] = 0x80;
 		slices |= 1 << index;
 	}
 	if ((slices & 3) == 3)
 		return;
 
  disable:
 	table->bootcamp = 0;
 	gpt_create_pmbr(table, pp);
 }
 
 static struct gpt_hdr *
 gpt_read_hdr(struct g_part_gpt_table *table, struct g_consumer *cp,
     enum gpt_elt elt)
 {
 	struct gpt_hdr *buf, *hdr;
 	struct g_provider *pp;
 	quad_t lba, last;
 	int error;
 	uint32_t crc, sz;
 
 	pp = cp->provider;
 	last = (pp->mediasize / pp->sectorsize) - 1;
 	table->state[elt] = GPT_STATE_MISSING;
 	/*
 	 * If the primary header is valid look for secondary
 	 * header in AlternateLBA, otherwise in the last medium's LBA.
 	 */
 	if (elt == GPT_ELT_SECHDR) {
 		if (table->state[GPT_ELT_PRIHDR] != GPT_STATE_OK)
 			table->lba[elt] = last;
 	} else
 		table->lba[elt] = 1;
 	buf = g_read_data(cp, table->lba[elt] * pp->sectorsize, pp->sectorsize,
 	    &error);
 	if (buf == NULL)
 		return (NULL);
 	hdr = NULL;
 	if (memcmp(buf->hdr_sig, GPT_HDR_SIG, sizeof(buf->hdr_sig)) != 0)
 		goto fail;
 
 	table->state[elt] = GPT_STATE_CORRUPT;
 	sz = le32toh(buf->hdr_size);
 	if (sz < 92 || sz > pp->sectorsize)
 		goto fail;
 
 	hdr = g_malloc(sz, M_WAITOK | M_ZERO);
 	bcopy(buf, hdr, sz);
 	hdr->hdr_size = sz;
 
 	crc = le32toh(buf->hdr_crc_self);
 	buf->hdr_crc_self = 0;
 	if (crc32(buf, sz) != crc)
 		goto fail;
 	hdr->hdr_crc_self = crc;
 
 	table->state[elt] = GPT_STATE_INVALID;
 	hdr->hdr_revision = le32toh(buf->hdr_revision);
 	if (hdr->hdr_revision < GPT_HDR_REVISION)
 		goto fail;
 	hdr->hdr_lba_self = le64toh(buf->hdr_lba_self);
 	if (hdr->hdr_lba_self != table->lba[elt])
 		goto fail;
 	hdr->hdr_lba_alt = le64toh(buf->hdr_lba_alt);
 	if (hdr->hdr_lba_alt == hdr->hdr_lba_self ||
 	    hdr->hdr_lba_alt > last)
 		goto fail;
 
 	/* Check the managed area. */
 	hdr->hdr_lba_start = le64toh(buf->hdr_lba_start);
 	if (hdr->hdr_lba_start < 2 || hdr->hdr_lba_start >= last)
 		goto fail;
 	hdr->hdr_lba_end = le64toh(buf->hdr_lba_end);
 	if (hdr->hdr_lba_end < hdr->hdr_lba_start || hdr->hdr_lba_end >= last)
 		goto fail;
 
 	/* Check the table location and size of the table. */
 	hdr->hdr_entries = le32toh(buf->hdr_entries);
 	hdr->hdr_entsz = le32toh(buf->hdr_entsz);
 	if (hdr->hdr_entries == 0 || hdr->hdr_entsz < 128 ||
 	    (hdr->hdr_entsz & 7) != 0)
 		goto fail;
 	hdr->hdr_lba_table = le64toh(buf->hdr_lba_table);
 	if (hdr->hdr_lba_table < 2 || hdr->hdr_lba_table >= last)
 		goto fail;
 	if (hdr->hdr_lba_table >= hdr->hdr_lba_start &&
 	    hdr->hdr_lba_table <= hdr->hdr_lba_end)
 		goto fail;
 	lba = hdr->hdr_lba_table +
 	    howmany(hdr->hdr_entries * hdr->hdr_entsz, pp->sectorsize) - 1;
 	if (lba >= last)
 		goto fail;
 	if (lba >= hdr->hdr_lba_start && lba <= hdr->hdr_lba_end)
 		goto fail;
 
 	table->state[elt] = GPT_STATE_OK;
 	le_uuid_dec(&buf->hdr_uuid, &hdr->hdr_uuid);
 	hdr->hdr_crc_table = le32toh(buf->hdr_crc_table);
 
 	/* save LBA for secondary header */
 	if (elt == GPT_ELT_PRIHDR)
 		table->lba[GPT_ELT_SECHDR] = hdr->hdr_lba_alt;
 
 	g_free(buf);
 	return (hdr);
 
  fail:
 	if (hdr != NULL)
 		g_free(hdr);
 	g_free(buf);
 	return (NULL);
 }
 
 static struct gpt_ent *
 gpt_read_tbl(struct g_part_gpt_table *table, struct g_consumer *cp,
     enum gpt_elt elt, struct gpt_hdr *hdr)
 {
 	struct g_provider *pp;
 	struct gpt_ent *ent, *tbl;
 	char *buf, *p;
 	unsigned int idx, sectors, tblsz, size;
 	int error;
 
 	if (hdr == NULL)
 		return (NULL);
 
 	pp = cp->provider;
 	table->lba[elt] = hdr->hdr_lba_table;
 
 	table->state[elt] = GPT_STATE_MISSING;
 	tblsz = hdr->hdr_entries * hdr->hdr_entsz;
 	sectors = howmany(tblsz, pp->sectorsize);
 	buf = g_malloc(sectors * pp->sectorsize, M_WAITOK | M_ZERO);
 	for (idx = 0; idx < sectors; idx += MAXPHYS / pp->sectorsize) {
 		size = (sectors - idx > MAXPHYS / pp->sectorsize) ?  MAXPHYS:
 		    (sectors - idx) * pp->sectorsize;
 		p = g_read_data(cp, (table->lba[elt] + idx) * pp->sectorsize,
 		    size, &error);
 		if (p == NULL) {
 			g_free(buf);
 			return (NULL);
 		}
 		bcopy(p, buf + idx * pp->sectorsize, size);
 		g_free(p);
 	}
 	table->state[elt] = GPT_STATE_CORRUPT;
 	if (crc32(buf, tblsz) != hdr->hdr_crc_table) {
 		g_free(buf);
 		return (NULL);
 	}
 
 	table->state[elt] = GPT_STATE_OK;
 	tbl = g_malloc(hdr->hdr_entries * sizeof(struct gpt_ent),
 	    M_WAITOK | M_ZERO);
 
 	for (idx = 0, ent = tbl, p = buf;
 	     idx < hdr->hdr_entries;
 	     idx++, ent++, p += hdr->hdr_entsz) {
 		le_uuid_dec(p, &ent->ent_type);
 		le_uuid_dec(p + 16, &ent->ent_uuid);
 		ent->ent_lba_start = le64dec(p + 32);
 		ent->ent_lba_end = le64dec(p + 40);
 		ent->ent_attr = le64dec(p + 48);
 		/* Keep UTF-16 in little-endian. */
 		bcopy(p + 56, ent->ent_name, sizeof(ent->ent_name));
 	}
 
 	g_free(buf);
 	return (tbl);
 }
 
 static int
 gpt_matched_hdrs(struct gpt_hdr *pri, struct gpt_hdr *sec)
 {
 
 	if (pri == NULL || sec == NULL)
 		return (0);
 
 	if (!EQUUID(&pri->hdr_uuid, &sec->hdr_uuid))
 		return (0);
 	return ((pri->hdr_revision == sec->hdr_revision &&
 	    pri->hdr_size == sec->hdr_size &&
 	    pri->hdr_lba_start == sec->hdr_lba_start &&
 	    pri->hdr_lba_end == sec->hdr_lba_end &&
 	    pri->hdr_entries == sec->hdr_entries &&
 	    pri->hdr_entsz == sec->hdr_entsz &&
 	    pri->hdr_crc_table == sec->hdr_crc_table) ? 1 : 0);
 }
 
 static int
 gpt_parse_type(const char *type, struct uuid *uuid)
 {
 	struct uuid tmp;
 	const char *alias;
 	int error;
 	struct g_part_uuid_alias *uap;
 
 	if (type[0] == '!') {
 		error = parse_uuid(type + 1, &tmp);
 		if (error)
 			return (error);
 		if (EQUUID(&tmp, &gpt_uuid_unused))
 			return (EINVAL);
 		*uuid = tmp;
 		return (0);
 	}
 	for (uap = &gpt_uuid_alias_match[0]; uap->uuid; uap++) {
 		alias = g_part_alias_name(uap->alias);
 		if (!strcasecmp(type, alias)) {
 			*uuid = *uap->uuid;
 			return (0);
 		}
 	}
 	return (EINVAL);
 }
 
 static int
 g_part_gpt_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 	struct g_part_gpt_entry *entry;
 	int error;
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	error = gpt_parse_type(gpp->gpp_type, &entry->ent.ent_type);
 	if (error)
 		return (error);
 	kern_uuidgen(&entry->ent.ent_uuid, 1);
 	entry->ent.ent_lba_start = baseentry->gpe_start;
 	entry->ent.ent_lba_end = baseentry->gpe_end;
 	if (baseentry->gpe_deleted) {
 		entry->ent.ent_attr = 0;
 		bzero(entry->ent.ent_name, sizeof(entry->ent.ent_name));
 	}
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		g_gpt_utf8_to_utf16(gpp->gpp_label, entry->ent.ent_name,
 		    sizeof(entry->ent.ent_name) /
 		    sizeof(entry->ent.ent_name[0]));
 	return (0);
 }
 
 static int
 g_part_gpt_bootcode(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_gpt_table *table;
 	size_t codesz;
 
 	codesz = DOSPARTOFF;
 	table = (struct g_part_gpt_table *)basetable;
 	bzero(table->mbr, codesz);
 	codesz = MIN(codesz, gpp->gpp_codesize);
 	if (codesz > 0)
 		bcopy(gpp->gpp_codeptr, table->mbr, codesz);
 	return (0);
 }
 
 static int
 g_part_gpt_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 	struct g_part_gpt_table *table;
 	size_t tblsz;
 
 	/* We don't nest, which means that our depth should be 0. */
 	if (basetable->gpt_depth != 0)
 		return (ENXIO);
 
 	table = (struct g_part_gpt_table *)basetable;
 	pp = gpp->gpp_provider;
 	tblsz = howmany(basetable->gpt_entries * sizeof(struct gpt_ent),
 	    pp->sectorsize);
 	if (pp->sectorsize < MBRSIZE ||
 	    pp->mediasize < (3 + 2 * tblsz + basetable->gpt_entries) *
 	    pp->sectorsize)
 		return (ENOSPC);
 
 	gpt_create_pmbr(table, pp);
 
 	/* Allocate space for the header */
 	table->hdr = g_malloc(sizeof(struct gpt_hdr), M_WAITOK | M_ZERO);
 
 	bcopy(GPT_HDR_SIG, table->hdr->hdr_sig, sizeof(table->hdr->hdr_sig));
 	table->hdr->hdr_revision = GPT_HDR_REVISION;
 	table->hdr->hdr_size = offsetof(struct gpt_hdr, padding);
 	kern_uuidgen(&table->hdr->hdr_uuid, 1);
 	table->hdr->hdr_entries = basetable->gpt_entries;
 	table->hdr->hdr_entsz = sizeof(struct gpt_ent);
 
 	g_gpt_set_defaults(basetable, pp);
 	return (0);
 }
 
 static int
 g_part_gpt_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_gpt_table *table;
 	struct g_provider *pp;
 
 	table = (struct g_part_gpt_table *)basetable;
 	pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 	g_free(table->hdr);
 	table->hdr = NULL;
 
 	/*
 	 * Wipe the first 2 sectors and last one to clear the partitioning.
 	 * Wipe sectors only if they have valid metadata.
 	 */
 	if (table->state[GPT_ELT_PRIHDR] == GPT_STATE_OK)
 		basetable->gpt_smhead |= 3;
 	if (table->state[GPT_ELT_SECHDR] == GPT_STATE_OK &&
 	    table->lba[GPT_ELT_SECHDR] == pp->mediasize / pp->sectorsize - 1)
 		basetable->gpt_smtail |= 1;
 	return (0);
 }
 
 static void
 g_part_gpt_dumpconf(struct g_part_table *table, struct g_part_entry *baseentry,
     struct sbuf *sb, const char *indent)
 {
 	struct g_part_gpt_entry *entry;
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs GPT xt ");
 		sbuf_printf_uuid(sb, &entry->ent.ent_type);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<label>", indent);
 		g_gpt_printf_utf16(sb, entry->ent.ent_name,
 		    sizeof(entry->ent.ent_name) >> 1);
 		sbuf_printf(sb, "</label>\n");
 		if (entry->ent.ent_attr & GPT_ENT_ATTR_BOOTME)
 			sbuf_printf(sb, "%s<attrib>bootme</attrib>\n", indent);
 		if (entry->ent.ent_attr & GPT_ENT_ATTR_BOOTONCE) {
 			sbuf_printf(sb, "%s<attrib>bootonce</attrib>\n",
 			    indent);
 		}
 		if (entry->ent.ent_attr & GPT_ENT_ATTR_BOOTFAILED) {
 			sbuf_printf(sb, "%s<attrib>bootfailed</attrib>\n",
 			    indent);
 		}
 		sbuf_printf(sb, "%s<rawtype>", indent);
 		sbuf_printf_uuid(sb, &entry->ent.ent_type);
 		sbuf_printf(sb, "</rawtype>\n");
 		sbuf_printf(sb, "%s<rawuuid>", indent);
 		sbuf_printf_uuid(sb, &entry->ent.ent_uuid);
 		sbuf_printf(sb, "</rawuuid>\n");
 		sbuf_printf(sb, "%s<efimedia>", indent);
 		sbuf_printf(sb, "HD(%d,GPT,", entry->base.gpe_index);
 		sbuf_printf_uuid(sb, &entry->ent.ent_uuid);
 		sbuf_printf(sb, ",%#jx,%#jx)", (intmax_t)entry->base.gpe_start,
 		    (intmax_t)(entry->base.gpe_end - entry->base.gpe_start + 1));
 		sbuf_printf(sb, "</efimedia>\n");
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_gpt_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 	struct g_part_gpt_entry *entry;
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	return ((EQUUID(&entry->ent.ent_type, &gpt_uuid_freebsd_swap) ||
 	    EQUUID(&entry->ent.ent_type, &gpt_uuid_linux_swap) ||
 	    EQUUID(&entry->ent.ent_type, &gpt_uuid_dfbsd_swap)) ? 1 : 0);
 }
 
 static int
 g_part_gpt_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_gpt_entry *entry;
 	int error;
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	if (gpp->gpp_parms & G_PART_PARM_TYPE) {
 		error = gpt_parse_type(gpp->gpp_type, &entry->ent.ent_type);
 		if (error)
 			return (error);
 	}
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		g_gpt_utf8_to_utf16(gpp->gpp_label, entry->ent.ent_name,
 		    sizeof(entry->ent.ent_name) /
 		    sizeof(entry->ent.ent_name[0]));
 	return (0);
 }
 
 static int
 g_part_gpt_resize(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_gpt_entry *entry;
 
 	if (baseentry == NULL)
 		return (g_part_gpt_recover(basetable));
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	baseentry->gpe_end = baseentry->gpe_start + gpp->gpp_size - 1;
 	entry->ent.ent_lba_end = baseentry->gpe_end;
 
 	return (0);
 }
 
 static const char *
 g_part_gpt_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_gpt_entry *entry;
 	char c;
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	c = (EQUUID(&entry->ent.ent_type, &gpt_uuid_freebsd)) ? 's' : 'p';
 	snprintf(buf, bufsz, "%c%d", c, baseentry->gpe_index);
 	return (buf);
 }
 
 static int
 g_part_gpt_probe(struct g_part_table *table, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error, index, pri, res;
 
 	/* We don't nest, which means that our depth should be 0. */
 	if (table->gpt_depth != 0)
 		return (ENXIO);
 
 	pp = cp->provider;
 
 	/*
 	 * Sanity-check the provider. Since the first sector on the provider
 	 * must be a PMBR and a PMBR is 512 bytes large, the sector size
 	 * must be at least 512 bytes.  Also, since the theoretical minimum
 	 * number of sectors needed by GPT is 6, any medium that has less
 	 * than 6 sectors is never going to be able to hold a GPT. The
 	 * number 6 comes from:
 	 *	1 sector for the PMBR
 	 *	2 sectors for the GPT headers (each 1 sector)
 	 *	2 sectors for the GPT tables (each 1 sector)
 	 *	1 sector for an actual partition
 	 * It's better to catch this pathological case early than behaving
 	 * pathologically later on...
 	 */
 	if (pp->sectorsize < MBRSIZE || pp->mediasize < 6 * pp->sectorsize)
 		return (ENOSPC);
 
 	/*
 	 * Check that there's a MBR or a PMBR. If it's a PMBR, we return
 	 * as the highest priority on a match, otherwise we assume some
 	 * GPT-unaware tool has destroyed the GPT by recreating a MBR and
 	 * we really want the MBR scheme to take precedence.
 	 */
 	buf = g_read_data(cp, 0L, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 	res = le16dec(buf + DOSMAGICOFFSET);
 	pri = G_PART_PROBE_PRI_LOW;
 	if (res == DOSMAGIC) {
 		for (index = 0; index < NDOSPART; index++) {
 			if (buf[DOSPARTOFF + DOSPARTSIZE * index + 4] == 0xee)
 				pri = G_PART_PROBE_PRI_HIGH;
 		}
 		g_free(buf);
 
 		/* Check that there's a primary header. */
 		buf = g_read_data(cp, pp->sectorsize, pp->sectorsize, &error);
 		if (buf == NULL)
 			return (error);
 		res = memcmp(buf, GPT_HDR_SIG, 8);
 		g_free(buf);
 		if (res == 0)
 			return (pri);
 	} else
 		g_free(buf);
 
 	/* No primary? Check that there's a secondary. */
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	if (buf == NULL)
 		return (error);
 	res = memcmp(buf, GPT_HDR_SIG, 8);
 	g_free(buf);
 	return ((res == 0) ? pri : ENXIO);
 }
 
 static int
 g_part_gpt_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct gpt_hdr *prihdr, *sechdr;
 	struct gpt_ent *tbl, *pritbl, *sectbl;
 	struct g_provider *pp;
 	struct g_part_gpt_table *table;
 	struct g_part_gpt_entry *entry;
 	u_char *buf;
 	uint64_t last;
 	int error, index;
 
 	table = (struct g_part_gpt_table *)basetable;
 	pp = cp->provider;
 	last = (pp->mediasize / pp->sectorsize) - 1;
 
 	/* Read the PMBR */
 	buf = g_read_data(cp, 0, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 	bcopy(buf, table->mbr, MBRSIZE);
 	g_free(buf);
 
 	/* Read the primary header and table. */
 	prihdr = gpt_read_hdr(table, cp, GPT_ELT_PRIHDR);
 	if (table->state[GPT_ELT_PRIHDR] == GPT_STATE_OK) {
 		pritbl = gpt_read_tbl(table, cp, GPT_ELT_PRITBL, prihdr);
 	} else {
 		table->state[GPT_ELT_PRITBL] = GPT_STATE_MISSING;
 		pritbl = NULL;
 	}
 
 	/* Read the secondary header and table. */
 	sechdr = gpt_read_hdr(table, cp, GPT_ELT_SECHDR);
 	if (table->state[GPT_ELT_SECHDR] == GPT_STATE_OK) {
 		sectbl = gpt_read_tbl(table, cp, GPT_ELT_SECTBL, sechdr);
 	} else {
 		table->state[GPT_ELT_SECTBL] = GPT_STATE_MISSING;
 		sectbl = NULL;
 	}
 
 	/* Fail if we haven't got any good tables at all. */
 	if (table->state[GPT_ELT_PRITBL] != GPT_STATE_OK &&
 	    table->state[GPT_ELT_SECTBL] != GPT_STATE_OK) {
 		printf("GEOM: %s: corrupt or invalid GPT detected.\n",
 		    pp->name);
 		printf("GEOM: %s: GPT rejected -- may not be recoverable.\n",
 		    pp->name);
 		if (prihdr != NULL)
 			g_free(prihdr);
 		if (pritbl != NULL)
 			g_free(pritbl);
 		if (sechdr != NULL)
 			g_free(sechdr);
 		if (sectbl != NULL)
 			g_free(sectbl);
 		return (EINVAL);
 	}
 
 	/*
 	 * If both headers are good but they disagree with each other,
 	 * then invalidate one. We prefer to keep the primary header,
 	 * unless the primary table is corrupt.
 	 */
 	if (table->state[GPT_ELT_PRIHDR] == GPT_STATE_OK &&
 	    table->state[GPT_ELT_SECHDR] == GPT_STATE_OK &&
 	    !gpt_matched_hdrs(prihdr, sechdr)) {
 		if (table->state[GPT_ELT_PRITBL] == GPT_STATE_OK) {
 			table->state[GPT_ELT_SECHDR] = GPT_STATE_INVALID;
 			table->state[GPT_ELT_SECTBL] = GPT_STATE_MISSING;
 			g_free(sechdr);
 			sechdr = NULL;
 		} else {
 			table->state[GPT_ELT_PRIHDR] = GPT_STATE_INVALID;
 			table->state[GPT_ELT_PRITBL] = GPT_STATE_MISSING;
 			g_free(prihdr);
 			prihdr = NULL;
 		}
 	}
 
 	if (table->state[GPT_ELT_PRITBL] != GPT_STATE_OK) {
 		printf("GEOM: %s: the primary GPT table is corrupt or "
 		    "invalid.\n", pp->name);
 		printf("GEOM: %s: using the secondary instead -- recovery "
 		    "strongly advised.\n", pp->name);
 		table->hdr = sechdr;
 		basetable->gpt_corrupt = 1;
 		if (prihdr != NULL)
 			g_free(prihdr);
 		tbl = sectbl;
 		if (pritbl != NULL)
 			g_free(pritbl);
 	} else {
 		if (table->state[GPT_ELT_SECTBL] != GPT_STATE_OK) {
 			printf("GEOM: %s: the secondary GPT table is corrupt "
 			    "or invalid.\n", pp->name);
 			printf("GEOM: %s: using the primary only -- recovery "
 			    "suggested.\n", pp->name);
 			basetable->gpt_corrupt = 1;
 		} else if (table->lba[GPT_ELT_SECHDR] != last) {
 			printf( "GEOM: %s: the secondary GPT header is not in "
 			    "the last LBA.\n", pp->name);
 			basetable->gpt_corrupt = 1;
 		}
 		table->hdr = prihdr;
 		if (sechdr != NULL)
 			g_free(sechdr);
 		tbl = pritbl;
 		if (sectbl != NULL)
 			g_free(sectbl);
 	}
 
 	basetable->gpt_first = table->hdr->hdr_lba_start;
 	basetable->gpt_last = table->hdr->hdr_lba_end;
 	basetable->gpt_entries = (table->hdr->hdr_lba_start - 2) *
 	    pp->sectorsize / table->hdr->hdr_entsz;
 
 	for (index = table->hdr->hdr_entries - 1; index >= 0; index--) {
 		if (EQUUID(&tbl[index].ent_type, &gpt_uuid_unused))
 			continue;
 		entry = (struct g_part_gpt_entry *)g_part_new_entry(
 		    basetable, index + 1, tbl[index].ent_lba_start,
 		    tbl[index].ent_lba_end);
 		entry->ent = tbl[index];
 	}
 
 	g_free(tbl);
 
 	/*
 	 * Under Mac OS X, the MBR mirrors the first 4 GPT partitions
 	 * if (and only if) any FAT32 or FAT16 partitions have been
 	 * created. This happens irrespective of whether Boot Camp is
 	 * used/enabled, though it's generally understood to be done
 	 * to support legacy Windows under Boot Camp. We refer to this
 	 * mirroring simply as Boot Camp. We try to detect Boot Camp
 	 * so that we can update the MBR if and when GPT changes have
 	 * been made. Note that we do not enable Boot Camp if not
 	 * previously enabled because we can't assume that we're on a
 	 * Mac alongside Mac OS X.
 	 */
 	table->bootcamp = gpt_is_bootcamp(table, pp->name);
 
 	return (0);
 }
 
 static int
 g_part_gpt_recover(struct g_part_table *basetable)
 {
 	struct g_part_gpt_table *table;
 	struct g_provider *pp;
 
 	table = (struct g_part_gpt_table *)basetable;
 	pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 	gpt_create_pmbr(table, pp);
 	g_gpt_set_defaults(basetable, pp);
 	basetable->gpt_corrupt = 0;
 	return (0);
 }
 
 static int
 g_part_gpt_setunset(struct g_part_table *basetable,
     struct g_part_entry *baseentry, const char *attrib, unsigned int set)
 {
 	struct g_part_gpt_entry *entry;
 	struct g_part_gpt_table *table;
 	struct g_provider *pp;
 	uint8_t *p;
 	uint64_t attr;
 	int i;
 
 	table = (struct g_part_gpt_table *)basetable;
 	entry = (struct g_part_gpt_entry *)baseentry;
 
 	if (strcasecmp(attrib, "active") == 0) {
 		if (table->bootcamp) {
 			/* The active flag must be set on a valid entry. */
 			if (entry == NULL)
 				return (ENXIO);
 			if (baseentry->gpe_index > NDOSPART)
 				return (EINVAL);
 			for (i = 0; i < NDOSPART; i++) {
 				p = &table->mbr[DOSPARTOFF + i * DOSPARTSIZE];
 				p[0] = (i == baseentry->gpe_index - 1)
 				    ? ((set) ? 0x80 : 0) : 0;
 			}
 		} else {
 			/* The PMBR is marked as active without an entry. */
 			if (entry != NULL)
 				return (ENXIO);
 			for (i = 0; i < NDOSPART; i++) {
 				p = &table->mbr[DOSPARTOFF + i * DOSPARTSIZE];
 				p[0] = (p[4] == 0xee) ? ((set) ? 0x80 : 0) : 0;
 			}
 		}
 		return (0);
 	} else if (strcasecmp(attrib, "lenovofix") == 0) {
 		/*
 		 * Write the 0xee GPT entry to slot #1 (2nd slot) in the pMBR.
 		 * This workaround allows Lenovo X220, T420, T520, etc to boot
 		 * from GPT Partitions in BIOS mode.
 		 */
 
 		if (entry != NULL)
 			return (ENXIO);
 
 		pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 		bzero(table->mbr + DOSPARTOFF, DOSPARTSIZE * NDOSPART);
 		gpt_write_mbr_entry(table->mbr, ((set) ? 1 : 0), 0xee, 1,
 		    MIN(pp->mediasize / pp->sectorsize - 1, UINT32_MAX));
 		return (0);
 	}
 
 	if (entry == NULL)
 		return (ENODEV);
 
 	attr = 0;
 	if (strcasecmp(attrib, "bootme") == 0) {
 		attr |= GPT_ENT_ATTR_BOOTME;
 	} else if (strcasecmp(attrib, "bootonce") == 0) {
 		attr |= GPT_ENT_ATTR_BOOTONCE;
 		if (set)
 			attr |= GPT_ENT_ATTR_BOOTME;
 	} else if (strcasecmp(attrib, "bootfailed") == 0) {
 		/*
 		 * It should only be possible to unset BOOTFAILED, but it might
 		 * be useful for test purposes to also be able to set it.
 		 */
 		attr |= GPT_ENT_ATTR_BOOTFAILED;
 	}
 	if (attr == 0)
 		return (EINVAL);
 
 	if (set)
 		attr = entry->ent.ent_attr | attr;
 	else
 		attr = entry->ent.ent_attr & ~attr;
 	if (attr != entry->ent.ent_attr) {
 		entry->ent.ent_attr = attr;
 		if (!baseentry->gpe_created)
 			baseentry->gpe_modified = 1;
 	}
 	return (0);
 }
 
 static const char *
 g_part_gpt_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_gpt_entry *entry;
 	struct uuid *type;
 	struct g_part_uuid_alias *uap;
 
 	entry = (struct g_part_gpt_entry *)baseentry;
 	type = &entry->ent.ent_type;
 	for (uap = &gpt_uuid_alias_match[0]; uap->uuid; uap++)
 		if (EQUUID(type, uap->uuid))
 			return (g_part_alias_name(uap->alias));
 	buf[0] = '!';
 	snprintf_uuid(buf + 1, bufsz - 1, type);
 
 	return (buf);
 }
 
 static int
 g_part_gpt_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	unsigned char *buf, *bp;
 	struct g_provider *pp;
 	struct g_part_entry *baseentry;
 	struct g_part_gpt_entry *entry;
 	struct g_part_gpt_table *table;
 	size_t tblsz;
 	uint32_t crc;
 	int error, index;
 
 	pp = cp->provider;
 	table = (struct g_part_gpt_table *)basetable;
 	tblsz = howmany(table->hdr->hdr_entries * table->hdr->hdr_entsz,
 	    pp->sectorsize);
 
 	/* Reconstruct the MBR from the GPT if under Boot Camp. */
 	if (table->bootcamp)
 		gpt_update_bootcamp(basetable, pp);
 
 	/* Write the PMBR */
 	buf = g_malloc(pp->sectorsize, M_WAITOK | M_ZERO);
 	bcopy(table->mbr, buf, MBRSIZE);
 	error = g_write_data(cp, 0, buf, pp->sectorsize);
 	g_free(buf);
 	if (error)
 		return (error);
 
 	/* Allocate space for the header and entries. */
 	buf = g_malloc((tblsz + 1) * pp->sectorsize, M_WAITOK | M_ZERO);
 
 	memcpy(buf, table->hdr->hdr_sig, sizeof(table->hdr->hdr_sig));
 	le32enc(buf + 8, table->hdr->hdr_revision);
 	le32enc(buf + 12, table->hdr->hdr_size);
 	le64enc(buf + 40, table->hdr->hdr_lba_start);
 	le64enc(buf + 48, table->hdr->hdr_lba_end);
 	le_uuid_enc(buf + 56, &table->hdr->hdr_uuid);
 	le32enc(buf + 80, table->hdr->hdr_entries);
 	le32enc(buf + 84, table->hdr->hdr_entsz);
 
 	LIST_FOREACH(baseentry, &basetable->gpt_entry, gpe_entry) {
 		if (baseentry->gpe_deleted)
 			continue;
 		entry = (struct g_part_gpt_entry *)baseentry;
 		index = baseentry->gpe_index - 1;
 		bp = buf + pp->sectorsize + table->hdr->hdr_entsz * index;
 		le_uuid_enc(bp, &entry->ent.ent_type);
 		le_uuid_enc(bp + 16, &entry->ent.ent_uuid);
 		le64enc(bp + 32, entry->ent.ent_lba_start);
 		le64enc(bp + 40, entry->ent.ent_lba_end);
 		le64enc(bp + 48, entry->ent.ent_attr);
 		memcpy(bp + 56, entry->ent.ent_name,
 		    sizeof(entry->ent.ent_name));
 	}
 
 	crc = crc32(buf + pp->sectorsize,
 	    table->hdr->hdr_entries * table->hdr->hdr_entsz);
 	le32enc(buf + 88, crc);
 
 	/* Write primary meta-data. */
 	le32enc(buf + 16, 0);	/* hdr_crc_self. */
 	le64enc(buf + 24, table->lba[GPT_ELT_PRIHDR]);	/* hdr_lba_self. */
 	le64enc(buf + 32, table->lba[GPT_ELT_SECHDR]);	/* hdr_lba_alt. */
 	le64enc(buf + 72, table->lba[GPT_ELT_PRITBL]);	/* hdr_lba_table. */
 	crc = crc32(buf, table->hdr->hdr_size);
 	le32enc(buf + 16, crc);
 
 	for (index = 0; index < tblsz; index += MAXPHYS / pp->sectorsize) {
 		error = g_write_data(cp,
 		    (table->lba[GPT_ELT_PRITBL] + index) * pp->sectorsize,
 		    buf + (index + 1) * pp->sectorsize,
 		    (tblsz - index > MAXPHYS / pp->sectorsize) ? MAXPHYS:
 		    (tblsz - index) * pp->sectorsize);
 		if (error)
 			goto out;
 	}
 	error = g_write_data(cp, table->lba[GPT_ELT_PRIHDR] * pp->sectorsize,
 	    buf, pp->sectorsize);
 	if (error)
 		goto out;
 
 	/* Write secondary meta-data. */
 	le32enc(buf + 16, 0);	/* hdr_crc_self. */
 	le64enc(buf + 24, table->lba[GPT_ELT_SECHDR]);	/* hdr_lba_self. */
 	le64enc(buf + 32, table->lba[GPT_ELT_PRIHDR]);	/* hdr_lba_alt. */
 	le64enc(buf + 72, table->lba[GPT_ELT_SECTBL]);	/* hdr_lba_table. */
 	crc = crc32(buf, table->hdr->hdr_size);
 	le32enc(buf + 16, crc);
 
 	for (index = 0; index < tblsz; index += MAXPHYS / pp->sectorsize) {
 		error = g_write_data(cp,
 		    (table->lba[GPT_ELT_SECTBL] + index) * pp->sectorsize,
 		    buf + (index + 1) * pp->sectorsize,
 		    (tblsz - index > MAXPHYS / pp->sectorsize) ? MAXPHYS:
 		    (tblsz - index) * pp->sectorsize);
 		if (error)
 			goto out;
 	}
 	error = g_write_data(cp, table->lba[GPT_ELT_SECHDR] * pp->sectorsize,
 	    buf, pp->sectorsize);
 
  out:
 	g_free(buf);
 	return (error);
 }
 
 static void
 g_gpt_set_defaults(struct g_part_table *basetable, struct g_provider *pp)
 {
 	struct g_part_entry *baseentry;
 	struct g_part_gpt_entry *entry;
 	struct g_part_gpt_table *table;
 	quad_t start, end, min, max;
 	quad_t lba, last;
 	size_t spb, tblsz;
 
 	table = (struct g_part_gpt_table *)basetable;
 	last = pp->mediasize / pp->sectorsize - 1;
 	tblsz = howmany(basetable->gpt_entries * sizeof(struct gpt_ent),
 	    pp->sectorsize);
 
 	table->lba[GPT_ELT_PRIHDR] = 1;
 	table->lba[GPT_ELT_PRITBL] = 2;
 	table->lba[GPT_ELT_SECHDR] = last;
 	table->lba[GPT_ELT_SECTBL] = last - tblsz;
 	table->state[GPT_ELT_PRIHDR] = GPT_STATE_OK;
 	table->state[GPT_ELT_PRITBL] = GPT_STATE_OK;
 	table->state[GPT_ELT_SECHDR] = GPT_STATE_OK;
 	table->state[GPT_ELT_SECTBL] = GPT_STATE_OK;
 
 	max = start = 2 + tblsz;
 	min = end = last - tblsz - 1;
 	LIST_FOREACH(baseentry, &basetable->gpt_entry, gpe_entry) {
 		if (baseentry->gpe_deleted)
 			continue;
 		entry = (struct g_part_gpt_entry *)baseentry;
 		if (entry->ent.ent_lba_start < min)
 			min = entry->ent.ent_lba_start;
 		if (entry->ent.ent_lba_end > max)
 			max = entry->ent.ent_lba_end;
 	}
 	spb = 4096 / pp->sectorsize;
 	if (spb > 1) {
 		lba = start + ((start % spb) ? spb - start % spb : 0);
 		if (lba <= min)
 			start = lba;
 		lba = end - (end + 1) % spb;
 		if (max <= lba)
 			end = lba;
 	}
 	table->hdr->hdr_lba_start = start;
 	table->hdr->hdr_lba_end = end;
 
 	basetable->gpt_first = start;
 	basetable->gpt_last = end;
 }
 
 static void
 g_gpt_printf_utf16(struct sbuf *sb, uint16_t *str, size_t len)
 {
 	u_int bo;
 	uint32_t ch;
 	uint16_t c;
 
 	bo = LITTLE_ENDIAN;	/* GPT is little-endian */
 	while (len > 0 && *str != 0) {
 		ch = (bo == BIG_ENDIAN) ? be16toh(*str) : le16toh(*str);
 		str++, len--;
 		if ((ch & 0xf800) == 0xd800) {
 			if (len > 0) {
 				c = (bo == BIG_ENDIAN) ? be16toh(*str)
 				    : le16toh(*str);
 				str++, len--;
 			} else
 				c = 0xfffd;
 			if ((ch & 0x400) == 0 && (c & 0xfc00) == 0xdc00) {
 				ch = ((ch & 0x3ff) << 10) + (c & 0x3ff);
 				ch += 0x10000;
 			} else
 				ch = 0xfffd;
 		} else if (ch == 0xfffe) { /* BOM (U+FEFF) swapped. */
 			bo = (bo == BIG_ENDIAN) ? LITTLE_ENDIAN : BIG_ENDIAN;
 			continue;
 		} else if (ch == 0xfeff) /* BOM (U+FEFF) unswapped. */
 			continue;
 
 		/* Write the Unicode character in UTF-8 */
 		if (ch < 0x80)
 			g_conf_printf_escaped(sb, "%c", ch);
 		else if (ch < 0x800)
 			g_conf_printf_escaped(sb, "%c%c", 0xc0 | (ch >> 6),
 			    0x80 | (ch & 0x3f));
 		else if (ch < 0x10000)
 			g_conf_printf_escaped(sb, "%c%c%c", 0xe0 | (ch >> 12),
 			    0x80 | ((ch >> 6) & 0x3f), 0x80 | (ch & 0x3f));
 		else if (ch < 0x200000)
 			g_conf_printf_escaped(sb, "%c%c%c%c", 0xf0 |
 			    (ch >> 18), 0x80 | ((ch >> 12) & 0x3f),
 			    0x80 | ((ch >> 6) & 0x3f), 0x80 | (ch & 0x3f));
 	}
 }
 
 static void
 g_gpt_utf8_to_utf16(const uint8_t *s8, uint16_t *s16, size_t s16len)
 {
 	size_t s16idx, s8idx;
 	uint32_t utfchar;
 	unsigned int c, utfbytes;
 
 	s8idx = s16idx = 0;
 	utfchar = 0;
 	utfbytes = 0;
 	bzero(s16, s16len << 1);
 	while (s8[s8idx] != 0 && s16idx < s16len) {
 		c = s8[s8idx++];
 		if ((c & 0xc0) != 0x80) {
 			/* Initial characters. */
 			if (utfbytes != 0) {
 				/* Incomplete encoding of previous char. */
 				s16[s16idx++] = htole16(0xfffd);
 			}
 			if ((c & 0xf8) == 0xf0) {
 				utfchar = c & 0x07;
 				utfbytes = 3;
 			} else if ((c & 0xf0) == 0xe0) {
 				utfchar = c & 0x0f;
 				utfbytes = 2;
 			} else if ((c & 0xe0) == 0xc0) {
 				utfchar = c & 0x1f;
 				utfbytes = 1;
 			} else {
 				utfchar = c & 0x7f;
 				utfbytes = 0;
 			}
 		} else {
 			/* Followup characters. */
 			if (utfbytes > 0) {
 				utfchar = (utfchar << 6) + (c & 0x3f);
 				utfbytes--;
 			} else if (utfbytes == 0)
 				utfbytes = ~0;
 		}
 		/*
 		 * Write the complete Unicode character as UTF-16 when we
 		 * have all the UTF-8 charactars collected.
 		 */
 		if (utfbytes == 0) {
 			/*
 			 * If we need to write 2 UTF-16 characters, but
 			 * we only have room for 1, then we truncate the
 			 * string by writing a 0 instead.
 			 */
 			if (utfchar >= 0x10000 && s16idx < s16len - 1) {
 				s16[s16idx++] =
 				    htole16(0xd800 | ((utfchar >> 10) - 0x40));
 				s16[s16idx++] =
 				    htole16(0xdc00 | (utfchar & 0x3ff));
 			} else
 				s16[s16idx++] = (utfchar >= 0x10000) ? 0 :
 				    htole16(utfchar);
 		}
 	}
 	/*
 	 * If our input string was truncated, append an invalid encoding
 	 * character to the output string.
 	 */
 	if (utfbytes != 0 && s16idx < s16len)
 		s16[s16idx++] = htole16(0xfffd);
 }
Index: user/markj/netdump/sys/geom/part/g_part_ldm.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_ldm.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_ldm.c	(revision 332408)
@@ -1,1484 +1,1485 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2012 Andrey V. Elsukov <ae@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/diskmbr.h>
 #include <sys/endian.h>
 #include <sys/gpt.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/uuid.h>
 #include <geom/geom.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_ldm, "GEOM partitioning class for LDM support");
 
 SYSCTL_DECL(_kern_geom_part);
 static SYSCTL_NODE(_kern_geom_part, OID_AUTO, ldm, CTLFLAG_RW, 0,
     "GEOM_PART_LDM Logical Disk Manager");
 
 static u_int ldm_debug = 0;
 SYSCTL_UINT(_kern_geom_part_ldm, OID_AUTO, debug,
     CTLFLAG_RWTUN, &ldm_debug, 0, "Debug level");
 
 /*
  * This allows access to mirrored LDM volumes. Since we do not
  * doing mirroring here, it is not enabled by default.
  */
 static u_int show_mirrors = 0;
 SYSCTL_UINT(_kern_geom_part_ldm, OID_AUTO, show_mirrors,
     CTLFLAG_RWTUN, &show_mirrors, 0, "Show mirrored volumes");
 
 #define	LDM_DEBUG(lvl, fmt, ...)	do {				\
 	if (ldm_debug >= (lvl)) {					\
 		printf("GEOM_PART: " fmt "\n", __VA_ARGS__);		\
 	}								\
 } while (0)
 #define	LDM_DUMP(buf, size)	do {					\
 	if (ldm_debug > 1) {						\
 		hexdump(buf, size, NULL, 0);				\
 	}								\
 } while (0)
 
 /*
  * There are internal representations of LDM structures.
  *
  * We do not keep all fields of on-disk structures, only most useful.
  * All numbers in an on-disk structures are in big-endian format.
  */
 
 /*
  * Private header is 512 bytes long. There are three copies on each disk.
  * Offset and sizes are in sectors. Location of each copy:
  * - the first offset is relative to the disk start;
  * - the second and third offset are relative to the LDM database start.
  *
  * On a disk partitioned with GPT, the LDM has not first private header.
  */
 #define	LDM_PH_MBRINDEX		0
 #define	LDM_PH_GPTINDEX		2
 static const uint64_t	ldm_ph_off[] = {6, 1856, 2047};
 #define	LDM_VERSION_2K		0x2000b
 #define	LDM_VERSION_VISTA	0x2000c
 #define	LDM_PH_VERSION_OFF	0x00c
 #define	LDM_PH_DISKGUID_OFF	0x030
 #define	LDM_PH_DGGUID_OFF	0x0b0
 #define	LDM_PH_DGNAME_OFF	0x0f0
 #define	LDM_PH_START_OFF	0x11b
 #define	LDM_PH_SIZE_OFF		0x123
 #define	LDM_PH_DB_OFF		0x12b
 #define	LDM_PH_DBSIZE_OFF	0x133
 #define	LDM_PH_TH1_OFF		0x13b
 #define	LDM_PH_TH2_OFF		0x143
 #define	LDM_PH_CONFSIZE_OFF	0x153
 #define	LDM_PH_LOGSIZE_OFF	0x15b
 #define	LDM_PH_SIGN		"PRIVHEAD"
 struct ldm_privhdr {
 	struct uuid	disk_guid;
 	struct uuid	dg_guid;
 	u_char		dg_name[32];
 	uint64_t	start;		/* logical disk start */
 	uint64_t	size;		/* logical disk size */
 	uint64_t	db_offset;	/* LDM database start */
 #define	LDM_DB_SIZE		2048
 	uint64_t	db_size;	/* LDM database size */
 #define	LDM_TH_COUNT		2
 	uint64_t	th_offset[LDM_TH_COUNT]; /* TOC header offsets */
 	uint64_t	conf_size;	/* configuration size */
 	uint64_t	log_size;	/* size of log */
 };
 
 /*
  * Table of contents header is 512 bytes long.
  * There are two identical copies at offsets from the private header.
  * Offsets are relative to the LDM database start.
  */
 #define	LDM_TH_SIGN		"TOCBLOCK"
 #define	LDM_TH_NAME1		"config"
 #define	LDM_TH_NAME2		"log"
 #define	LDM_TH_NAME1_OFF	0x024
 #define	LDM_TH_CONF_OFF		0x02e
 #define	LDM_TH_CONFSIZE_OFF	0x036
 #define	LDM_TH_NAME2_OFF	0x046
 #define	LDM_TH_LOG_OFF		0x050
 #define	LDM_TH_LOGSIZE_OFF	0x058
 struct ldm_tochdr {
 	uint64_t	conf_offset;	/* configuration offset */
 	uint64_t	log_offset;	/* log offset */
 };
 
 /*
  * LDM database header is 512 bytes long.
  */
 #define	LDM_VMDB_SIGN		"VMDB"
 #define	LDM_DB_LASTSEQ_OFF	0x004
 #define	LDM_DB_SIZE_OFF		0x008
 #define	LDM_DB_STATUS_OFF	0x010
 #define	LDM_DB_VERSION_OFF	0x012
 #define	LDM_DB_DGNAME_OFF	0x016
 #define	LDM_DB_DGGUID_OFF	0x035
 struct ldm_vmdbhdr {
 	uint32_t	last_seq;	/* sequence number of last VBLK */
 	uint32_t	size;		/* size of VBLK */
 };
 
 /*
  * The LDM database configuration section contains VMDB header and
  * many VBLKs. Each VBLK represents a disk group, disk partition,
  * component or volume.
  *
  * The most interesting for us are volumes, they are represents
  * partitions in the GEOM_PART meaning. But volume VBLK does not
  * contain all information needed to create GEOM provider. And we
  * should get this information from the related VBLK. This is how
  * VBLK releated:
  *	Volumes <- Components <- Partitions -> Disks
  *
  * One volume can contain several components. In this case LDM
  * does mirroring of volume data to each component.
  *
  * Also each component can contain several partitions (spanned or
  * striped volumes).
  */
 
 struct ldm_component {
 	uint64_t	id;		/* object id */
 	uint64_t	vol_id;		/* parent volume object id */
 
 	int		count;
 	LIST_HEAD(, ldm_partition) partitions;
 	LIST_ENTRY(ldm_component) entry;
 };
 
 struct ldm_volume {
 	uint64_t	id;		/* object id */
 	uint64_t	size;		/* volume size */
 	uint8_t		number;		/* used for ordering */
 	uint8_t		part_type;	/* partition type */
 
 	int		count;
 	LIST_HEAD(, ldm_component) components;
 	LIST_ENTRY(ldm_volume)	entry;
 };
 
 struct ldm_disk {
 	uint64_t	id;		/* object id */
 	struct uuid	guid;		/* disk guid */
 
 	LIST_ENTRY(ldm_disk) entry;
 };
 
 #if 0
 struct ldm_disk_group {
 	uint64_t	id;		/* object id */
 	struct uuid	guid;		/* disk group guid */
 	u_char		name[32];	/* disk group name */
 
 	LIST_ENTRY(ldm_disk_group) entry;
 };
 #endif
 
 struct ldm_partition {
 	uint64_t	id;		/* object id */
 	uint64_t	disk_id;	/* disk object id */
 	uint64_t	comp_id;	/* parent component object id */
 	uint64_t	start;		/* offset relative to disk start */
 	uint64_t	offset;		/* offset for spanned volumes */
 	uint64_t	size;		/* partition size */
 
 	LIST_ENTRY(ldm_partition) entry;
 };
 
 /*
  * Each VBLK is 128 bytes long and has standard 16 bytes header.
  * Some of VBLK's fields are fixed size, but others has variable size.
  * Fields with variable size are prefixed with one byte length marker.
  * Some fields are strings and also can have fixed size and variable.
  * Strings with fixed size are NULL-terminated, others are not.
  * All VBLKs have same several first fields:
  *	Offset		Size		Description
  *	---------------+---------------+--------------------------
  *	0x00		16		standard VBLK header
  *	0x10		2		update status
  *	0x13		1		VBLK type
  *	0x18		PS		object id
  *	0x18+		PN		object name
  *
  *  o Offset 0x18+ means '0x18 + length of all variable-width fields'
  *  o 'P' in size column means 'prefixed' (variable-width),
  *    'S' - string, 'N' - number.
  */
 #define	LDM_VBLK_SIGN		"VBLK"
 #define	LDM_VBLK_SEQ_OFF	0x04
 #define	LDM_VBLK_GROUP_OFF	0x08
 #define	LDM_VBLK_INDEX_OFF	0x0c
 #define	LDM_VBLK_COUNT_OFF	0x0e
 #define	LDM_VBLK_TYPE_OFF	0x13
 #define	LDM_VBLK_OID_OFF	0x18
 struct ldm_vblkhdr {
 	uint32_t	seq;		/* sequence number */
 	uint32_t	group;		/* group number */
 	uint16_t	index;		/* index in the group */
 	uint16_t	count;		/* number of entries in the group */
 };
 
 #define	LDM_VBLK_T_COMPONENT	0x32
 #define	LDM_VBLK_T_PARTITION	0x33
 #define	LDM_VBLK_T_DISK		0x34
 #define	LDM_VBLK_T_DISKGROUP	0x35
 #define	LDM_VBLK_T_DISK4	0x44
 #define	LDM_VBLK_T_DISKGROUP4	0x45
 #define	LDM_VBLK_T_VOLUME	0x51
 struct ldm_vblk {
 	uint8_t		type;		/* VBLK type */
 	union {
 		uint64_t		id;
 		struct ldm_volume	vol;
 		struct ldm_component	comp;
 		struct ldm_disk		disk;
 		struct ldm_partition	part;
 #if 0
 		struct ldm_disk_group	disk_group;
 #endif
 	} u;
 	LIST_ENTRY(ldm_vblk) entry;
 };
 
 /*
  * Some VBLKs contains a bit more data than can fit into 128 bytes. These
  * VBLKs are called eXtended VBLK. Before parsing, the data from these VBLK
  * should be placed into continuous memory buffer. We can determine xVBLK
  * by the count field in the standard VBLK header (count > 1).
  */
 struct ldm_xvblk {
 	uint32_t	group;		/* xVBLK group number */
 	uint32_t	size;		/* the total size of xVBLK */
 	uint8_t		map;		/* bitmask of currently saved VBLKs */
 	u_char		*data;		/* xVBLK data */
 
 	LIST_ENTRY(ldm_xvblk)	entry;
 };
 
 /* The internal representation of LDM database. */
 struct ldm_db {
 	struct ldm_privhdr		ph;	/* private header */
 	struct ldm_tochdr		th;	/* TOC header */
 	struct ldm_vmdbhdr		dh;	/* VMDB header */
 
 	LIST_HEAD(, ldm_volume)		volumes;
 	LIST_HEAD(, ldm_disk)		disks;
 	LIST_HEAD(, ldm_vblk)		vblks;
 	LIST_HEAD(, ldm_xvblk)		xvblks;
 };
 
 static struct uuid gpt_uuid_ms_ldm_metadata = GPT_ENT_TYPE_MS_LDM_METADATA;
 
 struct g_part_ldm_table {
 	struct g_part_table	base;
 	uint64_t		db_offset;
 	int			is_gpt;
 };
 struct g_part_ldm_entry {
 	struct g_part_entry	base;
 	uint8_t			type;
 };
 
 static int g_part_ldm_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_ldm_bootcode(struct g_part_table *, struct g_part_parms *);
 static int g_part_ldm_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_ldm_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_ldm_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_ldm_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_ldm_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_ldm_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_ldm_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_ldm_read(struct g_part_table *, struct g_consumer *);
 static const char *g_part_ldm_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_ldm_write(struct g_part_table *, struct g_consumer *);
 
 static kobj_method_t g_part_ldm_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_ldm_add),
 	KOBJMETHOD(g_part_bootcode,	g_part_ldm_bootcode),
 	KOBJMETHOD(g_part_create,	g_part_ldm_create),
 	KOBJMETHOD(g_part_destroy,	g_part_ldm_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_ldm_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_ldm_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_ldm_modify),
 	KOBJMETHOD(g_part_name,		g_part_ldm_name),
 	KOBJMETHOD(g_part_probe,	g_part_ldm_probe),
 	KOBJMETHOD(g_part_read,		g_part_ldm_read),
 	KOBJMETHOD(g_part_type,		g_part_ldm_type),
 	KOBJMETHOD(g_part_write,	g_part_ldm_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_ldm_scheme = {
 	"LDM",
 	g_part_ldm_methods,
 	sizeof(struct g_part_ldm_table),
 	.gps_entrysz = sizeof(struct g_part_ldm_entry)
 };
 G_PART_SCHEME_DECLARE(g_part_ldm);
+MODULE_VERSION(geom_part_ldm, 0);
 
 static struct g_part_ldm_alias {
 	u_char		typ;
 	int		alias;
 } ldm_alias_match[] = {
 	{ DOSPTYP_NTFS,		G_PART_ALIAS_MS_NTFS },
 	{ DOSPTYP_FAT32,	G_PART_ALIAS_MS_FAT32 },
 	{ DOSPTYP_386BSD,	G_PART_ALIAS_FREEBSD },
 	{ DOSPTYP_LDM,		G_PART_ALIAS_MS_LDM_DATA },
 	{ DOSPTYP_LINSWP,	G_PART_ALIAS_LINUX_SWAP },
 	{ DOSPTYP_LINUX,	G_PART_ALIAS_LINUX_DATA },
 	{ DOSPTYP_LINLVM,	G_PART_ALIAS_LINUX_LVM },
 	{ DOSPTYP_LINRAID,	G_PART_ALIAS_LINUX_RAID },
 };
 
 static u_char*
 ldm_privhdr_read(struct g_consumer *cp, uint64_t off, int *error)
 {
 	struct g_provider *pp;
 	u_char *buf;
 
 	pp = cp->provider;
 	buf = g_read_data(cp, off, pp->sectorsize, error);
 	if (buf == NULL)
 		return (NULL);
 
 	if (memcmp(buf, LDM_PH_SIGN, strlen(LDM_PH_SIGN)) != 0) {
 		LDM_DEBUG(1, "%s: invalid LDM private header signature",
 		    pp->name);
 		g_free(buf);
 		buf = NULL;
 		*error = EINVAL;
 	}
 	return (buf);
 }
 
 static int
 ldm_privhdr_parse(struct g_consumer *cp, struct ldm_privhdr *hdr,
     const u_char *buf)
 {
 	uint32_t version;
 	int error;
 
 	memset(hdr, 0, sizeof(*hdr));
 	version = be32dec(buf + LDM_PH_VERSION_OFF);
 	if (version != LDM_VERSION_2K &&
 	    version != LDM_VERSION_VISTA) {
 		LDM_DEBUG(0, "%s: unsupported LDM version %u.%u",
 		    cp->provider->name, version >> 16,
 		    version & 0xFFFF);
 		return (ENXIO);
 	}
 	error = parse_uuid(buf + LDM_PH_DISKGUID_OFF, &hdr->disk_guid);
 	if (error != 0)
 		return (error);
 	error = parse_uuid(buf + LDM_PH_DGGUID_OFF, &hdr->dg_guid);
 	if (error != 0)
 		return (error);
 	strncpy(hdr->dg_name, buf + LDM_PH_DGNAME_OFF, sizeof(hdr->dg_name));
 	hdr->start = be64dec(buf + LDM_PH_START_OFF);
 	hdr->size = be64dec(buf + LDM_PH_SIZE_OFF);
 	hdr->db_offset = be64dec(buf + LDM_PH_DB_OFF);
 	hdr->db_size = be64dec(buf + LDM_PH_DBSIZE_OFF);
 	hdr->th_offset[0] = be64dec(buf + LDM_PH_TH1_OFF);
 	hdr->th_offset[1] = be64dec(buf + LDM_PH_TH2_OFF);
 	hdr->conf_size = be64dec(buf + LDM_PH_CONFSIZE_OFF);
 	hdr->log_size = be64dec(buf + LDM_PH_LOGSIZE_OFF);
 	return (0);
 }
 
 static int
 ldm_privhdr_check(struct ldm_db *db, struct g_consumer *cp, int is_gpt)
 {
 	struct g_consumer *cp2;
 	struct g_provider *pp;
 	struct ldm_privhdr hdr;
 	uint64_t offset, last;
 	int error, found, i;
 	u_char *buf;
 
 	pp = cp->provider;
 	if (is_gpt) {
 		/*
 		 * The last LBA is used in several checks below, for the
 		 * GPT case it should be calculated relative to the whole
 		 * disk.
 		 */
 		cp2 = LIST_FIRST(&pp->geom->consumer);
 		last =
 		    cp2->provider->mediasize / cp2->provider->sectorsize - 1;
 	} else
 		last = pp->mediasize / pp->sectorsize - 1;
 	for (found = 0, i = is_gpt; i < nitems(ldm_ph_off); i++) {
 		offset = ldm_ph_off[i];
 		/*
 		 * In the GPT case consumer is attached to the LDM metadata
 		 * partition and we don't need add db_offset.
 		 */
 		if (!is_gpt)
 			offset += db->ph.db_offset;
 		if (i == LDM_PH_MBRINDEX) {
 			/*
 			 * Prepare to errors and setup new base offset
 			 * to read backup private headers. Assume that LDM
 			 * database is in the last 1Mbyte area.
 			 */
 			db->ph.db_offset = last - LDM_DB_SIZE;
 		}
 		buf = ldm_privhdr_read(cp, offset * pp->sectorsize, &error);
 		if (buf == NULL) {
 			LDM_DEBUG(1, "%s: failed to read private header "
 			    "%d at LBA %ju", pp->name, i, (uintmax_t)offset);
 			continue;
 		}
 		error = ldm_privhdr_parse(cp, &hdr, buf);
 		if (error != 0) {
 			LDM_DEBUG(1, "%s: failed to parse private "
 			    "header %d", pp->name, i);
 			LDM_DUMP(buf, pp->sectorsize);
 			g_free(buf);
 			continue;
 		}
 		g_free(buf);
 		if (hdr.start > last ||
 		    hdr.start + hdr.size - 1 > last ||
 		    (hdr.start + hdr.size - 1 > hdr.db_offset && !is_gpt) ||
 		    hdr.db_size != LDM_DB_SIZE ||
 		    hdr.db_offset + LDM_DB_SIZE - 1 > last ||
 		    hdr.th_offset[0] >= LDM_DB_SIZE ||
 		    hdr.th_offset[1] >= LDM_DB_SIZE ||
 		    hdr.conf_size + hdr.log_size >= LDM_DB_SIZE) {
 			LDM_DEBUG(1, "%s: invalid values in the "
 			    "private header %d", pp->name, i);
 			LDM_DEBUG(2, "%s: start: %jd, size: %jd, "
 			    "db_offset: %jd, db_size: %jd, th_offset0: %jd, "
 			    "th_offset1: %jd, conf_size: %jd, log_size: %jd, "
 			    "last: %jd", pp->name, hdr.start, hdr.size,
 			    hdr.db_offset, hdr.db_size, hdr.th_offset[0],
 			    hdr.th_offset[1], hdr.conf_size, hdr.log_size,
 			    last);
 			continue;
 		}
 		if (found != 0 && memcmp(&db->ph, &hdr, sizeof(hdr)) != 0) {
 			LDM_DEBUG(0, "%s: private headers are not equal",
 			    pp->name);
 			if (i > 1) {
 				/*
 				 * We have different headers in the LDM.
 				 * We can not trust this metadata.
 				 */
 				LDM_DEBUG(0, "%s: refuse LDM metadata",
 				    pp->name);
 				return (EINVAL);
 			}
 			/*
 			 * We already have read primary private header
 			 * and it differs from this backup one.
 			 * Prefer the backup header and save it.
 			 */
 			found = 0;
 		}
 		if (found == 0)
 			memcpy(&db->ph, &hdr, sizeof(hdr));
 		found = 1;
 	}
 	if (found == 0) {
 		LDM_DEBUG(1, "%s: valid LDM private header not found",
 		    pp->name);
 		return (ENXIO);
 	}
 	return (0);
 }
 
 static int
 ldm_gpt_check(struct ldm_db *db, struct g_consumer *cp)
 {
 	struct g_part_table *gpt;
 	struct g_part_entry *e;
 	struct g_consumer *cp2;
 	int error;
 
 	cp2 = LIST_NEXT(cp, consumer);
 	g_topology_lock();
 	gpt = cp->provider->geom->softc;
 	error = 0;
 	LIST_FOREACH(e, &gpt->gpt_entry, gpe_entry) {
 		if (cp->provider == e->gpe_pp) {
 			/* ms-ldm-metadata partition */
 			if (e->gpe_start != db->ph.db_offset ||
 			    e->gpe_end != db->ph.db_offset + LDM_DB_SIZE - 1)
 				error++;
 		} else if (cp2->provider == e->gpe_pp) {
 			/* ms-ldm-data partition */
 			if (e->gpe_start != db->ph.start ||
 			    e->gpe_end != db->ph.start + db->ph.size - 1)
 				error++;
 		}
 		if (error != 0) {
 			LDM_DEBUG(0, "%s: GPT partition %d boundaries "
 			    "do not match with the LDM metadata",
 			    e->gpe_pp->name, e->gpe_index);
 			error = ENXIO;
 			break;
 		}
 	}
 	g_topology_unlock();
 	return (error);
 }
 
 static int
 ldm_tochdr_check(struct ldm_db *db, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct ldm_tochdr hdr;
 	uint64_t offset, conf_size, log_size;
 	int error, found, i;
 	u_char *buf;
 
 	pp = cp->provider;
 	for (i = 0, found = 0; i < LDM_TH_COUNT; i++) {
 		offset = db->ph.db_offset + db->ph.th_offset[i];
 		buf = g_read_data(cp,
 		    offset * pp->sectorsize, pp->sectorsize, &error);
 		if (buf == NULL) {
 			LDM_DEBUG(1, "%s: failed to read TOC header "
 			    "at LBA %ju", pp->name, (uintmax_t)offset);
 			continue;
 		}
 		if (memcmp(buf, LDM_TH_SIGN, strlen(LDM_TH_SIGN)) != 0 ||
 		    memcmp(buf + LDM_TH_NAME1_OFF, LDM_TH_NAME1,
 		    strlen(LDM_TH_NAME1)) != 0 ||
 		    memcmp(buf + LDM_TH_NAME2_OFF, LDM_TH_NAME2,
 		    strlen(LDM_TH_NAME2)) != 0) {
 			LDM_DEBUG(1, "%s: failed to parse TOC header "
 			    "at LBA %ju", pp->name, (uintmax_t)offset);
 			LDM_DUMP(buf, pp->sectorsize);
 			g_free(buf);
 			continue;
 		}
 		hdr.conf_offset = be64dec(buf + LDM_TH_CONF_OFF);
 		hdr.log_offset = be64dec(buf + LDM_TH_LOG_OFF);
 		conf_size = be64dec(buf + LDM_TH_CONFSIZE_OFF);
 		log_size = be64dec(buf + LDM_TH_LOGSIZE_OFF);
 		if (conf_size != db->ph.conf_size ||
 		    hdr.conf_offset + conf_size >= LDM_DB_SIZE ||
 		    log_size != db->ph.log_size ||
 		    hdr.log_offset + log_size >= LDM_DB_SIZE) {
 			LDM_DEBUG(1, "%s: invalid values in the "
 			    "TOC header at LBA %ju", pp->name,
 			    (uintmax_t)offset);
 			LDM_DUMP(buf, pp->sectorsize);
 			g_free(buf);
 			continue;
 		}
 		g_free(buf);
 		if (found == 0)
 			memcpy(&db->th, &hdr, sizeof(hdr));
 		found = 1;
 	}
 	if (found == 0) {
 		LDM_DEBUG(0, "%s: valid LDM TOC header not found.",
 		    pp->name);
 		return (ENXIO);
 	}
 	return (0);
 }
 
 static int
 ldm_vmdbhdr_check(struct ldm_db *db, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct uuid dg_guid;
 	uint64_t offset;
 	uint32_t version;
 	int error;
 	u_char *buf;
 
 	pp = cp->provider;
 	offset = db->ph.db_offset + db->th.conf_offset;
 	buf = g_read_data(cp, offset * pp->sectorsize, pp->sectorsize,
 	    &error);
 	if (buf == NULL) {
 		LDM_DEBUG(0, "%s: failed to read VMDB header at "
 		    "LBA %ju", pp->name, (uintmax_t)offset);
 		return (error);
 	}
 	if (memcmp(buf, LDM_VMDB_SIGN, strlen(LDM_VMDB_SIGN)) != 0) {
 		g_free(buf);
 		LDM_DEBUG(0, "%s: failed to parse VMDB header at "
 		    "LBA %ju", pp->name, (uintmax_t)offset);
 		return (ENXIO);
 	}
 	/* Check version. */
 	version = be32dec(buf + LDM_DB_VERSION_OFF);
 	if (version != 0x4000A) {
 		g_free(buf);
 		LDM_DEBUG(0, "%s: unsupported VMDB version %u.%u",
 		    pp->name, version >> 16, version & 0xFFFF);
 		return (ENXIO);
 	}
 	/*
 	 * Check VMDB update status:
 	 *	1 - in a consistent state;
 	 *	2 - in a creation phase;
 	 *	3 - in a deletion phase;
 	 */
 	if (be16dec(buf + LDM_DB_STATUS_OFF) != 1) {
 		g_free(buf);
 		LDM_DEBUG(0, "%s: VMDB is not in a consistent state",
 		    pp->name);
 		return (ENXIO);
 	}
 	db->dh.last_seq = be32dec(buf + LDM_DB_LASTSEQ_OFF);
 	db->dh.size = be32dec(buf + LDM_DB_SIZE_OFF);
 	error = parse_uuid(buf + LDM_DB_DGGUID_OFF, &dg_guid);
 	/* Compare disk group name and guid from VMDB and private headers */
 	if (error != 0 || db->dh.size == 0 ||
 	    pp->sectorsize % db->dh.size != 0 ||
 	    strncmp(buf + LDM_DB_DGNAME_OFF, db->ph.dg_name, 31) != 0 ||
 	    memcmp(&dg_guid, &db->ph.dg_guid, sizeof(dg_guid)) != 0 ||
 	    db->dh.size * db->dh.last_seq >
 	    db->ph.conf_size * pp->sectorsize) {
 		LDM_DEBUG(0, "%s: invalid values in the VMDB header",
 		    pp->name);
 		LDM_DUMP(buf, pp->sectorsize);
 		g_free(buf);
 		return (EINVAL);
 	}
 	g_free(buf);
 	return (0);
 }
 
 static int
 ldm_xvblk_handle(struct ldm_db *db, struct ldm_vblkhdr *vh, const u_char *p)
 {
 	struct ldm_xvblk *blk;
 	size_t size;
 
 	size = db->dh.size - 16;
 	LIST_FOREACH(blk, &db->xvblks, entry)
 		if (blk->group == vh->group)
 			break;
 	if (blk == NULL) {
 		blk = g_malloc(sizeof(*blk), M_WAITOK | M_ZERO);
 		blk->group = vh->group;
 		blk->size = size * vh->count + 16;
 		blk->data = g_malloc(blk->size, M_WAITOK | M_ZERO);
 		blk->map = 0xFF << vh->count;
 		LIST_INSERT_HEAD(&db->xvblks, blk, entry);
 	}
 	if ((blk->map & (1 << vh->index)) != 0) {
 		/* Block with given index has been already saved. */
 		return (EINVAL);
 	}
 	/* Copy the data block to the place related to index. */
 	memcpy(blk->data + size * vh->index + 16, p + 16, size);
 	blk->map |= 1 << vh->index;
 	return (0);
 }
 
 /* Read the variable-width numeric field and return new offset */
 static int
 ldm_vnum_get(const u_char *buf, int offset, uint64_t *result, size_t range)
 {
 	uint64_t num;
 	uint8_t len;
 
 	len = buf[offset++];
 	if (len > sizeof(uint64_t) || len + offset >= range)
 		return (-1);
 	for (num = 0; len > 0; len--)
 		num = (num << 8) | buf[offset++];
 	*result = num;
 	return (offset);
 }
 
 /* Read the variable-width string and return new offset */
 static int
 ldm_vstr_get(const u_char *buf, int offset, u_char *result,
     size_t maxlen, size_t range)
 {
 	uint8_t len;
 
 	len = buf[offset++];
 	if (len >= maxlen || len + offset >= range)
 		return (-1);
 	memcpy(result, buf + offset, len);
 	result[len] = '\0';
 	return (offset + len);
 }
 
 /* Just skip the variable-width variable and return new offset */
 static int
 ldm_vparm_skip(const u_char *buf, int offset, size_t range)
 {
 	uint8_t len;
 
 	len = buf[offset++];
 	if (offset + len >= range)
 		return (-1);
 
 	return (offset + len);
 }
 
 static int
 ldm_vblk_handle(struct ldm_db *db, const u_char *p, size_t size)
 {
 	struct ldm_vblk *blk;
 	struct ldm_volume *volume, *last;
 	const char *errstr;
 	u_char vstr[64];
 	int error, offset;
 
 	blk = g_malloc(sizeof(*blk), M_WAITOK | M_ZERO);
 	blk->type = p[LDM_VBLK_TYPE_OFF];
 	offset = ldm_vnum_get(p, LDM_VBLK_OID_OFF, &blk->u.id, size);
 	if (offset < 0) {
 		errstr = "object id";
 		goto fail;
 	}
 	offset = ldm_vstr_get(p, offset, vstr, sizeof(vstr), size);
 	if (offset < 0) {
 		errstr = "object name";
 		goto fail;
 	}
 	switch (blk->type) {
 	/*
 	 * Component VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+	PS	volume state
 	 *  0x18+5	PN	component children count
 	 *  0x1D+16	PN	parent's volume object id
 	 *  0x2D+1	PN	stripe size
 	 */
 	case LDM_VBLK_T_COMPONENT:
 		offset = ldm_vparm_skip(p, offset, size);
 		if (offset < 0) {
 			errstr = "volume state";
 			goto fail;
 		}
 		offset = ldm_vparm_skip(p, offset + 5, size);
 		if (offset < 0) {
 			errstr = "children count";
 			goto fail;
 		}
 		offset = ldm_vnum_get(p, offset + 16,
 		    &blk->u.comp.vol_id, size);
 		if (offset < 0) {
 			errstr = "volume id";
 			goto fail;
 		}
 		break;
 	/*
 	 * Partition VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+12	8	partition start offset
 	 *  0x18+20	8	volume offset
 	 *  0x18+28	PN	partition size
 	 *  0x34+	PN	parent's component object id
 	 *  0x34+	PN	disk's object id
 	 */
 	case LDM_VBLK_T_PARTITION:
 		if (offset + 28 >= size) {
 			errstr = "too small buffer";
 			goto fail;
 		}
 		blk->u.part.start = be64dec(p + offset + 12);
 		blk->u.part.offset = be64dec(p + offset + 20);
 		offset = ldm_vnum_get(p, offset + 28, &blk->u.part.size, size);
 		if (offset < 0) {
 			errstr = "partition size";
 			goto fail;
 		}
 		offset = ldm_vnum_get(p, offset, &blk->u.part.comp_id, size);
 		if (offset < 0) {
 			errstr = "component id";
 			goto fail;
 		}
 		offset = ldm_vnum_get(p, offset, &blk->u.part.disk_id, size);
 		if (offset < 0) {
 			errstr = "disk id";
 			goto fail;
 		}
 		break;
 	/*
 	 * Disk VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+	PS	disk GUID
 	 */
 	case LDM_VBLK_T_DISK:
 		errstr = "disk guid";
 		offset = ldm_vstr_get(p, offset, vstr, sizeof(vstr), size);
 		if (offset < 0)
 			goto fail;
 		error = parse_uuid(vstr, &blk->u.disk.guid);
 		if (error != 0)
 			goto fail;
 		LIST_INSERT_HEAD(&db->disks, &blk->u.disk, entry);
 		break;
 	/*
 	 * Disk group VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+	PS	disk group GUID
 	 */
 	case LDM_VBLK_T_DISKGROUP:
 #if 0
 		strncpy(blk->u.disk_group.name, vstr,
 		    sizeof(blk->u.disk_group.name));
 		offset = ldm_vstr_get(p, offset, vstr, sizeof(vstr), size);
 		if (offset < 0) {
 			errstr = "disk group guid";
 			goto fail;
 		}
 		error = parse_uuid(name, &blk->u.disk_group.guid);
 		if (error != 0) {
 			errstr = "disk group guid";
 			goto fail;
 		}
 		LIST_INSERT_HEAD(&db->groups, &blk->u.disk_group, entry);
 #endif
 		break;
 	/*
 	 * Disk VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+	16	disk GUID
 	 */
 	case LDM_VBLK_T_DISK4:
 		be_uuid_dec(p + offset, &blk->u.disk.guid);
 		LIST_INSERT_HEAD(&db->disks, &blk->u.disk, entry);
 		break;
 	/*
 	 * Disk group VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+	16	disk GUID
 	 */
 	case LDM_VBLK_T_DISKGROUP4:
 #if 0
 		strncpy(blk->u.disk_group.name, vstr,
 		    sizeof(blk->u.disk_group.name));
 		be_uuid_dec(p + offset, &blk->u.disk.guid);
 		LIST_INSERT_HEAD(&db->groups, &blk->u.disk_group, entry);
 #endif
 		break;
 	/*
 	 * Volume VBLK fields:
 	 * Offset	Size	Description
 	 * ------------+-------+------------------------
 	 *  0x18+	PS	volume type
 	 *  0x18+	PS	unknown
 	 *  0x18+	14(S)	volume state
 	 *  0x18+16	1	volume number
 	 *  0x18+21	PN	volume children count
 	 *  0x2D+16	PN	volume size
 	 *  0x3D+4	1	partition type
 	 */
 	case LDM_VBLK_T_VOLUME:
 		offset = ldm_vparm_skip(p, offset, size);
 		if (offset < 0) {
 			errstr = "volume type";
 			goto fail;
 		}
 		offset = ldm_vparm_skip(p, offset, size);
 		if (offset < 0) {
 			errstr = "unknown param";
 			goto fail;
 		}
 		if (offset + 21 >= size) {
 			errstr = "too small buffer";
 			goto fail;
 		}
 		blk->u.vol.number = p[offset + 16];
 		offset = ldm_vparm_skip(p, offset + 21, size);
 		if (offset < 0) {
 			errstr = "children count";
 			goto fail;
 		}
 		offset = ldm_vnum_get(p, offset + 16, &blk->u.vol.size, size);
 		if (offset < 0) {
 			errstr = "volume size";
 			goto fail;
 		}
 		if (offset + 4 >= size) {
 			errstr = "too small buffer";
 			goto fail;
 		}
 		blk->u.vol.part_type = p[offset + 4];
 		/* keep volumes ordered by volume number */
 		last = NULL;
 		LIST_FOREACH(volume, &db->volumes, entry) {
 			if (volume->number > blk->u.vol.number)
 				break;
 			last = volume;
 		}
 		if (last != NULL)
 			LIST_INSERT_AFTER(last, &blk->u.vol, entry);
 		else
 			LIST_INSERT_HEAD(&db->volumes, &blk->u.vol, entry);
 		break;
 	default:
 		LDM_DEBUG(1, "unknown VBLK type 0x%02x\n", blk->type);
 		LDM_DUMP(p, size);
 	}
 	LIST_INSERT_HEAD(&db->vblks, blk, entry);
 	return (0);
 fail:
 	LDM_DEBUG(0, "failed to parse '%s' in VBLK of type 0x%02x\n",
 	    errstr, blk->type);
 	LDM_DUMP(p, size);
 	g_free(blk);
 	return (EINVAL);
 }
 
 static void
 ldm_vmdb_free(struct ldm_db *db)
 {
 	struct ldm_vblk *vblk;
 	struct ldm_xvblk *xvblk;
 
 	while (!LIST_EMPTY(&db->xvblks)) {
 		xvblk = LIST_FIRST(&db->xvblks);
 		LIST_REMOVE(xvblk, entry);
 		g_free(xvblk->data);
 		g_free(xvblk);
 	}
 	while (!LIST_EMPTY(&db->vblks)) {
 		vblk = LIST_FIRST(&db->vblks);
 		LIST_REMOVE(vblk, entry);
 		g_free(vblk);
 	}
 }
 
 static int
 ldm_vmdb_parse(struct ldm_db *db, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct ldm_vblk *vblk;
 	struct ldm_xvblk *xvblk;
 	struct ldm_volume *volume;
 	struct ldm_component *comp;
 	struct ldm_vblkhdr vh;
 	u_char *buf, *p;
 	size_t size, n, sectors;
 	uint64_t offset;
 	int error;
 
 	pp = cp->provider;
 	size = howmany(db->dh.last_seq * db->dh.size, pp->sectorsize);
 	size -= 1; /* one sector takes vmdb header */
 	for (n = 0; n < size; n += MAXPHYS / pp->sectorsize) {
 		offset = db->ph.db_offset + db->th.conf_offset + n + 1;
 		sectors = (size - n) > (MAXPHYS / pp->sectorsize) ?
 		    MAXPHYS / pp->sectorsize: size - n;
 		/* read VBLKs */
 		buf = g_read_data(cp, offset * pp->sectorsize,
 		    sectors * pp->sectorsize, &error);
 		if (buf == NULL) {
 			LDM_DEBUG(0, "%s: failed to read VBLK\n",
 			    pp->name);
 			goto fail;
 		}
 		for (p = buf; p < buf + sectors * pp->sectorsize;
 		    p += db->dh.size) {
 			if (memcmp(p, LDM_VBLK_SIGN,
 			    strlen(LDM_VBLK_SIGN)) != 0) {
 				LDM_DEBUG(0, "%s: no VBLK signature\n",
 				    pp->name);
 				LDM_DUMP(p, db->dh.size);
 				goto fail;
 			}
 			vh.seq = be32dec(p + LDM_VBLK_SEQ_OFF);
 			vh.group = be32dec(p + LDM_VBLK_GROUP_OFF);
 			/* skip empty blocks */
 			if (vh.seq == 0 || vh.group == 0)
 				continue;
 			vh.index = be16dec(p + LDM_VBLK_INDEX_OFF);
 			vh.count = be16dec(p + LDM_VBLK_COUNT_OFF);
 			if (vh.count == 0 || vh.count > 4 ||
 			    vh.seq > db->dh.last_seq) {
 				LDM_DEBUG(0, "%s: invalid values "
 				    "in the VBLK header\n", pp->name);
 				LDM_DUMP(p, db->dh.size);
 				goto fail;
 			}
 			if (vh.count > 1) {
 				error = ldm_xvblk_handle(db, &vh, p);
 				if (error != 0) {
 					LDM_DEBUG(0, "%s: xVBLK "
 					    "is corrupted\n", pp->name);
 					LDM_DUMP(p, db->dh.size);
 					goto fail;
 				}
 				continue;
 			}
 			if (be16dec(p + 16) != 0)
 				LDM_DEBUG(1, "%s: VBLK update"
 				    " status is %u\n", pp->name,
 				    be16dec(p + 16));
 			error = ldm_vblk_handle(db, p, db->dh.size);
 			if (error != 0)
 				goto fail;
 		}
 		g_free(buf);
 		buf = NULL;
 	}
 	/* Parse xVBLKs */
 	while (!LIST_EMPTY(&db->xvblks)) {
 		xvblk = LIST_FIRST(&db->xvblks);
 		if (xvblk->map == 0xFF) {
 			error = ldm_vblk_handle(db, xvblk->data, xvblk->size);
 			if (error != 0)
 				goto fail;
 		} else {
 			LDM_DEBUG(0, "%s: incomplete or corrupt "
 			    "xVBLK found\n", pp->name);
 			goto fail;
 		}
 		LIST_REMOVE(xvblk, entry);
 		g_free(xvblk->data);
 		g_free(xvblk);
 	}
 	/* construct all VBLKs relations */
 	LIST_FOREACH(volume, &db->volumes, entry) {
 		LIST_FOREACH(vblk, &db->vblks, entry)
 			if (vblk->type == LDM_VBLK_T_COMPONENT &&
 			    vblk->u.comp.vol_id == volume->id) {
 				LIST_INSERT_HEAD(&volume->components,
 				    &vblk->u.comp, entry);
 				volume->count++;
 			}
 		LIST_FOREACH(comp, &volume->components, entry)
 			LIST_FOREACH(vblk, &db->vblks, entry)
 				if (vblk->type == LDM_VBLK_T_PARTITION &&
 				    vblk->u.part.comp_id == comp->id) {
 					LIST_INSERT_HEAD(&comp->partitions,
 					    &vblk->u.part, entry);
 					comp->count++;
 				}
 	}
 	return (0);
 fail:
 	ldm_vmdb_free(db);
 	g_free(buf);
 	return (ENXIO);
 }
 
 static int
 g_part_ldm_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 
 	return (ENOSYS);
 }
 
 static int
 g_part_ldm_bootcode(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	return (ENOSYS);
 }
 
 static int
 g_part_ldm_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	return (ENOSYS);
 }
 
 static int
 g_part_ldm_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_ldm_table *table;
 	struct g_provider *pp;
 
 	table = (struct g_part_ldm_table *)basetable;
 	/*
 	 * To destroy LDM on a disk partitioned with GPT we should delete
 	 * ms-ldm-metadata partition, but we can't do this via standard
 	 * GEOM_PART method.
 	 */
 	if (table->is_gpt)
 		return (ENOSYS);
 	pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 	/*
 	 * To destroy LDM we should wipe MBR, first private header and
 	 * backup private headers.
 	 */
 	basetable->gpt_smhead = (1 << ldm_ph_off[0]) | 1;
 	/*
 	 * Don't touch last backup private header when LDM database is
 	 * not located in the last 1MByte area.
 	 * XXX: can't remove all blocks.
 	 */
 	if (table->db_offset + LDM_DB_SIZE ==
 	    pp->mediasize / pp->sectorsize)
 		basetable->gpt_smtail = 1;
 	return (0);
 }
 
 static void
 g_part_ldm_dumpconf(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct sbuf *sb, const char *indent)
 {
 	struct g_part_ldm_entry *entry;
 
 	entry = (struct g_part_ldm_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs LDM xt %u", entry->type);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<rawtype>%u</rawtype>\n", indent,
 		    entry->type);
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_ldm_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 
 	return (0);
 }
 
 static int
 g_part_ldm_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 
 	return (ENOSYS);
 }
 
 static const char *
 g_part_ldm_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "s%d", baseentry->gpe_index);
 	return (buf);
 }
 
 static int
 ldm_gpt_probe(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_part_ldm_table *table;
 	struct g_part_table *gpt;
 	struct g_part_entry *entry;
 	struct g_consumer *cp2;
 	struct gpt_ent *part;
 	u_char *buf;
 	int error;
 
 	/*
 	 * XXX: We use some knowledge about GEOM_PART_GPT internal
 	 * structures, but it is easier than parse GPT by himself.
 	 */
 	g_topology_lock();
 	gpt = cp->provider->geom->softc;
 	LIST_FOREACH(entry, &gpt->gpt_entry, gpe_entry) {
 		part = (struct gpt_ent *)(entry + 1);
 		/* Search ms-ldm-metadata partition */
 		if (memcmp(&part->ent_type,
 		    &gpt_uuid_ms_ldm_metadata, sizeof(struct uuid)) != 0 ||
 		    entry->gpe_end - entry->gpe_start < LDM_DB_SIZE - 1)
 			continue;
 
 		/* Create new consumer and attach it to metadata partition */
 		cp2 = g_new_consumer(cp->geom);
 		error = g_attach(cp2, entry->gpe_pp);
 		if (error != 0) {
 			g_destroy_consumer(cp2);
 			g_topology_unlock();
 			return (ENXIO);
 		}
 		error = g_access(cp2, 1, 0, 0);
 		if (error != 0) {
 			g_detach(cp2);
 			g_destroy_consumer(cp2);
 			g_topology_unlock();
 			return (ENXIO);
 		}
 		g_topology_unlock();
 
 		LDM_DEBUG(2, "%s: LDM metadata partition %s found in the GPT",
 		    cp->provider->name, cp2->provider->name);
 		/* Read the LDM private header */
 		buf = ldm_privhdr_read(cp2,
 		    ldm_ph_off[LDM_PH_GPTINDEX] * cp2->provider->sectorsize,
 		    &error);
 		if (buf != NULL) {
 			table = (struct g_part_ldm_table *)basetable;
 			table->is_gpt = 1;
 			g_free(buf);
 			return (G_PART_PROBE_PRI_HIGH);
 		}
 
 		/* second consumer is no longer needed. */
 		g_topology_lock();
 		g_access(cp2, -1, 0, 0);
 		g_detach(cp2);
 		g_destroy_consumer(cp2);
 		break;
 	}
 	g_topology_unlock();
 	return (ENXIO);
 }
 
 static int
 g_part_ldm_probe(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	u_char *buf, type[64];
 	int error, idx;
 
 
 	pp = cp->provider;
 	if (pp->sectorsize != 512)
 		return (ENXIO);
 
 	error = g_getattr("PART::scheme", cp, &type);
 	if (error == 0 && strcmp(type, "GPT") == 0) {
 		if (g_getattr("PART::type", cp, &type) != 0 ||
 		    strcmp(type, "ms-ldm-data") != 0)
 			return (ENXIO);
 		error = ldm_gpt_probe(basetable, cp);
 		return (error);
 	}
 
 	if (basetable->gpt_depth != 0)
 		return (ENXIO);
 
 	/* LDM has 1M metadata area */
 	if (pp->mediasize <= 1024 * 1024)
 		return (ENOSPC);
 
 	/* Check that there's a MBR */
 	buf = g_read_data(cp, 0, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 
 	if (le16dec(buf + DOSMAGICOFFSET) != DOSMAGIC) {
 		g_free(buf);
 		return (ENXIO);
 	}
 	error = ENXIO;
 	/* Check that we have LDM partitions in the MBR */
 	for (idx = 0; idx < NDOSPART && error != 0; idx++) {
 		if (buf[DOSPARTOFF + idx * DOSPARTSIZE + 4] == DOSPTYP_LDM)
 			error = 0;
 	}
 	g_free(buf);
 	if (error == 0) {
 		LDM_DEBUG(2, "%s: LDM data partitions found in MBR",
 		    pp->name);
 		/* Read the LDM private header */
 		buf = ldm_privhdr_read(cp,
 		    ldm_ph_off[LDM_PH_MBRINDEX] * pp->sectorsize, &error);
 		if (buf == NULL)
 			return (error);
 		g_free(buf);
 		return (G_PART_PROBE_PRI_HIGH);
 	}
 	return (error);
 }
 
 static int
 g_part_ldm_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_part_ldm_table *table;
 	struct g_part_ldm_entry *entry;
 	struct g_consumer *cp2;
 	struct ldm_component *comp;
 	struct ldm_partition *part;
 	struct ldm_volume *vol;
 	struct ldm_disk *disk;
 	struct ldm_db db;
 	int error, index, skipped;
 
 	table = (struct g_part_ldm_table *)basetable;
 	memset(&db, 0, sizeof(db));
 	cp2 = cp;					/* ms-ldm-data */
 	if (table->is_gpt)
 		cp = LIST_FIRST(&cp->geom->consumer);	/* ms-ldm-metadata */
 	/* Read and parse LDM private headers. */
 	error = ldm_privhdr_check(&db, cp, table->is_gpt);
 	if (error != 0)
 		goto gpt_cleanup;
 	basetable->gpt_first = table->is_gpt ? 0: db.ph.start;
 	basetable->gpt_last = basetable->gpt_first + db.ph.size - 1;
 	table->db_offset = db.ph.db_offset;
 	/* Make additional checks for GPT */
 	if (table->is_gpt) {
 		error = ldm_gpt_check(&db, cp);
 		if (error != 0)
 			goto gpt_cleanup;
 		/*
 		 * Now we should reset database offset to zero, because our
 		 * consumer cp is attached to the ms-ldm-metadata partition
 		 * and we don't need add db_offset to read from it.
 		 */
 		db.ph.db_offset = 0;
 	}
 	/* Read and parse LDM TOC headers. */
 	error = ldm_tochdr_check(&db, cp);
 	if (error != 0)
 		goto gpt_cleanup;
 	/* Read and parse LDM VMDB header. */
 	error = ldm_vmdbhdr_check(&db, cp);
 	if (error != 0)
 		goto gpt_cleanup;
 	error = ldm_vmdb_parse(&db, cp);
 	/*
 	 * For the GPT case we must detach and destroy
 	 * second consumer before return.
 	 */
 gpt_cleanup:
 	if (table->is_gpt) {
 		g_topology_lock();
 		g_access(cp, -1, 0, 0);
 		g_detach(cp);
 		g_destroy_consumer(cp);
 		g_topology_unlock();
 		cp = cp2;
 	}
 	if (error != 0)
 		return (error);
 	/* Search current disk in the disk list. */
 	LIST_FOREACH(disk, &db.disks, entry)
 	    if (memcmp(&disk->guid, &db.ph.disk_guid,
 		sizeof(struct uuid)) == 0)
 		    break;
 	if (disk == NULL) {
 		LDM_DEBUG(1, "%s: no LDM volumes on this disk",
 		    cp->provider->name);
 		ldm_vmdb_free(&db);
 		return (ENXIO);
 	}
 	index = 1;
 	LIST_FOREACH(vol, &db.volumes, entry) {
 		LIST_FOREACH(comp, &vol->components, entry) {
 			/* Skip volumes from different disks. */
 			part = LIST_FIRST(&comp->partitions);
 			if (part->disk_id != disk->id)
 				continue;
 			skipped = 0;
 			/* We don't support spanned and striped volumes. */
 			if (comp->count > 1 || part->offset != 0) {
 				LDM_DEBUG(1, "%s: LDM volume component "
 				    "%ju has %u partitions. Skipped",
 				    cp->provider->name, (uintmax_t)comp->id,
 				    comp->count);
 				skipped = 1;
 			}
 			/*
 			 * Allow mirrored volumes only when they are explicitly
 			 * allowed with kern.geom.part.ldm.show_mirrors=1.
 			 */
 			if (vol->count > 1 && show_mirrors == 0) {
 				LDM_DEBUG(1, "%s: LDM volume %ju has %u "
 				    "components. Skipped",
 				    cp->provider->name, (uintmax_t)vol->id,
 				    vol->count);
 				skipped = 1;
 			}
 			entry = (struct g_part_ldm_entry *)g_part_new_entry(
 			    basetable, index++,
 			    basetable->gpt_first + part->start,
 			    basetable->gpt_first + part->start +
 			    part->size - 1);
 			/*
 			 * Mark skipped partition as ms-ldm-data partition.
 			 * We do not support them, but it is better to show
 			 * that we have something there, than just show
 			 * free space.
 			 */
 			if (skipped == 0)
 				entry->type = vol->part_type;
 			else
 				entry->type = DOSPTYP_LDM;
 			LDM_DEBUG(1, "%s: new volume id: %ju, start: %ju,"
 			    " end: %ju, type: 0x%02x\n", cp->provider->name,
 			    (uintmax_t)part->id,(uintmax_t)part->start +
 			    basetable->gpt_first, (uintmax_t)part->start +
 			    part->size + basetable->gpt_first - 1,
 			    vol->part_type);
 		}
 	}
 	ldm_vmdb_free(&db);
 	return (error);
 }
 
 static const char *
 g_part_ldm_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_ldm_entry *entry;
 	int i;
 
 	entry = (struct g_part_ldm_entry *)baseentry;
 	for (i = 0; i < nitems(ldm_alias_match); i++) {
 		if (ldm_alias_match[i].typ == entry->type)
 			return (g_part_alias_name(ldm_alias_match[i].alias));
 	}
 	snprintf(buf, bufsz, "!%d", entry->type);
 	return (buf);
 }
 
 static int
 g_part_ldm_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 
 	return (ENOSYS);
 }
Index: user/markj/netdump/sys/geom/part/g_part_mbr.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_mbr.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_mbr.c	(revision 332408)
@@ -1,615 +1,616 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2007, 2008 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/diskmbr.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <geom/geom.h>
 #include <geom/geom_int.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_mbr, "GEOM partitioning class for MBR support");
 
 SYSCTL_DECL(_kern_geom_part);
 static SYSCTL_NODE(_kern_geom_part, OID_AUTO, mbr, CTLFLAG_RW, 0,
     "GEOM_PART_MBR Master Boot Record");
 
 static u_int enforce_chs = 0;
 SYSCTL_UINT(_kern_geom_part_mbr, OID_AUTO, enforce_chs,
     CTLFLAG_RWTUN, &enforce_chs, 0, "Enforce alignment to CHS addressing");
 
 #define	MBRSIZE		512
 
 struct g_part_mbr_table {
 	struct g_part_table	base;
 	u_char		mbr[MBRSIZE];
 };
 
 struct g_part_mbr_entry {
 	struct g_part_entry	base;
 	struct dos_partition ent;
 };
 
 static int g_part_mbr_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_mbr_bootcode(struct g_part_table *, struct g_part_parms *);
 static int g_part_mbr_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_mbr_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_mbr_dumpconf(struct g_part_table *, struct g_part_entry *,
     struct sbuf *, const char *);
 static int g_part_mbr_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_mbr_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_mbr_name(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_mbr_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_mbr_read(struct g_part_table *, struct g_consumer *);
 static int g_part_mbr_setunset(struct g_part_table *, struct g_part_entry *,
     const char *, unsigned int);
 static const char *g_part_mbr_type(struct g_part_table *, struct g_part_entry *,
     char *, size_t);
 static int g_part_mbr_write(struct g_part_table *, struct g_consumer *);
 static int g_part_mbr_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 
 static kobj_method_t g_part_mbr_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_mbr_add),
 	KOBJMETHOD(g_part_bootcode,	g_part_mbr_bootcode),
 	KOBJMETHOD(g_part_create,	g_part_mbr_create),
 	KOBJMETHOD(g_part_destroy,	g_part_mbr_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_mbr_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_mbr_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_mbr_modify),
 	KOBJMETHOD(g_part_resize,	g_part_mbr_resize),
 	KOBJMETHOD(g_part_name,		g_part_mbr_name),
 	KOBJMETHOD(g_part_probe,	g_part_mbr_probe),
 	KOBJMETHOD(g_part_read,		g_part_mbr_read),
 	KOBJMETHOD(g_part_setunset,	g_part_mbr_setunset),
 	KOBJMETHOD(g_part_type,		g_part_mbr_type),
 	KOBJMETHOD(g_part_write,	g_part_mbr_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_mbr_scheme = {
 	"MBR",
 	g_part_mbr_methods,
 	sizeof(struct g_part_mbr_table),
 	.gps_entrysz = sizeof(struct g_part_mbr_entry),
 	.gps_minent = NDOSPART,
 	.gps_maxent = NDOSPART,
 	.gps_bootcodesz = MBRSIZE,
 };
 G_PART_SCHEME_DECLARE(g_part_mbr);
+MODULE_VERSION(geom_part_mbr, 0);
 
 static struct g_part_mbr_alias {
 	u_char		typ;
 	int		alias;
 } mbr_alias_match[] = {
 	{ DOSPTYP_386BSD,	G_PART_ALIAS_FREEBSD },
 	{ DOSPTYP_EXT,		G_PART_ALIAS_EBR },
 	{ DOSPTYP_NTFS,		G_PART_ALIAS_MS_NTFS },
 	{ DOSPTYP_FAT16,	G_PART_ALIAS_MS_FAT16 },
 	{ DOSPTYP_FAT32,	G_PART_ALIAS_MS_FAT32 },
 	{ DOSPTYP_EXTLBA,	G_PART_ALIAS_EBR },
 	{ DOSPTYP_LDM,		G_PART_ALIAS_MS_LDM_DATA },
 	{ DOSPTYP_LINSWP,	G_PART_ALIAS_LINUX_SWAP },
 	{ DOSPTYP_LINUX,	G_PART_ALIAS_LINUX_DATA },
 	{ DOSPTYP_LINLVM,	G_PART_ALIAS_LINUX_LVM },
 	{ DOSPTYP_LINRAID,	G_PART_ALIAS_LINUX_RAID },
 	{ DOSPTYP_PPCBOOT,	G_PART_ALIAS_PREP_BOOT },
 	{ DOSPTYP_VMFS,		G_PART_ALIAS_VMFS },
 	{ DOSPTYP_VMKDIAG,	G_PART_ALIAS_VMKDIAG },
 	{ DOSPTYP_APPLE_UFS,	G_PART_ALIAS_APPLE_UFS },
 	{ DOSPTYP_APPLE_BOOT,	G_PART_ALIAS_APPLE_BOOT },
 	{ DOSPTYP_HFS,		G_PART_ALIAS_APPLE_HFS },
 };
 
 static int
 mbr_parse_type(const char *type, u_char *dp_typ)
 {
 	const char *alias;
 	char *endp;
 	long lt;
 	int i;
 
 	if (type[0] == '!') {
 		lt = strtol(type + 1, &endp, 0);
 		if (type[1] == '\0' || *endp != '\0' || lt <= 0 || lt >= 256)
 			return (EINVAL);
 		*dp_typ = (u_char)lt;
 		return (0);
 	}
 	for (i = 0; i < nitems(mbr_alias_match); i++) {
 		alias = g_part_alias_name(mbr_alias_match[i].alias);
 		if (strcasecmp(type, alias) == 0) {
 			*dp_typ = mbr_alias_match[i].typ;
 			return (0);
 		}
 	}
 	return (EINVAL);
 }
 
 static int
 mbr_probe_bpb(u_char *bpb)
 {
 	uint16_t secsz;
 	uint8_t clstsz;
 
 #define PO2(x)	((x & (x - 1)) == 0)
 	secsz = le16dec(bpb);
 	if (secsz < 512 || secsz > 4096 || !PO2(secsz))
 		return (0);
 	clstsz = bpb[2];
 	if (clstsz < 1 || clstsz > 128 || !PO2(clstsz))
 		return (0);
 #undef PO2
 
 	return (1);
 }
 
 static void
 mbr_set_chs(struct g_part_table *table, uint32_t lba, u_char *cylp, u_char *hdp,
     u_char *secp)
 {
 	uint32_t cyl, hd, sec;
 
 	sec = lba % table->gpt_sectors + 1;
 	lba /= table->gpt_sectors;
 	hd = lba % table->gpt_heads;
 	lba /= table->gpt_heads;
 	cyl = lba;
 	if (cyl > 1023)
 		sec = hd = cyl = ~0;
 
 	*cylp = cyl & 0xff;
 	*hdp = hd & 0xff;
 	*secp = (sec & 0x3f) | ((cyl >> 2) & 0xc0);
 }
 
 static int
 mbr_align(struct g_part_table *basetable, uint32_t *start, uint32_t *size)
 {
 	uint32_t sectors;
 
 	if (enforce_chs == 0)
 		return (0);
 	sectors = basetable->gpt_sectors;
 	if (*size < sectors)
 		return (EINVAL);
 	if (start != NULL && (*start % sectors)) {
 		*size += (*start % sectors) - sectors;
 		*start -= (*start % sectors) - sectors;
 	}
 	if (*size % sectors)
 		*size -= (*size % sectors);
 	if (*size < sectors)
 		return (EINVAL);
 	return (0);
 }
 
 static int
 g_part_mbr_add(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct g_part_parms *gpp)
 {
 	struct g_part_mbr_entry *entry;
 	uint32_t start, size;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_mbr_entry *)baseentry;
 	start = gpp->gpp_start;
 	size = gpp->gpp_size;
 	if (mbr_align(basetable, &start, &size) != 0)
 		return (EINVAL);
 	if (baseentry->gpe_deleted)
 		bzero(&entry->ent, sizeof(entry->ent));
 
 	KASSERT(baseentry->gpe_start <= start, ("%s", __func__));
 	KASSERT(baseentry->gpe_end >= start + size - 1, ("%s", __func__));
 	baseentry->gpe_start = start;
 	baseentry->gpe_end = start + size - 1;
 	entry->ent.dp_start = start;
 	entry->ent.dp_size = size;
 	mbr_set_chs(basetable, baseentry->gpe_start, &entry->ent.dp_scyl,
 	    &entry->ent.dp_shd, &entry->ent.dp_ssect);
 	mbr_set_chs(basetable, baseentry->gpe_end, &entry->ent.dp_ecyl,
 	    &entry->ent.dp_ehd, &entry->ent.dp_esect);
 	return (mbr_parse_type(gpp->gpp_type, &entry->ent.dp_typ));
 }
 
 static int
 g_part_mbr_bootcode(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_part_mbr_table *table;
 	uint32_t dsn;
 
 	if (gpp->gpp_codesize != MBRSIZE)
 		return (ENODEV);
 
 	table = (struct g_part_mbr_table *)basetable;
 	dsn = *(uint32_t *)(table->mbr + DOSDSNOFF);
 	bcopy(gpp->gpp_codeptr, table->mbr, DOSPARTOFF);
 	if (dsn != 0)
 		*(uint32_t *)(table->mbr + DOSDSNOFF) = dsn;
 	return (0);
 }
 
 static int
 g_part_mbr_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 	struct g_part_mbr_table *table;
 
 	pp = gpp->gpp_provider;
 	if (pp->sectorsize < MBRSIZE)
 		return (ENOSPC);
 
 	basetable->gpt_first = basetable->gpt_sectors;
 	basetable->gpt_last = MIN(pp->mediasize / pp->sectorsize,
 	    UINT32_MAX) - 1;
 
 	table = (struct g_part_mbr_table *)basetable;
 	le16enc(table->mbr + DOSMAGICOFFSET, DOSMAGIC);
 	return (0);
 }
 
 static int
 g_part_mbr_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	/* Wipe the first sector to clear the partitioning. */
 	basetable->gpt_smhead |= 1;
 	return (0);
 }
 
 static void
 g_part_mbr_dumpconf(struct g_part_table *basetable, struct g_part_entry *baseentry,
     struct sbuf *sb, const char *indent)
 {
 	struct g_part_mbr_entry *entry;
 	struct g_part_mbr_table *table;
 	uint32_t dsn;
 
 	table = (struct g_part_mbr_table *)basetable;
 	entry = (struct g_part_mbr_entry *)baseentry;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs MBR xt %u", entry->ent.dp_typ);
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<rawtype>%u</rawtype>\n", indent,
 		    entry->ent.dp_typ);
 		if (entry->ent.dp_flag & 0x80)
 			sbuf_printf(sb, "%s<attrib>active</attrib>\n", indent);
 		dsn = le32dec(table->mbr + DOSDSNOFF);
 		sbuf_printf(sb, "%s<efimedia>HD(%d,MBR,%#08x,%#jx,%#jx)", indent,
 		    entry->base.gpe_index, dsn, (intmax_t)entry->base.gpe_start,
 		    (intmax_t)(entry->base.gpe_end - entry->base.gpe_start + 1));
 		sbuf_printf(sb, "</efimedia>\n");
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_mbr_dumpto(struct g_part_table *table, struct g_part_entry *baseentry)
 {
 	struct g_part_mbr_entry *entry;
 
 	/* Allow dumping to a FreeBSD partition or Linux swap partition only. */
 	entry = (struct g_part_mbr_entry *)baseentry;
 	return ((entry->ent.dp_typ == DOSPTYP_386BSD ||
 	    entry->ent.dp_typ == DOSPTYP_LINSWP) ? 1 : 0);
 }
 
 static int
 g_part_mbr_modify(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_mbr_entry *entry;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	entry = (struct g_part_mbr_entry *)baseentry;
 	if (gpp->gpp_parms & G_PART_PARM_TYPE)
 		return (mbr_parse_type(gpp->gpp_type, &entry->ent.dp_typ));
 	return (0);
 }
 
 static int
 g_part_mbr_resize(struct g_part_table *basetable,
     struct g_part_entry *baseentry, struct g_part_parms *gpp)
 {
 	struct g_part_mbr_entry *entry;
 	struct g_provider *pp;
 	uint32_t size;
 
 	if (baseentry == NULL) {
 		pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 		basetable->gpt_last = MIN(pp->mediasize / pp->sectorsize,
 		    UINT32_MAX) - 1;
 		return (0);
 	}
 	size = gpp->gpp_size;
 	if (mbr_align(basetable, NULL, &size) != 0)
 		return (EINVAL);
 	/* XXX: prevent unexpected shrinking. */
 	pp = baseentry->gpe_pp;
 	if ((g_debugflags & 0x10) == 0 && size < gpp->gpp_size &&
 	    pp->mediasize / pp->sectorsize > size)
 		return (EBUSY);
 	entry = (struct g_part_mbr_entry *)baseentry;
 	baseentry->gpe_end = baseentry->gpe_start + size - 1;
 	entry->ent.dp_size = size;
 	mbr_set_chs(basetable, baseentry->gpe_end, &entry->ent.dp_ecyl,
 	    &entry->ent.dp_ehd, &entry->ent.dp_esect);
 	return (0);
 }
 
 static const char *
 g_part_mbr_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "s%d", baseentry->gpe_index);
 	return (buf);
 }
 
 static int
 g_part_mbr_probe(struct g_part_table *table, struct g_consumer *cp)
 {
 	char psn[8];
 	struct g_provider *pp;
 	u_char *buf, *p;
 	int error, index, res, sum;
 	uint16_t magic;
 
 	pp = cp->provider;
 
 	/* Sanity-check the provider. */
 	if (pp->sectorsize < MBRSIZE || pp->mediasize < pp->sectorsize)
 		return (ENOSPC);
 	if (pp->sectorsize > 4096)
 		return (ENXIO);
 
 	/* We don't nest under an MBR (see EBR instead). */
 	error = g_getattr("PART::scheme", cp, &psn);
 	if (error == 0 && strcmp(psn, g_part_mbr_scheme.name) == 0)
 		return (ELOOP);
 
 	/* Check that there's a MBR. */
 	buf = g_read_data(cp, 0L, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 
 	/* We goto out on mismatch. */
 	res = ENXIO;
 
 	magic = le16dec(buf + DOSMAGICOFFSET);
 	if (magic != DOSMAGIC)
 		goto out;
 
 	for (index = 0; index < NDOSPART; index++) {
 		p = buf + DOSPARTOFF + index * DOSPARTSIZE;
 		if (p[0] != 0 && p[0] != 0x80)
 			goto out;
 	}
 
 	/*
 	 * If the partition table does not consist of all zeroes,
 	 * assume we have a MBR. If it's all zeroes, we could have
 	 * a boot sector. For example, a boot sector that doesn't
 	 * have boot code -- common on non-i386 hardware. In that
 	 * case we check if we have a possible BPB. If so, then we
 	 * assume we have a boot sector instead.
 	 */
 	sum = 0;
 	for (index = 0; index < NDOSPART * DOSPARTSIZE; index++)
 		sum += buf[DOSPARTOFF + index];
 	if (sum != 0 || !mbr_probe_bpb(buf + 0x0b))
 		res = G_PART_PROBE_PRI_NORM;
 
  out:
 	g_free(buf);
 	return (res);
 }
 
 static int
 g_part_mbr_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct dos_partition ent;
 	struct g_provider *pp;
 	struct g_part_mbr_table *table;
 	struct g_part_mbr_entry *entry;
 	u_char *buf, *p;
 	off_t chs, msize, first;
 	u_int sectors, heads;
 	int error, index;
 
 	pp = cp->provider;
 	table = (struct g_part_mbr_table *)basetable;
 	first = basetable->gpt_sectors;
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 
 	buf = g_read_data(cp, 0L, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 
 	bcopy(buf, table->mbr, sizeof(table->mbr));
 	for (index = NDOSPART - 1; index >= 0; index--) {
 		p = buf + DOSPARTOFF + index * DOSPARTSIZE;
 		ent.dp_flag = p[0];
 		ent.dp_shd = p[1];
 		ent.dp_ssect = p[2];
 		ent.dp_scyl = p[3];
 		ent.dp_typ = p[4];
 		ent.dp_ehd = p[5];
 		ent.dp_esect = p[6];
 		ent.dp_ecyl = p[7];
 		ent.dp_start = le32dec(p + 8);
 		ent.dp_size = le32dec(p + 12);
 		if (ent.dp_typ == 0 || ent.dp_typ == DOSPTYP_PMBR)
 			continue;
 		if (ent.dp_start == 0 || ent.dp_size == 0)
 			continue;
 		sectors = ent.dp_esect & 0x3f;
 		if (sectors > basetable->gpt_sectors &&
 		    !basetable->gpt_fixgeom) {
 			g_part_geometry_heads(msize, sectors, &chs, &heads);
 			if (chs != 0) {
 				basetable->gpt_sectors = sectors;
 				basetable->gpt_heads = heads;
 			}
 		}
 		if (ent.dp_start < first)
 			first = ent.dp_start;
 		entry = (struct g_part_mbr_entry *)g_part_new_entry(basetable,
 		    index + 1, ent.dp_start, ent.dp_start + ent.dp_size - 1);
 		entry->ent = ent;
 	}
 
 	basetable->gpt_entries = NDOSPART;
 	basetable->gpt_first = basetable->gpt_sectors;
 	basetable->gpt_last = msize - 1;
 
 	if (first < basetable->gpt_first)
 		basetable->gpt_first = 1;
 
 	g_free(buf);
 	return (0);
 }
 
 static int
 g_part_mbr_setunset(struct g_part_table *table, struct g_part_entry *baseentry,
     const char *attrib, unsigned int set)
 {
 	struct g_part_entry *iter;
 	struct g_part_mbr_entry *entry;
 	int changed;
 
 	if (baseentry == NULL)
 		return (ENODEV);
 	if (strcasecmp(attrib, "active") != 0)
 		return (EINVAL);
 
 	/* Only one entry can have the active attribute. */
 	LIST_FOREACH(iter, &table->gpt_entry, gpe_entry) {
 		if (iter->gpe_deleted)
 			continue;
 		changed = 0;
 		entry = (struct g_part_mbr_entry *)iter;
 		if (iter == baseentry) {
 			if (set && (entry->ent.dp_flag & 0x80) == 0) {
 				entry->ent.dp_flag |= 0x80;
 				changed = 1;
 			} else if (!set && (entry->ent.dp_flag & 0x80)) {
 				entry->ent.dp_flag &= ~0x80;
 				changed = 1;
 			}
 		} else {
 			if (set && (entry->ent.dp_flag & 0x80)) {
 				entry->ent.dp_flag &= ~0x80;
 				changed = 1;
 			}
 		}
 		if (changed && !iter->gpe_created)
 			iter->gpe_modified = 1;
 	}
 	return (0);
 }
 
 static const char *
 g_part_mbr_type(struct g_part_table *basetable, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 	struct g_part_mbr_entry *entry;
 	int i;
 
 	entry = (struct g_part_mbr_entry *)baseentry;
 	for (i = 0; i < nitems(mbr_alias_match); i++) {
 		if (mbr_alias_match[i].typ == entry->ent.dp_typ)
 			return (g_part_alias_name(mbr_alias_match[i].alias));
 	}
 	snprintf(buf, bufsz, "!%d", entry->ent.dp_typ);
 	return (buf);
 }
 
 static int
 g_part_mbr_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_part_entry *baseentry;
 	struct g_part_mbr_entry *entry;
 	struct g_part_mbr_table *table;
 	u_char *p;
 	int error, index;
 
 	table = (struct g_part_mbr_table *)basetable;
 	baseentry = LIST_FIRST(&basetable->gpt_entry);
 	for (index = 1; index <= basetable->gpt_entries; index++) {
 		p = table->mbr + DOSPARTOFF + (index - 1) * DOSPARTSIZE;
 		entry = (baseentry != NULL && index == baseentry->gpe_index)
 		    ? (struct g_part_mbr_entry *)baseentry : NULL;
 		if (entry != NULL && !baseentry->gpe_deleted) {
 			p[0] = entry->ent.dp_flag;
 			p[1] = entry->ent.dp_shd;
 			p[2] = entry->ent.dp_ssect;
 			p[3] = entry->ent.dp_scyl;
 			p[4] = entry->ent.dp_typ;
 			p[5] = entry->ent.dp_ehd;
 			p[6] = entry->ent.dp_esect;
 			p[7] = entry->ent.dp_ecyl;
 			le32enc(p + 8, entry->ent.dp_start);
 			le32enc(p + 12, entry->ent.dp_size);
 		} else
 			bzero(p, DOSPARTSIZE);
 
 		if (entry != NULL)
 			baseentry = LIST_NEXT(baseentry, gpe_entry);
 	}
 
 	error = g_write_data(cp, 0, table->mbr, cp->provider->sectorsize);
 	return (error);
 }
Index: user/markj/netdump/sys/geom/part/g_part_vtoc8.c
===================================================================
--- user/markj/netdump/sys/geom/part/g_part_vtoc8.c	(revision 332407)
+++ user/markj/netdump/sys/geom/part/g_part_vtoc8.c	(revision 332408)
@@ -1,601 +1,602 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2008 Marcel Moolenaar
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/endian.h>
 #include <sys/kernel.h>
 #include <sys/kobj.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/systm.h>
 #include <sys/sysctl.h>
 #include <sys/vtoc.h>
 #include <geom/geom.h>
 #include <geom/geom_int.h>
 #include <geom/part/g_part.h>
 
 #include "g_part_if.h"
 
 FEATURE(geom_part_vtoc8, "GEOM partitioning class for SMI VTOC8 disk labels");
 
 struct g_part_vtoc8_table {
 	struct g_part_table	base;
 	struct vtoc8		vtoc;
 	uint32_t		secpercyl;
 };
 
 static int g_part_vtoc8_add(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static int g_part_vtoc8_create(struct g_part_table *, struct g_part_parms *);
 static int g_part_vtoc8_destroy(struct g_part_table *, struct g_part_parms *);
 static void g_part_vtoc8_dumpconf(struct g_part_table *,
     struct g_part_entry *, struct sbuf *, const char *);
 static int g_part_vtoc8_dumpto(struct g_part_table *, struct g_part_entry *);
 static int g_part_vtoc8_modify(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 static const char *g_part_vtoc8_name(struct g_part_table *,
     struct g_part_entry *, char *, size_t);
 static int g_part_vtoc8_probe(struct g_part_table *, struct g_consumer *);
 static int g_part_vtoc8_read(struct g_part_table *, struct g_consumer *);
 static const char *g_part_vtoc8_type(struct g_part_table *,
     struct g_part_entry *, char *, size_t);
 static int g_part_vtoc8_write(struct g_part_table *, struct g_consumer *);
 static int g_part_vtoc8_resize(struct g_part_table *, struct g_part_entry *,
     struct g_part_parms *);
 
 static kobj_method_t g_part_vtoc8_methods[] = {
 	KOBJMETHOD(g_part_add,		g_part_vtoc8_add),
 	KOBJMETHOD(g_part_create,	g_part_vtoc8_create),
 	KOBJMETHOD(g_part_destroy,	g_part_vtoc8_destroy),
 	KOBJMETHOD(g_part_dumpconf,	g_part_vtoc8_dumpconf),
 	KOBJMETHOD(g_part_dumpto,	g_part_vtoc8_dumpto),
 	KOBJMETHOD(g_part_modify,	g_part_vtoc8_modify),
 	KOBJMETHOD(g_part_resize,	g_part_vtoc8_resize),
 	KOBJMETHOD(g_part_name,		g_part_vtoc8_name),
 	KOBJMETHOD(g_part_probe,	g_part_vtoc8_probe),
 	KOBJMETHOD(g_part_read,		g_part_vtoc8_read),
 	KOBJMETHOD(g_part_type,		g_part_vtoc8_type),
 	KOBJMETHOD(g_part_write,	g_part_vtoc8_write),
 	{ 0, 0 }
 };
 
 static struct g_part_scheme g_part_vtoc8_scheme = {
 	"VTOC8",
 	g_part_vtoc8_methods,
 	sizeof(struct g_part_vtoc8_table),
 	.gps_entrysz = sizeof(struct g_part_entry),
 	.gps_minent = VTOC8_NPARTS,
 	.gps_maxent = VTOC8_NPARTS,
 };
 G_PART_SCHEME_DECLARE(g_part_vtoc8);
+MODULE_VERSION(geom_part_vtoc8, 0);
 
 static int
 vtoc8_parse_type(const char *type, uint16_t *tag)
 {
 	const char *alias;
 	char *endp;
 	long lt;
 
 	if (type[0] == '!') {
 		lt = strtol(type + 1, &endp, 0);
 		if (type[1] == '\0' || *endp != '\0' || lt <= 0 ||
 		    lt >= 65536)
 			return (EINVAL);
 		*tag = (uint16_t)lt;
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_NANDFS);
 	if (!strcasecmp(type, alias)) {
 		*tag = VTOC_TAG_FREEBSD_NANDFS;
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_SWAP);
 	if (!strcasecmp(type, alias)) {
 		*tag = VTOC_TAG_FREEBSD_SWAP;
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_UFS);
 	if (!strcasecmp(type, alias)) {
 		*tag = VTOC_TAG_FREEBSD_UFS;
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_VINUM);
 	if (!strcasecmp(type, alias)) {
 		*tag = VTOC_TAG_FREEBSD_VINUM;
 		return (0);
 	}
 	alias = g_part_alias_name(G_PART_ALIAS_FREEBSD_ZFS);
 	if (!strcasecmp(type, alias)) {
 		*tag = VTOC_TAG_FREEBSD_ZFS;
 		return (0);
 	}
 	return (EINVAL);
 }
 
 static int
 vtoc8_align(struct g_part_vtoc8_table *table, uint64_t *start, uint64_t *size)
 {
 
 	if (*size < table->secpercyl)
 		return (EINVAL);
 	if (start != NULL && (*start % table->secpercyl)) {
 		*size += (*start % table->secpercyl) - table->secpercyl;
 		*start -= (*start % table->secpercyl) - table->secpercyl;
 	}
 	if (*size % table->secpercyl)
 		*size -= (*size % table->secpercyl);
 	if (*size < table->secpercyl)
 		return (EINVAL);
 	return (0);
 }
 
 static int
 g_part_vtoc8_add(struct g_part_table *basetable, struct g_part_entry *entry,
     struct g_part_parms *gpp)
 {
 	struct g_part_vtoc8_table *table;
 	int error, index;
 	uint64_t start, size;
 	uint16_t tag;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	error = vtoc8_parse_type(gpp->gpp_type, &tag);
 	if (error)
 		return (error);
 
 	table = (struct g_part_vtoc8_table *)basetable;
 	index = entry->gpe_index - 1;
 	start = gpp->gpp_start;
 	size = gpp->gpp_size;
 	if (vtoc8_align(table, &start, &size) != 0)
 		return (EINVAL);
 
 	KASSERT(entry->gpe_start <= start, (__func__));
 	KASSERT(entry->gpe_end >= start + size - 1, (__func__));
 	entry->gpe_start = start;
 	entry->gpe_end = start + size - 1;
 
 	be16enc(&table->vtoc.part[index].tag, tag);
 	be16enc(&table->vtoc.part[index].flag, 0);
 	be32enc(&table->vtoc.timestamp[index], 0);
 	be32enc(&table->vtoc.map[index].cyl, start / table->secpercyl);
 	be32enc(&table->vtoc.map[index].nblks, size);
 	return (0);
 }
 
 static int
 g_part_vtoc8_create(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 	struct g_provider *pp;
 	struct g_part_entry *entry;
 	struct g_part_vtoc8_table *table;
 	uint64_t msize;
 	uint32_t acyls, ncyls, pcyls;
 
 	pp = gpp->gpp_provider;
 
 	if (pp->sectorsize < sizeof(struct vtoc8))
 		return (ENOSPC);
 	if (pp->sectorsize > sizeof(struct vtoc8))
 		return (ENXIO);
 
 	table = (struct g_part_vtoc8_table *)basetable;
 
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 	table->secpercyl = basetable->gpt_sectors * basetable->gpt_heads;
 	pcyls = msize / table->secpercyl;
 	acyls = 2;
 	ncyls = pcyls - acyls;
 	msize = ncyls * table->secpercyl;
 
 	sprintf(table->vtoc.ascii, "FreeBSD%lldM cyl %u alt %u hd %u sec %u",
 	    (long long)(msize / 2048), ncyls, acyls, basetable->gpt_heads,
 	    basetable->gpt_sectors);
 	be32enc(&table->vtoc.version, VTOC_VERSION);
 	be16enc(&table->vtoc.nparts, VTOC8_NPARTS);
 	be32enc(&table->vtoc.sanity, VTOC_SANITY);
 	be16enc(&table->vtoc.rpm, 3600);
 	be16enc(&table->vtoc.physcyls, pcyls);
 	be16enc(&table->vtoc.ncyls, ncyls);
 	be16enc(&table->vtoc.altcyls, acyls);
 	be16enc(&table->vtoc.nheads, basetable->gpt_heads);
 	be16enc(&table->vtoc.nsecs, basetable->gpt_sectors);
 	be16enc(&table->vtoc.magic, VTOC_MAGIC);
 
 	basetable->gpt_first = 0;
 	basetable->gpt_last = msize - 1;
 	basetable->gpt_isleaf = 1;
 
 	entry = g_part_new_entry(basetable, VTOC_RAW_PART + 1,
 	    basetable->gpt_first, basetable->gpt_last);
 	entry->gpe_internal = 1;
 	be16enc(&table->vtoc.part[VTOC_RAW_PART].tag, VTOC_TAG_BACKUP);
 	be32enc(&table->vtoc.map[VTOC_RAW_PART].nblks, msize);
 	return (0);
 }
 
 static int
 g_part_vtoc8_destroy(struct g_part_table *basetable, struct g_part_parms *gpp)
 {
 
 	/* Wipe the first sector to clear the partitioning. */
 	basetable->gpt_smhead |= 1;
 	return (0);
 }
 
 static void
 g_part_vtoc8_dumpconf(struct g_part_table *basetable,
     struct g_part_entry *entry, struct sbuf *sb, const char *indent)
 {
 	struct g_part_vtoc8_table *table;
 
 	table = (struct g_part_vtoc8_table *)basetable;
 	if (indent == NULL) {
 		/* conftxt: libdisk compatibility */
 		sbuf_printf(sb, " xs SUN sc %u hd %u alt %u",
 		    be16dec(&table->vtoc.nsecs), be16dec(&table->vtoc.nheads),
 		    be16dec(&table->vtoc.altcyls));
 	} else if (entry != NULL) {
 		/* confxml: partition entry information */
 		sbuf_printf(sb, "%s<rawtype>%u</rawtype>\n", indent,
 		    be16dec(&table->vtoc.part[entry->gpe_index - 1].tag));
 	} else {
 		/* confxml: scheme information */
 	}
 }
 
 static int
 g_part_vtoc8_dumpto(struct g_part_table *basetable,
     struct g_part_entry *entry)
 {
 	struct g_part_vtoc8_table *table;
 	uint16_t tag;
 
 	/*
 	 * Allow dumping to a swap partition or a partition that
 	 * has no type.
 	 */
 	table = (struct g_part_vtoc8_table *)basetable;
 	tag = be16dec(&table->vtoc.part[entry->gpe_index - 1].tag);
 	return ((tag == 0 || tag == VTOC_TAG_FREEBSD_SWAP ||
 	    tag == VTOC_TAG_SWAP) ? 1 : 0);
 }
 
 static int
 g_part_vtoc8_modify(struct g_part_table *basetable,
     struct g_part_entry *entry, struct g_part_parms *gpp)
 {
 	struct g_part_vtoc8_table *table;
 	int error;
 	uint16_t tag;
 
 	if (gpp->gpp_parms & G_PART_PARM_LABEL)
 		return (EINVAL);
 
 	table = (struct g_part_vtoc8_table *)basetable;
 	if (gpp->gpp_parms & G_PART_PARM_TYPE) {
 		error = vtoc8_parse_type(gpp->gpp_type, &tag);
 		if (error)
 			return(error);
 
 		be16enc(&table->vtoc.part[entry->gpe_index - 1].tag, tag);
 	}
 	return (0);
 }
 
 static int
 vtoc8_set_rawsize(struct g_part_table *basetable, struct g_provider *pp)
 {
 	struct g_part_vtoc8_table *table;
 	struct g_part_entry *baseentry;
 	off_t msize;
 	uint32_t acyls, ncyls, pcyls;
 
 	table = (struct g_part_vtoc8_table *)basetable;
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 	pcyls = msize / table->secpercyl;
 	if (pcyls > UINT16_MAX)
 		return (ERANGE);
 	acyls = be16dec(&table->vtoc.altcyls);
 	ncyls = pcyls - acyls;
 	msize = ncyls * table->secpercyl;
 	basetable->gpt_last = msize - 1;
 
 	bzero(table->vtoc.ascii, sizeof(table->vtoc.ascii));
 	sprintf(table->vtoc.ascii, "FreeBSD%lldM cyl %u alt %u hd %u sec %u",
 	    (long long)(msize / 2048), ncyls, acyls, basetable->gpt_heads,
 	    basetable->gpt_sectors);
 	be16enc(&table->vtoc.physcyls, pcyls);
 	be16enc(&table->vtoc.ncyls, ncyls);
 	be32enc(&table->vtoc.map[VTOC_RAW_PART].nblks, msize);
 	if (be32dec(&table->vtoc.sanity) == VTOC_SANITY)
 		be16enc(&table->vtoc.part[VTOC_RAW_PART].tag, VTOC_TAG_BACKUP);
 	LIST_FOREACH(baseentry, &basetable->gpt_entry, gpe_entry) {
 		if (baseentry->gpe_index == VTOC_RAW_PART + 1) {
 			baseentry->gpe_end = basetable->gpt_last;
 			return (0);
 		}
 	}
 	return (ENXIO);
 }
 
 static int
 g_part_vtoc8_resize(struct g_part_table *basetable,
     struct g_part_entry *entry, struct g_part_parms *gpp)
 {
 	struct g_part_vtoc8_table *table;
 	struct g_provider *pp;
 	uint64_t size;
 
 	if (entry == NULL) {
 		pp = LIST_FIRST(&basetable->gpt_gp->consumer)->provider;
 		return (vtoc8_set_rawsize(basetable, pp));
 	}
 	table = (struct g_part_vtoc8_table *)basetable;
 	size = gpp->gpp_size;
 	if (vtoc8_align(table, NULL, &size) != 0)
 		return (EINVAL);
 	/* XXX: prevent unexpected shrinking. */
 	pp = entry->gpe_pp;
 	if ((g_debugflags & 0x10) == 0 && size < gpp->gpp_size &&
 	    pp->mediasize / pp->sectorsize > size)
 		return (EBUSY);
 	entry->gpe_end = entry->gpe_start + size - 1;
 	be32enc(&table->vtoc.map[entry->gpe_index - 1].nblks, size);
 
 	return (0);
 }
 
 static const char *
 g_part_vtoc8_name(struct g_part_table *table, struct g_part_entry *baseentry,
     char *buf, size_t bufsz)
 {
 
 	snprintf(buf, bufsz, "%c", 'a' + baseentry->gpe_index - 1);
 	return (buf);
 }
 
 static int
 g_part_vtoc8_probe(struct g_part_table *table, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error, ofs, res;
 	uint16_t cksum, magic;
 
 	pp = cp->provider;
 
 	/* Sanity-check the provider. */
 	if (pp->sectorsize != sizeof(struct vtoc8))
 		return (ENOSPC);
 
 	/* Check that there's a disklabel. */
 	buf = g_read_data(cp, 0, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 
 	res = ENXIO;	/* Assume mismatch */
 
 	/* Check the magic */
 	magic = be16dec(buf + offsetof(struct vtoc8, magic));
 	if (magic != VTOC_MAGIC)
 		goto out;
 
 	/* Check the sum */
 	cksum = 0;
 	for (ofs = 0; ofs < sizeof(struct vtoc8); ofs += 2)
 		cksum ^= be16dec(buf + ofs);
 	if (cksum != 0)
 		goto out;
 
 	res = G_PART_PROBE_PRI_NORM;
 
  out:
 	g_free(buf);
 	return (res);
 }
 
 static int
 g_part_vtoc8_read(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_vtoc8_table *table;
 	struct g_part_entry *entry;
 	u_char *buf;
 	off_t chs, msize;
 	uint64_t offset, size;
 	u_int cyls, heads, sectors;
 	int error, index, withtags;
 	uint16_t tag;
 
 	pp = cp->provider;
 	buf = g_read_data(cp, 0, pp->sectorsize, &error);
 	if (buf == NULL)
 		return (error);
 
 	table = (struct g_part_vtoc8_table *)basetable;
 	bcopy(buf, &table->vtoc, sizeof(table->vtoc));
 	g_free(buf);
 
 	msize = MIN(pp->mediasize / pp->sectorsize, UINT32_MAX);
 	sectors = be16dec(&table->vtoc.nsecs);
 	if (sectors < 1)
 		goto invalid_label;
 	if (sectors != basetable->gpt_sectors && !basetable->gpt_fixgeom) {
 		g_part_geometry_heads(msize, sectors, &chs, &heads);
 		if (chs != 0) {
 			basetable->gpt_sectors = sectors;
 			basetable->gpt_heads = heads;
 		}
 	}
 
 	heads = be16dec(&table->vtoc.nheads);
 	if (heads < 1)
 		goto invalid_label;
 	if (heads != basetable->gpt_heads && !basetable->gpt_fixgeom)
 		basetable->gpt_heads = heads;
 	/*
 	 * Except for ATA disks > 32GB, Solaris uses the native geometry
 	 * as reported by the target for the labels while da(4) typically
 	 * uses a synthetic one so we don't complain too loudly if these
 	 * geometries don't match.
 	 */
 	if (bootverbose && (sectors != basetable->gpt_sectors ||
 	    heads != basetable->gpt_heads))
 		printf("GEOM: %s: geometry does not match VTOC8 label "
 		    "(label: %uh,%us GEOM: %uh,%us).\n", pp->name, heads,
 		    sectors, basetable->gpt_heads, basetable->gpt_sectors);
 
 	table->secpercyl = heads * sectors;
 	cyls = be16dec(&table->vtoc.ncyls);
 	chs = cyls * table->secpercyl;
 	if (chs < 1 || chs > msize)
 		goto invalid_label;
 
 	basetable->gpt_first = 0;
 	basetable->gpt_last = chs - 1;
 	basetable->gpt_isleaf = 1;
 
 	withtags = (be32dec(&table->vtoc.sanity) == VTOC_SANITY) ? 1 : 0;
 	if (!withtags) {
 		printf("GEOM: %s: adding VTOC8 information.\n", pp->name);
 		be32enc(&table->vtoc.version, VTOC_VERSION);
 		bzero(&table->vtoc.volume, VTOC_VOLUME_LEN);
 		be16enc(&table->vtoc.nparts, VTOC8_NPARTS);
 		bzero(&table->vtoc.part, sizeof(table->vtoc.part));
 		be32enc(&table->vtoc.sanity, VTOC_SANITY);
 	}
 
 	basetable->gpt_entries = be16dec(&table->vtoc.nparts);
 	if (basetable->gpt_entries < g_part_vtoc8_scheme.gps_minent ||
 	    basetable->gpt_entries > g_part_vtoc8_scheme.gps_maxent)
 		goto invalid_label;
 
 	for (index = basetable->gpt_entries - 1; index >= 0; index--) {
 		offset = be32dec(&table->vtoc.map[index].cyl) *
 		    table->secpercyl;
 		size = be32dec(&table->vtoc.map[index].nblks);
 		if (size == 0)
 			continue;
 		if (withtags)
 			tag = be16dec(&table->vtoc.part[index].tag);
 		else
 			tag = (index == VTOC_RAW_PART)
 			    ? VTOC_TAG_BACKUP
 			    : VTOC_TAG_UNASSIGNED;
 
 		if (index == VTOC_RAW_PART && tag != VTOC_TAG_BACKUP)
 			continue;
 		if (index != VTOC_RAW_PART && tag == VTOC_TAG_BACKUP)
 			continue;
 		entry = g_part_new_entry(basetable, index + 1, offset,
 		    offset + size - 1);
 		if (tag == VTOC_TAG_BACKUP)
 			entry->gpe_internal = 1;
 
 		if (!withtags)
 			be16enc(&table->vtoc.part[index].tag, tag);
 	}
 
 	return (0);
 
  invalid_label:
 	printf("GEOM: %s: invalid VTOC8 label.\n", pp->name);
 	return (EINVAL);
 }
 
 static const char *
 g_part_vtoc8_type(struct g_part_table *basetable, struct g_part_entry *entry,
     char *buf, size_t bufsz)
 {
 	struct g_part_vtoc8_table *table;
 	uint16_t tag;
 
 	table = (struct g_part_vtoc8_table *)basetable;
 	tag = be16dec(&table->vtoc.part[entry->gpe_index - 1].tag);
 	if (tag == VTOC_TAG_FREEBSD_NANDFS)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_NANDFS));
 	if (tag == VTOC_TAG_FREEBSD_SWAP)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_SWAP));
 	if (tag == VTOC_TAG_FREEBSD_UFS)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_UFS));
 	if (tag == VTOC_TAG_FREEBSD_VINUM)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_VINUM));
 	if (tag == VTOC_TAG_FREEBSD_ZFS)
 		return (g_part_alias_name(G_PART_ALIAS_FREEBSD_ZFS));
 	snprintf(buf, bufsz, "!%d", tag);
 	return (buf);
 }
 
 static int
 g_part_vtoc8_write(struct g_part_table *basetable, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	struct g_part_entry *entry;
 	struct g_part_vtoc8_table *table;
 	uint16_t sum;
 	u_char *p;
 	int error, index, match, offset;
 
 	pp = cp->provider;
 	table = (struct g_part_vtoc8_table *)basetable;
 	entry = LIST_FIRST(&basetable->gpt_entry);
 	for (index = 0; index < basetable->gpt_entries; index++) {
 		match = (entry != NULL && index == entry->gpe_index - 1)
 		    ? 1 : 0;
 		if (match) {
 			if (entry->gpe_deleted) {
 				be16enc(&table->vtoc.part[index].tag, 0);
 				be16enc(&table->vtoc.part[index].flag, 0);
 				be32enc(&table->vtoc.map[index].cyl, 0);
 				be32enc(&table->vtoc.map[index].nblks, 0);
 			}
 			entry = LIST_NEXT(entry, gpe_entry);
 		}
 	}
 
 	/* Calculate checksum. */
 	sum = 0;
 	p = (void *)&table->vtoc;
 	for (offset = 0; offset < sizeof(table->vtoc) - 2; offset += 2)
 		sum ^= be16dec(p + offset);
 	be16enc(&table->vtoc.cksum, sum);
 
 	error = g_write_data(cp, 0, p, pp->sectorsize);
 	return (error);
 }
Index: user/markj/netdump/sys/geom/raid3/g_raid3.c
===================================================================
--- user/markj/netdump/sys/geom/raid3/g_raid3.c	(revision 332407)
+++ user/markj/netdump/sys/geom/raid3/g_raid3.c	(revision 332408)
@@ -1,3585 +1,3586 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/eventhandler.h>
 #include <vm/uma.h>
 #include <geom/geom.h>
 #include <sys/proc.h>
 #include <sys/kthread.h>
 #include <sys/sched.h>
 #include <geom/raid3/g_raid3.h>
 
 FEATURE(geom_raid3, "GEOM RAID-3 functionality");
 
 static MALLOC_DEFINE(M_RAID3, "raid3_data", "GEOM_RAID3 Data");
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, raid3, CTLFLAG_RW, 0,
     "GEOM_RAID3 stuff");
 u_int g_raid3_debug = 0;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, debug, CTLFLAG_RWTUN, &g_raid3_debug, 0,
     "Debug level");
 static u_int g_raid3_timeout = 4;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, timeout, CTLFLAG_RWTUN, &g_raid3_timeout,
     0, "Time to wait on all raid3 components");
 static u_int g_raid3_idletime = 5;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, idletime, CTLFLAG_RWTUN,
     &g_raid3_idletime, 0, "Mark components as clean when idling");
 static u_int g_raid3_disconnect_on_failure = 1;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, disconnect_on_failure, CTLFLAG_RWTUN,
     &g_raid3_disconnect_on_failure, 0, "Disconnect component on I/O failure.");
 static u_int g_raid3_syncreqs = 2;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, sync_requests, CTLFLAG_RDTUN,
     &g_raid3_syncreqs, 0, "Parallel synchronization I/O requests.");
 static u_int g_raid3_use_malloc = 0;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, use_malloc, CTLFLAG_RDTUN,
     &g_raid3_use_malloc, 0, "Use malloc(9) instead of uma(9).");
 
 static u_int g_raid3_n64k = 50;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, n64k, CTLFLAG_RDTUN, &g_raid3_n64k, 0,
     "Maximum number of 64kB allocations");
 static u_int g_raid3_n16k = 200;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, n16k, CTLFLAG_RDTUN, &g_raid3_n16k, 0,
     "Maximum number of 16kB allocations");
 static u_int g_raid3_n4k = 1200;
 SYSCTL_UINT(_kern_geom_raid3, OID_AUTO, n4k, CTLFLAG_RDTUN, &g_raid3_n4k, 0,
     "Maximum number of 4kB allocations");
 
 static SYSCTL_NODE(_kern_geom_raid3, OID_AUTO, stat, CTLFLAG_RW, 0,
     "GEOM_RAID3 statistics");
 static u_int g_raid3_parity_mismatch = 0;
 SYSCTL_UINT(_kern_geom_raid3_stat, OID_AUTO, parity_mismatch, CTLFLAG_RD,
     &g_raid3_parity_mismatch, 0, "Number of failures in VERIFY mode");
 
 #define	MSLEEP(ident, mtx, priority, wmesg, timeout)	do {		\
 	G_RAID3_DEBUG(4, "%s: Sleeping %p.", __func__, (ident));	\
 	msleep((ident), (mtx), (priority), (wmesg), (timeout));		\
 	G_RAID3_DEBUG(4, "%s: Woken up %p.", __func__, (ident));	\
 } while (0)
 
 static eventhandler_tag g_raid3_post_sync = NULL;
 static int g_raid3_shutdown = 0;
 
 static int g_raid3_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 static g_taste_t g_raid3_taste;
 static void g_raid3_init(struct g_class *mp);
 static void g_raid3_fini(struct g_class *mp);
 
 struct g_class g_raid3_class = {
 	.name = G_RAID3_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_raid3_config,
 	.taste = g_raid3_taste,
 	.destroy_geom = g_raid3_destroy_geom,
 	.init = g_raid3_init,
 	.fini = g_raid3_fini
 };
 
 
 static void g_raid3_destroy_provider(struct g_raid3_softc *sc);
 static int g_raid3_update_disk(struct g_raid3_disk *disk, u_int state);
 static void g_raid3_update_device(struct g_raid3_softc *sc, boolean_t force);
 static void g_raid3_dumpconf(struct sbuf *sb, const char *indent,
     struct g_geom *gp, struct g_consumer *cp, struct g_provider *pp);
 static void g_raid3_sync_stop(struct g_raid3_softc *sc, int type);
 static int g_raid3_register_request(struct bio *pbp);
 static void g_raid3_sync_release(struct g_raid3_softc *sc);
 
 
 static const char *
 g_raid3_disk_state2str(int state)
 {
 
 	switch (state) {
 	case G_RAID3_DISK_STATE_NODISK:
 		return ("NODISK");
 	case G_RAID3_DISK_STATE_NONE:
 		return ("NONE");
 	case G_RAID3_DISK_STATE_NEW:
 		return ("NEW");
 	case G_RAID3_DISK_STATE_ACTIVE:
 		return ("ACTIVE");
 	case G_RAID3_DISK_STATE_STALE:
 		return ("STALE");
 	case G_RAID3_DISK_STATE_SYNCHRONIZING:
 		return ("SYNCHRONIZING");
 	case G_RAID3_DISK_STATE_DISCONNECTED:
 		return ("DISCONNECTED");
 	default:
 		return ("INVALID");
 	}
 }
 
 static const char *
 g_raid3_device_state2str(int state)
 {
 
 	switch (state) {
 	case G_RAID3_DEVICE_STATE_STARTING:
 		return ("STARTING");
 	case G_RAID3_DEVICE_STATE_DEGRADED:
 		return ("DEGRADED");
 	case G_RAID3_DEVICE_STATE_COMPLETE:
 		return ("COMPLETE");
 	default:
 		return ("INVALID");
 	}
 }
 
 const char *
 g_raid3_get_diskname(struct g_raid3_disk *disk)
 {
 
 	if (disk->d_consumer == NULL || disk->d_consumer->provider == NULL)
 		return ("[unknown]");
 	return (disk->d_name);
 }
 
 static void *
 g_raid3_alloc(struct g_raid3_softc *sc, size_t size, int flags)
 {
 	void *ptr;
 	enum g_raid3_zones zone;
 
 	if (g_raid3_use_malloc ||
 	    (zone = g_raid3_zone(size)) == G_RAID3_NUM_ZONES)
 		ptr = malloc(size, M_RAID3, flags);
 	else {
 		ptr = uma_zalloc_arg(sc->sc_zones[zone].sz_zone,
 		   &sc->sc_zones[zone], flags);
 		sc->sc_zones[zone].sz_requested++;
 		if (ptr == NULL)
 			sc->sc_zones[zone].sz_failed++;
 	}
 	return (ptr);
 }
 
 static void
 g_raid3_free(struct g_raid3_softc *sc, void *ptr, size_t size)
 {
 	enum g_raid3_zones zone;
 
 	if (g_raid3_use_malloc ||
 	    (zone = g_raid3_zone(size)) == G_RAID3_NUM_ZONES)
 		free(ptr, M_RAID3);
 	else {
 		uma_zfree_arg(sc->sc_zones[zone].sz_zone,
 		    ptr, &sc->sc_zones[zone]);
 	}
 }
 
 static int
 g_raid3_uma_ctor(void *mem, int size, void *arg, int flags)
 {
 	struct g_raid3_zone *sz = arg;
 
 	if (sz->sz_max > 0 && sz->sz_inuse == sz->sz_max)
 		return (ENOMEM);
 	sz->sz_inuse++;
 	return (0);
 }
 
 static void
 g_raid3_uma_dtor(void *mem, int size, void *arg)
 {
 	struct g_raid3_zone *sz = arg;
 
 	sz->sz_inuse--;
 }
 
 #define	g_raid3_xor(src, dst, size)					\
 	_g_raid3_xor((uint64_t *)(src),					\
 	    (uint64_t *)(dst), (size_t)size)
 static void
 _g_raid3_xor(uint64_t *src, uint64_t *dst, size_t size)
 {
 
 	KASSERT((size % 128) == 0, ("Invalid size: %zu.", size));
 	for (; size > 0; size -= 128) {
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 		*dst++ ^= (*src++);
 	}
 }
 
 static int
 g_raid3_is_zero(struct bio *bp)
 {
 	static const uint64_t zeros[] = {
 	    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 	};
 	u_char *addr;
 	ssize_t size;
 
 	size = bp->bio_length;
 	addr = (u_char *)bp->bio_data;
 	for (; size > 0; size -= sizeof(zeros), addr += sizeof(zeros)) {
 		if (bcmp(addr, zeros, sizeof(zeros)) != 0)
 			return (0);
 	}
 	return (1);
 }
 
 /*
  * --- Events handling functions ---
  * Events in geom_raid3 are used to maintain disks and device status
  * from one thread to simplify locking.
  */
 static void
 g_raid3_event_free(struct g_raid3_event *ep)
 {
 
 	free(ep, M_RAID3);
 }
 
 int
 g_raid3_event_send(void *arg, int state, int flags)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_disk *disk;
 	struct g_raid3_event *ep;
 	int error;
 
 	ep = malloc(sizeof(*ep), M_RAID3, M_WAITOK);
 	G_RAID3_DEBUG(4, "%s: Sending event %p.", __func__, ep);
 	if ((flags & G_RAID3_EVENT_DEVICE) != 0) {
 		disk = NULL;
 		sc = arg;
 	} else {
 		disk = arg;
 		sc = disk->d_softc;
 	}
 	ep->e_disk = disk;
 	ep->e_state = state;
 	ep->e_flags = flags;
 	ep->e_error = 0;
 	mtx_lock(&sc->sc_events_mtx);
 	TAILQ_INSERT_TAIL(&sc->sc_events, ep, e_next);
 	mtx_unlock(&sc->sc_events_mtx);
 	G_RAID3_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 	mtx_lock(&sc->sc_queue_mtx);
 	wakeup(sc);
 	wakeup(&sc->sc_queue);
 	mtx_unlock(&sc->sc_queue_mtx);
 	if ((flags & G_RAID3_EVENT_DONTWAIT) != 0)
 		return (0);
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 	G_RAID3_DEBUG(4, "%s: Sleeping %p.", __func__, ep);
 	sx_xunlock(&sc->sc_lock);
 	while ((ep->e_flags & G_RAID3_EVENT_DONE) == 0) {
 		mtx_lock(&sc->sc_events_mtx);
 		MSLEEP(ep, &sc->sc_events_mtx, PRIBIO | PDROP, "r3:event",
 		    hz * 5);
 	}
 	error = ep->e_error;
 	g_raid3_event_free(ep);
 	sx_xlock(&sc->sc_lock);
 	return (error);
 }
 
 static struct g_raid3_event *
 g_raid3_event_get(struct g_raid3_softc *sc)
 {
 	struct g_raid3_event *ep;
 
 	mtx_lock(&sc->sc_events_mtx);
 	ep = TAILQ_FIRST(&sc->sc_events);
 	mtx_unlock(&sc->sc_events_mtx);
 	return (ep);
 }
 
 static void
 g_raid3_event_remove(struct g_raid3_softc *sc, struct g_raid3_event *ep)
 {
 
 	mtx_lock(&sc->sc_events_mtx);
 	TAILQ_REMOVE(&sc->sc_events, ep, e_next);
 	mtx_unlock(&sc->sc_events_mtx);
 }
 
 static void
 g_raid3_event_cancel(struct g_raid3_disk *disk)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_event *ep, *tmpep;
 
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	mtx_lock(&sc->sc_events_mtx);
 	TAILQ_FOREACH_SAFE(ep, &sc->sc_events, e_next, tmpep) {
 		if ((ep->e_flags & G_RAID3_EVENT_DEVICE) != 0)
 			continue;
 		if (ep->e_disk != disk)
 			continue;
 		TAILQ_REMOVE(&sc->sc_events, ep, e_next);
 		if ((ep->e_flags & G_RAID3_EVENT_DONTWAIT) != 0)
 			g_raid3_event_free(ep);
 		else {
 			ep->e_error = ECANCELED;
 			wakeup(ep);
 		}
 	}
 	mtx_unlock(&sc->sc_events_mtx);
 }
 
 /*
  * Return the number of disks in the given state.
  * If state is equal to -1, count all connected disks.
  */
 u_int
 g_raid3_ndisks(struct g_raid3_softc *sc, int state)
 {
 	struct g_raid3_disk *disk;
 	u_int n, ndisks;
 
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	for (n = ndisks = 0; n < sc->sc_ndisks; n++) {
 		disk = &sc->sc_disks[n];
 		if (disk->d_state == G_RAID3_DISK_STATE_NODISK)
 			continue;
 		if (state == -1 || disk->d_state == state)
 			ndisks++;
 	}
 	return (ndisks);
 }
 
 static u_int
 g_raid3_nrequests(struct g_raid3_softc *sc, struct g_consumer *cp)
 {
 	struct bio *bp;
 	u_int nreqs = 0;
 
 	mtx_lock(&sc->sc_queue_mtx);
 	TAILQ_FOREACH(bp, &sc->sc_queue.queue, bio_queue) {
 		if (bp->bio_from == cp)
 			nreqs++;
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	return (nreqs);
 }
 
 static int
 g_raid3_is_busy(struct g_raid3_softc *sc, struct g_consumer *cp)
 {
 
 	if (cp->index > 0) {
 		G_RAID3_DEBUG(2,
 		    "I/O requests for %s exist, can't destroy it now.",
 		    cp->provider->name);
 		return (1);
 	}
 	if (g_raid3_nrequests(sc, cp) > 0) {
 		G_RAID3_DEBUG(2,
 		    "I/O requests for %s in queue, can't destroy it now.",
 		    cp->provider->name);
 		return (1);
 	}
 	return (0);
 }
 
 static void
 g_raid3_destroy_consumer(void *arg, int flags __unused)
 {
 	struct g_consumer *cp;
 
 	g_topology_assert();
 
 	cp = arg;
 	G_RAID3_DEBUG(1, "Consumer %s destroyed.", cp->provider->name);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static void
 g_raid3_kill_consumer(struct g_raid3_softc *sc, struct g_consumer *cp)
 {
 	struct g_provider *pp;
 	int retaste_wait;
 
 	g_topology_assert();
 
 	cp->private = NULL;
 	if (g_raid3_is_busy(sc, cp))
 		return;
 	G_RAID3_DEBUG(2, "Consumer %s destroyed.", cp->provider->name);
 	pp = cp->provider;
 	retaste_wait = 0;
 	if (cp->acw == 1) {
 		if ((pp->geom->flags & G_GEOM_WITHER) == 0)
 			retaste_wait = 1;
 	}
 	G_RAID3_DEBUG(2, "Access %s r%dw%de%d = %d", pp->name, -cp->acr,
 	    -cp->acw, -cp->ace, 0);
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 	if (retaste_wait) {
 		/*
 		 * After retaste event was send (inside g_access()), we can send
 		 * event to detach and destroy consumer.
 		 * A class, which has consumer to the given provider connected
 		 * will not receive retaste event for the provider.
 		 * This is the way how I ignore retaste events when I close
 		 * consumers opened for write: I detach and destroy consumer
 		 * after retaste event is sent.
 		 */
 		g_post_event(g_raid3_destroy_consumer, cp, M_WAITOK, NULL);
 		return;
 	}
 	G_RAID3_DEBUG(1, "Consumer %s destroyed.", pp->name);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static int
 g_raid3_connect_disk(struct g_raid3_disk *disk, struct g_provider *pp)
 {
 	struct g_consumer *cp;
 	int error;
 
 	g_topology_assert_not();
 	KASSERT(disk->d_consumer == NULL,
 	    ("Disk already connected (device %s).", disk->d_softc->sc_name));
 
 	g_topology_lock();
 	cp = g_new_consumer(disk->d_softc->sc_geom);
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		g_destroy_consumer(cp);
 		g_topology_unlock();
 		return (error);
 	}
 	error = g_access(cp, 1, 1, 1);
 		g_topology_unlock();
 	if (error != 0) {
 		g_detach(cp);
 		g_destroy_consumer(cp);
 		G_RAID3_DEBUG(0, "Cannot open consumer %s (error=%d).",
 		    pp->name, error);
 		return (error);
 	}
 	disk->d_consumer = cp;
 	disk->d_consumer->private = disk;
 	disk->d_consumer->index = 0;
 	G_RAID3_DEBUG(2, "Disk %s connected.", g_raid3_get_diskname(disk));
 	return (0);
 }
 
 static void
 g_raid3_disconnect_consumer(struct g_raid3_softc *sc, struct g_consumer *cp)
 {
 
 	g_topology_assert();
 
 	if (cp == NULL)
 		return;
 	if (cp->provider != NULL)
 		g_raid3_kill_consumer(sc, cp);
 	else
 		g_destroy_consumer(cp);
 }
 
 /*
  * Initialize disk. This means allocate memory, create consumer, attach it
  * to the provider and open access (r1w1e1) to it.
  */
 static struct g_raid3_disk *
 g_raid3_init_disk(struct g_raid3_softc *sc, struct g_provider *pp,
     struct g_raid3_metadata *md, int *errorp)
 {
 	struct g_raid3_disk *disk;
 	int error;
 
 	disk = &sc->sc_disks[md->md_no];
 	error = g_raid3_connect_disk(disk, pp);
 	if (error != 0) {
 		if (errorp != NULL)
 			*errorp = error;
 		return (NULL);
 	}
 	disk->d_state = G_RAID3_DISK_STATE_NONE;
 	disk->d_flags = md->md_dflags;
 	if (md->md_provider[0] != '\0')
 		disk->d_flags |= G_RAID3_DISK_FLAG_HARDCODED;
 	disk->d_sync.ds_consumer = NULL;
 	disk->d_sync.ds_offset = md->md_sync_offset;
 	disk->d_sync.ds_offset_done = md->md_sync_offset;
 	disk->d_genid = md->md_genid;
 	disk->d_sync.ds_syncid = md->md_syncid;
 	if (errorp != NULL)
 		*errorp = 0;
 	return (disk);
 }
 
 static void
 g_raid3_destroy_disk(struct g_raid3_disk *disk)
 {
 	struct g_raid3_softc *sc;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if (disk->d_state == G_RAID3_DISK_STATE_NODISK)
 		return;
 	g_raid3_event_cancel(disk);
 	switch (disk->d_state) {
 	case G_RAID3_DISK_STATE_SYNCHRONIZING:
 		if (sc->sc_syncdisk != NULL)
 			g_raid3_sync_stop(sc, 1);
 		/* FALLTHROUGH */
 	case G_RAID3_DISK_STATE_NEW:
 	case G_RAID3_DISK_STATE_STALE:
 	case G_RAID3_DISK_STATE_ACTIVE:
 		g_topology_lock();
 		g_raid3_disconnect_consumer(sc, disk->d_consumer);
 		g_topology_unlock();
 		disk->d_consumer = NULL;
 		break;
 	default:
 		KASSERT(0 == 1, ("Wrong disk state (%s, %s).",
 		    g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 	}
 	disk->d_state = G_RAID3_DISK_STATE_NODISK;
 }
 
 static void
 g_raid3_destroy_device(struct g_raid3_softc *sc)
 {
 	struct g_raid3_event *ep;
 	struct g_raid3_disk *disk;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	u_int n;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	gp = sc->sc_geom;
 	if (sc->sc_provider != NULL)
 		g_raid3_destroy_provider(sc);
 	for (n = 0; n < sc->sc_ndisks; n++) {
 		disk = &sc->sc_disks[n];
 		if (disk->d_state != G_RAID3_DISK_STATE_NODISK) {
 			disk->d_flags &= ~G_RAID3_DISK_FLAG_DIRTY;
 			g_raid3_update_metadata(disk);
 			g_raid3_destroy_disk(disk);
 		}
 	}
 	while ((ep = g_raid3_event_get(sc)) != NULL) {
 		g_raid3_event_remove(sc, ep);
 		if ((ep->e_flags & G_RAID3_EVENT_DONTWAIT) != 0)
 			g_raid3_event_free(ep);
 		else {
 			ep->e_error = ECANCELED;
 			ep->e_flags |= G_RAID3_EVENT_DONE;
 			G_RAID3_DEBUG(4, "%s: Waking up %p.", __func__, ep);
 			mtx_lock(&sc->sc_events_mtx);
 			wakeup(ep);
 			mtx_unlock(&sc->sc_events_mtx);
 		}
 	}
 	callout_drain(&sc->sc_callout);
 	cp = LIST_FIRST(&sc->sc_sync.ds_geom->consumer);
 	g_topology_lock();
 	if (cp != NULL)
 		g_raid3_disconnect_consumer(sc, cp);
 	g_wither_geom(sc->sc_sync.ds_geom, ENXIO);
 	G_RAID3_DEBUG(0, "Device %s destroyed.", gp->name);
 	g_wither_geom(gp, ENXIO);
 	g_topology_unlock();
 	if (!g_raid3_use_malloc) {
 		uma_zdestroy(sc->sc_zones[G_RAID3_ZONE_64K].sz_zone);
 		uma_zdestroy(sc->sc_zones[G_RAID3_ZONE_16K].sz_zone);
 		uma_zdestroy(sc->sc_zones[G_RAID3_ZONE_4K].sz_zone);
 	}
 	mtx_destroy(&sc->sc_queue_mtx);
 	mtx_destroy(&sc->sc_events_mtx);
 	sx_xunlock(&sc->sc_lock);
 	sx_destroy(&sc->sc_lock);
 }
 
 static void
 g_raid3_orphan(struct g_consumer *cp)
 {
 	struct g_raid3_disk *disk;
 
 	g_topology_assert();
 
 	disk = cp->private;
 	if (disk == NULL)
 		return;
 	disk->d_softc->sc_bump_id = G_RAID3_BUMP_SYNCID;
 	g_raid3_event_send(disk, G_RAID3_DISK_STATE_DISCONNECTED,
 	    G_RAID3_EVENT_DONTWAIT);
 }
 
 static int
 g_raid3_write_metadata(struct g_raid3_disk *disk, struct g_raid3_metadata *md)
 {
 	struct g_raid3_softc *sc;
 	struct g_consumer *cp;
 	off_t offset, length;
 	u_char *sector;
 	int error = 0;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	cp = disk->d_consumer;
 	KASSERT(cp != NULL, ("NULL consumer (%s).", sc->sc_name));
 	KASSERT(cp->provider != NULL, ("NULL provider (%s).", sc->sc_name));
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s closed? (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	length = cp->provider->sectorsize;
 	offset = cp->provider->mediasize - length;
 	sector = malloc((size_t)length, M_RAID3, M_WAITOK | M_ZERO);
 	if (md != NULL)
 		raid3_metadata_encode(md, sector);
 	error = g_write_data(cp, offset, sector, length);
 	free(sector, M_RAID3);
 	if (error != 0) {
 		if ((disk->d_flags & G_RAID3_DISK_FLAG_BROKEN) == 0) {
 			G_RAID3_DEBUG(0, "Cannot write metadata on %s "
 			    "(device=%s, error=%d).",
 			    g_raid3_get_diskname(disk), sc->sc_name, error);
 			disk->d_flags |= G_RAID3_DISK_FLAG_BROKEN;
 		} else {
 			G_RAID3_DEBUG(1, "Cannot write metadata on %s "
 			    "(device=%s, error=%d).",
 			    g_raid3_get_diskname(disk), sc->sc_name, error);
 		}
 		if (g_raid3_disconnect_on_failure &&
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE) {
 			sc->sc_bump_id |= G_RAID3_BUMP_GENID;
 			g_raid3_event_send(disk,
 			    G_RAID3_DISK_STATE_DISCONNECTED,
 			    G_RAID3_EVENT_DONTWAIT);
 		}
 	}
 	return (error);
 }
 
 int
 g_raid3_clear_metadata(struct g_raid3_disk *disk)
 {
 	int error;
 
 	g_topology_assert_not();
 	sx_assert(&disk->d_softc->sc_lock, SX_LOCKED);
 
 	error = g_raid3_write_metadata(disk, NULL);
 	if (error == 0) {
 		G_RAID3_DEBUG(2, "Metadata on %s cleared.",
 		    g_raid3_get_diskname(disk));
 	} else {
 		G_RAID3_DEBUG(0,
 		    "Cannot clear metadata on disk %s (error=%d).",
 		    g_raid3_get_diskname(disk), error);
 	}
 	return (error);
 }
 
 void
 g_raid3_fill_metadata(struct g_raid3_disk *disk, struct g_raid3_metadata *md)
 {
 	struct g_raid3_softc *sc;
 	struct g_provider *pp;
 
 	sc = disk->d_softc;
 	strlcpy(md->md_magic, G_RAID3_MAGIC, sizeof(md->md_magic));
 	md->md_version = G_RAID3_VERSION;
 	strlcpy(md->md_name, sc->sc_name, sizeof(md->md_name));
 	md->md_id = sc->sc_id;
 	md->md_all = sc->sc_ndisks;
 	md->md_genid = sc->sc_genid;
 	md->md_mediasize = sc->sc_mediasize;
 	md->md_sectorsize = sc->sc_sectorsize;
 	md->md_mflags = (sc->sc_flags & G_RAID3_DEVICE_FLAG_MASK);
 	md->md_no = disk->d_no;
 	md->md_syncid = disk->d_sync.ds_syncid;
 	md->md_dflags = (disk->d_flags & G_RAID3_DISK_FLAG_MASK);
 	if (disk->d_state != G_RAID3_DISK_STATE_SYNCHRONIZING)
 		md->md_sync_offset = 0;
 	else {
 		md->md_sync_offset =
 		    disk->d_sync.ds_offset_done / (sc->sc_ndisks - 1);
 	}
 	if (disk->d_consumer != NULL && disk->d_consumer->provider != NULL)
 		pp = disk->d_consumer->provider;
 	else
 		pp = NULL;
 	if ((disk->d_flags & G_RAID3_DISK_FLAG_HARDCODED) != 0 && pp != NULL)
 		strlcpy(md->md_provider, pp->name, sizeof(md->md_provider));
 	else
 		bzero(md->md_provider, sizeof(md->md_provider));
 	if (pp != NULL)
 		md->md_provsize = pp->mediasize;
 	else
 		md->md_provsize = 0;
 }
 
 void
 g_raid3_update_metadata(struct g_raid3_disk *disk)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_metadata md;
 	int error;
 
 	g_topology_assert_not();
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	g_raid3_fill_metadata(disk, &md);
 	error = g_raid3_write_metadata(disk, &md);
 	if (error == 0) {
 		G_RAID3_DEBUG(2, "Metadata on %s updated.",
 		    g_raid3_get_diskname(disk));
 	} else {
 		G_RAID3_DEBUG(0,
 		    "Cannot update metadata on disk %s (error=%d).",
 		    g_raid3_get_diskname(disk), error);
 	}
 }
 
 static void
 g_raid3_bump_syncid(struct g_raid3_softc *sc)
 {
 	struct g_raid3_disk *disk;
 	u_int n;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 	KASSERT(g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) > 0,
 	    ("%s called with no active disks (device=%s).", __func__,
 	    sc->sc_name));
 
 	sc->sc_syncid++;
 	G_RAID3_DEBUG(1, "Device %s: syncid bumped to %u.", sc->sc_name,
 	    sc->sc_syncid);
 	for (n = 0; n < sc->sc_ndisks; n++) {
 		disk = &sc->sc_disks[n];
 		if (disk->d_state == G_RAID3_DISK_STATE_ACTIVE ||
 		    disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING) {
 			disk->d_sync.ds_syncid = sc->sc_syncid;
 			g_raid3_update_metadata(disk);
 		}
 	}
 }
 
 static void
 g_raid3_bump_genid(struct g_raid3_softc *sc)
 {
 	struct g_raid3_disk *disk;
 	u_int n;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 	KASSERT(g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) > 0,
 	    ("%s called with no active disks (device=%s).", __func__,
 	    sc->sc_name));
 
 	sc->sc_genid++;
 	G_RAID3_DEBUG(1, "Device %s: genid bumped to %u.", sc->sc_name,
 	    sc->sc_genid);
 	for (n = 0; n < sc->sc_ndisks; n++) {
 		disk = &sc->sc_disks[n];
 		if (disk->d_state == G_RAID3_DISK_STATE_ACTIVE ||
 		    disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING) {
 			disk->d_genid = sc->sc_genid;
 			g_raid3_update_metadata(disk);
 		}
 	}
 }
 
 static int
 g_raid3_idle(struct g_raid3_softc *sc, int acw)
 {
 	struct g_raid3_disk *disk;
 	u_int i;
 	int timeout;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if (sc->sc_provider == NULL)
 		return (0);
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_NOFAILSYNC) != 0)
 		return (0);
 	if (sc->sc_idle)
 		return (0);
 	if (sc->sc_writes > 0)
 		return (0);
 	if (acw > 0 || (acw == -1 && sc->sc_provider->acw > 0)) {
 		timeout = g_raid3_idletime - (time_uptime - sc->sc_last_write);
 		if (!g_raid3_shutdown && timeout > 0)
 			return (timeout);
 	}
 	sc->sc_idle = 1;
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		disk = &sc->sc_disks[i];
 		if (disk->d_state != G_RAID3_DISK_STATE_ACTIVE)
 			continue;
 		G_RAID3_DEBUG(1, "Disk %s (device %s) marked as clean.",
 		    g_raid3_get_diskname(disk), sc->sc_name);
 		disk->d_flags &= ~G_RAID3_DISK_FLAG_DIRTY;
 		g_raid3_update_metadata(disk);
 	}
 	return (0);
 }
 
 static void
 g_raid3_unidle(struct g_raid3_softc *sc)
 {
 	struct g_raid3_disk *disk;
 	u_int i;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_NOFAILSYNC) != 0)
 		return;
 	sc->sc_idle = 0;
 	sc->sc_last_write = time_uptime;
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		disk = &sc->sc_disks[i];
 		if (disk->d_state != G_RAID3_DISK_STATE_ACTIVE)
 			continue;
 		G_RAID3_DEBUG(1, "Disk %s (device %s) marked as dirty.",
 		    g_raid3_get_diskname(disk), sc->sc_name);
 		disk->d_flags |= G_RAID3_DISK_FLAG_DIRTY;
 		g_raid3_update_metadata(disk);
 	}
 }
 
 /*
  * Treat bio_driver1 field in parent bio as list head and field bio_caller1
  * in child bio as pointer to the next element on the list.
  */
 #define	G_RAID3_HEAD_BIO(pbp)	(pbp)->bio_driver1
 
 #define	G_RAID3_NEXT_BIO(cbp)	(cbp)->bio_caller1
 
 #define	G_RAID3_FOREACH_BIO(pbp, bp)					\
 	for ((bp) = G_RAID3_HEAD_BIO(pbp); (bp) != NULL;		\
 	    (bp) = G_RAID3_NEXT_BIO(bp))
 
 #define	G_RAID3_FOREACH_SAFE_BIO(pbp, bp, tmpbp)			\
 	for ((bp) = G_RAID3_HEAD_BIO(pbp);				\
 	    (bp) != NULL && ((tmpbp) = G_RAID3_NEXT_BIO(bp), 1);	\
 	    (bp) = (tmpbp))
 
 static void
 g_raid3_init_bio(struct bio *pbp)
 {
 
 	G_RAID3_HEAD_BIO(pbp) = NULL;
 }
 
 static void
 g_raid3_remove_bio(struct bio *cbp)
 {
 	struct bio *pbp, *bp;
 
 	pbp = cbp->bio_parent;
 	if (G_RAID3_HEAD_BIO(pbp) == cbp)
 		G_RAID3_HEAD_BIO(pbp) = G_RAID3_NEXT_BIO(cbp);
 	else {
 		G_RAID3_FOREACH_BIO(pbp, bp) {
 			if (G_RAID3_NEXT_BIO(bp) == cbp) {
 				G_RAID3_NEXT_BIO(bp) = G_RAID3_NEXT_BIO(cbp);
 				break;
 			}
 		}
 	}
 	G_RAID3_NEXT_BIO(cbp) = NULL;
 }
 
 static void
 g_raid3_replace_bio(struct bio *sbp, struct bio *dbp)
 {
 	struct bio *pbp, *bp;
 
 	g_raid3_remove_bio(sbp);
 	pbp = dbp->bio_parent;
 	G_RAID3_NEXT_BIO(sbp) = G_RAID3_NEXT_BIO(dbp);
 	if (G_RAID3_HEAD_BIO(pbp) == dbp)
 		G_RAID3_HEAD_BIO(pbp) = sbp;
 	else {
 		G_RAID3_FOREACH_BIO(pbp, bp) {
 			if (G_RAID3_NEXT_BIO(bp) == dbp) {
 				G_RAID3_NEXT_BIO(bp) = sbp;
 				break;
 			}
 		}
 	}
 	G_RAID3_NEXT_BIO(dbp) = NULL;
 }
 
 static void
 g_raid3_destroy_bio(struct g_raid3_softc *sc, struct bio *cbp)
 {
 	struct bio *bp, *pbp;
 	size_t size;
 
 	pbp = cbp->bio_parent;
 	pbp->bio_children--;
 	KASSERT(cbp->bio_data != NULL, ("NULL bio_data"));
 	size = pbp->bio_length / (sc->sc_ndisks - 1);
 	g_raid3_free(sc, cbp->bio_data, size);
 	if (G_RAID3_HEAD_BIO(pbp) == cbp) {
 		G_RAID3_HEAD_BIO(pbp) = G_RAID3_NEXT_BIO(cbp);
 		G_RAID3_NEXT_BIO(cbp) = NULL;
 		g_destroy_bio(cbp);
 	} else {
 		G_RAID3_FOREACH_BIO(pbp, bp) {
 			if (G_RAID3_NEXT_BIO(bp) == cbp)
 				break;
 		}
 		if (bp != NULL) {
 			KASSERT(G_RAID3_NEXT_BIO(bp) != NULL,
 			    ("NULL bp->bio_driver1"));
 			G_RAID3_NEXT_BIO(bp) = G_RAID3_NEXT_BIO(cbp);
 			G_RAID3_NEXT_BIO(cbp) = NULL;
 		}
 		g_destroy_bio(cbp);
 	}
 }
 
 static struct bio *
 g_raid3_clone_bio(struct g_raid3_softc *sc, struct bio *pbp)
 {
 	struct bio *bp, *cbp;
 	size_t size;
 	int memflag;
 
 	cbp = g_clone_bio(pbp);
 	if (cbp == NULL)
 		return (NULL);
 	size = pbp->bio_length / (sc->sc_ndisks - 1);
 	if ((pbp->bio_cflags & G_RAID3_BIO_CFLAG_REGULAR) != 0)
 		memflag = M_WAITOK;
 	else
 		memflag = M_NOWAIT;
 	cbp->bio_data = g_raid3_alloc(sc, size, memflag);
 	if (cbp->bio_data == NULL) {
 		pbp->bio_children--;
 		g_destroy_bio(cbp);
 		return (NULL);
 	}
 	G_RAID3_NEXT_BIO(cbp) = NULL;
 	if (G_RAID3_HEAD_BIO(pbp) == NULL)
 		G_RAID3_HEAD_BIO(pbp) = cbp;
 	else {
 		G_RAID3_FOREACH_BIO(pbp, bp) {
 			if (G_RAID3_NEXT_BIO(bp) == NULL) {
 				G_RAID3_NEXT_BIO(bp) = cbp;
 				break;
 			}
 		}
 	}
 	return (cbp);
 }
 
 static void
 g_raid3_scatter(struct bio *pbp)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_disk *disk;
 	struct bio *bp, *cbp, *tmpbp;
 	off_t atom, cadd, padd, left;
 	int first;
 
 	sc = pbp->bio_to->geom->softc;
 	bp = NULL;
 	if ((pbp->bio_pflags & G_RAID3_BIO_PFLAG_NOPARITY) == 0) {
 		/*
 		 * Find bio for which we should calculate data.
 		 */
 		G_RAID3_FOREACH_BIO(pbp, cbp) {
 			if ((cbp->bio_cflags & G_RAID3_BIO_CFLAG_PARITY) != 0) {
 				bp = cbp;
 				break;
 			}
 		}
 		KASSERT(bp != NULL, ("NULL parity bio."));
 	}
 	atom = sc->sc_sectorsize / (sc->sc_ndisks - 1);
 	cadd = padd = 0;
 	for (left = pbp->bio_length; left > 0; left -= sc->sc_sectorsize) {
 		G_RAID3_FOREACH_BIO(pbp, cbp) {
 			if (cbp == bp)
 				continue;
 			bcopy(pbp->bio_data + padd, cbp->bio_data + cadd, atom);
 			padd += atom;
 		}
 		cadd += atom;
 	}
 	if ((pbp->bio_pflags & G_RAID3_BIO_PFLAG_NOPARITY) == 0) {
 		/*
 		 * Calculate parity.
 		 */
 		first = 1;
 		G_RAID3_FOREACH_SAFE_BIO(pbp, cbp, tmpbp) {
 			if (cbp == bp)
 				continue;
 			if (first) {
 				bcopy(cbp->bio_data, bp->bio_data,
 				    bp->bio_length);
 				first = 0;
 			} else {
 				g_raid3_xor(cbp->bio_data, bp->bio_data,
 				    bp->bio_length);
 			}
 			if ((cbp->bio_cflags & G_RAID3_BIO_CFLAG_NODISK) != 0)
 				g_raid3_destroy_bio(sc, cbp);
 		}
 	}
 	G_RAID3_FOREACH_SAFE_BIO(pbp, cbp, tmpbp) {
 		struct g_consumer *cp;
 
 		disk = cbp->bio_caller2;
 		cp = disk->d_consumer;
 		cbp->bio_to = cp->provider;
 		G_RAID3_LOGREQ(3, cbp, "Sending request.");
 		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 		    cp->acr, cp->acw, cp->ace));
 		cp->index++;
 		sc->sc_writes++;
 		g_io_request(cbp, cp);
 	}
 }
 
 static void
 g_raid3_gather(struct bio *pbp)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_disk *disk;
 	struct bio *xbp, *fbp, *cbp;
 	off_t atom, cadd, padd, left;
 
 	sc = pbp->bio_to->geom->softc;
 	/*
 	 * Find bio for which we have to calculate data.
 	 * While going through this path, check if all requests
 	 * succeeded, if not, deny whole request.
 	 * If we're in COMPLETE mode, we allow one request to fail,
 	 * so if we find one, we're sending it to the parity consumer.
 	 * If there are more failed requests, we deny whole request.
 	 */
 	xbp = fbp = NULL;
 	G_RAID3_FOREACH_BIO(pbp, cbp) {
 		if ((cbp->bio_cflags & G_RAID3_BIO_CFLAG_PARITY) != 0) {
 			KASSERT(xbp == NULL, ("More than one parity bio."));
 			xbp = cbp;
 		}
 		if (cbp->bio_error == 0)
 			continue;
 		/*
 		 * Found failed request.
 		 */
 		if (fbp == NULL) {
 			if ((pbp->bio_pflags & G_RAID3_BIO_PFLAG_DEGRADED) != 0) {
 				/*
 				 * We are already in degraded mode, so we can't
 				 * accept any failures.
 				 */
 				if (pbp->bio_error == 0)
 					pbp->bio_error = cbp->bio_error;
 			} else {
 				fbp = cbp;
 			}
 		} else {
 			/*
 			 * Next failed request, that's too many.
 			 */
 			if (pbp->bio_error == 0)
 				pbp->bio_error = fbp->bio_error;
 		}
 		disk = cbp->bio_caller2;
 		if (disk == NULL)
 			continue;
 		if ((disk->d_flags & G_RAID3_DISK_FLAG_BROKEN) == 0) {
 			disk->d_flags |= G_RAID3_DISK_FLAG_BROKEN;
 			G_RAID3_LOGREQ(0, cbp, "Request failed (error=%d).",
 			    cbp->bio_error);
 		} else {
 			G_RAID3_LOGREQ(1, cbp, "Request failed (error=%d).",
 			    cbp->bio_error);
 		}
 		if (g_raid3_disconnect_on_failure &&
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE) {
 			sc->sc_bump_id |= G_RAID3_BUMP_GENID;
 			g_raid3_event_send(disk,
 			    G_RAID3_DISK_STATE_DISCONNECTED,
 			    G_RAID3_EVENT_DONTWAIT);
 		}
 	}
 	if (pbp->bio_error != 0)
 		goto finish;
 	if (fbp != NULL && (pbp->bio_pflags & G_RAID3_BIO_PFLAG_VERIFY) != 0) {
 		pbp->bio_pflags &= ~G_RAID3_BIO_PFLAG_VERIFY;
 		if (xbp != fbp)
 			g_raid3_replace_bio(xbp, fbp);
 		g_raid3_destroy_bio(sc, fbp);
 	} else if (fbp != NULL) {
 		struct g_consumer *cp;
 
 		/*
 		 * One request failed, so send the same request to
 		 * the parity consumer.
 		 */
 		disk = pbp->bio_driver2;
 		if (disk->d_state != G_RAID3_DISK_STATE_ACTIVE) {
 			pbp->bio_error = fbp->bio_error;
 			goto finish;
 		}
 		pbp->bio_pflags |= G_RAID3_BIO_PFLAG_DEGRADED;
 		pbp->bio_inbed--;
 		fbp->bio_flags &= ~(BIO_DONE | BIO_ERROR);
 		if (disk->d_no == sc->sc_ndisks - 1)
 			fbp->bio_cflags |= G_RAID3_BIO_CFLAG_PARITY;
 		fbp->bio_error = 0;
 		fbp->bio_completed = 0;
 		fbp->bio_children = 0;
 		fbp->bio_inbed = 0;
 		cp = disk->d_consumer;
 		fbp->bio_caller2 = disk;
 		fbp->bio_to = cp->provider;
 		G_RAID3_LOGREQ(3, fbp, "Sending request (recover).");
 		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 		    cp->acr, cp->acw, cp->ace));
 		cp->index++;
 		g_io_request(fbp, cp);
 		return;
 	}
 	if (xbp != NULL) {
 		/*
 		 * Calculate parity.
 		 */
 		G_RAID3_FOREACH_BIO(pbp, cbp) {
 			if ((cbp->bio_cflags & G_RAID3_BIO_CFLAG_PARITY) != 0)
 				continue;
 			g_raid3_xor(cbp->bio_data, xbp->bio_data,
 			    xbp->bio_length);
 		}
 		xbp->bio_cflags &= ~G_RAID3_BIO_CFLAG_PARITY;
 		if ((pbp->bio_pflags & G_RAID3_BIO_PFLAG_VERIFY) != 0) {
 			if (!g_raid3_is_zero(xbp)) {
 				g_raid3_parity_mismatch++;
 				pbp->bio_error = EIO;
 				goto finish;
 			}
 			g_raid3_destroy_bio(sc, xbp);
 		}
 	}
 	atom = sc->sc_sectorsize / (sc->sc_ndisks - 1);
 	cadd = padd = 0;
 	for (left = pbp->bio_length; left > 0; left -= sc->sc_sectorsize) {
 		G_RAID3_FOREACH_BIO(pbp, cbp) {
 			bcopy(cbp->bio_data + cadd, pbp->bio_data + padd, atom);
 			pbp->bio_completed += atom;
 			padd += atom;
 		}
 		cadd += atom;
 	}
 finish:
 	if (pbp->bio_error == 0)
 		G_RAID3_LOGREQ(3, pbp, "Request finished.");
 	else {
 		if ((pbp->bio_pflags & G_RAID3_BIO_PFLAG_VERIFY) != 0)
 			G_RAID3_LOGREQ(1, pbp, "Verification error.");
 		else
 			G_RAID3_LOGREQ(0, pbp, "Request failed.");
 	}
 	pbp->bio_pflags &= ~G_RAID3_BIO_PFLAG_MASK;
 	while ((cbp = G_RAID3_HEAD_BIO(pbp)) != NULL)
 		g_raid3_destroy_bio(sc, cbp);
 	g_io_deliver(pbp, pbp->bio_error);
 }
 
 static void
 g_raid3_done(struct bio *bp)
 {
 	struct g_raid3_softc *sc;
 
 	sc = bp->bio_from->geom->softc;
 	bp->bio_cflags |= G_RAID3_BIO_CFLAG_REGULAR;
 	G_RAID3_LOGREQ(3, bp, "Regular request done (error=%d).", bp->bio_error);
 	mtx_lock(&sc->sc_queue_mtx);
 	bioq_insert_head(&sc->sc_queue, bp);
 	mtx_unlock(&sc->sc_queue_mtx);
 	wakeup(sc);
 	wakeup(&sc->sc_queue);
 }
 
 static void
 g_raid3_regular_request(struct bio *cbp)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_disk *disk;
 	struct bio *pbp;
 
 	g_topology_assert_not();
 
 	pbp = cbp->bio_parent;
 	sc = pbp->bio_to->geom->softc;
 	cbp->bio_from->index--;
 	if (cbp->bio_cmd == BIO_WRITE)
 		sc->sc_writes--;
 	disk = cbp->bio_from->private;
 	if (disk == NULL) {
 		g_topology_lock();
 		g_raid3_kill_consumer(sc, cbp->bio_from);
 		g_topology_unlock();
 	}
 
 	G_RAID3_LOGREQ(3, cbp, "Request finished.");
 	pbp->bio_inbed++;
 	KASSERT(pbp->bio_inbed <= pbp->bio_children,
 	    ("bio_inbed (%u) is bigger than bio_children (%u).", pbp->bio_inbed,
 	    pbp->bio_children));
 	if (pbp->bio_inbed != pbp->bio_children)
 		return;
 	switch (pbp->bio_cmd) {
 	case BIO_READ:
 		g_raid3_gather(pbp);
 		break;
 	case BIO_WRITE:
 	case BIO_DELETE:
 	    {
 		int error = 0;
 
 		pbp->bio_completed = pbp->bio_length;
 		while ((cbp = G_RAID3_HEAD_BIO(pbp)) != NULL) {
 			if (cbp->bio_error == 0) {
 				g_raid3_destroy_bio(sc, cbp);
 				continue;
 			}
 
 			if (error == 0)
 				error = cbp->bio_error;
 			else if (pbp->bio_error == 0) {
 				/*
 				 * Next failed request, that's too many.
 				 */
 				pbp->bio_error = error;
 			}
 
 			disk = cbp->bio_caller2;
 			if (disk == NULL) {
 				g_raid3_destroy_bio(sc, cbp);
 				continue;
 			}
 
 			if ((disk->d_flags & G_RAID3_DISK_FLAG_BROKEN) == 0) {
 				disk->d_flags |= G_RAID3_DISK_FLAG_BROKEN;
 				G_RAID3_LOGREQ(0, cbp,
 				    "Request failed (error=%d).",
 				    cbp->bio_error);
 			} else {
 				G_RAID3_LOGREQ(1, cbp,
 				    "Request failed (error=%d).",
 				    cbp->bio_error);
 			}
 			if (g_raid3_disconnect_on_failure &&
 			    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE) {
 				sc->sc_bump_id |= G_RAID3_BUMP_GENID;
 				g_raid3_event_send(disk,
 				    G_RAID3_DISK_STATE_DISCONNECTED,
 				    G_RAID3_EVENT_DONTWAIT);
 			}
 			g_raid3_destroy_bio(sc, cbp);
 		}
 		if (pbp->bio_error == 0)
 			G_RAID3_LOGREQ(3, pbp, "Request finished.");
 		else
 			G_RAID3_LOGREQ(0, pbp, "Request failed.");
 		pbp->bio_pflags &= ~G_RAID3_BIO_PFLAG_DEGRADED;
 		pbp->bio_pflags &= ~G_RAID3_BIO_PFLAG_NOPARITY;
 		bioq_remove(&sc->sc_inflight, pbp);
 		/* Release delayed sync requests if possible. */
 		g_raid3_sync_release(sc);
 		g_io_deliver(pbp, pbp->bio_error);
 		break;
 	    }
 	}
 }
 
 static void
 g_raid3_sync_done(struct bio *bp)
 {
 	struct g_raid3_softc *sc;
 
 	G_RAID3_LOGREQ(3, bp, "Synchronization request delivered.");
 	sc = bp->bio_from->geom->softc;
 	bp->bio_cflags |= G_RAID3_BIO_CFLAG_SYNC;
 	mtx_lock(&sc->sc_queue_mtx);
 	bioq_insert_head(&sc->sc_queue, bp);
 	mtx_unlock(&sc->sc_queue_mtx);
 	wakeup(sc);
 	wakeup(&sc->sc_queue);
 }
 
 static void
 g_raid3_flush(struct g_raid3_softc *sc, struct bio *bp)
 {
 	struct bio_queue_head queue;
 	struct g_raid3_disk *disk;
 	struct g_consumer *cp;
 	struct bio *cbp;
 	u_int i;
 
 	bioq_init(&queue);
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		disk = &sc->sc_disks[i];
 		if (disk->d_state != G_RAID3_DISK_STATE_ACTIVE)
 			continue;
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			for (cbp = bioq_first(&queue); cbp != NULL;
 			    cbp = bioq_first(&queue)) {
 				bioq_remove(&queue, cbp);
 				g_destroy_bio(cbp);
 			}
 			if (bp->bio_error == 0)
 				bp->bio_error = ENOMEM;
 			g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 		bioq_insert_tail(&queue, cbp);
 		cbp->bio_done = g_std_done;
 		cbp->bio_caller1 = disk;
 		cbp->bio_to = disk->d_consumer->provider;
 	}
 	for (cbp = bioq_first(&queue); cbp != NULL; cbp = bioq_first(&queue)) {
 		bioq_remove(&queue, cbp);
 		G_RAID3_LOGREQ(3, cbp, "Sending request.");
 		disk = cbp->bio_caller1;
 		cbp->bio_caller1 = NULL;
 		cp = disk->d_consumer;
 		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 		    cp->acr, cp->acw, cp->ace));
 		g_io_request(cbp, disk->d_consumer);
 	}
 }
 
 static void
 g_raid3_start(struct bio *bp)
 {
 	struct g_raid3_softc *sc;
 
 	sc = bp->bio_to->geom->softc;
 	/*
 	 * If sc == NULL or there are no valid disks, provider's error
 	 * should be set and g_raid3_start() should not be called at all.
 	 */
 	KASSERT(sc != NULL && (sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED ||
 	    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE),
 	    ("Provider's error should be set (error=%d)(device=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 	G_RAID3_LOGREQ(3, bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
 	case BIO_FLUSH:
 		g_raid3_flush(sc, bp);
 		return;
 	case BIO_GETATTR:
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 	mtx_lock(&sc->sc_queue_mtx);
 	bioq_insert_tail(&sc->sc_queue, bp);
 	mtx_unlock(&sc->sc_queue_mtx);
 	G_RAID3_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 	wakeup(sc);
 }
 
 /*
  * Return TRUE if the given request is colliding with a in-progress
  * synchronization request.
  */
 static int
 g_raid3_sync_collision(struct g_raid3_softc *sc, struct bio *bp)
 {
 	struct g_raid3_disk *disk;
 	struct bio *sbp;
 	off_t rstart, rend, sstart, send;
 	int i;
 
 	disk = sc->sc_syncdisk;
 	if (disk == NULL)
 		return (0);
 	rstart = bp->bio_offset;
 	rend = bp->bio_offset + bp->bio_length;
 	for (i = 0; i < g_raid3_syncreqs; i++) {
 		sbp = disk->d_sync.ds_bios[i];
 		if (sbp == NULL)
 			continue;
 		sstart = sbp->bio_offset;
 		send = sbp->bio_length;
 		if (sbp->bio_cmd == BIO_WRITE) {
 			sstart *= sc->sc_ndisks - 1;
 			send *= sc->sc_ndisks - 1;
 		}
 		send += sstart;
 		if (rend > sstart && rstart < send)
 			return (1);
 	}
 	return (0);
 }
 
 /*
  * Return TRUE if the given sync request is colliding with a in-progress regular
  * request.
  */
 static int
 g_raid3_regular_collision(struct g_raid3_softc *sc, struct bio *sbp)
 {
 	off_t rstart, rend, sstart, send;
 	struct bio *bp;
 
 	if (sc->sc_syncdisk == NULL)
 		return (0);
 	sstart = sbp->bio_offset;
 	send = sstart + sbp->bio_length;
 	TAILQ_FOREACH(bp, &sc->sc_inflight.queue, bio_queue) {
 		rstart = bp->bio_offset;
 		rend = bp->bio_offset + bp->bio_length;
 		if (rend > sstart && rstart < send)
 			return (1);
 	}
 	return (0);
 }
 
 /*
  * Puts request onto delayed queue.
  */
 static void
 g_raid3_regular_delay(struct g_raid3_softc *sc, struct bio *bp)
 {
 
 	G_RAID3_LOGREQ(2, bp, "Delaying request.");
 	bioq_insert_head(&sc->sc_regular_delayed, bp);
 }
 
 /*
  * Puts synchronization request onto delayed queue.
  */
 static void
 g_raid3_sync_delay(struct g_raid3_softc *sc, struct bio *bp)
 {
 
 	G_RAID3_LOGREQ(2, bp, "Delaying synchronization request.");
 	bioq_insert_tail(&sc->sc_sync_delayed, bp);
 }
 
 /*
  * Releases delayed regular requests which don't collide anymore with sync
  * requests.
  */
 static void
 g_raid3_regular_release(struct g_raid3_softc *sc)
 {
 	struct bio *bp, *bp2;
 
 	TAILQ_FOREACH_SAFE(bp, &sc->sc_regular_delayed.queue, bio_queue, bp2) {
 		if (g_raid3_sync_collision(sc, bp))
 			continue;
 		bioq_remove(&sc->sc_regular_delayed, bp);
 		G_RAID3_LOGREQ(2, bp, "Releasing delayed request (%p).", bp);
 		mtx_lock(&sc->sc_queue_mtx);
 		bioq_insert_head(&sc->sc_queue, bp);
 #if 0
 		/*
 		 * wakeup() is not needed, because this function is called from
 		 * the worker thread.
 		 */
 		wakeup(&sc->sc_queue);
 #endif
 		mtx_unlock(&sc->sc_queue_mtx);
 	}
 }
 
 /*
  * Releases delayed sync requests which don't collide anymore with regular
  * requests.
  */
 static void
 g_raid3_sync_release(struct g_raid3_softc *sc)
 {
 	struct bio *bp, *bp2;
 
 	TAILQ_FOREACH_SAFE(bp, &sc->sc_sync_delayed.queue, bio_queue, bp2) {
 		if (g_raid3_regular_collision(sc, bp))
 			continue;
 		bioq_remove(&sc->sc_sync_delayed, bp);
 		G_RAID3_LOGREQ(2, bp,
 		    "Releasing delayed synchronization request.");
 		g_io_request(bp, bp->bio_from);
 	}
 }
 
 /*
  * Handle synchronization requests.
  * Every synchronization request is two-steps process: first, READ request is
  * send to active provider and then WRITE request (with read data) to the provider
  * being synchronized. When WRITE is finished, new synchronization request is
  * send.
  */
 static void
 g_raid3_sync_request(struct bio *bp)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_disk *disk;
 
 	bp->bio_from->index--;
 	sc = bp->bio_from->geom->softc;
 	disk = bp->bio_from->private;
 	if (disk == NULL) {
 		sx_xunlock(&sc->sc_lock); /* Avoid recursion on sc_lock. */
 		g_topology_lock();
 		g_raid3_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
 		free(bp->bio_data, M_RAID3);
 		g_destroy_bio(bp);
 		sx_xlock(&sc->sc_lock);
 		return;
 	}
 
 	/*
 	 * Synchronization request.
 	 */
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	    {
 		struct g_consumer *cp;
 		u_char *dst, *src;
 		off_t left;
 		u_int atom;
 
 		if (bp->bio_error != 0) {
 			G_RAID3_LOGREQ(0, bp,
 			    "Synchronization request failed (error=%d).",
 			    bp->bio_error);
 			g_destroy_bio(bp);
 			return;
 		}
 		G_RAID3_LOGREQ(3, bp, "Synchronization request finished.");
 		atom = sc->sc_sectorsize / (sc->sc_ndisks - 1);
 		dst = src = bp->bio_data;
 		if (disk->d_no == sc->sc_ndisks - 1) {
 			u_int n;
 
 			/* Parity component. */
 			for (left = bp->bio_length; left > 0;
 			    left -= sc->sc_sectorsize) {
 				bcopy(src, dst, atom);
 				src += atom;
 				for (n = 1; n < sc->sc_ndisks - 1; n++) {
 					g_raid3_xor(src, dst, atom);
 					src += atom;
 				}
 				dst += atom;
 			}
 		} else {
 			/* Regular component. */
 			src += atom * disk->d_no;
 			for (left = bp->bio_length; left > 0;
 			    left -= sc->sc_sectorsize) {
 				bcopy(src, dst, atom);
 				src += sc->sc_sectorsize;
 				dst += atom;
 			}
 		}
 		bp->bio_driver1 = bp->bio_driver2 = NULL;
 		bp->bio_pflags = 0;
 		bp->bio_offset /= sc->sc_ndisks - 1;
 		bp->bio_length /= sc->sc_ndisks - 1;
 		bp->bio_cmd = BIO_WRITE;
 		bp->bio_cflags = 0;
 		bp->bio_children = bp->bio_inbed = 0;
 		cp = disk->d_consumer;
 		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
 		    cp->acr, cp->acw, cp->ace));
 		cp->index++;
 		g_io_request(bp, cp);
 		return;
 	    }
 	case BIO_WRITE:
 	    {
 		struct g_raid3_disk_sync *sync;
 		off_t boffset, moffset;
 		void *data;
 		int i;
 
 		if (bp->bio_error != 0) {
 			G_RAID3_LOGREQ(0, bp,
 			    "Synchronization request failed (error=%d).",
 			    bp->bio_error);
 			g_destroy_bio(bp);
 			sc->sc_bump_id |= G_RAID3_BUMP_GENID;
 			g_raid3_event_send(disk,
 			    G_RAID3_DISK_STATE_DISCONNECTED,
 			    G_RAID3_EVENT_DONTWAIT);
 			return;
 		}
 		G_RAID3_LOGREQ(3, bp, "Synchronization request finished.");
 		sync = &disk->d_sync;
 		if (sync->ds_offset == sc->sc_mediasize / (sc->sc_ndisks - 1) ||
 		    sync->ds_consumer == NULL ||
 		    (sc->sc_flags & G_RAID3_DEVICE_FLAG_DESTROY) != 0) {
 			/* Don't send more synchronization requests. */
 			sync->ds_inflight--;
 			if (sync->ds_bios != NULL) {
 				i = (int)(uintptr_t)bp->bio_caller1;
 				sync->ds_bios[i] = NULL;
 			}
 			free(bp->bio_data, M_RAID3);
 			g_destroy_bio(bp);
 			if (sync->ds_inflight > 0)
 				return;
 			if (sync->ds_consumer == NULL ||
 			    (sc->sc_flags & G_RAID3_DEVICE_FLAG_DESTROY) != 0) {
 				return;
 			}
 			/*
 			 * Disk up-to-date, activate it.
 			 */
 			g_raid3_event_send(disk, G_RAID3_DISK_STATE_ACTIVE,
 			    G_RAID3_EVENT_DONTWAIT);
 			return;
 		}
 
 		/* Send next synchronization request. */
 		data = bp->bio_data;
 		g_reset_bio(bp);
 		bp->bio_cmd = BIO_READ;
 		bp->bio_offset = sync->ds_offset * (sc->sc_ndisks - 1);
 		bp->bio_length = MIN(MAXPHYS, sc->sc_mediasize - bp->bio_offset);
 		sync->ds_offset += bp->bio_length / (sc->sc_ndisks - 1);
 		bp->bio_done = g_raid3_sync_done;
 		bp->bio_data = data;
 		bp->bio_from = sync->ds_consumer;
 		bp->bio_to = sc->sc_provider;
 		G_RAID3_LOGREQ(3, bp, "Sending synchronization request.");
 		sync->ds_consumer->index++;
 		/*
 		 * Delay the request if it is colliding with a regular request.
 		 */
 		if (g_raid3_regular_collision(sc, bp))
 			g_raid3_sync_delay(sc, bp);
 		else
 			g_io_request(bp, sync->ds_consumer);
 
 		/* Release delayed requests if possible. */
 		g_raid3_regular_release(sc);
 
 		/* Find the smallest offset. */
 		moffset = sc->sc_mediasize;
 		for (i = 0; i < g_raid3_syncreqs; i++) {
 			bp = sync->ds_bios[i];
 			boffset = bp->bio_offset;
 			if (bp->bio_cmd == BIO_WRITE)
 				boffset *= sc->sc_ndisks - 1;
 			if (boffset < moffset)
 				moffset = boffset;
 		}
 		if (sync->ds_offset_done + (MAXPHYS * 100) < moffset) {
 			/* Update offset_done on every 100 blocks. */
 			sync->ds_offset_done = moffset;
 			g_raid3_update_metadata(disk);
 		}
 		return;
 	    }
 	default:
 		KASSERT(1 == 0, ("Invalid command here: %u (device=%s)",
 		    bp->bio_cmd, sc->sc_name));
 		break;
 	}
 }
 
 static int
 g_raid3_register_request(struct bio *pbp)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_disk *disk;
 	struct g_consumer *cp;
 	struct bio *cbp, *tmpbp;
 	off_t offset, length;
 	u_int n, ndisks;
 	int round_robin, verify;
 
 	ndisks = 0;
 	sc = pbp->bio_to->geom->softc;
 	if ((pbp->bio_cflags & G_RAID3_BIO_CFLAG_REGSYNC) != 0 &&
 	    sc->sc_syncdisk == NULL) {
 		g_io_deliver(pbp, EIO);
 		return (0);
 	}
 	g_raid3_init_bio(pbp);
 	length = pbp->bio_length / (sc->sc_ndisks - 1);
 	offset = pbp->bio_offset / (sc->sc_ndisks - 1);
 	round_robin = verify = 0;
 	switch (pbp->bio_cmd) {
 	case BIO_READ:
 		if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_VERIFY) != 0 &&
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE) {
 			pbp->bio_pflags |= G_RAID3_BIO_PFLAG_VERIFY;
 			verify = 1;
 			ndisks = sc->sc_ndisks;
 		} else {
 			verify = 0;
 			ndisks = sc->sc_ndisks - 1;
 		}
 		if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_ROUND_ROBIN) != 0 &&
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE) {
 			round_robin = 1;
 		} else {
 			round_robin = 0;
 		}
 		KASSERT(!round_robin || !verify,
 		    ("ROUND-ROBIN and VERIFY are mutually exclusive."));
 		pbp->bio_driver2 = &sc->sc_disks[sc->sc_ndisks - 1];
 		break;
 	case BIO_WRITE:
 	case BIO_DELETE:
 		/*
 		 * Delay the request if it is colliding with a synchronization
 		 * request.
 		 */
 		if (g_raid3_sync_collision(sc, pbp)) {
 			g_raid3_regular_delay(sc, pbp);
 			return (0);
 		}
 
 		if (sc->sc_idle)
 			g_raid3_unidle(sc);
 		else
 			sc->sc_last_write = time_uptime;
 
 		ndisks = sc->sc_ndisks;
 		break;
 	}
 	for (n = 0; n < ndisks; n++) {
 		disk = &sc->sc_disks[n];
 		cbp = g_raid3_clone_bio(sc, pbp);
 		if (cbp == NULL) {
 			while ((cbp = G_RAID3_HEAD_BIO(pbp)) != NULL)
 				g_raid3_destroy_bio(sc, cbp);
 			/*
 			 * To prevent deadlock, we must run back up
 			 * with the ENOMEM for failed requests of any
 			 * of our consumers.  Our own sync requests
 			 * can stick around, as they are finite.
 			 */
 			if ((pbp->bio_cflags &
 			    G_RAID3_BIO_CFLAG_REGULAR) != 0) {
 				g_io_deliver(pbp, ENOMEM);
 				return (0);
 			}
 			return (ENOMEM);
 		}
 		cbp->bio_offset = offset;
 		cbp->bio_length = length;
 		cbp->bio_done = g_raid3_done;
 		switch (pbp->bio_cmd) {
 		case BIO_READ:
 			if (disk->d_state != G_RAID3_DISK_STATE_ACTIVE) {
 				/*
 				 * Replace invalid component with the parity
 				 * component.
 				 */
 				disk = &sc->sc_disks[sc->sc_ndisks - 1];
 				cbp->bio_cflags |= G_RAID3_BIO_CFLAG_PARITY;
 				pbp->bio_pflags |= G_RAID3_BIO_PFLAG_DEGRADED;
 			} else if (round_robin &&
 			    disk->d_no == sc->sc_round_robin) {
 				/*
 				 * In round-robin mode skip one data component
 				 * and use parity component when reading.
 				 */
 				pbp->bio_driver2 = disk;
 				disk = &sc->sc_disks[sc->sc_ndisks - 1];
 				cbp->bio_cflags |= G_RAID3_BIO_CFLAG_PARITY;
 				sc->sc_round_robin++;
 				round_robin = 0;
 			} else if (verify && disk->d_no == sc->sc_ndisks - 1) {
 				cbp->bio_cflags |= G_RAID3_BIO_CFLAG_PARITY;
 			}
 			break;
 		case BIO_WRITE:
 		case BIO_DELETE:
 			if (disk->d_state == G_RAID3_DISK_STATE_ACTIVE ||
 			    disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING) {
 				if (n == ndisks - 1) {
 					/*
 					 * Active parity component, mark it as such.
 					 */
 					cbp->bio_cflags |=
 					    G_RAID3_BIO_CFLAG_PARITY;
 				}
 			} else {
 				pbp->bio_pflags |= G_RAID3_BIO_PFLAG_DEGRADED;
 				if (n == ndisks - 1) {
 					/*
 					 * Parity component is not connected,
 					 * so destroy its request.
 					 */
 					pbp->bio_pflags |=
 					    G_RAID3_BIO_PFLAG_NOPARITY;
 					g_raid3_destroy_bio(sc, cbp);
 					cbp = NULL;
 				} else {
 					cbp->bio_cflags |=
 					    G_RAID3_BIO_CFLAG_NODISK;
 					disk = NULL;
 				}
 			}
 			break;
 		}
 		if (cbp != NULL)
 			cbp->bio_caller2 = disk;
 	}
 	switch (pbp->bio_cmd) {
 	case BIO_READ:
 		if (round_robin) {
 			/*
 			 * If we are in round-robin mode and 'round_robin' is
 			 * still 1, it means, that we skipped parity component
 			 * for this read and must reset sc_round_robin field.
 			 */
 			sc->sc_round_robin = 0;
 		}
 		G_RAID3_FOREACH_SAFE_BIO(pbp, cbp, tmpbp) {
 			disk = cbp->bio_caller2;
 			cp = disk->d_consumer;
 			cbp->bio_to = cp->provider;
 			G_RAID3_LOGREQ(3, cbp, "Sending request.");
 			KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 			    ("Consumer %s not opened (r%dw%de%d).",
 			    cp->provider->name, cp->acr, cp->acw, cp->ace));
 			cp->index++;
 			g_io_request(cbp, cp);
 		}
 		break;
 	case BIO_WRITE:
 	case BIO_DELETE:
 		/*
 		 * Put request onto inflight queue, so we can check if new
 		 * synchronization requests don't collide with it.
 		 */
 		bioq_insert_tail(&sc->sc_inflight, pbp);
 
 		/*
 		 * Bump syncid on first write.
 		 */
 		if ((sc->sc_bump_id & G_RAID3_BUMP_SYNCID) != 0) {
 			sc->sc_bump_id &= ~G_RAID3_BUMP_SYNCID;
 			g_raid3_bump_syncid(sc);
 		}
 		g_raid3_scatter(pbp);
 		break;
 	}
 	return (0);
 }
 
 static int
 g_raid3_can_destroy(struct g_raid3_softc *sc)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 
 	g_topology_assert();
 	gp = sc->sc_geom;
 	if (gp->softc == NULL)
 		return (1);
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (g_raid3_is_busy(sc, cp))
 			return (0);
 	}
 	gp = sc->sc_sync.ds_geom;
 	LIST_FOREACH(cp, &gp->consumer, consumer) {
 		if (g_raid3_is_busy(sc, cp))
 			return (0);
 	}
 	G_RAID3_DEBUG(2, "No I/O requests for %s, it can be destroyed.",
 	    sc->sc_name);
 	return (1);
 }
 
 static int
 g_raid3_try_destroy(struct g_raid3_softc *sc)
 {
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	if (sc->sc_rootmount != NULL) {
 		G_RAID3_DEBUG(1, "root_mount_rel[%u] %p", __LINE__,
 		    sc->sc_rootmount);
 		root_mount_rel(sc->sc_rootmount);
 		sc->sc_rootmount = NULL;
 	}
 
 	g_topology_lock();
 	if (!g_raid3_can_destroy(sc)) {
 		g_topology_unlock();
 		return (0);
 	}
 	sc->sc_geom->softc = NULL;
 	sc->sc_sync.ds_geom->softc = NULL;
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_WAIT) != 0) {
 		g_topology_unlock();
 		G_RAID3_DEBUG(4, "%s: Waking up %p.", __func__,
 		    &sc->sc_worker);
 		/* Unlock sc_lock here, as it can be destroyed after wakeup. */
 		sx_xunlock(&sc->sc_lock);
 		wakeup(&sc->sc_worker);
 		sc->sc_worker = NULL;
 	} else {
 		g_topology_unlock();
 		g_raid3_destroy_device(sc);
 		free(sc->sc_disks, M_RAID3);
 		free(sc, M_RAID3);
 	}
 	return (1);
 }
 
 /*
  * Worker thread.
  */
 static void
 g_raid3_worker(void *arg)
 {
 	struct g_raid3_softc *sc;
 	struct g_raid3_event *ep;
 	struct bio *bp;
 	int timeout;
 
 	sc = arg;
 	thread_lock(curthread);
 	sched_prio(curthread, PRIBIO);
 	thread_unlock(curthread);
 
 	sx_xlock(&sc->sc_lock);
 	for (;;) {
 		G_RAID3_DEBUG(5, "%s: Let's see...", __func__);
 		/*
 		 * First take a look at events.
 		 * This is important to handle events before any I/O requests.
 		 */
 		ep = g_raid3_event_get(sc);
 		if (ep != NULL) {
 			g_raid3_event_remove(sc, ep);
 			if ((ep->e_flags & G_RAID3_EVENT_DEVICE) != 0) {
 				/* Update only device status. */
 				G_RAID3_DEBUG(3,
 				    "Running event for device %s.",
 				    sc->sc_name);
 				ep->e_error = 0;
 				g_raid3_update_device(sc, 1);
 			} else {
 				/* Update disk status. */
 				G_RAID3_DEBUG(3, "Running event for disk %s.",
 				     g_raid3_get_diskname(ep->e_disk));
 				ep->e_error = g_raid3_update_disk(ep->e_disk,
 				    ep->e_state);
 				if (ep->e_error == 0)
 					g_raid3_update_device(sc, 0);
 			}
 			if ((ep->e_flags & G_RAID3_EVENT_DONTWAIT) != 0) {
 				KASSERT(ep->e_error == 0,
 				    ("Error cannot be handled."));
 				g_raid3_event_free(ep);
 			} else {
 				ep->e_flags |= G_RAID3_EVENT_DONE;
 				G_RAID3_DEBUG(4, "%s: Waking up %p.", __func__,
 				    ep);
 				mtx_lock(&sc->sc_events_mtx);
 				wakeup(ep);
 				mtx_unlock(&sc->sc_events_mtx);
 			}
 			if ((sc->sc_flags &
 			    G_RAID3_DEVICE_FLAG_DESTROY) != 0) {
 				if (g_raid3_try_destroy(sc)) {
 					curthread->td_pflags &= ~TDP_GEOM;
 					G_RAID3_DEBUG(1, "Thread exiting.");
 					kproc_exit(0);
 				}
 			}
 			G_RAID3_DEBUG(5, "%s: I'm here 1.", __func__);
 			continue;
 		}
 		/*
 		 * Check if we can mark array as CLEAN and if we can't take
 		 * how much seconds should we wait.
 		 */
 		timeout = g_raid3_idle(sc, -1);
 		/*
 		 * Now I/O requests.
 		 */
 		/* Get first request from the queue. */
 		mtx_lock(&sc->sc_queue_mtx);
 		bp = bioq_first(&sc->sc_queue);
 		if (bp == NULL) {
 			if ((sc->sc_flags &
 			    G_RAID3_DEVICE_FLAG_DESTROY) != 0) {
 				mtx_unlock(&sc->sc_queue_mtx);
 				if (g_raid3_try_destroy(sc)) {
 					curthread->td_pflags &= ~TDP_GEOM;
 					G_RAID3_DEBUG(1, "Thread exiting.");
 					kproc_exit(0);
 				}
 				mtx_lock(&sc->sc_queue_mtx);
 			}
 			sx_xunlock(&sc->sc_lock);
 			/*
 			 * XXX: We can miss an event here, because an event
 			 *      can be added without sx-device-lock and without
 			 *      mtx-queue-lock. Maybe I should just stop using
 			 *      dedicated mutex for events synchronization and
 			 *      stick with the queue lock?
 			 *      The event will hang here until next I/O request
 			 *      or next event is received.
 			 */
 			MSLEEP(sc, &sc->sc_queue_mtx, PRIBIO | PDROP, "r3:w1",
 			    timeout * hz);
 			sx_xlock(&sc->sc_lock);
 			G_RAID3_DEBUG(5, "%s: I'm here 4.", __func__);
 			continue;
 		}
 process:
 		bioq_remove(&sc->sc_queue, bp);
 		mtx_unlock(&sc->sc_queue_mtx);
 
 		if (bp->bio_from->geom == sc->sc_sync.ds_geom &&
 		    (bp->bio_cflags & G_RAID3_BIO_CFLAG_SYNC) != 0) {
 			g_raid3_sync_request(bp);	/* READ */
 		} else if (bp->bio_to != sc->sc_provider) {
 			if ((bp->bio_cflags & G_RAID3_BIO_CFLAG_REGULAR) != 0)
 				g_raid3_regular_request(bp);
 			else if ((bp->bio_cflags & G_RAID3_BIO_CFLAG_SYNC) != 0)
 				g_raid3_sync_request(bp);	/* WRITE */
 			else {
 				KASSERT(0,
 				    ("Invalid request cflags=0x%hx to=%s.",
 				    bp->bio_cflags, bp->bio_to->name));
 			}
 		} else if (g_raid3_register_request(bp) != 0) {
 			mtx_lock(&sc->sc_queue_mtx);
 			bioq_insert_head(&sc->sc_queue, bp);
 			/*
 			 * We are short in memory, let see if there are finished
 			 * request we can free.
 			 */
 			TAILQ_FOREACH(bp, &sc->sc_queue.queue, bio_queue) {
 				if (bp->bio_cflags & G_RAID3_BIO_CFLAG_REGULAR)
 					goto process;
 			}
 			/*
 			 * No finished regular request, so at least keep
 			 * synchronization running.
 			 */
 			TAILQ_FOREACH(bp, &sc->sc_queue.queue, bio_queue) {
 				if (bp->bio_cflags & G_RAID3_BIO_CFLAG_SYNC)
 					goto process;
 			}
 			sx_xunlock(&sc->sc_lock);
 			MSLEEP(&sc->sc_queue, &sc->sc_queue_mtx, PRIBIO | PDROP,
 			    "r3:lowmem", hz / 10);
 			sx_xlock(&sc->sc_lock);
 		}
 		G_RAID3_DEBUG(5, "%s: I'm here 9.", __func__);
 	}
 }
 
 static void
 g_raid3_update_idle(struct g_raid3_softc *sc, struct g_raid3_disk *disk)
 {
 
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_NOFAILSYNC) != 0)
 		return;
 	if (!sc->sc_idle && (disk->d_flags & G_RAID3_DISK_FLAG_DIRTY) == 0) {
 		G_RAID3_DEBUG(1, "Disk %s (device %s) marked as dirty.",
 		    g_raid3_get_diskname(disk), sc->sc_name);
 		disk->d_flags |= G_RAID3_DISK_FLAG_DIRTY;
 	} else if (sc->sc_idle &&
 	    (disk->d_flags & G_RAID3_DISK_FLAG_DIRTY) != 0) {
 		G_RAID3_DEBUG(1, "Disk %s (device %s) marked as clean.",
 		    g_raid3_get_diskname(disk), sc->sc_name);
 		disk->d_flags &= ~G_RAID3_DISK_FLAG_DIRTY;
 	}
 }
 
 static void
 g_raid3_sync_start(struct g_raid3_softc *sc)
 {
 	struct g_raid3_disk *disk;
 	struct g_consumer *cp;
 	struct bio *bp;
 	int error;
 	u_int n;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	KASSERT(sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED,
 	    ("Device not in DEGRADED state (%s, %u).", sc->sc_name,
 	    sc->sc_state));
 	KASSERT(sc->sc_syncdisk == NULL, ("Syncdisk is not NULL (%s, %u).",
 	    sc->sc_name, sc->sc_state));
 	disk = NULL;
 	for (n = 0; n < sc->sc_ndisks; n++) {
 		if (sc->sc_disks[n].d_state != G_RAID3_DISK_STATE_SYNCHRONIZING)
 			continue;
 		disk = &sc->sc_disks[n];
 		break;
 	}
 	if (disk == NULL)
 		return;
 
 	sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	cp = g_new_consumer(sc->sc_sync.ds_geom);
 	error = g_attach(cp, sc->sc_provider);
 	KASSERT(error == 0,
 	    ("Cannot attach to %s (error=%d).", sc->sc_name, error));
 	error = g_access(cp, 1, 0, 0);
 	KASSERT(error == 0, ("Cannot open %s (error=%d).", sc->sc_name, error));
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 
 	G_RAID3_DEBUG(0, "Device %s: rebuilding provider %s.", sc->sc_name,
 	    g_raid3_get_diskname(disk));
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_NOFAILSYNC) == 0)
 		disk->d_flags |= G_RAID3_DISK_FLAG_DIRTY;
 	KASSERT(disk->d_sync.ds_consumer == NULL,
 	    ("Sync consumer already exists (device=%s, disk=%s).",
 	    sc->sc_name, g_raid3_get_diskname(disk)));
 
 	disk->d_sync.ds_consumer = cp;
 	disk->d_sync.ds_consumer->private = disk;
 	disk->d_sync.ds_consumer->index = 0;
 	sc->sc_syncdisk = disk;
 
 	/*
 	 * Allocate memory for synchronization bios and initialize them.
 	 */
 	disk->d_sync.ds_bios = malloc(sizeof(struct bio *) * g_raid3_syncreqs,
 	    M_RAID3, M_WAITOK);
 	for (n = 0; n < g_raid3_syncreqs; n++) {
 		bp = g_alloc_bio();
 		disk->d_sync.ds_bios[n] = bp;
 		bp->bio_parent = NULL;
 		bp->bio_cmd = BIO_READ;
 		bp->bio_data = malloc(MAXPHYS, M_RAID3, M_WAITOK);
 		bp->bio_cflags = 0;
 		bp->bio_offset = disk->d_sync.ds_offset * (sc->sc_ndisks - 1);
 		bp->bio_length = MIN(MAXPHYS, sc->sc_mediasize - bp->bio_offset);
 		disk->d_sync.ds_offset += bp->bio_length / (sc->sc_ndisks - 1);
 		bp->bio_done = g_raid3_sync_done;
 		bp->bio_from = disk->d_sync.ds_consumer;
 		bp->bio_to = sc->sc_provider;
 		bp->bio_caller1 = (void *)(uintptr_t)n;
 	}
 
 	/* Set the number of in-flight synchronization requests. */
 	disk->d_sync.ds_inflight = g_raid3_syncreqs;
 
 	/*
 	 * Fire off first synchronization requests.
 	 */
 	for (n = 0; n < g_raid3_syncreqs; n++) {
 		bp = disk->d_sync.ds_bios[n];
 		G_RAID3_LOGREQ(3, bp, "Sending synchronization request.");
 		disk->d_sync.ds_consumer->index++;
 		/*
 		 * Delay the request if it is colliding with a regular request.
 		 */
 		if (g_raid3_regular_collision(sc, bp))
 			g_raid3_sync_delay(sc, bp);
 		else
 			g_io_request(bp, disk->d_sync.ds_consumer);
 	}
 }
 
 /*
  * Stop synchronization process.
  * type: 0 - synchronization finished
  *       1 - synchronization stopped
  */
 static void
 g_raid3_sync_stop(struct g_raid3_softc *sc, int type)
 {
 	struct g_raid3_disk *disk;
 	struct g_consumer *cp;
 
 	g_topology_assert_not();
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	KASSERT(sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED,
 	    ("Device not in DEGRADED state (%s, %u).", sc->sc_name,
 	    sc->sc_state));
 	disk = sc->sc_syncdisk;
 	sc->sc_syncdisk = NULL;
 	KASSERT(disk != NULL, ("No disk was synchronized (%s).", sc->sc_name));
 	KASSERT(disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING,
 	    ("Wrong disk state (%s, %s).", g_raid3_get_diskname(disk),
 	    g_raid3_disk_state2str(disk->d_state)));
 	if (disk->d_sync.ds_consumer == NULL)
 		return;
 
 	if (type == 0) {
 		G_RAID3_DEBUG(0, "Device %s: rebuilding provider %s finished.",
 		    sc->sc_name, g_raid3_get_diskname(disk));
 	} else /* if (type == 1) */ {
 		G_RAID3_DEBUG(0, "Device %s: rebuilding provider %s stopped.",
 		    sc->sc_name, g_raid3_get_diskname(disk));
 	}
 	free(disk->d_sync.ds_bios, M_RAID3);
 	disk->d_sync.ds_bios = NULL;
 	cp = disk->d_sync.ds_consumer;
 	disk->d_sync.ds_consumer = NULL;
 	disk->d_flags &= ~G_RAID3_DISK_FLAG_DIRTY;
 	sx_xunlock(&sc->sc_lock); /* Avoid recursion on sc_lock. */
 	g_topology_lock();
 	g_raid3_kill_consumer(sc, cp);
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 }
 
 static void
 g_raid3_launch_provider(struct g_raid3_softc *sc)
 {
 	struct g_provider *pp;
 	struct g_raid3_disk *disk;
 	int n;
 
 	sx_assert(&sc->sc_lock, SX_LOCKED);
 
 	g_topology_lock();
 	pp = g_new_providerf(sc->sc_geom, "raid3/%s", sc->sc_name);
 	pp->mediasize = sc->sc_mediasize;
 	pp->sectorsize = sc->sc_sectorsize;
 	pp->stripesize = 0;
 	pp->stripeoffset = 0;
 	for (n = 0; n < sc->sc_ndisks; n++) {
 		disk = &sc->sc_disks[n];
 		if (disk->d_consumer && disk->d_consumer->provider &&
 		    disk->d_consumer->provider->stripesize > pp->stripesize) {
 			pp->stripesize = disk->d_consumer->provider->stripesize;
 			pp->stripeoffset = disk->d_consumer->provider->stripeoffset;
 		}
 	}
 	pp->stripesize *= sc->sc_ndisks - 1;
 	pp->stripeoffset *= sc->sc_ndisks - 1;
 	sc->sc_provider = pp;
 	g_error_provider(pp, 0);
 	g_topology_unlock();
 	G_RAID3_DEBUG(0, "Device %s launched (%u/%u).", pp->name,
 	    g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE), sc->sc_ndisks);
 
 	if (sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED)
 		g_raid3_sync_start(sc);
 }
 
 static void
 g_raid3_destroy_provider(struct g_raid3_softc *sc)
 {
 	struct bio *bp;
 
 	g_topology_assert_not();
 	KASSERT(sc->sc_provider != NULL, ("NULL provider (device=%s).",
 	    sc->sc_name));
 
 	g_topology_lock();
 	g_error_provider(sc->sc_provider, ENXIO);
 	mtx_lock(&sc->sc_queue_mtx);
 	while ((bp = bioq_first(&sc->sc_queue)) != NULL) {
 		bioq_remove(&sc->sc_queue, bp);
 		g_io_deliver(bp, ENXIO);
 	}
 	mtx_unlock(&sc->sc_queue_mtx);
 	G_RAID3_DEBUG(0, "Device %s: provider %s destroyed.", sc->sc_name,
 	    sc->sc_provider->name);
 	g_wither_provider(sc->sc_provider, ENXIO);
 	g_topology_unlock();
 	sc->sc_provider = NULL;
 	if (sc->sc_syncdisk != NULL)
 		g_raid3_sync_stop(sc, 1);
 }
 
 static void
 g_raid3_go(void *arg)
 {
 	struct g_raid3_softc *sc;
 
 	sc = arg;
 	G_RAID3_DEBUG(0, "Force device %s start due to timeout.", sc->sc_name);
 	g_raid3_event_send(sc, 0,
 	    G_RAID3_EVENT_DONTWAIT | G_RAID3_EVENT_DEVICE);
 }
 
 static u_int
 g_raid3_determine_state(struct g_raid3_disk *disk)
 {
 	struct g_raid3_softc *sc;
 	u_int state;
 
 	sc = disk->d_softc;
 	if (sc->sc_syncid == disk->d_sync.ds_syncid) {
 		if ((disk->d_flags &
 		    G_RAID3_DISK_FLAG_SYNCHRONIZING) == 0) {
 			/* Disk does not need synchronization. */
 			state = G_RAID3_DISK_STATE_ACTIVE;
 		} else {
 			if ((sc->sc_flags &
 			     G_RAID3_DEVICE_FLAG_NOAUTOSYNC) == 0 ||
 			    (disk->d_flags &
 			     G_RAID3_DISK_FLAG_FORCE_SYNC) != 0) {
 				/*
 				 * We can start synchronization from
 				 * the stored offset.
 				 */
 				state = G_RAID3_DISK_STATE_SYNCHRONIZING;
 			} else {
 				state = G_RAID3_DISK_STATE_STALE;
 			}
 		}
 	} else if (disk->d_sync.ds_syncid < sc->sc_syncid) {
 		/*
 		 * Reset all synchronization data for this disk,
 		 * because if it even was synchronized, it was
 		 * synchronized to disks with different syncid.
 		 */
 		disk->d_flags |= G_RAID3_DISK_FLAG_SYNCHRONIZING;
 		disk->d_sync.ds_offset = 0;
 		disk->d_sync.ds_offset_done = 0;
 		disk->d_sync.ds_syncid = sc->sc_syncid;
 		if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_NOAUTOSYNC) == 0 ||
 		    (disk->d_flags & G_RAID3_DISK_FLAG_FORCE_SYNC) != 0) {
 			state = G_RAID3_DISK_STATE_SYNCHRONIZING;
 		} else {
 			state = G_RAID3_DISK_STATE_STALE;
 		}
 	} else /* if (sc->sc_syncid < disk->d_sync.ds_syncid) */ {
 		/*
 		 * Not good, NOT GOOD!
 		 * It means that device was started on stale disks
 		 * and more fresh disk just arrive.
 		 * If there were writes, device is broken, sorry.
 		 * I think the best choice here is don't touch
 		 * this disk and inform the user loudly.
 		 */
 		G_RAID3_DEBUG(0, "Device %s was started before the freshest "
 		    "disk (%s) arrives!! It will not be connected to the "
 		    "running device.", sc->sc_name,
 		    g_raid3_get_diskname(disk));
 		g_raid3_destroy_disk(disk);
 		state = G_RAID3_DISK_STATE_NONE;
 		/* Return immediately, because disk was destroyed. */
 		return (state);
 	}
 	G_RAID3_DEBUG(3, "State for %s disk: %s.",
 	    g_raid3_get_diskname(disk), g_raid3_disk_state2str(state));
 	return (state);
 }
 
 /*
  * Update device state.
  */
 static void
 g_raid3_update_device(struct g_raid3_softc *sc, boolean_t force)
 {
 	struct g_raid3_disk *disk;
 	u_int state;
 
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	switch (sc->sc_state) {
 	case G_RAID3_DEVICE_STATE_STARTING:
 	    {
 		u_int n, ndirty, ndisks, genid, syncid;
 
 		KASSERT(sc->sc_provider == NULL,
 		    ("Non-NULL provider in STARTING state (%s).", sc->sc_name));
 		/*
 		 * Are we ready? We are, if all disks are connected or
 		 * one disk is missing and 'force' is true.
 		 */
 		if (g_raid3_ndisks(sc, -1) + force == sc->sc_ndisks) {
 			if (!force)
 				callout_drain(&sc->sc_callout);
 		} else {
 			if (force) {
 				/*
 				 * Timeout expired, so destroy device.
 				 */
 				sc->sc_flags |= G_RAID3_DEVICE_FLAG_DESTROY;
 				G_RAID3_DEBUG(1, "root_mount_rel[%u] %p",
 				    __LINE__, sc->sc_rootmount);
 				root_mount_rel(sc->sc_rootmount);
 				sc->sc_rootmount = NULL;
 			}
 			return;
 		}
 
 		/*
 		 * Find the biggest genid.
 		 */
 		genid = 0;
 		for (n = 0; n < sc->sc_ndisks; n++) {
 			disk = &sc->sc_disks[n];
 			if (disk->d_state == G_RAID3_DISK_STATE_NODISK)
 				continue;
 			if (disk->d_genid > genid)
 				genid = disk->d_genid;
 		}
 		sc->sc_genid = genid;
 		/*
 		 * Remove all disks without the biggest genid.
 		 */
 		for (n = 0; n < sc->sc_ndisks; n++) {
 			disk = &sc->sc_disks[n];
 			if (disk->d_state == G_RAID3_DISK_STATE_NODISK)
 				continue;
 			if (disk->d_genid < genid) {
 				G_RAID3_DEBUG(0,
 				    "Component %s (device %s) broken, skipping.",
 				    g_raid3_get_diskname(disk), sc->sc_name);
 				g_raid3_destroy_disk(disk);
 			}
 		}
 
 		/*
 		 * There must be at least 'sc->sc_ndisks - 1' components
 		 * with the same syncid and without SYNCHRONIZING flag.
 		 */
 
 		/*
 		 * Find the biggest syncid, number of valid components and
 		 * number of dirty components.
 		 */
 		ndirty = ndisks = syncid = 0;
 		for (n = 0; n < sc->sc_ndisks; n++) {
 			disk = &sc->sc_disks[n];
 			if (disk->d_state == G_RAID3_DISK_STATE_NODISK)
 				continue;
 			if ((disk->d_flags & G_RAID3_DISK_FLAG_DIRTY) != 0)
 				ndirty++;
 			if (disk->d_sync.ds_syncid > syncid) {
 				syncid = disk->d_sync.ds_syncid;
 				ndisks = 0;
 			} else if (disk->d_sync.ds_syncid < syncid) {
 				continue;
 			}
 			if ((disk->d_flags &
 			    G_RAID3_DISK_FLAG_SYNCHRONIZING) != 0) {
 				continue;
 			}
 			ndisks++;
 		}
 		/*
 		 * Do we have enough valid components?
 		 */
 		if (ndisks + 1 < sc->sc_ndisks) {
 			G_RAID3_DEBUG(0,
 			    "Device %s is broken, too few valid components.",
 			    sc->sc_name);
 			sc->sc_flags |= G_RAID3_DEVICE_FLAG_DESTROY;
 			return;
 		}
 		/*
 		 * If there is one DIRTY component and all disks are present,
 		 * mark it for synchronization. If there is more than one DIRTY
 		 * component, mark parity component for synchronization.
 		 */
 		if (ndisks == sc->sc_ndisks && ndirty == 1) {
 			for (n = 0; n < sc->sc_ndisks; n++) {
 				disk = &sc->sc_disks[n];
 				if ((disk->d_flags &
 				    G_RAID3_DISK_FLAG_DIRTY) == 0) {
 					continue;
 				}
 				disk->d_flags |=
 				    G_RAID3_DISK_FLAG_SYNCHRONIZING;
 			}
 		} else if (ndisks == sc->sc_ndisks && ndirty > 1) {
 			disk = &sc->sc_disks[sc->sc_ndisks - 1];
 			disk->d_flags |= G_RAID3_DISK_FLAG_SYNCHRONIZING;
 		}
 
 		sc->sc_syncid = syncid;
 		if (force) {
 			/* Remember to bump syncid on first write. */
 			sc->sc_bump_id |= G_RAID3_BUMP_SYNCID;
 		}
 		if (ndisks == sc->sc_ndisks)
 			state = G_RAID3_DEVICE_STATE_COMPLETE;
 		else /* if (ndisks == sc->sc_ndisks - 1) */
 			state = G_RAID3_DEVICE_STATE_DEGRADED;
 		G_RAID3_DEBUG(1, "Device %s state changed from %s to %s.",
 		    sc->sc_name, g_raid3_device_state2str(sc->sc_state),
 		    g_raid3_device_state2str(state));
 		sc->sc_state = state;
 		for (n = 0; n < sc->sc_ndisks; n++) {
 			disk = &sc->sc_disks[n];
 			if (disk->d_state == G_RAID3_DISK_STATE_NODISK)
 				continue;
 			state = g_raid3_determine_state(disk);
 			g_raid3_event_send(disk, state, G_RAID3_EVENT_DONTWAIT);
 			if (state == G_RAID3_DISK_STATE_STALE)
 				sc->sc_bump_id |= G_RAID3_BUMP_SYNCID;
 		}
 		break;
 	    }
 	case G_RAID3_DEVICE_STATE_DEGRADED:
 		/*
 		 * Genid need to be bumped immediately, so do it here.
 		 */
 		if ((sc->sc_bump_id & G_RAID3_BUMP_GENID) != 0) {
 			sc->sc_bump_id &= ~G_RAID3_BUMP_GENID;
 			g_raid3_bump_genid(sc);
 		}
 
 		if (g_raid3_ndisks(sc, G_RAID3_DISK_STATE_NEW) > 0)
 			return;
 		if (g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) <
 		    sc->sc_ndisks - 1) {
 			if (sc->sc_provider != NULL)
 				g_raid3_destroy_provider(sc);
 			sc->sc_flags |= G_RAID3_DEVICE_FLAG_DESTROY;
 			return;
 		}
 		if (g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) ==
 		    sc->sc_ndisks) {
 			state = G_RAID3_DEVICE_STATE_COMPLETE;
 			G_RAID3_DEBUG(1,
 			    "Device %s state changed from %s to %s.",
 			    sc->sc_name, g_raid3_device_state2str(sc->sc_state),
 			    g_raid3_device_state2str(state));
 			sc->sc_state = state;
 		}
 		if (sc->sc_provider == NULL)
 			g_raid3_launch_provider(sc);
 		if (sc->sc_rootmount != NULL) {
 			G_RAID3_DEBUG(1, "root_mount_rel[%u] %p", __LINE__,
 			    sc->sc_rootmount);
 			root_mount_rel(sc->sc_rootmount);
 			sc->sc_rootmount = NULL;
 		}
 		break;
 	case G_RAID3_DEVICE_STATE_COMPLETE:
 		/*
 		 * Genid need to be bumped immediately, so do it here.
 		 */
 		if ((sc->sc_bump_id & G_RAID3_BUMP_GENID) != 0) {
 			sc->sc_bump_id &= ~G_RAID3_BUMP_GENID;
 			g_raid3_bump_genid(sc);
 		}
 
 		if (g_raid3_ndisks(sc, G_RAID3_DISK_STATE_NEW) > 0)
 			return;
 		KASSERT(g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) >=
 		    sc->sc_ndisks - 1,
 		    ("Too few ACTIVE components in COMPLETE state (device %s).",
 		    sc->sc_name));
 		if (g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) ==
 		    sc->sc_ndisks - 1) {
 			state = G_RAID3_DEVICE_STATE_DEGRADED;
 			G_RAID3_DEBUG(1,
 			    "Device %s state changed from %s to %s.",
 			    sc->sc_name, g_raid3_device_state2str(sc->sc_state),
 			    g_raid3_device_state2str(state));
 			sc->sc_state = state;
 		}
 		if (sc->sc_provider == NULL)
 			g_raid3_launch_provider(sc);
 		if (sc->sc_rootmount != NULL) {
 			G_RAID3_DEBUG(1, "root_mount_rel[%u] %p", __LINE__,
 			    sc->sc_rootmount);
 			root_mount_rel(sc->sc_rootmount);
 			sc->sc_rootmount = NULL;
 		}
 		break;
 	default:
 		KASSERT(1 == 0, ("Wrong device state (%s, %s).", sc->sc_name,
 		    g_raid3_device_state2str(sc->sc_state)));
 		break;
 	}
 }
 
 /*
  * Update disk state and device state if needed.
  */
 #define	DISK_STATE_CHANGED()	G_RAID3_DEBUG(1,			\
 	"Disk %s state changed from %s to %s (device %s).",		\
 	g_raid3_get_diskname(disk),					\
 	g_raid3_disk_state2str(disk->d_state),				\
 	g_raid3_disk_state2str(state), sc->sc_name)
 static int
 g_raid3_update_disk(struct g_raid3_disk *disk, u_int state)
 {
 	struct g_raid3_softc *sc;
 
 	sc = disk->d_softc;
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 again:
 	G_RAID3_DEBUG(3, "Changing disk %s state from %s to %s.",
 	    g_raid3_get_diskname(disk), g_raid3_disk_state2str(disk->d_state),
 	    g_raid3_disk_state2str(state));
 	switch (state) {
 	case G_RAID3_DISK_STATE_NEW:
 		/*
 		 * Possible scenarios:
 		 * 1. New disk arrive.
 		 */
 		/* Previous state should be NONE. */
 		KASSERT(disk->d_state == G_RAID3_DISK_STATE_NONE,
 		    ("Wrong disk state (%s, %s).", g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		disk->d_state = state;
 		G_RAID3_DEBUG(1, "Device %s: provider %s detected.",
 		    sc->sc_name, g_raid3_get_diskname(disk));
 		if (sc->sc_state == G_RAID3_DEVICE_STATE_STARTING)
 			break;
 		KASSERT(sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED ||
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_raid3_device_state2str(sc->sc_state),
 		    g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		state = g_raid3_determine_state(disk);
 		if (state != G_RAID3_DISK_STATE_NONE)
 			goto again;
 		break;
 	case G_RAID3_DISK_STATE_ACTIVE:
 		/*
 		 * Possible scenarios:
 		 * 1. New disk does not need synchronization.
 		 * 2. Synchronization process finished successfully.
 		 */
 		KASSERT(sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED ||
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_raid3_device_state2str(sc->sc_state),
 		    g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		/* Previous state should be NEW or SYNCHRONIZING. */
 		KASSERT(disk->d_state == G_RAID3_DISK_STATE_NEW ||
 		    disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING,
 		    ("Wrong disk state (%s, %s).", g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		if (disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING) {
 			disk->d_flags &= ~G_RAID3_DISK_FLAG_SYNCHRONIZING;
 			disk->d_flags &= ~G_RAID3_DISK_FLAG_FORCE_SYNC;
 			g_raid3_sync_stop(sc, 0);
 		}
 		disk->d_state = state;
 		disk->d_sync.ds_offset = 0;
 		disk->d_sync.ds_offset_done = 0;
 		g_raid3_update_idle(sc, disk);
 		g_raid3_update_metadata(disk);
 		G_RAID3_DEBUG(1, "Device %s: provider %s activated.",
 		    sc->sc_name, g_raid3_get_diskname(disk));
 		break;
 	case G_RAID3_DISK_STATE_STALE:
 		/*
 		 * Possible scenarios:
 		 * 1. Stale disk was connected.
 		 */
 		/* Previous state should be NEW. */
 		KASSERT(disk->d_state == G_RAID3_DISK_STATE_NEW,
 		    ("Wrong disk state (%s, %s).", g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		KASSERT(sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED ||
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_raid3_device_state2str(sc->sc_state),
 		    g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		/*
 		 * STALE state is only possible if device is marked
 		 * NOAUTOSYNC.
 		 */
 		KASSERT((sc->sc_flags & G_RAID3_DEVICE_FLAG_NOAUTOSYNC) != 0,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_raid3_device_state2str(sc->sc_state),
 		    g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		disk->d_flags &= ~G_RAID3_DISK_FLAG_DIRTY;
 		disk->d_state = state;
 		g_raid3_update_metadata(disk);
 		G_RAID3_DEBUG(0, "Device %s: provider %s is stale.",
 		    sc->sc_name, g_raid3_get_diskname(disk));
 		break;
 	case G_RAID3_DISK_STATE_SYNCHRONIZING:
 		/*
 		 * Possible scenarios:
 		 * 1. Disk which needs synchronization was connected.
 		 */
 		/* Previous state should be NEW. */
 		KASSERT(disk->d_state == G_RAID3_DISK_STATE_NEW,
 		    ("Wrong disk state (%s, %s).", g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		KASSERT(sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED ||
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE,
 		    ("Wrong device state (%s, %s, %s, %s).", sc->sc_name,
 		    g_raid3_device_state2str(sc->sc_state),
 		    g_raid3_get_diskname(disk),
 		    g_raid3_disk_state2str(disk->d_state)));
 		DISK_STATE_CHANGED();
 
 		if (disk->d_state == G_RAID3_DISK_STATE_NEW)
 			disk->d_flags &= ~G_RAID3_DISK_FLAG_DIRTY;
 		disk->d_state = state;
 		if (sc->sc_provider != NULL) {
 			g_raid3_sync_start(sc);
 			g_raid3_update_metadata(disk);
 		}
 		break;
 	case G_RAID3_DISK_STATE_DISCONNECTED:
 		/*
 		 * Possible scenarios:
 		 * 1. Device wasn't running yet, but disk disappear.
 		 * 2. Disk was active and disapppear.
 		 * 3. Disk disappear during synchronization process.
 		 */
 		if (sc->sc_state == G_RAID3_DEVICE_STATE_DEGRADED ||
 		    sc->sc_state == G_RAID3_DEVICE_STATE_COMPLETE) {
 			/*
 			 * Previous state should be ACTIVE, STALE or
 			 * SYNCHRONIZING.
 			 */
 			KASSERT(disk->d_state == G_RAID3_DISK_STATE_ACTIVE ||
 			    disk->d_state == G_RAID3_DISK_STATE_STALE ||
 			    disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING,
 			    ("Wrong disk state (%s, %s).",
 			    g_raid3_get_diskname(disk),
 			    g_raid3_disk_state2str(disk->d_state)));
 		} else if (sc->sc_state == G_RAID3_DEVICE_STATE_STARTING) {
 			/* Previous state should be NEW. */
 			KASSERT(disk->d_state == G_RAID3_DISK_STATE_NEW,
 			    ("Wrong disk state (%s, %s).",
 			    g_raid3_get_diskname(disk),
 			    g_raid3_disk_state2str(disk->d_state)));
 			/*
 			 * Reset bumping syncid if disk disappeared in STARTING
 			 * state.
 			 */
 			if ((sc->sc_bump_id & G_RAID3_BUMP_SYNCID) != 0)
 				sc->sc_bump_id &= ~G_RAID3_BUMP_SYNCID;
 #ifdef	INVARIANTS
 		} else {
 			KASSERT(1 == 0, ("Wrong device state (%s, %s, %s, %s).",
 			    sc->sc_name,
 			    g_raid3_device_state2str(sc->sc_state),
 			    g_raid3_get_diskname(disk),
 			    g_raid3_disk_state2str(disk->d_state)));
 #endif
 		}
 		DISK_STATE_CHANGED();
 		G_RAID3_DEBUG(0, "Device %s: provider %s disconnected.",
 		    sc->sc_name, g_raid3_get_diskname(disk));
 
 		g_raid3_destroy_disk(disk);
 		break;
 	default:
 		KASSERT(1 == 0, ("Unknown state (%u).", state));
 		break;
 	}
 	return (0);
 }
 #undef	DISK_STATE_CHANGED
 
 int
 g_raid3_read_metadata(struct g_consumer *cp, struct g_raid3_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	/* Metadata are stored on last sector. */
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL) {
 		G_RAID3_DEBUG(1, "Cannot read metadata from %s (error=%d).",
 		    cp->provider->name, error);
 		return (error);
 	}
 
 	/* Decode metadata. */
 	error = raid3_metadata_decode(buf, md);
 	g_free(buf);
 	if (strcmp(md->md_magic, G_RAID3_MAGIC) != 0)
 		return (EINVAL);
 	if (md->md_version > G_RAID3_VERSION) {
 		G_RAID3_DEBUG(0,
 		    "Kernel module is too old to handle metadata from %s.",
 		    cp->provider->name);
 		return (EINVAL);
 	}
 	if (error != 0) {
 		G_RAID3_DEBUG(1, "MD5 metadata hash mismatch for provider %s.",
 		    cp->provider->name);
 		return (error);
 	}
 	if (md->md_sectorsize > MAXPHYS) {
 		G_RAID3_DEBUG(0, "The blocksize is too big.");
 		return (EINVAL);
 	}
 
 	return (0);
 }
 
 static int
 g_raid3_check_metadata(struct g_raid3_softc *sc, struct g_provider *pp,
     struct g_raid3_metadata *md)
 {
 
 	if (md->md_no >= sc->sc_ndisks) {
 		G_RAID3_DEBUG(1, "Invalid disk %s number (no=%u), skipping.",
 		    pp->name, md->md_no);
 		return (EINVAL);
 	}
 	if (sc->sc_disks[md->md_no].d_state != G_RAID3_DISK_STATE_NODISK) {
 		G_RAID3_DEBUG(1, "Disk %s (no=%u) already exists, skipping.",
 		    pp->name, md->md_no);
 		return (EEXIST);
 	}
 	if (md->md_all != sc->sc_ndisks) {
 		G_RAID3_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_all", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_mediasize % md->md_sectorsize) != 0) {
 		G_RAID3_DEBUG(1, "Invalid metadata (mediasize %% sectorsize != "
 		    "0) on disk %s (device %s), skipping.", pp->name,
 		    sc->sc_name);
 		return (EINVAL);
 	}
 	if (md->md_mediasize != sc->sc_mediasize) {
 		G_RAID3_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_mediasize", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_mediasize % (sc->sc_ndisks - 1)) != 0) {
 		G_RAID3_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_mediasize", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((sc->sc_mediasize / (sc->sc_ndisks - 1)) > pp->mediasize) {
 		G_RAID3_DEBUG(1,
 		    "Invalid size of disk %s (device %s), skipping.", pp->name,
 		    sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_sectorsize / pp->sectorsize) < sc->sc_ndisks - 1) {
 		G_RAID3_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_sectorsize", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if (md->md_sectorsize != sc->sc_sectorsize) {
 		G_RAID3_DEBUG(1,
 		    "Invalid '%s' field on disk %s (device %s), skipping.",
 		    "md_sectorsize", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((sc->sc_sectorsize % pp->sectorsize) != 0) {
 		G_RAID3_DEBUG(1,
 		    "Invalid sector size of disk %s (device %s), skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_mflags & ~G_RAID3_DEVICE_FLAG_MASK) != 0) {
 		G_RAID3_DEBUG(1,
 		    "Invalid device flags on disk %s (device %s), skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_mflags & G_RAID3_DEVICE_FLAG_VERIFY) != 0 &&
 	    (md->md_mflags & G_RAID3_DEVICE_FLAG_ROUND_ROBIN) != 0) {
 		/*
 		 * VERIFY and ROUND-ROBIN options are mutally exclusive.
 		 */
 		G_RAID3_DEBUG(1, "Both VERIFY and ROUND-ROBIN flags exist on "
 		    "disk %s (device %s), skipping.", pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	if ((md->md_dflags & ~G_RAID3_DISK_FLAG_MASK) != 0) {
 		G_RAID3_DEBUG(1,
 		    "Invalid disk flags on disk %s (device %s), skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	return (0);
 }
 
 int
 g_raid3_add_disk(struct g_raid3_softc *sc, struct g_provider *pp,
     struct g_raid3_metadata *md)
 {
 	struct g_raid3_disk *disk;
 	int error;
 
 	g_topology_assert_not();
 	G_RAID3_DEBUG(2, "Adding disk %s.", pp->name);
 
 	error = g_raid3_check_metadata(sc, pp, md);
 	if (error != 0)
 		return (error);
 	if (sc->sc_state != G_RAID3_DEVICE_STATE_STARTING &&
 	    md->md_genid < sc->sc_genid) {
 		G_RAID3_DEBUG(0, "Component %s (device %s) broken, skipping.",
 		    pp->name, sc->sc_name);
 		return (EINVAL);
 	}
 	disk = g_raid3_init_disk(sc, pp, md, &error);
 	if (disk == NULL)
 		return (error);
 	error = g_raid3_event_send(disk, G_RAID3_DISK_STATE_NEW,
 	    G_RAID3_EVENT_WAIT);
 	if (error != 0)
 		return (error);
 	if (md->md_version < G_RAID3_VERSION) {
 		G_RAID3_DEBUG(0, "Upgrading metadata on %s (v%d->v%d).",
 		    pp->name, md->md_version, G_RAID3_VERSION);
 		g_raid3_update_metadata(disk);
 	}
 	return (0);
 }
 
 static void
 g_raid3_destroy_delayed(void *arg, int flag)
 {
 	struct g_raid3_softc *sc;
 	int error;
 
 	if (flag == EV_CANCEL) {
 		G_RAID3_DEBUG(1, "Destroying canceled.");
 		return;
 	}
 	sc = arg;
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 	KASSERT((sc->sc_flags & G_RAID3_DEVICE_FLAG_DESTROY) == 0,
 	    ("DESTROY flag set on %s.", sc->sc_name));
 	KASSERT((sc->sc_flags & G_RAID3_DEVICE_FLAG_DESTROYING) != 0,
 	    ("DESTROYING flag not set on %s.", sc->sc_name));
 	G_RAID3_DEBUG(0, "Destroying %s (delayed).", sc->sc_name);
 	error = g_raid3_destroy(sc, G_RAID3_DESTROY_SOFT);
 	if (error != 0) {
 		G_RAID3_DEBUG(0, "Cannot destroy %s.", sc->sc_name);
 		sx_xunlock(&sc->sc_lock);
 	}
 	g_topology_lock();
 }
 
 static int
 g_raid3_access(struct g_provider *pp, int acr, int acw, int ace)
 {
 	struct g_raid3_softc *sc;
 	int dcr, dcw, dce, error = 0;
 
 	g_topology_assert();
 	G_RAID3_DEBUG(2, "Access request for %s: r%dw%de%d.", pp->name, acr,
 	    acw, ace);
 
 	sc = pp->geom->softc;
 	if (sc == NULL && acr <= 0 && acw <= 0 && ace <= 0)
 		return (0);
 	KASSERT(sc != NULL, ("NULL softc (provider=%s).", pp->name));
 
 	dcr = pp->acr + acr;
 	dcw = pp->acw + acw;
 	dce = pp->ace + ace;
 
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_DESTROY) != 0 ||
 	    g_raid3_ndisks(sc, G_RAID3_DISK_STATE_ACTIVE) < sc->sc_ndisks - 1) {
 		if (acr > 0 || acw > 0 || ace > 0)
 			error = ENXIO;
 		goto end;
 	}
 	if (dcw == 0)
 		g_raid3_idle(sc, dcw);
 	if ((sc->sc_flags & G_RAID3_DEVICE_FLAG_DESTROYING) != 0) {
 		if (acr > 0 || acw > 0 || ace > 0) {
 			error = ENXIO;
 			goto end;
 		}
 		if (dcr == 0 && dcw == 0 && dce == 0) {
 			g_post_event(g_raid3_destroy_delayed, sc, M_WAITOK,
 			    sc, NULL);
 		}
 	}
 end:
 	sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	return (error);
 }
 
 static struct g_geom *
 g_raid3_create(struct g_class *mp, const struct g_raid3_metadata *md)
 {
 	struct g_raid3_softc *sc;
 	struct g_geom *gp;
 	int error, timeout;
 	u_int n;
 
 	g_topology_assert();
 	G_RAID3_DEBUG(1, "Creating device %s (id=%u).", md->md_name, md->md_id);
 
 	/* One disk is minimum. */
 	if (md->md_all < 1)
 		return (NULL);
 	/*
 	 * Action geom.
 	 */
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = malloc(sizeof(*sc), M_RAID3, M_WAITOK | M_ZERO);
 	sc->sc_disks = malloc(sizeof(struct g_raid3_disk) * md->md_all, M_RAID3,
 	    M_WAITOK | M_ZERO);
 	gp->start = g_raid3_start;
 	gp->orphan = g_raid3_orphan;
 	gp->access = g_raid3_access;
 	gp->dumpconf = g_raid3_dumpconf;
 
 	sc->sc_id = md->md_id;
 	sc->sc_mediasize = md->md_mediasize;
 	sc->sc_sectorsize = md->md_sectorsize;
 	sc->sc_ndisks = md->md_all;
 	sc->sc_round_robin = 0;
 	sc->sc_flags = md->md_mflags;
 	sc->sc_bump_id = 0;
 	sc->sc_idle = 1;
 	sc->sc_last_write = time_uptime;
 	sc->sc_writes = 0;
 	for (n = 0; n < sc->sc_ndisks; n++) {
 		sc->sc_disks[n].d_softc = sc;
 		sc->sc_disks[n].d_no = n;
 		sc->sc_disks[n].d_state = G_RAID3_DISK_STATE_NODISK;
 	}
 	sx_init(&sc->sc_lock, "graid3:lock");
 	bioq_init(&sc->sc_queue);
 	mtx_init(&sc->sc_queue_mtx, "graid3:queue", NULL, MTX_DEF);
 	bioq_init(&sc->sc_regular_delayed);
 	bioq_init(&sc->sc_inflight);
 	bioq_init(&sc->sc_sync_delayed);
 	TAILQ_INIT(&sc->sc_events);
 	mtx_init(&sc->sc_events_mtx, "graid3:events", NULL, MTX_DEF);
 	callout_init(&sc->sc_callout, 1);
 	sc->sc_state = G_RAID3_DEVICE_STATE_STARTING;
 	gp->softc = sc;
 	sc->sc_geom = gp;
 	sc->sc_provider = NULL;
 	/*
 	 * Synchronization geom.
 	 */
 	gp = g_new_geomf(mp, "%s.sync", md->md_name);
 	gp->softc = sc;
 	gp->orphan = g_raid3_orphan;
 	sc->sc_sync.ds_geom = gp;
 
 	if (!g_raid3_use_malloc) {
 		sc->sc_zones[G_RAID3_ZONE_64K].sz_zone = uma_zcreate("gr3:64k",
 		    65536, g_raid3_uma_ctor, g_raid3_uma_dtor, NULL, NULL,
 		    UMA_ALIGN_PTR, 0);
 		sc->sc_zones[G_RAID3_ZONE_64K].sz_inuse = 0;
 		sc->sc_zones[G_RAID3_ZONE_64K].sz_max = g_raid3_n64k;
 		sc->sc_zones[G_RAID3_ZONE_64K].sz_requested =
 		    sc->sc_zones[G_RAID3_ZONE_64K].sz_failed = 0;
 		sc->sc_zones[G_RAID3_ZONE_16K].sz_zone = uma_zcreate("gr3:16k",
 		    16384, g_raid3_uma_ctor, g_raid3_uma_dtor, NULL, NULL,
 		    UMA_ALIGN_PTR, 0);
 		sc->sc_zones[G_RAID3_ZONE_16K].sz_inuse = 0;
 		sc->sc_zones[G_RAID3_ZONE_16K].sz_max = g_raid3_n16k;
 		sc->sc_zones[G_RAID3_ZONE_16K].sz_requested =
 		    sc->sc_zones[G_RAID3_ZONE_16K].sz_failed = 0;
 		sc->sc_zones[G_RAID3_ZONE_4K].sz_zone = uma_zcreate("gr3:4k",
 		    4096, g_raid3_uma_ctor, g_raid3_uma_dtor, NULL, NULL,
 		    UMA_ALIGN_PTR, 0);
 		sc->sc_zones[G_RAID3_ZONE_4K].sz_inuse = 0;
 		sc->sc_zones[G_RAID3_ZONE_4K].sz_max = g_raid3_n4k;
 		sc->sc_zones[G_RAID3_ZONE_4K].sz_requested =
 		    sc->sc_zones[G_RAID3_ZONE_4K].sz_failed = 0;
 	}
 
 	error = kproc_create(g_raid3_worker, sc, &sc->sc_worker, 0, 0,
 	    "g_raid3 %s", md->md_name);
 	if (error != 0) {
 		G_RAID3_DEBUG(1, "Cannot create kernel thread for %s.",
 		    sc->sc_name);
 		if (!g_raid3_use_malloc) {
 			uma_zdestroy(sc->sc_zones[G_RAID3_ZONE_64K].sz_zone);
 			uma_zdestroy(sc->sc_zones[G_RAID3_ZONE_16K].sz_zone);
 			uma_zdestroy(sc->sc_zones[G_RAID3_ZONE_4K].sz_zone);
 		}
 		g_destroy_geom(sc->sc_sync.ds_geom);
 		mtx_destroy(&sc->sc_events_mtx);
 		mtx_destroy(&sc->sc_queue_mtx);
 		sx_destroy(&sc->sc_lock);
 		g_destroy_geom(sc->sc_geom);
 		free(sc->sc_disks, M_RAID3);
 		free(sc, M_RAID3);
 		return (NULL);
 	}
 
 	G_RAID3_DEBUG(1, "Device %s created (%u components, id=%u).",
 	    sc->sc_name, sc->sc_ndisks, sc->sc_id);
 
 	sc->sc_rootmount = root_mount_hold("GRAID3");
 	G_RAID3_DEBUG(1, "root_mount_hold %p", sc->sc_rootmount);
 
 	/*
 	 * Run timeout.
 	 */
 	timeout = atomic_load_acq_int(&g_raid3_timeout);
 	callout_reset(&sc->sc_callout, timeout * hz, g_raid3_go, sc);
 	return (sc->sc_geom);
 }
 
 int
 g_raid3_destroy(struct g_raid3_softc *sc, int how)
 {
 	struct g_provider *pp;
 
 	g_topology_assert_not();
 	if (sc == NULL)
 		return (ENXIO);
 	sx_assert(&sc->sc_lock, SX_XLOCKED);
 
 	pp = sc->sc_provider;
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		switch (how) {
 		case G_RAID3_DESTROY_SOFT:
 			G_RAID3_DEBUG(1,
 			    "Device %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		case G_RAID3_DESTROY_DELAYED:
 			G_RAID3_DEBUG(1,
 			    "Device %s will be destroyed on last close.",
 			    pp->name);
 			if (sc->sc_syncdisk != NULL)
 				g_raid3_sync_stop(sc, 1);
 			sc->sc_flags |= G_RAID3_DEVICE_FLAG_DESTROYING;
 			return (EBUSY);
 		case G_RAID3_DESTROY_HARD:
 			G_RAID3_DEBUG(1, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 			break;
 		}
 	}
 
 	g_topology_lock();
 	if (sc->sc_geom->softc == NULL) {
 		g_topology_unlock();
 		return (0);
 	}
 	sc->sc_geom->softc = NULL;
 	sc->sc_sync.ds_geom->softc = NULL;
 	g_topology_unlock();
 
 	sc->sc_flags |= G_RAID3_DEVICE_FLAG_DESTROY;
 	sc->sc_flags |= G_RAID3_DEVICE_FLAG_WAIT;
 	G_RAID3_DEBUG(4, "%s: Waking up %p.", __func__, sc);
 	sx_xunlock(&sc->sc_lock);
 	mtx_lock(&sc->sc_queue_mtx);
 	wakeup(sc);
 	wakeup(&sc->sc_queue);
 	mtx_unlock(&sc->sc_queue_mtx);
 	G_RAID3_DEBUG(4, "%s: Sleeping %p.", __func__, &sc->sc_worker);
 	while (sc->sc_worker != NULL)
 		tsleep(&sc->sc_worker, PRIBIO, "r3:destroy", hz / 5);
 	G_RAID3_DEBUG(4, "%s: Woken up %p.", __func__, &sc->sc_worker);
 	sx_xlock(&sc->sc_lock);
 	g_raid3_destroy_device(sc);
 	free(sc->sc_disks, M_RAID3);
 	free(sc, M_RAID3);
 	return (0);
 }
 
 static void
 g_raid3_taste_orphan(struct g_consumer *cp)
 {
 
 	KASSERT(1 == 0, ("%s called while tasting %s.", __func__,
 	    cp->provider->name));
 }
 
 static struct g_geom *
 g_raid3_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_raid3_metadata md;
 	struct g_raid3_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	G_RAID3_DEBUG(2, "Tasting %s.", pp->name);
 
 	gp = g_new_geomf(mp, "raid3:taste");
 	/* This orphan function should be never called. */
 	gp->orphan = g_raid3_taste_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_raid3_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (md.md_provider[0] != '\0' &&
 	    !g_compare_names(md.md_provider, pp->name))
 		return (NULL);
 	if (md.md_provsize != 0 && md.md_provsize != pp->mediasize)
 		return (NULL);
 	if (g_raid3_debug >= 2)
 		raid3_metadata_dump(&md);
 
 	/*
 	 * Let's check if device already exists.
 	 */
 	sc = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (sc->sc_sync.ds_geom == gp)
 			continue;
 		if (strcmp(md.md_name, sc->sc_name) != 0)
 			continue;
 		if (md.md_id != sc->sc_id) {
 			G_RAID3_DEBUG(0, "Device %s already configured.",
 			    sc->sc_name);
 			return (NULL);
 		}
 		break;
 	}
 	if (gp == NULL) {
 		gp = g_raid3_create(mp, &md);
 		if (gp == NULL) {
 			G_RAID3_DEBUG(0, "Cannot create device %s.",
 			    md.md_name);
 			return (NULL);
 		}
 		sc = gp->softc;
 	}
 	G_RAID3_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 	g_topology_unlock();
 	sx_xlock(&sc->sc_lock);
 	error = g_raid3_add_disk(sc, pp, &md);
 	if (error != 0) {
 		G_RAID3_DEBUG(0, "Cannot add disk %s to %s (error=%d).",
 		    pp->name, gp->name, error);
 		if (g_raid3_ndisks(sc, G_RAID3_DISK_STATE_NODISK) ==
 		    sc->sc_ndisks) {
 			g_cancel_event(sc);
 			g_raid3_destroy(sc, G_RAID3_DESTROY_HARD);
 			g_topology_lock();
 			return (NULL);
 		}
 		gp = NULL;
 	}
 	sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	return (gp);
 }
 
 static int
 g_raid3_destroy_geom(struct gctl_req *req __unused, struct g_class *mp __unused,
     struct g_geom *gp)
 {
 	struct g_raid3_softc *sc;
 	int error;
 
 	g_topology_unlock();
 	sc = gp->softc;
 	sx_xlock(&sc->sc_lock);
 	g_cancel_event(sc);
 	error = g_raid3_destroy(gp->softc, G_RAID3_DESTROY_SOFT);
 	if (error != 0)
 		sx_xunlock(&sc->sc_lock);
 	g_topology_lock();
 	return (error);
 }
 
 static void
 g_raid3_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_raid3_softc *sc;
 
 	g_topology_assert();
 
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	/* Skip synchronization geom. */
 	if (gp == sc->sc_sync.ds_geom)
 		return;
 	if (pp != NULL) {
 		/* Nothing here. */
 	} else if (cp != NULL) {
 		struct g_raid3_disk *disk;
 
 		disk = cp->private;
 		if (disk == NULL)
 			return;
 		g_topology_unlock();
 		sx_xlock(&sc->sc_lock);
 		sbuf_printf(sb, "%s<Type>", indent);
 		if (disk->d_no == sc->sc_ndisks - 1)
 			sbuf_printf(sb, "PARITY");
 		else
 			sbuf_printf(sb, "DATA");
 		sbuf_printf(sb, "</Type>\n");
 		sbuf_printf(sb, "%s<Number>%u</Number>\n", indent,
 		    (u_int)disk->d_no);
 		if (disk->d_state == G_RAID3_DISK_STATE_SYNCHRONIZING) {
 			sbuf_printf(sb, "%s<Synchronized>", indent);
 			if (disk->d_sync.ds_offset == 0)
 				sbuf_printf(sb, "0%%");
 			else {
 				sbuf_printf(sb, "%u%%",
 				    (u_int)((disk->d_sync.ds_offset * 100) /
 				    (sc->sc_mediasize / (sc->sc_ndisks - 1))));
 			}
 			sbuf_printf(sb, "</Synchronized>\n");
 			if (disk->d_sync.ds_offset > 0) {
 				sbuf_printf(sb, "%s<BytesSynced>%jd"
 				    "</BytesSynced>\n", indent,
 				    (intmax_t)disk->d_sync.ds_offset);
 			}
 		}
 		sbuf_printf(sb, "%s<SyncID>%u</SyncID>\n", indent,
 		    disk->d_sync.ds_syncid);
 		sbuf_printf(sb, "%s<GenID>%u</GenID>\n", indent, disk->d_genid);
 		sbuf_printf(sb, "%s<Flags>", indent);
 		if (disk->d_flags == 0)
 			sbuf_printf(sb, "NONE");
 		else {
 			int first = 1;
 
 #define	ADD_FLAG(flag, name)	do {					\
 	if ((disk->d_flags & (flag)) != 0) {				\
 		if (!first)						\
 			sbuf_printf(sb, ", ");				\
 		else							\
 			first = 0;					\
 		sbuf_printf(sb, name);					\
 	}								\
 } while (0)
 			ADD_FLAG(G_RAID3_DISK_FLAG_DIRTY, "DIRTY");
 			ADD_FLAG(G_RAID3_DISK_FLAG_HARDCODED, "HARDCODED");
 			ADD_FLAG(G_RAID3_DISK_FLAG_SYNCHRONIZING,
 			    "SYNCHRONIZING");
 			ADD_FLAG(G_RAID3_DISK_FLAG_FORCE_SYNC, "FORCE_SYNC");
 			ADD_FLAG(G_RAID3_DISK_FLAG_BROKEN, "BROKEN");
 #undef	ADD_FLAG
 		}
 		sbuf_printf(sb, "</Flags>\n");
 		sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 		    g_raid3_disk_state2str(disk->d_state));
 		sx_xunlock(&sc->sc_lock);
 		g_topology_lock();
 	} else {
 		g_topology_unlock();
 		sx_xlock(&sc->sc_lock);
 		if (!g_raid3_use_malloc) {
 			sbuf_printf(sb,
 			    "%s<Zone4kRequested>%u</Zone4kRequested>\n", indent,
 			    sc->sc_zones[G_RAID3_ZONE_4K].sz_requested);
 			sbuf_printf(sb,
 			    "%s<Zone4kFailed>%u</Zone4kFailed>\n", indent,
 			    sc->sc_zones[G_RAID3_ZONE_4K].sz_failed);
 			sbuf_printf(sb,
 			    "%s<Zone16kRequested>%u</Zone16kRequested>\n", indent,
 			    sc->sc_zones[G_RAID3_ZONE_16K].sz_requested);
 			sbuf_printf(sb,
 			    "%s<Zone16kFailed>%u</Zone16kFailed>\n", indent,
 			    sc->sc_zones[G_RAID3_ZONE_16K].sz_failed);
 			sbuf_printf(sb,
 			    "%s<Zone64kRequested>%u</Zone64kRequested>\n", indent,
 			    sc->sc_zones[G_RAID3_ZONE_64K].sz_requested);
 			sbuf_printf(sb,
 			    "%s<Zone64kFailed>%u</Zone64kFailed>\n", indent,
 			    sc->sc_zones[G_RAID3_ZONE_64K].sz_failed);
 		}
 		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)sc->sc_id);
 		sbuf_printf(sb, "%s<SyncID>%u</SyncID>\n", indent, sc->sc_syncid);
 		sbuf_printf(sb, "%s<GenID>%u</GenID>\n", indent, sc->sc_genid);
 		sbuf_printf(sb, "%s<Flags>", indent);
 		if (sc->sc_flags == 0)
 			sbuf_printf(sb, "NONE");
 		else {
 			int first = 1;
 
 #define	ADD_FLAG(flag, name)	do {					\
 	if ((sc->sc_flags & (flag)) != 0) {				\
 		if (!first)						\
 			sbuf_printf(sb, ", ");				\
 		else							\
 			first = 0;					\
 		sbuf_printf(sb, name);					\
 	}								\
 } while (0)
 			ADD_FLAG(G_RAID3_DEVICE_FLAG_NOFAILSYNC, "NOFAILSYNC");
 			ADD_FLAG(G_RAID3_DEVICE_FLAG_NOAUTOSYNC, "NOAUTOSYNC");
 			ADD_FLAG(G_RAID3_DEVICE_FLAG_ROUND_ROBIN,
 			    "ROUND-ROBIN");
 			ADD_FLAG(G_RAID3_DEVICE_FLAG_VERIFY, "VERIFY");
 #undef	ADD_FLAG
 		}
 		sbuf_printf(sb, "</Flags>\n");
 		sbuf_printf(sb, "%s<Components>%u</Components>\n", indent,
 		    sc->sc_ndisks);
 		sbuf_printf(sb, "%s<State>%s</State>\n", indent,
 		    g_raid3_device_state2str(sc->sc_state));
 		sx_xunlock(&sc->sc_lock);
 		g_topology_lock();
 	}
 }
 
 static void
 g_raid3_shutdown_post_sync(void *arg, int howto)
 {
 	struct g_class *mp;
 	struct g_geom *gp, *gp2;
 	struct g_raid3_softc *sc;
 	int error;
 
 	mp = arg;
 	g_topology_lock();
 	g_raid3_shutdown = 1;
 	LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
 		if ((sc = gp->softc) == NULL)
 			continue;
 		/* Skip synchronization geom. */
 		if (gp == sc->sc_sync.ds_geom)
 			continue;
 		g_topology_unlock();
 		sx_xlock(&sc->sc_lock);
 		g_raid3_idle(sc, -1);
 		g_cancel_event(sc);
 		error = g_raid3_destroy(sc, G_RAID3_DESTROY_DELAYED);
 		if (error != 0)
 			sx_xunlock(&sc->sc_lock);
 		g_topology_lock();
 	}
 	g_topology_unlock();
 }
 
 static void
 g_raid3_init(struct g_class *mp)
 {
 
 	g_raid3_post_sync = EVENTHANDLER_REGISTER(shutdown_post_sync,
 	    g_raid3_shutdown_post_sync, mp, SHUTDOWN_PRI_FIRST);
 	if (g_raid3_post_sync == NULL)
 		G_RAID3_DEBUG(0, "Warning! Cannot register shutdown event.");
 }
 
 static void
 g_raid3_fini(struct g_class *mp)
 {
 
 	if (g_raid3_post_sync != NULL)
 		EVENTHANDLER_DEREGISTER(shutdown_post_sync, g_raid3_post_sync);
 }
 
 DECLARE_GEOM_CLASS(g_raid3_class, g_raid3);
+MODULE_VERSION(geom_raid3, 0);
Index: user/markj/netdump/sys/geom/shsec/g_shsec.c
===================================================================
--- user/markj/netdump/sys/geom/shsec/g_shsec.c	(revision 332407)
+++ user/markj/netdump/sys/geom/shsec/g_shsec.c	(revision 332408)
@@ -1,838 +1,839 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <vm/uma.h>
 #include <geom/geom.h>
 #include <geom/shsec/g_shsec.h>
 
 FEATURE(geom_shsec, "GEOM shared secret device support");
 
 static MALLOC_DEFINE(M_SHSEC, "shsec_data", "GEOM_SHSEC Data");
 
 static uma_zone_t g_shsec_zone;
 
 static int g_shsec_destroy(struct g_shsec_softc *sc, boolean_t force);
 static int g_shsec_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 
 static g_taste_t g_shsec_taste;
 static g_ctl_req_t g_shsec_config;
 static g_dumpconf_t g_shsec_dumpconf;
 static g_init_t g_shsec_init;
 static g_fini_t g_shsec_fini;
 
 struct g_class g_shsec_class = {
 	.name = G_SHSEC_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_shsec_config,
 	.taste = g_shsec_taste,
 	.destroy_geom = g_shsec_destroy_geom,
 	.init = g_shsec_init,
 	.fini = g_shsec_fini
 };
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, shsec, CTLFLAG_RW, 0,
     "GEOM_SHSEC stuff");
 static u_int g_shsec_debug = 0;
 SYSCTL_UINT(_kern_geom_shsec, OID_AUTO, debug, CTLFLAG_RWTUN, &g_shsec_debug, 0,
     "Debug level");
 static u_int g_shsec_maxmem = MAXPHYS * 100;
 SYSCTL_UINT(_kern_geom_shsec, OID_AUTO, maxmem, CTLFLAG_RDTUN, &g_shsec_maxmem,
     0, "Maximum memory that can be allocated for I/O (in bytes)");
 static u_int g_shsec_alloc_failed = 0;
 SYSCTL_UINT(_kern_geom_shsec, OID_AUTO, alloc_failed, CTLFLAG_RD,
     &g_shsec_alloc_failed, 0, "How many times I/O allocation failed");
 
 /*
  * Greatest Common Divisor.
  */
 static u_int
 gcd(u_int a, u_int b)
 {
 	u_int c;
 
 	while (b != 0) {
 		c = a;
 		a = b;
 		b = (c % b);
 	}
 	return (a);
 }
 
 /*
  * Least Common Multiple.
  */
 static u_int
 lcm(u_int a, u_int b)
 {
 
 	return ((a * b) / gcd(a, b));
 }
 
 static void
 g_shsec_init(struct g_class *mp __unused)
 {
 
 	g_shsec_zone = uma_zcreate("g_shsec_zone", MAXPHYS, NULL, NULL, NULL,
 	    NULL, 0, 0);
 	g_shsec_maxmem -= g_shsec_maxmem % MAXPHYS;
 	uma_zone_set_max(g_shsec_zone, g_shsec_maxmem / MAXPHYS);
 }
 
 static void
 g_shsec_fini(struct g_class *mp __unused)
 {
 
 	uma_zdestroy(g_shsec_zone);
 }
 
 /*
  * Return the number of valid disks.
  */
 static u_int
 g_shsec_nvalid(struct g_shsec_softc *sc)
 {
 	u_int i, no;
 
 	no = 0;
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		if (sc->sc_disks[i] != NULL)
 			no++;
 	}
 
 	return (no);
 }
 
 static void
 g_shsec_remove_disk(struct g_consumer *cp)
 {
 	struct g_shsec_softc *sc;
 	u_int no;
 
 	KASSERT(cp != NULL, ("Non-valid disk in %s.", __func__));
 	sc = (struct g_shsec_softc *)cp->private;
 	KASSERT(sc != NULL, ("NULL sc in %s.", __func__));
 	no = cp->index;
 
 	G_SHSEC_DEBUG(0, "Disk %s removed from %s.", cp->provider->name,
 	    sc->sc_name);
 
 	sc->sc_disks[no] = NULL;
 	if (sc->sc_provider != NULL) {
 		g_wither_provider(sc->sc_provider, ENXIO);
 		sc->sc_provider = NULL;
 		G_SHSEC_DEBUG(0, "Device %s removed.", sc->sc_name);
 	}
 
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		g_access(cp, -cp->acr, -cp->acw, -cp->ace);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 }
 
 static void
 g_shsec_orphan(struct g_consumer *cp)
 {
 	struct g_shsec_softc *sc;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = cp->geom;
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 
 	g_shsec_remove_disk(cp);
 	/* If there are no valid disks anymore, remove device. */
 	if (g_shsec_nvalid(sc) == 0)
 		g_shsec_destroy(sc, 1);
 }
 
 static int
 g_shsec_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_consumer *cp1, *cp2;
 	struct g_shsec_softc *sc;
 	struct g_geom *gp;
 	int error;
 
 	gp = pp->geom;
 	sc = gp->softc;
 
 	if (sc == NULL) {
 		/*
 		 * It looks like geom is being withered.
 		 * In that case we allow only negative requests.
 		 */
 		KASSERT(dr <= 0 && dw <= 0 && de <= 0,
 		    ("Positive access request (device=%s).", pp->name));
 		if ((pp->acr + dr) == 0 && (pp->acw + dw) == 0 &&
 		    (pp->ace + de) == 0) {
 			G_SHSEC_DEBUG(0, "Device %s definitely destroyed.",
 			    gp->name);
 		}
 		return (0);
 	}
 
 	/* On first open, grab an extra "exclusive" bit */
 	if (pp->acr == 0 && pp->acw == 0 && pp->ace == 0)
 		de++;
 	/* ... and let go of it on last close */
 	if ((pp->acr + dr) == 0 && (pp->acw + dw) == 0 && (pp->ace + de) == 0)
 		de--;
 
 	error = ENXIO;
 	LIST_FOREACH(cp1, &gp->consumer, consumer) {
 		error = g_access(cp1, dr, dw, de);
 		if (error == 0)
 			continue;
 		/*
 		 * If we fail here, backout all previous changes.
 		 */
 		LIST_FOREACH(cp2, &gp->consumer, consumer) {
 			if (cp1 == cp2)
 				return (error);
 			g_access(cp2, -dr, -dw, -de);
 		}
 		/* NOTREACHED */
 	}
 
 	return (error);
 }
 
 static void
 g_shsec_xor1(uint32_t *src, uint32_t *dst, ssize_t len)
 {
 
 	for (; len > 0; len -= sizeof(uint32_t), dst++)
 		*dst = *dst ^ *src++;
 	KASSERT(len == 0, ("len != 0 (len=%zd)", len));
 }
 
 static void
 g_shsec_done(struct bio *bp)
 {
 	struct g_shsec_softc *sc;
 	struct bio *pbp;
 
 	pbp = bp->bio_parent;
 	sc = pbp->bio_to->geom->softc;
 	if (bp->bio_error == 0)
 		G_SHSEC_LOGREQ(2, bp, "Request done.");
 	else {
 		G_SHSEC_LOGREQ(0, bp, "Request failed (error=%d).",
 		    bp->bio_error);
 		if (pbp->bio_error == 0)
 			pbp->bio_error = bp->bio_error;
 	}
 	if (pbp->bio_cmd == BIO_READ) {
 		if ((pbp->bio_pflags & G_SHSEC_BFLAG_FIRST) != 0) {
 			bcopy(bp->bio_data, pbp->bio_data, pbp->bio_length);
 			pbp->bio_pflags = 0;
 		} else {
 			g_shsec_xor1((uint32_t *)bp->bio_data,
 			    (uint32_t *)pbp->bio_data,
 			    (ssize_t)pbp->bio_length);
 		}
 	}
 	bzero(bp->bio_data, bp->bio_length);
 	uma_zfree(g_shsec_zone, bp->bio_data);
 	g_destroy_bio(bp);
 	pbp->bio_inbed++;
 	if (pbp->bio_children == pbp->bio_inbed) {
 		pbp->bio_completed = pbp->bio_length;
 		g_io_deliver(pbp, pbp->bio_error);
 	}
 }
 
 static void
 g_shsec_xor2(uint32_t *rand, uint32_t *dst, ssize_t len)
 {
 
 	for (; len > 0; len -= sizeof(uint32_t), dst++) {
 		*rand = arc4random();
 		*dst = *dst ^ *rand++;
 	}
 	KASSERT(len == 0, ("len != 0 (len=%zd)", len));
 }
 
 static void
 g_shsec_start(struct bio *bp)
 {
 	TAILQ_HEAD(, bio) queue = TAILQ_HEAD_INITIALIZER(queue);
 	struct g_shsec_softc *sc;
 	struct bio *cbp;
 	uint32_t *dst;
 	ssize_t len;
 	u_int no;
 	int error;
 
 	sc = bp->bio_to->geom->softc;
 	/*
 	 * If sc == NULL, provider's error should be set and g_shsec_start()
 	 * should not be called at all.
 	 */
 	KASSERT(sc != NULL,
 	    ("Provider's error should be set (error=%d)(device=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 
 	G_SHSEC_LOGREQ(2, bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_FLUSH:
 		/*
 		 * Only those requests are supported.
 		 */
 		break;
 	case BIO_DELETE:
 	case BIO_GETATTR:
 		/* To which provider it should be delivered? */
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 
 	/*
 	 * Allocate all bios first and calculate XOR.
 	 */
 	dst = NULL;
 	len = bp->bio_length;
 	if (bp->bio_cmd == BIO_READ)
 		bp->bio_pflags = G_SHSEC_BFLAG_FIRST;
 	for (no = 0; no < sc->sc_ndisks; no++) {
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			error = ENOMEM;
 			goto failure;
 		}
 		TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 
 		/*
 		 * Fill in the component buf structure.
 		 */
 		cbp->bio_done = g_shsec_done;
 		cbp->bio_data = uma_zalloc(g_shsec_zone, M_NOWAIT);
 		if (cbp->bio_data == NULL) {
 			g_shsec_alloc_failed++;
 			error = ENOMEM;
 			goto failure;
 		}
 		cbp->bio_caller2 = sc->sc_disks[no];
 		if (bp->bio_cmd == BIO_WRITE) {
 			if (no == 0) {
 				dst = (uint32_t *)cbp->bio_data;
 				bcopy(bp->bio_data, dst, len);
 			} else {
 				g_shsec_xor2((uint32_t *)cbp->bio_data, dst,
 				    len);
 			}
 		}
 	}
 	/*
 	 * Fire off all allocated requests!
 	 */
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		struct g_consumer *cp;
 
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		cp = cbp->bio_caller2;
 		cbp->bio_caller2 = NULL;
 		cbp->bio_to = cp->provider;
 		G_SHSEC_LOGREQ(2, cbp, "Sending request.");
 		g_io_request(cbp, cp);
 	}
 	return;
 failure:
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		bp->bio_children--;
 		if (cbp->bio_data != NULL) {
 			bzero(cbp->bio_data, cbp->bio_length);
 			uma_zfree(g_shsec_zone, cbp->bio_data);
 		}
 		g_destroy_bio(cbp);
 	}
 	if (bp->bio_error == 0)
 		bp->bio_error = error;
 	g_io_deliver(bp, bp->bio_error);
 }
 
 static void
 g_shsec_check_and_run(struct g_shsec_softc *sc)
 {
 	off_t mediasize, ms;
 	u_int no, sectorsize = 0;
 
 	if (g_shsec_nvalid(sc) != sc->sc_ndisks)
 		return;
 
 	sc->sc_provider = g_new_providerf(sc->sc_geom, "shsec/%s", sc->sc_name);
 	/*
 	 * Find the smallest disk.
 	 */
 	mediasize = sc->sc_disks[0]->provider->mediasize;
 	mediasize -= sc->sc_disks[0]->provider->sectorsize;
 	sectorsize = sc->sc_disks[0]->provider->sectorsize;
 	for (no = 1; no < sc->sc_ndisks; no++) {
 		ms = sc->sc_disks[no]->provider->mediasize;
 		ms -= sc->sc_disks[no]->provider->sectorsize;
 		if (ms < mediasize)
 			mediasize = ms;
 		sectorsize = lcm(sectorsize,
 		    sc->sc_disks[no]->provider->sectorsize);
 	}
 	sc->sc_provider->sectorsize = sectorsize;
 	sc->sc_provider->mediasize = mediasize;
 	g_error_provider(sc->sc_provider, 0);
 
 	G_SHSEC_DEBUG(0, "Device %s activated.", sc->sc_name);
 }
 
 static int
 g_shsec_read_metadata(struct g_consumer *cp, struct g_shsec_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL)
 		return (error);
 
 	/* Decode metadata. */
 	shsec_metadata_decode(buf, md);
 	g_free(buf);
 
 	return (0);
 }
 
 /*
  * Add disk to given device.
  */
 static int
 g_shsec_add_disk(struct g_shsec_softc *sc, struct g_provider *pp, u_int no)
 {
 	struct g_consumer *cp, *fcp;
 	struct g_geom *gp;
 	struct g_shsec_metadata md;
 	int error;
 
 	/* Metadata corrupted? */
 	if (no >= sc->sc_ndisks)
 		return (EINVAL);
 
 	/* Check if disk is not already attached. */
 	if (sc->sc_disks[no] != NULL)
 		return (EEXIST);
 
 	gp = sc->sc_geom;
 	fcp = LIST_FIRST(&gp->consumer);
 
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		g_destroy_consumer(cp);
 		return (error);
 	}
 
 	if (fcp != NULL && (fcp->acr > 0 || fcp->acw > 0 || fcp->ace > 0)) {
 		error = g_access(cp, fcp->acr, fcp->acw, fcp->ace);
 		if (error != 0) {
 			g_detach(cp);
 			g_destroy_consumer(cp);
 			return (error);
 		}
 	}
 
 	/* Reread metadata. */
 	error = g_shsec_read_metadata(cp, &md);
 	if (error != 0)
 		goto fail;
 
 	if (strcmp(md.md_magic, G_SHSEC_MAGIC) != 0 ||
 	    strcmp(md.md_name, sc->sc_name) != 0 || md.md_id != sc->sc_id) {
 		G_SHSEC_DEBUG(0, "Metadata on %s changed.", pp->name);
 		goto fail;
 	}
 
 	cp->private = sc;
 	cp->index = no;
 	sc->sc_disks[no] = cp;
 
 	G_SHSEC_DEBUG(0, "Disk %s attached to %s.", pp->name, sc->sc_name);
 
 	g_shsec_check_and_run(sc);
 
 	return (0);
 fail:
 	if (fcp != NULL && (fcp->acr > 0 || fcp->acw > 0 || fcp->ace > 0))
 		g_access(cp, -fcp->acr, -fcp->acw, -fcp->ace);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	return (error);
 }
 
 static struct g_geom *
 g_shsec_create(struct g_class *mp, const struct g_shsec_metadata *md)
 {
 	struct g_shsec_softc *sc;
 	struct g_geom *gp;
 	u_int no;
 
 	G_SHSEC_DEBUG(1, "Creating device %s (id=%u).", md->md_name, md->md_id);
 
 	/* Two disks is minimum. */
 	if (md->md_all < 2) {
 		G_SHSEC_DEBUG(0, "Too few disks defined for %s.", md->md_name);
 		return (NULL);
 	}
 
 	/* Check for duplicate unit */
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc != NULL && strcmp(sc->sc_name, md->md_name) == 0) {
 			G_SHSEC_DEBUG(0, "Device %s already configured.",
 			    sc->sc_name);
 			return (NULL);
 		}
 	}
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = malloc(sizeof(*sc), M_SHSEC, M_WAITOK | M_ZERO);
 	gp->start = g_shsec_start;
 	gp->spoiled = g_shsec_orphan;
 	gp->orphan = g_shsec_orphan;
 	gp->access = g_shsec_access;
 	gp->dumpconf = g_shsec_dumpconf;
 
 	sc->sc_id = md->md_id;
 	sc->sc_ndisks = md->md_all;
 	sc->sc_disks = malloc(sizeof(struct g_consumer *) * sc->sc_ndisks,
 	    M_SHSEC, M_WAITOK | M_ZERO);
 	for (no = 0; no < sc->sc_ndisks; no++)
 		sc->sc_disks[no] = NULL;
 
 	gp->softc = sc;
 	sc->sc_geom = gp;
 	sc->sc_provider = NULL;
 
 	G_SHSEC_DEBUG(0, "Device %s created (id=%u).", sc->sc_name, sc->sc_id);
 
 	return (gp);
 }
 
 static int
 g_shsec_destroy(struct g_shsec_softc *sc, boolean_t force)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	u_int no;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENXIO);
 
 	pp = sc->sc_provider;
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_SHSEC_DEBUG(0, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_SHSEC_DEBUG(1,
 			    "Device %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	}
 
 	for (no = 0; no < sc->sc_ndisks; no++) {
 		if (sc->sc_disks[no] != NULL)
 			g_shsec_remove_disk(sc->sc_disks[no]);
 	}
 
 	gp = sc->sc_geom;
 	gp->softc = NULL;
 	KASSERT(sc->sc_provider == NULL, ("Provider still exists? (device=%s)",
 	    gp->name));
 	free(sc->sc_disks, M_SHSEC);
 	free(sc, M_SHSEC);
 
 	pp = LIST_FIRST(&gp->provider);
 	if (pp == NULL || (pp->acr == 0 && pp->acw == 0 && pp->ace == 0))
 		G_SHSEC_DEBUG(0, "Device %s destroyed.", gp->name);
 
 	g_wither_geom(gp, ENXIO);
 
 	return (0);
 }
 
 static int
 g_shsec_destroy_geom(struct gctl_req *req __unused, struct g_class *mp __unused,
     struct g_geom *gp)
 {
 	struct g_shsec_softc *sc;
 
 	sc = gp->softc;
 	return (g_shsec_destroy(sc, 0));
 }
 
 static struct g_geom *
 g_shsec_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_shsec_metadata md;
 	struct g_shsec_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	/* Skip providers that are already open for writing. */
 	if (pp->acw > 0)
 		return (NULL);
 
 	G_SHSEC_DEBUG(3, "Tasting %s.", pp->name);
 
 	gp = g_new_geomf(mp, "shsec:taste");
 	gp->start = g_shsec_start;
 	gp->access = g_shsec_access;
 	gp->orphan = g_shsec_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_shsec_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (strcmp(md.md_magic, G_SHSEC_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version > G_SHSEC_VERSION) {
 		G_SHSEC_DEBUG(0, "Kernel module is too old to handle %s.\n",
 		    pp->name);
 		return (NULL);
 	}
 	/*
 	 * Backward compatibility:
 	 */
 	/* There was no md_provsize field in earlier versions of metadata. */
 	if (md.md_version < 1)
 		md.md_provsize = pp->mediasize;
 
 	if (md.md_provider[0] != '\0' &&
 	    !g_compare_names(md.md_provider, pp->name))
 		return (NULL);
 	if (md.md_provsize != pp->mediasize)
 		return (NULL);
 
 	/*
 	 * Let's check if device already exists.
 	 */
 	sc = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (strcmp(md.md_name, sc->sc_name) != 0)
 			continue;
 		if (md.md_id != sc->sc_id)
 			continue;
 		break;
 	}
 	if (gp != NULL) {
 		G_SHSEC_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 		error = g_shsec_add_disk(sc, pp, md.md_no);
 		if (error != 0) {
 			G_SHSEC_DEBUG(0, "Cannot add disk %s to %s (error=%d).",
 			    pp->name, gp->name, error);
 			return (NULL);
 		}
 	} else {
 		gp = g_shsec_create(mp, &md);
 		if (gp == NULL) {
 			G_SHSEC_DEBUG(0, "Cannot create device %s.", md.md_name);
 			return (NULL);
 		}
 		sc = gp->softc;
 		G_SHSEC_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 		error = g_shsec_add_disk(sc, pp, md.md_no);
 		if (error != 0) {
 			G_SHSEC_DEBUG(0, "Cannot add disk %s to %s (error=%d).",
 			    pp->name, gp->name, error);
 			g_shsec_destroy(sc, 1);
 			return (NULL);
 		}
 	}
 	return (gp);
 }
 
 static struct g_shsec_softc *
 g_shsec_find_device(struct g_class *mp, const char *name)
 {
 	struct g_shsec_softc *sc;
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (strcmp(sc->sc_name, name) == 0)
 			return (sc);
 	}
 	return (NULL);
 }
 
 static void
 g_shsec_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_shsec_softc *sc;
 	int *force, *nargs, error;
 	const char *name;
 	char param[16];
 	u_int i;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No '%s' argument.", "force");
 		return;
 	}
 
 	for (i = 0; i < (u_int)*nargs; i++) {
 		snprintf(param, sizeof(param), "arg%u", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%u' argument.", i);
 			return;
 		}
 		sc = g_shsec_find_device(mp, name);
 		if (sc == NULL) {
 			gctl_error(req, "No such device: %s.", name);
 			return;
 		}
 		error = g_shsec_destroy(sc, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy device %s (error=%d).",
 			    sc->sc_name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_shsec_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_SHSEC_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "stop") == 0) {
 		g_shsec_ctl_destroy(req, mp);
 		return;
 	}
 
 	gctl_error(req, "Unknown verb.");
 }
 
 static void
 g_shsec_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_shsec_softc *sc;
 
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	if (pp != NULL) {
 		/* Nothing here. */
 	} else if (cp != NULL) {
 		sbuf_printf(sb, "%s<Number>%u</Number>\n", indent,
 		    (u_int)cp->index);
 	} else {
 		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)sc->sc_id);
 		sbuf_printf(sb, "%s<Status>Total=%u, Online=%u</Status>\n",
 		    indent, sc->sc_ndisks, g_shsec_nvalid(sc));
 		sbuf_printf(sb, "%s<State>", indent);
 		if (sc->sc_provider != NULL && sc->sc_provider->error == 0)
 			sbuf_printf(sb, "UP");
 		else
 			sbuf_printf(sb, "DOWN");
 		sbuf_printf(sb, "</State>\n");
 	}
 }
 
 DECLARE_GEOM_CLASS(g_shsec_class, g_shsec);
+MODULE_VERSION(geom_shsec, 0);
Index: user/markj/netdump/sys/geom/stripe/g_stripe.c
===================================================================
--- user/markj/netdump/sys/geom/stripe/g_stripe.c	(revision 332407)
+++ user/markj/netdump/sys/geom/stripe/g_stripe.c	(revision 332408)
@@ -1,1272 +1,1273 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <vm/uma.h>
 #include <geom/geom.h>
 #include <geom/stripe/g_stripe.h>
 
 FEATURE(geom_stripe, "GEOM striping support");
 
 static MALLOC_DEFINE(M_STRIPE, "stripe_data", "GEOM_STRIPE Data");
 
 static uma_zone_t g_stripe_zone;
 
 static int g_stripe_destroy(struct g_stripe_softc *sc, boolean_t force);
 static int g_stripe_destroy_geom(struct gctl_req *req, struct g_class *mp,
     struct g_geom *gp);
 
 static g_taste_t g_stripe_taste;
 static g_ctl_req_t g_stripe_config;
 static g_dumpconf_t g_stripe_dumpconf;
 static g_init_t g_stripe_init;
 static g_fini_t g_stripe_fini;
 
 struct g_class g_stripe_class = {
 	.name = G_STRIPE_CLASS_NAME,
 	.version = G_VERSION,
 	.ctlreq = g_stripe_config,
 	.taste = g_stripe_taste,
 	.destroy_geom = g_stripe_destroy_geom,
 	.init = g_stripe_init,
 	.fini = g_stripe_fini
 };
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, stripe, CTLFLAG_RW, 0,
     "GEOM_STRIPE stuff");
 static u_int g_stripe_debug = 0;
 SYSCTL_UINT(_kern_geom_stripe, OID_AUTO, debug, CTLFLAG_RWTUN, &g_stripe_debug, 0,
     "Debug level");
 static int g_stripe_fast = 0;
 static int
 g_sysctl_stripe_fast(SYSCTL_HANDLER_ARGS)
 {
 	int error, fast;
 
 	fast = g_stripe_fast;
 	error = sysctl_handle_int(oidp, &fast, 0, req);
 	if (error == 0 && req->newptr != NULL)
 		g_stripe_fast = fast;
 	return (error);
 }
 SYSCTL_PROC(_kern_geom_stripe, OID_AUTO, fast, CTLTYPE_INT | CTLFLAG_RWTUN,
     NULL, 0, g_sysctl_stripe_fast, "I", "Fast, but memory-consuming, mode");
 static u_int g_stripe_maxmem = MAXPHYS * 100;
 SYSCTL_UINT(_kern_geom_stripe, OID_AUTO, maxmem, CTLFLAG_RDTUN, &g_stripe_maxmem,
     0, "Maximum memory that can be allocated in \"fast\" mode (in bytes)");
 static u_int g_stripe_fast_failed = 0;
 SYSCTL_UINT(_kern_geom_stripe, OID_AUTO, fast_failed, CTLFLAG_RD,
     &g_stripe_fast_failed, 0, "How many times \"fast\" mode failed");
 
 /*
  * Greatest Common Divisor.
  */
 static u_int
 gcd(u_int a, u_int b)
 {
 	u_int c;
 
 	while (b != 0) {
 		c = a;
 		a = b;
 		b = (c % b);
 	}
 	return (a);
 }
 
 /*
  * Least Common Multiple.
  */
 static u_int
 lcm(u_int a, u_int b)
 {
 
 	return ((a * b) / gcd(a, b));
 }
 
 static void
 g_stripe_init(struct g_class *mp __unused)
 {
 
 	g_stripe_zone = uma_zcreate("g_stripe_zone", MAXPHYS, NULL, NULL,
 	    NULL, NULL, 0, 0);
 	g_stripe_maxmem -= g_stripe_maxmem % MAXPHYS;
 	uma_zone_set_max(g_stripe_zone, g_stripe_maxmem / MAXPHYS);
 }
 
 static void
 g_stripe_fini(struct g_class *mp __unused)
 {
 
 	uma_zdestroy(g_stripe_zone);
 }
 
 /*
  * Return the number of valid disks.
  */
 static u_int
 g_stripe_nvalid(struct g_stripe_softc *sc)
 {
 	u_int i, no;
 
 	no = 0;
 	for (i = 0; i < sc->sc_ndisks; i++) {
 		if (sc->sc_disks[i] != NULL)
 			no++;
 	}
 
 	return (no);
 }
 
 static void
 g_stripe_remove_disk(struct g_consumer *cp)
 {
 	struct g_stripe_softc *sc;
 
 	g_topology_assert();
 	KASSERT(cp != NULL, ("Non-valid disk in %s.", __func__));
 	sc = (struct g_stripe_softc *)cp->geom->softc;
 	KASSERT(sc != NULL, ("NULL sc in %s.", __func__));
 
 	if (cp->private == NULL) {
 		G_STRIPE_DEBUG(0, "Disk %s removed from %s.",
 		    cp->provider->name, sc->sc_name);
 		cp->private = (void *)(uintptr_t)-1;
 	}
 
 	if (sc->sc_provider != NULL) {
 		G_STRIPE_DEBUG(0, "Device %s deactivated.",
 		    sc->sc_provider->name);
 		g_wither_provider(sc->sc_provider, ENXIO);
 		sc->sc_provider = NULL;
 	}
 
 	if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0)
 		return;
 	sc->sc_disks[cp->index] = NULL;
 	cp->index = 0;
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	/* If there are no valid disks anymore, remove device. */
 	if (LIST_EMPTY(&sc->sc_geom->consumer))
 		g_stripe_destroy(sc, 1);
 }
 
 static void
 g_stripe_orphan(struct g_consumer *cp)
 {
 	struct g_stripe_softc *sc;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = cp->geom;
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 
 	g_stripe_remove_disk(cp);
 }
 
 static int
 g_stripe_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_consumer *cp1, *cp2, *tmp;
 	struct g_stripe_softc *sc;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	gp = pp->geom;
 	sc = gp->softc;
 	KASSERT(sc != NULL, ("NULL sc in %s.", __func__));
 
 	/* On first open, grab an extra "exclusive" bit */
 	if (pp->acr == 0 && pp->acw == 0 && pp->ace == 0)
 		de++;
 	/* ... and let go of it on last close */
 	if ((pp->acr + dr) == 0 && (pp->acw + dw) == 0 && (pp->ace + de) == 0)
 		de--;
 
 	LIST_FOREACH_SAFE(cp1, &gp->consumer, consumer, tmp) {
 		error = g_access(cp1, dr, dw, de);
 		if (error != 0)
 			goto fail;
 		if (cp1->acr == 0 && cp1->acw == 0 && cp1->ace == 0 &&
 		    cp1->private != NULL) {
 			g_stripe_remove_disk(cp1); /* May destroy geom. */
 		}
 	}
 	return (0);
 
 fail:
 	LIST_FOREACH(cp2, &gp->consumer, consumer) {
 		if (cp1 == cp2)
 			break;
 		g_access(cp2, -dr, -dw, -de);
 	}
 	return (error);
 }
 
 static void
 g_stripe_copy(struct g_stripe_softc *sc, char *src, char *dst, off_t offset,
     off_t length, int mode)
 {
 	u_int stripesize;
 	size_t len;
 
 	stripesize = sc->sc_stripesize;
 	len = (size_t)(stripesize - (offset & (stripesize - 1)));
 	do {
 		bcopy(src, dst, len);
 		if (mode) {
 			dst += len + stripesize * (sc->sc_ndisks - 1);
 			src += len;
 		} else {
 			dst += len;
 			src += len + stripesize * (sc->sc_ndisks - 1);
 		}
 		length -= len;
 		KASSERT(length >= 0,
 		    ("Length < 0 (stripesize=%zu, offset=%jd, length=%jd).",
 		    (size_t)stripesize, (intmax_t)offset, (intmax_t)length));
 		if (length > stripesize)
 			len = stripesize;
 		else
 			len = length;
 	} while (length > 0);
 }
 
 static void
 g_stripe_done(struct bio *bp)
 {
 	struct g_stripe_softc *sc;
 	struct bio *pbp;
 
 	pbp = bp->bio_parent;
 	sc = pbp->bio_to->geom->softc;
 	if (bp->bio_cmd == BIO_READ && bp->bio_caller1 != NULL) {
 		g_stripe_copy(sc, bp->bio_data, bp->bio_caller1, bp->bio_offset,
 		    bp->bio_length, 1);
 		bp->bio_data = bp->bio_caller1;
 		bp->bio_caller1 = NULL;
 	}
 	mtx_lock(&sc->sc_lock);
 	if (pbp->bio_error == 0)
 		pbp->bio_error = bp->bio_error;
 	pbp->bio_completed += bp->bio_completed;
 	pbp->bio_inbed++;
 	if (pbp->bio_children == pbp->bio_inbed) {
 		mtx_unlock(&sc->sc_lock);
 		if (pbp->bio_driver1 != NULL)
 			uma_zfree(g_stripe_zone, pbp->bio_driver1);
 		g_io_deliver(pbp, pbp->bio_error);
 	} else
 		mtx_unlock(&sc->sc_lock);
 	g_destroy_bio(bp);
 }
 
 static int
 g_stripe_start_fast(struct bio *bp, u_int no, off_t offset, off_t length)
 {
 	TAILQ_HEAD(, bio) queue = TAILQ_HEAD_INITIALIZER(queue);
 	u_int nparts = 0, stripesize;
 	struct g_stripe_softc *sc;
 	char *addr, *data = NULL;
 	struct bio *cbp;
 	int error;
 
 	sc = bp->bio_to->geom->softc;
 
 	addr = bp->bio_data;
 	stripesize = sc->sc_stripesize;
 
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		error = ENOMEM;
 		goto failure;
 	}
 	TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 	nparts++;
 	/*
 	 * Fill in the component buf structure.
 	 */
 	cbp->bio_done = g_stripe_done;
 	cbp->bio_offset = offset;
 	cbp->bio_data = addr;
 	cbp->bio_caller1 = NULL;
 	cbp->bio_length = length;
 	cbp->bio_caller2 = sc->sc_disks[no];
 
 	/* offset -= offset % stripesize; */
 	offset -= offset & (stripesize - 1);
 	addr += length;
 	length = bp->bio_length - length;
 	for (no++; length > 0; no++, length -= stripesize, addr += stripesize) {
 		if (no > sc->sc_ndisks - 1) {
 			no = 0;
 			offset += stripesize;
 		}
 		if (nparts >= sc->sc_ndisks) {
 			cbp = TAILQ_NEXT(cbp, bio_queue);
 			if (cbp == NULL)
 				cbp = TAILQ_FIRST(&queue);
 			nparts++;
 			/*
 			 * Update bio structure.
 			 */
 			/*
 			 * MIN() is in case when
 			 * (bp->bio_length % sc->sc_stripesize) != 0.
 			 */
 			cbp->bio_length += MIN(stripesize, length);
 			if (cbp->bio_caller1 == NULL) {
 				cbp->bio_caller1 = cbp->bio_data;
 				cbp->bio_data = NULL;
 				if (data == NULL) {
 					data = uma_zalloc(g_stripe_zone,
 					    M_NOWAIT);
 					if (data == NULL) {
 						error = ENOMEM;
 						goto failure;
 					}
 				}
 			}
 		} else {
 			cbp = g_clone_bio(bp);
 			if (cbp == NULL) {
 				error = ENOMEM;
 				goto failure;
 			}
 			TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 			nparts++;
 			/*
 			 * Fill in the component buf structure.
 			 */
 			cbp->bio_done = g_stripe_done;
 			cbp->bio_offset = offset;
 			cbp->bio_data = addr;
 			cbp->bio_caller1 = NULL;
 			/*
 			 * MIN() is in case when
 			 * (bp->bio_length % sc->sc_stripesize) != 0.
 			 */
 			cbp->bio_length = MIN(stripesize, length);
 			cbp->bio_caller2 = sc->sc_disks[no];
 		}
 	}
 	if (data != NULL)
 		bp->bio_driver1 = data;
 	/*
 	 * Fire off all allocated requests!
 	 */
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		struct g_consumer *cp;
 
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		cp = cbp->bio_caller2;
 		cbp->bio_caller2 = NULL;
 		cbp->bio_to = cp->provider;
 		if (cbp->bio_caller1 != NULL) {
 			cbp->bio_data = data;
 			if (bp->bio_cmd == BIO_WRITE) {
 				g_stripe_copy(sc, cbp->bio_caller1, data,
 				    cbp->bio_offset, cbp->bio_length, 0);
 			}
 			data += cbp->bio_length;
 		}
 		G_STRIPE_LOGREQ(cbp, "Sending request.");
 		g_io_request(cbp, cp);
 	}
 	return (0);
 failure:
 	if (data != NULL)
 		uma_zfree(g_stripe_zone, data);
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		if (cbp->bio_caller1 != NULL) {
 			cbp->bio_data = cbp->bio_caller1;
 			cbp->bio_caller1 = NULL;
 		}
 		bp->bio_children--;
 		g_destroy_bio(cbp);
 	}
 	return (error);
 }
 
 static int
 g_stripe_start_economic(struct bio *bp, u_int no, off_t offset, off_t length)
 {
 	TAILQ_HEAD(, bio) queue = TAILQ_HEAD_INITIALIZER(queue);
 	struct g_stripe_softc *sc;
 	uint32_t stripesize;
 	struct bio *cbp;
 	char *addr;
 	int error;
 
 	sc = bp->bio_to->geom->softc;
 
 	stripesize = sc->sc_stripesize;
 
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
 		error = ENOMEM;
 		goto failure;
 	}
 	TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 	/*
 	 * Fill in the component buf structure.
 	 */
 	if (bp->bio_length == length)
 		cbp->bio_done = g_std_done;	/* Optimized lockless case. */
 	else
 		cbp->bio_done = g_stripe_done;
 	cbp->bio_offset = offset;
 	cbp->bio_length = length;
 	if ((bp->bio_flags & BIO_UNMAPPED) != 0) {
 		bp->bio_ma_n = round_page(bp->bio_ma_offset +
 		    bp->bio_length) / PAGE_SIZE;
 		addr = NULL;
 	} else
 		addr = bp->bio_data;
 	cbp->bio_caller2 = sc->sc_disks[no];
 
 	/* offset -= offset % stripesize; */
 	offset -= offset & (stripesize - 1);
 	if (bp->bio_cmd != BIO_DELETE)
 		addr += length;
 	length = bp->bio_length - length;
 	for (no++; length > 0; no++, length -= stripesize) {
 		if (no > sc->sc_ndisks - 1) {
 			no = 0;
 			offset += stripesize;
 		}
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			error = ENOMEM;
 			goto failure;
 		}
 		TAILQ_INSERT_TAIL(&queue, cbp, bio_queue);
 
 		/*
 		 * Fill in the component buf structure.
 		 */
 		cbp->bio_done = g_stripe_done;
 		cbp->bio_offset = offset;
 		/*
 		 * MIN() is in case when
 		 * (bp->bio_length % sc->sc_stripesize) != 0.
 		 */
 		cbp->bio_length = MIN(stripesize, length);
 		if ((bp->bio_flags & BIO_UNMAPPED) != 0) {
 			cbp->bio_ma_offset += (uintptr_t)addr;
 			cbp->bio_ma += cbp->bio_ma_offset / PAGE_SIZE;
 			cbp->bio_ma_offset %= PAGE_SIZE;
 			cbp->bio_ma_n = round_page(cbp->bio_ma_offset +
 			    cbp->bio_length) / PAGE_SIZE;
 		} else
 			cbp->bio_data = addr;
 
 		cbp->bio_caller2 = sc->sc_disks[no];
 
 		if (bp->bio_cmd != BIO_DELETE)
 			addr += stripesize;
 	}
 	/*
 	 * Fire off all allocated requests!
 	 */
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		struct g_consumer *cp;
 
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		cp = cbp->bio_caller2;
 		cbp->bio_caller2 = NULL;
 		cbp->bio_to = cp->provider;
 		G_STRIPE_LOGREQ(cbp, "Sending request.");
 		g_io_request(cbp, cp);
 	}
 	return (0);
 failure:
 	while ((cbp = TAILQ_FIRST(&queue)) != NULL) {
 		TAILQ_REMOVE(&queue, cbp, bio_queue);
 		bp->bio_children--;
 		g_destroy_bio(cbp);
 	}
 	return (error);
 }
 
 static void
 g_stripe_flush(struct g_stripe_softc *sc, struct bio *bp)
 {
 	struct bio_queue_head queue;
 	struct g_consumer *cp;
 	struct bio *cbp;
 	u_int no;
 
 	bioq_init(&queue);
 	for (no = 0; no < sc->sc_ndisks; no++) {
 		cbp = g_clone_bio(bp);
 		if (cbp == NULL) {
 			for (cbp = bioq_first(&queue); cbp != NULL;
 			    cbp = bioq_first(&queue)) {
 				bioq_remove(&queue, cbp);
 				g_destroy_bio(cbp);
 			}
 			if (bp->bio_error == 0)
 				bp->bio_error = ENOMEM;
 			g_io_deliver(bp, bp->bio_error);
 			return;
 		}
 		bioq_insert_tail(&queue, cbp);
 		cbp->bio_done = g_stripe_done;
 		cbp->bio_caller2 = sc->sc_disks[no];
 		cbp->bio_to = sc->sc_disks[no]->provider;
 	}
 	for (cbp = bioq_first(&queue); cbp != NULL; cbp = bioq_first(&queue)) {
 		bioq_remove(&queue, cbp);
 		G_STRIPE_LOGREQ(cbp, "Sending request.");
 		cp = cbp->bio_caller2;
 		cbp->bio_caller2 = NULL;
 		g_io_request(cbp, cp);
 	}
 }
 
 static void
 g_stripe_start(struct bio *bp)
 {
 	off_t offset, start, length, nstripe;
 	struct g_stripe_softc *sc;
 	u_int no, stripesize;
 	int error, fast = 0;
 
 	sc = bp->bio_to->geom->softc;
 	/*
 	 * If sc == NULL, provider's error should be set and g_stripe_start()
 	 * should not be called at all.
 	 */
 	KASSERT(sc != NULL,
 	    ("Provider's error should be set (error=%d)(device=%s).",
 	    bp->bio_to->error, bp->bio_to->name));
 
 	G_STRIPE_LOGREQ(bp, "Request received.");
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
 	case BIO_FLUSH:
 		g_stripe_flush(sc, bp);
 		return;
 	case BIO_GETATTR:
 		/* To which provider it should be delivered? */
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 
 	stripesize = sc->sc_stripesize;
 
 	/*
 	 * Calculations are quite messy, but fast I hope.
 	 */
 
 	/* Stripe number. */
 	/* nstripe = bp->bio_offset / stripesize; */
 	nstripe = bp->bio_offset >> (off_t)sc->sc_stripebits;
 	/* Disk number. */
 	no = nstripe % sc->sc_ndisks;
 	/* Start position in stripe. */
 	/* start = bp->bio_offset % stripesize; */
 	start = bp->bio_offset & (stripesize - 1);
 	/* Start position in disk. */
 	/* offset = (nstripe / sc->sc_ndisks) * stripesize + start; */
 	offset = ((nstripe / sc->sc_ndisks) << sc->sc_stripebits) + start;
 	/* Length of data to operate. */
 	length = MIN(bp->bio_length, stripesize - start);
 
 	/*
 	 * Do use "fast" mode when:
 	 * 1. "Fast" mode is ON.
 	 * and
 	 * 2. Request size is less than or equal to MAXPHYS,
 	 *    which should always be true.
 	 * and
 	 * 3. Request size is bigger than stripesize * ndisks. If it isn't,
 	 *    there will be no need to send more than one I/O request to
 	 *    a provider, so there is nothing to optmize.
 	 * and
 	 * 4. Request is not unmapped.
 	 * and
 	 * 5. It is not a BIO_DELETE.
 	 */
 	if (g_stripe_fast && bp->bio_length <= MAXPHYS &&
 	    bp->bio_length >= stripesize * sc->sc_ndisks &&
 	    (bp->bio_flags & BIO_UNMAPPED) == 0 &&
 	    bp->bio_cmd != BIO_DELETE) {
 		fast = 1;
 	}
 	error = 0;
 	if (fast) {
 		error = g_stripe_start_fast(bp, no, offset, length);
 		if (error != 0)
 			g_stripe_fast_failed++;
 	}
 	/*
 	 * Do use "economic" when:
 	 * 1. "Economic" mode is ON.
 	 * or
 	 * 2. "Fast" mode failed. It can only fail if there is no memory.
 	 */
 	if (!fast || error != 0)
 		error = g_stripe_start_economic(bp, no, offset, length);
 	if (error != 0) {
 		if (bp->bio_error == 0)
 			bp->bio_error = error;
 		g_io_deliver(bp, bp->bio_error);
 	}
 }
 
 static void
 g_stripe_check_and_run(struct g_stripe_softc *sc)
 {
 	struct g_provider *dp;
 	off_t mediasize, ms;
 	u_int no, sectorsize = 0;
 
 	g_topology_assert();
 	if (g_stripe_nvalid(sc) != sc->sc_ndisks)
 		return;
 
 	sc->sc_provider = g_new_providerf(sc->sc_geom, "stripe/%s",
 	    sc->sc_name);
 	sc->sc_provider->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE;
 	if (g_stripe_fast == 0)
 		sc->sc_provider->flags |= G_PF_ACCEPT_UNMAPPED;
 	/*
 	 * Find the smallest disk.
 	 */
 	mediasize = sc->sc_disks[0]->provider->mediasize;
 	if (sc->sc_type == G_STRIPE_TYPE_AUTOMATIC)
 		mediasize -= sc->sc_disks[0]->provider->sectorsize;
 	mediasize -= mediasize % sc->sc_stripesize;
 	sectorsize = sc->sc_disks[0]->provider->sectorsize;
 	for (no = 1; no < sc->sc_ndisks; no++) {
 		dp = sc->sc_disks[no]->provider;
 		ms = dp->mediasize;
 		if (sc->sc_type == G_STRIPE_TYPE_AUTOMATIC)
 			ms -= dp->sectorsize;
 		ms -= ms % sc->sc_stripesize;
 		if (ms < mediasize)
 			mediasize = ms;
 		sectorsize = lcm(sectorsize, dp->sectorsize);
 
 		/* A provider underneath us doesn't support unmapped */
 		if ((dp->flags & G_PF_ACCEPT_UNMAPPED) == 0) {
 			G_STRIPE_DEBUG(1, "Cancelling unmapped "
 			    "because of %s.", dp->name);
 			sc->sc_provider->flags &= ~G_PF_ACCEPT_UNMAPPED;
 		}
 	}
 	sc->sc_provider->sectorsize = sectorsize;
 	sc->sc_provider->mediasize = mediasize * sc->sc_ndisks;
 	sc->sc_provider->stripesize = sc->sc_stripesize;
 	sc->sc_provider->stripeoffset = 0;
 	g_error_provider(sc->sc_provider, 0);
 
 	G_STRIPE_DEBUG(0, "Device %s activated.", sc->sc_provider->name);
 }
 
 static int
 g_stripe_read_metadata(struct g_consumer *cp, struct g_stripe_metadata *md)
 {
 	struct g_provider *pp;
 	u_char *buf;
 	int error;
 
 	g_topology_assert();
 
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL)
 		return (error);
 
 	/* Decode metadata. */
 	stripe_metadata_decode(buf, md);
 	g_free(buf);
 
 	return (0);
 }
 
 /*
  * Add disk to given device.
  */
 static int
 g_stripe_add_disk(struct g_stripe_softc *sc, struct g_provider *pp, u_int no)
 {
 	struct g_consumer *cp, *fcp;
 	struct g_geom *gp;
 	int error;
 
 	g_topology_assert();
 	/* Metadata corrupted? */
 	if (no >= sc->sc_ndisks)
 		return (EINVAL);
 
 	/* Check if disk is not already attached. */
 	if (sc->sc_disks[no] != NULL)
 		return (EEXIST);
 
 	gp = sc->sc_geom;
 	fcp = LIST_FIRST(&gp->consumer);
 
 	cp = g_new_consumer(gp);
 	cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE;
 	cp->private = NULL;
 	cp->index = no;
 	error = g_attach(cp, pp);
 	if (error != 0) {
 		g_destroy_consumer(cp);
 		return (error);
 	}
 
 	if (fcp != NULL && (fcp->acr > 0 || fcp->acw > 0 || fcp->ace > 0)) {
 		error = g_access(cp, fcp->acr, fcp->acw, fcp->ace);
 		if (error != 0) {
 			g_detach(cp);
 			g_destroy_consumer(cp);
 			return (error);
 		}
 	}
 	if (sc->sc_type == G_STRIPE_TYPE_AUTOMATIC) {
 		struct g_stripe_metadata md;
 
 		/* Reread metadata. */
 		error = g_stripe_read_metadata(cp, &md);
 		if (error != 0)
 			goto fail;
 
 		if (strcmp(md.md_magic, G_STRIPE_MAGIC) != 0 ||
 		    strcmp(md.md_name, sc->sc_name) != 0 ||
 		    md.md_id != sc->sc_id) {
 			G_STRIPE_DEBUG(0, "Metadata on %s changed.", pp->name);
 			goto fail;
 		}
 	}
 
 	sc->sc_disks[no] = cp;
 	G_STRIPE_DEBUG(0, "Disk %s attached to %s.", pp->name, sc->sc_name);
 	g_stripe_check_and_run(sc);
 
 	return (0);
 fail:
 	if (fcp != NULL && (fcp->acr > 0 || fcp->acw > 0 || fcp->ace > 0))
 		g_access(cp, -fcp->acr, -fcp->acw, -fcp->ace);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	return (error);
 }
 
 static struct g_geom *
 g_stripe_create(struct g_class *mp, const struct g_stripe_metadata *md,
     u_int type)
 {
 	struct g_stripe_softc *sc;
 	struct g_geom *gp;
 	u_int no;
 
 	g_topology_assert();
 	G_STRIPE_DEBUG(1, "Creating device %s (id=%u).", md->md_name,
 	    md->md_id);
 
 	/* Two disks is minimum. */
 	if (md->md_all < 2) {
 		G_STRIPE_DEBUG(0, "Too few disks defined for %s.", md->md_name);
 		return (NULL);
 	}
 #if 0
 	/* Stripe size have to be grater than or equal to sector size. */
 	if (md->md_stripesize < sectorsize) {
 		G_STRIPE_DEBUG(0, "Invalid stripe size for %s.", md->md_name);
 		return (NULL);
 	}
 #endif
 	/* Stripe size have to be power of 2. */
 	if (!powerof2(md->md_stripesize)) {
 		G_STRIPE_DEBUG(0, "Invalid stripe size for %s.", md->md_name);
 		return (NULL);
 	}
 
 	/* Check for duplicate unit */
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc != NULL && strcmp(sc->sc_name, md->md_name) == 0) {
 			G_STRIPE_DEBUG(0, "Device %s already configured.",
 			    sc->sc_name);
 			return (NULL);
 		}
 	}
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	sc = malloc(sizeof(*sc), M_STRIPE, M_WAITOK | M_ZERO);
 	gp->start = g_stripe_start;
 	gp->spoiled = g_stripe_orphan;
 	gp->orphan = g_stripe_orphan;
 	gp->access = g_stripe_access;
 	gp->dumpconf = g_stripe_dumpconf;
 
 	sc->sc_id = md->md_id;
 	sc->sc_stripesize = md->md_stripesize;
 	sc->sc_stripebits = bitcount32(sc->sc_stripesize - 1);
 	sc->sc_ndisks = md->md_all;
 	sc->sc_disks = malloc(sizeof(struct g_consumer *) * sc->sc_ndisks,
 	    M_STRIPE, M_WAITOK | M_ZERO);
 	for (no = 0; no < sc->sc_ndisks; no++)
 		sc->sc_disks[no] = NULL;
 	sc->sc_type = type;
 	mtx_init(&sc->sc_lock, "gstripe lock", NULL, MTX_DEF);
 
 	gp->softc = sc;
 	sc->sc_geom = gp;
 	sc->sc_provider = NULL;
 
 	G_STRIPE_DEBUG(0, "Device %s created (id=%u).", sc->sc_name, sc->sc_id);
 
 	return (gp);
 }
 
 static int
 g_stripe_destroy(struct g_stripe_softc *sc, boolean_t force)
 {
 	struct g_provider *pp;
 	struct g_consumer *cp, *cp1;
 	struct g_geom *gp;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENXIO);
 
 	pp = sc->sc_provider;
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		if (force) {
 			G_STRIPE_DEBUG(0, "Device %s is still open, so it "
 			    "can't be definitely removed.", pp->name);
 		} else {
 			G_STRIPE_DEBUG(1,
 			    "Device %s is still open (r%dw%de%d).", pp->name,
 			    pp->acr, pp->acw, pp->ace);
 			return (EBUSY);
 		}
 	}
 
 	gp = sc->sc_geom;
 	LIST_FOREACH_SAFE(cp, &gp->consumer, consumer, cp1) {
 		g_stripe_remove_disk(cp);
 		if (cp1 == NULL)
 			return (0);	/* Recursion happened. */
 	}
 	if (!LIST_EMPTY(&gp->consumer))
 		return (EINPROGRESS);
 
 	gp->softc = NULL;
 	KASSERT(sc->sc_provider == NULL, ("Provider still exists? (device=%s)",
 	    gp->name));
 	free(sc->sc_disks, M_STRIPE);
 	mtx_destroy(&sc->sc_lock);
 	free(sc, M_STRIPE);
 	G_STRIPE_DEBUG(0, "Device %s destroyed.", gp->name);
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static int
 g_stripe_destroy_geom(struct gctl_req *req __unused,
     struct g_class *mp __unused, struct g_geom *gp)
 {
 	struct g_stripe_softc *sc;
 
 	sc = gp->softc;
 	return (g_stripe_destroy(sc, 0));
 }
 
 static struct g_geom *
 g_stripe_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_stripe_metadata md;
 	struct g_stripe_softc *sc;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	/* Skip providers that are already open for writing. */
 	if (pp->acw > 0)
 		return (NULL);
 
 	G_STRIPE_DEBUG(3, "Tasting %s.", pp->name);
 
 	gp = g_new_geomf(mp, "stripe:taste");
 	gp->start = g_stripe_start;
 	gp->access = g_stripe_access;
 	gp->orphan = g_stripe_orphan;
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = g_stripe_read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 	if (error != 0)
 		return (NULL);
 	gp = NULL;
 
 	if (strcmp(md.md_magic, G_STRIPE_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version > G_STRIPE_VERSION) {
 		printf("geom_stripe.ko module is too old to handle %s.\n",
 		    pp->name);
 		return (NULL);
 	}
 	/*
 	 * Backward compatibility:
 	 */
 	/* There was no md_provider field in earlier versions of metadata. */
 	if (md.md_version < 2)
 		bzero(md.md_provider, sizeof(md.md_provider));
 	/* There was no md_provsize field in earlier versions of metadata. */
 	if (md.md_version < 3)
 		md.md_provsize = pp->mediasize;
 
 	if (md.md_provider[0] != '\0' &&
 	    !g_compare_names(md.md_provider, pp->name))
 		return (NULL);
 	if (md.md_provsize != pp->mediasize)
 		return (NULL);
 
 	/*
 	 * Let's check if device already exists.
 	 */
 	sc = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (sc->sc_type != G_STRIPE_TYPE_AUTOMATIC)
 			continue;
 		if (strcmp(md.md_name, sc->sc_name) != 0)
 			continue;
 		if (md.md_id != sc->sc_id)
 			continue;
 		break;
 	}
 	if (gp != NULL) {
 		G_STRIPE_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 		error = g_stripe_add_disk(sc, pp, md.md_no);
 		if (error != 0) {
 			G_STRIPE_DEBUG(0,
 			    "Cannot add disk %s to %s (error=%d).", pp->name,
 			    gp->name, error);
 			return (NULL);
 		}
 	} else {
 		gp = g_stripe_create(mp, &md, G_STRIPE_TYPE_AUTOMATIC);
 		if (gp == NULL) {
 			G_STRIPE_DEBUG(0, "Cannot create device %s.",
 			    md.md_name);
 			return (NULL);
 		}
 		sc = gp->softc;
 		G_STRIPE_DEBUG(1, "Adding disk %s to %s.", pp->name, gp->name);
 		error = g_stripe_add_disk(sc, pp, md.md_no);
 		if (error != 0) {
 			G_STRIPE_DEBUG(0,
 			    "Cannot add disk %s to %s (error=%d).", pp->name,
 			    gp->name, error);
 			g_stripe_destroy(sc, 1);
 			return (NULL);
 		}
 	}
 
 	return (gp);
 }
 
 static void
 g_stripe_ctl_create(struct gctl_req *req, struct g_class *mp)
 {
 	u_int attached, no;
 	struct g_stripe_metadata md;
 	struct g_provider *pp;
 	struct g_stripe_softc *sc;
 	struct g_geom *gp;
 	struct sbuf *sb;
 	intmax_t *stripesize;
 	const char *name;
 	char param[16];
 	int *nargs;
 
 	g_topology_assert();
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs <= 2) {
 		gctl_error(req, "Too few arguments.");
 		return;
 	}
 
 	strlcpy(md.md_magic, G_STRIPE_MAGIC, sizeof(md.md_magic));
 	md.md_version = G_STRIPE_VERSION;
 	name = gctl_get_asciiparam(req, "arg0");
 	if (name == NULL) {
 		gctl_error(req, "No 'arg%u' argument.", 0);
 		return;
 	}
 	strlcpy(md.md_name, name, sizeof(md.md_name));
 	md.md_id = arc4random();
 	md.md_no = 0;
 	md.md_all = *nargs - 1;
 	stripesize = gctl_get_paraml(req, "stripesize", sizeof(*stripesize));
 	if (stripesize == NULL) {
 		gctl_error(req, "No '%s' argument.", "stripesize");
 		return;
 	}
 	md.md_stripesize = *stripesize;
 	bzero(md.md_provider, sizeof(md.md_provider));
 	/* This field is not important here. */
 	md.md_provsize = 0;
 
 	/* Check all providers are valid */
 	for (no = 1; no < *nargs; no++) {
 		snprintf(param, sizeof(param), "arg%u", no);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%u' argument.", no);
 			return;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		if (pp == NULL) {
 			G_STRIPE_DEBUG(1, "Disk %s is invalid.", name);
 			gctl_error(req, "Disk %s is invalid.", name);
 			return;
 		}
 	}
 
 	gp = g_stripe_create(mp, &md, G_STRIPE_TYPE_MANUAL);
 	if (gp == NULL) {
 		gctl_error(req, "Can't configure %s.", md.md_name);
 		return;
 	}
 
 	sc = gp->softc;
 	sb = sbuf_new_auto();
 	sbuf_printf(sb, "Can't attach disk(s) to %s:", gp->name);
 	for (attached = 0, no = 1; no < *nargs; no++) {
 		snprintf(param, sizeof(param), "arg%u", no);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%u' argument.", no);
 			continue;
 		}
 		if (strncmp(name, "/dev/", strlen("/dev/")) == 0)
 			name += strlen("/dev/");
 		pp = g_provider_by_name(name);
 		KASSERT(pp != NULL, ("Provider %s disappear?!", name));
 		if (g_stripe_add_disk(sc, pp, no - 1) != 0) {
 			G_STRIPE_DEBUG(1, "Disk %u (%s) not attached to %s.",
 			    no, pp->name, gp->name);
 			sbuf_printf(sb, " %s", pp->name);
 			continue;
 		}
 		attached++;
 	}
 	sbuf_finish(sb);
 	if (md.md_all != attached) {
 		g_stripe_destroy(gp->softc, 1);
 		gctl_error(req, "%s", sbuf_data(sb));
 	}
 	sbuf_delete(sb);
 }
 
 static struct g_stripe_softc *
 g_stripe_find_device(struct g_class *mp, const char *name)
 {
 	struct g_stripe_softc *sc;
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (strcmp(sc->sc_name, name) == 0)
 			return (sc);
 	}
 	return (NULL);
 }
 
 static void
 g_stripe_ctl_destroy(struct gctl_req *req, struct g_class *mp)
 {
 	struct g_stripe_softc *sc;
 	int *force, *nargs, error;
 	const char *name;
 	char param[16];
 	u_int i;
 
 	g_topology_assert();
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "No '%s' argument.", "nargs");
 		return;
 	}
 	if (*nargs <= 0) {
 		gctl_error(req, "Missing device(s).");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof(*force));
 	if (force == NULL) {
 		gctl_error(req, "No '%s' argument.", "force");
 		return;
 	}
 
 	for (i = 0; i < (u_int)*nargs; i++) {
 		snprintf(param, sizeof(param), "arg%u", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%u' argument.", i);
 			return;
 		}
 		sc = g_stripe_find_device(mp, name);
 		if (sc == NULL) {
 			gctl_error(req, "No such device: %s.", name);
 			return;
 		}
 		error = g_stripe_destroy(sc, *force);
 		if (error != 0) {
 			gctl_error(req, "Cannot destroy device %s (error=%d).",
 			    sc->sc_name, error);
 			return;
 		}
 	}
 }
 
 static void
 g_stripe_config(struct gctl_req *req, struct g_class *mp, const char *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "No '%s' argument.", "version");
 		return;
 	}
 	if (*version != G_STRIPE_VERSION) {
 		gctl_error(req, "Userland and kernel parts are out of sync.");
 		return;
 	}
 
 	if (strcmp(verb, "create") == 0) {
 		g_stripe_ctl_create(req, mp);
 		return;
 	} else if (strcmp(verb, "destroy") == 0 ||
 	    strcmp(verb, "stop") == 0) {
 		g_stripe_ctl_destroy(req, mp);
 		return;
 	}
 
 	gctl_error(req, "Unknown verb.");
 }
 
 static void
 g_stripe_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_stripe_softc *sc;
 
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 	if (pp != NULL) {
 		/* Nothing here. */
 	} else if (cp != NULL) {
 		sbuf_printf(sb, "%s<Number>%u</Number>\n", indent,
 		    (u_int)cp->index);
 	} else {
 		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)sc->sc_id);
 		sbuf_printf(sb, "%s<Stripesize>%u</Stripesize>\n", indent,
 		    (u_int)sc->sc_stripesize);
 		sbuf_printf(sb, "%s<Type>", indent);
 		switch (sc->sc_type) {
 		case G_STRIPE_TYPE_AUTOMATIC:
 			sbuf_printf(sb, "AUTOMATIC");
 			break;
 		case G_STRIPE_TYPE_MANUAL:
 			sbuf_printf(sb, "MANUAL");
 			break;
 		default:
 			sbuf_printf(sb, "UNKNOWN");
 			break;
 		}
 		sbuf_printf(sb, "</Type>\n");
 		sbuf_printf(sb, "%s<Status>Total=%u, Online=%u</Status>\n",
 		    indent, sc->sc_ndisks, g_stripe_nvalid(sc));
 		sbuf_printf(sb, "%s<State>", indent);
 		if (sc->sc_provider != NULL && sc->sc_provider->error == 0)
 			sbuf_printf(sb, "UP");
 		else
 			sbuf_printf(sb, "DOWN");
 		sbuf_printf(sb, "</State>\n");
 	}
 }
 
 DECLARE_GEOM_CLASS(g_stripe_class, g_stripe);
+MODULE_VERSION(geom_stripe, 0);
Index: user/markj/netdump/sys/geom/uzip/g_uzip.c
===================================================================
--- user/markj/netdump/sys/geom/uzip/g_uzip.c	(revision 332407)
+++ user/markj/netdump/sys/geom/uzip/g_uzip.c	(revision 332408)
@@ -1,924 +1,925 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004 Max Khon
  * Copyright (c) 2014 Juniper Networks, Inc.
  * Copyright (c) 2006-2016 Maxim Sobolev <sobomax@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/endian.h>
 #include <sys/errno.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/malloc.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/kthread.h>
 
 #include <geom/geom.h>
 
 #include <geom/uzip/g_uzip.h>
 #include <geom/uzip/g_uzip_cloop.h>
 #include <geom/uzip/g_uzip_softc.h>
 #include <geom/uzip/g_uzip_dapi.h>
 #include <geom/uzip/g_uzip_zlib.h>
 #include <geom/uzip/g_uzip_lzma.h>
 #include <geom/uzip/g_uzip_wrkthr.h>
 
 #include "opt_geom.h"
 
 MALLOC_DEFINE(M_GEOM_UZIP, "geom_uzip", "GEOM UZIP data structures");
 
 FEATURE(geom_uzip, "GEOM read-only compressed disks support");
 
 struct g_uzip_blk {
         uint64_t offset;
         uint32_t blen;
         unsigned char last:1;
         unsigned char padded:1;
 #define BLEN_UNDEF      UINT32_MAX
 };
 
 #ifndef ABS
 #define	ABS(a)			((a) < 0 ? -(a) : (a))
 #endif
 
 #define BLK_IN_RANGE(mcn, bcn, ilen)	\
     (((bcn) != BLEN_UNDEF) && ( \
 	((ilen) >= 0 && (mcn >= bcn) && (mcn <= ((intmax_t)(bcn) + (ilen)))) || \
 	((ilen) < 0 && (mcn <= bcn) && (mcn >= ((intmax_t)(bcn) + (ilen)))) \
     ))
 
 #ifdef GEOM_UZIP_DEBUG
 # define GEOM_UZIP_DBG_DEFAULT	3
 #else
 # define GEOM_UZIP_DBG_DEFAULT	0
 #endif
 
 #define	GUZ_DBG_ERR	1
 #define	GUZ_DBG_INFO	2
 #define	GUZ_DBG_IO	3
 #define	GUZ_DBG_TOC	4
 
 #define	GUZ_DEV_SUFX	".uzip"
 #define	GUZ_DEV_NAME(p)	(p GUZ_DEV_SUFX)
 
 static char g_uzip_attach_to[MAXPATHLEN] = {"*"};
 static char g_uzip_noattach_to[MAXPATHLEN] = {GUZ_DEV_NAME("*")};
 TUNABLE_STR("kern.geom.uzip.attach_to", g_uzip_attach_to,
     sizeof(g_uzip_attach_to));
 TUNABLE_STR("kern.geom.uzip.noattach_to", g_uzip_noattach_to,
     sizeof(g_uzip_noattach_to));
 
 SYSCTL_DECL(_kern_geom);
 SYSCTL_NODE(_kern_geom, OID_AUTO, uzip, CTLFLAG_RW, 0, "GEOM_UZIP stuff");
 static u_int g_uzip_debug = GEOM_UZIP_DBG_DEFAULT;
 SYSCTL_UINT(_kern_geom_uzip, OID_AUTO, debug, CTLFLAG_RWTUN, &g_uzip_debug, 0,
     "Debug level (0-4)");
 static u_int g_uzip_debug_block = BLEN_UNDEF;
 SYSCTL_UINT(_kern_geom_uzip, OID_AUTO, debug_block, CTLFLAG_RWTUN,
     &g_uzip_debug_block, 0, "Debug operations around specific cluster#");
 
 #define	DPRINTF(lvl, a)		\
 	if ((lvl) <= g_uzip_debug) { \
 		printf a; \
 	}
 #define	DPRINTF_BLK(lvl, cn, a)	\
 	if ((lvl) <= g_uzip_debug || \
 	    BLK_IN_RANGE(cn, g_uzip_debug_block, 8) || \
 	    BLK_IN_RANGE(cn, g_uzip_debug_block, -8)) { \
 		printf a; \
 	}
 #define	DPRINTF_BRNG(lvl, bcn, ecn, a) \
 	KASSERT(bcn < ecn, ("DPRINTF_BRNG: invalid range (%ju, %ju)", \
 	    (uintmax_t)bcn, (uintmax_t)ecn)); \
 	if (((lvl) <= g_uzip_debug) || \
 	    BLK_IN_RANGE(g_uzip_debug_block, bcn, \
 	     (intmax_t)ecn - (intmax_t)bcn)) { \
 		printf a; \
 	}
 
 #define	UZIP_CLASS_NAME	"UZIP"
 
 /*
  * Maximum allowed valid block size (to prevent foot-shooting)
  */
 #define	MAX_BLKSZ	(MAXPHYS)
 
 static char CLOOP_MAGIC_START[] = "#!/bin/sh\n";
 
 static void g_uzip_read_done(struct bio *bp);
 static void g_uzip_do(struct g_uzip_softc *, struct bio *bp);
 
 static void
 g_uzip_softc_free(struct g_uzip_softc *sc, struct g_geom *gp)
 {
 
 	if (gp != NULL) {
 		DPRINTF(GUZ_DBG_INFO, ("%s: %d requests, %d cached\n",
 		    gp->name, sc->req_total, sc->req_cached));
 	}
 
 	mtx_lock(&sc->queue_mtx);
 	sc->wrkthr_flags |= GUZ_SHUTDOWN;
 	wakeup(sc);
 	while (!(sc->wrkthr_flags & GUZ_EXITING)) {
 		msleep(sc->procp, &sc->queue_mtx, PRIBIO, "guzfree",
 		    hz / 10);
 	}
 	mtx_unlock(&sc->queue_mtx);
 
 	sc->dcp->free(sc->dcp);
 	free(sc->toc, M_GEOM_UZIP);
 	mtx_destroy(&sc->queue_mtx);
 	mtx_destroy(&sc->last_mtx);
 	free(sc->last_buf, M_GEOM_UZIP);
 	free(sc, M_GEOM_UZIP);
 }
 
 static int
 g_uzip_cached(struct g_geom *gp, struct bio *bp)
 {
 	struct g_uzip_softc *sc;
 	off_t ofs;
 	size_t blk, blkofs, usz;
 
 	sc = gp->softc;
 	ofs = bp->bio_offset + bp->bio_completed;
 	blk = ofs / sc->blksz;
 	mtx_lock(&sc->last_mtx);
 	if (blk == sc->last_blk) {
 		blkofs = ofs % sc->blksz;
 		usz = sc->blksz - blkofs;
 		if (bp->bio_resid < usz)
 			usz = bp->bio_resid;
 		memcpy(bp->bio_data + bp->bio_completed, sc->last_buf + blkofs,
 		    usz);
 		sc->req_cached++;
 		mtx_unlock(&sc->last_mtx);
 
 		DPRINTF(GUZ_DBG_IO, ("%s/%s: %p: offset=%jd: got %jd bytes "
 		    "from cache\n", __func__, gp->name, bp, (intmax_t)ofs,
 		    (intmax_t)usz));
 
 		bp->bio_completed += usz;
 		bp->bio_resid -= usz;
 
 		if (bp->bio_resid == 0) {
 			g_io_deliver(bp, 0);
 			return (1);
 		}
 	} else
 		mtx_unlock(&sc->last_mtx);
 
 	return (0);
 }
 
 #define BLK_ENDS(sc, bi)	((sc)->toc[(bi)].offset + \
     (sc)->toc[(bi)].blen)
 
 #define BLK_IS_CONT(sc, bi)	(BLK_ENDS((sc), (bi) - 1) == \
     (sc)->toc[(bi)].offset)
 #define	BLK_IS_NIL(sc, bi)	((sc)->toc[(bi)].blen == 0)
 
 #define TOFF_2_BOFF(sc, pp, bi)	    ((sc)->toc[(bi)].offset - \
     (sc)->toc[(bi)].offset % (pp)->sectorsize)
 #define	TLEN_2_BLEN(sc, pp, bp, ei) roundup(BLK_ENDS((sc), (ei)) - \
     (bp)->bio_offset, (pp)->sectorsize)
 
 static int
 g_uzip_request(struct g_geom *gp, struct bio *bp)
 {
 	struct g_uzip_softc *sc;
 	struct bio *bp2;
 	struct g_consumer *cp;
 	struct g_provider *pp;
 	off_t ofs, start_blk_ofs;
 	size_t i, start_blk, end_blk, zsize;
 
 	if (g_uzip_cached(gp, bp) != 0)
 		return (1);
 
 	sc = gp->softc;
 
 	cp = LIST_FIRST(&gp->consumer);
 	pp = cp->provider;
 
 	ofs = bp->bio_offset + bp->bio_completed;
 	start_blk = ofs / sc->blksz;
 	KASSERT(start_blk < sc->nblocks, ("start_blk out of range"));
 	end_blk = howmany(ofs + bp->bio_resid, sc->blksz);
 	KASSERT(end_blk <= sc->nblocks, ("end_blk out of range"));
 
 	for (; BLK_IS_NIL(sc, start_blk) && start_blk < end_blk; start_blk++) {
 		/* Fill in any leading Nil blocks */
 		start_blk_ofs = ofs % sc->blksz;
 		zsize = MIN(sc->blksz - start_blk_ofs, bp->bio_resid);
 		DPRINTF_BLK(GUZ_DBG_IO, start_blk, ("%s/%s: %p/%ju: "
 		    "filling %ju zero bytes\n", __func__, gp->name, gp,
 		    (uintmax_t)bp->bio_completed, (uintmax_t)zsize));
 		bzero(bp->bio_data + bp->bio_completed, zsize);
 		bp->bio_completed += zsize;
 		bp->bio_resid -= zsize;
 		ofs += zsize;
 	}
 
 	if (start_blk == end_blk) {
 		KASSERT(bp->bio_resid == 0, ("bp->bio_resid is invalid"));
 		/*
 		 * No non-Nil data is left, complete request immediately.
 		 */
 		DPRINTF(GUZ_DBG_IO, ("%s/%s: %p: all done returning %ju "
 		    "bytes\n", __func__, gp->name, gp,
 		    (uintmax_t)bp->bio_completed));
 		g_io_deliver(bp, 0);
 		return (1);
 	}
 
 	for (i = start_blk + 1; i < end_blk; i++) {
 		/* Trim discontinuous areas if any */
 		if (!BLK_IS_CONT(sc, i)) {
 			end_blk = i;
 			break;
 		}
 	}
 
 	DPRINTF_BRNG(GUZ_DBG_IO, start_blk, end_blk, ("%s/%s: %p: "
 	    "start=%u (%ju[%jd]), end=%u (%ju)\n", __func__, gp->name, bp,
 	    (u_int)start_blk, (uintmax_t)sc->toc[start_blk].offset,
 	    (intmax_t)sc->toc[start_blk].blen,
 	    (u_int)end_blk, (uintmax_t)BLK_ENDS(sc, end_blk - 1)));
 
 	bp2 = g_clone_bio(bp);
 	if (bp2 == NULL) {
 		g_io_deliver(bp, ENOMEM);
 		return (1);
 	}
 	bp2->bio_done = g_uzip_read_done;
 
 	bp2->bio_offset = TOFF_2_BOFF(sc, pp, start_blk);
 	while (1) {
 		bp2->bio_length = TLEN_2_BLEN(sc, pp, bp2, end_blk - 1);
 		if (bp2->bio_length <= MAXPHYS) {
 			break;
 		}
 		if (end_blk == (start_blk + 1)) {
 			break;
 		}
 		end_blk--;
 	}
 
 	DPRINTF(GUZ_DBG_IO, ("%s/%s: bp2->bio_length = %jd, "
 	    "bp2->bio_offset = %jd\n", __func__, gp->name,
 	    (intmax_t)bp2->bio_length, (intmax_t)bp2->bio_offset));
 
 	bp2->bio_data = malloc(bp2->bio_length, M_GEOM_UZIP, M_NOWAIT);
 	if (bp2->bio_data == NULL) {
 		g_destroy_bio(bp2);
 		g_io_deliver(bp, ENOMEM);
 		return (1);
 	}
 
 	DPRINTF_BRNG(GUZ_DBG_IO, start_blk, end_blk, ("%s/%s: %p: "
 	    "reading %jd bytes from offset %jd\n", __func__, gp->name, bp,
 	    (intmax_t)bp2->bio_length, (intmax_t)bp2->bio_offset));
 
 	g_io_request(bp2, cp);
 	return (0);
 }
 
 static void
 g_uzip_read_done(struct bio *bp)
 {
 	struct bio *bp2;
 	struct g_geom *gp;
 	struct g_uzip_softc *sc;
 
 	bp2 = bp->bio_parent;
 	gp = bp2->bio_to->geom;
 	sc = gp->softc;
 
 	mtx_lock(&sc->queue_mtx);
 	bioq_disksort(&sc->bio_queue, bp);
 	mtx_unlock(&sc->queue_mtx);
 	wakeup(sc);
 }
 
 static int
 g_uzip_memvcmp(const void *memory, unsigned char val, size_t size)
 {
 	const u_char *mm;
 
 	mm = (const u_char *)memory;
 	return (*mm == val) && memcmp(mm, mm + 1, size - 1) == 0;
 }
 
 static void
 g_uzip_do(struct g_uzip_softc *sc, struct bio *bp)
 {
 	struct bio *bp2;
 	struct g_provider *pp;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	char *data, *data2;
 	off_t ofs;
 	size_t blk, blkofs, len, ulen, firstblk;
 	int err;
 
 	bp2 = bp->bio_parent;
 	gp = bp2->bio_to->geom;
 
 	cp = LIST_FIRST(&gp->consumer);
 	pp = cp->provider;
 
 	bp2->bio_error = bp->bio_error;
 	if (bp2->bio_error != 0)
 		goto done;
 
 	/* Make sure there's forward progress. */
 	if (bp->bio_completed == 0) {
 		bp2->bio_error = ECANCELED;
 		goto done;
 	}
 
 	ofs = bp2->bio_offset + bp2->bio_completed;
 	firstblk = blk = ofs / sc->blksz;
 	blkofs = ofs % sc->blksz;
 	data = bp->bio_data + sc->toc[blk].offset % pp->sectorsize;
 	data2 = bp2->bio_data + bp2->bio_completed;
 	while (bp->bio_completed && bp2->bio_resid) {
 		if (blk > firstblk && !BLK_IS_CONT(sc, blk)) {
 			DPRINTF_BLK(GUZ_DBG_IO, blk, ("%s/%s: %p: backref'ed "
 			    "cluster #%u requested, looping around\n",
 			    __func__, gp->name, bp2, (u_int)blk));
 			goto done;
 		}
 		ulen = MIN(sc->blksz - blkofs, bp2->bio_resid);
 		len = sc->toc[blk].blen;
 		DPRINTF(GUZ_DBG_IO, ("%s/%s: %p/%ju: data2=%p, ulen=%u, "
 		    "data=%p, len=%u\n", __func__, gp->name, gp,
 		    bp->bio_completed, data2, (u_int)ulen, data, (u_int)len));
 		if (len == 0) {
 			/* All zero block: no cache update */
 zero_block:
 			bzero(data2, ulen);
 		} else if (len <= bp->bio_completed) {
 			mtx_lock(&sc->last_mtx);
 			err = sc->dcp->decompress(sc->dcp, gp->name, data,
 			    len, sc->last_buf);
 			if (err != 0 && sc->toc[blk].last != 0) {
 				/*
 				 * Last block decompression has failed, check
 				 * if it's just zero padding.
 				 */
 				if (g_uzip_memvcmp(data, '\0', len) == 0) {
 					sc->toc[blk].blen = 0;
 					sc->last_blk = -1;
 					mtx_unlock(&sc->last_mtx);
 					len = 0;
 					goto zero_block;
 				}
 			}
 			if (err != 0) {
 				sc->last_blk = -1;
 				mtx_unlock(&sc->last_mtx);
 				bp2->bio_error = EILSEQ;
 				DPRINTF(GUZ_DBG_ERR, ("%s/%s: decompress"
 				    "(%p, %ju, %ju) failed\n", __func__,
 				    gp->name, sc->dcp, (uintmax_t)blk,
 				    (uintmax_t)len));
 				goto done;
 			}
 			sc->last_blk = blk;
 			memcpy(data2, sc->last_buf + blkofs, ulen);
 			mtx_unlock(&sc->last_mtx);
 			err = sc->dcp->rewind(sc->dcp, gp->name);
 			if (err != 0) {
 				bp2->bio_error = EILSEQ;
 				DPRINTF(GUZ_DBG_ERR, ("%s/%s: rewind(%p) "
 				    "failed\n", __func__, gp->name, sc->dcp));
 				goto done;
 			}
 			data += len;
 		} else
 			break;
 
 		data2 += ulen;
 		bp2->bio_completed += ulen;
 		bp2->bio_resid -= ulen;
 		bp->bio_completed -= len;
 		blkofs = 0;
 		blk++;
 	}
 
 done:
 	/* Finish processing the request. */
 	free(bp->bio_data, M_GEOM_UZIP);
 	g_destroy_bio(bp);
 	if (bp2->bio_error != 0 || bp2->bio_resid == 0)
 		g_io_deliver(bp2, bp2->bio_error);
 	else
 		g_uzip_request(gp, bp2);
 }
 
 static void
 g_uzip_start(struct bio *bp)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	struct g_uzip_softc *sc;
 
 	pp = bp->bio_to;
 	gp = pp->geom;
 
 	DPRINTF(GUZ_DBG_IO, ("%s/%s: %p: cmd=%d, offset=%jd, length=%jd, "
 	    "buffer=%p\n", __func__, gp->name, bp, bp->bio_cmd,
 	    (intmax_t)bp->bio_offset, (intmax_t)bp->bio_length, bp->bio_data));
 
 	sc = gp->softc;
 	sc->req_total++;
 
 	if (bp->bio_cmd == BIO_GETATTR) {
 		struct bio *bp2;
 		struct g_consumer *cp;
 		struct g_geom *gp;
 		struct g_provider *pp;
 
 		/* pass on MNT:* requests and ignore others */
 		if (strncmp(bp->bio_attribute, "MNT:", 4) == 0) {
 			bp2 = g_clone_bio(bp);
 			if (bp2 == NULL) {
 				g_io_deliver(bp, ENOMEM);
 				return;
 			}
 			bp2->bio_done = g_std_done;
 			pp = bp->bio_to;
 			gp = pp->geom;
 			cp = LIST_FIRST(&gp->consumer);
 			g_io_request(bp2, cp);
 			return;
 		}
 	}
 	if (bp->bio_cmd != BIO_READ) {
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 
 	bp->bio_resid = bp->bio_length;
 	bp->bio_completed = 0;
 
 	g_uzip_request(gp, bp);
 }
 
 static void
 g_uzip_orphan(struct g_consumer *cp)
 {
 	struct g_geom *gp;
 
 	g_trace(G_T_TOPOLOGY, "%s(%p/%s)", __func__, cp, cp->provider->name);
 	g_topology_assert();
 
 	gp = cp->geom;
 	g_uzip_softc_free(gp->softc, gp);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 }
 
 static int
 g_uzip_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 
 	gp = pp->geom;
 	cp = LIST_FIRST(&gp->consumer);
 	KASSERT (cp != NULL, ("g_uzip_access but no consumer"));
 
 	if (cp->acw + dw > 0)
 		return (EROFS);
 
 	return (g_access(cp, dr, dw, de));
 }
 
 static void
 g_uzip_spoiled(struct g_consumer *cp)
 {
 	struct g_geom *gp;
 
 	G_VALID_CONSUMER(cp);
 	gp = cp->geom;
 	g_trace(G_T_TOPOLOGY, "%s(%p/%s)", __func__, cp, gp->name);
 	g_topology_assert();
 
 	g_uzip_softc_free(gp->softc, gp);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 }
 
 static int
 g_uzip_parse_toc(struct g_uzip_softc *sc, struct g_provider *pp,
     struct g_geom *gp)
 {
 	uint32_t i, j, backref_to;
 	uint64_t max_offset, min_offset;
 	struct g_uzip_blk *last_blk;
 
 	min_offset = sizeof(struct cloop_header) +
 	    (sc->nblocks + 1) * sizeof(uint64_t);
 	max_offset = sc->toc[0].offset - 1;
 	last_blk = &sc->toc[0];
 	for (i = 0; i < sc->nblocks; i++) {
 		/* First do some bounds checking */
 		if ((sc->toc[i].offset < min_offset) ||
 		    (sc->toc[i].offset > pp->mediasize)) {
 			goto error_offset;
 		}
 		DPRINTF_BLK(GUZ_DBG_IO, i, ("%s: cluster #%u "
 		    "offset=%ju max_offset=%ju\n", gp->name,
 		    (u_int)i, (uintmax_t)sc->toc[i].offset,
 		    (uintmax_t)max_offset));
 		backref_to = BLEN_UNDEF;
 		if (sc->toc[i].offset < max_offset) {
 			/*
 			 * For the backref'ed blocks search already parsed
 			 * TOC entries for the matching offset and copy the
 			 * size from matched entry.
 			 */
 			for (j = 0; j <= i; j++) {
                                 if (sc->toc[j].offset == sc->toc[i].offset &&
 				    !BLK_IS_NIL(sc, j)) {
                                         break;
                                 }
                                 if (j != i) {
 					continue;
 				}
 				DPRINTF(GUZ_DBG_ERR, ("%s: cannot match "
 				    "backref'ed offset at cluster #%u\n",
 				    gp->name, i));
 				return (-1);
 			}
 			sc->toc[i].blen = sc->toc[j].blen;
 			backref_to = j;
 		} else {
 			last_blk = &sc->toc[i];
 			/*
 			 * For the "normal blocks" seek forward until we hit
 			 * block whose offset is larger than ours and assume
 			 * it's going to be the next one.
 			 */
 			for (j = i + 1; j < sc->nblocks; j++) {
 				if (sc->toc[j].offset > max_offset) {
 					break;
 				}
 			}
 			sc->toc[i].blen = sc->toc[j].offset -
 			    sc->toc[i].offset;
 			if (BLK_ENDS(sc, i) > pp->mediasize) {
 				DPRINTF(GUZ_DBG_ERR, ("%s: cluster #%u "
 				    "extends past media boundary (%ju > %ju)\n",
 				    gp->name, (u_int)i,
 				    (uintmax_t)BLK_ENDS(sc, i),
 				    (intmax_t)pp->mediasize));
 				return (-1);
 			}
 			KASSERT(max_offset <= sc->toc[i].offset, (
 			    "%s: max_offset is incorrect: %ju",
 			    gp->name, (uintmax_t)max_offset));
 			max_offset = BLK_ENDS(sc, i) - 1;
 		}
 		DPRINTF_BLK(GUZ_DBG_TOC, i, ("%s: cluster #%u, original %u "
 		    "bytes, in %u bytes", gp->name, i, sc->blksz,
 		    sc->toc[i].blen));
 		if (backref_to != BLEN_UNDEF) {
 			DPRINTF_BLK(GUZ_DBG_TOC, i, (" (->#%u)",
 			    (u_int)backref_to));
 		}
 		DPRINTF_BLK(GUZ_DBG_TOC, i, ("\n"));
 	}
 	last_blk->last = 1;
 	/* Do a second pass to validate block lengths */
 	for (i = 0; i < sc->nblocks; i++) {
 		if (sc->toc[i].blen > sc->dcp->max_blen) {
 			if (sc->toc[i].last == 0) {
 				DPRINTF(GUZ_DBG_ERR, ("%s: cluster #%u "
 				    "length (%ju) exceeds "
 				    "max_blen (%ju)\n", gp->name, i,
 				    (uintmax_t)sc->toc[i].blen,
 				    (uintmax_t)sc->dcp->max_blen));
 				return (-1);
 			}
 			DPRINTF(GUZ_DBG_INFO, ("%s: cluster #%u extra "
 			    "padding is detected, trimmed to %ju\n",
 			    gp->name, i, (uintmax_t)sc->dcp->max_blen));
 			    sc->toc[i].blen = sc->dcp->max_blen;
 			sc->toc[i].padded = 1;
 		}
 	}
 	return (0);
 
 error_offset:
 	DPRINTF(GUZ_DBG_ERR, ("%s: cluster #%u: invalid offset %ju, "
 	    "min_offset=%ju mediasize=%jd\n", gp->name, (u_int)i,
 	    sc->toc[i].offset, min_offset, pp->mediasize));
 	return (-1);
 }
 
 static struct g_geom *
 g_uzip_taste(struct g_class *mp, struct g_provider *pp, int flags)
 {
 	int error;
 	uint32_t i, total_offsets, offsets_read, blk;
 	void *buf;
 	struct cloop_header *header;
 	struct g_consumer *cp;
 	struct g_geom *gp;
 	struct g_provider *pp2;
 	struct g_uzip_softc *sc;
 	enum {
 		G_UZIP = 1,
 		G_ULZMA
 	} type;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s,%s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 
 	/* Skip providers that are already open for writing. */
 	if (pp->acw > 0)
 		return (NULL);
 
 	if ((fnmatch(g_uzip_attach_to, pp->name, 0) != 0) ||
 	    (fnmatch(g_uzip_noattach_to, pp->name, 0) == 0)) {
 		DPRINTF(GUZ_DBG_INFO, ("%s(%s,%s), ignoring\n", __func__,
 		    mp->name, pp->name));
 		return (NULL);
 	}
 
 	buf = NULL;
 
 	/*
 	 * Create geom instance.
 	 */
 	gp = g_new_geomf(mp, GUZ_DEV_NAME("%s"), pp->name);
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, pp);
 	if (error == 0)
 		error = g_access(cp, 1, 0, 0);
 	if (error) {
 		goto e1;
 	}
 	g_topology_unlock();
 
 	/*
 	 * Read cloop header, look for CLOOP magic, perform
 	 * other validity checks.
 	 */
 	DPRINTF(GUZ_DBG_INFO, ("%s: media sectorsize %u, mediasize %jd\n",
 	    gp->name, pp->sectorsize, (intmax_t)pp->mediasize));
 	buf = g_read_data(cp, 0, pp->sectorsize, NULL);
 	if (buf == NULL)
 		goto e2;
 	header = (struct cloop_header *) buf;
 	if (strncmp(header->magic, CLOOP_MAGIC_START,
 	    sizeof(CLOOP_MAGIC_START) - 1) != 0) {
 		DPRINTF(GUZ_DBG_ERR, ("%s: no CLOOP magic\n", gp->name));
 		goto e3;
 	}
 
 	switch (header->magic[CLOOP_OFS_COMPR]) {
 	case CLOOP_COMP_LZMA:
 	case CLOOP_COMP_LZMA_DDP:
 		type = G_ULZMA;
 		if (header->magic[CLOOP_OFS_VERSN] < CLOOP_MINVER_LZMA) {
 			DPRINTF(GUZ_DBG_ERR, ("%s: image version too old\n",
 			    gp->name));
 			goto e3;
 		}
 		DPRINTF(GUZ_DBG_INFO, ("%s: GEOM_UZIP_LZMA image found\n",
 		    gp->name));
 		break;
 	case CLOOP_COMP_LIBZ:
 	case CLOOP_COMP_LIBZ_DDP:
 		type = G_UZIP;
 		if (header->magic[CLOOP_OFS_VERSN] < CLOOP_MINVER_ZLIB) {
 			DPRINTF(GUZ_DBG_ERR, ("%s: image version too old\n",
 			    gp->name));
 			goto e3;
 		}
 		DPRINTF(GUZ_DBG_INFO, ("%s: GEOM_UZIP_ZLIB image found\n",
 		    gp->name));
 		break;
 	default:
 		DPRINTF(GUZ_DBG_ERR, ("%s: unsupported image type\n",
 		    gp->name));
                 goto e3;
         }
 
 	/*
 	 * Initialize softc and read offsets.
 	 */
 	sc = malloc(sizeof(*sc), M_GEOM_UZIP, M_WAITOK | M_ZERO);
 	gp->softc = sc;
 	sc->blksz = ntohl(header->blksz);
 	sc->nblocks = ntohl(header->nblocks);
 	if (sc->blksz % 512 != 0) {
 		printf("%s: block size (%u) should be multiple of 512.\n",
 		    gp->name, sc->blksz);
 		goto e4;
 	}
 	if (sc->blksz > MAX_BLKSZ) {
 		printf("%s: block size (%u) should not be larger than %d.\n",
 		    gp->name, sc->blksz, MAX_BLKSZ);
 	}
 	total_offsets = sc->nblocks + 1;
 	if (sizeof(struct cloop_header) +
 	    total_offsets * sizeof(uint64_t) > pp->mediasize) {
 		printf("%s: media too small for %u blocks\n",
 		    gp->name, sc->nblocks);
 		goto e4;
 	}
 	sc->toc = malloc(total_offsets * sizeof(struct g_uzip_blk),
 	    M_GEOM_UZIP, M_WAITOK | M_ZERO);
 	offsets_read = MIN(total_offsets,
 	    (pp->sectorsize - sizeof(*header)) / sizeof(uint64_t));
 	for (i = 0; i < offsets_read; i++) {
 		sc->toc[i].offset = be64toh(((uint64_t *) (header + 1))[i]);
 		sc->toc[i].blen = BLEN_UNDEF;
 	}
 	DPRINTF(GUZ_DBG_INFO, ("%s: %u offsets in the first sector\n",
 	       gp->name, offsets_read));
 	for (blk = 1; offsets_read < total_offsets; blk++) {
 		uint32_t nread;
 
 		free(buf, M_GEOM);
 		buf = g_read_data(
 		    cp, blk * pp->sectorsize, pp->sectorsize, NULL);
 		if (buf == NULL)
 			goto e5;
 		nread = MIN(total_offsets - offsets_read,
 		     pp->sectorsize / sizeof(uint64_t));
 		DPRINTF(GUZ_DBG_TOC, ("%s: %u offsets read from sector %d\n",
 		    gp->name, nread, blk));
 		for (i = 0; i < nread; i++) {
 			sc->toc[offsets_read + i].offset =
 			    be64toh(((uint64_t *) buf)[i]);
 			sc->toc[offsets_read + i].blen = BLEN_UNDEF;
 		}
 		offsets_read += nread;
 	}
 	free(buf, M_GEOM);
 	buf = NULL;
 	offsets_read -= 1;
 	DPRINTF(GUZ_DBG_INFO, ("%s: done reading %u block offsets from %u "
 	    "sectors\n", gp->name, offsets_read, blk));
 	if (sc->nblocks != offsets_read) {
 		DPRINTF(GUZ_DBG_ERR, ("%s: read %s offsets than expected "
 		    "blocks\n", gp->name,
 		    sc->nblocks < offsets_read ? "more" : "less"));
 		goto e5;
 	}
 
 	if (type == G_UZIP) {
 		sc->dcp = g_uzip_zlib_ctor(sc->blksz);
 	} else {
 		sc->dcp = g_uzip_lzma_ctor(sc->blksz);
 	}
 	if (sc->dcp == NULL) {
 		goto e5;
 	}
 
 	/*
 	 * "Fake" last+1 block, to make it easier for the TOC parser to
 	 * iterate without making the last element a special case.
 	 */
 	sc->toc[sc->nblocks].offset = pp->mediasize;
 	/* Massage TOC (table of contents), make sure it is sound */
 	if (g_uzip_parse_toc(sc, pp, gp) != 0) {
 		DPRINTF(GUZ_DBG_ERR, ("%s: TOC error\n", gp->name));
 		goto e6;
 	}
 	mtx_init(&sc->last_mtx, "geom_uzip cache", NULL, MTX_DEF);
 	mtx_init(&sc->queue_mtx, "geom_uzip wrkthread", NULL, MTX_DEF);
 	bioq_init(&sc->bio_queue);
 	sc->last_blk = -1;
 	sc->last_buf = malloc(sc->blksz, M_GEOM_UZIP, M_WAITOK);
 	sc->req_total = 0;
 	sc->req_cached = 0;
 
 	sc->uzip_do = &g_uzip_do;
 
 	error = kproc_create(g_uzip_wrkthr, sc, &sc->procp, 0, 0, "%s",
 	    gp->name);
 	if (error != 0) {
 		goto e7;
 	}
 
 	g_topology_lock();
 	pp2 = g_new_providerf(gp, "%s", gp->name);
 	pp2->sectorsize = 512;
 	pp2->mediasize = (off_t)sc->nblocks * sc->blksz;
 	pp2->stripesize = pp->stripesize;
 	pp2->stripeoffset = pp->stripeoffset;
 	g_error_provider(pp2, 0);
 	g_access(cp, -1, 0, 0);
 
 	DPRINTF(GUZ_DBG_INFO, ("%s: taste ok (%d, %jd), (%d, %d), %x\n",
 	    gp->name, pp2->sectorsize, (intmax_t)pp2->mediasize,
 	    pp2->stripeoffset, pp2->stripesize, pp2->flags));
 	DPRINTF(GUZ_DBG_INFO, ("%s: %u x %u blocks\n", gp->name, sc->nblocks,
 	    sc->blksz));
 	return (gp);
 
 e7:
 	free(sc->last_buf, M_GEOM);
 	mtx_destroy(&sc->queue_mtx);
 	mtx_destroy(&sc->last_mtx);
 e6:
 	sc->dcp->free(sc->dcp);
 e5:
 	free(sc->toc, M_GEOM);
 e4:
 	free(gp->softc, M_GEOM_UZIP);
 e3:
 	if (buf != NULL) {
 		free(buf, M_GEOM);
 	}
 e2:
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 e1:
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 
 	return (NULL);
 }
 
 static int
 g_uzip_destroy_geom(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 	struct g_provider *pp;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, gp->name);
 	g_topology_assert();
 
 	if (gp->softc == NULL) {
 		DPRINTF(GUZ_DBG_ERR, ("%s(%s): gp->softc == NULL\n", __func__,
 		    gp->name));
 		return (ENXIO);
 	}
 
 	KASSERT(gp != NULL, ("NULL geom"));
 	pp = LIST_FIRST(&gp->provider);
 	KASSERT(pp != NULL, ("NULL provider"));
 	if (pp->acr > 0 || pp->acw > 0 || pp->ace > 0)
 		return (EBUSY);
 
 	g_uzip_softc_free(gp->softc, gp);
 	gp->softc = NULL;
 	g_wither_geom(gp, ENXIO);
 
 	return (0);
 }
 
 static struct g_class g_uzip_class = {
 	.name = UZIP_CLASS_NAME,
 	.version = G_VERSION,
 	.taste = g_uzip_taste,
 	.destroy_geom = g_uzip_destroy_geom,
 
 	.start = g_uzip_start,
 	.orphan = g_uzip_orphan,
 	.access = g_uzip_access,
 	.spoiled = g_uzip_spoiled,
 };
 
 DECLARE_GEOM_CLASS(g_uzip_class, g_uzip);
 MODULE_DEPEND(g_uzip, zlib, 1, 1, 1);
+MODULE_VERSION(geom_uzip, 0);
Index: user/markj/netdump/sys/geom/vinum/geom_vinum.c
===================================================================
--- user/markj/netdump/sys/geom/vinum/geom_vinum.c	(revision 332407)
+++ user/markj/netdump/sys/geom/vinum/geom_vinum.c	(revision 332408)
@@ -1,1050 +1,1051 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  *  Copyright (c) 2004, 2007 Lukas Ertl
  *  Copyright (c) 2007, 2009 Ulf Lilleengen
  *  All rights reserved.
  * 
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 
  * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 
 #include <geom/geom.h>
 #include <geom/vinum/geom_vinum_var.h>
 #include <geom/vinum/geom_vinum.h>
 #include <geom/vinum/geom_vinum_raid5.h>
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, vinum, CTLFLAG_RW, 0,
     "GEOM_VINUM stuff");
 u_int g_vinum_debug = 0;
 SYSCTL_UINT(_kern_geom_vinum, OID_AUTO, debug, CTLFLAG_RWTUN, &g_vinum_debug, 0,
     "Debug level");
 
 static int	gv_create(struct g_geom *, struct gctl_req *);
 static void	gv_attach(struct gv_softc *, struct gctl_req *);
 static void	gv_detach(struct gv_softc *, struct gctl_req *);
 static void	gv_parityop(struct gv_softc *, struct gctl_req *);
 
 
 static void
 gv_orphan(struct g_consumer *cp)
 {
 	struct g_geom *gp;
 	struct gv_softc *sc;
 	struct gv_drive *d;
 	
 	g_topology_assert();
 
 	KASSERT(cp != NULL, ("gv_orphan: null cp"));
 	gp = cp->geom;
 	KASSERT(gp != NULL, ("gv_orphan: null gp"));
 	sc = gp->softc;
 	KASSERT(sc != NULL, ("gv_orphan: null sc"));
 	d = cp->private;
 	KASSERT(d != NULL, ("gv_orphan: null d"));
 
 	g_trace(G_T_TOPOLOGY, "gv_orphan(%s)", gp->name);
 
 	gv_post_event(sc, GV_EVENT_DRIVE_LOST, d, NULL, 0, 0);
 }
 
 void
 gv_start(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct gv_softc *sc;
 	
 	gp = bp->bio_to->geom;
 	sc = gp->softc;
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
 	case BIO_GETATTR:
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		return;
 	}
 	mtx_lock(&sc->bqueue_mtx);
 	bioq_disksort(sc->bqueue_down, bp);
 	wakeup(sc);
 	mtx_unlock(&sc->bqueue_mtx);
 }
 
 void
 gv_done(struct bio *bp)
 {
 	struct g_geom *gp;
 	struct gv_softc *sc;
 	
 	KASSERT(bp != NULL, ("NULL bp"));
 
 	gp = bp->bio_from->geom;
 	sc = gp->softc;
 
 	mtx_lock(&sc->bqueue_mtx);
 	bioq_disksort(sc->bqueue_up, bp);
 	wakeup(sc);
 	mtx_unlock(&sc->bqueue_mtx);
 }
 
 int
 gv_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_geom *gp;
 	struct gv_softc *sc;
 	struct gv_drive *d, *d2;
 	int error;
 	
 	gp = pp->geom;
 	sc = gp->softc;
 	/*
 	 * We want to modify the read count with the write count in case we have
 	 * plexes in a RAID-5 organization.
 	 */
 	dr += dw;
 
 	LIST_FOREACH(d, &sc->drives, drive) {
 		if (d->consumer == NULL)
 			continue;
 		error = g_access(d->consumer, dr, dw, de);
 		if (error) {
 			LIST_FOREACH(d2, &sc->drives, drive) {
 				if (d == d2)
 					break;
 				g_access(d2->consumer, -dr, -dw, -de);
 			}
 			G_VINUM_DEBUG(0, "g_access '%s' failed: %d", d->name,
 			    error);
 			return (error);
 		}
 	}
 	return (0);
 }
 
 static void
 gv_init(struct g_class *mp)
 {
 	struct g_geom *gp;
 	struct gv_softc *sc;
 
 	g_trace(G_T_TOPOLOGY, "gv_init(%p)", mp);
 
 	gp = g_new_geomf(mp, "VINUM");
 	gp->spoiled = gv_orphan;
 	gp->orphan = gv_orphan;
 	gp->access = gv_access;
 	gp->start = gv_start;
 	gp->softc = g_malloc(sizeof(struct gv_softc), M_WAITOK | M_ZERO);
 	sc = gp->softc;
 	sc->geom = gp;
 	sc->bqueue_down = g_malloc(sizeof(struct bio_queue_head),
 	    M_WAITOK | M_ZERO);
 	sc->bqueue_up = g_malloc(sizeof(struct bio_queue_head),
 	    M_WAITOK | M_ZERO);
 	bioq_init(sc->bqueue_down);
 	bioq_init(sc->bqueue_up);
 	LIST_INIT(&sc->drives);
 	LIST_INIT(&sc->subdisks);
 	LIST_INIT(&sc->plexes);
 	LIST_INIT(&sc->volumes);
 	TAILQ_INIT(&sc->equeue);
 	mtx_init(&sc->config_mtx, "gv_config", NULL, MTX_DEF);
 	mtx_init(&sc->equeue_mtx, "gv_equeue", NULL, MTX_DEF);
 	mtx_init(&sc->bqueue_mtx, "gv_bqueue", NULL, MTX_DEF);
 	kproc_create(gv_worker, sc, &sc->worker, 0, 0, "gv_worker");
 }
 
 static int
 gv_unload(struct gctl_req *req, struct g_class *mp, struct g_geom *gp)
 {
 	struct gv_softc *sc;
 
 	g_trace(G_T_TOPOLOGY, "gv_unload(%p)", mp);
 
 	g_topology_assert();
 	sc = gp->softc;
 
 	if (sc != NULL) {
 		gv_worker_exit(sc);
 		gp->softc = NULL;
 		g_wither_geom(gp, ENXIO);
 	}
 
 	return (0);
 }
 
 /* Handle userland request of attaching object. */
 static void
 gv_attach(struct gv_softc *sc, struct gctl_req *req)
 {
 	struct gv_volume *v;
 	struct gv_plex *p;
 	struct gv_sd *s;
 	off_t *offset;
 	int *rename, type_child, type_parent;
 	char *child, *parent;
 
 	child = gctl_get_param(req, "child", NULL);
 	if (child == NULL) {
 		gctl_error(req, "no child given");
 		return;
 	}
 	parent = gctl_get_param(req, "parent", NULL);
 	if (parent == NULL) {
 		gctl_error(req, "no parent given");
 		return;
 	}
 	offset = gctl_get_paraml(req, "offset", sizeof(*offset));
 	if (offset == NULL) {
 		gctl_error(req, "no offset given");
 		return;
 	}
 	rename = gctl_get_paraml(req, "rename", sizeof(*rename));
 	if (rename == NULL) {
 		gctl_error(req, "no rename flag given");
 		return;
 	}
 
 	type_child = gv_object_type(sc, child);
 	type_parent = gv_object_type(sc, parent);
 
 	switch (type_child) {
 	case GV_TYPE_PLEX:
 		if (type_parent != GV_TYPE_VOL) {
 			gctl_error(req, "no such volume to attach to");
 			return;
 		}
 		v = gv_find_vol(sc, parent);
 		p = gv_find_plex(sc, child);
 		gv_post_event(sc, GV_EVENT_ATTACH_PLEX, p, v, *offset, *rename);
 		break;
 	case GV_TYPE_SD:
 		if (type_parent != GV_TYPE_PLEX) {
 			gctl_error(req, "no such plex to attach to");
 			return;
 		}
 		p = gv_find_plex(sc, parent);
 		s = gv_find_sd(sc, child);
 		gv_post_event(sc, GV_EVENT_ATTACH_SD, s, p, *offset, *rename);
 		break;
 	default:
 		gctl_error(req, "invalid child type");
 		break;
 	}
 }
 
 /* Handle userland request of detaching object. */
 static void
 gv_detach(struct gv_softc *sc, struct gctl_req *req)
 {
 	struct gv_plex *p;
 	struct gv_sd *s;
 	int *flags, type;
 	char *object;
 
 	object = gctl_get_param(req, "object", NULL);
 	if (object == NULL) {
 		gctl_error(req, "no argument given");
 		return;
 	}
 
 	flags = gctl_get_paraml(req, "flags", sizeof(*flags));
 	type = gv_object_type(sc, object);
 	switch (type) {
 	case GV_TYPE_PLEX:
 		p = gv_find_plex(sc, object);
 		gv_post_event(sc, GV_EVENT_DETACH_PLEX, p, NULL, *flags, 0);
 		break;
 	case GV_TYPE_SD:
 		s = gv_find_sd(sc, object);
 		gv_post_event(sc, GV_EVENT_DETACH_SD, s, NULL, *flags, 0);
 		break;
 	default:
 		gctl_error(req, "invalid object type");
 		break;
 	}
 }
 
 /* Handle userland requests for creating new objects. */
 static int
 gv_create(struct g_geom *gp, struct gctl_req *req)
 {
 	struct gv_softc *sc;
 	struct gv_drive *d, *d2;
 	struct gv_plex *p, *p2;
 	struct gv_sd *s, *s2;
 	struct gv_volume *v, *v2;
 	struct g_provider *pp;
 	int error, i, *drives, *flags, *plexes, *subdisks, *volumes;
 	char buf[20];
 
 	g_topology_assert();
 
 	sc = gp->softc;
 
 	/* Find out how many of each object have been passed in. */
 	volumes = gctl_get_paraml(req, "volumes", sizeof(*volumes));
 	plexes = gctl_get_paraml(req, "plexes", sizeof(*plexes));
 	subdisks = gctl_get_paraml(req, "subdisks", sizeof(*subdisks));
 	drives = gctl_get_paraml(req, "drives", sizeof(*drives));
 	if (volumes == NULL || plexes == NULL || subdisks == NULL ||
 	    drives == NULL) {
 		gctl_error(req, "number of objects not given");
 		return (-1);
 	}
 	flags = gctl_get_paraml(req, "flags", sizeof(*flags));
 	if (flags == NULL) {
 		gctl_error(req, "flags not given");
 		return (-1);
 	}
 
 	/* First, handle drive definitions ... */
 	for (i = 0; i < *drives; i++) {
 		snprintf(buf, sizeof(buf), "drive%d", i);
 		d2 = gctl_get_paraml(req, buf, sizeof(*d2));
 		if (d2 == NULL) {
 			gctl_error(req, "no drive definition given");
 			return (-1);
 		}
 		/*
 		 * Make sure that the device specified in the drive config is
 		 * an active GEOM provider.
 		 */
 		pp = g_provider_by_name(d2->device);
 		if (pp == NULL) {
 			gctl_error(req, "%s: device not found", d2->device);
 			goto error;
 		}
 		if (gv_find_drive(sc, d2->name) != NULL) {
 			/* Ignore error. */
 			if (*flags & GV_FLAG_F)
 				continue;
 			gctl_error(req, "drive '%s' already exists", d2->name);
 			goto error;
 		}
 		if (gv_find_drive_device(sc, d2->device) != NULL) {
 			gctl_error(req, "device '%s' already configured in "
 			    "gvinum", d2->device);
 			goto error;
 		}
 
 
 		d = g_malloc(sizeof(*d), M_WAITOK | M_ZERO);
 		bcopy(d2, d, sizeof(*d));
 
 		gv_post_event(sc, GV_EVENT_CREATE_DRIVE, d, NULL, 0, 0);
 	}
 
 	/* ... then volume definitions ... */
 	for (i = 0; i < *volumes; i++) {
 		error = 0;
 		snprintf(buf, sizeof(buf), "volume%d", i);
 		v2 = gctl_get_paraml(req, buf, sizeof(*v2));
 		if (v2 == NULL) {
 			gctl_error(req, "no volume definition given");
 			return (-1);
 		}
 		if (gv_find_vol(sc, v2->name) != NULL) {
 			/* Ignore error. */
 			if (*flags & GV_FLAG_F)
 				continue;
 			gctl_error(req, "volume '%s' already exists", v2->name);
 			goto error;
 		}
 
 		v = g_malloc(sizeof(*v), M_WAITOK | M_ZERO);
 		bcopy(v2, v, sizeof(*v));
 
 		gv_post_event(sc, GV_EVENT_CREATE_VOLUME, v, NULL, 0, 0);
 	}
 
 	/* ... then plex definitions ... */
 	for (i = 0; i < *plexes; i++) {
 		error = 0;
 		snprintf(buf, sizeof(buf), "plex%d", i);
 		p2 = gctl_get_paraml(req, buf, sizeof(*p2));
 		if (p2 == NULL) {
 			gctl_error(req, "no plex definition given");
 			return (-1);
 		}
 		if (gv_find_plex(sc, p2->name) != NULL) {
 			/* Ignore error. */
 			if (*flags & GV_FLAG_F)
 				continue;
 			gctl_error(req, "plex '%s' already exists", p2->name);
 			goto error;
 		}
 
 		p = g_malloc(sizeof(*p), M_WAITOK | M_ZERO);
 		bcopy(p2, p, sizeof(*p));
 
 		gv_post_event(sc, GV_EVENT_CREATE_PLEX, p, NULL, 0, 0);
 	}
 
 	/* ... and, finally, subdisk definitions. */
 	for (i = 0; i < *subdisks; i++) {
 		error = 0;
 		snprintf(buf, sizeof(buf), "sd%d", i);
 		s2 = gctl_get_paraml(req, buf, sizeof(*s2));
 		if (s2 == NULL) {
 			gctl_error(req, "no subdisk definition given");
 			return (-1);
 		}
 		if (gv_find_sd(sc, s2->name) != NULL) {
 			/* Ignore error. */
 			if (*flags & GV_FLAG_F)
 				continue;
 			gctl_error(req, "sd '%s' already exists", s2->name);
 			goto error;
 		}
 
 		s = g_malloc(sizeof(*s), M_WAITOK | M_ZERO);
 		bcopy(s2, s, sizeof(*s));
 
 		gv_post_event(sc, GV_EVENT_CREATE_SD, s, NULL, 0, 0);
 	}
 
 error:
 	gv_post_event(sc, GV_EVENT_SETUP_OBJECTS, sc, NULL, 0, 0);
 	gv_post_event(sc, GV_EVENT_SAVE_CONFIG, sc, NULL, 0, 0);
 
 	return (0);
 }
 
 static void
 gv_config(struct gctl_req *req, struct g_class *mp, char const *verb)
 {
 	struct g_geom *gp;
 	struct gv_softc *sc;
 	struct sbuf *sb;
 	char *comment;
 
 	g_topology_assert();
 
 	gp = LIST_FIRST(&mp->geom);
 	sc = gp->softc;
 
 	if (!strcmp(verb, "attach")) {
 		gv_attach(sc, req);
 
 	} else if (!strcmp(verb, "concat")) {
 		gv_concat(gp, req);
 
 	} else if (!strcmp(verb, "detach")) {
 		gv_detach(sc, req);
 
 	} else if (!strcmp(verb, "list")) {
 		gv_list(gp, req);
 
 	/* Save our configuration back to disk. */
 	} else if (!strcmp(verb, "saveconfig")) {
 		gv_post_event(sc, GV_EVENT_SAVE_CONFIG, sc, NULL, 0, 0);
 
 	/* Return configuration in string form. */
 	} else if (!strcmp(verb, "getconfig")) {
 		comment = gctl_get_param(req, "comment", NULL);
 		if (comment == NULL) {
 			gctl_error(req, "no comment parameter given");
 			return;
 		}
 		sb = sbuf_new(NULL, NULL, GV_CFG_LEN, SBUF_FIXEDLEN);
 		gv_format_config(sc, sb, 0, comment);
 		sbuf_finish(sb);
 		gctl_set_param(req, "config", sbuf_data(sb), sbuf_len(sb) + 1);
 		sbuf_delete(sb);
 
 	} else if (!strcmp(verb, "create")) {
 		gv_create(gp, req);
 
 	} else if (!strcmp(verb, "mirror")) {
 		gv_mirror(gp, req);
 
 	} else if (!strcmp(verb, "move")) {
 		gv_move(gp, req);
 
 	} else if (!strcmp(verb, "raid5")) {
 		gv_raid5(gp, req);
 
 	} else if (!strcmp(verb, "rebuildparity") ||
 	    !strcmp(verb, "checkparity")) {
 		gv_parityop(sc, req);
 
 	} else if (!strcmp(verb, "remove")) {
 		gv_remove(gp, req);
 
 	} else if (!strcmp(verb, "rename")) {
 		gv_rename(gp, req);
 	
 	} else if (!strcmp(verb, "resetconfig")) {
 		gv_post_event(sc, GV_EVENT_RESET_CONFIG, sc, NULL, 0, 0);
 
 	} else if (!strcmp(verb, "start")) {
 		gv_start_obj(gp, req);
 
 	} else if (!strcmp(verb, "stripe")) {
 		gv_stripe(gp, req);
 
 	} else if (!strcmp(verb, "setstate")) {
 		gv_setstate(gp, req);
 	} else
 		gctl_error(req, "Unknown verb parameter");
 }
 
 static void
 gv_parityop(struct gv_softc *sc, struct gctl_req *req)
 {
 	struct gv_plex *p;
 	int *flags, *rebuild, type;
 	char *plex;
 
 	plex = gctl_get_param(req, "plex", NULL);
 	if (plex == NULL) {
 		gctl_error(req, "no plex given");
 		return;
 	}
 
 	flags = gctl_get_paraml(req, "flags", sizeof(*flags));
 	if (flags == NULL) {
 		gctl_error(req, "no flags given");
 		return;
 	}
 
 	rebuild = gctl_get_paraml(req, "rebuild", sizeof(*rebuild));
 	if (rebuild == NULL) {
 		gctl_error(req, "no operation given");
 		return;
 	}
 
 	type = gv_object_type(sc, plex);
 	if (type != GV_TYPE_PLEX) {
 		gctl_error(req, "'%s' is not a plex", plex);
 		return;
 	}
 	p = gv_find_plex(sc, plex);
 
 	if (p->state != GV_PLEX_UP) {
 		gctl_error(req, "plex %s is not completely accessible",
 		    p->name);
 		return;
 	}
 
 	if (p->org != GV_PLEX_RAID5) {
 		gctl_error(req, "plex %s is not a RAID5 plex", p->name);
 		return;
 	}
 
 	/* Put it in the event queue. */
 	/* XXX: The state of the plex might have changed when this event is
 	 * picked up ... We should perhaps check this afterwards. */
 	if (*rebuild)
 		gv_post_event(sc, GV_EVENT_PARITY_REBUILD, p, NULL, 0, 0);
 	else
 		gv_post_event(sc, GV_EVENT_PARITY_CHECK, p, NULL, 0, 0);
 }
 
 
 static struct g_geom *
 gv_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
 {
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct gv_softc *sc;
 	struct gv_hdr vhdr;
 	int error;
 
  	g_topology_assert();
 	g_trace(G_T_TOPOLOGY, "gv_taste(%s, %s)", mp->name, pp->name);
 
 	gp = LIST_FIRST(&mp->geom);
 	if (gp == NULL) {
 		G_VINUM_DEBUG(0, "error: tasting, but not initialized?");
 		return (NULL);
 	}
 	sc = gp->softc;
 
 	cp = g_new_consumer(gp);
 	if (g_attach(cp, pp) != 0) {
 		g_destroy_consumer(cp);
 		return (NULL);
 	}
 	if (g_access(cp, 1, 0, 0) != 0) {
 		g_detach(cp);
 		g_destroy_consumer(cp);
 		return (NULL);
 	}
 	g_topology_unlock();
 
 	error = gv_read_header(cp, &vhdr);
 
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 
 	/* Check if what we've been given is a valid vinum drive. */
 	if (!error)
 		gv_post_event(sc, GV_EVENT_DRIVE_TASTED, pp, NULL, 0, 0);
 
 	return (NULL);
 }
 
 void
 gv_worker(void *arg)
 {
 	struct g_provider *pp;
 	struct gv_softc *sc;
 	struct gv_event *ev;
 	struct gv_volume *v;
 	struct gv_plex *p;
 	struct gv_sd *s;
 	struct gv_drive *d;
 	struct bio *bp;
 	int newstate, flags, err, rename;
 	char *newname;
 	off_t offset;
 
 	sc = arg;
 	KASSERT(sc != NULL, ("NULL sc"));
 	for (;;) {
 		/* Look at the events first... */
 		ev = gv_get_event(sc);
 		if (ev != NULL) {
 			gv_remove_event(sc, ev);
 
 			switch (ev->type) {
 			case GV_EVENT_DRIVE_TASTED:
 				G_VINUM_DEBUG(2, "event 'drive tasted'");
 				pp = ev->arg1;
 				gv_drive_tasted(sc, pp);
 				break;
 
 			case GV_EVENT_DRIVE_LOST:
 				G_VINUM_DEBUG(2, "event 'drive lost'");
 				d = ev->arg1;
 				gv_drive_lost(sc, d);
 				break;
 
 			case GV_EVENT_CREATE_DRIVE:
 				G_VINUM_DEBUG(2, "event 'create drive'");
 				d = ev->arg1;
 				gv_create_drive(sc, d);
 				break;
 
 			case GV_EVENT_CREATE_VOLUME:
 				G_VINUM_DEBUG(2, "event 'create volume'");
 				v = ev->arg1;
 				gv_create_volume(sc, v);
 				break;
 
 			case GV_EVENT_CREATE_PLEX:
 				G_VINUM_DEBUG(2, "event 'create plex'");
 				p = ev->arg1;
 				gv_create_plex(sc, p);
 				break;
 
 			case GV_EVENT_CREATE_SD:
 				G_VINUM_DEBUG(2, "event 'create sd'");
 				s = ev->arg1;
 				gv_create_sd(sc, s);
 				break;
 
 			case GV_EVENT_RM_DRIVE:
 				G_VINUM_DEBUG(2, "event 'remove drive'");
 				d = ev->arg1;
 				flags = ev->arg3;
 				gv_rm_drive(sc, d, flags);
 				/*gv_setup_objects(sc);*/
 				break;
 
 			case GV_EVENT_RM_VOLUME:
 				G_VINUM_DEBUG(2, "event 'remove volume'");
 				v = ev->arg1;
 				gv_rm_vol(sc, v);
 				/*gv_setup_objects(sc);*/
 				break;
 
 			case GV_EVENT_RM_PLEX:
 				G_VINUM_DEBUG(2, "event 'remove plex'");
 				p = ev->arg1;
 				gv_rm_plex(sc, p);
 				/*gv_setup_objects(sc);*/
 				break;
 
 			case GV_EVENT_RM_SD:
 				G_VINUM_DEBUG(2, "event 'remove sd'");
 				s = ev->arg1;
 				gv_rm_sd(sc, s);
 				/*gv_setup_objects(sc);*/
 				break;
 
 			case GV_EVENT_SAVE_CONFIG:
 				G_VINUM_DEBUG(2, "event 'save config'");
 				gv_save_config(sc);
 				break;
 
 			case GV_EVENT_SET_SD_STATE:
 				G_VINUM_DEBUG(2, "event 'setstate sd'");
 				s = ev->arg1;
 				newstate = ev->arg3;
 				flags = ev->arg4;
 				err = gv_set_sd_state(s, newstate, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error setting subdisk"
 					    " state: error code %d", err);
 				break;
 
 			case GV_EVENT_SET_DRIVE_STATE:
 				G_VINUM_DEBUG(2, "event 'setstate drive'");
 				d = ev->arg1;
 				newstate = ev->arg3;
 				flags = ev->arg4;
 				err = gv_set_drive_state(d, newstate, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error setting drive "
 					    "state: error code %d", err);
 				break;
 
 			case GV_EVENT_SET_VOL_STATE:
 				G_VINUM_DEBUG(2, "event 'setstate volume'");
 				v = ev->arg1;
 				newstate = ev->arg3;
 				flags = ev->arg4;
 				err = gv_set_vol_state(v, newstate, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error setting volume "
 					    "state: error code %d", err);
 				break;
 
 			case GV_EVENT_SET_PLEX_STATE:
 				G_VINUM_DEBUG(2, "event 'setstate plex'");
 				p = ev->arg1;
 				newstate = ev->arg3;
 				flags = ev->arg4;
 				err = gv_set_plex_state(p, newstate, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error setting plex "
 					    "state: error code %d", err);
 				break;
 
 			case GV_EVENT_SETUP_OBJECTS:
 				G_VINUM_DEBUG(2, "event 'setup objects'");
 				gv_setup_objects(sc);
 				break;
 
 			case GV_EVENT_RESET_CONFIG:
 				G_VINUM_DEBUG(2, "event 'resetconfig'");
 				err = gv_resetconfig(sc);
 				if (err)
 					G_VINUM_DEBUG(0, "error resetting "
 					    "config: error code %d", err);
 				break;
 
 			case GV_EVENT_PARITY_REBUILD:
 				/*
 				 * Start the rebuild. The gv_plex_done will
 				 * handle issuing of the remaining rebuild bio's
 				 * until it's finished. 
 				 */
 				G_VINUM_DEBUG(2, "event 'rebuild'");
 				p = ev->arg1;
 				if (p->state != GV_PLEX_UP) {
 					G_VINUM_DEBUG(0, "plex %s is not "
 					    "completely accessible", p->name);
 					break;
 				}
 				if (p->flags & GV_PLEX_SYNCING ||
 				    p->flags & GV_PLEX_REBUILDING ||
 				    p->flags & GV_PLEX_GROWING) {
 					G_VINUM_DEBUG(0, "plex %s is busy with "
 					    "syncing or parity build", p->name);
 					break;
 				}
 				p->synced = 0;
 				p->flags |= GV_PLEX_REBUILDING;
 				g_topology_assert_not();
 				g_topology_lock();
 				err = gv_access(p->vol_sc->provider, 1, 1, 0);
 				if (err) {
 					G_VINUM_DEBUG(0, "unable to access "
 					    "provider");
 					break;
 				}
 				g_topology_unlock();
 				gv_parity_request(p, GV_BIO_CHECK |
 				    GV_BIO_PARITY, 0);
 				break;
 
 			case GV_EVENT_PARITY_CHECK:
 				/* Start parity check. */
 				G_VINUM_DEBUG(2, "event 'check'");
 				p = ev->arg1;
 				if (p->state != GV_PLEX_UP) {
 					G_VINUM_DEBUG(0, "plex %s is not "
 					    "completely accessible", p->name);
 					break;
 				}
 				if (p->flags & GV_PLEX_SYNCING ||
 				    p->flags & GV_PLEX_REBUILDING ||
 				    p->flags & GV_PLEX_GROWING) {
 					G_VINUM_DEBUG(0, "plex %s is busy with "
 					    "syncing or parity build", p->name);
 					break;
 				}
 				p->synced = 0;
 				g_topology_assert_not();
 				g_topology_lock();
 				err = gv_access(p->vol_sc->provider, 1, 1, 0);
 				if (err) {
 					G_VINUM_DEBUG(0, "unable to access "
 					    "provider");
 					break;
 				}
 				g_topology_unlock();
 				gv_parity_request(p, GV_BIO_CHECK, 0);
 				break;
 
 			case GV_EVENT_START_PLEX:
 				G_VINUM_DEBUG(2, "event 'start' plex");
 				p = ev->arg1;
 				gv_start_plex(p);
 				break;
 
 			case GV_EVENT_START_VOLUME:
 				G_VINUM_DEBUG(2, "event 'start' volume");
 				v = ev->arg1;
 				gv_start_vol(v);
 				break;
 
 			case GV_EVENT_ATTACH_PLEX:
 				G_VINUM_DEBUG(2, "event 'attach' plex");
 				p = ev->arg1;
 				v = ev->arg2;
 				rename = ev->arg4;
 				err = gv_attach_plex(p, v, rename);
 				if (err)
 					G_VINUM_DEBUG(0, "error attaching %s to"
 					    " %s: error code %d", p->name,
 					    v->name, err);
 				break;
 
 			case GV_EVENT_ATTACH_SD:
 				G_VINUM_DEBUG(2, "event 'attach' sd");
 				s = ev->arg1;
 				p = ev->arg2;
 				offset = ev->arg3;
 				rename = ev->arg4;
 				err = gv_attach_sd(s, p, offset, rename);
 				if (err)
 					G_VINUM_DEBUG(0, "error attaching %s to"
 					    " %s: error code %d", s->name,
 					    p->name, err);
 				break;
 
 			case GV_EVENT_DETACH_PLEX:
 				G_VINUM_DEBUG(2, "event 'detach' plex");
 				p = ev->arg1;
 				flags = ev->arg3;
 				err = gv_detach_plex(p, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error detaching %s: "
 					    "error code %d", p->name, err);
 				break;
 
 			case GV_EVENT_DETACH_SD:
 				G_VINUM_DEBUG(2, "event 'detach' sd");
 				s = ev->arg1;
 				flags = ev->arg3;
 				err = gv_detach_sd(s, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error detaching %s: "
 					    "error code %d", s->name, err);
 				break;
 
 			case GV_EVENT_RENAME_VOL:
 				G_VINUM_DEBUG(2, "event 'rename' volume");
 				v = ev->arg1;
 				newname = ev->arg2;
 				flags = ev->arg3;
 				err = gv_rename_vol(sc, v, newname, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error renaming %s to "
 					    "%s: error code %d", v->name,
 					    newname, err);
 				g_free(newname);
 				/* Destroy and recreate the provider if we can. */
 				if (gv_provider_is_open(v->provider)) {
 					G_VINUM_DEBUG(0, "unable to rename "
 					    "provider to %s: provider in use",
 					    v->name);
 					break;
 				}
 				g_topology_lock();
 				g_wither_provider(v->provider, ENOENT);
 				g_topology_unlock();
 				v->provider = NULL;
 				gv_post_event(sc, GV_EVENT_SETUP_OBJECTS, sc,
 				    NULL, 0, 0);
 				break;
 
 			case GV_EVENT_RENAME_PLEX:
 				G_VINUM_DEBUG(2, "event 'rename' plex");
 				p = ev->arg1;
 				newname = ev->arg2;
 				flags = ev->arg3;
 				err = gv_rename_plex(sc, p, newname, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error renaming %s to "
 					    "%s: error code %d", p->name,
 					    newname, err);
 				g_free(newname);
 				break;
 
 			case GV_EVENT_RENAME_SD:
 				G_VINUM_DEBUG(2, "event 'rename' sd");
 				s = ev->arg1;
 				newname = ev->arg2;
 				flags = ev->arg3;
 				err = gv_rename_sd(sc, s, newname, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error renaming %s to "
 					    "%s: error code %d", s->name,
 					    newname, err);
 				g_free(newname);
 				break;
 
 			case GV_EVENT_RENAME_DRIVE:
 				G_VINUM_DEBUG(2, "event 'rename' drive");
 				d = ev->arg1;
 				newname = ev->arg2;
 				flags = ev->arg3;
 				err = gv_rename_drive(sc, d, newname, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error renaming %s to "
 					    "%s: error code %d", d->name,
 					    newname, err);
 				g_free(newname);
 				break;
 
 			case GV_EVENT_MOVE_SD:
 				G_VINUM_DEBUG(2, "event 'move' sd");
 				s = ev->arg1;
 				d = ev->arg2;
 				flags = ev->arg3;
 				err = gv_move_sd(sc, s, d, flags);
 				if (err)
 					G_VINUM_DEBUG(0, "error moving %s to "
 					    "%s: error code %d", s->name,
 					    d->name, err);
 				break;
 
 			case GV_EVENT_THREAD_EXIT:
 				G_VINUM_DEBUG(2, "event 'thread exit'");
 				g_free(ev);
 				mtx_lock(&sc->equeue_mtx);
 				mtx_lock(&sc->bqueue_mtx);
 				gv_cleanup(sc);
 				mtx_destroy(&sc->bqueue_mtx);
 				mtx_destroy(&sc->equeue_mtx);
 				g_free(sc->bqueue_down);
 				g_free(sc->bqueue_up);
 				g_free(sc);
 				kproc_exit(0);
 				/* NOTREACHED */
 
 			default:
 				G_VINUM_DEBUG(1, "unknown event %d", ev->type);
 			}
 
 			g_free(ev);
 			continue;
 		}
 
 		/* ... then do I/O processing. */
 		mtx_lock(&sc->bqueue_mtx);
 		/* First do new requests. */
 		bp = bioq_takefirst(sc->bqueue_down);
 		if (bp != NULL) {
 			mtx_unlock(&sc->bqueue_mtx);
 			/* A bio that interfered with another bio. */
 			if (bp->bio_pflags & GV_BIO_ONHOLD) {
 				s = bp->bio_caller1;
 				p = s->plex_sc;
 				/* Is it still locked out? */
 				if (gv_stripe_active(p, bp)) {
 					/* Park the bio on the waiting queue. */
 					bioq_disksort(p->wqueue, bp);
 				} else {
 					bp->bio_pflags &= ~GV_BIO_ONHOLD;
 					g_io_request(bp, s->drive_sc->consumer);
 				}
 			/* A special request requireing special handling. */
 			} else if (bp->bio_pflags & GV_BIO_INTERNAL) {
 				p = bp->bio_caller1;
 				gv_plex_start(p, bp);
 			} else {
 				gv_volume_start(sc, bp);
 			}
 			mtx_lock(&sc->bqueue_mtx);
 		}
 		/* Then do completed requests. */
 		bp = bioq_takefirst(sc->bqueue_up);
 		if (bp == NULL) {
 			msleep(sc, &sc->bqueue_mtx, PRIBIO, "-", hz/10);
 			mtx_unlock(&sc->bqueue_mtx);
 			continue;
 		}
 		mtx_unlock(&sc->bqueue_mtx);
 		gv_bio_done(sc, bp);
 	}
 }
 
 #define	VINUM_CLASS_NAME "VINUM"
 
 static struct g_class g_vinum_class	= {
 	.name = VINUM_CLASS_NAME,
 	.version = G_VERSION,
 	.init = gv_init,
 	.taste = gv_taste,
 	.ctlreq = gv_config,
 	.destroy_geom = gv_unload,
 };
 
 DECLARE_GEOM_CLASS(g_vinum_class, g_vinum);
+MODULE_VERSION(geom_vinum, 0);
Index: user/markj/netdump/sys/geom/virstor/g_virstor.c
===================================================================
--- user/markj/netdump/sys/geom/virstor/g_virstor.c	(revision 332407)
+++ user/markj/netdump/sys/geom/virstor/g_virstor.c	(revision 332408)
@@ -1,1893 +1,1894 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2006-2007 Ivan Voras <ivoras@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /* Implementation notes:
  * - "Components" are wrappers around providers that make up the
  *   virtual storage (i.e. a virstor has "physical" components)
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/sx.h>
 #include <sys/bio.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/time.h>
 #include <sys/proc.h>
 #include <sys/kthread.h>
 #include <sys/mutex.h>
 #include <vm/uma.h>
 #include <geom/geom.h>
 
 #include <geom/virstor/g_virstor.h>
 #include <geom/virstor/g_virstor_md.h>
 
 FEATURE(g_virstor, "GEOM virtual storage support");
 
 /* Declare malloc(9) label */
 static MALLOC_DEFINE(M_GVIRSTOR, "gvirstor", "GEOM_VIRSTOR Data");
 
 /* GEOM class methods */
 static g_init_t g_virstor_init;
 static g_fini_t g_virstor_fini;
 static g_taste_t g_virstor_taste;
 static g_ctl_req_t g_virstor_config;
 static g_ctl_destroy_geom_t g_virstor_destroy_geom;
 
 /* Declare & initialize class structure ("geom class") */
 struct g_class g_virstor_class = {
 	.name =		G_VIRSTOR_CLASS_NAME,
 	.version =	G_VERSION,
 	.init =		g_virstor_init,
 	.fini =		g_virstor_fini,
 	.taste =	g_virstor_taste,
 	.ctlreq =	g_virstor_config,
 	.destroy_geom = g_virstor_destroy_geom
 	/* The .dumpconf and the rest are only usable for a geom instance, so
 	 * they will be set when such instance is created. */
 };
 
 /* Declare sysctl's and loader tunables */
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, virstor, CTLFLAG_RW, 0,
     "GEOM_GVIRSTOR information");
 
 static u_int g_virstor_debug = 2; /* XXX: lower to 2 when released to public */
 SYSCTL_UINT(_kern_geom_virstor, OID_AUTO, debug, CTLFLAG_RWTUN, &g_virstor_debug,
     0, "Debug level (2=production, 5=normal, 15=excessive)");
 
 static u_int g_virstor_chunk_watermark = 100;
 SYSCTL_UINT(_kern_geom_virstor, OID_AUTO, chunk_watermark, CTLFLAG_RWTUN,
     &g_virstor_chunk_watermark, 0,
     "Minimum number of free chunks before issuing administrative warning");
 
 static u_int g_virstor_component_watermark = 1;
 SYSCTL_UINT(_kern_geom_virstor, OID_AUTO, component_watermark, CTLFLAG_RWTUN,
     &g_virstor_component_watermark, 0,
     "Minimum number of free components before issuing administrative warning");
 
 static int read_metadata(struct g_consumer *, struct g_virstor_metadata *);
 static void write_metadata(struct g_consumer *, struct g_virstor_metadata *);
 static int clear_metadata(struct g_virstor_component *);
 static int add_provider_to_geom(struct g_virstor_softc *, struct g_provider *,
     struct g_virstor_metadata *);
 static struct g_geom *create_virstor_geom(struct g_class *,
     struct g_virstor_metadata *);
 static void virstor_check_and_run(struct g_virstor_softc *);
 static u_int virstor_valid_components(struct g_virstor_softc *);
 static int virstor_geom_destroy(struct g_virstor_softc *, boolean_t,
     boolean_t);
 static void remove_component(struct g_virstor_softc *,
     struct g_virstor_component *, boolean_t);
 static void bioq_dismantle(struct bio_queue_head *);
 static int allocate_chunk(struct g_virstor_softc *,
     struct g_virstor_component **, u_int *, u_int *);
 static void delay_destroy_consumer(void *, int);
 static void dump_component(struct g_virstor_component *comp);
 #if 0
 static void dump_me(struct virstor_map_entry *me, unsigned int nr);
 #endif
 
 static void virstor_ctl_stop(struct gctl_req *, struct g_class *);
 static void virstor_ctl_add(struct gctl_req *, struct g_class *);
 static void virstor_ctl_remove(struct gctl_req *, struct g_class *);
 static struct g_virstor_softc * virstor_find_geom(const struct g_class *,
     const char *);
 static void update_metadata(struct g_virstor_softc *);
 static void fill_metadata(struct g_virstor_softc *, struct g_virstor_metadata *,
     u_int, u_int);
 
 static void g_virstor_orphan(struct g_consumer *);
 static int g_virstor_access(struct g_provider *, int, int, int);
 static void g_virstor_start(struct bio *);
 static void g_virstor_dumpconf(struct sbuf *, const char *, struct g_geom *,
     struct g_consumer *, struct g_provider *);
 static void g_virstor_done(struct bio *);
 
 static void invalid_call(void);
 /*
  * Initialise GEOM class (per-class callback)
  */
 static void
 g_virstor_init(struct g_class *mp __unused)
 {
 
 	/* Catch map struct size mismatch at compile time; Map entries must
 	 * fit into MAXPHYS exactly, with no wasted space. */
 	CTASSERT(VIRSTOR_MAP_BLOCK_ENTRIES*VIRSTOR_MAP_ENTRY_SIZE == MAXPHYS);
 
 	/* Init UMA zones, TAILQ's, other global vars */
 }
 
 /*
  * Finalise GEOM class (per-class callback)
  */
 static void
 g_virstor_fini(struct g_class *mp __unused)
 {
 
 	/* Deinit UMA zones & global vars */
 }
 
 /*
  * Config (per-class callback)
  */
 static void
 g_virstor_config(struct gctl_req *req, struct g_class *cp, char const *verb)
 {
 	uint32_t *version;
 
 	g_topology_assert();
 
 	version = gctl_get_paraml(req, "version", sizeof(*version));
 	if (version == NULL) {
 		gctl_error(req, "Failed to get 'version' argument");
 		return;
 	}
 	if (*version != G_VIRSTOR_VERSION) {
 		gctl_error(req, "Userland and kernel versions out of sync");
 		return;
 	}
 
 	g_topology_unlock();
 	if (strcmp(verb, "add") == 0)
 		virstor_ctl_add(req, cp);
 	else if (strcmp(verb, "stop") == 0 || strcmp(verb, "destroy") == 0)
 		virstor_ctl_stop(req, cp);
 	else if (strcmp(verb, "remove") == 0)
 		virstor_ctl_remove(req, cp);
 	else
 		gctl_error(req, "unknown verb: '%s'", verb);
 	g_topology_lock();
 }
 
 /*
  * "stop" verb from userland
  */
 static void
 virstor_ctl_stop(struct gctl_req *req, struct g_class *cp)
 {
 	int *force, *nargs;
 	int i;
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof *nargs);
 	if (nargs == NULL) {
 		gctl_error(req, "Error fetching argument '%s'", "nargs");
 		return;
 	}
 	if (*nargs < 1) {
 		gctl_error(req, "Invalid number of arguments");
 		return;
 	}
 	force = gctl_get_paraml(req, "force", sizeof *force);
 	if (force == NULL) {
 		gctl_error(req, "Error fetching argument '%s'", "force");
 		return;
 	}
 
 	g_topology_lock();
 	for (i = 0; i < *nargs; i++) {
 		char param[8];
 		const char *name;
 		struct g_virstor_softc *sc;
 		int error;
 
 		sprintf(param, "arg%d", i);
 		name = gctl_get_asciiparam(req, param);
 		if (name == NULL) {
 			gctl_error(req, "No 'arg%d' argument", i);
 			g_topology_unlock();
 			return;
 		}
 		sc = virstor_find_geom(cp, name);
 		if (sc == NULL) {
 			gctl_error(req, "Don't know anything about '%s'", name);
 			g_topology_unlock();
 			return;
 		}
 
 		LOG_MSG(LVL_INFO, "Stopping %s by the userland command",
 		    sc->geom->name);
 		update_metadata(sc);
 		if ((error = virstor_geom_destroy(sc, TRUE, TRUE)) != 0) {
 			LOG_MSG(LVL_ERROR, "Cannot destroy %s: %d",
 			    sc->geom->name, error);
 		}
 	}
 	g_topology_unlock();
 }
 
 /*
  * "add" verb from userland - add new component(s) to the structure.
  * This will be done all at once in here, without going through the
  * .taste function for new components.
  */
 static void
 virstor_ctl_add(struct gctl_req *req, struct g_class *cp)
 {
 	/* Note: while this is going on, I/O is being done on
 	 * the g_up and g_down threads. The idea is to make changes
 	 * to softc members in a way that can atomically activate
 	 * them all at once. */
 	struct g_virstor_softc *sc;
 	int *hardcode, *nargs;
 	const char *geom_name;	/* geom to add a component to */
 	struct g_consumer *fcp;
 	struct g_virstor_bio_q *bq;
 	u_int added;
 	int error;
 	int i;
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "Error fetching argument '%s'", "nargs");
 		return;
 	}
 	if (*nargs < 2) {
 		gctl_error(req, "Invalid number of arguments");
 		return;
 	}
 	hardcode = gctl_get_paraml(req, "hardcode", sizeof(*hardcode));
 	if (hardcode == NULL) {
 		gctl_error(req, "Error fetching argument '%s'", "hardcode");
 		return;
 	}
 
 	/* Find "our" geom */
 	geom_name = gctl_get_asciiparam(req, "arg0");
 	if (geom_name == NULL) {
 		gctl_error(req, "Error fetching argument '%s'", "geom_name (arg0)");
 		return;
 	}
 	sc = virstor_find_geom(cp, geom_name);
 	if (sc == NULL) {
 		gctl_error(req, "Don't know anything about '%s'", geom_name);
 		return;
 	}
 
 	if (virstor_valid_components(sc) != sc->n_components) {
 		LOG_MSG(LVL_ERROR, "Cannot add components to incomplete "
 		    "virstor %s", sc->geom->name);
 		gctl_error(req, "Virstor %s is incomplete", sc->geom->name);
 		return;
 	}
 
 	fcp = sc->components[0].gcons;
 	added = 0;
 	g_topology_lock();
 	for (i = 1; i < *nargs; i++) {
 		struct g_virstor_metadata md;
 		char aname[8];
 		const char *prov_name;
 		struct g_provider *pp;
 		struct g_consumer *cp;
 		u_int nc;
 		u_int j;
 
 		snprintf(aname, sizeof aname, "arg%d", i);
 		prov_name = gctl_get_asciiparam(req, aname);
 		if (prov_name == NULL) {
 			gctl_error(req, "Error fetching argument '%s'", aname);
 			g_topology_unlock();
 			return;
 		}
 		if (strncmp(prov_name, _PATH_DEV, sizeof(_PATH_DEV) - 1) == 0)
 			prov_name += sizeof(_PATH_DEV) - 1;
 
 		pp = g_provider_by_name(prov_name);
 		if (pp == NULL) {
 			/* This is the most common error so be verbose about it */
 			if (added != 0) {
 				gctl_error(req, "Invalid provider: '%s' (added"
 				    " %u components)", prov_name, added);
 				update_metadata(sc);
 			} else {
 				gctl_error(req, "Invalid provider: '%s'",
 				    prov_name);
 			}
 			g_topology_unlock();
 			return;
 		}
 		cp = g_new_consumer(sc->geom);
 		if (cp == NULL) {
 			gctl_error(req, "Cannot create consumer");
 			g_topology_unlock();
 			return;
 		}
 		error = g_attach(cp, pp);
 		if (error != 0) {
 			gctl_error(req, "Cannot attach a consumer to %s",
 			    pp->name);
 			g_destroy_consumer(cp);
 			g_topology_unlock();
 			return;
 		}
 		if (fcp->acr != 0 || fcp->acw != 0 || fcp->ace != 0) {
 			error = g_access(cp, fcp->acr, fcp->acw, fcp->ace);
 			if (error != 0) {
 				gctl_error(req, "Access request failed for %s",
 				    pp->name);
 				g_destroy_consumer(cp);
 				g_topology_unlock();
 				return;
 			}
 		}
 		if (fcp->provider->sectorsize != pp->sectorsize) {
 			gctl_error(req, "Sector size doesn't fit for %s",
 			    pp->name);
 			g_destroy_consumer(cp);
 			g_topology_unlock();
 			return;
 		}
 		for (j = 0; j < sc->n_components; j++) {
 			if (strcmp(sc->components[j].gcons->provider->name,
 			    pp->name) == 0) {
 				gctl_error(req, "Component %s already in %s",
 				    pp->name, sc->geom->name);
 				g_destroy_consumer(cp);
 				g_topology_unlock();
 				return;
 			}
 		}
 		sc->components = realloc(sc->components,
 		    sizeof(*sc->components) * (sc->n_components + 1),
 		    M_GVIRSTOR, M_WAITOK);
 
 		nc = sc->n_components;
 		sc->components[nc].gcons = cp;
 		sc->components[nc].sc = sc;
 		sc->components[nc].index = nc;
 		sc->components[nc].chunk_count = cp->provider->mediasize /
 		    sc->chunk_size;
 		sc->components[nc].chunk_next = 0;
 		sc->components[nc].chunk_reserved = 0;
 
 		if (sc->components[nc].chunk_count < 4) {
 			gctl_error(req, "Provider too small: %s",
 			    cp->provider->name);
 			g_destroy_consumer(cp);
 			g_topology_unlock();
 			return;
 		}
 		fill_metadata(sc, &md, nc, *hardcode);
 		write_metadata(cp, &md);
 		/* The new component becomes visible when n_components is
 		 * incremented */
 		sc->n_components++;
 		added++;
 
 	}
 	/* This call to update_metadata() is critical. In case there's a
 	 * power failure in the middle of it and some components are updated
 	 * while others are not, there will be trouble on next .taste() iff
 	 * a non-updated component is detected first */
 	update_metadata(sc);
 	g_topology_unlock();
 	LOG_MSG(LVL_INFO, "Added %d component(s) to %s", added,
 	    sc->geom->name);
 	/* Fire off BIOs previously queued because there wasn't any
 	 * physical space left. If the BIOs still can't be satisfied
 	 * they will again be added to the end of the queue (during
 	 * which the mutex will be recursed) */
 	bq = malloc(sizeof(*bq), M_GVIRSTOR, M_WAITOK);
 	bq->bio = NULL;
 	mtx_lock(&sc->delayed_bio_q_mtx);
 	/* First, insert a sentinel to the queue end, so we don't
 	 * end up in an infinite loop if there's still no free
 	 * space available. */
 	STAILQ_INSERT_TAIL(&sc->delayed_bio_q, bq, linkage);
 	while (!STAILQ_EMPTY(&sc->delayed_bio_q)) {
 		bq = STAILQ_FIRST(&sc->delayed_bio_q);
 		if (bq->bio != NULL) {
 			g_virstor_start(bq->bio);
 			STAILQ_REMOVE_HEAD(&sc->delayed_bio_q, linkage);
 			free(bq, M_GVIRSTOR);
 		} else {
 			STAILQ_REMOVE_HEAD(&sc->delayed_bio_q, linkage);
 			free(bq, M_GVIRSTOR);
 			break;
 		}
 	}
 	mtx_unlock(&sc->delayed_bio_q_mtx);
 
 }
 
 /*
  * Find a geom handled by the class
  */
 static struct g_virstor_softc *
 virstor_find_geom(const struct g_class *cp, const char *name)
 {
 	struct g_geom *gp;
 
 	LIST_FOREACH(gp, &cp->geom, geom) {
 		if (strcmp(name, gp->name) == 0)
 			return (gp->softc);
 	}
 	return (NULL);
 }
 
 /*
  * Update metadata on all components to reflect the current state
  * of these fields:
  *    - chunk_next
  *    - flags
  *    - md_count
  * Expects things to be set up so write_metadata() can work, i.e.
  * the topology lock must be held.
  */
 static void
 update_metadata(struct g_virstor_softc *sc)
 {
 	struct g_virstor_metadata md;
 	u_int n;
 
 	if (virstor_valid_components(sc) != sc->n_components)
 		return; /* Incomplete device */
 	LOG_MSG(LVL_DEBUG, "Updating metadata on components for %s",
 	    sc->geom->name);
 	/* Update metadata on components */
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__,
 	    sc->geom->class->name, sc->geom->name);
 	g_topology_assert();
 	for (n = 0; n < sc->n_components; n++) {
 		read_metadata(sc->components[n].gcons, &md);
 		md.chunk_next = sc->components[n].chunk_next;
 		md.flags = sc->components[n].flags;
 		md.md_count = sc->n_components;
 		write_metadata(sc->components[n].gcons, &md);
 	}
 }
 
 /*
  * Fills metadata (struct md) from information stored in softc and the nc'th
  * component of virstor
  */
 static void
 fill_metadata(struct g_virstor_softc *sc, struct g_virstor_metadata *md,
     u_int nc, u_int hardcode)
 {
 	struct g_virstor_component *c;
 
 	bzero(md, sizeof *md);
 	c = &sc->components[nc];
 
 	strncpy(md->md_magic, G_VIRSTOR_MAGIC, sizeof md->md_magic);
 	md->md_version = G_VIRSTOR_VERSION;
 	strncpy(md->md_name, sc->geom->name, sizeof md->md_name);
 	md->md_id = sc->id;
 	md->md_virsize = sc->virsize;
 	md->md_chunk_size = sc->chunk_size;
 	md->md_count = sc->n_components;
 
 	if (hardcode) {
 		strncpy(md->provider, c->gcons->provider->name,
 		    sizeof md->provider);
 	}
 	md->no = nc;
 	md->provsize = c->gcons->provider->mediasize;
 	md->chunk_count = c->chunk_count;
 	md->chunk_next = c->chunk_next;
 	md->chunk_reserved = c->chunk_reserved;
 	md->flags = c->flags;
 }
 
 /*
  * Remove a component from virstor device.
  * Can only be done if the component is unallocated.
  */
 static void
 virstor_ctl_remove(struct gctl_req *req, struct g_class *cp)
 {
 	/* As this is executed in parallel to I/O, operations on virstor
 	 * structures must be as atomic as possible. */
 	struct g_virstor_softc *sc;
 	int *nargs;
 	const char *geom_name;
 	u_int removed;
 	int i;
 
 	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
 	if (nargs == NULL) {
 		gctl_error(req, "Error fetching argument '%s'", "nargs");
 		return;
 	}
 	if (*nargs < 2) {
 		gctl_error(req, "Invalid number of arguments");
 		return;
 	}
 	/* Find "our" geom */
 	geom_name = gctl_get_asciiparam(req, "arg0");
 	if (geom_name == NULL) {
 		gctl_error(req, "Error fetching argument '%s'",
 		    "geom_name (arg0)");
 		return;
 	}
 	sc = virstor_find_geom(cp, geom_name);
 	if (sc == NULL) {
 		gctl_error(req, "Don't know anything about '%s'", geom_name);
 		return;
 	}
 
 	if (virstor_valid_components(sc) != sc->n_components) {
 		LOG_MSG(LVL_ERROR, "Cannot remove components from incomplete "
 		    "virstor %s", sc->geom->name);
 		gctl_error(req, "Virstor %s is incomplete", sc->geom->name);
 		return;
 	}
 
 	removed = 0;
 	for (i = 1; i < *nargs; i++) {
 		char param[8];
 		const char *prov_name;
 		int j, found;
 		struct g_virstor_component *newcomp, *compbak;
 
 		sprintf(param, "arg%d", i);
 		prov_name = gctl_get_asciiparam(req, param);
 		if (prov_name == NULL) {
 			gctl_error(req, "Error fetching argument '%s'", param);
 			return;
 		}
 		if (strncmp(prov_name, _PATH_DEV, sizeof(_PATH_DEV) - 1) == 0)
 			prov_name += sizeof(_PATH_DEV) - 1;
 
 		found = -1;
 		for (j = 0; j < sc->n_components; j++) {
 			if (strcmp(sc->components[j].gcons->provider->name,
 			    prov_name) == 0) {
 				found = j;
 				break;
 			}
 		}
 		if (found == -1) {
 			LOG_MSG(LVL_ERROR, "No %s component in %s",
 			    prov_name, sc->geom->name);
 			continue;
 		}
 
 		compbak = sc->components;
 		newcomp = malloc(sc->n_components * sizeof(*sc->components),
 		    M_GVIRSTOR, M_WAITOK | M_ZERO);
 		bcopy(sc->components, newcomp, found * sizeof(*sc->components));
 		bcopy(&sc->components[found + 1], newcomp + found,
 		    found * sizeof(*sc->components));
 		if ((sc->components[j].flags & VIRSTOR_PROVIDER_ALLOCATED) != 0) {
 			LOG_MSG(LVL_ERROR, "Allocated provider %s cannot be "
 			    "removed from %s",
 			    prov_name, sc->geom->name);
 			free(newcomp, M_GVIRSTOR);
 			/* We'll consider this non-fatal error */
 			continue;
 		}
 		/* Renumerate unallocated components */
 		for (j = 0; j < sc->n_components-1; j++) {
 			if ((sc->components[j].flags &
 			    VIRSTOR_PROVIDER_ALLOCATED) == 0) {
 				sc->components[j].index = j;
 			}
 		}
 		/* This is the critical section. If a component allocation
 		 * event happens while both variables are not yet set,
 		 * there will be trouble. Something will panic on encountering
 		 * NULL sc->components[x].gcomp member.
 		 * Luckily, component allocation happens very rarely and
 		 * removing components is an abnormal action in any case. */
 		sc->components = newcomp;
 		sc->n_components--;
 		/* End critical section */
 
 		g_topology_lock();
 		if (clear_metadata(&compbak[found]) != 0) {
 			LOG_MSG(LVL_WARNING, "Trouble ahead: cannot clear "
 			    "metadata on %s", prov_name);
 		}
 		g_detach(compbak[found].gcons);
 		g_destroy_consumer(compbak[found].gcons);
 		g_topology_unlock();
 
 		free(compbak, M_GVIRSTOR);
 
 		removed++;
 	}
 
 	/* This call to update_metadata() is critical. In case there's a
 	 * power failure in the middle of it and some components are updated
 	 * while others are not, there will be trouble on next .taste() iff
 	 * a non-updated component is detected first */
 	g_topology_lock();
 	update_metadata(sc);
 	g_topology_unlock();
 	LOG_MSG(LVL_INFO, "Removed %d component(s) from %s", removed,
 	    sc->geom->name);
 }
 
 /*
  * Clear metadata sector on component
  */
 static int
 clear_metadata(struct g_virstor_component *comp)
 {
 	char *buf;
 	int error;
 
 	LOG_MSG(LVL_INFO, "Clearing metadata on %s",
 	    comp->gcons->provider->name);
 	g_topology_assert();
 	error = g_access(comp->gcons, 0, 1, 0);
 	if (error != 0)
 		return (error);
 	buf = malloc(comp->gcons->provider->sectorsize, M_GVIRSTOR,
 	    M_WAITOK | M_ZERO);
 	error = g_write_data(comp->gcons,
 	    comp->gcons->provider->mediasize -
 	    comp->gcons->provider->sectorsize,
 	    buf,
 	    comp->gcons->provider->sectorsize);
 	free(buf, M_GVIRSTOR);
 	g_access(comp->gcons, 0, -1, 0);
 	return (error);
 }
 
 /*
  * Destroy geom forcibly.
  */
 static int
 g_virstor_destroy_geom(struct gctl_req *req __unused, struct g_class *mp,
     struct g_geom *gp)
 {
 	struct g_virstor_softc *sc;
 	int exitval;
 
 	sc = gp->softc;
 	KASSERT(sc != NULL, ("%s: NULL sc", __func__));
 	
 	exitval = 0;
 	LOG_MSG(LVL_DEBUG, "%s called for %s, sc=%p", __func__, gp->name,
 	    gp->softc);
 
 	if (sc != NULL) {
 #ifdef INVARIANTS
 		char *buf;
 		int error;
 		off_t off;
 		int isclean, count;
 		int n;
 
 		LOG_MSG(LVL_INFO, "INVARIANTS detected");
 		LOG_MSG(LVL_INFO, "Verifying allocation "
 		    "table for %s", sc->geom->name);
 		count = 0;
 		for (n = 0; n < sc->chunk_count; n++) {
 			if (sc->map[n].flags || VIRSTOR_MAP_ALLOCATED != 0)
 				count++;
 		}
 		LOG_MSG(LVL_INFO, "Device %s has %d allocated chunks",
 		    sc->geom->name, count);
 		n = off = count = 0;
 		isclean = 1;
 		if (virstor_valid_components(sc) != sc->n_components) {
 			/* This is a incomplete virstor device (not all
 			 * components have been found) */
 			LOG_MSG(LVL_ERROR, "Device %s is incomplete",
 			    sc->geom->name);
 			goto bailout;
 		}
 		error = g_access(sc->components[0].gcons, 1, 0, 0);
 		KASSERT(error == 0, ("%s: g_access failed (%d)", __func__,
 		    error));
 		/* Compare the whole on-disk allocation table with what's
 		 * currently in memory */
 		while (n < sc->chunk_count) {
 			buf = g_read_data(sc->components[0].gcons, off,
 			    sc->sectorsize, &error);
 			KASSERT(buf != NULL, ("g_read_data returned NULL (%d) "
 			    "for read at %jd", error, off));
 			if (bcmp(buf, &sc->map[n], sc->sectorsize) != 0) {
 				LOG_MSG(LVL_ERROR, "ERROR in allocation table, "
 				    "entry %d, offset %jd", n, off);
 				isclean = 0;
 				count++;
 			}
 			n += sc->me_per_sector;
 			off += sc->sectorsize;
 			g_free(buf);
 		}
 		error = g_access(sc->components[0].gcons, -1, 0, 0);
 		KASSERT(error == 0, ("%s: g_access failed (%d) on exit",
 		    __func__, error));
 		if (isclean != 1) {
 			LOG_MSG(LVL_ERROR, "ALLOCATION TABLE CORRUPTED FOR %s "
 			    "(%d sectors don't match, max %zu allocations)",
 			    sc->geom->name, count,
 			    count * sc->me_per_sector);
 		} else {
 			LOG_MSG(LVL_INFO, "Allocation table ok for %s",
 			    sc->geom->name);
 		}
 bailout:
 #endif
 		update_metadata(sc);
 		virstor_geom_destroy(sc, FALSE, FALSE);
 		exitval = EAGAIN;
 	} else
 		exitval = 0;
 	return (exitval);
 }
 
 /*
  * Taste event (per-class callback)
  * Examines a provider and creates geom instances if needed
  */
 static struct g_geom *
 g_virstor_taste(struct g_class *mp, struct g_provider *pp, int flags)
 {
 	struct g_virstor_metadata md;
 	struct g_geom *gp;
 	struct g_consumer *cp;
 	struct g_virstor_softc *sc;
 	int error;
 
 	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
 	g_topology_assert();
 	LOG_MSG(LVL_DEBUG, "Tasting %s", pp->name);
 
 	/* We need a dummy geom to attach a consumer to the given provider */
 	gp = g_new_geomf(mp, "virstor:taste.helper");
 	gp->start = (void *)invalid_call;	/* XXX: hacked up so the        */
 	gp->access = (void *)invalid_call;	/* compiler doesn't complain.   */
 	gp->orphan = (void *)invalid_call;	/* I really want these to fail. */
 
 	cp = g_new_consumer(gp);
 	g_attach(cp, pp);
 	error = read_metadata(cp, &md);
 	g_detach(cp);
 	g_destroy_consumer(cp);
 	g_destroy_geom(gp);
 
 	if (error != 0)
 		return (NULL);
 
 	if (strcmp(md.md_magic, G_VIRSTOR_MAGIC) != 0)
 		return (NULL);
 	if (md.md_version != G_VIRSTOR_VERSION) {
 		LOG_MSG(LVL_ERROR, "Kernel module version invalid "
 		    "to handle %s (%s) : %d should be %d",
 		    md.md_name, pp->name, md.md_version, G_VIRSTOR_VERSION);
 		return (NULL);
 	}
 	if (md.provsize != pp->mediasize)
 		return (NULL);
 
 	/* If the provider name is hardcoded, use the offered provider only
 	 * if it's been offered with its proper name (the one used in
 	 * the label command). */
 	if (md.provider[0] != '\0' &&
 	    !g_compare_names(md.provider, pp->name))
 		return (NULL);
 
 	/* Iterate all geoms this class already knows about to see if a new
 	 * geom instance of this class needs to be created (in case the provider
 	 * is first from a (possibly) multi-consumer geom) or it just needs
 	 * to be added to an existing instance. */
 	sc = NULL;
 	gp = NULL;
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc == NULL)
 			continue;
 		if (strcmp(md.md_name, sc->geom->name) != 0)
 			continue;
 		if (md.md_id != sc->id)
 			continue;
 		break;
 	}
 	if (gp != NULL) { /* We found an existing geom instance; add to it */
 		LOG_MSG(LVL_INFO, "Adding %s to %s", pp->name, md.md_name);
 		error = add_provider_to_geom(sc, pp, &md);
 		if (error != 0) {
 			LOG_MSG(LVL_ERROR, "Error adding %s to %s (error %d)",
 			    pp->name, md.md_name, error);
 			return (NULL);
 		}
 	} else { /* New geom instance needs to be created */
 		gp = create_virstor_geom(mp, &md);
 		if (gp == NULL) {
 			LOG_MSG(LVL_ERROR, "Error creating new instance of "
 			    "class %s: %s", mp->name, md.md_name);
 			LOG_MSG(LVL_DEBUG, "Error creating %s at %s",
 			    md.md_name, pp->name);
 			return (NULL);
 		}
 		sc = gp->softc;
 		LOG_MSG(LVL_INFO, "Adding %s to %s (first found)", pp->name,
 		    md.md_name);
 		error = add_provider_to_geom(sc, pp, &md);
 		if (error != 0) {
 			LOG_MSG(LVL_ERROR, "Error adding %s to %s (error %d)",
 			    pp->name, md.md_name, error);
 			virstor_geom_destroy(sc, TRUE, FALSE);
 			return (NULL);
 		}
 	}
 
 	return (gp);
 }
 
 /*
  * Destroyes consumer passed to it in arguments. Used as a callback
  * on g_event queue.
  */
 static void
 delay_destroy_consumer(void *arg, int flags __unused)
 {
 	struct g_consumer *c = arg;
 	KASSERT(c != NULL, ("%s: invalid consumer", __func__));
 	LOG_MSG(LVL_DEBUG, "Consumer %s destroyed with delay",
 	    c->provider->name);
 	g_detach(c);
 	g_destroy_consumer(c);
 }
 
 /*
  * Remove a component (consumer) from geom instance; If it's the first
  * component being removed, orphan the provider to announce geom's being
  * dismantled
  */
 static void
 remove_component(struct g_virstor_softc *sc, struct g_virstor_component *comp,
     boolean_t delay)
 {
 	struct g_consumer *c;
 
 	KASSERT(comp->gcons != NULL, ("Component with no consumer in %s",
 	    sc->geom->name));
 	c = comp->gcons;
 
 	comp->gcons = NULL;
 	KASSERT(c->provider != NULL, ("%s: no provider", __func__));
 	LOG_MSG(LVL_DEBUG, "Component %s removed from %s", c->provider->name,
 	    sc->geom->name);
 	if (sc->provider != NULL) {
 		LOG_MSG(LVL_INFO, "Removing provider %s", sc->provider->name);
 		g_wither_provider(sc->provider, ENXIO);
 		sc->provider = NULL;
 	}
 
 	if (c->acr > 0 || c->acw > 0 || c->ace > 0)
 		g_access(c, -c->acr, -c->acw, -c->ace);
 	if (delay) {
 		/* Destroy consumer after it's tasted */
 		g_post_event(delay_destroy_consumer, c, M_WAITOK, NULL);
 	} else {
 		g_detach(c);
 		g_destroy_consumer(c);
 	}
 }
 
 /*
  * Destroy geom - called internally
  * See g_virstor_destroy_geom for the other one
  */
 static int
 virstor_geom_destroy(struct g_virstor_softc *sc, boolean_t force,
     boolean_t delay)
 {
 	struct g_provider *pp;
 	struct g_geom *gp;
 	u_int n;
 
 	g_topology_assert();
 
 	if (sc == NULL)
 		return (ENXIO);
 
 	pp = sc->provider;
 	if (pp != NULL && (pp->acr != 0 || pp->acw != 0 || pp->ace != 0)) {
 		LOG_MSG(force ? LVL_WARNING : LVL_ERROR,
 		    "Device %s is still open.", pp->name);
 		if (!force)
 			return (EBUSY);
 	}
 
 	for (n = 0; n < sc->n_components; n++) {
 		if (sc->components[n].gcons != NULL)
 			remove_component(sc, &sc->components[n], delay);
 	}
 
 	gp = sc->geom;
 	gp->softc = NULL;
 
 	KASSERT(sc->provider == NULL, ("Provider still exists for %s",
 	    gp->name));
 
 	/* XXX: This might or might not work, since we're called with
 	 * the topology lock held. Also, it might panic the kernel if
 	 * the error'd BIO is in softupdates code. */
 	mtx_lock(&sc->delayed_bio_q_mtx);
 	while (!STAILQ_EMPTY(&sc->delayed_bio_q)) {
 		struct g_virstor_bio_q *bq;
 		bq = STAILQ_FIRST(&sc->delayed_bio_q);
 		bq->bio->bio_error = ENOSPC;
 		g_io_deliver(bq->bio, EIO);
 		STAILQ_REMOVE_HEAD(&sc->delayed_bio_q, linkage);
 		free(bq, M_GVIRSTOR);
 	}
 	mtx_unlock(&sc->delayed_bio_q_mtx);
 	mtx_destroy(&sc->delayed_bio_q_mtx);
 
 	free(sc->map, M_GVIRSTOR);
 	free(sc->components, M_GVIRSTOR);
 	bzero(sc, sizeof *sc);
 	free(sc, M_GVIRSTOR);
 
 	pp = LIST_FIRST(&gp->provider); /* We only offer one provider */
 	if (pp == NULL || (pp->acr == 0 && pp->acw == 0 && pp->ace == 0))
 		LOG_MSG(LVL_DEBUG, "Device %s destroyed", gp->name);
 
 	g_wither_geom(gp, ENXIO);
 
 	return (0);
 }
 
 /*
  * Utility function: read metadata & decode. Wants topology lock to be
  * held.
  */
 static int
 read_metadata(struct g_consumer *cp, struct g_virstor_metadata *md)
 {
 	struct g_provider *pp;
 	char *buf;
 	int error;
 
 	g_topology_assert();
 	error = g_access(cp, 1, 0, 0);
 	if (error != 0)
 		return (error);
 	pp = cp->provider;
 	g_topology_unlock();
 	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
 	    &error);
 	g_topology_lock();
 	g_access(cp, -1, 0, 0);
 	if (buf == NULL)
 		return (error);
 
 	virstor_metadata_decode(buf, md);
 	g_free(buf);
 
 	return (0);
 }
 
 /**
  * Utility function: encode & write metadata. Assumes topology lock is
  * held.
  *
  * There is no useful way of recovering from errors in this function,
  * not involving panicking the kernel. If the metadata cannot be written
  * the most we can do is notify the operator and hope he spots it and
  * replaces the broken drive.
  */
 static void
 write_metadata(struct g_consumer *cp, struct g_virstor_metadata *md)
 {
 	struct g_provider *pp;
 	char *buf;
 	int error;
 
 	KASSERT(cp != NULL && md != NULL && cp->provider != NULL,
 	    ("Something's fishy in %s", __func__));
 	LOG_MSG(LVL_DEBUG, "Writing metadata on %s", cp->provider->name);
 	g_topology_assert();
 	error = g_access(cp, 0, 1, 0);
 	if (error != 0) {
 		LOG_MSG(LVL_ERROR, "g_access(0,1,0) failed for %s: %d",
 		    cp->provider->name, error);
 		return;
 	}
 	pp = cp->provider;
 
 	buf = malloc(pp->sectorsize, M_GVIRSTOR, M_WAITOK);
 	bzero(buf, pp->sectorsize);
 	virstor_metadata_encode(md, buf);
 	g_topology_unlock();
 	error = g_write_data(cp, pp->mediasize - pp->sectorsize, buf,
 	    pp->sectorsize);
 	g_topology_lock();
 	g_access(cp, 0, -1, 0);
 	free(buf, M_GVIRSTOR);
 
 	if (error != 0)
 		LOG_MSG(LVL_ERROR, "Error %d writing metadata to %s",
 		    error, cp->provider->name);
 }
 
 /*
  * Creates a new instance of this GEOM class, initialise softc
  */
 static struct g_geom *
 create_virstor_geom(struct g_class *mp, struct g_virstor_metadata *md)
 {
 	struct g_geom *gp;
 	struct g_virstor_softc *sc;
 
 	LOG_MSG(LVL_DEBUG, "Creating geom instance for %s (id=%u)",
 	    md->md_name, md->md_id);
 
 	if (md->md_count < 1 || md->md_chunk_size < 1 ||
 	    md->md_virsize < md->md_chunk_size) {
 		/* This is bogus configuration, and probably means data is
 		 * somehow corrupted. Panic, maybe? */
 		LOG_MSG(LVL_ERROR, "Nonsensical metadata information for %s",
 		    md->md_name);
 		return (NULL);
 	}
 
 	/* Check if it's already created */
 	LIST_FOREACH(gp, &mp->geom, geom) {
 		sc = gp->softc;
 		if (sc != NULL && strcmp(sc->geom->name, md->md_name) == 0) {
 			LOG_MSG(LVL_WARNING, "Geom %s already exists",
 			    md->md_name);
 			if (sc->id != md->md_id) {
 				LOG_MSG(LVL_ERROR,
 				    "Some stale or invalid components "
 				    "exist for virstor device named %s. "
 				    "You will need to <CLEAR> all stale "
 				    "components and maybe reconfigure "
 				    "the virstor device. Tune "
 				    "kern.geom.virstor.debug sysctl up "
 				    "for more information.",
 				    sc->geom->name);
 			}
 			return (NULL);
 		}
 	}
 	gp = g_new_geomf(mp, "%s", md->md_name);
 	gp->softc = NULL; /* to circumevent races that test softc */
 
 	gp->start = g_virstor_start;
 	gp->spoiled = g_virstor_orphan;
 	gp->orphan = g_virstor_orphan;
 	gp->access = g_virstor_access;
 	gp->dumpconf = g_virstor_dumpconf;
 
 	sc = malloc(sizeof(*sc), M_GVIRSTOR, M_WAITOK | M_ZERO);
 	sc->id = md->md_id;
 	sc->n_components = md->md_count;
 	sc->components = malloc(sizeof(struct g_virstor_component) * md->md_count,
 	    M_GVIRSTOR, M_WAITOK | M_ZERO);
 	sc->chunk_size = md->md_chunk_size;
 	sc->virsize = md->md_virsize;
 	STAILQ_INIT(&sc->delayed_bio_q);
 	mtx_init(&sc->delayed_bio_q_mtx, "gvirstor_delayed_bio_q_mtx",
 	    "gvirstor", MTX_DEF | MTX_RECURSE);
 
 	sc->geom = gp;
 	sc->provider = NULL; /* virstor_check_and_run will create it */
 	gp->softc = sc;
 
 	LOG_MSG(LVL_ANNOUNCE, "Device %s created", sc->geom->name);
 
 	return (gp);
 }
 
 /*
  * Add provider to a GEOM class instance
  */
 static int
 add_provider_to_geom(struct g_virstor_softc *sc, struct g_provider *pp,
     struct g_virstor_metadata *md)
 {
 	struct g_virstor_component *component;
 	struct g_consumer *cp, *fcp;
 	struct g_geom *gp;
 	int error;
 
 	if (md->no >= sc->n_components)
 		return (EINVAL);
 
 	/* "Current" compontent */
 	component = &(sc->components[md->no]);
 	if (component->gcons != NULL)
 		return (EEXIST);
 
 	gp = sc->geom;
 	fcp = LIST_FIRST(&gp->consumer);
 
 	cp = g_new_consumer(gp);
 	error = g_attach(cp, pp);
 
 	if (error != 0) {
 		g_destroy_consumer(cp);
 		return (error);
 	}
 
 	if (fcp != NULL) {
 		if (fcp->provider->sectorsize != pp->sectorsize) {
 			/* TODO: this can be made to work */
 			LOG_MSG(LVL_ERROR, "Provider %s of %s has invalid "
 			    "sector size (%d)", pp->name, sc->geom->name,
 			    pp->sectorsize);
 			return (EINVAL);
 		}
 		if (fcp->acr > 0 || fcp->acw || fcp->ace > 0) {
 			/* Replicate access permissions from first "live" consumer
 			 * to the new one */
 			error = g_access(cp, fcp->acr, fcp->acw, fcp->ace);
 			if (error != 0) {
 				g_detach(cp);
 				g_destroy_consumer(cp);
 				return (error);
 			}
 		}
 	}
 
 	/* Bring up a new component */
 	cp->private = component;
 	component->gcons = cp;
 	component->sc = sc;
 	component->index = md->no;
 	component->chunk_count = md->chunk_count;
 	component->chunk_next = md->chunk_next;
 	component->chunk_reserved = md->chunk_reserved;
 	component->flags = md->flags;
 
 	LOG_MSG(LVL_DEBUG, "%s attached to %s", pp->name, sc->geom->name);
 
 	virstor_check_and_run(sc);
 	return (0);
 }
 
 /*
  * Check if everything's ready to create the geom provider & device entry,
  * create and start provider.
  * Called ultimately by .taste, from g_event thread
  */
 static void
 virstor_check_and_run(struct g_virstor_softc *sc)
 {
 	off_t off;
 	size_t n, count;
 	int index;
 	int error;
 
 	if (virstor_valid_components(sc) != sc->n_components)
 		return;
 
 	if (virstor_valid_components(sc) == 0) {
 		/* This is actually a candidate for panic() */
 		LOG_MSG(LVL_ERROR, "No valid components for %s?",
 		    sc->provider->name);
 		return;
 	}
 
 	sc->sectorsize = sc->components[0].gcons->provider->sectorsize;
 
 	/* Initialise allocation map from the first consumer */
 	sc->chunk_count = sc->virsize / sc->chunk_size;
 	if (sc->chunk_count * (off_t)sc->chunk_size != sc->virsize) {
 		LOG_MSG(LVL_WARNING, "Device %s truncated to %ju bytes",
 		    sc->provider->name,
 		    sc->chunk_count * (off_t)sc->chunk_size);
 	}
 	sc->map_size = sc->chunk_count * sizeof *(sc->map);
 	/* The following allocation is in order of 4MB - 8MB */
 	sc->map = malloc(sc->map_size, M_GVIRSTOR, M_WAITOK);
 	KASSERT(sc->map != NULL, ("%s: Memory allocation error (%zu bytes) for %s",
 	    __func__, sc->map_size, sc->provider->name));
 	sc->map_sectors = sc->map_size / sc->sectorsize;
 
 	count = 0;
 	for (n = 0; n < sc->n_components; n++)
 		count += sc->components[n].chunk_count;
 	LOG_MSG(LVL_INFO, "Device %s has %zu physical chunks and %zu virtual "
 	    "(%zu KB chunks)",
 	    sc->geom->name, count, sc->chunk_count, sc->chunk_size / 1024);
 
 	error = g_access(sc->components[0].gcons, 1, 0, 0);
 	if (error != 0) {
 		LOG_MSG(LVL_ERROR, "Cannot acquire read access for %s to "
 		    "read allocation map for %s",
 		    sc->components[0].gcons->provider->name,
 		    sc->geom->name);
 		return;
 	}
 	/* Read in the allocation map */
 	LOG_MSG(LVL_DEBUG, "Reading map for %s from %s", sc->geom->name,
 	    sc->components[0].gcons->provider->name);
 	off = count = n = 0;
 	while (count < sc->map_size) {
 		struct g_virstor_map_entry *mapbuf;
 		size_t bs;
 
 		bs = MIN(MAXPHYS, sc->map_size - count);
 		if (bs % sc->sectorsize != 0) {
 			/* Check for alignment errors */
 			bs = rounddown(bs, sc->sectorsize);
 			if (bs == 0)
 				break;
 			LOG_MSG(LVL_ERROR, "Trouble: map is not sector-aligned "
 			    "for %s on %s", sc->geom->name,
 			    sc->components[0].gcons->provider->name);
 		}
 		mapbuf = g_read_data(sc->components[0].gcons, off, bs, &error);
 		if (mapbuf == NULL) {
 			free(sc->map, M_GVIRSTOR);
 			LOG_MSG(LVL_ERROR, "Error reading allocation map "
 			    "for %s from %s (offset %ju) (error %d)",
 			    sc->geom->name,
 			    sc->components[0].gcons->provider->name,
 			    off, error);
 			return;
 		}
 
 		bcopy(mapbuf, &sc->map[n], bs);
 		off += bs;
 		count += bs;
 		n += bs / sizeof *(sc->map);
 		g_free(mapbuf);
 	}
 	g_access(sc->components[0].gcons, -1, 0, 0);
 	LOG_MSG(LVL_DEBUG, "Read map for %s", sc->geom->name);
 
 	/* find first component with allocatable chunks */
 	index = -1;
 	for (n = 0; n < sc->n_components; n++) {
 		if (sc->components[n].chunk_next <
 		    sc->components[n].chunk_count) {
 			index = n;
 			break;
 		}
 	}
 	if (index == -1)
 		/* not found? set it to the last component and handle it
 		 * later */
 		index = sc->n_components - 1;
 
 	if (index >= sc->n_components - g_virstor_component_watermark - 1) {
 		LOG_MSG(LVL_WARNING, "Device %s running out of components "
 		    "(%d/%u: %s)", sc->geom->name,
 		    index+1,
 		    sc->n_components,
 		    sc->components[index].gcons->provider->name);
 	}
 	sc->curr_component = index;
 
 	if (sc->components[index].chunk_next >=
 	    sc->components[index].chunk_count - g_virstor_chunk_watermark) {
 		LOG_MSG(LVL_WARNING,
 		    "Component %s of %s is running out of free space "
 		    "(%u chunks left)",
 		    sc->components[index].gcons->provider->name,
 		    sc->geom->name, sc->components[index].chunk_count -
 		    sc->components[index].chunk_next);
 	}
 
 	sc->me_per_sector = sc->sectorsize / sizeof *(sc->map);
 	if (sc->sectorsize % sizeof *(sc->map) != 0) {
 		LOG_MSG(LVL_ERROR,
 		    "%s: Map entries don't fit exactly in a sector (%s)",
 		    __func__, sc->geom->name);
 		return;
 	}
 
 	/* Recalculate allocated chunks in components & at the same time
 	 * verify map data is sane. We could trust metadata on this, but
 	 * we want to make sure. */
 	for (n = 0; n < sc->n_components; n++)
 		sc->components[n].chunk_next = sc->components[n].chunk_reserved;
 
 	for (n = 0; n < sc->chunk_count; n++) {
 		if (sc->map[n].provider_no >= sc->n_components ||
 			sc->map[n].provider_chunk >=
 			sc->components[sc->map[n].provider_no].chunk_count) {
 			LOG_MSG(LVL_ERROR, "%s: Invalid entry %u in map for %s",
 			    __func__, (u_int)n, sc->geom->name);
 			LOG_MSG(LVL_ERROR, "%s: provider_no: %u, n_components: %u"
 			    " provider_chunk: %u, chunk_count: %u", __func__,
 			    sc->map[n].provider_no, sc->n_components,
 			    sc->map[n].provider_chunk,
 			    sc->components[sc->map[n].provider_no].chunk_count);
 			return;
 		}
 		if (sc->map[n].flags & VIRSTOR_MAP_ALLOCATED)
 			sc->components[sc->map[n].provider_no].chunk_next++;
 	}
 
 	sc->provider = g_new_providerf(sc->geom, "virstor/%s",
 	    sc->geom->name);
 
 	sc->provider->sectorsize = sc->sectorsize;
 	sc->provider->mediasize = sc->virsize;
 	g_error_provider(sc->provider, 0);
 
 	LOG_MSG(LVL_INFO, "%s activated", sc->provider->name);
 	LOG_MSG(LVL_DEBUG, "%s starting with current component %u, starting "
 	    "chunk %u", sc->provider->name, sc->curr_component,
 	    sc->components[sc->curr_component].chunk_next);
 }
 
 /*
  * Returns count of active providers in this geom instance
  */
 static u_int
 virstor_valid_components(struct g_virstor_softc *sc)
 {
 	unsigned int nc, i;
 
 	nc = 0;
 	KASSERT(sc != NULL, ("%s: softc is NULL", __func__));
 	KASSERT(sc->components != NULL, ("%s: sc->components is NULL", __func__));
 	for (i = 0; i < sc->n_components; i++)
 		if (sc->components[i].gcons != NULL)
 			nc++;
 	return (nc);
 }
 
 /*
  * Called when the consumer gets orphaned (?)
  */
 static void
 g_virstor_orphan(struct g_consumer *cp)
 {
 	struct g_virstor_softc *sc;
 	struct g_virstor_component *comp;
 	struct g_geom *gp;
 
 	g_topology_assert();
 	gp = cp->geom;
 	sc = gp->softc;
 	if (sc == NULL)
 		return;
 
 	comp = cp->private;
 	KASSERT(comp != NULL, ("%s: No component in private part of consumer",
 	    __func__));
 	remove_component(sc, comp, FALSE);
 	if (virstor_valid_components(sc) == 0)
 		virstor_geom_destroy(sc, TRUE, FALSE);
 }
 
 /*
  * Called to notify geom when it's been opened, and for what intent
  */
 static int
 g_virstor_access(struct g_provider *pp, int dr, int dw, int de)
 {
 	struct g_consumer *c;
 	struct g_virstor_softc *sc;
 	struct g_geom *gp;
 	int error;
 
 	KASSERT(pp != NULL, ("%s: NULL provider", __func__));
 	gp = pp->geom;
 	KASSERT(gp != NULL, ("%s: NULL geom", __func__));
 	sc = gp->softc;
 
 	if (sc == NULL) {
 		/* It seems that .access can be called with negative dr,dw,dx
 		 * in this case but I want to check for myself */
 		LOG_MSG(LVL_WARNING, "access(%d, %d, %d) for %s",
 		    dr, dw, de, pp->name);
 		/* This should only happen when geom is withered so
 		 * allow only negative requests */
 		KASSERT(dr <= 0 && dw <= 0 && de <= 0,
 		    ("%s: Positive access for %s", __func__, pp->name));
 		if (pp->acr + dr == 0 && pp->acw + dw == 0 && pp->ace + de == 0)
 			LOG_MSG(LVL_DEBUG, "Device %s definitely destroyed",
 			    pp->name);
 		return (0);
 	}
 
 	/* Grab an exclusive bit to propagate on our consumers on first open */
 	if (pp->acr == 0 && pp->acw == 0 && pp->ace == 0)
 		de++;
 	/* ... drop it on close */
 	if (pp->acr + dr == 0 && pp->acw + dw == 0 && pp->ace + de == 0) {
 		de--;
 		update_metadata(sc);	/* Writes statistical information */
 	}
 
 	error = ENXIO;
 	LIST_FOREACH(c, &gp->consumer, consumer) {
 		KASSERT(c != NULL, ("%s: consumer is NULL", __func__));
 		error = g_access(c, dr, dw, de);
 		if (error != 0) {
 			struct g_consumer *c2;
 
 			/* Backout earlier changes */
 			LIST_FOREACH(c2, &gp->consumer, consumer) {
 				if (c2 == c) /* all eariler components fixed */
 					return (error);
 				g_access(c2, -dr, -dw, -de);
 			}
 		}
 	}
 
 	return (error);
 }
 
 /*
  * Generate XML dump of current state
  */
 static void
 g_virstor_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
     struct g_consumer *cp, struct g_provider *pp)
 {
 	struct g_virstor_softc *sc;
 
 	g_topology_assert();
 	sc = gp->softc;
 
 	if (sc == NULL || pp != NULL)
 		return;
 
 	if (cp != NULL) {
 		/* For each component */
 		struct g_virstor_component *comp;
 
 		comp = cp->private;
 		if (comp == NULL)
 			return;
 		sbuf_printf(sb, "%s<ComponentIndex>%u</ComponentIndex>\n",
 		    indent, comp->index);
 		sbuf_printf(sb, "%s<ChunkCount>%u</ChunkCount>\n",
 		    indent, comp->chunk_count);
 		sbuf_printf(sb, "%s<ChunksUsed>%u</ChunksUsed>\n",
 		    indent, comp->chunk_next);
 		sbuf_printf(sb, "%s<ChunksReserved>%u</ChunksReserved>\n",
 		    indent, comp->chunk_reserved);
 		sbuf_printf(sb, "%s<StorageFree>%u%%</StorageFree>\n",
 		    indent,
 		    comp->chunk_next > 0 ? 100 -
 		    ((comp->chunk_next + comp->chunk_reserved) * 100) /
 		    comp->chunk_count : 100);
 	} else {
 		/* For the whole thing */
 		u_int count, used, i;
 		off_t size;
 
 		count = used = size = 0;
 		for (i = 0; i < sc->n_components; i++) {
 			if (sc->components[i].gcons != NULL) {
 				count += sc->components[i].chunk_count;
 				used += sc->components[i].chunk_next +
 				    sc->components[i].chunk_reserved;
 				size += sc->components[i].gcons->
 				    provider->mediasize;
 			}
 		}
 
 		sbuf_printf(sb, "%s<Status>"
 		    "Components=%u, Online=%u</Status>\n", indent,
 		    sc->n_components, virstor_valid_components(sc));
 		sbuf_printf(sb, "%s<State>%u%% physical free</State>\n",
 		    indent, 100-(used * 100) / count);
 		sbuf_printf(sb, "%s<ChunkSize>%zu</ChunkSize>\n", indent,
 		    sc->chunk_size);
 		sbuf_printf(sb, "%s<PhysicalFree>%u%%</PhysicalFree>\n",
 		    indent, used > 0 ? 100 - (used * 100) / count : 100);
 		sbuf_printf(sb, "%s<ChunkPhysicalCount>%u</ChunkPhysicalCount>\n",
 		    indent, count);
 		sbuf_printf(sb, "%s<ChunkVirtualCount>%zu</ChunkVirtualCount>\n",
 		    indent, sc->chunk_count);
 		sbuf_printf(sb, "%s<PhysicalBacking>%zu%%</PhysicalBacking>\n",
 		    indent,
 		    (count * 100) / sc->chunk_count);
 		sbuf_printf(sb, "%s<PhysicalBackingSize>%jd</PhysicalBackingSize>\n",
 		    indent, size);
 		sbuf_printf(sb, "%s<VirtualSize>%jd</VirtualSize>\n", indent,
 		    sc->virsize);
 	}
 }
 
 /*
  * GEOM .done handler
  * Can't use standard handler because one requested IO may
  * fork into additional data IOs
  */
 static void
 g_virstor_done(struct bio *b)
 {
 	struct g_virstor_softc *sc;
 	struct bio *parent_b;
 
 	parent_b = b->bio_parent;
 	sc = parent_b->bio_to->geom->softc;
 
 	if (b->bio_error != 0) {
 		LOG_MSG(LVL_ERROR, "Error %d for offset=%ju, length=%ju, %s",
 		    b->bio_error, b->bio_offset, b->bio_length,
 		    b->bio_to->name);
 		if (parent_b->bio_error == 0)
 			parent_b->bio_error = b->bio_error;
 	}
 
 	parent_b->bio_inbed++;
 	parent_b->bio_completed += b->bio_completed;
 
 	if (parent_b->bio_children == parent_b->bio_inbed) {
 		parent_b->bio_completed = parent_b->bio_length;
 		g_io_deliver(parent_b, parent_b->bio_error);
 	}
 	g_destroy_bio(b);
 }
 
 /*
  * I/O starts here
  * Called in g_down thread
  */
 static void
 g_virstor_start(struct bio *b)
 {
 	struct g_virstor_softc *sc;
 	struct g_virstor_component *comp;
 	struct bio *cb;
 	struct g_provider *pp;
 	char *addr;
 	off_t offset, length;
 	struct bio_queue_head bq;
 	size_t chunk_size;	/* cached for convenience */
 	u_int count;
 
 	pp = b->bio_to;
 	sc = pp->geom->softc;
 	KASSERT(sc != NULL, ("%s: no softc (error=%d, device=%s)", __func__,
 	    b->bio_to->error, b->bio_to->name));
 
 	LOG_REQ(LVL_MOREDEBUG, b, "%s", __func__);
 
 	switch (b->bio_cmd) {
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
 	default:
 		g_io_deliver(b, EOPNOTSUPP);
 		return;
 	}
 
 	LOG_MSG(LVL_DEBUG2, "BIO arrived, size=%ju", b->bio_length);
 	bioq_init(&bq);
 
 	chunk_size = sc->chunk_size;
 	addr = b->bio_data;
 	offset = b->bio_offset;	/* virtual offset and length */
 	length = b->bio_length;
 
 	while (length > 0) {
 		size_t chunk_index, in_chunk_offset, in_chunk_length;
 		struct virstor_map_entry *me;
 
 		chunk_index = offset / chunk_size; /* round downwards */
 		in_chunk_offset = offset % chunk_size;
 		in_chunk_length = min(length, chunk_size - in_chunk_offset);
 		LOG_MSG(LVL_DEBUG, "Mapped %s(%ju, %ju) to (%zu,%zu,%zu)",
 		    b->bio_cmd == BIO_READ ? "R" : "W",
 		    offset, length,
 		    chunk_index, in_chunk_offset, in_chunk_length);
 		me = &sc->map[chunk_index];
 
 		if (b->bio_cmd == BIO_READ || b->bio_cmd == BIO_DELETE) {
 			if ((me->flags & VIRSTOR_MAP_ALLOCATED) == 0) {
 				/* Reads from unallocated chunks return zeroed
 				 * buffers */
 				if (b->bio_cmd == BIO_READ)
 					bzero(addr, in_chunk_length);
 			} else {
 				comp = &sc->components[me->provider_no];
 
 				cb = g_clone_bio(b);
 				if (cb == NULL) {
 					bioq_dismantle(&bq);
 					if (b->bio_error == 0)
 						b->bio_error = ENOMEM;
 					g_io_deliver(b, b->bio_error);
 					return;
 				}
 				cb->bio_to = comp->gcons->provider;
 				cb->bio_done = g_virstor_done;
 				cb->bio_offset =
 				    (off_t)me->provider_chunk * (off_t)chunk_size
 				    + in_chunk_offset;
 				cb->bio_length = in_chunk_length;
 				cb->bio_data = addr;
 				cb->bio_caller1 = comp;
 				bioq_disksort(&bq, cb);
 			}
 		} else { /* handle BIO_WRITE */
 			KASSERT(b->bio_cmd == BIO_WRITE,
 			    ("%s: Unknown command %d", __func__,
 			    b->bio_cmd));
 
 			if ((me->flags & VIRSTOR_MAP_ALLOCATED) == 0) {
 				/* We have a virtual chunk, represented by
 				 * the "me" entry, but it's not yet allocated
 				 * (tied to) a physical chunk. So do it now. */
 				struct virstor_map_entry *data_me;
 				u_int phys_chunk, comp_no;
 				off_t s_offset;
 				int error;
 
 				error = allocate_chunk(sc, &comp, &comp_no,
 				    &phys_chunk);
 				if (error != 0) {
 					/* We cannot allocate a physical chunk
 					 * to satisfy this request, so we'll
 					 * delay it to when we can...
 					 * XXX: this will prevent the fs from
 					 * being umounted! */
 					struct g_virstor_bio_q *biq;
 					biq = malloc(sizeof *biq, M_GVIRSTOR,
 					    M_NOWAIT);
 					if (biq == NULL) {
 						bioq_dismantle(&bq);
 						if (b->bio_error == 0)
 							b->bio_error = ENOMEM;
 						g_io_deliver(b, b->bio_error);
 						return;
 					}
 					biq->bio = b;
 					mtx_lock(&sc->delayed_bio_q_mtx);
 					STAILQ_INSERT_TAIL(&sc->delayed_bio_q,
 					    biq, linkage);
 					mtx_unlock(&sc->delayed_bio_q_mtx);
 					LOG_MSG(LVL_WARNING, "Delaying BIO "
 					    "(size=%ju) until free physical "
 					    "space can be found on %s",
 					    b->bio_length,
 					    sc->provider->name);
 					return;
 				}
 				LOG_MSG(LVL_DEBUG, "Allocated chunk %u on %s "
 				    "for %s",
 				    phys_chunk,
 				    comp->gcons->provider->name,
 				    sc->provider->name);
 
 				me->provider_no = comp_no;
 				me->provider_chunk = phys_chunk;
 				me->flags |= VIRSTOR_MAP_ALLOCATED;
 
 				cb = g_clone_bio(b);
 				if (cb == NULL) {
 					me->flags &= ~VIRSTOR_MAP_ALLOCATED;
 					me->provider_no = 0;
 					me->provider_chunk = 0;
 					bioq_dismantle(&bq);
 					if (b->bio_error == 0)
 						b->bio_error = ENOMEM;
 					g_io_deliver(b, b->bio_error);
 					return;
 				}
 
 				/* The allocation table is stored continuously
 				 * at the start of the drive. We need to
 				 * calculate the offset of the sector that holds
 				 * this map entry both on the drive and in the
 				 * map array.
 				 * sc_offset will end up pointing to the drive
 				 * sector. */
 				s_offset = chunk_index * sizeof *me;
 				s_offset = rounddown(s_offset, sc->sectorsize);
 
 				/* data_me points to map entry sector
 				 * in memory (analogous to offset) */
 				data_me = &sc->map[rounddown(chunk_index,
 				    sc->me_per_sector)];
 
 				/* Commit sector with map entry to storage */
 				cb->bio_to = sc->components[0].gcons->provider;
 				cb->bio_done = g_virstor_done;
 				cb->bio_offset = s_offset;
 				cb->bio_data = (char *)data_me;
 				cb->bio_length = sc->sectorsize;
 				cb->bio_caller1 = &sc->components[0];
 				bioq_disksort(&bq, cb);
 			}
 
 			comp = &sc->components[me->provider_no];
 			cb = g_clone_bio(b);
 			if (cb == NULL) {
 				bioq_dismantle(&bq);
 				if (b->bio_error == 0)
 					b->bio_error = ENOMEM;
 				g_io_deliver(b, b->bio_error);
 				return;
 			}
 			/* Finally, handle the data */
 			cb->bio_to = comp->gcons->provider;
 			cb->bio_done = g_virstor_done;
 			cb->bio_offset = (off_t)me->provider_chunk*(off_t)chunk_size +
 			    in_chunk_offset;
 			cb->bio_length = in_chunk_length;
 			cb->bio_data = addr;
 			cb->bio_caller1 = comp;
 			bioq_disksort(&bq, cb);
 		}
 		addr += in_chunk_length;
 		length -= in_chunk_length;
 		offset += in_chunk_length;
 	}
 
 	/* Fire off bio's here */
 	count = 0;
 	for (cb = bioq_first(&bq); cb != NULL; cb = bioq_first(&bq)) {
 		bioq_remove(&bq, cb);
 		LOG_REQ(LVL_MOREDEBUG, cb, "Firing request");
 		comp = cb->bio_caller1;
 		cb->bio_caller1 = NULL;
 		LOG_MSG(LVL_DEBUG, " firing bio, offset=%ju, length=%ju",
 		    cb->bio_offset, cb->bio_length);
 		g_io_request(cb, comp->gcons);
 		count++;
 	}
 	if (count == 0) { /* We handled everything locally */
 		b->bio_completed = b->bio_length;
 		g_io_deliver(b, 0);
 	}
 
 }
 
 /*
  * Allocate a chunk from a physical provider. Returns physical component,
  * chunk index relative to the component and the component's index.
  */
 static int
 allocate_chunk(struct g_virstor_softc *sc, struct g_virstor_component **comp,
     u_int *comp_no_p, u_int *chunk)
 {
 	u_int comp_no;
 
 	KASSERT(sc->curr_component < sc->n_components,
 	    ("%s: Invalid curr_component: %u",  __func__, sc->curr_component));
 
 	comp_no = sc->curr_component;
 	*comp = &sc->components[comp_no];
 	dump_component(*comp);
 	if ((*comp)->chunk_next >= (*comp)->chunk_count) {
 		/* This component is full. Allocate next component */
 		if (comp_no >= sc->n_components-1) {
 			LOG_MSG(LVL_ERROR, "All physical space allocated for %s",
 			    sc->geom->name);
 			return (-1);
 		}
 		(*comp)->flags &= ~VIRSTOR_PROVIDER_CURRENT;
 		sc->curr_component = ++comp_no;
 
 		*comp = &sc->components[comp_no];
 		if (comp_no >= sc->n_components - g_virstor_component_watermark-1)
 			LOG_MSG(LVL_WARNING, "Device %s running out of components "
 			    "(switching to %u/%u: %s)", sc->geom->name,
 			    comp_no+1, sc->n_components,
 			    (*comp)->gcons->provider->name);
 		/* Take care not to overwrite reserved chunks */
 		if ( (*comp)->chunk_reserved > 0 &&
 		    (*comp)->chunk_next < (*comp)->chunk_reserved)
 			(*comp)->chunk_next = (*comp)->chunk_reserved;
 
 		(*comp)->flags |=
 		    VIRSTOR_PROVIDER_ALLOCATED | VIRSTOR_PROVIDER_CURRENT;
 		dump_component(*comp);
 		*comp_no_p = comp_no;
 		*chunk = (*comp)->chunk_next++;
 	} else {
 		*comp_no_p = comp_no;
 		*chunk = (*comp)->chunk_next++;
 	}
 	return (0);
 }
 
 /* Dump a component */
 static void
 dump_component(struct g_virstor_component *comp)
 {
 
 	if (g_virstor_debug < LVL_DEBUG2)
 		return;
 	printf("Component %d: %s\n", comp->index, comp->gcons->provider->name);
 	printf("  chunk_count: %u\n", comp->chunk_count);
 	printf("   chunk_next: %u\n", comp->chunk_next);
 	printf("        flags: %u\n", comp->flags);
 }
 
 #if 0
 /* Dump a map entry */
 static void
 dump_me(struct virstor_map_entry *me, unsigned int nr)
 {
 	if (g_virstor_debug < LVL_DEBUG)
 		return;
 	printf("VIRT. CHUNK #%d: ", nr);
 	if ((me->flags & VIRSTOR_MAP_ALLOCATED) == 0)
 		printf("(unallocated)\n");
 	else
 		printf("allocated at provider %u, provider_chunk %u\n",
 		    me->provider_no, me->provider_chunk);
 }
 #endif
 
 /*
  * Dismantle bio_queue and destroy its components
  */
 static void
 bioq_dismantle(struct bio_queue_head *bq)
 {
 	struct bio *b;
 
 	for (b = bioq_first(bq); b != NULL; b = bioq_first(bq)) {
 		bioq_remove(bq, b);
 		g_destroy_bio(b);
 	}
 }
 
 /*
  * The function that shouldn't be called.
  * When this is called, the stack is already garbled because of
  * argument mismatch. There's nothing to do now but panic, which is
  * accidentally the whole purpose of this function.
  * Motivation: to guard from accidentally calling geom methods when
  * they shouldn't be called. (see g_..._taste)
  */
 static void
 invalid_call(void)
 {
 	panic("invalid_call() has just been called. Something's fishy here.");
 }
 
 DECLARE_GEOM_CLASS(g_virstor_class, g_virstor); /* Let there be light */
+MODULE_VERSION(geom_virstor, 0);
Index: user/markj/netdump/sys/geom/zero/g_zero.c
===================================================================
--- user/markj/netdump/sys/geom/zero/g_zero.c	(revision 332407)
+++ user/markj/netdump/sys/geom/zero/g_zero.c	(revision 332408)
@@ -1,145 +1,146 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bio.h>
 #include <sys/kernel.h>
 #include <sys/limits.h>
 #include <sys/malloc.h>
 #include <sys/queue.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 
 #include <geom/geom.h>
 
 
 #define	G_ZERO_CLASS_NAME	"ZERO"
 
 static int	g_zero_clear_sysctl(SYSCTL_HANDLER_ARGS);
 
 SYSCTL_DECL(_kern_geom);
 static SYSCTL_NODE(_kern_geom, OID_AUTO, zero, CTLFLAG_RW, 0,
     "GEOM_ZERO stuff");
 static int g_zero_clear = 1;
 SYSCTL_PROC(_kern_geom_zero, OID_AUTO, clear, CTLTYPE_INT|CTLFLAG_RW,
     &g_zero_clear, 0, g_zero_clear_sysctl, "I", "Clear read data buffer");
 static int g_zero_byte = 0;
 SYSCTL_INT(_kern_geom_zero, OID_AUTO, byte, CTLFLAG_RW, &g_zero_byte, 0,
     "Byte (octet) value to clear the buffers with");
 
 static struct g_provider *gpp;
 
 static int
 g_zero_clear_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 
 	error = sysctl_handle_int(oidp, &g_zero_clear, 0, req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (gpp == NULL)
 		return (ENXIO);
 	if (g_zero_clear)
 		gpp->flags &= ~G_PF_ACCEPT_UNMAPPED;
 	else
 		gpp->flags |= G_PF_ACCEPT_UNMAPPED;
 	return (0);
 }
 
 static void
 g_zero_start(struct bio *bp)
 {
 	int error = ENXIO;
 
 	switch (bp->bio_cmd) {
 	case BIO_READ:
 		if (g_zero_clear && (bp->bio_flags & BIO_UNMAPPED) == 0)
 			memset(bp->bio_data, g_zero_byte, bp->bio_length);
 		/* FALLTHROUGH */
 	case BIO_DELETE:
 	case BIO_WRITE:
 		bp->bio_completed = bp->bio_length;
 		error = 0;
 		break;
 	case BIO_GETATTR:
 	default:
 		error = EOPNOTSUPP;
 		break;
 	}
 	g_io_deliver(bp, error);
 }
 
 static void
 g_zero_init(struct g_class *mp)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 
 	g_topology_assert();
 	gp = g_new_geomf(mp, "gzero");
 	gp->start = g_zero_start;
 	gp->access = g_std_access;
 	gpp = pp = g_new_providerf(gp, "%s", gp->name);
 	pp->flags |= G_PF_DIRECT_SEND | G_PF_DIRECT_RECEIVE;
 	if (!g_zero_clear)
 		pp->flags |= G_PF_ACCEPT_UNMAPPED;
 	pp->mediasize = 1152921504606846976LLU;
 	pp->sectorsize = 512;
 	g_error_provider(pp, 0);
 }
 
 static int
 g_zero_destroy_geom(struct gctl_req *req __unused, struct g_class *mp __unused,
     struct g_geom *gp)
 {
 	struct g_provider *pp;
 
 	g_topology_assert();
 	if (gp == NULL)
 		return (0);
 	pp = LIST_FIRST(&gp->provider);
 	if (pp == NULL)
 		return (0);
 	if (pp->acr > 0 || pp->acw > 0 || pp->ace > 0)
 		return (EBUSY);
 	gpp = NULL;
 	g_wither_geom(gp, ENXIO);
 	return (0);
 }
 
 static struct g_class g_zero_class = {
 	.name = G_ZERO_CLASS_NAME,
 	.version = G_VERSION,
 	.init = g_zero_init,
 	.destroy_geom = g_zero_destroy_geom
 };
 
 DECLARE_GEOM_CLASS(g_zero_class, g_zero);
+MODULE_VERSION(geom_zero, 0);
Index: user/markj/netdump/sys/kern/kern_environment.c
===================================================================
--- user/markj/netdump/sys/kern/kern_environment.c	(revision 332407)
+++ user/markj/netdump/sys/kern/kern_environment.c	(revision 332408)
@@ -1,714 +1,714 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 1998 Michael Smith
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * The unified bootloader passes us a pointer to a preserved copy of
  * bootstrap/kernel environment variables.  We convert them to a
  * dynamic array of strings later when the VM subsystem is up.
  *
  * We make these available through the kenv(2) syscall for userland
  * and through kern_getenv()/freeenv() kern_setenv() kern_unsetenv() testenv() for
  * the kernel.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/proc.h>
 #include <sys/queue.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/priv.h>
 #include <sys/kernel.h>
 #include <sys/systm.h>
 #include <sys/sysent.h>
 #include <sys/sysproto.h>
 #include <sys/libkern.h>
 #include <sys/kenv.h>
 
 #include <security/mac/mac_framework.h>
 
 static MALLOC_DEFINE(M_KENV, "kenv", "kernel environment");
 
 #define KENV_SIZE	512	/* Maximum number of environment strings */
 
 /* pointer to the static environment */
 char		*kern_envp;
 static int	env_len;
 static int	env_pos;
 static char	*kernenv_next(char *);
 
 /* dynamic environment variables */
 char		**kenvp;
 struct mtx	kenv_lock;
 
 /*
  * No need to protect this with a mutex since SYSINITS are single threaded.
  */
 int	dynamic_kenv = 0;
 
 #define KENV_CHECK	if (!dynamic_kenv) \
 			    panic("%s: called before SI_SUB_KMEM", __func__)
 
 int
 sys_kenv(td, uap)
 	struct thread *td;
 	struct kenv_args /* {
 		int what;
 		const char *name;
 		char *value;
 		int len;
 	} */ *uap;
 {
 	char *name, *value, *buffer = NULL;
 	size_t len, done, needed, buflen;
 	int error, i;
 
 	KASSERT(dynamic_kenv, ("kenv: dynamic_kenv = 0"));
 
 	error = 0;
 	if (uap->what == KENV_DUMP) {
 #ifdef MAC
 		error = mac_kenv_check_dump(td->td_ucred);
 		if (error)
 			return (error);
 #endif
 		done = needed = 0;
 		buflen = uap->len;
 		if (buflen > KENV_SIZE * (KENV_MNAMELEN + KENV_MVALLEN + 2))
 			buflen = KENV_SIZE * (KENV_MNAMELEN +
 			    KENV_MVALLEN + 2);
 		if (uap->len > 0 && uap->value != NULL)
 			buffer = malloc(buflen, M_TEMP, M_WAITOK|M_ZERO);
 		mtx_lock(&kenv_lock);
 		for (i = 0; kenvp[i] != NULL; i++) {
 			len = strlen(kenvp[i]) + 1;
 			needed += len;
 			len = min(len, buflen - done);
 			/*
 			 * If called with a NULL or insufficiently large
 			 * buffer, just keep computing the required size.
 			 */
 			if (uap->value != NULL && buffer != NULL && len > 0) {
 				bcopy(kenvp[i], buffer + done, len);
 				done += len;
 			}
 		}
 		mtx_unlock(&kenv_lock);
 		if (buffer != NULL) {
 			error = copyout(buffer, uap->value, done);
 			free(buffer, M_TEMP);
 		}
 		td->td_retval[0] = ((done == needed) ? 0 : needed);
 		return (error);
 	}
 
 	switch (uap->what) {
 	case KENV_SET:
 		error = priv_check(td, PRIV_KENV_SET);
 		if (error)
 			return (error);
 		break;
 
 	case KENV_UNSET:
 		error = priv_check(td, PRIV_KENV_UNSET);
 		if (error)
 			return (error);
 		break;
 	}
 
 	name = malloc(KENV_MNAMELEN + 1, M_TEMP, M_WAITOK);
 
 	error = copyinstr(uap->name, name, KENV_MNAMELEN + 1, NULL);
 	if (error)
 		goto done;
 
 	switch (uap->what) {
 	case KENV_GET:
 #ifdef MAC
 		error = mac_kenv_check_get(td->td_ucred, name);
 		if (error)
 			goto done;
 #endif
 		value = kern_getenv(name);
 		if (value == NULL) {
 			error = ENOENT;
 			goto done;
 		}
 		len = strlen(value) + 1;
 		if (len > uap->len)
 			len = uap->len;
 		error = copyout(value, uap->value, len);
 		freeenv(value);
 		if (error)
 			goto done;
 		td->td_retval[0] = len;
 		break;
 	case KENV_SET:
 		len = uap->len;
 		if (len < 1) {
 			error = EINVAL;
 			goto done;
 		}
 		if (len > KENV_MVALLEN + 1)
 			len = KENV_MVALLEN + 1;
 		value = malloc(len, M_TEMP, M_WAITOK);
 		error = copyinstr(uap->value, value, len, NULL);
 		if (error) {
 			free(value, M_TEMP);
 			goto done;
 		}
 #ifdef MAC
 		error = mac_kenv_check_set(td->td_ucred, name, value);
 		if (error == 0)
 #endif
 			kern_setenv(name, value);
 		free(value, M_TEMP);
 		break;
 	case KENV_UNSET:
 #ifdef MAC
 		error = mac_kenv_check_unset(td->td_ucred, name);
 		if (error)
 			goto done;
 #endif
 		error = kern_unsetenv(name);
 		if (error)
 			error = ENOENT;
 		break;
 	default:
 		error = EINVAL;
 		break;
 	}
 done:
 	free(name, M_TEMP);
 	return (error);
 }
 
 /*
  * Populate the initial kernel environment.
  *
  * This is called very early in MD startup, either to provide a copy of the
  * environment obtained from a boot loader, or to provide an empty buffer into
  * which MD code can store an initial environment using kern_setenv() calls.
  *
  * When a copy of an initial environment is passed in, we start by scanning that
  * env for overrides to the compiled-in envmode and hintmode variables.
  *
  * If the global envmode is 1, the environment is initialized from the global
  * static_env[], regardless of the arguments passed.  This implements the env
  * keyword described in config(5).  In this case env_pos is set to env_len,
  * causing kern_setenv() to return -1 (if len > 0) or panic (if len == 0) until
  * the dynamic environment is available.  The envmode and static_env variables
  * are defined in env.c which is generated by config(8).
  *
  * If len is non-zero, the caller is providing an empty buffer.  The caller will
  * subsequently use kern_setenv() to add up to len bytes of initial environment
  * before the dynamic environment is available.
  *
  * If len is zero, the caller is providing a pre-loaded buffer containing
  * environment strings.  Additional strings cannot be added until the dynamic
  * environment is available.  The memory pointed to must remain stable at least
  * until sysinit runs init_dynamic_kenv().  If no initial environment is
  * available from the boot loader, passing a NULL pointer allows the static_env
  * to be installed if it is configured.
  */
 void
 init_static_kenv(char *buf, size_t len)
 {
 	char *cp;
 	
 	for (cp = buf; cp != NULL && cp[0] != '\0'; cp += strlen(cp) + 1) {
 		if (strcmp(cp, "static_env.disabled=1") == 0)
 			envmode = 0;
 		if (strcmp(cp, "static_hints.disabled=1") == 0)
 			hintmode = 0;
 	}
 
 	if (envmode == 1) {
 		kern_envp = static_env;
 		env_len = len;
 		env_pos = len;
 	} else {
 		kern_envp = buf;
 		env_len = len;
 		env_pos = 0;
 	}
 }
 
 /*
  * Setup the dynamic kernel environment.
  */
 static void
 init_dynamic_kenv(void *data __unused)
 {
 	char *cp, *cpnext;
 	size_t len;
 	int i;
 
 	kenvp = malloc((KENV_SIZE + 1) * sizeof(char *), M_KENV,
 		M_WAITOK | M_ZERO);
 	i = 0;
 	if (kern_envp && *kern_envp != '\0') {
 		for (cp = kern_envp; cp != NULL; cp = cpnext) {
 			cpnext = kernenv_next(cp);
 			len = strlen(cp) + 1;
 			if (len > KENV_MNAMELEN + 1 + KENV_MVALLEN + 1) {
 				printf(
 				"WARNING: too long kenv string, ignoring %s\n",
 				    cp);
 				continue;
 			}
 			if (i < KENV_SIZE) {
 				kenvp[i] = malloc(len, M_KENV, M_WAITOK);
 				strcpy(kenvp[i++], cp);
-				memset(cp, 0, strlen(cp));
+				explicit_bzero(cp, strlen(cp));
 			} else
 				printf(
 				"WARNING: too many kenv strings, ignoring %s\n",
 				    cp);
 		}
 	}
 	kenvp[i] = NULL;
 
 	mtx_init(&kenv_lock, "kernel environment", NULL, MTX_DEF);
 	dynamic_kenv = 1;
 }
 SYSINIT(kenv, SI_SUB_KMEM, SI_ORDER_ANY, init_dynamic_kenv, NULL);
 
 void
 freeenv(char *env)
 {
 
 	if (dynamic_kenv && env != NULL) {
-		memset(env, 0, strlen(env));
+		explicit_bzero(env, strlen(env));
 		free(env, M_KENV);
 	}
 }
 
 /*
  * Internal functions for string lookup.
  */
 static char *
 _getenv_dynamic(const char *name, int *idx)
 {
 	char *cp;
 	int len, i;
 
 	mtx_assert(&kenv_lock, MA_OWNED);
 	len = strlen(name);
 	for (cp = kenvp[0], i = 0; cp != NULL; cp = kenvp[++i]) {
 		if ((strncmp(cp, name, len) == 0) &&
 		    (cp[len] == '=')) {
 			if (idx != NULL)
 				*idx = i;
 			return (cp + len + 1);
 		}
 	}
 	return (NULL);
 }
 
 static char *
 _getenv_static(const char *name)
 {
 	char *cp, *ep;
 	int len;
 
 	for (cp = kern_envp; cp != NULL; cp = kernenv_next(cp)) {
 		for (ep = cp; (*ep != '=') && (*ep != 0); ep++)
 			;
 		if (*ep != '=')
 			continue;
 		len = ep - cp;
 		ep++;
 		if (!strncmp(name, cp, len) && name[len] == 0)
 			return (ep);
 	}
 	return (NULL);
 }
 
 /*
  * Look up an environment variable by name.
  * Return a pointer to the string if found.
  * The pointer has to be freed with freeenv()
  * after use.
  */
 char *
 kern_getenv(const char *name)
 {
 	char buf[KENV_MNAMELEN + 1 + KENV_MVALLEN + 1];
 	char *ret;
 
 	if (dynamic_kenv) {
 		if (getenv_string(name, buf, sizeof(buf))) {
 			ret = strdup(buf, M_KENV);
 		} else {
 			ret = NULL;
 			WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
 			    "getenv");
 		}
 	} else
 		ret = _getenv_static(name);
 	return (ret);
 }
 
 /*
  * Test if an environment variable is defined.
  */
 int
 testenv(const char *name)
 {
 	char *cp;
 
 	if (dynamic_kenv) {
 		mtx_lock(&kenv_lock);
 		cp = _getenv_dynamic(name, NULL);
 		mtx_unlock(&kenv_lock);
 	} else
 		cp = _getenv_static(name);
 	if (cp != NULL)
 		return (1);
 	return (0);
 }
 
 static int
 setenv_static(const char *name, const char *value)
 {
 	int len;
 
 	if (env_pos >= env_len)
 		return (-1);
 
 	/* Check space for x=y and two nuls */
 	len = strlen(name) + strlen(value);
 	if (len + 3 < env_len - env_pos) {
 		len = sprintf(&kern_envp[env_pos], "%s=%s", name, value);
 		env_pos += len+1;
 		kern_envp[env_pos] = '\0';
 		return (0);
 	} else
 		return (-1);
 
 }
 
 /*
  * Set an environment variable by name.
  */
 int
 kern_setenv(const char *name, const char *value)
 {
 	char *buf, *cp, *oldenv;
 	int namelen, vallen, i;
 
 	if (dynamic_kenv == 0 && env_len > 0)
 		return (setenv_static(name, value));
 
 	KENV_CHECK;
 
 	namelen = strlen(name) + 1;
 	if (namelen > KENV_MNAMELEN + 1)
 		return (-1);
 	vallen = strlen(value) + 1;
 	if (vallen > KENV_MVALLEN + 1)
 		return (-1);
 	buf = malloc(namelen + vallen, M_KENV, M_WAITOK);
 	sprintf(buf, "%s=%s", name, value);
 
 	mtx_lock(&kenv_lock);
 	cp = _getenv_dynamic(name, &i);
 	if (cp != NULL) {
 		oldenv = kenvp[i];
 		kenvp[i] = buf;
 		mtx_unlock(&kenv_lock);
 		free(oldenv, M_KENV);
 	} else {
 		/* We add the option if it wasn't found */
 		for (i = 0; (cp = kenvp[i]) != NULL; i++)
 			;
 
 		/* Bounds checking */
 		if (i < 0 || i >= KENV_SIZE) {
 			free(buf, M_KENV);
 			mtx_unlock(&kenv_lock);
 			return (-1);
 		}
 
 		kenvp[i] = buf;
 		kenvp[i + 1] = NULL;
 		mtx_unlock(&kenv_lock);
 	}
 	return (0);
 }
 
 /*
  * Unset an environment variable string.
  */
 int
 kern_unsetenv(const char *name)
 {
 	char *cp, *oldenv;
 	int i, j;
 
 	KENV_CHECK;
 
 	mtx_lock(&kenv_lock);
 	cp = _getenv_dynamic(name, &i);
 	if (cp != NULL) {
 		oldenv = kenvp[i];
 		for (j = i + 1; kenvp[j] != NULL; j++)
 			kenvp[i++] = kenvp[j];
 		kenvp[i] = NULL;
 		mtx_unlock(&kenv_lock);
-		memset(oldenv, 0, strlen(oldenv));
+		explicit_bzero(oldenv, strlen(oldenv));
 		free(oldenv, M_KENV);
 		return (0);
 	}
 	mtx_unlock(&kenv_lock);
 	return (-1);
 }
 
 /*
  * Return a string value from an environment variable.
  */
 int
 getenv_string(const char *name, char *data, int size)
 {
 	char *cp;
 
 	if (dynamic_kenv) {
 		mtx_lock(&kenv_lock);
 		cp = _getenv_dynamic(name, NULL);
 		if (cp != NULL)
 			strlcpy(data, cp, size);
 		mtx_unlock(&kenv_lock);
 	} else {
 		cp = _getenv_static(name);
 		if (cp != NULL)
 			strlcpy(data, cp, size);
 	}
 	return (cp != NULL);
 }
 
 /*
  * Return an integer value from an environment variable.
  */
 int
 getenv_int(const char *name, int *data)
 {
 	quad_t tmp;
 	int rval;
 
 	rval = getenv_quad(name, &tmp);
 	if (rval)
 		*data = (int) tmp;
 	return (rval);
 }
 
 /*
  * Return an unsigned integer value from an environment variable.
  */
 int
 getenv_uint(const char *name, unsigned int *data)
 {
 	quad_t tmp;
 	int rval;
 
 	rval = getenv_quad(name, &tmp);
 	if (rval)
 		*data = (unsigned int) tmp;
 	return (rval);
 }
 
 /*
  * Return an int64_t value from an environment variable.
  */
 int
 getenv_int64(const char *name, int64_t *data)
 {
 	quad_t tmp;
 	int64_t rval;
 
 	rval = getenv_quad(name, &tmp);
 	if (rval)
 		*data = (int64_t) tmp;
 	return (rval);
 }
 
 /*
  * Return an uint64_t value from an environment variable.
  */
 int
 getenv_uint64(const char *name, uint64_t *data)
 {
 	quad_t tmp;
 	uint64_t rval;
 
 	rval = getenv_quad(name, &tmp);
 	if (rval)
 		*data = (uint64_t) tmp;
 	return (rval);
 }
 
 /*
  * Return a long value from an environment variable.
  */
 int
 getenv_long(const char *name, long *data)
 {
 	quad_t tmp;
 	int rval;
 
 	rval = getenv_quad(name, &tmp);
 	if (rval)
 		*data = (long) tmp;
 	return (rval);
 }
 
 /*
  * Return an unsigned long value from an environment variable.
  */
 int
 getenv_ulong(const char *name, unsigned long *data)
 {
 	quad_t tmp;
 	int rval;
 
 	rval = getenv_quad(name, &tmp);
 	if (rval)
 		*data = (unsigned long) tmp;
 	return (rval);
 }
 
 /*
  * Return a quad_t value from an environment variable.
  */
 int
 getenv_quad(const char *name, quad_t *data)
 {
 	char	value[KENV_MNAMELEN + 1 + KENV_MVALLEN + 1];
 	char	*vtp;
 	quad_t	iv;
 
 	if (!getenv_string(name, value, sizeof(value)))
 		return (0);
 	iv = strtoq(value, &vtp, 0);
 	if (vtp == value || (vtp[0] != '\0' && vtp[1] != '\0'))
 		return (0);
 	switch (vtp[0]) {
 	case 't': case 'T':
 		iv *= 1024;
 	case 'g': case 'G':
 		iv *= 1024;
 	case 'm': case 'M':
 		iv *= 1024;
 	case 'k': case 'K':
 		iv *= 1024;
 	case '\0':
 		break;
 	default:
 		return (0);
 	}
 	*data = iv;
 	return (1);
 }
 
 /*
  * Find the next entry after the one which (cp) falls within, return a
  * pointer to its start or NULL if there are no more.
  */
 static char *
 kernenv_next(char *cp)
 {
 
 	if (cp != NULL) {
 		while (*cp != 0)
 			cp++;
 		cp++;
 		if (*cp == 0)
 			cp = NULL;
 	}
 	return (cp);
 }
 
 void
 tunable_int_init(void *data)
 {
 	struct tunable_int *d = (struct tunable_int *)data;
 
 	TUNABLE_INT_FETCH(d->path, d->var);
 }
 
 void
 tunable_long_init(void *data)
 {
 	struct tunable_long *d = (struct tunable_long *)data;
 
 	TUNABLE_LONG_FETCH(d->path, d->var);
 }
 
 void
 tunable_ulong_init(void *data)
 {
 	struct tunable_ulong *d = (struct tunable_ulong *)data;
 
 	TUNABLE_ULONG_FETCH(d->path, d->var);
 }
 
 void
 tunable_int64_init(void *data)
 {
 	struct tunable_int64 *d = (struct tunable_int64 *)data;
 
 	TUNABLE_INT64_FETCH(d->path, d->var);
 }
 
 void
 tunable_uint64_init(void *data)
 {
 	struct tunable_uint64 *d = (struct tunable_uint64 *)data;
 
 	TUNABLE_UINT64_FETCH(d->path, d->var);
 }
 
 void
 tunable_quad_init(void *data)
 {
 	struct tunable_quad *d = (struct tunable_quad *)data;
 
 	TUNABLE_QUAD_FETCH(d->path, d->var);
 }
 
 void
 tunable_str_init(void *data)
 {
 	struct tunable_str *d = (struct tunable_str *)data;
 
 	TUNABLE_STR_FETCH(d->path, d->var, d->size);
 }
Index: user/markj/netdump/sys/kern/kern_rwlock.c
===================================================================
--- user/markj/netdump/sys/kern/kern_rwlock.c	(revision 332407)
+++ user/markj/netdump/sys/kern/kern_rwlock.c	(revision 332408)
@@ -1,1489 +1,1498 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2006 John Baldwin <jhb@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Machine independent bits of reader/writer lock implementation.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 #include "opt_hwpmc_hooks.h"
 #include "opt_no_adaptive_rwlocks.h"
 
 #include <sys/param.h>
 #include <sys/kdb.h>
 #include <sys/ktr.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/rwlock.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/turnstile.h>
 
 #include <machine/cpu.h>
 
 #if defined(SMP) && !defined(NO_ADAPTIVE_RWLOCKS)
 #define	ADAPTIVE_RWLOCKS
 #endif
 
 #ifdef HWPMC_HOOKS
 #include <sys/pmckern.h>
 PMC_SOFT_DECLARE( , , lock, failed);
 #endif
 
 /*
  * Return the rwlock address when the lock cookie address is provided.
  * This functionality assumes that struct rwlock* have a member named rw_lock.
  */
 #define	rwlock2rw(c)	(__containerof(c, struct rwlock, rw_lock))
 
 #ifdef DDB
 #include <ddb/ddb.h>
 
 static void	db_show_rwlock(const struct lock_object *lock);
 #endif
 static void	assert_rw(const struct lock_object *lock, int what);
 static void	lock_rw(struct lock_object *lock, uintptr_t how);
 #ifdef KDTRACE_HOOKS
 static int	owner_rw(const struct lock_object *lock, struct thread **owner);
 #endif
 static uintptr_t unlock_rw(struct lock_object *lock);
 
 struct lock_class lock_class_rw = {
 	.lc_name = "rw",
 	.lc_flags = LC_SLEEPLOCK | LC_RECURSABLE | LC_UPGRADABLE,
 	.lc_assert = assert_rw,
 #ifdef DDB
 	.lc_ddb_show = db_show_rwlock,
 #endif
 	.lc_lock = lock_rw,
 	.lc_unlock = unlock_rw,
 #ifdef KDTRACE_HOOKS
 	.lc_owner = owner_rw,
 #endif
 };
 
 #ifdef ADAPTIVE_RWLOCKS
-static int __read_frequently rowner_retries = 10;
-static int __read_frequently rowner_loops = 10000;
+static int __read_frequently rowner_retries;
+static int __read_frequently rowner_loops;
 static SYSCTL_NODE(_debug, OID_AUTO, rwlock, CTLFLAG_RD, NULL,
     "rwlock debugging");
 SYSCTL_INT(_debug_rwlock, OID_AUTO, retry, CTLFLAG_RW, &rowner_retries, 0, "");
 SYSCTL_INT(_debug_rwlock, OID_AUTO, loops, CTLFLAG_RW, &rowner_loops, 0, "");
 
 static struct lock_delay_config __read_frequently rw_delay;
 
 SYSCTL_INT(_debug_rwlock, OID_AUTO, delay_base, CTLFLAG_RW, &rw_delay.base,
     0, "");
 SYSCTL_INT(_debug_rwlock, OID_AUTO, delay_max, CTLFLAG_RW, &rw_delay.max,
     0, "");
 
-LOCK_DELAY_SYSINIT_DEFAULT(rw_delay);
+static void
+rw_lock_delay_init(void *arg __unused)
+{
+
+	lock_delay_default_init(&rw_delay);
+	rowner_retries = 10;
+	rowner_loops = max(10000, rw_delay.max);
+}
+LOCK_DELAY_SYSINIT(rw_lock_delay_init);
 #endif
 
 /*
  * Return a pointer to the owning thread if the lock is write-locked or
  * NULL if the lock is unlocked or read-locked.
  */
 
 #define	lv_rw_wowner(v)							\
 	((v) & RW_LOCK_READ ? NULL :					\
 	 (struct thread *)RW_OWNER((v)))
 
 #define	rw_wowner(rw)	lv_rw_wowner(RW_READ_VALUE(rw))
 
 /*
  * Returns if a write owner is recursed.  Write ownership is not assured
  * here and should be previously checked.
  */
 #define	rw_recursed(rw)		((rw)->rw_recurse != 0)
 
 /*
  * Return true if curthread helds the lock.
  */
 #define	rw_wlocked(rw)		(rw_wowner((rw)) == curthread)
 
 /*
  * Return a pointer to the owning thread for this lock who should receive
  * any priority lent by threads that block on this lock.  Currently this
  * is identical to rw_wowner().
  */
 #define	rw_owner(rw)		rw_wowner(rw)
 
 #ifndef INVARIANTS
 #define	__rw_assert(c, what, file, line)
 #endif
 
 void
 assert_rw(const struct lock_object *lock, int what)
 {
 
 	rw_assert((const struct rwlock *)lock, what);
 }
 
 void
 lock_rw(struct lock_object *lock, uintptr_t how)
 {
 	struct rwlock *rw;
 
 	rw = (struct rwlock *)lock;
 	if (how)
 		rw_rlock(rw);
 	else
 		rw_wlock(rw);
 }
 
 uintptr_t
 unlock_rw(struct lock_object *lock)
 {
 	struct rwlock *rw;
 
 	rw = (struct rwlock *)lock;
 	rw_assert(rw, RA_LOCKED | LA_NOTRECURSED);
 	if (rw->rw_lock & RW_LOCK_READ) {
 		rw_runlock(rw);
 		return (1);
 	} else {
 		rw_wunlock(rw);
 		return (0);
 	}
 }
 
 #ifdef KDTRACE_HOOKS
 int
 owner_rw(const struct lock_object *lock, struct thread **owner)
 {
 	const struct rwlock *rw = (const struct rwlock *)lock;
 	uintptr_t x = rw->rw_lock;
 
 	*owner = rw_wowner(rw);
 	return ((x & RW_LOCK_READ) != 0 ?  (RW_READERS(x) != 0) :
 	    (*owner != NULL));
 }
 #endif
 
 void
 _rw_init_flags(volatile uintptr_t *c, const char *name, int opts)
 {
 	struct rwlock *rw;
 	int flags;
 
 	rw = rwlock2rw(c);
 
 	MPASS((opts & ~(RW_DUPOK | RW_NOPROFILE | RW_NOWITNESS | RW_QUIET |
 	    RW_RECURSE | RW_NEW)) == 0);
 	ASSERT_ATOMIC_LOAD_PTR(rw->rw_lock,
 	    ("%s: rw_lock not aligned for %s: %p", __func__, name,
 	    &rw->rw_lock));
 
 	flags = LO_UPGRADABLE;
 	if (opts & RW_DUPOK)
 		flags |= LO_DUPOK;
 	if (opts & RW_NOPROFILE)
 		flags |= LO_NOPROFILE;
 	if (!(opts & RW_NOWITNESS))
 		flags |= LO_WITNESS;
 	if (opts & RW_RECURSE)
 		flags |= LO_RECURSABLE;
 	if (opts & RW_QUIET)
 		flags |= LO_QUIET;
 	if (opts & RW_NEW)
 		flags |= LO_NEW;
 
 	lock_init(&rw->lock_object, &lock_class_rw, name, NULL, flags);
 	rw->rw_lock = RW_UNLOCKED;
 	rw->rw_recurse = 0;
 }
 
 void
 _rw_destroy(volatile uintptr_t *c)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 
 	KASSERT(rw->rw_lock == RW_UNLOCKED, ("rw lock %p not unlocked", rw));
 	KASSERT(rw->rw_recurse == 0, ("rw lock %p still recursed", rw));
 	rw->rw_lock = RW_DESTROYED;
 	lock_destroy(&rw->lock_object);
 }
 
 void
 rw_sysinit(void *arg)
 {
 	struct rw_args *args;
 
 	args = arg;
 	rw_init_flags((struct rwlock *)args->ra_rw, args->ra_desc,
 	    args->ra_flags);
 }
 
 int
 _rw_wowned(const volatile uintptr_t *c)
 {
 
 	return (rw_wowner(rwlock2rw(c)) == curthread);
 }
 
 void
 _rw_wlock_cookie(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 	uintptr_t tid, v;
 
 	rw = rwlock2rw(c);
 
 	KASSERT(kdb_active != 0 || SCHEDULER_STOPPED() ||
 	    !TD_IS_IDLETHREAD(curthread),
 	    ("rw_wlock() by idle thread %p on rwlock %s @ %s:%d",
 	    curthread, rw->lock_object.lo_name, file, line));
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_wlock() of destroyed rwlock @ %s:%d", file, line));
 	WITNESS_CHECKORDER(&rw->lock_object, LOP_NEWORDER | LOP_EXCLUSIVE, file,
 	    line, NULL);
 	tid = (uintptr_t)curthread;
 	v = RW_UNLOCKED;
 	if (!_rw_write_lock_fetch(rw, &v, tid))
 		_rw_wlock_hard(rw, v, file, line);
 	else
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw,
 		    0, 0, file, line, LOCKSTAT_WRITER);
 
 	LOCK_LOG_LOCK("WLOCK", &rw->lock_object, 0, rw->rw_recurse, file, line);
 	WITNESS_LOCK(&rw->lock_object, LOP_EXCLUSIVE, file, line);
 	TD_LOCKS_INC(curthread);
 }
 
 int
 __rw_try_wlock_int(struct rwlock *rw LOCK_FILE_LINE_ARG_DEF)
 {
 	struct thread *td;
 	uintptr_t tid, v;
 	int rval;
 	bool recursed;
 
 	td = curthread;
 	tid = (uintptr_t)td;
 	if (SCHEDULER_STOPPED_TD(td))
 		return (1);
 
 	KASSERT(kdb_active != 0 || !TD_IS_IDLETHREAD(td),
 	    ("rw_try_wlock() by idle thread %p on rwlock %s @ %s:%d",
 	    curthread, rw->lock_object.lo_name, file, line));
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_try_wlock() of destroyed rwlock @ %s:%d", file, line));
 
 	rval = 1;
 	recursed = false;
 	v = RW_UNLOCKED;
 	for (;;) {
 		if (atomic_fcmpset_acq_ptr(&rw->rw_lock, &v, tid))
 			break;
 		if (v == RW_UNLOCKED)
 			continue;
 		if (v == tid && (rw->lock_object.lo_flags & LO_RECURSABLE)) {
 			rw->rw_recurse++;
 			atomic_set_ptr(&rw->rw_lock, RW_LOCK_WRITER_RECURSED);
 			break;
 		}
 		rval = 0;
 		break;
 	}
 
 	LOCK_LOG_TRY("WLOCK", &rw->lock_object, 0, rval, file, line);
 	if (rval) {
 		WITNESS_LOCK(&rw->lock_object, LOP_EXCLUSIVE | LOP_TRYLOCK,
 		    file, line);
 		if (!recursed)
 			LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire,
 			    rw, 0, 0, file, line, LOCKSTAT_WRITER);
 		TD_LOCKS_INC(curthread);
 	}
 	return (rval);
 }
 
 int
 __rw_try_wlock(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 	return (__rw_try_wlock_int(rw LOCK_FILE_LINE_ARG));
 }
 
 void
 _rw_wunlock_cookie(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_wunlock() of destroyed rwlock @ %s:%d", file, line));
 	__rw_assert(c, RA_WLOCKED, file, line);
 	WITNESS_UNLOCK(&rw->lock_object, LOP_EXCLUSIVE, file, line);
 	LOCK_LOG_LOCK("WUNLOCK", &rw->lock_object, 0, rw->rw_recurse, file,
 	    line);
 
 #ifdef LOCK_PROFILING
 	_rw_wunlock_hard(rw, (uintptr_t)curthread, file, line);
 #else
 	__rw_wunlock(rw, curthread, file, line);
 #endif
 
 	TD_LOCKS_DEC(curthread);
 }
 
 /*
  * Determines whether a new reader can acquire a lock.  Succeeds if the
  * reader already owns a read lock and the lock is locked for read to
  * prevent deadlock from reader recursion.  Also succeeds if the lock
  * is unlocked and has no writer waiters or spinners.  Failing otherwise
  * prioritizes writers before readers.
  */
 static bool __always_inline
 __rw_can_read(struct thread *td, uintptr_t v, bool fp)
 {
 
 	if ((v & (RW_LOCK_READ | RW_LOCK_WRITE_WAITERS | RW_LOCK_WRITE_SPINNER))
 	    == RW_LOCK_READ)
 		return (true);
 	if (!fp && td->td_rw_rlocks && (v & RW_LOCK_READ))
 		return (true);
 	return (false);
 }
 
 static bool __always_inline
 __rw_rlock_try(struct rwlock *rw, struct thread *td, uintptr_t *vp, bool fp
     LOCK_FILE_LINE_ARG_DEF)
 {
 
 	/*
 	 * Handle the easy case.  If no other thread has a write
 	 * lock, then try to bump up the count of read locks.  Note
 	 * that we have to preserve the current state of the
 	 * RW_LOCK_WRITE_WAITERS flag.  If we fail to acquire a
 	 * read lock, then rw_lock must have changed, so restart
 	 * the loop.  Note that this handles the case of a
 	 * completely unlocked rwlock since such a lock is encoded
 	 * as a read lock with no waiters.
 	 */
 	while (__rw_can_read(td, *vp, fp)) {
 		if (atomic_fcmpset_acq_ptr(&rw->rw_lock, vp,
 			*vp + RW_ONE_READER)) {
 			if (LOCK_LOG_TEST(&rw->lock_object, 0))
 				CTR4(KTR_LOCK,
 				    "%s: %p succeed %p -> %p", __func__,
 				    rw, (void *)*vp,
 				    (void *)(*vp + RW_ONE_READER));
 			td->td_rw_rlocks++;
 			return (true);
 		}
 	}
 	return (false);
 }
 
 static void __noinline
 __rw_rlock_hard(struct rwlock *rw, struct thread *td, uintptr_t v
     LOCK_FILE_LINE_ARG_DEF)
 {
 	struct turnstile *ts;
 	struct thread *owner;
 #ifdef ADAPTIVE_RWLOCKS
 	int spintries = 0;
 	int i, n;
 #endif
 #ifdef LOCK_PROFILING
 	uint64_t waittime = 0;
 	int contested = 0;
 #endif
 #if defined(ADAPTIVE_RWLOCKS) || defined(KDTRACE_HOOKS)
 	struct lock_delay_arg lda;
 #endif
 #ifdef KDTRACE_HOOKS
 	u_int sleep_cnt = 0;
 	int64_t sleep_time = 0;
 	int64_t all_time = 0;
 #endif
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	uintptr_t state;
 	int doing_lockprof = 0;
 #endif
 
 #ifdef KDTRACE_HOOKS
 	if (LOCKSTAT_PROFILE_ENABLED(rw__acquire)) {
 		if (__rw_rlock_try(rw, td, &v, false LOCK_FILE_LINE_ARG))
 			goto out_lockstat;
 		doing_lockprof = 1;
 		all_time -= lockstat_nsecs(&rw->lock_object);
 		state = v;
 	}
 #endif
 #ifdef LOCK_PROFILING
 	doing_lockprof = 1;
 	state = v;
 #endif
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 #if defined(ADAPTIVE_RWLOCKS)
 	lock_delay_arg_init(&lda, &rw_delay);
 #elif defined(KDTRACE_HOOKS)
 	lock_delay_arg_init(&lda, NULL);
 #endif
 
 #ifdef HWPMC_HOOKS
 	PMC_SOFT_CALL( , , lock, failed);
 #endif
 	lock_profile_obtain_lock_failed(&rw->lock_object,
 	    &contested, &waittime);
 
 	for (;;) {
 		if (__rw_rlock_try(rw, td, &v, false LOCK_FILE_LINE_ARG))
 			break;
 #ifdef KDTRACE_HOOKS
 		lda.spin_cnt++;
 #endif
 
 #ifdef ADAPTIVE_RWLOCKS
 		/*
 		 * If the owner is running on another CPU, spin until
 		 * the owner stops running or the state of the lock
 		 * changes.
 		 */
 		if ((v & RW_LOCK_READ) == 0) {
 			owner = (struct thread *)RW_OWNER(v);
 			if (TD_IS_RUNNING(owner)) {
 				if (LOCK_LOG_TEST(&rw->lock_object, 0))
 					CTR3(KTR_LOCK,
 					    "%s: spinning on %p held by %p",
 					    __func__, rw, owner);
 				KTR_STATE1(KTR_SCHED, "thread",
 				    sched_tdname(curthread), "spinning",
 				    "lockname:\"%s\"", rw->lock_object.lo_name);
 				do {
 					lock_delay(&lda);
 					v = RW_READ_VALUE(rw);
 					owner = lv_rw_wowner(v);
 				} while (owner != NULL && TD_IS_RUNNING(owner));
 				KTR_STATE0(KTR_SCHED, "thread",
 				    sched_tdname(curthread), "running");
 				continue;
 			}
 		} else if (spintries < rowner_retries) {
 			spintries++;
 			KTR_STATE1(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "spinning", "lockname:\"%s\"",
 			    rw->lock_object.lo_name);
 			for (i = 0; i < rowner_loops; i += n) {
 				n = RW_READERS(v);
 				lock_delay_spin(n);
 				v = RW_READ_VALUE(rw);
 				if ((v & RW_LOCK_READ) == 0 || __rw_can_read(td, v, false))
 					break;
 			}
 #ifdef KDTRACE_HOOKS
 			lda.spin_cnt += rowner_loops - i;
 #endif
 			KTR_STATE0(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "running");
 			if (i < rowner_loops)
 				continue;
 		}
 #endif
 
 		/*
 		 * Okay, now it's the hard case.  Some other thread already
 		 * has a write lock or there are write waiters present,
 		 * acquire the turnstile lock so we can begin the process
 		 * of blocking.
 		 */
 		ts = turnstile_trywait(&rw->lock_object);
 
 		/*
 		 * The lock might have been released while we spun, so
 		 * recheck its state and restart the loop if needed.
 		 */
 		v = RW_READ_VALUE(rw);
 retry_ts:
 		if (__rw_can_read(td, v, false)) {
 			turnstile_cancel(ts);
 			continue;
 		}
 
 		owner = lv_rw_wowner(v);
 
 #ifdef ADAPTIVE_RWLOCKS
 		/*
 		 * The current lock owner might have started executing
 		 * on another CPU (or the lock could have changed
 		 * owners) while we were waiting on the turnstile
 		 * chain lock.  If so, drop the turnstile lock and try
 		 * again.
 		 */
 		if (owner != NULL) {
 			if (TD_IS_RUNNING(owner)) {
 				turnstile_cancel(ts);
 				continue;
 			}
 		}
 #endif
 
 		/*
 		 * The lock is held in write mode or it already has waiters.
 		 */
 		MPASS(!__rw_can_read(td, v, false));
 
 		/*
 		 * If the RW_LOCK_READ_WAITERS flag is already set, then
 		 * we can go ahead and block.  If it is not set then try
 		 * to set it.  If we fail to set it drop the turnstile
 		 * lock and restart the loop.
 		 */
 		if (!(v & RW_LOCK_READ_WAITERS)) {
 			if (!atomic_fcmpset_ptr(&rw->rw_lock, &v,
 			    v | RW_LOCK_READ_WAITERS))
 				goto retry_ts;
 			if (LOCK_LOG_TEST(&rw->lock_object, 0))
 				CTR2(KTR_LOCK, "%s: %p set read waiters flag",
 				    __func__, rw);
 		}
 
 		/*
 		 * We were unable to acquire the lock and the read waiters
 		 * flag is set, so we must block on the turnstile.
 		 */
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p blocking on turnstile", __func__,
 			    rw);
 #ifdef KDTRACE_HOOKS
 		sleep_time -= lockstat_nsecs(&rw->lock_object);
 #endif
 		MPASS(owner == rw_owner(rw));
 		turnstile_wait(ts, owner, TS_SHARED_QUEUE);
 #ifdef KDTRACE_HOOKS
 		sleep_time += lockstat_nsecs(&rw->lock_object);
 		sleep_cnt++;
 #endif
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p resuming from turnstile",
 			    __func__, rw);
 		v = RW_READ_VALUE(rw);
 	}
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	if (__predict_true(!doing_lockprof))
 		return;
 #endif
 #ifdef KDTRACE_HOOKS
 	all_time += lockstat_nsecs(&rw->lock_object);
 	if (sleep_time)
 		LOCKSTAT_RECORD4(rw__block, rw, sleep_time,
 		    LOCKSTAT_READER, (state & RW_LOCK_READ) == 0,
 		    (state & RW_LOCK_READ) == 0 ? 0 : RW_READERS(state));
 
 	/* Record only the loops spinning and not sleeping. */
 	if (lda.spin_cnt > sleep_cnt)
 		LOCKSTAT_RECORD4(rw__spin, rw, all_time - sleep_time,
 		    LOCKSTAT_READER, (state & RW_LOCK_READ) == 0,
 		    (state & RW_LOCK_READ) == 0 ? 0 : RW_READERS(state));
 out_lockstat:
 #endif
 	/*
 	 * TODO: acquire "owner of record" here.  Here be turnstile dragons
 	 * however.  turnstiles don't like owners changing between calls to
 	 * turnstile_wait() currently.
 	 */
 	LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, contested,
 	    waittime, file, line, LOCKSTAT_READER);
 }
 
 void
 __rw_rlock_int(struct rwlock *rw LOCK_FILE_LINE_ARG_DEF)
 {
 	struct thread *td;
 	uintptr_t v;
 
 	td = curthread;
 
 	KASSERT(kdb_active != 0 || SCHEDULER_STOPPED_TD(td) ||
 	    !TD_IS_IDLETHREAD(td),
 	    ("rw_rlock() by idle thread %p on rwlock %s @ %s:%d",
 	    td, rw->lock_object.lo_name, file, line));
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_rlock() of destroyed rwlock @ %s:%d", file, line));
 	KASSERT(rw_wowner(rw) != td,
 	    ("rw_rlock: wlock already held for %s @ %s:%d",
 	    rw->lock_object.lo_name, file, line));
 	WITNESS_CHECKORDER(&rw->lock_object, LOP_NEWORDER, file, line, NULL);
 
 	v = RW_READ_VALUE(rw);
 	if (__predict_false(LOCKSTAT_PROFILE_ENABLED(rw__acquire) ||
 	    !__rw_rlock_try(rw, td, &v, true LOCK_FILE_LINE_ARG)))
 		__rw_rlock_hard(rw, td, v LOCK_FILE_LINE_ARG);
 	else
 		lock_profile_obtain_lock_success(&rw->lock_object, 0, 0,
 		    file, line);
 
 	LOCK_LOG_LOCK("RLOCK", &rw->lock_object, 0, 0, file, line);
 	WITNESS_LOCK(&rw->lock_object, 0, file, line);
 	TD_LOCKS_INC(curthread);
 }
 
 void
 __rw_rlock(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 	__rw_rlock_int(rw LOCK_FILE_LINE_ARG);
 }
 
 int
 __rw_try_rlock_int(struct rwlock *rw LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t x;
 
 	if (SCHEDULER_STOPPED())
 		return (1);
 
 	KASSERT(kdb_active != 0 || !TD_IS_IDLETHREAD(curthread),
 	    ("rw_try_rlock() by idle thread %p on rwlock %s @ %s:%d",
 	    curthread, rw->lock_object.lo_name, file, line));
 
 	x = rw->rw_lock;
 	for (;;) {
 		KASSERT(rw->rw_lock != RW_DESTROYED,
 		    ("rw_try_rlock() of destroyed rwlock @ %s:%d", file, line));
 		if (!(x & RW_LOCK_READ))
 			break;
 		if (atomic_fcmpset_acq_ptr(&rw->rw_lock, &x, x + RW_ONE_READER)) {
 			LOCK_LOG_TRY("RLOCK", &rw->lock_object, 0, 1, file,
 			    line);
 			WITNESS_LOCK(&rw->lock_object, LOP_TRYLOCK, file, line);
 			LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire,
 			    rw, 0, 0, file, line, LOCKSTAT_READER);
 			TD_LOCKS_INC(curthread);
 			curthread->td_rw_rlocks++;
 			return (1);
 		}
 	}
 
 	LOCK_LOG_TRY("RLOCK", &rw->lock_object, 0, 0, file, line);
 	return (0);
 }
 
 int
 __rw_try_rlock(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 	return (__rw_try_rlock_int(rw LOCK_FILE_LINE_ARG));
 }
 
 static bool __always_inline
 __rw_runlock_try(struct rwlock *rw, struct thread *td, uintptr_t *vp)
 {
 
 	for (;;) {
 		/*
 		 * See if there is more than one read lock held.  If so,
 		 * just drop one and return.
 		 */
 		if (RW_READERS(*vp) > 1) {
 			if (atomic_fcmpset_rel_ptr(&rw->rw_lock, vp,
 			    *vp - RW_ONE_READER)) {
 				if (LOCK_LOG_TEST(&rw->lock_object, 0))
 					CTR4(KTR_LOCK,
 					    "%s: %p succeeded %p -> %p",
 					    __func__, rw, (void *)*vp,
 					    (void *)(*vp - RW_ONE_READER));
 				td->td_rw_rlocks--;
 				return (true);
 			}
 			continue;
 		}
 		/*
 		 * If there aren't any waiters for a write lock, then try
 		 * to drop it quickly.
 		 */
 		if (!(*vp & RW_LOCK_WAITERS)) {
 			MPASS((*vp & ~RW_LOCK_WRITE_SPINNER) ==
 			    RW_READERS_LOCK(1));
 			if (atomic_fcmpset_rel_ptr(&rw->rw_lock, vp,
 			    RW_UNLOCKED)) {
 				if (LOCK_LOG_TEST(&rw->lock_object, 0))
 					CTR2(KTR_LOCK, "%s: %p last succeeded",
 					    __func__, rw);
 				td->td_rw_rlocks--;
 				return (true);
 			}
 			continue;
 		}
 		break;
 	}
 	return (false);
 }
 
 static void __noinline
 __rw_runlock_hard(struct rwlock *rw, struct thread *td, uintptr_t v
     LOCK_FILE_LINE_ARG_DEF)
 {
 	struct turnstile *ts;
 	uintptr_t setv, queue;
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 	if (__rw_runlock_try(rw, td, &v))
 		goto out_lockstat;
 
 	/*
 	 * Ok, we know we have waiters and we think we are the
 	 * last reader, so grab the turnstile lock.
 	 */
 	turnstile_chain_lock(&rw->lock_object);
 	v = RW_READ_VALUE(rw);
 	for (;;) {
 		if (__rw_runlock_try(rw, td, &v))
 			break;
 
 		v &= (RW_LOCK_WAITERS | RW_LOCK_WRITE_SPINNER);
 		MPASS(v & RW_LOCK_WAITERS);
 
 		/*
 		 * Try to drop our lock leaving the lock in a unlocked
 		 * state.
 		 *
 		 * If you wanted to do explicit lock handoff you'd have to
 		 * do it here.  You'd also want to use turnstile_signal()
 		 * and you'd have to handle the race where a higher
 		 * priority thread blocks on the write lock before the
 		 * thread you wakeup actually runs and have the new thread
 		 * "steal" the lock.  For now it's a lot simpler to just
 		 * wakeup all of the waiters.
 		 *
 		 * As above, if we fail, then another thread might have
 		 * acquired a read lock, so drop the turnstile lock and
 		 * restart.
 		 */
 		setv = RW_UNLOCKED;
 		queue = TS_SHARED_QUEUE;
 		if (v & RW_LOCK_WRITE_WAITERS) {
 			queue = TS_EXCLUSIVE_QUEUE;
 			setv |= (v & RW_LOCK_READ_WAITERS);
 		}
 		v |= RW_READERS_LOCK(1);
 		if (!atomic_fcmpset_rel_ptr(&rw->rw_lock, &v, setv))
 			continue;
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p last succeeded with waiters",
 			    __func__, rw);
 
 		/*
 		 * Ok.  The lock is released and all that's left is to
 		 * wake up the waiters.  Note that the lock might not be
 		 * free anymore, but in that case the writers will just
 		 * block again if they run before the new lock holder(s)
 		 * release the lock.
 		 */
 		ts = turnstile_lookup(&rw->lock_object);
 		MPASS(ts != NULL);
 		turnstile_broadcast(ts, queue);
 		turnstile_unpend(ts, TS_SHARED_LOCK);
 		td->td_rw_rlocks--;
 		break;
 	}
 	turnstile_chain_unlock(&rw->lock_object);
 out_lockstat:
 	LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw, LOCKSTAT_READER);
 }
 
 void
 _rw_runlock_cookie_int(struct rwlock *rw LOCK_FILE_LINE_ARG_DEF)
 {
 	struct thread *td;
 	uintptr_t v;
 
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_runlock() of destroyed rwlock @ %s:%d", file, line));
 	__rw_assert(&rw->rw_lock, RA_RLOCKED, file, line);
 	WITNESS_UNLOCK(&rw->lock_object, 0, file, line);
 	LOCK_LOG_LOCK("RUNLOCK", &rw->lock_object, 0, 0, file, line);
 
 	td = curthread;
 	v = RW_READ_VALUE(rw);
 
 	if (__predict_false(LOCKSTAT_PROFILE_ENABLED(rw__release) ||
 	    !__rw_runlock_try(rw, td, &v)))
 		__rw_runlock_hard(rw, td, v LOCK_FILE_LINE_ARG);
 	else
 		lock_profile_release_lock(&rw->lock_object);
 
 	TD_LOCKS_DEC(curthread);
 }
 
 void
 _rw_runlock_cookie(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 	_rw_runlock_cookie_int(rw LOCK_FILE_LINE_ARG);
 }
 
 /*
  * This function is called when we are unable to obtain a write lock on the
  * first try.  This means that at least one other thread holds either a
  * read or write lock.
  */
 void
 __rw_wlock_hard(volatile uintptr_t *c, uintptr_t v LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t tid;
 	struct rwlock *rw;
 	struct turnstile *ts;
 	struct thread *owner;
 #ifdef ADAPTIVE_RWLOCKS
 	int spintries = 0;
 	int i, n;
 	enum { READERS, WRITER } sleep_reason;
 #endif
 	uintptr_t x;
 #ifdef LOCK_PROFILING
 	uint64_t waittime = 0;
 	int contested = 0;
 #endif
 #if defined(ADAPTIVE_RWLOCKS) || defined(KDTRACE_HOOKS)
 	struct lock_delay_arg lda;
 #endif
 #ifdef KDTRACE_HOOKS
 	u_int sleep_cnt = 0;
 	int64_t sleep_time = 0;
 	int64_t all_time = 0;
 #endif
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	uintptr_t state;
 	int doing_lockprof = 0;
 #endif
 
 	tid = (uintptr_t)curthread;
 	rw = rwlock2rw(c);
 
 #ifdef KDTRACE_HOOKS
 	if (LOCKSTAT_PROFILE_ENABLED(rw__acquire)) {
 		while (v == RW_UNLOCKED) {
 			if (_rw_write_lock_fetch(rw, &v, tid))
 				goto out_lockstat;
 		}
 		doing_lockprof = 1;
 		all_time -= lockstat_nsecs(&rw->lock_object);
 		state = v;
 	}
 #endif
 #ifdef LOCK_PROFILING
 	doing_lockprof = 1;
 	state = v;
 #endif
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 #if defined(ADAPTIVE_RWLOCKS)
 	lock_delay_arg_init(&lda, &rw_delay);
 #elif defined(KDTRACE_HOOKS)
 	lock_delay_arg_init(&lda, NULL);
 #endif
 	if (__predict_false(v == RW_UNLOCKED))
 		v = RW_READ_VALUE(rw);
 
 	if (__predict_false(lv_rw_wowner(v) == (struct thread *)tid)) {
 		KASSERT(rw->lock_object.lo_flags & LO_RECURSABLE,
 		    ("%s: recursing but non-recursive rw %s @ %s:%d\n",
 		    __func__, rw->lock_object.lo_name, file, line));
 		rw->rw_recurse++;
 		atomic_set_ptr(&rw->rw_lock, RW_LOCK_WRITER_RECURSED);
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p recursing", __func__, rw);
 		return;
 	}
 
 	if (LOCK_LOG_TEST(&rw->lock_object, 0))
 		CTR5(KTR_LOCK, "%s: %s contested (lock=%p) at %s:%d", __func__,
 		    rw->lock_object.lo_name, (void *)rw->rw_lock, file, line);
 
 #ifdef HWPMC_HOOKS
 	PMC_SOFT_CALL( , , lock, failed);
 #endif
 	lock_profile_obtain_lock_failed(&rw->lock_object,
 	    &contested, &waittime);
 
 	for (;;) {
 		if (v == RW_UNLOCKED) {
 			if (_rw_write_lock_fetch(rw, &v, tid))
 				break;
 			continue;
 		}
 #ifdef KDTRACE_HOOKS
 		lda.spin_cnt++;
 #endif
 
 #ifdef ADAPTIVE_RWLOCKS
 		/*
 		 * If the lock is write locked and the owner is
 		 * running on another CPU, spin until the owner stops
 		 * running or the state of the lock changes.
 		 */
 		if (!(v & RW_LOCK_READ)) {
 			sleep_reason = WRITER;
 			owner = lv_rw_wowner(v);
 			if (!TD_IS_RUNNING(owner))
 				goto ts;
 			if (LOCK_LOG_TEST(&rw->lock_object, 0))
 				CTR3(KTR_LOCK, "%s: spinning on %p held by %p",
 				    __func__, rw, owner);
 			KTR_STATE1(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "spinning", "lockname:\"%s\"",
 			    rw->lock_object.lo_name);
 			do {
 				lock_delay(&lda);
 				v = RW_READ_VALUE(rw);
 				owner = lv_rw_wowner(v);
 			} while (owner != NULL && TD_IS_RUNNING(owner));
 			KTR_STATE0(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "running");
 			continue;
 		} else if (RW_READERS(v) > 0) {
 			sleep_reason = READERS;
 			if (spintries == rowner_retries)
 				goto ts;
 			if (!(v & RW_LOCK_WRITE_SPINNER)) {
 				if (!atomic_fcmpset_ptr(&rw->rw_lock, &v,
 				    v | RW_LOCK_WRITE_SPINNER)) {
 					continue;
 				}
 			}
 			spintries++;
 			KTR_STATE1(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "spinning", "lockname:\"%s\"",
 			    rw->lock_object.lo_name);
 			for (i = 0; i < rowner_loops; i += n) {
 				n = RW_READERS(v);
 				lock_delay_spin(n);
 				v = RW_READ_VALUE(rw);
 				if ((v & RW_LOCK_WRITE_SPINNER) == 0)
 					break;
 			}
 #ifdef KDTRACE_HOOKS
 			lda.spin_cnt += i;
 #endif
 			KTR_STATE0(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "running");
 			if (i < rowner_loops)
 				continue;
 		}
 ts:
 #endif
 		ts = turnstile_trywait(&rw->lock_object);
 		v = RW_READ_VALUE(rw);
 retry_ts:
 		owner = lv_rw_wowner(v);
 
 #ifdef ADAPTIVE_RWLOCKS
 		/*
 		 * The current lock owner might have started executing
 		 * on another CPU (or the lock could have changed
 		 * owners) while we were waiting on the turnstile
 		 * chain lock.  If so, drop the turnstile lock and try
 		 * again.
 		 */
 		if (owner != NULL) {
 			if (TD_IS_RUNNING(owner)) {
 				turnstile_cancel(ts);
 				continue;
 			}
 		} else if (RW_READERS(v) > 0 && sleep_reason == WRITER) {
 			turnstile_cancel(ts);
 			continue;
 		}
 #endif
 		/*
 		 * Check for the waiters flags about this rwlock.
 		 * If the lock was released, without maintain any pending
 		 * waiters queue, simply try to acquire it.
 		 * If a pending waiters queue is present, claim the lock
 		 * ownership and maintain the pending queue.
 		 */
 		x = v & (RW_LOCK_WAITERS | RW_LOCK_WRITE_SPINNER);
 		if ((v & ~x) == RW_UNLOCKED) {
 			x &= ~RW_LOCK_WRITE_SPINNER;
 			if (atomic_fcmpset_acq_ptr(&rw->rw_lock, &v, tid | x)) {
 				if (x)
 					turnstile_claim(ts);
 				else
 					turnstile_cancel(ts);
 				break;
 			}
 			goto retry_ts;
 		}
 		/*
 		 * If the RW_LOCK_WRITE_WAITERS flag isn't set, then try to
 		 * set it.  If we fail to set it, then loop back and try
 		 * again.
 		 */
 		if (!(v & RW_LOCK_WRITE_WAITERS)) {
 			if (!atomic_fcmpset_ptr(&rw->rw_lock, &v,
 			    v | RW_LOCK_WRITE_WAITERS))
 				goto retry_ts;
 			if (LOCK_LOG_TEST(&rw->lock_object, 0))
 				CTR2(KTR_LOCK, "%s: %p set write waiters flag",
 				    __func__, rw);
 		}
 		/*
 		 * We were unable to acquire the lock and the write waiters
 		 * flag is set, so we must block on the turnstile.
 		 */
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p blocking on turnstile", __func__,
 			    rw);
 #ifdef KDTRACE_HOOKS
 		sleep_time -= lockstat_nsecs(&rw->lock_object);
 #endif
 		MPASS(owner == rw_owner(rw));
 		turnstile_wait(ts, owner, TS_EXCLUSIVE_QUEUE);
 #ifdef KDTRACE_HOOKS
 		sleep_time += lockstat_nsecs(&rw->lock_object);
 		sleep_cnt++;
 #endif
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p resuming from turnstile",
 			    __func__, rw);
 #ifdef ADAPTIVE_RWLOCKS
 		spintries = 0;
 #endif
 		v = RW_READ_VALUE(rw);
 	}
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	if (__predict_true(!doing_lockprof))
 		return;
 #endif
 #ifdef KDTRACE_HOOKS
 	all_time += lockstat_nsecs(&rw->lock_object);
 	if (sleep_time)
 		LOCKSTAT_RECORD4(rw__block, rw, sleep_time,
 		    LOCKSTAT_WRITER, (state & RW_LOCK_READ) == 0,
 		    (state & RW_LOCK_READ) == 0 ? 0 : RW_READERS(state));
 
 	/* Record only the loops spinning and not sleeping. */
 	if (lda.spin_cnt > sleep_cnt)
 		LOCKSTAT_RECORD4(rw__spin, rw, all_time - sleep_time,
 		    LOCKSTAT_WRITER, (state & RW_LOCK_READ) == 0,
 		    (state & RW_LOCK_READ) == 0 ? 0 : RW_READERS(state));
 out_lockstat:
 #endif
 	LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, contested,
 	    waittime, file, line, LOCKSTAT_WRITER);
 }
 
 /*
  * This function is called if lockstat is active or the first try at releasing
  * a write lock failed.  The latter means that the lock is recursed or one of
  * the 2 waiter bits must be set indicating that at least one thread is waiting
  * on this lock.
  */
 void
 __rw_wunlock_hard(volatile uintptr_t *c, uintptr_t v LOCK_FILE_LINE_ARG_DEF)
 {
 	struct rwlock *rw;
 	struct turnstile *ts;
 	uintptr_t tid, setv;
 	int queue;
 
 	tid = (uintptr_t)curthread;
 	if (SCHEDULER_STOPPED())
 		return;
 
 	rw = rwlock2rw(c);
 	if (__predict_false(v == tid))
 		v = RW_READ_VALUE(rw);
 
 	if (v & RW_LOCK_WRITER_RECURSED) {
 		if (--(rw->rw_recurse) == 0)
 			atomic_clear_ptr(&rw->rw_lock, RW_LOCK_WRITER_RECURSED);
 		if (LOCK_LOG_TEST(&rw->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p unrecursing", __func__, rw);
 		return;
 	}
 
 	LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw, LOCKSTAT_WRITER);
 	if (v == tid && _rw_write_unlock(rw, tid))
 		return;
 
 	KASSERT(rw->rw_lock & (RW_LOCK_READ_WAITERS | RW_LOCK_WRITE_WAITERS),
 	    ("%s: neither of the waiter flags are set", __func__));
 
 	if (LOCK_LOG_TEST(&rw->lock_object, 0))
 		CTR2(KTR_LOCK, "%s: %p contested", __func__, rw);
 
 	turnstile_chain_lock(&rw->lock_object);
 
 	/*
 	 * Use the same algo as sx locks for now.  Prefer waking up shared
 	 * waiters if we have any over writers.  This is probably not ideal.
 	 *
 	 * 'v' is the value we are going to write back to rw_lock.  If we
 	 * have waiters on both queues, we need to preserve the state of
 	 * the waiter flag for the queue we don't wake up.  For now this is
 	 * hardcoded for the algorithm mentioned above.
 	 *
 	 * In the case of both readers and writers waiting we wakeup the
 	 * readers but leave the RW_LOCK_WRITE_WAITERS flag set.  If a
 	 * new writer comes in before a reader it will claim the lock up
 	 * above.  There is probably a potential priority inversion in
 	 * there that could be worked around either by waking both queues
 	 * of waiters or doing some complicated lock handoff gymnastics.
 	 */
 	setv = RW_UNLOCKED;
 	v = RW_READ_VALUE(rw);
 	queue = TS_SHARED_QUEUE;
 	if (v & RW_LOCK_WRITE_WAITERS) {
 		queue = TS_EXCLUSIVE_QUEUE;
 		setv |= (v & RW_LOCK_READ_WAITERS);
 	}
 	atomic_store_rel_ptr(&rw->rw_lock, setv);
 
 	/* Wake up all waiters for the specific queue. */
 	if (LOCK_LOG_TEST(&rw->lock_object, 0))
 		CTR3(KTR_LOCK, "%s: %p waking up %s waiters", __func__, rw,
 		    queue == TS_SHARED_QUEUE ? "read" : "write");
 
 	ts = turnstile_lookup(&rw->lock_object);
 	MPASS(ts != NULL);
 	turnstile_broadcast(ts, queue);
 	turnstile_unpend(ts, TS_EXCLUSIVE_LOCK);
 	turnstile_chain_unlock(&rw->lock_object);
 }
 
 /*
  * Attempt to do a non-blocking upgrade from a read lock to a write
  * lock.  This will only succeed if this thread holds a single read
  * lock.  Returns true if the upgrade succeeded and false otherwise.
  */
 int
 __rw_try_upgrade_int(struct rwlock *rw LOCK_FILE_LINE_ARG_DEF)
 {
-	uintptr_t v, x, tid;
+	uintptr_t v, setv, tid;
 	struct turnstile *ts;
 	int success;
 
 	if (SCHEDULER_STOPPED())
 		return (1);
 
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_try_upgrade() of destroyed rwlock @ %s:%d", file, line));
 	__rw_assert(&rw->rw_lock, RA_RLOCKED, file, line);
 
 	/*
 	 * Attempt to switch from one reader to a writer.  If there
 	 * are any write waiters, then we will have to lock the
 	 * turnstile first to prevent races with another writer
 	 * calling turnstile_wait() before we have claimed this
 	 * turnstile.  So, do the simple case of no waiters first.
 	 */
 	tid = (uintptr_t)curthread;
 	success = 0;
+	v = RW_READ_VALUE(rw);
 	for (;;) {
-		v = rw->rw_lock;
 		if (RW_READERS(v) > 1)
 			break;
 		if (!(v & RW_LOCK_WAITERS)) {
-			success = atomic_cmpset_acq_ptr(&rw->rw_lock, v, tid);
+			success = atomic_fcmpset_acq_ptr(&rw->rw_lock, &v, tid);
 			if (!success)
 				continue;
 			break;
 		}
 
 		/*
 		 * Ok, we think we have waiters, so lock the turnstile.
 		 */
 		ts = turnstile_trywait(&rw->lock_object);
-		v = rw->rw_lock;
+		v = RW_READ_VALUE(rw);
+retry_ts:
 		if (RW_READERS(v) > 1) {
 			turnstile_cancel(ts);
 			break;
 		}
 		/*
 		 * Try to switch from one reader to a writer again.  This time
 		 * we honor the current state of the waiters flags.
 		 * If we obtain the lock with the flags set, then claim
 		 * ownership of the turnstile.
 		 */
-		x = rw->rw_lock & RW_LOCK_WAITERS;
-		success = atomic_cmpset_ptr(&rw->rw_lock, v, tid | x);
+		setv = tid | (v & RW_LOCK_WAITERS);
+		success = atomic_fcmpset_ptr(&rw->rw_lock, &v, setv);
 		if (success) {
-			if (x)
+			if (v & RW_LOCK_WAITERS)
 				turnstile_claim(ts);
 			else
 				turnstile_cancel(ts);
 			break;
 		}
-		turnstile_cancel(ts);
+		goto retry_ts;
 	}
 	LOCK_LOG_TRY("WUPGRADE", &rw->lock_object, 0, success, file, line);
 	if (success) {
 		curthread->td_rw_rlocks--;
 		WITNESS_UPGRADE(&rw->lock_object, LOP_EXCLUSIVE | LOP_TRYLOCK,
 		    file, line);
 		LOCKSTAT_RECORD0(rw__upgrade, rw);
 	}
 	return (success);
 }
 
 int
 __rw_try_upgrade(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 	return (__rw_try_upgrade_int(rw LOCK_FILE_LINE_ARG));
 }
 
 /*
  * Downgrade a write lock into a single read lock.
  */
 void
 __rw_downgrade_int(struct rwlock *rw LOCK_FILE_LINE_ARG_DEF)
 {
 	struct turnstile *ts;
 	uintptr_t tid, v;
 	int rwait, wwait;
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 	KASSERT(rw->rw_lock != RW_DESTROYED,
 	    ("rw_downgrade() of destroyed rwlock @ %s:%d", file, line));
 	__rw_assert(&rw->rw_lock, RA_WLOCKED | RA_NOTRECURSED, file, line);
 #ifndef INVARIANTS
 	if (rw_recursed(rw))
 		panic("downgrade of a recursed lock");
 #endif
 
 	WITNESS_DOWNGRADE(&rw->lock_object, 0, file, line);
 
 	/*
 	 * Convert from a writer to a single reader.  First we handle
 	 * the easy case with no waiters.  If there are any waiters, we
 	 * lock the turnstile and "disown" the lock.
 	 */
 	tid = (uintptr_t)curthread;
 	if (atomic_cmpset_rel_ptr(&rw->rw_lock, tid, RW_READERS_LOCK(1)))
 		goto out;
 
 	/*
 	 * Ok, we think we have waiters, so lock the turnstile so we can
 	 * read the waiter flags without any races.
 	 */
 	turnstile_chain_lock(&rw->lock_object);
 	v = rw->rw_lock & RW_LOCK_WAITERS;
 	rwait = v & RW_LOCK_READ_WAITERS;
 	wwait = v & RW_LOCK_WRITE_WAITERS;
 	MPASS(rwait | wwait);
 
 	/*
 	 * Downgrade from a write lock while preserving waiters flag
 	 * and give up ownership of the turnstile.
 	 */
 	ts = turnstile_lookup(&rw->lock_object);
 	MPASS(ts != NULL);
 	if (!wwait)
 		v &= ~RW_LOCK_READ_WAITERS;
 	atomic_store_rel_ptr(&rw->rw_lock, RW_READERS_LOCK(1) | v);
 	/*
 	 * Wake other readers if there are no writers pending.  Otherwise they
 	 * won't be able to acquire the lock anyway.
 	 */
 	if (rwait && !wwait) {
 		turnstile_broadcast(ts, TS_SHARED_QUEUE);
 		turnstile_unpend(ts, TS_EXCLUSIVE_LOCK);
 	} else
 		turnstile_disown(ts);
 	turnstile_chain_unlock(&rw->lock_object);
 out:
 	curthread->td_rw_rlocks++;
 	LOCK_LOG_LOCK("WDOWNGRADE", &rw->lock_object, 0, 0, file, line);
 	LOCKSTAT_RECORD0(rw__downgrade, rw);
 }
 
 void
 __rw_downgrade(volatile uintptr_t *c, const char *file, int line)
 {
 	struct rwlock *rw;
 
 	rw = rwlock2rw(c);
 	__rw_downgrade_int(rw LOCK_FILE_LINE_ARG);
 }
 
 #ifdef INVARIANT_SUPPORT
 #ifndef INVARIANTS
 #undef __rw_assert
 #endif
 
 /*
  * In the non-WITNESS case, rw_assert() can only detect that at least
  * *some* thread owns an rlock, but it cannot guarantee that *this*
  * thread owns an rlock.
  */
 void
 __rw_assert(const volatile uintptr_t *c, int what, const char *file, int line)
 {
 	const struct rwlock *rw;
 
 	if (panicstr != NULL)
 		return;
 
 	rw = rwlock2rw(c);
 
 	switch (what) {
 	case RA_LOCKED:
 	case RA_LOCKED | RA_RECURSED:
 	case RA_LOCKED | RA_NOTRECURSED:
 	case RA_RLOCKED:
 	case RA_RLOCKED | RA_RECURSED:
 	case RA_RLOCKED | RA_NOTRECURSED:
 #ifdef WITNESS
 		witness_assert(&rw->lock_object, what, file, line);
 #else
 		/*
 		 * If some other thread has a write lock or we have one
 		 * and are asserting a read lock, fail.  Also, if no one
 		 * has a lock at all, fail.
 		 */
 		if (rw->rw_lock == RW_UNLOCKED ||
 		    (!(rw->rw_lock & RW_LOCK_READ) && (what & RA_RLOCKED ||
 		    rw_wowner(rw) != curthread)))
 			panic("Lock %s not %slocked @ %s:%d\n",
 			    rw->lock_object.lo_name, (what & RA_RLOCKED) ?
 			    "read " : "", file, line);
 
 		if (!(rw->rw_lock & RW_LOCK_READ) && !(what & RA_RLOCKED)) {
 			if (rw_recursed(rw)) {
 				if (what & RA_NOTRECURSED)
 					panic("Lock %s recursed @ %s:%d\n",
 					    rw->lock_object.lo_name, file,
 					    line);
 			} else if (what & RA_RECURSED)
 				panic("Lock %s not recursed @ %s:%d\n",
 				    rw->lock_object.lo_name, file, line);
 		}
 #endif
 		break;
 	case RA_WLOCKED:
 	case RA_WLOCKED | RA_RECURSED:
 	case RA_WLOCKED | RA_NOTRECURSED:
 		if (rw_wowner(rw) != curthread)
 			panic("Lock %s not exclusively locked @ %s:%d\n",
 			    rw->lock_object.lo_name, file, line);
 		if (rw_recursed(rw)) {
 			if (what & RA_NOTRECURSED)
 				panic("Lock %s recursed @ %s:%d\n",
 				    rw->lock_object.lo_name, file, line);
 		} else if (what & RA_RECURSED)
 			panic("Lock %s not recursed @ %s:%d\n",
 			    rw->lock_object.lo_name, file, line);
 		break;
 	case RA_UNLOCKED:
 #ifdef WITNESS
 		witness_assert(&rw->lock_object, what, file, line);
 #else
 		/*
 		 * If we hold a write lock fail.  We can't reliably check
 		 * to see if we hold a read lock or not.
 		 */
 		if (rw_wowner(rw) == curthread)
 			panic("Lock %s exclusively locked @ %s:%d\n",
 			    rw->lock_object.lo_name, file, line);
 #endif
 		break;
 	default:
 		panic("Unknown rw lock assertion: %d @ %s:%d", what, file,
 		    line);
 	}
 }
 #endif /* INVARIANT_SUPPORT */
 
 #ifdef DDB
 void
 db_show_rwlock(const struct lock_object *lock)
 {
 	const struct rwlock *rw;
 	struct thread *td;
 
 	rw = (const struct rwlock *)lock;
 
 	db_printf(" state: ");
 	if (rw->rw_lock == RW_UNLOCKED)
 		db_printf("UNLOCKED\n");
 	else if (rw->rw_lock == RW_DESTROYED) {
 		db_printf("DESTROYED\n");
 		return;
 	} else if (rw->rw_lock & RW_LOCK_READ)
 		db_printf("RLOCK: %ju locks\n",
 		    (uintmax_t)(RW_READERS(rw->rw_lock)));
 	else {
 		td = rw_wowner(rw);
 		db_printf("WLOCK: %p (tid %d, pid %d, \"%s\")\n", td,
 		    td->td_tid, td->td_proc->p_pid, td->td_name);
 		if (rw_recursed(rw))
 			db_printf(" recursed: %u\n", rw->rw_recurse);
 	}
 	db_printf(" waiters: ");
 	switch (rw->rw_lock & (RW_LOCK_READ_WAITERS | RW_LOCK_WRITE_WAITERS)) {
 	case RW_LOCK_READ_WAITERS:
 		db_printf("readers\n");
 		break;
 	case RW_LOCK_WRITE_WAITERS:
 		db_printf("writers\n");
 		break;
 	case RW_LOCK_READ_WAITERS | RW_LOCK_WRITE_WAITERS:
 		db_printf("readers and writers\n");
 		break;
 	default:
 		db_printf("none\n");
 		break;
 	}
 }
 
 #endif
Index: user/markj/netdump/sys/kern/kern_sx.c
===================================================================
--- user/markj/netdump/sys/kern/kern_sx.c	(revision 332407)
+++ user/markj/netdump/sys/kern/kern_sx.c	(revision 332408)
@@ -1,1456 +1,1464 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2007 Attilio Rao <attilio@freebsd.org>
  * Copyright (c) 2001 Jason Evans <jasone@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice(s), this list of conditions and the following disclaimer as
  *    the first lines of this file unmodified other than the possible
  *    addition of one or more copyright notices.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice(s), this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER(S) ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  * DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER(S) BE LIABLE FOR ANY
  * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
  * DAMAGE.
  */
 
 /*
  * Shared/exclusive locks.  This implementation attempts to ensure
  * deterministic lock granting behavior, so that slocks and xlocks are
  * interleaved.
  *
  * Priority propagation will not generally raise the priority of lock holders,
  * so should not be relied upon in combination with sx locks.
  */
 
 #include "opt_ddb.h"
 #include "opt_hwpmc_hooks.h"
 #include "opt_no_adaptive_sx.h"
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/sleepqueue.h>
 #include <sys/sx.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #if defined(SMP) && !defined(NO_ADAPTIVE_SX)
 #include <machine/cpu.h>
 #endif
 
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #if defined(SMP) && !defined(NO_ADAPTIVE_SX)
 #define	ADAPTIVE_SX
 #endif
 
 CTASSERT((SX_NOADAPTIVE & LO_CLASSFLAGS) == SX_NOADAPTIVE);
 
 #ifdef HWPMC_HOOKS
 #include <sys/pmckern.h>
 PMC_SOFT_DECLARE( , , lock, failed);
 #endif
 
 /* Handy macros for sleep queues. */
 #define	SQ_EXCLUSIVE_QUEUE	0
 #define	SQ_SHARED_QUEUE		1
 
 /*
  * Variations on DROP_GIANT()/PICKUP_GIANT() for use in this file.  We
  * drop Giant anytime we have to sleep or if we adaptively spin.
  */
 #define	GIANT_DECLARE							\
 	int _giantcnt = 0;						\
 	WITNESS_SAVE_DECL(Giant)					\
 
 #define	GIANT_SAVE(work) do {						\
 	if (__predict_false(mtx_owned(&Giant))) {			\
 		work++;							\
 		WITNESS_SAVE(&Giant.lock_object, Giant);		\
 		while (mtx_owned(&Giant)) {				\
 			_giantcnt++;					\
 			mtx_unlock(&Giant);				\
 		}							\
 	}								\
 } while (0)
 
 #define GIANT_RESTORE() do {						\
 	if (_giantcnt > 0) {						\
 		mtx_assert(&Giant, MA_NOTOWNED);			\
 		while (_giantcnt--)					\
 			mtx_lock(&Giant);				\
 		WITNESS_RESTORE(&Giant.lock_object, Giant);		\
 	}								\
 } while (0)
 
 /*
  * Returns true if an exclusive lock is recursed.  It assumes
  * curthread currently has an exclusive lock.
  */
 #define	sx_recursed(sx)		((sx)->sx_recurse != 0)
 
 static void	assert_sx(const struct lock_object *lock, int what);
 #ifdef DDB
 static void	db_show_sx(const struct lock_object *lock);
 #endif
 static void	lock_sx(struct lock_object *lock, uintptr_t how);
 #ifdef KDTRACE_HOOKS
 static int	owner_sx(const struct lock_object *lock, struct thread **owner);
 #endif
 static uintptr_t unlock_sx(struct lock_object *lock);
 
 struct lock_class lock_class_sx = {
 	.lc_name = "sx",
 	.lc_flags = LC_SLEEPLOCK | LC_SLEEPABLE | LC_RECURSABLE | LC_UPGRADABLE,
 	.lc_assert = assert_sx,
 #ifdef DDB
 	.lc_ddb_show = db_show_sx,
 #endif
 	.lc_lock = lock_sx,
 	.lc_unlock = unlock_sx,
 #ifdef KDTRACE_HOOKS
 	.lc_owner = owner_sx,
 #endif
 };
 
 #ifndef INVARIANTS
 #define	_sx_assert(sx, what, file, line)
 #endif
 
 #ifdef ADAPTIVE_SX
-static __read_frequently u_int asx_retries = 10;
-static __read_frequently u_int asx_loops = 10000;
+static __read_frequently u_int asx_retries;
+static __read_frequently u_int asx_loops;
 static SYSCTL_NODE(_debug, OID_AUTO, sx, CTLFLAG_RD, NULL, "sxlock debugging");
 SYSCTL_UINT(_debug_sx, OID_AUTO, retries, CTLFLAG_RW, &asx_retries, 0, "");
 SYSCTL_UINT(_debug_sx, OID_AUTO, loops, CTLFLAG_RW, &asx_loops, 0, "");
 
 static struct lock_delay_config __read_frequently sx_delay;
 
 SYSCTL_INT(_debug_sx, OID_AUTO, delay_base, CTLFLAG_RW, &sx_delay.base,
     0, "");
 SYSCTL_INT(_debug_sx, OID_AUTO, delay_max, CTLFLAG_RW, &sx_delay.max,
     0, "");
 
-LOCK_DELAY_SYSINIT_DEFAULT(sx_delay);
+static void
+sx_lock_delay_init(void *arg __unused)
+{
+
+	lock_delay_default_init(&sx_delay);
+	asx_retries = 10;
+	asx_loops = max(10000, sx_delay.max);
+}
+LOCK_DELAY_SYSINIT(sx_lock_delay_init);
 #endif
 
 void
 assert_sx(const struct lock_object *lock, int what)
 {
 
 	sx_assert((const struct sx *)lock, what);
 }
 
 void
 lock_sx(struct lock_object *lock, uintptr_t how)
 {
 	struct sx *sx;
 
 	sx = (struct sx *)lock;
 	if (how)
 		sx_slock(sx);
 	else
 		sx_xlock(sx);
 }
 
 uintptr_t
 unlock_sx(struct lock_object *lock)
 {
 	struct sx *sx;
 
 	sx = (struct sx *)lock;
 	sx_assert(sx, SA_LOCKED | SA_NOTRECURSED);
 	if (sx_xlocked(sx)) {
 		sx_xunlock(sx);
 		return (0);
 	} else {
 		sx_sunlock(sx);
 		return (1);
 	}
 }
 
 #ifdef KDTRACE_HOOKS
 int
 owner_sx(const struct lock_object *lock, struct thread **owner)
 {
 	const struct sx *sx;
 	uintptr_t x;
 
 	sx = (const struct sx *)lock;
 	x = sx->sx_lock;
 	*owner = NULL;
 	return ((x & SX_LOCK_SHARED) != 0 ? (SX_SHARERS(x) != 0) :
 	    ((*owner = (struct thread *)SX_OWNER(x)) != NULL));
 }
 #endif
 
 void
 sx_sysinit(void *arg)
 {
 	struct sx_args *sargs = arg;
 
 	sx_init_flags(sargs->sa_sx, sargs->sa_desc, sargs->sa_flags);
 }
 
 void
 sx_init_flags(struct sx *sx, const char *description, int opts)
 {
 	int flags;
 
 	MPASS((opts & ~(SX_QUIET | SX_RECURSE | SX_NOWITNESS | SX_DUPOK |
 	    SX_NOPROFILE | SX_NOADAPTIVE | SX_NEW)) == 0);
 	ASSERT_ATOMIC_LOAD_PTR(sx->sx_lock,
 	    ("%s: sx_lock not aligned for %s: %p", __func__, description,
 	    &sx->sx_lock));
 
 	flags = LO_SLEEPABLE | LO_UPGRADABLE;
 	if (opts & SX_DUPOK)
 		flags |= LO_DUPOK;
 	if (opts & SX_NOPROFILE)
 		flags |= LO_NOPROFILE;
 	if (!(opts & SX_NOWITNESS))
 		flags |= LO_WITNESS;
 	if (opts & SX_RECURSE)
 		flags |= LO_RECURSABLE;
 	if (opts & SX_QUIET)
 		flags |= LO_QUIET;
 	if (opts & SX_NEW)
 		flags |= LO_NEW;
 
 	flags |= opts & SX_NOADAPTIVE;
 	lock_init(&sx->lock_object, &lock_class_sx, description, NULL, flags);
 	sx->sx_lock = SX_LOCK_UNLOCKED;
 	sx->sx_recurse = 0;
 }
 
 void
 sx_destroy(struct sx *sx)
 {
 
 	KASSERT(sx->sx_lock == SX_LOCK_UNLOCKED, ("sx lock still held"));
 	KASSERT(sx->sx_recurse == 0, ("sx lock still recursed"));
 	sx->sx_lock = SX_LOCK_DESTROYED;
 	lock_destroy(&sx->lock_object);
 }
 
 int
 sx_try_slock_int(struct sx *sx LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t x;
 
 	if (SCHEDULER_STOPPED())
 		return (1);
 
 	KASSERT(kdb_active != 0 || !TD_IS_IDLETHREAD(curthread),
 	    ("sx_try_slock() by idle thread %p on sx %s @ %s:%d",
 	    curthread, sx->lock_object.lo_name, file, line));
 
 	x = sx->sx_lock;
 	for (;;) {
 		KASSERT(x != SX_LOCK_DESTROYED,
 		    ("sx_try_slock() of destroyed sx @ %s:%d", file, line));
 		if (!(x & SX_LOCK_SHARED))
 			break;
 		if (atomic_fcmpset_acq_ptr(&sx->sx_lock, &x, x + SX_ONE_SHARER)) {
 			LOCK_LOG_TRY("SLOCK", &sx->lock_object, 0, 1, file, line);
 			WITNESS_LOCK(&sx->lock_object, LOP_TRYLOCK, file, line);
 			LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire,
 			    sx, 0, 0, file, line, LOCKSTAT_READER);
 			TD_LOCKS_INC(curthread);
 			return (1);
 		}
 	}
 
 	LOCK_LOG_TRY("SLOCK", &sx->lock_object, 0, 0, file, line);
 	return (0);
 }
 
 int
 sx_try_slock_(struct sx *sx, const char *file, int line)
 {
 
 	return (sx_try_slock_int(sx LOCK_FILE_LINE_ARG));
 }
 
 int
 _sx_xlock(struct sx *sx, int opts, const char *file, int line)
 {
 	uintptr_t tid, x;
 	int error = 0;
 
 	KASSERT(kdb_active != 0 || SCHEDULER_STOPPED() ||
 	    !TD_IS_IDLETHREAD(curthread),
 	    ("sx_xlock() by idle thread %p on sx %s @ %s:%d",
 	    curthread, sx->lock_object.lo_name, file, line));
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_xlock() of destroyed sx @ %s:%d", file, line));
 	WITNESS_CHECKORDER(&sx->lock_object, LOP_NEWORDER | LOP_EXCLUSIVE, file,
 	    line, NULL);
 	tid = (uintptr_t)curthread;
 	x = SX_LOCK_UNLOCKED;
 	if (!atomic_fcmpset_acq_ptr(&sx->sx_lock, &x, tid))
 		error = _sx_xlock_hard(sx, x, opts LOCK_FILE_LINE_ARG);
 	else
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
 		    0, 0, file, line, LOCKSTAT_WRITER);
 	if (!error) {
 		LOCK_LOG_LOCK("XLOCK", &sx->lock_object, 0, sx->sx_recurse,
 		    file, line);
 		WITNESS_LOCK(&sx->lock_object, LOP_EXCLUSIVE, file, line);
 		TD_LOCKS_INC(curthread);
 	}
 
 	return (error);
 }
 
 int
 sx_try_xlock_int(struct sx *sx LOCK_FILE_LINE_ARG_DEF)
 {
 	struct thread *td;
 	uintptr_t tid, x;
 	int rval;
 	bool recursed;
 
 	td = curthread;
 	tid = (uintptr_t)td;
 	if (SCHEDULER_STOPPED_TD(td))
 		return (1);
 
 	KASSERT(kdb_active != 0 || !TD_IS_IDLETHREAD(td),
 	    ("sx_try_xlock() by idle thread %p on sx %s @ %s:%d",
 	    curthread, sx->lock_object.lo_name, file, line));
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_try_xlock() of destroyed sx @ %s:%d", file, line));
 
 	rval = 1;
 	recursed = false;
 	x = SX_LOCK_UNLOCKED;
 	for (;;) {
 		if (atomic_fcmpset_acq_ptr(&sx->sx_lock, &x, tid))
 			break;
 		if (x == SX_LOCK_UNLOCKED)
 			continue;
 		if (x == tid && (sx->lock_object.lo_flags & LO_RECURSABLE)) {
 			sx->sx_recurse++;
 			atomic_set_ptr(&sx->sx_lock, SX_LOCK_RECURSED);
 			break;
 		}
 		rval = 0;
 		break;
 	}
 
 	LOCK_LOG_TRY("XLOCK", &sx->lock_object, 0, rval, file, line);
 	if (rval) {
 		WITNESS_LOCK(&sx->lock_object, LOP_EXCLUSIVE | LOP_TRYLOCK,
 		    file, line);
 		if (!recursed)
 			LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire,
 			    sx, 0, 0, file, line, LOCKSTAT_WRITER);
 		TD_LOCKS_INC(curthread);
 	}
 
 	return (rval);
 }
 
 int
 sx_try_xlock_(struct sx *sx, const char *file, int line)
 {
 
 	return (sx_try_xlock_int(sx LOCK_FILE_LINE_ARG));
 }
 
 void
 _sx_xunlock(struct sx *sx, const char *file, int line)
 {
 
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_xunlock() of destroyed sx @ %s:%d", file, line));
 	_sx_assert(sx, SA_XLOCKED, file, line);
 	WITNESS_UNLOCK(&sx->lock_object, LOP_EXCLUSIVE, file, line);
 	LOCK_LOG_LOCK("XUNLOCK", &sx->lock_object, 0, sx->sx_recurse, file,
 	    line);
 #if LOCK_DEBUG > 0
 	_sx_xunlock_hard(sx, (uintptr_t)curthread, file, line);
 #else
 	__sx_xunlock(sx, curthread, file, line);
 #endif
 	TD_LOCKS_DEC(curthread);
 }
 
 /*
  * Try to do a non-blocking upgrade from a shared lock to an exclusive lock.
  * This will only succeed if this thread holds a single shared lock.
  * Return 1 if if the upgrade succeed, 0 otherwise.
  */
 int
 sx_try_upgrade_int(struct sx *sx LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t x;
 	uintptr_t waiters;
 	int success;
 
 	if (SCHEDULER_STOPPED())
 		return (1);
 
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_try_upgrade() of destroyed sx @ %s:%d", file, line));
 	_sx_assert(sx, SA_SLOCKED, file, line);
 
 	/*
 	 * Try to switch from one shared lock to an exclusive lock.  We need
 	 * to maintain the SX_LOCK_EXCLUSIVE_WAITERS flag if set so that
 	 * we will wake up the exclusive waiters when we drop the lock.
 	 */
 	success = 0;
 	x = SX_READ_VALUE(sx);
 	for (;;) {
 		if (SX_SHARERS(x) > 1)
 			break;
 		waiters = (x & SX_LOCK_EXCLUSIVE_WAITERS);
 		if (atomic_fcmpset_acq_ptr(&sx->sx_lock, &x,
 		    (uintptr_t)curthread | waiters)) {
 			success = 1;
 			break;
 		}
 	}
 	LOCK_LOG_TRY("XUPGRADE", &sx->lock_object, 0, success, file, line);
 	if (success) {
 		WITNESS_UPGRADE(&sx->lock_object, LOP_EXCLUSIVE | LOP_TRYLOCK,
 		    file, line);
 		LOCKSTAT_RECORD0(sx__upgrade, sx);
 	}
 	return (success);
 }
 
 int
 sx_try_upgrade_(struct sx *sx, const char *file, int line)
 {
 
 	return (sx_try_upgrade_int(sx LOCK_FILE_LINE_ARG));
 }
 
 /*
  * Downgrade an unrecursed exclusive lock into a single shared lock.
  */
 void
 sx_downgrade_int(struct sx *sx LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t x;
 	int wakeup_swapper;
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_downgrade() of destroyed sx @ %s:%d", file, line));
 	_sx_assert(sx, SA_XLOCKED | SA_NOTRECURSED, file, line);
 #ifndef INVARIANTS
 	if (sx_recursed(sx))
 		panic("downgrade of a recursed lock");
 #endif
 
 	WITNESS_DOWNGRADE(&sx->lock_object, 0, file, line);
 
 	/*
 	 * Try to switch from an exclusive lock with no shared waiters
 	 * to one sharer with no shared waiters.  If there are
 	 * exclusive waiters, we don't need to lock the sleep queue so
 	 * long as we preserve the flag.  We do one quick try and if
 	 * that fails we grab the sleepq lock to keep the flags from
 	 * changing and do it the slow way.
 	 *
 	 * We have to lock the sleep queue if there are shared waiters
 	 * so we can wake them up.
 	 */
 	x = sx->sx_lock;
 	if (!(x & SX_LOCK_SHARED_WAITERS) &&
 	    atomic_cmpset_rel_ptr(&sx->sx_lock, x, SX_SHARERS_LOCK(1) |
 	    (x & SX_LOCK_EXCLUSIVE_WAITERS)))
 		goto out;
 
 	/*
 	 * Lock the sleep queue so we can read the waiters bits
 	 * without any races and wakeup any shared waiters.
 	 */
 	sleepq_lock(&sx->lock_object);
 
 	/*
 	 * Preserve SX_LOCK_EXCLUSIVE_WAITERS while downgraded to a single
 	 * shared lock.  If there are any shared waiters, wake them up.
 	 */
 	wakeup_swapper = 0;
 	x = sx->sx_lock;
 	atomic_store_rel_ptr(&sx->sx_lock, SX_SHARERS_LOCK(1) |
 	    (x & SX_LOCK_EXCLUSIVE_WAITERS));
 	if (x & SX_LOCK_SHARED_WAITERS)
 		wakeup_swapper = sleepq_broadcast(&sx->lock_object, SLEEPQ_SX,
 		    0, SQ_SHARED_QUEUE);
 	sleepq_release(&sx->lock_object);
 
 	if (wakeup_swapper)
 		kick_proc0();
 
 out:
 	LOCK_LOG_LOCK("XDOWNGRADE", &sx->lock_object, 0, 0, file, line);
 	LOCKSTAT_RECORD0(sx__downgrade, sx);
 }
 
 void
 sx_downgrade_(struct sx *sx, const char *file, int line)
 {
 
 	sx_downgrade_int(sx LOCK_FILE_LINE_ARG);
 }
 
 /*
  * This function represents the so-called 'hard case' for sx_xlock
  * operation.  All 'easy case' failures are redirected to this.  Note
  * that ideally this would be a static function, but it needs to be
  * accessible from at least sx.h.
  */
 int
 _sx_xlock_hard(struct sx *sx, uintptr_t x, int opts LOCK_FILE_LINE_ARG_DEF)
 {
 	GIANT_DECLARE;
 	uintptr_t tid;
 #ifdef ADAPTIVE_SX
 	volatile struct thread *owner;
 	u_int i, n, spintries = 0;
 	enum { READERS, WRITER } sleep_reason;
 	bool adaptive;
 #endif
 #ifdef LOCK_PROFILING
 	uint64_t waittime = 0;
 	int contested = 0;
 #endif
 	int error = 0;
 #if defined(ADAPTIVE_SX) || defined(KDTRACE_HOOKS)
 	struct lock_delay_arg lda;
 #endif
 #ifdef	KDTRACE_HOOKS
 	u_int sleep_cnt = 0;
 	int64_t sleep_time = 0;
 	int64_t all_time = 0;
 #endif
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	uintptr_t state;
 #endif
 	int extra_work = 0;
 
 	tid = (uintptr_t)curthread;
 
 #ifdef KDTRACE_HOOKS
 	if (LOCKSTAT_PROFILE_ENABLED(sx__acquire)) {
 		while (x == SX_LOCK_UNLOCKED) {
 			if (atomic_fcmpset_acq_ptr(&sx->sx_lock, &x, tid))
 				goto out_lockstat;
 		}
 		extra_work = 1;
 		all_time -= lockstat_nsecs(&sx->lock_object);
 		state = x;
 	}
 #endif
 #ifdef LOCK_PROFILING
 	extra_work = 1;
 	state = x;
 #endif
 
 	if (SCHEDULER_STOPPED())
 		return (0);
 
 #if defined(ADAPTIVE_SX)
 	lock_delay_arg_init(&lda, &sx_delay);
 #elif defined(KDTRACE_HOOKS)
 	lock_delay_arg_init(&lda, NULL);
 #endif
 
 	if (__predict_false(x == SX_LOCK_UNLOCKED))
 		x = SX_READ_VALUE(sx);
 
 	/* If we already hold an exclusive lock, then recurse. */
 	if (__predict_false(lv_sx_owner(x) == (struct thread *)tid)) {
 		KASSERT((sx->lock_object.lo_flags & LO_RECURSABLE) != 0,
 	    ("_sx_xlock_hard: recursed on non-recursive sx %s @ %s:%d\n",
 		    sx->lock_object.lo_name, file, line));
 		sx->sx_recurse++;
 		atomic_set_ptr(&sx->sx_lock, SX_LOCK_RECURSED);
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p recursing", __func__, sx);
 		return (0);
 	}
 
 	if (LOCK_LOG_TEST(&sx->lock_object, 0))
 		CTR5(KTR_LOCK, "%s: %s contested (lock=%p) at %s:%d", __func__,
 		    sx->lock_object.lo_name, (void *)sx->sx_lock, file, line);
 
 #ifdef ADAPTIVE_SX
 	adaptive = ((sx->lock_object.lo_flags & SX_NOADAPTIVE) == 0);
 #endif
 
 #ifdef HWPMC_HOOKS
 	PMC_SOFT_CALL( , , lock, failed);
 #endif
 	lock_profile_obtain_lock_failed(&sx->lock_object, &contested,
 	    &waittime);
 
 #ifndef INVARIANTS
 	GIANT_SAVE(extra_work);
 #endif
 
 	for (;;) {
 		if (x == SX_LOCK_UNLOCKED) {
 			if (atomic_fcmpset_acq_ptr(&sx->sx_lock, &x, tid))
 				break;
 			continue;
 		}
 #ifdef INVARIANTS
 		GIANT_SAVE(extra_work);
 #endif
 #ifdef KDTRACE_HOOKS
 		lda.spin_cnt++;
 #endif
 #ifdef ADAPTIVE_SX
 		if (__predict_false(!adaptive))
 			goto sleepq;
 		/*
 		 * If the lock is write locked and the owner is
 		 * running on another CPU, spin until the owner stops
 		 * running or the state of the lock changes.
 		 */
 		if ((x & SX_LOCK_SHARED) == 0) {
 			sleep_reason = WRITER;
 			owner = lv_sx_owner(x);
 			if (!TD_IS_RUNNING(owner))
 				goto sleepq;
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR3(KTR_LOCK, "%s: spinning on %p held by %p",
 				    __func__, sx, owner);
 			KTR_STATE1(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "spinning", "lockname:\"%s\"",
 			    sx->lock_object.lo_name);
 			do {
 				lock_delay(&lda);
 				x = SX_READ_VALUE(sx);
 				owner = lv_sx_owner(x);
 			} while (owner != NULL && TD_IS_RUNNING(owner));
 			KTR_STATE0(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "running");
 			continue;
 		} else if (SX_SHARERS(x) > 0) {
 			sleep_reason = READERS;
 			if (spintries == asx_retries)
 				goto sleepq;
 			spintries++;
 			KTR_STATE1(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "spinning", "lockname:\"%s\"",
 			    sx->lock_object.lo_name);
 			for (i = 0; i < asx_loops; i += n) {
 				n = SX_SHARERS(x);
 				lock_delay_spin(n);
 				x = SX_READ_VALUE(sx);
 				if ((x & SX_LOCK_SHARED) == 0 ||
 				    SX_SHARERS(x) == 0)
 					break;
 			}
 #ifdef KDTRACE_HOOKS
 			lda.spin_cnt += i;
 #endif
 			KTR_STATE0(KTR_SCHED, "thread", sched_tdname(curthread),
 			    "running");
 			if (i < asx_loops)
 				continue;
 		}
 sleepq:
 #endif
 		sleepq_lock(&sx->lock_object);
 		x = SX_READ_VALUE(sx);
 retry_sleepq:
 
 		/*
 		 * If the lock was released while spinning on the
 		 * sleep queue chain lock, try again.
 		 */
 		if (x == SX_LOCK_UNLOCKED) {
 			sleepq_release(&sx->lock_object);
 			continue;
 		}
 
 #ifdef ADAPTIVE_SX
 		/*
 		 * The current lock owner might have started executing
 		 * on another CPU (or the lock could have changed
 		 * owners) while we were waiting on the sleep queue
 		 * chain lock.  If so, drop the sleep queue lock and try
 		 * again.
 		 */
 		if (adaptive) {
 			if (!(x & SX_LOCK_SHARED)) {
 				owner = (struct thread *)SX_OWNER(x);
 				if (TD_IS_RUNNING(owner)) {
 					sleepq_release(&sx->lock_object);
 					continue;
 				}
 			} else if (SX_SHARERS(x) > 0 && sleep_reason == WRITER) {
 				sleepq_release(&sx->lock_object);
 				continue;
 			}
 		}
 #endif
 
 		/*
 		 * If an exclusive lock was released with both shared
 		 * and exclusive waiters and a shared waiter hasn't
 		 * woken up and acquired the lock yet, sx_lock will be
 		 * set to SX_LOCK_UNLOCKED | SX_LOCK_EXCLUSIVE_WAITERS.
 		 * If we see that value, try to acquire it once.  Note
 		 * that we have to preserve SX_LOCK_EXCLUSIVE_WAITERS
 		 * as there are other exclusive waiters still.  If we
 		 * fail, restart the loop.
 		 */
 		if (x == (SX_LOCK_UNLOCKED | SX_LOCK_EXCLUSIVE_WAITERS)) {
 			if (!atomic_fcmpset_acq_ptr(&sx->sx_lock, &x,
 			    tid | SX_LOCK_EXCLUSIVE_WAITERS))
 				goto retry_sleepq;
 			sleepq_release(&sx->lock_object);
 			CTR2(KTR_LOCK, "%s: %p claimed by new writer",
 			    __func__, sx);
 			break;
 		}
 
 		/*
 		 * Try to set the SX_LOCK_EXCLUSIVE_WAITERS.  If we fail,
 		 * than loop back and retry.
 		 */
 		if (!(x & SX_LOCK_EXCLUSIVE_WAITERS)) {
 			if (!atomic_fcmpset_ptr(&sx->sx_lock, &x,
 			    x | SX_LOCK_EXCLUSIVE_WAITERS)) {
 				goto retry_sleepq;
 			}
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR2(KTR_LOCK, "%s: %p set excl waiters flag",
 				    __func__, sx);
 		}
 
 		/*
 		 * Since we have been unable to acquire the exclusive
 		 * lock and the exclusive waiters flag is set, we have
 		 * to sleep.
 		 */
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p blocking on sleep queue",
 			    __func__, sx);
 
 #ifdef KDTRACE_HOOKS
 		sleep_time -= lockstat_nsecs(&sx->lock_object);
 #endif
 		sleepq_add(&sx->lock_object, NULL, sx->lock_object.lo_name,
 		    SLEEPQ_SX | ((opts & SX_INTERRUPTIBLE) ?
 		    SLEEPQ_INTERRUPTIBLE : 0), SQ_EXCLUSIVE_QUEUE);
 		if (!(opts & SX_INTERRUPTIBLE))
 			sleepq_wait(&sx->lock_object, 0);
 		else
 			error = sleepq_wait_sig(&sx->lock_object, 0);
 #ifdef KDTRACE_HOOKS
 		sleep_time += lockstat_nsecs(&sx->lock_object);
 		sleep_cnt++;
 #endif
 		if (error) {
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR2(KTR_LOCK,
 			"%s: interruptible sleep by %p suspended by signal",
 				    __func__, sx);
 			break;
 		}
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p resuming from sleep queue",
 			    __func__, sx);
 		x = SX_READ_VALUE(sx);
 	}
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	if (__predict_true(!extra_work))
 		return (error);
 #endif
 #ifdef KDTRACE_HOOKS
 	all_time += lockstat_nsecs(&sx->lock_object);
 	if (sleep_time)
 		LOCKSTAT_RECORD4(sx__block, sx, sleep_time,
 		    LOCKSTAT_WRITER, (state & SX_LOCK_SHARED) == 0,
 		    (state & SX_LOCK_SHARED) == 0 ? 0 : SX_SHARERS(state));
 	if (lda.spin_cnt > sleep_cnt)
 		LOCKSTAT_RECORD4(sx__spin, sx, all_time - sleep_time,
 		    LOCKSTAT_WRITER, (state & SX_LOCK_SHARED) == 0,
 		    (state & SX_LOCK_SHARED) == 0 ? 0 : SX_SHARERS(state));
 out_lockstat:
 #endif
 	if (!error)
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
 		    contested, waittime, file, line, LOCKSTAT_WRITER);
 	GIANT_RESTORE();
 	return (error);
 }
 
 /*
  * This function represents the so-called 'hard case' for sx_xunlock
  * operation.  All 'easy case' failures are redirected to this.  Note
  * that ideally this would be a static function, but it needs to be
  * accessible from at least sx.h.
  */
 void
 _sx_xunlock_hard(struct sx *sx, uintptr_t x LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t tid, setx;
 	int queue, wakeup_swapper;
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 	tid = (uintptr_t)curthread;
 
 	if (__predict_false(x == tid))
 		x = SX_READ_VALUE(sx);
 
 	MPASS(!(x & SX_LOCK_SHARED));
 
 	if (__predict_false(x & SX_LOCK_RECURSED)) {
 		/* The lock is recursed, unrecurse one level. */
 		if ((--sx->sx_recurse) == 0)
 			atomic_clear_ptr(&sx->sx_lock, SX_LOCK_RECURSED);
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p unrecursing", __func__, sx);
 		return;
 	}
 
 	LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx, LOCKSTAT_WRITER);
 	if (x == tid &&
 	    atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
 		return;
 
 	if (LOCK_LOG_TEST(&sx->lock_object, 0))
 		CTR2(KTR_LOCK, "%s: %p contested", __func__, sx);
 
 	sleepq_lock(&sx->lock_object);
 	x = SX_READ_VALUE(sx);
 	MPASS(x & (SX_LOCK_SHARED_WAITERS | SX_LOCK_EXCLUSIVE_WAITERS));
 
 	/*
 	 * The wake up algorithm here is quite simple and probably not
 	 * ideal.  It gives precedence to shared waiters if they are
 	 * present.  For this condition, we have to preserve the
 	 * state of the exclusive waiters flag.
 	 * If interruptible sleeps left the shared queue empty avoid a
 	 * starvation for the threads sleeping on the exclusive queue by giving
 	 * them precedence and cleaning up the shared waiters bit anyway.
 	 */
 	setx = SX_LOCK_UNLOCKED;
 	queue = SQ_EXCLUSIVE_QUEUE;
 	if ((x & SX_LOCK_SHARED_WAITERS) != 0 &&
 	    sleepq_sleepcnt(&sx->lock_object, SQ_SHARED_QUEUE) != 0) {
 		queue = SQ_SHARED_QUEUE;
 		setx |= (x & SX_LOCK_EXCLUSIVE_WAITERS);
 	}
 	atomic_store_rel_ptr(&sx->sx_lock, setx);
 
 	/* Wake up all the waiters for the specific queue. */
 	if (LOCK_LOG_TEST(&sx->lock_object, 0))
 		CTR3(KTR_LOCK, "%s: %p waking up all threads on %s queue",
 		    __func__, sx, queue == SQ_SHARED_QUEUE ? "shared" :
 		    "exclusive");
 
 	wakeup_swapper = sleepq_broadcast(&sx->lock_object, SLEEPQ_SX, 0,
 	    queue);
 	sleepq_release(&sx->lock_object);
 	if (wakeup_swapper)
 		kick_proc0();
 }
 
 static bool __always_inline
 __sx_slock_try(struct sx *sx, uintptr_t *xp LOCK_FILE_LINE_ARG_DEF)
 {
 
 	/*
 	 * If no other thread has an exclusive lock then try to bump up
 	 * the count of sharers.  Since we have to preserve the state
 	 * of SX_LOCK_EXCLUSIVE_WAITERS, if we fail to acquire the
 	 * shared lock loop back and retry.
 	 */
 	while (*xp & SX_LOCK_SHARED) {
 		MPASS(!(*xp & SX_LOCK_SHARED_WAITERS));
 		if (atomic_fcmpset_acq_ptr(&sx->sx_lock, xp,
 		    *xp + SX_ONE_SHARER)) {
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR4(KTR_LOCK, "%s: %p succeed %p -> %p",
 				    __func__, sx, (void *)*xp,
 				    (void *)(*xp + SX_ONE_SHARER));
 			return (true);
 		}
 	}
 	return (false);
 }
 
 static int __noinline
 _sx_slock_hard(struct sx *sx, int opts, uintptr_t x LOCK_FILE_LINE_ARG_DEF)
 {
 	GIANT_DECLARE;
 #ifdef ADAPTIVE_SX
 	volatile struct thread *owner;
 	bool adaptive;
 #endif
 #ifdef LOCK_PROFILING
 	uint64_t waittime = 0;
 	int contested = 0;
 #endif
 	int error = 0;
 #if defined(ADAPTIVE_SX) || defined(KDTRACE_HOOKS)
 	struct lock_delay_arg lda;
 #endif
 #ifdef KDTRACE_HOOKS
 	u_int sleep_cnt = 0;
 	int64_t sleep_time = 0;
 	int64_t all_time = 0;
 #endif
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	uintptr_t state;
 #endif
 	int extra_work = 0;
 
 #ifdef KDTRACE_HOOKS
 	if (LOCKSTAT_PROFILE_ENABLED(sx__acquire)) {
 		if (__sx_slock_try(sx, &x LOCK_FILE_LINE_ARG))
 			goto out_lockstat;
 		extra_work = 1;
 		all_time -= lockstat_nsecs(&sx->lock_object);
 		state = x;
 	}
 #endif
 #ifdef LOCK_PROFILING
 	extra_work = 1;
 	state = x;
 #endif
 
 	if (SCHEDULER_STOPPED())
 		return (0);
 
 #if defined(ADAPTIVE_SX)
 	lock_delay_arg_init(&lda, &sx_delay);
 #elif defined(KDTRACE_HOOKS)
 	lock_delay_arg_init(&lda, NULL);
 #endif
 
 #ifdef ADAPTIVE_SX
 	adaptive = ((sx->lock_object.lo_flags & SX_NOADAPTIVE) == 0);
 #endif
 
 #ifdef HWPMC_HOOKS
 	PMC_SOFT_CALL( , , lock, failed);
 #endif
 	lock_profile_obtain_lock_failed(&sx->lock_object, &contested,
 	    &waittime);
 
 #ifndef INVARIANTS
 	GIANT_SAVE(extra_work);
 #endif
 
 	/*
 	 * As with rwlocks, we don't make any attempt to try to block
 	 * shared locks once there is an exclusive waiter.
 	 */
 	for (;;) {
 		if (__sx_slock_try(sx, &x LOCK_FILE_LINE_ARG))
 			break;
 #ifdef INVARIANTS
 		GIANT_SAVE(extra_work);
 #endif
 #ifdef KDTRACE_HOOKS
 		lda.spin_cnt++;
 #endif
 
 #ifdef ADAPTIVE_SX
 		if (__predict_false(!adaptive))
 			goto sleepq;
 		/*
 		 * If the owner is running on another CPU, spin until
 		 * the owner stops running or the state of the lock
 		 * changes.
 		 */
 		owner = lv_sx_owner(x);
 		if (TD_IS_RUNNING(owner)) {
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR3(KTR_LOCK,
 				    "%s: spinning on %p held by %p",
 				    __func__, sx, owner);
 			KTR_STATE1(KTR_SCHED, "thread",
 			    sched_tdname(curthread), "spinning",
 			    "lockname:\"%s\"", sx->lock_object.lo_name);
 			do {
 				lock_delay(&lda);
 				x = SX_READ_VALUE(sx);
 				owner = lv_sx_owner(x);
 			} while (owner != NULL && TD_IS_RUNNING(owner));
 			KTR_STATE0(KTR_SCHED, "thread",
 			    sched_tdname(curthread), "running");
 			continue;
 		}
 sleepq:
 #endif
 
 		/*
 		 * Some other thread already has an exclusive lock, so
 		 * start the process of blocking.
 		 */
 		sleepq_lock(&sx->lock_object);
 		x = SX_READ_VALUE(sx);
 retry_sleepq:
 		/*
 		 * The lock could have been released while we spun.
 		 * In this case loop back and retry.
 		 */
 		if (x & SX_LOCK_SHARED) {
 			sleepq_release(&sx->lock_object);
 			continue;
 		}
 
 #ifdef ADAPTIVE_SX
 		/*
 		 * If the owner is running on another CPU, spin until
 		 * the owner stops running or the state of the lock
 		 * changes.
 		 */
 		if (!(x & SX_LOCK_SHARED) && adaptive) {
 			owner = (struct thread *)SX_OWNER(x);
 			if (TD_IS_RUNNING(owner)) {
 				sleepq_release(&sx->lock_object);
 				x = SX_READ_VALUE(sx);
 				continue;
 			}
 		}
 #endif
 
 		/*
 		 * Try to set the SX_LOCK_SHARED_WAITERS flag.  If we
 		 * fail to set it drop the sleep queue lock and loop
 		 * back.
 		 */
 		if (!(x & SX_LOCK_SHARED_WAITERS)) {
 			if (!atomic_fcmpset_ptr(&sx->sx_lock, &x,
 			    x | SX_LOCK_SHARED_WAITERS))
 				goto retry_sleepq;
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR2(KTR_LOCK, "%s: %p set shared waiters flag",
 				    __func__, sx);
 		}
 
 		/*
 		 * Since we have been unable to acquire the shared lock,
 		 * we have to sleep.
 		 */
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p blocking on sleep queue",
 			    __func__, sx);
 
 #ifdef KDTRACE_HOOKS
 		sleep_time -= lockstat_nsecs(&sx->lock_object);
 #endif
 		sleepq_add(&sx->lock_object, NULL, sx->lock_object.lo_name,
 		    SLEEPQ_SX | ((opts & SX_INTERRUPTIBLE) ?
 		    SLEEPQ_INTERRUPTIBLE : 0), SQ_SHARED_QUEUE);
 		if (!(opts & SX_INTERRUPTIBLE))
 			sleepq_wait(&sx->lock_object, 0);
 		else
 			error = sleepq_wait_sig(&sx->lock_object, 0);
 #ifdef KDTRACE_HOOKS
 		sleep_time += lockstat_nsecs(&sx->lock_object);
 		sleep_cnt++;
 #endif
 		if (error) {
 			if (LOCK_LOG_TEST(&sx->lock_object, 0))
 				CTR2(KTR_LOCK,
 			"%s: interruptible sleep by %p suspended by signal",
 				    __func__, sx);
 			break;
 		}
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p resuming from sleep queue",
 			    __func__, sx);
 		x = SX_READ_VALUE(sx);
 	}
 #if defined(KDTRACE_HOOKS) || defined(LOCK_PROFILING)
 	if (__predict_true(!extra_work))
 		return (error);
 #endif
 #ifdef KDTRACE_HOOKS
 	all_time += lockstat_nsecs(&sx->lock_object);
 	if (sleep_time)
 		LOCKSTAT_RECORD4(sx__block, sx, sleep_time,
 		    LOCKSTAT_READER, (state & SX_LOCK_SHARED) == 0,
 		    (state & SX_LOCK_SHARED) == 0 ? 0 : SX_SHARERS(state));
 	if (lda.spin_cnt > sleep_cnt)
 		LOCKSTAT_RECORD4(sx__spin, sx, all_time - sleep_time,
 		    LOCKSTAT_READER, (state & SX_LOCK_SHARED) == 0,
 		    (state & SX_LOCK_SHARED) == 0 ? 0 : SX_SHARERS(state));
 out_lockstat:
 #endif
 	if (error == 0) {
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
 		    contested, waittime, file, line, LOCKSTAT_READER);
 	}
 	GIANT_RESTORE();
 	return (error);
 }
 
 int
 _sx_slock_int(struct sx *sx, int opts LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t x;
 	int error;
 
 	KASSERT(kdb_active != 0 || SCHEDULER_STOPPED() ||
 	    !TD_IS_IDLETHREAD(curthread),
 	    ("sx_slock() by idle thread %p on sx %s @ %s:%d",
 	    curthread, sx->lock_object.lo_name, file, line));
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_slock() of destroyed sx @ %s:%d", file, line));
 	WITNESS_CHECKORDER(&sx->lock_object, LOP_NEWORDER, file, line, NULL);
 
 	error = 0;
 	x = SX_READ_VALUE(sx);
 	if (__predict_false(LOCKSTAT_PROFILE_ENABLED(sx__acquire) ||
 	    !__sx_slock_try(sx, &x LOCK_FILE_LINE_ARG)))
 		error = _sx_slock_hard(sx, opts, x LOCK_FILE_LINE_ARG);
 	else
 		lock_profile_obtain_lock_success(&sx->lock_object, 0, 0,
 		    file, line);
 	if (error == 0) {
 		LOCK_LOG_LOCK("SLOCK", &sx->lock_object, 0, 0, file, line);
 		WITNESS_LOCK(&sx->lock_object, 0, file, line);
 		TD_LOCKS_INC(curthread);
 	}
 	return (error);
 }
 
 int
 _sx_slock(struct sx *sx, int opts, const char *file, int line)
 {
 
 	return (_sx_slock_int(sx, opts LOCK_FILE_LINE_ARG));
 }
 
 static bool __always_inline
 _sx_sunlock_try(struct sx *sx, uintptr_t *xp)
 {
 
 	for (;;) {
 		/*
 		 * We should never have sharers while at least one thread
 		 * holds a shared lock.
 		 */
 		KASSERT(!(*xp & SX_LOCK_SHARED_WAITERS),
 		    ("%s: waiting sharers", __func__));
 
 		/*
 		 * See if there is more than one shared lock held.  If
 		 * so, just drop one and return.
 		 */
 		if (SX_SHARERS(*xp) > 1) {
 			if (atomic_fcmpset_rel_ptr(&sx->sx_lock, xp,
 			    *xp - SX_ONE_SHARER)) {
 				if (LOCK_LOG_TEST(&sx->lock_object, 0))
 					CTR4(KTR_LOCK,
 					    "%s: %p succeeded %p -> %p",
 					    __func__, sx, (void *)*xp,
 					    (void *)(*xp - SX_ONE_SHARER));
 				return (true);
 			}
 			continue;
 		}
 
 		/*
 		 * If there aren't any waiters for an exclusive lock,
 		 * then try to drop it quickly.
 		 */
 		if (!(*xp & SX_LOCK_EXCLUSIVE_WAITERS)) {
 			MPASS(*xp == SX_SHARERS_LOCK(1));
 			*xp = SX_SHARERS_LOCK(1);
 			if (atomic_fcmpset_rel_ptr(&sx->sx_lock,
 			    xp, SX_LOCK_UNLOCKED)) {
 				if (LOCK_LOG_TEST(&sx->lock_object, 0))
 					CTR2(KTR_LOCK, "%s: %p last succeeded",
 					    __func__, sx);
 				return (true);
 			}
 			continue;
 		}
 		break;
 	}
 	return (false);
 }
 
 static void __noinline
 _sx_sunlock_hard(struct sx *sx, uintptr_t x LOCK_FILE_LINE_ARG_DEF)
 {
 	int wakeup_swapper = 0;
 	uintptr_t setx;
 
 	if (SCHEDULER_STOPPED())
 		return;
 
 	if (_sx_sunlock_try(sx, &x))
 		goto out_lockstat;
 
 	/*
 	 * At this point, there should just be one sharer with
 	 * exclusive waiters.
 	 */
 	MPASS(x == (SX_SHARERS_LOCK(1) | SX_LOCK_EXCLUSIVE_WAITERS));
 
 	sleepq_lock(&sx->lock_object);
 	x = SX_READ_VALUE(sx);
 	for (;;) {
 		MPASS(x & SX_LOCK_EXCLUSIVE_WAITERS);
 		MPASS(!(x & SX_LOCK_SHARED_WAITERS));
 		if (_sx_sunlock_try(sx, &x))
 			break;
 
 		/*
 		 * Wake up semantic here is quite simple:
 		 * Just wake up all the exclusive waiters.
 		 * Note that the state of the lock could have changed,
 		 * so if it fails loop back and retry.
 		 */
 		setx = x - SX_ONE_SHARER;
 		setx &= ~SX_LOCK_EXCLUSIVE_WAITERS;
 		if (!atomic_fcmpset_rel_ptr(&sx->sx_lock, &x, setx))
 			continue;
 		if (LOCK_LOG_TEST(&sx->lock_object, 0))
 			CTR2(KTR_LOCK, "%s: %p waking up all thread on"
 			    "exclusive queue", __func__, sx);
 		wakeup_swapper = sleepq_broadcast(&sx->lock_object, SLEEPQ_SX,
 		    0, SQ_EXCLUSIVE_QUEUE);
 		break;
 	}
 	sleepq_release(&sx->lock_object);
 	if (wakeup_swapper)
 		kick_proc0();
 out_lockstat:
 	LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx, LOCKSTAT_READER);
 }
 
 void
 _sx_sunlock_int(struct sx *sx LOCK_FILE_LINE_ARG_DEF)
 {
 	uintptr_t x;
 
 	KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
 	    ("sx_sunlock() of destroyed sx @ %s:%d", file, line));
 	_sx_assert(sx, SA_SLOCKED, file, line);
 	WITNESS_UNLOCK(&sx->lock_object, 0, file, line);
 	LOCK_LOG_LOCK("SUNLOCK", &sx->lock_object, 0, 0, file, line);
 
 	x = SX_READ_VALUE(sx);
 	if (__predict_false(LOCKSTAT_PROFILE_ENABLED(sx__release) ||
 	    !_sx_sunlock_try(sx, &x)))
 		_sx_sunlock_hard(sx, x LOCK_FILE_LINE_ARG);
 	else
 		lock_profile_release_lock(&sx->lock_object);
 
 	TD_LOCKS_DEC(curthread);
 }
 
 void
 _sx_sunlock(struct sx *sx, const char *file, int line)
 {
 
 	_sx_sunlock_int(sx LOCK_FILE_LINE_ARG);
 }
 
 #ifdef INVARIANT_SUPPORT
 #ifndef INVARIANTS
 #undef	_sx_assert
 #endif
 
 /*
  * In the non-WITNESS case, sx_assert() can only detect that at least
  * *some* thread owns an slock, but it cannot guarantee that *this*
  * thread owns an slock.
  */
 void
 _sx_assert(const struct sx *sx, int what, const char *file, int line)
 {
 #ifndef WITNESS
 	int slocked = 0;
 #endif
 
 	if (panicstr != NULL)
 		return;
 	switch (what) {
 	case SA_SLOCKED:
 	case SA_SLOCKED | SA_NOTRECURSED:
 	case SA_SLOCKED | SA_RECURSED:
 #ifndef WITNESS
 		slocked = 1;
 		/* FALLTHROUGH */
 #endif
 	case SA_LOCKED:
 	case SA_LOCKED | SA_NOTRECURSED:
 	case SA_LOCKED | SA_RECURSED:
 #ifdef WITNESS
 		witness_assert(&sx->lock_object, what, file, line);
 #else
 		/*
 		 * If some other thread has an exclusive lock or we
 		 * have one and are asserting a shared lock, fail.
 		 * Also, if no one has a lock at all, fail.
 		 */
 		if (sx->sx_lock == SX_LOCK_UNLOCKED ||
 		    (!(sx->sx_lock & SX_LOCK_SHARED) && (slocked ||
 		    sx_xholder(sx) != curthread)))
 			panic("Lock %s not %slocked @ %s:%d\n",
 			    sx->lock_object.lo_name, slocked ? "share " : "",
 			    file, line);
 
 		if (!(sx->sx_lock & SX_LOCK_SHARED)) {
 			if (sx_recursed(sx)) {
 				if (what & SA_NOTRECURSED)
 					panic("Lock %s recursed @ %s:%d\n",
 					    sx->lock_object.lo_name, file,
 					    line);
 			} else if (what & SA_RECURSED)
 				panic("Lock %s not recursed @ %s:%d\n",
 				    sx->lock_object.lo_name, file, line);
 		}
 #endif
 		break;
 	case SA_XLOCKED:
 	case SA_XLOCKED | SA_NOTRECURSED:
 	case SA_XLOCKED | SA_RECURSED:
 		if (sx_xholder(sx) != curthread)
 			panic("Lock %s not exclusively locked @ %s:%d\n",
 			    sx->lock_object.lo_name, file, line);
 		if (sx_recursed(sx)) {
 			if (what & SA_NOTRECURSED)
 				panic("Lock %s recursed @ %s:%d\n",
 				    sx->lock_object.lo_name, file, line);
 		} else if (what & SA_RECURSED)
 			panic("Lock %s not recursed @ %s:%d\n",
 			    sx->lock_object.lo_name, file, line);
 		break;
 	case SA_UNLOCKED:
 #ifdef WITNESS
 		witness_assert(&sx->lock_object, what, file, line);
 #else
 		/*
 		 * If we hold an exclusve lock fail.  We can't
 		 * reliably check to see if we hold a shared lock or
 		 * not.
 		 */
 		if (sx_xholder(sx) == curthread)
 			panic("Lock %s exclusively locked @ %s:%d\n",
 			    sx->lock_object.lo_name, file, line);
 #endif
 		break;
 	default:
 		panic("Unknown sx lock assertion: %d @ %s:%d", what, file,
 		    line);
 	}
 }
 #endif	/* INVARIANT_SUPPORT */
 
 #ifdef DDB
 static void
 db_show_sx(const struct lock_object *lock)
 {
 	struct thread *td;
 	const struct sx *sx;
 
 	sx = (const struct sx *)lock;
 
 	db_printf(" state: ");
 	if (sx->sx_lock == SX_LOCK_UNLOCKED)
 		db_printf("UNLOCKED\n");
 	else if (sx->sx_lock == SX_LOCK_DESTROYED) {
 		db_printf("DESTROYED\n");
 		return;
 	} else if (sx->sx_lock & SX_LOCK_SHARED)
 		db_printf("SLOCK: %ju\n", (uintmax_t)SX_SHARERS(sx->sx_lock));
 	else {
 		td = sx_xholder(sx);
 		db_printf("XLOCK: %p (tid %d, pid %d, \"%s\")\n", td,
 		    td->td_tid, td->td_proc->p_pid, td->td_name);
 		if (sx_recursed(sx))
 			db_printf(" recursed: %d\n", sx->sx_recurse);
 	}
 
 	db_printf(" waiters: ");
 	switch(sx->sx_lock &
 	    (SX_LOCK_SHARED_WAITERS | SX_LOCK_EXCLUSIVE_WAITERS)) {
 	case SX_LOCK_SHARED_WAITERS:
 		db_printf("shared\n");
 		break;
 	case SX_LOCK_EXCLUSIVE_WAITERS:
 		db_printf("exclusive\n");
 		break;
 	case SX_LOCK_SHARED_WAITERS | SX_LOCK_EXCLUSIVE_WAITERS:
 		db_printf("exclusive and shared\n");
 		break;
 	default:
 		db_printf("none\n");
 	}
 }
 
 /*
  * Check to see if a thread that is blocked on a sleep queue is actually
  * blocked on an sx lock.  If so, output some details and return true.
  * If the lock has an exclusive owner, return that in *ownerp.
  */
 int
 sx_chain(struct thread *td, struct thread **ownerp)
 {
 	struct sx *sx;
 
 	/*
 	 * Check to see if this thread is blocked on an sx lock.
 	 * First, we check the lock class.  If that is ok, then we
 	 * compare the lock name against the wait message.
 	 */
 	sx = td->td_wchan;
 	if (LOCK_CLASS(&sx->lock_object) != &lock_class_sx ||
 	    sx->lock_object.lo_name != td->td_wmesg)
 		return (0);
 
 	/* We think we have an sx lock, so output some details. */
 	db_printf("blocked on sx \"%s\" ", td->td_wmesg);
 	*ownerp = sx_xholder(sx);
 	if (sx->sx_lock & SX_LOCK_SHARED)
 		db_printf("SLOCK (count %ju)\n",
 		    (uintmax_t)SX_SHARERS(sx->sx_lock));
 	else
 		db_printf("XLOCK\n");
 	return (1);
 }
 #endif
Index: user/markj/netdump/sys/kern/subr_witness.c
===================================================================
--- user/markj/netdump/sys/kern/subr_witness.c	(revision 332407)
+++ user/markj/netdump/sys/kern/subr_witness.c	(revision 332408)
@@ -1,3071 +1,3071 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 2008 Isilon Systems, Inc.
  * Copyright (c) 2008 Ilya Maykov <ivmaykov@gmail.com>
  * Copyright (c) 1998 Berkeley Software Design, Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Berkeley Software Design Inc's name may not be used to endorse or
  *    promote products derived from this software without specific prior
  *    written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY BERKELEY SOFTWARE DESIGN INC ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL BERKELEY SOFTWARE DESIGN INC BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from BSDI $Id: mutex_witness.c,v 1.1.2.20 2000/04/27 03:10:27 cp Exp $
  *	and BSDI $Id: synch_machdep.c,v 2.3.2.39 2000/04/27 03:10:25 cp Exp $
  */
 
 /*
  * Implementation of the `witness' lock verifier.  Originally implemented for
  * mutexes in BSD/OS.  Extended to handle generic lock objects and lock
  * classes in FreeBSD.
  */
 
 /*
  *	Main Entry: witness
  *	Pronunciation: 'wit-n&s
  *	Function: noun
  *	Etymology: Middle English witnesse, from Old English witnes knowledge,
  *	    testimony, witness, from 2wit
  *	Date: before 12th century
  *	1 : attestation of a fact or event : TESTIMONY
  *	2 : one that gives evidence; specifically : one who testifies in
  *	    a cause or before a judicial tribunal
  *	3 : one asked to be present at a transaction so as to be able to
  *	    testify to its having taken place
  *	4 : one who has personal knowledge of something
  *	5 a : something serving as evidence or proof : SIGN
  *	  b : public affirmation by word or example of usually
  *	      religious faith or conviction <the heroic witness to divine
  *	      life -- Pilot>
  *	6 capitalized : a member of the Jehovah's Witnesses 
  */
 
 /*
  * Special rules concerning Giant and lock orders:
  *
  * 1) Giant must be acquired before any other mutexes.  Stated another way,
  *    no other mutex may be held when Giant is acquired.
  *
  * 2) Giant must be released when blocking on a sleepable lock.
  *
  * This rule is less obvious, but is a result of Giant providing the same
  * semantics as spl().  Basically, when a thread sleeps, it must release
  * Giant.  When a thread blocks on a sleepable lock, it sleeps.  Hence rule
  * 2).
  *
  * 3) Giant may be acquired before or after sleepable locks.
  *
  * This rule is also not quite as obvious.  Giant may be acquired after
  * a sleepable lock because it is a non-sleepable lock and non-sleepable
  * locks may always be acquired while holding a sleepable lock.  The second
  * case, Giant before a sleepable lock, follows from rule 2) above.  Suppose
  * you have two threads T1 and T2 and a sleepable lock X.  Suppose that T1
  * acquires X and blocks on Giant.  Then suppose that T2 acquires Giant and
  * blocks on X.  When T2 blocks on X, T2 will release Giant allowing T1 to
  * execute.  Thus, acquiring Giant both before and after a sleepable lock
  * will not result in a lock order reversal.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 #include "opt_hwpmc_hooks.h"
 #include "opt_stack.h"
 #include "opt_witness.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/sbuf.h>
 #include <sys/sched.h>
 #include <sys/stack.h>
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
 #include <sys/systm.h>
 
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #include <machine/stdarg.h>
 
 #if !defined(DDB) && !defined(STACK)
 #error "DDB or STACK options are required for WITNESS"
 #endif
 
 /* Note that these traces do not work with KTR_ALQ. */
 #if 0
 #define	KTR_WITNESS	KTR_SUBSYS
 #else
 #define	KTR_WITNESS	0
 #endif
 
 #define	LI_RECURSEMASK	0x0000ffff	/* Recursion depth of lock instance. */
 #define	LI_EXCLUSIVE	0x00010000	/* Exclusive lock instance. */
 #define	LI_NORELEASE	0x00020000	/* Lock not allowed to be released. */
 
 /* Define this to check for blessed mutexes */
 #undef BLESSING
 
 #ifndef WITNESS_COUNT
 #define	WITNESS_COUNT 		1536
 #endif
 #define	WITNESS_HASH_SIZE	251	/* Prime, gives load factor < 2 */
 #define	WITNESS_PENDLIST	(512 + (MAXCPU * 4))
 
 /* Allocate 256 KB of stack data space */
 #define	WITNESS_LO_DATA_COUNT	2048
 
 /* Prime, gives load factor of ~2 at full load */
 #define	WITNESS_LO_HASH_SIZE	1021
 
 /*
  * XXX: This is somewhat bogus, as we assume here that at most 2048 threads
  * will hold LOCK_NCHILDREN locks.  We handle failure ok, and we should
  * probably be safe for the most part, but it's still a SWAG.
  */
 #define	LOCK_NCHILDREN	5
 #define	LOCK_CHILDCOUNT	2048
 
 #define	MAX_W_NAME	64
 
 #define	FULLGRAPH_SBUF_SIZE	512
 
 /*
  * These flags go in the witness relationship matrix and describe the
  * relationship between any two struct witness objects.
  */
 #define	WITNESS_UNRELATED        0x00    /* No lock order relation. */
 #define	WITNESS_PARENT           0x01    /* Parent, aka direct ancestor. */
 #define	WITNESS_ANCESTOR         0x02    /* Direct or indirect ancestor. */
 #define	WITNESS_CHILD            0x04    /* Child, aka direct descendant. */
 #define	WITNESS_DESCENDANT       0x08    /* Direct or indirect descendant. */
 #define	WITNESS_ANCESTOR_MASK    (WITNESS_PARENT | WITNESS_ANCESTOR)
 #define	WITNESS_DESCENDANT_MASK  (WITNESS_CHILD | WITNESS_DESCENDANT)
 #define	WITNESS_RELATED_MASK						\
 	(WITNESS_ANCESTOR_MASK | WITNESS_DESCENDANT_MASK)
 #define	WITNESS_REVERSAL         0x10    /* A lock order reversal has been
 					  * observed. */
 #define	WITNESS_RESERVED1        0x20    /* Unused flag, reserved. */
 #define	WITNESS_RESERVED2        0x40    /* Unused flag, reserved. */
 #define	WITNESS_LOCK_ORDER_KNOWN 0x80    /* This lock order is known. */
 
 /* Descendant to ancestor flags */
 #define	WITNESS_DTOA(x)	(((x) & WITNESS_RELATED_MASK) >> 2)
 
 /* Ancestor to descendant flags */
 #define	WITNESS_ATOD(x)	(((x) & WITNESS_RELATED_MASK) << 2)
 
 #define	WITNESS_INDEX_ASSERT(i)						\
 	MPASS((i) > 0 && (i) <= w_max_used_index && (i) < witness_count)
 
 static MALLOC_DEFINE(M_WITNESS, "Witness", "Witness");
 
 /*
  * Lock instances.  A lock instance is the data associated with a lock while
  * it is held by witness.  For example, a lock instance will hold the
  * recursion count of a lock.  Lock instances are held in lists.  Spin locks
  * are held in a per-cpu list while sleep locks are held in per-thread list.
  */
 struct lock_instance {
 	struct lock_object	*li_lock;
 	const char		*li_file;
 	int			li_line;
 	u_int			li_flags;
 };
 
 /*
  * A simple list type used to build the list of locks held by a thread
  * or CPU.  We can't simply embed the list in struct lock_object since a
  * lock may be held by more than one thread if it is a shared lock.  Locks
  * are added to the head of the list, so we fill up each list entry from
  * "the back" logically.  To ease some of the arithmetic, we actually fill
  * in each list entry the normal way (children[0] then children[1], etc.) but
  * when we traverse the list we read children[count-1] as the first entry
  * down to children[0] as the final entry.
  */
 struct lock_list_entry {
 	struct lock_list_entry	*ll_next;
 	struct lock_instance	ll_children[LOCK_NCHILDREN];
 	u_int			ll_count;
 };
 
 /*
  * The main witness structure. One of these per named lock type in the system
  * (for example, "vnode interlock").
  */
 struct witness {
 	char  			w_name[MAX_W_NAME];
 	uint32_t 		w_index;  /* Index in the relationship matrix */
 	struct lock_class	*w_class;
 	STAILQ_ENTRY(witness) 	w_list;		/* List of all witnesses. */
 	STAILQ_ENTRY(witness) 	w_typelist;	/* Witnesses of a type. */
 	struct witness		*w_hash_next; /* Linked list in hash buckets. */
 	const char		*w_file; /* File where last acquired */
 	uint32_t 		w_line; /* Line where last acquired */
 	uint32_t 		w_refcount;
 	uint16_t 		w_num_ancestors; /* direct/indirect
 						  * ancestor count */
 	uint16_t 		w_num_descendants; /* direct/indirect
 						    * descendant count */
 	int16_t 		w_ddb_level;
 	unsigned		w_displayed:1;
 	unsigned		w_reversed:1;
 };
 
 STAILQ_HEAD(witness_list, witness);
 
 /*
  * The witness hash table. Keys are witness names (const char *), elements are
  * witness objects (struct witness *).
  */
 struct witness_hash {
 	struct witness	*wh_array[WITNESS_HASH_SIZE];
 	uint32_t	wh_size;
 	uint32_t	wh_count;
 };
 
 /*
  * Key type for the lock order data hash table.
  */
 struct witness_lock_order_key {
 	uint16_t	from;
 	uint16_t	to;
 };
 
 struct witness_lock_order_data {
 	struct stack			wlod_stack;
 	struct witness_lock_order_key	wlod_key;
 	struct witness_lock_order_data	*wlod_next;
 };
 
 /*
  * The witness lock order data hash table. Keys are witness index tuples
  * (struct witness_lock_order_key), elements are lock order data objects
  * (struct witness_lock_order_data). 
  */
 struct witness_lock_order_hash {
 	struct witness_lock_order_data	*wloh_array[WITNESS_LO_HASH_SIZE];
 	u_int	wloh_size;
 	u_int	wloh_count;
 };
 
 #ifdef BLESSING
 struct witness_blessed {
 	const char	*b_lock1;
 	const char	*b_lock2;
 };
 #endif
 
 struct witness_pendhelp {
 	const char		*wh_type;
 	struct lock_object	*wh_lock;
 };
 
 struct witness_order_list_entry {
 	const char		*w_name;
 	struct lock_class	*w_class;
 };
 
 /*
  * Returns 0 if one of the locks is a spin lock and the other is not.
  * Returns 1 otherwise.
  */
 static __inline int
 witness_lock_type_equal(struct witness *w1, struct witness *w2)
 {
 
 	return ((w1->w_class->lc_flags & (LC_SLEEPLOCK | LC_SPINLOCK)) ==
 		(w2->w_class->lc_flags & (LC_SLEEPLOCK | LC_SPINLOCK)));
 }
 
 static __inline int
 witness_lock_order_key_equal(const struct witness_lock_order_key *a,
     const struct witness_lock_order_key *b)
 {
 
 	return (a->from == b->from && a->to == b->to);
 }
 
 static int	_isitmyx(struct witness *w1, struct witness *w2, int rmask,
 		    const char *fname);
 static void	adopt(struct witness *parent, struct witness *child);
 #ifdef BLESSING
 static int	blessed(struct witness *, struct witness *);
 #endif
 static void	depart(struct witness *w);
 static struct witness	*enroll(const char *description,
 			    struct lock_class *lock_class);
 static struct lock_instance	*find_instance(struct lock_list_entry *list,
 				    const struct lock_object *lock);
 static int	isitmychild(struct witness *parent, struct witness *child);
 static int	isitmydescendant(struct witness *parent, struct witness *child);
 static void	itismychild(struct witness *parent, struct witness *child);
 static int	sysctl_debug_witness_badstacks(SYSCTL_HANDLER_ARGS);
 static int	sysctl_debug_witness_watch(SYSCTL_HANDLER_ARGS);
 static int	sysctl_debug_witness_fullgraph(SYSCTL_HANDLER_ARGS);
 static int	sysctl_debug_witness_channel(SYSCTL_HANDLER_ARGS);
 static void	witness_add_fullgraph(struct sbuf *sb, struct witness *parent);
 #ifdef DDB
 static void	witness_ddb_compute_levels(void);
 static void	witness_ddb_display(int(*)(const char *fmt, ...));
 static void	witness_ddb_display_descendants(int(*)(const char *fmt, ...),
 		    struct witness *, int indent);
 static void	witness_ddb_display_list(int(*prnt)(const char *fmt, ...),
 		    struct witness_list *list);
 static void	witness_ddb_level_descendants(struct witness *parent, int l);
 static void	witness_ddb_list(struct thread *td);
 #endif
 static void	witness_debugger(int cond, const char *msg);
 static void	witness_free(struct witness *m);
 static struct witness	*witness_get(void);
 static uint32_t	witness_hash_djb2(const uint8_t *key, uint32_t size);
 static struct witness	*witness_hash_get(const char *key);
 static void	witness_hash_put(struct witness *w);
 static void	witness_init_hash_tables(void);
 static void	witness_increment_graph_generation(void);
 static void	witness_lock_list_free(struct lock_list_entry *lle);
 static struct lock_list_entry	*witness_lock_list_get(void);
 static int	witness_lock_order_add(struct witness *parent,
 		    struct witness *child);
 static int	witness_lock_order_check(struct witness *parent,
 		    struct witness *child);
 static struct witness_lock_order_data	*witness_lock_order_get(
 					    struct witness *parent,
 					    struct witness *child);
 static void	witness_list_lock(struct lock_instance *instance,
 		    int (*prnt)(const char *fmt, ...));
 static int	witness_output(const char *fmt, ...) __printflike(1, 2);
 static int	witness_voutput(const char *fmt, va_list ap) __printflike(1, 0);
 static void	witness_setflag(struct lock_object *lock, int flag, int set);
 
 static SYSCTL_NODE(_debug, OID_AUTO, witness, CTLFLAG_RW, NULL,
     "Witness Locking");
 
 /*
  * If set to 0, lock order checking is disabled.  If set to -1,
  * witness is completely disabled.  Otherwise witness performs full
  * lock order checking for all locks.  At runtime, lock order checking
  * may be toggled.  However, witness cannot be reenabled once it is
  * completely disabled.
  */
 static int witness_watch = 1;
 SYSCTL_PROC(_debug_witness, OID_AUTO, watch, CTLFLAG_RWTUN | CTLTYPE_INT, NULL, 0,
     sysctl_debug_witness_watch, "I", "witness is watching lock operations");
 
 #ifdef KDB
 /*
  * When KDB is enabled and witness_kdb is 1, it will cause the system
  * to drop into kdebug() when:
  *	- a lock hierarchy violation occurs
  *	- locks are held when going to sleep.
  */
 #ifdef WITNESS_KDB
 int	witness_kdb = 1;
 #else
 int	witness_kdb = 0;
 #endif
 SYSCTL_INT(_debug_witness, OID_AUTO, kdb, CTLFLAG_RWTUN, &witness_kdb, 0, "");
 #endif /* KDB */
 
 #if defined(DDB) || defined(KDB)
 /*
  * When DDB or KDB is enabled and witness_trace is 1, it will cause the system
  * to print a stack trace:
  *	- a lock hierarchy violation occurs
  *	- locks are held when going to sleep.
  */
 int	witness_trace = 1;
 SYSCTL_INT(_debug_witness, OID_AUTO, trace, CTLFLAG_RWTUN, &witness_trace, 0, "");
 #endif /* DDB || KDB */
 
 #ifdef WITNESS_SKIPSPIN
 int	witness_skipspin = 1;
 #else
 int	witness_skipspin = 0;
 #endif
 SYSCTL_INT(_debug_witness, OID_AUTO, skipspin, CTLFLAG_RDTUN, &witness_skipspin, 0, "");
 
 int badstack_sbuf_size;
 
 int witness_count = WITNESS_COUNT;
 SYSCTL_INT(_debug_witness, OID_AUTO, witness_count, CTLFLAG_RDTUN, 
     &witness_count, 0, "");
 
 /*
  * Output channel for witness messages.  By default we print to the console.
  */
 enum witness_channel {
 	WITNESS_CONSOLE,
 	WITNESS_LOG,
 	WITNESS_NONE,
 };
 
 static enum witness_channel witness_channel = WITNESS_CONSOLE;
 SYSCTL_PROC(_debug_witness, OID_AUTO, output_channel, CTLTYPE_STRING |
     CTLFLAG_RWTUN, NULL, 0, sysctl_debug_witness_channel, "A",
     "Output channel for warnings");
 
 /*
  * Call this to print out the relations between locks.
  */
 SYSCTL_PROC(_debug_witness, OID_AUTO, fullgraph, CTLTYPE_STRING | CTLFLAG_RD,
     NULL, 0, sysctl_debug_witness_fullgraph, "A", "Show locks relation graphs");
 
 /*
  * Call this to print out the witness faulty stacks.
  */
 SYSCTL_PROC(_debug_witness, OID_AUTO, badstacks, CTLTYPE_STRING | CTLFLAG_RD,
     NULL, 0, sysctl_debug_witness_badstacks, "A", "Show bad witness stacks");
 
 static struct mtx w_mtx;
 
 /* w_list */
 static struct witness_list w_free = STAILQ_HEAD_INITIALIZER(w_free);
 static struct witness_list w_all = STAILQ_HEAD_INITIALIZER(w_all);
 
 /* w_typelist */
 static struct witness_list w_spin = STAILQ_HEAD_INITIALIZER(w_spin);
 static struct witness_list w_sleep = STAILQ_HEAD_INITIALIZER(w_sleep);
 
 /* lock list */
 static struct lock_list_entry *w_lock_list_free = NULL;
 static struct witness_pendhelp pending_locks[WITNESS_PENDLIST];
 static u_int pending_cnt;
 
 static int w_free_cnt, w_spin_cnt, w_sleep_cnt;
 SYSCTL_INT(_debug_witness, OID_AUTO, free_cnt, CTLFLAG_RD, &w_free_cnt, 0, "");
 SYSCTL_INT(_debug_witness, OID_AUTO, spin_cnt, CTLFLAG_RD, &w_spin_cnt, 0, "");
 SYSCTL_INT(_debug_witness, OID_AUTO, sleep_cnt, CTLFLAG_RD, &w_sleep_cnt, 0,
     "");
 
 static struct witness *w_data;
 static uint8_t **w_rmatrix;
 static struct lock_list_entry w_locklistdata[LOCK_CHILDCOUNT];
 static struct witness_hash w_hash;	/* The witness hash table. */
 
 /* The lock order data hash */
 static struct witness_lock_order_data w_lodata[WITNESS_LO_DATA_COUNT];
 static struct witness_lock_order_data *w_lofree = NULL;
 static struct witness_lock_order_hash w_lohash;
 static int w_max_used_index = 0;
 static unsigned int w_generation = 0;
 static const char w_notrunning[] = "Witness not running\n";
 static const char w_stillcold[] = "Witness is still cold\n";
 
 
 static struct witness_order_list_entry order_lists[] = {
 	/*
 	 * sx locks
 	 */
 	{ "proctree", &lock_class_sx },
 	{ "allproc", &lock_class_sx },
 	{ "allprison", &lock_class_sx },
 	{ NULL, NULL },
 	/*
 	 * Various mutexes
 	 */
 	{ "Giant", &lock_class_mtx_sleep },
 	{ "pipe mutex", &lock_class_mtx_sleep },
 	{ "sigio lock", &lock_class_mtx_sleep },
 	{ "process group", &lock_class_mtx_sleep },
 	{ "process lock", &lock_class_mtx_sleep },
 	{ "session", &lock_class_mtx_sleep },
 	{ "uidinfo hash", &lock_class_rw },
 #ifdef	HWPMC_HOOKS
 	{ "pmc-sleep", &lock_class_mtx_sleep },
 #endif
 	{ "time lock", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * umtx
 	 */
 	{ "umtx lock", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * Sockets
 	 */
 	{ "accept", &lock_class_mtx_sleep },
 	{ "so_snd", &lock_class_mtx_sleep },
 	{ "so_rcv", &lock_class_mtx_sleep },
 	{ "sellck", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * Routing
 	 */
 	{ "so_rcv", &lock_class_mtx_sleep },
 	{ "radix node head", &lock_class_rw },
 	{ "rtentry", &lock_class_mtx_sleep },
 	{ "ifaddr", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * IPv4 multicast:
 	 * protocol locks before interface locks, after UDP locks.
 	 */
 	{ "udpinp", &lock_class_rw },
 	{ "in_multi_mtx", &lock_class_mtx_sleep },
 	{ "igmp_mtx", &lock_class_mtx_sleep },
 	{ "if_addr_lock", &lock_class_rw },
 	{ NULL, NULL },
 	/*
 	 * IPv6 multicast:
 	 * protocol locks before interface locks, after UDP locks.
 	 */
 	{ "udpinp", &lock_class_rw },
 	{ "in6_multi_mtx", &lock_class_mtx_sleep },
 	{ "mld_mtx", &lock_class_mtx_sleep },
 	{ "if_addr_lock", &lock_class_rw },
 	{ NULL, NULL },
 	/*
 	 * UNIX Domain Sockets
 	 */
 	{ "unp_link_rwlock", &lock_class_rw },
 	{ "unp_list_lock", &lock_class_mtx_sleep },
 	{ "unp", &lock_class_mtx_sleep },
 	{ "so_snd", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * UDP/IP
 	 */
 	{ "udp", &lock_class_rw },
 	{ "udpinp", &lock_class_rw },
 	{ "so_snd", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * TCP/IP
 	 */
 	{ "tcp", &lock_class_rw },
 	{ "tcpinp", &lock_class_rw },
 	{ "so_snd", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * BPF
 	 */
-	{ "bpf global lock", &lock_class_mtx_sleep },
+	{ "bpf global lock", &lock_class_sx },
 	{ "bpf interface lock", &lock_class_rw },
 	{ "bpf cdev lock", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * NFS server
 	 */
 	{ "nfsd_mtx", &lock_class_mtx_sleep },
 	{ "so_snd", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 
 	/*
 	 * IEEE 802.11
 	 */
 	{ "802.11 com lock", &lock_class_mtx_sleep},
 	{ NULL, NULL },
 	/*
 	 * Network drivers
 	 */
 	{ "network driver", &lock_class_mtx_sleep},
 	{ NULL, NULL },
 
 	/*
 	 * Netgraph
 	 */
 	{ "ng_node", &lock_class_mtx_sleep },
 	{ "ng_worklist", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * CDEV
 	 */
 	{ "vm map (system)", &lock_class_mtx_sleep },
 	{ "vm pagequeue", &lock_class_mtx_sleep },
 	{ "vnode interlock", &lock_class_mtx_sleep },
 	{ "cdev", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * VM
 	 */
 	{ "vm map (user)", &lock_class_sx },
 	{ "vm object", &lock_class_rw },
 	{ "vm page", &lock_class_mtx_sleep },
 	{ "vm pagequeue", &lock_class_mtx_sleep },
 	{ "pmap pv global", &lock_class_rw },
 	{ "pmap", &lock_class_mtx_sleep },
 	{ "pmap pv list", &lock_class_rw },
 	{ "vm page free queue", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * kqueue/VFS interaction
 	 */
 	{ "kqueue", &lock_class_mtx_sleep },
 	{ "struct mount mtx", &lock_class_mtx_sleep },
 	{ "vnode interlock", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * VFS namecache
 	 */
 	{ "ncvn", &lock_class_mtx_sleep },
 	{ "ncbuc", &lock_class_rw },
 	{ "vnode interlock", &lock_class_mtx_sleep },
 	{ "ncneg", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * ZFS locking
 	 */
 	{ "dn->dn_mtx", &lock_class_sx },
 	{ "dr->dt.di.dr_mtx", &lock_class_sx },
 	{ "db->db_mtx", &lock_class_sx },
 	{ NULL, NULL },
 	/*
 	 * TCP log locks
 	 */
 	{ "TCP ID tree", &lock_class_rw },
 	{ "tcp log id bucket", &lock_class_mtx_sleep },
 	{ "tcpinp", &lock_class_rw },
 	{ "TCP log expireq", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * spin locks
 	 */
 #ifdef SMP
 	{ "ap boot", &lock_class_mtx_spin },
 #endif
 	{ "rm.mutex_mtx", &lock_class_mtx_spin },
 	{ "sio", &lock_class_mtx_spin },
 #ifdef __i386__
 	{ "cy", &lock_class_mtx_spin },
 #endif
 #ifdef __sparc64__
 	{ "pcib_mtx", &lock_class_mtx_spin },
 	{ "rtc_mtx", &lock_class_mtx_spin },
 #endif
 	{ "scc_hwmtx", &lock_class_mtx_spin },
 	{ "uart_hwmtx", &lock_class_mtx_spin },
 	{ "fast_taskqueue", &lock_class_mtx_spin },
 	{ "intr table", &lock_class_mtx_spin },
 #ifdef	HWPMC_HOOKS
 	{ "pmc-per-proc", &lock_class_mtx_spin },
 #endif
 	{ "process slock", &lock_class_mtx_spin },
 	{ "syscons video lock", &lock_class_mtx_spin },
 	{ "sleepq chain", &lock_class_mtx_spin },
 	{ "rm_spinlock", &lock_class_mtx_spin },
 	{ "turnstile chain", &lock_class_mtx_spin },
 	{ "turnstile lock", &lock_class_mtx_spin },
 	{ "sched lock", &lock_class_mtx_spin },
 	{ "td_contested", &lock_class_mtx_spin },
 	{ "callout", &lock_class_mtx_spin },
 	{ "entropy harvest mutex", &lock_class_mtx_spin },
 #ifdef SMP
 	{ "smp rendezvous", &lock_class_mtx_spin },
 #endif
 #ifdef __powerpc__
 	{ "tlb0", &lock_class_mtx_spin },
 #endif
 	/*
 	 * leaf locks
 	 */
 	{ "intrcnt", &lock_class_mtx_spin },
 	{ "icu", &lock_class_mtx_spin },
 #if defined(SMP) && defined(__sparc64__)
 	{ "ipi", &lock_class_mtx_spin },
 #endif
 #ifdef __i386__
 	{ "allpmaps", &lock_class_mtx_spin },
 	{ "descriptor tables", &lock_class_mtx_spin },
 #endif
 	{ "clk", &lock_class_mtx_spin },
 	{ "cpuset", &lock_class_mtx_spin },
 	{ "mprof lock", &lock_class_mtx_spin },
 	{ "zombie lock", &lock_class_mtx_spin },
 	{ "ALD Queue", &lock_class_mtx_spin },
 #if defined(__i386__) || defined(__amd64__)
 	{ "pcicfg", &lock_class_mtx_spin },
 	{ "NDIS thread lock", &lock_class_mtx_spin },
 #endif
 	{ "tw_osl_io_lock", &lock_class_mtx_spin },
 	{ "tw_osl_q_lock", &lock_class_mtx_spin },
 	{ "tw_cl_io_lock", &lock_class_mtx_spin },
 	{ "tw_cl_intr_lock", &lock_class_mtx_spin },
 	{ "tw_cl_gen_lock", &lock_class_mtx_spin },
 #ifdef	HWPMC_HOOKS
 	{ "pmc-leaf", &lock_class_mtx_spin },
 #endif
 	{ "blocked lock", &lock_class_mtx_spin },
 	{ NULL, NULL },
 	{ NULL, NULL }
 };
 
 #ifdef BLESSING
 /*
  * Pairs of locks which have been blessed
  * Don't complain about order problems with blessed locks
  */
 static struct witness_blessed blessed_list[] = {
 };
 #endif
 
 /*
  * This global is set to 0 once it becomes safe to use the witness code.
  */
 static int witness_cold = 1;
 
 /*
  * This global is set to 1 once the static lock orders have been enrolled
  * so that a warning can be issued for any spin locks enrolled later.
  */
 static int witness_spin_warn = 0;
 
 /* Trim useless garbage from filenames. */
 static const char *
 fixup_filename(const char *file)
 {
 
 	if (file == NULL)
 		return (NULL);
 	while (strncmp(file, "../", 3) == 0)
 		file += 3;
 	return (file);
 }
 
 /*
  * Calculate the size of early witness structures.
  */
 int
 witness_startup_count(void)
 {
 	int sz;
 
 	sz = sizeof(struct witness) * witness_count;
 	sz += sizeof(*w_rmatrix) * (witness_count + 1);
 	sz += sizeof(*w_rmatrix[0]) * (witness_count + 1) *
 	    (witness_count + 1);
 
 	return (sz);
 }
 
 /*
  * The WITNESS-enabled diagnostic code.  Note that the witness code does
  * assume that the early boot is single-threaded at least until after this
  * routine is completed.
  */
 void
 witness_startup(void *mem)
 {
 	struct lock_object *lock;
 	struct witness_order_list_entry *order;
 	struct witness *w, *w1;
 	uintptr_t p;
 	int i;
 
 	p = (uintptr_t)mem;
 	w_data = (void *)p;
 	p += sizeof(struct witness) * witness_count;
 
 	w_rmatrix = (void *)p;
 	p += sizeof(*w_rmatrix) * (witness_count + 1);
 
 	for (i = 0; i < witness_count + 1; i++) {
 		w_rmatrix[i] = (void *)p;
 		p += sizeof(*w_rmatrix[i]) * (witness_count + 1);
 	}
 	badstack_sbuf_size = witness_count * 256;
 
 	/*
 	 * We have to release Giant before initializing its witness
 	 * structure so that WITNESS doesn't get confused.
 	 */
 	mtx_unlock(&Giant);
 	mtx_assert(&Giant, MA_NOTOWNED);
 
 	CTR1(KTR_WITNESS, "%s: initializing witness", __func__);
 	mtx_init(&w_mtx, "witness lock", NULL, MTX_SPIN | MTX_QUIET |
 	    MTX_NOWITNESS | MTX_NOPROFILE);
 	for (i = witness_count - 1; i >= 0; i--) {
 		w = &w_data[i];
 		memset(w, 0, sizeof(*w));
 		w_data[i].w_index = i;	/* Witness index never changes. */
 		witness_free(w);
 	}
 	KASSERT(STAILQ_FIRST(&w_free)->w_index == 0,
 	    ("%s: Invalid list of free witness objects", __func__));
 
 	/* Witness with index 0 is not used to aid in debugging. */
 	STAILQ_REMOVE_HEAD(&w_free, w_list);
 	w_free_cnt--;
 
 	for (i = 0; i < witness_count; i++) {
 		memset(w_rmatrix[i], 0, sizeof(*w_rmatrix[i]) * 
 		    (witness_count + 1));
 	}
 
 	for (i = 0; i < LOCK_CHILDCOUNT; i++)
 		witness_lock_list_free(&w_locklistdata[i]);
 	witness_init_hash_tables();
 
 	/* First add in all the specified order lists. */
 	for (order = order_lists; order->w_name != NULL; order++) {
 		w = enroll(order->w_name, order->w_class);
 		if (w == NULL)
 			continue;
 		w->w_file = "order list";
 		for (order++; order->w_name != NULL; order++) {
 			w1 = enroll(order->w_name, order->w_class);
 			if (w1 == NULL)
 				continue;
 			w1->w_file = "order list";
 			itismychild(w, w1);
 			w = w1;
 		}
 	}
 	witness_spin_warn = 1;
 
 	/* Iterate through all locks and add them to witness. */
 	for (i = 0; pending_locks[i].wh_lock != NULL; i++) {
 		lock = pending_locks[i].wh_lock;
 		KASSERT(lock->lo_flags & LO_WITNESS,
 		    ("%s: lock %s is on pending list but not LO_WITNESS",
 		    __func__, lock->lo_name));
 		lock->lo_witness = enroll(pending_locks[i].wh_type,
 		    LOCK_CLASS(lock));
 	}
 
 	/* Mark the witness code as being ready for use. */
 	witness_cold = 0;
 
 	mtx_lock(&Giant);
 }
 
 void
 witness_init(struct lock_object *lock, const char *type)
 {
 	struct lock_class *class;
 
 	/* Various sanity checks. */
 	class = LOCK_CLASS(lock);
 	if ((lock->lo_flags & LO_RECURSABLE) != 0 &&
 	    (class->lc_flags & LC_RECURSABLE) == 0)
 		kassert_panic("%s: lock (%s) %s can not be recursable",
 		    __func__, class->lc_name, lock->lo_name);
 	if ((lock->lo_flags & LO_SLEEPABLE) != 0 &&
 	    (class->lc_flags & LC_SLEEPABLE) == 0)
 		kassert_panic("%s: lock (%s) %s can not be sleepable",
 		    __func__, class->lc_name, lock->lo_name);
 	if ((lock->lo_flags & LO_UPGRADABLE) != 0 &&
 	    (class->lc_flags & LC_UPGRADABLE) == 0)
 		kassert_panic("%s: lock (%s) %s can not be upgradable",
 		    __func__, class->lc_name, lock->lo_name);
 
 	/*
 	 * If we shouldn't watch this lock, then just clear lo_witness.
 	 * Otherwise, if witness_cold is set, then it is too early to
 	 * enroll this lock, so defer it to witness_initialize() by adding
 	 * it to the pending_locks list.  If it is not too early, then enroll
 	 * the lock now.
 	 */
 	if (witness_watch < 1 || panicstr != NULL ||
 	    (lock->lo_flags & LO_WITNESS) == 0)
 		lock->lo_witness = NULL;
 	else if (witness_cold) {
 		pending_locks[pending_cnt].wh_lock = lock;
 		pending_locks[pending_cnt++].wh_type = type;
 		if (pending_cnt > WITNESS_PENDLIST)
 			panic("%s: pending locks list is too small, "
 			    "increase WITNESS_PENDLIST\n",
 			    __func__);
 	} else
 		lock->lo_witness = enroll(type, class);
 }
 
 void
 witness_destroy(struct lock_object *lock)
 {
 	struct lock_class *class;
 	struct witness *w;
 
 	class = LOCK_CLASS(lock);
 
 	if (witness_cold)
 		panic("lock (%s) %s destroyed while witness_cold",
 		    class->lc_name, lock->lo_name);
 
 	/* XXX: need to verify that no one holds the lock */
 	if ((lock->lo_flags & LO_WITNESS) == 0 || lock->lo_witness == NULL)
 		return;
 	w = lock->lo_witness;
 
 	mtx_lock_spin(&w_mtx);
 	MPASS(w->w_refcount > 0);
 	w->w_refcount--;
 
 	if (w->w_refcount == 0)
 		depart(w);
 	mtx_unlock_spin(&w_mtx);
 }
 
 #ifdef DDB
 static void
 witness_ddb_compute_levels(void)
 {
 	struct witness *w;
 
 	/*
 	 * First clear all levels.
 	 */
 	STAILQ_FOREACH(w, &w_all, w_list)
 		w->w_ddb_level = -1;
 
 	/*
 	 * Look for locks with no parents and level all their descendants.
 	 */
 	STAILQ_FOREACH(w, &w_all, w_list) {
 
 		/* If the witness has ancestors (is not a root), skip it. */
 		if (w->w_num_ancestors > 0)
 			continue;
 		witness_ddb_level_descendants(w, 0);
 	}
 }
 
 static void
 witness_ddb_level_descendants(struct witness *w, int l)
 {
 	int i;
 
 	if (w->w_ddb_level >= l)
 		return;
 
 	w->w_ddb_level = l;
 	l++;
 
 	for (i = 1; i <= w_max_used_index; i++) {
 		if (w_rmatrix[w->w_index][i] & WITNESS_PARENT)
 			witness_ddb_level_descendants(&w_data[i], l);
 	}
 }
 
 static void
 witness_ddb_display_descendants(int(*prnt)(const char *fmt, ...),
     struct witness *w, int indent)
 {
 	int i;
 
  	for (i = 0; i < indent; i++)
  		prnt(" ");
 	prnt("%s (type: %s, depth: %d, active refs: %d)",
 	     w->w_name, w->w_class->lc_name,
 	     w->w_ddb_level, w->w_refcount);
  	if (w->w_displayed) {
  		prnt(" -- (already displayed)\n");
  		return;
  	}
  	w->w_displayed = 1;
 	if (w->w_file != NULL && w->w_line != 0)
 		prnt(" -- last acquired @ %s:%d\n", fixup_filename(w->w_file),
 		    w->w_line);
 	else
 		prnt(" -- never acquired\n");
 	indent++;
 	WITNESS_INDEX_ASSERT(w->w_index);
 	for (i = 1; i <= w_max_used_index; i++) {
 		if (db_pager_quit)
 			return;
 		if (w_rmatrix[w->w_index][i] & WITNESS_PARENT)
 			witness_ddb_display_descendants(prnt, &w_data[i],
 			    indent);
 	}
 }
 
 static void
 witness_ddb_display_list(int(*prnt)(const char *fmt, ...),
     struct witness_list *list)
 {
 	struct witness *w;
 
 	STAILQ_FOREACH(w, list, w_typelist) {
 		if (w->w_file == NULL || w->w_ddb_level > 0)
 			continue;
 
 		/* This lock has no anscestors - display its descendants. */
 		witness_ddb_display_descendants(prnt, w, 0);
 		if (db_pager_quit)
 			return;
 	}
 }
 	
 static void
 witness_ddb_display(int(*prnt)(const char *fmt, ...))
 {
 	struct witness *w;
 
 	KASSERT(witness_cold == 0, ("%s: witness_cold", __func__));
 	witness_ddb_compute_levels();
 
 	/* Clear all the displayed flags. */
 	STAILQ_FOREACH(w, &w_all, w_list)
 		w->w_displayed = 0;
 
 	/*
 	 * First, handle sleep locks which have been acquired at least
 	 * once.
 	 */
 	prnt("Sleep locks:\n");
 	witness_ddb_display_list(prnt, &w_sleep);
 	if (db_pager_quit)
 		return;
 	
 	/*
 	 * Now do spin locks which have been acquired at least once.
 	 */
 	prnt("\nSpin locks:\n");
 	witness_ddb_display_list(prnt, &w_spin);
 	if (db_pager_quit)
 		return;
 	
 	/*
 	 * Finally, any locks which have not been acquired yet.
 	 */
 	prnt("\nLocks which were never acquired:\n");
 	STAILQ_FOREACH(w, &w_all, w_list) {
 		if (w->w_file != NULL || w->w_refcount == 0)
 			continue;
 		prnt("%s (type: %s, depth: %d)\n", w->w_name,
 		    w->w_class->lc_name, w->w_ddb_level);
 		if (db_pager_quit)
 			return;
 	}
 }
 #endif /* DDB */
 
 int
 witness_defineorder(struct lock_object *lock1, struct lock_object *lock2)
 {
 
 	if (witness_watch == -1 || panicstr != NULL)
 		return (0);
 
 	/* Require locks that witness knows about. */
 	if (lock1 == NULL || lock1->lo_witness == NULL || lock2 == NULL ||
 	    lock2->lo_witness == NULL)
 		return (EINVAL);
 
 	mtx_assert(&w_mtx, MA_NOTOWNED);
 	mtx_lock_spin(&w_mtx);
 
 	/*
 	 * If we already have either an explicit or implied lock order that
 	 * is the other way around, then return an error.
 	 */
 	if (witness_watch &&
 	    isitmydescendant(lock2->lo_witness, lock1->lo_witness)) {
 		mtx_unlock_spin(&w_mtx);
 		return (EDOOFUS);
 	}
 	
 	/* Try to add the new order. */
 	CTR3(KTR_WITNESS, "%s: adding %s as a child of %s", __func__,
 	    lock2->lo_witness->w_name, lock1->lo_witness->w_name);
 	itismychild(lock1->lo_witness, lock2->lo_witness);
 	mtx_unlock_spin(&w_mtx);
 	return (0);
 }
 
 void
 witness_checkorder(struct lock_object *lock, int flags, const char *file,
     int line, struct lock_object *interlock)
 {
 	struct lock_list_entry *lock_list, *lle;
 	struct lock_instance *lock1, *lock2, *plock;
 	struct lock_class *class, *iclass;
 	struct witness *w, *w1;
 	struct thread *td;
 	int i, j;
 
 	if (witness_cold || witness_watch < 1 || lock->lo_witness == NULL ||
 	    panicstr != NULL)
 		return;
 
 	w = lock->lo_witness;
 	class = LOCK_CLASS(lock);
 	td = curthread;
 
 	if (class->lc_flags & LC_SLEEPLOCK) {
 
 		/*
 		 * Since spin locks include a critical section, this check
 		 * implicitly enforces a lock order of all sleep locks before
 		 * all spin locks.
 		 */
 		if (td->td_critnest != 0 && !kdb_active)
 			kassert_panic("acquiring blockable sleep lock with "
 			    "spinlock or critical section held (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 
 		/*
 		 * If this is the first lock acquired then just return as
 		 * no order checking is needed.
 		 */
 		lock_list = td->td_sleeplocks;
 		if (lock_list == NULL || lock_list->ll_count == 0)
 			return;
 	} else {
 
 		/*
 		 * If this is the first lock, just return as no order
 		 * checking is needed.  Avoid problems with thread
 		 * migration pinning the thread while checking if
 		 * spinlocks are held.  If at least one spinlock is held
 		 * the thread is in a safe path and it is allowed to
 		 * unpin it.
 		 */
 		sched_pin();
 		lock_list = PCPU_GET(spinlocks);
 		if (lock_list == NULL || lock_list->ll_count == 0) {
 			sched_unpin();
 			return;
 		}
 		sched_unpin();
 	}
 
 	/*
 	 * Check to see if we are recursing on a lock we already own.  If
 	 * so, make sure that we don't mismatch exclusive and shared lock
 	 * acquires.
 	 */
 	lock1 = find_instance(lock_list, lock);
 	if (lock1 != NULL) {
 		if ((lock1->li_flags & LI_EXCLUSIVE) != 0 &&
 		    (flags & LOP_EXCLUSIVE) == 0) {
 			witness_output("shared lock of (%s) %s @ %s:%d\n",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 			witness_output("while exclusively locked from %s:%d\n",
 			    fixup_filename(lock1->li_file), lock1->li_line);
 			kassert_panic("excl->share");
 		}
 		if ((lock1->li_flags & LI_EXCLUSIVE) == 0 &&
 		    (flags & LOP_EXCLUSIVE) != 0) {
 			witness_output("exclusive lock of (%s) %s @ %s:%d\n",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 			witness_output("while share locked from %s:%d\n",
 			    fixup_filename(lock1->li_file), lock1->li_line);
 			kassert_panic("share->excl");
 		}
 		return;
 	}
 
 	/* Warn if the interlock is not locked exactly once. */
 	if (interlock != NULL) {
 		iclass = LOCK_CLASS(interlock);
 		lock1 = find_instance(lock_list, interlock);
 		if (lock1 == NULL)
 			kassert_panic("interlock (%s) %s not locked @ %s:%d",
 			    iclass->lc_name, interlock->lo_name,
 			    fixup_filename(file), line);
 		else if ((lock1->li_flags & LI_RECURSEMASK) != 0)
 			kassert_panic("interlock (%s) %s recursed @ %s:%d",
 			    iclass->lc_name, interlock->lo_name,
 			    fixup_filename(file), line);
 	}
 
 	/*
 	 * Find the previously acquired lock, but ignore interlocks.
 	 */
 	plock = &lock_list->ll_children[lock_list->ll_count - 1];
 	if (interlock != NULL && plock->li_lock == interlock) {
 		if (lock_list->ll_count > 1)
 			plock =
 			    &lock_list->ll_children[lock_list->ll_count - 2];
 		else {
 			lle = lock_list->ll_next;
 
 			/*
 			 * The interlock is the only lock we hold, so
 			 * simply return.
 			 */
 			if (lle == NULL)
 				return;
 			plock = &lle->ll_children[lle->ll_count - 1];
 		}
 	}
 	
 	/*
 	 * Try to perform most checks without a lock.  If this succeeds we
 	 * can skip acquiring the lock and return success.  Otherwise we redo
 	 * the check with the lock held to handle races with concurrent updates.
 	 */
 	w1 = plock->li_lock->lo_witness;
 	if (witness_lock_order_check(w1, w))
 		return;
 
 	mtx_lock_spin(&w_mtx);
 	if (witness_lock_order_check(w1, w)) {
 		mtx_unlock_spin(&w_mtx);
 		return;
 	}
 	witness_lock_order_add(w1, w);
 
 	/*
 	 * Check for duplicate locks of the same type.  Note that we only
 	 * have to check for this on the last lock we just acquired.  Any
 	 * other cases will be caught as lock order violations.
 	 */
 	if (w1 == w) {
 		i = w->w_index;
 		if (!(lock->lo_flags & LO_DUPOK) && !(flags & LOP_DUPOK) &&
 		    !(w_rmatrix[i][i] & WITNESS_REVERSAL)) {
 		    w_rmatrix[i][i] |= WITNESS_REVERSAL;
 			w->w_reversed = 1;
 			mtx_unlock_spin(&w_mtx);
 			witness_output(
 			    "acquiring duplicate lock of same type: \"%s\"\n", 
 			    w->w_name);
 			witness_output(" 1st %s @ %s:%d\n", plock->li_lock->lo_name,
 			    fixup_filename(plock->li_file), plock->li_line);
 			witness_output(" 2nd %s @ %s:%d\n", lock->lo_name,
 			    fixup_filename(file), line);
 			witness_debugger(1, __func__);
 		} else
 			mtx_unlock_spin(&w_mtx);
 		return;
 	}
 	mtx_assert(&w_mtx, MA_OWNED);
 
 	/*
 	 * If we know that the lock we are acquiring comes after
 	 * the lock we most recently acquired in the lock order tree,
 	 * then there is no need for any further checks.
 	 */
 	if (isitmychild(w1, w))
 		goto out;
 
 	for (j = 0, lle = lock_list; lle != NULL; lle = lle->ll_next) {
 		for (i = lle->ll_count - 1; i >= 0; i--, j++) {
 
 			MPASS(j < LOCK_CHILDCOUNT * LOCK_NCHILDREN);
 			lock1 = &lle->ll_children[i];
 
 			/*
 			 * Ignore the interlock.
 			 */
 			if (interlock == lock1->li_lock)
 				continue;
 
 			/*
 			 * If this lock doesn't undergo witness checking,
 			 * then skip it.
 			 */
 			w1 = lock1->li_lock->lo_witness;
 			if (w1 == NULL) {
 				KASSERT((lock1->li_lock->lo_flags & LO_WITNESS) == 0,
 				    ("lock missing witness structure"));
 				continue;
 			}
 
 			/*
 			 * If we are locking Giant and this is a sleepable
 			 * lock, then skip it.
 			 */
 			if ((lock1->li_lock->lo_flags & LO_SLEEPABLE) != 0 &&
 			    lock == &Giant.lock_object)
 				continue;
 
 			/*
 			 * If we are locking a sleepable lock and this lock
 			 * is Giant, then skip it.
 			 */
 			if ((lock->lo_flags & LO_SLEEPABLE) != 0 &&
 			    lock1->li_lock == &Giant.lock_object)
 				continue;
 
 			/*
 			 * If we are locking a sleepable lock and this lock
 			 * isn't sleepable, we want to treat it as a lock
 			 * order violation to enfore a general lock order of
 			 * sleepable locks before non-sleepable locks.
 			 */
 			if (((lock->lo_flags & LO_SLEEPABLE) != 0 &&
 			    (lock1->li_lock->lo_flags & LO_SLEEPABLE) == 0))
 				goto reversal;
 
 			/*
 			 * If we are locking Giant and this is a non-sleepable
 			 * lock, then treat it as a reversal.
 			 */
 			if ((lock1->li_lock->lo_flags & LO_SLEEPABLE) == 0 &&
 			    lock == &Giant.lock_object)
 				goto reversal;
 
 			/*
 			 * Check the lock order hierarchy for a reveresal.
 			 */
 			if (!isitmydescendant(w, w1))
 				continue;
 		reversal:
 
 			/*
 			 * We have a lock order violation, check to see if it
 			 * is allowed or has already been yelled about.
 			 */
 #ifdef BLESSING
 
 			/*
 			 * If the lock order is blessed, just bail.  We don't
 			 * look for other lock order violations though, which
 			 * may be a bug.
 			 */
 			if (blessed(w, w1))
 				goto out;
 #endif
 
 			/* Bail if this violation is known */
 			if (w_rmatrix[w1->w_index][w->w_index] & WITNESS_REVERSAL)
 				goto out;
 
 			/* Record this as a violation */
 			w_rmatrix[w1->w_index][w->w_index] |= WITNESS_REVERSAL;
 			w_rmatrix[w->w_index][w1->w_index] |= WITNESS_REVERSAL;
 			w->w_reversed = w1->w_reversed = 1;
 			witness_increment_graph_generation();
 			mtx_unlock_spin(&w_mtx);
 
 #ifdef WITNESS_NO_VNODE
 			/*
 			 * There are known LORs between VNODE locks. They are
 			 * not an indication of a bug. VNODE locks are flagged
 			 * as such (LO_IS_VNODE) and we don't yell if the LOR
 			 * is between 2 VNODE locks.
 			 */
 			if ((lock->lo_flags & LO_IS_VNODE) != 0 &&
 			    (lock1->li_lock->lo_flags & LO_IS_VNODE) != 0)
 				return;
 #endif
 
 			/*
 			 * Ok, yell about it.
 			 */
 			if (((lock->lo_flags & LO_SLEEPABLE) != 0 &&
 			    (lock1->li_lock->lo_flags & LO_SLEEPABLE) == 0))
 				witness_output(
 		"lock order reversal: (sleepable after non-sleepable)\n");
 			else if ((lock1->li_lock->lo_flags & LO_SLEEPABLE) == 0
 			    && lock == &Giant.lock_object)
 				witness_output(
 		"lock order reversal: (Giant after non-sleepable)\n");
 			else
 				witness_output("lock order reversal:\n");
 
 			/*
 			 * Try to locate an earlier lock with
 			 * witness w in our list.
 			 */
 			do {
 				lock2 = &lle->ll_children[i];
 				MPASS(lock2->li_lock != NULL);
 				if (lock2->li_lock->lo_witness == w)
 					break;
 				if (i == 0 && lle->ll_next != NULL) {
 					lle = lle->ll_next;
 					i = lle->ll_count - 1;
 					MPASS(i >= 0 && i < LOCK_NCHILDREN);
 				} else
 					i--;
 			} while (i >= 0);
 			if (i < 0) {
 				witness_output(" 1st %p %s (%s) @ %s:%d\n",
 				    lock1->li_lock, lock1->li_lock->lo_name,
 				    w1->w_name, fixup_filename(lock1->li_file),
 				    lock1->li_line);
 				witness_output(" 2nd %p %s (%s) @ %s:%d\n", lock,
 				    lock->lo_name, w->w_name,
 				    fixup_filename(file), line);
 			} else {
 				witness_output(" 1st %p %s (%s) @ %s:%d\n",
 				    lock2->li_lock, lock2->li_lock->lo_name,
 				    lock2->li_lock->lo_witness->w_name,
 				    fixup_filename(lock2->li_file),
 				    lock2->li_line);
 				witness_output(" 2nd %p %s (%s) @ %s:%d\n",
 				    lock1->li_lock, lock1->li_lock->lo_name,
 				    w1->w_name, fixup_filename(lock1->li_file),
 				    lock1->li_line);
 				witness_output(" 3rd %p %s (%s) @ %s:%d\n", lock,
 				    lock->lo_name, w->w_name,
 				    fixup_filename(file), line);
 			}
 			witness_debugger(1, __func__);
 			return;
 		}
 	}
 
 	/*
 	 * If requested, build a new lock order.  However, don't build a new
 	 * relationship between a sleepable lock and Giant if it is in the
 	 * wrong direction.  The correct lock order is that sleepable locks
 	 * always come before Giant.
 	 */
 	if (flags & LOP_NEWORDER &&
 	    !(plock->li_lock == &Giant.lock_object &&
 	    (lock->lo_flags & LO_SLEEPABLE) != 0)) {
 		CTR3(KTR_WITNESS, "%s: adding %s as a child of %s", __func__,
 		    w->w_name, plock->li_lock->lo_witness->w_name);
 		itismychild(plock->li_lock->lo_witness, w);
 	}
 out:
 	mtx_unlock_spin(&w_mtx);
 }
 
 void
 witness_lock(struct lock_object *lock, int flags, const char *file, int line)
 {
 	struct lock_list_entry **lock_list, *lle;
 	struct lock_instance *instance;
 	struct witness *w;
 	struct thread *td;
 
 	if (witness_cold || witness_watch == -1 || lock->lo_witness == NULL ||
 	    panicstr != NULL)
 		return;
 	w = lock->lo_witness;
 	td = curthread;
 
 	/* Determine lock list for this lock. */
 	if (LOCK_CLASS(lock)->lc_flags & LC_SLEEPLOCK)
 		lock_list = &td->td_sleeplocks;
 	else
 		lock_list = PCPU_PTR(spinlocks);
 
 	/* Check to see if we are recursing on a lock we already own. */
 	instance = find_instance(*lock_list, lock);
 	if (instance != NULL) {
 		instance->li_flags++;
 		CTR4(KTR_WITNESS, "%s: pid %d recursed on %s r=%d", __func__,
 		    td->td_proc->p_pid, lock->lo_name,
 		    instance->li_flags & LI_RECURSEMASK);
 		instance->li_file = file;
 		instance->li_line = line;
 		return;
 	}
 
 	/* Update per-witness last file and line acquire. */
 	w->w_file = file;
 	w->w_line = line;
 
 	/* Find the next open lock instance in the list and fill it. */
 	lle = *lock_list;
 	if (lle == NULL || lle->ll_count == LOCK_NCHILDREN) {
 		lle = witness_lock_list_get();
 		if (lle == NULL)
 			return;
 		lle->ll_next = *lock_list;
 		CTR3(KTR_WITNESS, "%s: pid %d added lle %p", __func__,
 		    td->td_proc->p_pid, lle);
 		*lock_list = lle;
 	}
 	instance = &lle->ll_children[lle->ll_count++];
 	instance->li_lock = lock;
 	instance->li_line = line;
 	instance->li_file = file;
 	if ((flags & LOP_EXCLUSIVE) != 0)
 		instance->li_flags = LI_EXCLUSIVE;
 	else
 		instance->li_flags = 0;
 	CTR4(KTR_WITNESS, "%s: pid %d added %s as lle[%d]", __func__,
 	    td->td_proc->p_pid, lock->lo_name, lle->ll_count - 1);
 }
 
 void
 witness_upgrade(struct lock_object *lock, int flags, const char *file, int line)
 {
 	struct lock_instance *instance;
 	struct lock_class *class;
 
 	KASSERT(witness_cold == 0, ("%s: witness_cold", __func__));
 	if (lock->lo_witness == NULL || witness_watch == -1 || panicstr != NULL)
 		return;
 	class = LOCK_CLASS(lock);
 	if (witness_watch) {
 		if ((lock->lo_flags & LO_UPGRADABLE) == 0)
 			kassert_panic(
 			    "upgrade of non-upgradable lock (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((class->lc_flags & LC_SLEEPLOCK) == 0)
 			kassert_panic(
 			    "upgrade of non-sleep lock (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 	}
 	instance = find_instance(curthread->td_sleeplocks, lock);
 	if (instance == NULL) {
 		kassert_panic("upgrade of unlocked lock (%s) %s @ %s:%d",
 		    class->lc_name, lock->lo_name,
 		    fixup_filename(file), line);
 		return;
 	}
 	if (witness_watch) {
 		if ((instance->li_flags & LI_EXCLUSIVE) != 0)
 			kassert_panic(
 			    "upgrade of exclusive lock (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((instance->li_flags & LI_RECURSEMASK) != 0)
 			kassert_panic(
 			    "upgrade of recursed lock (%s) %s r=%d @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    instance->li_flags & LI_RECURSEMASK,
 			    fixup_filename(file), line);
 	}
 	instance->li_flags |= LI_EXCLUSIVE;
 }
 
 void
 witness_downgrade(struct lock_object *lock, int flags, const char *file,
     int line)
 {
 	struct lock_instance *instance;
 	struct lock_class *class;
 
 	KASSERT(witness_cold == 0, ("%s: witness_cold", __func__));
 	if (lock->lo_witness == NULL || witness_watch == -1 || panicstr != NULL)
 		return;
 	class = LOCK_CLASS(lock);
 	if (witness_watch) {
 		if ((lock->lo_flags & LO_UPGRADABLE) == 0)
 			kassert_panic(
 			    "downgrade of non-upgradable lock (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((class->lc_flags & LC_SLEEPLOCK) == 0)
 			kassert_panic(
 			    "downgrade of non-sleep lock (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 	}
 	instance = find_instance(curthread->td_sleeplocks, lock);
 	if (instance == NULL) {
 		kassert_panic("downgrade of unlocked lock (%s) %s @ %s:%d",
 		    class->lc_name, lock->lo_name,
 		    fixup_filename(file), line);
 		return;
 	}
 	if (witness_watch) {
 		if ((instance->li_flags & LI_EXCLUSIVE) == 0)
 			kassert_panic(
 			    "downgrade of shared lock (%s) %s @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((instance->li_flags & LI_RECURSEMASK) != 0)
 			kassert_panic(
 			    "downgrade of recursed lock (%s) %s r=%d @ %s:%d",
 			    class->lc_name, lock->lo_name,
 			    instance->li_flags & LI_RECURSEMASK,
 			    fixup_filename(file), line);
 	}
 	instance->li_flags &= ~LI_EXCLUSIVE;
 }
 
 void
 witness_unlock(struct lock_object *lock, int flags, const char *file, int line)
 {
 	struct lock_list_entry **lock_list, *lle;
 	struct lock_instance *instance;
 	struct lock_class *class;
 	struct thread *td;
 	register_t s;
 	int i, j;
 
 	if (witness_cold || lock->lo_witness == NULL || panicstr != NULL)
 		return;
 	td = curthread;
 	class = LOCK_CLASS(lock);
 
 	/* Find lock instance associated with this lock. */
 	if (class->lc_flags & LC_SLEEPLOCK)
 		lock_list = &td->td_sleeplocks;
 	else
 		lock_list = PCPU_PTR(spinlocks);
 	lle = *lock_list;
 	for (; *lock_list != NULL; lock_list = &(*lock_list)->ll_next)
 		for (i = 0; i < (*lock_list)->ll_count; i++) {
 			instance = &(*lock_list)->ll_children[i];
 			if (instance->li_lock == lock)
 				goto found;
 		}
 
 	/*
 	 * When disabling WITNESS through witness_watch we could end up in
 	 * having registered locks in the td_sleeplocks queue.
 	 * We have to make sure we flush these queues, so just search for
 	 * eventual register locks and remove them.
 	 */
 	if (witness_watch > 0) {
 		kassert_panic("lock (%s) %s not locked @ %s:%d", class->lc_name,
 		    lock->lo_name, fixup_filename(file), line);
 		return;
 	} else {
 		return;
 	}
 found:
 
 	/* First, check for shared/exclusive mismatches. */
 	if ((instance->li_flags & LI_EXCLUSIVE) != 0 && witness_watch > 0 &&
 	    (flags & LOP_EXCLUSIVE) == 0) {
 		witness_output("shared unlock of (%s) %s @ %s:%d\n",
 		    class->lc_name, lock->lo_name, fixup_filename(file), line);
 		witness_output("while exclusively locked from %s:%d\n",
 		    fixup_filename(instance->li_file), instance->li_line);
 		kassert_panic("excl->ushare");
 	}
 	if ((instance->li_flags & LI_EXCLUSIVE) == 0 && witness_watch > 0 &&
 	    (flags & LOP_EXCLUSIVE) != 0) {
 		witness_output("exclusive unlock of (%s) %s @ %s:%d\n",
 		    class->lc_name, lock->lo_name, fixup_filename(file), line);
 		witness_output("while share locked from %s:%d\n",
 		    fixup_filename(instance->li_file),
 		    instance->li_line);
 		kassert_panic("share->uexcl");
 	}
 	/* If we are recursed, unrecurse. */
 	if ((instance->li_flags & LI_RECURSEMASK) > 0) {
 		CTR4(KTR_WITNESS, "%s: pid %d unrecursed on %s r=%d", __func__,
 		    td->td_proc->p_pid, instance->li_lock->lo_name,
 		    instance->li_flags);
 		instance->li_flags--;
 		return;
 	}
 	/* The lock is now being dropped, check for NORELEASE flag */
 	if ((instance->li_flags & LI_NORELEASE) != 0 && witness_watch > 0) {
 		witness_output("forbidden unlock of (%s) %s @ %s:%d\n",
 		    class->lc_name, lock->lo_name, fixup_filename(file), line);
 		kassert_panic("lock marked norelease");
 	}
 
 	/* Otherwise, remove this item from the list. */
 	s = intr_disable();
 	CTR4(KTR_WITNESS, "%s: pid %d removed %s from lle[%d]", __func__,
 	    td->td_proc->p_pid, instance->li_lock->lo_name,
 	    (*lock_list)->ll_count - 1);
 	for (j = i; j < (*lock_list)->ll_count - 1; j++)
 		(*lock_list)->ll_children[j] =
 		    (*lock_list)->ll_children[j + 1];
 	(*lock_list)->ll_count--;
 	intr_restore(s);
 
 	/*
 	 * In order to reduce contention on w_mtx, we want to keep always an
 	 * head object into lists so that frequent allocation from the 
 	 * free witness pool (and subsequent locking) is avoided.
 	 * In order to maintain the current code simple, when the head
 	 * object is totally unloaded it means also that we do not have
 	 * further objects in the list, so the list ownership needs to be
 	 * hand over to another object if the current head needs to be freed.
 	 */
 	if ((*lock_list)->ll_count == 0) {
 		if (*lock_list == lle) {
 			if (lle->ll_next == NULL)
 				return;
 		} else
 			lle = *lock_list;
 		*lock_list = lle->ll_next;
 		CTR3(KTR_WITNESS, "%s: pid %d removed lle %p", __func__,
 		    td->td_proc->p_pid, lle);
 		witness_lock_list_free(lle);
 	}
 }
 
 void
 witness_thread_exit(struct thread *td)
 {
 	struct lock_list_entry *lle;
 	int i, n;
 
 	lle = td->td_sleeplocks;
 	if (lle == NULL || panicstr != NULL)
 		return;
 	if (lle->ll_count != 0) {
 		for (n = 0; lle != NULL; lle = lle->ll_next)
 			for (i = lle->ll_count - 1; i >= 0; i--) {
 				if (n == 0)
 					witness_output(
 		    "Thread %p exiting with the following locks held:\n", td);
 				n++;
 				witness_list_lock(&lle->ll_children[i],
 				    witness_output);
 				
 			}
 		kassert_panic(
 		    "Thread %p cannot exit while holding sleeplocks\n", td);
 	}
 	witness_lock_list_free(lle);
 }
 
 /*
  * Warn if any locks other than 'lock' are held.  Flags can be passed in to
  * exempt Giant and sleepable locks from the checks as well.  If any
  * non-exempt locks are held, then a supplied message is printed to the
  * output channel along with a list of the offending locks.  If indicated in the
  * flags then a failure results in a panic as well.
  */
 int
 witness_warn(int flags, struct lock_object *lock, const char *fmt, ...)
 {
 	struct lock_list_entry *lock_list, *lle;
 	struct lock_instance *lock1;
 	struct thread *td;
 	va_list ap;
 	int i, n;
 
 	if (witness_cold || witness_watch < 1 || panicstr != NULL)
 		return (0);
 	n = 0;
 	td = curthread;
 	for (lle = td->td_sleeplocks; lle != NULL; lle = lle->ll_next)
 		for (i = lle->ll_count - 1; i >= 0; i--) {
 			lock1 = &lle->ll_children[i];
 			if (lock1->li_lock == lock)
 				continue;
 			if (flags & WARN_GIANTOK &&
 			    lock1->li_lock == &Giant.lock_object)
 				continue;
 			if (flags & WARN_SLEEPOK &&
 			    (lock1->li_lock->lo_flags & LO_SLEEPABLE) != 0)
 				continue;
 			if (n == 0) {
 				va_start(ap, fmt);
 				vprintf(fmt, ap);
 				va_end(ap);
 				printf(" with the following %slocks held:\n",
 				    (flags & WARN_SLEEPOK) != 0 ?
 				    "non-sleepable " : "");
 			}
 			n++;
 			witness_list_lock(lock1, printf);
 		}
 
 	/*
 	 * Pin the thread in order to avoid problems with thread migration.
 	 * Once that all verifies are passed about spinlocks ownership,
 	 * the thread is in a safe path and it can be unpinned.
 	 */
 	sched_pin();
 	lock_list = PCPU_GET(spinlocks);
 	if (lock_list != NULL && lock_list->ll_count != 0) {
 		sched_unpin();
 
 		/*
 		 * We should only have one spinlock and as long as
 		 * the flags cannot match for this locks class,
 		 * check if the first spinlock is the one curthread
 		 * should hold.
 		 */
 		lock1 = &lock_list->ll_children[lock_list->ll_count - 1];
 		if (lock_list->ll_count == 1 && lock_list->ll_next == NULL &&
 		    lock1->li_lock == lock && n == 0)
 			return (0);
 
 		va_start(ap, fmt);
 		vprintf(fmt, ap);
 		va_end(ap);
 		printf(" with the following %slocks held:\n",
 		    (flags & WARN_SLEEPOK) != 0 ?  "non-sleepable " : "");
 		n += witness_list_locks(&lock_list, printf);
 	} else
 		sched_unpin();
 	if (flags & WARN_PANIC && n)
 		kassert_panic("%s", __func__);
 	else
 		witness_debugger(n, __func__);
 	return (n);
 }
 
 const char *
 witness_file(struct lock_object *lock)
 {
 	struct witness *w;
 
 	if (witness_cold || witness_watch < 1 || lock->lo_witness == NULL)
 		return ("?");
 	w = lock->lo_witness;
 	return (w->w_file);
 }
 
 int
 witness_line(struct lock_object *lock)
 {
 	struct witness *w;
 
 	if (witness_cold || witness_watch < 1 || lock->lo_witness == NULL)
 		return (0);
 	w = lock->lo_witness;
 	return (w->w_line);
 }
 
 static struct witness *
 enroll(const char *description, struct lock_class *lock_class)
 {
 	struct witness *w;
 
 	MPASS(description != NULL);
 
 	if (witness_watch == -1 || panicstr != NULL)
 		return (NULL);
 	if ((lock_class->lc_flags & LC_SPINLOCK)) {
 		if (witness_skipspin)
 			return (NULL);
 	} else if ((lock_class->lc_flags & LC_SLEEPLOCK) == 0) {
 		kassert_panic("lock class %s is not sleep or spin",
 		    lock_class->lc_name);
 		return (NULL);
 	}
 
 	mtx_lock_spin(&w_mtx);
 	w = witness_hash_get(description);
 	if (w)
 		goto found;
 	if ((w = witness_get()) == NULL)
 		return (NULL);
 	MPASS(strlen(description) < MAX_W_NAME);
 	strcpy(w->w_name, description);
 	w->w_class = lock_class;
 	w->w_refcount = 1;
 	STAILQ_INSERT_HEAD(&w_all, w, w_list);
 	if (lock_class->lc_flags & LC_SPINLOCK) {
 		STAILQ_INSERT_HEAD(&w_spin, w, w_typelist);
 		w_spin_cnt++;
 	} else if (lock_class->lc_flags & LC_SLEEPLOCK) {
 		STAILQ_INSERT_HEAD(&w_sleep, w, w_typelist);
 		w_sleep_cnt++;
 	}
 
 	/* Insert new witness into the hash */
 	witness_hash_put(w);
 	witness_increment_graph_generation();
 	mtx_unlock_spin(&w_mtx);
 	return (w);
 found:
 	w->w_refcount++;
 	if (w->w_refcount == 1)
 		w->w_class = lock_class;
 	mtx_unlock_spin(&w_mtx);
 	if (lock_class != w->w_class)
 		kassert_panic(
 		    "lock (%s) %s does not match earlier (%s) lock",
 		    description, lock_class->lc_name,
 		    w->w_class->lc_name);
 	return (w);
 }
 
 static void
 depart(struct witness *w)
 {
 
 	MPASS(w->w_refcount == 0);
 	if (w->w_class->lc_flags & LC_SLEEPLOCK) {
 		w_sleep_cnt--;
 	} else {
 		w_spin_cnt--;
 	}
 	/*
 	 * Set file to NULL as it may point into a loadable module.
 	 */
 	w->w_file = NULL;
 	w->w_line = 0;
 	witness_increment_graph_generation();
 }
 
 
 static void
 adopt(struct witness *parent, struct witness *child)
 {
 	int pi, ci, i, j;
 
 	if (witness_cold == 0)
 		mtx_assert(&w_mtx, MA_OWNED);
 
 	/* If the relationship is already known, there's no work to be done. */
 	if (isitmychild(parent, child))
 		return;
 
 	/* When the structure of the graph changes, bump up the generation. */
 	witness_increment_graph_generation();
 
 	/*
 	 * The hard part ... create the direct relationship, then propagate all
 	 * indirect relationships.
 	 */
 	pi = parent->w_index;
 	ci = child->w_index;
 	WITNESS_INDEX_ASSERT(pi);
 	WITNESS_INDEX_ASSERT(ci);
 	MPASS(pi != ci);
 	w_rmatrix[pi][ci] |= WITNESS_PARENT;
 	w_rmatrix[ci][pi] |= WITNESS_CHILD;
 
 	/*
 	 * If parent was not already an ancestor of child,
 	 * then we increment the descendant and ancestor counters.
 	 */
 	if ((w_rmatrix[pi][ci] & WITNESS_ANCESTOR) == 0) {
 		parent->w_num_descendants++;
 		child->w_num_ancestors++;
 	}
 
 	/* 
 	 * Find each ancestor of 'pi'. Note that 'pi' itself is counted as 
 	 * an ancestor of 'pi' during this loop.
 	 */
 	for (i = 1; i <= w_max_used_index; i++) {
 		if ((w_rmatrix[i][pi] & WITNESS_ANCESTOR_MASK) == 0 && 
 		    (i != pi))
 			continue;
 
 		/* Find each descendant of 'i' and mark it as a descendant. */
 		for (j = 1; j <= w_max_used_index; j++) {
 
 			/* 
 			 * Skip children that are already marked as
 			 * descendants of 'i'.
 			 */
 			if (w_rmatrix[i][j] & WITNESS_ANCESTOR_MASK)
 				continue;
 
 			/*
 			 * We are only interested in descendants of 'ci'. Note
 			 * that 'ci' itself is counted as a descendant of 'ci'.
 			 */
 			if ((w_rmatrix[ci][j] & WITNESS_ANCESTOR_MASK) == 0 && 
 			    (j != ci))
 				continue;
 			w_rmatrix[i][j] |= WITNESS_ANCESTOR;
 			w_rmatrix[j][i] |= WITNESS_DESCENDANT;
 			w_data[i].w_num_descendants++;
 			w_data[j].w_num_ancestors++;
 
 			/* 
 			 * Make sure we aren't marking a node as both an
 			 * ancestor and descendant. We should have caught 
 			 * this as a lock order reversal earlier.
 			 */
 			if ((w_rmatrix[i][j] & WITNESS_ANCESTOR_MASK) &&
 			    (w_rmatrix[i][j] & WITNESS_DESCENDANT_MASK)) {
 				printf("witness rmatrix paradox! [%d][%d]=%d "
 				    "both ancestor and descendant\n",
 				    i, j, w_rmatrix[i][j]); 
 				kdb_backtrace();
 				printf("Witness disabled.\n");
 				witness_watch = -1;
 			}
 			if ((w_rmatrix[j][i] & WITNESS_ANCESTOR_MASK) &&
 			    (w_rmatrix[j][i] & WITNESS_DESCENDANT_MASK)) {
 				printf("witness rmatrix paradox! [%d][%d]=%d "
 				    "both ancestor and descendant\n",
 				    j, i, w_rmatrix[j][i]); 
 				kdb_backtrace();
 				printf("Witness disabled.\n");
 				witness_watch = -1;
 			}
 		}
 	}
 }
 
 static void
 itismychild(struct witness *parent, struct witness *child)
 {
 	int unlocked;
 
 	MPASS(child != NULL && parent != NULL);
 	if (witness_cold == 0)
 		mtx_assert(&w_mtx, MA_OWNED);
 
 	if (!witness_lock_type_equal(parent, child)) {
 		if (witness_cold == 0) {
 			unlocked = 1;
 			mtx_unlock_spin(&w_mtx);
 		} else {
 			unlocked = 0;
 		}
 		kassert_panic(
 		    "%s: parent \"%s\" (%s) and child \"%s\" (%s) are not "
 		    "the same lock type", __func__, parent->w_name,
 		    parent->w_class->lc_name, child->w_name,
 		    child->w_class->lc_name);
 		if (unlocked)
 			mtx_lock_spin(&w_mtx);
 	}
 	adopt(parent, child);
 }
 
 /*
  * Generic code for the isitmy*() functions. The rmask parameter is the
  * expected relationship of w1 to w2.
  */
 static int
 _isitmyx(struct witness *w1, struct witness *w2, int rmask, const char *fname)
 {
 	unsigned char r1, r2;
 	int i1, i2;
 
 	i1 = w1->w_index;
 	i2 = w2->w_index;
 	WITNESS_INDEX_ASSERT(i1);
 	WITNESS_INDEX_ASSERT(i2);
 	r1 = w_rmatrix[i1][i2] & WITNESS_RELATED_MASK;
 	r2 = w_rmatrix[i2][i1] & WITNESS_RELATED_MASK;
 
 	/* The flags on one better be the inverse of the flags on the other */
 	if (!((WITNESS_ATOD(r1) == r2 && WITNESS_DTOA(r2) == r1) ||
 	    (WITNESS_DTOA(r1) == r2 && WITNESS_ATOD(r2) == r1))) {
 		/* Don't squawk if we're potentially racing with an update. */
 		if (!mtx_owned(&w_mtx))
 			return (0);
 		printf("%s: rmatrix mismatch between %s (index %d) and %s "
 		    "(index %d): w_rmatrix[%d][%d] == %hhx but "
 		    "w_rmatrix[%d][%d] == %hhx\n",
 		    fname, w1->w_name, i1, w2->w_name, i2, i1, i2, r1,
 		    i2, i1, r2);
 		kdb_backtrace();
 		printf("Witness disabled.\n");
 		witness_watch = -1;
 	}
 	return (r1 & rmask);
 }
 
 /*
  * Checks if @child is a direct child of @parent.
  */
 static int
 isitmychild(struct witness *parent, struct witness *child)
 {
 
 	return (_isitmyx(parent, child, WITNESS_PARENT, __func__));
 }
 
 /*
  * Checks if @descendant is a direct or inderect descendant of @ancestor.
  */
 static int
 isitmydescendant(struct witness *ancestor, struct witness *descendant)
 {
 
 	return (_isitmyx(ancestor, descendant, WITNESS_ANCESTOR_MASK,
 	    __func__));
 }
 
 #ifdef BLESSING
 static int
 blessed(struct witness *w1, struct witness *w2)
 {
 	int i;
 	struct witness_blessed *b;
 
 	for (i = 0; i < nitems(blessed_list); i++) {
 		b = &blessed_list[i];
 		if (strcmp(w1->w_name, b->b_lock1) == 0) {
 			if (strcmp(w2->w_name, b->b_lock2) == 0)
 				return (1);
 			continue;
 		}
 		if (strcmp(w1->w_name, b->b_lock2) == 0)
 			if (strcmp(w2->w_name, b->b_lock1) == 0)
 				return (1);
 	}
 	return (0);
 }
 #endif
 
 static struct witness *
 witness_get(void)
 {
 	struct witness *w;
 	int index;
 
 	if (witness_cold == 0)
 		mtx_assert(&w_mtx, MA_OWNED);
 
 	if (witness_watch == -1) {
 		mtx_unlock_spin(&w_mtx);
 		return (NULL);
 	}
 	if (STAILQ_EMPTY(&w_free)) {
 		witness_watch = -1;
 		mtx_unlock_spin(&w_mtx);
 		printf("WITNESS: unable to allocate a new witness object\n");
 		return (NULL);
 	}
 	w = STAILQ_FIRST(&w_free);
 	STAILQ_REMOVE_HEAD(&w_free, w_list);
 	w_free_cnt--;
 	index = w->w_index;
 	MPASS(index > 0 && index == w_max_used_index+1 &&
 	    index < witness_count);
 	bzero(w, sizeof(*w));
 	w->w_index = index;
 	if (index > w_max_used_index)
 		w_max_used_index = index;
 	return (w);
 }
 
 static void
 witness_free(struct witness *w)
 {
 
 	STAILQ_INSERT_HEAD(&w_free, w, w_list);
 	w_free_cnt++;
 }
 
 static struct lock_list_entry *
 witness_lock_list_get(void)
 {
 	struct lock_list_entry *lle;
 
 	if (witness_watch == -1)
 		return (NULL);
 	mtx_lock_spin(&w_mtx);
 	lle = w_lock_list_free;
 	if (lle == NULL) {
 		witness_watch = -1;
 		mtx_unlock_spin(&w_mtx);
 		printf("%s: witness exhausted\n", __func__);
 		return (NULL);
 	}
 	w_lock_list_free = lle->ll_next;
 	mtx_unlock_spin(&w_mtx);
 	bzero(lle, sizeof(*lle));
 	return (lle);
 }
 		
 static void
 witness_lock_list_free(struct lock_list_entry *lle)
 {
 
 	mtx_lock_spin(&w_mtx);
 	lle->ll_next = w_lock_list_free;
 	w_lock_list_free = lle;
 	mtx_unlock_spin(&w_mtx);
 }
 
 static struct lock_instance *
 find_instance(struct lock_list_entry *list, const struct lock_object *lock)
 {
 	struct lock_list_entry *lle;
 	struct lock_instance *instance;
 	int i;
 
 	for (lle = list; lle != NULL; lle = lle->ll_next)
 		for (i = lle->ll_count - 1; i >= 0; i--) {
 			instance = &lle->ll_children[i];
 			if (instance->li_lock == lock)
 				return (instance);
 		}
 	return (NULL);
 }
 
 static void
 witness_list_lock(struct lock_instance *instance,
     int (*prnt)(const char *fmt, ...))
 {
 	struct lock_object *lock;
 
 	lock = instance->li_lock;
 	prnt("%s %s %s", (instance->li_flags & LI_EXCLUSIVE) != 0 ?
 	    "exclusive" : "shared", LOCK_CLASS(lock)->lc_name, lock->lo_name);
 	if (lock->lo_witness->w_name != lock->lo_name)
 		prnt(" (%s)", lock->lo_witness->w_name);
 	prnt(" r = %d (%p) locked @ %s:%d\n",
 	    instance->li_flags & LI_RECURSEMASK, lock,
 	    fixup_filename(instance->li_file), instance->li_line);
 }
 
 static int
 witness_output(const char *fmt, ...)
 {
 	va_list ap;
 	int ret;
 
 	va_start(ap, fmt);
 	ret = witness_voutput(fmt, ap);
 	va_end(ap);
 	return (ret);
 }
 
 static int
 witness_voutput(const char *fmt, va_list ap)
 {
 	int ret;
 
 	ret = 0;
 	switch (witness_channel) {
 	case WITNESS_CONSOLE:
 		ret = vprintf(fmt, ap);
 		break;
 	case WITNESS_LOG:
 		vlog(LOG_NOTICE, fmt, ap);
 		break;
 	case WITNESS_NONE:
 		break;
 	}
 	return (ret);
 }
 
 #ifdef DDB
 static int
 witness_thread_has_locks(struct thread *td)
 {
 
 	if (td->td_sleeplocks == NULL)
 		return (0);
 	return (td->td_sleeplocks->ll_count != 0);
 }
 
 static int
 witness_proc_has_locks(struct proc *p)
 {
 	struct thread *td;
 
 	FOREACH_THREAD_IN_PROC(p, td) {
 		if (witness_thread_has_locks(td))
 			return (1);
 	}
 	return (0);
 }
 #endif
 
 int
 witness_list_locks(struct lock_list_entry **lock_list,
     int (*prnt)(const char *fmt, ...))
 {
 	struct lock_list_entry *lle;
 	int i, nheld;
 
 	nheld = 0;
 	for (lle = *lock_list; lle != NULL; lle = lle->ll_next)
 		for (i = lle->ll_count - 1; i >= 0; i--) {
 			witness_list_lock(&lle->ll_children[i], prnt);
 			nheld++;
 		}
 	return (nheld);
 }
 
 /*
  * This is a bit risky at best.  We call this function when we have timed
  * out acquiring a spin lock, and we assume that the other CPU is stuck
  * with this lock held.  So, we go groveling around in the other CPU's
  * per-cpu data to try to find the lock instance for this spin lock to
  * see when it was last acquired.
  */
 void
 witness_display_spinlock(struct lock_object *lock, struct thread *owner,
     int (*prnt)(const char *fmt, ...))
 {
 	struct lock_instance *instance;
 	struct pcpu *pc;
 
 	if (owner->td_critnest == 0 || owner->td_oncpu == NOCPU)
 		return;
 	pc = pcpu_find(owner->td_oncpu);
 	instance = find_instance(pc->pc_spinlocks, lock);
 	if (instance != NULL)
 		witness_list_lock(instance, prnt);
 }
 
 void
 witness_save(struct lock_object *lock, const char **filep, int *linep)
 {
 	struct lock_list_entry *lock_list;
 	struct lock_instance *instance;
 	struct lock_class *class;
 
 	/*
 	 * This function is used independently in locking code to deal with
 	 * Giant, SCHEDULER_STOPPED() check can be removed here after Giant
 	 * is gone.
 	 */
 	if (SCHEDULER_STOPPED())
 		return;
 	KASSERT(witness_cold == 0, ("%s: witness_cold", __func__));
 	if (lock->lo_witness == NULL || witness_watch == -1 || panicstr != NULL)
 		return;
 	class = LOCK_CLASS(lock);
 	if (class->lc_flags & LC_SLEEPLOCK)
 		lock_list = curthread->td_sleeplocks;
 	else {
 		if (witness_skipspin)
 			return;
 		lock_list = PCPU_GET(spinlocks);
 	}
 	instance = find_instance(lock_list, lock);
 	if (instance == NULL) {
 		kassert_panic("%s: lock (%s) %s not locked", __func__,
 		    class->lc_name, lock->lo_name);
 		return;
 	}
 	*filep = instance->li_file;
 	*linep = instance->li_line;
 }
 
 void
 witness_restore(struct lock_object *lock, const char *file, int line)
 {
 	struct lock_list_entry *lock_list;
 	struct lock_instance *instance;
 	struct lock_class *class;
 
 	/*
 	 * This function is used independently in locking code to deal with
 	 * Giant, SCHEDULER_STOPPED() check can be removed here after Giant
 	 * is gone.
 	 */
 	if (SCHEDULER_STOPPED())
 		return;
 	KASSERT(witness_cold == 0, ("%s: witness_cold", __func__));
 	if (lock->lo_witness == NULL || witness_watch == -1 || panicstr != NULL)
 		return;
 	class = LOCK_CLASS(lock);
 	if (class->lc_flags & LC_SLEEPLOCK)
 		lock_list = curthread->td_sleeplocks;
 	else {
 		if (witness_skipspin)
 			return;
 		lock_list = PCPU_GET(spinlocks);
 	}
 	instance = find_instance(lock_list, lock);
 	if (instance == NULL)
 		kassert_panic("%s: lock (%s) %s not locked", __func__,
 		    class->lc_name, lock->lo_name);
 	lock->lo_witness->w_file = file;
 	lock->lo_witness->w_line = line;
 	if (instance == NULL)
 		return;
 	instance->li_file = file;
 	instance->li_line = line;
 }
 
 void
 witness_assert(const struct lock_object *lock, int flags, const char *file,
     int line)
 {
 #ifdef INVARIANT_SUPPORT
 	struct lock_instance *instance;
 	struct lock_class *class;
 
 	if (lock->lo_witness == NULL || witness_watch < 1 || panicstr != NULL)
 		return;
 	class = LOCK_CLASS(lock);
 	if ((class->lc_flags & LC_SLEEPLOCK) != 0)
 		instance = find_instance(curthread->td_sleeplocks, lock);
 	else if ((class->lc_flags & LC_SPINLOCK) != 0)
 		instance = find_instance(PCPU_GET(spinlocks), lock);
 	else {
 		kassert_panic("Lock (%s) %s is not sleep or spin!",
 		    class->lc_name, lock->lo_name);
 		return;
 	}
 	switch (flags) {
 	case LA_UNLOCKED:
 		if (instance != NULL)
 			kassert_panic("Lock (%s) %s locked @ %s:%d.",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		break;
 	case LA_LOCKED:
 	case LA_LOCKED | LA_RECURSED:
 	case LA_LOCKED | LA_NOTRECURSED:
 	case LA_SLOCKED:
 	case LA_SLOCKED | LA_RECURSED:
 	case LA_SLOCKED | LA_NOTRECURSED:
 	case LA_XLOCKED:
 	case LA_XLOCKED | LA_RECURSED:
 	case LA_XLOCKED | LA_NOTRECURSED:
 		if (instance == NULL) {
 			kassert_panic("Lock (%s) %s not locked @ %s:%d.",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 			break;
 		}
 		if ((flags & LA_XLOCKED) != 0 &&
 		    (instance->li_flags & LI_EXCLUSIVE) == 0)
 			kassert_panic(
 			    "Lock (%s) %s not exclusively locked @ %s:%d.",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((flags & LA_SLOCKED) != 0 &&
 		    (instance->li_flags & LI_EXCLUSIVE) != 0)
 			kassert_panic(
 			    "Lock (%s) %s exclusively locked @ %s:%d.",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((flags & LA_RECURSED) != 0 &&
 		    (instance->li_flags & LI_RECURSEMASK) == 0)
 			kassert_panic("Lock (%s) %s not recursed @ %s:%d.",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		if ((flags & LA_NOTRECURSED) != 0 &&
 		    (instance->li_flags & LI_RECURSEMASK) != 0)
 			kassert_panic("Lock (%s) %s recursed @ %s:%d.",
 			    class->lc_name, lock->lo_name,
 			    fixup_filename(file), line);
 		break;
 	default:
 		kassert_panic("Invalid lock assertion at %s:%d.",
 		    fixup_filename(file), line);
 
 	}
 #endif	/* INVARIANT_SUPPORT */
 }
 
 static void
 witness_setflag(struct lock_object *lock, int flag, int set)
 {
 	struct lock_list_entry *lock_list;
 	struct lock_instance *instance;
 	struct lock_class *class;
 
 	if (lock->lo_witness == NULL || witness_watch == -1 || panicstr != NULL)
 		return;
 	class = LOCK_CLASS(lock);
 	if (class->lc_flags & LC_SLEEPLOCK)
 		lock_list = curthread->td_sleeplocks;
 	else {
 		if (witness_skipspin)
 			return;
 		lock_list = PCPU_GET(spinlocks);
 	}
 	instance = find_instance(lock_list, lock);
 	if (instance == NULL) {
 		kassert_panic("%s: lock (%s) %s not locked", __func__,
 		    class->lc_name, lock->lo_name);
 		return;
 	}
 
 	if (set)
 		instance->li_flags |= flag;
 	else
 		instance->li_flags &= ~flag;
 }
 
 void
 witness_norelease(struct lock_object *lock)
 {
 
 	witness_setflag(lock, LI_NORELEASE, 1);
 }
 
 void
 witness_releaseok(struct lock_object *lock)
 {
 
 	witness_setflag(lock, LI_NORELEASE, 0);
 }
 
 #ifdef DDB
 static void
 witness_ddb_list(struct thread *td)
 {
 
 	KASSERT(witness_cold == 0, ("%s: witness_cold", __func__));
 	KASSERT(kdb_active, ("%s: not in the debugger", __func__));
 
 	if (witness_watch < 1)
 		return;
 
 	witness_list_locks(&td->td_sleeplocks, db_printf);
 
 	/*
 	 * We only handle spinlocks if td == curthread.  This is somewhat broken
 	 * if td is currently executing on some other CPU and holds spin locks
 	 * as we won't display those locks.  If we had a MI way of getting
 	 * the per-cpu data for a given cpu then we could use
 	 * td->td_oncpu to get the list of spinlocks for this thread
 	 * and "fix" this.
 	 *
 	 * That still wouldn't really fix this unless we locked the scheduler
 	 * lock or stopped the other CPU to make sure it wasn't changing the
 	 * list out from under us.  It is probably best to just not try to
 	 * handle threads on other CPU's for now.
 	 */
 	if (td == curthread && PCPU_GET(spinlocks) != NULL)
 		witness_list_locks(PCPU_PTR(spinlocks), db_printf);
 }
 
 DB_SHOW_COMMAND(locks, db_witness_list)
 {
 	struct thread *td;
 
 	if (have_addr)
 		td = db_lookup_thread(addr, true);
 	else
 		td = kdb_thread;
 	witness_ddb_list(td);
 }
 
 DB_SHOW_ALL_COMMAND(locks, db_witness_list_all)
 {
 	struct thread *td;
 	struct proc *p;
 
 	/*
 	 * It would be nice to list only threads and processes that actually
 	 * held sleep locks, but that information is currently not exported
 	 * by WITNESS.
 	 */
 	FOREACH_PROC_IN_SYSTEM(p) {
 		if (!witness_proc_has_locks(p))
 			continue;
 		FOREACH_THREAD_IN_PROC(p, td) {
 			if (!witness_thread_has_locks(td))
 				continue;
 			db_printf("Process %d (%s) thread %p (%d)\n", p->p_pid,
 			    p->p_comm, td, td->td_tid);
 			witness_ddb_list(td);
 			if (db_pager_quit)
 				return;
 		}
 	}
 }
 DB_SHOW_ALIAS(alllocks, db_witness_list_all)
 
 DB_SHOW_COMMAND(witness, db_witness_display)
 {
 
 	witness_ddb_display(db_printf);
 }
 #endif
 
 static void
 sbuf_print_witness_badstacks(struct sbuf *sb, size_t *oldidx)
 {
 	struct witness_lock_order_data *data1, *data2, *tmp_data1, *tmp_data2;
 	struct witness *tmp_w1, *tmp_w2, *w1, *w2;
 	int generation, i, j;
 
 	tmp_data1 = NULL;
 	tmp_data2 = NULL;
 	tmp_w1 = NULL;
 	tmp_w2 = NULL;
 
 	/* Allocate and init temporary storage space. */
 	tmp_w1 = malloc(sizeof(struct witness), M_TEMP, M_WAITOK | M_ZERO);
 	tmp_w2 = malloc(sizeof(struct witness), M_TEMP, M_WAITOK | M_ZERO);
 	tmp_data1 = malloc(sizeof(struct witness_lock_order_data), M_TEMP, 
 	    M_WAITOK | M_ZERO);
 	tmp_data2 = malloc(sizeof(struct witness_lock_order_data), M_TEMP, 
 	    M_WAITOK | M_ZERO);
 	stack_zero(&tmp_data1->wlod_stack);
 	stack_zero(&tmp_data2->wlod_stack);
 
 restart:
 	mtx_lock_spin(&w_mtx);
 	generation = w_generation;
 	mtx_unlock_spin(&w_mtx);
 	sbuf_printf(sb, "Number of known direct relationships is %d\n",
 	    w_lohash.wloh_count);
 	for (i = 1; i < w_max_used_index; i++) {
 		mtx_lock_spin(&w_mtx);
 		if (generation != w_generation) {
 			mtx_unlock_spin(&w_mtx);
 
 			/* The graph has changed, try again. */
 			*oldidx = 0;
 			sbuf_clear(sb);
 			goto restart;
 		}
 
 		w1 = &w_data[i];
 		if (w1->w_reversed == 0) {
 			mtx_unlock_spin(&w_mtx);
 			continue;
 		}
 
 		/* Copy w1 locally so we can release the spin lock. */
 		*tmp_w1 = *w1;
 		mtx_unlock_spin(&w_mtx);
 
 		if (tmp_w1->w_reversed == 0)
 			continue;
 		for (j = 1; j < w_max_used_index; j++) {
 			if ((w_rmatrix[i][j] & WITNESS_REVERSAL) == 0 || i > j)
 				continue;
 
 			mtx_lock_spin(&w_mtx);
 			if (generation != w_generation) {
 				mtx_unlock_spin(&w_mtx);
 
 				/* The graph has changed, try again. */
 				*oldidx = 0;
 				sbuf_clear(sb);
 				goto restart;
 			}
 
 			w2 = &w_data[j];
 			data1 = witness_lock_order_get(w1, w2);
 			data2 = witness_lock_order_get(w2, w1);
 
 			/*
 			 * Copy information locally so we can release the
 			 * spin lock.
 			 */
 			*tmp_w2 = *w2;
 
 			if (data1) {
 				stack_zero(&tmp_data1->wlod_stack);
 				stack_copy(&data1->wlod_stack,
 				    &tmp_data1->wlod_stack);
 			}
 			if (data2 && data2 != data1) {
 				stack_zero(&tmp_data2->wlod_stack);
 				stack_copy(&data2->wlod_stack,
 				    &tmp_data2->wlod_stack);
 			}
 			mtx_unlock_spin(&w_mtx);
 
 			sbuf_printf(sb,
 	    "\nLock order reversal between \"%s\"(%s) and \"%s\"(%s)!\n",
 			    tmp_w1->w_name, tmp_w1->w_class->lc_name, 
 			    tmp_w2->w_name, tmp_w2->w_class->lc_name);
 			if (data1) {
 				sbuf_printf(sb,
 			"Lock order \"%s\"(%s) -> \"%s\"(%s) first seen at:\n",
 				    tmp_w1->w_name, tmp_w1->w_class->lc_name, 
 				    tmp_w2->w_name, tmp_w2->w_class->lc_name);
 				stack_sbuf_print(sb, &tmp_data1->wlod_stack);
 				sbuf_printf(sb, "\n");
 			}
 			if (data2 && data2 != data1) {
 				sbuf_printf(sb,
 			"Lock order \"%s\"(%s) -> \"%s\"(%s) first seen at:\n",
 				    tmp_w2->w_name, tmp_w2->w_class->lc_name, 
 				    tmp_w1->w_name, tmp_w1->w_class->lc_name);
 				stack_sbuf_print(sb, &tmp_data2->wlod_stack);
 				sbuf_printf(sb, "\n");
 			}
 		}
 	}
 	mtx_lock_spin(&w_mtx);
 	if (generation != w_generation) {
 		mtx_unlock_spin(&w_mtx);
 
 		/*
 		 * The graph changed while we were printing stack data,
 		 * try again.
 		 */
 		*oldidx = 0;
 		sbuf_clear(sb);
 		goto restart;
 	}
 	mtx_unlock_spin(&w_mtx);
 
 	/* Free temporary storage space. */
 	free(tmp_data1, M_TEMP);
 	free(tmp_data2, M_TEMP);
 	free(tmp_w1, M_TEMP);
 	free(tmp_w2, M_TEMP);
 }
 
 static int
 sysctl_debug_witness_badstacks(SYSCTL_HANDLER_ARGS)
 {
 	struct sbuf *sb;
 	int error;
 
 	if (witness_watch < 1) {
 		error = SYSCTL_OUT(req, w_notrunning, sizeof(w_notrunning));
 		return (error);
 	}
 	if (witness_cold) {
 		error = SYSCTL_OUT(req, w_stillcold, sizeof(w_stillcold));
 		return (error);
 	}
 	error = 0;
 	sb = sbuf_new(NULL, NULL, badstack_sbuf_size, SBUF_AUTOEXTEND);
 	if (sb == NULL)
 		return (ENOMEM);
 
 	sbuf_print_witness_badstacks(sb, &req->oldidx);
 
 	sbuf_finish(sb);
 	error = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb) + 1);
 	sbuf_delete(sb);
 
 	return (error);
 }
 
 #ifdef DDB
 static int
 sbuf_db_printf_drain(void *arg __unused, const char *data, int len)
 {
 
 	return (db_printf("%.*s", len, data));
 }
 
 DB_SHOW_COMMAND(badstacks, db_witness_badstacks)
 {
 	struct sbuf sb;
 	char buffer[128];
 	size_t dummy;
 
 	sbuf_new(&sb, buffer, sizeof(buffer), SBUF_FIXEDLEN);
 	sbuf_set_drain(&sb, sbuf_db_printf_drain, NULL);
 	sbuf_print_witness_badstacks(&sb, &dummy);
 	sbuf_finish(&sb);
 }
 #endif
 
 static int
 sysctl_debug_witness_channel(SYSCTL_HANDLER_ARGS)
 {
 	static const struct {
 		enum witness_channel channel;
 		const char *name;
 	} channels[] = {
 		{ WITNESS_CONSOLE, "console" },
 		{ WITNESS_LOG, "log" },
 		{ WITNESS_NONE, "none" },
 	};
 	char buf[16];
 	u_int i;
 	int error;
 
 	buf[0] = '\0';
 	for (i = 0; i < nitems(channels); i++)
 		if (witness_channel == channels[i].channel) {
 			snprintf(buf, sizeof(buf), "%s", channels[i].name);
 			break;
 		}
 
 	error = sysctl_handle_string(oidp, buf, sizeof(buf), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 
 	error = EINVAL;
 	for (i = 0; i < nitems(channels); i++)
 		if (strcmp(channels[i].name, buf) == 0) {
 			witness_channel = channels[i].channel;
 			error = 0;
 			break;
 		}
 	return (error);
 }
 
 static int
 sysctl_debug_witness_fullgraph(SYSCTL_HANDLER_ARGS)
 {
 	struct witness *w;
 	struct sbuf *sb;
 	int error;
 
 	if (witness_watch < 1) {
 		error = SYSCTL_OUT(req, w_notrunning, sizeof(w_notrunning));
 		return (error);
 	}
 	if (witness_cold) {
 		error = SYSCTL_OUT(req, w_stillcold, sizeof(w_stillcold));
 		return (error);
 	}
 	error = 0;
 
 	error = sysctl_wire_old_buffer(req, 0);
 	if (error != 0)
 		return (error);
 	sb = sbuf_new_for_sysctl(NULL, NULL, FULLGRAPH_SBUF_SIZE, req);
 	if (sb == NULL)
 		return (ENOMEM);
 	sbuf_printf(sb, "\n");
 
 	mtx_lock_spin(&w_mtx);
 	STAILQ_FOREACH(w, &w_all, w_list)
 		w->w_displayed = 0;
 	STAILQ_FOREACH(w, &w_all, w_list)
 		witness_add_fullgraph(sb, w);
 	mtx_unlock_spin(&w_mtx);
 
 	/*
 	 * Close the sbuf and return to userland.
 	 */
 	error = sbuf_finish(sb);
 	sbuf_delete(sb);
 
 	return (error);
 }
 
 static int
 sysctl_debug_witness_watch(SYSCTL_HANDLER_ARGS)
 {
 	int error, value;
 
 	value = witness_watch;
 	error = sysctl_handle_int(oidp, &value, 0, req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	if (value > 1 || value < -1 ||
 	    (witness_watch == -1 && value != witness_watch))
 		return (EINVAL);
 	witness_watch = value;
 	return (0);
 }
 
 static void
 witness_add_fullgraph(struct sbuf *sb, struct witness *w)
 {
 	int i;
 
 	if (w->w_displayed != 0 || (w->w_file == NULL && w->w_line == 0))
 		return;
 	w->w_displayed = 1;
 
 	WITNESS_INDEX_ASSERT(w->w_index);
 	for (i = 1; i <= w_max_used_index; i++) {
 		if (w_rmatrix[w->w_index][i] & WITNESS_PARENT) {
 			sbuf_printf(sb, "\"%s\",\"%s\"\n", w->w_name,
 			    w_data[i].w_name);
 			witness_add_fullgraph(sb, &w_data[i]);
 		}
 	}
 }
 
 /*
  * A simple hash function. Takes a key pointer and a key size. If size == 0,
  * interprets the key as a string and reads until the null
  * terminator. Otherwise, reads the first size bytes. Returns an unsigned 32-bit
  * hash value computed from the key.
  */
 static uint32_t
 witness_hash_djb2(const uint8_t *key, uint32_t size)
 {
 	unsigned int hash = 5381;
 	int i;
 
 	/* hash = hash * 33 + key[i] */
 	if (size)
 		for (i = 0; i < size; i++)
 			hash = ((hash << 5) + hash) + (unsigned int)key[i];
 	else
 		for (i = 0; key[i] != 0; i++)
 			hash = ((hash << 5) + hash) + (unsigned int)key[i];
 
 	return (hash);
 }
 
 
 /*
  * Initializes the two witness hash tables. Called exactly once from
  * witness_initialize().
  */
 static void
 witness_init_hash_tables(void)
 {
 	int i;
 
 	MPASS(witness_cold);
 
 	/* Initialize the hash tables. */
 	for (i = 0; i < WITNESS_HASH_SIZE; i++)
 		w_hash.wh_array[i] = NULL;
 
 	w_hash.wh_size = WITNESS_HASH_SIZE;
 	w_hash.wh_count = 0;
 
 	/* Initialize the lock order data hash. */
 	w_lofree = NULL;
 	for (i = 0; i < WITNESS_LO_DATA_COUNT; i++) {
 		memset(&w_lodata[i], 0, sizeof(w_lodata[i]));
 		w_lodata[i].wlod_next = w_lofree;
 		w_lofree = &w_lodata[i];
 	}
 	w_lohash.wloh_size = WITNESS_LO_HASH_SIZE;
 	w_lohash.wloh_count = 0;
 	for (i = 0; i < WITNESS_LO_HASH_SIZE; i++)
 		w_lohash.wloh_array[i] = NULL;
 }
 
 static struct witness *
 witness_hash_get(const char *key)
 {
 	struct witness *w;
 	uint32_t hash;
 	
 	MPASS(key != NULL);
 	if (witness_cold == 0)
 		mtx_assert(&w_mtx, MA_OWNED);
 	hash = witness_hash_djb2(key, 0) % w_hash.wh_size;
 	w = w_hash.wh_array[hash];
 	while (w != NULL) {
 		if (strcmp(w->w_name, key) == 0)
 			goto out;
 		w = w->w_hash_next;
 	}
 
 out:
 	return (w);
 }
 
 static void
 witness_hash_put(struct witness *w)
 {
 	uint32_t hash;
 
 	MPASS(w != NULL);
 	MPASS(w->w_name != NULL);
 	if (witness_cold == 0)
 		mtx_assert(&w_mtx, MA_OWNED);
 	KASSERT(witness_hash_get(w->w_name) == NULL,
 	    ("%s: trying to add a hash entry that already exists!", __func__));
 	KASSERT(w->w_hash_next == NULL,
 	    ("%s: w->w_hash_next != NULL", __func__));
 
 	hash = witness_hash_djb2(w->w_name, 0) % w_hash.wh_size;
 	w->w_hash_next = w_hash.wh_array[hash];
 	w_hash.wh_array[hash] = w;
 	w_hash.wh_count++;
 }
 
 
 static struct witness_lock_order_data *
 witness_lock_order_get(struct witness *parent, struct witness *child)
 {
 	struct witness_lock_order_data *data = NULL;
 	struct witness_lock_order_key key;
 	unsigned int hash;
 
 	MPASS(parent != NULL && child != NULL);
 	key.from = parent->w_index;
 	key.to = child->w_index;
 	WITNESS_INDEX_ASSERT(key.from);
 	WITNESS_INDEX_ASSERT(key.to);
 	if ((w_rmatrix[parent->w_index][child->w_index]
 	    & WITNESS_LOCK_ORDER_KNOWN) == 0)
 		goto out;
 
 	hash = witness_hash_djb2((const char*)&key,
 	    sizeof(key)) % w_lohash.wloh_size;
 	data = w_lohash.wloh_array[hash];
 	while (data != NULL) {
 		if (witness_lock_order_key_equal(&data->wlod_key, &key))
 			break;
 		data = data->wlod_next;
 	}
 
 out:
 	return (data);
 }
 
 /*
  * Verify that parent and child have a known relationship, are not the same,
  * and child is actually a child of parent.  This is done without w_mtx
  * to avoid contention in the common case.
  */
 static int
 witness_lock_order_check(struct witness *parent, struct witness *child)
 {
 
 	if (parent != child &&
 	    w_rmatrix[parent->w_index][child->w_index]
 	    & WITNESS_LOCK_ORDER_KNOWN &&
 	    isitmychild(parent, child))
 		return (1);
 
 	return (0);
 }
 
 static int
 witness_lock_order_add(struct witness *parent, struct witness *child)
 {
 	struct witness_lock_order_data *data = NULL;
 	struct witness_lock_order_key key;
 	unsigned int hash;
 	
 	MPASS(parent != NULL && child != NULL);
 	key.from = parent->w_index;
 	key.to = child->w_index;
 	WITNESS_INDEX_ASSERT(key.from);
 	WITNESS_INDEX_ASSERT(key.to);
 	if (w_rmatrix[parent->w_index][child->w_index]
 	    & WITNESS_LOCK_ORDER_KNOWN)
 		return (1);
 
 	hash = witness_hash_djb2((const char*)&key,
 	    sizeof(key)) % w_lohash.wloh_size;
 	w_rmatrix[parent->w_index][child->w_index] |= WITNESS_LOCK_ORDER_KNOWN;
 	data = w_lofree;
 	if (data == NULL)
 		return (0);
 	w_lofree = data->wlod_next;
 	data->wlod_next = w_lohash.wloh_array[hash];
 	data->wlod_key = key;
 	w_lohash.wloh_array[hash] = data;
 	w_lohash.wloh_count++;
 	stack_zero(&data->wlod_stack);
 	stack_save(&data->wlod_stack);
 	return (1);
 }
 
 /* Call this whenever the structure of the witness graph changes. */
 static void
 witness_increment_graph_generation(void)
 {
 
 	if (witness_cold == 0)
 		mtx_assert(&w_mtx, MA_OWNED);
 	w_generation++;
 }
 
 static int
 witness_output_drain(void *arg __unused, const char *data, int len)
 {
 
 	witness_output("%.*s", len, data);
 	return (len);
 }
 
 static void
 witness_debugger(int cond, const char *msg)
 {
 	char buf[32];
 	struct sbuf sb;
 	struct stack st;
 
 	if (!cond)
 		return;
 
 	if (witness_trace) {
 		sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN);
 		sbuf_set_drain(&sb, witness_output_drain, NULL);
 
 		stack_zero(&st);
 		stack_save(&st);
 		witness_output("stack backtrace:\n");
 		stack_sbuf_print_ddb(&sb, &st);
 
 		sbuf_finish(&sb);
 	}
 
 #ifdef KDB
 	if (witness_kdb)
 		kdb_enter(KDB_WHY_WITNESS, msg);
 #endif
 }
Index: user/markj/netdump/sys/mips/ingenic/jz4780_pinctrl.c
===================================================================
--- user/markj/netdump/sys/mips/ingenic/jz4780_pinctrl.c	(revision 332407)
+++ user/markj/netdump/sys/mips/ingenic/jz4780_pinctrl.c	(revision 332408)
@@ -1,260 +1,260 @@
 /*-
  * Copyright 2015 Alexander Kabaev <kan@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Ingenic JZ4780 pinctrl driver.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/bus.h>
 #include <sys/gpio.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/resource.h>
 #include <sys/rman.h>
 
 #include <machine/bus.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/fdt/fdt_pinctrl.h>
 #include <dev/fdt/simplebus.h>
 
 #include <mips/ingenic/jz4780_regs.h>
 
 #include "jz4780_gpio_if.h"
 
 struct jz4780_pinctrl_softc {
 	struct simplebus_softc          ssc;
 	device_t			dev;
 };
 
 #define CHIP_REG_STRIDE			256
 #define CHIP_REG_OFFSET(base, chip)	((base) + (chip) * CHIP_REG_STRIDE)
 
 static int
 jz4780_pinctrl_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 
 	if (!ofw_bus_is_compatible(dev, "ingenic,jz4780-pinctrl"))
 		return (ENXIO);
 
 	device_set_desc(dev, "Ingenic JZ4780 GPIO");
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 jz4780_pinctrl_attach(device_t dev)
 {
 	struct jz4780_pinctrl_softc *sc;
 	struct resource_list *rs;
 	struct resource_list_entry *re;
 	phandle_t dt_parent, dt_child;
 	int i, ret;
 
 	sc = device_get_softc(dev);
 	sc->dev = dev;
 
 	/*
 	 * Fetch our own resource list to dole memory between children
 	 */
 	rs = BUS_GET_RESOURCE_LIST(device_get_parent(dev), dev);
 	if (rs == NULL)
 		return (ENXIO);
 	re = resource_list_find(rs, SYS_RES_MEMORY, 0);
 	if (re == NULL)
 		return (ENXIO);
 
 	simplebus_init(dev, 0);
 
 	/* Iterate over this node children, looking for pin controllers */
 	dt_parent = ofw_bus_get_node(dev);
 	i = 0;
 	for (dt_child = OF_child(dt_parent); dt_child != 0;
 	    dt_child = OF_peer(dt_child)) {
 		struct simplebus_devinfo *ndi;
 		device_t child;
 		bus_addr_t phys;
 		bus_size_t size;
 
 		/* Add gpio controller child */
 		if (!OF_hasprop(dt_child, "gpio-controller"))
 			continue;
 		child = simplebus_add_device(dev, dt_child, 0,  NULL, -1, NULL);
 		if (child == NULL)
 			break;
 		/* Setup child resources */
 		phys = CHIP_REG_OFFSET(re->start, i);
 		size = CHIP_REG_STRIDE;
 		if (phys + size - 1 <= re->end) {
 			ndi = device_get_ivars(child);
 			resource_list_add(&ndi->rl, SYS_RES_MEMORY, 0,
 			    phys, phys + size - 1, size);
 		}
 		i++;
 	}
 
 	ret = bus_generic_attach(dev);
 	if (ret == 0) {
 	    fdt_pinctrl_register(dev, "ingenic,pins");
 	    fdt_pinctrl_configure_tree(dev);
 	}
 	return (ret);
 }
 
 static int
 jz4780_pinctrl_detach(device_t dev)
 {
 
 	bus_generic_detach(dev);
 	return (0);
 }
 
 struct jx4780_bias_prop {
 	const char *name;
 	uint32_t    bias;
 };
 
 static struct jx4780_bias_prop jx4780_bias_table[] = {
 	{ "bias-disable", 0 },
 	{ "bias-pull-up", GPIO_PIN_PULLUP },
 	{ "bias-pull-down", GPIO_PIN_PULLDOWN },
 };
 
 static int
 jz4780_pinctrl_parse_pincfg(phandle_t pincfgxref, uint32_t *bias_value)
 {
 	phandle_t pincfg_node;
 	int i;
 
 	pincfg_node = OF_node_from_xref(pincfgxref);
 	for (i = 0; i < nitems(jx4780_bias_table); i++) {
 		if (OF_hasprop(pincfg_node, jx4780_bias_table[i].name)) {
 			*bias_value = jx4780_bias_table[i].bias;
 			return 0;
 		}
 	}
 
 	return -1;
 }
 
 static device_t
 jz4780_pinctrl_chip_lookup(struct jz4780_pinctrl_softc *sc, phandle_t chipxref)
 {
 	device_t chipdev;
 
 	chipdev = OF_device_from_xref(chipxref);
 	return chipdev;
 }
 
 static int
 jz4780_pinctrl_configure_pins(device_t dev, phandle_t cfgxref)
 {
 	struct jz4780_pinctrl_softc *sc = device_get_softc(dev);
 	device_t  chip;
 	phandle_t node;
 	ssize_t i, len;
 	uint32_t *value, *pconf;
 	int result;
 
 	node = OF_node_from_xref(cfgxref);
 
-	len = OF_getencprop_alloc(node, "ingenic,pins", sizeof(uint32_t) * 4,
-	    (void **)&value);
+	len = OF_getencprop_alloc_multi(node, "ingenic,pins",
+	    sizeof(uint32_t) * 4, (void **)&value);
 	if (len < 0) {
 		device_printf(dev,
 		    "missing ingenic,pins attribute in FDT\n");
 		return (ENXIO);
 	}
 
 	pconf = value;
 	result = EINVAL;
 	for (i = 0; i < len; i++, pconf += 4) {
 		uint32_t bias;
 
 		/* Lookup the chip that handles this configuration */
 		chip = jz4780_pinctrl_chip_lookup(sc, pconf[0]);
 		if (chip == NULL) {
 			device_printf(dev,
 			    "invalid gpio controller reference in FDT\n");
 			goto done;
 		}
 
 		if (jz4780_pinctrl_parse_pincfg(pconf[3], &bias) != 0) {
 			device_printf(dev,
 			    "invalid pin bias for pin %u on %s in FDT\n",
 			    pconf[1], ofw_bus_get_name(chip));
 			goto done;
 		}
 
 		result = JZ4780_GPIO_CONFIGURE_PIN(chip, pconf[1], pconf[2],
 		    bias);
 		if (result != 0) {
 			device_printf(dev,
 			    "failed to configure pin %u on %s\n", pconf[1],
 			    ofw_bus_get_name(chip));
 			goto done;
 		}
 	}
 
 	result = 0;
 done:
 	free(value, M_OFWPROP);
 	return (result);
 }
 
 
 static device_method_t jz4780_pinctrl_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		jz4780_pinctrl_probe),
 	DEVMETHOD(device_attach,	jz4780_pinctrl_attach),
 	DEVMETHOD(device_detach,	jz4780_pinctrl_detach),
 
 	/* fdt_pinctrl interface */
 	DEVMETHOD(fdt_pinctrl_configure, jz4780_pinctrl_configure_pins),
 
 	DEVMETHOD_END
 };
 
 static devclass_t jz4780_pinctrl_devclass;
 DEFINE_CLASS_1(pinctrl, jz4780_pinctrl_driver, jz4780_pinctrl_methods,
             sizeof(struct jz4780_pinctrl_softc), simplebus_driver);
 EARLY_DRIVER_MODULE(pinctrl, simplebus, jz4780_pinctrl_driver,
     jz4780_pinctrl_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_LATE);
Index: user/markj/netdump/sys/mips/mediatek/fdt_reset.c
===================================================================
--- user/markj/netdump/sys/mips/mediatek/fdt_reset.c	(revision 332407)
+++ user/markj/netdump/sys/mips/mediatek/fdt_reset.c	(revision 332408)
@@ -1,125 +1,125 @@
 /*-
  * Copyright (c) 2016 Stanislav Galabov
  * Copyright (c) 2014 Ian Lepore <ian@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #include <sys/cdefs.h>
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/systm.h>
 
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include "fdt_reset_if.h"
 #include <mips/mediatek/fdt_reset.h>
 
 /*
  * Loop through all the tuples in the resets= property for a device, asserting
  * or deasserting each reset.
  *
  * Be liberal about errors for now: warn about a failure to (de)assert but keep
  * trying with any other resets in the list.  Return ENXIO if any errors were
  * found, and let the caller decide whether the problem is fatal.
  */
 static int
 assert_deassert_all(device_t consumer, boolean_t assert)
 {
 	phandle_t rnode;
 	device_t resetdev;
 	int resetnum, err, i, ncells;
 	uint32_t *resets;
 	boolean_t anyerrors;
 
 	rnode = ofw_bus_get_node(consumer);
-	ncells = OF_getencprop_alloc(rnode, "resets", sizeof(*resets),
+	ncells = OF_getencprop_alloc_multi(rnode, "resets", sizeof(*resets),
 	    (void **)&resets);
 	if (!assert && ncells < 2) {
 		device_printf(consumer, "Warning: No resets specified in fdt "
 		    "data; device may not function.");
 		return (ENXIO);
 	}
 	anyerrors = false;
 	for (i = 0; i < ncells; i += 2) {
 		resetdev = OF_device_from_xref(resets[i]);
 		resetnum = resets[i + 1];
 		if (resetdev == NULL) {
 			if (!assert)
 				device_printf(consumer, "Warning: can not find "
 				    "driver for reset number %u; device may "
 				    "not function\n", resetnum);
 			anyerrors = true;
 			continue;
 		}
 		if (assert)
 			err = FDT_RESET_ASSERT(resetdev, resetnum);
 		else
 			err = FDT_RESET_DEASSERT(resetdev, resetnum);
 		if (err != 0) {
 			if (!assert)
 				device_printf(consumer, "Warning: failed to "
 				    "deassert reset number %u; device may not "
 				    "function\n", resetnum);
 			anyerrors = true;
 		}
 	}
 	OF_prop_free(resets);
 	return (anyerrors ? ENXIO : 0);
 }
 
 int
 fdt_reset_assert_all(device_t consumer)
 {
 
 	return (assert_deassert_all(consumer, true));
 }
 
 int
 fdt_reset_deassert_all(device_t consumer)
 {
 
 	return (assert_deassert_all(consumer, false));
 }
 
 void
 fdt_reset_register_provider(device_t provider)
 {
 
 	OF_device_register_xref(
 	    OF_xref_from_node(ofw_bus_get_node(provider)), provider);
 }
 
 void
 fdt_reset_unregister_provider(device_t provider)
 {
 
 	OF_device_register_xref(OF_xref_from_device(provider), NULL);
 }
 
Index: user/markj/netdump/sys/net/bpf.c
===================================================================
--- user/markj/netdump/sys/net/bpf.c	(revision 332407)
+++ user/markj/netdump/sys/net/bpf.c	(revision 332408)
@@ -1,3063 +1,3066 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1990, 1991, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * This code is derived from the Stanford/CMU enet packet filter,
  * (net/enet.c) distributed as part of 4.3BSD, and code contributed
  * to Berkeley by Steven McCanne and Van Jacobson both of Lawrence
  * Berkeley Laboratory.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *      @(#)bpf.c	8.4 (Berkeley) 1/9/95
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_bpf.h"
 #include "opt_ddb.h"
 #include "opt_netgraph.h"
 
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/lock.h>
 #include <sys/rwlock.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/fcntl.h>
 #include <sys/jail.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/time.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/signalvar.h>
 #include <sys/filio.h>
 #include <sys/sockio.h>
 #include <sys/ttycom.h>
 #include <sys/uio.h>
 #include <sys/sysent.h>
 
 #include <sys/event.h>
 #include <sys/file.h>
 #include <sys/poll.h>
 #include <sys/proc.h>
 
 #include <sys/socket.h>
 
 #ifdef DDB
 #include <ddb/ddb.h>
 #endif
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/if_dl.h>
 #include <net/bpf.h>
 #include <net/bpf_buffer.h>
 #ifdef BPF_JITTER
 #include <net/bpf_jitter.h>
 #endif
 #include <net/bpf_zerocopy.h>
 #include <net/bpfdesc.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
 #include <netinet/in.h>
 #include <netinet/if_ether.h>
 #include <sys/kernel.h>
 #include <sys/sysctl.h>
 
 #include <net80211/ieee80211_freebsd.h>
 
 #include <security/mac/mac_framework.h>
 
 MALLOC_DEFINE(M_BPF, "BPF", "BPF data");
 
 struct bpf_if {
 #define	bif_next	bif_ext.bif_next
 #define	bif_dlist	bif_ext.bif_dlist
 	struct bpf_if_ext bif_ext;	/* public members */
 	u_int		bif_dlt;	/* link layer type */
 	u_int		bif_hdrlen;	/* length of link header */
 	struct ifnet	*bif_ifp;	/* corresponding interface */
 	struct rwlock	bif_lock;	/* interface lock */
 	LIST_HEAD(, bpf_d) bif_wlist;	/* writer-only list */
 	int		bif_flags;	/* Interface flags */
 	struct bpf_if	**bif_bpf;	/* Pointer to pointer to us */
 };
 
 CTASSERT(offsetof(struct bpf_if, bif_ext) == 0);
 
 #if defined(DEV_BPF) || defined(NETGRAPH_BPF)
 
 #define PRINET  26			/* interruptible */
 
 #define	SIZEOF_BPF_HDR(type)	\
     (offsetof(type, bh_hdrlen) + sizeof(((type *)0)->bh_hdrlen))
 
 #ifdef COMPAT_FREEBSD32
 #include <sys/mount.h>
 #include <compat/freebsd32/freebsd32.h>
 #define BPF_ALIGNMENT32 sizeof(int32_t)
 #define	BPF_WORDALIGN32(x) roundup2(x, BPF_ALIGNMENT32)
 
 #ifndef BURN_BRIDGES
 /*
  * 32-bit version of structure prepended to each packet.  We use this header
  * instead of the standard one for 32-bit streams.  We mark the a stream as
  * 32-bit the first time we see a 32-bit compat ioctl request.
  */
 struct bpf_hdr32 {
 	struct timeval32 bh_tstamp;	/* time stamp */
 	uint32_t	bh_caplen;	/* length of captured portion */
 	uint32_t	bh_datalen;	/* original length of packet */
 	uint16_t	bh_hdrlen;	/* length of bpf header (this struct
 					   plus alignment padding) */
 };
 #endif
 
 struct bpf_program32 {
 	u_int bf_len;
 	uint32_t bf_insns;
 };
 
 struct bpf_dltlist32 {
 	u_int	bfl_len;
 	u_int	bfl_list;
 };
 
 #define	BIOCSETF32	_IOW('B', 103, struct bpf_program32)
 #define	BIOCSRTIMEOUT32	_IOW('B', 109, struct timeval32)
 #define	BIOCGRTIMEOUT32	_IOR('B', 110, struct timeval32)
 #define	BIOCGDLTLIST32	_IOWR('B', 121, struct bpf_dltlist32)
 #define	BIOCSETWF32	_IOW('B', 123, struct bpf_program32)
 #define	BIOCSETFNR32	_IOW('B', 130, struct bpf_program32)
 #endif
 
+#define BPF_LOCK()	   sx_xlock(&bpf_sx)
+#define BPF_UNLOCK()		sx_xunlock(&bpf_sx)
+#define BPF_LOCK_ASSERT()	sx_assert(&bpf_sx, SA_XLOCKED)
 /*
  * bpf_iflist is a list of BPF interface structures, each corresponding to a
  * specific DLT.  The same network interface might have several BPF interface
  * structures registered by different layers in the stack (i.e., 802.11
  * frames, ethernet frames, etc).
  */
 static LIST_HEAD(, bpf_if)	bpf_iflist, bpf_freelist;
-static struct mtx	bpf_mtx;		/* bpf global lock */
+static struct sx	bpf_sx;		/* bpf global lock */
 static int		bpf_bpfd_cnt;
 
 static void	bpf_attachd(struct bpf_d *, struct bpf_if *);
 static void	bpf_detachd(struct bpf_d *);
 static void	bpf_detachd_locked(struct bpf_d *);
 static void	bpf_freed(struct bpf_d *);
 static int	bpf_movein(struct uio *, int, struct ifnet *, struct mbuf **,
 		    struct sockaddr *, int *, struct bpf_d *);
 static int	bpf_setif(struct bpf_d *, struct ifreq *);
 static void	bpf_timed_out(void *);
 static __inline void
 		bpf_wakeup(struct bpf_d *);
 static void	catchpacket(struct bpf_d *, u_char *, u_int, u_int,
 		    void (*)(struct bpf_d *, caddr_t, u_int, void *, u_int),
 		    struct bintime *);
 static void	reset_d(struct bpf_d *);
 static int	bpf_setf(struct bpf_d *, struct bpf_program *, u_long cmd);
 static int	bpf_getdltlist(struct bpf_d *, struct bpf_dltlist *);
 static int	bpf_setdlt(struct bpf_d *, u_int);
 static void	filt_bpfdetach(struct knote *);
 static int	filt_bpfread(struct knote *, long);
 static void	bpf_drvinit(void *);
 static int	bpf_stats_sysctl(SYSCTL_HANDLER_ARGS);
 
 SYSCTL_NODE(_net, OID_AUTO, bpf, CTLFLAG_RW, 0, "bpf sysctl");
 int bpf_maxinsns = BPF_MAXINSNS;
 SYSCTL_INT(_net_bpf, OID_AUTO, maxinsns, CTLFLAG_RW,
     &bpf_maxinsns, 0, "Maximum bpf program instructions");
 static int bpf_zerocopy_enable = 0;
 SYSCTL_INT(_net_bpf, OID_AUTO, zerocopy_enable, CTLFLAG_RW,
     &bpf_zerocopy_enable, 0, "Enable new zero-copy BPF buffer sessions");
 static SYSCTL_NODE(_net_bpf, OID_AUTO, stats, CTLFLAG_MPSAFE | CTLFLAG_RW,
     bpf_stats_sysctl, "bpf statistics portal");
 
 static VNET_DEFINE(int, bpf_optimize_writers) = 0;
 #define	V_bpf_optimize_writers VNET(bpf_optimize_writers)
 SYSCTL_INT(_net_bpf, OID_AUTO, optimize_writers, CTLFLAG_VNET | CTLFLAG_RW,
     &VNET_NAME(bpf_optimize_writers), 0,
     "Do not send packets until BPF program is set");
 
 static	d_open_t	bpfopen;
 static	d_read_t	bpfread;
 static	d_write_t	bpfwrite;
 static	d_ioctl_t	bpfioctl;
 static	d_poll_t	bpfpoll;
 static	d_kqfilter_t	bpfkqfilter;
 
 static struct cdevsw bpf_cdevsw = {
 	.d_version =	D_VERSION,
 	.d_open =	bpfopen,
 	.d_read =	bpfread,
 	.d_write =	bpfwrite,
 	.d_ioctl =	bpfioctl,
 	.d_poll =	bpfpoll,
 	.d_name =	"bpf",
 	.d_kqfilter =	bpfkqfilter,
 };
 
 static struct filterops bpfread_filtops = {
 	.f_isfd = 1,
 	.f_detach = filt_bpfdetach,
 	.f_event = filt_bpfread,
 };
 
 eventhandler_tag	bpf_ifdetach_cookie = NULL;
 
 /*
  * LOCKING MODEL USED BY BPF:
  * Locks:
  * 1) global lock (BPF_LOCK). Mutex, used to protect interface addition/removal,
  * some global counters and every bpf_if reference.
  * 2) Interface lock. Rwlock, used to protect list of BPF descriptors and their filters.
  * 3) Descriptor lock. Mutex, used to protect BPF buffers and various structure fields
  *   used by bpf_mtap code.
  *
  * Lock order:
  *
  * Global lock, interface lock, descriptor lock
  *
  * We have to acquire interface lock before descriptor main lock due to BPF_MTAP[2]
  * working model. In many places (like bpf_detachd) we start with BPF descriptor
  * (and we need to at least rlock it to get reliable interface pointer). This
  * gives us potential LOR. As a result, we use global lock to protect from bpf_if
  * change in every such place.
  *
  * Changing d->bd_bif is protected by 1) global lock, 2) interface lock and
  * 3) descriptor main wlock.
  * Reading bd_bif can be protected by any of these locks, typically global lock.
  *
  * Changing read/write BPF filter is protected by the same three locks,
  * the same applies for reading.
  *
  * Sleeping in global lock is not allowed due to bpfdetach() using it.
  */
 
 /*
  * Wrapper functions for various buffering methods.  If the set of buffer
  * modes expands, we will probably want to introduce a switch data structure
  * similar to protosw, et.
  */
 static void
 bpf_append_bytes(struct bpf_d *d, caddr_t buf, u_int offset, void *src,
     u_int len)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
 		return (bpf_buffer_append_bytes(d, buf, offset, src, len));
 
 	case BPF_BUFMODE_ZBUF:
 		counter_u64_add(d->bd_zcopy, 1);
 		return (bpf_zerocopy_append_bytes(d, buf, offset, src, len));
 
 	default:
 		panic("bpf_buf_append_bytes");
 	}
 }
 
 static void
 bpf_append_mbuf(struct bpf_d *d, caddr_t buf, u_int offset, void *src,
     u_int len)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
 		return (bpf_buffer_append_mbuf(d, buf, offset, src, len));
 
 	case BPF_BUFMODE_ZBUF:
 		counter_u64_add(d->bd_zcopy, 1);
 		return (bpf_zerocopy_append_mbuf(d, buf, offset, src, len));
 
 	default:
 		panic("bpf_buf_append_mbuf");
 	}
 }
 
 /*
  * This function gets called when the free buffer is re-assigned.
  */
 static void
 bpf_buf_reclaimed(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
 		return;
 
 	case BPF_BUFMODE_ZBUF:
 		bpf_zerocopy_buf_reclaimed(d);
 		return;
 
 	default:
 		panic("bpf_buf_reclaimed");
 	}
 }
 
 /*
  * If the buffer mechanism has a way to decide that a held buffer can be made
  * free, then it is exposed via the bpf_canfreebuf() interface.  (1) is
  * returned if the buffer can be discarded, (0) is returned if it cannot.
  */
 static int
 bpf_canfreebuf(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_ZBUF:
 		return (bpf_zerocopy_canfreebuf(d));
 	}
 	return (0);
 }
 
 /*
  * Allow the buffer model to indicate that the current store buffer is
  * immutable, regardless of the appearance of space.  Return (1) if the
  * buffer is writable, and (0) if not.
  */
 static int
 bpf_canwritebuf(struct bpf_d *d)
 {
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_ZBUF:
 		return (bpf_zerocopy_canwritebuf(d));
 	}
 	return (1);
 }
 
 /*
  * Notify buffer model that an attempt to write to the store buffer has
  * resulted in a dropped packet, in which case the buffer may be considered
  * full.
  */
 static void
 bpf_buffull(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_ZBUF:
 		bpf_zerocopy_buffull(d);
 		break;
 	}
 }
 
 /*
  * Notify the buffer model that a buffer has moved into the hold position.
  */
 void
 bpf_bufheld(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_ZBUF:
 		bpf_zerocopy_bufheld(d);
 		break;
 	}
 }
 
 static void
 bpf_free(struct bpf_d *d)
 {
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
 		return (bpf_buffer_free(d));
 
 	case BPF_BUFMODE_ZBUF:
 		return (bpf_zerocopy_free(d));
 
 	default:
 		panic("bpf_buf_free");
 	}
 }
 
 static int
 bpf_uiomove(struct bpf_d *d, caddr_t buf, u_int len, struct uio *uio)
 {
 
 	if (d->bd_bufmode != BPF_BUFMODE_BUFFER)
 		return (EOPNOTSUPP);
 	return (bpf_buffer_uiomove(d, buf, len, uio));
 }
 
 static int
 bpf_ioctl_sblen(struct bpf_d *d, u_int *i)
 {
 
 	if (d->bd_bufmode != BPF_BUFMODE_BUFFER)
 		return (EOPNOTSUPP);
 	return (bpf_buffer_ioctl_sblen(d, i));
 }
 
 static int
 bpf_ioctl_getzmax(struct thread *td, struct bpf_d *d, size_t *i)
 {
 
 	if (d->bd_bufmode != BPF_BUFMODE_ZBUF)
 		return (EOPNOTSUPP);
 	return (bpf_zerocopy_ioctl_getzmax(td, d, i));
 }
 
 static int
 bpf_ioctl_rotzbuf(struct thread *td, struct bpf_d *d, struct bpf_zbuf *bz)
 {
 
 	if (d->bd_bufmode != BPF_BUFMODE_ZBUF)
 		return (EOPNOTSUPP);
 	return (bpf_zerocopy_ioctl_rotzbuf(td, d, bz));
 }
 
 static int
 bpf_ioctl_setzbuf(struct thread *td, struct bpf_d *d, struct bpf_zbuf *bz)
 {
 
 	if (d->bd_bufmode != BPF_BUFMODE_ZBUF)
 		return (EOPNOTSUPP);
 	return (bpf_zerocopy_ioctl_setzbuf(td, d, bz));
 }
 
 /*
  * General BPF functions.
  */
 static int
 bpf_movein(struct uio *uio, int linktype, struct ifnet *ifp, struct mbuf **mp,
     struct sockaddr *sockp, int *hdrlen, struct bpf_d *d)
 {
 	const struct ieee80211_bpf_params *p;
 	struct ether_header *eh;
 	struct mbuf *m;
 	int error;
 	int len;
 	int hlen;
 	int slen;
 
 	/*
 	 * Build a sockaddr based on the data link layer type.
 	 * We do this at this level because the ethernet header
 	 * is copied directly into the data field of the sockaddr.
 	 * In the case of SLIP, there is no header and the packet
 	 * is forwarded as is.
 	 * Also, we are careful to leave room at the front of the mbuf
 	 * for the link level header.
 	 */
 	switch (linktype) {
 
 	case DLT_SLIP:
 		sockp->sa_family = AF_INET;
 		hlen = 0;
 		break;
 
 	case DLT_EN10MB:
 		sockp->sa_family = AF_UNSPEC;
 		/* XXX Would MAXLINKHDR be better? */
 		hlen = ETHER_HDR_LEN;
 		break;
 
 	case DLT_FDDI:
 		sockp->sa_family = AF_IMPLINK;
 		hlen = 0;
 		break;
 
 	case DLT_RAW:
 		sockp->sa_family = AF_UNSPEC;
 		hlen = 0;
 		break;
 
 	case DLT_NULL:
 		/*
 		 * null interface types require a 4 byte pseudo header which
 		 * corresponds to the address family of the packet.
 		 */
 		sockp->sa_family = AF_UNSPEC;
 		hlen = 4;
 		break;
 
 	case DLT_ATM_RFC1483:
 		/*
 		 * en atm driver requires 4-byte atm pseudo header.
 		 * though it isn't standard, vpi:vci needs to be
 		 * specified anyway.
 		 */
 		sockp->sa_family = AF_UNSPEC;
 		hlen = 12;	/* XXX 4(ATM_PH) + 3(LLC) + 5(SNAP) */
 		break;
 
 	case DLT_PPP:
 		sockp->sa_family = AF_UNSPEC;
 		hlen = 4;	/* This should match PPP_HDRLEN */
 		break;
 
 	case DLT_IEEE802_11:		/* IEEE 802.11 wireless */
 		sockp->sa_family = AF_IEEE80211;
 		hlen = 0;
 		break;
 
 	case DLT_IEEE802_11_RADIO:	/* IEEE 802.11 wireless w/ phy params */
 		sockp->sa_family = AF_IEEE80211;
 		sockp->sa_len = 12;	/* XXX != 0 */
 		hlen = sizeof(struct ieee80211_bpf_params);
 		break;
 
 	default:
 		return (EIO);
 	}
 
 	len = uio->uio_resid;
 	if (len < hlen || len - hlen > ifp->if_mtu)
 		return (EMSGSIZE);
 
 	m = m_get2(len, M_WAITOK, MT_DATA, M_PKTHDR);
 	if (m == NULL)
 		return (EIO);
 	m->m_pkthdr.len = m->m_len = len;
 	*mp = m;
 
 	error = uiomove(mtod(m, u_char *), len, uio);
 	if (error)
 		goto bad;
 
 	slen = bpf_filter(d->bd_wfilter, mtod(m, u_char *), len, len);
 	if (slen == 0) {
 		error = EPERM;
 		goto bad;
 	}
 
 	/* Check for multicast destination */
 	switch (linktype) {
 	case DLT_EN10MB:
 		eh = mtod(m, struct ether_header *);
 		if (ETHER_IS_MULTICAST(eh->ether_dhost)) {
 			if (bcmp(ifp->if_broadcastaddr, eh->ether_dhost,
 			    ETHER_ADDR_LEN) == 0)
 				m->m_flags |= M_BCAST;
 			else
 				m->m_flags |= M_MCAST;
 		}
 		if (d->bd_hdrcmplt == 0) {
 			memcpy(eh->ether_shost, IF_LLADDR(ifp),
 			    sizeof(eh->ether_shost));
 		}
 		break;
 	}
 
 	/*
 	 * Make room for link header, and copy it to sockaddr
 	 */
 	if (hlen != 0) {
 		if (sockp->sa_family == AF_IEEE80211) {
 			/*
 			 * Collect true length from the parameter header
 			 * NB: sockp is known to be zero'd so if we do a
 			 *     short copy unspecified parameters will be
 			 *     zero.
 			 * NB: packet may not be aligned after stripping
 			 *     bpf params
 			 * XXX check ibp_vers
 			 */
 			p = mtod(m, const struct ieee80211_bpf_params *);
 			hlen = p->ibp_len;
 			if (hlen > sizeof(sockp->sa_data)) {
 				error = EINVAL;
 				goto bad;
 			}
 		}
 		bcopy(mtod(m, const void *), sockp->sa_data, hlen);
 	}
 	*hdrlen = hlen;
 
 	return (0);
 bad:
 	m_freem(m);
 	return (error);
 }
 
 /*
  * Attach file to the bpf interface, i.e. make d listen on bp.
  */
 static void
 bpf_attachd(struct bpf_d *d, struct bpf_if *bp)
 {
 	int op_w;
 
 	BPF_LOCK_ASSERT();
 
 	/*
 	 * Save sysctl value to protect from sysctl change
 	 * between reads
 	 */
 	op_w = V_bpf_optimize_writers || d->bd_writer;
 
 	if (d->bd_bif != NULL)
 		bpf_detachd_locked(d);
 	/*
 	 * Point d at bp, and add d to the interface's list.
 	 * Since there are many applications using BPF for
 	 * sending raw packets only (dhcpd, cdpd are good examples)
 	 * we can delay adding d to the list of active listeners until
 	 * some filter is configured.
 	 */
 
 	BPFIF_WLOCK(bp);
 	BPFD_LOCK(d);
 
 	d->bd_bif = bp;
 
 	if (op_w != 0) {
 		/* Add to writers-only list */
 		LIST_INSERT_HEAD(&bp->bif_wlist, d, bd_next);
 		/*
 		 * We decrement bd_writer on every filter set operation.
 		 * First BIOCSETF is done by pcap_open_live() to set up
 		 * snap length. After that appliation usually sets its own filter
 		 */
 		d->bd_writer = 2;
 	} else
 		LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
 
 	BPFD_UNLOCK(d);
 	BPFIF_WUNLOCK(bp);
 
 	bpf_bpfd_cnt++;
 
 	CTR3(KTR_NET, "%s: bpf_attach called by pid %d, adding to %s list",
 	    __func__, d->bd_pid, d->bd_writer ? "writer" : "active");
 
 	if (op_w == 0)
 		EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
 }
 
 /*
  * Check if we need to upgrade our descriptor @d from write-only mode.
  */
 static int
 bpf_check_upgrade(u_long cmd, struct bpf_d *d, struct bpf_insn *fcode, int flen)
 {
 	int is_snap, need_upgrade;
 
 	/*
 	 * Check if we've already upgraded or new filter is empty.
 	 */
 	if (d->bd_writer == 0 || fcode == NULL)
 		return (0);
 
 	need_upgrade = 0;
 
 	/*
 	 * Check if cmd looks like snaplen setting from
 	 * pcap_bpf.c:pcap_open_live().
 	 * Note we're not checking .k value here:
 	 * while pcap_open_live() definitely sets to non-zero value,
 	 * we'd prefer to treat k=0 (deny ALL) case the same way: e.g.
 	 * do not consider upgrading immediately
 	 */
 	if (cmd == BIOCSETF && flen == 1 && fcode[0].code == (BPF_RET | BPF_K))
 		is_snap = 1;
 	else
 		is_snap = 0;
 
 	if (is_snap == 0) {
 		/*
 		 * We're setting first filter and it doesn't look like
 		 * setting snaplen.  We're probably using bpf directly.
 		 * Upgrade immediately.
 		 */
 		need_upgrade = 1;
 	} else {
 		/*
 		 * Do not require upgrade by first BIOCSETF
 		 * (used to set snaplen) by pcap_open_live().
 		 */
 
 		if (--d->bd_writer == 0) {
 			/*
 			 * First snaplen filter has already
 			 * been set. This is probably catch-all
 			 * filter
 			 */
 			need_upgrade = 1;
 		}
 	}
 
 	CTR5(KTR_NET,
 	    "%s: filter function set by pid %d, "
 	    "bd_writer counter %d, snap %d upgrade %d",
 	    __func__, d->bd_pid, d->bd_writer,
 	    is_snap, need_upgrade);
 
 	return (need_upgrade);
 }
 
 /*
  * Add d to the list of active bp filters.
  * Requires bpf_attachd() to be called before.
  */
 static void
 bpf_upgraded(struct bpf_d *d)
 {
 	struct bpf_if *bp;
 
 	BPF_LOCK_ASSERT();
 
 	bp = d->bd_bif;
 
 	/*
 	 * Filter can be set several times without specifying interface.
 	 * Mark d as reader and exit.
 	 */
 	if (bp == NULL) {
 		BPFD_LOCK(d);
 		d->bd_writer = 0;
 		BPFD_UNLOCK(d);
 		return;
 	}
 
 	BPFIF_WLOCK(bp);
 	BPFD_LOCK(d);
 
 	/* Remove from writers-only list */
 	LIST_REMOVE(d, bd_next);
 	LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
 	/* Mark d as reader */
 	d->bd_writer = 0;
 
 	BPFD_UNLOCK(d);
 	BPFIF_WUNLOCK(bp);
 
 	CTR2(KTR_NET, "%s: upgrade required by pid %d", __func__, d->bd_pid);
 
 	EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
 }
 
 /*
  * Detach a file from its interface.
  */
 static void
 bpf_detachd(struct bpf_d *d)
 {
 	BPF_LOCK();
 	bpf_detachd_locked(d);
 	BPF_UNLOCK();
 }
 
 static void
 bpf_detachd_locked(struct bpf_d *d)
 {
 	int error;
 	struct bpf_if *bp;
 	struct ifnet *ifp;
 
 	CTR2(KTR_NET, "%s: detach required by pid %d", __func__, d->bd_pid);
 
 	BPF_LOCK_ASSERT();
 
 	/* Check if descriptor is attached */
 	if ((bp = d->bd_bif) == NULL)
 		return;
 
 	BPFIF_WLOCK(bp);
 	BPFD_LOCK(d);
 
 	/* Save bd_writer value */
 	error = d->bd_writer;
 
 	/*
 	 * Remove d from the interface's descriptor list.
 	 */
 	LIST_REMOVE(d, bd_next);
 
 	ifp = bp->bif_ifp;
 	d->bd_bif = NULL;
 	BPFD_UNLOCK(d);
 	BPFIF_WUNLOCK(bp);
 
 	bpf_bpfd_cnt--;
 
 	/* Call event handler iff d is attached */
 	if (error == 0)
 		EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0);
 
 	/*
 	 * Check if this descriptor had requested promiscuous mode.
 	 * If so, turn it off.
 	 */
 	if (d->bd_promisc) {
 		d->bd_promisc = 0;
 		CURVNET_SET(ifp->if_vnet);
 		error = ifpromisc(ifp, 0);
 		CURVNET_RESTORE();
 		if (error != 0 && error != ENXIO) {
 			/*
 			 * ENXIO can happen if a pccard is unplugged
 			 * Something is really wrong if we were able to put
 			 * the driver into promiscuous mode, but can't
 			 * take it out.
 			 */
 			if_printf(bp->bif_ifp,
 				"bpf_detach: ifpromisc failed (%d)\n", error);
 		}
 	}
 }
 
 /*
  * Close the descriptor by detaching it from its interface,
  * deallocating its buffers, and marking it free.
  */
 static void
 bpf_dtor(void *data)
 {
 	struct bpf_d *d = data;
 
 	BPFD_LOCK(d);
 	if (d->bd_state == BPF_WAITING)
 		callout_stop(&d->bd_callout);
 	d->bd_state = BPF_IDLE;
 	BPFD_UNLOCK(d);
 	funsetown(&d->bd_sigio);
 	bpf_detachd(d);
 #ifdef MAC
 	mac_bpfdesc_destroy(d);
 #endif /* MAC */
 	seldrain(&d->bd_sel);
 	knlist_destroy(&d->bd_sel.si_note);
 	callout_drain(&d->bd_callout);
 	bpf_freed(d);
 	free(d, M_BPF);
 }
 
 /*
  * Open ethernet device.  Returns ENXIO for illegal minor device number,
  * EBUSY if file is open by another process.
  */
 /* ARGSUSED */
 static	int
 bpfopen(struct cdev *dev, int flags, int fmt, struct thread *td)
 {
 	struct bpf_d *d;
 	int error;
 
 	d = malloc(sizeof(*d), M_BPF, M_WAITOK | M_ZERO);
 	error = devfs_set_cdevpriv(d, bpf_dtor);
 	if (error != 0) {
 		free(d, M_BPF);
 		return (error);
 	}
 
 	/* Setup counters */
 	d->bd_rcount = counter_u64_alloc(M_WAITOK);
 	d->bd_dcount = counter_u64_alloc(M_WAITOK);
 	d->bd_fcount = counter_u64_alloc(M_WAITOK);
 	d->bd_wcount = counter_u64_alloc(M_WAITOK);
 	d->bd_wfcount = counter_u64_alloc(M_WAITOK);
 	d->bd_wdcount = counter_u64_alloc(M_WAITOK);
 	d->bd_zcopy = counter_u64_alloc(M_WAITOK);
 
 	/*
 	 * For historical reasons, perform a one-time initialization call to
 	 * the buffer routines, even though we're not yet committed to a
 	 * particular buffer method.
 	 */
 	bpf_buffer_init(d);
 	if ((flags & FREAD) == 0)
 		d->bd_writer = 2;
 	d->bd_hbuf_in_use = 0;
 	d->bd_bufmode = BPF_BUFMODE_BUFFER;
 	d->bd_sig = SIGIO;
 	d->bd_direction = BPF_D_INOUT;
 	BPF_PID_REFRESH(d, td);
 #ifdef MAC
 	mac_bpfdesc_init(d);
 	mac_bpfdesc_create(td->td_ucred, d);
 #endif
 	mtx_init(&d->bd_lock, devtoname(dev), "bpf cdev lock", MTX_DEF);
 	callout_init_mtx(&d->bd_callout, &d->bd_lock, 0);
 	knlist_init_mtx(&d->bd_sel.si_note, &d->bd_lock);
 
 	return (0);
 }
 
 /*
  *  bpfread - read next chunk of packets from buffers
  */
 static	int
 bpfread(struct cdev *dev, struct uio *uio, int ioflag)
 {
 	struct bpf_d *d;
 	int error;
 	int non_block;
 	int timed_out;
 
 	error = devfs_get_cdevpriv((void **)&d);
 	if (error != 0)
 		return (error);
 
 	/*
 	 * Restrict application to use a buffer the same size as
 	 * as kernel buffers.
 	 */
 	if (uio->uio_resid != d->bd_bufsize)
 		return (EINVAL);
 
 	non_block = ((ioflag & O_NONBLOCK) != 0);
 
 	BPFD_LOCK(d);
 	BPF_PID_REFRESH_CUR(d);
 	if (d->bd_bufmode != BPF_BUFMODE_BUFFER) {
 		BPFD_UNLOCK(d);
 		return (EOPNOTSUPP);
 	}
 	if (d->bd_state == BPF_WAITING)
 		callout_stop(&d->bd_callout);
 	timed_out = (d->bd_state == BPF_TIMED_OUT);
 	d->bd_state = BPF_IDLE;
 	while (d->bd_hbuf_in_use) {
 		error = mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
 		    PRINET|PCATCH, "bd_hbuf", 0);
 		if (error != 0) {
 			BPFD_UNLOCK(d);
 			return (error);
 		}
 	}
 	/*
 	 * If the hold buffer is empty, then do a timed sleep, which
 	 * ends when the timeout expires or when enough packets
 	 * have arrived to fill the store buffer.
 	 */
 	while (d->bd_hbuf == NULL) {
 		if (d->bd_slen != 0) {
 			/*
 			 * A packet(s) either arrived since the previous
 			 * read or arrived while we were asleep.
 			 */
 			if (d->bd_immediate || non_block || timed_out) {
 				/*
 				 * Rotate the buffers and return what's here
 				 * if we are in immediate mode, non-blocking
 				 * flag is set, or this descriptor timed out.
 				 */
 				ROTATE_BUFFERS(d);
 				break;
 			}
 		}
 
 		/*
 		 * No data is available, check to see if the bpf device
 		 * is still pointed at a real interface.  If not, return
 		 * ENXIO so that the userland process knows to rebind
 		 * it before using it again.
 		 */
 		if (d->bd_bif == NULL) {
 			BPFD_UNLOCK(d);
 			return (ENXIO);
 		}
 
 		if (non_block) {
 			BPFD_UNLOCK(d);
 			return (EWOULDBLOCK);
 		}
 		error = msleep(d, &d->bd_lock, PRINET|PCATCH,
 		     "bpf", d->bd_rtout);
 		if (error == EINTR || error == ERESTART) {
 			BPFD_UNLOCK(d);
 			return (error);
 		}
 		if (error == EWOULDBLOCK) {
 			/*
 			 * On a timeout, return what's in the buffer,
 			 * which may be nothing.  If there is something
 			 * in the store buffer, we can rotate the buffers.
 			 */
 			if (d->bd_hbuf)
 				/*
 				 * We filled up the buffer in between
 				 * getting the timeout and arriving
 				 * here, so we don't need to rotate.
 				 */
 				break;
 
 			if (d->bd_slen == 0) {
 				BPFD_UNLOCK(d);
 				return (0);
 			}
 			ROTATE_BUFFERS(d);
 			break;
 		}
 	}
 	/*
 	 * At this point, we know we have something in the hold slot.
 	 */
 	d->bd_hbuf_in_use = 1;
 	BPFD_UNLOCK(d);
 
 	/*
 	 * Move data from hold buffer into user space.
 	 * We know the entire buffer is transferred since
 	 * we checked above that the read buffer is bpf_bufsize bytes.
   	 *
 	 * We do not have to worry about simultaneous reads because
 	 * we waited for sole access to the hold buffer above.
 	 */
 	error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio);
 
 	BPFD_LOCK(d);
 	KASSERT(d->bd_hbuf != NULL, ("bpfread: lost bd_hbuf"));
 	d->bd_fbuf = d->bd_hbuf;
 	d->bd_hbuf = NULL;
 	d->bd_hlen = 0;
 	bpf_buf_reclaimed(d);
 	d->bd_hbuf_in_use = 0;
 	wakeup(&d->bd_hbuf_in_use);
 	BPFD_UNLOCK(d);
 
 	return (error);
 }
 
 /*
  * If there are processes sleeping on this descriptor, wake them up.
  */
 static __inline void
 bpf_wakeup(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 	if (d->bd_state == BPF_WAITING) {
 		callout_stop(&d->bd_callout);
 		d->bd_state = BPF_IDLE;
 	}
 	wakeup(d);
 	if (d->bd_async && d->bd_sig && d->bd_sigio)
 		pgsigio(&d->bd_sigio, d->bd_sig, 0);
 
 	selwakeuppri(&d->bd_sel, PRINET);
 	KNOTE_LOCKED(&d->bd_sel.si_note, 0);
 }
 
 static void
 bpf_timed_out(void *arg)
 {
 	struct bpf_d *d = (struct bpf_d *)arg;
 
 	BPFD_LOCK_ASSERT(d);
 
 	if (callout_pending(&d->bd_callout) || !callout_active(&d->bd_callout))
 		return;
 	if (d->bd_state == BPF_WAITING) {
 		d->bd_state = BPF_TIMED_OUT;
 		if (d->bd_slen != 0)
 			bpf_wakeup(d);
 	}
 }
 
 static int
 bpf_ready(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	if (!bpf_canfreebuf(d) && d->bd_hlen != 0)
 		return (1);
 	if ((d->bd_immediate || d->bd_state == BPF_TIMED_OUT) &&
 	    d->bd_slen != 0)
 		return (1);
 	return (0);
 }
 
 static int
 bpfwrite(struct cdev *dev, struct uio *uio, int ioflag)
 {
 	struct bpf_d *d;
 	struct ifnet *ifp;
 	struct mbuf *m, *mc;
 	struct sockaddr dst;
 	struct route ro;
 	int error, hlen;
 
 	error = devfs_get_cdevpriv((void **)&d);
 	if (error != 0)
 		return (error);
 
 	BPF_PID_REFRESH_CUR(d);
 	counter_u64_add(d->bd_wcount, 1);
 	/* XXX: locking required */
 	if (d->bd_bif == NULL) {
 		counter_u64_add(d->bd_wdcount, 1);
 		return (ENXIO);
 	}
 
 	ifp = d->bd_bif->bif_ifp;
 
 	if ((ifp->if_flags & IFF_UP) == 0) {
 		counter_u64_add(d->bd_wdcount, 1);
 		return (ENETDOWN);
 	}
 
 	if (uio->uio_resid == 0) {
 		counter_u64_add(d->bd_wdcount, 1);
 		return (0);
 	}
 
 	bzero(&dst, sizeof(dst));
 	m = NULL;
 	hlen = 0;
 	/* XXX: bpf_movein() can sleep */
 	error = bpf_movein(uio, (int)d->bd_bif->bif_dlt, ifp,
 	    &m, &dst, &hlen, d);
 	if (error) {
 		counter_u64_add(d->bd_wdcount, 1);
 		return (error);
 	}
 	counter_u64_add(d->bd_wfcount, 1);
 	if (d->bd_hdrcmplt)
 		dst.sa_family = pseudo_AF_HDRCMPLT;
 
 	if (d->bd_feedback) {
 		mc = m_dup(m, M_NOWAIT);
 		if (mc != NULL)
 			mc->m_pkthdr.rcvif = ifp;
 		/* Set M_PROMISC for outgoing packets to be discarded. */
 		if (d->bd_direction == BPF_D_INOUT)
 			m->m_flags |= M_PROMISC;
 	} else
 		mc = NULL;
 
 	m->m_pkthdr.len -= hlen;
 	m->m_len -= hlen;
 	m->m_data += hlen;	/* XXX */
 
 	CURVNET_SET(ifp->if_vnet);
 #ifdef MAC
 	BPFD_LOCK(d);
 	mac_bpfdesc_create_mbuf(d, m);
 	if (mc != NULL)
 		mac_bpfdesc_create_mbuf(d, mc);
 	BPFD_UNLOCK(d);
 #endif
 
 	bzero(&ro, sizeof(ro));
 	if (hlen != 0) {
 		ro.ro_prepend = (u_char *)&dst.sa_data;
 		ro.ro_plen = hlen;
 		ro.ro_flags = RT_HAS_HEADER;
 	}
 
 	error = (*ifp->if_output)(ifp, m, &dst, &ro);
 	if (error)
 		counter_u64_add(d->bd_wdcount, 1);
 
 	if (mc != NULL) {
 		if (error == 0)
 			(*ifp->if_input)(ifp, mc);
 		else
 			m_freem(mc);
 	}
 	CURVNET_RESTORE();
 
 	return (error);
 }
 
 /*
  * Reset a descriptor by flushing its packet buffer and clearing the receive
  * and drop counts.  This is doable for kernel-only buffers, but with
  * zero-copy buffers, we can't write to (or rotate) buffers that are
  * currently owned by userspace.  It would be nice if we could encapsulate
  * this logic in the buffer code rather than here.
  */
 static void
 reset_d(struct bpf_d *d)
 {
 
 	BPFD_LOCK_ASSERT(d);
 
 	while (d->bd_hbuf_in_use)
 		mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock, PRINET,
 		    "bd_hbuf", 0);
 	if ((d->bd_hbuf != NULL) &&
 	    (d->bd_bufmode != BPF_BUFMODE_ZBUF || bpf_canfreebuf(d))) {
 		/* Free the hold buffer. */
 		d->bd_fbuf = d->bd_hbuf;
 		d->bd_hbuf = NULL;
 		d->bd_hlen = 0;
 		bpf_buf_reclaimed(d);
 	}
 	if (bpf_canwritebuf(d))
 		d->bd_slen = 0;
 	counter_u64_zero(d->bd_rcount);
 	counter_u64_zero(d->bd_dcount);
 	counter_u64_zero(d->bd_fcount);
 	counter_u64_zero(d->bd_wcount);
 	counter_u64_zero(d->bd_wfcount);
 	counter_u64_zero(d->bd_wdcount);
 	counter_u64_zero(d->bd_zcopy);
 }
 
 /*
  *  FIONREAD		Check for read packet available.
  *  BIOCGBLEN		Get buffer len [for read()].
  *  BIOCSETF		Set read filter.
  *  BIOCSETFNR		Set read filter without resetting descriptor.
  *  BIOCSETWF		Set write filter.
  *  BIOCFLUSH		Flush read packet buffer.
  *  BIOCPROMISC		Put interface into promiscuous mode.
  *  BIOCGDLT		Get link layer type.
  *  BIOCGETIF		Get interface name.
  *  BIOCSETIF		Set interface.
  *  BIOCSRTIMEOUT	Set read timeout.
  *  BIOCGRTIMEOUT	Get read timeout.
  *  BIOCGSTATS		Get packet stats.
  *  BIOCIMMEDIATE	Set immediate mode.
  *  BIOCVERSION		Get filter language version.
  *  BIOCGHDRCMPLT	Get "header already complete" flag
  *  BIOCSHDRCMPLT	Set "header already complete" flag
  *  BIOCGDIRECTION	Get packet direction flag
  *  BIOCSDIRECTION	Set packet direction flag
  *  BIOCGTSTAMP		Get time stamp format and resolution.
  *  BIOCSTSTAMP		Set time stamp format and resolution.
  *  BIOCLOCK		Set "locked" flag
  *  BIOCFEEDBACK	Set packet feedback mode.
  *  BIOCSETZBUF		Set current zero-copy buffer locations.
  *  BIOCGETZMAX		Get maximum zero-copy buffer size.
  *  BIOCROTZBUF		Force rotation of zero-copy buffer
  *  BIOCSETBUFMODE	Set buffer mode.
  *  BIOCGETBUFMODE	Get current buffer mode.
  */
 /* ARGSUSED */
 static	int
 bpfioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags,
     struct thread *td)
 {
 	struct bpf_d *d;
 	int error;
 
 	error = devfs_get_cdevpriv((void **)&d);
 	if (error != 0)
 		return (error);
 
 	/*
 	 * Refresh PID associated with this descriptor.
 	 */
 	BPFD_LOCK(d);
 	BPF_PID_REFRESH(d, td);
 	if (d->bd_state == BPF_WAITING)
 		callout_stop(&d->bd_callout);
 	d->bd_state = BPF_IDLE;
 	BPFD_UNLOCK(d);
 
 	if (d->bd_locked == 1) {
 		switch (cmd) {
 		case BIOCGBLEN:
 		case BIOCFLUSH:
 		case BIOCGDLT:
 		case BIOCGDLTLIST:
 #ifdef COMPAT_FREEBSD32
 		case BIOCGDLTLIST32:
 #endif
 		case BIOCGETIF:
 		case BIOCGRTIMEOUT:
 #if defined(COMPAT_FREEBSD32) && defined(__amd64__)
 		case BIOCGRTIMEOUT32:
 #endif
 		case BIOCGSTATS:
 		case BIOCVERSION:
 		case BIOCGRSIG:
 		case BIOCGHDRCMPLT:
 		case BIOCSTSTAMP:
 		case BIOCFEEDBACK:
 		case FIONREAD:
 		case BIOCLOCK:
 		case BIOCSRTIMEOUT:
 #if defined(COMPAT_FREEBSD32) && defined(__amd64__)
 		case BIOCSRTIMEOUT32:
 #endif
 		case BIOCIMMEDIATE:
 		case TIOCGPGRP:
 		case BIOCROTZBUF:
 			break;
 		default:
 			return (EPERM);
 		}
 	}
 #ifdef COMPAT_FREEBSD32
 	/*
 	 * If we see a 32-bit compat ioctl, mark the stream as 32-bit so
 	 * that it will get 32-bit packet headers.
 	 */
 	switch (cmd) {
 	case BIOCSETF32:
 	case BIOCSETFNR32:
 	case BIOCSETWF32:
 	case BIOCGDLTLIST32:
 	case BIOCGRTIMEOUT32:
 	case BIOCSRTIMEOUT32:
 		if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) {
 			BPFD_LOCK(d);
 			d->bd_compat32 = 1;
 			BPFD_UNLOCK(d);
 		}
 	}
 #endif
 
 	CURVNET_SET(TD_TO_VNET(td));
 	switch (cmd) {
 
 	default:
 		error = EINVAL;
 		break;
 
 	/*
 	 * Check for read packet available.
 	 */
 	case FIONREAD:
 		{
 			int n;
 
 			BPFD_LOCK(d);
 			n = d->bd_slen;
 			while (d->bd_hbuf_in_use)
 				mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
 				    PRINET, "bd_hbuf", 0);
 			if (d->bd_hbuf)
 				n += d->bd_hlen;
 			BPFD_UNLOCK(d);
 
 			*(int *)addr = n;
 			break;
 		}
 
 	/*
 	 * Get buffer len [for read()].
 	 */
 	case BIOCGBLEN:
 		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_bufsize;
 		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Set buffer length.
 	 */
 	case BIOCSBLEN:
 		error = bpf_ioctl_sblen(d, (u_int *)addr);
 		break;
 
 	/*
 	 * Set link layer read filter.
 	 */
 	case BIOCSETF:
 	case BIOCSETFNR:
 	case BIOCSETWF:
 #ifdef COMPAT_FREEBSD32
 	case BIOCSETF32:
 	case BIOCSETFNR32:
 	case BIOCSETWF32:
 #endif
 		error = bpf_setf(d, (struct bpf_program *)addr, cmd);
 		break;
 
 	/*
 	 * Flush read packet buffer.
 	 */
 	case BIOCFLUSH:
 		BPFD_LOCK(d);
 		reset_d(d);
 		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Put interface into promiscuous mode.
 	 */
 	case BIOCPROMISC:
 		if (d->bd_bif == NULL) {
 			/*
 			 * No interface attached yet.
 			 */
 			error = EINVAL;
 			break;
 		}
 		if (d->bd_promisc == 0) {
 			error = ifpromisc(d->bd_bif->bif_ifp, 1);
 			if (error == 0)
 				d->bd_promisc = 1;
 		}
 		break;
 
 	/*
 	 * Get current data link type.
 	 */
 	case BIOCGDLT:
 		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else
 			*(u_int *)addr = d->bd_bif->bif_dlt;
 		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Get a list of supported data link types.
 	 */
 #ifdef COMPAT_FREEBSD32
 	case BIOCGDLTLIST32:
 		{
 			struct bpf_dltlist32 *list32;
 			struct bpf_dltlist dltlist;
 
 			list32 = (struct bpf_dltlist32 *)addr;
 			dltlist.bfl_len = list32->bfl_len;
 			dltlist.bfl_list = PTRIN(list32->bfl_list);
 			BPF_LOCK();
 			if (d->bd_bif == NULL)
 				error = EINVAL;
 			else {
 				error = bpf_getdltlist(d, &dltlist);
 				if (error == 0)
 					list32->bfl_len = dltlist.bfl_len;
 			}
 			BPF_UNLOCK();
 			break;
 		}
 #endif
 
 	case BIOCGDLTLIST:
 		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else
 			error = bpf_getdltlist(d, (struct bpf_dltlist *)addr);
 		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Set data link type.
 	 */
 	case BIOCSDLT:
 		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else
 			error = bpf_setdlt(d, *(u_int *)addr);
 		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Get interface name.
 	 */
 	case BIOCGETIF:
 		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else {
 			struct ifnet *const ifp = d->bd_bif->bif_ifp;
 			struct ifreq *const ifr = (struct ifreq *)addr;
 
 			strlcpy(ifr->ifr_name, ifp->if_xname,
 			    sizeof(ifr->ifr_name));
 		}
 		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Set interface.
 	 */
 	case BIOCSETIF:
 		{
 			int alloc_buf, size;
 
 			/*
 			 * Behavior here depends on the buffering model.  If
 			 * we're using kernel memory buffers, then we can
 			 * allocate them here.  If we're using zero-copy,
 			 * then the user process must have registered buffers
 			 * by the time we get here.
 			 */
 			alloc_buf = 0;
 			BPFD_LOCK(d);
 			if (d->bd_bufmode == BPF_BUFMODE_BUFFER &&
 			    d->bd_sbuf == NULL)
 				alloc_buf = 1;
 			BPFD_UNLOCK(d);
 			if (alloc_buf) {
 				size = d->bd_bufsize;
 				error = bpf_buffer_ioctl_sblen(d, &size);
 				if (error != 0)
 					break;
 			}
 			BPF_LOCK();
 			error = bpf_setif(d, (struct ifreq *)addr);
 			BPF_UNLOCK();
 			break;
 		}
 
 	/*
 	 * Set read timeout.
 	 */
 	case BIOCSRTIMEOUT:
 #if defined(COMPAT_FREEBSD32) && defined(__amd64__)
 	case BIOCSRTIMEOUT32:
 #endif
 		{
 			struct timeval *tv = (struct timeval *)addr;
 #if defined(COMPAT_FREEBSD32) && !defined(__mips__)
 			struct timeval32 *tv32;
 			struct timeval tv64;
 
 			if (cmd == BIOCSRTIMEOUT32) {
 				tv32 = (struct timeval32 *)addr;
 				tv = &tv64;
 				tv->tv_sec = tv32->tv_sec;
 				tv->tv_usec = tv32->tv_usec;
 			} else
 #endif
 				tv = (struct timeval *)addr;
 
 			/*
 			 * Subtract 1 tick from tvtohz() since this isn't
 			 * a one-shot timer.
 			 */
 			if ((error = itimerfix(tv)) == 0)
 				d->bd_rtout = tvtohz(tv) - 1;
 			break;
 		}
 
 	/*
 	 * Get read timeout.
 	 */
 	case BIOCGRTIMEOUT:
 #if defined(COMPAT_FREEBSD32) && defined(__amd64__)
 	case BIOCGRTIMEOUT32:
 #endif
 		{
 			struct timeval *tv;
 #if defined(COMPAT_FREEBSD32) && defined(__amd64__)
 			struct timeval32 *tv32;
 			struct timeval tv64;
 
 			if (cmd == BIOCGRTIMEOUT32)
 				tv = &tv64;
 			else
 #endif
 				tv = (struct timeval *)addr;
 
 			tv->tv_sec = d->bd_rtout / hz;
 			tv->tv_usec = (d->bd_rtout % hz) * tick;
 #if defined(COMPAT_FREEBSD32) && defined(__amd64__)
 			if (cmd == BIOCGRTIMEOUT32) {
 				tv32 = (struct timeval32 *)addr;
 				tv32->tv_sec = tv->tv_sec;
 				tv32->tv_usec = tv->tv_usec;
 			}
 #endif
 
 			break;
 		}
 
 	/*
 	 * Get packet stats.
 	 */
 	case BIOCGSTATS:
 		{
 			struct bpf_stat *bs = (struct bpf_stat *)addr;
 
 			/* XXXCSJP overflow */
 			bs->bs_recv = (u_int)counter_u64_fetch(d->bd_rcount);
 			bs->bs_drop = (u_int)counter_u64_fetch(d->bd_dcount);
 			break;
 		}
 
 	/*
 	 * Set immediate mode.
 	 */
 	case BIOCIMMEDIATE:
 		BPFD_LOCK(d);
 		d->bd_immediate = *(u_int *)addr;
 		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCVERSION:
 		{
 			struct bpf_version *bv = (struct bpf_version *)addr;
 
 			bv->bv_major = BPF_MAJOR_VERSION;
 			bv->bv_minor = BPF_MINOR_VERSION;
 			break;
 		}
 
 	/*
 	 * Get "header already complete" flag
 	 */
 	case BIOCGHDRCMPLT:
 		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_hdrcmplt;
 		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Set "header already complete" flag
 	 */
 	case BIOCSHDRCMPLT:
 		BPFD_LOCK(d);
 		d->bd_hdrcmplt = *(u_int *)addr ? 1 : 0;
 		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Get packet direction flag
 	 */
 	case BIOCGDIRECTION:
 		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_direction;
 		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Set packet direction flag
 	 */
 	case BIOCSDIRECTION:
 		{
 			u_int	direction;
 
 			direction = *(u_int *)addr;
 			switch (direction) {
 			case BPF_D_IN:
 			case BPF_D_INOUT:
 			case BPF_D_OUT:
 				BPFD_LOCK(d);
 				d->bd_direction = direction;
 				BPFD_UNLOCK(d);
 				break;
 			default:
 				error = EINVAL;
 			}
 		}
 		break;
 
 	/*
 	 * Get packet timestamp format and resolution.
 	 */
 	case BIOCGTSTAMP:
 		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_tstamp;
 		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Set packet timestamp format and resolution.
 	 */
 	case BIOCSTSTAMP:
 		{
 			u_int	func;
 
 			func = *(u_int *)addr;
 			if (BPF_T_VALID(func))
 				d->bd_tstamp = func;
 			else
 				error = EINVAL;
 		}
 		break;
 
 	case BIOCFEEDBACK:
 		BPFD_LOCK(d);
 		d->bd_feedback = *(u_int *)addr;
 		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCLOCK:
 		BPFD_LOCK(d);
 		d->bd_locked = 1;
 		BPFD_UNLOCK(d);
 		break;
 
 	case FIONBIO:		/* Non-blocking I/O */
 		break;
 
 	case FIOASYNC:		/* Send signal on receive packets */
 		BPFD_LOCK(d);
 		d->bd_async = *(int *)addr;
 		BPFD_UNLOCK(d);
 		break;
 
 	case FIOSETOWN:
 		/*
 		 * XXX: Add some sort of locking here?
 		 * fsetown() can sleep.
 		 */
 		error = fsetown(*(int *)addr, &d->bd_sigio);
 		break;
 
 	case FIOGETOWN:
 		BPFD_LOCK(d);
 		*(int *)addr = fgetown(&d->bd_sigio);
 		BPFD_UNLOCK(d);
 		break;
 
 	/* This is deprecated, FIOSETOWN should be used instead. */
 	case TIOCSPGRP:
 		error = fsetown(-(*(int *)addr), &d->bd_sigio);
 		break;
 
 	/* This is deprecated, FIOGETOWN should be used instead. */
 	case TIOCGPGRP:
 		*(int *)addr = -fgetown(&d->bd_sigio);
 		break;
 
 	case BIOCSRSIG:		/* Set receive signal */
 		{
 			u_int sig;
 
 			sig = *(u_int *)addr;
 
 			if (sig >= NSIG)
 				error = EINVAL;
 			else {
 				BPFD_LOCK(d);
 				d->bd_sig = sig;
 				BPFD_UNLOCK(d);
 			}
 			break;
 		}
 	case BIOCGRSIG:
 		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_sig;
 		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCGETBUFMODE:
 		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_bufmode;
 		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCSETBUFMODE:
 		/*
 		 * Allow the buffering mode to be changed as long as we
 		 * haven't yet committed to a particular mode.  Our
 		 * definition of commitment, for now, is whether or not a
 		 * buffer has been allocated or an interface attached, since
 		 * that's the point where things get tricky.
 		 */
 		switch (*(u_int *)addr) {
 		case BPF_BUFMODE_BUFFER:
 			break;
 
 		case BPF_BUFMODE_ZBUF:
 			if (bpf_zerocopy_enable)
 				break;
 			/* FALLSTHROUGH */
 
 		default:
 			CURVNET_RESTORE();
 			return (EINVAL);
 		}
 
 		BPFD_LOCK(d);
 		if (d->bd_sbuf != NULL || d->bd_hbuf != NULL ||
 		    d->bd_fbuf != NULL || d->bd_bif != NULL) {
 			BPFD_UNLOCK(d);
 			CURVNET_RESTORE();
 			return (EBUSY);
 		}
 		d->bd_bufmode = *(u_int *)addr;
 		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCGETZMAX:
 		error = bpf_ioctl_getzmax(td, d, (size_t *)addr);
 		break;
 
 	case BIOCSETZBUF:
 		error = bpf_ioctl_setzbuf(td, d, (struct bpf_zbuf *)addr);
 		break;
 
 	case BIOCROTZBUF:
 		error = bpf_ioctl_rotzbuf(td, d, (struct bpf_zbuf *)addr);
 		break;
 	}
 	CURVNET_RESTORE();
 	return (error);
 }
 
 /*
  * Set d's packet filter program to fp.  If this file already has a filter,
  * free it and replace it.  Returns EINVAL for bogus requests.
  *
  * Note we need global lock here to serialize bpf_setf() and bpf_setif() calls
  * since reading d->bd_bif can't be protected by d or interface lock due to
  * lock order.
  *
  * Additionally, we have to acquire interface write lock due to bpf_mtap() uses
  * interface read lock to read all filers.
  *
  */
 static int
 bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 {
 #ifdef COMPAT_FREEBSD32
 	struct bpf_program fp_swab;
 	struct bpf_program32 *fp32;
 #endif
 	struct bpf_insn *fcode, *old;
 #ifdef BPF_JITTER
 	bpf_jit_filter *jfunc, *ofunc;
 #endif
 	size_t size;
 	u_int flen;
 	int need_upgrade;
 
 #ifdef COMPAT_FREEBSD32
 	switch (cmd) {
 	case BIOCSETF32:
 	case BIOCSETWF32:
 	case BIOCSETFNR32:
 		fp32 = (struct bpf_program32 *)fp;
 		fp_swab.bf_len = fp32->bf_len;
 		fp_swab.bf_insns = (struct bpf_insn *)(uintptr_t)fp32->bf_insns;
 		fp = &fp_swab;
 		switch (cmd) {
 		case BIOCSETF32:
 			cmd = BIOCSETF;
 			break;
 		case BIOCSETWF32:
 			cmd = BIOCSETWF;
 			break;
 		}
 		break;
 	}
 #endif
 
 	fcode = NULL;
 #ifdef BPF_JITTER
 	jfunc = ofunc = NULL;
 #endif
 	need_upgrade = 0;
 
 	/*
 	 * Check new filter validness before acquiring any locks.
 	 * Allocate memory for new filter, if needed.
 	 */
 	flen = fp->bf_len;
 	if (flen > bpf_maxinsns || (fp->bf_insns == NULL && flen != 0))
 		return (EINVAL);
 	size = flen * sizeof(*fp->bf_insns);
 	if (size > 0) {
 		/* We're setting up new filter.  Copy and check actual data. */
 		fcode = malloc(size, M_BPF, M_WAITOK);
 		if (copyin(fp->bf_insns, fcode, size) != 0 ||
 		    !bpf_validate(fcode, flen)) {
 			free(fcode, M_BPF);
 			return (EINVAL);
 		}
 #ifdef BPF_JITTER
 		/* Filter is copied inside fcode and is perfectly valid. */
 		jfunc = bpf_jitter(fcode, flen);
 #endif
 	}
 
 	BPF_LOCK();
 
 	/*
 	 * Set up new filter.
 	 * Protect filter change by interface lock.
 	 * Additionally, we are protected by global lock here.
 	 */
 	if (d->bd_bif != NULL)
 		BPFIF_WLOCK(d->bd_bif);
 	BPFD_LOCK(d);
 	if (cmd == BIOCSETWF) {
 		old = d->bd_wfilter;
 		d->bd_wfilter = fcode;
 	} else {
 		old = d->bd_rfilter;
 		d->bd_rfilter = fcode;
 #ifdef BPF_JITTER
 		ofunc = d->bd_bfilter;
 		d->bd_bfilter = jfunc;
 #endif
 		if (cmd == BIOCSETF)
 			reset_d(d);
 
 		need_upgrade = bpf_check_upgrade(cmd, d, fcode, flen);
 	}
 	BPFD_UNLOCK(d);
 	if (d->bd_bif != NULL)
 		BPFIF_WUNLOCK(d->bd_bif);
 	if (old != NULL)
 		free(old, M_BPF);
 #ifdef BPF_JITTER
 	if (ofunc != NULL)
 		bpf_destroy_jit_filter(ofunc);
 #endif
 
 	/* Move d to active readers list. */
 	if (need_upgrade != 0)
 		bpf_upgraded(d);
 
 	BPF_UNLOCK();
 	return (0);
 }
 
 /*
  * Detach a file from its current interface (if attached at all) and attach
  * to the interface indicated by the name stored in ifr.
  * Return an errno or 0.
  */
 static int
 bpf_setif(struct bpf_d *d, struct ifreq *ifr)
 {
 	struct bpf_if *bp;
 	struct ifnet *theywant;
 
 	BPF_LOCK_ASSERT();
 
 	theywant = ifunit(ifr->ifr_name);
 	if (theywant == NULL || theywant->if_bpf == NULL)
 		return (ENXIO);
 
 	bp = theywant->if_bpf;
 
 	/* Check if interface is not being detached from BPF */
 	BPFIF_RLOCK(bp);
 	if (bp->bif_flags & BPFIF_FLAG_DYING) {
 		BPFIF_RUNLOCK(bp);
 		return (ENXIO);
 	}
 	BPFIF_RUNLOCK(bp);
 
 	/*
 	 * At this point, we expect the buffer is already allocated.  If not,
 	 * return an error.
 	 */
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
 	case BPF_BUFMODE_ZBUF:
 		if (d->bd_sbuf == NULL)
 			return (EINVAL);
 		break;
 
 	default:
 		panic("bpf_setif: bufmode %d", d->bd_bufmode);
 	}
 	if (bp != d->bd_bif)
 		bpf_attachd(d, bp);
 	BPFD_LOCK(d);
 	reset_d(d);
 	BPFD_UNLOCK(d);
 	return (0);
 }
 
 /*
  * Support for select() and poll() system calls
  *
  * Return true iff the specific operation will not block indefinitely.
  * Otherwise, return false but make a note that a selwakeup() must be done.
  */
 static int
 bpfpoll(struct cdev *dev, int events, struct thread *td)
 {
 	struct bpf_d *d;
 	int revents;
 
 	if (devfs_get_cdevpriv((void **)&d) != 0 || d->bd_bif == NULL)
 		return (events &
 		    (POLLHUP|POLLIN|POLLRDNORM|POLLOUT|POLLWRNORM));
 
 	/*
 	 * Refresh PID associated with this descriptor.
 	 */
 	revents = events & (POLLOUT | POLLWRNORM);
 	BPFD_LOCK(d);
 	BPF_PID_REFRESH(d, td);
 	if (events & (POLLIN | POLLRDNORM)) {
 		if (bpf_ready(d))
 			revents |= events & (POLLIN | POLLRDNORM);
 		else {
 			selrecord(td, &d->bd_sel);
 			/* Start the read timeout if necessary. */
 			if (d->bd_rtout > 0 && d->bd_state == BPF_IDLE) {
 				callout_reset(&d->bd_callout, d->bd_rtout,
 				    bpf_timed_out, d);
 				d->bd_state = BPF_WAITING;
 			}
 		}
 	}
 	BPFD_UNLOCK(d);
 	return (revents);
 }
 
 /*
  * Support for kevent() system call.  Register EVFILT_READ filters and
  * reject all others.
  */
 int
 bpfkqfilter(struct cdev *dev, struct knote *kn)
 {
 	struct bpf_d *d;
 
 	if (devfs_get_cdevpriv((void **)&d) != 0 ||
 	    kn->kn_filter != EVFILT_READ)
 		return (1);
 
 	/*
 	 * Refresh PID associated with this descriptor.
 	 */
 	BPFD_LOCK(d);
 	BPF_PID_REFRESH_CUR(d);
 	kn->kn_fop = &bpfread_filtops;
 	kn->kn_hook = d;
 	knlist_add(&d->bd_sel.si_note, kn, 1);
 	BPFD_UNLOCK(d);
 
 	return (0);
 }
 
 static void
 filt_bpfdetach(struct knote *kn)
 {
 	struct bpf_d *d = (struct bpf_d *)kn->kn_hook;
 
 	knlist_remove(&d->bd_sel.si_note, kn, 0);
 }
 
 static int
 filt_bpfread(struct knote *kn, long hint)
 {
 	struct bpf_d *d = (struct bpf_d *)kn->kn_hook;
 	int ready;
 
 	BPFD_LOCK_ASSERT(d);
 	ready = bpf_ready(d);
 	if (ready) {
 		kn->kn_data = d->bd_slen;
 		/*
 		 * Ignore the hold buffer if it is being copied to user space.
 		 */
 		if (!d->bd_hbuf_in_use && d->bd_hbuf)
 			kn->kn_data += d->bd_hlen;
 	} else if (d->bd_rtout > 0 && d->bd_state == BPF_IDLE) {
 		callout_reset(&d->bd_callout, d->bd_rtout,
 		    bpf_timed_out, d);
 		d->bd_state = BPF_WAITING;
 	}
 
 	return (ready);
 }
 
 #define	BPF_TSTAMP_NONE		0
 #define	BPF_TSTAMP_FAST		1
 #define	BPF_TSTAMP_NORMAL	2
 #define	BPF_TSTAMP_EXTERN	3
 
 static int
 bpf_ts_quality(int tstype)
 {
 
 	if (tstype == BPF_T_NONE)
 		return (BPF_TSTAMP_NONE);
 	if ((tstype & BPF_T_FAST) != 0)
 		return (BPF_TSTAMP_FAST);
 
 	return (BPF_TSTAMP_NORMAL);
 }
 
 static int
 bpf_gettime(struct bintime *bt, int tstype, struct mbuf *m)
 {
 	struct m_tag *tag;
 	int quality;
 
 	quality = bpf_ts_quality(tstype);
 	if (quality == BPF_TSTAMP_NONE)
 		return (quality);
 
 	if (m != NULL) {
 		tag = m_tag_locate(m, MTAG_BPF, MTAG_BPF_TIMESTAMP, NULL);
 		if (tag != NULL) {
 			*bt = *(struct bintime *)(tag + 1);
 			return (BPF_TSTAMP_EXTERN);
 		}
 	}
 	if (quality == BPF_TSTAMP_NORMAL)
 		binuptime(bt);
 	else
 		getbinuptime(bt);
 
 	return (quality);
 }
 
 /*
  * Incoming linkage from device drivers.  Process the packet pkt, of length
  * pktlen, which is stored in a contiguous buffer.  The packet is parsed
  * by each process' filter, and if accepted, stashed into the corresponding
  * buffer.
  */
 void
 bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktlen)
 {
 	struct bintime bt;
 	struct bpf_d *d;
 #ifdef BPF_JITTER
 	bpf_jit_filter *bf;
 #endif
 	u_int slen;
 	int gottime;
 
 	gottime = BPF_TSTAMP_NONE;
 
 	BPFIF_RLOCK(bp);
 
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		/*
 		 * We are not using any locks for d here because:
 		 * 1) any filter change is protected by interface
 		 * write lock
 		 * 2) destroying/detaching d is protected by interface
 		 * write lock, too
 		 */
 
 		counter_u64_add(d->bd_rcount, 1);
 		/*
 		 * NB: We dont call BPF_CHECK_DIRECTION() here since there is no
 		 * way for the caller to indiciate to us whether this packet
 		 * is inbound or outbound.  In the bpf_mtap() routines, we use
 		 * the interface pointers on the mbuf to figure it out.
 		 */
 #ifdef BPF_JITTER
 		bf = bpf_jitter_enable != 0 ? d->bd_bfilter : NULL;
 		if (bf != NULL)
 			slen = (*(bf->func))(pkt, pktlen, pktlen);
 		else
 #endif
 		slen = bpf_filter(d->bd_rfilter, pkt, pktlen, pktlen);
 		if (slen != 0) {
 			/*
 			 * Filter matches. Let's to acquire write lock.
 			 */
 			BPFD_LOCK(d);
 
 			counter_u64_add(d->bd_fcount, 1);
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, NULL);
 #ifdef MAC
 			if (mac_bpfdesc_check_receive(d, bp->bif_ifp) == 0)
 #endif
 				catchpacket(d, pkt, pktlen, slen,
 				    bpf_append_bytes, &bt);
 			BPFD_UNLOCK(d);
 		}
 	}
 	BPFIF_RUNLOCK(bp);
 }
 
 #define	BPF_CHECK_DIRECTION(d, r, i)				\
 	    (((d)->bd_direction == BPF_D_IN && (r) != (i)) ||	\
 	    ((d)->bd_direction == BPF_D_OUT && (r) == (i)))
 
 /*
  * Incoming linkage from device drivers, when packet is in an mbuf chain.
  * Locking model is explained in bpf_tap().
  */
 void
 bpf_mtap(struct bpf_if *bp, struct mbuf *m)
 {
 	struct bintime bt;
 	struct bpf_d *d;
 #ifdef BPF_JITTER
 	bpf_jit_filter *bf;
 #endif
 	u_int pktlen, slen;
 	int gottime;
 
 	/* Skip outgoing duplicate packets. */
 	if ((m->m_flags & M_PROMISC) != 0 && m->m_pkthdr.rcvif == NULL) {
 		m->m_flags &= ~M_PROMISC;
 		return;
 	}
 
 	pktlen = m_length(m, NULL);
 	gottime = BPF_TSTAMP_NONE;
 
 	BPFIF_RLOCK(bp);
 
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp))
 			continue;
 		counter_u64_add(d->bd_rcount, 1);
 #ifdef BPF_JITTER
 		bf = bpf_jitter_enable != 0 ? d->bd_bfilter : NULL;
 		/* XXX We cannot handle multiple mbufs. */
 		if (bf != NULL && m->m_next == NULL)
 			slen = (*(bf->func))(mtod(m, u_char *), pktlen, pktlen);
 		else
 #endif
 		slen = bpf_filter(d->bd_rfilter, (u_char *)m, pktlen, 0);
 		if (slen != 0) {
 			BPFD_LOCK(d);
 
 			counter_u64_add(d->bd_fcount, 1);
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, m);
 #ifdef MAC
 			if (mac_bpfdesc_check_receive(d, bp->bif_ifp) == 0)
 #endif
 				catchpacket(d, (u_char *)m, pktlen, slen,
 				    bpf_append_mbuf, &bt);
 			BPFD_UNLOCK(d);
 		}
 	}
 	BPFIF_RUNLOCK(bp);
 }
 
 /*
  * Incoming linkage from device drivers, when packet is in
  * an mbuf chain and to be prepended by a contiguous header.
  */
 void
 bpf_mtap2(struct bpf_if *bp, void *data, u_int dlen, struct mbuf *m)
 {
 	struct bintime bt;
 	struct mbuf mb;
 	struct bpf_d *d;
 	u_int pktlen, slen;
 	int gottime;
 
 	/* Skip outgoing duplicate packets. */
 	if ((m->m_flags & M_PROMISC) != 0 && m->m_pkthdr.rcvif == NULL) {
 		m->m_flags &= ~M_PROMISC;
 		return;
 	}
 
 	pktlen = m_length(m, NULL);
 	/*
 	 * Craft on-stack mbuf suitable for passing to bpf_filter.
 	 * Note that we cut corners here; we only setup what's
 	 * absolutely needed--this mbuf should never go anywhere else.
 	 */
 	mb.m_next = m;
 	mb.m_data = data;
 	mb.m_len = dlen;
 	pktlen += dlen;
 
 	gottime = BPF_TSTAMP_NONE;
 
 	BPFIF_RLOCK(bp);
 
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp))
 			continue;
 		counter_u64_add(d->bd_rcount, 1);
 		slen = bpf_filter(d->bd_rfilter, (u_char *)&mb, pktlen, 0);
 		if (slen != 0) {
 			BPFD_LOCK(d);
 
 			counter_u64_add(d->bd_fcount, 1);
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, m);
 #ifdef MAC
 			if (mac_bpfdesc_check_receive(d, bp->bif_ifp) == 0)
 #endif
 				catchpacket(d, (u_char *)&mb, pktlen, slen,
 				    bpf_append_mbuf, &bt);
 			BPFD_UNLOCK(d);
 		}
 	}
 	BPFIF_RUNLOCK(bp);
 }
 
 #undef	BPF_CHECK_DIRECTION
 
 #undef	BPF_TSTAMP_NONE
 #undef	BPF_TSTAMP_FAST
 #undef	BPF_TSTAMP_NORMAL
 #undef	BPF_TSTAMP_EXTERN
 
 static int
 bpf_hdrlen(struct bpf_d *d)
 {
 	int hdrlen;
 
 	hdrlen = d->bd_bif->bif_hdrlen;
 #ifndef BURN_BRIDGES
 	if (d->bd_tstamp == BPF_T_NONE ||
 	    BPF_T_FORMAT(d->bd_tstamp) == BPF_T_MICROTIME)
 #ifdef COMPAT_FREEBSD32
 		if (d->bd_compat32)
 			hdrlen += SIZEOF_BPF_HDR(struct bpf_hdr32);
 		else
 #endif
 			hdrlen += SIZEOF_BPF_HDR(struct bpf_hdr);
 	else
 #endif
 		hdrlen += SIZEOF_BPF_HDR(struct bpf_xhdr);
 #ifdef COMPAT_FREEBSD32
 	if (d->bd_compat32)
 		hdrlen = BPF_WORDALIGN32(hdrlen);
 	else
 #endif
 		hdrlen = BPF_WORDALIGN(hdrlen);
 
 	return (hdrlen - d->bd_bif->bif_hdrlen);
 }
 
 static void
 bpf_bintime2ts(struct bintime *bt, struct bpf_ts *ts, int tstype)
 {
 	struct bintime bt2, boottimebin;
 	struct timeval tsm;
 	struct timespec tsn;
 
 	if ((tstype & BPF_T_MONOTONIC) == 0) {
 		bt2 = *bt;
 		getboottimebin(&boottimebin);
 		bintime_add(&bt2, &boottimebin);
 		bt = &bt2;
 	}
 	switch (BPF_T_FORMAT(tstype)) {
 	case BPF_T_MICROTIME:
 		bintime2timeval(bt, &tsm);
 		ts->bt_sec = tsm.tv_sec;
 		ts->bt_frac = tsm.tv_usec;
 		break;
 	case BPF_T_NANOTIME:
 		bintime2timespec(bt, &tsn);
 		ts->bt_sec = tsn.tv_sec;
 		ts->bt_frac = tsn.tv_nsec;
 		break;
 	case BPF_T_BINTIME:
 		ts->bt_sec = bt->sec;
 		ts->bt_frac = bt->frac;
 		break;
 	}
 }
 
 /*
  * Move the packet data from interface memory (pkt) into the
  * store buffer.  "cpfn" is the routine called to do the actual data
  * transfer.  bcopy is passed in to copy contiguous chunks, while
  * bpf_append_mbuf is passed in to copy mbuf chains.  In the latter case,
  * pkt is really an mbuf.
  */
 static void
 catchpacket(struct bpf_d *d, u_char *pkt, u_int pktlen, u_int snaplen,
     void (*cpfn)(struct bpf_d *, caddr_t, u_int, void *, u_int),
     struct bintime *bt)
 {
 	struct bpf_xhdr hdr;
 #ifndef BURN_BRIDGES
 	struct bpf_hdr hdr_old;
 #ifdef COMPAT_FREEBSD32
 	struct bpf_hdr32 hdr32_old;
 #endif
 #endif
 	int caplen, curlen, hdrlen, totlen;
 	int do_wakeup = 0;
 	int do_timestamp;
 	int tstype;
 
 	BPFD_LOCK_ASSERT(d);
 
 	/*
 	 * Detect whether user space has released a buffer back to us, and if
 	 * so, move it from being a hold buffer to a free buffer.  This may
 	 * not be the best place to do it (for example, we might only want to
 	 * run this check if we need the space), but for now it's a reliable
 	 * spot to do it.
 	 */
 	if (d->bd_fbuf == NULL && bpf_canfreebuf(d)) {
 		d->bd_fbuf = d->bd_hbuf;
 		d->bd_hbuf = NULL;
 		d->bd_hlen = 0;
 		bpf_buf_reclaimed(d);
 	}
 
 	/*
 	 * Figure out how many bytes to move.  If the packet is
 	 * greater or equal to the snapshot length, transfer that
 	 * much.  Otherwise, transfer the whole packet (unless
 	 * we hit the buffer size limit).
 	 */
 	hdrlen = bpf_hdrlen(d);
 	totlen = hdrlen + min(snaplen, pktlen);
 	if (totlen > d->bd_bufsize)
 		totlen = d->bd_bufsize;
 
 	/*
 	 * Round up the end of the previous packet to the next longword.
 	 *
 	 * Drop the packet if there's no room and no hope of room
 	 * If the packet would overflow the storage buffer or the storage
 	 * buffer is considered immutable by the buffer model, try to rotate
 	 * the buffer and wakeup pending processes.
 	 */
 #ifdef COMPAT_FREEBSD32
 	if (d->bd_compat32)
 		curlen = BPF_WORDALIGN32(d->bd_slen);
 	else
 #endif
 		curlen = BPF_WORDALIGN(d->bd_slen);
 	if (curlen + totlen > d->bd_bufsize || !bpf_canwritebuf(d)) {
 		if (d->bd_fbuf == NULL) {
 			/*
 			 * There's no room in the store buffer, and no
 			 * prospect of room, so drop the packet.  Notify the
 			 * buffer model.
 			 */
 			bpf_buffull(d);
 			counter_u64_add(d->bd_dcount, 1);
 			return;
 		}
 		KASSERT(!d->bd_hbuf_in_use, ("hold buffer is in use"));
 		ROTATE_BUFFERS(d);
 		do_wakeup = 1;
 		curlen = 0;
 	} else if (d->bd_immediate || d->bd_state == BPF_TIMED_OUT)
 		/*
 		 * Immediate mode is set, or the read timeout has already
 		 * expired during a select call.  A packet arrived, so the
 		 * reader should be woken up.
 		 */
 		do_wakeup = 1;
 	caplen = totlen - hdrlen;
 	tstype = d->bd_tstamp;
 	do_timestamp = tstype != BPF_T_NONE;
 #ifndef BURN_BRIDGES
 	if (tstype == BPF_T_NONE || BPF_T_FORMAT(tstype) == BPF_T_MICROTIME) {
 		struct bpf_ts ts;
 		if (do_timestamp)
 			bpf_bintime2ts(bt, &ts, tstype);
 #ifdef COMPAT_FREEBSD32
 		if (d->bd_compat32) {
 			bzero(&hdr32_old, sizeof(hdr32_old));
 			if (do_timestamp) {
 				hdr32_old.bh_tstamp.tv_sec = ts.bt_sec;
 				hdr32_old.bh_tstamp.tv_usec = ts.bt_frac;
 			}
 			hdr32_old.bh_datalen = pktlen;
 			hdr32_old.bh_hdrlen = hdrlen;
 			hdr32_old.bh_caplen = caplen;
 			bpf_append_bytes(d, d->bd_sbuf, curlen, &hdr32_old,
 			    sizeof(hdr32_old));
 			goto copy;
 		}
 #endif
 		bzero(&hdr_old, sizeof(hdr_old));
 		if (do_timestamp) {
 			hdr_old.bh_tstamp.tv_sec = ts.bt_sec;
 			hdr_old.bh_tstamp.tv_usec = ts.bt_frac;
 		}
 		hdr_old.bh_datalen = pktlen;
 		hdr_old.bh_hdrlen = hdrlen;
 		hdr_old.bh_caplen = caplen;
 		bpf_append_bytes(d, d->bd_sbuf, curlen, &hdr_old,
 		    sizeof(hdr_old));
 		goto copy;
 	}
 #endif
 
 	/*
 	 * Append the bpf header.  Note we append the actual header size, but
 	 * move forward the length of the header plus padding.
 	 */
 	bzero(&hdr, sizeof(hdr));
 	if (do_timestamp)
 		bpf_bintime2ts(bt, &hdr.bh_tstamp, tstype);
 	hdr.bh_datalen = pktlen;
 	hdr.bh_hdrlen = hdrlen;
 	hdr.bh_caplen = caplen;
 	bpf_append_bytes(d, d->bd_sbuf, curlen, &hdr, sizeof(hdr));
 
 	/*
 	 * Copy the packet data into the store buffer and update its length.
 	 */
 #ifndef BURN_BRIDGES
 copy:
 #endif
 	(*cpfn)(d, d->bd_sbuf, curlen + hdrlen, pkt, caplen);
 	d->bd_slen = curlen + totlen;
 
 	if (do_wakeup)
 		bpf_wakeup(d);
 }
 
 /*
  * Free buffers currently in use by a descriptor.
  * Called on close.
  */
 static void
 bpf_freed(struct bpf_d *d)
 {
 
 	/*
 	 * We don't need to lock out interrupts since this descriptor has
 	 * been detached from its interface and it yet hasn't been marked
 	 * free.
 	 */
 	bpf_free(d);
 	if (d->bd_rfilter != NULL) {
 		free((caddr_t)d->bd_rfilter, M_BPF);
 #ifdef BPF_JITTER
 		if (d->bd_bfilter != NULL)
 			bpf_destroy_jit_filter(d->bd_bfilter);
 #endif
 	}
 	if (d->bd_wfilter != NULL)
 		free((caddr_t)d->bd_wfilter, M_BPF);
 	mtx_destroy(&d->bd_lock);
 
 	counter_u64_free(d->bd_rcount);
 	counter_u64_free(d->bd_dcount);
 	counter_u64_free(d->bd_fcount);
 	counter_u64_free(d->bd_wcount);
 	counter_u64_free(d->bd_wfcount);
 	counter_u64_free(d->bd_wdcount);
 	counter_u64_free(d->bd_zcopy);
 
 }
 
 /*
  * Attach an interface to bpf.  dlt is the link layer type; hdrlen is the
  * fixed size of the link header (variable length headers not yet supported).
  */
 void
 bpfattach(struct ifnet *ifp, u_int dlt, u_int hdrlen)
 {
 
 	bpfattach2(ifp, dlt, hdrlen, &ifp->if_bpf);
 }
 
 /*
  * Attach an interface to bpf.  ifp is a pointer to the structure
  * defining the interface to be attached, dlt is the link layer type,
  * and hdrlen is the fixed size of the link header (variable length
  * headers are not yet supporrted).
  */
 void
 bpfattach2(struct ifnet *ifp, u_int dlt, u_int hdrlen, struct bpf_if **driverp)
 {
 	struct bpf_if *bp;
 
 	bp = malloc(sizeof(*bp), M_BPF, M_NOWAIT | M_ZERO);
 	if (bp == NULL)
 		panic("bpfattach");
 
 	LIST_INIT(&bp->bif_dlist);
 	LIST_INIT(&bp->bif_wlist);
 	bp->bif_ifp = ifp;
 	bp->bif_dlt = dlt;
 	rw_init(&bp->bif_lock, "bpf interface lock");
 	KASSERT(*driverp == NULL, ("bpfattach2: driverp already initialized"));
 	bp->bif_bpf = driverp;
 	*driverp = bp;
 
 	BPF_LOCK();
 	LIST_INSERT_HEAD(&bpf_iflist, bp, bif_next);
 	BPF_UNLOCK();
 
 	bp->bif_hdrlen = hdrlen;
 
 	if (bootverbose && IS_DEFAULT_VNET(curvnet))
 		if_printf(ifp, "bpf attached\n");
 }
 
 #ifdef VIMAGE
 /*
  * When moving interfaces between vnet instances we need a way to
  * query the dlt and hdrlen before detach so we can re-attch the if_bpf
  * after the vmove.  We unfortunately have no device driver infrastructure
  * to query the interface for these values after creation/attach, thus
  * add this as a workaround.
  */
 int
 bpf_get_bp_params(struct bpf_if *bp, u_int *bif_dlt, u_int *bif_hdrlen)
 {
 
 	if (bp == NULL)
 		return (ENXIO);
 	if (bif_dlt == NULL && bif_hdrlen == NULL)
 		return (0);
 
 	if (bif_dlt != NULL)
 		*bif_dlt = bp->bif_dlt;
 	if (bif_hdrlen != NULL)
 		*bif_hdrlen = bp->bif_hdrlen;
 
 	return (0);
 }
 #endif
 
 /*
  * Detach bpf from an interface. This involves detaching each descriptor
  * associated with the interface. Notify each descriptor as it's detached
  * so that any sleepers wake up and get ENXIO.
  */
 void
 bpfdetach(struct ifnet *ifp)
 {
 	struct bpf_if	*bp, *bp_temp;
 	struct bpf_d	*d;
 	int ndetached;
 
 	ndetached = 0;
 
 	BPF_LOCK();
 	/* Find all bpf_if struct's which reference ifp and detach them. */
 	LIST_FOREACH_SAFE(bp, &bpf_iflist, bif_next, bp_temp) {
 		if (ifp != bp->bif_ifp)
 			continue;
 
 		LIST_REMOVE(bp, bif_next);
 		/* Add to to-be-freed list */
 		LIST_INSERT_HEAD(&bpf_freelist, bp, bif_next);
 
 		ndetached++;
 		/*
 		 * Delay freeing bp till interface is detached
 		 * and all routes through this interface are removed.
 		 * Mark bp as detached to restrict new consumers.
 		 */
 		BPFIF_WLOCK(bp);
 		bp->bif_flags |= BPFIF_FLAG_DYING;
 		*bp->bif_bpf = NULL;
 		BPFIF_WUNLOCK(bp);
 
 		CTR4(KTR_NET, "%s: sheduling free for encap %d (%p) for if %p",
 		    __func__, bp->bif_dlt, bp, ifp);
 
 		/* Free common descriptors */
 		while ((d = LIST_FIRST(&bp->bif_dlist)) != NULL) {
 			bpf_detachd_locked(d);
 			BPFD_LOCK(d);
 			bpf_wakeup(d);
 			BPFD_UNLOCK(d);
 		}
 
 		/* Free writer-only descriptors */
 		while ((d = LIST_FIRST(&bp->bif_wlist)) != NULL) {
 			bpf_detachd_locked(d);
 			BPFD_LOCK(d);
 			bpf_wakeup(d);
 			BPFD_UNLOCK(d);
 		}
 	}
 	BPF_UNLOCK();
 
 #ifdef INVARIANTS
 	if (ndetached == 0)
 		printf("bpfdetach: %s was not attached\n", ifp->if_xname);
 #endif
 }
 
 /*
  * Interface departure handler.
  * Note departure event does not guarantee interface is going down.
  * Interface renaming is currently done via departure/arrival event set.
  *
  * Departure handled is called after all routes pointing to
  * given interface are removed and interface is in down state
  * restricting any packets to be sent/received. We assume it is now safe
  * to free data allocated by BPF.
  */
 static void
 bpf_ifdetach(void *arg __unused, struct ifnet *ifp)
 {
 	struct bpf_if *bp, *bp_temp;
 	int nmatched = 0;
 
 	/* Ignore ifnet renaming. */
 	if (ifp->if_flags & IFF_RENAMING)
 		return;
 
 	BPF_LOCK();
 	/*
 	 * Find matching entries in free list.
 	 * Nothing should be found if bpfdetach() was not called.
 	 */
 	LIST_FOREACH_SAFE(bp, &bpf_freelist, bif_next, bp_temp) {
 		if (ifp != bp->bif_ifp)
 			continue;
 
 		CTR3(KTR_NET, "%s: freeing BPF instance %p for interface %p",
 		    __func__, bp, ifp);
 
 		LIST_REMOVE(bp, bif_next);
 
 		rw_destroy(&bp->bif_lock);
 		free(bp, M_BPF);
 
 		nmatched++;
 	}
 	BPF_UNLOCK();
 }
 
 /*
  * Get a list of available data link type of the interface.
  */
 static int
 bpf_getdltlist(struct bpf_d *d, struct bpf_dltlist *bfl)
 {
 	struct ifnet *ifp;
 	struct bpf_if *bp;
 	u_int *lst;
 	int error, n, n1;
 
 	BPF_LOCK_ASSERT();
 
 	ifp = d->bd_bif->bif_ifp;
 again:
 	n1 = 0;
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		if (bp->bif_ifp == ifp)
 			n1++;
 	}
 	if (bfl->bfl_list == NULL) {
 		bfl->bfl_len = n1;
 		return (0);
 	}
 	if (n1 > bfl->bfl_len)
 		return (ENOMEM);
 	BPF_UNLOCK();
 	lst = malloc(n1 * sizeof(u_int), M_TEMP, M_WAITOK);
 	n = 0;
 	BPF_LOCK();
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		if (bp->bif_ifp != ifp)
 			continue;
 		if (n >= n1) {
 			free(lst, M_TEMP);
 			goto again;
 		}
 		lst[n] = bp->bif_dlt;
 		n++;
 	}
 	BPF_UNLOCK();
 	error = copyout(lst, bfl->bfl_list, sizeof(u_int) * n);
 	free(lst, M_TEMP);
 	BPF_LOCK();
 	bfl->bfl_len = n;
 	return (error);
 }
 
 /*
  * Set the data link type of a BPF instance.
  */
 static int
 bpf_setdlt(struct bpf_d *d, u_int dlt)
 {
 	int error, opromisc;
 	struct ifnet *ifp;
 	struct bpf_if *bp;
 
 	BPF_LOCK_ASSERT();
 
 	if (d->bd_bif->bif_dlt == dlt)
 		return (0);
 	ifp = d->bd_bif->bif_ifp;
 
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		if (bp->bif_ifp == ifp && bp->bif_dlt == dlt)
 			break;
 	}
 
 	if (bp != NULL) {
 		opromisc = d->bd_promisc;
 		bpf_attachd(d, bp);
 		BPFD_LOCK(d);
 		reset_d(d);
 		BPFD_UNLOCK(d);
 		if (opromisc) {
 			error = ifpromisc(bp->bif_ifp, 1);
 			if (error)
 				if_printf(bp->bif_ifp,
 					"bpf_setdlt: ifpromisc failed (%d)\n",
 					error);
 			else
 				d->bd_promisc = 1;
 		}
 	}
 	return (bp == NULL ? EINVAL : 0);
 }
 
 static void
 bpf_drvinit(void *unused)
 {
 	struct cdev *dev;
 
-	mtx_init(&bpf_mtx, "bpf global lock", NULL, MTX_DEF);
+	sx_init(&bpf_sx, "bpf global lock");
 	LIST_INIT(&bpf_iflist);
 	LIST_INIT(&bpf_freelist);
 
 	dev = make_dev(&bpf_cdevsw, 0, UID_ROOT, GID_WHEEL, 0600, "bpf");
 	/* For compatibility */
 	make_dev_alias(dev, "bpf0");
 
 	/* Register interface departure handler */
 	bpf_ifdetach_cookie = EVENTHANDLER_REGISTER(
 		    ifnet_departure_event, bpf_ifdetach, NULL,
 		    EVENTHANDLER_PRI_ANY);
 }
 
 /*
  * Zero out the various packet counters associated with all of the bpf
  * descriptors.  At some point, we will probably want to get a bit more
  * granular and allow the user to specify descriptors to be zeroed.
  */
 static void
 bpf_zero_counters(void)
 {
 	struct bpf_if *bp;
 	struct bpf_d *bd;
 
 	BPF_LOCK();
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		BPFIF_RLOCK(bp);
 		LIST_FOREACH(bd, &bp->bif_dlist, bd_next) {
 			BPFD_LOCK(bd);
 			counter_u64_zero(bd->bd_rcount);
 			counter_u64_zero(bd->bd_dcount);
 			counter_u64_zero(bd->bd_fcount);
 			counter_u64_zero(bd->bd_wcount);
 			counter_u64_zero(bd->bd_wfcount);
 			counter_u64_zero(bd->bd_zcopy);
 			BPFD_UNLOCK(bd);
 		}
 		BPFIF_RUNLOCK(bp);
 	}
 	BPF_UNLOCK();
 }
 
 /*
  * Fill filter statistics
  */
 static void
 bpfstats_fill_xbpf(struct xbpf_d *d, struct bpf_d *bd)
 {
 
 	bzero(d, sizeof(*d));
 	BPFD_LOCK_ASSERT(bd);
 	d->bd_structsize = sizeof(*d);
 	/* XXX: reading should be protected by global lock */
 	d->bd_immediate = bd->bd_immediate;
 	d->bd_promisc = bd->bd_promisc;
 	d->bd_hdrcmplt = bd->bd_hdrcmplt;
 	d->bd_direction = bd->bd_direction;
 	d->bd_feedback = bd->bd_feedback;
 	d->bd_async = bd->bd_async;
 	d->bd_rcount = counter_u64_fetch(bd->bd_rcount);
 	d->bd_dcount = counter_u64_fetch(bd->bd_dcount);
 	d->bd_fcount = counter_u64_fetch(bd->bd_fcount);
 	d->bd_sig = bd->bd_sig;
 	d->bd_slen = bd->bd_slen;
 	d->bd_hlen = bd->bd_hlen;
 	d->bd_bufsize = bd->bd_bufsize;
 	d->bd_pid = bd->bd_pid;
 	strlcpy(d->bd_ifname,
 	    bd->bd_bif->bif_ifp->if_xname, IFNAMSIZ);
 	d->bd_locked = bd->bd_locked;
 	d->bd_wcount = counter_u64_fetch(bd->bd_wcount);
 	d->bd_wdcount = counter_u64_fetch(bd->bd_wdcount);
 	d->bd_wfcount = counter_u64_fetch(bd->bd_wfcount);
 	d->bd_zcopy = counter_u64_fetch(bd->bd_zcopy);
 	d->bd_bufmode = bd->bd_bufmode;
 }
 
 /*
  * Handle `netstat -B' stats request
  */
 static int
 bpf_stats_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	static const struct xbpf_d zerostats;
 	struct xbpf_d *xbdbuf, *xbd, tempstats;
 	int index, error;
 	struct bpf_if *bp;
 	struct bpf_d *bd;
 
 	/*
 	 * XXX This is not technically correct. It is possible for non
 	 * privileged users to open bpf devices. It would make sense
 	 * if the users who opened the devices were able to retrieve
 	 * the statistics for them, too.
 	 */
 	error = priv_check(req->td, PRIV_NET_BPF);
 	if (error)
 		return (error);
 	/*
 	 * Check to see if the user is requesting that the counters be
 	 * zeroed out.  Explicitly check that the supplied data is zeroed,
 	 * as we aren't allowing the user to set the counters currently.
 	 */
 	if (req->newptr != NULL) {
 		if (req->newlen != sizeof(tempstats))
 			return (EINVAL);
 		memset(&tempstats, 0, sizeof(tempstats));
 		error = SYSCTL_IN(req, &tempstats, sizeof(tempstats));
 		if (error)
 			return (error);
 		if (bcmp(&tempstats, &zerostats, sizeof(tempstats)) != 0)
 			return (EINVAL);
 		bpf_zero_counters();
 		return (0);
 	}
 	if (req->oldptr == NULL)
 		return (SYSCTL_OUT(req, 0, bpf_bpfd_cnt * sizeof(*xbd)));
 	if (bpf_bpfd_cnt == 0)
 		return (SYSCTL_OUT(req, 0, 0));
 	xbdbuf = malloc(req->oldlen, M_BPF, M_WAITOK);
 	BPF_LOCK();
 	if (req->oldlen < (bpf_bpfd_cnt * sizeof(*xbd))) {
 		BPF_UNLOCK();
 		free(xbdbuf, M_BPF);
 		return (ENOMEM);
 	}
 	index = 0;
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		BPFIF_RLOCK(bp);
 		/* Send writers-only first */
 		LIST_FOREACH(bd, &bp->bif_wlist, bd_next) {
 			xbd = &xbdbuf[index++];
 			BPFD_LOCK(bd);
 			bpfstats_fill_xbpf(xbd, bd);
 			BPFD_UNLOCK(bd);
 		}
 		LIST_FOREACH(bd, &bp->bif_dlist, bd_next) {
 			xbd = &xbdbuf[index++];
 			BPFD_LOCK(bd);
 			bpfstats_fill_xbpf(xbd, bd);
 			BPFD_UNLOCK(bd);
 		}
 		BPFIF_RUNLOCK(bp);
 	}
 	BPF_UNLOCK();
 	error = SYSCTL_OUT(req, xbdbuf, index * sizeof(*xbd));
 	free(xbdbuf, M_BPF);
 	return (error);
 }
 
 SYSINIT(bpfdev,SI_SUB_DRIVERS,SI_ORDER_MIDDLE,bpf_drvinit,NULL);
 
 #else /* !DEV_BPF && !NETGRAPH_BPF */
 /*
  * NOP stubs to allow bpf-using drivers to load and function.
  *
  * A 'better' implementation would allow the core bpf functionality
  * to be loaded at runtime.
  */
 static struct bpf_if bp_null;
 
 void
 bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktlen)
 {
 }
 
 void
 bpf_mtap(struct bpf_if *bp, struct mbuf *m)
 {
 }
 
 void
 bpf_mtap2(struct bpf_if *bp, void *d, u_int l, struct mbuf *m)
 {
 }
 
 void
 bpfattach(struct ifnet *ifp, u_int dlt, u_int hdrlen)
 {
 
 	bpfattach2(ifp, dlt, hdrlen, &ifp->if_bpf);
 }
 
 void
 bpfattach2(struct ifnet *ifp, u_int dlt, u_int hdrlen, struct bpf_if **driverp)
 {
 
 	*driverp = &bp_null;
 }
 
 void
 bpfdetach(struct ifnet *ifp)
 {
 }
 
 u_int
 bpf_filter(const struct bpf_insn *pc, u_char *p, u_int wirelen, u_int buflen)
 {
 	return -1;	/* "no filter" behaviour */
 }
 
 int
 bpf_validate(const struct bpf_insn *f, int len)
 {
 	return 0;		/* false */
 }
 
 #endif /* !DEV_BPF && !NETGRAPH_BPF */
 
 #ifdef DDB
 static void
 bpf_show_bpf_if(struct bpf_if *bpf_if)
 {
 
 	if (bpf_if == NULL)
 		return;
 	db_printf("%p:\n", bpf_if);
 #define	BPF_DB_PRINTF(f, e)	db_printf("   %s = " f "\n", #e, bpf_if->e);
 	/* bif_ext.bif_next */
 	/* bif_ext.bif_dlist */
 	BPF_DB_PRINTF("%#x", bif_dlt);
 	BPF_DB_PRINTF("%u", bif_hdrlen);
 	BPF_DB_PRINTF("%p", bif_ifp);
 	/* bif_lock */
 	/* bif_wlist */
 	BPF_DB_PRINTF("%#x", bif_flags);
 }
 
 DB_SHOW_COMMAND(bpf_if, db_show_bpf_if)
 {
 
 	if (!have_addr) {
 		db_printf("usage: show bpf_if <struct bpf_if *>\n");
 		return;
 	}
 
 	bpf_show_bpf_if((struct bpf_if *)addr);
 }
 #endif
Index: user/markj/netdump/sys/net/bpfdesc.h
===================================================================
--- user/markj/netdump/sys/net/bpfdesc.h	(revision 332407)
+++ user/markj/netdump/sys/net/bpfdesc.h	(revision 332408)
@@ -1,165 +1,162 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1990, 1991, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * This code is derived from the Stanford/CMU enet packet filter,
  * (net/enet.c) distributed as part of 4.3BSD, and code contributed
  * to Berkeley by Steven McCanne and Van Jacobson both of Lawrence
  * Berkeley Laboratory.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *      @(#)bpfdesc.h	8.1 (Berkeley) 6/10/93
  *
  * $FreeBSD$
  */
 
 #ifndef _NET_BPFDESC_H_
 #define _NET_BPFDESC_H_
 
 #include <sys/callout.h>
 #include <sys/selinfo.h>
 #include <sys/queue.h>
 #include <sys/conf.h>
 #include <sys/counter.h>
 #include <net/if.h>
 
 /*
  * Descriptor associated with each open bpf file.
  */
 struct zbuf;
 struct bpf_d {
 	LIST_ENTRY(bpf_d) bd_next;	/* Linked list of descriptors */
 	/*
 	 * Buffer slots: two memory buffers store the incoming packets.
 	 *   The model has three slots.  Sbuf is always occupied.
 	 *   sbuf (store) - Receive interrupt puts packets here.
 	 *   hbuf (hold) - When sbuf is full, put buffer here and
 	 *                 wakeup read (replace sbuf with fbuf).
 	 *   fbuf (free) - When read is done, put buffer here.
 	 * On receiving, if sbuf is full and fbuf is 0, packet is dropped.
 	 */
 	caddr_t		bd_sbuf;	/* store slot */
 	caddr_t		bd_hbuf;	/* hold slot */
 	caddr_t		bd_fbuf;	/* free slot */
 	int		bd_hbuf_in_use;	/* don't rotate buffers */
 	int 		bd_slen;	/* current length of store buffer */
 	int 		bd_hlen;	/* current length of hold buffer */
 
 	int		bd_bufsize;	/* absolute length of buffers */
 
 	struct bpf_if *	bd_bif;		/* interface descriptor */
 	u_long		bd_rtout;	/* Read timeout in 'ticks' */
 	struct bpf_insn *bd_rfilter; 	/* read filter code */
 	struct bpf_insn *bd_wfilter;	/* write filter code */
 	void		*bd_bfilter;	/* binary filter code */
 	counter_u64_t	bd_rcount;	/* number of packets received */
 	counter_u64_t	bd_dcount;	/* number of packets dropped */
 
 	u_char		bd_promisc;	/* true if listening promiscuously */
 	u_char		bd_state;	/* idle, waiting, or timed out */
 	u_char		bd_immediate;	/* true to return on packet arrival */
 	u_char		bd_writer;	/* non-zero if d is writer-only */
 	int		bd_hdrcmplt;	/* false to fill in src lladdr automatically */
 	int		bd_direction;	/* select packet direction */
 	int		bd_tstamp;	/* select time stamping function */
 	int		bd_feedback;	/* true to feed back sent packets */
 	int		bd_async;	/* non-zero if packet reception should generate signal */
 	int		bd_sig;		/* signal to send upon packet reception */
 	struct sigio *	bd_sigio;	/* information for async I/O */
 	struct selinfo	bd_sel;		/* bsd select info */
 	struct mtx	bd_lock;	/* per-descriptor lock */
 	struct callout	bd_callout;	/* for BPF timeouts with select */
 	struct label	*bd_label;	/* MAC label for descriptor */
 	counter_u64_t	bd_fcount;	/* number of packets which matched filter */
 	pid_t		bd_pid;		/* PID which created descriptor */
 	int		bd_locked;	/* true if descriptor is locked */
 	u_int		bd_bufmode;	/* Current buffer mode. */
 	counter_u64_t	bd_wcount;	/* number of packets written */
 	counter_u64_t	bd_wfcount;	/* number of packets that matched write filter */
 	counter_u64_t	bd_wdcount;	/* number of packets dropped during a write */
 	counter_u64_t	bd_zcopy;	/* number of zero copy operations */
 	u_char		bd_compat32;	/* 32-bit stream on LP64 system */
 };
 
 /* Values for bd_state */
 #define BPF_IDLE	0		/* no select in progress */
 #define BPF_WAITING	1		/* waiting for read timeout in select */
 #define BPF_TIMED_OUT	2		/* read timeout has expired in select */
 
 #define BPFD_LOCK(bd)		mtx_lock(&(bd)->bd_lock)
 #define BPFD_UNLOCK(bd)		mtx_unlock(&(bd)->bd_lock)
 #define BPFD_LOCK_ASSERT(bd)	mtx_assert(&(bd)->bd_lock, MA_OWNED)
 
 #define BPF_PID_REFRESH(bd, td)	(bd)->bd_pid = (td)->td_proc->p_pid
 #define BPF_PID_REFRESH_CUR(bd)	(bd)->bd_pid = curthread->td_proc->p_pid
 
-#define BPF_LOCK()		mtx_lock(&bpf_mtx)
-#define BPF_UNLOCK()		mtx_unlock(&bpf_mtx)
-#define BPF_LOCK_ASSERT()	mtx_assert(&bpf_mtx, MA_OWNED)
 /*
  * External representation of the bpf descriptor
  */
 struct xbpf_d {
 	u_int		bd_structsize;	/* Size of this structure. */
 	u_char		bd_promisc;
 	u_char		bd_immediate;
 	u_char		__bd_pad[6];
 	int		bd_hdrcmplt;
 	int		bd_direction;
 	int		bd_feedback;
 	int		bd_async;
 	u_int64_t	bd_rcount;
 	u_int64_t	bd_dcount;
 	u_int64_t	bd_fcount;
 	int		bd_sig;
 	int		bd_slen;
 	int		bd_hlen;
 	int		bd_bufsize;
 	pid_t		bd_pid;
 	char		bd_ifname[IFNAMSIZ];
 	int		bd_locked;
 	u_int64_t	bd_wcount;
 	u_int64_t	bd_wfcount;
 	u_int64_t	bd_wdcount;
 	u_int64_t	bd_zcopy;
 	int		bd_bufmode;
 	/*
 	 * Allocate 4 64 bit unsigned integers for future expansion so we do
 	 * not have to worry about breaking the ABI.
 	 */
 	u_int64_t	bd_spare[4];
 };
 
 #define BPFIF_RLOCK(bif)	rw_rlock(&(bif)->bif_lock)
 #define BPFIF_RUNLOCK(bif)	rw_runlock(&(bif)->bif_lock)
 #define BPFIF_WLOCK(bif)	rw_wlock(&(bif)->bif_lock)
 #define BPFIF_WUNLOCK(bif)	rw_wunlock(&(bif)->bif_lock)
 
 #define BPFIF_FLAG_DYING	1	/* Reject new bpf consumers */
 
 #endif
Index: user/markj/netdump/sys/net/iflib.c
===================================================================
--- user/markj/netdump/sys/net/iflib.c	(revision 332407)
+++ user/markj/netdump/sys/net/iflib.c	(revision 332408)
@@ -1,6059 +1,6082 @@
 /*-
- * Copyright (c) 2014-2017, Matthew Macy <mmacy@mattmacy.io>
+ * Copyright (c) 2014-2018, Matthew Macy <mmacy@mattmacy.io>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are met:
  *
  *  1. Redistributions of source code must retain the above copyright notice,
  *     this list of conditions and the following disclaimer.
  *
  *  2. Neither the name of Matthew Macy nor the names of its
  *     contributors may be used to endorse or promote products derived from
  *     this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
  * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
 #include "opt_acpi.h"
 #include "opt_sched.h"
 
 #include <sys/param.h>
 #include <sys/types.h>
 #include <sys/bus.h>
 #include <sys/eventhandler.h>
 #include <sys/sockio.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/module.h>
 #include <sys/kobj.h>
 #include <sys/rman.h>
 #include <sys/sbuf.h>
 #include <sys/smp.h>
 #include <sys/socket.h>
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
 #include <sys/taskqueue.h>
 #include <sys/limits.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/if_types.h>
 #include <net/if_media.h>
 #include <net/bpf.h>
 #include <net/ethernet.h>
 #include <net/mp_ring.h>
 #include <net/vnet.h>
 
 #include <netinet/in.h>
 #include <netinet/in_pcb.h>
 #include <netinet/tcp_lro.h>
 #include <netinet/in_systm.h>
 #include <netinet/if_ether.h>
 #include <netinet/ip.h>
 #include <netinet/ip6.h>
 #include <netinet/tcp.h>
 #include <netinet/ip_var.h>
 #include <netinet/netdump/netdump.h>
 #include <netinet6/ip6_var.h>
 
 #include <machine/bus.h>
 #include <machine/in_cksum.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <dev/led/led.h>
 #include <dev/pci/pcireg.h>
 #include <dev/pci/pcivar.h>
 #include <dev/pci/pci_private.h>
 
 #include <net/iflib.h>
 
 #include "ifdi_if.h"
 
 #if defined(__i386__) || defined(__amd64__)
 #include <sys/memdesc.h>
 #include <machine/bus.h>
 #include <machine/md_var.h>
 #include <machine/specialreg.h>
 #include <x86/include/busdma_impl.h>
 #include <x86/iommu/busdma_dmar.h>
 #endif
 
 #include <sys/bitstring.h>
 /*
  * enable accounting of every mbuf as it comes in to and goes out of
  * iflib's software descriptor references
  */
 #define MEMORY_LOGGING 0
 /*
  * Enable mbuf vectors for compressing long mbuf chains
  */
 
 /*
  * NB:
  * - Prefetching in tx cleaning should perhaps be a tunable. The distance ahead
  *   we prefetch needs to be determined by the time spent in m_free vis a vis
  *   the cost of a prefetch. This will of course vary based on the workload:
  *      - NFLX's m_free path is dominated by vm-based M_EXT manipulation which
  *        is quite expensive, thus suggesting very little prefetch.
  *      - small packet forwarding which is just returning a single mbuf to
  *        UMA will typically be very fast vis a vis the cost of a memory
  *        access.
  */
 
 
 /*
  * File organization:
  *  - private structures
  *  - iflib private utility functions
  *  - ifnet functions
  *  - vlan registry and other exported functions
  *  - iflib public core functions
  *
  *
  */
 static MALLOC_DEFINE(M_IFLIB, "iflib", "ifnet library");
 
 struct iflib_txq;
 typedef struct iflib_txq *iflib_txq_t;
 struct iflib_rxq;
 typedef struct iflib_rxq *iflib_rxq_t;
 struct iflib_fl;
 typedef struct iflib_fl *iflib_fl_t;
 
 struct iflib_ctx;
 
 static void iru_init(if_rxd_update_t iru, iflib_rxq_t rxq, uint8_t flid);
 
 typedef struct iflib_filter_info {
 	driver_filter_t *ifi_filter;
 	void *ifi_filter_arg;
 	struct grouptask *ifi_task;
 	void *ifi_ctx;
 } *iflib_filter_info_t;
 
 struct iflib_ctx {
 	KOBJ_FIELDS;
    /*
    * Pointer to hardware driver's softc
    */
 	void *ifc_softc;
 	device_t ifc_dev;
 	if_t ifc_ifp;
 
 	cpuset_t ifc_cpus;
 	if_shared_ctx_t ifc_sctx;
 	struct if_softc_ctx ifc_softc_ctx;
 
-	struct mtx ifc_mtx;
+	struct mtx ifc_ctx_mtx;
+	struct mtx ifc_state_mtx;
 
 	uint16_t ifc_nhwtxqs;
 	uint16_t ifc_nhwrxqs;
 
 	iflib_txq_t ifc_txqs;
 	iflib_rxq_t ifc_rxqs;
 	uint32_t ifc_if_flags;
 	uint32_t ifc_flags;
 	uint32_t ifc_max_fl_buf_size;
 	int ifc_in_detach;
 
 	int ifc_link_state;
 	int ifc_link_irq;
 	int ifc_watchdog_events;
 	struct cdev *ifc_led_dev;
 	struct resource *ifc_msix_mem;
 
 	struct if_irq ifc_legacy_irq;
 	struct grouptask ifc_admin_task;
 	struct grouptask ifc_vflr_task;
 	struct iflib_filter_info ifc_filter_info;
 	struct ifmedia	ifc_media;
 
 	struct sysctl_oid *ifc_sysctl_node;
 	uint16_t ifc_sysctl_ntxqs;
 	uint16_t ifc_sysctl_nrxqs;
 	uint16_t ifc_sysctl_qs_eq_override;
 	uint16_t ifc_sysctl_rx_budget;
 
 	qidx_t ifc_sysctl_ntxds[8];
 	qidx_t ifc_sysctl_nrxds[8];
 	struct if_txrx ifc_txrx;
 #define isc_txd_encap  ifc_txrx.ift_txd_encap
 #define isc_txd_flush  ifc_txrx.ift_txd_flush
 #define isc_txd_credits_update  ifc_txrx.ift_txd_credits_update
 #define isc_rxd_available ifc_txrx.ift_rxd_available
 #define isc_rxd_pkt_get ifc_txrx.ift_rxd_pkt_get
 #define isc_rxd_refill ifc_txrx.ift_rxd_refill
 #define isc_rxd_flush ifc_txrx.ift_rxd_flush
 #define isc_rxd_refill ifc_txrx.ift_rxd_refill
 #define isc_rxd_refill ifc_txrx.ift_rxd_refill
 #define isc_legacy_intr ifc_txrx.ift_legacy_intr
 	eventhandler_tag ifc_vlan_attach_event;
 	eventhandler_tag ifc_vlan_detach_event;
 	uint8_t ifc_mac[ETHER_ADDR_LEN];
 	char ifc_mtx_name[16];
 };
 
 
 void *
 iflib_get_softc(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_softc);
 }
 
 device_t
 iflib_get_dev(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_dev);
 }
 
 if_t
 iflib_get_ifp(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_ifp);
 }
 
 struct ifmedia *
 iflib_get_media(if_ctx_t ctx)
 {
 
 	return (&ctx->ifc_media);
 }
 
 void
 iflib_set_mac(if_ctx_t ctx, uint8_t mac[ETHER_ADDR_LEN])
 {
 
 	bcopy(mac, ctx->ifc_mac, ETHER_ADDR_LEN);
 }
 
 if_softc_ctx_t
 iflib_get_softc_ctx(if_ctx_t ctx)
 {
 
 	return (&ctx->ifc_softc_ctx);
 }
 
 if_shared_ctx_t
 iflib_get_sctx(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_sctx);
 }
 
 #define IP_ALIGNED(m) ((((uintptr_t)(m)->m_data) & 0x3) == 0x2)
 #define CACHE_PTR_INCREMENT (CACHE_LINE_SIZE/sizeof(void*))
 #define CACHE_PTR_NEXT(ptr) ((void *)(((uintptr_t)(ptr)+CACHE_LINE_SIZE-1) & (CACHE_LINE_SIZE-1)))
 
 #define LINK_ACTIVE(ctx) ((ctx)->ifc_link_state == LINK_STATE_UP)
 #define CTX_IS_VF(ctx) ((ctx)->ifc_sctx->isc_flags & IFLIB_IS_VF)
 
 #define RX_SW_DESC_MAP_CREATED	(1 << 0)
 #define TX_SW_DESC_MAP_CREATED	(1 << 1)
 #define RX_SW_DESC_INUSE        (1 << 3)
 #define TX_SW_DESC_MAPPED       (1 << 4)
 
 #define	M_TOOBIG		M_PROTO1
 
 typedef struct iflib_sw_rx_desc_array {
 	bus_dmamap_t	*ifsd_map;         /* bus_dma maps for packet */
 	struct mbuf	**ifsd_m;           /* pkthdr mbufs */
 	caddr_t		*ifsd_cl;          /* direct cluster pointer for rx */
 	uint8_t		*ifsd_flags;
 } iflib_rxsd_array_t;
 
 typedef struct iflib_sw_tx_desc_array {
 	bus_dmamap_t    *ifsd_map;         /* bus_dma maps for packet */
 	struct mbuf    **ifsd_m;           /* pkthdr mbufs */
 	uint8_t		*ifsd_flags;
 } if_txsd_vec_t;
 
 
 /* magic number that should be high enough for any hardware */
 #define IFLIB_MAX_TX_SEGS		128
 /* bnxt supports 64 with hardware LRO enabled */
 #define IFLIB_MAX_RX_SEGS		64
 #define IFLIB_RX_COPY_THRESH		128
 #define IFLIB_MAX_RX_REFRESH		32
 /* The minimum descriptors per second before we start coalescing */
 #define IFLIB_MIN_DESC_SEC		16384
 #define IFLIB_DEFAULT_TX_UPDATE_FREQ	16
 #define IFLIB_QUEUE_IDLE		0
 #define IFLIB_QUEUE_HUNG		1
 #define IFLIB_QUEUE_WORKING		2
 /* maximum number of txqs that can share an rx interrupt */
 #define IFLIB_MAX_TX_SHARED_INTR	4
 
 /* this should really scale with ring size - this is a fairly arbitrary value */
 #define TX_BATCH_SIZE			32
 
 #define IFLIB_RESTART_BUDGET		8
 
 #define	IFC_LEGACY		0x001
 #define	IFC_QFLUSH		0x002
 #define	IFC_MULTISEG		0x004
 #define	IFC_DMAR		0x008
 #define	IFC_SC_ALLOCATED	0x010
 #define	IFC_INIT_DONE		0x020
 #define	IFC_PREFETCH		0x040
 #define	IFC_DO_RESET		0x080
-#define	IFC_CHECK_HUNG		0x100
+#define	IFC_DO_WATCHDOG		0x100
+#define	IFC_CHECK_HUNG		0x200
 
+
 #define CSUM_OFFLOAD		(CSUM_IP_TSO|CSUM_IP6_TSO|CSUM_IP| \
 				 CSUM_IP_UDP|CSUM_IP_TCP|CSUM_IP_SCTP| \
 				 CSUM_IP6_UDP|CSUM_IP6_TCP|CSUM_IP6_SCTP)
 struct iflib_txq {
 	qidx_t		ift_in_use;
 	qidx_t		ift_cidx;
 	qidx_t		ift_cidx_processed;
 	qidx_t		ift_pidx;
 	uint8_t		ift_gen;
 	uint8_t		ift_br_offset;
 	uint16_t	ift_npending;
 	uint16_t	ift_db_pending;
 	uint16_t	ift_rs_pending;
 	/* implicit pad */
 	uint8_t		ift_txd_size[8];
 	uint64_t	ift_processed;
 	uint64_t	ift_cleaned;
 	uint64_t	ift_cleaned_prev;
 #if MEMORY_LOGGING
 	uint64_t	ift_enqueued;
 	uint64_t	ift_dequeued;
 #endif
 	uint64_t	ift_no_tx_dma_setup;
 	uint64_t	ift_no_desc_avail;
 	uint64_t	ift_mbuf_defrag_failed;
 	uint64_t	ift_mbuf_defrag;
 	uint64_t	ift_map_failed;
 	uint64_t	ift_txd_encap_efbig;
 	uint64_t	ift_pullups;
 
 	struct mtx	ift_mtx;
 	struct mtx	ift_db_mtx;
 
 	/* constant values */
 	if_ctx_t	ift_ctx;
 	struct ifmp_ring        *ift_br;
 	struct grouptask	ift_task;
 	qidx_t		ift_size;
 	uint16_t	ift_id;
 	struct callout	ift_timer;
 
 	if_txsd_vec_t	ift_sds;
 	uint8_t		ift_qstatus;
 	uint8_t		ift_closed;
 	uint8_t		ift_update_freq;
 	struct iflib_filter_info ift_filter_info;
 	bus_dma_tag_t		ift_desc_tag;
 	bus_dma_tag_t		ift_tso_desc_tag;
 	iflib_dma_info_t	ift_ifdi;
 #define MTX_NAME_LEN 16
 	char                    ift_mtx_name[MTX_NAME_LEN];
 	char                    ift_db_mtx_name[MTX_NAME_LEN];
 	bus_dma_segment_t	ift_segs[IFLIB_MAX_TX_SEGS]  __aligned(CACHE_LINE_SIZE);
 #ifdef IFLIB_DIAGNOSTICS
 	uint64_t ift_cpu_exec_count[256];
 #endif
 } __aligned(CACHE_LINE_SIZE);
 
 struct iflib_fl {
 	qidx_t		ifl_cidx;
 	qidx_t		ifl_pidx;
 	qidx_t		ifl_credits;
 	uint8_t		ifl_gen;
 	uint8_t		ifl_rxd_size;
 #if MEMORY_LOGGING
 	uint64_t	ifl_m_enqueued;
 	uint64_t	ifl_m_dequeued;
 	uint64_t	ifl_cl_enqueued;
 	uint64_t	ifl_cl_dequeued;
 #endif
 	/* implicit pad */
 
 	bitstr_t 	*ifl_rx_bitmap;
 	qidx_t		ifl_fragidx;
 	/* constant */
 	qidx_t		ifl_size;
 	uint16_t	ifl_buf_size;
 	uint16_t	ifl_cltype;
 	uma_zone_t	ifl_zone;
 	iflib_rxsd_array_t	ifl_sds;
 	iflib_rxq_t	ifl_rxq;
 	uint8_t		ifl_id;
 	bus_dma_tag_t           ifl_desc_tag;
 	iflib_dma_info_t	ifl_ifdi;
 	uint64_t	ifl_bus_addrs[IFLIB_MAX_RX_REFRESH] __aligned(CACHE_LINE_SIZE);
 	caddr_t		ifl_vm_addrs[IFLIB_MAX_RX_REFRESH];
 	qidx_t	ifl_rxd_idxs[IFLIB_MAX_RX_REFRESH];
 }  __aligned(CACHE_LINE_SIZE);
 
 static inline qidx_t
 get_inuse(int size, qidx_t cidx, qidx_t pidx, uint8_t gen)
 {
 	qidx_t used;
 
 	if (pidx > cidx)
 		used = pidx - cidx;
 	else if (pidx < cidx)
 		used = size - cidx + pidx;
 	else if (gen == 0 && pidx == cidx)
 		used = 0;
 	else if (gen == 1 && pidx == cidx)
 		used = size;
 	else
 		panic("bad state");
 
 	return (used);
 }
 
 #define TXQ_AVAIL(txq) (txq->ift_size - get_inuse(txq->ift_size, txq->ift_cidx, txq->ift_pidx, txq->ift_gen))
 
 #define IDXDIFF(head, tail, wrap) \
 	((head) >= (tail) ? (head) - (tail) : (wrap) - (tail) + (head))
 
 struct iflib_rxq {
 	/* If there is a separate completion queue -
 	 * these are the cq cidx and pidx. Otherwise
 	 * these are unused.
 	 */
 	qidx_t		ifr_size;
 	qidx_t		ifr_cq_cidx;
 	qidx_t		ifr_cq_pidx;
 	uint8_t		ifr_cq_gen;
 	uint8_t		ifr_fl_offset;
 
 	if_ctx_t	ifr_ctx;
 	iflib_fl_t	ifr_fl;
 	uint64_t	ifr_rx_irq;
 	uint16_t	ifr_id;
 	uint8_t		ifr_lro_enabled;
 	uint8_t		ifr_nfl;
 	uint8_t		ifr_ntxqirq;
 	uint8_t		ifr_txqid[IFLIB_MAX_TX_SHARED_INTR];
 	struct lro_ctrl			ifr_lc;
 	struct grouptask        ifr_task;
 	struct iflib_filter_info ifr_filter_info;
 	iflib_dma_info_t		ifr_ifdi;
 
 	/* dynamically allocate if any drivers need a value substantially larger than this */
 	struct if_rxd_frag	ifr_frags[IFLIB_MAX_RX_SEGS] __aligned(CACHE_LINE_SIZE);
 #ifdef IFLIB_DIAGNOSTICS
 	uint64_t ifr_cpu_exec_count[256];
 #endif
 }  __aligned(CACHE_LINE_SIZE);
 
 typedef struct if_rxsd {
 	caddr_t *ifsd_cl;
 	struct mbuf **ifsd_m;
 	iflib_fl_t ifsd_fl;
 	qidx_t ifsd_cidx;
 } *if_rxsd_t;
 
 /* multiple of word size */
 #ifdef __LP64__
 #define PKT_INFO_SIZE	6
 #define RXD_INFO_SIZE	5
 #define PKT_TYPE uint64_t
 #else
 #define PKT_INFO_SIZE	11
 #define RXD_INFO_SIZE	8
 #define PKT_TYPE uint32_t
 #endif
 #define PKT_LOOP_BOUND  ((PKT_INFO_SIZE/3)*3)
 #define RXD_LOOP_BOUND  ((RXD_INFO_SIZE/4)*4)
 
 typedef struct if_pkt_info_pad {
 	PKT_TYPE pkt_val[PKT_INFO_SIZE];
 } *if_pkt_info_pad_t;
 typedef struct if_rxd_info_pad {
 	PKT_TYPE rxd_val[RXD_INFO_SIZE];
 } *if_rxd_info_pad_t;
 
 CTASSERT(sizeof(struct if_pkt_info_pad) == sizeof(struct if_pkt_info));
 CTASSERT(sizeof(struct if_rxd_info_pad) == sizeof(struct if_rxd_info));
 
 
 static inline void
 pkt_info_zero(if_pkt_info_t pi)
 {
 	if_pkt_info_pad_t pi_pad;
 
 	pi_pad = (if_pkt_info_pad_t)pi;
 	pi_pad->pkt_val[0] = 0; pi_pad->pkt_val[1] = 0; pi_pad->pkt_val[2] = 0;
 	pi_pad->pkt_val[3] = 0; pi_pad->pkt_val[4] = 0; pi_pad->pkt_val[5] = 0;
 #ifndef __LP64__
 	pi_pad->pkt_val[6] = 0; pi_pad->pkt_val[7] = 0; pi_pad->pkt_val[8] = 0;
 	pi_pad->pkt_val[9] = 0; pi_pad->pkt_val[10] = 0;
 #endif	
 }
 
 static inline void
 rxd_info_zero(if_rxd_info_t ri)
 {
 	if_rxd_info_pad_t ri_pad;
 	int i;
 
 	ri_pad = (if_rxd_info_pad_t)ri;
 	for (i = 0; i < RXD_LOOP_BOUND; i += 4) {
 		ri_pad->rxd_val[i] = 0;
 		ri_pad->rxd_val[i+1] = 0;
 		ri_pad->rxd_val[i+2] = 0;
 		ri_pad->rxd_val[i+3] = 0;
 	}
 #ifdef __LP64__
 	ri_pad->rxd_val[RXD_INFO_SIZE-1] = 0;
 #endif
 }
 
 /*
  * Only allow a single packet to take up most 1/nth of the tx ring
  */
 #define MAX_SINGLE_PACKET_FRACTION 12
 #define IF_BAD_DMA (bus_addr_t)-1
 
 #define CTX_ACTIVE(ctx) ((if_getdrvflags((ctx)->ifc_ifp) & IFF_DRV_RUNNING))
 
-#define CTX_LOCK_INIT(_sc, _name)  mtx_init(&(_sc)->ifc_mtx, _name, "iflib ctx lock", MTX_DEF)
+#define CTX_LOCK_INIT(_sc, _name)  mtx_init(&(_sc)->ifc_ctx_mtx, _name, "iflib ctx lock", MTX_DEF)
+#define CTX_LOCK(ctx) mtx_lock(&(ctx)->ifc_ctx_mtx)
+#define CTX_UNLOCK(ctx) mtx_unlock(&(ctx)->ifc_ctx_mtx)
+#define CTX_LOCK_DESTROY(ctx) mtx_destroy(&(ctx)->ifc_ctx_mtx)
 
-#define CTX_LOCK(ctx) mtx_lock(&(ctx)->ifc_mtx)
-#define CTX_UNLOCK(ctx) mtx_unlock(&(ctx)->ifc_mtx)
-#define CTX_LOCK_DESTROY(ctx) mtx_destroy(&(ctx)->ifc_mtx)
 
+#define STATE_LOCK_INIT(_sc, _name)  mtx_init(&(_sc)->ifc_state_mtx, _name, "iflib state lock", MTX_DEF)
+#define STATE_LOCK(ctx) mtx_lock(&(ctx)->ifc_state_mtx)
+#define STATE_UNLOCK(ctx) mtx_unlock(&(ctx)->ifc_state_mtx)
+#define STATE_LOCK_DESTROY(ctx) mtx_destroy(&(ctx)->ifc_state_mtx)
 
+
+
 #define CALLOUT_LOCK(txq)	mtx_lock(&txq->ift_mtx)
 #define CALLOUT_UNLOCK(txq) 	mtx_unlock(&txq->ift_mtx)
 
 
 /* Our boot-time initialization hook */
 static int	iflib_module_event_handler(module_t, int, void *);
 
 static moduledata_t iflib_moduledata = {
 	"iflib",
 	iflib_module_event_handler,
 	NULL
 };
 
 DECLARE_MODULE(iflib, iflib_moduledata, SI_SUB_INIT_IF, SI_ORDER_ANY);
 MODULE_VERSION(iflib, 1);
 
 MODULE_DEPEND(iflib, pci, 1, 1, 1);
 MODULE_DEPEND(iflib, ether, 1, 1, 1);
 
 TASKQGROUP_DEFINE(if_io_tqg, mp_ncpus, 1);
 TASKQGROUP_DEFINE(if_config_tqg, 1, 1);
 
 #ifndef IFLIB_DEBUG_COUNTERS
 #ifdef INVARIANTS
 #define IFLIB_DEBUG_COUNTERS 1
 #else
 #define IFLIB_DEBUG_COUNTERS 0
 #endif /* !INVARIANTS */
 #endif
 
 static SYSCTL_NODE(_net, OID_AUTO, iflib, CTLFLAG_RD, 0,
                    "iflib driver parameters");
 
 /*
  * XXX need to ensure that this can't accidentally cause the head to be moved backwards 
  */
 static int iflib_min_tx_latency = 0;
 SYSCTL_INT(_net_iflib, OID_AUTO, min_tx_latency, CTLFLAG_RW,
 		   &iflib_min_tx_latency, 0, "minimize transmit latency at the possible expense of throughput");
 static int iflib_no_tx_batch = 0;
 SYSCTL_INT(_net_iflib, OID_AUTO, no_tx_batch, CTLFLAG_RW,
 		   &iflib_no_tx_batch, 0, "minimize transmit latency at the possible expense of throughput");
 
 
 #if IFLIB_DEBUG_COUNTERS
 
 static int iflib_tx_seen;
 static int iflib_tx_sent;
 static int iflib_tx_encap;
 static int iflib_rx_allocs;
 static int iflib_fl_refills;
 static int iflib_fl_refills_large;
 static int iflib_tx_frees;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_seen, CTLFLAG_RD,
 		   &iflib_tx_seen, 0, "# tx mbufs seen");
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_sent, CTLFLAG_RD,
 		   &iflib_tx_sent, 0, "# tx mbufs sent");
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_encap, CTLFLAG_RD,
 		   &iflib_tx_encap, 0, "# tx mbufs encapped");
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_frees, CTLFLAG_RD,
 		   &iflib_tx_frees, 0, "# tx frees");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_allocs, CTLFLAG_RD,
 		   &iflib_rx_allocs, 0, "# rx allocations");
 SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills, CTLFLAG_RD,
 		   &iflib_fl_refills, 0, "# refills");
 SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills_large, CTLFLAG_RD,
 		   &iflib_fl_refills_large, 0, "# large refills");
 
 
 static int iflib_txq_drain_flushing;
 static int iflib_txq_drain_oactive;
 static int iflib_txq_drain_notready;
 static int iflib_txq_drain_encapfail;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_flushing, CTLFLAG_RD,
 		   &iflib_txq_drain_flushing, 0, "# drain flushes");
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_oactive, CTLFLAG_RD,
 		   &iflib_txq_drain_oactive, 0, "# drain oactives");
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_notready, CTLFLAG_RD,
 		   &iflib_txq_drain_notready, 0, "# drain notready");
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_encapfail, CTLFLAG_RD,
 		   &iflib_txq_drain_encapfail, 0, "# drain encap fails");
 
 
 static int iflib_encap_load_mbuf_fail;
 static int iflib_encap_pad_mbuf_fail;
 static int iflib_encap_txq_avail_fail;
 static int iflib_encap_txd_encap_fail;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_load_mbuf_fail, CTLFLAG_RD,
 		   &iflib_encap_load_mbuf_fail, 0, "# busdma load failures");
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_pad_mbuf_fail, CTLFLAG_RD,
 		   &iflib_encap_pad_mbuf_fail, 0, "# runt frame pad failures");
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_txq_avail_fail, CTLFLAG_RD,
 		   &iflib_encap_txq_avail_fail, 0, "# txq avail failures");
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_txd_encap_fail, CTLFLAG_RD,
 		   &iflib_encap_txd_encap_fail, 0, "# driver encap failures");
 
 static int iflib_task_fn_rxs;
 static int iflib_rx_intr_enables;
 static int iflib_fast_intrs;
 static int iflib_intr_link;
 static int iflib_intr_msix; 
 static int iflib_rx_unavail;
 static int iflib_rx_ctx_inactive;
 static int iflib_rx_zero_len;
 static int iflib_rx_if_input;
 static int iflib_rx_mbuf_null;
 static int iflib_rxd_flush;
 
 static int iflib_verbose_debug;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, intr_link, CTLFLAG_RD,
 		   &iflib_intr_link, 0, "# intr link calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, intr_msix, CTLFLAG_RD,
 		   &iflib_intr_msix, 0, "# intr msix calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, task_fn_rx, CTLFLAG_RD,
 		   &iflib_task_fn_rxs, 0, "# task_fn_rx calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_intr_enables, CTLFLAG_RD,
 		   &iflib_rx_intr_enables, 0, "# rx intr enables");
 SYSCTL_INT(_net_iflib, OID_AUTO, fast_intrs, CTLFLAG_RD,
 		   &iflib_fast_intrs, 0, "# fast_intr calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_unavail, CTLFLAG_RD,
 		   &iflib_rx_unavail, 0, "# times rxeof called with no available data");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_ctx_inactive, CTLFLAG_RD,
 		   &iflib_rx_ctx_inactive, 0, "# times rxeof called with inactive context");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_zero_len, CTLFLAG_RD,
 		   &iflib_rx_zero_len, 0, "# times rxeof saw zero len mbuf");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_if_input, CTLFLAG_RD,
 		   &iflib_rx_if_input, 0, "# times rxeof called if_input");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_mbuf_null, CTLFLAG_RD,
 		   &iflib_rx_mbuf_null, 0, "# times rxeof got null mbuf");
 SYSCTL_INT(_net_iflib, OID_AUTO, rxd_flush, CTLFLAG_RD,
 	         &iflib_rxd_flush, 0, "# times rxd_flush called");
 SYSCTL_INT(_net_iflib, OID_AUTO, verbose_debug, CTLFLAG_RW,
 		   &iflib_verbose_debug, 0, "enable verbose debugging");
 
 #define DBG_COUNTER_INC(name) atomic_add_int(&(iflib_ ## name), 1)
 static void
 iflib_debug_reset(void)
 {
 	iflib_tx_seen = iflib_tx_sent = iflib_tx_encap = iflib_rx_allocs =
 		iflib_fl_refills = iflib_fl_refills_large = iflib_tx_frees =
 		iflib_txq_drain_flushing = iflib_txq_drain_oactive =
 		iflib_txq_drain_notready = iflib_txq_drain_encapfail =
 		iflib_encap_load_mbuf_fail = iflib_encap_pad_mbuf_fail =
 		iflib_encap_txq_avail_fail = iflib_encap_txd_encap_fail =
 		iflib_task_fn_rxs = iflib_rx_intr_enables = iflib_fast_intrs =
 		iflib_intr_link = iflib_intr_msix = iflib_rx_unavail =
 		iflib_rx_ctx_inactive = iflib_rx_zero_len = iflib_rx_if_input =
 		iflib_rx_mbuf_null = iflib_rxd_flush = 0;
 }
 
 #else
 #define DBG_COUNTER_INC(name)
 static void iflib_debug_reset(void) {}
 #endif
 
 
 
 #define IFLIB_DEBUG 0
 
 static void iflib_tx_structures_free(if_ctx_t ctx);
 static void iflib_rx_structures_free(if_ctx_t ctx);
 static int iflib_queues_alloc(if_ctx_t ctx);
 static int iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq);
 static int iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, qidx_t cidx, qidx_t budget);
 static int iflib_qset_structures_setup(if_ctx_t ctx);
 static int iflib_msix_init(if_ctx_t ctx);
 static int iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filterarg, int *rid, char *str);
 static void iflib_txq_check_drain(iflib_txq_t txq, int budget);
 static uint32_t iflib_txq_can_drain(struct ifmp_ring *);
 static int iflib_register(if_ctx_t);
 static void iflib_init_locked(if_ctx_t ctx);
 static void iflib_add_device_sysctl_pre(if_ctx_t ctx);
 static void iflib_add_device_sysctl_post(if_ctx_t ctx);
 static void iflib_ifmp_purge(iflib_txq_t txq);
 static void _iflib_pre_assert(if_softc_ctx_t scctx);
 static void iflib_stop(if_ctx_t ctx);
 static void iflib_if_init_locked(if_ctx_t ctx);
 #ifndef __NO_STRICT_ALIGNMENT
 static struct mbuf * iflib_fixup_rx(struct mbuf *m);
 #endif
 
 NETDUMP_DEFINE(iflib);
 
 #ifdef DEV_NETMAP
 #include <sys/selinfo.h>
 #include <net/netmap.h>
 #include <dev/netmap/netmap_kern.h>
 
 MODULE_DEPEND(iflib, netmap, 1, 1, 1);
 
 static int netmap_fl_refill(iflib_rxq_t rxq, struct netmap_kring *kring, uint32_t nm_i, bool init);
 
 /*
  * device-specific sysctl variables:
  *
  * iflib_crcstrip: 0: keep CRC in rx frames (default), 1: strip it.
  *	During regular operations the CRC is stripped, but on some
  *	hardware reception of frames not multiple of 64 is slower,
  *	so using crcstrip=0 helps in benchmarks.
  *
  * iflib_rx_miss, iflib_rx_miss_bufs:
  *	count packets that might be missed due to lost interrupts.
  */
 SYSCTL_DECL(_dev_netmap);
 /*
  * The xl driver by default strips CRCs and we do not override it.
  */
 
 int iflib_crcstrip = 1;
 SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_crcstrip,
     CTLFLAG_RW, &iflib_crcstrip, 1, "strip CRC on rx frames");
 
 int iflib_rx_miss, iflib_rx_miss_bufs;
 SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss,
     CTLFLAG_RW, &iflib_rx_miss, 0, "potentially missed rx intr");
 SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss_bufs,
     CTLFLAG_RW, &iflib_rx_miss_bufs, 0, "potentially missed rx intr bufs");
 
 /*
  * Register/unregister. We are already under netmap lock.
  * Only called on the first register or the last unregister.
  */
 static int
 iflib_netmap_register(struct netmap_adapter *na, int onoff)
 {
 	struct ifnet *ifp = na->ifp;
 	if_ctx_t ctx = ifp->if_softc;
 	int status;
 
 	CTX_LOCK(ctx);
 	IFDI_INTR_DISABLE(ctx);
 
 	/* Tell the stack that the interface is no longer active */
 	ifp->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE);
 
 	if (!CTX_IS_VF(ctx))
 		IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip);
 
 	/* enable or disable flags and callbacks in na and ifp */
 	if (onoff) {
 		nm_set_native_flags(na);
 	} else {
 		nm_clear_native_flags(na);
 	}
 	iflib_stop(ctx);
 	iflib_init_locked(ctx);
 	IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip); // XXX why twice ?
 	status = ifp->if_drv_flags & IFF_DRV_RUNNING ? 0 : 1;
 	if (status)
 		nm_clear_native_flags(na);
 	CTX_UNLOCK(ctx);
 	return (status);
 }
 
 static int
 netmap_fl_refill(iflib_rxq_t rxq, struct netmap_kring *kring, uint32_t nm_i, bool init)
 {
 	struct netmap_adapter *na = kring->na;
 	u_int const lim = kring->nkr_num_slots - 1;
 	u_int head = kring->rhead;
 	struct netmap_ring *ring = kring->ring;
 	bus_dmamap_t *map;
 	struct if_rxd_update iru;
 	if_ctx_t ctx = rxq->ifr_ctx;
 	iflib_fl_t fl = &rxq->ifr_fl[0];
 	uint32_t refill_pidx, nic_i;
 
 	if (nm_i == head && __predict_true(!init))
 		return 0;
 	iru_init(&iru, rxq, 0 /* flid */);
 	map = fl->ifl_sds.ifsd_map;
 	refill_pidx = netmap_idx_k2n(kring, nm_i);
 	/*
 	 * IMPORTANT: we must leave one free slot in the ring,
 	 * so move head back by one unit
 	 */
 	head = nm_prev(head, lim);
 	while (nm_i != head) {
 		for (int tmp_pidx = 0; tmp_pidx < IFLIB_MAX_RX_REFRESH && nm_i != head; tmp_pidx++) {
 			struct netmap_slot *slot = &ring->slot[nm_i];
 			void *addr = PNMB(na, slot, &fl->ifl_bus_addrs[tmp_pidx]);
 			uint32_t nic_i_dma = refill_pidx;
 			nic_i = netmap_idx_k2n(kring, nm_i);
 
 			MPASS(tmp_pidx < IFLIB_MAX_RX_REFRESH);
 
 			if (addr == NETMAP_BUF_BASE(na)) /* bad buf */
 			        return netmap_ring_reinit(kring);
 
 			fl->ifl_vm_addrs[tmp_pidx] = addr;
 			if (__predict_false(init) && map) {
 				netmap_load_map(na, fl->ifl_ifdi->idi_tag, map[nic_i], addr);
 			} else if (map && (slot->flags & NS_BUF_CHANGED)) {
 				/* buffer has changed, reload map */
 				netmap_reload_map(na, fl->ifl_ifdi->idi_tag, map[nic_i], addr);
 			}
 			slot->flags &= ~NS_BUF_CHANGED;
 
 			nm_i = nm_next(nm_i, lim);
 			fl->ifl_rxd_idxs[tmp_pidx] = nic_i = nm_next(nic_i, lim);
 			if (nm_i != head && tmp_pidx < IFLIB_MAX_RX_REFRESH-1)
 				continue;
 
 			iru.iru_pidx = refill_pidx;
 			iru.iru_count = tmp_pidx+1;
 			ctx->isc_rxd_refill(ctx->ifc_softc, &iru);
 
 			refill_pidx = nic_i;
 			if (map == NULL)
 				continue;
 
 			for (int n = 0; n < iru.iru_count; n++) {
 				bus_dmamap_sync(fl->ifl_ifdi->idi_tag, map[nic_i_dma],
 						BUS_DMASYNC_PREREAD);
 				/* XXX - change this to not use the netmap func*/
 				nic_i_dma = nm_next(nic_i_dma, lim);
 			}
 		}
 	}
 	kring->nr_hwcur = head;
 
 	if (map)
 		bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map,
 				BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, fl->ifl_id, nic_i);
 	return (0);
 }
 
 /*
  * Reconcile kernel and user view of the transmit ring.
  *
  * All information is in the kring.
  * Userspace wants to send packets up to the one before kring->rhead,
  * kernel knows kring->nr_hwcur is the first unsent packet.
  *
  * Here we push packets out (as many as possible), and possibly
  * reclaim buffers from previously completed transmission.
  *
  * The caller (netmap) guarantees that there is only one instance
  * running at any time. Any interference with other driver
  * methods should be handled by the individual drivers.
  */
 static int
 iflib_netmap_txsync(struct netmap_kring *kring, int flags)
 {
 	struct netmap_adapter *na = kring->na;
 	struct ifnet *ifp = na->ifp;
 	struct netmap_ring *ring = kring->ring;
 	u_int nm_i;	/* index into the netmap ring */
 	u_int nic_i;	/* index into the NIC ring */
 	u_int n;
 	u_int const lim = kring->nkr_num_slots - 1;
 	u_int const head = kring->rhead;
 	struct if_pkt_info pi;
 
 	/*
 	 * interrupts on every tx packet are expensive so request
 	 * them every half ring, or where NS_REPORT is set
 	 */
 	u_int report_frequency = kring->nkr_num_slots >> 1;
 	/* device-specific */
 	if_ctx_t ctx = ifp->if_softc;
 	iflib_txq_t txq = &ctx->ifc_txqs[kring->ring_id];
 
 	if (txq->ift_sds.ifsd_map)
 		bus_dmamap_sync(txq->ift_desc_tag, txq->ift_ifdi->idi_map,
 				BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 
 	/*
 	 * First part: process new packets to send.
 	 * nm_i is the current index in the netmap ring,
 	 * nic_i is the corresponding index in the NIC ring.
 	 *
 	 * If we have packets to send (nm_i != head)
 	 * iterate over the netmap ring, fetch length and update
 	 * the corresponding slot in the NIC ring. Some drivers also
 	 * need to update the buffer's physical address in the NIC slot
 	 * even NS_BUF_CHANGED is not set (PNMB computes the addresses).
 	 *
 	 * The netmap_reload_map() calls is especially expensive,
 	 * even when (as in this case) the tag is 0, so do only
 	 * when the buffer has actually changed.
 	 *
 	 * If possible do not set the report/intr bit on all slots,
 	 * but only a few times per ring or when NS_REPORT is set.
 	 *
 	 * Finally, on 10G and faster drivers, it might be useful
 	 * to prefetch the next slot and txr entry.
 	 */
 
 	nm_i = netmap_idx_n2k(kring, kring->nr_hwcur);
 	pkt_info_zero(&pi);
 	pi.ipi_segs = txq->ift_segs;
 	pi.ipi_qsidx = kring->ring_id;
 	if (nm_i != head) {	/* we have new packets to send */
 		nic_i = netmap_idx_k2n(kring, nm_i);
 
 		__builtin_prefetch(&ring->slot[nm_i]);
 		__builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i]);
 		if (txq->ift_sds.ifsd_map)
 			__builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i]);
 
 		for (n = 0; nm_i != head; n++) {
 			struct netmap_slot *slot = &ring->slot[nm_i];
 			u_int len = slot->len;
 			uint64_t paddr;
 			void *addr = PNMB(na, slot, &paddr);
 			int flags = (slot->flags & NS_REPORT ||
 				nic_i == 0 || nic_i == report_frequency) ?
 				IPI_TX_INTR : 0;
 
 			/* device-specific */
 			pi.ipi_len = len;
 			pi.ipi_segs[0].ds_addr = paddr;
 			pi.ipi_segs[0].ds_len = len;
 			pi.ipi_nsegs = 1;
 			pi.ipi_ndescs = 0;
 			pi.ipi_pidx = nic_i;
 			pi.ipi_flags = flags;
 
 			/* Fill the slot in the NIC ring. */
 			ctx->isc_txd_encap(ctx->ifc_softc, &pi);
 
 			/* prefetch for next round */
 			__builtin_prefetch(&ring->slot[nm_i + 1]);
 			__builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i + 1]);
 			if (txq->ift_sds.ifsd_map) {
 				__builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i + 1]);
 
 				NM_CHECK_ADDR_LEN(na, addr, len);
 
 				if (slot->flags & NS_BUF_CHANGED) {
 					/* buffer has changed, reload map */
 					netmap_reload_map(na, txq->ift_desc_tag, txq->ift_sds.ifsd_map[nic_i], addr);
 				}
 				/* make sure changes to the buffer are synced */
 				bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_sds.ifsd_map[nic_i],
 						BUS_DMASYNC_PREWRITE);
 			}
 			slot->flags &= ~(NS_REPORT | NS_BUF_CHANGED);
 			nm_i = nm_next(nm_i, lim);
 			nic_i = nm_next(nic_i, lim);
 		}
 		kring->nr_hwcur = head;
 
 		/* synchronize the NIC ring */
 		if (txq->ift_sds.ifsd_map)
 			bus_dmamap_sync(txq->ift_desc_tag, txq->ift_ifdi->idi_map,
 						BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 
 		/* (re)start the tx unit up to slot nic_i (excluded) */
 		ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, nic_i);
 	}
 
 	/*
 	 * Second part: reclaim buffers for completed transmissions.
 	 */
 	if (iflib_tx_credits_update(ctx, txq)) {
 		/* some tx completed, increment avail */
 		nic_i = txq->ift_cidx_processed;
 		kring->nr_hwtail = nm_prev(netmap_idx_n2k(kring, nic_i), lim);
 	}
 	return (0);
 }
 
 /*
  * Reconcile kernel and user view of the receive ring.
  * Same as for the txsync, this routine must be efficient.
  * The caller guarantees a single invocations, but races against
  * the rest of the driver should be handled here.
  *
  * On call, kring->rhead is the first packet that userspace wants
  * to keep, and kring->rcur is the wakeup point.
  * The kernel has previously reported packets up to kring->rtail.
  *
  * If (flags & NAF_FORCE_READ) also check for incoming packets irrespective
  * of whether or not we received an interrupt.
  */
 static int
 iflib_netmap_rxsync(struct netmap_kring *kring, int flags)
 {
 	struct netmap_adapter *na = kring->na;
 	struct netmap_ring *ring = kring->ring;
 	uint32_t nm_i;	/* index into the netmap ring */
 	uint32_t nic_i;	/* index into the NIC ring */
 	u_int i, n;
 	u_int const lim = kring->nkr_num_slots - 1;
 	u_int const head = netmap_idx_n2k(kring, kring->rhead);
 	int force_update = (flags & NAF_FORCE_READ) || kring->nr_kflags & NKR_PENDINTR;
 	struct if_rxd_info ri;
 
 	struct ifnet *ifp = na->ifp;
 	if_ctx_t ctx = ifp->if_softc;
 	iflib_rxq_t rxq = &ctx->ifc_rxqs[kring->ring_id];
 	iflib_fl_t fl = rxq->ifr_fl;
 	if (head > lim)
 		return netmap_ring_reinit(kring);
 
 	/* XXX check sync modes */
 	for (i = 0, fl = rxq->ifr_fl; i < rxq->ifr_nfl; i++, fl++) {
 		if (fl->ifl_sds.ifsd_map == NULL)
 			continue;
 		bus_dmamap_sync(rxq->ifr_fl[i].ifl_desc_tag, fl->ifl_ifdi->idi_map,
 				BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 	}
 	/*
 	 * First part: import newly received packets.
 	 *
 	 * nm_i is the index of the next free slot in the netmap ring,
 	 * nic_i is the index of the next received packet in the NIC ring,
 	 * and they may differ in case if_init() has been called while
 	 * in netmap mode. For the receive ring we have
 	 *
 	 *	nic_i = rxr->next_check;
 	 *	nm_i = kring->nr_hwtail (previous)
 	 * and
 	 *	nm_i == (nic_i + kring->nkr_hwofs) % ring_size
 	 *
 	 * rxr->next_check is set to 0 on a ring reinit
 	 */
 	if (netmap_no_pendintr || force_update) {
 		int crclen = iflib_crcstrip ? 0 : 4;
 		int error, avail;
 
 		for (i = 0; i < rxq->ifr_nfl; i++) {
 			fl = &rxq->ifr_fl[i];
 			nic_i = fl->ifl_cidx;
 			nm_i = netmap_idx_n2k(kring, nic_i);
 			avail = iflib_rxd_avail(ctx, rxq, nic_i, USHRT_MAX);
 			for (n = 0; avail > 0; n++, avail--) {
 				rxd_info_zero(&ri);
 				ri.iri_frags = rxq->ifr_frags;
 				ri.iri_qsidx = kring->ring_id;
 				ri.iri_ifp = ctx->ifc_ifp;
 				ri.iri_cidx = nic_i;
 
 				error = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri);
 				ring->slot[nm_i].len = error ? 0 : ri.iri_len - crclen;
 				ring->slot[nm_i].flags = 0;
 				if (fl->ifl_sds.ifsd_map)
 					bus_dmamap_sync(fl->ifl_ifdi->idi_tag,
 							fl->ifl_sds.ifsd_map[nic_i], BUS_DMASYNC_POSTREAD);
 				nm_i = nm_next(nm_i, lim);
 				nic_i = nm_next(nic_i, lim);
 			}
 			if (n) { /* update the state variables */
 				if (netmap_no_pendintr && !force_update) {
 					/* diagnostics */
 					iflib_rx_miss ++;
 					iflib_rx_miss_bufs += n;
 				}
 				fl->ifl_cidx = nic_i;
 				kring->nr_hwtail = netmap_idx_k2n(kring, nm_i);
 			}
 			kring->nr_kflags &= ~NKR_PENDINTR;
 		}
 	}
 	/*
 	 * Second part: skip past packets that userspace has released.
 	 * (kring->nr_hwcur to head excluded),
 	 * and make the buffers available for reception.
 	 * As usual nm_i is the index in the netmap ring,
 	 * nic_i is the index in the NIC ring, and
 	 * nm_i == (nic_i + kring->nkr_hwofs) % ring_size
 	 */
 	/* XXX not sure how this will work with multiple free lists */
 	nm_i = netmap_idx_n2k(kring, kring->nr_hwcur);
 
 	return (netmap_fl_refill(rxq, kring, nm_i, false));
 }
 
 static void
 iflib_netmap_intr(struct netmap_adapter *na, int onoff)
 {
 	struct ifnet *ifp = na->ifp;
 	if_ctx_t ctx = ifp->if_softc;
 
 	CTX_LOCK(ctx);
 	if (onoff) {
 		IFDI_INTR_ENABLE(ctx);
 	} else {
 		IFDI_INTR_DISABLE(ctx);
 	}
 	CTX_UNLOCK(ctx);
 }
 
 
 static int
 iflib_netmap_attach(if_ctx_t ctx)
 {
 	struct netmap_adapter na;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 
 	bzero(&na, sizeof(na));
 
 	na.ifp = ctx->ifc_ifp;
 	na.na_flags = NAF_BDG_MAYSLEEP;
 	MPASS(ctx->ifc_softc_ctx.isc_ntxqsets);
 	MPASS(ctx->ifc_softc_ctx.isc_nrxqsets);
 
 	na.num_tx_desc = scctx->isc_ntxd[0];
 	na.num_rx_desc = scctx->isc_nrxd[0];
 	na.nm_txsync = iflib_netmap_txsync;
 	na.nm_rxsync = iflib_netmap_rxsync;
 	na.nm_register = iflib_netmap_register;
 	na.nm_intr = iflib_netmap_intr;
 	na.num_tx_rings = ctx->ifc_softc_ctx.isc_ntxqsets;
 	na.num_rx_rings = ctx->ifc_softc_ctx.isc_nrxqsets;
 	return (netmap_attach(&na));
 }
 
 static void
 iflib_netmap_txq_init(if_ctx_t ctx, iflib_txq_t txq)
 {
 	struct netmap_adapter *na = NA(ctx->ifc_ifp);
 	struct netmap_slot *slot;
 
 	slot = netmap_reset(na, NR_TX, txq->ift_id, 0);
 	if (slot == NULL)
 		return;
 	if (txq->ift_sds.ifsd_map == NULL)
 		return;
 
 	for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxd[0]; i++) {
 
 		/*
 		 * In netmap mode, set the map for the packet buffer.
 		 * NOTE: Some drivers (not this one) also need to set
 		 * the physical buffer address in the NIC ring.
 		 * netmap_idx_n2k() maps a nic index, i, into the corresponding
 		 * netmap slot index, si
 		 */
 		int si = netmap_idx_n2k(&na->tx_rings[txq->ift_id], i);
 		netmap_load_map(na, txq->ift_desc_tag, txq->ift_sds.ifsd_map[i], NMB(na, slot + si));
 	}
 }
 
 static void
 iflib_netmap_rxq_init(if_ctx_t ctx, iflib_rxq_t rxq)
 {
 	struct netmap_adapter *na = NA(ctx->ifc_ifp);
 	struct netmap_kring *kring = &na->rx_rings[rxq->ifr_id];
 	struct netmap_slot *slot;
 	uint32_t nm_i;
 
 	slot = netmap_reset(na, NR_RX, rxq->ifr_id, 0);
 	if (slot == NULL)
 		return;
 	nm_i = netmap_idx_n2k(kring, 0);
 	netmap_fl_refill(rxq, kring, nm_i, true);
 }
 
 #define iflib_netmap_detach(ifp) netmap_detach(ifp)
 
 #else
 #define iflib_netmap_txq_init(ctx, txq)
 #define iflib_netmap_rxq_init(ctx, rxq)
 #define iflib_netmap_detach(ifp)
 
 #define iflib_netmap_attach(ctx) (0)
 #define netmap_rx_irq(ifp, qid, budget) (0)
 #define netmap_tx_irq(ifp, qid) do {} while (0)
 
 #endif
 
 #if defined(__i386__) || defined(__amd64__)
 static __inline void
 prefetch(void *x)
 {
 	__asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x));
 }
 static __inline void
 prefetch2cachelines(void *x)
 {
 	__asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x));
 #if (CACHE_LINE_SIZE < 128)
 	__asm volatile("prefetcht0 %0" :: "m" (*(((unsigned long *)x)+CACHE_LINE_SIZE/(sizeof(unsigned long)))));
 #endif
 }
 #else
 #define prefetch(x)
 #define prefetch2cachelines(x)
 #endif
 
 static void
 iru_init(if_rxd_update_t iru, iflib_rxq_t rxq, uint8_t flid)
 {
 	iflib_fl_t fl;
 
 	fl = &rxq->ifr_fl[flid];
 	iru->iru_paddrs = fl->ifl_bus_addrs;
 	iru->iru_vaddrs = &fl->ifl_vm_addrs[0];
 	iru->iru_idxs = fl->ifl_rxd_idxs;
 	iru->iru_qsidx = rxq->ifr_id;
 	iru->iru_buf_size = fl->ifl_buf_size;
 	iru->iru_flidx = fl->ifl_id;
 }
 
 static void
 _iflib_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nseg, int err)
 {
 	if (err)
 		return;
 	*(bus_addr_t *) arg = segs[0].ds_addr;
 }
 
 int
 iflib_dma_alloc(if_ctx_t ctx, int size, iflib_dma_info_t dma, int mapflags)
 {
 	int err;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	device_t dev = ctx->ifc_dev;
 
 	KASSERT(sctx->isc_q_align != 0, ("alignment value not initialized"));
 
 	err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */
 				sctx->isc_q_align, 0,	/* alignment, bounds */
 				BUS_SPACE_MAXADDR,	/* lowaddr */
 				BUS_SPACE_MAXADDR,	/* highaddr */
 				NULL, NULL,		/* filter, filterarg */
 				size,			/* maxsize */
 				1,			/* nsegments */
 				size,			/* maxsegsize */
 				BUS_DMA_ALLOCNOW,	/* flags */
 				NULL,			/* lockfunc */
 				NULL,			/* lockarg */
 				&dma->idi_tag);
 	if (err) {
 		device_printf(dev,
 		    "%s: bus_dma_tag_create failed: %d\n",
 		    __func__, err);
 		goto fail_0;
 	}
 
 	err = bus_dmamem_alloc(dma->idi_tag, (void**) &dma->idi_vaddr,
 	    BUS_DMA_NOWAIT | BUS_DMA_COHERENT | BUS_DMA_ZERO, &dma->idi_map);
 	if (err) {
 		device_printf(dev,
 		    "%s: bus_dmamem_alloc(%ju) failed: %d\n",
 		    __func__, (uintmax_t)size, err);
 		goto fail_1;
 	}
 
 	dma->idi_paddr = IF_BAD_DMA;
 	err = bus_dmamap_load(dma->idi_tag, dma->idi_map, dma->idi_vaddr,
 	    size, _iflib_dmamap_cb, &dma->idi_paddr, mapflags | BUS_DMA_NOWAIT);
 	if (err || dma->idi_paddr == IF_BAD_DMA) {
 		device_printf(dev,
 		    "%s: bus_dmamap_load failed: %d\n",
 		    __func__, err);
 		goto fail_2;
 	}
 
 	dma->idi_size = size;
 	return (0);
 
 fail_2:
 	bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map);
 fail_1:
 	bus_dma_tag_destroy(dma->idi_tag);
 fail_0:
 	dma->idi_tag = NULL;
 
 	return (err);
 }
 
 int
 iflib_dma_alloc_multi(if_ctx_t ctx, int *sizes, iflib_dma_info_t *dmalist, int mapflags, int count)
 {
 	int i, err;
 	iflib_dma_info_t *dmaiter;
 
 	dmaiter = dmalist;
 	for (i = 0; i < count; i++, dmaiter++) {
 		if ((err = iflib_dma_alloc(ctx, sizes[i], *dmaiter, mapflags)) != 0)
 			break;
 	}
 	if (err)
 		iflib_dma_free_multi(dmalist, i);
 	return (err);
 }
 
 void
 iflib_dma_free(iflib_dma_info_t dma)
 {
 	if (dma->idi_tag == NULL)
 		return;
 	if (dma->idi_paddr != IF_BAD_DMA) {
 		bus_dmamap_sync(dma->idi_tag, dma->idi_map,
 		    BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 		bus_dmamap_unload(dma->idi_tag, dma->idi_map);
 		dma->idi_paddr = IF_BAD_DMA;
 	}
 	if (dma->idi_vaddr != NULL) {
 		bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map);
 		dma->idi_vaddr = NULL;
 	}
 	bus_dma_tag_destroy(dma->idi_tag);
 	dma->idi_tag = NULL;
 }
 
 void
 iflib_dma_free_multi(iflib_dma_info_t *dmalist, int count)
 {
 	int i;
 	iflib_dma_info_t *dmaiter = dmalist;
 
 	for (i = 0; i < count; i++, dmaiter++)
 		iflib_dma_free(*dmaiter);
 }
 
 #ifdef EARLY_AP_STARTUP
 static const int iflib_started = 1;
 #else
 /*
  * We used to abuse the smp_started flag to decide if the queues have been
  * fully initialized (by late taskqgroup_adjust() calls in a SYSINIT()).
  * That gave bad races, since the SYSINIT() runs strictly after smp_started
  * is set.  Run a SYSINIT() strictly after that to just set a usable
  * completion flag.
  */
 
 static int iflib_started;
 
 static void
 iflib_record_started(void *arg)
 {
 	iflib_started = 1;
 }
 
 SYSINIT(iflib_record_started, SI_SUB_SMP + 1, SI_ORDER_FIRST,
 	iflib_record_started, NULL);
 #endif
 
 static int
 iflib_fast_intr(void *arg)
 {
 	iflib_filter_info_t info = arg;
 	struct grouptask *gtask = info->ifi_task;
 	if (!iflib_started)
 		return (FILTER_HANDLED);
 
 	DBG_COUNTER_INC(fast_intrs);
 	if (info->ifi_filter != NULL && info->ifi_filter(info->ifi_filter_arg) == FILTER_HANDLED)
 		return (FILTER_HANDLED);
 
 	GROUPTASK_ENQUEUE(gtask);
 	return (FILTER_HANDLED);
 }
 
 static int
 iflib_fast_intr_rxtx(void *arg)
 {
 	iflib_filter_info_t info = arg;
 	struct grouptask *gtask = info->ifi_task;
 	iflib_rxq_t rxq = (iflib_rxq_t)info->ifi_ctx;
 	if_ctx_t ctx;
 	int i, cidx;
 
 	if (!iflib_started)
 		return (FILTER_HANDLED);
 
 	DBG_COUNTER_INC(fast_intrs);
 	if (info->ifi_filter != NULL && info->ifi_filter(info->ifi_filter_arg) == FILTER_HANDLED)
 		return (FILTER_HANDLED);
 
 	for (i = 0; i < rxq->ifr_ntxqirq; i++) {
 		qidx_t txqid = rxq->ifr_txqid[i];
 
 		ctx = rxq->ifr_ctx;
 
 		if (!ctx->isc_txd_credits_update(ctx->ifc_softc, txqid, false)) {
 			IFDI_TX_QUEUE_INTR_ENABLE(ctx, txqid);
 			continue;
 		}
 		GROUPTASK_ENQUEUE(&ctx->ifc_txqs[txqid].ift_task);
 	}
 	if (ctx->ifc_sctx->isc_flags & IFLIB_HAS_RXCQ)
 		cidx = rxq->ifr_cq_cidx;
 	else
 		cidx = rxq->ifr_fl[0].ifl_cidx;
 	if (iflib_rxd_avail(ctx, rxq, cidx, 1))
 		GROUPTASK_ENQUEUE(gtask);
 	else
 		IFDI_RX_QUEUE_INTR_ENABLE(ctx, rxq->ifr_id);
 	return (FILTER_HANDLED);
 }
 
 
 static int
 iflib_fast_intr_ctx(void *arg)
 {
 	iflib_filter_info_t info = arg;
 	struct grouptask *gtask = info->ifi_task;
 
 	if (!iflib_started)
 		return (FILTER_HANDLED);
 
 	DBG_COUNTER_INC(fast_intrs);
 	if (info->ifi_filter != NULL && info->ifi_filter(info->ifi_filter_arg) == FILTER_HANDLED)
 		return (FILTER_HANDLED);
 
 	GROUPTASK_ENQUEUE(gtask);
 	return (FILTER_HANDLED);
 }
 
 static int
 _iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid,
 	driver_filter_t filter, driver_intr_t handler, void *arg,
 				 char *name)
 {
 	int rc, flags;
 	struct resource *res;
 	void *tag = NULL;
 	device_t dev = ctx->ifc_dev;
 
 	flags = RF_ACTIVE;
 	if (ctx->ifc_flags & IFC_LEGACY)
 		flags |= RF_SHAREABLE;
 	MPASS(rid < 512);
 	irq->ii_rid = rid;
 	res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &irq->ii_rid, flags);
 	if (res == NULL) {
 		device_printf(dev,
 		    "failed to allocate IRQ for rid %d, name %s.\n", rid, name);
 		return (ENOMEM);
 	}
 	irq->ii_res = res;
 	KASSERT(filter == NULL || handler == NULL, ("filter and handler can't both be non-NULL"));
 	rc = bus_setup_intr(dev, res, INTR_MPSAFE | INTR_TYPE_NET,
 						filter, handler, arg, &tag);
 	if (rc != 0) {
 		device_printf(dev,
 		    "failed to setup interrupt for rid %d, name %s: %d\n",
 					  rid, name ? name : "unknown", rc);
 		return (rc);
 	} else if (name)
 		bus_describe_intr(dev, res, tag, "%s", name);
 
 	irq->ii_tag = tag;
 	return (0);
 }
 
 
 /*********************************************************************
  *
  *  Allocate memory for tx_buffer structures. The tx_buffer stores all
  *  the information needed to transmit a packet on the wire. This is
  *  called only once at attach, setup is done every reset.
  *
  **********************************************************************/
 
 static int
 iflib_txsd_alloc(iflib_txq_t txq)
 {
 	if_ctx_t ctx = txq->ift_ctx;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	device_t dev = ctx->ifc_dev;
 	int err, nsegments, ntsosegments;
 
 	nsegments = scctx->isc_tx_nsegments;
 	ntsosegments = scctx->isc_tx_tso_segments_max;
 	MPASS(scctx->isc_ntxd[0] > 0);
 	MPASS(scctx->isc_ntxd[txq->ift_br_offset] > 0);
 	MPASS(nsegments > 0);
 	MPASS(ntsosegments > 0);
 	/*
 	 * Setup DMA descriptor areas.
 	 */
 	if ((err = bus_dma_tag_create(bus_get_dma_tag(dev),
 			       1, 0,			/* alignment, bounds */
 			       BUS_SPACE_MAXADDR,	/* lowaddr */
 			       BUS_SPACE_MAXADDR,	/* highaddr */
 			       NULL, NULL,		/* filter, filterarg */
 			       sctx->isc_tx_maxsize,		/* maxsize */
 			       nsegments,	/* nsegments */
 			       sctx->isc_tx_maxsegsize,	/* maxsegsize */
 			       0,			/* flags */
 			       NULL,			/* lockfunc */
 			       NULL,			/* lockfuncarg */
 			       &txq->ift_desc_tag))) {
 		device_printf(dev,"Unable to allocate TX DMA tag: %d\n", err);
 		device_printf(dev,"maxsize: %ju nsegments: %d maxsegsize: %ju\n",
 		    (uintmax_t)sctx->isc_tx_maxsize, nsegments, (uintmax_t)sctx->isc_tx_maxsegsize);
 		goto fail;
 	}
 	if ((err = bus_dma_tag_create(bus_get_dma_tag(dev),
 			       1, 0,			/* alignment, bounds */
 			       BUS_SPACE_MAXADDR,	/* lowaddr */
 			       BUS_SPACE_MAXADDR,	/* highaddr */
 			       NULL, NULL,		/* filter, filterarg */
 			       scctx->isc_tx_tso_size_max,		/* maxsize */
 			       ntsosegments,	/* nsegments */
 			       scctx->isc_tx_tso_segsize_max,	/* maxsegsize */
 			       0,			/* flags */
 			       NULL,			/* lockfunc */
 			       NULL,			/* lockfuncarg */
 			       &txq->ift_tso_desc_tag))) {
 		device_printf(dev,"Unable to allocate TX TSO DMA tag: %d\n", err);
 
 		goto fail;
 	}
 	if (!(txq->ift_sds.ifsd_flags =
 	    (uint8_t *) malloc(sizeof(uint8_t) *
 	    scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate tx_buffer memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 	if (!(txq->ift_sds.ifsd_m =
 	    (struct mbuf **) malloc(sizeof(struct mbuf *) *
 	    scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate tx_buffer memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 
         /* Create the descriptor buffer dma maps */
 #if defined(ACPI_DMAR) || (! (defined(__i386__) || defined(__amd64__)))
 	if ((ctx->ifc_flags & IFC_DMAR) == 0)
 		return (0);
 
 	if (!(txq->ift_sds.ifsd_map =
 	    (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate tx_buffer map memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 
 	for (int i = 0; i < scctx->isc_ntxd[txq->ift_br_offset]; i++) {
 		err = bus_dmamap_create(txq->ift_desc_tag, 0, &txq->ift_sds.ifsd_map[i]);
 		if (err != 0) {
 			device_printf(dev, "Unable to create TX DMA map\n");
 			goto fail;
 		}
 	}
 #endif
 	return (0);
 fail:
 	/* We free all, it handles case where we are in the middle */
 	iflib_tx_structures_free(ctx);
 	return (err);
 }
 
 static void
 iflib_txsd_destroy(if_ctx_t ctx, iflib_txq_t txq, int i)
 {
 	bus_dmamap_t map;
 
 	map = NULL;
 	if (txq->ift_sds.ifsd_map != NULL)
 		map = txq->ift_sds.ifsd_map[i];
 	if (map != NULL) {
 		bus_dmamap_unload(txq->ift_desc_tag, map);
 		bus_dmamap_destroy(txq->ift_desc_tag, map);
 		txq->ift_sds.ifsd_map[i] = NULL;
 	}
 }
 
 static void
 iflib_txq_destroy(iflib_txq_t txq)
 {
 	if_ctx_t ctx = txq->ift_ctx;
 
 	for (int i = 0; i < txq->ift_size; i++)
 		iflib_txsd_destroy(ctx, txq, i);
 	if (txq->ift_sds.ifsd_map != NULL) {
 		free(txq->ift_sds.ifsd_map, M_IFLIB);
 		txq->ift_sds.ifsd_map = NULL;
 	}
 	if (txq->ift_sds.ifsd_m != NULL) {
 		free(txq->ift_sds.ifsd_m, M_IFLIB);
 		txq->ift_sds.ifsd_m = NULL;
 	}
 	if (txq->ift_sds.ifsd_flags != NULL) {
 		free(txq->ift_sds.ifsd_flags, M_IFLIB);
 		txq->ift_sds.ifsd_flags = NULL;
 	}
 	if (txq->ift_desc_tag != NULL) {
 		bus_dma_tag_destroy(txq->ift_desc_tag);
 		txq->ift_desc_tag = NULL;
 	}
 	if (txq->ift_tso_desc_tag != NULL) {
 		bus_dma_tag_destroy(txq->ift_tso_desc_tag);
 		txq->ift_tso_desc_tag = NULL;
 	}
 }
 
 static void
 iflib_txsd_free(if_ctx_t ctx, iflib_txq_t txq, int i)
 {
 	struct mbuf **mp;
 
 	mp = &txq->ift_sds.ifsd_m[i];
 	if (*mp == NULL)
 		return;
 
 	if (txq->ift_sds.ifsd_map != NULL) {
 		bus_dmamap_sync(txq->ift_desc_tag,
 				txq->ift_sds.ifsd_map[i],
 				BUS_DMASYNC_POSTWRITE);
 		bus_dmamap_unload(txq->ift_desc_tag,
 				  txq->ift_sds.ifsd_map[i]);
 	}
 	m_free(*mp);
 	DBG_COUNTER_INC(tx_frees);
 	*mp = NULL;
 }
 
 static int
 iflib_txq_setup(iflib_txq_t txq)
 {
 	if_ctx_t ctx = txq->ift_ctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	iflib_dma_info_t di;
 	int i;
 
 	/* Set number of descriptors available */
 	txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 	/* XXX make configurable */
 	txq->ift_update_freq = IFLIB_DEFAULT_TX_UPDATE_FREQ;
 
 	/* Reset indices */
 	txq->ift_cidx_processed = 0;
 	txq->ift_pidx = txq->ift_cidx = txq->ift_npending = 0;
 	txq->ift_size = scctx->isc_ntxd[txq->ift_br_offset];
 
 	for (i = 0, di = txq->ift_ifdi; i < ctx->ifc_nhwtxqs; i++, di++)
 		bzero((void *)di->idi_vaddr, di->idi_size);
 
 	IFDI_TXQ_SETUP(ctx, txq->ift_id);
 	for (i = 0, di = txq->ift_ifdi; i < ctx->ifc_nhwtxqs; i++, di++)
 		bus_dmamap_sync(di->idi_tag, di->idi_map,
 						BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	return (0);
 }
 
 /*********************************************************************
  *
  *  Allocate memory for rx_buffer structures. Since we use one
  *  rx_buffer per received packet, the maximum number of rx_buffer's
  *  that we'll need is equal to the number of receive descriptors
  *  that we've allocated.
  *
  **********************************************************************/
 static int
 iflib_rxsd_alloc(iflib_rxq_t rxq)
 {
 	if_ctx_t ctx = rxq->ifr_ctx;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	device_t dev = ctx->ifc_dev;
 	iflib_fl_t fl;
 	int			err;
 
 	MPASS(scctx->isc_nrxd[0] > 0);
 	MPASS(scctx->isc_nrxd[rxq->ifr_fl_offset] > 0);
 
 	fl = rxq->ifr_fl;
 	for (int i = 0; i <  rxq->ifr_nfl; i++, fl++) {
 		fl->ifl_size = scctx->isc_nrxd[rxq->ifr_fl_offset]; /* this isn't necessarily the same */
 		err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */
 					 1, 0,			/* alignment, bounds */
 					 BUS_SPACE_MAXADDR,	/* lowaddr */
 					 BUS_SPACE_MAXADDR,	/* highaddr */
 					 NULL, NULL,		/* filter, filterarg */
 					 sctx->isc_rx_maxsize,	/* maxsize */
 					 sctx->isc_rx_nsegments,	/* nsegments */
 					 sctx->isc_rx_maxsegsize,	/* maxsegsize */
 					 0,			/* flags */
 					 NULL,			/* lockfunc */
 					 NULL,			/* lockarg */
 					 &fl->ifl_desc_tag);
 		if (err) {
 			device_printf(dev, "%s: bus_dma_tag_create failed %d\n",
 				__func__, err);
 			goto fail;
 		}
 		if (!(fl->ifl_sds.ifsd_flags =
 		      (uint8_t *) malloc(sizeof(uint8_t) *
 					 scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 		if (!(fl->ifl_sds.ifsd_m =
 		      (struct mbuf **) malloc(sizeof(struct mbuf *) *
 					      scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 		if (!(fl->ifl_sds.ifsd_cl =
 		      (caddr_t *) malloc(sizeof(caddr_t) *
 					      scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 
 		/* Create the descriptor buffer dma maps */
 #if defined(ACPI_DMAR) || (! (defined(__i386__) || defined(__amd64__)))
 		if ((ctx->ifc_flags & IFC_DMAR) == 0)
 			continue;
 
 		if (!(fl->ifl_sds.ifsd_map =
 		      (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer map memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 
 		for (int i = 0; i < scctx->isc_nrxd[rxq->ifr_fl_offset]; i++) {
 			err = bus_dmamap_create(fl->ifl_desc_tag, 0, &fl->ifl_sds.ifsd_map[i]);
 			if (err != 0) {
 				device_printf(dev, "Unable to create RX buffer DMA map\n");
 				goto fail;
 			}
 		}
 #endif
 	}
 	return (0);
 
 fail:
 	iflib_rx_structures_free(ctx);
 	return (err);
 }
 
 
 /*
  * Internal service routines
  */
 
 struct rxq_refill_cb_arg {
 	int               error;
 	bus_dma_segment_t seg;
 	int               nseg;
 };
 
 static void
 _rxq_refill_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error)
 {
 	struct rxq_refill_cb_arg *cb_arg = arg;
 
 	cb_arg->error = error;
 	cb_arg->seg = segs[0];
 	cb_arg->nseg = nseg;
 }
 
 
 #ifdef ACPI_DMAR
 #define IS_DMAR(ctx) (ctx->ifc_flags & IFC_DMAR)
 #else
 #define IS_DMAR(ctx) (0)
 #endif
 
 /**
  *	rxq_refill - refill an rxq  free-buffer list
  *	@ctx: the iflib context
  *	@rxq: the free-list to refill
  *	@n: the number of new buffers to allocate
  *
  *	(Re)populate an rxq free-buffer list with up to @n new packet buffers.
  *	The caller must assure that @n does not exceed the queue's capacity.
  */
 static void
 _iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count)
 {
 	struct mbuf *m;
 	int idx, frag_idx = fl->ifl_fragidx;
         int pidx = fl->ifl_pidx;
 	caddr_t cl, *sd_cl;
 	struct mbuf **sd_m;
 	uint8_t *sd_flags;
 	struct if_rxd_update iru;
 	bus_dmamap_t *sd_map;
 	int n, i = 0;
 	uint64_t bus_addr;
 	int err;
 	qidx_t credits;
 
 	sd_m = fl->ifl_sds.ifsd_m;
 	sd_map = fl->ifl_sds.ifsd_map;
 	sd_cl = fl->ifl_sds.ifsd_cl;
 	sd_flags = fl->ifl_sds.ifsd_flags;
 	idx = pidx;
 	credits = fl->ifl_credits;
 
 	n  = count;
 	MPASS(n > 0);
 	MPASS(credits + n <= fl->ifl_size);
 
 	if (pidx < fl->ifl_cidx)
 		MPASS(pidx + n <= fl->ifl_cidx);
 	if (pidx == fl->ifl_cidx && (credits < fl->ifl_size))
 		MPASS(fl->ifl_gen == 0);
 	if (pidx > fl->ifl_cidx)
 		MPASS(n <= fl->ifl_size - pidx + fl->ifl_cidx);
 
 	DBG_COUNTER_INC(fl_refills);
 	if (n > 8)
 		DBG_COUNTER_INC(fl_refills_large);
 	iru_init(&iru, fl->ifl_rxq, fl->ifl_id);
 	while (n--) {
 		/*
 		 * We allocate an uninitialized mbuf + cluster, mbuf is
 		 * initialized after rx.
 		 *
 		 * If the cluster is still set then we know a minimum sized packet was received
 		 */
 		bit_ffc_at(fl->ifl_rx_bitmap, frag_idx, fl->ifl_size,  &frag_idx);
 		if ((frag_idx < 0) || (frag_idx >= fl->ifl_size))
                 	bit_ffc(fl->ifl_rx_bitmap, fl->ifl_size, &frag_idx);
 		if ((cl = sd_cl[frag_idx]) == NULL) {
                        if ((cl = sd_cl[frag_idx] = m_cljget(NULL, M_NOWAIT, fl->ifl_buf_size)) == NULL)
 				break;
 #if MEMORY_LOGGING
 			fl->ifl_cl_enqueued++;
 #endif
 		}
 		if ((m = m_gethdr(M_NOWAIT, MT_NOINIT)) == NULL) {
 			break;
 		}
 #if MEMORY_LOGGING
 		fl->ifl_m_enqueued++;
 #endif
 
 		DBG_COUNTER_INC(rx_allocs);
 #if defined(__i386__) || defined(__amd64__)
 		if (!IS_DMAR(ctx)) {
 			bus_addr = pmap_kextract((vm_offset_t)cl);
 		} else
 #endif
 		{
 			struct rxq_refill_cb_arg cb_arg;
 			iflib_rxq_t q;
 
 			cb_arg.error = 0;
 			q = fl->ifl_rxq;
 			MPASS(sd_map != NULL);
 			MPASS(sd_map[frag_idx] != NULL);
 			err = bus_dmamap_load(fl->ifl_desc_tag, sd_map[frag_idx],
 		         cl, fl->ifl_buf_size, _rxq_refill_cb, &cb_arg, 0);
 			bus_dmamap_sync(fl->ifl_desc_tag, sd_map[frag_idx],
 					BUS_DMASYNC_PREREAD);
 
 			if (err != 0 || cb_arg.error) {
 				/*
 				 * !zone_pack ?
 				 */
 				if (fl->ifl_zone == zone_pack)
 					uma_zfree(fl->ifl_zone, cl);
 				m_free(m);
 				n = 0;
 				goto done;
 			}
 			bus_addr = cb_arg.seg.ds_addr;
 		}
                 bit_set(fl->ifl_rx_bitmap, frag_idx);
 		sd_flags[frag_idx] |= RX_SW_DESC_INUSE;
 
 		MPASS(sd_m[frag_idx] == NULL);
 		sd_cl[frag_idx] = cl;
 		sd_m[frag_idx] = m;
 		fl->ifl_rxd_idxs[i] = frag_idx;
 		fl->ifl_bus_addrs[i] = bus_addr;
 		fl->ifl_vm_addrs[i] = cl;
 		credits++;
 		i++;
 		MPASS(credits <= fl->ifl_size);
 		if (++idx == fl->ifl_size) {
 			fl->ifl_gen = 1;
 			idx = 0;
 		}
 		if (n == 0 || i == IFLIB_MAX_RX_REFRESH) {
 			iru.iru_pidx = pidx;
 			iru.iru_count = i;
 			ctx->isc_rxd_refill(ctx->ifc_softc, &iru);
 			i = 0;
 			pidx = idx;
 			fl->ifl_pidx = idx;
 			fl->ifl_credits = credits;
 		}
 
 	}
 done:
 	if (i) {
 		iru.iru_pidx = pidx;
 		iru.iru_count = i;
 		ctx->isc_rxd_refill(ctx->ifc_softc, &iru);
 		fl->ifl_pidx = idx;
 		fl->ifl_credits = credits;
 	}
 	DBG_COUNTER_INC(rxd_flush);
 	if (fl->ifl_pidx == 0)
 		pidx = fl->ifl_size - 1;
 	else
 		pidx = fl->ifl_pidx - 1;
 
 	if (sd_map)
 		bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map,
 				BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	ctx->isc_rxd_flush(ctx->ifc_softc, fl->ifl_rxq->ifr_id, fl->ifl_id, pidx);
 	fl->ifl_fragidx = frag_idx;
 }
 
 static __inline void
 __iflib_fl_refill_lt(if_ctx_t ctx, iflib_fl_t fl, int max)
 {
 	/* we avoid allowing pidx to catch up with cidx as it confuses ixl */
 	int32_t reclaimable = fl->ifl_size - fl->ifl_credits - 1;
 #ifdef INVARIANTS
 	int32_t delta = fl->ifl_size - get_inuse(fl->ifl_size, fl->ifl_cidx, fl->ifl_pidx, fl->ifl_gen) - 1;
 #endif
 
 	MPASS(fl->ifl_credits <= fl->ifl_size);
 	MPASS(reclaimable == delta);
 
 	if (reclaimable > 0)
 		_iflib_fl_refill(ctx, fl, min(max, reclaimable));
 }
 
 static void
 iflib_fl_bufs_free(iflib_fl_t fl)
 {
 	iflib_dma_info_t idi = fl->ifl_ifdi;
 	uint32_t i;
 
 	for (i = 0; i < fl->ifl_size; i++) {
 		struct mbuf **sd_m = &fl->ifl_sds.ifsd_m[i];
 		uint8_t *sd_flags = &fl->ifl_sds.ifsd_flags[i];
 		caddr_t *sd_cl = &fl->ifl_sds.ifsd_cl[i];
 
 		if (*sd_flags & RX_SW_DESC_INUSE) {
 			if (fl->ifl_sds.ifsd_map != NULL) {
 				bus_dmamap_t sd_map = fl->ifl_sds.ifsd_map[i];
 				bus_dmamap_unload(fl->ifl_desc_tag, sd_map);
 				if (fl->ifl_rxq->ifr_ctx->ifc_in_detach)
 					bus_dmamap_destroy(fl->ifl_desc_tag, sd_map);
 			}
 			if (*sd_m != NULL) {
 				m_init(*sd_m, M_NOWAIT, MT_DATA, 0);
 				uma_zfree(zone_mbuf, *sd_m);
 			}
 			if (*sd_cl != NULL)
 				uma_zfree(fl->ifl_zone, *sd_cl);
 			*sd_flags = 0;
 		} else {
 			MPASS(*sd_cl == NULL);
 			MPASS(*sd_m == NULL);
 		}
 #if MEMORY_LOGGING
 		fl->ifl_m_dequeued++;
 		fl->ifl_cl_dequeued++;
 #endif
 		*sd_cl = NULL;
 		*sd_m = NULL;
 	}
 #ifdef INVARIANTS
 	for (i = 0; i < fl->ifl_size; i++) {
 		MPASS(fl->ifl_sds.ifsd_flags[i] == 0);
 		MPASS(fl->ifl_sds.ifsd_cl[i] == NULL);
 		MPASS(fl->ifl_sds.ifsd_m[i] == NULL);
 	}
 #endif
 	/*
 	 * Reset free list values
 	 */
 	fl->ifl_credits = fl->ifl_cidx = fl->ifl_pidx = fl->ifl_gen = fl->ifl_fragidx = 0;
 	bzero(idi->idi_vaddr, idi->idi_size);
 }
 
 /*********************************************************************
  *
  *  Initialize a receive ring and its buffers.
  *
  **********************************************************************/
 static int
 iflib_fl_setup(iflib_fl_t fl)
 {
 	iflib_rxq_t rxq = fl->ifl_rxq;
 	if_ctx_t ctx = rxq->ifr_ctx;
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 
 	bit_nclear(fl->ifl_rx_bitmap, 0, fl->ifl_size - 1);
 	/*
 	** Free current RX buffer structs and their mbufs
 	*/
 	iflib_fl_bufs_free(fl);
 	/* Now replenish the mbufs */
 	MPASS(fl->ifl_credits == 0);
 	/*
 	 * XXX don't set the max_frame_size to larger
 	 * than the hardware can handle
 	 */
 	if (sctx->isc_max_frame_size <= 2048)
 		fl->ifl_buf_size = MCLBYTES;
 #ifndef CONTIGMALLOC_WORKS
 	else
 		fl->ifl_buf_size = MJUMPAGESIZE;
 #else
 	else if (sctx->isc_max_frame_size <= 4096)
 		fl->ifl_buf_size = MJUMPAGESIZE;
 	else if (sctx->isc_max_frame_size <= 9216)
 		fl->ifl_buf_size = MJUM9BYTES;
 	else
 		fl->ifl_buf_size = MJUM16BYTES;
 #endif
 	if (fl->ifl_buf_size > ctx->ifc_max_fl_buf_size)
 		ctx->ifc_max_fl_buf_size = fl->ifl_buf_size;
 	fl->ifl_cltype = m_gettype(fl->ifl_buf_size);
 	fl->ifl_zone = m_getzone(fl->ifl_buf_size);
 
 
 	/* avoid pre-allocating zillions of clusters to an idle card
 	 * potentially speeding up attach
 	 */
 	_iflib_fl_refill(ctx, fl, min(128, fl->ifl_size));
 	MPASS(min(128, fl->ifl_size) == fl->ifl_credits);
 	if (min(128, fl->ifl_size) != fl->ifl_credits)
 		return (ENOBUFS);
 	/*
 	 * handle failure
 	 */
 	MPASS(rxq != NULL);
 	MPASS(fl->ifl_ifdi != NULL);
 	bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map,
 	    BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	return (0);
 }
 
 /*********************************************************************
  *
  *  Free receive ring data structures
  *
  **********************************************************************/
 static void
 iflib_rx_sds_free(iflib_rxq_t rxq)
 {
 	iflib_fl_t fl;
 	int i;
 
 	if (rxq->ifr_fl != NULL) {
 		for (i = 0; i < rxq->ifr_nfl; i++) {
 			fl = &rxq->ifr_fl[i];
 			if (fl->ifl_desc_tag != NULL) {
 				bus_dma_tag_destroy(fl->ifl_desc_tag);
 				fl->ifl_desc_tag = NULL;
 			}
 			free(fl->ifl_sds.ifsd_m, M_IFLIB);
 			free(fl->ifl_sds.ifsd_cl, M_IFLIB);
 			/* XXX destroy maps first */
 			free(fl->ifl_sds.ifsd_map, M_IFLIB);
 			fl->ifl_sds.ifsd_m = NULL;
 			fl->ifl_sds.ifsd_cl = NULL;
 			fl->ifl_sds.ifsd_map = NULL;
 		}
 		free(rxq->ifr_fl, M_IFLIB);
 		rxq->ifr_fl = NULL;
 		rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0;
 	}
 }
 
 /*
  * MI independent logic
  *
  */
 static void
 iflib_timer(void *arg)
 {
 	iflib_txq_t txq = arg;
 	if_ctx_t ctx = txq->ift_ctx;
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 	/*
 	** Check on the state of the TX queue(s), this
 	** can be done without the lock because its RO
 	** and the HUNG state will be static if set.
 	*/
 	IFDI_TIMER(ctx, txq->ift_id);
 	if ((txq->ift_qstatus == IFLIB_QUEUE_HUNG) &&
 	    ((txq->ift_cleaned_prev == txq->ift_cleaned) ||
 	     (sctx->isc_pause_frames == 0)))
 		goto hung;
 
 	if (ifmp_ring_is_stalled(txq->ift_br))
 		txq->ift_qstatus = IFLIB_QUEUE_HUNG;
 	txq->ift_cleaned_prev = txq->ift_cleaned;
 	/* handle any laggards */
 	if (txq->ift_db_pending)
 		GROUPTASK_ENQUEUE(&txq->ift_task);
 
 	sctx->isc_pause_frames = 0;
 	if (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING) 
 		callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu);
 	return;
-hung:
-	CTX_LOCK(ctx);
-	if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING);
+ hung:
 	device_printf(ctx->ifc_dev,  "TX(%d) desc avail = %d, pidx = %d\n",
 				  txq->ift_id, TXQ_AVAIL(txq), txq->ift_pidx);
-
-	IFDI_WATCHDOG_RESET(ctx);
-	ctx->ifc_watchdog_events++;
-
-	ctx->ifc_flags |= IFC_DO_RESET;
+	STATE_LOCK(ctx);
+	if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING);
+	ctx->ifc_flags |= (IFC_DO_WATCHDOG|IFC_DO_RESET);
 	iflib_admin_intr_deferred(ctx);
-	CTX_UNLOCK(ctx);
+	STATE_UNLOCK(ctx);
 }
 
 static void
 iflib_init_locked(if_ctx_t ctx)
 {
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	if_t ifp = ctx->ifc_ifp;
 	iflib_fl_t fl;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	int i, j, tx_ip_csum_flags, tx_ip6_csum_flags;
 
 
 	if_setdrvflagbits(ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING);
 	IFDI_INTR_DISABLE(ctx);
 
 	tx_ip_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_SCTP);
 	tx_ip6_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP6_TCP | CSUM_IP6_UDP | CSUM_IP6_SCTP);
 	/* Set hardware offload abilities */
 	if_clearhwassist(ifp);
 	if (if_getcapenable(ifp) & IFCAP_TXCSUM)
 		if_sethwassistbits(ifp, tx_ip_csum_flags, 0);
 	if (if_getcapenable(ifp) & IFCAP_TXCSUM_IPV6)
 		if_sethwassistbits(ifp,  tx_ip6_csum_flags, 0);
 	if (if_getcapenable(ifp) & IFCAP_TSO4)
 		if_sethwassistbits(ifp, CSUM_IP_TSO, 0);
 	if (if_getcapenable(ifp) & IFCAP_TSO6)
 		if_sethwassistbits(ifp, CSUM_IP6_TSO, 0);
 
 	for (i = 0, txq = ctx->ifc_txqs; i < sctx->isc_ntxqsets; i++, txq++) {
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		CALLOUT_UNLOCK(txq);
 		iflib_netmap_txq_init(ctx, txq);
 	}
 #ifdef INVARIANTS
 	i = if_getdrvflags(ifp);
 #endif
 	IFDI_INIT(ctx);
 	MPASS(if_getdrvflags(ifp) == i);
 	for (i = 0, rxq = ctx->ifc_rxqs; i < sctx->isc_nrxqsets; i++, rxq++) {
 		/* XXX this should really be done on a per-queue basis */
 		if (if_getcapenable(ifp) & IFCAP_NETMAP) {
 			MPASS(rxq->ifr_id == i);
 			iflib_netmap_rxq_init(ctx, rxq);
 			continue;
 		}
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) {
 			if (iflib_fl_setup(fl)) {
 				device_printf(ctx->ifc_dev, "freelist setup failed - check cluster settings\n");
 				goto done;
 			}
 		}
 	}
 	done:
 	if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_RUNNING, IFF_DRV_OACTIVE);
 	IFDI_INTR_ENABLE(ctx);
 	txq = ctx->ifc_txqs;
 	for (i = 0; i < sctx->isc_ntxqsets; i++, txq++)
 		callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq,
 			txq->ift_timer.c_cpu);
 }
 
 static int
 iflib_media_change(if_t ifp)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 	int err;
 
 	CTX_LOCK(ctx);
 	if ((err = IFDI_MEDIA_CHANGE(ctx)) == 0)
 		iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 	return (err);
 }
 
 static void
 iflib_media_status(if_t ifp, struct ifmediareq *ifmr)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	CTX_LOCK(ctx);
 	IFDI_UPDATE_ADMIN_STATUS(ctx);
 	IFDI_MEDIA_STATUS(ctx, ifmr);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_stop(if_ctx_t ctx)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	iflib_dma_info_t di;
 	iflib_fl_t fl;
 	int i, j;
 
 	/* Tell the stack that the interface is no longer active */
 	if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING);
 
 	IFDI_INTR_DISABLE(ctx);
 	DELAY(1000);
 	IFDI_STOP(ctx);
 	DELAY(1000);
 
 	iflib_debug_reset();
 	/* Wait for current tx queue users to exit to disarm watchdog timer. */
 	for (i = 0; i < scctx->isc_ntxqsets; i++, txq++) {
 		/* make sure all transmitters have completed before proceeding XXX */
 
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		CALLOUT_UNLOCK(txq);
 
 		/* clean any enqueued buffers */
 		iflib_ifmp_purge(txq);
 		/* Free any existing tx buffers. */
 		for (j = 0; j < txq->ift_size; j++) {
 			iflib_txsd_free(ctx, txq, j);
 		}
 		txq->ift_processed = txq->ift_cleaned = txq->ift_cidx_processed = 0;
 		txq->ift_in_use = txq->ift_gen = txq->ift_cidx = txq->ift_pidx = txq->ift_no_desc_avail = 0;
 		txq->ift_closed = txq->ift_mbuf_defrag = txq->ift_mbuf_defrag_failed = 0;
 		txq->ift_no_tx_dma_setup = txq->ift_txd_encap_efbig = txq->ift_map_failed = 0;
 		txq->ift_pullups = 0;
 		ifmp_ring_reset_stats(txq->ift_br);
 		for (j = 0, di = txq->ift_ifdi; j < ctx->ifc_nhwtxqs; j++, di++)
 			bzero((void *)di->idi_vaddr, di->idi_size);
 	}
 	for (i = 0; i < scctx->isc_nrxqsets; i++, rxq++) {
 		/* make sure all transmitters have completed before proceeding XXX */
 
 		for (j = 0, di = txq->ift_ifdi; j < ctx->ifc_nhwrxqs; j++, di++)
 			bzero((void *)di->idi_vaddr, di->idi_size);
 		/* also resets the free lists pidx/cidx */
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++)
 			iflib_fl_bufs_free(fl);
 	}
 }
 
 static inline caddr_t
 calc_next_rxd(iflib_fl_t fl, int cidx)
 {
 	qidx_t size;
 	int nrxd;
 	caddr_t start, end, cur, next;
 
 	nrxd = fl->ifl_size;
 	size = fl->ifl_rxd_size;
 	start = fl->ifl_ifdi->idi_vaddr;
 
 	if (__predict_false(size == 0))
 		return (start);
 	cur = start + size*cidx;
 	end = start + size*nrxd;
 	next = CACHE_PTR_NEXT(cur);
 	return (next < end ? next : start);
 }
 
 static inline void
 prefetch_pkts(iflib_fl_t fl, int cidx)
 {
 	int nextptr;
 	int nrxd = fl->ifl_size;
 	caddr_t next_rxd;
 
 
 	nextptr = (cidx + CACHE_PTR_INCREMENT) & (nrxd-1);
 	prefetch(&fl->ifl_sds.ifsd_m[nextptr]);
 	prefetch(&fl->ifl_sds.ifsd_cl[nextptr]);
 	next_rxd = calc_next_rxd(fl, cidx);
 	prefetch(next_rxd);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 1) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 2) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 3) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 4) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 1) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 2) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 3) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 4) & (nrxd-1)]);
 }
 
 static void
 rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, int unload, if_rxsd_t sd)
 {
 	int flid, cidx;
 	bus_dmamap_t map;
 	iflib_fl_t fl;
 	iflib_dma_info_t di;
 	int next;
 
 	map = NULL;
 	flid = irf->irf_flid;
 	cidx = irf->irf_idx;
 	fl = &rxq->ifr_fl[flid];
 	sd->ifsd_fl = fl;
 	sd->ifsd_cidx = cidx;
 	sd->ifsd_m = &fl->ifl_sds.ifsd_m[cidx];
 	sd->ifsd_cl = &fl->ifl_sds.ifsd_cl[cidx];
 	fl->ifl_credits--;
 #if MEMORY_LOGGING
 	fl->ifl_m_dequeued++;
 #endif
 	if (rxq->ifr_ctx->ifc_flags & IFC_PREFETCH)
 		prefetch_pkts(fl, cidx);
 	if (fl->ifl_sds.ifsd_map != NULL) {
 		next = (cidx + CACHE_PTR_INCREMENT) & (fl->ifl_size-1);
 		prefetch(&fl->ifl_sds.ifsd_map[next]);
 		map = fl->ifl_sds.ifsd_map[cidx];
 		di = fl->ifl_ifdi;
 		next = (cidx + CACHE_LINE_SIZE) & (fl->ifl_size-1);
 		prefetch(&fl->ifl_sds.ifsd_flags[next]);
 		bus_dmamap_sync(di->idi_tag, di->idi_map,
 				BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 	/* not valid assert if bxe really does SGE from non-contiguous elements */
 		MPASS(fl->ifl_cidx == cidx);
 		if (unload)
 			bus_dmamap_unload(fl->ifl_desc_tag, map);
 	}
 	fl->ifl_cidx = (fl->ifl_cidx + 1) & (fl->ifl_size-1);
 	if (__predict_false(fl->ifl_cidx == 0))
 		fl->ifl_gen = 0;
 	if (map != NULL)
 		bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map,
 			BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
         bit_clear(fl->ifl_rx_bitmap, cidx);
 }
 
 static struct mbuf *
 assemble_segments(iflib_rxq_t rxq, if_rxd_info_t ri, if_rxsd_t sd)
 {
 	int i, padlen , flags;
 	struct mbuf *m, *mh, *mt;
 	caddr_t cl;
 
 	i = 0;
 	mh = NULL;
 	do {
 		rxd_frag_to_sd(rxq, &ri->iri_frags[i], TRUE, sd);
 
 		MPASS(*sd->ifsd_cl != NULL);
 		MPASS(*sd->ifsd_m != NULL);
 
 		/* Don't include zero-length frags */
 		if (ri->iri_frags[i].irf_len == 0) {
 			/* XXX we can save the cluster here, but not the mbuf */
 			m_init(*sd->ifsd_m, M_NOWAIT, MT_DATA, 0);
 			m_free(*sd->ifsd_m);
 			*sd->ifsd_m = NULL;
 			continue;
 		}
 		m = *sd->ifsd_m;
 		*sd->ifsd_m = NULL;
 		if (mh == NULL) {
 			flags = M_PKTHDR|M_EXT;
 			mh = mt = m;
 			padlen = ri->iri_pad;
 		} else {
 			flags = M_EXT;
 			mt->m_next = m;
 			mt = m;
 			/* assuming padding is only on the first fragment */
 			padlen = 0;
 		}
 		cl = *sd->ifsd_cl;
 		*sd->ifsd_cl = NULL;
 
 		/* Can these two be made one ? */
 		m_init(m, M_NOWAIT, MT_DATA, flags);
 		m_cljset(m, cl, sd->ifsd_fl->ifl_cltype);
 		/*
 		 * These must follow m_init and m_cljset
 		 */
 		m->m_data += padlen;
 		ri->iri_len -= padlen;
 		m->m_len = ri->iri_frags[i].irf_len;
 	} while (++i < ri->iri_nfrags);
 
 	return (mh);
 }
 
 /*
  * Process one software descriptor
  */
 static struct mbuf *
 iflib_rxd_pkt_get(iflib_rxq_t rxq, if_rxd_info_t ri)
 {
 	struct if_rxsd sd;
 	struct mbuf *m;
 
 	/* should I merge this back in now that the two paths are basically duplicated? */
 	if (ri->iri_nfrags == 1 &&
 	    ri->iri_frags[0].irf_len <= MIN(IFLIB_RX_COPY_THRESH, MHLEN)) {
 		rxd_frag_to_sd(rxq, &ri->iri_frags[0], FALSE, &sd);
 		m = *sd.ifsd_m;
 		*sd.ifsd_m = NULL;
 		m_init(m, M_NOWAIT, MT_DATA, M_PKTHDR);
 #ifndef __NO_STRICT_ALIGNMENT
 		if (!IP_ALIGNED(m))
 			m->m_data += 2;
 #endif
 		memcpy(m->m_data, *sd.ifsd_cl, ri->iri_len);
 		m->m_len = ri->iri_frags[0].irf_len;
        } else {
 		m = assemble_segments(rxq, ri, &sd);
 	}
 	m->m_pkthdr.len = ri->iri_len;
 	m->m_pkthdr.rcvif = ri->iri_ifp;
 	m->m_flags |= ri->iri_flags;
 	m->m_pkthdr.ether_vtag = ri->iri_vtag;
 	m->m_pkthdr.flowid = ri->iri_flowid;
 	M_HASHTYPE_SET(m, ri->iri_rsstype);
 	m->m_pkthdr.csum_flags = ri->iri_csum_flags;
 	m->m_pkthdr.csum_data = ri->iri_csum_data;
 	return (m);
 }
 
 #if defined(INET6) || defined(INET)
 static void
 iflib_get_ip_forwarding(struct lro_ctrl *lc, bool *v4, bool *v6)
 {
 	CURVNET_SET(lc->ifp->if_vnet);
 #if defined(INET6)
 	*v6 = VNET(ip6_forwarding);
 #endif
 #if defined(INET)
 	*v4 = VNET(ipforwarding);
 #endif
 	CURVNET_RESTORE();
 }
 
 /*
  * Returns true if it's possible this packet could be LROed.
  * if it returns false, it is guaranteed that tcp_lro_rx()
  * would not return zero.
  */
 static bool
 iflib_check_lro_possible(struct mbuf *m, bool v4_forwarding, bool v6_forwarding)
 {
 	struct ether_header *eh;
 	uint16_t eh_type;
 
 	eh = mtod(m, struct ether_header *);
 	eh_type = ntohs(eh->ether_type);
 	switch (eh_type) {
 #if defined(INET6)
 		case ETHERTYPE_IPV6:
 			return !v6_forwarding;
 #endif
 #if defined (INET)
 		case ETHERTYPE_IP:
 			return !v4_forwarding;
 #endif
 	}
 
 	return false;
 }
 #else
 static void
 iflib_get_ip_forwarding(struct lro_ctrl *lc __unused, bool *v4 __unused, bool *v6 __unused)
 {
 }
 #endif
 
 static bool
 iflib_rxeof(iflib_rxq_t rxq, qidx_t budget)
 {
 	if_ctx_t ctx = rxq->ifr_ctx;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	int avail, i;
 	qidx_t *cidxp;
 	struct if_rxd_info ri;
 	int err, budget_left, rx_bytes, rx_pkts;
 	iflib_fl_t fl;
 	struct ifnet *ifp;
 	int lro_enabled;
 	bool lro_possible = false;
 	bool v4_forwarding, v6_forwarding;
 
 	/*
 	 * XXX early demux data packets so that if_input processing only handles
 	 * acks in interrupt context
 	 */
 	struct mbuf *m, *mh, *mt, *mf;
 
 	ifp = ctx->ifc_ifp;
 	mh = mt = NULL;
 	MPASS(budget > 0);
 	rx_pkts	= rx_bytes = 0;
 	if (sctx->isc_flags & IFLIB_HAS_RXCQ)
 		cidxp = &rxq->ifr_cq_cidx;
 	else
 		cidxp = &rxq->ifr_fl[0].ifl_cidx;
 	if ((avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget)) == 0) {
 		for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++)
 			__iflib_fl_refill_lt(ctx, fl, budget + 8);
 		DBG_COUNTER_INC(rx_unavail);
 		return (false);
 	}
 
 	for (budget_left = budget; (budget_left > 0) && (avail > 0); budget_left--, avail--) {
 		if (__predict_false(!CTX_ACTIVE(ctx))) {
 			DBG_COUNTER_INC(rx_ctx_inactive);
 			break;
 		}
 		/*
 		 * Reset client set fields to their default values
 		 */
 		rxd_info_zero(&ri);
 		ri.iri_qsidx = rxq->ifr_id;
 		ri.iri_cidx = *cidxp;
 		ri.iri_ifp = ifp;
 		ri.iri_frags = rxq->ifr_frags;
 		err = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri);
 
 		if (err)
 			goto err;
 		if (sctx->isc_flags & IFLIB_HAS_RXCQ) {
 			*cidxp = ri.iri_cidx;
 			/* Update our consumer index */
 			/* XXX NB: shurd - check if this is still safe */
 			while (rxq->ifr_cq_cidx >= scctx->isc_nrxd[0]) {
 				rxq->ifr_cq_cidx -= scctx->isc_nrxd[0];
 				rxq->ifr_cq_gen = 0;
 			}
 			/* was this only a completion queue message? */
 			if (__predict_false(ri.iri_nfrags == 0))
 				continue;
 		}
 		MPASS(ri.iri_nfrags != 0);
 		MPASS(ri.iri_len != 0);
 
 		/* will advance the cidx on the corresponding free lists */
 		m = iflib_rxd_pkt_get(rxq, &ri);
 		if (avail == 0 && budget_left)
 			avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget_left);
 
 		if (__predict_false(m == NULL)) {
 			DBG_COUNTER_INC(rx_mbuf_null);
 			continue;
 		}
 		/* imm_pkt: -- cxgb */
 		if (mh == NULL)
 			mh = mt = m;
 		else {
 			mt->m_nextpkt = m;
 			mt = m;
 		}
 	}
 	/* make sure that we can refill faster than drain */
 	for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++)
 		__iflib_fl_refill_lt(ctx, fl, budget + 8);
 
 	lro_enabled = (if_getcapenable(ifp) & IFCAP_LRO);
 	if (lro_enabled)
 		iflib_get_ip_forwarding(&rxq->ifr_lc, &v4_forwarding, &v6_forwarding);
 	mt = mf = NULL;
 	while (mh != NULL) {
 		m = mh;
 		mh = mh->m_nextpkt;
 		m->m_nextpkt = NULL;
 #ifndef __NO_STRICT_ALIGNMENT
 		if (!IP_ALIGNED(m) && (m = iflib_fixup_rx(m)) == NULL)
 			continue;
 #endif
 		rx_bytes += m->m_pkthdr.len;
 		rx_pkts++;
 #if defined(INET6) || defined(INET)
 		if (lro_enabled) {
 			if (!lro_possible) {
 				lro_possible = iflib_check_lro_possible(m, v4_forwarding, v6_forwarding);
 				if (lro_possible && mf != NULL) {
 					ifp->if_input(ifp, mf);
 					DBG_COUNTER_INC(rx_if_input);
 					mt = mf = NULL;
 				}
 			}
 			if ((m->m_pkthdr.csum_flags & (CSUM_L4_CALC|CSUM_L4_VALID)) ==
 			    (CSUM_L4_CALC|CSUM_L4_VALID)) {
 				if (lro_possible && tcp_lro_rx(&rxq->ifr_lc, m, 0) == 0)
 					continue;
 			}
 		}
 #endif
 		if (lro_possible) {
 			ifp->if_input(ifp, m);
 			DBG_COUNTER_INC(rx_if_input);
 			continue;
 		}
 
 		if (mf == NULL)
 			mf = m;
 		if (mt != NULL)
 			mt->m_nextpkt = m;
 		mt = m;
 	}
 	if (mf != NULL) {
 		ifp->if_input(ifp, mf);
 		DBG_COUNTER_INC(rx_if_input);
 	}
 
 	if_inc_counter(ifp, IFCOUNTER_IBYTES, rx_bytes);
 	if_inc_counter(ifp, IFCOUNTER_IPACKETS, rx_pkts);
 
 	/*
 	 * Flush any outstanding LRO work
 	 */
 #if defined(INET6) || defined(INET)
 	tcp_lro_flush_all(&rxq->ifr_lc);
 #endif
 	if (avail)
 		return true;
 	return (iflib_rxd_avail(ctx, rxq, *cidxp, 1));
 err:
-	CTX_LOCK(ctx);
+	STATE_LOCK(ctx);
 	ctx->ifc_flags |= IFC_DO_RESET;
 	iflib_admin_intr_deferred(ctx);
-	CTX_UNLOCK(ctx);
+	STATE_UNLOCK(ctx);
 	return (false);
 }
 
 #define TXD_NOTIFY_COUNT(txq) (((txq)->ift_size / (txq)->ift_update_freq)-1)
 static inline qidx_t
 txq_max_db_deferred(iflib_txq_t txq, qidx_t in_use)
 {
 	qidx_t notify_count = TXD_NOTIFY_COUNT(txq);
 	qidx_t minthresh = txq->ift_size / 8;
 	if (in_use > 4*minthresh)
 		return (notify_count);
 	if (in_use > 2*minthresh)
 		return (notify_count >> 1);
 	if (in_use > minthresh)
 		return (notify_count >> 3);
 	return (0);
 }
 
 static inline qidx_t
 txq_max_rs_deferred(iflib_txq_t txq)
 {
 	qidx_t notify_count = TXD_NOTIFY_COUNT(txq);
 	qidx_t minthresh = txq->ift_size / 8;
 	if (txq->ift_in_use > 4*minthresh)
 		return (notify_count);
 	if (txq->ift_in_use > 2*minthresh)
 		return (notify_count >> 1);
 	if (txq->ift_in_use > minthresh)
 		return (notify_count >> 2);
 	return (2);
 }
 
 #define M_CSUM_FLAGS(m) ((m)->m_pkthdr.csum_flags)
 #define M_HAS_VLANTAG(m) (m->m_flags & M_VLANTAG)
 
 #define TXQ_MAX_DB_DEFERRED(txq, in_use) txq_max_db_deferred((txq), (in_use))
 #define TXQ_MAX_RS_DEFERRED(txq) txq_max_rs_deferred(txq)
 #define TXQ_MAX_DB_CONSUMED(size) (size >> 4)
 
 /* forward compatibility for cxgb */
 #define FIRST_QSET(ctx) 0
 #define NTXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_ntxqsets)
 #define NRXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_nrxqsets)
 #define QIDX(ctx, m) ((((m)->m_pkthdr.flowid & ctx->ifc_softc_ctx.isc_rss_table_mask) % NTXQSETS(ctx)) + FIRST_QSET(ctx))
 #define DESC_RECLAIMABLE(q) ((int)((q)->ift_processed - (q)->ift_cleaned - (q)->ift_ctx->ifc_softc_ctx.isc_tx_nsegments))
 
 /* XXX we should be setting this to something other than zero */
 #define RECLAIM_THRESH(ctx) ((ctx)->ifc_sctx->isc_tx_reclaim_thresh)
 #define MAX_TX_DESC(ctx) ((ctx)->ifc_softc_ctx.isc_tx_tso_segments_max)
 
 static inline bool
 iflib_txd_db_check(if_ctx_t ctx, iflib_txq_t txq, int ring, qidx_t in_use)
 {
 	qidx_t dbval, max;
 	bool rang;
 
 	rang = false;
 	max = TXQ_MAX_DB_DEFERRED(txq, in_use);
 	if (ring || txq->ift_db_pending >= max) {
 		dbval = txq->ift_npending ? txq->ift_npending : txq->ift_pidx;
 		ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, dbval);
 		txq->ift_db_pending = txq->ift_npending = 0;
 		rang = true;
 	}
 	return (rang);
 }
 
 #ifdef PKT_DEBUG
 static void
 print_pkt(if_pkt_info_t pi)
 {
 	printf("pi len:  %d qsidx: %d nsegs: %d ndescs: %d flags: %x pidx: %d\n",
 	       pi->ipi_len, pi->ipi_qsidx, pi->ipi_nsegs, pi->ipi_ndescs, pi->ipi_flags, pi->ipi_pidx);
 	printf("pi new_pidx: %d csum_flags: %lx tso_segsz: %d mflags: %x vtag: %d\n",
 	       pi->ipi_new_pidx, pi->ipi_csum_flags, pi->ipi_tso_segsz, pi->ipi_mflags, pi->ipi_vtag);
 	printf("pi etype: %d ehdrlen: %d ip_hlen: %d ipproto: %d\n",
 	       pi->ipi_etype, pi->ipi_ehdrlen, pi->ipi_ip_hlen, pi->ipi_ipproto);
 }
 #endif
 
 #define IS_TSO4(pi) ((pi)->ipi_csum_flags & CSUM_IP_TSO)
 #define IS_TSO6(pi) ((pi)->ipi_csum_flags & CSUM_IP6_TSO)
 
 static int
 iflib_parse_header(iflib_txq_t txq, if_pkt_info_t pi, struct mbuf **mp)
 {
 	if_shared_ctx_t sctx = txq->ift_ctx->ifc_sctx;
 	struct ether_vlan_header *eh;
 	struct mbuf *m, *n;
 
 	n = m = *mp;
 	if ((sctx->isc_flags & IFLIB_NEED_SCRATCH) &&
 	    M_WRITABLE(m) == 0) {
 		if ((m = m_dup(m, M_NOWAIT)) == NULL) {
 			return (ENOMEM);
 		} else {
 			m_freem(*mp);
 			n = *mp = m;
 		}
 	}
 
 	/*
 	 * Determine where frame payload starts.
 	 * Jump over vlan headers if already present,
 	 * helpful for QinQ too.
 	 */
 	if (__predict_false(m->m_len < sizeof(*eh))) {
 		txq->ift_pullups++;
 		if (__predict_false((m = m_pullup(m, sizeof(*eh))) == NULL))
 			return (ENOMEM);
 	}
 	eh = mtod(m, struct ether_vlan_header *);
 	if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
 		pi->ipi_etype = ntohs(eh->evl_proto);
 		pi->ipi_ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
 	} else {
 		pi->ipi_etype = ntohs(eh->evl_encap_proto);
 		pi->ipi_ehdrlen = ETHER_HDR_LEN;
 	}
 
 	switch (pi->ipi_etype) {
 #ifdef INET
 	case ETHERTYPE_IP:
 	{
 		struct ip *ip = NULL;
 		struct tcphdr *th = NULL;
 		int minthlen;
 
 		minthlen = min(m->m_pkthdr.len, pi->ipi_ehdrlen + sizeof(*ip) + sizeof(*th));
 		if (__predict_false(m->m_len < minthlen)) {
 			/*
 			 * if this code bloat is causing too much of a hit
 			 * move it to a separate function and mark it noinline
 			 */
 			if (m->m_len == pi->ipi_ehdrlen) {
 				n = m->m_next;
 				MPASS(n);
 				if (n->m_len >= sizeof(*ip))  {
 					ip = (struct ip *)n->m_data;
 					if (n->m_len >= (ip->ip_hl << 2) + sizeof(*th))
 						th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 				} else {
 					txq->ift_pullups++;
 					if (__predict_false((m = m_pullup(m, minthlen)) == NULL))
 						return (ENOMEM);
 					ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen);
 				}
 			} else {
 				txq->ift_pullups++;
 				if (__predict_false((m = m_pullup(m, minthlen)) == NULL))
 					return (ENOMEM);
 				ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen);
 				if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th))
 					th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 			}
 		} else {
 			ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen);
 			if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th))
 				th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 		}
 		pi->ipi_ip_hlen = ip->ip_hl << 2;
 		pi->ipi_ipproto = ip->ip_p;
 		pi->ipi_flags |= IPI_TX_IPV4;
 
 		if ((sctx->isc_flags & IFLIB_NEED_ZERO_CSUM) && (pi->ipi_csum_flags & CSUM_IP))
                        ip->ip_sum = 0;
 
 		if (IS_TSO4(pi)) {
 			if (pi->ipi_ipproto == IPPROTO_TCP) {
 				if (__predict_false(th == NULL)) {
 					txq->ift_pullups++;
 					if (__predict_false((m = m_pullup(m, (ip->ip_hl << 2) + sizeof(*th))) == NULL))
 						return (ENOMEM);
 					th = (struct tcphdr *)((caddr_t)ip + pi->ipi_ip_hlen);
 				}
 				pi->ipi_tcp_hflags = th->th_flags;
 				pi->ipi_tcp_hlen = th->th_off << 2;
 				pi->ipi_tcp_seq = th->th_seq;
 			}
 			if (__predict_false(ip->ip_p != IPPROTO_TCP))
 				return (ENXIO);
 			th->th_sum = in_pseudo(ip->ip_src.s_addr,
 					       ip->ip_dst.s_addr, htons(IPPROTO_TCP));
 			pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz;
 			if (sctx->isc_flags & IFLIB_TSO_INIT_IP) {
 				ip->ip_sum = 0;
 				ip->ip_len = htons(pi->ipi_ip_hlen + pi->ipi_tcp_hlen + pi->ipi_tso_segsz);
 			}
 		}
 		break;
 	}
 #endif
 #ifdef INET6
 	case ETHERTYPE_IPV6:
 	{
 		struct ip6_hdr *ip6 = (struct ip6_hdr *)(m->m_data + pi->ipi_ehdrlen);
 		struct tcphdr *th;
 		pi->ipi_ip_hlen = sizeof(struct ip6_hdr);
 
 		if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) {
 			if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) == NULL))
 				return (ENOMEM);
 		}
 		th = (struct tcphdr *)((caddr_t)ip6 + pi->ipi_ip_hlen);
 
 		/* XXX-BZ this will go badly in case of ext hdrs. */
 		pi->ipi_ipproto = ip6->ip6_nxt;
 		pi->ipi_flags |= IPI_TX_IPV6;
 
 		if (IS_TSO6(pi)) {
 			if (pi->ipi_ipproto == IPPROTO_TCP) {
 				if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) {
 					if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) == NULL))
 						return (ENOMEM);
 				}
 				pi->ipi_tcp_hflags = th->th_flags;
 				pi->ipi_tcp_hlen = th->th_off << 2;
 			}
 
 			if (__predict_false(ip6->ip6_nxt != IPPROTO_TCP))
 				return (ENXIO);
 			/*
 			 * The corresponding flag is set by the stack in the IPv4
 			 * TSO case, but not in IPv6 (at least in FreeBSD 10.2).
 			 * So, set it here because the rest of the flow requires it.
 			 */
 			pi->ipi_csum_flags |= CSUM_TCP_IPV6;
 			th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0);
 			pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz;
 		}
 		break;
 	}
 #endif
 	default:
 		pi->ipi_csum_flags &= ~CSUM_OFFLOAD;
 		pi->ipi_ip_hlen = 0;
 		break;
 	}
 	*mp = m;
 
 	return (0);
 }
 
 static  __noinline  struct mbuf *
 collapse_pkthdr(struct mbuf *m0)
 {
 	struct mbuf *m, *m_next, *tmp;
 
 	m = m0;
 	m_next = m->m_next;
 	while (m_next != NULL && m_next->m_len == 0) {
 		m = m_next;
 		m->m_next = NULL;
 		m_free(m);
 		m_next = m_next->m_next;
 	}
 	m = m0;
 	m->m_next = m_next;
 	if ((m_next->m_flags & M_EXT) == 0) {
 		m = m_defrag(m, M_NOWAIT);
 	} else {
 		tmp = m_next->m_next;
 		memcpy(m_next, m, MPKTHSIZE);
 		m = m_next;
 		m->m_next = tmp;
 	}
 	return (m);
 }
 
 /*
  * If dodgy hardware rejects the scatter gather chain we've handed it
  * we'll need to remove the mbuf chain from ifsg_m[] before we can add the
  * m_defrag'd mbufs
  */
 static __noinline struct mbuf *
 iflib_remove_mbuf(iflib_txq_t txq)
 {
 	int ntxd, i, pidx;
 	struct mbuf *m, *mh, **ifsd_m;
 
 	pidx = txq->ift_pidx;
 	ifsd_m = txq->ift_sds.ifsd_m;
 	ntxd = txq->ift_size;
 	mh = m = ifsd_m[pidx];
 	ifsd_m[pidx] = NULL;
 #if MEMORY_LOGGING
 	txq->ift_dequeued++;
 #endif
 	i = 1;
 
 	while (m) {
 		ifsd_m[(pidx + i) & (ntxd -1)] = NULL;
 #if MEMORY_LOGGING
 		txq->ift_dequeued++;
 #endif
 		m = m->m_next;
 		i++;
 	}
 	return (mh);
 }
 
 static int
 iflib_busdma_load_mbuf_sg(iflib_txq_t txq, bus_dma_tag_t tag, bus_dmamap_t map,
 			  struct mbuf **m0, bus_dma_segment_t *segs, int *nsegs,
 			  int max_segs, int flags)
 {
 	if_ctx_t ctx;
 	if_shared_ctx_t		sctx;
 	if_softc_ctx_t		scctx;
 	int i, next, pidx, err, ntxd, count;
 	struct mbuf *m, *tmp, **ifsd_m;
 
 	m = *m0;
 
 	/*
 	 * Please don't ever do this
 	 */
 	if (__predict_false(m->m_len == 0))
 		*m0 = m = collapse_pkthdr(m);
 
 	ctx = txq->ift_ctx;
 	sctx = ctx->ifc_sctx;
 	scctx = &ctx->ifc_softc_ctx;
 	ifsd_m = txq->ift_sds.ifsd_m;
 	ntxd = txq->ift_size;
 	pidx = txq->ift_pidx;
 	if (map != NULL) {
 		uint8_t *ifsd_flags = txq->ift_sds.ifsd_flags;
 
 		err = bus_dmamap_load_mbuf_sg(tag, map,
 					      *m0, segs, nsegs, BUS_DMA_NOWAIT);
 		if (err)
 			return (err);
 		ifsd_flags[pidx] |= TX_SW_DESC_MAPPED;
 		count = 0;
 		m = *m0;
 		do {
 			if (__predict_false(m->m_len <= 0)) {
 				tmp = m;
 				m = m->m_next;
 				tmp->m_next = NULL;
 				m_free(tmp);
 				continue;
 			}
 			m = m->m_next;
 			count++;
 		} while (m != NULL);
 		if (count > *nsegs) {
 			ifsd_m[pidx] = *m0;
 			ifsd_m[pidx]->m_flags |= M_TOOBIG;
 			return (0);
 		}
 		m = *m0;
 		count = 0;
 		do {
 			next = (pidx + count) & (ntxd-1);
 			MPASS(ifsd_m[next] == NULL);
 			ifsd_m[next] = m;
 			count++;
 			tmp = m;
 			m = m->m_next;
 		} while (m != NULL);
 	} else {
 		int buflen, sgsize, maxsegsz, max_sgsize;
 		vm_offset_t vaddr;
 		vm_paddr_t curaddr;
 
 		count = i = 0;
 		m = *m0;
 		if (m->m_pkthdr.csum_flags & CSUM_TSO)
 			maxsegsz = scctx->isc_tx_tso_segsize_max;
 		else
 			maxsegsz = sctx->isc_tx_maxsegsize;
 
 		do {
 			if (__predict_false(m->m_len <= 0)) {
 				tmp = m;
 				m = m->m_next;
 				tmp->m_next = NULL;
 				m_free(tmp);
 				continue;
 			}
 			buflen = m->m_len;
 			vaddr = (vm_offset_t)m->m_data;
 			/*
 			 * see if we can't be smarter about physically
 			 * contiguous mappings
 			 */
 			next = (pidx + count) & (ntxd-1);
 			MPASS(ifsd_m[next] == NULL);
 #if MEMORY_LOGGING
 			txq->ift_enqueued++;
 #endif
 			ifsd_m[next] = m;
 			while (buflen > 0) {
 				if (i >= max_segs)
 					goto err;
 				max_sgsize = MIN(buflen, maxsegsz);
 				curaddr = pmap_kextract(vaddr);
 				sgsize = PAGE_SIZE - (curaddr & PAGE_MASK);
 				sgsize = MIN(sgsize, max_sgsize);
 				segs[i].ds_addr = curaddr;
 				segs[i].ds_len = sgsize;
 				vaddr += sgsize;
 				buflen -= sgsize;
 				i++;
 			}
 			count++;
 			tmp = m;
 			m = m->m_next;
 		} while (m != NULL);
 		*nsegs = i;
 	}
 	return (0);
 err:
 	*m0 = iflib_remove_mbuf(txq);
 	return (EFBIG);
 }
 
 static inline caddr_t
 calc_next_txd(iflib_txq_t txq, int cidx, uint8_t qid)
 {
 	qidx_t size;
 	int ntxd;
 	caddr_t start, end, cur, next;
 
 	ntxd = txq->ift_size;
 	size = txq->ift_txd_size[qid];
 	start = txq->ift_ifdi[qid].idi_vaddr;
 
 	if (__predict_false(size == 0))
 		return (start);
 	cur = start + size*cidx;
 	end = start + size*ntxd;
 	next = CACHE_PTR_NEXT(cur);
 	return (next < end ? next : start);
 }
 
 /*
  * Pad an mbuf to ensure a minimum ethernet frame size.
  * min_frame_size is the frame size (less CRC) to pad the mbuf to
  */
 static __noinline int
 iflib_ether_pad(device_t dev, struct mbuf **m_head, uint16_t min_frame_size)
 {
 	/*
 	 * 18 is enough bytes to pad an ARP packet to 46 bytes, and
 	 * and ARP message is the smallest common payload I can think of
 	 */
 	static char pad[18];	/* just zeros */
 	int n;
 	struct mbuf *new_head;
 
 	if (!M_WRITABLE(*m_head)) {
 		new_head = m_dup(*m_head, M_NOWAIT);
 		if (new_head == NULL) {
 			m_freem(*m_head);
 			device_printf(dev, "cannot pad short frame, m_dup() failed");
 			DBG_COUNTER_INC(encap_pad_mbuf_fail);
 			return ENOMEM;
 		}
 		m_freem(*m_head);
 		*m_head = new_head;
 	}
 
 	for (n = min_frame_size - (*m_head)->m_pkthdr.len;
 	     n > 0; n -= sizeof(pad))
 		if (!m_append(*m_head, min(n, sizeof(pad)), pad))
 			break;
 
 	if (n > 0) {
 		m_freem(*m_head);
 		device_printf(dev, "cannot pad short frame\n");
 		DBG_COUNTER_INC(encap_pad_mbuf_fail);
 		return (ENOBUFS);
 	}
 
 	return 0;
 }
 
 static int
 iflib_encap(iflib_txq_t txq, struct mbuf **m_headp)
 {
 	if_ctx_t		ctx;
 	if_shared_ctx_t		sctx;
 	if_softc_ctx_t		scctx;
 	bus_dma_segment_t	*segs;
 	struct mbuf		*m_head;
 	void			*next_txd;
 	bus_dmamap_t		map;
 	struct if_pkt_info	pi;
 	int remap = 0;
 	int err, nsegs, ndesc, max_segs, pidx, cidx, next, ntxd;
 	bus_dma_tag_t desc_tag;
 
 	segs = txq->ift_segs;
 	ctx = txq->ift_ctx;
 	sctx = ctx->ifc_sctx;
 	scctx = &ctx->ifc_softc_ctx;
 	segs = txq->ift_segs;
 	ntxd = txq->ift_size;
 	m_head = *m_headp;
 	map = NULL;
 
 	/*
 	 * If we're doing TSO the next descriptor to clean may be quite far ahead
 	 */
 	cidx = txq->ift_cidx;
 	pidx = txq->ift_pidx;
 	if (ctx->ifc_flags & IFC_PREFETCH) {
 		next = (cidx + CACHE_PTR_INCREMENT) & (ntxd-1);
 		if (!(ctx->ifc_flags & IFLIB_HAS_TXCQ)) {
 			next_txd = calc_next_txd(txq, cidx, 0);
 			prefetch(next_txd);
 		}
 
 		/* prefetch the next cache line of mbuf pointers and flags */
 		prefetch(&txq->ift_sds.ifsd_m[next]);
 		if (txq->ift_sds.ifsd_map != NULL) {
 			prefetch(&txq->ift_sds.ifsd_map[next]);
 			next = (cidx + CACHE_LINE_SIZE) & (ntxd-1);
 			prefetch(&txq->ift_sds.ifsd_flags[next]);
 		}
 	} else if (txq->ift_sds.ifsd_map != NULL)
 		map = txq->ift_sds.ifsd_map[pidx];
 
 	if (m_head->m_pkthdr.csum_flags & CSUM_TSO) {
 		desc_tag = txq->ift_tso_desc_tag;
 		max_segs = scctx->isc_tx_tso_segments_max;
 	} else {
 		desc_tag = txq->ift_desc_tag;
 		max_segs = scctx->isc_tx_nsegments;
 	}
 	if ((sctx->isc_flags & IFLIB_NEED_ETHER_PAD) &&
 	    __predict_false(m_head->m_pkthdr.len < scctx->isc_min_frame_size)) {
 		err = iflib_ether_pad(ctx->ifc_dev, m_headp, scctx->isc_min_frame_size);
 		if (err)
 			return err;
 	}
 	m_head = *m_headp;
 
 	pkt_info_zero(&pi);
 	pi.ipi_mflags = (m_head->m_flags & (M_VLANTAG|M_BCAST|M_MCAST));
 	pi.ipi_pidx = pidx;
 	pi.ipi_qsidx = txq->ift_id;
 	pi.ipi_len = m_head->m_pkthdr.len;
 	pi.ipi_csum_flags = m_head->m_pkthdr.csum_flags;
 	pi.ipi_vtag = (m_head->m_flags & M_VLANTAG) ? m_head->m_pkthdr.ether_vtag : 0;
 
 	/* deliberate bitwise OR to make one condition */
 	if (__predict_true((pi.ipi_csum_flags | pi.ipi_vtag))) {
 		if (__predict_false((err = iflib_parse_header(txq, &pi, m_headp)) != 0))
 			return (err);
 		m_head = *m_headp;
 	}
 
 retry:
 	err = iflib_busdma_load_mbuf_sg(txq, desc_tag, map, m_headp, segs, &nsegs, max_segs, BUS_DMA_NOWAIT);
 defrag:
 	if (__predict_false(err)) {
 		switch (err) {
 		case EFBIG:
 			/* try collapse once and defrag once */
 			if (remap == 0)
 				m_head = m_collapse(*m_headp, M_NOWAIT, max_segs);
 			if (remap == 1)
 				m_head = m_defrag(*m_headp, M_NOWAIT);
 			remap++;
 			if (__predict_false(m_head == NULL))
 				goto defrag_failed;
 			txq->ift_mbuf_defrag++;
 			*m_headp = m_head;
 			goto retry;
 			break;
 		case ENOMEM:
 			txq->ift_no_tx_dma_setup++;
 			break;
 		default:
 			txq->ift_no_tx_dma_setup++;
 			m_freem(*m_headp);
 			DBG_COUNTER_INC(tx_frees);
 			*m_headp = NULL;
 			break;
 		}
 		txq->ift_map_failed++;
 		DBG_COUNTER_INC(encap_load_mbuf_fail);
 		return (err);
 	}
 
 	/*
 	 * XXX assumes a 1 to 1 relationship between segments and
 	 *        descriptors - this does not hold true on all drivers, e.g.
 	 *        cxgb
 	 */
 	if (__predict_false(nsegs + 2 > TXQ_AVAIL(txq))) {
 		txq->ift_no_desc_avail++;
 		if (map != NULL)
 			bus_dmamap_unload(desc_tag, map);
 		DBG_COUNTER_INC(encap_txq_avail_fail);
 		if ((txq->ift_task.gt_task.ta_flags & TASK_ENQUEUED) == 0)
 			GROUPTASK_ENQUEUE(&txq->ift_task);
 		return (ENOBUFS);
 	}
 	/*
 	 * On Intel cards we can greatly reduce the number of TX interrupts
 	 * we see by only setting report status on every Nth descriptor.
 	 * However, this also means that the driver will need to keep track
 	 * of the descriptors that RS was set on to check them for the DD bit.
 	 */
 	txq->ift_rs_pending += nsegs + 1;
 	if (txq->ift_rs_pending > TXQ_MAX_RS_DEFERRED(txq) ||
 	     iflib_no_tx_batch || (TXQ_AVAIL(txq) - nsegs - 1) <= MAX_TX_DESC(ctx)) {
 		pi.ipi_flags |= IPI_TX_INTR;
 		txq->ift_rs_pending = 0;
 	}
 
 	pi.ipi_segs = segs;
 	pi.ipi_nsegs = nsegs;
 
 	MPASS(pidx >= 0 && pidx < txq->ift_size);
 #ifdef PKT_DEBUG
 	print_pkt(&pi);
 #endif
 	if (map != NULL)
 		bus_dmamap_sync(desc_tag, map, BUS_DMASYNC_PREWRITE);
 	if ((err = ctx->isc_txd_encap(ctx->ifc_softc, &pi)) == 0) {
 		if (map != NULL)
 			bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map,
 					BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 		DBG_COUNTER_INC(tx_encap);
 		MPASS(pi.ipi_new_pidx < txq->ift_size);
 
 		ndesc = pi.ipi_new_pidx - pi.ipi_pidx;
 		if (pi.ipi_new_pidx < pi.ipi_pidx) {
 			ndesc += txq->ift_size;
 			txq->ift_gen = 1;
 		}
 		/*
 		 * drivers can need as many as 
 		 * two sentinels
 		 */
 		MPASS(ndesc <= pi.ipi_nsegs + 2);
 		MPASS(pi.ipi_new_pidx != pidx);
 		MPASS(ndesc > 0);
 		txq->ift_in_use += ndesc;
 
 		/*
 		 * We update the last software descriptor again here because there may
 		 * be a sentinel and/or there may be more mbufs than segments
 		 */
 		txq->ift_pidx = pi.ipi_new_pidx;
 		txq->ift_npending += pi.ipi_ndescs;
 	} else if (__predict_false(err == EFBIG && remap < 2)) {
 		*m_headp = m_head = iflib_remove_mbuf(txq);
 		remap = 1;
 		txq->ift_txd_encap_efbig++;
 		goto defrag;
 	} else
 		DBG_COUNTER_INC(encap_txd_encap_fail);
 	return (err);
 
 defrag_failed:
 	txq->ift_mbuf_defrag_failed++;
 	txq->ift_map_failed++;
 	m_freem(*m_headp);
 	DBG_COUNTER_INC(tx_frees);
 	*m_headp = NULL;
 	return (ENOMEM);
 }
 
 static void
 iflib_tx_desc_free(iflib_txq_t txq, int n)
 {
 	int hasmap;
 	uint32_t qsize, cidx, mask, gen;
 	struct mbuf *m, **ifsd_m;
 	uint8_t *ifsd_flags;
 	bus_dmamap_t *ifsd_map;
 	bool do_prefetch;
 
 	cidx = txq->ift_cidx;
 	gen = txq->ift_gen;
 	qsize = txq->ift_size;
 	mask = qsize-1;
 	hasmap = txq->ift_sds.ifsd_map != NULL;
 	ifsd_flags = txq->ift_sds.ifsd_flags;
 	ifsd_m = txq->ift_sds.ifsd_m;
 	ifsd_map = txq->ift_sds.ifsd_map;
 	do_prefetch = (txq->ift_ctx->ifc_flags & IFC_PREFETCH);
 
 	while (n-- > 0) {
 		if (do_prefetch) {
 			prefetch(ifsd_m[(cidx + 3) & mask]);
 			prefetch(ifsd_m[(cidx + 4) & mask]);
 		}
 		if (ifsd_m[cidx] != NULL) {
 			prefetch(&ifsd_m[(cidx + CACHE_PTR_INCREMENT) & mask]);
 			prefetch(&ifsd_flags[(cidx + CACHE_PTR_INCREMENT) & mask]);
 			if (hasmap && (ifsd_flags[cidx] & TX_SW_DESC_MAPPED)) {
 				/*
 				 * does it matter if it's not the TSO tag? If so we'll
 				 * have to add the type to flags
 				 */
 				bus_dmamap_unload(txq->ift_desc_tag, ifsd_map[cidx]);
 				ifsd_flags[cidx] &= ~TX_SW_DESC_MAPPED;
 			}
 			if ((m = ifsd_m[cidx]) != NULL) {
 				/* XXX we don't support any drivers that batch packets yet */
 				MPASS(m->m_nextpkt == NULL);
 				/* if the number of clusters exceeds the number of segments
 				 * there won't be space on the ring to save a pointer to each
 				 * cluster so we simply free the list here
 				 */
 				if (m->m_flags & M_TOOBIG) {
 					m_freem(m);
 				} else {
 					m_free(m);
 				}
 				ifsd_m[cidx] = NULL;
 #if MEMORY_LOGGING
 				txq->ift_dequeued++;
 #endif
 				DBG_COUNTER_INC(tx_frees);
 			}
 		}
 		if (__predict_false(++cidx == qsize)) {
 			cidx = 0;
 			gen = 0;
 		}
 	}
 	txq->ift_cidx = cidx;
 	txq->ift_gen = gen;
 }
 
 static __inline int
 iflib_completed_tx_reclaim(iflib_txq_t txq, int thresh)
 {
 	int reclaim;
 	if_ctx_t ctx = txq->ift_ctx;
 
 	KASSERT(thresh >= 0, ("invalid threshold to reclaim"));
 	MPASS(thresh /*+ MAX_TX_DESC(txq->ift_ctx) */ < txq->ift_size);
 
 	/*
 	 * Need a rate-limiting check so that this isn't called every time
 	 */
 	iflib_tx_credits_update(ctx, txq);
 	reclaim = DESC_RECLAIMABLE(txq);
 
 	if (reclaim <= thresh /* + MAX_TX_DESC(txq->ift_ctx) */) {
 #ifdef INVARIANTS
 		if (iflib_verbose_debug) {
 			printf("%s processed=%ju cleaned=%ju tx_nsegments=%d reclaim=%d thresh=%d\n", __FUNCTION__,
 			       txq->ift_processed, txq->ift_cleaned, txq->ift_ctx->ifc_softc_ctx.isc_tx_nsegments,
 			       reclaim, thresh);
 
 		}
 #endif
 		return (0);
 	}
 	iflib_tx_desc_free(txq, reclaim);
 	txq->ift_cleaned += reclaim;
 	txq->ift_in_use -= reclaim;
 
 	return (reclaim);
 }
 
 static struct mbuf **
 _ring_peek_one(struct ifmp_ring *r, int cidx, int offset, int remaining)
 {
 	int next, size;
 	struct mbuf **items;
 
 	size = r->size;
 	next = (cidx + CACHE_PTR_INCREMENT) & (size-1);
 	items = __DEVOLATILE(struct mbuf **, &r->items[0]);
 
 	prefetch(items[(cidx + offset) & (size-1)]);
 	if (remaining > 1) {
 		prefetch2cachelines(&items[next]);
 		prefetch2cachelines(items[(cidx + offset + 1) & (size-1)]);
 		prefetch2cachelines(items[(cidx + offset + 2) & (size-1)]);
 		prefetch2cachelines(items[(cidx + offset + 3) & (size-1)]);
 	}
 	return (__DEVOLATILE(struct mbuf **, &r->items[(cidx + offset) & (size-1)]));
 }
 
 static void
 iflib_txq_check_drain(iflib_txq_t txq, int budget)
 {
 
 	ifmp_ring_check_drainage(txq->ift_br, budget);
 }
 
 static uint32_t
 iflib_txq_can_drain(struct ifmp_ring *r)
 {
 	iflib_txq_t txq = r->cookie;
 	if_ctx_t ctx = txq->ift_ctx;
 
 	return ((TXQ_AVAIL(txq) > MAX_TX_DESC(ctx) + 2) ||
 		ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, false));
 }
 
 static uint32_t
 iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx)
 {
 	iflib_txq_t txq = r->cookie;
 	if_ctx_t ctx = txq->ift_ctx;
 	struct ifnet *ifp = ctx->ifc_ifp;
 	struct mbuf **mp, *m;
 	int i, count, consumed, pkt_sent, bytes_sent, mcast_sent, avail;
 	int reclaimed, err, in_use_prev, desc_used;
 	bool do_prefetch, ring, rang;
 
 	if (__predict_false(!(if_getdrvflags(ifp) & IFF_DRV_RUNNING) ||
 			    !LINK_ACTIVE(ctx))) {
 		DBG_COUNTER_INC(txq_drain_notready);
 		return (0);
 	}
 	reclaimed = iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx));
 	rang = iflib_txd_db_check(ctx, txq, reclaimed, txq->ift_in_use);
 	avail = IDXDIFF(pidx, cidx, r->size);
 	if (__predict_false(ctx->ifc_flags & IFC_QFLUSH)) {
 		DBG_COUNTER_INC(txq_drain_flushing);
 		for (i = 0; i < avail; i++) {
 			m_free(r->items[(cidx + i) & (r->size-1)]);
 			r->items[(cidx + i) & (r->size-1)] = NULL;
 		}
 		return (avail);
 	}
 
 	if (__predict_false(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE)) {
 		txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		CALLOUT_UNLOCK(txq);
 		DBG_COUNTER_INC(txq_drain_oactive);
 		return (0);
 	}
 	if (reclaimed)
 		txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 	consumed = mcast_sent = bytes_sent = pkt_sent = 0;
 	count = MIN(avail, TX_BATCH_SIZE);
 #ifdef INVARIANTS
 	if (iflib_verbose_debug)
 		printf("%s avail=%d ifc_flags=%x txq_avail=%d ", __FUNCTION__,
 		       avail, ctx->ifc_flags, TXQ_AVAIL(txq));
 #endif
 	do_prefetch = (ctx->ifc_flags & IFC_PREFETCH);
 	avail = TXQ_AVAIL(txq);
 	for (desc_used = i = 0; i < count && avail > MAX_TX_DESC(ctx) + 2; i++) {
 		int pidx_prev, rem = do_prefetch ? count - i : 0;
 
 		mp = _ring_peek_one(r, cidx, i, rem);
 		MPASS(mp != NULL && *mp != NULL);
 		if (__predict_false(*mp == (struct mbuf *)txq)) {
 			consumed++;
 			reclaimed++;
 			continue;
 		}
 		in_use_prev = txq->ift_in_use;
 		pidx_prev = txq->ift_pidx;
 		err = iflib_encap(txq, mp);
 		if (__predict_false(err)) {
 			DBG_COUNTER_INC(txq_drain_encapfail);
 			/* no room - bail out */
 			if (err == ENOBUFS)
 				break;
 			consumed++;
 			DBG_COUNTER_INC(txq_drain_encapfail);
 			/* we can't send this packet - skip it */
 			continue;
 		}
 		consumed++;
 		pkt_sent++;
 		m = *mp;
 		DBG_COUNTER_INC(tx_sent);
 		bytes_sent += m->m_pkthdr.len;
 		mcast_sent += !!(m->m_flags & M_MCAST);
 		avail = TXQ_AVAIL(txq);
 
 		txq->ift_db_pending += (txq->ift_in_use - in_use_prev);
 		desc_used += (txq->ift_in_use - in_use_prev);
 		ETHER_BPF_MTAP(ifp, m);
 		if (__predict_false(!(ifp->if_drv_flags & IFF_DRV_RUNNING)))
 			break;
 		rang = iflib_txd_db_check(ctx, txq, false, in_use_prev);
 	}
 
 	/* deliberate use of bitwise or to avoid gratuitous short-circuit */
 	ring = rang ? false  : (iflib_min_tx_latency | err) || (TXQ_AVAIL(txq) < MAX_TX_DESC(ctx));
 	iflib_txd_db_check(ctx, txq, ring, txq->ift_in_use);
 	if_inc_counter(ifp, IFCOUNTER_OBYTES, bytes_sent);
 	if_inc_counter(ifp, IFCOUNTER_OPACKETS, pkt_sent);
 	if (mcast_sent)
 		if_inc_counter(ifp, IFCOUNTER_OMCASTS, mcast_sent);
 #ifdef INVARIANTS
 	if (iflib_verbose_debug)
 		printf("consumed=%d\n", consumed);
 #endif
 	return (consumed);
 }
 
 static uint32_t
 iflib_txq_drain_always(struct ifmp_ring *r)
 {
 	return (1);
 }
 
 static uint32_t
 iflib_txq_drain_free(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx)
 {
 	int i, avail;
 	struct mbuf **mp;
 	iflib_txq_t txq;
 
 	txq = r->cookie;
 
 	txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 	CALLOUT_LOCK(txq);
 	callout_stop(&txq->ift_timer);
 	CALLOUT_UNLOCK(txq);
 
 	avail = IDXDIFF(pidx, cidx, r->size);
 	for (i = 0; i < avail; i++) {
 		mp = _ring_peek_one(r, cidx, i, avail - i);
 		if (__predict_false(*mp == (struct mbuf *)txq))
 			continue;
 		m_freem(*mp);
 	}
 	MPASS(ifmp_ring_is_stalled(r) == 0);
 	return (avail);
 }
 
 static void
 iflib_ifmp_purge(iflib_txq_t txq)
 {
 	struct ifmp_ring *r;
 
 	r = txq->ift_br;
 	r->drain = iflib_txq_drain_free;
 	r->can_drain = iflib_txq_drain_always;
 
 	ifmp_ring_check_drainage(r, r->size);
 
 	r->drain = iflib_txq_drain;
 	r->can_drain = iflib_txq_can_drain;
 }
 
 static void
 _task_fn_tx(void *context)
 {
 	iflib_txq_t txq = context;
 	if_ctx_t ctx = txq->ift_ctx;
 	struct ifnet *ifp = ctx->ifc_ifp;
 	int rc;
 
 #ifdef IFLIB_DIAGNOSTICS
 	txq->ift_cpu_exec_count[curcpu]++;
 #endif
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 	if (if_getcapenable(ifp) & IFCAP_NETMAP) {
 		if (ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, false))
 			netmap_tx_irq(ifp, txq->ift_id);
 		IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id);
 		return;
 	}
 	if (txq->ift_db_pending)
 		ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
 	ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
 	if (ctx->ifc_flags & IFC_LEGACY)
 		IFDI_INTR_ENABLE(ctx);
 	else {
 		rc = IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id);
 		KASSERT(rc != ENOTSUP, ("MSI-X support requires queue_intr_enable, but not implemented in driver"));
 	}
 }
 
 static void
 _task_fn_rx(void *context)
 {
 	iflib_rxq_t rxq = context;
 	if_ctx_t ctx = rxq->ifr_ctx;
 	bool more;
 	int rc;
 	uint16_t budget;
 
 #ifdef IFLIB_DIAGNOSTICS
 	rxq->ifr_cpu_exec_count[curcpu]++;
 #endif
 	DBG_COUNTER_INC(task_fn_rxs);
 	if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)))
 		return;
 	more = true;
 #ifdef DEV_NETMAP
 	if (if_getcapenable(ctx->ifc_ifp) & IFCAP_NETMAP) {
 		u_int work = 0;
 		if (netmap_rx_irq(ctx->ifc_ifp, rxq->ifr_id, &work)) {
 			more = false;
 		}
 	}
 #endif
 	budget = ctx->ifc_sysctl_rx_budget;
 	if (budget == 0)
 		budget = 16;	/* XXX */
 	if (more == false || (more = iflib_rxeof(rxq, budget)) == false) {
 		if (ctx->ifc_flags & IFC_LEGACY)
 			IFDI_INTR_ENABLE(ctx);
 		else {
 			DBG_COUNTER_INC(rx_intr_enables);
 			rc = IFDI_RX_QUEUE_INTR_ENABLE(ctx, rxq->ifr_id);
 			KASSERT(rc != ENOTSUP, ("MSI-X support requires queue_intr_enable, but not implemented in driver"));
 		}
 	}
 	if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)))
 		return;
 	if (more)
 		GROUPTASK_ENQUEUE(&rxq->ifr_task);
 }
 
 static void
 _task_fn_admin(void *context)
 {
 	if_ctx_t ctx = context;
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 	iflib_txq_t txq;
 	int i;
+	bool oactive, running, do_reset, do_watchdog;
 
-	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) {
-		if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE)) {
-			return;
-		}
-	}
+	STATE_LOCK(ctx);
+	running = (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING);
+	oactive = (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE);
+	do_reset = (ctx->ifc_flags & IFC_DO_RESET);
+	do_watchdog = (ctx->ifc_flags & IFC_DO_WATCHDOG);
+	ctx->ifc_flags &= ~(IFC_DO_RESET|IFC_DO_WATCHDOG);
+	STATE_UNLOCK(ctx);
 
+	if (!running & !oactive)
+		return;
+
 	CTX_LOCK(ctx);
 	for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) {
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		CALLOUT_UNLOCK(txq);
 	}
+	if (do_watchdog) {
+		ctx->ifc_watchdog_events++;
+		IFDI_WATCHDOG_RESET(ctx);
+	}
 	IFDI_UPDATE_ADMIN_STATUS(ctx);
 	for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++)
 		callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu);
 	IFDI_LINK_INTR_ENABLE(ctx);
-	if (ctx->ifc_flags & IFC_DO_RESET) {
-		ctx->ifc_flags &= ~IFC_DO_RESET;
+	if (do_reset)
 		iflib_if_init_locked(ctx);
-	}
 	CTX_UNLOCK(ctx);
 
 	if (LINK_ACTIVE(ctx) == 0)
 		return;
 	for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++)
 		iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET);
 }
 
 
 static void
 _task_fn_iov(void *context)
 {
 	if_ctx_t ctx = context;
 
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 
 	CTX_LOCK(ctx);
 	IFDI_VFLR_HANDLE(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static int
 iflib_sysctl_int_delay(SYSCTL_HANDLER_ARGS)
 {
 	int err;
 	if_int_delay_info_t info;
 	if_ctx_t ctx;
 
 	info = (if_int_delay_info_t)arg1;
 	ctx = info->iidi_ctx;
 	info->iidi_req = req;
 	info->iidi_oidp = oidp;
 	CTX_LOCK(ctx);
 	err = IFDI_SYSCTL_INT_DELAY(ctx, info);
 	CTX_UNLOCK(ctx);
 	return (err);
 }
 
 /*********************************************************************
  *
  *  IFNET FUNCTIONS
  *
  **********************************************************************/
 
 static void
 iflib_if_init_locked(if_ctx_t ctx)
 {
 	iflib_stop(ctx);
 	iflib_init_locked(ctx);
 }
 
 
 static void
 iflib_if_init(void *arg)
 {
 	if_ctx_t ctx = arg;
 
 	CTX_LOCK(ctx);
 	iflib_if_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static int
 iflib_if_transmit(if_t ifp, struct mbuf *m)
 {
 	if_ctx_t	ctx = if_getsoftc(ifp);
 
 	iflib_txq_t txq;
 	int err, qidx;
 
 	if (__predict_false((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || !LINK_ACTIVE(ctx))) {
 		DBG_COUNTER_INC(tx_frees);
 		m_freem(m);
 		return (ENOBUFS);
 	}
 
 	MPASS(m->m_nextpkt == NULL);
 	qidx = 0;
 	if ((NTXQSETS(ctx) > 1) && M_HASHTYPE_GET(m))
 		qidx = QIDX(ctx, m);
 	/*
 	 * XXX calculate buf_ring based on flowid (divvy up bits?)
 	 */
 	txq = &ctx->ifc_txqs[qidx];
 
 #ifdef DRIVER_BACKPRESSURE
 	if (txq->ift_closed) {
 		while (m != NULL) {
 			next = m->m_nextpkt;
 			m->m_nextpkt = NULL;
 			m_freem(m);
 			m = next;
 		}
 		return (ENOBUFS);
 	}
 #endif
 #ifdef notyet
 	qidx = count = 0;
 	mp = marr;
 	next = m;
 	do {
 		count++;
 		next = next->m_nextpkt;
 	} while (next != NULL);
 
 	if (count > nitems(marr))
 		if ((mp = malloc(count*sizeof(struct mbuf *), M_IFLIB, M_NOWAIT)) == NULL) {
 			/* XXX check nextpkt */
 			m_freem(m);
 			/* XXX simplify for now */
 			DBG_COUNTER_INC(tx_frees);
 			return (ENOBUFS);
 		}
 	for (next = m, i = 0; next != NULL; i++) {
 		mp[i] = next;
 		next = next->m_nextpkt;
 		mp[i]->m_nextpkt = NULL;
 	}
 #endif
 	DBG_COUNTER_INC(tx_seen);
 	err = ifmp_ring_enqueue(txq->ift_br, (void **)&m, 1, TX_BATCH_SIZE);
 
 	GROUPTASK_ENQUEUE(&txq->ift_task);
 	if (err) {
 		/* support forthcoming later */
 #ifdef DRIVER_BACKPRESSURE
 		txq->ift_closed = TRUE;
 #endif
 		ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
 		m_freem(m);
 	}
 
 	return (err);
 }
 
 static void
 iflib_if_qflush(if_t ifp)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 	iflib_txq_t txq = ctx->ifc_txqs;
 	int i;
 
-	CTX_LOCK(ctx);
+	STATE_LOCK(ctx);
 	ctx->ifc_flags |= IFC_QFLUSH;
-	CTX_UNLOCK(ctx);
+	STATE_UNLOCK(ctx);
 	for (i = 0; i < NTXQSETS(ctx); i++, txq++)
 		while (!(ifmp_ring_is_idle(txq->ift_br) || ifmp_ring_is_stalled(txq->ift_br)))
 			iflib_txq_check_drain(txq, 0);
-	CTX_LOCK(ctx);
+	STATE_LOCK(ctx);
 	ctx->ifc_flags &= ~IFC_QFLUSH;
-	CTX_UNLOCK(ctx);
+	STATE_UNLOCK(ctx);
 
 	if_qflush(ifp);
 }
 
 
 #define IFCAP_FLAGS (IFCAP_TXCSUM_IPV6 | IFCAP_RXCSUM_IPV6 | IFCAP_HWCSUM | IFCAP_LRO | \
 		     IFCAP_TSO4 | IFCAP_TSO6 | IFCAP_VLAN_HWTAGGING | IFCAP_HWSTATS | \
 		     IFCAP_VLAN_MTU | IFCAP_VLAN_HWFILTER | IFCAP_VLAN_HWTSO)
 
 static int
 iflib_if_ioctl(if_t ifp, u_long command, caddr_t data)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 	struct ifreq	*ifr = (struct ifreq *)data;
 #if defined(INET) || defined(INET6)
 	struct ifaddr	*ifa = (struct ifaddr *)data;
 #endif
 	bool		avoid_reset = FALSE;
 	int		err = 0, reinit = 0, bits;
 
 	switch (command) {
 	case SIOCSIFADDR:
 #ifdef INET
 		if (ifa->ifa_addr->sa_family == AF_INET)
 			avoid_reset = TRUE;
 #endif
 #ifdef INET6
 		if (ifa->ifa_addr->sa_family == AF_INET6)
 			avoid_reset = TRUE;
 #endif
 		/*
 		** Calling init results in link renegotiation,
 		** so we avoid doing it when possible.
 		*/
 		if (avoid_reset) {
 			if_setflagbits(ifp, IFF_UP,0);
 			if (!(if_getdrvflags(ifp)& IFF_DRV_RUNNING))
 				reinit = 1;
 #ifdef INET
 			if (!(if_getflags(ifp) & IFF_NOARP))
 				arp_ifinit(ifp, ifa);
 #endif
 		} else
 			err = ether_ioctl(ifp, command, data);
 		break;
 	case SIOCSIFMTU:
 		CTX_LOCK(ctx);
 		if (ifr->ifr_mtu == if_getmtu(ifp)) {
 			CTX_UNLOCK(ctx);
 			break;
 		}
 		bits = if_getdrvflags(ifp);
 		/* stop the driver and free any clusters before proceeding */
 		iflib_stop(ctx);
 
 		if ((err = IFDI_MTU_SET(ctx, ifr->ifr_mtu)) == 0) {
+			STATE_LOCK(ctx);
 			if (ifr->ifr_mtu > ctx->ifc_max_fl_buf_size)
 				ctx->ifc_flags |= IFC_MULTISEG;
 			else
 				ctx->ifc_flags &= ~IFC_MULTISEG;
+			STATE_UNLOCK(ctx);
 			err = if_setmtu(ifp, ifr->ifr_mtu);
 		}
 		iflib_init_locked(ctx);
+		STATE_LOCK(ctx);
 		if_setdrvflags(ifp, bits);
+		STATE_UNLOCK(ctx);
 		CTX_UNLOCK(ctx);
 		break;
 	case SIOCSIFFLAGS:
 		CTX_LOCK(ctx);
 		if (if_getflags(ifp) & IFF_UP) {
 			if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) {
 				if ((if_getflags(ifp) ^ ctx->ifc_if_flags) &
 				    (IFF_PROMISC | IFF_ALLMULTI)) {
 					err = IFDI_PROMISC_SET(ctx, if_getflags(ifp));
 				}
 			} else
 				reinit = 1;
 		} else if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) {
 			iflib_stop(ctx);
 		}
 		ctx->ifc_if_flags = if_getflags(ifp);
 		CTX_UNLOCK(ctx);
 		break;
 	case SIOCADDMULTI:
 	case SIOCDELMULTI:
 		if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) {
 			CTX_LOCK(ctx);
 			IFDI_INTR_DISABLE(ctx);
 			IFDI_MULTI_SET(ctx);
 			IFDI_INTR_ENABLE(ctx);
 			CTX_UNLOCK(ctx);
 		}
 		break;
 	case SIOCSIFMEDIA:
 		CTX_LOCK(ctx);
 		IFDI_MEDIA_SET(ctx);
 		CTX_UNLOCK(ctx);
 		/* falls thru */
 	case SIOCGIFMEDIA:
 	case SIOCGIFXMEDIA:
 		err = ifmedia_ioctl(ifp, ifr, &ctx->ifc_media, command);
 		break;
 	case SIOCGI2C:
 	{
 		struct ifi2creq i2c;
 
 		err = copyin(ifr_data_get_ptr(ifr), &i2c, sizeof(i2c));
 		if (err != 0)
 			break;
 		if (i2c.dev_addr != 0xA0 && i2c.dev_addr != 0xA2) {
 			err = EINVAL;
 			break;
 		}
 		if (i2c.len > sizeof(i2c.data)) {
 			err = EINVAL;
 			break;
 		}
 
 		if ((err = IFDI_I2C_REQ(ctx, &i2c)) == 0)
 			err = copyout(&i2c, ifr_data_get_ptr(ifr),
 			    sizeof(i2c));
 		break;
 	}
 	case SIOCSIFCAP:
 	{
 		int mask, setmask;
 
 		mask = ifr->ifr_reqcap ^ if_getcapenable(ifp);
 		setmask = 0;
 #ifdef TCP_OFFLOAD
 		setmask |= mask & (IFCAP_TOE4|IFCAP_TOE6);
 #endif
 		setmask |= (mask & IFCAP_FLAGS);
 
 		if (setmask  & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6))
 			setmask |= (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6);
 		if ((mask & IFCAP_WOL) &&
 		    (if_getcapabilities(ifp) & IFCAP_WOL) != 0)
 			setmask |= (mask & (IFCAP_WOL_MCAST|IFCAP_WOL_MAGIC));
 		if_vlancap(ifp);
 		/*
 		 * want to ensure that traffic has stopped before we change any of the flags
 		 */
 		if (setmask) {
 			CTX_LOCK(ctx);
 			bits = if_getdrvflags(ifp);
 			if (bits & IFF_DRV_RUNNING)
 				iflib_stop(ctx);
+			STATE_LOCK(ctx);
 			if_togglecapenable(ifp, setmask);
+			STATE_UNLOCK(ctx);
 			if (bits & IFF_DRV_RUNNING)
 				iflib_init_locked(ctx);
+			STATE_LOCK(ctx);
 			if_setdrvflags(ifp, bits);
+			STATE_UNLOCK(ctx);
 			CTX_UNLOCK(ctx);
 		}
 		break;
 	    }
 	case SIOCGPRIVATE_0:
 	case SIOCSDRVSPEC:
 	case SIOCGDRVSPEC:
 		CTX_LOCK(ctx);
 		err = IFDI_PRIV_IOCTL(ctx, command, data);
 		CTX_UNLOCK(ctx);
 		break;
 	default:
 		err = ether_ioctl(ifp, command, data);
 		break;
 	}
 	if (reinit)
 		iflib_if_init(ctx);
 	return (err);
 }
 
 static uint64_t
 iflib_if_get_counter(if_t ifp, ift_counter cnt)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	return (IFDI_GET_COUNTER(ctx, cnt));
 }
 
 /*********************************************************************
  *
  *  OTHER FUNCTIONS EXPORTED TO THE STACK
  *
  **********************************************************************/
 
 static void
 iflib_vlan_register(void *arg, if_t ifp, uint16_t vtag)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	if ((void *)ctx != arg)
 		return;
 
 	if ((vtag == 0) || (vtag > 4095))
 		return;
 
 	CTX_LOCK(ctx);
 	IFDI_VLAN_REGISTER(ctx, vtag);
 	/* Re-init to load the changes */
 	if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER)
 		iflib_if_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_vlan_unregister(void *arg, if_t ifp, uint16_t vtag)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	if ((void *)ctx != arg)
 		return;
 
 	if ((vtag == 0) || (vtag > 4095))
 		return;
 
 	CTX_LOCK(ctx);
 	IFDI_VLAN_UNREGISTER(ctx, vtag);
 	/* Re-init to load the changes */
 	if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER)
 		iflib_if_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_led_func(void *arg, int onoff)
 {
 	if_ctx_t ctx = arg;
 
 	CTX_LOCK(ctx);
 	IFDI_LED_FUNC(ctx, onoff);
 	CTX_UNLOCK(ctx);
 }
 
 /*********************************************************************
  *
  *  BUS FUNCTION DEFINITIONS
  *
  **********************************************************************/
 
 int
 iflib_device_probe(device_t dev)
 {
 	pci_vendor_info_t *ent;
 
 	uint16_t	pci_vendor_id, pci_device_id;
 	uint16_t	pci_subvendor_id, pci_subdevice_id;
 	uint16_t	pci_rev_id;
 	if_shared_ctx_t sctx;
 
 	if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC)
 		return (ENOTSUP);
 
 	pci_vendor_id = pci_get_vendor(dev);
 	pci_device_id = pci_get_device(dev);
 	pci_subvendor_id = pci_get_subvendor(dev);
 	pci_subdevice_id = pci_get_subdevice(dev);
 	pci_rev_id = pci_get_revid(dev);
 	if (sctx->isc_parse_devinfo != NULL)
 		sctx->isc_parse_devinfo(&pci_device_id, &pci_subvendor_id, &pci_subdevice_id, &pci_rev_id);
 
 	ent = sctx->isc_vendor_info;
 	while (ent->pvi_vendor_id != 0) {
 		if (pci_vendor_id != ent->pvi_vendor_id) {
 			ent++;
 			continue;
 		}
 		if ((pci_device_id == ent->pvi_device_id) &&
 		    ((pci_subvendor_id == ent->pvi_subvendor_id) ||
 		     (ent->pvi_subvendor_id == 0)) &&
 		    ((pci_subdevice_id == ent->pvi_subdevice_id) ||
 		     (ent->pvi_subdevice_id == 0)) &&
 		    ((pci_rev_id == ent->pvi_rev_id) ||
 		     (ent->pvi_rev_id == 0))) {
 
 			device_set_desc_copy(dev, ent->pvi_name);
 			/* this needs to be changed to zero if the bus probing code
 			 * ever stops re-probing on best match because the sctx
 			 * may have its values over written by register calls
 			 * in subsequent probes
 			 */
 			return (BUS_PROBE_DEFAULT);
 		}
 		ent++;
 	}
 	return (ENXIO);
 }
 
 int
 iflib_device_register(device_t dev, void *sc, if_shared_ctx_t sctx, if_ctx_t *ctxp)
 {
 	int err, rid, msix, msix_bar;
 	if_ctx_t ctx;
 	if_t ifp;
 	if_softc_ctx_t scctx;
 	int i;
 	uint16_t main_txq;
 	uint16_t main_rxq;
 
 
 	ctx = malloc(sizeof(* ctx), M_IFLIB, M_WAITOK|M_ZERO);
 
 	if (sc == NULL) {
 		sc = malloc(sctx->isc_driver->size, M_IFLIB, M_WAITOK|M_ZERO);
 		device_set_softc(dev, ctx);
 		ctx->ifc_flags |= IFC_SC_ALLOCATED;
 	}
 
 	ctx->ifc_sctx = sctx;
 	ctx->ifc_dev = dev;
 	ctx->ifc_softc = sc;
 
 	if ((err = iflib_register(ctx)) != 0) {
 		device_printf(dev, "iflib_register failed %d\n", err);
 		return (err);
 	}
 	iflib_add_device_sysctl_pre(ctx);
 
 	scctx = &ctx->ifc_softc_ctx;
 	ifp = ctx->ifc_ifp;
 
 	/*
 	 * XXX sanity check that ntxd & nrxd are a power of 2
 	 */
 	if (ctx->ifc_sysctl_ntxqs != 0)
 		scctx->isc_ntxqsets = ctx->ifc_sysctl_ntxqs;
 	if (ctx->ifc_sysctl_nrxqs != 0)
 		scctx->isc_nrxqsets = ctx->ifc_sysctl_nrxqs;
 
 	for (i = 0; i < sctx->isc_ntxqs; i++) {
 		if (ctx->ifc_sysctl_ntxds[i] != 0)
 			scctx->isc_ntxd[i] = ctx->ifc_sysctl_ntxds[i];
 		else
 			scctx->isc_ntxd[i] = sctx->isc_ntxd_default[i];
 	}
 
 	for (i = 0; i < sctx->isc_nrxqs; i++) {
 		if (ctx->ifc_sysctl_nrxds[i] != 0)
 			scctx->isc_nrxd[i] = ctx->ifc_sysctl_nrxds[i];
 		else
 			scctx->isc_nrxd[i] = sctx->isc_nrxd_default[i];
 	}
 
 	for (i = 0; i < sctx->isc_nrxqs; i++) {
 		if (scctx->isc_nrxd[i] < sctx->isc_nrxd_min[i]) {
 			device_printf(dev, "nrxd%d: %d less than nrxd_min %d - resetting to min\n",
 				      i, scctx->isc_nrxd[i], sctx->isc_nrxd_min[i]);
 			scctx->isc_nrxd[i] = sctx->isc_nrxd_min[i];
 		}
 		if (scctx->isc_nrxd[i] > sctx->isc_nrxd_max[i]) {
 			device_printf(dev, "nrxd%d: %d greater than nrxd_max %d - resetting to max\n",
 				      i, scctx->isc_nrxd[i], sctx->isc_nrxd_max[i]);
 			scctx->isc_nrxd[i] = sctx->isc_nrxd_max[i];
 		}
 	}
 
 	for (i = 0; i < sctx->isc_ntxqs; i++) {
 		if (scctx->isc_ntxd[i] < sctx->isc_ntxd_min[i]) {
 			device_printf(dev, "ntxd%d: %d less than ntxd_min %d - resetting to min\n",
 				      i, scctx->isc_ntxd[i], sctx->isc_ntxd_min[i]);
 			scctx->isc_ntxd[i] = sctx->isc_ntxd_min[i];
 		}
 		if (scctx->isc_ntxd[i] > sctx->isc_ntxd_max[i]) {
 			device_printf(dev, "ntxd%d: %d greater than ntxd_max %d - resetting to max\n",
 				      i, scctx->isc_ntxd[i], sctx->isc_ntxd_max[i]);
 			scctx->isc_ntxd[i] = sctx->isc_ntxd_max[i];
 		}
 	}
 
 	if ((err = IFDI_ATTACH_PRE(ctx)) != 0) {
 		device_printf(dev, "IFDI_ATTACH_PRE failed %d\n", err);
 		return (err);
 	}
 	_iflib_pre_assert(scctx);
 	ctx->ifc_txrx = *scctx->isc_txrx;
 
 #ifdef INVARIANTS
 	MPASS(scctx->isc_capenable);
 	if (scctx->isc_capenable & IFCAP_TXCSUM)
 		MPASS(scctx->isc_tx_csum_flags);
 #endif
 
 	if_setcapabilities(ifp, scctx->isc_capenable | IFCAP_HWSTATS);
 	if_setcapenable(ifp, scctx->isc_capenable | IFCAP_HWSTATS);
 
 	if (scctx->isc_ntxqsets == 0 || (scctx->isc_ntxqsets_max && scctx->isc_ntxqsets_max < scctx->isc_ntxqsets))
 		scctx->isc_ntxqsets = scctx->isc_ntxqsets_max;
 	if (scctx->isc_nrxqsets == 0 || (scctx->isc_nrxqsets_max && scctx->isc_nrxqsets_max < scctx->isc_nrxqsets))
 		scctx->isc_nrxqsets = scctx->isc_nrxqsets_max;
 
 #ifdef ACPI_DMAR
 	if (dmar_get_dma_tag(device_get_parent(dev), dev) != NULL)
 		ctx->ifc_flags |= IFC_DMAR;
 #elif !(defined(__i386__) || defined(__amd64__))
 	/* set unconditionally for !x86 */
 	ctx->ifc_flags |= IFC_DMAR;
 #endif
 
 	msix_bar = scctx->isc_msix_bar;
 	main_txq = (sctx->isc_flags & IFLIB_HAS_TXCQ) ? 1 : 0;
 	main_rxq = (sctx->isc_flags & IFLIB_HAS_RXCQ) ? 1 : 0;
 
 	/* XXX change for per-queue sizes */
 	device_printf(dev, "using %d tx descriptors and %d rx descriptors\n",
 		      scctx->isc_ntxd[main_txq], scctx->isc_nrxd[main_rxq]);
 	for (i = 0; i < sctx->isc_nrxqs; i++) {
 		if (!powerof2(scctx->isc_nrxd[i])) {
 			/* round down instead? */
 			device_printf(dev, "# rx descriptors must be a power of 2\n");
 			err = EINVAL;
 			goto fail;
 		}
 	}
 	for (i = 0; i < sctx->isc_ntxqs; i++) {
 		if (!powerof2(scctx->isc_ntxd[i])) {
 			device_printf(dev,
 			    "# tx descriptors must be a power of 2");
 			err = EINVAL;
 			goto fail;
 		}
 	}
 
 	if (scctx->isc_tx_nsegments > scctx->isc_ntxd[main_txq] /
 	    MAX_SINGLE_PACKET_FRACTION)
 		scctx->isc_tx_nsegments = max(1, scctx->isc_ntxd[main_txq] /
 		    MAX_SINGLE_PACKET_FRACTION);
 	if (scctx->isc_tx_tso_segments_max > scctx->isc_ntxd[main_txq] /
 	    MAX_SINGLE_PACKET_FRACTION)
 		scctx->isc_tx_tso_segments_max = max(1,
 		    scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION);
 
 	/*
 	 * Protect the stack against modern hardware
 	 */
 	if (scctx->isc_tx_tso_size_max > FREEBSD_TSO_SIZE_MAX)
 		scctx->isc_tx_tso_size_max = FREEBSD_TSO_SIZE_MAX;
 
 	/* TSO parameters - dig these out of the data sheet - simply correspond to tag setup */
 	ifp->if_hw_tsomaxsegcount = scctx->isc_tx_tso_segments_max;
 	ifp->if_hw_tsomax = scctx->isc_tx_tso_size_max;
 	ifp->if_hw_tsomaxsegsize = scctx->isc_tx_tso_segsize_max;
 	if (scctx->isc_rss_table_size == 0)
 		scctx->isc_rss_table_size = 64;
 	scctx->isc_rss_table_mask = scctx->isc_rss_table_size-1;
 
 	GROUPTASK_INIT(&ctx->ifc_admin_task, 0, _task_fn_admin, ctx);
 	/* XXX format name */
 	taskqgroup_attach(qgroup_if_config_tqg, &ctx->ifc_admin_task, ctx, -1, "admin");
 
 	/* Set up cpu set.  If it fails, use the set of all CPUs. */
 	if (bus_get_cpus(dev, INTR_CPUS, sizeof(ctx->ifc_cpus), &ctx->ifc_cpus) != 0) {
 		device_printf(dev, "Unable to fetch CPU list\n");
 		CPU_COPY(&all_cpus, &ctx->ifc_cpus);
 	}
 	MPASS(CPU_COUNT(&ctx->ifc_cpus) > 0);
 
 	/*
 	** Now setup MSI or MSI/X, should
 	** return us the number of supported
 	** vectors. (Will be 1 for MSI)
 	*/
 	if (sctx->isc_flags & IFLIB_SKIP_MSIX) {
 		msix = scctx->isc_vectors;
 	} else if (scctx->isc_msix_bar != 0)
 	       /*
 		* The simple fact that isc_msix_bar is not 0 does not mean we
 		* we have a good value there that is known to work.
 		*/
 		msix = iflib_msix_init(ctx);
 	else {
 		scctx->isc_vectors = 1;
 		scctx->isc_ntxqsets = 1;
 		scctx->isc_nrxqsets = 1;
 		scctx->isc_intr = IFLIB_INTR_LEGACY;
 		msix = 0;
 	}
 	/* Get memory for the station queues */
 	if ((err = iflib_queues_alloc(ctx))) {
 		device_printf(dev, "Unable to allocate queue memory\n");
 		goto fail;
 	}
 
 	if ((err = iflib_qset_structures_setup(ctx))) {
 		device_printf(dev, "qset structure setup failed %d\n", err);
 		goto fail_queues;
 	}
 
 	/*
 	 * Group taskqueues aren't properly set up until SMP is started,
 	 * so we disable interrupts until we can handle them post
 	 * SI_SUB_SMP.
 	 *
 	 * XXX: disabling interrupts doesn't actually work, at least for
 	 * the non-MSI case.  When they occur before SI_SUB_SMP completes,
 	 * we do null handling and depend on this not causing too large an
 	 * interrupt storm.
 	 */
 	IFDI_INTR_DISABLE(ctx);
 	if (msix > 1 && (err = IFDI_MSIX_INTR_ASSIGN(ctx, msix)) != 0) {
 		device_printf(dev, "IFDI_MSIX_INTR_ASSIGN failed %d\n", err);
 		goto fail_intr_free;
 	}
 	if (msix <= 1) {
 		rid = 0;
 		if (scctx->isc_intr == IFLIB_INTR_MSI) {
 			MPASS(msix == 1);
 			rid = 1;
 		}
 		if ((err = iflib_legacy_setup(ctx, ctx->isc_legacy_intr, ctx->ifc_softc, &rid, "irq0")) != 0) {
 			device_printf(dev, "iflib_legacy_setup failed %d\n", err);
 			goto fail_intr_free;
 		}
 	}
 	ether_ifattach(ctx->ifc_ifp, ctx->ifc_mac);
 	if ((err = IFDI_ATTACH_POST(ctx)) != 0) {
 		device_printf(dev, "IFDI_ATTACH_POST failed %d\n", err);
 		goto fail_detach;
 	}
 	if ((err = iflib_netmap_attach(ctx))) {
 		device_printf(ctx->ifc_dev, "netmap attach failed: %d\n", err);
 		goto fail_detach;
 	}
 	*ctxp = ctx;
 
 	NETDUMP_SET(ctx->ifc_ifp, iflib);
 
 	if_setgetcounterfn(ctx->ifc_ifp, iflib_if_get_counter);
 	iflib_add_device_sysctl_post(ctx);
 	ctx->ifc_flags |= IFC_INIT_DONE;
 	return (0);
 fail_detach:
 	ether_ifdetach(ctx->ifc_ifp);
 fail_intr_free:
 	if (scctx->isc_intr == IFLIB_INTR_MSIX || scctx->isc_intr == IFLIB_INTR_MSI)
 		pci_release_msi(ctx->ifc_dev);
 fail_queues:
 	/* XXX free queues */
 fail:
 	IFDI_DETACH(ctx);
 	return (err);
 }
 
 int
 iflib_device_attach(device_t dev)
 {
 	if_ctx_t ctx;
 	if_shared_ctx_t sctx;
 
 	if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC)
 		return (ENOTSUP);
 
 	pci_enable_busmaster(dev);
 
 	return (iflib_device_register(dev, NULL, sctx, &ctx));
 }
 
 int
 iflib_device_deregister(if_ctx_t ctx)
 {
 	if_t ifp = ctx->ifc_ifp;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	device_t dev = ctx->ifc_dev;
 	int i, j;
 	struct taskqgroup *tqg;
 	iflib_fl_t fl;
 
 	/* Make sure VLANS are not using driver */
 	if (if_vlantrunkinuse(ifp)) {
 		device_printf(dev,"Vlan in use, detach first\n");
 		return (EBUSY);
 	}
 
 	CTX_LOCK(ctx);
 	ctx->ifc_in_detach = 1;
 	iflib_stop(ctx);
 	CTX_UNLOCK(ctx);
 
 	/* Unregister VLAN events */
 	if (ctx->ifc_vlan_attach_event != NULL)
 		EVENTHANDLER_DEREGISTER(vlan_config, ctx->ifc_vlan_attach_event);
 	if (ctx->ifc_vlan_detach_event != NULL)
 		EVENTHANDLER_DEREGISTER(vlan_unconfig, ctx->ifc_vlan_detach_event);
 
 	iflib_netmap_detach(ifp);
 	ether_ifdetach(ifp);
 	/* ether_ifdetach calls if_qflush - lock must be destroy afterwards*/
 	CTX_LOCK_DESTROY(ctx);
 	if (ctx->ifc_led_dev != NULL)
 		led_destroy(ctx->ifc_led_dev);
 	/* XXX drain any dependent tasks */
 	tqg = qgroup_if_io_tqg;
 	for (txq = ctx->ifc_txqs, i = 0; i < NTXQSETS(ctx); i++, txq++) {
 		callout_drain(&txq->ift_timer);
 		if (txq->ift_task.gt_uniq != NULL)
 			taskqgroup_detach(tqg, &txq->ift_task);
 	}
 	for (i = 0, rxq = ctx->ifc_rxqs; i < NRXQSETS(ctx); i++, rxq++) {
 		if (rxq->ifr_task.gt_uniq != NULL)
 			taskqgroup_detach(tqg, &rxq->ifr_task);
 
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++)
 			free(fl->ifl_rx_bitmap, M_IFLIB);
 			
 	}
 	tqg = qgroup_if_config_tqg;
 	if (ctx->ifc_admin_task.gt_uniq != NULL)
 		taskqgroup_detach(tqg, &ctx->ifc_admin_task);
 	if (ctx->ifc_vflr_task.gt_uniq != NULL)
 		taskqgroup_detach(tqg, &ctx->ifc_vflr_task);
 
 	IFDI_DETACH(ctx);
 	device_set_softc(ctx->ifc_dev, NULL);
 	if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_LEGACY) {
 		pci_release_msi(dev);
 	}
 	if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_MSIX) {
 		iflib_irq_free(ctx, &ctx->ifc_legacy_irq);
 	}
 	if (ctx->ifc_msix_mem != NULL) {
 		bus_release_resource(ctx->ifc_dev, SYS_RES_MEMORY,
 			ctx->ifc_softc_ctx.isc_msix_bar, ctx->ifc_msix_mem);
 		ctx->ifc_msix_mem = NULL;
 	}
 
 	bus_generic_detach(dev);
 	if_free(ifp);
 
 	iflib_tx_structures_free(ctx);
 	iflib_rx_structures_free(ctx);
 	if (ctx->ifc_flags & IFC_SC_ALLOCATED)
 		free(ctx->ifc_softc, M_IFLIB);
 	free(ctx, M_IFLIB);
 	return (0);
 }
 
 
 int
 iflib_device_detach(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	return (iflib_device_deregister(ctx));
 }
 
 int
 iflib_device_suspend(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	IFDI_SUSPEND(ctx);
 	CTX_UNLOCK(ctx);
 
 	return bus_generic_suspend(dev);
 }
 int
 iflib_device_shutdown(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	IFDI_SHUTDOWN(ctx);
 	CTX_UNLOCK(ctx);
 
 	return bus_generic_suspend(dev);
 }
 
 
 int
 iflib_device_resume(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 	iflib_txq_t txq = ctx->ifc_txqs;
 
 	CTX_LOCK(ctx);
 	IFDI_RESUME(ctx);
 	iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 	for (int i = 0; i < NTXQSETS(ctx); i++, txq++)
 		iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET);
 
 	return (bus_generic_resume(dev));
 }
 
 int
 iflib_device_iov_init(device_t dev, uint16_t num_vfs, const nvlist_t *params)
 {
 	int error;
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	error = IFDI_IOV_INIT(ctx, num_vfs, params);
 	CTX_UNLOCK(ctx);
 
 	return (error);
 }
 
 void
 iflib_device_iov_uninit(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	IFDI_IOV_UNINIT(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 int
 iflib_device_iov_add_vf(device_t dev, uint16_t vfnum, const nvlist_t *params)
 {
 	int error;
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	error = IFDI_IOV_VF_ADD(ctx, vfnum, params);
 	CTX_UNLOCK(ctx);
 
 	return (error);
 }
 
 /*********************************************************************
  *
  *  MODULE FUNCTION DEFINITIONS
  *
  **********************************************************************/
 
 /*
  * - Start a fast taskqueue thread for each core
  * - Start a taskqueue for control operations
  */
 static int
 iflib_module_init(void)
 {
 	return (0);
 }
 
 static int
 iflib_module_event_handler(module_t mod, int what, void *arg)
 {
 	int err;
 
 	switch (what) {
 	case MOD_LOAD:
 		if ((err = iflib_module_init()) != 0)
 			return (err);
 		break;
 	case MOD_UNLOAD:
 		return (EBUSY);
 	default:
 		return (EOPNOTSUPP);
 	}
 
 	return (0);
 }
 
 /*********************************************************************
  *
  *  PUBLIC FUNCTION DEFINITIONS
  *     ordered as in iflib.h
  *
  **********************************************************************/
 
 
 static void
 _iflib_assert(if_shared_ctx_t sctx)
 {
 	MPASS(sctx->isc_tx_maxsize);
 	MPASS(sctx->isc_tx_maxsegsize);
 
 	MPASS(sctx->isc_rx_maxsize);
 	MPASS(sctx->isc_rx_nsegments);
 	MPASS(sctx->isc_rx_maxsegsize);
 
 	MPASS(sctx->isc_nrxd_min[0]);
 	MPASS(sctx->isc_nrxd_max[0]);
 	MPASS(sctx->isc_nrxd_default[0]);
 	MPASS(sctx->isc_ntxd_min[0]);
 	MPASS(sctx->isc_ntxd_max[0]);
 	MPASS(sctx->isc_ntxd_default[0]);
 }
 
 static void
 _iflib_pre_assert(if_softc_ctx_t scctx)
 {
 
 	MPASS(scctx->isc_txrx->ift_txd_encap);
 	MPASS(scctx->isc_txrx->ift_txd_flush);
 	MPASS(scctx->isc_txrx->ift_txd_credits_update);
 	MPASS(scctx->isc_txrx->ift_rxd_available);
 	MPASS(scctx->isc_txrx->ift_rxd_pkt_get);
 	MPASS(scctx->isc_txrx->ift_rxd_refill);
 	MPASS(scctx->isc_txrx->ift_rxd_flush);
 }
 
 static int
 iflib_register(if_ctx_t ctx)
 {
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	driver_t *driver = sctx->isc_driver;
 	device_t dev = ctx->ifc_dev;
 	if_t ifp;
 
 	_iflib_assert(sctx);
 
 	CTX_LOCK_INIT(ctx, device_get_nameunit(ctx->ifc_dev));
 
 	ifp = ctx->ifc_ifp = if_gethandle(IFT_ETHER);
 	if (ifp == NULL) {
 		device_printf(dev, "can not allocate ifnet structure\n");
 		return (ENOMEM);
 	}
 
 	/*
 	 * Initialize our context's device specific methods
 	 */
 	kobj_init((kobj_t) ctx, (kobj_class_t) driver);
 	kobj_class_compile((kobj_class_t) driver);
 	driver->refs++;
 
 	if_initname(ifp, device_get_name(dev), device_get_unit(dev));
 	if_setsoftc(ifp, ctx);
 	if_setdev(ifp, dev);
 	if_setinitfn(ifp, iflib_if_init);
 	if_setioctlfn(ifp, iflib_if_ioctl);
 	if_settransmitfn(ifp, iflib_if_transmit);
 	if_setqflushfn(ifp, iflib_if_qflush);
 	if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST);
 
 	ctx->ifc_vlan_attach_event =
 		EVENTHANDLER_REGISTER(vlan_config, iflib_vlan_register, ctx,
 							  EVENTHANDLER_PRI_FIRST);
 	ctx->ifc_vlan_detach_event =
 		EVENTHANDLER_REGISTER(vlan_unconfig, iflib_vlan_unregister, ctx,
 							  EVENTHANDLER_PRI_FIRST);
 
 	ifmedia_init(&ctx->ifc_media, IFM_IMASK,
 					 iflib_media_change, iflib_media_status);
 
 	return (0);
 }
 
 
 static int
 iflib_queues_alloc(if_ctx_t ctx)
 {
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	device_t dev = ctx->ifc_dev;
 	int nrxqsets = scctx->isc_nrxqsets;
 	int ntxqsets = scctx->isc_ntxqsets;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	iflib_fl_t fl = NULL;
 	int i, j, cpu, err, txconf, rxconf;
 	iflib_dma_info_t ifdip;
 	uint32_t *rxqsizes = scctx->isc_rxqsizes;
 	uint32_t *txqsizes = scctx->isc_txqsizes;
 	uint8_t nrxqs = sctx->isc_nrxqs;
 	uint8_t ntxqs = sctx->isc_ntxqs;
 	int nfree_lists = sctx->isc_nfl ? sctx->isc_nfl : 1;
 	caddr_t *vaddrs;
 	uint64_t *paddrs;
 	struct ifmp_ring **brscp;
 
 	KASSERT(ntxqs > 0, ("number of queues per qset must be at least 1"));
 	KASSERT(nrxqs > 0, ("number of queues per qset must be at least 1"));
 
 	brscp = NULL;
 	txq = NULL;
 	rxq = NULL;
 
 /* Allocate the TX ring struct memory */
 	if (!(txq =
 	    (iflib_txq_t) malloc(sizeof(struct iflib_txq) *
 	    ntxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate TX ring memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 
 	/* Now allocate the RX */
 	if (!(rxq =
 	    (iflib_rxq_t) malloc(sizeof(struct iflib_rxq) *
 	    nrxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate RX ring memory\n");
 		err = ENOMEM;
 		goto rx_fail;
 	}
 
 	ctx->ifc_txqs = txq;
 	ctx->ifc_rxqs = rxq;
 
 	/*
 	 * XXX handle allocation failure
 	 */
 	for (txconf = i = 0, cpu = CPU_FIRST(); i < ntxqsets; i++, txconf++, txq++, cpu = CPU_NEXT(cpu)) {
 		/* Set up some basics */
 
 		if ((ifdip = malloc(sizeof(struct iflib_dma_info) * ntxqs, M_IFLIB, M_WAITOK|M_ZERO)) == NULL) {
 			device_printf(dev, "failed to allocate iflib_dma_info\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 		txq->ift_ifdi = ifdip;
 		for (j = 0; j < ntxqs; j++, ifdip++) {
 			if (iflib_dma_alloc(ctx, txqsizes[j], ifdip, BUS_DMA_NOWAIT)) {
 				device_printf(dev, "Unable to allocate Descriptor memory\n");
 				err = ENOMEM;
 				goto err_tx_desc;
 			}
 			txq->ift_txd_size[j] = scctx->isc_txd_size[j];
 			bzero((void *)ifdip->idi_vaddr, txqsizes[j]);
 		}
 		txq->ift_ctx = ctx;
 		txq->ift_id = i;
 		if (sctx->isc_flags & IFLIB_HAS_TXCQ) {
 			txq->ift_br_offset = 1;
 		} else {
 			txq->ift_br_offset = 0;
 		}
 		/* XXX fix this */
 		txq->ift_timer.c_cpu = cpu;
 
 		if (iflib_txsd_alloc(txq)) {
 			device_printf(dev, "Critical Failure setting up TX buffers\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 
 		/* Initialize the TX lock */
 		snprintf(txq->ift_mtx_name, MTX_NAME_LEN, "%s:tx(%d):callout",
 		    device_get_nameunit(dev), txq->ift_id);
 		mtx_init(&txq->ift_mtx, txq->ift_mtx_name, NULL, MTX_DEF);
 		callout_init_mtx(&txq->ift_timer, &txq->ift_mtx, 0);
 
 		snprintf(txq->ift_db_mtx_name, MTX_NAME_LEN, "%s:tx(%d):db",
 			 device_get_nameunit(dev), txq->ift_id);
 
 		err = ifmp_ring_alloc(&txq->ift_br, 2048, txq, iflib_txq_drain,
 				      iflib_txq_can_drain, M_IFLIB, M_WAITOK);
 		if (err) {
 			/* XXX free any allocated rings */
 			device_printf(dev, "Unable to allocate buf_ring\n");
 			goto err_tx_desc;
 		}
 	}
 
 	for (rxconf = i = 0; i < nrxqsets; i++, rxconf++, rxq++) {
 		/* Set up some basics */
 
 		if ((ifdip = malloc(sizeof(struct iflib_dma_info) * nrxqs, M_IFLIB, M_WAITOK|M_ZERO)) == NULL) {
 			device_printf(dev, "failed to allocate iflib_dma_info\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 
 		rxq->ifr_ifdi = ifdip;
 		/* XXX this needs to be changed if #rx queues != #tx queues */
 		rxq->ifr_ntxqirq = 1;
 		rxq->ifr_txqid[0] = i;
 		for (j = 0; j < nrxqs; j++, ifdip++) {
 			if (iflib_dma_alloc(ctx, rxqsizes[j], ifdip, BUS_DMA_NOWAIT)) {
 				device_printf(dev, "Unable to allocate Descriptor memory\n");
 				err = ENOMEM;
 				goto err_tx_desc;
 			}
 			bzero((void *)ifdip->idi_vaddr, rxqsizes[j]);
 		}
 		rxq->ifr_ctx = ctx;
 		rxq->ifr_id = i;
 		if (sctx->isc_flags & IFLIB_HAS_RXCQ) {
 			rxq->ifr_fl_offset = 1;
 		} else {
 			rxq->ifr_fl_offset = 0;
 		}
 		rxq->ifr_nfl = nfree_lists;
 		if (!(fl =
 			  (iflib_fl_t) malloc(sizeof(struct iflib_fl) * nfree_lists, M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate free list memory\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 		rxq->ifr_fl = fl;
 		for (j = 0; j < nfree_lists; j++) {
 			fl[j].ifl_rxq = rxq;
 			fl[j].ifl_id = j;
 			fl[j].ifl_ifdi = &rxq->ifr_ifdi[j + rxq->ifr_fl_offset];
 			fl[j].ifl_rxd_size = scctx->isc_rxd_size[j];
 		}
         /* Allocate receive buffers for the ring*/
 		if (iflib_rxsd_alloc(rxq)) {
 			device_printf(dev,
 			    "Critical Failure setting up receive buffers\n");
 			err = ENOMEM;
 			goto err_rx_desc;
 		}
 
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) 
 			fl->ifl_rx_bitmap = bit_alloc(fl->ifl_size, M_IFLIB, M_WAITOK|M_ZERO);
 	}
 
 	/* TXQs */
 	vaddrs = malloc(sizeof(caddr_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK);
 	paddrs = malloc(sizeof(uint64_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK);
 	for (i = 0; i < ntxqsets; i++) {
 		iflib_dma_info_t di = ctx->ifc_txqs[i].ift_ifdi;
 
 		for (j = 0; j < ntxqs; j++, di++) {
 			vaddrs[i*ntxqs + j] = di->idi_vaddr;
 			paddrs[i*ntxqs + j] = di->idi_paddr;
 		}
 	}
 	if ((err = IFDI_TX_QUEUES_ALLOC(ctx, vaddrs, paddrs, ntxqs, ntxqsets)) != 0) {
 		device_printf(ctx->ifc_dev, "device queue allocation failed\n");
 		iflib_tx_structures_free(ctx);
 		free(vaddrs, M_IFLIB);
 		free(paddrs, M_IFLIB);
 		goto err_rx_desc;
 	}
 	free(vaddrs, M_IFLIB);
 	free(paddrs, M_IFLIB);
 
 	/* RXQs */
 	vaddrs = malloc(sizeof(caddr_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK);
 	paddrs = malloc(sizeof(uint64_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK);
 	for (i = 0; i < nrxqsets; i++) {
 		iflib_dma_info_t di = ctx->ifc_rxqs[i].ifr_ifdi;
 
 		for (j = 0; j < nrxqs; j++, di++) {
 			vaddrs[i*nrxqs + j] = di->idi_vaddr;
 			paddrs[i*nrxqs + j] = di->idi_paddr;
 		}
 	}
 	if ((err = IFDI_RX_QUEUES_ALLOC(ctx, vaddrs, paddrs, nrxqs, nrxqsets)) != 0) {
 		device_printf(ctx->ifc_dev, "device queue allocation failed\n");
 		iflib_tx_structures_free(ctx);
 		free(vaddrs, M_IFLIB);
 		free(paddrs, M_IFLIB);
 		goto err_rx_desc;
 	}
 	free(vaddrs, M_IFLIB);
 	free(paddrs, M_IFLIB);
 
 	return (0);
 
 /* XXX handle allocation failure changes */
 err_rx_desc:
 err_tx_desc:
 	if (ctx->ifc_rxqs != NULL)
 		free(ctx->ifc_rxqs, M_IFLIB);
 	ctx->ifc_rxqs = NULL;
 	if (ctx->ifc_txqs != NULL)
 		free(ctx->ifc_txqs, M_IFLIB);
 	ctx->ifc_txqs = NULL;
 rx_fail:
 	if (brscp != NULL)
 		free(brscp, M_IFLIB);
 	if (rxq != NULL)
 		free(rxq, M_IFLIB);
 	if (txq != NULL)
 		free(txq, M_IFLIB);
 fail:
 	return (err);
 }
 
 static int
 iflib_tx_structures_setup(if_ctx_t ctx)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	int i;
 
 	for (i = 0; i < NTXQSETS(ctx); i++, txq++)
 		iflib_txq_setup(txq);
 
 	return (0);
 }
 
 static void
 iflib_tx_structures_free(if_ctx_t ctx)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	int i, j;
 
 	for (i = 0; i < NTXQSETS(ctx); i++, txq++) {
 		iflib_txq_destroy(txq);
 		for (j = 0; j < ctx->ifc_nhwtxqs; j++)
 			iflib_dma_free(&txq->ift_ifdi[j]);
 	}
 	free(ctx->ifc_txqs, M_IFLIB);
 	ctx->ifc_txqs = NULL;
 	IFDI_QUEUES_FREE(ctx);
 }
 
 /*********************************************************************
  *
  *  Initialize all receive rings.
  *
  **********************************************************************/
 static int
 iflib_rx_structures_setup(if_ctx_t ctx)
 {
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 	int q;
 #if defined(INET6) || defined(INET)
 	int i, err;
 #endif
 
 	for (q = 0; q < ctx->ifc_softc_ctx.isc_nrxqsets; q++, rxq++) {
 #if defined(INET6) || defined(INET)
 		tcp_lro_free(&rxq->ifr_lc);
 		if ((err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp,
 		    TCP_LRO_ENTRIES, min(1024,
 		    ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset]))) != 0) {
 			device_printf(ctx->ifc_dev, "LRO Initialization failed!\n");
 			goto fail;
 		}
 		rxq->ifr_lro_enabled = TRUE;
 #endif
 		IFDI_RXQ_SETUP(ctx, rxq->ifr_id);
 	}
 	return (0);
 #if defined(INET6) || defined(INET)
 fail:
 	/*
 	 * Free RX software descriptors allocated so far, we will only handle
 	 * the rings that completed, the failing case will have
 	 * cleaned up for itself. 'q' failed, so its the terminus.
 	 */
 	rxq = ctx->ifc_rxqs;
 	for (i = 0; i < q; ++i, rxq++) {
 		iflib_rx_sds_free(rxq);
 		rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0;
 	}
 	return (err);
 #endif
 }
 
 /*********************************************************************
  *
  *  Free all receive rings.
  *
  **********************************************************************/
 static void
 iflib_rx_structures_free(if_ctx_t ctx)
 {
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 
 	for (int i = 0; i < ctx->ifc_softc_ctx.isc_nrxqsets; i++, rxq++) {
 		iflib_rx_sds_free(rxq);
 	}
 }
 
 static int
 iflib_qset_structures_setup(if_ctx_t ctx)
 {
 	int err;
 
 	if ((err = iflib_tx_structures_setup(ctx)) != 0)
 		return (err);
 
 	if ((err = iflib_rx_structures_setup(ctx)) != 0) {
 		device_printf(ctx->ifc_dev, "iflib_rx_structures_setup failed: %d\n", err);
 		iflib_tx_structures_free(ctx);
 		iflib_rx_structures_free(ctx);
 	}
 	return (err);
 }
 
 int
 iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid,
 				driver_filter_t filter, void *filter_arg, driver_intr_t handler, void *arg, char *name)
 {
 
 	return (_iflib_irq_alloc(ctx, irq, rid, filter, handler, arg, name));
 }
 
 #ifdef SMP
 static int
 find_nth(if_ctx_t ctx, int qid)
 {
 	cpuset_t cpus;
 	int i, cpuid, eqid, count;
 
 	CPU_COPY(&ctx->ifc_cpus, &cpus);
 	count = CPU_COUNT(&cpus);
 	eqid = qid % count;
 	/* clear up to the qid'th bit */
 	for (i = 0; i < eqid; i++) {
 		cpuid = CPU_FFS(&cpus);
 		MPASS(cpuid != 0);
 		CPU_CLR(cpuid-1, &cpus);
 	}
 	cpuid = CPU_FFS(&cpus);
 	MPASS(cpuid != 0);
 	return (cpuid-1);
 }
 
 #ifdef SCHED_ULE
 extern struct cpu_group *cpu_top;              /* CPU topology */
 
 static int
 find_child_with_core(int cpu, struct cpu_group *grp)
 {
 	int i;
 
 	if (grp->cg_children == 0)
 		return -1;
 
 	MPASS(grp->cg_child);
 	for (i = 0; i < grp->cg_children; i++) {
 		if (CPU_ISSET(cpu, &grp->cg_child[i].cg_mask))
 			return i;
 	}
 
 	return -1;
 }
 
 /*
  * Find the nth thread on the specified core
  */
 static int
 find_thread(int cpu, int thread_num)
 {
 	struct cpu_group *grp;
 	int i;
 	cpuset_t cs;
 
 	grp = cpu_top;
 	if (grp == NULL)
 		return cpu;
 	i = 0;
 	while ((i = find_child_with_core(cpu, grp)) != -1) {
 		/* If the child only has one cpu, don't descend */
 		if (grp->cg_child[i].cg_count <= 1)
 			break;
 		grp = &grp->cg_child[i];
 	}
 
 	/* If they don't share at least an L2 cache, use the same CPU */
 	if (grp->cg_level > CG_SHARE_L2 || grp->cg_level == CG_SHARE_NONE)
 		return cpu;
 
 	/* Now pick one */
 	CPU_COPY(&grp->cg_mask, &cs);
 	for (i = thread_num % grp->cg_count; i > 0; i--) {
 		MPASS(CPU_FFS(&cs));
 		CPU_CLR(CPU_FFS(&cs) - 1, &cs);
 	}
 	MPASS(CPU_FFS(&cs));
 	return CPU_FFS(&cs) - 1;
 }
 #else
 static int
 find_thread(int cpu, int thread_num __unused)
 {
 	return cpu;
 }
 #endif
 
 static int
 get_thread_num(if_ctx_t ctx, iflib_intr_type_t type, int qid)
 {
 	switch (type) {
 	case IFLIB_INTR_TX:
 		/* TX queues get threads on the same core as the corresponding RX queue */
 		/* XXX handle multiple RX threads per core and more than two threads per core */
 		return qid / CPU_COUNT(&ctx->ifc_cpus) + 1;
 	case IFLIB_INTR_RX:
 	case IFLIB_INTR_RXTX:
 		/* RX queues get the first thread on their core */
 		return qid / CPU_COUNT(&ctx->ifc_cpus);
 	default:
 		return -1;
 	}
 }
 #else
 #define get_thread_num(ctx, type, qid)	CPU_FIRST()
 #define find_thread(cpuid, tid)		CPU_FIRST()
 #define find_nth(ctx, gid)		CPU_FIRST()
 #endif
 
 /* Just to avoid copy/paste */
 static inline int
 iflib_irq_set_affinity(if_ctx_t ctx, int irq, iflib_intr_type_t type, int qid,
     struct grouptask *gtask, struct taskqgroup *tqg, void *uniq, char *name)
 {
 	int cpuid;
 	int err, tid;
 
 	cpuid = find_nth(ctx, qid);
 	tid = get_thread_num(ctx, type, qid);
 	MPASS(tid >= 0);
 	cpuid = find_thread(cpuid, tid);
 	err = taskqgroup_attach_cpu(tqg, gtask, uniq, cpuid, irq, name);
 	if (err) {
 		device_printf(ctx->ifc_dev, "taskqgroup_attach_cpu failed %d\n", err);
 		return (err);
 	}
 #ifdef notyet
 	if (cpuid > ctx->ifc_cpuid_highest)
 		ctx->ifc_cpuid_highest = cpuid;
 #endif
 	return 0;
 }
 
 int
 iflib_irq_alloc_generic(if_ctx_t ctx, if_irq_t irq, int rid,
 						iflib_intr_type_t type, driver_filter_t *filter,
 						void *filter_arg, int qid, char *name)
 {
 	struct grouptask *gtask;
 	struct taskqgroup *tqg;
 	iflib_filter_info_t info;
 	gtask_fn_t *fn;
 	int tqrid, err;
 	driver_filter_t *intr_fast;
 	void *q;
 
 	info = &ctx->ifc_filter_info;
 	tqrid = rid;
 
 	switch (type) {
 	/* XXX merge tx/rx for netmap? */
 	case IFLIB_INTR_TX:
 		q = &ctx->ifc_txqs[qid];
 		info = &ctx->ifc_txqs[qid].ift_filter_info;
 		gtask = &ctx->ifc_txqs[qid].ift_task;
 		tqg = qgroup_if_io_tqg;
 		fn = _task_fn_tx;
 		intr_fast = iflib_fast_intr;
 		GROUPTASK_INIT(gtask, 0, fn, q);
 		break;
 	case IFLIB_INTR_RX:
 		q = &ctx->ifc_rxqs[qid];
 		info = &ctx->ifc_rxqs[qid].ifr_filter_info;
 		gtask = &ctx->ifc_rxqs[qid].ifr_task;
 		tqg = qgroup_if_io_tqg;
 		fn = _task_fn_rx;
 		intr_fast = iflib_fast_intr;
 		GROUPTASK_INIT(gtask, 0, fn, q);
 		break;
 	case IFLIB_INTR_RXTX:
 		q = &ctx->ifc_rxqs[qid];
 		info = &ctx->ifc_rxqs[qid].ifr_filter_info;
 		gtask = &ctx->ifc_rxqs[qid].ifr_task;
 		tqg = qgroup_if_io_tqg;
 		fn = _task_fn_rx;
 		intr_fast = iflib_fast_intr_rxtx;
 		GROUPTASK_INIT(gtask, 0, fn, q);
 		break;
 	case IFLIB_INTR_ADMIN:
 		q = ctx;
 		tqrid = -1;
 		info = &ctx->ifc_filter_info;
 		gtask = &ctx->ifc_admin_task;
 		tqg = qgroup_if_config_tqg;
 		fn = _task_fn_admin;
 		intr_fast = iflib_fast_intr_ctx;
 		break;
 	default:
 		panic("unknown net intr type");
 	}
 
 	info->ifi_filter = filter;
 	info->ifi_filter_arg = filter_arg;
 	info->ifi_task = gtask;
 	info->ifi_ctx = q;
 
 	err = _iflib_irq_alloc(ctx, irq, rid, intr_fast, NULL, info,  name);
 	if (err != 0) {
 		device_printf(ctx->ifc_dev, "_iflib_irq_alloc failed %d\n", err);
 		return (err);
 	}
 	if (type == IFLIB_INTR_ADMIN)
 		return (0);
 
 	if (tqrid != -1) {
 		err = iflib_irq_set_affinity(ctx, rman_get_start(irq->ii_res), type, qid, gtask, tqg, q, name);
 		if (err)
 			return (err);
 	} else {
 		taskqgroup_attach(tqg, gtask, q, rman_get_start(irq->ii_res), name);
 	}
 
 	return (0);
 }
 
 void
 iflib_softirq_alloc_generic(if_ctx_t ctx, if_irq_t irq, iflib_intr_type_t type,  void *arg, int qid, char *name)
 {
 	struct grouptask *gtask;
 	struct taskqgroup *tqg;
 	gtask_fn_t *fn;
 	void *q;
 	int irq_num = -1;
 	int err;
 
 	switch (type) {
 	case IFLIB_INTR_TX:
 		q = &ctx->ifc_txqs[qid];
 		gtask = &ctx->ifc_txqs[qid].ift_task;
 		tqg = qgroup_if_io_tqg;
 		fn = _task_fn_tx;
 		if (irq != NULL)
 			irq_num = rman_get_start(irq->ii_res);
 		break;
 	case IFLIB_INTR_RX:
 		q = &ctx->ifc_rxqs[qid];
 		gtask = &ctx->ifc_rxqs[qid].ifr_task;
 		tqg = qgroup_if_io_tqg;
 		fn = _task_fn_rx;
 		if (irq != NULL)
 			irq_num = rman_get_start(irq->ii_res);
 		break;
 	case IFLIB_INTR_IOV:
 		q = ctx;
 		gtask = &ctx->ifc_vflr_task;
 		tqg = qgroup_if_config_tqg;
 		fn = _task_fn_iov;
 		break;
 	default:
 		panic("unknown net intr type");
 	}
 	GROUPTASK_INIT(gtask, 0, fn, q);
 	if (irq_num != -1) {
 		err = iflib_irq_set_affinity(ctx, irq_num, type, qid, gtask, tqg, q, name);
 		if (err)
 			taskqgroup_attach(tqg, gtask, q, irq_num, name);
 	}
 	else {
 		taskqgroup_attach(tqg, gtask, q, irq_num, name);
 	}
 }
 
 void
 iflib_irq_free(if_ctx_t ctx, if_irq_t irq)
 {
 	if (irq->ii_tag)
 		bus_teardown_intr(ctx->ifc_dev, irq->ii_res, irq->ii_tag);
 
 	if (irq->ii_res)
 		bus_release_resource(ctx->ifc_dev, SYS_RES_IRQ, irq->ii_rid, irq->ii_res);
 }
 
 static int
 iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filter_arg, int *rid, char *name)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 	if_irq_t irq = &ctx->ifc_legacy_irq;
 	iflib_filter_info_t info;
 	struct grouptask *gtask;
 	struct taskqgroup *tqg;
 	gtask_fn_t *fn;
 	int tqrid;
 	void *q;
 	int err;
 
 	q = &ctx->ifc_rxqs[0];
 	info = &rxq[0].ifr_filter_info;
 	gtask = &rxq[0].ifr_task;
 	tqg = qgroup_if_io_tqg;
 	tqrid = irq->ii_rid = *rid;
 	fn = _task_fn_rx;
 
 	ctx->ifc_flags |= IFC_LEGACY;
 	info->ifi_filter = filter;
 	info->ifi_filter_arg = filter_arg;
 	info->ifi_task = gtask;
 	info->ifi_ctx = ctx;
 
 	/* We allocate a single interrupt resource */
 	if ((err = _iflib_irq_alloc(ctx, irq, tqrid, iflib_fast_intr_ctx, NULL, info, name)) != 0)
 		return (err);
 	GROUPTASK_INIT(gtask, 0, fn, q);
 	taskqgroup_attach(tqg, gtask, q, rman_get_start(irq->ii_res), name);
 
 	GROUPTASK_INIT(&txq->ift_task, 0, _task_fn_tx, txq);
 	taskqgroup_attach(qgroup_if_io_tqg, &txq->ift_task, txq, rman_get_start(irq->ii_res), "tx");
 	return (0);
 }
 
 void
 iflib_led_create(if_ctx_t ctx)
 {
 
 	ctx->ifc_led_dev = led_create(iflib_led_func, ctx,
 	    device_get_nameunit(ctx->ifc_dev));
 }
 
 void
 iflib_tx_intr_deferred(if_ctx_t ctx, int txqid)
 {
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_txqs[txqid].ift_task);
 }
 
 void
 iflib_rx_intr_deferred(if_ctx_t ctx, int rxqid)
 {
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_rxqs[rxqid].ifr_task);
 }
 
 void
 iflib_admin_intr_deferred(if_ctx_t ctx)
 {
 #ifdef INVARIANTS
 	struct grouptask *gtask;
 
 	gtask = &ctx->ifc_admin_task;
 	MPASS(gtask != NULL && gtask->gt_taskqueue != NULL);
 #endif
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_admin_task);
 }
 
 void
 iflib_iov_intr_deferred(if_ctx_t ctx)
 {
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_vflr_task);
 }
 
 void
 iflib_io_tqg_attach(struct grouptask *gt, void *uniq, int cpu, char *name)
 {
 
 	taskqgroup_attach_cpu(qgroup_if_io_tqg, gt, uniq, cpu, -1, name);
 }
 
 void
 iflib_config_gtask_init(if_ctx_t ctx, struct grouptask *gtask, gtask_fn_t *fn,
 	char *name)
 {
 
 	GROUPTASK_INIT(gtask, 0, fn, ctx);
 	taskqgroup_attach(qgroup_if_config_tqg, gtask, gtask, -1, name);
 }
 
 void
 iflib_config_gtask_deinit(struct grouptask *gtask)
 {
 
 	taskqgroup_detach(qgroup_if_config_tqg, gtask);	
 }
 
 void
 iflib_link_state_change(if_ctx_t ctx, int link_state, uint64_t baudrate)
 {
 	if_t ifp = ctx->ifc_ifp;
 	iflib_txq_t txq = ctx->ifc_txqs;
 
 	if_setbaudrate(ifp, baudrate);
-	if (baudrate >= IF_Gbps(10))
+	if (baudrate >= IF_Gbps(10)) {
+		STATE_LOCK(ctx);
 		ctx->ifc_flags |= IFC_PREFETCH;
-
+		STATE_UNLOCK(ctx);
+	}
 	/* If link down, disable watchdog */
 	if ((ctx->ifc_link_state == LINK_STATE_UP) && (link_state == LINK_STATE_DOWN)) {
 		for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxqsets; i++, txq++)
 			txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 	}
 	ctx->ifc_link_state = link_state;
 	if_link_state_change(ifp, link_state);
 }
 
 static int
 iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq)
 {
 	int credits;
 #ifdef INVARIANTS
 	int credits_pre = txq->ift_cidx_processed;
 #endif
 
 	if (ctx->isc_txd_credits_update == NULL)
 		return (0);
 
 	if ((credits = ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, true)) == 0)
 		return (0);
 
 	txq->ift_processed += credits;
 	txq->ift_cidx_processed += credits;
 
 	MPASS(credits_pre + credits == txq->ift_cidx_processed);
 	if (txq->ift_cidx_processed >= txq->ift_size)
 		txq->ift_cidx_processed -= txq->ift_size;
 	return (credits);
 }
 
 static int
 iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, qidx_t cidx, qidx_t budget)
 {
 
 	return (ctx->isc_rxd_available(ctx->ifc_softc, rxq->ifr_id, cidx,
 	    budget));
 }
 
 void
 iflib_add_int_delay_sysctl(if_ctx_t ctx, const char *name,
 	const char *description, if_int_delay_info_t info,
 	int offset, int value)
 {
 	info->iidi_ctx = ctx;
 	info->iidi_offset = offset;
 	info->iidi_value = value;
 	SYSCTL_ADD_PROC(device_get_sysctl_ctx(ctx->ifc_dev),
 	    SYSCTL_CHILDREN(device_get_sysctl_tree(ctx->ifc_dev)),
 	    OID_AUTO, name, CTLTYPE_INT|CTLFLAG_RW,
 	    info, 0, iflib_sysctl_int_delay, "I", description);
 }
 
 struct mtx *
 iflib_ctx_lock_get(if_ctx_t ctx)
 {
 
-	return (&ctx->ifc_mtx);
+	return (&ctx->ifc_ctx_mtx);
 }
 
 static int
 iflib_msix_init(if_ctx_t ctx)
 {
 	device_t dev = ctx->ifc_dev;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	int vectors, queues, rx_queues, tx_queues, queuemsgs, msgs;
 	int iflib_num_tx_queues, iflib_num_rx_queues;
 	int err, admincnt, bar;
 
 	iflib_num_tx_queues = ctx->ifc_sysctl_ntxqs;
 	iflib_num_rx_queues = ctx->ifc_sysctl_nrxqs;
 
 	device_printf(dev, "msix_init qsets capped at %d\n", imax(scctx->isc_ntxqsets, scctx->isc_nrxqsets));
 
 	bar = ctx->ifc_softc_ctx.isc_msix_bar;
 	admincnt = sctx->isc_admin_intrcnt;
 	/* Override by global tuneable */
 	{
 		int i;
 		size_t len = sizeof(i);
 		err = kernel_sysctlbyname(curthread, "hw.pci.enable_msix", &i, &len, NULL, 0, NULL, 0);
 		if (err == 0) {
 			if (i == 0)
 				goto msi;
 		}
 		else {
 			device_printf(dev, "unable to read hw.pci.enable_msix.");
 		}
 	}
 	/* Override by tuneable */
 	if (scctx->isc_disable_msix)
 		goto msi;
 
 	/*
 	** When used in a virtualized environment
 	** PCI BUSMASTER capability may not be set
 	** so explicity set it here and rewrite
 	** the ENABLE in the MSIX control register
 	** at this point to cause the host to
 	** successfully initialize us.
 	*/
 	{
 		int msix_ctrl, rid;
 
  		pci_enable_busmaster(dev);
 		rid = 0;
 		if (pci_find_cap(dev, PCIY_MSIX, &rid) == 0 && rid != 0) {
 			rid += PCIR_MSIX_CTRL;
 			msix_ctrl = pci_read_config(dev, rid, 2);
 			msix_ctrl |= PCIM_MSIXCTRL_MSIX_ENABLE;
 			pci_write_config(dev, rid, msix_ctrl, 2);
 		} else {
 			device_printf(dev, "PCIY_MSIX capability not found; "
 			                   "or rid %d == 0.\n", rid);
 			goto msi;
 		}
 	}
 
 	/*
 	 * bar == -1 => "trust me I know what I'm doing"
 	 * Some drivers are for hardware that is so shoddily
 	 * documented that no one knows which bars are which
 	 * so the developer has to map all bars. This hack
 	 * allows shoddy garbage to use msix in this framework.
 	 */
 	if (bar != -1) {
 		ctx->ifc_msix_mem = bus_alloc_resource_any(dev,
 	            SYS_RES_MEMORY, &bar, RF_ACTIVE);
 		if (ctx->ifc_msix_mem == NULL) {
 			/* May not be enabled */
 			device_printf(dev, "Unable to map MSIX table \n");
 			goto msi;
 		}
 	}
 	/* First try MSI/X */
 	if ((msgs = pci_msix_count(dev)) == 0) { /* system has msix disabled */
 		device_printf(dev, "System has MSIX disabled \n");
 		bus_release_resource(dev, SYS_RES_MEMORY,
 		    bar, ctx->ifc_msix_mem);
 		ctx->ifc_msix_mem = NULL;
 		goto msi;
 	}
 #if IFLIB_DEBUG
 	/* use only 1 qset in debug mode */
 	queuemsgs = min(msgs - admincnt, 1);
 #else
 	queuemsgs = msgs - admincnt;
 #endif
 #ifdef RSS
 	queues = imin(queuemsgs, rss_getnumbuckets());
 #else
 	queues = queuemsgs;
 #endif
 	queues = imin(CPU_COUNT(&ctx->ifc_cpus), queues);
 	device_printf(dev, "pxm cpus: %d queue msgs: %d admincnt: %d\n",
 				  CPU_COUNT(&ctx->ifc_cpus), queuemsgs, admincnt);
 #ifdef  RSS
 	/* If we're doing RSS, clamp at the number of RSS buckets */
 	if (queues > rss_getnumbuckets())
 		queues = rss_getnumbuckets();
 #endif
 	if (iflib_num_rx_queues > 0 && iflib_num_rx_queues < queuemsgs - admincnt)
 		rx_queues = iflib_num_rx_queues;
 	else
 		rx_queues = queues;
 
 	if (rx_queues > scctx->isc_nrxqsets)
 		rx_queues = scctx->isc_nrxqsets;
 
 	/*
 	 * We want this to be all logical CPUs by default
 	 */
 	if (iflib_num_tx_queues > 0 && iflib_num_tx_queues < queues)
 		tx_queues = iflib_num_tx_queues;
 	else
 		tx_queues = mp_ncpus;
 
 	if (tx_queues > scctx->isc_ntxqsets)
 		tx_queues = scctx->isc_ntxqsets;
 
 	if (ctx->ifc_sysctl_qs_eq_override == 0) {
 #ifdef INVARIANTS
 		if (tx_queues != rx_queues)
 			device_printf(dev, "queue equality override not set, capping rx_queues at %d and tx_queues at %d\n",
 				      min(rx_queues, tx_queues), min(rx_queues, tx_queues));
 #endif
 		tx_queues = min(rx_queues, tx_queues);
 		rx_queues = min(rx_queues, tx_queues);
 	}
 
 	device_printf(dev, "using %d rx queues %d tx queues \n", rx_queues, tx_queues);
 
 	vectors = rx_queues + admincnt;
 	if ((err = pci_alloc_msix(dev, &vectors)) == 0) {
 		device_printf(dev,
 					  "Using MSIX interrupts with %d vectors\n", vectors);
 		scctx->isc_vectors = vectors;
 		scctx->isc_nrxqsets = rx_queues;
 		scctx->isc_ntxqsets = tx_queues;
 		scctx->isc_intr = IFLIB_INTR_MSIX;
 
 		return (vectors);
 	} else {
 		device_printf(dev, "failed to allocate %d msix vectors, err: %d - using MSI\n", vectors, err);
 	}
 msi:
 	vectors = pci_msi_count(dev);
 	scctx->isc_nrxqsets = 1;
 	scctx->isc_ntxqsets = 1;
 	scctx->isc_vectors = vectors;
 	if (vectors == 1 && pci_alloc_msi(dev, &vectors) == 0) {
 		device_printf(dev,"Using an MSI interrupt\n");
 		scctx->isc_intr = IFLIB_INTR_MSI;
 	} else {
 		device_printf(dev,"Using a Legacy interrupt\n");
 		scctx->isc_intr = IFLIB_INTR_LEGACY;
 	}
 
 	return (vectors);
 }
 
 char * ring_states[] = { "IDLE", "BUSY", "STALLED", "ABDICATED" };
 
 static int
 mp_ring_state_handler(SYSCTL_HANDLER_ARGS)
 {
 	int rc;
 	uint16_t *state = ((uint16_t *)oidp->oid_arg1);
 	struct sbuf *sb;
 	char *ring_state = "UNKNOWN";
 
 	/* XXX needed ? */
 	rc = sysctl_wire_old_buffer(req, 0);
 	MPASS(rc == 0);
 	if (rc != 0)
 		return (rc);
 	sb = sbuf_new_for_sysctl(NULL, NULL, 80, req);
 	MPASS(sb != NULL);
 	if (sb == NULL)
 		return (ENOMEM);
 	if (state[3] <= 3)
 		ring_state = ring_states[state[3]];
 
 	sbuf_printf(sb, "pidx_head: %04hd pidx_tail: %04hd cidx: %04hd state: %s",
 		    state[0], state[1], state[2], ring_state);
 	rc = sbuf_finish(sb);
 	sbuf_delete(sb);
         return(rc);
 }
 
 enum iflib_ndesc_handler {
 	IFLIB_NTXD_HANDLER,
 	IFLIB_NRXD_HANDLER,
 };
 
 static int
 mp_ndesc_handler(SYSCTL_HANDLER_ARGS)
 {
 	if_ctx_t ctx = (void *)arg1;
 	enum iflib_ndesc_handler type = arg2;
 	char buf[256] = {0};
 	qidx_t *ndesc;
 	char *p, *next;
 	int nqs, rc, i;
 
 	MPASS(type == IFLIB_NTXD_HANDLER || type == IFLIB_NRXD_HANDLER);
 
 	nqs = 8;
 	switch(type) {
 	case IFLIB_NTXD_HANDLER:
 		ndesc = ctx->ifc_sysctl_ntxds;
 		if (ctx->ifc_sctx)
 			nqs = ctx->ifc_sctx->isc_ntxqs;
 		break;
 	case IFLIB_NRXD_HANDLER:
 		ndesc = ctx->ifc_sysctl_nrxds;
 		if (ctx->ifc_sctx)
 			nqs = ctx->ifc_sctx->isc_nrxqs;
 		break;
 	}
 	if (nqs == 0)
 		nqs = 8;
 
 	for (i=0; i<8; i++) {
 		if (i >= nqs)
 			break;
 		if (i)
 			strcat(buf, ",");
 		sprintf(strchr(buf, 0), "%d", ndesc[i]);
 	}
 
 	rc = sysctl_handle_string(oidp, buf, sizeof(buf), req);
 	if (rc || req->newptr == NULL)
 		return rc;
 
 	for (i = 0, next = buf, p = strsep(&next, " ,"); i < 8 && p;
 	    i++, p = strsep(&next, " ,")) {
 		ndesc[i] = strtoul(p, NULL, 10);
 	}
 
 	return(rc);
 }
 
 #define NAME_BUFLEN 32
 static void
 iflib_add_device_sysctl_pre(if_ctx_t ctx)
 {
         device_t dev = iflib_get_dev(ctx);
 	struct sysctl_oid_list *child, *oid_list;
 	struct sysctl_ctx_list *ctx_list;
 	struct sysctl_oid *node;
 
 	ctx_list = device_get_sysctl_ctx(dev);
 	child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev));
 	ctx->ifc_sysctl_node = node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, "iflib",
 						      CTLFLAG_RD, NULL, "IFLIB fields");
 	oid_list = SYSCTL_CHILDREN(node);
 
 	SYSCTL_ADD_STRING(ctx_list, oid_list, OID_AUTO, "driver_version",
 		       CTLFLAG_RD, ctx->ifc_sctx->isc_driver_version, 0,
 		       "driver version");
 
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_ntxqs",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_ntxqs, 0,
 			"# of txqs to use, 0 => use default #");
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_nrxqs",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_nrxqs, 0,
 			"# of rxqs to use, 0 => use default #");
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_qs_enable",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_qs_eq_override, 0,
                        "permit #txq != #rxq");
 	SYSCTL_ADD_INT(ctx_list, oid_list, OID_AUTO, "disable_msix",
                       CTLFLAG_RWTUN, &ctx->ifc_softc_ctx.isc_disable_msix, 0,
                       "disable MSIX (default 0)");
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "rx_budget",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_rx_budget, 0,
                        "set the rx budget");
 
 	/* XXX change for per-queue sizes */
 	SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_ntxds",
 		       CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NTXD_HANDLER,
                        mp_ndesc_handler, "A",
                        "list of # of tx descriptors to use, 0 = use default #");
 	SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_nrxds",
 		       CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NRXD_HANDLER,
                        mp_ndesc_handler, "A",
                        "list of # of rx descriptors to use, 0 = use default #");
 }
 
 static void
 iflib_add_device_sysctl_post(if_ctx_t ctx)
 {
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
         device_t dev = iflib_get_dev(ctx);
 	struct sysctl_oid_list *child;
 	struct sysctl_ctx_list *ctx_list;
 	iflib_fl_t fl;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	int i, j;
 	char namebuf[NAME_BUFLEN];
 	char *qfmt;
 	struct sysctl_oid *queue_node, *fl_node, *node;
 	struct sysctl_oid_list *queue_list, *fl_list;
 	ctx_list = device_get_sysctl_ctx(dev);
 
 	node = ctx->ifc_sysctl_node;
 	child = SYSCTL_CHILDREN(node);
 
 	if (scctx->isc_ntxqsets > 100)
 		qfmt = "txq%03d";
 	else if (scctx->isc_ntxqsets > 10)
 		qfmt = "txq%02d";
 	else
 		qfmt = "txq%d";
 	for (i = 0, txq = ctx->ifc_txqs; i < scctx->isc_ntxqsets; i++, txq++) {
 		snprintf(namebuf, NAME_BUFLEN, qfmt, i);
 		queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf,
 					     CTLFLAG_RD, NULL, "Queue Name");
 		queue_list = SYSCTL_CHILDREN(queue_node);
 #if MEMORY_LOGGING
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_dequeued",
 				CTLFLAG_RD,
 				&txq->ift_dequeued, "total mbufs freed");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_enqueued",
 				CTLFLAG_RD,
 				&txq->ift_enqueued, "total mbufs enqueued");
 #endif
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag",
 				   CTLFLAG_RD,
 				   &txq->ift_mbuf_defrag, "# of times m_defrag was called");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "m_pullups",
 				   CTLFLAG_RD,
 				   &txq->ift_pullups, "# of times m_pullup was called");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag_failed",
 				   CTLFLAG_RD,
 				   &txq->ift_mbuf_defrag_failed, "# of times m_defrag failed");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_desc_avail",
 				   CTLFLAG_RD,
 				   &txq->ift_no_desc_avail, "# of times no descriptors were available");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "tx_map_failed",
 				   CTLFLAG_RD,
 				   &txq->ift_map_failed, "# of times dma map failed");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txd_encap_efbig",
 				   CTLFLAG_RD,
 				   &txq->ift_txd_encap_efbig, "# of times txd_encap returned EFBIG");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_tx_dma_setup",
 				   CTLFLAG_RD,
 				   &txq->ift_no_tx_dma_setup, "# of times map failed for other than EFBIG");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_pidx",
 				   CTLFLAG_RD,
 				   &txq->ift_pidx, 1, "Producer Index");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx",
 				   CTLFLAG_RD,
 				   &txq->ift_cidx, 1, "Consumer Index");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx_processed",
 				   CTLFLAG_RD,
 				   &txq->ift_cidx_processed, 1, "Consumer Index seen by credit update");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_in_use",
 				   CTLFLAG_RD,
 				   &txq->ift_in_use, 1, "descriptors in use");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_processed",
 				   CTLFLAG_RD,
 				   &txq->ift_processed, "descriptors procesed for clean");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_cleaned",
 				   CTLFLAG_RD,
 				   &txq->ift_cleaned, "total cleaned");
 		SYSCTL_ADD_PROC(ctx_list, queue_list, OID_AUTO, "ring_state",
 				CTLTYPE_STRING | CTLFLAG_RD, __DEVOLATILE(uint64_t *, &txq->ift_br->state),
 				0, mp_ring_state_handler, "A", "soft ring state");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_enqueues",
 				       CTLFLAG_RD, &txq->ift_br->enqueues,
 				       "# of enqueues to the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_drops",
 				       CTLFLAG_RD, &txq->ift_br->drops,
 				       "# of drops in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_starts",
 				       CTLFLAG_RD, &txq->ift_br->starts,
 				       "# of normal consumer starts in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_stalls",
 				       CTLFLAG_RD, &txq->ift_br->stalls,
 					       "# of consumer stalls in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_restarts",
 			       CTLFLAG_RD, &txq->ift_br->restarts,
 				       "# of consumer restarts in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_abdications",
 				       CTLFLAG_RD, &txq->ift_br->abdications,
 				       "# of consumer abdications in the mp_ring for this queue");
 	}
 
 	if (scctx->isc_nrxqsets > 100)
 		qfmt = "rxq%03d";
 	else if (scctx->isc_nrxqsets > 10)
 		qfmt = "rxq%02d";
 	else
 		qfmt = "rxq%d";
 	for (i = 0, rxq = ctx->ifc_rxqs; i < scctx->isc_nrxqsets; i++, rxq++) {
 		snprintf(namebuf, NAME_BUFLEN, qfmt, i);
 		queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf,
 					     CTLFLAG_RD, NULL, "Queue Name");
 		queue_list = SYSCTL_CHILDREN(queue_node);
 		if (sctx->isc_flags & IFLIB_HAS_RXCQ) {
 			SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_pidx",
 				       CTLFLAG_RD,
 				       &rxq->ifr_cq_pidx, 1, "Producer Index");
 			SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_cidx",
 				       CTLFLAG_RD,
 				       &rxq->ifr_cq_cidx, 1, "Consumer Index");
 		}
 
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) {
 			snprintf(namebuf, NAME_BUFLEN, "rxq_fl%d", j);
 			fl_node = SYSCTL_ADD_NODE(ctx_list, queue_list, OID_AUTO, namebuf,
 						     CTLFLAG_RD, NULL, "freelist Name");
 			fl_list = SYSCTL_CHILDREN(fl_node);
 			SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "pidx",
 				       CTLFLAG_RD,
 				       &fl->ifl_pidx, 1, "Producer Index");
 			SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "cidx",
 				       CTLFLAG_RD,
 				       &fl->ifl_cidx, 1, "Consumer Index");
 			SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "credits",
 				       CTLFLAG_RD,
 				       &fl->ifl_credits, 1, "credits available");
 #if MEMORY_LOGGING
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_enqueued",
 					CTLFLAG_RD,
 					&fl->ifl_m_enqueued, "mbufs allocated");
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_dequeued",
 					CTLFLAG_RD,
 					&fl->ifl_m_dequeued, "mbufs freed");
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_enqueued",
 					CTLFLAG_RD,
 					&fl->ifl_cl_enqueued, "clusters allocated");
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_dequeued",
 					CTLFLAG_RD,
 					&fl->ifl_cl_dequeued, "clusters freed");
 #endif
 
 		}
 	}
 
 }
 
 #ifndef __NO_STRICT_ALIGNMENT
 static struct mbuf *
 iflib_fixup_rx(struct mbuf *m)
 {
 	struct mbuf *n;
 
 	if (m->m_len <= (MCLBYTES - ETHER_HDR_LEN)) {
 		bcopy(m->m_data, m->m_data + ETHER_HDR_LEN, m->m_len);
 		m->m_data += ETHER_HDR_LEN;
 		n = m;
 	} else {
 		MGETHDR(n, M_NOWAIT, MT_DATA);
 		if (n == NULL) {
 			m_freem(m);
 			return (NULL);
 		}
 		bcopy(m->m_data, n->m_data, ETHER_HDR_LEN);
 		m->m_data += ETHER_HDR_LEN;
 		m->m_len -= ETHER_HDR_LEN;
 		n->m_len = ETHER_HDR_LEN;
 		M_MOVE_PKTHDR(n, m);
 		n->m_next = m;
 	}
 	return (n);
 }
 #endif
 
 #ifdef NETDUMP
 static void
 iflib_netdump_init(struct ifnet *ifp, int *nrxr, int *ncl, int *clsize)
 {
 	if_ctx_t ctx;
 
 	ctx = if_getsoftc(ifp);
 	CTX_LOCK(ctx);
 	*nrxr = NRXQSETS(ctx);
 	*ncl = ctx->ifc_rxqs[0].ifr_fl->ifl_size;
 	*clsize = ctx->ifc_rxqs[0].ifr_fl->ifl_buf_size;
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_netdump_event(struct ifnet *ifp, enum netdump_ev event)
 {
 	if_ctx_t ctx;
 	if_softc_ctx_t scctx;
 	iflib_fl_t fl;
 	iflib_rxq_t rxq;
 	int i, j;
 
 	ctx = if_getsoftc(ifp);
 	scctx = &ctx->ifc_softc_ctx;
 
 	switch (event) {
 	case NETDUMP_START:
 		for (i = 0; i < scctx->isc_nrxqsets; i++) {
 			rxq = &ctx->ifc_rxqs[i];
 			for (j = 0; j < rxq->ifr_nfl; j++) {
 				fl = rxq->ifr_fl;
 				fl->ifl_zone = m_getzone(fl->ifl_buf_size);
 			}
 		}
 		iflib_no_tx_batch = 1;
 		break;
 	default:
 		break;
 	}
 }
 
 static int
 iflib_netdump_transmit(struct ifnet *ifp, struct mbuf *m)
 {
 	if_ctx_t ctx;
 	iflib_txq_t txq;
 	int error;
 
 	ctx = if_getsoftc(ifp);
 	if ((if_getdrvflags(ifp) & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=
 	    IFF_DRV_RUNNING)
 		return (EBUSY);
 
 	txq = &ctx->ifc_txqs[0];
 	error = iflib_encap(txq, &m);
 	if (error == 0)
 		(void)iflib_txd_db_check(ctx, txq, true, txq->ift_in_use);
 	return (error);
 }
 
 static int
 iflib_netdump_poll(struct ifnet *ifp, int count)
 {
 	if_ctx_t ctx;
 	if_softc_ctx_t scctx;
 	iflib_txq_t txq;
 	int i;
 
 	ctx = if_getsoftc(ifp);
 	scctx = &ctx->ifc_softc_ctx;
 
 	if ((if_getdrvflags(ifp) & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=
 	    IFF_DRV_RUNNING)
 		return (EBUSY);
 
 	txq = &ctx->ifc_txqs[0];
 	(void)iflib_tx_credits_update(ctx, txq);
 	(void)iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx));
 
 	for (i = 0; i < scctx->isc_nrxqsets; i++)
 		(void)iflib_rxeof(&ctx->ifc_rxqs[i], 16 /* XXX */);
 	return (0);
 }
 #endif /* NETDUMP */
Index: user/markj/netdump/sys/netinet/tcp_log_buf.c
===================================================================
--- user/markj/netdump/sys/netinet/tcp_log_buf.c	(revision 332407)
+++ user/markj/netdump/sys/netinet/tcp_log_buf.c	(revision 332408)
@@ -1,2480 +1,2435 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2016-2018
  *	Netflix Inc.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/queue.h>
 #include <sys/refcount.h>
 #include <sys/rwlock.h>
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/sysctl.h>
 #include <sys/tree.h>
 #include <sys/counter.h>
 
 #include <dev/tcp_log/tcp_log_dev.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/vnet.h>
 
 #include <netinet/in.h>
 #include <netinet/in_pcb.h>
 #include <netinet/in_var.h>
 #include <netinet/tcp_var.h>
 #include <netinet/tcp_log_buf.h>
 
 /* Default expiry time */
 #define	TCP_LOG_EXPIRE_TIME	((sbintime_t)60 * SBT_1S)
 
 /* Max interval at which to run the expiry timer */
 #define	TCP_LOG_EXPIRE_INTVL	((sbintime_t)5 * SBT_1S)
 
 bool	tcp_log_verbose;
 static uma_zone_t tcp_log_bucket_zone, tcp_log_node_zone, tcp_log_zone;
 static int	tcp_log_session_limit = TCP_LOG_BUF_DEFAULT_SESSION_LIMIT;
 static uint32_t	tcp_log_version = TCP_LOG_BUF_VER;
 RB_HEAD(tcp_log_id_tree, tcp_log_id_bucket);
 static struct tcp_log_id_tree tcp_log_id_head;
 static STAILQ_HEAD(, tcp_log_id_node) tcp_log_expireq_head =
     STAILQ_HEAD_INITIALIZER(tcp_log_expireq_head);
 static struct mtx tcp_log_expireq_mtx;
 static struct callout tcp_log_expireq_callout;
 static u_long tcp_log_auto_ratio = 0;
 static volatile u_long tcp_log_auto_ratio_cur = 0;
 static uint32_t tcp_log_auto_mode = TCP_LOG_STATE_TAIL;
 static bool tcp_log_auto_all = false;
 
 RB_PROTOTYPE_STATIC(tcp_log_id_tree, tcp_log_id_bucket, tlb_rb, tcp_log_id_cmp)
 
 SYSCTL_NODE(_net_inet_tcp, OID_AUTO, bb, CTLFLAG_RW, 0, "TCP Black Box controls");
 
 SYSCTL_BOOL(_net_inet_tcp_bb, OID_AUTO, log_verbose, CTLFLAG_RW, &tcp_log_verbose,
     0, "Force verbose logging for TCP traces");
 
 SYSCTL_INT(_net_inet_tcp_bb, OID_AUTO, log_session_limit,
     CTLFLAG_RW, &tcp_log_session_limit, 0,
     "Maximum number of events maintained for each TCP session");
 
 SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_global_limit, CTLFLAG_RW,
     &tcp_log_zone, "Maximum number of events maintained for all TCP sessions");
 
 SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_global_entries, CTLFLAG_RD,
     &tcp_log_zone, "Current number of events maintained for all TCP sessions");
 
 SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_id_limit, CTLFLAG_RW,
     &tcp_log_bucket_zone, "Maximum number of log IDs");
 
 SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_id_entries, CTLFLAG_RD,
     &tcp_log_bucket_zone, "Current number of log IDs");
 
 SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_id_tcpcb_limit, CTLFLAG_RW,
     &tcp_log_node_zone, "Maximum number of tcpcbs with log IDs");
 
 SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_id_tcpcb_entries, CTLFLAG_RD,
     &tcp_log_node_zone, "Current number of tcpcbs with log IDs");
 
 SYSCTL_U32(_net_inet_tcp_bb, OID_AUTO, log_version, CTLFLAG_RD, &tcp_log_version,
     0, "Version of log formats exported");
 
 SYSCTL_ULONG(_net_inet_tcp_bb, OID_AUTO, log_auto_ratio, CTLFLAG_RW,
     &tcp_log_auto_ratio, 0, "Do auto capturing for 1 out of N sessions");
 
 SYSCTL_U32(_net_inet_tcp_bb, OID_AUTO, log_auto_mode, CTLFLAG_RW,
     &tcp_log_auto_mode, TCP_LOG_STATE_HEAD_AUTO,
     "Logging mode for auto-selected sessions (default is TCP_LOG_STATE_HEAD_AUTO)");
 
 SYSCTL_BOOL(_net_inet_tcp_bb, OID_AUTO, log_auto_all, CTLFLAG_RW,
     &tcp_log_auto_all, false,
     "Auto-select from all sessions (rather than just those with IDs)");
 
 #ifdef TCPLOG_DEBUG_COUNTERS
 counter_u64_t tcp_log_queued;
 counter_u64_t tcp_log_que_fail1;
 counter_u64_t tcp_log_que_fail2;
 counter_u64_t tcp_log_que_fail3;
 counter_u64_t tcp_log_que_fail4;
 counter_u64_t tcp_log_que_fail5;
 counter_u64_t tcp_log_que_copyout;
 counter_u64_t tcp_log_que_read;
 counter_u64_t tcp_log_que_freed;
 
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, queued, CTLFLAG_RD,
     &tcp_log_queued, "Number of entries queued");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail1, CTLFLAG_RD,
     &tcp_log_que_fail1, "Number of entries queued but fail 1");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail2, CTLFLAG_RD,
     &tcp_log_que_fail2, "Number of entries queued but fail 2");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail3, CTLFLAG_RD,
     &tcp_log_que_fail3, "Number of entries queued but fail 3");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail4, CTLFLAG_RD,
     &tcp_log_que_fail4, "Number of entries queued but fail 4");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail5, CTLFLAG_RD,
     &tcp_log_que_fail5, "Number of entries queued but fail 4");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, copyout, CTLFLAG_RD,
     &tcp_log_que_copyout, "Number of entries copied out");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, read, CTLFLAG_RD,
     &tcp_log_que_read, "Number of entries read from the queue");
 SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, freed, CTLFLAG_RD,
     &tcp_log_que_freed, "Number of entries freed after reading");
 #endif
 
 #ifdef INVARIANTS
 #define	TCPLOG_DEBUG_RINGBUF
 #endif
 
 struct tcp_log_mem
 {
 	STAILQ_ENTRY(tcp_log_mem) tlm_queue;
 	struct tcp_log_buffer	tlm_buf;
 	struct tcp_log_verbose	tlm_v;
 #ifdef TCPLOG_DEBUG_RINGBUF
 	volatile int		tlm_refcnt;
 #endif
 };
 
 /* 60 bytes for the header, + 16 bytes for padding */
 static uint8_t	zerobuf[76];
 
 /*
  * Lock order:
  * 1. TCPID_TREE
  * 2. TCPID_BUCKET
  * 3. INP
  *
  * Rules:
  * A. You need a lock on the Tree to add/remove buckets.
  * B. You need a lock on the bucket to add/remove nodes from the bucket.
  * C. To change information in a node, you need the INP lock if the tln_closed
  *    field is false. Otherwise, you need the bucket lock. (Note that the
  *    tln_closed field can change at any point, so you need to recheck the
  *    entry after acquiring the INP lock.)
  * D. To remove a node from the bucket, you must have that entry locked,
  *    according to the criteria of Rule C. Also, the node must not be on
  *    the expiry queue.
  * E. The exception to C is the expiry queue fields, which are locked by
  *    the TCPLOG_EXPIREQ lock.
  *
  * Buckets have a reference count. Each node is a reference. Further,
  * other callers may add reference counts to keep a bucket from disappearing.
  * You can add a reference as long as you own a lock sufficient to keep the
  * bucket from disappearing. For example, a common use is:
  *   a. Have a locked INP, but need to lock the TCPID_BUCKET.
  *   b. Add a refcount on the bucket. (Safe because the INP lock prevents
  *      the TCPID_BUCKET from going away.)
  *   c. Drop the INP lock.
  *   d. Acquire a lock on the TCPID_BUCKET.
  *   e. Acquire a lock on the INP.
  *   f. Drop the refcount on the bucket.
  *      (At this point, the bucket may disappear.)
  *
  * Expire queue lock:
  * You can acquire this with either the bucket or INP lock. Don't reverse it.
  * When the expire code has committed to freeing a node, it resets the expiry
  * time to SBT_MAX. That is the signal to everyone else that they should
  * leave that node alone.
  */
 static struct rwlock tcp_id_tree_lock;
 #define	TCPID_TREE_WLOCK()		rw_wlock(&tcp_id_tree_lock)
 #define	TCPID_TREE_RLOCK()		rw_rlock(&tcp_id_tree_lock)
 #define	TCPID_TREE_UPGRADE()		rw_try_upgrade(&tcp_id_tree_lock)
 #define	TCPID_TREE_WUNLOCK()		rw_wunlock(&tcp_id_tree_lock)
 #define	TCPID_TREE_RUNLOCK()		rw_runlock(&tcp_id_tree_lock)
 #define	TCPID_TREE_WLOCK_ASSERT()	rw_assert(&tcp_id_tree_lock, RA_WLOCKED)
 #define	TCPID_TREE_RLOCK_ASSERT()	rw_assert(&tcp_id_tree_lock, RA_RLOCKED)
 #define	TCPID_TREE_UNLOCK_ASSERT()	rw_assert(&tcp_id_tree_lock, RA_UNLOCKED)
 
 #define	TCPID_BUCKET_LOCK_INIT(tlb)	mtx_init(&((tlb)->tlb_mtx), "tcp log id bucket", NULL, MTX_DEF)
 #define	TCPID_BUCKET_LOCK_DESTROY(tlb)	mtx_destroy(&((tlb)->tlb_mtx))
 #define	TCPID_BUCKET_LOCK(tlb)		mtx_lock(&((tlb)->tlb_mtx))
 #define	TCPID_BUCKET_UNLOCK(tlb)	mtx_unlock(&((tlb)->tlb_mtx))
 #define	TCPID_BUCKET_LOCK_ASSERT(tlb)	mtx_assert(&((tlb)->tlb_mtx), MA_OWNED)
 #define	TCPID_BUCKET_UNLOCK_ASSERT(tlb) mtx_assert(&((tlb)->tlb_mtx), MA_NOTOWNED)
 
 #define	TCPID_BUCKET_REF(tlb)		refcount_acquire(&((tlb)->tlb_refcnt))
 #define	TCPID_BUCKET_UNREF(tlb)		refcount_release(&((tlb)->tlb_refcnt))
 
 #define	TCPLOG_EXPIREQ_LOCK()		mtx_lock(&tcp_log_expireq_mtx)
 #define	TCPLOG_EXPIREQ_UNLOCK()		mtx_unlock(&tcp_log_expireq_mtx)
 
 SLIST_HEAD(tcp_log_id_head, tcp_log_id_node);
 
 struct tcp_log_id_bucket
 {
 	/*
 	 * tlb_id must be first. This lets us use strcmp on
 	 * (struct tcp_log_id_bucket *) and (char *) interchangeably.
 	 */
 	char				tlb_id[TCP_LOG_ID_LEN];
 	RB_ENTRY(tcp_log_id_bucket)	tlb_rb;
 	struct tcp_log_id_head		tlb_head;
 	struct mtx			tlb_mtx;
 	volatile u_int			tlb_refcnt;
 };
 
 struct tcp_log_id_node
 {
 	SLIST_ENTRY(tcp_log_id_node) tln_list;
 	STAILQ_ENTRY(tcp_log_id_node) tln_expireq; /* Locked by the expireq lock */
 	sbintime_t		tln_expiretime;	/* Locked by the expireq lock */
 
 	/*
 	 * If INP is NULL, that means the connection has closed. We've
 	 * saved the connection endpoint information and the log entries
 	 * in the tln_ie and tln_entries members. We've also saved a pointer
 	 * to the enclosing bucket here. If INP is not NULL, the information is
 	 * in the PCB and not here.
 	 */
 	struct inpcb		*tln_inp;
 	struct tcpcb		*tln_tp;
 	struct tcp_log_id_bucket *tln_bucket;
 	struct in_endpoints	tln_ie;
 	struct tcp_log_stailq	tln_entries;
 	int			tln_count;
 	volatile int		tln_closed;
 	uint8_t			tln_af;
 };
 
 enum tree_lock_state {
 	TREE_UNLOCKED = 0,
 	TREE_RLOCKED,
 	TREE_WLOCKED,
 };
 
 /* Do we want to select this session for auto-logging? */
 static __inline bool
 tcp_log_selectauto(void)
 {
 
 	/*
 	 * If we are doing auto-capturing, figure out whether we will capture
 	 * this session.
 	 */
 	if (tcp_log_auto_ratio &&
 	    (atomic_fetchadd_long(&tcp_log_auto_ratio_cur, 1) %
 	    tcp_log_auto_ratio) == 0)
 		return (true);
 	return (false);
 }
 
 static __inline int
 tcp_log_id_cmp(struct tcp_log_id_bucket *a, struct tcp_log_id_bucket *b)
 {
 	KASSERT(a != NULL, ("tcp_log_id_cmp: argument a is unexpectedly NULL"));
 	KASSERT(b != NULL, ("tcp_log_id_cmp: argument b is unexpectedly NULL"));
 	return strncmp(a->tlb_id, b->tlb_id, TCP_LOG_ID_LEN);
 }
 
 RB_GENERATE_STATIC(tcp_log_id_tree, tcp_log_id_bucket, tlb_rb, tcp_log_id_cmp)
 
 static __inline void
 tcp_log_id_validate_tree_lock(int tree_locked)
 {
 
 #ifdef INVARIANTS
 	switch (tree_locked) {
 	case TREE_WLOCKED:
 		TCPID_TREE_WLOCK_ASSERT();
 		break;
 	case TREE_RLOCKED:
 		TCPID_TREE_RLOCK_ASSERT();
 		break;
 	case TREE_UNLOCKED:
 		TCPID_TREE_UNLOCK_ASSERT();
 		break;
 	default:
 		kassert_panic("%s:%d: unknown tree lock state", __func__,
 		    __LINE__);
 	}
 #endif
 }
 
 static __inline void
 tcp_log_remove_bucket(struct tcp_log_id_bucket *tlb)
 {
 
 	TCPID_TREE_WLOCK_ASSERT();
 	KASSERT(SLIST_EMPTY(&tlb->tlb_head),
 	    ("%s: Attempt to remove non-empty bucket", __func__));
 	if (RB_REMOVE(tcp_log_id_tree, &tcp_log_id_head, tlb) == NULL) {
 #ifdef INVARIANTS
 		kassert_panic("%s:%d: error removing element from tree",
 			    __func__, __LINE__);
 #endif
 	}
 	TCPID_BUCKET_LOCK_DESTROY(tlb);
 	uma_zfree(tcp_log_bucket_zone, tlb);
 }
 
 /*
  * Call with a referenced and locked bucket.
  * Will return true if the bucket was freed; otherwise, false.
  * tlb: The bucket to unreference.
  * tree_locked: A pointer to the state of the tree lock. If the tree lock
  *    state changes, the function will update it.
  * inp: If not NULL and the function needs to drop the inp lock to relock the
  *    tree, it will do so. (The caller must ensure inp will not become invalid,
  *    probably by holding a reference to it.)
  */
 static bool
 tcp_log_unref_bucket(struct tcp_log_id_bucket *tlb, int *tree_locked,
     struct inpcb *inp)
 {
 
 	KASSERT(tlb != NULL, ("%s: called with NULL tlb", __func__));
 	KASSERT(tree_locked != NULL, ("%s: called with NULL tree_locked",
 	    __func__));
 
 	tcp_log_id_validate_tree_lock(*tree_locked);
 
 	/*
 	 * Did we hold the last reference on the tlb? If so, we may need
 	 * to free it. (Note that we can realistically only execute the
 	 * loop twice: once without a write lock and once with a write
 	 * lock.)
 	 */
 	while (TCPID_BUCKET_UNREF(tlb)) {
 		/*
 		 * We need a write lock on the tree to free this.
 		 * If we can upgrade the tree lock, this is "easy". If we
 		 * can't upgrade the tree lock, we need to do this the
 		 * "hard" way: unwind all our locks and relock everything.
 		 * In the meantime, anything could have changed. We even
 		 * need to validate that we still need to free the bucket.
 		 */
 		if (*tree_locked == TREE_RLOCKED && TCPID_TREE_UPGRADE())
 			*tree_locked = TREE_WLOCKED;
 		else if (*tree_locked != TREE_WLOCKED) {
 			TCPID_BUCKET_REF(tlb);
 			if (inp != NULL)
 				INP_WUNLOCK(inp);
 			TCPID_BUCKET_UNLOCK(tlb);
 			if (*tree_locked == TREE_RLOCKED)
 				TCPID_TREE_RUNLOCK();
 			TCPID_TREE_WLOCK();
 			*tree_locked = TREE_WLOCKED;
 			TCPID_BUCKET_LOCK(tlb);
 			if (inp != NULL)
 				INP_WLOCK(inp);
 			continue;
 		}
 
 		/*
 		 * We have an empty bucket and a write lock on the tree.
 		 * Remove the empty bucket.
 		 */
 		tcp_log_remove_bucket(tlb);
 		return (true);
 	}
 	return (false);
 }
 
 /*
  * Call with a locked bucket. This function will release the lock on the
  * bucket before returning.
  *
  * The caller is responsible for freeing the tp->t_lin/tln node!
  *
  * Note: one of tp or both tlb and tln must be supplied.
  *
  * inp: A pointer to the inp. If the function needs to drop the inp lock to
  *    acquire the tree write lock, it will do so. (The caller must ensure inp
  *    will not become invalid, probably by holding a reference to it.)
  * tp: A pointer to the tcpcb. (optional; if specified, tlb and tln are ignored)
  * tlb: A pointer to the bucket. (optional; ignored if tp is specified)
  * tln: A pointer to the node. (optional; ignored if tp is specified)
  * tree_locked: A pointer to the state of the tree lock. If the tree lock
  *    state changes, the function will update it.
  *
  * Will return true if the INP lock was reacquired; otherwise, false.
  */
 static bool
 tcp_log_remove_id_node(struct inpcb *inp, struct tcpcb *tp,
     struct tcp_log_id_bucket *tlb, struct tcp_log_id_node *tln,
     int *tree_locked)
 {
 	int orig_tree_locked;
 
 	KASSERT(tp != NULL || (tlb != NULL && tln != NULL),
 	    ("%s: called with tp=%p, tlb=%p, tln=%p", __func__,
 	    tp, tlb, tln));
 	KASSERT(tree_locked != NULL, ("%s: called with NULL tree_locked",
 	    __func__));
 
 	if (tp != NULL) {
 		tlb = tp->t_lib;
 		tln = tp->t_lin;
 		KASSERT(tlb != NULL, ("%s: unexpectedly NULL tlb", __func__));
 		KASSERT(tln != NULL, ("%s: unexpectedly NULL tln", __func__));
 	}
 
 	tcp_log_id_validate_tree_lock(*tree_locked);
 	TCPID_BUCKET_LOCK_ASSERT(tlb);
 
 	/*
 	 * Remove the node, clear the log bucket and node from the TCPCB, and
 	 * decrement the bucket refcount. In the process, if this is the
 	 * last reference, the bucket will be freed.
 	 */
 	SLIST_REMOVE(&tlb->tlb_head, tln, tcp_log_id_node, tln_list);
 	if (tp != NULL) {
 		tp->t_lib = NULL;
 		tp->t_lin = NULL;
 	}
 	orig_tree_locked = *tree_locked;
 	if (!tcp_log_unref_bucket(tlb, tree_locked, inp))
 		TCPID_BUCKET_UNLOCK(tlb);
 	return (*tree_locked != orig_tree_locked);
 }
 
 #define	RECHECK_INP_CLEAN(cleanup)	do {			\
 	if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {	\
 		rv = ECONNRESET;				\
 		cleanup;					\
 		goto done;					\
 	}							\
 	tp = intotcpcb(inp);					\
 } while (0)
 
 #define	RECHECK_INP()	RECHECK_INP_CLEAN(/* noop */)
 
 static void
 tcp_log_grow_tlb(char *tlb_id, struct tcpcb *tp)
 {
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 #ifdef NETFLIX
 	if (V_tcp_perconn_stats_enable == 2 && tp->t_stats == NULL)
 		(void)tcp_stats_sample_rollthedice(tp, tlb_id, strlen(tlb_id));
 #endif
 }
 
 /*
  * Set the TCP log ID for a TCPCB.
  * Called with INPCB locked. Returns with it unlocked.
  */
 int
 tcp_log_set_id(struct tcpcb *tp, char *id)
 {
 	struct tcp_log_id_bucket *tlb, *tmp_tlb;
 	struct tcp_log_id_node *tln;
 	struct inpcb *inp;
 	int tree_locked, rv;
 	bool bucket_locked;
 
 	tlb = NULL;
 	tln = NULL;
 	inp = tp->t_inpcb;
 	tree_locked = TREE_UNLOCKED;
 	bucket_locked = false;
 
 restart:
 	INP_WLOCK_ASSERT(inp);
 
 	/* See if the ID is unchanged. */
 	if ((tp->t_lib != NULL && !strcmp(tp->t_lib->tlb_id, id)) ||
 	    (tp->t_lib == NULL && *id == 0)) {
 		rv = 0;
 		goto done;
 	}
 
 	/*
 	 * If the TCPCB had a previous ID, we need to extricate it from
 	 * the previous list.
 	 *
 	 * Drop the TCPCB lock and lock the tree and the bucket.
 	 * Because this is called in the socket context, we (theoretically)
 	 * don't need to worry about the INPCB completely going away
 	 * while we are gone.
 	 */
 	if (tp->t_lib != NULL) {
 		tlb = tp->t_lib;
 		TCPID_BUCKET_REF(tlb);
 		INP_WUNLOCK(inp);
 
 		if (tree_locked == TREE_UNLOCKED) {
 			TCPID_TREE_RLOCK();
 			tree_locked = TREE_RLOCKED;
 		}
 		TCPID_BUCKET_LOCK(tlb);
 		bucket_locked = true;
 		INP_WLOCK(inp);
 
 		/*
 		 * Unreference the bucket. If our bucket went away, it is no
 		 * longer locked or valid.
 		 */
 		if (tcp_log_unref_bucket(tlb, &tree_locked, inp)) {
 			bucket_locked = false;
 			tlb = NULL;
 		}
 
 		/* Validate the INP. */
 		RECHECK_INP();
 
 		/*
 		 * Evaluate whether the bucket changed while we were unlocked.
 		 *
 		 * Possible scenarios here:
 		 * 1. Bucket is unchanged and the same one we started with.
 		 * 2. The TCPCB no longer has a bucket and our bucket was
 		 *    freed.
 		 * 3. The TCPCB has a new bucket, whether ours was freed.
 		 * 4. The TCPCB no longer has a bucket and our bucket was
 		 *    not freed.
 		 *
 		 * In cases 2-4, we will start over. In case 1, we will
 		 * proceed here to remove the bucket.
 		 */
 		if (tlb == NULL || tp->t_lib != tlb) {
 			KASSERT(bucket_locked || tlb == NULL,
 			    ("%s: bucket_locked (%d) and tlb (%p) are "
 			    "inconsistent", __func__, bucket_locked, tlb));
 			
 			if (bucket_locked) {
 				TCPID_BUCKET_UNLOCK(tlb);
 				bucket_locked = false;
 				tlb = NULL;
 			}
 			goto restart;
 		}
 
 		/*
 		 * Store the (struct tcp_log_id_node) for reuse. Then, remove
 		 * it from the bucket. In the process, we may end up relocking.
 		 * If so, we need to validate that the INP is still valid, and
 		 * the TCPCB entries match we expect.
 		 *
 		 * We will clear tlb and change the bucket_locked state just
 		 * before calling tcp_log_remove_id_node(), since that function
 		 * will unlock the bucket.
 		 */
 		if (tln != NULL)
 			uma_zfree(tcp_log_node_zone, tln);
 		tln = tp->t_lin;
 		tlb = NULL;
 		bucket_locked = false;
 		if (tcp_log_remove_id_node(inp, tp, NULL, NULL, &tree_locked)) {
 			RECHECK_INP();
 
 			/*
 			 * If the TCPCB moved to a new bucket while we had
 			 * dropped the lock, restart.
 			 */
 			if (tp->t_lib != NULL || tp->t_lin != NULL)
 				goto restart;
 		}
 
 		/*
 		 * Yay! We successfully removed the TCPCB from its old
 		 * bucket. Phew!
 		 *
 		 * On to bigger and better things...
 		 */
 	}
 
 	/* At this point, the TCPCB should not be in any bucket. */
 	KASSERT(tp->t_lib == NULL, ("%s: tp->t_lib is not NULL", __func__));
 
 	/*
 	 * If the new ID is not empty, we need to now assign this TCPCB to a
 	 * new bucket.
 	 */
 	if (*id) {
 		/* Get a new tln, if we don't already have one to reuse. */
 		if (tln == NULL) {
 			tln = uma_zalloc(tcp_log_node_zone, M_NOWAIT | M_ZERO);
 			if (tln == NULL) {
 				rv = ENOBUFS;
 				goto done;
 			}
 			tln->tln_inp = inp;
 			tln->tln_tp = tp;
 		}
 
 		/*
 		 * Drop the INP lock for a bit. We don't need it, and dropping
 		 * it prevents lock order reversals.
 		 */
 		INP_WUNLOCK(inp);
 
 		/* Make sure we have at least a read lock on the tree. */
 		tcp_log_id_validate_tree_lock(tree_locked);
 		if (tree_locked == TREE_UNLOCKED) {
 			TCPID_TREE_RLOCK();
 			tree_locked = TREE_RLOCKED;
 		}
 
 refind:
 		/*
 		 * Remember that we constructed (struct tcp_log_id_node) so
 		 * we can safely cast the id to it for the purposes of finding.
 		 */
 		KASSERT(tlb == NULL, ("%s:%d tlb unexpectedly non-NULL", 
 		    __func__, __LINE__));
 		tmp_tlb = RB_FIND(tcp_log_id_tree, &tcp_log_id_head,
 		    (struct tcp_log_id_bucket *) id);
 
 		/*
 		 * If we didn't find a matching bucket, we need to add a new
 		 * one. This requires a write lock. But, of course, we will
 		 * need to recheck some things when we re-acquire the lock.
 		 */
 		if (tmp_tlb == NULL && tree_locked != TREE_WLOCKED) {
 			tree_locked = TREE_WLOCKED;
 			if (!TCPID_TREE_UPGRADE()) {
 				TCPID_TREE_RUNLOCK();
 				TCPID_TREE_WLOCK();
 
 				/*
 				 * The tree may have changed while we were
 				 * unlocked.
 				 */
 				goto refind;
 			}
 		}
 
 		/* If we need to add a new bucket, do it now. */
 		if (tmp_tlb == NULL) {
 			/* Allocate new bucket. */
 			tlb = uma_zalloc(tcp_log_bucket_zone, M_NOWAIT);
 			if (tlb == NULL) {
 				rv = ENOBUFS;
 				goto done_noinp;
 			}
 
 			/*
 			 * Copy the ID to the bucket.
 			 * NB: Don't use strlcpy() unless you are sure
 			 * we've always validated NULL termination.
 			 *
 			 * TODO: When I'm done writing this, see if we
 			 * we have correctly validated NULL termination and
 			 * can use strlcpy(). :-)
 			 */
 			strncpy(tlb->tlb_id, id, TCP_LOG_ID_LEN - 1);
 			tlb->tlb_id[TCP_LOG_ID_LEN - 1] = '\0';
 
 			/*
 			 * Take the refcount for the first node and go ahead
 			 * and lock this. Note that we zero the tlb_mtx
 			 * structure, since 0xdeadc0de flips the right bits
 			 * for the code to think that this mutex has already
 			 * been initialized. :-(
 			 */
 			SLIST_INIT(&tlb->tlb_head);
 			refcount_init(&tlb->tlb_refcnt, 1);
 			memset(&tlb->tlb_mtx, 0, sizeof(struct mtx));
 			TCPID_BUCKET_LOCK_INIT(tlb);
 			TCPID_BUCKET_LOCK(tlb);
 			bucket_locked = true;
 
 #define	FREE_NEW_TLB()	do {				\
 	TCPID_BUCKET_LOCK_DESTROY(tlb);			\
 	uma_zfree(tcp_log_bucket_zone, tlb);		\
 	bucket_locked = false;				\
 	tlb = NULL;					\
 } while (0)
 			/*
 			 * Relock the INP and make sure we are still
 			 * unassigned.
 			 */
 			INP_WLOCK(inp);
 			RECHECK_INP_CLEAN(FREE_NEW_TLB());
 			if (tp->t_lib != NULL) {
 				FREE_NEW_TLB();
 				goto restart;
 			}
 
 			/* Add the new bucket to the tree. */
 			tmp_tlb = RB_INSERT(tcp_log_id_tree, &tcp_log_id_head,
 			    tlb);
 			KASSERT(tmp_tlb == NULL,
 			    ("%s: Unexpected conflicting bucket (%p) while "
 			    "adding new bucket (%p)", __func__, tmp_tlb, tlb));
 
 			/*
 			 * If we found a conflicting bucket, free the new
 			 * one we made and fall through to use the existing
 			 * bucket.
 			 */
 			if (tmp_tlb != NULL) {
 				FREE_NEW_TLB();
 				INP_WUNLOCK(inp);
 			}
 #undef	FREE_NEW_TLB
 		}
 
 		/* If we found an existing bucket, use it. */
 		if (tmp_tlb != NULL) {
 			tlb = tmp_tlb;
 			TCPID_BUCKET_LOCK(tlb);
 			bucket_locked = true;
 
 			/*
 			 * Relock the INP and make sure we are still
 			 * unassigned.
 			 */
 			INP_UNLOCK_ASSERT(inp);
 			INP_WLOCK(inp);
 			RECHECK_INP();
 			if (tp->t_lib != NULL) {
 				TCPID_BUCKET_UNLOCK(tlb);
 				tlb = NULL;
 				goto restart;
 			}
 
 			/* Take a reference on the bucket. */
 			TCPID_BUCKET_REF(tlb);
 		}
 
 		tcp_log_grow_tlb(tlb->tlb_id, tp);
 
 		/* Add the new node to the list. */
 		SLIST_INSERT_HEAD(&tlb->tlb_head, tln, tln_list);
 		tp->t_lib = tlb;
 		tp->t_lin = tln;
 		tln = NULL;
 	}
 
 	rv = 0;
 
 done:
 	/* Unlock things, as needed, and return. */
 	INP_WUNLOCK(inp);
 done_noinp:
 	INP_UNLOCK_ASSERT(inp);
 	if (bucket_locked) {
 		TCPID_BUCKET_LOCK_ASSERT(tlb);
 		TCPID_BUCKET_UNLOCK(tlb);
 	} else if (tlb != NULL)
 		TCPID_BUCKET_UNLOCK_ASSERT(tlb);
 	if (tree_locked == TREE_WLOCKED) {
 		TCPID_TREE_WLOCK_ASSERT();
 		TCPID_TREE_WUNLOCK();
 	} else if (tree_locked == TREE_RLOCKED) {
 		TCPID_TREE_RLOCK_ASSERT();
 		TCPID_TREE_RUNLOCK();
 	} else
 		TCPID_TREE_UNLOCK_ASSERT();
 	if (tln != NULL)
 		uma_zfree(tcp_log_node_zone, tln);
 	return (rv);
 }
 
 /*
  * Get the TCP log ID for a TCPCB.
  * Called with INPCB locked.
  * 'buf' must point to a buffer that is at least TCP_LOG_ID_LEN bytes long.
  * Returns number of bytes copied.
  */
 size_t
 tcp_log_get_id(struct tcpcb *tp, char *buf)
 {
 	size_t len;
 
 	INP_LOCK_ASSERT(tp->t_inpcb);
 	if (tp->t_lib != NULL) {
 		len = strlcpy(buf, tp->t_lib->tlb_id, TCP_LOG_ID_LEN);
 		KASSERT(len < TCP_LOG_ID_LEN,
 		    ("%s:%d: tp->t_lib->tlb_id too long (%zu)",
 		    __func__, __LINE__, len));
 	} else {
 		*buf = '\0';
 		len = 0;
 	}
 	return (len);
 }
 
 /*
  * Get number of connections with the same log ID.
  * Log ID is taken from given TCPCB.
  * Called with INPCB locked.
  */
 u_int
 tcp_log_get_id_cnt(struct tcpcb *tp)
 {
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 	return ((tp->t_lib == NULL) ? 0 : tp->t_lib->tlb_refcnt);
 }
 
 #ifdef TCPLOG_DEBUG_RINGBUF
 /*
  * Functions/macros to increment/decrement reference count for a log
  * entry. This should catch when we do a double-free/double-remove or
  * a double-add.
  */
 static inline void
 _tcp_log_entry_refcnt_add(struct tcp_log_mem *log_entry, const char *func,
     int line)
 {
 	int refcnt;
 
 	refcnt = atomic_fetchadd_int(&log_entry->tlm_refcnt, 1);
 	if (refcnt != 0)
 		panic("%s:%d: log_entry(%p)->tlm_refcnt is %d (expected 0)",
 		    func, line, log_entry, refcnt);
 }
 #define	tcp_log_entry_refcnt_add(l)	\
     _tcp_log_entry_refcnt_add((l), __func__, __LINE__)
 
 static inline void
 _tcp_log_entry_refcnt_rem(struct tcp_log_mem *log_entry, const char *func,
     int line)
 {
 	int refcnt;
 
 	refcnt = atomic_fetchadd_int(&log_entry->tlm_refcnt, -1);
 	if (refcnt != 1)
 		panic("%s:%d: log_entry(%p)->tlm_refcnt is %d (expected 1)",
 		    func, line, log_entry, refcnt);
 }
 #define	tcp_log_entry_refcnt_rem(l)	\
     _tcp_log_entry_refcnt_rem((l), __func__, __LINE__)
 
 #else /* !TCPLOG_DEBUG_RINGBUF */
 
 #define	tcp_log_entry_refcnt_add(l)
 #define	tcp_log_entry_refcnt_rem(l)
 
 #endif
 
 /*
  * Cleanup after removing a log entry, but only decrement the count if we
  * are running INVARIANTS.
  */
 static inline void
 tcp_log_free_log_common(struct tcp_log_mem *log_entry, int *count __unused)
 {
 
 	uma_zfree(tcp_log_zone, log_entry);
 #ifdef INVARIANTS
 	(*count)--;
 	KASSERT(*count >= 0,
 	    ("%s: count unexpectedly negative", __func__));
 #endif
 }
 
 static void
 tcp_log_free_entries(struct tcp_log_stailq *head, int *count)
 {
 	struct tcp_log_mem *log_entry;
 
 	/* Free the entries. */
 	while ((log_entry = STAILQ_FIRST(head)) != NULL) {
 		STAILQ_REMOVE_HEAD(head, tlm_queue);
 		tcp_log_entry_refcnt_rem(log_entry);
 		tcp_log_free_log_common(log_entry, count);
 	}
 }
 
 /* Cleanup after removing a log entry. */
 static inline void
 tcp_log_remove_log_cleanup(struct tcpcb *tp, struct tcp_log_mem *log_entry)
 {
 	uma_zfree(tcp_log_zone, log_entry);
 	tp->t_lognum--;
 	KASSERT(tp->t_lognum >= 0,
 	    ("%s: tp->t_lognum unexpectedly negative", __func__));
 }
 
 /* Remove a log entry from the head of a list. */
 static inline void
 tcp_log_remove_log_head(struct tcpcb *tp, struct tcp_log_mem *log_entry)
 {
 
 	KASSERT(log_entry == STAILQ_FIRST(&tp->t_logs),
 	    ("%s: attempt to remove non-HEAD log entry", __func__));
 	STAILQ_REMOVE_HEAD(&tp->t_logs, tlm_queue);
 	tcp_log_entry_refcnt_rem(log_entry);
 	tcp_log_remove_log_cleanup(tp, log_entry);
 }
 
 #ifdef TCPLOG_DEBUG_RINGBUF
 /*
  * Initialize the log entry's reference count, which we want to
  * survive allocations.
  */
 static int
 tcp_log_zone_init(void *mem, int size, int flags __unused)
 {
 	struct tcp_log_mem *tlm;
 
 	KASSERT(size >= sizeof(struct tcp_log_mem),
 	    ("%s: unexpectedly short (%d) allocation", __func__, size));
 	tlm = (struct tcp_log_mem *)mem;
 	tlm->tlm_refcnt = 0;
 	return (0);
 }
 
 /*
  * Double check that the refcnt is zero on allocation and return.
  */
 static int
 tcp_log_zone_ctor(void *mem, int size, void *args __unused, int flags __unused)
 {
 	struct tcp_log_mem *tlm;
 
 	KASSERT(size >= sizeof(struct tcp_log_mem),
 	    ("%s: unexpectedly short (%d) allocation", __func__, size));
 	tlm = (struct tcp_log_mem *)mem;
 	if (tlm->tlm_refcnt != 0)
 		panic("%s:%d: tlm(%p)->tlm_refcnt is %d (expected 0)",
 		    __func__, __LINE__, tlm, tlm->tlm_refcnt);
 	return (0);
 }
 
 static void
 tcp_log_zone_dtor(void *mem, int size, void *args __unused)
 {
 	struct tcp_log_mem *tlm;
 
 	KASSERT(size >= sizeof(struct tcp_log_mem),
 	    ("%s: unexpectedly short (%d) allocation", __func__, size));
 	tlm = (struct tcp_log_mem *)mem;
 	if (tlm->tlm_refcnt != 0)
 		panic("%s:%d: tlm(%p)->tlm_refcnt is %d (expected 0)",
 		    __func__, __LINE__, tlm, tlm->tlm_refcnt);
 }
 #endif /* TCPLOG_DEBUG_RINGBUF */
 
 /* Do global initialization. */
 void
 tcp_log_init(void)
 {
 
 	tcp_log_zone = uma_zcreate("tcp_log", sizeof(struct tcp_log_mem),
 #ifdef TCPLOG_DEBUG_RINGBUF
 	    tcp_log_zone_ctor, tcp_log_zone_dtor, tcp_log_zone_init,
 #else
 	    NULL, NULL, NULL,
 #endif
 	    NULL, UMA_ALIGN_PTR, 0);
 	(void)uma_zone_set_max(tcp_log_zone, TCP_LOG_BUF_DEFAULT_GLOBAL_LIMIT);
 	tcp_log_bucket_zone = uma_zcreate("tcp_log_bucket",
 	    sizeof(struct tcp_log_id_bucket), NULL, NULL, NULL, NULL,
 	    UMA_ALIGN_PTR, 0);
 	tcp_log_node_zone = uma_zcreate("tcp_log_node",
 	    sizeof(struct tcp_log_id_node), NULL, NULL, NULL, NULL,
 	    UMA_ALIGN_PTR, 0);
 #ifdef TCPLOG_DEBUG_COUNTERS
 	tcp_log_queued = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_fail1 = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_fail2 = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_fail3 = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_fail4 = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_fail5 = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_copyout = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_read = counter_u64_alloc(M_WAITOK);
 	tcp_log_que_freed = counter_u64_alloc(M_WAITOK);
 #endif
 
 	rw_init_flags(&tcp_id_tree_lock, "TCP ID tree", RW_NEW);
 	mtx_init(&tcp_log_expireq_mtx, "TCP log expireq", NULL, MTX_DEF);
 	callout_init(&tcp_log_expireq_callout, 1);
 }
 
 /* Do per-TCPCB initialization. */
 void
 tcp_log_tcpcbinit(struct tcpcb *tp)
 {
 
 	/* A new TCPCB should start out zero-initialized. */
 	STAILQ_INIT(&tp->t_logs);
 
 	/*
 	 * If we are doing auto-capturing, figure out whether we will capture
 	 * this session.
 	 */
 	if (tcp_log_selectauto()) {
 		tp->t_logstate = tcp_log_auto_mode;
 		tp->t_flags2 |= TF2_LOG_AUTO;
 	}
 }
 
 
 /* Remove entries */
 static void
 tcp_log_expire(void *unused __unused)
 {
 	struct tcp_log_id_bucket *tlb;
 	struct tcp_log_id_node *tln;
 	sbintime_t expiry_limit;
 	int tree_locked;
 
 	TCPLOG_EXPIREQ_LOCK();
 	if (callout_pending(&tcp_log_expireq_callout)) {
 		/* Callout was reset. */
 		TCPLOG_EXPIREQ_UNLOCK();
 		return;
 	}
 
 	/*
 	 * Process entries until we reach one that expires too far in the
 	 * future. Look one second in the future.
 	 */
 	expiry_limit = getsbinuptime() + SBT_1S;
 	tree_locked = TREE_UNLOCKED;
 
 	while ((tln = STAILQ_FIRST(&tcp_log_expireq_head)) != NULL &&
 	    tln->tln_expiretime <= expiry_limit) {
 		if (!callout_active(&tcp_log_expireq_callout)) {
 			/*
 			 * Callout was stopped. I guess we should
 			 * just quit at this point.
 			 */
 			TCPLOG_EXPIREQ_UNLOCK();
 			return;
 		}
 
 		/*
 		 * Remove the node from the head of the list and unlock
 		 * the list. Change the expiry time to SBT_MAX as a signal
 		 * to other threads that we now own this.
 		 */
 		STAILQ_REMOVE_HEAD(&tcp_log_expireq_head, tln_expireq);
 		tln->tln_expiretime = SBT_MAX;
 		TCPLOG_EXPIREQ_UNLOCK();
 
 		/*
 		 * Remove the node from the bucket.
 		 */
 		tlb = tln->tln_bucket;
 		TCPID_BUCKET_LOCK(tlb);
 		if (tcp_log_remove_id_node(NULL, NULL, tlb, tln, &tree_locked)) {
 			tcp_log_id_validate_tree_lock(tree_locked);
 			if (tree_locked == TREE_WLOCKED)
 				TCPID_TREE_WUNLOCK();
 			else
 				TCPID_TREE_RUNLOCK();
 			tree_locked = TREE_UNLOCKED;
 		}
 
 		/* Drop the INP reference. */
 		INP_WLOCK(tln->tln_inp);
 		if (!in_pcbrele_wlocked(tln->tln_inp))
 			INP_WUNLOCK(tln->tln_inp);
 
 		/* Free the log records. */
 		tcp_log_free_entries(&tln->tln_entries, &tln->tln_count);
 
 		/* Free the node. */
 		uma_zfree(tcp_log_node_zone, tln);
 
 		/* Relock the expiry queue. */
 		TCPLOG_EXPIREQ_LOCK();
 	}
 
 	/*
 	 * We've expired all the entries we can. Do we need to reschedule
 	 * ourselves?
 	 */
 	callout_deactivate(&tcp_log_expireq_callout);
 	if (tln != NULL) {
 		/*
 		 * Get max(now + TCP_LOG_EXPIRE_INTVL, tln->tln_expiretime) and
 		 * set the next callout to that. (This helps ensure we generally
 		 * run the callout no more often than desired.)
 		 */
 		expiry_limit = getsbinuptime() + TCP_LOG_EXPIRE_INTVL;
 		if (expiry_limit < tln->tln_expiretime)
 			expiry_limit = tln->tln_expiretime;
 		callout_reset_sbt(&tcp_log_expireq_callout, expiry_limit,
 		    SBT_1S, tcp_log_expire, NULL, C_ABSOLUTE);
 	}
 
 	/* We're done. */
 	TCPLOG_EXPIREQ_UNLOCK();
 	return;
 }
 
 /*
  * Move log data from the TCPCB to a new node. This will reset the TCPCB log
  * entries and log count; however, it will not touch other things from the
  * TCPCB (e.g. t_lin, t_lib).
  *
  * NOTE: Must hold a lock on the INP.
  */
 static void
 tcp_log_move_tp_to_node(struct tcpcb *tp, struct tcp_log_id_node *tln)
 {
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 	tln->tln_ie = tp->t_inpcb->inp_inc.inc_ie;
 	if (tp->t_inpcb->inp_inc.inc_flags & INC_ISIPV6)
 		tln->tln_af = AF_INET6;
 	else
 		tln->tln_af = AF_INET;
 	tln->tln_entries = tp->t_logs;
 	tln->tln_count = tp->t_lognum;
 	tln->tln_bucket = tp->t_lib;
 
 	/* Clear information from the PCB. */
 	STAILQ_INIT(&tp->t_logs);
 	tp->t_lognum = 0;
 }
 
 /* Do per-TCPCB cleanup */
 void
 tcp_log_tcpcbfini(struct tcpcb *tp)
 {
 	struct tcp_log_id_node *tln, *tln_first;
 	struct tcp_log_mem *log_entry;
 	sbintime_t callouttime;
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 	/*
 	 * If we were gathering packets to be automatically dumped, try to do
 	 * it now. If this succeeds, the log information in the TCPCB will be
 	 * cleared. Otherwise, we'll handle the log information as we do
 	 * for other states.
 	 */
 	switch(tp->t_logstate) {
 	case TCP_LOG_STATE_HEAD_AUTO:
 		(void)tcp_log_dump_tp_logbuf(tp, "auto-dumped from head",
 		    M_NOWAIT, false);
 		break;
 	case TCP_LOG_STATE_TAIL_AUTO:
 		(void)tcp_log_dump_tp_logbuf(tp, "auto-dumped from tail",
 		    M_NOWAIT, false);
 		break;
 	case TCP_LOG_STATE_CONTINUAL:
 		(void)tcp_log_dump_tp_logbuf(tp, "auto-dumped from continual",
 		    M_NOWAIT, false);
 		break;
 	}
 
 	/*
 	 * There are two ways we could keep logs: per-socket or per-ID. If
 	 * we are tracking logs with an ID, then the logs survive the
 	 * destruction of the TCPCB.
 	 * 
 	 * If the TCPCB is associated with an ID node, move the logs from the
 	 * TCPCB to the ID node. In theory, this is safe, for reasons which I
 	 * will now explain for my own benefit when I next need to figure out
 	 * this code. :-)
 	 *
 	 * We own the INP lock. Therefore, no one else can change the contents
 	 * of this node (Rule C). Further, no one can remove this node from
 	 * the bucket while we hold the lock (Rule D). Basically, no one can
 	 * mess with this node. That leaves two states in which we could be:
 	 * 
 	 * 1. Another thread is currently waiting to acquire the INP lock, with
 	 *    plans to do something with this node. When we drop the INP lock,
 	 *    they will have a chance to do that. They will recheck the
 	 *    tln_closed field (see note to Rule C) and then acquire the
 	 *    bucket lock before proceeding further.
 	 *
 	 * 2. Another thread will try to acquire a lock at some point in the
 	 *    future. If they try to acquire a lock before we set the
 	 *    tln_closed field, they will follow state #1. If they try to
 	 *    acquire a lock after we set the tln_closed field, they will be
 	 *    able to make changes to the node, at will, following Rule C.
 	 *
 	 * Therefore, we currently own this node and can make any changes
 	 * we want. But, as soon as we set the tln_closed field to true, we
 	 * have effectively dropped our lock on the node. (For this reason, we
 	 * also need to make sure our writes are ordered correctly. An atomic
 	 * operation with "release" semantics should be sufficient.)
 	 */
 
 	if (tp->t_lin != NULL) {
 		/* Copy the relevant information to the log entry. */
 		tln = tp->t_lin;
 		KASSERT(tln->tln_inp == tp->t_inpcb,
 		    ("%s: Mismatched inp (tln->tln_inp=%p, tp->t_inpcb=%p)",
 		    __func__, tln->tln_inp, tp->t_inpcb));
 		tcp_log_move_tp_to_node(tp, tln);
 
 		/* Clear information from the PCB. */
 		tp->t_lin = NULL;
 		tp->t_lib = NULL;
 
 		/*
 		 * Take a reference on the INP. This ensures that the INP
 		 * remains valid while the node is on the expiry queue. This
 		 * ensures the INP is valid for other threads that may be
 		 * racing to lock this node when we move it to the expire
 		 * queue.
 		 */
 		in_pcbref(tp->t_inpcb);
 
 		/*
 		 * Store the entry on the expiry list. The exact behavior
 		 * depends on whether we have entries to keep. If so, we
 		 * put the entry at the tail of the list and expire in
 		 * TCP_LOG_EXPIRE_TIME. Otherwise, we expire "now" and put
 		 * the entry at the head of the list. (Handling the cleanup
 		 * via the expiry timer lets us avoid locking messy-ness here.)
 		 */
 		tln->tln_expiretime = getsbinuptime();
 		TCPLOG_EXPIREQ_LOCK();
 		if (tln->tln_count) {
 			tln->tln_expiretime += TCP_LOG_EXPIRE_TIME;
 			if (STAILQ_EMPTY(&tcp_log_expireq_head) &&
 			    !callout_active(&tcp_log_expireq_callout)) {
 				/*
 				 * We are adding the first entry and a callout
 				 * is not currently scheduled; therefore, we
 				 * need to schedule one.
 				 */
 				callout_reset_sbt(&tcp_log_expireq_callout,
 				    tln->tln_expiretime, SBT_1S, tcp_log_expire,
 				    NULL, C_ABSOLUTE);
 			}
 			STAILQ_INSERT_TAIL(&tcp_log_expireq_head, tln,
 			    tln_expireq);
 		} else {
 			callouttime = tln->tln_expiretime +
 			    TCP_LOG_EXPIRE_INTVL;
 			tln_first = STAILQ_FIRST(&tcp_log_expireq_head);
 
 			if ((tln_first == NULL ||
 			    callouttime < tln_first->tln_expiretime) &&
 			    (callout_pending(&tcp_log_expireq_callout) ||
 			    !callout_active(&tcp_log_expireq_callout))) {
 				/*
 				 * The list is empty, or we want to run the
 				 * expire code before the first entry's timer
 				 * fires. Also, we are in a case where a callout
 				 * is not actively running. We want to reset
 				 * the callout to occur sooner.
 				 */
 				callout_reset_sbt(&tcp_log_expireq_callout,
 				    callouttime, SBT_1S, tcp_log_expire, NULL,
 				    C_ABSOLUTE);
 			}
 
 			/*
 			 * Insert to the head, or just after the head, as
 			 * appropriate. (This might result in small
 			 * mis-orderings as a bunch of "expire now" entries
 			 * gather at the start of the list, but that should
 			 * not produce big problems, since the expire timer
 			 * will walk through all of them.)
 			 */
 			if (tln_first == NULL ||
 			    tln->tln_expiretime < tln_first->tln_expiretime)
 				STAILQ_INSERT_HEAD(&tcp_log_expireq_head, tln,
 				    tln_expireq);
 			else
 				STAILQ_INSERT_AFTER(&tcp_log_expireq_head,
 				    tln_first, tln, tln_expireq);
 		}
 		TCPLOG_EXPIREQ_UNLOCK();
 
 		/*
 		 * We are done messing with the tln. After this point, we
 		 * can't touch it. (Note that the "release" semantics should
 		 * be included with the TCPLOG_EXPIREQ_UNLOCK() call above.
 		 * Therefore, they should be unnecessary here. However, it
 		 * seems like a good idea to include them anyway, since we
 		 * really are releasing a lock here.)
 		 */
 		atomic_store_rel_int(&tln->tln_closed, 1);
 	} else {
 		/* Remove log entries. */
 		while ((log_entry = STAILQ_FIRST(&tp->t_logs)) != NULL)
 			tcp_log_remove_log_head(tp, log_entry);
 		KASSERT(tp->t_lognum == 0,
 		    ("%s: After freeing entries, tp->t_lognum=%d (expected 0)",
 			__func__, tp->t_lognum));
 	}
 
 	/*
 	 * Change the log state to off (just in case anything tries to sneak
 	 * in a last-minute log).
 	 */
 	tp->t_logstate = TCP_LOG_STATE_OFF;
 }
 
 /*
  * This logs an event for a TCP socket. Normally, this is called via
  * TCP_LOG_EVENT or TCP_LOG_EVENT_VERBOSE. See the documentation for
  * TCP_LOG_EVENT().
  */
 
 struct tcp_log_buffer *
 tcp_log_event_(struct tcpcb *tp, struct tcphdr *th, struct sockbuf *rxbuf,
     struct sockbuf *txbuf, uint8_t eventid, int errornum, uint32_t len,
     union tcp_log_stackspecific *stackinfo, int th_hostorder,
     const char *output_caller, const char *func, int line, const struct timeval *itv)
 {
 	struct tcp_log_mem *log_entry;
 	struct tcp_log_buffer *log_buf;
 	int attempt_count = 0;
 	struct tcp_log_verbose *log_verbose;
 	uint32_t logsn;
 
 	KASSERT((func == NULL && line == 0) || (func != NULL && line > 0),
 	    ("%s called with inconsistent func (%p) and line (%d) arguments",
 		__func__, func, line));
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 	KASSERT(tp->t_logstate == TCP_LOG_STATE_HEAD ||
 	    tp->t_logstate == TCP_LOG_STATE_TAIL ||
 	    tp->t_logstate == TCP_LOG_STATE_CONTINUAL ||
 	    tp->t_logstate == TCP_LOG_STATE_HEAD_AUTO ||
 	    tp->t_logstate == TCP_LOG_STATE_TAIL_AUTO,
 	    ("%s called with unexpected tp->t_logstate (%d)", __func__,
 		tp->t_logstate));
 
 	/*
 	 * Get the serial number. We do this early so it will
 	 * increment even if we end up skipping the log entry for some
 	 * reason.
 	 */
 	logsn = tp->t_logsn++;
 
 	/*
 	 * Can we get a new log entry? If so, increment the lognum counter
 	 * here.
 	 */
 retry:
 	if (tp->t_lognum < tcp_log_session_limit) {
 		if ((log_entry = uma_zalloc(tcp_log_zone, M_NOWAIT)) != NULL)
 			tp->t_lognum++;
 	} else
 		log_entry = NULL;
 
 	/* Do we need to try to reuse? */
 	if (log_entry == NULL) {
 		/*
 		 * Sacrifice auto-logged sessions without a log ID if
 		 * tcp_log_auto_all is false. (If they don't have a log
 		 * ID by now, it is probable that either they won't get one
 		 * or we are resource-constrained.)
 		 */
 		if (tp->t_lib == NULL && (tp->t_flags2 & TF2_LOG_AUTO) &&
 		    !tcp_log_auto_all) {
 			if (tcp_log_state_change(tp, TCP_LOG_STATE_CLEAR)) {
 #ifdef INVARIANTS
 				panic("%s:%d: tcp_log_state_change() failed "
 				    "to set tp %p to TCP_LOG_STATE_CLEAR",
 				    __func__, __LINE__, tp);
 #endif
 				tp->t_logstate = TCP_LOG_STATE_OFF;
 			}
 			return (NULL);
 		}
 		/*
 		 * If we are in TCP_LOG_STATE_HEAD_AUTO state, try to dump
 		 * the buffers. If successful, deactivate tracing. Otherwise,
 		 * leave it active so we will retry.
 		 */
 		if (tp->t_logstate == TCP_LOG_STATE_HEAD_AUTO &&
 		    !tcp_log_dump_tp_logbuf(tp, "auto-dumped from head",
 		    M_NOWAIT, false)) {
 			tp->t_logstate = TCP_LOG_STATE_OFF;
 			return(NULL);
 		} else if ((tp->t_logstate == TCP_LOG_STATE_CONTINUAL) &&
 		    !tcp_log_dump_tp_logbuf(tp, "auto-dumped from continual",
 		    M_NOWAIT, false)) {
 			if (attempt_count == 0) {
 				attempt_count++;
 				goto retry;
 			}
 #ifdef TCPLOG_DEBUG_COUNTERS
 			counter_u64_add(tcp_log_que_fail4, 1);
 #endif
 			return(NULL);
 		} else if (tp->t_logstate == TCP_LOG_STATE_HEAD_AUTO)
 			return(NULL);
 
 		/* If in HEAD state, just deactivate the tracing and return. */
 		if (tp->t_logstate == TCP_LOG_STATE_HEAD) {
 			tp->t_logstate = TCP_LOG_STATE_OFF;
 			return(NULL);
 		}
 
 		/*
 		 * Get a buffer to reuse. If that fails, just give up.
 		 * (We can't log anything without a buffer in which to
 		 * put it.)
 		 *
 		 * Note that we don't change the t_lognum counter
 		 * here. Because we are re-using the buffer, the total
 		 * number won't change.
 		 */
 		if ((log_entry = STAILQ_FIRST(&tp->t_logs)) == NULL)
 			return(NULL);
 		STAILQ_REMOVE_HEAD(&tp->t_logs, tlm_queue);
 		tcp_log_entry_refcnt_rem(log_entry);
 	}
 
 	KASSERT(log_entry != NULL,
 	    ("%s: log_entry unexpectedly NULL", __func__));
 
 	/* Extract the log buffer and verbose buffer pointers. */
 	log_buf = &log_entry->tlm_buf;
 	log_verbose = &log_entry->tlm_v;
 
 	/* Basic entries. */
 	if (itv == NULL)
 		getmicrouptime(&log_buf->tlb_tv);
 	else
 		memcpy(&log_buf->tlb_tv, itv, sizeof(struct timeval));
 	log_buf->tlb_ticks = ticks;
 	log_buf->tlb_sn = logsn;
 	log_buf->tlb_stackid = tp->t_fb->tfb_id;
 	log_buf->tlb_eventid = eventid;
 	log_buf->tlb_eventflags = 0;
 	log_buf->tlb_errno = errornum;
 
 	/* Socket buffers */
 	if (rxbuf != NULL) {
 		log_buf->tlb_eventflags |= TLB_FLAG_RXBUF;
 		log_buf->tlb_rxbuf.tls_sb_acc = rxbuf->sb_acc;
 		log_buf->tlb_rxbuf.tls_sb_ccc = rxbuf->sb_ccc;
 		log_buf->tlb_rxbuf.tls_sb_spare = 0;
 	}
 	if (txbuf != NULL) {
 		log_buf->tlb_eventflags |= TLB_FLAG_TXBUF;
 		log_buf->tlb_txbuf.tls_sb_acc = txbuf->sb_acc;
 		log_buf->tlb_txbuf.tls_sb_ccc = txbuf->sb_ccc;
 		log_buf->tlb_txbuf.tls_sb_spare = 0;
 	}
 	/* Copy values from tp to the log entry. */
 #define	COPY_STAT(f)	log_buf->tlb_ ## f = tp->f
 #define	COPY_STAT_T(f)	log_buf->tlb_ ## f = tp->t_ ## f
 	COPY_STAT_T(state);
 	COPY_STAT_T(starttime);
 	COPY_STAT(iss);
 	COPY_STAT_T(flags);
 	COPY_STAT(snd_una);
 	COPY_STAT(snd_max);
 	COPY_STAT(snd_cwnd);
 	COPY_STAT(snd_nxt);
 	COPY_STAT(snd_recover);
 	COPY_STAT(snd_wnd);
 	COPY_STAT(snd_ssthresh);
 	COPY_STAT_T(srtt);
 	COPY_STAT_T(rttvar);
 	COPY_STAT(rcv_up);
 	COPY_STAT(rcv_adv);
 	COPY_STAT(rcv_nxt);
 	COPY_STAT(sack_newdata);
 	COPY_STAT(rcv_wnd);
 	COPY_STAT_T(dupacks);
 	COPY_STAT_T(segqlen);
 	COPY_STAT(snd_numholes);
 	COPY_STAT(snd_scale);
 	COPY_STAT(rcv_scale);
 #undef COPY_STAT
 #undef COPY_STAT_T
 	log_buf->tlb_flex1 = 0;
 	log_buf->tlb_flex2 = 0;
 	/* Copy stack-specific info. */
 	if (stackinfo != NULL) {
 		memcpy(&log_buf->tlb_stackinfo, stackinfo,
 		    sizeof(log_buf->tlb_stackinfo));
 		log_buf->tlb_eventflags |= TLB_FLAG_STACKINFO;
 	}
 
 	/* The packet */
 	log_buf->tlb_len = len;
 	if (th) {
 		int optlen;
 
 		log_buf->tlb_eventflags |= TLB_FLAG_HDR;
 		log_buf->tlb_th = *th;
 		if (th_hostorder)
 			tcp_fields_to_net(&log_buf->tlb_th);
 		optlen = (th->th_off << 2) - sizeof (struct tcphdr);
 		if (optlen > 0)
 			memcpy(log_buf->tlb_opts, th + 1, optlen);
 	}
 
 	/* Verbose information */
 	if (func != NULL) {
 		log_buf->tlb_eventflags |= TLB_FLAG_VERBOSE;
 		if (output_caller != NULL)
 			strlcpy(log_verbose->tlv_snd_frm, output_caller,
 			    TCP_FUNC_LEN);
 		else
 			*log_verbose->tlv_snd_frm = 0;
 		strlcpy(log_verbose->tlv_trace_func, func, TCP_FUNC_LEN);
 		log_verbose->tlv_trace_line = line;
 	}
 
 	/* Insert the new log at the tail. */
 	STAILQ_INSERT_TAIL(&tp->t_logs, log_entry, tlm_queue);
 	tcp_log_entry_refcnt_add(log_entry);
 	return (log_buf);
 }
 
 /*
  * Change the logging state for a TCPCB. Returns 0 on success or an
  * error code on failure.
  */
 int
 tcp_log_state_change(struct tcpcb *tp, int state)
 {
 	struct tcp_log_mem *log_entry;
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 	switch(state) {
 	case TCP_LOG_STATE_CLEAR:
 		while ((log_entry = STAILQ_FIRST(&tp->t_logs)) != NULL)
 			tcp_log_remove_log_head(tp, log_entry);
 		/* Fall through */
 
 	case TCP_LOG_STATE_OFF:
 		tp->t_logstate = TCP_LOG_STATE_OFF;
 		break;
 
 	case TCP_LOG_STATE_TAIL:
 	case TCP_LOG_STATE_HEAD:
 	case TCP_LOG_STATE_CONTINUAL:
 	case TCP_LOG_STATE_HEAD_AUTO:
 	case TCP_LOG_STATE_TAIL_AUTO:
 		tp->t_logstate = state;
 		break;
 
 	default:
 		return (EINVAL);
 	}
 
 	tp->t_flags2 &= ~(TF2_LOG_AUTO);
 
 	return (0);
 }
 
 /* If tcp_drain() is called, flush half the log entries. */
 void
 tcp_log_drain(struct tcpcb *tp)
 {
 	struct tcp_log_mem *log_entry, *next;
 	int target, skip;
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 	if ((target = tp->t_lognum / 2) == 0)
 		return;
 
 	/*
 	 * If we are logging the "head" packets, we want to discard
 	 * from the tail of the queue. Otherwise, we want to discard
 	 * from the head.
 	 */
 	if (tp->t_logstate == TCP_LOG_STATE_HEAD ||
 	    tp->t_logstate == TCP_LOG_STATE_HEAD_AUTO) {
 		skip = tp->t_lognum - target;
 		STAILQ_FOREACH(log_entry, &tp->t_logs, tlm_queue)
 			if (!--skip)
 				break;
 		KASSERT(log_entry != NULL,
 		    ("%s: skipped through all entries!", __func__));
 		if (log_entry == NULL)
 			return;
 		while ((next = STAILQ_NEXT(log_entry, tlm_queue)) != NULL) {
 			STAILQ_REMOVE_AFTER(&tp->t_logs, log_entry, tlm_queue);
 			tcp_log_entry_refcnt_rem(next);
 			tcp_log_remove_log_cleanup(tp, next);
 #ifdef INVARIANTS
 			target--;
 #endif
 		}
 		KASSERT(target == 0,
 		    ("%s: After removing from tail, target was %d", __func__,
 			target));
 	} else if (tp->t_logstate == TCP_LOG_STATE_CONTINUAL) {
 		(void)tcp_log_dump_tp_logbuf(tp, "auto-dumped from continual",
 		    M_NOWAIT, false);
 	} else {
 		while ((log_entry = STAILQ_FIRST(&tp->t_logs)) != NULL &&
 		    target--)
 			tcp_log_remove_log_head(tp, log_entry);
 		KASSERT(target <= 0,
 		    ("%s: After removing from head, target was %d", __func__,
 			target));
 		KASSERT(tp->t_lognum > 0,
 		    ("%s: After removing from head, tp->t_lognum was %d",
 			__func__, target));
 		KASSERT(log_entry != NULL,
 		    ("%s: After removing from head, the tailq was empty",
 			__func__));
 	}
 }
 
 static inline int
 tcp_log_copyout(struct sockopt *sopt, void *src, void *dst, size_t len)
 {
 
 	if (sopt->sopt_td != NULL)
 		return (copyout(src, dst, len));
 	bcopy(src, dst, len);
 	return (0);
 }
 
 static int
 tcp_log_logs_to_buf(struct sockopt *sopt, struct tcp_log_stailq *log_tailqp,
     struct tcp_log_buffer **end, int count)
 {
 	struct tcp_log_buffer *out_entry;
 	struct tcp_log_mem *log_entry;
 	size_t entrysize;
 	int error;
 #ifdef INVARIANTS
 	int orig_count = count;
 #endif
 
 	/* Copy the data out. */
 	error = 0;
 	out_entry = (struct tcp_log_buffer *) sopt->sopt_val;
 	STAILQ_FOREACH(log_entry, log_tailqp, tlm_queue) {
 		count--;
 		KASSERT(count >= 0,
 		    ("%s:%d: Exceeded expected count (%d) processing list %p",
 		    __func__, __LINE__, orig_count, log_tailqp));
 
 #ifdef TCPLOG_DEBUG_COUNTERS
 		counter_u64_add(tcp_log_que_copyout, 1);
 #endif
-#if 0
-		struct tcp_log_buffer *lb = &log_entry->tlm_buf;
-		int i;
 
-		printf("lb = %p:\n", lb);
-#define	PRINT(f)	printf(#f " = %u\n", (unsigned int)lb->f)
-		printf("tlb_tv = {%lu, %lu}\n", lb->tlb_tv.tv_sec, lb->tlb_tv.tv_usec);
-		PRINT(tlb_ticks);
-		PRINT(tlb_sn);
-		PRINT(tlb_stackid);
-		PRINT(tlb_eventid);
-		PRINT(tlb_eventflags);
-		PRINT(tlb_errno);
-		PRINT(tlb_rxbuf.tls_sb_acc);
-		PRINT(tlb_rxbuf.tls_sb_ccc);
-		PRINT(tlb_rxbuf.tls_sb_spare);
-		PRINT(tlb_txbuf.tls_sb_acc);
-		PRINT(tlb_txbuf.tls_sb_ccc);
-		PRINT(tlb_txbuf.tls_sb_spare);
-		PRINT(tlb_state);
-		PRINT(tlb_flags);
-		PRINT(tlb_snd_una);
-		PRINT(tlb_snd_max);
-		PRINT(tlb_snd_cwnd);
-		PRINT(tlb_snd_nxt);
-		PRINT(tlb_snd_recover);
-		PRINT(tlb_snd_wnd);
-		PRINT(tlb_snd_ssthresh);
-		PRINT(tlb_srtt);
-		PRINT(tlb_rttvar);
-		PRINT(tlb_rcv_up);
-		PRINT(tlb_rcv_adv);
-		PRINT(tlb_rcv_nxt);
-		PRINT(tlb_sack_newdata);
-		PRINT(tlb_rcv_wnd);
-		PRINT(tlb_dupacks);
-		PRINT(tlb_segqlen);
-		PRINT(tlb_snd_numholes);
-		PRINT(tlb_snd_scale);
-		PRINT(tlb_rcv_scale);
-		PRINT(tlb_len);
-		printf("hex dump: ");
-		for (i = 0; i < sizeof(struct tcp_log_buffer); i++)
-			printf("%02x", *(((uint8_t *)lb) + i));
-#undef PRINT
-#endif
 		/*
 		 * Skip copying out the header if it isn't present.
 		 * Instead, copy out zeros (to ensure we don't leak info).
 		 * TODO: Make sure we truly do zero everything we don't
 		 * explicitly set.
 		 */
 		if (log_entry->tlm_buf.tlb_eventflags & TLB_FLAG_HDR)
 			entrysize = sizeof(struct tcp_log_buffer);
 		else
 			entrysize = offsetof(struct tcp_log_buffer, tlb_th);
 		error = tcp_log_copyout(sopt, &log_entry->tlm_buf, out_entry,
 		    entrysize);
 		if (error)
 			break;
 		if (!(log_entry->tlm_buf.tlb_eventflags & TLB_FLAG_HDR)) {
 			error = tcp_log_copyout(sopt, zerobuf,
 			    ((uint8_t *)out_entry) + entrysize,
 			    sizeof(struct tcp_log_buffer) - entrysize);
 		}
 
 		/*
 		 * Copy out the verbose bit, if needed. Either way,
 		 * increment the output pointer the correct amount.
 		 */
 		if (log_entry->tlm_buf.tlb_eventflags & TLB_FLAG_VERBOSE) {
 			error = tcp_log_copyout(sopt, &log_entry->tlm_v,
 			    out_entry->tlb_verbose,
 			    sizeof(struct tcp_log_verbose));
 			if (error)
 				break;
 			out_entry = (struct tcp_log_buffer *)
 			    (((uint8_t *) (out_entry + 1)) +
 			    sizeof(struct tcp_log_verbose));
 		} else
 			out_entry++;
 	}
 	*end = out_entry;
 	KASSERT(error || count == 0,
 	    ("%s:%d: Less than expected count (%d) processing list %p"
 	    " (%d remain)", __func__, __LINE__, orig_count,
 	    log_tailqp, count));
 
 	return (error);
 }
 
 /*
  * Copy out the buffer. Note that we do incremental copying, so
  * sooptcopyout() won't work. However, the goal is to produce the same
  * end result as if we copied in the entire user buffer, updated it,
  * and then used sooptcopyout() to copy it out.
  *
  * NOTE: This should be called with a write lock on the PCB; however,
  * the function will drop it after it extracts the data from the TCPCB.
  */
 int
 tcp_log_getlogbuf(struct sockopt *sopt, struct tcpcb *tp)
 {
 	struct tcp_log_stailq log_tailq;
 	struct tcp_log_mem *log_entry, *log_next;
 	struct tcp_log_buffer *out_entry;
 	struct inpcb *inp;
 	size_t outsize, entrysize;
 	int error, outnum;
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 	inp = tp->t_inpcb;
 
 	/*
 	 * Determine which log entries will fit in the buffer. As an
 	 * optimization, skip this if all the entries will clearly fit
 	 * in the buffer. (However, get an exact size if we are using
 	 * INVARIANTS.)
 	 */
 #ifndef INVARIANTS
 	if (sopt->sopt_valsize / (sizeof(struct tcp_log_buffer) +
 	    sizeof(struct tcp_log_verbose)) >= tp->t_lognum) {
 		log_entry = STAILQ_LAST(&tp->t_logs, tcp_log_mem, tlm_queue);
 		log_next = NULL;
 		outsize = 0;
 		outnum = tp->t_lognum;
 	} else {
 #endif
 		outsize = outnum = 0;
 		log_entry = NULL;
 		STAILQ_FOREACH(log_next, &tp->t_logs, tlm_queue) {
 			entrysize = sizeof(struct tcp_log_buffer);
 			if (log_next->tlm_buf.tlb_eventflags &
 			    TLB_FLAG_VERBOSE)
 				entrysize += sizeof(struct tcp_log_verbose);
 			if ((sopt->sopt_valsize - outsize) < entrysize)
 				break;
 			outsize += entrysize;
 			outnum++;
 			log_entry = log_next;
 		}
 		KASSERT(outsize <= sopt->sopt_valsize,
 		    ("%s: calculated output size (%zu) greater than available"
 			"space (%zu)", __func__, outsize, sopt->sopt_valsize));
 #ifndef INVARIANTS
 	}
 #endif
 
 	/*
 	 * Copy traditional sooptcopyout() behavior: if sopt->sopt_val
 	 * is NULL, silently skip the copy. However, in this case, we
 	 * will leave the list alone and return. Functionally, this
 	 * gives userspace a way to poll for an approximate buffer
 	 * size they will need to get the log entries.
 	 */
 	if (sopt->sopt_val == NULL) {
 		INP_WUNLOCK(inp);
 		if (outsize == 0) {
 			outsize = outnum * (sizeof(struct tcp_log_buffer) +
 			    sizeof(struct tcp_log_verbose));
 		}
 		if (sopt->sopt_valsize > outsize)
 			sopt->sopt_valsize = outsize;
 		return (0);
 	}
 
 	/*
 	 * Break apart the list. We'll save the ones we want to copy
 	 * out locally and remove them from the TCPCB list. We can
 	 * then drop the INPCB lock while we do the copyout.
 	 *
 	 * There are roughly three cases:
 	 * 1. There was nothing to copy out. That's easy: drop the
 	 * lock and return.
 	 * 2. We are copying out the entire list. Again, that's easy:
 	 * move the whole list.
 	 * 3. We are copying out a partial list. That's harder. We
 	 * need to update the list book-keeping entries.
 	 */
 	if (log_entry != NULL && log_next == NULL) {
 		/* Move entire list. */
 		KASSERT(outnum == tp->t_lognum,
 		    ("%s:%d: outnum (%d) should match tp->t_lognum (%d)",
 			__func__, __LINE__, outnum, tp->t_lognum));
 		log_tailq = tp->t_logs;
 		tp->t_lognum = 0;
 		STAILQ_INIT(&tp->t_logs);
 	} else if (log_entry != NULL) {
 		/* Move partial list. */
 		KASSERT(outnum < tp->t_lognum,
 		    ("%s:%d: outnum (%d) not less than tp->t_lognum (%d)",
 			__func__, __LINE__, outnum, tp->t_lognum));
 		STAILQ_FIRST(&log_tailq) = STAILQ_FIRST(&tp->t_logs);
 		STAILQ_FIRST(&tp->t_logs) = STAILQ_NEXT(log_entry, tlm_queue);
 		KASSERT(STAILQ_NEXT(log_entry, tlm_queue) != NULL,
 		    ("%s:%d: tp->t_logs is unexpectedly shorter than expected"
 		    "(tp: %p, log_tailq: %p, outnum: %d, tp->t_lognum: %d)",
 		    __func__, __LINE__, tp, &log_tailq, outnum, tp->t_lognum));
 		STAILQ_NEXT(log_entry, tlm_queue) = NULL;
 		log_tailq.stqh_last = &STAILQ_NEXT(log_entry, tlm_queue);
 		tp->t_lognum -= outnum;
 	} else
 		STAILQ_INIT(&log_tailq);
 
 	/* Drop the PCB lock. */
 	INP_WUNLOCK(inp);
 
 	/* Copy the data out. */
 	error = tcp_log_logs_to_buf(sopt, &log_tailq, &out_entry, outnum);
 
 	if (error) {
 		/* Restore list */
 		INP_WLOCK(inp);
 		if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) == 0) {
 			tp = intotcpcb(inp);
 
 			/* Merge the two lists. */
 			STAILQ_CONCAT(&log_tailq, &tp->t_logs);
 			tp->t_logs = log_tailq;
 			tp->t_lognum += outnum;
 		}
 		INP_WUNLOCK(inp);
 	} else {
 		/* Sanity check entries */
 		KASSERT(((caddr_t)out_entry - (caddr_t)sopt->sopt_val)  ==
 		    outsize, ("%s: Actual output size (%zu) != "
 			"calculated output size (%zu)", __func__,
 			(size_t)((caddr_t)out_entry - (caddr_t)sopt->sopt_val),
 			outsize));
 
 		/* Free the entries we just copied out. */
 		STAILQ_FOREACH_SAFE(log_entry, &log_tailq, tlm_queue, log_next) {
 			tcp_log_entry_refcnt_rem(log_entry);
 			uma_zfree(tcp_log_zone, log_entry);
 		}
 	}
 
 	sopt->sopt_valsize = (size_t)((caddr_t)out_entry -
 	    (caddr_t)sopt->sopt_val);
 	return (error);
 }
 
 static void
 tcp_log_free_queue(struct tcp_log_dev_queue *param)
 {
 	struct tcp_log_dev_log_queue *entry;
 
 	KASSERT(param != NULL, ("%s: called with NULL param", __func__));
 	if (param == NULL)
 		return;
 
 	entry = (struct tcp_log_dev_log_queue *)param;
 
 	/* Free the entries. */
 	tcp_log_free_entries(&entry->tldl_entries, &entry->tldl_count);
 
 	/* Free the buffer, if it is allocated. */
 	if (entry->tldl_common.tldq_buf != NULL)
 		free(entry->tldl_common.tldq_buf, M_TCPLOGDEV);
 
 	/* Free the queue entry. */
 	free(entry, M_TCPLOGDEV);
 }
 
 static struct tcp_log_common_header *
 tcp_log_expandlogbuf(struct tcp_log_dev_queue *param)
 {
 	struct tcp_log_dev_log_queue *entry;
 	struct tcp_log_header *hdr;
 	uint8_t *end;
 	struct sockopt sopt;
 	int error;
 
 	entry = (struct tcp_log_dev_log_queue *)param;
 
 	/* Take a worst-case guess at space needs. */
 	sopt.sopt_valsize = sizeof(struct tcp_log_header) +
 	    entry->tldl_count * (sizeof(struct tcp_log_buffer) +
 	    sizeof(struct tcp_log_verbose));
 	hdr = malloc(sopt.sopt_valsize, M_TCPLOGDEV, M_NOWAIT);
 	if (hdr == NULL) {
 #ifdef TCPLOG_DEBUG_COUNTERS
 		counter_u64_add(tcp_log_que_fail5, entry->tldl_count);
 #endif
 		return (NULL);
 	}
 	sopt.sopt_val = hdr + 1;
 	sopt.sopt_valsize -= sizeof(struct tcp_log_header);
 	sopt.sopt_td = NULL;
 	
 	error = tcp_log_logs_to_buf(&sopt, &entry->tldl_entries,
 	    (struct tcp_log_buffer **)&end, entry->tldl_count);
 	if (error) {
 		free(hdr, M_TCPLOGDEV);
 		return (NULL);
 	}
 
 	/* Free the entries. */
 	tcp_log_free_entries(&entry->tldl_entries, &entry->tldl_count);
 	entry->tldl_count = 0;
 
 	memset(hdr, 0, sizeof(struct tcp_log_header));
 	hdr->tlh_version = TCP_LOG_BUF_VER;
 	hdr->tlh_type = TCP_LOG_DEV_TYPE_BBR;
 	hdr->tlh_length = end - (uint8_t *)hdr;
 	hdr->tlh_ie = entry->tldl_ie;
 	hdr->tlh_af = entry->tldl_af;
 	getboottime(&hdr->tlh_offset);
 	strlcpy(hdr->tlh_id, entry->tldl_id, TCP_LOG_ID_LEN);
 	strlcpy(hdr->tlh_reason, entry->tldl_reason, TCP_LOG_REASON_LEN);
 	return ((struct tcp_log_common_header *)hdr);
 }
 
 /*
  * Queue the tcpcb's log buffer for transmission via the log buffer facility.
  *
  * NOTE: This should be called with a write lock on the PCB.
  *
  * how should be M_WAITOK or M_NOWAIT. If M_WAITOK, the function will drop
  * and reacquire the INP lock if it needs to do so.
  *
  * If force is false, this will only dump auto-logged sessions if
  * tcp_log_auto_all is true or if there is a log ID defined for the session.
  */
 int
 tcp_log_dump_tp_logbuf(struct tcpcb *tp, char *reason, int how, bool force)
 {
 	struct tcp_log_dev_log_queue *entry;
 	struct inpcb *inp;
 #ifdef TCPLOG_DEBUG_COUNTERS
 	int num_entries;
 #endif
 
 	inp = tp->t_inpcb;
 	INP_WLOCK_ASSERT(inp);
 
 	/* If there are no log entries, there is nothing to do. */
 	if (tp->t_lognum == 0)
 		return (0);
 
 	/* Check for a log ID. */
 	if (tp->t_lib == NULL && (tp->t_flags2 & TF2_LOG_AUTO) &&
 	    !tcp_log_auto_all && !force) {
 		struct tcp_log_mem *log_entry;
 
 		/*
 		 * We needed a log ID and none was found. Free the log entries
 		 * and return success. Also, cancel further logging. If the
 		 * session doesn't have a log ID by now, we'll assume it isn't
 		 * going to get one.
 		 */
 		while ((log_entry = STAILQ_FIRST(&tp->t_logs)) != NULL)
 			tcp_log_remove_log_head(tp, log_entry);
 		KASSERT(tp->t_lognum == 0,
 		    ("%s: After freeing entries, tp->t_lognum=%d (expected 0)",
 			__func__, tp->t_lognum));
 		tp->t_logstate = TCP_LOG_STATE_OFF;
 		return (0);
 	}
 
 	/*
 	 * Allocate memory. If we must wait, we'll need to drop the locks
 	 * and reacquire them (and do all the related business that goes
 	 * along with that).
 	 */
 	entry = malloc(sizeof(struct tcp_log_dev_log_queue), M_TCPLOGDEV,
 	    M_NOWAIT);
 	if (entry == NULL && (how & M_NOWAIT)) {
 #ifdef TCPLOG_DEBUG_COUNTERS
 		counter_u64_add(tcp_log_que_fail3, 1);
 #endif
 		return (ENOBUFS);
 	}
 	if (entry == NULL) {
 		INP_WUNLOCK(inp);
 		entry = malloc(sizeof(struct tcp_log_dev_log_queue),
 		    M_TCPLOGDEV, M_WAITOK);
 		INP_WLOCK(inp);
 		/*
 		 * Note that this check is slightly overly-restrictive in
 		 * that the TCB can survive either of these events.
 		 * However, there is currently not a good way to ensure
 		 * that is the case. So, if we hit this M_WAIT path, we
 		 * may end up dropping some entries. That seems like a
 		 * small price to pay for safety.
 		 */
 		if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
 			free(entry, M_TCPLOGDEV);
 #ifdef TCPLOG_DEBUG_COUNTERS
 			counter_u64_add(tcp_log_que_fail2, 1);
 #endif
 			return (ECONNRESET);
 		}
 		tp = intotcpcb(inp);
 		if (tp->t_lognum == 0) {
 			free(entry, M_TCPLOGDEV);
 			return (0);
 		}
 	}
 
 	/* Fill in the unique parts of the queue entry. */
 	if (tp->t_lib != NULL)
 		strlcpy(entry->tldl_id, tp->t_lib->tlb_id, TCP_LOG_ID_LEN);
 	else
 		strlcpy(entry->tldl_id, "UNKNOWN", TCP_LOG_ID_LEN);
 	if (reason != NULL)
 		strlcpy(entry->tldl_reason, reason, TCP_LOG_REASON_LEN);
 	else
 		strlcpy(entry->tldl_reason, "UNKNOWN", TCP_LOG_ID_LEN);
 	entry->tldl_ie = inp->inp_inc.inc_ie;
 	if (inp->inp_inc.inc_flags & INC_ISIPV6)
 		entry->tldl_af = AF_INET6;
 	else
 		entry->tldl_af = AF_INET;
 	entry->tldl_entries = tp->t_logs;
 	entry->tldl_count = tp->t_lognum;
 
 	/* Fill in the common parts of the queue entry. */
 	entry->tldl_common.tldq_buf = NULL;
 	entry->tldl_common.tldq_xform = tcp_log_expandlogbuf;
 	entry->tldl_common.tldq_dtor = tcp_log_free_queue;
 
 	/* Clear the log data from the TCPCB. */
 #ifdef TCPLOG_DEBUG_COUNTERS
 	num_entries = tp->t_lognum;
 #endif
 	tp->t_lognum = 0;
 	STAILQ_INIT(&tp->t_logs);
 
 	/* Add the entry. If no one is listening, free the entry. */
 	if (tcp_log_dev_add_log((struct tcp_log_dev_queue *)entry)) {
 		tcp_log_free_queue((struct tcp_log_dev_queue *)entry);
 #ifdef TCPLOG_DEBUG_COUNTERS
 		counter_u64_add(tcp_log_que_fail1, num_entries);
 	} else {
 		counter_u64_add(tcp_log_queued, num_entries);
 #endif
 	}
 	return (0);
 }
 
 /*
  * Queue the log_id_node's log buffers for transmission via the log buffer
  * facility.
  *
  * NOTE: This should be called with the bucket locked and referenced.
  *
  * how should be M_WAITOK or M_NOWAIT. If M_WAITOK, the function will drop
  * and reacquire the bucket lock if it needs to do so. (The caller must
  * ensure that the tln is no longer on any lists so no one else will mess
  * with this while the lock is dropped!)
  */
 static int
 tcp_log_dump_node_logbuf(struct tcp_log_id_node *tln, char *reason, int how)
 {
 	struct tcp_log_dev_log_queue *entry;
 	struct tcp_log_id_bucket *tlb;
 
 	tlb = tln->tln_bucket;
 	TCPID_BUCKET_LOCK_ASSERT(tlb);
 	KASSERT(tlb->tlb_refcnt > 0,
 	    ("%s:%d: Called with unreferenced bucket (tln=%p, tlb=%p)",
 	    __func__, __LINE__, tln, tlb));
 	KASSERT(tln->tln_closed,
 	    ("%s:%d: Called for node with tln_closed==false (tln=%p)",
 	    __func__, __LINE__, tln));
 
 	/* If there are no log entries, there is nothing to do. */
 	if (tln->tln_count == 0)
 		return (0);
 
 	/*
 	 * Allocate memory. If we must wait, we'll need to drop the locks
 	 * and reacquire them (and do all the related business that goes
 	 * along with that).
 	 */
 	entry = malloc(sizeof(struct tcp_log_dev_log_queue), M_TCPLOGDEV,
 	    M_NOWAIT);
 	if (entry == NULL && (how & M_NOWAIT))
 		return (ENOBUFS);
 	if (entry == NULL) {
 		TCPID_BUCKET_UNLOCK(tlb);
 		entry = malloc(sizeof(struct tcp_log_dev_log_queue),
 		    M_TCPLOGDEV, M_WAITOK);
 		TCPID_BUCKET_LOCK(tlb);
 	}
 
 	/* Fill in the common parts of the queue entry.. */
 	entry->tldl_common.tldq_buf = NULL;
 	entry->tldl_common.tldq_xform = tcp_log_expandlogbuf;
 	entry->tldl_common.tldq_dtor = tcp_log_free_queue;
 
 	/* Fill in the unique parts of the queue entry. */
 	strlcpy(entry->tldl_id, tlb->tlb_id, TCP_LOG_ID_LEN);
 	if (reason != NULL)
 		strlcpy(entry->tldl_reason, reason, TCP_LOG_REASON_LEN);
 	else
 		strlcpy(entry->tldl_reason, "UNKNOWN", TCP_LOG_ID_LEN);
 	entry->tldl_ie = tln->tln_ie;
 	entry->tldl_entries = tln->tln_entries;
 	entry->tldl_count = tln->tln_count;
 	entry->tldl_af = tln->tln_af;
 
 	/* Add the entry. If no one is listening, free the entry. */
 	if (tcp_log_dev_add_log((struct tcp_log_dev_queue *)entry))
 		tcp_log_free_queue((struct tcp_log_dev_queue *)entry);
 
 	return (0);
 }
 
 
 /*
  * Queue the log buffers for all sessions in a bucket for transmissions via
  * the log buffer facility.
  *
  * NOTE: This should be called with a locked bucket; however, the function
  * will drop the lock.
  */
 #define	LOCAL_SAVE	10
 static void
 tcp_log_dumpbucketlogs(struct tcp_log_id_bucket *tlb, char *reason)
 {
 	struct tcp_log_id_node local_entries[LOCAL_SAVE];
 	struct inpcb *inp;
 	struct tcpcb *tp;
 	struct tcp_log_id_node *cur_tln, *prev_tln, *tmp_tln;
 	int i, num_local_entries, tree_locked;
 	bool expireq_locked;
 
 	TCPID_BUCKET_LOCK_ASSERT(tlb);
 
 	/*
 	 * Take a reference on the bucket to keep it from disappearing until
 	 * we are done.
 	 */
 	TCPID_BUCKET_REF(tlb);
 
 	/*
 	 * We'll try to create these without dropping locks. However, we
 	 * might very well need to drop locks to get memory. If that's the
 	 * case, we'll save up to 10 on the stack, and sacrifice the rest.
 	 * (Otherwise, we need to worry about finding our place again in a
 	 * potentially changed list. It just doesn't seem worth the trouble
 	 * to do that.
 	 */
 	expireq_locked = false;
 	num_local_entries = 0;
 	prev_tln = NULL;
 	tree_locked = TREE_UNLOCKED;
 	SLIST_FOREACH_SAFE(cur_tln, &tlb->tlb_head, tln_list, tmp_tln) {
 		/*
 		 * If this isn't associated with a TCPCB, we can pull it off
 		 * the list now. We need to be careful that the expire timer
 		 * hasn't already taken ownership (tln_expiretime == SBT_MAX).
 		 * If so, we let the expire timer code free the data. 
 		 */
 		if (cur_tln->tln_closed) {
 no_inp:
 			/*
 			 * Get the expireq lock so we can get a consistent
 			 * read of tln_expiretime and so we can remove this
 			 * from the expireq.
 			 */
 			if (!expireq_locked) {
 				TCPLOG_EXPIREQ_LOCK();
 				expireq_locked = true;
 			}
 
 			/*
 			 * We ignore entries with tln_expiretime == SBT_MAX.
 			 * The expire timer code already owns those.
 			 */
 			KASSERT(cur_tln->tln_expiretime > (sbintime_t) 0,
 			    ("%s:%d: node on the expire queue without positive "
 			    "expire time", __func__, __LINE__));
 			if (cur_tln->tln_expiretime == SBT_MAX) {
 				prev_tln = cur_tln;
 				continue;
 			}
 
 			/* Remove the entry from the expireq. */
 			STAILQ_REMOVE(&tcp_log_expireq_head, cur_tln,
 			    tcp_log_id_node, tln_expireq);
 
 			/* Remove the entry from the bucket. */
 			if (prev_tln != NULL)
 				SLIST_REMOVE_AFTER(prev_tln, tln_list);
 			else
 				SLIST_REMOVE_HEAD(&tlb->tlb_head, tln_list);
 
 			/*
 			 * Drop the INP and bucket reference counts. Due to
 			 * lock-ordering rules, we need to drop the expire
 			 * queue lock.
 			 */
 			TCPLOG_EXPIREQ_UNLOCK();
 			expireq_locked = false;
 
 			/* Drop the INP reference. */
 			INP_WLOCK(cur_tln->tln_inp);
 			if (!in_pcbrele_wlocked(cur_tln->tln_inp))
 				INP_WUNLOCK(cur_tln->tln_inp);
 
 			if (tcp_log_unref_bucket(tlb, &tree_locked, NULL)) {
 #ifdef INVARIANTS
 				panic("%s: Bucket refcount unexpectedly 0.",
 				    __func__);
 #endif
 				/*
 				 * Recover as best we can: free the entry we
 				 * own.
 				 */
 				tcp_log_free_entries(&cur_tln->tln_entries,
 				    &cur_tln->tln_count);
 				uma_zfree(tcp_log_node_zone, cur_tln);
 				goto done;
 			}
 
 			if (tcp_log_dump_node_logbuf(cur_tln, reason,
 			    M_NOWAIT)) {
 				/*
 				 * If we have sapce, save the entries locally.
 				 * Otherwise, free them.
 				 */
 				if (num_local_entries < LOCAL_SAVE) {
 					local_entries[num_local_entries] =
 					    *cur_tln;
 					num_local_entries++;
 				} else {
 					tcp_log_free_entries(
 					    &cur_tln->tln_entries,
 					    &cur_tln->tln_count);
 				}
 			}
 
 			/* No matter what, we are done with the node now. */
 			uma_zfree(tcp_log_node_zone, cur_tln);
 
 			/*
 			 * Because we removed this entry from the list, prev_tln
 			 * (which tracks the previous entry still on the tlb
 			 * list) remains unchanged.
 			 */
 			continue;
 		}
 
 		/*
 		 * If we get to this point, the session data is still held in
 		 * the TCPCB. So, we need to pull the data out of that.
 		 *
 		 * We will need to drop the expireq lock so we can lock the INP.
 		 * We can then try to extract the data the "easy" way. If that
 		 * fails, we'll save the log entries for later.
 		 */
 		if (expireq_locked) {
 			TCPLOG_EXPIREQ_UNLOCK();
 			expireq_locked = false;
 		}
 
 		/* Lock the INP and then re-check the state. */
 		inp = cur_tln->tln_inp;
 		INP_WLOCK(inp);
 		/*
 		 * If we caught this while it was transitioning, the data
 		 * might have moved from the TCPCB to the tln (signified by
 		 * setting tln_closed to true. If so, treat this like an
 		 * inactive connection.
 		 */
 		if (cur_tln->tln_closed) {
 			/*
 			 * It looks like we may have caught this connection
 			 * while it was transitioning from active to inactive.
 			 * Treat this like an inactive connection.
 			 */
 			INP_WUNLOCK(inp);
 			goto no_inp;
 		}
 
 		/*
 		 * Try to dump the data from the tp without dropping the lock.
 		 * If this fails, try to save off the data locally.
 		 */
 		tp = cur_tln->tln_tp;
 		if (tcp_log_dump_tp_logbuf(tp, reason, M_NOWAIT, true) &&
 		    num_local_entries < LOCAL_SAVE) {
 			tcp_log_move_tp_to_node(tp,
 			    &local_entries[num_local_entries]);
 			local_entries[num_local_entries].tln_closed = 1;
 			KASSERT(local_entries[num_local_entries].tln_bucket ==
 			    tlb, ("%s: %d: bucket mismatch for node %p",
 			    __func__, __LINE__, cur_tln));
 			num_local_entries++;
 		}
 
 		INP_WUNLOCK(inp);
 
 		/*
 		 * We are goint to leave the current tln on the list. It will
 		 * become the previous tln.
 		 */
 		prev_tln = cur_tln;
 	}
 
 	/* Drop our locks, if any. */
 	KASSERT(tree_locked == TREE_UNLOCKED,
 	    ("%s: %d: tree unexpectedly locked", __func__, __LINE__));
 	switch (tree_locked) {
 	case TREE_WLOCKED:
 		TCPID_TREE_WUNLOCK();
 		tree_locked = TREE_UNLOCKED;
 		break;
 	case TREE_RLOCKED:
 		TCPID_TREE_RUNLOCK();
 		tree_locked = TREE_UNLOCKED;
 		break;
 	}
 	if (expireq_locked) {
 		TCPLOG_EXPIREQ_UNLOCK();
 		expireq_locked = false;
 	}
 
 	/*
 	 * Try again for any saved entries. tcp_log_dump_node_logbuf() is
 	 * guaranteed to free the log entries within the node. And, since
 	 * the node itself is on our stack, we don't need to free it.
 	 */
 	for (i = 0; i < num_local_entries; i++)
 		tcp_log_dump_node_logbuf(&local_entries[i], reason, M_WAITOK);
 
 	/* Drop our reference. */
 	if (!tcp_log_unref_bucket(tlb, &tree_locked, NULL))
 		TCPID_BUCKET_UNLOCK(tlb);
 
 done:
 	/* Drop our locks, if any. */
 	switch (tree_locked) {
 	case TREE_WLOCKED:
 		TCPID_TREE_WUNLOCK();
 		break;
 	case TREE_RLOCKED:
 		TCPID_TREE_RUNLOCK();
 		break;
 	}
 	if (expireq_locked)
 		TCPLOG_EXPIREQ_UNLOCK();
 }
 #undef	LOCAL_SAVE
 
 
 /*
  * Queue the log buffers for all sessions in a bucket for transmissions via
  * the log buffer facility.
  *
  * NOTE: This should be called with a locked INP; however, the function
  * will drop the lock.
  */
 void
 tcp_log_dump_tp_bucket_logbufs(struct tcpcb *tp, char *reason)
 {
 	struct tcp_log_id_bucket *tlb;
 	int tree_locked;
 
 	/* Figure out our bucket and lock it. */
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 	tlb = tp->t_lib;
 	if (tlb == NULL) {
 		/*
 		 * No bucket; treat this like a request to dump a single
 		 * session's traces.
 		 */
 		(void)tcp_log_dump_tp_logbuf(tp, reason, M_WAITOK, true);
 		INP_WUNLOCK(tp->t_inpcb);
 		return;
 	}
 	TCPID_BUCKET_REF(tlb);
 	INP_WUNLOCK(tp->t_inpcb);
 	TCPID_BUCKET_LOCK(tlb);
 
 	/* If we are the last reference, we have nothing more to do here. */
 	tree_locked = TREE_UNLOCKED;
 	if (tcp_log_unref_bucket(tlb, &tree_locked, NULL)) {
 		switch (tree_locked) {
 		case TREE_WLOCKED:
 			TCPID_TREE_WUNLOCK();
 			break;
 		case TREE_RLOCKED:
 			TCPID_TREE_RUNLOCK();
 			break;
 		}
 		return;
 	}
 
 	/* Turn this over to tcp_log_dumpbucketlogs() to finish the work. */ 
 	tcp_log_dumpbucketlogs(tlb, reason);
 }
 
 /*
  * Mark the end of a flow with the current stack. A stack can add
  * stack-specific info to this trace event by overriding this
  * function (see bbr_log_flowend() for example).
  */
 void
 tcp_log_flowend(struct tcpcb *tp)
 {
 	if (tp->t_logstate != TCP_LOG_STATE_OFF) {
 		struct socket *so = tp->t_inpcb->inp_socket;
 		TCP_LOG_EVENT(tp, NULL, &so->so_rcv, &so->so_snd,
 				TCP_LOG_FLOWEND, 0, 0, NULL, false);
 	}
 }
 
Index: user/markj/netdump/sys/netinet/tcp_output.c
===================================================================
--- user/markj/netdump/sys/netinet/tcp_output.c	(revision 332407)
+++ user/markj/netdump/sys/netinet/tcp_output.c	(revision 332408)
@@ -1,1910 +1,1910 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1982, 1986, 1988, 1990, 1993, 1995
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	@(#)tcp_output.c	8.4 (Berkeley) 5/24/95
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
 #include "opt_ipsec.h"
 #include "opt_tcpdebug.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/domain.h>
 #ifdef TCP_HHOOK
 #include <sys/hhook.h>
 #endif
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mbuf.h>
 #include <sys/mutex.h>
 #include <sys/protosw.h>
 #include <sys/sdt.h>
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/sysctl.h>
 
 #include <net/if.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
 #include <netinet/in.h>
 #include <netinet/in_kdtrace.h>
 #include <netinet/in_systm.h>
 #include <netinet/ip.h>
 #include <netinet/in_pcb.h>
 #include <netinet/ip_var.h>
 #include <netinet/ip_options.h>
 #ifdef INET6
 #include <netinet6/in6_pcb.h>
 #include <netinet/ip6.h>
 #include <netinet6/ip6_var.h>
 #endif
 #include <netinet/tcp.h>
 #define	TCPOUTFLAGS
 #include <netinet/tcp_fsm.h>
 #include <netinet/tcp_log_buf.h>
 #include <netinet/tcp_seq.h>
 #include <netinet/tcp_timer.h>
 #include <netinet/tcp_var.h>
 #include <netinet/tcpip.h>
 #include <netinet/cc/cc.h>
 #include <netinet/tcp_fastopen.h>
 #ifdef TCPPCAP
 #include <netinet/tcp_pcap.h>
 #endif
 #ifdef TCPDEBUG
 #include <netinet/tcp_debug.h>
 #endif
 #ifdef TCP_OFFLOAD
 #include <netinet/tcp_offload.h>
 #endif
 
 #include <netipsec/ipsec_support.h>
 
 #include <machine/in_cksum.h>
 
 #include <security/mac/mac_framework.h>
 
 VNET_DEFINE(int, path_mtu_discovery) = 1;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, path_mtu_discovery, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(path_mtu_discovery), 1,
 	"Enable Path MTU Discovery");
 
 VNET_DEFINE(int, tcp_do_tso) = 1;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, tso, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(tcp_do_tso), 0,
 	"Enable TCP Segmentation Offload");
 
 VNET_DEFINE(int, tcp_sendspace) = 1024*32;
 #define	V_tcp_sendspace	VNET(tcp_sendspace)
 SYSCTL_INT(_net_inet_tcp, TCPCTL_SENDSPACE, sendspace, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(tcp_sendspace), 0, "Initial send socket buffer size");
 
 VNET_DEFINE(int, tcp_do_autosndbuf) = 1;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_auto, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(tcp_do_autosndbuf), 0,
 	"Enable automatic send buffer sizing");
 
 VNET_DEFINE(int, tcp_autosndbuf_inc) = 8*1024;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_inc, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(tcp_autosndbuf_inc), 0,
 	"Incrementor step size of automatic send buffer");
 
 VNET_DEFINE(int, tcp_autosndbuf_max) = 2*1024*1024;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_max, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(tcp_autosndbuf_max), 0,
 	"Max size of automatic send buffer");
 
 VNET_DEFINE(int, tcp_sendbuf_auto_lowat) = 0;
 #define	V_tcp_sendbuf_auto_lowat	VNET(tcp_sendbuf_auto_lowat)
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_auto_lowat, CTLFLAG_VNET | CTLFLAG_RW,
 	&VNET_NAME(tcp_sendbuf_auto_lowat), 0,
 	"Modify threshold for auto send buffer growth to account for SO_SNDLOWAT");
 
 /*
  * Make sure that either retransmit or persist timer is set for SYN, FIN and
  * non-ACK.
  */
 #define TCP_XMIT_TIMER_ASSERT(tp, len, th_flags)			\
 	KASSERT(((len) == 0 && ((th_flags) & (TH_SYN | TH_FIN)) == 0) ||\
 	    tcp_timer_active((tp), TT_REXMT) ||				\
 	    tcp_timer_active((tp), TT_PERSIST),				\
 	    ("neither rexmt nor persist timer is set"))
 
 #ifdef TCP_HHOOK
 static void inline	hhook_run_tcp_est_out(struct tcpcb *tp,
 			    struct tcphdr *th, struct tcpopt *to,
 			    uint32_t len, int tso);
 #endif
 static void inline	cc_after_idle(struct tcpcb *tp);
 
 #ifdef TCP_HHOOK
 /*
  * Wrapper for the TCP established output helper hook.
  */
 static void inline
 hhook_run_tcp_est_out(struct tcpcb *tp, struct tcphdr *th,
     struct tcpopt *to, uint32_t len, int tso)
 {
 	struct tcp_hhook_data hhook_data;
 
 	if (V_tcp_hhh[HHOOK_TCP_EST_OUT]->hhh_nhooks > 0) {
 		hhook_data.tp = tp;
 		hhook_data.th = th;
 		hhook_data.to = to;
 		hhook_data.len = len;
 		hhook_data.tso = tso;
 
 		hhook_run_hooks(V_tcp_hhh[HHOOK_TCP_EST_OUT], &hhook_data,
 		    tp->osd);
 	}
 }
 #endif
 
 /*
  * CC wrapper hook functions
  */
 static void inline
 cc_after_idle(struct tcpcb *tp)
 {
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 	if (CC_ALGO(tp)->after_idle != NULL)
 		CC_ALGO(tp)->after_idle(tp->ccv);
 }
 
 /*
  * Tcp output routine: figure out what should be sent and send it.
  */
 int
 tcp_output(struct tcpcb *tp)
 {
 	struct socket *so = tp->t_inpcb->inp_socket;
 	int32_t len;
 	uint32_t recwin, sendwin;
 	int off, flags, error = 0;	/* Keep compiler happy */
 	struct mbuf *m;
 	struct ip *ip = NULL;
 #ifdef TCPDEBUG
 	struct ipovly *ipov = NULL;
 #endif
 	struct tcphdr *th;
 	u_char opt[TCP_MAXOLEN];
 	unsigned ipoptlen, optlen, hdrlen;
 #if defined(IPSEC) || defined(IPSEC_SUPPORT)
 	unsigned ipsec_optlen = 0;
 #endif
 	int idle, sendalot;
 	int sack_rxmit, sack_bytes_rxmt;
 	struct sackhole *p;
 	int tso, mtu;
 	struct tcpopt to;
 	unsigned int wanted_cookie = 0;
 	unsigned int dont_sendalot = 0;
 #if 0
 	int maxburst = TCP_MAXBURST;
 #endif
 #ifdef INET6
 	struct ip6_hdr *ip6 = NULL;
 	int isipv6;
 
 	isipv6 = (tp->t_inpcb->inp_vflag & INP_IPV6) != 0;
 #endif
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 #ifdef TCP_OFFLOAD
 	if (tp->t_flags & TF_TOE)
 		return (tcp_offload_output(tp));
 #endif
 
 	/*
 	 * For TFO connections in SYN_RECEIVED, only allow the initial
 	 * SYN|ACK and those sent by the retransmit timer.
 	 */
 	if (IS_FASTOPEN(tp->t_flags) &&
 	    (tp->t_state == TCPS_SYN_RECEIVED) &&
 	    SEQ_GT(tp->snd_max, tp->snd_una) &&    /* initial SYN|ACK sent */
 	    (tp->snd_nxt != tp->snd_una))         /* not a retransmit */
 		return (0);
 
 	/*
 	 * Determine length of data that should be transmitted,
 	 * and flags that will be used.
 	 * If there is some data or critical controls (SYN, RST)
 	 * to send, then transmit; otherwise, investigate further.
 	 */
 	idle = (tp->t_flags & TF_LASTIDLE) || (tp->snd_max == tp->snd_una);
 	if (idle && ticks - tp->t_rcvtime >= tp->t_rxtcur)
 		cc_after_idle(tp);
 	tp->t_flags &= ~TF_LASTIDLE;
 	if (idle) {
 		if (tp->t_flags & TF_MORETOCOME) {
 			tp->t_flags |= TF_LASTIDLE;
 			idle = 0;
 		}
 	}
 again:
 	/*
 	 * If we've recently taken a timeout, snd_max will be greater than
 	 * snd_nxt.  There may be SACK information that allows us to avoid
 	 * resending already delivered data.  Adjust snd_nxt accordingly.
 	 */
 	if ((tp->t_flags & TF_SACK_PERMIT) &&
 	    SEQ_LT(tp->snd_nxt, tp->snd_max))
 		tcp_sack_adjust(tp);
 	sendalot = 0;
 	tso = 0;
 	mtu = 0;
 	off = tp->snd_nxt - tp->snd_una;
 	sendwin = min(tp->snd_wnd, tp->snd_cwnd);
 
 	flags = tcp_outflags[tp->t_state];
 	/*
 	 * Send any SACK-generated retransmissions.  If we're explicitly trying
 	 * to send out new data (when sendalot is 1), bypass this function.
 	 * If we retransmit in fast recovery mode, decrement snd_cwnd, since
 	 * we're replacing a (future) new transmission with a retransmission
 	 * now, and we previously incremented snd_cwnd in tcp_input().
 	 */
 	/*
 	 * Still in sack recovery , reset rxmit flag to zero.
 	 */
 	sack_rxmit = 0;
 	sack_bytes_rxmt = 0;
 	len = 0;
 	p = NULL;
 	if ((tp->t_flags & TF_SACK_PERMIT) && IN_FASTRECOVERY(tp->t_flags) &&
 	    (p = tcp_sack_output(tp, &sack_bytes_rxmt))) {
 		uint32_t cwin;
 		
 		cwin =
 		    imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, 0);
 		/* Do not retransmit SACK segments beyond snd_recover */
 		if (SEQ_GT(p->end, tp->snd_recover)) {
 			/*
 			 * (At least) part of sack hole extends beyond
 			 * snd_recover. Check to see if we can rexmit data
 			 * for this hole.
 			 */
 			if (SEQ_GEQ(p->rxmit, tp->snd_recover)) {
 				/*
 				 * Can't rexmit any more data for this hole.
 				 * That data will be rexmitted in the next
 				 * sack recovery episode, when snd_recover
 				 * moves past p->rxmit.
 				 */
 				p = NULL;
 				goto after_sack_rexmit;
 			} else
 				/* Can rexmit part of the current hole */
 				len = ((int32_t)ulmin(cwin,
 						   tp->snd_recover - p->rxmit));
 		} else
 			len = ((int32_t)ulmin(cwin, p->end - p->rxmit));
 		off = p->rxmit - tp->snd_una;
 		KASSERT(off >= 0,("%s: sack block to the left of una : %d",
 		    __func__, off));
 		if (len > 0) {
 			sack_rxmit = 1;
 			sendalot = 1;
 			TCPSTAT_INC(tcps_sack_rexmits);
 			TCPSTAT_ADD(tcps_sack_rexmit_bytes,
 			    min(len, tp->t_maxseg));
 		}
 	}
 after_sack_rexmit:
 	/*
 	 * Get standard flags, and add SYN or FIN if requested by 'hidden'
 	 * state flags.
 	 */
 	if (tp->t_flags & TF_NEEDFIN)
 		flags |= TH_FIN;
 	if (tp->t_flags & TF_NEEDSYN)
 		flags |= TH_SYN;
 
 	SOCKBUF_LOCK(&so->so_snd);
 	/*
 	 * If in persist timeout with window of 0, send 1 byte.
 	 * Otherwise, if window is small but nonzero
 	 * and timer expired, we will send what we can
 	 * and go to transmit state.
 	 */
 	if (tp->t_flags & TF_FORCEDATA) {
 		if (sendwin == 0) {
 			/*
 			 * If we still have some data to send, then
 			 * clear the FIN bit.  Usually this would
 			 * happen below when it realizes that we
 			 * aren't sending all the data.  However,
 			 * if we have exactly 1 byte of unsent data,
 			 * then it won't clear the FIN bit below,
 			 * and if we are in persist state, we wind
 			 * up sending the packet without recording
 			 * that we sent the FIN bit.
 			 *
 			 * We can't just blindly clear the FIN bit,
 			 * because if we don't have any more data
 			 * to send then the probe will be the FIN
 			 * itself.
 			 */
 			if (off < sbused(&so->so_snd))
 				flags &= ~TH_FIN;
 			sendwin = 1;
 		} else {
 			tcp_timer_activate(tp, TT_PERSIST, 0);
 			tp->t_rxtshift = 0;
 		}
 	}
 
 	/*
 	 * If snd_nxt == snd_max and we have transmitted a FIN, the
 	 * offset will be > 0 even if so_snd.sb_cc is 0, resulting in
 	 * a negative length.  This can also occur when TCP opens up
 	 * its congestion window while receiving additional duplicate
 	 * acks after fast-retransmit because TCP will reset snd_nxt
 	 * to snd_max after the fast-retransmit.
 	 *
 	 * In the normal retransmit-FIN-only case, however, snd_nxt will
 	 * be set to snd_una, the offset will be 0, and the length may
 	 * wind up 0.
 	 *
 	 * If sack_rxmit is true we are retransmitting from the scoreboard
 	 * in which case len is already set.
 	 */
 	if (sack_rxmit == 0) {
 		if (sack_bytes_rxmt == 0)
 			len = ((int32_t)min(sbavail(&so->so_snd), sendwin) -
 			    off);
 		else {
 			int32_t cwin;
 
                         /*
 			 * We are inside of a SACK recovery episode and are
 			 * sending new data, having retransmitted all the
 			 * data possible in the scoreboard.
 			 */
 			len = ((int32_t)min(sbavail(&so->so_snd), tp->snd_wnd) -
 			    off);
 			/*
 			 * Don't remove this (len > 0) check !
 			 * We explicitly check for len > 0 here (although it 
 			 * isn't really necessary), to work around a gcc 
 			 * optimization issue - to force gcc to compute
 			 * len above. Without this check, the computation
 			 * of len is bungled by the optimizer.
 			 */
 			if (len > 0) {
 				cwin = tp->snd_cwnd - 
 					(tp->snd_nxt - tp->sack_newdata) -
 					sack_bytes_rxmt;
 				if (cwin < 0)
 					cwin = 0;
 				len = imin(len, cwin);
 			}
 		}
 	}
 
 	/*
 	 * Lop off SYN bit if it has already been sent.  However, if this
 	 * is SYN-SENT state and if segment contains data and if we don't
 	 * know that foreign host supports TAO, suppress sending segment.
 	 */
 	if ((flags & TH_SYN) && SEQ_GT(tp->snd_nxt, tp->snd_una)) {
 		if (tp->t_state != TCPS_SYN_RECEIVED)
 			flags &= ~TH_SYN;
 		/*
 		 * When sending additional segments following a TFO SYN|ACK,
 		 * do not include the SYN bit.
 		 */
 		if (IS_FASTOPEN(tp->t_flags) &&
 		    (tp->t_state == TCPS_SYN_RECEIVED))
 			flags &= ~TH_SYN;
 		off--, len++;
 	}
 
 	/*
 	 * Be careful not to send data and/or FIN on SYN segments.
 	 * This measure is needed to prevent interoperability problems
 	 * with not fully conformant TCP implementations.
 	 */
 	if ((flags & TH_SYN) && (tp->t_flags & TF_NOOPT)) {
 		len = 0;
 		flags &= ~TH_FIN;
 	}
 
 	/*
 	 * On TFO sockets, ensure no data is sent in the following cases:
 	 *
 	 *  - When retransmitting SYN|ACK on a passively-created socket
 	 *
 	 *  - When retransmitting SYN on an actively created socket
 	 *
 	 *  - When sending a zero-length cookie (cookie request) on an
 	 *    actively created socket
 	 *
 	 *  - When the socket is in the CLOSED state (RST is being sent)
 	 */
 	if (IS_FASTOPEN(tp->t_flags) &&
 	    (((flags & TH_SYN) && (tp->t_rxtshift > 0)) ||
 	     ((tp->t_state == TCPS_SYN_SENT) &&
 	      (tp->t_tfo_client_cookie_len == 0)) ||
 	     (flags & TH_RST)))
 		len = 0;
 	if (len <= 0) {
 		/*
 		 * If FIN has been sent but not acked,
 		 * but we haven't been called to retransmit,
 		 * len will be < 0.  Otherwise, window shrank
 		 * after we sent into it.  If window shrank to 0,
 		 * cancel pending retransmit, pull snd_nxt back
 		 * to (closed) window, and set the persist timer
 		 * if it isn't already going.  If the window didn't
 		 * close completely, just wait for an ACK.
 		 *
 		 * We also do a general check here to ensure that
 		 * we will set the persist timer when we have data
 		 * to send, but a 0-byte window. This makes sure
 		 * the persist timer is set even if the packet
 		 * hits one of the "goto send" lines below.
 		 */
 		len = 0;
 		if ((sendwin == 0) && (TCPS_HAVEESTABLISHED(tp->t_state)) &&
 			(off < (int) sbavail(&so->so_snd))) {
 			tcp_timer_activate(tp, TT_REXMT, 0);
 			tp->t_rxtshift = 0;
 			tp->snd_nxt = tp->snd_una;
 			if (!tcp_timer_active(tp, TT_PERSIST))
 				tcp_setpersist(tp);
 		}
 	}
 
 	/* len will be >= 0 after this point. */
 	KASSERT(len >= 0, ("[%s:%d]: len < 0", __func__, __LINE__));
 
 	tcp_sndbuf_autoscale(tp, so, sendwin);
 
 	/*
 	 * Decide if we can use TCP Segmentation Offloading (if supported by
 	 * hardware).
 	 *
 	 * TSO may only be used if we are in a pure bulk sending state.  The
 	 * presence of TCP-MD5, SACK retransmits, SACK advertizements and
 	 * IP options prevent using TSO.  With TSO the TCP header is the same
 	 * (except for the sequence number) for all generated packets.  This
 	 * makes it impossible to transmit any options which vary per generated
 	 * segment or packet.
 	 *
 	 * IPv4 handling has a clear separation of ip options and ip header
 	 * flags while IPv6 combines both in in6p_outputopts. ip6_optlen() does
 	 * the right thing below to provide length of just ip options and thus
 	 * checking for ipoptlen is enough to decide if ip options are present.
 	 */
 #if defined(IPSEC) || defined(IPSEC_SUPPORT)
 	/*
 	 * Pre-calculate here as we save another lookup into the darknesses
 	 * of IPsec that way and can actually decide if TSO is ok.
 	 */
 #ifdef INET6
 	if (isipv6 && IPSEC_ENABLED(ipv6))
 		ipsec_optlen = IPSEC_HDRSIZE(ipv6, tp->t_inpcb);
 #ifdef INET
 	else
 #endif
 #endif /* INET6 */
 #ifdef INET
 	if (IPSEC_ENABLED(ipv4))
 		ipsec_optlen = IPSEC_HDRSIZE(ipv4, tp->t_inpcb);
 #endif /* INET */
 #endif /* IPSEC */
 #ifdef INET6
 	if (isipv6)
 		ipoptlen = ip6_optlen(tp->t_inpcb);
 	else
 #endif
 	if (tp->t_inpcb->inp_options)
 		ipoptlen = tp->t_inpcb->inp_options->m_len -
 				offsetof(struct ipoption, ipopt_list);
 	else
 		ipoptlen = 0;
 #if defined(IPSEC) || defined(IPSEC_SUPPORT)
 	ipoptlen += ipsec_optlen;
 #endif
 
 	if ((tp->t_flags & TF_TSO) && V_tcp_do_tso && len > tp->t_maxseg &&
 	    ((tp->t_flags & TF_SIGNATURE) == 0) &&
 	    tp->rcv_numsacks == 0 && sack_rxmit == 0 &&
 	    ipoptlen == 0 && !(flags & TH_SYN))
 		tso = 1;
 
 	if (sack_rxmit) {
 		if (SEQ_LT(p->rxmit + len, tp->snd_una + sbused(&so->so_snd)))
 			flags &= ~TH_FIN;
 	} else {
 		if (SEQ_LT(tp->snd_nxt + len, tp->snd_una +
 		    sbused(&so->so_snd)))
 			flags &= ~TH_FIN;
 	}
 
 	recwin = lmin(lmax(sbspace(&so->so_rcv), 0),
 	    (long)TCP_MAXWIN << tp->rcv_scale);
 
 	/*
 	 * Sender silly window avoidance.   We transmit under the following
 	 * conditions when len is non-zero:
 	 *
 	 *	- We have a full segment (or more with TSO)
 	 *	- This is the last buffer in a write()/send() and we are
 	 *	  either idle or running NODELAY
 	 *	- we've timed out (e.g. persist timer)
 	 *	- we have more then 1/2 the maximum send window's worth of
 	 *	  data (receiver may be limited the window size)
 	 *	- we need to retransmit
 	 */
 	if (len) {
 		if (len >= tp->t_maxseg)
 			goto send;
 		/*
 		 * NOTE! on localhost connections an 'ack' from the remote
 		 * end may occur synchronously with the output and cause
 		 * us to flush a buffer queued with moretocome.  XXX
 		 *
 		 * note: the len + off check is almost certainly unnecessary.
 		 */
 		if (!(tp->t_flags & TF_MORETOCOME) &&	/* normal case */
 		    (idle || (tp->t_flags & TF_NODELAY)) &&
 		    (uint32_t)len + (uint32_t)off >= sbavail(&so->so_snd) &&
 		    (tp->t_flags & TF_NOPUSH) == 0) {
 			goto send;
 		}
 		if (tp->t_flags & TF_FORCEDATA)		/* typ. timeout case */
 			goto send;
 		if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
 			goto send;
 		if (SEQ_LT(tp->snd_nxt, tp->snd_max))	/* retransmit case */
 			goto send;
 		if (sack_rxmit)
 			goto send;
 	}
 
 	/*
 	 * Sending of standalone window updates.
 	 *
 	 * Window updates are important when we close our window due to a
 	 * full socket buffer and are opening it again after the application
 	 * reads data from it.  Once the window has opened again and the
 	 * remote end starts to send again the ACK clock takes over and
 	 * provides the most current window information.
 	 *
 	 * We must avoid the silly window syndrome whereas every read
 	 * from the receive buffer, no matter how small, causes a window
 	 * update to be sent.  We also should avoid sending a flurry of
 	 * window updates when the socket buffer had queued a lot of data
 	 * and the application is doing small reads.
 	 *
 	 * Prevent a flurry of pointless window updates by only sending
 	 * an update when we can increase the advertized window by more
 	 * than 1/4th of the socket buffer capacity.  When the buffer is
 	 * getting full or is very small be more aggressive and send an
 	 * update whenever we can increase by two mss sized segments.
 	 * In all other situations the ACK's to new incoming data will
 	 * carry further window increases.
 	 *
 	 * Don't send an independent window update if a delayed
 	 * ACK is pending (it will get piggy-backed on it) or the
 	 * remote side already has done a half-close and won't send
 	 * more data.  Skip this if the connection is in T/TCP
 	 * half-open state.
 	 */
 	if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) &&
 	    !(tp->t_flags & TF_DELACK) &&
 	    !TCPS_HAVERCVDFIN(tp->t_state)) {
 		/*
 		 * "adv" is the amount we could increase the window,
 		 * taking into account that we are limited by
 		 * TCP_MAXWIN << tp->rcv_scale.
 		 */
 		int32_t adv;
 		int oldwin;
 
 		adv = recwin;
 		if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) {
 			oldwin = (tp->rcv_adv - tp->rcv_nxt);
 			adv -= oldwin;
 		} else
 			oldwin = 0;
 
 		/* 
 		 * If the new window size ends up being the same as or less
 		 * than the old size when it is scaled, then don't force
 		 * a window update.
 		 */
 		if (oldwin >> tp->rcv_scale >= (adv + oldwin) >> tp->rcv_scale)
 			goto dontupdate;
 
 		if (adv >= (int32_t)(2 * tp->t_maxseg) &&
 		    (adv >= (int32_t)(so->so_rcv.sb_hiwat / 4) ||
 		     recwin <= (so->so_rcv.sb_hiwat / 8) ||
 		     so->so_rcv.sb_hiwat <= 8 * tp->t_maxseg))
 			goto send;
 		if (2 * adv >= (int32_t)so->so_rcv.sb_hiwat)
 			goto send;
 	}
 dontupdate:
 
 	/*
 	 * Send if we owe the peer an ACK, RST, SYN, or urgent data.  ACKNOW
 	 * is also a catch-all for the retransmit timer timeout case.
 	 */
 	if (tp->t_flags & TF_ACKNOW)
 		goto send;
 	if ((flags & TH_RST) ||
 	    ((flags & TH_SYN) && (tp->t_flags & TF_NEEDSYN) == 0))
 		goto send;
 	if (SEQ_GT(tp->snd_up, tp->snd_una))
 		goto send;
 	/*
 	 * If our state indicates that FIN should be sent
 	 * and we have not yet done so, then we need to send.
 	 */
 	if (flags & TH_FIN &&
 	    ((tp->t_flags & TF_SENTFIN) == 0 || tp->snd_nxt == tp->snd_una))
 		goto send;
 	/*
 	 * In SACK, it is possible for tcp_output to fail to send a segment
 	 * after the retransmission timer has been turned off.  Make sure
 	 * that the retransmission timer is set.
 	 */
 	if ((tp->t_flags & TF_SACK_PERMIT) &&
 	    SEQ_GT(tp->snd_max, tp->snd_una) &&
 	    !tcp_timer_active(tp, TT_REXMT) &&
 	    !tcp_timer_active(tp, TT_PERSIST)) {
 		tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur);
 		goto just_return;
 	} 
 	/*
 	 * TCP window updates are not reliable, rather a polling protocol
 	 * using ``persist'' packets is used to insure receipt of window
 	 * updates.  The three ``states'' for the output side are:
 	 *	idle			not doing retransmits or persists
 	 *	persisting		to move a small or zero window
 	 *	(re)transmitting	and thereby not persisting
 	 *
 	 * tcp_timer_active(tp, TT_PERSIST)
 	 *	is true when we are in persist state.
 	 * (tp->t_flags & TF_FORCEDATA)
 	 *	is set when we are called to send a persist packet.
 	 * tcp_timer_active(tp, TT_REXMT)
 	 *	is set when we are retransmitting
 	 * The output side is idle when both timers are zero.
 	 *
 	 * If send window is too small, there is data to transmit, and no
 	 * retransmit or persist is pending, then go to persist state.
 	 * If nothing happens soon, send when timer expires:
 	 * if window is nonzero, transmit what we can,
 	 * otherwise force out a byte.
 	 */
 	if (sbavail(&so->so_snd) && !tcp_timer_active(tp, TT_REXMT) &&
 	    !tcp_timer_active(tp, TT_PERSIST)) {
 		tp->t_rxtshift = 0;
 		tcp_setpersist(tp);
 	}
 
 	/*
 	 * No reason to send a segment, just return.
 	 */
 just_return:
 	SOCKBUF_UNLOCK(&so->so_snd);
 	return (0);
 
 send:
 	SOCKBUF_LOCK_ASSERT(&so->so_snd);
 	if (len > 0) {
 		if (len >= tp->t_maxseg)
 			tp->t_flags2 |= TF2_PLPMTU_MAXSEGSNT;
 		else
 			tp->t_flags2 &= ~TF2_PLPMTU_MAXSEGSNT;
 	}
 	/*
 	 * Before ESTABLISHED, force sending of initial options
 	 * unless TCP set not to do any options.
 	 * NOTE: we assume that the IP/TCP header plus TCP options
 	 * always fit in a single mbuf, leaving room for a maximum
 	 * link header, i.e.
 	 *	max_linkhdr + sizeof (struct tcpiphdr) + optlen <= MCLBYTES
 	 */
 	optlen = 0;
 #ifdef INET6
 	if (isipv6)
 		hdrlen = sizeof (struct ip6_hdr) + sizeof (struct tcphdr);
 	else
 #endif
 		hdrlen = sizeof (struct tcpiphdr);
 
 	/*
 	 * Compute options for segment.
 	 * We only have to care about SYN and established connection
 	 * segments.  Options for SYN-ACK segments are handled in TCP
 	 * syncache.
 	 */
 	to.to_flags = 0;
 	if ((tp->t_flags & TF_NOOPT) == 0) {
 		/* Maximum segment size. */
 		if (flags & TH_SYN) {
 			tp->snd_nxt = tp->iss;
 			to.to_mss = tcp_mssopt(&tp->t_inpcb->inp_inc);
 			to.to_flags |= TOF_MSS;
 
 			/*
 			 * On SYN or SYN|ACK transmits on TFO connections,
 			 * only include the TFO option if it is not a
 			 * retransmit, as the presence of the TFO option may
 			 * have caused the original SYN or SYN|ACK to have
 			 * been dropped by a middlebox.
 			 */
 			if (IS_FASTOPEN(tp->t_flags) &&
 			    (tp->t_rxtshift == 0)) {
 				if (tp->t_state == TCPS_SYN_RECEIVED) {
 					to.to_tfo_len = TCP_FASTOPEN_COOKIE_LEN;
 					to.to_tfo_cookie =
 					    (u_int8_t *)&tp->t_tfo_cookie.server;
 					to.to_flags |= TOF_FASTOPEN;
 					wanted_cookie = 1;
 				} else if (tp->t_state == TCPS_SYN_SENT) {
 					to.to_tfo_len =
 					    tp->t_tfo_client_cookie_len;
 					to.to_tfo_cookie =
 					    tp->t_tfo_cookie.client;
 					to.to_flags |= TOF_FASTOPEN;
 					wanted_cookie = 1;
 					/*
 					 * If we wind up having more data to
 					 * send with the SYN than can fit in
 					 * one segment, don't send any more
 					 * until the SYN|ACK comes back from
 					 * the other end.
 					 */
 					dont_sendalot = 1;
 				}
 			}
 		}
 		/* Window scaling. */
 		if ((flags & TH_SYN) && (tp->t_flags & TF_REQ_SCALE)) {
 			to.to_wscale = tp->request_r_scale;
 			to.to_flags |= TOF_SCALE;
 		}
 		/* Timestamps. */
 		if ((tp->t_flags & TF_RCVD_TSTMP) ||
 		    ((flags & TH_SYN) && (tp->t_flags & TF_REQ_TSTMP))) {
 			to.to_tsval = tcp_ts_getticks() + tp->ts_offset;
 			to.to_tsecr = tp->ts_recent;
 			to.to_flags |= TOF_TS;
 		}
 
 		/* Set receive buffer autosizing timestamp. */
 		if (tp->rfbuf_ts == 0 &&
 		    (so->so_rcv.sb_flags & SB_AUTOSIZE))
 			tp->rfbuf_ts = tcp_ts_getticks();
 
 		/* Selective ACK's. */
 		if (tp->t_flags & TF_SACK_PERMIT) {
 			if (flags & TH_SYN)
 				to.to_flags |= TOF_SACKPERM;
 			else if (TCPS_HAVEESTABLISHED(tp->t_state) &&
 			    (tp->t_flags & TF_SACK_PERMIT) &&
 			    tp->rcv_numsacks > 0) {
 				to.to_flags |= TOF_SACK;
 				to.to_nsacks = tp->rcv_numsacks;
 				to.to_sacks = (u_char *)tp->sackblks;
 			}
 		}
 #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
 		/* TCP-MD5 (RFC2385). */
 		/*
 		 * Check that TCP_MD5SIG is enabled in tcpcb to
 		 * account the size needed to set this TCP option.
 		 */
 		if (tp->t_flags & TF_SIGNATURE)
 			to.to_flags |= TOF_SIGNATURE;
 #endif /* TCP_SIGNATURE */
 
 		/* Processing the options. */
 		hdrlen += optlen = tcp_addoptions(&to, opt);
 		/*
 		 * If we wanted a TFO option to be added, but it was unable
 		 * to fit, ensure no data is sent.
 		 */
 		if (IS_FASTOPEN(tp->t_flags) && wanted_cookie &&
 		    !(to.to_flags & TOF_FASTOPEN))
 			len = 0;
 	}
 
 	/*
 	 * Adjust data length if insertion of options will
 	 * bump the packet length beyond the t_maxseg length.
 	 * Clear the FIN bit because we cut off the tail of
 	 * the segment.
 	 */
 	if (len + optlen + ipoptlen > tp->t_maxseg) {
 		flags &= ~TH_FIN;
 
 		if (tso) {
 			u_int if_hw_tsomax;
 			u_int if_hw_tsomaxsegcount;
 			u_int if_hw_tsomaxsegsize;
 			struct mbuf *mb;
 			u_int moff;
 			int max_len;
 
 			/* extract TSO information */
 			if_hw_tsomax = tp->t_tsomax;
 			if_hw_tsomaxsegcount = tp->t_tsomaxsegcount;
 			if_hw_tsomaxsegsize = tp->t_tsomaxsegsize;
 
 			/*
 			 * Limit a TSO burst to prevent it from
 			 * overflowing or exceeding the maximum length
 			 * allowed by the network interface:
 			 */
 			KASSERT(ipoptlen == 0,
 			    ("%s: TSO can't do IP options", __func__));
 
 			/*
 			 * Check if we should limit by maximum payload
 			 * length:
 			 */
 			if (if_hw_tsomax != 0) {
 				/* compute maximum TSO length */
 				max_len = (if_hw_tsomax - hdrlen -
 				    max_linkhdr);
 				if (max_len <= 0) {
 					len = 0;
 				} else if (len > max_len) {
 					sendalot = 1;
 					len = max_len;
 				}
 			}
 
 			/*
 			 * Check if we should limit by maximum segment
 			 * size and count:
 			 */
 			if (if_hw_tsomaxsegcount != 0 &&
 			    if_hw_tsomaxsegsize != 0) {
 				/*
 				 * Subtract one segment for the LINK
 				 * and TCP/IP headers mbuf that will
 				 * be prepended to this mbuf chain
 				 * after the code in this section
 				 * limits the number of mbufs in the
 				 * chain to if_hw_tsomaxsegcount.
 				 */
 				if_hw_tsomaxsegcount -= 1;
 				max_len = 0;
 				mb = sbsndmbuf(&so->so_snd, off, &moff);
 
 				while (mb != NULL && max_len < len) {
 					u_int mlen;
 					u_int frags;
 
 					/*
 					 * Get length of mbuf fragment
 					 * and how many hardware frags,
 					 * rounded up, it would use:
 					 */
 					mlen = (mb->m_len - moff);
 					frags = howmany(mlen,
 					    if_hw_tsomaxsegsize);
 
 					/* Handle special case: Zero Length Mbuf */
 					if (frags == 0)
 						frags = 1;
 
 					/*
 					 * Check if the fragment limit
 					 * will be reached or exceeded:
 					 */
 					if (frags >= if_hw_tsomaxsegcount) {
 						max_len += min(mlen,
 						    if_hw_tsomaxsegcount *
 						    if_hw_tsomaxsegsize);
 						break;
 					}
 					max_len += mlen;
 					if_hw_tsomaxsegcount -= frags;
 					moff = 0;
 					mb = mb->m_next;
 				}
 				if (max_len <= 0) {
 					len = 0;
 				} else if (len > max_len) {
 					sendalot = 1;
 					len = max_len;
 				}
 			}
 
 			/*
 			 * Prevent the last segment from being
 			 * fractional unless the send sockbuf can be
 			 * emptied:
 			 */
 			max_len = (tp->t_maxseg - optlen);
 			if (((uint32_t)off + (uint32_t)len) <
 			    sbavail(&so->so_snd)) {
 				moff = len % max_len;
 				if (moff != 0) {
 					len -= moff;
 					sendalot = 1;
 				}
 			}
 
 			/*
 			 * In case there are too many small fragments
 			 * don't use TSO:
 			 */
 			if (len <= max_len) {
 				len = max_len;
 				sendalot = 1;
 				tso = 0;
 			}
 
 			/*
 			 * Send the FIN in a separate segment
 			 * after the bulk sending is done.
 			 * We don't trust the TSO implementations
 			 * to clear the FIN flag on all but the
 			 * last segment.
 			 */
 			if (tp->t_flags & TF_NEEDFIN)
 				sendalot = 1;
 
 		} else {
 			len = tp->t_maxseg - optlen - ipoptlen;
 			sendalot = 1;
 			if (dont_sendalot)
 				sendalot = 0;
 		}
 	} else
 		tso = 0;
 
 	KASSERT(len + hdrlen + ipoptlen <= IP_MAXPACKET,
 	    ("%s: len > IP_MAXPACKET", __func__));
 
 /*#ifdef DIAGNOSTIC*/
 #ifdef INET6
 	if (max_linkhdr + hdrlen > MCLBYTES)
 #else
 	if (max_linkhdr + hdrlen > MHLEN)
 #endif
 		panic("tcphdr too big");
 /*#endif*/
 
 	/*
 	 * This KASSERT is here to catch edge cases at a well defined place.
 	 * Before, those had triggered (random) panic conditions further down.
 	 */
 	KASSERT(len >= 0, ("[%s:%d]: len < 0", __func__, __LINE__));
 
 	/*
 	 * Grab a header mbuf, attaching a copy of data to
 	 * be transmitted, and initialize the header from
 	 * the template for sends on this connection.
 	 */
 	if (len) {
 		struct mbuf *mb;
 		u_int moff;
 
 		if ((tp->t_flags & TF_FORCEDATA) && len == 1)
 			TCPSTAT_INC(tcps_sndprobe);
 		else if (SEQ_LT(tp->snd_nxt, tp->snd_max) || sack_rxmit) {
 			tp->t_sndrexmitpack++;
 			TCPSTAT_INC(tcps_sndrexmitpack);
 			TCPSTAT_ADD(tcps_sndrexmitbyte, len);
 		} else {
 			TCPSTAT_INC(tcps_sndpack);
 			TCPSTAT_ADD(tcps_sndbyte, len);
 		}
 #ifdef INET6
 		if (MHLEN < hdrlen + max_linkhdr)
 			m = m_getcl(M_NOWAIT, MT_DATA, M_PKTHDR);
 		else
 #endif
 			m = m_gethdr(M_NOWAIT, MT_DATA);
 
 		if (m == NULL) {
 			SOCKBUF_UNLOCK(&so->so_snd);
 			error = ENOBUFS;
 			sack_rxmit = 0;
 			goto out;
 		}
 
 		m->m_data += max_linkhdr;
 		m->m_len = hdrlen;
 
 		/*
 		 * Start the m_copy functions from the closest mbuf
 		 * to the offset in the socket buffer chain.
 		 */
 		mb = sbsndptr(&so->so_snd, off, len, &moff);
 
 		if (len <= MHLEN - hdrlen - max_linkhdr) {
 			m_copydata(mb, moff, len,
 			    mtod(m, caddr_t) + hdrlen);
 			m->m_len += len;
 		} else {
 			m->m_next = m_copym(mb, moff, len, M_NOWAIT);
 			if (m->m_next == NULL) {
 				SOCKBUF_UNLOCK(&so->so_snd);
 				(void) m_free(m);
 				error = ENOBUFS;
 				sack_rxmit = 0;
 				goto out;
 			}
 		}
 
 		/*
 		 * If we're sending everything we've got, set PUSH.
 		 * (This will keep happy those implementations which only
 		 * give data to the user when a buffer fills or
 		 * a PUSH comes in.)
 		 */
 		if (((uint32_t)off + (uint32_t)len == sbused(&so->so_snd)) &&
 		    !(flags & TH_SYN))
 			flags |= TH_PUSH;
 		SOCKBUF_UNLOCK(&so->so_snd);
 	} else {
 		SOCKBUF_UNLOCK(&so->so_snd);
 		if (tp->t_flags & TF_ACKNOW)
 			TCPSTAT_INC(tcps_sndacks);
 		else if (flags & (TH_SYN|TH_FIN|TH_RST))
 			TCPSTAT_INC(tcps_sndctrl);
 		else if (SEQ_GT(tp->snd_up, tp->snd_una))
 			TCPSTAT_INC(tcps_sndurg);
 		else
 			TCPSTAT_INC(tcps_sndwinup);
 
 		m = m_gethdr(M_NOWAIT, MT_DATA);
 		if (m == NULL) {
 			error = ENOBUFS;
 			sack_rxmit = 0;
 			goto out;
 		}
 #ifdef INET6
 		if (isipv6 && (MHLEN < hdrlen + max_linkhdr) &&
 		    MHLEN >= hdrlen) {
 			M_ALIGN(m, hdrlen);
 		} else
 #endif
 		m->m_data += max_linkhdr;
 		m->m_len = hdrlen;
 	}
 	SOCKBUF_UNLOCK_ASSERT(&so->so_snd);
 	m->m_pkthdr.rcvif = (struct ifnet *)0;
 #ifdef MAC
 	mac_inpcb_create_mbuf(tp->t_inpcb, m);
 #endif
 #ifdef INET6
 	if (isipv6) {
 		ip6 = mtod(m, struct ip6_hdr *);
 		th = (struct tcphdr *)(ip6 + 1);
 		tcpip_fillheaders(tp->t_inpcb, ip6, th);
 	} else
 #endif /* INET6 */
 	{
 		ip = mtod(m, struct ip *);
 #ifdef TCPDEBUG
 		ipov = (struct ipovly *)ip;
 #endif
 		th = (struct tcphdr *)(ip + 1);
 		tcpip_fillheaders(tp->t_inpcb, ip, th);
 	}
 
 	/*
 	 * Fill in fields, remembering maximum advertised
 	 * window for use in delaying messages about window sizes.
 	 * If resending a FIN, be sure not to use a new sequence number.
 	 */
 	if (flags & TH_FIN && tp->t_flags & TF_SENTFIN &&
 	    tp->snd_nxt == tp->snd_max)
 		tp->snd_nxt--;
 	/*
 	 * If we are starting a connection, send ECN setup
 	 * SYN packet. If we are on a retransmit, we may
 	 * resend those bits a number of times as per
 	 * RFC 3168.
 	 */
 	if (tp->t_state == TCPS_SYN_SENT && V_tcp_do_ecn == 1) {
 		if (tp->t_rxtshift >= 1) {
 			if (tp->t_rxtshift <= V_tcp_ecn_maxretries)
 				flags |= TH_ECE|TH_CWR;
 		} else
 			flags |= TH_ECE|TH_CWR;
 	}
 	
 	if (tp->t_state == TCPS_ESTABLISHED &&
 	    (tp->t_flags & TF_ECN_PERMIT)) {
 		/*
 		 * If the peer has ECN, mark data packets with
 		 * ECN capable transmission (ECT).
 		 * Ignore pure ack packets, retransmissions and window probes.
 		 */
 		if (len > 0 && SEQ_GEQ(tp->snd_nxt, tp->snd_max) &&
 		    !((tp->t_flags & TF_FORCEDATA) && len == 1)) {
 #ifdef INET6
 			if (isipv6)
 				ip6->ip6_flow |= htonl(IPTOS_ECN_ECT0 << 20);
 			else
 #endif
 				ip->ip_tos |= IPTOS_ECN_ECT0;
 			TCPSTAT_INC(tcps_ecn_ect0);
 		}
 		
 		/*
 		 * Reply with proper ECN notifications.
 		 */
 		if (tp->t_flags & TF_ECN_SND_CWR) {
 			flags |= TH_CWR;
 			tp->t_flags &= ~TF_ECN_SND_CWR;
 		} 
 		if (tp->t_flags & TF_ECN_SND_ECE)
 			flags |= TH_ECE;
 	}
 	
 	/*
 	 * If we are doing retransmissions, then snd_nxt will
 	 * not reflect the first unsent octet.  For ACK only
 	 * packets, we do not want the sequence number of the
 	 * retransmitted packet, we want the sequence number
 	 * of the next unsent octet.  So, if there is no data
 	 * (and no SYN or FIN), use snd_max instead of snd_nxt
 	 * when filling in ti_seq.  But if we are in persist
 	 * state, snd_max might reflect one byte beyond the
 	 * right edge of the window, so use snd_nxt in that
 	 * case, since we know we aren't doing a retransmission.
 	 * (retransmit and persist are mutually exclusive...)
 	 */
 	if (sack_rxmit == 0) {
 		if (len || (flags & (TH_SYN|TH_FIN)) ||
 		    tcp_timer_active(tp, TT_PERSIST))
 			th->th_seq = htonl(tp->snd_nxt);
 		else
 			th->th_seq = htonl(tp->snd_max);
 	} else {
 		th->th_seq = htonl(p->rxmit);
 		p->rxmit += len;
 		tp->sackhint.sack_bytes_rexmit += len;
 	}
 	th->th_ack = htonl(tp->rcv_nxt);
 	if (optlen) {
 		bcopy(opt, th + 1, optlen);
 		th->th_off = (sizeof (struct tcphdr) + optlen) >> 2;
 	}
 	th->th_flags = flags;
 	/*
 	 * Calculate receive window.  Don't shrink window,
 	 * but avoid silly window syndrome.
 	 */
 	if (recwin < (so->so_rcv.sb_hiwat / 4) &&
 	    recwin < tp->t_maxseg)
 		recwin = 0;
 	if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
 	    recwin < (tp->rcv_adv - tp->rcv_nxt))
 		recwin = (tp->rcv_adv - tp->rcv_nxt);
 
 	/*
 	 * According to RFC1323 the window field in a SYN (i.e., a <SYN>
 	 * or <SYN,ACK>) segment itself is never scaled.  The <SYN,ACK>
 	 * case is handled in syncache.
 	 */
 	if (flags & TH_SYN)
 		th->th_win = htons((u_short)
 				(min(sbspace(&so->so_rcv), TCP_MAXWIN)));
 	else
 		th->th_win = htons((u_short)(recwin >> tp->rcv_scale));
 
 	/*
 	 * Adjust the RXWIN0SENT flag - indicate that we have advertised
 	 * a 0 window.  This may cause the remote transmitter to stall.  This
 	 * flag tells soreceive() to disable delayed acknowledgements when
 	 * draining the buffer.  This can occur if the receiver is attempting
 	 * to read more data than can be buffered prior to transmitting on
 	 * the connection.
 	 */
 	if (th->th_win == 0) {
 		tp->t_sndzerowin++;
 		tp->t_flags |= TF_RXWIN0SENT;
 	} else
 		tp->t_flags &= ~TF_RXWIN0SENT;
 	if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
 		th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
 		th->th_flags |= TH_URG;
 	} else
 		/*
 		 * If no urgent pointer to send, then we pull
 		 * the urgent pointer to the left edge of the send window
 		 * so that it doesn't drift into the send window on sequence
 		 * number wraparound.
 		 */
 		tp->snd_up = tp->snd_una;		/* drag it along */
 
 	/*
 	 * Put TCP length in extended header, and then
 	 * checksum extended header and data.
 	 */
 	m->m_pkthdr.len = hdrlen + len; /* in6_cksum() need this */
 	m->m_pkthdr.csum_data = offsetof(struct tcphdr, th_sum);
 
 #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
 	if (to.to_flags & TOF_SIGNATURE) {
 		/*
 		 * Calculate MD5 signature and put it into the place
 		 * determined before.
 		 * NOTE: since TCP options buffer doesn't point into
 		 * mbuf's data, calculate offset and use it.
 		 */
 		if (!TCPMD5_ENABLED() || (error = TCPMD5_OUTPUT(m, th,
 		    (u_char *)(th + 1) + (to.to_signature - opt))) != 0) {
 			/*
 			 * Do not send segment if the calculation of MD5
 			 * digest has failed.
 			 */
 			m_freem(m);
 			goto out;
 		}
 	}
 #endif
 #ifdef INET6
 	if (isipv6) {
 		/*
 		 * There is no need to fill in ip6_plen right now.
 		 * It will be filled later by ip6_output.
 		 */
 		m->m_pkthdr.csum_flags = CSUM_TCP_IPV6;
 		th->th_sum = in6_cksum_pseudo(ip6, sizeof(struct tcphdr) +
 		    optlen + len, IPPROTO_TCP, 0);
 	}
 #endif
 #if defined(INET6) && defined(INET)
 	else
 #endif
 #ifdef INET
 	{
 		m->m_pkthdr.csum_flags = CSUM_TCP;
 		th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr,
 		    htons(sizeof(struct tcphdr) + IPPROTO_TCP + len + optlen));
 
 		/* IP version must be set here for ipv4/ipv6 checking later */
 		KASSERT(ip->ip_v == IPVERSION,
 		    ("%s: IP version incorrect: %d", __func__, ip->ip_v));
 	}
 #endif
 
-	/* We're getting ready to send; log now. */
-	TCP_LOG_EVENT(tp, th, &so->so_rcv, &so->so_snd, TCP_LOG_OUT, ERRNO_UNK,
-	    len, NULL, false);
-
 	/*
 	 * Enable TSO and specify the size of the segments.
 	 * The TCP pseudo header checksum is always provided.
 	 */
 	if (tso) {
 		KASSERT(len > tp->t_maxseg - optlen,
 		    ("%s: len <= tso_segsz", __func__));
 		m->m_pkthdr.csum_flags |= CSUM_TSO;
 		m->m_pkthdr.tso_segsz = tp->t_maxseg - optlen;
 	}
 
 #if defined(IPSEC) || defined(IPSEC_SUPPORT)
 	KASSERT(len + hdrlen + ipoptlen - ipsec_optlen == m_length(m, NULL),
 	    ("%s: mbuf chain shorter than expected: %d + %u + %u - %u != %u",
 	    __func__, len, hdrlen, ipoptlen, ipsec_optlen, m_length(m, NULL)));
 #else
 	KASSERT(len + hdrlen + ipoptlen == m_length(m, NULL),
 	    ("%s: mbuf chain shorter than expected: %d + %u + %u != %u",
 	    __func__, len, hdrlen, ipoptlen, m_length(m, NULL)));
 #endif
 
 #ifdef TCP_HHOOK
 	/* Run HHOOK_TCP_ESTABLISHED_OUT helper hooks. */
 	hhook_run_tcp_est_out(tp, th, &to, len, tso);
 #endif
 
 #ifdef TCPDEBUG
 	/*
 	 * Trace.
 	 */
 	if (so->so_options & SO_DEBUG) {
 		u_short save = 0;
 #ifdef INET6
 		if (!isipv6)
 #endif
 		{
 			save = ipov->ih_len;
 			ipov->ih_len = htons(m->m_pkthdr.len /* - hdrlen + (th->th_off << 2) */);
 		}
 		tcp_trace(TA_OUTPUT, tp->t_state, tp, mtod(m, void *), th, 0);
 #ifdef INET6
 		if (!isipv6)
 #endif
 		ipov->ih_len = save;
 	}
 #endif /* TCPDEBUG */
 	TCP_PROBE3(debug__output, tp, th, m);
+
+	/* We're getting ready to send; log now. */
+	TCP_LOG_EVENT(tp, th, &so->so_rcv, &so->so_snd, TCP_LOG_OUT, ERRNO_UNK,
+	    len, NULL, false);
 
 	/*
 	 * Fill in IP length and desired time to live and
 	 * send to IP level.  There should be a better way
 	 * to handle ttl and tos; we could keep them in
 	 * the template, but need a way to checksum without them.
 	 */
 	/*
 	 * m->m_pkthdr.len should have been set before checksum calculation,
 	 * because in6_cksum() need it.
 	 */
 #ifdef INET6
 	if (isipv6) {
 		/*
 		 * we separately set hoplimit for every segment, since the
 		 * user might want to change the value via setsockopt.
 		 * Also, desired default hop limit might be changed via
 		 * Neighbor Discovery.
 		 */
 		ip6->ip6_hlim = in6_selecthlim(tp->t_inpcb, NULL);
 
 		/*
 		 * Set the packet size here for the benefit of DTrace probes.
 		 * ip6_output() will set it properly; it's supposed to include
 		 * the option header lengths as well.
 		 */
 		ip6->ip6_plen = htons(m->m_pkthdr.len - sizeof(*ip6));
 
 		if (V_path_mtu_discovery && tp->t_maxseg > V_tcp_minmss)
 			tp->t_flags2 |= TF2_PLPMTU_PMTUD;
 		else
 			tp->t_flags2 &= ~TF2_PLPMTU_PMTUD;
 
 		if (tp->t_state == TCPS_SYN_SENT)
 			TCP_PROBE5(connect__request, NULL, tp, ip6, tp, th);
 
 		TCP_PROBE5(send, NULL, tp, ip6, tp, th);
 
 #ifdef TCPPCAP
 		/* Save packet, if requested. */
 		tcp_pcap_add(th, m, &(tp->t_outpkts));
 #endif
 
 		/* TODO: IPv6 IP6TOS_ECT bit on */
 		error = ip6_output(m, tp->t_inpcb->in6p_outputopts,
 		    &tp->t_inpcb->inp_route6,
 		    ((so->so_options & SO_DONTROUTE) ?  IP_ROUTETOIF : 0),
 		    NULL, NULL, tp->t_inpcb);
 
 		if (error == EMSGSIZE && tp->t_inpcb->inp_route6.ro_rt != NULL)
 			mtu = tp->t_inpcb->inp_route6.ro_rt->rt_mtu;
 	}
 #endif /* INET6 */
 #if defined(INET) && defined(INET6)
 	else
 #endif
 #ifdef INET
     {
 	ip->ip_len = htons(m->m_pkthdr.len);
 #ifdef INET6
 	if (tp->t_inpcb->inp_vflag & INP_IPV6PROTO)
 		ip->ip_ttl = in6_selecthlim(tp->t_inpcb, NULL);
 #endif /* INET6 */
 	/*
 	 * If we do path MTU discovery, then we set DF on every packet.
 	 * This might not be the best thing to do according to RFC3390
 	 * Section 2. However the tcp hostcache migitates the problem
 	 * so it affects only the first tcp connection with a host.
 	 *
 	 * NB: Don't set DF on small MTU/MSS to have a safe fallback.
 	 */
 	if (V_path_mtu_discovery && tp->t_maxseg > V_tcp_minmss) {
 		ip->ip_off |= htons(IP_DF);
 		tp->t_flags2 |= TF2_PLPMTU_PMTUD;
 	} else {
 		tp->t_flags2 &= ~TF2_PLPMTU_PMTUD;
 	}
 
 	if (tp->t_state == TCPS_SYN_SENT)
 		TCP_PROBE5(connect__request, NULL, tp, ip, tp, th);
 
 	TCP_PROBE5(send, NULL, tp, ip, tp, th);
 
 #ifdef TCPPCAP
 	/* Save packet, if requested. */
 	tcp_pcap_add(th, m, &(tp->t_outpkts));
 #endif
 
 	error = ip_output(m, tp->t_inpcb->inp_options, &tp->t_inpcb->inp_route,
 	    ((so->so_options & SO_DONTROUTE) ? IP_ROUTETOIF : 0), 0,
 	    tp->t_inpcb);
 
 	if (error == EMSGSIZE && tp->t_inpcb->inp_route.ro_rt != NULL)
 		mtu = tp->t_inpcb->inp_route.ro_rt->rt_mtu;
     }
 #endif /* INET */
 
 out:
 	/*
 	 * In transmit state, time the transmission and arrange for
 	 * the retransmit.  In persist state, just set snd_max.
 	 */
 	if ((tp->t_flags & TF_FORCEDATA) == 0 || 
 	    !tcp_timer_active(tp, TT_PERSIST)) {
 		tcp_seq startseq = tp->snd_nxt;
 
 		/*
 		 * Advance snd_nxt over sequence space of this segment.
 		 */
 		if (flags & (TH_SYN|TH_FIN)) {
 			if (flags & TH_SYN)
 				tp->snd_nxt++;
 			if (flags & TH_FIN) {
 				tp->snd_nxt++;
 				tp->t_flags |= TF_SENTFIN;
 			}
 		}
 		if (sack_rxmit)
 			goto timer;
 		tp->snd_nxt += len;
 		if (SEQ_GT(tp->snd_nxt, tp->snd_max)) {
 			tp->snd_max = tp->snd_nxt;
 			/*
 			 * Time this transmission if not a retransmission and
 			 * not currently timing anything.
 			 */
 			if (tp->t_rtttime == 0) {
 				tp->t_rtttime = ticks;
 				tp->t_rtseq = startseq;
 				TCPSTAT_INC(tcps_segstimed);
 			}
 		}
 
 		/*
 		 * Set retransmit timer if not currently set,
 		 * and not doing a pure ack or a keep-alive probe.
 		 * Initial value for retransmit timer is smoothed
 		 * round-trip time + 2 * round-trip time variance.
 		 * Initialize shift counter which is used for backoff
 		 * of retransmit time.
 		 */
 timer:
 		if (!tcp_timer_active(tp, TT_REXMT) &&
 		    ((sack_rxmit && tp->snd_nxt != tp->snd_max) ||
 		     (tp->snd_nxt != tp->snd_una))) {
 			if (tcp_timer_active(tp, TT_PERSIST)) {
 				tcp_timer_activate(tp, TT_PERSIST, 0);
 				tp->t_rxtshift = 0;
 			}
 			tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur);
 		} else if (len == 0 && sbavail(&so->so_snd) &&
 		    !tcp_timer_active(tp, TT_REXMT) &&
 		    !tcp_timer_active(tp, TT_PERSIST)) {
 			/*
 			 * Avoid a situation where we do not set persist timer
 			 * after a zero window condition. For example:
 			 * 1) A -> B: packet with enough data to fill the window
 			 * 2) B -> A: ACK for #1 + new data (0 window
 			 *    advertisement)
 			 * 3) A -> B: ACK for #2, 0 len packet
 			 *
 			 * In this case, A will not activate the persist timer,
 			 * because it chose to send a packet. Unless tcp_output
 			 * is called for some other reason (delayed ack timer,
 			 * another input packet from B, socket syscall), A will
 			 * not send zero window probes.
 			 *
 			 * So, if you send a 0-length packet, but there is data
 			 * in the socket buffer, and neither the rexmt or
 			 * persist timer is already set, then activate the
 			 * persist timer.
 			 */
 			tp->t_rxtshift = 0;
 			tcp_setpersist(tp);
 		}
 	} else {
 		/*
 		 * Persist case, update snd_max but since we are in
 		 * persist mode (no window) we do not update snd_nxt.
 		 */
 		int xlen = len;
 		if (flags & TH_SYN)
 			++xlen;
 		if (flags & TH_FIN) {
 			++xlen;
 			tp->t_flags |= TF_SENTFIN;
 		}
 		if (SEQ_GT(tp->snd_nxt + xlen, tp->snd_max))
 			tp->snd_max = tp->snd_nxt + xlen;
 	}
 
 	if (error) {
 		/* Record the error. */
 		TCP_LOG_EVENT(tp, NULL, &so->so_rcv, &so->so_snd, TCP_LOG_OUT,
 		    error, 0, NULL, false);
 
 		/*
 		 * We know that the packet was lost, so back out the
 		 * sequence number advance, if any.
 		 *
 		 * If the error is EPERM the packet got blocked by the
 		 * local firewall.  Normally we should terminate the
 		 * connection but the blocking may have been spurious
 		 * due to a firewall reconfiguration cycle.  So we treat
 		 * it like a packet loss and let the retransmit timer and
 		 * timeouts do their work over time.
 		 * XXX: It is a POLA question whether calling tcp_drop right
 		 * away would be the really correct behavior instead.
 		 */
 		if (((tp->t_flags & TF_FORCEDATA) == 0 ||
 		    !tcp_timer_active(tp, TT_PERSIST)) &&
 		    ((flags & TH_SYN) == 0) &&
 		    (error != EPERM)) {
 			if (sack_rxmit) {
 				p->rxmit -= len;
 				tp->sackhint.sack_bytes_rexmit -= len;
 				KASSERT(tp->sackhint.sack_bytes_rexmit >= 0,
 				    ("sackhint bytes rtx >= 0"));
 			} else
 				tp->snd_nxt -= len;
 		}
 		SOCKBUF_UNLOCK_ASSERT(&so->so_snd);	/* Check gotos. */
 		switch (error) {
 		case EACCES:
 			tp->t_softerror = error;
 			return (0);
 		case EPERM:
 			tp->t_softerror = error;
 			return (error);
 		case ENOBUFS:
 			TCP_XMIT_TIMER_ASSERT(tp, len, flags);
 			tp->snd_cwnd = tp->t_maxseg;
 			return (0);
 		case EMSGSIZE:
 			/*
 			 * For some reason the interface we used initially
 			 * to send segments changed to another or lowered
 			 * its MTU.
 			 * If TSO was active we either got an interface
 			 * without TSO capabilits or TSO was turned off.
 			 * If we obtained mtu from ip_output() then update
 			 * it and try again.
 			 */
 			if (tso)
 				tp->t_flags &= ~TF_TSO;
 			if (mtu != 0) {
 				tcp_mss_update(tp, -1, mtu, NULL, NULL);
 				goto again;
 			}
 			return (error);
 		case EHOSTDOWN:
 		case EHOSTUNREACH:
 		case ENETDOWN:
 		case ENETUNREACH:
 			if (TCPS_HAVERCVDSYN(tp->t_state)) {
 				tp->t_softerror = error;
 				return (0);
 			}
 			/* FALLTHROUGH */
 		default:
 			return (error);
 		}
 	}
 	TCPSTAT_INC(tcps_sndtotal);
 
 	/*
 	 * Data sent (as far as we can tell).
 	 * If this advertises a larger window than any other segment,
 	 * then remember the size of the advertised window.
 	 * Any pending ACK has now been sent.
 	 */
 	if (SEQ_GT(tp->rcv_nxt + recwin, tp->rcv_adv))
 		tp->rcv_adv = tp->rcv_nxt + recwin;
 	tp->last_ack_sent = tp->rcv_nxt;
 	tp->t_flags &= ~(TF_ACKNOW | TF_DELACK);
 	if (tcp_timer_active(tp, TT_DELACK))
 		tcp_timer_activate(tp, TT_DELACK, 0);
 #if 0
 	/*
 	 * This completely breaks TCP if newreno is turned on.  What happens
 	 * is that if delayed-acks are turned on on the receiver, this code
 	 * on the transmitter effectively destroys the TCP window, forcing
 	 * it to four packets (1.5Kx4 = 6K window).
 	 */
 	if (sendalot && --maxburst)
 		goto again;
 #endif
 	if (sendalot)
 		goto again;
 	return (0);
 }
 
 void
 tcp_setpersist(struct tcpcb *tp)
 {
 	int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;
 	int tt;
 
 	tp->t_flags &= ~TF_PREVVALID;
 	if (tcp_timer_active(tp, TT_REXMT))
 		panic("tcp_setpersist: retransmit pending");
 	/*
 	 * Start/restart persistence timer.
 	 */
 	TCPT_RANGESET(tt, t * tcp_backoff[tp->t_rxtshift],
 		      tcp_persmin, tcp_persmax);
 	tcp_timer_activate(tp, TT_PERSIST, tt);
 	if (tp->t_rxtshift < TCP_MAXRXTSHIFT)
 		tp->t_rxtshift++;
 }
 
 /*
  * Insert TCP options according to the supplied parameters to the place
  * optp in a consistent way.  Can handle unaligned destinations.
  *
  * The order of the option processing is crucial for optimal packing and
  * alignment for the scarce option space.
  *
  * The optimal order for a SYN/SYN-ACK segment is:
  *   MSS (4) + NOP (1) + Window scale (3) + SACK permitted (2) +
  *   Timestamp (10) + Signature (18) = 38 bytes out of a maximum of 40.
  *
  * The SACK options should be last.  SACK blocks consume 8*n+2 bytes.
  * So a full size SACK blocks option is 34 bytes (with 4 SACK blocks).
  * At minimum we need 10 bytes (to generate 1 SACK block).  If both
  * TCP Timestamps (12 bytes) and TCP Signatures (18 bytes) are present,
  * we only have 10 bytes for SACK options (40 - (12 + 18)).
  */
 int
 tcp_addoptions(struct tcpopt *to, u_char *optp)
 {
 	u_int32_t mask, optlen = 0;
 
 	for (mask = 1; mask < TOF_MAXOPT; mask <<= 1) {
 		if ((to->to_flags & mask) != mask)
 			continue;
 		if (optlen == TCP_MAXOLEN)
 			break;
 		switch (to->to_flags & mask) {
 		case TOF_MSS:
 			while (optlen % 4) {
 				optlen += TCPOLEN_NOP;
 				*optp++ = TCPOPT_NOP;
 			}
 			if (TCP_MAXOLEN - optlen < TCPOLEN_MAXSEG)
 				continue;
 			optlen += TCPOLEN_MAXSEG;
 			*optp++ = TCPOPT_MAXSEG;
 			*optp++ = TCPOLEN_MAXSEG;
 			to->to_mss = htons(to->to_mss);
 			bcopy((u_char *)&to->to_mss, optp, sizeof(to->to_mss));
 			optp += sizeof(to->to_mss);
 			break;
 		case TOF_SCALE:
 			while (!optlen || optlen % 2 != 1) {
 				optlen += TCPOLEN_NOP;
 				*optp++ = TCPOPT_NOP;
 			}
 			if (TCP_MAXOLEN - optlen < TCPOLEN_WINDOW)
 				continue;
 			optlen += TCPOLEN_WINDOW;
 			*optp++ = TCPOPT_WINDOW;
 			*optp++ = TCPOLEN_WINDOW;
 			*optp++ = to->to_wscale;
 			break;
 		case TOF_SACKPERM:
 			while (optlen % 2) {
 				optlen += TCPOLEN_NOP;
 				*optp++ = TCPOPT_NOP;
 			}
 			if (TCP_MAXOLEN - optlen < TCPOLEN_SACK_PERMITTED)
 				continue;
 			optlen += TCPOLEN_SACK_PERMITTED;
 			*optp++ = TCPOPT_SACK_PERMITTED;
 			*optp++ = TCPOLEN_SACK_PERMITTED;
 			break;
 		case TOF_TS:
 			while (!optlen || optlen % 4 != 2) {
 				optlen += TCPOLEN_NOP;
 				*optp++ = TCPOPT_NOP;
 			}
 			if (TCP_MAXOLEN - optlen < TCPOLEN_TIMESTAMP)
 				continue;
 			optlen += TCPOLEN_TIMESTAMP;
 			*optp++ = TCPOPT_TIMESTAMP;
 			*optp++ = TCPOLEN_TIMESTAMP;
 			to->to_tsval = htonl(to->to_tsval);
 			to->to_tsecr = htonl(to->to_tsecr);
 			bcopy((u_char *)&to->to_tsval, optp, sizeof(to->to_tsval));
 			optp += sizeof(to->to_tsval);
 			bcopy((u_char *)&to->to_tsecr, optp, sizeof(to->to_tsecr));
 			optp += sizeof(to->to_tsecr);
 			break;
 		case TOF_SIGNATURE:
 			{
 			int siglen = TCPOLEN_SIGNATURE - 2;
 
 			while (!optlen || optlen % 4 != 2) {
 				optlen += TCPOLEN_NOP;
 				*optp++ = TCPOPT_NOP;
 			}
 			if (TCP_MAXOLEN - optlen < TCPOLEN_SIGNATURE) {
 				to->to_flags &= ~TOF_SIGNATURE;
 				continue;
 			}
 			optlen += TCPOLEN_SIGNATURE;
 			*optp++ = TCPOPT_SIGNATURE;
 			*optp++ = TCPOLEN_SIGNATURE;
 			to->to_signature = optp;
 			while (siglen--)
 				 *optp++ = 0;
 			break;
 			}
 		case TOF_SACK:
 			{
 			int sackblks = 0;
 			struct sackblk *sack = (struct sackblk *)to->to_sacks;
 			tcp_seq sack_seq;
 
 			while (!optlen || optlen % 4 != 2) {
 				optlen += TCPOLEN_NOP;
 				*optp++ = TCPOPT_NOP;
 			}
 			if (TCP_MAXOLEN - optlen < TCPOLEN_SACKHDR + TCPOLEN_SACK)
 				continue;
 			optlen += TCPOLEN_SACKHDR;
 			*optp++ = TCPOPT_SACK;
 			sackblks = min(to->to_nsacks,
 					(TCP_MAXOLEN - optlen) / TCPOLEN_SACK);
 			*optp++ = TCPOLEN_SACKHDR + sackblks * TCPOLEN_SACK;
 			while (sackblks--) {
 				sack_seq = htonl(sack->start);
 				bcopy((u_char *)&sack_seq, optp, sizeof(sack_seq));
 				optp += sizeof(sack_seq);
 				sack_seq = htonl(sack->end);
 				bcopy((u_char *)&sack_seq, optp, sizeof(sack_seq));
 				optp += sizeof(sack_seq);
 				optlen += TCPOLEN_SACK;
 				sack++;
 			}
 			TCPSTAT_INC(tcps_sack_send_blocks);
 			break;
 			}
 		case TOF_FASTOPEN:
 			{
 			int total_len;
 
 			/* XXX is there any point to aligning this option? */
 			total_len = TCPOLEN_FAST_OPEN_EMPTY + to->to_tfo_len;
 			if (TCP_MAXOLEN - optlen < total_len) {
 				to->to_flags &= ~TOF_FASTOPEN;
 				continue;
 			}
 			*optp++ = TCPOPT_FAST_OPEN;
 			*optp++ = total_len;
 			if (to->to_tfo_len > 0) {
 				bcopy(to->to_tfo_cookie, optp, to->to_tfo_len);
 				optp += to->to_tfo_len;
 			}
 			optlen += total_len;
 			break;
 			}
 		default:
 			panic("%s: unknown TCP option type", __func__);
 			break;
 		}
 	}
 
 	/* Terminate and pad TCP options to a 4 byte boundary. */
 	if (optlen % 4) {
 		optlen += TCPOLEN_EOL;
 		*optp++ = TCPOPT_EOL;
 	}
 	/*
 	 * According to RFC 793 (STD0007):
 	 *   "The content of the header beyond the End-of-Option option
 	 *    must be header padding (i.e., zero)."
 	 *   and later: "The padding is composed of zeros."
 	 */
 	while (optlen % 4) {
 		optlen += TCPOLEN_PAD;
 		*optp++ = TCPOPT_PAD;
 	}
 
 	KASSERT(optlen <= TCP_MAXOLEN, ("%s: TCP options too long", __func__));
 	return (optlen);
 }
 
 void
 tcp_sndbuf_autoscale(struct tcpcb *tp, struct socket *so, uint32_t sendwin)
 {
 
 	/*
 	 * Automatic sizing of send socket buffer.  Often the send buffer
 	 * size is not optimally adjusted to the actual network conditions
 	 * at hand (delay bandwidth product).  Setting the buffer size too
 	 * small limits throughput on links with high bandwidth and high
 	 * delay (eg. trans-continental/oceanic links).  Setting the
 	 * buffer size too big consumes too much real kernel memory,
 	 * especially with many connections on busy servers.
 	 *
 	 * The criteria to step up the send buffer one notch are:
 	 *  1. receive window of remote host is larger than send buffer
 	 *     (with a fudge factor of 5/4th);
 	 *  2. send buffer is filled to 7/8th with data (so we actually
 	 *     have data to make use of it);
 	 *  3. send buffer fill has not hit maximal automatic size;
 	 *  4. our send window (slow start and cogestion controlled) is
 	 *     larger than sent but unacknowledged data in send buffer.
 	 *
 	 * The remote host receive window scaling factor may limit the
 	 * growing of the send buffer before it reaches its allowed
 	 * maximum.
 	 *
 	 * It scales directly with slow start or congestion window
 	 * and does at most one step per received ACK.  This fast
 	 * scaling has the drawback of growing the send buffer beyond
 	 * what is strictly necessary to make full use of a given
 	 * delay*bandwidth product.  However testing has shown this not
 	 * to be much of an problem.  At worst we are trading wasting
 	 * of available bandwidth (the non-use of it) for wasting some
 	 * socket buffer memory.
 	 *
 	 * TODO: Shrink send buffer during idle periods together
 	 * with congestion window.  Requires another timer.  Has to
 	 * wait for upcoming tcp timer rewrite.
 	 *
 	 * XXXGL: should there be used sbused() or sbavail()?
 	 */
 	if (V_tcp_do_autosndbuf && so->so_snd.sb_flags & SB_AUTOSIZE) {
 		int lowat;
 
 		lowat = V_tcp_sendbuf_auto_lowat ? so->so_snd.sb_lowat : 0;
 		if ((tp->snd_wnd / 4 * 5) >= so->so_snd.sb_hiwat - lowat &&
 		    sbused(&so->so_snd) >=
 		    (so->so_snd.sb_hiwat / 8 * 7) - lowat &&
 		    sbused(&so->so_snd) < V_tcp_autosndbuf_max &&
 		    sendwin >= (sbused(&so->so_snd) -
 		    (tp->snd_nxt - tp->snd_una))) {
 			if (!sbreserve_locked(&so->so_snd,
 			    min(so->so_snd.sb_hiwat + V_tcp_autosndbuf_inc,
 			     V_tcp_autosndbuf_max), so, curthread))
 				so->so_snd.sb_flags &= ~SB_AUTOSIZE;
 		}
 	}
 }
Index: user/markj/netdump/sys/netinet/tcp_subr.c
===================================================================
--- user/markj/netdump/sys/netinet/tcp_subr.c	(revision 332407)
+++ user/markj/netdump/sys/netinet/tcp_subr.c	(revision 332408)
@@ -1,2978 +1,2997 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1982, 1986, 1988, 1990, 1993, 1995
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	@(#)tcp_subr.c	8.2 (Berkeley) 5/24/95
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
 #include "opt_ipsec.h"
 #include "opt_tcpdebug.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/callout.h>
 #include <sys/eventhandler.h>
 #ifdef TCP_HHOOK
 #include <sys/hhook.h>
 #endif
 #include <sys/kernel.h>
 #ifdef TCP_HHOOK
 #include <sys/khelp.h>
 #endif
 #include <sys/sysctl.h>
 #include <sys/jail.h>
 #include <sys/malloc.h>
 #include <sys/refcount.h>
 #include <sys/mbuf.h>
 #ifdef INET6
 #include <sys/domain.h>
 #endif
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/sdt.h>
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/protosw.h>
 #include <sys/random.h>
 
 #include <vm/uma.h>
 
 #include <net/route.h>
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/vnet.h>
 
 #include <netinet/in.h>
 #include <netinet/in_fib.h>
 #include <netinet/in_kdtrace.h>
 #include <netinet/in_pcb.h>
 #include <netinet/in_systm.h>
 #include <netinet/in_var.h>
 #include <netinet/ip.h>
 #include <netinet/ip_icmp.h>
 #include <netinet/ip_var.h>
 #ifdef INET6
 #include <netinet/icmp6.h>
 #include <netinet/ip6.h>
 #include <netinet6/in6_fib.h>
 #include <netinet6/in6_pcb.h>
 #include <netinet6/ip6_var.h>
 #include <netinet6/scope6_var.h>
 #include <netinet6/nd6.h>
 #endif
 
 #include <netinet/tcp.h>
 #include <netinet/tcp_fsm.h>
 #include <netinet/tcp_seq.h>
 #include <netinet/tcp_timer.h>
 #include <netinet/tcp_var.h>
 #include <netinet/tcp_log_buf.h>
 #include <netinet/tcp_syncache.h>
 #include <netinet/cc/cc.h>
 #ifdef INET6
 #include <netinet6/tcp6_var.h>
 #endif
 #include <netinet/tcpip.h>
 #include <netinet/tcp_fastopen.h>
 #ifdef TCPPCAP
 #include <netinet/tcp_pcap.h>
 #endif
 #ifdef TCPDEBUG
 #include <netinet/tcp_debug.h>
 #endif
 #ifdef INET6
 #include <netinet6/ip6protosw.h>
 #endif
 #ifdef TCP_OFFLOAD
 #include <netinet/tcp_offload.h>
 #endif
 
 #include <netipsec/ipsec_support.h>
 
 #include <machine/in_cksum.h>
 #include <sys/md5.h>
 
 #include <security/mac/mac_framework.h>
 
 VNET_DEFINE(int, tcp_mssdflt) = TCP_MSS;
 #ifdef INET6
 VNET_DEFINE(int, tcp_v6mssdflt) = TCP6_MSS;
 #endif
 
 struct rwlock tcp_function_lock;
 
 static int
 sysctl_net_inet_tcp_mss_check(SYSCTL_HANDLER_ARGS)
 {
 	int error, new;
 
 	new = V_tcp_mssdflt;
 	error = sysctl_handle_int(oidp, &new, 0, req);
 	if (error == 0 && req->newptr) {
 		if (new < TCP_MINMSS)
 			error = EINVAL;
 		else
 			V_tcp_mssdflt = new;
 	}
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, TCPCTL_MSSDFLT, mssdflt,
     CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(tcp_mssdflt), 0,
     &sysctl_net_inet_tcp_mss_check, "I",
     "Default TCP Maximum Segment Size");
 
 #ifdef INET6
 static int
 sysctl_net_inet_tcp_mss_v6_check(SYSCTL_HANDLER_ARGS)
 {
 	int error, new;
 
 	new = V_tcp_v6mssdflt;
 	error = sysctl_handle_int(oidp, &new, 0, req);
 	if (error == 0 && req->newptr) {
 		if (new < TCP_MINMSS)
 			error = EINVAL;
 		else
 			V_tcp_v6mssdflt = new;
 	}
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, TCPCTL_V6MSSDFLT, v6mssdflt,
     CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(tcp_v6mssdflt), 0,
     &sysctl_net_inet_tcp_mss_v6_check, "I",
    "Default TCP Maximum Segment Size for IPv6");
 #endif /* INET6 */
 
 /*
  * Minimum MSS we accept and use. This prevents DoS attacks where
  * we are forced to a ridiculous low MSS like 20 and send hundreds
  * of packets instead of one. The effect scales with the available
  * bandwidth and quickly saturates the CPU and network interface
  * with packet generation and sending. Set to zero to disable MINMSS
  * checking. This setting prevents us from sending too small packets.
  */
 VNET_DEFINE(int, tcp_minmss) = TCP_MINMSS;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, minmss, CTLFLAG_VNET | CTLFLAG_RW,
      &VNET_NAME(tcp_minmss), 0,
     "Minimum TCP Maximum Segment Size");
 
 VNET_DEFINE(int, tcp_do_rfc1323) = 1;
 SYSCTL_INT(_net_inet_tcp, TCPCTL_DO_RFC1323, rfc1323, CTLFLAG_VNET | CTLFLAG_RW,
     &VNET_NAME(tcp_do_rfc1323), 0,
     "Enable rfc1323 (high performance TCP) extensions");
 
 static int	tcp_log_debug = 0;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, log_debug, CTLFLAG_RW,
     &tcp_log_debug, 0, "Log errors caused by incoming TCP segments");
 
 static int	tcp_tcbhashsize;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, tcbhashsize, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
     &tcp_tcbhashsize, 0, "Size of TCP control-block hashtable");
 
 static int	do_tcpdrain = 1;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, do_tcpdrain, CTLFLAG_RW, &do_tcpdrain, 0,
     "Enable tcp_drain routine for extra help when low on mbufs");
 
 SYSCTL_UINT(_net_inet_tcp, OID_AUTO, pcbcount, CTLFLAG_VNET | CTLFLAG_RD,
     &VNET_NAME(tcbinfo.ipi_count), 0, "Number of active PCBs");
 
 static VNET_DEFINE(int, icmp_may_rst) = 1;
 #define	V_icmp_may_rst			VNET(icmp_may_rst)
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, icmp_may_rst, CTLFLAG_VNET | CTLFLAG_RW,
     &VNET_NAME(icmp_may_rst), 0,
     "Certain ICMP unreachable messages may abort connections in SYN_SENT");
 
 static VNET_DEFINE(int, tcp_isn_reseed_interval) = 0;
 #define	V_tcp_isn_reseed_interval	VNET(tcp_isn_reseed_interval)
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, isn_reseed_interval, CTLFLAG_VNET | CTLFLAG_RW,
     &VNET_NAME(tcp_isn_reseed_interval), 0,
     "Seconds between reseeding of ISN secret");
 
 static int	tcp_soreceive_stream;
 SYSCTL_INT(_net_inet_tcp, OID_AUTO, soreceive_stream, CTLFLAG_RDTUN,
     &tcp_soreceive_stream, 0, "Using soreceive_stream for TCP sockets");
 
 VNET_DEFINE(uma_zone_t, sack_hole_zone);
 #define	V_sack_hole_zone		VNET(sack_hole_zone)
 
 #ifdef TCP_HHOOK
 VNET_DEFINE(struct hhook_head *, tcp_hhh[HHOOK_TCP_LAST+1]);
 #endif
 
 static struct inpcb *tcp_notify(struct inpcb *, int);
 static struct inpcb *tcp_mtudisc_notify(struct inpcb *, int);
 static void tcp_mtudisc(struct inpcb *, int);
 static char *	tcp_log_addr(struct in_conninfo *inc, struct tcphdr *th,
 		    void *ip4hdr, const void *ip6hdr);
 
 
 static struct tcp_function_block tcp_def_funcblk = {
 	"default",
 	tcp_output,
 	tcp_do_segment,
 	tcp_default_ctloutput,
 	NULL,
 	NULL,	
 	NULL,
 	NULL,
 	NULL,
 	NULL,
 	0,
 	0
 };
 
 int t_functions_inited = 0;
+static int tcp_fb_cnt = 0;
 struct tcp_funchead t_functions;
 static struct tcp_function_block *tcp_func_set_ptr = &tcp_def_funcblk;
 
 static void
 init_tcp_functions(void)
 {
 	if (t_functions_inited == 0) {
 		TAILQ_INIT(&t_functions);
 		rw_init_flags(&tcp_function_lock, "tcp_func_lock" , 0);
 		t_functions_inited = 1;
 	}
 }
 
 static struct tcp_function_block *
 find_tcp_functions_locked(struct tcp_function_set *fs)
 {
 	struct tcp_function *f;
 	struct tcp_function_block *blk=NULL;
 
 	TAILQ_FOREACH(f, &t_functions, tf_next) {
 		if (strcmp(f->tf_name, fs->function_set_name) == 0) {
 			blk = f->tf_fb;
 			break;
 		}
 	}
 	return(blk);
 }
 
 static struct tcp_function_block *
 find_tcp_fb_locked(struct tcp_function_block *blk, struct tcp_function **s)
 {
 	struct tcp_function_block *rblk=NULL;
 	struct tcp_function *f;
 
 	TAILQ_FOREACH(f, &t_functions, tf_next) {
 		if (f->tf_fb == blk) {
 			rblk = blk;
 			if (s) {
 				*s = f;
 			}
 			break;
 		}
 	}
 	return (rblk);
 }
 
 struct tcp_function_block *
 find_and_ref_tcp_functions(struct tcp_function_set *fs)
 {
 	struct tcp_function_block *blk;
 	
 	rw_rlock(&tcp_function_lock);	
 	blk = find_tcp_functions_locked(fs);
 	if (blk)
 		refcount_acquire(&blk->tfb_refcnt); 
 	rw_runlock(&tcp_function_lock);
 	return(blk);
 }
 
 struct tcp_function_block *
 find_and_ref_tcp_fb(struct tcp_function_block *blk)
 {
 	struct tcp_function_block *rblk;
 	
 	rw_rlock(&tcp_function_lock);	
 	rblk = find_tcp_fb_locked(blk, NULL);
 	if (rblk) 
 		refcount_acquire(&rblk->tfb_refcnt);
 	rw_runlock(&tcp_function_lock);
 	return(rblk);
 }
 
 
 static int
 sysctl_net_inet_default_tcp_functions(SYSCTL_HANDLER_ARGS)
 {
 	int error=ENOENT;
 	struct tcp_function_set fs;
 	struct tcp_function_block *blk;
 
 	memset(&fs, 0, sizeof(fs));
 	rw_rlock(&tcp_function_lock);
 	blk = find_tcp_fb_locked(tcp_func_set_ptr, NULL);
 	if (blk) {
 		/* Found him */
 		strcpy(fs.function_set_name, blk->tfb_tcp_block_name);
 		fs.pcbcnt = blk->tfb_refcnt;
 	}
 	rw_runlock(&tcp_function_lock);	
 	error = sysctl_handle_string(oidp, fs.function_set_name,
 				     sizeof(fs.function_set_name), req);
 
 	/* Check for error or no change */
 	if (error != 0 || req->newptr == NULL)
 		return(error);
 
 	rw_wlock(&tcp_function_lock);
 	blk = find_tcp_functions_locked(&fs);
 	if ((blk == NULL) ||
 	    (blk->tfb_flags & TCP_FUNC_BEING_REMOVED)) { 
 		error = ENOENT; 
 		goto done;
 	}
 	tcp_func_set_ptr = blk;
 done:
 	rw_wunlock(&tcp_function_lock);
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, OID_AUTO, functions_default,
 	    CTLTYPE_STRING | CTLFLAG_RW,
 	    NULL, 0, sysctl_net_inet_default_tcp_functions, "A",
 	    "Set/get the default TCP functions");
 
 static int
 sysctl_net_inet_list_available(SYSCTL_HANDLER_ARGS)
 {
 	int error, cnt, linesz;
 	struct tcp_function *f;
 	char *buffer, *cp;
 	size_t bufsz, outsz;
 	bool alias;
 
 	cnt = 0;
 	rw_rlock(&tcp_function_lock);
 	TAILQ_FOREACH(f, &t_functions, tf_next) {
 		cnt++;
 	}
 	rw_runlock(&tcp_function_lock);
 
 	bufsz = (cnt+2) * ((TCP_FUNCTION_NAME_LEN_MAX * 2) + 13) + 1;
 	buffer = malloc(bufsz, M_TEMP, M_WAITOK);
 
 	error = 0;
 	cp = buffer;
 
 	linesz = snprintf(cp, bufsz, "\n%-32s%c %-32s %s\n", "Stack", 'D',
 	    "Alias", "PCB count");
 	cp += linesz;
 	bufsz -= linesz;
 	outsz = linesz;
 
 	rw_rlock(&tcp_function_lock);	
 	TAILQ_FOREACH(f, &t_functions, tf_next) {
 		alias = (f->tf_name != f->tf_fb->tfb_tcp_block_name);
 		linesz = snprintf(cp, bufsz, "%-32s%c %-32s %u\n",
 		    f->tf_fb->tfb_tcp_block_name,
 		    (f->tf_fb == tcp_func_set_ptr) ? '*' : ' ',
 		    alias ? f->tf_name : "-",
 		    f->tf_fb->tfb_refcnt);
 		if (linesz >= bufsz) {
 			error = EOVERFLOW;
 			break;
 		}
 		cp += linesz;
 		bufsz -= linesz;
 		outsz += linesz;
 	}
 	rw_runlock(&tcp_function_lock);
 	if (error == 0)
 		error = sysctl_handle_string(oidp, buffer, outsz + 1, req);
 	free(buffer, M_TEMP);
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, OID_AUTO, functions_available,
 	    CTLTYPE_STRING|CTLFLAG_RD,
 	    NULL, 0, sysctl_net_inet_list_available, "A",
 	    "list available TCP Function sets");
 
 /*
- * Exports one (struct tcp_function_id) for each non-alias.
+ * Exports one (struct tcp_function_info) for each alias/name.
  */
 static int
-sysctl_net_inet_list_func_ids(SYSCTL_HANDLER_ARGS)
+sysctl_net_inet_list_func_info(SYSCTL_HANDLER_ARGS)
 {
-	int error, cnt;
+	int cnt, error;
 	struct tcp_function *f;
-	struct tcp_function_id tfi;
+	struct tcp_function_info tfi;
 
 	/*
 	 * We don't allow writes.
 	 */
 	if (req->newptr != NULL)
 		return (EINVAL);
 
 	/*
 	 * Wire the old buffer so we can directly copy the functions to
 	 * user space without dropping the lock.
 	 */
 	if (req->oldptr != NULL) {
 		error = sysctl_wire_old_buffer(req, 0);
 		if (error)
 			return (error);
 	}
 
 	/*
-	 * Walk the list, comparing the name of the function entry and
-	 * function block to determine which is an alias.
-	 * If exporting the list, copy out matching entries. Otherwise,
-	 * just record the total length.
+	 * Walk the list and copy out matching entries. If INVARIANTS
+	 * is compiled in, also walk the list to verify the length of
+	 * the list matches what we have recorded.
 	 */
-	cnt = 0;
 	rw_rlock(&tcp_function_lock);
+#ifdef INVARIANTS
+	cnt = 0;
+#else
+	if (req->oldptr == NULL) {
+		cnt = tcp_fb_cnt;
+		goto skip_loop;
+	}
+#endif
 	TAILQ_FOREACH(f, &t_functions, tf_next) {
-		if (strncmp(f->tf_name, f->tf_fb->tfb_tcp_block_name,
-		    TCP_FUNCTION_NAME_LEN_MAX))
-			continue;
+#ifdef INVARIANTS
+		cnt++;
+#endif
 		if (req->oldptr != NULL) {
+			tfi.tfi_refcnt = f->tf_fb->tfb_refcnt;
 			tfi.tfi_id = f->tf_fb->tfb_id;
-			(void)strncpy(tfi.tfi_name, f->tf_name,
+			(void)strncpy(tfi.tfi_alias, f->tf_name,
 			    TCP_FUNCTION_NAME_LEN_MAX);
+			tfi.tfi_alias[TCP_FUNCTION_NAME_LEN_MAX - 1] = '\0';
+			(void)strncpy(tfi.tfi_name,
+			    f->tf_fb->tfb_tcp_block_name,
+			    TCP_FUNCTION_NAME_LEN_MAX);
 			tfi.tfi_name[TCP_FUNCTION_NAME_LEN_MAX - 1] = '\0';
 			error = SYSCTL_OUT(req, &tfi, sizeof(tfi));
 			/*
 			 * Don't stop on error, as that is the
 			 * mechanism we use to accumulate length
 			 * information if the buffer was too short.
 			 */
-		} else
-			cnt++;
+		}
 	}
+	KASSERT(cnt == tcp_fb_cnt,
+	    ("%s: cnt (%d) != tcp_fb_cnt (%d)", __func__, cnt, tcp_fb_cnt));
+#ifndef INVARIANTS
+skip_loop:
+#endif
 	rw_runlock(&tcp_function_lock);
 	if (req->oldptr == NULL)
 		error = SYSCTL_OUT(req, NULL,
-		    (cnt + 1) * sizeof(struct tcp_function_id));
+		    (cnt + 1) * sizeof(struct tcp_function_info));
 
 	return (error);
 }
 
-SYSCTL_PROC(_net_inet_tcp, OID_AUTO, function_ids,
+SYSCTL_PROC(_net_inet_tcp, OID_AUTO, function_info,
 	    CTLTYPE_OPAQUE | CTLFLAG_SKIP | CTLFLAG_RD | CTLFLAG_MPSAFE,
-	    NULL, 0, sysctl_net_inet_list_func_ids, "S,tcp_function_id",
+	    NULL, 0, sysctl_net_inet_list_func_info, "S,tcp_function_info",
 	    "List TCP function block name-to-ID mappings");
 
 /*
  * Target size of TCP PCB hash tables. Must be a power of two.
  *
  * Note that this can be overridden by the kernel environment
  * variable net.inet.tcp.tcbhashsize
  */
 #ifndef TCBHASHSIZE
 #define TCBHASHSIZE	0
 #endif
 
 /*
  * XXX
  * Callouts should be moved into struct tcp directly.  They are currently
  * separate because the tcpcb structure is exported to userland for sysctl
  * parsing purposes, which do not know about callouts.
  */
 struct tcpcb_mem {
 	struct	tcpcb		tcb;
 	struct	tcp_timer	tt;
 	struct	cc_var		ccv;
 #ifdef TCP_HHOOK
 	struct	osd		osd;
 #endif
 };
 
 static VNET_DEFINE(uma_zone_t, tcpcb_zone);
 #define	V_tcpcb_zone			VNET(tcpcb_zone)
 
 MALLOC_DEFINE(M_TCPLOG, "tcplog", "TCP address and flags print buffers");
 MALLOC_DEFINE(M_TCPFUNCTIONS, "tcpfunc", "TCP function set memory");
 
 static struct mtx isn_mtx;
 
 #define	ISN_LOCK_INIT()	mtx_init(&isn_mtx, "isn_mtx", NULL, MTX_DEF)
 #define	ISN_LOCK()	mtx_lock(&isn_mtx)
 #define	ISN_UNLOCK()	mtx_unlock(&isn_mtx)
 
 /*
  * TCP initialization.
  */
 static void
 tcp_zone_change(void *tag)
 {
 
 	uma_zone_set_max(V_tcbinfo.ipi_zone, maxsockets);
 	uma_zone_set_max(V_tcpcb_zone, maxsockets);
 	tcp_tw_zone_change();
 }
 
 static int
 tcp_inpcb_init(void *mem, int size, int flags)
 {
 	struct inpcb *inp = mem;
 
 	INP_LOCK_INIT(inp, "inp", "tcpinp");
 	return (0);
 }
 
 /*
  * Take a value and get the next power of 2 that doesn't overflow.
  * Used to size the tcp_inpcb hash buckets.
  */
 static int
 maketcp_hashsize(int size)
 {
 	int hashsize;
 
 	/*
 	 * auto tune.
 	 * get the next power of 2 higher than maxsockets.
 	 */
 	hashsize = 1 << fls(size);
 	/* catch overflow, and just go one power of 2 smaller */
 	if (hashsize < size) {
 		hashsize = 1 << (fls(size) - 1);
 	}
 	return (hashsize);
 }
 
 static volatile int next_tcp_stack_id = 1;
 
 /*
  * Register a TCP function block with the name provided in the names
  * array.  (Note that this function does NOT automatically register
  * blk->tfb_tcp_block_name as a stack name.  Therefore, you should
  * explicitly include blk->tfb_tcp_block_name in the list of names if
  * you wish to register the stack with that name.)
  *
  * Either all name registrations will succeed or all will fail.  If
  * a name registration fails, the function will update the num_names
  * argument to point to the array index of the name that encountered
  * the failure.
  *
  * Returns 0 on success, or an error code on failure.
  */
 int
 register_tcp_functions_as_names(struct tcp_function_block *blk, int wait,
     const char *names[], int *num_names)
 {
 	struct tcp_function *n;
 	struct tcp_function_set fs;
 	int error, i;
 
 	KASSERT(names != NULL && *num_names > 0,
 	    ("%s: Called with 0-length name list", __func__));
 	KASSERT(names != NULL, ("%s: Called with NULL name list", __func__));
 
 	if (t_functions_inited == 0) {
 		init_tcp_functions();
 	}
 	if ((blk->tfb_tcp_output == NULL) ||
 	    (blk->tfb_tcp_do_segment == NULL) ||
 	    (blk->tfb_tcp_ctloutput == NULL) ||
 	    (strlen(blk->tfb_tcp_block_name) == 0)) {
 		/* 
 		 * These functions are required and you
 		 * need a name.
 		 */
 		*num_names = 0;
 		return (EINVAL);
 	}
 	if (blk->tfb_tcp_timer_stop_all ||
 	    blk->tfb_tcp_timer_activate ||
 	    blk->tfb_tcp_timer_active ||
 	    blk->tfb_tcp_timer_stop) {
 		/*
 		 * If you define one timer function you 
 		 * must have them all.
 		 */
 		if ((blk->tfb_tcp_timer_stop_all == NULL) ||
 		    (blk->tfb_tcp_timer_activate == NULL) ||
 		    (blk->tfb_tcp_timer_active == NULL) ||
 		    (blk->tfb_tcp_timer_stop == NULL)) {
 			*num_names = 0;
 			return (EINVAL);
 		}
 	}
 
 	refcount_init(&blk->tfb_refcnt, 0);
 	blk->tfb_flags = 0;
 	blk->tfb_id = atomic_fetchadd_int(&next_tcp_stack_id, 1);
 	for (i = 0; i < *num_names; i++) {
 		n = malloc(sizeof(struct tcp_function), M_TCPFUNCTIONS, wait);
 		if (n == NULL) {
 			error = ENOMEM;
 			goto cleanup;
 		}
 		n->tf_fb = blk;
 
 		(void)strncpy(fs.function_set_name, names[i],
 		    TCP_FUNCTION_NAME_LEN_MAX);
 		fs.function_set_name[TCP_FUNCTION_NAME_LEN_MAX - 1] = '\0';
 		rw_wlock(&tcp_function_lock);
 		if (find_tcp_functions_locked(&fs) != NULL) {
 			/* Duplicate name space not allowed */
 			rw_wunlock(&tcp_function_lock);
 			free(n, M_TCPFUNCTIONS);
 			error = EALREADY;
 			goto cleanup;
 		}
 		(void)strncpy(n->tf_name, names[i], TCP_FUNCTION_NAME_LEN_MAX);
 		n->tf_name[TCP_FUNCTION_NAME_LEN_MAX - 1] = '\0';
 		TAILQ_INSERT_TAIL(&t_functions, n, tf_next);
+		tcp_fb_cnt++;
 		rw_wunlock(&tcp_function_lock);
 	}
 	return(0);
 
 cleanup:
 	/*
 	 * Deregister the names we just added. Because registration failed
 	 * for names[i], we don't need to deregister that name.
 	 */
 	*num_names = i;
 	rw_wlock(&tcp_function_lock);
 	while (--i >= 0) {
 		TAILQ_FOREACH(n, &t_functions, tf_next) {
 			if (!strncmp(n->tf_name, names[i],
 			    TCP_FUNCTION_NAME_LEN_MAX)) {
 				TAILQ_REMOVE(&t_functions, n, tf_next);
+				tcp_fb_cnt--;
 				n->tf_fb = NULL;
 				free(n, M_TCPFUNCTIONS);
 				break;
 			}
 		}
 	}
 	rw_wunlock(&tcp_function_lock);
 	return (error);
 }
 
 /*
  * Register a TCP function block using the name provided in the name
  * argument.
  *
  * Returns 0 on success, or an error code on failure.
  */
 int
 register_tcp_functions_as_name(struct tcp_function_block *blk, const char *name,
     int wait)
 {
 	const char *name_list[1];
 	int num_names, rv;
 
 	num_names = 1;
 	if (name != NULL)
 		name_list[0] = name;
 	else
 		name_list[0] = blk->tfb_tcp_block_name;
 	rv = register_tcp_functions_as_names(blk, wait, name_list, &num_names);
 	return (rv);
 }
 
 /*
  * Register a TCP function block using the name defined in
  * blk->tfb_tcp_block_name.
  *
  * Returns 0 on success, or an error code on failure.
  */
 int
 register_tcp_functions(struct tcp_function_block *blk, int wait)
 {
 
 	return (register_tcp_functions_as_name(blk, NULL, wait));
 }
 
 int
 deregister_tcp_functions(struct tcp_function_block *blk)
 {
 	struct tcp_function *f;
 	int error=ENOENT;
 	
 	if (strcmp(blk->tfb_tcp_block_name, "default") == 0) {
 		/* You can't un-register the default */
 		return (EPERM);
 	}
 	rw_wlock(&tcp_function_lock);
 	if (blk == tcp_func_set_ptr) {
 		/* You can't free the current default */
 		rw_wunlock(&tcp_function_lock);
 		return (EBUSY);
 	}
 	if (blk->tfb_refcnt) {
 		/* Still tcb attached, mark it. */
 		blk->tfb_flags |= TCP_FUNC_BEING_REMOVED;
 		rw_wunlock(&tcp_function_lock);		
 		return (EBUSY);
 	}
 	while (find_tcp_fb_locked(blk, &f) != NULL) {
 		/* Found */
 		TAILQ_REMOVE(&t_functions, f, tf_next);
+		tcp_fb_cnt--;
 		f->tf_fb = NULL;
 		free(f, M_TCPFUNCTIONS);
 		error = 0;
 	}
 	rw_wunlock(&tcp_function_lock);
 	return (error);
 }
 
 void
 tcp_init(void)
 {
 	const char *tcbhash_tuneable;
 	int hashsize;
 
 	tcbhash_tuneable = "net.inet.tcp.tcbhashsize";
 
 #ifdef TCP_HHOOK
 	if (hhook_head_register(HHOOK_TYPE_TCP, HHOOK_TCP_EST_IN,
 	    &V_tcp_hhh[HHOOK_TCP_EST_IN], HHOOK_NOWAIT|HHOOK_HEADISINVNET) != 0)
 		printf("%s: WARNING: unable to register helper hook\n", __func__);
 	if (hhook_head_register(HHOOK_TYPE_TCP, HHOOK_TCP_EST_OUT,
 	    &V_tcp_hhh[HHOOK_TCP_EST_OUT], HHOOK_NOWAIT|HHOOK_HEADISINVNET) != 0)
 		printf("%s: WARNING: unable to register helper hook\n", __func__);
 #endif
 	hashsize = TCBHASHSIZE;
 	TUNABLE_INT_FETCH(tcbhash_tuneable, &hashsize);
 	if (hashsize == 0) {
 		/*
 		 * Auto tune the hash size based on maxsockets.
 		 * A perfect hash would have a 1:1 mapping
 		 * (hashsize = maxsockets) however it's been
 		 * suggested that O(2) average is better.
 		 */
 		hashsize = maketcp_hashsize(maxsockets / 4);
 		/*
 		 * Our historical default is 512,
 		 * do not autotune lower than this.
 		 */
 		if (hashsize < 512)
 			hashsize = 512;
 		if (bootverbose && IS_DEFAULT_VNET(curvnet))
 			printf("%s: %s auto tuned to %d\n", __func__,
 			    tcbhash_tuneable, hashsize);
 	}
 	/*
 	 * We require a hashsize to be a power of two.
 	 * Previously if it was not a power of two we would just reset it
 	 * back to 512, which could be a nasty surprise if you did not notice
 	 * the error message.
 	 * Instead what we do is clip it to the closest power of two lower
 	 * than the specified hash value.
 	 */
 	if (!powerof2(hashsize)) {
 		int oldhashsize = hashsize;
 
 		hashsize = maketcp_hashsize(hashsize);
 		/* prevent absurdly low value */
 		if (hashsize < 16)
 			hashsize = 16;
 		printf("%s: WARNING: TCB hash size not a power of 2, "
 		    "clipped from %d to %d.\n", __func__, oldhashsize,
 		    hashsize);
 	}
 	in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize,
 	    "tcp_inpcb", tcp_inpcb_init, IPI_HASHFIELDS_4TUPLE);
 
 	/*
 	 * These have to be type stable for the benefit of the timers.
 	 */
 	V_tcpcb_zone = uma_zcreate("tcpcb", sizeof(struct tcpcb_mem),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
 	uma_zone_set_max(V_tcpcb_zone, maxsockets);
 	uma_zone_set_warning(V_tcpcb_zone, "kern.ipc.maxsockets limit reached");
 
 	tcp_tw_init();
 	syncache_init();
 	tcp_hc_init();
 
 	TUNABLE_INT_FETCH("net.inet.tcp.sack.enable", &V_tcp_do_sack);
 	V_sack_hole_zone = uma_zcreate("sackhole", sizeof(struct sackhole),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
 
 	tcp_fastopen_init();
 
 	/* Skip initialization of globals for non-default instances. */
 	if (!IS_DEFAULT_VNET(curvnet))
 		return;
 
 	tcp_reass_global_init();
 
 	/* XXX virtualize those bellow? */
 	tcp_delacktime = TCPTV_DELACK;
 	tcp_keepinit = TCPTV_KEEP_INIT;
 	tcp_keepidle = TCPTV_KEEP_IDLE;
 	tcp_keepintvl = TCPTV_KEEPINTVL;
 	tcp_maxpersistidle = TCPTV_KEEP_IDLE;
 	tcp_msl = TCPTV_MSL;
 	tcp_rexmit_min = TCPTV_MIN;
 	if (tcp_rexmit_min < 1)
 		tcp_rexmit_min = 1;
 	tcp_persmin = TCPTV_PERSMIN;
 	tcp_persmax = TCPTV_PERSMAX;
 	tcp_rexmit_slop = TCPTV_CPU_VAR;
 	tcp_finwait2_timeout = TCPTV_FINWAIT2_TIMEOUT;
 	tcp_tcbhashsize = hashsize;
 	/* Setup the tcp function block list */
 	init_tcp_functions();
 	register_tcp_functions(&tcp_def_funcblk, M_WAITOK);
 #ifdef TCP_BLACKBOX
 	/* Initialize the TCP logging data. */
 	tcp_log_init();
 #endif
 
 	if (tcp_soreceive_stream) {
 #ifdef INET
 		tcp_usrreqs.pru_soreceive = soreceive_stream;
 #endif
 #ifdef INET6
 		tcp6_usrreqs.pru_soreceive = soreceive_stream;
 #endif /* INET6 */
 	}
 
 #ifdef INET6
 #define TCP_MINPROTOHDR (sizeof(struct ip6_hdr) + sizeof(struct tcphdr))
 #else /* INET6 */
 #define TCP_MINPROTOHDR (sizeof(struct tcpiphdr))
 #endif /* INET6 */
 	if (max_protohdr < TCP_MINPROTOHDR)
 		max_protohdr = TCP_MINPROTOHDR;
 	if (max_linkhdr + TCP_MINPROTOHDR > MHLEN)
 		panic("tcp_init");
 #undef TCP_MINPROTOHDR
 
 	ISN_LOCK_INIT();
 	EVENTHANDLER_REGISTER(shutdown_pre_sync, tcp_fini, NULL,
 		SHUTDOWN_PRI_DEFAULT);
 	EVENTHANDLER_REGISTER(maxsockets_change, tcp_zone_change, NULL,
 		EVENTHANDLER_PRI_ANY);
 #ifdef TCPPCAP
 	tcp_pcap_init();
 #endif
 }
 
 #ifdef VIMAGE
 static void
 tcp_destroy(void *unused __unused)
 {
 	int n;
 #ifdef TCP_HHOOK
 	int error;
 #endif
 
 	/*
 	 * All our processes are gone, all our sockets should be cleaned
 	 * up, which means, we should be past the tcp_discardcb() calls.
 	 * Sleep to let all tcpcb timers really disappear and cleanup.
 	 */
 	for (;;) {
 		INP_LIST_RLOCK(&V_tcbinfo);
 		n = V_tcbinfo.ipi_count;
 		INP_LIST_RUNLOCK(&V_tcbinfo);
 		if (n == 0)
 			break;
 		pause("tcpdes", hz / 10);
 	}
 	tcp_hc_destroy();
 	syncache_destroy();
 	tcp_tw_destroy();
 	in_pcbinfo_destroy(&V_tcbinfo);
 	/* tcp_discardcb() clears the sack_holes up. */
 	uma_zdestroy(V_sack_hole_zone);
 	uma_zdestroy(V_tcpcb_zone);
 
 	/*
 	 * Cannot free the zone until all tcpcbs are released as we attach
 	 * the allocations to them.
 	 */
 	tcp_fastopen_destroy();
 
 #ifdef TCP_HHOOK
 	error = hhook_head_deregister(V_tcp_hhh[HHOOK_TCP_EST_IN]);
 	if (error != 0) {
 		printf("%s: WARNING: unable to deregister helper hook "
 		    "type=%d, id=%d: error %d returned\n", __func__,
 		    HHOOK_TYPE_TCP, HHOOK_TCP_EST_IN, error);
 	}
 	error = hhook_head_deregister(V_tcp_hhh[HHOOK_TCP_EST_OUT]);
 	if (error != 0) {
 		printf("%s: WARNING: unable to deregister helper hook "
 		    "type=%d, id=%d: error %d returned\n", __func__,
 		    HHOOK_TYPE_TCP, HHOOK_TCP_EST_OUT, error);
 	}
 #endif
 }
 VNET_SYSUNINIT(tcp, SI_SUB_PROTO_DOMAIN, SI_ORDER_FOURTH, tcp_destroy, NULL);
 #endif
 
 void
 tcp_fini(void *xtp)
 {
 
 }
 
 /*
  * Fill in the IP and TCP headers for an outgoing packet, given the tcpcb.
  * tcp_template used to store this data in mbufs, but we now recopy it out
  * of the tcpcb each time to conserve mbufs.
  */
 void
 tcpip_fillheaders(struct inpcb *inp, void *ip_ptr, void *tcp_ptr)
 {
 	struct tcphdr *th = (struct tcphdr *)tcp_ptr;
 
 	INP_WLOCK_ASSERT(inp);
 
 #ifdef INET6
 	if ((inp->inp_vflag & INP_IPV6) != 0) {
 		struct ip6_hdr *ip6;
 
 		ip6 = (struct ip6_hdr *)ip_ptr;
 		ip6->ip6_flow = (ip6->ip6_flow & ~IPV6_FLOWINFO_MASK) |
 			(inp->inp_flow & IPV6_FLOWINFO_MASK);
 		ip6->ip6_vfc = (ip6->ip6_vfc & ~IPV6_VERSION_MASK) |
 			(IPV6_VERSION & IPV6_VERSION_MASK);
 		ip6->ip6_nxt = IPPROTO_TCP;
 		ip6->ip6_plen = htons(sizeof(struct tcphdr));
 		ip6->ip6_src = inp->in6p_laddr;
 		ip6->ip6_dst = inp->in6p_faddr;
 	}
 #endif /* INET6 */
 #if defined(INET6) && defined(INET)
 	else
 #endif
 #ifdef INET
 	{
 		struct ip *ip;
 
 		ip = (struct ip *)ip_ptr;
 		ip->ip_v = IPVERSION;
 		ip->ip_hl = 5;
 		ip->ip_tos = inp->inp_ip_tos;
 		ip->ip_len = 0;
 		ip->ip_id = 0;
 		ip->ip_off = 0;
 		ip->ip_ttl = inp->inp_ip_ttl;
 		ip->ip_sum = 0;
 		ip->ip_p = IPPROTO_TCP;
 		ip->ip_src = inp->inp_laddr;
 		ip->ip_dst = inp->inp_faddr;
 	}
 #endif /* INET */
 	th->th_sport = inp->inp_lport;
 	th->th_dport = inp->inp_fport;
 	th->th_seq = 0;
 	th->th_ack = 0;
 	th->th_x2 = 0;
 	th->th_off = 5;
 	th->th_flags = 0;
 	th->th_win = 0;
 	th->th_urp = 0;
 	th->th_sum = 0;		/* in_pseudo() is called later for ipv4 */
 }
 
 /*
  * Create template to be used to send tcp packets on a connection.
  * Allocates an mbuf and fills in a skeletal tcp/ip header.  The only
  * use for this function is in keepalives, which use tcp_respond.
  */
 struct tcptemp *
 tcpip_maketemplate(struct inpcb *inp)
 {
 	struct tcptemp *t;
 
 	t = malloc(sizeof(*t), M_TEMP, M_NOWAIT);
 	if (t == NULL)
 		return (NULL);
 	tcpip_fillheaders(inp, (void *)&t->tt_ipgen, (void *)&t->tt_t);
 	return (t);
 }
 
 /*
  * Send a single message to the TCP at address specified by
  * the given TCP/IP header.  If m == NULL, then we make a copy
  * of the tcpiphdr at th and send directly to the addressed host.
  * This is used to force keep alive messages out using the TCP
  * template for a connection.  If flags are given then we send
  * a message back to the TCP which originated the segment th,
  * and discard the mbuf containing it and any other attached mbufs.
  *
  * In any case the ack and sequence number of the transmitted
  * segment are as specified by the parameters.
  *
  * NOTE: If m != NULL, then th must point to *inside* the mbuf.
  */
 void
 tcp_respond(struct tcpcb *tp, void *ipgen, struct tcphdr *th, struct mbuf *m,
     tcp_seq ack, tcp_seq seq, int flags)
 {
 	struct tcpopt to;
 	struct inpcb *inp;
 	struct ip *ip;
 	struct mbuf *optm;
 	struct tcphdr *nth;
 	u_char *optp;
 #ifdef INET6
 	struct ip6_hdr *ip6;
 	int isipv6;
 #endif /* INET6 */
 	int optlen, tlen, win;
 	bool incl_opts;
 
 	KASSERT(tp != NULL || m != NULL, ("tcp_respond: tp and m both NULL"));
 
 #ifdef INET6
 	isipv6 = ((struct ip *)ipgen)->ip_v == (IPV6_VERSION >> 4);
 	ip6 = ipgen;
 #endif /* INET6 */
 	ip = ipgen;
 
 	if (tp != NULL) {
 		inp = tp->t_inpcb;
 		KASSERT(inp != NULL, ("tcp control block w/o inpcb"));
 		INP_WLOCK_ASSERT(inp);
 	} else
 		inp = NULL;
 
 	incl_opts = false;
 	win = 0;
 	if (tp != NULL) {
 		if (!(flags & TH_RST)) {
 			win = sbspace(&inp->inp_socket->so_rcv);
 			if (win > TCP_MAXWIN << tp->rcv_scale)
 				win = TCP_MAXWIN << tp->rcv_scale;
 		}
 		if ((tp->t_flags & TF_NOOPT) == 0)
 			incl_opts = true;
 	}
 	if (m == NULL) {
 		m = m_gethdr(M_NOWAIT, MT_DATA);
 		if (m == NULL)
 			return;
 		m->m_data += max_linkhdr;
 #ifdef INET6
 		if (isipv6) {
 			bcopy((caddr_t)ip6, mtod(m, caddr_t),
 			      sizeof(struct ip6_hdr));
 			ip6 = mtod(m, struct ip6_hdr *);
 			nth = (struct tcphdr *)(ip6 + 1);
 		} else
 #endif /* INET6 */
 		{
 			bcopy((caddr_t)ip, mtod(m, caddr_t), sizeof(struct ip));
 			ip = mtod(m, struct ip *);
 			nth = (struct tcphdr *)(ip + 1);
 		}
 		bcopy((caddr_t)th, (caddr_t)nth, sizeof(struct tcphdr));
 		flags = TH_ACK;
 	} else if (!M_WRITABLE(m)) {
 		struct mbuf *n;
 
 		/* Can't reuse 'm', allocate a new mbuf. */
 		n = m_gethdr(M_NOWAIT, MT_DATA);
 		if (n == NULL) {
 			m_freem(m);
 			return;
 		}
 
 		if (!m_dup_pkthdr(n, m, M_NOWAIT)) {
 			m_freem(m);
 			m_freem(n);
 			return;
 		}
 
 		n->m_data += max_linkhdr;
 		/* m_len is set later */
 #define xchg(a,b,type) { type t; t=a; a=b; b=t; }
 #ifdef INET6
 		if (isipv6) {
 			bcopy((caddr_t)ip6, mtod(n, caddr_t),
 			      sizeof(struct ip6_hdr));
 			ip6 = mtod(n, struct ip6_hdr *);
 			xchg(ip6->ip6_dst, ip6->ip6_src, struct in6_addr);
 			nth = (struct tcphdr *)(ip6 + 1);
 		} else
 #endif /* INET6 */
 		{
 			bcopy((caddr_t)ip, mtod(n, caddr_t), sizeof(struct ip));
 			ip = mtod(n, struct ip *);
 			xchg(ip->ip_dst.s_addr, ip->ip_src.s_addr, uint32_t);
 			nth = (struct tcphdr *)(ip + 1);
 		}
 		bcopy((caddr_t)th, (caddr_t)nth, sizeof(struct tcphdr));
 		xchg(nth->th_dport, nth->th_sport, uint16_t);
 		th = nth;
 		m_freem(m);
 		m = n;
 	} else {
 		/*
 		 *  reuse the mbuf. 
 		 * XXX MRT We inherit the FIB, which is lucky.
 		 */
 		m_freem(m->m_next);
 		m->m_next = NULL;
 		m->m_data = (caddr_t)ipgen;
 		/* m_len is set later */
 #ifdef INET6
 		if (isipv6) {
 			xchg(ip6->ip6_dst, ip6->ip6_src, struct in6_addr);
 			nth = (struct tcphdr *)(ip6 + 1);
 		} else
 #endif /* INET6 */
 		{
 			xchg(ip->ip_dst.s_addr, ip->ip_src.s_addr, uint32_t);
 			nth = (struct tcphdr *)(ip + 1);
 		}
 		if (th != nth) {
 			/*
 			 * this is usually a case when an extension header
 			 * exists between the IPv6 header and the
 			 * TCP header.
 			 */
 			nth->th_sport = th->th_sport;
 			nth->th_dport = th->th_dport;
 		}
 		xchg(nth->th_dport, nth->th_sport, uint16_t);
 #undef xchg
 	}
 	tlen = 0;
 #ifdef INET6
 	if (isipv6)
 		tlen = sizeof (struct ip6_hdr) + sizeof (struct tcphdr);
 #endif
 #if defined(INET) && defined(INET6)
 	else
 #endif
 #ifdef INET
 		tlen = sizeof (struct tcpiphdr);
 #endif
 #ifdef INVARIANTS
 	m->m_len = 0;
 	KASSERT(M_TRAILINGSPACE(m) >= tlen,
 	    ("Not enough trailing space for message (m=%p, need=%d, have=%ld)",
 	    m, tlen, (long)M_TRAILINGSPACE(m)));
 #endif
 	m->m_len = tlen;
 	to.to_flags = 0;
 	if (incl_opts) {
 		/* Make sure we have room. */
 		if (M_TRAILINGSPACE(m) < TCP_MAXOLEN) {
 			m->m_next = m_get(M_NOWAIT, MT_DATA);
 			if (m->m_next) {
 				optp = mtod(m->m_next, u_char *);
 				optm = m->m_next;
 			} else
 				incl_opts = false;
 		} else {
 			optp = (u_char *) (nth + 1);
 			optm = m;
 		}
 	}
 	if (incl_opts) {
 		/* Timestamps. */
 		if (tp->t_flags & TF_RCVD_TSTMP) {
 			to.to_tsval = tcp_ts_getticks() + tp->ts_offset;
 			to.to_tsecr = tp->ts_recent;
 			to.to_flags |= TOF_TS;
 		}
 #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
 		/* TCP-MD5 (RFC2385). */
 		if (tp->t_flags & TF_SIGNATURE)
 			to.to_flags |= TOF_SIGNATURE;
 #endif
 		/* Add the options. */
 		tlen += optlen = tcp_addoptions(&to, optp);
 
 		/* Update m_len in the correct mbuf. */
 		optm->m_len += optlen;
 	} else
 		optlen = 0;
 #ifdef INET6
 	if (isipv6) {
 		ip6->ip6_flow = 0;
 		ip6->ip6_vfc = IPV6_VERSION;
 		ip6->ip6_nxt = IPPROTO_TCP;
 		ip6->ip6_plen = htons(tlen - sizeof(*ip6));
 	}
 #endif
 #if defined(INET) && defined(INET6)
 	else
 #endif
 #ifdef INET
 	{
 		ip->ip_len = htons(tlen);
 		ip->ip_ttl = V_ip_defttl;
 		if (V_path_mtu_discovery)
 			ip->ip_off |= htons(IP_DF);
 	}
 #endif
 	m->m_pkthdr.len = tlen;
 	m->m_pkthdr.rcvif = NULL;
 #ifdef MAC
 	if (inp != NULL) {
 		/*
 		 * Packet is associated with a socket, so allow the
 		 * label of the response to reflect the socket label.
 		 */
 		INP_WLOCK_ASSERT(inp);
 		mac_inpcb_create_mbuf(inp, m);
 	} else {
 		/*
 		 * Packet is not associated with a socket, so possibly
 		 * update the label in place.
 		 */
 		mac_netinet_tcp_reply(m);
 	}
 #endif
 	nth->th_seq = htonl(seq);
 	nth->th_ack = htonl(ack);
 	nth->th_x2 = 0;
 	nth->th_off = (sizeof (struct tcphdr) + optlen) >> 2;
 	nth->th_flags = flags;
 	if (tp != NULL)
 		nth->th_win = htons((u_short) (win >> tp->rcv_scale));
 	else
 		nth->th_win = htons((u_short)win);
 	nth->th_urp = 0;
 
 #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
 	if (to.to_flags & TOF_SIGNATURE) {
 		if (!TCPMD5_ENABLED() ||
 		    TCPMD5_OUTPUT(m, nth, to.to_signature) != 0) {
 			m_freem(m);
 			return;
 		}
 	}
 #endif
 
 	m->m_pkthdr.csum_data = offsetof(struct tcphdr, th_sum);
 #ifdef INET6
 	if (isipv6) {
 		m->m_pkthdr.csum_flags = CSUM_TCP_IPV6;
 		nth->th_sum = in6_cksum_pseudo(ip6,
 		    tlen - sizeof(struct ip6_hdr), IPPROTO_TCP, 0);
 		ip6->ip6_hlim = in6_selecthlim(tp != NULL ? tp->t_inpcb :
 		    NULL, NULL);
 	}
 #endif /* INET6 */
 #if defined(INET6) && defined(INET)
 	else
 #endif
 #ifdef INET
 	{
 		m->m_pkthdr.csum_flags = CSUM_TCP;
 		nth->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr,
 		    htons((u_short)(tlen - sizeof(struct ip) + ip->ip_p)));
 	}
 #endif /* INET */
 #ifdef TCPDEBUG
 	if (tp == NULL || (inp->inp_socket->so_options & SO_DEBUG))
 		tcp_trace(TA_OUTPUT, 0, tp, mtod(m, void *), th, 0);
 #endif
 	TCP_PROBE3(debug__output, tp, th, m);
 	if (flags & TH_RST)
 		TCP_PROBE5(accept__refused, NULL, NULL, m, tp, nth);
 
 #ifdef INET6
 	if (isipv6) {
 		TCP_PROBE5(send, NULL, tp, ip6, tp, nth);
 		(void)ip6_output(m, NULL, NULL, 0, NULL, NULL, inp);
 	}
 #endif /* INET6 */
 #if defined(INET) && defined(INET6)
 	else
 #endif
 #ifdef INET
 	{
 		TCP_PROBE5(send, NULL, tp, ip, tp, nth);
 		(void)ip_output(m, NULL, NULL, 0, NULL, inp);
 	}
 #endif
 }
 
 /*
  * Create a new TCP control block, making an
  * empty reassembly queue and hooking it to the argument
  * protocol control block.  The `inp' parameter must have
  * come from the zone allocator set up in tcp_init().
  */
 struct tcpcb *
 tcp_newtcpcb(struct inpcb *inp)
 {
 	struct tcpcb_mem *tm;
 	struct tcpcb *tp;
 #ifdef INET6
 	int isipv6 = (inp->inp_vflag & INP_IPV6) != 0;
 #endif /* INET6 */
 
 	tm = uma_zalloc(V_tcpcb_zone, M_NOWAIT | M_ZERO);
 	if (tm == NULL)
 		return (NULL);
 	tp = &tm->tcb;
 
 	/* Initialise cc_var struct for this tcpcb. */
 	tp->ccv = &tm->ccv;
 	tp->ccv->type = IPPROTO_TCP;
 	tp->ccv->ccvc.tcp = tp;
 	rw_rlock(&tcp_function_lock);
 	tp->t_fb = tcp_func_set_ptr;
 	refcount_acquire(&tp->t_fb->tfb_refcnt);
 	rw_runlock(&tcp_function_lock);
 	/*
 	 * Use the current system default CC algorithm.
 	 */
 	CC_LIST_RLOCK();
 	KASSERT(!STAILQ_EMPTY(&cc_list), ("cc_list is empty!"));
 	CC_ALGO(tp) = CC_DEFAULT();
 	CC_LIST_RUNLOCK();
 
 	if (CC_ALGO(tp)->cb_init != NULL)
 		if (CC_ALGO(tp)->cb_init(tp->ccv) > 0) {
 			if (tp->t_fb->tfb_tcp_fb_fini)
 				(*tp->t_fb->tfb_tcp_fb_fini)(tp, 1);
 			refcount_release(&tp->t_fb->tfb_refcnt);
 			uma_zfree(V_tcpcb_zone, tm);
 			return (NULL);
 		}
 
 #ifdef TCP_HHOOK
 	tp->osd = &tm->osd;
 	if (khelp_init_osd(HELPER_CLASS_TCP, tp->osd)) {
 		if (tp->t_fb->tfb_tcp_fb_fini)
 			(*tp->t_fb->tfb_tcp_fb_fini)(tp, 1);
 		refcount_release(&tp->t_fb->tfb_refcnt);
 		uma_zfree(V_tcpcb_zone, tm);
 		return (NULL);
 	}
 #endif
 
 #ifdef VIMAGE
 	tp->t_vnet = inp->inp_vnet;
 #endif
 	tp->t_timers = &tm->tt;
 	/*	LIST_INIT(&tp->t_segq); */	/* XXX covered by M_ZERO */
 	tp->t_maxseg =
 #ifdef INET6
 		isipv6 ? V_tcp_v6mssdflt :
 #endif /* INET6 */
 		V_tcp_mssdflt;
 
 	/* Set up our timeouts. */
 	callout_init(&tp->t_timers->tt_rexmt, 1);
 	callout_init(&tp->t_timers->tt_persist, 1);
 	callout_init(&tp->t_timers->tt_keep, 1);
 	callout_init(&tp->t_timers->tt_2msl, 1);
 	callout_init(&tp->t_timers->tt_delack, 1);
 
 	if (V_tcp_do_rfc1323)
 		tp->t_flags = (TF_REQ_SCALE|TF_REQ_TSTMP);
 	if (V_tcp_do_sack)
 		tp->t_flags |= TF_SACK_PERMIT;
 	TAILQ_INIT(&tp->snd_holes);
 	/*
 	 * The tcpcb will hold a reference on its inpcb until tcp_discardcb()
 	 * is called.
 	 */
 	in_pcbref(inp);	/* Reference for tcpcb */
 	tp->t_inpcb = inp;
 
 	/*
 	 * Init srtt to TCPTV_SRTTBASE (0), so we can tell that we have no
 	 * rtt estimate.  Set rttvar so that srtt + 4 * rttvar gives
 	 * reasonable initial retransmit time.
 	 */
 	tp->t_srtt = TCPTV_SRTTBASE;
 	tp->t_rttvar = ((TCPTV_RTOBASE - TCPTV_SRTTBASE) << TCP_RTTVAR_SHIFT) / 4;
 	tp->t_rttmin = tcp_rexmit_min;
 	tp->t_rxtcur = TCPTV_RTOBASE;
 	tp->snd_cwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->t_rcvtime = ticks;
 	/*
 	 * IPv4 TTL initialization is necessary for an IPv6 socket as well,
 	 * because the socket may be bound to an IPv6 wildcard address,
 	 * which may match an IPv4-mapped IPv6 address.
 	 */
 	inp->inp_ip_ttl = V_ip_defttl;
 	inp->inp_ppcb = tp;
 #ifdef TCPPCAP
 	/*
 	 * Init the TCP PCAP queues.
 	 */
 	tcp_pcap_tcpcb_init(tp);
 #endif
 #ifdef TCP_BLACKBOX
 	/* Initialize the per-TCPCB log data. */
 	tcp_log_tcpcbinit(tp);
 #endif
 	if (tp->t_fb->tfb_tcp_fb_init) {
 		(*tp->t_fb->tfb_tcp_fb_init)(tp);
 	}
 	return (tp);		/* XXX */
 }
 
 /*
  * Switch the congestion control algorithm back to NewReno for any active
  * control blocks using an algorithm which is about to go away.
  * This ensures the CC framework can allow the unload to proceed without leaving
  * any dangling pointers which would trigger a panic.
  * Returning non-zero would inform the CC framework that something went wrong
  * and it would be unsafe to allow the unload to proceed. However, there is no
  * way for this to occur with this implementation so we always return zero.
  */
 int
 tcp_ccalgounload(struct cc_algo *unload_algo)
 {
 	struct cc_algo *tmpalgo;
 	struct inpcb *inp;
 	struct tcpcb *tp;
 	VNET_ITERATOR_DECL(vnet_iter);
 
 	/*
 	 * Check all active control blocks across all network stacks and change
 	 * any that are using "unload_algo" back to NewReno. If "unload_algo"
 	 * requires cleanup code to be run, call it.
 	 */
 	VNET_LIST_RLOCK();
 	VNET_FOREACH(vnet_iter) {
 		CURVNET_SET(vnet_iter);
 		INP_INFO_WLOCK(&V_tcbinfo);
 		/*
 		 * New connections already part way through being initialised
 		 * with the CC algo we're removing will not race with this code
 		 * because the INP_INFO_WLOCK is held during initialisation. We
 		 * therefore don't enter the loop below until the connection
 		 * list has stabilised.
 		 */
 		LIST_FOREACH(inp, &V_tcb, inp_list) {
 			INP_WLOCK(inp);
 			/* Important to skip tcptw structs. */
 			if (!(inp->inp_flags & INP_TIMEWAIT) &&
 			    (tp = intotcpcb(inp)) != NULL) {
 				/*
 				 * By holding INP_WLOCK here, we are assured
 				 * that the connection is not currently
 				 * executing inside the CC module's functions
 				 * i.e. it is safe to make the switch back to
 				 * NewReno.
 				 */
 				if (CC_ALGO(tp) == unload_algo) {
 					tmpalgo = CC_ALGO(tp);
 					/* NewReno does not require any init. */
 					CC_ALGO(tp) = &newreno_cc_algo;
 					if (tmpalgo->cb_destroy != NULL)
 						tmpalgo->cb_destroy(tp->ccv);
 				}
 			}
 			INP_WUNLOCK(inp);
 		}
 		INP_INFO_WUNLOCK(&V_tcbinfo);
 		CURVNET_RESTORE();
 	}
 	VNET_LIST_RUNLOCK();
 
 	return (0);
 }
 
 /*
  * Drop a TCP connection, reporting
  * the specified error.  If connection is synchronized,
  * then send a RST to peer.
  */
 struct tcpcb *
 tcp_drop(struct tcpcb *tp, int errno)
 {
 	struct socket *so = tp->t_inpcb->inp_socket;
 
 	INP_INFO_LOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 	if (TCPS_HAVERCVDSYN(tp->t_state)) {
 		tcp_state_change(tp, TCPS_CLOSED);
 		(void) tp->t_fb->tfb_tcp_output(tp);
 		TCPSTAT_INC(tcps_drops);
 	} else
 		TCPSTAT_INC(tcps_conndrops);
 	if (errno == ETIMEDOUT && tp->t_softerror)
 		errno = tp->t_softerror;
 	so->so_error = errno;
 	return (tcp_close(tp));
 }
 
 void
 tcp_discardcb(struct tcpcb *tp)
 {
 	struct inpcb *inp = tp->t_inpcb;
 	struct socket *so = inp->inp_socket;
 #ifdef INET6
 	int isipv6 = (inp->inp_vflag & INP_IPV6) != 0;
 #endif /* INET6 */
 	int released;
 
 	INP_WLOCK_ASSERT(inp);
 
 	/*
 	 * Make sure that all of our timers are stopped before we delete the
 	 * PCB.
 	 *
 	 * If stopping a timer fails, we schedule a discard function in same
 	 * callout, and the last discard function called will take care of
 	 * deleting the tcpcb.
 	 */
 	tp->t_timers->tt_draincnt = 0;
 	tcp_timer_stop(tp, TT_REXMT);
 	tcp_timer_stop(tp, TT_PERSIST);
 	tcp_timer_stop(tp, TT_KEEP);
 	tcp_timer_stop(tp, TT_2MSL);
 	tcp_timer_stop(tp, TT_DELACK);
 	if (tp->t_fb->tfb_tcp_timer_stop_all) {
 		/* 
 		 * Call the stop-all function of the methods, 
 		 * this function should call the tcp_timer_stop()
 		 * method with each of the function specific timeouts.
 		 * That stop will be called via the tfb_tcp_timer_stop()
 		 * which should use the async drain function of the 
 		 * callout system (see tcp_var.h).
 		 */
 		tp->t_fb->tfb_tcp_timer_stop_all(tp);
 	}
 
 	/*
 	 * If we got enough samples through the srtt filter,
 	 * save the rtt and rttvar in the routing entry.
 	 * 'Enough' is arbitrarily defined as 4 rtt samples.
 	 * 4 samples is enough for the srtt filter to converge
 	 * to within enough % of the correct value; fewer samples
 	 * and we could save a bogus rtt. The danger is not high
 	 * as tcp quickly recovers from everything.
 	 * XXX: Works very well but needs some more statistics!
 	 */
 	if (tp->t_rttupdated >= 4) {
 		struct hc_metrics_lite metrics;
 		uint32_t ssthresh;
 
 		bzero(&metrics, sizeof(metrics));
 		/*
 		 * Update the ssthresh always when the conditions below
 		 * are satisfied. This gives us better new start value
 		 * for the congestion avoidance for new connections.
 		 * ssthresh is only set if packet loss occurred on a session.
 		 *
 		 * XXXRW: 'so' may be NULL here, and/or socket buffer may be
 		 * being torn down.  Ideally this code would not use 'so'.
 		 */
 		ssthresh = tp->snd_ssthresh;
 		if (ssthresh != 0 && ssthresh < so->so_snd.sb_hiwat / 2) {
 			/*
 			 * convert the limit from user data bytes to
 			 * packets then to packet data bytes.
 			 */
 			ssthresh = (ssthresh + tp->t_maxseg / 2) / tp->t_maxseg;
 			if (ssthresh < 2)
 				ssthresh = 2;
 			ssthresh *= (tp->t_maxseg +
 #ifdef INET6
 			    (isipv6 ? sizeof (struct ip6_hdr) +
 				sizeof (struct tcphdr) :
 #endif
 				sizeof (struct tcpiphdr)
 #ifdef INET6
 			    )
 #endif
 			    );
 		} else
 			ssthresh = 0;
 		metrics.rmx_ssthresh = ssthresh;
 
 		metrics.rmx_rtt = tp->t_srtt;
 		metrics.rmx_rttvar = tp->t_rttvar;
 		metrics.rmx_cwnd = tp->snd_cwnd;
 		metrics.rmx_sendpipe = 0;
 		metrics.rmx_recvpipe = 0;
 
 		tcp_hc_update(&inp->inp_inc, &metrics);
 	}
 
 	/* free the reassembly queue, if any */
 	tcp_reass_flush(tp);
 
 #ifdef TCP_OFFLOAD
 	/* Disconnect offload device, if any. */
 	if (tp->t_flags & TF_TOE)
 		tcp_offload_detach(tp);
 #endif
 		
 	tcp_free_sackholes(tp);
 
 #ifdef TCPPCAP
 	/* Free the TCP PCAP queues. */
 	tcp_pcap_drain(&(tp->t_inpkts));
 	tcp_pcap_drain(&(tp->t_outpkts));
 #endif
 
 	/* Allow the CC algorithm to clean up after itself. */
 	if (CC_ALGO(tp)->cb_destroy != NULL)
 		CC_ALGO(tp)->cb_destroy(tp->ccv);
 
 #ifdef TCP_HHOOK
 	khelp_destroy_osd(tp->osd);
 #endif
 
 	CC_ALGO(tp) = NULL;
 	inp->inp_ppcb = NULL;
 	if (tp->t_timers->tt_draincnt == 0) {
 		/* We own the last reference on tcpcb, let's free it. */
 #ifdef TCP_BLACKBOX
 		tcp_log_tcpcbfini(tp);
 #endif
 		TCPSTATES_DEC(tp->t_state);
 		if (tp->t_fb->tfb_tcp_fb_fini)
 			(*tp->t_fb->tfb_tcp_fb_fini)(tp, 1);
 		refcount_release(&tp->t_fb->tfb_refcnt);
 		tp->t_inpcb = NULL;
 		uma_zfree(V_tcpcb_zone, tp);
 		released = in_pcbrele_wlocked(inp);
 		KASSERT(!released, ("%s: inp %p should not have been released "
 			"here", __func__, inp));
 	}
 }
 
 void
 tcp_timer_discard(void *ptp)
 {
 	struct inpcb *inp;
 	struct tcpcb *tp;
 	
 	tp = (struct tcpcb *)ptp;
 	CURVNET_SET(tp->t_vnet);
 	INP_INFO_RLOCK(&V_tcbinfo);
 	inp = tp->t_inpcb;
 	KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL",
 		__func__, tp));
 	INP_WLOCK(inp);
 	KASSERT((tp->t_timers->tt_flags & TT_STOPPED) != 0,
 		("%s: tcpcb has to be stopped here", __func__));
 	tp->t_timers->tt_draincnt--;
 	if (tp->t_timers->tt_draincnt == 0) {
 		/* We own the last reference on this tcpcb, let's free it. */
 #ifdef TCP_BLACKBOX
 		tcp_log_tcpcbfini(tp);
 #endif
 		TCPSTATES_DEC(tp->t_state);
 		if (tp->t_fb->tfb_tcp_fb_fini)
 			(*tp->t_fb->tfb_tcp_fb_fini)(tp, 1);
 		refcount_release(&tp->t_fb->tfb_refcnt);
 		tp->t_inpcb = NULL;
 		uma_zfree(V_tcpcb_zone, tp);
 		if (in_pcbrele_wlocked(inp)) {
 			INP_INFO_RUNLOCK(&V_tcbinfo);
 			CURVNET_RESTORE();
 			return;
 		}
 	}
 	INP_WUNLOCK(inp);
 	INP_INFO_RUNLOCK(&V_tcbinfo);
 	CURVNET_RESTORE();
 }
 
 /*
  * Attempt to close a TCP control block, marking it as dropped, and freeing
  * the socket if we hold the only reference.
  */
 struct tcpcb *
 tcp_close(struct tcpcb *tp)
 {
 	struct inpcb *inp = tp->t_inpcb;
 	struct socket *so;
 
 	INP_INFO_LOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
 #ifdef TCP_OFFLOAD
 	if (tp->t_state == TCPS_LISTEN)
 		tcp_offload_listen_stop(tp);
 #endif
 	/*
 	 * This releases the TFO pending counter resource for TFO listen
 	 * sockets as well as passively-created TFO sockets that transition
 	 * from SYN_RECEIVED to CLOSED.
 	 */
 	if (tp->t_tfo_pending) {
 		tcp_fastopen_decrement_counter(tp->t_tfo_pending);
 		tp->t_tfo_pending = NULL;
 	}
 	in_pcbdrop(inp);
 	TCPSTAT_INC(tcps_closed);
 	if (tp->t_state != TCPS_CLOSED)
 		tcp_state_change(tp, TCPS_CLOSED);
 	KASSERT(inp->inp_socket != NULL, ("tcp_close: inp_socket NULL"));
 	so = inp->inp_socket;
 	soisdisconnected(so);
 	if (inp->inp_flags & INP_SOCKREF) {
 		KASSERT(so->so_state & SS_PROTOREF,
 		    ("tcp_close: !SS_PROTOREF"));
 		inp->inp_flags &= ~INP_SOCKREF;
 		INP_WUNLOCK(inp);
 		SOCK_LOCK(so);
 		so->so_state &= ~SS_PROTOREF;
 		sofree(so);
 		return (NULL);
 	}
 	return (tp);
 }
 
 void
 tcp_drain(void)
 {
 	VNET_ITERATOR_DECL(vnet_iter);
 
 	if (!do_tcpdrain)
 		return;
 
 	VNET_LIST_RLOCK_NOSLEEP();
 	VNET_FOREACH(vnet_iter) {
 		CURVNET_SET(vnet_iter);
 		struct inpcb *inpb;
 		struct tcpcb *tcpb;
 
 	/*
 	 * Walk the tcpbs, if existing, and flush the reassembly queue,
 	 * if there is one...
 	 * XXX: The "Net/3" implementation doesn't imply that the TCP
 	 *      reassembly queue should be flushed, but in a situation
 	 *	where we're really low on mbufs, this is potentially
 	 *	useful.
 	 */
 		INP_INFO_WLOCK(&V_tcbinfo);
 		LIST_FOREACH(inpb, V_tcbinfo.ipi_listhead, inp_list) {
 			if (inpb->inp_flags & INP_TIMEWAIT)
 				continue;
 			INP_WLOCK(inpb);
 			if ((tcpb = intotcpcb(inpb)) != NULL) {
 				tcp_reass_flush(tcpb);
 				tcp_clean_sackreport(tcpb);
 #ifdef TCP_BLACKBOX
 				tcp_log_drain(tcpb);
 #endif
 #ifdef TCPPCAP
 				if (tcp_pcap_aggressive_free) {
 					/* Free the TCP PCAP queues. */
 					tcp_pcap_drain(&(tcpb->t_inpkts));
 					tcp_pcap_drain(&(tcpb->t_outpkts));
 				}
 #endif
 			}
 			INP_WUNLOCK(inpb);
 		}
 		INP_INFO_WUNLOCK(&V_tcbinfo);
 		CURVNET_RESTORE();
 	}
 	VNET_LIST_RUNLOCK_NOSLEEP();
 }
 
 /*
  * Notify a tcp user of an asynchronous error;
  * store error as soft error, but wake up user
  * (for now, won't do anything until can select for soft error).
  *
  * Do not wake up user since there currently is no mechanism for
  * reporting soft errors (yet - a kqueue filter may be added).
  */
 static struct inpcb *
 tcp_notify(struct inpcb *inp, int error)
 {
 	struct tcpcb *tp;
 
 	INP_INFO_LOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
 	if ((inp->inp_flags & INP_TIMEWAIT) ||
 	    (inp->inp_flags & INP_DROPPED))
 		return (inp);
 
 	tp = intotcpcb(inp);
 	KASSERT(tp != NULL, ("tcp_notify: tp == NULL"));
 
 	/*
 	 * Ignore some errors if we are hooked up.
 	 * If connection hasn't completed, has retransmitted several times,
 	 * and receives a second error, give up now.  This is better
 	 * than waiting a long time to establish a connection that
 	 * can never complete.
 	 */
 	if (tp->t_state == TCPS_ESTABLISHED &&
 	    (error == EHOSTUNREACH || error == ENETUNREACH ||
 	     error == EHOSTDOWN)) {
 		if (inp->inp_route.ro_rt) {
 			RTFREE(inp->inp_route.ro_rt);
 			inp->inp_route.ro_rt = (struct rtentry *)NULL;
 		}
 		return (inp);
 	} else if (tp->t_state < TCPS_ESTABLISHED && tp->t_rxtshift > 3 &&
 	    tp->t_softerror) {
 		tp = tcp_drop(tp, error);
 		if (tp != NULL)
 			return (inp);
 		else
 			return (NULL);
 	} else {
 		tp->t_softerror = error;
 		return (inp);
 	}
 #if 0
 	wakeup( &so->so_timeo);
 	sorwakeup(so);
 	sowwakeup(so);
 #endif
 }
 
 static int
 tcp_pcblist(SYSCTL_HANDLER_ARGS)
 {
 	int error, i, m, n, pcb_count;
 	struct inpcb *inp, **inp_list;
 	inp_gen_t gencnt;
 	struct xinpgen xig;
 
 	/*
 	 * The process of preparing the TCB list is too time-consuming and
 	 * resource-intensive to repeat twice on every request.
 	 */
 	if (req->oldptr == NULL) {
 		n = V_tcbinfo.ipi_count +
 		    counter_u64_fetch(V_tcps_states[TCPS_SYN_RECEIVED]);
 		n += imax(n / 8, 10);
 		req->oldidx = 2 * (sizeof xig) + n * sizeof(struct xtcpcb);
 		return (0);
 	}
 
 	if (req->newptr != NULL)
 		return (EPERM);
 
 	/*
 	 * OK, now we're committed to doing something.
 	 */
 	INP_LIST_RLOCK(&V_tcbinfo);
 	gencnt = V_tcbinfo.ipi_gencnt;
 	n = V_tcbinfo.ipi_count;
 	INP_LIST_RUNLOCK(&V_tcbinfo);
 
 	m = counter_u64_fetch(V_tcps_states[TCPS_SYN_RECEIVED]);
 
 	error = sysctl_wire_old_buffer(req, 2 * (sizeof xig)
 		+ (n + m) * sizeof(struct xtcpcb));
 	if (error != 0)
 		return (error);
 
 	xig.xig_len = sizeof xig;
 	xig.xig_count = n + m;
 	xig.xig_gen = gencnt;
 	xig.xig_sogen = so_gencnt;
 	error = SYSCTL_OUT(req, &xig, sizeof xig);
 	if (error)
 		return (error);
 
 	error = syncache_pcblist(req, m, &pcb_count);
 	if (error)
 		return (error);
 
 	inp_list = malloc(n * sizeof *inp_list, M_TEMP, M_WAITOK);
 
 	INP_INFO_WLOCK(&V_tcbinfo);
 	for (inp = LIST_FIRST(V_tcbinfo.ipi_listhead), i = 0;
 	    inp != NULL && i < n; inp = LIST_NEXT(inp, inp_list)) {
 		INP_WLOCK(inp);
 		if (inp->inp_gencnt <= gencnt) {
 			/*
 			 * XXX: This use of cr_cansee(), introduced with
 			 * TCP state changes, is not quite right, but for
 			 * now, better than nothing.
 			 */
 			if (inp->inp_flags & INP_TIMEWAIT) {
 				if (intotw(inp) != NULL)
 					error = cr_cansee(req->td->td_ucred,
 					    intotw(inp)->tw_cred);
 				else
 					error = EINVAL;	/* Skip this inp. */
 			} else
 				error = cr_canseeinpcb(req->td->td_ucred, inp);
 			if (error == 0) {
 				in_pcbref(inp);
 				inp_list[i++] = inp;
 			}
 		}
 		INP_WUNLOCK(inp);
 	}
 	INP_INFO_WUNLOCK(&V_tcbinfo);
 	n = i;
 
 	error = 0;
 	for (i = 0; i < n; i++) {
 		inp = inp_list[i];
 		INP_RLOCK(inp);
 		if (inp->inp_gencnt <= gencnt) {
 			struct xtcpcb xt;
 
 			tcp_inptoxtp(inp, &xt);
 			INP_RUNLOCK(inp);
 			error = SYSCTL_OUT(req, &xt, sizeof xt);
 		} else
 			INP_RUNLOCK(inp);
 	}
 	INP_INFO_RLOCK(&V_tcbinfo);
 	for (i = 0; i < n; i++) {
 		inp = inp_list[i];
 		INP_RLOCK(inp);
 		if (!in_pcbrele_rlocked(inp))
 			INP_RUNLOCK(inp);
 	}
 	INP_INFO_RUNLOCK(&V_tcbinfo);
 
 	if (!error) {
 		/*
 		 * Give the user an updated idea of our state.
 		 * If the generation differs from what we told
 		 * her before, she knows that something happened
 		 * while we were processing this request, and it
 		 * might be necessary to retry.
 		 */
 		INP_LIST_RLOCK(&V_tcbinfo);
 		xig.xig_gen = V_tcbinfo.ipi_gencnt;
 		xig.xig_sogen = so_gencnt;
 		xig.xig_count = V_tcbinfo.ipi_count + pcb_count;
 		INP_LIST_RUNLOCK(&V_tcbinfo);
 		error = SYSCTL_OUT(req, &xig, sizeof xig);
 	}
 	free(inp_list, M_TEMP);
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, TCPCTL_PCBLIST, pcblist,
     CTLTYPE_OPAQUE | CTLFLAG_RD, NULL, 0,
     tcp_pcblist, "S,xtcpcb", "List of active TCP connections");
 
 #ifdef INET
 static int
 tcp_getcred(SYSCTL_HANDLER_ARGS)
 {
 	struct xucred xuc;
 	struct sockaddr_in addrs[2];
 	struct inpcb *inp;
 	int error;
 
 	error = priv_check(req->td, PRIV_NETINET_GETCRED);
 	if (error)
 		return (error);
 	error = SYSCTL_IN(req, addrs, sizeof(addrs));
 	if (error)
 		return (error);
 	inp = in_pcblookup(&V_tcbinfo, addrs[1].sin_addr, addrs[1].sin_port,
 	    addrs[0].sin_addr, addrs[0].sin_port, INPLOOKUP_RLOCKPCB, NULL);
 	if (inp != NULL) {
 		if (inp->inp_socket == NULL)
 			error = ENOENT;
 		if (error == 0)
 			error = cr_canseeinpcb(req->td->td_ucred, inp);
 		if (error == 0)
 			cru2x(inp->inp_cred, &xuc);
 		INP_RUNLOCK(inp);
 	} else
 		error = ENOENT;
 	if (error == 0)
 		error = SYSCTL_OUT(req, &xuc, sizeof(struct xucred));
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, OID_AUTO, getcred,
     CTLTYPE_OPAQUE|CTLFLAG_RW|CTLFLAG_PRISON, 0, 0,
     tcp_getcred, "S,xucred", "Get the xucred of a TCP connection");
 #endif /* INET */
 
 #ifdef INET6
 static int
 tcp6_getcred(SYSCTL_HANDLER_ARGS)
 {
 	struct xucred xuc;
 	struct sockaddr_in6 addrs[2];
 	struct inpcb *inp;
 	int error;
 #ifdef INET
 	int mapped = 0;
 #endif
 
 	error = priv_check(req->td, PRIV_NETINET_GETCRED);
 	if (error)
 		return (error);
 	error = SYSCTL_IN(req, addrs, sizeof(addrs));
 	if (error)
 		return (error);
 	if ((error = sa6_embedscope(&addrs[0], V_ip6_use_defzone)) != 0 ||
 	    (error = sa6_embedscope(&addrs[1], V_ip6_use_defzone)) != 0) {
 		return (error);
 	}
 	if (IN6_IS_ADDR_V4MAPPED(&addrs[0].sin6_addr)) {
 #ifdef INET
 		if (IN6_IS_ADDR_V4MAPPED(&addrs[1].sin6_addr))
 			mapped = 1;
 		else
 #endif
 			return (EINVAL);
 	}
 
 #ifdef INET
 	if (mapped == 1)
 		inp = in_pcblookup(&V_tcbinfo,
 			*(struct in_addr *)&addrs[1].sin6_addr.s6_addr[12],
 			addrs[1].sin6_port,
 			*(struct in_addr *)&addrs[0].sin6_addr.s6_addr[12],
 			addrs[0].sin6_port, INPLOOKUP_RLOCKPCB, NULL);
 	else
 #endif
 		inp = in6_pcblookup(&V_tcbinfo,
 			&addrs[1].sin6_addr, addrs[1].sin6_port,
 			&addrs[0].sin6_addr, addrs[0].sin6_port,
 			INPLOOKUP_RLOCKPCB, NULL);
 	if (inp != NULL) {
 		if (inp->inp_socket == NULL)
 			error = ENOENT;
 		if (error == 0)
 			error = cr_canseeinpcb(req->td->td_ucred, inp);
 		if (error == 0)
 			cru2x(inp->inp_cred, &xuc);
 		INP_RUNLOCK(inp);
 	} else
 		error = ENOENT;
 	if (error == 0)
 		error = SYSCTL_OUT(req, &xuc, sizeof(struct xucred));
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet6_tcp6, OID_AUTO, getcred,
     CTLTYPE_OPAQUE|CTLFLAG_RW|CTLFLAG_PRISON, 0, 0,
     tcp6_getcred, "S,xucred", "Get the xucred of a TCP6 connection");
 #endif /* INET6 */
 
 
 #ifdef INET
 void
 tcp_ctlinput(int cmd, struct sockaddr *sa, void *vip)
 {
 	struct ip *ip = vip;
 	struct tcphdr *th;
 	struct in_addr faddr;
 	struct inpcb *inp;
 	struct tcpcb *tp;
 	struct inpcb *(*notify)(struct inpcb *, int) = tcp_notify;
 	struct icmp *icp;
 	struct in_conninfo inc;
 	tcp_seq icmp_tcp_seq;
 	int mtu;
 
 	faddr = ((struct sockaddr_in *)sa)->sin_addr;
 	if (sa->sa_family != AF_INET || faddr.s_addr == INADDR_ANY)
 		return;
 
 	if (cmd == PRC_MSGSIZE)
 		notify = tcp_mtudisc_notify;
 	else if (V_icmp_may_rst && (cmd == PRC_UNREACH_ADMIN_PROHIB ||
 		cmd == PRC_UNREACH_PORT || cmd == PRC_UNREACH_PROTOCOL || 
 		cmd == PRC_TIMXCEED_INTRANS) && ip)
 		notify = tcp_drop_syn_sent;
 
 	/*
 	 * Hostdead is ugly because it goes linearly through all PCBs.
 	 * XXX: We never get this from ICMP, otherwise it makes an
 	 * excellent DoS attack on machines with many connections.
 	 */
 	else if (cmd == PRC_HOSTDEAD)
 		ip = NULL;
 	else if ((unsigned)cmd >= PRC_NCMDS || inetctlerrmap[cmd] == 0)
 		return;
 
 	if (ip == NULL) {
 		in_pcbnotifyall(&V_tcbinfo, faddr, inetctlerrmap[cmd], notify);
 		return;
 	}
 
 	icp = (struct icmp *)((caddr_t)ip - offsetof(struct icmp, icmp_ip));
 	th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 	INP_INFO_RLOCK(&V_tcbinfo);
 	inp = in_pcblookup(&V_tcbinfo, faddr, th->th_dport, ip->ip_src,
 	    th->th_sport, INPLOOKUP_WLOCKPCB, NULL);
 	if (inp != NULL && PRC_IS_REDIRECT(cmd)) {
 		/* signal EHOSTDOWN, as it flushes the cached route */
 		inp = (*notify)(inp, EHOSTDOWN);
 		goto out;
 	}
 	icmp_tcp_seq = th->th_seq;
 	if (inp != NULL)  {
 		if (!(inp->inp_flags & INP_TIMEWAIT) &&
 		    !(inp->inp_flags & INP_DROPPED) &&
 		    !(inp->inp_socket == NULL)) {
 			tp = intotcpcb(inp);
 			if (SEQ_GEQ(ntohl(icmp_tcp_seq), tp->snd_una) &&
 			    SEQ_LT(ntohl(icmp_tcp_seq), tp->snd_max)) {
 				if (cmd == PRC_MSGSIZE) {
 					/*
 					 * MTU discovery:
 					 * If we got a needfrag set the MTU
 					 * in the route to the suggested new
 					 * value (if given) and then notify.
 					 */
 					mtu = ntohs(icp->icmp_nextmtu);
 					/*
 					 * If no alternative MTU was
 					 * proposed, try the next smaller
 					 * one.
 					 */
 					if (!mtu)
 						mtu = ip_next_mtu(
 						    ntohs(ip->ip_len), 1);
 					if (mtu < V_tcp_minmss +
 					    sizeof(struct tcpiphdr))
 						mtu = V_tcp_minmss +
 						    sizeof(struct tcpiphdr);
 					/*
 					 * Only process the offered MTU if it
 					 * is smaller than the current one.
 					 */
 					if (mtu < tp->t_maxseg +
 					    sizeof(struct tcpiphdr)) {
 						bzero(&inc, sizeof(inc));
 						inc.inc_faddr = faddr;
 						inc.inc_fibnum =
 						    inp->inp_inc.inc_fibnum;
 						tcp_hc_updatemtu(&inc, mtu);
 						tcp_mtudisc(inp, mtu);
 					}
 				} else
 					inp = (*notify)(inp,
 					    inetctlerrmap[cmd]);
 			}
 		}
 	} else {
 		bzero(&inc, sizeof(inc));
 		inc.inc_fport = th->th_dport;
 		inc.inc_lport = th->th_sport;
 		inc.inc_faddr = faddr;
 		inc.inc_laddr = ip->ip_src;
 		syncache_unreach(&inc, icmp_tcp_seq);
 	}
 out:
 	if (inp != NULL)
 		INP_WUNLOCK(inp);
 	INP_INFO_RUNLOCK(&V_tcbinfo);
 }
 #endif /* INET */
 
 #ifdef INET6
 void
 tcp6_ctlinput(int cmd, struct sockaddr *sa, void *d)
 {
 	struct in6_addr *dst;
 	struct inpcb *(*notify)(struct inpcb *, int) = tcp_notify;
 	struct ip6_hdr *ip6;
 	struct mbuf *m;
 	struct inpcb *inp;
 	struct tcpcb *tp;
 	struct icmp6_hdr *icmp6;
 	struct ip6ctlparam *ip6cp = NULL;
 	const struct sockaddr_in6 *sa6_src = NULL;
 	struct in_conninfo inc;
 	struct tcp_ports {
 		uint16_t th_sport;
 		uint16_t th_dport;
 	} t_ports;
 	tcp_seq icmp_tcp_seq;
 	unsigned int mtu;
 	unsigned int off;
 
 	if (sa->sa_family != AF_INET6 ||
 	    sa->sa_len != sizeof(struct sockaddr_in6))
 		return;
 
 	/* if the parameter is from icmp6, decode it. */
 	if (d != NULL) {
 		ip6cp = (struct ip6ctlparam *)d;
 		icmp6 = ip6cp->ip6c_icmp6;
 		m = ip6cp->ip6c_m;
 		ip6 = ip6cp->ip6c_ip6;
 		off = ip6cp->ip6c_off;
 		sa6_src = ip6cp->ip6c_src;
 		dst = ip6cp->ip6c_finaldst;
 	} else {
 		m = NULL;
 		ip6 = NULL;
 		off = 0;	/* fool gcc */
 		sa6_src = &sa6_any;
 		dst = NULL;
 	}
 
 	if (cmd == PRC_MSGSIZE)
 		notify = tcp_mtudisc_notify;
 	else if (V_icmp_may_rst && (cmd == PRC_UNREACH_ADMIN_PROHIB ||
 		cmd == PRC_UNREACH_PORT || cmd == PRC_UNREACH_PROTOCOL || 
 		cmd == PRC_TIMXCEED_INTRANS) && ip6 != NULL)
 		notify = tcp_drop_syn_sent;
 
 	/*
 	 * Hostdead is ugly because it goes linearly through all PCBs.
 	 * XXX: We never get this from ICMP, otherwise it makes an
 	 * excellent DoS attack on machines with many connections.
 	 */
 	else if (cmd == PRC_HOSTDEAD)
 		ip6 = NULL;
 	else if ((unsigned)cmd >= PRC_NCMDS || inet6ctlerrmap[cmd] == 0)
 		return;
 
 	if (ip6 == NULL) {
 		in6_pcbnotify(&V_tcbinfo, sa, 0,
 			      (const struct sockaddr *)sa6_src,
 			      0, cmd, NULL, notify);
 		return;
 	}
 
 	/* Check if we can safely get the ports from the tcp hdr */
 	if (m == NULL ||
 	    (m->m_pkthdr.len <
 		(int32_t) (off + sizeof(struct tcp_ports)))) {
 		return;
 	}
 	bzero(&t_ports, sizeof(struct tcp_ports));
 	m_copydata(m, off, sizeof(struct tcp_ports), (caddr_t)&t_ports);
 	INP_INFO_RLOCK(&V_tcbinfo);
 	inp = in6_pcblookup(&V_tcbinfo, &ip6->ip6_dst, t_ports.th_dport,
 	    &ip6->ip6_src, t_ports.th_sport, INPLOOKUP_WLOCKPCB, NULL);
 	if (inp != NULL && PRC_IS_REDIRECT(cmd)) {
 		/* signal EHOSTDOWN, as it flushes the cached route */
 		inp = (*notify)(inp, EHOSTDOWN);
 		goto out;
 	}
 	off += sizeof(struct tcp_ports);
 	if (m->m_pkthdr.len < (int32_t) (off + sizeof(tcp_seq))) {
 		goto out;
 	}
 	m_copydata(m, off, sizeof(tcp_seq), (caddr_t)&icmp_tcp_seq);
 	if (inp != NULL)  {
 		if (!(inp->inp_flags & INP_TIMEWAIT) &&
 		    !(inp->inp_flags & INP_DROPPED) &&
 		    !(inp->inp_socket == NULL)) {
 			tp = intotcpcb(inp);
 			if (SEQ_GEQ(ntohl(icmp_tcp_seq), tp->snd_una) &&
 			    SEQ_LT(ntohl(icmp_tcp_seq), tp->snd_max)) {
 				if (cmd == PRC_MSGSIZE) {
 					/*
 					 * MTU discovery:
 					 * If we got a needfrag set the MTU
 					 * in the route to the suggested new
 					 * value (if given) and then notify.
 					 */
 					mtu = ntohl(icmp6->icmp6_mtu);
 					/*
 					 * If no alternative MTU was
 					 * proposed, or the proposed
 					 * MTU was too small, set to
 					 * the min.
 					 */
 					if (mtu < IPV6_MMTU)
 						mtu = IPV6_MMTU - 8;
 					bzero(&inc, sizeof(inc));
 					inc.inc_fibnum = M_GETFIB(m);
 					inc.inc_flags |= INC_ISIPV6;
 					inc.inc6_faddr = *dst;
 					if (in6_setscope(&inc.inc6_faddr,
 						m->m_pkthdr.rcvif, NULL))
 						goto out;
 					/*
 					 * Only process the offered MTU if it
 					 * is smaller than the current one.
 					 */
 					if (mtu < tp->t_maxseg +
 					    sizeof (struct tcphdr) +
 					    sizeof (struct ip6_hdr)) {
 						tcp_hc_updatemtu(&inc, mtu);
 						tcp_mtudisc(inp, mtu);
 						ICMP6STAT_INC(icp6s_pmtuchg);
 					}
 				} else
 					inp = (*notify)(inp,
 					    inet6ctlerrmap[cmd]);
 			}
 		}
 	} else {
 		bzero(&inc, sizeof(inc));
 		inc.inc_fibnum = M_GETFIB(m);
 		inc.inc_flags |= INC_ISIPV6;
 		inc.inc_fport = t_ports.th_dport;
 		inc.inc_lport = t_ports.th_sport;
 		inc.inc6_faddr = *dst;
 		inc.inc6_laddr = ip6->ip6_src;
 		syncache_unreach(&inc, icmp_tcp_seq);
 	}
 out:
 	if (inp != NULL)
 		INP_WUNLOCK(inp);
 	INP_INFO_RUNLOCK(&V_tcbinfo);
 }
 #endif /* INET6 */
 
 
 /*
  * Following is where TCP initial sequence number generation occurs.
  *
  * There are two places where we must use initial sequence numbers:
  * 1.  In SYN-ACK packets.
  * 2.  In SYN packets.
  *
  * All ISNs for SYN-ACK packets are generated by the syncache.  See
  * tcp_syncache.c for details.
  *
  * The ISNs in SYN packets must be monotonic; TIME_WAIT recycling
  * depends on this property.  In addition, these ISNs should be
  * unguessable so as to prevent connection hijacking.  To satisfy
  * the requirements of this situation, the algorithm outlined in
  * RFC 1948 is used, with only small modifications.
  *
  * Implementation details:
  *
  * Time is based off the system timer, and is corrected so that it
  * increases by one megabyte per second.  This allows for proper
  * recycling on high speed LANs while still leaving over an hour
  * before rollover.
  *
  * As reading the *exact* system time is too expensive to be done
  * whenever setting up a TCP connection, we increment the time
  * offset in two ways.  First, a small random positive increment
  * is added to isn_offset for each connection that is set up.
  * Second, the function tcp_isn_tick fires once per clock tick
  * and increments isn_offset as necessary so that sequence numbers
  * are incremented at approximately ISN_BYTES_PER_SECOND.  The
  * random positive increments serve only to ensure that the same
  * exact sequence number is never sent out twice (as could otherwise
  * happen when a port is recycled in less than the system tick
  * interval.)
  *
  * net.inet.tcp.isn_reseed_interval controls the number of seconds
  * between seeding of isn_secret.  This is normally set to zero,
  * as reseeding should not be necessary.
  *
  * Locking of the global variables isn_secret, isn_last_reseed, isn_offset,
  * isn_offset_old, and isn_ctx is performed using the TCP pcbinfo lock.  In
  * general, this means holding an exclusive (write) lock.
  */
 
 #define ISN_BYTES_PER_SECOND 1048576
 #define ISN_STATIC_INCREMENT 4096
 #define ISN_RANDOM_INCREMENT (4096 - 1)
 
 static VNET_DEFINE(u_char, isn_secret[32]);
 static VNET_DEFINE(int, isn_last);
 static VNET_DEFINE(int, isn_last_reseed);
 static VNET_DEFINE(u_int32_t, isn_offset);
 static VNET_DEFINE(u_int32_t, isn_offset_old);
 
 #define	V_isn_secret			VNET(isn_secret)
 #define	V_isn_last			VNET(isn_last)
 #define	V_isn_last_reseed		VNET(isn_last_reseed)
 #define	V_isn_offset			VNET(isn_offset)
 #define	V_isn_offset_old		VNET(isn_offset_old)
 
 tcp_seq
 tcp_new_isn(struct tcpcb *tp)
 {
 	MD5_CTX isn_ctx;
 	u_int32_t md5_buffer[4];
 	tcp_seq new_isn;
 	u_int32_t projected_offset;
 
 	INP_WLOCK_ASSERT(tp->t_inpcb);
 
 	ISN_LOCK();
 	/* Seed if this is the first use, reseed if requested. */
 	if ((V_isn_last_reseed == 0) || ((V_tcp_isn_reseed_interval > 0) &&
 	     (((u_int)V_isn_last_reseed + (u_int)V_tcp_isn_reseed_interval*hz)
 		< (u_int)ticks))) {
 		read_random(&V_isn_secret, sizeof(V_isn_secret));
 		V_isn_last_reseed = ticks;
 	}
 
 	/* Compute the md5 hash and return the ISN. */
 	MD5Init(&isn_ctx);
 	MD5Update(&isn_ctx, (u_char *) &tp->t_inpcb->inp_fport, sizeof(u_short));
 	MD5Update(&isn_ctx, (u_char *) &tp->t_inpcb->inp_lport, sizeof(u_short));
 #ifdef INET6
 	if ((tp->t_inpcb->inp_vflag & INP_IPV6) != 0) {
 		MD5Update(&isn_ctx, (u_char *) &tp->t_inpcb->in6p_faddr,
 			  sizeof(struct in6_addr));
 		MD5Update(&isn_ctx, (u_char *) &tp->t_inpcb->in6p_laddr,
 			  sizeof(struct in6_addr));
 	} else
 #endif
 	{
 		MD5Update(&isn_ctx, (u_char *) &tp->t_inpcb->inp_faddr,
 			  sizeof(struct in_addr));
 		MD5Update(&isn_ctx, (u_char *) &tp->t_inpcb->inp_laddr,
 			  sizeof(struct in_addr));
 	}
 	MD5Update(&isn_ctx, (u_char *) &V_isn_secret, sizeof(V_isn_secret));
 	MD5Final((u_char *) &md5_buffer, &isn_ctx);
 	new_isn = (tcp_seq) md5_buffer[0];
 	V_isn_offset += ISN_STATIC_INCREMENT +
 		(arc4random() & ISN_RANDOM_INCREMENT);
 	if (ticks != V_isn_last) {
 		projected_offset = V_isn_offset_old +
 		    ISN_BYTES_PER_SECOND / hz * (ticks - V_isn_last);
 		if (SEQ_GT(projected_offset, V_isn_offset))
 			V_isn_offset = projected_offset;
 		V_isn_offset_old = V_isn_offset;
 		V_isn_last = ticks;
 	}
 	new_isn += V_isn_offset;
 	ISN_UNLOCK();
 	return (new_isn);
 }
 
 /*
  * When a specific ICMP unreachable message is received and the
  * connection state is SYN-SENT, drop the connection.  This behavior
  * is controlled by the icmp_may_rst sysctl.
  */
 struct inpcb *
 tcp_drop_syn_sent(struct inpcb *inp, int errno)
 {
 	struct tcpcb *tp;
 
 	INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
 	if ((inp->inp_flags & INP_TIMEWAIT) ||
 	    (inp->inp_flags & INP_DROPPED))
 		return (inp);
 
 	tp = intotcpcb(inp);
 	if (tp->t_state != TCPS_SYN_SENT)
 		return (inp);
 
 	if (IS_FASTOPEN(tp->t_flags))
 		tcp_fastopen_disable_path(tp);
 	
 	tp = tcp_drop(tp, errno);
 	if (tp != NULL)
 		return (inp);
 	else
 		return (NULL);
 }
 
 /*
  * When `need fragmentation' ICMP is received, update our idea of the MSS
  * based on the new value. Also nudge TCP to send something, since we
  * know the packet we just sent was dropped.
  * This duplicates some code in the tcp_mss() function in tcp_input.c.
  */
 static struct inpcb *
 tcp_mtudisc_notify(struct inpcb *inp, int error)
 {
 
 	tcp_mtudisc(inp, -1);
 	return (inp);
 }
 
 static void
 tcp_mtudisc(struct inpcb *inp, int mtuoffer)
 {
 	struct tcpcb *tp;
 	struct socket *so;
 
 	INP_WLOCK_ASSERT(inp);
 	if ((inp->inp_flags & INP_TIMEWAIT) ||
 	    (inp->inp_flags & INP_DROPPED))
 		return;
 
 	tp = intotcpcb(inp);
 	KASSERT(tp != NULL, ("tcp_mtudisc: tp == NULL"));
 
 	tcp_mss_update(tp, -1, mtuoffer, NULL, NULL);
   
 	so = inp->inp_socket;
 	SOCKBUF_LOCK(&so->so_snd);
 	/* If the mss is larger than the socket buffer, decrease the mss. */
 	if (so->so_snd.sb_hiwat < tp->t_maxseg)
 		tp->t_maxseg = so->so_snd.sb_hiwat;
 	SOCKBUF_UNLOCK(&so->so_snd);
 
 	TCPSTAT_INC(tcps_mturesent);
 	tp->t_rtttime = 0;
 	tp->snd_nxt = tp->snd_una;
 	tcp_free_sackholes(tp);
 	tp->snd_recover = tp->snd_max;
 	if (tp->t_flags & TF_SACK_PERMIT)
 		EXIT_FASTRECOVERY(tp->t_flags);
 	tp->t_fb->tfb_tcp_output(tp);
 }
 
 #ifdef INET
 /*
  * Look-up the routing entry to the peer of this inpcb.  If no route
  * is found and it cannot be allocated, then return 0.  This routine
  * is called by TCP routines that access the rmx structure and by
  * tcp_mss_update to get the peer/interface MTU.
  */
 uint32_t
 tcp_maxmtu(struct in_conninfo *inc, struct tcp_ifcap *cap)
 {
 	struct nhop4_extended nh4;
 	struct ifnet *ifp;
 	uint32_t maxmtu = 0;
 
 	KASSERT(inc != NULL, ("tcp_maxmtu with NULL in_conninfo pointer"));
 
 	if (inc->inc_faddr.s_addr != INADDR_ANY) {
 
 		if (fib4_lookup_nh_ext(inc->inc_fibnum, inc->inc_faddr,
 		    NHR_REF, 0, &nh4) != 0)
 			return (0);
 
 		ifp = nh4.nh_ifp;
 		maxmtu = nh4.nh_mtu;
 
 		/* Report additional interface capabilities. */
 		if (cap != NULL) {
 			if (ifp->if_capenable & IFCAP_TSO4 &&
 			    ifp->if_hwassist & CSUM_TSO) {
 				cap->ifcap |= CSUM_TSO;
 				cap->tsomax = ifp->if_hw_tsomax;
 				cap->tsomaxsegcount = ifp->if_hw_tsomaxsegcount;
 				cap->tsomaxsegsize = ifp->if_hw_tsomaxsegsize;
 			}
 		}
 		fib4_free_nh_ext(inc->inc_fibnum, &nh4);
 	}
 	return (maxmtu);
 }
 #endif /* INET */
 
 #ifdef INET6
 uint32_t
 tcp_maxmtu6(struct in_conninfo *inc, struct tcp_ifcap *cap)
 {
 	struct nhop6_extended nh6;
 	struct in6_addr dst6;
 	uint32_t scopeid;
 	struct ifnet *ifp;
 	uint32_t maxmtu = 0;
 
 	KASSERT(inc != NULL, ("tcp_maxmtu6 with NULL in_conninfo pointer"));
 
 	if (!IN6_IS_ADDR_UNSPECIFIED(&inc->inc6_faddr)) {
 		in6_splitscope(&inc->inc6_faddr, &dst6, &scopeid);
 		if (fib6_lookup_nh_ext(inc->inc_fibnum, &dst6, scopeid, 0,
 		    0, &nh6) != 0)
 			return (0);
 
 		ifp = nh6.nh_ifp;
 		maxmtu = nh6.nh_mtu;
 
 		/* Report additional interface capabilities. */
 		if (cap != NULL) {
 			if (ifp->if_capenable & IFCAP_TSO6 &&
 			    ifp->if_hwassist & CSUM_TSO) {
 				cap->ifcap |= CSUM_TSO;
 				cap->tsomax = ifp->if_hw_tsomax;
 				cap->tsomaxsegcount = ifp->if_hw_tsomaxsegcount;
 				cap->tsomaxsegsize = ifp->if_hw_tsomaxsegsize;
 			}
 		}
 		fib6_free_nh_ext(inc->inc_fibnum, &nh6);
 	}
 
 	return (maxmtu);
 }
 #endif /* INET6 */
 
 /*
  * Calculate effective SMSS per RFC5681 definition for a given TCP
  * connection at its current state, taking into account SACK and etc.
  */
 u_int
 tcp_maxseg(const struct tcpcb *tp)
 {
 	u_int optlen;
 
 	if (tp->t_flags & TF_NOOPT)
 		return (tp->t_maxseg);
 
 	/*
 	 * Here we have a simplified code from tcp_addoptions(),
 	 * without a proper loop, and having most of paddings hardcoded.
 	 * We might make mistakes with padding here in some edge cases,
 	 * but this is harmless, since result of tcp_maxseg() is used
 	 * only in cwnd and ssthresh estimations.
 	 */
 #define	PAD(len)	((((len) / 4) + !!((len) % 4)) * 4)
 	if (TCPS_HAVEESTABLISHED(tp->t_state)) {
 		if (tp->t_flags & TF_RCVD_TSTMP)
 			optlen = TCPOLEN_TSTAMP_APPA;
 		else
 			optlen = 0;
 #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
 		if (tp->t_flags & TF_SIGNATURE)
 			optlen += PAD(TCPOLEN_SIGNATURE);
 #endif
 		if ((tp->t_flags & TF_SACK_PERMIT) && tp->rcv_numsacks > 0) {
 			optlen += TCPOLEN_SACKHDR;
 			optlen += tp->rcv_numsacks * TCPOLEN_SACK;
 			optlen = PAD(optlen);
 		}
 	} else {
 		if (tp->t_flags & TF_REQ_TSTMP)
 			optlen = TCPOLEN_TSTAMP_APPA;
 		else
 			optlen = PAD(TCPOLEN_MAXSEG);
 		if (tp->t_flags & TF_REQ_SCALE)
 			optlen += PAD(TCPOLEN_WINDOW);
 #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
 		if (tp->t_flags & TF_SIGNATURE)
 			optlen += PAD(TCPOLEN_SIGNATURE);
 #endif
 		if (tp->t_flags & TF_SACK_PERMIT)
 			optlen += PAD(TCPOLEN_SACK_PERMITTED);
 	}
 #undef PAD
 	optlen = min(optlen, TCP_MAXOLEN);
 	return (tp->t_maxseg - optlen);
 }
 
 static int
 sysctl_drop(SYSCTL_HANDLER_ARGS)
 {
 	/* addrs[0] is a foreign socket, addrs[1] is a local one. */
 	struct sockaddr_storage addrs[2];
 	struct inpcb *inp;
 	struct tcpcb *tp;
 	struct tcptw *tw;
 	struct sockaddr_in *fin, *lin;
 #ifdef INET6
 	struct sockaddr_in6 *fin6, *lin6;
 #endif
 	int error;
 
 	inp = NULL;
 	fin = lin = NULL;
 #ifdef INET6
 	fin6 = lin6 = NULL;
 #endif
 	error = 0;
 
 	if (req->oldptr != NULL || req->oldlen != 0)
 		return (EINVAL);
 	if (req->newptr == NULL)
 		return (EPERM);
 	if (req->newlen < sizeof(addrs))
 		return (ENOMEM);
 	error = SYSCTL_IN(req, &addrs, sizeof(addrs));
 	if (error)
 		return (error);
 
 	switch (addrs[0].ss_family) {
 #ifdef INET6
 	case AF_INET6:
 		fin6 = (struct sockaddr_in6 *)&addrs[0];
 		lin6 = (struct sockaddr_in6 *)&addrs[1];
 		if (fin6->sin6_len != sizeof(struct sockaddr_in6) ||
 		    lin6->sin6_len != sizeof(struct sockaddr_in6))
 			return (EINVAL);
 		if (IN6_IS_ADDR_V4MAPPED(&fin6->sin6_addr)) {
 			if (!IN6_IS_ADDR_V4MAPPED(&lin6->sin6_addr))
 				return (EINVAL);
 			in6_sin6_2_sin_in_sock((struct sockaddr *)&addrs[0]);
 			in6_sin6_2_sin_in_sock((struct sockaddr *)&addrs[1]);
 			fin = (struct sockaddr_in *)&addrs[0];
 			lin = (struct sockaddr_in *)&addrs[1];
 			break;
 		}
 		error = sa6_embedscope(fin6, V_ip6_use_defzone);
 		if (error)
 			return (error);
 		error = sa6_embedscope(lin6, V_ip6_use_defzone);
 		if (error)
 			return (error);
 		break;
 #endif
 #ifdef INET
 	case AF_INET:
 		fin = (struct sockaddr_in *)&addrs[0];
 		lin = (struct sockaddr_in *)&addrs[1];
 		if (fin->sin_len != sizeof(struct sockaddr_in) ||
 		    lin->sin_len != sizeof(struct sockaddr_in))
 			return (EINVAL);
 		break;
 #endif
 	default:
 		return (EINVAL);
 	}
 	INP_INFO_RLOCK(&V_tcbinfo);
 	switch (addrs[0].ss_family) {
 #ifdef INET6
 	case AF_INET6:
 		inp = in6_pcblookup(&V_tcbinfo, &fin6->sin6_addr,
 		    fin6->sin6_port, &lin6->sin6_addr, lin6->sin6_port,
 		    INPLOOKUP_WLOCKPCB, NULL);
 		break;
 #endif
 #ifdef INET
 	case AF_INET:
 		inp = in_pcblookup(&V_tcbinfo, fin->sin_addr, fin->sin_port,
 		    lin->sin_addr, lin->sin_port, INPLOOKUP_WLOCKPCB, NULL);
 		break;
 #endif
 	}
 	if (inp != NULL) {
 		if (inp->inp_flags & INP_TIMEWAIT) {
 			/*
 			 * XXXRW: There currently exists a state where an
 			 * inpcb is present, but its timewait state has been
 			 * discarded.  For now, don't allow dropping of this
 			 * type of inpcb.
 			 */
 			tw = intotw(inp);
 			if (tw != NULL)
 				tcp_twclose(tw, 0);
 			else
 				INP_WUNLOCK(inp);
 		} else if (!(inp->inp_flags & INP_DROPPED) &&
 			   !(inp->inp_socket->so_options & SO_ACCEPTCONN)) {
 			tp = intotcpcb(inp);
 			tp = tcp_drop(tp, ECONNABORTED);
 			if (tp != NULL)
 				INP_WUNLOCK(inp);
 		} else
 			INP_WUNLOCK(inp);
 	} else
 		error = ESRCH;
 	INP_INFO_RUNLOCK(&V_tcbinfo);
 	return (error);
 }
 
 SYSCTL_PROC(_net_inet_tcp, TCPCTL_DROP, drop,
     CTLFLAG_VNET | CTLTYPE_STRUCT | CTLFLAG_WR | CTLFLAG_SKIP, NULL,
     0, sysctl_drop, "", "Drop TCP connection");
 
 /*
  * Generate a standardized TCP log line for use throughout the
  * tcp subsystem.  Memory allocation is done with M_NOWAIT to
  * allow use in the interrupt context.
  *
  * NB: The caller MUST free(s, M_TCPLOG) the returned string.
  * NB: The function may return NULL if memory allocation failed.
  *
  * Due to header inclusion and ordering limitations the struct ip
  * and ip6_hdr pointers have to be passed as void pointers.
  */
 char *
 tcp_log_vain(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
     const void *ip6hdr)
 {
 
 	/* Is logging enabled? */
 	if (tcp_log_in_vain == 0)
 		return (NULL);
 
 	return (tcp_log_addr(inc, th, ip4hdr, ip6hdr));
 }
 
 char *
 tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
     const void *ip6hdr)
 {
 
 	/* Is logging enabled? */
 	if (tcp_log_debug == 0)
 		return (NULL);
 
 	return (tcp_log_addr(inc, th, ip4hdr, ip6hdr));
 }
 
 static char *
 tcp_log_addr(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
     const void *ip6hdr)
 {
 	char *s, *sp;
 	size_t size;
 	struct ip *ip;
 #ifdef INET6
 	const struct ip6_hdr *ip6;
 
 	ip6 = (const struct ip6_hdr *)ip6hdr;
 #endif /* INET6 */
 	ip = (struct ip *)ip4hdr;
 
 	/*
 	 * The log line looks like this:
 	 * "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags 0x2<SYN>"
 	 */
 	size = sizeof("TCP: []:12345 to []:12345 tcpflags 0x2<>") +
 	    sizeof(PRINT_TH_FLAGS) + 1 +
 #ifdef INET6
 	    2 * INET6_ADDRSTRLEN;
 #else
 	    2 * INET_ADDRSTRLEN;
 #endif /* INET6 */
 
 	s = malloc(size, M_TCPLOG, M_ZERO|M_NOWAIT);
 	if (s == NULL)
 		return (NULL);
 
 	strcat(s, "TCP: [");
 	sp = s + strlen(s);
 
 	if (inc && ((inc->inc_flags & INC_ISIPV6) == 0)) {
 		inet_ntoa_r(inc->inc_faddr, sp);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i to [", ntohs(inc->inc_fport));
 		sp = s + strlen(s);
 		inet_ntoa_r(inc->inc_laddr, sp);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i", ntohs(inc->inc_lport));
 #ifdef INET6
 	} else if (inc) {
 		ip6_sprintf(sp, &inc->inc6_faddr);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i to [", ntohs(inc->inc_fport));
 		sp = s + strlen(s);
 		ip6_sprintf(sp, &inc->inc6_laddr);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i", ntohs(inc->inc_lport));
 	} else if (ip6 && th) {
 		ip6_sprintf(sp, &ip6->ip6_src);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i to [", ntohs(th->th_sport));
 		sp = s + strlen(s);
 		ip6_sprintf(sp, &ip6->ip6_dst);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i", ntohs(th->th_dport));
 #endif /* INET6 */
 #ifdef INET
 	} else if (ip && th) {
 		inet_ntoa_r(ip->ip_src, sp);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i to [", ntohs(th->th_sport));
 		sp = s + strlen(s);
 		inet_ntoa_r(ip->ip_dst, sp);
 		sp = s + strlen(s);
 		sprintf(sp, "]:%i", ntohs(th->th_dport));
 #endif /* INET */
 	} else {
 		free(s, M_TCPLOG);
 		return (NULL);
 	}
 	sp = s + strlen(s);
 	if (th)
 		sprintf(sp, " tcpflags 0x%b", th->th_flags, PRINT_TH_FLAGS);
 	if (*(s + size - 1) != '\0')
 		panic("%s: string too long", __func__);
 	return (s);
 }
 
 /*
  * A subroutine which makes it easy to track TCP state changes with DTrace.
  * This function shouldn't be called for t_state initializations that don't
  * correspond to actual TCP state transitions.
  */
 void
 tcp_state_change(struct tcpcb *tp, int newstate)
 {
 #if defined(KDTRACE_HOOKS)
 	int pstate = tp->t_state;
 #endif
 
 	TCPSTATES_DEC(tp->t_state);
 	TCPSTATES_INC(newstate);
 	tp->t_state = newstate;
 	TCP_PROBE6(state__change, NULL, tp, NULL, tp, NULL, pstate);
 }
 
 /*
  * Create an external-format (``xtcpcb'') structure using the information in
  * the kernel-format tcpcb structure pointed to by tp.  This is done to
  * reduce the spew of irrelevant information over this interface, to isolate
  * user code from changes in the kernel structure, and potentially to provide
  * information-hiding if we decide that some of this information should be
  * hidden from users.
  */
 void
 tcp_inptoxtp(const struct inpcb *inp, struct xtcpcb *xt)
 {
 	struct tcpcb *tp = intotcpcb(inp);
 	sbintime_t now;
 
 	if (inp->inp_flags & INP_TIMEWAIT) {
 		bzero(xt, sizeof(struct xtcpcb));
 		xt->t_state = TCPS_TIME_WAIT;
 	} else {
 		xt->t_state = tp->t_state;
 		xt->t_logstate = tp->t_logstate;
 		xt->t_flags = tp->t_flags;
 		xt->t_sndzerowin = tp->t_sndzerowin;
 		xt->t_sndrexmitpack = tp->t_sndrexmitpack;
 		xt->t_rcvoopack = tp->t_rcvoopack;
 
 		now = getsbinuptime();
 #define	COPYTIMER(ttt)	do {						\
 		if (callout_active(&tp->t_timers->ttt))			\
 			xt->ttt = (tp->t_timers->ttt.c_time - now) /	\
 			    SBT_1MS;					\
 		else							\
 			xt->ttt = 0;					\
 } while (0)
 		COPYTIMER(tt_delack);
 		COPYTIMER(tt_rexmt);
 		COPYTIMER(tt_persist);
 		COPYTIMER(tt_keep);
 		COPYTIMER(tt_2msl);
 #undef COPYTIMER
 		xt->t_rcvtime = 1000 * (ticks - tp->t_rcvtime) / hz;
 
 		bcopy(tp->t_fb->tfb_tcp_block_name, xt->xt_stack,
 		    TCP_FUNCTION_NAME_LEN_MAX);
 		bzero(xt->xt_logid, TCP_LOG_ID_LEN);
 #ifdef TCP_BLACKBOX
 		(void)tcp_log_get_id(tp, xt->xt_logid);
 #endif
 	}
 
 	xt->xt_len = sizeof(struct xtcpcb);
 	in_pcbtoxinpcb(inp, &xt->xt_inp);
 	if (inp->inp_socket == NULL)
 		xt->xt_inp.xi_socket.xso_protocol = IPPROTO_TCP;
 }
Index: user/markj/netdump/sys/netinet/tcp_var.h
===================================================================
--- user/markj/netdump/sys/netinet/tcp_var.h	(revision 332407)
+++ user/markj/netdump/sys/netinet/tcp_var.h	(revision 332408)
@@ -1,934 +1,937 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1982, 1986, 1993, 1994, 1995
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	@(#)tcp_var.h	8.4 (Berkeley) 5/24/95
  * $FreeBSD$
  */
 
 #ifndef _NETINET_TCP_VAR_H_
 #define _NETINET_TCP_VAR_H_
 
 #include <netinet/tcp.h>
 #include <netinet/tcp_fsm.h>
 
 #ifdef _KERNEL
 #include <net/vnet.h>
 #include <sys/mbuf.h>
 #endif
 
 #if defined(_KERNEL) || defined(_WANT_TCPCB)
 /* TCP segment queue entry */
 struct tseg_qent {
 	LIST_ENTRY(tseg_qent) tqe_q;
 	int	tqe_len;		/* TCP segment data length */
 	struct	tcphdr *tqe_th;		/* a pointer to tcp header */
 	struct	mbuf	*tqe_m;		/* mbuf contains packet */
 };
 LIST_HEAD(tsegqe_head, tseg_qent);
 
 struct sackblk {
 	tcp_seq start;		/* start seq no. of sack block */
 	tcp_seq end;		/* end seq no. */
 };
 
 struct sackhole {
 	tcp_seq start;		/* start seq no. of hole */
 	tcp_seq end;		/* end seq no. */
 	tcp_seq rxmit;		/* next seq. no in hole to be retransmitted */
 	TAILQ_ENTRY(sackhole) scblink;	/* scoreboard linkage */
 };
 
 struct sackhint {
 	struct sackhole	*nexthole;
 	int		sack_bytes_rexmit;
 	tcp_seq		last_sack_ack;	/* Most recent/largest sacked ack */
 
 	int		ispare;		/* explicit pad for 64bit alignment */
 	int             sacked_bytes;	/*
 					 * Total sacked bytes reported by the
 					 * receiver via sack option
 					 */
 	uint32_t	_pad1[1];	/* TBD */
 	uint64_t	_pad[1];	/* TBD */
 };
 
 STAILQ_HEAD(tcp_log_stailq, tcp_log_mem);
 
 /*
  * Tcp control block, one per tcp; fields:
  * Organized for 16 byte cacheline efficiency.
  */
 struct tcpcb {
 	struct	tsegqe_head t_segq;	/* segment reassembly queue */
 	int	t_segqlen;		/* segment reassembly queue length */
 	int	t_dupacks;		/* consecutive dup acks recd */
 
 	struct tcp_timer *t_timers;	/* All the TCP timers in one struct */
 
 	struct	inpcb *t_inpcb;		/* back pointer to internet pcb */
 	int	t_state;		/* state of this connection */
 	u_int	t_flags;
 
 	struct	vnet *t_vnet;		/* back pointer to parent vnet */
 
 	tcp_seq	snd_una;		/* sent but unacknowledged */
 	tcp_seq	snd_max;		/* highest sequence number sent;
 					 * used to recognize retransmits
 					 */
 	tcp_seq	snd_nxt;		/* send next */
 	tcp_seq	snd_up;			/* send urgent pointer */
 
 	tcp_seq	snd_wl1;		/* window update seg seq number */
 	tcp_seq	snd_wl2;		/* window update seg ack number */
 	tcp_seq	iss;			/* initial send sequence number */
 	tcp_seq	irs;			/* initial receive sequence number */
 
 	tcp_seq	rcv_nxt;		/* receive next */
 	tcp_seq	rcv_adv;		/* advertised window */
 	uint32_t  rcv_wnd;		/* receive window */
 	tcp_seq	rcv_up;			/* receive urgent pointer */
 
 	uint32_t  snd_wnd;		/* send window */
 	uint32_t  snd_cwnd;		/* congestion-controlled window */
 	uint32_t  snd_ssthresh;		/* snd_cwnd size threshold for
 					 * for slow start exponential to
 					 * linear switch
 					 */
 	tcp_seq	snd_recover;		/* for use in NewReno Fast Recovery */
 
 	u_int	t_rcvtime;		/* inactivity time */
 	u_int	t_starttime;		/* time connection was established */
 	u_int	t_rtttime;		/* RTT measurement start time */
 	tcp_seq	t_rtseq;		/* sequence number being timed */
 
 	int	t_rxtcur;		/* current retransmit value (ticks) */
 	u_int	t_maxseg;		/* maximum segment size */
 	u_int	t_pmtud_saved_maxseg;	/* pre-blackhole MSS */
 	int	t_srtt;			/* smoothed round-trip time */
 	int	t_rttvar;		/* variance in round-trip time */
 
 	int	t_rxtshift;		/* log(2) of rexmt exp. backoff */
 	u_int	t_rttmin;		/* minimum rtt allowed */
 	u_int	t_rttbest;		/* best rtt we've seen */
 	u_long	t_rttupdated;		/* number of times rtt sampled */
 	uint32_t  max_sndwnd;		/* largest window peer has offered */
 
 	int	t_softerror;		/* possible error not yet reported */
 /* out-of-band data */
 	char	t_oobflags;		/* have some */
 	char	t_iobc;			/* input character */
 /* RFC 1323 variables */
 	u_char	snd_scale;		/* window scaling for send window */
 	u_char	rcv_scale;		/* window scaling for recv window */
 	u_char	request_r_scale;	/* pending window scaling */
 	u_int32_t  ts_recent;		/* timestamp echo data */
 	u_int	ts_recent_age;		/* when last updated */
 	u_int32_t  ts_offset;		/* our timestamp offset */
 
 	tcp_seq	last_ack_sent;
 /* experimental */
 	uint32_t  snd_cwnd_prev;	/* cwnd prior to retransmit */
 	uint32_t  snd_ssthresh_prev;	/* ssthresh prior to retransmit */
 	tcp_seq	snd_recover_prev;	/* snd_recover prior to retransmit */
 	int	t_sndzerowin;		/* zero-window updates sent */
 	u_int	t_badrxtwin;		/* window for retransmit recovery */
 	u_char	snd_limited;		/* segments limited transmitted */
 /* SACK related state */
 	int	snd_numholes;		/* number of holes seen by sender */
 	TAILQ_HEAD(sackhole_head, sackhole) snd_holes;
 					/* SACK scoreboard (sorted) */
 	tcp_seq	snd_fack;		/* last seq number(+1) sack'd by rcv'r*/
 	int	rcv_numsacks;		/* # distinct sack blks present */
 	struct sackblk sackblks[MAX_SACK_BLKS]; /* seq nos. of sack blocks */
 	tcp_seq sack_newdata;		/* New data xmitted in this recovery
 					   episode starts at this seq number */
 	struct sackhint	sackhint;	/* SACK scoreboard hint */
 	int	t_rttlow;		/* smallest observerved RTT */
 	u_int32_t	rfbuf_ts;	/* recv buffer autoscaling timestamp */
 	int	rfbuf_cnt;		/* recv buffer autoscaling byte count */
 	struct toedev	*tod;		/* toedev handling this connection */
 	int	t_sndrexmitpack;	/* retransmit packets sent */
 	int	t_rcvoopack;		/* out-of-order packets received */
 	void	*t_toe;			/* TOE pcb pointer */
 	int	t_bytes_acked;		/* # bytes acked during current RTT */
 	struct cc_algo	*cc_algo;	/* congestion control algorithm */
 	struct cc_var	*ccv;		/* congestion control specific vars */
 	struct osd	*osd;		/* storage for Khelp module data */
 
 	u_int	t_keepinit;		/* time to establish connection */
 	u_int	t_keepidle;		/* time before keepalive probes begin */
 	u_int	t_keepintvl;		/* interval between keepalives */
 	u_int	t_keepcnt;		/* number of keepalives before close */
 
 	u_int	t_tsomax;		/* TSO total burst length limit in bytes */
 	u_int	t_tsomaxsegcount;	/* TSO maximum segment count */
 	u_int	t_tsomaxsegsize;	/* TSO maximum segment size in bytes */
 	u_int	t_flags2;		/* More tcpcb flags storage */
 	int	t_logstate;		/* State of "black box" logging */
 	struct tcp_log_stailq t_logs;	/* Log buffer */
 	int	t_lognum;		/* Number of log entries */
 	uint32_t t_logsn;		/* Log "serial number" */
 	struct tcp_log_id_node *t_lin;
 	struct tcp_log_id_bucket *t_lib;
 	const char *t_output_caller;	/* Function that called tcp_output */
 	struct tcp_function_block *t_fb;/* TCP function call block */
 	void	*t_fb_ptr;		/* Pointer to t_fb specific data */
 	uint8_t t_tfo_client_cookie_len; /* TCP Fast Open client cookie length */
 	unsigned int *t_tfo_pending;	/* TCP Fast Open server pending counter */
 	union {
 		uint8_t client[TCP_FASTOPEN_MAX_COOKIE_LEN];
 		uint64_t server;
 	} t_tfo_cookie;			/* TCP Fast Open cookie to send */
 #ifdef TCPPCAP
 	struct mbufq t_inpkts;		/* List of saved input packets. */
 	struct mbufq t_outpkts;		/* List of saved output packets. */
 #endif
 };
 #endif	/* _KERNEL || _WANT_TCPCB */
 
 #ifdef _KERNEL
 struct tcptemp {
 	u_char	tt_ipgen[40]; /* the size must be of max ip header, now IPv6 */
 	struct	tcphdr tt_t;
 };
 
 /* 
  * TODO: We yet need to brave plowing in
  * to tcp_input() and the pru_usrreq() block.
  * Right now these go to the old standards which
  * are somewhat ok, but in the long term may
  * need to be changed. If we do tackle tcp_input()
  * then we need to get rid of the tcp_do_segment()
  * function below.
  */
 /* Flags for tcp functions */
 #define TCP_FUNC_BEING_REMOVED 0x01   	/* Can no longer be referenced */
 
 /*
  * If defining the optional tcp_timers, in the
  * tfb_tcp_timer_stop call you must use the
  * callout_async_drain() function with the
  * tcp_timer_discard callback. You should check
  * the return of callout_async_drain() and if 0
  * increment tt_draincnt. Since the timer sub-system
  * does not know your callbacks you must provide a
  * stop_all function that loops through and calls
  * tcp_timer_stop() with each of your defined timers.
  * Adding a tfb_tcp_handoff_ok function allows the socket
  * option to change stacks to query you even if the
  * connection is in a later stage. You return 0 to
  * say you can take over and run your stack, you return
  * non-zero (an error number) to say no you can't.
  * If the function is undefined you can only change
  * in the early states (before connect or listen).
  * tfb_tcp_fb_fini is changed to add a flag to tell
  * the old stack if the tcb is being destroyed or
  * not. A one in the flag means the TCB is being
  * destroyed, a zero indicates its transitioning to
  * another stack (via socket option).
  */
 struct tcp_function_block {
 	char tfb_tcp_block_name[TCP_FUNCTION_NAME_LEN_MAX];
 	int	(*tfb_tcp_output)(struct tcpcb *);
 	void	(*tfb_tcp_do_segment)(struct mbuf *, struct tcphdr *,
 			    struct socket *, struct tcpcb *,
 			    int, int, uint8_t,
 			    int);
 	int     (*tfb_tcp_ctloutput)(struct socket *so, struct sockopt *sopt,
 			    struct inpcb *inp, struct tcpcb *tp);
 	/* Optional memory allocation/free routine */
 	void	(*tfb_tcp_fb_init)(struct tcpcb *);
 	void	(*tfb_tcp_fb_fini)(struct tcpcb *, int);
 	/* Optional timers, must define all if you define one */
 	int	(*tfb_tcp_timer_stop_all)(struct tcpcb *);
 	void	(*tfb_tcp_timer_activate)(struct tcpcb *,
 			    uint32_t, u_int);
 	int	(*tfb_tcp_timer_active)(struct tcpcb *, uint32_t);
 	void	(*tfb_tcp_timer_stop)(struct tcpcb *, uint32_t);
 	void	(*tfb_tcp_rexmit_tmr)(struct tcpcb *);
 	int	(*tfb_tcp_handoff_ok)(struct tcpcb *);
 	volatile uint32_t tfb_refcnt;
 	uint32_t  tfb_flags;
 	uint8_t	tfb_id;
 };
 
 struct tcp_function {
 	TAILQ_ENTRY(tcp_function)	tf_next;
 	char				tf_name[TCP_FUNCTION_NAME_LEN_MAX];
 	struct tcp_function_block	*tf_fb;
 };
 
 TAILQ_HEAD(tcp_funchead, tcp_function);
 #endif	/* _KERNEL */
 
 /*
  * Flags and utility macros for the t_flags field.
  */
 #define	TF_ACKNOW	0x000001	/* ack peer immediately */
 #define	TF_DELACK	0x000002	/* ack, but try to delay it */
 #define	TF_NODELAY	0x000004	/* don't delay packets to coalesce */
 #define	TF_NOOPT	0x000008	/* don't use tcp options */
 #define	TF_SENTFIN	0x000010	/* have sent FIN */
 #define	TF_REQ_SCALE	0x000020	/* have/will request window scaling */
 #define	TF_RCVD_SCALE	0x000040	/* other side has requested scaling */
 #define	TF_REQ_TSTMP	0x000080	/* have/will request timestamps */
 #define	TF_RCVD_TSTMP	0x000100	/* a timestamp was received in SYN */
 #define	TF_SACK_PERMIT	0x000200	/* other side said I could SACK */
 #define	TF_NEEDSYN	0x000400	/* send SYN (implicit state) */
 #define	TF_NEEDFIN	0x000800	/* send FIN (implicit state) */
 #define	TF_NOPUSH	0x001000	/* don't push */
 #define	TF_PREVVALID	0x002000	/* saved values for bad rxmit valid */
 #define	TF_MORETOCOME	0x010000	/* More data to be appended to sock */
 #define	TF_LQ_OVERFLOW	0x020000	/* listen queue overflow */
 #define	TF_LASTIDLE	0x040000	/* connection was previously idle */
 #define	TF_RXWIN0SENT	0x080000	/* sent a receiver win 0 in response */
 #define	TF_FASTRECOVERY	0x100000	/* in NewReno Fast Recovery */
 #define	TF_WASFRECOVERY	0x200000	/* was in NewReno Fast Recovery */
 #define	TF_SIGNATURE	0x400000	/* require MD5 digests (RFC2385) */
 #define	TF_FORCEDATA	0x800000	/* force out a byte */
 #define	TF_TSO		0x1000000	/* TSO enabled on this connection */
 #define	TF_TOE		0x2000000	/* this connection is offloaded */
 #define	TF_ECN_PERMIT	0x4000000	/* connection ECN-ready */
 #define	TF_ECN_SND_CWR	0x8000000	/* ECN CWR in queue */
 #define	TF_ECN_SND_ECE	0x10000000	/* ECN ECE in queue */
 #define	TF_CONGRECOVERY	0x20000000	/* congestion recovery mode */
 #define	TF_WASCRECOVERY	0x40000000	/* was in congestion recovery */
 #define	TF_FASTOPEN	0x80000000	/* TCP Fast Open indication */
 
 #define	IN_FASTRECOVERY(t_flags)	(t_flags & TF_FASTRECOVERY)
 #define	ENTER_FASTRECOVERY(t_flags)	t_flags |= TF_FASTRECOVERY
 #define	EXIT_FASTRECOVERY(t_flags)	t_flags &= ~TF_FASTRECOVERY
 
 #define	IN_CONGRECOVERY(t_flags)	(t_flags & TF_CONGRECOVERY)
 #define	ENTER_CONGRECOVERY(t_flags)	t_flags |= TF_CONGRECOVERY
 #define	EXIT_CONGRECOVERY(t_flags)	t_flags &= ~TF_CONGRECOVERY
 
 #define	IN_RECOVERY(t_flags) (t_flags & (TF_CONGRECOVERY | TF_FASTRECOVERY))
 #define	ENTER_RECOVERY(t_flags) t_flags |= (TF_CONGRECOVERY | TF_FASTRECOVERY)
 #define	EXIT_RECOVERY(t_flags) t_flags &= ~(TF_CONGRECOVERY | TF_FASTRECOVERY)
 
 #if defined(_KERNEL) && !defined(TCP_RFC7413)
 #define	IS_FASTOPEN(t_flags)		(false)
 #else
 #define	IS_FASTOPEN(t_flags)		(t_flags & TF_FASTOPEN)
 #endif
 
 #define	BYTES_THIS_ACK(tp, th)	(th->th_ack - tp->snd_una)
 
 /*
  * Flags for the t_oobflags field.
  */
 #define	TCPOOB_HAVEDATA	0x01
 #define	TCPOOB_HADDATA	0x02
 
 /*
  * Flags for the extended TCP flags field, t_flags2
  */
 #define	TF2_PLPMTU_BLACKHOLE	0x00000001 /* Possible PLPMTUD Black Hole. */
 #define	TF2_PLPMTU_PMTUD	0x00000002 /* Allowed to attempt PLPMTUD. */
 #define	TF2_PLPMTU_MAXSEGSNT	0x00000004 /* Last seg sent was full seg. */
 #define	TF2_LOG_AUTO		0x00000008 /* Session is auto-logging. */
 
 /*
  * Structure to hold TCP options that are only used during segment
  * processing (in tcp_input), but not held in the tcpcb.
  * It's basically used to reduce the number of parameters
  * to tcp_dooptions and tcp_addoptions.
  * The binary order of the to_flags is relevant for packing of the
  * options in tcp_addoptions.
  */
 struct tcpopt {
 	u_int32_t	to_flags;	/* which options are present */
 #define	TOF_MSS		0x0001		/* maximum segment size */
 #define	TOF_SCALE	0x0002		/* window scaling */
 #define	TOF_SACKPERM	0x0004		/* SACK permitted */
 #define	TOF_TS		0x0010		/* timestamp */
 #define	TOF_SIGNATURE	0x0040		/* TCP-MD5 signature option (RFC2385) */
 #define	TOF_SACK	0x0080		/* Peer sent SACK option */
 #define	TOF_FASTOPEN	0x0100		/* TCP Fast Open (TFO) cookie */
 #define	TOF_MAXOPT	0x0200
 	u_int32_t	to_tsval;	/* new timestamp */
 	u_int32_t	to_tsecr;	/* reflected timestamp */
 	u_char		*to_sacks;	/* pointer to the first SACK blocks */
 	u_char		*to_signature;	/* pointer to the TCP-MD5 signature */
 	u_int8_t	*to_tfo_cookie; /* pointer to the TFO cookie */
 	u_int16_t	to_mss;		/* maximum segment size */
 	u_int8_t	to_wscale;	/* window scaling */
 	u_int8_t	to_nsacks;	/* number of SACK blocks */
 	u_int8_t	to_tfo_len;	/* TFO cookie length */
 	u_int32_t	to_spare;	/* UTO */
 };
 
 /*
  * Flags for tcp_dooptions.
  */
 #define	TO_SYN		0x01		/* parse SYN-only options */
 
 struct hc_metrics_lite {	/* must stay in sync with hc_metrics */
 	uint32_t	rmx_mtu;	/* MTU for this path */
 	uint32_t	rmx_ssthresh;	/* outbound gateway buffer limit */
 	uint32_t	rmx_rtt;	/* estimated round trip time */
 	uint32_t	rmx_rttvar;	/* estimated rtt variance */
 	uint32_t	rmx_cwnd;	/* congestion window */
 	uint32_t	rmx_sendpipe;   /* outbound delay-bandwidth product */
 	uint32_t	rmx_recvpipe;   /* inbound delay-bandwidth product */
 };
 
 /*
  * Used by tcp_maxmtu() to communicate interface specific features
  * and limits at the time of connection setup.
  */
 struct tcp_ifcap {
 	int	ifcap;
 	u_int	tsomax;
 	u_int	tsomaxsegcount;
 	u_int	tsomaxsegsize;
 };
 
 #ifndef _NETINET_IN_PCB_H_
 struct in_conninfo;
 #endif /* _NETINET_IN_PCB_H_ */
 
 struct tcptw {
 	struct inpcb	*tw_inpcb;	/* XXX back pointer to internet pcb */
 	tcp_seq		snd_nxt;
 	tcp_seq		rcv_nxt;
 	tcp_seq		iss;
 	tcp_seq		irs;
 	u_short		last_win;	/* cached window value */
 	short		tw_so_options;	/* copy of so_options */
 	struct ucred	*tw_cred;	/* user credentials */
 	u_int32_t	t_recent;
 	u_int32_t	ts_offset;	/* our timestamp offset */
 	u_int		t_starttime;
 	int		tw_time;
 	TAILQ_ENTRY(tcptw) tw_2msl;
 	void		*tw_pspare;	/* TCP_SIGNATURE */
 	u_int		*tw_spare;	/* TCP_SIGNATURE */
 };
 
 #define	intotcpcb(ip)	((struct tcpcb *)(ip)->inp_ppcb)
 #define	intotw(ip)	((struct tcptw *)(ip)->inp_ppcb)
 #define	sototcpcb(so)	(intotcpcb(sotoinpcb(so)))
 
 /*
  * The smoothed round-trip time and estimated variance
  * are stored as fixed point numbers scaled by the values below.
  * For convenience, these scales are also used in smoothing the average
  * (smoothed = (1/scale)sample + ((scale-1)/scale)smoothed).
  * With these scales, srtt has 3 bits to the right of the binary point,
  * and thus an "ALPHA" of 0.875.  rttvar has 2 bits to the right of the
  * binary point, and is smoothed with an ALPHA of 0.75.
  */
 #define	TCP_RTT_SCALE		32	/* multiplier for srtt; 3 bits frac. */
 #define	TCP_RTT_SHIFT		5	/* shift for srtt; 3 bits frac. */
 #define	TCP_RTTVAR_SCALE	16	/* multiplier for rttvar; 2 bits */
 #define	TCP_RTTVAR_SHIFT	4	/* shift for rttvar; 2 bits */
 #define	TCP_DELTA_SHIFT		2	/* see tcp_input.c */
 
 /*
  * The initial retransmission should happen at rtt + 4 * rttvar.
  * Because of the way we do the smoothing, srtt and rttvar
  * will each average +1/2 tick of bias.  When we compute
  * the retransmit timer, we want 1/2 tick of rounding and
  * 1 extra tick because of +-1/2 tick uncertainty in the
  * firing of the timer.  The bias will give us exactly the
  * 1.5 tick we need.  But, because the bias is
  * statistical, we have to test that we don't drop below
  * the minimum feasible timer (which is 2 ticks).
  * This version of the macro adapted from a paper by Lawrence
  * Brakmo and Larry Peterson which outlines a problem caused
  * by insufficient precision in the original implementation,
  * which results in inappropriately large RTO values for very
  * fast networks.
  */
 #define	TCP_REXMTVAL(tp) \
 	max((tp)->t_rttmin, (((tp)->t_srtt >> (TCP_RTT_SHIFT - TCP_DELTA_SHIFT))  \
 	  + (tp)->t_rttvar) >> TCP_DELTA_SHIFT)
 
 /*
  * TCP statistics.
  * Many of these should be kept per connection,
  * but that's inconvenient at the moment.
  */
 struct	tcpstat {
 	uint64_t tcps_connattempt;	/* connections initiated */
 	uint64_t tcps_accepts;		/* connections accepted */
 	uint64_t tcps_connects;		/* connections established */
 	uint64_t tcps_drops;		/* connections dropped */
 	uint64_t tcps_conndrops;	/* embryonic connections dropped */
 	uint64_t tcps_minmssdrops;	/* average minmss too low drops */
 	uint64_t tcps_closed;		/* conn. closed (includes drops) */
 	uint64_t tcps_segstimed;	/* segs where we tried to get rtt */
 	uint64_t tcps_rttupdated;	/* times we succeeded */
 	uint64_t tcps_delack;		/* delayed acks sent */
 	uint64_t tcps_timeoutdrop;	/* conn. dropped in rxmt timeout */
 	uint64_t tcps_rexmttimeo;	/* retransmit timeouts */
 	uint64_t tcps_persisttimeo;	/* persist timeouts */
 	uint64_t tcps_keeptimeo;	/* keepalive timeouts */
 	uint64_t tcps_keepprobe;	/* keepalive probes sent */
 	uint64_t tcps_keepdrops;	/* connections dropped in keepalive */
 
 	uint64_t tcps_sndtotal;		/* total packets sent */
 	uint64_t tcps_sndpack;		/* data packets sent */
 	uint64_t tcps_sndbyte;		/* data bytes sent */
 	uint64_t tcps_sndrexmitpack;	/* data packets retransmitted */
 	uint64_t tcps_sndrexmitbyte;	/* data bytes retransmitted */
 	uint64_t tcps_sndrexmitbad;	/* unnecessary packet retransmissions */
 	uint64_t tcps_sndacks;		/* ack-only packets sent */
 	uint64_t tcps_sndprobe;		/* window probes sent */
 	uint64_t tcps_sndurg;		/* packets sent with URG only */
 	uint64_t tcps_sndwinup;		/* window update-only packets sent */
 	uint64_t tcps_sndctrl;		/* control (SYN|FIN|RST) packets sent */
 
 	uint64_t tcps_rcvtotal;		/* total packets received */
 	uint64_t tcps_rcvpack;		/* packets received in sequence */
 	uint64_t tcps_rcvbyte;		/* bytes received in sequence */
 	uint64_t tcps_rcvbadsum;	/* packets received with ccksum errs */
 	uint64_t tcps_rcvbadoff;	/* packets received with bad offset */
 	uint64_t tcps_rcvreassfull;	/* packets dropped for no reass space */
 	uint64_t tcps_rcvshort;		/* packets received too short */
 	uint64_t tcps_rcvduppack;	/* duplicate-only packets received */
 	uint64_t tcps_rcvdupbyte;	/* duplicate-only bytes received */
 	uint64_t tcps_rcvpartduppack;	/* packets with some duplicate data */
 	uint64_t tcps_rcvpartdupbyte;	/* dup. bytes in part-dup. packets */
 	uint64_t tcps_rcvoopack;	/* out-of-order packets received */
 	uint64_t tcps_rcvoobyte;	/* out-of-order bytes received */
 	uint64_t tcps_rcvpackafterwin;	/* packets with data after window */
 	uint64_t tcps_rcvbyteafterwin;	/* bytes rcvd after window */
 	uint64_t tcps_rcvafterclose;	/* packets rcvd after "close" */
 	uint64_t tcps_rcvwinprobe;	/* rcvd window probe packets */
 	uint64_t tcps_rcvdupack;	/* rcvd duplicate acks */
 	uint64_t tcps_rcvacktoomuch;	/* rcvd acks for unsent data */
 	uint64_t tcps_rcvackpack;	/* rcvd ack packets */
 	uint64_t tcps_rcvackbyte;	/* bytes acked by rcvd acks */
 	uint64_t tcps_rcvwinupd;	/* rcvd window update packets */
 	uint64_t tcps_pawsdrop;		/* segments dropped due to PAWS */
 	uint64_t tcps_predack;		/* times hdr predict ok for acks */
 	uint64_t tcps_preddat;		/* times hdr predict ok for data pkts */
 	uint64_t tcps_pcbcachemiss;
 	uint64_t tcps_cachedrtt;	/* times cached RTT in route updated */
 	uint64_t tcps_cachedrttvar;	/* times cached rttvar updated */
 	uint64_t tcps_cachedssthresh;	/* times cached ssthresh updated */
 	uint64_t tcps_usedrtt;		/* times RTT initialized from route */
 	uint64_t tcps_usedrttvar;	/* times RTTVAR initialized from rt */
 	uint64_t tcps_usedssthresh;	/* times ssthresh initialized from rt*/
 	uint64_t tcps_persistdrop;	/* timeout in persist state */
 	uint64_t tcps_badsyn;		/* bogus SYN, e.g. premature ACK */
 	uint64_t tcps_mturesent;	/* resends due to MTU discovery */
 	uint64_t tcps_listendrop;	/* listen queue overflows */
 	uint64_t tcps_badrst;		/* ignored RSTs in the window */
 
 	uint64_t tcps_sc_added;		/* entry added to syncache */
 	uint64_t tcps_sc_retransmitted;	/* syncache entry was retransmitted */
 	uint64_t tcps_sc_dupsyn;	/* duplicate SYN packet */
 	uint64_t tcps_sc_dropped;	/* could not reply to packet */
 	uint64_t tcps_sc_completed;	/* successful extraction of entry */
 	uint64_t tcps_sc_bucketoverflow;/* syncache per-bucket limit hit */
 	uint64_t tcps_sc_cacheoverflow;	/* syncache cache limit hit */
 	uint64_t tcps_sc_reset;		/* RST removed entry from syncache */
 	uint64_t tcps_sc_stale;		/* timed out or listen socket gone */
 	uint64_t tcps_sc_aborted;	/* syncache entry aborted */
 	uint64_t tcps_sc_badack;	/* removed due to bad ACK */
 	uint64_t tcps_sc_unreach;	/* ICMP unreachable received */
 	uint64_t tcps_sc_zonefail;	/* zalloc() failed */
 	uint64_t tcps_sc_sendcookie;	/* SYN cookie sent */
 	uint64_t tcps_sc_recvcookie;	/* SYN cookie received */
 
 	uint64_t tcps_hc_added;		/* entry added to hostcache */
 	uint64_t tcps_hc_bucketoverflow;/* hostcache per bucket limit hit */
 
 	uint64_t tcps_finwait2_drops;    /* Drop FIN_WAIT_2 connection after time limit */
 
 	/* SACK related stats */
 	uint64_t tcps_sack_recovery_episode; /* SACK recovery episodes */
 	uint64_t tcps_sack_rexmits;	    /* SACK rexmit segments   */
 	uint64_t tcps_sack_rexmit_bytes;    /* SACK rexmit bytes      */
 	uint64_t tcps_sack_rcv_blocks;	    /* SACK blocks (options) received */
 	uint64_t tcps_sack_send_blocks;	    /* SACK blocks (options) sent     */
 	uint64_t tcps_sack_sboverflow;	    /* times scoreboard overflowed */
 	
 	/* ECN related stats */
 	uint64_t tcps_ecn_ce;		/* ECN Congestion Experienced */
 	uint64_t tcps_ecn_ect0;		/* ECN Capable Transport */
 	uint64_t tcps_ecn_ect1;		/* ECN Capable Transport */
 	uint64_t tcps_ecn_shs;		/* ECN successful handshakes */
 	uint64_t tcps_ecn_rcwnd;	/* # times ECN reduced the cwnd */
 
 	/* TCP_SIGNATURE related stats */
 	uint64_t tcps_sig_rcvgoodsig;	/* Total matching signature received */
 	uint64_t tcps_sig_rcvbadsig;	/* Total bad signature received */
 	uint64_t tcps_sig_err_buildsig;	/* Failed to make signature */
 	uint64_t tcps_sig_err_sigopt;	/* No signature expected by socket */
 	uint64_t tcps_sig_err_nosigopt;	/* No signature provided by segment */
 
 	/* Path MTU Discovery Black Hole Detection related stats */
 	uint64_t tcps_pmtud_blackhole_activated;	 /* Black Hole Count */
 	uint64_t tcps_pmtud_blackhole_activated_min_mss; /* BH at min MSS Count */
 	uint64_t tcps_pmtud_blackhole_failed;		 /* Black Hole Failure Count */
 
 	uint64_t _pad[12];		/* 6 UTO, 6 TBD */
 };
 
 #define	tcps_rcvmemdrop	tcps_rcvreassfull	/* compat */
 
 #ifdef _KERNEL
 #define	TI_UNLOCKED	1
 #define	TI_RLOCKED	2
 #include <sys/counter.h>
 
 VNET_PCPUSTAT_DECLARE(struct tcpstat, tcpstat);	/* tcp statistics */
 /*
  * In-kernel consumers can use these accessor macros directly to update
  * stats.
  */
 #define	TCPSTAT_ADD(name, val)	\
     VNET_PCPUSTAT_ADD(struct tcpstat, tcpstat, name, (val))
 #define	TCPSTAT_INC(name)	TCPSTAT_ADD(name, 1)
 
 /*
  * Kernel module consumers must use this accessor macro.
  */
 void	kmod_tcpstat_inc(int statnum);
 #define	KMOD_TCPSTAT_INC(name)						\
     kmod_tcpstat_inc(offsetof(struct tcpstat, name) / sizeof(uint64_t))
 
 /*
  * Running TCP connection count by state.
  */
 VNET_DECLARE(counter_u64_t, tcps_states[TCP_NSTATES]);
 #define	V_tcps_states	VNET(tcps_states)
 #define	TCPSTATES_INC(state)	counter_u64_add(V_tcps_states[state], 1)
 #define	TCPSTATES_DEC(state)	counter_u64_add(V_tcps_states[state], -1)
 
 /*
  * TCP specific helper hook point identifiers.
  */
 #define	HHOOK_TCP_EST_IN		0
 #define	HHOOK_TCP_EST_OUT		1
 #define	HHOOK_TCP_LAST			HHOOK_TCP_EST_OUT
 
 struct tcp_hhook_data {
 	struct tcpcb	*tp;
 	struct tcphdr	*th;
 	struct tcpopt	*to;
 	uint32_t	len;
 	int		tso;
 	tcp_seq		curack;
 };
 #endif
 
 /*
  * TCB structure exported to user-land via sysctl(3).
  *
  * Fields prefixed with "xt_" are unique to the export structure, and fields
  * with "t_" or other prefixes match corresponding fields of 'struct tcpcb'.
  *
  * Legend:
  * (s) - used by userland utilities in src
  * (p) - used by utilities in ports
  * (3) - is known to be used by third party software not in ports
  * (n) - no known usage
  *
  * Evil hack: declare only if in_pcb.h and sys/socketvar.h have been
  * included.  Not all of our clients do.
  */
 #if defined(_NETINET_IN_PCB_H_) && defined(_SYS_SOCKETVAR_H_)
 struct xtcpcb {
 	size_t		xt_len;		/* length of this structure */
 	struct xinpcb	xt_inp;
 	char		xt_stack[TCP_FUNCTION_NAME_LEN_MAX];	/* (s) */
 	char		xt_logid[TCP_LOG_ID_LEN];	/* (s) */
 	int64_t		spare64[8];
 	int32_t		t_state;		/* (s,p) */
 	uint32_t	t_flags;		/* (s,p) */
 	int32_t		t_sndzerowin;		/* (s) */
 	int32_t		t_sndrexmitpack;	/* (s) */
 	int32_t		t_rcvoopack;		/* (s) */
 	int32_t		t_rcvtime;		/* (s) */
 	int32_t		tt_rexmt;		/* (s) */
 	int32_t		tt_persist;		/* (s) */
 	int32_t		tt_keep;		/* (s) */
 	int32_t		tt_2msl;		/* (s) */
 	int32_t		tt_delack;		/* (s) */
 	int32_t		t_logstate;		/* (3) */
 	int32_t		spare32[32];
 } __aligned(8);
 
 #ifdef _KERNEL
 void	tcp_inptoxtp(const struct inpcb *, struct xtcpcb *);
 #endif
 #endif
 
 /*
- * TCP function name-to-id mapping exported to user-land via sysctl(3).
+ * TCP function information (name-to-id mapping, aliases, and refcnt)
+ * exported to user-land via sysctl(3).
  */
-struct tcp_function_id {
+struct tcp_function_info {
+	uint32_t	tfi_refcnt;
 	uint8_t		tfi_id;
 	char		tfi_name[TCP_FUNCTION_NAME_LEN_MAX];
+	char		tfi_alias[TCP_FUNCTION_NAME_LEN_MAX];
 };
 
 /*
  * Identifiers for TCP sysctl nodes
  */
 #define	TCPCTL_DO_RFC1323	1	/* use RFC-1323 extensions */
 #define	TCPCTL_MSSDFLT		3	/* MSS default */
 #define TCPCTL_STATS		4	/* statistics */
 #define	TCPCTL_RTTDFLT		5	/* default RTT estimate */
 #define	TCPCTL_KEEPIDLE		6	/* keepalive idle timer */
 #define	TCPCTL_KEEPINTVL	7	/* interval to send keepalives */
 #define	TCPCTL_SENDSPACE	8	/* send buffer space */
 #define	TCPCTL_RECVSPACE	9	/* receive buffer space */
 #define	TCPCTL_KEEPINIT		10	/* timeout for establishing syn */
 #define	TCPCTL_PCBLIST		11	/* list of all outstanding PCBs */
 #define	TCPCTL_DELACKTIME	12	/* time before sending delayed ACK */
 #define	TCPCTL_V6MSSDFLT	13	/* MSS default for IPv6 */
 #define	TCPCTL_SACK		14	/* Selective Acknowledgement,rfc 2018 */
 #define	TCPCTL_DROP		15	/* drop tcp connection */
 #define	TCPCTL_STATES		16	/* connection counts by TCP state */
 
 #ifdef _KERNEL
 #ifdef SYSCTL_DECL
 SYSCTL_DECL(_net_inet_tcp);
 SYSCTL_DECL(_net_inet_tcp_sack);
 MALLOC_DECLARE(M_TCPLOG);
 #endif
 
 extern	int tcp_log_in_vain;
 
 /*
  * Global TCP tunables shared between different stacks.
  * Please keep the list sorted.
  */
 VNET_DECLARE(int, drop_synfin);
 VNET_DECLARE(int, path_mtu_discovery);
 VNET_DECLARE(int, tcp_abc_l_var);
 VNET_DECLARE(int, tcp_autorcvbuf_inc);
 VNET_DECLARE(int, tcp_autorcvbuf_max);
 VNET_DECLARE(int, tcp_autosndbuf_inc);
 VNET_DECLARE(int, tcp_autosndbuf_max);
 VNET_DECLARE(int, tcp_delack_enabled);
 VNET_DECLARE(int, tcp_do_autorcvbuf);
 VNET_DECLARE(int, tcp_do_autosndbuf);
 VNET_DECLARE(int, tcp_do_ecn);
 VNET_DECLARE(int, tcp_do_rfc1323);
 VNET_DECLARE(int, tcp_do_rfc3042);
 VNET_DECLARE(int, tcp_do_rfc3390);
 VNET_DECLARE(int, tcp_do_rfc3465);
 VNET_DECLARE(int, tcp_do_rfc6675_pipe);
 VNET_DECLARE(int, tcp_do_sack);
 VNET_DECLARE(int, tcp_do_tso);
 VNET_DECLARE(int, tcp_ecn_maxretries);
 VNET_DECLARE(int, tcp_initcwnd_segments);
 VNET_DECLARE(int, tcp_insecure_rst);
 VNET_DECLARE(int, tcp_insecure_syn);
 VNET_DECLARE(int, tcp_minmss);
 VNET_DECLARE(int, tcp_mssdflt);
 VNET_DECLARE(int, tcp_recvspace);
 VNET_DECLARE(int, tcp_sack_globalholes);
 VNET_DECLARE(int, tcp_sack_globalmaxholes);
 VNET_DECLARE(int, tcp_sack_maxholes);
 VNET_DECLARE(int, tcp_sc_rst_sock_fail);
 VNET_DECLARE(int, tcp_sendspace);
 VNET_DECLARE(struct inpcbhead, tcb);
 VNET_DECLARE(struct inpcbinfo, tcbinfo);
 
 #define	V_drop_synfin			VNET(drop_synfin)
 #define	V_path_mtu_discovery		VNET(path_mtu_discovery)
 #define	V_tcb				VNET(tcb)
 #define	V_tcbinfo			VNET(tcbinfo)
 #define	V_tcp_abc_l_var			VNET(tcp_abc_l_var)
 #define	V_tcp_autorcvbuf_inc		VNET(tcp_autorcvbuf_inc)
 #define	V_tcp_autorcvbuf_max		VNET(tcp_autorcvbuf_max)
 #define	V_tcp_autosndbuf_inc		VNET(tcp_autosndbuf_inc)
 #define	V_tcp_autosndbuf_max		VNET(tcp_autosndbuf_max)
 #define	V_tcp_delack_enabled		VNET(tcp_delack_enabled)
 #define	V_tcp_do_autorcvbuf		VNET(tcp_do_autorcvbuf)
 #define	V_tcp_do_autosndbuf		VNET(tcp_do_autosndbuf)
 #define	V_tcp_do_ecn			VNET(tcp_do_ecn)
 #define	V_tcp_do_rfc1323		VNET(tcp_do_rfc1323)
 #define	V_tcp_do_rfc3042		VNET(tcp_do_rfc3042)
 #define	V_tcp_do_rfc3390		VNET(tcp_do_rfc3390)
 #define	V_tcp_do_rfc3465		VNET(tcp_do_rfc3465)
 #define	V_tcp_do_rfc6675_pipe		VNET(tcp_do_rfc6675_pipe)
 #define	V_tcp_do_sack			VNET(tcp_do_sack)
 #define	V_tcp_do_tso			VNET(tcp_do_tso)
 #define	V_tcp_ecn_maxretries		VNET(tcp_ecn_maxretries)
 #define	V_tcp_initcwnd_segments		VNET(tcp_initcwnd_segments)
 #define	V_tcp_insecure_rst		VNET(tcp_insecure_rst)
 #define	V_tcp_insecure_syn		VNET(tcp_insecure_syn)
 #define	V_tcp_minmss			VNET(tcp_minmss)
 #define	V_tcp_mssdflt			VNET(tcp_mssdflt)
 #define	V_tcp_recvspace			VNET(tcp_recvspace)
 #define	V_tcp_sack_globalholes		VNET(tcp_sack_globalholes)
 #define	V_tcp_sack_globalmaxholes	VNET(tcp_sack_globalmaxholes)
 #define	V_tcp_sack_maxholes		VNET(tcp_sack_maxholes)
 #define	V_tcp_sc_rst_sock_fail		VNET(tcp_sc_rst_sock_fail)
 #define	V_tcp_sendspace			VNET(tcp_sendspace)
 
 #ifdef TCP_HHOOK
 VNET_DECLARE(struct hhook_head *, tcp_hhh[HHOOK_TCP_LAST + 1]);
 #define	V_tcp_hhh		VNET(tcp_hhh)
 #endif
 
 int	 tcp_addoptions(struct tcpopt *, u_char *);
 int	 tcp_ccalgounload(struct cc_algo *unload_algo);
 struct tcpcb *
 	 tcp_close(struct tcpcb *);
 void	 tcp_discardcb(struct tcpcb *);
 void	 tcp_twstart(struct tcpcb *);
 void	 tcp_twclose(struct tcptw *, int);
 void	 tcp_ctlinput(int, struct sockaddr *, void *);
 int	 tcp_ctloutput(struct socket *, struct sockopt *);
 struct tcpcb *
 	 tcp_drop(struct tcpcb *, int);
 void	 tcp_drain(void);
 void	 tcp_init(void);
 void	 tcp_fini(void *);
 char	*tcp_log_addrs(struct in_conninfo *, struct tcphdr *, void *,
 	    const void *);
 char	*tcp_log_vain(struct in_conninfo *, struct tcphdr *, void *,
 	    const void *);
 int	 tcp_reass(struct tcpcb *, struct tcphdr *, int *, struct mbuf *);
 void	 tcp_reass_global_init(void);
 void	 tcp_reass_flush(struct tcpcb *);
 void	 tcp_dooptions(struct tcpopt *, u_char *, int, int);
 void	tcp_dropwithreset(struct mbuf *, struct tcphdr *,
 		     struct tcpcb *, int, int);
 void	tcp_pulloutofband(struct socket *,
 		     struct tcphdr *, struct mbuf *, int);
 void	tcp_xmit_timer(struct tcpcb *, int);
 void	tcp_newreno_partial_ack(struct tcpcb *, struct tcphdr *);
 void	cc_ack_received(struct tcpcb *tp, struct tcphdr *th,
 			    uint16_t nsegs, uint16_t type);
 void 	cc_conn_init(struct tcpcb *tp);
 void 	cc_post_recovery(struct tcpcb *tp, struct tcphdr *th);
 void	cc_cong_signal(struct tcpcb *tp, struct tcphdr *th, uint32_t type);
 #ifdef TCP_HHOOK
 void	hhook_run_tcp_est_in(struct tcpcb *tp,
 			    struct tcphdr *th, struct tcpopt *to);
 #endif
 
 int	 tcp_input(struct mbuf **, int *, int);
 int	 tcp_autorcvbuf(struct mbuf *, struct tcphdr *, struct socket *,
 	    struct tcpcb *, int);
 void	 tcp_do_segment(struct mbuf *, struct tcphdr *,
 			struct socket *, struct tcpcb *, int, int, uint8_t,
 			int);
 
 int register_tcp_functions(struct tcp_function_block *blk, int wait);
 int register_tcp_functions_as_names(struct tcp_function_block *blk,
     int wait, const char *names[], int *num_names);
 int register_tcp_functions_as_name(struct tcp_function_block *blk,
     const char *name, int wait);
 int deregister_tcp_functions(struct tcp_function_block *blk);
 struct tcp_function_block *find_and_ref_tcp_functions(struct tcp_function_set *fs);
 struct tcp_function_block *find_and_ref_tcp_fb(struct tcp_function_block *blk);
 int tcp_default_ctloutput(struct socket *so, struct sockopt *sopt, struct inpcb *inp, struct tcpcb *tp);
 
 uint32_t tcp_maxmtu(struct in_conninfo *, struct tcp_ifcap *);
 uint32_t tcp_maxmtu6(struct in_conninfo *, struct tcp_ifcap *);
 u_int	 tcp_maxseg(const struct tcpcb *);
 void	 tcp_mss_update(struct tcpcb *, int, int, struct hc_metrics_lite *,
 	    struct tcp_ifcap *);
 void	 tcp_mss(struct tcpcb *, int);
 int	 tcp_mssopt(struct in_conninfo *);
 struct inpcb *
 	 tcp_drop_syn_sent(struct inpcb *, int);
 struct tcpcb *
 	 tcp_newtcpcb(struct inpcb *);
 int	 tcp_output(struct tcpcb *);
 void	 tcp_state_change(struct tcpcb *, int);
 void	 tcp_respond(struct tcpcb *, void *,
 	    struct tcphdr *, struct mbuf *, tcp_seq, tcp_seq, int);
 void	 tcp_tw_init(void);
 #ifdef VIMAGE
 void	 tcp_tw_destroy(void);
 #endif
 void	 tcp_tw_zone_change(void);
 int	 tcp_twcheck(struct inpcb *, struct tcpopt *, struct tcphdr *,
 	    struct mbuf *, int);
 void	 tcp_setpersist(struct tcpcb *);
 void	 tcp_slowtimo(void);
 struct tcptemp *
 	 tcpip_maketemplate(struct inpcb *);
 void	 tcpip_fillheaders(struct inpcb *, void *, void *);
 void	 tcp_timer_activate(struct tcpcb *, uint32_t, u_int);
 int	 tcp_timer_active(struct tcpcb *, uint32_t);
 void	 tcp_timer_stop(struct tcpcb *, uint32_t);
 void	 tcp_trace(short, short, struct tcpcb *, void *, struct tcphdr *, int);
 /*
  * All tcp_hc_* functions are IPv4 and IPv6 (via in_conninfo)
  */
 void	 tcp_hc_init(void);
 #ifdef VIMAGE
 void	 tcp_hc_destroy(void);
 #endif
 void	 tcp_hc_get(struct in_conninfo *, struct hc_metrics_lite *);
 uint32_t tcp_hc_getmtu(struct in_conninfo *);
 void	 tcp_hc_updatemtu(struct in_conninfo *, uint32_t);
 void	 tcp_hc_update(struct in_conninfo *, struct hc_metrics_lite *);
 
 extern	struct pr_usrreqs tcp_usrreqs;
 tcp_seq tcp_new_isn(struct tcpcb *);
 
 int	 tcp_sack_doack(struct tcpcb *, struct tcpopt *, tcp_seq);
 void	 tcp_update_sack_list(struct tcpcb *tp, tcp_seq rcv_laststart, tcp_seq rcv_lastend);
 void	 tcp_clean_sackreport(struct tcpcb *tp);
 void	 tcp_sack_adjust(struct tcpcb *tp);
 struct sackhole *tcp_sack_output(struct tcpcb *tp, int *sack_bytes_rexmt);
 void	 tcp_sack_partialack(struct tcpcb *, struct tcphdr *);
 void	 tcp_free_sackholes(struct tcpcb *tp);
 int	 tcp_newreno(struct tcpcb *, struct tcphdr *);
 int	 tcp_compute_pipe(struct tcpcb *);
 void	 tcp_sndbuf_autoscale(struct tcpcb *, struct socket *, uint32_t);
 
 static inline void
 tcp_fields_to_host(struct tcphdr *th)
 {
 
 	th->th_seq = ntohl(th->th_seq);
 	th->th_ack = ntohl(th->th_ack);
 	th->th_win = ntohs(th->th_win);
 	th->th_urp = ntohs(th->th_urp);
 }
 
 static inline void
 tcp_fields_to_net(struct tcphdr *th)
 {
 
 	th->th_seq = htonl(th->th_seq);
 	th->th_ack = htonl(th->th_ack);
 	th->th_win = htons(th->th_win);
 	th->th_urp = htons(th->th_urp);
 }
 #endif /* _KERNEL */
 
 #endif /* _NETINET_TCP_VAR_H_ */
Index: user/markj/netdump/sys/netpfil/ipfw/ip_fw_table.c
===================================================================
--- user/markj/netdump/sys/netpfil/ipfw/ip_fw_table.c	(revision 332407)
+++ user/markj/netdump/sys/netpfil/ipfw/ip_fw_table.c	(revision 332408)
@@ -1,3362 +1,3362 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
  *
  * Copyright (c) 2004 Ruslan Ermilov and Vsevolod Lobko.
  * Copyright (c) 2014 Yandex LLC
  * Copyright (c) 2014 Alexander V. Chernikov
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  * Lookup table support for ipfw.
  *
  * This file contains handlers for all generic tables' operations:
  * add/del/flush entries, list/dump tables etc..
  *
  * Table data modification is protected by both UH and runtime lock
  * while reading configuration/data is protected by UH lock.
  *
  * Lookup algorithms for all table types are located in ip_fw_table_algo.c
  */
 
 #include "opt_ipfw.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/malloc.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/rwlock.h>
 #include <sys/rmlock.h>
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/queue.h>
 #include <net/if.h>	/* ip_fw.h requires IFNAMSIZ */
 #include <net/pfil.h>
 
 #include <netinet/in.h>
 #include <netinet/ip_var.h>	/* struct ipfw_rule_ref */
 #include <netinet/ip_fw.h>
 
 #include <netpfil/ipfw/ip_fw_private.h>
 #include <netpfil/ipfw/ip_fw_table.h>
 
  /*
  * Table has the following `type` concepts:
  *
  * `no.type` represents lookup key type (addr, ifp, uid, etc..)
  * vmask represents bitmask of table values which are present at the moment.
  * Special IPFW_VTYPE_LEGACY ( (uint32_t)-1 ) represents old
  * single-value-for-all approach.
  */
 struct table_config {
 	struct named_object	no;
 	uint8_t		tflags;		/* type flags */
 	uint8_t		locked;		/* 1 if locked from changes */
 	uint8_t		linked;		/* 1 if already linked */
 	uint8_t		ochanged;	/* used by set swapping */
 	uint8_t		vshared;	/* 1 if using shared value array */
 	uint8_t		spare[3];
 	uint32_t	count;		/* Number of records */
 	uint32_t	limit;		/* Max number of records */
 	uint32_t	vmask;		/* bitmask with supported values */
 	uint32_t	ocount;		/* used by set swapping */
 	uint64_t	gencnt;		/* generation count */
 	char		tablename[64];	/* table name */
 	struct table_algo	*ta;	/* Callbacks for given algo */
 	void		*astate;	/* algorithm state */
 	struct table_info	ti_copy;	/* data to put to table_info */
 	struct namedobj_instance	*vi;
 };
 
 static int find_table_err(struct namedobj_instance *ni, struct tid_info *ti,
     struct table_config **tc);
 static struct table_config *find_table(struct namedobj_instance *ni,
     struct tid_info *ti);
 static struct table_config *alloc_table_config(struct ip_fw_chain *ch,
     struct tid_info *ti, struct table_algo *ta, char *adata, uint8_t tflags);
 static void free_table_config(struct namedobj_instance *ni,
     struct table_config *tc);
 static int create_table_internal(struct ip_fw_chain *ch, struct tid_info *ti,
     char *aname, ipfw_xtable_info *i, uint16_t *pkidx, int ref);
 static void link_table(struct ip_fw_chain *ch, struct table_config *tc);
 static void unlink_table(struct ip_fw_chain *ch, struct table_config *tc);
 static int find_ref_table(struct ip_fw_chain *ch, struct tid_info *ti,
     struct tentry_info *tei, uint32_t count, int op, struct table_config **ptc);
 #define	OP_ADD	1
 #define	OP_DEL	0
 static int export_tables(struct ip_fw_chain *ch, ipfw_obj_lheader *olh,
     struct sockopt_data *sd);
 static void export_table_info(struct ip_fw_chain *ch, struct table_config *tc,
     ipfw_xtable_info *i);
 static int dump_table_tentry(void *e, void *arg);
 static int dump_table_xentry(void *e, void *arg);
 
 static int swap_tables(struct ip_fw_chain *ch, struct tid_info *a,
     struct tid_info *b);
 
 static int check_table_name(const char *name);
 static int check_table_space(struct ip_fw_chain *ch, struct tableop_state *ts,
     struct table_config *tc, struct table_info *ti, uint32_t count);
 static int destroy_table(struct ip_fw_chain *ch, struct tid_info *ti);
 
 static struct table_algo *find_table_algo(struct tables_config *tableconf,
     struct tid_info *ti, char *name);
 
 static void objheader_to_ti(struct _ipfw_obj_header *oh, struct tid_info *ti);
 static void ntlv_to_ti(struct _ipfw_obj_ntlv *ntlv, struct tid_info *ti);
 
 #define	CHAIN_TO_NI(chain)	(CHAIN_TO_TCFG(chain)->namehash)
 #define	KIDX_TO_TI(ch, k)	(&(((struct table_info *)(ch)->tablestate)[k]))
 
 #define	TA_BUF_SZ	128	/* On-stack buffer for add/delete state */
 
 void
 rollback_toperation_state(struct ip_fw_chain *ch, void *object)
 {
 	struct tables_config *tcfg;
 	struct op_state *os;
 
 	tcfg = CHAIN_TO_TCFG(ch);
 	TAILQ_FOREACH(os, &tcfg->state_list, next)
 		os->func(object, os);
 }
 
 void
 add_toperation_state(struct ip_fw_chain *ch, struct tableop_state *ts)
 {
 	struct tables_config *tcfg;
 
 	tcfg = CHAIN_TO_TCFG(ch);
 	TAILQ_INSERT_HEAD(&tcfg->state_list, &ts->opstate, next);
 }
 
 void
 del_toperation_state(struct ip_fw_chain *ch, struct tableop_state *ts)
 {
 	struct tables_config *tcfg;
 
 	tcfg = CHAIN_TO_TCFG(ch);
 	TAILQ_REMOVE(&tcfg->state_list, &ts->opstate, next);
 }
 
 void
 tc_ref(struct table_config *tc)
 {
 
 	tc->no.refcnt++;
 }
 
 void
 tc_unref(struct table_config *tc)
 {
 
 	tc->no.refcnt--;
 }
 
 static struct table_value *
 get_table_value(struct ip_fw_chain *ch, struct table_config *tc, uint32_t kidx)
 {
 	struct table_value *pval;
 
 	pval = (struct table_value *)ch->valuestate;
 
 	return (&pval[kidx]);
 }
 
 
 /*
  * Checks if we're able to insert/update entry @tei into table
  * w.r.t @tc limits.
  * May alter @tei to indicate insertion error / insert
  * options.
  *
  * Returns 0 if operation can be performed/
  */
 static int
 check_table_limit(struct table_config *tc, struct tentry_info *tei)
 {
 
 	if (tc->limit == 0 || tc->count < tc->limit)
 		return (0);
 
 	if ((tei->flags & TEI_FLAGS_UPDATE) == 0) {
 		/* Notify userland on error cause */
 		tei->flags |= TEI_FLAGS_LIMIT;
 		return (EFBIG);
 	}
 
 	/*
 	 * We have UPDATE flag set.
 	 * Permit updating record (if found),
 	 * but restrict adding new one since we've
 	 * already hit the limit.
 	 */
 	tei->flags |= TEI_FLAGS_DONTADD;
 
 	return (0);
 }
 
 /*
  * Convert algorithm callback return code into
  * one of pre-defined states known by userland.
  */
 static void
 store_tei_result(struct tentry_info *tei, int op, int error, uint32_t num)
 {
 	int flag;
 
 	flag = 0;
 
 	switch (error) {
 	case 0:
 		if (op == OP_ADD && num != 0)
 			flag = TEI_FLAGS_ADDED;
 		if (op == OP_DEL)
 			flag = TEI_FLAGS_DELETED;
 		break;
 	case ENOENT:
 		flag = TEI_FLAGS_NOTFOUND;
 		break;
 	case EEXIST:
 		flag = TEI_FLAGS_EXISTS;
 		break;
 	default:
 		flag = TEI_FLAGS_ERROR;
 	}
 
 	tei->flags |= flag;
 }
 
 /*
  * Creates and references table with default parameters.
  * Saves table config, algo and allocated kidx info @ptc, @pta and
  * @pkidx if non-zero.
  * Used for table auto-creation to support old binaries.
  *
  * Returns 0 on success.
  */
 static int
 create_table_compat(struct ip_fw_chain *ch, struct tid_info *ti,
     uint16_t *pkidx)
 {
 	ipfw_xtable_info xi;
 	int error;
 
 	memset(&xi, 0, sizeof(xi));
 	/* Set default value mask for legacy clients */
 	xi.vmask = IPFW_VTYPE_LEGACY;
 
 	error = create_table_internal(ch, ti, NULL, &xi, pkidx, 1);
 	if (error != 0)
 		return (error);
 
 	return (0);
 }
 
 /*
  * Find and reference existing table optionally
  * creating new one.
  *
  * Saves found table config into @ptc.
  * Note function may drop/acquire UH_WLOCK.
  * Returns 0 if table was found/created and referenced
  * or non-zero return code.
  */
 static int
 find_ref_table(struct ip_fw_chain *ch, struct tid_info *ti,
     struct tentry_info *tei, uint32_t count, int op,
     struct table_config **ptc)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc;
 	uint16_t kidx;
 	int error;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	ni = CHAIN_TO_NI(ch);
 	tc = NULL;
 	if ((tc = find_table(ni, ti)) != NULL) {
 		/* check table type */
 		if (tc->no.subtype != ti->type)
 			return (EINVAL);
 
 		if (tc->locked != 0)
 			return (EACCES);
 
 		/* Try to exit early on limit hit */
 		if (op == OP_ADD && count == 1 &&
 		    check_table_limit(tc, tei) != 0)
 			return (EFBIG);
 
 		/* Reference and return */
 		tc->no.refcnt++;
 		*ptc = tc;
 		return (0);
 	}
 
 	if (op == OP_DEL)
 		return (ESRCH);
 
 	/* Compatibility mode: create new table for old clients */
 	if ((tei->flags & TEI_FLAGS_COMPAT) == 0)
 		return (ESRCH);
 
 	IPFW_UH_WUNLOCK(ch);
 	error = create_table_compat(ch, ti, &kidx);
 	IPFW_UH_WLOCK(ch);
 	
 	if (error != 0)
 		return (error);
 
 	tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, kidx);
 	KASSERT(tc != NULL, ("create_table_compat returned bad idx %d", kidx));
 
 	/* OK, now we've got referenced table. */
 	*ptc = tc;
 	return (0);
 }
 
 /*
  * Rolls back already @added to @tc entries using state array @ta_buf_m.
  * Assume the following layout:
  * 1) ADD state (ta_buf_m[0] ... t_buf_m[added - 1]) for handling update cases
  * 2) DEL state (ta_buf_m[count[ ... t_buf_m[count + added - 1])
  *   for storing deleted state
  */
 static void
 rollback_added_entries(struct ip_fw_chain *ch, struct table_config *tc,
     struct table_info *tinfo, struct tentry_info *tei, caddr_t ta_buf_m,
     uint32_t count, uint32_t added)
 {
 	struct table_algo *ta;
 	struct tentry_info *ptei;
 	caddr_t v, vv;
 	size_t ta_buf_sz;
 	int error, i;
 	uint32_t num;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	ta = tc->ta;
 	ta_buf_sz = ta->ta_buf_size;
 	v = ta_buf_m;
 	vv = v + count * ta_buf_sz;
 	for (i = 0; i < added; i++, v += ta_buf_sz, vv += ta_buf_sz) {
 		ptei = &tei[i];
 		if ((ptei->flags & TEI_FLAGS_UPDATED) != 0) {
 
 			/*
 			 * We have old value stored by previous
 			 * call in @ptei->value. Do add once again
 			 * to restore it.
 			 */
 			error = ta->add(tc->astate, tinfo, ptei, v, &num);
 			KASSERT(error == 0, ("rollback UPDATE fail"));
 			KASSERT(num == 0, ("rollback UPDATE fail2"));
 			continue;
 		}
 
 		error = ta->prepare_del(ch, ptei, vv);
 		KASSERT(error == 0, ("pre-rollback INSERT failed"));
 		error = ta->del(tc->astate, tinfo, ptei, vv, &num);
 		KASSERT(error == 0, ("rollback INSERT failed"));
 		tc->count -= num;
 	}
 }
 
 /*
  * Prepares add/del state for all @count entries in @tei.
  * Uses either stack buffer (@ta_buf) or allocates a new one.
  * Stores pointer to allocated buffer back to @ta_buf.
  *
  * Returns 0 on success.
  */
 static int
 prepare_batch_buffer(struct ip_fw_chain *ch, struct table_algo *ta,
     struct tentry_info *tei, uint32_t count, int op, caddr_t *ta_buf)
 {
 	caddr_t ta_buf_m, v;
 	size_t ta_buf_sz, sz;
 	struct tentry_info *ptei;
 	int error, i;
 
 	error = 0;
 	ta_buf_sz = ta->ta_buf_size;
 	if (count == 1) {
 		/* Single add/delete, use on-stack buffer */
 		memset(*ta_buf, 0, TA_BUF_SZ);
 		ta_buf_m = *ta_buf;
 	} else {
 
 		/*
 		 * Multiple adds/deletes, allocate larger buffer
 		 *
 		 * Note we need 2xcount buffer for add case:
 		 * we have hold both ADD state
 		 * and DELETE state (this may be needed
 		 * if we need to rollback all changes)
 		 */
 		sz = count * ta_buf_sz;
 		ta_buf_m = malloc((op == OP_ADD) ? sz * 2 : sz, M_TEMP,
 		    M_WAITOK | M_ZERO);
 	}
 
 	v = ta_buf_m;
 	for (i = 0; i < count; i++, v += ta_buf_sz) {
 		ptei = &tei[i];
 		error = (op == OP_ADD) ?
 		    ta->prepare_add(ch, ptei, v) : ta->prepare_del(ch, ptei, v);
 
 		/*
 		 * Some syntax error (incorrect mask, or address, or
 		 * anything). Return error regardless of atomicity
 		 * settings.
 		 */
 		if (error != 0)
 			break;
 	}
 
 	*ta_buf = ta_buf_m;
 	return (error);
 }
 
 /*
  * Flushes allocated state for each @count entries in @tei.
  * Frees @ta_buf_m if differs from stack buffer @ta_buf.
  */
 static void
 flush_batch_buffer(struct ip_fw_chain *ch, struct table_algo *ta,
     struct tentry_info *tei, uint32_t count, int rollback,
     caddr_t ta_buf_m, caddr_t ta_buf)
 {
 	caddr_t v;
 	struct tentry_info *ptei;
 	size_t ta_buf_sz;
 	int i;
 
 	ta_buf_sz = ta->ta_buf_size;
 
 	/* Run cleaning callback anyway */
 	v = ta_buf_m;
 	for (i = 0; i < count; i++, v += ta_buf_sz) {
 		ptei = &tei[i];
 		ta->flush_entry(ch, ptei, v);
 		if (ptei->ptv != NULL) {
 			free(ptei->ptv, M_IPFW);
 			ptei->ptv = NULL;
 		}
 	}
 
 	/* Clean up "deleted" state in case of rollback */
 	if (rollback != 0) {
 		v = ta_buf_m + count * ta_buf_sz;
 		for (i = 0; i < count; i++, v += ta_buf_sz)
 			ta->flush_entry(ch, &tei[i], v);
 	}
 
 	if (ta_buf_m != ta_buf)
 		free(ta_buf_m, M_TEMP);
 }
 
 
 static void
 rollback_add_entry(void *object, struct op_state *_state)
 {
 	struct ip_fw_chain *ch;
 	struct tableop_state *ts;
 
 	ts = (struct tableop_state *)_state;
 
 	if (ts->tc != object && ts->ch != object)
 		return;
 
 	ch = ts->ch;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	/* Call specifid unlockers */
 	rollback_table_values(ts);
 
 	/* Indicate we've called */
 	ts->modified = 1;
 }
 
 /*
  * Adds/updates one or more entries in table @ti.
  *
  * Function may drop/reacquire UH wlock multiple times due to
  * items alloc, algorithm callbacks (check_space), value linkage
  * (new values, value storage realloc), etc..
  * Other processes like other adds (which may involve storage resize),
  * table swaps (which changes table data and may change algo type),
  * table modify (which may change value mask) may be executed
  * simultaneously so we need to deal with it.
  *
  * The following approach was implemented:
  * we have per-chain linked list, protected with UH lock.
  * add_table_entry prepares special on-stack structure wthich is passed
  * to its descendants. Users add this structure to this list before unlock.
  * After performing needed operations and acquiring UH lock back, each user
  * checks if structure has changed. If true, it rolls local state back and
  * returns without error to the caller.
  * add_table_entry() on its own checks if structure has changed and restarts
  * its operation from the beginning (goto restart).
  *
  * Functions which are modifying fields of interest (currently
  *   resize_shared_value_storage() and swap_tables() )
  * traverses given list while holding UH lock immediately before
  * performing their operations calling function provided be list entry
  * ( currently rollback_add_entry  ) which performs rollback for all necessary
  * state and sets appropriate values in structure indicating rollback
  * has happened.
  *
  * Algo interaction:
  * Function references @ti first to ensure table won't
  * disappear or change its type.
  * After that, prepare_add callback is called for each @tei entry.
  * Next, we try to add each entry under UH+WHLOCK
  * using add() callback.
  * Finally, we free all state by calling flush_entry callback
  * for each @tei.
  *
  * Returns 0 on success.
  */
 int
 add_table_entry(struct ip_fw_chain *ch, struct tid_info *ti,
     struct tentry_info *tei, uint8_t flags, uint32_t count)
 {
 	struct table_config *tc;
 	struct table_algo *ta;
 	uint16_t kidx;
 	int error, first_error, i, rollback;
 	uint32_t num, numadd;
 	struct tentry_info *ptei;
 	struct tableop_state ts;
 	char ta_buf[TA_BUF_SZ];
 	caddr_t ta_buf_m, v;
 
 	memset(&ts, 0, sizeof(ts));
 	ta = NULL;
 	IPFW_UH_WLOCK(ch);
 
 	/*
 	 * Find and reference existing table.
 	 */
 restart:
 	if (ts.modified != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		flush_batch_buffer(ch, ta, tei, count, rollback,
 		    ta_buf_m, ta_buf);
 		memset(&ts, 0, sizeof(ts));
 		ta = NULL;
 		IPFW_UH_WLOCK(ch);
 	}
 
 	error = find_ref_table(ch, ti, tei, count, OP_ADD, &tc);
 	if (error != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (error);
 	}
 	ta = tc->ta;
 
 	/* Fill in tablestate */
 	ts.ch = ch;
 	ts.opstate.func = rollback_add_entry;
 	ts.tc = tc;
 	ts.vshared = tc->vshared;
 	ts.vmask = tc->vmask;
 	ts.ta = ta;
 	ts.tei = tei;
 	ts.count = count;
 	rollback = 0;
 	add_toperation_state(ch, &ts);
 	IPFW_UH_WUNLOCK(ch);
 
 	/* Allocate memory and prepare record(s) */
 	/* Pass stack buffer by default */
 	ta_buf_m = ta_buf;
 	error = prepare_batch_buffer(ch, ta, tei, count, OP_ADD, &ta_buf_m);
 
 	IPFW_UH_WLOCK(ch);
 	del_toperation_state(ch, &ts);
 	/* Drop reference we've used in first search */
 	tc->no.refcnt--;
 
 	/* Check prepare_batch_buffer() error */
 	if (error != 0)
 		goto cleanup;
 
 	/*
 	 * Check if table swap has happened.
 	 * (so table algo might be changed).
 	 * Restart operation to achieve consistent behavior.
 	 */
 	if (ts.modified != 0)
 		goto restart;
 
 	/*
 	 * Link all values values to shared/per-table value array.
 	 *
 	 * May release/reacquire UH_WLOCK.
 	 */
 	error = ipfw_link_table_values(ch, &ts);
 	if (error != 0)
 		goto cleanup;
 	if (ts.modified != 0)
 		goto restart;
 
 	/*
 	 * Ensure we are able to add all entries without additional
 	 * memory allocations. May release/reacquire UH_WLOCK.
 	 */
 	kidx = tc->no.kidx;
 	error = check_table_space(ch, &ts, tc, KIDX_TO_TI(ch, kidx), count);
 	if (error != 0)
 		goto cleanup;
 	if (ts.modified != 0)
 		goto restart;
 
 	/* We've got valid table in @tc. Let's try to add data */
 	kidx = tc->no.kidx;
 	ta = tc->ta;
 	numadd = 0;
 	first_error = 0;
 
 	IPFW_WLOCK(ch);
 
 	v = ta_buf_m;
 	for (i = 0; i < count; i++, v += ta->ta_buf_size) {
 		ptei = &tei[i];
 		num = 0;
 		/* check limit before adding */
 		if ((error = check_table_limit(tc, ptei)) == 0) {
 			error = ta->add(tc->astate, KIDX_TO_TI(ch, kidx),
 			    ptei, v, &num);
 			/* Set status flag to inform userland */
 			store_tei_result(ptei, OP_ADD, error, num);
 		}
 		if (error == 0) {
 			/* Update number of records to ease limit checking */
 			tc->count += num;
 			numadd += num;
 			continue;
 		}
 
 		if (first_error == 0)
 			first_error = error;
 
 		/*
 		 * Some error have happened. Check our atomicity
 		 * settings: continue if atomicity is not required,
 		 * rollback changes otherwise.
 		 */
 		if ((flags & IPFW_CTF_ATOMIC) == 0)
 			continue;
 
 		rollback_added_entries(ch, tc, KIDX_TO_TI(ch, kidx),
 		    tei, ta_buf_m, count, i);
 
 		rollback = 1;
 		break;
 	}
 
 	IPFW_WUNLOCK(ch);
 
 	ipfw_garbage_table_values(ch, tc, tei, count, rollback);
 
 	/* Permit post-add algorithm grow/rehash. */
 	if (numadd != 0)
 		check_table_space(ch, NULL, tc, KIDX_TO_TI(ch, kidx), 0);
 
 	/* Return first error to user, if any */
 	error = first_error;
 
 cleanup:
 	IPFW_UH_WUNLOCK(ch);
 
 	flush_batch_buffer(ch, ta, tei, count, rollback, ta_buf_m, ta_buf);
 	
 	return (error);
 }
 
 /*
  * Deletes one or more entries in table @ti.
  *
  * Returns 0 on success.
  */
 int
 del_table_entry(struct ip_fw_chain *ch, struct tid_info *ti,
     struct tentry_info *tei, uint8_t flags, uint32_t count)
 {
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct tentry_info *ptei;
 	uint16_t kidx;
 	int error, first_error, i;
 	uint32_t num, numdel;
 	char ta_buf[TA_BUF_SZ];
 	caddr_t ta_buf_m, v;
 
 	/*
 	 * Find and reference existing table.
 	 */
 	IPFW_UH_WLOCK(ch);
 	error = find_ref_table(ch, ti, tei, count, OP_DEL, &tc);
 	if (error != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (error);
 	}
 	ta = tc->ta;
 	IPFW_UH_WUNLOCK(ch);
 
 	/* Allocate memory and prepare record(s) */
 	/* Pass stack buffer by default */
 	ta_buf_m = ta_buf;
 	error = prepare_batch_buffer(ch, ta, tei, count, OP_DEL, &ta_buf_m);
 	if (error != 0)
 		goto cleanup;
 
 	IPFW_UH_WLOCK(ch);
 
 	/* Drop reference we've used in first search */
 	tc->no.refcnt--;
 
 	/*
 	 * Check if table algo is still the same.
 	 * (changed ta may be the result of table swap).
 	 */
 	if (ta != tc->ta) {
 		IPFW_UH_WUNLOCK(ch);
 		error = EINVAL;
 		goto cleanup;
 	}
 
 	kidx = tc->no.kidx;
 	numdel = 0;
 	first_error = 0;
 
 	IPFW_WLOCK(ch);
 	v = ta_buf_m;
 	for (i = 0; i < count; i++, v += ta->ta_buf_size) {
 		ptei = &tei[i];
 		num = 0;
 		error = ta->del(tc->astate, KIDX_TO_TI(ch, kidx), ptei, v,
 		    &num);
 		/* Save state for userland */
 		store_tei_result(ptei, OP_DEL, error, num);
 		if (error != 0 && first_error == 0)
 			first_error = error;
 		tc->count -= num;
 		numdel += num;
 	}
 	IPFW_WUNLOCK(ch);
 
 	/* Unlink non-used values */
 	ipfw_garbage_table_values(ch, tc, tei, count, 0);
 
 	if (numdel != 0) {
 		/* Run post-del hook to permit shrinking */
 		check_table_space(ch, NULL, tc, KIDX_TO_TI(ch, kidx), 0);
 	}
 
 	IPFW_UH_WUNLOCK(ch);
 
 	/* Return first error to user, if any */
 	error = first_error;
 
 cleanup:
 	flush_batch_buffer(ch, ta, tei, count, 0, ta_buf_m, ta_buf);
 
 	return (error);
 }
 
 /*
  * Ensure that table @tc has enough space to add @count entries without
  * need for reallocation.
  *
  * Callbacks order:
  * 0) need_modify() (UH_WLOCK) - checks if @count items can be added w/o resize.
  *
  * 1) alloc_modify (no locks, M_WAITOK) - alloc new state based on @pflags.
  * 2) prepare_modifyt (UH_WLOCK) - copy old data into new storage
  * 3) modify (UH_WLOCK + WLOCK) - switch pointers
  * 4) flush_modify (UH_WLOCK) - free state, if needed
  *
  * Returns 0 on success.
  */
 static int
 check_table_space(struct ip_fw_chain *ch, struct tableop_state *ts,
     struct table_config *tc, struct table_info *ti, uint32_t count)
 {
 	struct table_algo *ta;
 	uint64_t pflags;
 	char ta_buf[TA_BUF_SZ];
 	int error;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	error = 0;
 	ta = tc->ta;
 	if (ta->need_modify == NULL)
 		return (0);
 
 	/* Acquire reference not to loose @tc between locks/unlocks */
 	tc->no.refcnt++;
 
 	/*
 	 * TODO: think about avoiding race between large add/large delete
 	 * operation on algorithm which implements shrinking along with
 	 * growing.
 	 */
 	while (true) {
 		pflags = 0;
 		if (ta->need_modify(tc->astate, ti, count, &pflags) == 0) {
 			error = 0;
 			break;
 		}
 
 		/* We have to shrink/grow table */
 		if (ts != NULL)
 			add_toperation_state(ch, ts);
 		IPFW_UH_WUNLOCK(ch);
 
 		memset(&ta_buf, 0, sizeof(ta_buf));
 		error = ta->prepare_mod(ta_buf, &pflags);
 
 		IPFW_UH_WLOCK(ch);
 		if (ts != NULL)
 			del_toperation_state(ch, ts);
 
 		if (error != 0)
 			break;
 
 		if (ts != NULL && ts->modified != 0) {
 
 			/*
 			 * Swap operation has happened
 			 * so we're currently operating on other
 			 * table data. Stop doing this.
 			 */
 			ta->flush_mod(ta_buf);
 			break;
 		}
 
 		/* Check if we still need to alter table */
 		ti = KIDX_TO_TI(ch, tc->no.kidx);
 		if (ta->need_modify(tc->astate, ti, count, &pflags) == 0) {
 			IPFW_UH_WUNLOCK(ch);
 
 			/*
 			 * Other thread has already performed resize.
 			 * Flush our state and return.
 			 */
 			ta->flush_mod(ta_buf);
 			break;
 		}
 	
 		error = ta->fill_mod(tc->astate, ti, ta_buf, &pflags);
 		if (error == 0) {
 			/* Do actual modification */
 			IPFW_WLOCK(ch);
 			ta->modify(tc->astate, ti, ta_buf, pflags);
 			IPFW_WUNLOCK(ch);
 		}
 
 		/* Anyway, flush data and retry */
 		ta->flush_mod(ta_buf);
 	}
 
 	tc->no.refcnt--;
 	return (error);
 }
 
 /*
  * Adds or deletes record in table.
  * Data layout (v0):
  * Request: [ ip_fw3_opheader ipfw_table_xentry ]
  *
  * Returns 0 on success
  */
 static int
 manage_table_ent_v0(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	ipfw_table_xentry *xent;
 	struct tentry_info tei;
 	struct tid_info ti;
 	struct table_value v;
 	int error, hdrlen, read;
 
 	hdrlen = offsetof(ipfw_table_xentry, k);
 
 	/* Check minimum header size */
 	if (sd->valsize < (sizeof(*op3) + hdrlen))
 		return (EINVAL);
 
 	read = sizeof(ip_fw3_opheader);
 
 	/* Check if xentry len field is valid */
 	xent = (ipfw_table_xentry *)(op3 + 1);
 	if (xent->len < hdrlen || xent->len + read > sd->valsize)
 		return (EINVAL);
 	
 	memset(&tei, 0, sizeof(tei));
 	tei.paddr = &xent->k;
 	tei.masklen = xent->masklen;
 	ipfw_import_table_value_legacy(xent->value, &v);
 	tei.pvalue = &v;
 	/* Old requests compatibility */
 	tei.flags = TEI_FLAGS_COMPAT;
 	if (xent->type == IPFW_TABLE_ADDR) {
 		if (xent->len - hdrlen == sizeof(in_addr_t))
 			tei.subtype = AF_INET;
 		else
 			tei.subtype = AF_INET6;
 	}
 
 	memset(&ti, 0, sizeof(ti));
 	ti.uidx = xent->tbl;
 	ti.type = xent->type;
 
 	error = (op3->opcode == IP_FW_TABLE_XADD) ?
 	    add_table_entry(ch, &ti, &tei, 0, 1) :
 	    del_table_entry(ch, &ti, &tei, 0, 1);
 
 	return (error);
 }
 
 /*
  * Adds or deletes record in table.
  * Data layout (v1)(current):
  * Request: [ ipfw_obj_header
  *   ipfw_obj_ctlv(IPFW_TLV_TBLENT_LIST) [ ipfw_obj_tentry x N ]
  * ]
  *
  * Returns 0 on success
  */
 static int
 manage_table_ent_v1(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	ipfw_obj_tentry *tent, *ptent;
 	ipfw_obj_ctlv *ctlv;
 	ipfw_obj_header *oh;
 	struct tentry_info *ptei, tei, *tei_buf;
 	struct tid_info ti;
 	int error, i, kidx, read;
 
 	/* Check minimum header size */
 	if (sd->valsize < (sizeof(*oh) + sizeof(*ctlv)))
 		return (EINVAL);
 
 	/* Check if passed data is too long */
 	if (sd->valsize != sd->kavail)
 		return (EINVAL);
 
 	oh = (ipfw_obj_header *)sd->kbuf;
 
 	/* Basic length checks for TLVs */
 	if (oh->ntlv.head.length != sizeof(oh->ntlv))
 		return (EINVAL);
 
 	read = sizeof(*oh);
 
 	ctlv = (ipfw_obj_ctlv *)(oh + 1);
 	if (ctlv->head.length + read != sd->valsize)
 		return (EINVAL);
 
 	read += sizeof(*ctlv);
 	tent = (ipfw_obj_tentry *)(ctlv + 1);
 	if (ctlv->count * sizeof(*tent) + read != sd->valsize)
 		return (EINVAL);
 
 	if (ctlv->count == 0)
 		return (0);
 
 	/*
 	 * Mark entire buffer as "read".
 	 * This instructs sopt api write it back
 	 * after function return.
 	 */
 	ipfw_get_sopt_header(sd, sd->valsize);
 
 	/* Perform basic checks for each entry */
 	ptent = tent;
 	kidx = tent->idx;
 	for (i = 0; i < ctlv->count; i++, ptent++) {
 		if (ptent->head.length != sizeof(*ptent))
 			return (EINVAL);
 		if (ptent->idx != kidx)
 			return (ENOTSUP);
 	}
 
 	/* Convert data into kernel request objects */
 	objheader_to_ti(oh, &ti);
 	ti.type = oh->ntlv.type;
 	ti.uidx = kidx;
 
 	/* Use on-stack buffer for single add/del */
 	if (ctlv->count == 1) {
 		memset(&tei, 0, sizeof(tei));
 		tei_buf = &tei;
 	} else
 		tei_buf = malloc(ctlv->count * sizeof(tei), M_TEMP,
 		    M_WAITOK | M_ZERO);
 
 	ptei = tei_buf;
 	ptent = tent;
 	for (i = 0; i < ctlv->count; i++, ptent++, ptei++) {
 		ptei->paddr = &ptent->k;
 		ptei->subtype = ptent->subtype;
 		ptei->masklen = ptent->masklen;
 		if (ptent->head.flags & IPFW_TF_UPDATE)
 			ptei->flags |= TEI_FLAGS_UPDATE;
 
 		ipfw_import_table_value_v1(&ptent->v.value);
 		ptei->pvalue = (struct table_value *)&ptent->v.value;
 	}
 
 	error = (oh->opheader.opcode == IP_FW_TABLE_XADD) ?
 	    add_table_entry(ch, &ti, tei_buf, ctlv->flags, ctlv->count) :
 	    del_table_entry(ch, &ti, tei_buf, ctlv->flags, ctlv->count);
 
 	/* Translate result back to userland */
 	ptei = tei_buf;
 	ptent = tent;
 	for (i = 0; i < ctlv->count; i++, ptent++, ptei++) {
 		if (ptei->flags & TEI_FLAGS_ADDED)
 			ptent->result = IPFW_TR_ADDED;
 		else if (ptei->flags & TEI_FLAGS_DELETED)
 			ptent->result = IPFW_TR_DELETED;
 		else if (ptei->flags & TEI_FLAGS_UPDATED)
 			ptent->result = IPFW_TR_UPDATED;
 		else if (ptei->flags & TEI_FLAGS_LIMIT)
 			ptent->result = IPFW_TR_LIMIT;
 		else if (ptei->flags & TEI_FLAGS_ERROR)
 			ptent->result = IPFW_TR_ERROR;
 		else if (ptei->flags & TEI_FLAGS_NOTFOUND)
 			ptent->result = IPFW_TR_NOTFOUND;
 		else if (ptei->flags & TEI_FLAGS_EXISTS)
 			ptent->result = IPFW_TR_EXISTS;
 		ipfw_export_table_value_v1(ptei->pvalue, &ptent->v.value);
 	}
 
 	if (tei_buf != &tei)
 		free(tei_buf, M_TEMP);
 
 	return (error);
 }
 
 /*
  * Looks up an entry in given table.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ipfw_obj_tentry ]
  * Reply: [ ipfw_obj_header ipfw_obj_tentry ]
  *
  * Returns 0 on success
  */
 static int
 find_table_entry(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	ipfw_obj_tentry *tent;
 	ipfw_obj_header *oh;
 	struct tid_info ti;
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct table_info *kti;
 	struct table_value *pval;
 	struct namedobj_instance *ni;
 	int error;
 	size_t sz;
 
 	/* Check minimum header size */
 	sz = sizeof(*oh) + sizeof(*tent);
 	if (sd->valsize != sz)
 		return (EINVAL);
 
 	oh = (struct _ipfw_obj_header *)ipfw_get_sopt_header(sd, sz);
 	tent = (ipfw_obj_tentry *)(oh + 1);
 
 	/* Basic length checks for TLVs */
 	if (oh->ntlv.head.length != sizeof(oh->ntlv))
 		return (EINVAL);
 
 	objheader_to_ti(oh, &ti);
 	ti.type = oh->ntlv.type;
 	ti.uidx = tent->idx;
 
 	IPFW_UH_RLOCK(ch);
 	ni = CHAIN_TO_NI(ch);
 
 	/*
 	 * Find existing table and check its type .
 	 */
 	ta = NULL;
 	if ((tc = find_table(ni, &ti)) == NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (ESRCH);
 	}
 
 	/* check table type */
 	if (tc->no.subtype != ti.type) {
 		IPFW_UH_RUNLOCK(ch);
 		return (EINVAL);
 	}
 
 	kti = KIDX_TO_TI(ch, tc->no.kidx);
 	ta = tc->ta;
 
 	if (ta->find_tentry == NULL)
 		return (ENOTSUP);
 
 	error = ta->find_tentry(tc->astate, kti, tent);
 	if (error == 0) {
 		pval = get_table_value(ch, tc, tent->v.kidx);
 		ipfw_export_table_value_v1(pval, &tent->v.value);
 	}
 	IPFW_UH_RUNLOCK(ch);
 
 	return (error);
 }
 
 /*
  * Flushes all entries or destroys given table.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ]
  *
  * Returns 0 on success
  */
 static int
 flush_table_v0(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	int error;
 	struct _ipfw_obj_header *oh;
 	struct tid_info ti;
 
 	if (sd->valsize != sizeof(*oh))
 		return (EINVAL);
 
 	oh = (struct _ipfw_obj_header *)op3;
 	objheader_to_ti(oh, &ti);
 
 	if (op3->opcode == IP_FW_TABLE_XDESTROY)
 		error = destroy_table(ch, &ti);
 	else if (op3->opcode == IP_FW_TABLE_XFLUSH)
 		error = flush_table(ch, &ti);
 	else
 		return (ENOTSUP);
 
 	return (error);
 }
 
 static void
 restart_flush(void *object, struct op_state *_state)
 {
 	struct tableop_state *ts;
 
 	ts = (struct tableop_state *)_state;
 
 	if (ts->tc != object)
 		return;
 
 	/* Indicate we've called */
 	ts->modified = 1;
 }
 
 /*
  * Flushes given table.
  *
  * Function create new table instance with the same
  * parameters, swaps it with old one and
  * flushes state without holding runtime WLOCK.
  *
  * Returns 0 on success.
  */
 int
 flush_table(struct ip_fw_chain *ch, struct tid_info *ti)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct table_info ti_old, ti_new, *tablestate;
 	void *astate_old, *astate_new;
 	char algostate[64], *pstate;
 	struct tableop_state ts;
 	int error, need_gc;
 	uint16_t kidx;
 	uint8_t tflags;
 
 	/*
 	 * Stage 1: save table algorithm.
 	 * Reference found table to ensure it won't disappear.
 	 */
 	IPFW_UH_WLOCK(ch);
 	ni = CHAIN_TO_NI(ch);
 	if ((tc = find_table(ni, ti)) == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 	need_gc = 0;
 	astate_new = NULL;
 	memset(&ti_new, 0, sizeof(ti_new));
 restart:
 	/* Set up swap handler */
 	memset(&ts, 0, sizeof(ts));
 	ts.opstate.func = restart_flush;
 	ts.tc = tc;
 
 	ta = tc->ta;
 	/* Do not flush readonly tables */
 	if ((ta->flags & TA_FLAG_READONLY) != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EACCES);
 	}
 	/* Save startup algo parameters */
 	if (ta->print_config != NULL) {
 		ta->print_config(tc->astate, KIDX_TO_TI(ch, tc->no.kidx),
 		    algostate, sizeof(algostate));
 		pstate = algostate;
 	} else
 		pstate = NULL;
 	tflags = tc->tflags;
 	tc->no.refcnt++;
 	add_toperation_state(ch, &ts);
 	IPFW_UH_WUNLOCK(ch);
 
 	/*
 	 * Stage 1.5: if this is not the first attempt, destroy previous state
 	 */
 	if (need_gc != 0) {
 		ta->destroy(astate_new, &ti_new);
 		need_gc = 0;
 	}
 
 	/*
 	 * Stage 2: allocate new table instance using same algo.
 	 */
 	memset(&ti_new, 0, sizeof(struct table_info));
 	error = ta->init(ch, &astate_new, &ti_new, pstate, tflags);
 
 	/*
 	 * Stage 3: swap old state pointers with newly-allocated ones.
 	 * Decrease refcount.
 	 */
 	IPFW_UH_WLOCK(ch);
 	tc->no.refcnt--;
 	del_toperation_state(ch, &ts);
 
 	if (error != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (error);
 	}
 
 	/*
 	 * Restart operation if table swap has happened:
 	 * even if algo may be the same, algo init parameters
 	 * may change. Restart operation instead of doing
 	 * complex checks.
 	 */
 	if (ts.modified != 0) {
 		/* Delay destroying data since we're holding UH lock */
 		need_gc = 1;
 		goto restart;
 	}
 
 	ni = CHAIN_TO_NI(ch);
 	kidx = tc->no.kidx;
 	tablestate = (struct table_info *)ch->tablestate;
 
 	IPFW_WLOCK(ch);
 	ti_old = tablestate[kidx];
 	tablestate[kidx] = ti_new;
 	IPFW_WUNLOCK(ch);
 
 	astate_old = tc->astate;
 	tc->astate = astate_new;
 	tc->ti_copy = ti_new;
 	tc->count = 0;
 
 	/* Notify algo on real @ti address */
 	if (ta->change_ti != NULL)
 		ta->change_ti(tc->astate, &tablestate[kidx]);
 
 	/*
 	 * Stage 4: unref values.
 	 */
 	ipfw_unref_table_values(ch, tc, ta, astate_old, &ti_old);
 	IPFW_UH_WUNLOCK(ch);
 
 	/*
 	 * Stage 5: perform real flush/destroy.
 	 */
 	ta->destroy(astate_old, &ti_old);
 
 	return (0);
 }
 
 /*
  * Swaps two tables.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ipfw_obj_ntlv ]
  *
  * Returns 0 on success
  */
 static int
 swap_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	int error;
 	struct _ipfw_obj_header *oh;
 	struct tid_info ti_a, ti_b;
 
 	if (sd->valsize != sizeof(*oh) + sizeof(ipfw_obj_ntlv))
 		return (EINVAL);
 
 	oh = (struct _ipfw_obj_header *)op3;
 	ntlv_to_ti(&oh->ntlv, &ti_a);
 	ntlv_to_ti((ipfw_obj_ntlv *)(oh + 1), &ti_b);
 
 	error = swap_tables(ch, &ti_a, &ti_b);
 
 	return (error);
 }
 
 /*
  * Swaps two tables of the same type/valtype.
  *
  * Checks if tables are compatible and limits
  * permits swap, than actually perform swap.
  *
  * Each table consists of 2 different parts:
  * config:
  *   @tc (with name, set, kidx) and rule bindings, which is "stable".
  *   number of items
  *   table algo
  * runtime:
  *   runtime data @ti (ch->tablestate)
  *   runtime cache in @tc
  *   algo-specific data (@tc->astate)
  *
  * So we switch:
  *  all runtime data
  *   number of items
  *   table algo
  *
  * After that we call @ti change handler for each table.
  *
  * Note that referencing @tc won't protect tc->ta from change.
  * XXX: Do we need to restrict swap between locked tables?
  * XXX: Do we need to exchange ftype?
  *
  * Returns 0 on success.
  */
 static int
 swap_tables(struct ip_fw_chain *ch, struct tid_info *a,
     struct tid_info *b)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc_a, *tc_b;
 	struct table_algo *ta;
 	struct table_info ti, *tablestate;
 	void *astate;
 	uint32_t count;
 
 	/*
 	 * Stage 1: find both tables and ensure they are of
 	 * the same type.
 	 */
 	IPFW_UH_WLOCK(ch);
 	ni = CHAIN_TO_NI(ch);
 	if ((tc_a = find_table(ni, a)) == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 	if ((tc_b = find_table(ni, b)) == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 
 	/* It is very easy to swap between the same table */
 	if (tc_a == tc_b) {
 		IPFW_UH_WUNLOCK(ch);
 		return (0);
 	}
 
 	/* Check type and value are the same */
 	if (tc_a->no.subtype!=tc_b->no.subtype || tc_a->tflags!=tc_b->tflags) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EINVAL);
 	}
 
 	/* Check limits before swap */
 	if ((tc_a->limit != 0 && tc_b->count > tc_a->limit) ||
 	    (tc_b->limit != 0 && tc_a->count > tc_b->limit)) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EFBIG);
 	}
 
 	/* Check if one of the tables is readonly */
 	if (((tc_a->ta->flags | tc_b->ta->flags) & TA_FLAG_READONLY) != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EACCES);
 	}
 
 	/* Notify we're going to swap */
 	rollback_toperation_state(ch, tc_a);
 	rollback_toperation_state(ch, tc_b);
 
 	/* Everything is fine, prepare to swap */
 	tablestate = (struct table_info *)ch->tablestate;
 	ti = tablestate[tc_a->no.kidx];
 	ta = tc_a->ta;
 	astate = tc_a->astate;
 	count = tc_a->count;
 
 	IPFW_WLOCK(ch);
 	/* a <- b */
 	tablestate[tc_a->no.kidx] = tablestate[tc_b->no.kidx];
 	tc_a->ta = tc_b->ta;
 	tc_a->astate = tc_b->astate;
 	tc_a->count = tc_b->count;
 	/* b <- a */
 	tablestate[tc_b->no.kidx] = ti;
 	tc_b->ta = ta;
 	tc_b->astate = astate;
 	tc_b->count = count;
 	IPFW_WUNLOCK(ch);
 
 	/* Ensure tc.ti copies are in sync */
 	tc_a->ti_copy = tablestate[tc_a->no.kidx];
 	tc_b->ti_copy = tablestate[tc_b->no.kidx];
 
 	/* Notify both tables on @ti change */
 	if (tc_a->ta->change_ti != NULL)
 		tc_a->ta->change_ti(tc_a->astate, &tablestate[tc_a->no.kidx]);
 	if (tc_b->ta->change_ti != NULL)
 		tc_b->ta->change_ti(tc_b->astate, &tablestate[tc_b->no.kidx]);
 
 	IPFW_UH_WUNLOCK(ch);
 
 	return (0);
 }
 
 /*
  * Destroys table specified by @ti.
  * Data layout (v0)(current):
  * Request: [ ip_fw3_opheader ]
  *
  * Returns 0 on success
  */
 static int
 destroy_table(struct ip_fw_chain *ch, struct tid_info *ti)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc;
 
 	IPFW_UH_WLOCK(ch);
 
 	ni = CHAIN_TO_NI(ch);
 	if ((tc = find_table(ni, ti)) == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 
 	/* Do not permit destroying referenced tables */
 	if (tc->no.refcnt > 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EBUSY);
 	}
 
 	IPFW_WLOCK(ch);
 	unlink_table(ch, tc);
 	IPFW_WUNLOCK(ch);
 
 	/* Free obj index */
 	if (ipfw_objhash_free_idx(ni, tc->no.kidx) != 0)
 		printf("Error unlinking kidx %d from table %s\n",
 		    tc->no.kidx, tc->tablename);
 
 	/* Unref values used in tables while holding UH lock */
 	ipfw_unref_table_values(ch, tc, tc->ta, tc->astate, &tc->ti_copy);
 	IPFW_UH_WUNLOCK(ch);
 
 	free_table_config(ni, tc);
 
 	return (0);
 }
 
 static uint32_t
 roundup2p(uint32_t v)
 {
 
 	v--;
 	v |= v >> 1;
 	v |= v >> 2;
 	v |= v >> 4;
 	v |= v >> 8;
 	v |= v >> 16;
 	v++;
 
 	return (v);
 }
 
 /*
  * Grow tables index.
  *
  * Returns 0 on success.
  */
 int
 ipfw_resize_tables(struct ip_fw_chain *ch, unsigned int ntables)
 {
 	unsigned int ntables_old, tbl;
 	struct namedobj_instance *ni;
 	void *new_idx, *old_tablestate, *tablestate;
 	struct table_info *ti;
 	struct table_config *tc;
 	int i, new_blocks;
 
 	/* Check new value for validity */
 	if (ntables == 0)
 		return (EINVAL);
 	if (ntables > IPFW_TABLES_MAX)
 		ntables = IPFW_TABLES_MAX;
 	/* Alight to nearest power of 2 */
 	ntables = (unsigned int)roundup2p(ntables); 
 
 	/* Allocate new pointers */
 	tablestate = malloc(ntables * sizeof(struct table_info),
 	    M_IPFW, M_WAITOK | M_ZERO);
 
 	ipfw_objhash_bitmap_alloc(ntables, (void *)&new_idx, &new_blocks);
 
 	IPFW_UH_WLOCK(ch);
 
 	tbl = (ntables >= V_fw_tables_max) ? V_fw_tables_max : ntables;
 	ni = CHAIN_TO_NI(ch);
 
 	/* Temporary restrict decreasing max_tables */
 	if (ntables < V_fw_tables_max) {
 
 		/*
 		 * FIXME: Check if we really can shrink
 		 */
 		IPFW_UH_WUNLOCK(ch);
 		return (EINVAL);
 	}
 
 	/* Copy table info/indices */
 	memcpy(tablestate, ch->tablestate, sizeof(struct table_info) * tbl);
 	ipfw_objhash_bitmap_merge(ni, &new_idx, &new_blocks);
 
 	IPFW_WLOCK(ch);
 
 	/* Change pointers */
 	old_tablestate = ch->tablestate;
 	ch->tablestate = tablestate;
 	ipfw_objhash_bitmap_swap(ni, &new_idx, &new_blocks);
 
 	ntables_old = V_fw_tables_max;
 	V_fw_tables_max = ntables;
 
 	IPFW_WUNLOCK(ch);
 
 	/* Notify all consumers that their @ti pointer has changed */
 	ti = (struct table_info *)ch->tablestate;
 	for (i = 0; i < tbl; i++, ti++) {
 		if (ti->lookup == NULL)
 			continue;
 		tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, i);
 		if (tc == NULL || tc->ta->change_ti == NULL)
 			continue;
 
 		tc->ta->change_ti(tc->astate, ti);
 	}
 
 	IPFW_UH_WUNLOCK(ch);
 
 	/* Free old pointers */
 	free(old_tablestate, M_IPFW);
 	ipfw_objhash_bitmap_free(new_idx, new_blocks);
 
 	return (0);
 }
 
 /*
  * Lookup table's named object by its @kidx.
  */
 struct named_object *
 ipfw_objhash_lookup_table_kidx(struct ip_fw_chain *ch, uint16_t kidx)
 {
 
 	return (ipfw_objhash_lookup_kidx(CHAIN_TO_NI(ch), kidx));
 }
 
 /*
  * Take reference to table specified in @ntlv.
  * On success return its @kidx.
  */
 int
 ipfw_ref_table(struct ip_fw_chain *ch, ipfw_obj_ntlv *ntlv, uint16_t *kidx)
 {
 	struct tid_info ti;
 	struct table_config *tc;
 	int error;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	ntlv_to_ti(ntlv, &ti);
 	error = find_table_err(CHAIN_TO_NI(ch), &ti, &tc);
 	if (error != 0)
 		return (error);
 
 	if (tc == NULL)
 		return (ESRCH);
 
 	tc_ref(tc);
 	*kidx = tc->no.kidx;
 
 	return (0);
 }
 
 void
 ipfw_unref_table(struct ip_fw_chain *ch, uint16_t kidx)
 {
 
 	struct namedobj_instance *ni;
 	struct named_object *no;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 	ni = CHAIN_TO_NI(ch);
 	no = ipfw_objhash_lookup_kidx(ni, kidx);
 	KASSERT(no != NULL, ("Table with index %d not found", kidx));
 	no->refcnt--;
 }
 
 /*
  * Lookup an arbitrary key @paddr of length @plen in table @tbl.
  * Stores found value in @val.
  *
  * Returns 1 if key was found.
  */
 int
 ipfw_lookup_table(struct ip_fw_chain *ch, uint16_t tbl, uint16_t plen,
     void *paddr, uint32_t *val)
 {
 	struct table_info *ti;
 
 	ti = KIDX_TO_TI(ch, tbl);
 
 	return (ti->lookup(ti, paddr, plen, val));
 }
 
 /*
  * Info/List/dump support for tables.
  *
  */
 
 /*
  * High-level 'get' cmds sysctl handlers
  */
 
 /*
  * Lists all tables currently available in kernel.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size
  * Reply: [ ipfw_obj_lheader ipfw_xtable_info x N ]
  *
  * Returns 0 on success
  */
 static int
 list_tables(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct _ipfw_obj_lheader *olh;
 	int error;
 
 	olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh));
 	if (olh == NULL)
 		return (EINVAL);
 	if (sd->valsize < olh->size)
 		return (EINVAL);
 
 	IPFW_UH_RLOCK(ch);
 	error = export_tables(ch, olh, sd);
 	IPFW_UH_RUNLOCK(ch);
 
 	return (error);
 }
 
 /*
  * Store table info to buffer provided by @sd.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ipfw_xtable_info(empty)]
  * Reply: [ ipfw_obj_header ipfw_xtable_info ]
  *
  * Returns 0 on success.
  */
 static int
 describe_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct _ipfw_obj_header *oh;
 	struct table_config *tc;
 	struct tid_info ti;
 	size_t sz;
 
 	sz = sizeof(*oh) + sizeof(ipfw_xtable_info);
 	oh = (struct _ipfw_obj_header *)ipfw_get_sopt_header(sd, sz);
 	if (oh == NULL)
 		return (EINVAL);
 
 	objheader_to_ti(oh, &ti);
 
 	IPFW_UH_RLOCK(ch);
 	if ((tc = find_table(CHAIN_TO_NI(ch), &ti)) == NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (ESRCH);
 	}
 
 	export_table_info(ch, tc, (ipfw_xtable_info *)(oh + 1));
 	IPFW_UH_RUNLOCK(ch);
 
 	return (0);
 }
 
 /*
  * Modifies existing table.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ipfw_xtable_info ]
  *
  * Returns 0 on success
  */
 static int
 modify_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct _ipfw_obj_header *oh;
 	ipfw_xtable_info *i;
 	char *tname;
 	struct tid_info ti;
 	struct namedobj_instance *ni;
 	struct table_config *tc;
 
 	if (sd->valsize != sizeof(*oh) + sizeof(ipfw_xtable_info))
 		return (EINVAL);
 
 	oh = (struct _ipfw_obj_header *)sd->kbuf;
 	i = (ipfw_xtable_info *)(oh + 1);
 
 	/*
 	 * Verify user-supplied strings.
 	 * Check for null-terminated/zero-length strings/
 	 */
 	tname = oh->ntlv.name;
 	if (check_table_name(tname) != 0)
 		return (EINVAL);
 
 	objheader_to_ti(oh, &ti);
 	ti.type = i->type;
 
 	IPFW_UH_WLOCK(ch);
 	ni = CHAIN_TO_NI(ch);
 	if ((tc = find_table(ni, &ti)) == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 
 	/* Do not support any modifications for readonly tables */
 	if ((tc->ta->flags & TA_FLAG_READONLY) != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EACCES);
 	}
 
 	if ((i->mflags & IPFW_TMFLAGS_LIMIT) != 0)
 		tc->limit = i->limit;
 	if ((i->mflags & IPFW_TMFLAGS_LOCK) != 0)
 		tc->locked = ((i->flags & IPFW_TGFLAGS_LOCKED) != 0);
 	IPFW_UH_WUNLOCK(ch);
 
 	return (0);
 }
 
 /*
  * Creates new table.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ipfw_xtable_info ]
  *
  * Returns 0 on success
  */
 static int
 create_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct _ipfw_obj_header *oh;
 	ipfw_xtable_info *i;
 	char *tname, *aname;
 	struct tid_info ti;
 	struct namedobj_instance *ni;
 
 	if (sd->valsize != sizeof(*oh) + sizeof(ipfw_xtable_info))
 		return (EINVAL);
 
 	oh = (struct _ipfw_obj_header *)sd->kbuf;
 	i = (ipfw_xtable_info *)(oh + 1);
 
 	/*
 	 * Verify user-supplied strings.
 	 * Check for null-terminated/zero-length strings/
 	 */
 	tname = oh->ntlv.name;
 	aname = i->algoname;
 	if (check_table_name(tname) != 0 ||
 	    strnlen(aname, sizeof(i->algoname)) == sizeof(i->algoname))
 		return (EINVAL);
 
 	if (aname[0] == '\0') {
 		/* Use default algorithm */
 		aname = NULL;
 	}
 
 	objheader_to_ti(oh, &ti);
 	ti.type = i->type;
 
 	ni = CHAIN_TO_NI(ch);
 
 	IPFW_UH_RLOCK(ch);
 	if (find_table(ni, &ti) != NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (EEXIST);
 	}
 	IPFW_UH_RUNLOCK(ch);
 
 	return (create_table_internal(ch, &ti, aname, i, NULL, 0));
 }
 
 /*
  * Creates new table based on @ti and @aname.
  *
  * Assume @aname to be checked and valid.
  * Stores allocated table kidx inside @pkidx (if non-NULL).
  * Reference created table if @compat is non-zero.
  *
  * Returns 0 on success.
  */
 static int
 create_table_internal(struct ip_fw_chain *ch, struct tid_info *ti,
     char *aname, ipfw_xtable_info *i, uint16_t *pkidx, int compat)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc, *tc_new, *tmp;
 	struct table_algo *ta;
 	uint16_t kidx;
 
 	ni = CHAIN_TO_NI(ch);
 
 	ta = find_table_algo(CHAIN_TO_TCFG(ch), ti, aname);
 	if (ta == NULL)
 		return (ENOTSUP);
 
 	tc = alloc_table_config(ch, ti, ta, aname, i->tflags);
 	if (tc == NULL)
 		return (ENOMEM);
 
 	tc->vmask = i->vmask;
 	tc->limit = i->limit;
 	if (ta->flags & TA_FLAG_READONLY)
 		tc->locked = 1;
 	else
 		tc->locked = (i->flags & IPFW_TGFLAGS_LOCKED) != 0;
 
 	IPFW_UH_WLOCK(ch);
 
 	/* Check if table has been already created */
 	tc_new = find_table(ni, ti);
 	if (tc_new != NULL) {
 
 		/*
 		 * Compat: do not fail if we're
 		 * requesting to create existing table
 		 * which has the same type
 		 */
 		if (compat == 0 || tc_new->no.subtype != tc->no.subtype) {
 			IPFW_UH_WUNLOCK(ch);
 			free_table_config(ni, tc);
 			return (EEXIST);
 		}
 
 		/* Exchange tc and tc_new for proper refcounting & freeing */
 		tmp = tc;
 		tc = tc_new;
 		tc_new = tmp;
 	} else {
 		/* New table */
 		if (ipfw_objhash_alloc_idx(ni, &kidx) != 0) {
 			IPFW_UH_WUNLOCK(ch);
 			printf("Unable to allocate table index."
 			    " Consider increasing net.inet.ip.fw.tables_max");
 			free_table_config(ni, tc);
 			return (EBUSY);
 		}
 		tc->no.kidx = kidx;
 		tc->no.etlv = IPFW_TLV_TBL_NAME;
 
 		link_table(ch, tc);
 	}
 
 	if (compat != 0)
 		tc->no.refcnt++;
 	if (pkidx != NULL)
 		*pkidx = tc->no.kidx;
 
 	IPFW_UH_WUNLOCK(ch);
 
 	if (tc_new != NULL)
 		free_table_config(ni, tc_new);
 
 	return (0);
 }
 
 static void
 ntlv_to_ti(ipfw_obj_ntlv *ntlv, struct tid_info *ti)
 {
 
 	memset(ti, 0, sizeof(struct tid_info));
 	ti->set = ntlv->set;
 	ti->uidx = ntlv->idx;
 	ti->tlvs = ntlv;
 	ti->tlen = ntlv->head.length;
 }
 
 static void
 objheader_to_ti(struct _ipfw_obj_header *oh, struct tid_info *ti)
 {
 
 	ntlv_to_ti(&oh->ntlv, ti);
 }
 
 struct namedobj_instance *
 ipfw_get_table_objhash(struct ip_fw_chain *ch)
 {
 
 	return (CHAIN_TO_NI(ch));
 }
 
 /*
  * Exports basic table info as name TLV.
  * Used inside dump_static_rules() to provide info
  * about all tables referenced by current ruleset.
  *
  * Returns 0 on success.
  */
 int
 ipfw_export_table_ntlv(struct ip_fw_chain *ch, uint16_t kidx,
     struct sockopt_data *sd)
 {
 	struct namedobj_instance *ni;
 	struct named_object *no;
 	ipfw_obj_ntlv *ntlv;
 
 	ni = CHAIN_TO_NI(ch);
 
 	no = ipfw_objhash_lookup_kidx(ni, kidx);
 	KASSERT(no != NULL, ("invalid table kidx passed"));
 
 	ntlv = (ipfw_obj_ntlv *)ipfw_get_sopt_space(sd, sizeof(*ntlv));
 	if (ntlv == NULL)
 		return (ENOMEM);
 
 	ntlv->head.type = IPFW_TLV_TBL_NAME;
 	ntlv->head.length = sizeof(*ntlv);
 	ntlv->idx = no->kidx;
 	strlcpy(ntlv->name, no->name, sizeof(ntlv->name));
 
 	return (0);
 }
 
 struct dump_args {
 	struct ip_fw_chain *ch;
 	struct table_info *ti;
 	struct table_config *tc;
 	struct sockopt_data *sd;
 	uint32_t cnt;
 	uint16_t uidx;
 	int error;
 	uint32_t size;
 	ipfw_table_entry *ent;
 	ta_foreach_f *f;
 	void *farg;
 	ipfw_obj_tentry tent;
 };
 
 static int
 count_ext_entries(void *e, void *arg)
 {
 	struct dump_args *da;
 
 	da = (struct dump_args *)arg;
 	da->cnt++;
 
 	return (0);
 }
 
 /*
  * Gets number of items from table either using
  * internal counter or calling algo callback for
  * externally-managed tables.
  *
  * Returns number of records.
  */
 static uint32_t
 table_get_count(struct ip_fw_chain *ch, struct table_config *tc)
 {
 	struct table_info *ti;
 	struct table_algo *ta;
 	struct dump_args da;
 
 	ti = KIDX_TO_TI(ch, tc->no.kidx);
 	ta = tc->ta;
 
 	/* Use internal counter for self-managed tables */
 	if ((ta->flags & TA_FLAG_READONLY) == 0)
 		return (tc->count);
 
 	/* Use callback to quickly get number of items */
 	if ((ta->flags & TA_FLAG_EXTCOUNTER) != 0)
 		return (ta->get_count(tc->astate, ti));
 
 	/* Count number of iterms ourselves */
 	memset(&da, 0, sizeof(da));
 	ta->foreach(tc->astate, ti, count_ext_entries, &da);
 
 	return (da.cnt);
 }
 
 /*
  * Exports table @tc info into standard ipfw_xtable_info format.
  */
 static void
 export_table_info(struct ip_fw_chain *ch, struct table_config *tc,
     ipfw_xtable_info *i)
 {
 	struct table_info *ti;
 	struct table_algo *ta;
 	
 	i->type = tc->no.subtype;
 	i->tflags = tc->tflags;
 	i->vmask = tc->vmask;
 	i->set = tc->no.set;
 	i->kidx = tc->no.kidx;
 	i->refcnt = tc->no.refcnt;
 	i->count = table_get_count(ch, tc);
 	i->limit = tc->limit;
 	i->flags |= (tc->locked != 0) ? IPFW_TGFLAGS_LOCKED : 0;
 	i->size = i->count * sizeof(ipfw_obj_tentry);
 	i->size += sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info);
 	strlcpy(i->tablename, tc->tablename, sizeof(i->tablename));
 	ti = KIDX_TO_TI(ch, tc->no.kidx);
 	ta = tc->ta;
 	if (ta->print_config != NULL) {
 		/* Use algo function to print table config to string */
 		ta->print_config(tc->astate, ti, i->algoname,
 		    sizeof(i->algoname));
 	} else
 		strlcpy(i->algoname, ta->name, sizeof(i->algoname));
 	/* Dump algo-specific data, if possible */
 	if (ta->dump_tinfo != NULL) {
 		ta->dump_tinfo(tc->astate, ti, &i->ta_info);
 		i->ta_info.flags |= IPFW_TATFLAGS_DATA;
 	}
 }
 
 struct dump_table_args {
 	struct ip_fw_chain *ch;
 	struct sockopt_data *sd;
 };
 
 static int
 export_table_internal(struct namedobj_instance *ni, struct named_object *no,
     void *arg)
 {
 	ipfw_xtable_info *i;
 	struct dump_table_args *dta;
 
 	dta = (struct dump_table_args *)arg;
 
 	i = (ipfw_xtable_info *)ipfw_get_sopt_space(dta->sd, sizeof(*i));
 	KASSERT(i != NULL, ("previously checked buffer is not enough"));
 
 	export_table_info(dta->ch, (struct table_config *)no, i);
 	return (0);
 }
 
 /*
  * Export all tables as ipfw_xtable_info structures to
  * storage provided by @sd.
  *
  * If supplied buffer is too small, fills in required size
  * and returns ENOMEM.
  * Returns 0 on success.
  */
 static int
 export_tables(struct ip_fw_chain *ch, ipfw_obj_lheader *olh,
     struct sockopt_data *sd)
 {
 	uint32_t size;
 	uint32_t count;
 	struct dump_table_args dta;
 
 	count = ipfw_objhash_count(CHAIN_TO_NI(ch));
 	size = count * sizeof(ipfw_xtable_info) + sizeof(ipfw_obj_lheader);
 
 	/* Fill in header regadless of buffer size */
 	olh->count = count;
 	olh->objsize = sizeof(ipfw_xtable_info);
 
 	if (size > olh->size) {
 		olh->size = size;
 		return (ENOMEM);
 	}
 
 	olh->size = size;
 
 	dta.ch = ch;
 	dta.sd = sd;
 
 	ipfw_objhash_foreach(CHAIN_TO_NI(ch), export_table_internal, &dta);
 
 	return (0);
 }
 
 /*
  * Dumps all table data
  * Data layout (v1)(current):
  * Request: [ ipfw_obj_header ], size = ipfw_xtable_info.size
  * Reply: [ ipfw_obj_header ipfw_xtable_info ipfw_obj_tentry x N ]
  *
  * Returns 0 on success
  */
 static int
 dump_table_v1(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct _ipfw_obj_header *oh;
 	ipfw_xtable_info *i;
 	struct tid_info ti;
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct dump_args da;
 	uint32_t sz;
 
 	sz = sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info);
 	oh = (struct _ipfw_obj_header *)ipfw_get_sopt_header(sd, sz);
 	if (oh == NULL)
 		return (EINVAL);
 
 	i = (ipfw_xtable_info *)(oh + 1);
 	objheader_to_ti(oh, &ti);
 
 	IPFW_UH_RLOCK(ch);
 	if ((tc = find_table(CHAIN_TO_NI(ch), &ti)) == NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (ESRCH);
 	}
 	export_table_info(ch, tc, i);
 
 	if (sd->valsize < i->size) {
 
 		/*
 		 * Submitted buffer size is not enough.
 		 * WE've already filled in @i structure with
 		 * relevant table info including size, so we
 		 * can return. Buffer will be flushed automatically.
 		 */
 		IPFW_UH_RUNLOCK(ch);
 		return (ENOMEM);
 	}
 
 	/*
 	 * Do the actual dump in eXtended format
 	 */
 	memset(&da, 0, sizeof(da));
 	da.ch = ch;
 	da.ti = KIDX_TO_TI(ch, tc->no.kidx);
 	da.tc = tc;
 	da.sd = sd;
 
 	ta = tc->ta;
 
 	ta->foreach(tc->astate, da.ti, dump_table_tentry, &da);
 	IPFW_UH_RUNLOCK(ch);
 
 	return (da.error);
 }
 
 /*
  * Dumps all table data
  * Data layout (version 0)(legacy):
  * Request: [ ipfw_xtable ], size = IP_FW_TABLE_XGETSIZE()
  * Reply: [ ipfw_xtable ipfw_table_xentry x N ]
  *
  * Returns 0 on success
  */
 static int
 dump_table_v0(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	ipfw_xtable *xtbl;
 	struct tid_info ti;
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct dump_args da;
 	size_t sz, count;
 
 	xtbl = (ipfw_xtable *)ipfw_get_sopt_header(sd, sizeof(ipfw_xtable));
 	if (xtbl == NULL)
 		return (EINVAL);
 
 	memset(&ti, 0, sizeof(ti));
 	ti.uidx = xtbl->tbl;
 	
 	IPFW_UH_RLOCK(ch);
 	if ((tc = find_table(CHAIN_TO_NI(ch), &ti)) == NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (0);
 	}
 	count = table_get_count(ch, tc);
 	sz = count * sizeof(ipfw_table_xentry) + sizeof(ipfw_xtable);
 
 	xtbl->cnt = count;
 	xtbl->size = sz;
 	xtbl->type = tc->no.subtype;
 	xtbl->tbl = ti.uidx;
 
 	if (sd->valsize < sz) {
 
 		/*
 		 * Submitted buffer size is not enough.
 		 * WE've already filled in @i structure with
 		 * relevant table info including size, so we
 		 * can return. Buffer will be flushed automatically.
 		 */
 		IPFW_UH_RUNLOCK(ch);
 		return (ENOMEM);
 	}
 
 	/* Do the actual dump in eXtended format */
 	memset(&da, 0, sizeof(da));
 	da.ch = ch;
 	da.ti = KIDX_TO_TI(ch, tc->no.kidx);
 	da.tc = tc;
 	da.sd = sd;
 
 	ta = tc->ta;
 
 	ta->foreach(tc->astate, da.ti, dump_table_xentry, &da);
 	IPFW_UH_RUNLOCK(ch);
 
 	return (0);
 }
 
 /*
  * Legacy function to retrieve number of items in table.
  */
 static int
 get_table_size(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	uint32_t *tbl;
 	struct tid_info ti;
 	size_t sz;
 	int error;
 
 	sz = sizeof(*op3) + sizeof(uint32_t);
 	op3 = (ip_fw3_opheader *)ipfw_get_sopt_header(sd, sz);
 	if (op3 == NULL)
 		return (EINVAL);
 
 	tbl = (uint32_t *)(op3 + 1);
 	memset(&ti, 0, sizeof(ti));
 	ti.uidx = *tbl;
 	IPFW_UH_RLOCK(ch);
 	error = ipfw_count_xtable(ch, &ti, tbl);
 	IPFW_UH_RUNLOCK(ch);
 	return (error);
 }
 
 /*
  * Legacy IP_FW_TABLE_GETSIZE handler
  */
 int
 ipfw_count_table(struct ip_fw_chain *ch, struct tid_info *ti, uint32_t *cnt)
 {
 	struct table_config *tc;
 
 	if ((tc = find_table(CHAIN_TO_NI(ch), ti)) == NULL)
 		return (ESRCH);
 	*cnt = table_get_count(ch, tc);
 	return (0);
 }
 
 /*
  * Legacy IP_FW_TABLE_XGETSIZE handler
  */
 int
 ipfw_count_xtable(struct ip_fw_chain *ch, struct tid_info *ti, uint32_t *cnt)
 {
 	struct table_config *tc;
 	uint32_t count;
 
 	if ((tc = find_table(CHAIN_TO_NI(ch), ti)) == NULL) {
 		*cnt = 0;
 		return (0); /* 'table all list' requires success */
 	}
 
 	count = table_get_count(ch, tc);
 	*cnt = count * sizeof(ipfw_table_xentry);
 	if (count > 0)
 		*cnt += sizeof(ipfw_xtable);
 	return (0);
 }
 
 static int
 dump_table_entry(void *e, void *arg)
 {
 	struct dump_args *da;
 	struct table_config *tc;
 	struct table_algo *ta;
 	ipfw_table_entry *ent;
 	struct table_value *pval;
 	int error;
 
 	da = (struct dump_args *)arg;
 
 	tc = da->tc;
 	ta = tc->ta;
 
 	/* Out of memory, returning */
 	if (da->cnt == da->size)
 		return (1);
 	ent = da->ent++;
 	ent->tbl = da->uidx;
 	da->cnt++;
 
 	error = ta->dump_tentry(tc->astate, da->ti, e, &da->tent);
 	if (error != 0)
 		return (error);
 
 	ent->addr = da->tent.k.addr.s_addr;
 	ent->masklen = da->tent.masklen;
 	pval = get_table_value(da->ch, da->tc, da->tent.v.kidx);
 	ent->value = ipfw_export_table_value_legacy(pval);
 
 	return (0);
 }
 
 /*
  * Dumps table in pre-8.1 legacy format.
  */
 int
 ipfw_dump_table_legacy(struct ip_fw_chain *ch, struct tid_info *ti,
     ipfw_table *tbl)
 {
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct dump_args da;
 
 	tbl->cnt = 0;
 
 	if ((tc = find_table(CHAIN_TO_NI(ch), ti)) == NULL)
 		return (0);	/* XXX: We should return ESRCH */
 
 	ta = tc->ta;
 
 	/* This dump format supports IPv4 only */
 	if (tc->no.subtype != IPFW_TABLE_ADDR)
 		return (0);
 
 	memset(&da, 0, sizeof(da));
 	da.ch = ch;
 	da.ti = KIDX_TO_TI(ch, tc->no.kidx);
 	da.tc = tc;
 	da.ent = &tbl->ent[0];
 	da.size = tbl->size;
 
 	tbl->cnt = 0;
 	ta->foreach(tc->astate, da.ti, dump_table_entry, &da);
 	tbl->cnt = da.cnt;
 
 	return (0);
 }
 
 /*
  * Dumps table entry in eXtended format (v1)(current).
  */
 static int
 dump_table_tentry(void *e, void *arg)
 {
 	struct dump_args *da;
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct table_value *pval;
 	ipfw_obj_tentry *tent;
 	int error;
 
 	da = (struct dump_args *)arg;
 
 	tc = da->tc;
 	ta = tc->ta;
 
 	tent = (ipfw_obj_tentry *)ipfw_get_sopt_space(da->sd, sizeof(*tent));
 	/* Out of memory, returning */
 	if (tent == NULL) {
 		da->error = ENOMEM;
 		return (1);
 	}
 	tent->head.length = sizeof(ipfw_obj_tentry);
 	tent->idx = da->uidx;
 
 	error = ta->dump_tentry(tc->astate, da->ti, e, tent);
 	if (error != 0)
 		return (error);
 
 	pval = get_table_value(da->ch, da->tc, tent->v.kidx);
 	ipfw_export_table_value_v1(pval, &tent->v.value);
 
 	return (0);
 }
 
 /*
  * Dumps table entry in eXtended format (v0).
  */
 static int
 dump_table_xentry(void *e, void *arg)
 {
 	struct dump_args *da;
 	struct table_config *tc;
 	struct table_algo *ta;
 	ipfw_table_xentry *xent;
 	ipfw_obj_tentry *tent;
 	struct table_value *pval;
 	int error;
 
 	da = (struct dump_args *)arg;
 
 	tc = da->tc;
 	ta = tc->ta;
 
 	xent = (ipfw_table_xentry *)ipfw_get_sopt_space(da->sd, sizeof(*xent));
 	/* Out of memory, returning */
 	if (xent == NULL)
 		return (1);
 	xent->len = sizeof(ipfw_table_xentry);
 	xent->tbl = da->uidx;
 
 	memset(&da->tent, 0, sizeof(da->tent));
 	tent = &da->tent;
 	error = ta->dump_tentry(tc->astate, da->ti, e, tent);
 	if (error != 0)
 		return (error);
 
 	/* Convert current format to previous one */
 	xent->masklen = tent->masklen;
 	pval = get_table_value(da->ch, da->tc, da->tent.v.kidx);
 	xent->value = ipfw_export_table_value_legacy(pval);
 	/* Apply some hacks */
 	if (tc->no.subtype == IPFW_TABLE_ADDR && tent->subtype == AF_INET) {
 		xent->k.addr6.s6_addr32[3] = tent->k.addr.s_addr;
 		xent->flags = IPFW_TCF_INET;
 	} else
 		memcpy(&xent->k, &tent->k, sizeof(xent->k));
 
 	return (0);
 }
 
 /*
  * Helper function to export table algo data
  * to tentry format before calling user function.
  *
  * Returns 0 on success.
  */
 static int
 prepare_table_tentry(void *e, void *arg)
 {
 	struct dump_args *da;
 	struct table_config *tc;
 	struct table_algo *ta;
 	int error;
 
 	da = (struct dump_args *)arg;
 
 	tc = da->tc;
 	ta = tc->ta;
 
 	error = ta->dump_tentry(tc->astate, da->ti, e, &da->tent);
 	if (error != 0)
 		return (error);
 
 	da->f(&da->tent, da->farg);
 
 	return (0);
 }
 
 /*
  * Allow external consumers to read table entries in standard format.
  */
 int
 ipfw_foreach_table_tentry(struct ip_fw_chain *ch, uint16_t kidx,
     ta_foreach_f *f, void *arg)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc;
 	struct table_algo *ta;
 	struct dump_args da;
 
 	ni = CHAIN_TO_NI(ch);
 
 	tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, kidx);
 	if (tc == NULL)
 		return (ESRCH);
 
 	ta = tc->ta;
 
 	memset(&da, 0, sizeof(da));
 	da.ch = ch;
 	da.ti = KIDX_TO_TI(ch, tc->no.kidx);
 	da.tc = tc;
 	da.f = f;
 	da.farg = arg;
 
 	ta->foreach(tc->astate, da.ti, prepare_table_tentry, &da);
 
 	return (0);
 }
 
 /*
  * Table algorithms
  */ 
 
 /*
  * Finds algorithm by index, table type or supplied name.
  *
  * Returns pointer to algo or NULL.
  */
 static struct table_algo *
 find_table_algo(struct tables_config *tcfg, struct tid_info *ti, char *name)
 {
 	int i, l;
 	struct table_algo *ta;
 
 	if (ti->type > IPFW_TABLE_MAXTYPE)
 		return (NULL);
 
 	/* Search by index */
 	if (ti->atype != 0) {
 		if (ti->atype > tcfg->algo_count)
 			return (NULL);
 		return (tcfg->algo[ti->atype]);
 	}
 
 	if (name == NULL) {
 		/* Return default algorithm for given type if set */
 		return (tcfg->def_algo[ti->type]);
 	}
 
 	/* Search by name */
 	/* TODO: better search */
 	for (i = 1; i <= tcfg->algo_count; i++) {
 		ta = tcfg->algo[i];
 
 		/*
 		 * One can supply additional algorithm
 		 * parameters so we compare only the first word
 		 * of supplied name:
 		 * 'addr:chash hsize=32'
 		 * '^^^^^^^^^'
 		 *
 		 */
 		l = strlen(ta->name);
 		if (strncmp(name, ta->name, l) != 0)
 			continue;
 		if (name[l] != '\0' && name[l] != ' ')
 			continue;
 		/* Check if we're requesting proper table type */
 		if (ti->type != 0 && ti->type != ta->type)
 			return (NULL);
 		return (ta);
 	}
 
 	return (NULL);
 }
 
 /*
  * Register new table algo @ta.
  * Stores algo id inside @idx.
  *
  * Returns 0 on success.
  */
 int
 ipfw_add_table_algo(struct ip_fw_chain *ch, struct table_algo *ta, size_t size,
     int *idx)
 {
 	struct tables_config *tcfg;
 	struct table_algo *ta_new;
 	size_t sz;
 
 	if (size > sizeof(struct table_algo))
 		return (EINVAL);
 
 	/* Check for the required on-stack size for add/del */
 	sz = roundup2(ta->ta_buf_size, sizeof(void *));
 	if (sz > TA_BUF_SZ)
 		return (EINVAL);
 
 	KASSERT(ta->type <= IPFW_TABLE_MAXTYPE,("Increase IPFW_TABLE_MAXTYPE"));
 
 	/* Copy algorithm data to stable storage. */
 	ta_new = malloc(sizeof(struct table_algo), M_IPFW, M_WAITOK | M_ZERO);
 	memcpy(ta_new, ta, size);
 
 	tcfg = CHAIN_TO_TCFG(ch);
 
 	KASSERT(tcfg->algo_count < 255, ("Increase algo array size"));
 
 	tcfg->algo[++tcfg->algo_count] = ta_new;
 	ta_new->idx = tcfg->algo_count;
 
 	/* Set algorithm as default one for given type */
 	if ((ta_new->flags & TA_FLAG_DEFAULT) != 0 &&
 	    tcfg->def_algo[ta_new->type] == NULL)
 		tcfg->def_algo[ta_new->type] = ta_new;
 
 	*idx = ta_new->idx;
 	
 	return (0);
 }
 
 /*
  * Unregisters table algo using @idx as id.
  * XXX: It is NOT safe to call this function in any place
  * other than ipfw instance destroy handler.
  */
 void
 ipfw_del_table_algo(struct ip_fw_chain *ch, int idx)
 {
 	struct tables_config *tcfg;
 	struct table_algo *ta;
 
 	tcfg = CHAIN_TO_TCFG(ch);
 
 	KASSERT(idx <= tcfg->algo_count, ("algo idx %d out of range 1..%d",
 	    idx, tcfg->algo_count));
 
 	ta = tcfg->algo[idx];
 	KASSERT(ta != NULL, ("algo idx %d is NULL", idx));
 
 	if (tcfg->def_algo[ta->type] == ta)
 		tcfg->def_algo[ta->type] = NULL;
 
 	free(ta, M_IPFW);
 }
 
 /*
  * Lists all table algorithms currently available.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size
  * Reply: [ ipfw_obj_lheader ipfw_ta_info x N ]
  *
  * Returns 0 on success
  */
 static int
 list_table_algo(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct _ipfw_obj_lheader *olh;
 	struct tables_config *tcfg;
 	ipfw_ta_info *i;
 	struct table_algo *ta;
 	uint32_t count, n, size;
 
 	olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh));
 	if (olh == NULL)
 		return (EINVAL);
 	if (sd->valsize < olh->size)
 		return (EINVAL);
 
 	IPFW_UH_RLOCK(ch);
 	tcfg = CHAIN_TO_TCFG(ch);
 	count = tcfg->algo_count;
 	size = count * sizeof(ipfw_ta_info) + sizeof(ipfw_obj_lheader);
 
 	/* Fill in header regadless of buffer size */
 	olh->count = count;
 	olh->objsize = sizeof(ipfw_ta_info);
 
 	if (size > olh->size) {
 		olh->size = size;
 		IPFW_UH_RUNLOCK(ch);
 		return (ENOMEM);
 	}
 	olh->size = size;
 
 	for (n = 1; n <= count; n++) {
 		i = (ipfw_ta_info *)ipfw_get_sopt_space(sd, sizeof(*i));
 		KASSERT(i != NULL, ("previously checked buffer is not enough"));
 		ta = tcfg->algo[n];
 		strlcpy(i->algoname, ta->name, sizeof(i->algoname));
 		i->type = ta->type;
 		i->refcnt = ta->refcnt;
 	}
 
 	IPFW_UH_RUNLOCK(ch);
 
 	return (0);
 }
 
 static int
 classify_srcdst(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype)
 {
 	/* Basic IPv4/IPv6 or u32 lookups */
 	*puidx = cmd->arg1;
 	/* Assume ADDR by default */
 	*ptype = IPFW_TABLE_ADDR;
 	int v;
 		
 	if (F_LEN(cmd) > F_INSN_SIZE(ipfw_insn_u32)) {
 		/*
 		 * generic lookup. The key must be
 		 * in 32bit big-endian format.
 		 */
 		v = ((ipfw_insn_u32 *)cmd)->d[1];
 		switch (v) {
 		case 0:
 		case 1:
 			/* IPv4 src/dst */
 			break;
 		case 2:
 		case 3:
 			/* src/dst port */
 			*ptype = IPFW_TABLE_NUMBER;
 			break;
 		case 4:
 			/* uid/gid */
 			*ptype = IPFW_TABLE_NUMBER;
 			break;
 		case 5:
 			/* jid */
 			*ptype = IPFW_TABLE_NUMBER;
 			break;
 		case 6:
 			/* dscp */
 			*ptype = IPFW_TABLE_NUMBER;
 			break;
 		}
 	}
 
 	return (0);
 }
 
 static int
 classify_via(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype)
 {
 	ipfw_insn_if *cmdif;
 
 	/* Interface table, possibly */
 	cmdif = (ipfw_insn_if *)cmd;
 	if (cmdif->name[0] != '\1')
 		return (1);
 
 	*ptype = IPFW_TABLE_INTERFACE;
 	*puidx = cmdif->p.kidx;
 
 	return (0);
 }
 
 static int
 classify_flow(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype)
 {
 
 	*puidx = cmd->arg1;
 	*ptype = IPFW_TABLE_FLOW;
 
 	return (0);
 }
 
 static void
 update_arg1(ipfw_insn *cmd, uint16_t idx)
 {
 
 	cmd->arg1 = idx;
 }
 
 static void
 update_via(ipfw_insn *cmd, uint16_t idx)
 {
 	ipfw_insn_if *cmdif;
 
 	cmdif = (ipfw_insn_if *)cmd;
 	cmdif->p.kidx = idx;
 }
 
 static int
 table_findbyname(struct ip_fw_chain *ch, struct tid_info *ti,
     struct named_object **pno)
 {
 	struct table_config *tc;
 	int error;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	error = find_table_err(CHAIN_TO_NI(ch), ti, &tc);
 	if (error != 0)
 		return (error);
 
 	*pno = &tc->no;
 	return (0);
 }
 
 /* XXX: sets-sets! */
 static struct named_object *
 table_findbykidx(struct ip_fw_chain *ch, uint16_t idx)
 {
 	struct namedobj_instance *ni;
 	struct table_config *tc;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 	ni = CHAIN_TO_NI(ch);
 	tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, idx);
 	KASSERT(tc != NULL, ("Table with index %d not found", idx));
 
 	return (&tc->no);
 }
 
 static int
 table_manage_sets(struct ip_fw_chain *ch, uint16_t set, uint8_t new_set,
     enum ipfw_sets_cmd cmd)
 {
 
 	switch (cmd) {
 	case SWAP_ALL:
 	case TEST_ALL:
 	case MOVE_ALL:
 		/*
 		 * Always return success, the real action and decision
 		 * should make table_manage_sets_all().
 		 */
 		return (0);
 	case TEST_ONE:
 	case MOVE_ONE:
 		/*
 		 * NOTE: we need to use ipfw_objhash_del/ipfw_objhash_add
 		 * if set number will be used in hash function. Currently
 		 * we can just use generic handler that replaces set value.
 		 */
 		if (V_fw_tables_sets == 0)
 			return (0);
 		break;
 	case COUNT_ONE:
 		/*
 		 * Return EOPNOTSUPP for COUNT_ONE when per-set sysctl is
 		 * disabled. This allow skip table's opcodes from additional
 		 * checks when specific rules moved to another set.
 		 */
 		if (V_fw_tables_sets == 0)
 			return (EOPNOTSUPP);
 	}
 	/* Use generic sets handler when per-set sysctl is enabled. */
 	return (ipfw_obj_manage_sets(CHAIN_TO_NI(ch), IPFW_TLV_TBL_NAME,
 	    set, new_set, cmd));
 }
 
 /*
  * We register several opcode rewriters for lookup tables.
  * All tables opcodes have the same ETLV type, but different subtype.
  * To avoid invoking sets handler several times for XXX_ALL commands,
  * we use separate manage_sets handler. O_RECV has the lowest value,
  * so it should be called first.
  */
 static int
 table_manage_sets_all(struct ip_fw_chain *ch, uint16_t set, uint8_t new_set,
     enum ipfw_sets_cmd cmd)
 {
 
 	switch (cmd) {
 	case SWAP_ALL:
 	case TEST_ALL:
 		/*
 		 * Return success for TEST_ALL, since nothing prevents
 		 * move rules from one set to another. All tables are
 		 * accessible from all sets when per-set tables sysctl
 		 * is disabled.
 		 */
 	case MOVE_ALL:
 		if (V_fw_tables_sets == 0)
 			return (0);
 		break;
 	default:
 		return (table_manage_sets(ch, set, new_set, cmd));
 	}
 	/* Use generic sets handler when per-set sysctl is enabled. */
 	return (ipfw_obj_manage_sets(CHAIN_TO_NI(ch), IPFW_TLV_TBL_NAME,
 	    set, new_set, cmd));
 }
 
 static struct opcode_obj_rewrite opcodes[] = {
 	{
 		.opcode = O_IP_SRC_LOOKUP,
 		.etlv = IPFW_TLV_TBL_NAME,
 		.classifier = classify_srcdst,
 		.update = update_arg1,
 		.find_byname = table_findbyname,
 		.find_bykidx = table_findbykidx,
 		.create_object = create_table_compat,
 		.manage_sets = table_manage_sets,
 	},
 	{
 		.opcode = O_IP_DST_LOOKUP,
 		.etlv = IPFW_TLV_TBL_NAME,
 		.classifier = classify_srcdst,
 		.update = update_arg1,
 		.find_byname = table_findbyname,
 		.find_bykidx = table_findbykidx,
 		.create_object = create_table_compat,
 		.manage_sets = table_manage_sets,
 	},
 	{
 		.opcode = O_IP_FLOW_LOOKUP,
 		.etlv = IPFW_TLV_TBL_NAME,
 		.classifier = classify_flow,
 		.update = update_arg1,
 		.find_byname = table_findbyname,
 		.find_bykidx = table_findbykidx,
 		.create_object = create_table_compat,
 		.manage_sets = table_manage_sets,
 	},
 	{
 		.opcode = O_XMIT,
 		.etlv = IPFW_TLV_TBL_NAME,
 		.classifier = classify_via,
 		.update = update_via,
 		.find_byname = table_findbyname,
 		.find_bykidx = table_findbykidx,
 		.create_object = create_table_compat,
 		.manage_sets = table_manage_sets,
 	},
 	{
 		.opcode = O_RECV,
 		.etlv = IPFW_TLV_TBL_NAME,
 		.classifier = classify_via,
 		.update = update_via,
 		.find_byname = table_findbyname,
 		.find_bykidx = table_findbykidx,
 		.create_object = create_table_compat,
 		.manage_sets = table_manage_sets_all,
 	},
 	{
 		.opcode = O_VIA,
 		.etlv = IPFW_TLV_TBL_NAME,
 		.classifier = classify_via,
 		.update = update_via,
 		.find_byname = table_findbyname,
 		.find_bykidx = table_findbykidx,
 		.create_object = create_table_compat,
 		.manage_sets = table_manage_sets,
 	},
 };
 
 static int
 test_sets_cb(struct namedobj_instance *ni __unused, struct named_object *no,
     void *arg __unused)
 {
 
 	/* Check that there aren't any tables in not default set */
 	if (no->set != 0)
 		return (EBUSY);
 	return (0);
 }
 
 /*
  * Switch between "set 0" and "rule's set" table binding,
  * Check all ruleset bindings and permits changing
  * IFF each binding has both rule AND table in default set (set 0).
  *
  * Returns 0 on success.
  */
 int
 ipfw_switch_tables_namespace(struct ip_fw_chain *ch, unsigned int sets)
 {
 	struct opcode_obj_rewrite *rw;
 	struct namedobj_instance *ni;
 	struct named_object *no;
 	struct ip_fw *rule;
 	ipfw_insn *cmd;
 	int cmdlen, i, l;
 	uint16_t kidx;
 	uint8_t subtype;
 
 	IPFW_UH_WLOCK(ch);
 
 	if (V_fw_tables_sets == sets) {
 		IPFW_UH_WUNLOCK(ch);
 		return (0);
 	}
 	ni = CHAIN_TO_NI(ch);
 	if (sets == 0) {
 		/*
 		 * Prevent disabling sets support if we have some tables
 		 * in not default sets.
 		 */
 		if (ipfw_objhash_foreach_type(ni, test_sets_cb,
 		    NULL, IPFW_TLV_TBL_NAME) != 0) {
 			IPFW_UH_WUNLOCK(ch);
 			return (EBUSY);
 		}
 	}
 	/*
 	 * Scan all rules and examine tables opcodes.
 	 */
 	for (i = 0; i < ch->n_rules; i++) {
 		rule = ch->map[i];
 
 		l = rule->cmd_len;
 		cmd = rule->cmd;
 		cmdlen = 0;
 		for ( ;	l > 0 ; l -= cmdlen, cmd += cmdlen) {
 			cmdlen = F_LEN(cmd);
 			/* Check only tables opcodes */
 			for (kidx = 0, rw = opcodes;
 			    rw < opcodes + nitems(opcodes); rw++) {
 				if (rw->opcode != cmd->opcode)
 					continue;
 				if (rw->classifier(cmd, &kidx, &subtype) == 0)
 					break;
 			}
 			if (kidx == 0)
 				continue;
 			no = ipfw_objhash_lookup_kidx(ni, kidx);
 			/* Check if both table object and rule has the set 0 */
 			if (no->set != 0 || rule->set != 0) {
 				IPFW_UH_WUNLOCK(ch);
 				return (EBUSY);
 			}
 
 		}
 	}
 	V_fw_tables_sets = sets;
 	IPFW_UH_WUNLOCK(ch);
 	return (0);
 }
 
 /*
  * Checks table name for validity.
  * Enforce basic length checks, the rest
  * should be done in userland.
  *
  * Returns 0 if name is considered valid.
  */
 static int
 check_table_name(const char *name)
 {
 
 	/*
 	 * TODO: do some more complicated checks
 	 */
 	return (ipfw_check_object_name_generic(name));
 }
 
 /*
  * Finds table config based on either legacy index
  * or name in ntlv.
  * Note @ti structure contains unchecked data from userland.
  *
  * Returns 0 in success and fills in @tc with found config
  */
 static int
 find_table_err(struct namedobj_instance *ni, struct tid_info *ti,
     struct table_config **tc)
 {
 	char *name, bname[16];
 	struct named_object *no;
 	ipfw_obj_ntlv *ntlv;
 	uint32_t set;
 
 	if (ti->tlvs != NULL) {
 		ntlv = ipfw_find_name_tlv_type(ti->tlvs, ti->tlen, ti->uidx,
 		    IPFW_TLV_TBL_NAME);
 		if (ntlv == NULL)
 			return (EINVAL);
 		name = ntlv->name;
 
 		/*
 		 * Use set provided by @ti instead of @ntlv one.
 		 * This is needed due to different sets behavior
 		 * controlled by V_fw_tables_sets.
 		 */
 		set = (V_fw_tables_sets != 0) ? ti->set : 0;
 	} else {
 		snprintf(bname, sizeof(bname), "%d", ti->uidx);
 		name = bname;
 		set = 0;
 	}
 
 	no = ipfw_objhash_lookup_name(ni, set, name);
 	*tc = (struct table_config *)no;
 
 	return (0);
 }
 
 /*
  * Finds table config based on either legacy index
  * or name in ntlv.
  * Note @ti structure contains unchecked data from userland.
  *
  * Returns pointer to table_config or NULL.
  */
 static struct table_config *
 find_table(struct namedobj_instance *ni, struct tid_info *ti)
 {
 	struct table_config *tc;
 
 	if (find_table_err(ni, ti, &tc) != 0)
 		return (NULL);
 
 	return (tc);
 }
 
 /*
  * Allocate new table config structure using
  * specified @algo and @aname.
  *
  * Returns pointer to config or NULL.
  */
 static struct table_config *
 alloc_table_config(struct ip_fw_chain *ch, struct tid_info *ti,
     struct table_algo *ta, char *aname, uint8_t tflags)
 {
 	char *name, bname[16];
 	struct table_config *tc;
 	int error;
 	ipfw_obj_ntlv *ntlv;
 	uint32_t set;
 
 	if (ti->tlvs != NULL) {
 		ntlv = ipfw_find_name_tlv_type(ti->tlvs, ti->tlen, ti->uidx,
 		    IPFW_TLV_TBL_NAME);
 		if (ntlv == NULL)
 			return (NULL);
 		name = ntlv->name;
-		set = ntlv->set;
+		set = (V_fw_tables_sets == 0) ? 0 : ntlv->set;
 	} else {
 		/* Compat part: convert number to string representation */
 		snprintf(bname, sizeof(bname), "%d", ti->uidx);
 		name = bname;
 		set = 0;
 	}
 
 	tc = malloc(sizeof(struct table_config), M_IPFW, M_WAITOK | M_ZERO);
 	tc->no.name = tc->tablename;
 	tc->no.subtype = ta->type;
 	tc->no.set = set;
 	tc->tflags = tflags;
 	tc->ta = ta;
 	strlcpy(tc->tablename, name, sizeof(tc->tablename));
 	/* Set "shared" value type by default */
 	tc->vshared = 1;
 
 	/* Preallocate data structures for new tables */
 	error = ta->init(ch, &tc->astate, &tc->ti_copy, aname, tflags);
 	if (error != 0) {
 		free(tc, M_IPFW);
 		return (NULL);
 	}
 	
 	return (tc);
 }
 
 /*
  * Destroys table state and config.
  */
 static void
 free_table_config(struct namedobj_instance *ni, struct table_config *tc)
 {
 
 	KASSERT(tc->linked == 0, ("free() on linked config"));
 	/* UH lock MUST NOT be held */
 
 	/*
 	 * We're using ta without any locking/referencing.
 	 * TODO: fix this if we're going to use unloadable algos.
 	 */
 	tc->ta->destroy(tc->astate, &tc->ti_copy);
 	free(tc, M_IPFW);
 }
 
 /*
  * Links @tc to @chain table named instance.
  * Sets appropriate type/states in @chain table info.
  */
 static void
 link_table(struct ip_fw_chain *ch, struct table_config *tc)
 {
 	struct namedobj_instance *ni;
 	struct table_info *ti;
 	uint16_t kidx;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	ni = CHAIN_TO_NI(ch);
 	kidx = tc->no.kidx;
 
 	ipfw_objhash_add(ni, &tc->no);
 
 	ti = KIDX_TO_TI(ch, kidx);
 	*ti = tc->ti_copy;
 
 	/* Notify algo on real @ti address */
 	if (tc->ta->change_ti != NULL)
 		tc->ta->change_ti(tc->astate, ti);
 
 	tc->linked = 1;
 	tc->ta->refcnt++;
 }
 
 /*
  * Unlinks @tc from @chain table named instance.
  * Zeroes states in @chain and stores them in @tc.
  */
 static void
 unlink_table(struct ip_fw_chain *ch, struct table_config *tc)
 {
 	struct namedobj_instance *ni;
 	struct table_info *ti;
 	uint16_t kidx;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 	IPFW_WLOCK_ASSERT(ch);
 
 	ni = CHAIN_TO_NI(ch);
 	kidx = tc->no.kidx;
 
 	/* Clear state. @ti copy is already saved inside @tc */
 	ipfw_objhash_del(ni, &tc->no);
 	ti = KIDX_TO_TI(ch, kidx);
 	memset(ti, 0, sizeof(struct table_info));
 	tc->linked = 0;
 	tc->ta->refcnt--;
 
 	/* Notify algo on real @ti address */
 	if (tc->ta->change_ti != NULL)
 		tc->ta->change_ti(tc->astate, NULL);
 }
 
 static struct ipfw_sopt_handler	scodes[] = {
 	{ IP_FW_TABLE_XCREATE,	0,	HDIR_SET,	create_table },
 	{ IP_FW_TABLE_XDESTROY,	0,	HDIR_SET,	flush_table_v0 },
 	{ IP_FW_TABLE_XFLUSH,	0,	HDIR_SET,	flush_table_v0 },
 	{ IP_FW_TABLE_XMODIFY,	0,	HDIR_BOTH,	modify_table },
 	{ IP_FW_TABLE_XINFO,	0,	HDIR_GET,	describe_table },
 	{ IP_FW_TABLES_XLIST,	0,	HDIR_GET,	list_tables },
 	{ IP_FW_TABLE_XLIST,	0,	HDIR_GET,	dump_table_v0 },
 	{ IP_FW_TABLE_XLIST,	1,	HDIR_GET,	dump_table_v1 },
 	{ IP_FW_TABLE_XADD,	0,	HDIR_BOTH,	manage_table_ent_v0 },
 	{ IP_FW_TABLE_XADD,	1,	HDIR_BOTH,	manage_table_ent_v1 },
 	{ IP_FW_TABLE_XDEL,	0,	HDIR_BOTH,	manage_table_ent_v0 },
 	{ IP_FW_TABLE_XDEL,	1,	HDIR_BOTH,	manage_table_ent_v1 },
 	{ IP_FW_TABLE_XFIND,	0,	HDIR_GET,	find_table_entry },
 	{ IP_FW_TABLE_XSWAP,	0,	HDIR_SET,	swap_table },
 	{ IP_FW_TABLES_ALIST,	0,	HDIR_GET,	list_table_algo },
 	{ IP_FW_TABLE_XGETSIZE,	0,	HDIR_GET,	get_table_size },
 };
 
 static int
 destroy_table_locked(struct namedobj_instance *ni, struct named_object *no,
     void *arg)
 {
 
 	unlink_table((struct ip_fw_chain *)arg, (struct table_config *)no);
 	if (ipfw_objhash_free_idx(ni, no->kidx) != 0)
 		printf("Error unlinking kidx %d from table %s\n",
 		    no->kidx, no->name);
 	free_table_config(ni, (struct table_config *)no);
 	return (0);
 }
 
 /*
  * Shuts tables module down.
  */
 void
 ipfw_destroy_tables(struct ip_fw_chain *ch, int last)
 {
 
 	IPFW_DEL_SOPT_HANDLER(last, scodes);
 	IPFW_DEL_OBJ_REWRITER(last, opcodes);
 
 	/* Remove all tables from working set */
 	IPFW_UH_WLOCK(ch);
 	IPFW_WLOCK(ch);
 	ipfw_objhash_foreach(CHAIN_TO_NI(ch), destroy_table_locked, ch);
 	IPFW_WUNLOCK(ch);
 	IPFW_UH_WUNLOCK(ch);
 
 	/* Free pointers itself */
 	free(ch->tablestate, M_IPFW);
 
 	ipfw_table_value_destroy(ch, last);
 	ipfw_table_algo_destroy(ch);
 
 	ipfw_objhash_destroy(CHAIN_TO_NI(ch));
 	free(CHAIN_TO_TCFG(ch), M_IPFW);
 }
 
 /*
  * Starts tables module.
  */
 int
 ipfw_init_tables(struct ip_fw_chain *ch, int first)
 {
 	struct tables_config *tcfg;
 
 	/* Allocate pointers */
 	ch->tablestate = malloc(V_fw_tables_max * sizeof(struct table_info),
 	    M_IPFW, M_WAITOK | M_ZERO);
 
 	tcfg = malloc(sizeof(struct tables_config), M_IPFW, M_WAITOK | M_ZERO);
 	tcfg->namehash = ipfw_objhash_create(V_fw_tables_max);
 	ch->tblcfg = tcfg;
 
 	ipfw_table_value_init(ch, first);
 	ipfw_table_algo_init(ch);
 
 	IPFW_ADD_OBJ_REWRITER(first, opcodes);
 	IPFW_ADD_SOPT_HANDLER(first, scodes);
 	return (0);
 }
 
 
 
Index: user/markj/netdump/sys/netpfil/pf/pf.c
===================================================================
--- user/markj/netdump/sys/netpfil/pf/pf.c	(revision 332407)
+++ user/markj/netdump/sys/netpfil/pf/pf.c	(revision 332408)
@@ -1,6647 +1,6650 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause
  *
  * Copyright (c) 2001 Daniel Hartmeier
  * Copyright (c) 2002 - 2008 Henning Brauer
  * Copyright (c) 2012 Gleb Smirnoff <glebius@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  *    - Redistributions of source code must retain the above copyright
  *      notice, this list of conditions and the following disclaimer.
  *    - Redistributions in binary form must reproduce the above
  *      copyright notice, this list of conditions and the following
  *      disclaimer in the documentation and/or other materials provided
  *      with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
  * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
  * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  *
  * Effort sponsored in part by the Defense Advanced Research Projects
  * Agency (DARPA) and Air Force Research Laboratory, Air Force
  * Materiel Command, USAF, under agreement number F30602-01-2-0537.
  *
  *	$OpenBSD: pf.c,v 1.634 2009/02/27 12:37:45 henning Exp $
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
 #include "opt_bpf.h"
 #include "opt_pf.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/endian.h>
 #include <sys/hash.h>
 #include <sys/interrupt.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/limits.h>
 #include <sys/mbuf.h>
 #include <sys/md5.h>
 #include <sys/random.h>
 #include <sys/refcount.h>
 #include <sys/socket.h>
 #include <sys/sysctl.h>
 #include <sys/taskqueue.h>
 #include <sys/ucred.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/if_types.h>
 #include <net/if_vlan_var.h>
 #include <net/route.h>
 #include <net/radix_mpath.h>
 #include <net/vnet.h>
 
 #include <net/pfil.h>
 #include <net/pfvar.h>
 #include <net/if_pflog.h>
 #include <net/if_pfsync.h>
 
 #include <netinet/in_pcb.h>
 #include <netinet/in_var.h>
 #include <netinet/in_fib.h>
 #include <netinet/ip.h>
 #include <netinet/ip_fw.h>
 #include <netinet/ip_icmp.h>
 #include <netinet/icmp_var.h>
 #include <netinet/ip_var.h>
 #include <netinet/tcp.h>
 #include <netinet/tcp_fsm.h>
 #include <netinet/tcp_seq.h>
 #include <netinet/tcp_timer.h>
 #include <netinet/tcp_var.h>
 #include <netinet/udp.h>
 #include <netinet/udp_var.h>
 
 #include <netpfil/ipfw/ip_fw_private.h> /* XXX: only for DIR_IN/DIR_OUT */
 
 #ifdef INET6
 #include <netinet/ip6.h>
 #include <netinet/icmp6.h>
 #include <netinet6/nd6.h>
 #include <netinet6/ip6_var.h>
 #include <netinet6/in6_pcb.h>
 #include <netinet6/in6_fib.h>
 #include <netinet6/scope6_var.h>
 #endif /* INET6 */
 
 #include <machine/in_cksum.h>
 #include <security/mac/mac_framework.h>
 
 #define	DPFPRINTF(n, x)	if (V_pf_status.debug >= (n)) printf x
 
 /*
  * Global variables
  */
 
 /* state tables */
 VNET_DEFINE(struct pf_altqqueue,	 pf_altqs[2]);
 VNET_DEFINE(struct pf_palist,		 pf_pabuf);
 VNET_DEFINE(struct pf_altqqueue *,	 pf_altqs_active);
 VNET_DEFINE(struct pf_altqqueue *,	 pf_altqs_inactive);
 VNET_DEFINE(struct pf_kstatus,		 pf_status);
 
 VNET_DEFINE(u_int32_t,			 ticket_altqs_active);
 VNET_DEFINE(u_int32_t,			 ticket_altqs_inactive);
 VNET_DEFINE(int,			 altqs_inactive_open);
 VNET_DEFINE(u_int32_t,			 ticket_pabuf);
 
 VNET_DEFINE(MD5_CTX,			 pf_tcp_secret_ctx);
 #define	V_pf_tcp_secret_ctx		 VNET(pf_tcp_secret_ctx)
 VNET_DEFINE(u_char,			 pf_tcp_secret[16]);
 #define	V_pf_tcp_secret			 VNET(pf_tcp_secret)
 VNET_DEFINE(int,			 pf_tcp_secret_init);
 #define	V_pf_tcp_secret_init		 VNET(pf_tcp_secret_init)
 VNET_DEFINE(int,			 pf_tcp_iss_off);
 #define	V_pf_tcp_iss_off		 VNET(pf_tcp_iss_off)
 VNET_DECLARE(int,			 pf_vnet_active);
 #define	V_pf_vnet_active		 VNET(pf_vnet_active)
 
 static VNET_DEFINE(uint32_t, pf_purge_idx);
 #define V_pf_purge_idx	VNET(pf_purge_idx)
 
 /*
  * Queue for pf_intr() sends.
  */
 static MALLOC_DEFINE(M_PFTEMP, "pf_temp", "pf(4) temporary allocations");
 struct pf_send_entry {
 	STAILQ_ENTRY(pf_send_entry)	pfse_next;
 	struct mbuf			*pfse_m;
 	enum {
 		PFSE_IP,
 		PFSE_IP6,
 		PFSE_ICMP,
 		PFSE_ICMP6,
 	}				pfse_type;
 	struct {
 		int		type;
 		int		code;
 		int		mtu;
 	} icmpopts;
 };
 
 STAILQ_HEAD(pf_send_head, pf_send_entry);
 static VNET_DEFINE(struct pf_send_head, pf_sendqueue);
 #define	V_pf_sendqueue	VNET(pf_sendqueue)
 
 static struct mtx pf_sendqueue_mtx;
 MTX_SYSINIT(pf_sendqueue_mtx, &pf_sendqueue_mtx, "pf send queue", MTX_DEF);
 #define	PF_SENDQ_LOCK()		mtx_lock(&pf_sendqueue_mtx)
 #define	PF_SENDQ_UNLOCK()	mtx_unlock(&pf_sendqueue_mtx)
 
 /*
  * Queue for pf_overload_task() tasks.
  */
 struct pf_overload_entry {
 	SLIST_ENTRY(pf_overload_entry)	next;
 	struct pf_addr  		addr;
 	sa_family_t			af;
 	uint8_t				dir;
 	struct pf_rule  		*rule;
 };
 
 SLIST_HEAD(pf_overload_head, pf_overload_entry);
 static VNET_DEFINE(struct pf_overload_head, pf_overloadqueue);
 #define V_pf_overloadqueue	VNET(pf_overloadqueue)
 static VNET_DEFINE(struct task, pf_overloadtask);
 #define	V_pf_overloadtask	VNET(pf_overloadtask)
 
 static struct mtx pf_overloadqueue_mtx;
 MTX_SYSINIT(pf_overloadqueue_mtx, &pf_overloadqueue_mtx,
     "pf overload/flush queue", MTX_DEF);
 #define	PF_OVERLOADQ_LOCK()	mtx_lock(&pf_overloadqueue_mtx)
 #define	PF_OVERLOADQ_UNLOCK()	mtx_unlock(&pf_overloadqueue_mtx)
 
 VNET_DEFINE(struct pf_rulequeue, pf_unlinked_rules);
 struct mtx pf_unlnkdrules_mtx;
 MTX_SYSINIT(pf_unlnkdrules_mtx, &pf_unlnkdrules_mtx, "pf unlinked rules",
     MTX_DEF);
 
 static VNET_DEFINE(uma_zone_t,	pf_sources_z);
 #define	V_pf_sources_z	VNET(pf_sources_z)
 uma_zone_t		pf_mtag_z;
 VNET_DEFINE(uma_zone_t,	 pf_state_z);
 VNET_DEFINE(uma_zone_t,	 pf_state_key_z);
 
 VNET_DEFINE(uint64_t, pf_stateid[MAXCPU]);
 #define	PFID_CPUBITS	8
 #define	PFID_CPUSHIFT	(sizeof(uint64_t) * NBBY - PFID_CPUBITS)
 #define	PFID_CPUMASK	((uint64_t)((1 << PFID_CPUBITS) - 1) <<	PFID_CPUSHIFT)
 #define	PFID_MAXID	(~PFID_CPUMASK)
 CTASSERT((1 << PFID_CPUBITS) >= MAXCPU);
 
 static void		 pf_src_tree_remove_state(struct pf_state *);
 static void		 pf_init_threshold(struct pf_threshold *, u_int32_t,
 			    u_int32_t);
 static void		 pf_add_threshold(struct pf_threshold *);
 static int		 pf_check_threshold(struct pf_threshold *);
 
 static void		 pf_change_ap(struct mbuf *, struct pf_addr *, u_int16_t *,
 			    u_int16_t *, u_int16_t *, struct pf_addr *,
 			    u_int16_t, u_int8_t, sa_family_t);
 static int		 pf_modulate_sack(struct mbuf *, int, struct pf_pdesc *,
 			    struct tcphdr *, struct pf_state_peer *);
 static void		 pf_change_icmp(struct pf_addr *, u_int16_t *,
 			    struct pf_addr *, struct pf_addr *, u_int16_t,
 			    u_int16_t *, u_int16_t *, u_int16_t *,
 			    u_int16_t *, u_int8_t, sa_family_t);
 static void		 pf_send_tcp(struct mbuf *,
 			    const struct pf_rule *, sa_family_t,
 			    const struct pf_addr *, const struct pf_addr *,
 			    u_int16_t, u_int16_t, u_int32_t, u_int32_t,
 			    u_int8_t, u_int16_t, u_int16_t, u_int8_t, int,
 			    u_int16_t, struct ifnet *);
 static void		 pf_send_icmp(struct mbuf *, u_int8_t, u_int8_t,
 			    sa_family_t, struct pf_rule *);
 static void		 pf_detach_state(struct pf_state *);
 static int		 pf_state_key_attach(struct pf_state_key *,
 			    struct pf_state_key *, struct pf_state *);
 static void		 pf_state_key_detach(struct pf_state *, int);
 static int		 pf_state_key_ctor(void *, int, void *, int);
 static u_int32_t	 pf_tcp_iss(struct pf_pdesc *);
 static int		 pf_test_rule(struct pf_rule **, struct pf_state **,
 			    int, struct pfi_kif *, struct mbuf *, int,
 			    struct pf_pdesc *, struct pf_rule **,
 			    struct pf_ruleset **, struct inpcb *);
 static int		 pf_create_state(struct pf_rule *, struct pf_rule *,
 			    struct pf_rule *, struct pf_pdesc *,
 			    struct pf_src_node *, struct pf_state_key *,
 			    struct pf_state_key *, struct mbuf *, int,
 			    u_int16_t, u_int16_t, int *, struct pfi_kif *,
 			    struct pf_state **, int, u_int16_t, u_int16_t,
 			    int);
 static int		 pf_test_fragment(struct pf_rule **, int,
 			    struct pfi_kif *, struct mbuf *, void *,
 			    struct pf_pdesc *, struct pf_rule **,
 			    struct pf_ruleset **);
 static int		 pf_tcp_track_full(struct pf_state_peer *,
 			    struct pf_state_peer *, struct pf_state **,
 			    struct pfi_kif *, struct mbuf *, int,
 			    struct pf_pdesc *, u_short *, int *);
 static int		 pf_tcp_track_sloppy(struct pf_state_peer *,
 			    struct pf_state_peer *, struct pf_state **,
 			    struct pf_pdesc *, u_short *);
 static int		 pf_test_state_tcp(struct pf_state **, int,
 			    struct pfi_kif *, struct mbuf *, int,
 			    void *, struct pf_pdesc *, u_short *);
 static int		 pf_test_state_udp(struct pf_state **, int,
 			    struct pfi_kif *, struct mbuf *, int,
 			    void *, struct pf_pdesc *);
 static int		 pf_test_state_icmp(struct pf_state **, int,
 			    struct pfi_kif *, struct mbuf *, int,
 			    void *, struct pf_pdesc *, u_short *);
 static int		 pf_test_state_other(struct pf_state **, int,
 			    struct pfi_kif *, struct mbuf *, struct pf_pdesc *);
 static u_int8_t		 pf_get_wscale(struct mbuf *, int, u_int16_t,
 			    sa_family_t);
 static u_int16_t	 pf_get_mss(struct mbuf *, int, u_int16_t,
 			    sa_family_t);
 static u_int16_t	 pf_calc_mss(struct pf_addr *, sa_family_t,
 				int, u_int16_t);
 static int		 pf_check_proto_cksum(struct mbuf *, int, int,
 			    u_int8_t, sa_family_t);
 static void		 pf_print_state_parts(struct pf_state *,
 			    struct pf_state_key *, struct pf_state_key *);
 static int		 pf_addr_wrap_neq(struct pf_addr_wrap *,
 			    struct pf_addr_wrap *);
 static struct pf_state	*pf_find_state(struct pfi_kif *,
 			    struct pf_state_key_cmp *, u_int);
 static int		 pf_src_connlimit(struct pf_state **);
 static void		 pf_overload_task(void *v, int pending);
 static int		 pf_insert_src_node(struct pf_src_node **,
 			    struct pf_rule *, struct pf_addr *, sa_family_t);
 static u_int		 pf_purge_expired_states(u_int, int);
 static void		 pf_purge_unlinked_rules(void);
 static int		 pf_mtag_uminit(void *, int, int);
 static void		 pf_mtag_free(struct m_tag *);
 #ifdef INET
 static void		 pf_route(struct mbuf **, struct pf_rule *, int,
 			    struct ifnet *, struct pf_state *,
 			    struct pf_pdesc *);
 #endif /* INET */
 #ifdef INET6
 static void		 pf_change_a6(struct pf_addr *, u_int16_t *,
 			    struct pf_addr *, u_int8_t);
 static void		 pf_route6(struct mbuf **, struct pf_rule *, int,
 			    struct ifnet *, struct pf_state *,
 			    struct pf_pdesc *);
 #endif /* INET6 */
 
 int in4_cksum(struct mbuf *m, u_int8_t nxt, int off, int len);
 
 extern int pf_end_threads;
 extern struct proc *pf_purge_proc;
 
 VNET_DEFINE(struct pf_limit, pf_limits[PF_LIMIT_MAX]);
 
 #define	PACKET_LOOPED(pd)	((pd)->pf_mtag &&			\
 				 (pd)->pf_mtag->flags & PF_PACKET_LOOPED)
 
 #define	STATE_LOOKUP(i, k, d, s, pd)					\
 	do {								\
 		(s) = pf_find_state((i), (k), (d));			\
 		if ((s) == NULL)					\
 			return (PF_DROP);				\
 		if (PACKET_LOOPED(pd))					\
 			return (PF_PASS);				\
 		if ((d) == PF_OUT &&					\
 		    (((s)->rule.ptr->rt == PF_ROUTETO &&		\
 		    (s)->rule.ptr->direction == PF_OUT) ||		\
 		    ((s)->rule.ptr->rt == PF_REPLYTO &&			\
 		    (s)->rule.ptr->direction == PF_IN)) &&		\
 		    (s)->rt_kif != NULL &&				\
 		    (s)->rt_kif != (i))					\
 			return (PF_PASS);				\
 	} while (0)
 
 #define	BOUND_IFACE(r, k) \
 	((r)->rule_flag & PFRULE_IFBOUND) ? (k) : V_pfi_all
 
 #define	STATE_INC_COUNTERS(s)						\
 	do {								\
 		counter_u64_add(s->rule.ptr->states_cur, 1);		\
 		counter_u64_add(s->rule.ptr->states_tot, 1);		\
 		if (s->anchor.ptr != NULL) {				\
 			counter_u64_add(s->anchor.ptr->states_cur, 1);	\
 			counter_u64_add(s->anchor.ptr->states_tot, 1);	\
 		}							\
 		if (s->nat_rule.ptr != NULL) {				\
 			counter_u64_add(s->nat_rule.ptr->states_cur, 1);\
 			counter_u64_add(s->nat_rule.ptr->states_tot, 1);\
 		}							\
 	} while (0)
 
 #define	STATE_DEC_COUNTERS(s)						\
 	do {								\
 		if (s->nat_rule.ptr != NULL)				\
 			counter_u64_add(s->nat_rule.ptr->states_cur, -1);\
 		if (s->anchor.ptr != NULL)				\
 			counter_u64_add(s->anchor.ptr->states_cur, -1);	\
 		counter_u64_add(s->rule.ptr->states_cur, -1);		\
 	} while (0)
 
 static MALLOC_DEFINE(M_PFHASH, "pf_hash", "pf(4) hash header structures");
 VNET_DEFINE(struct pf_keyhash *, pf_keyhash);
 VNET_DEFINE(struct pf_idhash *, pf_idhash);
 VNET_DEFINE(struct pf_srchash *, pf_srchash);
 
 SYSCTL_NODE(_net, OID_AUTO, pf, CTLFLAG_RW, 0, "pf(4)");
 
 u_long	pf_hashmask;
 u_long	pf_srchashmask;
 static u_long	pf_hashsize;
 static u_long	pf_srchashsize;
+u_long	pf_ioctl_maxcount = 65535;
 
 SYSCTL_ULONG(_net_pf, OID_AUTO, states_hashsize, CTLFLAG_RDTUN,
     &pf_hashsize, 0, "Size of pf(4) states hashtable");
 SYSCTL_ULONG(_net_pf, OID_AUTO, source_nodes_hashsize, CTLFLAG_RDTUN,
     &pf_srchashsize, 0, "Size of pf(4) source nodes hashtable");
+SYSCTL_ULONG(_net_pf, OID_AUTO, request_maxcount, CTLFLAG_RDTUN,
+    &pf_ioctl_maxcount, 0, "Maximum number of tables, addresses, ... in a single ioctl() call");
 
 VNET_DEFINE(void *, pf_swi_cookie);
 
 VNET_DEFINE(uint32_t, pf_hashseed);
 #define	V_pf_hashseed	VNET(pf_hashseed)
 
 int
 pf_addr_cmp(struct pf_addr *a, struct pf_addr *b, sa_family_t af)
 {
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		if (a->addr32[0] > b->addr32[0])
 			return (1);
 		if (a->addr32[0] < b->addr32[0])
 			return (-1);
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		if (a->addr32[3] > b->addr32[3])
 			return (1);
 		if (a->addr32[3] < b->addr32[3])
 			return (-1);
 		if (a->addr32[2] > b->addr32[2])
 			return (1);
 		if (a->addr32[2] < b->addr32[2])
 			return (-1);
 		if (a->addr32[1] > b->addr32[1])
 			return (1);
 		if (a->addr32[1] < b->addr32[1])
 			return (-1);
 		if (a->addr32[0] > b->addr32[0])
 			return (1);
 		if (a->addr32[0] < b->addr32[0])
 			return (-1);
 		break;
 #endif /* INET6 */
 	default:
 		panic("%s: unknown address family %u", __func__, af);
 	}
 	return (0);
 }
 
 static __inline uint32_t
 pf_hashkey(struct pf_state_key *sk)
 {
 	uint32_t h;
 
 	h = murmur3_32_hash32((uint32_t *)sk,
 	    sizeof(struct pf_state_key_cmp)/sizeof(uint32_t),
 	    V_pf_hashseed);
 
 	return (h & pf_hashmask);
 }
 
 static __inline uint32_t
 pf_hashsrc(struct pf_addr *addr, sa_family_t af)
 {
 	uint32_t h;
 
 	switch (af) {
 	case AF_INET:
 		h = murmur3_32_hash32((uint32_t *)&addr->v4,
 		    sizeof(addr->v4)/sizeof(uint32_t), V_pf_hashseed);
 		break;
 	case AF_INET6:
 		h = murmur3_32_hash32((uint32_t *)&addr->v6,
 		    sizeof(addr->v6)/sizeof(uint32_t), V_pf_hashseed);
 		break;
 	default:
 		panic("%s: unknown address family %u", __func__, af);
 	}
 
 	return (h & pf_srchashmask);
 }
 
 #ifdef ALTQ
 static int
 pf_state_hash(struct pf_state *s)
 {
 	u_int32_t hv = (intptr_t)s / sizeof(*s);
 
 	hv ^= crc32(&s->src, sizeof(s->src));
 	hv ^= crc32(&s->dst, sizeof(s->dst));
 	if (hv == 0)
 		hv = 1;
 	return (hv);
 }
 #endif
 
 #ifdef INET6
 void
 pf_addrcpy(struct pf_addr *dst, struct pf_addr *src, sa_family_t af)
 {
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		dst->addr32[0] = src->addr32[0];
 		break;
 #endif /* INET */
 	case AF_INET6:
 		dst->addr32[0] = src->addr32[0];
 		dst->addr32[1] = src->addr32[1];
 		dst->addr32[2] = src->addr32[2];
 		dst->addr32[3] = src->addr32[3];
 		break;
 	}
 }
 #endif /* INET6 */
 
 static void
 pf_init_threshold(struct pf_threshold *threshold,
     u_int32_t limit, u_int32_t seconds)
 {
 	threshold->limit = limit * PF_THRESHOLD_MULT;
 	threshold->seconds = seconds;
 	threshold->count = 0;
 	threshold->last = time_uptime;
 }
 
 static void
 pf_add_threshold(struct pf_threshold *threshold)
 {
 	u_int32_t t = time_uptime, diff = t - threshold->last;
 
 	if (diff >= threshold->seconds)
 		threshold->count = 0;
 	else
 		threshold->count -= threshold->count * diff /
 		    threshold->seconds;
 	threshold->count += PF_THRESHOLD_MULT;
 	threshold->last = t;
 }
 
 static int
 pf_check_threshold(struct pf_threshold *threshold)
 {
 	return (threshold->count > threshold->limit);
 }
 
 static int
 pf_src_connlimit(struct pf_state **state)
 {
 	struct pf_overload_entry *pfoe;
 	int bad = 0;
 
 	PF_STATE_LOCK_ASSERT(*state);
 
 	(*state)->src_node->conn++;
 	(*state)->src.tcp_est = 1;
 	pf_add_threshold(&(*state)->src_node->conn_rate);
 
 	if ((*state)->rule.ptr->max_src_conn &&
 	    (*state)->rule.ptr->max_src_conn <
 	    (*state)->src_node->conn) {
 		counter_u64_add(V_pf_status.lcounters[LCNT_SRCCONN], 1);
 		bad++;
 	}
 
 	if ((*state)->rule.ptr->max_src_conn_rate.limit &&
 	    pf_check_threshold(&(*state)->src_node->conn_rate)) {
 		counter_u64_add(V_pf_status.lcounters[LCNT_SRCCONNRATE], 1);
 		bad++;
 	}
 
 	if (!bad)
 		return (0);
 
 	/* Kill this state. */
 	(*state)->timeout = PFTM_PURGE;
 	(*state)->src.state = (*state)->dst.state = TCPS_CLOSED;
 
 	if ((*state)->rule.ptr->overload_tbl == NULL)
 		return (1);
 
 	/* Schedule overloading and flushing task. */
 	pfoe = malloc(sizeof(*pfoe), M_PFTEMP, M_NOWAIT);
 	if (pfoe == NULL)
 		return (1);	/* too bad :( */
 
 	bcopy(&(*state)->src_node->addr, &pfoe->addr, sizeof(pfoe->addr));
 	pfoe->af = (*state)->key[PF_SK_WIRE]->af;
 	pfoe->rule = (*state)->rule.ptr;
 	pfoe->dir = (*state)->direction;
 	PF_OVERLOADQ_LOCK();
 	SLIST_INSERT_HEAD(&V_pf_overloadqueue, pfoe, next);
 	PF_OVERLOADQ_UNLOCK();
 	taskqueue_enqueue(taskqueue_swi, &V_pf_overloadtask);
 
 	return (1);
 }
 
 static void
 pf_overload_task(void *v, int pending)
 {
 	struct pf_overload_head queue;
 	struct pfr_addr p;
 	struct pf_overload_entry *pfoe, *pfoe1;
 	uint32_t killed = 0;
 
 	CURVNET_SET((struct vnet *)v);
 
 	PF_OVERLOADQ_LOCK();
 	queue = V_pf_overloadqueue;
 	SLIST_INIT(&V_pf_overloadqueue);
 	PF_OVERLOADQ_UNLOCK();
 
 	bzero(&p, sizeof(p));
 	SLIST_FOREACH(pfoe, &queue, next) {
 		counter_u64_add(V_pf_status.lcounters[LCNT_OVERLOAD_TABLE], 1);
 		if (V_pf_status.debug >= PF_DEBUG_MISC) {
 			printf("%s: blocking address ", __func__);
 			pf_print_host(&pfoe->addr, 0, pfoe->af);
 			printf("\n");
 		}
 
 		p.pfra_af = pfoe->af;
 		switch (pfoe->af) {
 #ifdef INET
 		case AF_INET:
 			p.pfra_net = 32;
 			p.pfra_ip4addr = pfoe->addr.v4;
 			break;
 #endif
 #ifdef INET6
 		case AF_INET6:
 			p.pfra_net = 128;
 			p.pfra_ip6addr = pfoe->addr.v6;
 			break;
 #endif
 		}
 
 		PF_RULES_WLOCK();
 		pfr_insert_kentry(pfoe->rule->overload_tbl, &p, time_second);
 		PF_RULES_WUNLOCK();
 	}
 
 	/*
 	 * Remove those entries, that don't need flushing.
 	 */
 	SLIST_FOREACH_SAFE(pfoe, &queue, next, pfoe1)
 		if (pfoe->rule->flush == 0) {
 			SLIST_REMOVE(&queue, pfoe, pf_overload_entry, next);
 			free(pfoe, M_PFTEMP);
 		} else
 			counter_u64_add(
 			    V_pf_status.lcounters[LCNT_OVERLOAD_FLUSH], 1);
 
 	/* If nothing to flush, return. */
 	if (SLIST_EMPTY(&queue)) {
 		CURVNET_RESTORE();
 		return;
 	}
 
 	for (int i = 0; i <= pf_hashmask; i++) {
 		struct pf_idhash *ih = &V_pf_idhash[i];
 		struct pf_state_key *sk;
 		struct pf_state *s;
 
 		PF_HASHROW_LOCK(ih);
 		LIST_FOREACH(s, &ih->states, entry) {
 		    sk = s->key[PF_SK_WIRE];
 		    SLIST_FOREACH(pfoe, &queue, next)
 			if (sk->af == pfoe->af &&
 			    ((pfoe->rule->flush & PF_FLUSH_GLOBAL) ||
 			    pfoe->rule == s->rule.ptr) &&
 			    ((pfoe->dir == PF_OUT &&
 			    PF_AEQ(&pfoe->addr, &sk->addr[1], sk->af)) ||
 			    (pfoe->dir == PF_IN &&
 			    PF_AEQ(&pfoe->addr, &sk->addr[0], sk->af)))) {
 				s->timeout = PFTM_PURGE;
 				s->src.state = s->dst.state = TCPS_CLOSED;
 				killed++;
 			}
 		}
 		PF_HASHROW_UNLOCK(ih);
 	}
 	SLIST_FOREACH_SAFE(pfoe, &queue, next, pfoe1)
 		free(pfoe, M_PFTEMP);
 	if (V_pf_status.debug >= PF_DEBUG_MISC)
 		printf("%s: %u states killed", __func__, killed);
 
 	CURVNET_RESTORE();
 }
 
 /*
  * Can return locked on failure, so that we can consistently
  * allocate and insert a new one.
  */
 struct pf_src_node *
 pf_find_src_node(struct pf_addr *src, struct pf_rule *rule, sa_family_t af,
 	int returnlocked)
 {
 	struct pf_srchash *sh;
 	struct pf_src_node *n;
 
 	counter_u64_add(V_pf_status.scounters[SCNT_SRC_NODE_SEARCH], 1);
 
 	sh = &V_pf_srchash[pf_hashsrc(src, af)];
 	PF_HASHROW_LOCK(sh);
 	LIST_FOREACH(n, &sh->nodes, entry)
 		if (n->rule.ptr == rule && n->af == af &&
 		    ((af == AF_INET && n->addr.v4.s_addr == src->v4.s_addr) ||
 		    (af == AF_INET6 && bcmp(&n->addr, src, sizeof(*src)) == 0)))
 			break;
 	if (n != NULL) {
 		n->states++;
 		PF_HASHROW_UNLOCK(sh);
 	} else if (returnlocked == 0)
 		PF_HASHROW_UNLOCK(sh);
 
 	return (n);
 }
 
 static int
 pf_insert_src_node(struct pf_src_node **sn, struct pf_rule *rule,
     struct pf_addr *src, sa_family_t af)
 {
 
 	KASSERT((rule->rule_flag & PFRULE_RULESRCTRACK ||
 	    rule->rpool.opts & PF_POOL_STICKYADDR),
 	    ("%s for non-tracking rule %p", __func__, rule));
 
 	if (*sn == NULL)
 		*sn = pf_find_src_node(src, rule, af, 1);
 
 	if (*sn == NULL) {
 		struct pf_srchash *sh = &V_pf_srchash[pf_hashsrc(src, af)];
 
 		PF_HASHROW_ASSERT(sh);
 
 		if (!rule->max_src_nodes ||
 		    counter_u64_fetch(rule->src_nodes) < rule->max_src_nodes)
 			(*sn) = uma_zalloc(V_pf_sources_z, M_NOWAIT | M_ZERO);
 		else
 			counter_u64_add(V_pf_status.lcounters[LCNT_SRCNODES],
 			    1);
 		if ((*sn) == NULL) {
 			PF_HASHROW_UNLOCK(sh);
 			return (-1);
 		}
 
 		pf_init_threshold(&(*sn)->conn_rate,
 		    rule->max_src_conn_rate.limit,
 		    rule->max_src_conn_rate.seconds);
 
 		(*sn)->af = af;
 		(*sn)->rule.ptr = rule;
 		PF_ACPY(&(*sn)->addr, src, af);
 		LIST_INSERT_HEAD(&sh->nodes, *sn, entry);
 		(*sn)->creation = time_uptime;
 		(*sn)->ruletype = rule->action;
 		(*sn)->states = 1;
 		if ((*sn)->rule.ptr != NULL)
 			counter_u64_add((*sn)->rule.ptr->src_nodes, 1);
 		PF_HASHROW_UNLOCK(sh);
 		counter_u64_add(V_pf_status.scounters[SCNT_SRC_NODE_INSERT], 1);
 	} else {
 		if (rule->max_src_states &&
 		    (*sn)->states >= rule->max_src_states) {
 			counter_u64_add(V_pf_status.lcounters[LCNT_SRCSTATES],
 			    1);
 			return (-1);
 		}
 	}
 	return (0);
 }
 
 void
 pf_unlink_src_node(struct pf_src_node *src)
 {
 
 	PF_HASHROW_ASSERT(&V_pf_srchash[pf_hashsrc(&src->addr, src->af)]);
 	LIST_REMOVE(src, entry);
 	if (src->rule.ptr)
 		counter_u64_add(src->rule.ptr->src_nodes, -1);
 }
 
 u_int
 pf_free_src_nodes(struct pf_src_node_list *head)
 {
 	struct pf_src_node *sn, *tmp;
 	u_int count = 0;
 
 	LIST_FOREACH_SAFE(sn, head, entry, tmp) {
 		uma_zfree(V_pf_sources_z, sn);
 		count++;
 	}
 
 	counter_u64_add(V_pf_status.scounters[SCNT_SRC_NODE_REMOVALS], count);
 
 	return (count);
 }
 
 void
 pf_mtag_initialize()
 {
 
 	pf_mtag_z = uma_zcreate("pf mtags", sizeof(struct m_tag) +
 	    sizeof(struct pf_mtag), NULL, NULL, pf_mtag_uminit, NULL,
 	    UMA_ALIGN_PTR, 0);
 }
 
 /* Per-vnet data storage structures initialization. */
 void
 pf_initialize()
 {
 	struct pf_keyhash	*kh;
 	struct pf_idhash	*ih;
 	struct pf_srchash	*sh;
 	u_int i;
 
 	if (pf_hashsize == 0 || !powerof2(pf_hashsize))
 		pf_hashsize = PF_HASHSIZ;
 	if (pf_srchashsize == 0 || !powerof2(pf_srchashsize))
 		pf_srchashsize = PF_SRCHASHSIZ;
 
 	V_pf_hashseed = arc4random();
 
 	/* States and state keys storage. */
 	V_pf_state_z = uma_zcreate("pf states", sizeof(struct pf_state),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
 	V_pf_limits[PF_LIMIT_STATES].zone = V_pf_state_z;
 	uma_zone_set_max(V_pf_state_z, PFSTATE_HIWAT);
 	uma_zone_set_warning(V_pf_state_z, "PF states limit reached");
 
 	V_pf_state_key_z = uma_zcreate("pf state keys",
 	    sizeof(struct pf_state_key), pf_state_key_ctor, NULL, NULL, NULL,
 	    UMA_ALIGN_PTR, 0);
 
 	V_pf_keyhash = mallocarray(pf_hashsize, sizeof(struct pf_keyhash),
 	    M_PFHASH, M_NOWAIT | M_ZERO);
 	V_pf_idhash = mallocarray(pf_hashsize, sizeof(struct pf_idhash),
 	    M_PFHASH, M_NOWAIT | M_ZERO);
 	if (V_pf_keyhash == NULL || V_pf_idhash == NULL) {
 		printf("pf: Unable to allocate memory for "
 		    "state_hashsize %lu.\n", pf_hashsize);
 
 		free(V_pf_keyhash, M_PFHASH);
 		free(V_pf_idhash, M_PFHASH);
 
 		pf_hashsize = PF_HASHSIZ;
 		V_pf_keyhash = mallocarray(pf_hashsize,
 		    sizeof(struct pf_keyhash), M_PFHASH, M_WAITOK | M_ZERO);
 		V_pf_idhash = mallocarray(pf_hashsize,
 		    sizeof(struct pf_idhash), M_PFHASH, M_WAITOK | M_ZERO);
 	}
 
 	pf_hashmask = pf_hashsize - 1;
 	for (i = 0, kh = V_pf_keyhash, ih = V_pf_idhash; i <= pf_hashmask;
 	    i++, kh++, ih++) {
 		mtx_init(&kh->lock, "pf_keyhash", NULL, MTX_DEF | MTX_DUPOK);
 		mtx_init(&ih->lock, "pf_idhash", NULL, MTX_DEF);
 	}
 
 	/* Source nodes. */
 	V_pf_sources_z = uma_zcreate("pf source nodes",
 	    sizeof(struct pf_src_node), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 	    0);
 	V_pf_limits[PF_LIMIT_SRC_NODES].zone = V_pf_sources_z;
 	uma_zone_set_max(V_pf_sources_z, PFSNODE_HIWAT);
 	uma_zone_set_warning(V_pf_sources_z, "PF source nodes limit reached");
 
 	V_pf_srchash = mallocarray(pf_srchashsize,
 	    sizeof(struct pf_srchash), M_PFHASH, M_NOWAIT | M_ZERO);
 	if (V_pf_srchash == NULL) {
 		printf("pf: Unable to allocate memory for "
 		    "source_hashsize %lu.\n", pf_srchashsize);
 
 		pf_srchashsize = PF_SRCHASHSIZ;
 		V_pf_srchash = mallocarray(pf_srchashsize,
 		    sizeof(struct pf_srchash), M_PFHASH, M_WAITOK | M_ZERO);
 	}
 
 	pf_srchashmask = pf_srchashsize - 1;
 	for (i = 0, sh = V_pf_srchash; i <= pf_srchashmask; i++, sh++)
 		mtx_init(&sh->lock, "pf_srchash", NULL, MTX_DEF);
 
 	/* ALTQ */
 	TAILQ_INIT(&V_pf_altqs[0]);
 	TAILQ_INIT(&V_pf_altqs[1]);
 	TAILQ_INIT(&V_pf_pabuf);
 	V_pf_altqs_active = &V_pf_altqs[0];
 	V_pf_altqs_inactive = &V_pf_altqs[1];
 
 	/* Send & overload+flush queues. */
 	STAILQ_INIT(&V_pf_sendqueue);
 	SLIST_INIT(&V_pf_overloadqueue);
 	TASK_INIT(&V_pf_overloadtask, 0, pf_overload_task, curvnet);
 
 	/* Unlinked, but may be referenced rules. */
 	TAILQ_INIT(&V_pf_unlinked_rules);
 }
 
 void
 pf_mtag_cleanup()
 {
 
 	uma_zdestroy(pf_mtag_z);
 }
 
 void
 pf_cleanup()
 {
 	struct pf_keyhash	*kh;
 	struct pf_idhash	*ih;
 	struct pf_srchash	*sh;
 	struct pf_send_entry	*pfse, *next;
 	u_int i;
 
 	for (i = 0, kh = V_pf_keyhash, ih = V_pf_idhash; i <= pf_hashmask;
 	    i++, kh++, ih++) {
 		KASSERT(LIST_EMPTY(&kh->keys), ("%s: key hash not empty",
 		    __func__));
 		KASSERT(LIST_EMPTY(&ih->states), ("%s: id hash not empty",
 		    __func__));
 		mtx_destroy(&kh->lock);
 		mtx_destroy(&ih->lock);
 	}
 	free(V_pf_keyhash, M_PFHASH);
 	free(V_pf_idhash, M_PFHASH);
 
 	for (i = 0, sh = V_pf_srchash; i <= pf_srchashmask; i++, sh++) {
 		KASSERT(LIST_EMPTY(&sh->nodes),
 		    ("%s: source node hash not empty", __func__));
 		mtx_destroy(&sh->lock);
 	}
 	free(V_pf_srchash, M_PFHASH);
 
 	STAILQ_FOREACH_SAFE(pfse, &V_pf_sendqueue, pfse_next, next) {
 		m_freem(pfse->pfse_m);
 		free(pfse, M_PFTEMP);
 	}
 
 	uma_zdestroy(V_pf_sources_z);
 	uma_zdestroy(V_pf_state_z);
 	uma_zdestroy(V_pf_state_key_z);
 }
 
 static int
 pf_mtag_uminit(void *mem, int size, int how)
 {
 	struct m_tag *t;
 
 	t = (struct m_tag *)mem;
 	t->m_tag_cookie = MTAG_ABI_COMPAT;
 	t->m_tag_id = PACKET_TAG_PF;
 	t->m_tag_len = sizeof(struct pf_mtag);
 	t->m_tag_free = pf_mtag_free;
 
 	return (0);
 }
 
 static void
 pf_mtag_free(struct m_tag *t)
 {
 
 	uma_zfree(pf_mtag_z, t);
 }
 
 struct pf_mtag *
 pf_get_mtag(struct mbuf *m)
 {
 	struct m_tag *mtag;
 
 	if ((mtag = m_tag_find(m, PACKET_TAG_PF, NULL)) != NULL)
 		return ((struct pf_mtag *)(mtag + 1));
 
 	mtag = uma_zalloc(pf_mtag_z, M_NOWAIT);
 	if (mtag == NULL)
 		return (NULL);
 	bzero(mtag + 1, sizeof(struct pf_mtag));
 	m_tag_prepend(m, mtag);
 
 	return ((struct pf_mtag *)(mtag + 1));
 }
 
 static int
 pf_state_key_attach(struct pf_state_key *skw, struct pf_state_key *sks,
     struct pf_state *s)
 {
 	struct pf_keyhash	*khs, *khw, *kh;
 	struct pf_state_key	*sk, *cur;
 	struct pf_state		*si, *olds = NULL;
 	int idx;
 
 	KASSERT(s->refs == 0, ("%s: state not pristine", __func__));
 	KASSERT(s->key[PF_SK_WIRE] == NULL, ("%s: state has key", __func__));
 	KASSERT(s->key[PF_SK_STACK] == NULL, ("%s: state has key", __func__));
 
 	/*
 	 * We need to lock hash slots of both keys. To avoid deadlock
 	 * we always lock the slot with lower address first. Unlock order
 	 * isn't important.
 	 *
 	 * We also need to lock ID hash slot before dropping key
 	 * locks. On success we return with ID hash slot locked.
 	 */
 
 	if (skw == sks) {
 		khs = khw = &V_pf_keyhash[pf_hashkey(skw)];
 		PF_HASHROW_LOCK(khs);
 	} else {
 		khs = &V_pf_keyhash[pf_hashkey(sks)];
 		khw = &V_pf_keyhash[pf_hashkey(skw)];
 		if (khs == khw) {
 			PF_HASHROW_LOCK(khs);
 		} else if (khs < khw) {
 			PF_HASHROW_LOCK(khs);
 			PF_HASHROW_LOCK(khw);
 		} else {
 			PF_HASHROW_LOCK(khw);
 			PF_HASHROW_LOCK(khs);
 		}
 	}
 
 #define	KEYS_UNLOCK()	do {			\
 	if (khs != khw) {			\
 		PF_HASHROW_UNLOCK(khs);		\
 		PF_HASHROW_UNLOCK(khw);		\
 	} else					\
 		PF_HASHROW_UNLOCK(khs);		\
 } while (0)
 
 	/*
 	 * First run: start with wire key.
 	 */
 	sk = skw;
 	kh = khw;
 	idx = PF_SK_WIRE;
 
 keyattach:
 	LIST_FOREACH(cur, &kh->keys, entry)
 		if (bcmp(cur, sk, sizeof(struct pf_state_key_cmp)) == 0)
 			break;
 
 	if (cur != NULL) {
 		/* Key exists. Check for same kif, if none, add to key. */
 		TAILQ_FOREACH(si, &cur->states[idx], key_list[idx]) {
 			struct pf_idhash *ih = &V_pf_idhash[PF_IDHASH(si)];
 
 			PF_HASHROW_LOCK(ih);
 			if (si->kif == s->kif &&
 			    si->direction == s->direction) {
 				if (sk->proto == IPPROTO_TCP &&
 				    si->src.state >= TCPS_FIN_WAIT_2 &&
 				    si->dst.state >= TCPS_FIN_WAIT_2) {
 					/*
 					 * New state matches an old >FIN_WAIT_2
 					 * state. We can't drop key hash locks,
 					 * thus we can't unlink it properly.
 					 *
 					 * As a workaround we drop it into
 					 * TCPS_CLOSED state, schedule purge
 					 * ASAP and push it into the very end
 					 * of the slot TAILQ, so that it won't
 					 * conflict with our new state.
 					 */
 					si->src.state = si->dst.state =
 					    TCPS_CLOSED;
 					si->timeout = PFTM_PURGE;
 					olds = si;
 				} else {
 					if (V_pf_status.debug >= PF_DEBUG_MISC) {
 						printf("pf: %s key attach "
 						    "failed on %s: ",
 						    (idx == PF_SK_WIRE) ?
 						    "wire" : "stack",
 						    s->kif->pfik_name);
 						pf_print_state_parts(s,
 						    (idx == PF_SK_WIRE) ?
 						    sk : NULL,
 						    (idx == PF_SK_STACK) ?
 						    sk : NULL);
 						printf(", existing: ");
 						pf_print_state_parts(si,
 						    (idx == PF_SK_WIRE) ?
 						    sk : NULL,
 						    (idx == PF_SK_STACK) ?
 						    sk : NULL);
 						printf("\n");
 					}
 					PF_HASHROW_UNLOCK(ih);
 					KEYS_UNLOCK();
 					uma_zfree(V_pf_state_key_z, sk);
 					if (idx == PF_SK_STACK)
 						pf_detach_state(s);
 					return (EEXIST); /* collision! */
 				}
 			}
 			PF_HASHROW_UNLOCK(ih);
 		}
 		uma_zfree(V_pf_state_key_z, sk);
 		s->key[idx] = cur;
 	} else {
 		LIST_INSERT_HEAD(&kh->keys, sk, entry);
 		s->key[idx] = sk;
 	}
 
 stateattach:
 	/* List is sorted, if-bound states before floating. */
 	if (s->kif == V_pfi_all)
 		TAILQ_INSERT_TAIL(&s->key[idx]->states[idx], s, key_list[idx]);
 	else
 		TAILQ_INSERT_HEAD(&s->key[idx]->states[idx], s, key_list[idx]);
 
 	if (olds) {
 		TAILQ_REMOVE(&s->key[idx]->states[idx], olds, key_list[idx]);
 		TAILQ_INSERT_TAIL(&s->key[idx]->states[idx], olds,
 		    key_list[idx]);
 		olds = NULL;
 	}
 
 	/*
 	 * Attach done. See how should we (or should not?)
 	 * attach a second key.
 	 */
 	if (sks == skw) {
 		s->key[PF_SK_STACK] = s->key[PF_SK_WIRE];
 		idx = PF_SK_STACK;
 		sks = NULL;
 		goto stateattach;
 	} else if (sks != NULL) {
 		/*
 		 * Continue attaching with stack key.
 		 */
 		sk = sks;
 		kh = khs;
 		idx = PF_SK_STACK;
 		sks = NULL;
 		goto keyattach;
 	}
 
 	PF_STATE_LOCK(s);
 	KEYS_UNLOCK();
 
 	KASSERT(s->key[PF_SK_WIRE] != NULL && s->key[PF_SK_STACK] != NULL,
 	    ("%s failure", __func__));
 
 	return (0);
 #undef	KEYS_UNLOCK
 }
 
 static void
 pf_detach_state(struct pf_state *s)
 {
 	struct pf_state_key *sks = s->key[PF_SK_STACK];
 	struct pf_keyhash *kh;
 
 	if (sks != NULL) {
 		kh = &V_pf_keyhash[pf_hashkey(sks)];
 		PF_HASHROW_LOCK(kh);
 		if (s->key[PF_SK_STACK] != NULL)
 			pf_state_key_detach(s, PF_SK_STACK);
 		/*
 		 * If both point to same key, then we are done.
 		 */
 		if (sks == s->key[PF_SK_WIRE]) {
 			pf_state_key_detach(s, PF_SK_WIRE);
 			PF_HASHROW_UNLOCK(kh);
 			return;
 		}
 		PF_HASHROW_UNLOCK(kh);
 	}
 
 	if (s->key[PF_SK_WIRE] != NULL) {
 		kh = &V_pf_keyhash[pf_hashkey(s->key[PF_SK_WIRE])];
 		PF_HASHROW_LOCK(kh);
 		if (s->key[PF_SK_WIRE] != NULL)
 			pf_state_key_detach(s, PF_SK_WIRE);
 		PF_HASHROW_UNLOCK(kh);
 	}
 }
 
 static void
 pf_state_key_detach(struct pf_state *s, int idx)
 {
 	struct pf_state_key *sk = s->key[idx];
 #ifdef INVARIANTS
 	struct pf_keyhash *kh = &V_pf_keyhash[pf_hashkey(sk)];
 
 	PF_HASHROW_ASSERT(kh);
 #endif
 	TAILQ_REMOVE(&sk->states[idx], s, key_list[idx]);
 	s->key[idx] = NULL;
 
 	if (TAILQ_EMPTY(&sk->states[0]) && TAILQ_EMPTY(&sk->states[1])) {
 		LIST_REMOVE(sk, entry);
 		uma_zfree(V_pf_state_key_z, sk);
 	}
 }
 
 static int
 pf_state_key_ctor(void *mem, int size, void *arg, int flags)
 {
 	struct pf_state_key *sk = mem;
 
 	bzero(sk, sizeof(struct pf_state_key_cmp));
 	TAILQ_INIT(&sk->states[PF_SK_WIRE]);
 	TAILQ_INIT(&sk->states[PF_SK_STACK]);
 
 	return (0);
 }
 
 struct pf_state_key *
 pf_state_key_setup(struct pf_pdesc *pd, struct pf_addr *saddr,
 	struct pf_addr *daddr, u_int16_t sport, u_int16_t dport)
 {
 	struct pf_state_key *sk;
 
 	sk = uma_zalloc(V_pf_state_key_z, M_NOWAIT);
 	if (sk == NULL)
 		return (NULL);
 
 	PF_ACPY(&sk->addr[pd->sidx], saddr, pd->af);
 	PF_ACPY(&sk->addr[pd->didx], daddr, pd->af);
 	sk->port[pd->sidx] = sport;
 	sk->port[pd->didx] = dport;
 	sk->proto = pd->proto;
 	sk->af = pd->af;
 
 	return (sk);
 }
 
 struct pf_state_key *
 pf_state_key_clone(struct pf_state_key *orig)
 {
 	struct pf_state_key *sk;
 
 	sk = uma_zalloc(V_pf_state_key_z, M_NOWAIT);
 	if (sk == NULL)
 		return (NULL);
 
 	bcopy(orig, sk, sizeof(struct pf_state_key_cmp));
 
 	return (sk);
 }
 
 int
 pf_state_insert(struct pfi_kif *kif, struct pf_state_key *skw,
     struct pf_state_key *sks, struct pf_state *s)
 {
 	struct pf_idhash *ih;
 	struct pf_state *cur;
 	int error;
 
 	KASSERT(TAILQ_EMPTY(&sks->states[0]) && TAILQ_EMPTY(&sks->states[1]),
 	    ("%s: sks not pristine", __func__));
 	KASSERT(TAILQ_EMPTY(&skw->states[0]) && TAILQ_EMPTY(&skw->states[1]),
 	    ("%s: skw not pristine", __func__));
 	KASSERT(s->refs == 0, ("%s: state not pristine", __func__));
 
 	s->kif = kif;
 
 	if (s->id == 0 && s->creatorid == 0) {
 		/* XXX: should be atomic, but probability of collision low */
 		if ((s->id = V_pf_stateid[curcpu]++) == PFID_MAXID)
 			V_pf_stateid[curcpu] = 1;
 		s->id |= (uint64_t )curcpu << PFID_CPUSHIFT;
 		s->id = htobe64(s->id);
 		s->creatorid = V_pf_status.hostid;
 	}
 
 	/* Returns with ID locked on success. */
 	if ((error = pf_state_key_attach(skw, sks, s)) != 0)
 		return (error);
 
 	ih = &V_pf_idhash[PF_IDHASH(s)];
 	PF_HASHROW_ASSERT(ih);
 	LIST_FOREACH(cur, &ih->states, entry)
 		if (cur->id == s->id && cur->creatorid == s->creatorid)
 			break;
 
 	if (cur != NULL) {
 		PF_HASHROW_UNLOCK(ih);
 		if (V_pf_status.debug >= PF_DEBUG_MISC) {
 			printf("pf: state ID collision: "
 			    "id: %016llx creatorid: %08x\n",
 			    (unsigned long long)be64toh(s->id),
 			    ntohl(s->creatorid));
 		}
 		pf_detach_state(s);
 		return (EEXIST);
 	}
 	LIST_INSERT_HEAD(&ih->states, s, entry);
 	/* One for keys, one for ID hash. */
 	refcount_init(&s->refs, 2);
 
 	counter_u64_add(V_pf_status.fcounters[FCNT_STATE_INSERT], 1);
 	if (pfsync_insert_state_ptr != NULL)
 		pfsync_insert_state_ptr(s);
 
 	/* Returns locked. */
 	return (0);
 }
 
 /*
  * Find state by ID: returns with locked row on success.
  */
 struct pf_state *
 pf_find_state_byid(uint64_t id, uint32_t creatorid)
 {
 	struct pf_idhash *ih;
 	struct pf_state *s;
 
 	counter_u64_add(V_pf_status.fcounters[FCNT_STATE_SEARCH], 1);
 
 	ih = &V_pf_idhash[(be64toh(id) % (pf_hashmask + 1))];
 
 	PF_HASHROW_LOCK(ih);
 	LIST_FOREACH(s, &ih->states, entry)
 		if (s->id == id && s->creatorid == creatorid)
 			break;
 
 	if (s == NULL)
 		PF_HASHROW_UNLOCK(ih);
 
 	return (s);
 }
 
 /*
  * Find state by key.
  * Returns with ID hash slot locked on success.
  */
 static struct pf_state *
 pf_find_state(struct pfi_kif *kif, struct pf_state_key_cmp *key, u_int dir)
 {
 	struct pf_keyhash	*kh;
 	struct pf_state_key	*sk;
 	struct pf_state		*s;
 	int idx;
 
 	counter_u64_add(V_pf_status.fcounters[FCNT_STATE_SEARCH], 1);
 
 	kh = &V_pf_keyhash[pf_hashkey((struct pf_state_key *)key)];
 
 	PF_HASHROW_LOCK(kh);
 	LIST_FOREACH(sk, &kh->keys, entry)
 		if (bcmp(sk, key, sizeof(struct pf_state_key_cmp)) == 0)
 			break;
 	if (sk == NULL) {
 		PF_HASHROW_UNLOCK(kh);
 		return (NULL);
 	}
 
 	idx = (dir == PF_IN ? PF_SK_WIRE : PF_SK_STACK);
 
 	/* List is sorted, if-bound states before floating ones. */
 	TAILQ_FOREACH(s, &sk->states[idx], key_list[idx])
 		if (s->kif == V_pfi_all || s->kif == kif) {
 			PF_STATE_LOCK(s);
 			PF_HASHROW_UNLOCK(kh);
 			if (s->timeout >= PFTM_MAX) {
 				/*
 				 * State is either being processed by
 				 * pf_unlink_state() in an other thread, or
 				 * is scheduled for immediate expiry.
 				 */
 				PF_STATE_UNLOCK(s);
 				return (NULL);
 			}
 			return (s);
 		}
 	PF_HASHROW_UNLOCK(kh);
 
 	return (NULL);
 }
 
 struct pf_state *
 pf_find_state_all(struct pf_state_key_cmp *key, u_int dir, int *more)
 {
 	struct pf_keyhash	*kh;
 	struct pf_state_key	*sk;
 	struct pf_state		*s, *ret = NULL;
 	int			 idx, inout = 0;
 
 	counter_u64_add(V_pf_status.fcounters[FCNT_STATE_SEARCH], 1);
 
 	kh = &V_pf_keyhash[pf_hashkey((struct pf_state_key *)key)];
 
 	PF_HASHROW_LOCK(kh);
 	LIST_FOREACH(sk, &kh->keys, entry)
 		if (bcmp(sk, key, sizeof(struct pf_state_key_cmp)) == 0)
 			break;
 	if (sk == NULL) {
 		PF_HASHROW_UNLOCK(kh);
 		return (NULL);
 	}
 	switch (dir) {
 	case PF_IN:
 		idx = PF_SK_WIRE;
 		break;
 	case PF_OUT:
 		idx = PF_SK_STACK;
 		break;
 	case PF_INOUT:
 		idx = PF_SK_WIRE;
 		inout = 1;
 		break;
 	default:
 		panic("%s: dir %u", __func__, dir);
 	}
 second_run:
 	TAILQ_FOREACH(s, &sk->states[idx], key_list[idx]) {
 		if (more == NULL) {
 			PF_HASHROW_UNLOCK(kh);
 			return (s);
 		}
 
 		if (ret)
 			(*more)++;
 		else
 			ret = s;
 	}
 	if (inout == 1) {
 		inout = 0;
 		idx = PF_SK_STACK;
 		goto second_run;
 	}
 	PF_HASHROW_UNLOCK(kh);
 
 	return (ret);
 }
 
 /* END state table stuff */
 
 static void
 pf_send(struct pf_send_entry *pfse)
 {
 
 	PF_SENDQ_LOCK();
 	STAILQ_INSERT_TAIL(&V_pf_sendqueue, pfse, pfse_next);
 	PF_SENDQ_UNLOCK();
 	swi_sched(V_pf_swi_cookie, 0);
 }
 
 void
 pf_intr(void *v)
 {
 	struct pf_send_head queue;
 	struct pf_send_entry *pfse, *next;
 
 	CURVNET_SET((struct vnet *)v);
 
 	PF_SENDQ_LOCK();
 	queue = V_pf_sendqueue;
 	STAILQ_INIT(&V_pf_sendqueue);
 	PF_SENDQ_UNLOCK();
 
 	STAILQ_FOREACH_SAFE(pfse, &queue, pfse_next, next) {
 		switch (pfse->pfse_type) {
 #ifdef INET
 		case PFSE_IP:
 			ip_output(pfse->pfse_m, NULL, NULL, 0, NULL, NULL);
 			break;
 		case PFSE_ICMP:
 			icmp_error(pfse->pfse_m, pfse->icmpopts.type,
 			    pfse->icmpopts.code, 0, pfse->icmpopts.mtu);
 			break;
 #endif /* INET */
 #ifdef INET6
 		case PFSE_IP6:
 			ip6_output(pfse->pfse_m, NULL, NULL, 0, NULL, NULL,
 			    NULL);
 			break;
 		case PFSE_ICMP6:
 			icmp6_error(pfse->pfse_m, pfse->icmpopts.type,
 			    pfse->icmpopts.code, pfse->icmpopts.mtu);
 			break;
 #endif /* INET6 */
 		default:
 			panic("%s: unknown type", __func__);
 		}
 		free(pfse, M_PFTEMP);
 	}
 	CURVNET_RESTORE();
 }
 
 void
 pf_purge_thread(void *unused __unused)
 {
 	VNET_ITERATOR_DECL(vnet_iter);
 
 	sx_xlock(&pf_end_lock);
 	while (pf_end_threads == 0) {
 		sx_sleep(pf_purge_thread, &pf_end_lock, 0, "pftm", hz / 10);
 
 		VNET_LIST_RLOCK();
 		VNET_FOREACH(vnet_iter) {
 			CURVNET_SET(vnet_iter);
 
 
 			/* Wait until V_pf_default_rule is initialized. */
 			if (V_pf_vnet_active == 0) {
 				CURVNET_RESTORE();
 				continue;
 			}
 
 			/*
 			 *  Process 1/interval fraction of the state
 			 * table every run.
 			 */
 			V_pf_purge_idx =
 			    pf_purge_expired_states(V_pf_purge_idx, pf_hashmask /
 			    (V_pf_default_rule.timeout[PFTM_INTERVAL] * 10));
 
 			/*
 			 * Purge other expired types every
 			 * PFTM_INTERVAL seconds.
 			 */
 			if (V_pf_purge_idx == 0) {
 				/*
 				 * Order is important:
 				 * - states and src nodes reference rules
 				 * - states and rules reference kifs
 				 */
 				pf_purge_expired_fragments();
 				pf_purge_expired_src_nodes();
 				pf_purge_unlinked_rules();
 				pfi_kif_purge();
 			}
 			CURVNET_RESTORE();
 		}
 		VNET_LIST_RUNLOCK();
 	}
 
 	pf_end_threads++;
 	sx_xunlock(&pf_end_lock);
 	kproc_exit(0);
 }
 
 void
 pf_unload_vnet_purge(void)
 {
 
 	/*
 	 * To cleanse up all kifs and rules we need
 	 * two runs: first one clears reference flags,
 	 * then pf_purge_expired_states() doesn't
 	 * raise them, and then second run frees.
 	 */
 	pf_purge_unlinked_rules();
 	pfi_kif_purge();
 
 	/*
 	 * Now purge everything.
 	 */
 	pf_purge_expired_states(0, pf_hashmask);
 	pf_purge_fragments(UINT_MAX);
 	pf_purge_expired_src_nodes();
 
 	/*
 	 * Now all kifs & rules should be unreferenced,
 	 * thus should be successfully freed.
 	 */
 	pf_purge_unlinked_rules();
 	pfi_kif_purge();
 }
 
 
 u_int32_t
 pf_state_expires(const struct pf_state *state)
 {
 	u_int32_t	timeout;
 	u_int32_t	start;
 	u_int32_t	end;
 	u_int32_t	states;
 
 	/* handle all PFTM_* > PFTM_MAX here */
 	if (state->timeout == PFTM_PURGE)
 		return (time_uptime);
 	KASSERT(state->timeout != PFTM_UNLINKED,
 	    ("pf_state_expires: timeout == PFTM_UNLINKED"));
 	KASSERT((state->timeout < PFTM_MAX),
 	    ("pf_state_expires: timeout > PFTM_MAX"));
 	timeout = state->rule.ptr->timeout[state->timeout];
 	if (!timeout)
 		timeout = V_pf_default_rule.timeout[state->timeout];
 	start = state->rule.ptr->timeout[PFTM_ADAPTIVE_START];
 	if (start) {
 		end = state->rule.ptr->timeout[PFTM_ADAPTIVE_END];
 		states = counter_u64_fetch(state->rule.ptr->states_cur);
 	} else {
 		start = V_pf_default_rule.timeout[PFTM_ADAPTIVE_START];
 		end = V_pf_default_rule.timeout[PFTM_ADAPTIVE_END];
 		states = V_pf_status.states;
 	}
 	if (end && states > start && start < end) {
 		if (states < end)
 			return (state->expire + timeout * (end - states) /
 			    (end - start));
 		else
 			return (time_uptime);
 	}
 	return (state->expire + timeout);
 }
 
 void
 pf_purge_expired_src_nodes()
 {
 	struct pf_src_node_list	 freelist;
 	struct pf_srchash	*sh;
 	struct pf_src_node	*cur, *next;
 	int i;
 
 	LIST_INIT(&freelist);
 	for (i = 0, sh = V_pf_srchash; i <= pf_srchashmask; i++, sh++) {
 	    PF_HASHROW_LOCK(sh);
 	    LIST_FOREACH_SAFE(cur, &sh->nodes, entry, next)
 		if (cur->states == 0 && cur->expire <= time_uptime) {
 			pf_unlink_src_node(cur);
 			LIST_INSERT_HEAD(&freelist, cur, entry);
 		} else if (cur->rule.ptr != NULL)
 			cur->rule.ptr->rule_flag |= PFRULE_REFS;
 	    PF_HASHROW_UNLOCK(sh);
 	}
 
 	pf_free_src_nodes(&freelist);
 
 	V_pf_status.src_nodes = uma_zone_get_cur(V_pf_sources_z);
 }
 
 static void
 pf_src_tree_remove_state(struct pf_state *s)
 {
 	struct pf_src_node *sn;
 	struct pf_srchash *sh;
 	uint32_t timeout;
 
 	timeout = s->rule.ptr->timeout[PFTM_SRC_NODE] ?
 	    s->rule.ptr->timeout[PFTM_SRC_NODE] :
 	    V_pf_default_rule.timeout[PFTM_SRC_NODE];
 
 	if (s->src_node != NULL) {
 		sn = s->src_node;
 		sh = &V_pf_srchash[pf_hashsrc(&sn->addr, sn->af)];
 	    	PF_HASHROW_LOCK(sh);
 		if (s->src.tcp_est)
 			--sn->conn;
 		if (--sn->states == 0)
 			sn->expire = time_uptime + timeout;
 	    	PF_HASHROW_UNLOCK(sh);
 	}
 	if (s->nat_src_node != s->src_node && s->nat_src_node != NULL) {
 		sn = s->nat_src_node;
 		sh = &V_pf_srchash[pf_hashsrc(&sn->addr, sn->af)];
 	    	PF_HASHROW_LOCK(sh);
 		if (--sn->states == 0)
 			sn->expire = time_uptime + timeout;
 	    	PF_HASHROW_UNLOCK(sh);
 	}
 	s->src_node = s->nat_src_node = NULL;
 }
 
 /*
  * Unlink and potentilly free a state. Function may be
  * called with ID hash row locked, but always returns
  * unlocked, since it needs to go through key hash locking.
  */
 int
 pf_unlink_state(struct pf_state *s, u_int flags)
 {
 	struct pf_idhash *ih = &V_pf_idhash[PF_IDHASH(s)];
 
 	if ((flags & PF_ENTER_LOCKED) == 0)
 		PF_HASHROW_LOCK(ih);
 	else
 		PF_HASHROW_ASSERT(ih);
 
 	if (s->timeout == PFTM_UNLINKED) {
 		/*
 		 * State is being processed
 		 * by pf_unlink_state() in
 		 * an other thread.
 		 */
 		PF_HASHROW_UNLOCK(ih);
 		return (0);	/* XXXGL: undefined actually */
 	}
 
 	if (s->src.state == PF_TCPS_PROXY_DST) {
 		/* XXX wire key the right one? */
 		pf_send_tcp(NULL, s->rule.ptr, s->key[PF_SK_WIRE]->af,
 		    &s->key[PF_SK_WIRE]->addr[1],
 		    &s->key[PF_SK_WIRE]->addr[0],
 		    s->key[PF_SK_WIRE]->port[1],
 		    s->key[PF_SK_WIRE]->port[0],
 		    s->src.seqhi, s->src.seqlo + 1,
 		    TH_RST|TH_ACK, 0, 0, 0, 1, s->tag, NULL);
 	}
 
 	LIST_REMOVE(s, entry);
 	pf_src_tree_remove_state(s);
 
 	if (pfsync_delete_state_ptr != NULL)
 		pfsync_delete_state_ptr(s);
 
 	STATE_DEC_COUNTERS(s);
 
 	s->timeout = PFTM_UNLINKED;
 
 	PF_HASHROW_UNLOCK(ih);
 
 	pf_detach_state(s);
 	/* pf_state_insert() initialises refs to 2, so we can never release the
 	 * last reference here, only in pf_release_state(). */
 	(void)refcount_release(&s->refs);
 
 	return (pf_release_state(s));
 }
 
 void
 pf_free_state(struct pf_state *cur)
 {
 
 	KASSERT(cur->refs == 0, ("%s: %p has refs", __func__, cur));
 	KASSERT(cur->timeout == PFTM_UNLINKED, ("%s: timeout %u", __func__,
 	    cur->timeout));
 
 	pf_normalize_tcp_cleanup(cur);
 	uma_zfree(V_pf_state_z, cur);
 	counter_u64_add(V_pf_status.fcounters[FCNT_STATE_REMOVALS], 1);
 }
 
 /*
  * Called only from pf_purge_thread(), thus serialized.
  */
 static u_int
 pf_purge_expired_states(u_int i, int maxcheck)
 {
 	struct pf_idhash *ih;
 	struct pf_state *s;
 
 	V_pf_status.states = uma_zone_get_cur(V_pf_state_z);
 
 	/*
 	 * Go through hash and unlink states that expire now.
 	 */
 	while (maxcheck > 0) {
 
 		ih = &V_pf_idhash[i];
 relock:
 		PF_HASHROW_LOCK(ih);
 		LIST_FOREACH(s, &ih->states, entry) {
 			if (pf_state_expires(s) <= time_uptime) {
 				V_pf_status.states -=
 				    pf_unlink_state(s, PF_ENTER_LOCKED);
 				goto relock;
 			}
 			s->rule.ptr->rule_flag |= PFRULE_REFS;
 			if (s->nat_rule.ptr != NULL)
 				s->nat_rule.ptr->rule_flag |= PFRULE_REFS;
 			if (s->anchor.ptr != NULL)
 				s->anchor.ptr->rule_flag |= PFRULE_REFS;
 			s->kif->pfik_flags |= PFI_IFLAG_REFS;
 			if (s->rt_kif)
 				s->rt_kif->pfik_flags |= PFI_IFLAG_REFS;
 		}
 		PF_HASHROW_UNLOCK(ih);
 
 		/* Return when we hit end of hash. */
 		if (++i > pf_hashmask) {
 			V_pf_status.states = uma_zone_get_cur(V_pf_state_z);
 			return (0);
 		}
 
 		maxcheck--;
 	}
 
 	V_pf_status.states = uma_zone_get_cur(V_pf_state_z);
 
 	return (i);
 }
 
 static void
 pf_purge_unlinked_rules()
 {
 	struct pf_rulequeue tmpq;
 	struct pf_rule *r, *r1;
 
 	/*
 	 * If we have overloading task pending, then we'd
 	 * better skip purging this time. There is a tiny
 	 * probability that overloading task references
 	 * an already unlinked rule.
 	 */
 	PF_OVERLOADQ_LOCK();
 	if (!SLIST_EMPTY(&V_pf_overloadqueue)) {
 		PF_OVERLOADQ_UNLOCK();
 		return;
 	}
 	PF_OVERLOADQ_UNLOCK();
 
 	/*
 	 * Do naive mark-and-sweep garbage collecting of old rules.
 	 * Reference flag is raised by pf_purge_expired_states()
 	 * and pf_purge_expired_src_nodes().
 	 *
 	 * To avoid LOR between PF_UNLNKDRULES_LOCK/PF_RULES_WLOCK,
 	 * use a temporary queue.
 	 */
 	TAILQ_INIT(&tmpq);
 	PF_UNLNKDRULES_LOCK();
 	TAILQ_FOREACH_SAFE(r, &V_pf_unlinked_rules, entries, r1) {
 		if (!(r->rule_flag & PFRULE_REFS)) {
 			TAILQ_REMOVE(&V_pf_unlinked_rules, r, entries);
 			TAILQ_INSERT_TAIL(&tmpq, r, entries);
 		} else
 			r->rule_flag &= ~PFRULE_REFS;
 	}
 	PF_UNLNKDRULES_UNLOCK();
 
 	if (!TAILQ_EMPTY(&tmpq)) {
 		PF_RULES_WLOCK();
 		TAILQ_FOREACH_SAFE(r, &tmpq, entries, r1) {
 			TAILQ_REMOVE(&tmpq, r, entries);
 			pf_free_rule(r);
 		}
 		PF_RULES_WUNLOCK();
 	}
 }
 
 void
 pf_print_host(struct pf_addr *addr, u_int16_t p, sa_family_t af)
 {
 	switch (af) {
 #ifdef INET
 	case AF_INET: {
 		u_int32_t a = ntohl(addr->addr32[0]);
 		printf("%u.%u.%u.%u", (a>>24)&255, (a>>16)&255,
 		    (a>>8)&255, a&255);
 		if (p) {
 			p = ntohs(p);
 			printf(":%u", p);
 		}
 		break;
 	}
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6: {
 		u_int16_t b;
 		u_int8_t i, curstart, curend, maxstart, maxend;
 		curstart = curend = maxstart = maxend = 255;
 		for (i = 0; i < 8; i++) {
 			if (!addr->addr16[i]) {
 				if (curstart == 255)
 					curstart = i;
 				curend = i;
 			} else {
 				if ((curend - curstart) >
 				    (maxend - maxstart)) {
 					maxstart = curstart;
 					maxend = curend;
 				}
 				curstart = curend = 255;
 			}
 		}
 		if ((curend - curstart) >
 		    (maxend - maxstart)) {
 			maxstart = curstart;
 			maxend = curend;
 		}
 		for (i = 0; i < 8; i++) {
 			if (i >= maxstart && i <= maxend) {
 				if (i == 0)
 					printf(":");
 				if (i == maxend)
 					printf(":");
 			} else {
 				b = ntohs(addr->addr16[i]);
 				printf("%x", b);
 				if (i < 7)
 					printf(":");
 			}
 		}
 		if (p) {
 			p = ntohs(p);
 			printf("[%u]", p);
 		}
 		break;
 	}
 #endif /* INET6 */
 	}
 }
 
 void
 pf_print_state(struct pf_state *s)
 {
 	pf_print_state_parts(s, NULL, NULL);
 }
 
 static void
 pf_print_state_parts(struct pf_state *s,
     struct pf_state_key *skwp, struct pf_state_key *sksp)
 {
 	struct pf_state_key *skw, *sks;
 	u_int8_t proto, dir;
 
 	/* Do our best to fill these, but they're skipped if NULL */
 	skw = skwp ? skwp : (s ? s->key[PF_SK_WIRE] : NULL);
 	sks = sksp ? sksp : (s ? s->key[PF_SK_STACK] : NULL);
 	proto = skw ? skw->proto : (sks ? sks->proto : 0);
 	dir = s ? s->direction : 0;
 
 	switch (proto) {
 	case IPPROTO_IPV4:
 		printf("IPv4");
 		break;
 	case IPPROTO_IPV6:
 		printf("IPv6");
 		break;
 	case IPPROTO_TCP:
 		printf("TCP");
 		break;
 	case IPPROTO_UDP:
 		printf("UDP");
 		break;
 	case IPPROTO_ICMP:
 		printf("ICMP");
 		break;
 	case IPPROTO_ICMPV6:
 		printf("ICMPv6");
 		break;
 	default:
 		printf("%u", proto);
 		break;
 	}
 	switch (dir) {
 	case PF_IN:
 		printf(" in");
 		break;
 	case PF_OUT:
 		printf(" out");
 		break;
 	}
 	if (skw) {
 		printf(" wire: ");
 		pf_print_host(&skw->addr[0], skw->port[0], skw->af);
 		printf(" ");
 		pf_print_host(&skw->addr[1], skw->port[1], skw->af);
 	}
 	if (sks) {
 		printf(" stack: ");
 		if (sks != skw) {
 			pf_print_host(&sks->addr[0], sks->port[0], sks->af);
 			printf(" ");
 			pf_print_host(&sks->addr[1], sks->port[1], sks->af);
 		} else
 			printf("-");
 	}
 	if (s) {
 		if (proto == IPPROTO_TCP) {
 			printf(" [lo=%u high=%u win=%u modulator=%u",
 			    s->src.seqlo, s->src.seqhi,
 			    s->src.max_win, s->src.seqdiff);
 			if (s->src.wscale && s->dst.wscale)
 				printf(" wscale=%u",
 				    s->src.wscale & PF_WSCALE_MASK);
 			printf("]");
 			printf(" [lo=%u high=%u win=%u modulator=%u",
 			    s->dst.seqlo, s->dst.seqhi,
 			    s->dst.max_win, s->dst.seqdiff);
 			if (s->src.wscale && s->dst.wscale)
 				printf(" wscale=%u",
 				s->dst.wscale & PF_WSCALE_MASK);
 			printf("]");
 		}
 		printf(" %u:%u", s->src.state, s->dst.state);
 	}
 }
 
 void
 pf_print_flags(u_int8_t f)
 {
 	if (f)
 		printf(" ");
 	if (f & TH_FIN)
 		printf("F");
 	if (f & TH_SYN)
 		printf("S");
 	if (f & TH_RST)
 		printf("R");
 	if (f & TH_PUSH)
 		printf("P");
 	if (f & TH_ACK)
 		printf("A");
 	if (f & TH_URG)
 		printf("U");
 	if (f & TH_ECE)
 		printf("E");
 	if (f & TH_CWR)
 		printf("W");
 }
 
 #define	PF_SET_SKIP_STEPS(i)					\
 	do {							\
 		while (head[i] != cur) {			\
 			head[i]->skip[i].ptr = cur;		\
 			head[i] = TAILQ_NEXT(head[i], entries);	\
 		}						\
 	} while (0)
 
 void
 pf_calc_skip_steps(struct pf_rulequeue *rules)
 {
 	struct pf_rule *cur, *prev, *head[PF_SKIP_COUNT];
 	int i;
 
 	cur = TAILQ_FIRST(rules);
 	prev = cur;
 	for (i = 0; i < PF_SKIP_COUNT; ++i)
 		head[i] = cur;
 	while (cur != NULL) {
 
 		if (cur->kif != prev->kif || cur->ifnot != prev->ifnot)
 			PF_SET_SKIP_STEPS(PF_SKIP_IFP);
 		if (cur->direction != prev->direction)
 			PF_SET_SKIP_STEPS(PF_SKIP_DIR);
 		if (cur->af != prev->af)
 			PF_SET_SKIP_STEPS(PF_SKIP_AF);
 		if (cur->proto != prev->proto)
 			PF_SET_SKIP_STEPS(PF_SKIP_PROTO);
 		if (cur->src.neg != prev->src.neg ||
 		    pf_addr_wrap_neq(&cur->src.addr, &prev->src.addr))
 			PF_SET_SKIP_STEPS(PF_SKIP_SRC_ADDR);
 		if (cur->src.port[0] != prev->src.port[0] ||
 		    cur->src.port[1] != prev->src.port[1] ||
 		    cur->src.port_op != prev->src.port_op)
 			PF_SET_SKIP_STEPS(PF_SKIP_SRC_PORT);
 		if (cur->dst.neg != prev->dst.neg ||
 		    pf_addr_wrap_neq(&cur->dst.addr, &prev->dst.addr))
 			PF_SET_SKIP_STEPS(PF_SKIP_DST_ADDR);
 		if (cur->dst.port[0] != prev->dst.port[0] ||
 		    cur->dst.port[1] != prev->dst.port[1] ||
 		    cur->dst.port_op != prev->dst.port_op)
 			PF_SET_SKIP_STEPS(PF_SKIP_DST_PORT);
 
 		prev = cur;
 		cur = TAILQ_NEXT(cur, entries);
 	}
 	for (i = 0; i < PF_SKIP_COUNT; ++i)
 		PF_SET_SKIP_STEPS(i);
 }
 
 static int
 pf_addr_wrap_neq(struct pf_addr_wrap *aw1, struct pf_addr_wrap *aw2)
 {
 	if (aw1->type != aw2->type)
 		return (1);
 	switch (aw1->type) {
 	case PF_ADDR_ADDRMASK:
 	case PF_ADDR_RANGE:
 		if (PF_ANEQ(&aw1->v.a.addr, &aw2->v.a.addr, AF_INET6))
 			return (1);
 		if (PF_ANEQ(&aw1->v.a.mask, &aw2->v.a.mask, AF_INET6))
 			return (1);
 		return (0);
 	case PF_ADDR_DYNIFTL:
 		return (aw1->p.dyn->pfid_kt != aw2->p.dyn->pfid_kt);
 	case PF_ADDR_NOROUTE:
 	case PF_ADDR_URPFFAILED:
 		return (0);
 	case PF_ADDR_TABLE:
 		return (aw1->p.tbl != aw2->p.tbl);
 	default:
 		printf("invalid address type: %d\n", aw1->type);
 		return (1);
 	}
 }
 
 /**
  * Checksum updates are a little complicated because the checksum in the TCP/UDP
  * header isn't always a full checksum. In some cases (i.e. output) it's a
  * pseudo-header checksum, which is a partial checksum over src/dst IP
  * addresses, protocol number and length.
  *
  * That means we have the following cases:
  *  * Input or forwarding: we don't have TSO, the checksum fields are full
  *  	checksums, we need to update the checksum whenever we change anything.
  *  * Output (i.e. the checksum is a pseudo-header checksum):
  *  	x The field being updated is src/dst address or affects the length of
  *  	the packet. We need to update the pseudo-header checksum (note that this
  *  	checksum is not ones' complement).
  *  	x Some other field is being modified (e.g. src/dst port numbers): We
  *  	don't have to update anything.
  **/
 u_int16_t
 pf_cksum_fixup(u_int16_t cksum, u_int16_t old, u_int16_t new, u_int8_t udp)
 {
 	u_int32_t	l;
 
 	if (udp && !cksum)
 		return (0x0000);
 	l = cksum + old - new;
 	l = (l >> 16) + (l & 65535);
 	l = l & 65535;
 	if (udp && !l)
 		return (0xFFFF);
 	return (l);
 }
 
 u_int16_t
 pf_proto_cksum_fixup(struct mbuf *m, u_int16_t cksum, u_int16_t old,
         u_int16_t new, u_int8_t udp)
 {
 	if (m->m_pkthdr.csum_flags & (CSUM_DELAY_DATA | CSUM_DELAY_DATA_IPV6))
 		return (cksum);
 
 	return (pf_cksum_fixup(cksum, old, new, udp));
 }
 
 static void
 pf_change_ap(struct mbuf *m, struct pf_addr *a, u_int16_t *p, u_int16_t *ic,
         u_int16_t *pc, struct pf_addr *an, u_int16_t pn, u_int8_t u,
         sa_family_t af)
 {
 	struct pf_addr	ao;
 	u_int16_t	po = *p;
 
 	PF_ACPY(&ao, a, af);
 	PF_ACPY(a, an, af);
 
 	if (m->m_pkthdr.csum_flags & (CSUM_DELAY_DATA | CSUM_DELAY_DATA_IPV6))
 		*pc = ~*pc;
 
 	*p = pn;
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		*ic = pf_cksum_fixup(pf_cksum_fixup(*ic,
 		    ao.addr16[0], an->addr16[0], 0),
 		    ao.addr16[1], an->addr16[1], 0);
 		*p = pn;
 
 		*pc = pf_cksum_fixup(pf_cksum_fixup(*pc,
 		    ao.addr16[0], an->addr16[0], u),
 		    ao.addr16[1], an->addr16[1], u);
 
 		*pc = pf_proto_cksum_fixup(m, *pc, po, pn, u);
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		*pc = pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 		    pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 		    pf_cksum_fixup(pf_cksum_fixup(*pc,
 		    ao.addr16[0], an->addr16[0], u),
 		    ao.addr16[1], an->addr16[1], u),
 		    ao.addr16[2], an->addr16[2], u),
 		    ao.addr16[3], an->addr16[3], u),
 		    ao.addr16[4], an->addr16[4], u),
 		    ao.addr16[5], an->addr16[5], u),
 		    ao.addr16[6], an->addr16[6], u),
 		    ao.addr16[7], an->addr16[7], u);
 
 		*pc = pf_proto_cksum_fixup(m, *pc, po, pn, u);
 		break;
 #endif /* INET6 */
 	}
 
 	if (m->m_pkthdr.csum_flags & (CSUM_DELAY_DATA | 
 	    CSUM_DELAY_DATA_IPV6)) {
 		*pc = ~*pc;
 		if (! *pc)
 			*pc = 0xffff;
 	}
 }
 
 /* Changes a u_int32_t.  Uses a void * so there are no align restrictions */
 void
 pf_change_a(void *a, u_int16_t *c, u_int32_t an, u_int8_t u)
 {
 	u_int32_t	ao;
 
 	memcpy(&ao, a, sizeof(ao));
 	memcpy(a, &an, sizeof(u_int32_t));
 	*c = pf_cksum_fixup(pf_cksum_fixup(*c, ao / 65536, an / 65536, u),
 	    ao % 65536, an % 65536, u);
 }
 
 void
 pf_change_proto_a(struct mbuf *m, void *a, u_int16_t *c, u_int32_t an, u_int8_t udp)
 {
 	u_int32_t	ao;
 
 	memcpy(&ao, a, sizeof(ao));
 	memcpy(a, &an, sizeof(u_int32_t));
 
 	*c = pf_proto_cksum_fixup(m,
 	    pf_proto_cksum_fixup(m, *c, ao / 65536, an / 65536, udp),
 	    ao % 65536, an % 65536, udp);
 }
 
 #ifdef INET6
 static void
 pf_change_a6(struct pf_addr *a, u_int16_t *c, struct pf_addr *an, u_int8_t u)
 {
 	struct pf_addr	ao;
 
 	PF_ACPY(&ao, a, AF_INET6);
 	PF_ACPY(a, an, AF_INET6);
 
 	*c = pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 	    pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 	    pf_cksum_fixup(pf_cksum_fixup(*c,
 	    ao.addr16[0], an->addr16[0], u),
 	    ao.addr16[1], an->addr16[1], u),
 	    ao.addr16[2], an->addr16[2], u),
 	    ao.addr16[3], an->addr16[3], u),
 	    ao.addr16[4], an->addr16[4], u),
 	    ao.addr16[5], an->addr16[5], u),
 	    ao.addr16[6], an->addr16[6], u),
 	    ao.addr16[7], an->addr16[7], u);
 }
 #endif /* INET6 */
 
 static void
 pf_change_icmp(struct pf_addr *ia, u_int16_t *ip, struct pf_addr *oa,
     struct pf_addr *na, u_int16_t np, u_int16_t *pc, u_int16_t *h2c,
     u_int16_t *ic, u_int16_t *hc, u_int8_t u, sa_family_t af)
 {
 	struct pf_addr	oia, ooa;
 
 	PF_ACPY(&oia, ia, af);
 	if (oa)
 		PF_ACPY(&ooa, oa, af);
 
 	/* Change inner protocol port, fix inner protocol checksum. */
 	if (ip != NULL) {
 		u_int16_t	oip = *ip;
 		u_int32_t	opc;
 
 		if (pc != NULL)
 			opc = *pc;
 		*ip = np;
 		if (pc != NULL)
 			*pc = pf_cksum_fixup(*pc, oip, *ip, u);
 		*ic = pf_cksum_fixup(*ic, oip, *ip, 0);
 		if (pc != NULL)
 			*ic = pf_cksum_fixup(*ic, opc, *pc, 0);
 	}
 	/* Change inner ip address, fix inner ip and icmp checksums. */
 	PF_ACPY(ia, na, af);
 	switch (af) {
 #ifdef INET
 	case AF_INET: {
 		u_int32_t	 oh2c = *h2c;
 
 		*h2c = pf_cksum_fixup(pf_cksum_fixup(*h2c,
 		    oia.addr16[0], ia->addr16[0], 0),
 		    oia.addr16[1], ia->addr16[1], 0);
 		*ic = pf_cksum_fixup(pf_cksum_fixup(*ic,
 		    oia.addr16[0], ia->addr16[0], 0),
 		    oia.addr16[1], ia->addr16[1], 0);
 		*ic = pf_cksum_fixup(*ic, oh2c, *h2c, 0);
 		break;
 	}
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		*ic = pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 		    pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 		    pf_cksum_fixup(pf_cksum_fixup(*ic,
 		    oia.addr16[0], ia->addr16[0], u),
 		    oia.addr16[1], ia->addr16[1], u),
 		    oia.addr16[2], ia->addr16[2], u),
 		    oia.addr16[3], ia->addr16[3], u),
 		    oia.addr16[4], ia->addr16[4], u),
 		    oia.addr16[5], ia->addr16[5], u),
 		    oia.addr16[6], ia->addr16[6], u),
 		    oia.addr16[7], ia->addr16[7], u);
 		break;
 #endif /* INET6 */
 	}
 	/* Outer ip address, fix outer ip or icmpv6 checksum, if necessary. */
 	if (oa) {
 		PF_ACPY(oa, na, af);
 		switch (af) {
 #ifdef INET
 		case AF_INET:
 			*hc = pf_cksum_fixup(pf_cksum_fixup(*hc,
 			    ooa.addr16[0], oa->addr16[0], 0),
 			    ooa.addr16[1], oa->addr16[1], 0);
 			break;
 #endif /* INET */
 #ifdef INET6
 		case AF_INET6:
 			*ic = pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 			    pf_cksum_fixup(pf_cksum_fixup(pf_cksum_fixup(
 			    pf_cksum_fixup(pf_cksum_fixup(*ic,
 			    ooa.addr16[0], oa->addr16[0], u),
 			    ooa.addr16[1], oa->addr16[1], u),
 			    ooa.addr16[2], oa->addr16[2], u),
 			    ooa.addr16[3], oa->addr16[3], u),
 			    ooa.addr16[4], oa->addr16[4], u),
 			    ooa.addr16[5], oa->addr16[5], u),
 			    ooa.addr16[6], oa->addr16[6], u),
 			    ooa.addr16[7], oa->addr16[7], u);
 			break;
 #endif /* INET6 */
 		}
 	}
 }
 
 
 /*
  * Need to modulate the sequence numbers in the TCP SACK option
  * (credits to Krzysztof Pfaff for report and patch)
  */
 static int
 pf_modulate_sack(struct mbuf *m, int off, struct pf_pdesc *pd,
     struct tcphdr *th, struct pf_state_peer *dst)
 {
 	int hlen = (th->th_off << 2) - sizeof(*th), thoptlen = hlen;
 	u_int8_t opts[TCP_MAXOLEN], *opt = opts;
 	int copyback = 0, i, olen;
 	struct sackblk sack;
 
 #define	TCPOLEN_SACKLEN	(TCPOLEN_SACK + 2)
 	if (hlen < TCPOLEN_SACKLEN ||
 	    !pf_pull_hdr(m, off + sizeof(*th), opts, hlen, NULL, NULL, pd->af))
 		return 0;
 
 	while (hlen >= TCPOLEN_SACKLEN) {
 		olen = opt[1];
 		switch (*opt) {
 		case TCPOPT_EOL:	/* FALLTHROUGH */
 		case TCPOPT_NOP:
 			opt++;
 			hlen--;
 			break;
 		case TCPOPT_SACK:
 			if (olen > hlen)
 				olen = hlen;
 			if (olen >= TCPOLEN_SACKLEN) {
 				for (i = 2; i + TCPOLEN_SACK <= olen;
 				    i += TCPOLEN_SACK) {
 					memcpy(&sack, &opt[i], sizeof(sack));
 					pf_change_proto_a(m, &sack.start, &th->th_sum,
 					    htonl(ntohl(sack.start) - dst->seqdiff), 0);
 					pf_change_proto_a(m, &sack.end, &th->th_sum,
 					    htonl(ntohl(sack.end) - dst->seqdiff), 0);
 					memcpy(&opt[i], &sack, sizeof(sack));
 				}
 				copyback = 1;
 			}
 			/* FALLTHROUGH */
 		default:
 			if (olen < 2)
 				olen = 2;
 			hlen -= olen;
 			opt += olen;
 		}
 	}
 
 	if (copyback)
 		m_copyback(m, off + sizeof(*th), thoptlen, (caddr_t)opts);
 	return (copyback);
 }
 
 static void
 pf_send_tcp(struct mbuf *replyto, const struct pf_rule *r, sa_family_t af,
     const struct pf_addr *saddr, const struct pf_addr *daddr,
     u_int16_t sport, u_int16_t dport, u_int32_t seq, u_int32_t ack,
     u_int8_t flags, u_int16_t win, u_int16_t mss, u_int8_t ttl, int tag,
     u_int16_t rtag, struct ifnet *ifp)
 {
 	struct pf_send_entry *pfse;
 	struct mbuf	*m;
 	int		 len, tlen;
 #ifdef INET
 	struct ip	*h = NULL;
 #endif /* INET */
 #ifdef INET6
 	struct ip6_hdr	*h6 = NULL;
 #endif /* INET6 */
 	struct tcphdr	*th;
 	char		*opt;
 	struct pf_mtag  *pf_mtag;
 
 	len = 0;
 	th = NULL;
 
 	/* maximum segment size tcp option */
 	tlen = sizeof(struct tcphdr);
 	if (mss)
 		tlen += 4;
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		len = sizeof(struct ip) + tlen;
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		len = sizeof(struct ip6_hdr) + tlen;
 		break;
 #endif /* INET6 */
 	default:
 		panic("%s: unsupported af %d", __func__, af);
 	}
 
 	/* Allocate outgoing queue entry, mbuf and mbuf tag. */
 	pfse = malloc(sizeof(*pfse), M_PFTEMP, M_NOWAIT);
 	if (pfse == NULL)
 		return;
 	m = m_gethdr(M_NOWAIT, MT_DATA);
 	if (m == NULL) {
 		free(pfse, M_PFTEMP);
 		return;
 	}
 #ifdef MAC
 	mac_netinet_firewall_send(m);
 #endif
 	if ((pf_mtag = pf_get_mtag(m)) == NULL) {
 		free(pfse, M_PFTEMP);
 		m_freem(m);
 		return;
 	}
 	if (tag)
 		m->m_flags |= M_SKIP_FIREWALL;
 	pf_mtag->tag = rtag;
 
 	if (r != NULL && r->rtableid >= 0)
 		M_SETFIB(m, r->rtableid);
 
 #ifdef ALTQ
 	if (r != NULL && r->qid) {
 		pf_mtag->qid = r->qid;
 
 		/* add hints for ecn */
 		pf_mtag->hdr = mtod(m, struct ip *);
 	}
 #endif /* ALTQ */
 	m->m_data += max_linkhdr;
 	m->m_pkthdr.len = m->m_len = len;
 	m->m_pkthdr.rcvif = NULL;
 	bzero(m->m_data, len);
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		h = mtod(m, struct ip *);
 
 		/* IP header fields included in the TCP checksum */
 		h->ip_p = IPPROTO_TCP;
 		h->ip_len = htons(tlen);
 		h->ip_src.s_addr = saddr->v4.s_addr;
 		h->ip_dst.s_addr = daddr->v4.s_addr;
 
 		th = (struct tcphdr *)((caddr_t)h + sizeof(struct ip));
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		h6 = mtod(m, struct ip6_hdr *);
 
 		/* IP header fields included in the TCP checksum */
 		h6->ip6_nxt = IPPROTO_TCP;
 		h6->ip6_plen = htons(tlen);
 		memcpy(&h6->ip6_src, &saddr->v6, sizeof(struct in6_addr));
 		memcpy(&h6->ip6_dst, &daddr->v6, sizeof(struct in6_addr));
 
 		th = (struct tcphdr *)((caddr_t)h6 + sizeof(struct ip6_hdr));
 		break;
 #endif /* INET6 */
 	}
 
 	/* TCP header */
 	th->th_sport = sport;
 	th->th_dport = dport;
 	th->th_seq = htonl(seq);
 	th->th_ack = htonl(ack);
 	th->th_off = tlen >> 2;
 	th->th_flags = flags;
 	th->th_win = htons(win);
 
 	if (mss) {
 		opt = (char *)(th + 1);
 		opt[0] = TCPOPT_MAXSEG;
 		opt[1] = 4;
 		HTONS(mss);
 		bcopy((caddr_t)&mss, (caddr_t)(opt + 2), 2);
 	}
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		/* TCP checksum */
 		th->th_sum = in_cksum(m, len);
 
 		/* Finish the IP header */
 		h->ip_v = 4;
 		h->ip_hl = sizeof(*h) >> 2;
 		h->ip_tos = IPTOS_LOWDELAY;
 		h->ip_off = htons(V_path_mtu_discovery ? IP_DF : 0);
 		h->ip_len = htons(len);
 		h->ip_ttl = ttl ? ttl : V_ip_defttl;
 		h->ip_sum = 0;
 
 		pfse->pfse_type = PFSE_IP;
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		/* TCP checksum */
 		th->th_sum = in6_cksum(m, IPPROTO_TCP,
 		    sizeof(struct ip6_hdr), tlen);
 
 		h6->ip6_vfc |= IPV6_VERSION;
 		h6->ip6_hlim = IPV6_DEFHLIM;
 
 		pfse->pfse_type = PFSE_IP6;
 		break;
 #endif /* INET6 */
 	}
 	pfse->pfse_m = m;
 	pf_send(pfse);
 }
 
 static int
 pf_ieee8021q_setpcp(struct mbuf *m, u_int8_t prio)
 {
 	struct m_tag *mtag;
 
 	KASSERT(prio <= PF_PRIO_MAX,
 	    ("%s with invalid pcp", __func__));
 
 	mtag = m_tag_locate(m, MTAG_8021Q, MTAG_8021Q_PCP_OUT, NULL);
 	if (mtag == NULL) {
 		mtag = m_tag_alloc(MTAG_8021Q, MTAG_8021Q_PCP_OUT,
 		    sizeof(uint8_t), M_NOWAIT);
 		if (mtag == NULL)
 			return (ENOMEM);
 		m_tag_prepend(m, mtag);
 	}
 
 	*(uint8_t *)(mtag + 1) = prio;
 	return (0);
 }
 
 static int
 pf_match_ieee8021q_pcp(u_int8_t prio, struct mbuf *m)
 {
 	struct m_tag *mtag;
 	u_int8_t mpcp;
 
 	mtag = m_tag_locate(m, MTAG_8021Q, MTAG_8021Q_PCP_IN, NULL);
 	if (mtag == NULL)
 		return (0);
 
 	if (prio == PF_PRIO_ZERO)
 		prio = 0;
 
 	mpcp = *(uint8_t *)(mtag + 1);
 
 	return (mpcp == prio);
 }
 
 static void
 pf_send_icmp(struct mbuf *m, u_int8_t type, u_int8_t code, sa_family_t af,
     struct pf_rule *r)
 {
 	struct pf_send_entry *pfse;
 	struct mbuf *m0;
 	struct pf_mtag *pf_mtag;
 
 	/* Allocate outgoing queue entry, mbuf and mbuf tag. */
 	pfse = malloc(sizeof(*pfse), M_PFTEMP, M_NOWAIT);
 	if (pfse == NULL)
 		return;
 
 	if ((m0 = m_copypacket(m, M_NOWAIT)) == NULL) {
 		free(pfse, M_PFTEMP);
 		return;
 	}
 
 	if ((pf_mtag = pf_get_mtag(m0)) == NULL) {
 		free(pfse, M_PFTEMP);
 		return;
 	}
 	/* XXX: revisit */
 	m0->m_flags |= M_SKIP_FIREWALL;
 
 	if (r->rtableid >= 0)
 		M_SETFIB(m0, r->rtableid);
 
 #ifdef ALTQ
 	if (r->qid) {
 		pf_mtag->qid = r->qid;
 		/* add hints for ecn */
 		pf_mtag->hdr = mtod(m0, struct ip *);
 	}
 #endif /* ALTQ */
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		pfse->pfse_type = PFSE_ICMP;
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		pfse->pfse_type = PFSE_ICMP6;
 		break;
 #endif /* INET6 */
 	}
 	pfse->pfse_m = m0;
 	pfse->icmpopts.type = type;
 	pfse->icmpopts.code = code;
 	pf_send(pfse);
 }
 
 /*
  * Return 1 if the addresses a and b match (with mask m), otherwise return 0.
  * If n is 0, they match if they are equal. If n is != 0, they match if they
  * are different.
  */
 int
 pf_match_addr(u_int8_t n, struct pf_addr *a, struct pf_addr *m,
     struct pf_addr *b, sa_family_t af)
 {
 	int	match = 0;
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		if ((a->addr32[0] & m->addr32[0]) ==
 		    (b->addr32[0] & m->addr32[0]))
 			match++;
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		if (((a->addr32[0] & m->addr32[0]) ==
 		     (b->addr32[0] & m->addr32[0])) &&
 		    ((a->addr32[1] & m->addr32[1]) ==
 		     (b->addr32[1] & m->addr32[1])) &&
 		    ((a->addr32[2] & m->addr32[2]) ==
 		     (b->addr32[2] & m->addr32[2])) &&
 		    ((a->addr32[3] & m->addr32[3]) ==
 		     (b->addr32[3] & m->addr32[3])))
 			match++;
 		break;
 #endif /* INET6 */
 	}
 	if (match) {
 		if (n)
 			return (0);
 		else
 			return (1);
 	} else {
 		if (n)
 			return (1);
 		else
 			return (0);
 	}
 }
 
 /*
  * Return 1 if b <= a <= e, otherwise return 0.
  */
 int
 pf_match_addr_range(struct pf_addr *b, struct pf_addr *e,
     struct pf_addr *a, sa_family_t af)
 {
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		if ((ntohl(a->addr32[0]) < ntohl(b->addr32[0])) ||
 		    (ntohl(a->addr32[0]) > ntohl(e->addr32[0])))
 			return (0);
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6: {
 		int	i;
 
 		/* check a >= b */
 		for (i = 0; i < 4; ++i)
 			if (ntohl(a->addr32[i]) > ntohl(b->addr32[i]))
 				break;
 			else if (ntohl(a->addr32[i]) < ntohl(b->addr32[i]))
 				return (0);
 		/* check a <= e */
 		for (i = 0; i < 4; ++i)
 			if (ntohl(a->addr32[i]) < ntohl(e->addr32[i]))
 				break;
 			else if (ntohl(a->addr32[i]) > ntohl(e->addr32[i]))
 				return (0);
 		break;
 	}
 #endif /* INET6 */
 	}
 	return (1);
 }
 
 static int
 pf_match(u_int8_t op, u_int32_t a1, u_int32_t a2, u_int32_t p)
 {
 	switch (op) {
 	case PF_OP_IRG:
 		return ((p > a1) && (p < a2));
 	case PF_OP_XRG:
 		return ((p < a1) || (p > a2));
 	case PF_OP_RRG:
 		return ((p >= a1) && (p <= a2));
 	case PF_OP_EQ:
 		return (p == a1);
 	case PF_OP_NE:
 		return (p != a1);
 	case PF_OP_LT:
 		return (p < a1);
 	case PF_OP_LE:
 		return (p <= a1);
 	case PF_OP_GT:
 		return (p > a1);
 	case PF_OP_GE:
 		return (p >= a1);
 	}
 	return (0); /* never reached */
 }
 
 int
 pf_match_port(u_int8_t op, u_int16_t a1, u_int16_t a2, u_int16_t p)
 {
 	NTOHS(a1);
 	NTOHS(a2);
 	NTOHS(p);
 	return (pf_match(op, a1, a2, p));
 }
 
 static int
 pf_match_uid(u_int8_t op, uid_t a1, uid_t a2, uid_t u)
 {
 	if (u == UID_MAX && op != PF_OP_EQ && op != PF_OP_NE)
 		return (0);
 	return (pf_match(op, a1, a2, u));
 }
 
 static int
 pf_match_gid(u_int8_t op, gid_t a1, gid_t a2, gid_t g)
 {
 	if (g == GID_MAX && op != PF_OP_EQ && op != PF_OP_NE)
 		return (0);
 	return (pf_match(op, a1, a2, g));
 }
 
 int
 pf_match_tag(struct mbuf *m, struct pf_rule *r, int *tag, int mtag)
 {
 	if (*tag == -1)
 		*tag = mtag;
 
 	return ((!r->match_tag_not && r->match_tag == *tag) ||
 	    (r->match_tag_not && r->match_tag != *tag));
 }
 
 int
 pf_tag_packet(struct mbuf *m, struct pf_pdesc *pd, int tag)
 {
 
 	KASSERT(tag > 0, ("%s: tag %d", __func__, tag));
 
 	if (pd->pf_mtag == NULL && ((pd->pf_mtag = pf_get_mtag(m)) == NULL))
 		return (ENOMEM);
 
 	pd->pf_mtag->tag = tag;
 
 	return (0);
 }
 
 #define	PF_ANCHOR_STACKSIZE	32
 struct pf_anchor_stackframe {
 	struct pf_ruleset	*rs;
 	struct pf_rule		*r;	/* XXX: + match bit */
 	struct pf_anchor	*child;
 };
 
 /*
  * XXX: We rely on malloc(9) returning pointer aligned addresses.
  */
 #define	PF_ANCHORSTACK_MATCH	0x00000001
 #define	PF_ANCHORSTACK_MASK	(PF_ANCHORSTACK_MATCH)
 
 #define	PF_ANCHOR_MATCH(f)	((uintptr_t)(f)->r & PF_ANCHORSTACK_MATCH)
 #define	PF_ANCHOR_RULE(f)	(struct pf_rule *)			\
 				((uintptr_t)(f)->r & ~PF_ANCHORSTACK_MASK)
 #define	PF_ANCHOR_SET_MATCH(f)	do { (f)->r = (void *) 			\
 				((uintptr_t)(f)->r | PF_ANCHORSTACK_MATCH);  \
 } while (0)
 
 void
 pf_step_into_anchor(struct pf_anchor_stackframe *stack, int *depth,
     struct pf_ruleset **rs, int n, struct pf_rule **r, struct pf_rule **a,
     int *match)
 {
 	struct pf_anchor_stackframe	*f;
 
 	PF_RULES_RASSERT();
 
 	if (match)
 		*match = 0;
 	if (*depth >= PF_ANCHOR_STACKSIZE) {
 		printf("%s: anchor stack overflow on %s\n",
 		    __func__, (*r)->anchor->name);
 		*r = TAILQ_NEXT(*r, entries);
 		return;
 	} else if (*depth == 0 && a != NULL)
 		*a = *r;
 	f = stack + (*depth)++;
 	f->rs = *rs;
 	f->r = *r;
 	if ((*r)->anchor_wildcard) {
 		struct pf_anchor_node *parent = &(*r)->anchor->children;
 
 		if ((f->child = RB_MIN(pf_anchor_node, parent)) == NULL) {
 			*r = NULL;
 			return;
 		}
 		*rs = &f->child->ruleset;
 	} else {
 		f->child = NULL;
 		*rs = &(*r)->anchor->ruleset;
 	}
 	*r = TAILQ_FIRST((*rs)->rules[n].active.ptr);
 }
 
 int
 pf_step_out_of_anchor(struct pf_anchor_stackframe *stack, int *depth,
     struct pf_ruleset **rs, int n, struct pf_rule **r, struct pf_rule **a,
     int *match)
 {
 	struct pf_anchor_stackframe	*f;
 	struct pf_rule *fr;
 	int quick = 0;
 
 	PF_RULES_RASSERT();
 
 	do {
 		if (*depth <= 0)
 			break;
 		f = stack + *depth - 1;
 		fr = PF_ANCHOR_RULE(f);
 		if (f->child != NULL) {
 			struct pf_anchor_node *parent;
 
 			/*
 			 * This block traverses through
 			 * a wildcard anchor.
 			 */
 			parent = &fr->anchor->children;
 			if (match != NULL && *match) {
 				/*
 				 * If any of "*" matched, then
 				 * "foo/ *" matched, mark frame
 				 * appropriately.
 				 */
 				PF_ANCHOR_SET_MATCH(f);
 				*match = 0;
 			}
 			f->child = RB_NEXT(pf_anchor_node, parent, f->child);
 			if (f->child != NULL) {
 				*rs = &f->child->ruleset;
 				*r = TAILQ_FIRST((*rs)->rules[n].active.ptr);
 				if (*r == NULL)
 					continue;
 				else
 					break;
 			}
 		}
 		(*depth)--;
 		if (*depth == 0 && a != NULL)
 			*a = NULL;
 		*rs = f->rs;
 		if (PF_ANCHOR_MATCH(f) || (match != NULL && *match))
 			quick = fr->quick;
 		*r = TAILQ_NEXT(fr, entries);
 	} while (*r == NULL);
 
 	return (quick);
 }
 
 #ifdef INET6
 void
 pf_poolmask(struct pf_addr *naddr, struct pf_addr *raddr,
     struct pf_addr *rmask, struct pf_addr *saddr, sa_family_t af)
 {
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		naddr->addr32[0] = (raddr->addr32[0] & rmask->addr32[0]) |
 		((rmask->addr32[0] ^ 0xffffffff ) & saddr->addr32[0]);
 		break;
 #endif /* INET */
 	case AF_INET6:
 		naddr->addr32[0] = (raddr->addr32[0] & rmask->addr32[0]) |
 		((rmask->addr32[0] ^ 0xffffffff ) & saddr->addr32[0]);
 		naddr->addr32[1] = (raddr->addr32[1] & rmask->addr32[1]) |
 		((rmask->addr32[1] ^ 0xffffffff ) & saddr->addr32[1]);
 		naddr->addr32[2] = (raddr->addr32[2] & rmask->addr32[2]) |
 		((rmask->addr32[2] ^ 0xffffffff ) & saddr->addr32[2]);
 		naddr->addr32[3] = (raddr->addr32[3] & rmask->addr32[3]) |
 		((rmask->addr32[3] ^ 0xffffffff ) & saddr->addr32[3]);
 		break;
 	}
 }
 
 void
 pf_addr_inc(struct pf_addr *addr, sa_family_t af)
 {
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		addr->addr32[0] = htonl(ntohl(addr->addr32[0]) + 1);
 		break;
 #endif /* INET */
 	case AF_INET6:
 		if (addr->addr32[3] == 0xffffffff) {
 			addr->addr32[3] = 0;
 			if (addr->addr32[2] == 0xffffffff) {
 				addr->addr32[2] = 0;
 				if (addr->addr32[1] == 0xffffffff) {
 					addr->addr32[1] = 0;
 					addr->addr32[0] =
 					    htonl(ntohl(addr->addr32[0]) + 1);
 				} else
 					addr->addr32[1] =
 					    htonl(ntohl(addr->addr32[1]) + 1);
 			} else
 				addr->addr32[2] =
 				    htonl(ntohl(addr->addr32[2]) + 1);
 		} else
 			addr->addr32[3] =
 			    htonl(ntohl(addr->addr32[3]) + 1);
 		break;
 	}
 }
 #endif /* INET6 */
 
 int
 pf_socket_lookup(int direction, struct pf_pdesc *pd, struct mbuf *m)
 {
 	struct pf_addr		*saddr, *daddr;
 	u_int16_t		 sport, dport;
 	struct inpcbinfo	*pi;
 	struct inpcb		*inp;
 
 	pd->lookup.uid = UID_MAX;
 	pd->lookup.gid = GID_MAX;
 
 	switch (pd->proto) {
 	case IPPROTO_TCP:
 		if (pd->hdr.tcp == NULL)
 			return (-1);
 		sport = pd->hdr.tcp->th_sport;
 		dport = pd->hdr.tcp->th_dport;
 		pi = &V_tcbinfo;
 		break;
 	case IPPROTO_UDP:
 		if (pd->hdr.udp == NULL)
 			return (-1);
 		sport = pd->hdr.udp->uh_sport;
 		dport = pd->hdr.udp->uh_dport;
 		pi = &V_udbinfo;
 		break;
 	default:
 		return (-1);
 	}
 	if (direction == PF_IN) {
 		saddr = pd->src;
 		daddr = pd->dst;
 	} else {
 		u_int16_t	p;
 
 		p = sport;
 		sport = dport;
 		dport = p;
 		saddr = pd->dst;
 		daddr = pd->src;
 	}
 	switch (pd->af) {
 #ifdef INET
 	case AF_INET:
 		inp = in_pcblookup_mbuf(pi, saddr->v4, sport, daddr->v4,
 		    dport, INPLOOKUP_RLOCKPCB, NULL, m);
 		if (inp == NULL) {
 			inp = in_pcblookup_mbuf(pi, saddr->v4, sport,
 			   daddr->v4, dport, INPLOOKUP_WILDCARD |
 			   INPLOOKUP_RLOCKPCB, NULL, m);
 			if (inp == NULL)
 				return (-1);
 		}
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		inp = in6_pcblookup_mbuf(pi, &saddr->v6, sport, &daddr->v6,
 		    dport, INPLOOKUP_RLOCKPCB, NULL, m);
 		if (inp == NULL) {
 			inp = in6_pcblookup_mbuf(pi, &saddr->v6, sport,
 			    &daddr->v6, dport, INPLOOKUP_WILDCARD |
 			    INPLOOKUP_RLOCKPCB, NULL, m);
 			if (inp == NULL)
 				return (-1);
 		}
 		break;
 #endif /* INET6 */
 
 	default:
 		return (-1);
 	}
 	INP_RLOCK_ASSERT(inp);
 	pd->lookup.uid = inp->inp_cred->cr_uid;
 	pd->lookup.gid = inp->inp_cred->cr_groups[0];
 	INP_RUNLOCK(inp);
 
 	return (1);
 }
 
 static u_int8_t
 pf_get_wscale(struct mbuf *m, int off, u_int16_t th_off, sa_family_t af)
 {
 	int		 hlen;
 	u_int8_t	 hdr[60];
 	u_int8_t	*opt, optlen;
 	u_int8_t	 wscale = 0;
 
 	hlen = th_off << 2;		/* hlen <= sizeof(hdr) */
 	if (hlen <= sizeof(struct tcphdr))
 		return (0);
 	if (!pf_pull_hdr(m, off, hdr, hlen, NULL, NULL, af))
 		return (0);
 	opt = hdr + sizeof(struct tcphdr);
 	hlen -= sizeof(struct tcphdr);
 	while (hlen >= 3) {
 		switch (*opt) {
 		case TCPOPT_EOL:
 		case TCPOPT_NOP:
 			++opt;
 			--hlen;
 			break;
 		case TCPOPT_WINDOW:
 			wscale = opt[2];
 			if (wscale > TCP_MAX_WINSHIFT)
 				wscale = TCP_MAX_WINSHIFT;
 			wscale |= PF_WSCALE_FLAG;
 			/* FALLTHROUGH */
 		default:
 			optlen = opt[1];
 			if (optlen < 2)
 				optlen = 2;
 			hlen -= optlen;
 			opt += optlen;
 			break;
 		}
 	}
 	return (wscale);
 }
 
 static u_int16_t
 pf_get_mss(struct mbuf *m, int off, u_int16_t th_off, sa_family_t af)
 {
 	int		 hlen;
 	u_int8_t	 hdr[60];
 	u_int8_t	*opt, optlen;
 	u_int16_t	 mss = V_tcp_mssdflt;
 
 	hlen = th_off << 2;	/* hlen <= sizeof(hdr) */
 	if (hlen <= sizeof(struct tcphdr))
 		return (0);
 	if (!pf_pull_hdr(m, off, hdr, hlen, NULL, NULL, af))
 		return (0);
 	opt = hdr + sizeof(struct tcphdr);
 	hlen -= sizeof(struct tcphdr);
 	while (hlen >= TCPOLEN_MAXSEG) {
 		switch (*opt) {
 		case TCPOPT_EOL:
 		case TCPOPT_NOP:
 			++opt;
 			--hlen;
 			break;
 		case TCPOPT_MAXSEG:
 			bcopy((caddr_t)(opt + 2), (caddr_t)&mss, 2);
 			NTOHS(mss);
 			/* FALLTHROUGH */
 		default:
 			optlen = opt[1];
 			if (optlen < 2)
 				optlen = 2;
 			hlen -= optlen;
 			opt += optlen;
 			break;
 		}
 	}
 	return (mss);
 }
 
 static u_int16_t
 pf_calc_mss(struct pf_addr *addr, sa_family_t af, int rtableid, u_int16_t offer)
 {
 #ifdef INET
 	struct nhop4_basic	nh4;
 #endif /* INET */
 #ifdef INET6
 	struct nhop6_basic	nh6;
 	struct in6_addr		dst6;
 	uint32_t		scopeid;
 #endif /* INET6 */
 	int			 hlen = 0;
 	uint16_t		 mss = 0;
 
 	switch (af) {
 #ifdef INET
 	case AF_INET:
 		hlen = sizeof(struct ip);
 		if (fib4_lookup_nh_basic(rtableid, addr->v4, 0, 0, &nh4) == 0)
 			mss = nh4.nh_mtu - hlen - sizeof(struct tcphdr);
 		break;
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6:
 		hlen = sizeof(struct ip6_hdr);
 		in6_splitscope(&addr->v6, &dst6, &scopeid);
 		if (fib6_lookup_nh_basic(rtableid, &dst6, scopeid, 0,0,&nh6)==0)
 			mss = nh6.nh_mtu - hlen - sizeof(struct tcphdr);
 		break;
 #endif /* INET6 */
 	}
 
 	mss = max(V_tcp_mssdflt, mss);
 	mss = min(mss, offer);
 	mss = max(mss, 64);		/* sanity - at least max opt space */
 	return (mss);
 }
 
 static u_int32_t
 pf_tcp_iss(struct pf_pdesc *pd)
 {
 	MD5_CTX ctx;
 	u_int32_t digest[4];
 
 	if (V_pf_tcp_secret_init == 0) {
 		read_random(&V_pf_tcp_secret, sizeof(V_pf_tcp_secret));
 		MD5Init(&V_pf_tcp_secret_ctx);
 		MD5Update(&V_pf_tcp_secret_ctx, V_pf_tcp_secret,
 		    sizeof(V_pf_tcp_secret));
 		V_pf_tcp_secret_init = 1;
 	}
 
 	ctx = V_pf_tcp_secret_ctx;
 
 	MD5Update(&ctx, (char *)&pd->hdr.tcp->th_sport, sizeof(u_short));
 	MD5Update(&ctx, (char *)&pd->hdr.tcp->th_dport, sizeof(u_short));
 	if (pd->af == AF_INET6) {
 		MD5Update(&ctx, (char *)&pd->src->v6, sizeof(struct in6_addr));
 		MD5Update(&ctx, (char *)&pd->dst->v6, sizeof(struct in6_addr));
 	} else {
 		MD5Update(&ctx, (char *)&pd->src->v4, sizeof(struct in_addr));
 		MD5Update(&ctx, (char *)&pd->dst->v4, sizeof(struct in_addr));
 	}
 	MD5Final((u_char *)digest, &ctx);
 	V_pf_tcp_iss_off += 4096;
 #define	ISN_RANDOM_INCREMENT (4096 - 1)
 	return (digest[0] + (arc4random() & ISN_RANDOM_INCREMENT) +
 	    V_pf_tcp_iss_off);
 #undef	ISN_RANDOM_INCREMENT
 }
 
 static int
 pf_test_rule(struct pf_rule **rm, struct pf_state **sm, int direction,
     struct pfi_kif *kif, struct mbuf *m, int off, struct pf_pdesc *pd,
     struct pf_rule **am, struct pf_ruleset **rsm, struct inpcb *inp)
 {
 	struct pf_rule		*nr = NULL;
 	struct pf_addr		* const saddr = pd->src;
 	struct pf_addr		* const daddr = pd->dst;
 	sa_family_t		 af = pd->af;
 	struct pf_rule		*r, *a = NULL;
 	struct pf_ruleset	*ruleset = NULL;
 	struct pf_src_node	*nsn = NULL;
 	struct tcphdr		*th = pd->hdr.tcp;
 	struct pf_state_key	*sk = NULL, *nk = NULL;
 	u_short			 reason;
 	int			 rewrite = 0, hdrlen = 0;
 	int			 tag = -1, rtableid = -1;
 	int			 asd = 0;
 	int			 match = 0;
 	int			 state_icmp = 0;
 	u_int16_t		 sport = 0, dport = 0;
 	u_int16_t		 bproto_sum = 0, bip_sum = 0;
 	u_int8_t		 icmptype = 0, icmpcode = 0;
 	struct pf_anchor_stackframe	anchor_stack[PF_ANCHOR_STACKSIZE];
 
 	PF_RULES_RASSERT();
 
 	if (inp != NULL) {
 		INP_LOCK_ASSERT(inp);
 		pd->lookup.uid = inp->inp_cred->cr_uid;
 		pd->lookup.gid = inp->inp_cred->cr_groups[0];
 		pd->lookup.done = 1;
 	}
 
 	switch (pd->proto) {
 	case IPPROTO_TCP:
 		sport = th->th_sport;
 		dport = th->th_dport;
 		hdrlen = sizeof(*th);
 		break;
 	case IPPROTO_UDP:
 		sport = pd->hdr.udp->uh_sport;
 		dport = pd->hdr.udp->uh_dport;
 		hdrlen = sizeof(*pd->hdr.udp);
 		break;
 #ifdef INET
 	case IPPROTO_ICMP:
 		if (pd->af != AF_INET)
 			break;
 		sport = dport = pd->hdr.icmp->icmp_id;
 		hdrlen = sizeof(*pd->hdr.icmp);
 		icmptype = pd->hdr.icmp->icmp_type;
 		icmpcode = pd->hdr.icmp->icmp_code;
 
 		if (icmptype == ICMP_UNREACH ||
 		    icmptype == ICMP_SOURCEQUENCH ||
 		    icmptype == ICMP_REDIRECT ||
 		    icmptype == ICMP_TIMXCEED ||
 		    icmptype == ICMP_PARAMPROB)
 			state_icmp++;
 		break;
 #endif /* INET */
 #ifdef INET6
 	case IPPROTO_ICMPV6:
 		if (af != AF_INET6)
 			break;
 		sport = dport = pd->hdr.icmp6->icmp6_id;
 		hdrlen = sizeof(*pd->hdr.icmp6);
 		icmptype = pd->hdr.icmp6->icmp6_type;
 		icmpcode = pd->hdr.icmp6->icmp6_code;
 
 		if (icmptype == ICMP6_DST_UNREACH ||
 		    icmptype == ICMP6_PACKET_TOO_BIG ||
 		    icmptype == ICMP6_TIME_EXCEEDED ||
 		    icmptype == ICMP6_PARAM_PROB)
 			state_icmp++;
 		break;
 #endif /* INET6 */
 	default:
 		sport = dport = hdrlen = 0;
 		break;
 	}
 
 	r = TAILQ_FIRST(pf_main_ruleset.rules[PF_RULESET_FILTER].active.ptr);
 
 	/* check packet for BINAT/NAT/RDR */
 	if ((nr = pf_get_translation(pd, m, off, direction, kif, &nsn, &sk,
 	    &nk, saddr, daddr, sport, dport, anchor_stack)) != NULL) {
 		KASSERT(sk != NULL, ("%s: null sk", __func__));
 		KASSERT(nk != NULL, ("%s: null nk", __func__));
 
 		if (pd->ip_sum)
 			bip_sum = *pd->ip_sum;
 
 		switch (pd->proto) {
 		case IPPROTO_TCP:
 			bproto_sum = th->th_sum;
 			pd->proto_sum = &th->th_sum;
 
 			if (PF_ANEQ(saddr, &nk->addr[pd->sidx], af) ||
 			    nk->port[pd->sidx] != sport) {
 				pf_change_ap(m, saddr, &th->th_sport, pd->ip_sum,
 				    &th->th_sum, &nk->addr[pd->sidx],
 				    nk->port[pd->sidx], 0, af);
 				pd->sport = &th->th_sport;
 				sport = th->th_sport;
 			}
 
 			if (PF_ANEQ(daddr, &nk->addr[pd->didx], af) ||
 			    nk->port[pd->didx] != dport) {
 				pf_change_ap(m, daddr, &th->th_dport, pd->ip_sum,
 				    &th->th_sum, &nk->addr[pd->didx],
 				    nk->port[pd->didx], 0, af);
 				dport = th->th_dport;
 				pd->dport = &th->th_dport;
 			}
 			rewrite++;
 			break;
 		case IPPROTO_UDP:
 			bproto_sum = pd->hdr.udp->uh_sum;
 			pd->proto_sum = &pd->hdr.udp->uh_sum;
 
 			if (PF_ANEQ(saddr, &nk->addr[pd->sidx], af) ||
 			    nk->port[pd->sidx] != sport) {
 				pf_change_ap(m, saddr, &pd->hdr.udp->uh_sport,
 				    pd->ip_sum, &pd->hdr.udp->uh_sum,
 				    &nk->addr[pd->sidx],
 				    nk->port[pd->sidx], 1, af);
 				sport = pd->hdr.udp->uh_sport;
 				pd->sport = &pd->hdr.udp->uh_sport;
 			}
 
 			if (PF_ANEQ(daddr, &nk->addr[pd->didx], af) ||
 			    nk->port[pd->didx] != dport) {
 				pf_change_ap(m, daddr, &pd->hdr.udp->uh_dport,
 				    pd->ip_sum, &pd->hdr.udp->uh_sum,
 				    &nk->addr[pd->didx],
 				    nk->port[pd->didx], 1, af);
 				dport = pd->hdr.udp->uh_dport;
 				pd->dport = &pd->hdr.udp->uh_dport;
 			}
 			rewrite++;
 			break;
 #ifdef INET
 		case IPPROTO_ICMP:
 			nk->port[0] = nk->port[1];
 			if (PF_ANEQ(saddr, &nk->addr[pd->sidx], AF_INET))
 				pf_change_a(&saddr->v4.s_addr, pd->ip_sum,
 				    nk->addr[pd->sidx].v4.s_addr, 0);
 
 			if (PF_ANEQ(daddr, &nk->addr[pd->didx], AF_INET))
 				pf_change_a(&daddr->v4.s_addr, pd->ip_sum,
 				    nk->addr[pd->didx].v4.s_addr, 0);
 
 			if (nk->port[1] != pd->hdr.icmp->icmp_id) {
 				pd->hdr.icmp->icmp_cksum = pf_cksum_fixup(
 				    pd->hdr.icmp->icmp_cksum, sport,
 				    nk->port[1], 0);
 				pd->hdr.icmp->icmp_id = nk->port[1];
 				pd->sport = &pd->hdr.icmp->icmp_id;
 			}
 			m_copyback(m, off, ICMP_MINLEN, (caddr_t)pd->hdr.icmp);
 			break;
 #endif /* INET */
 #ifdef INET6
 		case IPPROTO_ICMPV6:
 			nk->port[0] = nk->port[1];
 			if (PF_ANEQ(saddr, &nk->addr[pd->sidx], AF_INET6))
 				pf_change_a6(saddr, &pd->hdr.icmp6->icmp6_cksum,
 				    &nk->addr[pd->sidx], 0);
 
 			if (PF_ANEQ(daddr, &nk->addr[pd->didx], AF_INET6))
 				pf_change_a6(daddr, &pd->hdr.icmp6->icmp6_cksum,
 				    &nk->addr[pd->didx], 0);
 			rewrite++;
 			break;
 #endif /* INET */
 		default:
 			switch (af) {
 #ifdef INET
 			case AF_INET:
 				if (PF_ANEQ(saddr,
 				    &nk->addr[pd->sidx], AF_INET))
 					pf_change_a(&saddr->v4.s_addr,
 					    pd->ip_sum,
 					    nk->addr[pd->sidx].v4.s_addr, 0);
 
 				if (PF_ANEQ(daddr,
 				    &nk->addr[pd->didx], AF_INET))
 					pf_change_a(&daddr->v4.s_addr,
 					    pd->ip_sum,
 					    nk->addr[pd->didx].v4.s_addr, 0);
 				break;
 #endif /* INET */
 #ifdef INET6
 			case AF_INET6:
 				if (PF_ANEQ(saddr,
 				    &nk->addr[pd->sidx], AF_INET6))
 					PF_ACPY(saddr, &nk->addr[pd->sidx], af);
 
 				if (PF_ANEQ(daddr,
 				    &nk->addr[pd->didx], AF_INET6))
 					PF_ACPY(saddr, &nk->addr[pd->didx], af);
 				break;
 #endif /* INET */
 			}
 			break;
 		}
 		if (nr->natpass)
 			r = NULL;
 		pd->nat_rule = nr;
 	}
 
 	while (r != NULL) {
 		r->evaluations++;
 		if (pfi_kif_match(r->kif, kif) == r->ifnot)
 			r = r->skip[PF_SKIP_IFP].ptr;
 		else if (r->direction && r->direction != direction)
 			r = r->skip[PF_SKIP_DIR].ptr;
 		else if (r->af && r->af != af)
 			r = r->skip[PF_SKIP_AF].ptr;
 		else if (r->proto && r->proto != pd->proto)
 			r = r->skip[PF_SKIP_PROTO].ptr;
 		else if (PF_MISMATCHAW(&r->src.addr, saddr, af,
 		    r->src.neg, kif, M_GETFIB(m)))
 			r = r->skip[PF_SKIP_SRC_ADDR].ptr;
 		/* tcp/udp only. port_op always 0 in other cases */
 		else if (r->src.port_op && !pf_match_port(r->src.port_op,
 		    r->src.port[0], r->src.port[1], sport))
 			r = r->skip[PF_SKIP_SRC_PORT].ptr;
 		else if (PF_MISMATCHAW(&r->dst.addr, daddr, af,
 		    r->dst.neg, NULL, M_GETFIB(m)))
 			r = r->skip[PF_SKIP_DST_ADDR].ptr;
 		/* tcp/udp only. port_op always 0 in other cases */
 		else if (r->dst.port_op && !pf_match_port(r->dst.port_op,
 		    r->dst.port[0], r->dst.port[1], dport))
 			r = r->skip[PF_SKIP_DST_PORT].ptr;
 		/* icmp only. type always 0 in other cases */
 		else if (r->type && r->type != icmptype + 1)
 			r = TAILQ_NEXT(r, entries);
 		/* icmp only. type always 0 in other cases */
 		else if (r->code && r->code != icmpcode + 1)
 			r = TAILQ_NEXT(r, entries);
 		else if (r->tos && !(r->tos == pd->tos))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->rule_flag & PFRULE_FRAGMENT)
 			r = TAILQ_NEXT(r, entries);
 		else if (pd->proto == IPPROTO_TCP &&
 		    (r->flagset & th->th_flags) != r->flags)
 			r = TAILQ_NEXT(r, entries);
 		/* tcp/udp only. uid.op always 0 in other cases */
 		else if (r->uid.op && (pd->lookup.done || (pd->lookup.done =
 		    pf_socket_lookup(direction, pd, m), 1)) &&
 		    !pf_match_uid(r->uid.op, r->uid.uid[0], r->uid.uid[1],
 		    pd->lookup.uid))
 			r = TAILQ_NEXT(r, entries);
 		/* tcp/udp only. gid.op always 0 in other cases */
 		else if (r->gid.op && (pd->lookup.done || (pd->lookup.done =
 		    pf_socket_lookup(direction, pd, m), 1)) &&
 		    !pf_match_gid(r->gid.op, r->gid.gid[0], r->gid.gid[1],
 		    pd->lookup.gid))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->prio &&
 		    !pf_match_ieee8021q_pcp(r->prio, m))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->prob &&
 		    r->prob <= arc4random())
 			r = TAILQ_NEXT(r, entries);
 		else if (r->match_tag && !pf_match_tag(m, r, &tag,
 		    pd->pf_mtag ? pd->pf_mtag->tag : 0))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->os_fingerprint != PF_OSFP_ANY &&
 		    (pd->proto != IPPROTO_TCP || !pf_osfp_match(
 		    pf_osfp_fingerprint(pd, m, off, th),
 		    r->os_fingerprint)))
 			r = TAILQ_NEXT(r, entries);
 		else {
 			if (r->tag)
 				tag = r->tag;
 			if (r->rtableid >= 0)
 				rtableid = r->rtableid;
 			if (r->anchor == NULL) {
 				match = 1;
 				*rm = r;
 				*am = a;
 				*rsm = ruleset;
 				if ((*rm)->quick)
 					break;
 				r = TAILQ_NEXT(r, entries);
 			} else
 				pf_step_into_anchor(anchor_stack, &asd,
 				    &ruleset, PF_RULESET_FILTER, &r, &a,
 				    &match);
 		}
 		if (r == NULL && pf_step_out_of_anchor(anchor_stack, &asd,
 		    &ruleset, PF_RULESET_FILTER, &r, &a, &match))
 			break;
 	}
 	r = *rm;
 	a = *am;
 	ruleset = *rsm;
 
 	REASON_SET(&reason, PFRES_MATCH);
 
 	if (r->log || (nr != NULL && nr->log)) {
 		if (rewrite)
 			m_copyback(m, off, hdrlen, pd->hdr.any);
 		PFLOG_PACKET(kif, m, af, direction, reason, r->log ? r : nr, a,
 		    ruleset, pd, 1);
 	}
 
 	if ((r->action == PF_DROP) &&
 	    ((r->rule_flag & PFRULE_RETURNRST) ||
 	    (r->rule_flag & PFRULE_RETURNICMP) ||
 	    (r->rule_flag & PFRULE_RETURN))) {
 		/* undo NAT changes, if they have taken place */
 		if (nr != NULL) {
 			PF_ACPY(saddr, &sk->addr[pd->sidx], af);
 			PF_ACPY(daddr, &sk->addr[pd->didx], af);
 			if (pd->sport)
 				*pd->sport = sk->port[pd->sidx];
 			if (pd->dport)
 				*pd->dport = sk->port[pd->didx];
 			if (pd->proto_sum)
 				*pd->proto_sum = bproto_sum;
 			if (pd->ip_sum)
 				*pd->ip_sum = bip_sum;
 			m_copyback(m, off, hdrlen, pd->hdr.any);
 		}
 		if (pd->proto == IPPROTO_TCP &&
 		    ((r->rule_flag & PFRULE_RETURNRST) ||
 		    (r->rule_flag & PFRULE_RETURN)) &&
 		    !(th->th_flags & TH_RST)) {
 			u_int32_t	 ack = ntohl(th->th_seq) + pd->p_len;
 			int		 len = 0;
 #ifdef INET
 			struct ip	*h4;
 #endif
 #ifdef INET6
 			struct ip6_hdr	*h6;
 #endif
 
 			switch (af) {
 #ifdef INET
 			case AF_INET:
 				h4 = mtod(m, struct ip *);
 				len = ntohs(h4->ip_len) - off;
 				break;
 #endif
 #ifdef INET6
 			case AF_INET6:
 				h6 = mtod(m, struct ip6_hdr *);
 				len = ntohs(h6->ip6_plen) - (off - sizeof(*h6));
 				break;
 #endif
 			}
 
 			if (pf_check_proto_cksum(m, off, len, IPPROTO_TCP, af))
 				REASON_SET(&reason, PFRES_PROTCKSUM);
 			else {
 				if (th->th_flags & TH_SYN)
 					ack++;
 				if (th->th_flags & TH_FIN)
 					ack++;
 				pf_send_tcp(m, r, af, pd->dst,
 				    pd->src, th->th_dport, th->th_sport,
 				    ntohl(th->th_ack), ack, TH_RST|TH_ACK, 0, 0,
 				    r->return_ttl, 1, 0, kif->pfik_ifp);
 			}
 		} else if (pd->proto != IPPROTO_ICMP && af == AF_INET &&
 		    r->return_icmp)
 			pf_send_icmp(m, r->return_icmp >> 8,
 			    r->return_icmp & 255, af, r);
 		else if (pd->proto != IPPROTO_ICMPV6 && af == AF_INET6 &&
 		    r->return_icmp6)
 			pf_send_icmp(m, r->return_icmp6 >> 8,
 			    r->return_icmp6 & 255, af, r);
 	}
 
 	if (r->action == PF_DROP)
 		goto cleanup;
 
 	if (tag > 0 && pf_tag_packet(m, pd, tag)) {
 		REASON_SET(&reason, PFRES_MEMORY);
 		goto cleanup;
 	}
 	if (rtableid >= 0)
 		M_SETFIB(m, rtableid);
 
 	if (!state_icmp && (r->keep_state || nr != NULL ||
 	    (pd->flags & PFDESC_TCP_NORM))) {
 		int action;
 		action = pf_create_state(r, nr, a, pd, nsn, nk, sk, m, off,
 		    sport, dport, &rewrite, kif, sm, tag, bproto_sum, bip_sum,
 		    hdrlen);
 		if (action != PF_PASS)
 			return (action);
 	} else {
 		if (sk != NULL)
 			uma_zfree(V_pf_state_key_z, sk);
 		if (nk != NULL)
 			uma_zfree(V_pf_state_key_z, nk);
 	}
 
 	/* copy back packet headers if we performed NAT operations */
 	if (rewrite)
 		m_copyback(m, off, hdrlen, pd->hdr.any);
 
 	if (*sm != NULL && !((*sm)->state_flags & PFSTATE_NOSYNC) &&
 	    direction == PF_OUT &&
 	    pfsync_defer_ptr != NULL && pfsync_defer_ptr(*sm, m))
 		/*
 		 * We want the state created, but we dont
 		 * want to send this in case a partner
 		 * firewall has to know about it to allow
 		 * replies through it.
 		 */
 		return (PF_DEFER);
 
 	return (PF_PASS);
 
 cleanup:
 	if (sk != NULL)
 		uma_zfree(V_pf_state_key_z, sk);
 	if (nk != NULL)
 		uma_zfree(V_pf_state_key_z, nk);
 	return (PF_DROP);
 }
 
 static int
 pf_create_state(struct pf_rule *r, struct pf_rule *nr, struct pf_rule *a,
     struct pf_pdesc *pd, struct pf_src_node *nsn, struct pf_state_key *nk,
     struct pf_state_key *sk, struct mbuf *m, int off, u_int16_t sport,
     u_int16_t dport, int *rewrite, struct pfi_kif *kif, struct pf_state **sm,
     int tag, u_int16_t bproto_sum, u_int16_t bip_sum, int hdrlen)
 {
 	struct pf_state		*s = NULL;
 	struct pf_src_node	*sn = NULL;
 	struct tcphdr		*th = pd->hdr.tcp;
 	u_int16_t		 mss = V_tcp_mssdflt;
 	u_short			 reason;
 
 	/* check maximums */
 	if (r->max_states &&
 	    (counter_u64_fetch(r->states_cur) >= r->max_states)) {
 		counter_u64_add(V_pf_status.lcounters[LCNT_STATES], 1);
 		REASON_SET(&reason, PFRES_MAXSTATES);
 		goto csfailed;
 	}
 	/* src node for filter rule */
 	if ((r->rule_flag & PFRULE_SRCTRACK ||
 	    r->rpool.opts & PF_POOL_STICKYADDR) &&
 	    pf_insert_src_node(&sn, r, pd->src, pd->af) != 0) {
 		REASON_SET(&reason, PFRES_SRCLIMIT);
 		goto csfailed;
 	}
 	/* src node for translation rule */
 	if (nr != NULL && (nr->rpool.opts & PF_POOL_STICKYADDR) &&
 	    pf_insert_src_node(&nsn, nr, &sk->addr[pd->sidx], pd->af)) {
 		REASON_SET(&reason, PFRES_SRCLIMIT);
 		goto csfailed;
 	}
 	s = uma_zalloc(V_pf_state_z, M_NOWAIT | M_ZERO);
 	if (s == NULL) {
 		REASON_SET(&reason, PFRES_MEMORY);
 		goto csfailed;
 	}
 	s->rule.ptr = r;
 	s->nat_rule.ptr = nr;
 	s->anchor.ptr = a;
 	STATE_INC_COUNTERS(s);
 	if (r->allow_opts)
 		s->state_flags |= PFSTATE_ALLOWOPTS;
 	if (r->rule_flag & PFRULE_STATESLOPPY)
 		s->state_flags |= PFSTATE_SLOPPY;
 	s->log = r->log & PF_LOG_ALL;
 	s->sync_state = PFSYNC_S_NONE;
 	if (nr != NULL)
 		s->log |= nr->log & PF_LOG_ALL;
 	switch (pd->proto) {
 	case IPPROTO_TCP:
 		s->src.seqlo = ntohl(th->th_seq);
 		s->src.seqhi = s->src.seqlo + pd->p_len + 1;
 		if ((th->th_flags & (TH_SYN|TH_ACK)) == TH_SYN &&
 		    r->keep_state == PF_STATE_MODULATE) {
 			/* Generate sequence number modulator */
 			if ((s->src.seqdiff = pf_tcp_iss(pd) - s->src.seqlo) ==
 			    0)
 				s->src.seqdiff = 1;
 			pf_change_proto_a(m, &th->th_seq, &th->th_sum,
 			    htonl(s->src.seqlo + s->src.seqdiff), 0);
 			*rewrite = 1;
 		} else
 			s->src.seqdiff = 0;
 		if (th->th_flags & TH_SYN) {
 			s->src.seqhi++;
 			s->src.wscale = pf_get_wscale(m, off,
 			    th->th_off, pd->af);
 		}
 		s->src.max_win = MAX(ntohs(th->th_win), 1);
 		if (s->src.wscale & PF_WSCALE_MASK) {
 			/* Remove scale factor from initial window */
 			int win = s->src.max_win;
 			win += 1 << (s->src.wscale & PF_WSCALE_MASK);
 			s->src.max_win = (win - 1) >>
 			    (s->src.wscale & PF_WSCALE_MASK);
 		}
 		if (th->th_flags & TH_FIN)
 			s->src.seqhi++;
 		s->dst.seqhi = 1;
 		s->dst.max_win = 1;
 		s->src.state = TCPS_SYN_SENT;
 		s->dst.state = TCPS_CLOSED;
 		s->timeout = PFTM_TCP_FIRST_PACKET;
 		break;
 	case IPPROTO_UDP:
 		s->src.state = PFUDPS_SINGLE;
 		s->dst.state = PFUDPS_NO_TRAFFIC;
 		s->timeout = PFTM_UDP_FIRST_PACKET;
 		break;
 	case IPPROTO_ICMP:
 #ifdef INET6
 	case IPPROTO_ICMPV6:
 #endif
 		s->timeout = PFTM_ICMP_FIRST_PACKET;
 		break;
 	default:
 		s->src.state = PFOTHERS_SINGLE;
 		s->dst.state = PFOTHERS_NO_TRAFFIC;
 		s->timeout = PFTM_OTHER_FIRST_PACKET;
 	}
 
 	if (r->rt) {
 		if (pf_map_addr(pd->af, r, pd->src, &s->rt_addr, NULL, &sn)) {
 			REASON_SET(&reason, PFRES_MAPFAILED);
 			pf_src_tree_remove_state(s);
 			STATE_DEC_COUNTERS(s);
 			uma_zfree(V_pf_state_z, s);
 			goto csfailed;
 		}
 		s->rt_kif = r->rpool.cur->kif;
 	}
 
 	s->creation = time_uptime;
 	s->expire = time_uptime;
 
 	if (sn != NULL)
 		s->src_node = sn;
 	if (nsn != NULL) {
 		/* XXX We only modify one side for now. */
 		PF_ACPY(&nsn->raddr, &nk->addr[1], pd->af);
 		s->nat_src_node = nsn;
 	}
 	if (pd->proto == IPPROTO_TCP) {
 		if ((pd->flags & PFDESC_TCP_NORM) && pf_normalize_tcp_init(m,
 		    off, pd, th, &s->src, &s->dst)) {
 			REASON_SET(&reason, PFRES_MEMORY);
 			pf_src_tree_remove_state(s);
 			STATE_DEC_COUNTERS(s);
 			uma_zfree(V_pf_state_z, s);
 			return (PF_DROP);
 		}
 		if ((pd->flags & PFDESC_TCP_NORM) && s->src.scrub &&
 		    pf_normalize_tcp_stateful(m, off, pd, &reason, th, s,
 		    &s->src, &s->dst, rewrite)) {
 			/* This really shouldn't happen!!! */
 			DPFPRINTF(PF_DEBUG_URGENT,
 			    ("pf_normalize_tcp_stateful failed on first pkt"));
 			pf_normalize_tcp_cleanup(s);
 			pf_src_tree_remove_state(s);
 			STATE_DEC_COUNTERS(s);
 			uma_zfree(V_pf_state_z, s);
 			return (PF_DROP);
 		}
 	}
 	s->direction = pd->dir;
 
 	/*
 	 * sk/nk could already been setup by pf_get_translation().
 	 */
 	if (nr == NULL) {
 		KASSERT((sk == NULL && nk == NULL), ("%s: nr %p sk %p, nk %p",
 		    __func__, nr, sk, nk));
 		sk = pf_state_key_setup(pd, pd->src, pd->dst, sport, dport);
 		if (sk == NULL)
 			goto csfailed;
 		nk = sk;
 	} else
 		KASSERT((sk != NULL && nk != NULL), ("%s: nr %p sk %p, nk %p",
 		    __func__, nr, sk, nk));
 
 	/* Swap sk/nk for PF_OUT. */
 	if (pf_state_insert(BOUND_IFACE(r, kif),
 	    (pd->dir == PF_IN) ? sk : nk,
 	    (pd->dir == PF_IN) ? nk : sk, s)) {
 		if (pd->proto == IPPROTO_TCP)
 			pf_normalize_tcp_cleanup(s);
 		REASON_SET(&reason, PFRES_STATEINS);
 		pf_src_tree_remove_state(s);
 		STATE_DEC_COUNTERS(s);
 		uma_zfree(V_pf_state_z, s);
 		return (PF_DROP);
 	} else
 		*sm = s;
 
 	if (tag > 0)
 		s->tag = tag;
 	if (pd->proto == IPPROTO_TCP && (th->th_flags & (TH_SYN|TH_ACK)) ==
 	    TH_SYN && r->keep_state == PF_STATE_SYNPROXY) {
 		s->src.state = PF_TCPS_PROXY_SRC;
 		/* undo NAT changes, if they have taken place */
 		if (nr != NULL) {
 			struct pf_state_key *skt = s->key[PF_SK_WIRE];
 			if (pd->dir == PF_OUT)
 				skt = s->key[PF_SK_STACK];
 			PF_ACPY(pd->src, &skt->addr[pd->sidx], pd->af);
 			PF_ACPY(pd->dst, &skt->addr[pd->didx], pd->af);
 			if (pd->sport)
 				*pd->sport = skt->port[pd->sidx];
 			if (pd->dport)
 				*pd->dport = skt->port[pd->didx];
 			if (pd->proto_sum)
 				*pd->proto_sum = bproto_sum;
 			if (pd->ip_sum)
 				*pd->ip_sum = bip_sum;
 			m_copyback(m, off, hdrlen, pd->hdr.any);
 		}
 		s->src.seqhi = htonl(arc4random());
 		/* Find mss option */
 		int rtid = M_GETFIB(m);
 		mss = pf_get_mss(m, off, th->th_off, pd->af);
 		mss = pf_calc_mss(pd->src, pd->af, rtid, mss);
 		mss = pf_calc_mss(pd->dst, pd->af, rtid, mss);
 		s->src.mss = mss;
 		pf_send_tcp(NULL, r, pd->af, pd->dst, pd->src, th->th_dport,
 		    th->th_sport, s->src.seqhi, ntohl(th->th_seq) + 1,
 		    TH_SYN|TH_ACK, 0, s->src.mss, 0, 1, 0, NULL);
 		REASON_SET(&reason, PFRES_SYNPROXY);
 		return (PF_SYNPROXY_DROP);
 	}
 
 	return (PF_PASS);
 
 csfailed:
 	if (sk != NULL)
 		uma_zfree(V_pf_state_key_z, sk);
 	if (nk != NULL)
 		uma_zfree(V_pf_state_key_z, nk);
 
 	if (sn != NULL) {
 		struct pf_srchash *sh;
 
 		sh = &V_pf_srchash[pf_hashsrc(&sn->addr, sn->af)];
 		PF_HASHROW_LOCK(sh);
 		if (--sn->states == 0 && sn->expire == 0) {
 			pf_unlink_src_node(sn);
 			uma_zfree(V_pf_sources_z, sn);
 			counter_u64_add(
 			    V_pf_status.scounters[SCNT_SRC_NODE_REMOVALS], 1);
 		}
 		PF_HASHROW_UNLOCK(sh);
 	}
 
 	if (nsn != sn && nsn != NULL) {
 		struct pf_srchash *sh;
 
 		sh = &V_pf_srchash[pf_hashsrc(&nsn->addr, nsn->af)];
 		PF_HASHROW_LOCK(sh);
 		if (--nsn->states == 0 && nsn->expire == 0) {
 			pf_unlink_src_node(nsn);
 			uma_zfree(V_pf_sources_z, nsn);
 			counter_u64_add(
 			    V_pf_status.scounters[SCNT_SRC_NODE_REMOVALS], 1);
 		}
 		PF_HASHROW_UNLOCK(sh);
 	}
 
 	return (PF_DROP);
 }
 
 static int
 pf_test_fragment(struct pf_rule **rm, int direction, struct pfi_kif *kif,
     struct mbuf *m, void *h, struct pf_pdesc *pd, struct pf_rule **am,
     struct pf_ruleset **rsm)
 {
 	struct pf_rule		*r, *a = NULL;
 	struct pf_ruleset	*ruleset = NULL;
 	sa_family_t		 af = pd->af;
 	u_short			 reason;
 	int			 tag = -1;
 	int			 asd = 0;
 	int			 match = 0;
 	struct pf_anchor_stackframe	anchor_stack[PF_ANCHOR_STACKSIZE];
 
 	PF_RULES_RASSERT();
 
 	r = TAILQ_FIRST(pf_main_ruleset.rules[PF_RULESET_FILTER].active.ptr);
 	while (r != NULL) {
 		r->evaluations++;
 		if (pfi_kif_match(r->kif, kif) == r->ifnot)
 			r = r->skip[PF_SKIP_IFP].ptr;
 		else if (r->direction && r->direction != direction)
 			r = r->skip[PF_SKIP_DIR].ptr;
 		else if (r->af && r->af != af)
 			r = r->skip[PF_SKIP_AF].ptr;
 		else if (r->proto && r->proto != pd->proto)
 			r = r->skip[PF_SKIP_PROTO].ptr;
 		else if (PF_MISMATCHAW(&r->src.addr, pd->src, af,
 		    r->src.neg, kif, M_GETFIB(m)))
 			r = r->skip[PF_SKIP_SRC_ADDR].ptr;
 		else if (PF_MISMATCHAW(&r->dst.addr, pd->dst, af,
 		    r->dst.neg, NULL, M_GETFIB(m)))
 			r = r->skip[PF_SKIP_DST_ADDR].ptr;
 		else if (r->tos && !(r->tos == pd->tos))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->os_fingerprint != PF_OSFP_ANY)
 			r = TAILQ_NEXT(r, entries);
 		else if (pd->proto == IPPROTO_UDP &&
 		    (r->src.port_op || r->dst.port_op))
 			r = TAILQ_NEXT(r, entries);
 		else if (pd->proto == IPPROTO_TCP &&
 		    (r->src.port_op || r->dst.port_op || r->flagset))
 			r = TAILQ_NEXT(r, entries);
 		else if ((pd->proto == IPPROTO_ICMP ||
 		    pd->proto == IPPROTO_ICMPV6) &&
 		    (r->type || r->code))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->prio &&
 		    !pf_match_ieee8021q_pcp(r->prio, m))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->prob && r->prob <=
 		    (arc4random() % (UINT_MAX - 1) + 1))
 			r = TAILQ_NEXT(r, entries);
 		else if (r->match_tag && !pf_match_tag(m, r, &tag,
 		    pd->pf_mtag ? pd->pf_mtag->tag : 0))
 			r = TAILQ_NEXT(r, entries);
 		else {
 			if (r->anchor == NULL) {
 				match = 1;
 				*rm = r;
 				*am = a;
 				*rsm = ruleset;
 				if ((*rm)->quick)
 					break;
 				r = TAILQ_NEXT(r, entries);
 			} else
 				pf_step_into_anchor(anchor_stack, &asd,
 				    &ruleset, PF_RULESET_FILTER, &r, &a,
 				    &match);
 		}
 		if (r == NULL && pf_step_out_of_anchor(anchor_stack, &asd,
 		    &ruleset, PF_RULESET_FILTER, &r, &a, &match))
 			break;
 	}
 	r = *rm;
 	a = *am;
 	ruleset = *rsm;
 
 	REASON_SET(&reason, PFRES_MATCH);
 
 	if (r->log)
 		PFLOG_PACKET(kif, m, af, direction, reason, r, a, ruleset, pd,
 		    1);
 
 	if (r->action != PF_PASS)
 		return (PF_DROP);
 
 	if (tag > 0 && pf_tag_packet(m, pd, tag)) {
 		REASON_SET(&reason, PFRES_MEMORY);
 		return (PF_DROP);
 	}
 
 	return (PF_PASS);
 }
 
 static int
 pf_tcp_track_full(struct pf_state_peer *src, struct pf_state_peer *dst,
 	struct pf_state **state, struct pfi_kif *kif, struct mbuf *m, int off,
 	struct pf_pdesc *pd, u_short *reason, int *copyback)
 {
 	struct tcphdr		*th = pd->hdr.tcp;
 	u_int16_t		 win = ntohs(th->th_win);
 	u_int32_t		 ack, end, seq, orig_seq;
 	u_int8_t		 sws, dws;
 	int			 ackskew;
 
 	if (src->wscale && dst->wscale && !(th->th_flags & TH_SYN)) {
 		sws = src->wscale & PF_WSCALE_MASK;
 		dws = dst->wscale & PF_WSCALE_MASK;
 	} else
 		sws = dws = 0;
 
 	/*
 	 * Sequence tracking algorithm from Guido van Rooij's paper:
 	 *   http://www.madison-gurkha.com/publications/tcp_filtering/
 	 *	tcp_filtering.ps
 	 */
 
 	orig_seq = seq = ntohl(th->th_seq);
 	if (src->seqlo == 0) {
 		/* First packet from this end. Set its state */
 
 		if ((pd->flags & PFDESC_TCP_NORM || dst->scrub) &&
 		    src->scrub == NULL) {
 			if (pf_normalize_tcp_init(m, off, pd, th, src, dst)) {
 				REASON_SET(reason, PFRES_MEMORY);
 				return (PF_DROP);
 			}
 		}
 
 		/* Deferred generation of sequence number modulator */
 		if (dst->seqdiff && !src->seqdiff) {
 			/* use random iss for the TCP server */
 			while ((src->seqdiff = arc4random() - seq) == 0)
 				;
 			ack = ntohl(th->th_ack) - dst->seqdiff;
 			pf_change_proto_a(m, &th->th_seq, &th->th_sum, htonl(seq +
 			    src->seqdiff), 0);
 			pf_change_proto_a(m, &th->th_ack, &th->th_sum, htonl(ack), 0);
 			*copyback = 1;
 		} else {
 			ack = ntohl(th->th_ack);
 		}
 
 		end = seq + pd->p_len;
 		if (th->th_flags & TH_SYN) {
 			end++;
 			if (dst->wscale & PF_WSCALE_FLAG) {
 				src->wscale = pf_get_wscale(m, off, th->th_off,
 				    pd->af);
 				if (src->wscale & PF_WSCALE_FLAG) {
 					/* Remove scale factor from initial
 					 * window */
 					sws = src->wscale & PF_WSCALE_MASK;
 					win = ((u_int32_t)win + (1 << sws) - 1)
 					    >> sws;
 					dws = dst->wscale & PF_WSCALE_MASK;
 				} else {
 					/* fixup other window */
 					dst->max_win <<= dst->wscale &
 					    PF_WSCALE_MASK;
 					/* in case of a retrans SYN|ACK */
 					dst->wscale = 0;
 				}
 			}
 		}
 		if (th->th_flags & TH_FIN)
 			end++;
 
 		src->seqlo = seq;
 		if (src->state < TCPS_SYN_SENT)
 			src->state = TCPS_SYN_SENT;
 
 		/*
 		 * May need to slide the window (seqhi may have been set by
 		 * the crappy stack check or if we picked up the connection
 		 * after establishment)
 		 */
 		if (src->seqhi == 1 ||
 		    SEQ_GEQ(end + MAX(1, dst->max_win << dws), src->seqhi))
 			src->seqhi = end + MAX(1, dst->max_win << dws);
 		if (win > src->max_win)
 			src->max_win = win;
 
 	} else {
 		ack = ntohl(th->th_ack) - dst->seqdiff;
 		if (src->seqdiff) {
 			/* Modulate sequence numbers */
 			pf_change_proto_a(m, &th->th_seq, &th->th_sum, htonl(seq +
 			    src->seqdiff), 0);
 			pf_change_proto_a(m, &th->th_ack, &th->th_sum, htonl(ack), 0);
 			*copyback = 1;
 		}
 		end = seq + pd->p_len;
 		if (th->th_flags & TH_SYN)
 			end++;
 		if (th->th_flags & TH_FIN)
 			end++;
 	}
 
 	if ((th->th_flags & TH_ACK) == 0) {
 		/* Let it pass through the ack skew check */
 		ack = dst->seqlo;
 	} else if ((ack == 0 &&
 	    (th->th_flags & (TH_ACK|TH_RST)) == (TH_ACK|TH_RST)) ||
 	    /* broken tcp stacks do not set ack */
 	    (dst->state < TCPS_SYN_SENT)) {
 		/*
 		 * Many stacks (ours included) will set the ACK number in an
 		 * FIN|ACK if the SYN times out -- no sequence to ACK.
 		 */
 		ack = dst->seqlo;
 	}
 
 	if (seq == end) {
 		/* Ease sequencing restrictions on no data packets */
 		seq = src->seqlo;
 		end = seq;
 	}
 
 	ackskew = dst->seqlo - ack;
 
 
 	/*
 	 * Need to demodulate the sequence numbers in any TCP SACK options
 	 * (Selective ACK). We could optionally validate the SACK values
 	 * against the current ACK window, either forwards or backwards, but
 	 * I'm not confident that SACK has been implemented properly
 	 * everywhere. It wouldn't surprise me if several stacks accidentally
 	 * SACK too far backwards of previously ACKed data. There really aren't
 	 * any security implications of bad SACKing unless the target stack
 	 * doesn't validate the option length correctly. Someone trying to
 	 * spoof into a TCP connection won't bother blindly sending SACK
 	 * options anyway.
 	 */
 	if (dst->seqdiff && (th->th_off << 2) > sizeof(struct tcphdr)) {
 		if (pf_modulate_sack(m, off, pd, th, dst))
 			*copyback = 1;
 	}
 
 
 #define	MAXACKWINDOW (0xffff + 1500)	/* 1500 is an arbitrary fudge factor */
 	if (SEQ_GEQ(src->seqhi, end) &&
 	    /* Last octet inside other's window space */
 	    SEQ_GEQ(seq, src->seqlo - (dst->max_win << dws)) &&
 	    /* Retrans: not more than one window back */
 	    (ackskew >= -MAXACKWINDOW) &&
 	    /* Acking not more than one reassembled fragment backwards */
 	    (ackskew <= (MAXACKWINDOW << sws)) &&
 	    /* Acking not more than one window forward */
 	    ((th->th_flags & TH_RST) == 0 || orig_seq == src->seqlo ||
 	    (orig_seq == src->seqlo + 1) || (orig_seq + 1 == src->seqlo) ||
 	    (pd->flags & PFDESC_IP_REAS) == 0)) {
 	    /* Require an exact/+1 sequence match on resets when possible */
 
 		if (dst->scrub || src->scrub) {
 			if (pf_normalize_tcp_stateful(m, off, pd, reason, th,
 			    *state, src, dst, copyback))
 				return (PF_DROP);
 		}
 
 		/* update max window */
 		if (src->max_win < win)
 			src->max_win = win;
 		/* synchronize sequencing */
 		if (SEQ_GT(end, src->seqlo))
 			src->seqlo = end;
 		/* slide the window of what the other end can send */
 		if (SEQ_GEQ(ack + (win << sws), dst->seqhi))
 			dst->seqhi = ack + MAX((win << sws), 1);
 
 
 		/* update states */
 		if (th->th_flags & TH_SYN)
 			if (src->state < TCPS_SYN_SENT)
 				src->state = TCPS_SYN_SENT;
 		if (th->th_flags & TH_FIN)
 			if (src->state < TCPS_CLOSING)
 				src->state = TCPS_CLOSING;
 		if (th->th_flags & TH_ACK) {
 			if (dst->state == TCPS_SYN_SENT) {
 				dst->state = TCPS_ESTABLISHED;
 				if (src->state == TCPS_ESTABLISHED &&
 				    (*state)->src_node != NULL &&
 				    pf_src_connlimit(state)) {
 					REASON_SET(reason, PFRES_SRCLIMIT);
 					return (PF_DROP);
 				}
 			} else if (dst->state == TCPS_CLOSING)
 				dst->state = TCPS_FIN_WAIT_2;
 		}
 		if (th->th_flags & TH_RST)
 			src->state = dst->state = TCPS_TIME_WAIT;
 
 		/* update expire time */
 		(*state)->expire = time_uptime;
 		if (src->state >= TCPS_FIN_WAIT_2 &&
 		    dst->state >= TCPS_FIN_WAIT_2)
 			(*state)->timeout = PFTM_TCP_CLOSED;
 		else if (src->state >= TCPS_CLOSING &&
 		    dst->state >= TCPS_CLOSING)
 			(*state)->timeout = PFTM_TCP_FIN_WAIT;
 		else if (src->state < TCPS_ESTABLISHED ||
 		    dst->state < TCPS_ESTABLISHED)
 			(*state)->timeout = PFTM_TCP_OPENING;
 		else if (src->state >= TCPS_CLOSING ||
 		    dst->state >= TCPS_CLOSING)
 			(*state)->timeout = PFTM_TCP_CLOSING;
 		else
 			(*state)->timeout = PFTM_TCP_ESTABLISHED;
 
 		/* Fall through to PASS packet */
 
 	} else if ((dst->state < TCPS_SYN_SENT ||
 		dst->state >= TCPS_FIN_WAIT_2 ||
 		src->state >= TCPS_FIN_WAIT_2) &&
 	    SEQ_GEQ(src->seqhi + MAXACKWINDOW, end) &&
 	    /* Within a window forward of the originating packet */
 	    SEQ_GEQ(seq, src->seqlo - MAXACKWINDOW)) {
 	    /* Within a window backward of the originating packet */
 
 		/*
 		 * This currently handles three situations:
 		 *  1) Stupid stacks will shotgun SYNs before their peer
 		 *     replies.
 		 *  2) When PF catches an already established stream (the
 		 *     firewall rebooted, the state table was flushed, routes
 		 *     changed...)
 		 *  3) Packets get funky immediately after the connection
 		 *     closes (this should catch Solaris spurious ACK|FINs
 		 *     that web servers like to spew after a close)
 		 *
 		 * This must be a little more careful than the above code
 		 * since packet floods will also be caught here. We don't
 		 * update the TTL here to mitigate the damage of a packet
 		 * flood and so the same code can handle awkward establishment
 		 * and a loosened connection close.
 		 * In the establishment case, a correct peer response will
 		 * validate the connection, go through the normal state code
 		 * and keep updating the state TTL.
 		 */
 
 		if (V_pf_status.debug >= PF_DEBUG_MISC) {
 			printf("pf: loose state match: ");
 			pf_print_state(*state);
 			pf_print_flags(th->th_flags);
 			printf(" seq=%u (%u) ack=%u len=%u ackskew=%d "
 			    "pkts=%llu:%llu dir=%s,%s\n", seq, orig_seq, ack,
 			    pd->p_len, ackskew, (unsigned long long)(*state)->packets[0],
 			    (unsigned long long)(*state)->packets[1],
 			    pd->dir == PF_IN ? "in" : "out",
 			    pd->dir == (*state)->direction ? "fwd" : "rev");
 		}
 
 		if (dst->scrub || src->scrub) {
 			if (pf_normalize_tcp_stateful(m, off, pd, reason, th,
 			    *state, src, dst, copyback))
 				return (PF_DROP);
 		}
 
 		/* update max window */
 		if (src->max_win < win)
 			src->max_win = win;
 		/* synchronize sequencing */
 		if (SEQ_GT(end, src->seqlo))
 			src->seqlo = end;
 		/* slide the window of what the other end can send */
 		if (SEQ_GEQ(ack + (win << sws), dst->seqhi))
 			dst->seqhi = ack + MAX((win << sws), 1);
 
 		/*
 		 * Cannot set dst->seqhi here since this could be a shotgunned
 		 * SYN and not an already established connection.
 		 */
 
 		if (th->th_flags & TH_FIN)
 			if (src->state < TCPS_CLOSING)
 				src->state = TCPS_CLOSING;
 		if (th->th_flags & TH_RST)
 			src->state = dst->state = TCPS_TIME_WAIT;
 
 		/* Fall through to PASS packet */
 
 	} else {
 		if ((*state)->dst.state == TCPS_SYN_SENT &&
 		    (*state)->src.state == TCPS_SYN_SENT) {
 			/* Send RST for state mismatches during handshake */
 			if (!(th->th_flags & TH_RST))
 				pf_send_tcp(NULL, (*state)->rule.ptr, pd->af,
 				    pd->dst, pd->src, th->th_dport,
 				    th->th_sport, ntohl(th->th_ack), 0,
 				    TH_RST, 0, 0,
 				    (*state)->rule.ptr->return_ttl, 1, 0,
 				    kif->pfik_ifp);
 			src->seqlo = 0;
 			src->seqhi = 1;
 			src->max_win = 1;
 		} else if (V_pf_status.debug >= PF_DEBUG_MISC) {
 			printf("pf: BAD state: ");
 			pf_print_state(*state);
 			pf_print_flags(th->th_flags);
 			printf(" seq=%u (%u) ack=%u len=%u ackskew=%d "
 			    "pkts=%llu:%llu dir=%s,%s\n",
 			    seq, orig_seq, ack, pd->p_len, ackskew,
 			    (unsigned long long)(*state)->packets[0],
 			    (unsigned long long)(*state)->packets[1],
 			    pd->dir == PF_IN ? "in" : "out",
 			    pd->dir == (*state)->direction ? "fwd" : "rev");
 			printf("pf: State failure on: %c %c %c %c | %c %c\n",
 			    SEQ_GEQ(src->seqhi, end) ? ' ' : '1',
 			    SEQ_GEQ(seq, src->seqlo - (dst->max_win << dws)) ?
 			    ' ': '2',
 			    (ackskew >= -MAXACKWINDOW) ? ' ' : '3',
 			    (ackskew <= (MAXACKWINDOW << sws)) ? ' ' : '4',
 			    SEQ_GEQ(src->seqhi + MAXACKWINDOW, end) ?' ' :'5',
 			    SEQ_GEQ(seq, src->seqlo - MAXACKWINDOW) ?' ' :'6');
 		}
 		REASON_SET(reason, PFRES_BADSTATE);
 		return (PF_DROP);
 	}
 
 	return (PF_PASS);
 }
 
 static int
 pf_tcp_track_sloppy(struct pf_state_peer *src, struct pf_state_peer *dst,
 	struct pf_state **state, struct pf_pdesc *pd, u_short *reason)
 {
 	struct tcphdr		*th = pd->hdr.tcp;
 
 	if (th->th_flags & TH_SYN)
 		if (src->state < TCPS_SYN_SENT)
 			src->state = TCPS_SYN_SENT;
 	if (th->th_flags & TH_FIN)
 		if (src->state < TCPS_CLOSING)
 			src->state = TCPS_CLOSING;
 	if (th->th_flags & TH_ACK) {
 		if (dst->state == TCPS_SYN_SENT) {
 			dst->state = TCPS_ESTABLISHED;
 			if (src->state == TCPS_ESTABLISHED &&
 			    (*state)->src_node != NULL &&
 			    pf_src_connlimit(state)) {
 				REASON_SET(reason, PFRES_SRCLIMIT);
 				return (PF_DROP);
 			}
 		} else if (dst->state == TCPS_CLOSING) {
 			dst->state = TCPS_FIN_WAIT_2;
 		} else if (src->state == TCPS_SYN_SENT &&
 		    dst->state < TCPS_SYN_SENT) {
 			/*
 			 * Handle a special sloppy case where we only see one
 			 * half of the connection. If there is a ACK after
 			 * the initial SYN without ever seeing a packet from
 			 * the destination, set the connection to established.
 			 */
 			dst->state = src->state = TCPS_ESTABLISHED;
 			if ((*state)->src_node != NULL &&
 			    pf_src_connlimit(state)) {
 				REASON_SET(reason, PFRES_SRCLIMIT);
 				return (PF_DROP);
 			}
 		} else if (src->state == TCPS_CLOSING &&
 		    dst->state == TCPS_ESTABLISHED &&
 		    dst->seqlo == 0) {
 			/*
 			 * Handle the closing of half connections where we
 			 * don't see the full bidirectional FIN/ACK+ACK
 			 * handshake.
 			 */
 			dst->state = TCPS_CLOSING;
 		}
 	}
 	if (th->th_flags & TH_RST)
 		src->state = dst->state = TCPS_TIME_WAIT;
 
 	/* update expire time */
 	(*state)->expire = time_uptime;
 	if (src->state >= TCPS_FIN_WAIT_2 &&
 	    dst->state >= TCPS_FIN_WAIT_2)
 		(*state)->timeout = PFTM_TCP_CLOSED;
 	else if (src->state >= TCPS_CLOSING &&
 	    dst->state >= TCPS_CLOSING)
 		(*state)->timeout = PFTM_TCP_FIN_WAIT;
 	else if (src->state < TCPS_ESTABLISHED ||
 	    dst->state < TCPS_ESTABLISHED)
 		(*state)->timeout = PFTM_TCP_OPENING;
 	else if (src->state >= TCPS_CLOSING ||
 	    dst->state >= TCPS_CLOSING)
 		(*state)->timeout = PFTM_TCP_CLOSING;
 	else
 		(*state)->timeout = PFTM_TCP_ESTABLISHED;
 
 	return (PF_PASS);
 }
 
 static int
 pf_test_state_tcp(struct pf_state **state, int direction, struct pfi_kif *kif,
     struct mbuf *m, int off, void *h, struct pf_pdesc *pd,
     u_short *reason)
 {
 	struct pf_state_key_cmp	 key;
 	struct tcphdr		*th = pd->hdr.tcp;
 	int			 copyback = 0;
 	struct pf_state_peer	*src, *dst;
 	struct pf_state_key	*sk;
 
 	bzero(&key, sizeof(key));
 	key.af = pd->af;
 	key.proto = IPPROTO_TCP;
 	if (direction == PF_IN)	{	/* wire side, straight */
 		PF_ACPY(&key.addr[0], pd->src, key.af);
 		PF_ACPY(&key.addr[1], pd->dst, key.af);
 		key.port[0] = th->th_sport;
 		key.port[1] = th->th_dport;
 	} else {			/* stack side, reverse */
 		PF_ACPY(&key.addr[1], pd->src, key.af);
 		PF_ACPY(&key.addr[0], pd->dst, key.af);
 		key.port[1] = th->th_sport;
 		key.port[0] = th->th_dport;
 	}
 
 	STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 	if (direction == (*state)->direction) {
 		src = &(*state)->src;
 		dst = &(*state)->dst;
 	} else {
 		src = &(*state)->dst;
 		dst = &(*state)->src;
 	}
 
 	sk = (*state)->key[pd->didx];
 
 	if ((*state)->src.state == PF_TCPS_PROXY_SRC) {
 		if (direction != (*state)->direction) {
 			REASON_SET(reason, PFRES_SYNPROXY);
 			return (PF_SYNPROXY_DROP);
 		}
 		if (th->th_flags & TH_SYN) {
 			if (ntohl(th->th_seq) != (*state)->src.seqlo) {
 				REASON_SET(reason, PFRES_SYNPROXY);
 				return (PF_DROP);
 			}
 			pf_send_tcp(NULL, (*state)->rule.ptr, pd->af, pd->dst,
 			    pd->src, th->th_dport, th->th_sport,
 			    (*state)->src.seqhi, ntohl(th->th_seq) + 1,
 			    TH_SYN|TH_ACK, 0, (*state)->src.mss, 0, 1, 0, NULL);
 			REASON_SET(reason, PFRES_SYNPROXY);
 			return (PF_SYNPROXY_DROP);
 		} else if (!(th->th_flags & TH_ACK) ||
 		    (ntohl(th->th_ack) != (*state)->src.seqhi + 1) ||
 		    (ntohl(th->th_seq) != (*state)->src.seqlo + 1)) {
 			REASON_SET(reason, PFRES_SYNPROXY);
 			return (PF_DROP);
 		} else if ((*state)->src_node != NULL &&
 		    pf_src_connlimit(state)) {
 			REASON_SET(reason, PFRES_SRCLIMIT);
 			return (PF_DROP);
 		} else
 			(*state)->src.state = PF_TCPS_PROXY_DST;
 	}
 	if ((*state)->src.state == PF_TCPS_PROXY_DST) {
 		if (direction == (*state)->direction) {
 			if (((th->th_flags & (TH_SYN|TH_ACK)) != TH_ACK) ||
 			    (ntohl(th->th_ack) != (*state)->src.seqhi + 1) ||
 			    (ntohl(th->th_seq) != (*state)->src.seqlo + 1)) {
 				REASON_SET(reason, PFRES_SYNPROXY);
 				return (PF_DROP);
 			}
 			(*state)->src.max_win = MAX(ntohs(th->th_win), 1);
 			if ((*state)->dst.seqhi == 1)
 				(*state)->dst.seqhi = htonl(arc4random());
 			pf_send_tcp(NULL, (*state)->rule.ptr, pd->af,
 			    &sk->addr[pd->sidx], &sk->addr[pd->didx],
 			    sk->port[pd->sidx], sk->port[pd->didx],
 			    (*state)->dst.seqhi, 0, TH_SYN, 0,
 			    (*state)->src.mss, 0, 0, (*state)->tag, NULL);
 			REASON_SET(reason, PFRES_SYNPROXY);
 			return (PF_SYNPROXY_DROP);
 		} else if (((th->th_flags & (TH_SYN|TH_ACK)) !=
 		    (TH_SYN|TH_ACK)) ||
 		    (ntohl(th->th_ack) != (*state)->dst.seqhi + 1)) {
 			REASON_SET(reason, PFRES_SYNPROXY);
 			return (PF_DROP);
 		} else {
 			(*state)->dst.max_win = MAX(ntohs(th->th_win), 1);
 			(*state)->dst.seqlo = ntohl(th->th_seq);
 			pf_send_tcp(NULL, (*state)->rule.ptr, pd->af, pd->dst,
 			    pd->src, th->th_dport, th->th_sport,
 			    ntohl(th->th_ack), ntohl(th->th_seq) + 1,
 			    TH_ACK, (*state)->src.max_win, 0, 0, 0,
 			    (*state)->tag, NULL);
 			pf_send_tcp(NULL, (*state)->rule.ptr, pd->af,
 			    &sk->addr[pd->sidx], &sk->addr[pd->didx],
 			    sk->port[pd->sidx], sk->port[pd->didx],
 			    (*state)->src.seqhi + 1, (*state)->src.seqlo + 1,
 			    TH_ACK, (*state)->dst.max_win, 0, 0, 1, 0, NULL);
 			(*state)->src.seqdiff = (*state)->dst.seqhi -
 			    (*state)->src.seqlo;
 			(*state)->dst.seqdiff = (*state)->src.seqhi -
 			    (*state)->dst.seqlo;
 			(*state)->src.seqhi = (*state)->src.seqlo +
 			    (*state)->dst.max_win;
 			(*state)->dst.seqhi = (*state)->dst.seqlo +
 			    (*state)->src.max_win;
 			(*state)->src.wscale = (*state)->dst.wscale = 0;
 			(*state)->src.state = (*state)->dst.state =
 			    TCPS_ESTABLISHED;
 			REASON_SET(reason, PFRES_SYNPROXY);
 			return (PF_SYNPROXY_DROP);
 		}
 	}
 
 	if (((th->th_flags & (TH_SYN|TH_ACK)) == TH_SYN) &&
 	    dst->state >= TCPS_FIN_WAIT_2 &&
 	    src->state >= TCPS_FIN_WAIT_2) {
 		if (V_pf_status.debug >= PF_DEBUG_MISC) {
 			printf("pf: state reuse ");
 			pf_print_state(*state);
 			pf_print_flags(th->th_flags);
 			printf("\n");
 		}
 		/* XXX make sure it's the same direction ?? */
 		(*state)->src.state = (*state)->dst.state = TCPS_CLOSED;
 		pf_unlink_state(*state, PF_ENTER_LOCKED);
 		*state = NULL;
 		return (PF_DROP);
 	}
 
 	if ((*state)->state_flags & PFSTATE_SLOPPY) {
 		if (pf_tcp_track_sloppy(src, dst, state, pd, reason) == PF_DROP)
 			return (PF_DROP);
 	} else {
 		if (pf_tcp_track_full(src, dst, state, kif, m, off, pd, reason,
 		    &copyback) == PF_DROP)
 			return (PF_DROP);
 	}
 
 	/* translate source/destination address, if necessary */
 	if ((*state)->key[PF_SK_WIRE] != (*state)->key[PF_SK_STACK]) {
 		struct pf_state_key *nk = (*state)->key[pd->didx];
 
 		if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], pd->af) ||
 		    nk->port[pd->sidx] != th->th_sport)
 			pf_change_ap(m, pd->src, &th->th_sport,
 			    pd->ip_sum, &th->th_sum, &nk->addr[pd->sidx],
 			    nk->port[pd->sidx], 0, pd->af);
 
 		if (PF_ANEQ(pd->dst, &nk->addr[pd->didx], pd->af) ||
 		    nk->port[pd->didx] != th->th_dport)
 			pf_change_ap(m, pd->dst, &th->th_dport,
 			    pd->ip_sum, &th->th_sum, &nk->addr[pd->didx],
 			    nk->port[pd->didx], 0, pd->af);
 		copyback = 1;
 	}
 
 	/* Copyback sequence modulation or stateful scrub changes if needed */
 	if (copyback)
 		m_copyback(m, off, sizeof(*th), (caddr_t)th);
 
 	return (PF_PASS);
 }
 
 static int
 pf_test_state_udp(struct pf_state **state, int direction, struct pfi_kif *kif,
     struct mbuf *m, int off, void *h, struct pf_pdesc *pd)
 {
 	struct pf_state_peer	*src, *dst;
 	struct pf_state_key_cmp	 key;
 	struct udphdr		*uh = pd->hdr.udp;
 
 	bzero(&key, sizeof(key));
 	key.af = pd->af;
 	key.proto = IPPROTO_UDP;
 	if (direction == PF_IN)	{	/* wire side, straight */
 		PF_ACPY(&key.addr[0], pd->src, key.af);
 		PF_ACPY(&key.addr[1], pd->dst, key.af);
 		key.port[0] = uh->uh_sport;
 		key.port[1] = uh->uh_dport;
 	} else {			/* stack side, reverse */
 		PF_ACPY(&key.addr[1], pd->src, key.af);
 		PF_ACPY(&key.addr[0], pd->dst, key.af);
 		key.port[1] = uh->uh_sport;
 		key.port[0] = uh->uh_dport;
 	}
 
 	STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 	if (direction == (*state)->direction) {
 		src = &(*state)->src;
 		dst = &(*state)->dst;
 	} else {
 		src = &(*state)->dst;
 		dst = &(*state)->src;
 	}
 
 	/* update states */
 	if (src->state < PFUDPS_SINGLE)
 		src->state = PFUDPS_SINGLE;
 	if (dst->state == PFUDPS_SINGLE)
 		dst->state = PFUDPS_MULTIPLE;
 
 	/* update expire time */
 	(*state)->expire = time_uptime;
 	if (src->state == PFUDPS_MULTIPLE && dst->state == PFUDPS_MULTIPLE)
 		(*state)->timeout = PFTM_UDP_MULTIPLE;
 	else
 		(*state)->timeout = PFTM_UDP_SINGLE;
 
 	/* translate source/destination address, if necessary */
 	if ((*state)->key[PF_SK_WIRE] != (*state)->key[PF_SK_STACK]) {
 		struct pf_state_key *nk = (*state)->key[pd->didx];
 
 		if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], pd->af) ||
 		    nk->port[pd->sidx] != uh->uh_sport)
 			pf_change_ap(m, pd->src, &uh->uh_sport, pd->ip_sum,
 			    &uh->uh_sum, &nk->addr[pd->sidx],
 			    nk->port[pd->sidx], 1, pd->af);
 
 		if (PF_ANEQ(pd->dst, &nk->addr[pd->didx], pd->af) ||
 		    nk->port[pd->didx] != uh->uh_dport)
 			pf_change_ap(m, pd->dst, &uh->uh_dport, pd->ip_sum,
 			    &uh->uh_sum, &nk->addr[pd->didx],
 			    nk->port[pd->didx], 1, pd->af);
 		m_copyback(m, off, sizeof(*uh), (caddr_t)uh);
 	}
 
 	return (PF_PASS);
 }
 
 static int
 pf_test_state_icmp(struct pf_state **state, int direction, struct pfi_kif *kif,
     struct mbuf *m, int off, void *h, struct pf_pdesc *pd, u_short *reason)
 {
 	struct pf_addr  *saddr = pd->src, *daddr = pd->dst;
 	u_int16_t	 icmpid = 0, *icmpsum;
 	u_int8_t	 icmptype;
 	int		 state_icmp = 0;
 	struct pf_state_key_cmp key;
 
 	bzero(&key, sizeof(key));
 	switch (pd->proto) {
 #ifdef INET
 	case IPPROTO_ICMP:
 		icmptype = pd->hdr.icmp->icmp_type;
 		icmpid = pd->hdr.icmp->icmp_id;
 		icmpsum = &pd->hdr.icmp->icmp_cksum;
 
 		if (icmptype == ICMP_UNREACH ||
 		    icmptype == ICMP_SOURCEQUENCH ||
 		    icmptype == ICMP_REDIRECT ||
 		    icmptype == ICMP_TIMXCEED ||
 		    icmptype == ICMP_PARAMPROB)
 			state_icmp++;
 		break;
 #endif /* INET */
 #ifdef INET6
 	case IPPROTO_ICMPV6:
 		icmptype = pd->hdr.icmp6->icmp6_type;
 		icmpid = pd->hdr.icmp6->icmp6_id;
 		icmpsum = &pd->hdr.icmp6->icmp6_cksum;
 
 		if (icmptype == ICMP6_DST_UNREACH ||
 		    icmptype == ICMP6_PACKET_TOO_BIG ||
 		    icmptype == ICMP6_TIME_EXCEEDED ||
 		    icmptype == ICMP6_PARAM_PROB)
 			state_icmp++;
 		break;
 #endif /* INET6 */
 	}
 
 	if (!state_icmp) {
 
 		/*
 		 * ICMP query/reply message not related to a TCP/UDP packet.
 		 * Search for an ICMP state.
 		 */
 		key.af = pd->af;
 		key.proto = pd->proto;
 		key.port[0] = key.port[1] = icmpid;
 		if (direction == PF_IN)	{	/* wire side, straight */
 			PF_ACPY(&key.addr[0], pd->src, key.af);
 			PF_ACPY(&key.addr[1], pd->dst, key.af);
 		} else {			/* stack side, reverse */
 			PF_ACPY(&key.addr[1], pd->src, key.af);
 			PF_ACPY(&key.addr[0], pd->dst, key.af);
 		}
 
 		STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 		(*state)->expire = time_uptime;
 		(*state)->timeout = PFTM_ICMP_ERROR_REPLY;
 
 		/* translate source/destination address, if necessary */
 		if ((*state)->key[PF_SK_WIRE] != (*state)->key[PF_SK_STACK]) {
 			struct pf_state_key *nk = (*state)->key[pd->didx];
 
 			switch (pd->af) {
 #ifdef INET
 			case AF_INET:
 				if (PF_ANEQ(pd->src,
 				    &nk->addr[pd->sidx], AF_INET))
 					pf_change_a(&saddr->v4.s_addr,
 					    pd->ip_sum,
 					    nk->addr[pd->sidx].v4.s_addr, 0);
 
 				if (PF_ANEQ(pd->dst, &nk->addr[pd->didx],
 				    AF_INET))
 					pf_change_a(&daddr->v4.s_addr,
 					    pd->ip_sum,
 					    nk->addr[pd->didx].v4.s_addr, 0);
 
 				if (nk->port[0] !=
 				    pd->hdr.icmp->icmp_id) {
 					pd->hdr.icmp->icmp_cksum =
 					    pf_cksum_fixup(
 					    pd->hdr.icmp->icmp_cksum, icmpid,
 					    nk->port[pd->sidx], 0);
 					pd->hdr.icmp->icmp_id =
 					    nk->port[pd->sidx];
 				}
 
 				m_copyback(m, off, ICMP_MINLEN,
 				    (caddr_t )pd->hdr.icmp);
 				break;
 #endif /* INET */
 #ifdef INET6
 			case AF_INET6:
 				if (PF_ANEQ(pd->src,
 				    &nk->addr[pd->sidx], AF_INET6))
 					pf_change_a6(saddr,
 					    &pd->hdr.icmp6->icmp6_cksum,
 					    &nk->addr[pd->sidx], 0);
 
 				if (PF_ANEQ(pd->dst,
 				    &nk->addr[pd->didx], AF_INET6))
 					pf_change_a6(daddr,
 					    &pd->hdr.icmp6->icmp6_cksum,
 					    &nk->addr[pd->didx], 0);
 
 				m_copyback(m, off, sizeof(struct icmp6_hdr),
 				    (caddr_t )pd->hdr.icmp6);
 				break;
 #endif /* INET6 */
 			}
 		}
 		return (PF_PASS);
 
 	} else {
 		/*
 		 * ICMP error message in response to a TCP/UDP packet.
 		 * Extract the inner TCP/UDP header and search for that state.
 		 */
 
 		struct pf_pdesc	pd2;
 		bzero(&pd2, sizeof pd2);
 #ifdef INET
 		struct ip	h2;
 #endif /* INET */
 #ifdef INET6
 		struct ip6_hdr	h2_6;
 		int		terminal = 0;
 #endif /* INET6 */
 		int		ipoff2 = 0;
 		int		off2 = 0;
 
 		pd2.af = pd->af;
 		/* Payload packet is from the opposite direction. */
 		pd2.sidx = (direction == PF_IN) ? 1 : 0;
 		pd2.didx = (direction == PF_IN) ? 0 : 1;
 		switch (pd->af) {
 #ifdef INET
 		case AF_INET:
 			/* offset of h2 in mbuf chain */
 			ipoff2 = off + ICMP_MINLEN;
 
 			if (!pf_pull_hdr(m, ipoff2, &h2, sizeof(h2),
 			    NULL, reason, pd2.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: ICMP error message too short "
 				    "(ip)\n"));
 				return (PF_DROP);
 			}
 			/*
 			 * ICMP error messages don't refer to non-first
 			 * fragments
 			 */
 			if (h2.ip_off & htons(IP_OFFMASK)) {
 				REASON_SET(reason, PFRES_FRAG);
 				return (PF_DROP);
 			}
 
 			/* offset of protocol header that follows h2 */
 			off2 = ipoff2 + (h2.ip_hl << 2);
 
 			pd2.proto = h2.ip_p;
 			pd2.src = (struct pf_addr *)&h2.ip_src;
 			pd2.dst = (struct pf_addr *)&h2.ip_dst;
 			pd2.ip_sum = &h2.ip_sum;
 			break;
 #endif /* INET */
 #ifdef INET6
 		case AF_INET6:
 			ipoff2 = off + sizeof(struct icmp6_hdr);
 
 			if (!pf_pull_hdr(m, ipoff2, &h2_6, sizeof(h2_6),
 			    NULL, reason, pd2.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: ICMP error message too short "
 				    "(ip6)\n"));
 				return (PF_DROP);
 			}
 			pd2.proto = h2_6.ip6_nxt;
 			pd2.src = (struct pf_addr *)&h2_6.ip6_src;
 			pd2.dst = (struct pf_addr *)&h2_6.ip6_dst;
 			pd2.ip_sum = NULL;
 			off2 = ipoff2 + sizeof(h2_6);
 			do {
 				switch (pd2.proto) {
 				case IPPROTO_FRAGMENT:
 					/*
 					 * ICMPv6 error messages for
 					 * non-first fragments
 					 */
 					REASON_SET(reason, PFRES_FRAG);
 					return (PF_DROP);
 				case IPPROTO_AH:
 				case IPPROTO_HOPOPTS:
 				case IPPROTO_ROUTING:
 				case IPPROTO_DSTOPTS: {
 					/* get next header and header length */
 					struct ip6_ext opt6;
 
 					if (!pf_pull_hdr(m, off2, &opt6,
 					    sizeof(opt6), NULL, reason,
 					    pd2.af)) {
 						DPFPRINTF(PF_DEBUG_MISC,
 						    ("pf: ICMPv6 short opt\n"));
 						return (PF_DROP);
 					}
 					if (pd2.proto == IPPROTO_AH)
 						off2 += (opt6.ip6e_len + 2) * 4;
 					else
 						off2 += (opt6.ip6e_len + 1) * 8;
 					pd2.proto = opt6.ip6e_nxt;
 					/* goto the next header */
 					break;
 				}
 				default:
 					terminal++;
 					break;
 				}
 			} while (!terminal);
 			break;
 #endif /* INET6 */
 		}
 
 		switch (pd2.proto) {
 		case IPPROTO_TCP: {
 			struct tcphdr		 th;
 			u_int32_t		 seq;
 			struct pf_state_peer	*src, *dst;
 			u_int8_t		 dws;
 			int			 copyback = 0;
 
 			/*
 			 * Only the first 8 bytes of the TCP header can be
 			 * expected. Don't access any TCP header fields after
 			 * th_seq, an ackskew test is not possible.
 			 */
 			if (!pf_pull_hdr(m, off2, &th, 8, NULL, reason,
 			    pd2.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: ICMP error message too short "
 				    "(tcp)\n"));
 				return (PF_DROP);
 			}
 
 			key.af = pd2.af;
 			key.proto = IPPROTO_TCP;
 			PF_ACPY(&key.addr[pd2.sidx], pd2.src, key.af);
 			PF_ACPY(&key.addr[pd2.didx], pd2.dst, key.af);
 			key.port[pd2.sidx] = th.th_sport;
 			key.port[pd2.didx] = th.th_dport;
 
 			STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 			if (direction == (*state)->direction) {
 				src = &(*state)->dst;
 				dst = &(*state)->src;
 			} else {
 				src = &(*state)->src;
 				dst = &(*state)->dst;
 			}
 
 			if (src->wscale && dst->wscale)
 				dws = dst->wscale & PF_WSCALE_MASK;
 			else
 				dws = 0;
 
 			/* Demodulate sequence number */
 			seq = ntohl(th.th_seq) - src->seqdiff;
 			if (src->seqdiff) {
 				pf_change_a(&th.th_seq, icmpsum,
 				    htonl(seq), 0);
 				copyback = 1;
 			}
 
 			if (!((*state)->state_flags & PFSTATE_SLOPPY) &&
 			    (!SEQ_GEQ(src->seqhi, seq) ||
 			    !SEQ_GEQ(seq, src->seqlo - (dst->max_win << dws)))) {
 				if (V_pf_status.debug >= PF_DEBUG_MISC) {
 					printf("pf: BAD ICMP %d:%d ",
 					    icmptype, pd->hdr.icmp->icmp_code);
 					pf_print_host(pd->src, 0, pd->af);
 					printf(" -> ");
 					pf_print_host(pd->dst, 0, pd->af);
 					printf(" state: ");
 					pf_print_state(*state);
 					printf(" seq=%u\n", seq);
 				}
 				REASON_SET(reason, PFRES_BADSTATE);
 				return (PF_DROP);
 			} else {
 				if (V_pf_status.debug >= PF_DEBUG_MISC) {
 					printf("pf: OK ICMP %d:%d ",
 					    icmptype, pd->hdr.icmp->icmp_code);
 					pf_print_host(pd->src, 0, pd->af);
 					printf(" -> ");
 					pf_print_host(pd->dst, 0, pd->af);
 					printf(" state: ");
 					pf_print_state(*state);
 					printf(" seq=%u\n", seq);
 				}
 			}
 
 			/* translate source/destination address, if necessary */
 			if ((*state)->key[PF_SK_WIRE] !=
 			    (*state)->key[PF_SK_STACK]) {
 				struct pf_state_key *nk =
 				    (*state)->key[pd->didx];
 
 				if (PF_ANEQ(pd2.src,
 				    &nk->addr[pd2.sidx], pd2.af) ||
 				    nk->port[pd2.sidx] != th.th_sport)
 					pf_change_icmp(pd2.src, &th.th_sport,
 					    daddr, &nk->addr[pd2.sidx],
 					    nk->port[pd2.sidx], NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, pd2.af);
 
 				if (PF_ANEQ(pd2.dst,
 				    &nk->addr[pd2.didx], pd2.af) ||
 				    nk->port[pd2.didx] != th.th_dport)
 					pf_change_icmp(pd2.dst, &th.th_dport,
 					    saddr, &nk->addr[pd2.didx],
 					    nk->port[pd2.didx], NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, pd2.af);
 				copyback = 1;
 			}
 
 			if (copyback) {
 				switch (pd2.af) {
 #ifdef INET
 				case AF_INET:
 					m_copyback(m, off, ICMP_MINLEN,
 					    (caddr_t )pd->hdr.icmp);
 					m_copyback(m, ipoff2, sizeof(h2),
 					    (caddr_t )&h2);
 					break;
 #endif /* INET */
 #ifdef INET6
 				case AF_INET6:
 					m_copyback(m, off,
 					    sizeof(struct icmp6_hdr),
 					    (caddr_t )pd->hdr.icmp6);
 					m_copyback(m, ipoff2, sizeof(h2_6),
 					    (caddr_t )&h2_6);
 					break;
 #endif /* INET6 */
 				}
 				m_copyback(m, off2, 8, (caddr_t)&th);
 			}
 
 			return (PF_PASS);
 			break;
 		}
 		case IPPROTO_UDP: {
 			struct udphdr		uh;
 
 			if (!pf_pull_hdr(m, off2, &uh, sizeof(uh),
 			    NULL, reason, pd2.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: ICMP error message too short "
 				    "(udp)\n"));
 				return (PF_DROP);
 			}
 
 			key.af = pd2.af;
 			key.proto = IPPROTO_UDP;
 			PF_ACPY(&key.addr[pd2.sidx], pd2.src, key.af);
 			PF_ACPY(&key.addr[pd2.didx], pd2.dst, key.af);
 			key.port[pd2.sidx] = uh.uh_sport;
 			key.port[pd2.didx] = uh.uh_dport;
 
 			STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 			/* translate source/destination address, if necessary */
 			if ((*state)->key[PF_SK_WIRE] !=
 			    (*state)->key[PF_SK_STACK]) {
 				struct pf_state_key *nk =
 				    (*state)->key[pd->didx];
 
 				if (PF_ANEQ(pd2.src,
 				    &nk->addr[pd2.sidx], pd2.af) ||
 				    nk->port[pd2.sidx] != uh.uh_sport)
 					pf_change_icmp(pd2.src, &uh.uh_sport,
 					    daddr, &nk->addr[pd2.sidx],
 					    nk->port[pd2.sidx], &uh.uh_sum,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 1, pd2.af);
 
 				if (PF_ANEQ(pd2.dst,
 				    &nk->addr[pd2.didx], pd2.af) ||
 				    nk->port[pd2.didx] != uh.uh_dport)
 					pf_change_icmp(pd2.dst, &uh.uh_dport,
 					    saddr, &nk->addr[pd2.didx],
 					    nk->port[pd2.didx], &uh.uh_sum,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 1, pd2.af);
 
 				switch (pd2.af) {
 #ifdef INET
 				case AF_INET:
 					m_copyback(m, off, ICMP_MINLEN,
 					    (caddr_t )pd->hdr.icmp);
 					m_copyback(m, ipoff2, sizeof(h2), (caddr_t)&h2);
 					break;
 #endif /* INET */
 #ifdef INET6
 				case AF_INET6:
 					m_copyback(m, off,
 					    sizeof(struct icmp6_hdr),
 					    (caddr_t )pd->hdr.icmp6);
 					m_copyback(m, ipoff2, sizeof(h2_6),
 					    (caddr_t )&h2_6);
 					break;
 #endif /* INET6 */
 				}
 				m_copyback(m, off2, sizeof(uh), (caddr_t)&uh);
 			}
 			return (PF_PASS);
 			break;
 		}
 #ifdef INET
 		case IPPROTO_ICMP: {
 			struct icmp		iih;
 
 			if (!pf_pull_hdr(m, off2, &iih, ICMP_MINLEN,
 			    NULL, reason, pd2.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: ICMP error message too short i"
 				    "(icmp)\n"));
 				return (PF_DROP);
 			}
 
 			key.af = pd2.af;
 			key.proto = IPPROTO_ICMP;
 			PF_ACPY(&key.addr[pd2.sidx], pd2.src, key.af);
 			PF_ACPY(&key.addr[pd2.didx], pd2.dst, key.af);
 			key.port[0] = key.port[1] = iih.icmp_id;
 
 			STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 			/* translate source/destination address, if necessary */
 			if ((*state)->key[PF_SK_WIRE] !=
 			    (*state)->key[PF_SK_STACK]) {
 				struct pf_state_key *nk =
 				    (*state)->key[pd->didx];
 
 				if (PF_ANEQ(pd2.src,
 				    &nk->addr[pd2.sidx], pd2.af) ||
 				    nk->port[pd2.sidx] != iih.icmp_id)
 					pf_change_icmp(pd2.src, &iih.icmp_id,
 					    daddr, &nk->addr[pd2.sidx],
 					    nk->port[pd2.sidx], NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, AF_INET);
 
 				if (PF_ANEQ(pd2.dst,
 				    &nk->addr[pd2.didx], pd2.af) ||
 				    nk->port[pd2.didx] != iih.icmp_id)
 					pf_change_icmp(pd2.dst, &iih.icmp_id,
 					    saddr, &nk->addr[pd2.didx],
 					    nk->port[pd2.didx], NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, AF_INET);
 
 				m_copyback(m, off, ICMP_MINLEN, (caddr_t)pd->hdr.icmp);
 				m_copyback(m, ipoff2, sizeof(h2), (caddr_t)&h2);
 				m_copyback(m, off2, ICMP_MINLEN, (caddr_t)&iih);
 			}
 			return (PF_PASS);
 			break;
 		}
 #endif /* INET */
 #ifdef INET6
 		case IPPROTO_ICMPV6: {
 			struct icmp6_hdr	iih;
 
 			if (!pf_pull_hdr(m, off2, &iih,
 			    sizeof(struct icmp6_hdr), NULL, reason, pd2.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: ICMP error message too short "
 				    "(icmp6)\n"));
 				return (PF_DROP);
 			}
 
 			key.af = pd2.af;
 			key.proto = IPPROTO_ICMPV6;
 			PF_ACPY(&key.addr[pd2.sidx], pd2.src, key.af);
 			PF_ACPY(&key.addr[pd2.didx], pd2.dst, key.af);
 			key.port[0] = key.port[1] = iih.icmp6_id;
 
 			STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 			/* translate source/destination address, if necessary */
 			if ((*state)->key[PF_SK_WIRE] !=
 			    (*state)->key[PF_SK_STACK]) {
 				struct pf_state_key *nk =
 				    (*state)->key[pd->didx];
 
 				if (PF_ANEQ(pd2.src,
 				    &nk->addr[pd2.sidx], pd2.af) ||
 				    nk->port[pd2.sidx] != iih.icmp6_id)
 					pf_change_icmp(pd2.src, &iih.icmp6_id,
 					    daddr, &nk->addr[pd2.sidx],
 					    nk->port[pd2.sidx], NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, AF_INET6);
 
 				if (PF_ANEQ(pd2.dst,
 				    &nk->addr[pd2.didx], pd2.af) ||
 				    nk->port[pd2.didx] != iih.icmp6_id)
 					pf_change_icmp(pd2.dst, &iih.icmp6_id,
 					    saddr, &nk->addr[pd2.didx],
 					    nk->port[pd2.didx], NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, AF_INET6);
 
 				m_copyback(m, off, sizeof(struct icmp6_hdr),
 				    (caddr_t)pd->hdr.icmp6);
 				m_copyback(m, ipoff2, sizeof(h2_6), (caddr_t)&h2_6);
 				m_copyback(m, off2, sizeof(struct icmp6_hdr),
 				    (caddr_t)&iih);
 			}
 			return (PF_PASS);
 			break;
 		}
 #endif /* INET6 */
 		default: {
 			key.af = pd2.af;
 			key.proto = pd2.proto;
 			PF_ACPY(&key.addr[pd2.sidx], pd2.src, key.af);
 			PF_ACPY(&key.addr[pd2.didx], pd2.dst, key.af);
 			key.port[0] = key.port[1] = 0;
 
 			STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 			/* translate source/destination address, if necessary */
 			if ((*state)->key[PF_SK_WIRE] !=
 			    (*state)->key[PF_SK_STACK]) {
 				struct pf_state_key *nk =
 				    (*state)->key[pd->didx];
 
 				if (PF_ANEQ(pd2.src,
 				    &nk->addr[pd2.sidx], pd2.af))
 					pf_change_icmp(pd2.src, NULL, daddr,
 					    &nk->addr[pd2.sidx], 0, NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, pd2.af);
 
 				if (PF_ANEQ(pd2.dst,
 				    &nk->addr[pd2.didx], pd2.af))
 					pf_change_icmp(pd2.dst, NULL, saddr,
 					    &nk->addr[pd2.didx], 0, NULL,
 					    pd2.ip_sum, icmpsum,
 					    pd->ip_sum, 0, pd2.af);
 
 				switch (pd2.af) {
 #ifdef INET
 				case AF_INET:
 					m_copyback(m, off, ICMP_MINLEN,
 					    (caddr_t)pd->hdr.icmp);
 					m_copyback(m, ipoff2, sizeof(h2), (caddr_t)&h2);
 					break;
 #endif /* INET */
 #ifdef INET6
 				case AF_INET6:
 					m_copyback(m, off,
 					    sizeof(struct icmp6_hdr),
 					    (caddr_t )pd->hdr.icmp6);
 					m_copyback(m, ipoff2, sizeof(h2_6),
 					    (caddr_t )&h2_6);
 					break;
 #endif /* INET6 */
 				}
 			}
 			return (PF_PASS);
 			break;
 		}
 		}
 	}
 }
 
 static int
 pf_test_state_other(struct pf_state **state, int direction, struct pfi_kif *kif,
     struct mbuf *m, struct pf_pdesc *pd)
 {
 	struct pf_state_peer	*src, *dst;
 	struct pf_state_key_cmp	 key;
 
 	bzero(&key, sizeof(key));
 	key.af = pd->af;
 	key.proto = pd->proto;
 	if (direction == PF_IN)	{
 		PF_ACPY(&key.addr[0], pd->src, key.af);
 		PF_ACPY(&key.addr[1], pd->dst, key.af);
 		key.port[0] = key.port[1] = 0;
 	} else {
 		PF_ACPY(&key.addr[1], pd->src, key.af);
 		PF_ACPY(&key.addr[0], pd->dst, key.af);
 		key.port[1] = key.port[0] = 0;
 	}
 
 	STATE_LOOKUP(kif, &key, direction, *state, pd);
 
 	if (direction == (*state)->direction) {
 		src = &(*state)->src;
 		dst = &(*state)->dst;
 	} else {
 		src = &(*state)->dst;
 		dst = &(*state)->src;
 	}
 
 	/* update states */
 	if (src->state < PFOTHERS_SINGLE)
 		src->state = PFOTHERS_SINGLE;
 	if (dst->state == PFOTHERS_SINGLE)
 		dst->state = PFOTHERS_MULTIPLE;
 
 	/* update expire time */
 	(*state)->expire = time_uptime;
 	if (src->state == PFOTHERS_MULTIPLE && dst->state == PFOTHERS_MULTIPLE)
 		(*state)->timeout = PFTM_OTHER_MULTIPLE;
 	else
 		(*state)->timeout = PFTM_OTHER_SINGLE;
 
 	/* translate source/destination address, if necessary */
 	if ((*state)->key[PF_SK_WIRE] != (*state)->key[PF_SK_STACK]) {
 		struct pf_state_key *nk = (*state)->key[pd->didx];
 
 		KASSERT(nk, ("%s: nk is null", __func__));
 		KASSERT(pd, ("%s: pd is null", __func__));
 		KASSERT(pd->src, ("%s: pd->src is null", __func__));
 		KASSERT(pd->dst, ("%s: pd->dst is null", __func__));
 		switch (pd->af) {
 #ifdef INET
 		case AF_INET:
 			if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], AF_INET))
 				pf_change_a(&pd->src->v4.s_addr,
 				    pd->ip_sum,
 				    nk->addr[pd->sidx].v4.s_addr,
 				    0);
 
 
 			if (PF_ANEQ(pd->dst, &nk->addr[pd->didx], AF_INET))
 				pf_change_a(&pd->dst->v4.s_addr,
 				    pd->ip_sum,
 				    nk->addr[pd->didx].v4.s_addr,
 				    0);
 
 				break;
 #endif /* INET */
 #ifdef INET6
 		case AF_INET6:
 			if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], AF_INET))
 				PF_ACPY(pd->src, &nk->addr[pd->sidx], pd->af);
 
 			if (PF_ANEQ(pd->dst, &nk->addr[pd->didx], AF_INET))
 				PF_ACPY(pd->dst, &nk->addr[pd->didx], pd->af);
 #endif /* INET6 */
 		}
 	}
 	return (PF_PASS);
 }
 
 /*
  * ipoff and off are measured from the start of the mbuf chain.
  * h must be at "ipoff" on the mbuf chain.
  */
 void *
 pf_pull_hdr(struct mbuf *m, int off, void *p, int len,
     u_short *actionp, u_short *reasonp, sa_family_t af)
 {
 	switch (af) {
 #ifdef INET
 	case AF_INET: {
 		struct ip	*h = mtod(m, struct ip *);
 		u_int16_t	 fragoff = (ntohs(h->ip_off) & IP_OFFMASK) << 3;
 
 		if (fragoff) {
 			if (fragoff >= len)
 				ACTION_SET(actionp, PF_PASS);
 			else {
 				ACTION_SET(actionp, PF_DROP);
 				REASON_SET(reasonp, PFRES_FRAG);
 			}
 			return (NULL);
 		}
 		if (m->m_pkthdr.len < off + len ||
 		    ntohs(h->ip_len) < off + len) {
 			ACTION_SET(actionp, PF_DROP);
 			REASON_SET(reasonp, PFRES_SHORT);
 			return (NULL);
 		}
 		break;
 	}
 #endif /* INET */
 #ifdef INET6
 	case AF_INET6: {
 		struct ip6_hdr	*h = mtod(m, struct ip6_hdr *);
 
 		if (m->m_pkthdr.len < off + len ||
 		    (ntohs(h->ip6_plen) + sizeof(struct ip6_hdr)) <
 		    (unsigned)(off + len)) {
 			ACTION_SET(actionp, PF_DROP);
 			REASON_SET(reasonp, PFRES_SHORT);
 			return (NULL);
 		}
 		break;
 	}
 #endif /* INET6 */
 	}
 	m_copydata(m, off, len, p);
 	return (p);
 }
 
 #ifdef RADIX_MPATH
 static int
 pf_routable_oldmpath(struct pf_addr *addr, sa_family_t af, struct pfi_kif *kif,
     int rtableid)
 {
 	struct radix_node_head	*rnh;
 	struct sockaddr_in	*dst;
 	int			 ret = 1;
 	int			 check_mpath;
 #ifdef INET6
 	struct sockaddr_in6	*dst6;
 	struct route_in6	 ro;
 #else
 	struct route		 ro;
 #endif
 	struct radix_node	*rn;
 	struct rtentry		*rt;
 	struct ifnet		*ifp;
 
 	check_mpath = 0;
 	/* XXX: stick to table 0 for now */
 	rnh = rt_tables_get_rnh(0, af);
 	if (rnh != NULL && rn_mpath_capable(rnh))
 		check_mpath = 1;
 	bzero(&ro, sizeof(ro));
 	switch (af) {
 	case AF_INET:
 		dst = satosin(&ro.ro_dst);
 		dst->sin_family = AF_INET;
 		dst->sin_len = sizeof(*dst);
 		dst->sin_addr = addr->v4;
 		break;
 #ifdef INET6
 	case AF_INET6:
 		/*
 		 * Skip check for addresses with embedded interface scope,
 		 * as they would always match anyway.
 		 */
 		if (IN6_IS_SCOPE_EMBED(&addr->v6))
 			goto out;
 		dst6 = (struct sockaddr_in6 *)&ro.ro_dst;
 		dst6->sin6_family = AF_INET6;
 		dst6->sin6_len = sizeof(*dst6);
 		dst6->sin6_addr = addr->v6;
 		break;
 #endif /* INET6 */
 	default:
 		return (0);
 	}
 
 	/* Skip checks for ipsec interfaces */
 	if (kif != NULL && kif->pfik_ifp->if_type == IFT_ENC)
 		goto out;
 
 	switch (af) {
 #ifdef INET6
 	case AF_INET6:
 		in6_rtalloc_ign(&ro, 0, rtableid);
 		break;
 #endif
 #ifdef INET
 	case AF_INET:
 		in_rtalloc_ign((struct route *)&ro, 0, rtableid);
 		break;
 #endif
 	}
 
 	if (ro.ro_rt != NULL) {
 		/* No interface given, this is a no-route check */
 		if (kif == NULL)
 			goto out;
 
 		if (kif->pfik_ifp == NULL) {
 			ret = 0;
 			goto out;
 		}
 
 		/* Perform uRPF check if passed input interface */
 		ret = 0;
 		rn = (struct radix_node *)ro.ro_rt;
 		do {
 			rt = (struct rtentry *)rn;
 			ifp = rt->rt_ifp;
 
 			if (kif->pfik_ifp == ifp)
 				ret = 1;
 			rn = rn_mpath_next(rn);
 		} while (check_mpath == 1 && rn != NULL && ret == 0);
 	} else
 		ret = 0;
 out:
 	if (ro.ro_rt != NULL)
 		RTFREE(ro.ro_rt);
 	return (ret);
 }
 #endif
 
 int
 pf_routable(struct pf_addr *addr, sa_family_t af, struct pfi_kif *kif,
     int rtableid)
 {
 #ifdef INET
 	struct nhop4_basic	nh4;
 #endif
 #ifdef INET6
 	struct nhop6_basic	nh6;
 #endif
 	struct ifnet		*ifp;
 #ifdef RADIX_MPATH
 	struct radix_node_head	*rnh;
 
 	/* XXX: stick to table 0 for now */
 	rnh = rt_tables_get_rnh(0, af);
 	if (rnh != NULL && rn_mpath_capable(rnh))
 		return (pf_routable_oldmpath(addr, af, kif, rtableid));
 #endif
 	/*
 	 * Skip check for addresses with embedded interface scope,
 	 * as they would always match anyway.
 	 */
 	if (af == AF_INET6 && IN6_IS_SCOPE_EMBED(&addr->v6))
 		return (1);
 
 	if (af != AF_INET && af != AF_INET6)
 		return (0);
 
 	/* Skip checks for ipsec interfaces */
 	if (kif != NULL && kif->pfik_ifp->if_type == IFT_ENC)
 		return (1);
 
 	ifp = NULL;
 
 	switch (af) {
 #ifdef INET6
 	case AF_INET6:
 		if (fib6_lookup_nh_basic(rtableid, &addr->v6, 0, 0, 0, &nh6)!=0)
 			return (0);
 		ifp = nh6.nh_ifp;
 		break;
 #endif
 #ifdef INET
 	case AF_INET:
 		if (fib4_lookup_nh_basic(rtableid, addr->v4, 0, 0, &nh4) != 0)
 			return (0);
 		ifp = nh4.nh_ifp;
 		break;
 #endif
 	}
 
 	/* No interface given, this is a no-route check */
 	if (kif == NULL)
 		return (1);
 
 	if (kif->pfik_ifp == NULL)
 		return (0);
 
 	/* Perform uRPF check if passed input interface */
 	if (kif->pfik_ifp == ifp)
 		return (1);
 	return (0);
 }
 
 #ifdef INET
 static void
 pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
     struct pf_state *s, struct pf_pdesc *pd)
 {
 	struct mbuf		*m0, *m1;
 	struct sockaddr_in	dst;
 	struct ip		*ip;
 	struct ifnet		*ifp = NULL;
 	struct pf_addr		 naddr;
 	struct pf_src_node	*sn = NULL;
 	int			 error = 0;
 	uint16_t		 ip_len, ip_off;
 
 	KASSERT(m && *m && r && oifp, ("%s: invalid parameters", __func__));
 	KASSERT(dir == PF_IN || dir == PF_OUT, ("%s: invalid direction",
 	    __func__));
 
 	if ((pd->pf_mtag == NULL &&
 	    ((pd->pf_mtag = pf_get_mtag(*m)) == NULL)) ||
 	    pd->pf_mtag->routed++ > 3) {
 		m0 = *m;
 		*m = NULL;
 		goto bad_locked;
 	}
 
 	if (r->rt == PF_DUPTO) {
 		if ((m0 = m_dup(*m, M_NOWAIT)) == NULL) {
 			if (s)
 				PF_STATE_UNLOCK(s);
 			return;
 		}
 	} else {
 		if ((r->rt == PF_REPLYTO) == (r->direction == dir)) {
 			if (s)
 				PF_STATE_UNLOCK(s);
 			return;
 		}
 		m0 = *m;
 	}
 
 	ip = mtod(m0, struct ip *);
 
 	bzero(&dst, sizeof(dst));
 	dst.sin_family = AF_INET;
 	dst.sin_len = sizeof(dst);
 	dst.sin_addr = ip->ip_dst;
 
 	if (TAILQ_EMPTY(&r->rpool.list)) {
 		DPFPRINTF(PF_DEBUG_URGENT,
 		    ("%s: TAILQ_EMPTY(&r->rpool.list)\n", __func__));
 		goto bad_locked;
 	}
 	if (s == NULL) {
 		pf_map_addr(AF_INET, r, (struct pf_addr *)&ip->ip_src,
 		    &naddr, NULL, &sn);
 		if (!PF_AZERO(&naddr, AF_INET))
 			dst.sin_addr.s_addr = naddr.v4.s_addr;
 		ifp = r->rpool.cur->kif ?
 		    r->rpool.cur->kif->pfik_ifp : NULL;
 	} else {
 		if (!PF_AZERO(&s->rt_addr, AF_INET))
 			dst.sin_addr.s_addr =
 			    s->rt_addr.v4.s_addr;
 		ifp = s->rt_kif ? s->rt_kif->pfik_ifp : NULL;
 		PF_STATE_UNLOCK(s);
 	}
 	if (ifp == NULL)
 		goto bad;
 
 	if (oifp != ifp) {
 		if (pf_test(PF_OUT, 0, ifp, &m0, NULL) != PF_PASS)
 			goto bad;
 		else if (m0 == NULL)
 			goto done;
 		if (m0->m_len < sizeof(struct ip)) {
 			DPFPRINTF(PF_DEBUG_URGENT,
 			    ("%s: m0->m_len < sizeof(struct ip)\n", __func__));
 			goto bad;
 		}
 		ip = mtod(m0, struct ip *);
 	}
 
 	if (ifp->if_flags & IFF_LOOPBACK)
 		m0->m_flags |= M_SKIP_FIREWALL;
 
 	ip_len = ntohs(ip->ip_len);
 	ip_off = ntohs(ip->ip_off);
 
 	/* Copied from FreeBSD 10.0-CURRENT ip_output. */
 	m0->m_pkthdr.csum_flags |= CSUM_IP;
 	if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
 		in_delayed_cksum(m0);
 		m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
 	}
 #ifdef SCTP
 	if (m0->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) {
 		sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2));
 		m0->m_pkthdr.csum_flags &= ~CSUM_SCTP;
 	}
 #endif
 
 	/*
 	 * If small enough for interface, or the interface will take
 	 * care of the fragmentation for us, we can just send directly.
 	 */
 	if (ip_len <= ifp->if_mtu ||
 	    (m0->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0) {
 		ip->ip_sum = 0;
 		if (m0->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) {
 			ip->ip_sum = in_cksum(m0, ip->ip_hl << 2);
 			m0->m_pkthdr.csum_flags &= ~CSUM_IP;
 		}
 		m_clrprotoflags(m0);	/* Avoid confusing lower layers. */
 		error = (*ifp->if_output)(ifp, m0, sintosa(&dst), NULL);
 		goto done;
 	}
 
 	/* Balk when DF bit is set or the interface didn't support TSO. */
 	if ((ip_off & IP_DF) || (m0->m_pkthdr.csum_flags & CSUM_TSO)) {
 		error = EMSGSIZE;
 		KMOD_IPSTAT_INC(ips_cantfrag);
 		if (r->rt != PF_DUPTO) {
 			icmp_error(m0, ICMP_UNREACH, ICMP_UNREACH_NEEDFRAG, 0,
 			    ifp->if_mtu);
 			goto done;
 		} else
 			goto bad;
 	}
 
 	error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist);
 	if (error)
 		goto bad;
 
 	for (; m0; m0 = m1) {
 		m1 = m0->m_nextpkt;
 		m0->m_nextpkt = NULL;
 		if (error == 0) {
 			m_clrprotoflags(m0);
 			error = (*ifp->if_output)(ifp, m0, sintosa(&dst), NULL);
 		} else
 			m_freem(m0);
 	}
 
 	if (error == 0)
 		KMOD_IPSTAT_INC(ips_fragmented);
 
 done:
 	if (r->rt != PF_DUPTO)
 		*m = NULL;
 	return;
 
 bad_locked:
 	if (s)
 		PF_STATE_UNLOCK(s);
 bad:
 	m_freem(m0);
 	goto done;
 }
 #endif /* INET */
 
 #ifdef INET6
 static void
 pf_route6(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
     struct pf_state *s, struct pf_pdesc *pd)
 {
 	struct mbuf		*m0;
 	struct sockaddr_in6	dst;
 	struct ip6_hdr		*ip6;
 	struct ifnet		*ifp = NULL;
 	struct pf_addr		 naddr;
 	struct pf_src_node	*sn = NULL;
 
 	KASSERT(m && *m && r && oifp, ("%s: invalid parameters", __func__));
 	KASSERT(dir == PF_IN || dir == PF_OUT, ("%s: invalid direction",
 	    __func__));
 
 	if ((pd->pf_mtag == NULL &&
 	    ((pd->pf_mtag = pf_get_mtag(*m)) == NULL)) ||
 	    pd->pf_mtag->routed++ > 3) {
 		m0 = *m;
 		*m = NULL;
 		goto bad_locked;
 	}
 
 	if (r->rt == PF_DUPTO) {
 		if ((m0 = m_dup(*m, M_NOWAIT)) == NULL) {
 			if (s)
 				PF_STATE_UNLOCK(s);
 			return;
 		}
 	} else {
 		if ((r->rt == PF_REPLYTO) == (r->direction == dir)) {
 			if (s)
 				PF_STATE_UNLOCK(s);
 			return;
 		}
 		m0 = *m;
 	}
 
 	ip6 = mtod(m0, struct ip6_hdr *);
 
 	bzero(&dst, sizeof(dst));
 	dst.sin6_family = AF_INET6;
 	dst.sin6_len = sizeof(dst);
 	dst.sin6_addr = ip6->ip6_dst;
 
 	if (TAILQ_EMPTY(&r->rpool.list)) {
 		DPFPRINTF(PF_DEBUG_URGENT,
 		    ("%s: TAILQ_EMPTY(&r->rpool.list)\n", __func__));
 		goto bad_locked;
 	}
 	if (s == NULL) {
 		pf_map_addr(AF_INET6, r, (struct pf_addr *)&ip6->ip6_src,
 		    &naddr, NULL, &sn);
 		if (!PF_AZERO(&naddr, AF_INET6))
 			PF_ACPY((struct pf_addr *)&dst.sin6_addr,
 			    &naddr, AF_INET6);
 		ifp = r->rpool.cur->kif ? r->rpool.cur->kif->pfik_ifp : NULL;
 	} else {
 		if (!PF_AZERO(&s->rt_addr, AF_INET6))
 			PF_ACPY((struct pf_addr *)&dst.sin6_addr,
 			    &s->rt_addr, AF_INET6);
 		ifp = s->rt_kif ? s->rt_kif->pfik_ifp : NULL;
 	}
 
 	if (s)
 		PF_STATE_UNLOCK(s);
 
 	if (ifp == NULL)
 		goto bad;
 
 	if (oifp != ifp) {
 		if (pf_test6(PF_OUT, PFIL_FWD, ifp, &m0, NULL) != PF_PASS)
 			goto bad;
 		else if (m0 == NULL)
 			goto done;
 		if (m0->m_len < sizeof(struct ip6_hdr)) {
 			DPFPRINTF(PF_DEBUG_URGENT,
 			    ("%s: m0->m_len < sizeof(struct ip6_hdr)\n",
 			    __func__));
 			goto bad;
 		}
 		ip6 = mtod(m0, struct ip6_hdr *);
 	}
 
 	if (ifp->if_flags & IFF_LOOPBACK)
 		m0->m_flags |= M_SKIP_FIREWALL;
 
 	if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6 &
 	    ~ifp->if_hwassist) {
 		uint32_t plen = m0->m_pkthdr.len - sizeof(*ip6);
 		in6_delayed_cksum(m0, plen, sizeof(struct ip6_hdr));
 		m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA_IPV6;
 	}
 
 	/*
 	 * If the packet is too large for the outgoing interface,
 	 * send back an icmp6 error.
 	 */
 	if (IN6_IS_SCOPE_EMBED(&dst.sin6_addr))
 		dst.sin6_addr.s6_addr16[1] = htons(ifp->if_index);
 	if ((u_long)m0->m_pkthdr.len <= ifp->if_mtu)
 		nd6_output_ifp(ifp, ifp, m0, &dst, NULL);
 	else {
 		in6_ifstat_inc(ifp, ifs6_in_toobig);
 		if (r->rt != PF_DUPTO)
 			icmp6_error(m0, ICMP6_PACKET_TOO_BIG, 0, ifp->if_mtu);
 		else
 			goto bad;
 	}
 
 done:
 	if (r->rt != PF_DUPTO)
 		*m = NULL;
 	return;
 
 bad_locked:
 	if (s)
 		PF_STATE_UNLOCK(s);
 bad:
 	m_freem(m0);
 	goto done;
 }
 #endif /* INET6 */
 
 /*
  * FreeBSD supports cksum offloads for the following drivers.
  *  em(4), fxp(4), ixgb(4), lge(4), ndis(4), nge(4), re(4),
  *   ti(4), txp(4), xl(4)
  *
  * CSUM_DATA_VALID | CSUM_PSEUDO_HDR :
  *  network driver performed cksum including pseudo header, need to verify
  *   csum_data
  * CSUM_DATA_VALID :
  *  network driver performed cksum, needs to additional pseudo header
  *  cksum computation with partial csum_data(i.e. lack of H/W support for
  *  pseudo header, for instance hme(4), sk(4) and possibly gem(4))
  *
  * After validating the cksum of packet, set both flag CSUM_DATA_VALID and
  * CSUM_PSEUDO_HDR in order to avoid recomputation of the cksum in upper
  * TCP/UDP layer.
  * Also, set csum_data to 0xffff to force cksum validation.
  */
 static int
 pf_check_proto_cksum(struct mbuf *m, int off, int len, u_int8_t p, sa_family_t af)
 {
 	u_int16_t sum = 0;
 	int hw_assist = 0;
 	struct ip *ip;
 
 	if (off < sizeof(struct ip) || len < sizeof(struct udphdr))
 		return (1);
 	if (m->m_pkthdr.len < off + len)
 		return (1);
 
 	switch (p) {
 	case IPPROTO_TCP:
 		if (m->m_pkthdr.csum_flags & CSUM_DATA_VALID) {
 			if (m->m_pkthdr.csum_flags & CSUM_PSEUDO_HDR) {
 				sum = m->m_pkthdr.csum_data;
 			} else {
 				ip = mtod(m, struct ip *);
 				sum = in_pseudo(ip->ip_src.s_addr,
 				ip->ip_dst.s_addr, htonl((u_short)len +
 				m->m_pkthdr.csum_data + IPPROTO_TCP));
 			}
 			sum ^= 0xffff;
 			++hw_assist;
 		}
 		break;
 	case IPPROTO_UDP:
 		if (m->m_pkthdr.csum_flags & CSUM_DATA_VALID) {
 			if (m->m_pkthdr.csum_flags & CSUM_PSEUDO_HDR) {
 				sum = m->m_pkthdr.csum_data;
 			} else {
 				ip = mtod(m, struct ip *);
 				sum = in_pseudo(ip->ip_src.s_addr,
 				ip->ip_dst.s_addr, htonl((u_short)len +
 				m->m_pkthdr.csum_data + IPPROTO_UDP));
 			}
 			sum ^= 0xffff;
 			++hw_assist;
 		}
 		break;
 	case IPPROTO_ICMP:
 #ifdef INET6
 	case IPPROTO_ICMPV6:
 #endif /* INET6 */
 		break;
 	default:
 		return (1);
 	}
 
 	if (!hw_assist) {
 		switch (af) {
 		case AF_INET:
 			if (p == IPPROTO_ICMP) {
 				if (m->m_len < off)
 					return (1);
 				m->m_data += off;
 				m->m_len -= off;
 				sum = in_cksum(m, len);
 				m->m_data -= off;
 				m->m_len += off;
 			} else {
 				if (m->m_len < sizeof(struct ip))
 					return (1);
 				sum = in4_cksum(m, p, off, len);
 			}
 			break;
 #ifdef INET6
 		case AF_INET6:
 			if (m->m_len < sizeof(struct ip6_hdr))
 				return (1);
 			sum = in6_cksum(m, p, off, len);
 			break;
 #endif /* INET6 */
 		default:
 			return (1);
 		}
 	}
 	if (sum) {
 		switch (p) {
 		case IPPROTO_TCP:
 		    {
 			KMOD_TCPSTAT_INC(tcps_rcvbadsum);
 			break;
 		    }
 		case IPPROTO_UDP:
 		    {
 			KMOD_UDPSTAT_INC(udps_badsum);
 			break;
 		    }
 #ifdef INET
 		case IPPROTO_ICMP:
 		    {
 			KMOD_ICMPSTAT_INC(icps_checksum);
 			break;
 		    }
 #endif
 #ifdef INET6
 		case IPPROTO_ICMPV6:
 		    {
 			KMOD_ICMP6STAT_INC(icp6s_checksum);
 			break;
 		    }
 #endif /* INET6 */
 		}
 		return (1);
 	} else {
 		if (p == IPPROTO_TCP || p == IPPROTO_UDP) {
 			m->m_pkthdr.csum_flags |=
 			    (CSUM_DATA_VALID | CSUM_PSEUDO_HDR);
 			m->m_pkthdr.csum_data = 0xffff;
 		}
 	}
 	return (0);
 }
 
 
 #ifdef INET
 int
 pf_test(int dir, int pflags, struct ifnet *ifp, struct mbuf **m0, struct inpcb *inp)
 {
 	struct pfi_kif		*kif;
 	u_short			 action, reason = 0, log = 0;
 	struct mbuf		*m = *m0;
 	struct ip		*h = NULL;
 	struct m_tag		*ipfwtag;
 	struct pf_rule		*a = NULL, *r = &V_pf_default_rule, *tr, *nr;
 	struct pf_state		*s = NULL;
 	struct pf_ruleset	*ruleset = NULL;
 	struct pf_pdesc		 pd;
 	int			 off, dirndx, pqid = 0;
 
 	M_ASSERTPKTHDR(m);
 
 	if (!V_pf_status.running)
 		return (PF_PASS);
 
 	memset(&pd, 0, sizeof(pd));
 
 	kif = (struct pfi_kif *)ifp->if_pf_kif;
 
 	if (kif == NULL) {
 		DPFPRINTF(PF_DEBUG_URGENT,
 		    ("pf_test: kif == NULL, if_xname %s\n", ifp->if_xname));
 		return (PF_DROP);
 	}
 	if (kif->pfik_flags & PFI_IFLAG_SKIP)
 		return (PF_PASS);
 
 	if (m->m_flags & M_SKIP_FIREWALL)
 		return (PF_PASS);
 
 	pd.pf_mtag = pf_find_mtag(m);
 
 	PF_RULES_RLOCK();
 
 	if (ip_divert_ptr != NULL &&
 	    ((ipfwtag = m_tag_locate(m, MTAG_IPFW_RULE, 0, NULL)) != NULL)) {
 		struct ipfw_rule_ref *rr = (struct ipfw_rule_ref *)(ipfwtag+1);
 		if (rr->info & IPFW_IS_DIVERT && rr->rulenum == 0) {
 			if (pd.pf_mtag == NULL &&
 			    ((pd.pf_mtag = pf_get_mtag(m)) == NULL)) {
 				action = PF_DROP;
 				goto done;
 			}
 			pd.pf_mtag->flags |= PF_PACKET_LOOPED;
 			m_tag_delete(m, ipfwtag);
 		}
 		if (pd.pf_mtag && pd.pf_mtag->flags & PF_FASTFWD_OURS_PRESENT) {
 			m->m_flags |= M_FASTFWD_OURS;
 			pd.pf_mtag->flags &= ~PF_FASTFWD_OURS_PRESENT;
 		}
 	} else if (pf_normalize_ip(m0, dir, kif, &reason, &pd) != PF_PASS) {
 		/* We do IP header normalization and packet reassembly here */
 		action = PF_DROP;
 		goto done;
 	}
 	m = *m0;	/* pf_normalize messes with m0 */
 	h = mtod(m, struct ip *);
 
 	off = h->ip_hl << 2;
 	if (off < (int)sizeof(struct ip)) {
 		action = PF_DROP;
 		REASON_SET(&reason, PFRES_SHORT);
 		log = 1;
 		goto done;
 	}
 
 	pd.src = (struct pf_addr *)&h->ip_src;
 	pd.dst = (struct pf_addr *)&h->ip_dst;
 	pd.sport = pd.dport = NULL;
 	pd.ip_sum = &h->ip_sum;
 	pd.proto_sum = NULL;
 	pd.proto = h->ip_p;
 	pd.dir = dir;
 	pd.sidx = (dir == PF_IN) ? 0 : 1;
 	pd.didx = (dir == PF_IN) ? 1 : 0;
 	pd.af = AF_INET;
 	pd.tos = h->ip_tos & ~IPTOS_ECN_MASK;
 	pd.tot_len = ntohs(h->ip_len);
 
 	/* handle fragments that didn't get reassembled by normalization */
 	if (h->ip_off & htons(IP_MF | IP_OFFMASK)) {
 		action = pf_test_fragment(&r, dir, kif, m, h,
 		    &pd, &a, &ruleset);
 		goto done;
 	}
 
 	switch (h->ip_p) {
 
 	case IPPROTO_TCP: {
 		struct tcphdr	th;
 
 		pd.hdr.tcp = &th;
 		if (!pf_pull_hdr(m, off, &th, sizeof(th),
 		    &action, &reason, AF_INET)) {
 			log = action != PF_PASS;
 			goto done;
 		}
 		pd.p_len = pd.tot_len - off - (th.th_off << 2);
 		if ((th.th_flags & TH_ACK) && pd.p_len == 0)
 			pqid = 1;
 		action = pf_normalize_tcp(dir, kif, m, 0, off, h, &pd);
 		if (action == PF_DROP)
 			goto done;
 		action = pf_test_state_tcp(&s, dir, kif, m, off, h, &pd,
 		    &reason);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 	case IPPROTO_UDP: {
 		struct udphdr	uh;
 
 		pd.hdr.udp = &uh;
 		if (!pf_pull_hdr(m, off, &uh, sizeof(uh),
 		    &action, &reason, AF_INET)) {
 			log = action != PF_PASS;
 			goto done;
 		}
 		if (uh.uh_dport == 0 ||
 		    ntohs(uh.uh_ulen) > m->m_pkthdr.len - off ||
 		    ntohs(uh.uh_ulen) < sizeof(struct udphdr)) {
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_SHORT);
 			goto done;
 		}
 		action = pf_test_state_udp(&s, dir, kif, m, off, h, &pd);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 	case IPPROTO_ICMP: {
 		struct icmp	ih;
 
 		pd.hdr.icmp = &ih;
 		if (!pf_pull_hdr(m, off, &ih, ICMP_MINLEN,
 		    &action, &reason, AF_INET)) {
 			log = action != PF_PASS;
 			goto done;
 		}
 		action = pf_test_state_icmp(&s, dir, kif, m, off, h, &pd,
 		    &reason);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 #ifdef INET6
 	case IPPROTO_ICMPV6: {
 		action = PF_DROP;
 		DPFPRINTF(PF_DEBUG_MISC,
 		    ("pf: dropping IPv4 packet with ICMPv6 payload\n"));
 		goto done;
 	}
 #endif
 
 	default:
 		action = pf_test_state_other(&s, dir, kif, m, &pd);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 done:
 	PF_RULES_RUNLOCK();
 	if (action == PF_PASS && h->ip_hl > 5 &&
 	    !((s && s->state_flags & PFSTATE_ALLOWOPTS) || r->allow_opts)) {
 		action = PF_DROP;
 		REASON_SET(&reason, PFRES_IPOPTIONS);
 		log = r->log;
 		DPFPRINTF(PF_DEBUG_MISC,
 		    ("pf: dropping packet with ip options\n"));
 	}
 
 	if (s && s->tag > 0 && pf_tag_packet(m, &pd, s->tag)) {
 		action = PF_DROP;
 		REASON_SET(&reason, PFRES_MEMORY);
 	}
 	if (r->rtableid >= 0)
 		M_SETFIB(m, r->rtableid);
 
 	if (r->scrub_flags & PFSTATE_SETPRIO) {
 		if (pd.tos & IPTOS_LOWDELAY)
 			pqid = 1;
 		if (pf_ieee8021q_setpcp(m, r->set_prio[pqid])) {
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_MEMORY);
 			log = 1;
 			DPFPRINTF(PF_DEBUG_MISC,
 			    ("pf: failed to allocate 802.1q mtag\n"));
 		}
 	}
 
 #ifdef ALTQ
 	if (action == PF_PASS && r->qid) {
 		if (pd.pf_mtag == NULL &&
 		    ((pd.pf_mtag = pf_get_mtag(m)) == NULL)) {
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_MEMORY);
 		} else {
 			if (s != NULL)
 				pd.pf_mtag->qid_hash = pf_state_hash(s);
 			if (pqid || (pd.tos & IPTOS_LOWDELAY))
 				pd.pf_mtag->qid = r->pqid;
 			else
 				pd.pf_mtag->qid = r->qid;
 			/* Add hints for ecn. */
 			pd.pf_mtag->hdr = h;
 		}
 
 	}
 #endif /* ALTQ */
 
 	/*
 	 * connections redirected to loopback should not match sockets
 	 * bound specifically to loopback due to security implications,
 	 * see tcp_input() and in_pcblookup_listen().
 	 */
 	if (dir == PF_IN && action == PF_PASS && (pd.proto == IPPROTO_TCP ||
 	    pd.proto == IPPROTO_UDP) && s != NULL && s->nat_rule.ptr != NULL &&
 	    (s->nat_rule.ptr->action == PF_RDR ||
 	    s->nat_rule.ptr->action == PF_BINAT) &&
 	    (ntohl(pd.dst->v4.s_addr) >> IN_CLASSA_NSHIFT) == IN_LOOPBACKNET)
 		m->m_flags |= M_SKIP_FIREWALL;
 
 	if (action == PF_PASS && r->divert.port && ip_divert_ptr != NULL &&
 	    !PACKET_LOOPED(&pd)) {
 
 		ipfwtag = m_tag_alloc(MTAG_IPFW_RULE, 0,
 		    sizeof(struct ipfw_rule_ref), M_NOWAIT | M_ZERO);
 		if (ipfwtag != NULL) {
 			((struct ipfw_rule_ref *)(ipfwtag+1))->info =
 			    ntohs(r->divert.port);
 			((struct ipfw_rule_ref *)(ipfwtag+1))->rulenum = dir;
 
 			if (s)
 				PF_STATE_UNLOCK(s);
 
 			m_tag_prepend(m, ipfwtag);
 			if (m->m_flags & M_FASTFWD_OURS) {
 				if (pd.pf_mtag == NULL &&
 				    ((pd.pf_mtag = pf_get_mtag(m)) == NULL)) {
 					action = PF_DROP;
 					REASON_SET(&reason, PFRES_MEMORY);
 					log = 1;
 					DPFPRINTF(PF_DEBUG_MISC,
 					    ("pf: failed to allocate tag\n"));
 				} else {
 					pd.pf_mtag->flags |=
 					    PF_FASTFWD_OURS_PRESENT;
 					m->m_flags &= ~M_FASTFWD_OURS;
 				}
 			}
 			ip_divert_ptr(*m0, dir ==  PF_IN ? DIR_IN : DIR_OUT);
 			*m0 = NULL;
 
 			return (action);
 		} else {
 			/* XXX: ipfw has the same behaviour! */
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_MEMORY);
 			log = 1;
 			DPFPRINTF(PF_DEBUG_MISC,
 			    ("pf: failed to allocate divert tag\n"));
 		}
 	}
 
 	if (log) {
 		struct pf_rule *lr;
 
 		if (s != NULL && s->nat_rule.ptr != NULL &&
 		    s->nat_rule.ptr->log & PF_LOG_ALL)
 			lr = s->nat_rule.ptr;
 		else
 			lr = r;
 		PFLOG_PACKET(kif, m, AF_INET, dir, reason, lr, a, ruleset, &pd,
 		    (s == NULL));
 	}
 
 	kif->pfik_bytes[0][dir == PF_OUT][action != PF_PASS] += pd.tot_len;
 	kif->pfik_packets[0][dir == PF_OUT][action != PF_PASS]++;
 
 	if (action == PF_PASS || r->action == PF_DROP) {
 		dirndx = (dir == PF_OUT);
 		r->packets[dirndx]++;
 		r->bytes[dirndx] += pd.tot_len;
 		if (a != NULL) {
 			a->packets[dirndx]++;
 			a->bytes[dirndx] += pd.tot_len;
 		}
 		if (s != NULL) {
 			if (s->nat_rule.ptr != NULL) {
 				s->nat_rule.ptr->packets[dirndx]++;
 				s->nat_rule.ptr->bytes[dirndx] += pd.tot_len;
 			}
 			if (s->src_node != NULL) {
 				s->src_node->packets[dirndx]++;
 				s->src_node->bytes[dirndx] += pd.tot_len;
 			}
 			if (s->nat_src_node != NULL) {
 				s->nat_src_node->packets[dirndx]++;
 				s->nat_src_node->bytes[dirndx] += pd.tot_len;
 			}
 			dirndx = (dir == s->direction) ? 0 : 1;
 			s->packets[dirndx]++;
 			s->bytes[dirndx] += pd.tot_len;
 		}
 		tr = r;
 		nr = (s != NULL) ? s->nat_rule.ptr : pd.nat_rule;
 		if (nr != NULL && r == &V_pf_default_rule)
 			tr = nr;
 		if (tr->src.addr.type == PF_ADDR_TABLE)
 			pfr_update_stats(tr->src.addr.p.tbl,
 			    (s == NULL) ? pd.src :
 			    &s->key[(s->direction == PF_IN)]->
 				addr[(s->direction == PF_OUT)],
 			    pd.af, pd.tot_len, dir == PF_OUT,
 			    r->action == PF_PASS, tr->src.neg);
 		if (tr->dst.addr.type == PF_ADDR_TABLE)
 			pfr_update_stats(tr->dst.addr.p.tbl,
 			    (s == NULL) ? pd.dst :
 			    &s->key[(s->direction == PF_IN)]->
 				addr[(s->direction == PF_IN)],
 			    pd.af, pd.tot_len, dir == PF_OUT,
 			    r->action == PF_PASS, tr->dst.neg);
 	}
 
 	switch (action) {
 	case PF_SYNPROXY_DROP:
 		m_freem(*m0);
 	case PF_DEFER:
 		*m0 = NULL;
 		action = PF_PASS;
 		break;
 	case PF_DROP:
 		m_freem(*m0);
 		*m0 = NULL;
 		break;
 	default:
 		/* pf_route() returns unlocked. */
 		if (r->rt) {
 			pf_route(m0, r, dir, kif->pfik_ifp, s, &pd);
 			return (action);
 		}
 		break;
 	}
 	if (s)
 		PF_STATE_UNLOCK(s);
 
 	return (action);
 }
 #endif /* INET */
 
 #ifdef INET6
 int
 pf_test6(int dir, int pflags, struct ifnet *ifp, struct mbuf **m0, struct inpcb *inp)
 {
 	struct pfi_kif		*kif;
 	u_short			 action, reason = 0, log = 0;
 	struct mbuf		*m = *m0, *n = NULL;
 	struct m_tag		*mtag;
 	struct ip6_hdr		*h = NULL;
 	struct pf_rule		*a = NULL, *r = &V_pf_default_rule, *tr, *nr;
 	struct pf_state		*s = NULL;
 	struct pf_ruleset	*ruleset = NULL;
 	struct pf_pdesc		 pd;
 	int			 off, terminal = 0, dirndx, rh_cnt = 0, pqid = 0;
 
 	M_ASSERTPKTHDR(m);
 
 	if (!V_pf_status.running)
 		return (PF_PASS);
 
 	memset(&pd, 0, sizeof(pd));
 	pd.pf_mtag = pf_find_mtag(m);
 
 	if (pd.pf_mtag && pd.pf_mtag->flags & PF_TAG_GENERATED)
 		return (PF_PASS);
 
 	kif = (struct pfi_kif *)ifp->if_pf_kif;
 	if (kif == NULL) {
 		DPFPRINTF(PF_DEBUG_URGENT,
 		    ("pf_test6: kif == NULL, if_xname %s\n", ifp->if_xname));
 		return (PF_DROP);
 	}
 	if (kif->pfik_flags & PFI_IFLAG_SKIP)
 		return (PF_PASS);
 
 	if (m->m_flags & M_SKIP_FIREWALL)
 		return (PF_PASS);
 
 	PF_RULES_RLOCK();
 
 	/* We do IP header normalization and packet reassembly here */
 	if (pf_normalize_ip6(m0, dir, kif, &reason, &pd) != PF_PASS) {
 		action = PF_DROP;
 		goto done;
 	}
 	m = *m0;	/* pf_normalize messes with m0 */
 	h = mtod(m, struct ip6_hdr *);
 
 #if 1
 	/*
 	 * we do not support jumbogram yet.  if we keep going, zero ip6_plen
 	 * will do something bad, so drop the packet for now.
 	 */
 	if (htons(h->ip6_plen) == 0) {
 		action = PF_DROP;
 		REASON_SET(&reason, PFRES_NORM);	/*XXX*/
 		goto done;
 	}
 #endif
 
 	pd.src = (struct pf_addr *)&h->ip6_src;
 	pd.dst = (struct pf_addr *)&h->ip6_dst;
 	pd.sport = pd.dport = NULL;
 	pd.ip_sum = NULL;
 	pd.proto_sum = NULL;
 	pd.dir = dir;
 	pd.sidx = (dir == PF_IN) ? 0 : 1;
 	pd.didx = (dir == PF_IN) ? 1 : 0;
 	pd.af = AF_INET6;
 	pd.tos = 0;
 	pd.tot_len = ntohs(h->ip6_plen) + sizeof(struct ip6_hdr);
 
 	off = ((caddr_t)h - m->m_data) + sizeof(struct ip6_hdr);
 	pd.proto = h->ip6_nxt;
 	do {
 		switch (pd.proto) {
 		case IPPROTO_FRAGMENT:
 			action = pf_test_fragment(&r, dir, kif, m, h,
 			    &pd, &a, &ruleset);
 			if (action == PF_DROP)
 				REASON_SET(&reason, PFRES_FRAG);
 			goto done;
 		case IPPROTO_ROUTING: {
 			struct ip6_rthdr rthdr;
 
 			if (rh_cnt++) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: IPv6 more than one rthdr\n"));
 				action = PF_DROP;
 				REASON_SET(&reason, PFRES_IPOPTIONS);
 				log = 1;
 				goto done;
 			}
 			if (!pf_pull_hdr(m, off, &rthdr, sizeof(rthdr), NULL,
 			    &reason, pd.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: IPv6 short rthdr\n"));
 				action = PF_DROP;
 				REASON_SET(&reason, PFRES_SHORT);
 				log = 1;
 				goto done;
 			}
 			if (rthdr.ip6r_type == IPV6_RTHDR_TYPE_0) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: IPv6 rthdr0\n"));
 				action = PF_DROP;
 				REASON_SET(&reason, PFRES_IPOPTIONS);
 				log = 1;
 				goto done;
 			}
 			/* FALLTHROUGH */
 		}
 		case IPPROTO_AH:
 		case IPPROTO_HOPOPTS:
 		case IPPROTO_DSTOPTS: {
 			/* get next header and header length */
 			struct ip6_ext	opt6;
 
 			if (!pf_pull_hdr(m, off, &opt6, sizeof(opt6),
 			    NULL, &reason, pd.af)) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: IPv6 short opt\n"));
 				action = PF_DROP;
 				log = 1;
 				goto done;
 			}
 			if (pd.proto == IPPROTO_AH)
 				off += (opt6.ip6e_len + 2) * 4;
 			else
 				off += (opt6.ip6e_len + 1) * 8;
 			pd.proto = opt6.ip6e_nxt;
 			/* goto the next header */
 			break;
 		}
 		default:
 			terminal++;
 			break;
 		}
 	} while (!terminal);
 
 	/* if there's no routing header, use unmodified mbuf for checksumming */
 	if (!n)
 		n = m;
 
 	switch (pd.proto) {
 
 	case IPPROTO_TCP: {
 		struct tcphdr	th;
 
 		pd.hdr.tcp = &th;
 		if (!pf_pull_hdr(m, off, &th, sizeof(th),
 		    &action, &reason, AF_INET6)) {
 			log = action != PF_PASS;
 			goto done;
 		}
 		pd.p_len = pd.tot_len - off - (th.th_off << 2);
 		action = pf_normalize_tcp(dir, kif, m, 0, off, h, &pd);
 		if (action == PF_DROP)
 			goto done;
 		action = pf_test_state_tcp(&s, dir, kif, m, off, h, &pd,
 		    &reason);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 	case IPPROTO_UDP: {
 		struct udphdr	uh;
 
 		pd.hdr.udp = &uh;
 		if (!pf_pull_hdr(m, off, &uh, sizeof(uh),
 		    &action, &reason, AF_INET6)) {
 			log = action != PF_PASS;
 			goto done;
 		}
 		if (uh.uh_dport == 0 ||
 		    ntohs(uh.uh_ulen) > m->m_pkthdr.len - off ||
 		    ntohs(uh.uh_ulen) < sizeof(struct udphdr)) {
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_SHORT);
 			goto done;
 		}
 		action = pf_test_state_udp(&s, dir, kif, m, off, h, &pd);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 	case IPPROTO_ICMP: {
 		action = PF_DROP;
 		DPFPRINTF(PF_DEBUG_MISC,
 		    ("pf: dropping IPv6 packet with ICMPv4 payload\n"));
 		goto done;
 	}
 
 	case IPPROTO_ICMPV6: {
 		struct icmp6_hdr	ih;
 
 		pd.hdr.icmp6 = &ih;
 		if (!pf_pull_hdr(m, off, &ih, sizeof(ih),
 		    &action, &reason, AF_INET6)) {
 			log = action != PF_PASS;
 			goto done;
 		}
 		action = pf_test_state_icmp(&s, dir, kif,
 		    m, off, h, &pd, &reason);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 	default:
 		action = pf_test_state_other(&s, dir, kif, m, &pd);
 		if (action == PF_PASS) {
 			if (pfsync_update_state_ptr != NULL)
 				pfsync_update_state_ptr(s);
 			r = s->rule.ptr;
 			a = s->anchor.ptr;
 			log = s->log;
 		} else if (s == NULL)
 			action = pf_test_rule(&r, &s, dir, kif, m, off, &pd,
 			    &a, &ruleset, inp);
 		break;
 	}
 
 done:
 	PF_RULES_RUNLOCK();
 	if (n != m) {
 		m_freem(n);
 		n = NULL;
 	}
 
 	/* handle dangerous IPv6 extension headers. */
 	if (action == PF_PASS && rh_cnt &&
 	    !((s && s->state_flags & PFSTATE_ALLOWOPTS) || r->allow_opts)) {
 		action = PF_DROP;
 		REASON_SET(&reason, PFRES_IPOPTIONS);
 		log = r->log;
 		DPFPRINTF(PF_DEBUG_MISC,
 		    ("pf: dropping packet with dangerous v6 headers\n"));
 	}
 
 	if (s && s->tag > 0 && pf_tag_packet(m, &pd, s->tag)) {
 		action = PF_DROP;
 		REASON_SET(&reason, PFRES_MEMORY);
 	}
 	if (r->rtableid >= 0)
 		M_SETFIB(m, r->rtableid);
 
 	if (r->scrub_flags & PFSTATE_SETPRIO) {
 		if (pd.tos & IPTOS_LOWDELAY)
 			pqid = 1;
 		if (pf_ieee8021q_setpcp(m, r->set_prio[pqid])) {
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_MEMORY);
 			log = 1;
 			DPFPRINTF(PF_DEBUG_MISC,
 			    ("pf: failed to allocate 802.1q mtag\n"));
 		}
 	}
 
 #ifdef ALTQ
 	if (action == PF_PASS && r->qid) {
 		if (pd.pf_mtag == NULL &&
 		    ((pd.pf_mtag = pf_get_mtag(m)) == NULL)) {
 			action = PF_DROP;
 			REASON_SET(&reason, PFRES_MEMORY);
 		} else {
 			if (s != NULL)
 				pd.pf_mtag->qid_hash = pf_state_hash(s);
 			if (pd.tos & IPTOS_LOWDELAY)
 				pd.pf_mtag->qid = r->pqid;
 			else
 				pd.pf_mtag->qid = r->qid;
 			/* Add hints for ecn. */
 			pd.pf_mtag->hdr = h;
 		}
 	}
 #endif /* ALTQ */
 
 	if (dir == PF_IN && action == PF_PASS && (pd.proto == IPPROTO_TCP ||
 	    pd.proto == IPPROTO_UDP) && s != NULL && s->nat_rule.ptr != NULL &&
 	    (s->nat_rule.ptr->action == PF_RDR ||
 	    s->nat_rule.ptr->action == PF_BINAT) &&
 	    IN6_IS_ADDR_LOOPBACK(&pd.dst->v6))
 		m->m_flags |= M_SKIP_FIREWALL;
 
 	/* XXX: Anybody working on it?! */
 	if (r->divert.port)
 		printf("pf: divert(9) is not supported for IPv6\n");
 
 	if (log) {
 		struct pf_rule *lr;
 
 		if (s != NULL && s->nat_rule.ptr != NULL &&
 		    s->nat_rule.ptr->log & PF_LOG_ALL)
 			lr = s->nat_rule.ptr;
 		else
 			lr = r;
 		PFLOG_PACKET(kif, m, AF_INET6, dir, reason, lr, a, ruleset,
 		    &pd, (s == NULL));
 	}
 
 	kif->pfik_bytes[1][dir == PF_OUT][action != PF_PASS] += pd.tot_len;
 	kif->pfik_packets[1][dir == PF_OUT][action != PF_PASS]++;
 
 	if (action == PF_PASS || r->action == PF_DROP) {
 		dirndx = (dir == PF_OUT);
 		r->packets[dirndx]++;
 		r->bytes[dirndx] += pd.tot_len;
 		if (a != NULL) {
 			a->packets[dirndx]++;
 			a->bytes[dirndx] += pd.tot_len;
 		}
 		if (s != NULL) {
 			if (s->nat_rule.ptr != NULL) {
 				s->nat_rule.ptr->packets[dirndx]++;
 				s->nat_rule.ptr->bytes[dirndx] += pd.tot_len;
 			}
 			if (s->src_node != NULL) {
 				s->src_node->packets[dirndx]++;
 				s->src_node->bytes[dirndx] += pd.tot_len;
 			}
 			if (s->nat_src_node != NULL) {
 				s->nat_src_node->packets[dirndx]++;
 				s->nat_src_node->bytes[dirndx] += pd.tot_len;
 			}
 			dirndx = (dir == s->direction) ? 0 : 1;
 			s->packets[dirndx]++;
 			s->bytes[dirndx] += pd.tot_len;
 		}
 		tr = r;
 		nr = (s != NULL) ? s->nat_rule.ptr : pd.nat_rule;
 		if (nr != NULL && r == &V_pf_default_rule)
 			tr = nr;
 		if (tr->src.addr.type == PF_ADDR_TABLE)
 			pfr_update_stats(tr->src.addr.p.tbl,
 			    (s == NULL) ? pd.src :
 			    &s->key[(s->direction == PF_IN)]->addr[0],
 			    pd.af, pd.tot_len, dir == PF_OUT,
 			    r->action == PF_PASS, tr->src.neg);
 		if (tr->dst.addr.type == PF_ADDR_TABLE)
 			pfr_update_stats(tr->dst.addr.p.tbl,
 			    (s == NULL) ? pd.dst :
 			    &s->key[(s->direction == PF_IN)]->addr[1],
 			    pd.af, pd.tot_len, dir == PF_OUT,
 			    r->action == PF_PASS, tr->dst.neg);
 	}
 
 	switch (action) {
 	case PF_SYNPROXY_DROP:
 		m_freem(*m0);
 	case PF_DEFER:
 		*m0 = NULL;
 		action = PF_PASS;
 		break;
 	case PF_DROP:
 		m_freem(*m0);
 		*m0 = NULL;
 		break;
 	default:
 		/* pf_route6() returns unlocked. */
 		if (r->rt) {
 			pf_route6(m0, r, dir, kif->pfik_ifp, s, &pd);
 			return (action);
 		}
 		break;
 	}
 
 	if (s)
 		PF_STATE_UNLOCK(s);
 
 	/* If reassembled packet passed, create new fragments. */
 	if (action == PF_PASS && *m0 && (pflags & PFIL_FWD) &&
 	    (mtag = m_tag_find(m, PF_REASSEMBLED, NULL)) != NULL)
 		action = pf_refragment6(ifp, m0, mtag);
 
 	return (action);
 }
 #endif /* INET6 */
Index: user/markj/netdump/sys/netpfil/pf/pf_ioctl.c
===================================================================
--- user/markj/netdump/sys/netpfil/pf/pf_ioctl.c	(revision 332407)
+++ user/markj/netdump/sys/netpfil/pf/pf_ioctl.c	(revision 332408)
@@ -1,4039 +1,4053 @@
 /*-
  * SPDX-License-Identifier: BSD-2-Clause
  *
  * Copyright (c) 2001 Daniel Hartmeier
  * Copyright (c) 2002,2003 Henning Brauer
  * Copyright (c) 2012 Gleb Smirnoff <glebius@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  *    - Redistributions of source code must retain the above copyright
  *      notice, this list of conditions and the following disclaimer.
  *    - Redistributions in binary form must reproduce the above
  *      copyright notice, this list of conditions and the following
  *      disclaimer in the documentation and/or other materials provided
  *      with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
  * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
  * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  *
  * Effort sponsored in part by the Defense Advanced Research Projects
  * Agency (DARPA) and Air Force Research Laboratory, Air Force
  * Materiel Command, USAF, under agreement number F30602-01-2-0537.
  *
  *	$OpenBSD: pf_ioctl.c,v 1.213 2009/02/15 21:46:12 mbalmer Exp $
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
 #include "opt_bpf.h"
 #include "opt_pf.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/conf.h>
 #include <sys/endian.h>
 #include <sys/fcntl.h>
 #include <sys/filio.h>
 #include <sys/interrupt.h>
 #include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/lock.h>
 #include <sys/mbuf.h>
 #include <sys/module.h>
 #include <sys/proc.h>
 #include <sys/rwlock.h>
 #include <sys/smp.h>
 #include <sys/socket.h>
 #include <sys/sysctl.h>
 #include <sys/md5.h>
 #include <sys/ucred.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/vnet.h>
 #include <net/route.h>
 #include <net/pfil.h>
 #include <net/pfvar.h>
 #include <net/if_pfsync.h>
 #include <net/if_pflog.h>
 
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/ip_var.h>
 #include <netinet6/ip6_var.h>
 #include <netinet/ip_icmp.h>
 
 #ifdef INET6
 #include <netinet/ip6.h>
 #endif /* INET6 */
 
 #ifdef ALTQ
 #include <net/altq/altq.h>
 #endif
 
-#define PF_TABLES_MAX_REQUEST   65535 /* Maximum tables per request. */
-
 static struct pf_pool	*pf_get_pool(char *, u_int32_t, u_int8_t, u_int32_t,
 			    u_int8_t, u_int8_t, u_int8_t);
 
 static void		 pf_mv_pool(struct pf_palist *, struct pf_palist *);
 static void		 pf_empty_pool(struct pf_palist *);
 static int		 pfioctl(struct cdev *, u_long, caddr_t, int,
 			    struct thread *);
 #ifdef ALTQ
 static int		 pf_begin_altq(u_int32_t *);
 static int		 pf_rollback_altq(u_int32_t);
 static int		 pf_commit_altq(u_int32_t);
 static int		 pf_enable_altq(struct pf_altq *);
 static int		 pf_disable_altq(struct pf_altq *);
 static u_int32_t	 pf_qname2qid(char *);
 static void		 pf_qid_unref(u_int32_t);
 #endif /* ALTQ */
 static int		 pf_begin_rules(u_int32_t *, int, const char *);
 static int		 pf_rollback_rules(u_int32_t, int, char *);
 static int		 pf_setup_pfsync_matching(struct pf_ruleset *);
 static void		 pf_hash_rule(MD5_CTX *, struct pf_rule *);
 static void		 pf_hash_rule_addr(MD5_CTX *, struct pf_rule_addr *);
 static int		 pf_commit_rules(u_int32_t, int, char *);
 static int		 pf_addr_setup(struct pf_ruleset *,
 			    struct pf_addr_wrap *, sa_family_t);
 static void		 pf_addr_copyout(struct pf_addr_wrap *);
 
 VNET_DEFINE(struct pf_rule,	pf_default_rule);
 
 #ifdef ALTQ
 static VNET_DEFINE(int,		pf_altq_running);
 #define	V_pf_altq_running	VNET(pf_altq_running)
 #endif
 
 #define	TAGID_MAX	 50000
 struct pf_tagname {
 	TAILQ_ENTRY(pf_tagname)	entries;
 	char			name[PF_TAG_NAME_SIZE];
 	uint16_t		tag;
 	int			ref;
 };
 
 TAILQ_HEAD(pf_tags, pf_tagname);
 #define	V_pf_tags		VNET(pf_tags)
 VNET_DEFINE(struct pf_tags, pf_tags);
 #define	V_pf_qids		VNET(pf_qids)
 VNET_DEFINE(struct pf_tags, pf_qids);
 static MALLOC_DEFINE(M_PFTAG, "pf_tag", "pf(4) tag names");
 static MALLOC_DEFINE(M_PFALTQ, "pf_altq", "pf(4) altq configuration db");
 static MALLOC_DEFINE(M_PFRULE, "pf_rule", "pf(4) rules");
 
 #if (PF_QNAME_SIZE != PF_TAG_NAME_SIZE)
 #error PF_QNAME_SIZE must be equal to PF_TAG_NAME_SIZE
 #endif
 
 static u_int16_t	 tagname2tag(struct pf_tags *, char *);
 static u_int16_t	 pf_tagname2tag(char *);
 static void		 tag_unref(struct pf_tags *, u_int16_t);
 
 #define DPFPRINTF(n, x) if (V_pf_status.debug >= (n)) printf x
 
 struct cdev *pf_dev;
 
 /*
  * XXX - These are new and need to be checked when moveing to a new version
  */
 static void		 pf_clear_states(void);
 static int		 pf_clear_tables(void);
 static void		 pf_clear_srcnodes(struct pf_src_node *);
 static void		 pf_kill_srcnodes(struct pfioc_src_node_kill *);
 static void		 pf_tbladdr_copyout(struct pf_addr_wrap *);
 
 /*
  * Wrapper functions for pfil(9) hooks
  */
 #ifdef INET
 static int pf_check_in(void *arg, struct mbuf **m, struct ifnet *ifp,
     int dir, int flags, struct inpcb *inp);
 static int pf_check_out(void *arg, struct mbuf **m, struct ifnet *ifp,
     int dir, int flags, struct inpcb *inp);
 #endif
 #ifdef INET6
 static int pf_check6_in(void *arg, struct mbuf **m, struct ifnet *ifp,
     int dir, int flags, struct inpcb *inp);
 static int pf_check6_out(void *arg, struct mbuf **m, struct ifnet *ifp,
     int dir, int flags, struct inpcb *inp);
 #endif
 
 static int		hook_pf(void);
 static int		dehook_pf(void);
 static int		shutdown_pf(void);
 static int		pf_load(void);
 static void		pf_unload(void);
 
 static struct cdevsw pf_cdevsw = {
 	.d_ioctl =	pfioctl,
 	.d_name =	PF_NAME,
 	.d_version =	D_VERSION,
 };
 
 static volatile VNET_DEFINE(int, pf_pfil_hooked);
 #define V_pf_pfil_hooked	VNET(pf_pfil_hooked)
 
 /*
  * We need a flag that is neither hooked nor running to know when
  * the VNET is "valid".  We primarily need this to control (global)
  * external event, e.g., eventhandlers.
  */
 VNET_DEFINE(int, pf_vnet_active);
 #define V_pf_vnet_active	VNET(pf_vnet_active)
 
 int pf_end_threads;
 struct proc *pf_purge_proc;
 
 struct rwlock			pf_rules_lock;
 struct sx			pf_ioctl_lock;
 struct sx			pf_end_lock;
 
 /* pfsync */
 pfsync_state_import_t 		*pfsync_state_import_ptr = NULL;
 pfsync_insert_state_t		*pfsync_insert_state_ptr = NULL;
 pfsync_update_state_t		*pfsync_update_state_ptr = NULL;
 pfsync_delete_state_t		*pfsync_delete_state_ptr = NULL;
 pfsync_clear_states_t		*pfsync_clear_states_ptr = NULL;
 pfsync_defer_t			*pfsync_defer_ptr = NULL;
 /* pflog */
 pflog_packet_t			*pflog_packet_ptr = NULL;
 
+extern u_long	pf_ioctl_maxcount;
+
 static void
 pfattach_vnet(void)
 {
 	u_int32_t *my_timeout = V_pf_default_rule.timeout;
 
 	pf_initialize();
 	pfr_initialize();
 	pfi_initialize_vnet();
 	pf_normalize_init();
 
 	V_pf_limits[PF_LIMIT_STATES].limit = PFSTATE_HIWAT;
 	V_pf_limits[PF_LIMIT_SRC_NODES].limit = PFSNODE_HIWAT;
 
 	RB_INIT(&V_pf_anchors);
 	pf_init_ruleset(&pf_main_ruleset);
 
 	/* default rule should never be garbage collected */
 	V_pf_default_rule.entries.tqe_prev = &V_pf_default_rule.entries.tqe_next;
 #ifdef PF_DEFAULT_TO_DROP
 	V_pf_default_rule.action = PF_DROP;
 #else
 	V_pf_default_rule.action = PF_PASS;
 #endif
 	V_pf_default_rule.nr = -1;
 	V_pf_default_rule.rtableid = -1;
 
 	V_pf_default_rule.states_cur = counter_u64_alloc(M_WAITOK);
 	V_pf_default_rule.states_tot = counter_u64_alloc(M_WAITOK);
 	V_pf_default_rule.src_nodes = counter_u64_alloc(M_WAITOK);
 
 	/* initialize default timeouts */
 	my_timeout[PFTM_TCP_FIRST_PACKET] = PFTM_TCP_FIRST_PACKET_VAL;
 	my_timeout[PFTM_TCP_OPENING] = PFTM_TCP_OPENING_VAL;
 	my_timeout[PFTM_TCP_ESTABLISHED] = PFTM_TCP_ESTABLISHED_VAL;
 	my_timeout[PFTM_TCP_CLOSING] = PFTM_TCP_CLOSING_VAL;
 	my_timeout[PFTM_TCP_FIN_WAIT] = PFTM_TCP_FIN_WAIT_VAL;
 	my_timeout[PFTM_TCP_CLOSED] = PFTM_TCP_CLOSED_VAL;
 	my_timeout[PFTM_UDP_FIRST_PACKET] = PFTM_UDP_FIRST_PACKET_VAL;
 	my_timeout[PFTM_UDP_SINGLE] = PFTM_UDP_SINGLE_VAL;
 	my_timeout[PFTM_UDP_MULTIPLE] = PFTM_UDP_MULTIPLE_VAL;
 	my_timeout[PFTM_ICMP_FIRST_PACKET] = PFTM_ICMP_FIRST_PACKET_VAL;
 	my_timeout[PFTM_ICMP_ERROR_REPLY] = PFTM_ICMP_ERROR_REPLY_VAL;
 	my_timeout[PFTM_OTHER_FIRST_PACKET] = PFTM_OTHER_FIRST_PACKET_VAL;
 	my_timeout[PFTM_OTHER_SINGLE] = PFTM_OTHER_SINGLE_VAL;
 	my_timeout[PFTM_OTHER_MULTIPLE] = PFTM_OTHER_MULTIPLE_VAL;
 	my_timeout[PFTM_FRAG] = PFTM_FRAG_VAL;
 	my_timeout[PFTM_INTERVAL] = PFTM_INTERVAL_VAL;
 	my_timeout[PFTM_SRC_NODE] = PFTM_SRC_NODE_VAL;
 	my_timeout[PFTM_TS_DIFF] = PFTM_TS_DIFF_VAL;
 	my_timeout[PFTM_ADAPTIVE_START] = PFSTATE_ADAPT_START;
 	my_timeout[PFTM_ADAPTIVE_END] = PFSTATE_ADAPT_END;
 
 	bzero(&V_pf_status, sizeof(V_pf_status));
 	V_pf_status.debug = PF_DEBUG_URGENT;
 
 	V_pf_pfil_hooked = 0;
 
 	/* XXX do our best to avoid a conflict */
 	V_pf_status.hostid = arc4random();
 
 	for (int i = 0; i < PFRES_MAX; i++)
 		V_pf_status.counters[i] = counter_u64_alloc(M_WAITOK);
 	for (int i = 0; i < LCNT_MAX; i++)
 		V_pf_status.lcounters[i] = counter_u64_alloc(M_WAITOK);
 	for (int i = 0; i < FCNT_MAX; i++)
 		V_pf_status.fcounters[i] = counter_u64_alloc(M_WAITOK);
 	for (int i = 0; i < SCNT_MAX; i++)
 		V_pf_status.scounters[i] = counter_u64_alloc(M_WAITOK);
 
 	if (swi_add(NULL, "pf send", pf_intr, curvnet, SWI_NET,
 	    INTR_MPSAFE, &V_pf_swi_cookie) != 0)
 		/* XXXGL: leaked all above. */
 		return;
 }
 
 
 static struct pf_pool *
 pf_get_pool(char *anchor, u_int32_t ticket, u_int8_t rule_action,
     u_int32_t rule_number, u_int8_t r_last, u_int8_t active,
     u_int8_t check_ticket)
 {
 	struct pf_ruleset	*ruleset;
 	struct pf_rule		*rule;
 	int			 rs_num;
 
 	ruleset = pf_find_ruleset(anchor);
 	if (ruleset == NULL)
 		return (NULL);
 	rs_num = pf_get_ruleset_number(rule_action);
 	if (rs_num >= PF_RULESET_MAX)
 		return (NULL);
 	if (active) {
 		if (check_ticket && ticket !=
 		    ruleset->rules[rs_num].active.ticket)
 			return (NULL);
 		if (r_last)
 			rule = TAILQ_LAST(ruleset->rules[rs_num].active.ptr,
 			    pf_rulequeue);
 		else
 			rule = TAILQ_FIRST(ruleset->rules[rs_num].active.ptr);
 	} else {
 		if (check_ticket && ticket !=
 		    ruleset->rules[rs_num].inactive.ticket)
 			return (NULL);
 		if (r_last)
 			rule = TAILQ_LAST(ruleset->rules[rs_num].inactive.ptr,
 			    pf_rulequeue);
 		else
 			rule = TAILQ_FIRST(ruleset->rules[rs_num].inactive.ptr);
 	}
 	if (!r_last) {
 		while ((rule != NULL) && (rule->nr != rule_number))
 			rule = TAILQ_NEXT(rule, entries);
 	}
 	if (rule == NULL)
 		return (NULL);
 
 	return (&rule->rpool);
 }
 
 static void
 pf_mv_pool(struct pf_palist *poola, struct pf_palist *poolb)
 {
 	struct pf_pooladdr	*mv_pool_pa;
 
 	while ((mv_pool_pa = TAILQ_FIRST(poola)) != NULL) {
 		TAILQ_REMOVE(poola, mv_pool_pa, entries);
 		TAILQ_INSERT_TAIL(poolb, mv_pool_pa, entries);
 	}
 }
 
 static void
 pf_empty_pool(struct pf_palist *poola)
 {
 	struct pf_pooladdr *pa;
 
 	while ((pa = TAILQ_FIRST(poola)) != NULL) {
 		switch (pa->addr.type) {
 		case PF_ADDR_DYNIFTL:
 			pfi_dynaddr_remove(pa->addr.p.dyn);
 			break;
 		case PF_ADDR_TABLE:
 			/* XXX: this could be unfinished pooladdr on pabuf */
 			if (pa->addr.p.tbl != NULL)
 				pfr_detach_table(pa->addr.p.tbl);
 			break;
 		}
 		if (pa->kif)
 			pfi_kif_unref(pa->kif);
 		TAILQ_REMOVE(poola, pa, entries);
 		free(pa, M_PFRULE);
 	}
 }
 
 static void
 pf_unlink_rule(struct pf_rulequeue *rulequeue, struct pf_rule *rule)
 {
 
 	PF_RULES_WASSERT();
 
 	TAILQ_REMOVE(rulequeue, rule, entries);
 
 	PF_UNLNKDRULES_LOCK();
 	rule->rule_flag |= PFRULE_REFS;
 	TAILQ_INSERT_TAIL(&V_pf_unlinked_rules, rule, entries);
 	PF_UNLNKDRULES_UNLOCK();
 }
 
 void
 pf_free_rule(struct pf_rule *rule)
 {
 
 	PF_RULES_WASSERT();
 
 	if (rule->tag)
 		tag_unref(&V_pf_tags, rule->tag);
 	if (rule->match_tag)
 		tag_unref(&V_pf_tags, rule->match_tag);
 #ifdef ALTQ
 	if (rule->pqid != rule->qid)
 		pf_qid_unref(rule->pqid);
 	pf_qid_unref(rule->qid);
 #endif
 	switch (rule->src.addr.type) {
 	case PF_ADDR_DYNIFTL:
 		pfi_dynaddr_remove(rule->src.addr.p.dyn);
 		break;
 	case PF_ADDR_TABLE:
 		pfr_detach_table(rule->src.addr.p.tbl);
 		break;
 	}
 	switch (rule->dst.addr.type) {
 	case PF_ADDR_DYNIFTL:
 		pfi_dynaddr_remove(rule->dst.addr.p.dyn);
 		break;
 	case PF_ADDR_TABLE:
 		pfr_detach_table(rule->dst.addr.p.tbl);
 		break;
 	}
 	if (rule->overload_tbl)
 		pfr_detach_table(rule->overload_tbl);
 	if (rule->kif)
 		pfi_kif_unref(rule->kif);
 	pf_anchor_remove(rule);
 	pf_empty_pool(&rule->rpool.list);
 	counter_u64_free(rule->states_cur);
 	counter_u64_free(rule->states_tot);
 	counter_u64_free(rule->src_nodes);
 	free(rule, M_PFRULE);
 }
 
 static u_int16_t
 tagname2tag(struct pf_tags *head, char *tagname)
 {
 	struct pf_tagname	*tag, *p = NULL;
 	u_int16_t		 new_tagid = 1;
 
 	PF_RULES_WASSERT();
 
 	TAILQ_FOREACH(tag, head, entries)
 		if (strcmp(tagname, tag->name) == 0) {
 			tag->ref++;
 			return (tag->tag);
 		}
 
 	/*
 	 * to avoid fragmentation, we do a linear search from the beginning
 	 * and take the first free slot we find. if there is none or the list
 	 * is empty, append a new entry at the end.
 	 */
 
 	/* new entry */
 	if (!TAILQ_EMPTY(head))
 		for (p = TAILQ_FIRST(head); p != NULL &&
 		    p->tag == new_tagid; p = TAILQ_NEXT(p, entries))
 			new_tagid = p->tag + 1;
 
 	if (new_tagid > TAGID_MAX)
 		return (0);
 
 	/* allocate and fill new struct pf_tagname */
 	tag = malloc(sizeof(*tag), M_PFTAG, M_NOWAIT|M_ZERO);
 	if (tag == NULL)
 		return (0);
 	strlcpy(tag->name, tagname, sizeof(tag->name));
 	tag->tag = new_tagid;
 	tag->ref++;
 
 	if (p != NULL)	/* insert new entry before p */
 		TAILQ_INSERT_BEFORE(p, tag, entries);
 	else	/* either list empty or no free slot in between */
 		TAILQ_INSERT_TAIL(head, tag, entries);
 
 	return (tag->tag);
 }
 
 static void
 tag_unref(struct pf_tags *head, u_int16_t tag)
 {
 	struct pf_tagname	*p, *next;
 
 	PF_RULES_WASSERT();
 
 	for (p = TAILQ_FIRST(head); p != NULL; p = next) {
 		next = TAILQ_NEXT(p, entries);
 		if (tag == p->tag) {
 			if (--p->ref == 0) {
 				TAILQ_REMOVE(head, p, entries);
 				free(p, M_PFTAG);
 			}
 			break;
 		}
 	}
 }
 
 static u_int16_t
 pf_tagname2tag(char *tagname)
 {
 	return (tagname2tag(&V_pf_tags, tagname));
 }
 
 #ifdef ALTQ
 static u_int32_t
 pf_qname2qid(char *qname)
 {
 	return ((u_int32_t)tagname2tag(&V_pf_qids, qname));
 }
 
 static void
 pf_qid_unref(u_int32_t qid)
 {
 	tag_unref(&V_pf_qids, (u_int16_t)qid);
 }
 
 static int
 pf_begin_altq(u_int32_t *ticket)
 {
 	struct pf_altq	*altq;
 	int		 error = 0;
 
 	PF_RULES_WASSERT();
 
 	/* Purge the old altq list */
 	while ((altq = TAILQ_FIRST(V_pf_altqs_inactive)) != NULL) {
 		TAILQ_REMOVE(V_pf_altqs_inactive, altq, entries);
 		if (altq->qname[0] == 0 &&
 		    (altq->local_flags & PFALTQ_FLAG_IF_REMOVED) == 0) {
 			/* detach and destroy the discipline */
 			error = altq_remove(altq);
 		} else
 			pf_qid_unref(altq->qid);
 		free(altq, M_PFALTQ);
 	}
 	if (error)
 		return (error);
 	*ticket = ++V_ticket_altqs_inactive;
 	V_altqs_inactive_open = 1;
 	return (0);
 }
 
 static int
 pf_rollback_altq(u_int32_t ticket)
 {
 	struct pf_altq	*altq;
 	int		 error = 0;
 
 	PF_RULES_WASSERT();
 
 	if (!V_altqs_inactive_open || ticket != V_ticket_altqs_inactive)
 		return (0);
 	/* Purge the old altq list */
 	while ((altq = TAILQ_FIRST(V_pf_altqs_inactive)) != NULL) {
 		TAILQ_REMOVE(V_pf_altqs_inactive, altq, entries);
 		if (altq->qname[0] == 0 &&
 		   (altq->local_flags & PFALTQ_FLAG_IF_REMOVED) == 0) {
 			/* detach and destroy the discipline */
 			error = altq_remove(altq);
 		} else
 			pf_qid_unref(altq->qid);
 		free(altq, M_PFALTQ);
 	}
 	V_altqs_inactive_open = 0;
 	return (error);
 }
 
 static int
 pf_commit_altq(u_int32_t ticket)
 {
 	struct pf_altqqueue	*old_altqs;
 	struct pf_altq		*altq;
 	int			 err, error = 0;
 
 	PF_RULES_WASSERT();
 
 	if (!V_altqs_inactive_open || ticket != V_ticket_altqs_inactive)
 		return (EBUSY);
 
 	/* swap altqs, keep the old. */
 	old_altqs = V_pf_altqs_active;
 	V_pf_altqs_active = V_pf_altqs_inactive;
 	V_pf_altqs_inactive = old_altqs;
 	V_ticket_altqs_active = V_ticket_altqs_inactive;
 
 	/* Attach new disciplines */
 	TAILQ_FOREACH(altq, V_pf_altqs_active, entries) {
 	if (altq->qname[0] == 0 &&
 	   (altq->local_flags & PFALTQ_FLAG_IF_REMOVED) == 0) {
 			/* attach the discipline */
 			error = altq_pfattach(altq);
 			if (error == 0 && V_pf_altq_running)
 				error = pf_enable_altq(altq);
 			if (error != 0)
 				return (error);
 		}
 	}
 
 	/* Purge the old altq list */
 	while ((altq = TAILQ_FIRST(V_pf_altqs_inactive)) != NULL) {
 		TAILQ_REMOVE(V_pf_altqs_inactive, altq, entries);
 		if (altq->qname[0] == 0 &&
 		    (altq->local_flags & PFALTQ_FLAG_IF_REMOVED) == 0) {
 			/* detach and destroy the discipline */
 			if (V_pf_altq_running)
 				error = pf_disable_altq(altq);
 			err = altq_pfdetach(altq);
 			if (err != 0 && error == 0)
 				error = err;
 			err = altq_remove(altq);
 			if (err != 0 && error == 0)
 				error = err;
 		} else
 			pf_qid_unref(altq->qid);
 		free(altq, M_PFALTQ);
 	}
 
 	V_altqs_inactive_open = 0;
 	return (error);
 }
 
 static int
 pf_enable_altq(struct pf_altq *altq)
 {
 	struct ifnet		*ifp;
 	struct tb_profile	 tb;
 	int			 error = 0;
 
 	if ((ifp = ifunit(altq->ifname)) == NULL)
 		return (EINVAL);
 
 	if (ifp->if_snd.altq_type != ALTQT_NONE)
 		error = altq_enable(&ifp->if_snd);
 
 	/* set tokenbucket regulator */
 	if (error == 0 && ifp != NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) {
 		tb.rate = altq->ifbandwidth;
 		tb.depth = altq->tbrsize;
 		error = tbr_set(&ifp->if_snd, &tb);
 	}
 
 	return (error);
 }
 
 static int
 pf_disable_altq(struct pf_altq *altq)
 {
 	struct ifnet		*ifp;
 	struct tb_profile	 tb;
 	int			 error;
 
 	if ((ifp = ifunit(altq->ifname)) == NULL)
 		return (EINVAL);
 
 	/*
 	 * when the discipline is no longer referenced, it was overridden
 	 * by a new one.  if so, just return.
 	 */
 	if (altq->altq_disc != ifp->if_snd.altq_disc)
 		return (0);
 
 	error = altq_disable(&ifp->if_snd);
 
 	if (error == 0) {
 		/* clear tokenbucket regulator */
 		tb.rate = 0;
 		error = tbr_set(&ifp->if_snd, &tb);
 	}
 
 	return (error);
 }
 
 void
 pf_altq_ifnet_event(struct ifnet *ifp, int remove)
 {
 	struct ifnet	*ifp1;
 	struct pf_altq	*a1, *a2, *a3;
 	u_int32_t	 ticket;
 	int		 error = 0;
 
 	/* Interrupt userland queue modifications */
 	if (V_altqs_inactive_open)
 		pf_rollback_altq(V_ticket_altqs_inactive);
 
 	/* Start new altq ruleset */
 	if (pf_begin_altq(&ticket))
 		return;
 
 	/* Copy the current active set */
 	TAILQ_FOREACH(a1, V_pf_altqs_active, entries) {
 		a2 = malloc(sizeof(*a2), M_PFALTQ, M_NOWAIT);
 		if (a2 == NULL) {
 			error = ENOMEM;
 			break;
 		}
 		bcopy(a1, a2, sizeof(struct pf_altq));
 
 		if (a2->qname[0] != 0) {
 			if ((a2->qid = pf_qname2qid(a2->qname)) == 0) {
 				error = EBUSY;
 				free(a2, M_PFALTQ);
 				break;
 			}
 			a2->altq_disc = NULL;
 			TAILQ_FOREACH(a3, V_pf_altqs_inactive, entries) {
 				if (strncmp(a3->ifname, a2->ifname,
 				    IFNAMSIZ) == 0 && a3->qname[0] == 0) {
 					a2->altq_disc = a3->altq_disc;
 					break;
 				}
 			}
 		}
 		/* Deactivate the interface in question */
 		a2->local_flags &= ~PFALTQ_FLAG_IF_REMOVED;
 		if ((ifp1 = ifunit(a2->ifname)) == NULL ||
 		    (remove && ifp1 == ifp)) {
 			a2->local_flags |= PFALTQ_FLAG_IF_REMOVED;
 		} else {
 			error = altq_add(a2);
 
 			if (ticket != V_ticket_altqs_inactive)
 				error = EBUSY;
 
 			if (error) {
 				free(a2, M_PFALTQ);
 				break;
 			}
 		}
 
 		TAILQ_INSERT_TAIL(V_pf_altqs_inactive, a2, entries);
 	}
 
 	if (error != 0)
 		pf_rollback_altq(ticket);
 	else
 		pf_commit_altq(ticket);
 }
 #endif /* ALTQ */
 
 static int
 pf_begin_rules(u_int32_t *ticket, int rs_num, const char *anchor)
 {
 	struct pf_ruleset	*rs;
 	struct pf_rule		*rule;
 
 	PF_RULES_WASSERT();
 
 	if (rs_num < 0 || rs_num >= PF_RULESET_MAX)
 		return (EINVAL);
 	rs = pf_find_or_create_ruleset(anchor);
 	if (rs == NULL)
 		return (EINVAL);
 	while ((rule = TAILQ_FIRST(rs->rules[rs_num].inactive.ptr)) != NULL) {
 		pf_unlink_rule(rs->rules[rs_num].inactive.ptr, rule);
 		rs->rules[rs_num].inactive.rcount--;
 	}
 	*ticket = ++rs->rules[rs_num].inactive.ticket;
 	rs->rules[rs_num].inactive.open = 1;
 	return (0);
 }
 
 static int
 pf_rollback_rules(u_int32_t ticket, int rs_num, char *anchor)
 {
 	struct pf_ruleset	*rs;
 	struct pf_rule		*rule;
 
 	PF_RULES_WASSERT();
 
 	if (rs_num < 0 || rs_num >= PF_RULESET_MAX)
 		return (EINVAL);
 	rs = pf_find_ruleset(anchor);
 	if (rs == NULL || !rs->rules[rs_num].inactive.open ||
 	    rs->rules[rs_num].inactive.ticket != ticket)
 		return (0);
 	while ((rule = TAILQ_FIRST(rs->rules[rs_num].inactive.ptr)) != NULL) {
 		pf_unlink_rule(rs->rules[rs_num].inactive.ptr, rule);
 		rs->rules[rs_num].inactive.rcount--;
 	}
 	rs->rules[rs_num].inactive.open = 0;
 	return (0);
 }
 
 #define PF_MD5_UPD(st, elm)						\
 		MD5Update(ctx, (u_int8_t *) &(st)->elm, sizeof((st)->elm))
 
 #define PF_MD5_UPD_STR(st, elm)						\
 		MD5Update(ctx, (u_int8_t *) (st)->elm, strlen((st)->elm))
 
 #define PF_MD5_UPD_HTONL(st, elm, stor) do {				\
 		(stor) = htonl((st)->elm);				\
 		MD5Update(ctx, (u_int8_t *) &(stor), sizeof(u_int32_t));\
 } while (0)
 
 #define PF_MD5_UPD_HTONS(st, elm, stor) do {				\
 		(stor) = htons((st)->elm);				\
 		MD5Update(ctx, (u_int8_t *) &(stor), sizeof(u_int16_t));\
 } while (0)
 
 static void
 pf_hash_rule_addr(MD5_CTX *ctx, struct pf_rule_addr *pfr)
 {
 	PF_MD5_UPD(pfr, addr.type);
 	switch (pfr->addr.type) {
 		case PF_ADDR_DYNIFTL:
 			PF_MD5_UPD(pfr, addr.v.ifname);
 			PF_MD5_UPD(pfr, addr.iflags);
 			break;
 		case PF_ADDR_TABLE:
 			PF_MD5_UPD(pfr, addr.v.tblname);
 			break;
 		case PF_ADDR_ADDRMASK:
 			/* XXX ignore af? */
 			PF_MD5_UPD(pfr, addr.v.a.addr.addr32);
 			PF_MD5_UPD(pfr, addr.v.a.mask.addr32);
 			break;
 	}
 
 	PF_MD5_UPD(pfr, port[0]);
 	PF_MD5_UPD(pfr, port[1]);
 	PF_MD5_UPD(pfr, neg);
 	PF_MD5_UPD(pfr, port_op);
 }
 
 static void
 pf_hash_rule(MD5_CTX *ctx, struct pf_rule *rule)
 {
 	u_int16_t x;
 	u_int32_t y;
 
 	pf_hash_rule_addr(ctx, &rule->src);
 	pf_hash_rule_addr(ctx, &rule->dst);
 	PF_MD5_UPD_STR(rule, label);
 	PF_MD5_UPD_STR(rule, ifname);
 	PF_MD5_UPD_STR(rule, match_tagname);
 	PF_MD5_UPD_HTONS(rule, match_tag, x); /* dup? */
 	PF_MD5_UPD_HTONL(rule, os_fingerprint, y);
 	PF_MD5_UPD_HTONL(rule, prob, y);
 	PF_MD5_UPD_HTONL(rule, uid.uid[0], y);
 	PF_MD5_UPD_HTONL(rule, uid.uid[1], y);
 	PF_MD5_UPD(rule, uid.op);
 	PF_MD5_UPD_HTONL(rule, gid.gid[0], y);
 	PF_MD5_UPD_HTONL(rule, gid.gid[1], y);
 	PF_MD5_UPD(rule, gid.op);
 	PF_MD5_UPD_HTONL(rule, rule_flag, y);
 	PF_MD5_UPD(rule, action);
 	PF_MD5_UPD(rule, direction);
 	PF_MD5_UPD(rule, af);
 	PF_MD5_UPD(rule, quick);
 	PF_MD5_UPD(rule, ifnot);
 	PF_MD5_UPD(rule, match_tag_not);
 	PF_MD5_UPD(rule, natpass);
 	PF_MD5_UPD(rule, keep_state);
 	PF_MD5_UPD(rule, proto);
 	PF_MD5_UPD(rule, type);
 	PF_MD5_UPD(rule, code);
 	PF_MD5_UPD(rule, flags);
 	PF_MD5_UPD(rule, flagset);
 	PF_MD5_UPD(rule, allow_opts);
 	PF_MD5_UPD(rule, rt);
 	PF_MD5_UPD(rule, tos);
 }
 
 static int
 pf_commit_rules(u_int32_t ticket, int rs_num, char *anchor)
 {
 	struct pf_ruleset	*rs;
 	struct pf_rule		*rule, **old_array;
 	struct pf_rulequeue	*old_rules;
 	int			 error;
 	u_int32_t		 old_rcount;
 
 	PF_RULES_WASSERT();
 
 	if (rs_num < 0 || rs_num >= PF_RULESET_MAX)
 		return (EINVAL);
 	rs = pf_find_ruleset(anchor);
 	if (rs == NULL || !rs->rules[rs_num].inactive.open ||
 	    ticket != rs->rules[rs_num].inactive.ticket)
 		return (EBUSY);
 
 	/* Calculate checksum for the main ruleset */
 	if (rs == &pf_main_ruleset) {
 		error = pf_setup_pfsync_matching(rs);
 		if (error != 0)
 			return (error);
 	}
 
 	/* Swap rules, keep the old. */
 	old_rules = rs->rules[rs_num].active.ptr;
 	old_rcount = rs->rules[rs_num].active.rcount;
 	old_array = rs->rules[rs_num].active.ptr_array;
 
 	rs->rules[rs_num].active.ptr =
 	    rs->rules[rs_num].inactive.ptr;
 	rs->rules[rs_num].active.ptr_array =
 	    rs->rules[rs_num].inactive.ptr_array;
 	rs->rules[rs_num].active.rcount =
 	    rs->rules[rs_num].inactive.rcount;
 	rs->rules[rs_num].inactive.ptr = old_rules;
 	rs->rules[rs_num].inactive.ptr_array = old_array;
 	rs->rules[rs_num].inactive.rcount = old_rcount;
 
 	rs->rules[rs_num].active.ticket =
 	    rs->rules[rs_num].inactive.ticket;
 	pf_calc_skip_steps(rs->rules[rs_num].active.ptr);
 
 
 	/* Purge the old rule list. */
 	while ((rule = TAILQ_FIRST(old_rules)) != NULL)
 		pf_unlink_rule(old_rules, rule);
 	if (rs->rules[rs_num].inactive.ptr_array)
 		free(rs->rules[rs_num].inactive.ptr_array, M_TEMP);
 	rs->rules[rs_num].inactive.ptr_array = NULL;
 	rs->rules[rs_num].inactive.rcount = 0;
 	rs->rules[rs_num].inactive.open = 0;
 	pf_remove_if_empty_ruleset(rs);
 
 	return (0);
 }
 
 static int
 pf_setup_pfsync_matching(struct pf_ruleset *rs)
 {
 	MD5_CTX			 ctx;
 	struct pf_rule		*rule;
 	int			 rs_cnt;
 	u_int8_t		 digest[PF_MD5_DIGEST_LENGTH];
 
 	MD5Init(&ctx);
 	for (rs_cnt = 0; rs_cnt < PF_RULESET_MAX; rs_cnt++) {
 		/* XXX PF_RULESET_SCRUB as well? */
 		if (rs_cnt == PF_RULESET_SCRUB)
 			continue;
 
 		if (rs->rules[rs_cnt].inactive.ptr_array)
 			free(rs->rules[rs_cnt].inactive.ptr_array, M_TEMP);
 		rs->rules[rs_cnt].inactive.ptr_array = NULL;
 
 		if (rs->rules[rs_cnt].inactive.rcount) {
 			rs->rules[rs_cnt].inactive.ptr_array =
 			    malloc(sizeof(caddr_t) *
 			    rs->rules[rs_cnt].inactive.rcount,
 			    M_TEMP, M_NOWAIT);
 
 			if (!rs->rules[rs_cnt].inactive.ptr_array)
 				return (ENOMEM);
 		}
 
 		TAILQ_FOREACH(rule, rs->rules[rs_cnt].inactive.ptr,
 		    entries) {
 			pf_hash_rule(&ctx, rule);
 			(rs->rules[rs_cnt].inactive.ptr_array)[rule->nr] = rule;
 		}
 	}
 
 	MD5Final(digest, &ctx);
 	memcpy(V_pf_status.pf_chksum, digest, sizeof(V_pf_status.pf_chksum));
 	return (0);
 }
 
 static int
 pf_addr_setup(struct pf_ruleset *ruleset, struct pf_addr_wrap *addr,
     sa_family_t af)
 {
 	int error = 0;
 
 	switch (addr->type) {
 	case PF_ADDR_TABLE:
 		addr->p.tbl = pfr_attach_table(ruleset, addr->v.tblname);
 		if (addr->p.tbl == NULL)
 			error = ENOMEM;
 		break;
 	case PF_ADDR_DYNIFTL:
 		error = pfi_dynaddr_setup(addr, af);
 		break;
 	}
 
 	return (error);
 }
 
 static void
 pf_addr_copyout(struct pf_addr_wrap *addr)
 {
 
 	switch (addr->type) {
 	case PF_ADDR_DYNIFTL:
 		pfi_dynaddr_copyout(addr);
 		break;
 	case PF_ADDR_TABLE:
 		pf_tbladdr_copyout(addr);
 		break;
 	}
 }
 
 static int
 pfioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags, struct thread *td)
 {
 	int			 error = 0;
 
 	/* XXX keep in sync with switch() below */
 	if (securelevel_gt(td->td_ucred, 2))
 		switch (cmd) {
 		case DIOCGETRULES:
 		case DIOCGETRULE:
 		case DIOCGETADDRS:
 		case DIOCGETADDR:
 		case DIOCGETSTATE:
 		case DIOCSETSTATUSIF:
 		case DIOCGETSTATUS:
 		case DIOCCLRSTATUS:
 		case DIOCNATLOOK:
 		case DIOCSETDEBUG:
 		case DIOCGETSTATES:
 		case DIOCGETTIMEOUT:
 		case DIOCCLRRULECTRS:
 		case DIOCGETLIMIT:
 		case DIOCGETALTQS:
 		case DIOCGETALTQ:
 		case DIOCGETQSTATS:
 		case DIOCGETRULESETS:
 		case DIOCGETRULESET:
 		case DIOCRGETTABLES:
 		case DIOCRGETTSTATS:
 		case DIOCRCLRTSTATS:
 		case DIOCRCLRADDRS:
 		case DIOCRADDADDRS:
 		case DIOCRDELADDRS:
 		case DIOCRSETADDRS:
 		case DIOCRGETADDRS:
 		case DIOCRGETASTATS:
 		case DIOCRCLRASTATS:
 		case DIOCRTSTADDRS:
 		case DIOCOSFPGET:
 		case DIOCGETSRCNODES:
 		case DIOCCLRSRCNODES:
 		case DIOCIGETIFACES:
 		case DIOCGIFSPEED:
 		case DIOCSETIFFLAG:
 		case DIOCCLRIFFLAG:
 			break;
 		case DIOCRCLRTABLES:
 		case DIOCRADDTABLES:
 		case DIOCRDELTABLES:
 		case DIOCRSETTFLAGS:
 			if (((struct pfioc_table *)addr)->pfrio_flags &
 			    PFR_FLAG_DUMMY)
 				break; /* dummy operation ok */
 			return (EPERM);
 		default:
 			return (EPERM);
 		}
 
 	if (!(flags & FWRITE))
 		switch (cmd) {
 		case DIOCGETRULES:
 		case DIOCGETADDRS:
 		case DIOCGETADDR:
 		case DIOCGETSTATE:
 		case DIOCGETSTATUS:
 		case DIOCGETSTATES:
 		case DIOCGETTIMEOUT:
 		case DIOCGETLIMIT:
 		case DIOCGETALTQS:
 		case DIOCGETALTQ:
 		case DIOCGETQSTATS:
 		case DIOCGETRULESETS:
 		case DIOCGETRULESET:
 		case DIOCNATLOOK:
 		case DIOCRGETTABLES:
 		case DIOCRGETTSTATS:
 		case DIOCRGETADDRS:
 		case DIOCRGETASTATS:
 		case DIOCRTSTADDRS:
 		case DIOCOSFPGET:
 		case DIOCGETSRCNODES:
 		case DIOCIGETIFACES:
 		case DIOCGIFSPEED:
 			break;
 		case DIOCRCLRTABLES:
 		case DIOCRADDTABLES:
 		case DIOCRDELTABLES:
 		case DIOCRCLRTSTATS:
 		case DIOCRCLRADDRS:
 		case DIOCRADDADDRS:
 		case DIOCRDELADDRS:
 		case DIOCRSETADDRS:
 		case DIOCRSETTFLAGS:
 			if (((struct pfioc_table *)addr)->pfrio_flags &
 			    PFR_FLAG_DUMMY) {
 				flags |= FWRITE; /* need write lock for dummy */
 				break; /* dummy operation ok */
 			}
 			return (EACCES);
 		case DIOCGETRULE:
 			if (((struct pfioc_rule *)addr)->action ==
 			    PF_GET_CLR_CNTR)
 				return (EACCES);
 			break;
 		default:
 			return (EACCES);
 		}
 
 	CURVNET_SET(TD_TO_VNET(td));
 
 	switch (cmd) {
 	case DIOCSTART:
 		sx_xlock(&pf_ioctl_lock);
 		if (V_pf_status.running)
 			error = EEXIST;
 		else {
 			int cpu;
 
 			error = hook_pf();
 			if (error) {
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: pfil registration failed\n"));
 				break;
 			}
 			V_pf_status.running = 1;
 			V_pf_status.since = time_second;
 
 			CPU_FOREACH(cpu)
 				V_pf_stateid[cpu] = time_second;
 
 			DPFPRINTF(PF_DEBUG_MISC, ("pf: started\n"));
 		}
 		break;
 
 	case DIOCSTOP:
 		sx_xlock(&pf_ioctl_lock);
 		if (!V_pf_status.running)
 			error = ENOENT;
 		else {
 			V_pf_status.running = 0;
 			error = dehook_pf();
 			if (error) {
 				V_pf_status.running = 1;
 				DPFPRINTF(PF_DEBUG_MISC,
 				    ("pf: pfil unregistration failed\n"));
 			}
 			V_pf_status.since = time_second;
 			DPFPRINTF(PF_DEBUG_MISC, ("pf: stopped\n"));
 		}
 		break;
 
 	case DIOCADDRULE: {
 		struct pfioc_rule	*pr = (struct pfioc_rule *)addr;
 		struct pf_ruleset	*ruleset;
 		struct pf_rule		*rule, *tail;
 		struct pf_pooladdr	*pa;
 		struct pfi_kif		*kif = NULL;
 		int			 rs_num;
 
 		if (pr->rule.return_icmp >> 8 > ICMP_MAXTYPE) {
 			error = EINVAL;
 			break;
 		}
 #ifndef INET
 		if (pr->rule.af == AF_INET) {
 			error = EAFNOSUPPORT;
 			break;
 		}
 #endif /* INET */
 #ifndef INET6
 		if (pr->rule.af == AF_INET6) {
 			error = EAFNOSUPPORT;
 			break;
 		}
 #endif /* INET6 */
 
 		rule = malloc(sizeof(*rule), M_PFRULE, M_WAITOK);
 		bcopy(&pr->rule, rule, sizeof(struct pf_rule));
 		if (rule->ifname[0])
 			kif = malloc(sizeof(*kif), PFI_MTYPE, M_WAITOK);
 		rule->states_cur = counter_u64_alloc(M_WAITOK);
 		rule->states_tot = counter_u64_alloc(M_WAITOK);
 		rule->src_nodes = counter_u64_alloc(M_WAITOK);
 		rule->cuid = td->td_ucred->cr_ruid;
 		rule->cpid = td->td_proc ? td->td_proc->p_pid : 0;
 		TAILQ_INIT(&rule->rpool.list);
 
 #define	ERROUT(x)	{ error = (x); goto DIOCADDRULE_error; }
 
 		PF_RULES_WLOCK();
 		pr->anchor[sizeof(pr->anchor) - 1] = 0;
 		ruleset = pf_find_ruleset(pr->anchor);
 		if (ruleset == NULL)
 			ERROUT(EINVAL);
 		rs_num = pf_get_ruleset_number(pr->rule.action);
 		if (rs_num >= PF_RULESET_MAX)
 			ERROUT(EINVAL);
 		if (pr->ticket != ruleset->rules[rs_num].inactive.ticket) {
 			DPFPRINTF(PF_DEBUG_MISC,
 			    ("ticket: %d != [%d]%d\n", pr->ticket, rs_num,
 			    ruleset->rules[rs_num].inactive.ticket));
 			ERROUT(EBUSY);
 		}
 		if (pr->pool_ticket != V_ticket_pabuf) {
 			DPFPRINTF(PF_DEBUG_MISC,
 			    ("pool_ticket: %d != %d\n", pr->pool_ticket,
 			    V_ticket_pabuf));
 			ERROUT(EBUSY);
 		}
 
 		tail = TAILQ_LAST(ruleset->rules[rs_num].inactive.ptr,
 		    pf_rulequeue);
 		if (tail)
 			rule->nr = tail->nr + 1;
 		else
 			rule->nr = 0;
 		if (rule->ifname[0]) {
 			rule->kif = pfi_kif_attach(kif, rule->ifname);
 			pfi_kif_ref(rule->kif);
 		} else
 			rule->kif = NULL;
 
 		if (rule->rtableid > 0 && rule->rtableid >= rt_numfibs)
 			error = EBUSY;
 
 #ifdef ALTQ
 		/* set queue IDs */
 		if (rule->qname[0] != 0) {
 			if ((rule->qid = pf_qname2qid(rule->qname)) == 0)
 				error = EBUSY;
 			else if (rule->pqname[0] != 0) {
 				if ((rule->pqid =
 				    pf_qname2qid(rule->pqname)) == 0)
 					error = EBUSY;
 			} else
 				rule->pqid = rule->qid;
 		}
 #endif
 		if (rule->tagname[0])
 			if ((rule->tag = pf_tagname2tag(rule->tagname)) == 0)
 				error = EBUSY;
 		if (rule->match_tagname[0])
 			if ((rule->match_tag =
 			    pf_tagname2tag(rule->match_tagname)) == 0)
 				error = EBUSY;
 		if (rule->rt && !rule->direction)
 			error = EINVAL;
 		if (!rule->log)
 			rule->logif = 0;
 		if (rule->logif >= PFLOGIFS_MAX)
 			error = EINVAL;
 		if (pf_addr_setup(ruleset, &rule->src.addr, rule->af))
 			error = ENOMEM;
 		if (pf_addr_setup(ruleset, &rule->dst.addr, rule->af))
 			error = ENOMEM;
 		if (pf_anchor_setup(rule, ruleset, pr->anchor_call))
 			error = EINVAL;
 		if (rule->scrub_flags & PFSTATE_SETPRIO &&
 		    (rule->set_prio[0] > PF_PRIO_MAX ||
 		    rule->set_prio[1] > PF_PRIO_MAX))
 			error = EINVAL;
 		TAILQ_FOREACH(pa, &V_pf_pabuf, entries)
 			if (pa->addr.type == PF_ADDR_TABLE) {
 				pa->addr.p.tbl = pfr_attach_table(ruleset,
 				    pa->addr.v.tblname);
 				if (pa->addr.p.tbl == NULL)
 					error = ENOMEM;
 			}
 
 		rule->overload_tbl = NULL;
 		if (rule->overload_tblname[0]) {
 			if ((rule->overload_tbl = pfr_attach_table(ruleset,
 			    rule->overload_tblname)) == NULL)
 				error = EINVAL;
 			else
 				rule->overload_tbl->pfrkt_flags |=
 				    PFR_TFLAG_ACTIVE;
 		}
 
 		pf_mv_pool(&V_pf_pabuf, &rule->rpool.list);
 		if (((((rule->action == PF_NAT) || (rule->action == PF_RDR) ||
 		    (rule->action == PF_BINAT)) && rule->anchor == NULL) ||
 		    (rule->rt > PF_NOPFROUTE)) &&
 		    (TAILQ_FIRST(&rule->rpool.list) == NULL))
 			error = EINVAL;
 
 		if (error) {
 			pf_free_rule(rule);
 			PF_RULES_WUNLOCK();
 			break;
 		}
 
 		rule->rpool.cur = TAILQ_FIRST(&rule->rpool.list);
 		rule->evaluations = rule->packets[0] = rule->packets[1] =
 		    rule->bytes[0] = rule->bytes[1] = 0;
 		TAILQ_INSERT_TAIL(ruleset->rules[rs_num].inactive.ptr,
 		    rule, entries);
 		ruleset->rules[rs_num].inactive.rcount++;
 		PF_RULES_WUNLOCK();
 		break;
 
 #undef ERROUT
 DIOCADDRULE_error:
 		PF_RULES_WUNLOCK();
 		counter_u64_free(rule->states_cur);
 		counter_u64_free(rule->states_tot);
 		counter_u64_free(rule->src_nodes);
 		free(rule, M_PFRULE);
 		if (kif)
 			free(kif, PFI_MTYPE);
 		break;
 	}
 
 	case DIOCGETRULES: {
 		struct pfioc_rule	*pr = (struct pfioc_rule *)addr;
 		struct pf_ruleset	*ruleset;
 		struct pf_rule		*tail;
 		int			 rs_num;
 
 		PF_RULES_WLOCK();
 		pr->anchor[sizeof(pr->anchor) - 1] = 0;
 		ruleset = pf_find_ruleset(pr->anchor);
 		if (ruleset == NULL) {
 			PF_RULES_WUNLOCK();
 			error = EINVAL;
 			break;
 		}
 		rs_num = pf_get_ruleset_number(pr->rule.action);
 		if (rs_num >= PF_RULESET_MAX) {
 			PF_RULES_WUNLOCK();
 			error = EINVAL;
 			break;
 		}
 		tail = TAILQ_LAST(ruleset->rules[rs_num].active.ptr,
 		    pf_rulequeue);
 		if (tail)
 			pr->nr = tail->nr + 1;
 		else
 			pr->nr = 0;
 		pr->ticket = ruleset->rules[rs_num].active.ticket;
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCGETRULE: {
 		struct pfioc_rule	*pr = (struct pfioc_rule *)addr;
 		struct pf_ruleset	*ruleset;
 		struct pf_rule		*rule;
 		int			 rs_num, i;
 
 		PF_RULES_WLOCK();
 		pr->anchor[sizeof(pr->anchor) - 1] = 0;
 		ruleset = pf_find_ruleset(pr->anchor);
 		if (ruleset == NULL) {
 			PF_RULES_WUNLOCK();
 			error = EINVAL;
 			break;
 		}
 		rs_num = pf_get_ruleset_number(pr->rule.action);
 		if (rs_num >= PF_RULESET_MAX) {
 			PF_RULES_WUNLOCK();
 			error = EINVAL;
 			break;
 		}
 		if (pr->ticket != ruleset->rules[rs_num].active.ticket) {
 			PF_RULES_WUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		rule = TAILQ_FIRST(ruleset->rules[rs_num].active.ptr);
 		while ((rule != NULL) && (rule->nr != pr->nr))
 			rule = TAILQ_NEXT(rule, entries);
 		if (rule == NULL) {
 			PF_RULES_WUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		bcopy(rule, &pr->rule, sizeof(struct pf_rule));
 		pr->rule.u_states_cur = counter_u64_fetch(rule->states_cur);
 		pr->rule.u_states_tot = counter_u64_fetch(rule->states_tot);
 		pr->rule.u_src_nodes = counter_u64_fetch(rule->src_nodes);
 		if (pf_anchor_copyout(ruleset, rule, pr)) {
 			PF_RULES_WUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		pf_addr_copyout(&pr->rule.src.addr);
 		pf_addr_copyout(&pr->rule.dst.addr);
 		for (i = 0; i < PF_SKIP_COUNT; ++i)
 			if (rule->skip[i].ptr == NULL)
 				pr->rule.skip[i].nr = -1;
 			else
 				pr->rule.skip[i].nr =
 				    rule->skip[i].ptr->nr;
 
 		if (pr->action == PF_GET_CLR_CNTR) {
 			rule->evaluations = 0;
 			rule->packets[0] = rule->packets[1] = 0;
 			rule->bytes[0] = rule->bytes[1] = 0;
 			counter_u64_zero(rule->states_tot);
 		}
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCCHANGERULE: {
 		struct pfioc_rule	*pcr = (struct pfioc_rule *)addr;
 		struct pf_ruleset	*ruleset;
 		struct pf_rule		*oldrule = NULL, *newrule = NULL;
 		struct pfi_kif		*kif = NULL;
 		struct pf_pooladdr	*pa;
 		u_int32_t		 nr = 0;
 		int			 rs_num;
 
 		if (pcr->action < PF_CHANGE_ADD_HEAD ||
 		    pcr->action > PF_CHANGE_GET_TICKET) {
 			error = EINVAL;
 			break;
 		}
 		if (pcr->rule.return_icmp >> 8 > ICMP_MAXTYPE) {
 			error = EINVAL;
 			break;
 		}
 
 		if (pcr->action != PF_CHANGE_REMOVE) {
 #ifndef INET
 			if (pcr->rule.af == AF_INET) {
 				error = EAFNOSUPPORT;
 				break;
 			}
 #endif /* INET */
 #ifndef INET6
 			if (pcr->rule.af == AF_INET6) {
 				error = EAFNOSUPPORT;
 				break;
 			}
 #endif /* INET6 */
 			newrule = malloc(sizeof(*newrule), M_PFRULE, M_WAITOK);
 			bcopy(&pcr->rule, newrule, sizeof(struct pf_rule));
 			if (newrule->ifname[0])
 				kif = malloc(sizeof(*kif), PFI_MTYPE, M_WAITOK);
 			newrule->states_cur = counter_u64_alloc(M_WAITOK);
 			newrule->states_tot = counter_u64_alloc(M_WAITOK);
 			newrule->src_nodes = counter_u64_alloc(M_WAITOK);
 			newrule->cuid = td->td_ucred->cr_ruid;
 			newrule->cpid = td->td_proc ? td->td_proc->p_pid : 0;
 			TAILQ_INIT(&newrule->rpool.list);
 		}
 
 #define	ERROUT(x)	{ error = (x); goto DIOCCHANGERULE_error; }
 
 		PF_RULES_WLOCK();
 		if (!(pcr->action == PF_CHANGE_REMOVE ||
 		    pcr->action == PF_CHANGE_GET_TICKET) &&
 		    pcr->pool_ticket != V_ticket_pabuf)
 			ERROUT(EBUSY);
 
 		ruleset = pf_find_ruleset(pcr->anchor);
 		if (ruleset == NULL)
 			ERROUT(EINVAL);
 
 		rs_num = pf_get_ruleset_number(pcr->rule.action);
 		if (rs_num >= PF_RULESET_MAX)
 			ERROUT(EINVAL);
 
 		if (pcr->action == PF_CHANGE_GET_TICKET) {
 			pcr->ticket = ++ruleset->rules[rs_num].active.ticket;
 			ERROUT(0);
 		} else if (pcr->ticket !=
 			    ruleset->rules[rs_num].active.ticket)
 				ERROUT(EINVAL);
 
 		if (pcr->action != PF_CHANGE_REMOVE) {
 			if (newrule->ifname[0]) {
 				newrule->kif = pfi_kif_attach(kif,
 				    newrule->ifname);
 				pfi_kif_ref(newrule->kif);
 			} else
 				newrule->kif = NULL;
 
 			if (newrule->rtableid > 0 &&
 			    newrule->rtableid >= rt_numfibs)
 				error = EBUSY;
 
 #ifdef ALTQ
 			/* set queue IDs */
 			if (newrule->qname[0] != 0) {
 				if ((newrule->qid =
 				    pf_qname2qid(newrule->qname)) == 0)
 					error = EBUSY;
 				else if (newrule->pqname[0] != 0) {
 					if ((newrule->pqid =
 					    pf_qname2qid(newrule->pqname)) == 0)
 						error = EBUSY;
 				} else
 					newrule->pqid = newrule->qid;
 			}
 #endif /* ALTQ */
 			if (newrule->tagname[0])
 				if ((newrule->tag =
 				    pf_tagname2tag(newrule->tagname)) == 0)
 					error = EBUSY;
 			if (newrule->match_tagname[0])
 				if ((newrule->match_tag = pf_tagname2tag(
 				    newrule->match_tagname)) == 0)
 					error = EBUSY;
 			if (newrule->rt && !newrule->direction)
 				error = EINVAL;
 			if (!newrule->log)
 				newrule->logif = 0;
 			if (newrule->logif >= PFLOGIFS_MAX)
 				error = EINVAL;
 			if (pf_addr_setup(ruleset, &newrule->src.addr, newrule->af))
 				error = ENOMEM;
 			if (pf_addr_setup(ruleset, &newrule->dst.addr, newrule->af))
 				error = ENOMEM;
 			if (pf_anchor_setup(newrule, ruleset, pcr->anchor_call))
 				error = EINVAL;
 			TAILQ_FOREACH(pa, &V_pf_pabuf, entries)
 				if (pa->addr.type == PF_ADDR_TABLE) {
 					pa->addr.p.tbl =
 					    pfr_attach_table(ruleset,
 					    pa->addr.v.tblname);
 					if (pa->addr.p.tbl == NULL)
 						error = ENOMEM;
 				}
 
 			newrule->overload_tbl = NULL;
 			if (newrule->overload_tblname[0]) {
 				if ((newrule->overload_tbl = pfr_attach_table(
 				    ruleset, newrule->overload_tblname)) ==
 				    NULL)
 					error = EINVAL;
 				else
 					newrule->overload_tbl->pfrkt_flags |=
 					    PFR_TFLAG_ACTIVE;
 			}
 
 			pf_mv_pool(&V_pf_pabuf, &newrule->rpool.list);
 			if (((((newrule->action == PF_NAT) ||
 			    (newrule->action == PF_RDR) ||
 			    (newrule->action == PF_BINAT) ||
 			    (newrule->rt > PF_NOPFROUTE)) &&
 			    !newrule->anchor)) &&
 			    (TAILQ_FIRST(&newrule->rpool.list) == NULL))
 				error = EINVAL;
 
 			if (error) {
 				pf_free_rule(newrule);
 				PF_RULES_WUNLOCK();
 				break;
 			}
 
 			newrule->rpool.cur = TAILQ_FIRST(&newrule->rpool.list);
 			newrule->evaluations = 0;
 			newrule->packets[0] = newrule->packets[1] = 0;
 			newrule->bytes[0] = newrule->bytes[1] = 0;
 		}
 		pf_empty_pool(&V_pf_pabuf);
 
 		if (pcr->action == PF_CHANGE_ADD_HEAD)
 			oldrule = TAILQ_FIRST(
 			    ruleset->rules[rs_num].active.ptr);
 		else if (pcr->action == PF_CHANGE_ADD_TAIL)
 			oldrule = TAILQ_LAST(
 			    ruleset->rules[rs_num].active.ptr, pf_rulequeue);
 		else {
 			oldrule = TAILQ_FIRST(
 			    ruleset->rules[rs_num].active.ptr);
 			while ((oldrule != NULL) && (oldrule->nr != pcr->nr))
 				oldrule = TAILQ_NEXT(oldrule, entries);
 			if (oldrule == NULL) {
 				if (newrule != NULL)
 					pf_free_rule(newrule);
 				PF_RULES_WUNLOCK();
 				error = EINVAL;
 				break;
 			}
 		}
 
 		if (pcr->action == PF_CHANGE_REMOVE) {
 			pf_unlink_rule(ruleset->rules[rs_num].active.ptr,
 			    oldrule);
 			ruleset->rules[rs_num].active.rcount--;
 		} else {
 			if (oldrule == NULL)
 				TAILQ_INSERT_TAIL(
 				    ruleset->rules[rs_num].active.ptr,
 				    newrule, entries);
 			else if (pcr->action == PF_CHANGE_ADD_HEAD ||
 			    pcr->action == PF_CHANGE_ADD_BEFORE)
 				TAILQ_INSERT_BEFORE(oldrule, newrule, entries);
 			else
 				TAILQ_INSERT_AFTER(
 				    ruleset->rules[rs_num].active.ptr,
 				    oldrule, newrule, entries);
 			ruleset->rules[rs_num].active.rcount++;
 		}
 
 		nr = 0;
 		TAILQ_FOREACH(oldrule,
 		    ruleset->rules[rs_num].active.ptr, entries)
 			oldrule->nr = nr++;
 
 		ruleset->rules[rs_num].active.ticket++;
 
 		pf_calc_skip_steps(ruleset->rules[rs_num].active.ptr);
 		pf_remove_if_empty_ruleset(ruleset);
 
 		PF_RULES_WUNLOCK();
 		break;
 
 #undef ERROUT
 DIOCCHANGERULE_error:
 		PF_RULES_WUNLOCK();
 		if (newrule != NULL) {
 			counter_u64_free(newrule->states_cur);
 			counter_u64_free(newrule->states_tot);
 			counter_u64_free(newrule->src_nodes);
 			free(newrule, M_PFRULE);
 		}
 		if (kif != NULL)
 			free(kif, PFI_MTYPE);
 		break;
 	}
 
 	case DIOCCLRSTATES: {
 		struct pf_state		*s;
 		struct pfioc_state_kill *psk = (struct pfioc_state_kill *)addr;
 		u_int			 i, killed = 0;
 
 		for (i = 0; i <= pf_hashmask; i++) {
 			struct pf_idhash *ih = &V_pf_idhash[i];
 
 relock_DIOCCLRSTATES:
 			PF_HASHROW_LOCK(ih);
 			LIST_FOREACH(s, &ih->states, entry)
 				if (!psk->psk_ifname[0] ||
 				    !strcmp(psk->psk_ifname,
 				    s->kif->pfik_name)) {
 					/*
 					 * Don't send out individual
 					 * delete messages.
 					 */
 					s->state_flags |= PFSTATE_NOSYNC;
 					pf_unlink_state(s, PF_ENTER_LOCKED);
 					killed++;
 					goto relock_DIOCCLRSTATES;
 				}
 			PF_HASHROW_UNLOCK(ih);
 		}
 		psk->psk_killed = killed;
 		if (pfsync_clear_states_ptr != NULL)
 			pfsync_clear_states_ptr(V_pf_status.hostid, psk->psk_ifname);
 		break;
 	}
 
 	case DIOCKILLSTATES: {
 		struct pf_state		*s;
 		struct pf_state_key	*sk;
 		struct pf_addr		*srcaddr, *dstaddr;
 		u_int16_t		 srcport, dstport;
 		struct pfioc_state_kill	*psk = (struct pfioc_state_kill *)addr;
 		u_int			 i, killed = 0;
 
 		if (psk->psk_pfcmp.id) {
 			if (psk->psk_pfcmp.creatorid == 0)
 				psk->psk_pfcmp.creatorid = V_pf_status.hostid;
 			if ((s = pf_find_state_byid(psk->psk_pfcmp.id,
 			    psk->psk_pfcmp.creatorid))) {
 				pf_unlink_state(s, PF_ENTER_LOCKED);
 				psk->psk_killed = 1;
 			}
 			break;
 		}
 
 		for (i = 0; i <= pf_hashmask; i++) {
 			struct pf_idhash *ih = &V_pf_idhash[i];
 
 relock_DIOCKILLSTATES:
 			PF_HASHROW_LOCK(ih);
 			LIST_FOREACH(s, &ih->states, entry) {
 				sk = s->key[PF_SK_WIRE];
 				if (s->direction == PF_OUT) {
 					srcaddr = &sk->addr[1];
 					dstaddr = &sk->addr[0];
 					srcport = sk->port[1];
 					dstport = sk->port[0];
 				} else {
 					srcaddr = &sk->addr[0];
 					dstaddr = &sk->addr[1];
 					srcport = sk->port[0];
 					dstport = sk->port[1];
 				}
 
 				if ((!psk->psk_af || sk->af == psk->psk_af)
 				    && (!psk->psk_proto || psk->psk_proto ==
 				    sk->proto) &&
 				    PF_MATCHA(psk->psk_src.neg,
 				    &psk->psk_src.addr.v.a.addr,
 				    &psk->psk_src.addr.v.a.mask,
 				    srcaddr, sk->af) &&
 				    PF_MATCHA(psk->psk_dst.neg,
 				    &psk->psk_dst.addr.v.a.addr,
 				    &psk->psk_dst.addr.v.a.mask,
 				    dstaddr, sk->af) &&
 				    (psk->psk_src.port_op == 0 ||
 				    pf_match_port(psk->psk_src.port_op,
 				    psk->psk_src.port[0], psk->psk_src.port[1],
 				    srcport)) &&
 				    (psk->psk_dst.port_op == 0 ||
 				    pf_match_port(psk->psk_dst.port_op,
 				    psk->psk_dst.port[0], psk->psk_dst.port[1],
 				    dstport)) &&
 				    (!psk->psk_label[0] ||
 				    (s->rule.ptr->label[0] &&
 				    !strcmp(psk->psk_label,
 				    s->rule.ptr->label))) &&
 				    (!psk->psk_ifname[0] ||
 				    !strcmp(psk->psk_ifname,
 				    s->kif->pfik_name))) {
 					pf_unlink_state(s, PF_ENTER_LOCKED);
 					killed++;
 					goto relock_DIOCKILLSTATES;
 				}
 			}
 			PF_HASHROW_UNLOCK(ih);
 		}
 		psk->psk_killed = killed;
 		break;
 	}
 
 	case DIOCADDSTATE: {
 		struct pfioc_state	*ps = (struct pfioc_state *)addr;
 		struct pfsync_state	*sp = &ps->state;
 
 		if (sp->timeout >= PFTM_MAX) {
 			error = EINVAL;
 			break;
 		}
 		if (pfsync_state_import_ptr != NULL) {
 			PF_RULES_RLOCK();
 			error = pfsync_state_import_ptr(sp, PFSYNC_SI_IOCTL);
 			PF_RULES_RUNLOCK();
 		} else
 			error = EOPNOTSUPP;
 		break;
 	}
 
 	case DIOCGETSTATE: {
 		struct pfioc_state	*ps = (struct pfioc_state *)addr;
 		struct pf_state		*s;
 
 		s = pf_find_state_byid(ps->state.id, ps->state.creatorid);
 		if (s == NULL) {
 			error = ENOENT;
 			break;
 		}
 
 		pfsync_state_export(&ps->state, s);
 		PF_STATE_UNLOCK(s);
 		break;
 	}
 
 	case DIOCGETSTATES: {
 		struct pfioc_states	*ps = (struct pfioc_states *)addr;
 		struct pf_state		*s;
 		struct pfsync_state	*pstore, *p;
 		int i, nr;
 
 		if (ps->ps_len == 0) {
 			nr = uma_zone_get_cur(V_pf_state_z);
 			ps->ps_len = sizeof(struct pfsync_state) * nr;
 			break;
 		}
 
 		p = pstore = malloc(ps->ps_len, M_TEMP, M_WAITOK);
 		nr = 0;
 
 		for (i = 0; i <= pf_hashmask; i++) {
 			struct pf_idhash *ih = &V_pf_idhash[i];
 
 			PF_HASHROW_LOCK(ih);
 			LIST_FOREACH(s, &ih->states, entry) {
 
 				if (s->timeout == PFTM_UNLINKED)
 					continue;
 
 				if ((nr+1) * sizeof(*p) > ps->ps_len) {
 					PF_HASHROW_UNLOCK(ih);
 					goto DIOCGETSTATES_full;
 				}
 				pfsync_state_export(p, s);
 				p++;
 				nr++;
 			}
 			PF_HASHROW_UNLOCK(ih);
 		}
 DIOCGETSTATES_full:
 		error = copyout(pstore, ps->ps_states,
 		    sizeof(struct pfsync_state) * nr);
 		if (error) {
 			free(pstore, M_TEMP);
 			break;
 		}
 		ps->ps_len = sizeof(struct pfsync_state) * nr;
 		free(pstore, M_TEMP);
 
 		break;
 	}
 
 	case DIOCGETSTATUS: {
 		struct pf_status *s = (struct pf_status *)addr;
 
 		PF_RULES_RLOCK();
 		s->running = V_pf_status.running;
 		s->since   = V_pf_status.since;
 		s->debug   = V_pf_status.debug;
 		s->hostid  = V_pf_status.hostid;
 		s->states  = V_pf_status.states;
 		s->src_nodes = V_pf_status.src_nodes;
 
 		for (int i = 0; i < PFRES_MAX; i++)
 			s->counters[i] =
 			    counter_u64_fetch(V_pf_status.counters[i]);
 		for (int i = 0; i < LCNT_MAX; i++)
 			s->lcounters[i] =
 			    counter_u64_fetch(V_pf_status.lcounters[i]);
 		for (int i = 0; i < FCNT_MAX; i++)
 			s->fcounters[i] =
 			    counter_u64_fetch(V_pf_status.fcounters[i]);
 		for (int i = 0; i < SCNT_MAX; i++)
 			s->scounters[i] =
 			    counter_u64_fetch(V_pf_status.scounters[i]);
 
 		bcopy(V_pf_status.ifname, s->ifname, IFNAMSIZ);
 		bcopy(V_pf_status.pf_chksum, s->pf_chksum,
 		    PF_MD5_DIGEST_LENGTH);
 
 		pfi_update_status(s->ifname, s);
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCSETSTATUSIF: {
 		struct pfioc_if	*pi = (struct pfioc_if *)addr;
 
 		if (pi->ifname[0] == 0) {
 			bzero(V_pf_status.ifname, IFNAMSIZ);
 			break;
 		}
 		PF_RULES_WLOCK();
 		strlcpy(V_pf_status.ifname, pi->ifname, IFNAMSIZ);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCCLRSTATUS: {
 		PF_RULES_WLOCK();
 		for (int i = 0; i < PFRES_MAX; i++)
 			counter_u64_zero(V_pf_status.counters[i]);
 		for (int i = 0; i < FCNT_MAX; i++)
 			counter_u64_zero(V_pf_status.fcounters[i]);
 		for (int i = 0; i < SCNT_MAX; i++)
 			counter_u64_zero(V_pf_status.scounters[i]);
 		for (int i = 0; i < LCNT_MAX; i++)
 			counter_u64_zero(V_pf_status.lcounters[i]);
 		V_pf_status.since = time_second;
 		if (*V_pf_status.ifname)
 			pfi_update_status(V_pf_status.ifname, NULL);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCNATLOOK: {
 		struct pfioc_natlook	*pnl = (struct pfioc_natlook *)addr;
 		struct pf_state_key	*sk;
 		struct pf_state		*state;
 		struct pf_state_key_cmp	 key;
 		int			 m = 0, direction = pnl->direction;
 		int			 sidx, didx;
 
 		/* NATLOOK src and dst are reversed, so reverse sidx/didx */
 		sidx = (direction == PF_IN) ? 1 : 0;
 		didx = (direction == PF_IN) ? 0 : 1;
 
 		if (!pnl->proto ||
 		    PF_AZERO(&pnl->saddr, pnl->af) ||
 		    PF_AZERO(&pnl->daddr, pnl->af) ||
 		    ((pnl->proto == IPPROTO_TCP ||
 		    pnl->proto == IPPROTO_UDP) &&
 		    (!pnl->dport || !pnl->sport)))
 			error = EINVAL;
 		else {
 			bzero(&key, sizeof(key));
 			key.af = pnl->af;
 			key.proto = pnl->proto;
 			PF_ACPY(&key.addr[sidx], &pnl->saddr, pnl->af);
 			key.port[sidx] = pnl->sport;
 			PF_ACPY(&key.addr[didx], &pnl->daddr, pnl->af);
 			key.port[didx] = pnl->dport;
 
 			state = pf_find_state_all(&key, direction, &m);
 
 			if (m > 1)
 				error = E2BIG;	/* more than one state */
 			else if (state != NULL) {
 				/* XXXGL: not locked read */
 				sk = state->key[sidx];
 				PF_ACPY(&pnl->rsaddr, &sk->addr[sidx], sk->af);
 				pnl->rsport = sk->port[sidx];
 				PF_ACPY(&pnl->rdaddr, &sk->addr[didx], sk->af);
 				pnl->rdport = sk->port[didx];
 			} else
 				error = ENOENT;
 		}
 		break;
 	}
 
 	case DIOCSETTIMEOUT: {
 		struct pfioc_tm	*pt = (struct pfioc_tm *)addr;
 		int		 old;
 
 		if (pt->timeout < 0 || pt->timeout >= PFTM_MAX ||
 		    pt->seconds < 0) {
 			error = EINVAL;
 			break;
 		}
 		PF_RULES_WLOCK();
 		old = V_pf_default_rule.timeout[pt->timeout];
 		if (pt->timeout == PFTM_INTERVAL && pt->seconds == 0)
 			pt->seconds = 1;
 		V_pf_default_rule.timeout[pt->timeout] = pt->seconds;
 		if (pt->timeout == PFTM_INTERVAL && pt->seconds < old)
 			wakeup(pf_purge_thread);
 		pt->seconds = old;
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCGETTIMEOUT: {
 		struct pfioc_tm	*pt = (struct pfioc_tm *)addr;
 
 		if (pt->timeout < 0 || pt->timeout >= PFTM_MAX) {
 			error = EINVAL;
 			break;
 		}
 		PF_RULES_RLOCK();
 		pt->seconds = V_pf_default_rule.timeout[pt->timeout];
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCGETLIMIT: {
 		struct pfioc_limit	*pl = (struct pfioc_limit *)addr;
 
 		if (pl->index < 0 || pl->index >= PF_LIMIT_MAX) {
 			error = EINVAL;
 			break;
 		}
 		PF_RULES_RLOCK();
 		pl->limit = V_pf_limits[pl->index].limit;
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCSETLIMIT: {
 		struct pfioc_limit	*pl = (struct pfioc_limit *)addr;
 		int			 old_limit;
 
 		PF_RULES_WLOCK();
 		if (pl->index < 0 || pl->index >= PF_LIMIT_MAX ||
 		    V_pf_limits[pl->index].zone == NULL) {
 			PF_RULES_WUNLOCK();
 			error = EINVAL;
 			break;
 		}
 		uma_zone_set_max(V_pf_limits[pl->index].zone, pl->limit);
 		old_limit = V_pf_limits[pl->index].limit;
 		V_pf_limits[pl->index].limit = pl->limit;
 		pl->limit = old_limit;
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCSETDEBUG: {
 		u_int32_t	*level = (u_int32_t *)addr;
 
 		PF_RULES_WLOCK();
 		V_pf_status.debug = *level;
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCCLRRULECTRS: {
 		/* obsoleted by DIOCGETRULE with action=PF_GET_CLR_CNTR */
 		struct pf_ruleset	*ruleset = &pf_main_ruleset;
 		struct pf_rule		*rule;
 
 		PF_RULES_WLOCK();
 		TAILQ_FOREACH(rule,
 		    ruleset->rules[PF_RULESET_FILTER].active.ptr, entries) {
 			rule->evaluations = 0;
 			rule->packets[0] = rule->packets[1] = 0;
 			rule->bytes[0] = rule->bytes[1] = 0;
 		}
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCGIFSPEED: {
 		struct pf_ifspeed	*psp = (struct pf_ifspeed *)addr;
 		struct pf_ifspeed	ps;
 		struct ifnet		*ifp;
 
 		if (psp->ifname[0] != 0) {
 			/* Can we completely trust user-land? */
 			strlcpy(ps.ifname, psp->ifname, IFNAMSIZ);
 			ifp = ifunit(ps.ifname);
 			if (ifp != NULL)
 				psp->baudrate = ifp->if_baudrate;
 			else
 				error = EINVAL;
 		} else
 			error = EINVAL;
 		break;
 	}
 
 #ifdef ALTQ
 	case DIOCSTARTALTQ: {
 		struct pf_altq		*altq;
 
 		PF_RULES_WLOCK();
 		/* enable all altq interfaces on active list */
 		TAILQ_FOREACH(altq, V_pf_altqs_active, entries) {
 			if (altq->qname[0] == 0 && (altq->local_flags &
 			    PFALTQ_FLAG_IF_REMOVED) == 0) {
 				error = pf_enable_altq(altq);
 				if (error != 0)
 					break;
 			}
 		}
 		if (error == 0)
 			V_pf_altq_running = 1;
 		PF_RULES_WUNLOCK();
 		DPFPRINTF(PF_DEBUG_MISC, ("altq: started\n"));
 		break;
 	}
 
 	case DIOCSTOPALTQ: {
 		struct pf_altq		*altq;
 
 		PF_RULES_WLOCK();
 		/* disable all altq interfaces on active list */
 		TAILQ_FOREACH(altq, V_pf_altqs_active, entries) {
 			if (altq->qname[0] == 0 && (altq->local_flags &
 			    PFALTQ_FLAG_IF_REMOVED) == 0) {
 				error = pf_disable_altq(altq);
 				if (error != 0)
 					break;
 			}
 		}
 		if (error == 0)
 			V_pf_altq_running = 0;
 		PF_RULES_WUNLOCK();
 		DPFPRINTF(PF_DEBUG_MISC, ("altq: stopped\n"));
 		break;
 	}
 
 	case DIOCADDALTQ: {
 		struct pfioc_altq	*pa = (struct pfioc_altq *)addr;
 		struct pf_altq		*altq, *a;
 		struct ifnet		*ifp;
 
 		altq = malloc(sizeof(*altq), M_PFALTQ, M_WAITOK);
 		bcopy(&pa->altq, altq, sizeof(struct pf_altq));
 		altq->local_flags = 0;
 
 		PF_RULES_WLOCK();
 		if (pa->ticket != V_ticket_altqs_inactive) {
 			PF_RULES_WUNLOCK();
 			free(altq, M_PFALTQ);
 			error = EBUSY;
 			break;
 		}
 
 		/*
 		 * if this is for a queue, find the discipline and
 		 * copy the necessary fields
 		 */
 		if (altq->qname[0] != 0) {
 			if ((altq->qid = pf_qname2qid(altq->qname)) == 0) {
 				PF_RULES_WUNLOCK();
 				error = EBUSY;
 				free(altq, M_PFALTQ);
 				break;
 			}
 			altq->altq_disc = NULL;
 			TAILQ_FOREACH(a, V_pf_altqs_inactive, entries) {
 				if (strncmp(a->ifname, altq->ifname,
 				    IFNAMSIZ) == 0 && a->qname[0] == 0) {
 					altq->altq_disc = a->altq_disc;
 					break;
 				}
 			}
 		}
 
 		if ((ifp = ifunit(altq->ifname)) == NULL)
 			altq->local_flags |= PFALTQ_FLAG_IF_REMOVED;
 		else
 			error = altq_add(altq);
 
 		if (error) {
 			PF_RULES_WUNLOCK();
 			free(altq, M_PFALTQ);
 			break;
 		}
 
 		TAILQ_INSERT_TAIL(V_pf_altqs_inactive, altq, entries);
 		bcopy(altq, &pa->altq, sizeof(struct pf_altq));
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCGETALTQS: {
 		struct pfioc_altq	*pa = (struct pfioc_altq *)addr;
 		struct pf_altq		*altq;
 
 		PF_RULES_RLOCK();
 		pa->nr = 0;
 		TAILQ_FOREACH(altq, V_pf_altqs_active, entries)
 			pa->nr++;
 		pa->ticket = V_ticket_altqs_active;
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCGETALTQ: {
 		struct pfioc_altq	*pa = (struct pfioc_altq *)addr;
 		struct pf_altq		*altq;
 		u_int32_t		 nr;
 
 		PF_RULES_RLOCK();
 		if (pa->ticket != V_ticket_altqs_active) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		nr = 0;
 		altq = TAILQ_FIRST(V_pf_altqs_active);
 		while ((altq != NULL) && (nr < pa->nr)) {
 			altq = TAILQ_NEXT(altq, entries);
 			nr++;
 		}
 		if (altq == NULL) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		bcopy(altq, &pa->altq, sizeof(struct pf_altq));
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCCHANGEALTQ:
 		/* CHANGEALTQ not supported yet! */
 		error = ENODEV;
 		break;
 
 	case DIOCGETQSTATS: {
 		struct pfioc_qstats	*pq = (struct pfioc_qstats *)addr;
 		struct pf_altq		*altq;
 		u_int32_t		 nr;
 		int			 nbytes;
 
 		PF_RULES_RLOCK();
 		if (pq->ticket != V_ticket_altqs_active) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		nbytes = pq->nbytes;
 		nr = 0;
 		altq = TAILQ_FIRST(V_pf_altqs_active);
 		while ((altq != NULL) && (nr < pq->nr)) {
 			altq = TAILQ_NEXT(altq, entries);
 			nr++;
 		}
 		if (altq == NULL) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 
 		if ((altq->local_flags & PFALTQ_FLAG_IF_REMOVED) != 0) {
 			PF_RULES_RUNLOCK();
 			error = ENXIO;
 			break;
 		}
 		PF_RULES_RUNLOCK();
 		error = altq_getqstats(altq, pq->buf, &nbytes);
 		if (error == 0) {
 			pq->scheduler = altq->scheduler;
 			pq->nbytes = nbytes;
 		}
 		break;
 	}
 #endif /* ALTQ */
 
 	case DIOCBEGINADDRS: {
 		struct pfioc_pooladdr	*pp = (struct pfioc_pooladdr *)addr;
 
 		PF_RULES_WLOCK();
 		pf_empty_pool(&V_pf_pabuf);
 		pp->ticket = ++V_ticket_pabuf;
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCADDADDR: {
 		struct pfioc_pooladdr	*pp = (struct pfioc_pooladdr *)addr;
 		struct pf_pooladdr	*pa;
 		struct pfi_kif		*kif = NULL;
 
 #ifndef INET
 		if (pp->af == AF_INET) {
 			error = EAFNOSUPPORT;
 			break;
 		}
 #endif /* INET */
 #ifndef INET6
 		if (pp->af == AF_INET6) {
 			error = EAFNOSUPPORT;
 			break;
 		}
 #endif /* INET6 */
 		if (pp->addr.addr.type != PF_ADDR_ADDRMASK &&
 		    pp->addr.addr.type != PF_ADDR_DYNIFTL &&
 		    pp->addr.addr.type != PF_ADDR_TABLE) {
 			error = EINVAL;
 			break;
 		}
 		pa = malloc(sizeof(*pa), M_PFRULE, M_WAITOK);
 		bcopy(&pp->addr, pa, sizeof(struct pf_pooladdr));
 		if (pa->ifname[0])
 			kif = malloc(sizeof(*kif), PFI_MTYPE, M_WAITOK);
 		PF_RULES_WLOCK();
 		if (pp->ticket != V_ticket_pabuf) {
 			PF_RULES_WUNLOCK();
 			if (pa->ifname[0])
 				free(kif, PFI_MTYPE);
 			free(pa, M_PFRULE);
 			error = EBUSY;
 			break;
 		}
 		if (pa->ifname[0]) {
 			pa->kif = pfi_kif_attach(kif, pa->ifname);
 			pfi_kif_ref(pa->kif);
 		} else
 			pa->kif = NULL;
 		if (pa->addr.type == PF_ADDR_DYNIFTL && ((error =
 		    pfi_dynaddr_setup(&pa->addr, pp->af)) != 0)) {
 			if (pa->ifname[0])
 				pfi_kif_unref(pa->kif);
 			PF_RULES_WUNLOCK();
 			free(pa, M_PFRULE);
 			break;
 		}
 		TAILQ_INSERT_TAIL(&V_pf_pabuf, pa, entries);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCGETADDRS: {
 		struct pfioc_pooladdr	*pp = (struct pfioc_pooladdr *)addr;
 		struct pf_pool		*pool;
 		struct pf_pooladdr	*pa;
 
 		PF_RULES_RLOCK();
 		pp->nr = 0;
 		pool = pf_get_pool(pp->anchor, pp->ticket, pp->r_action,
 		    pp->r_num, 0, 1, 0);
 		if (pool == NULL) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		TAILQ_FOREACH(pa, &pool->list, entries)
 			pp->nr++;
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCGETADDR: {
 		struct pfioc_pooladdr	*pp = (struct pfioc_pooladdr *)addr;
 		struct pf_pool		*pool;
 		struct pf_pooladdr	*pa;
 		u_int32_t		 nr = 0;
 
 		PF_RULES_RLOCK();
 		pool = pf_get_pool(pp->anchor, pp->ticket, pp->r_action,
 		    pp->r_num, 0, 1, 1);
 		if (pool == NULL) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		pa = TAILQ_FIRST(&pool->list);
 		while ((pa != NULL) && (nr < pp->nr)) {
 			pa = TAILQ_NEXT(pa, entries);
 			nr++;
 		}
 		if (pa == NULL) {
 			PF_RULES_RUNLOCK();
 			error = EBUSY;
 			break;
 		}
 		bcopy(pa, &pp->addr, sizeof(struct pf_pooladdr));
 		pf_addr_copyout(&pp->addr.addr);
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCCHANGEADDR: {
 		struct pfioc_pooladdr	*pca = (struct pfioc_pooladdr *)addr;
 		struct pf_pool		*pool;
 		struct pf_pooladdr	*oldpa = NULL, *newpa = NULL;
 		struct pf_ruleset	*ruleset;
 		struct pfi_kif		*kif = NULL;
 
 		if (pca->action < PF_CHANGE_ADD_HEAD ||
 		    pca->action > PF_CHANGE_REMOVE) {
 			error = EINVAL;
 			break;
 		}
 		if (pca->addr.addr.type != PF_ADDR_ADDRMASK &&
 		    pca->addr.addr.type != PF_ADDR_DYNIFTL &&
 		    pca->addr.addr.type != PF_ADDR_TABLE) {
 			error = EINVAL;
 			break;
 		}
 
 		if (pca->action != PF_CHANGE_REMOVE) {
 #ifndef INET
 			if (pca->af == AF_INET) {
 				error = EAFNOSUPPORT;
 				break;
 			}
 #endif /* INET */
 #ifndef INET6
 			if (pca->af == AF_INET6) {
 				error = EAFNOSUPPORT;
 				break;
 			}
 #endif /* INET6 */
 			newpa = malloc(sizeof(*newpa), M_PFRULE, M_WAITOK);
 			bcopy(&pca->addr, newpa, sizeof(struct pf_pooladdr));
 			if (newpa->ifname[0])
 				kif = malloc(sizeof(*kif), PFI_MTYPE, M_WAITOK);
 			newpa->kif = NULL;
 		}
 
 #define	ERROUT(x)	{ error = (x); goto DIOCCHANGEADDR_error; }
 		PF_RULES_WLOCK();
 		ruleset = pf_find_ruleset(pca->anchor);
 		if (ruleset == NULL)
 			ERROUT(EBUSY);
 
 		pool = pf_get_pool(pca->anchor, pca->ticket, pca->r_action,
 		    pca->r_num, pca->r_last, 1, 1);
 		if (pool == NULL)
 			ERROUT(EBUSY);
 
 		if (pca->action != PF_CHANGE_REMOVE) {
 			if (newpa->ifname[0]) {
 				newpa->kif = pfi_kif_attach(kif, newpa->ifname);
 				pfi_kif_ref(newpa->kif);
 				kif = NULL;
 			}
 
 			switch (newpa->addr.type) {
 			case PF_ADDR_DYNIFTL:
 				error = pfi_dynaddr_setup(&newpa->addr,
 				    pca->af);
 				break;
 			case PF_ADDR_TABLE:
 				newpa->addr.p.tbl = pfr_attach_table(ruleset,
 				    newpa->addr.v.tblname);
 				if (newpa->addr.p.tbl == NULL)
 					error = ENOMEM;
 				break;
 			}
 			if (error)
 				goto DIOCCHANGEADDR_error;
 		}
 
 		switch (pca->action) {
 		case PF_CHANGE_ADD_HEAD:
 			oldpa = TAILQ_FIRST(&pool->list);
 			break;
 		case PF_CHANGE_ADD_TAIL:
 			oldpa = TAILQ_LAST(&pool->list, pf_palist);
 			break;
 		default:
 			oldpa = TAILQ_FIRST(&pool->list);
 			for (int i = 0; oldpa && i < pca->nr; i++)
 				oldpa = TAILQ_NEXT(oldpa, entries);
 
 			if (oldpa == NULL)
 				ERROUT(EINVAL);
 		}
 
 		if (pca->action == PF_CHANGE_REMOVE) {
 			TAILQ_REMOVE(&pool->list, oldpa, entries);
 			switch (oldpa->addr.type) {
 			case PF_ADDR_DYNIFTL:
 				pfi_dynaddr_remove(oldpa->addr.p.dyn);
 				break;
 			case PF_ADDR_TABLE:
 				pfr_detach_table(oldpa->addr.p.tbl);
 				break;
 			}
 			if (oldpa->kif)
 				pfi_kif_unref(oldpa->kif);
 			free(oldpa, M_PFRULE);
 		} else {
 			if (oldpa == NULL)
 				TAILQ_INSERT_TAIL(&pool->list, newpa, entries);
 			else if (pca->action == PF_CHANGE_ADD_HEAD ||
 			    pca->action == PF_CHANGE_ADD_BEFORE)
 				TAILQ_INSERT_BEFORE(oldpa, newpa, entries);
 			else
 				TAILQ_INSERT_AFTER(&pool->list, oldpa,
 				    newpa, entries);
 		}
 
 		pool->cur = TAILQ_FIRST(&pool->list);
 		PF_ACPY(&pool->counter, &pool->cur->addr.v.a.addr, pca->af);
 		PF_RULES_WUNLOCK();
 		break;
 
 #undef ERROUT
 DIOCCHANGEADDR_error:
 		if (newpa != NULL) {
 			if (newpa->kif)
 				pfi_kif_unref(newpa->kif);
 			free(newpa, M_PFRULE);
 		}
 		PF_RULES_WUNLOCK();
 		if (kif != NULL)
 			free(kif, PFI_MTYPE);
 		break;
 	}
 
 	case DIOCGETRULESETS: {
 		struct pfioc_ruleset	*pr = (struct pfioc_ruleset *)addr;
 		struct pf_ruleset	*ruleset;
 		struct pf_anchor	*anchor;
 
 		PF_RULES_RLOCK();
 		pr->path[sizeof(pr->path) - 1] = 0;
 		if ((ruleset = pf_find_ruleset(pr->path)) == NULL) {
 			PF_RULES_RUNLOCK();
 			error = ENOENT;
 			break;
 		}
 		pr->nr = 0;
 		if (ruleset->anchor == NULL) {
 			/* XXX kludge for pf_main_ruleset */
 			RB_FOREACH(anchor, pf_anchor_global, &V_pf_anchors)
 				if (anchor->parent == NULL)
 					pr->nr++;
 		} else {
 			RB_FOREACH(anchor, pf_anchor_node,
 			    &ruleset->anchor->children)
 				pr->nr++;
 		}
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCGETRULESET: {
 		struct pfioc_ruleset	*pr = (struct pfioc_ruleset *)addr;
 		struct pf_ruleset	*ruleset;
 		struct pf_anchor	*anchor;
 		u_int32_t		 nr = 0;
 
 		PF_RULES_RLOCK();
 		pr->path[sizeof(pr->path) - 1] = 0;
 		if ((ruleset = pf_find_ruleset(pr->path)) == NULL) {
 			PF_RULES_RUNLOCK();
 			error = ENOENT;
 			break;
 		}
 		pr->name[0] = 0;
 		if (ruleset->anchor == NULL) {
 			/* XXX kludge for pf_main_ruleset */
 			RB_FOREACH(anchor, pf_anchor_global, &V_pf_anchors)
 				if (anchor->parent == NULL && nr++ == pr->nr) {
 					strlcpy(pr->name, anchor->name,
 					    sizeof(pr->name));
 					break;
 				}
 		} else {
 			RB_FOREACH(anchor, pf_anchor_node,
 			    &ruleset->anchor->children)
 				if (nr++ == pr->nr) {
 					strlcpy(pr->name, anchor->name,
 					    sizeof(pr->name));
 					break;
 				}
 		}
 		if (!pr->name[0])
 			error = EBUSY;
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCRCLRTABLES: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 
 		if (io->pfrio_esize != 0) {
 			error = ENODEV;
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_clr_tables(&io->pfrio_table, &io->pfrio_ndel,
 		    io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCRADDTABLES: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_table *pfrts;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_table)) {
 			error = ENODEV;
 			break;
 		}
 
-		if (io->pfrio_size < 0 || io->pfrio_size > PF_TABLES_MAX_REQUEST) {
+		if (io->pfrio_size < 0 || io->pfrio_size > pf_ioctl_maxcount ||
+		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_table))) {
 			error = ENOMEM;
 			break;
 		}
 
 		totlen = io->pfrio_size * sizeof(struct pfr_table);
 		pfrts = mallocarray(io->pfrio_size, sizeof(struct pfr_table),
 		    M_TEMP, M_WAITOK);
 		error = copyin(io->pfrio_buffer, pfrts, totlen);
 		if (error) {
 			free(pfrts, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_add_tables(pfrts, io->pfrio_size,
 		    &io->pfrio_nadd, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		free(pfrts, M_TEMP);
 		break;
 	}
 
 	case DIOCRDELTABLES: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_table *pfrts;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_table)) {
 			error = ENODEV;
 			break;
 		}
 
-		if (io->pfrio_size < 0 || io->pfrio_size > PF_TABLES_MAX_REQUEST) {
+		if (io->pfrio_size < 0 || io->pfrio_size > pf_ioctl_maxcount ||
+		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_table))) {
 			error = ENOMEM;
 			break;
 		}
 
 		totlen = io->pfrio_size * sizeof(struct pfr_table);
 		pfrts = mallocarray(io->pfrio_size, sizeof(struct pfr_table),
 		    M_TEMP, M_WAITOK);
 		error = copyin(io->pfrio_buffer, pfrts, totlen);
 		if (error) {
 			free(pfrts, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_del_tables(pfrts, io->pfrio_size,
 		    &io->pfrio_ndel, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		free(pfrts, M_TEMP);
 		break;
 	}
 
 	case DIOCRGETTABLES: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_table *pfrts;
 		size_t totlen, n;
 
 		if (io->pfrio_esize != sizeof(struct pfr_table)) {
 			error = ENODEV;
 			break;
 		}
 		PF_RULES_RLOCK();
 		n = pfr_table_count(&io->pfrio_table, io->pfrio_flags);
 		io->pfrio_size = min(io->pfrio_size, n);
 
 		totlen = io->pfrio_size * sizeof(struct pfr_table);
 
 		pfrts = mallocarray(io->pfrio_size, sizeof(struct pfr_table),
 		    M_TEMP, M_NOWAIT);
 		if (pfrts == NULL) {
 			error = ENOMEM;
 			PF_RULES_RUNLOCK();
 			break;
 		}
 		error = pfr_get_tables(&io->pfrio_table, pfrts,
 		    &io->pfrio_size, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_RUNLOCK();
 		if (error == 0)
 			error = copyout(pfrts, io->pfrio_buffer, totlen);
 		free(pfrts, M_TEMP);
 		break;
 	}
 
 	case DIOCRGETTSTATS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_tstats *pfrtstats;
 		size_t totlen, n;
 
 		if (io->pfrio_esize != sizeof(struct pfr_tstats)) {
 			error = ENODEV;
 			break;
 		}
 		PF_RULES_WLOCK();
 		n = pfr_table_count(&io->pfrio_table, io->pfrio_flags);
 		io->pfrio_size = min(io->pfrio_size, n);
 
 		totlen = io->pfrio_size * sizeof(struct pfr_tstats);
 		pfrtstats = mallocarray(io->pfrio_size,
 		    sizeof(struct pfr_tstats), M_TEMP, M_NOWAIT);
 		if (pfrtstats == NULL) {
 			error = ENOMEM;
 			PF_RULES_WUNLOCK();
 			break;
 		}
 		error = pfr_get_tstats(&io->pfrio_table, pfrtstats,
 		    &io->pfrio_size, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		if (error == 0)
 			error = copyout(pfrtstats, io->pfrio_buffer, totlen);
 		free(pfrtstats, M_TEMP);
 		break;
 	}
 
 	case DIOCRCLRTSTATS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_table *pfrts;
 		size_t totlen, n;
 
 		if (io->pfrio_esize != sizeof(struct pfr_table)) {
 			error = ENODEV;
 			break;
 		}
 
 		PF_RULES_WLOCK();
 		n = pfr_table_count(&io->pfrio_table, io->pfrio_flags);
 		io->pfrio_size = min(io->pfrio_size, n);
 
 		totlen = io->pfrio_size * sizeof(struct pfr_table);
 		pfrts = mallocarray(io->pfrio_size, sizeof(struct pfr_table),
 		    M_TEMP, M_NOWAIT);
 		if (pfrts == NULL) {
 			error = ENOMEM;
 			PF_RULES_WUNLOCK();
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfrts, totlen);
 		if (error) {
 			free(pfrts, M_TEMP);
 			PF_RULES_WUNLOCK();
 			break;
 		}
 		error = pfr_clr_tstats(pfrts, io->pfrio_size,
 		    &io->pfrio_nzero, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		free(pfrts, M_TEMP);
 		break;
 	}
 
 	case DIOCRSETTFLAGS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_table *pfrts;
 		size_t totlen, n;
 
 		if (io->pfrio_esize != sizeof(struct pfr_table)) {
 			error = ENODEV;
 			break;
 		}
 
 		PF_RULES_WLOCK();
 		n = pfr_table_count(&io->pfrio_table, io->pfrio_flags);
 		io->pfrio_size = min(io->pfrio_size, n);
 
 		totlen = io->pfrio_size * sizeof(struct pfr_table);
 		pfrts = mallocarray(io->pfrio_size, sizeof(struct pfr_table),
 		    M_TEMP, M_NOWAIT);
 		if (pfrts == NULL) {
 			error = ENOMEM;
 			PF_RULES_WUNLOCK();
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfrts, totlen);
 		if (error) {
 			free(pfrts, M_TEMP);
 			PF_RULES_WUNLOCK();
 			break;
 		}
 		error = pfr_set_tflags(pfrts, io->pfrio_size,
 		    io->pfrio_setflag, io->pfrio_clrflag, &io->pfrio_nchange,
 		    &io->pfrio_ndel, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		free(pfrts, M_TEMP);
 		break;
 	}
 
 	case DIOCRCLRADDRS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 
 		if (io->pfrio_esize != 0) {
 			error = ENODEV;
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_clr_addrs(&io->pfrio_table, &io->pfrio_ndel,
 		    io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCRADDADDRS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_addr);
 		pfras = mallocarray(io->pfrio_size, sizeof(struct pfr_addr),
 		    M_TEMP, M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfras, totlen);
 		if (error) {
 			free(pfras, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_add_addrs(&io->pfrio_table, pfras,
 		    io->pfrio_size, &io->pfrio_nadd, io->pfrio_flags |
 		    PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		if (error == 0 && io->pfrio_flags & PFR_FLAG_FEEDBACK)
 			error = copyout(pfras, io->pfrio_buffer, totlen);
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCRDELADDRS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_addr);
 		pfras = mallocarray(io->pfrio_size, sizeof(struct pfr_addr),
 		    M_TEMP, M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfras, totlen);
 		if (error) {
 			free(pfras, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_del_addrs(&io->pfrio_table, pfras,
 		    io->pfrio_size, &io->pfrio_ndel, io->pfrio_flags |
 		    PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		if (error == 0 && io->pfrio_flags & PFR_FLAG_FEEDBACK)
 			error = copyout(pfras, io->pfrio_buffer, totlen);
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCRSETADDRS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen, count;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 || io->pfrio_size2 < 0) {
 			error = EINVAL;
 			break;
 		}
 		count = max(io->pfrio_size, io->pfrio_size2);
-		if (WOULD_OVERFLOW(count, sizeof(struct pfr_addr))) {
+		if (count > pf_ioctl_maxcount ||
+		    WOULD_OVERFLOW(count, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = count * sizeof(struct pfr_addr);
 		pfras = mallocarray(count, sizeof(struct pfr_addr), M_TEMP,
 		    M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfras, totlen);
 		if (error) {
 			free(pfras, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_set_addrs(&io->pfrio_table, pfras,
 		    io->pfrio_size, &io->pfrio_size2, &io->pfrio_nadd,
 		    &io->pfrio_ndel, &io->pfrio_nchange, io->pfrio_flags |
 		    PFR_FLAG_USERIOCTL, 0);
 		PF_RULES_WUNLOCK();
 		if (error == 0 && io->pfrio_flags & PFR_FLAG_FEEDBACK)
 			error = copyout(pfras, io->pfrio_buffer, totlen);
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCRGETADDRS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_addr);
 		pfras = mallocarray(io->pfrio_size, sizeof(struct pfr_addr),
 		    M_TEMP, M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		PF_RULES_RLOCK();
 		error = pfr_get_addrs(&io->pfrio_table, pfras,
 		    &io->pfrio_size, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_RUNLOCK();
 		if (error == 0)
 			error = copyout(pfras, io->pfrio_buffer, totlen);
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCRGETASTATS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_astats *pfrastats;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_astats)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_astats))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_astats);
 		pfrastats = mallocarray(io->pfrio_size,
 		    sizeof(struct pfr_astats), M_TEMP, M_NOWAIT);
 		if (! pfrastats) {
 			error = ENOMEM;
 			break;
 		}
 		PF_RULES_RLOCK();
 		error = pfr_get_astats(&io->pfrio_table, pfrastats,
 		    &io->pfrio_size, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_RUNLOCK();
 		if (error == 0)
 			error = copyout(pfrastats, io->pfrio_buffer, totlen);
 		free(pfrastats, M_TEMP);
 		break;
 	}
 
 	case DIOCRCLRASTATS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_addr);
 		pfras = mallocarray(io->pfrio_size, sizeof(struct pfr_addr),
 		    M_TEMP, M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfras, totlen);
 		if (error) {
 			free(pfras, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_clr_astats(&io->pfrio_table, pfras,
 		    io->pfrio_size, &io->pfrio_nzero, io->pfrio_flags |
 		    PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		if (error == 0 && io->pfrio_flags & PFR_FLAG_FEEDBACK)
 			error = copyout(pfras, io->pfrio_buffer, totlen);
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCRTSTADDRS: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_addr);
 		pfras = mallocarray(io->pfrio_size, sizeof(struct pfr_addr),
 		    M_TEMP, M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfras, totlen);
 		if (error) {
 			free(pfras, M_TEMP);
 			break;
 		}
 		PF_RULES_RLOCK();
 		error = pfr_tst_addrs(&io->pfrio_table, pfras,
 		    io->pfrio_size, &io->pfrio_nmatch, io->pfrio_flags |
 		    PFR_FLAG_USERIOCTL);
 		PF_RULES_RUNLOCK();
 		if (error == 0)
 			error = copyout(pfras, io->pfrio_buffer, totlen);
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCRINADEFINE: {
 		struct pfioc_table *io = (struct pfioc_table *)addr;
 		struct pfr_addr *pfras;
 		size_t totlen;
 
 		if (io->pfrio_esize != sizeof(struct pfr_addr)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->pfrio_size < 0 ||
+		    io->pfrio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfrio_size, sizeof(struct pfr_addr))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = io->pfrio_size * sizeof(struct pfr_addr);
 		pfras = mallocarray(io->pfrio_size, sizeof(struct pfr_addr),
 		    M_TEMP, M_NOWAIT);
 		if (! pfras) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->pfrio_buffer, pfras, totlen);
 		if (error) {
 			free(pfras, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		error = pfr_ina_define(&io->pfrio_table, pfras,
 		    io->pfrio_size, &io->pfrio_nadd, &io->pfrio_naddr,
 		    io->pfrio_ticket, io->pfrio_flags | PFR_FLAG_USERIOCTL);
 		PF_RULES_WUNLOCK();
 		free(pfras, M_TEMP);
 		break;
 	}
 
 	case DIOCOSFPADD: {
 		struct pf_osfp_ioctl *io = (struct pf_osfp_ioctl *)addr;
 		PF_RULES_WLOCK();
 		error = pf_osfp_add(io);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCOSFPGET: {
 		struct pf_osfp_ioctl *io = (struct pf_osfp_ioctl *)addr;
 		PF_RULES_RLOCK();
 		error = pf_osfp_get(io);
 		PF_RULES_RUNLOCK();
 		break;
 	}
 
 	case DIOCXBEGIN: {
 		struct pfioc_trans	*io = (struct pfioc_trans *)addr;
 		struct pfioc_trans_e	*ioes, *ioe;
 		size_t			 totlen;
 		int			 i;
 
 		if (io->esize != sizeof(*ioe)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->size < 0 ||
+		    io->size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->size, sizeof(struct pfioc_trans_e))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = sizeof(struct pfioc_trans_e) * io->size;
 		ioes = mallocarray(io->size, sizeof(struct pfioc_trans_e),
 		    M_TEMP, M_NOWAIT);
 		if (! ioes) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->array, ioes, totlen);
 		if (error) {
 			free(ioes, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		for (i = 0, ioe = ioes; i < io->size; i++, ioe++) {
 			switch (ioe->rs_num) {
 #ifdef ALTQ
 			case PF_RULESET_ALTQ:
 				if (ioe->anchor[0]) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EINVAL;
 					goto fail;
 				}
 				if ((error = pf_begin_altq(&ioe->ticket))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail;
 				}
 				break;
 #endif /* ALTQ */
 			case PF_RULESET_TABLE:
 			    {
 				struct pfr_table table;
 
 				bzero(&table, sizeof(table));
 				strlcpy(table.pfrt_anchor, ioe->anchor,
 				    sizeof(table.pfrt_anchor));
 				if ((error = pfr_ina_begin(&table,
 				    &ioe->ticket, NULL, 0))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail;
 				}
 				break;
 			    }
 			default:
 				if ((error = pf_begin_rules(&ioe->ticket,
 				    ioe->rs_num, ioe->anchor))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail;
 				}
 				break;
 			}
 		}
 		PF_RULES_WUNLOCK();
 		error = copyout(ioes, io->array, totlen);
 		free(ioes, M_TEMP);
 		break;
 	}
 
 	case DIOCXROLLBACK: {
 		struct pfioc_trans	*io = (struct pfioc_trans *)addr;
 		struct pfioc_trans_e	*ioe, *ioes;
 		size_t			 totlen;
 		int			 i;
 
 		if (io->esize != sizeof(*ioe)) {
 			error = ENODEV;
 			break;
 		}
 		if (io->size < 0 ||
+		    io->size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->size, sizeof(struct pfioc_trans_e))) {
 			error = EINVAL;
 			break;
 		}
 		totlen = sizeof(struct pfioc_trans_e) * io->size;
 		ioes = mallocarray(io->size, sizeof(struct pfioc_trans_e),
 		    M_TEMP, M_NOWAIT);
 		if (! ioes) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->array, ioes, totlen);
 		if (error) {
 			free(ioes, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		for (i = 0, ioe = ioes; i < io->size; i++, ioe++) {
 			switch (ioe->rs_num) {
 #ifdef ALTQ
 			case PF_RULESET_ALTQ:
 				if (ioe->anchor[0]) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EINVAL;
 					goto fail;
 				}
 				if ((error = pf_rollback_altq(ioe->ticket))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail; /* really bad */
 				}
 				break;
 #endif /* ALTQ */
 			case PF_RULESET_TABLE:
 			    {
 				struct pfr_table table;
 
 				bzero(&table, sizeof(table));
 				strlcpy(table.pfrt_anchor, ioe->anchor,
 				    sizeof(table.pfrt_anchor));
 				if ((error = pfr_ina_rollback(&table,
 				    ioe->ticket, NULL, 0))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail; /* really bad */
 				}
 				break;
 			    }
 			default:
 				if ((error = pf_rollback_rules(ioe->ticket,
 				    ioe->rs_num, ioe->anchor))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail; /* really bad */
 				}
 				break;
 			}
 		}
 		PF_RULES_WUNLOCK();
 		free(ioes, M_TEMP);
 		break;
 	}
 
 	case DIOCXCOMMIT: {
 		struct pfioc_trans	*io = (struct pfioc_trans *)addr;
 		struct pfioc_trans_e	*ioe, *ioes;
 		struct pf_ruleset	*rs;
 		size_t			 totlen;
 		int			 i;
 
 		if (io->esize != sizeof(*ioe)) {
 			error = ENODEV;
 			break;
 		}
 
 		if (io->size < 0 ||
+		    io->size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->size, sizeof(struct pfioc_trans_e))) {
 			error = EINVAL;
 			break;
 		}
 
 		totlen = sizeof(struct pfioc_trans_e) * io->size;
 		ioes = mallocarray(io->size, sizeof(struct pfioc_trans_e),
 		    M_TEMP, M_NOWAIT);
 		if (ioes == NULL) {
 			error = ENOMEM;
 			break;
 		}
 		error = copyin(io->array, ioes, totlen);
 		if (error) {
 			free(ioes, M_TEMP);
 			break;
 		}
 		PF_RULES_WLOCK();
 		/* First makes sure everything will succeed. */
 		for (i = 0, ioe = ioes; i < io->size; i++, ioe++) {
 			switch (ioe->rs_num) {
 #ifdef ALTQ
 			case PF_RULESET_ALTQ:
 				if (ioe->anchor[0]) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EINVAL;
 					goto fail;
 				}
 				if (!V_altqs_inactive_open || ioe->ticket !=
 				    V_ticket_altqs_inactive) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EBUSY;
 					goto fail;
 				}
 				break;
 #endif /* ALTQ */
 			case PF_RULESET_TABLE:
 				rs = pf_find_ruleset(ioe->anchor);
 				if (rs == NULL || !rs->topen || ioe->ticket !=
 				    rs->tticket) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EBUSY;
 					goto fail;
 				}
 				break;
 			default:
 				if (ioe->rs_num < 0 || ioe->rs_num >=
 				    PF_RULESET_MAX) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EINVAL;
 					goto fail;
 				}
 				rs = pf_find_ruleset(ioe->anchor);
 				if (rs == NULL ||
 				    !rs->rules[ioe->rs_num].inactive.open ||
 				    rs->rules[ioe->rs_num].inactive.ticket !=
 				    ioe->ticket) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					error = EBUSY;
 					goto fail;
 				}
 				break;
 			}
 		}
 		/* Now do the commit - no errors should happen here. */
 		for (i = 0, ioe = ioes; i < io->size; i++, ioe++) {
 			switch (ioe->rs_num) {
 #ifdef ALTQ
 			case PF_RULESET_ALTQ:
 				if ((error = pf_commit_altq(ioe->ticket))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail; /* really bad */
 				}
 				break;
 #endif /* ALTQ */
 			case PF_RULESET_TABLE:
 			    {
 				struct pfr_table table;
 
 				bzero(&table, sizeof(table));
 				strlcpy(table.pfrt_anchor, ioe->anchor,
 				    sizeof(table.pfrt_anchor));
 				if ((error = pfr_ina_commit(&table,
 				    ioe->ticket, NULL, NULL, 0))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail; /* really bad */
 				}
 				break;
 			    }
 			default:
 				if ((error = pf_commit_rules(ioe->ticket,
 				    ioe->rs_num, ioe->anchor))) {
 					PF_RULES_WUNLOCK();
 					free(ioes, M_TEMP);
 					goto fail; /* really bad */
 				}
 				break;
 			}
 		}
 		PF_RULES_WUNLOCK();
 		free(ioes, M_TEMP);
 		break;
 	}
 
 	case DIOCGETSRCNODES: {
 		struct pfioc_src_nodes	*psn = (struct pfioc_src_nodes *)addr;
 		struct pf_srchash	*sh;
 		struct pf_src_node	*n, *p, *pstore;
 		uint32_t		 i, nr = 0;
 
 		if (psn->psn_len == 0) {
 			for (i = 0, sh = V_pf_srchash; i <= pf_srchashmask;
 			    i++, sh++) {
 				PF_HASHROW_LOCK(sh);
 				LIST_FOREACH(n, &sh->nodes, entry)
 					nr++;
 				PF_HASHROW_UNLOCK(sh);
 			}
 			psn->psn_len = sizeof(struct pf_src_node) * nr;
 			break;
 		}
 
 		p = pstore = malloc(psn->psn_len, M_TEMP, M_WAITOK);
 		for (i = 0, sh = V_pf_srchash; i <= pf_srchashmask;
 		    i++, sh++) {
 		    PF_HASHROW_LOCK(sh);
 		    LIST_FOREACH(n, &sh->nodes, entry) {
 			int	secs = time_uptime, diff;
 
 			if ((nr + 1) * sizeof(*p) > (unsigned)psn->psn_len)
 				break;
 
 			bcopy(n, p, sizeof(struct pf_src_node));
 			if (n->rule.ptr != NULL)
 				p->rule.nr = n->rule.ptr->nr;
 			p->creation = secs - p->creation;
 			if (p->expire > secs)
 				p->expire -= secs;
 			else
 				p->expire = 0;
 
 			/* Adjust the connection rate estimate. */
 			diff = secs - n->conn_rate.last;
 			if (diff >= n->conn_rate.seconds)
 				p->conn_rate.count = 0;
 			else
 				p->conn_rate.count -=
 				    n->conn_rate.count * diff /
 				    n->conn_rate.seconds;
 			p++;
 			nr++;
 		    }
 		    PF_HASHROW_UNLOCK(sh);
 		}
 		error = copyout(pstore, psn->psn_src_nodes,
 		    sizeof(struct pf_src_node) * nr);
 		if (error) {
 			free(pstore, M_TEMP);
 			break;
 		}
 		psn->psn_len = sizeof(struct pf_src_node) * nr;
 		free(pstore, M_TEMP);
 		break;
 	}
 
 	case DIOCCLRSRCNODES: {
 
 		pf_clear_srcnodes(NULL);
 		pf_purge_expired_src_nodes();
 		break;
 	}
 
 	case DIOCKILLSRCNODES:
 		pf_kill_srcnodes((struct pfioc_src_node_kill *)addr);
 		break;
 
 	case DIOCSETHOSTID: {
 		u_int32_t	*hostid = (u_int32_t *)addr;
 
 		PF_RULES_WLOCK();
 		if (*hostid == 0)
 			V_pf_status.hostid = arc4random();
 		else
 			V_pf_status.hostid = *hostid;
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCOSFPFLUSH:
 		PF_RULES_WLOCK();
 		pf_osfp_flush();
 		PF_RULES_WUNLOCK();
 		break;
 
 	case DIOCIGETIFACES: {
 		struct pfioc_iface *io = (struct pfioc_iface *)addr;
 		struct pfi_kif *ifstore;
 		size_t bufsiz;
 
 		if (io->pfiio_esize != sizeof(struct pfi_kif)) {
 			error = ENODEV;
 			break;
 		}
 
 		if (io->pfiio_size < 0 ||
+		    io->pfiio_size > pf_ioctl_maxcount ||
 		    WOULD_OVERFLOW(io->pfiio_size, sizeof(struct pfi_kif))) {
 			error = EINVAL;
 			break;
 		}
 
 		bufsiz = io->pfiio_size * sizeof(struct pfi_kif);
 		ifstore = mallocarray(io->pfiio_size, sizeof(struct pfi_kif),
 		    M_TEMP, M_NOWAIT);
 		if (ifstore == NULL) {
 			error = ENOMEM;
 			break;
 		}
 
 		PF_RULES_RLOCK();
 		pfi_get_ifaces(io->pfiio_name, ifstore, &io->pfiio_size);
 		PF_RULES_RUNLOCK();
 		error = copyout(ifstore, io->pfiio_buffer, bufsiz);
 		free(ifstore, M_TEMP);
 		break;
 	}
 
 	case DIOCSETIFFLAG: {
 		struct pfioc_iface *io = (struct pfioc_iface *)addr;
 
 		PF_RULES_WLOCK();
 		error = pfi_set_flags(io->pfiio_name, io->pfiio_flags);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	case DIOCCLRIFFLAG: {
 		struct pfioc_iface *io = (struct pfioc_iface *)addr;
 
 		PF_RULES_WLOCK();
 		error = pfi_clear_flags(io->pfiio_name, io->pfiio_flags);
 		PF_RULES_WUNLOCK();
 		break;
 	}
 
 	default:
 		error = ENODEV;
 		break;
 	}
 fail:
 	if (sx_xlocked(&pf_ioctl_lock))
 		sx_xunlock(&pf_ioctl_lock);
 	CURVNET_RESTORE();
 
 	return (error);
 }
 
 void
 pfsync_state_export(struct pfsync_state *sp, struct pf_state *st)
 {
 	bzero(sp, sizeof(struct pfsync_state));
 
 	/* copy from state key */
 	sp->key[PF_SK_WIRE].addr[0] = st->key[PF_SK_WIRE]->addr[0];
 	sp->key[PF_SK_WIRE].addr[1] = st->key[PF_SK_WIRE]->addr[1];
 	sp->key[PF_SK_WIRE].port[0] = st->key[PF_SK_WIRE]->port[0];
 	sp->key[PF_SK_WIRE].port[1] = st->key[PF_SK_WIRE]->port[1];
 	sp->key[PF_SK_STACK].addr[0] = st->key[PF_SK_STACK]->addr[0];
 	sp->key[PF_SK_STACK].addr[1] = st->key[PF_SK_STACK]->addr[1];
 	sp->key[PF_SK_STACK].port[0] = st->key[PF_SK_STACK]->port[0];
 	sp->key[PF_SK_STACK].port[1] = st->key[PF_SK_STACK]->port[1];
 	sp->proto = st->key[PF_SK_WIRE]->proto;
 	sp->af = st->key[PF_SK_WIRE]->af;
 
 	/* copy from state */
 	strlcpy(sp->ifname, st->kif->pfik_name, sizeof(sp->ifname));
 	bcopy(&st->rt_addr, &sp->rt_addr, sizeof(sp->rt_addr));
 	sp->creation = htonl(time_uptime - st->creation);
 	sp->expire = pf_state_expires(st);
 	if (sp->expire <= time_uptime)
 		sp->expire = htonl(0);
 	else
 		sp->expire = htonl(sp->expire - time_uptime);
 
 	sp->direction = st->direction;
 	sp->log = st->log;
 	sp->timeout = st->timeout;
 	sp->state_flags = st->state_flags;
 	if (st->src_node)
 		sp->sync_flags |= PFSYNC_FLAG_SRCNODE;
 	if (st->nat_src_node)
 		sp->sync_flags |= PFSYNC_FLAG_NATSRCNODE;
 
 	sp->id = st->id;
 	sp->creatorid = st->creatorid;
 	pf_state_peer_hton(&st->src, &sp->src);
 	pf_state_peer_hton(&st->dst, &sp->dst);
 
 	if (st->rule.ptr == NULL)
 		sp->rule = htonl(-1);
 	else
 		sp->rule = htonl(st->rule.ptr->nr);
 	if (st->anchor.ptr == NULL)
 		sp->anchor = htonl(-1);
 	else
 		sp->anchor = htonl(st->anchor.ptr->nr);
 	if (st->nat_rule.ptr == NULL)
 		sp->nat_rule = htonl(-1);
 	else
 		sp->nat_rule = htonl(st->nat_rule.ptr->nr);
 
 	pf_state_counter_hton(st->packets[0], sp->packets[0]);
 	pf_state_counter_hton(st->packets[1], sp->packets[1]);
 	pf_state_counter_hton(st->bytes[0], sp->bytes[0]);
 	pf_state_counter_hton(st->bytes[1], sp->bytes[1]);
 
 }
 
 static void
 pf_tbladdr_copyout(struct pf_addr_wrap *aw)
 {
 	struct pfr_ktable *kt;
 
 	KASSERT(aw->type == PF_ADDR_TABLE, ("%s: type %u", __func__, aw->type));
 
 	kt = aw->p.tbl;
 	if (!(kt->pfrkt_flags & PFR_TFLAG_ACTIVE) && kt->pfrkt_root != NULL)
 		kt = kt->pfrkt_root;
 	aw->p.tbl = NULL;
 	aw->p.tblcnt = (kt->pfrkt_flags & PFR_TFLAG_ACTIVE) ?
 		kt->pfrkt_cnt : -1;
 }
 
 /*
  * XXX - Check for version missmatch!!!
  */
 static void
 pf_clear_states(void)
 {
 	struct pf_state	*s;
 	u_int i;
 
 	for (i = 0; i <= pf_hashmask; i++) {
 		struct pf_idhash *ih = &V_pf_idhash[i];
 relock:
 		PF_HASHROW_LOCK(ih);
 		LIST_FOREACH(s, &ih->states, entry) {
 			s->timeout = PFTM_PURGE;
 			/* Don't send out individual delete messages. */
 			s->state_flags |= PFSTATE_NOSYNC;
 			pf_unlink_state(s, PF_ENTER_LOCKED);
 			goto relock;
 		}
 		PF_HASHROW_UNLOCK(ih);
 	}
 }
 
 static int
 pf_clear_tables(void)
 {
 	struct pfioc_table io;
 	int error;
 
 	bzero(&io, sizeof(io));
 
 	error = pfr_clr_tables(&io.pfrio_table, &io.pfrio_ndel,
 	    io.pfrio_flags);
 
 	return (error);
 }
 
 static void
 pf_clear_srcnodes(struct pf_src_node *n)
 {
 	struct pf_state *s;
 	int i;
 
 	for (i = 0; i <= pf_hashmask; i++) {
 		struct pf_idhash *ih = &V_pf_idhash[i];
 
 		PF_HASHROW_LOCK(ih);
 		LIST_FOREACH(s, &ih->states, entry) {
 			if (n == NULL || n == s->src_node)
 				s->src_node = NULL;
 			if (n == NULL || n == s->nat_src_node)
 				s->nat_src_node = NULL;
 		}
 		PF_HASHROW_UNLOCK(ih);
 	}
 
 	if (n == NULL) {
 		struct pf_srchash *sh;
 
 		for (i = 0, sh = V_pf_srchash; i <= pf_srchashmask;
 		    i++, sh++) {
 			PF_HASHROW_LOCK(sh);
 			LIST_FOREACH(n, &sh->nodes, entry) {
 				n->expire = 1;
 				n->states = 0;
 			}
 			PF_HASHROW_UNLOCK(sh);
 		}
 	} else {
 		/* XXX: hash slot should already be locked here. */
 		n->expire = 1;
 		n->states = 0;
 	}
 }
 
 static void
 pf_kill_srcnodes(struct pfioc_src_node_kill *psnk)
 {
 	struct pf_src_node_list	 kill;
 
 	LIST_INIT(&kill);
 	for (int i = 0; i <= pf_srchashmask; i++) {
 		struct pf_srchash *sh = &V_pf_srchash[i];
 		struct pf_src_node *sn, *tmp;
 
 		PF_HASHROW_LOCK(sh);
 		LIST_FOREACH_SAFE(sn, &sh->nodes, entry, tmp)
 			if (PF_MATCHA(psnk->psnk_src.neg,
 			      &psnk->psnk_src.addr.v.a.addr,
 			      &psnk->psnk_src.addr.v.a.mask,
 			      &sn->addr, sn->af) &&
 			    PF_MATCHA(psnk->psnk_dst.neg,
 			      &psnk->psnk_dst.addr.v.a.addr,
 			      &psnk->psnk_dst.addr.v.a.mask,
 			      &sn->raddr, sn->af)) {
 				pf_unlink_src_node(sn);
 				LIST_INSERT_HEAD(&kill, sn, entry);
 				sn->expire = 1;
 			}
 		PF_HASHROW_UNLOCK(sh);
 	}
 
 	for (int i = 0; i <= pf_hashmask; i++) {
 		struct pf_idhash *ih = &V_pf_idhash[i];
 		struct pf_state *s;
 
 		PF_HASHROW_LOCK(ih);
 		LIST_FOREACH(s, &ih->states, entry) {
 			if (s->src_node && s->src_node->expire == 1)
 				s->src_node = NULL;
 			if (s->nat_src_node && s->nat_src_node->expire == 1)
 				s->nat_src_node = NULL;
 		}
 		PF_HASHROW_UNLOCK(ih);
 	}
 
 	psnk->psnk_killed = pf_free_src_nodes(&kill);
 }
 
 /*
  * XXX - Check for version missmatch!!!
  */
 
 /*
  * Duplicate pfctl -Fa operation to get rid of as much as we can.
  */
 static int
 shutdown_pf(void)
 {
 	int error = 0;
 	u_int32_t t[5];
 	char nn = '\0';
 
 	do {
 		if ((error = pf_begin_rules(&t[0], PF_RULESET_SCRUB, &nn))
 		    != 0) {
 			DPFPRINTF(PF_DEBUG_MISC, ("shutdown_pf: SCRUB\n"));
 			break;
 		}
 		if ((error = pf_begin_rules(&t[1], PF_RULESET_FILTER, &nn))
 		    != 0) {
 			DPFPRINTF(PF_DEBUG_MISC, ("shutdown_pf: FILTER\n"));
 			break;		/* XXX: rollback? */
 		}
 		if ((error = pf_begin_rules(&t[2], PF_RULESET_NAT, &nn))
 		    != 0) {
 			DPFPRINTF(PF_DEBUG_MISC, ("shutdown_pf: NAT\n"));
 			break;		/* XXX: rollback? */
 		}
 		if ((error = pf_begin_rules(&t[3], PF_RULESET_BINAT, &nn))
 		    != 0) {
 			DPFPRINTF(PF_DEBUG_MISC, ("shutdown_pf: BINAT\n"));
 			break;		/* XXX: rollback? */
 		}
 		if ((error = pf_begin_rules(&t[4], PF_RULESET_RDR, &nn))
 		    != 0) {
 			DPFPRINTF(PF_DEBUG_MISC, ("shutdown_pf: RDR\n"));
 			break;		/* XXX: rollback? */
 		}
 
 		/* XXX: these should always succeed here */
 		pf_commit_rules(t[0], PF_RULESET_SCRUB, &nn);
 		pf_commit_rules(t[1], PF_RULESET_FILTER, &nn);
 		pf_commit_rules(t[2], PF_RULESET_NAT, &nn);
 		pf_commit_rules(t[3], PF_RULESET_BINAT, &nn);
 		pf_commit_rules(t[4], PF_RULESET_RDR, &nn);
 
 		if ((error = pf_clear_tables()) != 0)
 			break;
 
 #ifdef ALTQ
 		if ((error = pf_begin_altq(&t[0])) != 0) {
 			DPFPRINTF(PF_DEBUG_MISC, ("shutdown_pf: ALTQ\n"));
 			break;
 		}
 		pf_commit_altq(t[0]);
 #endif
 
 		pf_clear_states();
 
 		pf_clear_srcnodes(NULL);
 
 		/* status does not use malloced mem so no need to cleanup */
 		/* fingerprints and interfaces have their own cleanup code */
 
 		/* Free counters last as we updated them during shutdown. */
 		counter_u64_free(V_pf_default_rule.states_cur);
 		counter_u64_free(V_pf_default_rule.states_tot);
 		counter_u64_free(V_pf_default_rule.src_nodes);
 
 		for (int i = 0; i < PFRES_MAX; i++)
 			counter_u64_free(V_pf_status.counters[i]);
 		for (int i = 0; i < LCNT_MAX; i++)
 			counter_u64_free(V_pf_status.lcounters[i]);
 		for (int i = 0; i < FCNT_MAX; i++)
 			counter_u64_free(V_pf_status.fcounters[i]);
 		for (int i = 0; i < SCNT_MAX; i++)
 			counter_u64_free(V_pf_status.scounters[i]);
 	} while(0);
 
 	return (error);
 }
 
 #ifdef INET
 static int
 pf_check_in(void *arg, struct mbuf **m, struct ifnet *ifp, int dir, int flags,
     struct inpcb *inp)
 {
 	int chk;
 
 	chk = pf_test(PF_IN, flags, ifp, m, inp);
 	if (chk && *m) {
 		m_freem(*m);
 		*m = NULL;
 	}
 
 	if (chk != PF_PASS)
 		return (EACCES);
 	return (0);
 }
 
 static int
 pf_check_out(void *arg, struct mbuf **m, struct ifnet *ifp, int dir, int flags,
     struct inpcb *inp)
 {
 	int chk;
 
 	chk = pf_test(PF_OUT, flags, ifp, m, inp);
 	if (chk && *m) {
 		m_freem(*m);
 		*m = NULL;
 	}
 
 	if (chk != PF_PASS)
 		return (EACCES);
 	return (0);
 }
 #endif
 
 #ifdef INET6
 static int
 pf_check6_in(void *arg, struct mbuf **m, struct ifnet *ifp, int dir, int flags,
     struct inpcb *inp)
 {
 	int chk;
 
 	/*
 	 * In case of loopback traffic IPv6 uses the real interface in
 	 * order to support scoped addresses. In order to support stateful
 	 * filtering we have change this to lo0 as it is the case in IPv4.
 	 */
 	CURVNET_SET(ifp->if_vnet);
 	chk = pf_test6(PF_IN, flags, (*m)->m_flags & M_LOOP ? V_loif : ifp, m, inp);
 	CURVNET_RESTORE();
 	if (chk && *m) {
 		m_freem(*m);
 		*m = NULL;
 	}
 	if (chk != PF_PASS)
 		return (EACCES);
 	return (0);
 }
 
 static int
 pf_check6_out(void *arg, struct mbuf **m, struct ifnet *ifp, int dir, int flags,
     struct inpcb *inp)
 {
 	int chk;
 
 	CURVNET_SET(ifp->if_vnet);
 	chk = pf_test6(PF_OUT, flags, ifp, m, inp);
 	CURVNET_RESTORE();
 	if (chk && *m) {
 		m_freem(*m);
 		*m = NULL;
 	}
 	if (chk != PF_PASS)
 		return (EACCES);
 	return (0);
 }
 #endif /* INET6 */
 
 static int
 hook_pf(void)
 {
 #ifdef INET
 	struct pfil_head *pfh_inet;
 #endif
 #ifdef INET6
 	struct pfil_head *pfh_inet6;
 #endif
 
 	if (V_pf_pfil_hooked)
 		return (0);
 
 #ifdef INET
 	pfh_inet = pfil_head_get(PFIL_TYPE_AF, AF_INET);
 	if (pfh_inet == NULL)
 		return (ESRCH); /* XXX */
 	pfil_add_hook_flags(pf_check_in, NULL, PFIL_IN | PFIL_WAITOK, pfh_inet);
 	pfil_add_hook_flags(pf_check_out, NULL, PFIL_OUT | PFIL_WAITOK, pfh_inet);
 #endif
 #ifdef INET6
 	pfh_inet6 = pfil_head_get(PFIL_TYPE_AF, AF_INET6);
 	if (pfh_inet6 == NULL) {
 #ifdef INET
 		pfil_remove_hook_flags(pf_check_in, NULL, PFIL_IN | PFIL_WAITOK,
 		    pfh_inet);
 		pfil_remove_hook_flags(pf_check_out, NULL, PFIL_OUT | PFIL_WAITOK,
 		    pfh_inet);
 #endif
 		return (ESRCH); /* XXX */
 	}
 	pfil_add_hook_flags(pf_check6_in, NULL, PFIL_IN | PFIL_WAITOK, pfh_inet6);
 	pfil_add_hook_flags(pf_check6_out, NULL, PFIL_OUT | PFIL_WAITOK, pfh_inet6);
 #endif
 
 	V_pf_pfil_hooked = 1;
 	return (0);
 }
 
 static int
 dehook_pf(void)
 {
 #ifdef INET
 	struct pfil_head *pfh_inet;
 #endif
 #ifdef INET6
 	struct pfil_head *pfh_inet6;
 #endif
 
 	if (V_pf_pfil_hooked == 0)
 		return (0);
 
 #ifdef INET
 	pfh_inet = pfil_head_get(PFIL_TYPE_AF, AF_INET);
 	if (pfh_inet == NULL)
 		return (ESRCH); /* XXX */
 	pfil_remove_hook_flags(pf_check_in, NULL, PFIL_IN | PFIL_WAITOK,
 	    pfh_inet);
 	pfil_remove_hook_flags(pf_check_out, NULL, PFIL_OUT | PFIL_WAITOK,
 	    pfh_inet);
 #endif
 #ifdef INET6
 	pfh_inet6 = pfil_head_get(PFIL_TYPE_AF, AF_INET6);
 	if (pfh_inet6 == NULL)
 		return (ESRCH); /* XXX */
 	pfil_remove_hook_flags(pf_check6_in, NULL, PFIL_IN | PFIL_WAITOK,
 	    pfh_inet6);
 	pfil_remove_hook_flags(pf_check6_out, NULL, PFIL_OUT | PFIL_WAITOK,
 	    pfh_inet6);
 #endif
 
 	V_pf_pfil_hooked = 0;
 	return (0);
 }
 
 static void
 pf_load_vnet(void)
 {
 	TAILQ_INIT(&V_pf_tags);
 	TAILQ_INIT(&V_pf_qids);
 
 	pfattach_vnet();
 	V_pf_vnet_active = 1;
 }
 
 static int
 pf_load(void)
 {
 	int error;
 
 	rw_init(&pf_rules_lock, "pf rulesets");
 	sx_init(&pf_ioctl_lock, "pf ioctl");
 	sx_init(&pf_end_lock, "pf end thread");
 
 	pf_mtag_initialize();
 
 	pf_dev = make_dev(&pf_cdevsw, 0, 0, 0, 0600, PF_NAME);
 	if (pf_dev == NULL)
 		return (ENOMEM);
 
 	pf_end_threads = 0;
 	error = kproc_create(pf_purge_thread, NULL, &pf_purge_proc, 0, 0, "pf purge");
 	if (error != 0)
 		return (error);
 
 	pfi_initialize();
 
 	return (0);
 }
 
 static void
 pf_unload_vnet(void)
 {
 	int error;
 
 	V_pf_vnet_active = 0;
 	V_pf_status.running = 0;
 	swi_remove(V_pf_swi_cookie);
 	error = dehook_pf();
 	if (error) {
 		/*
 		 * Should not happen!
 		 * XXX Due to error code ESRCH, kldunload will show
 		 * a message like 'No such process'.
 		 */
 		printf("%s : pfil unregisteration fail\n", __FUNCTION__);
 		return;
 	}
 
 	PF_RULES_WLOCK();
 	shutdown_pf();
 	PF_RULES_WUNLOCK();
 
 	pf_unload_vnet_purge();
 
 	pf_normalize_cleanup();
 	PF_RULES_WLOCK();
 	pfi_cleanup_vnet();
 	PF_RULES_WUNLOCK();
 	pfr_cleanup();
 	pf_osfp_flush();
 	pf_cleanup();
 	if (IS_DEFAULT_VNET(curvnet))
 		pf_mtag_cleanup();
 }
 
 static void
 pf_unload(void)
 {
 
 	sx_xlock(&pf_end_lock);
 	pf_end_threads = 1;
 	while (pf_end_threads < 2) {
 		wakeup_one(pf_purge_thread);
 		sx_sleep(pf_purge_proc, &pf_end_lock, 0, "pftmo", 0);
 	}
 	sx_xunlock(&pf_end_lock);
 
 	if (pf_dev != NULL)
 		destroy_dev(pf_dev);
 
 	pfi_cleanup();
 
 	rw_destroy(&pf_rules_lock);
 	sx_destroy(&pf_ioctl_lock);
 	sx_destroy(&pf_end_lock);
 }
 
 static void
 vnet_pf_init(void *unused __unused)
 {
 
 	pf_load_vnet();
 }
 VNET_SYSINIT(vnet_pf_init, SI_SUB_PROTO_FIREWALL, SI_ORDER_THIRD, 
     vnet_pf_init, NULL);
 
 static void
 vnet_pf_uninit(const void *unused __unused)
 {
 
 	pf_unload_vnet();
 } 
 SYSUNINIT(pf_unload, SI_SUB_PROTO_FIREWALL, SI_ORDER_SECOND, pf_unload, NULL);
 VNET_SYSUNINIT(vnet_pf_uninit, SI_SUB_PROTO_FIREWALL, SI_ORDER_THIRD,
     vnet_pf_uninit, NULL);
 
 
 static int
 pf_modevent(module_t mod, int type, void *data)
 {
 	int error = 0;
 
 	switch(type) {
 	case MOD_LOAD:
 		error = pf_load();
 		break;
 	case MOD_UNLOAD:
 		/* Handled in SYSUNINIT(pf_unload) to ensure it's done after
 		 * the vnet_pf_uninit()s */
 		break;
 	default:
 		error = EINVAL;
 		break;
 	}
 
 	return (error);
 }
 
 static moduledata_t pf_mod = {
 	"pf",
 	pf_modevent,
 	0
 };
 
 DECLARE_MODULE(pf, pf_mod, SI_SUB_PROTO_FIREWALL, SI_ORDER_SECOND);
 MODULE_VERSION(pf, PF_MODVER);
Index: user/markj/netdump/sys/powerpc/booke/trap_subr.S
===================================================================
--- user/markj/netdump/sys/powerpc/booke/trap_subr.S	(revision 332407)
+++ user/markj/netdump/sys/powerpc/booke/trap_subr.S	(revision 332408)
@@ -1,1126 +1,1114 @@
 /*-
  * Copyright (C) 2006-2009 Semihalf, Rafal Jaworowski <raj@semihalf.com>
  * Copyright (C) 2006 Semihalf, Marian Balakowicz <m8@semihalf.com>
  * Copyright (C) 2006 Juniper Networks, Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The name of the author may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN
  * NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
  * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
  * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
  * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 /*-
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  *	from: $NetBSD: trap_subr.S,v 1.20 2002/04/22 23:20:08 kleink Exp $
  */
 
 /*
  * NOTICE: This is not a standalone file.  to use it, #include it in
  * your port's locore.S, like so:
  *
  *	#include <powerpc/booke/trap_subr.S>
  */
 
 /*
  * SPRG usage notes
  *
  * SPRG0 - pcpu pointer
  * SPRG1 - all interrupts except TLB miss, critical, machine check
  * SPRG2 - critical
  * SPRG3 - machine check
  * SPRG4-6 - scratch
  *
  */
 
 /* Get the per-CPU data structure */
 #define GET_CPUINFO(r) mfsprg0 r
 
 #define RES_GRANULE	64
 #define RES_LOCK	0	/* offset to the 'lock' word */
 #ifdef __powerpc64__
 #define RES_RECURSE	8	/* offset to the 'recurse' word */
 #else
 #define RES_RECURSE	4	/* offset to the 'recurse' word */
 #endif
 
 /*
  * Standard interrupt prolog
  *
  * sprg_sp - SPRG{1-3} reg used to temporarily store the SP
  * savearea - temp save area (pc_{tempsave, disisave, critsave, mchksave})
  * isrr0-1 - save restore registers with CPU state at interrupt time (may be
  *           SRR0-1, CSRR0-1, MCSRR0-1
  *
  * 1. saves in the given savearea:
  *   - R30-31
  *   - DEAR, ESR
  *   - xSRR0-1
  *
  * 2. saves CR -> R30
  *
  * 3. switches to kstack if needed
  *
  * 4. notes:
  *   - R31 can be used as scratch register until a new frame is layed on
  *     the stack with FRAME_SETUP
  *
  *   - potential TLB miss: NO. Saveareas are always acessible via TLB1 
  *     permanent entries, and within this prolog we do not dereference any
  *     locations potentially not in the TLB
  */
 #define STANDARD_PROLOG(sprg_sp, savearea, isrr0, isrr1)		\
 	mtspr	sprg_sp, %r1;		/* Save SP */			\
 	GET_CPUINFO(%r1);		/* Per-cpu structure */		\
 	STORE	%r30, (savearea+CPUSAVE_R30)(%r1);			\
 	STORE	%r31, (savearea+CPUSAVE_R31)(%r1); 			\
 	mfdear	%r30;		 					\
 	mfesr	%r31;							\
 	STORE	%r30, (savearea+CPUSAVE_BOOKE_DEAR)(%r1); 		\
 	STORE	%r31, (savearea+CPUSAVE_BOOKE_ESR)(%r1); 		\
 	mfspr	%r30, isrr0;						\
 	mfspr	%r31, isrr1;	 	/* MSR at interrupt time */	\
 	STORE	%r30, (savearea+CPUSAVE_SRR0)(%r1);			\
 	STORE	%r31, (savearea+CPUSAVE_SRR1)(%r1);			\
 	isync;			 					\
 	mfspr	%r1, sprg_sp;	 	/* Restore SP */		\
 	mfcr	%r30;		 	/* Save CR */			\
 	/* switch to per-thread kstack if intr taken in user mode */	\
 	mtcr	%r31;			/* MSR at interrupt time  */	\
 	bf	17, 1f;							\
 	GET_CPUINFO(%r1);		/* Per-cpu structure */		\
 	LOAD	%r1, PC_CURPCB(%r1); 	/* Per-thread kernel stack */	\
 1:
 
 #define	STANDARD_CRIT_PROLOG(sprg_sp, savearea, isrr0, isrr1)		\
 	mtspr	sprg_sp, %r1;		/* Save SP */			\
 	GET_CPUINFO(%r1);		/* Per-cpu structure */		\
 	STORE	%r30, (savearea+CPUSAVE_R30)(%r1);			\
 	STORE	%r31, (savearea+CPUSAVE_R31)(%r1);			\
 	mfdear	%r30;							\
 	mfesr	%r31;							\
 	STORE	%r30, (savearea+CPUSAVE_BOOKE_DEAR)(%r1);		\
 	STORE	%r31, (savearea+CPUSAVE_BOOKE_ESR)(%r1);		\
 	mfspr	%r30, isrr0;						\
 	mfspr	%r31, isrr1;		/* MSR at interrupt time */	\
 	STORE	%r30, (savearea+CPUSAVE_SRR0)(%r1);			\
 	STORE	%r31, (savearea+CPUSAVE_SRR1)(%r1);			\
 	mfspr	%r30, SPR_SRR0;						\
 	mfspr	%r31, SPR_SRR1;		/* MSR at interrupt time */	\
 	STORE	%r30, (savearea+BOOKE_CRITSAVE_SRR0)(%r1);		\
 	STORE	%r31, (savearea+BOOKE_CRITSAVE_SRR1)(%r1);		\
 	isync;								\
 	mfspr	%r1, sprg_sp;		/* Restore SP */		\
 	mfcr	%r30;			/* Save CR */			\
 	/* switch to per-thread kstack if intr taken in user mode */	\
 	mtcr	%r31;			/* MSR at interrupt time  */	\
 	bf	17, 1f;							\
 	GET_CPUINFO(%r1);		/* Per-cpu structure */		\
 	LOAD	%r1, PC_CURPCB(%r1);	/* Per-thread kernel stack */	\
 1:
 
 /*
  * FRAME_SETUP assumes:
  *	SPRG{1-3}	SP at the time interrupt occured
  *	savearea	r30-r31, DEAR, ESR, xSRR0-1
  *	r30		CR
  *	r31		scratch
  *	r1		kernel stack
  *
  * sprg_sp - SPRG reg containing SP at the time interrupt occured
  * savearea - temp save
  * exc - exception number (EXC_xxx)
  *
  * 1. sets a new frame
  * 2. saves in the frame:
  *   - R0, R1 (SP at the time of interrupt), R2, LR, CR
  *   - R3-31 (R30-31 first restored from savearea)
  *   - XER, CTR, DEAR, ESR (from savearea), xSRR0-1
  *
  * Notes:
  * - potential TLB miss: YES, since we make dereferences to kstack, which
  *   can happen not covered (we can have up to two DTLB misses if fortunate
  *   enough i.e. when kstack crosses page boundary and both pages are
  *   untranslated)
  */
 #ifdef __powerpc64__
 #define SAVE_REGS(r)							\
 	std	%r3, FRAME_3+CALLSIZE(r);				\
 	std	%r4, FRAME_4+CALLSIZE(r);				\
 	std	%r5, FRAME_5+CALLSIZE(r);				\
 	std	%r6, FRAME_6+CALLSIZE(r);				\
 	std	%r7, FRAME_7+CALLSIZE(r);				\
 	std	%r8, FRAME_8+CALLSIZE(r);				\
 	std	%r9, FRAME_9+CALLSIZE(r);				\
 	std	%r10, FRAME_10+CALLSIZE(r);				\
 	std	%r11, FRAME_11+CALLSIZE(r);				\
 	std	%r12, FRAME_12+CALLSIZE(r);				\
 	std	%r13, FRAME_13+CALLSIZE(r);				\
 	std	%r14, FRAME_14+CALLSIZE(r);				\
 	std	%r15, FRAME_15+CALLSIZE(r);				\
 	std	%r16, FRAME_16+CALLSIZE(r);				\
 	std	%r17, FRAME_17+CALLSIZE(r);				\
 	std	%r18, FRAME_18+CALLSIZE(r);				\
 	std	%r19, FRAME_19+CALLSIZE(r);				\
 	std	%r20, FRAME_20+CALLSIZE(r);				\
 	std	%r21, FRAME_21+CALLSIZE(r);				\
 	std	%r22, FRAME_22+CALLSIZE(r);				\
 	std	%r23, FRAME_23+CALLSIZE(r);				\
 	std	%r24, FRAME_24+CALLSIZE(r);				\
 	std	%r25, FRAME_25+CALLSIZE(r);				\
 	std	%r26, FRAME_26+CALLSIZE(r);				\
 	std	%r27, FRAME_27+CALLSIZE(r);				\
 	std	%r28, FRAME_28+CALLSIZE(r);				\
 	std	%r29, FRAME_29+CALLSIZE(r);				\
 	std	%r30, FRAME_30+CALLSIZE(r);				\
 	std	%r31, FRAME_31+CALLSIZE(r)
 #define LD_REGS(r)							\
 	ld	%r3, FRAME_3+CALLSIZE(r);				\
 	ld	%r4, FRAME_4+CALLSIZE(r);				\
 	ld	%r5, FRAME_5+CALLSIZE(r);				\
 	ld	%r6, FRAME_6+CALLSIZE(r);				\
 	ld	%r7, FRAME_7+CALLSIZE(r);				\
 	ld	%r8, FRAME_8+CALLSIZE(r);				\
 	ld	%r9, FRAME_9+CALLSIZE(r);				\
 	ld	%r10, FRAME_10+CALLSIZE(r);				\
 	ld	%r11, FRAME_11+CALLSIZE(r);				\
 	ld	%r12, FRAME_12+CALLSIZE(r);				\
 	ld	%r13, FRAME_13+CALLSIZE(r);				\
 	ld	%r14, FRAME_14+CALLSIZE(r);				\
 	ld	%r15, FRAME_15+CALLSIZE(r);				\
 	ld	%r16, FRAME_16+CALLSIZE(r);				\
 	ld	%r17, FRAME_17+CALLSIZE(r);				\
 	ld	%r18, FRAME_18+CALLSIZE(r);				\
 	ld	%r19, FRAME_19+CALLSIZE(r);				\
 	ld	%r20, FRAME_20+CALLSIZE(r);				\
 	ld	%r21, FRAME_21+CALLSIZE(r);				\
 	ld	%r22, FRAME_22+CALLSIZE(r);				\
 	ld	%r23, FRAME_23+CALLSIZE(r);				\
 	ld	%r24, FRAME_24+CALLSIZE(r);				\
 	ld	%r25, FRAME_25+CALLSIZE(r);				\
 	ld	%r26, FRAME_26+CALLSIZE(r);				\
 	ld	%r27, FRAME_27+CALLSIZE(r);				\
 	ld	%r28, FRAME_28+CALLSIZE(r);				\
 	ld	%r29, FRAME_29+CALLSIZE(r);				\
 	ld	%r30, FRAME_30+CALLSIZE(r);				\
 	ld	%r31, FRAME_31+CALLSIZE(r)
 #else
 #define SAVE_REGS(r)							\
 	stmw	%r3,  FRAME_3+CALLSIZE(r)
 #define LD_REGS(r)							\
 	lmw	%r3,  FRAME_3+CALLSIZE(r)
 #endif
 #define	FRAME_SETUP(sprg_sp, savearea, exc)				\
 	mfspr	%r31, sprg_sp;		/* get saved SP */		\
 	/* establish a new stack frame and put everything on it */	\
 	STU	%r31, -(FRAMELEN+REDZONE)(%r1);				\
 	STORE	%r0, FRAME_0+CALLSIZE(%r1);	/* save r0 in the trapframe */	\
 	STORE	%r31, FRAME_1+CALLSIZE(%r1);	/* save SP   "     " */	\
 	STORE	%r2, FRAME_2+CALLSIZE(%r1);	/* save r2   "     " */	\
 	mflr	%r31;		 					\
 	STORE	%r31, FRAME_LR+CALLSIZE(%r1);	/* save LR   "     " */	\
 	STORE	%r30, FRAME_CR+CALLSIZE(%r1);	/* save CR   "     " */	\
 	GET_CPUINFO(%r2);						\
 	LOAD	%r30, (savearea+CPUSAVE_R30)(%r2); /* get saved r30 */	\
 	LOAD	%r31, (savearea+CPUSAVE_R31)(%r2); /* get saved r31 */	\
 	/* save R3-31 */						\
 	SAVE_REGS(%r1);							\
 	/* save DEAR, ESR */						\
 	LOAD	%r28, (savearea+CPUSAVE_BOOKE_DEAR)(%r2);		\
 	LOAD	%r29, (savearea+CPUSAVE_BOOKE_ESR)(%r2);		\
 	STORE	%r28, FRAME_BOOKE_DEAR+CALLSIZE(%r1);			\
 	STORE	%r29, FRAME_BOOKE_ESR+CALLSIZE(%r1);			\
 	/* save XER, CTR, exc number */					\
 	mfxer	%r3;							\
 	mfctr	%r4;							\
 	STORE	%r3, FRAME_XER+CALLSIZE(%r1);				\
 	STORE	%r4, FRAME_CTR+CALLSIZE(%r1);				\
 	li	%r5, exc;						\
 	STORE	%r5, FRAME_EXC+CALLSIZE(%r1);				\
 	/* save DBCR0 */						\
 	mfspr	%r3, SPR_DBCR0;						\
 	STORE	%r3, FRAME_BOOKE_DBCR0+CALLSIZE(%r1);			\
 	/* save xSSR0-1 */						\
 	LOAD	%r30, (savearea+CPUSAVE_SRR0)(%r2);			\
 	LOAD	%r31, (savearea+CPUSAVE_SRR1)(%r2);			\
 	STORE	%r30, FRAME_SRR0+CALLSIZE(%r1);				\
 	STORE	%r31, FRAME_SRR1+CALLSIZE(%r1);				\
 	LOAD	THREAD_REG, PC_CURTHREAD(%r2);				\
 
 /*
  *
  * isrr0-1 - save restore registers to restore CPU state to (may be
  *           SRR0-1, CSRR0-1, MCSRR0-1
  *
  * Notes:
  *  - potential TLB miss: YES. The deref'd kstack may be not covered
  */
 #define	FRAME_LEAVE(isrr0, isrr1)					\
 	wrteei 0;							\
 	/* restore CTR, XER, LR, CR */					\
 	LOAD	%r4, FRAME_CTR+CALLSIZE(%r1);				\
 	LOAD	%r5, FRAME_XER+CALLSIZE(%r1);				\
 	LOAD	%r6, FRAME_LR+CALLSIZE(%r1);				\
 	LOAD	%r7, FRAME_CR+CALLSIZE(%r1);				\
 	mtctr	%r4;							\
 	mtxer	%r5;							\
 	mtlr	%r6;							\
 	mtcr	%r7;							\
 	/* restore DBCR0 */						\
 	LOAD	%r4, FRAME_BOOKE_DBCR0+CALLSIZE(%r1);			\
 	mtspr	SPR_DBCR0, %r4;						\
 	/* restore xSRR0-1 */						\
 	LOAD	%r30, FRAME_SRR0+CALLSIZE(%r1);				\
 	LOAD	%r31, FRAME_SRR1+CALLSIZE(%r1);				\
 	mtspr	isrr0, %r30;						\
 	mtspr	isrr1, %r31;						\
 	/* restore R2-31, SP */						\
 	LD_REGS(%r1);							\
 	LOAD	%r2, FRAME_2+CALLSIZE(%r1);				\
 	LOAD	%r0, FRAME_0+CALLSIZE(%r1);				\
 	LOAD	%r1, FRAME_1+CALLSIZE(%r1);				\
 	isync
 
 /*
  * TLB miss prolog
  *
  * saves LR, CR, SRR0-1, R20-31 in the TLBSAVE area
  *
  * Notes:
  *  - potential TLB miss: NO. It is crucial that we do not generate a TLB
  *    miss within the TLB prolog itself!
  *  - TLBSAVE is always translated
  */
 #ifdef __powerpc64__
 #define	TLB_SAVE_REGS(br)						\
 	std	%r20, (TLBSAVE_BOOKE_R20)(br);				\
 	std	%r21, (TLBSAVE_BOOKE_R21)(br);				\
 	std	%r22, (TLBSAVE_BOOKE_R22)(br);				\
 	std	%r23, (TLBSAVE_BOOKE_R23)(br);				\
 	std	%r24, (TLBSAVE_BOOKE_R24)(br);				\
 	std	%r25, (TLBSAVE_BOOKE_R25)(br);				\
 	std	%r26, (TLBSAVE_BOOKE_R26)(br);				\
 	std	%r27, (TLBSAVE_BOOKE_R27)(br);				\
 	std	%r28, (TLBSAVE_BOOKE_R28)(br);				\
 	std	%r29, (TLBSAVE_BOOKE_R29)(br);				\
 	std	%r30, (TLBSAVE_BOOKE_R30)(br);				\
 	std	%r31, (TLBSAVE_BOOKE_R31)(br);				
 #define	TLB_RESTORE_REGS(br)						\
 	ld	%r20, (TLBSAVE_BOOKE_R20)(br);				\
 	ld	%r21, (TLBSAVE_BOOKE_R21)(br);				\
 	ld	%r22, (TLBSAVE_BOOKE_R22)(br);				\
 	ld	%r23, (TLBSAVE_BOOKE_R23)(br);				\
 	ld	%r24, (TLBSAVE_BOOKE_R24)(br);				\
 	ld	%r25, (TLBSAVE_BOOKE_R25)(br);				\
 	ld	%r26, (TLBSAVE_BOOKE_R26)(br);				\
 	ld	%r27, (TLBSAVE_BOOKE_R27)(br);				\
 	ld	%r28, (TLBSAVE_BOOKE_R28)(br);				\
 	ld	%r29, (TLBSAVE_BOOKE_R29)(br);				\
 	ld	%r30, (TLBSAVE_BOOKE_R30)(br);				\
 	ld	%r31, (TLBSAVE_BOOKE_R31)(br);				
 #define TLB_NEST(outr,inr)						\
 	rlwinm	outr, inr, 7, 22, 24;	/* 8 x TLBSAVE_LEN */
 #else
 #define TLB_SAVE_REGS(br)						\
 	stmw	%r20, TLBSAVE_BOOKE_R20(br)
 #define TLB_RESTORE_REGS(br)						\
 	lmw	%r20, TLBSAVE_BOOKE_R20(br)
 #define TLB_NEST(outr,inr)						\
 	rlwinm	outr, inr, 6, 23, 25;	/* 4 x TLBSAVE_LEN */
 #endif
 #define TLB_PROLOG							\
 	mtsprg4	%r1;			/* Save SP */			\
 	mtsprg5 %r28;							\
 	mtsprg6 %r29;							\
 	/* calculate TLB nesting level and TLBSAVE instance address */	\
 	GET_CPUINFO(%r1);	 	/* Per-cpu structure */		\
 	LOAD	%r28, PC_BOOKE_TLB_LEVEL(%r1);				\
 	TLB_NEST(%r29,%r28);						\
 	addi	%r28, %r28, 1;						\
 	STORE	%r28, PC_BOOKE_TLB_LEVEL(%r1);				\
 	addi	%r29, %r29, PC_BOOKE_TLBSAVE@l; 			\
 	add	%r1, %r1, %r29;		/* current TLBSAVE ptr */	\
 									\
 	/* save R20-31 */						\
 	mfsprg5 %r28;		 					\
 	mfsprg6 %r29;							\
 	TLB_SAVE_REGS(%r1);			\
 	/* save LR, CR */						\
 	mflr	%r30;		 					\
 	mfcr	%r31;							\
 	STORE	%r30, (TLBSAVE_BOOKE_LR)(%r1);				\
 	STORE	%r31, (TLBSAVE_BOOKE_CR)(%r1);				\
 	/* save SRR0-1 */						\
 	mfsrr0	%r30;		/* execution addr at interrupt time */	\
 	mfsrr1	%r31;		/* MSR at interrupt time*/		\
 	STORE	%r30, (TLBSAVE_BOOKE_SRR0)(%r1);	/* save SRR0 */	\
 	STORE	%r31, (TLBSAVE_BOOKE_SRR1)(%r1);	/* save SRR1 */	\
 	isync;								\
 	mfsprg4	%r1
 
 /*
  * restores LR, CR, SRR0-1, R20-31 from the TLBSAVE area
  *
  * same notes as for the TLB_PROLOG
  */
 #define TLB_RESTORE							\
 	mtsprg4	%r1;			/* Save SP */			\
 	GET_CPUINFO(%r1);	 	/* Per-cpu structure */		\
 	/* calculate TLB nesting level and TLBSAVE instance addr */	\
 	LOAD	%r28, PC_BOOKE_TLB_LEVEL(%r1);				\
 	subi	%r28, %r28, 1;						\
 	STORE	%r28, PC_BOOKE_TLB_LEVEL(%r1);				\
 	TLB_NEST(%r29,%r28);						\
 	addi	%r29, %r29, PC_BOOKE_TLBSAVE@l;				\
 	add	%r1, %r1, %r29;						\
 									\
 	/* restore LR, CR */						\
 	LOAD	%r30, (TLBSAVE_BOOKE_LR)(%r1);				\
 	LOAD	%r31, (TLBSAVE_BOOKE_CR)(%r1);				\
 	mtlr	%r30;							\
 	mtcr	%r31;							\
 	/* restore SRR0-1 */						\
 	LOAD	%r30, (TLBSAVE_BOOKE_SRR0)(%r1);			\
 	LOAD	%r31, (TLBSAVE_BOOKE_SRR1)(%r1);			\
 	mtsrr0	%r30;							\
 	mtsrr1	%r31;							\
 	/* restore R20-31 */						\
 	TLB_RESTORE_REGS(%r1);						\
 	mfsprg4	%r1
 
 #ifdef SMP
 #define TLB_LOCK							\
 	GET_CPUINFO(%r20);						\
 	LOAD	%r21, PC_CURTHREAD(%r20);				\
 	LOAD	%r22, PC_BOOKE_TLB_LOCK(%r20);				\
 									\
 1:	LOADX	%r23, 0, %r22;						\
 	CMPI	%r23, TLB_UNLOCKED;					\
 	beq	2f;							\
 									\
 	/* check if this is recursion */				\
 	CMPL	cr0, %r21, %r23;					\
 	bne-	1b;							\
 									\
 2:	/* try to acquire lock */					\
 	STOREX	%r21, 0, %r22;						\
 	bne-	1b;							\
 									\
 	/* got it, update recursion counter */				\
 	lwz	%r21, RES_RECURSE(%r22);				\
 	addi	%r21, %r21, 1;						\
 	stw	%r21, RES_RECURSE(%r22);				\
 	isync;								\
 	msync
 
 #define TLB_UNLOCK							\
 	GET_CPUINFO(%r20);						\
 	LOAD	%r21, PC_CURTHREAD(%r20);				\
 	LOAD	%r22, PC_BOOKE_TLB_LOCK(%r20);				\
 									\
 	/* update recursion counter */					\
 	lwz	%r23, RES_RECURSE(%r22);				\
 	subi	%r23, %r23, 1;						\
 	stw	%r23, RES_RECURSE(%r22);				\
 									\
 	cmplwi	%r23, 0;						\
 	bne	1f;							\
 	isync;								\
 	msync;								\
 									\
 	/* release the lock */						\
 	li	%r23, TLB_UNLOCKED;					\
 	STORE	%r23, 0(%r22);						\
 1:	isync;								\
 	msync
 #else
 #define TLB_LOCK
 #define TLB_UNLOCK
 #endif	/* SMP */
 
 #define INTERRUPT(label)						\
 	.globl	label;							\
 	.align	5;							\
 	CNAME(label):
 
 /*
  * Interrupt handling routines in BookE can be flexibly placed and do not have
  * to live in pre-defined vectors location. Note they need to be TLB-mapped at
  * all times in order to be able to handle exceptions. We thus arrange for
  * them to be part of kernel text which is always TLB-accessible.
  *
  * The interrupt handling routines have to be 16 bytes aligned: we align them
  * to 32 bytes (cache line length) which supposedly performs better.
  *
  */
 	.text
 	.globl CNAME(interrupt_vector_base)
 	.align 5
 interrupt_vector_base:
 /*****************************************************************************
  * Catch-all handler to handle uninstalled IVORs
  ****************************************************************************/
 INTERRUPT(int_unknown)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_RSVD)
 	b	trap_common
 
 /*****************************************************************************
  * Critical input interrupt
  ****************************************************************************/
 INTERRUPT(int_critical_input)
 	STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_CSRR0, SPR_CSRR1)
 	FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_CRIT)
 	GET_TOCBASE(%r2)
 	addi	%r3, %r1, CALLSIZE
 	bl	CNAME(powerpc_interrupt)
 	TOC_RESTORE
 	FRAME_LEAVE(SPR_CSRR0, SPR_CSRR1)
 	rfci
 
 
 /*****************************************************************************
  * Machine check interrupt
  ****************************************************************************/
 INTERRUPT(int_machine_check)
 	STANDARD_PROLOG(SPR_SPRG3, PC_BOOKE_MCHKSAVE, SPR_MCSRR0, SPR_MCSRR1)
 	FRAME_SETUP(SPR_SPRG3, PC_BOOKE_MCHKSAVE, EXC_MCHK)
 	GET_TOCBASE(%r2)
 	addi	%r3, %r1, CALLSIZE
 	bl	CNAME(powerpc_interrupt)
 	TOC_RESTORE
 	FRAME_LEAVE(SPR_MCSRR0, SPR_MCSRR1)
 	rfmci
 
 
 /*****************************************************************************
  * Data storage interrupt
  ****************************************************************************/
 INTERRUPT(int_data_storage)
 	STANDARD_PROLOG(SPR_SPRG1, PC_DISISAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_DISISAVE, EXC_DSI)
 	b	trap_common
 
 
 /*****************************************************************************
  * Instruction storage interrupt
  ****************************************************************************/
 INTERRUPT(int_instr_storage)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_ISI)
 	b	trap_common
 
 
 /*****************************************************************************
  * External input interrupt
  ****************************************************************************/
 INTERRUPT(int_external_input)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_EXI)
-	GET_TOCBASE(%r2)
-	addi	%r3, %r1, CALLSIZE
-	bl	CNAME(powerpc_interrupt)
-	TOC_RESTORE
-	b	trapexit
+	b	trap_common
 
 
 INTERRUPT(int_alignment)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_ALI)
 	b	trap_common
 
 
 INTERRUPT(int_program)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_PGM)
 	b	trap_common
 
 
 INTERRUPT(int_fpu)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_FPU)
 	b	trap_common
 
 
 /*****************************************************************************
  * System call
  ****************************************************************************/
 INTERRUPT(int_syscall)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_SC)
 	b	trap_common
 
 
 /*****************************************************************************
  * Decrementer interrupt
  ****************************************************************************/
 INTERRUPT(int_decrementer)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_DECR)
-	GET_TOCBASE(%r2)
-	addi	%r3, %r1, CALLSIZE
-	bl	CNAME(powerpc_interrupt)
-	TOC_RESTORE
-	b	trapexit
+	b	trap_common
 
 
 /*****************************************************************************
  * Fixed interval timer
  ****************************************************************************/
 INTERRUPT(int_fixed_interval_timer)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_FIT)
 	b	trap_common
 
 
 /*****************************************************************************
  * Watchdog interrupt
  ****************************************************************************/
 INTERRUPT(int_watchdog)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_WDOG)
 	b	trap_common
 
 
 /*****************************************************************************
  * Altivec Unavailable interrupt
  ****************************************************************************/
 INTERRUPT(int_vec)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_VEC)
 	b	trap_common
 
 
 /*****************************************************************************
  * Altivec Assist interrupt
  ****************************************************************************/
 INTERRUPT(int_vecast)
 	STANDARD_PROLOG(SPR_SPRG1, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG1, PC_TEMPSAVE, EXC_VECAST_E)
 	b	trap_common
 
 
 #ifdef HWPMC_HOOKS
 /*****************************************************************************
  * PMC Interrupt
  ****************************************************************************/
 INTERRUPT(int_performance_counter)
 	STANDARD_PROLOG(SPR_SPRG3, PC_TEMPSAVE, SPR_SRR0, SPR_SRR1)
 	FRAME_SETUP(SPR_SPRG3, PC_TEMPSAVE, EXC_PERF)
-	GET_TOCBASE(%r2)
-	addi	%r3, %r1, CALLSIZE
-	bl	CNAME(powerpc_interrupt)
-	TOC_RESTORE
-	b	trapexit
+	b	trap_common
 #endif
 
 
 /*****************************************************************************
  * Data TLB miss interrupt
  *
  * There can be nested TLB misses - while handling a TLB miss we reference
  * data structures that may be not covered by translations. We support up to
  * TLB_NESTED_MAX-1 nested misses.
  * 
  * Registers use:
  *	r31 - dear
  *	r30 - unused
  *	r29 - saved mas0
  *	r28 - saved mas1
  *	r27 - saved mas2
  *	r26 - pmap address
  *	r25 - pte address
  *
  *	r20:r23 - scratch registers
  ****************************************************************************/
 INTERRUPT(int_data_tlb_error)
 	TLB_PROLOG
 	TLB_LOCK
 
 	mfdear	%r31
 
 	/*
 	 * Save MAS0-MAS2 registers. There might be another tlb miss during
 	 * pte lookup overwriting current contents (which was hw filled).
 	 */
 	mfspr	%r29, SPR_MAS0
 	mfspr	%r28, SPR_MAS1
 	mfspr	%r27, SPR_MAS2
 
 	/* Check faulting address. */
 	LOAD_ADDR(%r21, VM_MAXUSER_ADDRESS)
 	CMPL	cr0, %r31, %r21
 	blt	search_user_pmap
 	
 	/* If it's kernel address, allow only supervisor mode misses. */
 	mfsrr1	%r21
 	mtcr	%r21
 	bt	17, search_failed	/* check MSR[PR] */
 
 search_kernel_pmap:
 	/* Load r26 with kernel_pmap address */
 	bl	1f
 #ifdef __powerpc64__
 	.llong kernel_pmap_store-.
 #else
 	.long kernel_pmap_store-.
 #endif
 1:	mflr	%r21
 	LOAD	%r26, 0(%r21)
 	add	%r26, %r21, %r26	/* kernel_pmap_store in r26 */
 
 	/* Force kernel tid, set TID to 0 in MAS1. */
 	li	%r21, 0
 	rlwimi	%r28, %r21, 0, 8, 15	/* clear TID bits */
 
 tlb_miss_handle:
 	/* This may result in nested tlb miss. */
 	bl	pte_lookup		/* returns PTE address in R25 */
 
 	CMPI	%r25, 0			/* pte found? */
 	beq	search_failed
 
 	/* Finish up, write TLB entry. */
 	bl	tlb_fill_entry
 
 tlb_miss_return:
 	TLB_UNLOCK
 	TLB_RESTORE
 	rfi
 
 search_user_pmap:
 	/* Load r26 with current user space process pmap */
 	GET_CPUINFO(%r26)
 	LOAD	%r26, PC_CURPMAP(%r26)
 
 	b	tlb_miss_handle
 
 search_failed:
 	/*
 	 * Whenever we don't find a TLB mapping in PT, set a TLB0 entry with
 	 * the faulting virtual address anyway, but put a fake RPN and no
 	 * access rights. This should cause a following {D,I}SI exception.
 	 */
 	lis	%r23, 0xffff0000@h	/* revoke all permissions */
 
 	/* Load MAS registers. */
 	mtspr	SPR_MAS0, %r29
 	isync
 	mtspr	SPR_MAS1, %r28
 	isync
 	mtspr	SPR_MAS2, %r27
 	isync
 	mtspr	SPR_MAS3, %r23
 	isync
 
 	bl	zero_mas7
 	bl	zero_mas8
 
 	tlbwe
 	msync
 	isync
 	b	tlb_miss_return
 
 /*****************************************************************************
  *
  * Return pte address that corresponds to given pmap/va.  If there is no valid
  * entry return 0.
  *
  * input: r26 - pmap
  * input: r31 - dear
  * output: r25 - pte address
  *
  * scratch regs used: r21
  *
  ****************************************************************************/
 pte_lookup:
 	CMPI	%r26, 0
 	beq	1f			/* fail quickly if pmap is invalid */
 
 #ifdef __powerpc64__
 	rldicl  %r21, %r31, (64 - PP2D_L_L), (64 - PP2D_L_NUM) /* pp2d offset */
 	rldicl  %r25, %r31, (64 - PP2D_H_L), (64 - PP2D_H_NUM)
 	rldimi  %r21, %r25, PP2D_L_NUM, (64 - (PP2D_L_NUM + PP2D_H_NUM))
 	slwi    %r21, %r21, PP2D_ENTRY_SHIFT	/* multiply by pp2d entry size */
 	addi    %r25, %r26, PM_PP2D		/* pmap pm_pp2d[] address */
 	add     %r25, %r25, %r21		/* offset within pm_pp2d[] table */
 	ld      %r25, 0(%r25)			/* get pdir address, i.e.  pmap->pm_pp2d[pp2d_idx] * */
 
 	cmpdi   %r25, 0
 	beq 1f
 
 #if PAGE_SIZE < 65536
 	rldicl  %r21, %r31, (64 - PDIR_L), (64 - PDIR_NUM)      /* pdir offset */
 	slwi    %r21, %r21, PDIR_ENTRY_SHIFT    /* multiply by pdir entry size */
 	add     %r25, %r25, %r21                /* offset within pdir table */
 	ld      %r25, 0(%r25)                   /* get ptbl address, i.e.  pmap->pm_pp2d[pp2d_idx][pdir_idx] */
 
 	cmpdi   %r25, 0
 	beq     1f
 #endif
 
 	rldicl  %r21, %r31, (64 - PTBL_L), (64 - PTBL_NUM) /* ptbl offset */
 	slwi    %r21, %r21, PTBL_ENTRY_SHIFT   /* multiply by pte entry size */
 
 #else
 	srwi	%r21, %r31, PDIR_SHIFT		/* pdir offset */
 	slwi	%r21, %r21, PDIR_ENTRY_SHIFT	/* multiply by pdir entry size */
 
 	addi	%r25, %r26, PM_PDIR	/* pmap pm_dir[] address */
 	add	%r25, %r25, %r21	/* offset within pm_pdir[] table */
 	/*
 	 * Get ptbl address, i.e. pmap->pm_pdir[pdir_idx]
 	 * This load may cause a Data TLB miss for non-kernel pmap!
 	 */
 	LOAD	%r25, 0(%r25)
 	CMPI	%r25, 0
 	beq	2f
 
 	lis	%r21, PTBL_MASK@h
 	ori	%r21, %r21, PTBL_MASK@l
 	and	%r21, %r21, %r31
 
 	/* ptbl offset, multiply by ptbl entry size */
 	srwi	%r21, %r21, (PTBL_SHIFT - PTBL_ENTRY_SHIFT)
 #endif
 
 	add	%r25, %r25, %r21		/* address of pte entry */
 	/*
 	 * Get pte->flags
 	 * This load may cause a Data TLB miss for non-kernel pmap!
 	 */
 	lwz	%r21, PTE_FLAGS(%r25)
 	andi.	%r21, %r21, PTE_VALID@l
 	bne	2f
 1:
 	li	%r25, 0
 2:
 	blr
 
 /*****************************************************************************
  *
  * Load MAS1-MAS3 registers with data, write TLB entry
  *
  * input:
  * r29 - mas0
  * r28 - mas1
  * r27 - mas2
  * r25 - pte
  *
  * output: none
  *
  * scratch regs: r21-r23
  *
  ****************************************************************************/
 tlb_fill_entry:
 	/*
 	 * Update PTE flags: we have to do it atomically, as pmap_protect()
 	 * running on other CPUs could attempt to update the flags at the same
 	 * time.
 	 */
 	li	%r23, PTE_FLAGS
 1:
 	lwarx	%r21, %r23, %r25		/* get pte->flags */
 	oris	%r21, %r21, PTE_REFERENCED@h	/* set referenced bit */
 
 	andi.	%r22, %r21, (PTE_SW | PTE_UW)@l	/* check if writable */
 	beq	2f
 	ori	%r21, %r21, PTE_MODIFIED@l	/* set modified bit */
 2:
 	stwcx.	%r21, %r23, %r25		/* write it back */
 	bne-	1b
 
 	/* Update MAS2. */
 	rlwimi	%r27, %r21, 13, 27, 30		/* insert WIMG bits from pte */
 
 	/* Setup MAS3 value in r23. */
 	LOAD	%r23, PTE_RPN(%r25)		/* get pte->rpn */
 #ifdef __powerpc64__
 	rldicr	%r22, %r23, 52, 51		/* extract MAS3 portion of RPN */
 	rldicl	%r23, %r23, 20, 54		/* extract MAS7 portion of RPN */
 
 	rlwimi	%r22, %r21, 30, 26, 31		/* insert protection bits from pte */
 #else
 	rlwinm	%r22, %r23, 20, 0, 11		/* extract MAS3 portion of RPN */
 
 	rlwimi	%r22, %r21, 30, 26, 31		/* insert protection bits from pte */
 	rlwimi	%r22, %r21, 20, 12, 19		/* insert lower 8 RPN bits to MAS3 */
 	rlwinm	%r23, %r23, 20, 24, 31		/* MAS7 portion of RPN */
 #endif
 
 	/* Load MAS registers. */
 	mtspr	SPR_MAS0, %r29
 	isync
 	mtspr	SPR_MAS1, %r28
 	isync
 	mtspr	SPR_MAS2, %r27
 	isync
 	mtspr	SPR_MAS3, %r22
 	isync
 	mtspr	SPR_MAS7, %r23
 	isync
 
 	mflr	%r21
 	bl	zero_mas8
 	mtlr	%r21
 
 	tlbwe
 	isync
 	msync
 	blr
 
 /*****************************************************************************
  * Instruction TLB miss interrupt
  *
  * Same notes as for the Data TLB miss
  ****************************************************************************/
 INTERRUPT(int_inst_tlb_error)
 	TLB_PROLOG
 	TLB_LOCK
 
 	mfsrr0	%r31			/* faulting address */
 
 	/*
 	 * Save MAS0-MAS2 registers. There might be another tlb miss during pte
 	 * lookup overwriting current contents (which was hw filled).
 	 */
 	mfspr	%r29, SPR_MAS0
 	mfspr	%r28, SPR_MAS1
 	mfspr	%r27, SPR_MAS2
 
 	mfsrr1	%r21
 	mtcr	%r21
 
 	/* check MSR[PR] */
 	bt	17, search_user_pmap
 	b	search_kernel_pmap
 
 
 	.globl	interrupt_vector_top
 interrupt_vector_top:
 
 /*****************************************************************************
  * Debug interrupt
  ****************************************************************************/
 INTERRUPT(int_debug)
 	STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_CSRR0, SPR_CSRR1)
 	FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_DEBUG)
 	bl	int_debug_int
 	FRAME_LEAVE(SPR_CSRR0, SPR_CSRR1)
 	rfci
 
 INTERRUPT(int_debug_ed)
 	STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_DSRR0, SPR_DSRR1)
 	FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_DEBUG)
 	bl	int_debug_int
 	FRAME_LEAVE(SPR_DSRR0, SPR_DSRR1)
 	rfdi
 	/* .long 0x4c00004e */
 
 /* Internal helper for debug interrupt handling. */
 /* Common code between e500v1/v2 and e500mc-based cores. */
 int_debug_int:
 	mflr	%r14
 	GET_CPUINFO(%r3)
 	LOAD	%r3, (PC_BOOKE_CRITSAVE+CPUSAVE_SRR0)(%r3)
 	bl	0f
 	ADDR(interrupt_vector_base-.)
 	ADDR(interrupt_vector_top-.)
 0:	mflr	%r5
 	LOAD	%r4,0(%r5)	/* interrupt_vector_base in r4 */
 	add	%r4,%r4,%r5
 	CMPL	cr0, %r3, %r4
 	blt	trap_common
 	LOAD	%r4,WORD_SIZE(%r5)	/* interrupt_vector_top in r4 */
 	add	%r4,%r4,%r5
 	addi	%r4,%r4,4
 	CMPL	cr0, %r3, %r4
 	bge	trap_common
 	/* Disable single-stepping for the interrupt handlers. */
 	LOAD	%r3, FRAME_SRR1+CALLSIZE(%r1);
 	rlwinm	%r3, %r3, 0, 23, 21
 	STORE	%r3, FRAME_SRR1+CALLSIZE(%r1);
 	/* Restore srr0 and srr1 as they could have been clobbered. */
 	GET_CPUINFO(%r4)
 	LOAD	%r3, (PC_BOOKE_CRITSAVE+BOOKE_CRITSAVE_SRR0)(%r4);
 	mtspr	SPR_SRR0, %r3
 	LOAD	%r4, (PC_BOOKE_CRITSAVE+BOOKE_CRITSAVE_SRR1)(%r4);
 	mtspr	SPR_SRR1, %r4
 	mtlr	%r14
 	blr
 
 /*****************************************************************************
  * Common trap code
  ****************************************************************************/
 trap_common:
 	/* Call C trap dispatcher */
 	GET_TOCBASE(%r2)
 	addi	%r3, %r1, CALLSIZE
-	bl	CNAME(trap)
+	bl	CNAME(powerpc_interrupt)
 	TOC_RESTORE
 
 	.globl	CNAME(trapexit)		/* exported for db_backtrace use */
 CNAME(trapexit):
 	/* disable interrupts */
 	wrteei	0
 
 	/* Test AST pending - makes sense for user process only */
 	LOAD	%r5, FRAME_SRR1+CALLSIZE(%r1)
 	mtcr	%r5
 	bf	17, 1f
 
 	GET_CPUINFO(%r3)
 	LOAD	%r4, PC_CURTHREAD(%r3)
 	lwz	%r4, TD_FLAGS(%r4)
 	lis	%r5, (TDF_ASTPENDING | TDF_NEEDRESCHED)@h
 	ori	%r5, %r5, (TDF_ASTPENDING | TDF_NEEDRESCHED)@l
 	and.	%r4, %r4, %r5
 	beq	1f
 
 	/* re-enable interrupts before calling ast() */
 	wrteei	1
 
 	addi	%r3, %r1, CALLSIZE
 	bl	CNAME(ast)
 	TOC_RESTORE
 	.globl	CNAME(asttrapexit)	/* db_backtrace code sentinel #2 */
 CNAME(asttrapexit):
 	b	trapexit		/* test ast ret value ? */
 1:
 	FRAME_LEAVE(SPR_SRR0, SPR_SRR1)
 	rfi
 
 
 #if defined(KDB)
 /*
  * Deliberate entry to dbtrap
  */
 	/* .globl	CNAME(breakpoint)*/
 ASENTRY_NOPROF(breakpoint)
 	mtsprg1	%r1
 	mfmsr	%r3
 	mtsrr1	%r3
 	li	%r4, ~(PSL_EE | PSL_ME)@l
 	oris	%r4, %r4, ~(PSL_EE | PSL_ME)@h
 	and	%r3, %r3, %r4
 	mtmsr	%r3			/* disable interrupts */
 	isync
 	GET_CPUINFO(%r3)
 	STORE	%r30, (PC_DBSAVE+CPUSAVE_R30)(%r3)
 	STORE	%r31, (PC_DBSAVE+CPUSAVE_R31)(%r3)
 
 	mflr	%r31
 	mtsrr0	%r31
 
 	mfdear	%r30
 	mfesr	%r31
 	STORE	%r30, (PC_DBSAVE+CPUSAVE_BOOKE_DEAR)(%r3)
 	STORE	%r31, (PC_DBSAVE+CPUSAVE_BOOKE_ESR)(%r3)
 
 	mfsrr0	%r30
 	mfsrr1	%r31
 	STORE	%r30, (PC_DBSAVE+CPUSAVE_SRR0)(%r3)
 	STORE	%r31, (PC_DBSAVE+CPUSAVE_SRR1)(%r3)
 	isync
 
 	mfcr	%r30
 
 /*
  * Now the kdb trap catching code.
  */
 dbtrap:
 	FRAME_SETUP(SPR_SPRG1, PC_DBSAVE, EXC_DEBUG)
 /* Call C trap code: */
 	GET_TOCBASE(%r2)
 	addi	%r3, %r1, CALLSIZE
 	bl	CNAME(db_trap_glue)
 	TOC_RESTORE
 	or.	%r3, %r3, %r3
 	bne	dbleave
 /* This wasn't for KDB, so switch to real trap: */
 	b	trap_common
 
 dbleave:
 	FRAME_LEAVE(SPR_SRR0, SPR_SRR1)
 	rfi
 #endif /* KDB */
 
 #ifdef SMP
 ENTRY(tlb_lock)
 	GET_CPUINFO(%r5)
 	LOAD	%r5, PC_CURTHREAD(%r5)
 1:	LOADX	%r4, 0, %r3
 	CMPI	%r4, TLB_UNLOCKED
 	bne	1b
 	STOREX	%r5, 0, %r3
 	bne-	1b
 	isync
 	msync
 	blr
 
 ENTRY(tlb_unlock)
 	isync
 	msync
 	li	%r4, TLB_UNLOCKED
 	STORE	%r4, 0(%r3)
 	isync
 	msync
 	blr
 
 /*
  * TLB miss spin locks. For each CPU we have a reservation granule (32 bytes);
  * only a single word from this granule will actually be used as a spin lock
  * for mutual exclusion between TLB miss handler and pmap layer that
  * manipulates page table contents.
  */
 	.data
 	.align	5
 GLOBAL(tlb0_miss_locks)
 	.space	RES_GRANULE * MAXCPU
 #endif
Index: user/markj/netdump/sys/powerpc/mpc85xx/lbc.c
===================================================================
--- user/markj/netdump/sys/powerpc/mpc85xx/lbc.c	(revision 332407)
+++ user/markj/netdump/sys/powerpc/mpc85xx/lbc.c	(revision 332408)
@@ -1,861 +1,862 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 2006-2008, Juniper Networks, Inc.
  * Copyright (c) 2008 Semihalf, Rafal Czubak
  * Copyright (c) 2009 The FreeBSD Foundation
  * All rights reserved.
  *
  * Portions of this software were developed by Semihalf
  * under sponsorship from the FreeBSD Foundation.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The name of the author may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
  * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include "opt_platform.h"
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/ktr.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/bus.h>
 #include <sys/rman.h>
 #include <machine/bus.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <dev/fdt/fdt_common.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <powerpc/mpc85xx/mpc85xx.h>
 
 #include "ofw_bus_if.h"
 #include "lbc.h"
 
 #ifdef DEBUG
 #define debugf(fmt, args...) do { printf("%s(): ", __func__);	\
     printf(fmt,##args); } while (0)
 #else
 #define debugf(fmt, args...)
 #endif
 
 static MALLOC_DEFINE(M_LBC, "localbus", "localbus devices information");
 
 static int lbc_probe(device_t);
 static int lbc_attach(device_t);
 static int lbc_shutdown(device_t);
 static int lbc_activate_resource(device_t bus __unused, device_t child __unused,
     int type, int rid __unused, struct resource *r);
 static int lbc_deactivate_resource(device_t bus __unused,
     device_t child __unused, int type __unused, int rid __unused,
     struct resource *r);
 static struct resource *lbc_alloc_resource(device_t, device_t, int, int *,
     rman_res_t, rman_res_t, rman_res_t, u_int);
 static int lbc_print_child(device_t, device_t);
 static int lbc_release_resource(device_t, device_t, int, int,
     struct resource *);
 static const struct ofw_bus_devinfo *lbc_get_devinfo(device_t, device_t);
 
 /*
  * Bus interface definition
  */
 static device_method_t lbc_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		lbc_probe),
 	DEVMETHOD(device_attach,	lbc_attach),
 	DEVMETHOD(device_shutdown,	lbc_shutdown),
 
 	/* Bus interface */
 	DEVMETHOD(bus_print_child,	lbc_print_child),
 	DEVMETHOD(bus_setup_intr,	bus_generic_setup_intr),
 	DEVMETHOD(bus_teardown_intr,	NULL),
 
 	DEVMETHOD(bus_alloc_resource,	lbc_alloc_resource),
 	DEVMETHOD(bus_release_resource,	lbc_release_resource),
 	DEVMETHOD(bus_activate_resource, lbc_activate_resource),
 	DEVMETHOD(bus_deactivate_resource, lbc_deactivate_resource),
 
 	/* OFW bus interface */
 	DEVMETHOD(ofw_bus_get_devinfo,	lbc_get_devinfo),
 	DEVMETHOD(ofw_bus_get_compat,	ofw_bus_gen_get_compat),
 	DEVMETHOD(ofw_bus_get_model,	ofw_bus_gen_get_model),
 	DEVMETHOD(ofw_bus_get_name,	ofw_bus_gen_get_name),
 	DEVMETHOD(ofw_bus_get_node,	ofw_bus_gen_get_node),
 	DEVMETHOD(ofw_bus_get_type,	ofw_bus_gen_get_type),
 
 	{ 0, 0 }
 };
 
 static driver_t lbc_driver = {
 	"lbc",
 	lbc_methods,
 	sizeof(struct lbc_softc)
 };
 
 devclass_t lbc_devclass;
 
 EARLY_DRIVER_MODULE(lbc, ofwbus, lbc_driver, lbc_devclass,
     0, 0, BUS_PASS_BUS);
 
 /*
  * Calculate address mask used by OR(n) registers. Use memory region size to
  * determine mask value. The size must be a power of two and within the range
  * of 32KB - 4GB. Otherwise error code is returned. Value representing
  * 4GB size can be passed as 0xffffffff.
  */
 static uint32_t
 lbc_address_mask(uint32_t size)
 {
 	int n = 15;
 
 	if (size == ~0)
 		return (0);
 
 	while (n < 32) {
 		if (size == (1U << n))
 			break;
 		n++;
 	}
 
 	if (n == 32)
 		return (EINVAL);
 
 	return (0xffff8000 << (n - 15));
 }
 
 static void
 lbc_banks_unmap(struct lbc_softc *sc)
 {
 	int r;
 
 	r = 0;
 	while (r < LBC_DEV_MAX) {
 		if (sc->sc_range[r].size == 0)
 			return;
 
 		pmap_unmapdev(sc->sc_range[r].kva, sc->sc_range[r].size);
 		law_disable(OCP85XX_TGTIF_LBC, sc->sc_range[r].addr,
 		    sc->sc_range[r].size);
 		r++;
 	}
 }
 
 static int
 lbc_banks_map(struct lbc_softc *sc)
 {
 	vm_paddr_t end, start;
 	vm_size_t size;
 	u_int i, r, ranges, s;
 	int error;
 
 	bzero(sc->sc_range, sizeof(sc->sc_range));
 
 	/*
 	 * Determine number of discontiguous address ranges to program.
 	 */
 	ranges = 0;
 	for (i = 0; i < LBC_DEV_MAX; i++) {
 		size = sc->sc_banks[i].size;
 		if (size == 0)
 			continue;
 
 		start = sc->sc_banks[i].addr;
 		for (r = 0; r < ranges; r++) {
 			/* Avoid wrap-around bugs. */
 			end = sc->sc_range[r].addr - 1 + sc->sc_range[r].size;
 			if (start > 0 && end == start - 1) {
 				sc->sc_range[r].size += size;
 				break;
 			}
 			/* Avoid wrap-around bugs. */
 			end = start - 1 + size;
 			if (sc->sc_range[r].addr > 0 &&
 			    end == sc->sc_range[r].addr - 1) {
 				sc->sc_range[r].addr = start;
 				sc->sc_range[r].size += size;
 				break;
 			}
 		}
 		if (r == ranges) {
 			/* New range; add using insertion sort */
 			r = 0;
 			while (r < ranges && sc->sc_range[r].addr < start)
 				r++;
 			for (s = ranges; s > r; s--)
 				sc->sc_range[s] = sc->sc_range[s-1];
 			sc->sc_range[r].addr = start;
 			sc->sc_range[r].size = size;
 			ranges++;
 		}
 	}
 
 	/*
 	 * Ranges are sorted so quickly go over the list to merge ranges
 	 * that grew toward each other while building the ranges.
 	 */
 	r = 0;
 	while (r < ranges - 1) {
 		end = sc->sc_range[r].addr + sc->sc_range[r].size;
 		if (end != sc->sc_range[r+1].addr) {
 			r++;
 			continue;
 		}
 		sc->sc_range[r].size += sc->sc_range[r+1].size;
 		for (s = r + 1; s < ranges - 1; s++)
 			sc->sc_range[s] = sc->sc_range[s+1];
 		bzero(&sc->sc_range[s], sizeof(sc->sc_range[s]));
 		ranges--;
 	}
 
 	/*
 	 * Configure LAW for the LBC ranges and map the physical memory
 	 * range into KVA.
 	 */
 	for (r = 0; r < ranges; r++) {
 		start = sc->sc_range[r].addr;
 		size = sc->sc_range[r].size;
 		error = law_enable(OCP85XX_TGTIF_LBC, start, size);
 		if (error)
 			return (error);
 		sc->sc_range[r].kva = (vm_offset_t)pmap_mapdev(start, size);
 	}
 
 	/* XXX: need something better here? */
 	if (ranges == 0)
 		return (EINVAL);
 
 	/* Assign KVA to banks based on the enclosing range. */
 	for (i = 0; i < LBC_DEV_MAX; i++) {
 		size = sc->sc_banks[i].size;
 		if (size == 0)
 			continue;
 
 		start = sc->sc_banks[i].addr;
 		for (r = 0; r < ranges; r++) {
 			end = sc->sc_range[r].addr - 1 + sc->sc_range[r].size;
 			if (start >= sc->sc_range[r].addr &&
 			    start - 1 + size <= end)
 				break;
 		}
 		if (r < ranges) {
 			sc->sc_banks[i].kva = sc->sc_range[r].kva +
 			    (start - sc->sc_range[r].addr);
 		}
 	}
 
 	return (0);
 }
 
 static int
 lbc_banks_enable(struct lbc_softc *sc)
 {
 	uint32_t size;
 	uint32_t regval;
 	int error, i;
 
 	for (i = 0; i < LBC_DEV_MAX; i++) {
 		size = sc->sc_banks[i].size;
 		if (size == 0)
 			continue;
 
 		/*
 		 * Compute and program BR value.
 		 */
 		regval = sc->sc_banks[i].addr;
 		switch (sc->sc_banks[i].width) {
 		case 8:
 			regval |= (1 << 11);
 			break;
 		case 16:
 			regval |= (2 << 11);
 			break;
 		case 32:
 			regval |= (3 << 11);
 			break;
 		default:
 			error = EINVAL;
 			goto fail;
 		}
 		regval |= (sc->sc_banks[i].decc << 9);
 		regval |= (sc->sc_banks[i].wp << 8);
 		regval |= (sc->sc_banks[i].msel << 5);
 		regval |= (sc->sc_banks[i].atom << 2);
 		regval |= 1;
 		bus_space_write_4(sc->sc_bst, sc->sc_bsh,
 		    LBC85XX_BR(i), regval);
 
 		/*
 		 * Compute and program OR value.
 		 */
 		regval = lbc_address_mask(size);
 		switch (sc->sc_banks[i].msel) {
 		case LBCRES_MSEL_GPCM:
 			/* TODO Add flag support for option registers */
 			regval |= 0x0ff7;
 			break;
 		case LBCRES_MSEL_FCM:
 			/* TODO Add flag support for options register */
 			regval |= 0x0796;
 			break;
 		case LBCRES_MSEL_UPMA:
 		case LBCRES_MSEL_UPMB:
 		case LBCRES_MSEL_UPMC:
 			printf("UPM mode not supported yet!");
 			error = ENOSYS;
 			goto fail;
 		}
 		bus_space_write_4(sc->sc_bst, sc->sc_bsh,
 		    LBC85XX_OR(i), regval);
 	}
 
 	return (0);
 
 fail:
 	lbc_banks_unmap(sc);
 	return (error);
 }
 
 static void
 fdt_lbc_fixup(phandle_t node, struct lbc_softc *sc, struct lbc_devinfo *di)
 {
 	pcell_t width;
 	int bank;
 
 	if (OF_getprop(node, "bank-width", (void *)&width, sizeof(width)) <= 0)
 		return;
 
 	bank = di->di_bank;
 	if (sc->sc_banks[bank].size == 0)
 		return;
 
 	/* Express width in bits. */
 	sc->sc_banks[bank].width = width * 8;
 }
 
 static int
 fdt_lbc_reg_decode(phandle_t node, struct lbc_softc *sc,
     struct lbc_devinfo *di)
 {
 	rman_res_t start, end, count;
 	pcell_t *reg, *regptr;
 	pcell_t addr_cells, size_cells;
 	int tuple_size, tuples;
 	int i, j, rv, bank;
 
 	if (fdt_addrsize_cells(OF_parent(node), &addr_cells, &size_cells) != 0)
 		return (ENXIO);
 
 	tuple_size = sizeof(pcell_t) * (addr_cells + size_cells);
-	tuples = OF_getencprop_alloc(node, "reg", tuple_size, (void **)&reg);
+	tuples = OF_getencprop_alloc_multi(node, "reg", tuple_size,
+	    (void **)&reg);
 	debugf("addr_cells = %d, size_cells = %d\n", addr_cells, size_cells);
 	debugf("tuples = %d, tuple size = %d\n", tuples, tuple_size);
 	if (tuples <= 0)
 		/* No 'reg' property in this node. */
 		return (0);
 
 	regptr = reg;
 	for (i = 0; i < tuples; i++) {
 
 		bank = fdt_data_get((void *)reg, 1);
 		di->di_bank = bank;
 		reg += 1;
 
 		/* Get address/size. */
 		start = count = 0;
 		for (j = 0; j < addr_cells; j++) {
 			start <<= 32;
 			start |= reg[j];
 		}
 		for (j = 0; j < size_cells; j++) {
 			count <<= 32;
 			count |= reg[addr_cells + j - 1];
 		}
 		reg += addr_cells - 1 + size_cells;
 
 		/* Calculate address range relative to VA base. */
 		start = sc->sc_banks[bank].kva + start;
 		end = start + count - 1;
 
 		debugf("reg addr bank = %d, start = %jx, end = %jx, "
 		    "count = %jx\n", bank, start, end, count);
 
 		/* Use bank (CS) cell as rid. */
 		resource_list_add(&di->di_res, SYS_RES_MEMORY, bank, start,
 		    end, count);
 	}
 	rv = 0;
 	OF_prop_free(regptr);
 	return (rv);
 }
 
 static void
 lbc_intr(void *arg)
 {
 	struct lbc_softc *sc = arg;
 	uint32_t ltesr;
 
 	ltesr = bus_space_read_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LTESR);
 	sc->sc_ltesr = ltesr;
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LTESR, ltesr);
 	wakeup(sc->sc_dev);
 }
 
 static int
 lbc_probe(device_t dev)
 {
 
 	if (!(ofw_bus_is_compatible(dev, "fsl,lbc") ||
 	    ofw_bus_is_compatible(dev, "fsl,elbc")))
 		return (ENXIO);
 
 	device_set_desc(dev, "Freescale Local Bus Controller");
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 lbc_attach(device_t dev)
 {
 	struct lbc_softc *sc;
 	struct lbc_devinfo *di;
 	struct rman *rm;
 	uintmax_t offset, size;
 	vm_paddr_t start;
 	device_t cdev;
 	phandle_t node, child;
 	pcell_t *ranges, *rangesptr;
 	int tuple_size, tuples;
 	int par_addr_cells;
 	int bank, error, i, j;
 
 	sc = device_get_softc(dev);
 	sc->sc_dev = dev;
 
 	sc->sc_mrid = 0;
 	sc->sc_mres = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &sc->sc_mrid,
 	    RF_ACTIVE);
 	if (sc->sc_mres == NULL)
 		return (ENXIO);
 
 	sc->sc_bst = rman_get_bustag(sc->sc_mres);
 	sc->sc_bsh = rman_get_bushandle(sc->sc_mres);
 
 	for (bank = 0; bank < LBC_DEV_MAX; bank++) {
 		bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_BR(bank), 0);
 		bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_OR(bank), 0);
 	}
 
 	/*
 	 * Initialize configuration register:
 	 * - enable Local Bus
 	 * - set data buffer control signal function
 	 * - disable parity byte select
 	 * - set ECC parity type
 	 * - set bus monitor timing and timer prescale
 	 */
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LBCR, 0);
 
 	/*
 	 * Initialize clock ratio register:
 	 * - disable PLL bypass mode
 	 * - configure LCLK delay cycles for the assertion of LALE
 	 * - set system clock divider
 	 */
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LCRR, 0x00030008);
 
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LTEDR, 0);
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LTESR, ~0);
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, LBC85XX_LTEIR, 0x64080001);
 
 	sc->sc_irid = 0;
 	sc->sc_ires = bus_alloc_resource_any(dev, SYS_RES_IRQ, &sc->sc_irid,
 	    RF_ACTIVE | RF_SHAREABLE);
 	if (sc->sc_ires != NULL) {
 		error = bus_setup_intr(dev, sc->sc_ires,
 		    INTR_TYPE_MISC | INTR_MPSAFE, NULL, lbc_intr, sc,
 		    &sc->sc_icookie);
 		if (error) {
 			device_printf(dev, "could not activate interrupt\n");
 			bus_release_resource(dev, SYS_RES_IRQ, sc->sc_irid,
 			    sc->sc_ires);
 			sc->sc_ires = NULL;
 		}
 	}
 
 	sc->sc_ltesr = ~0;
 
 	rangesptr = NULL;
 
 	rm = &sc->sc_rman;
 	rm->rm_type = RMAN_ARRAY;
 	rm->rm_descr = "Local Bus Space";
 	error = rman_init(rm);
 	if (error)
 		goto fail;
 
 	error = rman_manage_region(rm, rm->rm_start, rm->rm_end);
 	if (error) {
 		rman_fini(rm);
 		goto fail;
 	}
 
 	/*
 	 * Process 'ranges' property.
 	 */
 	node = ofw_bus_get_node(dev);
 	if ((fdt_addrsize_cells(node, &sc->sc_addr_cells,
 	    &sc->sc_size_cells)) != 0) {
 		error = ENXIO;
 		goto fail;
 	}
 
 	par_addr_cells = fdt_parent_addr_cells(node);
 	if (par_addr_cells > 2) {
 		device_printf(dev, "unsupported parent #addr-cells\n");
 		error = ERANGE;
 		goto fail;
 	}
 	tuple_size = sizeof(pcell_t) * (sc->sc_addr_cells + par_addr_cells +
 	    sc->sc_size_cells);
 
-	tuples = OF_getencprop_alloc(node, "ranges", tuple_size,
+	tuples = OF_getencprop_alloc_multi(node, "ranges", tuple_size,
 	    (void **)&ranges);
 	if (tuples < 0) {
 		device_printf(dev, "could not retrieve 'ranges' property\n");
 		error = ENXIO;
 		goto fail;
 	}
 	rangesptr = ranges;
 
 	debugf("par addr_cells = %d, addr_cells = %d, size_cells = %d, "
 	    "tuple_size = %d, tuples = %d\n", par_addr_cells,
 	    sc->sc_addr_cells, sc->sc_size_cells, tuple_size, tuples);
 
 	start = 0;
 	size = 0;
 	for (i = 0; i < tuples; i++) {
 
 		/* The first cell is the bank (chip select) number. */
 		bank = fdt_data_get(ranges, 1);
 		if (bank < 0 || bank > LBC_DEV_MAX) {
 			device_printf(dev, "bank out of range: %d\n", bank);
 			error = ERANGE;
 			goto fail;
 		}
 		ranges += 1;
 
 		/*
 		 * Remaining cells of the child address define offset into
 		 * this CS.
 		 */
 		offset = 0;
 		for (j = 0; j < sc->sc_addr_cells - 1; j++) {
 			offset <<= sizeof(pcell_t) * 8;
 			offset |= *ranges;
 			ranges++;
 		}
 
 		/* Parent bus start address of this bank. */
 		start = 0;
 		for (j = 0; j < par_addr_cells; j++) {
 			start <<= sizeof(pcell_t) * 8;
 			start |= *ranges;
 			ranges++;
 		}
 
 		size = fdt_data_get((void *)ranges, sc->sc_size_cells);
 		ranges += sc->sc_size_cells;
 		debugf("bank = %d, start = %jx, size = %jx\n", bank,
 		    (uintmax_t)start, size);
 
 		sc->sc_banks[bank].addr = start + offset;
 		sc->sc_banks[bank].size = size;
 
 		/*
 		 * Attributes for the bank.
 		 *
 		 * XXX Note there are no DT bindings defined for them at the
 		 * moment, so we need to provide some defaults.
 		 */
 		sc->sc_banks[bank].width = 16;
 		sc->sc_banks[bank].msel = LBCRES_MSEL_GPCM;
 		sc->sc_banks[bank].decc = LBCRES_DECC_DISABLED;
 		sc->sc_banks[bank].atom = LBCRES_ATOM_DISABLED;
 		sc->sc_banks[bank].wp = 0;
 	}
 
 	/*
 	 * Initialize mem-mappings for the LBC banks (i.e. chip selects).
 	 */
 	error = lbc_banks_map(sc);
 	if (error)
 		goto fail;
 
 	/*
 	 * Walk the localbus and add direct subordinates as our children.
 	 */
 	for (child = OF_child(node); child != 0; child = OF_peer(child)) {
 
 		di = malloc(sizeof(*di), M_LBC, M_WAITOK | M_ZERO);
 
 		if (ofw_bus_gen_setup_devinfo(&di->di_ofw, child) != 0) {
 			free(di, M_LBC);
 			device_printf(dev, "could not set up devinfo\n");
 			continue;
 		}
 
 		resource_list_init(&di->di_res);
 
 		if (fdt_lbc_reg_decode(child, sc, di)) {
 			device_printf(dev, "could not process 'reg' "
 			    "property\n");
 			ofw_bus_gen_destroy_devinfo(&di->di_ofw);
 			free(di, M_LBC);
 			continue;
 		}
 
 		fdt_lbc_fixup(child, sc, di);
 
 		/* Add newbus device for this FDT node */
 		cdev = device_add_child(dev, NULL, -1);
 		if (cdev == NULL) {
 			device_printf(dev, "could not add child: %s\n",
 			    di->di_ofw.obd_name);
 			resource_list_free(&di->di_res);
 			ofw_bus_gen_destroy_devinfo(&di->di_ofw);
 			free(di, M_LBC);
 			continue;
 		}
 		debugf("added child name='%s', node=%x\n", di->di_ofw.obd_name,
 		    child);
 		device_set_ivars(cdev, di);
 	}
 
 	/*
 	 * Enable the LBC.
 	 */
 	lbc_banks_enable(sc);
 
 	OF_prop_free(rangesptr);
 	return (bus_generic_attach(dev));
 
 fail:
 	OF_prop_free(rangesptr);
 	bus_release_resource(dev, SYS_RES_MEMORY, sc->sc_mrid, sc->sc_mres);
 	return (error);
 }
 
 static int
 lbc_shutdown(device_t dev)
 {
 
 	/* TODO */
 	return(0);
 }
 
 static struct resource *
 lbc_alloc_resource(device_t bus, device_t child, int type, int *rid,
     rman_res_t start, rman_res_t end, rman_res_t count, u_int flags)
 {
 	struct lbc_softc *sc;
 	struct lbc_devinfo *di;
 	struct resource_list_entry *rle;
 	struct resource *res;
 	struct rman *rm;
 	int needactivate;
 
 	/* We only support default allocations. */
 	if (!RMAN_IS_DEFAULT_RANGE(start, end))
 		return (NULL);
 
 	sc = device_get_softc(bus);
 	if (type == SYS_RES_IRQ)
 		return (bus_alloc_resource(bus, type, rid, start, end, count,
 		    flags));
 
 	/*
 	 * Request for the default allocation with a given rid: use resource
 	 * list stored in the local device info.
 	 */
 	if ((di = device_get_ivars(child)) == NULL)
 		return (NULL);
 
 	if (type == SYS_RES_IOPORT)
 		type = SYS_RES_MEMORY;
 
 	rid = &di->di_bank;
 
 	rle = resource_list_find(&di->di_res, type, *rid);
 	if (rle == NULL) {
 		device_printf(bus, "no default resources for "
 		    "rid = %d, type = %d\n", *rid, type);
 		return (NULL);
 	}
 	start = rle->start;
 	count = rle->count;
 	end = start + count - 1;
 
 	sc = device_get_softc(bus);
 
 	needactivate = flags & RF_ACTIVE;
 	flags &= ~RF_ACTIVE;
 
 	rm = &sc->sc_rman;
 
 	res = rman_reserve_resource(rm, start, end, count, flags, child);
 	if (res == NULL) {
 		device_printf(bus, "failed to reserve resource %#jx - %#jx "
 		    "(%#jx)\n", start, end, count);
 		return (NULL);
 	}
 
 	rman_set_rid(res, *rid);
 	rman_set_bustag(res, &bs_be_tag);
 	rman_set_bushandle(res, rman_get_start(res));
 
 	if (needactivate)
 		if (bus_activate_resource(child, type, *rid, res)) {
 			device_printf(child, "resource activation failed\n");
 			rman_release_resource(res);
 			return (NULL);
 		}
 
 	return (res);
 }
 
 static int
 lbc_print_child(device_t dev, device_t child)
 {
 	struct lbc_devinfo *di;
 	struct resource_list *rl;
 	int rv;
 
 	di = device_get_ivars(child);
 	rl = &di->di_res;
 
 	rv = 0;
 	rv += bus_print_child_header(dev, child);
 	rv += resource_list_print_type(rl, "mem", SYS_RES_MEMORY, "%#jx");
 	rv += resource_list_print_type(rl, "irq", SYS_RES_IRQ, "%jd");
 	rv += bus_print_child_footer(dev, child);
 
 	return (rv);
 }
 
 static int
 lbc_release_resource(device_t dev, device_t child, int type, int rid,
     struct resource *res)
 {
 	int err;
 
 	if (rman_get_flags(res) & RF_ACTIVE) {
 		err = bus_deactivate_resource(child, type, rid, res);
 		if (err)
 			return (err);
 	}
 
 	return (rman_release_resource(res));
 }
 
 static int
 lbc_activate_resource(device_t bus __unused, device_t child __unused,
     int type __unused, int rid __unused, struct resource *r)
 {
 
 	/* Child resources were already mapped, just activate. */
 	return (rman_activate_resource(r));
 }
 
 static int
 lbc_deactivate_resource(device_t bus __unused, device_t child __unused,
     int type __unused, int rid __unused, struct resource *r)
 {
 
 	return (rman_deactivate_resource(r));
 }
 
 static const struct ofw_bus_devinfo *
 lbc_get_devinfo(device_t bus, device_t child)
 {
 	struct lbc_devinfo *di;
 
 	di = device_get_ivars(child);
 	return (&di->di_ofw);
 }
 
 void
 lbc_write_reg(device_t child, u_int off, uint32_t val)
 {
 	device_t dev;
 	struct lbc_softc *sc;
 
 	dev = device_get_parent(child);
 
 	if (off >= 0x1000) {
 		device_printf(dev, "%s(%s): invalid offset %#x\n",
 		    __func__, device_get_nameunit(child), off);
 		return;
 	}
 
 	sc = device_get_softc(dev);
 
 	if (off == LBC85XX_LTESR && sc->sc_ltesr != ~0u) {
 		sc->sc_ltesr ^= (val & sc->sc_ltesr);
 		return;
 	}
 
 	if (off == LBC85XX_LTEATR && (val & 1) == 0)
 		sc->sc_ltesr = ~0u;
 	bus_space_write_4(sc->sc_bst, sc->sc_bsh, off, val);
 }
 
 uint32_t
 lbc_read_reg(device_t child, u_int off)
 {
 	device_t dev;
 	struct lbc_softc *sc;
 	uint32_t val;
 
 	dev = device_get_parent(child);
 
 	if (off >= 0x1000) {
 		device_printf(dev, "%s(%s): invalid offset %#x\n",
 		    __func__, device_get_nameunit(child), off);
 		return (~0U);
 	}
 
 	sc = device_get_softc(dev);
 
 	if (off == LBC85XX_LTESR && sc->sc_ltesr != ~0U)
 		val = sc->sc_ltesr;
 	else
 		val = bus_space_read_4(sc->sc_bst, sc->sc_bsh, off);
 	return (val);
 }
Index: user/markj/netdump/sys/powerpc/powerpc/trap.c
===================================================================
--- user/markj/netdump/sys/powerpc/powerpc/trap.c	(revision 332407)
+++ user/markj/netdump/sys/powerpc/powerpc/trap.c	(revision 332408)
@@ -1,922 +1,929 @@
 /*-
  * Copyright (C) 1995, 1996 Wolfgang Solfrank.
  * Copyright (C) 1995, 1996 TooLs GmbH.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by TooLs GmbH.
  * 4. The name of TooLs GmbH may not be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $NetBSD: trap.c,v 1.58 2002/03/04 04:07:35 dbj Exp $
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/kdb.h>
 #include <sys/proc.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/pioctl.h>
 #include <sys/ptrace.h>
 #include <sys/reboot.h>
 #include <sys/syscall.h>
 #include <sys/sysent.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/uio.h>
 #include <sys/signalvar.h>
 #include <sys/vmmeter.h>
 
 #include <security/audit/audit.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_map.h>
 #include <vm/vm_page.h>
 
 #include <machine/_inttypes.h>
 #include <machine/altivec.h>
 #include <machine/cpu.h>
 #include <machine/db_machdep.h>
 #include <machine/fpu.h>
 #include <machine/frame.h>
 #include <machine/pcb.h>
 #include <machine/psl.h>
 #include <machine/trap.h>
 #include <machine/spr.h>
 #include <machine/sr.h>
 
 /* Below matches setjmp.S */
 #define	FAULTBUF_LR	21
 #define	FAULTBUF_R1	1
 #define	FAULTBUF_R2	2
 #define	FAULTBUF_CR	22
 #define	FAULTBUF_R14	3
 
 #define	MOREARGS(sp)	((caddr_t)((uintptr_t)(sp) + \
     sizeof(struct callframe) - 3*sizeof(register_t))) /* more args go here */
 
 static void	trap_fatal(struct trapframe *frame);
 static void	printtrap(u_int vector, struct trapframe *frame, int isfatal,
 		    int user);
 static int	trap_pfault(struct trapframe *frame, int user);
 static int	fix_unaligned(struct thread *td, struct trapframe *frame);
 static int	handle_onfault(struct trapframe *frame);
 static void	syscall(struct trapframe *frame);
 
 #if defined(__powerpc64__) && defined(AIM)
        void	handle_kernel_slb_spill(int, register_t, register_t);
 static int	handle_user_slb_spill(pmap_t pm, vm_offset_t addr);
 extern int	n_slbs;
 #endif
 
 extern vm_offset_t __startkernel;
 
 #ifdef KDB
 int db_trap_glue(struct trapframe *);		/* Called from trap_subr.S */
 #endif
 
 struct powerpc_exception {
 	u_int	vector;
 	char	*name;
 };
 
 #ifdef KDTRACE_HOOKS
 #include <sys/dtrace_bsd.h>
 
 int (*dtrace_invop_jump_addr)(struct trapframe *);
 #endif
 
 static struct powerpc_exception powerpc_exceptions[] = {
 	{ EXC_CRIT,	"critical input" },
 	{ EXC_RST,	"system reset" },
 	{ EXC_MCHK,	"machine check" },
 	{ EXC_DSI,	"data storage interrupt" },
 	{ EXC_DSE,	"data segment exception" },
 	{ EXC_ISI,	"instruction storage interrupt" },
 	{ EXC_ISE,	"instruction segment exception" },
 	{ EXC_EXI,	"external interrupt" },
 	{ EXC_ALI,	"alignment" },
 	{ EXC_PGM,	"program" },
 	{ EXC_HEA,	"hypervisor emulation assistance" },
 	{ EXC_FPU,	"floating-point unavailable" },
 	{ EXC_APU,	"auxiliary proc unavailable" },
 	{ EXC_DECR,	"decrementer" },
 	{ EXC_FIT,	"fixed-interval timer" },
 	{ EXC_WDOG,	"watchdog timer" },
 	{ EXC_SC,	"system call" },
 	{ EXC_TRC,	"trace" },
 	{ EXC_FPA,	"floating-point assist" },
 	{ EXC_DEBUG,	"debug" },
 	{ EXC_PERF,	"performance monitoring" },
 	{ EXC_VEC,	"altivec unavailable" },
 	{ EXC_VSX,	"vsx unavailable" },
 	{ EXC_FAC,	"facility unavailable" },
 	{ EXC_ITMISS,	"instruction tlb miss" },
 	{ EXC_DLMISS,	"data load tlb miss" },
 	{ EXC_DSMISS,	"data store tlb miss" },
 	{ EXC_BPT,	"instruction breakpoint" },
 	{ EXC_SMI,	"system management" },
 	{ EXC_VECAST_G4,	"altivec assist" },
 	{ EXC_THRM,	"thermal management" },
 	{ EXC_RUNMODETRC,	"run mode/trace" },
 	{ EXC_LAST,	NULL }
 };
 
 #define ESR_BITMASK							\
     "\20"								\
     "\040b0\037b1\036b2\035b3\034PIL\033PRR\032PTR\031FP"		\
     "\030ST\027b9\026DLK\025ILK\024b12\023b13\022BO\021PIE"		\
     "\020b16\017b17\016b18\015b19\014b20\013b21\012b22\011b23"		\
     "\010SPE\007EPID\006b26\005b27\004b28\003b29\002b30\001b31"
 #define	MCSR_BITMASK							\
     "\20"								\
     "\040MCP\037ICERR\036DCERR\035TLBPERR\034L2MMU_MHIT\033b5\032b6\031b7"	\
     "\030b8\027b9\026b10\025NMI\024MAV\023MEA\022b14\021IF"		\
     "\020LD\017ST\016LDG\015b19\014b20\013b21\012b22\011b23"		\
     "\010b24\007b25\006b26\005b27\004b28\003b29\002TLBSYNC\001BSL2_ERR"
 #define	MSSSR_BITMASK							\
     "\20"								\
     "\040b0\037b1\036b2\035b3\034b4\033b5\032b6\031b7"			\
     "\030b8\027b9\026b10\025b11\024b12\023L2TAG\022L2DAT\021L3TAG"	\
     "\020L3DAT\017APE\016DPE\015TEA\014b20\013b21\012b22\011b23"	\
     "\010b24\007b25\006b26\005b27\004b28\003b29\002b30\001b31"
 
 
 static const char *
 trapname(u_int vector)
 {
 	struct	powerpc_exception *pe;
 
 	for (pe = powerpc_exceptions; pe->vector != EXC_LAST; pe++) {
 		if (pe->vector == vector)
 			return (pe->name);
 	}
 
 	return ("unknown");
 }
 
 static inline bool
 frame_is_trap_inst(struct trapframe *frame)
 {
 #ifdef AIM
 	return (frame->exc == EXC_PGM && frame->srr1 & EXC_PGM_TRAP);
 #else
 	return ((frame->cpu.booke.esr & ESR_PTR) != 0);
 #endif
 }
 
 void
 trap(struct trapframe *frame)
 {
 	struct thread	*td;
 	struct proc	*p;
 #ifdef KDTRACE_HOOKS
 	uint32_t inst;
 #endif
 	int		sig, type, user;
 	u_int		ucode;
 	ksiginfo_t	ksi;
 	register_t 	fscr;
 
 	VM_CNT_INC(v_trap);
 
+#ifdef KDB
+	if (kdb_active) {
+		kdb_reenter();
+		return;
+	}
+#endif
+
 	td = curthread;
 	p = td->td_proc;
 
 	type = ucode = frame->exc;
 	sig = 0;
 	user = frame->srr1 & PSL_PR;
 
 	CTR3(KTR_TRAP, "trap: %s type=%s (%s)", td->td_name,
 	    trapname(type), user ? "user" : "kernel");
 
 #ifdef KDTRACE_HOOKS
 	/*
 	 * A trap can occur while DTrace executes a probe. Before
 	 * executing the probe, DTrace blocks re-scheduling and sets
 	 * a flag in its per-cpu flags to indicate that it doesn't
 	 * want to fault. On returning from the probe, the no-fault
 	 * flag is cleared and finally re-scheduling is enabled.
 	 *
 	 * If the DTrace kernel module has registered a trap handler,
 	 * call it and if it returns non-zero, assume that it has
 	 * handled the trap and modified the trap frame so that this
 	 * function can return normally.
 	 */
 	if (dtrace_trap_func != NULL && (*dtrace_trap_func)(frame, type) != 0)
 		return;
 #endif
 
 	if (user) {
 		td->td_pticks = 0;
 		td->td_frame = frame;
 		if (td->td_cowgen != p->p_cowgen)
 			thread_cow_update(td);
 
 		/* User Mode Traps */
 		switch (type) {
 		case EXC_RUNMODETRC:
 		case EXC_TRC:
 			frame->srr1 &= ~PSL_SE;
 			sig = SIGTRAP;
 			ucode = TRAP_TRACE;
 			break;
 
 #if defined(__powerpc64__) && defined(AIM)
 		case EXC_ISE:
 		case EXC_DSE:
 			if (handle_user_slb_spill(&p->p_vmspace->vm_pmap,
 			    (type == EXC_ISE) ? frame->srr0 : frame->dar) != 0){
 				sig = SIGSEGV;
 				ucode = SEGV_MAPERR;
 			}
 			break;
 #endif
 		case EXC_DSI:
 		case EXC_ISI:
 			sig = trap_pfault(frame, 1);
 			if (sig == SIGSEGV)
 				ucode = SEGV_MAPERR;
 			break;
 
 		case EXC_SC:
 			syscall(frame);
 			break;
 
 		case EXC_FPU:
 			KASSERT((td->td_pcb->pcb_flags & PCB_FPU) != PCB_FPU,
 			    ("FPU already enabled for thread"));
 			enable_fpu(td);
 			break;
 
 		case EXC_VEC:
 			KASSERT((td->td_pcb->pcb_flags & PCB_VEC) != PCB_VEC,
 			    ("Altivec already enabled for thread"));
 			enable_vec(td);
 			break;
 
 		case EXC_VSX:
 			KASSERT((td->td_pcb->pcb_flags & PCB_VSX) != PCB_VSX,
 			    ("VSX already enabled for thread"));
 			if (!(td->td_pcb->pcb_flags & PCB_VEC))
 				enable_vec(td);
 			if (!(td->td_pcb->pcb_flags & PCB_FPU))
 				save_fpu(td);
 			td->td_pcb->pcb_flags |= PCB_VSX;
 			enable_fpu(td);
 			break;
 
 		case EXC_FAC:
 			fscr = mfspr(SPR_FSCR);
 			if ((fscr & FSCR_IC_MASK) == FSCR_IC_HTM) {
 				CTR0(KTR_TRAP, "Hardware Transactional Memory subsystem disabled");
 			}
 			sig = SIGILL;
 			ucode =	ILL_ILLOPC;
 			break;
 		case EXC_HEA:
 			sig = SIGILL;
 			ucode =	ILL_ILLOPC;
 			break;
 
 		case EXC_VECAST_E:
 		case EXC_VECAST_G4:
 		case EXC_VECAST_G5:
 			/*
 			 * We get a VPU assist exception for IEEE mode
 			 * vector operations on denormalized floats.
 			 * Emulating this is a giant pain, so for now,
 			 * just switch off IEEE mode and treat them as
 			 * zero.
 			 */
 
 			save_vec(td);
 			td->td_pcb->pcb_vec.vscr |= ALTIVEC_VSCR_NJ;
 			enable_vec(td);
 			break;
 
 		case EXC_ALI:
 			if (fix_unaligned(td, frame) != 0) {
 				sig = SIGBUS;
 				ucode = BUS_ADRALN;
 			}
 			else
 				frame->srr0 += 4;
 			break;
 
 		case EXC_DEBUG:	/* Single stepping */
 			mtspr(SPR_DBSR, mfspr(SPR_DBSR));
 			frame->srr1 &= ~PSL_DE;
 			frame->cpu.booke.dbcr0 &= ~(DBCR0_IDM | DBCR0_IC);
 			sig = SIGTRAP;
 			ucode = TRAP_TRACE;
 			break;
 
 		case EXC_PGM:
 			/* Identify the trap reason */
 			if (frame_is_trap_inst(frame)) {
 #ifdef KDTRACE_HOOKS
 				inst = fuword32((const void *)frame->srr0);
 				if (inst == 0x0FFFDDDD &&
 				    dtrace_pid_probe_ptr != NULL) {
 					(*dtrace_pid_probe_ptr)(frame);
 					break;
 				}
 #endif
  				sig = SIGTRAP;
 				ucode = TRAP_BRKPT;
 			} else {
 				sig = ppc_instr_emulate(frame, td->td_pcb);
 				if (sig == SIGILL) {
 					if (frame->srr1 & EXC_PGM_PRIV)
 						ucode = ILL_PRVOPC;
 					else if (frame->srr1 & EXC_PGM_ILLEGAL)
 						ucode = ILL_ILLOPC;
 				} else if (sig == SIGFPE)
 					ucode = FPE_FLTINV;	/* Punt for now, invalid operation. */
 			}
 			break;
 
 		case EXC_MCHK:
 			/*
 			 * Note that this may not be recoverable for the user
 			 * process, depending on the type of machine check,
 			 * but it at least prevents the kernel from dying.
 			 */
 			sig = SIGBUS;
 			ucode = BUS_OBJERR;
 			break;
 
 		default:
 			trap_fatal(frame);
 		}
 	} else {
 		/* Kernel Mode Traps */
 
 		KASSERT(cold || td->td_ucred != NULL,
 		    ("kernel trap doesn't have ucred"));
 		switch (type) {
 		case EXC_PGM:
 #ifdef KDTRACE_HOOKS
 			if (frame_is_trap_inst(frame)) {
 				if (*(uint32_t *)frame->srr0 == EXC_DTRACE) {
 					if (dtrace_invop_jump_addr != NULL) {
 						dtrace_invop_jump_addr(frame);
 						return;
 					}
 				}
 			}
 #endif
 #ifdef KDB
 			if (db_trap_glue(frame))
 				return;
 #endif
 			break;
 #if defined(__powerpc64__) && defined(AIM)
 		case EXC_DSE:
 			if (td->td_pcb->pcb_cpu.aim.usr_vsid != 0 &&
 			    (frame->dar & SEGMENT_MASK) == USER_ADDR) {
 				__asm __volatile ("slbmte %0, %1" ::
 					"r"(td->td_pcb->pcb_cpu.aim.usr_vsid),
 					"r"(USER_SLB_SLBE));
 				return;
 			}
 			break;
 #endif
 		case EXC_DSI:
 			if (trap_pfault(frame, 0) == 0)
  				return;
 			break;
 		case EXC_MCHK:
 			if (handle_onfault(frame))
  				return;
 			break;
 		default:
 			break;
 		}
 		trap_fatal(frame);
 	}
 
 	if (sig != 0) {
 		if (p->p_sysent->sv_transtrap != NULL)
 			sig = (p->p_sysent->sv_transtrap)(sig, type);
 		ksiginfo_init_trap(&ksi);
 		ksi.ksi_signo = sig;
 		ksi.ksi_code = (int) ucode; /* XXX, not POSIX */
 		/* ksi.ksi_addr = ? */
 		ksi.ksi_trapno = type;
 		trapsignal(td, &ksi);
 	}
 
 	userret(td, frame);
 }
 
 static void
 trap_fatal(struct trapframe *frame)
 {
 
 	printtrap(frame->exc, frame, 1, (frame->srr1 & PSL_PR));
 #ifdef KDB
 	if ((debugger_on_panic || kdb_active) &&
 	    kdb_trap(frame->exc, 0, frame))
 		return;
 #endif
 	panic("%s trap", trapname(frame->exc));
 }
 
 static void
 cpu_printtrap(u_int vector, struct trapframe *frame, int isfatal, int user)
 {
 #ifdef AIM
 	uint16_t ver;
 
 	switch (vector) {
 	case EXC_DSE:
 	case EXC_DSI:
 	case EXC_DTMISS:
 		printf("   dsisr           = 0x%lx\n",
 		    (u_long)frame->cpu.aim.dsisr);
 		break;
 	case EXC_MCHK:
 		ver = mfpvr() >> 16;
 		if (MPC745X_P(ver))
 			printf("    msssr0         = 0x%b\n",
 			    (int)mfspr(SPR_MSSSR0), MSSSR_BITMASK);
 		break;
 	}
 #elif defined(BOOKE)
 	vm_paddr_t pa;
 
 	switch (vector) {
 	case EXC_MCHK:
 		pa = mfspr(SPR_MCARU);
 		pa = (pa << 32) | (u_register_t)mfspr(SPR_MCAR);
 		printf("   mcsr            = 0x%b\n",
 		    (int)mfspr(SPR_MCSR), MCSR_BITMASK);
 		printf("   mcar            = 0x%jx\n", (uintmax_t)pa);
 	}
 	printf("   esr             = 0x%b\n",
 	    (int)frame->cpu.booke.esr, ESR_BITMASK);
 #endif
 }
 
 static void
 printtrap(u_int vector, struct trapframe *frame, int isfatal, int user)
 {
 
 	printf("\n");
 	printf("%s %s trap:\n", isfatal ? "fatal" : "handled",
 	    user ? "user" : "kernel");
 	printf("\n");
 	printf("   exception       = 0x%x (%s)\n", vector, trapname(vector));
 	switch (vector) {
 	case EXC_DSE:
 	case EXC_DSI:
 	case EXC_DTMISS:
 		printf("   virtual address = 0x%" PRIxPTR "\n", frame->dar);
 		break;
 	case EXC_ISE:
 	case EXC_ISI:
 	case EXC_ITMISS:
 		printf("   virtual address = 0x%" PRIxPTR "\n", frame->srr0);
 		break;
 	case EXC_MCHK:
 		break;
 	}
 	cpu_printtrap(vector, frame, isfatal, user);
 	printf("   srr0            = 0x%" PRIxPTR " (0x%" PRIxPTR ")\n",
 	    frame->srr0, frame->srr0 - (register_t)(__startkernel - KERNBASE));
 	printf("   srr1            = 0x%lx\n", (u_long)frame->srr1);
 	printf("   current msr     = 0x%" PRIxPTR "\n", mfmsr());
 	printf("   lr              = 0x%" PRIxPTR " (0x%" PRIxPTR ")\n",
 	    frame->lr, frame->lr - (register_t)(__startkernel - KERNBASE));
 	printf("   curthread       = %p\n", curthread);
 	if (curthread != NULL)
 		printf("          pid = %d, comm = %s\n",
 		    curthread->td_proc->p_pid, curthread->td_name);
 	printf("\n");
 }
 
 /*
  * Handles a fatal fault when we have onfault state to recover.  Returns
  * non-zero if there was onfault recovery state available.
  */
 static int
 handle_onfault(struct trapframe *frame)
 {
 	struct		thread *td;
 	jmp_buf		*fb;
 
 	td = curthread;
 	fb = td->td_pcb->pcb_onfault;
 	if (fb != NULL) {
 		frame->srr0 = (*fb)->_jb[FAULTBUF_LR];
 		frame->fixreg[1] = (*fb)->_jb[FAULTBUF_R1];
 		frame->fixreg[2] = (*fb)->_jb[FAULTBUF_R2];
 		frame->fixreg[3] = 1;
 		frame->cr = (*fb)->_jb[FAULTBUF_CR];
 		bcopy(&(*fb)->_jb[FAULTBUF_R14], &frame->fixreg[14],
 		    18 * sizeof(register_t));
 		td->td_pcb->pcb_onfault = NULL; /* Returns twice, not thrice */
 		return (1);
 	}
 	return (0);
 }
 
 int
 cpu_fetch_syscall_args(struct thread *td)
 {
 	struct proc *p;
 	struct trapframe *frame;
 	struct syscall_args *sa;
 	caddr_t	params;
 	size_t argsz;
 	int error, n, i;
 
 	p = td->td_proc;
 	frame = td->td_frame;
 	sa = &td->td_sa;
 
 	sa->code = frame->fixreg[0];
 	params = (caddr_t)(frame->fixreg + FIRSTARG);
 	n = NARGREG;
 
 	if (sa->code == SYS_syscall) {
 		/*
 		 * code is first argument,
 		 * followed by actual args.
 		 */
 		sa->code = *(register_t *) params;
 		params += sizeof(register_t);
 		n -= 1;
 	} else if (sa->code == SYS___syscall) {
 		/*
 		 * Like syscall, but code is a quad,
 		 * so as to maintain quad alignment
 		 * for the rest of the args.
 		 */
 		if (SV_PROC_FLAG(p, SV_ILP32)) {
 			params += sizeof(register_t);
 			sa->code = *(register_t *) params;
 			params += sizeof(register_t);
 			n -= 2;
 		} else {
 			sa->code = *(register_t *) params;
 			params += sizeof(register_t);
 			n -= 1;
 		}
 	}
 
  	if (p->p_sysent->sv_mask)
 		sa->code &= p->p_sysent->sv_mask;
 	if (sa->code >= p->p_sysent->sv_size)
 		sa->callp = &p->p_sysent->sv_table[0];
 	else
 		sa->callp = &p->p_sysent->sv_table[sa->code];
 
 	sa->narg = sa->callp->sy_narg;
 
 	if (SV_PROC_FLAG(p, SV_ILP32)) {
 		argsz = sizeof(uint32_t);
 
 		for (i = 0; i < n; i++)
 			sa->args[i] = ((u_register_t *)(params))[i] &
 			    0xffffffff;
 	} else {
 		argsz = sizeof(uint64_t);
 
 		for (i = 0; i < n; i++)
 			sa->args[i] = ((u_register_t *)(params))[i];
 	}
 
 	if (sa->narg > n)
 		error = copyin(MOREARGS(frame->fixreg[1]), sa->args + n,
 			       (sa->narg - n) * argsz);
 	else
 		error = 0;
 
 #ifdef __powerpc64__
 	if (SV_PROC_FLAG(p, SV_ILP32) && sa->narg > n) {
 		/* Expand the size of arguments copied from the stack */
 
 		for (i = sa->narg; i >= n; i--)
 			sa->args[i] = ((uint32_t *)(&sa->args[n]))[i-n];
 	}
 #endif
 
 	if (error == 0) {
 		td->td_retval[0] = 0;
 		td->td_retval[1] = frame->fixreg[FIRSTARG + 1];
 	}
 	return (error);
 }
 
 #include "../../kern/subr_syscall.c"
 
 void
 syscall(struct trapframe *frame)
 {
 	struct thread *td;
 	int error;
 
 	td = curthread;
 	td->td_frame = frame;
 
 #if defined(__powerpc64__) && defined(AIM)
 	/*
 	 * Speculatively restore last user SLB segment, which we know is
 	 * invalid already, since we are likely to do copyin()/copyout().
 	 */
 	if (td->td_pcb->pcb_cpu.aim.usr_vsid != 0)
 		__asm __volatile ("slbmte %0, %1; isync" ::
 		    "r"(td->td_pcb->pcb_cpu.aim.usr_vsid), "r"(USER_SLB_SLBE));
 #endif
 
 	error = syscallenter(td);
 	syscallret(td, error);
 }
 
 #if defined(__powerpc64__) && defined(AIM)
 /* Handle kernel SLB faults -- runs in real mode, all seat belts off */
 void
 handle_kernel_slb_spill(int type, register_t dar, register_t srr0)
 {
 	struct slb *slbcache;
 	uint64_t slbe, slbv;
 	uint64_t esid, addr;
 	int i;
 
 	addr = (type == EXC_ISE) ? srr0 : dar;
 	slbcache = PCPU_GET(aim.slb);
 	esid = (uintptr_t)addr >> ADDR_SR_SHFT;
 	slbe = (esid << SLBE_ESID_SHIFT) | SLBE_VALID;
 	
 	/* See if the hardware flushed this somehow (can happen in LPARs) */
 	for (i = 0; i < n_slbs; i++)
 		if (slbcache[i].slbe == (slbe | (uint64_t)i))
 			return;
 
 	/* Not in the map, needs to actually be added */
 	slbv = kernel_va_to_slbv(addr);
 	if (slbcache[USER_SLB_SLOT].slbe == 0) {
 		for (i = 0; i < n_slbs; i++) {
 			if (i == USER_SLB_SLOT)
 				continue;
 			if (!(slbcache[i].slbe & SLBE_VALID))
 				goto fillkernslb;
 		}
 
 		if (i == n_slbs)
 			slbcache[USER_SLB_SLOT].slbe = 1;
 	}
 
 	/* Sacrifice a random SLB entry that is not the user entry */
 	i = mftb() % n_slbs;
 	if (i == USER_SLB_SLOT)
 		i = (i+1) % n_slbs;
 
 fillkernslb:
 	/* Write new entry */
 	slbcache[i].slbv = slbv;
 	slbcache[i].slbe = slbe | (uint64_t)i;
 
 	/* Trap handler will restore from cache on exit */
 }
 
 static int 
 handle_user_slb_spill(pmap_t pm, vm_offset_t addr)
 {
 	struct slb *user_entry;
 	uint64_t esid;
 	int i;
 
 	if (pm->pm_slb == NULL)
 		return (-1);
 
 	esid = (uintptr_t)addr >> ADDR_SR_SHFT;
 
 	PMAP_LOCK(pm);
 	user_entry = user_va_to_slb_entry(pm, addr);
 
 	if (user_entry == NULL) {
 		/* allocate_vsid auto-spills it */
 		(void)allocate_user_vsid(pm, esid, 0);
 	} else {
 		/*
 		 * Check that another CPU has not already mapped this.
 		 * XXX: Per-thread SLB caches would be better.
 		 */
 		for (i = 0; i < pm->pm_slb_len; i++)
 			if (pm->pm_slb[i] == user_entry)
 				break;
 
 		if (i == pm->pm_slb_len)
 			slb_insert_user(pm, user_entry);
 	}
 	PMAP_UNLOCK(pm);
 
 	return (0);
 }
 #endif
 
 static int
 trap_pfault(struct trapframe *frame, int user)
 {
 	vm_offset_t	eva, va;
 	struct		thread *td;
 	struct		proc *p;
 	vm_map_t	map;
 	vm_prot_t	ftype;
 	int		rv, is_user;
 
 	td = curthread;
 	p = td->td_proc;
 	if (frame->exc == EXC_ISI) {
 		eva = frame->srr0;
 		ftype = VM_PROT_EXECUTE;
 		if (frame->srr1 & SRR1_ISI_PFAULT)
 			ftype |= VM_PROT_READ;
 	} else {
 		eva = frame->dar;
 #ifdef BOOKE
 		if (frame->cpu.booke.esr & ESR_ST)
 #else
 		if (frame->cpu.aim.dsisr & DSISR_STORE)
 #endif
 			ftype = VM_PROT_WRITE;
 		else
 			ftype = VM_PROT_READ;
 	}
 
 	if (user) {
 		KASSERT(p->p_vmspace != NULL, ("trap_pfault: vmspace  NULL"));
 		map = &p->p_vmspace->vm_map;
 	} else {
 		rv = pmap_decode_kernel_ptr(eva, &is_user, &eva);
 		if (rv != 0)
 			return (SIGSEGV);
 
 		if (is_user)
 			map = &p->p_vmspace->vm_map;
 		else
 			map = kernel_map;
 	}
 	va = trunc_page(eva);
 
 	/* Fault in the page. */
 	rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL);
 	/*
 	 * XXXDTRACE: add dtrace_doubletrap_func here?
 	 */
 
 	if (rv == KERN_SUCCESS)
 		return (0);
 
 	if (!user && handle_onfault(frame))
 		return (0);
 
 	return (SIGSEGV);
 }
 
 /*
  * For now, this only deals with the particular unaligned access case
  * that gcc tends to generate.  Eventually it should handle all of the
  * possibilities that can happen on a 32-bit PowerPC in big-endian mode.
  */
 
 static int
 fix_unaligned(struct thread *td, struct trapframe *frame)
 {
 	struct thread	*fputhread;
 #ifdef	__SPE__
 	uint32_t	inst;
 #endif
 	int		indicator, reg;
 	double		*fpr;
 
 #ifdef __SPE__
 	indicator = (frame->cpu.booke.esr & (ESR_ST|ESR_SPE));
 	if (indicator & ESR_SPE) {
 		if (copyin((void *)frame->srr0, &inst, sizeof(inst)) != 0)
 			return (-1);
 		reg = EXC_ALI_SPE_REG(inst);
 		fpr = (double *)td->td_pcb->pcb_vec.vr[reg];
 		fputhread = PCPU_GET(vecthread);
 
 		/* Juggle the SPE to ensure that we've initialized
 		 * the registers, and that their current state is in
 		 * the PCB.
 		 */
 		if (fputhread != td) {
 			if (fputhread)
 				save_vec(fputhread);
 			enable_vec(td);
 		}
 		save_vec(td);
 
 		if (!(indicator & ESR_ST)) {
 			if (copyin((void *)frame->dar, fpr,
 			    sizeof(double)) != 0)
 				return (-1);
 			frame->fixreg[reg] = td->td_pcb->pcb_vec.vr[reg][1];
 			enable_vec(td);
 		} else {
 			td->td_pcb->pcb_vec.vr[reg][1] = frame->fixreg[reg];
 			if (copyout(fpr, (void *)frame->dar,
 			    sizeof(double)) != 0)
 				return (-1);
 		}
 		return (0);
 	}
 #else
 	indicator = EXC_ALI_OPCODE_INDICATOR(frame->cpu.aim.dsisr);
 
 	switch (indicator) {
 	case EXC_ALI_LFD:
 	case EXC_ALI_STFD:
 		reg = EXC_ALI_RST(frame->cpu.aim.dsisr);
 		fpr = &td->td_pcb->pcb_fpu.fpr[reg].fpr;
 		fputhread = PCPU_GET(fputhread);
 
 		/* Juggle the FPU to ensure that we've initialized
 		 * the FPRs, and that their current state is in
 		 * the PCB.
 		 */
 		if (fputhread != td) {
 			if (fputhread)
 				save_fpu(fputhread);
 			enable_fpu(td);
 		}
 		save_fpu(td);
 
 		if (indicator == EXC_ALI_LFD) {
 			if (copyin((void *)frame->dar, fpr,
 			    sizeof(double)) != 0)
 				return (-1);
 			enable_fpu(td);
 		} else {
 			if (copyout(fpr, (void *)frame->dar,
 			    sizeof(double)) != 0)
 				return (-1);
 		}
 		return (0);
 		break;
 	}
 #endif
 
 	return (-1);
 }
 
 #ifdef KDB
 int
 db_trap_glue(struct trapframe *frame)
 {
 
 	if (!(frame->srr1 & PSL_PR)
 	    && (frame->exc == EXC_TRC || frame->exc == EXC_RUNMODETRC
 	    	|| frame_is_trap_inst(frame)
 		|| frame->exc == EXC_BPT
 		|| frame->exc == EXC_DEBUG
 		|| frame->exc == EXC_DSI)) {
 		int type = frame->exc;
 
 		/* Ignore DTrace traps. */
 		if (*(uint32_t *)frame->srr0 == EXC_DTRACE)
 			return (0);
 		if (frame_is_trap_inst(frame)) {
 			type = T_BREAKPOINT;
 		}
 		return (kdb_trap(type, 0, frame));
 	}
 
 	return (0);
 }
 #endif
Index: user/markj/netdump/tests/sys/acl/run
===================================================================
--- user/markj/netdump/tests/sys/acl/run	(revision 332407)
+++ user/markj/netdump/tests/sys/acl/run	(revision 332408)
@@ -1,329 +1,329 @@
 #!/usr/bin/perl -w -U
 
 # Copyright (c) 2007, 2008 Andreas Gruenbacher.
 # All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
 # are met:
 # 1. Redistributions of source code must retain the above copyright
 #    notice, this list of conditions, and the following disclaimer,
 #    without modification, immediately at the beginning of the file.
 # 2. The name of the author may not be used to endorse or promote products
 #    derived from this software without specific prior written permission.
 #
 # Alternatively, this software may be distributed under the terms of the
 # GNU Public License ("GPL").
 #
 # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 # ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR
 # ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 # SUCH DAMAGE.
 #
 # $FreeBSD$
 #
 
 #
 # Possible improvements:
 #
 # - distinguish stdout and stderr output
 # - add environment variable like assignments
 # - run up to a specific line
 # - resume at a specific line
 #
 
 use strict;
 use FileHandle;
 use Getopt::Std;
 use POSIX qw(isatty setuid getcwd);
 use vars qw($opt_l $opt_v);
 
 no warnings qw(taint);
 
 $opt_l = ~0;  # a really huge number
 getopts('l:v');
 
 my ($OK, $FAILED) = ("ok", "failed");
 if (isatty(fileno(STDOUT))) {
 	$OK = "\033[32m" . $OK . "\033[m";
 	$FAILED = "\033[31m\033[1m" . $FAILED . "\033[m";
 }
 
 sub exec_test($$);
 sub process_test($$$$);
 
 my ($prog, $in, $out) = ([], [], []);
 my $prog_line = 0;
 my ($tests, $failed) = (0,0);
 my $lineno;
 my $width = ($ENV{COLUMNS} || 80) >> 1;
 
 for (;;) {
   my $line = <>; $lineno++;
   if (defined $line) {
     # Substitute %VAR and %{VAR} with environment variables.
     $line =~ s[%(\w+)][$ENV{$1}]eg;
-    $line =~ s[%{(\w+)}][$ENV{$1}]eg;
+    $line =~ s[%\{(\w+)\}][$ENV{$1}]eg;
   }
   if (defined $line) {
     if ($line =~ s/^\s*< ?//) {
       push @$in, $line;
     } elsif ($line =~ s/^\s*> ?//) {
       push @$out, $line;
     } else {
       process_test($prog, $prog_line, $in, $out);
       last if $prog_line >= $opt_l;
 
       $prog = [];
       $prog_line = 0;
     }
     if ($line =~ s/^\s*\$ ?//) {
       $prog = [ map { s/\\(.)/$1/g; $_ } split /(?<!\\)\s+/, $line ];
       $prog_line = $lineno;
       $in = [];
       $out = [];
     }
   } else {
     process_test($prog, $prog_line, $in, $out);
     last;
   }
 }
 
 my $status = sprintf("%d commands (%d passed, %d failed)",
 	$tests, $tests-$failed, $failed);
 if (isatty(fileno(STDOUT))) {
 	if ($failed) {
 		$status = "\033[31m\033[1m" . $status . "\033[m";
 	} else {
 		$status = "\033[32m" . $status . "\033[m";
 	}
 }
 print $status, "\n";
 exit $failed ? 1 : 0;
 
 
 sub process_test($$$$) {
   my ($prog, $prog_line, $in, $out) = @_;
 
   return unless @$prog;
 
        my $p = [ @$prog ];
        print "[$prog_line] \$ ", join(' ',
              map { s/\s/\\$&/g; $_ } @$p), " -- ";
        my $result = exec_test($prog, $in);
        my @good = ();
        my $nmax = (@$out > @$result) ? @$out : @$result;
        for (my $n=0; $n < $nmax; $n++) {
 	   my $use_re;
 	   if (defined $out->[$n] && $out->[$n] =~ /^~ /) {
 		$use_re = 1;
 		$out->[$n] =~ s/^~ //g;
 	   }
 
            if (!defined($out->[$n]) || !defined($result->[$n]) ||
                (!$use_re && $result->[$n] ne $out->[$n]) ||
                ( $use_re && $result->[$n] !~ /^$out->[$n]/)) {
                push @good, ($use_re ? '!~' : '!=');
 	   }
 	   else {
                push @good, ($use_re ? '=~' : '==');
            }
        }
        my $good = !(grep /!/, @good);
        $tests++;
        $failed++ unless $good;
        print $good ? $OK : $FAILED, "\n";
        if (!$good || $opt_v) {
          for (my $n=0; $n < $nmax; $n++) {
 	   my $l = defined($out->[$n]) ? $out->[$n] : "~";
 	   chomp $l;
 	   my $r = defined($result->[$n]) ? $result->[$n] : "~";
 	   chomp $r;
 	   print sprintf("%-" . ($width-3) . "s %s %s\n",
 			 $r, $good[$n], $l);
          }
        }
 }
 
 
 sub su($) {
   my ($user) = @_;
 
   $user ||= "root";
 
   my ($login, $pass, $uid, $gid) = getpwnam($user)
     or return [ "su: user $user does not exist\n" ];
   my @groups = ();
   my $fh = new FileHandle("/etc/group")
     or return [ "opening /etc/group: $!\n" ];
   while (<$fh>) {
     chomp;
     my ($group, $passwd, $gid, $users) = split /:/;
     foreach my $u (split /,/, $users) {
       push @groups, $gid
 	if ($user eq $u);
     }
   }
   $fh->close;
 
   my $groups = join(" ", ($gid, $gid, @groups));
   #print STDERR "[[$groups]]\n";
   $! = 0;  # reset errno
   $> = 0;
   $( = $gid;
   $) = $groups;
   if ($!) {
     return [ "su: $!\n" ];
   }
   if ($uid != 0) {
     $> = $uid;
     #$< = $uid;
     if ($!) {
       return [ "su: $prog->[1]: $!\n" ];
     }
   }
   #print STDERR "[($>,$<)($(,$))]";
   return [];
 }
 
 
 sub sg($) {
   my ($group) = @_;
 
   my $gid = getgrnam($group)
     or return [ "sg: group $group does not exist\n" ];
   my %groups = map { $_ eq $gid ? () : ($_ => 1) } (split /\s/, $));
   
   #print STDERR "<<", join("/", keys %groups), ">>\n";
   my $groups = join(" ", ($gid, $gid, keys %groups));
   #print STDERR "[[$groups]]\n";
   $! = 0;  # reset errno
   if ($> != 0) {
 	  my $uid = $>;
 	  $> = 0;
 	  $( = $gid;
 	  $) = $groups;
 	  $> = $uid;
   } else {
 	  $( = $gid;
 	  $) = $groups;
   }
   if ($!) {
     return [ "sg: $!\n" ];
   }
   print STDERR "[($>,$<)($(,$))]";
   return [];
 }
 
 
 sub exec_test($$) {
   my ($prog, $in) = @_;
   local (*IN, *IN_DUP, *IN2, *OUT_DUP, *OUT, *OUT2);
   my $needs_shell = (join('', @$prog) =~ /[][|<>"'`\$\*\?]/);
 
   if ($prog->[0] eq "umask") {
     umask oct $prog->[1];
     return [];
   } elsif ($prog->[0] eq "cd") {
     if (!chdir $prog->[1]) {
       return [ "chdir: $prog->[1]: $!\n" ];
     }
     $ENV{PWD} = getcwd;
     return [];
   } elsif ($prog->[0] eq "su") {
     return su($prog->[1]);
   } elsif ($prog->[0] eq "sg") {
     return sg($prog->[1]);
   } elsif ($prog->[0] eq "export") {
     my ($name, $value) = split /=/, $prog->[1];
     # FIXME: need to evaluate $value, so that things like this will work:
     # export dir=$PWD/dir
     $ENV{$name} = $value;
     return [];
   } elsif ($prog->[0] eq "unset") {
     delete $ENV{$prog->[1]};
     return [];
   }
 
   pipe *IN2, *OUT
     or die "Can't create pipe for reading: $!";
   open *IN_DUP, "<&STDIN"
     or *IN_DUP = undef;
   open *STDIN, "<&IN2"
     or die "Can't duplicate pipe for reading: $!";
   close *IN2;
 
   open *OUT_DUP, ">&STDOUT"
     or die "Can't duplicate STDOUT: $!";
   pipe *IN, *OUT2
     or die "Can't create pipe for writing: $!";
   open *STDOUT, ">&OUT2"
     or die "Can't duplicate pipe for writing: $!";
   close *OUT2;
 
   *STDOUT->autoflush();
   *OUT->autoflush();
 
   $SIG{CHLD} = 'IGNORE';
 
   if (fork()) {
     # Server
     if (*IN_DUP) {
       open *STDIN, "<&IN_DUP"
         or die "Can't duplicate STDIN: $!";
       close *IN_DUP
         or die "Can't close STDIN duplicate: $!";
     }
     open *STDOUT, ">&OUT_DUP"
       or die "Can't duplicate STDOUT: $!";
     close *OUT_DUP
       or die "Can't close STDOUT duplicate: $!";
 
     foreach my $line (@$in) {
       #print "> $line";
       print OUT $line;
     }
     close *OUT
       or die "Can't close pipe for writing: $!";
 
     my $result = [];
     while (<IN>) {
       #print "< $_";
       if ($needs_shell) {
 	s#^/bin/sh: line \d+: ##;
       }
       push @$result, $_;
     }
     return $result;
   } else {
     # Client
     $< = $>;
     close IN
       or die "Can't close read end for input pipe: $!";
     close OUT
       or die "Can't close write end for output pipe: $!";
     close OUT_DUP
       or die "Can't close STDOUT duplicate: $!";
     local *ERR_DUP;
     open ERR_DUP, ">&STDERR"
       or die "Can't duplicate STDERR: $!";
     open STDERR, ">&STDOUT"
       or die "Can't join STDOUT and STDERR: $!";
 
     if ($needs_shell) {
       exec ('/bin/sh', '-c', join(" ", @$prog));
     } else {
       exec @$prog;
     }
     print STDERR $prog->[0], ": $!\n";
     exit;
   }
 }
 
Index: user/markj/netdump/tests/sys/netpfil/Makefile
===================================================================
--- user/markj/netdump/tests/sys/netpfil/Makefile	(revision 332407)
+++ user/markj/netdump/tests/sys/netpfil/Makefile	(revision 332408)
@@ -1,7 +1,11 @@
 # $FreeBSD$
 
+.include <src.opts.mk>
+
 TESTSDIR=		${TESTSBASE}/sys/netpfil
 
+.if ${MK_PF} != "no"
 TESTS_SUBDIRS+=		pf
+.endif
 
 .include <bsd.test.mk>
Index: user/markj/netdump/usr.bin/head/head.1
===================================================================
--- user/markj/netdump/usr.bin/head/head.1	(revision 332407)
+++ user/markj/netdump/usr.bin/head/head.1	(revision 332408)
@@ -1,78 +1,90 @@
 .\" Copyright (c) 1980, 1990, 1993
 .\"	The Regents of the University of California.  All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\" 3. Neither the name of the University nor the names of its contributors
 .\"    may be used to endorse or promote products derived from this software
 .\"    without specific prior written permission.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\"	@(#)head.1	8.1 (Berkeley) 6/6/93
 .\" $FreeBSD$
 .\"
-.Dd March 16, 2013
+.Dd April 10, 2018
 .Dt HEAD 1
 .Os
 .Sh NAME
 .Nm head
 .Nd display first lines of a file
 .Sh SYNOPSIS
 .Nm
 .Op Fl n Ar count | Fl c Ar bytes
 .Op Ar
 .Sh DESCRIPTION
 This filter displays the first
 .Ar count
 lines or
 .Ar bytes
 of each of the specified files, or of the standard input if no
 files are specified.
 If
 .Ar count
 is omitted it defaults to 10.
+.Pp
+The following options are available:
+.Bl -tag -width indent
+.It Fl c Ar bytes , Fl -bytes Ns = Ns Ar bytes
+Print
+.Ar bytes
+of each of the specified files.
+.It Fl n Ar count , Fl -lines Ns = Ns Ar count
+Print
+.Ar count
+lines of each of the specified files.
+.El
 .Pp
 If more than a single file is specified, each file is preceded by a
 header consisting of the string
 .Dq ==> XXX <==
 where
 .Dq XXX
 is the name of the file.
 .Sh EXIT STATUS
 .Ex -std
 .Sh EXAMPLES
 To display the first 500 lines of the file
 .Ar foo :
 .Pp
 .Dl $ head -n 500 foo
 .Pp
 .Nm
 can be used in conjunction with
 .Xr tail 1
 in the following way to, for example, display only line 500 from the file
 .Ar foo :
 .Pp
 .Dl $ head -n 500 foo | tail -n 1
 .Sh SEE ALSO
 .Xr tail 1
 .Sh HISTORY
 The
 .Nm
 command appeared in PWB UNIX.
Index: user/markj/netdump/usr.bin/head/head.c
===================================================================
--- user/markj/netdump/usr.bin/head/head.c	(revision 332407)
+++ user/markj/netdump/usr.bin/head/head.c	(revision 332408)
@@ -1,184 +1,192 @@
 /*
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1980, 1987, 1992, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #ifndef lint
 static const char copyright[] =
 "@(#) Copyright (c) 1980, 1987, 1992, 1993\n\
 	The Regents of the University of California.  All rights reserved.\n";
 #endif /* not lint */
 
 #ifndef lint
 #if 0
 static char sccsid[] = "@(#)head.c	8.2 (Berkeley) 5/4/95";
 #endif
 #endif /* not lint */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/types.h>
 
 #include <ctype.h>
 #include <err.h>
+#include <getopt.h>
 #include <inttypes.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 
 /*
  * head - give the first few lines of a stream or of each of a set of files
  *
  * Bill Joy UCB August 24, 1977
  */
 
 static void head(FILE *, int);
 static void head_bytes(FILE *, off_t);
 static void obsolete(char *[]);
 static void usage(void);
 
+static const struct option long_opts[] =
+{
+	{"bytes",	required_argument,	NULL, 'c'},
+	{"lines",	required_argument,	NULL, 'n'},
+	{NULL,		no_argument,		NULL, 0}
+};
+
 int
 main(int argc, char *argv[])
 {
 	int ch;
 	FILE *fp;
 	int first, linecnt = -1, eval = 0;
 	off_t bytecnt = -1;
 	char *ep;
 
 	obsolete(argv);
-	while ((ch = getopt(argc, argv, "n:c:")) != -1)
+	while ((ch = getopt_long(argc, argv, "+n:c:", long_opts, NULL)) != -1)
 		switch(ch) {
 		case 'c':
 			bytecnt = strtoimax(optarg, &ep, 10);
 			if (*ep || bytecnt <= 0)
 				errx(1, "illegal byte count -- %s", optarg);
 			break;
 		case 'n':
 			linecnt = strtol(optarg, &ep, 10);
 			if (*ep || linecnt <= 0)
 				errx(1, "illegal line count -- %s", optarg);
 			break;
 		case '?':
 		default:
 			usage();
 		}
 	argc -= optind;
 	argv += optind;
 
 	if (linecnt != -1 && bytecnt != -1)
 		errx(1, "can't combine line and byte counts");
 	if (linecnt == -1 )
 		linecnt = 10;
 	if (*argv) {
 		for (first = 1; *argv; ++argv) {
 			if ((fp = fopen(*argv, "r")) == NULL) {
 				warn("%s", *argv);
 				eval = 1;
 				continue;
 			}
 			if (argc > 1) {
 				(void)printf("%s==> %s <==\n",
 				    first ? "" : "\n", *argv);
 				first = 0;
 			}
 			if (bytecnt == -1)
 				head(fp, linecnt);
 			else
 				head_bytes(fp, bytecnt);
 			(void)fclose(fp);
 		}
 	} else if (bytecnt == -1)
 		head(stdin, linecnt);
 	else
 		head_bytes(stdin, bytecnt);
 
 	exit(eval);
 }
 
 static void
 head(FILE *fp, int cnt)
 {
 	char *cp;
 	size_t error, readlen;
 
 	while (cnt && (cp = fgetln(fp, &readlen)) != NULL) {
 		error = fwrite(cp, sizeof(char), readlen, stdout);
 		if (error != readlen)
 			err(1, "stdout");
 		cnt--;
 	}
 }
 
 static void
 head_bytes(FILE *fp, off_t cnt)
 {
 	char buf[4096];
 	size_t readlen;
 
 	while (cnt) {
 		if ((uintmax_t)cnt < sizeof(buf))
 			readlen = cnt;
 		else
 			readlen = sizeof(buf);
 		readlen = fread(buf, sizeof(char), readlen, fp);
 		if (readlen == 0)
 			break;
 		if (fwrite(buf, sizeof(char), readlen, stdout) != readlen)
 			err(1, "stdout");
 		cnt -= readlen;
 	}
 }
 
 static void
 obsolete(char *argv[])
 {
 	char *ap;
 
 	while ((ap = *++argv)) {
 		/* Return if "--" or not "-[0-9]*". */
 		if (ap[0] != '-' || ap[1] == '-' || !isdigit(ap[1]))
 			return;
 		if ((ap = malloc(strlen(*argv) + 2)) == NULL)
 			err(1, NULL);
 		ap[0] = '-';
 		ap[1] = 'n';
 		(void)strcpy(ap + 2, *argv + 1);
 		*argv = ap;
 	}
 }
 
 static void
 usage(void)
 {
 
 	(void)fprintf(stderr, "usage: head [-n lines | -c bytes] [file ...]\n");
 	exit(1);
 }
Index: user/markj/netdump/usr.bin/systat/sctp.c
===================================================================
--- user/markj/netdump/usr.bin/systat/sctp.c	(revision 332407)
+++ user/markj/netdump/usr.bin/systat/sctp.c	(revision 332408)
@@ -1,360 +1,357 @@
 /*-
  * Copyright (c) 2015
  * The Regents of the University of California.  All rights reserved.
  * Michael Tuexen.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/types.h>
 #include <sys/socket.h>
 #include <sys/sysctl.h>
 
 #include <netinet/sctp.h>
 
 #include <stdlib.h>
 #include <string.h>
 
 #include "systat.h"
 #include "extern.h"
 #include "mode.h"
 
 static struct sctpstat curstat, initstat, oldstat;
 
 /*-
 --0         1         2         3         4         5         6         7
 --0123456789012345678901234567890123456789012345678901234567890123456789012345
 00             SCTP Associations                     SCTP Packets
 01999999999999 associations initiated   999999999999 packets sent
 02999999999999 associations accepted    999999999999 packets received
-03999999999999 associations established 999999999999 - out of the blue
-04999999999999 associations restarted   999999999999 - bad vtag
-05999999999999 associations terminated  999999999999 - bad crc32c
-06999999999999 associations aborted
-07
-08             SCTP Timers                           SCTP Chunks
-09999999999999 init timeouts            999999999999 control chunks sent
-10999999999999 cookie timeouts          999999999999 data chunks sent
-11999999999999 data timeouts            999999999999 - ordered
-12999999999999 delayed sack timeouts    999999999999 - unordered
-13999999999999 shutdown timeouts        999999999999 control chunks received
-14999999999999 shutdown-ack timeouts    999999999999 data chunks received
-15999999999999 shutdown guard timeouts  999999999999 - ordered
-16999999999999 heartbeat timeouts       999999999999 - unordered
-17999999999999 path MTU timeouts
-18999999999999 autoclose timeouts                    SCTP user messages
-19999999999999 asconf timeouts          999999999999 fragmented
-20999999999999 stream reset timeouts    999999999999 reassembled
+03999999999999 associations restarted   999999999999 - out of the blue
+04999999999999 associations terminated  999999999999 - bad vtag
+05999999999999 associations aborted     999999999999 - bad crc32c
+06
+07             SCTP Timers                           SCTP Chunks
+08999999999999 init timeouts            999999999999 control chunks sent
+09999999999999 cookie timeouts          999999999999 data chunks sent
+10999999999999 data timeouts            999999999999 - ordered
+11999999999999 delayed sack timeouts    999999999999 - unordered
+12999999999999 shutdown timeouts        999999999999 control chunks received
+13999999999999 shutdown-ack timeouts    999999999999 data chunks received
+14999999999999 shutdown guard timeouts  999999999999 - ordered
+15999999999999 heartbeat timeouts       999999999999 - unordered
+16999999999999 path MTU timeouts
+17999999999999 autoclose timeouts                    SCTP user messages
+18999999999999 asconf timeouts          999999999999 fragmented
+19999999999999 stream reset timeouts    999999999999 reassembled
 --0123456789012345678901234567890123456789012345678901234567890123456789012345
 --0         1         2         3         4         5         6         7
 */
 
 WINDOW *
 opensctp(void)
 {
 	return (subwin(stdscr, LINES-3-1, 0, MAINWIN_ROW, 0));
 }
 
 void
 closesctp(WINDOW *w)
 {
 	if (w != NULL) {
 		wclear(w);
 		wrefresh(w);
 		delwin(w);
 	}
 }
 
 void
 labelsctp(void)
 {
 	wmove(wnd, 0, 0); wclrtoeol(wnd);
 #define L(row, str) mvwprintw(wnd, row, 13, str)
 #define R(row, str) mvwprintw(wnd, row, 51, str);
 	L(0, "SCTP Associations");		R(0, "SCTP Packets");
 	L(1, "associations initiated");		R(1, "packets sent");
 	L(2, "associations accepted");		R(2, "packets received");
-	L(3, "associations established");	R(3, "- out of the blue");
-	L(4, "associations restarted");		R(4, "- bad vtag");
-	L(5, "associations terminated");	R(5, "- bad crc32c");
-	L(6, "associations aborted");
+	L(3, "associations restarted");		R(3, "- out of the blue");
+	L(4, "associations terminated");	R(4, "- bad vtag");
+	L(5, "associations aborted");		R(5, "- bad crc32c");
 
-	L(8, "SCTP Timers");			R(8, "SCTP Chunks");
-	L(9, "init timeouts");			R(9, "control chunks sent");
-	L(10, "cookie timeouts");		R(10, "data chunks sent");
-	L(11, "data timeouts");			R(11, "- ordered");
-	L(12, "delayed sack timeouts");		R(12, "- unordered");
-	L(13, "shutdown timeouts");		R(13, "control chunks received");
-	L(14, "shutdown-ack timeouts");		R(14, "data chunks received");
-	L(15, "shutdown guard timeouts");	R(15, "- ordered");
-	L(16, "heartbeat timeouts");		R(16, "- unordered");
-	L(17, "path MTU timeouts");
-	L(18, "autoclose timeouts");		R(18, "SCTP User Messages");
-	L(19, "asconf timeouts");		R(19, "fragmented");
-	L(20, "stream reset timeouts");		R(20, "reassembled");
+	L(7, "SCTP Timers");			R(7, "SCTP Chunks");
+	L(8, "init timeouts");			R(8, "control chunks sent");
+	L(9, "cookie timeouts");		R(9, "data chunks sent");
+	L(10, "data timeouts");			R(10, "- ordered");
+	L(11, "delayed sack timeouts");		R(11, "- unordered");
+	L(12, "shutdown timeouts");		R(12, "control chunks received");
+	L(13, "shutdown-ack timeouts");		R(13, "data chunks received");
+	L(14, "shutdown guard timeouts");	R(14, "- ordered");
+	L(15, "heartbeat timeouts");		R(15, "- unordered");
+	L(16, "path MTU timeouts");
+	L(17, "autoclose timeouts");		R(17, "SCTP User Messages");
+	L(18, "asconf timeouts");		R(18, "fragmented");
+	L(19, "stream reset timeouts");		R(19, "reassembled");
 #undef L
 #undef R
 }
 
 static void
 domode(struct sctpstat *ret)
 {
 	const struct sctpstat *sub;
 	int divisor = 1;
 
 	switch(currentmode) {
 	case display_RATE:
 		sub = &oldstat;
 		divisor = (delay > 1000000) ? delay / 1000000 : 1;
 		break;
 	case display_DELTA:
 		sub = &oldstat;
 		break;
 	case display_SINCE:
 		sub = &initstat;
 		break;
 	default:
 		*ret = curstat;
 		return;
 	}
 #define DO(stat) ret->stat = (curstat.stat - sub->stat) / divisor
 	DO(sctps_currestab);
 	DO(sctps_activeestab);
 	DO(sctps_restartestab);
 	DO(sctps_collisionestab);
 	DO(sctps_passiveestab);
 	DO(sctps_aborted);
 	DO(sctps_shutdown);
 	DO(sctps_outoftheblue);
 	DO(sctps_checksumerrors);
 	DO(sctps_outcontrolchunks);
 	DO(sctps_outorderchunks);
 	DO(sctps_outunorderchunks);
 	DO(sctps_incontrolchunks);
 	DO(sctps_inorderchunks);
 	DO(sctps_inunorderchunks);
 	DO(sctps_fragusrmsgs);
 	DO(sctps_reasmusrmsgs);
 	DO(sctps_outpackets);
 	DO(sctps_inpackets);
 
 	DO(sctps_recvpackets);
 	DO(sctps_recvdatagrams);
 	DO(sctps_recvpktwithdata);
 	DO(sctps_recvsacks);
 	DO(sctps_recvdata);
 	DO(sctps_recvdupdata);
 	DO(sctps_recvheartbeat);
 	DO(sctps_recvheartbeatack);
 	DO(sctps_recvecne);
 	DO(sctps_recvauth);
 	DO(sctps_recvauthmissing);
 	DO(sctps_recvivalhmacid);
 	DO(sctps_recvivalkeyid);
 	DO(sctps_recvauthfailed);
 	DO(sctps_recvexpress);
 	DO(sctps_recvexpressm);
 	DO(sctps_recvswcrc);
 	DO(sctps_recvhwcrc);
 
 	DO(sctps_sendpackets);
 	DO(sctps_sendsacks);
 	DO(sctps_senddata);
 	DO(sctps_sendretransdata);
 	DO(sctps_sendfastretrans);
 	DO(sctps_sendmultfastretrans);
 	DO(sctps_sendheartbeat);
 	DO(sctps_sendecne);
 	DO(sctps_sendauth);
 	DO(sctps_senderrors);
 	DO(sctps_sendswcrc);
 	DO(sctps_sendhwcrc);
 
 	DO(sctps_pdrpfmbox);
 	DO(sctps_pdrpfehos);
 	DO(sctps_pdrpmbda);
 	DO(sctps_pdrpmbct);
 	DO(sctps_pdrpbwrpt);
 	DO(sctps_pdrpcrupt);
 	DO(sctps_pdrpnedat);
 	DO(sctps_pdrppdbrk);
 	DO(sctps_pdrptsnnf);
 	DO(sctps_pdrpdnfnd);
 	DO(sctps_pdrpdiwnp);
 	DO(sctps_pdrpdizrw);
 	DO(sctps_pdrpbadd);
 	DO(sctps_pdrpmark);
 
 	DO(sctps_timoiterator);
 	DO(sctps_timodata);
 	DO(sctps_timowindowprobe);
 	DO(sctps_timoinit);
 	DO(sctps_timosack);
 	DO(sctps_timoshutdown);
 	DO(sctps_timoheartbeat);
 	DO(sctps_timocookie);
 	DO(sctps_timosecret);
 	DO(sctps_timopathmtu);
 	DO(sctps_timoshutdownack);
 	DO(sctps_timoshutdownguard);
 	DO(sctps_timostrmrst);
 	DO(sctps_timoearlyfr);
 	DO(sctps_timoasconf);
 	DO(sctps_timodelprim);
 	DO(sctps_timoautoclose);
 	DO(sctps_timoassockill);
 	DO(sctps_timoinpkill);
 
 	DO(sctps_hdrops);
 	DO(sctps_badsum);
 	DO(sctps_noport);
 	DO(sctps_badvtag);
 	DO(sctps_badsid);
 	DO(sctps_nomem);
 	DO(sctps_fastretransinrtt);
 	DO(sctps_markedretrans);
 	DO(sctps_naglesent);
 	DO(sctps_naglequeued);
 	DO(sctps_maxburstqueued);
 	DO(sctps_ifnomemqueued);
 	DO(sctps_windowprobed);
 	DO(sctps_lowlevelerr);
 	DO(sctps_lowlevelerrusr);
 	DO(sctps_datadropchklmt);
 	DO(sctps_datadroprwnd);
 	DO(sctps_ecnereducedcwnd);
 	DO(sctps_vtagexpress);
 	DO(sctps_vtagbogus);
 	DO(sctps_primary_randry);
 	DO(sctps_cmt_randry);
 	DO(sctps_slowpath_sack);
 	DO(sctps_wu_sacks_sent);
 	DO(sctps_sends_with_flags);
 	DO(sctps_sends_with_unord);
 	DO(sctps_sends_with_eof);
 	DO(sctps_sends_with_abort);
 	DO(sctps_protocol_drain_calls);
 	DO(sctps_protocol_drains_done);
 	DO(sctps_read_peeks);
 	DO(sctps_cached_chk);
 	DO(sctps_cached_strmoq);
 	DO(sctps_left_abandon);
 	DO(sctps_send_burst_avoid);
 	DO(sctps_send_cwnd_avoid);
 	DO(sctps_fwdtsn_map_over);
 	DO(sctps_queue_upd_ecne);
 #undef DO
 }
 
 void
 showsctp(void)
 {
 	struct sctpstat stats;
 
 	memset(&stats, 0, sizeof stats);
 	domode(&stats);
 
 #define DO(stat, row, col) \
 	mvwprintw(wnd, row, col, "%12lu", stats.stat)
 #define	L(row, stat) DO(stat, row, 0)
 #define	R(row, stat) DO(stat, row, 38)
 	L(1, sctps_activeestab);	R(1, sctps_outpackets);
 	L(2, sctps_passiveestab);	R(2, sctps_inpackets);
-	L(3, sctps_currestab);		R(3, sctps_outoftheblue);
-	L(4, sctps_restartestab);	R(4, sctps_badvtag);
-	L(5, sctps_shutdown);		R(5, sctps_checksumerrors);
-	L(6, sctps_aborted);
+	L(3, sctps_restartestab);	R(3, sctps_outoftheblue);
+	L(4, sctps_shutdown);		R(4, sctps_badvtag);
+	L(5, sctps_aborted);		R(5, sctps_checksumerrors);
 
 
-	L(9, sctps_timoinit);		R(9, sctps_outcontrolchunks);
-	L(10, sctps_timocookie);	R(10, sctps_senddata);
-	L(11, sctps_timodata);		R(11, sctps_outorderchunks);
-	L(12, sctps_timosack);		R(12, sctps_outunorderchunks);
-	L(13, sctps_timoshutdown);	R(13, sctps_incontrolchunks);
-	L(14, sctps_timoshutdownack);	R(14, sctps_recvdata);
-	L(15, sctps_timoshutdownguard);	R(15, sctps_inorderchunks);
-	L(16, sctps_timoheartbeat);	R(16, sctps_inunorderchunks);
-	L(17, sctps_timopathmtu);
-	L(18, sctps_timoautoclose);
-	L(19, sctps_timoasconf);	R(19, sctps_fragusrmsgs);
-	L(20, sctps_timostrmrst);	R(20, sctps_reasmusrmsgs);
+	L(8, sctps_timoinit);		R(8, sctps_outcontrolchunks);
+	L(9, sctps_timocookie);		R(9, sctps_senddata);
+	L(10, sctps_timodata);		R(10, sctps_outorderchunks);
+	L(11, sctps_timosack);		R(11, sctps_outunorderchunks);
+	L(12, sctps_timoshutdown);	R(12, sctps_incontrolchunks);
+	L(13, sctps_timoshutdownack);	R(13, sctps_recvdata);
+	L(14, sctps_timoshutdownguard);	R(14, sctps_inorderchunks);
+	L(15, sctps_timoheartbeat);	R(15, sctps_inunorderchunks);
+	L(16, sctps_timopathmtu);
+	L(17, sctps_timoautoclose);
+	L(18, sctps_timoasconf);	R(18, sctps_fragusrmsgs);
+	L(19, sctps_timostrmrst);	R(19, sctps_reasmusrmsgs);
 #undef DO
 #undef L
 #undef R
 }
 
 int
 initsctp(void)
 {
 	size_t len;
 	const char *name = "net.inet.sctp.stats";
 
 	len = 0;
 	if (sysctlbyname(name, NULL, &len, NULL, 0) < 0) {
 		error("sysctl getting sctpstat size failed");
 		return 0;
 	}
 	if (len > sizeof curstat) {
 		error("sctpstat structure has grown--recompile systat!");
 		return 0;
 	}
 	if (sysctlbyname(name, &initstat, &len, NULL, 0) < 0) {
 		error("sysctl getting sctpstat failed");
 		return 0;
 	}
 	oldstat = initstat;
 	return 1;
 }
 
 void
 resetsctp(void)
 {
 	size_t len;
 	const char *name = "net.inet.sctp.stats";
 
 	len = sizeof initstat;
 	if (sysctlbyname(name, &initstat, &len, NULL, 0) < 0) {
 		error("sysctl getting sctpstat failed");
 	}
 	oldstat = initstat;
 }
 
 void
 fetchsctp(void)
 {
 	size_t len;
 	const char *name = "net.inet.sctp.stats";
 
 	oldstat = curstat;
 	len = sizeof curstat;
 	if (sysctlbyname(name, &curstat, &len, NULL, 0) < 0) {
 		error("sysctl getting sctpstat failed");
 	}
 	return;
 }
Index: user/markj/netdump/usr.bin/tail/tail.1
===================================================================
--- user/markj/netdump/usr.bin/tail/tail.1	(revision 332407)
+++ user/markj/netdump/usr.bin/tail/tail.1	(revision 332408)
@@ -1,202 +1,202 @@
 .\" Copyright (c) 1980, 1990, 1991, 1993
 .\"	The Regents of the University of California.  All rights reserved.
 .\"
 .\" This code is derived from software contributed to Berkeley by
 .\" the Institute of Electrical and Electronics Engineers, Inc.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\" 3. Neither the name of the University nor the names of its contributors
 .\"    may be used to endorse or promote products derived from this software
 .\"    without specific prior written permission.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\"	@(#)tail.1	8.1 (Berkeley) 6/6/93
 .\" $FreeBSD$
 .\"
-.Dd March 16, 2013
+.Dd April 10, 2018
 .Dt TAIL 1
 .Os
 .Sh NAME
 .Nm tail
 .Nd display the last part of a file
 .Sh SYNOPSIS
 .Nm
 .Op Fl F | f | r
 .Op Fl q
 .Oo
 .Fl b Ar number | Fl c Ar number | Fl n Ar number
 .Oc
 .Op Ar
 .Sh DESCRIPTION
 The
 .Nm
 utility displays the contents of
 .Ar file
 or, by default, its standard input, to the standard output.
 .Pp
 The display begins at a byte, line or 512-byte block location in the
 input.
 Numbers having a leading plus
 .Pq Ql +
 sign are relative to the beginning
 of the input, for example,
 .Dq Li "-c +2"
 starts the display at the second
 byte of the input.
 Numbers having a leading minus
 .Pq Ql -
 sign or no explicit sign are
 relative to the end of the input, for example,
 .Dq Li "-n 2"
 displays the last two lines of the input.
 The default starting location is
 .Dq Li "-n 10" ,
 or the last 10 lines of the input.
 .Pp
 The options are as follows:
 .Bl -tag -width indent
-.It Fl b Ar number
+.It Fl b Ar number , Fl -blocks Ns = Ns Ar number
 The location is
 .Ar number
 512-byte blocks.
-.It Fl c Ar number
+.It Fl c Ar number , Fl -bytes Ns = Ns Ar number
 The location is
 .Ar number
 bytes.
 .It Fl f
 The
 .Fl f
 option causes
 .Nm
 to not stop when end of file is reached, but rather to wait for additional
 data to be appended to the input.
 The
 .Fl f
 option is ignored if the standard input is a pipe, but not if it is a FIFO.
 .It Fl F
 The
 .Fl F
 option implies the
 .Fl f
 option, but
 .Nm
 will also check to see if the file being followed has been renamed or rotated.
 The file is closed and reopened when
 .Nm
 detects that the filename being read from has a new inode number.
 .Pp
 If the file being followed does not (yet) exist or if it is removed, tail
 will keep looking and will display the file from the beginning if and when
 it is created.
 .Pp
 The
 .Fl F
 option is the same as the
 .Fl f
 option if reading from standard input rather than a file.
-.It Fl n Ar number
+.It Fl n Ar number , Fl -lines Ns = Ns Ar number
 The location is
 .Ar number
 lines.
 .It Fl q
 Suppresses printing of headers when multiple files are being examined.
 .It Fl r
 The
 .Fl r
 option causes the input to be displayed in reverse order, by line.
 Additionally, this option changes the meaning of the
 .Fl b , c
 and
 .Fl n
 options.
 When the
 .Fl r
 option is specified, these options specify the number of bytes, lines
 or 512-byte blocks to display, instead of the bytes, lines or blocks
 from the beginning or end of the input from which to begin the display.
 The default for the
 .Fl r
 option is to display all of the input.
 .El
 .Pp
 If more than a single file is specified, each file is preceded by a
 header consisting of the string
 .Dq Li "==> " Ns Ar XXX Ns Li " <=="
 where
 .Ar XXX
 is the name of the file unless
 .Fl q
 flag is specified.
 .Sh EXIT STATUS
 .Ex -std
 .Sh EXAMPLES
 To display the last 500 lines of the file
 .Ar foo :
 .Pp
 .Dl $ tail -n 500 foo
 .Pp
 Keep
 .Pa /var/log/messages
 open, displaying to the standard output anything appended to the file:
 .Pp
 .Dl $ tail -f /var/log/messages
 .Sh SEE ALSO
 .Xr cat 1 ,
 .Xr head 1 ,
 .Xr sed 1
 .Sh STANDARDS
 The
 .Nm
 utility is expected to be a superset of the
 .St -p1003.2-92
 specification.
 In particular, the
 .Fl F ,
 .Fl b
 and
 .Fl r
 options are extensions to that standard.
 .Pp
 The historic command line syntax of
 .Nm
 is supported by this implementation.
 The only difference between this implementation and historic versions
 of
 .Nm ,
 once the command line syntax translation has been done, is that the
 .Fl b ,
 .Fl c
 and
 .Fl n
 options modify the
 .Fl r
 option, i.e.,
 .Dq Li "-r -c 4"
 displays the last 4 characters of the last line
 of the input, while the historic tail (using the historic syntax
 .Dq Li -4cr )
 would ignore the
 .Fl c
 option and display the last 4 lines of the input.
 .Sh HISTORY
 A
 .Nm
 command appeared in PWB UNIX.
Index: user/markj/netdump/usr.bin/tail/tail.c
===================================================================
--- user/markj/netdump/usr.bin/tail/tail.c	(revision 332407)
+++ user/markj/netdump/usr.bin/tail/tail.c	(revision 332408)
@@ -1,338 +1,348 @@
 /*-
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright (c) 1991, 1993
  *	The Regents of the University of California.  All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * Edward Sze-Tyan Wang.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 
 __FBSDID("$FreeBSD$");
 
 #ifndef lint
 static const char copyright[] =
 "@(#) Copyright (c) 1991, 1993\n\
 	The Regents of the University of California.  All rights reserved.\n";
 #endif
 
 #ifndef lint
 static const char sccsid[] = "@(#)tail.c	8.1 (Berkeley) 6/6/93";
 #endif
 
 #include <sys/types.h>
 #include <sys/stat.h>
 
 #include <err.h>
 #include <errno.h>
+#include <getopt.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 
 #include "extern.h"
 
 int Fflag, fflag, qflag, rflag, rval, no_files;
 
 static file_info_t *files;
 
 static void obsolete(char **);
 static void usage(void);
 
+static const struct option long_opts[] =
+{
+	{"blocks",	required_argument,	NULL, 'b'},
+	{"bytes",	required_argument,	NULL, 'c'},
+	{"lines",	required_argument,	NULL, 'n'},
+	{NULL,		no_argument,		NULL, 0}
+};
+
 int
 main(int argc, char *argv[])
 {
 	struct stat sb;
 	const char *fn;
 	FILE *fp;
 	off_t off;
 	enum STYLE style;
 	int i, ch, first;
 	file_info_t *file;
 	char *p;
 
 	/*
 	 * Tail's options are weird.  First, -n10 is the same as -n-10, not
 	 * -n+10.  Second, the number options are 1 based and not offsets,
 	 * so -n+1 is the first line, and -c-1 is the last byte.  Third, the
 	 * number options for the -r option specify the number of things that
 	 * get displayed, not the starting point in the file.  The one major
 	 * incompatibility in this version as compared to historical versions
 	 * is that the 'r' option couldn't be modified by the -lbc options,
 	 * i.e. it was always done in lines.  This version treats -rc as a
 	 * number of characters in reverse order.  Finally, the default for
 	 * -r is the entire file, not 10 lines.
 	 */
 #define	ARG(units, forward, backward) {					\
 	if (style)							\
 		usage();						\
 	off = strtoll(optarg, &p, 10) * (units);                        \
 	if (*p)								\
 		errx(1, "illegal offset -- %s", optarg);		\
 	switch(optarg[0]) {						\
 	case '+':							\
 		if (off)						\
 			off -= (units);					\
 			style = (forward);				\
 		break;							\
 	case '-':							\
 		off = -off;						\
 		/* FALLTHROUGH */					\
 	default:							\
 		style = (backward);					\
 		break;							\
 	}								\
 }
 
 	obsolete(argv);
 	style = NOTSET;
 	off = 0;
-	while ((ch = getopt(argc, argv, "Fb:c:fn:qr")) != -1)
+	while ((ch = getopt_long(argc, argv, "+Fb:c:fn:qr", long_opts, NULL)) !=
+	    -1)
 		switch(ch) {
 		case 'F':	/* -F is superset of (and implies) -f */
 			Fflag = fflag = 1;
 			break;
 		case 'b':
 			ARG(512, FBYTES, RBYTES);
 			break;
 		case 'c':
 			ARG(1, FBYTES, RBYTES);
 			break;
 		case 'f':
 			fflag = 1;
 			break;
 		case 'n':
 			ARG(1, FLINES, RLINES);
 			break;
 		case 'q':
 			qflag = 1;
 			break;
 		case 'r':
 			rflag = 1;
 			break;
 		case '?':
 		default:
 			usage();
 		}
 	argc -= optind;
 	argv += optind;
 
 	no_files = argc ? argc : 1;
 
 	/*
 	 * If displaying in reverse, don't permit follow option, and convert
 	 * style values.
 	 */
 	if (rflag) {
 		if (fflag)
 			usage();
 		if (style == FBYTES)
 			style = RBYTES;
 		else if (style == FLINES)
 			style = RLINES;
 	}
 
 	/*
 	 * If style not specified, the default is the whole file for -r, and
 	 * the last 10 lines if not -r.
 	 */
 	if (style == NOTSET) {
 		if (rflag) {
 			off = 0;
 			style = REVERSE;
 		} else {
 			off = 10;
 			style = RLINES;
 		}
 	}
 
 	if (*argv && fflag) {
 		files = (struct file_info *) malloc(no_files *
 		    sizeof(struct file_info));
 		if (!files)
 			err(1, "Couldn't malloc space for file descriptors.");
 
 		for (file = files; (fn = *argv++); file++) {
 			file->file_name = strdup(fn);
 			if (! file->file_name)
 				errx(1, "Couldn't malloc space for file name.");
 			if ((file->fp = fopen(file->file_name, "r")) == NULL ||
 			    fstat(fileno(file->fp), &file->st)) {
 				if (file->fp != NULL) {
 					fclose(file->fp);
 					file->fp = NULL;
 				}
 				if (!Fflag || errno != ENOENT)
 					ierr(file->file_name);
 			}
 		}
 		follow(files, style, off);
 		for (i = 0, file = files; i < no_files; i++, file++) {
 		    free(file->file_name);
 		}
 		free(files);
 	} else if (*argv) {
 		for (first = 1; (fn = *argv++);) {
 			if ((fp = fopen(fn, "r")) == NULL ||
 			    fstat(fileno(fp), &sb)) {
 				ierr(fn);
 				continue;
 			}
 			if (argc > 1 && !qflag) {
 				printfn(fn, !first);
 				first = 0;
 			}
 
 			if (rflag)
 				reverse(fp, fn, style, off, &sb);
 			else
 				forward(fp, fn, style, off, &sb);
 		}
 	} else {
 		fn = "stdin";
 
 		if (fstat(fileno(stdin), &sb)) {
 			ierr(fn);
 			exit(1);
 		}
 
 		/*
 		 * Determine if input is a pipe.  4.4BSD will set the SOCKET
 		 * bit in the st_mode field for pipes.  Fix this then.
 		 */
 		if (lseek(fileno(stdin), (off_t)0, SEEK_CUR) == -1 &&
 		    errno == ESPIPE) {
 			errno = 0;
 			fflag = 0;		/* POSIX.2 requires this. */
 		}
 
 		if (rflag)
 			reverse(stdin, fn, style, off, &sb);
 		else
 			forward(stdin, fn, style, off, &sb);
 	}
 	exit(rval);
 }
 
 /*
  * Convert the obsolete argument form into something that getopt can handle.
  * This means that anything of the form [+-][0-9][0-9]*[lbc][Ffr] that isn't
  * the option argument for a -b, -c or -n option gets converted.
  */
 static void
 obsolete(char *argv[])
 {
 	char *ap, *p, *t;
 	size_t len;
 	char *start;
 
 	while ((ap = *++argv)) {
 		/* Return if "--" or not an option of any form. */
 		if (ap[0] != '-') {
 			if (ap[0] != '+')
 				return;
 		} else if (ap[1] == '-')
 			return;
 
 		switch(*++ap) {
 		/* Old-style option. */
 		case '0': case '1': case '2': case '3': case '4':
 		case '5': case '6': case '7': case '8': case '9':
 
 			/* Malloc space for dash, new option and argument. */
 			len = strlen(*argv);
 			if ((start = p = malloc(len + 3)) == NULL)
 				err(1, "malloc");
 			*p++ = '-';
 
 			/*
 			 * Go to the end of the option argument.  Save off any
 			 * trailing options (-3lf) and translate any trailing
 			 * output style characters.
 			 */
 			t = *argv + len - 1;
 			if (*t == 'F' || *t == 'f' || *t == 'r') {
 				*p++ = *t;
 				*t-- = '\0';
 			}
 			switch(*t) {
 			case 'b':
 				*p++ = 'b';
 				*t = '\0';
 				break;
 			case 'c':
 				*p++ = 'c';
 				*t = '\0';
 				break;
 			case 'l':
 				*t = '\0';
 				/* FALLTHROUGH */
 			case '0': case '1': case '2': case '3': case '4':
 			case '5': case '6': case '7': case '8': case '9':
 				*p++ = 'n';
 				break;
 			default:
 				errx(1, "illegal option -- %s", *argv);
 			}
 			*p++ = *argv[0];
 			(void)strcpy(p, ap);
 			*argv = start;
 			continue;
 
 		/*
 		 * Options w/ arguments, skip the argument and continue
 		 * with the next option.
 		 */
 		case 'b':
 		case 'c':
 		case 'n':
 			if (!ap[1])
 				++argv;
 			/* FALLTHROUGH */
 		/* Options w/o arguments, continue with the next option. */
 		case 'F':
 		case 'f':
 		case 'r':
 			continue;
 
 		/* Illegal option, return and let getopt handle it. */
 		default:
 			return;
 		}
 	}
 }
 
 static void
 usage(void)
 {
 	(void)fprintf(stderr,
 	    "usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #]"
 	    " [file ...]\n");
 	exit(1);
 }
Index: user/markj/netdump/usr.sbin/ctld/ctl.conf.5
===================================================================
--- user/markj/netdump/usr.sbin/ctld/ctl.conf.5	(revision 332407)
+++ user/markj/netdump/usr.sbin/ctld/ctl.conf.5	(revision 332408)
@@ -1,587 +1,587 @@
 .\" Copyright (c) 2012 The FreeBSD Foundation
 .\" Copyright (c) 2015 Alexander Motin <mav@FreeBSD.org>
 .\" All rights reserved.
 .\"
 .\" This software was developed by Edward Tomasz Napierala under sponsorship
 .\" from the FreeBSD Foundation.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
 .Dd July 21, 2016
 .Dt CTL.CONF 5
 .Os
 .Sh NAME
 .Nm ctl.conf
 .Nd CAM Target Layer / iSCSI target daemon configuration file
 .Sh DESCRIPTION
 The
 .Nm
 configuration file is used by the
 .Xr ctld 8
 daemon.
 Lines starting with
 .Ql #
 are interpreted as comments.
 The general syntax of the
 .Nm
 file is:
 .Bd -literal -offset indent
 .No pidfile Ar path
 
 .No auth-group Ar name No {
 .Dl chap Ar user Ar secret
 .Dl ...
 }
 
 .No portal-group Ar name No {
 .Dl listen Ar address
 .\".Dl listen-iser Ar address
 .Dl discovery-auth-group Ar name
 .Dl ...
 }
 
 .No target Ar name {
 .Dl auth-group Ar name
 .Dl portal-group Ar name
 .Dl lun Ar number No {
 .Dl 	path Ar path
 .Dl }
 .Dl ...
 }
 .Ed
 .Ss Global Context
 .Bl -tag -width indent
 .It Ic auth-group Ar name
 Create an
 .Sy auth-group
 configuration context,
 defining a new auth-group,
 which can then be assigned to any number of targets.
 .It Ic debug Ar level
 The debug verbosity level.
 The default is 0.
 .It Ic maxproc Ar number
 The limit for concurrently running child processes handling
 incoming connections.
 The default is 30.
 A setting of 0 disables the limit.
 .It Ic pidfile Ar path
 The path to the pidfile.
 The default is
 .Pa /var/run/ctld.pid .
 .It Ic portal-group Ar name
 Create a
 .Sy portal-group
 configuration context,
 defining a new portal-group,
 which can then be assigned to any number of targets.
 .It Ic lun Ar name
 Create a
 .Sy lun
 configuration context, defining a LUN to be exported by any number of targets.
 .It Ic target Ar name
 Create a
 .Sy target
 configuration context, which can optionally contain one or more
 .Sy lun
 contexts.
 .It Ic timeout Ar seconds
 The timeout for login sessions, after which the connection
 will be forcibly terminated.
 The default is 60.
 A setting of 0 disables the timeout.
 .It Ic isns-server Ar address
 An IPv4 or IPv6 address and optionally port of iSNS server to register on.
 .It Ic isns-period Ar seconds
 iSNS registration period.
 Registered Network Entity not updated during this period will be unregistered.
 The default is 900.
 .It Ic isns-timeout Ar seconds
 Timeout for iSNS requests.
 The default is 5.
 .El
 .Ss auth-group Context
 .Bl -tag -width indent
 .It Ic auth-type Ar type
 Sets the authentication type.
 Type can be either
 .Qq Ar none ,
 .Qq Ar deny ,
 .Qq Ar chap ,
 or
 .Qq Ar chap-mutual .
 In most cases it is not necessary to set the type using this clause;
 it is usually used to disable authentication for a given
 .Sy auth-group .
 .It Ic chap Ar user Ar secret
 A set of CHAP authentication credentials.
 Note that for any
 .Sy auth-group ,
 the configuration may only contain either
 .Sy chap
 or
 .Sy chap-mutual
 entries; it is an error to mix them.
 .It Ic chap-mutual Ar user Ar secret Ar mutualuser Ar mutualsecret
 A set of mutual CHAP authentication credentials.
 Note that for any
 .Sy auth-group ,
 the configuration may only contain either
 .Sy chap
 or
 .Sy chap-mutual
 entries; it is an error to mix them.
 .It Ic initiator-name Ar initiator-name
 An iSCSI initiator name.
 Only initiators with a name matching one of the defined
 names will be allowed to connect.
 If not defined, there will be no restrictions based on initiator
 name.
 .It Ic initiator-portal Ar address Ns Op / Ns Ar prefixlen
 An iSCSI initiator portal: an IPv4 or IPv6 address, optionally
 followed by a literal slash and a prefix length.
 Only initiators with an address matching one of the defined
 addresses will be allowed to connect.
 If not defined, there will be no restrictions based on initiator
 address.
 .El
 .Ss portal-group Context
 .Bl -tag -width indent
 .It Ic discovery-auth-group Ar name
 Assign a previously defined authentication group to the portal group,
 to be used for target discovery.
 By default, portal groups are assigned predefined
 .Sy auth-group
 .Qq Ar default ,
 which denies discovery.
 Another predefined
 .Sy auth-group ,
 .Qq Ar no-authentication ,
 may be used
 to permit discovery without authentication.
 .It Ic discovery-filter Ar filter
 Determines which targets are returned during discovery.
 Filter can be either
 .Qq Ar none ,
 .Qq Ar portal ,
 .Qq Ar portal-name ,
 or
 .Qq Ar portal-name-auth .
 When set to
 .Qq Ar none ,
 discovery will return all targets assigned to that portal group.
 When set to
 .Qq Ar portal ,
 discovery will not return targets that cannot be accessed by the
 initiator because of their
 .Sy initiator-portal .
 When set to
 .Qq Ar portal-name ,
 the check will include both
 .Sy initiator-portal
 and
 .Sy initiator-name .
 When set to
 .Qq Ar portal-name-auth ,
 the check will include
 .Sy initiator-portal ,
 .Sy initiator-name ,
 and authentication credentials.
 The target is returned if it does not require CHAP authentication,
 or if the CHAP user and secret used during discovery match those
 used by the target.
 Note that when using
 .Qq Ar portal-name-auth ,
 targets that require CHAP authentication will only be returned if
 .Sy discovery-auth-group
 requires CHAP.
 The default is
 .Qq Ar none .
 .It Ic listen Ar address
 An IPv4 or IPv6 address and port to listen on for incoming connections.
 .\".It Ic listen-iser Ar address
 .\"An IPv4 or IPv6 address and port to listen on for incoming connections
 .\"using iSER (iSCSI over RDMA) protocol.
 .It Ic offload Ar driver
 Define iSCSI hardware offload driver to use for this
 .Sy portal-group .
 The default is
 .Qq Ar none .
 .It Ic option Ar name Ar value
 The CTL-specific port options passed to the kernel.
 .It Ic redirect Ar address
 IPv4 or IPv6 address to redirect initiators to.
 When configured, all initiators attempting to connect to portal
 belonging to this
 .Sy portal-group
 will get redirected using "Target moved temporarily" login response.
 Redirection happens before authentication and any
 .Sy initiator-name
 or
 .Sy initiator-portal
 checks are skipped.
 .It Ic tag Ar value
 Unique 16-bit tag value of this
 .Sy portal-group .
 If not specified, the value is generated automatically.
 .It Ic foreign
 Specifies that this
 .Sy portal-group
 is listened by some other host.
 This host will announce it on discovery stage, but won't listen.
 .El
 .Ss target Context
 .Bl -tag -width indent
 .It Ic alias Ar text
 Assign a human-readable description to the target.
 There is no default.
 .It Ic auth-group Ar name
 Assign a previously defined authentication group to the target.
 By default, targets that do not specify their own auth settings,
 using clauses such as
 .Sy chap
 or
 .Sy initiator-name ,
 are assigned
 predefined
 .Sy auth-group
 .Qq Ar default ,
 which denies all access.
 Another predefined
 .Sy auth-group ,
 .Qq Ar no-authentication ,
 may be used to permit access
 without authentication.
 Note that this clause can be overridden using the second argument
 to a
 .Sy portal-group
 clause.
 .It Ic auth-type Ar type
 Sets the authentication type.
 Type can be either
 .Qq Ar none ,
 .Qq Ar deny ,
 .Qq Ar chap ,
 or
 .Qq Ar chap-mutual .
 In most cases it is not necessary to set the type using this clause;
 it is usually used to disable authentication for a given
 .Sy target .
 This clause is mutually exclusive with
 .Sy auth-group ;
 one cannot use
 both in a single target.
 .It Ic chap Ar user Ar secret
 A set of CHAP authentication credentials.
 Note that targets must only use one of
 .Sy auth-group , chap , No or Sy chap-mutual ;
 it is a configuration error to mix multiple types in one target.
 .It Ic chap-mutual Ar user Ar secret Ar mutualuser Ar mutualsecret
 A set of mutual CHAP authentication credentials.
 Note that targets must only use one of
 .Sy auth-group , chap , No or Sy chap-mutual ;
 it is a configuration error to mix multiple types in one target.
 .It Ic initiator-name Ar initiator-name
 An iSCSI initiator name.
 Only initiators with a name matching one of the defined
 names will be allowed to connect.
 If not defined, there will be no restrictions based on initiator
 name.
 This clause is mutually exclusive with
 .Sy auth-group ;
 one cannot use
 both in a single target.
 .It Ic initiator-portal Ar address Ns Op / Ns Ar prefixlen
 An iSCSI initiator portal: an IPv4 or IPv6 address, optionally
 followed by a literal slash and a prefix length.
 Only initiators with an address matching one of the defined
 addresses will be allowed to connect.
 If not defined, there will be no restrictions based on initiator
 address.
 This clause is mutually exclusive with
 .Sy auth-group ;
 one cannot use
 both in a single target.
 .Pp
 The
 .Sy auth-type ,
 .Sy chap ,
 .Sy chap-mutual ,
 .Sy initiator-name ,
 and
 .Sy initiator-portal
 clauses in the target context provide an alternative to assigning an
 .Sy auth-group
 defined separately, useful in the common case of authentication settings
 specific to a single target.
 .It Ic portal-group Ar name Op Ar ag-name
 Assign a previously defined portal group to the target.
 The default portal group is
 .Qq Ar default ,
 which makes the target available
 on TCP port 3260 on all configured IPv4 and IPv6 addresses.
 Optional second argument specifies
 .Sy auth-group
 for connections to this specific portal group.
 If second argument is not specified, target
 .Sy auth-group
 is used.
 .It Ic port Ar name
 .It Ic port Ar name/pp
 .It Ic port Ar name/pp/vp
 Assign specified CTL port (such as "isp0" or "isp2/1") to the target.
 This is used to export the target through a specific physical - eg Fibre
 Channel - port, in addition to portal-groups configured for the target.
 Use
 .Cm "ctladm portlist"
 command to retrieve the list of available ports.
 On startup
 .Xr ctld 8
 configures LUN mapping and enables all assigned ports.
 Each port can be assigned to only one target.
 .It Ic redirect Ar address
 IPv4 or IPv6 address to redirect initiators to.
 When configured, all initiators attempting to connect to this target
 will get redirected using "Target moved temporarily" login response.
 Redirection happens after successful authentication.
 .It Ic lun Ar number Ar name
 Export previously defined
 .Sy lun
 by the parent target.
 .It Ic lun Ar number
 Create a
 .Sy lun
 configuration context, defining a LUN exported by the parent target.
 .Pp
 This is an alternative to defining the LUN separately, useful in the common
 case of a LUN being exported by a single target.
 .El
 .Ss lun Context
 .Bl -tag -width indent
 .It Ic backend Ar block No | Ar ramdisk
 The CTL backend to use for a given LUN.
 Valid choices are
 .Qq Ar block
 and
 .Qq Ar ramdisk ;
 block is used for LUNs backed
 by files or disk device nodes; ramdisk is a bitsink device, used mostly for
 testing.
 The default backend is block.
 .It Ic blocksize Ar size
 The blocksize visible to the initiator.
 The default blocksize is 512 for disks, and 2048 for CD/DVDs.
 .It Ic ctl-lun Ar lun_id
 Global numeric identifier to use for a given LUN inside CTL.
 By default CTL allocates those IDs dynamically, but explicit specification
 may be needed for consistency in HA configurations.
 .It Ic device-id Ar string
 The SCSI Device Identification string presented to the initiator.
 .It Ic device-type Ar type
 Specify the SCSI device type to use when creating the LUN.
 Currently CTL supports Direct Access (type 0), Processor (type 3)
 and CD/DVD (type 5) LUNs.
 .It Ic option Ar name Ar value
 The CTL-specific options passed to the kernel.
 All CTL-specific options are documented in the
 .Sx OPTIONS
 section of
 .Xr ctladm 8 .
 .It Ic path Ar path
 The path to the file, device node, or
 .Xr zfs 8
 volume used to back the LUN.
 For optimal performance, create the volume with the
 .Qq Ar volmode=dev
 property set.
 .It Ic serial Ar string
 The SCSI serial number presented to the initiator.
 .It Ic size Ar size
 The LUN size, in bytes.
 .El
 .Sh FILES
 .Bl -tag -width ".Pa /etc/ctl.conf" -compact
 .It Pa /etc/ctl.conf
 The default location of the
 .Xr ctld 8
 configuration file.
 .El
 .Sh EXAMPLES
 .Bd -literal
 auth-group ag0 {
 	chap-mutual "user" "secret" "mutualuser" "mutualsecret"
 	chap-mutual "user2" "secret2" "mutualuser" "mutualsecret"
 	initiator-portal 192.168.1.1/16
 }
 
 auth-group ag1 {
 	auth-type none
 	initiator-name "iqn.2012-06.com.example:initiatorhost1"
 	initiator-name "iqn.2012-06.com.example:initiatorhost2"
 	initiator-portal 192.168.1.1/24
 	initiator-portal [2001:db8::de:ef]
 }
 
 portal-group pg0 {
 	discovery-auth-group no-authentication
 	listen 0.0.0.0:3260
 	listen [::]:3260
 	listen [fe80::be:ef]:3261
 }
 
 target iqn.2012-06.com.example:target0 {
 	alias "Example target"
 	auth-group no-authentication
 	lun 0 {
 		path /dev/zvol/tank/example_0
 		blocksize 4096
 		size 4G
 	}
 }
 
 lun example_1 {
 	path /dev/zvol/tank/example_1
 	option naa 0x50015178f369f093
 }
 
 target iqn.2012-06.com.example:target1 {
 	auth-group ag0
 	portal-group pg0
 	lun 0 example_1
 	lun 1 {
 		path /dev/zvol/tank/example_2
 		option vendor "FreeBSD"
 	}
 }
 
 target naa.50015178f369f092 {
 	port isp0
 	port isp1
 	lun 0 example_1
 }
 .Ed
 .Pp
 An equivalent configuration in UCL format, for use with
-.Fl u : 
+.Fl u :
 .Bd -literal
 auth-group {
 	ag0 {
 		chap-mutual = [
 			{
 				user = "user"
 				secret = "secretsecret"
 				mutual-user = "mutualuser"
 				mutual-secret = "mutualsecret"
 			},
 			{
 				user = "user2"
 				secret = "secret2secret2"
 				mutual-user = "mutualuser"
 				mutual-secret = "mutualsecret"
 			}
 		]
 	}
 
 	ag1 {
 		auth-type = none
 		initiator-name = [
 			"iqn.2012-06.com.example:initiatorhost1",
 			"iqn.2012-06.com.example:initiatorhost2"
 		]
 		initiator-portal = [192.168.1.1/24, "[2001:db8::de:ef]"]
 	}
 }
 
 portal-group {
 	pg0 {
 		discovery-auth-group = no-authentication
 		listen = [
 			0.0.0.0:3260,
 			"[::]:3260",
 			"[fe80::be:ef]:3261"
 		]
 	}
 }
 
 lun {
 	example_0 {
 		path = /dev/zvol/tank/example_0
 		blocksize = 4096
 		size = "4G"
 	}
 
 	example_1 {
 		path = /dev/zvol/tank/example_1
 		options {
 			naa = "0x50015178f369f093"
 		}
 	}
 
 	example_2 {
 		path = /dev/zvol/tank/example_2
 		options {
 			vendor = "FreeBSD"
 		}
 	}
 }
 
 target {
 	"iqn.2012-06.com.example:target0" {
 		alias = "Example target"
 		auth-group = no-authentication
 		lun = [
 			{ number = 0, name = example_0 },
 		]
 	}
 
 	"iqn.2012-06.com.example:target1" {
 		auth-group = ag0
 		portal-group { name = pg0 }
 		lun = [
 			{ number = 0, name = example_1 },
 			{ number = 1, name = example_2 }
 		]
 	}
 
 	naa.50015178f369f092 {
 		port = isp0
 		lun = [
 			{ number = 0, name = example_1 }
 		]
 	}
 }
 .Ed
 .Sh SEE ALSO
 .Xr ctl 4 ,
 .Xr ctladm 8 ,
 .Xr ctld 8 ,
 .Xr zfs 8
 .Sh AUTHORS
 The
 .Nm
 configuration file functionality for
 .Xr ctld 8
 was developed by
 .An Edward Tomasz Napierala Aq Mt trasz@FreeBSD.org
 under sponsorship from the FreeBSD Foundation.
Index: user/markj/netdump
===================================================================
--- user/markj/netdump	(revision 332407)
+++ user/markj/netdump	(revision 332408)

Property changes on: user/markj/netdump
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head:r332339-332407