Index: projects/clang400-import/README.md
===================================================================
--- projects/clang400-import/README.md	(revision 314522)
+++ projects/clang400-import/README.md	(revision 314523)
@@ -1,86 +1,86 @@
 FreeBSD Source:
 ---------------
-This is the top level of the FreeBSD source directory.  This file  
-was last revised on:  
+This is the top level of the FreeBSD source directory.  This file
+was last revised on:
 $FreeBSD$
 
-For copyright information, please see the file COPYRIGHT in this  
-directory (additional copyright information also exists for some    
-sources in this tree - please see the specific source directories for  
+For copyright information, please see the file COPYRIGHT in this
+directory (additional copyright information also exists for some
+sources in this tree - please see the specific source directories for
 more information).
 
-The Makefile in this directory supports a number of targets for  
-building components (or all) of the FreeBSD source tree.  See build(7)  
-and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/makeworld.html  
-for more information, including setting make(1) variables.  
+The Makefile in this directory supports a number of targets for
+building components (or all) of the FreeBSD source tree.  See build(7)
+and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/makeworld.html
+for more information, including setting make(1) variables.
 
-The `buildkernel` and `installkernel` targets build and install  
-the kernel and the modules (see below).  Please see the top of  
-the Makefile in this directory for more information on the  
+The `buildkernel` and `installkernel` targets build and install
+the kernel and the modules (see below).  Please see the top of
+the Makefile in this directory for more information on the
 standard build targets and compile-time flags.
 
-Building a kernel is a somewhat more involved process.  See build(7), config(8),  
-and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig.html  
+Building a kernel is a somewhat more involved process.  See build(7), config(8),
+and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig.html
 for more information.
 
-Note: If you want to build and install the kernel with the  
-`buildkernel` and `installkernel` targets, you might need to build  
+Note: If you want to build and install the kernel with the
+`buildkernel` and `installkernel` targets, you might need to build
 world before.  More information is available in the handbook.
 
-The kernel configuration files reside in the `sys/<arch>/conf`  
-sub-directory.  GENERIC is the default configuration used in release builds.  
-NOTES contains entries and documentation for all possible  
+The kernel configuration files reside in the `sys/<arch>/conf`
+sub-directory.  GENERIC is the default configuration used in release builds.
+NOTES contains entries and documentation for all possible
 devices, not just those commonly used.
 
 
 Source Roadmap:
 ---------------
 ```
 bin				System/user commands.
 
 cddl			Various commands and libraries under the Common Development  
 				and Distribution License.
 
 contrib			Packages contributed by 3rd parties.
 
 crypto			Cryptography stuff (see crypto/README).
 
 etc				Template files for /etc.
 
 gnu				Various commands and libraries under the GNU Public License.  
 				Please see gnu/COPYING* for more information.
 
 include			System include files.
 
 kerberos5		Kerberos5 (Heimdal) package.
 
 lib				System libraries.
 
 libexec			System daemons.
 
 release			Release building Makefile & associated tools.
 
 rescue			Build system for statically linked /rescue utilities.
 
 sbin			System commands.
 
 secure			Cryptographic libraries and commands.
 
 share			Shared resources.
 
 sys				Kernel sources.
 
 tests			Regression tests which can be run by Kyua.  See tests/README
 				for additional information.
 
 tools			Utilities for regression testing and miscellaneous tasks.
 
 usr.bin			User commands.
 
 usr.sbin		System administration commands.
 ```
 
-For information on synchronizing your source tree with one or more of  
+For information on synchronizing your source tree with one or more of
 the FreeBSD Project's development branches, please see:
 
    http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/synching.html
Index: projects/clang400-import/contrib/dma/dma.c
===================================================================
--- projects/clang400-import/contrib/dma/dma.c	(revision 314522)
+++ projects/clang400-import/contrib/dma/dma.c	(revision 314523)
@@ -1,632 +1,633 @@
 /*
  * Copyright (c) 2008-2014, Simon Schubert <2@0x2c.org>.
  * Copyright (c) 2008 The DragonFly Project.  All rights reserved.
  *
  * This code is derived from software contributed to The DragonFly Project
  * by Simon Schubert <2@0x2c.org>.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in
  *    the documentation and/or other materials provided with the
  *    distribution.
  * 3. Neither the name of The DragonFly Project nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific, prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE
  * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
  * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
  * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
  * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include "dfcompat.h"
 
 #include <sys/param.h>
 #include <sys/types.h>
 #include <sys/queue.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <sys/wait.h>
 
 #include <dirent.h>
 #include <err.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <inttypes.h>
+#include <libgen.h>
 #include <paths.h>
 #include <pwd.h>
 #include <signal.h>
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <syslog.h>
 #include <unistd.h>
 
 #include "dma.h"
 
 
 static void deliver(struct qitem *);
 
 struct aliases aliases = LIST_HEAD_INITIALIZER(aliases);
 struct strlist tmpfs = SLIST_HEAD_INITIALIZER(tmpfs);
 struct authusers authusers = LIST_HEAD_INITIALIZER(authusers);
 char username[USERNAME_SIZE];
 uid_t useruid;
 const char *logident_base;
 char errmsg[ERRMSG_SIZE];
 
 static int daemonize = 1;
 static int doqueue = 0;
 
 struct config config = {
 	.smarthost	= NULL,
 	.port		= 25,
 	.aliases	= "/etc/aliases",
 	.spooldir	= "/var/spool/dma",
 	.authpath	= NULL,
 	.certfile	= NULL,
 	.features	= 0,
 	.mailname	= NULL,
 	.masquerade_host = NULL,
 	.masquerade_user = NULL,
 };
 
 
 static void
 sighup_handler(int signo)
 {
 	(void)signo;	/* so that gcc doesn't complain */
 }
 
 static char *
 set_from(struct queue *queue, const char *osender)
 {
 	const char *addr;
 	char *sender;
 
 	if (osender) {
 		addr = osender;
 	} else if (getenv("EMAIL") != NULL) {
 		addr = getenv("EMAIL");
 	} else {
 		if (config.masquerade_user)
 			addr = config.masquerade_user;
 		else
 			addr = username;
 	}
 
 	if (!strchr(addr, '@')) {
 		const char *from_host = hostname();
 
 		if (config.masquerade_host)
 			from_host = config.masquerade_host;
 
 		if (asprintf(&sender, "%s@%s", addr, from_host) <= 0)
 			return (NULL);
 	} else {
 		sender = strdup(addr);
 		if (sender == NULL)
 			return (NULL);
 	}
 
 	if (strchr(sender, '\n') != NULL) {
 		errno = EINVAL;
 		return (NULL);
 	}
 
 	queue->sender = sender;
 	return (sender);
 }
 
 static int
 read_aliases(void)
 {
 	yyin = fopen(config.aliases, "r");
 	if (yyin == NULL) {
 		/*
 		 * Non-existing aliases file is not a fatal error
 		 */
 		if (errno == ENOENT)
 			return (0);
 		/* Other problems are. */
 		return (-1);
 	}
 	if (yyparse())
 		return (-1);	/* fatal error, probably malloc() */
 	fclose(yyin);
 	return (0);
 }
 
 static int
 do_alias(struct queue *queue, const char *addr)
 {
 	struct alias *al;
         struct stritem *sit;
 	int aliased = 0;
 
         LIST_FOREACH(al, &aliases, next) {
                 if (strcmp(al->alias, addr) != 0)
                         continue;
 		SLIST_FOREACH(sit, &al->dests, next) {
 			if (add_recp(queue, sit->str, EXPAND_ADDR) != 0)
 				return (-1);
 		}
 		aliased = 1;
         }
 
         return (aliased);
 }
 
 int
 add_recp(struct queue *queue, const char *str, int expand)
 {
 	struct qitem *it, *tit;
 	struct passwd *pw;
 	char *host;
 	int aliased = 0;
 
 	it = calloc(1, sizeof(*it));
 	if (it == NULL)
 		return (-1);
 	it->addr = strdup(str);
 	if (it->addr == NULL)
 		return (-1);
 
 	it->sender = queue->sender;
 	host = strrchr(it->addr, '@');
 	if (host != NULL &&
 	    (strcmp(host + 1, hostname()) == 0 ||
 	     strcmp(host + 1, "localhost") == 0)) {
 		*host = 0;
 	}
 	LIST_FOREACH(tit, &queue->queue, next) {
 		/* weed out duplicate dests */
 		if (strcmp(tit->addr, it->addr) == 0) {
 			free(it->addr);
 			free(it);
 			return (0);
 		}
 	}
 	LIST_INSERT_HEAD(&queue->queue, it, next);
 
 	/**
 	 * Do local delivery if there is no @.
 	 * Do not do local delivery when NULLCLIENT is set.
 	 */
 	if (strrchr(it->addr, '@') == NULL && (config.features & NULLCLIENT) == 0) {
 		it->remote = 0;
 		if (expand) {
 			aliased = do_alias(queue, it->addr);
 			if (!aliased && expand == EXPAND_WILDCARD)
 				aliased = do_alias(queue, "*");
 			if (aliased < 0)
 				return (-1);
 			if (aliased) {
 				LIST_REMOVE(it, next);
 			} else {
 				/* Local destination, check */
 				pw = getpwnam(it->addr);
 				if (pw == NULL)
 					goto out;
 				/* XXX read .forward */
 				endpwent();
 			}
 		}
 	} else {
 		it->remote = 1;
 	}
 
 	return (0);
 
 out:
 	free(it->addr);
 	free(it);
 	return (-1);
 }
 
 static struct qitem *
 go_background(struct queue *queue)
 {
 	struct sigaction sa;
 	struct qitem *it;
 	pid_t pid;
 
 	if (daemonize && daemon(0, 0) != 0) {
 		syslog(LOG_ERR, "can not daemonize: %m");
 		exit(EX_OSERR);
 	}
 	daemonize = 0;
 
 	bzero(&sa, sizeof(sa));
 	sa.sa_handler = SIG_IGN;
 	sigaction(SIGCHLD, &sa, NULL);
 
 	LIST_FOREACH(it, &queue->queue, next) {
 		/* No need to fork for the last dest */
 		if (LIST_NEXT(it, next) == NULL)
 			goto retit;
 
 		pid = fork();
 		switch (pid) {
 		case -1:
 			syslog(LOG_ERR, "can not fork: %m");
 			exit(EX_OSERR);
 			break;
 
 		case 0:
 			/*
 			 * Child:
 			 *
 			 * return and deliver mail
 			 */
 retit:
 			/*
 			 * If necessary, acquire the queue and * mail files.
 			 * If this fails, we probably were raced by another
 			 * process.  It is okay to be raced if we're supposed
 			 * to flush the queue.
 			 */
 			setlogident("%s", it->queueid);
 			switch (acquirespool(it)) {
 			case 0:
 				break;
 			case 1:
 				if (doqueue)
 					exit(EX_OK);
 				syslog(LOG_WARNING, "could not lock queue file");
 				exit(EX_SOFTWARE);
 			default:
 				exit(EX_SOFTWARE);
 			}
 			dropspool(queue, it);
 			return (it);
 
 		default:
 			/*
 			 * Parent:
 			 *
 			 * fork next child
 			 */
 			break;
 		}
 	}
 
 	syslog(LOG_CRIT, "reached dead code");
 	exit(EX_SOFTWARE);
 }
 
 static void
 deliver(struct qitem *it)
 {
 	int error;
 	unsigned int backoff = MIN_RETRY, slept;
 	struct timeval now;
 	struct stat st;
 
 	snprintf(errmsg, sizeof(errmsg), "unknown bounce reason");
 
 retry:
 	syslog(LOG_INFO, "<%s> trying delivery", it->addr);
 
 	if (it->remote)
 		error = deliver_remote(it);
 	else
 		error = deliver_local(it);
 
 	switch (error) {
 	case 0:
 		delqueue(it);
 		syslog(LOG_INFO, "<%s> delivery successful", it->addr);
 		exit(EX_OK);
 
 	case 1:
 		if (stat(it->queuefn, &st) != 0) {
 			syslog(LOG_ERR, "lost queue file `%s'", it->queuefn);
 			exit(EX_SOFTWARE);
 		}
 		if (gettimeofday(&now, NULL) == 0 &&
 		    (now.tv_sec - st.st_mtim.tv_sec > MAX_TIMEOUT)) {
 			snprintf(errmsg, sizeof(errmsg),
 				 "Could not deliver for the last %d seconds. Giving up.",
 				 MAX_TIMEOUT);
 			goto bounce;
 		}
 		for (slept = 0; slept < backoff;) {
 			slept += SLEEP_TIMEOUT - sleep(SLEEP_TIMEOUT);
 			if (flushqueue_since(slept)) {
 				backoff = MIN_RETRY;
 				goto retry;
 			}
 		}
 		if (slept >= backoff) {
 			/* pick the next backoff between [1.5, 2.5) times backoff */
 			backoff = backoff + backoff / 2 + random() % backoff;
 			if (backoff > MAX_RETRY)
 				backoff = MAX_RETRY;
 		}
 		goto retry;
 
 	case -1:
 	default:
 		break;
 	}
 
 bounce:
 	bounce(it, errmsg);
 	/* NOTREACHED */
 }
 
 void
 run_queue(struct queue *queue)
 {
 	struct qitem *it;
 
 	if (LIST_EMPTY(&queue->queue))
 		return;
 
 	it = go_background(queue);
 	deliver(it);
 	/* NOTREACHED */
 }
 
 static void
 show_queue(struct queue *queue)
 {
 	struct qitem *it;
 	int locked = 0;	/* XXX */
 
 	if (LIST_EMPTY(&queue->queue)) {
 		printf("Mail queue is empty\n");
 		return;
 	}
 
 	LIST_FOREACH(it, &queue->queue, next) {
 		printf("ID\t: %s%s\n"
 		       "From\t: %s\n"
 		       "To\t: %s\n",
 		       it->queueid,
 		       locked ? "*" : "",
 		       it->sender, it->addr);
 
 		if (LIST_NEXT(it, next) != NULL)
 			printf("--\n");
 	}
 }
 
 /*
  * TODO:
  *
  * - alias processing
  * - use group permissions
  * - proper sysexit codes
  */
 
 int
 main(int argc, char **argv)
 {
 	struct sigaction act;
 	char *sender = NULL;
 	struct queue queue;
 	int i, ch;
 	int nodot = 0, showq = 0, queue_only = 0;
 	int recp_from_header = 0;
 
 	set_username();
 
 	/*
 	 * We never run as root.  If called by root, drop permissions
 	 * to the mail user.
 	 */
 	if (geteuid() == 0 || getuid() == 0) {
 		struct passwd *pw;
 
 		errno = 0;
 		pw = getpwnam(DMA_ROOT_USER);
 		if (pw == NULL) {
 			if (errno == 0)
 				errx(EX_CONFIG, "user '%s' not found", DMA_ROOT_USER);
 			else
 				err(EX_OSERR, "cannot drop root privileges");
 		}
 
 		if (setuid(pw->pw_uid) != 0)
 			err(EX_OSERR, "cannot drop root privileges");
 
 		if (geteuid() == 0 || getuid() == 0)
 			errx(EX_OSERR, "cannot drop root privileges");
 	}
 
 	atexit(deltmp);
 	init_random();
 
 	bzero(&queue, sizeof(queue));
 	LIST_INIT(&queue.queue);
 
-	if (strcmp(argv[0], "mailq") == 0) {
+	if (strcmp(basename(argv[0]), "mailq") == 0) {
 		argv++; argc--;
 		showq = 1;
 		if (argc != 0)
 			errx(EX_USAGE, "invalid arguments");
 		goto skipopts;
 	} else if (strcmp(argv[0], "newaliases") == 0) {
 		logident_base = "dma";
 		setlogident("%s", logident_base);
 
 		if (read_aliases() != 0)
 			errx(EX_SOFTWARE, "could not parse aliases file `%s'", config.aliases);
 		exit(EX_OK);
 	}
 
 	opterr = 0;
 	while ((ch = getopt(argc, argv, ":A:b:B:C:d:Df:F:h:iL:N:no:O:q:r:R:tUV:vX:")) != -1) {
 		switch (ch) {
 		case 'A':
 			/* -AX is being ignored, except for -A{c,m} */
 			if (optarg[0] == 'c' || optarg[0] == 'm') {
 				break;
 			}
 			/* else FALLTRHOUGH */
 		case 'b':
 			/* -bX is being ignored, except for -bp */
 			if (optarg[0] == 'p') {
 				showq = 1;
 				break;
 			} else if (optarg[0] == 'q') {
 				queue_only = 1;
 				break;
 			}
 			/* else FALLTRHOUGH */
 		case 'D':
 			daemonize = 0;
 			break;
 		case 'L':
 			logident_base = optarg;
 			break;
 		case 'f':
 		case 'r':
 			sender = optarg;
 			break;
 
 		case 't':
 			recp_from_header = 1;
 			break;
 
 		case 'o':
 			/* -oX is being ignored, except for -oi */
 			if (optarg[0] != 'i')
 				break;
 			/* else FALLTRHOUGH */
 		case 'O':
 			break;
 		case 'i':
 			nodot = 1;
 			break;
 
 		case 'q':
 			/* Don't let getopt slup up other arguments */
 			if (optarg && *optarg == '-')
 				optind--;
 			doqueue = 1;
 			break;
 
 		/* Ignored options */
 		case 'B':
 		case 'C':
 		case 'd':
 		case 'F':
 		case 'h':
 		case 'N':
 		case 'n':
 		case 'R':
 		case 'U':
 		case 'V':
 		case 'v':
 		case 'X':
 			break;
 
 		case ':':
 			if (optopt == 'q') {
 				doqueue = 1;
 				break;
 			}
 			/* FALLTHROUGH */
 
 		default:
 			fprintf(stderr, "invalid argument: `-%c'\n", optopt);
 			exit(EX_USAGE);
 		}
 	}
 	argc -= optind;
 	argv += optind;
 	opterr = 1;
 
 	if (argc != 0 && (showq || doqueue))
 		errx(EX_USAGE, "sending mail and queue operations are mutually exclusive");
 
 	if (showq + doqueue > 1)
 		errx(EX_USAGE, "conflicting queue operations");
 
 skipopts:
 	if (logident_base == NULL)
 		logident_base = "dma";
 	setlogident("%s", logident_base);
 
 	act.sa_handler = sighup_handler;
 	act.sa_flags = 0;
 	sigemptyset(&act.sa_mask);
 	if (sigaction(SIGHUP, &act, NULL) != 0)
 		syslog(LOG_WARNING, "can not set signal handler: %m");
 
 	parse_conf(CONF_PATH "/dma.conf");
 
 	if (config.authpath != NULL)
 		parse_authfile(config.authpath);
 
 	if (showq) {
 		if (load_queue(&queue) < 0)
 			errlog(EX_NOINPUT, "can not load queue");
 		show_queue(&queue);
 		return (0);
 	}
 
 	if (doqueue) {
 		flushqueue_signal();
 		if (load_queue(&queue) < 0)
 			errlog(EX_NOINPUT, "can not load queue");
 		run_queue(&queue);
 		return (0);
 	}
 
 	if (read_aliases() != 0)
 		errlog(EX_SOFTWARE, "could not parse aliases file `%s'", config.aliases);
 
 	if ((sender = set_from(&queue, sender)) == NULL)
 		errlog(EX_SOFTWARE, "set_from()");
 
 	if (newspoolf(&queue) != 0)
 		errlog(EX_CANTCREAT, "can not create temp file in `%s'", config.spooldir);
 
 	setlogident("%s", queue.id);
 
 	for (i = 0; i < argc; i++) {
 		if (add_recp(&queue, argv[i], EXPAND_WILDCARD) != 0)
 			errlogx(EX_DATAERR, "invalid recipient `%s'", argv[i]);
 	}
 
 	if (LIST_EMPTY(&queue.queue) && !recp_from_header)
 		errlogx(EX_NOINPUT, "no recipients");
 
 	if (readmail(&queue, nodot, recp_from_header) != 0)
 		errlog(EX_NOINPUT, "can not read mail");
 
 	if (LIST_EMPTY(&queue.queue))
 		errlogx(EX_NOINPUT, "no recipients");
 
 	if (linkspool(&queue) != 0)
 		errlog(EX_CANTCREAT, "can not create spools");
 
 	/* From here on the mail is safe. */
 
 	if (config.features & DEFER || queue_only)
 		return (0);
 
 	run_queue(&queue);
 
 	/* NOTREACHED */
 	return (0);
 }
Index: projects/clang400-import/contrib/dma
===================================================================
--- projects/clang400-import/contrib/dma	(revision 314522)
+++ projects/clang400-import/contrib/dma	(revision 314523)

Property changes on: projects/clang400-import/contrib/dma
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,2 ##
   Merged /head/contrib/dma:r311132-312925,312928-314522
   Merged /vendor/dma/dist:r306541-314519
Index: projects/clang400-import/sys/boot/arm/uboot/Makefile
===================================================================
--- projects/clang400-import/sys/boot/arm/uboot/Makefile	(revision 314522)
+++ projects/clang400-import/sys/boot/arm/uboot/Makefile	(revision 314523)
@@ -1,156 +1,157 @@
 # $FreeBSD$
 
 .include <src.opts.mk>
 
 FILES=		ubldr ubldr.bin
 
 NEWVERSWHAT=	"U-Boot loader" ${MACHINE_ARCH}
 BINDIR?=	/boot
 INSTALLFLAGS=	-b
 WARNS?=		1
 # Address at which ubldr will be loaded.
 # This varies for different boards and SOCs.
 UBLDR_LOADADDR?=	0x1000000
 
 # Architecture-specific loader code
 SRCS=		start.S conf.c self_reloc.c vers.c
 
 .if !defined(LOADER_NO_DISK_SUPPORT)
 LOADER_DISK_SUPPORT?=	yes
 .else
 LOADER_DISK_SUPPORT=	no
 .endif
 LOADER_UFS_SUPPORT?=	yes
 LOADER_CD9660_SUPPORT?=	no
 LOADER_EXT2FS_SUPPORT?=	no
 .if ${MK_NAND} != "no"
 LOADER_NANDFS_SUPPORT?= yes
 .else
 LOADER_NANDFS_SUPPORT?= no
 .endif
 LOADER_NET_SUPPORT?=	yes
 LOADER_NFS_SUPPORT?=	yes
 LOADER_TFTP_SUPPORT?=	no
 LOADER_GZIP_SUPPORT?=	no
 LOADER_BZIP2_SUPPORT?=	no
 .if ${MK_FDT} != "no"
 LOADER_FDT_SUPPORT=	yes
 .else
 LOADER_FDT_SUPPORT=	no
 .endif
 
 .if ${LOADER_DISK_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_DISK_SUPPORT
 .endif
 .if ${LOADER_UFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_UFS_SUPPORT
 .endif
 .if ${LOADER_CD9660_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_CD9660_SUPPORT
 .endif
 .if ${LOADER_EXT2FS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_EXT2FS_SUPPORT
 .endif
 .if ${LOADER_NANDFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NANDFS_SUPPORT
 .endif
 .if ${LOADER_GZIP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_GZIP_SUPPORT
 .endif
 .if ${LOADER_BZIP2_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_BZIP2_SUPPORT
 .endif
 .if ${LOADER_NET_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NET_SUPPORT
 .endif
 .if ${LOADER_NFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NFS_SUPPORT
 .endif
 .if ${LOADER_TFTP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_TFTP_SUPPORT
 .endif
 .if ${LOADER_FDT_SUPPORT} == "yes"
 CFLAGS+=	-I${.CURDIR}/../../fdt
 CFLAGS+=	-I${.OBJDIR}/../../fdt
 CFLAGS+=	-DLOADER_FDT_SUPPORT
 LIBUBOOT_FDT=	${.OBJDIR}/../../uboot/fdt/libuboot_fdt.a
 LIBFDT=		${.OBJDIR}/../../fdt/libfdt.a
 .endif
 
 .if ${MK_FORTH} != "no"
 # Enable BootForth
 BOOT_FORTH=	yes
-CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/arm
+CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl
+CFLAGS+=	-I${.CURDIR}/../../ficl/arm
 LIBFICL=	${.OBJDIR}/../../ficl/libficl.a
 .endif
 
 # Always add MI sources
 .PATH:		${.CURDIR}/../../common
 .include	"${.CURDIR}/../../common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../common
 CFLAGS+=	-I.
 
 CLEANFILES+=	loader.help
 
 CFLAGS+=	-ffreestanding -msoft-float
 
 LDFLAGS=	-nostdlib -static -T ${.CURDIR}/ldscript.${MACHINE_CPUARCH}
 
 # Pull in common loader code
 .PATH:		${.CURDIR}/../../uboot/common
 .include	"${.CURDIR}/../../uboot/common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../uboot/common
 
 # U-Boot standalone support library
 LIBUBOOT=	${.OBJDIR}/../../uboot/lib/libuboot.a
 CFLAGS+=	-I${.CURDIR}/../../uboot/lib
 CFLAGS+=	-I${.OBJDIR}/../../uboot/lib
 
 # where to get libstand from
 CFLAGS+=	-I${.CURDIR}/../../../../lib/libstand/
 
 CFLAGS+=	-fPIC
 
 # clang doesn't understand %D as a specifier to printf
 NO_WERROR.clang=
 
 DPADD=		${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} ${LIBSTAND}
 LDADD=		${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} -lstand
 
 OBJS+=  ${SRCS:N*.h:R:S/$/.o/g}
 
 loader.help: help.common help.uboot ${.CURDIR}/../../fdt/help.fdt
 	cat ${.ALLSRC} | \
 	    awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET}
 
 ldscript.abs:
 	echo "UBLDR_LOADADDR = ${UBLDR_LOADADDR};" >${.TARGET}
 
 ldscript.pie:
 	echo "UBLDR_LOADADDR = 0;" >${.TARGET}
 
 ubldr: ${OBJS} ldscript.abs ${.CURDIR}/ldscript.${MACHINE_CPUARCH} ${DPADD}
 	${CC} ${CFLAGS} -T ldscript.abs ${LDFLAGS} \
 	    -o ${.TARGET} ${OBJS} ${LDADD}
 
 ubldr.pie: ${OBJS} ldscript.pie ${.CURDIR}/ldscript.${MACHINE_CPUARCH} ${DPADD}
 	${CC} ${CFLAGS} -T ldscript.pie ${LDFLAGS} -pie -Wl,-Bsymbolic \
 	    -o ${.TARGET} ${OBJS} ${LDADD}
 
 ubldr.bin: ubldr.pie
 	${OBJCOPY} -S -O binary ubldr.pie ${.TARGET}
 
 CLEANFILES+=	ldscript.abs ldscript.pie ubldr ubldr.pie ubldr.bin
 
 .if !defined(LOADER_ONLY)
 .PATH: ${.CURDIR}/../../forth
 .include	"${.CURDIR}/../../forth/Makefile.inc"
 
 # Install loader.rc.
 FILES+=	loader.rc
 # Put sample menu.rc on disk but don't enable it by default.
 FILES+=	menu.rc
 FILESNAME_menu.rc=	menu.rc.sample
 .endif
 
 .include <bsd.prog.mk>
Index: projects/clang400-import/sys/boot/fdt/dts/arm/socfpga_arria10_socdk_sdmmc.dts
===================================================================
--- projects/clang400-import/sys/boot/fdt/dts/arm/socfpga_arria10_socdk_sdmmc.dts	(revision 314522)
+++ projects/clang400-import/sys/boot/fdt/dts/arm/socfpga_arria10_socdk_sdmmc.dts	(revision 314523)
@@ -1,82 +1,86 @@
 /*-
  * Copyright (c) 2017 Ruslan Bukin <br@bsdpad.com>
  * All rights reserved.
  *
  * This software was developed by SRI International and the University of
  * Cambridge Computer Laboratory under DARPA/AFRL contract FA8750-10-C-0237
  * ("CTSRD"), as part of the DARPA CRASH research programme.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /dts-v1/;
 #include "socfpga_arria10_socdk.dtsi"
 
 / {
 	model = "Altera SOCFPGA Arria 10";
 	compatible = "altr,socfpga-arria10", "altr,socfpga";
 
 	/* Reserve first page for secondary CPU trampoline code */
 	memreserve = < 0x00000000 0x1000 >;
 
 	soc {
 		/* Local timer */
 		timer@ffffc600 {
 			clock-frequency = <200000000>;
 		};
 
 		/* Global timer */
 		global_timer: timer@ffffc200 {
 			compatible = "arm,cortex-a9-global-timer";
 			reg = <0xffffc200 0x20>;
 			interrupts = <1 11 0x301>;
 			clock-frequency = <200000000>;
 		};
 	};
 
 	chosen {
 		stdin = "serial1";
 		stdout = "serial1";
 	};
 };
 
 &uart1 {
 	clock-frequency = < 50000000 >;
 };
 
 &mmc {
 	status = "okay";
 	num-slots = <1>;
 	cap-sd-highspeed;
 	broken-cd;
 	bus-width = <4>;
 	bus-frequency = <200000000>;
 };
 
 &i2c1 {
 	lcd@28 {
 		compatible = "newhaven,nhd-0216k3z-nsw-bbw";
 		reg = <0x28>;
 	};
 };
+
+&usb0 {
+	dr_mode = "host";
+};
Index: projects/clang400-import/sys/boot/powerpc/kboot/Makefile
===================================================================
--- projects/clang400-import/sys/boot/powerpc/kboot/Makefile	(revision 314522)
+++ projects/clang400-import/sys/boot/powerpc/kboot/Makefile	(revision 314523)
@@ -1,111 +1,112 @@
 # $FreeBSD$
 
 .include <src.opts.mk>
 MK_SSP=		no
 MAN=
 
 PROG=		loader.kboot
 NEWVERSWHAT=	"kboot loader" ${MACHINE_ARCH}
 BINDIR?=	/boot
 INSTALLFLAGS=	-b
 
 # Architecture-specific loader code
 SRCS=		conf.c metadata.c vers.c main.c ppc64_elf_freebsd.c
 SRCS+=		host_syscall.S hostcons.c hostdisk.c kerneltramp.S kbootfdt.c
 SRCS+=		ucmpdi2.c
 
 LOADER_DISK_SUPPORT?=	yes
 LOADER_UFS_SUPPORT?=	yes
 LOADER_CD9660_SUPPORT?=	yes
 LOADER_EXT2FS_SUPPORT?=	yes
 LOADER_NET_SUPPORT?=	yes
 LOADER_NFS_SUPPORT?=	yes
 LOADER_TFTP_SUPPORT?=	no
 LOADER_GZIP_SUPPORT?=	yes
 LOADER_FDT_SUPPORT=	yes
 LOADER_BZIP2_SUPPORT?=	no
 
 .if ${LOADER_DISK_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_DISK_SUPPORT
 .endif
 .if ${LOADER_UFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_UFS_SUPPORT
 .endif
 .if ${LOADER_CD9660_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_CD9660_SUPPORT
 .endif
 .if ${LOADER_EXT2FS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_EXT2FS_SUPPORT
 .endif
 .if ${LOADER_GZIP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_GZIP_SUPPORT
 .endif
 .if ${LOADER_BZIP2_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_BZIP2_SUPPORT
 .endif
 .if ${LOADER_NET_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NET_SUPPORT
 .endif
 .if ${LOADER_NFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NFS_SUPPORT
 .endif
 .if ${LOADER_TFTP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_TFTP_SUPPORT
 .endif
 .if ${LOADER_FDT_SUPPORT} == "yes"
 CFLAGS+=	-I${.CURDIR}/../../fdt
 CFLAGS+=	-I${.OBJDIR}/../../fdt
 CFLAGS+=	-I${.CURDIR}/../../../contrib/libfdt
 CFLAGS+=	-DLOADER_FDT_SUPPORT
 LIBFDT=		${.OBJDIR}/../../fdt/libfdt.a
 .endif
 
 
 .if ${MK_FORTH} != "no"
 # Enable BootForth
 BOOT_FORTH=	yes
-CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc
+CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl
+CFLAGS+=	-I${.CURDIR}/../../ficl/powerpc
 LIBFICL=	${.OBJDIR}/../../ficl/libficl.a
 .endif
 
 CFLAGS+=	-mcpu=powerpc64
 
 # Always add MI sources
 .PATH:		${.CURDIR}/../../common ${.CURDIR}/../../../libkern
 .include	"${.CURDIR}/../../common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../common -I${.CURDIR}/../../..
 CFLAGS+=	-I.
 
 CLEANFILES+=	loader.help
 
 CFLAGS+=	-Wall -ffreestanding -msoft-float -DAIM
 # load address. set in linker script
 RELOC?=		0x0
 CFLAGS+=	-DRELOC=${RELOC}
 
 LDFLAGS=	-nostdlib -static -T ${.CURDIR}/ldscript.powerpc
 
 # 64-bit bridge extensions
 CFLAGS+= -Wa,-mppc64bridge
 
 # Pull in common loader code
 #.PATH:		${.CURDIR}/../../ofw/common
 #.include	"${.CURDIR}/../../ofw/common/Makefile.inc"
 
 # where to get libstand from
 LIBSTAND=	${.OBJDIR}/../../libstand32/libstand.a
 CFLAGS+=	-I${.CURDIR}/../../../../lib/libstand/
 
 DPADD=		${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND}
 LDADD=		${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND}
 
 loader.help: help.common help.kboot ${.CURDIR}/../../fdt/help.fdt
 	cat ${.ALLSRC} | \
 	    awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET}
 
 .PATH: ${.CURDIR}/../../forth
 .include	"${.CURDIR}/../../forth/Makefile.inc"
 
 FILES+= loader.rc menu.rc
 
 .include <bsd.prog.mk>
Index: projects/clang400-import/sys/boot/powerpc/ofw/Makefile
===================================================================
--- projects/clang400-import/sys/boot/powerpc/ofw/Makefile	(revision 314522)
+++ projects/clang400-import/sys/boot/powerpc/ofw/Makefile	(revision 314523)
@@ -1,109 +1,110 @@
 # $FreeBSD$
 
 .include <src.opts.mk>
 MK_SSP=		no
 MAN=
 
 PROG=		loader
 NEWVERSWHAT=	"Open Firmware loader" ${MACHINE_ARCH}
 BINDIR?=	/boot
 INSTALLFLAGS=	-b
 
 # Architecture-specific loader code
 SRCS=		conf.c metadata.c vers.c start.c
 SRCS+=		ucmpdi2.c
 
 LOADER_DISK_SUPPORT?=	yes
 LOADER_UFS_SUPPORT?=	yes
 LOADER_CD9660_SUPPORT?=	yes
 LOADER_EXT2FS_SUPPORT?=	no
 LOADER_NET_SUPPORT?=	yes
 LOADER_NFS_SUPPORT?=	yes
 LOADER_TFTP_SUPPORT?=	no
 LOADER_GZIP_SUPPORT?=	yes
 LOADER_BZIP2_SUPPORT?=	no
 LOADER_FDT_SUPPORT?=	yes
 
 .if ${LOADER_DISK_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_DISK_SUPPORT
 .endif
 .if ${LOADER_UFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_UFS_SUPPORT
 .endif
 .if ${LOADER_CD9660_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_CD9660_SUPPORT
 .endif
 .if ${LOADER_EXT2FS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_EXT2FS_SUPPORT
 .endif
 .if ${LOADER_GZIP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_GZIP_SUPPORT
 .endif
 .if ${LOADER_BZIP2_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_BZIP2_SUPPORT
 .endif
 .if ${LOADER_NET_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NET_SUPPORT
 .endif
 .if ${LOADER_NFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NFS_SUPPORT
 .endif
 .if ${LOADER_TFTP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_TFTP_SUPPORT
 .endif
 .if ${LOADER_FDT_SUPPORT} == "yes"
 SRCS+=		ofwfdt.c
 CFLAGS+=	-I${.CURDIR}/../../fdt
 CFLAGS+=	-I${.OBJDIR}/../../fdt
 CFLAGS+=	-I${.CURDIR}/../../../contrib/libfdt
 CFLAGS+=	-DLOADER_FDT_SUPPORT
 LIBFDT=		${.OBJDIR}/../../fdt/libfdt.a
 .endif
 
 .if ${MK_FORTH} != "no"
 # Enable BootForth
 BOOT_FORTH=	yes
-CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc
+CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl
+CFLAGS+=	-I${.CURDIR}/../../ficl/powerpc
 LIBFICL=	${.OBJDIR}/../../ficl/libficl.a
 .endif
 
 # Always add MI sources
 .PATH:		${.CURDIR}/../../common ${.CURDIR}/../../../libkern
 .include	"${.CURDIR}/../../common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../common -I${.CURDIR}/../../..
 CFLAGS+=	-I.
 
 CLEANFILES+=	loader.help
 
 CFLAGS+=	-ffreestanding -msoft-float
 # load address. set in linker script
 RELOC?=		0x1C00000
 CFLAGS+=	-DRELOC=${RELOC}
 
 LDFLAGS=	-nostdlib -static -T ${.CURDIR}/ldscript.powerpc
 
 # Pull in common loader code
 .PATH:		${.CURDIR}/../../ofw/common
 .include	"${.CURDIR}/../../ofw/common/Makefile.inc"
 
 # Open Firmware standalone support library
 LIBOFW=		${.OBJDIR}/../../ofw/libofw/libofw.a
 CFLAGS+=	-I${.CURDIR}/../../ofw/libofw
 
 # where to get libstand from
 LIBSTAND=	${.OBJDIR}/../../libstand32/libstand.a
 CFLAGS+=	-I${.CURDIR}/../../../../lib/libstand/
 
 DPADD=		${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND}
 LDADD=		${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND}
 
 loader.help: help.common help.ofw ${.CURDIR}/../../fdt/help.fdt
 	cat ${.ALLSRC} | \
 	    awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET}
 
 .PATH: ${.CURDIR}/../../forth
 .include	"${.CURDIR}/../../forth/Makefile.inc"
 
 FILES+= loader.rc menu.rc
 
 .include <bsd.prog.mk>
Index: projects/clang400-import/sys/boot/powerpc/ps3/Makefile
===================================================================
--- projects/clang400-import/sys/boot/powerpc/ps3/Makefile	(revision 314522)
+++ projects/clang400-import/sys/boot/powerpc/ps3/Makefile	(revision 314523)
@@ -1,113 +1,114 @@
 # $FreeBSD$
 
 .include <src.opts.mk>
 MK_SSP=		no
 MAN=
 
 PROG=		loader.ps3
 NEWVERSWHAT=	"Playstation 3 loader" ${MACHINE_ARCH}
 BINDIR?=	/boot
 INSTALLFLAGS=	-b
 
 # Architecture-specific loader code
 SRCS=		start.S conf.c metadata.c vers.c main.c devicename.c ppc64_elf_freebsd.c
 SRCS+=		lv1call.S ps3cons.c font.h ps3mmu.c ps3net.c ps3repo.c \
 		ps3stor.c ps3disk.c ps3cdrom.c
 SRCS+=		ucmpdi2.c
 
 LOADER_DISK_SUPPORT?=	yes
 LOADER_UFS_SUPPORT?=	yes
 LOADER_CD9660_SUPPORT?=	yes
 LOADER_EXT2FS_SUPPORT?=	yes
 LOADER_NET_SUPPORT?=	yes
 LOADER_NFS_SUPPORT?=	yes
 LOADER_TFTP_SUPPORT?=	no
 LOADER_GZIP_SUPPORT?=	yes
 LOADER_FDT_SUPPORT?=	no
 LOADER_BZIP2_SUPPORT?=	no
 
 .if ${LOADER_DISK_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_DISK_SUPPORT
 .endif
 .if ${LOADER_UFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_UFS_SUPPORT
 .endif
 .if ${LOADER_CD9660_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_CD9660_SUPPORT
 .endif
 .if ${LOADER_EXT2FS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_EXT2FS_SUPPORT
 .endif
 .if ${LOADER_GZIP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_GZIP_SUPPORT
 .endif
 .if ${LOADER_BZIP2_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_BZIP2_SUPPORT
 .endif
 .if ${LOADER_NET_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NET_SUPPORT
 .endif
 .if ${LOADER_NFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NFS_SUPPORT
 .endif
 .if ${LOADER_TFTP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_TFTP_SUPPORT
 .endif
 .if ${LOADER_FDT_SUPPORT} == "yes"
 CFLAGS+=	-I${.CURDIR}/../../fdt
 CFLAGS+=	-I${.OBJDIR}/../../fdt
 CFLAGS+=	-DLOADER_FDT_SUPPORT
 LIBFDT=		${.OBJDIR}/../../fdt/libfdt.a
 .endif
 
 
 .if ${MK_FORTH} != "no"
 # Enable BootForth
 BOOT_FORTH=	yes
-CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc
+CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl
+CFLAGS+=	-I${.CURDIR}/../../ficl/powerpc
 LIBFICL=	${.OBJDIR}/../../ficl/libficl.a
 .endif
 
 CFLAGS+=	-mcpu=powerpc64
 
 # Always add MI sources
 .PATH:		${.CURDIR}/../../common ${.CURDIR}/../../../libkern
 .include	"${.CURDIR}/../../common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../common -I${.CURDIR}/../../..
 CFLAGS+=	-I.
 
 CLEANFILES+=	loader.help
 
 CFLAGS+=	-Wall -ffreestanding -msoft-float -DAIM
 # load address. set in linker script
 RELOC?=		0x0
 CFLAGS+=	-DRELOC=${RELOC}
 
 LDFLAGS=	-nostdlib -static -T ${.CURDIR}/ldscript.powerpc
 
 # Pull in common loader code
 #.PATH:		${.CURDIR}/../../ofw/common
 #.include	"${.CURDIR}/../../ofw/common/Makefile.inc"
 
 # where to get libstand from
 LIBSTAND=	${.OBJDIR}/../../libstand32/libstand.a
 CFLAGS+=	-I${.CURDIR}/../../../../lib/libstand/
 
 DPADD=		${LIBFICL} ${LIBOFW} ${LIBSTAND}
 LDADD=		${LIBFICL} ${LIBOFW} ${LIBSTAND}
 
 SC_DFLT_FONT=cp437
 
 font.h:
 	uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x16.fnt && file2c 'u_char dflt_font_16[16*256] = {' '};' < ${SC_DFLT_FONT}-8x16 > font.h && uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x14.fnt && file2c 'u_char dflt_font_14[14*256] = {' '};' < ${SC_DFLT_FONT}-8x14 >> font.h && uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x8.fnt && file2c 'u_char dflt_font_8[8*256] = {' '};' < ${SC_DFLT_FONT}-8x8 >> font.h
 
 loader.help: help.common help.ps3 ${.CURDIR}/../../fdt/help.fdt
 	cat ${.ALLSRC} | \
 	    awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET}
 
 .PATH: ${.CURDIR}/../../forth
 .include	"${.CURDIR}/../../forth/Makefile.inc"
 
 FILES+= loader.rc menu.rc
 
 .include <bsd.prog.mk>
Index: projects/clang400-import/sys/boot/powerpc/uboot/Makefile
===================================================================
--- projects/clang400-import/sys/boot/powerpc/uboot/Makefile	(revision 314522)
+++ projects/clang400-import/sys/boot/powerpc/uboot/Makefile	(revision 314523)
@@ -1,112 +1,113 @@
 # $FreeBSD$
 
 .include <src.opts.mk>
 
 PROG=		ubldr
 NEWVERSWHAT=	"U-Boot loader" ${MACHINE_ARCH}
 BINDIR?=	/boot
 INSTALLFLAGS=	-b
 MAN=
 
 # Architecture-specific loader code
 SRCS=		start.S conf.c vers.c
 SRCS+=		ucmpdi2.c
 
 .if !defined(LOADER_NO_DISK_SUPPORT)
 LOADER_DISK_SUPPORT?=	yes
 .else
 LOADER_DISK_SUPPORT=	no
 .endif
 LOADER_UFS_SUPPORT?=	yes
 LOADER_CD9660_SUPPORT?=	no
 LOADER_EXT2FS_SUPPORT?=	no
 LOADER_NET_SUPPORT?=	yes
 LOADER_NFS_SUPPORT?=	yes
 LOADER_TFTP_SUPPORT?=	no
 LOADER_GZIP_SUPPORT?=	no
 LOADER_BZIP2_SUPPORT?=	no
 .if ${MK_FDT} != "no"
 LOADER_FDT_SUPPORT=	yes
 .else
 LOADER_FDT_SUPPORT=	no
 .endif
 
 .if ${LOADER_DISK_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_DISK_SUPPORT
 .endif
 .if ${LOADER_UFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_UFS_SUPPORT
 .endif
 .if ${LOADER_CD9660_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_CD9660_SUPPORT
 .endif
 .if ${LOADER_EXT2FS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_EXT2FS_SUPPORT
 .endif
 .if ${LOADER_GZIP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_GZIP_SUPPORT
 .endif
 .if ${LOADER_BZIP2_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_BZIP2_SUPPORT
 .endif
 .if ${LOADER_NET_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NET_SUPPORT
 .endif
 .if ${LOADER_NFS_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_NFS_SUPPORT
 .endif
 .if ${LOADER_TFTP_SUPPORT} == "yes"
 CFLAGS+=	-DLOADER_TFTP_SUPPORT
 .endif
 .if ${LOADER_FDT_SUPPORT} == "yes"
 CFLAGS+=	-I${.CURDIR}/../../fdt
 CFLAGS+=	-I${.OBJDIR}/../../fdt
 CFLAGS+=	-DLOADER_FDT_SUPPORT
 LIBUBOOT_FDT=	${.OBJDIR}/../../uboot/fdt/libuboot_fdt.a
 LIBFDT=		${.OBJDIR}/../../fdt/libfdt.a
 .endif
 
 .if ${MK_FORTH} != "no"
 # Enable BootForth
 BOOT_FORTH=	yes
-CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc
+CFLAGS+=	-DBOOT_FORTH -I${.CURDIR}/../../ficl
+CFLAGS+=	-I${.CURDIR}/../../ficl/powerpc
 LIBFICL=	${.OBJDIR}/../../ficl/libficl.a
 .endif
 
 # Always add MI sources
 .PATH:		${.CURDIR}/../../common ${.CURDIR}/../../../libkern
 .include	"${.CURDIR}/../../common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../common -I${.CURDIR}/../../..
 CFLAGS+=	-I.
 
 CLEANFILES+=	${PROG}.help
 
 CFLAGS+=	-ffreestanding
 
 LDFLAGS=	-nostdlib -static -T ${.CURDIR}/ldscript.powerpc
 
 # Pull in common loader code
 .PATH:		${.CURDIR}/../../uboot/common
 .include	"${.CURDIR}/../../uboot/common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../uboot/common
 
 # U-Boot standalone support library
 LIBUBOOT=	${.OBJDIR}/../../uboot/lib/libuboot.a
 CFLAGS+=	-I${.CURDIR}/../../uboot/lib
 CFLAGS+=	-I${.OBJDIR}/../../uboot/lib
 
 # where to get libstand from
 LIBSTAND=	${.OBJDIR}/../../libstand32/libstand.a
 CFLAGS+=	-I${.CURDIR}/../../../../lib/libstand/
 
 DPADD=		${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} ${LIBSTAND}
 LDADD=		${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} ${LIBSTAND}
 
 loader.help: help.common help.uboot ${.CURDIR}/../../fdt/help.fdt
 	cat ${.ALLSRC} | \
 	    awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET}
 
 .PATH: ${.CURDIR}/../../forth
 FILES=	loader.help
 
 .include <bsd.prog.mk>
Index: projects/clang400-import/sys/boot/userboot/userboot/Makefile
===================================================================
--- projects/clang400-import/sys/boot/userboot/userboot/Makefile	(revision 314522)
+++ projects/clang400-import/sys/boot/userboot/userboot/Makefile	(revision 314523)
@@ -1,66 +1,67 @@
 # $FreeBSD$
 
 MAN=
 
 .include <src.opts.mk>
 MK_SSP=		no
 
 SHLIB_NAME=	userboot.so
 MK_CTF=		no
 STRIP=
 LIBDIR=		/boot
 
 SRCS=		autoload.c
 SRCS+=		bcache.c
 SRCS+=		biossmap.c
 SRCS+=		bootinfo.c
 SRCS+=		bootinfo32.c
 SRCS+=		bootinfo64.c
 SRCS+=		conf.c
 SRCS+=		console.c
 SRCS+=		copy.c
 SRCS+=		devicename.c
 SRCS+=		elf32_freebsd.c
 SRCS+=		elf64_freebsd.c
 SRCS+=		host.c
 SRCS+=		main.c
 SRCS+=		userboot_cons.c
 SRCS+=		userboot_disk.c
 SRCS+=		vers.c
 
 CFLAGS+=	-Wall
 CFLAGS+=	-I${.CURDIR}/..
 CFLAGS+=	-I${.CURDIR}/../../common
 CFLAGS+=	-I${.CURDIR}/../../..
 CFLAGS+=	-I${.CURDIR}/../../../../lib/libstand
 CFLAGS+=	-ffreestanding -I.
 
 CWARNFLAGS.main.c += -Wno-implicit-function-declaration
 
 LDFLAGS+=	-nostdlib -Wl,-Bsymbolic
 
 NEWVERSWHAT=	"User boot" ${MACHINE_CPUARCH}
 
 .if ${MK_FORTH} != "no"
 BOOT_FORTH=	yes
-CFLAGS+=        -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/i386
+CFLAGS+=        -DBOOT_FORTH -I${.CURDIR}/../../ficl
+CFLAGS+=        -I${.CURDIR}/../../ficl/i386
 CFLAGS+=	-DBF_DICTSIZE=15000
 LIBFICL=	${.OBJDIR}/../ficl/libficl.a
 .endif
 
 LIBSTAND=	${.OBJDIR}/../libstand/libstand.a
 
 .if ${MK_ZFS} != "no"
 CFLAGS+=	-DUSERBOOT_ZFS_SUPPORT
 LIBZFSBOOT=	${.OBJDIR}/../zfs/libzfsboot.a
 .endif
 
 # Always add MI sources 
 .PATH:		${.CURDIR}/../../common
 .include	"${.CURDIR}/../../common/Makefile.inc"
 CFLAGS+=	-I${.CURDIR}/../../common
 CFLAGS+=	-I.
 DPADD+=		${LIBFICL} ${LIBZFSBOOT} ${LIBSTAND} 
 LDADD+=		${LIBFICL} ${LIBZFSBOOT} ${LIBSTAND}
 
 .include <bsd.lib.mk>
Index: projects/clang400-import/sys/boot/zfs/zfsimpl.c
===================================================================
--- projects/clang400-import/sys/boot/zfs/zfsimpl.c	(revision 314522)
+++ projects/clang400-import/sys/boot/zfs/zfsimpl.c	(revision 314523)
@@ -1,2488 +1,2488 @@
 /*-
  * Copyright (c) 2007 Doug Rabson
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  *	Stand-alone ZFS file reader.
  */
 
 #include <sys/stat.h>
 #include <sys/stdint.h>
 
 #include "zfsimpl.h"
 #include "zfssubr.c"
 
 
 struct zfsmount {
 	const spa_t	*spa;
 	objset_phys_t	objset;
 	uint64_t	rootobj;
 };
 
 /*
  * List of all vdevs, chained through v_alllink.
  */
 static vdev_list_t zfs_vdevs;
 
  /*
  * List of ZFS features supported for read
  */
 static const char *features_for_read[] = {
 	"org.illumos:lz4_compress",
 	"com.delphix:hole_birth",
 	"com.delphix:extensible_dataset",
 	"com.delphix:embedded_data",
 	"org.open-zfs:large_blocks",
 	"org.illumos:sha512",
 	"org.illumos:skein",
 	NULL
 };
 
 /*
  * List of all pools, chained through spa_link.
  */
 static spa_list_t zfs_pools;
 
 static uint64_t zfs_crc64_table[256];
 static const dnode_phys_t *dnode_cache_obj = NULL;
 static uint64_t dnode_cache_bn;
 static char *dnode_cache_buf;
 static char *zap_scratch;
 static char *zfs_temp_buf, *zfs_temp_end, *zfs_temp_ptr;
 
 #define TEMP_SIZE	(1024 * 1024)
 
 static int zio_read(const spa_t *spa, const blkptr_t *bp, void *buf);
 static int zfs_get_root(const spa_t *spa, uint64_t *objid);
 static int zfs_rlookup(const spa_t *spa, uint64_t objnum, char *result);
 static int zap_lookup(const spa_t *spa, const dnode_phys_t *dnode,
     const char *name, uint64_t integer_size, uint64_t num_integers,
     void *value);
 
 static void
 zfs_init(void)
 {
 	STAILQ_INIT(&zfs_vdevs);
 	STAILQ_INIT(&zfs_pools);
 
 	zfs_temp_buf = malloc(TEMP_SIZE);
 	zfs_temp_end = zfs_temp_buf + TEMP_SIZE;
 	zfs_temp_ptr = zfs_temp_buf;
 	dnode_cache_buf = malloc(SPA_MAXBLOCKSIZE);
 	zap_scratch = malloc(SPA_MAXBLOCKSIZE);
 
 	zfs_init_crc();
 }
 
 static void *
 zfs_alloc(size_t size)
 {
 	char *ptr;
 
 	if (zfs_temp_ptr + size > zfs_temp_end) {
 		printf("ZFS: out of temporary buffer space\n");
 		for (;;) ;
 	}
 	ptr = zfs_temp_ptr;
 	zfs_temp_ptr += size;
 
 	return (ptr);
 }
 
 static void
 zfs_free(void *ptr, size_t size)
 {
 
 	zfs_temp_ptr -= size;
 	if (zfs_temp_ptr != ptr) {
 		printf("ZFS: zfs_alloc()/zfs_free() mismatch\n");
 		for (;;) ;
 	}
 }
 
 static int
 xdr_int(const unsigned char **xdr, int *ip)
 {
 	*ip = ((*xdr)[0] << 24)
 		| ((*xdr)[1] << 16)
 		| ((*xdr)[2] << 8)
 		| ((*xdr)[3] << 0);
 	(*xdr) += 4;
 	return (0);
 }
 
 static int
 xdr_u_int(const unsigned char **xdr, u_int *ip)
 {
 	*ip = ((*xdr)[0] << 24)
 		| ((*xdr)[1] << 16)
 		| ((*xdr)[2] << 8)
 		| ((*xdr)[3] << 0);
 	(*xdr) += 4;
 	return (0);
 }
 
 static int
 xdr_uint64_t(const unsigned char **xdr, uint64_t *lp)
 {
 	u_int hi, lo;
 
 	xdr_u_int(xdr, &hi);
 	xdr_u_int(xdr, &lo);
 	*lp = (((uint64_t) hi) << 32) | lo;
 	return (0);
 }
 
 static int
 nvlist_find(const unsigned char *nvlist, const char *name, int type,
 	    int* elementsp, void *valuep)
 {
 	const unsigned char *p, *pair;
 	int junk;
 	int encoded_size, decoded_size;
 
 	p = nvlist;
 	xdr_int(&p, &junk);
 	xdr_int(&p, &junk);
 
 	pair = p;
 	xdr_int(&p, &encoded_size);
 	xdr_int(&p, &decoded_size);
 	while (encoded_size && decoded_size) {
 		int namelen, pairtype, elements;
 		const char *pairname;
 
 		xdr_int(&p, &namelen);
 		pairname = (const char*) p;
 		p += roundup(namelen, 4);
 		xdr_int(&p, &pairtype);
 
 		if (!memcmp(name, pairname, namelen) && type == pairtype) {
 			xdr_int(&p, &elements);
 			if (elementsp)
 				*elementsp = elements;
 			if (type == DATA_TYPE_UINT64) {
 				xdr_uint64_t(&p, (uint64_t *) valuep);
 				return (0);
 			} else if (type == DATA_TYPE_STRING) {
 				int len;
 				xdr_int(&p, &len);
 				(*(const char**) valuep) = (const char*) p;
 				return (0);
 			} else if (type == DATA_TYPE_NVLIST
 				   || type == DATA_TYPE_NVLIST_ARRAY) {
 				(*(const unsigned char**) valuep) =
 					 (const unsigned char*) p;
 				return (0);
 			} else {
 				return (EIO);
 			}
 		} else {
 			/*
 			 * Not the pair we are looking for, skip to the next one.
 			 */
 			p = pair + encoded_size;
 		}
 
 		pair = p;
 		xdr_int(&p, &encoded_size);
 		xdr_int(&p, &decoded_size);
 	}
 
 	return (EIO);
 }
 
 static int
 nvlist_check_features_for_read(const unsigned char *nvlist)
 {
 	const unsigned char *p, *pair;
 	int junk;
 	int encoded_size, decoded_size;
 	int rc;
 
 	rc = 0;
 
 	p = nvlist;
 	xdr_int(&p, &junk);
 	xdr_int(&p, &junk);
 
 	pair = p;
 	xdr_int(&p, &encoded_size);
 	xdr_int(&p, &decoded_size);
 	while (encoded_size && decoded_size) {
 		int namelen, pairtype;
 		const char *pairname;
 		int i, found;
 
 		found = 0;
 
 		xdr_int(&p, &namelen);
 		pairname = (const char*) p;
 		p += roundup(namelen, 4);
 		xdr_int(&p, &pairtype);
 
 		for (i = 0; features_for_read[i] != NULL; i++) {
 			if (!memcmp(pairname, features_for_read[i], namelen)) {
 				found = 1;
 				break;
 			}
 		}
 
 		if (!found) {
 			printf("ZFS: unsupported feature: %s\n", pairname);
 			rc = EIO;
 		}
 
 		p = pair + encoded_size;
 
 		pair = p;
 		xdr_int(&p, &encoded_size);
 		xdr_int(&p, &decoded_size);
 	}
 
 	return (rc);
 }
 
 /*
  * Return the next nvlist in an nvlist array.
  */
 static const unsigned char *
 nvlist_next(const unsigned char *nvlist)
 {
 	const unsigned char *p, *pair;
 	int junk;
 	int encoded_size, decoded_size;
 
 	p = nvlist;
 	xdr_int(&p, &junk);
 	xdr_int(&p, &junk);
 
 	pair = p;
 	xdr_int(&p, &encoded_size);
 	xdr_int(&p, &decoded_size);
 	while (encoded_size && decoded_size) {
 		p = pair + encoded_size;
 
 		pair = p;
 		xdr_int(&p, &encoded_size);
 		xdr_int(&p, &decoded_size);
 	}
 
 	return p;
 }
 
 #ifdef TEST
 
 static const unsigned char *
 nvlist_print(const unsigned char *nvlist, unsigned int indent)
 {
 	static const char* typenames[] = {
 		"DATA_TYPE_UNKNOWN",
 		"DATA_TYPE_BOOLEAN",
 		"DATA_TYPE_BYTE",
 		"DATA_TYPE_INT16",
 		"DATA_TYPE_UINT16",
 		"DATA_TYPE_INT32",
 		"DATA_TYPE_UINT32",
 		"DATA_TYPE_INT64",
 		"DATA_TYPE_UINT64",
 		"DATA_TYPE_STRING",
 		"DATA_TYPE_BYTE_ARRAY",
 		"DATA_TYPE_INT16_ARRAY",
 		"DATA_TYPE_UINT16_ARRAY",
 		"DATA_TYPE_INT32_ARRAY",
 		"DATA_TYPE_UINT32_ARRAY",
 		"DATA_TYPE_INT64_ARRAY",
 		"DATA_TYPE_UINT64_ARRAY",
 		"DATA_TYPE_STRING_ARRAY",
 		"DATA_TYPE_HRTIME",
 		"DATA_TYPE_NVLIST",
 		"DATA_TYPE_NVLIST_ARRAY",
 		"DATA_TYPE_BOOLEAN_VALUE",
 		"DATA_TYPE_INT8",
 		"DATA_TYPE_UINT8",
 		"DATA_TYPE_BOOLEAN_ARRAY",
 		"DATA_TYPE_INT8_ARRAY",
 		"DATA_TYPE_UINT8_ARRAY"
 	};
 
 	unsigned int i, j;
 	const unsigned char *p, *pair;
 	int junk;
 	int encoded_size, decoded_size;
 
 	p = nvlist;
 	xdr_int(&p, &junk);
 	xdr_int(&p, &junk);
 
 	pair = p;
 	xdr_int(&p, &encoded_size);
 	xdr_int(&p, &decoded_size);
 	while (encoded_size && decoded_size) {
 		int namelen, pairtype, elements;
 		const char *pairname;
 
 		xdr_int(&p, &namelen);
 		pairname = (const char*) p;
 		p += roundup(namelen, 4);
 		xdr_int(&p, &pairtype);
 
 		for (i = 0; i < indent; i++)
 			printf(" ");
 		printf("%s %s", typenames[pairtype], pairname);
 
 		xdr_int(&p, &elements);
 		switch (pairtype) {
 		case DATA_TYPE_UINT64: {
 			uint64_t val;
 			xdr_uint64_t(&p, &val);
 			printf(" = 0x%jx\n", (uintmax_t)val);
 			break;
 		}
 
 		case DATA_TYPE_STRING: {
 			int len;
 			xdr_int(&p, &len);
 			printf(" = \"%s\"\n", p);
 			break;
 		}
 
 		case DATA_TYPE_NVLIST:
 			printf("\n");
 			nvlist_print(p, indent + 1);
 			break;
 
 		case DATA_TYPE_NVLIST_ARRAY:
 			for (j = 0; j < elements; j++) {
 				printf("[%d]\n", j);
 				p = nvlist_print(p, indent + 1);
 				if (j != elements - 1) {
 					for (i = 0; i < indent; i++)
 						printf(" ");
 					printf("%s %s", typenames[pairtype], pairname);
 				}
 			}
 			break;
 
 		default:
 			printf("\n");
 		}
 
 		p = pair + encoded_size;
 
 		pair = p;
 		xdr_int(&p, &encoded_size);
 		xdr_int(&p, &decoded_size);
 	}
 
 	return p;
 }
 
 #endif
 
 static int
 vdev_read_phys(vdev_t *vdev, const blkptr_t *bp, void *buf,
     off_t offset, size_t size)
 {
 	size_t psize;
 	int rc;
 
 	if (!vdev->v_phys_read)
 		return (EIO);
 
 	if (bp) {
 		psize = BP_GET_PSIZE(bp);
 	} else {
 		psize = size;
 	}
 
 	/*printf("ZFS: reading %d bytes at 0x%jx to %p\n", psize, (uintmax_t)offset, buf);*/
 	rc = vdev->v_phys_read(vdev, vdev->v_read_priv, offset, buf, psize);
 	if (rc)
 		return (rc);
 	if (bp && zio_checksum_verify(vdev->spa, bp, buf))
 		return (EIO);
 
 	return (0);
 }
 
 static int
 vdev_disk_read(vdev_t *vdev, const blkptr_t *bp, void *buf,
     off_t offset, size_t bytes)
 {
 
 	return (vdev_read_phys(vdev, bp, buf,
 		offset + VDEV_LABEL_START_SIZE, bytes));
 }
 
 
 static int
 vdev_mirror_read(vdev_t *vdev, const blkptr_t *bp, void *buf,
     off_t offset, size_t bytes)
 {
 	vdev_t *kid;
 	int rc;
 
 	rc = EIO;
 	STAILQ_FOREACH(kid, &vdev->v_children, v_childlink) {
 		if (kid->v_state != VDEV_STATE_HEALTHY)
 			continue;
 		rc = kid->v_read(kid, bp, buf, offset, bytes);
 		if (!rc)
 			return (0);
 	}
 
 	return (rc);
 }
 
 static int
 vdev_replacing_read(vdev_t *vdev, const blkptr_t *bp, void *buf,
     off_t offset, size_t bytes)
 {
 	vdev_t *kid;
 
 	/*
 	 * Here we should have two kids:
 	 * First one which is the one we are replacing and we can trust
 	 * only this one to have valid data, but it might not be present.
 	 * Second one is that one we are replacing with. It is most likely
 	 * healthy, but we can't trust it has needed data, so we won't use it.
 	 */
 	kid = STAILQ_FIRST(&vdev->v_children);
 	if (kid == NULL)
 		return (EIO);
 	if (kid->v_state != VDEV_STATE_HEALTHY)
 		return (EIO);
 	return (kid->v_read(kid, bp, buf, offset, bytes));
 }
 
 static vdev_t *
 vdev_find(uint64_t guid)
 {
 	vdev_t *vdev;
 
 	STAILQ_FOREACH(vdev, &zfs_vdevs, v_alllink)
 		if (vdev->v_guid == guid)
 			return (vdev);
 
 	return (0);
 }
 
 static vdev_t *
 vdev_create(uint64_t guid, vdev_read_t *read)
 {
 	vdev_t *vdev;
 
 	vdev = malloc(sizeof(vdev_t));
 	memset(vdev, 0, sizeof(vdev_t));
 	STAILQ_INIT(&vdev->v_children);
 	vdev->v_guid = guid;
 	vdev->v_state = VDEV_STATE_OFFLINE;
 	vdev->v_read = read;
 	vdev->v_phys_read = 0;
 	vdev->v_read_priv = 0;
 	STAILQ_INSERT_TAIL(&zfs_vdevs, vdev, v_alllink);
 
 	return (vdev);
 }
 
 static int
 vdev_init_from_nvlist(const unsigned char *nvlist, vdev_t *pvdev,
     vdev_t **vdevp, int is_newer)
 {
 	int rc;
 	uint64_t guid, id, ashift, nparity;
 	const char *type;
 	const char *path;
 	vdev_t *vdev, *kid;
 	const unsigned char *kids;
 	int nkids, i, is_new;
 	uint64_t is_offline, is_faulted, is_degraded, is_removed, isnt_present;
 
 	if (nvlist_find(nvlist, ZPOOL_CONFIG_GUID,
 			DATA_TYPE_UINT64, 0, &guid)
 	    || nvlist_find(nvlist, ZPOOL_CONFIG_ID,
 			   DATA_TYPE_UINT64, 0, &id)
 	    || nvlist_find(nvlist, ZPOOL_CONFIG_TYPE,
 			   DATA_TYPE_STRING, 0, &type)) {
 		printf("ZFS: can't find vdev details\n");
 		return (ENOENT);
 	}
 
 	if (strcmp(type, VDEV_TYPE_MIRROR)
 	    && strcmp(type, VDEV_TYPE_DISK)
 #ifdef ZFS_TEST
 	    && strcmp(type, VDEV_TYPE_FILE)
 #endif
 	    && strcmp(type, VDEV_TYPE_RAIDZ)
 	    && strcmp(type, VDEV_TYPE_REPLACING)) {
 		printf("ZFS: can only boot from disk, mirror, raidz1, raidz2 and raidz3 vdevs\n");
 		return (EIO);
 	}
 
 	is_offline = is_removed = is_faulted = is_degraded = isnt_present = 0;
 
 	nvlist_find(nvlist, ZPOOL_CONFIG_OFFLINE, DATA_TYPE_UINT64, 0,
 			&is_offline);
 	nvlist_find(nvlist, ZPOOL_CONFIG_REMOVED, DATA_TYPE_UINT64, 0,
 			&is_removed);
 	nvlist_find(nvlist, ZPOOL_CONFIG_FAULTED, DATA_TYPE_UINT64, 0,
 			&is_faulted);
 	nvlist_find(nvlist, ZPOOL_CONFIG_DEGRADED, DATA_TYPE_UINT64, 0,
 			&is_degraded);
 	nvlist_find(nvlist, ZPOOL_CONFIG_NOT_PRESENT, DATA_TYPE_UINT64, 0,
 			&isnt_present);
 
 	vdev = vdev_find(guid);
 	if (!vdev) {
 		is_new = 1;
 
 		if (!strcmp(type, VDEV_TYPE_MIRROR))
 			vdev = vdev_create(guid, vdev_mirror_read);
 		else if (!strcmp(type, VDEV_TYPE_RAIDZ))
 			vdev = vdev_create(guid, vdev_raidz_read);
 		else if (!strcmp(type, VDEV_TYPE_REPLACING))
 			vdev = vdev_create(guid, vdev_replacing_read);
 		else
 			vdev = vdev_create(guid, vdev_disk_read);
 
 		vdev->v_id = id;
 		vdev->v_top = pvdev != NULL ? pvdev : vdev;
 		if (nvlist_find(nvlist, ZPOOL_CONFIG_ASHIFT,
 			DATA_TYPE_UINT64, 0, &ashift) == 0)
 			vdev->v_ashift = ashift;
 		else
 			vdev->v_ashift = 0;
 		if (nvlist_find(nvlist, ZPOOL_CONFIG_NPARITY,
 			DATA_TYPE_UINT64, 0, &nparity) == 0)
 			vdev->v_nparity = nparity;
 		else
 			vdev->v_nparity = 0;
 		if (nvlist_find(nvlist, ZPOOL_CONFIG_PATH,
 				DATA_TYPE_STRING, 0, &path) == 0) {
 			if (strncmp(path, "/dev/", 5) == 0)
 				path += 5;
 			vdev->v_name = strdup(path);
 		} else {
 			if (!strcmp(type, "raidz")) {
 				if (vdev->v_nparity == 1)
 					vdev->v_name = "raidz1";
 				else if (vdev->v_nparity == 2)
 					vdev->v_name = "raidz2";
 				else if (vdev->v_nparity == 3)
 					vdev->v_name = "raidz3";
 				else {
 					printf("ZFS: can only boot from disk, mirror, raidz1, raidz2 and raidz3 vdevs\n");
 					return (EIO);
 				}
 			} else {
 				vdev->v_name = strdup(type);
 			}
 		}
 	} else {
 		is_new = 0;
 	}
 
 	if (is_new || is_newer) {
 		/*
 		 * This is either new vdev or we've already seen this vdev,
 		 * but from an older vdev label, so let's refresh its state
 		 * from the newer label.
 		 */
 		if (is_offline)
 			vdev->v_state = VDEV_STATE_OFFLINE;
 		else if (is_removed)
 			vdev->v_state = VDEV_STATE_REMOVED;
 		else if (is_faulted)
 			vdev->v_state = VDEV_STATE_FAULTED;
 		else if (is_degraded)
 			vdev->v_state = VDEV_STATE_DEGRADED;
 		else if (isnt_present)
 			vdev->v_state = VDEV_STATE_CANT_OPEN;
 	}
 
 	rc = nvlist_find(nvlist, ZPOOL_CONFIG_CHILDREN,
 			 DATA_TYPE_NVLIST_ARRAY, &nkids, &kids);
 	/*
 	 * Its ok if we don't have any kids.
 	 */
 	if (rc == 0) {
 		vdev->v_nchildren = nkids;
 		for (i = 0; i < nkids; i++) {
 			rc = vdev_init_from_nvlist(kids, vdev, &kid, is_newer);
 			if (rc)
 				return (rc);
 			if (is_new)
 				STAILQ_INSERT_TAIL(&vdev->v_children, kid,
 						   v_childlink);
 			kids = nvlist_next(kids);
 		}
 	} else {
 		vdev->v_nchildren = 0;
 	}
 
 	if (vdevp)
 		*vdevp = vdev;
 	return (0);
 }
 
 static void
 vdev_set_state(vdev_t *vdev)
 {
 	vdev_t *kid;
 	int good_kids;
 	int bad_kids;
 
 	/*
 	 * A mirror or raidz is healthy if all its kids are healthy. A
 	 * mirror is degraded if any of its kids is healthy; a raidz
 	 * is degraded if at most nparity kids are offline.
 	 */
 	if (STAILQ_FIRST(&vdev->v_children)) {
 		good_kids = 0;
 		bad_kids = 0;
 		STAILQ_FOREACH(kid, &vdev->v_children, v_childlink) {
 			if (kid->v_state == VDEV_STATE_HEALTHY)
 				good_kids++;
 			else
 				bad_kids++;
 		}
 		if (bad_kids == 0) {
 			vdev->v_state = VDEV_STATE_HEALTHY;
 		} else {
 			if (vdev->v_read == vdev_mirror_read) {
 				if (good_kids) {
 					vdev->v_state = VDEV_STATE_DEGRADED;
 				} else {
 					vdev->v_state = VDEV_STATE_OFFLINE;
 				}
 			} else if (vdev->v_read == vdev_raidz_read) {
 				if (bad_kids > vdev->v_nparity) {
 					vdev->v_state = VDEV_STATE_OFFLINE;
 				} else {
 					vdev->v_state = VDEV_STATE_DEGRADED;
 				}
 			}
 		}
 	}
 }
 
 static spa_t *
 spa_find_by_guid(uint64_t guid)
 {
 	spa_t *spa;
 
 	STAILQ_FOREACH(spa, &zfs_pools, spa_link)
 		if (spa->spa_guid == guid)
 			return (spa);
 
 	return (0);
 }
 
 static spa_t *
 spa_find_by_name(const char *name)
 {
 	spa_t *spa;
 
 	STAILQ_FOREACH(spa, &zfs_pools, spa_link)
 		if (!strcmp(spa->spa_name, name))
 			return (spa);
 
 	return (0);
 }
 
 #ifdef BOOT2
 static spa_t *
 spa_get_primary(void)
 {
 
 	return (STAILQ_FIRST(&zfs_pools));
 }
 
 static vdev_t *
 spa_get_primary_vdev(const spa_t *spa)
 {
 	vdev_t *vdev;
 	vdev_t *kid;
 
 	if (spa == NULL)
 		spa = spa_get_primary();
 	if (spa == NULL)
 		return (NULL);
 	vdev = STAILQ_FIRST(&spa->spa_vdevs);
 	if (vdev == NULL)
 		return (NULL);
 	for (kid = STAILQ_FIRST(&vdev->v_children); kid != NULL;
 	     kid = STAILQ_FIRST(&vdev->v_children))
 		vdev = kid;
 	return (vdev);
 }
 #endif
 
 static spa_t *
 spa_create(uint64_t guid)
 {
 	spa_t *spa;
 
 	spa = malloc(sizeof(spa_t));
 	memset(spa, 0, sizeof(spa_t));
 	STAILQ_INIT(&spa->spa_vdevs);
 	spa->spa_guid = guid;
 	STAILQ_INSERT_TAIL(&zfs_pools, spa, spa_link);
 
 	return (spa);
 }
 
 static const char *
 state_name(vdev_state_t state)
 {
 	static const char* names[] = {
 		"UNKNOWN",
 		"CLOSED",
 		"OFFLINE",
 		"REMOVED",
 		"CANT_OPEN",
 		"FAULTED",
 		"DEGRADED",
 		"ONLINE"
 	};
 	return names[state];
 }
 
 #ifdef BOOT2
 
 #define pager_printf printf
 
 #else
 
 static int
 pager_printf(const char *fmt, ...)
 {
 	char line[80];
 	va_list args;
 
 	va_start(args, fmt);
 	vsprintf(line, fmt, args);
 	va_end(args);
 	return (pager_output(line));
 }
 
 #endif
 
 #define STATUS_FORMAT	"        %s %s\n"
 
 static int
 print_state(int indent, const char *name, vdev_state_t state)
 {
 	int i;
 	char buf[512];
 
 	buf[0] = 0;
 	for (i = 0; i < indent; i++)
 		strcat(buf, "  ");
 	strcat(buf, name);
 	return (pager_printf(STATUS_FORMAT, buf, state_name(state)));
 	
 }
 
 static int
 vdev_status(vdev_t *vdev, int indent)
 {
 	vdev_t *kid;
 	int ret;
 	ret = print_state(indent, vdev->v_name, vdev->v_state);
 	if (ret != 0)
 		return (ret);
 
 	STAILQ_FOREACH(kid, &vdev->v_children, v_childlink) {
 		ret = vdev_status(kid, indent + 1);
 		if (ret != 0)
 			return (ret);
 	}
 	return (ret);
 }
 
 static int
 spa_status(spa_t *spa)
 {
 	static char bootfs[ZFS_MAXNAMELEN];
 	uint64_t rootid;
 	vdev_t *vdev;
 	int good_kids, bad_kids, degraded_kids, ret;
 	vdev_state_t state;
 
 	ret = pager_printf("  pool: %s\n", spa->spa_name);
 	if (ret != 0)
 		return (ret);
 
 	if (zfs_get_root(spa, &rootid) == 0 &&
 	    zfs_rlookup(spa, rootid, bootfs) == 0) {
 		if (bootfs[0] == '\0')
 			ret = pager_printf("bootfs: %s\n", spa->spa_name);
 		else
 			ret = pager_printf("bootfs: %s/%s\n", spa->spa_name,
 			    bootfs);
 		if (ret != 0)
 			return (ret);
 	}
 	ret = pager_printf("config:\n\n");
 	if (ret != 0)
 		return (ret);
 	ret = pager_printf(STATUS_FORMAT, "NAME", "STATE");
 	if (ret != 0)
 		return (ret);
 
 	good_kids = 0;
 	degraded_kids = 0;
 	bad_kids = 0;
 	STAILQ_FOREACH(vdev, &spa->spa_vdevs, v_childlink) {
 		if (vdev->v_state == VDEV_STATE_HEALTHY)
 			good_kids++;
 		else if (vdev->v_state == VDEV_STATE_DEGRADED)
 			degraded_kids++;
 		else
 			bad_kids++;
 	}
 
 	state = VDEV_STATE_CLOSED;
 	if (good_kids > 0 && (degraded_kids + bad_kids) == 0)
 		state = VDEV_STATE_HEALTHY;
 	else if ((good_kids + degraded_kids) > 0)
 		state = VDEV_STATE_DEGRADED;
 
 	ret = print_state(0, spa->spa_name, state);
 	if (ret != 0)
 		return (ret);
 	STAILQ_FOREACH(vdev, &spa->spa_vdevs, v_childlink) {
 		ret = vdev_status(vdev, 1);
 		if (ret != 0)
 			return (ret);
 	}
 	return (ret);
 }
 
 static int
 spa_all_status(void)
 {
 	spa_t *spa;
 	int first = 1, ret = 0;
 
 	STAILQ_FOREACH(spa, &zfs_pools, spa_link) {
 		if (!first) {
 			ret = pager_printf("\n");
 			if (ret != 0)
 				return (ret);
 		}
 		first = 0;
 		ret = spa_status(spa);
 		if (ret != 0)
 			return (ret);
 	}
 	return (ret);
 }
 
 static int
 vdev_probe(vdev_phys_read_t *read, void *read_priv, spa_t **spap)
 {
 	vdev_t vtmp;
 	vdev_phys_t *vdev_label = (vdev_phys_t *) zap_scratch;
 	spa_t *spa;
 	vdev_t *vdev, *top_vdev, *pool_vdev;
 	off_t off;
 	blkptr_t bp;
 	const unsigned char *nvlist;
 	uint64_t val;
 	uint64_t guid;
 	uint64_t pool_txg, pool_guid;
 	uint64_t is_log;
 	const char *pool_name;
 	const unsigned char *vdevs;
 	const unsigned char *features;
 	int i, rc, is_newer;
 	char *upbuf;
 	const struct uberblock *up;
 
 	/*
 	 * Load the vdev label and figure out which
 	 * uberblock is most current.
 	 */
 	memset(&vtmp, 0, sizeof(vtmp));
 	vtmp.v_phys_read = read;
 	vtmp.v_read_priv = read_priv;
 	off = offsetof(vdev_label_t, vl_vdev_phys);
 	BP_ZERO(&bp);
 	BP_SET_LSIZE(&bp, sizeof(vdev_phys_t));
 	BP_SET_PSIZE(&bp, sizeof(vdev_phys_t));
 	BP_SET_CHECKSUM(&bp, ZIO_CHECKSUM_LABEL);
 	BP_SET_COMPRESS(&bp, ZIO_COMPRESS_OFF);
 	DVA_SET_OFFSET(BP_IDENTITY(&bp), off);
 	ZIO_SET_CHECKSUM(&bp.blk_cksum, off, 0, 0, 0);
 	if (vdev_read_phys(&vtmp, &bp, vdev_label, off, 0))
 		return (EIO);
 
 	if (vdev_label->vp_nvlist[0] != NV_ENCODE_XDR) {
 		return (EIO);
 	}
 
 	nvlist = (const unsigned char *) vdev_label->vp_nvlist + 4;
 
 	if (nvlist_find(nvlist,
 			ZPOOL_CONFIG_VERSION,
 			DATA_TYPE_UINT64, 0, &val)) {
 		return (EIO);
 	}
 
 	if (!SPA_VERSION_IS_SUPPORTED(val)) {
 		printf("ZFS: unsupported ZFS version %u (should be %u)\n",
 		    (unsigned) val, (unsigned) SPA_VERSION);
 		return (EIO);
 	}
 
 	/* Check ZFS features for read */
 	if (nvlist_find(nvlist,
 			ZPOOL_CONFIG_FEATURES_FOR_READ,
 			DATA_TYPE_NVLIST, 0, &features) == 0
 	    && nvlist_check_features_for_read(features) != 0)
 		return (EIO);
 
 	if (nvlist_find(nvlist,
 			ZPOOL_CONFIG_POOL_STATE,
 			DATA_TYPE_UINT64, 0, &val)) {
 		return (EIO);
 	}
 
 	if (val == POOL_STATE_DESTROYED) {
 		/* We don't boot only from destroyed pools. */
 		return (EIO);
 	}
 
 	if (nvlist_find(nvlist,
 			ZPOOL_CONFIG_POOL_TXG,
 			DATA_TYPE_UINT64, 0, &pool_txg)
 	    || nvlist_find(nvlist,
 			   ZPOOL_CONFIG_POOL_GUID,
 			   DATA_TYPE_UINT64, 0, &pool_guid)
 	    || nvlist_find(nvlist,
 			   ZPOOL_CONFIG_POOL_NAME,
 			   DATA_TYPE_STRING, 0, &pool_name)) {
 		/*
 		 * Cache and spare devices end up here - just ignore
 		 * them.
 		 */
 		/*printf("ZFS: can't find pool details\n");*/
 		return (EIO);
 	}
 
 	is_log = 0;
 	(void) nvlist_find(nvlist, ZPOOL_CONFIG_IS_LOG, DATA_TYPE_UINT64, 0,
 	    &is_log);
 	if (is_log)
 		return (EIO);
 
 	/*
 	 * Create the pool if this is the first time we've seen it.
 	 */
 	spa = spa_find_by_guid(pool_guid);
 	if (!spa) {
 		spa = spa_create(pool_guid);
 		spa->spa_name = strdup(pool_name);
 	}
 	if (pool_txg > spa->spa_txg) {
 		spa->spa_txg = pool_txg;
 		is_newer = 1;
 	} else
 		is_newer = 0;
 
 	/*
 	 * Get the vdev tree and create our in-core copy of it.
 	 * If we already have a vdev with this guid, this must
 	 * be some kind of alias (overlapping slices, dangerously dedicated
 	 * disks etc).
 	 */
 	if (nvlist_find(nvlist,
 			ZPOOL_CONFIG_GUID,
 			DATA_TYPE_UINT64, 0, &guid)) {
 		return (EIO);
 	}
 	vdev = vdev_find(guid);
 	if (vdev && vdev->v_phys_read)	/* Has this vdev already been inited? */
 		return (EIO);
 
 	if (nvlist_find(nvlist,
 			ZPOOL_CONFIG_VDEV_TREE,
 			DATA_TYPE_NVLIST, 0, &vdevs)) {
 		return (EIO);
 	}
 
 	rc = vdev_init_from_nvlist(vdevs, NULL, &top_vdev, is_newer);
 	if (rc)
 		return (rc);
 
 	/*
 	 * Add the toplevel vdev to the pool if its not already there.
 	 */
 	STAILQ_FOREACH(pool_vdev, &spa->spa_vdevs, v_childlink)
 		if (top_vdev == pool_vdev)
 			break;
 	if (!pool_vdev && top_vdev) {
 		top_vdev->spa = spa;
 		STAILQ_INSERT_TAIL(&spa->spa_vdevs, top_vdev, v_childlink);
 	}
 
 	/*
 	 * We should already have created an incomplete vdev for this
 	 * vdev. Find it and initialise it with our read proc.
 	 */
 	vdev = vdev_find(guid);
 	if (vdev) {
 		vdev->v_phys_read = read;
 		vdev->v_read_priv = read_priv;
 		vdev->v_state = VDEV_STATE_HEALTHY;
 	} else {
 		printf("ZFS: inconsistent nvlist contents\n");
 		return (EIO);
 	}
 
 	/*
 	 * Re-evaluate top-level vdev state.
 	 */
 	vdev_set_state(top_vdev);
 
 	/*
 	 * Ok, we are happy with the pool so far. Lets find
 	 * the best uberblock and then we can actually access
 	 * the contents of the pool.
 	 */
 	upbuf = zfs_alloc(VDEV_UBERBLOCK_SIZE(vdev));
 	up = (const struct uberblock *)upbuf;
 	for (i = 0;
 	     i < VDEV_UBERBLOCK_COUNT(vdev);
 	     i++) {
 		off = VDEV_UBERBLOCK_OFFSET(vdev, i);
 		BP_ZERO(&bp);
 		DVA_SET_OFFSET(&bp.blk_dva[0], off);
 		BP_SET_LSIZE(&bp, VDEV_UBERBLOCK_SIZE(vdev));
 		BP_SET_PSIZE(&bp, VDEV_UBERBLOCK_SIZE(vdev));
 		BP_SET_CHECKSUM(&bp, ZIO_CHECKSUM_LABEL);
 		BP_SET_COMPRESS(&bp, ZIO_COMPRESS_OFF);
 		ZIO_SET_CHECKSUM(&bp.blk_cksum, off, 0, 0, 0);
 
 		if (vdev_read_phys(vdev, &bp, upbuf, off, 0))
 			continue;
 
 		if (up->ub_magic != UBERBLOCK_MAGIC)
 			continue;
 		if (up->ub_txg < spa->spa_txg)
 			continue;
 		if (up->ub_txg > spa->spa_uberblock.ub_txg) {
 			spa->spa_uberblock = *up;
 		} else if (up->ub_txg == spa->spa_uberblock.ub_txg) {
 			if (up->ub_timestamp > spa->spa_uberblock.ub_timestamp)
 				spa->spa_uberblock = *up;
 		}
 	}
 	zfs_free(upbuf, VDEV_UBERBLOCK_SIZE(vdev));
 
 	vdev->spa = spa;
 	if (spap)
 		*spap = spa;
 	return (0);
 }
 
 static int
 ilog2(int n)
 {
 	int v;
 
 	for (v = 0; v < 32; v++)
 		if (n == (1 << v))
 			return v;
 	return -1;
 }
 
 static int
 zio_read_gang(const spa_t *spa, const blkptr_t *bp, void *buf)
 {
 	blkptr_t gbh_bp;
 	zio_gbh_phys_t zio_gb;
 	char *pbuf;
 	int i;
 
 	/* Artificial BP for gang block header. */
 	gbh_bp = *bp;
 	BP_SET_PSIZE(&gbh_bp, SPA_GANGBLOCKSIZE);
 	BP_SET_LSIZE(&gbh_bp, SPA_GANGBLOCKSIZE);
 	BP_SET_CHECKSUM(&gbh_bp, ZIO_CHECKSUM_GANG_HEADER);
 	BP_SET_COMPRESS(&gbh_bp, ZIO_COMPRESS_OFF);
 	for (i = 0; i < SPA_DVAS_PER_BP; i++)
 		DVA_SET_GANG(&gbh_bp.blk_dva[i], 0);
 
 	/* Read gang header block using the artificial BP. */
 	if (zio_read(spa, &gbh_bp, &zio_gb))
 		return (EIO);
 
 	pbuf = buf;
 	for (i = 0; i < SPA_GBH_NBLKPTRS; i++) {
 		blkptr_t *gbp = &zio_gb.zg_blkptr[i];
 
 		if (BP_IS_HOLE(gbp))
 			continue;
 		if (zio_read(spa, gbp, pbuf))
 			return (EIO);
 		pbuf += BP_GET_PSIZE(gbp);
 	}
 
 	if (zio_checksum_verify(spa, bp, buf))
 		return (EIO);
 	return (0);
 }
 
 static int
 zio_read(const spa_t *spa, const blkptr_t *bp, void *buf)
 {
 	int cpfunc = BP_GET_COMPRESS(bp);
 	uint64_t align, size;
 	void *pbuf;
 	int i, error;
 
 	/*
 	 * Process data embedded in block pointer
 	 */
 	if (BP_IS_EMBEDDED(bp)) {
 		ASSERT(BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
 
 		size = BPE_GET_PSIZE(bp);
 		ASSERT(size <= BPE_PAYLOAD_SIZE);
 
 		if (cpfunc != ZIO_COMPRESS_OFF)
 			pbuf = zfs_alloc(size);
 		else
 			pbuf = buf;
 
 		decode_embedded_bp_compressed(bp, pbuf);
 		error = 0;
 
 		if (cpfunc != ZIO_COMPRESS_OFF) {
 			error = zio_decompress_data(cpfunc, pbuf,
 			    size, buf, BP_GET_LSIZE(bp));
 			zfs_free(pbuf, size);
 		}
 		if (error != 0)
 			printf("ZFS: i/o error - unable to decompress block pointer data, error %d\n",
 			    error);
 		return (error);
 	}
 
 	error = EIO;
 
 	for (i = 0; i < SPA_DVAS_PER_BP; i++) {
 		const dva_t *dva = &bp->blk_dva[i];
 		vdev_t *vdev;
 		int vdevid;
 		off_t offset;
 
 		if (!dva->dva_word[0] && !dva->dva_word[1])
 			continue;
 
 		vdevid = DVA_GET_VDEV(dva);
 		offset = DVA_GET_OFFSET(dva);
 		STAILQ_FOREACH(vdev, &spa->spa_vdevs, v_childlink) {
 			if (vdev->v_id == vdevid)
 				break;
 		}
 		if (!vdev || !vdev->v_read)
 			continue;
 
 		size = BP_GET_PSIZE(bp);
 		if (vdev->v_read == vdev_raidz_read) {
 			align = 1ULL << vdev->v_top->v_ashift;
 			if (P2PHASE(size, align) != 0)
 				size = P2ROUNDUP(size, align);
 		}
 		if (size != BP_GET_PSIZE(bp) || cpfunc != ZIO_COMPRESS_OFF)
 			pbuf = zfs_alloc(size);
 		else
 			pbuf = buf;
 
 		if (DVA_GET_GANG(dva))
 			error = zio_read_gang(spa, bp, pbuf);
 		else
 			error = vdev->v_read(vdev, bp, pbuf, offset, size);
 		if (error == 0) {
 			if (cpfunc != ZIO_COMPRESS_OFF)
 				error = zio_decompress_data(cpfunc, pbuf,
 				    BP_GET_PSIZE(bp), buf, BP_GET_LSIZE(bp));
 			else if (size != BP_GET_PSIZE(bp))
 				bcopy(pbuf, buf, BP_GET_PSIZE(bp));
 		}
 		if (buf != pbuf)
 			zfs_free(pbuf, size);
 		if (error == 0)
 			break;
 	}
 	if (error != 0)
 		printf("ZFS: i/o error - all block copies unavailable\n");
 	return (error);
 }
 
 static int
 dnode_read(const spa_t *spa, const dnode_phys_t *dnode, off_t offset, void *buf, size_t buflen)
 {
 	int ibshift = dnode->dn_indblkshift - SPA_BLKPTRSHIFT;
 	int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT;
 	int nlevels = dnode->dn_nlevels;
 	int i, rc;
 
 	if (bsize > SPA_MAXBLOCKSIZE) {
 		printf("ZFS: I/O error - blocks larger than %llu are not "
 		    "supported\n", SPA_MAXBLOCKSIZE);
 		return (EIO);
 	}
 
 	/*
 	 * Note: bsize may not be a power of two here so we need to do an
 	 * actual divide rather than a bitshift.
 	 */
 	while (buflen > 0) {
 		uint64_t bn = offset / bsize;
 		int boff = offset % bsize;
 		int ibn;
 		const blkptr_t *indbp;
 		blkptr_t bp;
 
 		if (bn > dnode->dn_maxblkid)
 			return (EIO);
 
 		if (dnode == dnode_cache_obj && bn == dnode_cache_bn)
 			goto cached;
 
 		indbp = dnode->dn_blkptr;
 		for (i = 0; i < nlevels; i++) {
 			/*
 			 * Copy the bp from the indirect array so that
 			 * we can re-use the scratch buffer for multi-level
 			 * objects.
 			 */
 			ibn = bn >> ((nlevels - i - 1) * ibshift);
 			ibn &= ((1 << ibshift) - 1);
 			bp = indbp[ibn];
 			if (BP_IS_HOLE(&bp)) {
 				memset(dnode_cache_buf, 0, bsize);
 				break;
 			}
 			rc = zio_read(spa, &bp, dnode_cache_buf);
 			if (rc)
 				return (rc);
 			indbp = (const blkptr_t *) dnode_cache_buf;
 		}
 		dnode_cache_obj = dnode;
 		dnode_cache_bn = bn;
 	cached:
 
 		/*
 		 * The buffer contains our data block. Copy what we
 		 * need from it and loop.
 		 */ 
 		i = bsize - boff;
 		if (i > buflen) i = buflen;
 		memcpy(buf, &dnode_cache_buf[boff], i);
 		buf = ((char*) buf) + i;
 		offset += i;
 		buflen -= i;
 	}
 
 	return (0);
 }
 
 /*
  * Lookup a value in a microzap directory. Assumes that the zap
  * scratch buffer contains the directory contents.
  */
 static int
 mzap_lookup(const dnode_phys_t *dnode, const char *name, uint64_t *value)
 {
 	const mzap_phys_t *mz;
 	const mzap_ent_phys_t *mze;
 	size_t size;
 	int chunks, i;
 
 	/*
 	 * Microzap objects use exactly one block. Read the whole
 	 * thing.
 	 */
 	size = dnode->dn_datablkszsec * 512;
 
 	mz = (const mzap_phys_t *) zap_scratch;
 	chunks = size / MZAP_ENT_LEN - 1;
 
 	for (i = 0; i < chunks; i++) {
 		mze = &mz->mz_chunk[i];
 		if (!strcmp(mze->mze_name, name)) {
 			*value = mze->mze_value;
 			return (0);
 		}
 	}
 
 	return (ENOENT);
 }
 
 /*
  * Compare a name with a zap leaf entry. Return non-zero if the name
  * matches.
  */
 static int
 fzap_name_equal(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc, const char *name)
 {
 	size_t namelen;
 	const zap_leaf_chunk_t *nc;
 	const char *p;
 
 	namelen = zc->l_entry.le_name_numints;
 			
 	nc = &ZAP_LEAF_CHUNK(zl, zc->l_entry.le_name_chunk);
 	p = name;
 	while (namelen > 0) {
 		size_t len;
 		len = namelen;
 		if (len > ZAP_LEAF_ARRAY_BYTES)
 			len = ZAP_LEAF_ARRAY_BYTES;
 		if (memcmp(p, nc->l_array.la_array, len))
 			return (0);
 		p += len;
 		namelen -= len;
 		nc = &ZAP_LEAF_CHUNK(zl, nc->l_array.la_next);
 	}
 
 	return 1;
 }
 
 /*
  * Extract a uint64_t value from a zap leaf entry.
  */
 static uint64_t
 fzap_leaf_value(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc)
 {
 	const zap_leaf_chunk_t *vc;
 	int i;
 	uint64_t value;
 	const uint8_t *p;
 
 	vc = &ZAP_LEAF_CHUNK(zl, zc->l_entry.le_value_chunk);
 	for (i = 0, value = 0, p = vc->l_array.la_array; i < 8; i++) {
 		value = (value << 8) | p[i];
 	}
 
 	return value;
 }
 
 static void
 stv(int len, void *addr, uint64_t value)
 {
 	switch (len) {
 	case 1:
 		*(uint8_t *)addr = value;
 		return;
 	case 2:
 		*(uint16_t *)addr = value;
 		return;
 	case 4:
 		*(uint32_t *)addr = value;
 		return;
 	case 8:
 		*(uint64_t *)addr = value;
 		return;
 	}
 }
 
 /*
  * Extract a array from a zap leaf entry.
  */
 static void
 fzap_leaf_array(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc,
     uint64_t integer_size, uint64_t num_integers, void *buf)
 {
 	uint64_t array_int_len = zc->l_entry.le_value_intlen;
 	uint64_t value = 0;
 	uint64_t *u64 = buf;
 	char *p = buf;
 	int len = MIN(zc->l_entry.le_value_numints, num_integers);
 	int chunk = zc->l_entry.le_value_chunk;
 	int byten = 0;
 
 	if (integer_size == 8 && len == 1) {
 		*u64 = fzap_leaf_value(zl, zc);
 		return;
 	}
 
 	while (len > 0) {
 		struct zap_leaf_array *la = &ZAP_LEAF_CHUNK(zl, chunk).l_array;
 		int i;
 
 		ASSERT3U(chunk, <, ZAP_LEAF_NUMCHUNKS(zl));
 		for (i = 0; i < ZAP_LEAF_ARRAY_BYTES && len > 0; i++) {
 			value = (value << 8) | la->la_array[i];
 			byten++;
 			if (byten == array_int_len) {
 				stv(integer_size, p, value);
 				byten = 0;
 				len--;
 				if (len == 0)
 					return;
 				p += integer_size;
 			}
 		}
 		chunk = la->la_next;
 	}
 }
 
 /*
  * Lookup a value in a fatzap directory. Assumes that the zap scratch
  * buffer contains the directory header.
  */
 static int
 fzap_lookup(const spa_t *spa, const dnode_phys_t *dnode, const char *name,
     uint64_t integer_size, uint64_t num_integers, void *value)
 {
 	int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT;
 	zap_phys_t zh = *(zap_phys_t *) zap_scratch;
 	fat_zap_t z;
 	uint64_t *ptrtbl;
 	uint64_t hash;
 	int rc;
 
 	if (zh.zap_magic != ZAP_MAGIC)
 		return (EIO);
 
 	z.zap_block_shift = ilog2(bsize);
 	z.zap_phys = (zap_phys_t *) zap_scratch;
 
 	/*
 	 * Figure out where the pointer table is and read it in if necessary.
 	 */
 	if (zh.zap_ptrtbl.zt_blk) {
 		rc = dnode_read(spa, dnode, zh.zap_ptrtbl.zt_blk * bsize,
 			       zap_scratch, bsize);
 		if (rc)
 			return (rc);
 		ptrtbl = (uint64_t *) zap_scratch;
 	} else {
 		ptrtbl = &ZAP_EMBEDDED_PTRTBL_ENT(&z, 0);
 	}
 
 	hash = zap_hash(zh.zap_salt, name);
 
 	zap_leaf_t zl;
 	zl.l_bs = z.zap_block_shift;
 
 	off_t off = ptrtbl[hash >> (64 - zh.zap_ptrtbl.zt_shift)] << zl.l_bs;
 	zap_leaf_chunk_t *zc;
 
 	rc = dnode_read(spa, dnode, off, zap_scratch, bsize);
 	if (rc)
 		return (rc);
 
 	zl.l_phys = (zap_leaf_phys_t *) zap_scratch;
 
 	/*
 	 * Make sure this chunk matches our hash.
 	 */
 	if (zl.l_phys->l_hdr.lh_prefix_len > 0
 	    && zl.l_phys->l_hdr.lh_prefix
 	    != hash >> (64 - zl.l_phys->l_hdr.lh_prefix_len))
 		return (ENOENT);
 
 	/*
 	 * Hash within the chunk to find our entry.
 	 */
 	int shift = (64 - ZAP_LEAF_HASH_SHIFT(&zl) - zl.l_phys->l_hdr.lh_prefix_len);
 	int h = (hash >> shift) & ((1 << ZAP_LEAF_HASH_SHIFT(&zl)) - 1);
 	h = zl.l_phys->l_hash[h];
 	if (h == 0xffff)
 		return (ENOENT);
 	zc = &ZAP_LEAF_CHUNK(&zl, h);
 	while (zc->l_entry.le_hash != hash) {
 		if (zc->l_entry.le_next == 0xffff) {
 			zc = NULL;
 			break;
 		}
 		zc = &ZAP_LEAF_CHUNK(&zl, zc->l_entry.le_next);
 	}
 	if (fzap_name_equal(&zl, zc, name)) {
 		if (zc->l_entry.le_value_intlen * zc->l_entry.le_value_numints >
 		    integer_size * num_integers)
 			return (E2BIG);
 		fzap_leaf_array(&zl, zc, integer_size, num_integers, value);
 		return (0);
 	}
 
 	return (ENOENT);
 }
 
 /*
  * Lookup a name in a zap object and return its value as a uint64_t.
  */
 static int
 zap_lookup(const spa_t *spa, const dnode_phys_t *dnode, const char *name,
     uint64_t integer_size, uint64_t num_integers, void *value)
 {
 	int rc;
 	uint64_t zap_type;
 	size_t size = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT;
 
 	rc = dnode_read(spa, dnode, 0, zap_scratch, size);
 	if (rc)
 		return (rc);
 
 	zap_type = *(uint64_t *) zap_scratch;
 	if (zap_type == ZBT_MICRO)
 		return mzap_lookup(dnode, name, value);
 	else if (zap_type == ZBT_HEADER) {
 		return fzap_lookup(spa, dnode, name, integer_size,
 		    num_integers, value);
 	}
 	printf("ZFS: invalid zap_type=%d\n", (int)zap_type);
 	return (EIO);
 }
 
 /*
  * List a microzap directory. Assumes that the zap scratch buffer contains
  * the directory contents.
  */
 static int
 mzap_list(const dnode_phys_t *dnode, int (*callback)(const char *, uint64_t))
 {
 	const mzap_phys_t *mz;
 	const mzap_ent_phys_t *mze;
 	size_t size;
 	int chunks, i, rc;
 
 	/*
 	 * Microzap objects use exactly one block. Read the whole
 	 * thing.
 	 */
 	size = dnode->dn_datablkszsec * 512;
 	mz = (const mzap_phys_t *) zap_scratch;
 	chunks = size / MZAP_ENT_LEN - 1;
 
 	for (i = 0; i < chunks; i++) {
 		mze = &mz->mz_chunk[i];
 		if (mze->mze_name[0]) {
 			rc = callback(mze->mze_name, mze->mze_value);
 			if (rc != 0)
 				return (rc);
 		}
 	}
 
 	return (0);
 }
 
 /*
  * List a fatzap directory. Assumes that the zap scratch buffer contains
  * the directory header.
  */
 static int
 fzap_list(const spa_t *spa, const dnode_phys_t *dnode, int (*callback)(const char *, uint64_t))
 {
 	int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT;
 	zap_phys_t zh = *(zap_phys_t *) zap_scratch;
 	fat_zap_t z;
 	int i, j, rc;
 
 	if (zh.zap_magic != ZAP_MAGIC)
 		return (EIO);
 
 	z.zap_block_shift = ilog2(bsize);
 	z.zap_phys = (zap_phys_t *) zap_scratch;
 
 	/*
 	 * This assumes that the leaf blocks start at block 1. The
 	 * documentation isn't exactly clear on this.
 	 */
 	zap_leaf_t zl;
 	zl.l_bs = z.zap_block_shift;
 	for (i = 0; i < zh.zap_num_leafs; i++) {
 		off_t off = (i + 1) << zl.l_bs;
 		char name[256], *p;
 		uint64_t value;
 
 		if (dnode_read(spa, dnode, off, zap_scratch, bsize))
 			return (EIO);
 
 		zl.l_phys = (zap_leaf_phys_t *) zap_scratch;
 
 		for (j = 0; j < ZAP_LEAF_NUMCHUNKS(&zl); j++) {
 			zap_leaf_chunk_t *zc, *nc;
 			int namelen;
 
 			zc = &ZAP_LEAF_CHUNK(&zl, j);
 			if (zc->l_entry.le_type != ZAP_CHUNK_ENTRY)
 				continue;
 			namelen = zc->l_entry.le_name_numints;
 			if (namelen > sizeof(name))
 				namelen = sizeof(name);
 
 			/*
 			 * Paste the name back together.
 			 */
 			nc = &ZAP_LEAF_CHUNK(&zl, zc->l_entry.le_name_chunk);
 			p = name;
 			while (namelen > 0) {
 				int len;
 				len = namelen;
 				if (len > ZAP_LEAF_ARRAY_BYTES)
 					len = ZAP_LEAF_ARRAY_BYTES;
 				memcpy(p, nc->l_array.la_array, len);
 				p += len;
 				namelen -= len;
 				nc = &ZAP_LEAF_CHUNK(&zl, nc->l_array.la_next);
 			}
 
 			/*
 			 * Assume the first eight bytes of the value are
 			 * a uint64_t.
 			 */
 			value = fzap_leaf_value(&zl, zc);
 
 			//printf("%s 0x%jx\n", name, (uintmax_t)value);
 			rc = callback((const char *)name, value);
 			if (rc != 0)
 				return (rc);
 		}
 	}
 
 	return (0);
 }
 
 static int zfs_printf(const char *name, uint64_t value __unused)
 {
 
 	printf("%s\n", name);
 
 	return (0);
 }
 
 /*
  * List a zap directory.
  */
 static int
 zap_list(const spa_t *spa, const dnode_phys_t *dnode)
 {
 	uint64_t zap_type;
 	size_t size = dnode->dn_datablkszsec * 512;
 
 	if (dnode_read(spa, dnode, 0, zap_scratch, size))
 		return (EIO);
 
 	zap_type = *(uint64_t *) zap_scratch;
 	if (zap_type == ZBT_MICRO)
 		return mzap_list(dnode, zfs_printf);
 	else
 		return fzap_list(spa, dnode, zfs_printf);
 }
 
 static int
 objset_get_dnode(const spa_t *spa, const objset_phys_t *os, uint64_t objnum, dnode_phys_t *dnode)
 {
 	off_t offset;
 
 	offset = objnum * sizeof(dnode_phys_t);
 	return dnode_read(spa, &os->os_meta_dnode, offset,
 		dnode, sizeof(dnode_phys_t));
 }
 
 static int
 mzap_rlookup(const spa_t *spa, const dnode_phys_t *dnode, char *name, uint64_t value)
 {
 	const mzap_phys_t *mz;
 	const mzap_ent_phys_t *mze;
 	size_t size;
 	int chunks, i;
 
 	/*
 	 * Microzap objects use exactly one block. Read the whole
 	 * thing.
 	 */
 	size = dnode->dn_datablkszsec * 512;
 
 	mz = (const mzap_phys_t *) zap_scratch;
 	chunks = size / MZAP_ENT_LEN - 1;
 
 	for (i = 0; i < chunks; i++) {
 		mze = &mz->mz_chunk[i];
 		if (value == mze->mze_value) {
 			strcpy(name, mze->mze_name);
 			return (0);
 		}
 	}
 
 	return (ENOENT);
 }
 
 static void
 fzap_name_copy(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc, char *name)
 {
 	size_t namelen;
 	const zap_leaf_chunk_t *nc;
 	char *p;
 
 	namelen = zc->l_entry.le_name_numints;
 
 	nc = &ZAP_LEAF_CHUNK(zl, zc->l_entry.le_name_chunk);
 	p = name;
 	while (namelen > 0) {
 		size_t len;
 		len = namelen;
 		if (len > ZAP_LEAF_ARRAY_BYTES)
 			len = ZAP_LEAF_ARRAY_BYTES;
 		memcpy(p, nc->l_array.la_array, len);
 		p += len;
 		namelen -= len;
 		nc = &ZAP_LEAF_CHUNK(zl, nc->l_array.la_next);
 	}
 
 	*p = '\0';
 }
 
 static int
 fzap_rlookup(const spa_t *spa, const dnode_phys_t *dnode, char *name, uint64_t value)
 {
 	int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT;
 	zap_phys_t zh = *(zap_phys_t *) zap_scratch;
 	fat_zap_t z;
 	int i, j;
 
 	if (zh.zap_magic != ZAP_MAGIC)
 		return (EIO);
 
 	z.zap_block_shift = ilog2(bsize);
 	z.zap_phys = (zap_phys_t *) zap_scratch;
 
 	/*
 	 * This assumes that the leaf blocks start at block 1. The
 	 * documentation isn't exactly clear on this.
 	 */
 	zap_leaf_t zl;
 	zl.l_bs = z.zap_block_shift;
 	for (i = 0; i < zh.zap_num_leafs; i++) {
 		off_t off = (i + 1) << zl.l_bs;
 
 		if (dnode_read(spa, dnode, off, zap_scratch, bsize))
 			return (EIO);
 
 		zl.l_phys = (zap_leaf_phys_t *) zap_scratch;
 
 		for (j = 0; j < ZAP_LEAF_NUMCHUNKS(&zl); j++) {
 			zap_leaf_chunk_t *zc;
 
 			zc = &ZAP_LEAF_CHUNK(&zl, j);
 			if (zc->l_entry.le_type != ZAP_CHUNK_ENTRY)
 				continue;
 			if (zc->l_entry.le_value_intlen != 8 ||
 			    zc->l_entry.le_value_numints != 1)
 				continue;
 
 			if (fzap_leaf_value(&zl, zc) == value) {
 				fzap_name_copy(&zl, zc, name);
 				return (0);
 			}
 		}
 	}
 
 	return (ENOENT);
 }
 
 static int
 zap_rlookup(const spa_t *spa, const dnode_phys_t *dnode, char *name, uint64_t value)
 {
 	int rc;
 	uint64_t zap_type;
 	size_t size = dnode->dn_datablkszsec * 512;
 
 	rc = dnode_read(spa, dnode, 0, zap_scratch, size);
 	if (rc)
 		return (rc);
 
 	zap_type = *(uint64_t *) zap_scratch;
 	if (zap_type == ZBT_MICRO)
 		return mzap_rlookup(spa, dnode, name, value);
 	else
 		return fzap_rlookup(spa, dnode, name, value);
 }
 
 static int
 zfs_rlookup(const spa_t *spa, uint64_t objnum, char *result)
 {
 	char name[256];
 	char component[256];
 	uint64_t dir_obj, parent_obj, child_dir_zapobj;
 	dnode_phys_t child_dir_zap, dataset, dir, parent;
 	dsl_dir_phys_t *dd;
 	dsl_dataset_phys_t *ds;
 	char *p;
 	int len;
 
 	p = &name[sizeof(name) - 1];
 	*p = '\0';
 
 	if (objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset)) {
 		printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum);
 		return (EIO);
 	}
 	ds = (dsl_dataset_phys_t *)&dataset.dn_bonus;
 	dir_obj = ds->ds_dir_obj;
 
 	for (;;) {
 		if (objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir) != 0)
 			return (EIO);
 		dd = (dsl_dir_phys_t *)&dir.dn_bonus;
 
 		/* Actual loop condition. */
 		parent_obj  = dd->dd_parent_obj;
 		if (parent_obj == 0)
 			break;
 
 		if (objset_get_dnode(spa, &spa->spa_mos, parent_obj, &parent) != 0)
 			return (EIO);
 		dd = (dsl_dir_phys_t *)&parent.dn_bonus;
 		child_dir_zapobj = dd->dd_child_dir_zapobj;
 		if (objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap) != 0)
 			return (EIO);
 		if (zap_rlookup(spa, &child_dir_zap, component, dir_obj) != 0)
 			return (EIO);
 
 		len = strlen(component);
 		p -= len;
 		memcpy(p, component, len);
 		--p;
 		*p = '/';
 
 		/* Actual loop iteration. */
 		dir_obj = parent_obj;
 	}
 
 	if (*p != '\0')
 		++p;
 	strcpy(result, p);
 
 	return (0);
 }
 
 static int
 zfs_lookup_dataset(const spa_t *spa, const char *name, uint64_t *objnum)
 {
 	char element[256];
 	uint64_t dir_obj, child_dir_zapobj;
 	dnode_phys_t child_dir_zap, dir;
 	dsl_dir_phys_t *dd;
 	const char *p, *q;
 
 	if (objset_get_dnode(spa, &spa->spa_mos, DMU_POOL_DIRECTORY_OBJECT, &dir))
 		return (EIO);
 	if (zap_lookup(spa, &dir, DMU_POOL_ROOT_DATASET, sizeof (dir_obj),
 	    1, &dir_obj))
 		return (EIO);
 
 	p = name;
 	for (;;) {
 		if (objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir))
 			return (EIO);
 		dd = (dsl_dir_phys_t *)&dir.dn_bonus;
 
 		while (*p == '/')
 			p++;
 		/* Actual loop condition #1. */
 		if (*p == '\0')
 			break;
 
 		q = strchr(p, '/');
 		if (q) {
 			memcpy(element, p, q - p);
 			element[q - p] = '\0';
 			p = q + 1;
 		} else {
 			strcpy(element, p);
 			p += strlen(p);
 		}
 
 		child_dir_zapobj = dd->dd_child_dir_zapobj;
 		if (objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap) != 0)
 			return (EIO);
 
 		/* Actual loop condition #2. */
 		if (zap_lookup(spa, &child_dir_zap, element, sizeof (dir_obj),
 		    1, &dir_obj) != 0)
 			return (ENOENT);
 	}
 
 	*objnum = dd->dd_head_dataset_obj;
 	return (0);
 }
 
 #ifndef BOOT2
 static int
 zfs_list_dataset(const spa_t *spa, uint64_t objnum/*, int pos, char *entry*/)
 {
 	uint64_t dir_obj, child_dir_zapobj;
 	dnode_phys_t child_dir_zap, dir, dataset;
 	dsl_dataset_phys_t *ds;
 	dsl_dir_phys_t *dd;
 
 	if (objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset)) {
 		printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum);
 		return (EIO);
 	}
 	ds = (dsl_dataset_phys_t *) &dataset.dn_bonus;
 	dir_obj = ds->ds_dir_obj;
 
 	if (objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir)) {
 		printf("ZFS: can't find dirobj %ju\n", (uintmax_t)dir_obj);
 		return (EIO);
 	}
 	dd = (dsl_dir_phys_t *)&dir.dn_bonus;
 
 	child_dir_zapobj = dd->dd_child_dir_zapobj;
 	if (objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap) != 0) {
 		printf("ZFS: can't find child zap %ju\n", (uintmax_t)dir_obj);
 		return (EIO);
 	}
 
 	return (zap_list(spa, &child_dir_zap) != 0);
 }
 
 int
 zfs_callback_dataset(const spa_t *spa, uint64_t objnum, int (*callback)(const char *, uint64_t))
 {
 	uint64_t dir_obj, child_dir_zapobj, zap_type;
 	dnode_phys_t child_dir_zap, dir, dataset;
 	dsl_dataset_phys_t *ds;
 	dsl_dir_phys_t *dd;
 	int err;
 
 	err = objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset);
 	if (err != 0) {
 		printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum);
 		return (err);
 	}
 	ds = (dsl_dataset_phys_t *) &dataset.dn_bonus;
 	dir_obj = ds->ds_dir_obj;
 
 	err = objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir);
 	if (err != 0) {
 		printf("ZFS: can't find dirobj %ju\n", (uintmax_t)dir_obj);
 		return (err);
 	}
 	dd = (dsl_dir_phys_t *)&dir.dn_bonus;
 
 	child_dir_zapobj = dd->dd_child_dir_zapobj;
 	err = objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap);
 	if (err != 0) {
 		printf("ZFS: can't find child zap %ju\n", (uintmax_t)dir_obj);
 		return (err);
 	}
 
 	err = dnode_read(spa, &child_dir_zap, 0, zap_scratch, child_dir_zap.dn_datablkszsec * 512);
 	if (err != 0)
 		return (err);
 
 	zap_type = *(uint64_t *) zap_scratch;
 	if (zap_type == ZBT_MICRO)
 		return mzap_list(&child_dir_zap, callback);
 	else
 		return fzap_list(spa, &child_dir_zap, callback);
 }
 #endif
 
 /*
  * Find the object set given the object number of its dataset object
  * and return its details in *objset
  */
 static int
 zfs_mount_dataset(const spa_t *spa, uint64_t objnum, objset_phys_t *objset)
 {
 	dnode_phys_t dataset;
 	dsl_dataset_phys_t *ds;
 
 	if (objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset)) {
 		printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum);
 		return (EIO);
 	}
 
 	ds = (dsl_dataset_phys_t *) &dataset.dn_bonus;
 	if (zio_read(spa, &ds->ds_bp, objset)) {
 		printf("ZFS: can't read object set for dataset %ju\n",
 		    (uintmax_t)objnum);
 		return (EIO);
 	}
 
 	return (0);
 }
 
 /*
  * Find the object set pointed to by the BOOTFS property or the root
  * dataset if there is none and return its details in *objset
  */
 static int
 zfs_get_root(const spa_t *spa, uint64_t *objid)
 {
 	dnode_phys_t dir, propdir;
 	uint64_t props, bootfs, root;
 
 	*objid = 0;
 
 	/*
 	 * Start with the MOS directory object.
 	 */
 	if (objset_get_dnode(spa, &spa->spa_mos, DMU_POOL_DIRECTORY_OBJECT, &dir)) {
 		printf("ZFS: can't read MOS object directory\n");
 		return (EIO);
 	}
 
 	/*
 	 * Lookup the pool_props and see if we can find a bootfs.
 	 */
 	if (zap_lookup(spa, &dir, DMU_POOL_PROPS, sizeof (props), 1, &props) == 0
 	     && objset_get_dnode(spa, &spa->spa_mos, props, &propdir) == 0
 	     && zap_lookup(spa, &propdir, "bootfs", sizeof (bootfs), 1, &bootfs) == 0
 	     && bootfs != 0)
 	{
 		*objid = bootfs;
 		return (0);
 	}
 	/*
 	 * Lookup the root dataset directory
 	 */
 	if (zap_lookup(spa, &dir, DMU_POOL_ROOT_DATASET, sizeof (root), 1, &root)
 	    || objset_get_dnode(spa, &spa->spa_mos, root, &dir)) {
 		printf("ZFS: can't find root dsl_dir\n");
 		return (EIO);
 	}
 
 	/*
 	 * Use the information from the dataset directory's bonus buffer
 	 * to find the dataset object and from that the object set itself.
 	 */
 	dsl_dir_phys_t *dd = (dsl_dir_phys_t *) &dir.dn_bonus;
 	*objid = dd->dd_head_dataset_obj;
 	return (0);
 }
 
 static int
 zfs_mount(const spa_t *spa, uint64_t rootobj, struct zfsmount *mount)
 {
 
 	mount->spa = spa;
 
 	/*
 	 * Find the root object set if not explicitly provided
 	 */
 	if (rootobj == 0 && zfs_get_root(spa, &rootobj)) {
 		printf("ZFS: can't find root filesystem\n");
 		return (EIO);
 	}
 
 	if (zfs_mount_dataset(spa, rootobj, &mount->objset)) {
 		printf("ZFS: can't open root filesystem\n");
 		return (EIO);
 	}
 
 	mount->rootobj = rootobj;
 
 	return (0);
 }
 
 /*
  * callback function for feature name checks.
  */
 static int
 check_feature(const char *name, uint64_t value)
 {
 	int i;
 
 	if (value == 0)
 		return (0);
 	if (name[0] == '\0')
 		return (0);
 
 	for (i = 0; features_for_read[i] != NULL; i++) {
 		if (strcmp(name, features_for_read[i]) == 0)
 			return (0);
 	}
 	printf("ZFS: unsupported feature: %s\n", name);
 	return (EIO);
 }
 
 /*
  * Checks whether the MOS features that are active are supported.
  */
 static int
 check_mos_features(const spa_t *spa)
 {
 	dnode_phys_t dir;
 	uint64_t objnum, zap_type;
 	size_t size;
 	int rc;
 
 	if ((rc = objset_get_dnode(spa, &spa->spa_mos, DMU_OT_OBJECT_DIRECTORY,
 	    &dir)) != 0)
 		return (rc);
 	if ((rc = zap_lookup(spa, &dir, DMU_POOL_FEATURES_FOR_READ,
 	    sizeof (objnum), 1, &objnum)) != 0) {
 		/*
 		 * It is older pool without features. As we have already
 		 * tested the label, just return without raising the error.
 		 */
 		return (0);
 	}
 
 	if ((rc = objset_get_dnode(spa, &spa->spa_mos, objnum, &dir)) != 0)
 		return (rc);
 
 	if (dir.dn_type != DMU_OTN_ZAP_METADATA)
 		return (EIO);
 
 	size = dir.dn_datablkszsec * 512;
 	if (dnode_read(spa, &dir, 0, zap_scratch, size))
 		return (EIO);
 
 	zap_type = *(uint64_t *) zap_scratch;
 	if (zap_type == ZBT_MICRO)
 		rc = mzap_list(&dir, check_feature);
 	else
 		rc = fzap_list(spa, &dir, check_feature);
 
 	return (rc);
 }
 
 static int
 zfs_spa_init(spa_t *spa)
 {
 	dnode_phys_t dir;
 	int rc;
 
 	if (zio_read(spa, &spa->spa_uberblock.ub_rootbp, &spa->spa_mos)) {
 		printf("ZFS: can't read MOS of pool %s\n", spa->spa_name);
 		return (EIO);
 	}
 	if (spa->spa_mos.os_type != DMU_OST_META) {
 		printf("ZFS: corrupted MOS of pool %s\n", spa->spa_name);
 		return (EIO);
 	}
 
 	if (objset_get_dnode(spa, &spa->spa_mos, DMU_POOL_DIRECTORY_OBJECT,
 	    &dir)) {
 		printf("ZFS: failed to read pool %s directory object\n",
 		    spa->spa_name);
 		return (EIO);
 	}
 	/* this is allowed to fail, older pools do not have salt */
 	rc = zap_lookup(spa, &dir, DMU_POOL_CHECKSUM_SALT, 1,
 	    sizeof (spa->spa_cksum_salt.zcs_bytes),
 	    spa->spa_cksum_salt.zcs_bytes);
 
 	rc = check_mos_features(spa);
 	if (rc != 0) {
 		printf("ZFS: pool %s is not supported\n", spa->spa_name);
 	}
 
 	return (rc);
 }
 
 static int
 zfs_dnode_stat(const spa_t *spa, dnode_phys_t *dn, struct stat *sb)
 {
 
 	if (dn->dn_bonustype != DMU_OT_SA) {
 		znode_phys_t *zp = (znode_phys_t *)dn->dn_bonus;
 
 		sb->st_mode = zp->zp_mode;
 		sb->st_uid = zp->zp_uid;
 		sb->st_gid = zp->zp_gid;
 		sb->st_size = zp->zp_size;
 	} else {
 		sa_hdr_phys_t *sahdrp;
 		int hdrsize;
 		size_t size = 0;
 		void *buf = NULL;
 
 		if (dn->dn_bonuslen != 0)
 			sahdrp = (sa_hdr_phys_t *)DN_BONUS(dn);
 		else {
 			if ((dn->dn_flags & DNODE_FLAG_SPILL_BLKPTR) != 0) {
 				blkptr_t *bp = &dn->dn_spill;
 				int error;
 
 				size = BP_GET_LSIZE(bp);
 				buf = zfs_alloc(size);
 				error = zio_read(spa, bp, buf);
 				if (error != 0) {
 					zfs_free(buf, size);
 					return (error);
 				}
 				sahdrp = buf;
 			} else {
 				return (EIO);
 			}
 		}
 		hdrsize = SA_HDR_SIZE(sahdrp);
 		sb->st_mode = *(uint64_t *)((char *)sahdrp + hdrsize +
 		    SA_MODE_OFFSET);
 		sb->st_uid = *(uint64_t *)((char *)sahdrp + hdrsize +
 		    SA_UID_OFFSET);
 		sb->st_gid = *(uint64_t *)((char *)sahdrp + hdrsize +
 		    SA_GID_OFFSET);
 		sb->st_size = *(uint64_t *)((char *)sahdrp + hdrsize +
 		    SA_SIZE_OFFSET);
 		if (buf != NULL)
 			zfs_free(buf, size);
 	}
 
 	return (0);
 }
 
 static int
 zfs_dnode_readlink(const spa_t *spa, dnode_phys_t *dn, char *path, size_t psize)
 {
 	int rc = 0;
 
 	if (dn->dn_bonustype == DMU_OT_SA) {
 		sa_hdr_phys_t *sahdrp = NULL;
 		size_t size = 0;
 		void *buf = NULL;
 		int hdrsize;
 		char *p;
 
 		if (dn->dn_bonuslen != 0)
 			sahdrp = (sa_hdr_phys_t *)DN_BONUS(dn);
 		else {
 			blkptr_t *bp;
 
 			if ((dn->dn_flags & DNODE_FLAG_SPILL_BLKPTR) == 0)
 				return (EIO);
 			bp = &dn->dn_spill;
 
 			size = BP_GET_LSIZE(bp);
 			buf = zfs_alloc(size);
 			rc = zio_read(spa, bp, buf);
 			if (rc != 0) {
 				zfs_free(buf, size);
 				return (rc);
 			}
 			sahdrp = buf;
 		}
 		hdrsize = SA_HDR_SIZE(sahdrp);
 		p = (char *)((uintptr_t)sahdrp + hdrsize + SA_SYMLINK_OFFSET);
 		memcpy(path, p, psize);
 		if (buf != NULL)
 			zfs_free(buf, size);
 		return (0);
 	}
 	/*
 	 * Second test is purely to silence bogus compiler
 	 * warning about accessing past the end of dn_bonus.
 	 */
 	if (psize + sizeof(znode_phys_t) <= dn->dn_bonuslen &&
 	    sizeof(znode_phys_t) <= sizeof(dn->dn_bonus)) {
 		memcpy(path, &dn->dn_bonus[sizeof(znode_phys_t)], psize);
 	} else {
 		rc = dnode_read(spa, dn, 0, path, psize);
 	}
 	return (rc);
 }
 
 struct obj_list {
 	uint64_t		objnum;
 	STAILQ_ENTRY(obj_list)	entry;
 };
 
 /*
  * Lookup a file and return its dnode.
  */
 static int
 zfs_lookup(const struct zfsmount *mount, const char *upath, dnode_phys_t *dnode)
 {
 	int rc;
 	uint64_t objnum;
 	const spa_t *spa;
 	dnode_phys_t dn;
 	const char *p, *q;
 	char element[256];
 	char path[1024];
 	int symlinks_followed = 0;
 	struct stat sb;
-	struct obj_list *entry;
+	struct obj_list *entry, *tentry;
 	STAILQ_HEAD(, obj_list) on_cache = STAILQ_HEAD_INITIALIZER(on_cache);
 
 	spa = mount->spa;
 	if (mount->objset.os_type != DMU_OST_ZFS) {
 		printf("ZFS: unexpected object set type %ju\n",
 		    (uintmax_t)mount->objset.os_type);
 		return (EIO);
 	}
 
 	if ((entry = malloc(sizeof(struct obj_list))) == NULL)
 		return (ENOMEM);
 
 	/*
 	 * Get the root directory dnode.
 	 */
 	rc = objset_get_dnode(spa, &mount->objset, MASTER_NODE_OBJ, &dn);
 	if (rc) {
 		free(entry);
 		return (rc);
 	}
 
 	rc = zap_lookup(spa, &dn, ZFS_ROOT_OBJ, sizeof (objnum), 1, &objnum);
 	if (rc) {
 		free(entry);
 		return (rc);
 	}
 	entry->objnum = objnum;
 	STAILQ_INSERT_HEAD(&on_cache, entry, entry);
 
 	rc = objset_get_dnode(spa, &mount->objset, objnum, &dn);
 	if (rc != 0)
 		goto done;
 
 	p = upath;
 	while (p && *p) {
 		rc = objset_get_dnode(spa, &mount->objset, objnum, &dn);
 		if (rc != 0)
 			goto done;
 
 		while (*p == '/')
 			p++;
 		if (*p == '\0')
 			break;
 		q = p;
 		while (*q != '\0' && *q != '/')
 			q++;
 
 		/* skip dot */
 		if (p + 1 == q && p[0] == '.') {
 			p++;
 			continue;
 		}
 		/* double dot */
 		if (p + 2 == q && p[0] == '.' && p[1] == '.') {
 			p += 2;
 			if (STAILQ_FIRST(&on_cache) ==
 			    STAILQ_LAST(&on_cache, obj_list, entry)) {
 				rc = ENOENT;
 				goto done;
 			}
 			entry = STAILQ_FIRST(&on_cache);
 			STAILQ_REMOVE_HEAD(&on_cache, entry);
 			free(entry);
 			objnum = (STAILQ_FIRST(&on_cache))->objnum;
 			continue;
 		}
 		if (q - p + 1 > sizeof(element)) {
 			rc = ENAMETOOLONG;
 			goto done;
 		}
 		memcpy(element, p, q - p);
 		element[q - p] = 0;
 		p = q;
 
 		if ((rc = zfs_dnode_stat(spa, &dn, &sb)) != 0)
 			goto done;
 		if (!S_ISDIR(sb.st_mode)) {
 			rc = ENOTDIR;
 			goto done;
 		}
 
 		rc = zap_lookup(spa, &dn, element, sizeof (objnum), 1, &objnum);
 		if (rc)
 			goto done;
 		objnum = ZFS_DIRENT_OBJ(objnum);
 
 		if ((entry = malloc(sizeof(struct obj_list))) == NULL) {
 			rc = ENOMEM;
 			goto done;
 		}
 		entry->objnum = objnum;
 		STAILQ_INSERT_HEAD(&on_cache, entry, entry);
 		rc = objset_get_dnode(spa, &mount->objset, objnum, &dn);
 		if (rc)
 			goto done;
 
 		/*
 		 * Check for symlink.
 		 */
 		rc = zfs_dnode_stat(spa, &dn, &sb);
 		if (rc)
 			goto done;
 		if (S_ISLNK(sb.st_mode)) {
 			if (symlinks_followed > 10) {
 				rc = EMLINK;
 				goto done;
 			}
 			symlinks_followed++;
 
 			/*
 			 * Read the link value and copy the tail of our
 			 * current path onto the end.
 			 */
 			if (sb.st_size + strlen(p) + 1 > sizeof(path)) {
 				rc = ENAMETOOLONG;
 				goto done;
 			}
 			strcpy(&path[sb.st_size], p);
 
 			rc = zfs_dnode_readlink(spa, &dn, path, sb.st_size);
 			if (rc != 0)
 				goto done;
 
 			/*
 			 * Restart with the new path, starting either at
 			 * the root or at the parent depending whether or
 			 * not the link is relative.
 			 */
 			p = path;
 			if (*p == '/') {
 				while (STAILQ_FIRST(&on_cache) !=
 				    STAILQ_LAST(&on_cache, obj_list, entry)) {
 					entry = STAILQ_FIRST(&on_cache);
 					STAILQ_REMOVE_HEAD(&on_cache, entry);
 					free(entry);
 				}
 			} else {
 				entry = STAILQ_FIRST(&on_cache);
 				STAILQ_REMOVE_HEAD(&on_cache, entry);
 				free(entry);
 			}
 			objnum = (STAILQ_FIRST(&on_cache))->objnum;
 		}
 	}
 
 	*dnode = dn;
 done:
-	STAILQ_FOREACH(entry, &on_cache, entry)
+	STAILQ_FOREACH_SAFE(entry, &on_cache, entry, tentry)
 		free(entry);
 	return (rc);
 }
Index: projects/clang400-import/sys/cam/ctl/ctl.c
===================================================================
--- projects/clang400-import/sys/cam/ctl/ctl.c	(revision 314522)
+++ projects/clang400-import/sys/cam/ctl/ctl.c	(revision 314523)
@@ -1,13537 +1,13537 @@
 /*-
  * Copyright (c) 2003-2009 Silicon Graphics International Corp.
  * Copyright (c) 2012 The FreeBSD Foundation
  * Copyright (c) 2014-2017 Alexander Motin <mav@FreeBSD.org>
  * All rights reserved.
  *
  * Portions of this software were developed by Edward Tomasz Napierala
  * under sponsorship from the FreeBSD Foundation.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions, and the following disclaimer,
  *    without modification.
  * 2. Redistributions in binary form must reproduce at minimum a disclaimer
  *    substantially similar to the "NO WARRANTY" disclaimer below
  *    ("Disclaimer") and any redistribution must be conditioned upon
  *    including a substantially similar Disclaimer requirement for further
  *    binary redistribution.
  *
  * NO WARRANTY
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
  * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
  * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
  * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGES.
  *
  * $Id$
  */
 /*
  * CAM Target Layer, a SCSI device emulation subsystem.
  *
  * Author: Ken Merry <ken@FreeBSD.org>
  */
 
 #define _CTL_C
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/ctype.h>
 #include <sys/kernel.h>
 #include <sys/types.h>
 #include <sys/kthread.h>
 #include <sys/bio.h>
 #include <sys/fcntl.h>
 #include <sys/lock.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/condvar.h>
 #include <sys/malloc.h>
 #include <sys/conf.h>
 #include <sys/ioccom.h>
 #include <sys/queue.h>
 #include <sys/sbuf.h>
 #include <sys/smp.h>
 #include <sys/endian.h>
 #include <sys/sysctl.h>
 #include <vm/uma.h>
 
 #include <cam/cam.h>
 #include <cam/scsi/scsi_all.h>
 #include <cam/scsi/scsi_cd.h>
 #include <cam/scsi/scsi_da.h>
 #include <cam/ctl/ctl_io.h>
 #include <cam/ctl/ctl.h>
 #include <cam/ctl/ctl_frontend.h>
 #include <cam/ctl/ctl_util.h>
 #include <cam/ctl/ctl_backend.h>
 #include <cam/ctl/ctl_ioctl.h>
 #include <cam/ctl/ctl_ha.h>
 #include <cam/ctl/ctl_private.h>
 #include <cam/ctl/ctl_debug.h>
 #include <cam/ctl/ctl_scsi_all.h>
 #include <cam/ctl/ctl_error.h>
 
 struct ctl_softc *control_softc = NULL;
 
 /*
  * Template mode pages.
  */
 
 /*
  * Note that these are default values only.  The actual values will be
  * filled in when the user does a mode sense.
  */
 const static struct scsi_da_rw_recovery_page rw_er_page_default = {
 	/*page_code*/SMS_RW_ERROR_RECOVERY_PAGE,
 	/*page_length*/sizeof(struct scsi_da_rw_recovery_page) - 2,
 	/*byte3*/SMS_RWER_AWRE|SMS_RWER_ARRE,
 	/*read_retry_count*/0,
 	/*correction_span*/0,
 	/*head_offset_count*/0,
 	/*data_strobe_offset_cnt*/0,
 	/*byte8*/SMS_RWER_LBPERE,
 	/*write_retry_count*/0,
 	/*reserved2*/0,
 	/*recovery_time_limit*/{0, 0},
 };
 
 const static struct scsi_da_rw_recovery_page rw_er_page_changeable = {
 	/*page_code*/SMS_RW_ERROR_RECOVERY_PAGE,
 	/*page_length*/sizeof(struct scsi_da_rw_recovery_page) - 2,
 	/*byte3*/SMS_RWER_PER,
 	/*read_retry_count*/0,
 	/*correction_span*/0,
 	/*head_offset_count*/0,
 	/*data_strobe_offset_cnt*/0,
 	/*byte8*/SMS_RWER_LBPERE,
 	/*write_retry_count*/0,
 	/*reserved2*/0,
 	/*recovery_time_limit*/{0, 0},
 };
 
 const static struct scsi_format_page format_page_default = {
 	/*page_code*/SMS_FORMAT_DEVICE_PAGE,
 	/*page_length*/sizeof(struct scsi_format_page) - 2,
 	/*tracks_per_zone*/ {0, 0},
 	/*alt_sectors_per_zone*/ {0, 0},
 	/*alt_tracks_per_zone*/ {0, 0},
 	/*alt_tracks_per_lun*/ {0, 0},
 	/*sectors_per_track*/ {(CTL_DEFAULT_SECTORS_PER_TRACK >> 8) & 0xff,
 			        CTL_DEFAULT_SECTORS_PER_TRACK & 0xff},
 	/*bytes_per_sector*/ {0, 0},
 	/*interleave*/ {0, 0},
 	/*track_skew*/ {0, 0},
 	/*cylinder_skew*/ {0, 0},
 	/*flags*/ SFP_HSEC,
 	/*reserved*/ {0, 0, 0}
 };
 
 const static struct scsi_format_page format_page_changeable = {
 	/*page_code*/SMS_FORMAT_DEVICE_PAGE,
 	/*page_length*/sizeof(struct scsi_format_page) - 2,
 	/*tracks_per_zone*/ {0, 0},
 	/*alt_sectors_per_zone*/ {0, 0},
 	/*alt_tracks_per_zone*/ {0, 0},
 	/*alt_tracks_per_lun*/ {0, 0},
 	/*sectors_per_track*/ {0, 0},
 	/*bytes_per_sector*/ {0, 0},
 	/*interleave*/ {0, 0},
 	/*track_skew*/ {0, 0},
 	/*cylinder_skew*/ {0, 0},
 	/*flags*/ 0,
 	/*reserved*/ {0, 0, 0}
 };
 
 const static struct scsi_rigid_disk_page rigid_disk_page_default = {
 	/*page_code*/SMS_RIGID_DISK_PAGE,
 	/*page_length*/sizeof(struct scsi_rigid_disk_page) - 2,
 	/*cylinders*/ {0, 0, 0},
 	/*heads*/ CTL_DEFAULT_HEADS,
 	/*start_write_precomp*/ {0, 0, 0},
 	/*start_reduced_current*/ {0, 0, 0},
 	/*step_rate*/ {0, 0},
 	/*landing_zone_cylinder*/ {0, 0, 0},
 	/*rpl*/ SRDP_RPL_DISABLED,
 	/*rotational_offset*/ 0,
 	/*reserved1*/ 0,
 	/*rotation_rate*/ {(CTL_DEFAULT_ROTATION_RATE >> 8) & 0xff,
 			   CTL_DEFAULT_ROTATION_RATE & 0xff},
 	/*reserved2*/ {0, 0}
 };
 
 const static struct scsi_rigid_disk_page rigid_disk_page_changeable = {
 	/*page_code*/SMS_RIGID_DISK_PAGE,
 	/*page_length*/sizeof(struct scsi_rigid_disk_page) - 2,
 	/*cylinders*/ {0, 0, 0},
 	/*heads*/ 0,
 	/*start_write_precomp*/ {0, 0, 0},
 	/*start_reduced_current*/ {0, 0, 0},
 	/*step_rate*/ {0, 0},
 	/*landing_zone_cylinder*/ {0, 0, 0},
 	/*rpl*/ 0,
 	/*rotational_offset*/ 0,
 	/*reserved1*/ 0,
 	/*rotation_rate*/ {0, 0},
 	/*reserved2*/ {0, 0}
 };
 
 const static struct scsi_da_verify_recovery_page verify_er_page_default = {
 	/*page_code*/SMS_VERIFY_ERROR_RECOVERY_PAGE,
 	/*page_length*/sizeof(struct scsi_da_verify_recovery_page) - 2,
 	/*byte3*/0,
 	/*read_retry_count*/0,
 	/*reserved*/{ 0, 0, 0, 0, 0, 0 },
 	/*recovery_time_limit*/{0, 0},
 };
 
 const static struct scsi_da_verify_recovery_page verify_er_page_changeable = {
 	/*page_code*/SMS_VERIFY_ERROR_RECOVERY_PAGE,
 	/*page_length*/sizeof(struct scsi_da_verify_recovery_page) - 2,
 	/*byte3*/SMS_VER_PER,
 	/*read_retry_count*/0,
 	/*reserved*/{ 0, 0, 0, 0, 0, 0 },
 	/*recovery_time_limit*/{0, 0},
 };
 
 const static struct scsi_caching_page caching_page_default = {
 	/*page_code*/SMS_CACHING_PAGE,
 	/*page_length*/sizeof(struct scsi_caching_page) - 2,
 	/*flags1*/ SCP_DISC | SCP_WCE,
 	/*ret_priority*/ 0,
 	/*disable_pf_transfer_len*/ {0xff, 0xff},
 	/*min_prefetch*/ {0, 0},
 	/*max_prefetch*/ {0xff, 0xff},
 	/*max_pf_ceiling*/ {0xff, 0xff},
 	/*flags2*/ 0,
 	/*cache_segments*/ 0,
 	/*cache_seg_size*/ {0, 0},
 	/*reserved*/ 0,
 	/*non_cache_seg_size*/ {0, 0, 0}
 };
 
 const static struct scsi_caching_page caching_page_changeable = {
 	/*page_code*/SMS_CACHING_PAGE,
 	/*page_length*/sizeof(struct scsi_caching_page) - 2,
 	/*flags1*/ SCP_WCE | SCP_RCD,
 	/*ret_priority*/ 0,
 	/*disable_pf_transfer_len*/ {0, 0},
 	/*min_prefetch*/ {0, 0},
 	/*max_prefetch*/ {0, 0},
 	/*max_pf_ceiling*/ {0, 0},
 	/*flags2*/ 0,
 	/*cache_segments*/ 0,
 	/*cache_seg_size*/ {0, 0},
 	/*reserved*/ 0,
 	/*non_cache_seg_size*/ {0, 0, 0}
 };
 
 const static struct scsi_control_page control_page_default = {
 	/*page_code*/SMS_CONTROL_MODE_PAGE,
 	/*page_length*/sizeof(struct scsi_control_page) - 2,
 	/*rlec*/0,
 	/*queue_flags*/SCP_QUEUE_ALG_RESTRICTED,
 	/*eca_and_aen*/0,
 	/*flags4*/SCP_TAS,
 	/*aen_holdoff_period*/{0, 0},
 	/*busy_timeout_period*/{0, 0},
 	/*extended_selftest_completion_time*/{0, 0}
 };
 
 const static struct scsi_control_page control_page_changeable = {
 	/*page_code*/SMS_CONTROL_MODE_PAGE,
 	/*page_length*/sizeof(struct scsi_control_page) - 2,
 	/*rlec*/SCP_DSENSE,
 	/*queue_flags*/SCP_QUEUE_ALG_MASK | SCP_NUAR,
 	/*eca_and_aen*/SCP_SWP,
 	/*flags4*/0,
 	/*aen_holdoff_period*/{0, 0},
 	/*busy_timeout_period*/{0, 0},
 	/*extended_selftest_completion_time*/{0, 0}
 };
 
 #define CTL_CEM_LEN	(sizeof(struct scsi_control_ext_page) - 4)
 
 const static struct scsi_control_ext_page control_ext_page_default = {
 	/*page_code*/SMS_CONTROL_MODE_PAGE | SMPH_SPF,
 	/*subpage_code*/0x01,
 	/*page_length*/{CTL_CEM_LEN >> 8, CTL_CEM_LEN},
 	/*flags*/0,
 	/*prio*/0,
 	/*max_sense*/0
 };
 
 const static struct scsi_control_ext_page control_ext_page_changeable = {
 	/*page_code*/SMS_CONTROL_MODE_PAGE | SMPH_SPF,
 	/*subpage_code*/0x01,
 	/*page_length*/{CTL_CEM_LEN >> 8, CTL_CEM_LEN},
 	/*flags*/0,
 	/*prio*/0,
 	/*max_sense*/0xff
 };
 
 const static struct scsi_info_exceptions_page ie_page_default = {
 	/*page_code*/SMS_INFO_EXCEPTIONS_PAGE,
 	/*page_length*/sizeof(struct scsi_info_exceptions_page) - 2,
 	/*info_flags*/SIEP_FLAGS_EWASC,
 	/*mrie*/SIEP_MRIE_NO,
 	/*interval_timer*/{0, 0, 0, 0},
 	/*report_count*/{0, 0, 0, 1}
 };
 
 const static struct scsi_info_exceptions_page ie_page_changeable = {
 	/*page_code*/SMS_INFO_EXCEPTIONS_PAGE,
 	/*page_length*/sizeof(struct scsi_info_exceptions_page) - 2,
 	/*info_flags*/SIEP_FLAGS_EWASC | SIEP_FLAGS_DEXCPT | SIEP_FLAGS_TEST |
 	    SIEP_FLAGS_LOGERR,
 	/*mrie*/0x0f,
 	/*interval_timer*/{0xff, 0xff, 0xff, 0xff},
 	/*report_count*/{0xff, 0xff, 0xff, 0xff}
 };
 
 #define CTL_LBPM_LEN	(sizeof(struct ctl_logical_block_provisioning_page) - 4)
 
 const static struct ctl_logical_block_provisioning_page lbp_page_default = {{
 	/*page_code*/SMS_INFO_EXCEPTIONS_PAGE | SMPH_SPF,
 	/*subpage_code*/0x02,
 	/*page_length*/{CTL_LBPM_LEN >> 8, CTL_LBPM_LEN},
 	/*flags*/0,
 	/*reserved*/{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
 	/*descr*/{}},
 	{{/*flags*/0,
 	  /*resource*/0x01,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}},
 	 {/*flags*/0,
 	  /*resource*/0x02,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}},
 	 {/*flags*/0,
 	  /*resource*/0xf1,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}},
 	 {/*flags*/0,
 	  /*resource*/0xf2,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}}
 	}
 };
 
 const static struct ctl_logical_block_provisioning_page lbp_page_changeable = {{
 	/*page_code*/SMS_INFO_EXCEPTIONS_PAGE | SMPH_SPF,
 	/*subpage_code*/0x02,
 	/*page_length*/{CTL_LBPM_LEN >> 8, CTL_LBPM_LEN},
 	/*flags*/SLBPP_SITUA,
 	/*reserved*/{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
 	/*descr*/{}},
 	{{/*flags*/0,
 	  /*resource*/0,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}},
 	 {/*flags*/0,
 	  /*resource*/0,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}},
 	 {/*flags*/0,
 	  /*resource*/0,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}},
 	 {/*flags*/0,
 	  /*resource*/0,
 	  /*reserved*/{0, 0},
 	  /*count*/{0, 0, 0, 0}}
 	}
 };
 
 const static struct scsi_cddvd_capabilities_page cddvd_page_default = {
 	/*page_code*/SMS_CDDVD_CAPS_PAGE,
 	/*page_length*/sizeof(struct scsi_cddvd_capabilities_page) - 2,
 	/*caps1*/0x3f,
 	/*caps2*/0x00,
 	/*caps3*/0xf0,
 	/*caps4*/0x00,
 	/*caps5*/0x29,
 	/*caps6*/0x00,
 	/*obsolete*/{0, 0},
 	/*nvol_levels*/{0, 0},
 	/*buffer_size*/{8, 0},
 	/*obsolete2*/{0, 0},
 	/*reserved*/0,
 	/*digital*/0,
 	/*obsolete3*/0,
 	/*copy_management*/0,
 	/*reserved2*/0,
 	/*rotation_control*/0,
 	/*cur_write_speed*/0,
 	/*num_speed_descr*/0,
 };
 
 const static struct scsi_cddvd_capabilities_page cddvd_page_changeable = {
 	/*page_code*/SMS_CDDVD_CAPS_PAGE,
 	/*page_length*/sizeof(struct scsi_cddvd_capabilities_page) - 2,
 	/*caps1*/0,
 	/*caps2*/0,
 	/*caps3*/0,
 	/*caps4*/0,
 	/*caps5*/0,
 	/*caps6*/0,
 	/*obsolete*/{0, 0},
 	/*nvol_levels*/{0, 0},
 	/*buffer_size*/{0, 0},
 	/*obsolete2*/{0, 0},
 	/*reserved*/0,
 	/*digital*/0,
 	/*obsolete3*/0,
 	/*copy_management*/0,
 	/*reserved2*/0,
 	/*rotation_control*/0,
 	/*cur_write_speed*/0,
 	/*num_speed_descr*/0,
 };
 
 SYSCTL_NODE(_kern_cam, OID_AUTO, ctl, CTLFLAG_RD, 0, "CAM Target Layer");
 static int worker_threads = -1;
 SYSCTL_INT(_kern_cam_ctl, OID_AUTO, worker_threads, CTLFLAG_RDTUN,
     &worker_threads, 1, "Number of worker threads");
 static int ctl_debug = CTL_DEBUG_NONE;
 SYSCTL_INT(_kern_cam_ctl, OID_AUTO, debug, CTLFLAG_RWTUN,
     &ctl_debug, 0, "Enabled debug flags");
 static int ctl_lun_map_size = 1024;
 SYSCTL_INT(_kern_cam_ctl, OID_AUTO, lun_map_size, CTLFLAG_RWTUN,
     &ctl_lun_map_size, 0, "Size of per-port LUN map (max LUN + 1)");
 
 /*
  * Supported pages (0x00), Serial number (0x80), Device ID (0x83),
  * Extended INQUIRY Data (0x86), Mode Page Policy (0x87),
  * SCSI Ports (0x88), Third-party Copy (0x8F), Block limits (0xB0),
  * Block Device Characteristics (0xB1) and Logical Block Provisioning (0xB2)
  */
 #define SCSI_EVPD_NUM_SUPPORTED_PAGES	10
 
 static void ctl_isc_event_handler(ctl_ha_channel chanel, ctl_ha_event event,
 				  int param);
 static void ctl_copy_sense_data(union ctl_ha_msg *src, union ctl_io *dest);
 static void ctl_copy_sense_data_back(union ctl_io *src, union ctl_ha_msg *dest);
 static int ctl_init(void);
 static int ctl_shutdown(void);
 static int ctl_open(struct cdev *dev, int flags, int fmt, struct thread *td);
 static int ctl_close(struct cdev *dev, int flags, int fmt, struct thread *td);
 static void ctl_serialize_other_sc_cmd(struct ctl_scsiio *ctsio);
 static void ctl_ioctl_fill_ooa(struct ctl_lun *lun, uint32_t *cur_fill_num,
 			      struct ctl_ooa *ooa_hdr,
 			      struct ctl_ooa_entry *kern_entries);
 static int ctl_ioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flag,
 		     struct thread *td);
 static int ctl_alloc_lun(struct ctl_softc *ctl_softc, struct ctl_lun *lun,
 			 struct ctl_be_lun *be_lun);
 static int ctl_free_lun(struct ctl_lun *lun);
 static void ctl_create_lun(struct ctl_be_lun *be_lun);
 
 static int ctl_do_mode_select(union ctl_io *io);
 static int ctl_pro_preempt(struct ctl_softc *softc, struct ctl_lun *lun,
 			   uint64_t res_key, uint64_t sa_res_key,
 			   uint8_t type, uint32_t residx,
 			   struct ctl_scsiio *ctsio,
 			   struct scsi_per_res_out *cdb,
 			   struct scsi_per_res_out_parms* param);
 static void ctl_pro_preempt_other(struct ctl_lun *lun,
 				  union ctl_ha_msg *msg);
 static void ctl_hndl_per_res_out_on_other_sc(union ctl_io *io);
 static int ctl_inquiry_evpd_supported(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd_serial(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd_devid(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd_eid(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd_mpp(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd_scsi_ports(struct ctl_scsiio *ctsio,
 					 int alloc_len);
 static int ctl_inquiry_evpd_block_limits(struct ctl_scsiio *ctsio,
 					 int alloc_len);
 static int ctl_inquiry_evpd_bdc(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd_lbp(struct ctl_scsiio *ctsio, int alloc_len);
 static int ctl_inquiry_evpd(struct ctl_scsiio *ctsio);
 static int ctl_inquiry_std(struct ctl_scsiio *ctsio);
 static int ctl_get_lba_len(union ctl_io *io, uint64_t *lba, uint64_t *len);
 static ctl_action ctl_extent_check(union ctl_io *io1, union ctl_io *io2,
     bool seq);
 static ctl_action ctl_extent_check_seq(union ctl_io *io1, union ctl_io *io2);
 static ctl_action ctl_check_for_blockage(struct ctl_lun *lun,
     union ctl_io *pending_io, union ctl_io *ooa_io);
 static ctl_action ctl_check_ooa(struct ctl_lun *lun, union ctl_io *pending_io,
 				union ctl_io *starting_io);
 static int ctl_check_blocked(struct ctl_lun *lun);
 static int ctl_scsiio_lun_check(struct ctl_lun *lun,
 				const struct ctl_cmd_entry *entry,
 				struct ctl_scsiio *ctsio);
 static void ctl_failover_lun(union ctl_io *io);
 static int ctl_scsiio_precheck(struct ctl_softc *ctl_softc,
 			       struct ctl_scsiio *ctsio);
 static int ctl_scsiio(struct ctl_scsiio *ctsio);
 
 static int ctl_target_reset(union ctl_io *io);
 static void ctl_do_lun_reset(struct ctl_lun *lun, uint32_t initidx,
 			 ctl_ua_type ua_type);
 static int ctl_lun_reset(union ctl_io *io);
 static int ctl_abort_task(union ctl_io *io);
 static int ctl_abort_task_set(union ctl_io *io);
 static int ctl_query_task(union ctl_io *io, int task_set);
 static void ctl_i_t_nexus_loss(struct ctl_softc *softc, uint32_t initidx,
 			      ctl_ua_type ua_type);
 static int ctl_i_t_nexus_reset(union ctl_io *io);
 static int ctl_query_async_event(union ctl_io *io);
 static void ctl_run_task(union ctl_io *io);
 #ifdef CTL_IO_DELAY
 static void ctl_datamove_timer_wakeup(void *arg);
 static void ctl_done_timer_wakeup(void *arg);
 #endif /* CTL_IO_DELAY */
 
 static void ctl_send_datamove_done(union ctl_io *io, int have_lock);
 static void ctl_datamove_remote_write_cb(struct ctl_ha_dt_req *rq);
 static int ctl_datamove_remote_dm_write_cb(union ctl_io *io);
 static void ctl_datamove_remote_write(union ctl_io *io);
 static int ctl_datamove_remote_dm_read_cb(union ctl_io *io);
 static void ctl_datamove_remote_read_cb(struct ctl_ha_dt_req *rq);
 static int ctl_datamove_remote_sgl_setup(union ctl_io *io);
 static int ctl_datamove_remote_xfer(union ctl_io *io, unsigned command,
 				    ctl_ha_dt_cb callback);
 static void ctl_datamove_remote_read(union ctl_io *io);
 static void ctl_datamove_remote(union ctl_io *io);
 static void ctl_process_done(union ctl_io *io);
 static void ctl_lun_thread(void *arg);
 static void ctl_thresh_thread(void *arg);
 static void ctl_work_thread(void *arg);
 static void ctl_enqueue_incoming(union ctl_io *io);
 static void ctl_enqueue_rtr(union ctl_io *io);
 static void ctl_enqueue_done(union ctl_io *io);
 static void ctl_enqueue_isc(union ctl_io *io);
 static const struct ctl_cmd_entry *
     ctl_get_cmd_entry(struct ctl_scsiio *ctsio, int *sa);
 static const struct ctl_cmd_entry *
     ctl_validate_command(struct ctl_scsiio *ctsio);
 static int ctl_cmd_applicable(uint8_t lun_type,
     const struct ctl_cmd_entry *entry);
 static int ctl_ha_init(void);
 static int ctl_ha_shutdown(void);
 
 static uint64_t ctl_get_prkey(struct ctl_lun *lun, uint32_t residx);
 static void ctl_clr_prkey(struct ctl_lun *lun, uint32_t residx);
 static void ctl_alloc_prkey(struct ctl_lun *lun, uint32_t residx);
 static void ctl_set_prkey(struct ctl_lun *lun, uint32_t residx, uint64_t key);
 
 /*
  * Load the serialization table.  This isn't very pretty, but is probably
  * the easiest way to do it.
  */
 #include "ctl_ser_table.c"
 
 /*
  * We only need to define open, close and ioctl routines for this driver.
  */
 static struct cdevsw ctl_cdevsw = {
 	.d_version =	D_VERSION,
 	.d_flags =	0,
 	.d_open =	ctl_open,
 	.d_close =	ctl_close,
 	.d_ioctl =	ctl_ioctl,
 	.d_name =	"ctl",
 };
 
 
 MALLOC_DEFINE(M_CTL, "ctlmem", "Memory used for CTL");
 
 static int ctl_module_event_handler(module_t, int /*modeventtype_t*/, void *);
 
 static moduledata_t ctl_moduledata = {
 	"ctl",
 	ctl_module_event_handler,
 	NULL
 };
 
 DECLARE_MODULE(ctl, ctl_moduledata, SI_SUB_CONFIGURE, SI_ORDER_THIRD);
 MODULE_VERSION(ctl, 1);
 
 static struct ctl_frontend ha_frontend =
 {
 	.name = "ha",
 	.init = ctl_ha_init,
 	.shutdown = ctl_ha_shutdown,
 };
 
 static int
 ctl_ha_init(void)
 {
 	struct ctl_softc *softc = control_softc;
 
 	if (ctl_pool_create(softc, "othersc", CTL_POOL_ENTRIES_OTHER_SC,
 	                    &softc->othersc_pool) != 0)
 		return (ENOMEM);
 	if (ctl_ha_msg_init(softc) != CTL_HA_STATUS_SUCCESS) {
 		ctl_pool_free(softc->othersc_pool);
 		return (EIO);
 	}
 	if (ctl_ha_msg_register(CTL_HA_CHAN_CTL, ctl_isc_event_handler)
 	    != CTL_HA_STATUS_SUCCESS) {
 		ctl_ha_msg_destroy(softc);
 		ctl_pool_free(softc->othersc_pool);
 		return (EIO);
 	}
 	return (0);
 };
 
 static int
 ctl_ha_shutdown(void)
 {
 	struct ctl_softc *softc = control_softc;
 	struct ctl_port *port;
 
 	ctl_ha_msg_shutdown(softc);
 	if (ctl_ha_msg_deregister(CTL_HA_CHAN_CTL) != CTL_HA_STATUS_SUCCESS)
 		return (EIO);
 	if (ctl_ha_msg_destroy(softc) != CTL_HA_STATUS_SUCCESS)
 		return (EIO);
 	ctl_pool_free(softc->othersc_pool);
 	while ((port = STAILQ_FIRST(&ha_frontend.port_list)) != NULL) {
 		ctl_port_deregister(port);
 		free(port->port_name, M_CTL);
 		free(port, M_CTL);
 	}
 	return (0);
 };
 
 static void
 ctl_ha_datamove(union ctl_io *io)
 {
 	struct ctl_lun *lun = CTL_LUN(io);
 	struct ctl_sg_entry *sgl;
 	union ctl_ha_msg msg;
 	uint32_t sg_entries_sent;
 	int do_sg_copy, i, j;
 
 	memset(&msg.dt, 0, sizeof(msg.dt));
 	msg.hdr.msg_type = CTL_MSG_DATAMOVE;
 	msg.hdr.original_sc = io->io_hdr.original_sc;
 	msg.hdr.serializing_sc = io;
 	msg.hdr.nexus = io->io_hdr.nexus;
 	msg.hdr.status = io->io_hdr.status;
 	msg.dt.flags = io->io_hdr.flags;
 
 	/*
 	 * We convert everything into a S/G list here.  We can't
 	 * pass by reference, only by value between controllers.
 	 * So we can't pass a pointer to the S/G list, only as many
 	 * S/G entries as we can fit in here.  If it's possible for
 	 * us to get more than CTL_HA_MAX_SG_ENTRIES S/G entries,
 	 * then we need to break this up into multiple transfers.
 	 */
 	if (io->scsiio.kern_sg_entries == 0) {
 		msg.dt.kern_sg_entries = 1;
 #if 0
 		if (io->io_hdr.flags & CTL_FLAG_BUS_ADDR) {
 			msg.dt.sg_list[0].addr = io->scsiio.kern_data_ptr;
 		} else {
 			/* XXX KDM use busdma here! */
 			msg.dt.sg_list[0].addr =
 			    (void *)vtophys(io->scsiio.kern_data_ptr);
 		}
 #else
 		KASSERT((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0,
 		    ("HA does not support BUS_ADDR"));
 		msg.dt.sg_list[0].addr = io->scsiio.kern_data_ptr;
 #endif
 		msg.dt.sg_list[0].len = io->scsiio.kern_data_len;
 		do_sg_copy = 0;
 	} else {
 		msg.dt.kern_sg_entries = io->scsiio.kern_sg_entries;
 		do_sg_copy = 1;
 	}
 
 	msg.dt.kern_data_len = io->scsiio.kern_data_len;
 	msg.dt.kern_total_len = io->scsiio.kern_total_len;
 	msg.dt.kern_data_resid = io->scsiio.kern_data_resid;
 	msg.dt.kern_rel_offset = io->scsiio.kern_rel_offset;
 	msg.dt.sg_sequence = 0;
 
 	/*
 	 * Loop until we've sent all of the S/G entries.  On the
 	 * other end, we'll recompose these S/G entries into one
 	 * contiguous list before processing.
 	 */
 	for (sg_entries_sent = 0; sg_entries_sent < msg.dt.kern_sg_entries;
 	    msg.dt.sg_sequence++) {
 		msg.dt.cur_sg_entries = MIN((sizeof(msg.dt.sg_list) /
 		    sizeof(msg.dt.sg_list[0])),
 		    msg.dt.kern_sg_entries - sg_entries_sent);
 		if (do_sg_copy != 0) {
 			sgl = (struct ctl_sg_entry *)io->scsiio.kern_data_ptr;
 			for (i = sg_entries_sent, j = 0;
 			     i < msg.dt.cur_sg_entries; i++, j++) {
 #if 0
 				if (io->io_hdr.flags & CTL_FLAG_BUS_ADDR) {
 					msg.dt.sg_list[j].addr = sgl[i].addr;
 				} else {
 					/* XXX KDM use busdma here! */
 					msg.dt.sg_list[j].addr =
 					    (void *)vtophys(sgl[i].addr);
 				}
 #else
 				KASSERT((io->io_hdr.flags &
 				    CTL_FLAG_BUS_ADDR) == 0,
 				    ("HA does not support BUS_ADDR"));
 				msg.dt.sg_list[j].addr = sgl[i].addr;
 #endif
 				msg.dt.sg_list[j].len = sgl[i].len;
 			}
 		}
 
 		sg_entries_sent += msg.dt.cur_sg_entries;
 		msg.dt.sg_last = (sg_entries_sent >= msg.dt.kern_sg_entries);
 		if (ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg,
 		    sizeof(msg.dt) - sizeof(msg.dt.sg_list) +
 		    sizeof(struct ctl_sg_entry) * msg.dt.cur_sg_entries,
 		    M_WAITOK) > CTL_HA_STATUS_SUCCESS) {
 			io->io_hdr.port_status = 31341;
 			io->scsiio.be_move_done(io);
 			return;
 		}
 		msg.dt.sent_sg_entries = sg_entries_sent;
 	}
 
 	/*
 	 * Officially handover the request from us to peer.
 	 * If failover has just happened, then we must return error.
 	 * If failover happen just after, then it is not our problem.
 	 */
 	if (lun)
 		mtx_lock(&lun->lun_lock);
 	if (io->io_hdr.flags & CTL_FLAG_FAILOVER) {
 		if (lun)
 			mtx_unlock(&lun->lun_lock);
 		io->io_hdr.port_status = 31342;
 		io->scsiio.be_move_done(io);
 		return;
 	}
 	io->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE;
 	io->io_hdr.flags |= CTL_FLAG_DMA_INPROG;
 	if (lun)
 		mtx_unlock(&lun->lun_lock);
 }
 
 static void
 ctl_ha_done(union ctl_io *io)
 {
 	union ctl_ha_msg msg;
 
 	if (io->io_hdr.io_type == CTL_IO_SCSI) {
 		memset(&msg, 0, sizeof(msg));
 		msg.hdr.msg_type = CTL_MSG_FINISH_IO;
 		msg.hdr.original_sc = io->io_hdr.original_sc;
 		msg.hdr.nexus = io->io_hdr.nexus;
 		msg.hdr.status = io->io_hdr.status;
 		msg.scsi.scsi_status = io->scsiio.scsi_status;
 		msg.scsi.tag_num = io->scsiio.tag_num;
 		msg.scsi.tag_type = io->scsiio.tag_type;
 		msg.scsi.sense_len = io->scsiio.sense_len;
 		memcpy(&msg.scsi.sense_data, &io->scsiio.sense_data,
 		    io->scsiio.sense_len);
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg,
 		    sizeof(msg.scsi) - sizeof(msg.scsi.sense_data) +
 		    msg.scsi.sense_len, M_WAITOK);
 	}
 	ctl_free_io(io);
 }
 
 static void
 ctl_isc_handler_finish_xfer(struct ctl_softc *ctl_softc,
 			    union ctl_ha_msg *msg_info)
 {
 	struct ctl_scsiio *ctsio;
 
 	if (msg_info->hdr.original_sc == NULL) {
 		printf("%s: original_sc == NULL!\n", __func__);
 		/* XXX KDM now what? */
 		return;
 	}
 
 	ctsio = &msg_info->hdr.original_sc->scsiio;
 	ctsio->io_hdr.flags |= CTL_FLAG_IO_ACTIVE;
 	ctsio->io_hdr.msg_type = CTL_MSG_FINISH_IO;
 	ctsio->io_hdr.status = msg_info->hdr.status;
 	ctsio->scsi_status = msg_info->scsi.scsi_status;
 	ctsio->sense_len = msg_info->scsi.sense_len;
 	memcpy(&ctsio->sense_data, &msg_info->scsi.sense_data,
 	       msg_info->scsi.sense_len);
 	ctl_enqueue_isc((union ctl_io *)ctsio);
 }
 
 static void
 ctl_isc_handler_finish_ser_only(struct ctl_softc *ctl_softc,
 				union ctl_ha_msg *msg_info)
 {
 	struct ctl_scsiio *ctsio;
 
 	if (msg_info->hdr.serializing_sc == NULL) {
 		printf("%s: serializing_sc == NULL!\n", __func__);
 		/* XXX KDM now what? */
 		return;
 	}
 
 	ctsio = &msg_info->hdr.serializing_sc->scsiio;
 	ctsio->io_hdr.msg_type = CTL_MSG_FINISH_IO;
 	ctl_enqueue_isc((union ctl_io *)ctsio);
 }
 
 void
 ctl_isc_announce_lun(struct ctl_lun *lun)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	union ctl_ha_msg *msg;
 	struct ctl_ha_msg_lun_pr_key pr_key;
 	int i, k;
 
 	if (softc->ha_link != CTL_HA_LINK_ONLINE)
 		return;
 	mtx_lock(&lun->lun_lock);
 	i = sizeof(msg->lun);
 	if (lun->lun_devid)
 		i += lun->lun_devid->len;
 	i += sizeof(pr_key) * lun->pr_key_count;
 alloc:
 	mtx_unlock(&lun->lun_lock);
 	msg = malloc(i, M_CTL, M_WAITOK);
 	mtx_lock(&lun->lun_lock);
 	k = sizeof(msg->lun);
 	if (lun->lun_devid)
 		k += lun->lun_devid->len;
 	k += sizeof(pr_key) * lun->pr_key_count;
 	if (i < k) {
 		free(msg, M_CTL);
 		i = k;
 		goto alloc;
 	}
 	bzero(&msg->lun, sizeof(msg->lun));
 	msg->hdr.msg_type = CTL_MSG_LUN_SYNC;
 	msg->hdr.nexus.targ_lun = lun->lun;
 	msg->hdr.nexus.targ_mapped_lun = lun->lun;
 	msg->lun.flags = lun->flags;
 	msg->lun.pr_generation = lun->pr_generation;
 	msg->lun.pr_res_idx = lun->pr_res_idx;
 	msg->lun.pr_res_type = lun->pr_res_type;
 	msg->lun.pr_key_count = lun->pr_key_count;
 	i = 0;
 	if (lun->lun_devid) {
 		msg->lun.lun_devid_len = lun->lun_devid->len;
 		memcpy(&msg->lun.data[i], lun->lun_devid->data,
 		    msg->lun.lun_devid_len);
 		i += msg->lun.lun_devid_len;
 	}
 	for (k = 0; k < CTL_MAX_INITIATORS; k++) {
 		if ((pr_key.pr_key = ctl_get_prkey(lun, k)) == 0)
 			continue;
 		pr_key.pr_iid = k;
 		memcpy(&msg->lun.data[i], &pr_key, sizeof(pr_key));
 		i += sizeof(pr_key);
 	}
 	mtx_unlock(&lun->lun_lock);
 	ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg->port, sizeof(msg->port) + i,
 	    M_WAITOK);
 	free(msg, M_CTL);
 
 	if (lun->flags & CTL_LUN_PRIMARY_SC) {
 		for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 			ctl_isc_announce_mode(lun, -1,
 			    lun->mode_pages.index[i].page_code & SMPH_PC_MASK,
 			    lun->mode_pages.index[i].subpage);
 		}
 	}
 }
 
 void
 ctl_isc_announce_port(struct ctl_port *port)
 {
 	struct ctl_softc *softc = port->ctl_softc;
 	union ctl_ha_msg *msg;
 	int i;
 
 	if (port->targ_port < softc->port_min ||
 	    port->targ_port >= softc->port_max ||
 	    softc->ha_link != CTL_HA_LINK_ONLINE)
 		return;
 	i = sizeof(msg->port) + strlen(port->port_name) + 1;
 	if (port->lun_map)
 		i += port->lun_map_size * sizeof(uint32_t);
 	if (port->port_devid)
 		i += port->port_devid->len;
 	if (port->target_devid)
 		i += port->target_devid->len;
 	if (port->init_devid)
 		i += port->init_devid->len;
 	msg = malloc(i, M_CTL, M_WAITOK);
 	bzero(&msg->port, sizeof(msg->port));
 	msg->hdr.msg_type = CTL_MSG_PORT_SYNC;
 	msg->hdr.nexus.targ_port = port->targ_port;
 	msg->port.port_type = port->port_type;
 	msg->port.physical_port = port->physical_port;
 	msg->port.virtual_port = port->virtual_port;
 	msg->port.status = port->status;
 	i = 0;
 	msg->port.name_len = sprintf(&msg->port.data[i],
 	    "%d:%s", softc->ha_id, port->port_name) + 1;
 	i += msg->port.name_len;
 	if (port->lun_map) {
 		msg->port.lun_map_len = port->lun_map_size * sizeof(uint32_t);
 		memcpy(&msg->port.data[i], port->lun_map,
 		    msg->port.lun_map_len);
 		i += msg->port.lun_map_len;
 	}
 	if (port->port_devid) {
 		msg->port.port_devid_len = port->port_devid->len;
 		memcpy(&msg->port.data[i], port->port_devid->data,
 		    msg->port.port_devid_len);
 		i += msg->port.port_devid_len;
 	}
 	if (port->target_devid) {
 		msg->port.target_devid_len = port->target_devid->len;
 		memcpy(&msg->port.data[i], port->target_devid->data,
 		    msg->port.target_devid_len);
 		i += msg->port.target_devid_len;
 	}
 	if (port->init_devid) {
 		msg->port.init_devid_len = port->init_devid->len;
 		memcpy(&msg->port.data[i], port->init_devid->data,
 		    msg->port.init_devid_len);
 		i += msg->port.init_devid_len;
 	}
 	ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg->port, sizeof(msg->port) + i,
 	    M_WAITOK);
 	free(msg, M_CTL);
 }
 
 void
 ctl_isc_announce_iid(struct ctl_port *port, int iid)
 {
 	struct ctl_softc *softc = port->ctl_softc;
 	union ctl_ha_msg *msg;
 	int i, l;
 
 	if (port->targ_port < softc->port_min ||
 	    port->targ_port >= softc->port_max ||
 	    softc->ha_link != CTL_HA_LINK_ONLINE)
 		return;
 	mtx_lock(&softc->ctl_lock);
 	i = sizeof(msg->iid);
 	l = 0;
 	if (port->wwpn_iid[iid].name)
 		l = strlen(port->wwpn_iid[iid].name) + 1;
 	i += l;
 	msg = malloc(i, M_CTL, M_NOWAIT);
 	if (msg == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		return;
 	}
 	bzero(&msg->iid, sizeof(msg->iid));
 	msg->hdr.msg_type = CTL_MSG_IID_SYNC;
 	msg->hdr.nexus.targ_port = port->targ_port;
 	msg->hdr.nexus.initid = iid;
 	msg->iid.in_use = port->wwpn_iid[iid].in_use;
 	msg->iid.name_len = l;
 	msg->iid.wwpn = port->wwpn_iid[iid].wwpn;
 	if (port->wwpn_iid[iid].name)
 		strlcpy(msg->iid.data, port->wwpn_iid[iid].name, l);
 	mtx_unlock(&softc->ctl_lock);
 	ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg->iid, i, M_NOWAIT);
 	free(msg, M_CTL);
 }
 
 void
 ctl_isc_announce_mode(struct ctl_lun *lun, uint32_t initidx,
     uint8_t page, uint8_t subpage)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	union ctl_ha_msg msg;
 	u_int i;
 
 	if (softc->ha_link != CTL_HA_LINK_ONLINE)
 		return;
 	for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 		if ((lun->mode_pages.index[i].page_code & SMPH_PC_MASK) ==
 		    page && lun->mode_pages.index[i].subpage == subpage)
 			break;
 	}
 	if (i == CTL_NUM_MODE_PAGES)
 		return;
 
 	/* Don't try to replicate pages not present on this device. */
 	if (lun->mode_pages.index[i].page_data == NULL)
 		return;
 
 	bzero(&msg.mode, sizeof(msg.mode));
 	msg.hdr.msg_type = CTL_MSG_MODE_SYNC;
 	msg.hdr.nexus.targ_port = initidx / CTL_MAX_INIT_PER_PORT;
 	msg.hdr.nexus.initid = initidx % CTL_MAX_INIT_PER_PORT;
 	msg.hdr.nexus.targ_lun = lun->lun;
 	msg.hdr.nexus.targ_mapped_lun = lun->lun;
 	msg.mode.page_code = page;
 	msg.mode.subpage = subpage;
 	msg.mode.page_len = lun->mode_pages.index[i].page_len;
 	memcpy(msg.mode.data, lun->mode_pages.index[i].page_data,
 	    msg.mode.page_len);
 	ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg.mode, sizeof(msg.mode),
 	    M_WAITOK);
 }
 
 static void
 ctl_isc_ha_link_up(struct ctl_softc *softc)
 {
 	struct ctl_port *port;
 	struct ctl_lun *lun;
 	union ctl_ha_msg msg;
 	int i;
 
 	/* Announce this node parameters to peer for validation. */
 	msg.login.msg_type = CTL_MSG_LOGIN;
 	msg.login.version = CTL_HA_VERSION;
 	msg.login.ha_mode = softc->ha_mode;
 	msg.login.ha_id = softc->ha_id;
 	msg.login.max_luns = CTL_MAX_LUNS;
 	msg.login.max_ports = CTL_MAX_PORTS;
 	msg.login.max_init_per_port = CTL_MAX_INIT_PER_PORT;
 	ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg.login, sizeof(msg.login),
 	    M_WAITOK);
 
 	STAILQ_FOREACH(port, &softc->port_list, links) {
 		ctl_isc_announce_port(port);
 		for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) {
 			if (port->wwpn_iid[i].in_use)
 				ctl_isc_announce_iid(port, i);
 		}
 	}
 	STAILQ_FOREACH(lun, &softc->lun_list, links)
 		ctl_isc_announce_lun(lun);
 }
 
 static void
 ctl_isc_ha_link_down(struct ctl_softc *softc)
 {
 	struct ctl_port *port;
 	struct ctl_lun *lun;
 	union ctl_io *io;
 	int i;
 
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(lun, &softc->lun_list, links) {
 		mtx_lock(&lun->lun_lock);
 		if (lun->flags & CTL_LUN_PEER_SC_PRIMARY) {
 			lun->flags &= ~CTL_LUN_PEER_SC_PRIMARY;
 			ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE);
 		}
 		mtx_unlock(&lun->lun_lock);
 
 		mtx_unlock(&softc->ctl_lock);
 		io = ctl_alloc_io(softc->othersc_pool);
 		mtx_lock(&softc->ctl_lock);
 		ctl_zero_io(io);
 		io->io_hdr.msg_type = CTL_MSG_FAILOVER;
 		io->io_hdr.nexus.targ_mapped_lun = lun->lun;
 		ctl_enqueue_isc(io);
 	}
 
 	STAILQ_FOREACH(port, &softc->port_list, links) {
 		if (port->targ_port >= softc->port_min &&
 		    port->targ_port < softc->port_max)
 			continue;
 		port->status &= ~CTL_PORT_STATUS_ONLINE;
 		for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) {
 			port->wwpn_iid[i].in_use = 0;
 			free(port->wwpn_iid[i].name, M_CTL);
 			port->wwpn_iid[i].name = NULL;
 		}
 	}
 	mtx_unlock(&softc->ctl_lock);
 }
 
 static void
 ctl_isc_ua(struct ctl_softc *softc, union ctl_ha_msg *msg, int len)
 {
 	struct ctl_lun *lun;
 	uint32_t iid = ctl_get_initindex(&msg->hdr.nexus);
 
 	mtx_lock(&softc->ctl_lock);
 	if (msg->hdr.nexus.targ_mapped_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[msg->hdr.nexus.targ_mapped_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		return;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	if (msg->ua.ua_type == CTL_UA_THIN_PROV_THRES && msg->ua.ua_set)
 		memcpy(lun->ua_tpt_info, msg->ua.ua_info, 8);
 	if (msg->ua.ua_all) {
 		if (msg->ua.ua_set)
 			ctl_est_ua_all(lun, iid, msg->ua.ua_type);
 		else
 			ctl_clr_ua_all(lun, iid, msg->ua.ua_type);
 	} else {
 		if (msg->ua.ua_set)
 			ctl_est_ua(lun, iid, msg->ua.ua_type);
 		else
 			ctl_clr_ua(lun, iid, msg->ua.ua_type);
 	}
 	mtx_unlock(&lun->lun_lock);
 }
 
 static void
 ctl_isc_lun_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len)
 {
 	struct ctl_lun *lun;
 	struct ctl_ha_msg_lun_pr_key pr_key;
 	int i, k;
 	ctl_lun_flags oflags;
 	uint32_t targ_lun;
 
 	targ_lun = msg->hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		return;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	if (lun->flags & CTL_LUN_DISABLED) {
 		mtx_unlock(&lun->lun_lock);
 		return;
 	}
 	i = (lun->lun_devid != NULL) ? lun->lun_devid->len : 0;
 	if (msg->lun.lun_devid_len != i || (i > 0 &&
 	    memcmp(&msg->lun.data[0], lun->lun_devid->data, i) != 0)) {
 		mtx_unlock(&lun->lun_lock);
 		printf("%s: Received conflicting HA LUN %d\n",
 		    __func__, targ_lun);
 		return;
 	} else {
 		/* Record whether peer is primary. */
 		oflags = lun->flags;
 		if ((msg->lun.flags & CTL_LUN_PRIMARY_SC) &&
 		    (msg->lun.flags & CTL_LUN_DISABLED) == 0)
 			lun->flags |= CTL_LUN_PEER_SC_PRIMARY;
 		else
 			lun->flags &= ~CTL_LUN_PEER_SC_PRIMARY;
 		if (oflags != lun->flags)
 			ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE);
 
 		/* If peer is primary and we are not -- use data */
 		if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 &&
 		    (lun->flags & CTL_LUN_PEER_SC_PRIMARY)) {
 			lun->pr_generation = msg->lun.pr_generation;
 			lun->pr_res_idx = msg->lun.pr_res_idx;
 			lun->pr_res_type = msg->lun.pr_res_type;
 			lun->pr_key_count = msg->lun.pr_key_count;
 			for (k = 0; k < CTL_MAX_INITIATORS; k++)
 				ctl_clr_prkey(lun, k);
 			for (k = 0; k < msg->lun.pr_key_count; k++) {
 				memcpy(&pr_key, &msg->lun.data[i],
 				    sizeof(pr_key));
 				ctl_alloc_prkey(lun, pr_key.pr_iid);
 				ctl_set_prkey(lun, pr_key.pr_iid,
 				    pr_key.pr_key);
 				i += sizeof(pr_key);
 			}
 		}
 
 		mtx_unlock(&lun->lun_lock);
 		CTL_DEBUG_PRINT(("%s: Known LUN %d, peer is %s\n",
 		    __func__, targ_lun,
 		    (msg->lun.flags & CTL_LUN_PRIMARY_SC) ?
 		    "primary" : "secondary"));
 
 		/* If we are primary but peer doesn't know -- notify */
 		if ((lun->flags & CTL_LUN_PRIMARY_SC) &&
 		    (msg->lun.flags & CTL_LUN_PEER_SC_PRIMARY) == 0)
 			ctl_isc_announce_lun(lun);
 	}
 }
 
 static void
 ctl_isc_port_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len)
 {
 	struct ctl_port *port;
 	struct ctl_lun *lun;
 	int i, new;
 
 	port = softc->ctl_ports[msg->hdr.nexus.targ_port];
 	if (port == NULL) {
 		CTL_DEBUG_PRINT(("%s: New port %d\n", __func__,
 		    msg->hdr.nexus.targ_port));
 		new = 1;
 		port = malloc(sizeof(*port), M_CTL, M_WAITOK | M_ZERO);
 		port->frontend = &ha_frontend;
 		port->targ_port = msg->hdr.nexus.targ_port;
 		port->fe_datamove = ctl_ha_datamove;
 		port->fe_done = ctl_ha_done;
 	} else if (port->frontend == &ha_frontend) {
 		CTL_DEBUG_PRINT(("%s: Updated port %d\n", __func__,
 		    msg->hdr.nexus.targ_port));
 		new = 0;
 	} else {
 		printf("%s: Received conflicting HA port %d\n",
 		    __func__, msg->hdr.nexus.targ_port);
 		return;
 	}
 	port->port_type = msg->port.port_type;
 	port->physical_port = msg->port.physical_port;
 	port->virtual_port = msg->port.virtual_port;
 	port->status = msg->port.status;
 	i = 0;
 	free(port->port_name, M_CTL);
 	port->port_name = strndup(&msg->port.data[i], msg->port.name_len,
 	    M_CTL);
 	i += msg->port.name_len;
 	if (msg->port.lun_map_len != 0) {
 		if (port->lun_map == NULL ||
 		    port->lun_map_size * sizeof(uint32_t) <
 		    msg->port.lun_map_len) {
 			port->lun_map_size = 0;
 			free(port->lun_map, M_CTL);
 			port->lun_map = malloc(msg->port.lun_map_len,
 			    M_CTL, M_WAITOK);
 		}
 		memcpy(port->lun_map, &msg->port.data[i], msg->port.lun_map_len);
 		port->lun_map_size = msg->port.lun_map_len / sizeof(uint32_t);
 		i += msg->port.lun_map_len;
 	} else {
 		port->lun_map_size = 0;
 		free(port->lun_map, M_CTL);
 		port->lun_map = NULL;
 	}
 	if (msg->port.port_devid_len != 0) {
 		if (port->port_devid == NULL ||
 		    port->port_devid->len < msg->port.port_devid_len) {
 			free(port->port_devid, M_CTL);
 			port->port_devid = malloc(sizeof(struct ctl_devid) +
 			    msg->port.port_devid_len, M_CTL, M_WAITOK);
 		}
 		memcpy(port->port_devid->data, &msg->port.data[i],
 		    msg->port.port_devid_len);
 		port->port_devid->len = msg->port.port_devid_len;
 		i += msg->port.port_devid_len;
 	} else {
 		free(port->port_devid, M_CTL);
 		port->port_devid = NULL;
 	}
 	if (msg->port.target_devid_len != 0) {
 		if (port->target_devid == NULL ||
 		    port->target_devid->len < msg->port.target_devid_len) {
 			free(port->target_devid, M_CTL);
 			port->target_devid = malloc(sizeof(struct ctl_devid) +
 			    msg->port.target_devid_len, M_CTL, M_WAITOK);
 		}
 		memcpy(port->target_devid->data, &msg->port.data[i],
 		    msg->port.target_devid_len);
 		port->target_devid->len = msg->port.target_devid_len;
 		i += msg->port.target_devid_len;
 	} else {
 		free(port->target_devid, M_CTL);
 		port->target_devid = NULL;
 	}
 	if (msg->port.init_devid_len != 0) {
 		if (port->init_devid == NULL ||
 		    port->init_devid->len < msg->port.init_devid_len) {
 			free(port->init_devid, M_CTL);
 			port->init_devid = malloc(sizeof(struct ctl_devid) +
 			    msg->port.init_devid_len, M_CTL, M_WAITOK);
 		}
 		memcpy(port->init_devid->data, &msg->port.data[i],
 		    msg->port.init_devid_len);
 		port->init_devid->len = msg->port.init_devid_len;
 		i += msg->port.init_devid_len;
 	} else {
 		free(port->init_devid, M_CTL);
 		port->init_devid = NULL;
 	}
 	if (new) {
 		if (ctl_port_register(port) != 0) {
 			printf("%s: ctl_port_register() failed with error\n",
 			    __func__);
 		}
 	}
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(lun, &softc->lun_list, links) {
 		if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 			continue;
 		mtx_lock(&lun->lun_lock);
 		ctl_est_ua_all(lun, -1, CTL_UA_INQ_CHANGE);
 		mtx_unlock(&lun->lun_lock);
 	}
 	mtx_unlock(&softc->ctl_lock);
 }
 
 static void
 ctl_isc_iid_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len)
 {
 	struct ctl_port *port;
 	int iid;
 
 	port = softc->ctl_ports[msg->hdr.nexus.targ_port];
 	if (port == NULL) {
 		printf("%s: Received IID for unknown port %d\n",
 		    __func__, msg->hdr.nexus.targ_port);
 		return;
 	}
 	iid = msg->hdr.nexus.initid;
 	if (port->wwpn_iid[iid].in_use != 0 &&
 	    msg->iid.in_use == 0)
 		ctl_i_t_nexus_loss(softc, iid, CTL_UA_POWERON);
 	port->wwpn_iid[iid].in_use = msg->iid.in_use;
 	port->wwpn_iid[iid].wwpn = msg->iid.wwpn;
 	free(port->wwpn_iid[iid].name, M_CTL);
 	if (msg->iid.name_len) {
 		port->wwpn_iid[iid].name = strndup(&msg->iid.data[0],
 		    msg->iid.name_len, M_CTL);
 	} else
 		port->wwpn_iid[iid].name = NULL;
 }
 
 static void
 ctl_isc_login(struct ctl_softc *softc, union ctl_ha_msg *msg, int len)
 {
 
 	if (msg->login.version != CTL_HA_VERSION) {
 		printf("CTL HA peers have different versions %d != %d\n",
 		    msg->login.version, CTL_HA_VERSION);
 		ctl_ha_msg_abort(CTL_HA_CHAN_CTL);
 		return;
 	}
 	if (msg->login.ha_mode != softc->ha_mode) {
 		printf("CTL HA peers have different ha_mode %d != %d\n",
 		    msg->login.ha_mode, softc->ha_mode);
 		ctl_ha_msg_abort(CTL_HA_CHAN_CTL);
 		return;
 	}
 	if (msg->login.ha_id == softc->ha_id) {
 		printf("CTL HA peers have same ha_id %d\n", msg->login.ha_id);
 		ctl_ha_msg_abort(CTL_HA_CHAN_CTL);
 		return;
 	}
 	if (msg->login.max_luns != CTL_MAX_LUNS ||
 	    msg->login.max_ports != CTL_MAX_PORTS ||
 	    msg->login.max_init_per_port != CTL_MAX_INIT_PER_PORT) {
 		printf("CTL HA peers have different limits\n");
 		ctl_ha_msg_abort(CTL_HA_CHAN_CTL);
 		return;
 	}
 }
 
 static void
 ctl_isc_mode_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len)
 {
 	struct ctl_lun *lun;
 	u_int i;
 	uint32_t initidx, targ_lun;
 
 	targ_lun = msg->hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		return;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	if (lun->flags & CTL_LUN_DISABLED) {
 		mtx_unlock(&lun->lun_lock);
 		return;
 	}
 	for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 		if ((lun->mode_pages.index[i].page_code & SMPH_PC_MASK) ==
 		    msg->mode.page_code &&
 		    lun->mode_pages.index[i].subpage == msg->mode.subpage)
 			break;
 	}
 	if (i == CTL_NUM_MODE_PAGES) {
 		mtx_unlock(&lun->lun_lock);
 		return;
 	}
 	memcpy(lun->mode_pages.index[i].page_data, msg->mode.data,
 	    lun->mode_pages.index[i].page_len);
 	initidx = ctl_get_initindex(&msg->hdr.nexus);
 	if (initidx != -1)
 		ctl_est_ua_all(lun, initidx, CTL_UA_MODE_CHANGE);
 	mtx_unlock(&lun->lun_lock);
 }
 
 /*
  * ISC (Inter Shelf Communication) event handler.  Events from the HA
  * subsystem come in here.
  */
 static void
 ctl_isc_event_handler(ctl_ha_channel channel, ctl_ha_event event, int param)
 {
 	struct ctl_softc *softc = control_softc;
 	union ctl_io *io;
 	struct ctl_prio *presio;
 	ctl_ha_status isc_status;
 
 	CTL_DEBUG_PRINT(("CTL: Isc Msg event %d\n", event));
 	if (event == CTL_HA_EVT_MSG_RECV) {
 		union ctl_ha_msg *msg, msgbuf;
 
 		if (param > sizeof(msgbuf))
 			msg = malloc(param, M_CTL, M_WAITOK);
 		else
 			msg = &msgbuf;
 		isc_status = ctl_ha_msg_recv(CTL_HA_CHAN_CTL, msg, param,
 		    M_WAITOK);
 		if (isc_status != CTL_HA_STATUS_SUCCESS) {
 			printf("%s: Error receiving message: %d\n",
 			    __func__, isc_status);
 			if (msg != &msgbuf)
 				free(msg, M_CTL);
 			return;
 		}
 
 		CTL_DEBUG_PRINT(("CTL: msg_type %d\n", msg->msg_type));
 		switch (msg->hdr.msg_type) {
 		case CTL_MSG_SERIALIZE:
 			io = ctl_alloc_io(softc->othersc_pool);
 			ctl_zero_io(io);
 			// populate ctsio from msg
 			io->io_hdr.io_type = CTL_IO_SCSI;
 			io->io_hdr.msg_type = CTL_MSG_SERIALIZE;
 			io->io_hdr.original_sc = msg->hdr.original_sc;
 			io->io_hdr.flags |= CTL_FLAG_FROM_OTHER_SC |
 					    CTL_FLAG_IO_ACTIVE;
 			/*
 			 * If we're in serialization-only mode, we don't
 			 * want to go through full done processing.  Thus
 			 * the COPY flag.
 			 *
 			 * XXX KDM add another flag that is more specific.
 			 */
 			if (softc->ha_mode != CTL_HA_MODE_XFER)
 				io->io_hdr.flags |= CTL_FLAG_INT_COPY;
 			io->io_hdr.nexus = msg->hdr.nexus;
 #if 0
 			printf("port %u, iid %u, lun %u\n",
 			       io->io_hdr.nexus.targ_port,
 			       io->io_hdr.nexus.initid,
 			       io->io_hdr.nexus.targ_lun);
 #endif
 			io->scsiio.tag_num = msg->scsi.tag_num;
 			io->scsiio.tag_type = msg->scsi.tag_type;
 #ifdef CTL_TIME_IO
 			io->io_hdr.start_time = time_uptime;
 			getbinuptime(&io->io_hdr.start_bt);
 #endif /* CTL_TIME_IO */
 			io->scsiio.cdb_len = msg->scsi.cdb_len;
 			memcpy(io->scsiio.cdb, msg->scsi.cdb,
 			       CTL_MAX_CDBLEN);
 			if (softc->ha_mode == CTL_HA_MODE_XFER) {
 				const struct ctl_cmd_entry *entry;
 
 				entry = ctl_get_cmd_entry(&io->scsiio, NULL);
 				io->io_hdr.flags &= ~CTL_FLAG_DATA_MASK;
 				io->io_hdr.flags |=
 					entry->flags & CTL_FLAG_DATA_MASK;
 			}
 			ctl_enqueue_isc(io);
 			break;
 
 		/* Performed on the Originating SC, XFER mode only */
 		case CTL_MSG_DATAMOVE: {
 			struct ctl_sg_entry *sgl;
 			int i, j;
 
 			io = msg->hdr.original_sc;
 			if (io == NULL) {
 				printf("%s: original_sc == NULL!\n", __func__);
 				/* XXX KDM do something here */
 				break;
 			}
 			io->io_hdr.msg_type = CTL_MSG_DATAMOVE;
 			io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE;
 			/*
 			 * Keep track of this, we need to send it back over
 			 * when the datamove is complete.
 			 */
 			io->io_hdr.serializing_sc = msg->hdr.serializing_sc;
 			if (msg->hdr.status == CTL_SUCCESS)
 				io->io_hdr.status = msg->hdr.status;
 
 			if (msg->dt.sg_sequence == 0) {
 #ifdef CTL_TIME_IO
 				getbinuptime(&io->io_hdr.dma_start_bt);
 #endif
 				i = msg->dt.kern_sg_entries +
 				    msg->dt.kern_data_len /
 				    CTL_HA_DATAMOVE_SEGMENT + 1;
 				sgl = malloc(sizeof(*sgl) * i, M_CTL,
 				    M_WAITOK | M_ZERO);
 				io->io_hdr.remote_sglist = sgl;
 				io->io_hdr.local_sglist =
 				    &sgl[msg->dt.kern_sg_entries];
 
 				io->scsiio.kern_data_ptr = (uint8_t *)sgl;
 
 				io->scsiio.kern_sg_entries =
 					msg->dt.kern_sg_entries;
 				io->scsiio.rem_sg_entries =
 					msg->dt.kern_sg_entries;
 				io->scsiio.kern_data_len =
 					msg->dt.kern_data_len;
 				io->scsiio.kern_total_len =
 					msg->dt.kern_total_len;
 				io->scsiio.kern_data_resid =
 					msg->dt.kern_data_resid;
 				io->scsiio.kern_rel_offset =
 					msg->dt.kern_rel_offset;
 				io->io_hdr.flags &= ~CTL_FLAG_BUS_ADDR;
 				io->io_hdr.flags |= msg->dt.flags &
 				    CTL_FLAG_BUS_ADDR;
 			} else
 				sgl = (struct ctl_sg_entry *)
 					io->scsiio.kern_data_ptr;
 
 			for (i = msg->dt.sent_sg_entries, j = 0;
 			     i < (msg->dt.sent_sg_entries +
 			     msg->dt.cur_sg_entries); i++, j++) {
 				sgl[i].addr = msg->dt.sg_list[j].addr;
 				sgl[i].len = msg->dt.sg_list[j].len;
 
 #if 0
 				printf("%s: DATAMOVE: %p,%lu j=%d, i=%d\n",
 				    __func__, sgl[i].addr, sgl[i].len, j, i);
 #endif
 			}
 
 			/*
 			 * If this is the last piece of the I/O, we've got
 			 * the full S/G list.  Queue processing in the thread.
 			 * Otherwise wait for the next piece.
 			 */
 			if (msg->dt.sg_last != 0)
 				ctl_enqueue_isc(io);
 			break;
 		}
 		/* Performed on the Serializing (primary) SC, XFER mode only */
 		case CTL_MSG_DATAMOVE_DONE: {
 			if (msg->hdr.serializing_sc == NULL) {
 				printf("%s: serializing_sc == NULL!\n",
 				       __func__);
 				/* XXX KDM now what? */
 				break;
 			}
 			/*
 			 * We grab the sense information here in case
 			 * there was a failure, so we can return status
 			 * back to the initiator.
 			 */
 			io = msg->hdr.serializing_sc;
 			io->io_hdr.msg_type = CTL_MSG_DATAMOVE_DONE;
 			io->io_hdr.flags &= ~CTL_FLAG_DMA_INPROG;
 			io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE;
 			io->io_hdr.port_status = msg->scsi.port_status;
 			io->scsiio.kern_data_resid = msg->scsi.kern_data_resid;
 			if (msg->hdr.status != CTL_STATUS_NONE) {
 				io->io_hdr.status = msg->hdr.status;
 				io->scsiio.scsi_status = msg->scsi.scsi_status;
 				io->scsiio.sense_len = msg->scsi.sense_len;
 				memcpy(&io->scsiio.sense_data,
 				    &msg->scsi.sense_data,
 				    msg->scsi.sense_len);
 				if (msg->hdr.status == CTL_SUCCESS)
 					io->io_hdr.flags |= CTL_FLAG_STATUS_SENT;
 			}
 			ctl_enqueue_isc(io);
 			break;
 		}
 
 		/* Preformed on Originating SC, SER_ONLY mode */
 		case CTL_MSG_R2R:
 			io = msg->hdr.original_sc;
 			if (io == NULL) {
 				printf("%s: original_sc == NULL!\n",
 				    __func__);
 				break;
 			}
 			io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE;
 			io->io_hdr.msg_type = CTL_MSG_R2R;
 			io->io_hdr.serializing_sc = msg->hdr.serializing_sc;
 			ctl_enqueue_isc(io);
 			break;
 
 		/*
 		 * Performed on Serializing(i.e. primary SC) SC in SER_ONLY
 		 * mode.
 		 * Performed on the Originating (i.e. secondary) SC in XFER
 		 * mode
 		 */
 		case CTL_MSG_FINISH_IO:
 			if (softc->ha_mode == CTL_HA_MODE_XFER)
 				ctl_isc_handler_finish_xfer(softc, msg);
 			else
 				ctl_isc_handler_finish_ser_only(softc, msg);
 			break;
 
 		/* Preformed on Originating SC */
 		case CTL_MSG_BAD_JUJU:
 			io = msg->hdr.original_sc;
 			if (io == NULL) {
 				printf("%s: Bad JUJU!, original_sc is NULL!\n",
 				       __func__);
 				break;
 			}
 			ctl_copy_sense_data(msg, io);
 			/*
 			 * IO should have already been cleaned up on other
 			 * SC so clear this flag so we won't send a message
 			 * back to finish the IO there.
 			 */
 			io->io_hdr.flags &= ~CTL_FLAG_SENT_2OTHER_SC;
 			io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE;
 
 			/* io = msg->hdr.serializing_sc; */
 			io->io_hdr.msg_type = CTL_MSG_BAD_JUJU;
 			ctl_enqueue_isc(io);
 			break;
 
 		/* Handle resets sent from the other side */
 		case CTL_MSG_MANAGE_TASKS: {
 			struct ctl_taskio *taskio;
 			taskio = (struct ctl_taskio *)ctl_alloc_io(
 			    softc->othersc_pool);
 			ctl_zero_io((union ctl_io *)taskio);
 			taskio->io_hdr.io_type = CTL_IO_TASK;
 			taskio->io_hdr.flags |= CTL_FLAG_FROM_OTHER_SC;
 			taskio->io_hdr.nexus = msg->hdr.nexus;
 			taskio->task_action = msg->task.task_action;
 			taskio->tag_num = msg->task.tag_num;
 			taskio->tag_type = msg->task.tag_type;
 #ifdef CTL_TIME_IO
 			taskio->io_hdr.start_time = time_uptime;
 			getbinuptime(&taskio->io_hdr.start_bt);
 #endif /* CTL_TIME_IO */
 			ctl_run_task((union ctl_io *)taskio);
 			break;
 		}
 		/* Persistent Reserve action which needs attention */
 		case CTL_MSG_PERS_ACTION:
 			presio = (struct ctl_prio *)ctl_alloc_io(
 			    softc->othersc_pool);
 			ctl_zero_io((union ctl_io *)presio);
 			presio->io_hdr.msg_type = CTL_MSG_PERS_ACTION;
 			presio->io_hdr.flags |= CTL_FLAG_FROM_OTHER_SC;
 			presio->io_hdr.nexus = msg->hdr.nexus;
 			presio->pr_msg = msg->pr;
 			ctl_enqueue_isc((union ctl_io *)presio);
 			break;
 		case CTL_MSG_UA:
 			ctl_isc_ua(softc, msg, param);
 			break;
 		case CTL_MSG_PORT_SYNC:
 			ctl_isc_port_sync(softc, msg, param);
 			break;
 		case CTL_MSG_LUN_SYNC:
 			ctl_isc_lun_sync(softc, msg, param);
 			break;
 		case CTL_MSG_IID_SYNC:
 			ctl_isc_iid_sync(softc, msg, param);
 			break;
 		case CTL_MSG_LOGIN:
 			ctl_isc_login(softc, msg, param);
 			break;
 		case CTL_MSG_MODE_SYNC:
 			ctl_isc_mode_sync(softc, msg, param);
 			break;
 		default:
 			printf("Received HA message of unknown type %d\n",
 			    msg->hdr.msg_type);
 			ctl_ha_msg_abort(CTL_HA_CHAN_CTL);
 			break;
 		}
 		if (msg != &msgbuf)
 			free(msg, M_CTL);
 	} else if (event == CTL_HA_EVT_LINK_CHANGE) {
 		printf("CTL: HA link status changed from %d to %d\n",
 		    softc->ha_link, param);
 		if (param == softc->ha_link)
 			return;
 		if (softc->ha_link == CTL_HA_LINK_ONLINE) {
 			softc->ha_link = param;
 			ctl_isc_ha_link_down(softc);
 		} else {
 			softc->ha_link = param;
 			if (softc->ha_link == CTL_HA_LINK_ONLINE)
 				ctl_isc_ha_link_up(softc);
 		}
 		return;
 	} else {
 		printf("ctl_isc_event_handler: Unknown event %d\n", event);
 		return;
 	}
 }
 
 static void
 ctl_copy_sense_data(union ctl_ha_msg *src, union ctl_io *dest)
 {
 
 	memcpy(&dest->scsiio.sense_data, &src->scsi.sense_data,
 	    src->scsi.sense_len);
 	dest->scsiio.scsi_status = src->scsi.scsi_status;
 	dest->scsiio.sense_len = src->scsi.sense_len;
 	dest->io_hdr.status = src->hdr.status;
 }
 
 static void
 ctl_copy_sense_data_back(union ctl_io *src, union ctl_ha_msg *dest)
 {
 
 	memcpy(&dest->scsi.sense_data, &src->scsiio.sense_data,
 	    src->scsiio.sense_len);
 	dest->scsi.scsi_status = src->scsiio.scsi_status;
 	dest->scsi.sense_len = src->scsiio.sense_len;
 	dest->hdr.status = src->io_hdr.status;
 }
 
 void
 ctl_est_ua(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	ctl_ua_type *pu;
 
 	if (initidx < softc->init_min || initidx >= softc->init_max)
 		return;
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 	pu = lun->pending_ua[initidx / CTL_MAX_INIT_PER_PORT];
 	if (pu == NULL)
 		return;
 	pu[initidx % CTL_MAX_INIT_PER_PORT] |= ua;
 }
 
 void
 ctl_est_ua_port(struct ctl_lun *lun, int port, uint32_t except, ctl_ua_type ua)
 {
 	int i;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 	if (lun->pending_ua[port] == NULL)
 		return;
 	for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) {
 		if (port * CTL_MAX_INIT_PER_PORT + i == except)
 			continue;
 		lun->pending_ua[port][i] |= ua;
 	}
 }
 
 void
 ctl_est_ua_all(struct ctl_lun *lun, uint32_t except, ctl_ua_type ua)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	int i;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 	for (i = softc->port_min; i < softc->port_max; i++)
 		ctl_est_ua_port(lun, i, except, ua);
 }
 
 void
 ctl_clr_ua(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	ctl_ua_type *pu;
 
 	if (initidx < softc->init_min || initidx >= softc->init_max)
 		return;
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 	pu = lun->pending_ua[initidx / CTL_MAX_INIT_PER_PORT];
 	if (pu == NULL)
 		return;
 	pu[initidx % CTL_MAX_INIT_PER_PORT] &= ~ua;
 }
 
 void
 ctl_clr_ua_all(struct ctl_lun *lun, uint32_t except, ctl_ua_type ua)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	int i, j;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 	for (i = softc->port_min; i < softc->port_max; i++) {
 		if (lun->pending_ua[i] == NULL)
 			continue;
 		for (j = 0; j < CTL_MAX_INIT_PER_PORT; j++) {
 			if (i * CTL_MAX_INIT_PER_PORT + j == except)
 				continue;
 			lun->pending_ua[i][j] &= ~ua;
 		}
 	}
 }
 
 void
 ctl_clr_ua_allluns(struct ctl_softc *ctl_softc, uint32_t initidx,
     ctl_ua_type ua_type)
 {
 	struct ctl_lun *lun;
 
 	mtx_assert(&ctl_softc->ctl_lock, MA_OWNED);
 	STAILQ_FOREACH(lun, &ctl_softc->lun_list, links) {
 		mtx_lock(&lun->lun_lock);
 		ctl_clr_ua(lun, initidx, ua_type);
 		mtx_unlock(&lun->lun_lock);
 	}
 }
 
 static int
 ctl_ha_role_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct ctl_softc *softc = (struct ctl_softc *)arg1;
 	struct ctl_lun *lun;
 	struct ctl_lun_req ireq;
 	int error, value;
 
 	value = (softc->flags & CTL_FLAG_ACTIVE_SHELF) ? 0 : 1;
 	error = sysctl_handle_int(oidp, &value, 0, req);
 	if ((error != 0) || (req->newptr == NULL))
 		return (error);
 
 	mtx_lock(&softc->ctl_lock);
 	if (value == 0)
 		softc->flags |= CTL_FLAG_ACTIVE_SHELF;
 	else
 		softc->flags &= ~CTL_FLAG_ACTIVE_SHELF;
 	STAILQ_FOREACH(lun, &softc->lun_list, links) {
 		mtx_unlock(&softc->ctl_lock);
 		bzero(&ireq, sizeof(ireq));
 		ireq.reqtype = CTL_LUNREQ_MODIFY;
 		ireq.reqdata.modify.lun_id = lun->lun;
 		lun->backend->ioctl(NULL, CTL_LUN_REQ, (caddr_t)&ireq, 0,
 		    curthread);
 		if (ireq.status != CTL_LUN_OK) {
 			printf("%s: CTL_LUNREQ_MODIFY returned %d '%s'\n",
 			    __func__, ireq.status, ireq.error_str);
 		}
 		mtx_lock(&softc->ctl_lock);
 	}
 	mtx_unlock(&softc->ctl_lock);
 	return (0);
 }
 
 static int
 ctl_init(void)
 {
 	struct make_dev_args args;
 	struct ctl_softc *softc;
 	int i, error;
 
 	softc = control_softc = malloc(sizeof(*control_softc), M_DEVBUF,
 			       M_WAITOK | M_ZERO);
 
 	make_dev_args_init(&args);
 	args.mda_devsw = &ctl_cdevsw;
 	args.mda_uid = UID_ROOT;
 	args.mda_gid = GID_OPERATOR;
 	args.mda_mode = 0600;
 	args.mda_si_drv1 = softc;
 	error = make_dev_s(&args, &softc->dev, "cam/ctl");
 	if (error != 0) {
 		free(softc, M_DEVBUF);
 		control_softc = NULL;
 		return (error);
 	}
 
 	sysctl_ctx_init(&softc->sysctl_ctx);
 	softc->sysctl_tree = SYSCTL_ADD_NODE(&softc->sysctl_ctx,
 		SYSCTL_STATIC_CHILDREN(_kern_cam), OID_AUTO, "ctl",
 		CTLFLAG_RD, 0, "CAM Target Layer");
 
 	if (softc->sysctl_tree == NULL) {
 		printf("%s: unable to allocate sysctl tree\n", __func__);
 		destroy_dev(softc->dev);
 		free(softc, M_DEVBUF);
 		control_softc = NULL;
 		return (ENOMEM);
 	}
 
 	mtx_init(&softc->ctl_lock, "CTL mutex", NULL, MTX_DEF);
 	softc->io_zone = uma_zcreate("CTL IO", sizeof(union ctl_io),
 	    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
 	softc->flags = 0;
 
 	SYSCTL_ADD_INT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree),
 	    OID_AUTO, "ha_mode", CTLFLAG_RDTUN, (int *)&softc->ha_mode, 0,
 	    "HA mode (0 - act/stby, 1 - serialize only, 2 - xfer)");
 
 	/*
 	 * In Copan's HA scheme, the "master" and "slave" roles are
 	 * figured out through the slot the controller is in.  Although it
 	 * is an active/active system, someone has to be in charge.
 	 */
 	SYSCTL_ADD_INT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree),
 	    OID_AUTO, "ha_id", CTLFLAG_RDTUN, &softc->ha_id, 0,
 	    "HA head ID (0 - no HA)");
 	if (softc->ha_id == 0 || softc->ha_id > NUM_HA_SHELVES) {
 		softc->flags |= CTL_FLAG_ACTIVE_SHELF;
 		softc->is_single = 1;
 		softc->port_cnt = CTL_MAX_PORTS;
 		softc->port_min = 0;
 	} else {
 		softc->port_cnt = CTL_MAX_PORTS / NUM_HA_SHELVES;
 		softc->port_min = (softc->ha_id - 1) * softc->port_cnt;
 	}
 	softc->port_max = softc->port_min + softc->port_cnt;
 	softc->init_min = softc->port_min * CTL_MAX_INIT_PER_PORT;
 	softc->init_max = softc->port_max * CTL_MAX_INIT_PER_PORT;
 
 	SYSCTL_ADD_INT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree),
 	    OID_AUTO, "ha_link", CTLFLAG_RD, (int *)&softc->ha_link, 0,
 	    "HA link state (0 - offline, 1 - unknown, 2 - online)");
 
 	STAILQ_INIT(&softc->lun_list);
 	STAILQ_INIT(&softc->pending_lun_queue);
 	STAILQ_INIT(&softc->fe_list);
 	STAILQ_INIT(&softc->port_list);
 	STAILQ_INIT(&softc->be_list);
 	ctl_tpc_init(softc);
 
 	if (worker_threads <= 0)
 		worker_threads = max(1, mp_ncpus / 4);
 	if (worker_threads > CTL_MAX_THREADS)
 		worker_threads = CTL_MAX_THREADS;
 
 	for (i = 0; i < worker_threads; i++) {
 		struct ctl_thread *thr = &softc->threads[i];
 
 		mtx_init(&thr->queue_lock, "CTL queue mutex", NULL, MTX_DEF);
 		thr->ctl_softc = softc;
 		STAILQ_INIT(&thr->incoming_queue);
 		STAILQ_INIT(&thr->rtr_queue);
 		STAILQ_INIT(&thr->done_queue);
 		STAILQ_INIT(&thr->isc_queue);
 
 		error = kproc_kthread_add(ctl_work_thread, thr,
 		    &softc->ctl_proc, &thr->thread, 0, 0, "ctl", "work%d", i);
 		if (error != 0) {
 			printf("error creating CTL work thread!\n");
 			return (error);
 		}
 	}
 	error = kproc_kthread_add(ctl_lun_thread, softc,
 	    &softc->ctl_proc, &softc->lun_thread, 0, 0, "ctl", "lun");
 	if (error != 0) {
 		printf("error creating CTL lun thread!\n");
 		return (error);
 	}
 	error = kproc_kthread_add(ctl_thresh_thread, softc,
 	    &softc->ctl_proc, &softc->thresh_thread, 0, 0, "ctl", "thresh");
 	if (error != 0) {
 		printf("error creating CTL threshold thread!\n");
 		return (error);
 	}
 
 	SYSCTL_ADD_PROC(&softc->sysctl_ctx,SYSCTL_CHILDREN(softc->sysctl_tree),
 	    OID_AUTO, "ha_role", CTLTYPE_INT | CTLFLAG_RWTUN,
 	    softc, 0, ctl_ha_role_sysctl, "I", "HA role for this head");
 
 	if (softc->is_single == 0) {
 		if (ctl_frontend_register(&ha_frontend) != 0)
 			softc->is_single = 1;
 	}
 	return (0);
 }
 
 static int
 ctl_shutdown(void)
 {
 	struct ctl_softc *softc = control_softc;
 	int i;
 
 	if (softc->is_single == 0)
 		ctl_frontend_deregister(&ha_frontend);
 
 	destroy_dev(softc->dev);
 
 	/* Shutdown CTL threads. */
 	softc->shutdown = 1;
 	for (i = 0; i < worker_threads; i++) {
 		struct ctl_thread *thr = &softc->threads[i];
 		while (thr->thread != NULL) {
 			wakeup(thr);
 			if (thr->thread != NULL)
 				pause("CTL thr shutdown", 1);
 		}
 		mtx_destroy(&thr->queue_lock);
 	}
 	while (softc->lun_thread != NULL) {
 		wakeup(&softc->pending_lun_queue);
 		if (softc->lun_thread != NULL)
 			pause("CTL thr shutdown", 1);
 	}
 	while (softc->thresh_thread != NULL) {
 		wakeup(softc->thresh_thread);
 		if (softc->thresh_thread != NULL)
 			pause("CTL thr shutdown", 1);
 	}
 
 	ctl_tpc_shutdown(softc);
 	uma_zdestroy(softc->io_zone);
 	mtx_destroy(&softc->ctl_lock);
 
 	sysctl_ctx_free(&softc->sysctl_ctx);
 
 	free(softc, M_DEVBUF);
 	control_softc = NULL;
 	return (0);
 }
 
 static int
 ctl_module_event_handler(module_t mod, int what, void *arg)
 {
 
 	switch (what) {
 	case MOD_LOAD:
 		return (ctl_init());
 	case MOD_UNLOAD:
 		return (ctl_shutdown());
 	default:
 		return (EOPNOTSUPP);
 	}
 }
 
 /*
  * XXX KDM should we do some access checks here?  Bump a reference count to
  * prevent a CTL module from being unloaded while someone has it open?
  */
 static int
 ctl_open(struct cdev *dev, int flags, int fmt, struct thread *td)
 {
 	return (0);
 }
 
 static int
 ctl_close(struct cdev *dev, int flags, int fmt, struct thread *td)
 {
 	return (0);
 }
 
 /*
  * Remove an initiator by port number and initiator ID.
  * Returns 0 for success, -1 for failure.
  */
 int
 ctl_remove_initiator(struct ctl_port *port, int iid)
 {
 	struct ctl_softc *softc = port->ctl_softc;
 	int last;
 
 	mtx_assert(&softc->ctl_lock, MA_NOTOWNED);
 
 	if (iid > CTL_MAX_INIT_PER_PORT) {
 		printf("%s: initiator ID %u > maximun %u!\n",
 		       __func__, iid, CTL_MAX_INIT_PER_PORT);
 		return (-1);
 	}
 
 	mtx_lock(&softc->ctl_lock);
 	last = (--port->wwpn_iid[iid].in_use == 0);
 	port->wwpn_iid[iid].last_use = time_uptime;
 	mtx_unlock(&softc->ctl_lock);
 	if (last)
 		ctl_i_t_nexus_loss(softc, iid, CTL_UA_POWERON);
 	ctl_isc_announce_iid(port, iid);
 
 	return (0);
 }
 
 /*
  * Add an initiator to the initiator map.
  * Returns iid for success, < 0 for failure.
  */
 int
 ctl_add_initiator(struct ctl_port *port, int iid, uint64_t wwpn, char *name)
 {
 	struct ctl_softc *softc = port->ctl_softc;
 	time_t best_time;
 	int i, best;
 
 	mtx_assert(&softc->ctl_lock, MA_NOTOWNED);
 
 	if (iid >= CTL_MAX_INIT_PER_PORT) {
 		printf("%s: WWPN %#jx initiator ID %u > maximum %u!\n",
 		       __func__, wwpn, iid, CTL_MAX_INIT_PER_PORT);
 		free(name, M_CTL);
 		return (-1);
 	}
 
 	mtx_lock(&softc->ctl_lock);
 
 	if (iid < 0 && (wwpn != 0 || name != NULL)) {
 		for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) {
 			if (wwpn != 0 && wwpn == port->wwpn_iid[i].wwpn) {
 				iid = i;
 				break;
 			}
 			if (name != NULL && port->wwpn_iid[i].name != NULL &&
 			    strcmp(name, port->wwpn_iid[i].name) == 0) {
 				iid = i;
 				break;
 			}
 		}
 	}
 
 	if (iid < 0) {
 		for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) {
 			if (port->wwpn_iid[i].in_use == 0 &&
 			    port->wwpn_iid[i].wwpn == 0 &&
 			    port->wwpn_iid[i].name == NULL) {
 				iid = i;
 				break;
 			}
 		}
 	}
 
 	if (iid < 0) {
 		best = -1;
 		best_time = INT32_MAX;
 		for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) {
 			if (port->wwpn_iid[i].in_use == 0) {
 				if (port->wwpn_iid[i].last_use < best_time) {
 					best = i;
 					best_time = port->wwpn_iid[i].last_use;
 				}
 			}
 		}
 		iid = best;
 	}
 
 	if (iid < 0) {
 		mtx_unlock(&softc->ctl_lock);
 		free(name, M_CTL);
 		return (-2);
 	}
 
 	if (port->wwpn_iid[iid].in_use > 0 && (wwpn != 0 || name != NULL)) {
 		/*
 		 * This is not an error yet.
 		 */
 		if (wwpn != 0 && wwpn == port->wwpn_iid[iid].wwpn) {
 #if 0
 			printf("%s: port %d iid %u WWPN %#jx arrived"
 			    " again\n", __func__, port->targ_port,
 			    iid, (uintmax_t)wwpn);
 #endif
 			goto take;
 		}
 		if (name != NULL && port->wwpn_iid[iid].name != NULL &&
 		    strcmp(name, port->wwpn_iid[iid].name) == 0) {
 #if 0
 			printf("%s: port %d iid %u name '%s' arrived"
 			    " again\n", __func__, port->targ_port,
 			    iid, name);
 #endif
 			goto take;
 		}
 
 		/*
 		 * This is an error, but what do we do about it?  The
 		 * driver is telling us we have a new WWPN for this
 		 * initiator ID, so we pretty much need to use it.
 		 */
 		printf("%s: port %d iid %u WWPN %#jx '%s' arrived,"
 		    " but WWPN %#jx '%s' is still at that address\n",
 		    __func__, port->targ_port, iid, wwpn, name,
 		    (uintmax_t)port->wwpn_iid[iid].wwpn,
 		    port->wwpn_iid[iid].name);
 	}
 take:
 	free(port->wwpn_iid[iid].name, M_CTL);
 	port->wwpn_iid[iid].name = name;
 	port->wwpn_iid[iid].wwpn = wwpn;
 	port->wwpn_iid[iid].in_use++;
 	mtx_unlock(&softc->ctl_lock);
 	ctl_isc_announce_iid(port, iid);
 
 	return (iid);
 }
 
 static int
 ctl_create_iid(struct ctl_port *port, int iid, uint8_t *buf)
 {
 	int len;
 
 	switch (port->port_type) {
 	case CTL_PORT_FC:
 	{
 		struct scsi_transportid_fcp *id =
 		    (struct scsi_transportid_fcp *)buf;
 		if (port->wwpn_iid[iid].wwpn == 0)
 			return (0);
 		memset(id, 0, sizeof(*id));
 		id->format_protocol = SCSI_PROTO_FC;
 		scsi_u64to8b(port->wwpn_iid[iid].wwpn, id->n_port_name);
 		return (sizeof(*id));
 	}
 	case CTL_PORT_ISCSI:
 	{
 		struct scsi_transportid_iscsi_port *id =
 		    (struct scsi_transportid_iscsi_port *)buf;
 		if (port->wwpn_iid[iid].name == NULL)
 			return (0);
 		memset(id, 0, 256);
 		id->format_protocol = SCSI_TRN_ISCSI_FORMAT_PORT |
 		    SCSI_PROTO_ISCSI;
 		len = strlcpy(id->iscsi_name, port->wwpn_iid[iid].name, 252) + 1;
 		len = roundup2(min(len, 252), 4);
 		scsi_ulto2b(len, id->additional_length);
 		return (sizeof(*id) + len);
 	}
 	case CTL_PORT_SAS:
 	{
 		struct scsi_transportid_sas *id =
 		    (struct scsi_transportid_sas *)buf;
 		if (port->wwpn_iid[iid].wwpn == 0)
 			return (0);
 		memset(id, 0, sizeof(*id));
 		id->format_protocol = SCSI_PROTO_SAS;
 		scsi_u64to8b(port->wwpn_iid[iid].wwpn, id->sas_address);
 		return (sizeof(*id));
 	}
 	default:
 	{
 		struct scsi_transportid_spi *id =
 		    (struct scsi_transportid_spi *)buf;
 		memset(id, 0, sizeof(*id));
 		id->format_protocol = SCSI_PROTO_SPI;
 		scsi_ulto2b(iid, id->scsi_addr);
 		scsi_ulto2b(port->targ_port, id->rel_trgt_port_id);
 		return (sizeof(*id));
 	}
 	}
 }
 
 /*
  * Serialize a command that went down the "wrong" side, and so was sent to
  * this controller for execution.  The logic is a little different than the
  * standard case in ctl_scsiio_precheck().  Errors in this case need to get
  * sent back to the other side, but in the success case, we execute the
  * command on this side (XFER mode) or tell the other side to execute it
  * (SER_ONLY mode).
  */
 static void
 ctl_serialize_other_sc_cmd(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_port *port = CTL_PORT(ctsio);
 	union ctl_ha_msg msg_info;
 	struct ctl_lun *lun;
 	const struct ctl_cmd_entry *entry;
 	uint32_t targ_lun;
 
 	targ_lun = ctsio->io_hdr.nexus.targ_mapped_lun;
 
 	/* Make sure that we know about this port. */
 	if (port == NULL || (port->status & CTL_PORT_STATUS_ONLINE) == 0) {
 		ctl_set_internal_failure(ctsio, /*sks_valid*/ 0,
 					 /*retry_count*/ 1);
 		goto badjuju;
 	}
 
 	/* Make sure that we know about this LUN. */
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 
 		/*
 		 * The other node would not send this request to us unless
 		 * received announce that we are primary node for this LUN.
 		 * If this LUN does not exist now, it is probably result of
 		 * a race, so respond to initiator in the most opaque way.
 		 */
 		ctl_set_busy(ctsio);
 		goto badjuju;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 
 	/*
 	 * If the LUN is invalid, pretend that it doesn't exist.
 	 * It will go away as soon as all pending I/Os completed.
 	 */
 	if (lun->flags & CTL_LUN_DISABLED) {
 		mtx_unlock(&lun->lun_lock);
 		ctl_set_busy(ctsio);
 		goto badjuju;
 	}
 
 	entry = ctl_get_cmd_entry(ctsio, NULL);
 	if (ctl_scsiio_lun_check(lun, entry, ctsio) != 0) {
 		mtx_unlock(&lun->lun_lock);
 		goto badjuju;
 	}
 
 	CTL_LUN(ctsio) = lun;
 	CTL_BACKEND_LUN(ctsio) = lun->be_lun;
 
 	/*
 	 * Every I/O goes into the OOA queue for a
 	 * particular LUN, and stays there until completion.
 	 */
 #ifdef CTL_TIME_IO
 	if (TAILQ_EMPTY(&lun->ooa_queue))
 		lun->idle_time += getsbinuptime() - lun->last_busy;
 #endif
 	TAILQ_INSERT_TAIL(&lun->ooa_queue, &ctsio->io_hdr, ooa_links);
 
 	switch (ctl_check_ooa(lun, (union ctl_io *)ctsio,
 		(union ctl_io *)TAILQ_PREV(&ctsio->io_hdr, ctl_ooaq,
 		 ooa_links))) {
 	case CTL_ACTION_BLOCK:
 		ctsio->io_hdr.flags |= CTL_FLAG_BLOCKED;
 		TAILQ_INSERT_TAIL(&lun->blocked_queue, &ctsio->io_hdr,
 				  blocked_links);
 		mtx_unlock(&lun->lun_lock);
 		break;
 	case CTL_ACTION_PASS:
 	case CTL_ACTION_SKIP:
 		if (softc->ha_mode == CTL_HA_MODE_XFER) {
 			ctsio->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR;
 			ctl_enqueue_rtr((union ctl_io *)ctsio);
 			mtx_unlock(&lun->lun_lock);
 		} else {
 			ctsio->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE;
 			mtx_unlock(&lun->lun_lock);
 
 			/* send msg back to other side */
 			msg_info.hdr.original_sc = ctsio->io_hdr.original_sc;
 			msg_info.hdr.serializing_sc = (union ctl_io *)ctsio;
 			msg_info.hdr.msg_type = CTL_MSG_R2R;
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 			    sizeof(msg_info.hdr), M_WAITOK);
 		}
 		break;
 	case CTL_ACTION_OVERLAP:
 		TAILQ_REMOVE(&lun->ooa_queue, &ctsio->io_hdr, ooa_links);
 		mtx_unlock(&lun->lun_lock);
 		ctl_set_overlapped_cmd(ctsio);
 		goto badjuju;
 	case CTL_ACTION_OVERLAP_TAG:
 		TAILQ_REMOVE(&lun->ooa_queue, &ctsio->io_hdr, ooa_links);
 		mtx_unlock(&lun->lun_lock);
 		ctl_set_overlapped_tag(ctsio, ctsio->tag_num);
 		goto badjuju;
 	case CTL_ACTION_ERROR:
 	default:
 		TAILQ_REMOVE(&lun->ooa_queue, &ctsio->io_hdr, ooa_links);
 		mtx_unlock(&lun->lun_lock);
 
 		ctl_set_internal_failure(ctsio, /*sks_valid*/ 0,
 					 /*retry_count*/ 0);
 badjuju:
 		ctl_copy_sense_data_back((union ctl_io *)ctsio, &msg_info);
 		msg_info.hdr.original_sc = ctsio->io_hdr.original_sc;
 		msg_info.hdr.serializing_sc = NULL;
 		msg_info.hdr.msg_type = CTL_MSG_BAD_JUJU;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 		    sizeof(msg_info.scsi), M_WAITOK);
 		ctl_free_io((union ctl_io *)ctsio);
 		break;
 	}
 }
 
 /*
  * Returns 0 for success, errno for failure.
  */
 static void
 ctl_ioctl_fill_ooa(struct ctl_lun *lun, uint32_t *cur_fill_num,
 		   struct ctl_ooa *ooa_hdr, struct ctl_ooa_entry *kern_entries)
 {
 	union ctl_io *io;
 
 	mtx_lock(&lun->lun_lock);
 	for (io = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); (io != NULL);
 	     (*cur_fill_num)++, io = (union ctl_io *)TAILQ_NEXT(&io->io_hdr,
 	     ooa_links)) {
 		struct ctl_ooa_entry *entry;
 
 		/*
 		 * If we've got more than we can fit, just count the
 		 * remaining entries.
 		 */
 		if (*cur_fill_num >= ooa_hdr->alloc_num)
 			continue;
 
 		entry = &kern_entries[*cur_fill_num];
 
 		entry->tag_num = io->scsiio.tag_num;
 		entry->lun_num = lun->lun;
 #ifdef CTL_TIME_IO
 		entry->start_bt = io->io_hdr.start_bt;
 #endif
 		bcopy(io->scsiio.cdb, entry->cdb, io->scsiio.cdb_len);
 		entry->cdb_len = io->scsiio.cdb_len;
 		if (io->io_hdr.flags & CTL_FLAG_BLOCKED)
 			entry->cmd_flags |= CTL_OOACMD_FLAG_BLOCKED;
 
 		if (io->io_hdr.flags & CTL_FLAG_DMA_INPROG)
 			entry->cmd_flags |= CTL_OOACMD_FLAG_DMA;
 
 		if (io->io_hdr.flags & CTL_FLAG_ABORT)
 			entry->cmd_flags |= CTL_OOACMD_FLAG_ABORT;
 
 		if (io->io_hdr.flags & CTL_FLAG_IS_WAS_ON_RTR)
 			entry->cmd_flags |= CTL_OOACMD_FLAG_RTR;
 
 		if (io->io_hdr.flags & CTL_FLAG_DMA_QUEUED)
 			entry->cmd_flags |= CTL_OOACMD_FLAG_DMA_QUEUED;
 	}
 	mtx_unlock(&lun->lun_lock);
 }
 
 static void *
 ctl_copyin_alloc(void *user_addr, unsigned int len, char *error_str,
 		 size_t error_str_len)
 {
 	void *kptr;
 
 	kptr = malloc(len, M_CTL, M_WAITOK | M_ZERO);
 
 	if (copyin(user_addr, kptr, len) != 0) {
 		snprintf(error_str, error_str_len, "Error copying %d bytes "
 			 "from user address %p to kernel address %p", len,
 			 user_addr, kptr);
 		free(kptr, M_CTL);
 		return (NULL);
 	}
 
 	return (kptr);
 }
 
 static void
 ctl_free_args(int num_args, struct ctl_be_arg *args)
 {
 	int i;
 
 	if (args == NULL)
 		return;
 
 	for (i = 0; i < num_args; i++) {
 		free(args[i].kname, M_CTL);
 		free(args[i].kvalue, M_CTL);
 	}
 
 	free(args, M_CTL);
 }
 
 static struct ctl_be_arg *
 ctl_copyin_args(int num_args, struct ctl_be_arg *uargs,
 		char *error_str, size_t error_str_len)
 {
 	struct ctl_be_arg *args;
 	int i;
 
 	args = ctl_copyin_alloc(uargs, num_args * sizeof(*args),
 				error_str, error_str_len);
 
 	if (args == NULL)
 		goto bailout;
 
 	for (i = 0; i < num_args; i++) {
 		args[i].kname = NULL;
 		args[i].kvalue = NULL;
 	}
 
 	for (i = 0; i < num_args; i++) {
 		uint8_t *tmpptr;
 
 		if (args[i].namelen == 0) {
 			snprintf(error_str, error_str_len, "Argument %d "
 				 "name length is zero", i);
 			goto bailout;
 		}
 
 		args[i].kname = ctl_copyin_alloc(args[i].name,
 			args[i].namelen, error_str, error_str_len);
 		if (args[i].kname == NULL)
 			goto bailout;
 
 		if (args[i].kname[args[i].namelen - 1] != '\0') {
 			snprintf(error_str, error_str_len, "Argument %d "
 				 "name is not NUL-terminated", i);
 			goto bailout;
 		}
 
 		if (args[i].flags & CTL_BEARG_RD) {
 			if (args[i].vallen == 0) {
 				snprintf(error_str, error_str_len, "Argument %d "
 					 "value length is zero", i);
 				goto bailout;
 			}
 
 			tmpptr = ctl_copyin_alloc(args[i].value,
 				args[i].vallen, error_str, error_str_len);
 			if (tmpptr == NULL)
 				goto bailout;
 
 			if ((args[i].flags & CTL_BEARG_ASCII)
 			 && (tmpptr[args[i].vallen - 1] != '\0')) {
 				snprintf(error_str, error_str_len, "Argument "
 				    "%d value is not NUL-terminated", i);
 				free(tmpptr, M_CTL);
 				goto bailout;
 			}
 			args[i].kvalue = tmpptr;
 		} else {
 			args[i].kvalue = malloc(args[i].vallen,
 			    M_CTL, M_WAITOK | M_ZERO);
 		}
 	}
 
 	return (args);
 bailout:
 
 	ctl_free_args(num_args, args);
 
 	return (NULL);
 }
 
 static void
 ctl_copyout_args(int num_args, struct ctl_be_arg *args)
 {
 	int i;
 
 	for (i = 0; i < num_args; i++) {
 		if (args[i].flags & CTL_BEARG_WR)
 			copyout(args[i].kvalue, args[i].value, args[i].vallen);
 	}
 }
 
 /*
  * Escape characters that are illegal or not recommended in XML.
  */
 int
 ctl_sbuf_printf_esc(struct sbuf *sb, char *str, int size)
 {
 	char *end = str + size;
 	int retval;
 
 	retval = 0;
 
 	for (; *str && str < end; str++) {
 		switch (*str) {
 		case '&':
 			retval = sbuf_printf(sb, "&amp;");
 			break;
 		case '>':
 			retval = sbuf_printf(sb, "&gt;");
 			break;
 		case '<':
 			retval = sbuf_printf(sb, "&lt;");
 			break;
 		default:
 			retval = sbuf_putc(sb, *str);
 			break;
 		}
 
 		if (retval != 0)
 			break;
 
 	}
 
 	return (retval);
 }
 
 static void
 ctl_id_sbuf(struct ctl_devid *id, struct sbuf *sb)
 {
 	struct scsi_vpd_id_descriptor *desc;
 	int i;
 
 	if (id == NULL || id->len < 4)
 		return;
 	desc = (struct scsi_vpd_id_descriptor *)id->data;
 	switch (desc->id_type & SVPD_ID_TYPE_MASK) {
 	case SVPD_ID_TYPE_T10:
 		sbuf_printf(sb, "t10.");
 		break;
 	case SVPD_ID_TYPE_EUI64:
 		sbuf_printf(sb, "eui.");
 		break;
 	case SVPD_ID_TYPE_NAA:
 		sbuf_printf(sb, "naa.");
 		break;
 	case SVPD_ID_TYPE_SCSI_NAME:
 		break;
 	}
 	switch (desc->proto_codeset & SVPD_ID_CODESET_MASK) {
 	case SVPD_ID_CODESET_BINARY:
 		for (i = 0; i < desc->length; i++)
 			sbuf_printf(sb, "%02x", desc->identifier[i]);
 		break;
 	case SVPD_ID_CODESET_ASCII:
 		sbuf_printf(sb, "%.*s", (int)desc->length,
 		    (char *)desc->identifier);
 		break;
 	case SVPD_ID_CODESET_UTF8:
 		sbuf_printf(sb, "%s", (char *)desc->identifier);
 		break;
 	}
 }
 
 static int
 ctl_ioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flag,
 	  struct thread *td)
 {
 	struct ctl_softc *softc = dev->si_drv1;
 	struct ctl_port *port;
 	struct ctl_lun *lun;
 	int retval;
 
 	retval = 0;
 
 	switch (cmd) {
 	case CTL_IO:
 		retval = ctl_ioctl_io(dev, cmd, addr, flag, td);
 		break;
 	case CTL_ENABLE_PORT:
 	case CTL_DISABLE_PORT:
 	case CTL_SET_PORT_WWNS: {
 		struct ctl_port *port;
 		struct ctl_port_entry *entry;
 
 		entry = (struct ctl_port_entry *)addr;
 		
 		mtx_lock(&softc->ctl_lock);
 		STAILQ_FOREACH(port, &softc->port_list, links) {
 			int action, done;
 
 			if (port->targ_port < softc->port_min ||
 			    port->targ_port >= softc->port_max)
 				continue;
 
 			action = 0;
 			done = 0;
 			if ((entry->port_type == CTL_PORT_NONE)
 			 && (entry->targ_port == port->targ_port)) {
 				/*
 				 * If the user only wants to enable or
 				 * disable or set WWNs on a specific port,
 				 * do the operation and we're done.
 				 */
 				action = 1;
 				done = 1;
 			} else if (entry->port_type & port->port_type) {
 				/*
 				 * Compare the user's type mask with the
 				 * particular frontend type to see if we
 				 * have a match.
 				 */
 				action = 1;
 				done = 0;
 
 				/*
 				 * Make sure the user isn't trying to set
 				 * WWNs on multiple ports at the same time.
 				 */
 				if (cmd == CTL_SET_PORT_WWNS) {
 					printf("%s: Can't set WWNs on "
 					       "multiple ports\n", __func__);
 					retval = EINVAL;
 					break;
 				}
 			}
 			if (action == 0)
 				continue;
 
 			/*
 			 * XXX KDM we have to drop the lock here, because
 			 * the online/offline operations can potentially
 			 * block.  We need to reference count the frontends
 			 * so they can't go away,
 			 */
 			if (cmd == CTL_ENABLE_PORT) {
 				mtx_unlock(&softc->ctl_lock);
 				ctl_port_online(port);
 				mtx_lock(&softc->ctl_lock);
 			} else if (cmd == CTL_DISABLE_PORT) {
 				mtx_unlock(&softc->ctl_lock);
 				ctl_port_offline(port);
 				mtx_lock(&softc->ctl_lock);
 			} else if (cmd == CTL_SET_PORT_WWNS) {
 				ctl_port_set_wwns(port,
 				    (entry->flags & CTL_PORT_WWNN_VALID) ?
 				    1 : 0, entry->wwnn,
 				    (entry->flags & CTL_PORT_WWPN_VALID) ?
 				    1 : 0, entry->wwpn);
 			}
 			if (done != 0)
 				break;
 		}
 		mtx_unlock(&softc->ctl_lock);
 		break;
 	}
 	case CTL_GET_OOA: {
 		struct ctl_ooa *ooa_hdr;
 		struct ctl_ooa_entry *entries;
 		uint32_t cur_fill_num;
 
 		ooa_hdr = (struct ctl_ooa *)addr;
 
 		if ((ooa_hdr->alloc_len == 0)
 		 || (ooa_hdr->alloc_num == 0)) {
 			printf("%s: CTL_GET_OOA: alloc len %u and alloc num %u "
 			       "must be non-zero\n", __func__,
 			       ooa_hdr->alloc_len, ooa_hdr->alloc_num);
 			retval = EINVAL;
 			break;
 		}
 
 		if (ooa_hdr->alloc_len != (ooa_hdr->alloc_num *
 		    sizeof(struct ctl_ooa_entry))) {
 			printf("%s: CTL_GET_OOA: alloc len %u must be alloc "
 			       "num %d * sizeof(struct ctl_ooa_entry) %zd\n",
 			       __func__, ooa_hdr->alloc_len,
 			       ooa_hdr->alloc_num,sizeof(struct ctl_ooa_entry));
 			retval = EINVAL;
 			break;
 		}
 
 		entries = malloc(ooa_hdr->alloc_len, M_CTL, M_WAITOK | M_ZERO);
 		if (entries == NULL) {
 			printf("%s: could not allocate %d bytes for OOA "
 			       "dump\n", __func__, ooa_hdr->alloc_len);
 			retval = ENOMEM;
 			break;
 		}
 
 		mtx_lock(&softc->ctl_lock);
 		if ((ooa_hdr->flags & CTL_OOA_FLAG_ALL_LUNS) == 0 &&
 		    (ooa_hdr->lun_num >= CTL_MAX_LUNS ||
 		     softc->ctl_luns[ooa_hdr->lun_num] == NULL)) {
 			mtx_unlock(&softc->ctl_lock);
 			free(entries, M_CTL);
 			printf("%s: CTL_GET_OOA: invalid LUN %ju\n",
 			       __func__, (uintmax_t)ooa_hdr->lun_num);
 			retval = EINVAL;
 			break;
 		}
 
 		cur_fill_num = 0;
 
 		if (ooa_hdr->flags & CTL_OOA_FLAG_ALL_LUNS) {
 			STAILQ_FOREACH(lun, &softc->lun_list, links) {
 				ctl_ioctl_fill_ooa(lun, &cur_fill_num,
 				    ooa_hdr, entries);
 			}
 		} else {
 			lun = softc->ctl_luns[ooa_hdr->lun_num];
 			ctl_ioctl_fill_ooa(lun, &cur_fill_num, ooa_hdr,
 			    entries);
 		}
 		mtx_unlock(&softc->ctl_lock);
 
 		ooa_hdr->fill_num = min(cur_fill_num, ooa_hdr->alloc_num);
 		ooa_hdr->fill_len = ooa_hdr->fill_num *
 			sizeof(struct ctl_ooa_entry);
 		retval = copyout(entries, ooa_hdr->entries, ooa_hdr->fill_len);
 		if (retval != 0) {
 			printf("%s: error copying out %d bytes for OOA dump\n", 
 			       __func__, ooa_hdr->fill_len);
 		}
 
 		getbinuptime(&ooa_hdr->cur_bt);
 
 		if (cur_fill_num > ooa_hdr->alloc_num) {
 			ooa_hdr->dropped_num = cur_fill_num -ooa_hdr->alloc_num;
 			ooa_hdr->status = CTL_OOA_NEED_MORE_SPACE;
 		} else {
 			ooa_hdr->dropped_num = 0;
 			ooa_hdr->status = CTL_OOA_OK;
 		}
 
 		free(entries, M_CTL);
 		break;
 	}
 	case CTL_DELAY_IO: {
 		struct ctl_io_delay_info *delay_info;
 
 		delay_info = (struct ctl_io_delay_info *)addr;
 
 #ifdef CTL_IO_DELAY
 		mtx_lock(&softc->ctl_lock);
 		if (delay_info->lun_id >= CTL_MAX_LUNS ||
 		    (lun = softc->ctl_luns[delay_info->lun_id]) == NULL) {
 			mtx_unlock(&softc->ctl_lock);
 			delay_info->status = CTL_DELAY_STATUS_INVALID_LUN;
 			break;
 		}
 		mtx_lock(&lun->lun_lock);
 		mtx_unlock(&softc->ctl_lock);
 		delay_info->status = CTL_DELAY_STATUS_OK;
 		switch (delay_info->delay_type) {
 		case CTL_DELAY_TYPE_CONT:
 		case CTL_DELAY_TYPE_ONESHOT:
 			break;
 		default:
 			delay_info->status = CTL_DELAY_STATUS_INVALID_TYPE;
 			break;
 		}
 		switch (delay_info->delay_loc) {
 		case CTL_DELAY_LOC_DATAMOVE:
 			lun->delay_info.datamove_type = delay_info->delay_type;
 			lun->delay_info.datamove_delay = delay_info->delay_secs;
 			break;
 		case CTL_DELAY_LOC_DONE:
 			lun->delay_info.done_type = delay_info->delay_type;
 			lun->delay_info.done_delay = delay_info->delay_secs;
 			break;
 		default:
 			delay_info->status = CTL_DELAY_STATUS_INVALID_LOC;
 			break;
 		}
 		mtx_unlock(&lun->lun_lock);
 #else
 		delay_info->status = CTL_DELAY_STATUS_NOT_IMPLEMENTED;
 #endif /* CTL_IO_DELAY */
 		break;
 	}
 #ifdef CTL_LEGACY_STATS
 	case CTL_GETSTATS: {
 		struct ctl_stats *stats = (struct ctl_stats *)addr;
 		int i;
 
 		/*
 		 * XXX KDM no locking here.  If the LUN list changes,
 		 * things can blow up.
 		 */
 		i = 0;
 		stats->status = CTL_SS_OK;
 		stats->fill_len = 0;
 		STAILQ_FOREACH(lun, &softc->lun_list, links) {
 			if (stats->fill_len + sizeof(lun->legacy_stats) >
 			    stats->alloc_len) {
 				stats->status = CTL_SS_NEED_MORE_SPACE;
 				break;
 			}
 			retval = copyout(&lun->legacy_stats, &stats->lun_stats[i++],
 					 sizeof(lun->legacy_stats));
 			if (retval != 0)
 				break;
 			stats->fill_len += sizeof(lun->legacy_stats);
 		}
 		stats->num_luns = softc->num_luns;
 		stats->flags = CTL_STATS_FLAG_NONE;
 #ifdef CTL_TIME_IO
 		stats->flags |= CTL_STATS_FLAG_TIME_VALID;
 #endif
 		getnanouptime(&stats->timestamp);
 		break;
 	}
 #endif /* CTL_LEGACY_STATS */
 	case CTL_ERROR_INJECT: {
 		struct ctl_error_desc *err_desc, *new_err_desc;
 
 		err_desc = (struct ctl_error_desc *)addr;
 
 		new_err_desc = malloc(sizeof(*new_err_desc), M_CTL,
 				      M_WAITOK | M_ZERO);
 		bcopy(err_desc, new_err_desc, sizeof(*new_err_desc));
 
 		mtx_lock(&softc->ctl_lock);
 		if (err_desc->lun_id >= CTL_MAX_LUNS ||
 		    (lun = softc->ctl_luns[err_desc->lun_id]) == NULL) {
 			mtx_unlock(&softc->ctl_lock);
 			free(new_err_desc, M_CTL);
 			printf("%s: CTL_ERROR_INJECT: invalid LUN %ju\n",
 			       __func__, (uintmax_t)err_desc->lun_id);
 			retval = EINVAL;
 			break;
 		}
 		mtx_lock(&lun->lun_lock);
 		mtx_unlock(&softc->ctl_lock);
 
 		/*
 		 * We could do some checking here to verify the validity
 		 * of the request, but given the complexity of error
 		 * injection requests, the checking logic would be fairly
 		 * complex.
 		 *
 		 * For now, if the request is invalid, it just won't get
 		 * executed and might get deleted.
 		 */
 		STAILQ_INSERT_TAIL(&lun->error_list, new_err_desc, links);
 
 		/*
 		 * XXX KDM check to make sure the serial number is unique,
 		 * in case we somehow manage to wrap.  That shouldn't
 		 * happen for a very long time, but it's the right thing to
 		 * do.
 		 */
 		new_err_desc->serial = lun->error_serial;
 		err_desc->serial = lun->error_serial;
 		lun->error_serial++;
 
 		mtx_unlock(&lun->lun_lock);
 		break;
 	}
 	case CTL_ERROR_INJECT_DELETE: {
 		struct ctl_error_desc *delete_desc, *desc, *desc2;
 		int delete_done;
 
 		delete_desc = (struct ctl_error_desc *)addr;
 		delete_done = 0;
 
 		mtx_lock(&softc->ctl_lock);
 		if (delete_desc->lun_id >= CTL_MAX_LUNS ||
 		    (lun = softc->ctl_luns[delete_desc->lun_id]) == NULL) {
 			mtx_unlock(&softc->ctl_lock);
 			printf("%s: CTL_ERROR_INJECT_DELETE: invalid LUN %ju\n",
 			       __func__, (uintmax_t)delete_desc->lun_id);
 			retval = EINVAL;
 			break;
 		}
 		mtx_lock(&lun->lun_lock);
 		mtx_unlock(&softc->ctl_lock);
 		STAILQ_FOREACH_SAFE(desc, &lun->error_list, links, desc2) {
 			if (desc->serial != delete_desc->serial)
 				continue;
 
 			STAILQ_REMOVE(&lun->error_list, desc, ctl_error_desc,
 				      links);
 			free(desc, M_CTL);
 			delete_done = 1;
 		}
 		mtx_unlock(&lun->lun_lock);
 		if (delete_done == 0) {
 			printf("%s: CTL_ERROR_INJECT_DELETE: can't find "
 			       "error serial %ju on LUN %u\n", __func__, 
 			       delete_desc->serial, delete_desc->lun_id);
 			retval = EINVAL;
 			break;
 		}
 		break;
 	}
 	case CTL_DUMP_STRUCTS: {
 		int j, k;
 		struct ctl_port *port;
 		struct ctl_frontend *fe;
 
 		mtx_lock(&softc->ctl_lock);
 		printf("CTL Persistent Reservation information start:\n");
 		STAILQ_FOREACH(lun, &softc->lun_list, links) {
 			mtx_lock(&lun->lun_lock);
 			if ((lun->flags & CTL_LUN_DISABLED) != 0) {
 				mtx_unlock(&lun->lun_lock);
 				continue;
 			}
 
 			for (j = 0; j < CTL_MAX_PORTS; j++) {
 				if (lun->pr_keys[j] == NULL)
 					continue;
 				for (k = 0; k < CTL_MAX_INIT_PER_PORT; k++){
 					if (lun->pr_keys[j][k] == 0)
 						continue;
 					printf("  LUN %ju port %d iid %d key "
 					       "%#jx\n", lun->lun, j, k,
 					       (uintmax_t)lun->pr_keys[j][k]);
 				}
 			}
 			mtx_unlock(&lun->lun_lock);
 		}
 		printf("CTL Persistent Reservation information end\n");
 		printf("CTL Ports:\n");
 		STAILQ_FOREACH(port, &softc->port_list, links) {
 			printf("  Port %d '%s' Frontend '%s' Type %u pp %d vp %d WWNN "
 			       "%#jx WWPN %#jx\n", port->targ_port, port->port_name,
 			       port->frontend->name, port->port_type,
 			       port->physical_port, port->virtual_port,
 			       (uintmax_t)port->wwnn, (uintmax_t)port->wwpn);
 			for (j = 0; j < CTL_MAX_INIT_PER_PORT; j++) {
 				if (port->wwpn_iid[j].in_use == 0 &&
 				    port->wwpn_iid[j].wwpn == 0 &&
 				    port->wwpn_iid[j].name == NULL)
 					continue;
 
 				printf("    iid %u use %d WWPN %#jx '%s'\n",
 				    j, port->wwpn_iid[j].in_use,
 				    (uintmax_t)port->wwpn_iid[j].wwpn,
 				    port->wwpn_iid[j].name);
 			}
 		}
 		printf("CTL Port information end\n");
 		mtx_unlock(&softc->ctl_lock);
 		/*
 		 * XXX KDM calling this without a lock.  We'd likely want
 		 * to drop the lock before calling the frontend's dump
 		 * routine anyway.
 		 */
 		printf("CTL Frontends:\n");
 		STAILQ_FOREACH(fe, &softc->fe_list, links) {
 			printf("  Frontend '%s'\n", fe->name);
 			if (fe->fe_dump != NULL)
 				fe->fe_dump();
 		}
 		printf("CTL Frontend information end\n");
 		break;
 	}
 	case CTL_LUN_REQ: {
 		struct ctl_lun_req *lun_req;
 		struct ctl_backend_driver *backend;
 
 		lun_req = (struct ctl_lun_req *)addr;
 
 		backend = ctl_backend_find(lun_req->backend);
 		if (backend == NULL) {
 			lun_req->status = CTL_LUN_ERROR;
 			snprintf(lun_req->error_str,
 				 sizeof(lun_req->error_str),
 				 "Backend \"%s\" not found.",
 				 lun_req->backend);
 			break;
 		}
 		if (lun_req->num_be_args > 0) {
 			lun_req->kern_be_args = ctl_copyin_args(
 				lun_req->num_be_args,
 				lun_req->be_args,
 				lun_req->error_str,
 				sizeof(lun_req->error_str));
 			if (lun_req->kern_be_args == NULL) {
 				lun_req->status = CTL_LUN_ERROR;
 				break;
 			}
 		}
 
 		retval = backend->ioctl(dev, cmd, addr, flag, td);
 
 		if (lun_req->num_be_args > 0) {
 			ctl_copyout_args(lun_req->num_be_args,
 				      lun_req->kern_be_args);
 			ctl_free_args(lun_req->num_be_args,
 				      lun_req->kern_be_args);
 		}
 		break;
 	}
 	case CTL_LUN_LIST: {
 		struct sbuf *sb;
 		struct ctl_lun_list *list;
 		struct ctl_option *opt;
 
 		list = (struct ctl_lun_list *)addr;
 
 		/*
 		 * Allocate a fixed length sbuf here, based on the length
 		 * of the user's buffer.  We could allocate an auto-extending
 		 * buffer, and then tell the user how much larger our
 		 * amount of data is than his buffer, but that presents
 		 * some problems:
 		 *
 		 * 1.  The sbuf(9) routines use a blocking malloc, and so
 		 *     we can't hold a lock while calling them with an
 		 *     auto-extending buffer.
  		 *
 		 * 2.  There is not currently a LUN reference counting
 		 *     mechanism, outside of outstanding transactions on
 		 *     the LUN's OOA queue.  So a LUN could go away on us
 		 *     while we're getting the LUN number, backend-specific
 		 *     information, etc.  Thus, given the way things
 		 *     currently work, we need to hold the CTL lock while
 		 *     grabbing LUN information.
 		 *
 		 * So, from the user's standpoint, the best thing to do is
 		 * allocate what he thinks is a reasonable buffer length,
 		 * and then if he gets a CTL_LUN_LIST_NEED_MORE_SPACE error,
 		 * double the buffer length and try again.  (And repeat
 		 * that until he succeeds.)
 		 */
 		sb = sbuf_new(NULL, NULL, list->alloc_len, SBUF_FIXEDLEN);
 		if (sb == NULL) {
 			list->status = CTL_LUN_LIST_ERROR;
 			snprintf(list->error_str, sizeof(list->error_str),
 				 "Unable to allocate %d bytes for LUN list",
 				 list->alloc_len);
 			break;
 		}
 
 		sbuf_printf(sb, "<ctllunlist>\n");
 
 		mtx_lock(&softc->ctl_lock);
 		STAILQ_FOREACH(lun, &softc->lun_list, links) {
 			mtx_lock(&lun->lun_lock);
 			retval = sbuf_printf(sb, "<lun id=\"%ju\">\n",
 					     (uintmax_t)lun->lun);
 
 			/*
 			 * Bail out as soon as we see that we've overfilled
 			 * the buffer.
 			 */
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<backend_type>%s"
 					     "</backend_type>\n",
 					     (lun->backend == NULL) ?  "none" :
 					     lun->backend->name);
 
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<lun_type>%d</lun_type>\n",
 					     lun->be_lun->lun_type);
 
 			if (retval != 0)
 				break;
 
 			if (lun->backend == NULL) {
 				retval = sbuf_printf(sb, "</lun>\n");
 				if (retval != 0)
 					break;
 				continue;
 			}
 
 			retval = sbuf_printf(sb, "\t<size>%ju</size>\n",
 					     (lun->be_lun->maxlba > 0) ?
 					     lun->be_lun->maxlba + 1 : 0);
 
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<blocksize>%u</blocksize>\n",
 					     lun->be_lun->blocksize);
 
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<serial_number>");
 
 			if (retval != 0)
 				break;
 
 			retval = ctl_sbuf_printf_esc(sb,
 			    lun->be_lun->serial_num,
 			    sizeof(lun->be_lun->serial_num));
 
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "</serial_number>\n");
 		
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<device_id>");
 
 			if (retval != 0)
 				break;
 
 			retval = ctl_sbuf_printf_esc(sb,
 			    lun->be_lun->device_id,
 			    sizeof(lun->be_lun->device_id));
 
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "</device_id>\n");
 
 			if (retval != 0)
 				break;
 
 			if (lun->backend->lun_info != NULL) {
 				retval = lun->backend->lun_info(lun->be_lun->be_lun, sb);
 				if (retval != 0)
 					break;
 			}
 			STAILQ_FOREACH(opt, &lun->be_lun->options, links) {
 				retval = sbuf_printf(sb, "\t<%s>%s</%s>\n",
 				    opt->name, opt->value, opt->name);
 				if (retval != 0)
 					break;
 			}
 
 			retval = sbuf_printf(sb, "</lun>\n");
 
 			if (retval != 0)
 				break;
 			mtx_unlock(&lun->lun_lock);
 		}
 		if (lun != NULL)
 			mtx_unlock(&lun->lun_lock);
 		mtx_unlock(&softc->ctl_lock);
 
 		if ((retval != 0)
 		 || ((retval = sbuf_printf(sb, "</ctllunlist>\n")) != 0)) {
 			retval = 0;
 			sbuf_delete(sb);
 			list->status = CTL_LUN_LIST_NEED_MORE_SPACE;
 			snprintf(list->error_str, sizeof(list->error_str),
 				 "Out of space, %d bytes is too small",
 				 list->alloc_len);
 			break;
 		}
 
 		sbuf_finish(sb);
 
 		retval = copyout(sbuf_data(sb), list->lun_xml,
 				 sbuf_len(sb) + 1);
 
 		list->fill_len = sbuf_len(sb) + 1;
 		list->status = CTL_LUN_LIST_OK;
 		sbuf_delete(sb);
 		break;
 	}
 	case CTL_ISCSI: {
 		struct ctl_iscsi *ci;
 		struct ctl_frontend *fe;
 
 		ci = (struct ctl_iscsi *)addr;
 
 		fe = ctl_frontend_find("iscsi");
 		if (fe == NULL) {
 			ci->status = CTL_ISCSI_ERROR;
 			snprintf(ci->error_str, sizeof(ci->error_str),
 			    "Frontend \"iscsi\" not found.");
 			break;
 		}
 
 		retval = fe->ioctl(dev, cmd, addr, flag, td);
 		break;
 	}
 	case CTL_PORT_REQ: {
 		struct ctl_req *req;
 		struct ctl_frontend *fe;
 
 		req = (struct ctl_req *)addr;
 
 		fe = ctl_frontend_find(req->driver);
 		if (fe == NULL) {
 			req->status = CTL_LUN_ERROR;
 			snprintf(req->error_str, sizeof(req->error_str),
 			    "Frontend \"%s\" not found.", req->driver);
 			break;
 		}
 		if (req->num_args > 0) {
 			req->kern_args = ctl_copyin_args(req->num_args,
 			    req->args, req->error_str, sizeof(req->error_str));
 			if (req->kern_args == NULL) {
 				req->status = CTL_LUN_ERROR;
 				break;
 			}
 		}
 
 		if (fe->ioctl)
 			retval = fe->ioctl(dev, cmd, addr, flag, td);
 		else
 			retval = ENODEV;
 
 		if (req->num_args > 0) {
 			ctl_copyout_args(req->num_args, req->kern_args);
 			ctl_free_args(req->num_args, req->kern_args);
 		}
 		break;
 	}
 	case CTL_PORT_LIST: {
 		struct sbuf *sb;
 		struct ctl_port *port;
 		struct ctl_lun_list *list;
 		struct ctl_option *opt;
 		int j;
 		uint32_t plun;
 
 		list = (struct ctl_lun_list *)addr;
 
 		sb = sbuf_new(NULL, NULL, list->alloc_len, SBUF_FIXEDLEN);
 		if (sb == NULL) {
 			list->status = CTL_LUN_LIST_ERROR;
 			snprintf(list->error_str, sizeof(list->error_str),
 				 "Unable to allocate %d bytes for LUN list",
 				 list->alloc_len);
 			break;
 		}
 
 		sbuf_printf(sb, "<ctlportlist>\n");
 
 		mtx_lock(&softc->ctl_lock);
 		STAILQ_FOREACH(port, &softc->port_list, links) {
 			retval = sbuf_printf(sb, "<targ_port id=\"%ju\">\n",
 					     (uintmax_t)port->targ_port);
 
 			/*
 			 * Bail out as soon as we see that we've overfilled
 			 * the buffer.
 			 */
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<frontend_type>%s"
 			    "</frontend_type>\n", port->frontend->name);
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<port_type>%d</port_type>\n",
 					     port->port_type);
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<online>%s</online>\n",
 			    (port->status & CTL_PORT_STATUS_ONLINE) ? "YES" : "NO");
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<port_name>%s</port_name>\n",
 			    port->port_name);
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<physical_port>%d</physical_port>\n",
 			    port->physical_port);
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "\t<virtual_port>%d</virtual_port>\n",
 			    port->virtual_port);
 			if (retval != 0)
 				break;
 
 			if (port->target_devid != NULL) {
 				sbuf_printf(sb, "\t<target>");
 				ctl_id_sbuf(port->target_devid, sb);
 				sbuf_printf(sb, "</target>\n");
 			}
 
 			if (port->port_devid != NULL) {
 				sbuf_printf(sb, "\t<port>");
 				ctl_id_sbuf(port->port_devid, sb);
 				sbuf_printf(sb, "</port>\n");
 			}
 
 			if (port->port_info != NULL) {
 				retval = port->port_info(port->onoff_arg, sb);
 				if (retval != 0)
 					break;
 			}
 			STAILQ_FOREACH(opt, &port->options, links) {
 				retval = sbuf_printf(sb, "\t<%s>%s</%s>\n",
 				    opt->name, opt->value, opt->name);
 				if (retval != 0)
 					break;
 			}
 
 			if (port->lun_map != NULL) {
 				sbuf_printf(sb, "\t<lun_map>on</lun_map>\n");
 				for (j = 0; j < port->lun_map_size; j++) {
 					plun = ctl_lun_map_from_port(port, j);
 					if (plun == UINT32_MAX)
 						continue;
 					sbuf_printf(sb,
 					    "\t<lun id=\"%u\">%u</lun>\n",
 					    j, plun);
 				}
 			}
 
 			for (j = 0; j < CTL_MAX_INIT_PER_PORT; j++) {
 				if (port->wwpn_iid[j].in_use == 0 ||
 				    (port->wwpn_iid[j].wwpn == 0 &&
 				     port->wwpn_iid[j].name == NULL))
 					continue;
 
 				if (port->wwpn_iid[j].name != NULL)
 					retval = sbuf_printf(sb,
 					    "\t<initiator id=\"%u\">%s</initiator>\n",
 					    j, port->wwpn_iid[j].name);
 				else
 					retval = sbuf_printf(sb,
 					    "\t<initiator id=\"%u\">naa.%08jx</initiator>\n",
 					    j, port->wwpn_iid[j].wwpn);
 				if (retval != 0)
 					break;
 			}
 			if (retval != 0)
 				break;
 
 			retval = sbuf_printf(sb, "</targ_port>\n");
 			if (retval != 0)
 				break;
 		}
 		mtx_unlock(&softc->ctl_lock);
 
 		if ((retval != 0)
 		 || ((retval = sbuf_printf(sb, "</ctlportlist>\n")) != 0)) {
 			retval = 0;
 			sbuf_delete(sb);
 			list->status = CTL_LUN_LIST_NEED_MORE_SPACE;
 			snprintf(list->error_str, sizeof(list->error_str),
 				 "Out of space, %d bytes is too small",
 				 list->alloc_len);
 			break;
 		}
 
 		sbuf_finish(sb);
 
 		retval = copyout(sbuf_data(sb), list->lun_xml,
 				 sbuf_len(sb) + 1);
 
 		list->fill_len = sbuf_len(sb) + 1;
 		list->status = CTL_LUN_LIST_OK;
 		sbuf_delete(sb);
 		break;
 	}
 	case CTL_LUN_MAP: {
 		struct ctl_lun_map *lm  = (struct ctl_lun_map *)addr;
 		struct ctl_port *port;
 
 		mtx_lock(&softc->ctl_lock);
 		if (lm->port < softc->port_min ||
 		    lm->port >= softc->port_max ||
 		    (port = softc->ctl_ports[lm->port]) == NULL) {
 			mtx_unlock(&softc->ctl_lock);
 			return (ENXIO);
 		}
 		if (port->status & CTL_PORT_STATUS_ONLINE) {
 			STAILQ_FOREACH(lun, &softc->lun_list, links) {
 				if (ctl_lun_map_to_port(port, lun->lun) ==
 				    UINT32_MAX)
 					continue;
 				mtx_lock(&lun->lun_lock);
 				ctl_est_ua_port(lun, lm->port, -1,
 				    CTL_UA_LUN_CHANGE);
 				mtx_unlock(&lun->lun_lock);
 			}
 		}
 		mtx_unlock(&softc->ctl_lock); // XXX: port_enable sleeps
 		if (lm->plun != UINT32_MAX) {
 			if (lm->lun == UINT32_MAX)
 				retval = ctl_lun_map_unset(port, lm->plun);
 			else if (lm->lun < CTL_MAX_LUNS &&
 			    softc->ctl_luns[lm->lun] != NULL)
 				retval = ctl_lun_map_set(port, lm->plun, lm->lun);
 			else
 				return (ENXIO);
 		} else {
 			if (lm->lun == UINT32_MAX)
 				retval = ctl_lun_map_deinit(port);
 			else
 				retval = ctl_lun_map_init(port);
 		}
 		if (port->status & CTL_PORT_STATUS_ONLINE)
 			ctl_isc_announce_port(port);
 		break;
 	}
 	case CTL_GET_LUN_STATS: {
 		struct ctl_get_io_stats *stats = (struct ctl_get_io_stats *)addr;
 		int i;
 
 		/*
 		 * XXX KDM no locking here.  If the LUN list changes,
 		 * things can blow up.
 		 */
 		i = 0;
 		stats->status = CTL_SS_OK;
 		stats->fill_len = 0;
 		STAILQ_FOREACH(lun, &softc->lun_list, links) {
 			if (lun->lun < stats->first_item)
 				continue;
 			if (stats->fill_len + sizeof(lun->stats) >
 			    stats->alloc_len) {
 				stats->status = CTL_SS_NEED_MORE_SPACE;
 				break;
 			}
 			retval = copyout(&lun->stats, &stats->stats[i++],
 					 sizeof(lun->stats));
 			if (retval != 0)
 				break;
 			stats->fill_len += sizeof(lun->stats);
 		}
 		stats->num_items = softc->num_luns;
 		stats->flags = CTL_STATS_FLAG_NONE;
 #ifdef CTL_TIME_IO
 		stats->flags |= CTL_STATS_FLAG_TIME_VALID;
 #endif
 		getnanouptime(&stats->timestamp);
 		break;
 	}
 	case CTL_GET_PORT_STATS: {
 		struct ctl_get_io_stats *stats = (struct ctl_get_io_stats *)addr;
 		int i;
 
 		/*
 		 * XXX KDM no locking here.  If the LUN list changes,
 		 * things can blow up.
 		 */
 		i = 0;
 		stats->status = CTL_SS_OK;
 		stats->fill_len = 0;
 		STAILQ_FOREACH(port, &softc->port_list, links) {
 			if (port->targ_port < stats->first_item)
 				continue;
 			if (stats->fill_len + sizeof(port->stats) >
 			    stats->alloc_len) {
 				stats->status = CTL_SS_NEED_MORE_SPACE;
 				break;
 			}
 			retval = copyout(&port->stats, &stats->stats[i++],
 					 sizeof(port->stats));
 			if (retval != 0)
 				break;
 			stats->fill_len += sizeof(port->stats);
 		}
 		stats->num_items = softc->num_ports;
 		stats->flags = CTL_STATS_FLAG_NONE;
 #ifdef CTL_TIME_IO
 		stats->flags |= CTL_STATS_FLAG_TIME_VALID;
 #endif
 		getnanouptime(&stats->timestamp);
 		break;
 	}
 	default: {
 		/* XXX KDM should we fix this? */
 #if 0
 		struct ctl_backend_driver *backend;
 		unsigned int type;
 		int found;
 
 		found = 0;
 
 		/*
 		 * We encode the backend type as the ioctl type for backend
 		 * ioctls.  So parse it out here, and then search for a
 		 * backend of this type.
 		 */
 		type = _IOC_TYPE(cmd);
 
 		STAILQ_FOREACH(backend, &softc->be_list, links) {
 			if (backend->type == type) {
 				found = 1;
 				break;
 			}
 		}
 		if (found == 0) {
 			printf("ctl: unknown ioctl command %#lx or backend "
 			       "%d\n", cmd, type);
 			retval = EINVAL;
 			break;
 		}
 		retval = backend->ioctl(dev, cmd, addr, flag, td);
 #endif
 		retval = ENOTTY;
 		break;
 	}
 	}
 	return (retval);
 }
 
 uint32_t
 ctl_get_initindex(struct ctl_nexus *nexus)
 {
 	return (nexus->initid + (nexus->targ_port * CTL_MAX_INIT_PER_PORT));
 }
 
 int
 ctl_lun_map_init(struct ctl_port *port)
 {
 	struct ctl_softc *softc = port->ctl_softc;
 	struct ctl_lun *lun;
 	int size = ctl_lun_map_size;
 	uint32_t i;
 
 	if (port->lun_map == NULL || port->lun_map_size < size) {
 		port->lun_map_size = 0;
 		free(port->lun_map, M_CTL);
 		port->lun_map = malloc(size * sizeof(uint32_t),
 		    M_CTL, M_NOWAIT);
 	}
 	if (port->lun_map == NULL)
 		return (ENOMEM);
 	for (i = 0; i < size; i++)
 		port->lun_map[i] = UINT32_MAX;
 	port->lun_map_size = size;
 	if (port->status & CTL_PORT_STATUS_ONLINE) {
 		if (port->lun_disable != NULL) {
 			STAILQ_FOREACH(lun, &softc->lun_list, links)
 				port->lun_disable(port->targ_lun_arg, lun->lun);
 		}
 		ctl_isc_announce_port(port);
 	}
 	return (0);
 }
 
 int
 ctl_lun_map_deinit(struct ctl_port *port)
 {
 	struct ctl_softc *softc = port->ctl_softc;
 	struct ctl_lun *lun;
 
 	if (port->lun_map == NULL)
 		return (0);
 	port->lun_map_size = 0;
 	free(port->lun_map, M_CTL);
 	port->lun_map = NULL;
 	if (port->status & CTL_PORT_STATUS_ONLINE) {
 		if (port->lun_enable != NULL) {
 			STAILQ_FOREACH(lun, &softc->lun_list, links)
 				port->lun_enable(port->targ_lun_arg, lun->lun);
 		}
 		ctl_isc_announce_port(port);
 	}
 	return (0);
 }
 
 int
 ctl_lun_map_set(struct ctl_port *port, uint32_t plun, uint32_t glun)
 {
 	int status;
 	uint32_t old;
 
 	if (port->lun_map == NULL) {
 		status = ctl_lun_map_init(port);
 		if (status != 0)
 			return (status);
 	}
 	if (plun >= port->lun_map_size)
 		return (EINVAL);
 	old = port->lun_map[plun];
 	port->lun_map[plun] = glun;
 	if ((port->status & CTL_PORT_STATUS_ONLINE) && old == UINT32_MAX) {
 		if (port->lun_enable != NULL)
 			port->lun_enable(port->targ_lun_arg, plun);
 		ctl_isc_announce_port(port);
 	}
 	return (0);
 }
 
 int
 ctl_lun_map_unset(struct ctl_port *port, uint32_t plun)
 {
 	uint32_t old;
 
 	if (port->lun_map == NULL || plun >= port->lun_map_size)
 		return (0);
 	old = port->lun_map[plun];
 	port->lun_map[plun] = UINT32_MAX;
 	if ((port->status & CTL_PORT_STATUS_ONLINE) && old != UINT32_MAX) {
 		if (port->lun_disable != NULL)
 			port->lun_disable(port->targ_lun_arg, plun);
 		ctl_isc_announce_port(port);
 	}
 	return (0);
 }
 
 uint32_t
 ctl_lun_map_from_port(struct ctl_port *port, uint32_t lun_id)
 {
 
 	if (port == NULL)
 		return (UINT32_MAX);
 	if (port->lun_map == NULL)
 		return (lun_id);
 	if (lun_id > port->lun_map_size)
 		return (UINT32_MAX);
 	return (port->lun_map[lun_id]);
 }
 
 uint32_t
 ctl_lun_map_to_port(struct ctl_port *port, uint32_t lun_id)
 {
 	uint32_t i;
 
 	if (port == NULL)
 		return (UINT32_MAX);
 	if (port->lun_map == NULL)
 		return (lun_id);
 	for (i = 0; i < port->lun_map_size; i++) {
 		if (port->lun_map[i] == lun_id)
 			return (i);
 	}
 	return (UINT32_MAX);
 }
 
 uint32_t
 ctl_decode_lun(uint64_t encoded)
 {
 	uint8_t lun[8];
 	uint32_t result = 0xffffffff;
 
 	be64enc(lun, encoded);
 	switch (lun[0] & RPL_LUNDATA_ATYP_MASK) {
 	case RPL_LUNDATA_ATYP_PERIPH:
 		if ((lun[0] & 0x3f) == 0 && lun[2] == 0 && lun[3] == 0 &&
 		    lun[4] == 0 && lun[5] == 0 && lun[6] == 0 && lun[7] == 0)
 			result = lun[1];
 		break;
 	case RPL_LUNDATA_ATYP_FLAT:
 		if (lun[2] == 0 && lun[3] == 0 && lun[4] == 0 && lun[5] == 0 &&
 		    lun[6] == 0 && lun[7] == 0)
 			result = ((lun[0] & 0x3f) << 8) + lun[1];
 		break;
 	case RPL_LUNDATA_ATYP_EXTLUN:
 		switch (lun[0] & RPL_LUNDATA_EXT_EAM_MASK) {
 		case 0x02:
 			switch (lun[0] & RPL_LUNDATA_EXT_LEN_MASK) {
 			case 0x00:
 				result = lun[1];
 				break;
 			case 0x10:
 				result = (lun[1] << 16) + (lun[2] << 8) +
 				    lun[3];
 				break;
 			case 0x20:
 				if (lun[1] == 0 && lun[6] == 0 && lun[7] == 0)
 					result = (lun[2] << 24) +
 					    (lun[3] << 16) + (lun[4] << 8) +
 					    lun[5];
 				break;
 			}
 			break;
 		case RPL_LUNDATA_EXT_EAM_NOT_SPEC:
 			result = 0xffffffff;
 			break;
 		}
 		break;
 	}
 	return (result);
 }
 
 uint64_t
 ctl_encode_lun(uint32_t decoded)
 {
 	uint64_t l = decoded;
 
 	if (l <= 0xff)
 		return (((uint64_t)RPL_LUNDATA_ATYP_PERIPH << 56) | (l << 48));
 	if (l <= 0x3fff)
 		return (((uint64_t)RPL_LUNDATA_ATYP_FLAT << 56) | (l << 48));
 	if (l <= 0xffffff)
 		return (((uint64_t)(RPL_LUNDATA_ATYP_EXTLUN | 0x12) << 56) |
 		    (l << 32));
 	return ((((uint64_t)RPL_LUNDATA_ATYP_EXTLUN | 0x22) << 56) | (l << 16));
 }
 
 int
 ctl_ffz(uint32_t *mask, uint32_t first, uint32_t last)
 {
 	int i;
 
 	for (i = first; i < last; i++) {
 		if ((mask[i / 32] & (1 << (i % 32))) == 0)
 			return (i);
 	}
 	return (-1);
 }
 
 int
 ctl_set_mask(uint32_t *mask, uint32_t bit)
 {
 	uint32_t chunk, piece;
 
 	chunk = bit >> 5;
 	piece = bit % (sizeof(uint32_t) * 8);
 
 	if ((mask[chunk] & (1 << piece)) != 0)
 		return (-1);
 	else
 		mask[chunk] |= (1 << piece);
 
 	return (0);
 }
 
 int
 ctl_clear_mask(uint32_t *mask, uint32_t bit)
 {
 	uint32_t chunk, piece;
 
 	chunk = bit >> 5;
 	piece = bit % (sizeof(uint32_t) * 8);
 
 	if ((mask[chunk] & (1 << piece)) == 0)
 		return (-1);
 	else
 		mask[chunk] &= ~(1 << piece);
 
 	return (0);
 }
 
 int
 ctl_is_set(uint32_t *mask, uint32_t bit)
 {
 	uint32_t chunk, piece;
 
 	chunk = bit >> 5;
 	piece = bit % (sizeof(uint32_t) * 8);
 
 	if ((mask[chunk] & (1 << piece)) == 0)
 		return (0);
 	else
 		return (1);
 }
 
 static uint64_t
 ctl_get_prkey(struct ctl_lun *lun, uint32_t residx)
 {
 	uint64_t *t;
 
 	t = lun->pr_keys[residx/CTL_MAX_INIT_PER_PORT];
 	if (t == NULL)
 		return (0);
 	return (t[residx % CTL_MAX_INIT_PER_PORT]);
 }
 
 static void
 ctl_clr_prkey(struct ctl_lun *lun, uint32_t residx)
 {
 	uint64_t *t;
 
 	t = lun->pr_keys[residx/CTL_MAX_INIT_PER_PORT];
 	if (t == NULL)
 		return;
 	t[residx % CTL_MAX_INIT_PER_PORT] = 0;
 }
 
 static void
 ctl_alloc_prkey(struct ctl_lun *lun, uint32_t residx)
 {
 	uint64_t *p;
 	u_int i;
 
 	i = residx/CTL_MAX_INIT_PER_PORT;
 	if (lun->pr_keys[i] != NULL)
 		return;
 	mtx_unlock(&lun->lun_lock);
 	p = malloc(sizeof(uint64_t) * CTL_MAX_INIT_PER_PORT, M_CTL,
 	    M_WAITOK | M_ZERO);
 	mtx_lock(&lun->lun_lock);
 	if (lun->pr_keys[i] == NULL)
 		lun->pr_keys[i] = p;
 	else
 		free(p, M_CTL);
 }
 
 static void
 ctl_set_prkey(struct ctl_lun *lun, uint32_t residx, uint64_t key)
 {
 	uint64_t *t;
 
 	t = lun->pr_keys[residx/CTL_MAX_INIT_PER_PORT];
 	KASSERT(t != NULL, ("prkey %d is not allocated", residx));
 	t[residx % CTL_MAX_INIT_PER_PORT] = key;
 }
 
 /*
  * ctl_softc, pool_name, total_ctl_io are passed in.
  * npool is passed out.
  */
 int
 ctl_pool_create(struct ctl_softc *ctl_softc, const char *pool_name,
 		uint32_t total_ctl_io, void **npool)
 {
 	struct ctl_io_pool *pool;
 
 	pool = (struct ctl_io_pool *)malloc(sizeof(*pool), M_CTL,
 					    M_NOWAIT | M_ZERO);
 	if (pool == NULL)
 		return (ENOMEM);
 
 	snprintf(pool->name, sizeof(pool->name), "CTL IO %s", pool_name);
 	pool->ctl_softc = ctl_softc;
 #ifdef IO_POOLS
 	pool->zone = uma_zsecond_create(pool->name, NULL,
 	    NULL, NULL, NULL, ctl_softc->io_zone);
 	/* uma_prealloc(pool->zone, total_ctl_io); */
 #else
 	pool->zone = ctl_softc->io_zone;
 #endif
 
 	*npool = pool;
 	return (0);
 }
 
 void
 ctl_pool_free(struct ctl_io_pool *pool)
 {
 
 	if (pool == NULL)
 		return;
 
 #ifdef IO_POOLS
 	uma_zdestroy(pool->zone);
 #endif
 	free(pool, M_CTL);
 }
 
 union ctl_io *
 ctl_alloc_io(void *pool_ref)
 {
 	struct ctl_io_pool *pool = (struct ctl_io_pool *)pool_ref;
 	union ctl_io *io;
 
 	io = uma_zalloc(pool->zone, M_WAITOK);
 	if (io != NULL) {
 		io->io_hdr.pool = pool_ref;
 		CTL_SOFTC(io) = pool->ctl_softc;
 	}
 	return (io);
 }
 
 union ctl_io *
 ctl_alloc_io_nowait(void *pool_ref)
 {
 	struct ctl_io_pool *pool = (struct ctl_io_pool *)pool_ref;
 	union ctl_io *io;
 
 	io = uma_zalloc(pool->zone, M_NOWAIT);
 	if (io != NULL) {
 		io->io_hdr.pool = pool_ref;
 		CTL_SOFTC(io) = pool->ctl_softc;
 	}
 	return (io);
 }
 
 void
 ctl_free_io(union ctl_io *io)
 {
 	struct ctl_io_pool *pool;
 
 	if (io == NULL)
 		return;
 
 	pool = (struct ctl_io_pool *)io->io_hdr.pool;
 	uma_zfree(pool->zone, io);
 }
 
 void
 ctl_zero_io(union ctl_io *io)
 {
 	struct ctl_io_pool *pool;
 
 	if (io == NULL)
 		return;
 
 	/*
 	 * May need to preserve linked list pointers at some point too.
 	 */
 	pool = io->io_hdr.pool;
 	memset(io, 0, sizeof(*io));
 	io->io_hdr.pool = pool;
 	CTL_SOFTC(io) = pool->ctl_softc;
 }
 
 int
 ctl_expand_number(const char *buf, uint64_t *num)
 {
 	char *endptr;
 	uint64_t number;
 	unsigned shift;
 
 	number = strtoq(buf, &endptr, 0);
 
 	switch (tolower((unsigned char)*endptr)) {
 	case 'e':
 		shift = 60;
 		break;
 	case 'p':
 		shift = 50;
 		break;
 	case 't':
 		shift = 40;
 		break;
 	case 'g':
 		shift = 30;
 		break;
 	case 'm':
 		shift = 20;
 		break;
 	case 'k':
 		shift = 10;
 		break;
 	case 'b':
 	case '\0': /* No unit. */
 		*num = number;
 		return (0);
 	default:
 		/* Unrecognized unit. */
 		return (-1);
 	}
 
 	if ((number << shift) >> shift != number) {
 		/* Overflow */
 		return (-1);
 	}
 	*num = number << shift;
 	return (0);
 }
 
 
 /*
  * This routine could be used in the future to load default and/or saved
  * mode page parameters for a particuar lun.
  */
 static int
 ctl_init_page_index(struct ctl_lun *lun)
 {
 	int i, page_code;
 	struct ctl_page_index *page_index;
 	const char *value;
 	uint64_t ival;
 
 	memcpy(&lun->mode_pages.index, page_index_template,
 	       sizeof(page_index_template));
 
 	for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 
 		page_index = &lun->mode_pages.index[i];
 		if (lun->be_lun->lun_type == T_DIRECT &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 			continue;
 		if (lun->be_lun->lun_type == T_PROCESSOR &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 			continue;
 		if (lun->be_lun->lun_type == T_CDROM &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 			continue;
 
 		page_code = page_index->page_code & SMPH_PC_MASK;
 		switch (page_code) {
 		case SMS_RW_ERROR_RECOVERY_PAGE: {
 			KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0,
 			    ("subpage %#x for page %#x is incorrect!",
 			    page_index->subpage, page_code));
 			memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_CURRENT],
 			       &rw_er_page_default,
 			       sizeof(rw_er_page_default));
 			memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_CHANGEABLE],
 			       &rw_er_page_changeable,
 			       sizeof(rw_er_page_changeable));
 			memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_DEFAULT],
 			       &rw_er_page_default,
 			       sizeof(rw_er_page_default));
 			memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_SAVED],
 			       &rw_er_page_default,
 			       sizeof(rw_er_page_default));
 			page_index->page_data =
 				(uint8_t *)lun->mode_pages.rw_er_page;
 			break;
 		}
 		case SMS_FORMAT_DEVICE_PAGE: {
 			struct scsi_format_page *format_page;
 
 			KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0,
 			    ("subpage %#x for page %#x is incorrect!",
 			    page_index->subpage, page_code));
 
 			/*
 			 * Sectors per track are set above.  Bytes per
 			 * sector need to be set here on a per-LUN basis.
 			 */
 			memcpy(&lun->mode_pages.format_page[CTL_PAGE_CURRENT],
 			       &format_page_default,
 			       sizeof(format_page_default));
 			memcpy(&lun->mode_pages.format_page[
 			       CTL_PAGE_CHANGEABLE], &format_page_changeable,
 			       sizeof(format_page_changeable));
 			memcpy(&lun->mode_pages.format_page[CTL_PAGE_DEFAULT],
 			       &format_page_default,
 			       sizeof(format_page_default));
 			memcpy(&lun->mode_pages.format_page[CTL_PAGE_SAVED],
 			       &format_page_default,
 			       sizeof(format_page_default));
 
 			format_page = &lun->mode_pages.format_page[
 				CTL_PAGE_CURRENT];
 			scsi_ulto2b(lun->be_lun->blocksize,
 				    format_page->bytes_per_sector);
 
 			format_page = &lun->mode_pages.format_page[
 				CTL_PAGE_DEFAULT];
 			scsi_ulto2b(lun->be_lun->blocksize,
 				    format_page->bytes_per_sector);
 
 			format_page = &lun->mode_pages.format_page[
 				CTL_PAGE_SAVED];
 			scsi_ulto2b(lun->be_lun->blocksize,
 				    format_page->bytes_per_sector);
 
 			page_index->page_data =
 				(uint8_t *)lun->mode_pages.format_page;
 			break;
 		}
 		case SMS_RIGID_DISK_PAGE: {
 			struct scsi_rigid_disk_page *rigid_disk_page;
 			uint32_t sectors_per_cylinder;
 			uint64_t cylinders;
 #ifndef	__XSCALE__
 			int shift;
 #endif /* !__XSCALE__ */
 
 			KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0,
 			    ("subpage %#x for page %#x is incorrect!",
 			    page_index->subpage, page_code));
 
 			/*
 			 * Rotation rate and sectors per track are set
 			 * above.  We calculate the cylinders here based on
 			 * capacity.  Due to the number of heads and
 			 * sectors per track we're using, smaller arrays
 			 * may turn out to have 0 cylinders.  Linux and
 			 * FreeBSD don't pay attention to these mode pages
 			 * to figure out capacity, but Solaris does.  It
 			 * seems to deal with 0 cylinders just fine, and
 			 * works out a fake geometry based on the capacity.
 			 */
 			memcpy(&lun->mode_pages.rigid_disk_page[
 			       CTL_PAGE_DEFAULT], &rigid_disk_page_default,
 			       sizeof(rigid_disk_page_default));
 			memcpy(&lun->mode_pages.rigid_disk_page[
 			       CTL_PAGE_CHANGEABLE],&rigid_disk_page_changeable,
 			       sizeof(rigid_disk_page_changeable));
 
 			sectors_per_cylinder = CTL_DEFAULT_SECTORS_PER_TRACK *
 				CTL_DEFAULT_HEADS;
 
 			/*
 			 * The divide method here will be more accurate,
 			 * probably, but results in floating point being
 			 * used in the kernel on i386 (__udivdi3()).  On the
 			 * XScale, though, __udivdi3() is implemented in
 			 * software.
 			 *
 			 * The shift method for cylinder calculation is
 			 * accurate if sectors_per_cylinder is a power of
 			 * 2.  Otherwise it might be slightly off -- you
 			 * might have a bit of a truncation problem.
 			 */
 #ifdef	__XSCALE__
 			cylinders = (lun->be_lun->maxlba + 1) /
 				sectors_per_cylinder;
 #else
 			for (shift = 31; shift > 0; shift--) {
 				if (sectors_per_cylinder & (1 << shift))
 					break;
 			}
 			cylinders = (lun->be_lun->maxlba + 1) >> shift;
 #endif
 
 			/*
 			 * We've basically got 3 bytes, or 24 bits for the
 			 * cylinder size in the mode page.  If we're over,
 			 * just round down to 2^24.
 			 */
 			if (cylinders > 0xffffff)
 				cylinders = 0xffffff;
 
 			rigid_disk_page = &lun->mode_pages.rigid_disk_page[
 				CTL_PAGE_DEFAULT];
 			scsi_ulto3b(cylinders, rigid_disk_page->cylinders);
 
 			if ((value = ctl_get_opt(&lun->be_lun->options,
 			    "rpm")) != NULL) {
 				scsi_ulto2b(strtol(value, NULL, 0),
 				     rigid_disk_page->rotation_rate);
 			}
 
 			memcpy(&lun->mode_pages.rigid_disk_page[CTL_PAGE_CURRENT],
 			       &lun->mode_pages.rigid_disk_page[CTL_PAGE_DEFAULT],
 			       sizeof(rigid_disk_page_default));
 			memcpy(&lun->mode_pages.rigid_disk_page[CTL_PAGE_SAVED],
 			       &lun->mode_pages.rigid_disk_page[CTL_PAGE_DEFAULT],
 			       sizeof(rigid_disk_page_default));
 
 			page_index->page_data =
 				(uint8_t *)lun->mode_pages.rigid_disk_page;
 			break;
 		}
 		case SMS_VERIFY_ERROR_RECOVERY_PAGE: {
 			KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0,
 			    ("subpage %#x for page %#x is incorrect!",
 			    page_index->subpage, page_code));
 			memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_CURRENT],
 			       &verify_er_page_default,
 			       sizeof(verify_er_page_default));
 			memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_CHANGEABLE],
 			       &verify_er_page_changeable,
 			       sizeof(verify_er_page_changeable));
 			memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_DEFAULT],
 			       &verify_er_page_default,
 			       sizeof(verify_er_page_default));
 			memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_SAVED],
 			       &verify_er_page_default,
 			       sizeof(verify_er_page_default));
 			page_index->page_data =
 				(uint8_t *)lun->mode_pages.verify_er_page;
 			break;
 		}
 		case SMS_CACHING_PAGE: {
 			struct scsi_caching_page *caching_page;
 
 			KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0,
 			    ("subpage %#x for page %#x is incorrect!",
 			    page_index->subpage, page_code));
 			memcpy(&lun->mode_pages.caching_page[CTL_PAGE_DEFAULT],
 			       &caching_page_default,
 			       sizeof(caching_page_default));
 			memcpy(&lun->mode_pages.caching_page[
 			       CTL_PAGE_CHANGEABLE], &caching_page_changeable,
 			       sizeof(caching_page_changeable));
 			memcpy(&lun->mode_pages.caching_page[CTL_PAGE_SAVED],
 			       &caching_page_default,
 			       sizeof(caching_page_default));
 			caching_page = &lun->mode_pages.caching_page[
 			    CTL_PAGE_SAVED];
 			value = ctl_get_opt(&lun->be_lun->options, "writecache");
 			if (value != NULL && strcmp(value, "off") == 0)
 				caching_page->flags1 &= ~SCP_WCE;
 			value = ctl_get_opt(&lun->be_lun->options, "readcache");
 			if (value != NULL && strcmp(value, "off") == 0)
 				caching_page->flags1 |= SCP_RCD;
 			memcpy(&lun->mode_pages.caching_page[CTL_PAGE_CURRENT],
 			       &lun->mode_pages.caching_page[CTL_PAGE_SAVED],
 			       sizeof(caching_page_default));
 			page_index->page_data =
 				(uint8_t *)lun->mode_pages.caching_page;
 			break;
 		}
 		case SMS_CONTROL_MODE_PAGE: {
 			switch (page_index->subpage) {
 			case SMS_SUBPAGE_PAGE_0: {
 				struct scsi_control_page *control_page;
 
 				memcpy(&lun->mode_pages.control_page[
 				    CTL_PAGE_DEFAULT],
 				       &control_page_default,
 				       sizeof(control_page_default));
 				memcpy(&lun->mode_pages.control_page[
 				    CTL_PAGE_CHANGEABLE],
 				       &control_page_changeable,
 				       sizeof(control_page_changeable));
 				memcpy(&lun->mode_pages.control_page[
 				    CTL_PAGE_SAVED],
 				       &control_page_default,
 				       sizeof(control_page_default));
 				control_page = &lun->mode_pages.control_page[
 				    CTL_PAGE_SAVED];
 				value = ctl_get_opt(&lun->be_lun->options,
 				    "reordering");
 				if (value != NULL &&
 				    strcmp(value, "unrestricted") == 0) {
 					control_page->queue_flags &=
 					    ~SCP_QUEUE_ALG_MASK;
 					control_page->queue_flags |=
 					    SCP_QUEUE_ALG_UNRESTRICTED;
 				}
 				memcpy(&lun->mode_pages.control_page[
 				    CTL_PAGE_CURRENT],
 				       &lun->mode_pages.control_page[
 				    CTL_PAGE_SAVED],
 				       sizeof(control_page_default));
 				page_index->page_data =
 				    (uint8_t *)lun->mode_pages.control_page;
 				break;
 			}
 			case 0x01:
 				memcpy(&lun->mode_pages.control_ext_page[
 				    CTL_PAGE_DEFAULT],
 				       &control_ext_page_default,
 				       sizeof(control_ext_page_default));
 				memcpy(&lun->mode_pages.control_ext_page[
 				    CTL_PAGE_CHANGEABLE],
 				       &control_ext_page_changeable,
 				       sizeof(control_ext_page_changeable));
 				memcpy(&lun->mode_pages.control_ext_page[
 				    CTL_PAGE_SAVED],
 				       &control_ext_page_default,
 				       sizeof(control_ext_page_default));
 				memcpy(&lun->mode_pages.control_ext_page[
 				    CTL_PAGE_CURRENT],
 				       &lun->mode_pages.control_ext_page[
 				    CTL_PAGE_SAVED],
 				       sizeof(control_ext_page_default));
 				page_index->page_data =
 				    (uint8_t *)lun->mode_pages.control_ext_page;
 				break;
 			default:
 				panic("subpage %#x for page %#x is incorrect!",
 				      page_index->subpage, page_code);
 			}
 			break;
 		}
 		case SMS_INFO_EXCEPTIONS_PAGE: {
 			switch (page_index->subpage) {
 			case SMS_SUBPAGE_PAGE_0:
 				memcpy(&lun->mode_pages.ie_page[CTL_PAGE_CURRENT],
 				       &ie_page_default,
 				       sizeof(ie_page_default));
 				memcpy(&lun->mode_pages.ie_page[
 				       CTL_PAGE_CHANGEABLE], &ie_page_changeable,
 				       sizeof(ie_page_changeable));
 				memcpy(&lun->mode_pages.ie_page[CTL_PAGE_DEFAULT],
 				       &ie_page_default,
 				       sizeof(ie_page_default));
 				memcpy(&lun->mode_pages.ie_page[CTL_PAGE_SAVED],
 				       &ie_page_default,
 				       sizeof(ie_page_default));
 				page_index->page_data =
 					(uint8_t *)lun->mode_pages.ie_page;
 				break;
 			case 0x02: {
 				struct ctl_logical_block_provisioning_page *page;
 
 				memcpy(&lun->mode_pages.lbp_page[CTL_PAGE_DEFAULT],
 				       &lbp_page_default,
 				       sizeof(lbp_page_default));
 				memcpy(&lun->mode_pages.lbp_page[
 				       CTL_PAGE_CHANGEABLE], &lbp_page_changeable,
 				       sizeof(lbp_page_changeable));
 				memcpy(&lun->mode_pages.lbp_page[CTL_PAGE_SAVED],
 				       &lbp_page_default,
 				       sizeof(lbp_page_default));
 				page = &lun->mode_pages.lbp_page[CTL_PAGE_SAVED];
 				value = ctl_get_opt(&lun->be_lun->options,
 				    "avail-threshold");
 				if (value != NULL &&
 				    ctl_expand_number(value, &ival) == 0) {
 					page->descr[0].flags |= SLBPPD_ENABLED |
 					    SLBPPD_ARMING_DEC;
 					if (lun->be_lun->blocksize)
 						ival /= lun->be_lun->blocksize;
 					else
 						ival /= 512;
 					scsi_ulto4b(ival >> CTL_LBP_EXPONENT,
 					    page->descr[0].count);
 				}
 				value = ctl_get_opt(&lun->be_lun->options,
 				    "used-threshold");
 				if (value != NULL &&
 				    ctl_expand_number(value, &ival) == 0) {
 					page->descr[1].flags |= SLBPPD_ENABLED |
 					    SLBPPD_ARMING_INC;
 					if (lun->be_lun->blocksize)
 						ival /= lun->be_lun->blocksize;
 					else
 						ival /= 512;
 					scsi_ulto4b(ival >> CTL_LBP_EXPONENT,
 					    page->descr[1].count);
 				}
 				value = ctl_get_opt(&lun->be_lun->options,
 				    "pool-avail-threshold");
 				if (value != NULL &&
 				    ctl_expand_number(value, &ival) == 0) {
 					page->descr[2].flags |= SLBPPD_ENABLED |
 					    SLBPPD_ARMING_DEC;
 					if (lun->be_lun->blocksize)
 						ival /= lun->be_lun->blocksize;
 					else
 						ival /= 512;
 					scsi_ulto4b(ival >> CTL_LBP_EXPONENT,
 					    page->descr[2].count);
 				}
 				value = ctl_get_opt(&lun->be_lun->options,
 				    "pool-used-threshold");
 				if (value != NULL &&
 				    ctl_expand_number(value, &ival) == 0) {
 					page->descr[3].flags |= SLBPPD_ENABLED |
 					    SLBPPD_ARMING_INC;
 					if (lun->be_lun->blocksize)
 						ival /= lun->be_lun->blocksize;
 					else
 						ival /= 512;
 					scsi_ulto4b(ival >> CTL_LBP_EXPONENT,
 					    page->descr[3].count);
 				}
 				memcpy(&lun->mode_pages.lbp_page[CTL_PAGE_CURRENT],
 				       &lun->mode_pages.lbp_page[CTL_PAGE_SAVED],
 				       sizeof(lbp_page_default));
 				page_index->page_data =
 					(uint8_t *)lun->mode_pages.lbp_page;
 				break;
 			}
 			default:
 				panic("subpage %#x for page %#x is incorrect!",
 				      page_index->subpage, page_code);
 			}
 			break;
 		}
 		case SMS_CDDVD_CAPS_PAGE:{
 			KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0,
 			    ("subpage %#x for page %#x is incorrect!",
 			    page_index->subpage, page_code));
 			memcpy(&lun->mode_pages.cddvd_page[CTL_PAGE_DEFAULT],
 			       &cddvd_page_default,
 			       sizeof(cddvd_page_default));
 			memcpy(&lun->mode_pages.cddvd_page[
 			       CTL_PAGE_CHANGEABLE], &cddvd_page_changeable,
 			       sizeof(cddvd_page_changeable));
 			memcpy(&lun->mode_pages.cddvd_page[CTL_PAGE_SAVED],
 			       &cddvd_page_default,
 			       sizeof(cddvd_page_default));
 			memcpy(&lun->mode_pages.cddvd_page[CTL_PAGE_CURRENT],
 			       &lun->mode_pages.cddvd_page[CTL_PAGE_SAVED],
 			       sizeof(cddvd_page_default));
 			page_index->page_data =
 				(uint8_t *)lun->mode_pages.cddvd_page;
 			break;
 		}
 		default:
 			panic("invalid page code value %#x", page_code);
 		}
 	}
 
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_init_log_page_index(struct ctl_lun *lun)
 {
 	struct ctl_page_index *page_index;
 	int i, j, k, prev;
 
 	memcpy(&lun->log_pages.index, log_page_index_template,
 	       sizeof(log_page_index_template));
 
 	prev = -1;
 	for (i = 0, j = 0, k = 0; i < CTL_NUM_LOG_PAGES; i++) {
 
 		page_index = &lun->log_pages.index[i];
 		if (lun->be_lun->lun_type == T_DIRECT &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 			continue;
 		if (lun->be_lun->lun_type == T_PROCESSOR &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 			continue;
 		if (lun->be_lun->lun_type == T_CDROM &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 			continue;
 
 		if (page_index->page_code == SLS_LOGICAL_BLOCK_PROVISIONING &&
 		    lun->backend->lun_attr == NULL)
 			continue;
 
 		if (page_index->page_code != prev) {
 			lun->log_pages.pages_page[j] = page_index->page_code;
 			prev = page_index->page_code;
 			j++;
 		}
 		lun->log_pages.subpages_page[k*2] = page_index->page_code;
 		lun->log_pages.subpages_page[k*2+1] = page_index->subpage;
 		k++;
 	}
 	lun->log_pages.index[0].page_data = &lun->log_pages.pages_page[0];
 	lun->log_pages.index[0].page_len = j;
 	lun->log_pages.index[1].page_data = &lun->log_pages.subpages_page[0];
 	lun->log_pages.index[1].page_len = k * 2;
 	lun->log_pages.index[2].page_data = &lun->log_pages.lbp_page[0];
 	lun->log_pages.index[2].page_len = 12*CTL_NUM_LBP_PARAMS;
 	lun->log_pages.index[3].page_data = (uint8_t *)&lun->log_pages.stat_page;
 	lun->log_pages.index[3].page_len = sizeof(lun->log_pages.stat_page);
 	lun->log_pages.index[4].page_data = (uint8_t *)&lun->log_pages.ie_page;
 	lun->log_pages.index[4].page_len = sizeof(lun->log_pages.ie_page);
 
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 hex2bin(const char *str, uint8_t *buf, int buf_size)
 {
 	int i;
 	u_char c;
 
 	memset(buf, 0, buf_size);
 	while (isspace(str[0]))
 		str++;
 	if (str[0] == '0' && (str[1] == 'x' || str[1] == 'X'))
 		str += 2;
 	buf_size *= 2;
 	for (i = 0; str[i] != 0 && i < buf_size; i++) {
 		while (str[i] == '-')	/* Skip dashes in UUIDs. */
 			str++;
 		c = str[i];
 		if (isdigit(c))
 			c -= '0';
 		else if (isalpha(c))
 			c -= isupper(c) ? 'A' - 10 : 'a' - 10;
 		else
 			break;
 		if (c >= 16)
 			break;
 		if ((i & 1) == 0)
 			buf[i / 2] |= (c << 4);
 		else
 			buf[i / 2] |= c;
 	}
 	return ((i + 1) / 2);
 }
 
 /*
  * LUN allocation.
  *
  * Requirements:
  * - caller allocates and zeros LUN storage, or passes in a NULL LUN if he
  *   wants us to allocate the LUN and he can block.
  * - ctl_softc is always set
  * - be_lun is set if the LUN has a backend (needed for disk LUNs)
  *
  * Returns 0 for success, non-zero (errno) for failure.
  */
 static int
 ctl_alloc_lun(struct ctl_softc *ctl_softc, struct ctl_lun *ctl_lun,
 	      struct ctl_be_lun *const be_lun)
 {
 	struct ctl_lun *nlun, *lun;
 	struct scsi_vpd_id_descriptor *desc;
 	struct scsi_vpd_id_t10 *t10id;
 	const char *eui, *naa, *scsiname, *uuid, *vendor, *value;
 	int lun_number, lun_malloced;
 	int devidlen, idlen1, idlen2 = 0, len;
 
 	if (be_lun == NULL)
 		return (EINVAL);
 
 	/*
 	 * We currently only support Direct Access or Processor LUN types.
 	 */
 	switch (be_lun->lun_type) {
 	case T_DIRECT:
 	case T_PROCESSOR:
 	case T_CDROM:
 		break;
 	case T_SEQUENTIAL:
 	case T_CHANGER:
 	default:
 		be_lun->lun_config_status(be_lun->be_lun,
 					  CTL_LUN_CONFIG_FAILURE);
 		break;
 	}
 	if (ctl_lun == NULL) {
 		lun = malloc(sizeof(*lun), M_CTL, M_WAITOK);
 		lun_malloced = 1;
 	} else {
 		lun_malloced = 0;
 		lun = ctl_lun;
 	}
 
 	memset(lun, 0, sizeof(*lun));
 	if (lun_malloced)
 		lun->flags = CTL_LUN_MALLOCED;
 
 	/* Generate LUN ID. */
 	devidlen = max(CTL_DEVID_MIN_LEN,
 	    strnlen(be_lun->device_id, CTL_DEVID_LEN));
 	idlen1 = sizeof(*t10id) + devidlen;
 	len = sizeof(struct scsi_vpd_id_descriptor) + idlen1;
 	scsiname = ctl_get_opt(&be_lun->options, "scsiname");
 	if (scsiname != NULL) {
 		idlen2 = roundup2(strlen(scsiname) + 1, 4);
 		len += sizeof(struct scsi_vpd_id_descriptor) + idlen2;
 	}
 	eui = ctl_get_opt(&be_lun->options, "eui");
 	if (eui != NULL) {
 		len += sizeof(struct scsi_vpd_id_descriptor) + 16;
 	}
 	naa = ctl_get_opt(&be_lun->options, "naa");
 	if (naa != NULL) {
 		len += sizeof(struct scsi_vpd_id_descriptor) + 16;
 	}
 	uuid = ctl_get_opt(&be_lun->options, "uuid");
 	if (uuid != NULL) {
 		len += sizeof(struct scsi_vpd_id_descriptor) + 18;
 	}
 	lun->lun_devid = malloc(sizeof(struct ctl_devid) + len,
 	    M_CTL, M_WAITOK | M_ZERO);
 	desc = (struct scsi_vpd_id_descriptor *)lun->lun_devid->data;
 	desc->proto_codeset = SVPD_ID_CODESET_ASCII;
 	desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN | SVPD_ID_TYPE_T10;
 	desc->length = idlen1;
 	t10id = (struct scsi_vpd_id_t10 *)&desc->identifier[0];
 	memset(t10id->vendor, ' ', sizeof(t10id->vendor));
 	if ((vendor = ctl_get_opt(&be_lun->options, "vendor")) == NULL) {
 		strncpy((char *)t10id->vendor, CTL_VENDOR, sizeof(t10id->vendor));
 	} else {
 		strncpy(t10id->vendor, vendor,
 		    min(sizeof(t10id->vendor), strlen(vendor)));
 	}
 	strncpy((char *)t10id->vendor_spec_id,
 	    (char *)be_lun->device_id, devidlen);
 	if (scsiname != NULL) {
 		desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] +
 		    desc->length);
 		desc->proto_codeset = SVPD_ID_CODESET_UTF8;
 		desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN |
 		    SVPD_ID_TYPE_SCSI_NAME;
 		desc->length = idlen2;
 		strlcpy(desc->identifier, scsiname, idlen2);
 	}
 	if (eui != NULL) {
 		desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] +
 		    desc->length);
 		desc->proto_codeset = SVPD_ID_CODESET_BINARY;
 		desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN |
 		    SVPD_ID_TYPE_EUI64;
 		desc->length = hex2bin(eui, desc->identifier, 16);
 		desc->length = desc->length > 12 ? 16 :
 		    (desc->length > 8 ? 12 : 8);
 		len -= 16 - desc->length;
 	}
 	if (naa != NULL) {
 		desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] +
 		    desc->length);
 		desc->proto_codeset = SVPD_ID_CODESET_BINARY;
 		desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN |
 		    SVPD_ID_TYPE_NAA;
 		desc->length = hex2bin(naa, desc->identifier, 16);
 		desc->length = desc->length > 8 ? 16 : 8;
 		len -= 16 - desc->length;
 	}
 	if (uuid != NULL) {
 		desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] +
 		    desc->length);
 		desc->proto_codeset = SVPD_ID_CODESET_BINARY;
 		desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN |
 		    SVPD_ID_TYPE_UUID;
 		desc->identifier[0] = 0x10;
 		hex2bin(uuid, &desc->identifier[2], 16);
 		desc->length = 18;
 	}
 	lun->lun_devid->len = len;
 
 	mtx_lock(&ctl_softc->ctl_lock);
 	/*
 	 * See if the caller requested a particular LUN number.  If so, see
 	 * if it is available.  Otherwise, allocate the first available LUN.
 	 */
 	if (be_lun->flags & CTL_LUN_FLAG_ID_REQ) {
 		if ((be_lun->req_lun_id > (CTL_MAX_LUNS - 1))
 		 || (ctl_is_set(ctl_softc->ctl_lun_mask, be_lun->req_lun_id))) {
 			mtx_unlock(&ctl_softc->ctl_lock);
 			if (be_lun->req_lun_id > (CTL_MAX_LUNS - 1)) {
 				printf("ctl: requested LUN ID %d is higher "
 				       "than CTL_MAX_LUNS - 1 (%d)\n",
 				       be_lun->req_lun_id, CTL_MAX_LUNS - 1);
 			} else {
 				/*
 				 * XXX KDM return an error, or just assign
 				 * another LUN ID in this case??
 				 */
 				printf("ctl: requested LUN ID %d is already "
 				       "in use\n", be_lun->req_lun_id);
 			}
 fail:
 			free(lun->lun_devid, M_CTL);
 			if (lun->flags & CTL_LUN_MALLOCED)
 				free(lun, M_CTL);
 			be_lun->lun_config_status(be_lun->be_lun,
 						  CTL_LUN_CONFIG_FAILURE);
 			return (ENOSPC);
 		}
 		lun_number = be_lun->req_lun_id;
 	} else {
 		lun_number = ctl_ffz(ctl_softc->ctl_lun_mask, 0, CTL_MAX_LUNS);
 		if (lun_number == -1) {
 			mtx_unlock(&ctl_softc->ctl_lock);
 			printf("ctl: can't allocate LUN, out of LUNs\n");
 			goto fail;
 		}
 	}
 	ctl_set_mask(ctl_softc->ctl_lun_mask, lun_number);
 	mtx_unlock(&ctl_softc->ctl_lock);
 
 	mtx_init(&lun->lun_lock, "CTL LUN", NULL, MTX_DEF);
 	lun->lun = lun_number;
 	lun->be_lun = be_lun;
 	/*
 	 * The processor LUN is always enabled.  Disk LUNs come on line
 	 * disabled, and must be enabled by the backend.
 	 */
 	lun->flags |= CTL_LUN_DISABLED;
 	lun->backend = be_lun->be;
 	be_lun->ctl_lun = lun;
 	be_lun->lun_id = lun_number;
 	atomic_add_int(&be_lun->be->num_luns, 1);
 	if (be_lun->flags & CTL_LUN_FLAG_EJECTED)
 		lun->flags |= CTL_LUN_EJECTED;
 	if (be_lun->flags & CTL_LUN_FLAG_NO_MEDIA)
 		lun->flags |= CTL_LUN_NO_MEDIA;
 	if (be_lun->flags & CTL_LUN_FLAG_STOPPED)
 		lun->flags |= CTL_LUN_STOPPED;
 
 	if (be_lun->flags & CTL_LUN_FLAG_PRIMARY)
 		lun->flags |= CTL_LUN_PRIMARY_SC;
 
 	value = ctl_get_opt(&be_lun->options, "removable");
 	if (value != NULL) {
 		if (strcmp(value, "on") == 0)
 			lun->flags |= CTL_LUN_REMOVABLE;
 	} else if (be_lun->lun_type == T_CDROM)
 		lun->flags |= CTL_LUN_REMOVABLE;
 
 	lun->ctl_softc = ctl_softc;
 #ifdef CTL_TIME_IO
 	lun->last_busy = getsbinuptime();
 #endif
 	TAILQ_INIT(&lun->ooa_queue);
 	TAILQ_INIT(&lun->blocked_queue);
 	STAILQ_INIT(&lun->error_list);
 	lun->ie_reported = 1;
 	callout_init_mtx(&lun->ie_callout, &lun->lun_lock, 0);
 	ctl_tpc_lun_init(lun);
 	if (lun->flags & CTL_LUN_REMOVABLE) {
 		lun->prevent = malloc((CTL_MAX_INITIATORS + 31) / 32 * 4,
 		    M_CTL, M_WAITOK);
 	}
 
 	/*
 	 * Initialize the mode and log page index.
 	 */
 	ctl_init_page_index(lun);
 	ctl_init_log_page_index(lun);
 
 	/* Setup statistics gathering */
 #ifdef CTL_LEGACY_STATS
 	lun->legacy_stats.device_type = be_lun->lun_type;
 	lun->legacy_stats.lun_number = lun_number;
 	lun->legacy_stats.blocksize = be_lun->blocksize;
 	if (be_lun->blocksize == 0)
 		lun->legacy_stats.flags = CTL_LUN_STATS_NO_BLOCKSIZE;
 	for (len = 0; len < CTL_MAX_PORTS; len++)
 		lun->legacy_stats.ports[len].targ_port = len;
 #endif /* CTL_LEGACY_STATS */
 	lun->stats.item = lun_number;
 
 	/*
 	 * Now, before we insert this lun on the lun list, set the lun
 	 * inventory changed UA for all other luns.
 	 */
 	mtx_lock(&ctl_softc->ctl_lock);
 	STAILQ_FOREACH(nlun, &ctl_softc->lun_list, links) {
 		mtx_lock(&nlun->lun_lock);
 		ctl_est_ua_all(nlun, -1, CTL_UA_LUN_CHANGE);
 		mtx_unlock(&nlun->lun_lock);
 	}
 	STAILQ_INSERT_TAIL(&ctl_softc->lun_list, lun, links);
 	ctl_softc->ctl_luns[lun_number] = lun;
 	ctl_softc->num_luns++;
 	mtx_unlock(&ctl_softc->ctl_lock);
 
 	lun->be_lun->lun_config_status(lun->be_lun->be_lun, CTL_LUN_CONFIG_OK);
 	return (0);
 }
 
 /*
  * Delete a LUN.
  * Assumptions:
  * - LUN has already been marked invalid and any pending I/O has been taken
  *   care of.
  */
 static int
 ctl_free_lun(struct ctl_lun *lun)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	struct ctl_lun *nlun;
 	int i;
 
 	mtx_assert(&softc->ctl_lock, MA_OWNED);
 
 	STAILQ_REMOVE(&softc->lun_list, lun, ctl_lun, links);
 
 	ctl_clear_mask(softc->ctl_lun_mask, lun->lun);
 
 	softc->ctl_luns[lun->lun] = NULL;
 
 	if (!TAILQ_EMPTY(&lun->ooa_queue))
 		panic("Freeing a LUN %p with outstanding I/O!!\n", lun);
 
 	softc->num_luns--;
 
 	/*
 	 * Tell the backend to free resources, if this LUN has a backend.
 	 */
 	atomic_subtract_int(&lun->be_lun->be->num_luns, 1);
 	lun->be_lun->lun_shutdown(lun->be_lun->be_lun);
 
 	lun->ie_reportcnt = UINT32_MAX;
 	callout_drain(&lun->ie_callout);
 
 	ctl_tpc_lun_shutdown(lun);
 	mtx_destroy(&lun->lun_lock);
 	free(lun->lun_devid, M_CTL);
 	for (i = 0; i < CTL_MAX_PORTS; i++)
 		free(lun->pending_ua[i], M_CTL);
 	for (i = 0; i < CTL_MAX_PORTS; i++)
 		free(lun->pr_keys[i], M_CTL);
 	free(lun->write_buffer, M_CTL);
 	free(lun->prevent, M_CTL);
 	if (lun->flags & CTL_LUN_MALLOCED)
 		free(lun, M_CTL);
 
 	STAILQ_FOREACH(nlun, &softc->lun_list, links) {
 		mtx_lock(&nlun->lun_lock);
 		ctl_est_ua_all(nlun, -1, CTL_UA_LUN_CHANGE);
 		mtx_unlock(&nlun->lun_lock);
 	}
 
 	return (0);
 }
 
 static void
 ctl_create_lun(struct ctl_be_lun *be_lun)
 {
 
 	/*
 	 * ctl_alloc_lun() should handle all potential failure cases.
 	 */
 	ctl_alloc_lun(control_softc, NULL, be_lun);
 }
 
 int
 ctl_add_lun(struct ctl_be_lun *be_lun)
 {
 	struct ctl_softc *softc = control_softc;
 
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_INSERT_TAIL(&softc->pending_lun_queue, be_lun, links);
 	mtx_unlock(&softc->ctl_lock);
 	wakeup(&softc->pending_lun_queue);
 
 	return (0);
 }
 
 int
 ctl_enable_lun(struct ctl_be_lun *be_lun)
 {
 	struct ctl_softc *softc;
 	struct ctl_port *port, *nport;
 	struct ctl_lun *lun;
 	int retval;
 
 	lun = (struct ctl_lun *)be_lun->ctl_lun;
 	softc = lun->ctl_softc;
 
 	mtx_lock(&softc->ctl_lock);
 	mtx_lock(&lun->lun_lock);
 	if ((lun->flags & CTL_LUN_DISABLED) == 0) {
 		/*
 		 * eh?  Why did we get called if the LUN is already
 		 * enabled?
 		 */
 		mtx_unlock(&lun->lun_lock);
 		mtx_unlock(&softc->ctl_lock);
 		return (0);
 	}
 	lun->flags &= ~CTL_LUN_DISABLED;
 	mtx_unlock(&lun->lun_lock);
 
 	STAILQ_FOREACH_SAFE(port, &softc->port_list, links, nport) {
 		if ((port->status & CTL_PORT_STATUS_ONLINE) == 0 ||
 		    port->lun_map != NULL || port->lun_enable == NULL)
 			continue;
 
 		/*
 		 * Drop the lock while we call the FETD's enable routine.
 		 * This can lead to a callback into CTL (at least in the
 		 * case of the internal initiator frontend.
 		 */
 		mtx_unlock(&softc->ctl_lock);
 		retval = port->lun_enable(port->targ_lun_arg, lun->lun);
 		mtx_lock(&softc->ctl_lock);
 		if (retval != 0) {
 			printf("%s: FETD %s port %d returned error "
 			       "%d for lun_enable on lun %jd\n",
 			       __func__, port->port_name, port->targ_port,
 			       retval, (intmax_t)lun->lun);
 		}
 	}
 
 	mtx_unlock(&softc->ctl_lock);
 	ctl_isc_announce_lun(lun);
 
 	return (0);
 }
 
 int
 ctl_disable_lun(struct ctl_be_lun *be_lun)
 {
 	struct ctl_softc *softc;
 	struct ctl_port *port;
 	struct ctl_lun *lun;
 	int retval;
 
 	lun = (struct ctl_lun *)be_lun->ctl_lun;
 	softc = lun->ctl_softc;
 
 	mtx_lock(&softc->ctl_lock);
 	mtx_lock(&lun->lun_lock);
 	if (lun->flags & CTL_LUN_DISABLED) {
 		mtx_unlock(&lun->lun_lock);
 		mtx_unlock(&softc->ctl_lock);
 		return (0);
 	}
 	lun->flags |= CTL_LUN_DISABLED;
 	mtx_unlock(&lun->lun_lock);
 
 	STAILQ_FOREACH(port, &softc->port_list, links) {
 		if ((port->status & CTL_PORT_STATUS_ONLINE) == 0 ||
 		    port->lun_map != NULL || port->lun_disable == NULL)
 			continue;
 
 		/*
 		 * Drop the lock before we call the frontend's disable
 		 * routine, to avoid lock order reversals.
 		 *
 		 * XXX KDM what happens if the frontend list changes while
 		 * we're traversing it?  It's unlikely, but should be handled.
 		 */
 		mtx_unlock(&softc->ctl_lock);
 		retval = port->lun_disable(port->targ_lun_arg, lun->lun);
 		mtx_lock(&softc->ctl_lock);
 		if (retval != 0) {
 			printf("%s: FETD %s port %d returned error "
 			       "%d for lun_disable on lun %jd\n",
 			       __func__, port->port_name, port->targ_port,
 			       retval, (intmax_t)lun->lun);
 		}
 	}
 
 	mtx_unlock(&softc->ctl_lock);
 	ctl_isc_announce_lun(lun);
 
 	return (0);
 }
 
 int
 ctl_start_lun(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags &= ~CTL_LUN_STOPPED;
 	mtx_unlock(&lun->lun_lock);
 	return (0);
 }
 
 int
 ctl_stop_lun(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags |= CTL_LUN_STOPPED;
 	mtx_unlock(&lun->lun_lock);
 	return (0);
 }
 
 int
 ctl_lun_no_media(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags |= CTL_LUN_NO_MEDIA;
 	mtx_unlock(&lun->lun_lock);
 	return (0);
 }
 
 int
 ctl_lun_has_media(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 	union ctl_ha_msg msg;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags &= ~(CTL_LUN_NO_MEDIA | CTL_LUN_EJECTED);
 	if (lun->flags & CTL_LUN_REMOVABLE)
 		ctl_est_ua_all(lun, -1, CTL_UA_MEDIUM_CHANGE);
 	mtx_unlock(&lun->lun_lock);
 	if ((lun->flags & CTL_LUN_REMOVABLE) &&
 	    lun->ctl_softc->ha_mode == CTL_HA_MODE_XFER) {
 		bzero(&msg.ua, sizeof(msg.ua));
 		msg.hdr.msg_type = CTL_MSG_UA;
 		msg.hdr.nexus.initid = -1;
 		msg.hdr.nexus.targ_port = -1;
 		msg.hdr.nexus.targ_lun = lun->lun;
 		msg.hdr.nexus.targ_mapped_lun = lun->lun;
 		msg.ua.ua_all = 1;
 		msg.ua.ua_set = 1;
 		msg.ua.ua_type = CTL_UA_MEDIUM_CHANGE;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.ua),
 		    M_WAITOK);
 	}
 	return (0);
 }
 
 int
 ctl_lun_ejected(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags |= CTL_LUN_EJECTED;
 	mtx_unlock(&lun->lun_lock);
 	return (0);
 }
 
 int
 ctl_lun_primary(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags |= CTL_LUN_PRIMARY_SC;
 	ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE);
 	mtx_unlock(&lun->lun_lock);
 	ctl_isc_announce_lun(lun);
 	return (0);
 }
 
 int
 ctl_lun_secondary(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 
 	mtx_lock(&lun->lun_lock);
 	lun->flags &= ~CTL_LUN_PRIMARY_SC;
 	ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE);
 	mtx_unlock(&lun->lun_lock);
 	ctl_isc_announce_lun(lun);
 	return (0);
 }
 
 int
 ctl_invalidate_lun(struct ctl_be_lun *be_lun)
 {
 	struct ctl_softc *softc;
 	struct ctl_lun *lun;
 
 	lun = (struct ctl_lun *)be_lun->ctl_lun;
 	softc = lun->ctl_softc;
 
 	mtx_lock(&lun->lun_lock);
 
 	/*
 	 * The LUN needs to be disabled before it can be marked invalid.
 	 */
 	if ((lun->flags & CTL_LUN_DISABLED) == 0) {
 		mtx_unlock(&lun->lun_lock);
 		return (-1);
 	}
 	/*
 	 * Mark the LUN invalid.
 	 */
 	lun->flags |= CTL_LUN_INVALID;
 
 	/*
 	 * If there is nothing in the OOA queue, go ahead and free the LUN.
 	 * If we have something in the OOA queue, we'll free it when the
 	 * last I/O completes.
 	 */
 	if (TAILQ_EMPTY(&lun->ooa_queue)) {
 		mtx_unlock(&lun->lun_lock);
 		mtx_lock(&softc->ctl_lock);
 		ctl_free_lun(lun);
 		mtx_unlock(&softc->ctl_lock);
 	} else
 		mtx_unlock(&lun->lun_lock);
 
 	return (0);
 }
 
 void
 ctl_lun_capacity_changed(struct ctl_be_lun *be_lun)
 {
 	struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun;
 	union ctl_ha_msg msg;
 
 	mtx_lock(&lun->lun_lock);
 	ctl_est_ua_all(lun, -1, CTL_UA_CAPACITY_CHANGE);
 	mtx_unlock(&lun->lun_lock);
 	if (lun->ctl_softc->ha_mode == CTL_HA_MODE_XFER) {
 		/* Send msg to other side. */
 		bzero(&msg.ua, sizeof(msg.ua));
 		msg.hdr.msg_type = CTL_MSG_UA;
 		msg.hdr.nexus.initid = -1;
 		msg.hdr.nexus.targ_port = -1;
 		msg.hdr.nexus.targ_lun = lun->lun;
 		msg.hdr.nexus.targ_mapped_lun = lun->lun;
 		msg.ua.ua_all = 1;
 		msg.ua.ua_set = 1;
 		msg.ua.ua_type = CTL_UA_CAPACITY_CHANGE;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.ua),
 		    M_WAITOK);
 	}
 }
 
 /*
  * Backend "memory move is complete" callback for requests that never
  * make it down to say RAIDCore's configuration code.
  */
 int
 ctl_config_move_done(union ctl_io *io)
 {
 	int retval;
 
 	CTL_DEBUG_PRINT(("ctl_config_move_done\n"));
 	KASSERT(io->io_hdr.io_type == CTL_IO_SCSI,
 	    ("Config I/O type isn't CTL_IO_SCSI (%d)!", io->io_hdr.io_type));
 
 	if ((io->io_hdr.port_status != 0) &&
 	    ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE ||
 	     (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) {
 		ctl_set_internal_failure(&io->scsiio, /*sks_valid*/ 1,
 		    /*retry_count*/ io->io_hdr.port_status);
 	} else if (io->scsiio.kern_data_resid != 0 &&
 	    (io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_OUT &&
 	    ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE ||
 	     (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) {
 		ctl_set_invalid_field_ciu(&io->scsiio);
 	}
 
 	if (ctl_debug & CTL_DEBUG_CDB_DATA)
 		ctl_data_print(io);
 	if (((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_IN) ||
 	    ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE &&
 	     (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS) ||
 	    ((io->io_hdr.flags & CTL_FLAG_ABORT) != 0)) {
 		/*
 		 * XXX KDM just assuming a single pointer here, and not a
 		 * S/G list.  If we start using S/G lists for config data,
 		 * we'll need to know how to clean them up here as well.
 		 */
 		if (io->io_hdr.flags & CTL_FLAG_ALLOCATED)
 			free(io->scsiio.kern_data_ptr, M_CTL);
 		ctl_done(io);
 		retval = CTL_RETVAL_COMPLETE;
 	} else {
 		/*
 		 * XXX KDM now we need to continue data movement.  Some
 		 * options:
 		 * - call ctl_scsiio() again?  We don't do this for data
 		 *   writes, because for those at least we know ahead of
 		 *   time where the write will go and how long it is.  For
 		 *   config writes, though, that information is largely
 		 *   contained within the write itself, thus we need to
 		 *   parse out the data again.
 		 *
 		 * - Call some other function once the data is in?
 		 */
 
 		/*
 		 * XXX KDM call ctl_scsiio() again for now, and check flag
 		 * bits to see whether we're allocated or not.
 		 */
 		retval = ctl_scsiio(&io->scsiio);
 	}
 	return (retval);
 }
 
 /*
  * This gets called by a backend driver when it is done with a
  * data_submit method.
  */
 void
 ctl_data_submit_done(union ctl_io *io)
 {
 	/*
 	 * If the IO_CONT flag is set, we need to call the supplied
 	 * function to continue processing the I/O, instead of completing
 	 * the I/O just yet.
 	 *
 	 * If there is an error, though, we don't want to keep processing.
 	 * Instead, just send status back to the initiator.
 	 */
 	if ((io->io_hdr.flags & CTL_FLAG_IO_CONT) &&
 	    (io->io_hdr.flags & CTL_FLAG_ABORT) == 0 &&
 	    ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE ||
 	     (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) {
 		io->scsiio.io_cont(io);
 		return;
 	}
 	ctl_done(io);
 }
 
 /*
  * This gets called by a backend driver when it is done with a
  * configuration write.
  */
 void
 ctl_config_write_done(union ctl_io *io)
 {
 	uint8_t *buf;
 
 	/*
 	 * If the IO_CONT flag is set, we need to call the supplied
 	 * function to continue processing the I/O, instead of completing
 	 * the I/O just yet.
 	 *
 	 * If there is an error, though, we don't want to keep processing.
 	 * Instead, just send status back to the initiator.
 	 */
 	if ((io->io_hdr.flags & CTL_FLAG_IO_CONT) &&
 	    (io->io_hdr.flags & CTL_FLAG_ABORT) == 0 &&
 	    ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE ||
 	     (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) {
 		io->scsiio.io_cont(io);
 		return;
 	}
 	/*
 	 * Since a configuration write can be done for commands that actually
 	 * have data allocated, like write buffer, and commands that have
 	 * no data, like start/stop unit, we need to check here.
 	 */
 	if (io->io_hdr.flags & CTL_FLAG_ALLOCATED)
 		buf = io->scsiio.kern_data_ptr;
 	else
 		buf = NULL;
 	ctl_done(io);
 	if (buf)
 		free(buf, M_CTL);
 }
 
 void
 ctl_config_read_done(union ctl_io *io)
 {
 	uint8_t *buf;
 
 	/*
 	 * If there is some error -- we are done, skip data transfer.
 	 */
 	if ((io->io_hdr.flags & CTL_FLAG_ABORT) != 0 ||
 	    ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE &&
 	     (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS)) {
 		if (io->io_hdr.flags & CTL_FLAG_ALLOCATED)
 			buf = io->scsiio.kern_data_ptr;
 		else
 			buf = NULL;
 		ctl_done(io);
 		if (buf)
 			free(buf, M_CTL);
 		return;
 	}
 
 	/*
 	 * If the IO_CONT flag is set, we need to call the supplied
 	 * function to continue processing the I/O, instead of completing
 	 * the I/O just yet.
 	 */
 	if (io->io_hdr.flags & CTL_FLAG_IO_CONT) {
 		io->scsiio.io_cont(io);
 		return;
 	}
 
 	ctl_datamove(io);
 }
 
 /*
  * SCSI release command.
  */
 int
 ctl_scsi_release(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	uint32_t residx;
 
 	CTL_DEBUG_PRINT(("ctl_scsi_release\n"));
 
 	residx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 
 	/*
 	 * XXX KDM right now, we only support LUN reservation.  We don't
 	 * support 3rd party reservations, or extent reservations, which
 	 * might actually need the parameter list.  If we've gotten this
 	 * far, we've got a LUN reservation.  Anything else got kicked out
 	 * above.  So, according to SPC, ignore the length.
 	 */
 
 	mtx_lock(&lun->lun_lock);
 
 	/*
 	 * According to SPC, it is not an error for an intiator to attempt
 	 * to release a reservation on a LUN that isn't reserved, or that
 	 * is reserved by another initiator.  The reservation can only be
 	 * released, though, by the initiator who made it or by one of
 	 * several reset type events.
 	 */
 	if ((lun->flags & CTL_LUN_RESERVED) && (lun->res_idx == residx))
 			lun->flags &= ~CTL_LUN_RESERVED;
 
 	mtx_unlock(&lun->lun_lock);
 
 	ctl_set_success(ctsio);
 	ctl_done((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_scsi_reserve(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	uint32_t residx;
 
 	CTL_DEBUG_PRINT(("ctl_reserve\n"));
 
 	residx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 
 	/*
 	 * XXX KDM right now, we only support LUN reservation.  We don't
 	 * support 3rd party reservations, or extent reservations, which
 	 * might actually need the parameter list.  If we've gotten this
 	 * far, we've got a LUN reservation.  Anything else got kicked out
 	 * above.  So, according to SPC, ignore the length.
 	 */
 
 	mtx_lock(&lun->lun_lock);
 	if ((lun->flags & CTL_LUN_RESERVED) && (lun->res_idx != residx)) {
 		ctl_set_reservation_conflict(ctsio);
 		goto bailout;
 	}
 
 	/* SPC-3 exceptions to SPC-2 RESERVE and RELEASE behavior. */
 	if (lun->flags & CTL_LUN_PR_RESERVED) {
 		ctl_set_success(ctsio);
 		goto bailout;
 	}
 
 	lun->flags |= CTL_LUN_RESERVED;
 	lun->res_idx = residx;
 	ctl_set_success(ctsio);
 
 bailout:
 	mtx_unlock(&lun->lun_lock);
 	ctl_done((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_start_stop(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_start_stop_unit *cdb;
 	int retval;
 
 	CTL_DEBUG_PRINT(("ctl_start_stop\n"));
 
 	cdb = (struct scsi_start_stop_unit *)ctsio->cdb;
 
 	if ((cdb->how & SSS_PC_MASK) == 0) {
 		if ((lun->flags & CTL_LUN_PR_RESERVED) &&
 		    (cdb->how & SSS_START) == 0) {
 			uint32_t residx;
 
 			residx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 			if (ctl_get_prkey(lun, residx) == 0 ||
 			    (lun->pr_res_idx != residx && lun->pr_res_type < 4)) {
 
 				ctl_set_reservation_conflict(ctsio);
 				ctl_done((union ctl_io *)ctsio);
 				return (CTL_RETVAL_COMPLETE);
 			}
 		}
 
 		if ((cdb->how & SSS_LOEJ) &&
 		    (lun->flags & CTL_LUN_REMOVABLE) == 0) {
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 4,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 1);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		if ((cdb->how & SSS_START) == 0 && (cdb->how & SSS_LOEJ) &&
 		    lun->prevent_count > 0) {
 			/* "Medium removal prevented" */
 			ctl_set_sense(ctsio, /*current_error*/ 1,
 			    /*sense_key*/(lun->flags & CTL_LUN_NO_MEDIA) ?
 			     SSD_KEY_NOT_READY : SSD_KEY_ILLEGAL_REQUEST,
 			    /*asc*/ 0x53, /*ascq*/ 0x02, SSD_ELEM_NONE);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 	}
 
 	retval = lun->backend->config_write((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_prevent_allow(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_prevent *cdb;
 	int retval;
 	uint32_t initidx;
 
 	CTL_DEBUG_PRINT(("ctl_prevent_allow\n"));
 
 	cdb = (struct scsi_prevent *)ctsio->cdb;
 
 	if ((lun->flags & CTL_LUN_REMOVABLE) == 0 || lun->prevent == NULL) {
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	initidx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 	mtx_lock(&lun->lun_lock);
 	if ((cdb->how & PR_PREVENT) &&
 	    ctl_is_set(lun->prevent, initidx) == 0) {
 		ctl_set_mask(lun->prevent, initidx);
 		lun->prevent_count++;
 	} else if ((cdb->how & PR_PREVENT) == 0 &&
 	    ctl_is_set(lun->prevent, initidx)) {
 		ctl_clear_mask(lun->prevent, initidx);
 		lun->prevent_count--;
 	}
 	mtx_unlock(&lun->lun_lock);
 	retval = lun->backend->config_write((union ctl_io *)ctsio);
 	return (retval);
 }
 
 /*
  * We support the SYNCHRONIZE CACHE command (10 and 16 byte versions), but
  * we don't really do anything with the LBA and length fields if the user
  * passes them in.  Instead we'll just flush out the cache for the entire
  * LUN.
  */
 int
 ctl_sync_cache(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct ctl_lba_len_flags *lbalen;
 	uint64_t starting_lba;
 	uint32_t block_count;
 	int retval;
 	uint8_t byte2;
 
 	CTL_DEBUG_PRINT(("ctl_sync_cache\n"));
 
 	retval = 0;
 
 	switch (ctsio->cdb[0]) {
 	case SYNCHRONIZE_CACHE: {
 		struct scsi_sync_cache *cdb;
 		cdb = (struct scsi_sync_cache *)ctsio->cdb;
 
 		starting_lba = scsi_4btoul(cdb->begin_lba);
 		block_count = scsi_2btoul(cdb->lb_count);
 		byte2 = cdb->byte2;
 		break;
 	}
 	case SYNCHRONIZE_CACHE_16: {
 		struct scsi_sync_cache_16 *cdb;
 		cdb = (struct scsi_sync_cache_16 *)ctsio->cdb;
 
 		starting_lba = scsi_8btou64(cdb->begin_lba);
 		block_count = scsi_4btoul(cdb->lb_count);
 		byte2 = cdb->byte2;
 		break;
 	}
 	default:
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		goto bailout;
 		break; /* NOTREACHED */
 	}
 
 	/*
 	 * We check the LBA and length, but don't do anything with them.
 	 * A SYNCHRONIZE CACHE will cause the entire cache for this lun to
 	 * get flushed.  This check will just help satisfy anyone who wants
 	 * to see an error for an out of range LBA.
 	 */
 	if ((starting_lba + block_count) > (lun->be_lun->maxlba + 1)) {
 		ctl_set_lba_out_of_range(ctsio,
 		    MAX(starting_lba, lun->be_lun->maxlba + 1));
 		ctl_done((union ctl_io *)ctsio);
 		goto bailout;
 	}
 
 	lbalen = (struct ctl_lba_len_flags *)&ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->lba = starting_lba;
 	lbalen->len = block_count;
 	lbalen->flags = byte2;
 	retval = lun->backend->config_write((union ctl_io *)ctsio);
 
 bailout:
 	return (retval);
 }
 
 int
 ctl_format(struct ctl_scsiio *ctsio)
 {
 	struct scsi_format *cdb;
 	int length, defect_list_len;
 
 	CTL_DEBUG_PRINT(("ctl_format\n"));
 
 	cdb = (struct scsi_format *)ctsio->cdb;
 
 	length = 0;
 	if (cdb->byte2 & SF_FMTDATA) {
 		if (cdb->byte2 & SF_LONGLIST)
 			length = sizeof(struct scsi_format_header_long);
 		else
 			length = sizeof(struct scsi_format_header_short);
 	}
 
 	if (((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0)
 	 && (length > 0)) {
 		ctsio->kern_data_ptr = malloc(length, M_CTL, M_WAITOK);
 		ctsio->kern_data_len = length;
 		ctsio->kern_total_len = length;
 		ctsio->kern_rel_offset = 0;
 		ctsio->kern_sg_entries = 0;
 		ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 		ctsio->be_move_done = ctl_config_move_done;
 		ctl_datamove((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	defect_list_len = 0;
 
 	if (cdb->byte2 & SF_FMTDATA) {
 		if (cdb->byte2 & SF_LONGLIST) {
 			struct scsi_format_header_long *header;
 
 			header = (struct scsi_format_header_long *)
 				ctsio->kern_data_ptr;
 
 			defect_list_len = scsi_4btoul(header->defect_list_len);
 			if (defect_list_len != 0) {
 				ctl_set_invalid_field(ctsio,
 						      /*sks_valid*/ 1,
 						      /*command*/ 0,
 						      /*field*/ 2,
 						      /*bit_valid*/ 0,
 						      /*bit*/ 0);
 				goto bailout;
 			}
 		} else {
 			struct scsi_format_header_short *header;
 
 			header = (struct scsi_format_header_short *)
 				ctsio->kern_data_ptr;
 
 			defect_list_len = scsi_2btoul(header->defect_list_len);
 			if (defect_list_len != 0) {
 				ctl_set_invalid_field(ctsio,
 						      /*sks_valid*/ 1,
 						      /*command*/ 0,
 						      /*field*/ 2,
 						      /*bit_valid*/ 0,
 						      /*bit*/ 0);
 				goto bailout;
 			}
 		}
 	}
 
 	ctl_set_success(ctsio);
 bailout:
 
 	if (ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) {
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctsio->io_hdr.flags &= ~CTL_FLAG_ALLOCATED;
 	}
 
 	ctl_done((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_read_buffer(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	uint64_t buffer_offset;
 	uint32_t len;
 	uint8_t byte2;
 	static uint8_t descr[4];
 	static uint8_t echo_descr[4] = { 0 };
 
 	CTL_DEBUG_PRINT(("ctl_read_buffer\n"));
 
 	switch (ctsio->cdb[0]) {
 	case READ_BUFFER: {
 		struct scsi_read_buffer *cdb;
 
 		cdb = (struct scsi_read_buffer *)ctsio->cdb;
 		buffer_offset = scsi_3btoul(cdb->offset);
 		len = scsi_3btoul(cdb->length);
 		byte2 = cdb->byte2;
 		break;
 	}
 	case READ_BUFFER_16: {
 		struct scsi_read_buffer_16 *cdb;
 
 		cdb = (struct scsi_read_buffer_16 *)ctsio->cdb;
 		buffer_offset = scsi_8btou64(cdb->offset);
 		len = scsi_4btoul(cdb->length);
 		byte2 = cdb->byte2;
 		break;
 	}
 	default: /* This shouldn't happen. */
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	if (buffer_offset > CTL_WRITE_BUFFER_SIZE ||
 	    buffer_offset + len > CTL_WRITE_BUFFER_SIZE) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 6,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	if ((byte2 & RWB_MODE) == RWB_MODE_DESCR) {
 		descr[0] = 0;
 		scsi_ulto3b(CTL_WRITE_BUFFER_SIZE, &descr[1]);
 		ctsio->kern_data_ptr = descr;
 		len = min(len, sizeof(descr));
 	} else if ((byte2 & RWB_MODE) == RWB_MODE_ECHO_DESCR) {
 		ctsio->kern_data_ptr = echo_descr;
 		len = min(len, sizeof(echo_descr));
 	} else {
 		if (lun->write_buffer == NULL) {
 			lun->write_buffer = malloc(CTL_WRITE_BUFFER_SIZE,
 			    M_CTL, M_WAITOK);
 		}
 		ctsio->kern_data_ptr = lun->write_buffer + buffer_offset;
 	}
 	ctsio->kern_data_len = len;
 	ctsio->kern_total_len = len;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctl_set_success(ctsio);
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_write_buffer(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_write_buffer *cdb;
 	int buffer_offset, len;
 
 	CTL_DEBUG_PRINT(("ctl_write_buffer\n"));
 
 	cdb = (struct scsi_write_buffer *)ctsio->cdb;
 
 	len = scsi_3btoul(cdb->length);
 	buffer_offset = scsi_3btoul(cdb->offset);
 
 	if (buffer_offset + len > CTL_WRITE_BUFFER_SIZE) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 6,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * If we've got a kernel request that hasn't been malloced yet,
 	 * malloc it and tell the caller the data buffer is here.
 	 */
 	if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) {
 		if (lun->write_buffer == NULL) {
 			lun->write_buffer = malloc(CTL_WRITE_BUFFER_SIZE,
 			    M_CTL, M_WAITOK);
 		}
 		ctsio->kern_data_ptr = lun->write_buffer + buffer_offset;
 		ctsio->kern_data_len = len;
 		ctsio->kern_total_len = len;
 		ctsio->kern_rel_offset = 0;
 		ctsio->kern_sg_entries = 0;
 		ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 		ctsio->be_move_done = ctl_config_move_done;
 		ctl_datamove((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	ctl_set_success(ctsio);
 	ctl_done((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_write_same(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct ctl_lba_len_flags *lbalen;
 	uint64_t lba;
 	uint32_t num_blocks;
 	int len, retval;
 	uint8_t byte2;
 
 	CTL_DEBUG_PRINT(("ctl_write_same\n"));
 
 	switch (ctsio->cdb[0]) {
 	case WRITE_SAME_10: {
 		struct scsi_write_same_10 *cdb;
 
 		cdb = (struct scsi_write_same_10 *)ctsio->cdb;
 
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_2btoul(cdb->length);
 		byte2 = cdb->byte2;
 		break;
 	}
 	case WRITE_SAME_16: {
 		struct scsi_write_same_16 *cdb;
 
 		cdb = (struct scsi_write_same_16 *)ctsio->cdb;
 
 		lba = scsi_8btou64(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		byte2 = cdb->byte2;
 		break;
 	}
 	default:
 		/*
 		 * We got a command we don't support.  This shouldn't
 		 * happen, commands should be filtered out above us.
 		 */
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 		break; /* NOTREACHED */
 	}
 
 	/* ANCHOR flag can be used only together with UNMAP */
 	if ((byte2 & SWS_UNMAP) == 0 && (byte2 & SWS_ANCHOR) != 0) {
 		ctl_set_invalid_field(ctsio, /*sks_valid*/ 1,
 		    /*command*/ 1, /*field*/ 1, /*bit_valid*/ 1, /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * The first check is to make sure we're in bounds, the second
 	 * check is to catch wrap-around problems.  If the lba + num blocks
 	 * is less than the lba, then we've wrapped around and the block
 	 * range is invalid anyway.
 	 */
 	if (((lba + num_blocks) > (lun->be_lun->maxlba + 1))
 	 || ((lba + num_blocks) < lba)) {
 		ctl_set_lba_out_of_range(ctsio,
 		    MAX(lba, lun->be_lun->maxlba + 1));
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/* Zero number of blocks means "to the last logical block" */
 	if (num_blocks == 0) {
 		if ((lun->be_lun->maxlba + 1) - lba > UINT32_MAX) {
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 0,
 					      /*command*/ 1,
 					      /*field*/ 0,
 					      /*bit_valid*/ 0,
 					      /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		num_blocks = (lun->be_lun->maxlba + 1) - lba;
 	}
 
 	len = lun->be_lun->blocksize;
 
 	/*
 	 * If we've got a kernel request that hasn't been malloced yet,
 	 * malloc it and tell the caller the data buffer is here.
 	 */
 	if ((byte2 & SWS_NDOB) == 0 &&
 	    (ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) {
 		ctsio->kern_data_ptr = malloc(len, M_CTL, M_WAITOK);
 		ctsio->kern_data_len = len;
 		ctsio->kern_total_len = len;
 		ctsio->kern_rel_offset = 0;
 		ctsio->kern_sg_entries = 0;
 		ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 		ctsio->be_move_done = ctl_config_move_done;
 		ctl_datamove((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	lbalen = (struct ctl_lba_len_flags *)&ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->lba = lba;
 	lbalen->len = num_blocks;
 	lbalen->flags = byte2;
 	retval = lun->backend->config_write((union ctl_io *)ctsio);
 
 	return (retval);
 }
 
 int
 ctl_unmap(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_unmap *cdb;
 	struct ctl_ptr_len_flags *ptrlen;
 	struct scsi_unmap_header *hdr;
 	struct scsi_unmap_desc *buf, *end, *endnz, *range;
 	uint64_t lba;
 	uint32_t num_blocks;
 	int len, retval;
 	uint8_t byte2;
 
 	CTL_DEBUG_PRINT(("ctl_unmap\n"));
 
 	cdb = (struct scsi_unmap *)ctsio->cdb;
 	len = scsi_2btoul(cdb->length);
 	byte2 = cdb->byte2;
 
 	/*
 	 * If we've got a kernel request that hasn't been malloced yet,
 	 * malloc it and tell the caller the data buffer is here.
 	 */
 	if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) {
 		ctsio->kern_data_ptr = malloc(len, M_CTL, M_WAITOK);
 		ctsio->kern_data_len = len;
 		ctsio->kern_total_len = len;
 		ctsio->kern_rel_offset = 0;
 		ctsio->kern_sg_entries = 0;
 		ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 		ctsio->be_move_done = ctl_config_move_done;
 		ctl_datamove((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	len = ctsio->kern_total_len - ctsio->kern_data_resid;
 	hdr = (struct scsi_unmap_header *)ctsio->kern_data_ptr;
 	if (len < sizeof (*hdr) ||
 	    len < (scsi_2btoul(hdr->length) + sizeof(hdr->length)) ||
 	    len < (scsi_2btoul(hdr->desc_length) + sizeof (*hdr)) ||
 	    scsi_2btoul(hdr->desc_length) % sizeof(*buf) != 0) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 0,
 				      /*command*/ 0,
 				      /*field*/ 0,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		goto done;
 	}
 	len = scsi_2btoul(hdr->desc_length);
 	buf = (struct scsi_unmap_desc *)(hdr + 1);
 	end = buf + len / sizeof(*buf);
 
 	endnz = buf;
 	for (range = buf; range < end; range++) {
 		lba = scsi_8btou64(range->lba);
 		num_blocks = scsi_4btoul(range->length);
 		if (((lba + num_blocks) > (lun->be_lun->maxlba + 1))
 		 || ((lba + num_blocks) < lba)) {
 			ctl_set_lba_out_of_range(ctsio,
 			    MAX(lba, lun->be_lun->maxlba + 1));
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		if (num_blocks != 0)
 			endnz = range + 1;
 	}
 
 	/*
 	 * Block backend can not handle zero last range.
 	 * Filter it out and return if there is nothing left.
 	 */
 	len = (uint8_t *)endnz - (uint8_t *)buf;
 	if (len == 0) {
 		ctl_set_success(ctsio);
 		goto done;
 	}
 
 	mtx_lock(&lun->lun_lock);
 	ptrlen = (struct ctl_ptr_len_flags *)
 	    &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	ptrlen->ptr = (void *)buf;
 	ptrlen->len = len;
 	ptrlen->flags = byte2;
 	ctl_check_blocked(lun);
 	mtx_unlock(&lun->lun_lock);
 
 	retval = lun->backend->config_write((union ctl_io *)ctsio);
 	return (retval);
 
 done:
 	if (ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) {
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctsio->io_hdr.flags &= ~CTL_FLAG_ALLOCATED;
 	}
 	ctl_done((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_default_page_handler(struct ctl_scsiio *ctsio,
 			 struct ctl_page_index *page_index, uint8_t *page_ptr)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	uint8_t *current_cp;
 	int set_ua;
 	uint32_t initidx;
 
 	initidx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 	set_ua = 0;
 
 	current_cp = (page_index->page_data + (page_index->page_len *
 	    CTL_PAGE_CURRENT));
 
 	mtx_lock(&lun->lun_lock);
 	if (memcmp(current_cp, page_ptr, page_index->page_len)) {
 		memcpy(current_cp, page_ptr, page_index->page_len);
 		set_ua = 1;
 	}
 	if (set_ua != 0)
 		ctl_est_ua_all(lun, initidx, CTL_UA_MODE_CHANGE);
 	mtx_unlock(&lun->lun_lock);
 	if (set_ua) {
 		ctl_isc_announce_mode(lun,
 		    ctl_get_initindex(&ctsio->io_hdr.nexus),
 		    page_index->page_code, page_index->subpage);
 	}
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static void
 ctl_ie_timer(void *arg)
 {
 	struct ctl_lun *lun = arg;
 	uint64_t t;
 
 	if (lun->ie_asc == 0)
 		return;
 
 	if (lun->MODE_IE.mrie == SIEP_MRIE_UA)
 		ctl_est_ua_all(lun, -1, CTL_UA_IE);
 	else
 		lun->ie_reported = 0;
 
 	if (lun->ie_reportcnt < scsi_4btoul(lun->MODE_IE.report_count)) {
 		lun->ie_reportcnt++;
 		t = scsi_4btoul(lun->MODE_IE.interval_timer);
 		if (t == 0 || t == UINT32_MAX)
 			t = 3000;  /* 5 min */
 		callout_schedule(&lun->ie_callout, t * hz / 10);
 	}
 }
 
 int
 ctl_ie_page_handler(struct ctl_scsiio *ctsio,
 			 struct ctl_page_index *page_index, uint8_t *page_ptr)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_info_exceptions_page *pg;
 	uint64_t t;
 
 	(void)ctl_default_page_handler(ctsio, page_index, page_ptr);
 
 	pg = (struct scsi_info_exceptions_page *)page_ptr;
 	mtx_lock(&lun->lun_lock);
 	if (pg->info_flags & SIEP_FLAGS_TEST) {
 		lun->ie_asc = 0x5d;
 		lun->ie_ascq = 0xff;
 		if (pg->mrie == SIEP_MRIE_UA) {
 			ctl_est_ua_all(lun, -1, CTL_UA_IE);
 			lun->ie_reported = 1;
 		} else {
 			ctl_clr_ua_all(lun, -1, CTL_UA_IE);
 			lun->ie_reported = -1;
 		}
 		lun->ie_reportcnt = 1;
 		if (lun->ie_reportcnt < scsi_4btoul(pg->report_count)) {
 			lun->ie_reportcnt++;
 			t = scsi_4btoul(pg->interval_timer);
 			if (t == 0 || t == UINT32_MAX)
 				t = 3000;  /* 5 min */
 			callout_reset(&lun->ie_callout, t * hz / 10,
 			    ctl_ie_timer, lun);
 		}
 	} else {
 		lun->ie_asc = 0;
 		lun->ie_ascq = 0;
 		lun->ie_reported = 1;
 		ctl_clr_ua_all(lun, -1, CTL_UA_IE);
 		lun->ie_reportcnt = UINT32_MAX;
 		callout_stop(&lun->ie_callout);
 	}
 	mtx_unlock(&lun->lun_lock);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_do_mode_select(union ctl_io *io)
 {
 	struct ctl_lun *lun = CTL_LUN(io);
 	struct scsi_mode_page_header *page_header;
 	struct ctl_page_index *page_index;
 	struct ctl_scsiio *ctsio;
 	int page_len, page_len_offset, page_len_size;
 	union ctl_modepage_info *modepage_info;
 	uint16_t *len_left, *len_used;
 	int retval, i;
 
 	ctsio = &io->scsiio;
 	page_index = NULL;
 	page_len = 0;
 
 	modepage_info = (union ctl_modepage_info *)
 		ctsio->io_hdr.ctl_private[CTL_PRIV_MODEPAGE].bytes;
 	len_left = &modepage_info->header.len_left;
 	len_used = &modepage_info->header.len_used;
 
 do_next_page:
 
 	page_header = (struct scsi_mode_page_header *)
 		(ctsio->kern_data_ptr + *len_used);
 
 	if (*len_left == 0) {
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	} else if (*len_left < sizeof(struct scsi_mode_page_header)) {
 
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_set_param_len_error(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 
 	} else if ((page_header->page_code & SMPH_SPF)
 		&& (*len_left < sizeof(struct scsi_mode_page_header_sp))) {
 
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_set_param_len_error(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 
 	/*
 	 * XXX KDM should we do something with the block descriptor?
 	 */
 	for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 		page_index = &lun->mode_pages.index[i];
 		if (lun->be_lun->lun_type == T_DIRECT &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 			continue;
 		if (lun->be_lun->lun_type == T_PROCESSOR &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 			continue;
 		if (lun->be_lun->lun_type == T_CDROM &&
 		    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 			continue;
 
 		if ((page_index->page_code & SMPH_PC_MASK) !=
 		    (page_header->page_code & SMPH_PC_MASK))
 			continue;
 
 		/*
 		 * If neither page has a subpage code, then we've got a
 		 * match.
 		 */
 		if (((page_index->page_code & SMPH_SPF) == 0)
 		 && ((page_header->page_code & SMPH_SPF) == 0)) {
 			page_len = page_header->page_length;
 			break;
 		}
 
 		/*
 		 * If both pages have subpages, then the subpage numbers
 		 * have to match.
 		 */
 		if ((page_index->page_code & SMPH_SPF)
 		  && (page_header->page_code & SMPH_SPF)) {
 			struct scsi_mode_page_header_sp *sph;
 
 			sph = (struct scsi_mode_page_header_sp *)page_header;
 			if (page_index->subpage == sph->subpage) {
 				page_len = scsi_2btoul(sph->page_length);
 				break;
 			}
 		}
 	}
 
 	/*
 	 * If we couldn't find the page, or if we don't have a mode select
 	 * handler for it, send back an error to the user.
 	 */
 	if ((i >= CTL_NUM_MODE_PAGES)
 	 || (page_index->select_handler == NULL)) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 0,
 				      /*field*/ *len_used,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	if (page_index->page_code & SMPH_SPF) {
 		page_len_offset = 2;
 		page_len_size = 2;
 	} else {
 		page_len_size = 1;
 		page_len_offset = 1;
 	}
 
 	/*
 	 * If the length the initiator gives us isn't the one we specify in
 	 * the mode page header, or if they didn't specify enough data in
 	 * the CDB to avoid truncating this page, kick out the request.
 	 */
 	if (page_len != page_index->page_len - page_len_offset - page_len_size) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 0,
 				      /*field*/ *len_used + page_len_offset,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 	if (*len_left < page_index->page_len) {
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_set_param_len_error(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * Run through the mode page, checking to make sure that the bits
 	 * the user changed are actually legal for him to change.
 	 */
 	for (i = 0; i < page_index->page_len; i++) {
 		uint8_t *user_byte, *change_mask, *current_byte;
 		int bad_bit;
 		int j;
 
 		user_byte = (uint8_t *)page_header + i;
 		change_mask = page_index->page_data +
 			      (page_index->page_len * CTL_PAGE_CHANGEABLE) + i;
 		current_byte = page_index->page_data +
 			       (page_index->page_len * CTL_PAGE_CURRENT) + i;
 
 		/*
 		 * Check to see whether the user set any bits in this byte
 		 * that he is not allowed to set.
 		 */
 		if ((*user_byte & ~(*change_mask)) ==
 		    (*current_byte & ~(*change_mask)))
 			continue;
 
 		/*
 		 * Go through bit by bit to determine which one is illegal.
 		 */
 		bad_bit = 0;
 		for (j = 7; j >= 0; j--) {
 			if ((((1 << i) & ~(*change_mask)) & *user_byte) !=
 			    (((1 << i) & ~(*change_mask)) & *current_byte)) {
 				bad_bit = i;
 				break;
 			}
 		}
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 0,
 				      /*field*/ *len_used + i,
 				      /*bit_valid*/ 1,
 				      /*bit*/ bad_bit);
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * Decrement these before we call the page handler, since we may
 	 * end up getting called back one way or another before the handler
 	 * returns to this context.
 	 */
 	*len_left -= page_index->page_len;
 	*len_used += page_index->page_len;
 
 	retval = page_index->select_handler(ctsio, page_index,
 					    (uint8_t *)page_header);
 
 	/*
 	 * If the page handler returns CTL_RETVAL_QUEUED, then we need to
 	 * wait until this queued command completes to finish processing
 	 * the mode page.  If it returns anything other than
 	 * CTL_RETVAL_COMPLETE (e.g. CTL_RETVAL_ERROR), then it should have
 	 * already set the sense information, freed the data pointer, and
 	 * completed the io for us.
 	 */
 	if (retval != CTL_RETVAL_COMPLETE)
 		goto bailout_no_done;
 
 	/*
 	 * If the initiator sent us more than one page, parse the next one.
 	 */
 	if (*len_left > 0)
 		goto do_next_page;
 
 	ctl_set_success(ctsio);
 	free(ctsio->kern_data_ptr, M_CTL);
 	ctl_done((union ctl_io *)ctsio);
 
 bailout_no_done:
 
 	return (CTL_RETVAL_COMPLETE);
 
 }
 
 int
 ctl_mode_select(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	union ctl_modepage_info *modepage_info;
 	int bd_len, i, header_size, param_len, pf, rtd, sp;
 	uint32_t initidx;
 
 	initidx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 	switch (ctsio->cdb[0]) {
 	case MODE_SELECT_6: {
 		struct scsi_mode_select_6 *cdb;
 
 		cdb = (struct scsi_mode_select_6 *)ctsio->cdb;
 
 		pf = (cdb->byte2 & SMS_PF) ? 1 : 0;
 		rtd = (cdb->byte2 & SMS_RTD) ? 1 : 0;
 		sp = (cdb->byte2 & SMS_SP) ? 1 : 0;
 		param_len = cdb->length;
 		header_size = sizeof(struct scsi_mode_header_6);
 		break;
 	}
 	case MODE_SELECT_10: {
 		struct scsi_mode_select_10 *cdb;
 
 		cdb = (struct scsi_mode_select_10 *)ctsio->cdb;
 
 		pf = (cdb->byte2 & SMS_PF) ? 1 : 0;
 		rtd = (cdb->byte2 & SMS_RTD) ? 1 : 0;
 		sp = (cdb->byte2 & SMS_SP) ? 1 : 0;
 		param_len = scsi_2btoul(cdb->length);
 		header_size = sizeof(struct scsi_mode_header_10);
 		break;
 	}
 	default:
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	if (rtd) {
 		if (param_len != 0) {
 			ctl_set_invalid_field(ctsio, /*sks_valid*/ 0,
 			    /*command*/ 1, /*field*/ 0,
 			    /*bit_valid*/ 0, /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		/* Revert to defaults. */
 		ctl_init_page_index(lun);
 		mtx_lock(&lun->lun_lock);
 		ctl_est_ua_all(lun, initidx, CTL_UA_MODE_CHANGE);
 		mtx_unlock(&lun->lun_lock);
 		for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 			ctl_isc_announce_mode(lun, -1,
 			    lun->mode_pages.index[i].page_code & SMPH_PC_MASK,
 			    lun->mode_pages.index[i].subpage);
 		}
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * From SPC-3:
 	 * "A parameter list length of zero indicates that the Data-Out Buffer
 	 * shall be empty. This condition shall not be considered as an error."
 	 */
 	if (param_len == 0) {
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * Since we'll hit this the first time through, prior to
 	 * allocation, we don't need to free a data buffer here.
 	 */
 	if (param_len < header_size) {
 		ctl_set_param_len_error(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * Allocate the data buffer and grab the user's data.  In theory,
 	 * we shouldn't have to sanity check the parameter list length here
 	 * because the maximum size is 64K.  We should be able to malloc
 	 * that much without too many problems.
 	 */
 	if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) {
 		ctsio->kern_data_ptr = malloc(param_len, M_CTL, M_WAITOK);
 		ctsio->kern_data_len = param_len;
 		ctsio->kern_total_len = param_len;
 		ctsio->kern_rel_offset = 0;
 		ctsio->kern_sg_entries = 0;
 		ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 		ctsio->be_move_done = ctl_config_move_done;
 		ctl_datamove((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	switch (ctsio->cdb[0]) {
 	case MODE_SELECT_6: {
 		struct scsi_mode_header_6 *mh6;
 
 		mh6 = (struct scsi_mode_header_6 *)ctsio->kern_data_ptr;
 		bd_len = mh6->blk_desc_len;
 		break;
 	}
 	case MODE_SELECT_10: {
 		struct scsi_mode_header_10 *mh10;
 
 		mh10 = (struct scsi_mode_header_10 *)ctsio->kern_data_ptr;
 		bd_len = scsi_2btoul(mh10->blk_desc_len);
 		break;
 	}
 	default:
 		panic("%s: Invalid CDB type %#x", __func__, ctsio->cdb[0]);
 	}
 
 	if (param_len < (header_size + bd_len)) {
 		free(ctsio->kern_data_ptr, M_CTL);
 		ctl_set_param_len_error(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * Set the IO_CONT flag, so that if this I/O gets passed to
 	 * ctl_config_write_done(), it'll get passed back to
 	 * ctl_do_mode_select() for further processing, or completion if
 	 * we're all done.
 	 */
 	ctsio->io_hdr.flags |= CTL_FLAG_IO_CONT;
 	ctsio->io_cont = ctl_do_mode_select;
 
 	modepage_info = (union ctl_modepage_info *)
 		ctsio->io_hdr.ctl_private[CTL_PRIV_MODEPAGE].bytes;
 	memset(modepage_info, 0, sizeof(*modepage_info));
 	modepage_info->header.len_left = param_len - header_size - bd_len;
 	modepage_info->header.len_used = header_size + bd_len;
 
 	return (ctl_do_mode_select((union ctl_io *)ctsio));
 }
 
 int
 ctl_mode_sense(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	int pc, page_code, dbd, llba, subpage;
 	int alloc_len, page_len, header_len, total_len;
 	struct scsi_mode_block_descr *block_desc;
 	struct ctl_page_index *page_index;
 
 	dbd = 0;
 	llba = 0;
 	block_desc = NULL;
 
 	CTL_DEBUG_PRINT(("ctl_mode_sense\n"));
 
 	switch (ctsio->cdb[0]) {
 	case MODE_SENSE_6: {
 		struct scsi_mode_sense_6 *cdb;
 
 		cdb = (struct scsi_mode_sense_6 *)ctsio->cdb;
 
 		header_len = sizeof(struct scsi_mode_hdr_6);
 		if (cdb->byte2 & SMS_DBD)
 			dbd = 1;
 		else
 			header_len += sizeof(struct scsi_mode_block_descr);
 
 		pc = (cdb->page & SMS_PAGE_CTRL_MASK) >> 6;
 		page_code = cdb->page & SMS_PAGE_CODE;
 		subpage = cdb->subpage;
 		alloc_len = cdb->length;
 		break;
 	}
 	case MODE_SENSE_10: {
 		struct scsi_mode_sense_10 *cdb;
 
 		cdb = (struct scsi_mode_sense_10 *)ctsio->cdb;
 
 		header_len = sizeof(struct scsi_mode_hdr_10);
 
 		if (cdb->byte2 & SMS_DBD)
 			dbd = 1;
 		else
 			header_len += sizeof(struct scsi_mode_block_descr);
 		if (cdb->byte2 & SMS10_LLBAA)
 			llba = 1;
 		pc = (cdb->page & SMS_PAGE_CTRL_MASK) >> 6;
 		page_code = cdb->page & SMS_PAGE_CODE;
 		subpage = cdb->subpage;
 		alloc_len = scsi_2btoul(cdb->length);
 		break;
 	}
 	default:
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 		break; /* NOTREACHED */
 	}
 
 	/*
 	 * We have to make a first pass through to calculate the size of
 	 * the pages that match the user's query.  Then we allocate enough
 	 * memory to hold it, and actually copy the data into the buffer.
 	 */
 	switch (page_code) {
 	case SMS_ALL_PAGES_PAGE: {
 		u_int i;
 
 		page_len = 0;
 
 		/*
 		 * At the moment, values other than 0 and 0xff here are
 		 * reserved according to SPC-3.
 		 */
 		if ((subpage != SMS_SUBPAGE_PAGE_0)
 		 && (subpage != SMS_SUBPAGE_ALL)) {
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 3,
 					      /*bit_valid*/ 0,
 					      /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 			page_index = &lun->mode_pages.index[i];
 
 			/* Make sure the page is supported for this dev type */
 			if (lun->be_lun->lun_type == T_DIRECT &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_PROCESSOR &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_CDROM &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 				continue;
 
 			/*
 			 * We don't use this subpage if the user didn't
 			 * request all subpages.
 			 */
 			if ((page_index->subpage != 0)
 			 && (subpage == SMS_SUBPAGE_PAGE_0))
 				continue;
 
 #if 0
 			printf("found page %#x len %d\n",
 			       page_index->page_code & SMPH_PC_MASK,
 			       page_index->page_len);
 #endif
 			page_len += page_index->page_len;
 		}
 		break;
 	}
 	default: {
 		u_int i;
 
 		page_len = 0;
 
 		for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 			page_index = &lun->mode_pages.index[i];
 
 			/* Make sure the page is supported for this dev type */
 			if (lun->be_lun->lun_type == T_DIRECT &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_PROCESSOR &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_CDROM &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 				continue;
 
 			/* Look for the right page code */
 			if ((page_index->page_code & SMPH_PC_MASK) != page_code)
 				continue;
 
 			/* Look for the right subpage or the subpage wildcard*/
 			if ((page_index->subpage != subpage)
 			 && (subpage != SMS_SUBPAGE_ALL))
 				continue;
 
 #if 0
 			printf("found page %#x len %d\n",
 			       page_index->page_code & SMPH_PC_MASK,
 			       page_index->page_len);
 #endif
 
 			page_len += page_index->page_len;
 		}
 
 		if (page_len == 0) {
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 2,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 5);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		break;
 	}
 	}
 
 	total_len = header_len + page_len;
 #if 0
 	printf("header_len = %d, page_len = %d, total_len = %d\n",
 	       header_len, page_len, total_len);
 #endif
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	switch (ctsio->cdb[0]) {
 	case MODE_SENSE_6: {
 		struct scsi_mode_hdr_6 *header;
 
 		header = (struct scsi_mode_hdr_6 *)ctsio->kern_data_ptr;
 
 		header->datalen = MIN(total_len - 1, 254);
 		if (lun->be_lun->lun_type == T_DIRECT) {
 			header->dev_specific = 0x10; /* DPOFUA */
 			if ((lun->be_lun->flags & CTL_LUN_FLAG_READONLY) ||
 			    (lun->MODE_CTRL.eca_and_aen & SCP_SWP) != 0)
 				header->dev_specific |= 0x80; /* WP */
 		}
 		if (dbd)
 			header->block_descr_len = 0;
 		else
 			header->block_descr_len =
 				sizeof(struct scsi_mode_block_descr);
 		block_desc = (struct scsi_mode_block_descr *)&header[1];
 		break;
 	}
 	case MODE_SENSE_10: {
 		struct scsi_mode_hdr_10 *header;
 		int datalen;
 
 		header = (struct scsi_mode_hdr_10 *)ctsio->kern_data_ptr;
 
 		datalen = MIN(total_len - 2, 65533);
 		scsi_ulto2b(datalen, header->datalen);
 		if (lun->be_lun->lun_type == T_DIRECT) {
 			header->dev_specific = 0x10; /* DPOFUA */
 			if ((lun->be_lun->flags & CTL_LUN_FLAG_READONLY) ||
 			    (lun->MODE_CTRL.eca_and_aen & SCP_SWP) != 0)
 				header->dev_specific |= 0x80; /* WP */
 		}
 		if (dbd)
 			scsi_ulto2b(0, header->block_descr_len);
 		else
 			scsi_ulto2b(sizeof(struct scsi_mode_block_descr),
 				    header->block_descr_len);
 		block_desc = (struct scsi_mode_block_descr *)&header[1];
 		break;
 	}
 	default:
 		panic("%s: Invalid CDB type %#x", __func__, ctsio->cdb[0]);
 	}
 
 	/*
 	 * If we've got a disk, use its blocksize in the block
 	 * descriptor.  Otherwise, just set it to 0.
 	 */
 	if (dbd == 0) {
 		if (lun->be_lun->lun_type == T_DIRECT)
 			scsi_ulto3b(lun->be_lun->blocksize,
 				    block_desc->block_len);
 		else
 			scsi_ulto3b(0, block_desc->block_len);
 	}
 
 	switch (page_code) {
 	case SMS_ALL_PAGES_PAGE: {
 		int i, data_used;
 
 		data_used = header_len;
 		for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 			struct ctl_page_index *page_index;
 
 			page_index = &lun->mode_pages.index[i];
 			if (lun->be_lun->lun_type == T_DIRECT &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_PROCESSOR &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_CDROM &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 				continue;
 
 			/*
 			 * We don't use this subpage if the user didn't
 			 * request all subpages.  We already checked (above)
 			 * to make sure the user only specified a subpage
 			 * of 0 or 0xff in the SMS_ALL_PAGES_PAGE case.
 			 */
 			if ((page_index->subpage != 0)
 			 && (subpage == SMS_SUBPAGE_PAGE_0))
 				continue;
 
 			/*
 			 * Call the handler, if it exists, to update the
 			 * page to the latest values.
 			 */
 			if (page_index->sense_handler != NULL)
 				page_index->sense_handler(ctsio, page_index,pc);
 
 			memcpy(ctsio->kern_data_ptr + data_used,
 			       page_index->page_data +
 			       (page_index->page_len * pc),
 			       page_index->page_len);
 			data_used += page_index->page_len;
 		}
 		break;
 	}
 	default: {
 		int i, data_used;
 
 		data_used = header_len;
 
 		for (i = 0; i < CTL_NUM_MODE_PAGES; i++) {
 			struct ctl_page_index *page_index;
 
 			page_index = &lun->mode_pages.index[i];
 
 			/* Look for the right page code */
 			if ((page_index->page_code & SMPH_PC_MASK) != page_code)
 				continue;
 
 			/* Look for the right subpage or the subpage wildcard*/
 			if ((page_index->subpage != subpage)
 			 && (subpage != SMS_SUBPAGE_ALL))
 				continue;
 
 			/* Make sure the page is supported for this dev type */
 			if (lun->be_lun->lun_type == T_DIRECT &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_PROCESSOR &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0)
 				continue;
 			if (lun->be_lun->lun_type == T_CDROM &&
 			    (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0)
 				continue;
 
 			/*
 			 * Call the handler, if it exists, to update the
 			 * page to the latest values.
 			 */
 			if (page_index->sense_handler != NULL)
 				page_index->sense_handler(ctsio, page_index,pc);
 
 			memcpy(ctsio->kern_data_ptr + data_used,
 			       page_index->page_data +
 			       (page_index->page_len * pc),
 			       page_index->page_len);
 			data_used += page_index->page_len;
 		}
 		break;
 	}
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_lbp_log_sense_handler(struct ctl_scsiio *ctsio,
 			       struct ctl_page_index *page_index,
 			       int pc)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_log_param_header *phdr;
 	uint8_t *data;
 	uint64_t val;
 
 	data = page_index->page_data;
 
 	if (lun->backend->lun_attr != NULL &&
 	    (val = lun->backend->lun_attr(lun->be_lun->be_lun, "blocksavail"))
 	     != UINT64_MAX) {
 		phdr = (struct scsi_log_param_header *)data;
 		scsi_ulto2b(0x0001, phdr->param_code);
 		phdr->param_control = SLP_LBIN | SLP_LP;
 		phdr->param_len = 8;
 		data = (uint8_t *)(phdr + 1);
 		scsi_ulto4b(val >> CTL_LBP_EXPONENT, data);
 		data[4] = 0x02; /* per-pool */
 		data += phdr->param_len;
 	}
 
 	if (lun->backend->lun_attr != NULL &&
 	    (val = lun->backend->lun_attr(lun->be_lun->be_lun, "blocksused"))
 	     != UINT64_MAX) {
 		phdr = (struct scsi_log_param_header *)data;
 		scsi_ulto2b(0x0002, phdr->param_code);
 		phdr->param_control = SLP_LBIN | SLP_LP;
 		phdr->param_len = 8;
 		data = (uint8_t *)(phdr + 1);
 		scsi_ulto4b(val >> CTL_LBP_EXPONENT, data);
 		data[4] = 0x01; /* per-LUN */
 		data += phdr->param_len;
 	}
 
 	if (lun->backend->lun_attr != NULL &&
 	    (val = lun->backend->lun_attr(lun->be_lun->be_lun, "poolblocksavail"))
 	     != UINT64_MAX) {
 		phdr = (struct scsi_log_param_header *)data;
 		scsi_ulto2b(0x00f1, phdr->param_code);
 		phdr->param_control = SLP_LBIN | SLP_LP;
 		phdr->param_len = 8;
 		data = (uint8_t *)(phdr + 1);
 		scsi_ulto4b(val >> CTL_LBP_EXPONENT, data);
 		data[4] = 0x02; /* per-pool */
 		data += phdr->param_len;
 	}
 
 	if (lun->backend->lun_attr != NULL &&
 	    (val = lun->backend->lun_attr(lun->be_lun->be_lun, "poolblocksused"))
 	     != UINT64_MAX) {
 		phdr = (struct scsi_log_param_header *)data;
 		scsi_ulto2b(0x00f2, phdr->param_code);
 		phdr->param_control = SLP_LBIN | SLP_LP;
 		phdr->param_len = 8;
 		data = (uint8_t *)(phdr + 1);
 		scsi_ulto4b(val >> CTL_LBP_EXPONENT, data);
 		data[4] = 0x02; /* per-pool */
 		data += phdr->param_len;
 	}
 
 	page_index->page_len = data - page_index->page_data;
 	return (0);
 }
 
 int
 ctl_sap_log_sense_handler(struct ctl_scsiio *ctsio,
 			       struct ctl_page_index *page_index,
 			       int pc)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct stat_page *data;
 	struct bintime *t;
 
 	data = (struct stat_page *)page_index->page_data;
 
 	scsi_ulto2b(SLP_SAP, data->sap.hdr.param_code);
 	data->sap.hdr.param_control = SLP_LBIN;
 	data->sap.hdr.param_len = sizeof(struct scsi_log_stat_and_perf) -
 	    sizeof(struct scsi_log_param_header);
 	scsi_u64to8b(lun->stats.operations[CTL_STATS_READ],
 	    data->sap.read_num);
 	scsi_u64to8b(lun->stats.operations[CTL_STATS_WRITE],
 	    data->sap.write_num);
 	if (lun->be_lun->blocksize > 0) {
 		scsi_u64to8b(lun->stats.bytes[CTL_STATS_WRITE] /
 		    lun->be_lun->blocksize, data->sap.recvieved_lba);
 		scsi_u64to8b(lun->stats.bytes[CTL_STATS_READ] /
 		    lun->be_lun->blocksize, data->sap.transmitted_lba);
 	}
 	t = &lun->stats.time[CTL_STATS_READ];
 	scsi_u64to8b((uint64_t)t->sec * 1000 + t->frac / (UINT64_MAX / 1000),
 	    data->sap.read_int);
 	t = &lun->stats.time[CTL_STATS_WRITE];
 	scsi_u64to8b((uint64_t)t->sec * 1000 + t->frac / (UINT64_MAX / 1000),
 	    data->sap.write_int);
 	scsi_u64to8b(0, data->sap.weighted_num);
 	scsi_u64to8b(0, data->sap.weighted_int);
 	scsi_ulto2b(SLP_IT, data->it.hdr.param_code);
 	data->it.hdr.param_control = SLP_LBIN;
 	data->it.hdr.param_len = sizeof(struct scsi_log_idle_time) -
 	    sizeof(struct scsi_log_param_header);
 #ifdef CTL_TIME_IO
 	scsi_u64to8b(lun->idle_time / SBT_1MS, data->it.idle_int);
 #endif
 	scsi_ulto2b(SLP_TI, data->ti.hdr.param_code);
 	data->it.hdr.param_control = SLP_LBIN;
 	data->ti.hdr.param_len = sizeof(struct scsi_log_time_interval) -
 	    sizeof(struct scsi_log_param_header);
 	scsi_ulto4b(3, data->ti.exponent);
 	scsi_ulto4b(1, data->ti.integer);
 	return (0);
 }
 
 int
 ctl_ie_log_sense_handler(struct ctl_scsiio *ctsio,
 			       struct ctl_page_index *page_index,
 			       int pc)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_log_informational_exceptions *data;
 
 	data = (struct scsi_log_informational_exceptions *)page_index->page_data;
 
 	scsi_ulto2b(SLP_IE_GEN, data->hdr.param_code);
 	data->hdr.param_control = SLP_LBIN;
 	data->hdr.param_len = sizeof(struct scsi_log_informational_exceptions) -
 	    sizeof(struct scsi_log_param_header);
 	data->ie_asc = lun->ie_asc;
 	data->ie_ascq = lun->ie_ascq;
 	data->temperature = 0xff;
 	return (0);
 }
 
 int
 ctl_log_sense(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	int i, pc, page_code, subpage;
 	int alloc_len, total_len;
 	struct ctl_page_index *page_index;
 	struct scsi_log_sense *cdb;
 	struct scsi_log_header *header;
 
 	CTL_DEBUG_PRINT(("ctl_log_sense\n"));
 
 	cdb = (struct scsi_log_sense *)ctsio->cdb;
 	pc = (cdb->page & SLS_PAGE_CTRL_MASK) >> 6;
 	page_code = cdb->page & SLS_PAGE_CODE;
 	subpage = cdb->subpage;
 	alloc_len = scsi_2btoul(cdb->length);
 
 	page_index = NULL;
 	for (i = 0; i < CTL_NUM_LOG_PAGES; i++) {
 		page_index = &lun->log_pages.index[i];
 
 		/* Look for the right page code */
 		if ((page_index->page_code & SL_PAGE_CODE) != page_code)
 			continue;
 
 		/* Look for the right subpage or the subpage wildcard*/
 		if (page_index->subpage != subpage)
 			continue;
 
 		break;
 	}
 	if (i >= CTL_NUM_LOG_PAGES) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	total_len = sizeof(struct scsi_log_header) + page_index->page_len;
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	header = (struct scsi_log_header *)ctsio->kern_data_ptr;
 	header->page = page_index->page_code;
 	if (page_index->page_code == SLS_LOGICAL_BLOCK_PROVISIONING)
 		header->page |= SL_DS;
 	if (page_index->subpage) {
 		header->page |= SL_SPF;
 		header->subpage = page_index->subpage;
 	}
 	scsi_ulto2b(page_index->page_len, header->datalen);
 
 	/*
 	 * Call the handler, if it exists, to update the
 	 * page to the latest values.
 	 */
 	if (page_index->sense_handler != NULL)
 		page_index->sense_handler(ctsio, page_index, pc);
 
 	memcpy(header + 1, page_index->page_data, page_index->page_len);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_read_capacity(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_read_capacity *cdb;
 	struct scsi_read_capacity_data *data;
 	uint32_t lba;
 
 	CTL_DEBUG_PRINT(("ctl_read_capacity\n"));
 
 	cdb = (struct scsi_read_capacity *)ctsio->cdb;
 
 	lba = scsi_4btoul(cdb->addr);
 	if (((cdb->pmi & SRC_PMI) == 0)
 	 && (lba != 0)) {
 		ctl_set_invalid_field(/*ctsio*/ ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	ctsio->kern_data_ptr = malloc(sizeof(*data), M_CTL, M_WAITOK | M_ZERO);
 	data = (struct scsi_read_capacity_data *)ctsio->kern_data_ptr;
 	ctsio->kern_data_len = sizeof(*data);
 	ctsio->kern_total_len = sizeof(*data);
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 
 	/*
 	 * If the maximum LBA is greater than 0xfffffffe, the user must
 	 * issue a SERVICE ACTION IN (16) command, with the read capacity
 	 * serivce action set.
 	 */
 	if (lun->be_lun->maxlba > 0xfffffffe)
 		scsi_ulto4b(0xffffffff, data->addr);
 	else
 		scsi_ulto4b(lun->be_lun->maxlba, data->addr);
 
 	/*
 	 * XXX KDM this may not be 512 bytes...
 	 */
 	scsi_ulto4b(lun->be_lun->blocksize, data->length);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_read_capacity_16(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_read_capacity_16 *cdb;
 	struct scsi_read_capacity_data_long *data;
 	uint64_t lba;
 	uint32_t alloc_len;
 
 	CTL_DEBUG_PRINT(("ctl_read_capacity_16\n"));
 
 	cdb = (struct scsi_read_capacity_16 *)ctsio->cdb;
 
 	alloc_len = scsi_4btoul(cdb->alloc_len);
 	lba = scsi_8btou64(cdb->addr);
 
 	if ((cdb->reladr & SRC16_PMI)
 	 && (lba != 0)) {
 		ctl_set_invalid_field(/*ctsio*/ ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	ctsio->kern_data_ptr = malloc(sizeof(*data), M_CTL, M_WAITOK | M_ZERO);
 	data = (struct scsi_read_capacity_data_long *)ctsio->kern_data_ptr;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(sizeof(*data), alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	scsi_u64to8b(lun->be_lun->maxlba, data->addr);
 	/* XXX KDM this may not be 512 bytes... */
 	scsi_ulto4b(lun->be_lun->blocksize, data->length);
 	data->prot_lbppbe = lun->be_lun->pblockexp & SRC16_LBPPBE;
 	scsi_ulto2b(lun->be_lun->pblockoff & SRC16_LALBA_A, data->lalba_lbp);
 	if (lun->be_lun->flags & CTL_LUN_FLAG_UNMAP)
 		data->lalba_lbp[0] |= SRC16_LBPME | SRC16_LBPRZ;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_get_lba_status(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_get_lba_status *cdb;
 	struct scsi_get_lba_status_data *data;
 	struct ctl_lba_len_flags *lbalen;
 	uint64_t lba;
 	uint32_t alloc_len, total_len;
 	int retval;
 
 	CTL_DEBUG_PRINT(("ctl_get_lba_status\n"));
 
 	cdb = (struct scsi_get_lba_status *)ctsio->cdb;
 	lba = scsi_8btou64(cdb->addr);
 	alloc_len = scsi_4btoul(cdb->alloc_len);
 
 	if (lba > lun->be_lun->maxlba) {
 		ctl_set_lba_out_of_range(ctsio, lba);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	total_len = sizeof(*data) + sizeof(data->descr[0]);
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	data = (struct scsi_get_lba_status_data *)ctsio->kern_data_ptr;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/* Fill dummy data in case backend can't tell anything. */
 	scsi_ulto4b(4 + sizeof(data->descr[0]), data->length);
 	scsi_u64to8b(lba, data->descr[0].addr);
 	scsi_ulto4b(MIN(UINT32_MAX, lun->be_lun->maxlba + 1 - lba),
 	    data->descr[0].length);
 	data->descr[0].status = 0; /* Mapped or unknown. */
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 
 	lbalen = (struct ctl_lba_len_flags *)&ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->lba = lba;
 	lbalen->len = total_len;
 	lbalen->flags = 0;
 	retval = lun->backend->config_read((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_read_defect(struct ctl_scsiio *ctsio)
 {
 	struct scsi_read_defect_data_10 *ccb10;
 	struct scsi_read_defect_data_12 *ccb12;
 	struct scsi_read_defect_data_hdr_10 *data10;
 	struct scsi_read_defect_data_hdr_12 *data12;
 	uint32_t alloc_len, data_len;
 	uint8_t format;
 
 	CTL_DEBUG_PRINT(("ctl_read_defect\n"));
 
 	if (ctsio->cdb[0] == READ_DEFECT_DATA_10) {
 		ccb10 = (struct scsi_read_defect_data_10 *)&ctsio->cdb;
 		format = ccb10->format;
 		alloc_len = scsi_2btoul(ccb10->alloc_length);
 		data_len = sizeof(*data10);
 	} else {
 		ccb12 = (struct scsi_read_defect_data_12 *)&ctsio->cdb;
 		format = ccb12->format;
 		alloc_len = scsi_4btoul(ccb12->alloc_length);
 		data_len = sizeof(*data12);
 	}
 	if (alloc_len == 0) {
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	if (ctsio->cdb[0] == READ_DEFECT_DATA_10) {
 		data10 = (struct scsi_read_defect_data_hdr_10 *)
 		    ctsio->kern_data_ptr;
 		data10->format = format;
 		scsi_ulto2b(0, data10->length);
 	} else {
 		data12 = (struct scsi_read_defect_data_hdr_12 *)
 		    ctsio->kern_data_ptr;
 		data12->format = format;
 		scsi_ulto2b(0, data12->generation);
 		scsi_ulto4b(0, data12->length);
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_report_tagret_port_groups(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_maintenance_in *cdb;
 	int retval;
 	int alloc_len, ext, total_len = 0, g, pc, pg, ts, os;
 	int num_ha_groups, num_target_ports, shared_group;
 	struct ctl_port *port;
 	struct scsi_target_group_data *rtg_ptr;
 	struct scsi_target_group_data_extended *rtg_ext_ptr;
 	struct scsi_target_port_group_descriptor *tpg_desc;
 
 	CTL_DEBUG_PRINT(("ctl_report_tagret_port_groups\n"));
 
 	cdb = (struct scsi_maintenance_in *)ctsio->cdb;
 	retval = CTL_RETVAL_COMPLETE;
 
 	switch (cdb->byte2 & STG_PDF_MASK) {
 	case STG_PDF_LENGTH:
 		ext = 0;
 		break;
 	case STG_PDF_EXTENDED:
 		ext = 1;
 		break;
 	default:
 		ctl_set_invalid_field(/*ctsio*/ ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 1,
 				      /*bit*/ 5);
 		ctl_done((union ctl_io *)ctsio);
 		return(retval);
 	}
 
 	num_target_ports = 0;
 	shared_group = (softc->is_single != 0);
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(port, &softc->port_list, links) {
 		if ((port->status & CTL_PORT_STATUS_ONLINE) == 0)
 			continue;
 		if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 			continue;
 		num_target_ports++;
 		if (port->status & CTL_PORT_STATUS_HA_SHARED)
 			shared_group = 1;
 	}
 	mtx_unlock(&softc->ctl_lock);
 	num_ha_groups = (softc->is_single) ? 0 : NUM_HA_SHELVES;
 
 	if (ext)
 		total_len = sizeof(struct scsi_target_group_data_extended);
 	else
 		total_len = sizeof(struct scsi_target_group_data);
 	total_len += sizeof(struct scsi_target_port_group_descriptor) *
 		(shared_group + num_ha_groups) +
 	    sizeof(struct scsi_target_port_descriptor) * num_target_ports;
 
 	alloc_len = scsi_4btoul(cdb->length);
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	if (ext) {
 		rtg_ext_ptr = (struct scsi_target_group_data_extended *)
 		    ctsio->kern_data_ptr;
 		scsi_ulto4b(total_len - 4, rtg_ext_ptr->length);
 		rtg_ext_ptr->format_type = 0x10;
 		rtg_ext_ptr->implicit_transition_time = 0;
 		tpg_desc = &rtg_ext_ptr->groups[0];
 	} else {
 		rtg_ptr = (struct scsi_target_group_data *)
 		    ctsio->kern_data_ptr;
 		scsi_ulto4b(total_len - 4, rtg_ptr->length);
 		tpg_desc = &rtg_ptr->groups[0];
 	}
 
 	mtx_lock(&softc->ctl_lock);
 	pg = softc->port_min / softc->port_cnt;
 	if (lun->flags & (CTL_LUN_PRIMARY_SC | CTL_LUN_PEER_SC_PRIMARY)) {
 		/* Some shelf is known to be primary. */
 		if (softc->ha_link == CTL_HA_LINK_OFFLINE)
 			os = TPG_ASYMMETRIC_ACCESS_UNAVAILABLE;
 		else if (softc->ha_link == CTL_HA_LINK_UNKNOWN)
 			os = TPG_ASYMMETRIC_ACCESS_TRANSITIONING;
 		else if (softc->ha_mode == CTL_HA_MODE_ACT_STBY)
 			os = TPG_ASYMMETRIC_ACCESS_STANDBY;
 		else
 			os = TPG_ASYMMETRIC_ACCESS_NONOPTIMIZED;
 		if (lun->flags & CTL_LUN_PRIMARY_SC) {
 			ts = TPG_ASYMMETRIC_ACCESS_OPTIMIZED;
 		} else {
 			ts = os;
 			os = TPG_ASYMMETRIC_ACCESS_OPTIMIZED;
 		}
 	} else {
 		/* No known primary shelf. */
 		if (softc->ha_link == CTL_HA_LINK_OFFLINE) {
 			ts = TPG_ASYMMETRIC_ACCESS_UNAVAILABLE;
 			os = TPG_ASYMMETRIC_ACCESS_OPTIMIZED;
 		} else if (softc->ha_link == CTL_HA_LINK_UNKNOWN) {
 			ts = TPG_ASYMMETRIC_ACCESS_TRANSITIONING;
 			os = TPG_ASYMMETRIC_ACCESS_OPTIMIZED;
 		} else {
 			ts = os = TPG_ASYMMETRIC_ACCESS_TRANSITIONING;
 		}
 	}
 	if (shared_group) {
 		tpg_desc->pref_state = ts;
 		tpg_desc->support = TPG_AO_SUP | TPG_AN_SUP | TPG_S_SUP |
 		    TPG_U_SUP | TPG_T_SUP;
 		scsi_ulto2b(1, tpg_desc->target_port_group);
 		tpg_desc->status = TPG_IMPLICIT;
 		pc = 0;
 		STAILQ_FOREACH(port, &softc->port_list, links) {
 			if ((port->status & CTL_PORT_STATUS_ONLINE) == 0)
 				continue;
 			if (!softc->is_single &&
 			    (port->status & CTL_PORT_STATUS_HA_SHARED) == 0)
 				continue;
 			if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 				continue;
 			scsi_ulto2b(port->targ_port, tpg_desc->descriptors[pc].
 			    relative_target_port_identifier);
 			pc++;
 		}
 		tpg_desc->target_port_count = pc;
 		tpg_desc = (struct scsi_target_port_group_descriptor *)
 		    &tpg_desc->descriptors[pc];
 	}
 	for (g = 0; g < num_ha_groups; g++) {
 		tpg_desc->pref_state = (g == pg) ? ts : os;
 		tpg_desc->support = TPG_AO_SUP | TPG_AN_SUP | TPG_S_SUP |
 		    TPG_U_SUP | TPG_T_SUP;
 		scsi_ulto2b(2 + g, tpg_desc->target_port_group);
 		tpg_desc->status = TPG_IMPLICIT;
 		pc = 0;
 		STAILQ_FOREACH(port, &softc->port_list, links) {
 			if (port->targ_port < g * softc->port_cnt ||
 			    port->targ_port >= (g + 1) * softc->port_cnt)
 				continue;
 			if ((port->status & CTL_PORT_STATUS_ONLINE) == 0)
 				continue;
 			if (port->status & CTL_PORT_STATUS_HA_SHARED)
 				continue;
 			if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 				continue;
 			scsi_ulto2b(port->targ_port, tpg_desc->descriptors[pc].
 			    relative_target_port_identifier);
 			pc++;
 		}
 		tpg_desc->target_port_count = pc;
 		tpg_desc = (struct scsi_target_port_group_descriptor *)
 		    &tpg_desc->descriptors[pc];
 	}
 	mtx_unlock(&softc->ctl_lock);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return(retval);
 }
 
 int
 ctl_report_supported_opcodes(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_report_supported_opcodes *cdb;
 	const struct ctl_cmd_entry *entry, *sentry;
 	struct scsi_report_supported_opcodes_all *all;
 	struct scsi_report_supported_opcodes_descr *descr;
 	struct scsi_report_supported_opcodes_one *one;
 	int retval;
 	int alloc_len, total_len;
 	int opcode, service_action, i, j, num;
 
 	CTL_DEBUG_PRINT(("ctl_report_supported_opcodes\n"));
 
 	cdb = (struct scsi_report_supported_opcodes *)ctsio->cdb;
 	retval = CTL_RETVAL_COMPLETE;
 
 	opcode = cdb->requested_opcode;
 	service_action = scsi_2btoul(cdb->requested_service_action);
 	switch (cdb->options & RSO_OPTIONS_MASK) {
 	case RSO_OPTIONS_ALL:
 		num = 0;
 		for (i = 0; i < 256; i++) {
 			entry = &ctl_cmd_table[i];
 			if (entry->flags & CTL_CMD_FLAG_SA5) {
 				for (j = 0; j < 32; j++) {
 					sentry = &((const struct ctl_cmd_entry *)
 					    entry->execute)[j];
 					if (ctl_cmd_applicable(
 					    lun->be_lun->lun_type, sentry))
 						num++;
 				}
 			} else {
 				if (ctl_cmd_applicable(lun->be_lun->lun_type,
 				    entry))
 					num++;
 			}
 		}
 		total_len = sizeof(struct scsi_report_supported_opcodes_all) +
 		    num * sizeof(struct scsi_report_supported_opcodes_descr);
 		break;
 	case RSO_OPTIONS_OC:
 		if (ctl_cmd_table[opcode].flags & CTL_CMD_FLAG_SA5) {
 			ctl_set_invalid_field(/*ctsio*/ ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 2,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 2);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		total_len = sizeof(struct scsi_report_supported_opcodes_one) + 32;
 		break;
 	case RSO_OPTIONS_OC_SA:
 		if ((ctl_cmd_table[opcode].flags & CTL_CMD_FLAG_SA5) == 0 ||
 		    service_action >= 32) {
 			ctl_set_invalid_field(/*ctsio*/ ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 2,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 2);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		/* FALLTHROUGH */
 	case RSO_OPTIONS_OC_ASA:
 		total_len = sizeof(struct scsi_report_supported_opcodes_one) + 32;
 		break;
 	default:
 		ctl_set_invalid_field(/*ctsio*/ ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 1,
 				      /*bit*/ 2);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	alloc_len = scsi_4btoul(cdb->length);
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	switch (cdb->options & RSO_OPTIONS_MASK) {
 	case RSO_OPTIONS_ALL:
 		all = (struct scsi_report_supported_opcodes_all *)
 		    ctsio->kern_data_ptr;
 		num = 0;
 		for (i = 0; i < 256; i++) {
 			entry = &ctl_cmd_table[i];
 			if (entry->flags & CTL_CMD_FLAG_SA5) {
 				for (j = 0; j < 32; j++) {
 					sentry = &((const struct ctl_cmd_entry *)
 					    entry->execute)[j];
 					if (!ctl_cmd_applicable(
 					    lun->be_lun->lun_type, sentry))
 						continue;
 					descr = &all->descr[num++];
 					descr->opcode = i;
 					scsi_ulto2b(j, descr->service_action);
 					descr->flags = RSO_SERVACTV;
 					scsi_ulto2b(sentry->length,
 					    descr->cdb_length);
 				}
 			} else {
 				if (!ctl_cmd_applicable(lun->be_lun->lun_type,
 				    entry))
 					continue;
 				descr = &all->descr[num++];
 				descr->opcode = i;
 				scsi_ulto2b(0, descr->service_action);
 				descr->flags = 0;
 				scsi_ulto2b(entry->length, descr->cdb_length);
 			}
 		}
 		scsi_ulto4b(
 		    num * sizeof(struct scsi_report_supported_opcodes_descr),
 		    all->length);
 		break;
 	case RSO_OPTIONS_OC:
 		one = (struct scsi_report_supported_opcodes_one *)
 		    ctsio->kern_data_ptr;
 		entry = &ctl_cmd_table[opcode];
 		goto fill_one;
 	case RSO_OPTIONS_OC_SA:
 		one = (struct scsi_report_supported_opcodes_one *)
 		    ctsio->kern_data_ptr;
 		entry = &ctl_cmd_table[opcode];
 		entry = &((const struct ctl_cmd_entry *)
 		    entry->execute)[service_action];
 fill_one:
 		if (ctl_cmd_applicable(lun->be_lun->lun_type, entry)) {
 			one->support = 3;
 			scsi_ulto2b(entry->length, one->cdb_length);
 			one->cdb_usage[0] = opcode;
 			memcpy(&one->cdb_usage[1], entry->usage,
 			    entry->length - 1);
 		} else
 			one->support = 1;
 		break;
 	case RSO_OPTIONS_OC_ASA:
 		one = (struct scsi_report_supported_opcodes_one *)
 		    ctsio->kern_data_ptr;
 		entry = &ctl_cmd_table[opcode];
 		if (entry->flags & CTL_CMD_FLAG_SA5) {
 			entry = &((const struct ctl_cmd_entry *)
 			    entry->execute)[service_action];
 		} else if (service_action != 0) {
 			one->support = 1;
 			break;
 		}
 		goto fill_one;
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return(retval);
 }
 
 int
 ctl_report_supported_tmf(struct ctl_scsiio *ctsio)
 {
 	struct scsi_report_supported_tmf *cdb;
 	struct scsi_report_supported_tmf_ext_data *data;
 	int retval;
 	int alloc_len, total_len;
 
 	CTL_DEBUG_PRINT(("ctl_report_supported_tmf\n"));
 
 	cdb = (struct scsi_report_supported_tmf *)ctsio->cdb;
 
 	retval = CTL_RETVAL_COMPLETE;
 
 	if (cdb->options & RST_REPD)
 		total_len = sizeof(struct scsi_report_supported_tmf_ext_data);
 	else
 		total_len = sizeof(struct scsi_report_supported_tmf_data);
 	alloc_len = scsi_4btoul(cdb->length);
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	data = (struct scsi_report_supported_tmf_ext_data *)ctsio->kern_data_ptr;
 	data->byte1 |= RST_ATS | RST_ATSS | RST_CTSS | RST_LURS | RST_QTS |
 	    RST_TRS;
 	data->byte2 |= RST_QAES | RST_QTSS | RST_ITNRS;
 	data->length = total_len - 4;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_report_timestamp(struct ctl_scsiio *ctsio)
 {
 	struct scsi_report_timestamp *cdb;
 	struct scsi_report_timestamp_data *data;
 	struct timeval tv;
 	int64_t timestamp;
 	int retval;
 	int alloc_len, total_len;
 
 	CTL_DEBUG_PRINT(("ctl_report_timestamp\n"));
 
 	cdb = (struct scsi_report_timestamp *)ctsio->cdb;
 
 	retval = CTL_RETVAL_COMPLETE;
 
 	total_len = sizeof(struct scsi_report_timestamp_data);
 	alloc_len = scsi_4btoul(cdb->length);
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	data = (struct scsi_report_timestamp_data *)ctsio->kern_data_ptr;
 	scsi_ulto2b(sizeof(*data) - 2, data->length);
 	data->origin = RTS_ORIG_OUTSIDE;
 	getmicrotime(&tv);
 	timestamp = (int64_t)tv.tv_sec * 1000 + tv.tv_usec / 1000;
 	scsi_ulto4b(timestamp >> 16, data->timestamp);
 	scsi_ulto2b(timestamp & 0xffff, &data->timestamp[4]);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_persistent_reserve_in(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_per_res_in *cdb;
 	int alloc_len, total_len = 0;
 	/* struct scsi_per_res_in_rsrv in_data; */
 	uint64_t key;
 
 	CTL_DEBUG_PRINT(("ctl_persistent_reserve_in\n"));
 
 	cdb = (struct scsi_per_res_in *)ctsio->cdb;
 
 	alloc_len = scsi_2btoul(cdb->length);
 
 retry:
 	mtx_lock(&lun->lun_lock);
 	switch (cdb->action) {
 	case SPRI_RK: /* read keys */
 		total_len = sizeof(struct scsi_per_res_in_keys) +
 			lun->pr_key_count *
 			sizeof(struct scsi_per_res_key);
 		break;
 	case SPRI_RR: /* read reservation */
 		if (lun->flags & CTL_LUN_PR_RESERVED)
 			total_len = sizeof(struct scsi_per_res_in_rsrv);
 		else
 			total_len = sizeof(struct scsi_per_res_in_header);
 		break;
 	case SPRI_RC: /* report capabilities */
 		total_len = sizeof(struct scsi_per_res_cap);
 		break;
 	case SPRI_RS: /* read full status */
 		total_len = sizeof(struct scsi_per_res_in_header) +
 		    (sizeof(struct scsi_per_res_in_full_desc) + 256) *
 		    lun->pr_key_count;
 		break;
 	default:
 		panic("%s: Invalid PR type %#x", __func__, cdb->action);
 	}
 	mtx_unlock(&lun->lun_lock);
 
 	ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(total_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	mtx_lock(&lun->lun_lock);
 	switch (cdb->action) {
 	case SPRI_RK: { // read keys
         struct scsi_per_res_in_keys *res_keys;
 		int i, key_count;
 
 		res_keys = (struct scsi_per_res_in_keys*)ctsio->kern_data_ptr;
 
 		/*
 		 * We had to drop the lock to allocate our buffer, which
 		 * leaves time for someone to come in with another
 		 * persistent reservation.  (That is unlikely, though,
 		 * since this should be the only persistent reservation
 		 * command active right now.)
 		 */
 		if (total_len != (sizeof(struct scsi_per_res_in_keys) +
 		    (lun->pr_key_count *
 		     sizeof(struct scsi_per_res_key)))){
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			printf("%s: reservation length changed, retrying\n",
 			       __func__);
 			goto retry;
 		}
 
 		scsi_ulto4b(lun->pr_generation, res_keys->header.generation);
 
 		scsi_ulto4b(sizeof(struct scsi_per_res_key) *
 			     lun->pr_key_count, res_keys->header.length);
 
 		for (i = 0, key_count = 0; i < CTL_MAX_INITIATORS; i++) {
 			if ((key = ctl_get_prkey(lun, i)) == 0)
 				continue;
 
 			/*
 			 * We used lun->pr_key_count to calculate the
 			 * size to allocate.  If it turns out the number of
 			 * initiators with the registered flag set is
 			 * larger than that (i.e. they haven't been kept in
 			 * sync), we've got a problem.
 			 */
 			if (key_count >= lun->pr_key_count) {
 				key_count++;
 				continue;
 			}
 			scsi_u64to8b(key, res_keys->keys[key_count].key);
 			key_count++;
 		}
 		break;
 	}
 	case SPRI_RR: { // read reservation
 		struct scsi_per_res_in_rsrv *res;
 		int tmp_len, header_only;
 
 		res = (struct scsi_per_res_in_rsrv *)ctsio->kern_data_ptr;
 
 		scsi_ulto4b(lun->pr_generation, res->header.generation);
 
 		if (lun->flags & CTL_LUN_PR_RESERVED)
 		{
 			tmp_len = sizeof(struct scsi_per_res_in_rsrv);
 			scsi_ulto4b(sizeof(struct scsi_per_res_in_rsrv_data),
 				    res->header.length);
 			header_only = 0;
 		} else {
 			tmp_len = sizeof(struct scsi_per_res_in_header);
 			scsi_ulto4b(0, res->header.length);
 			header_only = 1;
 		}
 
 		/*
 		 * We had to drop the lock to allocate our buffer, which
 		 * leaves time for someone to come in with another
 		 * persistent reservation.  (That is unlikely, though,
 		 * since this should be the only persistent reservation
 		 * command active right now.)
 		 */
 		if (tmp_len != total_len) {
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			printf("%s: reservation status changed, retrying\n",
 			       __func__);
 			goto retry;
 		}
 
 		/*
 		 * No reservation held, so we're done.
 		 */
 		if (header_only != 0)
 			break;
 
 		/*
 		 * If the registration is an All Registrants type, the key
 		 * is 0, since it doesn't really matter.
 		 */
 		if (lun->pr_res_idx != CTL_PR_ALL_REGISTRANTS) {
 			scsi_u64to8b(ctl_get_prkey(lun, lun->pr_res_idx),
 			    res->data.reservation);
 		}
 		res->data.scopetype = lun->pr_res_type;
 		break;
 	}
 	case SPRI_RC:     //report capabilities
 	{
 		struct scsi_per_res_cap *res_cap;
 		uint16_t type_mask;
 
 		res_cap = (struct scsi_per_res_cap *)ctsio->kern_data_ptr;
 		scsi_ulto2b(sizeof(*res_cap), res_cap->length);
 		res_cap->flags1 = SPRI_CRH;
 		res_cap->flags2 = SPRI_TMV | SPRI_ALLOW_5;
 		type_mask = SPRI_TM_WR_EX_AR |
 			    SPRI_TM_EX_AC_RO |
 			    SPRI_TM_WR_EX_RO |
 			    SPRI_TM_EX_AC |
 			    SPRI_TM_WR_EX |
 			    SPRI_TM_EX_AC_AR;
 		scsi_ulto2b(type_mask, res_cap->type_mask);
 		break;
 	}
 	case SPRI_RS: { // read full status
 		struct scsi_per_res_in_full *res_status;
 		struct scsi_per_res_in_full_desc *res_desc;
 		struct ctl_port *port;
 		int i, len;
 
 		res_status = (struct scsi_per_res_in_full*)ctsio->kern_data_ptr;
 
 		/*
 		 * We had to drop the lock to allocate our buffer, which
 		 * leaves time for someone to come in with another
 		 * persistent reservation.  (That is unlikely, though,
 		 * since this should be the only persistent reservation
 		 * command active right now.)
 		 */
 		if (total_len < (sizeof(struct scsi_per_res_in_header) +
 		    (sizeof(struct scsi_per_res_in_full_desc) + 256) *
 		     lun->pr_key_count)){
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			printf("%s: reservation length changed, retrying\n",
 			       __func__);
 			goto retry;
 		}
 
 		scsi_ulto4b(lun->pr_generation, res_status->header.generation);
 
 		res_desc = &res_status->desc[0];
 		for (i = 0; i < CTL_MAX_INITIATORS; i++) {
 			if ((key = ctl_get_prkey(lun, i)) == 0)
 				continue;
 
 			scsi_u64to8b(key, res_desc->res_key.key);
 			if ((lun->flags & CTL_LUN_PR_RESERVED) &&
 			    (lun->pr_res_idx == i ||
 			     lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS)) {
 				res_desc->flags = SPRI_FULL_R_HOLDER;
 				res_desc->scopetype = lun->pr_res_type;
 			}
 			scsi_ulto2b(i / CTL_MAX_INIT_PER_PORT,
 			    res_desc->rel_trgt_port_id);
 			len = 0;
 			port = softc->ctl_ports[i / CTL_MAX_INIT_PER_PORT];
 			if (port != NULL)
 				len = ctl_create_iid(port,
 				    i % CTL_MAX_INIT_PER_PORT,
 				    res_desc->transport_id);
 			scsi_ulto4b(len, res_desc->additional_length);
 			res_desc = (struct scsi_per_res_in_full_desc *)
 			    &res_desc->transport_id[len];
 		}
 		scsi_ulto4b((uint8_t *)res_desc - (uint8_t *)&res_status->desc[0],
 		    res_status->header.length);
 		break;
 	}
 	default:
 		panic("%s: Invalid PR type %#x", __func__, cdb->action);
 	}
 	mtx_unlock(&lun->lun_lock);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * Returns 0 if ctl_persistent_reserve_out() should continue, non-zero if
  * it should return.
  */
 static int
 ctl_pro_preempt(struct ctl_softc *softc, struct ctl_lun *lun, uint64_t res_key,
 		uint64_t sa_res_key, uint8_t type, uint32_t residx,
 		struct ctl_scsiio *ctsio, struct scsi_per_res_out *cdb,
 		struct scsi_per_res_out_parms* param)
 {
 	union ctl_ha_msg persis_io;
 	int i;
 
 	mtx_lock(&lun->lun_lock);
 	if (sa_res_key == 0) {
 		if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS) {
 			/* validate scope and type */
 			if ((cdb->scope_type & SPR_SCOPE_MASK) !=
 			     SPR_LU_SCOPE) {
 				mtx_unlock(&lun->lun_lock);
 				ctl_set_invalid_field(/*ctsio*/ ctsio,
 						      /*sks_valid*/ 1,
 						      /*command*/ 1,
 						      /*field*/ 2,
 						      /*bit_valid*/ 1,
 						      /*bit*/ 4);
 				ctl_done((union ctl_io *)ctsio);
 				return (1);
 			}
 
 		        if (type>8 || type==2 || type==4 || type==0) {
 				mtx_unlock(&lun->lun_lock);
 				ctl_set_invalid_field(/*ctsio*/ ctsio,
        	           				      /*sks_valid*/ 1,
 						      /*command*/ 1,
 						      /*field*/ 2,
 						      /*bit_valid*/ 1,
 						      /*bit*/ 0);
 				ctl_done((union ctl_io *)ctsio);
 				return (1);
 		        }
 
 			/*
 			 * Unregister everybody else and build UA for
 			 * them
 			 */
 			for(i = 0; i < CTL_MAX_INITIATORS; i++) {
 				if (i == residx || ctl_get_prkey(lun, i) == 0)
 					continue;
 
 				ctl_clr_prkey(lun, i);
 				ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 			}
 			lun->pr_key_count = 1;
 			lun->pr_res_type = type;
 			if (lun->pr_res_type != SPR_TYPE_WR_EX_AR &&
 			    lun->pr_res_type != SPR_TYPE_EX_AC_AR)
 				lun->pr_res_idx = residx;
 			lun->pr_generation++;
 			mtx_unlock(&lun->lun_lock);
 
 			/* send msg to other side */
 			persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 			persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 			persis_io.pr.pr_info.action = CTL_PR_PREEMPT;
 			persis_io.pr.pr_info.residx = lun->pr_res_idx;
 			persis_io.pr.pr_info.res_type = type;
 			memcpy(persis_io.pr.pr_info.sa_res_key,
 			       param->serv_act_res_key,
 			       sizeof(param->serv_act_res_key));
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 			    sizeof(persis_io.pr), M_WAITOK);
 		} else {
 			/* not all registrants */
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 0,
 					      /*field*/ 8,
 					      /*bit_valid*/ 0,
 					      /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (1);
 		}
 	} else if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS
 		|| !(lun->flags & CTL_LUN_PR_RESERVED)) {
 		int found = 0;
 
 		if (res_key == sa_res_key) {
 			/* special case */
 			/*
 			 * The spec implies this is not good but doesn't
 			 * say what to do. There are two choices either
 			 * generate a res conflict or check condition
 			 * with illegal field in parameter data. Since
 			 * that is what is done when the sa_res_key is
 			 * zero I'll take that approach since this has
 			 * to do with the sa_res_key.
 			 */
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 0,
 					      /*field*/ 8,
 					      /*bit_valid*/ 0,
 					      /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (1);
 		}
 
 		for (i = 0; i < CTL_MAX_INITIATORS; i++) {
 			if (ctl_get_prkey(lun, i) != sa_res_key)
 				continue;
 
 			found = 1;
 			ctl_clr_prkey(lun, i);
 			lun->pr_key_count--;
 			ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 		}
 		if (!found) {
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_reservation_conflict(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		lun->pr_generation++;
 		mtx_unlock(&lun->lun_lock);
 
 		/* send msg to other side */
 		persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 		persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 		persis_io.pr.pr_info.action = CTL_PR_PREEMPT;
 		persis_io.pr.pr_info.residx = lun->pr_res_idx;
 		persis_io.pr.pr_info.res_type = type;
 		memcpy(persis_io.pr.pr_info.sa_res_key,
 		       param->serv_act_res_key,
 		       sizeof(param->serv_act_res_key));
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 		    sizeof(persis_io.pr), M_WAITOK);
 	} else {
 		/* Reserved but not all registrants */
 		/* sa_res_key is res holder */
 		if (sa_res_key == ctl_get_prkey(lun, lun->pr_res_idx)) {
 			/* validate scope and type */
 			if ((cdb->scope_type & SPR_SCOPE_MASK) !=
 			     SPR_LU_SCOPE) {
 				mtx_unlock(&lun->lun_lock);
 				ctl_set_invalid_field(/*ctsio*/ ctsio,
 						      /*sks_valid*/ 1,
 						      /*command*/ 1,
 						      /*field*/ 2,
 						      /*bit_valid*/ 1,
 						      /*bit*/ 4);
 				ctl_done((union ctl_io *)ctsio);
 				return (1);
 			}
 
 			if (type>8 || type==2 || type==4 || type==0) {
 				mtx_unlock(&lun->lun_lock);
 				ctl_set_invalid_field(/*ctsio*/ ctsio,
 						      /*sks_valid*/ 1,
 						      /*command*/ 1,
 						      /*field*/ 2,
 						      /*bit_valid*/ 1,
 						      /*bit*/ 0);
 				ctl_done((union ctl_io *)ctsio);
 				return (1);
 			}
 
 			/*
 			 * Do the following:
 			 * if sa_res_key != res_key remove all
 			 * registrants w/sa_res_key and generate UA
 			 * for these registrants(Registrations
 			 * Preempted) if it wasn't an exclusive
 			 * reservation generate UA(Reservations
 			 * Preempted) for all other registered nexuses
 			 * if the type has changed. Establish the new
 			 * reservation and holder. If res_key and
 			 * sa_res_key are the same do the above
 			 * except don't unregister the res holder.
 			 */
 
 			for(i = 0; i < CTL_MAX_INITIATORS; i++) {
 				if (i == residx || ctl_get_prkey(lun, i) == 0)
 					continue;
 
 				if (sa_res_key == ctl_get_prkey(lun, i)) {
 					ctl_clr_prkey(lun, i);
 					lun->pr_key_count--;
 					ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 				} else if (type != lun->pr_res_type &&
 				    (lun->pr_res_type == SPR_TYPE_WR_EX_RO ||
 				     lun->pr_res_type == SPR_TYPE_EX_AC_RO)) {
 					ctl_est_ua(lun, i, CTL_UA_RES_RELEASE);
 				}
 			}
 			lun->pr_res_type = type;
 			if (lun->pr_res_type != SPR_TYPE_WR_EX_AR &&
 			    lun->pr_res_type != SPR_TYPE_EX_AC_AR)
 				lun->pr_res_idx = residx;
 			else
 				lun->pr_res_idx = CTL_PR_ALL_REGISTRANTS;
 			lun->pr_generation++;
 			mtx_unlock(&lun->lun_lock);
 
 			persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 			persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 			persis_io.pr.pr_info.action = CTL_PR_PREEMPT;
 			persis_io.pr.pr_info.residx = lun->pr_res_idx;
 			persis_io.pr.pr_info.res_type = type;
 			memcpy(persis_io.pr.pr_info.sa_res_key,
 			       param->serv_act_res_key,
 			       sizeof(param->serv_act_res_key));
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 			    sizeof(persis_io.pr), M_WAITOK);
 		} else {
 			/*
 			 * sa_res_key is not the res holder just
 			 * remove registrants
 			 */
 			int found=0;
 
 			for (i = 0; i < CTL_MAX_INITIATORS; i++) {
 				if (sa_res_key != ctl_get_prkey(lun, i))
 					continue;
 
 				found = 1;
 				ctl_clr_prkey(lun, i);
 				lun->pr_key_count--;
 				ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 			}
 
 			if (!found) {
 				mtx_unlock(&lun->lun_lock);
 				free(ctsio->kern_data_ptr, M_CTL);
 				ctl_set_reservation_conflict(ctsio);
 				ctl_done((union ctl_io *)ctsio);
 		        	return (1);
 			}
 			lun->pr_generation++;
 			mtx_unlock(&lun->lun_lock);
 
 			persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 			persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 			persis_io.pr.pr_info.action = CTL_PR_PREEMPT;
 			persis_io.pr.pr_info.residx = lun->pr_res_idx;
 			persis_io.pr.pr_info.res_type = type;
 			memcpy(persis_io.pr.pr_info.sa_res_key,
 			       param->serv_act_res_key,
 			       sizeof(param->serv_act_res_key));
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 			    sizeof(persis_io.pr), M_WAITOK);
 		}
 	}
 	return (0);
 }
 
 static void
 ctl_pro_preempt_other(struct ctl_lun *lun, union ctl_ha_msg *msg)
 {
 	uint64_t sa_res_key;
 	int i;
 
 	sa_res_key = scsi_8btou64(msg->pr.pr_info.sa_res_key);
 
 	if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS
 	 || lun->pr_res_idx == CTL_PR_NO_RESERVATION
 	 || sa_res_key != ctl_get_prkey(lun, lun->pr_res_idx)) {
 		if (sa_res_key == 0) {
 			/*
 			 * Unregister everybody else and build UA for
 			 * them
 			 */
 			for(i = 0; i < CTL_MAX_INITIATORS; i++) {
 				if (i == msg->pr.pr_info.residx ||
 				    ctl_get_prkey(lun, i) == 0)
 					continue;
 
 				ctl_clr_prkey(lun, i);
 				ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 			}
 
 			lun->pr_key_count = 1;
 			lun->pr_res_type = msg->pr.pr_info.res_type;
 			if (lun->pr_res_type != SPR_TYPE_WR_EX_AR &&
 			    lun->pr_res_type != SPR_TYPE_EX_AC_AR)
 				lun->pr_res_idx = msg->pr.pr_info.residx;
 		} else {
 		        for (i = 0; i < CTL_MAX_INITIATORS; i++) {
 				if (sa_res_key == ctl_get_prkey(lun, i))
 					continue;
 
 				ctl_clr_prkey(lun, i);
 				lun->pr_key_count--;
 				ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 			}
 		}
 	} else {
 		for (i = 0; i < CTL_MAX_INITIATORS; i++) {
 			if (i == msg->pr.pr_info.residx ||
 			    ctl_get_prkey(lun, i) == 0)
 				continue;
 
 			if (sa_res_key == ctl_get_prkey(lun, i)) {
 				ctl_clr_prkey(lun, i);
 				lun->pr_key_count--;
 				ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 			} else if (msg->pr.pr_info.res_type != lun->pr_res_type
 			    && (lun->pr_res_type == SPR_TYPE_WR_EX_RO ||
 			     lun->pr_res_type == SPR_TYPE_EX_AC_RO)) {
 				ctl_est_ua(lun, i, CTL_UA_RES_RELEASE);
 			}
 		}
 		lun->pr_res_type = msg->pr.pr_info.res_type;
 		if (lun->pr_res_type != SPR_TYPE_WR_EX_AR &&
 		    lun->pr_res_type != SPR_TYPE_EX_AC_AR)
 			lun->pr_res_idx = msg->pr.pr_info.residx;
 		else
 			lun->pr_res_idx = CTL_PR_ALL_REGISTRANTS;
 	}
 	lun->pr_generation++;
 
 }
 
 
 int
 ctl_persistent_reserve_out(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	int retval;
 	u_int32_t param_len;
 	struct scsi_per_res_out *cdb;
 	struct scsi_per_res_out_parms* param;
 	uint32_t residx;
 	uint64_t res_key, sa_res_key, key;
 	uint8_t type;
 	union ctl_ha_msg persis_io;
 	int    i;
 
 	CTL_DEBUG_PRINT(("ctl_persistent_reserve_out\n"));
 
 	cdb = (struct scsi_per_res_out *)ctsio->cdb;
 	retval = CTL_RETVAL_COMPLETE;
 
 	/*
 	 * We only support whole-LUN scope.  The scope & type are ignored for
 	 * register, register and ignore existing key and clear.
 	 * We sometimes ignore scope and type on preempts too!!
 	 * Verify reservation type here as well.
 	 */
 	type = cdb->scope_type & SPR_TYPE_MASK;
 	if ((cdb->action == SPRO_RESERVE)
 	 || (cdb->action == SPRO_RELEASE)) {
 		if ((cdb->scope_type & SPR_SCOPE_MASK) != SPR_LU_SCOPE) {
 			ctl_set_invalid_field(/*ctsio*/ ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 2,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 4);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		if (type>8 || type==2 || type==4 || type==0) {
 			ctl_set_invalid_field(/*ctsio*/ ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 2,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 	}
 
 	param_len = scsi_4btoul(cdb->length);
 
 	if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) {
 		ctsio->kern_data_ptr = malloc(param_len, M_CTL, M_WAITOK);
 		ctsio->kern_data_len = param_len;
 		ctsio->kern_total_len = param_len;
 		ctsio->kern_rel_offset = 0;
 		ctsio->kern_sg_entries = 0;
 		ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 		ctsio->be_move_done = ctl_config_move_done;
 		ctl_datamove((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	param = (struct scsi_per_res_out_parms *)ctsio->kern_data_ptr;
 
 	residx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 	res_key = scsi_8btou64(param->res_key.key);
 	sa_res_key = scsi_8btou64(param->serv_act_res_key);
 
 	/*
 	 * Validate the reservation key here except for SPRO_REG_IGNO
 	 * This must be done for all other service actions
 	 */
 	if ((cdb->action & SPRO_ACTION_MASK) != SPRO_REG_IGNO) {
 		mtx_lock(&lun->lun_lock);
 		if ((key = ctl_get_prkey(lun, residx)) != 0) {
 			if (res_key != key) {
 				/*
 				 * The current key passed in doesn't match
 				 * the one the initiator previously
 				 * registered.
 				 */
 				mtx_unlock(&lun->lun_lock);
 				free(ctsio->kern_data_ptr, M_CTL);
 				ctl_set_reservation_conflict(ctsio);
 				ctl_done((union ctl_io *)ctsio);
 				return (CTL_RETVAL_COMPLETE);
 			}
 		} else if ((cdb->action & SPRO_ACTION_MASK) != SPRO_REGISTER) {
 			/*
 			 * We are not registered
 			 */
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_reservation_conflict(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		} else if (res_key != 0) {
 			/*
 			 * We are not registered and trying to register but
 			 * the register key isn't zero.
 			 */
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_reservation_conflict(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		mtx_unlock(&lun->lun_lock);
 	}
 
 	switch (cdb->action & SPRO_ACTION_MASK) {
 	case SPRO_REGISTER:
 	case SPRO_REG_IGNO: {
 
 #if 0
 		printf("Registration received\n");
 #endif
 
 		/*
 		 * We don't support any of these options, as we report in
 		 * the read capabilities request (see
 		 * ctl_persistent_reserve_in(), above).
 		 */
 		if ((param->flags & SPR_SPEC_I_PT)
 		 || (param->flags & SPR_ALL_TG_PT)
 		 || (param->flags & SPR_APTPL)) {
 			int bit_ptr;
 
 			if (param->flags & SPR_APTPL)
 				bit_ptr = 0;
 			else if (param->flags & SPR_ALL_TG_PT)
 				bit_ptr = 2;
 			else /* SPR_SPEC_I_PT */
 				bit_ptr = 3;
 
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 0,
 					      /*field*/ 20,
 					      /*bit_valid*/ 1,
 					      /*bit*/ bit_ptr);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		mtx_lock(&lun->lun_lock);
 
 		/*
 		 * The initiator wants to clear the
 		 * key/unregister.
 		 */
 		if (sa_res_key == 0) {
 			if ((res_key == 0
 			  && (cdb->action & SPRO_ACTION_MASK) == SPRO_REGISTER)
 			 || ((cdb->action & SPRO_ACTION_MASK) == SPRO_REG_IGNO
 			  && ctl_get_prkey(lun, residx) == 0)) {
 				mtx_unlock(&lun->lun_lock);
 				goto done;
 			}
 
 			ctl_clr_prkey(lun, residx);
 			lun->pr_key_count--;
 
 			if (residx == lun->pr_res_idx) {
 				lun->flags &= ~CTL_LUN_PR_RESERVED;
 				lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 
 				if ((lun->pr_res_type == SPR_TYPE_WR_EX_RO ||
 				     lun->pr_res_type == SPR_TYPE_EX_AC_RO) &&
 				    lun->pr_key_count) {
 					/*
 					 * If the reservation is a registrants
 					 * only type we need to generate a UA
 					 * for other registered inits.  The
 					 * sense code should be RESERVATIONS
 					 * RELEASED
 					 */
 
 					for (i = softc->init_min; i < softc->init_max; i++){
 						if (ctl_get_prkey(lun, i) == 0)
 							continue;
 						ctl_est_ua(lun, i,
 						    CTL_UA_RES_RELEASE);
 					}
 				}
 				lun->pr_res_type = 0;
 			} else if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS) {
 				if (lun->pr_key_count==0) {
 					lun->flags &= ~CTL_LUN_PR_RESERVED;
 					lun->pr_res_type = 0;
 					lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 				}
 			}
 			lun->pr_generation++;
 			mtx_unlock(&lun->lun_lock);
 
 			persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 			persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 			persis_io.pr.pr_info.action = CTL_PR_UNREG_KEY;
 			persis_io.pr.pr_info.residx = residx;
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 			    sizeof(persis_io.pr), M_WAITOK);
 		} else /* sa_res_key != 0 */ {
 
 			/*
 			 * If we aren't registered currently then increment
 			 * the key count and set the registered flag.
 			 */
 			ctl_alloc_prkey(lun, residx);
 			if (ctl_get_prkey(lun, residx) == 0)
 				lun->pr_key_count++;
 			ctl_set_prkey(lun, residx, sa_res_key);
 			lun->pr_generation++;
 			mtx_unlock(&lun->lun_lock);
 
 			persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 			persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 			persis_io.pr.pr_info.action = CTL_PR_REG_KEY;
 			persis_io.pr.pr_info.residx = residx;
 			memcpy(persis_io.pr.pr_info.sa_res_key,
 			       param->serv_act_res_key,
 			       sizeof(param->serv_act_res_key));
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 			    sizeof(persis_io.pr), M_WAITOK);
 		}
 
 		break;
 	}
 	case SPRO_RESERVE:
 #if 0
                 printf("Reserve executed type %d\n", type);
 #endif
 		mtx_lock(&lun->lun_lock);
 		if (lun->flags & CTL_LUN_PR_RESERVED) {
 			/*
 			 * if this isn't the reservation holder and it's
 			 * not a "all registrants" type or if the type is
 			 * different then we have a conflict
 			 */
 			if ((lun->pr_res_idx != residx
 			  && lun->pr_res_idx != CTL_PR_ALL_REGISTRANTS)
 			 || lun->pr_res_type != type) {
 				mtx_unlock(&lun->lun_lock);
 				free(ctsio->kern_data_ptr, M_CTL);
 				ctl_set_reservation_conflict(ctsio);
 				ctl_done((union ctl_io *)ctsio);
 				return (CTL_RETVAL_COMPLETE);
 			}
 			mtx_unlock(&lun->lun_lock);
 		} else /* create a reservation */ {
 			/*
 			 * If it's not an "all registrants" type record
 			 * reservation holder
 			 */
 			if (type != SPR_TYPE_WR_EX_AR
 			 && type != SPR_TYPE_EX_AC_AR)
 				lun->pr_res_idx = residx; /* Res holder */
 			else
 				lun->pr_res_idx = CTL_PR_ALL_REGISTRANTS;
 
 			lun->flags |= CTL_LUN_PR_RESERVED;
 			lun->pr_res_type = type;
 
 			mtx_unlock(&lun->lun_lock);
 
 			/* send msg to other side */
 			persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 			persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 			persis_io.pr.pr_info.action = CTL_PR_RESERVE;
 			persis_io.pr.pr_info.residx = lun->pr_res_idx;
 			persis_io.pr.pr_info.res_type = type;
 			ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 			    sizeof(persis_io.pr), M_WAITOK);
 		}
 		break;
 
 	case SPRO_RELEASE:
 		mtx_lock(&lun->lun_lock);
 		if ((lun->flags & CTL_LUN_PR_RESERVED) == 0) {
 			/* No reservation exists return good status */
 			mtx_unlock(&lun->lun_lock);
 			goto done;
 		}
 		/*
 		 * Is this nexus a reservation holder?
 		 */
 		if (lun->pr_res_idx != residx
 		 && lun->pr_res_idx != CTL_PR_ALL_REGISTRANTS) {
 			/*
 			 * not a res holder return good status but
 			 * do nothing
 			 */
 			mtx_unlock(&lun->lun_lock);
 			goto done;
 		}
 
 		if (lun->pr_res_type != type) {
 			mtx_unlock(&lun->lun_lock);
 			free(ctsio->kern_data_ptr, M_CTL);
 			ctl_set_illegal_pr_release(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		/* okay to release */
 		lun->flags &= ~CTL_LUN_PR_RESERVED;
 		lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 		lun->pr_res_type = 0;
 
 		/*
 		 * If this isn't an exclusive access reservation and NUAR
 		 * is not set, generate UA for all other registrants.
 		 */
 		if (type != SPR_TYPE_EX_AC && type != SPR_TYPE_WR_EX &&
 		    (lun->MODE_CTRL.queue_flags & SCP_NUAR) == 0) {
 			for (i = softc->init_min; i < softc->init_max; i++) {
 				if (i == residx || ctl_get_prkey(lun, i) == 0)
 					continue;
 				ctl_est_ua(lun, i, CTL_UA_RES_RELEASE);
 			}
 		}
 		mtx_unlock(&lun->lun_lock);
 
 		/* Send msg to other side */
 		persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 		persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 		persis_io.pr.pr_info.action = CTL_PR_RELEASE;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 		     sizeof(persis_io.pr), M_WAITOK);
 		break;
 
 	case SPRO_CLEAR:
 		/* send msg to other side */
 
 		mtx_lock(&lun->lun_lock);
 		lun->flags &= ~CTL_LUN_PR_RESERVED;
 		lun->pr_res_type = 0;
 		lun->pr_key_count = 0;
 		lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 
 		ctl_clr_prkey(lun, residx);
 		for (i = 0; i < CTL_MAX_INITIATORS; i++)
 			if (ctl_get_prkey(lun, i) != 0) {
 				ctl_clr_prkey(lun, i);
 				ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 			}
 		lun->pr_generation++;
 		mtx_unlock(&lun->lun_lock);
 
 		persis_io.hdr.nexus = ctsio->io_hdr.nexus;
 		persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION;
 		persis_io.pr.pr_info.action = CTL_PR_CLEAR;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io,
 		     sizeof(persis_io.pr), M_WAITOK);
 		break;
 
 	case SPRO_PREEMPT:
 	case SPRO_PRE_ABO: {
 		int nretval;
 
 		nretval = ctl_pro_preempt(softc, lun, res_key, sa_res_key, type,
 					  residx, ctsio, cdb, param);
 		if (nretval != 0)
 			return (CTL_RETVAL_COMPLETE);
 		break;
 	}
 	default:
 		panic("%s: Invalid PR type %#x", __func__, cdb->action);
 	}
 
 done:
 	free(ctsio->kern_data_ptr, M_CTL);
 	ctl_set_success(ctsio);
 	ctl_done((union ctl_io *)ctsio);
 
 	return (retval);
 }
 
 /*
  * This routine is for handling a message from the other SC pertaining to
  * persistent reserve out. All the error checking will have been done
  * so only perorming the action need be done here to keep the two
  * in sync.
  */
 static void
 ctl_hndl_per_res_out_on_other_sc(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	union ctl_ha_msg *msg = (union ctl_ha_msg *)&io->presio.pr_msg;
 	struct ctl_lun *lun;
 	int i;
 	uint32_t residx, targ_lun;
 
 	targ_lun = msg->hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		return;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	if (lun->flags & CTL_LUN_DISABLED) {
 		mtx_unlock(&lun->lun_lock);
 		return;
 	}
 	residx = ctl_get_initindex(&msg->hdr.nexus);
 	switch(msg->pr.pr_info.action) {
 	case CTL_PR_REG_KEY:
 		ctl_alloc_prkey(lun, msg->pr.pr_info.residx);
 		if (ctl_get_prkey(lun, msg->pr.pr_info.residx) == 0)
 			lun->pr_key_count++;
 		ctl_set_prkey(lun, msg->pr.pr_info.residx,
 		    scsi_8btou64(msg->pr.pr_info.sa_res_key));
 		lun->pr_generation++;
 		break;
 
 	case CTL_PR_UNREG_KEY:
 		ctl_clr_prkey(lun, msg->pr.pr_info.residx);
 		lun->pr_key_count--;
 
 		/* XXX Need to see if the reservation has been released */
 		/* if so do we need to generate UA? */
 		if (msg->pr.pr_info.residx == lun->pr_res_idx) {
 			lun->flags &= ~CTL_LUN_PR_RESERVED;
 			lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 
 			if ((lun->pr_res_type == SPR_TYPE_WR_EX_RO ||
 			     lun->pr_res_type == SPR_TYPE_EX_AC_RO) &&
 			    lun->pr_key_count) {
 				/*
 				 * If the reservation is a registrants
 				 * only type we need to generate a UA
 				 * for other registered inits.  The
 				 * sense code should be RESERVATIONS
 				 * RELEASED
 				 */
 
 				for (i = softc->init_min; i < softc->init_max; i++) {
 					if (ctl_get_prkey(lun, i) == 0)
 						continue;
 
 					ctl_est_ua(lun, i, CTL_UA_RES_RELEASE);
 				}
 			}
 			lun->pr_res_type = 0;
 		} else if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS) {
 			if (lun->pr_key_count==0) {
 				lun->flags &= ~CTL_LUN_PR_RESERVED;
 				lun->pr_res_type = 0;
 				lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 			}
 		}
 		lun->pr_generation++;
 		break;
 
 	case CTL_PR_RESERVE:
 		lun->flags |= CTL_LUN_PR_RESERVED;
 		lun->pr_res_type = msg->pr.pr_info.res_type;
 		lun->pr_res_idx = msg->pr.pr_info.residx;
 
 		break;
 
 	case CTL_PR_RELEASE:
 		/*
 		 * If this isn't an exclusive access reservation and NUAR
 		 * is not set, generate UA for all other registrants.
 		 */
 		if (lun->pr_res_type != SPR_TYPE_EX_AC &&
 		    lun->pr_res_type != SPR_TYPE_WR_EX &&
 		    (lun->MODE_CTRL.queue_flags & SCP_NUAR) == 0) {
 			for (i = softc->init_min; i < softc->init_max; i++)
 				if (i == residx || ctl_get_prkey(lun, i) == 0)
 					continue;
 				ctl_est_ua(lun, i, CTL_UA_RES_RELEASE);
 		}
 
 		lun->flags &= ~CTL_LUN_PR_RESERVED;
 		lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 		lun->pr_res_type = 0;
 		break;
 
 	case CTL_PR_PREEMPT:
 		ctl_pro_preempt_other(lun, msg);
 		break;
 	case CTL_PR_CLEAR:
 		lun->flags &= ~CTL_LUN_PR_RESERVED;
 		lun->pr_res_type = 0;
 		lun->pr_key_count = 0;
 		lun->pr_res_idx = CTL_PR_NO_RESERVATION;
 
 		for (i=0; i < CTL_MAX_INITIATORS; i++) {
 			if (ctl_get_prkey(lun, i) == 0)
 				continue;
 			ctl_clr_prkey(lun, i);
 			ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT);
 		}
 		lun->pr_generation++;
 		break;
 	}
 
 	mtx_unlock(&lun->lun_lock);
 }
 
 int
 ctl_read_write(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct ctl_lba_len_flags *lbalen;
 	uint64_t lba;
 	uint32_t num_blocks;
 	int flags, retval;
 	int isread;
 
 	CTL_DEBUG_PRINT(("ctl_read_write: command: %#x\n", ctsio->cdb[0]));
 
 	flags = 0;
 	isread = ctsio->cdb[0] == READ_6  || ctsio->cdb[0] == READ_10
 	      || ctsio->cdb[0] == READ_12 || ctsio->cdb[0] == READ_16;
 	switch (ctsio->cdb[0]) {
 	case READ_6:
 	case WRITE_6: {
 		struct scsi_rw_6 *cdb;
 
 		cdb = (struct scsi_rw_6 *)ctsio->cdb;
 
 		lba = scsi_3btoul(cdb->addr);
 		/* only 5 bits are valid in the most significant address byte */
 		lba &= 0x1fffff;
 		num_blocks = cdb->length;
 		/*
 		 * This is correct according to SBC-2.
 		 */
 		if (num_blocks == 0)
 			num_blocks = 256;
 		break;
 	}
 	case READ_10:
 	case WRITE_10: {
 		struct scsi_rw_10 *cdb;
 
 		cdb = (struct scsi_rw_10 *)ctsio->cdb;
 		if (cdb->byte2 & SRW10_FUA)
 			flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SRW10_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_2btoul(cdb->length);
 		break;
 	}
 	case WRITE_VERIFY_10: {
 		struct scsi_write_verify_10 *cdb;
 
 		cdb = (struct scsi_write_verify_10 *)ctsio->cdb;
 		flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SWV_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_2btoul(cdb->length);
 		break;
 	}
 	case READ_12:
 	case WRITE_12: {
 		struct scsi_rw_12 *cdb;
 
 		cdb = (struct scsi_rw_12 *)ctsio->cdb;
 		if (cdb->byte2 & SRW12_FUA)
 			flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SRW12_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		break;
 	}
 	case WRITE_VERIFY_12: {
 		struct scsi_write_verify_12 *cdb;
 
 		cdb = (struct scsi_write_verify_12 *)ctsio->cdb;
 		flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SWV_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		break;
 	}
 	case READ_16:
 	case WRITE_16: {
 		struct scsi_rw_16 *cdb;
 
 		cdb = (struct scsi_rw_16 *)ctsio->cdb;
 		if (cdb->byte2 & SRW12_FUA)
 			flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SRW12_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_8btou64(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		break;
 	}
 	case WRITE_ATOMIC_16: {
 		struct scsi_write_atomic_16 *cdb;
 
 		if (lun->be_lun->atomicblock == 0) {
 			ctl_set_invalid_opcode(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 
 		cdb = (struct scsi_write_atomic_16 *)ctsio->cdb;
 		if (cdb->byte2 & SRW12_FUA)
 			flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SRW12_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_8btou64(cdb->addr);
 		num_blocks = scsi_2btoul(cdb->length);
 		if (num_blocks > lun->be_lun->atomicblock) {
 			ctl_set_invalid_field(ctsio, /*sks_valid*/ 1,
 			    /*command*/ 1, /*field*/ 12, /*bit_valid*/ 0,
 			    /*bit*/ 0);
 			ctl_done((union ctl_io *)ctsio);
 			return (CTL_RETVAL_COMPLETE);
 		}
 		break;
 	}
 	case WRITE_VERIFY_16: {
 		struct scsi_write_verify_16 *cdb;
 
 		cdb = (struct scsi_write_verify_16 *)ctsio->cdb;
 		flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SWV_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_8btou64(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		break;
 	}
 	default:
 		/*
 		 * We got a command we don't support.  This shouldn't
 		 * happen, commands should be filtered out above us.
 		 */
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 		break; /* NOTREACHED */
 	}
 
 	/*
 	 * The first check is to make sure we're in bounds, the second
 	 * check is to catch wrap-around problems.  If the lba + num blocks
 	 * is less than the lba, then we've wrapped around and the block
 	 * range is invalid anyway.
 	 */
 	if (((lba + num_blocks) > (lun->be_lun->maxlba + 1))
 	 || ((lba + num_blocks) < lba)) {
 		ctl_set_lba_out_of_range(ctsio,
 		    MAX(lba, lun->be_lun->maxlba + 1));
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * According to SBC-3, a transfer length of 0 is not an error.
 	 * Note that this cannot happen with WRITE(6) or READ(6), since 0
 	 * translates to 256 blocks for those commands.
 	 */
 	if (num_blocks == 0) {
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/* Set FUA and/or DPO if caches are disabled. */
 	if (isread) {
 		if ((lun->MODE_CACHING.flags1 & SCP_RCD) != 0)
 			flags |= CTL_LLF_FUA | CTL_LLF_DPO;
 	} else {
 		if ((lun->MODE_CACHING.flags1 & SCP_WCE) == 0)
 			flags |= CTL_LLF_FUA;
 	}
 
 	lbalen = (struct ctl_lba_len_flags *)
 	    &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->lba = lba;
 	lbalen->len = num_blocks;
 	lbalen->flags = (isread ? CTL_LLF_READ : CTL_LLF_WRITE) | flags;
 
 	ctsio->kern_total_len = num_blocks * lun->be_lun->blocksize;
 	ctsio->kern_rel_offset = 0;
 
 	CTL_DEBUG_PRINT(("ctl_read_write: calling data_submit()\n"));
 
 	retval = lun->backend->data_submit((union ctl_io *)ctsio);
 	return (retval);
 }
 
 static int
 ctl_cnw_cont(union ctl_io *io)
 {
 	struct ctl_lun *lun = CTL_LUN(io);
 	struct ctl_scsiio *ctsio;
 	struct ctl_lba_len_flags *lbalen;
 	int retval;
 
 	ctsio = &io->scsiio;
 	ctsio->io_hdr.status = CTL_STATUS_NONE;
 	ctsio->io_hdr.flags &= ~CTL_FLAG_IO_CONT;
 	lbalen = (struct ctl_lba_len_flags *)
 	    &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->flags &= ~CTL_LLF_COMPARE;
 	lbalen->flags |= CTL_LLF_WRITE;
 
 	CTL_DEBUG_PRINT(("ctl_cnw_cont: calling data_submit()\n"));
 	retval = lun->backend->data_submit((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_cnw(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct ctl_lba_len_flags *lbalen;
 	uint64_t lba;
 	uint32_t num_blocks;
 	int flags, retval;
 
 	CTL_DEBUG_PRINT(("ctl_cnw: command: %#x\n", ctsio->cdb[0]));
 
 	flags = 0;
 	switch (ctsio->cdb[0]) {
 	case COMPARE_AND_WRITE: {
 		struct scsi_compare_and_write *cdb;
 
 		cdb = (struct scsi_compare_and_write *)ctsio->cdb;
 		if (cdb->byte2 & SRW10_FUA)
 			flags |= CTL_LLF_FUA;
 		if (cdb->byte2 & SRW10_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_8btou64(cdb->addr);
 		num_blocks = cdb->length;
 		break;
 	}
 	default:
 		/*
 		 * We got a command we don't support.  This shouldn't
 		 * happen, commands should be filtered out above us.
 		 */
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 
 		return (CTL_RETVAL_COMPLETE);
 		break; /* NOTREACHED */
 	}
 
 	/*
 	 * The first check is to make sure we're in bounds, the second
 	 * check is to catch wrap-around problems.  If the lba + num blocks
 	 * is less than the lba, then we've wrapped around and the block
 	 * range is invalid anyway.
 	 */
 	if (((lba + num_blocks) > (lun->be_lun->maxlba + 1))
 	 || ((lba + num_blocks) < lba)) {
 		ctl_set_lba_out_of_range(ctsio,
 		    MAX(lba, lun->be_lun->maxlba + 1));
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * According to SBC-3, a transfer length of 0 is not an error.
 	 */
 	if (num_blocks == 0) {
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/* Set FUA if write cache is disabled. */
 	if ((lun->MODE_CACHING.flags1 & SCP_WCE) == 0)
 		flags |= CTL_LLF_FUA;
 
 	ctsio->kern_total_len = 2 * num_blocks * lun->be_lun->blocksize;
 	ctsio->kern_rel_offset = 0;
 
 	/*
 	 * Set the IO_CONT flag, so that if this I/O gets passed to
 	 * ctl_data_submit_done(), it'll get passed back to
 	 * ctl_ctl_cnw_cont() for further processing.
 	 */
 	ctsio->io_hdr.flags |= CTL_FLAG_IO_CONT;
 	ctsio->io_cont = ctl_cnw_cont;
 
 	lbalen = (struct ctl_lba_len_flags *)
 	    &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->lba = lba;
 	lbalen->len = num_blocks;
 	lbalen->flags = CTL_LLF_COMPARE | flags;
 
 	CTL_DEBUG_PRINT(("ctl_cnw: calling data_submit()\n"));
 	retval = lun->backend->data_submit((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_verify(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct ctl_lba_len_flags *lbalen;
 	uint64_t lba;
 	uint32_t num_blocks;
 	int bytchk, flags;
 	int retval;
 
 	CTL_DEBUG_PRINT(("ctl_verify: command: %#x\n", ctsio->cdb[0]));
 
 	bytchk = 0;
 	flags = CTL_LLF_FUA;
 	switch (ctsio->cdb[0]) {
 	case VERIFY_10: {
 		struct scsi_verify_10 *cdb;
 
 		cdb = (struct scsi_verify_10 *)ctsio->cdb;
 		if (cdb->byte2 & SVFY_BYTCHK)
 			bytchk = 1;
 		if (cdb->byte2 & SVFY_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_2btoul(cdb->length);
 		break;
 	}
 	case VERIFY_12: {
 		struct scsi_verify_12 *cdb;
 
 		cdb = (struct scsi_verify_12 *)ctsio->cdb;
 		if (cdb->byte2 & SVFY_BYTCHK)
 			bytchk = 1;
 		if (cdb->byte2 & SVFY_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_4btoul(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		break;
 	}
 	case VERIFY_16: {
 		struct scsi_rw_16 *cdb;
 
 		cdb = (struct scsi_rw_16 *)ctsio->cdb;
 		if (cdb->byte2 & SVFY_BYTCHK)
 			bytchk = 1;
 		if (cdb->byte2 & SVFY_DPO)
 			flags |= CTL_LLF_DPO;
 		lba = scsi_8btou64(cdb->addr);
 		num_blocks = scsi_4btoul(cdb->length);
 		break;
 	}
 	default:
 		/*
 		 * We got a command we don't support.  This shouldn't
 		 * happen, commands should be filtered out above us.
 		 */
 		ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * The first check is to make sure we're in bounds, the second
 	 * check is to catch wrap-around problems.  If the lba + num blocks
 	 * is less than the lba, then we've wrapped around and the block
 	 * range is invalid anyway.
 	 */
 	if (((lba + num_blocks) > (lun->be_lun->maxlba + 1))
 	 || ((lba + num_blocks) < lba)) {
 		ctl_set_lba_out_of_range(ctsio,
 		    MAX(lba, lun->be_lun->maxlba + 1));
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	/*
 	 * According to SBC-3, a transfer length of 0 is not an error.
 	 */
 	if (num_blocks == 0) {
 		ctl_set_success(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	lbalen = (struct ctl_lba_len_flags *)
 	    &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	lbalen->lba = lba;
 	lbalen->len = num_blocks;
 	if (bytchk) {
 		lbalen->flags = CTL_LLF_COMPARE | flags;
 		ctsio->kern_total_len = num_blocks * lun->be_lun->blocksize;
 	} else {
 		lbalen->flags = CTL_LLF_VERIFY | flags;
 		ctsio->kern_total_len = 0;
 	}
 	ctsio->kern_rel_offset = 0;
 
 	CTL_DEBUG_PRINT(("ctl_verify: calling data_submit()\n"));
 	retval = lun->backend->data_submit((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_report_luns(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_port *port = CTL_PORT(ctsio);
 	struct ctl_lun *lun, *request_lun = CTL_LUN(ctsio);
 	struct scsi_report_luns *cdb;
 	struct scsi_report_luns_data *lun_data;
 	int num_filled, num_luns, num_port_luns, retval;
 	uint32_t alloc_len, lun_datalen;
 	uint32_t initidx, targ_lun_id, lun_id;
 
 	retval = CTL_RETVAL_COMPLETE;
 	cdb = (struct scsi_report_luns *)ctsio->cdb;
 
 	CTL_DEBUG_PRINT(("ctl_report_luns\n"));
 
 	num_luns = 0;
 	num_port_luns = port->lun_map ? port->lun_map_size : CTL_MAX_LUNS;
 	mtx_lock(&softc->ctl_lock);
 	for (targ_lun_id = 0; targ_lun_id < num_port_luns; targ_lun_id++) {
 		if (ctl_lun_map_from_port(port, targ_lun_id) != UINT32_MAX)
 			num_luns++;
 	}
 	mtx_unlock(&softc->ctl_lock);
 
 	switch (cdb->select_report) {
 	case RPL_REPORT_DEFAULT:
 	case RPL_REPORT_ALL:
 	case RPL_REPORT_NONSUBSID:
 		break;
 	case RPL_REPORT_WELLKNOWN:
 	case RPL_REPORT_ADMIN:
 	case RPL_REPORT_CONGLOM:
 		num_luns = 0;
 		break;
 	default:
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (retval);
 		break; /* NOTREACHED */
 	}
 
 	alloc_len = scsi_4btoul(cdb->length);
 	/*
 	 * The initiator has to allocate at least 16 bytes for this request,
 	 * so he can at least get the header and the first LUN.  Otherwise
 	 * we reject the request (per SPC-3 rev 14, section 6.21).
 	 */
 	if (alloc_len < (sizeof(struct scsi_report_luns_data) +
 	    sizeof(struct scsi_report_luns_lundata))) {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 6,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (retval);
 	}
 
 	lun_datalen = sizeof(*lun_data) +
 		(num_luns * sizeof(struct scsi_report_luns_lundata));
 
 	ctsio->kern_data_ptr = malloc(lun_datalen, M_CTL, M_WAITOK | M_ZERO);
 	lun_data = (struct scsi_report_luns_data *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 
 	initidx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 
 	mtx_lock(&softc->ctl_lock);
 	for (targ_lun_id = 0, num_filled = 0;
 	    targ_lun_id < num_port_luns && num_filled < num_luns;
 	    targ_lun_id++) {
 		lun_id = ctl_lun_map_from_port(port, targ_lun_id);
 		if (lun_id == UINT32_MAX)
 			continue;
 		lun = softc->ctl_luns[lun_id];
 		if (lun == NULL)
 			continue;
 
 		be64enc(lun_data->luns[num_filled++].lundata,
 		    ctl_encode_lun(targ_lun_id));
 
 		/*
 		 * According to SPC-3, rev 14 section 6.21:
 		 *
 		 * "The execution of a REPORT LUNS command to any valid and
 		 * installed logical unit shall clear the REPORTED LUNS DATA
 		 * HAS CHANGED unit attention condition for all logical
 		 * units of that target with respect to the requesting
 		 * initiator. A valid and installed logical unit is one
 		 * having a PERIPHERAL QUALIFIER of 000b in the standard
 		 * INQUIRY data (see 6.4.2)."
 		 *
 		 * If request_lun is NULL, the LUN this report luns command
 		 * was issued to is either disabled or doesn't exist. In that
 		 * case, we shouldn't clear any pending lun change unit
 		 * attention.
 		 */
 		if (request_lun != NULL) {
 			mtx_lock(&lun->lun_lock);
 			ctl_clr_ua(lun, initidx, CTL_UA_LUN_CHANGE);
 			mtx_unlock(&lun->lun_lock);
 		}
 	}
 	mtx_unlock(&softc->ctl_lock);
 
 	/*
 	 * It's quite possible that we've returned fewer LUNs than we allocated
 	 * space for.  Trim it.
 	 */
 	lun_datalen = sizeof(*lun_data) +
 		(num_filled * sizeof(struct scsi_report_luns_lundata));
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(lun_datalen, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * We set this to the actual data length, regardless of how much
 	 * space we actually have to return results.  If the user looks at
 	 * this value, he'll know whether or not he allocated enough space
 	 * and reissue the command if necessary.  We don't support well
 	 * known logical units, so if the user asks for that, return none.
 	 */
 	scsi_ulto4b(lun_datalen - 8, lun_data->length);
 
 	/*
 	 * We can only return SCSI_STATUS_CHECK_COND when we can't satisfy
 	 * this request.
 	 */
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (retval);
 }
 
 int
 ctl_request_sense(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_request_sense *cdb;
 	struct scsi_sense_data *sense_ptr, *ps;
 	uint32_t initidx;
 	int have_error;
 	u_int sense_len = SSD_FULL_SIZE;
 	scsi_sense_data_type sense_format;
 	ctl_ua_type ua_type;
 	uint8_t asc = 0, ascq = 0;
 
 	cdb = (struct scsi_request_sense *)ctsio->cdb;
 
 	CTL_DEBUG_PRINT(("ctl_request_sense\n"));
 
 	/*
 	 * Determine which sense format the user wants.
 	 */
 	if (cdb->byte2 & SRS_DESC)
 		sense_format = SSD_TYPE_DESC;
 	else
 		sense_format = SSD_TYPE_FIXED;
 
 	ctsio->kern_data_ptr = malloc(sizeof(*sense_ptr), M_CTL, M_WAITOK);
 	sense_ptr = (struct scsi_sense_data *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 
 	/*
 	 * struct scsi_sense_data, which is currently set to 256 bytes, is
 	 * larger than the largest allowed value for the length field in the
 	 * REQUEST SENSE CDB, which is 252 bytes as of SPC-4.
 	 */
 	ctsio->kern_data_len = cdb->length;
 	ctsio->kern_total_len = cdb->length;
 
 	/*
 	 * If we don't have a LUN, we don't have any pending sense.
 	 */
 	if (lun == NULL ||
 	    ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 &&
 	     softc->ha_link < CTL_HA_LINK_UNKNOWN)) {
 		/* "Logical unit not supported" */
 		ctl_set_sense_data(sense_ptr, &sense_len, NULL, sense_format,
 		    /*current_error*/ 1,
 		    /*sense_key*/ SSD_KEY_ILLEGAL_REQUEST,
 		    /*asc*/ 0x25,
 		    /*ascq*/ 0x00,
 		    SSD_ELEM_NONE);
 		goto send;
 	}
 
 	have_error = 0;
 	initidx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 	/*
 	 * Check for pending sense, and then for pending unit attentions.
 	 * Pending sense gets returned first, then pending unit attentions.
 	 */
 	mtx_lock(&lun->lun_lock);
 	ps = lun->pending_sense[initidx / CTL_MAX_INIT_PER_PORT];
 	if (ps != NULL)
 		ps += initidx % CTL_MAX_INIT_PER_PORT;
 	if (ps != NULL && ps->error_code != 0) {
 		scsi_sense_data_type stored_format;
 
 		/*
 		 * Check to see which sense format was used for the stored
 		 * sense data.
 		 */
 		stored_format = scsi_sense_type(ps);
 
 		/*
 		 * If the user requested a different sense format than the
 		 * one we stored, then we need to convert it to the other
 		 * format.  If we're going from descriptor to fixed format
 		 * sense data, we may lose things in translation, depending
 		 * on what options were used.
 		 *
 		 * If the stored format is SSD_TYPE_NONE (i.e. invalid),
 		 * for some reason we'll just copy it out as-is.
 		 */
 		if ((stored_format == SSD_TYPE_FIXED)
 		 && (sense_format == SSD_TYPE_DESC))
 			ctl_sense_to_desc((struct scsi_sense_data_fixed *)
 			    ps, (struct scsi_sense_data_desc *)sense_ptr);
 		else if ((stored_format == SSD_TYPE_DESC)
 		      && (sense_format == SSD_TYPE_FIXED))
 			ctl_sense_to_fixed((struct scsi_sense_data_desc *)
 			    ps, (struct scsi_sense_data_fixed *)sense_ptr);
 		else
 			memcpy(sense_ptr, ps, sizeof(*sense_ptr));
 
 		ps->error_code = 0;
 		have_error = 1;
 	} else {
 		ua_type = ctl_build_ua(lun, initidx, sense_ptr, &sense_len,
 		    sense_format);
 		if (ua_type != CTL_UA_NONE)
 			have_error = 1;
 	}
 	if (have_error == 0) {
 		/*
 		 * Report informational exception if have one and allowed.
 		 */
 		if (lun->MODE_IE.mrie != SIEP_MRIE_NO) {
 			asc = lun->ie_asc;
 			ascq = lun->ie_ascq;
 		}
 		ctl_set_sense_data(sense_ptr, &sense_len, lun, sense_format,
 		    /*current_error*/ 1,
 		    /*sense_key*/ SSD_KEY_NO_SENSE,
 		    /*asc*/ asc,
 		    /*ascq*/ ascq,
 		    SSD_ELEM_NONE);
 	}
 	mtx_unlock(&lun->lun_lock);
 
 send:
 	/*
 	 * We report the SCSI status as OK, since the status of the command
 	 * itself is OK.  We're reporting sense as parameter data.
 	 */
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_tur(struct ctl_scsiio *ctsio)
 {
 
 	CTL_DEBUG_PRINT(("ctl_tur\n"));
 
 	ctl_set_success(ctsio);
 	ctl_done((union ctl_io *)ctsio);
 
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * SCSI VPD page 0x00, the Supported VPD Pages page.
  */
 static int
 ctl_inquiry_evpd_supported(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_supported_pages *pages;
 	int sup_page_size;
 	int p;
 
 	sup_page_size = sizeof(struct scsi_vpd_supported_pages) *
 	    SCSI_EVPD_NUM_SUPPORTED_PAGES;
 	ctsio->kern_data_ptr = malloc(sup_page_size, M_CTL, M_WAITOK | M_ZERO);
 	pages = (struct scsi_vpd_supported_pages *)ctsio->kern_data_ptr;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(sup_page_size, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.  Need to change this
 	 * to figure out whether the disk device is actually online or not.
 	 */
 	if (lun != NULL)
 		pages->device = (SID_QUAL_LU_CONNECTED << 5) |
 				lun->be_lun->lun_type;
 	else
 		pages->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 
 	p = 0;
 	/* Supported VPD pages */
 	pages->page_list[p++] = SVPD_SUPPORTED_PAGES;
 	/* Serial Number */
 	pages->page_list[p++] = SVPD_UNIT_SERIAL_NUMBER;
 	/* Device Identification */
 	pages->page_list[p++] = SVPD_DEVICE_ID;
 	/* Extended INQUIRY Data */
 	pages->page_list[p++] = SVPD_EXTENDED_INQUIRY_DATA;
 	/* Mode Page Policy */
 	pages->page_list[p++] = SVPD_MODE_PAGE_POLICY;
 	/* SCSI Ports */
 	pages->page_list[p++] = SVPD_SCSI_PORTS;
 	/* Third-party Copy */
 	pages->page_list[p++] = SVPD_SCSI_TPC;
 	if (lun != NULL && lun->be_lun->lun_type == T_DIRECT) {
 		/* Block limits */
 		pages->page_list[p++] = SVPD_BLOCK_LIMITS;
 		/* Block Device Characteristics */
 		pages->page_list[p++] = SVPD_BDC;
 		/* Logical Block Provisioning */
 		pages->page_list[p++] = SVPD_LBP;
 	}
 	pages->length = p;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * SCSI VPD page 0x80, the Unit Serial Number page.
  */
 static int
 ctl_inquiry_evpd_serial(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_unit_serial_number *sn_ptr;
 	int data_len;
 
 	data_len = 4 + CTL_SN_LEN;
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	sn_ptr = (struct scsi_vpd_unit_serial_number *)ctsio->kern_data_ptr;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.  Need to change this
 	 * to figure out whether the disk device is actually online or not.
 	 */
 	if (lun != NULL)
 		sn_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				  lun->be_lun->lun_type;
 	else
 		sn_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 
 	sn_ptr->page_code = SVPD_UNIT_SERIAL_NUMBER;
 	sn_ptr->length = CTL_SN_LEN;
 	/*
 	 * If we don't have a LUN, we just leave the serial number as
 	 * all spaces.
 	 */
 	if (lun != NULL) {
 		strncpy((char *)sn_ptr->serial_num,
 			(char *)lun->be_lun->serial_num, CTL_SN_LEN);
 	} else
 		memset(sn_ptr->serial_num, 0x20, CTL_SN_LEN);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 
 /*
  * SCSI VPD page 0x86, the Extended INQUIRY Data page.
  */
 static int
 ctl_inquiry_evpd_eid(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_extended_inquiry_data *eid_ptr;
 	int data_len;
 
 	data_len = sizeof(struct scsi_vpd_extended_inquiry_data);
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	eid_ptr = (struct scsi_vpd_extended_inquiry_data *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.
 	 */
 	if (lun != NULL)
 		eid_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				     lun->be_lun->lun_type;
 	else
 		eid_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 	eid_ptr->page_code = SVPD_EXTENDED_INQUIRY_DATA;
 	scsi_ulto2b(data_len - 4, eid_ptr->page_length);
 	/*
 	 * We support head of queue, ordered and simple tags.
 	 */
 	eid_ptr->flags2 = SVPD_EID_HEADSUP | SVPD_EID_ORDSUP | SVPD_EID_SIMPSUP;
 	/*
 	 * Volatile cache supported.
 	 */
 	eid_ptr->flags3 = SVPD_EID_V_SUP;
 
 	/*
 	 * This means that we clear the REPORTED LUNS DATA HAS CHANGED unit
 	 * attention for a particular IT nexus on all LUNs once we report
 	 * it to that nexus once.  This bit is required as of SPC-4.
 	 */
 	eid_ptr->flags4 = SVPD_EID_LUICLR;
 
 	/*
 	 * We support revert to defaults (RTD) bit in MODE SELECT.
 	 */
 	eid_ptr->flags5 = SVPD_EID_RTD_SUP;
 
 	/*
 	 * XXX KDM in order to correctly answer this, we would need
 	 * information from the SIM to determine how much sense data it
 	 * can send.  So this would really be a path inquiry field, most
 	 * likely.  This can be set to a maximum of 252 according to SPC-4,
 	 * but the hardware may or may not be able to support that much.
 	 * 0 just means that the maximum sense data length is not reported.
 	 */
 	eid_ptr->max_sense_length = 0;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_inquiry_evpd_mpp(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_mode_page_policy *mpp_ptr;
 	int data_len;
 
 	data_len = sizeof(struct scsi_vpd_mode_page_policy) +
 	    sizeof(struct scsi_vpd_mode_page_policy_descr);
 
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	mpp_ptr = (struct scsi_vpd_mode_page_policy *)ctsio->kern_data_ptr;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.
 	 */
 	if (lun != NULL)
 		mpp_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				     lun->be_lun->lun_type;
 	else
 		mpp_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 	mpp_ptr->page_code = SVPD_MODE_PAGE_POLICY;
 	scsi_ulto2b(data_len - 4, mpp_ptr->page_length);
 	mpp_ptr->descr[0].page_code = 0x3f;
 	mpp_ptr->descr[0].subpage_code = 0xff;
 	mpp_ptr->descr[0].policy = SVPD_MPP_SHARED;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * SCSI VPD page 0x83, the Device Identification page.
  */
 static int
 ctl_inquiry_evpd_devid(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_port *port = CTL_PORT(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_device_id *devid_ptr;
 	struct scsi_vpd_id_descriptor *desc;
 	int data_len, g;
 	uint8_t proto;
 
 	data_len = sizeof(struct scsi_vpd_device_id) +
 	    sizeof(struct scsi_vpd_id_descriptor) +
 		sizeof(struct scsi_vpd_id_rel_trgt_port_id) +
 	    sizeof(struct scsi_vpd_id_descriptor) +
 		sizeof(struct scsi_vpd_id_trgt_port_grp_id);
 	if (lun && lun->lun_devid)
 		data_len += lun->lun_devid->len;
 	if (port && port->port_devid)
 		data_len += port->port_devid->len;
 	if (port && port->target_devid)
 		data_len += port->target_devid->len;
 
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	devid_ptr = (struct scsi_vpd_device_id *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.
 	 */
 	if (lun != NULL)
 		devid_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				     lun->be_lun->lun_type;
 	else
 		devid_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 	devid_ptr->page_code = SVPD_DEVICE_ID;
 	scsi_ulto2b(data_len - 4, devid_ptr->length);
 
 	if (port && port->port_type == CTL_PORT_FC)
 		proto = SCSI_PROTO_FC << 4;
-	else if (port->port_type == CTL_PORT_SAS)
+	else if (port && port->port_type == CTL_PORT_SAS)
 		proto = SCSI_PROTO_SAS << 4;
 	else if (port && port->port_type == CTL_PORT_ISCSI)
 		proto = SCSI_PROTO_ISCSI << 4;
 	else
 		proto = SCSI_PROTO_SPI << 4;
 	desc = (struct scsi_vpd_id_descriptor *)devid_ptr->desc_list;
 
 	/*
 	 * We're using a LUN association here.  i.e., this device ID is a
 	 * per-LUN identifier.
 	 */
 	if (lun && lun->lun_devid) {
 		memcpy(desc, lun->lun_devid->data, lun->lun_devid->len);
 		desc = (struct scsi_vpd_id_descriptor *)((uint8_t *)desc +
 		    lun->lun_devid->len);
 	}
 
 	/*
 	 * This is for the WWPN which is a port association.
 	 */
 	if (port && port->port_devid) {
 		memcpy(desc, port->port_devid->data, port->port_devid->len);
 		desc = (struct scsi_vpd_id_descriptor *)((uint8_t *)desc +
 		    port->port_devid->len);
 	}
 
 	/*
 	 * This is for the Relative Target Port(type 4h) identifier
 	 */
 	desc->proto_codeset = proto | SVPD_ID_CODESET_BINARY;
 	desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_PORT |
 	    SVPD_ID_TYPE_RELTARG;
 	desc->length = 4;
 	scsi_ulto2b(ctsio->io_hdr.nexus.targ_port, &desc->identifier[2]);
 	desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] +
 	    sizeof(struct scsi_vpd_id_rel_trgt_port_id));
 
 	/*
 	 * This is for the Target Port Group(type 5h) identifier
 	 */
 	desc->proto_codeset = proto | SVPD_ID_CODESET_BINARY;
 	desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_PORT |
 	    SVPD_ID_TYPE_TPORTGRP;
 	desc->length = 4;
 	if (softc->is_single ||
 	    (port && port->status & CTL_PORT_STATUS_HA_SHARED))
 		g = 1;
 	else
 		g = 2 + ctsio->io_hdr.nexus.targ_port / softc->port_cnt;
 	scsi_ulto2b(g, &desc->identifier[2]);
 	desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] +
 	    sizeof(struct scsi_vpd_id_trgt_port_grp_id));
 
 	/*
 	 * This is for the Target identifier
 	 */
 	if (port && port->target_devid) {
 		memcpy(desc, port->target_devid->data, port->target_devid->len);
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_inquiry_evpd_scsi_ports(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_scsi_ports *sp;
 	struct scsi_vpd_port_designation *pd;
 	struct scsi_vpd_port_designation_cont *pdc;
 	struct ctl_port *port;
 	int data_len, num_target_ports, iid_len, id_len;
 
 	num_target_ports = 0;
 	iid_len = 0;
 	id_len = 0;
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(port, &softc->port_list, links) {
 		if ((port->status & CTL_PORT_STATUS_ONLINE) == 0)
 			continue;
 		if (lun != NULL &&
 		    ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 			continue;
 		num_target_ports++;
 		if (port->init_devid)
 			iid_len += port->init_devid->len;
 		if (port->port_devid)
 			id_len += port->port_devid->len;
 	}
 	mtx_unlock(&softc->ctl_lock);
 
 	data_len = sizeof(struct scsi_vpd_scsi_ports) +
 	    num_target_ports * (sizeof(struct scsi_vpd_port_designation) +
 	     sizeof(struct scsi_vpd_port_designation_cont)) + iid_len + id_len;
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	sp = (struct scsi_vpd_scsi_ports *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.  Need to change this
 	 * to figure out whether the disk device is actually online or not.
 	 */
 	if (lun != NULL)
 		sp->device = (SID_QUAL_LU_CONNECTED << 5) |
 				  lun->be_lun->lun_type;
 	else
 		sp->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 
 	sp->page_code = SVPD_SCSI_PORTS;
 	scsi_ulto2b(data_len - sizeof(struct scsi_vpd_scsi_ports),
 	    sp->page_length);
 	pd = &sp->design[0];
 
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(port, &softc->port_list, links) {
 		if ((port->status & CTL_PORT_STATUS_ONLINE) == 0)
 			continue;
 		if (lun != NULL &&
 		    ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 			continue;
 		scsi_ulto2b(port->targ_port, pd->relative_port_id);
 		if (port->init_devid) {
 			iid_len = port->init_devid->len;
 			memcpy(pd->initiator_transportid,
 			    port->init_devid->data, port->init_devid->len);
 		} else
 			iid_len = 0;
 		scsi_ulto2b(iid_len, pd->initiator_transportid_length);
 		pdc = (struct scsi_vpd_port_designation_cont *)
 		    (&pd->initiator_transportid[iid_len]);
 		if (port->port_devid) {
 			id_len = port->port_devid->len;
 			memcpy(pdc->target_port_descriptors,
 			    port->port_devid->data, port->port_devid->len);
 		} else
 			id_len = 0;
 		scsi_ulto2b(id_len, pdc->target_port_descriptors_length);
 		pd = (struct scsi_vpd_port_designation *)
 		    ((uint8_t *)pdc->target_port_descriptors + id_len);
 	}
 	mtx_unlock(&softc->ctl_lock);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_inquiry_evpd_block_limits(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_block_limits *bl_ptr;
 	uint64_t ival;
 
 	ctsio->kern_data_ptr = malloc(sizeof(*bl_ptr), M_CTL, M_WAITOK | M_ZERO);
 	bl_ptr = (struct scsi_vpd_block_limits *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_data_len = min(sizeof(*bl_ptr), alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.  Need to change this
 	 * to figure out whether the disk device is actually online or not.
 	 */
 	if (lun != NULL)
 		bl_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				  lun->be_lun->lun_type;
 	else
 		bl_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 
 	bl_ptr->page_code = SVPD_BLOCK_LIMITS;
 	scsi_ulto2b(sizeof(*bl_ptr) - 4, bl_ptr->page_length);
 	bl_ptr->max_cmp_write_len = 0xff;
 	scsi_ulto4b(0xffffffff, bl_ptr->max_txfer_len);
 	if (lun != NULL) {
 		scsi_ulto4b(lun->be_lun->opttxferlen, bl_ptr->opt_txfer_len);
 		if (lun->be_lun->flags & CTL_LUN_FLAG_UNMAP) {
 			ival = 0xffffffff;
 			ctl_get_opt_number(&lun->be_lun->options,
 			    "unmap_max_lba", &ival);
 			scsi_ulto4b(ival, bl_ptr->max_unmap_lba_cnt);
 			ival = 0xffffffff;
 			ctl_get_opt_number(&lun->be_lun->options,
 			    "unmap_max_descr", &ival);
 			scsi_ulto4b(ival, bl_ptr->max_unmap_blk_cnt);
 			if (lun->be_lun->ublockexp != 0) {
 				scsi_ulto4b((1 << lun->be_lun->ublockexp),
 				    bl_ptr->opt_unmap_grain);
 				scsi_ulto4b(0x80000000 | lun->be_lun->ublockoff,
 				    bl_ptr->unmap_grain_align);
 			}
 		}
 		scsi_ulto4b(lun->be_lun->atomicblock,
 		    bl_ptr->max_atomic_transfer_length);
 		scsi_ulto4b(0, bl_ptr->atomic_alignment);
 		scsi_ulto4b(0, bl_ptr->atomic_transfer_length_granularity);
 		scsi_ulto4b(0, bl_ptr->max_atomic_transfer_length_with_atomic_boundary);
 		scsi_ulto4b(0, bl_ptr->max_atomic_boundary_size);
 		ival = UINT64_MAX;
 		ctl_get_opt_number(&lun->be_lun->options, "write_same_max_lba", &ival);
 		scsi_u64to8b(ival, bl_ptr->max_write_same_length);
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_inquiry_evpd_bdc(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_block_device_characteristics *bdc_ptr;
 	const char *value;
 	u_int i;
 
 	ctsio->kern_data_ptr = malloc(sizeof(*bdc_ptr), M_CTL, M_WAITOK | M_ZERO);
 	bdc_ptr = (struct scsi_vpd_block_device_characteristics *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(sizeof(*bdc_ptr), alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.  Need to change this
 	 * to figure out whether the disk device is actually online or not.
 	 */
 	if (lun != NULL)
 		bdc_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				  lun->be_lun->lun_type;
 	else
 		bdc_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 	bdc_ptr->page_code = SVPD_BDC;
 	scsi_ulto2b(sizeof(*bdc_ptr) - 4, bdc_ptr->page_length);
 	if (lun != NULL &&
 	    (value = ctl_get_opt(&lun->be_lun->options, "rpm")) != NULL)
 		i = strtol(value, NULL, 0);
 	else
 		i = CTL_DEFAULT_ROTATION_RATE;
 	scsi_ulto2b(i, bdc_ptr->medium_rotation_rate);
 	if (lun != NULL &&
 	    (value = ctl_get_opt(&lun->be_lun->options, "formfactor")) != NULL)
 		i = strtol(value, NULL, 0);
 	else
 		i = 0;
 	bdc_ptr->wab_wac_ff = (i & 0x0f);
 	bdc_ptr->flags = SVPD_FUAB | SVPD_VBULS;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static int
 ctl_inquiry_evpd_lbp(struct ctl_scsiio *ctsio, int alloc_len)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_vpd_logical_block_prov *lbp_ptr;
 	const char *value;
 
 	ctsio->kern_data_ptr = malloc(sizeof(*lbp_ptr), M_CTL, M_WAITOK | M_ZERO);
 	lbp_ptr = (struct scsi_vpd_logical_block_prov *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(sizeof(*lbp_ptr), alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	/*
 	 * The control device is always connected.  The disk device, on the
 	 * other hand, may not be online all the time.  Need to change this
 	 * to figure out whether the disk device is actually online or not.
 	 */
 	if (lun != NULL)
 		lbp_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 				  lun->be_lun->lun_type;
 	else
 		lbp_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT;
 
 	lbp_ptr->page_code = SVPD_LBP;
 	scsi_ulto2b(sizeof(*lbp_ptr) - 4, lbp_ptr->page_length);
 	lbp_ptr->threshold_exponent = CTL_LBP_EXPONENT;
 	if (lun != NULL && lun->be_lun->flags & CTL_LUN_FLAG_UNMAP) {
 		lbp_ptr->flags = SVPD_LBP_UNMAP | SVPD_LBP_WS16 |
 		    SVPD_LBP_WS10 | SVPD_LBP_RZ | SVPD_LBP_ANC_SUP;
 		value = ctl_get_opt(&lun->be_lun->options, "provisioning_type");
 		if (value != NULL) {
 			if (strcmp(value, "resource") == 0)
 				lbp_ptr->prov_type = SVPD_LBP_RESOURCE;
 			else if (strcmp(value, "thin") == 0)
 				lbp_ptr->prov_type = SVPD_LBP_THIN;
 		} else
 			lbp_ptr->prov_type = SVPD_LBP_THIN;
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * INQUIRY with the EVPD bit set.
  */
 static int
 ctl_inquiry_evpd(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_inquiry *cdb;
 	int alloc_len, retval;
 
 	cdb = (struct scsi_inquiry *)ctsio->cdb;
 	alloc_len = scsi_2btoul(cdb->length);
 
 	switch (cdb->page_code) {
 	case SVPD_SUPPORTED_PAGES:
 		retval = ctl_inquiry_evpd_supported(ctsio, alloc_len);
 		break;
 	case SVPD_UNIT_SERIAL_NUMBER:
 		retval = ctl_inquiry_evpd_serial(ctsio, alloc_len);
 		break;
 	case SVPD_DEVICE_ID:
 		retval = ctl_inquiry_evpd_devid(ctsio, alloc_len);
 		break;
 	case SVPD_EXTENDED_INQUIRY_DATA:
 		retval = ctl_inquiry_evpd_eid(ctsio, alloc_len);
 		break;
 	case SVPD_MODE_PAGE_POLICY:
 		retval = ctl_inquiry_evpd_mpp(ctsio, alloc_len);
 		break;
 	case SVPD_SCSI_PORTS:
 		retval = ctl_inquiry_evpd_scsi_ports(ctsio, alloc_len);
 		break;
 	case SVPD_SCSI_TPC:
 		retval = ctl_inquiry_evpd_tpc(ctsio, alloc_len);
 		break;
 	case SVPD_BLOCK_LIMITS:
 		if (lun == NULL || lun->be_lun->lun_type != T_DIRECT)
 			goto err;
 		retval = ctl_inquiry_evpd_block_limits(ctsio, alloc_len);
 		break;
 	case SVPD_BDC:
 		if (lun == NULL || lun->be_lun->lun_type != T_DIRECT)
 			goto err;
 		retval = ctl_inquiry_evpd_bdc(ctsio, alloc_len);
 		break;
 	case SVPD_LBP:
 		if (lun == NULL || lun->be_lun->lun_type != T_DIRECT)
 			goto err;
 		retval = ctl_inquiry_evpd_lbp(ctsio, alloc_len);
 		break;
 	default:
 err:
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		retval = CTL_RETVAL_COMPLETE;
 		break;
 	}
 
 	return (retval);
 }
 
 /*
  * Standard INQUIRY data.
  */
 static int
 ctl_inquiry_std(struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(ctsio);
 	struct ctl_port *port = CTL_PORT(ctsio);
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_inquiry_data *inq_ptr;
 	struct scsi_inquiry *cdb;
 	char *val;
 	uint32_t alloc_len, data_len;
 	ctl_port_type port_type;
 
 	port_type = port->port_type;
 	if (port_type == CTL_PORT_IOCTL || port_type == CTL_PORT_INTERNAL)
 		port_type = CTL_PORT_SCSI;
 
 	cdb = (struct scsi_inquiry *)ctsio->cdb;
 	alloc_len = scsi_2btoul(cdb->length);
 
 	/*
 	 * We malloc the full inquiry data size here and fill it
 	 * in.  If the user only asks for less, we'll give him
 	 * that much.
 	 */
 	data_len = offsetof(struct scsi_inquiry_data, vendor_specific1);
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	inq_ptr = (struct scsi_inquiry_data *)ctsio->kern_data_ptr;
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	if (lun != NULL) {
 		if ((lun->flags & CTL_LUN_PRIMARY_SC) ||
 		    softc->ha_link >= CTL_HA_LINK_UNKNOWN) {
 			inq_ptr->device = (SID_QUAL_LU_CONNECTED << 5) |
 			    lun->be_lun->lun_type;
 		} else {
 			inq_ptr->device = (SID_QUAL_LU_OFFLINE << 5) |
 			    lun->be_lun->lun_type;
 		}
 		if (lun->flags & CTL_LUN_REMOVABLE)
 			inq_ptr->dev_qual2 |= SID_RMB;
 	} else
 		inq_ptr->device = (SID_QUAL_BAD_LU << 5) | T_NODEVICE;
 
 	/* RMB in byte 2 is 0 */
 	inq_ptr->version = SCSI_REV_SPC5;
 
 	/*
 	 * According to SAM-3, even if a device only supports a single
 	 * level of LUN addressing, it should still set the HISUP bit:
 	 *
 	 * 4.9.1 Logical unit numbers overview
 	 *
 	 * All logical unit number formats described in this standard are
 	 * hierarchical in structure even when only a single level in that
 	 * hierarchy is used. The HISUP bit shall be set to one in the
 	 * standard INQUIRY data (see SPC-2) when any logical unit number
 	 * format described in this standard is used.  Non-hierarchical
 	 * formats are outside the scope of this standard.
 	 *
 	 * Therefore we set the HiSup bit here.
 	 *
 	 * The response format is 2, per SPC-3.
 	 */
 	inq_ptr->response_format = SID_HiSup | 2;
 
 	inq_ptr->additional_length = data_len -
 	    (offsetof(struct scsi_inquiry_data, additional_length) + 1);
 	CTL_DEBUG_PRINT(("additional_length = %d\n",
 			 inq_ptr->additional_length));
 
 	inq_ptr->spc3_flags = SPC3_SID_3PC | SPC3_SID_TPGS_IMPLICIT;
 	if (port_type == CTL_PORT_SCSI)
 		inq_ptr->spc2_flags = SPC2_SID_ADDR16;
 	inq_ptr->spc2_flags |= SPC2_SID_MultiP;
 	inq_ptr->flags = SID_CmdQue;
 	if (port_type == CTL_PORT_SCSI)
 		inq_ptr->flags |= SID_WBus16 | SID_Sync;
 
 	/*
 	 * Per SPC-3, unused bytes in ASCII strings are filled with spaces.
 	 * We have 8 bytes for the vendor name, and 16 bytes for the device
 	 * name and 4 bytes for the revision.
 	 */
 	if (lun == NULL || (val = ctl_get_opt(&lun->be_lun->options,
 	    "vendor")) == NULL) {
 		strncpy(inq_ptr->vendor, CTL_VENDOR, sizeof(inq_ptr->vendor));
 	} else {
 		memset(inq_ptr->vendor, ' ', sizeof(inq_ptr->vendor));
 		strncpy(inq_ptr->vendor, val,
 		    min(sizeof(inq_ptr->vendor), strlen(val)));
 	}
 	if (lun == NULL) {
 		strncpy(inq_ptr->product, CTL_DIRECT_PRODUCT,
 		    sizeof(inq_ptr->product));
 	} else if ((val = ctl_get_opt(&lun->be_lun->options, "product")) == NULL) {
 		switch (lun->be_lun->lun_type) {
 		case T_DIRECT:
 			strncpy(inq_ptr->product, CTL_DIRECT_PRODUCT,
 			    sizeof(inq_ptr->product));
 			break;
 		case T_PROCESSOR:
 			strncpy(inq_ptr->product, CTL_PROCESSOR_PRODUCT,
 			    sizeof(inq_ptr->product));
 			break;
 		case T_CDROM:
 			strncpy(inq_ptr->product, CTL_CDROM_PRODUCT,
 			    sizeof(inq_ptr->product));
 			break;
 		default:
 			strncpy(inq_ptr->product, CTL_UNKNOWN_PRODUCT,
 			    sizeof(inq_ptr->product));
 			break;
 		}
 	} else {
 		memset(inq_ptr->product, ' ', sizeof(inq_ptr->product));
 		strncpy(inq_ptr->product, val,
 		    min(sizeof(inq_ptr->product), strlen(val)));
 	}
 
 	/*
 	 * XXX make this a macro somewhere so it automatically gets
 	 * incremented when we make changes.
 	 */
 	if (lun == NULL || (val = ctl_get_opt(&lun->be_lun->options,
 	    "revision")) == NULL) {
 		strncpy(inq_ptr->revision, "0001", sizeof(inq_ptr->revision));
 	} else {
 		memset(inq_ptr->revision, ' ', sizeof(inq_ptr->revision));
 		strncpy(inq_ptr->revision, val,
 		    min(sizeof(inq_ptr->revision), strlen(val)));
 	}
 
 	/*
 	 * For parallel SCSI, we support double transition and single
 	 * transition clocking.  We also support QAS (Quick Arbitration
 	 * and Selection) and Information Unit transfers on both the
 	 * control and array devices.
 	 */
 	if (port_type == CTL_PORT_SCSI)
 		inq_ptr->spi3data = SID_SPI_CLOCK_DT_ST | SID_SPI_QAS |
 				    SID_SPI_IUS;
 
 	/* SAM-6 (no version claimed) */
 	scsi_ulto2b(0x00C0, inq_ptr->version1);
 	/* SPC-5 (no version claimed) */
 	scsi_ulto2b(0x05C0, inq_ptr->version2);
 	if (port_type == CTL_PORT_FC) {
 		/* FCP-2 ANSI INCITS.350:2003 */
 		scsi_ulto2b(0x0917, inq_ptr->version3);
 	} else if (port_type == CTL_PORT_SCSI) {
 		/* SPI-4 ANSI INCITS.362:200x */
 		scsi_ulto2b(0x0B56, inq_ptr->version3);
 	} else if (port_type == CTL_PORT_ISCSI) {
 		/* iSCSI (no version claimed) */
 		scsi_ulto2b(0x0960, inq_ptr->version3);
 	} else if (port_type == CTL_PORT_SAS) {
 		/* SAS (no version claimed) */
 		scsi_ulto2b(0x0BE0, inq_ptr->version3);
 	} else if (port_type == CTL_PORT_UMASS) {
 		/* USB Mass Storage Class Bulk-Only Transport, Revision 1.0 */
 		scsi_ulto2b(0x1730, inq_ptr->version3);
 	}
 
 	if (lun == NULL) {
 		/* SBC-4 (no version claimed) */
 		scsi_ulto2b(0x0600, inq_ptr->version4);
 	} else {
 		switch (lun->be_lun->lun_type) {
 		case T_DIRECT:
 			/* SBC-4 (no version claimed) */
 			scsi_ulto2b(0x0600, inq_ptr->version4);
 			break;
 		case T_PROCESSOR:
 			break;
 		case T_CDROM:
 			/* MMC-6 (no version claimed) */
 			scsi_ulto2b(0x04E0, inq_ptr->version4);
 			break;
 		default:
 			break;
 		}
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_inquiry(struct ctl_scsiio *ctsio)
 {
 	struct scsi_inquiry *cdb;
 	int retval;
 
 	CTL_DEBUG_PRINT(("ctl_inquiry\n"));
 
 	cdb = (struct scsi_inquiry *)ctsio->cdb;
 	if (cdb->byte2 & SI_EVPD)
 		retval = ctl_inquiry_evpd(ctsio);
 	else if (cdb->page_code == 0)
 		retval = ctl_inquiry_std(ctsio);
 	else {
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ 2,
 				      /*bit_valid*/ 0,
 				      /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 
 	return (retval);
 }
 
 int
 ctl_get_config(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_get_config_header *hdr;
 	struct scsi_get_config_feature *feature;
 	struct scsi_get_config *cdb;
 	uint32_t alloc_len, data_len;
 	int rt, starting;
 
 	cdb = (struct scsi_get_config *)ctsio->cdb;
 	rt = (cdb->rt & SGC_RT_MASK);
 	starting = scsi_2btoul(cdb->starting_feature);
 	alloc_len = scsi_2btoul(cdb->length);
 
 	data_len = sizeof(struct scsi_get_config_header) +
 	    sizeof(struct scsi_get_config_feature) + 8 +
 	    sizeof(struct scsi_get_config_feature) + 8 +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 8 +
 	    sizeof(struct scsi_get_config_feature) +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 4 +
 	    sizeof(struct scsi_get_config_feature) + 4;
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 
 	hdr = (struct scsi_get_config_header *)ctsio->kern_data_ptr;
 	if (lun->flags & CTL_LUN_NO_MEDIA)
 		scsi_ulto2b(0x0000, hdr->current_profile);
 	else
 		scsi_ulto2b(0x0010, hdr->current_profile);
 	feature = (struct scsi_get_config_feature *)(hdr + 1);
 
 	if (starting > 0x003b)
 		goto done;
 	if (starting > 0x003a)
 		goto f3b;
 	if (starting > 0x002b)
 		goto f3a;
 	if (starting > 0x002a)
 		goto f2b;
 	if (starting > 0x001f)
 		goto f2a;
 	if (starting > 0x001e)
 		goto f1f;
 	if (starting > 0x001d)
 		goto f1e;
 	if (starting > 0x0010)
 		goto f1d;
 	if (starting > 0x0003)
 		goto f10;
 	if (starting > 0x0002)
 		goto f3;
 	if (starting > 0x0001)
 		goto f2;
 	if (starting > 0x0000)
 		goto f1;
 
 	/* Profile List */
 	scsi_ulto2b(0x0000, feature->feature_code);
 	feature->flags = SGC_F_PERSISTENT | SGC_F_CURRENT;
 	feature->add_length = 8;
 	scsi_ulto2b(0x0008, &feature->feature_data[0]);	/* CD-ROM */
 	feature->feature_data[2] = 0x00;
 	scsi_ulto2b(0x0010, &feature->feature_data[4]);	/* DVD-ROM */
 	feature->feature_data[6] = 0x01;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f1:	/* Core */
 	scsi_ulto2b(0x0001, feature->feature_code);
 	feature->flags = 0x08 | SGC_F_PERSISTENT | SGC_F_CURRENT;
 	feature->add_length = 8;
 	scsi_ulto4b(0x00000000, &feature->feature_data[0]);
 	feature->feature_data[4] = 0x03;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f2:	/* Morphing */
 	scsi_ulto2b(0x0002, feature->feature_code);
 	feature->flags = 0x04 | SGC_F_PERSISTENT | SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x02;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f3:	/* Removable Medium */
 	scsi_ulto2b(0x0003, feature->feature_code);
 	feature->flags = 0x04 | SGC_F_PERSISTENT | SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x39;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 	if (rt == SGC_RT_CURRENT && (lun->flags & CTL_LUN_NO_MEDIA))
 		goto done;
 
 f10:	/* Random Read */
 	scsi_ulto2b(0x0010, feature->feature_code);
 	feature->flags = 0x00;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 8;
 	scsi_ulto4b(lun->be_lun->blocksize, &feature->feature_data[0]);
 	scsi_ulto2b(1, &feature->feature_data[4]);
 	feature->feature_data[6] = 0x00;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f1d:	/* Multi-Read */
 	scsi_ulto2b(0x001D, feature->feature_code);
 	feature->flags = 0x00;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 0;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f1e:	/* CD Read */
 	scsi_ulto2b(0x001E, feature->feature_code);
 	feature->flags = 0x00;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x00;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f1f:	/* DVD Read */
 	scsi_ulto2b(0x001F, feature->feature_code);
 	feature->flags = 0x08;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x01;
 	feature->feature_data[2] = 0x03;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f2a:	/* DVD+RW */
 	scsi_ulto2b(0x002A, feature->feature_code);
 	feature->flags = 0x04;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x00;
 	feature->feature_data[1] = 0x00;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f2b:	/* DVD+R */
 	scsi_ulto2b(0x002B, feature->feature_code);
 	feature->flags = 0x00;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x00;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f3a:	/* DVD+RW Dual Layer */
 	scsi_ulto2b(0x003A, feature->feature_code);
 	feature->flags = 0x00;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x00;
 	feature->feature_data[1] = 0x00;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 f3b:	/* DVD+R Dual Layer */
 	scsi_ulto2b(0x003B, feature->feature_code);
 	feature->flags = 0x00;
 	if ((lun->flags & CTL_LUN_NO_MEDIA) == 0)
 		feature->flags |= SGC_F_CURRENT;
 	feature->add_length = 4;
 	feature->feature_data[0] = 0x00;
 	feature = (struct scsi_get_config_feature *)
 	    &feature->feature_data[feature->add_length];
 
 done:
 	data_len = (uint8_t *)feature - (uint8_t *)hdr;
 	if (rt == SGC_RT_SPECIFIC && data_len > 4) {
 		feature = (struct scsi_get_config_feature *)(hdr + 1);
 		if (scsi_2btoul(feature->feature_code) == starting)
 			feature = (struct scsi_get_config_feature *)
 			    &feature->feature_data[feature->add_length];
 		data_len = (uint8_t *)feature - (uint8_t *)hdr;
 	}
 	scsi_ulto4b(data_len - 4, hdr->data_length);
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_get_event_status(struct ctl_scsiio *ctsio)
 {
 	struct scsi_get_event_status_header *hdr;
 	struct scsi_get_event_status *cdb;
 	uint32_t alloc_len, data_len;
 	int notif_class;
 
 	cdb = (struct scsi_get_event_status *)ctsio->cdb;
 	if ((cdb->byte2 & SGESN_POLLED) == 0) {
 		ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1,
 		    /*field*/ 1, /*bit_valid*/ 1, /*bit*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		return (CTL_RETVAL_COMPLETE);
 	}
 	notif_class = cdb->notif_class;
 	alloc_len = scsi_2btoul(cdb->length);
 
 	data_len = sizeof(struct scsi_get_event_status_header);
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	hdr = (struct scsi_get_event_status_header *)ctsio->kern_data_ptr;
 	scsi_ulto2b(0, hdr->descr_length);
 	hdr->nea_class = SGESN_NEA;
 	hdr->supported_class = 0;
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 int
 ctl_mechanism_status(struct ctl_scsiio *ctsio)
 {
 	struct scsi_mechanism_status_header *hdr;
 	struct scsi_mechanism_status *cdb;
 	uint32_t alloc_len, data_len;
 
 	cdb = (struct scsi_mechanism_status *)ctsio->cdb;
 	alloc_len = scsi_2btoul(cdb->length);
 
 	data_len = sizeof(struct scsi_mechanism_status_header);
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	hdr = (struct scsi_mechanism_status_header *)ctsio->kern_data_ptr;
 	hdr->state1 = 0x00;
 	hdr->state2 = 0xe0;
 	scsi_ulto3b(0, hdr->lba);
 	hdr->slots_num = 0;
 	scsi_ulto2b(0, hdr->slots_length);
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 static void
 ctl_ultomsf(uint32_t lba, uint8_t *buf)
 {
 
 	lba += 150;
 	buf[0] = 0;
 	buf[1] = bin2bcd((lba / 75) / 60);
 	buf[2] = bin2bcd((lba / 75) % 60);
 	buf[3] = bin2bcd(lba % 75);
 }
 
 int
 ctl_read_toc(struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun = CTL_LUN(ctsio);
 	struct scsi_read_toc_hdr *hdr;
 	struct scsi_read_toc_type01_descr *descr;
 	struct scsi_read_toc *cdb;
 	uint32_t alloc_len, data_len;
 	int format, msf;
 
 	cdb = (struct scsi_read_toc *)ctsio->cdb;
 	msf = (cdb->byte2 & CD_MSF) != 0;
 	format = cdb->format;
 	alloc_len = scsi_2btoul(cdb->data_len);
 
 	data_len = sizeof(struct scsi_read_toc_hdr);
 	if (format == 0)
 		data_len += 2 * sizeof(struct scsi_read_toc_type01_descr);
 	else
 		data_len += sizeof(struct scsi_read_toc_type01_descr);
 	ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO);
 	ctsio->kern_sg_entries = 0;
 	ctsio->kern_rel_offset = 0;
 	ctsio->kern_data_len = min(data_len, alloc_len);
 	ctsio->kern_total_len = ctsio->kern_data_len;
 
 	hdr = (struct scsi_read_toc_hdr *)ctsio->kern_data_ptr;
 	if (format == 0) {
 		scsi_ulto2b(0x12, hdr->data_length);
 		hdr->first = 1;
 		hdr->last = 1;
 		descr = (struct scsi_read_toc_type01_descr *)(hdr + 1);
 		descr->addr_ctl = 0x14;
 		descr->track_number = 1;
 		if (msf)
 			ctl_ultomsf(0, descr->track_start);
 		else
 			scsi_ulto4b(0, descr->track_start);
 		descr++;
 		descr->addr_ctl = 0x14;
 		descr->track_number = 0xaa;
 		if (msf)
 			ctl_ultomsf(lun->be_lun->maxlba+1, descr->track_start);
 		else
 			scsi_ulto4b(lun->be_lun->maxlba+1, descr->track_start);
 	} else {
 		scsi_ulto2b(0x0a, hdr->data_length);
 		hdr->first = 1;
 		hdr->last = 1;
 		descr = (struct scsi_read_toc_type01_descr *)(hdr + 1);
 		descr->addr_ctl = 0x14;
 		descr->track_number = 1;
 		if (msf)
 			ctl_ultomsf(0, descr->track_start);
 		else
 			scsi_ulto4b(0, descr->track_start);
 	}
 
 	ctl_set_success(ctsio);
 	ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED;
 	ctsio->be_move_done = ctl_config_move_done;
 	ctl_datamove((union ctl_io *)ctsio);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * For known CDB types, parse the LBA and length.
  */
 static int
 ctl_get_lba_len(union ctl_io *io, uint64_t *lba, uint64_t *len)
 {
 	if (io->io_hdr.io_type != CTL_IO_SCSI)
 		return (1);
 
 	switch (io->scsiio.cdb[0]) {
 	case COMPARE_AND_WRITE: {
 		struct scsi_compare_and_write *cdb;
 
 		cdb = (struct scsi_compare_and_write *)io->scsiio.cdb;
 
 		*lba = scsi_8btou64(cdb->addr);
 		*len = cdb->length;
 		break;
 	}
 	case READ_6:
 	case WRITE_6: {
 		struct scsi_rw_6 *cdb;
 
 		cdb = (struct scsi_rw_6 *)io->scsiio.cdb;
 
 		*lba = scsi_3btoul(cdb->addr);
 		/* only 5 bits are valid in the most significant address byte */
 		*lba &= 0x1fffff;
 		*len = cdb->length;
 		break;
 	}
 	case READ_10:
 	case WRITE_10: {
 		struct scsi_rw_10 *cdb;
 
 		cdb = (struct scsi_rw_10 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_2btoul(cdb->length);
 		break;
 	}
 	case WRITE_VERIFY_10: {
 		struct scsi_write_verify_10 *cdb;
 
 		cdb = (struct scsi_write_verify_10 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_2btoul(cdb->length);
 		break;
 	}
 	case READ_12:
 	case WRITE_12: {
 		struct scsi_rw_12 *cdb;
 
 		cdb = (struct scsi_rw_12 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case WRITE_VERIFY_12: {
 		struct scsi_write_verify_12 *cdb;
 
 		cdb = (struct scsi_write_verify_12 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case READ_16:
 	case WRITE_16: {
 		struct scsi_rw_16 *cdb;
 
 		cdb = (struct scsi_rw_16 *)io->scsiio.cdb;
 
 		*lba = scsi_8btou64(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case WRITE_ATOMIC_16: {
 		struct scsi_write_atomic_16 *cdb;
 
 		cdb = (struct scsi_write_atomic_16 *)io->scsiio.cdb;
 
 		*lba = scsi_8btou64(cdb->addr);
 		*len = scsi_2btoul(cdb->length);
 		break;
 	}
 	case WRITE_VERIFY_16: {
 		struct scsi_write_verify_16 *cdb;
 
 		cdb = (struct scsi_write_verify_16 *)io->scsiio.cdb;
 
 		*lba = scsi_8btou64(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case WRITE_SAME_10: {
 		struct scsi_write_same_10 *cdb;
 
 		cdb = (struct scsi_write_same_10 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_2btoul(cdb->length);
 		break;
 	}
 	case WRITE_SAME_16: {
 		struct scsi_write_same_16 *cdb;
 
 		cdb = (struct scsi_write_same_16 *)io->scsiio.cdb;
 
 		*lba = scsi_8btou64(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case VERIFY_10: {
 		struct scsi_verify_10 *cdb;
 
 		cdb = (struct scsi_verify_10 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_2btoul(cdb->length);
 		break;
 	}
 	case VERIFY_12: {
 		struct scsi_verify_12 *cdb;
 
 		cdb = (struct scsi_verify_12 *)io->scsiio.cdb;
 
 		*lba = scsi_4btoul(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case VERIFY_16: {
 		struct scsi_verify_16 *cdb;
 
 		cdb = (struct scsi_verify_16 *)io->scsiio.cdb;
 
 		*lba = scsi_8btou64(cdb->addr);
 		*len = scsi_4btoul(cdb->length);
 		break;
 	}
 	case UNMAP: {
 		*lba = 0;
 		*len = UINT64_MAX;
 		break;
 	}
 	case SERVICE_ACTION_IN: {	/* GET LBA STATUS */
 		struct scsi_get_lba_status *cdb;
 
 		cdb = (struct scsi_get_lba_status *)io->scsiio.cdb;
 		*lba = scsi_8btou64(cdb->addr);
 		*len = UINT32_MAX;
 		break;
 	}
 	default:
 		return (1);
 		break; /* NOTREACHED */
 	}
 
 	return (0);
 }
 
 static ctl_action
 ctl_extent_check_lba(uint64_t lba1, uint64_t len1, uint64_t lba2, uint64_t len2,
     bool seq)
 {
 	uint64_t endlba1, endlba2;
 
 	endlba1 = lba1 + len1 - (seq ? 0 : 1);
 	endlba2 = lba2 + len2 - 1;
 
 	if ((endlba1 < lba2) || (endlba2 < lba1))
 		return (CTL_ACTION_PASS);
 	else
 		return (CTL_ACTION_BLOCK);
 }
 
 static int
 ctl_extent_check_unmap(union ctl_io *io, uint64_t lba2, uint64_t len2)
 {
 	struct ctl_ptr_len_flags *ptrlen;
 	struct scsi_unmap_desc *buf, *end, *range;
 	uint64_t lba;
 	uint32_t len;
 
 	/* If not UNMAP -- go other way. */
 	if (io->io_hdr.io_type != CTL_IO_SCSI ||
 	    io->scsiio.cdb[0] != UNMAP)
 		return (CTL_ACTION_ERROR);
 
 	/* If UNMAP without data -- block and wait for data. */
 	ptrlen = (struct ctl_ptr_len_flags *)
 	    &io->io_hdr.ctl_private[CTL_PRIV_LBA_LEN];
 	if ((io->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0 ||
 	    ptrlen->ptr == NULL)
 		return (CTL_ACTION_BLOCK);
 
 	/* UNMAP with data -- check for collision. */
 	buf = (struct scsi_unmap_desc *)ptrlen->ptr;
 	end = buf + ptrlen->len / sizeof(*buf);
 	for (range = buf; range < end; range++) {
 		lba = scsi_8btou64(range->lba);
 		len = scsi_4btoul(range->length);
 		if ((lba < lba2 + len2) && (lba + len > lba2))
 			return (CTL_ACTION_BLOCK);
 	}
 	return (CTL_ACTION_PASS);
 }
 
 static ctl_action
 ctl_extent_check(union ctl_io *io1, union ctl_io *io2, bool seq)
 {
 	uint64_t lba1, lba2;
 	uint64_t len1, len2;
 	int retval;
 
 	if (ctl_get_lba_len(io2, &lba2, &len2) != 0)
 		return (CTL_ACTION_ERROR);
 
 	retval = ctl_extent_check_unmap(io1, lba2, len2);
 	if (retval != CTL_ACTION_ERROR)
 		return (retval);
 
 	if (ctl_get_lba_len(io1, &lba1, &len1) != 0)
 		return (CTL_ACTION_ERROR);
 
 	if (io1->io_hdr.flags & CTL_FLAG_SERSEQ_DONE)
 		seq = FALSE;
 	return (ctl_extent_check_lba(lba1, len1, lba2, len2, seq));
 }
 
 static ctl_action
 ctl_extent_check_seq(union ctl_io *io1, union ctl_io *io2)
 {
 	uint64_t lba1, lba2;
 	uint64_t len1, len2;
 
 	if (io1->io_hdr.flags & CTL_FLAG_SERSEQ_DONE)
 		return (CTL_ACTION_PASS);
 	if (ctl_get_lba_len(io1, &lba1, &len1) != 0)
 		return (CTL_ACTION_ERROR);
 	if (ctl_get_lba_len(io2, &lba2, &len2) != 0)
 		return (CTL_ACTION_ERROR);
 
 	if (lba1 + len1 == lba2)
 		return (CTL_ACTION_BLOCK);
 	return (CTL_ACTION_PASS);
 }
 
 static ctl_action
 ctl_check_for_blockage(struct ctl_lun *lun, union ctl_io *pending_io,
     union ctl_io *ooa_io)
 {
 	const struct ctl_cmd_entry *pending_entry, *ooa_entry;
 	const ctl_serialize_action *serialize_row;
 
 	/*
 	 * The initiator attempted multiple untagged commands at the same
 	 * time.  Can't do that.
 	 */
 	if ((pending_io->scsiio.tag_type == CTL_TAG_UNTAGGED)
 	 && (ooa_io->scsiio.tag_type == CTL_TAG_UNTAGGED)
 	 && ((pending_io->io_hdr.nexus.targ_port ==
 	      ooa_io->io_hdr.nexus.targ_port)
 	  && (pending_io->io_hdr.nexus.initid ==
 	      ooa_io->io_hdr.nexus.initid))
 	 && ((ooa_io->io_hdr.flags & (CTL_FLAG_ABORT |
 	      CTL_FLAG_STATUS_SENT)) == 0))
 		return (CTL_ACTION_OVERLAP);
 
 	/*
 	 * The initiator attempted to send multiple tagged commands with
 	 * the same ID.  (It's fine if different initiators have the same
 	 * tag ID.)
 	 *
 	 * Even if all of those conditions are true, we don't kill the I/O
 	 * if the command ahead of us has been aborted.  We won't end up
 	 * sending it to the FETD, and it's perfectly legal to resend a
 	 * command with the same tag number as long as the previous
 	 * instance of this tag number has been aborted somehow.
 	 */
 	if ((pending_io->scsiio.tag_type != CTL_TAG_UNTAGGED)
 	 && (ooa_io->scsiio.tag_type != CTL_TAG_UNTAGGED)
 	 && (pending_io->scsiio.tag_num == ooa_io->scsiio.tag_num)
 	 && ((pending_io->io_hdr.nexus.targ_port ==
 	      ooa_io->io_hdr.nexus.targ_port)
 	  && (pending_io->io_hdr.nexus.initid ==
 	      ooa_io->io_hdr.nexus.initid))
 	 && ((ooa_io->io_hdr.flags & (CTL_FLAG_ABORT |
 	      CTL_FLAG_STATUS_SENT)) == 0))
 		return (CTL_ACTION_OVERLAP_TAG);
 
 	/*
 	 * If we get a head of queue tag, SAM-3 says that we should
 	 * immediately execute it.
 	 *
 	 * What happens if this command would normally block for some other
 	 * reason?  e.g. a request sense with a head of queue tag
 	 * immediately after a write.  Normally that would block, but this
 	 * will result in its getting executed immediately...
 	 *
 	 * We currently return "pass" instead of "skip", so we'll end up
 	 * going through the rest of the queue to check for overlapped tags.
 	 *
 	 * XXX KDM check for other types of blockage first??
 	 */
 	if (pending_io->scsiio.tag_type == CTL_TAG_HEAD_OF_QUEUE)
 		return (CTL_ACTION_PASS);
 
 	/*
 	 * Ordered tags have to block until all items ahead of them
 	 * have completed.  If we get called with an ordered tag, we always
 	 * block, if something else is ahead of us in the queue.
 	 */
 	if (pending_io->scsiio.tag_type == CTL_TAG_ORDERED)
 		return (CTL_ACTION_BLOCK);
 
 	/*
 	 * Simple tags get blocked until all head of queue and ordered tags
 	 * ahead of them have completed.  I'm lumping untagged commands in
 	 * with simple tags here.  XXX KDM is that the right thing to do?
 	 */
 	if (((pending_io->scsiio.tag_type == CTL_TAG_UNTAGGED)
 	  || (pending_io->scsiio.tag_type == CTL_TAG_SIMPLE))
 	 && ((ooa_io->scsiio.tag_type == CTL_TAG_HEAD_OF_QUEUE)
 	  || (ooa_io->scsiio.tag_type == CTL_TAG_ORDERED)))
 		return (CTL_ACTION_BLOCK);
 
 	pending_entry = ctl_get_cmd_entry(&pending_io->scsiio, NULL);
 	KASSERT(pending_entry->seridx < CTL_SERIDX_COUNT,
 	    ("%s: Invalid seridx %d for pending CDB %02x %02x @ %p",
 	     __func__, pending_entry->seridx, pending_io->scsiio.cdb[0],
 	     pending_io->scsiio.cdb[1], pending_io));
 	ooa_entry = ctl_get_cmd_entry(&ooa_io->scsiio, NULL);
 	if (ooa_entry->seridx == CTL_SERIDX_INVLD)
 		return (CTL_ACTION_PASS); /* Unsupported command in OOA queue */
 	KASSERT(ooa_entry->seridx < CTL_SERIDX_COUNT,
 	    ("%s: Invalid seridx %d for ooa CDB %02x %02x @ %p",
 	     __func__, ooa_entry->seridx, ooa_io->scsiio.cdb[0],
 	     ooa_io->scsiio.cdb[1], ooa_io));
 
 	serialize_row = ctl_serialize_table[ooa_entry->seridx];
 
 	switch (serialize_row[pending_entry->seridx]) {
 	case CTL_SER_BLOCK:
 		return (CTL_ACTION_BLOCK);
 	case CTL_SER_EXTENT:
 		return (ctl_extent_check(ooa_io, pending_io,
 		    (lun->be_lun && lun->be_lun->serseq == CTL_LUN_SERSEQ_ON)));
 	case CTL_SER_EXTENTOPT:
 		if ((lun->MODE_CTRL.queue_flags & SCP_QUEUE_ALG_MASK) !=
 		    SCP_QUEUE_ALG_UNRESTRICTED)
 			return (ctl_extent_check(ooa_io, pending_io,
 			    (lun->be_lun &&
 			     lun->be_lun->serseq == CTL_LUN_SERSEQ_ON)));
 		return (CTL_ACTION_PASS);
 	case CTL_SER_EXTENTSEQ:
 		if (lun->be_lun && lun->be_lun->serseq != CTL_LUN_SERSEQ_OFF)
 			return (ctl_extent_check_seq(ooa_io, pending_io));
 		return (CTL_ACTION_PASS);
 	case CTL_SER_PASS:
 		return (CTL_ACTION_PASS);
 	case CTL_SER_BLOCKOPT:
 		if ((lun->MODE_CTRL.queue_flags & SCP_QUEUE_ALG_MASK) !=
 		    SCP_QUEUE_ALG_UNRESTRICTED)
 			return (CTL_ACTION_BLOCK);
 		return (CTL_ACTION_PASS);
 	case CTL_SER_SKIP:
 		return (CTL_ACTION_SKIP);
 	default:
 		panic("%s: Invalid serialization value %d for %d => %d",
 		    __func__, serialize_row[pending_entry->seridx],
 		    pending_entry->seridx, ooa_entry->seridx);
 	}
 
 	return (CTL_ACTION_ERROR);
 }
 
 /*
  * Check for blockage or overlaps against the OOA (Order Of Arrival) queue.
  * Assumptions:
  * - pending_io is generally either incoming, or on the blocked queue
  * - starting I/O is the I/O we want to start the check with.
  */
 static ctl_action
 ctl_check_ooa(struct ctl_lun *lun, union ctl_io *pending_io,
 	      union ctl_io *starting_io)
 {
 	union ctl_io *ooa_io;
 	ctl_action action;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 
 	/*
 	 * Run back along the OOA queue, starting with the current
 	 * blocked I/O and going through every I/O before it on the
 	 * queue.  If starting_io is NULL, we'll just end up returning
 	 * CTL_ACTION_PASS.
 	 */
 	for (ooa_io = starting_io; ooa_io != NULL;
 	     ooa_io = (union ctl_io *)TAILQ_PREV(&ooa_io->io_hdr, ctl_ooaq,
 	     ooa_links)){
 
 		/*
 		 * This routine just checks to see whether
 		 * cur_blocked is blocked by ooa_io, which is ahead
 		 * of it in the queue.  It doesn't queue/dequeue
 		 * cur_blocked.
 		 */
 		action = ctl_check_for_blockage(lun, pending_io, ooa_io);
 		switch (action) {
 		case CTL_ACTION_BLOCK:
 		case CTL_ACTION_OVERLAP:
 		case CTL_ACTION_OVERLAP_TAG:
 		case CTL_ACTION_SKIP:
 		case CTL_ACTION_ERROR:
 			return (action);
 			break; /* NOTREACHED */
 		case CTL_ACTION_PASS:
 			break;
 		default:
 			panic("%s: Invalid action %d\n", __func__, action);
 		}
 	}
 
 	return (CTL_ACTION_PASS);
 }
 
 /*
  * Assumptions:
  * - An I/O has just completed, and has been removed from the per-LUN OOA
  *   queue, so some items on the blocked queue may now be unblocked.
  */
 static int
 ctl_check_blocked(struct ctl_lun *lun)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	union ctl_io *cur_blocked, *next_blocked;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 
 	/*
 	 * Run forward from the head of the blocked queue, checking each
 	 * entry against the I/Os prior to it on the OOA queue to see if
 	 * there is still any blockage.
 	 *
 	 * We cannot use the TAILQ_FOREACH() macro, because it can't deal
 	 * with our removing a variable on it while it is traversing the
 	 * list.
 	 */
 	for (cur_blocked = (union ctl_io *)TAILQ_FIRST(&lun->blocked_queue);
 	     cur_blocked != NULL; cur_blocked = next_blocked) {
 		union ctl_io *prev_ooa;
 		ctl_action action;
 
 		next_blocked = (union ctl_io *)TAILQ_NEXT(&cur_blocked->io_hdr,
 							  blocked_links);
 
 		prev_ooa = (union ctl_io *)TAILQ_PREV(&cur_blocked->io_hdr,
 						      ctl_ooaq, ooa_links);
 
 		/*
 		 * If cur_blocked happens to be the first item in the OOA
 		 * queue now, prev_ooa will be NULL, and the action
 		 * returned will just be CTL_ACTION_PASS.
 		 */
 		action = ctl_check_ooa(lun, cur_blocked, prev_ooa);
 
 		switch (action) {
 		case CTL_ACTION_BLOCK:
 			/* Nothing to do here, still blocked */
 			break;
 		case CTL_ACTION_OVERLAP:
 		case CTL_ACTION_OVERLAP_TAG:
 			/*
 			 * This shouldn't happen!  In theory we've already
 			 * checked this command for overlap...
 			 */
 			break;
 		case CTL_ACTION_PASS:
 		case CTL_ACTION_SKIP: {
 			const struct ctl_cmd_entry *entry;
 
 			/*
 			 * The skip case shouldn't happen, this transaction
 			 * should have never made it onto the blocked queue.
 			 */
 			/*
 			 * This I/O is no longer blocked, we can remove it
 			 * from the blocked queue.  Since this is a TAILQ
 			 * (doubly linked list), we can do O(1) removals
 			 * from any place on the list.
 			 */
 			TAILQ_REMOVE(&lun->blocked_queue, &cur_blocked->io_hdr,
 				     blocked_links);
 			cur_blocked->io_hdr.flags &= ~CTL_FLAG_BLOCKED;
 
 			if ((softc->ha_mode != CTL_HA_MODE_XFER) &&
 			    (cur_blocked->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC)){
 				/*
 				 * Need to send IO back to original side to
 				 * run
 				 */
 				union ctl_ha_msg msg_info;
 
 				cur_blocked->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE;
 				msg_info.hdr.original_sc =
 					cur_blocked->io_hdr.original_sc;
 				msg_info.hdr.serializing_sc = cur_blocked;
 				msg_info.hdr.msg_type = CTL_MSG_R2R;
 				ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 				    sizeof(msg_info.hdr), M_NOWAIT);
 				break;
 			}
 			entry = ctl_get_cmd_entry(&cur_blocked->scsiio, NULL);
 
 			/*
 			 * Check this I/O for LUN state changes that may
 			 * have happened while this command was blocked.
 			 * The LUN state may have been changed by a command
 			 * ahead of us in the queue, so we need to re-check
 			 * for any states that can be caused by SCSI
 			 * commands.
 			 */
 			if (ctl_scsiio_lun_check(lun, entry,
 						 &cur_blocked->scsiio) == 0) {
 				cur_blocked->io_hdr.flags |=
 				                      CTL_FLAG_IS_WAS_ON_RTR;
 				ctl_enqueue_rtr(cur_blocked);
 			} else
 				ctl_done(cur_blocked);
 			break;
 		}
 		default:
 			/*
 			 * This probably shouldn't happen -- we shouldn't
 			 * get CTL_ACTION_ERROR, or anything else.
 			 */
 			break;
 		}
 	}
 
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * This routine (with one exception) checks LUN flags that can be set by
  * commands ahead of us in the OOA queue.  These flags have to be checked
  * when a command initially comes in, and when we pull a command off the
  * blocked queue and are preparing to execute it.  The reason we have to
  * check these flags for commands on the blocked queue is that the LUN
  * state may have been changed by a command ahead of us while we're on the
  * blocked queue.
  *
  * Ordering is somewhat important with these checks, so please pay
  * careful attention to the placement of any new checks.
  */
 static int
 ctl_scsiio_lun_check(struct ctl_lun *lun,
     const struct ctl_cmd_entry *entry, struct ctl_scsiio *ctsio)
 {
 	struct ctl_softc *softc = lun->ctl_softc;
 	int retval;
 	uint32_t residx;
 
 	retval = 0;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 
 	/*
 	 * If this shelf is a secondary shelf controller, we may have to
 	 * reject some commands disallowed by HA mode and link state.
 	 */
 	if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0) {
 		if (softc->ha_link == CTL_HA_LINK_OFFLINE &&
 		    (entry->flags & CTL_CMD_FLAG_OK_ON_UNAVAIL) == 0) {
 			ctl_set_lun_unavail(ctsio);
 			retval = 1;
 			goto bailout;
 		}
 		if ((lun->flags & CTL_LUN_PEER_SC_PRIMARY) == 0 &&
 		    (entry->flags & CTL_CMD_FLAG_OK_ON_UNAVAIL) == 0) {
 			ctl_set_lun_transit(ctsio);
 			retval = 1;
 			goto bailout;
 		}
 		if (softc->ha_mode == CTL_HA_MODE_ACT_STBY &&
 		    (entry->flags & CTL_CMD_FLAG_OK_ON_STANDBY) == 0) {
 			ctl_set_lun_standby(ctsio);
 			retval = 1;
 			goto bailout;
 		}
 
 		/* The rest of checks are only done on executing side */
 		if (softc->ha_mode == CTL_HA_MODE_XFER)
 			goto bailout;
 	}
 
 	if (entry->pattern & CTL_LUN_PAT_WRITE) {
 		if (lun->be_lun &&
 		    lun->be_lun->flags & CTL_LUN_FLAG_READONLY) {
 			ctl_set_hw_write_protected(ctsio);
 			retval = 1;
 			goto bailout;
 		}
 		if ((lun->MODE_CTRL.eca_and_aen & SCP_SWP) != 0) {
 			ctl_set_sense(ctsio, /*current_error*/ 1,
 			    /*sense_key*/ SSD_KEY_DATA_PROTECT,
 			    /*asc*/ 0x27, /*ascq*/ 0x02, SSD_ELEM_NONE);
 			retval = 1;
 			goto bailout;
 		}
 	}
 
 	/*
 	 * Check for a reservation conflict.  If this command isn't allowed
 	 * even on reserved LUNs, and if this initiator isn't the one who
 	 * reserved us, reject the command with a reservation conflict.
 	 */
 	residx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 	if ((lun->flags & CTL_LUN_RESERVED)
 	 && ((entry->flags & CTL_CMD_FLAG_ALLOW_ON_RESV) == 0)) {
 		if (lun->res_idx != residx) {
 			ctl_set_reservation_conflict(ctsio);
 			retval = 1;
 			goto bailout;
 		}
 	}
 
 	if ((lun->flags & CTL_LUN_PR_RESERVED) == 0 ||
 	    (entry->flags & CTL_CMD_FLAG_ALLOW_ON_PR_RESV)) {
 		/* No reservation or command is allowed. */;
 	} else if ((entry->flags & CTL_CMD_FLAG_ALLOW_ON_PR_WRESV) &&
 	    (lun->pr_res_type == SPR_TYPE_WR_EX ||
 	     lun->pr_res_type == SPR_TYPE_WR_EX_RO ||
 	     lun->pr_res_type == SPR_TYPE_WR_EX_AR)) {
 		/* The command is allowed for Write Exclusive resv. */;
 	} else {
 		/*
 		 * if we aren't registered or it's a res holder type
 		 * reservation and this isn't the res holder then set a
 		 * conflict.
 		 */
 		if (ctl_get_prkey(lun, residx) == 0 ||
 		    (residx != lun->pr_res_idx && lun->pr_res_type < 4)) {
 			ctl_set_reservation_conflict(ctsio);
 			retval = 1;
 			goto bailout;
 		}
 	}
 
 	if ((entry->flags & CTL_CMD_FLAG_OK_ON_NO_MEDIA) == 0) {
 		if (lun->flags & CTL_LUN_EJECTED)
 			ctl_set_lun_ejected(ctsio);
 		else if (lun->flags & CTL_LUN_NO_MEDIA) {
 			if (lun->flags & CTL_LUN_REMOVABLE)
 				ctl_set_lun_no_media(ctsio);
 			else
 				ctl_set_lun_int_reqd(ctsio);
 		} else if (lun->flags & CTL_LUN_STOPPED)
 			ctl_set_lun_stopped(ctsio);
 		else
 			goto bailout;
 		retval = 1;
 		goto bailout;
 	}
 
 bailout:
 	return (retval);
 }
 
 static void
 ctl_failover_io(union ctl_io *io, int have_lock)
 {
 	ctl_set_busy(&io->scsiio);
 	ctl_done(io);
 }
 
 static void
 ctl_failover_lun(union ctl_io *rio)
 {
 	struct ctl_softc *softc = CTL_SOFTC(rio);
 	struct ctl_lun *lun;
 	struct ctl_io_hdr *io, *next_io;
 	uint32_t targ_lun;
 
 	targ_lun = rio->io_hdr.nexus.targ_mapped_lun;
 	CTL_DEBUG_PRINT(("FAILOVER for lun %ju\n", targ_lun));
 
 	/* Find and lock the LUN. */
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun > CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		return;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	if (lun->flags & CTL_LUN_DISABLED) {
 		mtx_unlock(&lun->lun_lock);
 		return;
 	}
 
 	if (softc->ha_mode == CTL_HA_MODE_XFER) {
 		TAILQ_FOREACH_SAFE(io, &lun->ooa_queue, ooa_links, next_io) {
 			/* We are master */
 			if (io->flags & CTL_FLAG_FROM_OTHER_SC) {
 				if (io->flags & CTL_FLAG_IO_ACTIVE) {
 					io->flags |= CTL_FLAG_ABORT;
 					io->flags |= CTL_FLAG_FAILOVER;
 				} else { /* This can be only due to DATAMOVE */
 					io->msg_type = CTL_MSG_DATAMOVE_DONE;
 					io->flags &= ~CTL_FLAG_DMA_INPROG;
 					io->flags |= CTL_FLAG_IO_ACTIVE;
 					io->port_status = 31340;
 					ctl_enqueue_isc((union ctl_io *)io);
 				}
 			}
 			/* We are slave */
 			if (io->flags & CTL_FLAG_SENT_2OTHER_SC) {
 				io->flags &= ~CTL_FLAG_SENT_2OTHER_SC;
 				if (io->flags & CTL_FLAG_IO_ACTIVE) {
 					io->flags |= CTL_FLAG_FAILOVER;
 				} else {
 					ctl_set_busy(&((union ctl_io *)io)->
 					    scsiio);
 					ctl_done((union ctl_io *)io);
 				}
 			}
 		}
 	} else { /* SERIALIZE modes */
 		TAILQ_FOREACH_SAFE(io, &lun->blocked_queue, blocked_links,
 		    next_io) {
 			/* We are master */
 			if (io->flags & CTL_FLAG_FROM_OTHER_SC) {
 				TAILQ_REMOVE(&lun->blocked_queue, io,
 				    blocked_links);
 				io->flags &= ~CTL_FLAG_BLOCKED;
 				TAILQ_REMOVE(&lun->ooa_queue, io, ooa_links);
 				ctl_free_io((union ctl_io *)io);
 			}
 		}
 		TAILQ_FOREACH_SAFE(io, &lun->ooa_queue, ooa_links, next_io) {
 			/* We are master */
 			if (io->flags & CTL_FLAG_FROM_OTHER_SC) {
 				TAILQ_REMOVE(&lun->ooa_queue, io, ooa_links);
 				ctl_free_io((union ctl_io *)io);
 			}
 			/* We are slave */
 			if (io->flags & CTL_FLAG_SENT_2OTHER_SC) {
 				io->flags &= ~CTL_FLAG_SENT_2OTHER_SC;
 				if (!(io->flags & CTL_FLAG_IO_ACTIVE)) {
 					ctl_set_busy(&((union ctl_io *)io)->
 					    scsiio);
 					ctl_done((union ctl_io *)io);
 				}
 			}
 		}
 		ctl_check_blocked(lun);
 	}
 	mtx_unlock(&lun->lun_lock);
 }
 
 static int
 ctl_scsiio_precheck(struct ctl_softc *softc, struct ctl_scsiio *ctsio)
 {
 	struct ctl_lun *lun;
 	const struct ctl_cmd_entry *entry;
 	uint32_t initidx, targ_lun;
 	int retval = 0;
 
 	lun = NULL;
 	targ_lun = ctsio->io_hdr.nexus.targ_mapped_lun;
 	if (targ_lun < CTL_MAX_LUNS)
 		lun = softc->ctl_luns[targ_lun];
 	if (lun) {
 		/*
 		 * If the LUN is invalid, pretend that it doesn't exist.
 		 * It will go away as soon as all pending I/O has been
 		 * completed.
 		 */
 		mtx_lock(&lun->lun_lock);
 		if (lun->flags & CTL_LUN_DISABLED) {
 			mtx_unlock(&lun->lun_lock);
 			lun = NULL;
 		}
 	}
 	CTL_LUN(ctsio) = lun;
 	if (lun) {
 		CTL_BACKEND_LUN(ctsio) = lun->be_lun;
 
 		/*
 		 * Every I/O goes into the OOA queue for a particular LUN,
 		 * and stays there until completion.
 		 */
 #ifdef CTL_TIME_IO
 		if (TAILQ_EMPTY(&lun->ooa_queue))
 			lun->idle_time += getsbinuptime() - lun->last_busy;
 #endif
 		TAILQ_INSERT_TAIL(&lun->ooa_queue, &ctsio->io_hdr, ooa_links);
 	}
 
 	/* Get command entry and return error if it is unsuppotyed. */
 	entry = ctl_validate_command(ctsio);
 	if (entry == NULL) {
 		if (lun)
 			mtx_unlock(&lun->lun_lock);
 		return (retval);
 	}
 
 	ctsio->io_hdr.flags &= ~CTL_FLAG_DATA_MASK;
 	ctsio->io_hdr.flags |= entry->flags & CTL_FLAG_DATA_MASK;
 
 	/*
 	 * Check to see whether we can send this command to LUNs that don't
 	 * exist.  This should pretty much only be the case for inquiry
 	 * and request sense.  Further checks, below, really require having
 	 * a LUN, so we can't really check the command anymore.  Just put
 	 * it on the rtr queue.
 	 */
 	if (lun == NULL) {
 		if (entry->flags & CTL_CMD_FLAG_OK_ON_NO_LUN) {
 			ctsio->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR;
 			ctl_enqueue_rtr((union ctl_io *)ctsio);
 			return (retval);
 		}
 
 		ctl_set_unsupported_lun(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		CTL_DEBUG_PRINT(("ctl_scsiio_precheck: bailing out due to invalid LUN\n"));
 		return (retval);
 	} else {
 		/*
 		 * Make sure we support this particular command on this LUN.
 		 * e.g., we don't support writes to the control LUN.
 		 */
 		if (!ctl_cmd_applicable(lun->be_lun->lun_type, entry)) {
 			mtx_unlock(&lun->lun_lock);
 			ctl_set_invalid_opcode(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (retval);
 		}
 	}
 
 	initidx = ctl_get_initindex(&ctsio->io_hdr.nexus);
 
 	/*
 	 * If we've got a request sense, it'll clear the contingent
 	 * allegiance condition.  Otherwise, if we have a CA condition for
 	 * this initiator, clear it, because it sent down a command other
 	 * than request sense.
 	 */
 	if (ctsio->cdb[0] != REQUEST_SENSE) {
 		struct scsi_sense_data *ps;
 
 		ps = lun->pending_sense[initidx / CTL_MAX_INIT_PER_PORT];
 		if (ps != NULL)
 			ps[initidx % CTL_MAX_INIT_PER_PORT].error_code = 0;
 	}
 
 	/*
 	 * If the command has this flag set, it handles its own unit
 	 * attention reporting, we shouldn't do anything.  Otherwise we
 	 * check for any pending unit attentions, and send them back to the
 	 * initiator.  We only do this when a command initially comes in,
 	 * not when we pull it off the blocked queue.
 	 *
 	 * According to SAM-3, section 5.3.2, the order that things get
 	 * presented back to the host is basically unit attentions caused
 	 * by some sort of reset event, busy status, reservation conflicts
 	 * or task set full, and finally any other status.
 	 *
 	 * One issue here is that some of the unit attentions we report
 	 * don't fall into the "reset" category (e.g. "reported luns data
 	 * has changed").  So reporting it here, before the reservation
 	 * check, may be technically wrong.  I guess the only thing to do
 	 * would be to check for and report the reset events here, and then
 	 * check for the other unit attention types after we check for a
 	 * reservation conflict.
 	 *
 	 * XXX KDM need to fix this
 	 */
 	if ((entry->flags & CTL_CMD_FLAG_NO_SENSE) == 0) {
 		ctl_ua_type ua_type;
 		u_int sense_len = 0;
 
 		ua_type = ctl_build_ua(lun, initidx, &ctsio->sense_data,
 		    &sense_len, SSD_TYPE_NONE);
 		if (ua_type != CTL_UA_NONE) {
 			mtx_unlock(&lun->lun_lock);
 			ctsio->scsi_status = SCSI_STATUS_CHECK_COND;
 			ctsio->io_hdr.status = CTL_SCSI_ERROR | CTL_AUTOSENSE;
 			ctsio->sense_len = sense_len;
 			ctl_done((union ctl_io *)ctsio);
 			return (retval);
 		}
 	}
 
 
 	if (ctl_scsiio_lun_check(lun, entry, ctsio) != 0) {
 		mtx_unlock(&lun->lun_lock);
 		ctl_done((union ctl_io *)ctsio);
 		return (retval);
 	}
 
 	/*
 	 * XXX CHD this is where we want to send IO to other side if
 	 * this LUN is secondary on this SC. We will need to make a copy
 	 * of the IO and flag the IO on this side as SENT_2OTHER and the flag
 	 * the copy we send as FROM_OTHER.
 	 * We also need to stuff the address of the original IO so we can
 	 * find it easily. Something similar will need be done on the other
 	 * side so when we are done we can find the copy.
 	 */
 	if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 &&
 	    (lun->flags & CTL_LUN_PEER_SC_PRIMARY) != 0 &&
 	    (entry->flags & CTL_CMD_FLAG_RUN_HERE) == 0) {
 		union ctl_ha_msg msg_info;
 		int isc_retval;
 
 		ctsio->io_hdr.flags |= CTL_FLAG_SENT_2OTHER_SC;
 		ctsio->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE;
 		mtx_unlock(&lun->lun_lock);
 
 		msg_info.hdr.msg_type = CTL_MSG_SERIALIZE;
 		msg_info.hdr.original_sc = (union ctl_io *)ctsio;
 		msg_info.hdr.serializing_sc = NULL;
 		msg_info.hdr.nexus = ctsio->io_hdr.nexus;
 		msg_info.scsi.tag_num = ctsio->tag_num;
 		msg_info.scsi.tag_type = ctsio->tag_type;
 		msg_info.scsi.cdb_len = ctsio->cdb_len;
 		memcpy(msg_info.scsi.cdb, ctsio->cdb, CTL_MAX_CDBLEN);
 
 		if ((isc_retval = ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 		    sizeof(msg_info.scsi) - sizeof(msg_info.scsi.sense_data),
 		    M_WAITOK)) > CTL_HA_STATUS_SUCCESS) {
 			ctl_set_busy(ctsio);
 			ctl_done((union ctl_io *)ctsio);
 			return (retval);
 		}
 		return (retval);
 	}
 
 	switch (ctl_check_ooa(lun, (union ctl_io *)ctsio,
 			      (union ctl_io *)TAILQ_PREV(&ctsio->io_hdr,
 			      ctl_ooaq, ooa_links))) {
 	case CTL_ACTION_BLOCK:
 		ctsio->io_hdr.flags |= CTL_FLAG_BLOCKED;
 		TAILQ_INSERT_TAIL(&lun->blocked_queue, &ctsio->io_hdr,
 				  blocked_links);
 		mtx_unlock(&lun->lun_lock);
 		return (retval);
 	case CTL_ACTION_PASS:
 	case CTL_ACTION_SKIP:
 		ctsio->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR;
 		mtx_unlock(&lun->lun_lock);
 		ctl_enqueue_rtr((union ctl_io *)ctsio);
 		break;
 	case CTL_ACTION_OVERLAP:
 		mtx_unlock(&lun->lun_lock);
 		ctl_set_overlapped_cmd(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		break;
 	case CTL_ACTION_OVERLAP_TAG:
 		mtx_unlock(&lun->lun_lock);
 		ctl_set_overlapped_tag(ctsio, ctsio->tag_num & 0xff);
 		ctl_done((union ctl_io *)ctsio);
 		break;
 	case CTL_ACTION_ERROR:
 	default:
 		mtx_unlock(&lun->lun_lock);
 		ctl_set_internal_failure(ctsio,
 					 /*sks_valid*/ 0,
 					 /*retry_count*/ 0);
 		ctl_done((union ctl_io *)ctsio);
 		break;
 	}
 	return (retval);
 }
 
 const struct ctl_cmd_entry *
 ctl_get_cmd_entry(struct ctl_scsiio *ctsio, int *sa)
 {
 	const struct ctl_cmd_entry *entry;
 	int service_action;
 
 	entry = &ctl_cmd_table[ctsio->cdb[0]];
 	if (sa)
 		*sa = ((entry->flags & CTL_CMD_FLAG_SA5) != 0);
 	if (entry->flags & CTL_CMD_FLAG_SA5) {
 		service_action = ctsio->cdb[1] & SERVICE_ACTION_MASK;
 		entry = &((const struct ctl_cmd_entry *)
 		    entry->execute)[service_action];
 	}
 	return (entry);
 }
 
 const struct ctl_cmd_entry *
 ctl_validate_command(struct ctl_scsiio *ctsio)
 {
 	const struct ctl_cmd_entry *entry;
 	int i, sa;
 	uint8_t diff;
 
 	entry = ctl_get_cmd_entry(ctsio, &sa);
 	if (entry->execute == NULL) {
 		if (sa)
 			ctl_set_invalid_field(ctsio,
 					      /*sks_valid*/ 1,
 					      /*command*/ 1,
 					      /*field*/ 1,
 					      /*bit_valid*/ 1,
 					      /*bit*/ 4);
 		else
 			ctl_set_invalid_opcode(ctsio);
 		ctl_done((union ctl_io *)ctsio);
 		return (NULL);
 	}
 	KASSERT(entry->length > 0,
 	    ("Not defined length for command 0x%02x/0x%02x",
 	     ctsio->cdb[0], ctsio->cdb[1]));
 	for (i = 1; i < entry->length; i++) {
 		diff = ctsio->cdb[i] & ~entry->usage[i - 1];
 		if (diff == 0)
 			continue;
 		ctl_set_invalid_field(ctsio,
 				      /*sks_valid*/ 1,
 				      /*command*/ 1,
 				      /*field*/ i,
 				      /*bit_valid*/ 1,
 				      /*bit*/ fls(diff) - 1);
 		ctl_done((union ctl_io *)ctsio);
 		return (NULL);
 	}
 	return (entry);
 }
 
 static int
 ctl_cmd_applicable(uint8_t lun_type, const struct ctl_cmd_entry *entry)
 {
 
 	switch (lun_type) {
 	case T_DIRECT:
 		if ((entry->flags & CTL_CMD_FLAG_OK_ON_DIRECT) == 0)
 			return (0);
 		break;
 	case T_PROCESSOR:
 		if ((entry->flags & CTL_CMD_FLAG_OK_ON_PROC) == 0)
 			return (0);
 		break;
 	case T_CDROM:
 		if ((entry->flags & CTL_CMD_FLAG_OK_ON_CDROM) == 0)
 			return (0);
 		break;
 	default:
 		return (0);
 	}
 	return (1);
 }
 
 static int
 ctl_scsiio(struct ctl_scsiio *ctsio)
 {
 	int retval;
 	const struct ctl_cmd_entry *entry;
 
 	retval = CTL_RETVAL_COMPLETE;
 
 	CTL_DEBUG_PRINT(("ctl_scsiio cdb[0]=%02X\n", ctsio->cdb[0]));
 
 	entry = ctl_get_cmd_entry(ctsio, NULL);
 
 	/*
 	 * If this I/O has been aborted, just send it straight to
 	 * ctl_done() without executing it.
 	 */
 	if (ctsio->io_hdr.flags & CTL_FLAG_ABORT) {
 		ctl_done((union ctl_io *)ctsio);
 		goto bailout;
 	}
 
 	/*
 	 * All the checks should have been handled by ctl_scsiio_precheck().
 	 * We should be clear now to just execute the I/O.
 	 */
 	retval = entry->execute(ctsio);
 
 bailout:
 	return (retval);
 }
 
 static int
 ctl_target_reset(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_port *port = CTL_PORT(io);
 	struct ctl_lun *lun;
 	uint32_t initidx;
 	ctl_ua_type ua_type;
 
 	if (!(io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC)) {
 		union ctl_ha_msg msg_info;
 
 		msg_info.hdr.nexus = io->io_hdr.nexus;
 		msg_info.task.task_action = io->taskio.task_action;
 		msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS;
 		msg_info.hdr.original_sc = NULL;
 		msg_info.hdr.serializing_sc = NULL;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 		    sizeof(msg_info.task), M_WAITOK);
 	}
 
 	initidx = ctl_get_initindex(&io->io_hdr.nexus);
 	if (io->taskio.task_action == CTL_TASK_TARGET_RESET)
 		ua_type = CTL_UA_TARG_RESET;
 	else
 		ua_type = CTL_UA_BUS_RESET;
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(lun, &softc->lun_list, links) {
 		if (port != NULL &&
 		    ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX)
 			continue;
 		ctl_do_lun_reset(lun, initidx, ua_type);
 	}
 	mtx_unlock(&softc->ctl_lock);
 	io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 	return (0);
 }
 
 /*
  * The LUN should always be set.  The I/O is optional, and is used to
  * distinguish between I/Os sent by this initiator, and by other
  * initiators.  We set unit attention for initiators other than this one.
  * SAM-3 is vague on this point.  It does say that a unit attention should
  * be established for other initiators when a LUN is reset (see section
  * 5.7.3), but it doesn't specifically say that the unit attention should
  * be established for this particular initiator when a LUN is reset.  Here
  * is the relevant text, from SAM-3 rev 8:
  *
  * 5.7.2 When a SCSI initiator port aborts its own tasks
  *
  * When a SCSI initiator port causes its own task(s) to be aborted, no
  * notification that the task(s) have been aborted shall be returned to
  * the SCSI initiator port other than the completion response for the
  * command or task management function action that caused the task(s) to
  * be aborted and notification(s) associated with related effects of the
  * action (e.g., a reset unit attention condition).
  *
  * XXX KDM for now, we're setting unit attention for all initiators.
  */
 static void
 ctl_do_lun_reset(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua_type)
 {
 	union ctl_io *xio;
 	int i;
 
 	mtx_lock(&lun->lun_lock);
 	/* Abort tasks. */
 	for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL;
 	     xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) {
 		xio->io_hdr.flags |= CTL_FLAG_ABORT | CTL_FLAG_ABORT_STATUS;
 	}
 	/* Clear CA. */
 	for (i = 0; i < CTL_MAX_PORTS; i++) {
 		free(lun->pending_sense[i], M_CTL);
 		lun->pending_sense[i] = NULL;
 	}
 	/* Clear reservation. */
 	lun->flags &= ~CTL_LUN_RESERVED;
 	/* Clear prevent media removal. */
 	if (lun->prevent) {
 		for (i = 0; i < CTL_MAX_INITIATORS; i++)
 			ctl_clear_mask(lun->prevent, i);
 		lun->prevent_count = 0;
 	}
 	/* Clear TPC status */
 	ctl_tpc_lun_clear(lun, -1);
 	/* Establish UA. */
 #if 0
 	ctl_est_ua_all(lun, initidx, ua_type);
 #else
 	ctl_est_ua_all(lun, -1, ua_type);
 #endif
 	mtx_unlock(&lun->lun_lock);
 }
 
 static int
 ctl_lun_reset(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_lun *lun;
 	uint32_t targ_lun, initidx;
 
 	targ_lun = io->io_hdr.nexus.targ_mapped_lun;
 	initidx = ctl_get_initindex(&io->io_hdr.nexus);
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST;
 		return (1);
 	}
 	ctl_do_lun_reset(lun, initidx, CTL_UA_LUN_RESET);
 	mtx_unlock(&softc->ctl_lock);
 	io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 
 	if ((io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) == 0) {
 		union ctl_ha_msg msg_info;
 
 		msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS;
 		msg_info.hdr.nexus = io->io_hdr.nexus;
 		msg_info.task.task_action = CTL_TASK_LUN_RESET;
 		msg_info.hdr.original_sc = NULL;
 		msg_info.hdr.serializing_sc = NULL;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 		    sizeof(msg_info.task), M_WAITOK);
 	}
 	return (0);
 }
 
 static void
 ctl_abort_tasks_lun(struct ctl_lun *lun, uint32_t targ_port, uint32_t init_id,
     int other_sc)
 {
 	union ctl_io *xio;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 
 	/*
 	 * Run through the OOA queue and attempt to find the given I/O.
 	 * The target port, initiator ID, tag type and tag number have to
 	 * match the values that we got from the initiator.  If we have an
 	 * untagged command to abort, simply abort the first untagged command
 	 * we come to.  We only allow one untagged command at a time of course.
 	 */
 	for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL;
 	     xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) {
 
 		if ((targ_port == UINT32_MAX ||
 		     targ_port == xio->io_hdr.nexus.targ_port) &&
 		    (init_id == UINT32_MAX ||
 		     init_id == xio->io_hdr.nexus.initid)) {
 			if (targ_port != xio->io_hdr.nexus.targ_port ||
 			    init_id != xio->io_hdr.nexus.initid)
 				xio->io_hdr.flags |= CTL_FLAG_ABORT_STATUS;
 			xio->io_hdr.flags |= CTL_FLAG_ABORT;
 			if (!other_sc && !(lun->flags & CTL_LUN_PRIMARY_SC)) {
 				union ctl_ha_msg msg_info;
 
 				msg_info.hdr.nexus = xio->io_hdr.nexus;
 				msg_info.task.task_action = CTL_TASK_ABORT_TASK;
 				msg_info.task.tag_num = xio->scsiio.tag_num;
 				msg_info.task.tag_type = xio->scsiio.tag_type;
 				msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS;
 				msg_info.hdr.original_sc = NULL;
 				msg_info.hdr.serializing_sc = NULL;
 				ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 				    sizeof(msg_info.task), M_NOWAIT);
 			}
 		}
 	}
 }
 
 static int
 ctl_abort_task_set(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_lun *lun;
 	uint32_t targ_lun;
 
 	/*
 	 * Look up the LUN.
 	 */
 	targ_lun = io->io_hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST;
 		return (1);
 	}
 
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	if (io->taskio.task_action == CTL_TASK_ABORT_TASK_SET) {
 		ctl_abort_tasks_lun(lun, io->io_hdr.nexus.targ_port,
 		    io->io_hdr.nexus.initid,
 		    (io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) != 0);
 	} else { /* CTL_TASK_CLEAR_TASK_SET */
 		ctl_abort_tasks_lun(lun, UINT32_MAX, UINT32_MAX,
 		    (io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) != 0);
 	}
 	mtx_unlock(&lun->lun_lock);
 	io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 	return (0);
 }
 
 static void
 ctl_i_t_nexus_loss(struct ctl_softc *softc, uint32_t initidx,
     ctl_ua_type ua_type)
 {
 	struct ctl_lun *lun;
 	struct scsi_sense_data *ps;
 	uint32_t p, i;
 
 	p = initidx / CTL_MAX_INIT_PER_PORT;
 	i = initidx % CTL_MAX_INIT_PER_PORT;
 	mtx_lock(&softc->ctl_lock);
 	STAILQ_FOREACH(lun, &softc->lun_list, links) {
 		mtx_lock(&lun->lun_lock);
 		/* Abort tasks. */
 		ctl_abort_tasks_lun(lun, p, i, 1);
 		/* Clear CA. */
 		ps = lun->pending_sense[p];
 		if (ps != NULL)
 			ps[i].error_code = 0;
 		/* Clear reservation. */
 		if ((lun->flags & CTL_LUN_RESERVED) && (lun->res_idx == initidx))
 			lun->flags &= ~CTL_LUN_RESERVED;
 		/* Clear prevent media removal. */
 		if (lun->prevent && ctl_is_set(lun->prevent, initidx)) {
 			ctl_clear_mask(lun->prevent, initidx);
 			lun->prevent_count--;
 		}
 		/* Clear TPC status */
 		ctl_tpc_lun_clear(lun, initidx);
 		/* Establish UA. */
 		ctl_est_ua(lun, initidx, ua_type);
 		mtx_unlock(&lun->lun_lock);
 	}
 	mtx_unlock(&softc->ctl_lock);
 }
 
 static int
 ctl_i_t_nexus_reset(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	uint32_t initidx;
 
 	if (!(io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC)) {
 		union ctl_ha_msg msg_info;
 
 		msg_info.hdr.nexus = io->io_hdr.nexus;
 		msg_info.task.task_action = CTL_TASK_I_T_NEXUS_RESET;
 		msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS;
 		msg_info.hdr.original_sc = NULL;
 		msg_info.hdr.serializing_sc = NULL;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 		    sizeof(msg_info.task), M_WAITOK);
 	}
 
 	initidx = ctl_get_initindex(&io->io_hdr.nexus);
 	ctl_i_t_nexus_loss(softc, initidx, CTL_UA_I_T_NEXUS_LOSS);
 	io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 	return (0);
 }
 
 static int
 ctl_abort_task(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	union ctl_io *xio;
 	struct ctl_lun *lun;
 #if 0
 	struct sbuf sb;
 	char printbuf[128];
 #endif
 	int found;
 	uint32_t targ_lun;
 
 	found = 0;
 
 	/*
 	 * Look up the LUN.
 	 */
 	targ_lun = io->io_hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST;
 		return (1);
 	}
 
 #if 0
 	printf("ctl_abort_task: called for lun %lld, tag %d type %d\n",
 	       lun->lun, io->taskio.tag_num, io->taskio.tag_type);
 #endif
 
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	/*
 	 * Run through the OOA queue and attempt to find the given I/O.
 	 * The target port, initiator ID, tag type and tag number have to
 	 * match the values that we got from the initiator.  If we have an
 	 * untagged command to abort, simply abort the first untagged command
 	 * we come to.  We only allow one untagged command at a time of course.
 	 */
 	for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL;
 	     xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) {
 #if 0
 		sbuf_new(&sb, printbuf, sizeof(printbuf), SBUF_FIXEDLEN);
 
 		sbuf_printf(&sb, "LUN %lld tag %d type %d%s%s%s%s: ",
 			    lun->lun, xio->scsiio.tag_num,
 			    xio->scsiio.tag_type,
 			    (xio->io_hdr.blocked_links.tqe_prev
 			    == NULL) ? "" : " BLOCKED",
 			    (xio->io_hdr.flags &
 			    CTL_FLAG_DMA_INPROG) ? " DMA" : "",
 			    (xio->io_hdr.flags &
 			    CTL_FLAG_ABORT) ? " ABORT" : "",
 			    (xio->io_hdr.flags &
 			    CTL_FLAG_IS_WAS_ON_RTR ? " RTR" : ""));
 		ctl_scsi_command_string(&xio->scsiio, NULL, &sb);
 		sbuf_finish(&sb);
 		printf("%s\n", sbuf_data(&sb));
 #endif
 
 		if ((xio->io_hdr.nexus.targ_port != io->io_hdr.nexus.targ_port)
 		 || (xio->io_hdr.nexus.initid != io->io_hdr.nexus.initid)
 		 || (xio->io_hdr.flags & CTL_FLAG_ABORT))
 			continue;
 
 		/*
 		 * If the abort says that the task is untagged, the
 		 * task in the queue must be untagged.  Otherwise,
 		 * we just check to see whether the tag numbers
 		 * match.  This is because the QLogic firmware
 		 * doesn't pass back the tag type in an abort
 		 * request.
 		 */
 #if 0
 		if (((xio->scsiio.tag_type == CTL_TAG_UNTAGGED)
 		  && (io->taskio.tag_type == CTL_TAG_UNTAGGED))
 		 || (xio->scsiio.tag_num == io->taskio.tag_num))
 #endif
 		/*
 		 * XXX KDM we've got problems with FC, because it
 		 * doesn't send down a tag type with aborts.  So we
 		 * can only really go by the tag number...
 		 * This may cause problems with parallel SCSI.
 		 * Need to figure that out!!
 		 */
 		if (xio->scsiio.tag_num == io->taskio.tag_num) {
 			xio->io_hdr.flags |= CTL_FLAG_ABORT;
 			found = 1;
 			if ((io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) == 0 &&
 			    !(lun->flags & CTL_LUN_PRIMARY_SC)) {
 				union ctl_ha_msg msg_info;
 
 				msg_info.hdr.nexus = io->io_hdr.nexus;
 				msg_info.task.task_action = CTL_TASK_ABORT_TASK;
 				msg_info.task.tag_num = io->taskio.tag_num;
 				msg_info.task.tag_type = io->taskio.tag_type;
 				msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS;
 				msg_info.hdr.original_sc = NULL;
 				msg_info.hdr.serializing_sc = NULL;
 #if 0
 				printf("Sent Abort to other side\n");
 #endif
 				ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info,
 				    sizeof(msg_info.task), M_NOWAIT);
 			}
 #if 0
 			printf("ctl_abort_task: found I/O to abort\n");
 #endif
 		}
 	}
 	mtx_unlock(&lun->lun_lock);
 
 	if (found == 0) {
 		/*
 		 * This isn't really an error.  It's entirely possible for
 		 * the abort and command completion to cross on the wire.
 		 * This is more of an informative/diagnostic error.
 		 */
 #if 0
 		printf("ctl_abort_task: ABORT sent for nonexistent I/O: "
 		       "%u:%u:%u tag %d type %d\n",
 		       io->io_hdr.nexus.initid,
 		       io->io_hdr.nexus.targ_port,
 		       io->io_hdr.nexus.targ_lun, io->taskio.tag_num,
 		       io->taskio.tag_type);
 #endif
 	}
 	io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 	return (0);
 }
 
 static int
 ctl_query_task(union ctl_io *io, int task_set)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	union ctl_io *xio;
 	struct ctl_lun *lun;
 	int found = 0;
 	uint32_t targ_lun;
 
 	targ_lun = io->io_hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST;
 		return (1);
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL;
 	     xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) {
 
 		if ((xio->io_hdr.nexus.targ_port != io->io_hdr.nexus.targ_port)
 		 || (xio->io_hdr.nexus.initid != io->io_hdr.nexus.initid)
 		 || (xio->io_hdr.flags & CTL_FLAG_ABORT))
 			continue;
 
 		if (task_set || xio->scsiio.tag_num == io->taskio.tag_num) {
 			found = 1;
 			break;
 		}
 	}
 	mtx_unlock(&lun->lun_lock);
 	if (found)
 		io->taskio.task_status = CTL_TASK_FUNCTION_SUCCEEDED;
 	else
 		io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 	return (0);
 }
 
 static int
 ctl_query_async_event(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_lun *lun;
 	ctl_ua_type ua;
 	uint32_t targ_lun, initidx;
 
 	targ_lun = io->io_hdr.nexus.targ_mapped_lun;
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST;
 		return (1);
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 	initidx = ctl_get_initindex(&io->io_hdr.nexus);
 	ua = ctl_build_qae(lun, initidx, io->taskio.task_resp);
 	mtx_unlock(&lun->lun_lock);
 	if (ua != CTL_UA_NONE)
 		io->taskio.task_status = CTL_TASK_FUNCTION_SUCCEEDED;
 	else
 		io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE;
 	return (0);
 }
 
 static void
 ctl_run_task(union ctl_io *io)
 {
 	int retval = 1;
 
 	CTL_DEBUG_PRINT(("ctl_run_task\n"));
 	KASSERT(io->io_hdr.io_type == CTL_IO_TASK,
 	    ("ctl_run_task: Unextected io_type %d\n", io->io_hdr.io_type));
 	io->taskio.task_status = CTL_TASK_FUNCTION_NOT_SUPPORTED;
 	bzero(io->taskio.task_resp, sizeof(io->taskio.task_resp));
 	switch (io->taskio.task_action) {
 	case CTL_TASK_ABORT_TASK:
 		retval = ctl_abort_task(io);
 		break;
 	case CTL_TASK_ABORT_TASK_SET:
 	case CTL_TASK_CLEAR_TASK_SET:
 		retval = ctl_abort_task_set(io);
 		break;
 	case CTL_TASK_CLEAR_ACA:
 		break;
 	case CTL_TASK_I_T_NEXUS_RESET:
 		retval = ctl_i_t_nexus_reset(io);
 		break;
 	case CTL_TASK_LUN_RESET:
 		retval = ctl_lun_reset(io);
 		break;
 	case CTL_TASK_TARGET_RESET:
 	case CTL_TASK_BUS_RESET:
 		retval = ctl_target_reset(io);
 		break;
 	case CTL_TASK_PORT_LOGIN:
 		break;
 	case CTL_TASK_PORT_LOGOUT:
 		break;
 	case CTL_TASK_QUERY_TASK:
 		retval = ctl_query_task(io, 0);
 		break;
 	case CTL_TASK_QUERY_TASK_SET:
 		retval = ctl_query_task(io, 1);
 		break;
 	case CTL_TASK_QUERY_ASYNC_EVENT:
 		retval = ctl_query_async_event(io);
 		break;
 	default:
 		printf("%s: got unknown task management event %d\n",
 		       __func__, io->taskio.task_action);
 		break;
 	}
 	if (retval == 0)
 		io->io_hdr.status = CTL_SUCCESS;
 	else
 		io->io_hdr.status = CTL_ERROR;
 	ctl_done(io);
 }
 
 /*
  * For HA operation.  Handle commands that come in from the other
  * controller.
  */
 static void
 ctl_handle_isc(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_lun *lun;
 	const struct ctl_cmd_entry *entry;
 	uint32_t targ_lun;
 
 	targ_lun = io->io_hdr.nexus.targ_mapped_lun;
 	switch (io->io_hdr.msg_type) {
 	case CTL_MSG_SERIALIZE:
 		ctl_serialize_other_sc_cmd(&io->scsiio);
 		break;
 	case CTL_MSG_R2R:		/* Only used in SER_ONLY mode. */
 		entry = ctl_get_cmd_entry(&io->scsiio, NULL);
 		if (targ_lun >= CTL_MAX_LUNS ||
 		    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 			ctl_done(io);
 			break;
 		}
 		mtx_lock(&lun->lun_lock);
 		if (ctl_scsiio_lun_check(lun, entry, &io->scsiio) != 0) {
 			mtx_unlock(&lun->lun_lock);
 			ctl_done(io);
 			break;
 		}
 		io->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR;
 		mtx_unlock(&lun->lun_lock);
 		ctl_enqueue_rtr(io);
 		break;
 	case CTL_MSG_FINISH_IO:
 		if (softc->ha_mode == CTL_HA_MODE_XFER) {
 			ctl_done(io);
 			break;
 		}
 		if (targ_lun >= CTL_MAX_LUNS ||
 		    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 			ctl_free_io(io);
 			break;
 		}
 		mtx_lock(&lun->lun_lock);
 		TAILQ_REMOVE(&lun->ooa_queue, &io->io_hdr, ooa_links);
 		ctl_check_blocked(lun);
 		mtx_unlock(&lun->lun_lock);
 		ctl_free_io(io);
 		break;
 	case CTL_MSG_PERS_ACTION:
 		ctl_hndl_per_res_out_on_other_sc(io);
 		ctl_free_io(io);
 		break;
 	case CTL_MSG_BAD_JUJU:
 		ctl_done(io);
 		break;
 	case CTL_MSG_DATAMOVE:		/* Only used in XFER mode */
 		ctl_datamove_remote(io);
 		break;
 	case CTL_MSG_DATAMOVE_DONE:	/* Only used in XFER mode */
 		io->scsiio.be_move_done(io);
 		break;
 	case CTL_MSG_FAILOVER:
 		ctl_failover_lun(io);
 		ctl_free_io(io);
 		break;
 	default:
 		printf("%s: Invalid message type %d\n",
 		       __func__, io->io_hdr.msg_type);
 		ctl_free_io(io);
 		break;
 	}
 
 }
 
 
 /*
  * Returns the match type in the case of a match, or CTL_LUN_PAT_NONE if
  * there is no match.
  */
 static ctl_lun_error_pattern
 ctl_cmd_pattern_match(struct ctl_scsiio *ctsio, struct ctl_error_desc *desc)
 {
 	const struct ctl_cmd_entry *entry;
 	ctl_lun_error_pattern filtered_pattern, pattern;
 
 	pattern = desc->error_pattern;
 
 	/*
 	 * XXX KDM we need more data passed into this function to match a
 	 * custom pattern, and we actually need to implement custom pattern
 	 * matching.
 	 */
 	if (pattern & CTL_LUN_PAT_CMD)
 		return (CTL_LUN_PAT_CMD);
 
 	if ((pattern & CTL_LUN_PAT_MASK) == CTL_LUN_PAT_ANY)
 		return (CTL_LUN_PAT_ANY);
 
 	entry = ctl_get_cmd_entry(ctsio, NULL);
 
 	filtered_pattern = entry->pattern & pattern;
 
 	/*
 	 * If the user requested specific flags in the pattern (e.g.
 	 * CTL_LUN_PAT_RANGE), make sure the command supports all of those
 	 * flags.
 	 *
 	 * If the user did not specify any flags, it doesn't matter whether
 	 * or not the command supports the flags.
 	 */
 	if ((filtered_pattern & ~CTL_LUN_PAT_MASK) !=
 	     (pattern & ~CTL_LUN_PAT_MASK))
 		return (CTL_LUN_PAT_NONE);
 
 	/*
 	 * If the user asked for a range check, see if the requested LBA
 	 * range overlaps with this command's LBA range.
 	 */
 	if (filtered_pattern & CTL_LUN_PAT_RANGE) {
 		uint64_t lba1;
 		uint64_t len1;
 		ctl_action action;
 		int retval;
 
 		retval = ctl_get_lba_len((union ctl_io *)ctsio, &lba1, &len1);
 		if (retval != 0)
 			return (CTL_LUN_PAT_NONE);
 
 		action = ctl_extent_check_lba(lba1, len1, desc->lba_range.lba,
 					      desc->lba_range.len, FALSE);
 		/*
 		 * A "pass" means that the LBA ranges don't overlap, so
 		 * this doesn't match the user's range criteria.
 		 */
 		if (action == CTL_ACTION_PASS)
 			return (CTL_LUN_PAT_NONE);
 	}
 
 	return (filtered_pattern);
 }
 
 static void
 ctl_inject_error(struct ctl_lun *lun, union ctl_io *io)
 {
 	struct ctl_error_desc *desc, *desc2;
 
 	mtx_assert(&lun->lun_lock, MA_OWNED);
 
 	STAILQ_FOREACH_SAFE(desc, &lun->error_list, links, desc2) {
 		ctl_lun_error_pattern pattern;
 		/*
 		 * Check to see whether this particular command matches
 		 * the pattern in the descriptor.
 		 */
 		pattern = ctl_cmd_pattern_match(&io->scsiio, desc);
 		if ((pattern & CTL_LUN_PAT_MASK) == CTL_LUN_PAT_NONE)
 			continue;
 
 		switch (desc->lun_error & CTL_LUN_INJ_TYPE) {
 		case CTL_LUN_INJ_ABORTED:
 			ctl_set_aborted(&io->scsiio);
 			break;
 		case CTL_LUN_INJ_MEDIUM_ERR:
 			ctl_set_medium_error(&io->scsiio,
 			    (io->io_hdr.flags & CTL_FLAG_DATA_MASK) !=
 			     CTL_FLAG_DATA_OUT);
 			break;
 		case CTL_LUN_INJ_UA:
 			/* 29h/00h  POWER ON, RESET, OR BUS DEVICE RESET
 			 * OCCURRED */
 			ctl_set_ua(&io->scsiio, 0x29, 0x00);
 			break;
 		case CTL_LUN_INJ_CUSTOM:
 			/*
 			 * We're assuming the user knows what he is doing.
 			 * Just copy the sense information without doing
 			 * checks.
 			 */
 			bcopy(&desc->custom_sense, &io->scsiio.sense_data,
 			      MIN(sizeof(desc->custom_sense),
 				  sizeof(io->scsiio.sense_data)));
 			io->scsiio.scsi_status = SCSI_STATUS_CHECK_COND;
 			io->scsiio.sense_len = SSD_FULL_SIZE;
 			io->io_hdr.status = CTL_SCSI_ERROR | CTL_AUTOSENSE;
 			break;
 		case CTL_LUN_INJ_NONE:
 		default:
 			/*
 			 * If this is an error injection type we don't know
 			 * about, clear the continuous flag (if it is set)
 			 * so it will get deleted below.
 			 */
 			desc->lun_error &= ~CTL_LUN_INJ_CONTINUOUS;
 			break;
 		}
 		/*
 		 * By default, each error injection action is a one-shot
 		 */
 		if (desc->lun_error & CTL_LUN_INJ_CONTINUOUS)
 			continue;
 
 		STAILQ_REMOVE(&lun->error_list, desc, ctl_error_desc, links);
 
 		free(desc, M_CTL);
 	}
 }
 
 #ifdef CTL_IO_DELAY
 static void
 ctl_datamove_timer_wakeup(void *arg)
 {
 	union ctl_io *io;
 
 	io = (union ctl_io *)arg;
 
 	ctl_datamove(io);
 }
 #endif /* CTL_IO_DELAY */
 
 void
 ctl_datamove(union ctl_io *io)
 {
 	void (*fe_datamove)(union ctl_io *io);
 
 	mtx_assert(&((struct ctl_softc *)CTL_SOFTC(io))->ctl_lock, MA_NOTOWNED);
 
 	CTL_DEBUG_PRINT(("ctl_datamove\n"));
 
 	/* No data transferred yet.  Frontend must update this when done. */
 	io->scsiio.kern_data_resid = io->scsiio.kern_data_len;
 
 #ifdef CTL_TIME_IO
 	if ((time_uptime - io->io_hdr.start_time) > ctl_time_io_secs) {
 		char str[256];
 		char path_str[64];
 		struct sbuf sb;
 
 		ctl_scsi_path_string(io, path_str, sizeof(path_str));
 		sbuf_new(&sb, str, sizeof(str), SBUF_FIXEDLEN);
 
 		sbuf_cat(&sb, path_str);
 		switch (io->io_hdr.io_type) {
 		case CTL_IO_SCSI:
 			ctl_scsi_command_string(&io->scsiio, NULL, &sb);
 			sbuf_printf(&sb, "\n");
 			sbuf_cat(&sb, path_str);
 			sbuf_printf(&sb, "Tag: 0x%04x, type %d\n",
 				    io->scsiio.tag_num, io->scsiio.tag_type);
 			break;
 		case CTL_IO_TASK:
 			sbuf_printf(&sb, "Task I/O type: %d, Tag: 0x%04x, "
 				    "Tag Type: %d\n", io->taskio.task_action,
 				    io->taskio.tag_num, io->taskio.tag_type);
 			break;
 		default:
 			panic("%s: Invalid CTL I/O type %d\n",
 			    __func__, io->io_hdr.io_type);
 		}
 		sbuf_cat(&sb, path_str);
 		sbuf_printf(&sb, "ctl_datamove: %jd seconds\n",
 			    (intmax_t)time_uptime - io->io_hdr.start_time);
 		sbuf_finish(&sb);
 		printf("%s", sbuf_data(&sb));
 	}
 #endif /* CTL_TIME_IO */
 
 #ifdef CTL_IO_DELAY
 	if (io->io_hdr.flags & CTL_FLAG_DELAY_DONE) {
 		io->io_hdr.flags &= ~CTL_FLAG_DELAY_DONE;
 	} else {
 		struct ctl_lun *lun;
 
 		lun = CTL_LUN(io);
 		if ((lun != NULL)
 		 && (lun->delay_info.datamove_delay > 0)) {
 
 			callout_init(&io->io_hdr.delay_callout, /*mpsafe*/ 1);
 			io->io_hdr.flags |= CTL_FLAG_DELAY_DONE;
 			callout_reset(&io->io_hdr.delay_callout,
 				      lun->delay_info.datamove_delay * hz,
 				      ctl_datamove_timer_wakeup, io);
 			if (lun->delay_info.datamove_type ==
 			    CTL_DELAY_TYPE_ONESHOT)
 				lun->delay_info.datamove_delay = 0;
 			return;
 		}
 	}
 #endif
 
 	/*
 	 * This command has been aborted.  Set the port status, so we fail
 	 * the data move.
 	 */
 	if (io->io_hdr.flags & CTL_FLAG_ABORT) {
 		printf("ctl_datamove: tag 0x%04x on (%u:%u:%u) aborted\n",
 		       io->scsiio.tag_num, io->io_hdr.nexus.initid,
 		       io->io_hdr.nexus.targ_port,
 		       io->io_hdr.nexus.targ_lun);
 		io->io_hdr.port_status = 31337;
 		/*
 		 * Note that the backend, in this case, will get the
 		 * callback in its context.  In other cases it may get
 		 * called in the frontend's interrupt thread context.
 		 */
 		io->scsiio.be_move_done(io);
 		return;
 	}
 
 	/* Don't confuse frontend with zero length data move. */
 	if (io->scsiio.kern_data_len == 0) {
 		io->scsiio.be_move_done(io);
 		return;
 	}
 
 	fe_datamove = CTL_PORT(io)->fe_datamove;
 	fe_datamove(io);
 }
 
 static void
 ctl_send_datamove_done(union ctl_io *io, int have_lock)
 {
 	union ctl_ha_msg msg;
 #ifdef CTL_TIME_IO
 	struct bintime cur_bt;
 #endif
 
 	memset(&msg, 0, sizeof(msg));
 	msg.hdr.msg_type = CTL_MSG_DATAMOVE_DONE;
 	msg.hdr.original_sc = io;
 	msg.hdr.serializing_sc = io->io_hdr.serializing_sc;
 	msg.hdr.nexus = io->io_hdr.nexus;
 	msg.hdr.status = io->io_hdr.status;
 	msg.scsi.kern_data_resid = io->scsiio.kern_data_resid;
 	msg.scsi.tag_num = io->scsiio.tag_num;
 	msg.scsi.tag_type = io->scsiio.tag_type;
 	msg.scsi.scsi_status = io->scsiio.scsi_status;
 	memcpy(&msg.scsi.sense_data, &io->scsiio.sense_data,
 	       io->scsiio.sense_len);
 	msg.scsi.sense_len = io->scsiio.sense_len;
 	msg.scsi.port_status = io->io_hdr.port_status;
 	io->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE;
 	if (io->io_hdr.flags & CTL_FLAG_FAILOVER) {
 		ctl_failover_io(io, /*have_lock*/ have_lock);
 		return;
 	}
 	ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg,
 	    sizeof(msg.scsi) - sizeof(msg.scsi.sense_data) +
 	    msg.scsi.sense_len, M_WAITOK);
 
 #ifdef CTL_TIME_IO
 	getbinuptime(&cur_bt);
 	bintime_sub(&cur_bt, &io->io_hdr.dma_start_bt);
 	bintime_add(&io->io_hdr.dma_bt, &cur_bt);
 #endif
 	io->io_hdr.num_dmas++;
 }
 
 /*
  * The DMA to the remote side is done, now we need to tell the other side
  * we're done so it can continue with its data movement.
  */
 static void
 ctl_datamove_remote_write_cb(struct ctl_ha_dt_req *rq)
 {
 	union ctl_io *io;
 	uint32_t i;
 
 	io = rq->context;
 
 	if (rq->ret != CTL_HA_STATUS_SUCCESS) {
 		printf("%s: ISC DMA write failed with error %d", __func__,
 		       rq->ret);
 		ctl_set_internal_failure(&io->scsiio,
 					 /*sks_valid*/ 1,
 					 /*retry_count*/ rq->ret);
 	}
 
 	ctl_dt_req_free(rq);
 
 	for (i = 0; i < io->scsiio.kern_sg_entries; i++)
 		free(io->io_hdr.local_sglist[i].addr, M_CTL);
 	free(io->io_hdr.remote_sglist, M_CTL);
 	io->io_hdr.remote_sglist = NULL;
 	io->io_hdr.local_sglist = NULL;
 
 	/*
 	 * The data is in local and remote memory, so now we need to send
 	 * status (good or back) back to the other side.
 	 */
 	ctl_send_datamove_done(io, /*have_lock*/ 0);
 }
 
 /*
  * We've moved the data from the host/controller into local memory.  Now we
  * need to push it over to the remote controller's memory.
  */
 static int
 ctl_datamove_remote_dm_write_cb(union ctl_io *io)
 {
 	int retval;
 
 	retval = ctl_datamove_remote_xfer(io, CTL_HA_DT_CMD_WRITE,
 					  ctl_datamove_remote_write_cb);
 	return (retval);
 }
 
 static void
 ctl_datamove_remote_write(union ctl_io *io)
 {
 	int retval;
 	void (*fe_datamove)(union ctl_io *io);
 
 	/*
 	 * - Get the data from the host/HBA into local memory.
 	 * - DMA memory from the local controller to the remote controller.
 	 * - Send status back to the remote controller.
 	 */
 
 	retval = ctl_datamove_remote_sgl_setup(io);
 	if (retval != 0)
 		return;
 
 	/* Switch the pointer over so the FETD knows what to do */
 	io->scsiio.kern_data_ptr = (uint8_t *)io->io_hdr.local_sglist;
 
 	/*
 	 * Use a custom move done callback, since we need to send completion
 	 * back to the other controller, not to the backend on this side.
 	 */
 	io->scsiio.be_move_done = ctl_datamove_remote_dm_write_cb;
 
 	fe_datamove = CTL_PORT(io)->fe_datamove;
 	fe_datamove(io);
 }
 
 static int
 ctl_datamove_remote_dm_read_cb(union ctl_io *io)
 {
 #if 0
 	char str[256];
 	char path_str[64];
 	struct sbuf sb;
 #endif
 	uint32_t i;
 
 	for (i = 0; i < io->scsiio.kern_sg_entries; i++)
 		free(io->io_hdr.local_sglist[i].addr, M_CTL);
 	free(io->io_hdr.remote_sglist, M_CTL);
 	io->io_hdr.remote_sglist = NULL;
 	io->io_hdr.local_sglist = NULL;
 
 #if 0
 	scsi_path_string(io, path_str, sizeof(path_str));
 	sbuf_new(&sb, str, sizeof(str), SBUF_FIXEDLEN);
 	sbuf_cat(&sb, path_str);
 	scsi_command_string(&io->scsiio, NULL, &sb);
 	sbuf_printf(&sb, "\n");
 	sbuf_cat(&sb, path_str);
 	sbuf_printf(&sb, "Tag: 0x%04x, type %d\n",
 		    io->scsiio.tag_num, io->scsiio.tag_type);
 	sbuf_cat(&sb, path_str);
 	sbuf_printf(&sb, "%s: flags %#x, status %#x\n", __func__,
 		    io->io_hdr.flags, io->io_hdr.status);
 	sbuf_finish(&sb);
 	printk("%s", sbuf_data(&sb));
 #endif
 
 
 	/*
 	 * The read is done, now we need to send status (good or bad) back
 	 * to the other side.
 	 */
 	ctl_send_datamove_done(io, /*have_lock*/ 0);
 
 	return (0);
 }
 
 static void
 ctl_datamove_remote_read_cb(struct ctl_ha_dt_req *rq)
 {
 	union ctl_io *io;
 	void (*fe_datamove)(union ctl_io *io);
 
 	io = rq->context;
 
 	if (rq->ret != CTL_HA_STATUS_SUCCESS) {
 		printf("%s: ISC DMA read failed with error %d\n", __func__,
 		       rq->ret);
 		ctl_set_internal_failure(&io->scsiio,
 					 /*sks_valid*/ 1,
 					 /*retry_count*/ rq->ret);
 	}
 
 	ctl_dt_req_free(rq);
 
 	/* Switch the pointer over so the FETD knows what to do */
 	io->scsiio.kern_data_ptr = (uint8_t *)io->io_hdr.local_sglist;
 
 	/*
 	 * Use a custom move done callback, since we need to send completion
 	 * back to the other controller, not to the backend on this side.
 	 */
 	io->scsiio.be_move_done = ctl_datamove_remote_dm_read_cb;
 
 	/* XXX KDM add checks like the ones in ctl_datamove? */
 
 	fe_datamove = CTL_PORT(io)->fe_datamove;
 	fe_datamove(io);
 }
 
 static int
 ctl_datamove_remote_sgl_setup(union ctl_io *io)
 {
 	struct ctl_sg_entry *local_sglist;
 	uint32_t len_to_go;
 	int retval;
 	int i;
 
 	retval = 0;
 	local_sglist = io->io_hdr.local_sglist;
 	len_to_go = io->scsiio.kern_data_len;
 
 	/*
 	 * The difficult thing here is that the size of the various
 	 * S/G segments may be different than the size from the
 	 * remote controller.  That'll make it harder when DMAing
 	 * the data back to the other side.
 	 */
 	for (i = 0; len_to_go > 0; i++) {
 		local_sglist[i].len = MIN(len_to_go, CTL_HA_DATAMOVE_SEGMENT);
 		local_sglist[i].addr =
 		    malloc(local_sglist[i].len, M_CTL, M_WAITOK);
 
 		len_to_go -= local_sglist[i].len;
 	}
 	/*
 	 * Reset the number of S/G entries accordingly.  The original
 	 * number of S/G entries is available in rem_sg_entries.
 	 */
 	io->scsiio.kern_sg_entries = i;
 
 #if 0
 	printf("%s: kern_sg_entries = %d\n", __func__,
 	       io->scsiio.kern_sg_entries);
 	for (i = 0; i < io->scsiio.kern_sg_entries; i++)
 		printf("%s: sg[%d] = %p, %lu\n", __func__, i,
 		       local_sglist[i].addr, local_sglist[i].len);
 #endif
 
 	return (retval);
 }
 
 static int
 ctl_datamove_remote_xfer(union ctl_io *io, unsigned command,
 			 ctl_ha_dt_cb callback)
 {
 	struct ctl_ha_dt_req *rq;
 	struct ctl_sg_entry *remote_sglist, *local_sglist;
 	uint32_t local_used, remote_used, total_used;
 	int i, j, isc_ret;
 
 	rq = ctl_dt_req_alloc();
 
 	/*
 	 * If we failed to allocate the request, and if the DMA didn't fail
 	 * anyway, set busy status.  This is just a resource allocation
 	 * failure.
 	 */
 	if ((rq == NULL)
 	 && ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE &&
 	     (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS))
 		ctl_set_busy(&io->scsiio);
 
 	if ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE &&
 	    (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS) {
 
 		if (rq != NULL)
 			ctl_dt_req_free(rq);
 
 		/*
 		 * The data move failed.  We need to return status back
 		 * to the other controller.  No point in trying to DMA
 		 * data to the remote controller.
 		 */
 
 		ctl_send_datamove_done(io, /*have_lock*/ 0);
 
 		return (1);
 	}
 
 	local_sglist = io->io_hdr.local_sglist;
 	remote_sglist = io->io_hdr.remote_sglist;
 	local_used = 0;
 	remote_used = 0;
 	total_used = 0;
 
 	/*
 	 * Pull/push the data over the wire from/to the other controller.
 	 * This takes into account the possibility that the local and
 	 * remote sglists may not be identical in terms of the size of
 	 * the elements and the number of elements.
 	 *
 	 * One fundamental assumption here is that the length allocated for
 	 * both the local and remote sglists is identical.  Otherwise, we've
 	 * essentially got a coding error of some sort.
 	 */
 	isc_ret = CTL_HA_STATUS_SUCCESS;
 	for (i = 0, j = 0; total_used < io->scsiio.kern_data_len; ) {
 		uint32_t cur_len;
 		uint8_t *tmp_ptr;
 
 		rq->command = command;
 		rq->context = io;
 
 		/*
 		 * Both pointers should be aligned.  But it is possible
 		 * that the allocation length is not.  They should both
 		 * also have enough slack left over at the end, though,
 		 * to round up to the next 8 byte boundary.
 		 */
 		cur_len = MIN(local_sglist[i].len - local_used,
 			      remote_sglist[j].len - remote_used);
 		rq->size = cur_len;
 
 		tmp_ptr = (uint8_t *)local_sglist[i].addr;
 		tmp_ptr += local_used;
 
 #if 0
 		/* Use physical addresses when talking to ISC hardware */
 		if ((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0) {
 			/* XXX KDM use busdma */
 			rq->local = vtophys(tmp_ptr);
 		} else
 			rq->local = tmp_ptr;
 #else
 		KASSERT((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0,
 		    ("HA does not support BUS_ADDR"));
 		rq->local = tmp_ptr;
 #endif
 
 		tmp_ptr = (uint8_t *)remote_sglist[j].addr;
 		tmp_ptr += remote_used;
 		rq->remote = tmp_ptr;
 
 		rq->callback = NULL;
 
 		local_used += cur_len;
 		if (local_used >= local_sglist[i].len) {
 			i++;
 			local_used = 0;
 		}
 
 		remote_used += cur_len;
 		if (remote_used >= remote_sglist[j].len) {
 			j++;
 			remote_used = 0;
 		}
 		total_used += cur_len;
 
 		if (total_used >= io->scsiio.kern_data_len)
 			rq->callback = callback;
 
 #if 0
 		printf("%s: %s: local %p remote %p size %d\n", __func__,
 		       (command == CTL_HA_DT_CMD_WRITE) ? "WRITE" : "READ",
 		       rq->local, rq->remote, rq->size);
 #endif
 
 		isc_ret = ctl_dt_single(rq);
 		if (isc_ret > CTL_HA_STATUS_SUCCESS)
 			break;
 	}
 	if (isc_ret != CTL_HA_STATUS_WAIT) {
 		rq->ret = isc_ret;
 		callback(rq);
 	}
 
 	return (0);
 }
 
 static void
 ctl_datamove_remote_read(union ctl_io *io)
 {
 	int retval;
 	uint32_t i;
 
 	/*
 	 * This will send an error to the other controller in the case of a
 	 * failure.
 	 */
 	retval = ctl_datamove_remote_sgl_setup(io);
 	if (retval != 0)
 		return;
 
 	retval = ctl_datamove_remote_xfer(io, CTL_HA_DT_CMD_READ,
 					  ctl_datamove_remote_read_cb);
 	if (retval != 0) {
 		/*
 		 * Make sure we free memory if there was an error..  The
 		 * ctl_datamove_remote_xfer() function will send the
 		 * datamove done message, or call the callback with an
 		 * error if there is a problem.
 		 */
 		for (i = 0; i < io->scsiio.kern_sg_entries; i++)
 			free(io->io_hdr.local_sglist[i].addr, M_CTL);
 		free(io->io_hdr.remote_sglist, M_CTL);
 		io->io_hdr.remote_sglist = NULL;
 		io->io_hdr.local_sglist = NULL;
 	}
 }
 
 /*
  * Process a datamove request from the other controller.  This is used for
  * XFER mode only, not SER_ONLY mode.  For writes, we DMA into local memory
  * first.  Once that is complete, the data gets DMAed into the remote
  * controller's memory.  For reads, we DMA from the remote controller's
  * memory into our memory first, and then move it out to the FETD.
  */
 static void
 ctl_datamove_remote(union ctl_io *io)
 {
 
 	mtx_assert(&((struct ctl_softc *)CTL_SOFTC(io))->ctl_lock, MA_NOTOWNED);
 
 	if (io->io_hdr.flags & CTL_FLAG_FAILOVER) {
 		ctl_failover_io(io, /*have_lock*/ 0);
 		return;
 	}
 
 	/*
 	 * Note that we look for an aborted I/O here, but don't do some of
 	 * the other checks that ctl_datamove() normally does.
 	 * We don't need to run the datamove delay code, since that should
 	 * have been done if need be on the other controller.
 	 */
 	if (io->io_hdr.flags & CTL_FLAG_ABORT) {
 		printf("%s: tag 0x%04x on (%u:%u:%u) aborted\n", __func__,
 		       io->scsiio.tag_num, io->io_hdr.nexus.initid,
 		       io->io_hdr.nexus.targ_port,
 		       io->io_hdr.nexus.targ_lun);
 		io->io_hdr.port_status = 31338;
 		ctl_send_datamove_done(io, /*have_lock*/ 0);
 		return;
 	}
 
 	if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_OUT)
 		ctl_datamove_remote_write(io);
 	else if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_IN)
 		ctl_datamove_remote_read(io);
 	else {
 		io->io_hdr.port_status = 31339;
 		ctl_send_datamove_done(io, /*have_lock*/ 0);
 	}
 }
 
 static void
 ctl_process_done(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_port *port = CTL_PORT(io);
 	struct ctl_lun *lun = CTL_LUN(io);
 	void (*fe_done)(union ctl_io *io);
 	union ctl_ha_msg msg;
 
 	CTL_DEBUG_PRINT(("ctl_process_done\n"));
 	fe_done = port->fe_done;
 
 #ifdef CTL_TIME_IO
 	if ((time_uptime - io->io_hdr.start_time) > ctl_time_io_secs) {
 		char str[256];
 		char path_str[64];
 		struct sbuf sb;
 
 		ctl_scsi_path_string(io, path_str, sizeof(path_str));
 		sbuf_new(&sb, str, sizeof(str), SBUF_FIXEDLEN);
 
 		sbuf_cat(&sb, path_str);
 		switch (io->io_hdr.io_type) {
 		case CTL_IO_SCSI:
 			ctl_scsi_command_string(&io->scsiio, NULL, &sb);
 			sbuf_printf(&sb, "\n");
 			sbuf_cat(&sb, path_str);
 			sbuf_printf(&sb, "Tag: 0x%04x, type %d\n",
 				    io->scsiio.tag_num, io->scsiio.tag_type);
 			break;
 		case CTL_IO_TASK:
 			sbuf_printf(&sb, "Task I/O type: %d, Tag: 0x%04x, "
 				    "Tag Type: %d\n", io->taskio.task_action,
 				    io->taskio.tag_num, io->taskio.tag_type);
 			break;
 		default:
 			panic("%s: Invalid CTL I/O type %d\n",
 			    __func__, io->io_hdr.io_type);
 		}
 		sbuf_cat(&sb, path_str);
 		sbuf_printf(&sb, "ctl_process_done: %jd seconds\n",
 			    (intmax_t)time_uptime - io->io_hdr.start_time);
 		sbuf_finish(&sb);
 		printf("%s", sbuf_data(&sb));
 	}
 #endif /* CTL_TIME_IO */
 
 	switch (io->io_hdr.io_type) {
 	case CTL_IO_SCSI:
 		break;
 	case CTL_IO_TASK:
 		if (ctl_debug & CTL_DEBUG_INFO)
 			ctl_io_error_print(io, NULL);
 		fe_done(io);
 		return;
 	default:
 		panic("%s: Invalid CTL I/O type %d\n",
 		    __func__, io->io_hdr.io_type);
 	}
 
 	if (lun == NULL) {
 		CTL_DEBUG_PRINT(("NULL LUN for lun %d\n",
 				 io->io_hdr.nexus.targ_mapped_lun));
 		goto bailout;
 	}
 
 	mtx_lock(&lun->lun_lock);
 
 	/*
 	 * Check to see if we have any informational exception and status
 	 * of this command can be modified to report it in form of either
 	 * RECOVERED ERROR or NO SENSE, depending on MRIE mode page field.
 	 */
 	if (lun->ie_reported == 0 && lun->ie_asc != 0 &&
 	    io->io_hdr.status == CTL_SUCCESS &&
 	    (io->io_hdr.flags & CTL_FLAG_STATUS_SENT) == 0) {
 		uint8_t mrie = lun->MODE_IE.mrie;
 		uint8_t per = ((lun->MODE_RWER.byte3 & SMS_RWER_PER) ||
 		    (lun->MODE_VER.byte3 & SMS_VER_PER));
 		if (((mrie == SIEP_MRIE_REC_COND && per) ||
 		     mrie == SIEP_MRIE_REC_UNCOND ||
 		     mrie == SIEP_MRIE_NO_SENSE) &&
 		    (ctl_get_cmd_entry(&io->scsiio, NULL)->flags &
 		     CTL_CMD_FLAG_NO_SENSE) == 0) {
 			ctl_set_sense(&io->scsiio,
 			      /*current_error*/ 1,
 			      /*sense_key*/ (mrie == SIEP_MRIE_NO_SENSE) ?
 			        SSD_KEY_NO_SENSE : SSD_KEY_RECOVERED_ERROR,
 			      /*asc*/ lun->ie_asc,
 			      /*ascq*/ lun->ie_ascq,
 			      SSD_ELEM_NONE);
 			lun->ie_reported = 1;
 		}
 	} else if (lun->ie_reported < 0)
 		lun->ie_reported = 0;
 
 	/*
 	 * Check to see if we have any errors to inject here.  We only
 	 * inject errors for commands that don't already have errors set.
 	 */
 	if (!STAILQ_EMPTY(&lun->error_list) &&
 	    ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS) &&
 	    ((io->io_hdr.flags & CTL_FLAG_STATUS_SENT) == 0))
 		ctl_inject_error(lun, io);
 
 	/*
 	 * XXX KDM how do we treat commands that aren't completed
 	 * successfully?
 	 *
 	 * XXX KDM should we also track I/O latency?
 	 */
 	if ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS &&
 	    io->io_hdr.io_type == CTL_IO_SCSI) {
 		int type;
 #ifdef CTL_TIME_IO
 		struct bintime bt;
 
 		getbinuptime(&bt);
 		bintime_sub(&bt, &io->io_hdr.start_bt);
 #endif
 		if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) ==
 		    CTL_FLAG_DATA_IN)
 			type = CTL_STATS_READ;
 		else if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) ==
 		    CTL_FLAG_DATA_OUT)
 			type = CTL_STATS_WRITE;
 		else
 			type = CTL_STATS_NO_IO;
 
 #ifdef CTL_LEGACY_STATS
 		uint32_t targ_port = port->targ_port;
 		lun->legacy_stats.ports[targ_port].bytes[type] +=
 		    io->scsiio.kern_total_len;
 		lun->legacy_stats.ports[targ_port].operations[type] ++;
 		lun->legacy_stats.ports[targ_port].num_dmas[type] +=
 		    io->io_hdr.num_dmas;
 #ifdef CTL_TIME_IO
 		bintime_add(&lun->legacy_stats.ports[targ_port].dma_time[type],
 		   &io->io_hdr.dma_bt);
 		bintime_add(&lun->legacy_stats.ports[targ_port].time[type],
 		    &bt);
 #endif
 #endif /* CTL_LEGACY_STATS */
 
 		lun->stats.bytes[type] += io->scsiio.kern_total_len;
 		lun->stats.operations[type] ++;
 		lun->stats.dmas[type] += io->io_hdr.num_dmas;
 #ifdef CTL_TIME_IO
 		bintime_add(&lun->stats.dma_time[type], &io->io_hdr.dma_bt);
 		bintime_add(&lun->stats.time[type], &bt);
 #endif
 
 		mtx_lock(&port->port_lock);
 		port->stats.bytes[type] += io->scsiio.kern_total_len;
 		port->stats.operations[type] ++;
 		port->stats.dmas[type] += io->io_hdr.num_dmas;
 #ifdef CTL_TIME_IO
 		bintime_add(&port->stats.dma_time[type], &io->io_hdr.dma_bt);
 		bintime_add(&port->stats.time[type], &bt);
 #endif
 		mtx_unlock(&port->port_lock);
 	}
 
 	/*
 	 * Remove this from the OOA queue.
 	 */
 	TAILQ_REMOVE(&lun->ooa_queue, &io->io_hdr, ooa_links);
 #ifdef CTL_TIME_IO
 	if (TAILQ_EMPTY(&lun->ooa_queue))
 		lun->last_busy = getsbinuptime();
 #endif
 
 	/*
 	 * Run through the blocked queue on this LUN and see if anything
 	 * has become unblocked, now that this transaction is done.
 	 */
 	ctl_check_blocked(lun);
 
 	/*
 	 * If the LUN has been invalidated, free it if there is nothing
 	 * left on its OOA queue.
 	 */
 	if ((lun->flags & CTL_LUN_INVALID)
 	 && TAILQ_EMPTY(&lun->ooa_queue)) {
 		mtx_unlock(&lun->lun_lock);
 		mtx_lock(&softc->ctl_lock);
 		ctl_free_lun(lun);
 		mtx_unlock(&softc->ctl_lock);
 	} else
 		mtx_unlock(&lun->lun_lock);
 
 bailout:
 
 	/*
 	 * If this command has been aborted, make sure we set the status
 	 * properly.  The FETD is responsible for freeing the I/O and doing
 	 * whatever it needs to do to clean up its state.
 	 */
 	if (io->io_hdr.flags & CTL_FLAG_ABORT)
 		ctl_set_task_aborted(&io->scsiio);
 
 	/*
 	 * If enabled, print command error status.
 	 */
 	if ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS &&
 	    (ctl_debug & CTL_DEBUG_INFO) != 0)
 		ctl_io_error_print(io, NULL);
 
 	/*
 	 * Tell the FETD or the other shelf controller we're done with this
 	 * command.  Note that only SCSI commands get to this point.  Task
 	 * management commands are completed above.
 	 */
 	if ((softc->ha_mode != CTL_HA_MODE_XFER) &&
 	    (io->io_hdr.flags & CTL_FLAG_SENT_2OTHER_SC)) {
 		memset(&msg, 0, sizeof(msg));
 		msg.hdr.msg_type = CTL_MSG_FINISH_IO;
 		msg.hdr.serializing_sc = io->io_hdr.serializing_sc;
 		msg.hdr.nexus = io->io_hdr.nexus;
 		ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg,
 		    sizeof(msg.scsi) - sizeof(msg.scsi.sense_data),
 		    M_WAITOK);
 	}
 
 	fe_done(io);
 }
 
 /*
  * Front end should call this if it doesn't do autosense.  When the request
  * sense comes back in from the initiator, we'll dequeue this and send it.
  */
 int
 ctl_queue_sense(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_port *port = CTL_PORT(io);
 	struct ctl_lun *lun;
 	struct scsi_sense_data *ps;
 	uint32_t initidx, p, targ_lun;
 
 	CTL_DEBUG_PRINT(("ctl_queue_sense\n"));
 
 	targ_lun = ctl_lun_map_from_port(port, io->io_hdr.nexus.targ_lun);
 
 	/*
 	 * LUN lookup will likely move to the ctl_work_thread() once we
 	 * have our new queueing infrastructure (that doesn't put things on
 	 * a per-LUN queue initially).  That is so that we can handle
 	 * things like an INQUIRY to a LUN that we don't have enabled.  We
 	 * can't deal with that right now.
 	 * If we don't have a LUN for this, just toss the sense information.
 	 */
 	mtx_lock(&softc->ctl_lock);
 	if (targ_lun >= CTL_MAX_LUNS ||
 	    (lun = softc->ctl_luns[targ_lun]) == NULL) {
 		mtx_unlock(&softc->ctl_lock);
 		goto bailout;
 	}
 	mtx_lock(&lun->lun_lock);
 	mtx_unlock(&softc->ctl_lock);
 
 	initidx = ctl_get_initindex(&io->io_hdr.nexus);
 	p = initidx / CTL_MAX_INIT_PER_PORT;
 	if (lun->pending_sense[p] == NULL) {
 		lun->pending_sense[p] = malloc(sizeof(*ps) * CTL_MAX_INIT_PER_PORT,
 		    M_CTL, M_NOWAIT | M_ZERO);
 	}
 	if ((ps = lun->pending_sense[p]) != NULL) {
 		ps += initidx % CTL_MAX_INIT_PER_PORT;
 		memset(ps, 0, sizeof(*ps));
 		memcpy(ps, &io->scsiio.sense_data, io->scsiio.sense_len);
 	}
 	mtx_unlock(&lun->lun_lock);
 
 bailout:
 	ctl_free_io(io);
 	return (CTL_RETVAL_COMPLETE);
 }
 
 /*
  * Primary command inlet from frontend ports.  All SCSI and task I/O
  * requests must go through this function.
  */
 int
 ctl_queue(union ctl_io *io)
 {
 	struct ctl_port *port = CTL_PORT(io);
 
 	CTL_DEBUG_PRINT(("ctl_queue cdb[0]=%02X\n", io->scsiio.cdb[0]));
 
 #ifdef CTL_TIME_IO
 	io->io_hdr.start_time = time_uptime;
 	getbinuptime(&io->io_hdr.start_bt);
 #endif /* CTL_TIME_IO */
 
 	/* Map FE-specific LUN ID into global one. */
 	io->io_hdr.nexus.targ_mapped_lun =
 	    ctl_lun_map_from_port(port, io->io_hdr.nexus.targ_lun);
 
 	switch (io->io_hdr.io_type) {
 	case CTL_IO_SCSI:
 	case CTL_IO_TASK:
 		if (ctl_debug & CTL_DEBUG_CDB)
 			ctl_io_print(io);
 		ctl_enqueue_incoming(io);
 		break;
 	default:
 		printf("ctl_queue: unknown I/O type %d\n", io->io_hdr.io_type);
 		return (EINVAL);
 	}
 
 	return (CTL_RETVAL_COMPLETE);
 }
 
 #ifdef CTL_IO_DELAY
 static void
 ctl_done_timer_wakeup(void *arg)
 {
 	union ctl_io *io;
 
 	io = (union ctl_io *)arg;
 	ctl_done(io);
 }
 #endif /* CTL_IO_DELAY */
 
 void
 ctl_serseq_done(union ctl_io *io)
 {
 	struct ctl_lun *lun = CTL_LUN(io);;
 
 	if (lun->be_lun == NULL ||
 	    lun->be_lun->serseq == CTL_LUN_SERSEQ_OFF)
 		return;
 	mtx_lock(&lun->lun_lock);
 	io->io_hdr.flags |= CTL_FLAG_SERSEQ_DONE;
 	ctl_check_blocked(lun);
 	mtx_unlock(&lun->lun_lock);
 }
 
 void
 ctl_done(union ctl_io *io)
 {
 
 	/*
 	 * Enable this to catch duplicate completion issues.
 	 */
 #if 0
 	if (io->io_hdr.flags & CTL_FLAG_ALREADY_DONE) {
 		printf("%s: type %d msg %d cdb %x iptl: "
 		       "%u:%u:%u tag 0x%04x "
 		       "flag %#x status %x\n",
 			__func__,
 			io->io_hdr.io_type,
 			io->io_hdr.msg_type,
 			io->scsiio.cdb[0],
 			io->io_hdr.nexus.initid,
 			io->io_hdr.nexus.targ_port,
 			io->io_hdr.nexus.targ_lun,
 			(io->io_hdr.io_type ==
 			CTL_IO_TASK) ?
 			io->taskio.tag_num :
 			io->scsiio.tag_num,
 		        io->io_hdr.flags,
 			io->io_hdr.status);
 	} else
 		io->io_hdr.flags |= CTL_FLAG_ALREADY_DONE;
 #endif
 
 	/*
 	 * This is an internal copy of an I/O, and should not go through
 	 * the normal done processing logic.
 	 */
 	if (io->io_hdr.flags & CTL_FLAG_INT_COPY)
 		return;
 
 #ifdef CTL_IO_DELAY
 	if (io->io_hdr.flags & CTL_FLAG_DELAY_DONE) {
 		io->io_hdr.flags &= ~CTL_FLAG_DELAY_DONE;
 	} else {
 		struct ctl_lun *lun = CTL_LUN(io);
 
 		if ((lun != NULL)
 		 && (lun->delay_info.done_delay > 0)) {
 
 			callout_init(&io->io_hdr.delay_callout, /*mpsafe*/ 1);
 			io->io_hdr.flags |= CTL_FLAG_DELAY_DONE;
 			callout_reset(&io->io_hdr.delay_callout,
 				      lun->delay_info.done_delay * hz,
 				      ctl_done_timer_wakeup, io);
 			if (lun->delay_info.done_type == CTL_DELAY_TYPE_ONESHOT)
 				lun->delay_info.done_delay = 0;
 			return;
 		}
 	}
 #endif /* CTL_IO_DELAY */
 
 	ctl_enqueue_done(io);
 }
 
 static void
 ctl_work_thread(void *arg)
 {
 	struct ctl_thread *thr = (struct ctl_thread *)arg;
 	struct ctl_softc *softc = thr->ctl_softc;
 	union ctl_io *io;
 	int retval;
 
 	CTL_DEBUG_PRINT(("ctl_work_thread starting\n"));
 
 	while (!softc->shutdown) {
 		/*
 		 * We handle the queues in this order:
 		 * - ISC
 		 * - done queue (to free up resources, unblock other commands)
 		 * - RtR queue
 		 * - incoming queue
 		 *
 		 * If those queues are empty, we break out of the loop and
 		 * go to sleep.
 		 */
 		mtx_lock(&thr->queue_lock);
 		io = (union ctl_io *)STAILQ_FIRST(&thr->isc_queue);
 		if (io != NULL) {
 			STAILQ_REMOVE_HEAD(&thr->isc_queue, links);
 			mtx_unlock(&thr->queue_lock);
 			ctl_handle_isc(io);
 			continue;
 		}
 		io = (union ctl_io *)STAILQ_FIRST(&thr->done_queue);
 		if (io != NULL) {
 			STAILQ_REMOVE_HEAD(&thr->done_queue, links);
 			/* clear any blocked commands, call fe_done */
 			mtx_unlock(&thr->queue_lock);
 			ctl_process_done(io);
 			continue;
 		}
 		io = (union ctl_io *)STAILQ_FIRST(&thr->incoming_queue);
 		if (io != NULL) {
 			STAILQ_REMOVE_HEAD(&thr->incoming_queue, links);
 			mtx_unlock(&thr->queue_lock);
 			if (io->io_hdr.io_type == CTL_IO_TASK)
 				ctl_run_task(io);
 			else
 				ctl_scsiio_precheck(softc, &io->scsiio);
 			continue;
 		}
 		io = (union ctl_io *)STAILQ_FIRST(&thr->rtr_queue);
 		if (io != NULL) {
 			STAILQ_REMOVE_HEAD(&thr->rtr_queue, links);
 			mtx_unlock(&thr->queue_lock);
 			retval = ctl_scsiio(&io->scsiio);
 			if (retval != CTL_RETVAL_COMPLETE)
 				CTL_DEBUG_PRINT(("ctl_scsiio failed\n"));
 			continue;
 		}
 
 		/* Sleep until we have something to do. */
 		mtx_sleep(thr, &thr->queue_lock, PDROP | PRIBIO, "-", 0);
 	}
 	thr->thread = NULL;
 	kthread_exit();
 }
 
 static void
 ctl_lun_thread(void *arg)
 {
 	struct ctl_softc *softc = (struct ctl_softc *)arg;
 	struct ctl_be_lun *be_lun;
 
 	CTL_DEBUG_PRINT(("ctl_lun_thread starting\n"));
 
 	while (!softc->shutdown) {
 		mtx_lock(&softc->ctl_lock);
 		be_lun = STAILQ_FIRST(&softc->pending_lun_queue);
 		if (be_lun != NULL) {
 			STAILQ_REMOVE_HEAD(&softc->pending_lun_queue, links);
 			mtx_unlock(&softc->ctl_lock);
 			ctl_create_lun(be_lun);
 			continue;
 		}
 
 		/* Sleep until we have something to do. */
 		mtx_sleep(&softc->pending_lun_queue, &softc->ctl_lock,
 		    PDROP | PRIBIO, "-", 0);
 	}
 	softc->lun_thread = NULL;
 	kthread_exit();
 }
 
 static void
 ctl_thresh_thread(void *arg)
 {
 	struct ctl_softc *softc = (struct ctl_softc *)arg;
 	struct ctl_lun *lun;
 	struct ctl_logical_block_provisioning_page *page;
 	const char *attr;
 	union ctl_ha_msg msg;
 	uint64_t thres, val;
 	int i, e, set;
 
 	CTL_DEBUG_PRINT(("ctl_thresh_thread starting\n"));
 
 	while (!softc->shutdown) {
 		mtx_lock(&softc->ctl_lock);
 		STAILQ_FOREACH(lun, &softc->lun_list, links) {
 			if ((lun->flags & CTL_LUN_DISABLED) ||
 			    (lun->flags & CTL_LUN_NO_MEDIA) ||
 			    lun->backend->lun_attr == NULL)
 				continue;
 			if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 &&
 			    softc->ha_mode == CTL_HA_MODE_XFER)
 				continue;
 			if ((lun->MODE_RWER.byte8 & SMS_RWER_LBPERE) == 0)
 				continue;
 			e = 0;
 			page = &lun->MODE_LBP;
 			for (i = 0; i < CTL_NUM_LBP_THRESH; i++) {
 				if ((page->descr[i].flags & SLBPPD_ENABLED) == 0)
 					continue;
 				thres = scsi_4btoul(page->descr[i].count);
 				thres <<= CTL_LBP_EXPONENT;
 				switch (page->descr[i].resource) {
 				case 0x01:
 					attr = "blocksavail";
 					break;
 				case 0x02:
 					attr = "blocksused";
 					break;
 				case 0xf1:
 					attr = "poolblocksavail";
 					break;
 				case 0xf2:
 					attr = "poolblocksused";
 					break;
 				default:
 					continue;
 				}
 				mtx_unlock(&softc->ctl_lock); // XXX
 				val = lun->backend->lun_attr(
 				    lun->be_lun->be_lun, attr);
 				mtx_lock(&softc->ctl_lock);
 				if (val == UINT64_MAX)
 					continue;
 				if ((page->descr[i].flags & SLBPPD_ARMING_MASK)
 				    == SLBPPD_ARMING_INC)
 					e = (val >= thres);
 				else
 					e = (val <= thres);
 				if (e)
 					break;
 			}
 			mtx_lock(&lun->lun_lock);
 			if (e) {
 				scsi_u64to8b((uint8_t *)&page->descr[i] -
 				    (uint8_t *)page, lun->ua_tpt_info);
 				if (lun->lasttpt == 0 ||
 				    time_uptime - lun->lasttpt >= CTL_LBP_UA_PERIOD) {
 					lun->lasttpt = time_uptime;
 					ctl_est_ua_all(lun, -1, CTL_UA_THIN_PROV_THRES);
 					set = 1;
 				} else
 					set = 0;
 			} else {
 				lun->lasttpt = 0;
 				ctl_clr_ua_all(lun, -1, CTL_UA_THIN_PROV_THRES);
 				set = -1;
 			}
 			mtx_unlock(&lun->lun_lock);
 			if (set != 0 &&
 			    lun->ctl_softc->ha_mode == CTL_HA_MODE_XFER) {
 				/* Send msg to other side. */
 				bzero(&msg.ua, sizeof(msg.ua));
 				msg.hdr.msg_type = CTL_MSG_UA;
 				msg.hdr.nexus.initid = -1;
 				msg.hdr.nexus.targ_port = -1;
 				msg.hdr.nexus.targ_lun = lun->lun;
 				msg.hdr.nexus.targ_mapped_lun = lun->lun;
 				msg.ua.ua_all = 1;
 				msg.ua.ua_set = (set > 0);
 				msg.ua.ua_type = CTL_UA_THIN_PROV_THRES;
 				memcpy(msg.ua.ua_info, lun->ua_tpt_info, 8);
 				mtx_unlock(&softc->ctl_lock); // XXX
 				ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg,
 				    sizeof(msg.ua), M_WAITOK);
 				mtx_lock(&softc->ctl_lock);
 			}
 		}
 		mtx_sleep(&softc->thresh_thread, &softc->ctl_lock,
 		    PDROP | PRIBIO, "-", CTL_LBP_PERIOD * hz);
 	}
 	softc->thresh_thread = NULL;
 	kthread_exit();
 }
 
 static void
 ctl_enqueue_incoming(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_thread *thr;
 	u_int idx;
 
 	idx = (io->io_hdr.nexus.targ_port * 127 +
 	       io->io_hdr.nexus.initid) % worker_threads;
 	thr = &softc->threads[idx];
 	mtx_lock(&thr->queue_lock);
 	STAILQ_INSERT_TAIL(&thr->incoming_queue, &io->io_hdr, links);
 	mtx_unlock(&thr->queue_lock);
 	wakeup(thr);
 }
 
 static void
 ctl_enqueue_rtr(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_thread *thr;
 
 	thr = &softc->threads[io->io_hdr.nexus.targ_mapped_lun % worker_threads];
 	mtx_lock(&thr->queue_lock);
 	STAILQ_INSERT_TAIL(&thr->rtr_queue, &io->io_hdr, links);
 	mtx_unlock(&thr->queue_lock);
 	wakeup(thr);
 }
 
 static void
 ctl_enqueue_done(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_thread *thr;
 
 	thr = &softc->threads[io->io_hdr.nexus.targ_mapped_lun % worker_threads];
 	mtx_lock(&thr->queue_lock);
 	STAILQ_INSERT_TAIL(&thr->done_queue, &io->io_hdr, links);
 	mtx_unlock(&thr->queue_lock);
 	wakeup(thr);
 }
 
 static void
 ctl_enqueue_isc(union ctl_io *io)
 {
 	struct ctl_softc *softc = CTL_SOFTC(io);
 	struct ctl_thread *thr;
 
 	thr = &softc->threads[io->io_hdr.nexus.targ_mapped_lun % worker_threads];
 	mtx_lock(&thr->queue_lock);
 	STAILQ_INSERT_TAIL(&thr->isc_queue, &io->io_hdr, links);
 	mtx_unlock(&thr->queue_lock);
 	wakeup(thr);
 }
 
 /*
  *  vim: ts=8
  */
Index: projects/clang400-import/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
===================================================================
--- projects/clang400-import/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	(revision 314522)
+++ projects/clang400-import/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	(revision 314523)
@@ -1,3234 +1,3243 @@
 /*
  * CDDL HEADER START
  *
  * The contents of this file are subject to the terms of the
  * Common Development and Distribution License (the "License").
  * You may not use this file except in compliance with the License.
  *
  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  * or http://www.opensolaris.org/os/licensing.
  * See the License for the specific language governing permissions
  * and limitations under the License.
  *
  * When distributing Covered Code, include this CDDL HEADER in each
  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  * If applicable, add the following below this CDDL HEADER, with the
  * fields enclosed by brackets "[]" replaced with your own identifying
  * information: Portions Copyright [yyyy] [name of copyright owner]
  *
  * CDDL HEADER END
  */
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  *
  * Copyright (c) 2006-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>
  * All rights reserved.
  *
  * Portions Copyright 2010 Robert Milkowski
  *
  * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
  * Copyright (c) 2012, 2016 by Delphix. All rights reserved.
  * Copyright (c) 2013, Joyent, Inc. All rights reserved.
  * Copyright (c) 2014 Integros [integros.com]
  */
 
 /* Portions Copyright 2011 Martin Matuska <mm@FreeBSD.org> */
 
 /*
  * ZFS volume emulation driver.
  *
  * Makes a DMU object look like a volume of arbitrary size, up to 2^64 bytes.
  * Volumes are accessed through the symbolic links named:
  *
  * /dev/zvol/dsk/<pool_name>/<dataset_name>
  * /dev/zvol/rdsk/<pool_name>/<dataset_name>
  *
  * These links are created by the /dev filesystem (sdev_zvolops.c).
  * Volumes are persistent through reboot.  No user command needs to be
  * run before opening and using a device.
  *
  * FreeBSD notes.
  * On FreeBSD ZVOLs are simply GEOM providers like any other storage device
  * in the system.
  */
 
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/errno.h>
 #include <sys/uio.h>
 #include <sys/bio.h>
 #include <sys/buf.h>
 #include <sys/kmem.h>
 #include <sys/conf.h>
 #include <sys/cmn_err.h>
 #include <sys/stat.h>
 #include <sys/zap.h>
 #include <sys/spa.h>
 #include <sys/spa_impl.h>
 #include <sys/zio.h>
 #include <sys/disk.h>
 #include <sys/dmu_traverse.h>
 #include <sys/dnode.h>
 #include <sys/dsl_dataset.h>
 #include <sys/dsl_prop.h>
 #include <sys/dkio.h>
 #include <sys/byteorder.h>
 #include <sys/sunddi.h>
 #include <sys/dirent.h>
 #include <sys/policy.h>
 #include <sys/queue.h>
 #include <sys/fs/zfs.h>
 #include <sys/zfs_ioctl.h>
 #include <sys/zil.h>
 #include <sys/refcount.h>
 #include <sys/zfs_znode.h>
 #include <sys/zfs_rlock.h>
 #include <sys/vdev_impl.h>
 #include <sys/vdev_raidz.h>
 #include <sys/zvol.h>
 #include <sys/zil_impl.h>
 #include <sys/dbuf.h>
 #include <sys/dmu_tx.h>
 #include <sys/zfeature.h>
 #include <sys/zio_checksum.h>
 #include <sys/filio.h>
 
 #include <geom/geom.h>
 
 #include "zfs_namecheck.h"
 
 #ifndef illumos
 struct g_class zfs_zvol_class = {
 	.name = "ZFS::ZVOL",
 	.version = G_VERSION,
 };
 
 DECLARE_GEOM_CLASS(zfs_zvol_class, zfs_zvol);
 
 #endif
 void *zfsdev_state;
 static char *zvol_tag = "zvol_tag";
 
 #define	ZVOL_DUMPSIZE		"dumpsize"
 
 /*
  * This lock protects the zfsdev_state structure from being modified
  * while it's being used, e.g. an open that comes in before a create
  * finishes.  It also protects temporary opens of the dataset so that,
  * e.g., an open doesn't get a spurious EBUSY.
  */
 #ifdef illumos
 kmutex_t zfsdev_state_lock;
 #else
 /*
  * In FreeBSD we've replaced the upstream zfsdev_state_lock with the
  * spa_namespace_lock in the ZVOL code.
  */
 #define zfsdev_state_lock spa_namespace_lock
 #endif
 static uint32_t zvol_minors;
 
 #ifndef illumos
 SYSCTL_DECL(_vfs_zfs);
 SYSCTL_NODE(_vfs_zfs, OID_AUTO, vol, CTLFLAG_RW, 0, "ZFS VOLUME");
 static int	volmode = ZFS_VOLMODE_GEOM;
 SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, mode, CTLFLAG_RWTUN, &volmode, 0,
     "Expose as GEOM providers (1), device files (2) or neither");
 static boolean_t zpool_on_zvol = B_FALSE;
 SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, recursive, CTLFLAG_RWTUN, &zpool_on_zvol, 0,
     "Allow zpools to use zvols as vdevs (DANGEROUS)");
 
 #endif
 typedef struct zvol_extent {
 	list_node_t	ze_node;
 	dva_t		ze_dva;		/* dva associated with this extent */
 	uint64_t	ze_nblks;	/* number of blocks in extent */
 } zvol_extent_t;
 
 /*
  * The in-core state of each volume.
  */
 typedef struct zvol_state {
 #ifndef illumos
 	LIST_ENTRY(zvol_state)	zv_links;
 #endif
 	char		zv_name[MAXPATHLEN]; /* pool/dd name */
 	uint64_t	zv_volsize;	/* amount of space we advertise */
 	uint64_t	zv_volblocksize; /* volume block size */
 #ifdef illumos
 	minor_t		zv_minor;	/* minor number */
 #else
 	struct cdev	*zv_dev;	/* non-GEOM device */
 	struct g_provider *zv_provider;	/* GEOM provider */
 #endif
 	uint8_t		zv_min_bs;	/* minimum addressable block shift */
 	uint8_t		zv_flags;	/* readonly, dumpified, etc. */
 	objset_t	*zv_objset;	/* objset handle */
 #ifdef illumos
 	uint32_t	zv_open_count[OTYPCNT];	/* open counts */
 #endif
 	uint32_t	zv_total_opens;	/* total open count */
 	uint32_t	zv_sync_cnt;	/* synchronous open count */
 	zilog_t		*zv_zilog;	/* ZIL handle */
 	list_t		zv_extents;	/* List of extents for dump */
 	znode_t		zv_znode;	/* for range locking */
 	dmu_buf_t	*zv_dbuf;	/* bonus handle */
 #ifndef illumos
 	int		zv_state;
 	int		zv_volmode;	/* Provide GEOM or cdev */
 	struct bio_queue_head zv_queue;
 	struct mtx	zv_queue_mtx;	/* zv_queue mutex */
 #endif
 } zvol_state_t;
 
 #ifndef illumos
 static LIST_HEAD(, zvol_state) all_zvols;
 #endif
 /*
  * zvol specific flags
  */
 #define	ZVOL_RDONLY	0x1
 #define	ZVOL_DUMPIFIED	0x2
 #define	ZVOL_EXCL	0x4
 #define	ZVOL_WCE	0x8
 
 /*
  * zvol maximum transfer in one DMU tx.
  */
 int zvol_maxphys = DMU_MAX_ACCESS/2;
 
 /*
  * Toggle unmap functionality.
  */
 boolean_t zvol_unmap_enabled = B_TRUE;
 
 /*
  * If true, unmaps requested as synchronous are executed synchronously,
  * otherwise all unmaps are asynchronous.
  */
 boolean_t zvol_unmap_sync_enabled = B_FALSE;
 
 #ifndef illumos
 SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, unmap_enabled, CTLFLAG_RWTUN,
     &zvol_unmap_enabled, 0,
     "Enable UNMAP functionality");
 
 SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, unmap_sync_enabled, CTLFLAG_RWTUN,
     &zvol_unmap_sync_enabled, 0,
     "UNMAPs requested as sync are executed synchronously");
 
 static d_open_t		zvol_d_open;
 static d_close_t	zvol_d_close;
 static d_read_t		zvol_read;
 static d_write_t	zvol_write;
 static d_ioctl_t	zvol_d_ioctl;
 static d_strategy_t	zvol_strategy;
 
 static struct cdevsw zvol_cdevsw = {
 	.d_version =	D_VERSION,
 	.d_open =	zvol_d_open,
 	.d_close =	zvol_d_close,
 	.d_read =	zvol_read,
 	.d_write =	zvol_write,
 	.d_ioctl =	zvol_d_ioctl,
 	.d_strategy =	zvol_strategy,
 	.d_name =	"zvol",
 	.d_flags =	D_DISK | D_TRACKCLOSE,
 };
 
 static void zvol_geom_run(zvol_state_t *zv);
 static void zvol_geom_destroy(zvol_state_t *zv);
 static int zvol_geom_access(struct g_provider *pp, int acr, int acw, int ace);
 static void zvol_geom_start(struct bio *bp);
 static void zvol_geom_worker(void *arg);
 static void zvol_log_truncate(zvol_state_t *zv, dmu_tx_t *tx, uint64_t off,
     uint64_t len, boolean_t sync);
 #endif	/* !illumos */
 
 extern int zfs_set_prop_nvlist(const char *, zprop_source_t,
     nvlist_t *, nvlist_t *);
 static int zvol_remove_zv(zvol_state_t *);
 static int zvol_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio);
 static int zvol_dumpify(zvol_state_t *zv);
 static int zvol_dump_fini(zvol_state_t *zv);
 static int zvol_dump_init(zvol_state_t *zv, boolean_t resize);
 
 static void
 zvol_size_changed(zvol_state_t *zv, uint64_t volsize)
 {
 #ifdef illumos
 	dev_t dev = makedevice(ddi_driver_major(zfs_dip), zv->zv_minor);
 
 	zv->zv_volsize = volsize;
 	VERIFY(ddi_prop_update_int64(dev, zfs_dip,
 	    "Size", volsize) == DDI_SUCCESS);
 	VERIFY(ddi_prop_update_int64(dev, zfs_dip,
 	    "Nblocks", lbtodb(volsize)) == DDI_SUCCESS);
 
 	/* Notify specfs to invalidate the cached size */
 	spec_size_invalidate(dev, VBLK);
 	spec_size_invalidate(dev, VCHR);
 #else	/* !illumos */
 	zv->zv_volsize = volsize;
 	if (zv->zv_volmode == ZFS_VOLMODE_GEOM) {
 		struct g_provider *pp;
 
 		pp = zv->zv_provider;
 		if (pp == NULL)
 			return;
 		g_topology_lock();
-		g_resize_provider(pp, zv->zv_volsize);
+
+		/*
+		 * Do not invoke resize event when initial size was zero.
+		 * ZVOL initializes the size on first open, this is not
+		 * real resizing.
+		 */
+		if (pp->mediasize == 0)
+			pp->mediasize = zv->zv_volsize;
+		else
+			g_resize_provider(pp, zv->zv_volsize);
 		g_topology_unlock();
 	}
 #endif	/* illumos */
 }
 
 int
 zvol_check_volsize(uint64_t volsize, uint64_t blocksize)
 {
 	if (volsize == 0)
 		return (SET_ERROR(EINVAL));
 
 	if (volsize % blocksize != 0)
 		return (SET_ERROR(EINVAL));
 
 #ifdef _ILP32
 	if (volsize - 1 > SPEC_MAXOFFSET_T)
 		return (SET_ERROR(EOVERFLOW));
 #endif
 	return (0);
 }
 
 int
 zvol_check_volblocksize(uint64_t volblocksize)
 {
 	if (volblocksize < SPA_MINBLOCKSIZE ||
 	    volblocksize > SPA_OLD_MAXBLOCKSIZE ||
 	    !ISP2(volblocksize))
 		return (SET_ERROR(EDOM));
 
 	return (0);
 }
 
 int
 zvol_get_stats(objset_t *os, nvlist_t *nv)
 {
 	int error;
 	dmu_object_info_t doi;
 	uint64_t val;
 
 	error = zap_lookup(os, ZVOL_ZAP_OBJ, "size", 8, 1, &val);
 	if (error)
 		return (error);
 
 	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_VOLSIZE, val);
 
 	error = dmu_object_info(os, ZVOL_OBJ, &doi);
 
 	if (error == 0) {
 		dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_VOLBLOCKSIZE,
 		    doi.doi_data_block_size);
 	}
 
 	return (error);
 }
 
 static zvol_state_t *
 zvol_minor_lookup(const char *name)
 {
 #ifdef illumos
 	minor_t minor;
 #endif
 	zvol_state_t *zv;
 
 	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
 
 #ifdef illumos
 	for (minor = 1; minor <= ZFSDEV_MAX_MINOR; minor++) {
 		zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 		if (zv == NULL)
 			continue;
 #else
 	LIST_FOREACH(zv, &all_zvols, zv_links) {
 #endif
 		if (strcmp(zv->zv_name, name) == 0)
 			return (zv);
 	}
 
 	return (NULL);
 }
 
 /* extent mapping arg */
 struct maparg {
 	zvol_state_t	*ma_zv;
 	uint64_t	ma_blks;
 };
 
 /*ARGSUSED*/
 static int
 zvol_map_block(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
     const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
 {
 	struct maparg *ma = arg;
 	zvol_extent_t *ze;
 	int bs = ma->ma_zv->zv_volblocksize;
 
 	if (bp == NULL || BP_IS_HOLE(bp) ||
 	    zb->zb_object != ZVOL_OBJ || zb->zb_level != 0)
 		return (0);
 
 	VERIFY(!BP_IS_EMBEDDED(bp));
 
 	VERIFY3U(ma->ma_blks, ==, zb->zb_blkid);
 	ma->ma_blks++;
 
 	/* Abort immediately if we have encountered gang blocks */
 	if (BP_IS_GANG(bp))
 		return (SET_ERROR(EFRAGS));
 
 	/*
 	 * See if the block is at the end of the previous extent.
 	 */
 	ze = list_tail(&ma->ma_zv->zv_extents);
 	if (ze &&
 	    DVA_GET_VDEV(BP_IDENTITY(bp)) == DVA_GET_VDEV(&ze->ze_dva) &&
 	    DVA_GET_OFFSET(BP_IDENTITY(bp)) ==
 	    DVA_GET_OFFSET(&ze->ze_dva) + ze->ze_nblks * bs) {
 		ze->ze_nblks++;
 		return (0);
 	}
 
 	dprintf_bp(bp, "%s", "next blkptr:");
 
 	/* start a new extent */
 	ze = kmem_zalloc(sizeof (zvol_extent_t), KM_SLEEP);
 	ze->ze_dva = bp->blk_dva[0];	/* structure assignment */
 	ze->ze_nblks = 1;
 	list_insert_tail(&ma->ma_zv->zv_extents, ze);
 	return (0);
 }
 
 static void
 zvol_free_extents(zvol_state_t *zv)
 {
 	zvol_extent_t *ze;
 
 	while (ze = list_head(&zv->zv_extents)) {
 		list_remove(&zv->zv_extents, ze);
 		kmem_free(ze, sizeof (zvol_extent_t));
 	}
 }
 
 static int
 zvol_get_lbas(zvol_state_t *zv)
 {
 	objset_t *os = zv->zv_objset;
 	struct maparg	ma;
 	int		err;
 
 	ma.ma_zv = zv;
 	ma.ma_blks = 0;
 	zvol_free_extents(zv);
 
 	/* commit any in-flight changes before traversing the dataset */
 	txg_wait_synced(dmu_objset_pool(os), 0);
 	err = traverse_dataset(dmu_objset_ds(os), 0,
 	    TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA, zvol_map_block, &ma);
 	if (err || ma.ma_blks != (zv->zv_volsize / zv->zv_volblocksize)) {
 		zvol_free_extents(zv);
 		return (err ? err : EIO);
 	}
 
 	return (0);
 }
 
 /* ARGSUSED */
 void
 zvol_create_cb(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx)
 {
 	zfs_creat_t *zct = arg;
 	nvlist_t *nvprops = zct->zct_props;
 	int error;
 	uint64_t volblocksize, volsize;
 
 	VERIFY(nvlist_lookup_uint64(nvprops,
 	    zfs_prop_to_name(ZFS_PROP_VOLSIZE), &volsize) == 0);
 	if (nvlist_lookup_uint64(nvprops,
 	    zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &volblocksize) != 0)
 		volblocksize = zfs_prop_default_numeric(ZFS_PROP_VOLBLOCKSIZE);
 
 	/*
 	 * These properties must be removed from the list so the generic
 	 * property setting step won't apply to them.
 	 */
 	VERIFY(nvlist_remove_all(nvprops,
 	    zfs_prop_to_name(ZFS_PROP_VOLSIZE)) == 0);
 	(void) nvlist_remove_all(nvprops,
 	    zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE));
 
 	error = dmu_object_claim(os, ZVOL_OBJ, DMU_OT_ZVOL, volblocksize,
 	    DMU_OT_NONE, 0, tx);
 	ASSERT(error == 0);
 
 	error = zap_create_claim(os, ZVOL_ZAP_OBJ, DMU_OT_ZVOL_PROP,
 	    DMU_OT_NONE, 0, tx);
 	ASSERT(error == 0);
 
 	error = zap_update(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize, tx);
 	ASSERT(error == 0);
 }
 
 /*
  * Replay a TX_TRUNCATE ZIL transaction if asked.  TX_TRUNCATE is how we
  * implement DKIOCFREE/free-long-range.
  */
 static int
 zvol_replay_truncate(zvol_state_t *zv, lr_truncate_t *lr, boolean_t byteswap)
 {
 	uint64_t offset, length;
 
 	if (byteswap)
 		byteswap_uint64_array(lr, sizeof (*lr));
 
 	offset = lr->lr_offset;
 	length = lr->lr_length;
 
 	return (dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, offset, length));
 }
 
 /*
  * Replay a TX_WRITE ZIL transaction that didn't get committed
  * after a system failure
  */
 static int
 zvol_replay_write(zvol_state_t *zv, lr_write_t *lr, boolean_t byteswap)
 {
 	objset_t *os = zv->zv_objset;
 	char *data = (char *)(lr + 1);	/* data follows lr_write_t */
 	uint64_t offset, length;
 	dmu_tx_t *tx;
 	int error;
 
 	if (byteswap)
 		byteswap_uint64_array(lr, sizeof (*lr));
 
 	offset = lr->lr_offset;
 	length = lr->lr_length;
 
 	/* If it's a dmu_sync() block, write the whole block */
 	if (lr->lr_common.lrc_reclen == sizeof (lr_write_t)) {
 		uint64_t blocksize = BP_GET_LSIZE(&lr->lr_blkptr);
 		if (length < blocksize) {
 			offset -= offset % blocksize;
 			length = blocksize;
 		}
 	}
 
 	tx = dmu_tx_create(os);
 	dmu_tx_hold_write(tx, ZVOL_OBJ, offset, length);
 	error = dmu_tx_assign(tx, TXG_WAIT);
 	if (error) {
 		dmu_tx_abort(tx);
 	} else {
 		dmu_write(os, ZVOL_OBJ, offset, length, data, tx);
 		dmu_tx_commit(tx);
 	}
 
 	return (error);
 }
 
 /* ARGSUSED */
 static int
 zvol_replay_err(zvol_state_t *zv, lr_t *lr, boolean_t byteswap)
 {
 	return (SET_ERROR(ENOTSUP));
 }
 
 /*
  * Callback vectors for replaying records.
  * Only TX_WRITE and TX_TRUNCATE are needed for zvol.
  */
 zil_replay_func_t *zvol_replay_vector[TX_MAX_TYPE] = {
 	zvol_replay_err,	/* 0 no such transaction type */
 	zvol_replay_err,	/* TX_CREATE */
 	zvol_replay_err,	/* TX_MKDIR */
 	zvol_replay_err,	/* TX_MKXATTR */
 	zvol_replay_err,	/* TX_SYMLINK */
 	zvol_replay_err,	/* TX_REMOVE */
 	zvol_replay_err,	/* TX_RMDIR */
 	zvol_replay_err,	/* TX_LINK */
 	zvol_replay_err,	/* TX_RENAME */
 	zvol_replay_write,	/* TX_WRITE */
 	zvol_replay_truncate,	/* TX_TRUNCATE */
 	zvol_replay_err,	/* TX_SETATTR */
 	zvol_replay_err,	/* TX_ACL */
 	zvol_replay_err,	/* TX_CREATE_ACL */
 	zvol_replay_err,	/* TX_CREATE_ATTR */
 	zvol_replay_err,	/* TX_CREATE_ACL_ATTR */
 	zvol_replay_err,	/* TX_MKDIR_ACL */
 	zvol_replay_err,	/* TX_MKDIR_ATTR */
 	zvol_replay_err,	/* TX_MKDIR_ACL_ATTR */
 	zvol_replay_err,	/* TX_WRITE2 */
 };
 
 #ifdef illumos
 int
 zvol_name2minor(const char *name, minor_t *minor)
 {
 	zvol_state_t *zv;
 
 	mutex_enter(&zfsdev_state_lock);
 	zv = zvol_minor_lookup(name);
 	if (minor && zv)
 		*minor = zv->zv_minor;
 	mutex_exit(&zfsdev_state_lock);
 	return (zv ? 0 : -1);
 }
 #endif	/* illumos */
 
 /*
  * Create a minor node (plus a whole lot more) for the specified volume.
  */
 int
 zvol_create_minor(const char *name)
 {
 	zfs_soft_state_t *zs;
 	zvol_state_t *zv;
 	objset_t *os;
 #ifdef illumos
 	dmu_object_info_t doi;
 	minor_t minor = 0;
 	char chrbuf[30], blkbuf[30];
 #else
 	struct g_provider *pp;
 	struct g_geom *gp;
 	uint64_t mode;
 #endif
 	int error;
 
 #ifndef illumos
 	ZFS_LOG(1, "Creating ZVOL %s...", name);
 #endif
 
 	mutex_enter(&zfsdev_state_lock);
 
 	if (zvol_minor_lookup(name) != NULL) {
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(EEXIST));
 	}
 
 	/* lie and say we're read-only */
 	error = dmu_objset_own(name, DMU_OST_ZVOL, B_TRUE, FTAG, &os);
 
 	if (error) {
 		mutex_exit(&zfsdev_state_lock);
 		return (error);
 	}
 
 #ifdef illumos
 	if ((minor = zfsdev_minor_alloc()) == 0) {
 		dmu_objset_disown(os, FTAG);
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(ENXIO));
 	}
 
 	if (ddi_soft_state_zalloc(zfsdev_state, minor) != DDI_SUCCESS) {
 		dmu_objset_disown(os, FTAG);
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(EAGAIN));
 	}
 	(void) ddi_prop_update_string(minor, zfs_dip, ZVOL_PROP_NAME,
 	    (char *)name);
 
 	(void) snprintf(chrbuf, sizeof (chrbuf), "%u,raw", minor);
 
 	if (ddi_create_minor_node(zfs_dip, chrbuf, S_IFCHR,
 	    minor, DDI_PSEUDO, 0) == DDI_FAILURE) {
 		ddi_soft_state_free(zfsdev_state, minor);
 		dmu_objset_disown(os, FTAG);
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(EAGAIN));
 	}
 
 	(void) snprintf(blkbuf, sizeof (blkbuf), "%u", minor);
 
 	if (ddi_create_minor_node(zfs_dip, blkbuf, S_IFBLK,
 	    minor, DDI_PSEUDO, 0) == DDI_FAILURE) {
 		ddi_remove_minor_node(zfs_dip, chrbuf);
 		ddi_soft_state_free(zfsdev_state, minor);
 		dmu_objset_disown(os, FTAG);
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(EAGAIN));
 	}
 
 	zs = ddi_get_soft_state(zfsdev_state, minor);
 	zs->zss_type = ZSST_ZVOL;
 	zv = zs->zss_data = kmem_zalloc(sizeof (zvol_state_t), KM_SLEEP);
 #else	/* !illumos */
 
 	zv = kmem_zalloc(sizeof(*zv), KM_SLEEP);
 	zv->zv_state = 0;
 	error = dsl_prop_get_integer(name,
 	    zfs_prop_to_name(ZFS_PROP_VOLMODE), &mode, NULL);
 	if (error != 0 || mode == ZFS_VOLMODE_DEFAULT)
 		mode = volmode;
 
 	DROP_GIANT();
 	zv->zv_volmode = mode;
 	if (zv->zv_volmode == ZFS_VOLMODE_GEOM) {
 		g_topology_lock();
 		gp = g_new_geomf(&zfs_zvol_class, "zfs::zvol::%s", name);
 		gp->start = zvol_geom_start;
 		gp->access = zvol_geom_access;
 		pp = g_new_providerf(gp, "%s/%s", ZVOL_DRIVER, name);
 		pp->flags |= G_PF_DIRECT_RECEIVE | G_PF_DIRECT_SEND;
 		pp->sectorsize = DEV_BSIZE;
 		pp->mediasize = 0;
 		pp->private = zv;
 
 		zv->zv_provider = pp;
 		bioq_init(&zv->zv_queue);
 		mtx_init(&zv->zv_queue_mtx, "zvol", NULL, MTX_DEF);
 	} else if (zv->zv_volmode == ZFS_VOLMODE_DEV) {
 		struct make_dev_args args;
 
 		make_dev_args_init(&args);
 		args.mda_flags = MAKEDEV_CHECKNAME | MAKEDEV_WAITOK;
 		args.mda_devsw = &zvol_cdevsw;
 		args.mda_cr = NULL;
 		args.mda_uid = UID_ROOT;
 		args.mda_gid = GID_OPERATOR;
 		args.mda_mode = 0640;
 		args.mda_si_drv2 = zv;
 		error = make_dev_s(&args, &zv->zv_dev,
 		    "%s/%s", ZVOL_DRIVER, name);
 		if (error != 0) {
 			kmem_free(zv, sizeof(*zv));
 			dmu_objset_disown(os, FTAG);
 			mutex_exit(&zfsdev_state_lock);
 			return (error);
 		}
 		zv->zv_dev->si_iosize_max = MAXPHYS;
 	}
 	LIST_INSERT_HEAD(&all_zvols, zv, zv_links);
 #endif	/* illumos */
 
 	(void) strlcpy(zv->zv_name, name, MAXPATHLEN);
 	zv->zv_min_bs = DEV_BSHIFT;
 #ifdef illumos
 	zv->zv_minor = minor;
 #endif
 	zv->zv_objset = os;
 	if (dmu_objset_is_snapshot(os) || !spa_writeable(dmu_objset_spa(os)))
 		zv->zv_flags |= ZVOL_RDONLY;
 	mutex_init(&zv->zv_znode.z_range_lock, NULL, MUTEX_DEFAULT, NULL);
 	avl_create(&zv->zv_znode.z_range_avl, zfs_range_compare,
 	    sizeof (rl_t), offsetof(rl_t, r_node));
 	list_create(&zv->zv_extents, sizeof (zvol_extent_t),
 	    offsetof(zvol_extent_t, ze_node));
 #ifdef illumos
 	/* get and cache the blocksize */
 	error = dmu_object_info(os, ZVOL_OBJ, &doi);
 	ASSERT(error == 0);
 	zv->zv_volblocksize = doi.doi_data_block_size;
 #endif
 
 	if (spa_writeable(dmu_objset_spa(os))) {
 		if (zil_replay_disable)
 			zil_destroy(dmu_objset_zil(os), B_FALSE);
 		else
 			zil_replay(os, zv, zvol_replay_vector);
 	}
 	dmu_objset_disown(os, FTAG);
 	zv->zv_objset = NULL;
 
 	zvol_minors++;
 
 	mutex_exit(&zfsdev_state_lock);
 #ifndef illumos
 	if (zv->zv_volmode == ZFS_VOLMODE_GEOM) {
 		zvol_geom_run(zv);
 		g_topology_unlock();
 	}
 	PICKUP_GIANT();
 
 	ZFS_LOG(1, "ZVOL %s created.", name);
 #endif
 
 	return (0);
 }
 
 /*
  * Remove minor node for the specified volume.
  */
 static int
 zvol_remove_zv(zvol_state_t *zv)
 {
 #ifdef illumos
 	char nmbuf[20];
 	minor_t minor = zv->zv_minor;
 #endif
 
 	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
 	if (zv->zv_total_opens != 0)
 		return (SET_ERROR(EBUSY));
 
 #ifdef illumos
 	(void) snprintf(nmbuf, sizeof (nmbuf), "%u,raw", minor);
 	ddi_remove_minor_node(zfs_dip, nmbuf);
 
 	(void) snprintf(nmbuf, sizeof (nmbuf), "%u", minor);
 	ddi_remove_minor_node(zfs_dip, nmbuf);
 #else
 	ZFS_LOG(1, "ZVOL %s destroyed.", zv->zv_name);
 
 	LIST_REMOVE(zv, zv_links);
 	if (zv->zv_volmode == ZFS_VOLMODE_GEOM) {
 		g_topology_lock();
 		zvol_geom_destroy(zv);
 		g_topology_unlock();
 	} else if (zv->zv_volmode == ZFS_VOLMODE_DEV) {
 		if (zv->zv_dev != NULL)
 			destroy_dev(zv->zv_dev);
 	}
 #endif
 
 	avl_destroy(&zv->zv_znode.z_range_avl);
 	mutex_destroy(&zv->zv_znode.z_range_lock);
 
 	kmem_free(zv, sizeof (zvol_state_t));
 #ifdef illumos
 	ddi_soft_state_free(zfsdev_state, minor);
 #endif
 	zvol_minors--;
 	return (0);
 }
 
 int
 zvol_remove_minor(const char *name)
 {
 	zvol_state_t *zv;
 	int rc;
 
 	mutex_enter(&zfsdev_state_lock);
 	if ((zv = zvol_minor_lookup(name)) == NULL) {
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(ENXIO));
 	}
 	rc = zvol_remove_zv(zv);
 	mutex_exit(&zfsdev_state_lock);
 	return (rc);
 }
 
 int
 zvol_first_open(zvol_state_t *zv)
 {
 	dmu_object_info_t doi;
 	objset_t *os;
 	uint64_t volsize;
 	int error;
 	uint64_t readonly;
 
 	/* lie and say we're read-only */
 	error = dmu_objset_own(zv->zv_name, DMU_OST_ZVOL, B_TRUE,
 	    zvol_tag, &os);
 	if (error)
 		return (error);
 
 	zv->zv_objset = os;
 	error = zap_lookup(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize);
 	if (error) {
 		ASSERT(error == 0);
 		dmu_objset_disown(os, zvol_tag);
 		return (error);
 	}
 
 	/* get and cache the blocksize */
 	error = dmu_object_info(os, ZVOL_OBJ, &doi);
 	if (error) {
 		ASSERT(error == 0);
 		dmu_objset_disown(os, zvol_tag);
 		return (error);
 	}
 	zv->zv_volblocksize = doi.doi_data_block_size;
 
 	error = dmu_bonus_hold(os, ZVOL_OBJ, zvol_tag, &zv->zv_dbuf);
 	if (error) {
 		dmu_objset_disown(os, zvol_tag);
 		return (error);
 	}
 
 	zvol_size_changed(zv, volsize);
 	zv->zv_zilog = zil_open(os, zvol_get_data);
 
 	VERIFY(dsl_prop_get_integer(zv->zv_name, "readonly", &readonly,
 	    NULL) == 0);
 	if (readonly || dmu_objset_is_snapshot(os) ||
 	    !spa_writeable(dmu_objset_spa(os)))
 		zv->zv_flags |= ZVOL_RDONLY;
 	else
 		zv->zv_flags &= ~ZVOL_RDONLY;
 	return (error);
 }
 
 void
 zvol_last_close(zvol_state_t *zv)
 {
 	zil_close(zv->zv_zilog);
 	zv->zv_zilog = NULL;
 
 	dmu_buf_rele(zv->zv_dbuf, zvol_tag);
 	zv->zv_dbuf = NULL;
 
 	/*
 	 * Evict cached data
 	 */
 	if (dsl_dataset_is_dirty(dmu_objset_ds(zv->zv_objset)) &&
 	    !(zv->zv_flags & ZVOL_RDONLY))
 		txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0);
 	dmu_objset_evict_dbufs(zv->zv_objset);
 
 	dmu_objset_disown(zv->zv_objset, zvol_tag);
 	zv->zv_objset = NULL;
 }
 
 #ifdef illumos
 int
 zvol_prealloc(zvol_state_t *zv)
 {
 	objset_t *os = zv->zv_objset;
 	dmu_tx_t *tx;
 	uint64_t refd, avail, usedobjs, availobjs;
 	uint64_t resid = zv->zv_volsize;
 	uint64_t off = 0;
 
 	/* Check the space usage before attempting to allocate the space */
 	dmu_objset_space(os, &refd, &avail, &usedobjs, &availobjs);
 	if (avail < zv->zv_volsize)
 		return (SET_ERROR(ENOSPC));
 
 	/* Free old extents if they exist */
 	zvol_free_extents(zv);
 
 	while (resid != 0) {
 		int error;
 		uint64_t bytes = MIN(resid, SPA_OLD_MAXBLOCKSIZE);
 
 		tx = dmu_tx_create(os);
 		dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);
 		error = dmu_tx_assign(tx, TXG_WAIT);
 		if (error) {
 			dmu_tx_abort(tx);
 			(void) dmu_free_long_range(os, ZVOL_OBJ, 0, off);
 			return (error);
 		}
 		dmu_prealloc(os, ZVOL_OBJ, off, bytes, tx);
 		dmu_tx_commit(tx);
 		off += bytes;
 		resid -= bytes;
 	}
 	txg_wait_synced(dmu_objset_pool(os), 0);
 
 	return (0);
 }
 #endif	/* illumos */
 
 static int
 zvol_update_volsize(objset_t *os, uint64_t volsize)
 {
 	dmu_tx_t *tx;
 	int error;
 
 	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
 
 	tx = dmu_tx_create(os);
 	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
 	dmu_tx_mark_netfree(tx);
 	error = dmu_tx_assign(tx, TXG_WAIT);
 	if (error) {
 		dmu_tx_abort(tx);
 		return (error);
 	}
 
 	error = zap_update(os, ZVOL_ZAP_OBJ, "size", 8, 1,
 	    &volsize, tx);
 	dmu_tx_commit(tx);
 
 	if (error == 0)
 		error = dmu_free_long_range(os,
 		    ZVOL_OBJ, volsize, DMU_OBJECT_END);
 	return (error);
 }
 
 void
 zvol_remove_minors(const char *name)
 {
 #ifdef illumos
 	zvol_state_t *zv;
 	char *namebuf;
 	minor_t minor;
 
 	namebuf = kmem_zalloc(strlen(name) + 2, KM_SLEEP);
 	(void) strncpy(namebuf, name, strlen(name));
 	(void) strcat(namebuf, "/");
 	mutex_enter(&zfsdev_state_lock);
 	for (minor = 1; minor <= ZFSDEV_MAX_MINOR; minor++) {
 
 		zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 		if (zv == NULL)
 			continue;
 		if (strncmp(namebuf, zv->zv_name, strlen(namebuf)) == 0)
 			(void) zvol_remove_zv(zv);
 	}
 	kmem_free(namebuf, strlen(name) + 2);
 
 	mutex_exit(&zfsdev_state_lock);
 #else	/* !illumos */
 	zvol_state_t *zv, *tzv;
 	size_t namelen;
 
 	namelen = strlen(name);
 
 	DROP_GIANT();
 	mutex_enter(&zfsdev_state_lock);
 
 	LIST_FOREACH_SAFE(zv, &all_zvols, zv_links, tzv) {
 		if (strcmp(zv->zv_name, name) == 0 ||
 		    (strncmp(zv->zv_name, name, namelen) == 0 &&
 		    strlen(zv->zv_name) > namelen && (zv->zv_name[namelen] == '/' ||
 		    zv->zv_name[namelen] == '@'))) {
 			(void) zvol_remove_zv(zv);
 		}
 	}
 
 	mutex_exit(&zfsdev_state_lock);
 	PICKUP_GIANT();
 #endif	/* illumos */
 }
 
 static int
 zvol_update_live_volsize(zvol_state_t *zv, uint64_t volsize)
 {
 	uint64_t old_volsize = 0ULL;
 	int error = 0;
 
 	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
 
 	/*
 	 * Reinitialize the dump area to the new size. If we
 	 * failed to resize the dump area then restore it back to
 	 * its original size.  We must set the new volsize prior
 	 * to calling dumpvp_resize() to ensure that the devices'
 	 * size(9P) is not visible by the dump subsystem.
 	 */
 	old_volsize = zv->zv_volsize;
 	zvol_size_changed(zv, volsize);
 
 #ifdef ZVOL_DUMP
 	if (zv->zv_flags & ZVOL_DUMPIFIED) {
 		if ((error = zvol_dumpify(zv)) != 0 ||
 		    (error = dumpvp_resize()) != 0) {
 			int dumpify_error;
 
 			(void) zvol_update_volsize(zv->zv_objset, old_volsize);
 			zvol_size_changed(zv, old_volsize);
 			dumpify_error = zvol_dumpify(zv);
 			error = dumpify_error ? dumpify_error : error;
 		}
 	}
 #endif	/* ZVOL_DUMP */
 
 #ifdef illumos
 	/*
 	 * Generate a LUN expansion event.
 	 */
 	if (error == 0) {
 		sysevent_id_t eid;
 		nvlist_t *attr;
 		char *physpath = kmem_zalloc(MAXPATHLEN, KM_SLEEP);
 
 		(void) snprintf(physpath, MAXPATHLEN, "%s%u", ZVOL_PSEUDO_DEV,
 		    zv->zv_minor);
 
 		VERIFY(nvlist_alloc(&attr, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 		VERIFY(nvlist_add_string(attr, DEV_PHYS_PATH, physpath) == 0);
 
 		(void) ddi_log_sysevent(zfs_dip, SUNW_VENDOR, EC_DEV_STATUS,
 		    ESC_DEV_DLE, attr, &eid, DDI_SLEEP);
 
 		nvlist_free(attr);
 		kmem_free(physpath, MAXPATHLEN);
 	}
 #endif	/* illumos */
 	return (error);
 }
 
 int
 zvol_set_volsize(const char *name, uint64_t volsize)
 {
 	zvol_state_t *zv = NULL;
 	objset_t *os;
 	int error;
 	dmu_object_info_t doi;
 	uint64_t readonly;
 	boolean_t owned = B_FALSE;
 
 	error = dsl_prop_get_integer(name,
 	    zfs_prop_to_name(ZFS_PROP_READONLY), &readonly, NULL);
 	if (error != 0)
 		return (error);
 	if (readonly)
 		return (SET_ERROR(EROFS));
 
 	mutex_enter(&zfsdev_state_lock);
 	zv = zvol_minor_lookup(name);
 
 	if (zv == NULL || zv->zv_objset == NULL) {
 		if ((error = dmu_objset_own(name, DMU_OST_ZVOL, B_FALSE,
 		    FTAG, &os)) != 0) {
 			mutex_exit(&zfsdev_state_lock);
 			return (error);
 		}
 		owned = B_TRUE;
 		if (zv != NULL)
 			zv->zv_objset = os;
 	} else {
 		os = zv->zv_objset;
 	}
 
 	if ((error = dmu_object_info(os, ZVOL_OBJ, &doi)) != 0 ||
 	    (error = zvol_check_volsize(volsize, doi.doi_data_block_size)) != 0)
 		goto out;
 
 	error = zvol_update_volsize(os, volsize);
 
 	if (error == 0 && zv != NULL)
 		error = zvol_update_live_volsize(zv, volsize);
 out:
 	if (owned) {
 		dmu_objset_disown(os, FTAG);
 		if (zv != NULL)
 			zv->zv_objset = NULL;
 	}
 	mutex_exit(&zfsdev_state_lock);
 	return (error);
 }
 
 /*ARGSUSED*/
 #ifdef illumos
 int
 zvol_open(dev_t *devp, int flag, int otyp, cred_t *cr)
 #else
 static int
 zvol_open(struct g_provider *pp, int flag, int count)
 #endif
 {
 	zvol_state_t *zv;
 	int err = 0;
 #ifdef illumos
 
 	mutex_enter(&zfsdev_state_lock);
 
 	zv = zfsdev_get_soft_state(getminor(*devp), ZSST_ZVOL);
 	if (zv == NULL) {
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(ENXIO));
 	}
 
 	if (zv->zv_total_opens == 0)
 		err = zvol_first_open(zv);
 	if (err) {
 		mutex_exit(&zfsdev_state_lock);
 		return (err);
 	}
 #else	/* !illumos */
 	boolean_t locked = B_FALSE;
 
 	if (!zpool_on_zvol && tsd_get(zfs_geom_probe_vdev_key) != NULL) {
 		/*
 		 * if zfs_geom_probe_vdev_key is set, that means that zfs is
 		 * attempting to probe geom providers while looking for a
 		 * replacement for a missing VDEV.  In this case, the
 		 * spa_namespace_lock will not be held, but it is still illegal
 		 * to use a zvol as a vdev.  Deadlocks can result if another
 		 * thread has spa_namespace_lock
 		 */
 		return (EOPNOTSUPP);
 	}
 	/*
 	 * Protect against recursively entering spa_namespace_lock
 	 * when spa_open() is used for a pool on a (local) ZVOL(s).
 	 * This is needed since we replaced upstream zfsdev_state_lock
 	 * with spa_namespace_lock in the ZVOL code.
 	 * We are using the same trick as spa_open().
 	 * Note that calls in zvol_first_open which need to resolve
 	 * pool name to a spa object will enter spa_open()
 	 * recursively, but that function already has all the
 	 * necessary protection.
 	 */
 	if (!MUTEX_HELD(&zfsdev_state_lock)) {
 		mutex_enter(&zfsdev_state_lock);
 		locked = B_TRUE;
 	}
 
 	zv = pp->private;
 	if (zv == NULL) {
 		if (locked)
 			mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(ENXIO));
 	}
 
 	if (zv->zv_total_opens == 0) {
 		err = zvol_first_open(zv);
 		if (err) {
 			if (locked)
 				mutex_exit(&zfsdev_state_lock);
 			return (err);
 		}
 		pp->mediasize = zv->zv_volsize;
 		pp->stripeoffset = 0;
 		pp->stripesize = zv->zv_volblocksize;
 	}
 #endif	/* illumos */
 	if ((flag & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) {
 		err = SET_ERROR(EROFS);
 		goto out;
 	}
 	if (zv->zv_flags & ZVOL_EXCL) {
 		err = SET_ERROR(EBUSY);
 		goto out;
 	}
 #ifdef FEXCL
 	if (flag & FEXCL) {
 		if (zv->zv_total_opens != 0) {
 			err = SET_ERROR(EBUSY);
 			goto out;
 		}
 		zv->zv_flags |= ZVOL_EXCL;
 	}
 #endif
 
 #ifdef illumos
 	if (zv->zv_open_count[otyp] == 0 || otyp == OTYP_LYR) {
 		zv->zv_open_count[otyp]++;
 		zv->zv_total_opens++;
 	}
 	mutex_exit(&zfsdev_state_lock);
 #else
 	zv->zv_total_opens += count;
 	if (locked)
 		mutex_exit(&zfsdev_state_lock);
 #endif
 
 	return (err);
 out:
 	if (zv->zv_total_opens == 0)
 		zvol_last_close(zv);
 #ifdef illumos
 	mutex_exit(&zfsdev_state_lock);
 #else
 	if (locked)
 		mutex_exit(&zfsdev_state_lock);
 #endif
 	return (err);
 }
 
 /*ARGSUSED*/
 #ifdef illumos
 int
 zvol_close(dev_t dev, int flag, int otyp, cred_t *cr)
 {
 	minor_t minor = getminor(dev);
 	zvol_state_t *zv;
 	int error = 0;
 
 	mutex_enter(&zfsdev_state_lock);
 
 	zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 	if (zv == NULL) {
 		mutex_exit(&zfsdev_state_lock);
 #else	/* !illumos */
 static int
 zvol_close(struct g_provider *pp, int flag, int count)
 {
 	zvol_state_t *zv;
 	int error = 0;
 	boolean_t locked = B_FALSE;
 
 	/* See comment in zvol_open(). */
 	if (!MUTEX_HELD(&zfsdev_state_lock)) {
 		mutex_enter(&zfsdev_state_lock);
 		locked = B_TRUE;
 	}
 
 	zv = pp->private;
 	if (zv == NULL) {
 		if (locked)
 			mutex_exit(&zfsdev_state_lock);
 #endif	/* illumos */
 		return (SET_ERROR(ENXIO));
 	}
 
 	if (zv->zv_flags & ZVOL_EXCL) {
 		ASSERT(zv->zv_total_opens == 1);
 		zv->zv_flags &= ~ZVOL_EXCL;
 	}
 
 	/*
 	 * If the open count is zero, this is a spurious close.
 	 * That indicates a bug in the kernel / DDI framework.
 	 */
 #ifdef illumos
 	ASSERT(zv->zv_open_count[otyp] != 0);
 #endif
 	ASSERT(zv->zv_total_opens != 0);
 
 	/*
 	 * You may get multiple opens, but only one close.
 	 */
 #ifdef illumos
 	zv->zv_open_count[otyp]--;
 	zv->zv_total_opens--;
 #else
 	zv->zv_total_opens -= count;
 #endif
 
 	if (zv->zv_total_opens == 0)
 		zvol_last_close(zv);
 
 #ifdef illumos
 	mutex_exit(&zfsdev_state_lock);
 #else
 	if (locked)
 		mutex_exit(&zfsdev_state_lock);
 #endif
 	return (error);
 }
 
 static void
 zvol_get_done(zgd_t *zgd, int error)
 {
 	if (zgd->zgd_db)
 		dmu_buf_rele(zgd->zgd_db, zgd);
 
 	zfs_range_unlock(zgd->zgd_rl);
 
 	if (error == 0 && zgd->zgd_bp)
 		zil_add_block(zgd->zgd_zilog, zgd->zgd_bp);
 
 	kmem_free(zgd, sizeof (zgd_t));
 }
 
 /*
  * Get data to generate a TX_WRITE intent log record.
  */
 static int
 zvol_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio)
 {
 	zvol_state_t *zv = arg;
 	objset_t *os = zv->zv_objset;
 	uint64_t object = ZVOL_OBJ;
 	uint64_t offset = lr->lr_offset;
 	uint64_t size = lr->lr_length;	/* length of user data */
 	blkptr_t *bp = &lr->lr_blkptr;
 	dmu_buf_t *db;
 	zgd_t *zgd;
 	int error;
 
 	ASSERT(zio != NULL);
 	ASSERT(size != 0);
 
 	zgd = kmem_zalloc(sizeof (zgd_t), KM_SLEEP);
 	zgd->zgd_zilog = zv->zv_zilog;
 	zgd->zgd_rl = zfs_range_lock(&zv->zv_znode, offset, size, RL_READER);
 
 	/*
 	 * Write records come in two flavors: immediate and indirect.
 	 * For small writes it's cheaper to store the data with the
 	 * log record (immediate); for large writes it's cheaper to
 	 * sync the data and get a pointer to it (indirect) so that
 	 * we don't have to write the data twice.
 	 */
 	if (buf != NULL) {	/* immediate write */
 		error = dmu_read(os, object, offset, size, buf,
 		    DMU_READ_NO_PREFETCH);
 	} else {
 		size = zv->zv_volblocksize;
 		offset = P2ALIGN(offset, size);
 		error = dmu_buf_hold(os, object, offset, zgd, &db,
 		    DMU_READ_NO_PREFETCH);
 		if (error == 0) {
 			blkptr_t *obp = dmu_buf_get_blkptr(db);
 			if (obp) {
 				ASSERT(BP_IS_HOLE(bp));
 				*bp = *obp;
 			}
 
 			zgd->zgd_db = db;
 			zgd->zgd_bp = bp;
 
 			ASSERT(db->db_offset == offset);
 			ASSERT(db->db_size == size);
 
 			error = dmu_sync(zio, lr->lr_common.lrc_txg,
 			    zvol_get_done, zgd);
 
 			if (error == 0)
 				return (0);
 		}
 	}
 
 	zvol_get_done(zgd, error);
 
 	return (error);
 }
 
 /*
  * zvol_log_write() handles synchronous writes using TX_WRITE ZIL transactions.
  *
  * We store data in the log buffers if it's small enough.
  * Otherwise we will later flush the data out via dmu_sync().
  */
 ssize_t zvol_immediate_write_sz = 32768;
 #ifdef _KERNEL
 SYSCTL_LONG(_vfs_zfs_vol, OID_AUTO, immediate_write_sz, CTLFLAG_RWTUN,
     &zvol_immediate_write_sz, 0, "Minimal size for indirect log write");
 #endif
 
 static void
 zvol_log_write(zvol_state_t *zv, dmu_tx_t *tx, offset_t off, ssize_t resid,
     boolean_t sync)
 {
 	uint32_t blocksize = zv->zv_volblocksize;
 	zilog_t *zilog = zv->zv_zilog;
 	itx_wr_state_t write_state;
 
 	if (zil_replaying(zilog, tx))
 		return;
 
 	if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
 		write_state = WR_INDIRECT;
 	else if (!spa_has_slogs(zilog->zl_spa) &&
 	    resid >= blocksize && blocksize > zvol_immediate_write_sz)
 		write_state = WR_INDIRECT;
 	else if (sync)
 		write_state = WR_COPIED;
 	else
 		write_state = WR_NEED_COPY;
 
 	while (resid) {
 		itx_t *itx;
 		lr_write_t *lr;
 		itx_wr_state_t wr_state = write_state;
 		ssize_t len = resid;
 
 		if (wr_state == WR_COPIED && resid > ZIL_MAX_COPIED_DATA)
 			wr_state = WR_NEED_COPY;
 		else if (wr_state == WR_INDIRECT)
 			len = MIN(blocksize - P2PHASE(off, blocksize), resid);
 
 		itx = zil_itx_create(TX_WRITE, sizeof (*lr) +
 		    (wr_state == WR_COPIED ? len : 0));
 		lr = (lr_write_t *)&itx->itx_lr;
 		if (wr_state == WR_COPIED && dmu_read(zv->zv_objset,
 		    ZVOL_OBJ, off, len, lr + 1, DMU_READ_NO_PREFETCH) != 0) {
 			zil_itx_destroy(itx);
 			itx = zil_itx_create(TX_WRITE, sizeof (*lr));
 			lr = (lr_write_t *)&itx->itx_lr;
 			wr_state = WR_NEED_COPY;
 		}
 
 		itx->itx_wr_state = wr_state;
 		lr->lr_foid = ZVOL_OBJ;
 		lr->lr_offset = off;
 		lr->lr_length = len;
 		lr->lr_blkoff = 0;
 		BP_ZERO(&lr->lr_blkptr);
 
 		itx->itx_private = zv;
 
 		if (!sync && (zv->zv_sync_cnt == 0))
 			itx->itx_sync = B_FALSE;
 
 		zil_itx_assign(zilog, itx, tx);
 
 		off += len;
 		resid -= len;
 	}
 }
 
 #ifdef illumos
 static int
 zvol_dumpio_vdev(vdev_t *vd, void *addr, uint64_t offset, uint64_t origoffset,
     uint64_t size, boolean_t doread, boolean_t isdump)
 {
 	vdev_disk_t *dvd;
 	int c;
 	int numerrors = 0;
 
 	if (vd->vdev_ops == &vdev_mirror_ops ||
 	    vd->vdev_ops == &vdev_replacing_ops ||
 	    vd->vdev_ops == &vdev_spare_ops) {
 		for (c = 0; c < vd->vdev_children; c++) {
 			int err = zvol_dumpio_vdev(vd->vdev_child[c],
 			    addr, offset, origoffset, size, doread, isdump);
 			if (err != 0) {
 				numerrors++;
 			} else if (doread) {
 				break;
 			}
 		}
 	}
 
 	if (!vd->vdev_ops->vdev_op_leaf && vd->vdev_ops != &vdev_raidz_ops)
 		return (numerrors < vd->vdev_children ? 0 : EIO);
 
 	if (doread && !vdev_readable(vd))
 		return (SET_ERROR(EIO));
 	else if (!doread && !vdev_writeable(vd))
 		return (SET_ERROR(EIO));
 
 	if (vd->vdev_ops == &vdev_raidz_ops) {
 		return (vdev_raidz_physio(vd,
 		    addr, size, offset, origoffset, doread, isdump));
 	}
 
 	offset += VDEV_LABEL_START_SIZE;
 
 	if (ddi_in_panic() || isdump) {
 		ASSERT(!doread);
 		if (doread)
 			return (SET_ERROR(EIO));
 		dvd = vd->vdev_tsd;
 		ASSERT3P(dvd, !=, NULL);
 		return (ldi_dump(dvd->vd_lh, addr, lbtodb(offset),
 		    lbtodb(size)));
 	} else {
 		dvd = vd->vdev_tsd;
 		ASSERT3P(dvd, !=, NULL);
 		return (vdev_disk_ldi_physio(dvd->vd_lh, addr, size,
 		    offset, doread ? B_READ : B_WRITE));
 	}
 }
 
 static int
 zvol_dumpio(zvol_state_t *zv, void *addr, uint64_t offset, uint64_t size,
     boolean_t doread, boolean_t isdump)
 {
 	vdev_t *vd;
 	int error;
 	zvol_extent_t *ze;
 	spa_t *spa = dmu_objset_spa(zv->zv_objset);
 
 	/* Must be sector aligned, and not stradle a block boundary. */
 	if (P2PHASE(offset, DEV_BSIZE) || P2PHASE(size, DEV_BSIZE) ||
 	    P2BOUNDARY(offset, size, zv->zv_volblocksize)) {
 		return (SET_ERROR(EINVAL));
 	}
 	ASSERT(size <= zv->zv_volblocksize);
 
 	/* Locate the extent this belongs to */
 	ze = list_head(&zv->zv_extents);
 	while (offset >= ze->ze_nblks * zv->zv_volblocksize) {
 		offset -= ze->ze_nblks * zv->zv_volblocksize;
 		ze = list_next(&zv->zv_extents, ze);
 	}
 
 	if (ze == NULL)
 		return (SET_ERROR(EINVAL));
 
 	if (!ddi_in_panic())
 		spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
 
 	vd = vdev_lookup_top(spa, DVA_GET_VDEV(&ze->ze_dva));
 	offset += DVA_GET_OFFSET(&ze->ze_dva);
 	error = zvol_dumpio_vdev(vd, addr, offset, DVA_GET_OFFSET(&ze->ze_dva),
 	    size, doread, isdump);
 
 	if (!ddi_in_panic())
 		spa_config_exit(spa, SCL_STATE, FTAG);
 
 	return (error);
 }
 
 int
 zvol_strategy(buf_t *bp)
 {
 	zfs_soft_state_t *zs = NULL;
 #else	/* !illumos */
 void
 zvol_strategy(struct bio *bp)
 {
 #endif	/* illumos */
 	zvol_state_t *zv;
 	uint64_t off, volsize;
 	size_t resid;
 	char *addr;
 	objset_t *os;
 	rl_t *rl;
 	int error = 0;
 #ifdef illumos
 	boolean_t doread = bp->b_flags & B_READ;
 #else
 	boolean_t doread = 0;
 #endif
 	boolean_t is_dumpified;
 	boolean_t sync;
 
 #ifdef illumos
 	if (getminor(bp->b_edev) == 0) {
 		error = SET_ERROR(EINVAL);
 	} else {
 		zs = ddi_get_soft_state(zfsdev_state, getminor(bp->b_edev));
 		if (zs == NULL)
 			error = SET_ERROR(ENXIO);
 		else if (zs->zss_type != ZSST_ZVOL)
 			error = SET_ERROR(EINVAL);
 	}
 
 	if (error) {
 		bioerror(bp, error);
 		biodone(bp);
 		return (0);
 	}
 
 	zv = zs->zss_data;
 
 	if (!(bp->b_flags & B_READ) && (zv->zv_flags & ZVOL_RDONLY)) {
 		bioerror(bp, EROFS);
 		biodone(bp);
 		return (0);
 	}
 
 	off = ldbtob(bp->b_blkno);
 #else	/* !illumos */
 	if (bp->bio_to)
 		zv = bp->bio_to->private;
 	else
 		zv = bp->bio_dev->si_drv2;
 
 	if (zv == NULL) {
 		error = SET_ERROR(ENXIO);
 		goto out;
 	}
 
 	if (bp->bio_cmd != BIO_READ && (zv->zv_flags & ZVOL_RDONLY)) {
 		error = SET_ERROR(EROFS);
 		goto out;
 	}
 
 	switch (bp->bio_cmd) {
 	case BIO_FLUSH:
 		goto sync;
 	case BIO_READ:
 		doread = 1;
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
 	default:
 		error = EOPNOTSUPP;
 		goto out;
 	}
 
 	off = bp->bio_offset;
 #endif	/* illumos */
 	volsize = zv->zv_volsize;
 
 	os = zv->zv_objset;
 	ASSERT(os != NULL);
 
 #ifdef illumos
 	bp_mapin(bp);
 	addr = bp->b_un.b_addr;
 	resid = bp->b_bcount;
 
 	if (resid > 0 && (off < 0 || off >= volsize)) {
 		bioerror(bp, EIO);
 		biodone(bp);
 		return (0);
 	}
 
 	is_dumpified = zv->zv_flags & ZVOL_DUMPIFIED;
 	sync = ((!(bp->b_flags & B_ASYNC) &&
 	    !(zv->zv_flags & ZVOL_WCE)) ||
 	    (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS)) &&
 	    !doread && !is_dumpified;
 #else	/* !illumos */
 	addr = bp->bio_data;
 	resid = bp->bio_length;
 
 	if (resid > 0 && (off < 0 || off >= volsize)) {
 		error = SET_ERROR(EIO);
 		goto out;
 	}
 
 	is_dumpified = B_FALSE;
 	sync = !doread && !is_dumpified &&
 	    zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS;
 #endif	/* illumos */
 
 	/*
 	 * There must be no buffer changes when doing a dmu_sync() because
 	 * we can't change the data whilst calculating the checksum.
 	 */
 	rl = zfs_range_lock(&zv->zv_znode, off, resid,
 	    doread ? RL_READER : RL_WRITER);
 
 #ifndef illumos
 	if (bp->bio_cmd == BIO_DELETE) {
 		dmu_tx_t *tx = dmu_tx_create(zv->zv_objset);
 		error = dmu_tx_assign(tx, TXG_WAIT);
 		if (error != 0) {
 			dmu_tx_abort(tx);
 		} else {
 			zvol_log_truncate(zv, tx, off, resid, sync);
 			dmu_tx_commit(tx);
 			error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ,
 			    off, resid);
 			resid = 0;
 		}
 		goto unlock;
 	}
 #endif
 	while (resid != 0 && off < volsize) {
 		size_t size = MIN(resid, zvol_maxphys);
 #ifdef illumos
 		if (is_dumpified) {
 			size = MIN(size, P2END(off, zv->zv_volblocksize) - off);
 			error = zvol_dumpio(zv, addr, off, size,
 			    doread, B_FALSE);
 		} else if (doread) {
 #else
 		if (doread) {
 #endif
 			error = dmu_read(os, ZVOL_OBJ, off, size, addr,
 			    DMU_READ_PREFETCH);
 		} else {
 			dmu_tx_t *tx = dmu_tx_create(os);
 			dmu_tx_hold_write(tx, ZVOL_OBJ, off, size);
 			error = dmu_tx_assign(tx, TXG_WAIT);
 			if (error) {
 				dmu_tx_abort(tx);
 			} else {
 				dmu_write(os, ZVOL_OBJ, off, size, addr, tx);
 				zvol_log_write(zv, tx, off, size, sync);
 				dmu_tx_commit(tx);
 			}
 		}
 		if (error) {
 			/* convert checksum errors into IO errors */
 			if (error == ECKSUM)
 				error = SET_ERROR(EIO);
 			break;
 		}
 		off += size;
 		addr += size;
 		resid -= size;
 	}
 #ifndef illumos
 unlock:
 #endif
 	zfs_range_unlock(rl);
 
 #ifdef illumos
 	if ((bp->b_resid = resid) == bp->b_bcount)
 		bioerror(bp, off > volsize ? EINVAL : error);
 
 	if (sync)
 		zil_commit(zv->zv_zilog, ZVOL_OBJ);
 	biodone(bp);
 
 	return (0);
 #else	/* !illumos */
 	bp->bio_completed = bp->bio_length - resid;
 	if (bp->bio_completed < bp->bio_length && off > volsize)
 		error = EINVAL;
 
 	if (sync) {
 sync:
 		zil_commit(zv->zv_zilog, ZVOL_OBJ);
 	}
 out:
 	if (bp->bio_to)
 		g_io_deliver(bp, error);
 	else
 		biofinish(bp, NULL, error);
 #endif	/* illumos */
 }
 
 #ifdef illumos
 /*
  * Set the buffer count to the zvol maximum transfer.
  * Using our own routine instead of the default minphys()
  * means that for larger writes we write bigger buffers on X86
  * (128K instead of 56K) and flush the disk write cache less often
  * (every zvol_maxphys - currently 1MB) instead of minphys (currently
  * 56K on X86 and 128K on sparc).
  */
 void
 zvol_minphys(struct buf *bp)
 {
 	if (bp->b_bcount > zvol_maxphys)
 		bp->b_bcount = zvol_maxphys;
 }
 
 int
 zvol_dump(dev_t dev, caddr_t addr, daddr_t blkno, int nblocks)
 {
 	minor_t minor = getminor(dev);
 	zvol_state_t *zv;
 	int error = 0;
 	uint64_t size;
 	uint64_t boff;
 	uint64_t resid;
 
 	zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 	if (zv == NULL)
 		return (SET_ERROR(ENXIO));
 
 	if ((zv->zv_flags & ZVOL_DUMPIFIED) == 0)
 		return (SET_ERROR(EINVAL));
 
 	boff = ldbtob(blkno);
 	resid = ldbtob(nblocks);
 
 	VERIFY3U(boff + resid, <=, zv->zv_volsize);
 
 	while (resid) {
 		size = MIN(resid, P2END(boff, zv->zv_volblocksize) - boff);
 		error = zvol_dumpio(zv, addr, boff, size, B_FALSE, B_TRUE);
 		if (error)
 			break;
 		boff += size;
 		addr += size;
 		resid -= size;
 	}
 
 	return (error);
 }
 
 /*ARGSUSED*/
 int
 zvol_read(dev_t dev, uio_t *uio, cred_t *cr)
 {
 	minor_t minor = getminor(dev);
 #else	/* !illumos */
 int
 zvol_read(struct cdev *dev, struct uio *uio, int ioflag)
 {
 #endif	/* illumos */
 	zvol_state_t *zv;
 	uint64_t volsize;
 	rl_t *rl;
 	int error = 0;
 
 #ifdef illumos
 	zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 	if (zv == NULL)
 		return (SET_ERROR(ENXIO));
 #else
 	zv = dev->si_drv2;
 #endif
 
 	volsize = zv->zv_volsize;
 	/* uio_loffset == volsize isn't an error as its required for EOF processing. */
 	if (uio->uio_resid > 0 &&
 	    (uio->uio_loffset < 0 || uio->uio_loffset > volsize))
 		return (SET_ERROR(EIO));
 
 #ifdef illumos
 	if (zv->zv_flags & ZVOL_DUMPIFIED) {
 		error = physio(zvol_strategy, NULL, dev, B_READ,
 		    zvol_minphys, uio);
 		return (error);
 	}
 #endif
 
 	rl = zfs_range_lock(&zv->zv_znode, uio->uio_loffset, uio->uio_resid,
 	    RL_READER);
 	while (uio->uio_resid > 0 && uio->uio_loffset < volsize) {
 		uint64_t bytes = MIN(uio->uio_resid, DMU_MAX_ACCESS >> 1);
 
 		/* don't read past the end */
 		if (bytes > volsize - uio->uio_loffset)
 			bytes = volsize - uio->uio_loffset;
 
 		error =  dmu_read_uio_dbuf(zv->zv_dbuf, uio, bytes);
 		if (error) {
 			/* convert checksum errors into IO errors */
 			if (error == ECKSUM)
 				error = SET_ERROR(EIO);
 			break;
 		}
 	}
 	zfs_range_unlock(rl);
 	return (error);
 }
 
 #ifdef illumos
 /*ARGSUSED*/
 int
 zvol_write(dev_t dev, uio_t *uio, cred_t *cr)
 {
 	minor_t minor = getminor(dev);
 #else	/* !illumos */
 int
 zvol_write(struct cdev *dev, struct uio *uio, int ioflag)
 {
 #endif	/* illumos */
 	zvol_state_t *zv;
 	uint64_t volsize;
 	rl_t *rl;
 	int error = 0;
 	boolean_t sync;
 
 #ifdef illumos
 	zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 	if (zv == NULL)
 		return (SET_ERROR(ENXIO));
 #else
 	zv = dev->si_drv2;
 #endif
 
 	volsize = zv->zv_volsize;
 	/* uio_loffset == volsize isn't an error as its required for EOF processing. */
 	if (uio->uio_resid > 0 &&
 	    (uio->uio_loffset < 0 || uio->uio_loffset > volsize))
 		return (SET_ERROR(EIO));
 
 #ifdef illumos
 	if (zv->zv_flags & ZVOL_DUMPIFIED) {
 		error = physio(zvol_strategy, NULL, dev, B_WRITE,
 		    zvol_minphys, uio);
 		return (error);
 	}
 
 	sync = !(zv->zv_flags & ZVOL_WCE) ||
 #else
 	sync = (ioflag & IO_SYNC) ||
 #endif
 	    (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS);
 
 	rl = zfs_range_lock(&zv->zv_znode, uio->uio_loffset, uio->uio_resid,
 	    RL_WRITER);
 	while (uio->uio_resid > 0 && uio->uio_loffset < volsize) {
 		uint64_t bytes = MIN(uio->uio_resid, DMU_MAX_ACCESS >> 1);
 		uint64_t off = uio->uio_loffset;
 		dmu_tx_t *tx = dmu_tx_create(zv->zv_objset);
 
 		if (bytes > volsize - off)	/* don't write past the end */
 			bytes = volsize - off;
 
 		dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);
 		error = dmu_tx_assign(tx, TXG_WAIT);
 		if (error) {
 			dmu_tx_abort(tx);
 			break;
 		}
 		error = dmu_write_uio_dbuf(zv->zv_dbuf, uio, bytes, tx);
 		if (error == 0)
 			zvol_log_write(zv, tx, off, bytes, sync);
 		dmu_tx_commit(tx);
 
 		if (error)
 			break;
 	}
 	zfs_range_unlock(rl);
 	if (sync)
 		zil_commit(zv->zv_zilog, ZVOL_OBJ);
 	return (error);
 }
 
 #ifdef illumos
 int
 zvol_getefi(void *arg, int flag, uint64_t vs, uint8_t bs)
 {
 	struct uuid uuid = EFI_RESERVED;
 	efi_gpe_t gpe = { 0 };
 	uint32_t crc;
 	dk_efi_t efi;
 	int length;
 	char *ptr;
 
 	if (ddi_copyin(arg, &efi, sizeof (dk_efi_t), flag))
 		return (SET_ERROR(EFAULT));
 	ptr = (char *)(uintptr_t)efi.dki_data_64;
 	length = efi.dki_length;
 	/*
 	 * Some clients may attempt to request a PMBR for the
 	 * zvol.  Currently this interface will return EINVAL to
 	 * such requests.  These requests could be supported by
 	 * adding a check for lba == 0 and consing up an appropriate
 	 * PMBR.
 	 */
 	if (efi.dki_lba < 1 || efi.dki_lba > 2 || length <= 0)
 		return (SET_ERROR(EINVAL));
 
 	gpe.efi_gpe_StartingLBA = LE_64(34ULL);
 	gpe.efi_gpe_EndingLBA = LE_64((vs >> bs) - 1);
 	UUID_LE_CONVERT(gpe.efi_gpe_PartitionTypeGUID, uuid);
 
 	if (efi.dki_lba == 1) {
 		efi_gpt_t gpt = { 0 };
 
 		gpt.efi_gpt_Signature = LE_64(EFI_SIGNATURE);
 		gpt.efi_gpt_Revision = LE_32(EFI_VERSION_CURRENT);
 		gpt.efi_gpt_HeaderSize = LE_32(sizeof (gpt));
 		gpt.efi_gpt_MyLBA = LE_64(1ULL);
 		gpt.efi_gpt_FirstUsableLBA = LE_64(34ULL);
 		gpt.efi_gpt_LastUsableLBA = LE_64((vs >> bs) - 1);
 		gpt.efi_gpt_PartitionEntryLBA = LE_64(2ULL);
 		gpt.efi_gpt_NumberOfPartitionEntries = LE_32(1);
 		gpt.efi_gpt_SizeOfPartitionEntry =
 		    LE_32(sizeof (efi_gpe_t));
 		CRC32(crc, &gpe, sizeof (gpe), -1U, crc32_table);
 		gpt.efi_gpt_PartitionEntryArrayCRC32 = LE_32(~crc);
 		CRC32(crc, &gpt, sizeof (gpt), -1U, crc32_table);
 		gpt.efi_gpt_HeaderCRC32 = LE_32(~crc);
 		if (ddi_copyout(&gpt, ptr, MIN(sizeof (gpt), length),
 		    flag))
 			return (SET_ERROR(EFAULT));
 		ptr += sizeof (gpt);
 		length -= sizeof (gpt);
 	}
 	if (length > 0 && ddi_copyout(&gpe, ptr, MIN(sizeof (gpe),
 	    length), flag))
 		return (SET_ERROR(EFAULT));
 	return (0);
 }
 
 /*
  * BEGIN entry points to allow external callers access to the volume.
  */
 /*
  * Return the volume parameters needed for access from an external caller.
  * These values are invariant as long as the volume is held open.
  */
 int
 zvol_get_volume_params(minor_t minor, uint64_t *blksize,
     uint64_t *max_xfer_len, void **minor_hdl, void **objset_hdl, void **zil_hdl,
     void **rl_hdl, void **bonus_hdl)
 {
 	zvol_state_t *zv;
 
 	zv = zfsdev_get_soft_state(minor, ZSST_ZVOL);
 	if (zv == NULL)
 		return (SET_ERROR(ENXIO));
 	if (zv->zv_flags & ZVOL_DUMPIFIED)
 		return (SET_ERROR(ENXIO));
 
 	ASSERT(blksize && max_xfer_len && minor_hdl &&
 	    objset_hdl && zil_hdl && rl_hdl && bonus_hdl);
 
 	*blksize = zv->zv_volblocksize;
 	*max_xfer_len = (uint64_t)zvol_maxphys;
 	*minor_hdl = zv;
 	*objset_hdl = zv->zv_objset;
 	*zil_hdl = zv->zv_zilog;
 	*rl_hdl = &zv->zv_znode;
 	*bonus_hdl = zv->zv_dbuf;
 	return (0);
 }
 
 /*
  * Return the current volume size to an external caller.
  * The size can change while the volume is open.
  */
 uint64_t
 zvol_get_volume_size(void *minor_hdl)
 {
 	zvol_state_t *zv = minor_hdl;
 
 	return (zv->zv_volsize);
 }
 
 /*
  * Return the current WCE setting to an external caller.
  * The WCE setting can change while the volume is open.
  */
 int
 zvol_get_volume_wce(void *minor_hdl)
 {
 	zvol_state_t *zv = minor_hdl;
 
 	return ((zv->zv_flags & ZVOL_WCE) ? 1 : 0);
 }
 
 /*
  * Entry point for external callers to zvol_log_write
  */
 void
 zvol_log_write_minor(void *minor_hdl, dmu_tx_t *tx, offset_t off, ssize_t resid,
     boolean_t sync)
 {
 	zvol_state_t *zv = minor_hdl;
 
 	zvol_log_write(zv, tx, off, resid, sync);
 }
 /*
  * END entry points to allow external callers access to the volume.
  */
 #endif	/* illumos */
 
 /*
  * Log a DKIOCFREE/free-long-range to the ZIL with TX_TRUNCATE.
  */
 static void
 zvol_log_truncate(zvol_state_t *zv, dmu_tx_t *tx, uint64_t off, uint64_t len,
     boolean_t sync)
 {
 	itx_t *itx;
 	lr_truncate_t *lr;
 	zilog_t *zilog = zv->zv_zilog;
 
 	if (zil_replaying(zilog, tx))
 		return;
 
 	itx = zil_itx_create(TX_TRUNCATE, sizeof (*lr));
 	lr = (lr_truncate_t *)&itx->itx_lr;
 	lr->lr_foid = ZVOL_OBJ;
 	lr->lr_offset = off;
 	lr->lr_length = len;
 
 	itx->itx_sync = (sync || zv->zv_sync_cnt != 0);
 	zil_itx_assign(zilog, itx, tx);
 }
 
 #ifdef illumos
 /*
  * Dirtbag ioctls to support mkfs(1M) for UFS filesystems.  See dkio(7I).
  * Also a dirtbag dkio ioctl for unmap/free-block functionality.
  */
 /*ARGSUSED*/
 int
 zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
 {
 	zvol_state_t *zv;
 	struct dk_callback *dkc;
 	int error = 0;
 	rl_t *rl;
 
 	mutex_enter(&zfsdev_state_lock);
 
 	zv = zfsdev_get_soft_state(getminor(dev), ZSST_ZVOL);
 
 	if (zv == NULL) {
 		mutex_exit(&zfsdev_state_lock);
 		return (SET_ERROR(ENXIO));
 	}
 	ASSERT(zv->zv_total_opens > 0);
 
 	switch (cmd) {
 
 	case DKIOCINFO:
 	{
 		struct dk_cinfo dki;
 
 		bzero(&dki, sizeof (dki));
 		(void) strcpy(dki.dki_cname, "zvol");
 		(void) strcpy(dki.dki_dname, "zvol");
 		dki.dki_ctype = DKC_UNKNOWN;
 		dki.dki_unit = getminor(dev);
 		dki.dki_maxtransfer =
 		    1 << (SPA_OLD_MAXBLOCKSHIFT - zv->zv_min_bs);
 		mutex_exit(&zfsdev_state_lock);
 		if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag))
 			error = SET_ERROR(EFAULT);
 		return (error);
 	}
 
 	case DKIOCGMEDIAINFO:
 	{
 		struct dk_minfo dkm;
 
 		bzero(&dkm, sizeof (dkm));
 		dkm.dki_lbsize = 1U << zv->zv_min_bs;
 		dkm.dki_capacity = zv->zv_volsize >> zv->zv_min_bs;
 		dkm.dki_media_type = DK_UNKNOWN;
 		mutex_exit(&zfsdev_state_lock);
 		if (ddi_copyout(&dkm, (void *)arg, sizeof (dkm), flag))
 			error = SET_ERROR(EFAULT);
 		return (error);
 	}
 
 	case DKIOCGMEDIAINFOEXT:
 	{
 		struct dk_minfo_ext dkmext;
 
 		bzero(&dkmext, sizeof (dkmext));
 		dkmext.dki_lbsize = 1U << zv->zv_min_bs;
 		dkmext.dki_pbsize = zv->zv_volblocksize;
 		dkmext.dki_capacity = zv->zv_volsize >> zv->zv_min_bs;
 		dkmext.dki_media_type = DK_UNKNOWN;
 		mutex_exit(&zfsdev_state_lock);
 		if (ddi_copyout(&dkmext, (void *)arg, sizeof (dkmext), flag))
 			error = SET_ERROR(EFAULT);
 		return (error);
 	}
 
 	case DKIOCGETEFI:
 	{
 		uint64_t vs = zv->zv_volsize;
 		uint8_t bs = zv->zv_min_bs;
 
 		mutex_exit(&zfsdev_state_lock);
 		error = zvol_getefi((void *)arg, flag, vs, bs);
 		return (error);
 	}
 
 	case DKIOCFLUSHWRITECACHE:
 		dkc = (struct dk_callback *)arg;
 		mutex_exit(&zfsdev_state_lock);
 		zil_commit(zv->zv_zilog, ZVOL_OBJ);
 		if ((flag & FKIOCTL) && dkc != NULL && dkc->dkc_callback) {
 			(*dkc->dkc_callback)(dkc->dkc_cookie, error);
 			error = 0;
 		}
 		return (error);
 
 	case DKIOCGETWCE:
 	{
 		int wce = (zv->zv_flags & ZVOL_WCE) ? 1 : 0;
 		if (ddi_copyout(&wce, (void *)arg, sizeof (int),
 		    flag))
 			error = SET_ERROR(EFAULT);
 		break;
 	}
 	case DKIOCSETWCE:
 	{
 		int wce;
 		if (ddi_copyin((void *)arg, &wce, sizeof (int),
 		    flag)) {
 			error = SET_ERROR(EFAULT);
 			break;
 		}
 		if (wce) {
 			zv->zv_flags |= ZVOL_WCE;
 			mutex_exit(&zfsdev_state_lock);
 		} else {
 			zv->zv_flags &= ~ZVOL_WCE;
 			mutex_exit(&zfsdev_state_lock);
 			zil_commit(zv->zv_zilog, ZVOL_OBJ);
 		}
 		return (0);
 	}
 
 	case DKIOCGGEOM:
 	case DKIOCGVTOC:
 		/*
 		 * commands using these (like prtvtoc) expect ENOTSUP
 		 * since we're emulating an EFI label
 		 */
 		error = SET_ERROR(ENOTSUP);
 		break;
 
 	case DKIOCDUMPINIT:
 		rl = zfs_range_lock(&zv->zv_znode, 0, zv->zv_volsize,
 		    RL_WRITER);
 		error = zvol_dumpify(zv);
 		zfs_range_unlock(rl);
 		break;
 
 	case DKIOCDUMPFINI:
 		if (!(zv->zv_flags & ZVOL_DUMPIFIED))
 			break;
 		rl = zfs_range_lock(&zv->zv_znode, 0, zv->zv_volsize,
 		    RL_WRITER);
 		error = zvol_dump_fini(zv);
 		zfs_range_unlock(rl);
 		break;
 
 	case DKIOCFREE:
 	{
 		dkioc_free_t df;
 		dmu_tx_t *tx;
 
 		if (!zvol_unmap_enabled)
 			break;
 
 		if (ddi_copyin((void *)arg, &df, sizeof (df), flag)) {
 			error = SET_ERROR(EFAULT);
 			break;
 		}
 
 		/*
 		 * Apply Postel's Law to length-checking.  If they overshoot,
 		 * just blank out until the end, if there's a need to blank
 		 * out anything.
 		 */
 		if (df.df_start >= zv->zv_volsize)
 			break;	/* No need to do anything... */
 
 		mutex_exit(&zfsdev_state_lock);
 
 		rl = zfs_range_lock(&zv->zv_znode, df.df_start, df.df_length,
 		    RL_WRITER);
 		tx = dmu_tx_create(zv->zv_objset);
 		dmu_tx_mark_netfree(tx);
 		error = dmu_tx_assign(tx, TXG_WAIT);
 		if (error != 0) {
 			dmu_tx_abort(tx);
 		} else {
 			zvol_log_truncate(zv, tx, df.df_start,
 			    df.df_length, B_TRUE);
 			dmu_tx_commit(tx);
 			error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ,
 			    df.df_start, df.df_length);
 		}
 
 		zfs_range_unlock(rl);
 
 		/*
 		 * If the write-cache is disabled, 'sync' property
 		 * is set to 'always', or if the caller is asking for
 		 * a synchronous free, commit this operation to the zil.
 		 * This will sync any previous uncommitted writes to the
 		 * zvol object.
 		 * Can be overridden by the zvol_unmap_sync_enabled tunable.
 		 */
 		if ((error == 0) && zvol_unmap_sync_enabled &&
 		    (!(zv->zv_flags & ZVOL_WCE) ||
 		    (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS) ||
 		    (df.df_flags & DF_WAIT_SYNC))) {
 			zil_commit(zv->zv_zilog, ZVOL_OBJ);
 		}
 
 		return (error);
 	}
 
 	default:
 		error = SET_ERROR(ENOTTY);
 		break;
 
 	}
 	mutex_exit(&zfsdev_state_lock);
 	return (error);
 }
 #endif	/* illumos */
 
 int
 zvol_busy(void)
 {
 	return (zvol_minors != 0);
 }
 
 void
 zvol_init(void)
 {
 	VERIFY(ddi_soft_state_init(&zfsdev_state, sizeof (zfs_soft_state_t),
 	    1) == 0);
 #ifdef illumos
 	mutex_init(&zfsdev_state_lock, NULL, MUTEX_DEFAULT, NULL);
 #else
 	ZFS_LOG(1, "ZVOL Initialized.");
 #endif
 }
 
 void
 zvol_fini(void)
 {
 #ifdef illumos
 	mutex_destroy(&zfsdev_state_lock);
 #endif
 	ddi_soft_state_fini(&zfsdev_state);
 	ZFS_LOG(1, "ZVOL Deinitialized.");
 }
 
 #ifdef illumos
 /*ARGSUSED*/
 static int
 zfs_mvdev_dump_feature_check(void *arg, dmu_tx_t *tx)
 {
 	spa_t *spa = dmu_tx_pool(tx)->dp_spa;
 
 	if (spa_feature_is_active(spa, SPA_FEATURE_MULTI_VDEV_CRASH_DUMP))
 		return (1);
 	return (0);
 }
 
 /*ARGSUSED*/
 static void
 zfs_mvdev_dump_activate_feature_sync(void *arg, dmu_tx_t *tx)
 {
 	spa_t *spa = dmu_tx_pool(tx)->dp_spa;
 
 	spa_feature_incr(spa, SPA_FEATURE_MULTI_VDEV_CRASH_DUMP, tx);
 }
 
 static int
 zvol_dump_init(zvol_state_t *zv, boolean_t resize)
 {
 	dmu_tx_t *tx;
 	int error;
 	objset_t *os = zv->zv_objset;
 	spa_t *spa = dmu_objset_spa(os);
 	vdev_t *vd = spa->spa_root_vdev;
 	nvlist_t *nv = NULL;
 	uint64_t version = spa_version(spa);
 	uint64_t checksum, compress, refresrv, vbs, dedup;
 
 	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
 	ASSERT(vd->vdev_ops == &vdev_root_ops);
 
 	error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, 0,
 	    DMU_OBJECT_END);
 	if (error != 0)
 		return (error);
 	/* wait for dmu_free_long_range to actually free the blocks */
 	txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0);
 
 	/*
 	 * If the pool on which the dump device is being initialized has more
 	 * than one child vdev, check that the MULTI_VDEV_CRASH_DUMP feature is
 	 * enabled.  If so, bump that feature's counter to indicate that the
 	 * feature is active. We also check the vdev type to handle the
 	 * following case:
 	 *   # zpool create test raidz disk1 disk2 disk3
 	 *   Now have spa_root_vdev->vdev_children == 1 (the raidz vdev),
 	 *   the raidz vdev itself has 3 children.
 	 */
 	if (vd->vdev_children > 1 || vd->vdev_ops == &vdev_raidz_ops) {
 		if (!spa_feature_is_enabled(spa,
 		    SPA_FEATURE_MULTI_VDEV_CRASH_DUMP))
 			return (SET_ERROR(ENOTSUP));
 		(void) dsl_sync_task(spa_name(spa),
 		    zfs_mvdev_dump_feature_check,
 		    zfs_mvdev_dump_activate_feature_sync, NULL,
 		    2, ZFS_SPACE_CHECK_RESERVED);
 	}
 
 	if (!resize) {
 		error = dsl_prop_get_integer(zv->zv_name,
 		    zfs_prop_to_name(ZFS_PROP_COMPRESSION), &compress, NULL);
 		if (error == 0) {
 			error = dsl_prop_get_integer(zv->zv_name,
 			    zfs_prop_to_name(ZFS_PROP_CHECKSUM), &checksum,
 			    NULL);
 		}
 		if (error == 0) {
 			error = dsl_prop_get_integer(zv->zv_name,
 			    zfs_prop_to_name(ZFS_PROP_REFRESERVATION),
 			    &refresrv, NULL);
 		}
 		if (error == 0) {
 			error = dsl_prop_get_integer(zv->zv_name,
 			    zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &vbs,
 			    NULL);
 		}
 		if (version >= SPA_VERSION_DEDUP && error == 0) {
 			error = dsl_prop_get_integer(zv->zv_name,
 			    zfs_prop_to_name(ZFS_PROP_DEDUP), &dedup, NULL);
 		}
 	}
 	if (error != 0)
 		return (error);
 
 	tx = dmu_tx_create(os);
 	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
 	dmu_tx_hold_bonus(tx, ZVOL_OBJ);
 	error = dmu_tx_assign(tx, TXG_WAIT);
 	if (error != 0) {
 		dmu_tx_abort(tx);
 		return (error);
 	}
 
 	/*
 	 * If we are resizing the dump device then we only need to
 	 * update the refreservation to match the newly updated
 	 * zvolsize. Otherwise, we save off the original state of the
 	 * zvol so that we can restore them if the zvol is ever undumpified.
 	 */
 	if (resize) {
 		error = zap_update(os, ZVOL_ZAP_OBJ,
 		    zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 8, 1,
 		    &zv->zv_volsize, tx);
 	} else {
 		error = zap_update(os, ZVOL_ZAP_OBJ,
 		    zfs_prop_to_name(ZFS_PROP_COMPRESSION), 8, 1,
 		    &compress, tx);
 		if (error == 0) {
 			error = zap_update(os, ZVOL_ZAP_OBJ,
 			    zfs_prop_to_name(ZFS_PROP_CHECKSUM), 8, 1,
 			    &checksum, tx);
 		}
 		if (error == 0) {
 			error = zap_update(os, ZVOL_ZAP_OBJ,
 			    zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 8, 1,
 			    &refresrv, tx);
 		}
 		if (error == 0) {
 			error = zap_update(os, ZVOL_ZAP_OBJ,
 			    zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), 8, 1,
 			    &vbs, tx);
 		}
 		if (error == 0) {
 			error = dmu_object_set_blocksize(
 			    os, ZVOL_OBJ, SPA_OLD_MAXBLOCKSIZE, 0, tx);
 		}
 		if (version >= SPA_VERSION_DEDUP && error == 0) {
 			error = zap_update(os, ZVOL_ZAP_OBJ,
 			    zfs_prop_to_name(ZFS_PROP_DEDUP), 8, 1,
 			    &dedup, tx);
 		}
 		if (error == 0)
 			zv->zv_volblocksize = SPA_OLD_MAXBLOCKSIZE;
 	}
 	dmu_tx_commit(tx);
 
 	/*
 	 * We only need update the zvol's property if we are initializing
 	 * the dump area for the first time.
 	 */
 	if (error == 0 && !resize) {
 		/*
 		 * If MULTI_VDEV_CRASH_DUMP is active, use the NOPARITY checksum
 		 * function.  Otherwise, use the old default -- OFF.
 		 */
 		checksum = spa_feature_is_active(spa,
 		    SPA_FEATURE_MULTI_VDEV_CRASH_DUMP) ? ZIO_CHECKSUM_NOPARITY :
 		    ZIO_CHECKSUM_OFF;
 
 		VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 		VERIFY(nvlist_add_uint64(nv,
 		    zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 0) == 0);
 		VERIFY(nvlist_add_uint64(nv,
 		    zfs_prop_to_name(ZFS_PROP_COMPRESSION),
 		    ZIO_COMPRESS_OFF) == 0);
 		VERIFY(nvlist_add_uint64(nv,
 		    zfs_prop_to_name(ZFS_PROP_CHECKSUM),
 		    checksum) == 0);
 		if (version >= SPA_VERSION_DEDUP) {
 			VERIFY(nvlist_add_uint64(nv,
 			    zfs_prop_to_name(ZFS_PROP_DEDUP),
 			    ZIO_CHECKSUM_OFF) == 0);
 		}
 
 		error = zfs_set_prop_nvlist(zv->zv_name, ZPROP_SRC_LOCAL,
 		    nv, NULL);
 		nvlist_free(nv);
 	}
 
 	/* Allocate the space for the dump */
 	if (error == 0)
 		error = zvol_prealloc(zv);
 	return (error);
 }
 
 static int
 zvol_dumpify(zvol_state_t *zv)
 {
 	int error = 0;
 	uint64_t dumpsize = 0;
 	dmu_tx_t *tx;
 	objset_t *os = zv->zv_objset;
 
 	if (zv->zv_flags & ZVOL_RDONLY)
 		return (SET_ERROR(EROFS));
 
 	if (zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, ZVOL_DUMPSIZE,
 	    8, 1, &dumpsize) != 0 || dumpsize != zv->zv_volsize) {
 		boolean_t resize = (dumpsize > 0);
 
 		if ((error = zvol_dump_init(zv, resize)) != 0) {
 			(void) zvol_dump_fini(zv);
 			return (error);
 		}
 	}
 
 	/*
 	 * Build up our lba mapping.
 	 */
 	error = zvol_get_lbas(zv);
 	if (error) {
 		(void) zvol_dump_fini(zv);
 		return (error);
 	}
 
 	tx = dmu_tx_create(os);
 	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
 	error = dmu_tx_assign(tx, TXG_WAIT);
 	if (error) {
 		dmu_tx_abort(tx);
 		(void) zvol_dump_fini(zv);
 		return (error);
 	}
 
 	zv->zv_flags |= ZVOL_DUMPIFIED;
 	error = zap_update(os, ZVOL_ZAP_OBJ, ZVOL_DUMPSIZE, 8, 1,
 	    &zv->zv_volsize, tx);
 	dmu_tx_commit(tx);
 
 	if (error) {
 		(void) zvol_dump_fini(zv);
 		return (error);
 	}
 
 	txg_wait_synced(dmu_objset_pool(os), 0);
 	return (0);
 }
 
 static int
 zvol_dump_fini(zvol_state_t *zv)
 {
 	dmu_tx_t *tx;
 	objset_t *os = zv->zv_objset;
 	nvlist_t *nv;
 	int error = 0;
 	uint64_t checksum, compress, refresrv, vbs, dedup;
 	uint64_t version = spa_version(dmu_objset_spa(zv->zv_objset));
 
 	/*
 	 * Attempt to restore the zvol back to its pre-dumpified state.
 	 * This is a best-effort attempt as it's possible that not all
 	 * of these properties were initialized during the dumpify process
 	 * (i.e. error during zvol_dump_init).
 	 */
 
 	tx = dmu_tx_create(os);
 	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
 	error = dmu_tx_assign(tx, TXG_WAIT);
 	if (error) {
 		dmu_tx_abort(tx);
 		return (error);
 	}
 	(void) zap_remove(os, ZVOL_ZAP_OBJ, ZVOL_DUMPSIZE, tx);
 	dmu_tx_commit(tx);
 
 	(void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ,
 	    zfs_prop_to_name(ZFS_PROP_CHECKSUM), 8, 1, &checksum);
 	(void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ,
 	    zfs_prop_to_name(ZFS_PROP_COMPRESSION), 8, 1, &compress);
 	(void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ,
 	    zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 8, 1, &refresrv);
 	(void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ,
 	    zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), 8, 1, &vbs);
 
 	VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 	(void) nvlist_add_uint64(nv,
 	    zfs_prop_to_name(ZFS_PROP_CHECKSUM), checksum);
 	(void) nvlist_add_uint64(nv,
 	    zfs_prop_to_name(ZFS_PROP_COMPRESSION), compress);
 	(void) nvlist_add_uint64(nv,
 	    zfs_prop_to_name(ZFS_PROP_REFRESERVATION), refresrv);
 	if (version >= SPA_VERSION_DEDUP &&
 	    zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ,
 	    zfs_prop_to_name(ZFS_PROP_DEDUP), 8, 1, &dedup) == 0) {
 		(void) nvlist_add_uint64(nv,
 		    zfs_prop_to_name(ZFS_PROP_DEDUP), dedup);
 	}
 	(void) zfs_set_prop_nvlist(zv->zv_name, ZPROP_SRC_LOCAL,
 	    nv, NULL);
 	nvlist_free(nv);
 
 	zvol_free_extents(zv);
 	zv->zv_flags &= ~ZVOL_DUMPIFIED;
 	(void) dmu_free_long_range(os, ZVOL_OBJ, 0, DMU_OBJECT_END);
 	/* wait for dmu_free_long_range to actually free the blocks */
 	txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0);
 	tx = dmu_tx_create(os);
 	dmu_tx_hold_bonus(tx, ZVOL_OBJ);
 	error = dmu_tx_assign(tx, TXG_WAIT);
 	if (error) {
 		dmu_tx_abort(tx);
 		return (error);
 	}
 	if (dmu_object_set_blocksize(os, ZVOL_OBJ, vbs, 0, tx) == 0)
 		zv->zv_volblocksize = vbs;
 	dmu_tx_commit(tx);
 
 	return (0);
 }
 #else	/* !illumos */
 
 static void
 zvol_geom_run(zvol_state_t *zv)
 {
 	struct g_provider *pp;
 
 	pp = zv->zv_provider;
 	g_error_provider(pp, 0);
 
 	kproc_kthread_add(zvol_geom_worker, zv, &zfsproc, NULL, 0, 0,
 	    "zfskern", "zvol %s", pp->name + sizeof(ZVOL_DRIVER));
 }
 
 static void
 zvol_geom_destroy(zvol_state_t *zv)
 {
 	struct g_provider *pp;
 
 	g_topology_assert();
 
 	mtx_lock(&zv->zv_queue_mtx);
 	zv->zv_state = 1;
 	wakeup_one(&zv->zv_queue);
 	while (zv->zv_state != 2)
 		msleep(&zv->zv_state, &zv->zv_queue_mtx, 0, "zvol:w", 0);
 	mtx_destroy(&zv->zv_queue_mtx);
 
 	pp = zv->zv_provider;
 	zv->zv_provider = NULL;
 	pp->private = NULL;
 	g_wither_geom(pp->geom, ENXIO);
 }
 
 static int
 zvol_geom_access(struct g_provider *pp, int acr, int acw, int ace)
 {
 	int count, error, flags;
 
 	g_topology_assert();
 
 	/*
 	 * To make it easier we expect either open or close, but not both
 	 * at the same time.
 	 */
 	KASSERT((acr >= 0 && acw >= 0 && ace >= 0) ||
 	    (acr <= 0 && acw <= 0 && ace <= 0),
 	    ("Unsupported access request to %s (acr=%d, acw=%d, ace=%d).",
 	    pp->name, acr, acw, ace));
 
 	if (pp->private == NULL) {
 		if (acr <= 0 && acw <= 0 && ace <= 0)
 			return (0);
 		return (pp->error);
 	}
 
 	/*
 	 * We don't pass FEXCL flag to zvol_open()/zvol_close() if ace != 0,
 	 * because GEOM already handles that and handles it a bit differently.
 	 * GEOM allows for multiple read/exclusive consumers and ZFS allows
 	 * only one exclusive consumer, no matter if it is reader or writer.
 	 * I like better the way GEOM works so I'll leave it for GEOM to
 	 * decide what to do.
 	 */
 
 	count = acr + acw + ace;
 	if (count == 0)
 		return (0);
 
 	flags = 0;
 	if (acr != 0 || ace != 0)
 		flags |= FREAD;
 	if (acw != 0)
 		flags |= FWRITE;
 
 	g_topology_unlock();
 	if (count > 0)
 		error = zvol_open(pp, flags, count);
 	else
 		error = zvol_close(pp, flags, -count);
 	g_topology_lock();
 	return (error);
 }
 
 static void
 zvol_geom_start(struct bio *bp)
 {
 	zvol_state_t *zv;
 	boolean_t first;
 
 	zv = bp->bio_to->private;
 	ASSERT(zv != NULL);
 	switch (bp->bio_cmd) {
 	case BIO_FLUSH:
 		if (!THREAD_CAN_SLEEP())
 			goto enqueue;
 		zil_commit(zv->zv_zilog, ZVOL_OBJ);
 		g_io_deliver(bp, 0);
 		break;
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
 		if (!THREAD_CAN_SLEEP())
 			goto enqueue;
 		zvol_strategy(bp);
 		break;
 	case BIO_GETATTR: {
 		spa_t *spa = dmu_objset_spa(zv->zv_objset);
 		uint64_t refd, avail, usedobjs, availobjs, val;
 
 		if (g_handleattr_int(bp, "GEOM::candelete", 1))
 			return;
 		if (strcmp(bp->bio_attribute, "blocksavail") == 0) {
 			dmu_objset_space(zv->zv_objset, &refd, &avail,
 			    &usedobjs, &availobjs);
 			if (g_handleattr_off_t(bp, "blocksavail",
 			    avail / DEV_BSIZE))
 				return;
 		} else if (strcmp(bp->bio_attribute, "blocksused") == 0) {
 			dmu_objset_space(zv->zv_objset, &refd, &avail,
 			    &usedobjs, &availobjs);
 			if (g_handleattr_off_t(bp, "blocksused",
 			    refd / DEV_BSIZE))
 				return;
 		} else if (strcmp(bp->bio_attribute, "poolblocksavail") == 0) {
 			avail = metaslab_class_get_space(spa_normal_class(spa));
 			avail -= metaslab_class_get_alloc(spa_normal_class(spa));
 			if (g_handleattr_off_t(bp, "poolblocksavail",
 			    avail / DEV_BSIZE))
 				return;
 		} else if (strcmp(bp->bio_attribute, "poolblocksused") == 0) {
 			refd = metaslab_class_get_alloc(spa_normal_class(spa));
 			if (g_handleattr_off_t(bp, "poolblocksused",
 			    refd / DEV_BSIZE))
 				return;
 		}
 		/* FALLTHROUGH */
 	}
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
 		break;
 	}
 	return;
 
 enqueue:
 	mtx_lock(&zv->zv_queue_mtx);
 	first = (bioq_first(&zv->zv_queue) == NULL);
 	bioq_insert_tail(&zv->zv_queue, bp);
 	mtx_unlock(&zv->zv_queue_mtx);
 	if (first)
 		wakeup_one(&zv->zv_queue);
 }
 
 static void
 zvol_geom_worker(void *arg)
 {
 	zvol_state_t *zv;
 	struct bio *bp;
 
 	thread_lock(curthread);
 	sched_prio(curthread, PRIBIO);
 	thread_unlock(curthread);
 
 	zv = arg;
 	for (;;) {
 		mtx_lock(&zv->zv_queue_mtx);
 		bp = bioq_takefirst(&zv->zv_queue);
 		if (bp == NULL) {
 			if (zv->zv_state == 1) {
 				zv->zv_state = 2;
 				wakeup(&zv->zv_state);
 				mtx_unlock(&zv->zv_queue_mtx);
 				kthread_exit();
 			}
 			msleep(&zv->zv_queue, &zv->zv_queue_mtx, PRIBIO | PDROP,
 			    "zvol:io", 0);
 			continue;
 		}
 		mtx_unlock(&zv->zv_queue_mtx);
 		switch (bp->bio_cmd) {
 		case BIO_FLUSH:
 			zil_commit(zv->zv_zilog, ZVOL_OBJ);
 			g_io_deliver(bp, 0);
 			break;
 		case BIO_READ:
 		case BIO_WRITE:
 		case BIO_DELETE:
 			zvol_strategy(bp);
 			break;
 		default:
 			g_io_deliver(bp, EOPNOTSUPP);
 			break;
 		}
 	}
 }
 
 extern boolean_t dataset_name_hidden(const char *name);
 
 static int
 zvol_create_snapshots(objset_t *os, const char *name)
 {
 	uint64_t cookie, obj;
 	char *sname;
 	int error, len;
 
 	cookie = obj = 0;
 	sname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
 
 #if 0
 	(void) dmu_objset_find(name, dmu_objset_prefetch, NULL,
 	    DS_FIND_SNAPSHOTS);
 #endif
 
 	for (;;) {
 		len = snprintf(sname, MAXPATHLEN, "%s@", name);
 		if (len >= MAXPATHLEN) {
 			dmu_objset_rele(os, FTAG);
 			error = ENAMETOOLONG;
 			break;
 		}
 
 		dsl_pool_config_enter(dmu_objset_pool(os), FTAG);
 		error = dmu_snapshot_list_next(os, MAXPATHLEN - len,
 		    sname + len, &obj, &cookie, NULL);
 		dsl_pool_config_exit(dmu_objset_pool(os), FTAG);
 		if (error != 0) {
 			if (error == ENOENT)
 				error = 0;
 			break;
 		}
 
 		error = zvol_create_minor(sname);
 		if (error != 0 && error != EEXIST) {
 			printf("ZFS WARNING: Unable to create ZVOL %s (error=%d).\n",
 			    sname, error);
 			break;
 		}
 	}
 
 	kmem_free(sname, MAXPATHLEN);
 	return (error);
 }
 
 int
 zvol_create_minors(const char *name)
 {
 	uint64_t cookie;
 	objset_t *os;
 	char *osname, *p;
 	int error, len;
 
 	if (dataset_name_hidden(name))
 		return (0);
 
 	if ((error = dmu_objset_hold(name, FTAG, &os)) != 0) {
 		printf("ZFS WARNING: Unable to put hold on %s (error=%d).\n",
 		    name, error);
 		return (error);
 	}
 	if (dmu_objset_type(os) == DMU_OST_ZVOL) {
 		dsl_dataset_long_hold(os->os_dsl_dataset, FTAG);
 		dsl_pool_rele(dmu_objset_pool(os), FTAG);
 		error = zvol_create_minor(name);
 		if (error == 0 || error == EEXIST) {
 			error = zvol_create_snapshots(os, name);
 		} else {
 			printf("ZFS WARNING: Unable to create ZVOL %s (error=%d).\n",
 			    name, error);
 		}
 		dsl_dataset_long_rele(os->os_dsl_dataset, FTAG);
 		dsl_dataset_rele(os->os_dsl_dataset, FTAG);
 		return (error);
 	}
 	if (dmu_objset_type(os) != DMU_OST_ZFS) {
 		dmu_objset_rele(os, FTAG);
 		return (0);
 	}
 
 	osname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
 	if (snprintf(osname, MAXPATHLEN, "%s/", name) >= MAXPATHLEN) {
 		dmu_objset_rele(os, FTAG);
 		kmem_free(osname, MAXPATHLEN);
 		return (ENOENT);
 	}
 	p = osname + strlen(osname);
 	len = MAXPATHLEN - (p - osname);
 
 #if 0
 	/* Prefetch the datasets. */
 	cookie = 0;
 	while (dmu_dir_list_next(os, len, p, NULL, &cookie) == 0) {
 		if (!dataset_name_hidden(osname))
 			(void) dmu_objset_prefetch(osname, NULL);
 	}
 #endif
 
 	cookie = 0;
 	while (dmu_dir_list_next(os, MAXPATHLEN - (p - osname), p, NULL,
 	    &cookie) == 0) {
 		dmu_objset_rele(os, FTAG);
 		(void)zvol_create_minors(osname);
 		if ((error = dmu_objset_hold(name, FTAG, &os)) != 0) {
 			printf("ZFS WARNING: Unable to put hold on %s (error=%d).\n",
 			    name, error);
 			return (error);
 		}
 	}
 
 	dmu_objset_rele(os, FTAG);
 	kmem_free(osname, MAXPATHLEN);
 	return (0);
 }
 
 static void
 zvol_rename_minor(zvol_state_t *zv, const char *newname)
 {
 	struct g_geom *gp;
 	struct g_provider *pp;
 	struct cdev *dev;
 
 	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
 
 	if (zv->zv_volmode == ZFS_VOLMODE_GEOM) {
 		g_topology_lock();
 		pp = zv->zv_provider;
 		ASSERT(pp != NULL);
 		gp = pp->geom;
 		ASSERT(gp != NULL);
 
 		zv->zv_provider = NULL;
 		g_wither_provider(pp, ENXIO);
 
 		pp = g_new_providerf(gp, "%s/%s", ZVOL_DRIVER, newname);
 		pp->flags |= G_PF_DIRECT_RECEIVE | G_PF_DIRECT_SEND;
 		pp->sectorsize = DEV_BSIZE;
 		pp->mediasize = zv->zv_volsize;
 		pp->private = zv;
 		zv->zv_provider = pp;
 		g_error_provider(pp, 0);
 		g_topology_unlock();
 	} else if (zv->zv_volmode == ZFS_VOLMODE_DEV) {
 		struct make_dev_args args;
 
 		if ((dev = zv->zv_dev) != NULL) {
 			zv->zv_dev = NULL;
 			destroy_dev(dev);
 			if (zv->zv_total_opens > 0) {
 				zv->zv_flags &= ~ZVOL_EXCL;
 				zv->zv_total_opens = 0;
 				zvol_last_close(zv);
 			}
 		}
 
 		make_dev_args_init(&args);
 		args.mda_flags = MAKEDEV_CHECKNAME | MAKEDEV_WAITOK;
 		args.mda_devsw = &zvol_cdevsw;
 		args.mda_cr = NULL;
 		args.mda_uid = UID_ROOT;
 		args.mda_gid = GID_OPERATOR;
 		args.mda_mode = 0640;
 		args.mda_si_drv2 = zv;
 		if (make_dev_s(&args, &zv->zv_dev,
 		    "%s/%s", ZVOL_DRIVER, newname) == 0)
 			zv->zv_dev->si_iosize_max = MAXPHYS;
 	}
 	strlcpy(zv->zv_name, newname, sizeof(zv->zv_name));
 }
 
 void
 zvol_rename_minors(const char *oldname, const char *newname)
 {
 	char name[MAXPATHLEN];
 	struct g_provider *pp;
 	struct g_geom *gp;
 	size_t oldnamelen, newnamelen;
 	zvol_state_t *zv;
 	char *namebuf;
 	boolean_t locked = B_FALSE;
 
 	oldnamelen = strlen(oldname);
 	newnamelen = strlen(newname);
 
 	DROP_GIANT();
 	/* See comment in zvol_open(). */
 	if (!MUTEX_HELD(&zfsdev_state_lock)) {
 		mutex_enter(&zfsdev_state_lock);
 		locked = B_TRUE;
 	}
 
 	LIST_FOREACH(zv, &all_zvols, zv_links) {
 		if (strcmp(zv->zv_name, oldname) == 0) {
 			zvol_rename_minor(zv, newname);
 		} else if (strncmp(zv->zv_name, oldname, oldnamelen) == 0 &&
 		    (zv->zv_name[oldnamelen] == '/' ||
 		     zv->zv_name[oldnamelen] == '@')) {
 			snprintf(name, sizeof(name), "%s%c%s", newname,
 			    zv->zv_name[oldnamelen],
 			    zv->zv_name + oldnamelen + 1);
 			zvol_rename_minor(zv, name);
 		}
 	}
 
 	if (locked)
 		mutex_exit(&zfsdev_state_lock);
 	PICKUP_GIANT();
 }
 
 static int
 zvol_d_open(struct cdev *dev, int flags, int fmt, struct thread *td)
 {
 	zvol_state_t *zv = dev->si_drv2;
 	int err = 0;
 
 	mutex_enter(&zfsdev_state_lock);
 	if (zv->zv_total_opens == 0)
 		err = zvol_first_open(zv);
 	if (err) {
 		mutex_exit(&zfsdev_state_lock);
 		return (err);
 	}
 	if ((flags & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) {
 		err = SET_ERROR(EROFS);
 		goto out;
 	}
 	if (zv->zv_flags & ZVOL_EXCL) {
 		err = SET_ERROR(EBUSY);
 		goto out;
 	}
 #ifdef FEXCL
 	if (flags & FEXCL) {
 		if (zv->zv_total_opens != 0) {
 			err = SET_ERROR(EBUSY);
 			goto out;
 		}
 		zv->zv_flags |= ZVOL_EXCL;
 	}
 #endif
 
 	zv->zv_total_opens++;
 	if (flags & (FSYNC | FDSYNC)) {
 		zv->zv_sync_cnt++;
 		if (zv->zv_sync_cnt == 1)
 			zil_async_to_sync(zv->zv_zilog, ZVOL_OBJ);
 	}
 	mutex_exit(&zfsdev_state_lock);
 	return (err);
 out:
 	if (zv->zv_total_opens == 0)
 		zvol_last_close(zv);
 	mutex_exit(&zfsdev_state_lock);
 	return (err);
 }
 
 static int
 zvol_d_close(struct cdev *dev, int flags, int fmt, struct thread *td)
 {
 	zvol_state_t *zv = dev->si_drv2;
 
 	mutex_enter(&zfsdev_state_lock);
 	if (zv->zv_flags & ZVOL_EXCL) {
 		ASSERT(zv->zv_total_opens == 1);
 		zv->zv_flags &= ~ZVOL_EXCL;
 	}
 
 	/*
 	 * If the open count is zero, this is a spurious close.
 	 * That indicates a bug in the kernel / DDI framework.
 	 */
 	ASSERT(zv->zv_total_opens != 0);
 
 	/*
 	 * You may get multiple opens, but only one close.
 	 */
 	zv->zv_total_opens--;
 	if (flags & (FSYNC | FDSYNC))
 		zv->zv_sync_cnt--;
 
 	if (zv->zv_total_opens == 0)
 		zvol_last_close(zv);
 
 	mutex_exit(&zfsdev_state_lock);
 	return (0);
 }
 
 static int
 zvol_d_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int fflag, struct thread *td)
 {
 	zvol_state_t *zv;
 	rl_t *rl;
 	off_t offset, length;
 	int i, error;
 	boolean_t sync;
 
 	zv = dev->si_drv2;
 
 	error = 0;
 	KASSERT(zv->zv_total_opens > 0,
 	    ("Device with zero access count in zvol_d_ioctl"));
 
 	i = IOCPARM_LEN(cmd);
 	switch (cmd) {
 	case DIOCGSECTORSIZE:
 		*(u_int *)data = DEV_BSIZE;
 		break;
 	case DIOCGMEDIASIZE:
 		*(off_t *)data = zv->zv_volsize;
 		break;
 	case DIOCGFLUSH:
 		zil_commit(zv->zv_zilog, ZVOL_OBJ);
 		break;
 	case DIOCGDELETE:
 		if (!zvol_unmap_enabled)
 			break;
 
 		offset = ((off_t *)data)[0];
 		length = ((off_t *)data)[1];
 		if ((offset % DEV_BSIZE) != 0 || (length % DEV_BSIZE) != 0 ||
 		    offset < 0 || offset >= zv->zv_volsize ||
 		    length <= 0) {
 			printf("%s: offset=%jd length=%jd\n", __func__, offset,
 			    length);
 			error = EINVAL;
 			break;
 		}
 
 		rl = zfs_range_lock(&zv->zv_znode, offset, length, RL_WRITER);
 		dmu_tx_t *tx = dmu_tx_create(zv->zv_objset);
 		error = dmu_tx_assign(tx, TXG_WAIT);
 		if (error != 0) {
 			sync = FALSE;
 			dmu_tx_abort(tx);
 		} else {
 			sync = (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS);
 			zvol_log_truncate(zv, tx, offset, length, sync);
 			dmu_tx_commit(tx);
 			error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ,
 			    offset, length);
 		}
 		zfs_range_unlock(rl);
 		if (sync)
 			zil_commit(zv->zv_zilog, ZVOL_OBJ);
 		break;
 	case DIOCGSTRIPESIZE:
 		*(off_t *)data = zv->zv_volblocksize;
 		break;
 	case DIOCGSTRIPEOFFSET:
 		*(off_t *)data = 0;
 		break;
 	case DIOCGATTR: {
 		spa_t *spa = dmu_objset_spa(zv->zv_objset);
 		struct diocgattr_arg *arg = (struct diocgattr_arg *)data;
 		uint64_t refd, avail, usedobjs, availobjs;
 
 		if (strcmp(arg->name, "GEOM::candelete") == 0)
 			arg->value.i = 1;
 		else if (strcmp(arg->name, "blocksavail") == 0) {
 			dmu_objset_space(zv->zv_objset, &refd, &avail,
 			    &usedobjs, &availobjs);
 			arg->value.off = avail / DEV_BSIZE;
 		} else if (strcmp(arg->name, "blocksused") == 0) {
 			dmu_objset_space(zv->zv_objset, &refd, &avail,
 			    &usedobjs, &availobjs);
 			arg->value.off = refd / DEV_BSIZE;
 		} else if (strcmp(arg->name, "poolblocksavail") == 0) {
 			avail = metaslab_class_get_space(spa_normal_class(spa));
 			avail -= metaslab_class_get_alloc(spa_normal_class(spa));
 			arg->value.off = avail / DEV_BSIZE;
 		} else if (strcmp(arg->name, "poolblocksused") == 0) {
 			refd = metaslab_class_get_alloc(spa_normal_class(spa));
 			arg->value.off = refd / DEV_BSIZE;
 		} else
 			error = ENOIOCTL;
 		break;
 	}
 	case FIOSEEKHOLE:
 	case FIOSEEKDATA: {
 		off_t *off = (off_t *)data;
 		uint64_t noff;
 		boolean_t hole;
 
 		hole = (cmd == FIOSEEKHOLE);
 		noff = *off;
 		error = dmu_offset_next(zv->zv_objset, ZVOL_OBJ, hole, &noff);
 		*off = noff;
 		break;
 	}
 	default:
 		error = ENOIOCTL;
 	}
 
 	return (error);
 }
 #endif	/* illumos */
Index: projects/clang400-import/sys/cddl/contrib/opensolaris
===================================================================
--- projects/clang400-import/sys/cddl/contrib/opensolaris	(revision 314522)
+++ projects/clang400-import/sys/cddl/contrib/opensolaris	(revision 314523)

Property changes on: projects/clang400-import/sys/cddl/contrib/opensolaris
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head/sys/cddl/contrib/opensolaris:r314420-314522
Index: projects/clang400-import/sys/compat/ia32/ia32_sysvec.c
===================================================================
--- projects/clang400-import/sys/compat/ia32/ia32_sysvec.c	(revision 314522)
+++ projects/clang400-import/sys/compat/ia32/ia32_sysvec.c	(revision 314523)
@@ -1,238 +1,235 @@
 /*-
  * Copyright (c) 2002 Doug Rabson
  * Copyright (c) 2003 Peter Wemm
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_compat.h"
 
 #define __ELF_WORD_SIZE 32
 
 #include <sys/param.h>
 #include <sys/exec.h>
 #include <sys/fcntl.h>
 #include <sys/imgact.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/mman.h>
 #include <sys/namei.h>
 #include <sys/pioctl.h>
 #include <sys/proc.h>
 #include <sys/procfs.h>
 #include <sys/resourcevar.h>
 #include <sys/systm.h>
 #include <sys/signalvar.h>
 #include <sys/stat.h>
 #include <sys/sx.h>
 #include <sys/syscall.h>
 #include <sys/sysctl.h>
 #include <sys/sysent.h>
 #include <sys/vnode.h>
 #include <sys/imgact_elf.h>
 
 #include <vm/vm.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 
 #include <compat/freebsd32/freebsd32_signal.h>
 #include <compat/freebsd32/freebsd32_util.h>
 #include <compat/freebsd32/freebsd32_proto.h>
 #include <compat/freebsd32/freebsd32_syscall.h>
 #include <compat/ia32/ia32_signal.h>
 #include <machine/frame.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/cpufunc.h>
 
 CTASSERT(sizeof(struct ia32_mcontext) == 640);
 CTASSERT(sizeof(struct ia32_ucontext) == 704);
 CTASSERT(sizeof(struct ia32_sigframe) == 800);
 CTASSERT(sizeof(struct siginfo32) == 64);
 #ifdef COMPAT_FREEBSD4
 CTASSERT(sizeof(struct ia32_mcontext4) == 260);
 CTASSERT(sizeof(struct ia32_ucontext4) == 324);
 CTASSERT(sizeof(struct ia32_sigframe4) == 408);
 #endif
 
 extern const char *freebsd32_syscallnames[];
 
 static SYSCTL_NODE(_compat, OID_AUTO, ia32, CTLFLAG_RW, 0, "ia32 mode");
 
 static u_long	ia32_maxdsiz = IA32_MAXDSIZ;
 SYSCTL_ULONG(_compat_ia32, OID_AUTO, maxdsiz, CTLFLAG_RWTUN, &ia32_maxdsiz, 0, "");
 u_long	ia32_maxssiz = IA32_MAXSSIZ;
 SYSCTL_ULONG(_compat_ia32, OID_AUTO, maxssiz, CTLFLAG_RWTUN, &ia32_maxssiz, 0, "");
 static u_long	ia32_maxvmem = IA32_MAXVMEM;
 SYSCTL_ULONG(_compat_ia32, OID_AUTO, maxvmem, CTLFLAG_RWTUN, &ia32_maxvmem, 0, "");
 
 struct sysentvec ia32_freebsd_sysvec = {
 	.sv_size	= FREEBSD32_SYS_MAXSYSCALL,
 	.sv_table	= freebsd32_sysent,
 	.sv_mask	= 0,
 	.sv_errsize	= 0,
 	.sv_errtbl	= NULL,
 	.sv_transtrap	= NULL,
 	.sv_fixup	= elf32_freebsd_fixup,
 	.sv_sendsig	= ia32_sendsig,
 	.sv_sigcode	= ia32_sigcode,
 	.sv_szsigcode	= &sz_ia32_sigcode,
 	.sv_name	= "FreeBSD ELF32",
 	.sv_coredump	= elf32_coredump,
 	.sv_imgact_try	= NULL,
 	.sv_minsigstksz	= MINSIGSTKSZ,
 	.sv_pagesize	= IA32_PAGE_SIZE,
 	.sv_minuser	= FREEBSD32_MINUSER,
 	.sv_maxuser	= FREEBSD32_MAXUSER,
 	.sv_usrstack	= FREEBSD32_USRSTACK,
 	.sv_psstrings	= FREEBSD32_PS_STRINGS,
 	.sv_stackprot	= VM_PROT_ALL,
 	.sv_copyout_strings	= freebsd32_copyout_strings,
 	.sv_setregs	= ia32_setregs,
 	.sv_fixlimit	= ia32_fixlimit,
 	.sv_maxssiz	= &ia32_maxssiz,
-	.sv_flags	=
-#ifdef __amd64__
-			  SV_SHP | SV_TIMEKEEP |
-#endif
-			  SV_ABI_FREEBSD | SV_IA32 | SV_ILP32,
+	.sv_flags	= SV_ABI_FREEBSD | SV_IA32 | SV_ILP32 |
+			    SV_SHP | SV_TIMEKEEP,
 	.sv_set_syscall_retval = ia32_set_syscall_retval,
 	.sv_fetch_syscall_args = ia32_fetch_syscall_args,
 	.sv_syscallnames = freebsd32_syscallnames,
 	.sv_shared_page_base = FREEBSD32_SHAREDPAGE,
 	.sv_shared_page_len = PAGE_SIZE,
 	.sv_schedtail	= NULL,
 	.sv_thread_detach = NULL,
 	.sv_trap	= NULL,
 };
 INIT_SYSENTVEC(elf_ia32_sysvec, &ia32_freebsd_sysvec);
 
 static Elf32_Brandinfo ia32_brand_info = {
 	.brand		= ELFOSABI_FREEBSD,
 	.machine	= EM_386,
 	.compat_3_brand	= "FreeBSD",
 	.emul_path	= NULL,
 	.interp_path	= "/libexec/ld-elf.so.1",
 	.sysvec		= &ia32_freebsd_sysvec,
 	.interp_newpath	= "/libexec/ld-elf32.so.1",
 	.brand_note	= &elf32_freebsd_brandnote,
 	.flags		= BI_CAN_EXEC_DYN | BI_BRAND_NOTE
 };
 
 SYSINIT(ia32, SI_SUB_EXEC, SI_ORDER_MIDDLE,
 	(sysinit_cfunc_t) elf32_insert_brand_entry,
 	&ia32_brand_info);
 
 static Elf32_Brandinfo ia32_brand_oinfo = {
 	.brand		= ELFOSABI_FREEBSD,
 	.machine	= EM_386,
 	.compat_3_brand	= "FreeBSD",
 	.emul_path	= NULL,
 	.interp_path	= "/usr/libexec/ld-elf.so.1",
 	.sysvec		= &ia32_freebsd_sysvec,
 	.interp_newpath	= "/libexec/ld-elf32.so.1",
 	.brand_note	= &elf32_freebsd_brandnote,
 	.flags		= BI_CAN_EXEC_DYN | BI_BRAND_NOTE
 };
 
 SYSINIT(oia32, SI_SUB_EXEC, SI_ORDER_ANY,
 	(sysinit_cfunc_t) elf32_insert_brand_entry,
 	&ia32_brand_oinfo);
 
 static Elf32_Brandinfo kia32_brand_info = {
 	.brand		= ELFOSABI_FREEBSD,
 	.machine	= EM_386,
 	.compat_3_brand	= "FreeBSD",
 	.emul_path	= NULL,
 	.interp_path	= "/lib/ld.so.1",
 	.sysvec		= &ia32_freebsd_sysvec,
 	.brand_note	= &elf32_kfreebsd_brandnote,
 	.flags		= BI_CAN_EXEC_DYN | BI_BRAND_NOTE_MANDATORY
 };
 
 SYSINIT(kia32, SI_SUB_EXEC, SI_ORDER_ANY,
 	(sysinit_cfunc_t) elf32_insert_brand_entry,
 	&kia32_brand_info);
 
 void
 elf32_dump_thread(struct thread *td, void *dst, size_t *off)
 {
 	void *buf;
 	size_t len;
 
 	len = 0;
 	if (use_xsave) {
 		if (dst != NULL) {
 			fpugetregs(td);
 			len += elf32_populate_note(NT_X86_XSTATE,
 			    get_pcb_user_save_td(td), dst,
 			    cpu_max_ext_state_size, &buf);
 			*(uint64_t *)((char *)buf + X86_XSTATE_XCR0_OFFSET) =
 			    xsave_mask;
 		} else
 			len += elf32_populate_note(NT_X86_XSTATE, NULL, NULL,
 			    cpu_max_ext_state_size, NULL);
 	}
 	*off = len;
 }
 
 void
 ia32_fixlimit(struct rlimit *rl, int which)
 {
 
 	switch (which) {
 	case RLIMIT_DATA:
 		if (ia32_maxdsiz != 0) {
 			if (rl->rlim_cur > ia32_maxdsiz)
 				rl->rlim_cur = ia32_maxdsiz;
 			if (rl->rlim_max > ia32_maxdsiz)
 				rl->rlim_max = ia32_maxdsiz;
 		}
 		break;
 	case RLIMIT_STACK:
 		if (ia32_maxssiz != 0) {
 			if (rl->rlim_cur > ia32_maxssiz)
 				rl->rlim_cur = ia32_maxssiz;
 			if (rl->rlim_max > ia32_maxssiz)
 				rl->rlim_max = ia32_maxssiz;
 		}
 		break;
 	case RLIMIT_VMEM:
 		if (ia32_maxvmem != 0) {
 			if (rl->rlim_cur > ia32_maxvmem)
 				rl->rlim_cur = ia32_maxvmem;
 			if (rl->rlim_max > ia32_maxvmem)
 				rl->rlim_max = ia32_maxvmem;
 		}
 		break;
 	}
 }
Index: projects/clang400-import/sys/dev/cxgbe/iw_cxgbe/mem.c
===================================================================
--- projects/clang400-import/sys/dev/cxgbe/iw_cxgbe/mem.c	(revision 314522)
+++ projects/clang400-import/sys/dev/cxgbe/iw_cxgbe/mem.c	(revision 314523)
@@ -1,845 +1,842 @@
 /*
  * Copyright (c) 2009-2013 Chelsio, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
  * General Public License (GPL) Version 2, available from the file
  * COPYING in the main directory of this source tree, or the
  * OpenIB.org BSD license below:
  *
  *     Redistribution and use in source and binary forms, with or
  *     without modification, are permitted provided that the following
  *     conditions are met:
  *
  *      - Redistributions of source code must retain the above
  *        copyright notice, this list of conditions and the following
  *        disclaimer.
  *
  *      - Redistributions in binary form must reproduce the above
  *        copyright notice, this list of conditions and the following
  *        disclaimer in the documentation and/or other materials
  *        provided with the distribution.
  *
  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
  * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
  * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
  * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
  * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  */
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 
 #ifdef TCP_OFFLOAD
 #include <linux/types.h>
 #include <linux/kref.h>
 #include <rdma/ib_umem.h>
 #include <asm/atomic.h>
 
 #include <common/t4_msg.h>
 #include "iw_cxgbe.h"
 
 #define T4_ULPTX_MIN_IO 32
 #define C4IW_MAX_INLINE_SIZE 96
 
 static int
 mr_exceeds_hw_limits(struct c4iw_dev *dev __unused, u64 length)
 {
 
 	return (length >= 8*1024*1024*1024ULL);
 }
 static int
 write_adapter_mem(struct c4iw_rdev *rdev, u32 addr, u32 len, void *data)
 {
 	struct adapter *sc = rdev->adap;
 	struct ulp_mem_io *ulpmc;
 	struct ulptx_idata *ulpsc;
 	u8 wr_len, *to_dp, *from_dp;
 	int copy_len, num_wqe, i, ret = 0;
 	struct c4iw_wr_wait wr_wait;
 	struct wrqe *wr;
 	u32 cmd;
 
 	cmd = cpu_to_be32(V_ULPTX_CMD(ULP_TX_MEM_WRITE));
 	if (is_t4(sc))
 		cmd |= cpu_to_be32(F_ULP_MEMIO_ORDER);
 	else
 		cmd |= cpu_to_be32(F_T5_ULP_MEMIO_IMM);
 
 	addr &= 0x7FFFFFF;
 	CTR3(KTR_IW_CXGBE, "%s addr 0x%x len %u", __func__, addr, len);
 	num_wqe = DIV_ROUND_UP(len, C4IW_MAX_INLINE_SIZE);
 	c4iw_init_wr_wait(&wr_wait);
 	for (i = 0; i < num_wqe; i++) {
 
 		copy_len = min(len, C4IW_MAX_INLINE_SIZE);
 		wr_len = roundup(sizeof *ulpmc + sizeof *ulpsc +
 				 roundup(copy_len, T4_ULPTX_MIN_IO), 16);
 
 		wr = alloc_wrqe(wr_len, &sc->sge.mgmtq);
 		if (wr == NULL)
 			return (0);
 		ulpmc = wrtod(wr);
 
 		memset(ulpmc, 0, wr_len);
 		INIT_ULPTX_WR(ulpmc, wr_len, 0, 0);
 
 		if (i == (num_wqe-1)) {
 			ulpmc->wr.wr_hi = cpu_to_be32(V_FW_WR_OP(FW_ULPTX_WR) |
 						    F_FW_WR_COMPL);
 			ulpmc->wr.wr_lo = (__force __be64)(unsigned long) &wr_wait;
 		} else
 			ulpmc->wr.wr_hi = cpu_to_be32(V_FW_WR_OP(FW_ULPTX_WR));
 		ulpmc->wr.wr_mid = cpu_to_be32(
 				       V_FW_WR_LEN16(DIV_ROUND_UP(wr_len, 16)));
 
 		ulpmc->cmd = cmd;
 		ulpmc->dlen = cpu_to_be32(V_ULP_MEMIO_DATA_LEN(
 		    DIV_ROUND_UP(copy_len, T4_ULPTX_MIN_IO)));
 		ulpmc->len16 = cpu_to_be32(DIV_ROUND_UP(wr_len-sizeof(ulpmc->wr),
 						      16));
 		ulpmc->lock_addr = cpu_to_be32(V_ULP_MEMIO_ADDR(addr + i * 3));
 
 		ulpsc = (struct ulptx_idata *)(ulpmc + 1);
 		ulpsc->cmd_more = cpu_to_be32(V_ULPTX_CMD(ULP_TX_SC_IMM));
 		ulpsc->len = cpu_to_be32(roundup(copy_len, T4_ULPTX_MIN_IO));
 
 		to_dp = (u8 *)(ulpsc + 1);
 		from_dp = (u8 *)data + i * C4IW_MAX_INLINE_SIZE;
 		if (data)
 			memcpy(to_dp, from_dp, copy_len);
 		else
 			memset(to_dp, 0, copy_len);
 		if (copy_len % T4_ULPTX_MIN_IO)
 			memset(to_dp + copy_len, 0, T4_ULPTX_MIN_IO -
 			       (copy_len % T4_ULPTX_MIN_IO));
 		t4_wrq_tx(sc, wr);
 		len -= C4IW_MAX_INLINE_SIZE;
 	}
 
 	ret = c4iw_wait_for_reply(rdev, &wr_wait, 0, 0, __func__);
 	return ret;
 }
 
 /*
  * Build and write a TPT entry.
  * IN: stag key, pdid, perm, bind_enabled, zbva, to, len, page_size,
  *     pbl_size and pbl_addr
  * OUT: stag index
  */
 static int write_tpt_entry(struct c4iw_rdev *rdev, u32 reset_tpt_entry,
 			   u32 *stag, u8 stag_state, u32 pdid,
 			   enum fw_ri_stag_type type, enum fw_ri_mem_perms perm,
 			   int bind_enabled, u32 zbva, u64 to,
 			   u64 len, u8 page_size, u32 pbl_size, u32 pbl_addr)
 {
 	int err;
 	struct fw_ri_tpte tpt;
 	u32 stag_idx;
 	static atomic_t key;
 
 	if (c4iw_fatal_error(rdev))
 		return -EIO;
 
 	stag_state = stag_state > 0;
 	stag_idx = (*stag) >> 8;
 
 	if ((!reset_tpt_entry) && (*stag == T4_STAG_UNSET)) {
 		stag_idx = c4iw_get_resource(&rdev->resource.tpt_table);
 		if (!stag_idx) {
 			mutex_lock(&rdev->stats.lock);
 			rdev->stats.stag.fail++;
 			mutex_unlock(&rdev->stats.lock);
 			return -ENOMEM;
 		}
 		mutex_lock(&rdev->stats.lock);
 		rdev->stats.stag.cur += 32;
 		if (rdev->stats.stag.cur > rdev->stats.stag.max)
 			rdev->stats.stag.max = rdev->stats.stag.cur;
 		mutex_unlock(&rdev->stats.lock);
 		*stag = (stag_idx << 8) | (atomic_inc_return(&key) & 0xff);
 	}
 	CTR5(KTR_IW_CXGBE,
 	    "%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x",
 	    __func__, stag_state, type, pdid, stag_idx);
 
 	/* write TPT entry */
 	if (reset_tpt_entry)
 		memset(&tpt, 0, sizeof(tpt));
 	else {
 		tpt.valid_to_pdid = cpu_to_be32(F_FW_RI_TPTE_VALID |
 			V_FW_RI_TPTE_STAGKEY((*stag & M_FW_RI_TPTE_STAGKEY)) |
 			V_FW_RI_TPTE_STAGSTATE(stag_state) |
 			V_FW_RI_TPTE_STAGTYPE(type) | V_FW_RI_TPTE_PDID(pdid));
 		tpt.locread_to_qpid = cpu_to_be32(V_FW_RI_TPTE_PERM(perm) |
 			(bind_enabled ? F_FW_RI_TPTE_MWBINDEN : 0) |
 			V_FW_RI_TPTE_ADDRTYPE((zbva ? FW_RI_ZERO_BASED_TO :
 						      FW_RI_VA_BASED_TO))|
 			V_FW_RI_TPTE_PS(page_size));
 		tpt.nosnoop_pbladdr = !pbl_size ? 0 : cpu_to_be32(
 			V_FW_RI_TPTE_PBLADDR(PBL_OFF(rdev, pbl_addr)>>3));
 		tpt.len_lo = cpu_to_be32((u32)(len & 0xffffffffUL));
 		tpt.va_hi = cpu_to_be32((u32)(to >> 32));
 		tpt.va_lo_fbo = cpu_to_be32((u32)(to & 0xffffffffUL));
 		tpt.dca_mwbcnt_pstag = cpu_to_be32(0);
 		tpt.len_hi = cpu_to_be32((u32)(len >> 32));
 	}
 	err = write_adapter_mem(rdev, stag_idx +
 				(rdev->adap->vres.stag.start >> 5),
 				sizeof(tpt), &tpt);
 
 	if (reset_tpt_entry) {
 		c4iw_put_resource(&rdev->resource.tpt_table, stag_idx);
 		mutex_lock(&rdev->stats.lock);
 		rdev->stats.stag.cur -= 32;
 		mutex_unlock(&rdev->stats.lock);
 	}
 	return err;
 }
 
 static int write_pbl(struct c4iw_rdev *rdev, __be64 *pbl,
 		     u32 pbl_addr, u32 pbl_size)
 {
 	int err;
 
 	CTR4(KTR_IW_CXGBE, "%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d",
 	     __func__, pbl_addr, rdev->adap->vres.pbl.start, pbl_size);
 
 	err = write_adapter_mem(rdev, pbl_addr >> 5, pbl_size << 3, pbl);
 	return err;
 }
 
 static int dereg_mem(struct c4iw_rdev *rdev, u32 stag, u32 pbl_size,
 		     u32 pbl_addr)
 {
 	return write_tpt_entry(rdev, 1, &stag, 0, 0, 0, 0, 0, 0, 0UL, 0, 0,
 			       pbl_size, pbl_addr);
 }
 
 static int allocate_window(struct c4iw_rdev *rdev, u32 * stag, u32 pdid)
 {
 	*stag = T4_STAG_UNSET;
 	return write_tpt_entry(rdev, 0, stag, 0, pdid, FW_RI_STAG_MW, 0, 0, 0,
 			       0UL, 0, 0, 0, 0);
 }
 
 static int deallocate_window(struct c4iw_rdev *rdev, u32 stag)
 {
 	return write_tpt_entry(rdev, 1, &stag, 0, 0, 0, 0, 0, 0, 0UL, 0, 0, 0,
 			       0);
 }
 
 static int allocate_stag(struct c4iw_rdev *rdev, u32 *stag, u32 pdid,
 			 u32 pbl_size, u32 pbl_addr)
 {
 	*stag = T4_STAG_UNSET;
 	return write_tpt_entry(rdev, 0, stag, 0, pdid, FW_RI_STAG_NSMR, 0, 0, 0,
 			       0UL, 0, 0, pbl_size, pbl_addr);
 }
 
 static int finish_mem_reg(struct c4iw_mr *mhp, u32 stag)
 {
 	u32 mmid;
 
 	mhp->attr.state = 1;
 	mhp->attr.stag = stag;
 	mmid = stag >> 8;
 	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
 	CTR3(KTR_IW_CXGBE, "%s mmid 0x%x mhp %p", __func__, mmid, mhp);
 	return insert_handle(mhp->rhp, &mhp->rhp->mmidr, mhp, mmid);
 }
 
 static int register_mem(struct c4iw_dev *rhp, struct c4iw_pd *php,
 		      struct c4iw_mr *mhp, int shift)
 {
 	u32 stag = T4_STAG_UNSET;
 	int ret;
 
 	ret = write_tpt_entry(&rhp->rdev, 0, &stag, 1, mhp->attr.pdid,
 			      FW_RI_STAG_NSMR, mhp->attr.len ? mhp->attr.perms : 0,
 			      mhp->attr.mw_bind_enable, mhp->attr.zbva,
 			      mhp->attr.va_fbo, mhp->attr.len ? mhp->attr.len : -1, shift - 12,
 			      mhp->attr.pbl_size, mhp->attr.pbl_addr);
 	if (ret)
 		return ret;
 
 	ret = finish_mem_reg(mhp, stag);
 	if (ret)
 		dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
 		       mhp->attr.pbl_addr);
 	return ret;
 }
 
 static int reregister_mem(struct c4iw_dev *rhp, struct c4iw_pd *php,
 			  struct c4iw_mr *mhp, int shift, int npages)
 {
 	u32 stag;
 	int ret;
 
 	if (npages > mhp->attr.pbl_size)
 		return -ENOMEM;
 
 	stag = mhp->attr.stag;
 	ret = write_tpt_entry(&rhp->rdev, 0, &stag, 1, mhp->attr.pdid,
 			      FW_RI_STAG_NSMR, mhp->attr.perms,
 			      mhp->attr.mw_bind_enable, mhp->attr.zbva,
 			      mhp->attr.va_fbo, mhp->attr.len, shift - 12,
 			      mhp->attr.pbl_size, mhp->attr.pbl_addr);
 	if (ret)
 		return ret;
 
 	ret = finish_mem_reg(mhp, stag);
 	if (ret)
 		dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
 		       mhp->attr.pbl_addr);
 
 	return ret;
 }
 
 static int alloc_pbl(struct c4iw_mr *mhp, int npages)
 {
 	mhp->attr.pbl_addr = c4iw_pblpool_alloc(&mhp->rhp->rdev,
 						    npages << 3);
 
 	if (!mhp->attr.pbl_addr)
 		return -ENOMEM;
 
 	mhp->attr.pbl_size = npages;
 
 	return 0;
 }
 
 static int build_phys_page_list(struct ib_phys_buf *buffer_list,
 				int num_phys_buf, u64 *iova_start,
 				u64 *total_size, int *npages,
 				int *shift, __be64 **page_list)
 {
 	u64 mask;
 	int i, j, n;
 
 	mask = 0;
 	*total_size = 0;
 	for (i = 0; i < num_phys_buf; ++i) {
 		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
 			return -EINVAL;
 		if (i != 0 && i != num_phys_buf - 1 &&
 		    (buffer_list[i].size & ~PAGE_MASK))
 			return -EINVAL;
 		*total_size += buffer_list[i].size;
 		if (i > 0)
 			mask |= buffer_list[i].addr;
 		else
 			mask |= buffer_list[i].addr & PAGE_MASK;
 		if (i != num_phys_buf - 1)
 			mask |= buffer_list[i].addr + buffer_list[i].size;
 		else
 			mask |= (buffer_list[i].addr + buffer_list[i].size +
 				PAGE_SIZE - 1) & PAGE_MASK;
 	}
 
-	if (*total_size > 0xFFFFFFFFULL)
-		return -ENOMEM;
-
 	/* Find largest page shift we can use to cover buffers */
 	for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift))
 		if ((1ULL << *shift) & mask)
 			break;
 
 	buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1);
 	buffer_list[0].addr &= ~0ull << *shift;
 
 	*npages = 0;
 	for (i = 0; i < num_phys_buf; ++i)
 		*npages += (buffer_list[i].size +
 			(1ULL << *shift) - 1) >> *shift;
 
 	if (!*npages)
 		return -EINVAL;
 
 	*page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL);
 	if (!*page_list)
 		return -ENOMEM;
 
 	n = 0;
 	for (i = 0; i < num_phys_buf; ++i)
 		for (j = 0;
 		     j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift;
 		     ++j)
 			(*page_list)[n++] = cpu_to_be64(buffer_list[i].addr +
 			    ((u64) j << *shift));
 
 	CTR6(KTR_IW_CXGBE,
 	    "%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d", __func__,
 	    (unsigned long long)*iova_start, (unsigned long long)mask, *shift,
 	    (unsigned long long)*total_size, *npages);
 
 	return 0;
 
 }
 
 int c4iw_reregister_phys_mem(struct ib_mr *mr, int mr_rereg_mask,
 			     struct ib_pd *pd, struct ib_phys_buf *buffer_list,
 			     int num_phys_buf, int acc, u64 *iova_start)
 {
 
 	struct c4iw_mr mh, *mhp;
 	struct c4iw_pd *php;
 	struct c4iw_dev *rhp;
 	__be64 *page_list = NULL;
 	int shift = 0;
 	u64 total_size = 0;
 	int npages = 0;
 	int ret;
 
 	CTR3(KTR_IW_CXGBE, "%s ib_mr %p ib_pd %p", __func__, mr, pd);
 
 	/* There can be no memory windows */
 	if (atomic_read(&mr->usecnt))
 		return -EINVAL;
 
 	mhp = to_c4iw_mr(mr);
 	rhp = mhp->rhp;
 	php = to_c4iw_pd(mr->pd);
 
 	/* make sure we are on the same adapter */
 	if (rhp != php->rhp)
 		return -EINVAL;
 
 	memcpy(&mh, mhp, sizeof *mhp);
 
 	if (mr_rereg_mask & IB_MR_REREG_PD)
 		php = to_c4iw_pd(pd);
 	if (mr_rereg_mask & IB_MR_REREG_ACCESS) {
 		mh.attr.perms = c4iw_ib_to_tpt_access(acc);
 		mh.attr.mw_bind_enable = (acc & IB_ACCESS_MW_BIND) ==
 					 IB_ACCESS_MW_BIND;
 	}
 	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
 		ret = build_phys_page_list(buffer_list, num_phys_buf,
 						iova_start,
 						&total_size, &npages,
 						&shift, &page_list);
 		if (ret)
 			return ret;
 	}
 	if (mr_exceeds_hw_limits(rhp, total_size)) {
 		kfree(page_list);
 		return -EINVAL;
 	}
 	ret = reregister_mem(rhp, php, &mh, shift, npages);
 	kfree(page_list);
 	if (ret)
 		return ret;
 	if (mr_rereg_mask & IB_MR_REREG_PD)
 		mhp->attr.pdid = php->pdid;
 	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
 		mhp->attr.perms = c4iw_ib_to_tpt_access(acc);
 	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
 		mhp->attr.zbva = 0;
 		mhp->attr.va_fbo = *iova_start;
 		mhp->attr.page_size = shift - 12;
 		mhp->attr.len = (u32) total_size;
 		mhp->attr.pbl_size = npages;
 	}
 
 	return 0;
 }
 
 struct ib_mr *c4iw_register_phys_mem(struct ib_pd *pd,
 				     struct ib_phys_buf *buffer_list,
 				     int num_phys_buf, int acc, u64 *iova_start)
 {
 	__be64 *page_list;
 	int shift;
 	u64 total_size;
 	int npages;
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
 	struct c4iw_mr *mhp;
 	int ret;
 
 	CTR2(KTR_IW_CXGBE, "%s ib_pd %p", __func__, pd);
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 
 	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
 	if (!mhp)
 		return ERR_PTR(-ENOMEM);
 
 	mhp->rhp = rhp;
 
 	/* First check that we have enough alignment */
 	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
 		ret = -EINVAL;
 		goto err;
 	}
 
 	if (num_phys_buf > 1 &&
 	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) {
 		ret = -EINVAL;
 		goto err;
 	}
 
 	ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start,
 					&total_size, &npages, &shift,
 					&page_list);
 	if (ret)
 		goto err;
 
 	if (mr_exceeds_hw_limits(rhp, total_size)) {
 		kfree(page_list);
 		ret = -EINVAL;
 		goto err;
 	}
 	ret = alloc_pbl(mhp, npages);
 	if (ret) {
 		kfree(page_list);
 		goto err;
 	}
 
 	ret = write_pbl(&mhp->rhp->rdev, page_list, mhp->attr.pbl_addr,
 			     npages);
 	kfree(page_list);
 	if (ret)
 		goto err_pbl;
 
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.zbva = 0;
 
 	mhp->attr.perms = c4iw_ib_to_tpt_access(acc);
 	mhp->attr.va_fbo = *iova_start;
 	mhp->attr.page_size = shift - 12;
 
 	mhp->attr.len = (u32) total_size;
 	mhp->attr.pbl_size = npages;
 	ret = register_mem(rhp, php, mhp, shift);
 	if (ret)
 		goto err_pbl;
 
 	return &mhp->ibmr;
 
 err_pbl:
 	c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
 			      mhp->attr.pbl_size << 3);
 
 err:
 	kfree(mhp);
 	return ERR_PTR(ret);
 
 }
 
 struct ib_mr *c4iw_get_dma_mr(struct ib_pd *pd, int acc)
 {
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
 	struct c4iw_mr *mhp;
 	int ret;
 	u32 stag = T4_STAG_UNSET;
 
 	CTR2(KTR_IW_CXGBE, "%s ib_pd %p", __func__, pd);
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 
 	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
 	if (!mhp)
 		return ERR_PTR(-ENOMEM);
 
 	mhp->rhp = rhp;
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.perms = c4iw_ib_to_tpt_access(acc);
 	mhp->attr.mw_bind_enable = (acc&IB_ACCESS_MW_BIND) == IB_ACCESS_MW_BIND;
 	mhp->attr.zbva = 0;
 	mhp->attr.va_fbo = 0;
 	mhp->attr.page_size = 0;
 	mhp->attr.len = ~0UL;
 	mhp->attr.pbl_size = 0;
 
 	ret = write_tpt_entry(&rhp->rdev, 0, &stag, 1, php->pdid,
 			      FW_RI_STAG_NSMR, mhp->attr.perms,
 			      mhp->attr.mw_bind_enable, 0, 0, ~0UL, 0, 0, 0);
 	if (ret)
 		goto err1;
 
 	ret = finish_mem_reg(mhp, stag);
 	if (ret)
 		goto err2;
 	return &mhp->ibmr;
 err2:
 	dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
 		  mhp->attr.pbl_addr);
 err1:
 	kfree(mhp);
 	return ERR_PTR(ret);
 }
 
 struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
     u64 virt, int acc, struct ib_udata *udata, int mr_id)
 {
 	__be64 *pages;
 	int shift, n, len;
 	int i, k, entry;
 	int err = 0;
 	struct scatterlist *sg;
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
 	struct c4iw_mr *mhp;
 
 	CTR2(KTR_IW_CXGBE, "%s ib_pd %p", __func__, pd);
 
 	if (length == ~0ULL)
 		return ERR_PTR(-EINVAL);
 
 	if ((length + start) < start)
 		return ERR_PTR(-EINVAL);
 
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 
 	if (mr_exceeds_hw_limits(rhp, length))
 		return ERR_PTR(-EINVAL);
 
 	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
 	if (!mhp)
 		return ERR_PTR(-ENOMEM);
 
 	mhp->rhp = rhp;
 
 	mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
 	if (IS_ERR(mhp->umem)) {
 		err = PTR_ERR(mhp->umem);
 		kfree(mhp);
 		return ERR_PTR(err);
 	}
 
 	shift = ffs(mhp->umem->page_size) - 1;
 	
 	n = mhp->umem->nmap;
 	err = alloc_pbl(mhp, n);
 	if (err)
 		goto err;
 
 	pages = (__be64 *) __get_free_page(GFP_KERNEL);
 	if (!pages) {
 		err = -ENOMEM;
 		goto err_pbl;
 	}
 
 	i = n = 0;
 	for_each_sg(mhp->umem->sg_head.sgl, sg, mhp->umem->nmap, entry) {
 		len = sg_dma_len(sg) >> shift;
 		for (k = 0; k < len; ++k) {
 			pages[i++] = cpu_to_be64(sg_dma_address(sg) +
 					mhp->umem->page_size * k);
 			if (i == PAGE_SIZE / sizeof *pages) {
 				err = write_pbl(&mhp->rhp->rdev,
 						pages,
 						mhp->attr.pbl_addr + (n << 3), i);
 				if (err)
 					goto pbl_done;
 				n += i;
 				i = 0;
 
 			}
 		}
 	}
 
 	if (i)
 		err = write_pbl(&mhp->rhp->rdev, pages,
 				     mhp->attr.pbl_addr + (n << 3), i);
 
 pbl_done:
 	free_page((unsigned long) pages);
 	if (err)
 		goto err_pbl;
 
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.zbva = 0;
 	mhp->attr.perms = c4iw_ib_to_tpt_access(acc);
 	mhp->attr.va_fbo = virt;
 	mhp->attr.page_size = shift - 12;
 	mhp->attr.len = length;
 
 	err = register_mem(rhp, php, mhp, shift);
 	if (err)
 		goto err_pbl;
 
 	return &mhp->ibmr;
 
 err_pbl:
 	c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
 			      mhp->attr.pbl_size << 3);
 
 err:
 	ib_umem_release(mhp->umem);
 	kfree(mhp);
 	return ERR_PTR(err);
 }
 
 struct ib_mw *c4iw_alloc_mw(struct ib_pd *pd, enum ib_mw_type type)
 {
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
 	struct c4iw_mw *mhp;
 	u32 mmid;
 	u32 stag = 0;
 	int ret;
 
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
 	if (!mhp)
 		return ERR_PTR(-ENOMEM);
 	ret = allocate_window(&rhp->rdev, &stag, php->pdid);
 	if (ret) {
 		kfree(mhp);
 		return ERR_PTR(ret);
 	}
 	mhp->rhp = rhp;
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.type = FW_RI_STAG_MW;
 	mhp->attr.stag = stag;
 	mmid = (stag) >> 8;
 	mhp->ibmw.rkey = stag;
 	if (insert_handle(rhp, &rhp->mmidr, mhp, mmid)) {
 		deallocate_window(&rhp->rdev, mhp->attr.stag);
 		kfree(mhp);
 		return ERR_PTR(-ENOMEM);
 	}
 	CTR4(KTR_IW_CXGBE, "%s mmid 0x%x mhp %p stag 0x%x", __func__, mmid, mhp,
 	    stag);
 	return &(mhp->ibmw);
 }
 
 int c4iw_dealloc_mw(struct ib_mw *mw)
 {
 	struct c4iw_dev *rhp;
 	struct c4iw_mw *mhp;
 	u32 mmid;
 
 	mhp = to_c4iw_mw(mw);
 	rhp = mhp->rhp;
 	mmid = (mw->rkey) >> 8;
 	remove_handle(rhp, &rhp->mmidr, mmid);
 	deallocate_window(&rhp->rdev, mhp->attr.stag);
 	kfree(mhp);
 	CTR4(KTR_IW_CXGBE, "%s ib_mw %p mmid 0x%x ptr %p", __func__, mw, mmid,
 	    mhp);
 	return 0;
 }
 
 struct ib_mr *c4iw_alloc_fast_reg_mr(struct ib_pd *pd, int pbl_depth)
 {
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
 	struct c4iw_mr *mhp;
 	u32 mmid;
 	u32 stag = 0;
 	int ret = 0;
 
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
 	if (!mhp) {
 		ret = -ENOMEM;
 		goto err;
 	}
 
 	mhp->rhp = rhp;
 	ret = alloc_pbl(mhp, pbl_depth);
 	if (ret)
 		goto err1;
 	mhp->attr.pbl_size = pbl_depth;
 	ret = allocate_stag(&rhp->rdev, &stag, php->pdid,
 				 mhp->attr.pbl_size, mhp->attr.pbl_addr);
 	if (ret)
 		goto err2;
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.type = FW_RI_STAG_NSMR;
 	mhp->attr.stag = stag;
 	mhp->attr.state = 1;
 	mmid = (stag) >> 8;
 	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
 	if (insert_handle(rhp, &rhp->mmidr, mhp, mmid)) {
 		ret = -ENOMEM;
 		goto err3;
 	}
 
 	CTR4(KTR_IW_CXGBE, "%s mmid 0x%x mhp %p stag 0x%x", __func__, mmid, mhp,
 	    stag);
 	return &(mhp->ibmr);
 err3:
 	dereg_mem(&rhp->rdev, stag, mhp->attr.pbl_size,
 		       mhp->attr.pbl_addr);
 err2:
 	c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
 			      mhp->attr.pbl_size << 3);
 err1:
 	kfree(mhp);
 err:
 	return ERR_PTR(ret);
 }
 
 struct ib_fast_reg_page_list *c4iw_alloc_fastreg_pbl(struct ib_device *device,
 						     int page_list_len)
 {
 	struct c4iw_fr_page_list *c4pl;
 	struct c4iw_dev *dev = to_c4iw_dev(device);
 	bus_addr_t dma_addr;
 	int size = sizeof *c4pl + page_list_len * sizeof(u64);
 
 	c4pl = contigmalloc(size,
             M_DEVBUF, M_NOWAIT, 0ul, ~0ul, 4096, 0);
         if (c4pl)
                 dma_addr = vtophys(c4pl);
         else
                 return ERR_PTR(-ENOMEM);
 
 	pci_unmap_addr_set(c4pl, mapping, dma_addr);
 	c4pl->dma_addr = dma_addr;
 	c4pl->dev = dev;
 	c4pl->size = size;
 	c4pl->ibpl.page_list = (u64 *)(c4pl + 1);
 	c4pl->ibpl.max_page_list_len = page_list_len;
 
 	return &c4pl->ibpl;
 }
 
 void c4iw_free_fastreg_pbl(struct ib_fast_reg_page_list *ibpl)
 {
 	struct c4iw_fr_page_list *c4pl = to_c4iw_fr_page_list(ibpl);
 	contigfree(c4pl, c4pl->size, M_DEVBUF);
 }
 
 int c4iw_dereg_mr(struct ib_mr *ib_mr)
 {
 	struct c4iw_dev *rhp;
 	struct c4iw_mr *mhp;
 	u32 mmid;
 
 	CTR2(KTR_IW_CXGBE, "%s ib_mr %p", __func__, ib_mr);
 	/* There can be no memory windows */
 	if (atomic_read(&ib_mr->usecnt))
 		return -EINVAL;
 
 	mhp = to_c4iw_mr(ib_mr);
 	rhp = mhp->rhp;
 	mmid = mhp->attr.stag >> 8;
 	remove_handle(rhp, &rhp->mmidr, mmid);
 	dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
 		       mhp->attr.pbl_addr);
 	if (mhp->attr.pbl_size)
 		c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
 				  mhp->attr.pbl_size << 3);
 	if (mhp->kva)
 		kfree((void *) (unsigned long) mhp->kva);
 	if (mhp->umem)
 		ib_umem_release(mhp->umem);
 	CTR3(KTR_IW_CXGBE, "%s mmid 0x%x ptr %p", __func__, mmid, mhp);
 	kfree(mhp);
 	return 0;
 }
 #endif
Index: projects/clang400-import/sys/dev/hyperv/netvsc/hn_nvs.c
===================================================================
--- projects/clang400-import/sys/dev/hyperv/netvsc/hn_nvs.c	(revision 314522)
+++ projects/clang400-import/sys/dev/hyperv/netvsc/hn_nvs.c	(revision 314523)
@@ -1,735 +1,740 @@
 /*-
  * Copyright (c) 2009-2012,2016 Microsoft Corp.
  * Copyright (c) 2010-2012 Citrix Inc.
  * Copyright (c) 2012 NetApp Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*
  * Network Virtualization Service.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet6.h"
 #include "opt_inet.h"
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/limits.h>
 #include <sys/socket.h>
 #include <sys/systm.h>
 #include <sys/taskqueue.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/if_media.h>
 
 #include <netinet/in.h>
 #include <netinet/tcp_lro.h>
 
 #include <dev/hyperv/include/hyperv.h>
 #include <dev/hyperv/include/hyperv_busdma.h>
 #include <dev/hyperv/include/vmbus.h>
 #include <dev/hyperv/include/vmbus_xact.h>
 
 #include <dev/hyperv/netvsc/ndis.h>
 #include <dev/hyperv/netvsc/if_hnreg.h>
 #include <dev/hyperv/netvsc/if_hnvar.h>
 #include <dev/hyperv/netvsc/hn_nvs.h>
 
 static int			hn_nvs_conn_chim(struct hn_softc *);
 static int			hn_nvs_conn_rxbuf(struct hn_softc *);
 static void			hn_nvs_disconn_chim(struct hn_softc *);
 static void			hn_nvs_disconn_rxbuf(struct hn_softc *);
 static int			hn_nvs_conf_ndis(struct hn_softc *, int);
 static int			hn_nvs_init_ndis(struct hn_softc *);
 static int			hn_nvs_doinit(struct hn_softc *, uint32_t);
 static int			hn_nvs_init(struct hn_softc *);
 static const void		*hn_nvs_xact_execute(struct hn_softc *,
 				    struct vmbus_xact *, void *, int,
 				    size_t *, uint32_t);
 static void			hn_nvs_sent_none(struct hn_nvs_sendctx *,
 				    struct hn_softc *, struct vmbus_channel *,
 				    const void *, int);
 
 struct hn_nvs_sendctx		hn_nvs_sendctx_none =
     HN_NVS_SENDCTX_INITIALIZER(hn_nvs_sent_none, NULL);
 
 static const uint32_t		hn_nvs_version[] = {
 	HN_NVS_VERSION_5,
 	HN_NVS_VERSION_4,
 	HN_NVS_VERSION_2,
 	HN_NVS_VERSION_1
 };
 
 static const void *
 hn_nvs_xact_execute(struct hn_softc *sc, struct vmbus_xact *xact,
     void *req, int reqlen, size_t *resplen0, uint32_t type)
 {
 	struct hn_nvs_sendctx sndc;
 	size_t resplen, min_resplen = *resplen0;
 	const struct hn_nvs_hdr *hdr;
 	int error;
 
 	KASSERT(min_resplen >= sizeof(*hdr),
 	    ("invalid minimum response len %zu", min_resplen));
 
 	/*
 	 * Execute the xact setup by the caller.
 	 */
 	hn_nvs_sendctx_init(&sndc, hn_nvs_sent_xact, xact);
 
 	vmbus_xact_activate(xact);
 	error = hn_nvs_send(sc->hn_prichan, VMBUS_CHANPKT_FLAG_RC,
 	    req, reqlen, &sndc);
 	if (error) {
 		vmbus_xact_deactivate(xact);
 		return (NULL);
 	}
 	hdr = vmbus_chan_xact_wait(sc->hn_prichan, xact, &resplen,
 	    HN_CAN_SLEEP(sc));
 
 	/*
 	 * Check this NVS response message.
 	 */
 	if (resplen < min_resplen) {
 		if_printf(sc->hn_ifp, "invalid NVS resp len %zu\n", resplen);
 		return (NULL);
 	}
 	if (hdr->nvs_type != type) {
 		if_printf(sc->hn_ifp, "unexpected NVS resp 0x%08x, "
 		    "expect 0x%08x\n", hdr->nvs_type, type);
 		return (NULL);
 	}
 	/* All pass! */
 	*resplen0 = resplen;
 	return (hdr);
 }
 
 static __inline int
 hn_nvs_req_send(struct hn_softc *sc, void *req, int reqlen)
 {
 
 	return (hn_nvs_send(sc->hn_prichan, VMBUS_CHANPKT_FLAG_NONE,
 	    req, reqlen, &hn_nvs_sendctx_none));
 }
 
 static int 
 hn_nvs_conn_rxbuf(struct hn_softc *sc)
 {
 	struct vmbus_xact *xact = NULL;
 	struct hn_nvs_rxbuf_conn *conn;
 	const struct hn_nvs_rxbuf_connresp *resp;
 	size_t resp_len;
 	uint32_t status;
 	int error, rxbuf_size;
 
 	/*
 	 * Limit RXBUF size for old NVS.
 	 */
 	if (sc->hn_nvs_ver <= HN_NVS_VERSION_2)
 		rxbuf_size = HN_RXBUF_SIZE_COMPAT;
 	else
 		rxbuf_size = HN_RXBUF_SIZE;
 
 	/*
 	 * Connect the RXBUF GPADL to the primary channel.
 	 *
 	 * NOTE:
 	 * Only primary channel has RXBUF connected to it.  Sub-channels
 	 * just share this RXBUF.
 	 */
 	error = vmbus_chan_gpadl_connect(sc->hn_prichan,
 	    sc->hn_rxbuf_dma.hv_paddr, rxbuf_size, &sc->hn_rxbuf_gpadl);
 	if (error) {
 		if_printf(sc->hn_ifp, "rxbuf gpadl conn failed: %d\n",
 		    error);
 		goto cleanup;
 	}
 
 	/*
 	 * Connect RXBUF to NVS.
 	 */
 
 	xact = vmbus_xact_get(sc->hn_xact, sizeof(*conn));
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for nvs rxbuf conn\n");
 		error = ENXIO;
 		goto cleanup;
 	}
 	conn = vmbus_xact_req_data(xact);
 	conn->nvs_type = HN_NVS_TYPE_RXBUF_CONN;
 	conn->nvs_gpadl = sc->hn_rxbuf_gpadl;
 	conn->nvs_sig = HN_NVS_RXBUF_SIG;
 
 	resp_len = sizeof(*resp);
 	resp = hn_nvs_xact_execute(sc, xact, conn, sizeof(*conn), &resp_len,
 	    HN_NVS_TYPE_RXBUF_CONNRESP);
 	if (resp == NULL) {
 		if_printf(sc->hn_ifp, "exec nvs rxbuf conn failed\n");
 		error = EIO;
 		goto cleanup;
 	}
 
 	status = resp->nvs_status;
 	vmbus_xact_put(xact);
 	xact = NULL;
 
 	if (status != HN_NVS_STATUS_OK) {
 		if_printf(sc->hn_ifp, "nvs rxbuf conn failed: %x\n", status);
 		error = EIO;
 		goto cleanup;
 	}
 	sc->hn_flags |= HN_FLAG_RXBUF_CONNECTED;
 
 	return (0);
 
 cleanup:
 	if (xact != NULL)
 		vmbus_xact_put(xact);
 	hn_nvs_disconn_rxbuf(sc);
 	return (error);
 }
 
 static int 
 hn_nvs_conn_chim(struct hn_softc *sc)
 {
 	struct vmbus_xact *xact = NULL;
 	struct hn_nvs_chim_conn *chim;
 	const struct hn_nvs_chim_connresp *resp;
 	size_t resp_len;
 	uint32_t status, sectsz;
 	int error;
 
 	/*
 	 * Connect chimney sending buffer GPADL to the primary channel.
 	 *
 	 * NOTE:
 	 * Only primary channel has chimney sending buffer connected to it.
 	 * Sub-channels just share this chimney sending buffer.
 	 */
 	error = vmbus_chan_gpadl_connect(sc->hn_prichan,
   	    sc->hn_chim_dma.hv_paddr, HN_CHIM_SIZE, &sc->hn_chim_gpadl);
 	if (error) {
 		if_printf(sc->hn_ifp, "chim gpadl conn failed: %d\n", error);
 		goto cleanup;
 	}
 
 	/*
 	 * Connect chimney sending buffer to NVS
 	 */
 
 	xact = vmbus_xact_get(sc->hn_xact, sizeof(*chim));
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for nvs chim conn\n");
 		error = ENXIO;
 		goto cleanup;
 	}
 	chim = vmbus_xact_req_data(xact);
 	chim->nvs_type = HN_NVS_TYPE_CHIM_CONN;
 	chim->nvs_gpadl = sc->hn_chim_gpadl;
 	chim->nvs_sig = HN_NVS_CHIM_SIG;
 
 	resp_len = sizeof(*resp);
 	resp = hn_nvs_xact_execute(sc, xact, chim, sizeof(*chim), &resp_len,
 	    HN_NVS_TYPE_CHIM_CONNRESP);
 	if (resp == NULL) {
 		if_printf(sc->hn_ifp, "exec nvs chim conn failed\n");
 		error = EIO;
 		goto cleanup;
 	}
 
 	status = resp->nvs_status;
 	sectsz = resp->nvs_sectsz;
 	vmbus_xact_put(xact);
 	xact = NULL;
 
 	if (status != HN_NVS_STATUS_OK) {
 		if_printf(sc->hn_ifp, "nvs chim conn failed: %x\n", status);
 		error = EIO;
 		goto cleanup;
 	}
-	if (sectsz == 0) {
+	if (sectsz == 0 || sectsz % sizeof(uint32_t) != 0) {
 		/*
 		 * Can't use chimney sending buffer; done!
 		 */
-		if_printf(sc->hn_ifp, "zero chimney sending buffer "
-		    "section size\n");
+		if (sectsz == 0) {
+			if_printf(sc->hn_ifp, "zero chimney sending buffer "
+			    "section size\n");
+		} else {
+			if_printf(sc->hn_ifp, "misaligned chimney sending "
+			    "buffers, section size: %u\n", sectsz);
+		}
 		sc->hn_chim_szmax = 0;
 		sc->hn_chim_cnt = 0;
 		sc->hn_flags |= HN_FLAG_CHIM_CONNECTED;
 		return (0);
 	}
 
 	sc->hn_chim_szmax = sectsz;
 	sc->hn_chim_cnt = HN_CHIM_SIZE / sc->hn_chim_szmax;
 	if (HN_CHIM_SIZE % sc->hn_chim_szmax != 0) {
 		if_printf(sc->hn_ifp, "chimney sending sections are "
 		    "not properly aligned\n");
 	}
 	if (sc->hn_chim_cnt % LONG_BIT != 0) {
 		if_printf(sc->hn_ifp, "discard %d chimney sending sections\n",
 		    sc->hn_chim_cnt % LONG_BIT);
 	}
 
 	sc->hn_chim_bmap_cnt = sc->hn_chim_cnt / LONG_BIT;
 	sc->hn_chim_bmap = malloc(sc->hn_chim_bmap_cnt * sizeof(u_long),
 	    M_DEVBUF, M_WAITOK | M_ZERO);
 
 	/* Done! */
 	sc->hn_flags |= HN_FLAG_CHIM_CONNECTED;
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "chimney sending buffer %d/%d\n",
 		    sc->hn_chim_szmax, sc->hn_chim_cnt);
 	}
 	return (0);
 
 cleanup:
 	if (xact != NULL)
 		vmbus_xact_put(xact);
 	hn_nvs_disconn_chim(sc);
 	return (error);
 }
 
 static void
 hn_nvs_disconn_rxbuf(struct hn_softc *sc)
 {
 	int error;
 
 	if (sc->hn_flags & HN_FLAG_RXBUF_CONNECTED) {
 		struct hn_nvs_rxbuf_disconn disconn;
 
 		/*
 		 * Disconnect RXBUF from NVS.
 		 */
 		memset(&disconn, 0, sizeof(disconn));
 		disconn.nvs_type = HN_NVS_TYPE_RXBUF_DISCONN;
 		disconn.nvs_sig = HN_NVS_RXBUF_SIG;
 
 		/* NOTE: No response. */
 		error = hn_nvs_req_send(sc, &disconn, sizeof(disconn));
 		if (error) {
 			if_printf(sc->hn_ifp,
 			    "send nvs rxbuf disconn failed: %d\n", error);
 			/*
 			 * Fine for a revoked channel, since the hypervisor
 			 * does not drain TX bufring for a revoked channel.
 			 */
 			if (!vmbus_chan_is_revoked(sc->hn_prichan))
 				sc->hn_flags |= HN_FLAG_RXBUF_REF;
 		}
 		sc->hn_flags &= ~HN_FLAG_RXBUF_CONNECTED;
 
 		/*
 		 * Wait for the hypervisor to receive this NVS request.
 		 *
 		 * NOTE:
 		 * The TX bufring will not be drained by the hypervisor,
 		 * if the primary channel is revoked.
 		 */
 		while (!vmbus_chan_tx_empty(sc->hn_prichan) &&
 		    !vmbus_chan_is_revoked(sc->hn_prichan))
 			pause("waittx", 1);
 		/*
 		 * Linger long enough for NVS to disconnect RXBUF.
 		 */
 		pause("lingtx", (200 * hz) / 1000);
 	}
 
 	if (sc->hn_rxbuf_gpadl != 0) {
 		/*
 		 * Disconnect RXBUF from primary channel.
 		 */
 		error = vmbus_chan_gpadl_disconnect(sc->hn_prichan,
 		    sc->hn_rxbuf_gpadl);
 		if (error) {
 			if_printf(sc->hn_ifp,
 			    "rxbuf gpadl disconn failed: %d\n", error);
 			sc->hn_flags |= HN_FLAG_RXBUF_REF;
 		}
 		sc->hn_rxbuf_gpadl = 0;
 	}
 }
 
 static void
 hn_nvs_disconn_chim(struct hn_softc *sc)
 {
 	int error;
 
 	if (sc->hn_flags & HN_FLAG_CHIM_CONNECTED) {
 		struct hn_nvs_chim_disconn disconn;
 
 		/*
 		 * Disconnect chimney sending buffer from NVS.
 		 */
 		memset(&disconn, 0, sizeof(disconn));
 		disconn.nvs_type = HN_NVS_TYPE_CHIM_DISCONN;
 		disconn.nvs_sig = HN_NVS_CHIM_SIG;
 
 		/* NOTE: No response. */
 		error = hn_nvs_req_send(sc, &disconn, sizeof(disconn));
 		if (error) {
 			if_printf(sc->hn_ifp,
 			    "send nvs chim disconn failed: %d\n", error);
 			/*
 			 * Fine for a revoked channel, since the hypervisor
 			 * does not drain TX bufring for a revoked channel.
 			 */
 			if (!vmbus_chan_is_revoked(sc->hn_prichan))
 				sc->hn_flags |= HN_FLAG_CHIM_REF;
 		}
 		sc->hn_flags &= ~HN_FLAG_CHIM_CONNECTED;
 
 		/*
 		 * Wait for the hypervisor to receive this NVS request.
 		 *
 		 * NOTE:
 		 * The TX bufring will not be drained by the hypervisor,
 		 * if the primary channel is revoked.
 		 */
 		while (!vmbus_chan_tx_empty(sc->hn_prichan) &&
 		    !vmbus_chan_is_revoked(sc->hn_prichan))
 			pause("waittx", 1);
 		/*
 		 * Linger long enough for NVS to disconnect chimney
 		 * sending buffer.
 		 */
 		pause("lingtx", (200 * hz) / 1000);
 	}
 
 	if (sc->hn_chim_gpadl != 0) {
 		/*
 		 * Disconnect chimney sending buffer from primary channel.
 		 */
 		error = vmbus_chan_gpadl_disconnect(sc->hn_prichan,
 		    sc->hn_chim_gpadl);
 		if (error) {
 			if_printf(sc->hn_ifp,
 			    "chim gpadl disconn failed: %d\n", error);
 			sc->hn_flags |= HN_FLAG_CHIM_REF;
 		}
 		sc->hn_chim_gpadl = 0;
 	}
 
 	if (sc->hn_chim_bmap != NULL) {
 		free(sc->hn_chim_bmap, M_DEVBUF);
 		sc->hn_chim_bmap = NULL;
 		sc->hn_chim_bmap_cnt = 0;
 	}
 }
 
 static int
 hn_nvs_doinit(struct hn_softc *sc, uint32_t nvs_ver)
 {
 	struct vmbus_xact *xact;
 	struct hn_nvs_init *init;
 	const struct hn_nvs_init_resp *resp;
 	size_t resp_len;
 	uint32_t status;
 
 	xact = vmbus_xact_get(sc->hn_xact, sizeof(*init));
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for nvs init\n");
 		return (ENXIO);
 	}
 	init = vmbus_xact_req_data(xact);
 	init->nvs_type = HN_NVS_TYPE_INIT;
 	init->nvs_ver_min = nvs_ver;
 	init->nvs_ver_max = nvs_ver;
 
 	resp_len = sizeof(*resp);
 	resp = hn_nvs_xact_execute(sc, xact, init, sizeof(*init), &resp_len,
 	    HN_NVS_TYPE_INIT_RESP);
 	if (resp == NULL) {
 		if_printf(sc->hn_ifp, "exec init failed\n");
 		vmbus_xact_put(xact);
 		return (EIO);
 	}
 
 	status = resp->nvs_status;
 	vmbus_xact_put(xact);
 
 	if (status != HN_NVS_STATUS_OK) {
 		if (bootverbose) {
 			/*
 			 * Caller may try another NVS version, and will log
 			 * error if there are no more NVS versions to try,
 			 * so don't bark out loud here.
 			 */
 			if_printf(sc->hn_ifp, "nvs init failed for ver 0x%x\n",
 			    nvs_ver);
 		}
 		return (EINVAL);
 	}
 	return (0);
 }
 
 /*
  * Configure MTU and enable VLAN.
  */
 static int
 hn_nvs_conf_ndis(struct hn_softc *sc, int mtu)
 {
 	struct hn_nvs_ndis_conf conf;
 	int error;
 
 	memset(&conf, 0, sizeof(conf));
 	conf.nvs_type = HN_NVS_TYPE_NDIS_CONF;
 	conf.nvs_mtu = mtu;
 	conf.nvs_caps = HN_NVS_NDIS_CONF_VLAN;
 	if (sc->hn_nvs_ver >= HN_NVS_VERSION_5)
 		conf.nvs_caps |= HN_NVS_NDIS_CONF_SRIOV;
 
 	/* NOTE: No response. */
 	error = hn_nvs_req_send(sc, &conf, sizeof(conf));
 	if (error) {
 		if_printf(sc->hn_ifp, "send nvs ndis conf failed: %d\n", error);
 		return (error);
 	}
 
 	if (bootverbose)
 		if_printf(sc->hn_ifp, "nvs ndis conf done\n");
 	sc->hn_caps |= HN_CAP_MTU | HN_CAP_VLAN;
 	return (0);
 }
 
 static int
 hn_nvs_init_ndis(struct hn_softc *sc)
 {
 	struct hn_nvs_ndis_init ndis;
 	int error;
 
 	memset(&ndis, 0, sizeof(ndis));
 	ndis.nvs_type = HN_NVS_TYPE_NDIS_INIT;
 	ndis.nvs_ndis_major = HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver);
 	ndis.nvs_ndis_minor = HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver);
 
 	/* NOTE: No response. */
 	error = hn_nvs_req_send(sc, &ndis, sizeof(ndis));
 	if (error)
 		if_printf(sc->hn_ifp, "send nvs ndis init failed: %d\n", error);
 	return (error);
 }
 
 static int
 hn_nvs_init(struct hn_softc *sc)
 {
 	int i, error;
 
 	if (device_is_attached(sc->hn_dev)) {
 		/*
 		 * NVS version and NDIS version MUST NOT be changed.
 		 */
 		if (bootverbose) {
 			if_printf(sc->hn_ifp, "reinit NVS version 0x%x, "
 			    "NDIS version %u.%u\n", sc->hn_nvs_ver,
 			    HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver),
 			    HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver));
 		}
 
 		error = hn_nvs_doinit(sc, sc->hn_nvs_ver);
 		if (error) {
 			if_printf(sc->hn_ifp, "reinit NVS version 0x%x "
 			    "failed: %d\n", sc->hn_nvs_ver, error);
 			return (error);
 		}
 		goto done;
 	}
 
 	/*
 	 * Find the supported NVS version and set NDIS version accordingly.
 	 */
 	for (i = 0; i < nitems(hn_nvs_version); ++i) {
 		error = hn_nvs_doinit(sc, hn_nvs_version[i]);
 		if (!error) {
 			sc->hn_nvs_ver = hn_nvs_version[i];
 
 			/* Set NDIS version according to NVS version. */
 			sc->hn_ndis_ver = HN_NDIS_VERSION_6_30;
 			if (sc->hn_nvs_ver <= HN_NVS_VERSION_4)
 				sc->hn_ndis_ver = HN_NDIS_VERSION_6_1;
 
 			if (bootverbose) {
 				if_printf(sc->hn_ifp, "NVS version 0x%x, "
 				    "NDIS version %u.%u\n", sc->hn_nvs_ver,
 				    HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver),
 				    HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver));
 			}
 			goto done;
 		}
 	}
 	if_printf(sc->hn_ifp, "no NVS available\n");
 	return (ENXIO);
 
 done:
 	if (sc->hn_nvs_ver >= HN_NVS_VERSION_5)
 		sc->hn_caps |= HN_CAP_HASHVAL;
 	return (0);
 }
 
 int
 hn_nvs_attach(struct hn_softc *sc, int mtu)
 {
 	int error;
 
 	/*
 	 * Initialize NVS.
 	 */
 	error = hn_nvs_init(sc);
 	if (error)
 		return (error);
 
 	if (sc->hn_nvs_ver >= HN_NVS_VERSION_2) {
 		/*
 		 * Configure NDIS before initializing it.
 		 */
 		error = hn_nvs_conf_ndis(sc, mtu);
 		if (error)
 			return (error);
 	}
 
 	/*
 	 * Initialize NDIS.
 	 */
 	error = hn_nvs_init_ndis(sc);
 	if (error)
 		return (error);
 
 	/*
 	 * Connect RXBUF.
 	 */
 	error = hn_nvs_conn_rxbuf(sc);
 	if (error)
 		return (error);
 
 	/*
 	 * Connect chimney sending buffer.
 	 */
 	error = hn_nvs_conn_chim(sc);
 	if (error) {
 		hn_nvs_disconn_rxbuf(sc);
 		return (error);
 	}
 	return (0);
 }
 
 void
 hn_nvs_detach(struct hn_softc *sc)
 {
 
 	/* NOTE: there are no requests to stop the NVS. */
 	hn_nvs_disconn_rxbuf(sc);
 	hn_nvs_disconn_chim(sc);
 }
 
 void
 hn_nvs_sent_xact(struct hn_nvs_sendctx *sndc,
     struct hn_softc *sc __unused, struct vmbus_channel *chan __unused,
     const void *data, int dlen)
 {
 
 	vmbus_xact_wakeup(sndc->hn_cbarg, data, dlen);
 }
 
 static void
 hn_nvs_sent_none(struct hn_nvs_sendctx *sndc __unused,
     struct hn_softc *sc __unused, struct vmbus_channel *chan __unused,
     const void *data __unused, int dlen __unused)
 {
 	/* EMPTY */
 }
 
 int
 hn_nvs_alloc_subchans(struct hn_softc *sc, int *nsubch0)
 {
 	struct vmbus_xact *xact;
 	struct hn_nvs_subch_req *req;
 	const struct hn_nvs_subch_resp *resp;
 	int error, nsubch_req;
 	uint32_t nsubch;
 	size_t resp_len;
 
 	nsubch_req = *nsubch0;
 	KASSERT(nsubch_req > 0, ("invalid # of sub-channels %d", nsubch_req));
 
 	xact = vmbus_xact_get(sc->hn_xact, sizeof(*req));
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for nvs subch alloc\n");
 		return (ENXIO);
 	}
 	req = vmbus_xact_req_data(xact);
 	req->nvs_type = HN_NVS_TYPE_SUBCH_REQ;
 	req->nvs_op = HN_NVS_SUBCH_OP_ALLOC;
 	req->nvs_nsubch = nsubch_req;
 
 	resp_len = sizeof(*resp);
 	resp = hn_nvs_xact_execute(sc, xact, req, sizeof(*req), &resp_len,
 	    HN_NVS_TYPE_SUBCH_RESP);
 	if (resp == NULL) {
 		if_printf(sc->hn_ifp, "exec nvs subch alloc failed\n");
 		error = EIO;
 		goto done;
 	}
 	if (resp->nvs_status != HN_NVS_STATUS_OK) {
 		if_printf(sc->hn_ifp, "nvs subch alloc failed: %x\n",
 		    resp->nvs_status);
 		error = EIO;
 		goto done;
 	}
 
 	nsubch = resp->nvs_nsubch;
 	if (nsubch > nsubch_req) {
 		if_printf(sc->hn_ifp, "%u subchans are allocated, "
 		    "requested %d\n", nsubch, nsubch_req);
 		nsubch = nsubch_req;
 	}
 	*nsubch0 = nsubch;
 	error = 0;
 done:
 	vmbus_xact_put(xact);
 	return (error);
 }
 
 int
 hn_nvs_send_rndis_ctrl(struct vmbus_channel *chan,
     struct hn_nvs_sendctx *sndc, struct vmbus_gpa *gpa, int gpa_cnt)
 {
 
 	return hn_nvs_send_rndis_sglist(chan, HN_NVS_RNDIS_MTYPE_CTRL,
 	    sndc, gpa, gpa_cnt);
 }
 
 void
 hn_nvs_set_datapath(struct hn_softc *sc, uint32_t path)
 {
 	struct hn_nvs_datapath dp;
 
 	memset(&dp, 0, sizeof(dp));
 	dp.nvs_type = HN_NVS_TYPE_SET_DATAPATH;
 	dp.nvs_active_path = path;
 
 	hn_nvs_req_send(sc, &dp, sizeof(dp));
 }
Index: projects/clang400-import/sys/dev/hyperv/netvsc/hn_rndis.c
===================================================================
--- projects/clang400-import/sys/dev/hyperv/netvsc/hn_rndis.c	(revision 314522)
+++ projects/clang400-import/sys/dev/hyperv/netvsc/hn_rndis.c	(revision 314523)
@@ -1,993 +1,1006 @@
 /*-
  * Copyright (c) 2009-2012,2016 Microsoft Corp.
  * Copyright (c) 2010-2012 Citrix Inc.
  * Copyright (c) 2012 NetApp Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet6.h"
 #include "opt_inet.h"
 
 #include <sys/param.h>
 #include <sys/socket.h>
 #include <sys/systm.h>
 #include <sys/taskqueue.h>
 
 #include <machine/atomic.h>
 
 #include <net/ethernet.h>
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/if_media.h>
 #include <net/rndis.h>
 
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/tcp_lro.h>
 
 #include <dev/hyperv/include/hyperv.h>
 #include <dev/hyperv/include/hyperv_busdma.h>
 #include <dev/hyperv/include/vmbus.h>
 #include <dev/hyperv/include/vmbus_xact.h>
 
 #include <dev/hyperv/netvsc/ndis.h>
 #include <dev/hyperv/netvsc/if_hnreg.h>
 #include <dev/hyperv/netvsc/if_hnvar.h>
 #include <dev/hyperv/netvsc/hn_nvs.h>
 #include <dev/hyperv/netvsc/hn_rndis.h>
 
 #define HN_RNDIS_RID_COMPAT_MASK	0xffff
 #define HN_RNDIS_RID_COMPAT_MAX		HN_RNDIS_RID_COMPAT_MASK
 
 #define HN_RNDIS_XFER_SIZE		2048
 
 #define HN_NDIS_TXCSUM_CAP_IP4		\
 	(NDIS_TXCSUM_CAP_IP4 | NDIS_TXCSUM_CAP_IP4OPT)
 #define HN_NDIS_TXCSUM_CAP_TCP4		\
 	(NDIS_TXCSUM_CAP_TCP4 | NDIS_TXCSUM_CAP_TCP4OPT)
 #define HN_NDIS_TXCSUM_CAP_TCP6		\
 	(NDIS_TXCSUM_CAP_TCP6 | NDIS_TXCSUM_CAP_TCP6OPT | \
 	 NDIS_TXCSUM_CAP_IP6EXT)
 #define HN_NDIS_TXCSUM_CAP_UDP6		\
 	(NDIS_TXCSUM_CAP_UDP6 | NDIS_TXCSUM_CAP_IP6EXT)
 #define HN_NDIS_LSOV2_CAP_IP6		\
 	(NDIS_LSOV2_CAP_IP6EXT | NDIS_LSOV2_CAP_TCP6OPT)
 
 static const void	*hn_rndis_xact_exec1(struct hn_softc *,
 			    struct vmbus_xact *, size_t,
 			    struct hn_nvs_sendctx *, size_t *);
 static const void	*hn_rndis_xact_execute(struct hn_softc *,
 			    struct vmbus_xact *, uint32_t, size_t, size_t *,
 			    uint32_t);
 static int		hn_rndis_query(struct hn_softc *, uint32_t,
 			    const void *, size_t, void *, size_t *);
 static int		hn_rndis_query2(struct hn_softc *, uint32_t,
 			    const void *, size_t, void *, size_t *, size_t);
 static int		hn_rndis_set(struct hn_softc *, uint32_t,
 			    const void *, size_t);
 static int		hn_rndis_init(struct hn_softc *);
 static int		hn_rndis_halt(struct hn_softc *);
 static int		hn_rndis_conf_offload(struct hn_softc *, int);
 static int		hn_rndis_query_hwcaps(struct hn_softc *,
 			    struct ndis_offload *);
 
 static __inline uint32_t
 hn_rndis_rid(struct hn_softc *sc)
 {
 	uint32_t rid;
 
 again:
 	rid = atomic_fetchadd_int(&sc->hn_rndis_rid, 1);
 	if (rid == 0)
 		goto again;
 
 	/* Use upper 16 bits for non-compat RNDIS messages. */
 	return ((rid & 0xffff) << 16);
 }
 
 void
 hn_rndis_rx_ctrl(struct hn_softc *sc, const void *data, int dlen)
 {
 	const struct rndis_comp_hdr *comp;
 	const struct rndis_msghdr *hdr;
 
 	KASSERT(dlen >= sizeof(*hdr), ("invalid RNDIS msg\n"));
 	hdr = data;
 
 	switch (hdr->rm_type) {
 	case REMOTE_NDIS_INITIALIZE_CMPLT:
 	case REMOTE_NDIS_QUERY_CMPLT:
 	case REMOTE_NDIS_SET_CMPLT:
 	case REMOTE_NDIS_KEEPALIVE_CMPLT:	/* unused */
 		if (dlen < sizeof(*comp)) {
 			if_printf(sc->hn_ifp, "invalid RNDIS cmplt\n");
 			return;
 		}
 		comp = data;
 
 		KASSERT(comp->rm_rid > HN_RNDIS_RID_COMPAT_MAX,
 		    ("invalid RNDIS rid 0x%08x\n", comp->rm_rid));
 		vmbus_xact_ctx_wakeup(sc->hn_xact, comp, dlen);
 		break;
 
 	case REMOTE_NDIS_RESET_CMPLT:
 		/*
 		 * Reset completed, no rid.
 		 *
 		 * NOTE:
 		 * RESET is not issued by hn(4), so this message should
 		 * _not_ be observed.
 		 */
 		if_printf(sc->hn_ifp, "RESET cmplt received\n");
 		break;
 
 	default:
 		if_printf(sc->hn_ifp, "unknown RNDIS msg 0x%x\n",
 		    hdr->rm_type);
 		break;
 	}
 }
 
 int
 hn_rndis_get_eaddr(struct hn_softc *sc, uint8_t *eaddr)
 {
 	size_t eaddr_len;
 	int error;
 
 	eaddr_len = ETHER_ADDR_LEN;
 	error = hn_rndis_query(sc, OID_802_3_PERMANENT_ADDRESS, NULL, 0,
 	    eaddr, &eaddr_len);
 	if (error)
 		return (error);
 	if (eaddr_len != ETHER_ADDR_LEN) {
 		if_printf(sc->hn_ifp, "invalid eaddr len %zu\n", eaddr_len);
 		return (EINVAL);
 	}
 	return (0);
 }
 
 int
 hn_rndis_get_linkstatus(struct hn_softc *sc, uint32_t *link_status)
 {
 	size_t size;
 	int error;
 
 	size = sizeof(*link_status);
 	error = hn_rndis_query(sc, OID_GEN_MEDIA_CONNECT_STATUS, NULL, 0,
 	    link_status, &size);
 	if (error)
 		return (error);
 	if (size != sizeof(uint32_t)) {
 		if_printf(sc->hn_ifp, "invalid link status len %zu\n", size);
 		return (EINVAL);
 	}
 	return (0);
 }
 
 static const void *
 hn_rndis_xact_exec1(struct hn_softc *sc, struct vmbus_xact *xact, size_t reqlen,
     struct hn_nvs_sendctx *sndc, size_t *comp_len)
 {
 	struct vmbus_gpa gpa[HN_XACT_REQ_PGCNT];
 	int gpa_cnt, error;
 	bus_addr_t paddr;
 
 	KASSERT(reqlen <= HN_XACT_REQ_SIZE && reqlen > 0,
 	    ("invalid request length %zu", reqlen));
 
 	/*
 	 * Setup the SG list.
 	 */
 	paddr = vmbus_xact_req_paddr(xact);
 	KASSERT((paddr & PAGE_MASK) == 0,
 	    ("vmbus xact request is not page aligned 0x%jx", (uintmax_t)paddr));
 	for (gpa_cnt = 0; gpa_cnt < HN_XACT_REQ_PGCNT; ++gpa_cnt) {
 		int len = PAGE_SIZE;
 
 		if (reqlen == 0)
 			break;
 		if (reqlen < len)
 			len = reqlen;
 
 		gpa[gpa_cnt].gpa_page = atop(paddr) + gpa_cnt;
 		gpa[gpa_cnt].gpa_len = len;
 		gpa[gpa_cnt].gpa_ofs = 0;
 
 		reqlen -= len;
 	}
 	KASSERT(reqlen == 0, ("still have %zu request data left", reqlen));
 
 	/*
 	 * Send this RNDIS control message and wait for its completion
 	 * message.
 	 */
 	vmbus_xact_activate(xact);
 	error = hn_nvs_send_rndis_ctrl(sc->hn_prichan, sndc, gpa, gpa_cnt);
 	if (error) {
 		vmbus_xact_deactivate(xact);
 		if_printf(sc->hn_ifp, "RNDIS ctrl send failed: %d\n", error);
 		return (NULL);
 	}
 	return (vmbus_chan_xact_wait(sc->hn_prichan, xact, comp_len,
 	    HN_CAN_SLEEP(sc)));
 }
 
 static const void *
 hn_rndis_xact_execute(struct hn_softc *sc, struct vmbus_xact *xact, uint32_t rid,
     size_t reqlen, size_t *comp_len0, uint32_t comp_type)
 {
 	const struct rndis_comp_hdr *comp;
 	size_t comp_len, min_complen = *comp_len0;
 
 	KASSERT(rid > HN_RNDIS_RID_COMPAT_MAX, ("invalid rid %u\n", rid));
 	KASSERT(min_complen >= sizeof(*comp),
 	    ("invalid minimum complete len %zu", min_complen));
 
 	/*
 	 * Execute the xact setup by the caller.
 	 */
 	comp = hn_rndis_xact_exec1(sc, xact, reqlen, &hn_nvs_sendctx_none,
 	    &comp_len);
 	if (comp == NULL)
 		return (NULL);
 
 	/*
 	 * Check this RNDIS complete message.
 	 */
 	if (comp_len < min_complen) {
 		if (comp_len >= sizeof(*comp)) {
 			/* rm_status field is valid */
 			if_printf(sc->hn_ifp, "invalid RNDIS comp len %zu, "
 			    "status 0x%08x\n", comp_len, comp->rm_status);
 		} else {
 			if_printf(sc->hn_ifp, "invalid RNDIS comp len %zu\n",
 			    comp_len);
 		}
 		return (NULL);
 	}
 	if (comp->rm_len < min_complen) {
 		if_printf(sc->hn_ifp, "invalid RNDIS comp msglen %u\n",
 		    comp->rm_len);
 		return (NULL);
 	}
 	if (comp->rm_type != comp_type) {
 		if_printf(sc->hn_ifp, "unexpected RNDIS comp 0x%08x, "
 		    "expect 0x%08x\n", comp->rm_type, comp_type);
 		return (NULL);
 	}
 	if (comp->rm_rid != rid) {
 		if_printf(sc->hn_ifp, "RNDIS comp rid mismatch %u, "
 		    "expect %u\n", comp->rm_rid, rid);
 		return (NULL);
 	}
 	/* All pass! */
 	*comp_len0 = comp_len;
 	return (comp);
 }
 
 static int
 hn_rndis_query(struct hn_softc *sc, uint32_t oid,
     const void *idata, size_t idlen, void *odata, size_t *odlen0)
 {
 
 	return (hn_rndis_query2(sc, oid, idata, idlen, odata, odlen0, *odlen0));
 }
 
 static int
 hn_rndis_query2(struct hn_softc *sc, uint32_t oid,
     const void *idata, size_t idlen, void *odata, size_t *odlen0,
     size_t min_odlen)
 {
 	struct rndis_query_req *req;
 	const struct rndis_query_comp *comp;
 	struct vmbus_xact *xact;
 	size_t reqlen, odlen = *odlen0, comp_len;
 	int error, ofs;
 	uint32_t rid;
 
 	reqlen = sizeof(*req) + idlen;
 	xact = vmbus_xact_get(sc->hn_xact, reqlen);
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for RNDIS query 0x%08x\n", oid);
 		return (ENXIO);
 	}
 	rid = hn_rndis_rid(sc);
 	req = vmbus_xact_req_data(xact);
 	req->rm_type = REMOTE_NDIS_QUERY_MSG;
 	req->rm_len = reqlen;
 	req->rm_rid = rid;
 	req->rm_oid = oid;
 	/*
 	 * XXX
 	 * This is _not_ RNDIS Spec conforming:
 	 * "This MUST be set to 0 when there is no input data
 	 *  associated with the OID."
 	 *
 	 * If this field was set to 0 according to the RNDIS Spec,
 	 * Hyper-V would set non-SUCCESS status in the query
 	 * completion.
 	 */
 	req->rm_infobufoffset = RNDIS_QUERY_REQ_INFOBUFOFFSET;
 
 	if (idlen > 0) {
 		req->rm_infobuflen = idlen;
 		/* Input data immediately follows RNDIS query. */
 		memcpy(req + 1, idata, idlen);
 	}
 
 	comp_len = sizeof(*comp) + min_odlen;
 	comp = hn_rndis_xact_execute(sc, xact, rid, reqlen, &comp_len,
 	    REMOTE_NDIS_QUERY_CMPLT);
 	if (comp == NULL) {
 		if_printf(sc->hn_ifp, "exec RNDIS query 0x%08x failed\n", oid);
 		error = EIO;
 		goto done;
 	}
 
 	if (comp->rm_status != RNDIS_STATUS_SUCCESS) {
 		if_printf(sc->hn_ifp, "RNDIS query 0x%08x failed: "
 		    "status 0x%08x\n", oid, comp->rm_status);
 		error = EIO;
 		goto done;
 	}
 	if (comp->rm_infobuflen == 0 || comp->rm_infobufoffset == 0) {
 		/* No output data! */
 		if_printf(sc->hn_ifp, "RNDIS query 0x%08x, no data\n", oid);
 		*odlen0 = 0;
 		error = 0;
 		goto done;
 	}
 
 	/*
 	 * Check output data length and offset.
 	 */
 	/* ofs is the offset from the beginning of comp. */
 	ofs = RNDIS_QUERY_COMP_INFOBUFOFFSET_ABS(comp->rm_infobufoffset);
 	if (ofs < sizeof(*comp) || ofs + comp->rm_infobuflen > comp_len) {
 		if_printf(sc->hn_ifp, "RNDIS query invalid comp ib off/len, "
 		    "%u/%u\n", comp->rm_infobufoffset, comp->rm_infobuflen);
 		error = EINVAL;
 		goto done;
 	}
 
 	/*
 	 * Save output data.
 	 */
 	if (comp->rm_infobuflen < odlen)
 		odlen = comp->rm_infobuflen;
 	memcpy(odata, ((const uint8_t *)comp) + ofs, odlen);
 	*odlen0 = odlen;
 
 	error = 0;
 done:
 	vmbus_xact_put(xact);
 	return (error);
 }
 
 int
 hn_rndis_query_rsscaps(struct hn_softc *sc, int *rxr_cnt0)
 {
 	struct ndis_rss_caps in, caps;
 	size_t caps_len;
 	int error, indsz, rxr_cnt, hash_fnidx;
 	uint32_t hash_func = 0, hash_types = 0;
 
 	*rxr_cnt0 = 0;
 
 	if (sc->hn_ndis_ver < HN_NDIS_VERSION_6_20)
 		return (EOPNOTSUPP);
 
 	memset(&in, 0, sizeof(in));
 	in.ndis_hdr.ndis_type = NDIS_OBJTYPE_RSS_CAPS;
 	in.ndis_hdr.ndis_rev = NDIS_RSS_CAPS_REV_2;
 	in.ndis_hdr.ndis_size = NDIS_RSS_CAPS_SIZE;
 
 	caps_len = NDIS_RSS_CAPS_SIZE;
 	error = hn_rndis_query2(sc, OID_GEN_RECEIVE_SCALE_CAPABILITIES,
 	    &in, NDIS_RSS_CAPS_SIZE, &caps, &caps_len, NDIS_RSS_CAPS_SIZE_6_0);
 	if (error)
 		return (error);
 
 	/*
 	 * Preliminary verification.
 	 */
 	if (caps.ndis_hdr.ndis_type != NDIS_OBJTYPE_RSS_CAPS) {
 		if_printf(sc->hn_ifp, "invalid NDIS objtype 0x%02x\n",
 		    caps.ndis_hdr.ndis_type);
 		return (EINVAL);
 	}
 	if (caps.ndis_hdr.ndis_rev < NDIS_RSS_CAPS_REV_1) {
 		if_printf(sc->hn_ifp, "invalid NDIS objrev 0x%02x\n",
 		    caps.ndis_hdr.ndis_rev);
 		return (EINVAL);
 	}
 	if (caps.ndis_hdr.ndis_size > caps_len) {
 		if_printf(sc->hn_ifp, "invalid NDIS objsize %u, "
 		    "data size %zu\n", caps.ndis_hdr.ndis_size, caps_len);
 		return (EINVAL);
 	} else if (caps.ndis_hdr.ndis_size < NDIS_RSS_CAPS_SIZE_6_0) {
 		if_printf(sc->hn_ifp, "invalid NDIS objsize %u\n",
 		    caps.ndis_hdr.ndis_size);
 		return (EINVAL);
 	}
 
 	/*
 	 * Save information for later RSS configuration.
 	 */
 	if (caps.ndis_nrxr == 0) {
 		if_printf(sc->hn_ifp, "0 RX rings!?\n");
 		return (EINVAL);
 	}
 	if (bootverbose)
 		if_printf(sc->hn_ifp, "%u RX rings\n", caps.ndis_nrxr);
 	rxr_cnt = caps.ndis_nrxr;
 
 	if (caps.ndis_hdr.ndis_size == NDIS_RSS_CAPS_SIZE &&
 	    caps.ndis_hdr.ndis_rev >= NDIS_RSS_CAPS_REV_2) {
 		if (caps.ndis_nind > NDIS_HASH_INDCNT) {
 			if_printf(sc->hn_ifp,
 			    "too many RSS indirect table entries %u\n",
 			    caps.ndis_nind);
 			return (EOPNOTSUPP);
 		}
 		if (!powerof2(caps.ndis_nind)) {
 			if_printf(sc->hn_ifp, "RSS indirect table size is not "
 			    "power-of-2 %u\n", caps.ndis_nind);
 		}
 
 		if (bootverbose) {
 			if_printf(sc->hn_ifp, "RSS indirect table size %u\n",
 			    caps.ndis_nind);
 		}
 		indsz = caps.ndis_nind;
 	} else {
 		indsz = NDIS_HASH_INDCNT;
 	}
 	if (indsz < rxr_cnt) {
 		if_printf(sc->hn_ifp, "# of RX rings (%d) > "
 		    "RSS indirect table size %d\n", rxr_cnt, indsz);
 		rxr_cnt = indsz;
 	}
 
 	/*
 	 * NOTE:
 	 * Toeplitz is at the lowest bit, and it is prefered; so ffs(),
 	 * instead of fls(), is used here.
 	 */
 	hash_fnidx = ffs(caps.ndis_caps & NDIS_RSS_CAP_HASHFUNC_MASK);
 	if (hash_fnidx == 0) {
 		if_printf(sc->hn_ifp, "no hash functions, caps 0x%08x\n",
 		    caps.ndis_caps);
 		return (EOPNOTSUPP);
 	}
 	hash_func = 1 << (hash_fnidx - 1); /* ffs is 1-based */
 
 	if (caps.ndis_caps & NDIS_RSS_CAP_IPV4)
 		hash_types |= NDIS_HASH_IPV4 | NDIS_HASH_TCP_IPV4;
 	if (caps.ndis_caps & NDIS_RSS_CAP_IPV6)
 		hash_types |= NDIS_HASH_IPV6 | NDIS_HASH_TCP_IPV6;
 	if (caps.ndis_caps & NDIS_RSS_CAP_IPV6_EX)
 		hash_types |= NDIS_HASH_IPV6_EX | NDIS_HASH_TCP_IPV6_EX;
 	if (hash_types == 0) {
 		if_printf(sc->hn_ifp, "no hash types, caps 0x%08x\n",
 		    caps.ndis_caps);
 		return (EOPNOTSUPP);
 	}
 
 	/* Commit! */
 	sc->hn_rss_ind_size = indsz;
 	sc->hn_rss_hash = hash_func | hash_types;
 	*rxr_cnt0 = rxr_cnt;
 	return (0);
 }
 
 static int
 hn_rndis_set(struct hn_softc *sc, uint32_t oid, const void *data, size_t dlen)
 {
 	struct rndis_set_req *req;
 	const struct rndis_set_comp *comp;
 	struct vmbus_xact *xact;
 	size_t reqlen, comp_len;
 	uint32_t rid;
 	int error;
 
 	KASSERT(dlen > 0, ("invalid dlen %zu", dlen));
 
 	reqlen = sizeof(*req) + dlen;
 	xact = vmbus_xact_get(sc->hn_xact, reqlen);
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for RNDIS set 0x%08x\n", oid);
 		return (ENXIO);
 	}
 	rid = hn_rndis_rid(sc);
 	req = vmbus_xact_req_data(xact);
 	req->rm_type = REMOTE_NDIS_SET_MSG;
 	req->rm_len = reqlen;
 	req->rm_rid = rid;
 	req->rm_oid = oid;
 	req->rm_infobuflen = dlen;
 	req->rm_infobufoffset = RNDIS_SET_REQ_INFOBUFOFFSET;
 	/* Data immediately follows RNDIS set. */
 	memcpy(req + 1, data, dlen);
 
 	comp_len = sizeof(*comp);
 	comp = hn_rndis_xact_execute(sc, xact, rid, reqlen, &comp_len,
 	    REMOTE_NDIS_SET_CMPLT);
 	if (comp == NULL) {
 		if_printf(sc->hn_ifp, "exec RNDIS set 0x%08x failed\n", oid);
 		error = EIO;
 		goto done;
 	}
 
 	if (comp->rm_status != RNDIS_STATUS_SUCCESS) {
 		if_printf(sc->hn_ifp, "RNDIS set 0x%08x failed: "
 		    "status 0x%08x\n", oid, comp->rm_status);
 		error = EIO;
 		goto done;
 	}
 	error = 0;
 done:
 	vmbus_xact_put(xact);
 	return (error);
 }
 
 static int
 hn_rndis_conf_offload(struct hn_softc *sc, int mtu)
 {
 	struct ndis_offload hwcaps;
 	struct ndis_offload_params params;
 	uint32_t caps = 0;
 	size_t paramsz;
 	int error, tso_maxsz, tso_minsg;
 
 	error = hn_rndis_query_hwcaps(sc, &hwcaps);
 	if (error) {
 		if_printf(sc->hn_ifp, "hwcaps query failed: %d\n", error);
 		return (error);
 	}
 
 	/* NOTE: 0 means "no change" */
 	memset(&params, 0, sizeof(params));
 
 	params.ndis_hdr.ndis_type = NDIS_OBJTYPE_DEFAULT;
 	if (sc->hn_ndis_ver < HN_NDIS_VERSION_6_30) {
 		params.ndis_hdr.ndis_rev = NDIS_OFFLOAD_PARAMS_REV_2;
 		paramsz = NDIS_OFFLOAD_PARAMS_SIZE_6_1;
 	} else {
 		params.ndis_hdr.ndis_rev = NDIS_OFFLOAD_PARAMS_REV_3;
 		paramsz = NDIS_OFFLOAD_PARAMS_SIZE;
 	}
 	params.ndis_hdr.ndis_size = paramsz;
 
 	/*
 	 * TSO4/TSO6 setup.
 	 */
 	tso_maxsz = IP_MAXPACKET;
 	tso_minsg = 2;
 	if (hwcaps.ndis_lsov2.ndis_ip4_encap & NDIS_OFFLOAD_ENCAP_8023) {
 		caps |= HN_CAP_TSO4;
 		params.ndis_lsov2_ip4 = NDIS_OFFLOAD_LSOV2_ON;
 
 		if (hwcaps.ndis_lsov2.ndis_ip4_maxsz < tso_maxsz)
 			tso_maxsz = hwcaps.ndis_lsov2.ndis_ip4_maxsz;
 		if (hwcaps.ndis_lsov2.ndis_ip4_minsg > tso_minsg)
 			tso_minsg = hwcaps.ndis_lsov2.ndis_ip4_minsg;
 	}
 	if ((hwcaps.ndis_lsov2.ndis_ip6_encap & NDIS_OFFLOAD_ENCAP_8023) &&
 	    (hwcaps.ndis_lsov2.ndis_ip6_opts & HN_NDIS_LSOV2_CAP_IP6) ==
 	    HN_NDIS_LSOV2_CAP_IP6) {
 		caps |= HN_CAP_TSO6;
 		params.ndis_lsov2_ip6 = NDIS_OFFLOAD_LSOV2_ON;
 
 		if (hwcaps.ndis_lsov2.ndis_ip6_maxsz < tso_maxsz)
 			tso_maxsz = hwcaps.ndis_lsov2.ndis_ip6_maxsz;
 		if (hwcaps.ndis_lsov2.ndis_ip6_minsg > tso_minsg)
 			tso_minsg = hwcaps.ndis_lsov2.ndis_ip6_minsg;
 	}
 	sc->hn_ndis_tso_szmax = 0;
 	sc->hn_ndis_tso_sgmin = 0;
 	if (caps & (HN_CAP_TSO4 | HN_CAP_TSO6)) {
 		KASSERT(tso_maxsz <= IP_MAXPACKET,
 		    ("invalid NDIS TSO maxsz %d", tso_maxsz));
 		KASSERT(tso_minsg >= 2,
 		    ("invalid NDIS TSO minsg %d", tso_minsg));
 		if (tso_maxsz < tso_minsg * mtu) {
 			if_printf(sc->hn_ifp, "invalid NDIS TSO config: "
 			    "maxsz %d, minsg %d, mtu %d; "
 			    "disable TSO4 and TSO6\n",
 			    tso_maxsz, tso_minsg, mtu);
 			caps &= ~(HN_CAP_TSO4 | HN_CAP_TSO6);
 			params.ndis_lsov2_ip4 = NDIS_OFFLOAD_LSOV2_OFF;
 			params.ndis_lsov2_ip6 = NDIS_OFFLOAD_LSOV2_OFF;
 		} else {
 			sc->hn_ndis_tso_szmax = tso_maxsz;
 			sc->hn_ndis_tso_sgmin = tso_minsg;
 			if (bootverbose) {
 				if_printf(sc->hn_ifp, "NDIS TSO "
 				    "szmax %d sgmin %d\n",
 				    sc->hn_ndis_tso_szmax,
 				    sc->hn_ndis_tso_sgmin);
 			}
 		}
 	}
 
 	/* IPv4 checksum */
 	if ((hwcaps.ndis_csum.ndis_ip4_txcsum & HN_NDIS_TXCSUM_CAP_IP4) ==
 	    HN_NDIS_TXCSUM_CAP_IP4) {
 		caps |= HN_CAP_IPCS;
 		params.ndis_ip4csum = NDIS_OFFLOAD_PARAM_TX;
 	}
 	if (hwcaps.ndis_csum.ndis_ip4_rxcsum & NDIS_RXCSUM_CAP_IP4) {
 		if (params.ndis_ip4csum == NDIS_OFFLOAD_PARAM_TX)
 			params.ndis_ip4csum = NDIS_OFFLOAD_PARAM_TXRX;
 		else
 			params.ndis_ip4csum = NDIS_OFFLOAD_PARAM_RX;
 	}
 
 	/* TCP4 checksum */
 	if ((hwcaps.ndis_csum.ndis_ip4_txcsum & HN_NDIS_TXCSUM_CAP_TCP4) ==
 	    HN_NDIS_TXCSUM_CAP_TCP4) {
 		caps |= HN_CAP_TCP4CS;
 		params.ndis_tcp4csum = NDIS_OFFLOAD_PARAM_TX;
 	}
 	if (hwcaps.ndis_csum.ndis_ip4_rxcsum & NDIS_RXCSUM_CAP_TCP4) {
 		if (params.ndis_tcp4csum == NDIS_OFFLOAD_PARAM_TX)
 			params.ndis_tcp4csum = NDIS_OFFLOAD_PARAM_TXRX;
 		else
 			params.ndis_tcp4csum = NDIS_OFFLOAD_PARAM_RX;
 	}
 
 	/* UDP4 checksum */
 	if (hwcaps.ndis_csum.ndis_ip4_txcsum & NDIS_TXCSUM_CAP_UDP4) {
 		caps |= HN_CAP_UDP4CS;
 		params.ndis_udp4csum = NDIS_OFFLOAD_PARAM_TX;
 	}
 	if (hwcaps.ndis_csum.ndis_ip4_rxcsum & NDIS_RXCSUM_CAP_UDP4) {
 		if (params.ndis_udp4csum == NDIS_OFFLOAD_PARAM_TX)
 			params.ndis_udp4csum = NDIS_OFFLOAD_PARAM_TXRX;
 		else
 			params.ndis_udp4csum = NDIS_OFFLOAD_PARAM_RX;
 	}
 
 	/* TCP6 checksum */
 	if ((hwcaps.ndis_csum.ndis_ip6_txcsum & HN_NDIS_TXCSUM_CAP_TCP6) ==
 	    HN_NDIS_TXCSUM_CAP_TCP6) {
 		caps |= HN_CAP_TCP6CS;
 		params.ndis_tcp6csum = NDIS_OFFLOAD_PARAM_TX;
 	}
 	if (hwcaps.ndis_csum.ndis_ip6_rxcsum & NDIS_RXCSUM_CAP_TCP6) {
 		if (params.ndis_tcp6csum == NDIS_OFFLOAD_PARAM_TX)
 			params.ndis_tcp6csum = NDIS_OFFLOAD_PARAM_TXRX;
 		else
 			params.ndis_tcp6csum = NDIS_OFFLOAD_PARAM_RX;
 	}
 
 	/* UDP6 checksum */
 	if ((hwcaps.ndis_csum.ndis_ip6_txcsum & HN_NDIS_TXCSUM_CAP_UDP6) ==
 	    HN_NDIS_TXCSUM_CAP_UDP6) {
 		caps |= HN_CAP_UDP6CS;
 		params.ndis_udp6csum = NDIS_OFFLOAD_PARAM_TX;
 	}
 	if (hwcaps.ndis_csum.ndis_ip6_rxcsum & NDIS_RXCSUM_CAP_UDP6) {
 		if (params.ndis_udp6csum == NDIS_OFFLOAD_PARAM_TX)
 			params.ndis_udp6csum = NDIS_OFFLOAD_PARAM_TXRX;
 		else
 			params.ndis_udp6csum = NDIS_OFFLOAD_PARAM_RX;
 	}
 
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "offload csum: "
 		    "ip4 %u, tcp4 %u, udp4 %u, tcp6 %u, udp6 %u\n",
 		    params.ndis_ip4csum,
 		    params.ndis_tcp4csum,
 		    params.ndis_udp4csum,
 		    params.ndis_tcp6csum,
 		    params.ndis_udp6csum);
 		if_printf(sc->hn_ifp, "offload lsov2: ip4 %u, ip6 %u\n",
 		    params.ndis_lsov2_ip4,
 		    params.ndis_lsov2_ip6);
 	}
 
 	error = hn_rndis_set(sc, OID_TCP_OFFLOAD_PARAMETERS, &params, paramsz);
 	if (error) {
 		if_printf(sc->hn_ifp, "offload config failed: %d\n", error);
 		return (error);
 	}
 
 	if (bootverbose)
 		if_printf(sc->hn_ifp, "offload config done\n");
 	sc->hn_caps |= caps;
 	return (0);
 }
 
 int
 hn_rndis_conf_rss(struct hn_softc *sc, uint16_t flags)
 {
 	struct ndis_rssprm_toeplitz *rss = &sc->hn_rss;
 	struct ndis_rss_params *prm = &rss->rss_params;
 	int error, rss_size;
 
 	/*
 	 * Only NDIS 6.20+ is supported:
 	 * We only support 4bytes element in indirect table, which has been
 	 * adopted since NDIS 6.20.
 	 */
 	KASSERT(sc->hn_ndis_ver >= HN_NDIS_VERSION_6_20,
 	    ("NDIS 6.20+ is required, NDIS version 0x%08x", sc->hn_ndis_ver));
 
 	/* XXX only one can be specified through, popcnt? */
 	KASSERT((sc->hn_rss_hash & NDIS_HASH_FUNCTION_MASK), ("no hash func"));
 	KASSERT((sc->hn_rss_hash & NDIS_HASH_TYPE_MASK), ("no hash types"));
 	KASSERT(sc->hn_rss_ind_size > 0, ("no indirect table size"));
 
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "RSS indirect table size %d, "
 		    "hash 0x%08x\n", sc->hn_rss_ind_size, sc->hn_rss_hash);
 	}
 
 	/*
 	 * NOTE:
 	 * DO NOT whack rss_key and rss_ind, which are setup by the caller.
 	 */
 	memset(prm, 0, sizeof(*prm));
 	rss_size = NDIS_RSSPRM_TOEPLITZ_SIZE(sc->hn_rss_ind_size);
 
 	prm->ndis_hdr.ndis_type = NDIS_OBJTYPE_RSS_PARAMS;
 	prm->ndis_hdr.ndis_rev = NDIS_RSS_PARAMS_REV_2;
 	prm->ndis_hdr.ndis_size = rss_size;
 	prm->ndis_flags = flags;
 	prm->ndis_hash = sc->hn_rss_hash;
 	prm->ndis_indsize = sizeof(rss->rss_ind[0]) * sc->hn_rss_ind_size;
 	prm->ndis_indoffset =
 	    __offsetof(struct ndis_rssprm_toeplitz, rss_ind[0]);
 	prm->ndis_keysize = sizeof(rss->rss_key);
 	prm->ndis_keyoffset =
 	    __offsetof(struct ndis_rssprm_toeplitz, rss_key[0]);
 
 	error = hn_rndis_set(sc, OID_GEN_RECEIVE_SCALE_PARAMETERS,
 	    rss, rss_size);
 	if (error) {
 		if_printf(sc->hn_ifp, "RSS config failed: %d\n", error);
 	} else {
 		if (bootverbose)
 			if_printf(sc->hn_ifp, "RSS config done\n");
 	}
 	return (error);
 }
 
 int
 hn_rndis_set_rxfilter(struct hn_softc *sc, uint32_t filter)
 {
 	int error;
 
 	error = hn_rndis_set(sc, OID_GEN_CURRENT_PACKET_FILTER,
 	    &filter, sizeof(filter));
 	if (error) {
 		if_printf(sc->hn_ifp, "set RX filter 0x%08x failed: %d\n",
 		    filter, error);
 	} else {
 		if (bootverbose) {
 			if_printf(sc->hn_ifp, "set RX filter 0x%08x done\n",
 			    filter);
 		}
 	}
 	return (error);
 }
 
 static int
 hn_rndis_init(struct hn_softc *sc)
 {
 	struct rndis_init_req *req;
 	const struct rndis_init_comp *comp;
 	struct vmbus_xact *xact;
 	size_t comp_len;
 	uint32_t rid;
 	int error;
 
 	xact = vmbus_xact_get(sc->hn_xact, sizeof(*req));
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for RNDIS init\n");
 		return (ENXIO);
 	}
 	rid = hn_rndis_rid(sc);
 	req = vmbus_xact_req_data(xact);
 	req->rm_type = REMOTE_NDIS_INITIALIZE_MSG;
 	req->rm_len = sizeof(*req);
 	req->rm_rid = rid;
 	req->rm_ver_major = RNDIS_VERSION_MAJOR;
 	req->rm_ver_minor = RNDIS_VERSION_MINOR;
 	req->rm_max_xfersz = HN_RNDIS_XFER_SIZE;
 
 	comp_len = RNDIS_INIT_COMP_SIZE_MIN;
 	comp = hn_rndis_xact_execute(sc, xact, rid, sizeof(*req), &comp_len,
 	    REMOTE_NDIS_INITIALIZE_CMPLT);
 	if (comp == NULL) {
 		if_printf(sc->hn_ifp, "exec RNDIS init failed\n");
 		error = EIO;
 		goto done;
 	}
 
 	if (comp->rm_status != RNDIS_STATUS_SUCCESS) {
 		if_printf(sc->hn_ifp, "RNDIS init failed: status 0x%08x\n",
 		    comp->rm_status);
 		error = EIO;
 		goto done;
 	}
 	sc->hn_rndis_agg_size = comp->rm_pktmaxsz;
 	sc->hn_rndis_agg_pkts = comp->rm_pktmaxcnt;
 	sc->hn_rndis_agg_align = 1U << comp->rm_align;
 
+	if (sc->hn_rndis_agg_align < sizeof(uint32_t)) {
+		/*
+		 * The RNDIS packet messsage encap assumes that the RNDIS
+		 * packet message is at least 4 bytes aligned.  Fix up the
+		 * alignment here, if the remote side sets the alignment
+		 * too low.
+		 */
+		if_printf(sc->hn_ifp, "fixup RNDIS aggpkt align: %u -> %zu\n",
+		    sc->hn_rndis_agg_align, sizeof(uint32_t));
+		sc->hn_rndis_agg_align = sizeof(uint32_t);
+	}
+
 	if (bootverbose) {
-		if_printf(sc->hn_ifp, "RNDIS ver %u.%u, pktsz %u, pktcnt %u, "
-		    "align %u\n", comp->rm_ver_major, comp->rm_ver_minor,
+		if_printf(sc->hn_ifp, "RNDIS ver %u.%u, "
+		    "aggpkt size %u, aggpkt cnt %u, aggpkt align %u\n",
+		    comp->rm_ver_major, comp->rm_ver_minor,
 		    sc->hn_rndis_agg_size, sc->hn_rndis_agg_pkts,
 		    sc->hn_rndis_agg_align);
 	}
 	error = 0;
 done:
 	vmbus_xact_put(xact);
 	return (error);
 }
 
 static int
 hn_rndis_halt(struct hn_softc *sc)
 {
 	struct vmbus_xact *xact;
 	struct rndis_halt_req *halt;
 	struct hn_nvs_sendctx sndc;
 	size_t comp_len;
 
 	xact = vmbus_xact_get(sc->hn_xact, sizeof(*halt));
 	if (xact == NULL) {
 		if_printf(sc->hn_ifp, "no xact for RNDIS halt\n");
 		return (ENXIO);
 	}
 	halt = vmbus_xact_req_data(xact);
 	halt->rm_type = REMOTE_NDIS_HALT_MSG;
 	halt->rm_len = sizeof(*halt);
 	halt->rm_rid = hn_rndis_rid(sc);
 
 	/* No RNDIS completion; rely on NVS message send completion */
 	hn_nvs_sendctx_init(&sndc, hn_nvs_sent_xact, xact);
 	hn_rndis_xact_exec1(sc, xact, sizeof(*halt), &sndc, &comp_len);
 
 	vmbus_xact_put(xact);
 	if (bootverbose)
 		if_printf(sc->hn_ifp, "RNDIS halt done\n");
 	return (0);
 }
 
 static int
 hn_rndis_query_hwcaps(struct hn_softc *sc, struct ndis_offload *caps)
 {
 	struct ndis_offload in;
 	size_t caps_len, size;
 	int error;
 
 	memset(&in, 0, sizeof(in));
 	in.ndis_hdr.ndis_type = NDIS_OBJTYPE_OFFLOAD;
 	if (sc->hn_ndis_ver >= HN_NDIS_VERSION_6_30) {
 		in.ndis_hdr.ndis_rev = NDIS_OFFLOAD_REV_3;
 		size = NDIS_OFFLOAD_SIZE;
 	} else if (sc->hn_ndis_ver >= HN_NDIS_VERSION_6_1) {
 		in.ndis_hdr.ndis_rev = NDIS_OFFLOAD_REV_2;
 		size = NDIS_OFFLOAD_SIZE_6_1;
 	} else {
 		in.ndis_hdr.ndis_rev = NDIS_OFFLOAD_REV_1;
 		size = NDIS_OFFLOAD_SIZE_6_0;
 	}
 	in.ndis_hdr.ndis_size = size;
 
 	caps_len = NDIS_OFFLOAD_SIZE;
 	error = hn_rndis_query2(sc, OID_TCP_OFFLOAD_HARDWARE_CAPABILITIES,
 	    &in, size, caps, &caps_len, NDIS_OFFLOAD_SIZE_6_0);
 	if (error)
 		return (error);
 
 	/*
 	 * Preliminary verification.
 	 */
 	if (caps->ndis_hdr.ndis_type != NDIS_OBJTYPE_OFFLOAD) {
 		if_printf(sc->hn_ifp, "invalid NDIS objtype 0x%02x\n",
 		    caps->ndis_hdr.ndis_type);
 		return (EINVAL);
 	}
 	if (caps->ndis_hdr.ndis_rev < NDIS_OFFLOAD_REV_1) {
 		if_printf(sc->hn_ifp, "invalid NDIS objrev 0x%02x\n",
 		    caps->ndis_hdr.ndis_rev);
 		return (EINVAL);
 	}
 	if (caps->ndis_hdr.ndis_size > caps_len) {
 		if_printf(sc->hn_ifp, "invalid NDIS objsize %u, "
 		    "data size %zu\n", caps->ndis_hdr.ndis_size, caps_len);
 		return (EINVAL);
 	} else if (caps->ndis_hdr.ndis_size < NDIS_OFFLOAD_SIZE_6_0) {
 		if_printf(sc->hn_ifp, "invalid NDIS objsize %u\n",
 		    caps->ndis_hdr.ndis_size);
 		return (EINVAL);
 	}
 
 	if (bootverbose) {
 		/*
 		 * NOTE:
 		 * caps->ndis_hdr.ndis_size MUST be checked before accessing
 		 * NDIS 6.1+ specific fields.
 		 */
 		if_printf(sc->hn_ifp, "hwcaps rev %u\n",
 		    caps->ndis_hdr.ndis_rev);
 
 		if_printf(sc->hn_ifp, "hwcaps csum: "
 		    "ip4 tx 0x%x/0x%x rx 0x%x/0x%x, "
 		    "ip6 tx 0x%x/0x%x rx 0x%x/0x%x\n",
 		    caps->ndis_csum.ndis_ip4_txcsum,
 		    caps->ndis_csum.ndis_ip4_txenc,
 		    caps->ndis_csum.ndis_ip4_rxcsum,
 		    caps->ndis_csum.ndis_ip4_rxenc,
 		    caps->ndis_csum.ndis_ip6_txcsum,
 		    caps->ndis_csum.ndis_ip6_txenc,
 		    caps->ndis_csum.ndis_ip6_rxcsum,
 		    caps->ndis_csum.ndis_ip6_rxenc);
 		if_printf(sc->hn_ifp, "hwcaps lsov2: "
 		    "ip4 maxsz %u minsg %u encap 0x%x, "
 		    "ip6 maxsz %u minsg %u encap 0x%x opts 0x%x\n",
 		    caps->ndis_lsov2.ndis_ip4_maxsz,
 		    caps->ndis_lsov2.ndis_ip4_minsg,
 		    caps->ndis_lsov2.ndis_ip4_encap,
 		    caps->ndis_lsov2.ndis_ip6_maxsz,
 		    caps->ndis_lsov2.ndis_ip6_minsg,
 		    caps->ndis_lsov2.ndis_ip6_encap,
 		    caps->ndis_lsov2.ndis_ip6_opts);
 	}
 	return (0);
 }
 
 int
 hn_rndis_attach(struct hn_softc *sc, int mtu)
 {
 	int error;
 
 	/*
 	 * Initialize RNDIS.
 	 */
 	error = hn_rndis_init(sc);
 	if (error)
 		return (error);
 
 	/*
 	 * Configure NDIS offload settings.
 	 */
 	hn_rndis_conf_offload(sc, mtu);
 	return (0);
 }
 
 void
 hn_rndis_detach(struct hn_softc *sc)
 {
 
 	/* Halt the RNDIS. */
 	hn_rndis_halt(sc);
 }
Index: projects/clang400-import/sys/dev/hyperv/netvsc/if_hn.c
===================================================================
--- projects/clang400-import/sys/dev/hyperv/netvsc/if_hn.c	(revision 314522)
+++ projects/clang400-import/sys/dev/hyperv/netvsc/if_hn.c	(revision 314523)
@@ -1,5740 +1,5739 @@
 /*-
  * Copyright (c) 2010-2012 Citrix Inc.
  * Copyright (c) 2009-2012,2016 Microsoft Corp.
  * Copyright (c) 2012 NetApp Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*-
  * Copyright (c) 2004-2006 Kip Macy
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_hn.h"
 #include "opt_inet6.h"
 #include "opt_inet.h"
 #include "opt_rss.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/limits.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/module.h>
 #include <sys/queue.h>
 #include <sys/lock.h>
 #include <sys/smp.h>
 #include <sys/socket.h>
 #include <sys/sockio.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 #include <sys/systm.h>
 #include <sys/taskqueue.h>
 #include <sys/buf_ring.h>
 #include <sys/eventhandler.h>
 
 #include <machine/atomic.h>
 #include <machine/in_cksum.h>
 
 #include <net/bpf.h>
 #include <net/ethernet.h>
 #include <net/if.h>
 #include <net/if_dl.h>
 #include <net/if_media.h>
 #include <net/if_types.h>
 #include <net/if_var.h>
 #include <net/rndis.h>
 #ifdef RSS
 #include <net/rss_config.h>
 #endif
 
 #include <netinet/in_systm.h>
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/ip6.h>
 #include <netinet/tcp.h>
 #include <netinet/tcp_lro.h>
 #include <netinet/udp.h>
 
 #include <dev/hyperv/include/hyperv.h>
 #include <dev/hyperv/include/hyperv_busdma.h>
 #include <dev/hyperv/include/vmbus.h>
 #include <dev/hyperv/include/vmbus_xact.h>
 
 #include <dev/hyperv/netvsc/ndis.h>
 #include <dev/hyperv/netvsc/if_hnreg.h>
 #include <dev/hyperv/netvsc/if_hnvar.h>
 #include <dev/hyperv/netvsc/hn_nvs.h>
 #include <dev/hyperv/netvsc/hn_rndis.h>
 
 #include "vmbus_if.h"
 
 #define HN_IFSTART_SUPPORT
 
 #define HN_RING_CNT_DEF_MAX		8
 
 /* YYY should get it from the underlying channel */
 #define HN_TX_DESC_CNT			512
 
 #define HN_RNDIS_PKT_LEN					\
 	(sizeof(struct rndis_packet_msg) +			\
 	 HN_RNDIS_PKTINFO_SIZE(HN_NDIS_HASH_VALUE_SIZE) +	\
 	 HN_RNDIS_PKTINFO_SIZE(NDIS_VLAN_INFO_SIZE) +		\
 	 HN_RNDIS_PKTINFO_SIZE(NDIS_LSO2_INFO_SIZE) +		\
 	 HN_RNDIS_PKTINFO_SIZE(NDIS_TXCSUM_INFO_SIZE))
 #define HN_RNDIS_PKT_BOUNDARY		PAGE_SIZE
 #define HN_RNDIS_PKT_ALIGN		CACHE_LINE_SIZE
 
 #define HN_TX_DATA_BOUNDARY		PAGE_SIZE
 #define HN_TX_DATA_MAXSIZE		IP_MAXPACKET
 #define HN_TX_DATA_SEGSIZE		PAGE_SIZE
 /* -1 for RNDIS packet message */
 #define HN_TX_DATA_SEGCNT_MAX		(HN_GPACNT_MAX - 1)
 
 #define HN_DIRECT_TX_SIZE_DEF		128
 
 #define HN_EARLY_TXEOF_THRESH		8
 
 #define HN_PKTBUF_LEN_DEF		(16 * 1024)
 
 #define HN_LROENT_CNT_DEF		128
 
 #define HN_LRO_LENLIM_MULTIRX_DEF	(12 * ETHERMTU)
 #define HN_LRO_LENLIM_DEF		(25 * ETHERMTU)
 /* YYY 2*MTU is a bit rough, but should be good enough. */
 #define HN_LRO_LENLIM_MIN(ifp)		(2 * (ifp)->if_mtu)
 
 #define HN_LRO_ACKCNT_DEF		1
 
 #define HN_LOCK_INIT(sc)		\
 	sx_init(&(sc)->hn_lock, device_get_nameunit((sc)->hn_dev))
 #define HN_LOCK_DESTROY(sc)		sx_destroy(&(sc)->hn_lock)
 #define HN_LOCK_ASSERT(sc)		sx_assert(&(sc)->hn_lock, SA_XLOCKED)
 #define HN_LOCK(sc)					\
 do {							\
 	while (sx_try_xlock(&(sc)->hn_lock) == 0)	\
 		DELAY(1000);				\
 } while (0)
 #define HN_UNLOCK(sc)			sx_xunlock(&(sc)->hn_lock)
 
 #define HN_CSUM_IP_MASK			(CSUM_IP | CSUM_IP_TCP | CSUM_IP_UDP)
 #define HN_CSUM_IP6_MASK		(CSUM_IP6_TCP | CSUM_IP6_UDP)
 #define HN_CSUM_IP_HWASSIST(sc)		\
 	((sc)->hn_tx_ring[0].hn_csum_assist & HN_CSUM_IP_MASK)
 #define HN_CSUM_IP6_HWASSIST(sc)	\
 	((sc)->hn_tx_ring[0].hn_csum_assist & HN_CSUM_IP6_MASK)
 
 #define HN_PKTSIZE_MIN(align)		\
 	roundup2(ETHER_MIN_LEN + ETHER_VLAN_ENCAP_LEN - ETHER_CRC_LEN + \
 	    HN_RNDIS_PKT_LEN, (align))
 #define HN_PKTSIZE(m, align)		\
 	roundup2((m)->m_pkthdr.len + HN_RNDIS_PKT_LEN, (align))
 
 #ifdef RSS
 #define HN_RING_IDX2CPU(sc, idx)	rss_getcpu((idx) % rss_getnumbuckets())
 #else
 #define HN_RING_IDX2CPU(sc, idx)	(((sc)->hn_cpu + (idx)) % mp_ncpus)
 #endif
 
 struct hn_txdesc {
 #ifndef HN_USE_TXDESC_BUFRING
 	SLIST_ENTRY(hn_txdesc)		link;
 #endif
 	STAILQ_ENTRY(hn_txdesc)		agg_link;
 
 	/* Aggregated txdescs, in sending order. */
 	STAILQ_HEAD(, hn_txdesc)	agg_list;
 
 	/* The oldest packet, if transmission aggregation happens. */
 	struct mbuf			*m;
 	struct hn_tx_ring		*txr;
 	int				refs;
 	uint32_t			flags;	/* HN_TXD_FLAG_ */
 	struct hn_nvs_sendctx		send_ctx;
 	uint32_t			chim_index;
 	int				chim_size;
 
 	bus_dmamap_t			data_dmap;
 
 	bus_addr_t			rndis_pkt_paddr;
 	struct rndis_packet_msg		*rndis_pkt;
 	bus_dmamap_t			rndis_pkt_dmap;
 };
 
 #define HN_TXD_FLAG_ONLIST		0x0001
 #define HN_TXD_FLAG_DMAMAP		0x0002
 #define HN_TXD_FLAG_ONAGG		0x0004
 
 struct hn_rxinfo {
 	uint32_t			vlan_info;
 	uint32_t			csum_info;
 	uint32_t			hash_info;
 	uint32_t			hash_value;
 };
 
 struct hn_update_vf {
 	struct hn_rx_ring	*rxr;
 	struct ifnet		*vf;
 };
 
 #define HN_RXINFO_VLAN			0x0001
 #define HN_RXINFO_CSUM			0x0002
 #define HN_RXINFO_HASHINF		0x0004
 #define HN_RXINFO_HASHVAL		0x0008
 #define HN_RXINFO_ALL			\
 	(HN_RXINFO_VLAN |		\
 	 HN_RXINFO_CSUM |		\
 	 HN_RXINFO_HASHINF |		\
 	 HN_RXINFO_HASHVAL)
 
 #define HN_NDIS_VLAN_INFO_INVALID	0xffffffff
 #define HN_NDIS_RXCSUM_INFO_INVALID	0
 #define HN_NDIS_HASH_INFO_INVALID	0
 
 static int			hn_probe(device_t);
 static int			hn_attach(device_t);
 static int			hn_detach(device_t);
 static int			hn_shutdown(device_t);
 static void			hn_chan_callback(struct vmbus_channel *,
 				    void *);
 
 static void			hn_init(void *);
 static int			hn_ioctl(struct ifnet *, u_long, caddr_t);
 #ifdef HN_IFSTART_SUPPORT
 static void			hn_start(struct ifnet *);
 #endif
 static int			hn_transmit(struct ifnet *, struct mbuf *);
 static void			hn_xmit_qflush(struct ifnet *);
 static int			hn_ifmedia_upd(struct ifnet *);
 static void			hn_ifmedia_sts(struct ifnet *,
 				    struct ifmediareq *);
 
 static int			hn_rndis_rxinfo(const void *, int,
 				    struct hn_rxinfo *);
 static void			hn_rndis_rx_data(struct hn_rx_ring *,
 				    const void *, int);
 static void			hn_rndis_rx_status(struct hn_softc *,
 				    const void *, int);
 
 static void			hn_nvs_handle_notify(struct hn_softc *,
 				    const struct vmbus_chanpkt_hdr *);
 static void			hn_nvs_handle_comp(struct hn_softc *,
 				    struct vmbus_channel *,
 				    const struct vmbus_chanpkt_hdr *);
 static void			hn_nvs_handle_rxbuf(struct hn_rx_ring *,
 				    struct vmbus_channel *,
 				    const struct vmbus_chanpkt_hdr *);
 static void			hn_nvs_ack_rxbuf(struct hn_rx_ring *,
 				    struct vmbus_channel *, uint64_t);
 
 #if __FreeBSD_version >= 1100099
 static int			hn_lro_lenlim_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_lro_ackcnt_sysctl(SYSCTL_HANDLER_ARGS);
 #endif
 static int			hn_trust_hcsum_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_chim_size_sysctl(SYSCTL_HANDLER_ARGS);
 #if __FreeBSD_version < 1100095
 static int			hn_rx_stat_int_sysctl(SYSCTL_HANDLER_ARGS);
 #else
 static int			hn_rx_stat_u64_sysctl(SYSCTL_HANDLER_ARGS);
 #endif
 static int			hn_rx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_tx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_tx_conf_int_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_ndis_version_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_caps_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_hwassist_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_rxfilter_sysctl(SYSCTL_HANDLER_ARGS);
 #ifndef RSS
 static int			hn_rss_key_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_rss_ind_sysctl(SYSCTL_HANDLER_ARGS);
 #endif
 static int			hn_rss_hash_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_txagg_size_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_txagg_pkts_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_txagg_pktmax_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_txagg_align_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_polling_sysctl(SYSCTL_HANDLER_ARGS);
 static int			hn_vf_sysctl(SYSCTL_HANDLER_ARGS);
 
 static void			hn_stop(struct hn_softc *, bool);
 static void			hn_init_locked(struct hn_softc *);
 static int			hn_chan_attach(struct hn_softc *,
 				    struct vmbus_channel *);
 static void			hn_chan_detach(struct hn_softc *,
 				    struct vmbus_channel *);
 static int			hn_attach_subchans(struct hn_softc *);
 static void			hn_detach_allchans(struct hn_softc *);
 static void			hn_chan_rollup(struct hn_rx_ring *,
 				    struct hn_tx_ring *);
 static void			hn_set_ring_inuse(struct hn_softc *, int);
 static int			hn_synth_attach(struct hn_softc *, int);
 static void			hn_synth_detach(struct hn_softc *);
 static int			hn_synth_alloc_subchans(struct hn_softc *,
 				    int *);
 static bool			hn_synth_attachable(const struct hn_softc *);
 static void			hn_suspend(struct hn_softc *);
 static void			hn_suspend_data(struct hn_softc *);
 static void			hn_suspend_mgmt(struct hn_softc *);
 static void			hn_resume(struct hn_softc *);
 static void			hn_resume_data(struct hn_softc *);
 static void			hn_resume_mgmt(struct hn_softc *);
 static void			hn_suspend_mgmt_taskfunc(void *, int);
 static void			hn_chan_drain(struct hn_softc *,
 				    struct vmbus_channel *);
 static void			hn_polling(struct hn_softc *, u_int);
 static void			hn_chan_polling(struct vmbus_channel *, u_int);
 
 static void			hn_update_link_status(struct hn_softc *);
 static void			hn_change_network(struct hn_softc *);
 static void			hn_link_taskfunc(void *, int);
 static void			hn_netchg_init_taskfunc(void *, int);
 static void			hn_netchg_status_taskfunc(void *, int);
 static void			hn_link_status(struct hn_softc *);
 
 static int			hn_create_rx_data(struct hn_softc *, int);
 static void			hn_destroy_rx_data(struct hn_softc *);
 static int			hn_check_iplen(const struct mbuf *, int);
 static int			hn_set_rxfilter(struct hn_softc *, uint32_t);
 static int			hn_rxfilter_config(struct hn_softc *);
 #ifndef RSS
 static int			hn_rss_reconfig(struct hn_softc *);
 #endif
 static void			hn_rss_ind_fixup(struct hn_softc *);
 static int			hn_rxpkt(struct hn_rx_ring *, const void *,
 				    int, const struct hn_rxinfo *);
 
 static int			hn_tx_ring_create(struct hn_softc *, int);
 static void			hn_tx_ring_destroy(struct hn_tx_ring *);
 static int			hn_create_tx_data(struct hn_softc *, int);
 static void			hn_fixup_tx_data(struct hn_softc *);
 static void			hn_destroy_tx_data(struct hn_softc *);
 static void			hn_txdesc_dmamap_destroy(struct hn_txdesc *);
 static void			hn_txdesc_gc(struct hn_tx_ring *,
 				    struct hn_txdesc *);
 static int			hn_encap(struct ifnet *, struct hn_tx_ring *,
 				    struct hn_txdesc *, struct mbuf **);
 static int			hn_txpkt(struct ifnet *, struct hn_tx_ring *,
 				    struct hn_txdesc *);
 static void			hn_set_chim_size(struct hn_softc *, int);
 static void			hn_set_tso_maxsize(struct hn_softc *, int, int);
 static bool			hn_tx_ring_pending(struct hn_tx_ring *);
 static void			hn_tx_ring_qflush(struct hn_tx_ring *);
 static void			hn_resume_tx(struct hn_softc *, int);
 static void			hn_set_txagg(struct hn_softc *);
 static void			*hn_try_txagg(struct ifnet *,
 				    struct hn_tx_ring *, struct hn_txdesc *,
 				    int);
 static int			hn_get_txswq_depth(const struct hn_tx_ring *);
 static void			hn_txpkt_done(struct hn_nvs_sendctx *,
 				    struct hn_softc *, struct vmbus_channel *,
 				    const void *, int);
 static int			hn_txpkt_sglist(struct hn_tx_ring *,
 				    struct hn_txdesc *);
 static int			hn_txpkt_chim(struct hn_tx_ring *,
 				    struct hn_txdesc *);
 static int			hn_xmit(struct hn_tx_ring *, int);
 static void			hn_xmit_taskfunc(void *, int);
 static void			hn_xmit_txeof(struct hn_tx_ring *);
 static void			hn_xmit_txeof_taskfunc(void *, int);
 #ifdef HN_IFSTART_SUPPORT
 static int			hn_start_locked(struct hn_tx_ring *, int);
 static void			hn_start_taskfunc(void *, int);
 static void			hn_start_txeof(struct hn_tx_ring *);
 static void			hn_start_txeof_taskfunc(void *, int);
 #endif
 
 SYSCTL_NODE(_hw, OID_AUTO, hn, CTLFLAG_RD | CTLFLAG_MPSAFE, NULL,
     "Hyper-V network interface");
 
 /* Trust tcp segements verification on host side. */
 static int			hn_trust_hosttcp = 1;
 SYSCTL_INT(_hw_hn, OID_AUTO, trust_hosttcp, CTLFLAG_RDTUN,
     &hn_trust_hosttcp, 0,
     "Trust tcp segement verification on host side, "
     "when csum info is missing (global setting)");
 
 /* Trust udp datagrams verification on host side. */
 static int			hn_trust_hostudp = 1;
 SYSCTL_INT(_hw_hn, OID_AUTO, trust_hostudp, CTLFLAG_RDTUN,
     &hn_trust_hostudp, 0,
     "Trust udp datagram verification on host side, "
     "when csum info is missing (global setting)");
 
 /* Trust ip packets verification on host side. */
 static int			hn_trust_hostip = 1;
 SYSCTL_INT(_hw_hn, OID_AUTO, trust_hostip, CTLFLAG_RDTUN,
     &hn_trust_hostip, 0,
     "Trust ip packet verification on host side, "
     "when csum info is missing (global setting)");
 
 /* Limit TSO burst size */
 static int			hn_tso_maxlen = IP_MAXPACKET;
 SYSCTL_INT(_hw_hn, OID_AUTO, tso_maxlen, CTLFLAG_RDTUN,
     &hn_tso_maxlen, 0, "TSO burst limit");
 
 /* Limit chimney send size */
 static int			hn_tx_chimney_size = 0;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_chimney_size, CTLFLAG_RDTUN,
     &hn_tx_chimney_size, 0, "Chimney send packet size limit");
 
 /* Limit the size of packet for direct transmission */
 static int			hn_direct_tx_size = HN_DIRECT_TX_SIZE_DEF;
 SYSCTL_INT(_hw_hn, OID_AUTO, direct_tx_size, CTLFLAG_RDTUN,
     &hn_direct_tx_size, 0, "Size of the packet for direct transmission");
 
 /* # of LRO entries per RX ring */
 #if defined(INET) || defined(INET6)
 #if __FreeBSD_version >= 1100095
 static int			hn_lro_entry_count = HN_LROENT_CNT_DEF;
 SYSCTL_INT(_hw_hn, OID_AUTO, lro_entry_count, CTLFLAG_RDTUN,
     &hn_lro_entry_count, 0, "LRO entry count");
 #endif
 #endif
 
 static int			hn_tx_taskq_cnt = 1;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_taskq_cnt, CTLFLAG_RDTUN,
     &hn_tx_taskq_cnt, 0, "# of TX taskqueues");
 
 #define HN_TX_TASKQ_M_INDEP	0
 #define HN_TX_TASKQ_M_GLOBAL	1
 #define HN_TX_TASKQ_M_EVTTQ	2
 
 static int			hn_tx_taskq_mode = HN_TX_TASKQ_M_INDEP;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_taskq_mode, CTLFLAG_RDTUN,
     &hn_tx_taskq_mode, 0, "TX taskqueue modes: "
     "0 - independent, 1 - share global tx taskqs, 2 - share event taskqs");
 
 #ifndef HN_USE_TXDESC_BUFRING
 static int			hn_use_txdesc_bufring = 0;
 #else
 static int			hn_use_txdesc_bufring = 1;
 #endif
 SYSCTL_INT(_hw_hn, OID_AUTO, use_txdesc_bufring, CTLFLAG_RD,
     &hn_use_txdesc_bufring, 0, "Use buf_ring for TX descriptors");
 
 #ifdef HN_IFSTART_SUPPORT
 /* Use ifnet.if_start instead of ifnet.if_transmit */
 static int			hn_use_if_start = 0;
 SYSCTL_INT(_hw_hn, OID_AUTO, use_if_start, CTLFLAG_RDTUN,
     &hn_use_if_start, 0, "Use if_start TX method");
 #endif
 
 /* # of channels to use */
 static int			hn_chan_cnt = 0;
 SYSCTL_INT(_hw_hn, OID_AUTO, chan_cnt, CTLFLAG_RDTUN,
     &hn_chan_cnt, 0,
     "# of channels to use; each channel has one RX ring and one TX ring");
 
 /* # of transmit rings to use */
 static int			hn_tx_ring_cnt = 0;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_ring_cnt, CTLFLAG_RDTUN,
     &hn_tx_ring_cnt, 0, "# of TX rings to use");
 
 /* Software TX ring deptch */
 static int			hn_tx_swq_depth = 0;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_swq_depth, CTLFLAG_RDTUN,
     &hn_tx_swq_depth, 0, "Depth of IFQ or BUFRING");
 
 /* Enable sorted LRO, and the depth of the per-channel mbuf queue */
 #if __FreeBSD_version >= 1100095
 static u_int			hn_lro_mbufq_depth = 0;
 SYSCTL_UINT(_hw_hn, OID_AUTO, lro_mbufq_depth, CTLFLAG_RDTUN,
     &hn_lro_mbufq_depth, 0, "Depth of LRO mbuf queue");
 #endif
 
 /* Packet transmission aggregation size limit */
 static int			hn_tx_agg_size = -1;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_agg_size, CTLFLAG_RDTUN,
     &hn_tx_agg_size, 0, "Packet transmission aggregation size limit");
 
 /* Packet transmission aggregation count limit */
 static int			hn_tx_agg_pkts = -1;
 SYSCTL_INT(_hw_hn, OID_AUTO, tx_agg_pkts, CTLFLAG_RDTUN,
     &hn_tx_agg_pkts, 0, "Packet transmission aggregation packet limit");
 
 static u_int			hn_cpu_index;	/* next CPU for channel */
 static struct taskqueue		**hn_tx_taskque;/* shared TX taskqueues */
 
 #ifndef RSS
 static const uint8_t
 hn_rss_key_default[NDIS_HASH_KEYSIZE_TOEPLITZ] = {
 	0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2,
 	0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0,
 	0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4,
 	0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c,
 	0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa
 };
 #endif	/* !RSS */
 
 static device_method_t hn_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		hn_probe),
 	DEVMETHOD(device_attach,	hn_attach),
 	DEVMETHOD(device_detach,	hn_detach),
 	DEVMETHOD(device_shutdown,	hn_shutdown),
 	DEVMETHOD_END
 };
 
 static driver_t hn_driver = {
 	"hn",
 	hn_methods,
 	sizeof(struct hn_softc)
 };
 
 static devclass_t hn_devclass;
 
 DRIVER_MODULE(hn, vmbus, hn_driver, hn_devclass, 0, 0);
 MODULE_VERSION(hn, 1);
 MODULE_DEPEND(hn, vmbus, 1, 1, 1);
 
 #if __FreeBSD_version >= 1100099
 static void
 hn_set_lro_lenlim(struct hn_softc *sc, int lenlim)
 {
 	int i;
 
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i)
 		sc->hn_rx_ring[i].hn_lro.lro_length_lim = lenlim;
 }
 #endif
 
 static int
 hn_txpkt_sglist(struct hn_tx_ring *txr, struct hn_txdesc *txd)
 {
 
 	KASSERT(txd->chim_index == HN_NVS_CHIM_IDX_INVALID &&
 	    txd->chim_size == 0, ("invalid rndis sglist txd"));
 	return (hn_nvs_send_rndis_sglist(txr->hn_chan, HN_NVS_RNDIS_MTYPE_DATA,
 	    &txd->send_ctx, txr->hn_gpa, txr->hn_gpa_cnt));
 }
 
 static int
 hn_txpkt_chim(struct hn_tx_ring *txr, struct hn_txdesc *txd)
 {
 	struct hn_nvs_rndis rndis;
 
 	KASSERT(txd->chim_index != HN_NVS_CHIM_IDX_INVALID &&
 	    txd->chim_size > 0, ("invalid rndis chim txd"));
 
 	rndis.nvs_type = HN_NVS_TYPE_RNDIS;
 	rndis.nvs_rndis_mtype = HN_NVS_RNDIS_MTYPE_DATA;
 	rndis.nvs_chim_idx = txd->chim_index;
 	rndis.nvs_chim_sz = txd->chim_size;
 
 	return (hn_nvs_send(txr->hn_chan, VMBUS_CHANPKT_FLAG_RC,
 	    &rndis, sizeof(rndis), &txd->send_ctx));
 }
 
 static __inline uint32_t
 hn_chim_alloc(struct hn_softc *sc)
 {
 	int i, bmap_cnt = sc->hn_chim_bmap_cnt;
 	u_long *bmap = sc->hn_chim_bmap;
 	uint32_t ret = HN_NVS_CHIM_IDX_INVALID;
 
 	for (i = 0; i < bmap_cnt; ++i) {
 		int idx;
 
 		idx = ffsl(~bmap[i]);
 		if (idx == 0)
 			continue;
 
 		--idx; /* ffsl is 1-based */
 		KASSERT(i * LONG_BIT + idx < sc->hn_chim_cnt,
 		    ("invalid i %d and idx %d", i, idx));
 
 		if (atomic_testandset_long(&bmap[i], idx))
 			continue;
 
 		ret = i * LONG_BIT + idx;
 		break;
 	}
 	return (ret);
 }
 
 static __inline void
 hn_chim_free(struct hn_softc *sc, uint32_t chim_idx)
 {
 	u_long mask;
 	uint32_t idx;
 
 	idx = chim_idx / LONG_BIT;
 	KASSERT(idx < sc->hn_chim_bmap_cnt,
 	    ("invalid chimney index 0x%x", chim_idx));
 
 	mask = 1UL << (chim_idx % LONG_BIT);
 	KASSERT(sc->hn_chim_bmap[idx] & mask,
 	    ("index bitmap 0x%lx, chimney index %u, "
 	     "bitmap idx %d, bitmask 0x%lx",
 	     sc->hn_chim_bmap[idx], chim_idx, idx, mask));
 
 	atomic_clear_long(&sc->hn_chim_bmap[idx], mask);
 }
 
 #if defined(INET6) || defined(INET)
 /*
  * NOTE: If this function failed, the m_head would be freed.
  */
 static __inline struct mbuf *
 hn_tso_fixup(struct mbuf *m_head)
 {
 	struct ether_vlan_header *evl;
 	struct tcphdr *th;
 	int ehlen;
 
 	KASSERT(M_WRITABLE(m_head), ("TSO mbuf not writable"));
 
 #define PULLUP_HDR(m, len)				\
 do {							\
 	if (__predict_false((m)->m_len < (len))) {	\
 		(m) = m_pullup((m), (len));		\
 		if ((m) == NULL)			\
 			return (NULL);			\
 	}						\
 } while (0)
 
 	PULLUP_HDR(m_head, sizeof(*evl));
 	evl = mtod(m_head, struct ether_vlan_header *);
 	if (evl->evl_encap_proto == ntohs(ETHERTYPE_VLAN))
 		ehlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
 	else
 		ehlen = ETHER_HDR_LEN;
 
 #ifdef INET
 	if (m_head->m_pkthdr.csum_flags & CSUM_IP_TSO) {
 		struct ip *ip;
 		int iphlen;
 
 		PULLUP_HDR(m_head, ehlen + sizeof(*ip));
 		ip = mtodo(m_head, ehlen);
 		iphlen = ip->ip_hl << 2;
 
 		PULLUP_HDR(m_head, ehlen + iphlen + sizeof(*th));
 		th = mtodo(m_head, ehlen + iphlen);
 
 		ip->ip_len = 0;
 		ip->ip_sum = 0;
 		th->th_sum = in_pseudo(ip->ip_src.s_addr,
 		    ip->ip_dst.s_addr, htons(IPPROTO_TCP));
 	}
 #endif
 #if defined(INET6) && defined(INET)
 	else
 #endif
 #ifdef INET6
 	{
 		struct ip6_hdr *ip6;
 
 		PULLUP_HDR(m_head, ehlen + sizeof(*ip6));
 		ip6 = mtodo(m_head, ehlen);
 		if (ip6->ip6_nxt != IPPROTO_TCP) {
 			m_freem(m_head);
 			return (NULL);
 		}
 
 		PULLUP_HDR(m_head, ehlen + sizeof(*ip6) + sizeof(*th));
 		th = mtodo(m_head, ehlen + sizeof(*ip6));
 
 		ip6->ip6_plen = 0;
 		th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0);
 	}
 #endif
 	return (m_head);
 
 #undef PULLUP_HDR
 }
 #endif	/* INET6 || INET */
 
 static int
 hn_set_rxfilter(struct hn_softc *sc, uint32_t filter)
 {
 	int error = 0;
 
 	HN_LOCK_ASSERT(sc);
 
 	if (sc->hn_rx_filter != filter) {
 		error = hn_rndis_set_rxfilter(sc, filter);
 		if (!error)
 			sc->hn_rx_filter = filter;
 	}
 	return (error);
 }
 
 static int
 hn_rxfilter_config(struct hn_softc *sc)
 {
 	struct ifnet *ifp = sc->hn_ifp;
 	uint32_t filter;
 
 	HN_LOCK_ASSERT(sc);
 
 	if ((ifp->if_flags & IFF_PROMISC) ||
 	    (sc->hn_flags & HN_FLAG_VF)) {
 		filter = NDIS_PACKET_TYPE_PROMISCUOUS;
 	} else {
 		filter = NDIS_PACKET_TYPE_DIRECTED;
 		if (ifp->if_flags & IFF_BROADCAST)
 			filter |= NDIS_PACKET_TYPE_BROADCAST;
 		/* TODO: support multicast list */
 		if ((ifp->if_flags & IFF_ALLMULTI) ||
 		    !TAILQ_EMPTY(&ifp->if_multiaddrs))
 			filter |= NDIS_PACKET_TYPE_ALL_MULTICAST;
 	}
 	return (hn_set_rxfilter(sc, filter));
 }
 
 static void
 hn_set_txagg(struct hn_softc *sc)
 {
 	uint32_t size, pkts;
 	int i;
 
 	/*
 	 * Setup aggregation size.
 	 */
 	if (sc->hn_agg_size < 0)
 		size = UINT32_MAX;
 	else
 		size = sc->hn_agg_size;
 
 	if (sc->hn_rndis_agg_size < size)
 		size = sc->hn_rndis_agg_size;
 
 	/* NOTE: We only aggregate packets using chimney sending buffers. */
 	if (size > (uint32_t)sc->hn_chim_szmax)
 		size = sc->hn_chim_szmax;
 
 	if (size <= 2 * HN_PKTSIZE_MIN(sc->hn_rndis_agg_align)) {
 		/* Disable */
 		size = 0;
 		pkts = 0;
 		goto done;
 	}
 
 	/* NOTE: Type of the per TX ring setting is 'int'. */
 	if (size > INT_MAX)
 		size = INT_MAX;
 
 	/*
 	 * Setup aggregation packet count.
 	 */
 	if (sc->hn_agg_pkts < 0)
 		pkts = UINT32_MAX;
 	else
 		pkts = sc->hn_agg_pkts;
 
 	if (sc->hn_rndis_agg_pkts < pkts)
 		pkts = sc->hn_rndis_agg_pkts;
 
 	if (pkts <= 1) {
 		/* Disable */
 		size = 0;
 		pkts = 0;
 		goto done;
 	}
 
 	/* NOTE: Type of the per TX ring setting is 'short'. */
 	if (pkts > SHRT_MAX)
 		pkts = SHRT_MAX;
 
 done:
 	/* NOTE: Type of the per TX ring setting is 'short'. */
 	if (sc->hn_rndis_agg_align > SHRT_MAX) {
 		/* Disable */
 		size = 0;
 		pkts = 0;
 	}
 
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "TX agg size %u, pkts %u, align %u\n",
 		    size, pkts, sc->hn_rndis_agg_align);
 	}
 
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i) {
 		struct hn_tx_ring *txr = &sc->hn_tx_ring[i];
 
 		mtx_lock(&txr->hn_tx_lock);
 		txr->hn_agg_szmax = size;
 		txr->hn_agg_pktmax = pkts;
 		txr->hn_agg_align = sc->hn_rndis_agg_align;
 		mtx_unlock(&txr->hn_tx_lock);
 	}
 }
 
 static int
 hn_get_txswq_depth(const struct hn_tx_ring *txr)
 {
 
 	KASSERT(txr->hn_txdesc_cnt > 0, ("tx ring is not setup yet"));
 	if (hn_tx_swq_depth < txr->hn_txdesc_cnt)
 		return txr->hn_txdesc_cnt;
 	return hn_tx_swq_depth;
 }
 
 #ifndef RSS
 static int
 hn_rss_reconfig(struct hn_softc *sc)
 {
 	int error;
 
 	HN_LOCK_ASSERT(sc);
 
 	if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0)
 		return (ENXIO);
 
 	/*
 	 * Disable RSS first.
 	 *
 	 * NOTE:
 	 * Direct reconfiguration by setting the UNCHG flags does
 	 * _not_ work properly.
 	 */
 	if (bootverbose)
 		if_printf(sc->hn_ifp, "disable RSS\n");
 	error = hn_rndis_conf_rss(sc, NDIS_RSS_FLAG_DISABLE);
 	if (error) {
 		if_printf(sc->hn_ifp, "RSS disable failed\n");
 		return (error);
 	}
 
 	/*
 	 * Reenable the RSS w/ the updated RSS key or indirect
 	 * table.
 	 */
 	if (bootverbose)
 		if_printf(sc->hn_ifp, "reconfig RSS\n");
 	error = hn_rndis_conf_rss(sc, NDIS_RSS_FLAG_NONE);
 	if (error) {
 		if_printf(sc->hn_ifp, "RSS reconfig failed\n");
 		return (error);
 	}
 	return (0);
 }
 #endif	/* !RSS */
 
 static void
 hn_rss_ind_fixup(struct hn_softc *sc)
 {
 	struct ndis_rssprm_toeplitz *rss = &sc->hn_rss;
 	int i, nchan;
 
 	nchan = sc->hn_rx_ring_inuse;
 	KASSERT(nchan > 1, ("invalid # of channels %d", nchan));
 
 	/*
 	 * Check indirect table to make sure that all channels in it
 	 * can be used.
 	 */
 	for (i = 0; i < NDIS_HASH_INDCNT; ++i) {
 		if (rss->rss_ind[i] >= nchan) {
 			if_printf(sc->hn_ifp,
 			    "RSS indirect table %d fixup: %u -> %d\n",
 			    i, rss->rss_ind[i], nchan - 1);
 			rss->rss_ind[i] = nchan - 1;
 		}
 	}
 }
 
 static int
 hn_ifmedia_upd(struct ifnet *ifp __unused)
 {
 
 	return EOPNOTSUPP;
 }
 
 static void
 hn_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr)
 {
 	struct hn_softc *sc = ifp->if_softc;
 
 	ifmr->ifm_status = IFM_AVALID;
 	ifmr->ifm_active = IFM_ETHER;
 
 	if ((sc->hn_link_flags & HN_LINK_FLAG_LINKUP) == 0) {
 		ifmr->ifm_active |= IFM_NONE;
 		return;
 	}
 	ifmr->ifm_status |= IFM_ACTIVE;
 	ifmr->ifm_active |= IFM_10G_T | IFM_FDX;
 }
 
 static void
 hn_update_vf_task(void *arg, int pending __unused)
 {
 	struct hn_update_vf *uv = arg;
 
 	uv->rxr->hn_vf = uv->vf;
 }
 
 static void
 hn_update_vf(struct hn_softc *sc, struct ifnet *vf)
 {
 	struct hn_rx_ring *rxr;
 	struct hn_update_vf uv;
 	struct task task;
 	int i;
 
 	HN_LOCK_ASSERT(sc);
 
 	TASK_INIT(&task, 0, hn_update_vf_task, &uv);
 
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 
 		if (i < sc->hn_rx_ring_inuse) {
 			uv.rxr = rxr;
 			uv.vf = vf;
 			vmbus_chan_run_task(rxr->hn_chan, &task);
 		} else {
 			rxr->hn_vf = vf;
 		}
 	}
 }
 
 static void
 hn_set_vf(struct hn_softc *sc, struct ifnet *ifp, bool vf)
 {
 	struct ifnet *hn_ifp;
 
 	HN_LOCK(sc);
 
 	if (!(sc->hn_flags & HN_FLAG_SYNTH_ATTACHED))
 		goto out;
 
 	hn_ifp = sc->hn_ifp;
 
 	if (ifp == hn_ifp)
 		goto out;
 
 	if (ifp->if_alloctype != IFT_ETHER)
 		goto out;
 
 	/* Ignore lagg/vlan interfaces */
 	if (strcmp(ifp->if_dname, "lagg") == 0 ||
 	    strcmp(ifp->if_dname, "vlan") == 0)
 		goto out;
 
 	if (bcmp(IF_LLADDR(ifp), IF_LLADDR(hn_ifp), ETHER_ADDR_LEN) != 0)
 		goto out;
 
 	/* Now we're sure 'ifp' is a real VF device. */
 	if (vf) {
 		if (sc->hn_flags & HN_FLAG_VF)
 			goto out;
 
 		sc->hn_flags |= HN_FLAG_VF;
 		hn_rxfilter_config(sc);
 	} else {
 		if (!(sc->hn_flags & HN_FLAG_VF))
 			goto out;
 
 		sc->hn_flags &= ~HN_FLAG_VF;
 		if (sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING)
 			hn_rxfilter_config(sc);
 		else
 			hn_set_rxfilter(sc, NDIS_PACKET_TYPE_NONE);
 	}
 
 	hn_nvs_set_datapath(sc,
 	    vf ? HN_NVS_DATAPATH_VF : HN_NVS_DATAPATH_SYNTHETIC);
 
 	hn_update_vf(sc, vf ? ifp : NULL);
 
 	if (vf) {
 		hn_suspend_mgmt(sc);
 		sc->hn_link_flags &=
 		    ~(HN_LINK_FLAG_LINKUP | HN_LINK_FLAG_NETCHG);
 		if_link_state_change(sc->hn_ifp, LINK_STATE_DOWN);
 	} else {
 		hn_resume_mgmt(sc);
 	}
 
 	devctl_notify("HYPERV_NIC_VF", if_name(hn_ifp),
 	    vf ? "VF_UP" : "VF_DOWN", NULL);
 
 	if (bootverbose)
 		if_printf(hn_ifp, "Data path is switched %s %s\n",
 		    vf ? "to" : "from", if_name(ifp));
 out:
 	HN_UNLOCK(sc);
 }
 
 static void
 hn_ifnet_event(void *arg, struct ifnet *ifp, int event)
 {
 	if (event != IFNET_EVENT_UP && event != IFNET_EVENT_DOWN)
 		return;
 
 	hn_set_vf(arg, ifp, event == IFNET_EVENT_UP);
 }
 
 static void
 hn_ifaddr_event(void *arg, struct ifnet *ifp)
 {
 	hn_set_vf(arg, ifp, ifp->if_flags & IFF_UP);
 }
 
 /* {F8615163-DF3E-46c5-913F-F2D2F965ED0E} */
 static const struct hyperv_guid g_net_vsc_device_type = {
 	.hv_guid = {0x63, 0x51, 0x61, 0xF8, 0x3E, 0xDF, 0xc5, 0x46,
 		0x91, 0x3F, 0xF2, 0xD2, 0xF9, 0x65, 0xED, 0x0E}
 };
 
 static int
 hn_probe(device_t dev)
 {
 
 	if (VMBUS_PROBE_GUID(device_get_parent(dev), dev,
 	    &g_net_vsc_device_type) == 0) {
 		device_set_desc(dev, "Hyper-V Network Interface");
 		return BUS_PROBE_DEFAULT;
 	}
 	return ENXIO;
 }
 
 static int
 hn_attach(device_t dev)
 {
 	struct hn_softc *sc = device_get_softc(dev);
 	struct sysctl_oid_list *child;
 	struct sysctl_ctx_list *ctx;
 	uint8_t eaddr[ETHER_ADDR_LEN];
 	struct ifnet *ifp = NULL;
 	int error, ring_cnt, tx_ring_cnt;
 
 	sc->hn_dev = dev;
 	sc->hn_prichan = vmbus_get_channel(dev);
 	HN_LOCK_INIT(sc);
 
 	/*
 	 * Initialize these tunables once.
 	 */
 	sc->hn_agg_size = hn_tx_agg_size;
 	sc->hn_agg_pkts = hn_tx_agg_pkts;
 
 	/*
 	 * Setup taskqueue for transmission.
 	 */
 	if (hn_tx_taskq_mode == HN_TX_TASKQ_M_INDEP) {
 		int i;
 
 		sc->hn_tx_taskqs =
 		    malloc(hn_tx_taskq_cnt * sizeof(struct taskqueue *),
 		    M_DEVBUF, M_WAITOK);
 		for (i = 0; i < hn_tx_taskq_cnt; ++i) {
 			sc->hn_tx_taskqs[i] = taskqueue_create("hn_tx",
 			    M_WAITOK, taskqueue_thread_enqueue,
 			    &sc->hn_tx_taskqs[i]);
 			taskqueue_start_threads(&sc->hn_tx_taskqs[i], 1, PI_NET,
 			    "%s tx%d", device_get_nameunit(dev), i);
 		}
 	} else if (hn_tx_taskq_mode == HN_TX_TASKQ_M_GLOBAL) {
 		sc->hn_tx_taskqs = hn_tx_taskque;
 	}
 
 	/*
 	 * Setup taskqueue for mangement tasks, e.g. link status.
 	 */
 	sc->hn_mgmt_taskq0 = taskqueue_create("hn_mgmt", M_WAITOK,
 	    taskqueue_thread_enqueue, &sc->hn_mgmt_taskq0);
 	taskqueue_start_threads(&sc->hn_mgmt_taskq0, 1, PI_NET, "%s mgmt",
 	    device_get_nameunit(dev));
 	TASK_INIT(&sc->hn_link_task, 0, hn_link_taskfunc, sc);
 	TASK_INIT(&sc->hn_netchg_init, 0, hn_netchg_init_taskfunc, sc);
 	TIMEOUT_TASK_INIT(sc->hn_mgmt_taskq0, &sc->hn_netchg_status, 0,
 	    hn_netchg_status_taskfunc, sc);
 
 	/*
 	 * Allocate ifnet and setup its name earlier, so that if_printf
 	 * can be used by functions, which will be called after
 	 * ether_ifattach().
 	 */
 	ifp = sc->hn_ifp = if_alloc(IFT_ETHER);
 	ifp->if_softc = sc;
 	if_initname(ifp, device_get_name(dev), device_get_unit(dev));
 
 	/*
 	 * Initialize ifmedia earlier so that it can be unconditionally
 	 * destroyed, if error happened later on.
 	 */
 	ifmedia_init(&sc->hn_media, 0, hn_ifmedia_upd, hn_ifmedia_sts);
 
 	/*
 	 * Figure out the # of RX rings (ring_cnt) and the # of TX rings
 	 * to use (tx_ring_cnt).
 	 *
 	 * NOTE:
 	 * The # of RX rings to use is same as the # of channels to use.
 	 */
 	ring_cnt = hn_chan_cnt;
 	if (ring_cnt <= 0) {
 		/* Default */
 		ring_cnt = mp_ncpus;
 		if (ring_cnt > HN_RING_CNT_DEF_MAX)
 			ring_cnt = HN_RING_CNT_DEF_MAX;
 	} else if (ring_cnt > mp_ncpus) {
 		ring_cnt = mp_ncpus;
 	}
 #ifdef RSS
 	if (ring_cnt > rss_getnumbuckets())
 		ring_cnt = rss_getnumbuckets();
 #endif
 
 	tx_ring_cnt = hn_tx_ring_cnt;
 	if (tx_ring_cnt <= 0 || tx_ring_cnt > ring_cnt)
 		tx_ring_cnt = ring_cnt;
 #ifdef HN_IFSTART_SUPPORT
 	if (hn_use_if_start) {
 		/* ifnet.if_start only needs one TX ring. */
 		tx_ring_cnt = 1;
 	}
 #endif
 
 	/*
 	 * Set the leader CPU for channels.
 	 */
 	sc->hn_cpu = atomic_fetchadd_int(&hn_cpu_index, ring_cnt) % mp_ncpus;
 
 	/*
 	 * Create enough TX/RX rings, even if only limited number of
 	 * channels can be allocated.
 	 */
 	error = hn_create_tx_data(sc, tx_ring_cnt);
 	if (error)
 		goto failed;
 	error = hn_create_rx_data(sc, ring_cnt);
 	if (error)
 		goto failed;
 
 	/*
 	 * Create transaction context for NVS and RNDIS transactions.
 	 */
 	sc->hn_xact = vmbus_xact_ctx_create(bus_get_dma_tag(dev),
 	    HN_XACT_REQ_SIZE, HN_XACT_RESP_SIZE, 0);
 	if (sc->hn_xact == NULL) {
 		error = ENXIO;
 		goto failed;
 	}
 
 	/*
 	 * Install orphan handler for the revocation of this device's
 	 * primary channel.
 	 *
 	 * NOTE:
 	 * The processing order is critical here:
 	 * Install the orphan handler, _before_ testing whether this
 	 * device's primary channel has been revoked or not.
 	 */
 	vmbus_chan_set_orphan(sc->hn_prichan, sc->hn_xact);
 	if (vmbus_chan_is_revoked(sc->hn_prichan)) {
 		error = ENXIO;
 		goto failed;
 	}
 
 	/*
 	 * Attach the synthetic parts, i.e. NVS and RNDIS.
 	 */
 	error = hn_synth_attach(sc, ETHERMTU);
 	if (error)
 		goto failed;
 
 	error = hn_rndis_get_eaddr(sc, eaddr);
 	if (error)
 		goto failed;
 
 #if __FreeBSD_version >= 1100099
 	if (sc->hn_rx_ring_inuse > 1) {
 		/*
 		 * Reduce TCP segment aggregation limit for multiple
 		 * RX rings to increase ACK timeliness.
 		 */
 		hn_set_lro_lenlim(sc, HN_LRO_LENLIM_MULTIRX_DEF);
 	}
 #endif
 
 	/*
 	 * Fixup TX stuffs after synthetic parts are attached.
 	 */
 	hn_fixup_tx_data(sc);
 
 	ctx = device_get_sysctl_ctx(dev);
 	child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev));
 	SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "nvs_version", CTLFLAG_RD,
 	    &sc->hn_nvs_ver, 0, "NVS version");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "ndis_version",
 	    CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_ndis_version_sysctl, "A", "NDIS version");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "caps",
 	    CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_caps_sysctl, "A", "capabilities");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "hwassist",
 	    CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_hwassist_sysctl, "A", "hwassist");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rxfilter",
 	    CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_rxfilter_sysctl, "A", "rxfilter");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rss_hash",
 	    CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_rss_hash_sysctl, "A", "RSS hash");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rss_ind_size",
 	    CTLFLAG_RD, &sc->hn_rss_ind_size, 0, "RSS indirect entry count");
 #ifndef RSS
 	/*
 	 * Don't allow RSS key/indirect table changes, if RSS is defined.
 	 */
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rss_key",
 	    CTLTYPE_OPAQUE | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_rss_key_sysctl, "IU", "RSS key");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rss_ind",
 	    CTLTYPE_OPAQUE | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_rss_ind_sysctl, "IU", "RSS indirect table");
 #endif
 	SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rndis_agg_size",
 	    CTLFLAG_RD, &sc->hn_rndis_agg_size, 0,
 	    "RNDIS offered packet transmission aggregation size limit");
 	SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rndis_agg_pkts",
 	    CTLFLAG_RD, &sc->hn_rndis_agg_pkts, 0,
 	    "RNDIS offered packet transmission aggregation count limit");
 	SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rndis_agg_align",
 	    CTLFLAG_RD, &sc->hn_rndis_agg_align, 0,
 	    "RNDIS packet transmission aggregation alignment");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_size",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_txagg_size_sysctl, "I",
 	    "Packet transmission aggregation size, 0 -- disable, -1 -- auto");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_pkts",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_txagg_pkts_sysctl, "I",
 	    "Packet transmission aggregation packets, "
 	    "0 -- disable, -1 -- auto");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "polling",
 	    CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_polling_sysctl, "I",
 	    "Polling frequency: [100,1000000], 0 disable polling");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "vf",
 	    CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_vf_sysctl, "A", "Virtual Function's name");
 
 	/*
 	 * Setup the ifmedia, which has been initialized earlier.
 	 */
 	ifmedia_add(&sc->hn_media, IFM_ETHER | IFM_AUTO, 0, NULL);
 	ifmedia_set(&sc->hn_media, IFM_ETHER | IFM_AUTO);
 	/* XXX ifmedia_set really should do this for us */
 	sc->hn_media.ifm_media = sc->hn_media.ifm_cur->ifm_media;
 
 	/*
 	 * Setup the ifnet for this interface.
 	 */
 
 	ifp->if_baudrate = IF_Gbps(10);
 	ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
 	ifp->if_ioctl = hn_ioctl;
 	ifp->if_init = hn_init;
 #ifdef HN_IFSTART_SUPPORT
 	if (hn_use_if_start) {
 		int qdepth = hn_get_txswq_depth(&sc->hn_tx_ring[0]);
 
 		ifp->if_start = hn_start;
 		IFQ_SET_MAXLEN(&ifp->if_snd, qdepth);
 		ifp->if_snd.ifq_drv_maxlen = qdepth - 1;
 		IFQ_SET_READY(&ifp->if_snd);
 	} else
 #endif
 	{
 		ifp->if_transmit = hn_transmit;
 		ifp->if_qflush = hn_xmit_qflush;
 	}
 
 	ifp->if_capabilities |= IFCAP_RXCSUM | IFCAP_LRO;
 #ifdef foo
 	/* We can't diff IPv6 packets from IPv4 packets on RX path. */
 	ifp->if_capabilities |= IFCAP_RXCSUM_IPV6;
 #endif
 	if (sc->hn_caps & HN_CAP_VLAN) {
 		/* XXX not sure about VLAN_MTU. */
 		ifp->if_capabilities |= IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_MTU;
 	}
 
 	ifp->if_hwassist = sc->hn_tx_ring[0].hn_csum_assist;
 	if (ifp->if_hwassist & HN_CSUM_IP_MASK)
 		ifp->if_capabilities |= IFCAP_TXCSUM;
 	if (ifp->if_hwassist & HN_CSUM_IP6_MASK)
 		ifp->if_capabilities |= IFCAP_TXCSUM_IPV6;
 	if (sc->hn_caps & HN_CAP_TSO4) {
 		ifp->if_capabilities |= IFCAP_TSO4;
 		ifp->if_hwassist |= CSUM_IP_TSO;
 	}
 	if (sc->hn_caps & HN_CAP_TSO6) {
 		ifp->if_capabilities |= IFCAP_TSO6;
 		ifp->if_hwassist |= CSUM_IP6_TSO;
 	}
 
 	/* Enable all available capabilities by default. */
 	ifp->if_capenable = ifp->if_capabilities;
 
 	/*
 	 * Disable IPv6 TSO and TXCSUM by default, they still can
 	 * be enabled through SIOCSIFCAP.
 	 */
 	ifp->if_capenable &= ~(IFCAP_TXCSUM_IPV6 | IFCAP_TSO6);
 	ifp->if_hwassist &= ~(HN_CSUM_IP6_MASK | CSUM_IP6_TSO);
 
 	if (ifp->if_capabilities & (IFCAP_TSO6 | IFCAP_TSO4)) {
 		hn_set_tso_maxsize(sc, hn_tso_maxlen, ETHERMTU);
 		ifp->if_hw_tsomaxsegcount = HN_TX_DATA_SEGCNT_MAX;
 		ifp->if_hw_tsomaxsegsize = PAGE_SIZE;
 	}
 
 	ether_ifattach(ifp, eaddr);
 
 	if ((ifp->if_capabilities & (IFCAP_TSO6 | IFCAP_TSO4)) && bootverbose) {
 		if_printf(ifp, "TSO segcnt %u segsz %u\n",
 		    ifp->if_hw_tsomaxsegcount, ifp->if_hw_tsomaxsegsize);
 	}
 
 	/* Inform the upper layer about the long frame support. */
 	ifp->if_hdrlen = sizeof(struct ether_vlan_header);
 
 	/*
 	 * Kick off link status check.
 	 */
 	sc->hn_mgmt_taskq = sc->hn_mgmt_taskq0;
 	hn_update_link_status(sc);
 
 	sc->hn_ifnet_evthand = EVENTHANDLER_REGISTER(ifnet_event,
 	    hn_ifnet_event, sc, EVENTHANDLER_PRI_ANY);
 
 	sc->hn_ifaddr_evthand = EVENTHANDLER_REGISTER(ifaddr_event,
 	    hn_ifaddr_event, sc, EVENTHANDLER_PRI_ANY);
 
 	return (0);
 failed:
 	if (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED)
 		hn_synth_detach(sc);
 	hn_detach(dev);
 	return (error);
 }
 
 static int
 hn_detach(device_t dev)
 {
 	struct hn_softc *sc = device_get_softc(dev);
 	struct ifnet *ifp = sc->hn_ifp;
 
 	if (sc->hn_ifaddr_evthand != NULL)
 		EVENTHANDLER_DEREGISTER(ifaddr_event, sc->hn_ifaddr_evthand);
 	if (sc->hn_ifnet_evthand != NULL)
 		EVENTHANDLER_DEREGISTER(ifnet_event, sc->hn_ifnet_evthand);
 
 	if (sc->hn_xact != NULL && vmbus_chan_is_revoked(sc->hn_prichan)) {
 		/*
 		 * In case that the vmbus missed the orphan handler
 		 * installation.
 		 */
 		vmbus_xact_ctx_orphan(sc->hn_xact);
 	}
 
 	if (device_is_attached(dev)) {
 		HN_LOCK(sc);
 		if (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) {
 			if (ifp->if_drv_flags & IFF_DRV_RUNNING)
 				hn_stop(sc, true);
 			/*
 			 * NOTE:
 			 * hn_stop() only suspends data, so managment
 			 * stuffs have to be suspended manually here.
 			 */
 			hn_suspend_mgmt(sc);
 			hn_synth_detach(sc);
 		}
 		HN_UNLOCK(sc);
 		ether_ifdetach(ifp);
 	}
 
 	ifmedia_removeall(&sc->hn_media);
 	hn_destroy_rx_data(sc);
 	hn_destroy_tx_data(sc);
 
 	if (sc->hn_tx_taskqs != NULL && sc->hn_tx_taskqs != hn_tx_taskque) {
 		int i;
 
 		for (i = 0; i < hn_tx_taskq_cnt; ++i)
 			taskqueue_free(sc->hn_tx_taskqs[i]);
 		free(sc->hn_tx_taskqs, M_DEVBUF);
 	}
 	taskqueue_free(sc->hn_mgmt_taskq0);
 
 	if (sc->hn_xact != NULL) {
 		/*
 		 * Uninstall the orphan handler _before_ the xact is
 		 * destructed.
 		 */
 		vmbus_chan_unset_orphan(sc->hn_prichan);
 		vmbus_xact_ctx_destroy(sc->hn_xact);
 	}
 
 	if_free(ifp);
 
 	HN_LOCK_DESTROY(sc);
 	return (0);
 }
 
 static int
 hn_shutdown(device_t dev)
 {
 
 	return (0);
 }
 
 static void
 hn_link_status(struct hn_softc *sc)
 {
 	uint32_t link_status;
 	int error;
 
 	error = hn_rndis_get_linkstatus(sc, &link_status);
 	if (error) {
 		/* XXX what to do? */
 		return;
 	}
 
 	if (link_status == NDIS_MEDIA_STATE_CONNECTED)
 		sc->hn_link_flags |= HN_LINK_FLAG_LINKUP;
 	else
 		sc->hn_link_flags &= ~HN_LINK_FLAG_LINKUP;
 	if_link_state_change(sc->hn_ifp,
 	    (sc->hn_link_flags & HN_LINK_FLAG_LINKUP) ?
 	    LINK_STATE_UP : LINK_STATE_DOWN);
 }
 
 static void
 hn_link_taskfunc(void *xsc, int pending __unused)
 {
 	struct hn_softc *sc = xsc;
 
 	if (sc->hn_link_flags & HN_LINK_FLAG_NETCHG)
 		return;
 	hn_link_status(sc);
 }
 
 static void
 hn_netchg_init_taskfunc(void *xsc, int pending __unused)
 {
 	struct hn_softc *sc = xsc;
 
 	/* Prevent any link status checks from running. */
 	sc->hn_link_flags |= HN_LINK_FLAG_NETCHG;
 
 	/*
 	 * Fake up a [link down --> link up] state change; 5 seconds
 	 * delay is used, which closely simulates miibus reaction
 	 * upon link down event.
 	 */
 	sc->hn_link_flags &= ~HN_LINK_FLAG_LINKUP;
 	if_link_state_change(sc->hn_ifp, LINK_STATE_DOWN);
 	taskqueue_enqueue_timeout(sc->hn_mgmt_taskq0,
 	    &sc->hn_netchg_status, 5 * hz);
 }
 
 static void
 hn_netchg_status_taskfunc(void *xsc, int pending __unused)
 {
 	struct hn_softc *sc = xsc;
 
 	/* Re-allow link status checks. */
 	sc->hn_link_flags &= ~HN_LINK_FLAG_NETCHG;
 	hn_link_status(sc);
 }
 
 static void
 hn_update_link_status(struct hn_softc *sc)
 {
 
 	if (sc->hn_mgmt_taskq != NULL)
 		taskqueue_enqueue(sc->hn_mgmt_taskq, &sc->hn_link_task);
 }
 
 static void
 hn_change_network(struct hn_softc *sc)
 {
 
 	if (sc->hn_mgmt_taskq != NULL)
 		taskqueue_enqueue(sc->hn_mgmt_taskq, &sc->hn_netchg_init);
 }
 
 static __inline int
 hn_txdesc_dmamap_load(struct hn_tx_ring *txr, struct hn_txdesc *txd,
     struct mbuf **m_head, bus_dma_segment_t *segs, int *nsegs)
 {
 	struct mbuf *m = *m_head;
 	int error;
 
 	KASSERT(txd->chim_index == HN_NVS_CHIM_IDX_INVALID, ("txd uses chim"));
 
 	error = bus_dmamap_load_mbuf_sg(txr->hn_tx_data_dtag, txd->data_dmap,
 	    m, segs, nsegs, BUS_DMA_NOWAIT);
 	if (error == EFBIG) {
 		struct mbuf *m_new;
 
 		m_new = m_collapse(m, M_NOWAIT, HN_TX_DATA_SEGCNT_MAX);
 		if (m_new == NULL)
 			return ENOBUFS;
 		else
 			*m_head = m = m_new;
 		txr->hn_tx_collapsed++;
 
 		error = bus_dmamap_load_mbuf_sg(txr->hn_tx_data_dtag,
 		    txd->data_dmap, m, segs, nsegs, BUS_DMA_NOWAIT);
 	}
 	if (!error) {
 		bus_dmamap_sync(txr->hn_tx_data_dtag, txd->data_dmap,
 		    BUS_DMASYNC_PREWRITE);
 		txd->flags |= HN_TXD_FLAG_DMAMAP;
 	}
 	return error;
 }
 
 static __inline int
 hn_txdesc_put(struct hn_tx_ring *txr, struct hn_txdesc *txd)
 {
 
 	KASSERT((txd->flags & HN_TXD_FLAG_ONLIST) == 0,
 	    ("put an onlist txd %#x", txd->flags));
 	KASSERT((txd->flags & HN_TXD_FLAG_ONAGG) == 0,
 	    ("put an onagg txd %#x", txd->flags));
 
 	KASSERT(txd->refs > 0, ("invalid txd refs %d", txd->refs));
 	if (atomic_fetchadd_int(&txd->refs, -1) != 1)
 		return 0;
 
 	if (!STAILQ_EMPTY(&txd->agg_list)) {
 		struct hn_txdesc *tmp_txd;
 
 		while ((tmp_txd = STAILQ_FIRST(&txd->agg_list)) != NULL) {
 			int freed;
 
 			KASSERT(STAILQ_EMPTY(&tmp_txd->agg_list),
 			    ("resursive aggregation on aggregated txdesc"));
 			KASSERT((tmp_txd->flags & HN_TXD_FLAG_ONAGG),
 			    ("not aggregated txdesc"));
 			KASSERT((tmp_txd->flags & HN_TXD_FLAG_DMAMAP) == 0,
 			    ("aggregated txdesc uses dmamap"));
 			KASSERT(tmp_txd->chim_index == HN_NVS_CHIM_IDX_INVALID,
 			    ("aggregated txdesc consumes "
 			     "chimney sending buffer"));
 			KASSERT(tmp_txd->chim_size == 0,
 			    ("aggregated txdesc has non-zero "
 			     "chimney sending size"));
 
 			STAILQ_REMOVE_HEAD(&txd->agg_list, agg_link);
 			tmp_txd->flags &= ~HN_TXD_FLAG_ONAGG;
 			freed = hn_txdesc_put(txr, tmp_txd);
 			KASSERT(freed, ("failed to free aggregated txdesc"));
 		}
 	}
 
 	if (txd->chim_index != HN_NVS_CHIM_IDX_INVALID) {
 		KASSERT((txd->flags & HN_TXD_FLAG_DMAMAP) == 0,
 		    ("chim txd uses dmamap"));
 		hn_chim_free(txr->hn_sc, txd->chim_index);
 		txd->chim_index = HN_NVS_CHIM_IDX_INVALID;
 		txd->chim_size = 0;
 	} else if (txd->flags & HN_TXD_FLAG_DMAMAP) {
 		bus_dmamap_sync(txr->hn_tx_data_dtag,
 		    txd->data_dmap, BUS_DMASYNC_POSTWRITE);
 		bus_dmamap_unload(txr->hn_tx_data_dtag,
 		    txd->data_dmap);
 		txd->flags &= ~HN_TXD_FLAG_DMAMAP;
 	}
 
 	if (txd->m != NULL) {
 		m_freem(txd->m);
 		txd->m = NULL;
 	}
 
 	txd->flags |= HN_TXD_FLAG_ONLIST;
 #ifndef HN_USE_TXDESC_BUFRING
 	mtx_lock_spin(&txr->hn_txlist_spin);
 	KASSERT(txr->hn_txdesc_avail >= 0 &&
 	    txr->hn_txdesc_avail < txr->hn_txdesc_cnt,
 	    ("txdesc_put: invalid txd avail %d", txr->hn_txdesc_avail));
 	txr->hn_txdesc_avail++;
 	SLIST_INSERT_HEAD(&txr->hn_txlist, txd, link);
 	mtx_unlock_spin(&txr->hn_txlist_spin);
 #else	/* HN_USE_TXDESC_BUFRING */
 #ifdef HN_DEBUG
 	atomic_add_int(&txr->hn_txdesc_avail, 1);
 #endif
 	buf_ring_enqueue(txr->hn_txdesc_br, txd);
 #endif	/* !HN_USE_TXDESC_BUFRING */
 
 	return 1;
 }
 
 static __inline struct hn_txdesc *
 hn_txdesc_get(struct hn_tx_ring *txr)
 {
 	struct hn_txdesc *txd;
 
 #ifndef HN_USE_TXDESC_BUFRING
 	mtx_lock_spin(&txr->hn_txlist_spin);
 	txd = SLIST_FIRST(&txr->hn_txlist);
 	if (txd != NULL) {
 		KASSERT(txr->hn_txdesc_avail > 0,
 		    ("txdesc_get: invalid txd avail %d", txr->hn_txdesc_avail));
 		txr->hn_txdesc_avail--;
 		SLIST_REMOVE_HEAD(&txr->hn_txlist, link);
 	}
 	mtx_unlock_spin(&txr->hn_txlist_spin);
 #else
 	txd = buf_ring_dequeue_sc(txr->hn_txdesc_br);
 #endif
 
 	if (txd != NULL) {
 #ifdef HN_USE_TXDESC_BUFRING
 #ifdef HN_DEBUG
 		atomic_subtract_int(&txr->hn_txdesc_avail, 1);
 #endif
 #endif	/* HN_USE_TXDESC_BUFRING */
 		KASSERT(txd->m == NULL && txd->refs == 0 &&
 		    STAILQ_EMPTY(&txd->agg_list) &&
 		    txd->chim_index == HN_NVS_CHIM_IDX_INVALID &&
 		    txd->chim_size == 0 &&
 		    (txd->flags & HN_TXD_FLAG_ONLIST) &&
 		    (txd->flags & HN_TXD_FLAG_ONAGG) == 0 &&
 		    (txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("invalid txd"));
 		txd->flags &= ~HN_TXD_FLAG_ONLIST;
 		txd->refs = 1;
 	}
 	return txd;
 }
 
 static __inline void
 hn_txdesc_hold(struct hn_txdesc *txd)
 {
 
 	/* 0->1 transition will never work */
 	KASSERT(txd->refs > 0, ("invalid txd refs %d", txd->refs));
 	atomic_add_int(&txd->refs, 1);
 }
 
 static __inline void
 hn_txdesc_agg(struct hn_txdesc *agg_txd, struct hn_txdesc *txd)
 {
 
 	KASSERT((agg_txd->flags & HN_TXD_FLAG_ONAGG) == 0,
 	    ("recursive aggregation on aggregating txdesc"));
 
 	KASSERT((txd->flags & HN_TXD_FLAG_ONAGG) == 0,
 	    ("already aggregated"));
 	KASSERT(STAILQ_EMPTY(&txd->agg_list),
 	    ("recursive aggregation on to-be-aggregated txdesc"));
 
 	txd->flags |= HN_TXD_FLAG_ONAGG;
 	STAILQ_INSERT_TAIL(&agg_txd->agg_list, txd, agg_link);
 }
 
 static bool
 hn_tx_ring_pending(struct hn_tx_ring *txr)
 {
 	bool pending = false;
 
 #ifndef HN_USE_TXDESC_BUFRING
 	mtx_lock_spin(&txr->hn_txlist_spin);
 	if (txr->hn_txdesc_avail != txr->hn_txdesc_cnt)
 		pending = true;
 	mtx_unlock_spin(&txr->hn_txlist_spin);
 #else
 	if (!buf_ring_full(txr->hn_txdesc_br))
 		pending = true;
 #endif
 	return (pending);
 }
 
 static __inline void
 hn_txeof(struct hn_tx_ring *txr)
 {
 	txr->hn_has_txeof = 0;
 	txr->hn_txeof(txr);
 }
 
 static void
 hn_txpkt_done(struct hn_nvs_sendctx *sndc, struct hn_softc *sc,
     struct vmbus_channel *chan, const void *data __unused, int dlen __unused)
 {
 	struct hn_txdesc *txd = sndc->hn_cbarg;
 	struct hn_tx_ring *txr;
 
 	txr = txd->txr;
 	KASSERT(txr->hn_chan == chan,
 	    ("channel mismatch, on chan%u, should be chan%u",
 	     vmbus_chan_id(chan), vmbus_chan_id(txr->hn_chan)));
 
 	txr->hn_has_txeof = 1;
 	hn_txdesc_put(txr, txd);
 
 	++txr->hn_txdone_cnt;
 	if (txr->hn_txdone_cnt >= HN_EARLY_TXEOF_THRESH) {
 		txr->hn_txdone_cnt = 0;
 		if (txr->hn_oactive)
 			hn_txeof(txr);
 	}
 }
 
 static void
 hn_chan_rollup(struct hn_rx_ring *rxr, struct hn_tx_ring *txr)
 {
 #if defined(INET) || defined(INET6)
 	tcp_lro_flush_all(&rxr->hn_lro);
 #endif
 
 	/*
 	 * NOTE:
 	 * 'txr' could be NULL, if multiple channels and
 	 * ifnet.if_start method are enabled.
 	 */
 	if (txr == NULL || !txr->hn_has_txeof)
 		return;
 
 	txr->hn_txdone_cnt = 0;
 	hn_txeof(txr);
 }
 
 static __inline uint32_t
 hn_rndis_pktmsg_offset(uint32_t ofs)
 {
 
 	KASSERT(ofs >= sizeof(struct rndis_packet_msg),
 	    ("invalid RNDIS packet msg offset %u", ofs));
 	return (ofs - __offsetof(struct rndis_packet_msg, rm_dataoffset));
 }
 
 static __inline void *
 hn_rndis_pktinfo_append(struct rndis_packet_msg *pkt, size_t pktsize,
     size_t pi_dlen, uint32_t pi_type)
 {
 	const size_t pi_size = HN_RNDIS_PKTINFO_SIZE(pi_dlen);
 	struct rndis_pktinfo *pi;
 
 	KASSERT((pi_size & RNDIS_PACKET_MSG_OFFSET_ALIGNMASK) == 0,
 	    ("unaligned pktinfo size %zu, pktinfo dlen %zu", pi_size, pi_dlen));
 
 	/*
 	 * Per-packet-info does not move; it only grows.
 	 *
 	 * NOTE:
 	 * rm_pktinfooffset in this phase counts from the beginning
 	 * of rndis_packet_msg.
 	 */
 	KASSERT(pkt->rm_pktinfooffset + pkt->rm_pktinfolen + pi_size <= pktsize,
 	    ("%u pktinfo overflows RNDIS packet msg", pi_type));
 	pi = (struct rndis_pktinfo *)((uint8_t *)pkt + pkt->rm_pktinfooffset +
 	    pkt->rm_pktinfolen);
 	pkt->rm_pktinfolen += pi_size;
 
 	pi->rm_size = pi_size;
 	pi->rm_type = pi_type;
 	pi->rm_pktinfooffset = RNDIS_PKTINFO_OFFSET;
 
-	/* Update RNDIS packet msg length */
-	pkt->rm_len += pi_size;
-
 	return (pi->rm_data);
 }
 
 static __inline int
 hn_flush_txagg(struct ifnet *ifp, struct hn_tx_ring *txr)
 {
 	struct hn_txdesc *txd;
 	struct mbuf *m;
 	int error, pkts;
 
 	txd = txr->hn_agg_txd;
 	KASSERT(txd != NULL, ("no aggregate txdesc"));
 
 	/*
 	 * Since hn_txpkt() will reset this temporary stat, save
 	 * it now, so that oerrors can be updated properly, if
 	 * hn_txpkt() ever fails.
 	 */
 	pkts = txr->hn_stat_pkts;
 
 	/*
 	 * Since txd's mbuf will _not_ be freed upon hn_txpkt()
 	 * failure, save it for later freeing, if hn_txpkt() ever
 	 * fails.
 	 */
 	m = txd->m;
 	error = hn_txpkt(ifp, txr, txd);
 	if (__predict_false(error)) {
 		/* txd is freed, but m is not. */
 		m_freem(m);
 
 		txr->hn_flush_failed++;
 		if_inc_counter(ifp, IFCOUNTER_OERRORS, pkts);
 	}
 
 	/* Reset all aggregation states. */
 	txr->hn_agg_txd = NULL;
 	txr->hn_agg_szleft = 0;
 	txr->hn_agg_pktleft = 0;
 	txr->hn_agg_prevpkt = NULL;
 
 	return (error);
 }
 
 static void *
 hn_try_txagg(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd,
     int pktsize)
 {
 	void *chim;
 
 	if (txr->hn_agg_txd != NULL) {
 		if (txr->hn_agg_pktleft >= 1 && txr->hn_agg_szleft > pktsize) {
 			struct hn_txdesc *agg_txd = txr->hn_agg_txd;
 			struct rndis_packet_msg *pkt = txr->hn_agg_prevpkt;
 			int olen;
 
 			/*
 			 * Update the previous RNDIS packet's total length,
 			 * it can be increased due to the mandatory alignment
 			 * padding for this RNDIS packet.  And update the
 			 * aggregating txdesc's chimney sending buffer size
 			 * accordingly.
 			 *
 			 * XXX
 			 * Zero-out the padding, as required by the RNDIS spec.
 			 */
 			olen = pkt->rm_len;
 			pkt->rm_len = roundup2(olen, txr->hn_agg_align);
 			agg_txd->chim_size += pkt->rm_len - olen;
 
 			/* Link this txdesc to the parent. */
 			hn_txdesc_agg(agg_txd, txd);
 
 			chim = (uint8_t *)pkt + pkt->rm_len;
 			/* Save the current packet for later fixup. */
 			txr->hn_agg_prevpkt = chim;
 
 			txr->hn_agg_pktleft--;
 			txr->hn_agg_szleft -= pktsize;
 			if (txr->hn_agg_szleft <=
 			    HN_PKTSIZE_MIN(txr->hn_agg_align)) {
 				/*
 				 * Probably can't aggregate more packets,
 				 * flush this aggregating txdesc proactively.
 				 */
 				txr->hn_agg_pktleft = 0;
 			}
 			/* Done! */
 			return (chim);
 		}
 		hn_flush_txagg(ifp, txr);
 	}
 	KASSERT(txr->hn_agg_txd == NULL, ("lingering aggregating txdesc"));
 
 	txr->hn_tx_chimney_tried++;
 	txd->chim_index = hn_chim_alloc(txr->hn_sc);
 	if (txd->chim_index == HN_NVS_CHIM_IDX_INVALID)
 		return (NULL);
 	txr->hn_tx_chimney++;
 
 	chim = txr->hn_sc->hn_chim +
 	    (txd->chim_index * txr->hn_sc->hn_chim_szmax);
 
 	if (txr->hn_agg_pktmax > 1 &&
 	    txr->hn_agg_szmax > pktsize + HN_PKTSIZE_MIN(txr->hn_agg_align)) {
 		txr->hn_agg_txd = txd;
 		txr->hn_agg_pktleft = txr->hn_agg_pktmax - 1;
 		txr->hn_agg_szleft = txr->hn_agg_szmax - pktsize;
 		txr->hn_agg_prevpkt = chim;
 	}
 	return (chim);
 }
 
 /*
  * NOTE:
  * If this function fails, then both txd and m_head0 will be freed.
  */
 static int
 hn_encap(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd,
     struct mbuf **m_head0)
 {
 	bus_dma_segment_t segs[HN_TX_DATA_SEGCNT_MAX];
 	int error, nsegs, i;
 	struct mbuf *m_head = *m_head0;
 	struct rndis_packet_msg *pkt;
 	uint32_t *pi_data;
 	void *chim = NULL;
 	int pkt_hlen, pkt_size;
 
 	pkt = txd->rndis_pkt;
 	pkt_size = HN_PKTSIZE(m_head, txr->hn_agg_align);
 	if (pkt_size < txr->hn_chim_size) {
 		chim = hn_try_txagg(ifp, txr, txd, pkt_size);
 		if (chim != NULL)
 			pkt = chim;
 	} else {
 		if (txr->hn_agg_txd != NULL)
 			hn_flush_txagg(ifp, txr);
 	}
 
 	pkt->rm_type = REMOTE_NDIS_PACKET_MSG;
-	pkt->rm_len = sizeof(*pkt) + m_head->m_pkthdr.len;
+	pkt->rm_len = m_head->m_pkthdr.len;
 	pkt->rm_dataoffset = 0;
 	pkt->rm_datalen = m_head->m_pkthdr.len;
 	pkt->rm_oobdataoffset = 0;
 	pkt->rm_oobdatalen = 0;
 	pkt->rm_oobdataelements = 0;
 	pkt->rm_pktinfooffset = sizeof(*pkt);
 	pkt->rm_pktinfolen = 0;
 	pkt->rm_vchandle = 0;
 	pkt->rm_reserved = 0;
 
 	if (txr->hn_tx_flags & HN_TX_FLAG_HASHVAL) {
 		/*
 		 * Set the hash value for this packet, so that the host could
 		 * dispatch the TX done event for this packet back to this TX
 		 * ring's channel.
 		 */
 		pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN,
 		    HN_NDIS_HASH_VALUE_SIZE, HN_NDIS_PKTINFO_TYPE_HASHVAL);
 		*pi_data = txr->hn_tx_idx;
 	}
 
 	if (m_head->m_flags & M_VLANTAG) {
 		pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN,
 		    NDIS_VLAN_INFO_SIZE, NDIS_PKTINFO_TYPE_VLAN);
 		*pi_data = NDIS_VLAN_INFO_MAKE(
 		    EVL_VLANOFTAG(m_head->m_pkthdr.ether_vtag),
 		    EVL_PRIOFTAG(m_head->m_pkthdr.ether_vtag),
 		    EVL_CFIOFTAG(m_head->m_pkthdr.ether_vtag));
 	}
 
 	if (m_head->m_pkthdr.csum_flags & CSUM_TSO) {
 #if defined(INET6) || defined(INET)
 		pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN,
 		    NDIS_LSO2_INFO_SIZE, NDIS_PKTINFO_TYPE_LSO);
 #ifdef INET
 		if (m_head->m_pkthdr.csum_flags & CSUM_IP_TSO) {
 			*pi_data = NDIS_LSO2_INFO_MAKEIPV4(0,
 			    m_head->m_pkthdr.tso_segsz);
 		}
 #endif
 #if defined(INET6) && defined(INET)
 		else
 #endif
 #ifdef INET6
 		{
 			*pi_data = NDIS_LSO2_INFO_MAKEIPV6(0,
 			    m_head->m_pkthdr.tso_segsz);
 		}
 #endif
 #endif	/* INET6 || INET */
 	} else if (m_head->m_pkthdr.csum_flags & txr->hn_csum_assist) {
 		pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN,
 		    NDIS_TXCSUM_INFO_SIZE, NDIS_PKTINFO_TYPE_CSUM);
 		if (m_head->m_pkthdr.csum_flags &
 		    (CSUM_IP6_TCP | CSUM_IP6_UDP)) {
 			*pi_data = NDIS_TXCSUM_INFO_IPV6;
 		} else {
 			*pi_data = NDIS_TXCSUM_INFO_IPV4;
 			if (m_head->m_pkthdr.csum_flags & CSUM_IP)
 				*pi_data |= NDIS_TXCSUM_INFO_IPCS;
 		}
 
 		if (m_head->m_pkthdr.csum_flags & (CSUM_IP_TCP | CSUM_IP6_TCP))
 			*pi_data |= NDIS_TXCSUM_INFO_TCPCS;
 		else if (m_head->m_pkthdr.csum_flags &
 		    (CSUM_IP_UDP | CSUM_IP6_UDP))
 			*pi_data |= NDIS_TXCSUM_INFO_UDPCS;
 	}
 
 	pkt_hlen = pkt->rm_pktinfooffset + pkt->rm_pktinfolen;
+	/* Fixup RNDIS packet message total length */
+	pkt->rm_len += pkt_hlen;
 	/* Convert RNDIS packet message offsets */
 	pkt->rm_dataoffset = hn_rndis_pktmsg_offset(pkt_hlen);
 	pkt->rm_pktinfooffset = hn_rndis_pktmsg_offset(pkt->rm_pktinfooffset);
 
 	/*
 	 * Fast path: Chimney sending.
 	 */
 	if (chim != NULL) {
 		struct hn_txdesc *tgt_txd = txd;
 
 		if (txr->hn_agg_txd != NULL) {
 			tgt_txd = txr->hn_agg_txd;
 #ifdef INVARIANTS
 			*m_head0 = NULL;
 #endif
 		}
 
 		KASSERT(pkt == chim,
 		    ("RNDIS pkt not in chimney sending buffer"));
 		KASSERT(tgt_txd->chim_index != HN_NVS_CHIM_IDX_INVALID,
 		    ("chimney sending buffer is not used"));
 		tgt_txd->chim_size += pkt->rm_len;
 
 		m_copydata(m_head, 0, m_head->m_pkthdr.len,
 		    ((uint8_t *)chim) + pkt_hlen);
 
 		txr->hn_gpa_cnt = 0;
 		txr->hn_sendpkt = hn_txpkt_chim;
 		goto done;
 	}
 
 	KASSERT(txr->hn_agg_txd == NULL, ("aggregating sglist txdesc"));
 	KASSERT(txd->chim_index == HN_NVS_CHIM_IDX_INVALID,
 	    ("chimney buffer is used"));
 	KASSERT(pkt == txd->rndis_pkt, ("RNDIS pkt not in txdesc"));
 
 	error = hn_txdesc_dmamap_load(txr, txd, &m_head, segs, &nsegs);
 	if (__predict_false(error)) {
 		int freed;
 
 		/*
 		 * This mbuf is not linked w/ the txd yet, so free it now.
 		 */
 		m_freem(m_head);
 		*m_head0 = NULL;
 
 		freed = hn_txdesc_put(txr, txd);
 		KASSERT(freed != 0,
 		    ("fail to free txd upon txdma error"));
 
 		txr->hn_txdma_failed++;
 		if_inc_counter(ifp, IFCOUNTER_OERRORS, 1);
 		return error;
 	}
 	*m_head0 = m_head;
 
 	/* +1 RNDIS packet message */
 	txr->hn_gpa_cnt = nsegs + 1;
 
 	/* send packet with page buffer */
 	txr->hn_gpa[0].gpa_page = atop(txd->rndis_pkt_paddr);
 	txr->hn_gpa[0].gpa_ofs = txd->rndis_pkt_paddr & PAGE_MASK;
 	txr->hn_gpa[0].gpa_len = pkt_hlen;
 
 	/*
 	 * Fill the page buffers with mbuf info after the page
 	 * buffer for RNDIS packet message.
 	 */
 	for (i = 0; i < nsegs; ++i) {
 		struct vmbus_gpa *gpa = &txr->hn_gpa[i + 1];
 
 		gpa->gpa_page = atop(segs[i].ds_addr);
 		gpa->gpa_ofs = segs[i].ds_addr & PAGE_MASK;
 		gpa->gpa_len = segs[i].ds_len;
 	}
 
 	txd->chim_index = HN_NVS_CHIM_IDX_INVALID;
 	txd->chim_size = 0;
 	txr->hn_sendpkt = hn_txpkt_sglist;
 done:
 	txd->m = m_head;
 
 	/* Set the completion routine */
 	hn_nvs_sendctx_init(&txd->send_ctx, hn_txpkt_done, txd);
 
 	/* Update temporary stats for later use. */
 	txr->hn_stat_pkts++;
 	txr->hn_stat_size += m_head->m_pkthdr.len;
 	if (m_head->m_flags & M_MCAST)
 		txr->hn_stat_mcasts++;
 
 	return 0;
 }
 
 /*
  * NOTE:
  * If this function fails, then txd will be freed, but the mbuf
  * associated w/ the txd will _not_ be freed.
  */
 static int
 hn_txpkt(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd)
 {
 	int error, send_failed = 0, has_bpf;
 
 again:
 	has_bpf = bpf_peers_present(ifp->if_bpf);
 	if (has_bpf) {
 		/*
 		 * Make sure that this txd and any aggregated txds are not
 		 * freed before ETHER_BPF_MTAP.
 		 */
 		hn_txdesc_hold(txd);
 	}
 	error = txr->hn_sendpkt(txr, txd);
 	if (!error) {
 		if (has_bpf) {
 			const struct hn_txdesc *tmp_txd;
 
 			ETHER_BPF_MTAP(ifp, txd->m);
 			STAILQ_FOREACH(tmp_txd, &txd->agg_list, agg_link)
 				ETHER_BPF_MTAP(ifp, tmp_txd->m);
 		}
 
 		if_inc_counter(ifp, IFCOUNTER_OPACKETS, txr->hn_stat_pkts);
 #ifdef HN_IFSTART_SUPPORT
 		if (!hn_use_if_start)
 #endif
 		{
 			if_inc_counter(ifp, IFCOUNTER_OBYTES,
 			    txr->hn_stat_size);
 			if (txr->hn_stat_mcasts != 0) {
 				if_inc_counter(ifp, IFCOUNTER_OMCASTS,
 				    txr->hn_stat_mcasts);
 			}
 		}
 		txr->hn_pkts += txr->hn_stat_pkts;
 		txr->hn_sends++;
 	}
 	if (has_bpf)
 		hn_txdesc_put(txr, txd);
 
 	if (__predict_false(error)) {
 		int freed;
 
 		/*
 		 * This should "really rarely" happen.
 		 *
 		 * XXX Too many RX to be acked or too many sideband
 		 * commands to run?  Ask netvsc_channel_rollup()
 		 * to kick start later.
 		 */
 		txr->hn_has_txeof = 1;
 		if (!send_failed) {
 			txr->hn_send_failed++;
 			send_failed = 1;
 			/*
 			 * Try sending again after set hn_has_txeof;
 			 * in case that we missed the last
 			 * netvsc_channel_rollup().
 			 */
 			goto again;
 		}
 		if_printf(ifp, "send failed\n");
 
 		/*
 		 * Caller will perform further processing on the
 		 * associated mbuf, so don't free it in hn_txdesc_put();
 		 * only unload it from the DMA map in hn_txdesc_put(),
 		 * if it was loaded.
 		 */
 		txd->m = NULL;
 		freed = hn_txdesc_put(txr, txd);
 		KASSERT(freed != 0,
 		    ("fail to free txd upon send error"));
 
 		txr->hn_send_failed++;
 	}
 
 	/* Reset temporary stats, after this sending is done. */
 	txr->hn_stat_size = 0;
 	txr->hn_stat_pkts = 0;
 	txr->hn_stat_mcasts = 0;
 
 	return (error);
 }
 
 /*
  * Append the specified data to the indicated mbuf chain,
  * Extend the mbuf chain if the new data does not fit in
  * existing space.
  *
  * This is a minor rewrite of m_append() from sys/kern/uipc_mbuf.c.
  * There should be an equivalent in the kernel mbuf code,
  * but there does not appear to be one yet.
  *
  * Differs from m_append() in that additional mbufs are
  * allocated with cluster size MJUMPAGESIZE, and filled
  * accordingly.
  *
  * Return 1 if able to complete the job; otherwise 0.
  */
 static int
 hv_m_append(struct mbuf *m0, int len, c_caddr_t cp)
 {
 	struct mbuf *m, *n;
 	int remainder, space;
 
 	for (m = m0; m->m_next != NULL; m = m->m_next)
 		;
 	remainder = len;
 	space = M_TRAILINGSPACE(m);
 	if (space > 0) {
 		/*
 		 * Copy into available space.
 		 */
 		if (space > remainder)
 			space = remainder;
 		bcopy(cp, mtod(m, caddr_t) + m->m_len, space);
 		m->m_len += space;
 		cp += space;
 		remainder -= space;
 	}
 	while (remainder > 0) {
 		/*
 		 * Allocate a new mbuf; could check space
 		 * and allocate a cluster instead.
 		 */
 		n = m_getjcl(M_NOWAIT, m->m_type, 0, MJUMPAGESIZE);
 		if (n == NULL)
 			break;
 		n->m_len = min(MJUMPAGESIZE, remainder);
 		bcopy(cp, mtod(n, caddr_t), n->m_len);
 		cp += n->m_len;
 		remainder -= n->m_len;
 		m->m_next = n;
 		m = n;
 	}
 	if (m0->m_flags & M_PKTHDR)
 		m0->m_pkthdr.len += len - remainder;
 
 	return (remainder == 0);
 }
 
 #if defined(INET) || defined(INET6)
 static __inline int
 hn_lro_rx(struct lro_ctrl *lc, struct mbuf *m)
 {
 #if __FreeBSD_version >= 1100095
 	if (hn_lro_mbufq_depth) {
 		tcp_lro_queue_mbuf(lc, m);
 		return 0;
 	}
 #endif
 	return tcp_lro_rx(lc, m, 0);
 }
 #endif
 
 static int
 hn_rxpkt(struct hn_rx_ring *rxr, const void *data, int dlen,
     const struct hn_rxinfo *info)
 {
 	struct ifnet *ifp;
 	struct mbuf *m_new;
 	int size, do_lro = 0, do_csum = 1;
 	int hash_type;
 
 	/* If the VF is active, inject the packet through the VF */
 	ifp = rxr->hn_vf ? rxr->hn_vf : rxr->hn_ifp;
 
 	if (dlen <= MHLEN) {
 		m_new = m_gethdr(M_NOWAIT, MT_DATA);
 		if (m_new == NULL) {
 			if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1);
 			return (0);
 		}
 		memcpy(mtod(m_new, void *), data, dlen);
 		m_new->m_pkthdr.len = m_new->m_len = dlen;
 		rxr->hn_small_pkts++;
 	} else {
 		/*
 		 * Get an mbuf with a cluster.  For packets 2K or less,
 		 * get a standard 2K cluster.  For anything larger, get a
 		 * 4K cluster.  Any buffers larger than 4K can cause problems
 		 * if looped around to the Hyper-V TX channel, so avoid them.
 		 */
 		size = MCLBYTES;
 		if (dlen > MCLBYTES) {
 			/* 4096 */
 			size = MJUMPAGESIZE;
 		}
 
 		m_new = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, size);
 		if (m_new == NULL) {
 			if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1);
 			return (0);
 		}
 
 		hv_m_append(m_new, dlen, data);
 	}
 	m_new->m_pkthdr.rcvif = ifp;
 
 	if (__predict_false((ifp->if_capenable & IFCAP_RXCSUM) == 0))
 		do_csum = 0;
 
 	/* receive side checksum offload */
 	if (info->csum_info != HN_NDIS_RXCSUM_INFO_INVALID) {
 		/* IP csum offload */
 		if ((info->csum_info & NDIS_RXCSUM_INFO_IPCS_OK) && do_csum) {
 			m_new->m_pkthdr.csum_flags |=
 			    (CSUM_IP_CHECKED | CSUM_IP_VALID);
 			rxr->hn_csum_ip++;
 		}
 
 		/* TCP/UDP csum offload */
 		if ((info->csum_info & (NDIS_RXCSUM_INFO_UDPCS_OK |
 		     NDIS_RXCSUM_INFO_TCPCS_OK)) && do_csum) {
 			m_new->m_pkthdr.csum_flags |=
 			    (CSUM_DATA_VALID | CSUM_PSEUDO_HDR);
 			m_new->m_pkthdr.csum_data = 0xffff;
 			if (info->csum_info & NDIS_RXCSUM_INFO_TCPCS_OK)
 				rxr->hn_csum_tcp++;
 			else
 				rxr->hn_csum_udp++;
 		}
 
 		/*
 		 * XXX
 		 * As of this write (Oct 28th, 2016), host side will turn
 		 * on only TCPCS_OK and IPCS_OK even for UDP datagrams, so
 		 * the do_lro setting here is actually _not_ accurate.  We
 		 * depend on the RSS hash type check to reset do_lro.
 		 */
 		if ((info->csum_info &
 		     (NDIS_RXCSUM_INFO_TCPCS_OK | NDIS_RXCSUM_INFO_IPCS_OK)) ==
 		    (NDIS_RXCSUM_INFO_TCPCS_OK | NDIS_RXCSUM_INFO_IPCS_OK))
 			do_lro = 1;
 	} else {
 		const struct ether_header *eh;
 		uint16_t etype;
 		int hoff;
 
 		hoff = sizeof(*eh);
 		if (m_new->m_len < hoff)
 			goto skip;
 		eh = mtod(m_new, struct ether_header *);
 		etype = ntohs(eh->ether_type);
 		if (etype == ETHERTYPE_VLAN) {
 			const struct ether_vlan_header *evl;
 
 			hoff = sizeof(*evl);
 			if (m_new->m_len < hoff)
 				goto skip;
 			evl = mtod(m_new, struct ether_vlan_header *);
 			etype = ntohs(evl->evl_proto);
 		}
 
 		if (etype == ETHERTYPE_IP) {
 			int pr;
 
 			pr = hn_check_iplen(m_new, hoff);
 			if (pr == IPPROTO_TCP) {
 				if (do_csum &&
 				    (rxr->hn_trust_hcsum &
 				     HN_TRUST_HCSUM_TCP)) {
 					rxr->hn_csum_trusted++;
 					m_new->m_pkthdr.csum_flags |=
 					   (CSUM_IP_CHECKED | CSUM_IP_VALID |
 					    CSUM_DATA_VALID | CSUM_PSEUDO_HDR);
 					m_new->m_pkthdr.csum_data = 0xffff;
 				}
 				do_lro = 1;
 			} else if (pr == IPPROTO_UDP) {
 				if (do_csum &&
 				    (rxr->hn_trust_hcsum &
 				     HN_TRUST_HCSUM_UDP)) {
 					rxr->hn_csum_trusted++;
 					m_new->m_pkthdr.csum_flags |=
 					   (CSUM_IP_CHECKED | CSUM_IP_VALID |
 					    CSUM_DATA_VALID | CSUM_PSEUDO_HDR);
 					m_new->m_pkthdr.csum_data = 0xffff;
 				}
 			} else if (pr != IPPROTO_DONE && do_csum &&
 			    (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_IP)) {
 				rxr->hn_csum_trusted++;
 				m_new->m_pkthdr.csum_flags |=
 				    (CSUM_IP_CHECKED | CSUM_IP_VALID);
 			}
 		}
 	}
 skip:
 	if (info->vlan_info != HN_NDIS_VLAN_INFO_INVALID) {
 		m_new->m_pkthdr.ether_vtag = EVL_MAKETAG(
 		    NDIS_VLAN_INFO_ID(info->vlan_info),
 		    NDIS_VLAN_INFO_PRI(info->vlan_info),
 		    NDIS_VLAN_INFO_CFI(info->vlan_info));
 		m_new->m_flags |= M_VLANTAG;
 	}
 
 	if (info->hash_info != HN_NDIS_HASH_INFO_INVALID) {
 		rxr->hn_rss_pkts++;
 		m_new->m_pkthdr.flowid = info->hash_value;
 		hash_type = M_HASHTYPE_OPAQUE_HASH;
 		if ((info->hash_info & NDIS_HASH_FUNCTION_MASK) ==
 		    NDIS_HASH_FUNCTION_TOEPLITZ) {
 			uint32_t type = (info->hash_info & NDIS_HASH_TYPE_MASK);
 
 			/*
 			 * NOTE:
 			 * do_lro is resetted, if the hash types are not TCP
 			 * related.  See the comment in the above csum_flags
 			 * setup section.
 			 */
 			switch (type) {
 			case NDIS_HASH_IPV4:
 				hash_type = M_HASHTYPE_RSS_IPV4;
 				do_lro = 0;
 				break;
 
 			case NDIS_HASH_TCP_IPV4:
 				hash_type = M_HASHTYPE_RSS_TCP_IPV4;
 				break;
 
 			case NDIS_HASH_IPV6:
 				hash_type = M_HASHTYPE_RSS_IPV6;
 				do_lro = 0;
 				break;
 
 			case NDIS_HASH_IPV6_EX:
 				hash_type = M_HASHTYPE_RSS_IPV6_EX;
 				do_lro = 0;
 				break;
 
 			case NDIS_HASH_TCP_IPV6:
 				hash_type = M_HASHTYPE_RSS_TCP_IPV6;
 				break;
 
 			case NDIS_HASH_TCP_IPV6_EX:
 				hash_type = M_HASHTYPE_RSS_TCP_IPV6_EX;
 				break;
 			}
 		}
 	} else {
 		m_new->m_pkthdr.flowid = rxr->hn_rx_idx;
 		hash_type = M_HASHTYPE_OPAQUE;
 	}
 	M_HASHTYPE_SET(m_new, hash_type);
 
 	/*
 	 * Note:  Moved RX completion back to hv_nv_on_receive() so all
 	 * messages (not just data messages) will trigger a response.
 	 */
 
 	if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1);
 	rxr->hn_pkts++;
 
 	if ((ifp->if_capenable & IFCAP_LRO) && do_lro) {
 #if defined(INET) || defined(INET6)
 		struct lro_ctrl *lro = &rxr->hn_lro;
 
 		if (lro->lro_cnt) {
 			rxr->hn_lro_tried++;
 			if (hn_lro_rx(lro, m_new) == 0) {
 				/* DONE! */
 				return 0;
 			}
 		}
 #endif
 	}
 
 	/* We're not holding the lock here, so don't release it */
 	(*ifp->if_input)(ifp, m_new);
 
 	return (0);
 }
 
 static int
 hn_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data)
 {
 	struct hn_softc *sc = ifp->if_softc;
 	struct ifreq *ifr = (struct ifreq *)data;
 	int mask, error = 0;
 
 	switch (cmd) {
 	case SIOCSIFMTU:
 		if (ifr->ifr_mtu > HN_MTU_MAX) {
 			error = EINVAL;
 			break;
 		}
 
 		HN_LOCK(sc);
 
 		if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) {
 			HN_UNLOCK(sc);
 			break;
 		}
 
 		if ((sc->hn_caps & HN_CAP_MTU) == 0) {
 			/* Can't change MTU */
 			HN_UNLOCK(sc);
 			error = EOPNOTSUPP;
 			break;
 		}
 
 		if (ifp->if_mtu == ifr->ifr_mtu) {
 			HN_UNLOCK(sc);
 			break;
 		}
 
 		/*
 		 * Suspend this interface before the synthetic parts
 		 * are ripped.
 		 */
 		hn_suspend(sc);
 
 		/*
 		 * Detach the synthetics parts, i.e. NVS and RNDIS.
 		 */
 		hn_synth_detach(sc);
 
 		/*
 		 * Reattach the synthetic parts, i.e. NVS and RNDIS,
 		 * with the new MTU setting.
 		 */
 		error = hn_synth_attach(sc, ifr->ifr_mtu);
 		if (error) {
 			HN_UNLOCK(sc);
 			break;
 		}
 
 		/*
 		 * Commit the requested MTU, after the synthetic parts
 		 * have been successfully attached.
 		 */
 		ifp->if_mtu = ifr->ifr_mtu;
 
 		/*
 		 * Make sure that various parameters based on MTU are
 		 * still valid, after the MTU change.
 		 */
 		if (sc->hn_tx_ring[0].hn_chim_size > sc->hn_chim_szmax)
 			hn_set_chim_size(sc, sc->hn_chim_szmax);
 		hn_set_tso_maxsize(sc, hn_tso_maxlen, ifp->if_mtu);
 #if __FreeBSD_version >= 1100099
 		if (sc->hn_rx_ring[0].hn_lro.lro_length_lim <
 		    HN_LRO_LENLIM_MIN(ifp))
 			hn_set_lro_lenlim(sc, HN_LRO_LENLIM_MIN(ifp));
 #endif
 
 		/*
 		 * All done!  Resume the interface now.
 		 */
 		hn_resume(sc);
 
 		HN_UNLOCK(sc);
 		break;
 
 	case SIOCSIFFLAGS:
 		HN_LOCK(sc);
 
 		if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) {
 			HN_UNLOCK(sc);
 			break;
 		}
 
 		if (ifp->if_flags & IFF_UP) {
 			if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
 				/*
 				 * Caller meight hold mutex, e.g.
 				 * bpf; use busy-wait for the RNDIS
 				 * reply.
 				 */
 				HN_NO_SLEEPING(sc);
 				hn_rxfilter_config(sc);
 				HN_SLEEPING_OK(sc);
 			} else {
 				hn_init_locked(sc);
 			}
 		} else {
 			if (ifp->if_drv_flags & IFF_DRV_RUNNING)
 				hn_stop(sc, false);
 		}
 		sc->hn_if_flags = ifp->if_flags;
 
 		HN_UNLOCK(sc);
 		break;
 
 	case SIOCSIFCAP:
 		HN_LOCK(sc);
 		mask = ifr->ifr_reqcap ^ ifp->if_capenable;
 
 		if (mask & IFCAP_TXCSUM) {
 			ifp->if_capenable ^= IFCAP_TXCSUM;
 			if (ifp->if_capenable & IFCAP_TXCSUM)
 				ifp->if_hwassist |= HN_CSUM_IP_HWASSIST(sc);
 			else
 				ifp->if_hwassist &= ~HN_CSUM_IP_HWASSIST(sc);
 		}
 		if (mask & IFCAP_TXCSUM_IPV6) {
 			ifp->if_capenable ^= IFCAP_TXCSUM_IPV6;
 			if (ifp->if_capenable & IFCAP_TXCSUM_IPV6)
 				ifp->if_hwassist |= HN_CSUM_IP6_HWASSIST(sc);
 			else
 				ifp->if_hwassist &= ~HN_CSUM_IP6_HWASSIST(sc);
 		}
 
 		/* TODO: flip RNDIS offload parameters for RXCSUM. */
 		if (mask & IFCAP_RXCSUM)
 			ifp->if_capenable ^= IFCAP_RXCSUM;
 #ifdef foo
 		/* We can't diff IPv6 packets from IPv4 packets on RX path. */
 		if (mask & IFCAP_RXCSUM_IPV6)
 			ifp->if_capenable ^= IFCAP_RXCSUM_IPV6;
 #endif
 
 		if (mask & IFCAP_LRO)
 			ifp->if_capenable ^= IFCAP_LRO;
 
 		if (mask & IFCAP_TSO4) {
 			ifp->if_capenable ^= IFCAP_TSO4;
 			if (ifp->if_capenable & IFCAP_TSO4)
 				ifp->if_hwassist |= CSUM_IP_TSO;
 			else
 				ifp->if_hwassist &= ~CSUM_IP_TSO;
 		}
 		if (mask & IFCAP_TSO6) {
 			ifp->if_capenable ^= IFCAP_TSO6;
 			if (ifp->if_capenable & IFCAP_TSO6)
 				ifp->if_hwassist |= CSUM_IP6_TSO;
 			else
 				ifp->if_hwassist &= ~CSUM_IP6_TSO;
 		}
 
 		HN_UNLOCK(sc);
 		break;
 
 	case SIOCADDMULTI:
 	case SIOCDELMULTI:
 		HN_LOCK(sc);
 
 		if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) {
 			HN_UNLOCK(sc);
 			break;
 		}
 		if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
 			/*
 			 * Multicast uses mutex; use busy-wait for
 			 * the RNDIS reply.
 			 */
 			HN_NO_SLEEPING(sc);
 			hn_rxfilter_config(sc);
 			HN_SLEEPING_OK(sc);
 		}
 
 		HN_UNLOCK(sc);
 		break;
 
 	case SIOCSIFMEDIA:
 	case SIOCGIFMEDIA:
 		error = ifmedia_ioctl(ifp, ifr, &sc->hn_media, cmd);
 		break;
 
 	default:
 		error = ether_ioctl(ifp, cmd, data);
 		break;
 	}
 	return (error);
 }
 
 static void
 hn_stop(struct hn_softc *sc, bool detaching)
 {
 	struct ifnet *ifp = sc->hn_ifp;
 	int i;
 
 	HN_LOCK_ASSERT(sc);
 
 	KASSERT(sc->hn_flags & HN_FLAG_SYNTH_ATTACHED,
 	    ("synthetic parts were not attached"));
 
 	/* Disable polling. */
 	hn_polling(sc, 0);
 
 	/* Clear RUNNING bit _before_ hn_suspend_data() */
 	atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_RUNNING);
 	hn_suspend_data(sc);
 
 	/* Clear OACTIVE bit. */
 	atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE);
 	for (i = 0; i < sc->hn_tx_ring_inuse; ++i)
 		sc->hn_tx_ring[i].hn_oactive = 0;
 
 	/*
 	 * If the VF is active, make sure the filter is not 0, even if
 	 * the synthetic NIC is down.
 	 */
 	if (!detaching && (sc->hn_flags & HN_FLAG_VF))
 		hn_rxfilter_config(sc);
 }
 
 static void
 hn_init_locked(struct hn_softc *sc)
 {
 	struct ifnet *ifp = sc->hn_ifp;
 	int i;
 
 	HN_LOCK_ASSERT(sc);
 
 	if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0)
 		return;
 
 	if (ifp->if_drv_flags & IFF_DRV_RUNNING)
 		return;
 
 	/* Configure RX filter */
 	hn_rxfilter_config(sc);
 
 	/* Clear OACTIVE bit. */
 	atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE);
 	for (i = 0; i < sc->hn_tx_ring_inuse; ++i)
 		sc->hn_tx_ring[i].hn_oactive = 0;
 
 	/* Clear TX 'suspended' bit. */
 	hn_resume_tx(sc, sc->hn_tx_ring_inuse);
 
 	/* Everything is ready; unleash! */
 	atomic_set_int(&ifp->if_drv_flags, IFF_DRV_RUNNING);
 
 	/* Re-enable polling if requested. */
 	if (sc->hn_pollhz > 0)
 		hn_polling(sc, sc->hn_pollhz);
 }
 
 static void
 hn_init(void *xsc)
 {
 	struct hn_softc *sc = xsc;
 
 	HN_LOCK(sc);
 	hn_init_locked(sc);
 	HN_UNLOCK(sc);
 }
 
 #if __FreeBSD_version >= 1100099
 
 static int
 hn_lro_lenlim_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	unsigned int lenlim;
 	int error;
 
 	lenlim = sc->hn_rx_ring[0].hn_lro.lro_length_lim;
 	error = sysctl_handle_int(oidp, &lenlim, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	HN_LOCK(sc);
 	if (lenlim < HN_LRO_LENLIM_MIN(sc->hn_ifp) ||
 	    lenlim > TCP_LRO_LENGTH_MAX) {
 		HN_UNLOCK(sc);
 		return EINVAL;
 	}
 	hn_set_lro_lenlim(sc, lenlim);
 	HN_UNLOCK(sc);
 
 	return 0;
 }
 
 static int
 hn_lro_ackcnt_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int ackcnt, error, i;
 
 	/*
 	 * lro_ackcnt_lim is append count limit,
 	 * +1 to turn it into aggregation limit.
 	 */
 	ackcnt = sc->hn_rx_ring[0].hn_lro.lro_ackcnt_lim + 1;
 	error = sysctl_handle_int(oidp, &ackcnt, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	if (ackcnt < 2 || ackcnt > (TCP_LRO_ACKCNT_MAX + 1))
 		return EINVAL;
 
 	/*
 	 * Convert aggregation limit back to append
 	 * count limit.
 	 */
 	--ackcnt;
 	HN_LOCK(sc);
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i)
 		sc->hn_rx_ring[i].hn_lro.lro_ackcnt_lim = ackcnt;
 	HN_UNLOCK(sc);
 	return 0;
 }
 
 #endif
 
 static int
 hn_trust_hcsum_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int hcsum = arg2;
 	int on, error, i;
 
 	on = 0;
 	if (sc->hn_rx_ring[0].hn_trust_hcsum & hcsum)
 		on = 1;
 
 	error = sysctl_handle_int(oidp, &on, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	HN_LOCK(sc);
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		struct hn_rx_ring *rxr = &sc->hn_rx_ring[i];
 
 		if (on)
 			rxr->hn_trust_hcsum |= hcsum;
 		else
 			rxr->hn_trust_hcsum &= ~hcsum;
 	}
 	HN_UNLOCK(sc);
 	return 0;
 }
 
 static int
 hn_chim_size_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int chim_size, error;
 
 	chim_size = sc->hn_tx_ring[0].hn_chim_size;
 	error = sysctl_handle_int(oidp, &chim_size, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	if (chim_size > sc->hn_chim_szmax || chim_size <= 0)
 		return EINVAL;
 
 	HN_LOCK(sc);
 	hn_set_chim_size(sc, chim_size);
 	HN_UNLOCK(sc);
 	return 0;
 }
 
 #if __FreeBSD_version < 1100095
 static int
 hn_rx_stat_int_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int ofs = arg2, i, error;
 	struct hn_rx_ring *rxr;
 	uint64_t stat;
 
 	stat = 0;
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 		stat += *((int *)((uint8_t *)rxr + ofs));
 	}
 
 	error = sysctl_handle_64(oidp, &stat, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	/* Zero out this stat. */
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 		*((int *)((uint8_t *)rxr + ofs)) = 0;
 	}
 	return 0;
 }
 #else
 static int
 hn_rx_stat_u64_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int ofs = arg2, i, error;
 	struct hn_rx_ring *rxr;
 	uint64_t stat;
 
 	stat = 0;
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 		stat += *((uint64_t *)((uint8_t *)rxr + ofs));
 	}
 
 	error = sysctl_handle_64(oidp, &stat, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	/* Zero out this stat. */
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 		*((uint64_t *)((uint8_t *)rxr + ofs)) = 0;
 	}
 	return 0;
 }
 
 #endif
 
 static int
 hn_rx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int ofs = arg2, i, error;
 	struct hn_rx_ring *rxr;
 	u_long stat;
 
 	stat = 0;
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 		stat += *((u_long *)((uint8_t *)rxr + ofs));
 	}
 
 	error = sysctl_handle_long(oidp, &stat, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	/* Zero out this stat. */
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		rxr = &sc->hn_rx_ring[i];
 		*((u_long *)((uint8_t *)rxr + ofs)) = 0;
 	}
 	return 0;
 }
 
 static int
 hn_tx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int ofs = arg2, i, error;
 	struct hn_tx_ring *txr;
 	u_long stat;
 
 	stat = 0;
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i) {
 		txr = &sc->hn_tx_ring[i];
 		stat += *((u_long *)((uint8_t *)txr + ofs));
 	}
 
 	error = sysctl_handle_long(oidp, &stat, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	/* Zero out this stat. */
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i) {
 		txr = &sc->hn_tx_ring[i];
 		*((u_long *)((uint8_t *)txr + ofs)) = 0;
 	}
 	return 0;
 }
 
 static int
 hn_tx_conf_int_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int ofs = arg2, i, error, conf;
 	struct hn_tx_ring *txr;
 
 	txr = &sc->hn_tx_ring[0];
 	conf = *((int *)((uint8_t *)txr + ofs));
 
 	error = sysctl_handle_int(oidp, &conf, 0, req);
 	if (error || req->newptr == NULL)
 		return error;
 
 	HN_LOCK(sc);
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i) {
 		txr = &sc->hn_tx_ring[i];
 		*((int *)((uint8_t *)txr + ofs)) = conf;
 	}
 	HN_UNLOCK(sc);
 
 	return 0;
 }
 
 static int
 hn_txagg_size_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int error, size;
 
 	size = sc->hn_agg_size;
 	error = sysctl_handle_int(oidp, &size, 0, req);
 	if (error || req->newptr == NULL)
 		return (error);
 
 	HN_LOCK(sc);
 	sc->hn_agg_size = size;
 	hn_set_txagg(sc);
 	HN_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 hn_txagg_pkts_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int error, pkts;
 
 	pkts = sc->hn_agg_pkts;
 	error = sysctl_handle_int(oidp, &pkts, 0, req);
 	if (error || req->newptr == NULL)
 		return (error);
 
 	HN_LOCK(sc);
 	sc->hn_agg_pkts = pkts;
 	hn_set_txagg(sc);
 	HN_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 hn_txagg_pktmax_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int pkts;
 
 	pkts = sc->hn_tx_ring[0].hn_agg_pktmax;
 	return (sysctl_handle_int(oidp, &pkts, 0, req));
 }
 
 static int
 hn_txagg_align_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int align;
 
 	align = sc->hn_tx_ring[0].hn_agg_align;
 	return (sysctl_handle_int(oidp, &align, 0, req));
 }
 
 static void
 hn_chan_polling(struct vmbus_channel *chan, u_int pollhz)
 {
 	if (pollhz == 0)
 		vmbus_chan_poll_disable(chan);
 	else
 		vmbus_chan_poll_enable(chan, pollhz);
 }
 
 static void
 hn_polling(struct hn_softc *sc, u_int pollhz)
 {
 	int nsubch = sc->hn_rx_ring_inuse - 1;
 
 	HN_LOCK_ASSERT(sc);
 
 	if (nsubch > 0) {
 		struct vmbus_channel **subch;
 		int i;
 
 		subch = vmbus_subchan_get(sc->hn_prichan, nsubch);
 		for (i = 0; i < nsubch; ++i)
 			hn_chan_polling(subch[i], pollhz);
 		vmbus_subchan_rel(subch, nsubch);
 	}
 	hn_chan_polling(sc->hn_prichan, pollhz);
 }
 
 static int
 hn_polling_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int pollhz, error;
 
 	pollhz = sc->hn_pollhz;
 	error = sysctl_handle_int(oidp, &pollhz, 0, req);
 	if (error || req->newptr == NULL)
 		return (error);
 
 	if (pollhz != 0 &&
 	    (pollhz < VMBUS_CHAN_POLLHZ_MIN || pollhz > VMBUS_CHAN_POLLHZ_MAX))
 		return (EINVAL);
 
 	HN_LOCK(sc);
 	if (sc->hn_pollhz != pollhz) {
 		sc->hn_pollhz = pollhz;
 		if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) &&
 		    (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED))
 			hn_polling(sc, sc->hn_pollhz);
 	}
 	HN_UNLOCK(sc);
 
 	return (0);
 }
 
 static int
 hn_ndis_version_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	char verstr[16];
 
 	snprintf(verstr, sizeof(verstr), "%u.%u",
 	    HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver),
 	    HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver));
 	return sysctl_handle_string(oidp, verstr, sizeof(verstr), req);
 }
 
 static int
 hn_caps_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	char caps_str[128];
 	uint32_t caps;
 
 	HN_LOCK(sc);
 	caps = sc->hn_caps;
 	HN_UNLOCK(sc);
 	snprintf(caps_str, sizeof(caps_str), "%b", caps, HN_CAP_BITS);
 	return sysctl_handle_string(oidp, caps_str, sizeof(caps_str), req);
 }
 
 static int
 hn_hwassist_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	char assist_str[128];
 	uint32_t hwassist;
 
 	HN_LOCK(sc);
 	hwassist = sc->hn_ifp->if_hwassist;
 	HN_UNLOCK(sc);
 	snprintf(assist_str, sizeof(assist_str), "%b", hwassist, CSUM_BITS);
 	return sysctl_handle_string(oidp, assist_str, sizeof(assist_str), req);
 }
 
 static int
 hn_rxfilter_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	char filter_str[128];
 	uint32_t filter;
 
 	HN_LOCK(sc);
 	filter = sc->hn_rx_filter;
 	HN_UNLOCK(sc);
 	snprintf(filter_str, sizeof(filter_str), "%b", filter,
 	    NDIS_PACKET_TYPES);
 	return sysctl_handle_string(oidp, filter_str, sizeof(filter_str), req);
 }
 
 #ifndef RSS
 
 static int
 hn_rss_key_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int error;
 
 	HN_LOCK(sc);
 
 	error = SYSCTL_OUT(req, sc->hn_rss.rss_key, sizeof(sc->hn_rss.rss_key));
 	if (error || req->newptr == NULL)
 		goto back;
 
 	error = SYSCTL_IN(req, sc->hn_rss.rss_key, sizeof(sc->hn_rss.rss_key));
 	if (error)
 		goto back;
 	sc->hn_flags |= HN_FLAG_HAS_RSSKEY;
 
 	if (sc->hn_rx_ring_inuse > 1) {
 		error = hn_rss_reconfig(sc);
 	} else {
 		/* Not RSS capable, at least for now; just save the RSS key. */
 		error = 0;
 	}
 back:
 	HN_UNLOCK(sc);
 	return (error);
 }
 
 static int
 hn_rss_ind_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	int error;
 
 	HN_LOCK(sc);
 
 	error = SYSCTL_OUT(req, sc->hn_rss.rss_ind, sizeof(sc->hn_rss.rss_ind));
 	if (error || req->newptr == NULL)
 		goto back;
 
 	/*
 	 * Don't allow RSS indirect table change, if this interface is not
 	 * RSS capable currently.
 	 */
 	if (sc->hn_rx_ring_inuse == 1) {
 		error = EOPNOTSUPP;
 		goto back;
 	}
 
 	error = SYSCTL_IN(req, sc->hn_rss.rss_ind, sizeof(sc->hn_rss.rss_ind));
 	if (error)
 		goto back;
 	sc->hn_flags |= HN_FLAG_HAS_RSSIND;
 
 	hn_rss_ind_fixup(sc);
 	error = hn_rss_reconfig(sc);
 back:
 	HN_UNLOCK(sc);
 	return (error);
 }
 
 #endif	/* !RSS */
 
 static int
 hn_rss_hash_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	char hash_str[128];
 	uint32_t hash;
 
 	HN_LOCK(sc);
 	hash = sc->hn_rss_hash;
 	HN_UNLOCK(sc);
 	snprintf(hash_str, sizeof(hash_str), "%b", hash, NDIS_HASH_BITS);
 	return sysctl_handle_string(oidp, hash_str, sizeof(hash_str), req);
 }
 
 static int
 hn_vf_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	struct hn_softc *sc = arg1;
 	char vf_name[128];
 	struct ifnet *vf;
 
 	HN_LOCK(sc);
 	vf_name[0] = '\0';
 	vf = sc->hn_rx_ring[0].hn_vf;
 	if (vf != NULL)
 		snprintf(vf_name, sizeof(vf_name), "%s", if_name(vf));
 	HN_UNLOCK(sc);
 	return sysctl_handle_string(oidp, vf_name, sizeof(vf_name), req);
 }
 
 static int
 hn_check_iplen(const struct mbuf *m, int hoff)
 {
 	const struct ip *ip;
 	int len, iphlen, iplen;
 	const struct tcphdr *th;
 	int thoff;				/* TCP data offset */
 
 	len = hoff + sizeof(struct ip);
 
 	/* The packet must be at least the size of an IP header. */
 	if (m->m_pkthdr.len < len)
 		return IPPROTO_DONE;
 
 	/* The fixed IP header must reside completely in the first mbuf. */
 	if (m->m_len < len)
 		return IPPROTO_DONE;
 
 	ip = mtodo(m, hoff);
 
 	/* Bound check the packet's stated IP header length. */
 	iphlen = ip->ip_hl << 2;
 	if (iphlen < sizeof(struct ip))		/* minimum header length */
 		return IPPROTO_DONE;
 
 	/* The full IP header must reside completely in the one mbuf. */
 	if (m->m_len < hoff + iphlen)
 		return IPPROTO_DONE;
 
 	iplen = ntohs(ip->ip_len);
 
 	/*
 	 * Check that the amount of data in the buffers is as
 	 * at least much as the IP header would have us expect.
 	 */
 	if (m->m_pkthdr.len < hoff + iplen)
 		return IPPROTO_DONE;
 
 	/*
 	 * Ignore IP fragments.
 	 */
 	if (ntohs(ip->ip_off) & (IP_OFFMASK | IP_MF))
 		return IPPROTO_DONE;
 
 	/*
 	 * The TCP/IP or UDP/IP header must be entirely contained within
 	 * the first fragment of a packet.
 	 */
 	switch (ip->ip_p) {
 	case IPPROTO_TCP:
 		if (iplen < iphlen + sizeof(struct tcphdr))
 			return IPPROTO_DONE;
 		if (m->m_len < hoff + iphlen + sizeof(struct tcphdr))
 			return IPPROTO_DONE;
 		th = (const struct tcphdr *)((const uint8_t *)ip + iphlen);
 		thoff = th->th_off << 2;
 		if (thoff < sizeof(struct tcphdr) || thoff + iphlen > iplen)
 			return IPPROTO_DONE;
 		if (m->m_len < hoff + iphlen + thoff)
 			return IPPROTO_DONE;
 		break;
 	case IPPROTO_UDP:
 		if (iplen < iphlen + sizeof(struct udphdr))
 			return IPPROTO_DONE;
 		if (m->m_len < hoff + iphlen + sizeof(struct udphdr))
 			return IPPROTO_DONE;
 		break;
 	default:
 		if (iplen < iphlen)
 			return IPPROTO_DONE;
 		break;
 	}
 	return ip->ip_p;
 }
 
 static int
 hn_create_rx_data(struct hn_softc *sc, int ring_cnt)
 {
 	struct sysctl_oid_list *child;
 	struct sysctl_ctx_list *ctx;
 	device_t dev = sc->hn_dev;
 #if defined(INET) || defined(INET6)
 #if __FreeBSD_version >= 1100095
 	int lroent_cnt;
 #endif
 #endif
 	int i;
 
 	/*
 	 * Create RXBUF for reception.
 	 *
 	 * NOTE:
 	 * - It is shared by all channels.
 	 * - A large enough buffer is allocated, certain version of NVSes
 	 *   may further limit the usable space.
 	 */
 	sc->hn_rxbuf = hyperv_dmamem_alloc(bus_get_dma_tag(dev),
 	    PAGE_SIZE, 0, HN_RXBUF_SIZE, &sc->hn_rxbuf_dma,
 	    BUS_DMA_WAITOK | BUS_DMA_ZERO);
 	if (sc->hn_rxbuf == NULL) {
 		device_printf(sc->hn_dev, "allocate rxbuf failed\n");
 		return (ENOMEM);
 	}
 
 	sc->hn_rx_ring_cnt = ring_cnt;
 	sc->hn_rx_ring_inuse = sc->hn_rx_ring_cnt;
 
 	sc->hn_rx_ring = malloc(sizeof(struct hn_rx_ring) * sc->hn_rx_ring_cnt,
 	    M_DEVBUF, M_WAITOK | M_ZERO);
 
 #if defined(INET) || defined(INET6)
 #if __FreeBSD_version >= 1100095
 	lroent_cnt = hn_lro_entry_count;
 	if (lroent_cnt < TCP_LRO_ENTRIES)
 		lroent_cnt = TCP_LRO_ENTRIES;
 	if (bootverbose)
 		device_printf(dev, "LRO: entry count %d\n", lroent_cnt);
 #endif
 #endif	/* INET || INET6 */
 
 	ctx = device_get_sysctl_ctx(dev);
 	child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev));
 
 	/* Create dev.hn.UNIT.rx sysctl tree */
 	sc->hn_rx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "rx",
 	    CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "");
 
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		struct hn_rx_ring *rxr = &sc->hn_rx_ring[i];
 
 		rxr->hn_br = hyperv_dmamem_alloc(bus_get_dma_tag(dev),
 		    PAGE_SIZE, 0, HN_TXBR_SIZE + HN_RXBR_SIZE,
 		    &rxr->hn_br_dma, BUS_DMA_WAITOK);
 		if (rxr->hn_br == NULL) {
 			device_printf(dev, "allocate bufring failed\n");
 			return (ENOMEM);
 		}
 
 		if (hn_trust_hosttcp)
 			rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_TCP;
 		if (hn_trust_hostudp)
 			rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_UDP;
 		if (hn_trust_hostip)
 			rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_IP;
 		rxr->hn_ifp = sc->hn_ifp;
 		if (i < sc->hn_tx_ring_cnt)
 			rxr->hn_txr = &sc->hn_tx_ring[i];
 		rxr->hn_pktbuf_len = HN_PKTBUF_LEN_DEF;
 		rxr->hn_pktbuf = malloc(rxr->hn_pktbuf_len, M_DEVBUF, M_WAITOK);
 		rxr->hn_rx_idx = i;
 		rxr->hn_rxbuf = sc->hn_rxbuf;
 
 		/*
 		 * Initialize LRO.
 		 */
 #if defined(INET) || defined(INET6)
 #if __FreeBSD_version >= 1100095
 		tcp_lro_init_args(&rxr->hn_lro, sc->hn_ifp, lroent_cnt,
 		    hn_lro_mbufq_depth);
 #else
 		tcp_lro_init(&rxr->hn_lro);
 		rxr->hn_lro.ifp = sc->hn_ifp;
 #endif
 #if __FreeBSD_version >= 1100099
 		rxr->hn_lro.lro_length_lim = HN_LRO_LENLIM_DEF;
 		rxr->hn_lro.lro_ackcnt_lim = HN_LRO_ACKCNT_DEF;
 #endif
 #endif	/* INET || INET6 */
 
 		if (sc->hn_rx_sysctl_tree != NULL) {
 			char name[16];
 
 			/*
 			 * Create per RX ring sysctl tree:
 			 * dev.hn.UNIT.rx.RINGID
 			 */
 			snprintf(name, sizeof(name), "%d", i);
 			rxr->hn_rx_sysctl_tree = SYSCTL_ADD_NODE(ctx,
 			    SYSCTL_CHILDREN(sc->hn_rx_sysctl_tree),
 			    OID_AUTO, name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "");
 
 			if (rxr->hn_rx_sysctl_tree != NULL) {
 				SYSCTL_ADD_ULONG(ctx,
 				    SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree),
 				    OID_AUTO, "packets", CTLFLAG_RW,
 				    &rxr->hn_pkts, "# of packets received");
 				SYSCTL_ADD_ULONG(ctx,
 				    SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree),
 				    OID_AUTO, "rss_pkts", CTLFLAG_RW,
 				    &rxr->hn_rss_pkts,
 				    "# of packets w/ RSS info received");
 				SYSCTL_ADD_INT(ctx,
 				    SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree),
 				    OID_AUTO, "pktbuf_len", CTLFLAG_RD,
 				    &rxr->hn_pktbuf_len, 0,
 				    "Temporary channel packet buffer length");
 			}
 		}
 	}
 
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_queued",
 	    CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_lro.lro_queued),
 #if __FreeBSD_version < 1100095
 	    hn_rx_stat_int_sysctl,
 #else
 	    hn_rx_stat_u64_sysctl,
 #endif
 	    "LU", "LRO queued");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_flushed",
 	    CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_lro.lro_flushed),
 #if __FreeBSD_version < 1100095
 	    hn_rx_stat_int_sysctl,
 #else
 	    hn_rx_stat_u64_sysctl,
 #endif
 	    "LU", "LRO flushed");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_tried",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_lro_tried),
 	    hn_rx_stat_ulong_sysctl, "LU", "# of LRO tries");
 #if __FreeBSD_version >= 1100099
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_length_lim",
 	    CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_lro_lenlim_sysctl, "IU",
 	    "Max # of data bytes to be aggregated by LRO");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_ackcnt_lim",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_lro_ackcnt_sysctl, "I",
 	    "Max # of ACKs to be aggregated by LRO");
 #endif
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hosttcp",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_TCP,
 	    hn_trust_hcsum_sysctl, "I",
 	    "Trust tcp segement verification on host side, "
 	    "when csum info is missing");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hostudp",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_UDP,
 	    hn_trust_hcsum_sysctl, "I",
 	    "Trust udp datagram verification on host side, "
 	    "when csum info is missing");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hostip",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_IP,
 	    hn_trust_hcsum_sysctl, "I",
 	    "Trust ip packet verification on host side, "
 	    "when csum info is missing");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_ip",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_csum_ip),
 	    hn_rx_stat_ulong_sysctl, "LU", "RXCSUM IP");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_tcp",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_csum_tcp),
 	    hn_rx_stat_ulong_sysctl, "LU", "RXCSUM TCP");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_udp",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_csum_udp),
 	    hn_rx_stat_ulong_sysctl, "LU", "RXCSUM UDP");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_trusted",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_csum_trusted),
 	    hn_rx_stat_ulong_sysctl, "LU",
 	    "# of packets that we trust host's csum verification");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "small_pkts",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_small_pkts),
 	    hn_rx_stat_ulong_sysctl, "LU", "# of small packets received");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rx_ack_failed",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_rx_ring, hn_ack_failed),
 	    hn_rx_stat_ulong_sysctl, "LU", "# of RXBUF ack failures");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rx_ring_cnt",
 	    CTLFLAG_RD, &sc->hn_rx_ring_cnt, 0, "# created RX rings");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rx_ring_inuse",
 	    CTLFLAG_RD, &sc->hn_rx_ring_inuse, 0, "# used RX rings");
 
 	return (0);
 }
 
 static void
 hn_destroy_rx_data(struct hn_softc *sc)
 {
 	int i;
 
 	if (sc->hn_rxbuf != NULL) {
 		if ((sc->hn_flags & HN_FLAG_RXBUF_REF) == 0)
 			hyperv_dmamem_free(&sc->hn_rxbuf_dma, sc->hn_rxbuf);
 		else
 			device_printf(sc->hn_dev, "RXBUF is referenced\n");
 		sc->hn_rxbuf = NULL;
 	}
 
 	if (sc->hn_rx_ring_cnt == 0)
 		return;
 
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		struct hn_rx_ring *rxr = &sc->hn_rx_ring[i];
 
 		if (rxr->hn_br == NULL)
 			continue;
 		if ((rxr->hn_rx_flags & HN_RX_FLAG_BR_REF) == 0) {
 			hyperv_dmamem_free(&rxr->hn_br_dma, rxr->hn_br);
 		} else {
 			device_printf(sc->hn_dev,
 			    "%dth channel bufring is referenced", i);
 		}
 		rxr->hn_br = NULL;
 
 #if defined(INET) || defined(INET6)
 		tcp_lro_free(&rxr->hn_lro);
 #endif
 		free(rxr->hn_pktbuf, M_DEVBUF);
 	}
 	free(sc->hn_rx_ring, M_DEVBUF);
 	sc->hn_rx_ring = NULL;
 
 	sc->hn_rx_ring_cnt = 0;
 	sc->hn_rx_ring_inuse = 0;
 }
 
 static int
 hn_tx_ring_create(struct hn_softc *sc, int id)
 {
 	struct hn_tx_ring *txr = &sc->hn_tx_ring[id];
 	device_t dev = sc->hn_dev;
 	bus_dma_tag_t parent_dtag;
 	int error, i;
 
 	txr->hn_sc = sc;
 	txr->hn_tx_idx = id;
 
 #ifndef HN_USE_TXDESC_BUFRING
 	mtx_init(&txr->hn_txlist_spin, "hn txlist", NULL, MTX_SPIN);
 #endif
 	mtx_init(&txr->hn_tx_lock, "hn tx", NULL, MTX_DEF);
 
 	txr->hn_txdesc_cnt = HN_TX_DESC_CNT;
 	txr->hn_txdesc = malloc(sizeof(struct hn_txdesc) * txr->hn_txdesc_cnt,
 	    M_DEVBUF, M_WAITOK | M_ZERO);
 #ifndef HN_USE_TXDESC_BUFRING
 	SLIST_INIT(&txr->hn_txlist);
 #else
 	txr->hn_txdesc_br = buf_ring_alloc(txr->hn_txdesc_cnt, M_DEVBUF,
 	    M_WAITOK, &txr->hn_tx_lock);
 #endif
 
 	if (hn_tx_taskq_mode == HN_TX_TASKQ_M_EVTTQ) {
 		txr->hn_tx_taskq = VMBUS_GET_EVENT_TASKQ(
 		    device_get_parent(dev), dev, HN_RING_IDX2CPU(sc, id));
 	} else {
 		txr->hn_tx_taskq = sc->hn_tx_taskqs[id % hn_tx_taskq_cnt];
 	}
 
 #ifdef HN_IFSTART_SUPPORT
 	if (hn_use_if_start) {
 		txr->hn_txeof = hn_start_txeof;
 		TASK_INIT(&txr->hn_tx_task, 0, hn_start_taskfunc, txr);
 		TASK_INIT(&txr->hn_txeof_task, 0, hn_start_txeof_taskfunc, txr);
 	} else
 #endif
 	{
 		int br_depth;
 
 		txr->hn_txeof = hn_xmit_txeof;
 		TASK_INIT(&txr->hn_tx_task, 0, hn_xmit_taskfunc, txr);
 		TASK_INIT(&txr->hn_txeof_task, 0, hn_xmit_txeof_taskfunc, txr);
 
 		br_depth = hn_get_txswq_depth(txr);
 		txr->hn_mbuf_br = buf_ring_alloc(br_depth, M_DEVBUF,
 		    M_WAITOK, &txr->hn_tx_lock);
 	}
 
 	txr->hn_direct_tx_size = hn_direct_tx_size;
 
 	/*
 	 * Always schedule transmission instead of trying to do direct
 	 * transmission.  This one gives the best performance so far.
 	 */
 	txr->hn_sched_tx = 1;
 
 	parent_dtag = bus_get_dma_tag(dev);
 
 	/* DMA tag for RNDIS packet messages. */
 	error = bus_dma_tag_create(parent_dtag, /* parent */
 	    HN_RNDIS_PKT_ALIGN,		/* alignment */
 	    HN_RNDIS_PKT_BOUNDARY,	/* boundary */
 	    BUS_SPACE_MAXADDR,		/* lowaddr */
 	    BUS_SPACE_MAXADDR,		/* highaddr */
 	    NULL, NULL,			/* filter, filterarg */
 	    HN_RNDIS_PKT_LEN,		/* maxsize */
 	    1,				/* nsegments */
 	    HN_RNDIS_PKT_LEN,		/* maxsegsize */
 	    0,				/* flags */
 	    NULL,			/* lockfunc */
 	    NULL,			/* lockfuncarg */
 	    &txr->hn_tx_rndis_dtag);
 	if (error) {
 		device_printf(dev, "failed to create rndis dmatag\n");
 		return error;
 	}
 
 	/* DMA tag for data. */
 	error = bus_dma_tag_create(parent_dtag, /* parent */
 	    1,				/* alignment */
 	    HN_TX_DATA_BOUNDARY,	/* boundary */
 	    BUS_SPACE_MAXADDR,		/* lowaddr */
 	    BUS_SPACE_MAXADDR,		/* highaddr */
 	    NULL, NULL,			/* filter, filterarg */
 	    HN_TX_DATA_MAXSIZE,		/* maxsize */
 	    HN_TX_DATA_SEGCNT_MAX,	/* nsegments */
 	    HN_TX_DATA_SEGSIZE,		/* maxsegsize */
 	    0,				/* flags */
 	    NULL,			/* lockfunc */
 	    NULL,			/* lockfuncarg */
 	    &txr->hn_tx_data_dtag);
 	if (error) {
 		device_printf(dev, "failed to create data dmatag\n");
 		return error;
 	}
 
 	for (i = 0; i < txr->hn_txdesc_cnt; ++i) {
 		struct hn_txdesc *txd = &txr->hn_txdesc[i];
 
 		txd->txr = txr;
 		txd->chim_index = HN_NVS_CHIM_IDX_INVALID;
 		STAILQ_INIT(&txd->agg_list);
 
 		/*
 		 * Allocate and load RNDIS packet message.
 		 */
         	error = bus_dmamem_alloc(txr->hn_tx_rndis_dtag,
 		    (void **)&txd->rndis_pkt,
 		    BUS_DMA_WAITOK | BUS_DMA_COHERENT | BUS_DMA_ZERO,
 		    &txd->rndis_pkt_dmap);
 		if (error) {
 			device_printf(dev,
 			    "failed to allocate rndis_packet_msg, %d\n", i);
 			return error;
 		}
 
 		error = bus_dmamap_load(txr->hn_tx_rndis_dtag,
 		    txd->rndis_pkt_dmap,
 		    txd->rndis_pkt, HN_RNDIS_PKT_LEN,
 		    hyperv_dma_map_paddr, &txd->rndis_pkt_paddr,
 		    BUS_DMA_NOWAIT);
 		if (error) {
 			device_printf(dev,
 			    "failed to load rndis_packet_msg, %d\n", i);
 			bus_dmamem_free(txr->hn_tx_rndis_dtag,
 			    txd->rndis_pkt, txd->rndis_pkt_dmap);
 			return error;
 		}
 
 		/* DMA map for TX data. */
 		error = bus_dmamap_create(txr->hn_tx_data_dtag, 0,
 		    &txd->data_dmap);
 		if (error) {
 			device_printf(dev,
 			    "failed to allocate tx data dmamap\n");
 			bus_dmamap_unload(txr->hn_tx_rndis_dtag,
 			    txd->rndis_pkt_dmap);
 			bus_dmamem_free(txr->hn_tx_rndis_dtag,
 			    txd->rndis_pkt, txd->rndis_pkt_dmap);
 			return error;
 		}
 
 		/* All set, put it to list */
 		txd->flags |= HN_TXD_FLAG_ONLIST;
 #ifndef HN_USE_TXDESC_BUFRING
 		SLIST_INSERT_HEAD(&txr->hn_txlist, txd, link);
 #else
 		buf_ring_enqueue(txr->hn_txdesc_br, txd);
 #endif
 	}
 	txr->hn_txdesc_avail = txr->hn_txdesc_cnt;
 
 	if (sc->hn_tx_sysctl_tree != NULL) {
 		struct sysctl_oid_list *child;
 		struct sysctl_ctx_list *ctx;
 		char name[16];
 
 		/*
 		 * Create per TX ring sysctl tree:
 		 * dev.hn.UNIT.tx.RINGID
 		 */
 		ctx = device_get_sysctl_ctx(dev);
 		child = SYSCTL_CHILDREN(sc->hn_tx_sysctl_tree);
 
 		snprintf(name, sizeof(name), "%d", id);
 		txr->hn_tx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO,
 		    name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "");
 
 		if (txr->hn_tx_sysctl_tree != NULL) {
 			child = SYSCTL_CHILDREN(txr->hn_tx_sysctl_tree);
 
 #ifdef HN_DEBUG
 			SYSCTL_ADD_INT(ctx, child, OID_AUTO, "txdesc_avail",
 			    CTLFLAG_RD, &txr->hn_txdesc_avail, 0,
 			    "# of available TX descs");
 #endif
 #ifdef HN_IFSTART_SUPPORT
 			if (!hn_use_if_start)
 #endif
 			{
 				SYSCTL_ADD_INT(ctx, child, OID_AUTO, "oactive",
 				    CTLFLAG_RD, &txr->hn_oactive, 0,
 				    "over active");
 			}
 			SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "packets",
 			    CTLFLAG_RW, &txr->hn_pkts,
 			    "# of packets transmitted");
 			SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "sends",
 			    CTLFLAG_RW, &txr->hn_sends, "# of sends");
 		}
 	}
 
 	return 0;
 }
 
 static void
 hn_txdesc_dmamap_destroy(struct hn_txdesc *txd)
 {
 	struct hn_tx_ring *txr = txd->txr;
 
 	KASSERT(txd->m == NULL, ("still has mbuf installed"));
 	KASSERT((txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("still dma mapped"));
 
 	bus_dmamap_unload(txr->hn_tx_rndis_dtag, txd->rndis_pkt_dmap);
 	bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_pkt,
 	    txd->rndis_pkt_dmap);
 	bus_dmamap_destroy(txr->hn_tx_data_dtag, txd->data_dmap);
 }
 
 static void
 hn_txdesc_gc(struct hn_tx_ring *txr, struct hn_txdesc *txd)
 {
 
 	KASSERT(txd->refs == 0 || txd->refs == 1,
 	    ("invalid txd refs %d", txd->refs));
 
 	/* Aggregated txds will be freed by their aggregating txd. */
 	if (txd->refs > 0 && (txd->flags & HN_TXD_FLAG_ONAGG) == 0) {
 		int freed;
 
 		freed = hn_txdesc_put(txr, txd);
 		KASSERT(freed, ("can't free txdesc"));
 	}
 }
 
 static void
 hn_tx_ring_destroy(struct hn_tx_ring *txr)
 {
 	int i;
 
 	if (txr->hn_txdesc == NULL)
 		return;
 
 	/*
 	 * NOTE:
 	 * Because the freeing of aggregated txds will be deferred
 	 * to the aggregating txd, two passes are used here:
 	 * - The first pass GCes any pending txds.  This GC is necessary,
 	 *   since if the channels are revoked, hypervisor will not
 	 *   deliver send-done for all pending txds.
 	 * - The second pass frees the busdma stuffs, i.e. after all txds
 	 *   were freed.
 	 */
 	for (i = 0; i < txr->hn_txdesc_cnt; ++i)
 		hn_txdesc_gc(txr, &txr->hn_txdesc[i]);
 	for (i = 0; i < txr->hn_txdesc_cnt; ++i)
 		hn_txdesc_dmamap_destroy(&txr->hn_txdesc[i]);
 
 	if (txr->hn_tx_data_dtag != NULL)
 		bus_dma_tag_destroy(txr->hn_tx_data_dtag);
 	if (txr->hn_tx_rndis_dtag != NULL)
 		bus_dma_tag_destroy(txr->hn_tx_rndis_dtag);
 
 #ifdef HN_USE_TXDESC_BUFRING
 	buf_ring_free(txr->hn_txdesc_br, M_DEVBUF);
 #endif
 
 	free(txr->hn_txdesc, M_DEVBUF);
 	txr->hn_txdesc = NULL;
 
 	if (txr->hn_mbuf_br != NULL)
 		buf_ring_free(txr->hn_mbuf_br, M_DEVBUF);
 
 #ifndef HN_USE_TXDESC_BUFRING
 	mtx_destroy(&txr->hn_txlist_spin);
 #endif
 	mtx_destroy(&txr->hn_tx_lock);
 }
 
 static int
 hn_create_tx_data(struct hn_softc *sc, int ring_cnt)
 {
 	struct sysctl_oid_list *child;
 	struct sysctl_ctx_list *ctx;
 	int i;
 
 	/*
 	 * Create TXBUF for chimney sending.
 	 *
 	 * NOTE: It is shared by all channels.
 	 */
 	sc->hn_chim = hyperv_dmamem_alloc(bus_get_dma_tag(sc->hn_dev),
 	    PAGE_SIZE, 0, HN_CHIM_SIZE, &sc->hn_chim_dma,
 	    BUS_DMA_WAITOK | BUS_DMA_ZERO);
 	if (sc->hn_chim == NULL) {
 		device_printf(sc->hn_dev, "allocate txbuf failed\n");
 		return (ENOMEM);
 	}
 
 	sc->hn_tx_ring_cnt = ring_cnt;
 	sc->hn_tx_ring_inuse = sc->hn_tx_ring_cnt;
 
 	sc->hn_tx_ring = malloc(sizeof(struct hn_tx_ring) * sc->hn_tx_ring_cnt,
 	    M_DEVBUF, M_WAITOK | M_ZERO);
 
 	ctx = device_get_sysctl_ctx(sc->hn_dev);
 	child = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->hn_dev));
 
 	/* Create dev.hn.UNIT.tx sysctl tree */
 	sc->hn_tx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "tx",
 	    CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "");
 
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i) {
 		int error;
 
 		error = hn_tx_ring_create(sc, i);
 		if (error)
 			return error;
 	}
 
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "no_txdescs",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_no_txdescs),
 	    hn_tx_stat_ulong_sysctl, "LU", "# of times short of TX descs");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "send_failed",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_send_failed),
 	    hn_tx_stat_ulong_sysctl, "LU", "# of hyper-v sending failure");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "txdma_failed",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_txdma_failed),
 	    hn_tx_stat_ulong_sysctl, "LU", "# of TX DMA failure");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_flush_failed",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_flush_failed),
 	    hn_tx_stat_ulong_sysctl, "LU",
 	    "# of packet transmission aggregation flush failure");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_collapsed",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_tx_collapsed),
 	    hn_tx_stat_ulong_sysctl, "LU", "# of TX mbuf collapsed");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_tx_chimney),
 	    hn_tx_stat_ulong_sysctl, "LU", "# of chimney send");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney_tried",
 	    CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_tx_chimney_tried),
 	    hn_tx_stat_ulong_sysctl, "LU", "# of chimney send tries");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "txdesc_cnt",
 	    CTLFLAG_RD, &sc->hn_tx_ring[0].hn_txdesc_cnt, 0,
 	    "# of total TX descs");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_chimney_max",
 	    CTLFLAG_RD, &sc->hn_chim_szmax, 0,
 	    "Chimney send packet size upper boundary");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney_size",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0,
 	    hn_chim_size_sysctl, "I", "Chimney send packet size limit");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "direct_tx_size",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_direct_tx_size),
 	    hn_tx_conf_int_sysctl, "I",
 	    "Size of the packet for direct transmission");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "sched_tx",
 	    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc,
 	    __offsetof(struct hn_tx_ring, hn_sched_tx),
 	    hn_tx_conf_int_sysctl, "I",
 	    "Always schedule transmission "
 	    "instead of doing direct transmission");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_ring_cnt",
 	    CTLFLAG_RD, &sc->hn_tx_ring_cnt, 0, "# created TX rings");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_ring_inuse",
 	    CTLFLAG_RD, &sc->hn_tx_ring_inuse, 0, "# used TX rings");
 	SYSCTL_ADD_INT(ctx, child, OID_AUTO, "agg_szmax",
 	    CTLFLAG_RD, &sc->hn_tx_ring[0].hn_agg_szmax, 0,
 	    "Applied packet transmission aggregation size");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_pktmax",
 	    CTLTYPE_INT | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_txagg_pktmax_sysctl, "I",
 	    "Applied packet transmission aggregation packets");
 	SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_align",
 	    CTLTYPE_INT | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0,
 	    hn_txagg_align_sysctl, "I",
 	    "Applied packet transmission aggregation alignment");
 
 	return 0;
 }
 
 static void
 hn_set_chim_size(struct hn_softc *sc, int chim_size)
 {
 	int i;
 
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i)
 		sc->hn_tx_ring[i].hn_chim_size = chim_size;
 }
 
 static void
 hn_set_tso_maxsize(struct hn_softc *sc, int tso_maxlen, int mtu)
 {
 	struct ifnet *ifp = sc->hn_ifp;
 	int tso_minlen;
 
 	if ((ifp->if_capabilities & (IFCAP_TSO4 | IFCAP_TSO6)) == 0)
 		return;
 
 	KASSERT(sc->hn_ndis_tso_sgmin >= 2,
 	    ("invalid NDIS tso sgmin %d", sc->hn_ndis_tso_sgmin));
 	tso_minlen = sc->hn_ndis_tso_sgmin * mtu;
 
 	KASSERT(sc->hn_ndis_tso_szmax >= tso_minlen &&
 	    sc->hn_ndis_tso_szmax <= IP_MAXPACKET,
 	    ("invalid NDIS tso szmax %d", sc->hn_ndis_tso_szmax));
 
 	if (tso_maxlen < tso_minlen)
 		tso_maxlen = tso_minlen;
 	else if (tso_maxlen > IP_MAXPACKET)
 		tso_maxlen = IP_MAXPACKET;
 	if (tso_maxlen > sc->hn_ndis_tso_szmax)
 		tso_maxlen = sc->hn_ndis_tso_szmax;
 	ifp->if_hw_tsomax = tso_maxlen - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
 	if (bootverbose)
 		if_printf(ifp, "TSO size max %u\n", ifp->if_hw_tsomax);
 }
 
 static void
 hn_fixup_tx_data(struct hn_softc *sc)
 {
 	uint64_t csum_assist;
 	int i;
 
 	hn_set_chim_size(sc, sc->hn_chim_szmax);
 	if (hn_tx_chimney_size > 0 &&
 	    hn_tx_chimney_size < sc->hn_chim_szmax)
 		hn_set_chim_size(sc, hn_tx_chimney_size);
 
 	csum_assist = 0;
 	if (sc->hn_caps & HN_CAP_IPCS)
 		csum_assist |= CSUM_IP;
 	if (sc->hn_caps & HN_CAP_TCP4CS)
 		csum_assist |= CSUM_IP_TCP;
 	if (sc->hn_caps & HN_CAP_UDP4CS)
 		csum_assist |= CSUM_IP_UDP;
 	if (sc->hn_caps & HN_CAP_TCP6CS)
 		csum_assist |= CSUM_IP6_TCP;
 	if (sc->hn_caps & HN_CAP_UDP6CS)
 		csum_assist |= CSUM_IP6_UDP;
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i)
 		sc->hn_tx_ring[i].hn_csum_assist = csum_assist;
 
 	if (sc->hn_caps & HN_CAP_HASHVAL) {
 		/*
 		 * Support HASHVAL pktinfo on TX path.
 		 */
 		if (bootverbose)
 			if_printf(sc->hn_ifp, "support HASHVAL pktinfo\n");
 		for (i = 0; i < sc->hn_tx_ring_cnt; ++i)
 			sc->hn_tx_ring[i].hn_tx_flags |= HN_TX_FLAG_HASHVAL;
 	}
 }
 
 static void
 hn_destroy_tx_data(struct hn_softc *sc)
 {
 	int i;
 
 	if (sc->hn_chim != NULL) {
 		if ((sc->hn_flags & HN_FLAG_CHIM_REF) == 0) {
 			hyperv_dmamem_free(&sc->hn_chim_dma, sc->hn_chim);
 		} else {
 			device_printf(sc->hn_dev,
 			    "chimney sending buffer is referenced");
 		}
 		sc->hn_chim = NULL;
 	}
 
 	if (sc->hn_tx_ring_cnt == 0)
 		return;
 
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i)
 		hn_tx_ring_destroy(&sc->hn_tx_ring[i]);
 
 	free(sc->hn_tx_ring, M_DEVBUF);
 	sc->hn_tx_ring = NULL;
 
 	sc->hn_tx_ring_cnt = 0;
 	sc->hn_tx_ring_inuse = 0;
 }
 
 #ifdef HN_IFSTART_SUPPORT
 
 static void
 hn_start_taskfunc(void *xtxr, int pending __unused)
 {
 	struct hn_tx_ring *txr = xtxr;
 
 	mtx_lock(&txr->hn_tx_lock);
 	hn_start_locked(txr, 0);
 	mtx_unlock(&txr->hn_tx_lock);
 }
 
 static int
 hn_start_locked(struct hn_tx_ring *txr, int len)
 {
 	struct hn_softc *sc = txr->hn_sc;
 	struct ifnet *ifp = sc->hn_ifp;
 	int sched = 0;
 
 	KASSERT(hn_use_if_start,
 	    ("hn_start_locked is called, when if_start is disabled"));
 	KASSERT(txr == &sc->hn_tx_ring[0], ("not the first TX ring"));
 	mtx_assert(&txr->hn_tx_lock, MA_OWNED);
 	KASSERT(txr->hn_agg_txd == NULL, ("lingering aggregating txdesc"));
 
 	if (__predict_false(txr->hn_suspended))
 		return (0);
 
 	if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=
 	    IFF_DRV_RUNNING)
 		return (0);
 
 	while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) {
 		struct hn_txdesc *txd;
 		struct mbuf *m_head;
 		int error;
 
 		IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head);
 		if (m_head == NULL)
 			break;
 
 		if (len > 0 && m_head->m_pkthdr.len > len) {
 			/*
 			 * This sending could be time consuming; let callers
 			 * dispatch this packet sending (and sending of any
 			 * following up packets) to tx taskqueue.
 			 */
 			IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
 			sched = 1;
 			break;
 		}
 
 #if defined(INET6) || defined(INET)
 		if (m_head->m_pkthdr.csum_flags & CSUM_TSO) {
 			m_head = hn_tso_fixup(m_head);
 			if (__predict_false(m_head == NULL)) {
 				if_inc_counter(ifp, IFCOUNTER_OERRORS, 1);
 				continue;
 			}
 		}
 #endif
 
 		txd = hn_txdesc_get(txr);
 		if (txd == NULL) {
 			txr->hn_no_txdescs++;
 			IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
 			atomic_set_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE);
 			break;
 		}
 
 		error = hn_encap(ifp, txr, txd, &m_head);
 		if (error) {
 			/* Both txd and m_head are freed */
 			KASSERT(txr->hn_agg_txd == NULL,
 			    ("encap failed w/ pending aggregating txdesc"));
 			continue;
 		}
 
 		if (txr->hn_agg_pktleft == 0) {
 			if (txr->hn_agg_txd != NULL) {
 				KASSERT(m_head == NULL,
 				    ("pending mbuf for aggregating txdesc"));
 				error = hn_flush_txagg(ifp, txr);
 				if (__predict_false(error)) {
 					atomic_set_int(&ifp->if_drv_flags,
 					    IFF_DRV_OACTIVE);
 					break;
 				}
 			} else {
 				KASSERT(m_head != NULL, ("mbuf was freed"));
 				error = hn_txpkt(ifp, txr, txd);
 				if (__predict_false(error)) {
 					/* txd is freed, but m_head is not */
 					IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
 					atomic_set_int(&ifp->if_drv_flags,
 					    IFF_DRV_OACTIVE);
 					break;
 				}
 			}
 		}
 #ifdef INVARIANTS
 		else {
 			KASSERT(txr->hn_agg_txd != NULL,
 			    ("no aggregating txdesc"));
 			KASSERT(m_head == NULL,
 			    ("pending mbuf for aggregating txdesc"));
 		}
 #endif
 	}
 
 	/* Flush pending aggerated transmission. */
 	if (txr->hn_agg_txd != NULL)
 		hn_flush_txagg(ifp, txr);
 	return (sched);
 }
 
 static void
 hn_start(struct ifnet *ifp)
 {
 	struct hn_softc *sc = ifp->if_softc;
 	struct hn_tx_ring *txr = &sc->hn_tx_ring[0];
 
 	if (txr->hn_sched_tx)
 		goto do_sched;
 
 	if (mtx_trylock(&txr->hn_tx_lock)) {
 		int sched;
 
 		sched = hn_start_locked(txr, txr->hn_direct_tx_size);
 		mtx_unlock(&txr->hn_tx_lock);
 		if (!sched)
 			return;
 	}
 do_sched:
 	taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task);
 }
 
 static void
 hn_start_txeof_taskfunc(void *xtxr, int pending __unused)
 {
 	struct hn_tx_ring *txr = xtxr;
 
 	mtx_lock(&txr->hn_tx_lock);
 	atomic_clear_int(&txr->hn_sc->hn_ifp->if_drv_flags, IFF_DRV_OACTIVE);
 	hn_start_locked(txr, 0);
 	mtx_unlock(&txr->hn_tx_lock);
 }
 
 static void
 hn_start_txeof(struct hn_tx_ring *txr)
 {
 	struct hn_softc *sc = txr->hn_sc;
 	struct ifnet *ifp = sc->hn_ifp;
 
 	KASSERT(txr == &sc->hn_tx_ring[0], ("not the first TX ring"));
 
 	if (txr->hn_sched_tx)
 		goto do_sched;
 
 	if (mtx_trylock(&txr->hn_tx_lock)) {
 		int sched;
 
 		atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE);
 		sched = hn_start_locked(txr, txr->hn_direct_tx_size);
 		mtx_unlock(&txr->hn_tx_lock);
 		if (sched) {
 			taskqueue_enqueue(txr->hn_tx_taskq,
 			    &txr->hn_tx_task);
 		}
 	} else {
 do_sched:
 		/*
 		 * Release the OACTIVE earlier, with the hope, that
 		 * others could catch up.  The task will clear the
 		 * flag again with the hn_tx_lock to avoid possible
 		 * races.
 		 */
 		atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE);
 		taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task);
 	}
 }
 
 #endif	/* HN_IFSTART_SUPPORT */
 
 static int
 hn_xmit(struct hn_tx_ring *txr, int len)
 {
 	struct hn_softc *sc = txr->hn_sc;
 	struct ifnet *ifp = sc->hn_ifp;
 	struct mbuf *m_head;
 	int sched = 0;
 
 	mtx_assert(&txr->hn_tx_lock, MA_OWNED);
 #ifdef HN_IFSTART_SUPPORT
 	KASSERT(hn_use_if_start == 0,
 	    ("hn_xmit is called, when if_start is enabled"));
 #endif
 	KASSERT(txr->hn_agg_txd == NULL, ("lingering aggregating txdesc"));
 
 	if (__predict_false(txr->hn_suspended))
 		return (0);
 
 	if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || txr->hn_oactive)
 		return (0);
 
 	while ((m_head = drbr_peek(ifp, txr->hn_mbuf_br)) != NULL) {
 		struct hn_txdesc *txd;
 		int error;
 
 		if (len > 0 && m_head->m_pkthdr.len > len) {
 			/*
 			 * This sending could be time consuming; let callers
 			 * dispatch this packet sending (and sending of any
 			 * following up packets) to tx taskqueue.
 			 */
 			drbr_putback(ifp, txr->hn_mbuf_br, m_head);
 			sched = 1;
 			break;
 		}
 
 		txd = hn_txdesc_get(txr);
 		if (txd == NULL) {
 			txr->hn_no_txdescs++;
 			drbr_putback(ifp, txr->hn_mbuf_br, m_head);
 			txr->hn_oactive = 1;
 			break;
 		}
 
 		error = hn_encap(ifp, txr, txd, &m_head);
 		if (error) {
 			/* Both txd and m_head are freed; discard */
 			KASSERT(txr->hn_agg_txd == NULL,
 			    ("encap failed w/ pending aggregating txdesc"));
 			drbr_advance(ifp, txr->hn_mbuf_br);
 			continue;
 		}
 
 		if (txr->hn_agg_pktleft == 0) {
 			if (txr->hn_agg_txd != NULL) {
 				KASSERT(m_head == NULL,
 				    ("pending mbuf for aggregating txdesc"));
 				error = hn_flush_txagg(ifp, txr);
 				if (__predict_false(error)) {
 					txr->hn_oactive = 1;
 					break;
 				}
 			} else {
 				KASSERT(m_head != NULL, ("mbuf was freed"));
 				error = hn_txpkt(ifp, txr, txd);
 				if (__predict_false(error)) {
 					/* txd is freed, but m_head is not */
 					drbr_putback(ifp, txr->hn_mbuf_br,
 					    m_head);
 					txr->hn_oactive = 1;
 					break;
 				}
 			}
 		}
 #ifdef INVARIANTS
 		else {
 			KASSERT(txr->hn_agg_txd != NULL,
 			    ("no aggregating txdesc"));
 			KASSERT(m_head == NULL,
 			    ("pending mbuf for aggregating txdesc"));
 		}
 #endif
 
 		/* Sent */
 		drbr_advance(ifp, txr->hn_mbuf_br);
 	}
 
 	/* Flush pending aggerated transmission. */
 	if (txr->hn_agg_txd != NULL)
 		hn_flush_txagg(ifp, txr);
 	return (sched);
 }
 
 static int
 hn_transmit(struct ifnet *ifp, struct mbuf *m)
 {
 	struct hn_softc *sc = ifp->if_softc;
 	struct hn_tx_ring *txr;
 	int error, idx = 0;
 
 #if defined(INET6) || defined(INET)
 	/*
 	 * Perform TSO packet header fixup now, since the TSO
 	 * packet header should be cache-hot.
 	 */
 	if (m->m_pkthdr.csum_flags & CSUM_TSO) {
 		m = hn_tso_fixup(m);
 		if (__predict_false(m == NULL)) {
 			if_inc_counter(ifp, IFCOUNTER_OERRORS, 1);
 			return EIO;
 		}
 	}
 #endif
 
 	/*
 	 * Select the TX ring based on flowid
 	 */
 	if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) {
 #ifdef RSS
 		uint32_t bid;
 
 		if (rss_hash2bucket(m->m_pkthdr.flowid, M_HASHTYPE_GET(m),
 		    &bid) == 0)
 			idx = bid % sc->hn_tx_ring_inuse;
 		else
 #endif
 			idx = m->m_pkthdr.flowid % sc->hn_tx_ring_inuse;
 	}
 	txr = &sc->hn_tx_ring[idx];
 
 	error = drbr_enqueue(ifp, txr->hn_mbuf_br, m);
 	if (error) {
 		if_inc_counter(ifp, IFCOUNTER_OQDROPS, 1);
 		return error;
 	}
 
 	if (txr->hn_oactive)
 		return 0;
 
 	if (txr->hn_sched_tx)
 		goto do_sched;
 
 	if (mtx_trylock(&txr->hn_tx_lock)) {
 		int sched;
 
 		sched = hn_xmit(txr, txr->hn_direct_tx_size);
 		mtx_unlock(&txr->hn_tx_lock);
 		if (!sched)
 			return 0;
 	}
 do_sched:
 	taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task);
 	return 0;
 }
 
 static void
 hn_tx_ring_qflush(struct hn_tx_ring *txr)
 {
 	struct mbuf *m;
 
 	mtx_lock(&txr->hn_tx_lock);
 	while ((m = buf_ring_dequeue_sc(txr->hn_mbuf_br)) != NULL)
 		m_freem(m);
 	mtx_unlock(&txr->hn_tx_lock);
 }
 
 static void
 hn_xmit_qflush(struct ifnet *ifp)
 {
 	struct hn_softc *sc = ifp->if_softc;
 	int i;
 
 	for (i = 0; i < sc->hn_tx_ring_inuse; ++i)
 		hn_tx_ring_qflush(&sc->hn_tx_ring[i]);
 	if_qflush(ifp);
 }
 
 static void
 hn_xmit_txeof(struct hn_tx_ring *txr)
 {
 
 	if (txr->hn_sched_tx)
 		goto do_sched;
 
 	if (mtx_trylock(&txr->hn_tx_lock)) {
 		int sched;
 
 		txr->hn_oactive = 0;
 		sched = hn_xmit(txr, txr->hn_direct_tx_size);
 		mtx_unlock(&txr->hn_tx_lock);
 		if (sched) {
 			taskqueue_enqueue(txr->hn_tx_taskq,
 			    &txr->hn_tx_task);
 		}
 	} else {
 do_sched:
 		/*
 		 * Release the oactive earlier, with the hope, that
 		 * others could catch up.  The task will clear the
 		 * oactive again with the hn_tx_lock to avoid possible
 		 * races.
 		 */
 		txr->hn_oactive = 0;
 		taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task);
 	}
 }
 
 static void
 hn_xmit_taskfunc(void *xtxr, int pending __unused)
 {
 	struct hn_tx_ring *txr = xtxr;
 
 	mtx_lock(&txr->hn_tx_lock);
 	hn_xmit(txr, 0);
 	mtx_unlock(&txr->hn_tx_lock);
 }
 
 static void
 hn_xmit_txeof_taskfunc(void *xtxr, int pending __unused)
 {
 	struct hn_tx_ring *txr = xtxr;
 
 	mtx_lock(&txr->hn_tx_lock);
 	txr->hn_oactive = 0;
 	hn_xmit(txr, 0);
 	mtx_unlock(&txr->hn_tx_lock);
 }
 
 static int
 hn_chan_attach(struct hn_softc *sc, struct vmbus_channel *chan)
 {
 	struct vmbus_chan_br cbr;
 	struct hn_rx_ring *rxr;
 	struct hn_tx_ring *txr = NULL;
 	int idx, error;
 
 	idx = vmbus_chan_subidx(chan);
 
 	/*
 	 * Link this channel to RX/TX ring.
 	 */
 	KASSERT(idx >= 0 && idx < sc->hn_rx_ring_inuse,
 	    ("invalid channel index %d, should > 0 && < %d",
 	     idx, sc->hn_rx_ring_inuse));
 	rxr = &sc->hn_rx_ring[idx];
 	KASSERT((rxr->hn_rx_flags & HN_RX_FLAG_ATTACHED) == 0,
 	    ("RX ring %d already attached", idx));
 	rxr->hn_rx_flags |= HN_RX_FLAG_ATTACHED;
 	rxr->hn_chan = chan;
 
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "link RX ring %d to chan%u\n",
 		    idx, vmbus_chan_id(chan));
 	}
 
 	if (idx < sc->hn_tx_ring_inuse) {
 		txr = &sc->hn_tx_ring[idx];
 		KASSERT((txr->hn_tx_flags & HN_TX_FLAG_ATTACHED) == 0,
 		    ("TX ring %d already attached", idx));
 		txr->hn_tx_flags |= HN_TX_FLAG_ATTACHED;
 
 		txr->hn_chan = chan;
 		if (bootverbose) {
 			if_printf(sc->hn_ifp, "link TX ring %d to chan%u\n",
 			    idx, vmbus_chan_id(chan));
 		}
 	}
 
 	/* Bind this channel to a proper CPU. */
 	vmbus_chan_cpu_set(chan, HN_RING_IDX2CPU(sc, idx));
 
 	/*
 	 * Open this channel
 	 */
 	cbr.cbr = rxr->hn_br;
 	cbr.cbr_paddr = rxr->hn_br_dma.hv_paddr;
 	cbr.cbr_txsz = HN_TXBR_SIZE;
 	cbr.cbr_rxsz = HN_RXBR_SIZE;
 	error = vmbus_chan_open_br(chan, &cbr, NULL, 0, hn_chan_callback, rxr);
 	if (error) {
 		if (error == EISCONN) {
 			if_printf(sc->hn_ifp, "bufring is connected after "
 			    "chan%u open failure\n", vmbus_chan_id(chan));
 			rxr->hn_rx_flags |= HN_RX_FLAG_BR_REF;
 		} else {
 			if_printf(sc->hn_ifp, "open chan%u failed: %d\n",
 			    vmbus_chan_id(chan), error);
 		}
 	}
 	return (error);
 }
 
 static void
 hn_chan_detach(struct hn_softc *sc, struct vmbus_channel *chan)
 {
 	struct hn_rx_ring *rxr;
 	int idx, error;
 
 	idx = vmbus_chan_subidx(chan);
 
 	/*
 	 * Link this channel to RX/TX ring.
 	 */
 	KASSERT(idx >= 0 && idx < sc->hn_rx_ring_inuse,
 	    ("invalid channel index %d, should > 0 && < %d",
 	     idx, sc->hn_rx_ring_inuse));
 	rxr = &sc->hn_rx_ring[idx];
 	KASSERT((rxr->hn_rx_flags & HN_RX_FLAG_ATTACHED),
 	    ("RX ring %d is not attached", idx));
 	rxr->hn_rx_flags &= ~HN_RX_FLAG_ATTACHED;
 
 	if (idx < sc->hn_tx_ring_inuse) {
 		struct hn_tx_ring *txr = &sc->hn_tx_ring[idx];
 
 		KASSERT((txr->hn_tx_flags & HN_TX_FLAG_ATTACHED),
 		    ("TX ring %d is not attached attached", idx));
 		txr->hn_tx_flags &= ~HN_TX_FLAG_ATTACHED;
 	}
 
 	/*
 	 * Close this channel.
 	 *
 	 * NOTE:
 	 * Channel closing does _not_ destroy the target channel.
 	 */
 	error = vmbus_chan_close_direct(chan);
 	if (error == EISCONN) {
 		if_printf(sc->hn_ifp, "chan%u bufring is connected "
 		    "after being closed\n", vmbus_chan_id(chan));
 		rxr->hn_rx_flags |= HN_RX_FLAG_BR_REF;
 	} else if (error) {
 		if_printf(sc->hn_ifp, "chan%u close failed: %d\n",
 		    vmbus_chan_id(chan), error);
 	}
 }
 
 static int
 hn_attach_subchans(struct hn_softc *sc)
 {
 	struct vmbus_channel **subchans;
 	int subchan_cnt = sc->hn_rx_ring_inuse - 1;
 	int i, error = 0;
 
 	KASSERT(subchan_cnt > 0, ("no sub-channels"));
 
 	/* Attach the sub-channels. */
 	subchans = vmbus_subchan_get(sc->hn_prichan, subchan_cnt);
 	for (i = 0; i < subchan_cnt; ++i) {
 		int error1;
 
 		error1 = hn_chan_attach(sc, subchans[i]);
 		if (error1) {
 			error = error1;
 			/* Move on; all channels will be detached later. */
 		}
 	}
 	vmbus_subchan_rel(subchans, subchan_cnt);
 
 	if (error) {
 		if_printf(sc->hn_ifp, "sub-channels attach failed: %d\n", error);
 	} else {
 		if (bootverbose) {
 			if_printf(sc->hn_ifp, "%d sub-channels attached\n",
 			    subchan_cnt);
 		}
 	}
 	return (error);
 }
 
 static void
 hn_detach_allchans(struct hn_softc *sc)
 {
 	struct vmbus_channel **subchans;
 	int subchan_cnt = sc->hn_rx_ring_inuse - 1;
 	int i;
 
 	if (subchan_cnt == 0)
 		goto back;
 
 	/* Detach the sub-channels. */
 	subchans = vmbus_subchan_get(sc->hn_prichan, subchan_cnt);
 	for (i = 0; i < subchan_cnt; ++i)
 		hn_chan_detach(sc, subchans[i]);
 	vmbus_subchan_rel(subchans, subchan_cnt);
 
 back:
 	/*
 	 * Detach the primary channel, _after_ all sub-channels
 	 * are detached.
 	 */
 	hn_chan_detach(sc, sc->hn_prichan);
 
 	/* Wait for sub-channels to be destroyed, if any. */
 	vmbus_subchan_drain(sc->hn_prichan);
 
 #ifdef INVARIANTS
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		KASSERT((sc->hn_rx_ring[i].hn_rx_flags &
 		    HN_RX_FLAG_ATTACHED) == 0,
 		    ("%dth RX ring is still attached", i));
 	}
 	for (i = 0; i < sc->hn_tx_ring_cnt; ++i) {
 		KASSERT((sc->hn_tx_ring[i].hn_tx_flags &
 		    HN_TX_FLAG_ATTACHED) == 0,
 		    ("%dth TX ring is still attached", i));
 	}
 #endif
 }
 
 static int
 hn_synth_alloc_subchans(struct hn_softc *sc, int *nsubch)
 {
 	struct vmbus_channel **subchans;
 	int nchan, rxr_cnt, error;
 
 	nchan = *nsubch + 1;
 	if (nchan == 1) {
 		/*
 		 * Multiple RX/TX rings are not requested.
 		 */
 		*nsubch = 0;
 		return (0);
 	}
 
 	/*
 	 * Query RSS capabilities, e.g. # of RX rings, and # of indirect
 	 * table entries.
 	 */
 	error = hn_rndis_query_rsscaps(sc, &rxr_cnt);
 	if (error) {
 		/* No RSS; this is benign. */
 		*nsubch = 0;
 		return (0);
 	}
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "RX rings offered %u, requested %d\n",
 		    rxr_cnt, nchan);
 	}
 
 	if (nchan > rxr_cnt)
 		nchan = rxr_cnt;
 	if (nchan == 1) {
 		if_printf(sc->hn_ifp, "only 1 channel is supported, no vRSS\n");
 		*nsubch = 0;
 		return (0);
 	}
 
 	/*
 	 * Allocate sub-channels from NVS.
 	 */
 	*nsubch = nchan - 1;
 	error = hn_nvs_alloc_subchans(sc, nsubch);
 	if (error || *nsubch == 0) {
 		/* Failed to allocate sub-channels. */
 		*nsubch = 0;
 		return (0);
 	}
 
 	/*
 	 * Wait for all sub-channels to become ready before moving on.
 	 */
 	subchans = vmbus_subchan_get(sc->hn_prichan, *nsubch);
 	vmbus_subchan_rel(subchans, *nsubch);
 	return (0);
 }
 
 static bool
 hn_synth_attachable(const struct hn_softc *sc)
 {
 	int i;
 
 	if (sc->hn_flags & HN_FLAG_ERRORS)
 		return (false);
 
 	for (i = 0; i < sc->hn_rx_ring_cnt; ++i) {
 		const struct hn_rx_ring *rxr = &sc->hn_rx_ring[i];
 
 		if (rxr->hn_rx_flags & HN_RX_FLAG_BR_REF)
 			return (false);
 	}
 	return (true);
 }
 
 static int
 hn_synth_attach(struct hn_softc *sc, int mtu)
 {
 #define ATTACHED_NVS		0x0002
 #define ATTACHED_RNDIS		0x0004
 
 	struct ndis_rssprm_toeplitz *rss = &sc->hn_rss;
 	int error, nsubch, nchan, i;
 	uint32_t old_caps, attached = 0;
 
 	KASSERT((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0,
 	    ("synthetic parts were attached"));
 
 	if (!hn_synth_attachable(sc))
 		return (ENXIO);
 
 	/* Save capabilities for later verification. */
 	old_caps = sc->hn_caps;
 	sc->hn_caps = 0;
 
 	/* Clear RSS stuffs. */
 	sc->hn_rss_ind_size = 0;
 	sc->hn_rss_hash = 0;
 
 	/*
 	 * Attach the primary channel _before_ attaching NVS and RNDIS.
 	 */
 	error = hn_chan_attach(sc, sc->hn_prichan);
 	if (error)
 		goto failed;
 
 	/*
 	 * Attach NVS.
 	 */
 	error = hn_nvs_attach(sc, mtu);
 	if (error)
 		goto failed;
 	attached |= ATTACHED_NVS;
 
 	/*
 	 * Attach RNDIS _after_ NVS is attached.
 	 */
 	error = hn_rndis_attach(sc, mtu);
 	if (error)
 		goto failed;
 	attached |= ATTACHED_RNDIS;
 
 	/*
 	 * Make sure capabilities are not changed.
 	 */
 	if (device_is_attached(sc->hn_dev) && old_caps != sc->hn_caps) {
 		if_printf(sc->hn_ifp, "caps mismatch old 0x%08x, new 0x%08x\n",
 		    old_caps, sc->hn_caps);
 		error = ENXIO;
 		goto failed;
 	}
 
 	/*
 	 * Allocate sub-channels for multi-TX/RX rings.
 	 *
 	 * NOTE:
 	 * The # of RX rings that can be used is equivalent to the # of
 	 * channels to be requested.
 	 */
 	nsubch = sc->hn_rx_ring_cnt - 1;
 	error = hn_synth_alloc_subchans(sc, &nsubch);
 	if (error)
 		goto failed;
 	/* NOTE: _Full_ synthetic parts detach is required now. */
 	sc->hn_flags |= HN_FLAG_SYNTH_ATTACHED;
 
 	/*
 	 * Set the # of TX/RX rings that could be used according to
 	 * the # of channels that NVS offered.
 	 */
 	nchan = nsubch + 1;
 	hn_set_ring_inuse(sc, nchan);
 	if (nchan == 1) {
 		/* Only the primary channel can be used; done */
 		goto back;
 	}
 
 	/*
 	 * Attach the sub-channels.
 	 *
 	 * NOTE: hn_set_ring_inuse() _must_ have been called.
 	 */
 	error = hn_attach_subchans(sc);
 	if (error)
 		goto failed;
 
 	/*
 	 * Configure RSS key and indirect table _after_ all sub-channels
 	 * are attached.
 	 */
 	if ((sc->hn_flags & HN_FLAG_HAS_RSSKEY) == 0) {
 		/*
 		 * RSS key is not set yet; set it to the default RSS key.
 		 */
 		if (bootverbose)
 			if_printf(sc->hn_ifp, "setup default RSS key\n");
 #ifdef RSS
 		rss_getkey(rss->rss_key);
 #else
 		memcpy(rss->rss_key, hn_rss_key_default, sizeof(rss->rss_key));
 #endif
 		sc->hn_flags |= HN_FLAG_HAS_RSSKEY;
 	}
 
 	if ((sc->hn_flags & HN_FLAG_HAS_RSSIND) == 0) {
 		/*
 		 * RSS indirect table is not set yet; set it up in round-
 		 * robin fashion.
 		 */
 		if (bootverbose) {
 			if_printf(sc->hn_ifp, "setup default RSS indirect "
 			    "table\n");
 		}
 		for (i = 0; i < NDIS_HASH_INDCNT; ++i) {
 			uint32_t subidx;
 
 #ifdef RSS
 			subidx = rss_get_indirection_to_bucket(i);
 #else
 			subidx = i;
 #endif
 			rss->rss_ind[i] = subidx % nchan;
 		}
 		sc->hn_flags |= HN_FLAG_HAS_RSSIND;
 	} else {
 		/*
 		 * # of usable channels may be changed, so we have to
 		 * make sure that all entries in RSS indirect table
 		 * are valid.
 		 *
 		 * NOTE: hn_set_ring_inuse() _must_ have been called.
 		 */
 		hn_rss_ind_fixup(sc);
 	}
 
 	error = hn_rndis_conf_rss(sc, NDIS_RSS_FLAG_NONE);
 	if (error)
 		goto failed;
 back:
 	/*
 	 * Fixup transmission aggregation setup.
 	 */
 	hn_set_txagg(sc);
 	return (0);
 
 failed:
 	if (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) {
 		hn_synth_detach(sc);
 	} else {
 		if (attached & ATTACHED_RNDIS)
 			hn_rndis_detach(sc);
 		if (attached & ATTACHED_NVS)
 			hn_nvs_detach(sc);
 		hn_chan_detach(sc, sc->hn_prichan);
 		/* Restore old capabilities. */
 		sc->hn_caps = old_caps;
 	}
 	return (error);
 
 #undef ATTACHED_RNDIS
 #undef ATTACHED_NVS
 }
 
 /*
  * NOTE:
  * The interface must have been suspended though hn_suspend(), before
  * this function get called.
  */
 static void
 hn_synth_detach(struct hn_softc *sc)
 {
 
 	KASSERT(sc->hn_flags & HN_FLAG_SYNTH_ATTACHED,
 	    ("synthetic parts were not attached"));
 
 	/* Detach the RNDIS first. */
 	hn_rndis_detach(sc);
 
 	/* Detach NVS. */
 	hn_nvs_detach(sc);
 
 	/* Detach all of the channels. */
 	hn_detach_allchans(sc);
 
 	sc->hn_flags &= ~HN_FLAG_SYNTH_ATTACHED;
 }
 
 static void
 hn_set_ring_inuse(struct hn_softc *sc, int ring_cnt)
 {
 	KASSERT(ring_cnt > 0 && ring_cnt <= sc->hn_rx_ring_cnt,
 	    ("invalid ring count %d", ring_cnt));
 
 	if (sc->hn_tx_ring_cnt > ring_cnt)
 		sc->hn_tx_ring_inuse = ring_cnt;
 	else
 		sc->hn_tx_ring_inuse = sc->hn_tx_ring_cnt;
 	sc->hn_rx_ring_inuse = ring_cnt;
 
 #ifdef RSS
 	if (sc->hn_rx_ring_inuse != rss_getnumbuckets()) {
 		if_printf(sc->hn_ifp, "# of RX rings (%d) does not match "
 		    "# of RSS buckets (%d)\n", sc->hn_rx_ring_inuse,
 		    rss_getnumbuckets());
 	}
 #endif
 
 	if (bootverbose) {
 		if_printf(sc->hn_ifp, "%d TX ring, %d RX ring\n",
 		    sc->hn_tx_ring_inuse, sc->hn_rx_ring_inuse);
 	}
 }
 
 static void
 hn_chan_drain(struct hn_softc *sc, struct vmbus_channel *chan)
 {
 
 	/*
 	 * NOTE:
 	 * The TX bufring will not be drained by the hypervisor,
 	 * if the primary channel is revoked.
 	 */
 	while (!vmbus_chan_rx_empty(chan) ||
 	    (!vmbus_chan_is_revoked(sc->hn_prichan) &&
 	     !vmbus_chan_tx_empty(chan)))
 		pause("waitch", 1);
 	vmbus_chan_intr_drain(chan);
 }
 
 static void
 hn_suspend_data(struct hn_softc *sc)
 {
 	struct vmbus_channel **subch = NULL;
 	struct hn_tx_ring *txr;
 	int i, nsubch;
 
 	HN_LOCK_ASSERT(sc);
 
 	/*
 	 * Suspend TX.
 	 */
 	for (i = 0; i < sc->hn_tx_ring_inuse; ++i) {
 		txr = &sc->hn_tx_ring[i];
 
 		mtx_lock(&txr->hn_tx_lock);
 		txr->hn_suspended = 1;
 		mtx_unlock(&txr->hn_tx_lock);
 		/* No one is able send more packets now. */
 
 		/*
 		 * Wait for all pending sends to finish.
 		 *
 		 * NOTE:
 		 * We will _not_ receive all pending send-done, if the
 		 * primary channel is revoked.
 		 */
 		while (hn_tx_ring_pending(txr) &&
 		    !vmbus_chan_is_revoked(sc->hn_prichan))
 			pause("hnwtx", 1 /* 1 tick */);
 	}
 
 	/*
 	 * Disable RX by clearing RX filter.
 	 */
 	hn_set_rxfilter(sc, NDIS_PACKET_TYPE_NONE);
 
 	/*
 	 * Give RNDIS enough time to flush all pending data packets.
 	 */
 	pause("waitrx", (200 * hz) / 1000);
 
 	/*
 	 * Drain RX/TX bufrings and interrupts.
 	 */
 	nsubch = sc->hn_rx_ring_inuse - 1;
 	if (nsubch > 0)
 		subch = vmbus_subchan_get(sc->hn_prichan, nsubch);
 
 	if (subch != NULL) {
 		for (i = 0; i < nsubch; ++i)
 			hn_chan_drain(sc, subch[i]);
 	}
 	hn_chan_drain(sc, sc->hn_prichan);
 
 	if (subch != NULL)
 		vmbus_subchan_rel(subch, nsubch);
 
 	/*
 	 * Drain any pending TX tasks.
 	 *
 	 * NOTE:
 	 * The above hn_chan_drain() can dispatch TX tasks, so the TX
 	 * tasks will have to be drained _after_ the above hn_chan_drain()
 	 * calls.
 	 */
 	for (i = 0; i < sc->hn_tx_ring_inuse; ++i) {
 		txr = &sc->hn_tx_ring[i];
 
 		taskqueue_drain(txr->hn_tx_taskq, &txr->hn_tx_task);
 		taskqueue_drain(txr->hn_tx_taskq, &txr->hn_txeof_task);
 	}
 }
 
 static void
 hn_suspend_mgmt_taskfunc(void *xsc, int pending __unused)
 {
 
 	((struct hn_softc *)xsc)->hn_mgmt_taskq = NULL;
 }
 
 static void
 hn_suspend_mgmt(struct hn_softc *sc)
 {
 	struct task task;
 
 	HN_LOCK_ASSERT(sc);
 
 	/*
 	 * Make sure that hn_mgmt_taskq0 can nolonger be accessed
 	 * through hn_mgmt_taskq.
 	 */
 	TASK_INIT(&task, 0, hn_suspend_mgmt_taskfunc, sc);
 	vmbus_chan_run_task(sc->hn_prichan, &task);
 
 	/*
 	 * Make sure that all pending management tasks are completed.
 	 */
 	taskqueue_drain(sc->hn_mgmt_taskq0, &sc->hn_netchg_init);
 	taskqueue_drain_timeout(sc->hn_mgmt_taskq0, &sc->hn_netchg_status);
 	taskqueue_drain_all(sc->hn_mgmt_taskq0);
 }
 
 static void
 hn_suspend(struct hn_softc *sc)
 {
 
 	/* Disable polling. */
 	hn_polling(sc, 0);
 
 	if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) ||
 	    (sc->hn_flags & HN_FLAG_VF))
 		hn_suspend_data(sc);
 	hn_suspend_mgmt(sc);
 }
 
 static void
 hn_resume_tx(struct hn_softc *sc, int tx_ring_cnt)
 {
 	int i;
 
 	KASSERT(tx_ring_cnt <= sc->hn_tx_ring_cnt,
 	    ("invalid TX ring count %d", tx_ring_cnt));
 
 	for (i = 0; i < tx_ring_cnt; ++i) {
 		struct hn_tx_ring *txr = &sc->hn_tx_ring[i];
 
 		mtx_lock(&txr->hn_tx_lock);
 		txr->hn_suspended = 0;
 		mtx_unlock(&txr->hn_tx_lock);
 	}
 }
 
 static void
 hn_resume_data(struct hn_softc *sc)
 {
 	int i;
 
 	HN_LOCK_ASSERT(sc);
 
 	/*
 	 * Re-enable RX.
 	 */
 	hn_rxfilter_config(sc);
 
 	/*
 	 * Make sure to clear suspend status on "all" TX rings,
 	 * since hn_tx_ring_inuse can be changed after
 	 * hn_suspend_data().
 	 */
 	hn_resume_tx(sc, sc->hn_tx_ring_cnt);
 
 #ifdef HN_IFSTART_SUPPORT
 	if (!hn_use_if_start)
 #endif
 	{
 		/*
 		 * Flush unused drbrs, since hn_tx_ring_inuse may be
 		 * reduced.
 		 */
 		for (i = sc->hn_tx_ring_inuse; i < sc->hn_tx_ring_cnt; ++i)
 			hn_tx_ring_qflush(&sc->hn_tx_ring[i]);
 	}
 
 	/*
 	 * Kick start TX.
 	 */
 	for (i = 0; i < sc->hn_tx_ring_inuse; ++i) {
 		struct hn_tx_ring *txr = &sc->hn_tx_ring[i];
 
 		/*
 		 * Use txeof task, so that any pending oactive can be
 		 * cleared properly.
 		 */
 		taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task);
 	}
 }
 
 static void
 hn_resume_mgmt(struct hn_softc *sc)
 {
 
 	sc->hn_mgmt_taskq = sc->hn_mgmt_taskq0;
 
 	/*
 	 * Kick off network change detection, if it was pending.
 	 * If no network change was pending, start link status
 	 * checks, which is more lightweight than network change
 	 * detection.
 	 */
 	if (sc->hn_link_flags & HN_LINK_FLAG_NETCHG)
 		hn_change_network(sc);
 	else
 		hn_update_link_status(sc);
 }
 
 static void
 hn_resume(struct hn_softc *sc)
 {
 
 	if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) ||
 	    (sc->hn_flags & HN_FLAG_VF))
 		hn_resume_data(sc);
 
 	/*
 	 * When the VF is activated, the synthetic interface is changed
 	 * to DOWN in hn_set_vf(). Here, if the VF is still active, we
 	 * don't call hn_resume_mgmt() until the VF is deactivated in
 	 * hn_set_vf().
 	 */
 	if (!(sc->hn_flags & HN_FLAG_VF))
 		hn_resume_mgmt(sc);
 
 	/*
 	 * Re-enable polling if this interface is running and
 	 * the polling is requested.
 	 */
 	if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) && sc->hn_pollhz > 0)
 		hn_polling(sc, sc->hn_pollhz);
 }
 
 static void 
 hn_rndis_rx_status(struct hn_softc *sc, const void *data, int dlen)
 {
 	const struct rndis_status_msg *msg;
 	int ofs;
 
 	if (dlen < sizeof(*msg)) {
 		if_printf(sc->hn_ifp, "invalid RNDIS status\n");
 		return;
 	}
 	msg = data;
 
 	switch (msg->rm_status) {
 	case RNDIS_STATUS_MEDIA_CONNECT:
 	case RNDIS_STATUS_MEDIA_DISCONNECT:
 		hn_update_link_status(sc);
 		break;
 
 	case RNDIS_STATUS_TASK_OFFLOAD_CURRENT_CONFIG:
 		/* Not really useful; ignore. */
 		break;
 
 	case RNDIS_STATUS_NETWORK_CHANGE:
 		ofs = RNDIS_STBUFOFFSET_ABS(msg->rm_stbufoffset);
 		if (dlen < ofs + msg->rm_stbuflen ||
 		    msg->rm_stbuflen < sizeof(uint32_t)) {
 			if_printf(sc->hn_ifp, "network changed\n");
 		} else {
 			uint32_t change;
 
 			memcpy(&change, ((const uint8_t *)msg) + ofs,
 			    sizeof(change));
 			if_printf(sc->hn_ifp, "network changed, change %u\n",
 			    change);
 		}
 		hn_change_network(sc);
 		break;
 
 	default:
 		if_printf(sc->hn_ifp, "unknown RNDIS status 0x%08x\n",
 		    msg->rm_status);
 		break;
 	}
 }
 
 static int
 hn_rndis_rxinfo(const void *info_data, int info_dlen, struct hn_rxinfo *info)
 {
 	const struct rndis_pktinfo *pi = info_data;
 	uint32_t mask = 0;
 
 	while (info_dlen != 0) {
 		const void *data;
 		uint32_t dlen;
 
 		if (__predict_false(info_dlen < sizeof(*pi)))
 			return (EINVAL);
 		if (__predict_false(info_dlen < pi->rm_size))
 			return (EINVAL);
 		info_dlen -= pi->rm_size;
 
 		if (__predict_false(pi->rm_size & RNDIS_PKTINFO_SIZE_ALIGNMASK))
 			return (EINVAL);
 		if (__predict_false(pi->rm_size < pi->rm_pktinfooffset))
 			return (EINVAL);
 		dlen = pi->rm_size - pi->rm_pktinfooffset;
 		data = pi->rm_data;
 
 		switch (pi->rm_type) {
 		case NDIS_PKTINFO_TYPE_VLAN:
 			if (__predict_false(dlen < NDIS_VLAN_INFO_SIZE))
 				return (EINVAL);
 			info->vlan_info = *((const uint32_t *)data);
 			mask |= HN_RXINFO_VLAN;
 			break;
 
 		case NDIS_PKTINFO_TYPE_CSUM:
 			if (__predict_false(dlen < NDIS_RXCSUM_INFO_SIZE))
 				return (EINVAL);
 			info->csum_info = *((const uint32_t *)data);
 			mask |= HN_RXINFO_CSUM;
 			break;
 
 		case HN_NDIS_PKTINFO_TYPE_HASHVAL:
 			if (__predict_false(dlen < HN_NDIS_HASH_VALUE_SIZE))
 				return (EINVAL);
 			info->hash_value = *((const uint32_t *)data);
 			mask |= HN_RXINFO_HASHVAL;
 			break;
 
 		case HN_NDIS_PKTINFO_TYPE_HASHINF:
 			if (__predict_false(dlen < HN_NDIS_HASH_INFO_SIZE))
 				return (EINVAL);
 			info->hash_info = *((const uint32_t *)data);
 			mask |= HN_RXINFO_HASHINF;
 			break;
 
 		default:
 			goto next;
 		}
 
 		if (mask == HN_RXINFO_ALL) {
 			/* All found; done */
 			break;
 		}
 next:
 		pi = (const struct rndis_pktinfo *)
 		    ((const uint8_t *)pi + pi->rm_size);
 	}
 
 	/*
 	 * Final fixup.
 	 * - If there is no hash value, invalidate the hash info.
 	 */
 	if ((mask & HN_RXINFO_HASHVAL) == 0)
 		info->hash_info = HN_NDIS_HASH_INFO_INVALID;
 	return (0);
 }
 
 static __inline bool
 hn_rndis_check_overlap(int off, int len, int check_off, int check_len)
 {
 
 	if (off < check_off) {
 		if (__predict_true(off + len <= check_off))
 			return (false);
 	} else if (off > check_off) {
 		if (__predict_true(check_off + check_len <= off))
 			return (false);
 	}
 	return (true);
 }
 
 static void
 hn_rndis_rx_data(struct hn_rx_ring *rxr, const void *data, int dlen)
 {
 	const struct rndis_packet_msg *pkt;
 	struct hn_rxinfo info;
 	int data_off, pktinfo_off, data_len, pktinfo_len;
 
 	/*
 	 * Check length.
 	 */
 	if (__predict_false(dlen < sizeof(*pkt))) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msg\n");
 		return;
 	}
 	pkt = data;
 
 	if (__predict_false(dlen < pkt->rm_len)) {
 		if_printf(rxr->hn_ifp, "truncated RNDIS packet msg, "
 		    "dlen %d, msglen %u\n", dlen, pkt->rm_len);
 		return;
 	}
 	if (__predict_false(pkt->rm_len <
 	    pkt->rm_datalen + pkt->rm_oobdatalen + pkt->rm_pktinfolen)) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msglen, "
 		    "msglen %u, data %u, oob %u, pktinfo %u\n",
 		    pkt->rm_len, pkt->rm_datalen, pkt->rm_oobdatalen,
 		    pkt->rm_pktinfolen);
 		return;
 	}
 	if (__predict_false(pkt->rm_datalen == 0)) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, no data\n");
 		return;
 	}
 
 	/*
 	 * Check offests.
 	 */
 #define IS_OFFSET_INVALID(ofs)			\
 	((ofs) < RNDIS_PACKET_MSG_OFFSET_MIN ||	\
 	 ((ofs) & RNDIS_PACKET_MSG_OFFSET_ALIGNMASK))
 
 	/* XXX Hyper-V does not meet data offset alignment requirement */
 	if (__predict_false(pkt->rm_dataoffset < RNDIS_PACKET_MSG_OFFSET_MIN)) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 		    "data offset %u\n", pkt->rm_dataoffset);
 		return;
 	}
 	if (__predict_false(pkt->rm_oobdataoffset > 0 &&
 	    IS_OFFSET_INVALID(pkt->rm_oobdataoffset))) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 		    "oob offset %u\n", pkt->rm_oobdataoffset);
 		return;
 	}
 	if (__predict_true(pkt->rm_pktinfooffset > 0) &&
 	    __predict_false(IS_OFFSET_INVALID(pkt->rm_pktinfooffset))) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 		    "pktinfo offset %u\n", pkt->rm_pktinfooffset);
 		return;
 	}
 
 #undef IS_OFFSET_INVALID
 
 	data_off = RNDIS_PACKET_MSG_OFFSET_ABS(pkt->rm_dataoffset);
 	data_len = pkt->rm_datalen;
 	pktinfo_off = RNDIS_PACKET_MSG_OFFSET_ABS(pkt->rm_pktinfooffset);
 	pktinfo_len = pkt->rm_pktinfolen;
 
 	/*
 	 * Check OOB coverage.
 	 */
 	if (__predict_false(pkt->rm_oobdatalen != 0)) {
 		int oob_off, oob_len;
 
 		if_printf(rxr->hn_ifp, "got oobdata\n");
 		oob_off = RNDIS_PACKET_MSG_OFFSET_ABS(pkt->rm_oobdataoffset);
 		oob_len = pkt->rm_oobdatalen;
 
 		if (__predict_false(oob_off + oob_len > pkt->rm_len)) {
 			if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 			    "oob overflow, msglen %u, oob abs %d len %d\n",
 			    pkt->rm_len, oob_off, oob_len);
 			return;
 		}
 
 		/*
 		 * Check against data.
 		 */
 		if (hn_rndis_check_overlap(oob_off, oob_len,
 		    data_off, data_len)) {
 			if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 			    "oob overlaps data, oob abs %d len %d, "
 			    "data abs %d len %d\n",
 			    oob_off, oob_len, data_off, data_len);
 			return;
 		}
 
 		/*
 		 * Check against pktinfo.
 		 */
 		if (pktinfo_len != 0 &&
 		    hn_rndis_check_overlap(oob_off, oob_len,
 		    pktinfo_off, pktinfo_len)) {
 			if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 			    "oob overlaps pktinfo, oob abs %d len %d, "
 			    "pktinfo abs %d len %d\n",
 			    oob_off, oob_len, pktinfo_off, pktinfo_len);
 			return;
 		}
 	}
 
 	/*
 	 * Check per-packet-info coverage and find useful per-packet-info.
 	 */
 	info.vlan_info = HN_NDIS_VLAN_INFO_INVALID;
 	info.csum_info = HN_NDIS_RXCSUM_INFO_INVALID;
 	info.hash_info = HN_NDIS_HASH_INFO_INVALID;
 	if (__predict_true(pktinfo_len != 0)) {
 		bool overlap;
 		int error;
 
 		if (__predict_false(pktinfo_off + pktinfo_len > pkt->rm_len)) {
 			if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 			    "pktinfo overflow, msglen %u, "
 			    "pktinfo abs %d len %d\n",
 			    pkt->rm_len, pktinfo_off, pktinfo_len);
 			return;
 		}
 
 		/*
 		 * Check packet info coverage.
 		 */
 		overlap = hn_rndis_check_overlap(pktinfo_off, pktinfo_len,
 		    data_off, data_len);
 		if (__predict_false(overlap)) {
 			if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 			    "pktinfo overlap data, pktinfo abs %d len %d, "
 			    "data abs %d len %d\n",
 			    pktinfo_off, pktinfo_len, data_off, data_len);
 			return;
 		}
 
 		/*
 		 * Find useful per-packet-info.
 		 */
 		error = hn_rndis_rxinfo(((const uint8_t *)pkt) + pktinfo_off,
 		    pktinfo_len, &info);
 		if (__predict_false(error)) {
 			if_printf(rxr->hn_ifp, "invalid RNDIS packet msg "
 			    "pktinfo\n");
 			return;
 		}
 	}
 
 	if (__predict_false(data_off + data_len > pkt->rm_len)) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, "
 		    "data overflow, msglen %u, data abs %d len %d\n",
 		    pkt->rm_len, data_off, data_len);
 		return;
 	}
 	hn_rxpkt(rxr, ((const uint8_t *)pkt) + data_off, data_len, &info);
 }
 
 static __inline void
 hn_rndis_rxpkt(struct hn_rx_ring *rxr, const void *data, int dlen)
 {
 	const struct rndis_msghdr *hdr;
 
 	if (__predict_false(dlen < sizeof(*hdr))) {
 		if_printf(rxr->hn_ifp, "invalid RNDIS msg\n");
 		return;
 	}
 	hdr = data;
 
 	if (__predict_true(hdr->rm_type == REMOTE_NDIS_PACKET_MSG)) {
 		/* Hot data path. */
 		hn_rndis_rx_data(rxr, data, dlen);
 		/* Done! */
 		return;
 	}
 
 	if (hdr->rm_type == REMOTE_NDIS_INDICATE_STATUS_MSG)
 		hn_rndis_rx_status(rxr->hn_ifp->if_softc, data, dlen);
 	else
 		hn_rndis_rx_ctrl(rxr->hn_ifp->if_softc, data, dlen);
 }
 
 static void
 hn_nvs_handle_notify(struct hn_softc *sc, const struct vmbus_chanpkt_hdr *pkt)
 {
 	const struct hn_nvs_hdr *hdr;
 
 	if (VMBUS_CHANPKT_DATALEN(pkt) < sizeof(*hdr)) {
 		if_printf(sc->hn_ifp, "invalid nvs notify\n");
 		return;
 	}
 	hdr = VMBUS_CHANPKT_CONST_DATA(pkt);
 
 	if (hdr->nvs_type == HN_NVS_TYPE_TXTBL_NOTE) {
 		/* Useless; ignore */
 		return;
 	}
 	if_printf(sc->hn_ifp, "got notify, nvs type %u\n", hdr->nvs_type);
 }
 
 static void
 hn_nvs_handle_comp(struct hn_softc *sc, struct vmbus_channel *chan,
     const struct vmbus_chanpkt_hdr *pkt)
 {
 	struct hn_nvs_sendctx *sndc;
 
 	sndc = (struct hn_nvs_sendctx *)(uintptr_t)pkt->cph_xactid;
 	sndc->hn_cb(sndc, sc, chan, VMBUS_CHANPKT_CONST_DATA(pkt),
 	    VMBUS_CHANPKT_DATALEN(pkt));
 	/*
 	 * NOTE:
 	 * 'sndc' CAN NOT be accessed anymore, since it can be freed by
 	 * its callback.
 	 */
 }
 
 static void
 hn_nvs_handle_rxbuf(struct hn_rx_ring *rxr, struct vmbus_channel *chan,
     const struct vmbus_chanpkt_hdr *pkthdr)
 {
 	const struct vmbus_chanpkt_rxbuf *pkt;
 	const struct hn_nvs_hdr *nvs_hdr;
 	int count, i, hlen;
 
 	if (__predict_false(VMBUS_CHANPKT_DATALEN(pkthdr) < sizeof(*nvs_hdr))) {
 		if_printf(rxr->hn_ifp, "invalid nvs RNDIS\n");
 		return;
 	}
 	nvs_hdr = VMBUS_CHANPKT_CONST_DATA(pkthdr);
 
 	/* Make sure that this is a RNDIS message. */
 	if (__predict_false(nvs_hdr->nvs_type != HN_NVS_TYPE_RNDIS)) {
 		if_printf(rxr->hn_ifp, "nvs type %u, not RNDIS\n",
 		    nvs_hdr->nvs_type);
 		return;
 	}
 
 	hlen = VMBUS_CHANPKT_GETLEN(pkthdr->cph_hlen);
 	if (__predict_false(hlen < sizeof(*pkt))) {
 		if_printf(rxr->hn_ifp, "invalid rxbuf chanpkt\n");
 		return;
 	}
 	pkt = (const struct vmbus_chanpkt_rxbuf *)pkthdr;
 
 	if (__predict_false(pkt->cp_rxbuf_id != HN_NVS_RXBUF_SIG)) {
 		if_printf(rxr->hn_ifp, "invalid rxbuf_id 0x%08x\n",
 		    pkt->cp_rxbuf_id);
 		return;
 	}
 
 	count = pkt->cp_rxbuf_cnt;
 	if (__predict_false(hlen <
 	    __offsetof(struct vmbus_chanpkt_rxbuf, cp_rxbuf[count]))) {
 		if_printf(rxr->hn_ifp, "invalid rxbuf_cnt %d\n", count);
 		return;
 	}
 
 	/* Each range represents 1 RNDIS pkt that contains 1 Ethernet frame */
 	for (i = 0; i < count; ++i) {
 		int ofs, len;
 
 		ofs = pkt->cp_rxbuf[i].rb_ofs;
 		len = pkt->cp_rxbuf[i].rb_len;
 		if (__predict_false(ofs + len > HN_RXBUF_SIZE)) {
 			if_printf(rxr->hn_ifp, "%dth RNDIS msg overflow rxbuf, "
 			    "ofs %d, len %d\n", i, ofs, len);
 			continue;
 		}
 		hn_rndis_rxpkt(rxr, rxr->hn_rxbuf + ofs, len);
 	}
 
 	/*
 	 * Ack the consumed RXBUF associated w/ this channel packet,
 	 * so that this RXBUF can be recycled by the hypervisor.
 	 */
 	hn_nvs_ack_rxbuf(rxr, chan, pkt->cp_hdr.cph_xactid);
 }
 
 static void
 hn_nvs_ack_rxbuf(struct hn_rx_ring *rxr, struct vmbus_channel *chan,
     uint64_t tid)
 {
 	struct hn_nvs_rndis_ack ack;
 	int retries, error;
 	
 	ack.nvs_type = HN_NVS_TYPE_RNDIS_ACK;
 	ack.nvs_status = HN_NVS_STATUS_OK;
 
 	retries = 0;
 again:
 	error = vmbus_chan_send(chan, VMBUS_CHANPKT_TYPE_COMP,
 	    VMBUS_CHANPKT_FLAG_NONE, &ack, sizeof(ack), tid);
 	if (__predict_false(error == EAGAIN)) {
 		/*
 		 * NOTE:
 		 * This should _not_ happen in real world, since the
 		 * consumption of the TX bufring from the TX path is
 		 * controlled.
 		 */
 		if (rxr->hn_ack_failed == 0)
 			if_printf(rxr->hn_ifp, "RXBUF ack retry\n");
 		rxr->hn_ack_failed++;
 		retries++;
 		if (retries < 10) {
 			DELAY(100);
 			goto again;
 		}
 		/* RXBUF leaks! */
 		if_printf(rxr->hn_ifp, "RXBUF ack failed\n");
 	}
 }
 
 static void
 hn_chan_callback(struct vmbus_channel *chan, void *xrxr)
 {
 	struct hn_rx_ring *rxr = xrxr;
 	struct hn_softc *sc = rxr->hn_ifp->if_softc;
 
 	for (;;) {
 		struct vmbus_chanpkt_hdr *pkt = rxr->hn_pktbuf;
 		int error, pktlen;
 
 		pktlen = rxr->hn_pktbuf_len;
 		error = vmbus_chan_recv_pkt(chan, pkt, &pktlen);
 		if (__predict_false(error == ENOBUFS)) {
 			void *nbuf;
 			int nlen;
 
 			/*
 			 * Expand channel packet buffer.
 			 *
 			 * XXX
 			 * Use M_WAITOK here, since allocation failure
 			 * is fatal.
 			 */
 			nlen = rxr->hn_pktbuf_len * 2;
 			while (nlen < pktlen)
 				nlen *= 2;
 			nbuf = malloc(nlen, M_DEVBUF, M_WAITOK);
 
 			if_printf(rxr->hn_ifp, "expand pktbuf %d -> %d\n",
 			    rxr->hn_pktbuf_len, nlen);
 
 			free(rxr->hn_pktbuf, M_DEVBUF);
 			rxr->hn_pktbuf = nbuf;
 			rxr->hn_pktbuf_len = nlen;
 			/* Retry! */
 			continue;
 		} else if (__predict_false(error == EAGAIN)) {
 			/* No more channel packets; done! */
 			break;
 		}
 		KASSERT(!error, ("vmbus_chan_recv_pkt failed: %d", error));
 
 		switch (pkt->cph_type) {
 		case VMBUS_CHANPKT_TYPE_COMP:
 			hn_nvs_handle_comp(sc, chan, pkt);
 			break;
 
 		case VMBUS_CHANPKT_TYPE_RXBUF:
 			hn_nvs_handle_rxbuf(rxr, chan, pkt);
 			break;
 
 		case VMBUS_CHANPKT_TYPE_INBAND:
 			hn_nvs_handle_notify(sc, pkt);
 			break;
 
 		default:
 			if_printf(rxr->hn_ifp, "unknown chan pkt %u\n",
 			    pkt->cph_type);
 			break;
 		}
 	}
 	hn_chan_rollup(rxr, rxr->hn_txr);
 }
 
 static void
 hn_tx_taskq_create(void *arg __unused)
 {
 	int i;
 
 	/*
 	 * Fix the # of TX taskqueues.
 	 */
 	if (hn_tx_taskq_cnt <= 0)
 		hn_tx_taskq_cnt = 1;
 	else if (hn_tx_taskq_cnt > mp_ncpus)
 		hn_tx_taskq_cnt = mp_ncpus;
 
 	/*
 	 * Fix the TX taskqueue mode.
 	 */
 	switch (hn_tx_taskq_mode) {
 	case HN_TX_TASKQ_M_INDEP:
 	case HN_TX_TASKQ_M_GLOBAL:
 	case HN_TX_TASKQ_M_EVTTQ:
 		break;
 	default:
 		hn_tx_taskq_mode = HN_TX_TASKQ_M_INDEP;
 		break;
 	}
 
 	if (vm_guest != VM_GUEST_HV)
 		return;
 
 	if (hn_tx_taskq_mode != HN_TX_TASKQ_M_GLOBAL)
 		return;
 
 	hn_tx_taskque = malloc(hn_tx_taskq_cnt * sizeof(struct taskqueue *),
 	    M_DEVBUF, M_WAITOK);
 	for (i = 0; i < hn_tx_taskq_cnt; ++i) {
 		hn_tx_taskque[i] = taskqueue_create("hn_tx", M_WAITOK,
 		    taskqueue_thread_enqueue, &hn_tx_taskque[i]);
 		taskqueue_start_threads(&hn_tx_taskque[i], 1, PI_NET,
 		    "hn tx%d", i);
 	}
 }
 SYSINIT(hn_txtq_create, SI_SUB_DRIVERS, SI_ORDER_SECOND,
     hn_tx_taskq_create, NULL);
 
 static void
 hn_tx_taskq_destroy(void *arg __unused)
 {
 
 	if (hn_tx_taskque != NULL) {
 		int i;
 
 		for (i = 0; i < hn_tx_taskq_cnt; ++i)
 			taskqueue_free(hn_tx_taskque[i]);
 		free(hn_tx_taskque, M_DEVBUF);
 	}
 }
 SYSUNINIT(hn_txtq_destroy, SI_SUB_DRIVERS, SI_ORDER_SECOND,
     hn_tx_taskq_destroy, NULL);
Index: projects/clang400-import/sys/kern/imgact_elf.c
===================================================================
--- projects/clang400-import/sys/kern/imgact_elf.c	(revision 314522)
+++ projects/clang400-import/sys/kern/imgact_elf.c	(revision 314523)
@@ -1,2410 +1,2416 @@
 /*-
  * Copyright (c) 2000 David O'Brien
  * Copyright (c) 1995-1996 Søren Schmidt
  * Copyright (c) 1996 Peter Wemm
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer
  *    in this position and unchanged.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. The name of the author may not be used to endorse or promote products
  *    derived from this software without specific prior written permission
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_capsicum.h"
 #include "opt_compat.h"
 #include "opt_gzio.h"
 
 #include <sys/param.h>
 #include <sys/capsicum.h>
 #include <sys/exec.h>
 #include <sys/fcntl.h>
 #include <sys/gzio.h>
 #include <sys/imgact.h>
 #include <sys/imgact_elf.h>
 #include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mount.h>
 #include <sys/mman.h>
 #include <sys/namei.h>
 #include <sys/pioctl.h>
 #include <sys/proc.h>
 #include <sys/procfs.h>
 #include <sys/racct.h>
 #include <sys/resourcevar.h>
 #include <sys/rwlock.h>
 #include <sys/sbuf.h>
 #include <sys/sf_buf.h>
 #include <sys/smp.h>
 #include <sys/systm.h>
 #include <sys/signalvar.h>
 #include <sys/stat.h>
 #include <sys/sx.h>
 #include <sys/syscall.h>
 #include <sys/sysctl.h>
 #include <sys/sysent.h>
 #include <sys/vnode.h>
 #include <sys/syslog.h>
 #include <sys/eventhandler.h>
 #include <sys/user.h>
 
 #include <vm/vm.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 
 #include <machine/elf.h>
 #include <machine/md_var.h>
 
 #define ELF_NOTE_ROUNDSIZE	4
 #define OLD_EI_BRAND	8
 
 static int __elfN(check_header)(const Elf_Ehdr *hdr);
 static Elf_Brandinfo *__elfN(get_brandinfo)(struct image_params *imgp,
     const char *interp, int interp_name_len, int32_t *osrel);
 static int __elfN(load_file)(struct proc *p, const char *file, u_long *addr,
     u_long *entry, size_t pagesize);
 static int __elfN(load_section)(struct image_params *imgp, vm_offset_t offset,
     caddr_t vmaddr, size_t memsz, size_t filsz, vm_prot_t prot,
     size_t pagesize);
 static int __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp);
 static boolean_t __elfN(freebsd_trans_osrel)(const Elf_Note *note,
     int32_t *osrel);
 static boolean_t kfreebsd_trans_osrel(const Elf_Note *note, int32_t *osrel);
 static boolean_t __elfN(check_note)(struct image_params *imgp,
     Elf_Brandnote *checknote, int32_t *osrel);
 static vm_prot_t __elfN(trans_prot)(Elf_Word);
 static Elf_Word __elfN(untrans_prot)(vm_prot_t);
 
 SYSCTL_NODE(_kern, OID_AUTO, __CONCAT(elf, __ELF_WORD_SIZE), CTLFLAG_RW, 0,
     "");
 
 #define	CORE_BUF_SIZE	(16 * 1024)
 
 int __elfN(fallback_brand) = -1;
 SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO,
     fallback_brand, CTLFLAG_RWTUN, &__elfN(fallback_brand), 0,
     __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) " brand of last resort");
 
 static int elf_legacy_coredump = 0;
 SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, 
     &elf_legacy_coredump, 0,
     "include all and only RW pages in core dumps");
 
 int __elfN(nxstack) =
 #if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */ || \
     (defined(__arm__) && __ARM_ARCH >= 7) || defined(__aarch64__)
 	1;
 #else
 	0;
 #endif
 SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO,
     nxstack, CTLFLAG_RW, &__elfN(nxstack), 0,
     __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) ": enable non-executable stack");
 
 #if __ELF_WORD_SIZE == 32
 #if defined(__amd64__)
 int i386_read_exec = 0;
 SYSCTL_INT(_kern_elf32, OID_AUTO, read_exec, CTLFLAG_RW, &i386_read_exec, 0,
     "enable execution from readable segments");
 #endif
 #endif
 
 static Elf_Brandinfo *elf_brand_list[MAX_BRANDS];
 
 #define	trunc_page_ps(va, ps)	rounddown2(va, ps)
 #define	round_page_ps(va, ps)	roundup2(va, ps)
 #define	aligned(a, t)	(trunc_page_ps((u_long)(a), sizeof(t)) == (u_long)(a))
 
 static const char FREEBSD_ABI_VENDOR[] = "FreeBSD";
 
 Elf_Brandnote __elfN(freebsd_brandnote) = {
 	.hdr.n_namesz	= sizeof(FREEBSD_ABI_VENDOR),
 	.hdr.n_descsz	= sizeof(int32_t),
 	.hdr.n_type	= NT_FREEBSD_ABI_TAG,
 	.vendor		= FREEBSD_ABI_VENDOR,
 	.flags		= BN_TRANSLATE_OSREL,
 	.trans_osrel	= __elfN(freebsd_trans_osrel)
 };
 
 static boolean_t
 __elfN(freebsd_trans_osrel)(const Elf_Note *note, int32_t *osrel)
 {
 	uintptr_t p;
 
 	p = (uintptr_t)(note + 1);
 	p += roundup2(note->n_namesz, ELF_NOTE_ROUNDSIZE);
 	*osrel = *(const int32_t *)(p);
 
 	return (TRUE);
 }
 
 static const char GNU_ABI_VENDOR[] = "GNU";
 static int GNU_KFREEBSD_ABI_DESC = 3;
 
 Elf_Brandnote __elfN(kfreebsd_brandnote) = {
 	.hdr.n_namesz	= sizeof(GNU_ABI_VENDOR),
 	.hdr.n_descsz	= 16,	/* XXX at least 16 */
 	.hdr.n_type	= 1,
 	.vendor		= GNU_ABI_VENDOR,
 	.flags		= BN_TRANSLATE_OSREL,
 	.trans_osrel	= kfreebsd_trans_osrel
 };
 
 static boolean_t
 kfreebsd_trans_osrel(const Elf_Note *note, int32_t *osrel)
 {
 	const Elf32_Word *desc;
 	uintptr_t p;
 
 	p = (uintptr_t)(note + 1);
 	p += roundup2(note->n_namesz, ELF_NOTE_ROUNDSIZE);
 
 	desc = (const Elf32_Word *)p;
 	if (desc[0] != GNU_KFREEBSD_ABI_DESC)
 		return (FALSE);
 
 	/*
 	 * Debian GNU/kFreeBSD embed the earliest compatible kernel version
 	 * (__FreeBSD_version: <major><two digit minor>Rxx) in the LSB way.
 	 */
 	*osrel = desc[1] * 100000 + desc[2] * 1000 + desc[3];
 
 	return (TRUE);
 }
 
 int
 __elfN(insert_brand_entry)(Elf_Brandinfo *entry)
 {
 	int i;
 
 	for (i = 0; i < MAX_BRANDS; i++) {
 		if (elf_brand_list[i] == NULL) {
 			elf_brand_list[i] = entry;
 			break;
 		}
 	}
 	if (i == MAX_BRANDS) {
 		printf("WARNING: %s: could not insert brandinfo entry: %p\n",
 			__func__, entry);
 		return (-1);
 	}
 	return (0);
 }
 
 int
 __elfN(remove_brand_entry)(Elf_Brandinfo *entry)
 {
 	int i;
 
 	for (i = 0; i < MAX_BRANDS; i++) {
 		if (elf_brand_list[i] == entry) {
 			elf_brand_list[i] = NULL;
 			break;
 		}
 	}
 	if (i == MAX_BRANDS)
 		return (-1);
 	return (0);
 }
 
 int
 __elfN(brand_inuse)(Elf_Brandinfo *entry)
 {
 	struct proc *p;
 	int rval = FALSE;
 
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		if (p->p_sysent == entry->sysvec) {
 			rval = TRUE;
 			break;
 		}
 	}
 	sx_sunlock(&allproc_lock);
 
 	return (rval);
 }
 
 static Elf_Brandinfo *
 __elfN(get_brandinfo)(struct image_params *imgp, const char *interp,
     int interp_name_len, int32_t *osrel)
 {
 	const Elf_Ehdr *hdr = (const Elf_Ehdr *)imgp->image_header;
 	Elf_Brandinfo *bi, *bi_m;
 	boolean_t ret;
 	int i;
 
 	/*
 	 * We support four types of branding -- (1) the ELF EI_OSABI field
 	 * that SCO added to the ELF spec, (2) FreeBSD 3.x's traditional string
 	 * branding w/in the ELF header, (3) path of the `interp_path'
 	 * field, and (4) the ".note.ABI-tag" ELF section.
 	 */
 
 	/* Look for an ".note.ABI-tag" ELF section */
 	bi_m = NULL;
 	for (i = 0; i < MAX_BRANDS; i++) {
 		bi = elf_brand_list[i];
 		if (bi == NULL)
 			continue;
 		if (hdr->e_machine == bi->machine && (bi->flags &
 		    (BI_BRAND_NOTE|BI_BRAND_NOTE_MANDATORY)) != 0) {
 			ret = __elfN(check_note)(imgp, bi->brand_note, osrel);
 			/* Give brand a chance to veto check_note's guess */
 			if (ret && bi->header_supported)
 				ret = bi->header_supported(imgp);
 			/*
 			 * If note checker claimed the binary, but the
 			 * interpreter path in the image does not
 			 * match default one for the brand, try to
 			 * search for other brands with the same
 			 * interpreter.  Either there is better brand
 			 * with the right interpreter, or, failing
 			 * this, we return first brand which accepted
 			 * our note and, optionally, header.
 			 */
 			if (ret && bi_m == NULL && (strlen(bi->interp_path) +
 			    1 != interp_name_len || strncmp(interp,
 			    bi->interp_path, interp_name_len) != 0)) {
 				bi_m = bi;
 				ret = 0;
 			}
 			if (ret)
 				return (bi);
 		}
 	}
 	if (bi_m != NULL)
 		return (bi_m);
 
 	/* If the executable has a brand, search for it in the brand list. */
 	for (i = 0; i < MAX_BRANDS; i++) {
 		bi = elf_brand_list[i];
 		if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY)
 			continue;
 		if (hdr->e_machine == bi->machine &&
 		    (hdr->e_ident[EI_OSABI] == bi->brand ||
 		    strncmp((const char *)&hdr->e_ident[OLD_EI_BRAND],
 		    bi->compat_3_brand, strlen(bi->compat_3_brand)) == 0)) {
 			/* Looks good, but give brand a chance to veto */
 			if (!bi->header_supported || bi->header_supported(imgp))
 				return (bi);
 		}
 	}
 
 	/* No known brand, see if the header is recognized by any brand */
 	for (i = 0; i < MAX_BRANDS; i++) {
 		bi = elf_brand_list[i];
 		if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY ||
 		    bi->header_supported == NULL)
 			continue;
 		if (hdr->e_machine == bi->machine) {
 			ret = bi->header_supported(imgp);
 			if (ret)
 				return (bi);
 		}
 	}
 
 	/* Lacking a known brand, search for a recognized interpreter. */
 	if (interp != NULL) {
 		for (i = 0; i < MAX_BRANDS; i++) {
 			bi = elf_brand_list[i];
 			if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY)
 				continue;
 			if (hdr->e_machine == bi->machine &&
 			    /* ELF image p_filesz includes terminating zero */
 			    strlen(bi->interp_path) + 1 == interp_name_len &&
 			    strncmp(interp, bi->interp_path, interp_name_len)
 			    == 0)
 				return (bi);
 		}
 	}
 
 	/* Lacking a recognized interpreter, try the default brand */
 	for (i = 0; i < MAX_BRANDS; i++) {
 		bi = elf_brand_list[i];
 		if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY)
 			continue;
 		if (hdr->e_machine == bi->machine &&
 		    __elfN(fallback_brand) == bi->brand)
 			return (bi);
 	}
 	return (NULL);
 }
 
 static int
 __elfN(check_header)(const Elf_Ehdr *hdr)
 {
 	Elf_Brandinfo *bi;
 	int i;
 
 	if (!IS_ELF(*hdr) ||
 	    hdr->e_ident[EI_CLASS] != ELF_TARG_CLASS ||
 	    hdr->e_ident[EI_DATA] != ELF_TARG_DATA ||
 	    hdr->e_ident[EI_VERSION] != EV_CURRENT ||
 	    hdr->e_phentsize != sizeof(Elf_Phdr) ||
 	    hdr->e_version != ELF_TARG_VER)
 		return (ENOEXEC);
 
 	/*
 	 * Make sure we have at least one brand for this machine.
 	 */
 
 	for (i = 0; i < MAX_BRANDS; i++) {
 		bi = elf_brand_list[i];
 		if (bi != NULL && bi->machine == hdr->e_machine)
 			break;
 	}
 	if (i == MAX_BRANDS)
 		return (ENOEXEC);
 
 	return (0);
 }
 
 static int
 __elfN(map_partial)(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
     vm_offset_t start, vm_offset_t end, vm_prot_t prot)
 {
 	struct sf_buf *sf;
 	int error;
 	vm_offset_t off;
 
 	/*
 	 * Create the page if it doesn't exist yet. Ignore errors.
 	 */
 	vm_map_lock(map);
 	vm_map_insert(map, NULL, 0, trunc_page(start), round_page(end),
 	    VM_PROT_ALL, VM_PROT_ALL, 0);
 	vm_map_unlock(map);
 
 	/*
 	 * Find the page from the underlying object.
 	 */
 	if (object) {
 		sf = vm_imgact_map_page(object, offset);
 		if (sf == NULL)
 			return (KERN_FAILURE);
 		off = offset - trunc_page(offset);
 		error = copyout((caddr_t)sf_buf_kva(sf) + off, (caddr_t)start,
 		    end - start);
 		vm_imgact_unmap_page(sf);
 		if (error) {
 			return (KERN_FAILURE);
 		}
 	}
 
 	return (KERN_SUCCESS);
 }
 
 static int
-__elfN(map_insert)(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
-    vm_offset_t start, vm_offset_t end, vm_prot_t prot, int cow)
+__elfN(map_insert)(struct image_params *imgp, vm_map_t map, vm_object_t object,
+    vm_ooffset_t offset, vm_offset_t start, vm_offset_t end, vm_prot_t prot,
+    int cow)
 {
 	struct sf_buf *sf;
 	vm_offset_t off;
 	vm_size_t sz;
-	int error, rv;
+	int error, locked, rv;
 
 	if (start != trunc_page(start)) {
 		rv = __elfN(map_partial)(map, object, offset, start,
 		    round_page(start), prot);
 		if (rv)
 			return (rv);
 		offset += round_page(start) - start;
 		start = round_page(start);
 	}
 	if (end != round_page(end)) {
 		rv = __elfN(map_partial)(map, object, offset +
 		    trunc_page(end) - start, trunc_page(end), end, prot);
 		if (rv)
 			return (rv);
 		end = trunc_page(end);
 	}
 	if (end > start) {
 		if (offset & PAGE_MASK) {
 			/*
 			 * The mapping is not page aligned. This means we have
 			 * to copy the data. Sigh.
 			 */
-			rv = vm_map_find(map, NULL, 0, &start, end - start, 0,
-			    VMFS_NO_SPACE, prot | VM_PROT_WRITE, VM_PROT_ALL,
-			    0);
+			vm_map_lock(map);
+			rv = vm_map_insert(map, NULL, 0, start, end,
+			    prot | VM_PROT_WRITE, VM_PROT_ALL, 0);
+			vm_map_unlock(map);
 			if (rv != KERN_SUCCESS)
 				return (rv);
 			if (object == NULL)
 				return (KERN_SUCCESS);
 			for (; start < end; start += sz) {
 				sf = vm_imgact_map_page(object, offset);
 				if (sf == NULL)
 					return (KERN_FAILURE);
 				off = offset - trunc_page(offset);
 				sz = end - start;
 				if (sz > PAGE_SIZE - off)
 					sz = PAGE_SIZE - off;
 				error = copyout((caddr_t)sf_buf_kva(sf) + off,
 				    (caddr_t)start, sz);
 				vm_imgact_unmap_page(sf);
 				if (error != 0)
 					return (KERN_FAILURE);
 				offset += sz;
 			}
 			rv = KERN_SUCCESS;
 		} else {
 			vm_object_reference(object);
 			vm_map_lock(map);
 			rv = vm_map_insert(map, object, offset, start, end,
 			    prot, VM_PROT_ALL, cow);
 			vm_map_unlock(map);
-			if (rv != KERN_SUCCESS)
+			if (rv != KERN_SUCCESS) {
+				locked = VOP_ISLOCKED(imgp->vp);
+				VOP_UNLOCK(imgp->vp, 0);
 				vm_object_deallocate(object);
+				vn_lock(imgp->vp, locked | LK_RETRY);
+			}
 		}
 		return (rv);
 	} else {
 		return (KERN_SUCCESS);
 	}
 }
 
 static int
 __elfN(load_section)(struct image_params *imgp, vm_offset_t offset,
     caddr_t vmaddr, size_t memsz, size_t filsz, vm_prot_t prot,
     size_t pagesize)
 {
 	struct sf_buf *sf;
 	size_t map_len;
 	vm_map_t map;
 	vm_object_t object;
 	vm_offset_t map_addr;
 	int error, rv, cow;
 	size_t copy_len;
 	vm_offset_t file_addr;
 
 	/*
 	 * It's necessary to fail if the filsz + offset taken from the
 	 * header is greater than the actual file pager object's size.
 	 * If we were to allow this, then the vm_map_find() below would
 	 * walk right off the end of the file object and into the ether.
 	 *
 	 * While I'm here, might as well check for something else that
 	 * is invalid: filsz cannot be greater than memsz.
 	 */
 	if ((off_t)filsz + offset > imgp->attr->va_size || filsz > memsz) {
 		uprintf("elf_load_section: truncated ELF file\n");
 		return (ENOEXEC);
 	}
 
 	object = imgp->object;
 	map = &imgp->proc->p_vmspace->vm_map;
 	map_addr = trunc_page_ps((vm_offset_t)vmaddr, pagesize);
 	file_addr = trunc_page_ps(offset, pagesize);
 
 	/*
 	 * We have two choices.  We can either clear the data in the last page
 	 * of an oversized mapping, or we can start the anon mapping a page
 	 * early and copy the initialized data into that first page.  We
 	 * choose the second..
 	 */
 	if (memsz > filsz)
 		map_len = trunc_page_ps(offset + filsz, pagesize) - file_addr;
 	else
 		map_len = round_page_ps(offset + filsz, pagesize) - file_addr;
 
 	if (map_len != 0) {
 		/* cow flags: don't dump readonly sections in core */
 		cow = MAP_COPY_ON_WRITE | MAP_PREFAULT |
 		    (prot & VM_PROT_WRITE ? 0 : MAP_DISABLE_COREDUMP);
 
-		rv = __elfN(map_insert)(map,
+		rv = __elfN(map_insert)(imgp, map,
 				      object,
 				      file_addr,	/* file offset */
 				      map_addr,		/* virtual start */
 				      map_addr + map_len,/* virtual end */
 				      prot,
 				      cow);
 		if (rv != KERN_SUCCESS)
 			return (EINVAL);
 
 		/* we can stop now if we've covered it all */
 		if (memsz == filsz) {
 			return (0);
 		}
 	}
 
 
 	/*
 	 * We have to get the remaining bit of the file into the first part
 	 * of the oversized map segment.  This is normally because the .data
 	 * segment in the file is extended to provide bss.  It's a neat idea
 	 * to try and save a page, but it's a pain in the behind to implement.
 	 */
 	copy_len = (offset + filsz) - trunc_page_ps(offset + filsz, pagesize);
 	map_addr = trunc_page_ps((vm_offset_t)vmaddr + filsz, pagesize);
 	map_len = round_page_ps((vm_offset_t)vmaddr + memsz, pagesize) -
 	    map_addr;
 
 	/* This had damn well better be true! */
 	if (map_len != 0) {
-		rv = __elfN(map_insert)(map, NULL, 0, map_addr, map_addr +
-		    map_len, VM_PROT_ALL, 0);
+		rv = __elfN(map_insert)(imgp, map, NULL, 0, map_addr,
+		    map_addr + map_len, VM_PROT_ALL, 0);
 		if (rv != KERN_SUCCESS) {
 			return (EINVAL);
 		}
 	}
 
 	if (copy_len != 0) {
 		vm_offset_t off;
 
 		sf = vm_imgact_map_page(object, offset + filsz);
 		if (sf == NULL)
 			return (EIO);
 
 		/* send the page fragment to user space */
 		off = trunc_page_ps(offset + filsz, pagesize) -
 		    trunc_page(offset + filsz);
 		error = copyout((caddr_t)sf_buf_kva(sf) + off,
 		    (caddr_t)map_addr, copy_len);
 		vm_imgact_unmap_page(sf);
 		if (error) {
 			return (error);
 		}
 	}
 
 	/*
 	 * set it to the specified protection.
 	 * XXX had better undo the damage from pasting over the cracks here!
 	 */
 	vm_map_protect(map, trunc_page(map_addr), round_page(map_addr +
 	    map_len), prot, FALSE);
 
 	return (0);
 }
 
 /*
  * Load the file "file" into memory.  It may be either a shared object
  * or an executable.
  *
  * The "addr" reference parameter is in/out.  On entry, it specifies
  * the address where a shared object should be loaded.  If the file is
  * an executable, this value is ignored.  On exit, "addr" specifies
  * where the file was actually loaded.
  *
  * The "entry" reference parameter is out only.  On exit, it specifies
  * the entry point for the loaded file.
  */
 static int
 __elfN(load_file)(struct proc *p, const char *file, u_long *addr,
 	u_long *entry, size_t pagesize)
 {
 	struct {
 		struct nameidata nd;
 		struct vattr attr;
 		struct image_params image_params;
 	} *tempdata;
 	const Elf_Ehdr *hdr = NULL;
 	const Elf_Phdr *phdr = NULL;
 	struct nameidata *nd;
 	struct vattr *attr;
 	struct image_params *imgp;
 	vm_prot_t prot;
 	u_long rbase;
 	u_long base_addr = 0;
 	int error, i, numsegs;
 
 #ifdef CAPABILITY_MODE
 	/*
 	 * XXXJA: This check can go away once we are sufficiently confident
 	 * that the checks in namei() are correct.
 	 */
 	if (IN_CAPABILITY_MODE(curthread))
 		return (ECAPMODE);
 #endif
 
 	tempdata = malloc(sizeof(*tempdata), M_TEMP, M_WAITOK);
 	nd = &tempdata->nd;
 	attr = &tempdata->attr;
 	imgp = &tempdata->image_params;
 
 	/*
 	 * Initialize part of the common data
 	 */
 	imgp->proc = p;
 	imgp->attr = attr;
 	imgp->firstpage = NULL;
 	imgp->image_header = NULL;
 	imgp->object = NULL;
 	imgp->execlabel = NULL;
 
 	NDINIT(nd, LOOKUP, LOCKLEAF | FOLLOW, UIO_SYSSPACE, file, curthread);
 	if ((error = namei(nd)) != 0) {
 		nd->ni_vp = NULL;
 		goto fail;
 	}
 	NDFREE(nd, NDF_ONLY_PNBUF);
 	imgp->vp = nd->ni_vp;
 
 	/*
 	 * Check permissions, modes, uid, etc on the file, and "open" it.
 	 */
 	error = exec_check_permissions(imgp);
 	if (error)
 		goto fail;
 
 	error = exec_map_first_page(imgp);
 	if (error)
 		goto fail;
 
 	/*
 	 * Also make certain that the interpreter stays the same, so set
 	 * its VV_TEXT flag, too.
 	 */
 	VOP_SET_TEXT(nd->ni_vp);
 
 	imgp->object = nd->ni_vp->v_object;
 
 	hdr = (const Elf_Ehdr *)imgp->image_header;
 	if ((error = __elfN(check_header)(hdr)) != 0)
 		goto fail;
 	if (hdr->e_type == ET_DYN)
 		rbase = *addr;
 	else if (hdr->e_type == ET_EXEC)
 		rbase = 0;
 	else {
 		error = ENOEXEC;
 		goto fail;
 	}
 
 	/* Only support headers that fit within first page for now      */
 	if ((hdr->e_phoff > PAGE_SIZE) ||
 	    (u_int)hdr->e_phentsize * hdr->e_phnum > PAGE_SIZE - hdr->e_phoff) {
 		error = ENOEXEC;
 		goto fail;
 	}
 
 	phdr = (const Elf_Phdr *)(imgp->image_header + hdr->e_phoff);
 	if (!aligned(phdr, Elf_Addr)) {
 		error = ENOEXEC;
 		goto fail;
 	}
 
 	for (i = 0, numsegs = 0; i < hdr->e_phnum; i++) {
 		if (phdr[i].p_type == PT_LOAD && phdr[i].p_memsz != 0) {
 			/* Loadable segment */
 			prot = __elfN(trans_prot)(phdr[i].p_flags);
 			error = __elfN(load_section)(imgp, phdr[i].p_offset,
 			    (caddr_t)(uintptr_t)phdr[i].p_vaddr + rbase,
 			    phdr[i].p_memsz, phdr[i].p_filesz, prot, pagesize);
 			if (error != 0)
 				goto fail;
 			/*
 			 * Establish the base address if this is the
 			 * first segment.
 			 */
 			if (numsegs == 0)
   				base_addr = trunc_page(phdr[i].p_vaddr +
 				    rbase);
 			numsegs++;
 		}
 	}
 	*addr = base_addr;
 	*entry = (unsigned long)hdr->e_entry + rbase;
 
 fail:
 	if (imgp->firstpage)
 		exec_unmap_first_page(imgp);
 
 	if (nd->ni_vp)
 		vput(nd->ni_vp);
 
 	free(tempdata, M_TEMP);
 
 	return (error);
 }
 
 static int
 __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp)
 {
 	struct thread *td;
 	const Elf_Ehdr *hdr;
 	const Elf_Phdr *phdr;
 	Elf_Auxargs *elf_auxargs;
 	struct vmspace *vmspace;
 	const char *err_str, *newinterp;
 	char *interp, *interp_buf, *path;
 	Elf_Brandinfo *brand_info;
 	struct sysentvec *sv;
 	vm_prot_t prot;
 	u_long text_size, data_size, total_size, text_addr, data_addr;
 	u_long seg_size, seg_addr, addr, baddr, et_dyn_addr, entry, proghdr;
 	int32_t osrel;
 	int error, i, n, interp_name_len, have_interp;
 
 	hdr = (const Elf_Ehdr *)imgp->image_header;
 
 	/*
 	 * Do we have a valid ELF header ?
 	 *
 	 * Only allow ET_EXEC & ET_DYN here, reject ET_DYN later
 	 * if particular brand doesn't support it.
 	 */
 	if (__elfN(check_header)(hdr) != 0 ||
 	    (hdr->e_type != ET_EXEC && hdr->e_type != ET_DYN))
 		return (-1);
 
 	/*
 	 * From here on down, we return an errno, not -1, as we've
 	 * detected an ELF file.
 	 */
 
 	if ((hdr->e_phoff > PAGE_SIZE) ||
 	    (u_int)hdr->e_phentsize * hdr->e_phnum > PAGE_SIZE - hdr->e_phoff) {
 		/* Only support headers in first page for now */
 		uprintf("Program headers not in the first page\n");
 		return (ENOEXEC);
 	}
 	phdr = (const Elf_Phdr *)(imgp->image_header + hdr->e_phoff); 
 	if (!aligned(phdr, Elf_Addr)) {
 		uprintf("Unaligned program headers\n");
 		return (ENOEXEC);
 	}
 
 	n = error = 0;
 	baddr = 0;
 	osrel = 0;
 	text_size = data_size = total_size = text_addr = data_addr = 0;
 	entry = proghdr = 0;
 	interp_name_len = 0;
 	err_str = newinterp = NULL;
 	interp = interp_buf = NULL;
 	td = curthread;
 
 	for (i = 0; i < hdr->e_phnum; i++) {
 		switch (phdr[i].p_type) {
 		case PT_LOAD:
 			if (n == 0)
 				baddr = phdr[i].p_vaddr;
 			n++;
 			break;
 		case PT_INTERP:
 			/* Path to interpreter */
 			if (phdr[i].p_filesz > MAXPATHLEN) {
 				uprintf("Invalid PT_INTERP\n");
 				error = ENOEXEC;
 				goto ret;
 			}
 			if (interp != NULL) {
 				uprintf("Multiple PT_INTERP headers\n");
 				error = ENOEXEC;
 				goto ret;
 			}
 			interp_name_len = phdr[i].p_filesz;
 			if (phdr[i].p_offset > PAGE_SIZE ||
 			    interp_name_len > PAGE_SIZE - phdr[i].p_offset) {
 				VOP_UNLOCK(imgp->vp, 0);
 				interp_buf = malloc(interp_name_len + 1, M_TEMP,
 				    M_WAITOK);
 				vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY);
 				error = vn_rdwr(UIO_READ, imgp->vp, interp_buf,
 				    interp_name_len, phdr[i].p_offset,
 				    UIO_SYSSPACE, IO_NODELOCKED, td->td_ucred,
 				    NOCRED, NULL, td);
 				if (error != 0) {
 					uprintf("i/o error PT_INTERP\n");
 					goto ret;
 				}
 				interp_buf[interp_name_len] = '\0';
 				interp = interp_buf;
 			} else {
 				interp = __DECONST(char *, imgp->image_header) +
 				    phdr[i].p_offset;
 			}
 			break;
 		case PT_GNU_STACK:
 			if (__elfN(nxstack))
 				imgp->stack_prot =
 				    __elfN(trans_prot)(phdr[i].p_flags);
 			imgp->stack_sz = phdr[i].p_memsz;
 			break;
 		}
 	}
 
 	brand_info = __elfN(get_brandinfo)(imgp, interp, interp_name_len,
 	    &osrel);
 	if (brand_info == NULL) {
 		uprintf("ELF binary type \"%u\" not known.\n",
 		    hdr->e_ident[EI_OSABI]);
 		error = ENOEXEC;
 		goto ret;
 	}
 	et_dyn_addr = 0;
 	if (hdr->e_type == ET_DYN) {
 		if ((brand_info->flags & BI_CAN_EXEC_DYN) == 0) {
 			uprintf("Cannot execute shared object\n");
 			error = ENOEXEC;
 			goto ret;
 		}
 		/*
 		 * Honour the base load address from the dso if it is
 		 * non-zero for some reason.
 		 */
 		if (baddr == 0)
 			et_dyn_addr = ET_DYN_LOAD_ADDR;
 	}
 	sv = brand_info->sysvec;
 	if (interp != NULL && brand_info->interp_newpath != NULL)
 		newinterp = brand_info->interp_newpath;
 
 	/*
 	 * Avoid a possible deadlock if the current address space is destroyed
 	 * and that address space maps the locked vnode.  In the common case,
 	 * the locked vnode's v_usecount is decremented but remains greater
 	 * than zero.  Consequently, the vnode lock is not needed by vrele().
 	 * However, in cases where the vnode lock is external, such as nullfs,
 	 * v_usecount may become zero.
 	 *
 	 * The VV_TEXT flag prevents modifications to the executable while
 	 * the vnode is unlocked.
 	 */
 	VOP_UNLOCK(imgp->vp, 0);
 
 	error = exec_new_vmspace(imgp, sv);
 	imgp->proc->p_sysent = sv;
 
 	vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY);
 	if (error != 0)
 		goto ret;
 
 	for (i = 0; i < hdr->e_phnum; i++) {
 		switch (phdr[i].p_type) {
 		case PT_LOAD:	/* Loadable segment */
 			if (phdr[i].p_memsz == 0)
 				break;
 			prot = __elfN(trans_prot)(phdr[i].p_flags);
 			error = __elfN(load_section)(imgp, phdr[i].p_offset,
 			    (caddr_t)(uintptr_t)phdr[i].p_vaddr + et_dyn_addr,
 			    phdr[i].p_memsz, phdr[i].p_filesz, prot,
 			    sv->sv_pagesize);
 			if (error != 0)
 				goto ret;
 
 			/*
 			 * If this segment contains the program headers,
 			 * remember their virtual address for the AT_PHDR
 			 * aux entry. Static binaries don't usually include
 			 * a PT_PHDR entry.
 			 */
 			if (phdr[i].p_offset == 0 &&
 			    hdr->e_phoff + hdr->e_phnum * hdr->e_phentsize
 				<= phdr[i].p_filesz)
 				proghdr = phdr[i].p_vaddr + hdr->e_phoff +
 				    et_dyn_addr;
 
 			seg_addr = trunc_page(phdr[i].p_vaddr + et_dyn_addr);
 			seg_size = round_page(phdr[i].p_memsz +
 			    phdr[i].p_vaddr + et_dyn_addr - seg_addr);
 
 			/*
 			 * Make the largest executable segment the official
 			 * text segment and all others data.
 			 *
 			 * Note that obreak() assumes that data_addr + 
 			 * data_size == end of data load area, and the ELF
 			 * file format expects segments to be sorted by
 			 * address.  If multiple data segments exist, the
 			 * last one will be used.
 			 */
 
 			if (phdr[i].p_flags & PF_X && text_size < seg_size) {
 				text_size = seg_size;
 				text_addr = seg_addr;
 			} else {
 				data_size = seg_size;
 				data_addr = seg_addr;
 			}
 			total_size += seg_size;
 			break;
 		case PT_PHDR: 	/* Program header table info */
 			proghdr = phdr[i].p_vaddr + et_dyn_addr;
 			break;
 		default:
 			break;
 		}
 	}
 	
 	if (data_addr == 0 && data_size == 0) {
 		data_addr = text_addr;
 		data_size = text_size;
 	}
 
 	entry = (u_long)hdr->e_entry + et_dyn_addr;
 
 	/*
 	 * Check limits.  It should be safe to check the
 	 * limits after loading the segments since we do
 	 * not actually fault in all the segments pages.
 	 */
 	PROC_LOCK(imgp->proc);
 	if (data_size > lim_cur_proc(imgp->proc, RLIMIT_DATA))
 		err_str = "Data segment size exceeds process limit";
 	else if (text_size > maxtsiz)
 		err_str = "Text segment size exceeds system limit";
 	else if (total_size > lim_cur_proc(imgp->proc, RLIMIT_VMEM))
 		err_str = "Total segment size exceeds process limit";
 	else if (racct_set(imgp->proc, RACCT_DATA, data_size) != 0)
 		err_str = "Data segment size exceeds resource limit";
 	else if (racct_set(imgp->proc, RACCT_VMEM, total_size) != 0)
 		err_str = "Total segment size exceeds resource limit";
 	if (err_str != NULL) {
 		PROC_UNLOCK(imgp->proc);
 		uprintf("%s\n", err_str);
 		error = ENOMEM;
 		goto ret;
 	}
 
 	vmspace = imgp->proc->p_vmspace;
 	vmspace->vm_tsize = text_size >> PAGE_SHIFT;
 	vmspace->vm_taddr = (caddr_t)(uintptr_t)text_addr;
 	vmspace->vm_dsize = data_size >> PAGE_SHIFT;
 	vmspace->vm_daddr = (caddr_t)(uintptr_t)data_addr;
 
 	/*
 	 * We load the dynamic linker where a userland call
 	 * to mmap(0, ...) would put it.  The rationale behind this
 	 * calculation is that it leaves room for the heap to grow to
 	 * its maximum allowed size.
 	 */
 	addr = round_page((vm_offset_t)vmspace->vm_daddr + lim_max(td,
 	    RLIMIT_DATA));
 	PROC_UNLOCK(imgp->proc);
 
 	imgp->entry_addr = entry;
 
 	if (interp != NULL) {
 		have_interp = FALSE;
 		VOP_UNLOCK(imgp->vp, 0);
 		if (brand_info->emul_path != NULL &&
 		    brand_info->emul_path[0] != '\0') {
 			path = malloc(MAXPATHLEN, M_TEMP, M_WAITOK);
 			snprintf(path, MAXPATHLEN, "%s%s",
 			    brand_info->emul_path, interp);
 			error = __elfN(load_file)(imgp->proc, path, &addr,
 			    &imgp->entry_addr, sv->sv_pagesize);
 			free(path, M_TEMP);
 			if (error == 0)
 				have_interp = TRUE;
 		}
 		if (!have_interp && newinterp != NULL &&
 		    (brand_info->interp_path == NULL ||
 		    strcmp(interp, brand_info->interp_path) == 0)) {
 			error = __elfN(load_file)(imgp->proc, newinterp, &addr,
 			    &imgp->entry_addr, sv->sv_pagesize);
 			if (error == 0)
 				have_interp = TRUE;
 		}
 		if (!have_interp) {
 			error = __elfN(load_file)(imgp->proc, interp, &addr,
 			    &imgp->entry_addr, sv->sv_pagesize);
 		}
 		vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY);
 		if (error != 0) {
 			uprintf("ELF interpreter %s not found, error %d\n",
 			    interp, error);
 			goto ret;
 		}
 	} else
 		addr = et_dyn_addr;
 
 	/*
 	 * Construct auxargs table (used by the fixup routine)
 	 */
 	elf_auxargs = malloc(sizeof(Elf_Auxargs), M_TEMP, M_WAITOK);
 	elf_auxargs->execfd = -1;
 	elf_auxargs->phdr = proghdr;
 	elf_auxargs->phent = hdr->e_phentsize;
 	elf_auxargs->phnum = hdr->e_phnum;
 	elf_auxargs->pagesz = PAGE_SIZE;
 	elf_auxargs->base = addr;
 	elf_auxargs->flags = 0;
 	elf_auxargs->entry = entry;
 	elf_auxargs->hdr_eflags = hdr->e_flags;
 
 	imgp->auxargs = elf_auxargs;
 	imgp->interpreted = 0;
 	imgp->reloc_base = addr;
 	imgp->proc->p_osrel = osrel;
 	imgp->proc->p_elf_machine = hdr->e_machine;
 	imgp->proc->p_elf_flags = hdr->e_flags;
 
 ret:
 	free(interp_buf, M_TEMP);
 	return (error);
 }
 
 #define	suword __CONCAT(suword, __ELF_WORD_SIZE)
 
 int
 __elfN(freebsd_fixup)(register_t **stack_base, struct image_params *imgp)
 {
 	Elf_Auxargs *args = (Elf_Auxargs *)imgp->auxargs;
 	Elf_Addr *base;
 	Elf_Addr *pos;
 
 	base = (Elf_Addr *)*stack_base;
 	pos = base + (imgp->args->argc + imgp->args->envc + 2);
 
 	if (args->execfd != -1)
 		AUXARGS_ENTRY(pos, AT_EXECFD, args->execfd);
 	AUXARGS_ENTRY(pos, AT_PHDR, args->phdr);
 	AUXARGS_ENTRY(pos, AT_PHENT, args->phent);
 	AUXARGS_ENTRY(pos, AT_PHNUM, args->phnum);
 	AUXARGS_ENTRY(pos, AT_PAGESZ, args->pagesz);
 	AUXARGS_ENTRY(pos, AT_FLAGS, args->flags);
 	AUXARGS_ENTRY(pos, AT_ENTRY, args->entry);
 	AUXARGS_ENTRY(pos, AT_BASE, args->base);
 #ifdef AT_EHDRFLAGS
 	AUXARGS_ENTRY(pos, AT_EHDRFLAGS, args->hdr_eflags);
 #endif
 	if (imgp->execpathp != 0)
 		AUXARGS_ENTRY(pos, AT_EXECPATH, imgp->execpathp);
 	AUXARGS_ENTRY(pos, AT_OSRELDATE,
 	    imgp->proc->p_ucred->cr_prison->pr_osreldate);
 	if (imgp->canary != 0) {
 		AUXARGS_ENTRY(pos, AT_CANARY, imgp->canary);
 		AUXARGS_ENTRY(pos, AT_CANARYLEN, imgp->canarylen);
 	}
 	AUXARGS_ENTRY(pos, AT_NCPUS, mp_ncpus);
 	if (imgp->pagesizes != 0) {
 		AUXARGS_ENTRY(pos, AT_PAGESIZES, imgp->pagesizes);
 		AUXARGS_ENTRY(pos, AT_PAGESIZESLEN, imgp->pagesizeslen);
 	}
 	if (imgp->sysent->sv_timekeep_base != 0) {
 		AUXARGS_ENTRY(pos, AT_TIMEKEEP,
 		    imgp->sysent->sv_timekeep_base);
 	}
 	AUXARGS_ENTRY(pos, AT_STACKPROT, imgp->sysent->sv_shared_page_obj
 	    != NULL && imgp->stack_prot != 0 ? imgp->stack_prot :
 	    imgp->sysent->sv_stackprot);
 	AUXARGS_ENTRY(pos, AT_NULL, 0);
 
 	free(imgp->auxargs, M_TEMP);
 	imgp->auxargs = NULL;
 
 	base--;
 	suword(base, (long)imgp->args->argc);
 	*stack_base = (register_t *)base;
 	return (0);
 }
 
 /*
  * Code for generating ELF core dumps.
  */
 
 typedef void (*segment_callback)(vm_map_entry_t, void *);
 
 /* Closure for cb_put_phdr(). */
 struct phdr_closure {
 	Elf_Phdr *phdr;		/* Program header to fill in */
 	Elf_Off offset;		/* Offset of segment in core file */
 };
 
 /* Closure for cb_size_segment(). */
 struct sseg_closure {
 	int count;		/* Count of writable segments. */
 	size_t size;		/* Total size of all writable segments. */
 };
 
 typedef void (*outfunc_t)(void *, struct sbuf *, size_t *);
 
 struct note_info {
 	int		type;		/* Note type. */
 	outfunc_t 	outfunc; 	/* Output function. */
 	void		*outarg;	/* Argument for the output function. */
 	size_t		outsize;	/* Output size. */
 	TAILQ_ENTRY(note_info) link;	/* Link to the next note info. */
 };
 
 TAILQ_HEAD(note_info_list, note_info);
 
 /* Coredump output parameters. */
 struct coredump_params {
 	off_t		offset;
 	struct ucred	*active_cred;
 	struct ucred	*file_cred;
 	struct thread	*td;
 	struct vnode	*vp;
 	struct gzio_stream *gzs;
 };
 
 static void cb_put_phdr(vm_map_entry_t, void *);
 static void cb_size_segment(vm_map_entry_t, void *);
 static int core_write(struct coredump_params *, const void *, size_t, off_t,
     enum uio_seg);
 static void each_dumpable_segment(struct thread *, segment_callback, void *);
 static int __elfN(corehdr)(struct coredump_params *, int, void *, size_t,
     struct note_info_list *, size_t);
 static void __elfN(prepare_notes)(struct thread *, struct note_info_list *,
     size_t *);
 static void __elfN(puthdr)(struct thread *, void *, size_t, int, size_t);
 static void __elfN(putnote)(struct note_info *, struct sbuf *);
 static size_t register_note(struct note_info_list *, int, outfunc_t, void *);
 static int sbuf_drain_core_output(void *, const char *, int);
 static int sbuf_drain_count(void *arg, const char *data, int len);
 
 static void __elfN(note_fpregset)(void *, struct sbuf *, size_t *);
 static void __elfN(note_prpsinfo)(void *, struct sbuf *, size_t *);
 static void __elfN(note_prstatus)(void *, struct sbuf *, size_t *);
 static void __elfN(note_threadmd)(void *, struct sbuf *, size_t *);
 static void __elfN(note_thrmisc)(void *, struct sbuf *, size_t *);
 static void __elfN(note_procstat_auxv)(void *, struct sbuf *, size_t *);
 static void __elfN(note_procstat_proc)(void *, struct sbuf *, size_t *);
 static void __elfN(note_procstat_psstrings)(void *, struct sbuf *, size_t *);
 static void note_procstat_files(void *, struct sbuf *, size_t *);
 static void note_procstat_groups(void *, struct sbuf *, size_t *);
 static void note_procstat_osrel(void *, struct sbuf *, size_t *);
 static void note_procstat_rlimit(void *, struct sbuf *, size_t *);
 static void note_procstat_umask(void *, struct sbuf *, size_t *);
 static void note_procstat_vmmap(void *, struct sbuf *, size_t *);
 
 #ifdef GZIO
 extern int compress_user_cores_gzlevel;
 
 /*
  * Write out a core segment to the compression stream.
  */
 static int
 compress_chunk(struct coredump_params *p, char *base, char *buf, u_int len)
 {
 	u_int chunk_len;
 	int error;
 
 	while (len > 0) {
 		chunk_len = MIN(len, CORE_BUF_SIZE);
 
 		/*
 		 * We can get EFAULT error here.
 		 * In that case zero out the current chunk of the segment.
 		 */
 		error = copyin(base, buf, chunk_len);
 		if (error != 0)
 			bzero(buf, chunk_len);
 		error = gzio_write(p->gzs, buf, chunk_len);
 		if (error != 0)
 			break;
 		base += chunk_len;
 		len -= chunk_len;
 	}
 	return (error);
 }
 
 static int
 core_gz_write(void *base, size_t len, off_t offset, void *arg)
 {
 
 	return (core_write((struct coredump_params *)arg, base, len, offset,
 	    UIO_SYSSPACE));
 }
 #endif /* GZIO */
 
 static int
 core_write(struct coredump_params *p, const void *base, size_t len,
     off_t offset, enum uio_seg seg)
 {
 
 	return (vn_rdwr_inchunks(UIO_WRITE, p->vp, __DECONST(void *, base),
 	    len, offset, seg, IO_UNIT | IO_DIRECT | IO_RANGELOCKED,
 	    p->active_cred, p->file_cred, NULL, p->td));
 }
 
 static int
 core_output(void *base, size_t len, off_t offset, struct coredump_params *p,
     void *tmpbuf)
 {
 	int error;
 
 #ifdef GZIO
 	if (p->gzs != NULL)
 		return (compress_chunk(p, base, tmpbuf, len));
 #endif
 	/*
 	 * EFAULT is a non-fatal error that we can get, for example,
 	 * if the segment is backed by a file but extends beyond its
 	 * end.
 	 */
 	error = core_write(p, base, len, offset, UIO_USERSPACE);
 	if (error == EFAULT) {
 		log(LOG_WARNING, "Failed to fully fault in a core file segment "
 		    "at VA %p with size 0x%zx to be written at offset 0x%jx "
 		    "for process %s\n", base, len, offset, curproc->p_comm);
 
 		/*
 		 * Write a "real" zero byte at the end of the target region
 		 * in the case this is the last segment.
 		 * The intermediate space will be implicitly zero-filled.
 		 */
 		error = core_write(p, zero_region, 1, offset + len - 1,
 		    UIO_SYSSPACE);
 	}
 	return (error);
 }
 
 /*
  * Drain into a core file.
  */
 static int
 sbuf_drain_core_output(void *arg, const char *data, int len)
 {
 	struct coredump_params *p;
 	int error, locked;
 
 	p = (struct coredump_params *)arg;
 
 	/*
 	 * Some kern_proc out routines that print to this sbuf may
 	 * call us with the process lock held. Draining with the
 	 * non-sleepable lock held is unsafe. The lock is needed for
 	 * those routines when dumping a live process. In our case we
 	 * can safely release the lock before draining and acquire
 	 * again after.
 	 */
 	locked = PROC_LOCKED(p->td->td_proc);
 	if (locked)
 		PROC_UNLOCK(p->td->td_proc);
 #ifdef GZIO
 	if (p->gzs != NULL)
 		error = gzio_write(p->gzs, __DECONST(char *, data), len);
 	else
 #endif
 		error = core_write(p, __DECONST(void *, data), len, p->offset,
 		    UIO_SYSSPACE);
 	if (locked)
 		PROC_LOCK(p->td->td_proc);
 	if (error != 0)
 		return (-error);
 	p->offset += len;
 	return (len);
 }
 
 /*
  * Drain into a counter.
  */
 static int
 sbuf_drain_count(void *arg, const char *data __unused, int len)
 {
 	size_t *sizep;
 
 	sizep = (size_t *)arg;
 	*sizep += len;
 	return (len);
 }
 
 int
 __elfN(coredump)(struct thread *td, struct vnode *vp, off_t limit, int flags)
 {
 	struct ucred *cred = td->td_ucred;
 	int error = 0;
 	struct sseg_closure seginfo;
 	struct note_info_list notelst;
 	struct coredump_params params;
 	struct note_info *ninfo;
 	void *hdr, *tmpbuf;
 	size_t hdrsize, notesz, coresize;
 #ifdef GZIO
 	boolean_t compress;
 
 	compress = (flags & IMGACT_CORE_COMPRESS) != 0;
 #endif
 	hdr = NULL;
 	tmpbuf = NULL;
 	TAILQ_INIT(&notelst);
 
 	/* Size the program segments. */
 	seginfo.count = 0;
 	seginfo.size = 0;
 	each_dumpable_segment(td, cb_size_segment, &seginfo);
 
 	/*
 	 * Collect info about the core file header area.
 	 */
 	hdrsize = sizeof(Elf_Ehdr) + sizeof(Elf_Phdr) * (1 + seginfo.count);
 	if (seginfo.count + 1 >= PN_XNUM)
 		hdrsize += sizeof(Elf_Shdr);
 	__elfN(prepare_notes)(td, &notelst, &notesz);
 	coresize = round_page(hdrsize + notesz) + seginfo.size;
 
 	/* Set up core dump parameters. */
 	params.offset = 0;
 	params.active_cred = cred;
 	params.file_cred = NOCRED;
 	params.td = td;
 	params.vp = vp;
 	params.gzs = NULL;
 
 #ifdef RACCT
 	if (racct_enable) {
 		PROC_LOCK(td->td_proc);
 		error = racct_add(td->td_proc, RACCT_CORE, coresize);
 		PROC_UNLOCK(td->td_proc);
 		if (error != 0) {
 			error = EFAULT;
 			goto done;
 		}
 	}
 #endif
 	if (coresize >= limit) {
 		error = EFAULT;
 		goto done;
 	}
 
 #ifdef GZIO
 	/* Create a compression stream if necessary. */
 	if (compress) {
 		params.gzs = gzio_init(core_gz_write, GZIO_DEFLATE,
 		    CORE_BUF_SIZE, compress_user_cores_gzlevel, &params);
 		if (params.gzs == NULL) {
 			error = EFAULT;
 			goto done;
 		}
 		tmpbuf = malloc(CORE_BUF_SIZE, M_TEMP, M_WAITOK | M_ZERO);
         }
 #endif
 
 	/*
 	 * Allocate memory for building the header, fill it up,
 	 * and write it out following the notes.
 	 */
 	hdr = malloc(hdrsize, M_TEMP, M_WAITOK);
 	error = __elfN(corehdr)(&params, seginfo.count, hdr, hdrsize, &notelst,
 	    notesz);
 
 	/* Write the contents of all of the writable segments. */
 	if (error == 0) {
 		Elf_Phdr *php;
 		off_t offset;
 		int i;
 
 		php = (Elf_Phdr *)((char *)hdr + sizeof(Elf_Ehdr)) + 1;
 		offset = round_page(hdrsize + notesz);
 		for (i = 0; i < seginfo.count; i++) {
 			error = core_output((caddr_t)(uintptr_t)php->p_vaddr,
 			    php->p_filesz, offset, &params, tmpbuf);
 			if (error != 0)
 				break;
 			offset += php->p_filesz;
 			php++;
 		}
 #ifdef GZIO
 		if (error == 0 && compress)
 			error = gzio_flush(params.gzs);
 #endif
 	}
 	if (error) {
 		log(LOG_WARNING,
 		    "Failed to write core file for process %s (error %d)\n",
 		    curproc->p_comm, error);
 	}
 
 done:
 #ifdef GZIO
 	if (compress) {
 		free(tmpbuf, M_TEMP);
 		if (params.gzs != NULL)
 			gzio_fini(params.gzs);
 	}
 #endif
 	while ((ninfo = TAILQ_FIRST(&notelst)) != NULL) {
 		TAILQ_REMOVE(&notelst, ninfo, link);
 		free(ninfo, M_TEMP);
 	}
 	if (hdr != NULL)
 		free(hdr, M_TEMP);
 
 	return (error);
 }
 
 /*
  * A callback for each_dumpable_segment() to write out the segment's
  * program header entry.
  */
 static void
 cb_put_phdr(entry, closure)
 	vm_map_entry_t entry;
 	void *closure;
 {
 	struct phdr_closure *phc = (struct phdr_closure *)closure;
 	Elf_Phdr *phdr = phc->phdr;
 
 	phc->offset = round_page(phc->offset);
 
 	phdr->p_type = PT_LOAD;
 	phdr->p_offset = phc->offset;
 	phdr->p_vaddr = entry->start;
 	phdr->p_paddr = 0;
 	phdr->p_filesz = phdr->p_memsz = entry->end - entry->start;
 	phdr->p_align = PAGE_SIZE;
 	phdr->p_flags = __elfN(untrans_prot)(entry->protection);
 
 	phc->offset += phdr->p_filesz;
 	phc->phdr++;
 }
 
 /*
  * A callback for each_dumpable_segment() to gather information about
  * the number of segments and their total size.
  */
 static void
 cb_size_segment(vm_map_entry_t entry, void *closure)
 {
 	struct sseg_closure *ssc = (struct sseg_closure *)closure;
 
 	ssc->count++;
 	ssc->size += entry->end - entry->start;
 }
 
 /*
  * For each writable segment in the process's memory map, call the given
  * function with a pointer to the map entry and some arbitrary
  * caller-supplied data.
  */
 static void
 each_dumpable_segment(struct thread *td, segment_callback func, void *closure)
 {
 	struct proc *p = td->td_proc;
 	vm_map_t map = &p->p_vmspace->vm_map;
 	vm_map_entry_t entry;
 	vm_object_t backing_object, object;
 	boolean_t ignore_entry;
 
 	vm_map_lock_read(map);
 	for (entry = map->header.next; entry != &map->header;
 	    entry = entry->next) {
 		/*
 		 * Don't dump inaccessible mappings, deal with legacy
 		 * coredump mode.
 		 *
 		 * Note that read-only segments related to the elf binary
 		 * are marked MAP_ENTRY_NOCOREDUMP now so we no longer
 		 * need to arbitrarily ignore such segments.
 		 */
 		if (elf_legacy_coredump) {
 			if ((entry->protection & VM_PROT_RW) != VM_PROT_RW)
 				continue;
 		} else {
 			if ((entry->protection & VM_PROT_ALL) == 0)
 				continue;
 		}
 
 		/*
 		 * Dont include memory segment in the coredump if
 		 * MAP_NOCORE is set in mmap(2) or MADV_NOCORE in
 		 * madvise(2).  Do not dump submaps (i.e. parts of the
 		 * kernel map).
 		 */
 		if (entry->eflags & (MAP_ENTRY_NOCOREDUMP|MAP_ENTRY_IS_SUB_MAP))
 			continue;
 
 		if ((object = entry->object.vm_object) == NULL)
 			continue;
 
 		/* Ignore memory-mapped devices and such things. */
 		VM_OBJECT_RLOCK(object);
 		while ((backing_object = object->backing_object) != NULL) {
 			VM_OBJECT_RLOCK(backing_object);
 			VM_OBJECT_RUNLOCK(object);
 			object = backing_object;
 		}
 		ignore_entry = object->type != OBJT_DEFAULT &&
 		    object->type != OBJT_SWAP && object->type != OBJT_VNODE &&
 		    object->type != OBJT_PHYS;
 		VM_OBJECT_RUNLOCK(object);
 		if (ignore_entry)
 			continue;
 
 		(*func)(entry, closure);
 	}
 	vm_map_unlock_read(map);
 }
 
 /*
  * Write the core file header to the file, including padding up to
  * the page boundary.
  */
 static int
 __elfN(corehdr)(struct coredump_params *p, int numsegs, void *hdr,
     size_t hdrsize, struct note_info_list *notelst, size_t notesz)
 {
 	struct note_info *ninfo;
 	struct sbuf *sb;
 	int error;
 
 	/* Fill in the header. */
 	bzero(hdr, hdrsize);
 	__elfN(puthdr)(p->td, hdr, hdrsize, numsegs, notesz);
 
 	sb = sbuf_new(NULL, NULL, CORE_BUF_SIZE, SBUF_FIXEDLEN);
 	sbuf_set_drain(sb, sbuf_drain_core_output, p);
 	sbuf_start_section(sb, NULL);
 	sbuf_bcat(sb, hdr, hdrsize);
 	TAILQ_FOREACH(ninfo, notelst, link)
 	    __elfN(putnote)(ninfo, sb);
 	/* Align up to a page boundary for the program segments. */
 	sbuf_end_section(sb, -1, PAGE_SIZE, 0);
 	error = sbuf_finish(sb);
 	sbuf_delete(sb);
 
 	return (error);
 }
 
 static void
 __elfN(prepare_notes)(struct thread *td, struct note_info_list *list,
     size_t *sizep)
 {
 	struct proc *p;
 	struct thread *thr;
 	size_t size;
 
 	p = td->td_proc;
 	size = 0;
 
 	size += register_note(list, NT_PRPSINFO, __elfN(note_prpsinfo), p);
 
 	/*
 	 * To have the debugger select the right thread (LWP) as the initial
 	 * thread, we dump the state of the thread passed to us in td first.
 	 * This is the thread that causes the core dump and thus likely to
 	 * be the right thread one wants to have selected in the debugger.
 	 */
 	thr = td;
 	while (thr != NULL) {
 		size += register_note(list, NT_PRSTATUS,
 		    __elfN(note_prstatus), thr);
 		size += register_note(list, NT_FPREGSET,
 		    __elfN(note_fpregset), thr);
 		size += register_note(list, NT_THRMISC,
 		    __elfN(note_thrmisc), thr);
 		size += register_note(list, -1,
 		    __elfN(note_threadmd), thr);
 
 		thr = (thr == td) ? TAILQ_FIRST(&p->p_threads) :
 		    TAILQ_NEXT(thr, td_plist);
 		if (thr == td)
 			thr = TAILQ_NEXT(thr, td_plist);
 	}
 
 	size += register_note(list, NT_PROCSTAT_PROC,
 	    __elfN(note_procstat_proc), p);
 	size += register_note(list, NT_PROCSTAT_FILES,
 	    note_procstat_files, p);
 	size += register_note(list, NT_PROCSTAT_VMMAP,
 	    note_procstat_vmmap, p);
 	size += register_note(list, NT_PROCSTAT_GROUPS,
 	    note_procstat_groups, p);
 	size += register_note(list, NT_PROCSTAT_UMASK,
 	    note_procstat_umask, p);
 	size += register_note(list, NT_PROCSTAT_RLIMIT,
 	    note_procstat_rlimit, p);
 	size += register_note(list, NT_PROCSTAT_OSREL,
 	    note_procstat_osrel, p);
 	size += register_note(list, NT_PROCSTAT_PSSTRINGS,
 	    __elfN(note_procstat_psstrings), p);
 	size += register_note(list, NT_PROCSTAT_AUXV,
 	    __elfN(note_procstat_auxv), p);
 
 	*sizep = size;
 }
 
 static void
 __elfN(puthdr)(struct thread *td, void *hdr, size_t hdrsize, int numsegs,
     size_t notesz)
 {
 	Elf_Ehdr *ehdr;
 	Elf_Phdr *phdr;
 	Elf_Shdr *shdr;
 	struct phdr_closure phc;
 
 	ehdr = (Elf_Ehdr *)hdr;
 
 	ehdr->e_ident[EI_MAG0] = ELFMAG0;
 	ehdr->e_ident[EI_MAG1] = ELFMAG1;
 	ehdr->e_ident[EI_MAG2] = ELFMAG2;
 	ehdr->e_ident[EI_MAG3] = ELFMAG3;
 	ehdr->e_ident[EI_CLASS] = ELF_CLASS;
 	ehdr->e_ident[EI_DATA] = ELF_DATA;
 	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
 	ehdr->e_ident[EI_OSABI] = ELFOSABI_FREEBSD;
 	ehdr->e_ident[EI_ABIVERSION] = 0;
 	ehdr->e_ident[EI_PAD] = 0;
 	ehdr->e_type = ET_CORE;
 	ehdr->e_machine = td->td_proc->p_elf_machine;
 	ehdr->e_version = EV_CURRENT;
 	ehdr->e_entry = 0;
 	ehdr->e_phoff = sizeof(Elf_Ehdr);
 	ehdr->e_flags = td->td_proc->p_elf_flags;
 	ehdr->e_ehsize = sizeof(Elf_Ehdr);
 	ehdr->e_phentsize = sizeof(Elf_Phdr);
 	ehdr->e_shentsize = sizeof(Elf_Shdr);
 	ehdr->e_shstrndx = SHN_UNDEF;
 	if (numsegs + 1 < PN_XNUM) {
 		ehdr->e_phnum = numsegs + 1;
 		ehdr->e_shnum = 0;
 	} else {
 		ehdr->e_phnum = PN_XNUM;
 		ehdr->e_shnum = 1;
 
 		ehdr->e_shoff = ehdr->e_phoff +
 		    (numsegs + 1) * ehdr->e_phentsize;
 		KASSERT(ehdr->e_shoff == hdrsize - sizeof(Elf_Shdr),
 		    ("e_shoff: %zu, hdrsize - shdr: %zu",
 		     (size_t)ehdr->e_shoff, hdrsize - sizeof(Elf_Shdr)));
 
 		shdr = (Elf_Shdr *)((char *)hdr + ehdr->e_shoff);
 		memset(shdr, 0, sizeof(*shdr));
 		/*
 		 * A special first section is used to hold large segment and
 		 * section counts.  This was proposed by Sun Microsystems in
 		 * Solaris and has been adopted by Linux; the standard ELF
 		 * tools are already familiar with the technique.
 		 *
 		 * See table 7-7 of the Solaris "Linker and Libraries Guide"
 		 * (or 12-7 depending on the version of the document) for more
 		 * details.
 		 */
 		shdr->sh_type = SHT_NULL;
 		shdr->sh_size = ehdr->e_shnum;
 		shdr->sh_link = ehdr->e_shstrndx;
 		shdr->sh_info = numsegs + 1;
 	}
 
 	/*
 	 * Fill in the program header entries.
 	 */
 	phdr = (Elf_Phdr *)((char *)hdr + ehdr->e_phoff);
 
 	/* The note segement. */
 	phdr->p_type = PT_NOTE;
 	phdr->p_offset = hdrsize;
 	phdr->p_vaddr = 0;
 	phdr->p_paddr = 0;
 	phdr->p_filesz = notesz;
 	phdr->p_memsz = 0;
 	phdr->p_flags = PF_R;
 	phdr->p_align = ELF_NOTE_ROUNDSIZE;
 	phdr++;
 
 	/* All the writable segments from the program. */
 	phc.phdr = phdr;
 	phc.offset = round_page(hdrsize + notesz);
 	each_dumpable_segment(td, cb_put_phdr, &phc);
 }
 
 static size_t
 register_note(struct note_info_list *list, int type, outfunc_t out, void *arg)
 {
 	struct note_info *ninfo;
 	size_t size, notesize;
 
 	size = 0;
 	out(arg, NULL, &size);
 	ninfo = malloc(sizeof(*ninfo), M_TEMP, M_ZERO | M_WAITOK);
 	ninfo->type = type;
 	ninfo->outfunc = out;
 	ninfo->outarg = arg;
 	ninfo->outsize = size;
 	TAILQ_INSERT_TAIL(list, ninfo, link);
 
 	if (type == -1)
 		return (size);
 
 	notesize = sizeof(Elf_Note) +		/* note header */
 	    roundup2(sizeof(FREEBSD_ABI_VENDOR), ELF_NOTE_ROUNDSIZE) +
 						/* note name */
 	    roundup2(size, ELF_NOTE_ROUNDSIZE);	/* note description */
 
 	return (notesize);
 }
 
 static size_t
 append_note_data(const void *src, void *dst, size_t len)
 {
 	size_t padded_len;
 
 	padded_len = roundup2(len, ELF_NOTE_ROUNDSIZE);
 	if (dst != NULL) {
 		bcopy(src, dst, len);
 		bzero((char *)dst + len, padded_len - len);
 	}
 	return (padded_len);
 }
 
 size_t
 __elfN(populate_note)(int type, void *src, void *dst, size_t size, void **descp)
 {
 	Elf_Note *note;
 	char *buf;
 	size_t notesize;
 
 	buf = dst;
 	if (buf != NULL) {
 		note = (Elf_Note *)buf;
 		note->n_namesz = sizeof(FREEBSD_ABI_VENDOR);
 		note->n_descsz = size;
 		note->n_type = type;
 		buf += sizeof(*note);
 		buf += append_note_data(FREEBSD_ABI_VENDOR, buf,
 		    sizeof(FREEBSD_ABI_VENDOR));
 		append_note_data(src, buf, size);
 		if (descp != NULL)
 			*descp = buf;
 	}
 
 	notesize = sizeof(Elf_Note) +		/* note header */
 	    roundup2(sizeof(FREEBSD_ABI_VENDOR), ELF_NOTE_ROUNDSIZE) +
 						/* note name */
 	    roundup2(size, ELF_NOTE_ROUNDSIZE);	/* note description */
 
 	return (notesize);
 }
 
 static void
 __elfN(putnote)(struct note_info *ninfo, struct sbuf *sb)
 {
 	Elf_Note note;
 	ssize_t old_len, sect_len;
 	size_t new_len, descsz, i;
 
 	if (ninfo->type == -1) {
 		ninfo->outfunc(ninfo->outarg, sb, &ninfo->outsize);
 		return;
 	}
 
 	note.n_namesz = sizeof(FREEBSD_ABI_VENDOR);
 	note.n_descsz = ninfo->outsize;
 	note.n_type = ninfo->type;
 
 	sbuf_bcat(sb, &note, sizeof(note));
 	sbuf_start_section(sb, &old_len);
 	sbuf_bcat(sb, FREEBSD_ABI_VENDOR, sizeof(FREEBSD_ABI_VENDOR));
 	sbuf_end_section(sb, old_len, ELF_NOTE_ROUNDSIZE, 0);
 	if (note.n_descsz == 0)
 		return;
 	sbuf_start_section(sb, &old_len);
 	ninfo->outfunc(ninfo->outarg, sb, &ninfo->outsize);
 	sect_len = sbuf_end_section(sb, old_len, ELF_NOTE_ROUNDSIZE, 0);
 	if (sect_len < 0)
 		return;
 
 	new_len = (size_t)sect_len;
 	descsz = roundup(note.n_descsz, ELF_NOTE_ROUNDSIZE);
 	if (new_len < descsz) {
 		/*
 		 * It is expected that individual note emitters will correctly
 		 * predict their expected output size and fill up to that size
 		 * themselves, padding in a format-specific way if needed.
 		 * However, in case they don't, just do it here with zeros.
 		 */
 		for (i = 0; i < descsz - new_len; i++)
 			sbuf_putc(sb, 0);
 	} else if (new_len > descsz) {
 		/*
 		 * We can't always truncate sb -- we may have drained some
 		 * of it already.
 		 */
 		KASSERT(new_len == descsz, ("%s: Note type %u changed as we "
 		    "read it (%zu > %zu).  Since it is longer than "
 		    "expected, this coredump's notes are corrupt.  THIS "
 		    "IS A BUG in the note_procstat routine for type %u.\n",
 		    __func__, (unsigned)note.n_type, new_len, descsz,
 		    (unsigned)note.n_type));
 	}
 }
 
 /*
  * Miscellaneous note out functions.
  */
 
 #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32
 #include <compat/freebsd32/freebsd32.h>
 
 typedef struct prstatus32 elf_prstatus_t;
 typedef struct prpsinfo32 elf_prpsinfo_t;
 typedef struct fpreg32 elf_prfpregset_t;
 typedef struct fpreg32 elf_fpregset_t;
 typedef struct reg32 elf_gregset_t;
 typedef struct thrmisc32 elf_thrmisc_t;
 #define ELF_KERN_PROC_MASK	KERN_PROC_MASK32
 typedef struct kinfo_proc32 elf_kinfo_proc_t;
 typedef uint32_t elf_ps_strings_t;
 #else
 typedef prstatus_t elf_prstatus_t;
 typedef prpsinfo_t elf_prpsinfo_t;
 typedef prfpregset_t elf_prfpregset_t;
 typedef prfpregset_t elf_fpregset_t;
 typedef gregset_t elf_gregset_t;
 typedef thrmisc_t elf_thrmisc_t;
 #define ELF_KERN_PROC_MASK	0
 typedef struct kinfo_proc elf_kinfo_proc_t;
 typedef vm_offset_t elf_ps_strings_t;
 #endif
 
 static void
 __elfN(note_prpsinfo)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct sbuf sbarg;
 	size_t len;
 	char *cp, *end;
 	struct proc *p;
 	elf_prpsinfo_t *psinfo;
 	int error;
 
 	p = (struct proc *)arg;
 	if (sb != NULL) {
 		KASSERT(*sizep == sizeof(*psinfo), ("invalid size"));
 		psinfo = malloc(sizeof(*psinfo), M_TEMP, M_ZERO | M_WAITOK);
 		psinfo->pr_version = PRPSINFO_VERSION;
 		psinfo->pr_psinfosz = sizeof(elf_prpsinfo_t);
 		strlcpy(psinfo->pr_fname, p->p_comm, sizeof(psinfo->pr_fname));
 		PROC_LOCK(p);
 		if (p->p_args != NULL) {
 			len = sizeof(psinfo->pr_psargs) - 1;
 			if (len > p->p_args->ar_length)
 				len = p->p_args->ar_length;
 			memcpy(psinfo->pr_psargs, p->p_args->ar_args, len);
 			PROC_UNLOCK(p);
 			error = 0;
 		} else {
 			_PHOLD(p);
 			PROC_UNLOCK(p);
 			sbuf_new(&sbarg, psinfo->pr_psargs,
 			    sizeof(psinfo->pr_psargs), SBUF_FIXEDLEN);
 			error = proc_getargv(curthread, p, &sbarg);
 			PRELE(p);
 			if (sbuf_finish(&sbarg) == 0)
 				len = sbuf_len(&sbarg) - 1;
 			else
 				len = sizeof(psinfo->pr_psargs) - 1;
 			sbuf_delete(&sbarg);
 		}
 		if (error || len == 0)
 			strlcpy(psinfo->pr_psargs, p->p_comm,
 			    sizeof(psinfo->pr_psargs));
 		else {
 			KASSERT(len < sizeof(psinfo->pr_psargs),
 			    ("len is too long: %zu vs %zu", len,
 			    sizeof(psinfo->pr_psargs)));
 			cp = psinfo->pr_psargs;
 			end = cp + len - 1;
 			for (;;) {
 				cp = memchr(cp, '\0', end - cp);
 				if (cp == NULL)
 					break;
 				*cp = ' ';
 			}
 		}
 		psinfo->pr_pid = p->p_pid;
 		sbuf_bcat(sb, psinfo, sizeof(*psinfo));
 		free(psinfo, M_TEMP);
 	}
 	*sizep = sizeof(*psinfo);
 }
 
 static void
 __elfN(note_prstatus)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct thread *td;
 	elf_prstatus_t *status;
 
 	td = (struct thread *)arg;
 	if (sb != NULL) {
 		KASSERT(*sizep == sizeof(*status), ("invalid size"));
 		status = malloc(sizeof(*status), M_TEMP, M_ZERO | M_WAITOK);
 		status->pr_version = PRSTATUS_VERSION;
 		status->pr_statussz = sizeof(elf_prstatus_t);
 		status->pr_gregsetsz = sizeof(elf_gregset_t);
 		status->pr_fpregsetsz = sizeof(elf_fpregset_t);
 		status->pr_osreldate = osreldate;
 		status->pr_cursig = td->td_proc->p_sig;
 		status->pr_pid = td->td_tid;
 #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32
 		fill_regs32(td, &status->pr_reg);
 #else
 		fill_regs(td, &status->pr_reg);
 #endif
 		sbuf_bcat(sb, status, sizeof(*status));
 		free(status, M_TEMP);
 	}
 	*sizep = sizeof(*status);
 }
 
 static void
 __elfN(note_fpregset)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct thread *td;
 	elf_prfpregset_t *fpregset;
 
 	td = (struct thread *)arg;
 	if (sb != NULL) {
 		KASSERT(*sizep == sizeof(*fpregset), ("invalid size"));
 		fpregset = malloc(sizeof(*fpregset), M_TEMP, M_ZERO | M_WAITOK);
 #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32
 		fill_fpregs32(td, fpregset);
 #else
 		fill_fpregs(td, fpregset);
 #endif
 		sbuf_bcat(sb, fpregset, sizeof(*fpregset));
 		free(fpregset, M_TEMP);
 	}
 	*sizep = sizeof(*fpregset);
 }
 
 static void
 __elfN(note_thrmisc)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct thread *td;
 	elf_thrmisc_t thrmisc;
 
 	td = (struct thread *)arg;
 	if (sb != NULL) {
 		KASSERT(*sizep == sizeof(thrmisc), ("invalid size"));
 		bzero(&thrmisc._pad, sizeof(thrmisc._pad));
 		strcpy(thrmisc.pr_tname, td->td_name);
 		sbuf_bcat(sb, &thrmisc, sizeof(thrmisc));
 	}
 	*sizep = sizeof(thrmisc);
 }
 
 /*
  * Allow for MD specific notes, as well as any MD
  * specific preparations for writing MI notes.
  */
 static void
 __elfN(note_threadmd)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct thread *td;
 	void *buf;
 	size_t size;
 
 	td = (struct thread *)arg;
 	size = *sizep;
 	if (size != 0 && sb != NULL)
 		buf = malloc(size, M_TEMP, M_ZERO | M_WAITOK);
 	else
 		buf = NULL;
 	size = 0;
 	__elfN(dump_thread)(td, buf, &size);
 	KASSERT(sb == NULL || *sizep == size, ("invalid size"));
 	if (size != 0 && sb != NULL)
 		sbuf_bcat(sb, buf, size);
 	free(buf, M_TEMP);
 	*sizep = size;
 }
 
 #ifdef KINFO_PROC_SIZE
 CTASSERT(sizeof(struct kinfo_proc) == KINFO_PROC_SIZE);
 #endif
 
 static void
 __elfN(note_procstat_proc)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size;
 	int structsize;
 
 	p = (struct proc *)arg;
 	size = sizeof(structsize) + p->p_numthreads *
 	    sizeof(elf_kinfo_proc_t);
 
 	if (sb != NULL) {
 		KASSERT(*sizep == size, ("invalid size"));
 		structsize = sizeof(elf_kinfo_proc_t);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		sx_slock(&proctree_lock);
 		PROC_LOCK(p);
 		kern_proc_out(p, sb, ELF_KERN_PROC_MASK);
 		sx_sunlock(&proctree_lock);
 	}
 	*sizep = size;
 }
 
 #ifdef KINFO_FILE_SIZE
 CTASSERT(sizeof(struct kinfo_file) == KINFO_FILE_SIZE);
 #endif
 
 static void
 note_procstat_files(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size, sect_sz, i;
 	ssize_t start_len, sect_len;
 	int structsize, filedesc_flags;
 
 	if (coredump_pack_fileinfo)
 		filedesc_flags = KERN_FILEDESC_PACK_KINFO;
 	else
 		filedesc_flags = 0;
 
 	p = (struct proc *)arg;
 	structsize = sizeof(struct kinfo_file);
 	if (sb == NULL) {
 		size = 0;
 		sb = sbuf_new(NULL, NULL, 128, SBUF_FIXEDLEN);
 		sbuf_set_drain(sb, sbuf_drain_count, &size);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PROC_LOCK(p);
 		kern_proc_filedesc_out(p, sb, -1, filedesc_flags);
 		sbuf_finish(sb);
 		sbuf_delete(sb);
 		*sizep = size;
 	} else {
 		sbuf_start_section(sb, &start_len);
 
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PROC_LOCK(p);
 		kern_proc_filedesc_out(p, sb, *sizep - sizeof(structsize),
 		    filedesc_flags);
 
 		sect_len = sbuf_end_section(sb, start_len, 0, 0);
 		if (sect_len < 0)
 			return;
 		sect_sz = sect_len;
 
 		KASSERT(sect_sz <= *sizep,
 		    ("kern_proc_filedesc_out did not respect maxlen; "
 		     "requested %zu, got %zu", *sizep - sizeof(structsize),
 		     sect_sz - sizeof(structsize)));
 
 		for (i = 0; i < *sizep - sect_sz && sb->s_error == 0; i++)
 			sbuf_putc(sb, 0);
 	}
 }
 
 #ifdef KINFO_VMENTRY_SIZE
 CTASSERT(sizeof(struct kinfo_vmentry) == KINFO_VMENTRY_SIZE);
 #endif
 
 static void
 note_procstat_vmmap(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size;
 	int structsize, vmmap_flags;
 
 	if (coredump_pack_vmmapinfo)
 		vmmap_flags = KERN_VMMAP_PACK_KINFO;
 	else
 		vmmap_flags = 0;
 
 	p = (struct proc *)arg;
 	structsize = sizeof(struct kinfo_vmentry);
 	if (sb == NULL) {
 		size = 0;
 		sb = sbuf_new(NULL, NULL, 128, SBUF_FIXEDLEN);
 		sbuf_set_drain(sb, sbuf_drain_count, &size);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PROC_LOCK(p);
 		kern_proc_vmmap_out(p, sb, -1, vmmap_flags);
 		sbuf_finish(sb);
 		sbuf_delete(sb);
 		*sizep = size;
 	} else {
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PROC_LOCK(p);
 		kern_proc_vmmap_out(p, sb, *sizep - sizeof(structsize),
 		    vmmap_flags);
 	}
 }
 
 static void
 note_procstat_groups(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size;
 	int structsize;
 
 	p = (struct proc *)arg;
 	size = sizeof(structsize) + p->p_ucred->cr_ngroups * sizeof(gid_t);
 	if (sb != NULL) {
 		KASSERT(*sizep == size, ("invalid size"));
 		structsize = sizeof(gid_t);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		sbuf_bcat(sb, p->p_ucred->cr_groups, p->p_ucred->cr_ngroups *
 		    sizeof(gid_t));
 	}
 	*sizep = size;
 }
 
 static void
 note_procstat_umask(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size;
 	int structsize;
 
 	p = (struct proc *)arg;
 	size = sizeof(structsize) + sizeof(p->p_fd->fd_cmask);
 	if (sb != NULL) {
 		KASSERT(*sizep == size, ("invalid size"));
 		structsize = sizeof(p->p_fd->fd_cmask);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		sbuf_bcat(sb, &p->p_fd->fd_cmask, sizeof(p->p_fd->fd_cmask));
 	}
 	*sizep = size;
 }
 
 static void
 note_procstat_rlimit(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	struct rlimit rlim[RLIM_NLIMITS];
 	size_t size;
 	int structsize, i;
 
 	p = (struct proc *)arg;
 	size = sizeof(structsize) + sizeof(rlim);
 	if (sb != NULL) {
 		KASSERT(*sizep == size, ("invalid size"));
 		structsize = sizeof(rlim);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PROC_LOCK(p);
 		for (i = 0; i < RLIM_NLIMITS; i++)
 			lim_rlimit_proc(p, i, &rlim[i]);
 		PROC_UNLOCK(p);
 		sbuf_bcat(sb, rlim, sizeof(rlim));
 	}
 	*sizep = size;
 }
 
 static void
 note_procstat_osrel(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size;
 	int structsize;
 
 	p = (struct proc *)arg;
 	size = sizeof(structsize) + sizeof(p->p_osrel);
 	if (sb != NULL) {
 		KASSERT(*sizep == size, ("invalid size"));
 		structsize = sizeof(p->p_osrel);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		sbuf_bcat(sb, &p->p_osrel, sizeof(p->p_osrel));
 	}
 	*sizep = size;
 }
 
 static void
 __elfN(note_procstat_psstrings)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	elf_ps_strings_t ps_strings;
 	size_t size;
 	int structsize;
 
 	p = (struct proc *)arg;
 	size = sizeof(structsize) + sizeof(ps_strings);
 	if (sb != NULL) {
 		KASSERT(*sizep == size, ("invalid size"));
 		structsize = sizeof(ps_strings);
 #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32
 		ps_strings = PTROUT(p->p_sysent->sv_psstrings);
 #else
 		ps_strings = p->p_sysent->sv_psstrings;
 #endif
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		sbuf_bcat(sb, &ps_strings, sizeof(ps_strings));
 	}
 	*sizep = size;
 }
 
 static void
 __elfN(note_procstat_auxv)(void *arg, struct sbuf *sb, size_t *sizep)
 {
 	struct proc *p;
 	size_t size;
 	int structsize;
 
 	p = (struct proc *)arg;
 	if (sb == NULL) {
 		size = 0;
 		sb = sbuf_new(NULL, NULL, 128, SBUF_FIXEDLEN);
 		sbuf_set_drain(sb, sbuf_drain_count, &size);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PHOLD(p);
 		proc_getauxv(curthread, p, sb);
 		PRELE(p);
 		sbuf_finish(sb);
 		sbuf_delete(sb);
 		*sizep = size;
 	} else {
 		structsize = sizeof(Elf_Auxinfo);
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PHOLD(p);
 		proc_getauxv(curthread, p, sb);
 		PRELE(p);
 	}
 }
 
 static boolean_t
 __elfN(parse_notes)(struct image_params *imgp, Elf_Brandnote *checknote,
     int32_t *osrel, const Elf_Phdr *pnote)
 {
 	const Elf_Note *note, *note0, *note_end;
 	const char *note_name;
 	char *buf;
 	int i, error;
 	boolean_t res;
 
 	/* We need some limit, might as well use PAGE_SIZE. */
 	if (pnote == NULL || pnote->p_filesz > PAGE_SIZE)
 		return (FALSE);
 	ASSERT_VOP_LOCKED(imgp->vp, "parse_notes");
 	if (pnote->p_offset > PAGE_SIZE ||
 	    pnote->p_filesz > PAGE_SIZE - pnote->p_offset) {
 		VOP_UNLOCK(imgp->vp, 0);
 		buf = malloc(pnote->p_filesz, M_TEMP, M_WAITOK);
 		vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY);
 		error = vn_rdwr(UIO_READ, imgp->vp, buf, pnote->p_filesz,
 		    pnote->p_offset, UIO_SYSSPACE, IO_NODELOCKED,
 		    curthread->td_ucred, NOCRED, NULL, curthread);
 		if (error != 0) {
 			uprintf("i/o error PT_NOTE\n");
 			res = FALSE;
 			goto ret;
 		}
 		note = note0 = (const Elf_Note *)buf;
 		note_end = (const Elf_Note *)(buf + pnote->p_filesz);
 	} else {
 		note = note0 = (const Elf_Note *)(imgp->image_header +
 		    pnote->p_offset);
 		note_end = (const Elf_Note *)(imgp->image_header +
 		    pnote->p_offset + pnote->p_filesz);
 		buf = NULL;
 	}
 	for (i = 0; i < 100 && note >= note0 && note < note_end; i++) {
 		if (!aligned(note, Elf32_Addr) || (const char *)note_end -
 		    (const char *)note < sizeof(Elf_Note)) {
 			res = FALSE;
 			goto ret;
 		}
 		if (note->n_namesz != checknote->hdr.n_namesz ||
 		    note->n_descsz != checknote->hdr.n_descsz ||
 		    note->n_type != checknote->hdr.n_type)
 			goto nextnote;
 		note_name = (const char *)(note + 1);
 		if (note_name + checknote->hdr.n_namesz >=
 		    (const char *)note_end || strncmp(checknote->vendor,
 		    note_name, checknote->hdr.n_namesz) != 0)
 			goto nextnote;
 
 		/*
 		 * Fetch the osreldate for binary
 		 * from the ELF OSABI-note if necessary.
 		 */
 		if ((checknote->flags & BN_TRANSLATE_OSREL) != 0 &&
 		    checknote->trans_osrel != NULL) {
 			res = checknote->trans_osrel(note, osrel);
 			goto ret;
 		}
 		res = TRUE;
 		goto ret;
 nextnote:
 		note = (const Elf_Note *)((const char *)(note + 1) +
 		    roundup2(note->n_namesz, ELF_NOTE_ROUNDSIZE) +
 		    roundup2(note->n_descsz, ELF_NOTE_ROUNDSIZE));
 	}
 	res = FALSE;
 ret:
 	free(buf, M_TEMP);
 	return (res);
 }
 
 /*
  * Try to find the appropriate ABI-note section for checknote,
  * fetch the osreldate for binary from the ELF OSABI-note. Only the
  * first page of the image is searched, the same as for headers.
  */
 static boolean_t
 __elfN(check_note)(struct image_params *imgp, Elf_Brandnote *checknote,
     int32_t *osrel)
 {
 	const Elf_Phdr *phdr;
 	const Elf_Ehdr *hdr;
 	int i;
 
 	hdr = (const Elf_Ehdr *)imgp->image_header;
 	phdr = (const Elf_Phdr *)(imgp->image_header + hdr->e_phoff);
 
 	for (i = 0; i < hdr->e_phnum; i++) {
 		if (phdr[i].p_type == PT_NOTE &&
 		    __elfN(parse_notes)(imgp, checknote, osrel, &phdr[i]))
 			return (TRUE);
 	}
 	return (FALSE);
 
 }
 
 /*
  * Tell kern_execve.c about it, with a little help from the linker.
  */
 static struct execsw __elfN(execsw) = {
 	__CONCAT(exec_, __elfN(imgact)),
 	__XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE))
 };
 EXEC_SET(__CONCAT(elf, __ELF_WORD_SIZE), __elfN(execsw));
 
 static vm_prot_t
 __elfN(trans_prot)(Elf_Word flags)
 {
 	vm_prot_t prot;
 
 	prot = 0;
 	if (flags & PF_X)
 		prot |= VM_PROT_EXECUTE;
 	if (flags & PF_W)
 		prot |= VM_PROT_WRITE;
 	if (flags & PF_R)
 		prot |= VM_PROT_READ;
 #if __ELF_WORD_SIZE == 32
 #if defined(__amd64__)
 	if (i386_read_exec && (flags & PF_R))
 		prot |= VM_PROT_EXECUTE;
 #endif
 #endif
 	return (prot);
 }
 
 static Elf_Word
 __elfN(untrans_prot)(vm_prot_t prot)
 {
 	Elf_Word flags;
 
 	flags = 0;
 	if (prot & VM_PROT_EXECUTE)
 		flags |= PF_X;
 	if (prot & VM_PROT_READ)
 		flags |= PF_R;
 	if (prot & VM_PROT_WRITE)
 		flags |= PF_W;
 	return (flags);
 }
Index: projects/clang400-import/sys/kern/subr_gtaskqueue.c
===================================================================
--- projects/clang400-import/sys/kern/subr_gtaskqueue.c	(revision 314522)
+++ projects/clang400-import/sys/kern/subr_gtaskqueue.c	(revision 314523)
@@ -1,963 +1,965 @@
 /*-
  * Copyright (c) 2000 Doug Rabson
  * Copyright (c) 2014 Jeff Roberson
  * Copyright (c) 2016 Matthew Macy
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/cpuset.h>
 #include <sys/interrupt.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/libkern.h>
 #include <sys/limits.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/gtaskqueue.h>
 #include <sys/unistd.h>
 #include <machine/stdarg.h>
 
 static MALLOC_DEFINE(M_GTASKQUEUE, "taskqueue", "Task Queues");
 static void	gtaskqueue_thread_enqueue(void *);
 static void	gtaskqueue_thread_loop(void *arg);
 
+TASKQGROUP_DEFINE(softirq, mp_ncpus, 1);
+
 struct gtaskqueue_busy {
 	struct gtask	*tb_running;
 	TAILQ_ENTRY(gtaskqueue_busy) tb_link;
 };
 
 static struct gtask * const TB_DRAIN_WAITER = (struct gtask *)0x1;
 
 struct gtaskqueue {
 	STAILQ_HEAD(, gtask)	tq_queue;
 	gtaskqueue_enqueue_fn	tq_enqueue;
 	void			*tq_context;
 	char			*tq_name;
 	TAILQ_HEAD(, gtaskqueue_busy) tq_active;
 	struct mtx		tq_mutex;
 	struct thread		**tq_threads;
 	int			tq_tcount;
 	int			tq_spin;
 	int			tq_flags;
 	int			tq_callouts;
 	taskqueue_callback_fn	tq_callbacks[TASKQUEUE_NUM_CALLBACKS];
 	void			*tq_cb_contexts[TASKQUEUE_NUM_CALLBACKS];
 };
 
 #define	TQ_FLAGS_ACTIVE		(1 << 0)
 #define	TQ_FLAGS_BLOCKED	(1 << 1)
 #define	TQ_FLAGS_UNLOCKED_ENQUEUE	(1 << 2)
 
 #define	DT_CALLOUT_ARMED	(1 << 0)
 
 #define	TQ_LOCK(tq)							\
 	do {								\
 		if ((tq)->tq_spin)					\
 			mtx_lock_spin(&(tq)->tq_mutex);			\
 		else							\
 			mtx_lock(&(tq)->tq_mutex);			\
 	} while (0)
 #define	TQ_ASSERT_LOCKED(tq)	mtx_assert(&(tq)->tq_mutex, MA_OWNED)
 
 #define	TQ_UNLOCK(tq)							\
 	do {								\
 		if ((tq)->tq_spin)					\
 			mtx_unlock_spin(&(tq)->tq_mutex);		\
 		else							\
 			mtx_unlock(&(tq)->tq_mutex);			\
 	} while (0)
 #define	TQ_ASSERT_UNLOCKED(tq)	mtx_assert(&(tq)->tq_mutex, MA_NOTOWNED)
 
 #ifdef INVARIANTS
 static void
 gtask_dump(struct gtask *gtask)
 {
 	printf("gtask: %p ta_flags=%x ta_priority=%d ta_func=%p ta_context=%p\n",
 	       gtask, gtask->ta_flags, gtask->ta_priority, gtask->ta_func, gtask->ta_context);
 }
 #endif
 
 static __inline int
 TQ_SLEEP(struct gtaskqueue *tq, void *p, struct mtx *m, int pri, const char *wm,
     int t)
 {
 	if (tq->tq_spin)
 		return (msleep_spin(p, m, wm, t));
 	return (msleep(p, m, pri, wm, t));
 }
 
 static struct gtaskqueue *
 _gtaskqueue_create(const char *name, int mflags,
 		 taskqueue_enqueue_fn enqueue, void *context,
 		 int mtxflags, const char *mtxname __unused)
 {
 	struct gtaskqueue *queue;
 	char *tq_name;
 
 	tq_name = malloc(TASKQUEUE_NAMELEN, M_GTASKQUEUE, mflags | M_ZERO);
 	if (!tq_name)
 		return (NULL);
 
 	snprintf(tq_name, TASKQUEUE_NAMELEN, "%s", (name) ? name : "taskqueue");
 
 	queue = malloc(sizeof(struct gtaskqueue), M_GTASKQUEUE, mflags | M_ZERO);
 	if (!queue)
 		return (NULL);
 
 	STAILQ_INIT(&queue->tq_queue);
 	TAILQ_INIT(&queue->tq_active);
 	queue->tq_enqueue = enqueue;
 	queue->tq_context = context;
 	queue->tq_name = tq_name;
 	queue->tq_spin = (mtxflags & MTX_SPIN) != 0;
 	queue->tq_flags |= TQ_FLAGS_ACTIVE;
 	if (enqueue == gtaskqueue_thread_enqueue)
 		queue->tq_flags |= TQ_FLAGS_UNLOCKED_ENQUEUE;
 	mtx_init(&queue->tq_mutex, tq_name, NULL, mtxflags);
 
 	return (queue);
 }
 
 
 /*
  * Signal a taskqueue thread to terminate.
  */
 static void
 gtaskqueue_terminate(struct thread **pp, struct gtaskqueue *tq)
 {
 
 	while (tq->tq_tcount > 0 || tq->tq_callouts > 0) {
 		wakeup(tq);
 		TQ_SLEEP(tq, pp, &tq->tq_mutex, PWAIT, "taskqueue_destroy", 0);
 	}
 }
 
 static void
 gtaskqueue_free(struct gtaskqueue *queue)
 {
 
 	TQ_LOCK(queue);
 	queue->tq_flags &= ~TQ_FLAGS_ACTIVE;
 	gtaskqueue_terminate(queue->tq_threads, queue);
 	KASSERT(TAILQ_EMPTY(&queue->tq_active), ("Tasks still running?"));
 	KASSERT(queue->tq_callouts == 0, ("Armed timeout tasks"));
 	mtx_destroy(&queue->tq_mutex);
 	free(queue->tq_threads, M_GTASKQUEUE);
 	free(queue->tq_name, M_GTASKQUEUE);
 	free(queue, M_GTASKQUEUE);
 }
 
 int
 grouptaskqueue_enqueue(struct gtaskqueue *queue, struct gtask *gtask)
 {
 #ifdef INVARIANTS
 	if (queue == NULL) {
 		gtask_dump(gtask);
 		panic("queue == NULL");
 	}
 #endif
 	TQ_LOCK(queue);
 	if (gtask->ta_flags & TASK_ENQUEUED) {
 		TQ_UNLOCK(queue);
 		return (0);
 	}
 	STAILQ_INSERT_TAIL(&queue->tq_queue, gtask, ta_link);
 	gtask->ta_flags |= TASK_ENQUEUED;
 	TQ_UNLOCK(queue);
 	if ((queue->tq_flags & TQ_FLAGS_BLOCKED) == 0)
 		queue->tq_enqueue(queue->tq_context);
 	return (0);
 }
 
 static void
 gtaskqueue_task_nop_fn(void *context)
 {
 }
 
 /*
  * Block until all currently queued tasks in this taskqueue
  * have begun execution.  Tasks queued during execution of
  * this function are ignored.
  */
 static void
 gtaskqueue_drain_tq_queue(struct gtaskqueue *queue)
 {
 	struct gtask t_barrier;
 
 	if (STAILQ_EMPTY(&queue->tq_queue))
 		return;
 
 	/*
 	 * Enqueue our barrier after all current tasks, but with
 	 * the highest priority so that newly queued tasks cannot
 	 * pass it.  Because of the high priority, we can not use
 	 * taskqueue_enqueue_locked directly (which drops the lock
 	 * anyway) so just insert it at tail while we have the
 	 * queue lock.
 	 */
 	GTASK_INIT(&t_barrier, 0, USHRT_MAX, gtaskqueue_task_nop_fn, &t_barrier);
 	STAILQ_INSERT_TAIL(&queue->tq_queue, &t_barrier, ta_link);
 	t_barrier.ta_flags |= TASK_ENQUEUED;
 
 	/*
 	 * Once the barrier has executed, all previously queued tasks
 	 * have completed or are currently executing.
 	 */
 	while (t_barrier.ta_flags & TASK_ENQUEUED)
 		TQ_SLEEP(queue, &t_barrier, &queue->tq_mutex, PWAIT, "-", 0);
 }
 
 /*
  * Block until all currently executing tasks for this taskqueue
  * complete.  Tasks that begin execution during the execution
  * of this function are ignored.
  */
 static void
 gtaskqueue_drain_tq_active(struct gtaskqueue *queue)
 {
 	struct gtaskqueue_busy tb_marker, *tb_first;
 
 	if (TAILQ_EMPTY(&queue->tq_active))
 		return;
 
 	/* Block taskq_terminate().*/
 	queue->tq_callouts++;
 
 	/*
 	 * Wait for all currently executing taskqueue threads
 	 * to go idle.
 	 */
 	tb_marker.tb_running = TB_DRAIN_WAITER;
 	TAILQ_INSERT_TAIL(&queue->tq_active, &tb_marker, tb_link);
 	while (TAILQ_FIRST(&queue->tq_active) != &tb_marker)
 		TQ_SLEEP(queue, &tb_marker, &queue->tq_mutex, PWAIT, "-", 0);
 	TAILQ_REMOVE(&queue->tq_active, &tb_marker, tb_link);
 
 	/*
 	 * Wakeup any other drain waiter that happened to queue up
 	 * without any intervening active thread.
 	 */
 	tb_first = TAILQ_FIRST(&queue->tq_active);
 	if (tb_first != NULL && tb_first->tb_running == TB_DRAIN_WAITER)
 		wakeup(tb_first);
 
 	/* Release taskqueue_terminate(). */
 	queue->tq_callouts--;
 	if ((queue->tq_flags & TQ_FLAGS_ACTIVE) == 0)
 		wakeup_one(queue->tq_threads);
 }
 
 void
 gtaskqueue_block(struct gtaskqueue *queue)
 {
 
 	TQ_LOCK(queue);
 	queue->tq_flags |= TQ_FLAGS_BLOCKED;
 	TQ_UNLOCK(queue);
 }
 
 void
 gtaskqueue_unblock(struct gtaskqueue *queue)
 {
 
 	TQ_LOCK(queue);
 	queue->tq_flags &= ~TQ_FLAGS_BLOCKED;
 	if (!STAILQ_EMPTY(&queue->tq_queue))
 		queue->tq_enqueue(queue->tq_context);
 	TQ_UNLOCK(queue);
 }
 
 static void
 gtaskqueue_run_locked(struct gtaskqueue *queue)
 {
 	struct gtaskqueue_busy tb;
 	struct gtaskqueue_busy *tb_first;
 	struct gtask *gtask;
 
 	KASSERT(queue != NULL, ("tq is NULL"));
 	TQ_ASSERT_LOCKED(queue);
 	tb.tb_running = NULL;
 
 	while (STAILQ_FIRST(&queue->tq_queue)) {
 		TAILQ_INSERT_TAIL(&queue->tq_active, &tb, tb_link);
 
 		/*
 		 * Carefully remove the first task from the queue and
 		 * clear its TASK_ENQUEUED flag
 		 */
 		gtask = STAILQ_FIRST(&queue->tq_queue);
 		KASSERT(gtask != NULL, ("task is NULL"));
 		STAILQ_REMOVE_HEAD(&queue->tq_queue, ta_link);
 		gtask->ta_flags &= ~TASK_ENQUEUED;
 		tb.tb_running = gtask;
 		TQ_UNLOCK(queue);
 
 		KASSERT(gtask->ta_func != NULL, ("task->ta_func is NULL"));
 		gtask->ta_func(gtask->ta_context);
 
 		TQ_LOCK(queue);
 		tb.tb_running = NULL;
 		wakeup(gtask);
 
 		TAILQ_REMOVE(&queue->tq_active, &tb, tb_link);
 		tb_first = TAILQ_FIRST(&queue->tq_active);
 		if (tb_first != NULL &&
 		    tb_first->tb_running == TB_DRAIN_WAITER)
 			wakeup(tb_first);
 	}
 }
 
 static int
 task_is_running(struct gtaskqueue *queue, struct gtask *gtask)
 {
 	struct gtaskqueue_busy *tb;
 
 	TQ_ASSERT_LOCKED(queue);
 	TAILQ_FOREACH(tb, &queue->tq_active, tb_link) {
 		if (tb->tb_running == gtask)
 			return (1);
 	}
 	return (0);
 }
 
 static int
 gtaskqueue_cancel_locked(struct gtaskqueue *queue, struct gtask *gtask)
 {
 
 	if (gtask->ta_flags & TASK_ENQUEUED)
 		STAILQ_REMOVE(&queue->tq_queue, gtask, gtask, ta_link);
 	gtask->ta_flags &= ~TASK_ENQUEUED;
 	return (task_is_running(queue, gtask) ? EBUSY : 0);
 }
 
 int
 gtaskqueue_cancel(struct gtaskqueue *queue, struct gtask *gtask)
 {
 	int error;
 
 	TQ_LOCK(queue);
 	error = gtaskqueue_cancel_locked(queue, gtask);
 	TQ_UNLOCK(queue);
 
 	return (error);
 }
 
 void
 gtaskqueue_drain(struct gtaskqueue *queue, struct gtask *gtask)
 {
 
 	if (!queue->tq_spin)
 		WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, __func__);
 
 	TQ_LOCK(queue);
 	while ((gtask->ta_flags & TASK_ENQUEUED) || task_is_running(queue, gtask))
 		TQ_SLEEP(queue, gtask, &queue->tq_mutex, PWAIT, "-", 0);
 	TQ_UNLOCK(queue);
 }
 
 void
 gtaskqueue_drain_all(struct gtaskqueue *queue)
 {
 
 	if (!queue->tq_spin)
 		WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, __func__);
 
 	TQ_LOCK(queue);
 	gtaskqueue_drain_tq_queue(queue);
 	gtaskqueue_drain_tq_active(queue);
 	TQ_UNLOCK(queue);
 }
 
 static int
 _gtaskqueue_start_threads(struct gtaskqueue **tqp, int count, int pri,
     cpuset_t *mask, const char *name, va_list ap)
 {
 	char ktname[MAXCOMLEN + 1];
 	struct thread *td;
 	struct gtaskqueue *tq;
 	int i, error;
 
 	if (count <= 0)
 		return (EINVAL);
 
 	vsnprintf(ktname, sizeof(ktname), name, ap);
 	tq = *tqp;
 
 	tq->tq_threads = malloc(sizeof(struct thread *) * count, M_GTASKQUEUE,
 	    M_NOWAIT | M_ZERO);
 	if (tq->tq_threads == NULL) {
 		printf("%s: no memory for %s threads\n", __func__, ktname);
 		return (ENOMEM);
 	}
 
 	for (i = 0; i < count; i++) {
 		if (count == 1)
 			error = kthread_add(gtaskqueue_thread_loop, tqp, NULL,
 			    &tq->tq_threads[i], RFSTOPPED, 0, "%s", ktname);
 		else
 			error = kthread_add(gtaskqueue_thread_loop, tqp, NULL,
 			    &tq->tq_threads[i], RFSTOPPED, 0,
 			    "%s_%d", ktname, i);
 		if (error) {
 			/* should be ok to continue, taskqueue_free will dtrt */
 			printf("%s: kthread_add(%s): error %d", __func__,
 			    ktname, error);
 			tq->tq_threads[i] = NULL;		/* paranoid */
 		} else
 			tq->tq_tcount++;
 	}
 	for (i = 0; i < count; i++) {
 		if (tq->tq_threads[i] == NULL)
 			continue;
 		td = tq->tq_threads[i];
 		if (mask) {
 			error = cpuset_setthread(td->td_tid, mask);
 			/*
 			 * Failing to pin is rarely an actual fatal error;
 			 * it'll just affect performance.
 			 */
 			if (error)
 				printf("%s: curthread=%llu: can't pin; "
 				    "error=%d\n",
 				    __func__,
 				    (unsigned long long) td->td_tid,
 				    error);
 		}
 		thread_lock(td);
 		sched_prio(td, pri);
 		sched_add(td, SRQ_BORING);
 		thread_unlock(td);
 	}
 
 	return (0);
 }
 
 static int
 gtaskqueue_start_threads(struct gtaskqueue **tqp, int count, int pri,
     const char *name, ...)
 {
 	va_list ap;
 	int error;
 
 	va_start(ap, name);
 	error = _gtaskqueue_start_threads(tqp, count, pri, NULL, name, ap);
 	va_end(ap);
 	return (error);
 }
 
 static inline void
 gtaskqueue_run_callback(struct gtaskqueue *tq,
     enum taskqueue_callback_type cb_type)
 {
 	taskqueue_callback_fn tq_callback;
 
 	TQ_ASSERT_UNLOCKED(tq);
 	tq_callback = tq->tq_callbacks[cb_type];
 	if (tq_callback != NULL)
 		tq_callback(tq->tq_cb_contexts[cb_type]);
 }
 
 static void
 gtaskqueue_thread_loop(void *arg)
 {
 	struct gtaskqueue **tqp, *tq;
 
 	tqp = arg;
 	tq = *tqp;
 	gtaskqueue_run_callback(tq, TASKQUEUE_CALLBACK_TYPE_INIT);
 	TQ_LOCK(tq);
 	while ((tq->tq_flags & TQ_FLAGS_ACTIVE) != 0) {
 		/* XXX ? */
 		gtaskqueue_run_locked(tq);
 		/*
 		 * Because taskqueue_run() can drop tq_mutex, we need to
 		 * check if the TQ_FLAGS_ACTIVE flag wasn't removed in the
 		 * meantime, which means we missed a wakeup.
 		 */
 		if ((tq->tq_flags & TQ_FLAGS_ACTIVE) == 0)
 			break;
 		TQ_SLEEP(tq, tq, &tq->tq_mutex, 0, "-", 0);
 	}
 	gtaskqueue_run_locked(tq);
 	/*
 	 * This thread is on its way out, so just drop the lock temporarily
 	 * in order to call the shutdown callback.  This allows the callback
 	 * to look at the taskqueue, even just before it dies.
 	 */
 	TQ_UNLOCK(tq);
 	gtaskqueue_run_callback(tq, TASKQUEUE_CALLBACK_TYPE_SHUTDOWN);
 	TQ_LOCK(tq);
 
 	/* rendezvous with thread that asked us to terminate */
 	tq->tq_tcount--;
 	wakeup_one(tq->tq_threads);
 	TQ_UNLOCK(tq);
 	kthread_exit();
 }
 
 static void
 gtaskqueue_thread_enqueue(void *context)
 {
 	struct gtaskqueue **tqp, *tq;
 
 	tqp = context;
 	tq = *tqp;
 	wakeup_one(tq);
 }
 
 
 static struct gtaskqueue *
 gtaskqueue_create_fast(const char *name, int mflags,
 		 taskqueue_enqueue_fn enqueue, void *context)
 {
 	return _gtaskqueue_create(name, mflags, enqueue, context,
 			MTX_SPIN, "fast_taskqueue");
 }
 
 
 struct taskqgroup_cpu {
 	LIST_HEAD(, grouptask)	tgc_tasks;
 	struct gtaskqueue	*tgc_taskq;
 	int	tgc_cnt;
 	int	tgc_cpu;
 };
 
 struct taskqgroup {
 	struct taskqgroup_cpu tqg_queue[MAXCPU];
 	struct mtx	tqg_lock;
 	char *		tqg_name;
 	int		tqg_adjusting;
 	int		tqg_stride;
 	int		tqg_cnt;
 };
 
 struct taskq_bind_task {
 	struct gtask bt_task;
 	int	bt_cpuid;
 };
 
 static void
 taskqgroup_cpu_create(struct taskqgroup *qgroup, int idx, int cpu)
 {
 	struct taskqgroup_cpu *qcpu;
 
 	qcpu = &qgroup->tqg_queue[idx];
 	LIST_INIT(&qcpu->tgc_tasks);
 	qcpu->tgc_taskq = gtaskqueue_create_fast(NULL, M_WAITOK,
 	    taskqueue_thread_enqueue, &qcpu->tgc_taskq);
 	gtaskqueue_start_threads(&qcpu->tgc_taskq, 1, PI_SOFT,
 	    "%s_%d", qgroup->tqg_name, idx);
 	qcpu->tgc_cpu = cpu;
 }
 
 static void
 taskqgroup_cpu_remove(struct taskqgroup *qgroup, int idx)
 {
 
 	gtaskqueue_free(qgroup->tqg_queue[idx].tgc_taskq);
 }
 
 /*
  * Find the taskq with least # of tasks that doesn't currently have any
  * other queues from the uniq identifier.
  */
 static int
 taskqgroup_find(struct taskqgroup *qgroup, void *uniq)
 {
 	struct grouptask *n;
 	int i, idx, mincnt;
 	int strict;
 
 	mtx_assert(&qgroup->tqg_lock, MA_OWNED);
 	if (qgroup->tqg_cnt == 0)
 		return (0);
 	idx = -1;
 	mincnt = INT_MAX;
 	/*
 	 * Two passes;  First scan for a queue with the least tasks that
 	 * does not already service this uniq id.  If that fails simply find
 	 * the queue with the least total tasks;
 	 */
 	for (strict = 1; mincnt == INT_MAX; strict = 0) {
 		for (i = 0; i < qgroup->tqg_cnt; i++) {
 			if (qgroup->tqg_queue[i].tgc_cnt > mincnt)
 				continue;
 			if (strict) {
 				LIST_FOREACH(n,
 				    &qgroup->tqg_queue[i].tgc_tasks, gt_list)
 					if (n->gt_uniq == uniq)
 						break;
 				if (n != NULL)
 					continue;
 			}
 			mincnt = qgroup->tqg_queue[i].tgc_cnt;
 			idx = i;
 		}
 	}
 	if (idx == -1)
 		panic("taskqgroup_find: Failed to pick a qid.");
 
 	return (idx);
 }
 
 /*
  * smp_started is unusable since it is not set for UP kernels or even for
  * SMP kernels when there is 1 CPU.  This is usually handled by adding a
  * (mp_ncpus == 1) test, but that would be broken here since we need to
  * to synchronize with the SI_SUB_SMP ordering.  Even in the pure SMP case
  * smp_started only gives a fuzzy ordering relative to SI_SUB_SMP.
  *
  * So maintain our own flag.  It must be set after all CPUs are started
  * and before SI_SUB_SMP:SI_ORDER_ANY so that the SYSINIT for delayed
  * adjustment is properly delayed.  SI_ORDER_FOURTH is clearly before
  * SI_ORDER_ANY and unclearly after the CPUs are started.  It would be
  * simpler for adjustment to pass a flag indicating if it is delayed.
  */ 
 
 static int tqg_smp_started;
 
 static void
 tqg_record_smp_started(void *arg)
 {
 	tqg_smp_started = 1;
 }
 
 SYSINIT(tqg_record_smp_started, SI_SUB_SMP, SI_ORDER_FOURTH,
 	tqg_record_smp_started, NULL);
 
 void
 taskqgroup_attach(struct taskqgroup *qgroup, struct grouptask *gtask,
     void *uniq, int irq, char *name)
 {
 	cpuset_t mask;
 	int qid;
 
 	gtask->gt_uniq = uniq;
 	gtask->gt_name = name;
 	gtask->gt_irq = irq;
 	gtask->gt_cpu = -1;
 	mtx_lock(&qgroup->tqg_lock);
 	qid = taskqgroup_find(qgroup, uniq);
 	qgroup->tqg_queue[qid].tgc_cnt++;
 	LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list);
 	gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq;
 	if (irq != -1 && tqg_smp_started) {
 		gtask->gt_cpu = qgroup->tqg_queue[qid].tgc_cpu;
 		CPU_ZERO(&mask);
 		CPU_SET(qgroup->tqg_queue[qid].tgc_cpu, &mask);
 		mtx_unlock(&qgroup->tqg_lock);
 		intr_setaffinity(irq, &mask);
 	} else
 		mtx_unlock(&qgroup->tqg_lock);
 }
 
 static void
 taskqgroup_attach_deferred(struct taskqgroup *qgroup, struct grouptask *gtask)
 {
 	cpuset_t mask;
 	int qid, cpu;
 
 	mtx_lock(&qgroup->tqg_lock);
 	qid = taskqgroup_find(qgroup, gtask->gt_uniq);
 	cpu = qgroup->tqg_queue[qid].tgc_cpu;
 	if (gtask->gt_irq != -1) {
 		mtx_unlock(&qgroup->tqg_lock);
 
 		CPU_ZERO(&mask);
 		CPU_SET(cpu, &mask);
 		intr_setaffinity(gtask->gt_irq, &mask);
 
 		mtx_lock(&qgroup->tqg_lock);
 	}
 	qgroup->tqg_queue[qid].tgc_cnt++;
 
 	LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask,
 			 gt_list);
 	MPASS(qgroup->tqg_queue[qid].tgc_taskq != NULL);
 	gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq;
 	mtx_unlock(&qgroup->tqg_lock);
 }
 
 int
 taskqgroup_attach_cpu(struct taskqgroup *qgroup, struct grouptask *gtask,
 	void *uniq, int cpu, int irq, char *name)
 {
 	cpuset_t mask;
 	int i, qid;
 
 	qid = -1;
 	gtask->gt_uniq = uniq;
 	gtask->gt_name = name;
 	gtask->gt_irq = irq;
 	gtask->gt_cpu = cpu;
 	mtx_lock(&qgroup->tqg_lock);
 	if (tqg_smp_started) {
 		for (i = 0; i < qgroup->tqg_cnt; i++)
 			if (qgroup->tqg_queue[i].tgc_cpu == cpu) {
 				qid = i;
 				break;
 			}
 		if (qid == -1) {
 			mtx_unlock(&qgroup->tqg_lock);
 			return (EINVAL);
 		}
 	} else
 		qid = 0;
 	qgroup->tqg_queue[qid].tgc_cnt++;
 	LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list);
 	gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq;
 	cpu = qgroup->tqg_queue[qid].tgc_cpu;
 	mtx_unlock(&qgroup->tqg_lock);
 
 	CPU_ZERO(&mask);
 	CPU_SET(cpu, &mask);
 	if (irq != -1 && tqg_smp_started)
 		intr_setaffinity(irq, &mask);
 	return (0);
 }
 
 static int
 taskqgroup_attach_cpu_deferred(struct taskqgroup *qgroup, struct grouptask *gtask)
 {
 	cpuset_t mask;
 	int i, qid, irq, cpu;
 
 	qid = -1;
 	irq = gtask->gt_irq;
 	cpu = gtask->gt_cpu;
 	MPASS(tqg_smp_started);
 	mtx_lock(&qgroup->tqg_lock);
 	for (i = 0; i < qgroup->tqg_cnt; i++)
 		if (qgroup->tqg_queue[i].tgc_cpu == cpu) {
 			qid = i;
 			break;
 		}
 	if (qid == -1) {
 		mtx_unlock(&qgroup->tqg_lock);
 		return (EINVAL);
 	}
 	qgroup->tqg_queue[qid].tgc_cnt++;
 	LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list);
 	MPASS(qgroup->tqg_queue[qid].tgc_taskq != NULL);
 	gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq;
 	mtx_unlock(&qgroup->tqg_lock);
 
 	CPU_ZERO(&mask);
 	CPU_SET(cpu, &mask);
 
 	if (irq != -1)
 		intr_setaffinity(irq, &mask);
 	return (0);
 }
 
 void
 taskqgroup_detach(struct taskqgroup *qgroup, struct grouptask *gtask)
 {
 	int i;
 
 	mtx_lock(&qgroup->tqg_lock);
 	for (i = 0; i < qgroup->tqg_cnt; i++)
 		if (qgroup->tqg_queue[i].tgc_taskq == gtask->gt_taskqueue)
 			break;
 	if (i == qgroup->tqg_cnt)
 		panic("taskqgroup_detach: task not in group\n");
 	qgroup->tqg_queue[i].tgc_cnt--;
 	LIST_REMOVE(gtask, gt_list);
 	mtx_unlock(&qgroup->tqg_lock);
 	gtask->gt_taskqueue = NULL;
 }
 
 static void
 taskqgroup_binder(void *ctx)
 {
 	struct taskq_bind_task *gtask = (struct taskq_bind_task *)ctx;
 	cpuset_t mask;
 	int error;
 
 	CPU_ZERO(&mask);
 	CPU_SET(gtask->bt_cpuid, &mask);
 	error = cpuset_setthread(curthread->td_tid, &mask);
 	thread_lock(curthread);
 	sched_bind(curthread, gtask->bt_cpuid);
 	thread_unlock(curthread);
 
 	if (error)
 		printf("taskqgroup_binder: setaffinity failed: %d\n",
 		    error);
 	free(gtask, M_DEVBUF);
 }
 
 static void
 taskqgroup_bind(struct taskqgroup *qgroup)
 {
 	struct taskq_bind_task *gtask;
 	int i;
 
 	/*
 	 * Bind taskqueue threads to specific CPUs, if they have been assigned
 	 * one.
 	 */
 	if (qgroup->tqg_cnt == 1)
 		return;
 
 	for (i = 0; i < qgroup->tqg_cnt; i++) {
 		gtask = malloc(sizeof (*gtask), M_DEVBUF, M_WAITOK);
 		GTASK_INIT(&gtask->bt_task, 0, 0, taskqgroup_binder, gtask);
 		gtask->bt_cpuid = qgroup->tqg_queue[i].tgc_cpu;
 		grouptaskqueue_enqueue(qgroup->tqg_queue[i].tgc_taskq,
 		    &gtask->bt_task);
 	}
 }
 
 static int
 _taskqgroup_adjust(struct taskqgroup *qgroup, int cnt, int stride)
 {
 	LIST_HEAD(, grouptask) gtask_head = LIST_HEAD_INITIALIZER(NULL);
 	struct grouptask *gtask;
 	int i, k, old_cnt, old_cpu, cpu;
 
 	mtx_assert(&qgroup->tqg_lock, MA_OWNED);
 
 	if (cnt < 1 || cnt * stride > mp_ncpus || !tqg_smp_started) {
 		printf("%s: failed cnt: %d stride: %d "
 		    "mp_ncpus: %d tqg_smp_started: %d\n",
 		    __func__, cnt, stride, mp_ncpus, tqg_smp_started);
 		return (EINVAL);
 	}
 	if (qgroup->tqg_adjusting) {
 		printf("taskqgroup_adjust failed: adjusting\n");
 		return (EBUSY);
 	}
 	qgroup->tqg_adjusting = 1;
 	old_cnt = qgroup->tqg_cnt;
 	old_cpu = 0;
 	if (old_cnt < cnt)
 		old_cpu = qgroup->tqg_queue[old_cnt].tgc_cpu;
 	mtx_unlock(&qgroup->tqg_lock);
 	/*
 	 * Set up queue for tasks added before boot.
 	 */
 	if (old_cnt == 0) {
 		LIST_SWAP(&gtask_head, &qgroup->tqg_queue[0].tgc_tasks,
 		    grouptask, gt_list);
 		qgroup->tqg_queue[0].tgc_cnt = 0;
 	}
 
 	/*
 	 * If new taskq threads have been added.
 	 */
 	cpu = old_cpu;
 	for (i = old_cnt; i < cnt; i++) {
 		taskqgroup_cpu_create(qgroup, i, cpu);
 
 		for (k = 0; k < stride; k++)
 			cpu = CPU_NEXT(cpu);
 	}
 	mtx_lock(&qgroup->tqg_lock);
 	qgroup->tqg_cnt = cnt;
 	qgroup->tqg_stride = stride;
 
 	/*
 	 * Adjust drivers to use new taskqs.
 	 */
 	for (i = 0; i < old_cnt; i++) {
 		while ((gtask = LIST_FIRST(&qgroup->tqg_queue[i].tgc_tasks))) {
 			LIST_REMOVE(gtask, gt_list);
 			qgroup->tqg_queue[i].tgc_cnt--;
 			LIST_INSERT_HEAD(&gtask_head, gtask, gt_list);
 		}
 	}
 	mtx_unlock(&qgroup->tqg_lock);
 
 	while ((gtask = LIST_FIRST(&gtask_head))) {
 		LIST_REMOVE(gtask, gt_list);
 		if (gtask->gt_cpu == -1)
 			taskqgroup_attach_deferred(qgroup, gtask);
 		else if (taskqgroup_attach_cpu_deferred(qgroup, gtask))
 			taskqgroup_attach_deferred(qgroup, gtask);
 	}
 
 #ifdef INVARIANTS
 	mtx_lock(&qgroup->tqg_lock);
 	for (i = 0; i < qgroup->tqg_cnt; i++) {
 		MPASS(qgroup->tqg_queue[i].tgc_taskq != NULL);
 		LIST_FOREACH(gtask, &qgroup->tqg_queue[i].tgc_tasks, gt_list)
 			MPASS(gtask->gt_taskqueue != NULL);
 	}
 	mtx_unlock(&qgroup->tqg_lock);
 #endif
 	/*
 	 * If taskq thread count has been reduced.
 	 */
 	for (i = cnt; i < old_cnt; i++)
 		taskqgroup_cpu_remove(qgroup, i);
 
 	taskqgroup_bind(qgroup);
 
 	mtx_lock(&qgroup->tqg_lock);
 	qgroup->tqg_adjusting = 0;
 
 	return (0);
 }
 
 int
 taskqgroup_adjust(struct taskqgroup *qgroup, int cnt, int stride)
 {
 	int error;
 
 	mtx_lock(&qgroup->tqg_lock);
 	error = _taskqgroup_adjust(qgroup, cnt, stride);
 	mtx_unlock(&qgroup->tqg_lock);
 
 	return (error);
 }
 
 struct taskqgroup *
 taskqgroup_create(char *name)
 {
 	struct taskqgroup *qgroup;
 
 	qgroup = malloc(sizeof(*qgroup), M_GTASKQUEUE, M_WAITOK | M_ZERO);
 	mtx_init(&qgroup->tqg_lock, "taskqgroup", NULL, MTX_DEF);
 	qgroup->tqg_name = name;
 	LIST_INIT(&qgroup->tqg_queue[0].tgc_tasks);
 
 	return (qgroup);
 }
 
 void
 taskqgroup_destroy(struct taskqgroup *qgroup)
 {
 
 }
Index: projects/clang400-import/sys/net/iflib.c
===================================================================
--- projects/clang400-import/sys/net/iflib.c	(revision 314522)
+++ projects/clang400-import/sys/net/iflib.c	(revision 314523)
@@ -1,5244 +1,5243 @@
 /*-
  * Copyright (c) 2014-2017, Matthew Macy <mmacy@nextbsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are met:
  *
  *  1. Redistributions of source code must retain the above copyright notice,
  *     this list of conditions and the following disclaimer.
  *
  *  2. Neither the name of Matthew Macy nor the names of its
  *     contributors may be used to endorse or promote products derived from
  *     this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
  * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
 #include "opt_acpi.h"
 
 #include <sys/param.h>
 #include <sys/types.h>
 #include <sys/bus.h>
 #include <sys/eventhandler.h>
 #include <sys/sockio.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/module.h>
 #include <sys/kobj.h>
 #include <sys/rman.h>
 #include <sys/sbuf.h>
 #include <sys/smp.h>
 #include <sys/socket.h>
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
 #include <sys/taskqueue.h>
 #include <sys/limits.h>
 
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/if_types.h>
 #include <net/if_media.h>
 #include <net/bpf.h>
 #include <net/ethernet.h>
 #include <net/mp_ring.h>
 
 #include <netinet/in.h>
 #include <netinet/in_pcb.h>
 #include <netinet/tcp_lro.h>
 #include <netinet/in_systm.h>
 #include <netinet/if_ether.h>
 #include <netinet/ip.h>
 #include <netinet/ip6.h>
 #include <netinet/tcp.h>
 
 #include <machine/bus.h>
 #include <machine/in_cksum.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <dev/led/led.h>
 #include <dev/pci/pcireg.h>
 #include <dev/pci/pcivar.h>
 #include <dev/pci/pci_private.h>
 
 #include <net/iflib.h>
 
 #include "ifdi_if.h"
 
 #if defined(__i386__) || defined(__amd64__)
 #include <sys/memdesc.h>
 #include <machine/bus.h>
 #include <machine/md_var.h>
 #include <machine/specialreg.h>
 #include <x86/include/busdma_impl.h>
 #include <x86/iommu/busdma_dmar.h>
 #endif
 
 /*
  * enable accounting of every mbuf as it comes in to and goes out of iflib's software descriptor references
  */
 #define MEMORY_LOGGING 0
 /*
  * Enable mbuf vectors for compressing long mbuf chains
  */
 
 /*
  * NB:
  * - Prefetching in tx cleaning should perhaps be a tunable. The distance ahead
  *   we prefetch needs to be determined by the time spent in m_free vis a vis
  *   the cost of a prefetch. This will of course vary based on the workload:
  *      - NFLX's m_free path is dominated by vm-based M_EXT manipulation which
  *        is quite expensive, thus suggesting very little prefetch.
  *      - small packet forwarding which is just returning a single mbuf to
  *        UMA will typically be very fast vis a vis the cost of a memory
  *        access.
  */
 
 
 /*
  * File organization:
  *  - private structures
  *  - iflib private utility functions
  *  - ifnet functions
  *  - vlan registry and other exported functions
  *  - iflib public core functions
  *
  *
  */
 static MALLOC_DEFINE(M_IFLIB, "iflib", "ifnet library");
 
 struct iflib_txq;
 typedef struct iflib_txq *iflib_txq_t;
 struct iflib_rxq;
 typedef struct iflib_rxq *iflib_rxq_t;
 struct iflib_fl;
 typedef struct iflib_fl *iflib_fl_t;
 
 struct iflib_ctx;
 
 typedef struct iflib_filter_info {
 	driver_filter_t *ifi_filter;
 	void *ifi_filter_arg;
 	struct grouptask *ifi_task;
 	struct iflib_ctx *ifi_ctx;
 } *iflib_filter_info_t;
 
 struct iflib_ctx {
 	KOBJ_FIELDS;
    /*
    * Pointer to hardware driver's softc
    */
 	void *ifc_softc;
 	device_t ifc_dev;
 	if_t ifc_ifp;
 
 	cpuset_t ifc_cpus;
 	if_shared_ctx_t ifc_sctx;
 	struct if_softc_ctx ifc_softc_ctx;
 
 	struct mtx ifc_mtx;
 
 	uint16_t ifc_nhwtxqs;
 	uint16_t ifc_nhwrxqs;
 
 	iflib_txq_t ifc_txqs;
 	iflib_rxq_t ifc_rxqs;
 	uint32_t ifc_if_flags;
 	uint32_t ifc_flags;
 	uint32_t ifc_max_fl_buf_size;
 	int ifc_in_detach;
 
 	int ifc_link_state;
 	int ifc_link_irq;
 	int ifc_pause_frames;
 	int ifc_watchdog_events;
 	struct cdev *ifc_led_dev;
 	struct resource *ifc_msix_mem;
 
 	struct if_irq ifc_legacy_irq;
 	struct grouptask ifc_admin_task;
 	struct grouptask ifc_vflr_task;
 	struct iflib_filter_info ifc_filter_info;
 	struct ifmedia	ifc_media;
 
 	struct sysctl_oid *ifc_sysctl_node;
 	uint16_t ifc_sysctl_ntxqs;
 	uint16_t ifc_sysctl_nrxqs;
 	uint16_t ifc_sysctl_qs_eq_override;
 
 	uint16_t ifc_sysctl_ntxds[8];
 	uint16_t ifc_sysctl_nrxds[8];
 	struct if_txrx ifc_txrx;
 #define isc_txd_encap  ifc_txrx.ift_txd_encap
 #define isc_txd_flush  ifc_txrx.ift_txd_flush
 #define isc_txd_credits_update  ifc_txrx.ift_txd_credits_update
 #define isc_rxd_available ifc_txrx.ift_rxd_available
 #define isc_rxd_pkt_get ifc_txrx.ift_rxd_pkt_get
 #define isc_rxd_refill ifc_txrx.ift_rxd_refill
 #define isc_rxd_flush ifc_txrx.ift_rxd_flush
 #define isc_rxd_refill ifc_txrx.ift_rxd_refill
 #define isc_rxd_refill ifc_txrx.ift_rxd_refill
 #define isc_legacy_intr ifc_txrx.ift_legacy_intr
 	eventhandler_tag ifc_vlan_attach_event;
 	eventhandler_tag ifc_vlan_detach_event;
 	uint8_t ifc_mac[ETHER_ADDR_LEN];
 	char ifc_mtx_name[16];
 };
 
 
 void *
 iflib_get_softc(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_softc);
 }
 
 device_t
 iflib_get_dev(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_dev);
 }
 
 if_t
 iflib_get_ifp(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_ifp);
 }
 
 struct ifmedia *
 iflib_get_media(if_ctx_t ctx)
 {
 
 	return (&ctx->ifc_media);
 }
 
 void
 iflib_set_mac(if_ctx_t ctx, uint8_t mac[ETHER_ADDR_LEN])
 {
 
 	bcopy(mac, ctx->ifc_mac, ETHER_ADDR_LEN);
 }
 
 if_softc_ctx_t
 iflib_get_softc_ctx(if_ctx_t ctx)
 {
 
 	return (&ctx->ifc_softc_ctx);
 }
 
 if_shared_ctx_t
 iflib_get_sctx(if_ctx_t ctx)
 {
 
 	return (ctx->ifc_sctx);
 }
 
 #define CACHE_PTR_INCREMENT (CACHE_LINE_SIZE/sizeof(void*))
 
 #define LINK_ACTIVE(ctx) ((ctx)->ifc_link_state == LINK_STATE_UP)
 #define CTX_IS_VF(ctx) ((ctx)->ifc_sctx->isc_flags & IFLIB_IS_VF)
 
 #define RX_SW_DESC_MAP_CREATED	(1 << 0)
 #define TX_SW_DESC_MAP_CREATED	(1 << 1)
 #define RX_SW_DESC_INUSE        (1 << 3)
 #define TX_SW_DESC_MAPPED       (1 << 4)
 
 typedef struct iflib_sw_rx_desc_array {
 	bus_dmamap_t	*ifsd_map;         /* bus_dma maps for packet */
 	struct mbuf	**ifsd_m;           /* pkthdr mbufs */
 	caddr_t		*ifsd_cl;          /* direct cluster pointer for rx */
 	uint8_t		*ifsd_flags;
 } iflib_rxsd_array_t;
 
 typedef struct iflib_sw_tx_desc_array {
 	bus_dmamap_t    *ifsd_map;         /* bus_dma maps for packet */
 	struct mbuf    **ifsd_m;           /* pkthdr mbufs */
 	uint8_t		*ifsd_flags;
 } iflib_txsd_array_t;
 
 
 /* magic number that should be high enough for any hardware */
 #define IFLIB_MAX_TX_SEGS		128
 #define IFLIB_MAX_RX_SEGS		32
 #define IFLIB_RX_COPY_THRESH		63
 #define IFLIB_MAX_RX_REFRESH		32
 #define IFLIB_QUEUE_IDLE		0
 #define IFLIB_QUEUE_HUNG		1
 #define IFLIB_QUEUE_WORKING		2
 
 /* this should really scale with ring size - 32 is a fairly arbitrary value for this */
 #define TX_BATCH_SIZE			16
 
 #define IFLIB_RESTART_BUDGET		8
 
 #define	IFC_LEGACY		0x01
 #define	IFC_QFLUSH		0x02
 #define	IFC_MULTISEG		0x04
 #define	IFC_DMAR		0x08
 #define	IFC_SC_ALLOCATED	0x10
 #define	IFC_INIT_DONE		0x20
 
 
 #define CSUM_OFFLOAD		(CSUM_IP_TSO|CSUM_IP6_TSO|CSUM_IP| \
 				 CSUM_IP_UDP|CSUM_IP_TCP|CSUM_IP_SCTP| \
 				 CSUM_IP6_UDP|CSUM_IP6_TCP|CSUM_IP6_SCTP)
 struct iflib_txq {
 	uint16_t	ift_in_use;
 	uint16_t	ift_cidx;
 	uint16_t	ift_cidx_processed;
 	uint16_t	ift_pidx;
 	uint8_t		ift_gen;
 	uint8_t		ift_db_pending;
 	uint8_t		ift_db_pending_queued;
 	uint8_t		ift_npending;
 	uint8_t		ift_br_offset;
 	/* implicit pad */
 	uint64_t	ift_processed;
 	uint64_t	ift_cleaned;
 #if MEMORY_LOGGING
 	uint64_t	ift_enqueued;
 	uint64_t	ift_dequeued;
 #endif
 	uint64_t	ift_no_tx_dma_setup;
 	uint64_t	ift_no_desc_avail;
 	uint64_t	ift_mbuf_defrag_failed;
 	uint64_t	ift_mbuf_defrag;
 	uint64_t	ift_map_failed;
 	uint64_t	ift_txd_encap_efbig;
 	uint64_t	ift_pullups;
 
 	struct mtx	ift_mtx;
 	struct mtx	ift_db_mtx;
 
 	/* constant values */
 	if_ctx_t	ift_ctx;
 	struct ifmp_ring        **ift_br;
 	struct grouptask	ift_task;
 	uint16_t	ift_size;
 	uint16_t	ift_id;
 	struct callout	ift_timer;
 	struct callout	ift_db_check;
 
 	iflib_txsd_array_t	ift_sds;
 	uint8_t			ift_nbr;
 	uint8_t			ift_qstatus;
 	uint8_t			ift_active;
 	uint8_t			ift_closed;
 	int			ift_watchdog_time;
 	struct iflib_filter_info ift_filter_info;
 	bus_dma_tag_t		ift_desc_tag;
 	bus_dma_tag_t		ift_tso_desc_tag;
 	iflib_dma_info_t	ift_ifdi;
 #define MTX_NAME_LEN 16
 	char                    ift_mtx_name[MTX_NAME_LEN];
 	char                    ift_db_mtx_name[MTX_NAME_LEN];
 	bus_dma_segment_t	ift_segs[IFLIB_MAX_TX_SEGS]  __aligned(CACHE_LINE_SIZE);
 #ifdef IFLIB_DIAGNOSTICS
 	uint64_t ift_cpu_exec_count[256];
 #endif
 } __aligned(CACHE_LINE_SIZE);
 
 struct iflib_fl {
 	uint16_t	ifl_cidx;
 	uint16_t	ifl_pidx;
 	uint16_t	ifl_credits;
 	uint8_t		ifl_gen;
 #if MEMORY_LOGGING
 	uint64_t	ifl_m_enqueued;
 	uint64_t	ifl_m_dequeued;
 	uint64_t	ifl_cl_enqueued;
 	uint64_t	ifl_cl_dequeued;
 #endif
 	/* implicit pad */
 
 	/* constant */
 	uint16_t	ifl_size;
 	uint16_t	ifl_buf_size;
 	uint16_t	ifl_cltype;
 	uma_zone_t	ifl_zone;
 	iflib_rxsd_array_t	ifl_sds;
 	iflib_rxq_t	ifl_rxq;
 	uint8_t		ifl_id;
 	bus_dma_tag_t           ifl_desc_tag;
 	iflib_dma_info_t	ifl_ifdi;
 	uint64_t	ifl_bus_addrs[IFLIB_MAX_RX_REFRESH] __aligned(CACHE_LINE_SIZE);
 	caddr_t		ifl_vm_addrs[IFLIB_MAX_RX_REFRESH];
 }  __aligned(CACHE_LINE_SIZE);
 
 static inline int
 get_inuse(int size, int cidx, int pidx, int gen)
 {
 	int used;
 
 	if (pidx > cidx)
 		used = pidx - cidx;
 	else if (pidx < cidx)
 		used = size - cidx + pidx;
 	else if (gen == 0 && pidx == cidx)
 		used = 0;
 	else if (gen == 1 && pidx == cidx)
 		used = size;
 	else
 		panic("bad state");
 
 	return (used);
 }
 
 #define TXQ_AVAIL(txq) (txq->ift_size - get_inuse(txq->ift_size, txq->ift_cidx, txq->ift_pidx, txq->ift_gen))
 
 #define IDXDIFF(head, tail, wrap) \
 	((head) >= (tail) ? (head) - (tail) : (wrap) - (tail) + (head))
 
 struct iflib_rxq {
 	/* If there is a separate completion queue -
 	 * these are the cq cidx and pidx. Otherwise
 	 * these are unused.
 	 */
 	uint16_t	ifr_size;
 	uint16_t	ifr_cq_cidx;
 	uint16_t	ifr_cq_pidx;
 	uint8_t		ifr_cq_gen;
 	uint8_t		ifr_fl_offset;
 
 	if_ctx_t	ifr_ctx;
 	iflib_fl_t	ifr_fl;
 	uint64_t	ifr_rx_irq;
 	uint16_t	ifr_id;
 	uint8_t		ifr_lro_enabled;
 	uint8_t		ifr_nfl;
 	struct lro_ctrl			ifr_lc;
 	struct grouptask        ifr_task;
 	struct iflib_filter_info ifr_filter_info;
 	iflib_dma_info_t		ifr_ifdi;
 	/* dynamically allocate if any drivers need a value substantially larger than this */
 	struct if_rxd_frag	ifr_frags[IFLIB_MAX_RX_SEGS] __aligned(CACHE_LINE_SIZE);
 #ifdef IFLIB_DIAGNOSTICS
 	uint64_t ifr_cpu_exec_count[256];
 #endif
 }  __aligned(CACHE_LINE_SIZE);
 
 /*
  * Only allow a single packet to take up most 1/nth of the tx ring
  */
 #define MAX_SINGLE_PACKET_FRACTION 12
 #define IF_BAD_DMA (bus_addr_t)-1
 
 static int enable_msix = 1;
 
 #define CTX_ACTIVE(ctx) ((if_getdrvflags((ctx)->ifc_ifp) & IFF_DRV_RUNNING))
 
 #define CTX_LOCK_INIT(_sc, _name)  mtx_init(&(_sc)->ifc_mtx, _name, "iflib ctx lock", MTX_DEF)
 
 #define CTX_LOCK(ctx) mtx_lock(&(ctx)->ifc_mtx)
 #define CTX_UNLOCK(ctx) mtx_unlock(&(ctx)->ifc_mtx)
 #define CTX_LOCK_DESTROY(ctx) mtx_destroy(&(ctx)->ifc_mtx)
 
 
 #define TXDB_LOCK_INIT(txq)  mtx_init(&(txq)->ift_db_mtx, (txq)->ift_db_mtx_name, NULL, MTX_DEF)
 #define TXDB_TRYLOCK(txq) mtx_trylock(&(txq)->ift_db_mtx)
 #define TXDB_LOCK(txq) mtx_lock(&(txq)->ift_db_mtx)
 #define TXDB_UNLOCK(txq) mtx_unlock(&(txq)->ift_db_mtx)
 #define TXDB_LOCK_DESTROY(txq) mtx_destroy(&(txq)->ift_db_mtx)
 
 #define CALLOUT_LOCK(txq)	mtx_lock(&txq->ift_mtx)
 #define CALLOUT_UNLOCK(txq) 	mtx_unlock(&txq->ift_mtx)
 
 
 /* Our boot-time initialization hook */
 static int	iflib_module_event_handler(module_t, int, void *);
 
 static moduledata_t iflib_moduledata = {
 	"iflib",
 	iflib_module_event_handler,
 	NULL
 };
 
 DECLARE_MODULE(iflib, iflib_moduledata, SI_SUB_INIT_IF, SI_ORDER_ANY);
 MODULE_VERSION(iflib, 1);
 
 MODULE_DEPEND(iflib, pci, 1, 1, 1);
 MODULE_DEPEND(iflib, ether, 1, 1, 1);
 
-TASKQGROUP_DEFINE(if_io_tqg, mp_ncpus, 1);
 TASKQGROUP_DEFINE(if_config_tqg, 1, 1);
 
 #ifndef IFLIB_DEBUG_COUNTERS
 #ifdef INVARIANTS
 #define IFLIB_DEBUG_COUNTERS 1
 #else
 #define IFLIB_DEBUG_COUNTERS 0
 #endif /* !INVARIANTS */
 #endif
 
 static SYSCTL_NODE(_net, OID_AUTO, iflib, CTLFLAG_RD, 0,
                    "iflib driver parameters");
 
 /*
  * XXX need to ensure that this can't accidentally cause the head to be moved backwards 
  */
 static int iflib_min_tx_latency = 0;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, min_tx_latency, CTLFLAG_RW,
 		   &iflib_min_tx_latency, 0, "minimize transmit latency at the possible expense of throughput");
 
 
 #if IFLIB_DEBUG_COUNTERS
 
 static int iflib_tx_seen;
 static int iflib_tx_sent;
 static int iflib_tx_encap;
 static int iflib_rx_allocs;
 static int iflib_fl_refills;
 static int iflib_fl_refills_large;
 static int iflib_tx_frees;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_seen, CTLFLAG_RD,
 		   &iflib_tx_seen, 0, "# tx mbufs seen");
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_sent, CTLFLAG_RD,
 		   &iflib_tx_sent, 0, "# tx mbufs sent");
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_encap, CTLFLAG_RD,
 		   &iflib_tx_encap, 0, "# tx mbufs encapped");
 SYSCTL_INT(_net_iflib, OID_AUTO, tx_frees, CTLFLAG_RD,
 		   &iflib_tx_frees, 0, "# tx frees");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_allocs, CTLFLAG_RD,
 		   &iflib_rx_allocs, 0, "# rx allocations");
 SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills, CTLFLAG_RD,
 		   &iflib_fl_refills, 0, "# refills");
 SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills_large, CTLFLAG_RD,
 		   &iflib_fl_refills_large, 0, "# large refills");
 
 
 static int iflib_txq_drain_flushing;
 static int iflib_txq_drain_oactive;
 static int iflib_txq_drain_notready;
 static int iflib_txq_drain_encapfail;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_flushing, CTLFLAG_RD,
 		   &iflib_txq_drain_flushing, 0, "# drain flushes");
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_oactive, CTLFLAG_RD,
 		   &iflib_txq_drain_oactive, 0, "# drain oactives");
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_notready, CTLFLAG_RD,
 		   &iflib_txq_drain_notready, 0, "# drain notready");
 SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_encapfail, CTLFLAG_RD,
 		   &iflib_txq_drain_encapfail, 0, "# drain encap fails");
 
 
 static int iflib_encap_load_mbuf_fail;
 static int iflib_encap_txq_avail_fail;
 static int iflib_encap_txd_encap_fail;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_load_mbuf_fail, CTLFLAG_RD,
 		   &iflib_encap_load_mbuf_fail, 0, "# busdma load failures");
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_txq_avail_fail, CTLFLAG_RD,
 		   &iflib_encap_txq_avail_fail, 0, "# txq avail failures");
 SYSCTL_INT(_net_iflib, OID_AUTO, encap_txd_encap_fail, CTLFLAG_RD,
 		   &iflib_encap_txd_encap_fail, 0, "# driver encap failures");
 
 static int iflib_task_fn_rxs;
 static int iflib_rx_intr_enables;
 static int iflib_fast_intrs;
 static int iflib_intr_link;
 static int iflib_intr_msix; 
 static int iflib_rx_unavail;
 static int iflib_rx_ctx_inactive;
 static int iflib_rx_zero_len;
 static int iflib_rx_if_input;
 static int iflib_rx_mbuf_null;
 static int iflib_rxd_flush;
 
 static int iflib_verbose_debug;
 
 SYSCTL_INT(_net_iflib, OID_AUTO, intr_link, CTLFLAG_RD,
 		   &iflib_intr_link, 0, "# intr link calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, intr_msix, CTLFLAG_RD,
 		   &iflib_intr_msix, 0, "# intr msix calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, task_fn_rx, CTLFLAG_RD,
 		   &iflib_task_fn_rxs, 0, "# task_fn_rx calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_intr_enables, CTLFLAG_RD,
 		   &iflib_rx_intr_enables, 0, "# rx intr enables");
 SYSCTL_INT(_net_iflib, OID_AUTO, fast_intrs, CTLFLAG_RD,
 		   &iflib_fast_intrs, 0, "# fast_intr calls");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_unavail, CTLFLAG_RD,
 		   &iflib_rx_unavail, 0, "# times rxeof called with no available data");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_ctx_inactive, CTLFLAG_RD,
 		   &iflib_rx_ctx_inactive, 0, "# times rxeof called with inactive context");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_zero_len, CTLFLAG_RD,
 		   &iflib_rx_zero_len, 0, "# times rxeof saw zero len mbuf");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_if_input, CTLFLAG_RD,
 		   &iflib_rx_if_input, 0, "# times rxeof called if_input");
 SYSCTL_INT(_net_iflib, OID_AUTO, rx_mbuf_null, CTLFLAG_RD,
 		   &iflib_rx_mbuf_null, 0, "# times rxeof got null mbuf");
 SYSCTL_INT(_net_iflib, OID_AUTO, rxd_flush, CTLFLAG_RD,
 	         &iflib_rxd_flush, 0, "# times rxd_flush called");
 SYSCTL_INT(_net_iflib, OID_AUTO, verbose_debug, CTLFLAG_RW,
 		   &iflib_verbose_debug, 0, "enable verbose debugging");
 
 #define DBG_COUNTER_INC(name) atomic_add_int(&(iflib_ ## name), 1)
 static void
 iflib_debug_reset(void)
 {
 	iflib_tx_seen = iflib_tx_sent = iflib_tx_encap = iflib_rx_allocs =
 		iflib_fl_refills = iflib_fl_refills_large = iflib_tx_frees =
 		iflib_txq_drain_flushing = iflib_txq_drain_oactive =
 		iflib_txq_drain_notready = iflib_txq_drain_encapfail =
 		iflib_encap_load_mbuf_fail = iflib_encap_txq_avail_fail =
 		iflib_encap_txd_encap_fail = iflib_task_fn_rxs = iflib_rx_intr_enables =
 		iflib_fast_intrs = iflib_intr_link = iflib_intr_msix = iflib_rx_unavail =
 		iflib_rx_ctx_inactive = iflib_rx_zero_len = iflib_rx_if_input =
 		iflib_rx_mbuf_null = iflib_rxd_flush = 0;
 }
 
 #else
 #define DBG_COUNTER_INC(name)
 static void iflib_debug_reset(void) {}
 #endif
 
 
 
 #define IFLIB_DEBUG 0
 
 static void iflib_tx_structures_free(if_ctx_t ctx);
 static void iflib_rx_structures_free(if_ctx_t ctx);
 static int iflib_queues_alloc(if_ctx_t ctx);
 static int iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq);
 static int iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, int cidx, int budget);
 static int iflib_qset_structures_setup(if_ctx_t ctx);
 static int iflib_msix_init(if_ctx_t ctx);
 static int iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filterarg, int *rid, char *str);
 static void iflib_txq_check_drain(iflib_txq_t txq, int budget);
 static uint32_t iflib_txq_can_drain(struct ifmp_ring *);
 static int iflib_register(if_ctx_t);
 static void iflib_init_locked(if_ctx_t ctx);
 static void iflib_add_device_sysctl_pre(if_ctx_t ctx);
 static void iflib_add_device_sysctl_post(if_ctx_t ctx);
 static void iflib_ifmp_purge(iflib_txq_t txq);
 static void _iflib_pre_assert(if_softc_ctx_t scctx);
 
 #ifdef DEV_NETMAP
 #include <sys/selinfo.h>
 #include <net/netmap.h>
 #include <dev/netmap/netmap_kern.h>
 
 MODULE_DEPEND(iflib, netmap, 1, 1, 1);
 
 /*
  * device-specific sysctl variables:
  *
  * iflib_crcstrip: 0: keep CRC in rx frames (default), 1: strip it.
  *	During regular operations the CRC is stripped, but on some
  *	hardware reception of frames not multiple of 64 is slower,
  *	so using crcstrip=0 helps in benchmarks.
  *
  * iflib_rx_miss, iflib_rx_miss_bufs:
  *	count packets that might be missed due to lost interrupts.
  */
 SYSCTL_DECL(_dev_netmap);
 /*
  * The xl driver by default strips CRCs and we do not override it.
  */
 
 int iflib_crcstrip = 1;
 SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_crcstrip,
     CTLFLAG_RW, &iflib_crcstrip, 1, "strip CRC on rx frames");
 
 int iflib_rx_miss, iflib_rx_miss_bufs;
 SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss,
     CTLFLAG_RW, &iflib_rx_miss, 0, "potentially missed rx intr");
 SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss_bufs,
     CTLFLAG_RW, &iflib_rx_miss_bufs, 0, "potentially missed rx intr bufs");
 
 /*
  * Register/unregister. We are already under netmap lock.
  * Only called on the first register or the last unregister.
  */
 static int
 iflib_netmap_register(struct netmap_adapter *na, int onoff)
 {
 	struct ifnet *ifp = na->ifp;
 	if_ctx_t ctx = ifp->if_softc;
 
 	CTX_LOCK(ctx);
 	IFDI_INTR_DISABLE(ctx);
 
 	/* Tell the stack that the interface is no longer active */
 	ifp->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE);
 
 	if (!CTX_IS_VF(ctx))
 		IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip);
 
 	/* enable or disable flags and callbacks in na and ifp */
 	if (onoff) {
 		nm_set_native_flags(na);
 	} else {
 		nm_clear_native_flags(na);
 	}
 	IFDI_INIT(ctx);
 	IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip); // XXX why twice ?
 	CTX_UNLOCK(ctx);
 	return (ifp->if_drv_flags & IFF_DRV_RUNNING ? 0 : 1);
 }
 
 /*
  * Reconcile kernel and user view of the transmit ring.
  *
  * All information is in the kring.
  * Userspace wants to send packets up to the one before kring->rhead,
  * kernel knows kring->nr_hwcur is the first unsent packet.
  *
  * Here we push packets out (as many as possible), and possibly
  * reclaim buffers from previously completed transmission.
  *
  * The caller (netmap) guarantees that there is only one instance
  * running at any time. Any interference with other driver
  * methods should be handled by the individual drivers.
  */
 static int
 iflib_netmap_txsync(struct netmap_kring *kring, int flags)
 {
 	struct netmap_adapter *na = kring->na;
 	struct ifnet *ifp = na->ifp;
 	struct netmap_ring *ring = kring->ring;
 	u_int nm_i;	/* index into the netmap ring */
 	u_int nic_i;	/* index into the NIC ring */
 	u_int n;
 	u_int const lim = kring->nkr_num_slots - 1;
 	u_int const head = kring->rhead;
 	struct if_pkt_info pi;
 
 	/*
 	 * interrupts on every tx packet are expensive so request
 	 * them every half ring, or where NS_REPORT is set
 	 */
 	u_int report_frequency = kring->nkr_num_slots >> 1;
 	/* device-specific */
 	if_ctx_t ctx = ifp->if_softc;
 	iflib_txq_t txq = &ctx->ifc_txqs[kring->ring_id];
 
 	pi.ipi_segs = txq->ift_segs;
 	pi.ipi_qsidx = kring->ring_id;
 	pi.ipi_ndescs = 0;
 
 	bus_dmamap_sync(txq->ift_desc_tag, txq->ift_ifdi->idi_map,
 					BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 
 	/*
 	 * First part: process new packets to send.
 	 * nm_i is the current index in the netmap ring,
 	 * nic_i is the corresponding index in the NIC ring.
 	 *
 	 * If we have packets to send (nm_i != head)
 	 * iterate over the netmap ring, fetch length and update
 	 * the corresponding slot in the NIC ring. Some drivers also
 	 * need to update the buffer's physical address in the NIC slot
 	 * even NS_BUF_CHANGED is not set (PNMB computes the addresses).
 	 *
 	 * The netmap_reload_map() calls is especially expensive,
 	 * even when (as in this case) the tag is 0, so do only
 	 * when the buffer has actually changed.
 	 *
 	 * If possible do not set the report/intr bit on all slots,
 	 * but only a few times per ring or when NS_REPORT is set.
 	 *
 	 * Finally, on 10G and faster drivers, it might be useful
 	 * to prefetch the next slot and txr entry.
 	 */
 
 	nm_i = kring->nr_hwcur;
 	if (nm_i != head) {	/* we have new packets to send */
 		nic_i = netmap_idx_k2n(kring, nm_i);
 
 		__builtin_prefetch(&ring->slot[nm_i]);
 		__builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i]);
 		__builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i]);
 
 		for (n = 0; nm_i != head; n++) {
 			struct netmap_slot *slot = &ring->slot[nm_i];
 			u_int len = slot->len;
 			uint64_t paddr;
 			void *addr = PNMB(na, slot, &paddr);
 			int flags = (slot->flags & NS_REPORT ||
 				nic_i == 0 || nic_i == report_frequency) ?
 				IPI_TX_INTR : 0;
 
 			/* device-specific */
 			pi.ipi_pidx = nic_i;
 			pi.ipi_flags = flags;
 
 			/* Fill the slot in the NIC ring. */
 			ctx->isc_txd_encap(ctx->ifc_softc, &pi);
 
 			/* prefetch for next round */
 			__builtin_prefetch(&ring->slot[nm_i + 1]);
 			__builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i + 1]);
 			__builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i + 1]);
 
 			NM_CHECK_ADDR_LEN(na, addr, len);
 
 			if (slot->flags & NS_BUF_CHANGED) {
 				/* buffer has changed, reload map */
 				netmap_reload_map(na, txq->ift_desc_tag, txq->ift_sds.ifsd_map[nic_i], addr);
 			}
 			slot->flags &= ~(NS_REPORT | NS_BUF_CHANGED);
 
 			/* make sure changes to the buffer are synced */
 			bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_sds.ifsd_map[nic_i],
 							BUS_DMASYNC_PREWRITE);
 
 			nm_i = nm_next(nm_i, lim);
 			nic_i = nm_next(nic_i, lim);
 		}
 		kring->nr_hwcur = head;
 
 		/* synchronize the NIC ring */
 		bus_dmamap_sync(txq->ift_desc_tag, txq->ift_ifdi->idi_map,
 						BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 
 		/* (re)start the tx unit up to slot nic_i (excluded) */
 		ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, nic_i);
 	}
 
 	/*
 	 * Second part: reclaim buffers for completed transmissions.
 	 */
 	if (iflib_tx_credits_update(ctx, txq)) {
 		/* some tx completed, increment avail */
 		nic_i = txq->ift_cidx_processed;
 		kring->nr_hwtail = nm_prev(netmap_idx_n2k(kring, nic_i), lim);
 	}
 	return (0);
 }
 
 /*
  * Reconcile kernel and user view of the receive ring.
  * Same as for the txsync, this routine must be efficient.
  * The caller guarantees a single invocations, but races against
  * the rest of the driver should be handled here.
  *
  * On call, kring->rhead is the first packet that userspace wants
  * to keep, and kring->rcur is the wakeup point.
  * The kernel has previously reported packets up to kring->rtail.
  *
  * If (flags & NAF_FORCE_READ) also check for incoming packets irrespective
  * of whether or not we received an interrupt.
  */
 static int
 iflib_netmap_rxsync(struct netmap_kring *kring, int flags)
 {
 	struct netmap_adapter *na = kring->na;
 	struct ifnet *ifp = na->ifp;
 	struct netmap_ring *ring = kring->ring;
 	u_int nm_i;	/* index into the netmap ring */
 	u_int nic_i;	/* index into the NIC ring */
 	u_int i, n;
 	u_int const lim = kring->nkr_num_slots - 1;
 	u_int const head = kring->rhead;
 	int force_update = (flags & NAF_FORCE_READ) || kring->nr_kflags & NKR_PENDINTR;
 	struct if_rxd_info ri;
 	/* device-specific */
 	if_ctx_t ctx = ifp->if_softc;
 	iflib_rxq_t rxq = &ctx->ifc_rxqs[kring->ring_id];
 	iflib_fl_t fl = rxq->ifr_fl;
 	if (head > lim)
 		return netmap_ring_reinit(kring);
 
 	bzero(&ri, sizeof(ri));
 	ri.iri_qsidx = kring->ring_id;
 	ri.iri_ifp = ctx->ifc_ifp;
 	/* XXX check sync modes */
 	for (i = 0, fl = rxq->ifr_fl; i < rxq->ifr_nfl; i++, fl++)
 		bus_dmamap_sync(rxq->ifr_fl[i].ifl_desc_tag, fl->ifl_ifdi->idi_map,
 				BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 	/*
 	 * First part: import newly received packets.
 	 *
 	 * nm_i is the index of the next free slot in the netmap ring,
 	 * nic_i is the index of the next received packet in the NIC ring,
 	 * and they may differ in case if_init() has been called while
 	 * in netmap mode. For the receive ring we have
 	 *
 	 *	nic_i = rxr->next_check;
 	 *	nm_i = kring->nr_hwtail (previous)
 	 * and
 	 *	nm_i == (nic_i + kring->nkr_hwofs) % ring_size
 	 *
 	 * rxr->next_check is set to 0 on a ring reinit
 	 */
 	if (netmap_no_pendintr || force_update) {
 		int crclen = iflib_crcstrip ? 0 : 4;
 		int error, avail;
 		uint16_t slot_flags = kring->nkr_slot_flags;
 
 		for (fl = rxq->ifr_fl, i = 0; i < rxq->ifr_nfl; i++, fl++) {
 			nic_i = fl->ifl_cidx;
 			nm_i = netmap_idx_n2k(kring, nic_i);
 			avail = ctx->isc_rxd_available(ctx->ifc_softc, kring->ring_id, nic_i, INT_MAX);
 			for (n = 0; avail > 0; n++, avail--) {
 				error = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri);
 				if (error)
 					ring->slot[nm_i].len = 0;
 				else
 					ring->slot[nm_i].len = ri.iri_len - crclen;
 				ring->slot[nm_i].flags = slot_flags;
 				bus_dmamap_sync(fl->ifl_ifdi->idi_tag,
 								fl->ifl_sds.ifsd_map[nic_i], BUS_DMASYNC_POSTREAD);
 				nm_i = nm_next(nm_i, lim);
 				nic_i = nm_next(nic_i, lim);
 			}
 			if (n) { /* update the state variables */
 				if (netmap_no_pendintr && !force_update) {
 					/* diagnostics */
 					iflib_rx_miss ++;
 					iflib_rx_miss_bufs += n;
 				}
 				fl->ifl_cidx = nic_i;
 				kring->nr_hwtail = nm_i;
 			}
 			kring->nr_kflags &= ~NKR_PENDINTR;
 		}
 	}
 	/*
 	 * Second part: skip past packets that userspace has released.
 	 * (kring->nr_hwcur to head excluded),
 	 * and make the buffers available for reception.
 	 * As usual nm_i is the index in the netmap ring,
 	 * nic_i is the index in the NIC ring, and
 	 * nm_i == (nic_i + kring->nkr_hwofs) % ring_size
 	 */
 	/* XXX not sure how this will work with multiple free lists */
 	nm_i = kring->nr_hwcur;
 	if (nm_i != head) {
 		nic_i = netmap_idx_k2n(kring, nm_i);
 		for (n = 0; nm_i != head; n++) {
 			struct netmap_slot *slot = &ring->slot[nm_i];
 			uint64_t paddr;
 			caddr_t vaddr;
 			void *addr = PNMB(na, slot, &paddr);
 
 			if (addr == NETMAP_BUF_BASE(na)) /* bad buf */
 				goto ring_reset;
 
 			vaddr = addr;
 			if (slot->flags & NS_BUF_CHANGED) {
 				/* buffer has changed, reload map */
 				netmap_reload_map(na, fl->ifl_ifdi->idi_tag, fl->ifl_sds.ifsd_map[nic_i], addr);
 				slot->flags &= ~NS_BUF_CHANGED;
 			}
 			/*
 			 * XXX we should be batching this operation - TODO
 			 */
 			ctx->isc_rxd_refill(ctx->ifc_softc, rxq->ifr_id, fl->ifl_id, nic_i, &paddr, &vaddr, 1, fl->ifl_buf_size);
 			bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_sds.ifsd_map[nic_i],
 			    BUS_DMASYNC_PREREAD);
 			nm_i = nm_next(nm_i, lim);
 			nic_i = nm_next(nic_i, lim);
 		}
 		kring->nr_hwcur = head;
 
 		bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map,
 		    BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 		/*
 		 * IMPORTANT: we must leave one free slot in the ring,
 		 * so move nic_i back by one unit
 		 */
 		nic_i = nm_prev(nic_i, lim);
 		ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, fl->ifl_id, nic_i);
 	}
 
 	return 0;
 
 ring_reset:
 	return netmap_ring_reinit(kring);
 }
 
 static int
 iflib_netmap_attach(if_ctx_t ctx)
 {
 	struct netmap_adapter na;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 
 	bzero(&na, sizeof(na));
 
 	na.ifp = ctx->ifc_ifp;
 	na.na_flags = NAF_BDG_MAYSLEEP;
 	MPASS(ctx->ifc_softc_ctx.isc_ntxqsets);
 	MPASS(ctx->ifc_softc_ctx.isc_nrxqsets);
 
 	na.num_tx_desc = scctx->isc_ntxd[0];
 	na.num_rx_desc = scctx->isc_nrxd[0];
 	na.nm_txsync = iflib_netmap_txsync;
 	na.nm_rxsync = iflib_netmap_rxsync;
 	na.nm_register = iflib_netmap_register;
 	na.num_tx_rings = ctx->ifc_softc_ctx.isc_ntxqsets;
 	na.num_rx_rings = ctx->ifc_softc_ctx.isc_nrxqsets;
 	return (netmap_attach(&na));
 }
 
 static void
 iflib_netmap_txq_init(if_ctx_t ctx, iflib_txq_t txq)
 {
 	struct netmap_adapter *na = NA(ctx->ifc_ifp);
 	struct netmap_slot *slot;
 
 	slot = netmap_reset(na, NR_TX, txq->ift_id, 0);
 	if (slot == NULL)
 		return;
 
 	for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxd[0]; i++) {
 
 		/*
 		 * In netmap mode, set the map for the packet buffer.
 		 * NOTE: Some drivers (not this one) also need to set
 		 * the physical buffer address in the NIC ring.
 		 * netmap_idx_n2k() maps a nic index, i, into the corresponding
 		 * netmap slot index, si
 		 */
 		int si = netmap_idx_n2k(&na->tx_rings[txq->ift_id], i);
 		netmap_load_map(na, txq->ift_desc_tag, txq->ift_sds.ifsd_map[i], NMB(na, slot + si));
 	}
 }
 static void
 iflib_netmap_rxq_init(if_ctx_t ctx, iflib_rxq_t rxq)
 {
 	struct netmap_adapter *na = NA(ctx->ifc_ifp);
 	struct netmap_slot *slot;
 	bus_dmamap_t *map;
 	int nrxd;
 
 	slot = netmap_reset(na, NR_RX, rxq->ifr_id, 0);
 	if (slot == NULL)
 		return;
 	map = rxq->ifr_fl[0].ifl_sds.ifsd_map;
 	nrxd = ctx->ifc_softc_ctx.isc_nrxd[0];
 	for (int i = 0; i < nrxd; i++, map++) {
 			int sj = netmap_idx_n2k(&na->rx_rings[rxq->ifr_id], i);
 			uint64_t paddr;
 			void *addr;
 			caddr_t vaddr;
 
 			vaddr = addr = PNMB(na, slot + sj, &paddr);
 			netmap_load_map(na, rxq->ifr_fl[0].ifl_ifdi->idi_tag, *map, addr);
 			/* Update descriptor and the cached value */
 			ctx->isc_rxd_refill(ctx->ifc_softc, rxq->ifr_id, 0 /* fl_id */, i, &paddr, &vaddr, 1, rxq->ifr_fl[0].ifl_buf_size);
 	}
 	/* preserve queue */
 	if (ctx->ifc_ifp->if_capenable & IFCAP_NETMAP) {
 		struct netmap_kring *kring = &na->rx_rings[rxq->ifr_id];
 		int t = na->num_rx_desc - 1 - nm_kr_rxspace(kring);
 		ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, 0 /* fl_id */, t);
 	} else
 		ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, 0 /* fl_id */, nrxd-1);
 }
 
 #define iflib_netmap_detach(ifp) netmap_detach(ifp)
 
 #else
 #define iflib_netmap_txq_init(ctx, txq)
 #define iflib_netmap_rxq_init(ctx, rxq)
 #define iflib_netmap_detach(ifp)
 
 #define iflib_netmap_attach(ctx) (0)
 #define netmap_rx_irq(ifp, qid, budget) (0)
 
 #endif
 
 #if defined(__i386__) || defined(__amd64__)
 static __inline void
 prefetch(void *x)
 {
 	__asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x));
 }
 #else
 #define prefetch(x)
 #endif
 
 static void
 _iflib_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nseg, int err)
 {
 	if (err)
 		return;
 	*(bus_addr_t *) arg = segs[0].ds_addr;
 }
 
 int
 iflib_dma_alloc(if_ctx_t ctx, int size, iflib_dma_info_t dma, int mapflags)
 {
 	int err;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	device_t dev = ctx->ifc_dev;
 
 	KASSERT(sctx->isc_q_align != 0, ("alignment value not initialized"));
 
 	err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */
 				sctx->isc_q_align, 0,	/* alignment, bounds */
 				BUS_SPACE_MAXADDR,	/* lowaddr */
 				BUS_SPACE_MAXADDR,	/* highaddr */
 				NULL, NULL,		/* filter, filterarg */
 				size,			/* maxsize */
 				1,			/* nsegments */
 				size,			/* maxsegsize */
 				BUS_DMA_ALLOCNOW,	/* flags */
 				NULL,			/* lockfunc */
 				NULL,			/* lockarg */
 				&dma->idi_tag);
 	if (err) {
 		device_printf(dev,
 		    "%s: bus_dma_tag_create failed: %d\n",
 		    __func__, err);
 		goto fail_0;
 	}
 
 	err = bus_dmamem_alloc(dma->idi_tag, (void**) &dma->idi_vaddr,
 	    BUS_DMA_NOWAIT | BUS_DMA_COHERENT | BUS_DMA_ZERO, &dma->idi_map);
 	if (err) {
 		device_printf(dev,
 		    "%s: bus_dmamem_alloc(%ju) failed: %d\n",
 		    __func__, (uintmax_t)size, err);
 		goto fail_1;
 	}
 
 	dma->idi_paddr = IF_BAD_DMA;
 	err = bus_dmamap_load(dma->idi_tag, dma->idi_map, dma->idi_vaddr,
 	    size, _iflib_dmamap_cb, &dma->idi_paddr, mapflags | BUS_DMA_NOWAIT);
 	if (err || dma->idi_paddr == IF_BAD_DMA) {
 		device_printf(dev,
 		    "%s: bus_dmamap_load failed: %d\n",
 		    __func__, err);
 		goto fail_2;
 	}
 
 	dma->idi_size = size;
 	return (0);
 
 fail_2:
 	bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map);
 fail_1:
 	bus_dma_tag_destroy(dma->idi_tag);
 fail_0:
 	dma->idi_tag = NULL;
 
 	return (err);
 }
 
 int
 iflib_dma_alloc_multi(if_ctx_t ctx, int *sizes, iflib_dma_info_t *dmalist, int mapflags, int count)
 {
 	int i, err;
 	iflib_dma_info_t *dmaiter;
 
 	dmaiter = dmalist;
 	for (i = 0; i < count; i++, dmaiter++) {
 		if ((err = iflib_dma_alloc(ctx, sizes[i], *dmaiter, mapflags)) != 0)
 			break;
 	}
 	if (err)
 		iflib_dma_free_multi(dmalist, i);
 	return (err);
 }
 
 void
 iflib_dma_free(iflib_dma_info_t dma)
 {
 	if (dma->idi_tag == NULL)
 		return;
 	if (dma->idi_paddr != IF_BAD_DMA) {
 		bus_dmamap_sync(dma->idi_tag, dma->idi_map,
 		    BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 		bus_dmamap_unload(dma->idi_tag, dma->idi_map);
 		dma->idi_paddr = IF_BAD_DMA;
 	}
 	if (dma->idi_vaddr != NULL) {
 		bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map);
 		dma->idi_vaddr = NULL;
 	}
 	bus_dma_tag_destroy(dma->idi_tag);
 	dma->idi_tag = NULL;
 }
 
 void
 iflib_dma_free_multi(iflib_dma_info_t *dmalist, int count)
 {
 	int i;
 	iflib_dma_info_t *dmaiter = dmalist;
 
 	for (i = 0; i < count; i++, dmaiter++)
 		iflib_dma_free(*dmaiter);
 }
 
 #ifdef EARLY_AP_STARTUP
 static const int iflib_started = 1;
 #else
 /*
  * We used to abuse the smp_started flag to decide if the queues have been
  * fully initialized (by late taskqgroup_adjust() calls in a SYSINIT()).
  * That gave bad races, since the SYSINIT() runs strictly after smp_started
  * is set.  Run a SYSINIT() strictly after that to just set a usable
  * completion flag.
  */
 
 static int iflib_started;
 
 static void
 iflib_record_started(void *arg)
 {
 	iflib_started = 1;
 }
 
 SYSINIT(iflib_record_started, SI_SUB_SMP + 1, SI_ORDER_FIRST,
 	iflib_record_started, NULL);
 #endif
 
 static int
 iflib_fast_intr(void *arg)
 {
 	iflib_filter_info_t info = arg;
 	struct grouptask *gtask = info->ifi_task;
 
 	if (!iflib_started)
 		return (FILTER_HANDLED);
 
 	DBG_COUNTER_INC(fast_intrs);
 	if (info->ifi_filter != NULL && info->ifi_filter(info->ifi_filter_arg) == FILTER_HANDLED)
 		return (FILTER_HANDLED);
 
 	GROUPTASK_ENQUEUE(gtask);
 	return (FILTER_HANDLED);
 }
 
 static int
 _iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid,
 	driver_filter_t filter, driver_intr_t handler, void *arg,
 				 char *name)
 {
 	int rc;
 	struct resource *res;
 	void *tag;
 	device_t dev = ctx->ifc_dev;
 
 	MPASS(rid < 512);
 	irq->ii_rid = rid;
 	res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &irq->ii_rid,
 				     RF_SHAREABLE | RF_ACTIVE);
 	if (res == NULL) {
 		device_printf(dev,
 		    "failed to allocate IRQ for rid %d, name %s.\n", rid, name);
 		return (ENOMEM);
 	}
 	irq->ii_res = res;
 	KASSERT(filter == NULL || handler == NULL, ("filter and handler can't both be non-NULL"));
 	rc = bus_setup_intr(dev, res, INTR_MPSAFE | INTR_TYPE_NET,
 						filter, handler, arg, &tag);
 	if (rc != 0) {
 		device_printf(dev,
 		    "failed to setup interrupt for rid %d, name %s: %d\n",
 					  rid, name ? name : "unknown", rc);
 		return (rc);
 	} else if (name)
 		bus_describe_intr(dev, res, tag, "%s", name);
 
 	irq->ii_tag = tag;
 	return (0);
 }
 
 
 /*********************************************************************
  *
  *  Allocate memory for tx_buffer structures. The tx_buffer stores all
  *  the information needed to transmit a packet on the wire. This is
  *  called only once at attach, setup is done every reset.
  *
  **********************************************************************/
 
 static int
 iflib_txsd_alloc(iflib_txq_t txq)
 {
 	if_ctx_t ctx = txq->ift_ctx;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	device_t dev = ctx->ifc_dev;
 	int err, nsegments, ntsosegments;
 
 	nsegments = scctx->isc_tx_nsegments;
 	ntsosegments = scctx->isc_tx_tso_segments_max;
 	MPASS(scctx->isc_ntxd[0] > 0);
 	MPASS(scctx->isc_ntxd[txq->ift_br_offset] > 0);
 	MPASS(nsegments > 0);
 	MPASS(ntsosegments > 0);
 	/*
 	 * Setup DMA descriptor areas.
 	 */
 	if ((err = bus_dma_tag_create(bus_get_dma_tag(dev),
 			       1, 0,			/* alignment, bounds */
 			       BUS_SPACE_MAXADDR,	/* lowaddr */
 			       BUS_SPACE_MAXADDR,	/* highaddr */
 			       NULL, NULL,		/* filter, filterarg */
 			       sctx->isc_tx_maxsize,		/* maxsize */
 			       nsegments,	/* nsegments */
 			       sctx->isc_tx_maxsegsize,	/* maxsegsize */
 			       0,			/* flags */
 			       NULL,			/* lockfunc */
 			       NULL,			/* lockfuncarg */
 			       &txq->ift_desc_tag))) {
 		device_printf(dev,"Unable to allocate TX DMA tag: %d\n", err);
 		device_printf(dev,"maxsize: %zd nsegments: %d maxsegsize: %zd\n",
 					  sctx->isc_tx_maxsize, nsegments, sctx->isc_tx_maxsegsize);
 		goto fail;
 	}
 	if ((err = bus_dma_tag_create(bus_get_dma_tag(dev),
 			       1, 0,			/* alignment, bounds */
 			       BUS_SPACE_MAXADDR,	/* lowaddr */
 			       BUS_SPACE_MAXADDR,	/* highaddr */
 			       NULL, NULL,		/* filter, filterarg */
 			       scctx->isc_tx_tso_size_max,		/* maxsize */
 			       ntsosegments,	/* nsegments */
 			       scctx->isc_tx_tso_segsize_max,	/* maxsegsize */
 			       0,			/* flags */
 			       NULL,			/* lockfunc */
 			       NULL,			/* lockfuncarg */
 			       &txq->ift_tso_desc_tag))) {
 		device_printf(dev,"Unable to allocate TX TSO DMA tag: %d\n", err);
 
 		goto fail;
 	}
 	if (!(txq->ift_sds.ifsd_flags =
 	    (uint8_t *) malloc(sizeof(uint8_t) *
 	    scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate tx_buffer memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 	if (!(txq->ift_sds.ifsd_m =
 	    (struct mbuf **) malloc(sizeof(struct mbuf *) *
 	    scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate tx_buffer memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 
         /* Create the descriptor buffer dma maps */
 #if defined(ACPI_DMAR) || (!(defined(__i386__) && !defined(__amd64__)))
 	if ((ctx->ifc_flags & IFC_DMAR) == 0)
 		return (0);
 
 	if (!(txq->ift_sds.ifsd_map =
 	    (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate tx_buffer map memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 
 	for (int i = 0; i < scctx->isc_ntxd[txq->ift_br_offset]; i++) {
 		err = bus_dmamap_create(txq->ift_desc_tag, 0, &txq->ift_sds.ifsd_map[i]);
 		if (err != 0) {
 			device_printf(dev, "Unable to create TX DMA map\n");
 			goto fail;
 		}
 	}
 #endif
 	return (0);
 fail:
 	/* We free all, it handles case where we are in the middle */
 	iflib_tx_structures_free(ctx);
 	return (err);
 }
 
 static void
 iflib_txsd_destroy(if_ctx_t ctx, iflib_txq_t txq, int i)
 {
 	bus_dmamap_t map;
 
 	map = NULL;
 	if (txq->ift_sds.ifsd_map != NULL)
 		map = txq->ift_sds.ifsd_map[i];
 	if (map != NULL) {
 		bus_dmamap_unload(txq->ift_desc_tag, map);
 		bus_dmamap_destroy(txq->ift_desc_tag, map);
 		txq->ift_sds.ifsd_map[i] = NULL;
 	}
 }
 
 static void
 iflib_txq_destroy(iflib_txq_t txq)
 {
 	if_ctx_t ctx = txq->ift_ctx;
 
 	for (int i = 0; i < txq->ift_size; i++)
 		iflib_txsd_destroy(ctx, txq, i);
 	if (txq->ift_sds.ifsd_map != NULL) {
 		free(txq->ift_sds.ifsd_map, M_IFLIB);
 		txq->ift_sds.ifsd_map = NULL;
 	}
 	if (txq->ift_sds.ifsd_m != NULL) {
 		free(txq->ift_sds.ifsd_m, M_IFLIB);
 		txq->ift_sds.ifsd_m = NULL;
 	}
 	if (txq->ift_sds.ifsd_flags != NULL) {
 		free(txq->ift_sds.ifsd_flags, M_IFLIB);
 		txq->ift_sds.ifsd_flags = NULL;
 	}
 	if (txq->ift_desc_tag != NULL) {
 		bus_dma_tag_destroy(txq->ift_desc_tag);
 		txq->ift_desc_tag = NULL;
 	}
 	if (txq->ift_tso_desc_tag != NULL) {
 		bus_dma_tag_destroy(txq->ift_tso_desc_tag);
 		txq->ift_tso_desc_tag = NULL;
 	}
 }
 
 static void
 iflib_txsd_free(if_ctx_t ctx, iflib_txq_t txq, int i)
 {
 	struct mbuf **mp;
 
 	mp = &txq->ift_sds.ifsd_m[i];
 	if (*mp == NULL)
 		return;
 
 	if (txq->ift_sds.ifsd_map != NULL) {
 		bus_dmamap_sync(txq->ift_desc_tag,
 				txq->ift_sds.ifsd_map[i],
 				BUS_DMASYNC_POSTWRITE);
 		bus_dmamap_unload(txq->ift_desc_tag,
 				  txq->ift_sds.ifsd_map[i]);
 	}
 	m_free(*mp);
 	DBG_COUNTER_INC(tx_frees);
 	*mp = NULL;
 }
 
 static int
 iflib_txq_setup(iflib_txq_t txq)
 {
 	if_ctx_t ctx = txq->ift_ctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	iflib_dma_info_t di;
 	int i;
 
 	/* Set number of descriptors available */
 	txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 
 	/* Reset indices */
 	txq->ift_cidx_processed = txq->ift_pidx = txq->ift_cidx = txq->ift_npending = 0;
 	txq->ift_size = scctx->isc_ntxd[txq->ift_br_offset];
 
 	for (i = 0, di = txq->ift_ifdi; i < ctx->ifc_nhwtxqs; i++, di++)
 		bzero((void *)di->idi_vaddr, di->idi_size);
 
 	IFDI_TXQ_SETUP(ctx, txq->ift_id);
 	for (i = 0, di = txq->ift_ifdi; i < ctx->ifc_nhwtxqs; i++, di++)
 		bus_dmamap_sync(di->idi_tag, di->idi_map,
 						BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	return (0);
 }
 
 /*********************************************************************
  *
  *  Allocate memory for rx_buffer structures. Since we use one
  *  rx_buffer per received packet, the maximum number of rx_buffer's
  *  that we'll need is equal to the number of receive descriptors
  *  that we've allocated.
  *
  **********************************************************************/
 static int
 iflib_rxsd_alloc(iflib_rxq_t rxq)
 {
 	if_ctx_t ctx = rxq->ifr_ctx;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	device_t dev = ctx->ifc_dev;
 	iflib_fl_t fl;
 	int			err;
 
 	MPASS(scctx->isc_nrxd[0] > 0);
 	MPASS(scctx->isc_nrxd[rxq->ifr_fl_offset] > 0);
 
 	fl = rxq->ifr_fl;
 	for (int i = 0; i <  rxq->ifr_nfl; i++, fl++) {
 		fl->ifl_size = scctx->isc_nrxd[rxq->ifr_fl_offset]; /* this isn't necessarily the same */
 		err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */
 					 1, 0,			/* alignment, bounds */
 					 BUS_SPACE_MAXADDR,	/* lowaddr */
 					 BUS_SPACE_MAXADDR,	/* highaddr */
 					 NULL, NULL,		/* filter, filterarg */
 					 sctx->isc_rx_maxsize,	/* maxsize */
 					 sctx->isc_rx_nsegments,	/* nsegments */
 					 sctx->isc_rx_maxsegsize,	/* maxsegsize */
 					 0,			/* flags */
 					 NULL,			/* lockfunc */
 					 NULL,			/* lockarg */
 					 &fl->ifl_desc_tag);
 		if (err) {
 			device_printf(dev, "%s: bus_dma_tag_create failed %d\n",
 				__func__, err);
 			goto fail;
 		}
 		if (!(fl->ifl_sds.ifsd_flags =
 		      (uint8_t *) malloc(sizeof(uint8_t) *
 					 scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 		if (!(fl->ifl_sds.ifsd_m =
 		      (struct mbuf **) malloc(sizeof(struct mbuf *) *
 					      scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 		if (!(fl->ifl_sds.ifsd_cl =
 		      (caddr_t *) malloc(sizeof(caddr_t) *
 					      scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 
 		/* Create the descriptor buffer dma maps */
 #if defined(ACPI_DMAR) || (!(defined(__i386__) && !defined(__amd64__)))
 		if ((ctx->ifc_flags & IFC_DMAR) == 0)
 			continue;
 
 		if (!(fl->ifl_sds.ifsd_map =
 		      (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate tx_buffer map memory\n");
 			err = ENOMEM;
 			goto fail;
 		}
 
 		for (int i = 0; i < scctx->isc_nrxd[rxq->ifr_fl_offset]; i++) {
 			err = bus_dmamap_create(fl->ifl_desc_tag, 0, &fl->ifl_sds.ifsd_map[i]);
 			if (err != 0) {
 				device_printf(dev, "Unable to create TX DMA map\n");
 				goto fail;
 			}
 		}
 #endif
 	}
 	return (0);
 
 fail:
 	iflib_rx_structures_free(ctx);
 	return (err);
 }
 
 
 /*
  * Internal service routines
  */
 
 struct rxq_refill_cb_arg {
 	int               error;
 	bus_dma_segment_t seg;
 	int               nseg;
 };
 
 static void
 _rxq_refill_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error)
 {
 	struct rxq_refill_cb_arg *cb_arg = arg;
 
 	cb_arg->error = error;
 	cb_arg->seg = segs[0];
 	cb_arg->nseg = nseg;
 }
 
 
 #ifdef ACPI_DMAR
 #define IS_DMAR(ctx) (ctx->ifc_flags & IFC_DMAR)
 #else
 #define IS_DMAR(ctx) (0)
 #endif
 
 /**
  *	rxq_refill - refill an rxq  free-buffer list
  *	@ctx: the iflib context
  *	@rxq: the free-list to refill
  *	@n: the number of new buffers to allocate
  *
  *	(Re)populate an rxq free-buffer list with up to @n new packet buffers.
  *	The caller must assure that @n does not exceed the queue's capacity.
  */
 static void
 _iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count)
 {
 	struct mbuf *m;
 	int idx, pidx = fl->ifl_pidx;
 	caddr_t cl, *sd_cl;
 	struct mbuf **sd_m;
 	uint8_t *sd_flags;
 	bus_dmamap_t *sd_map;
 	int n, i = 0;
 	uint64_t bus_addr;
 	int err;
 
 	sd_m = fl->ifl_sds.ifsd_m;
 	sd_map = fl->ifl_sds.ifsd_map;
 	sd_cl = fl->ifl_sds.ifsd_cl;
 	sd_flags = fl->ifl_sds.ifsd_flags;
 	idx = pidx;
 
 	n  = count;
 	MPASS(n > 0);
 	MPASS(fl->ifl_credits + n <= fl->ifl_size);
 
 	if (pidx < fl->ifl_cidx)
 		MPASS(pidx + n <= fl->ifl_cidx);
 	if (pidx == fl->ifl_cidx && (fl->ifl_credits < fl->ifl_size))
 		MPASS(fl->ifl_gen == 0);
 	if (pidx > fl->ifl_cidx)
 		MPASS(n <= fl->ifl_size - pidx + fl->ifl_cidx);
 
 	DBG_COUNTER_INC(fl_refills);
 	if (n > 8)
 		DBG_COUNTER_INC(fl_refills_large);
 
 	while (n--) {
 		/*
 		 * We allocate an uninitialized mbuf + cluster, mbuf is
 		 * initialized after rx.
 		 *
 		 * If the cluster is still set then we know a minimum sized packet was received
 		 */
 		if ((cl = sd_cl[idx]) == NULL) {
 			if ((cl = sd_cl[idx] = m_cljget(NULL, M_NOWAIT, fl->ifl_buf_size)) == NULL)
 				break;
 #if MEMORY_LOGGING
 			fl->ifl_cl_enqueued++;
 #endif
 		}
 		if ((m = m_gethdr(M_NOWAIT, MT_NOINIT)) == NULL) {
 			break;
 		}
 #if MEMORY_LOGGING
 		fl->ifl_m_enqueued++;
 #endif
 
 		DBG_COUNTER_INC(rx_allocs);
 #ifdef notyet
 		if ((sd_flags[pidx] & RX_SW_DESC_MAP_CREATED) == 0) {
 			int err;
 
 			if ((err = bus_dmamap_create(fl->ifl_ifdi->idi_tag, 0, &sd_map[idx]))) {
 				log(LOG_WARNING, "bus_dmamap_create failed %d\n", err);
 				uma_zfree(fl->ifl_zone, cl);
 				n = 0;
 				goto done;
 			}
 			sd_flags[idx] |= RX_SW_DESC_MAP_CREATED;
 		}
 #endif
 #if defined(__i386__) || defined(__amd64__)
 		if (!IS_DMAR(ctx)) {
 			bus_addr = pmap_kextract((vm_offset_t)cl);
 		} else
 #endif
 		{
 			struct rxq_refill_cb_arg cb_arg;
 			iflib_rxq_t q;
 
 			cb_arg.error = 0;
 			q = fl->ifl_rxq;
 			err = bus_dmamap_load(fl->ifl_desc_tag, sd_map[idx],
 		         cl, fl->ifl_buf_size, _rxq_refill_cb, &cb_arg, 0);
 
 			if (err != 0 || cb_arg.error) {
 				/*
 				 * !zone_pack ?
 				 */
 				if (fl->ifl_zone == zone_pack)
 					uma_zfree(fl->ifl_zone, cl);
 				m_free(m);
 				n = 0;
 				goto done;
 			}
 			bus_addr = cb_arg.seg.ds_addr;
 		}
 		sd_flags[idx] |= RX_SW_DESC_INUSE;
 
 		MPASS(sd_m[idx] == NULL);
 		sd_cl[idx] = cl;
 		sd_m[idx] = m;
 		fl->ifl_bus_addrs[i] = bus_addr;
 		fl->ifl_vm_addrs[i] = cl;
 		fl->ifl_credits++;
 		i++;
 		MPASS(fl->ifl_credits <= fl->ifl_size);
 		if (++idx == fl->ifl_size) {
 			fl->ifl_gen = 1;
 			idx = 0;
 		}
 		if (n == 0 || i == IFLIB_MAX_RX_REFRESH) {
 			ctx->isc_rxd_refill(ctx->ifc_softc, fl->ifl_rxq->ifr_id, fl->ifl_id, pidx,
 								 fl->ifl_bus_addrs, fl->ifl_vm_addrs, i, fl->ifl_buf_size);
 			i = 0;
 			pidx = idx;
 		}
 		fl->ifl_pidx = idx;
 
 	}
 done:
 	DBG_COUNTER_INC(rxd_flush);
 	if (fl->ifl_pidx == 0)
 		pidx = fl->ifl_size - 1;
 	else
 		pidx = fl->ifl_pidx - 1;
 	ctx->isc_rxd_flush(ctx->ifc_softc, fl->ifl_rxq->ifr_id, fl->ifl_id, pidx);
 }
 
 static __inline void
 __iflib_fl_refill_lt(if_ctx_t ctx, iflib_fl_t fl, int max)
 {
 	/* we avoid allowing pidx to catch up with cidx as it confuses ixl */
 	int32_t reclaimable = fl->ifl_size - fl->ifl_credits - 1;
 #ifdef INVARIANTS
 	int32_t delta = fl->ifl_size - get_inuse(fl->ifl_size, fl->ifl_cidx, fl->ifl_pidx, fl->ifl_gen) - 1;
 #endif
 
 	MPASS(fl->ifl_credits <= fl->ifl_size);
 	MPASS(reclaimable == delta);
 
 	if (reclaimable > 0)
 		_iflib_fl_refill(ctx, fl, min(max, reclaimable));
 }
 
 static void
 iflib_fl_bufs_free(iflib_fl_t fl)
 {
 	iflib_dma_info_t idi = fl->ifl_ifdi;
 	uint32_t i;
 
 	for (i = 0; i < fl->ifl_size; i++) {
 		struct mbuf **sd_m = &fl->ifl_sds.ifsd_m[i];
 		uint8_t *sd_flags = &fl->ifl_sds.ifsd_flags[i];
 		caddr_t *sd_cl = &fl->ifl_sds.ifsd_cl[i];
 
 		if (*sd_flags & RX_SW_DESC_INUSE) {
 			if (fl->ifl_sds.ifsd_map != NULL) {
 				bus_dmamap_t sd_map = fl->ifl_sds.ifsd_map[i];
 				bus_dmamap_unload(fl->ifl_desc_tag, sd_map);
 				bus_dmamap_destroy(fl->ifl_desc_tag, sd_map);
 			}
 			if (*sd_m != NULL) {
 				m_init(*sd_m, M_NOWAIT, MT_DATA, 0);
 				uma_zfree(zone_mbuf, *sd_m);
 			}
 			if (*sd_cl != NULL)
 				uma_zfree(fl->ifl_zone, *sd_cl);
 			*sd_flags = 0;
 		} else {
 			MPASS(*sd_cl == NULL);
 			MPASS(*sd_m == NULL);
 		}
 #if MEMORY_LOGGING
 		fl->ifl_m_dequeued++;
 		fl->ifl_cl_dequeued++;
 #endif
 		*sd_cl = NULL;
 		*sd_m = NULL;
 	}
 	/*
 	 * Reset free list values
 	 */
 	fl->ifl_credits = fl->ifl_cidx = fl->ifl_pidx = fl->ifl_gen = 0;;
 	bzero(idi->idi_vaddr, idi->idi_size);
 }
 
 /*********************************************************************
  *
  *  Initialize a receive ring and its buffers.
  *
  **********************************************************************/
 static int
 iflib_fl_setup(iflib_fl_t fl)
 {
 	iflib_rxq_t rxq = fl->ifl_rxq;
 	if_ctx_t ctx = rxq->ifr_ctx;
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 
 	/*
 	** Free current RX buffer structs and their mbufs
 	*/
 	iflib_fl_bufs_free(fl);
 	/* Now replenish the mbufs */
 	MPASS(fl->ifl_credits == 0);
 	/*
 	 * XXX don't set the max_frame_size to larger
 	 * than the hardware can handle
 	 */
 	if (sctx->isc_max_frame_size <= 2048)
 		fl->ifl_buf_size = MCLBYTES;
 	else if (sctx->isc_max_frame_size <= 4096)
 		fl->ifl_buf_size = MJUMPAGESIZE;
 	else if (sctx->isc_max_frame_size <= 9216)
 		fl->ifl_buf_size = MJUM9BYTES;
 	else
 		fl->ifl_buf_size = MJUM16BYTES;
 	if (fl->ifl_buf_size > ctx->ifc_max_fl_buf_size)
 		ctx->ifc_max_fl_buf_size = fl->ifl_buf_size;
 	fl->ifl_cltype = m_gettype(fl->ifl_buf_size);
 	fl->ifl_zone = m_getzone(fl->ifl_buf_size);
 
 
 	/* avoid pre-allocating zillions of clusters to an idle card
 	 * potentially speeding up attach
 	 */
 	_iflib_fl_refill(ctx, fl, min(128, fl->ifl_size));
 	MPASS(min(128, fl->ifl_size) == fl->ifl_credits);
 	if (min(128, fl->ifl_size) != fl->ifl_credits)
 		return (ENOBUFS);
 	/*
 	 * handle failure
 	 */
 	MPASS(rxq != NULL);
 	MPASS(fl->ifl_ifdi != NULL);
 	bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map,
 	    BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 	return (0);
 }
 
 /*********************************************************************
  *
  *  Free receive ring data structures
  *
  **********************************************************************/
 static void
 iflib_rx_sds_free(iflib_rxq_t rxq)
 {
 	iflib_fl_t fl;
 	int i;
 
 	if (rxq->ifr_fl != NULL) {
 		for (i = 0; i < rxq->ifr_nfl; i++) {
 			fl = &rxq->ifr_fl[i];
 			if (fl->ifl_desc_tag != NULL) {
 				bus_dma_tag_destroy(fl->ifl_desc_tag);
 				fl->ifl_desc_tag = NULL;
 			}
 			free(fl->ifl_sds.ifsd_m, M_IFLIB);
 			free(fl->ifl_sds.ifsd_cl, M_IFLIB);
 			/* XXX destroy maps first */
 			free(fl->ifl_sds.ifsd_map, M_IFLIB);
 			fl->ifl_sds.ifsd_m = NULL;
 			fl->ifl_sds.ifsd_cl = NULL;
 			fl->ifl_sds.ifsd_map = NULL;
 		}
 		free(rxq->ifr_fl, M_IFLIB);
 		rxq->ifr_fl = NULL;
 		rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0;
 	}
 }
 
 /*
  * MI independent logic
  *
  */
 static void
 iflib_timer(void *arg)
 {
 	iflib_txq_t txq = arg;
 	if_ctx_t ctx = txq->ift_ctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 	/*
 	** Check on the state of the TX queue(s), this
 	** can be done without the lock because its RO
 	** and the HUNG state will be static if set.
 	*/
 	IFDI_TIMER(ctx, txq->ift_id);
 	if ((txq->ift_qstatus == IFLIB_QUEUE_HUNG) &&
 		(ctx->ifc_pause_frames == 0))
 		goto hung;
 
 	if (TXQ_AVAIL(txq) <= 2*scctx->isc_tx_nsegments ||
 	    ifmp_ring_is_stalled(txq->ift_br[0]))
 		GROUPTASK_ENQUEUE(&txq->ift_task);
 
 	ctx->ifc_pause_frames = 0;
 	if (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING) 
 		callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu);
 	return;
 hung:
 	CTX_LOCK(ctx);
 	if_setdrvflagbits(ctx->ifc_ifp, 0, IFF_DRV_RUNNING);
 	device_printf(ctx->ifc_dev,  "TX(%d) desc avail = %d, pidx = %d\n",
 				  txq->ift_id, TXQ_AVAIL(txq), txq->ift_pidx);
 
 	IFDI_WATCHDOG_RESET(ctx);
 	ctx->ifc_watchdog_events++;
 	ctx->ifc_pause_frames = 0;
 
 	iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_init_locked(if_ctx_t ctx)
 {
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	if_t ifp = ctx->ifc_ifp;
 	iflib_fl_t fl;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	int i, j, tx_ip_csum_flags, tx_ip6_csum_flags;
 
 
 	if_setdrvflagbits(ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING);
 	IFDI_INTR_DISABLE(ctx);
 
 	tx_ip_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_SCTP);
 	tx_ip6_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP6_TCP | CSUM_IP6_UDP | CSUM_IP6_SCTP);
 	/* Set hardware offload abilities */
 	if_clearhwassist(ifp);
 	if (if_getcapenable(ifp) & IFCAP_TXCSUM)
 		if_sethwassistbits(ifp, tx_ip_csum_flags, 0);
 	if (if_getcapenable(ifp) & IFCAP_TXCSUM_IPV6)
 		if_sethwassistbits(ifp,  tx_ip6_csum_flags, 0);
 	if (if_getcapenable(ifp) & IFCAP_TSO4)
 		if_sethwassistbits(ifp, CSUM_IP_TSO, 0);
 	if (if_getcapenable(ifp) & IFCAP_TSO6)
 		if_sethwassistbits(ifp, CSUM_IP6_TSO, 0);
 
 	for (i = 0, txq = ctx->ifc_txqs; i < sctx->isc_ntxqsets; i++, txq++) {
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		callout_stop(&txq->ift_db_check);
 		CALLOUT_UNLOCK(txq);
 		iflib_netmap_txq_init(ctx, txq);
 	}
 	for (i = 0, rxq = ctx->ifc_rxqs; i < sctx->isc_nrxqsets; i++, rxq++) {
 		iflib_netmap_rxq_init(ctx, rxq);
 	}
 #ifdef INVARIANTS
 	i = if_getdrvflags(ifp);
 #endif
 	IFDI_INIT(ctx);
 	MPASS(if_getdrvflags(ifp) == i);
 	for (i = 0, rxq = ctx->ifc_rxqs; i < sctx->isc_nrxqsets; i++, rxq++) {
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) {
 			if (iflib_fl_setup(fl)) {
 				device_printf(ctx->ifc_dev, "freelist setup failed - check cluster settings\n");
 				goto done;
 			}
 		}
 	}
 	done:
 	if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_RUNNING, IFF_DRV_OACTIVE);
 	IFDI_INTR_ENABLE(ctx);
 	txq = ctx->ifc_txqs;
 	for (i = 0; i < sctx->isc_ntxqsets; i++, txq++)
 		callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq,
 			txq->ift_timer.c_cpu);
 }
 
 static int
 iflib_media_change(if_t ifp)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 	int err;
 
 	CTX_LOCK(ctx);
 	if ((err = IFDI_MEDIA_CHANGE(ctx)) == 0)
 		iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 	return (err);
 }
 
 static void
 iflib_media_status(if_t ifp, struct ifmediareq *ifmr)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	CTX_LOCK(ctx);
 	IFDI_UPDATE_ADMIN_STATUS(ctx);
 	IFDI_MEDIA_STATUS(ctx, ifmr);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_stop(if_ctx_t ctx)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	iflib_dma_info_t di;
 	iflib_fl_t fl;
 	int i, j;
 
 	/* Tell the stack that the interface is no longer active */
 	if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING);
 
 	IFDI_INTR_DISABLE(ctx);
 	DELAY(100000);
 	IFDI_STOP(ctx);
 	DELAY(100000);
 
 	iflib_debug_reset();
 	/* Wait for current tx queue users to exit to disarm watchdog timer. */
 	for (i = 0; i < scctx->isc_ntxqsets; i++, txq++) {
 		/* make sure all transmitters have completed before proceeding XXX */
 
 		/* clean any enqueued buffers */
 		iflib_ifmp_purge(txq);
 		/* Free any existing tx buffers. */
 		for (j = 0; j < txq->ift_size; j++) {
 			iflib_txsd_free(ctx, txq, j);
 		}
 		txq->ift_processed = txq->ift_cleaned = txq->ift_cidx_processed = 0;
 		txq->ift_in_use = txq->ift_gen = txq->ift_cidx = txq->ift_pidx = txq->ift_no_desc_avail = 0;
 		txq->ift_closed = txq->ift_mbuf_defrag = txq->ift_mbuf_defrag_failed = 0;
 		txq->ift_no_tx_dma_setup = txq->ift_txd_encap_efbig = txq->ift_map_failed = 0;
 		txq->ift_pullups = 0;
 		ifmp_ring_reset_stats(txq->ift_br[0]);
 		for (j = 0, di = txq->ift_ifdi; j < ctx->ifc_nhwtxqs; j++, di++)
 			bzero((void *)di->idi_vaddr, di->idi_size);
 	}
 	for (i = 0; i < scctx->isc_nrxqsets; i++, rxq++) {
 		/* make sure all transmitters have completed before proceeding XXX */
 
 		for (j = 0, di = txq->ift_ifdi; j < ctx->ifc_nhwrxqs; j++, di++)
 			bzero((void *)di->idi_vaddr, di->idi_size);
 		/* also resets the free lists pidx/cidx */
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++)
 			iflib_fl_bufs_free(fl);
 	}
 }
 
 static inline void
 prefetch_pkts(iflib_fl_t fl, int cidx)
 {
 	int nextptr;
 	int nrxd = fl->ifl_size;
 
 	nextptr = (cidx + CACHE_PTR_INCREMENT) & (nrxd-1);
 	prefetch(&fl->ifl_sds.ifsd_m[nextptr]);
 	prefetch(&fl->ifl_sds.ifsd_cl[nextptr]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 1) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 2) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 3) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_m[(cidx + 4) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 1) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 2) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 3) & (nrxd-1)]);
 	prefetch(fl->ifl_sds.ifsd_cl[(cidx + 4) & (nrxd-1)]);
 }
 
 static void
 rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, int *cltype, int unload, iflib_fl_t *pfl, int *pcidx)
 {
 	int flid, cidx;
 	bus_dmamap_t map;
 	iflib_fl_t fl;
 	iflib_dma_info_t di;
 	int next;
 
 	flid = irf->irf_flid;
 	cidx = irf->irf_idx;
 	fl = &rxq->ifr_fl[flid];
 	fl->ifl_credits--;
 #if MEMORY_LOGGING
 	fl->ifl_m_dequeued++;
 	if (cltype)
 		fl->ifl_cl_dequeued++;
 #endif
 	prefetch_pkts(fl, cidx);
 	if (fl->ifl_sds.ifsd_map != NULL) {
 		next = (cidx + CACHE_PTR_INCREMENT) & (fl->ifl_size-1);
 		prefetch(&fl->ifl_sds.ifsd_map[next]);
 		map = fl->ifl_sds.ifsd_map[cidx];
 		di = fl->ifl_ifdi;
 		next = (cidx + CACHE_LINE_SIZE) & (fl->ifl_size-1);
 		prefetch(&fl->ifl_sds.ifsd_flags[next]);
 		bus_dmamap_sync(di->idi_tag, di->idi_map,
 				BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
 
 	/* not valid assert if bxe really does SGE from non-contiguous elements */
 		MPASS(fl->ifl_cidx == cidx);
 		if (unload)
 			bus_dmamap_unload(fl->ifl_desc_tag, map);
 	}
 	if (__predict_false(++fl->ifl_cidx == fl->ifl_size)) {
 		fl->ifl_cidx = 0;
 		fl->ifl_gen = 0;
 	}
 	/* YES ick */
 	if (cltype)
 		*cltype = fl->ifl_cltype;
 	*pfl = fl;
 	*pcidx = cidx;
 }
 
 static struct mbuf *
 assemble_segments(iflib_rxq_t rxq, if_rxd_info_t ri)
 {
 	int i, padlen , flags, cltype;
 	struct mbuf *m, *mh, *mt, *sd_m;
 	iflib_fl_t fl;
 	int cidx;
 	caddr_t cl, sd_cl;
 
 	i = 0;
 	mh = NULL;
 	do {
 		rxd_frag_to_sd(rxq, &ri->iri_frags[i], &cltype, TRUE, &fl, &cidx);
 		sd_m = fl->ifl_sds.ifsd_m[cidx];
 		sd_cl = fl->ifl_sds.ifsd_cl[cidx];
 
 		MPASS(sd_cl != NULL);
 		MPASS(sd_m != NULL);
 
 		/* Don't include zero-length frags */
 		if (ri->iri_frags[i].irf_len == 0) {
 			/* XXX we can save the cluster here, but not the mbuf */
 			m_init(sd_m, M_NOWAIT, MT_DATA, 0);
 			m_free(sd_m);
 			fl->ifl_sds.ifsd_m[cidx] = NULL;
 			continue;
 		}
 		m = sd_m;
 		if (mh == NULL) {
 			flags = M_PKTHDR|M_EXT;
 			mh = mt = m;
 			padlen = ri->iri_pad;
 		} else {
 			flags = M_EXT;
 			mt->m_next = m;
 			mt = m;
 			/* assuming padding is only on the first fragment */
 			padlen = 0;
 		}
 		fl->ifl_sds.ifsd_m[cidx] = NULL;
 		cl = fl->ifl_sds.ifsd_cl[cidx];
 		fl->ifl_sds.ifsd_cl[cidx] = NULL;
 
 		/* Can these two be made one ? */
 		m_init(m, M_NOWAIT, MT_DATA, flags);
 		m_cljset(m, cl, cltype);
 		/*
 		 * These must follow m_init and m_cljset
 		 */
 		m->m_data += padlen;
 		ri->iri_len -= padlen;
 		m->m_len = ri->iri_frags[i].irf_len;
 	} while (++i < ri->iri_nfrags);
 
 	return (mh);
 }
 
 /*
  * Process one software descriptor
  */
 static struct mbuf *
 iflib_rxd_pkt_get(iflib_rxq_t rxq, if_rxd_info_t ri)
 {
 	struct mbuf *m;
 	iflib_fl_t fl;
 	caddr_t sd_cl;
 	int cidx;
 
 	/* should I merge this back in now that the two paths are basically duplicated? */
 	if (ri->iri_nfrags == 1 &&
 	    ri->iri_frags[0].irf_len <= IFLIB_RX_COPY_THRESH) {
 		rxd_frag_to_sd(rxq, &ri->iri_frags[0], NULL, FALSE, &fl, &cidx);
 		m = fl->ifl_sds.ifsd_m[cidx];
 		fl->ifl_sds.ifsd_m[cidx] = NULL;
 		sd_cl = fl->ifl_sds.ifsd_cl[cidx];
 		m_init(m, M_NOWAIT, MT_DATA, M_PKTHDR);
 		memcpy(m->m_data, sd_cl, ri->iri_len);
 		m->m_len = ri->iri_frags[0].irf_len;
        } else {
 		m = assemble_segments(rxq, ri);
 	}
 	m->m_pkthdr.len = ri->iri_len;
 	m->m_pkthdr.rcvif = ri->iri_ifp;
 	m->m_flags |= ri->iri_flags;
 	m->m_pkthdr.ether_vtag = ri->iri_vtag;
 	m->m_pkthdr.flowid = ri->iri_flowid;
 	M_HASHTYPE_SET(m, ri->iri_rsstype);
 	m->m_pkthdr.csum_flags = ri->iri_csum_flags;
 	m->m_pkthdr.csum_data = ri->iri_csum_data;
 	return (m);
 }
 
 static bool
 iflib_rxeof(iflib_rxq_t rxq, int budget)
 {
 	if_ctx_t ctx = rxq->ifr_ctx;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	int avail, i;
 	uint16_t *cidxp;
 	struct if_rxd_info ri;
 	int err, budget_left, rx_bytes, rx_pkts;
 	iflib_fl_t fl;
 	struct ifnet *ifp;
 	int lro_enabled;
 	/*
 	 * XXX early demux data packets so that if_input processing only handles
 	 * acks in interrupt context
 	 */
 	struct mbuf *m, *mh, *mt;
 
 	if (netmap_rx_irq(ctx->ifc_ifp, rxq->ifr_id, &budget)) {
 		return (FALSE);
 	}
 
 	mh = mt = NULL;
 	MPASS(budget > 0);
 	rx_pkts	= rx_bytes = 0;
 	if (sctx->isc_flags & IFLIB_HAS_RXCQ)
 		cidxp = &rxq->ifr_cq_cidx;
 	else
 		cidxp = &rxq->ifr_fl[0].ifl_cidx;
 	if ((avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget)) == 0) {
 		for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++)
 			__iflib_fl_refill_lt(ctx, fl, budget + 8);
 		DBG_COUNTER_INC(rx_unavail);
 		return (false);
 	}
 
 	for (budget_left = budget; (budget_left > 0) && (avail > 0); budget_left--, avail--) {
 		if (__predict_false(!CTX_ACTIVE(ctx))) {
 			DBG_COUNTER_INC(rx_ctx_inactive);
 			break;
 		}
 		/*
 		 * Reset client set fields to their default values
 		 */
 		bzero(&ri, sizeof(ri));
 		ri.iri_qsidx = rxq->ifr_id;
 		ri.iri_cidx = *cidxp;
 		ri.iri_ifp = ctx->ifc_ifp;
 		ri.iri_frags = rxq->ifr_frags;
 		err = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri);
 
 		/* in lieu of handling correctly - make sure it isn't being unhandled */
 		MPASS(err == 0);
 		if (sctx->isc_flags & IFLIB_HAS_RXCQ) {
 			*cidxp = ri.iri_cidx;
 			/* Update our consumer index */
 			while (rxq->ifr_cq_cidx >= scctx->isc_nrxd[0]) {
 				rxq->ifr_cq_cidx -= scctx->isc_nrxd[0];
 				rxq->ifr_cq_gen = 0;
 			}
 			/* was this only a completion queue message? */
 			if (__predict_false(ri.iri_nfrags == 0))
 				continue;
 		}
 		MPASS(ri.iri_nfrags != 0);
 		MPASS(ri.iri_len != 0);
 
 		/* will advance the cidx on the corresponding free lists */
 		m = iflib_rxd_pkt_get(rxq, &ri);
 		if (avail == 0 && budget_left)
 			avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget_left);
 
 		if (__predict_false(m == NULL)) {
 			DBG_COUNTER_INC(rx_mbuf_null);
 			continue;
 		}
 		/* imm_pkt: -- cxgb */
 		if (mh == NULL)
 			mh = mt = m;
 		else {
 			mt->m_nextpkt = m;
 			mt = m;
 		}
 	}
 	/* make sure that we can refill faster than drain */
 	for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++)
 		__iflib_fl_refill_lt(ctx, fl, budget + 8);
 
 	ifp = ctx->ifc_ifp;
 	lro_enabled = (if_getcapenable(ifp) & IFCAP_LRO);
 	while (mh != NULL) {
 		m = mh;
 		mh = mh->m_nextpkt;
 		m->m_nextpkt = NULL;
 		rx_bytes += m->m_pkthdr.len;
 		rx_pkts++;
 #if defined(INET6) || defined(INET)
 		if (lro_enabled && tcp_lro_rx(&rxq->ifr_lc, m, 0) == 0)
 			continue;
 #endif
 		DBG_COUNTER_INC(rx_if_input);
 		ifp->if_input(ifp, m);
 	}
 
 	if_inc_counter(ifp, IFCOUNTER_IBYTES, rx_bytes);
 	if_inc_counter(ifp, IFCOUNTER_IPACKETS, rx_pkts);
 
 	/*
 	 * Flush any outstanding LRO work
 	 */
 #if defined(INET6) || defined(INET)
 	tcp_lro_flush_all(&rxq->ifr_lc);
 #endif
 	if (avail)
 		return true;
 	return (iflib_rxd_avail(ctx, rxq, *cidxp, 1));
 }
 
 #define M_CSUM_FLAGS(m) ((m)->m_pkthdr.csum_flags)
 #define M_HAS_VLANTAG(m) (m->m_flags & M_VLANTAG)
 #define TXQ_MAX_DB_DEFERRED(size) (size >> 5)
 #define TXQ_MAX_DB_CONSUMED(size) (size >> 4)
 
 static __inline void
 iflib_txd_db_check(if_ctx_t ctx, iflib_txq_t txq, int ring)
 {
 	uint32_t dbval;
 
 	if (ring || txq->ift_db_pending >=
 	    TXQ_MAX_DB_DEFERRED(txq->ift_size)) {
 
 		/* the lock will only ever be contended in the !min_latency case */
 		if (!TXDB_TRYLOCK(txq))
 			return;
 		dbval = txq->ift_npending ? txq->ift_npending : txq->ift_pidx;
 		ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, dbval);
 		txq->ift_db_pending = txq->ift_npending = 0;
 		TXDB_UNLOCK(txq);
 	}
 }
 
 static void
 iflib_txd_deferred_db_check(void * arg)
 {
 	iflib_txq_t txq = arg;
 
 	/* simple non-zero boolean so use bitwise OR */
 	if ((txq->ift_db_pending | txq->ift_npending) &&
 	    txq->ift_db_pending >= txq->ift_db_pending_queued)
 		iflib_txd_db_check(txq->ift_ctx, txq, TRUE);
 	txq->ift_db_pending_queued = 0;
 	if (ifmp_ring_is_stalled(txq->ift_br[0]))
 		iflib_txq_check_drain(txq, 4);
 }
 
 #ifdef PKT_DEBUG
 static void
 print_pkt(if_pkt_info_t pi)
 {
 	printf("pi len:  %d qsidx: %d nsegs: %d ndescs: %d flags: %x pidx: %d\n",
 	       pi->ipi_len, pi->ipi_qsidx, pi->ipi_nsegs, pi->ipi_ndescs, pi->ipi_flags, pi->ipi_pidx);
 	printf("pi new_pidx: %d csum_flags: %lx tso_segsz: %d mflags: %x vtag: %d\n",
 	       pi->ipi_new_pidx, pi->ipi_csum_flags, pi->ipi_tso_segsz, pi->ipi_mflags, pi->ipi_vtag);
 	printf("pi etype: %d ehdrlen: %d ip_hlen: %d ipproto: %d\n",
 	       pi->ipi_etype, pi->ipi_ehdrlen, pi->ipi_ip_hlen, pi->ipi_ipproto);
 }
 #endif
 
 #define IS_TSO4(pi) ((pi)->ipi_csum_flags & CSUM_IP_TSO)
 #define IS_TSO6(pi) ((pi)->ipi_csum_flags & CSUM_IP6_TSO)
 
 static int
 iflib_parse_header(iflib_txq_t txq, if_pkt_info_t pi, struct mbuf **mp)
 {
 	if_shared_ctx_t sctx = txq->ift_ctx->ifc_sctx;
 	struct ether_vlan_header *eh;
 	struct mbuf *m, *n;
 
 	n = m = *mp;
 	if ((sctx->isc_flags & IFLIB_NEED_SCRATCH) &&
 	    M_WRITABLE(m) == 0) {
 		if ((m = m_dup(m, M_NOWAIT)) == NULL) {
 			return (ENOMEM);
 		} else {
 			m_freem(*mp);
 			n = *mp = m;
 		}
 	}
 
 	/*
 	 * Determine where frame payload starts.
 	 * Jump over vlan headers if already present,
 	 * helpful for QinQ too.
 	 */
 	if (__predict_false(m->m_len < sizeof(*eh))) {
 		txq->ift_pullups++;
 		if (__predict_false((m = m_pullup(m, sizeof(*eh))) == NULL))
 			return (ENOMEM);
 	}
 	eh = mtod(m, struct ether_vlan_header *);
 	if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
 		pi->ipi_etype = ntohs(eh->evl_proto);
 		pi->ipi_ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
 	} else {
 		pi->ipi_etype = ntohs(eh->evl_encap_proto);
 		pi->ipi_ehdrlen = ETHER_HDR_LEN;
 	}
 
 	switch (pi->ipi_etype) {
 #ifdef INET
 	case ETHERTYPE_IP:
 	{
 		struct ip *ip = NULL;
 		struct tcphdr *th = NULL;
 		int minthlen;
 
 		minthlen = min(m->m_pkthdr.len, pi->ipi_ehdrlen + sizeof(*ip) + sizeof(*th));
 		if (__predict_false(m->m_len < minthlen)) {
 			/*
 			 * if this code bloat is causing too much of a hit
 			 * move it to a separate function and mark it noinline
 			 */
 			if (m->m_len == pi->ipi_ehdrlen) {
 				n = m->m_next;
 				MPASS(n);
 				if (n->m_len >= sizeof(*ip))  {
 					ip = (struct ip *)n->m_data;
 					if (n->m_len >= (ip->ip_hl << 2) + sizeof(*th))
 						th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 				} else {
 					txq->ift_pullups++;
 					if (__predict_false((m = m_pullup(m, minthlen)) == NULL))
 						return (ENOMEM);
 					ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen);
 				}
 			} else {
 				txq->ift_pullups++;
 				if (__predict_false((m = m_pullup(m, minthlen)) == NULL))
 					return (ENOMEM);
 				ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen);
 				if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th))
 					th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 			}
 		} else {
 			ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen);
 			if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th))
 				th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2));
 		}
 		pi->ipi_ip_hlen = ip->ip_hl << 2;
 		pi->ipi_ipproto = ip->ip_p;
 		pi->ipi_flags |= IPI_TX_IPV4;
 
 		if (pi->ipi_csum_flags & CSUM_IP)
                        ip->ip_sum = 0;
 
 		if (pi->ipi_ipproto == IPPROTO_TCP) {
 			if (__predict_false(th == NULL)) {
 				txq->ift_pullups++;
 				if (__predict_false((m = m_pullup(m, (ip->ip_hl << 2) + sizeof(*th))) == NULL))
 					return (ENOMEM);
 				th = (struct tcphdr *)((caddr_t)ip + pi->ipi_ip_hlen);
 			}
 			pi->ipi_tcp_hflags = th->th_flags;
 			pi->ipi_tcp_hlen = th->th_off << 2;
 			pi->ipi_tcp_seq = th->th_seq;
 		}
 		if (IS_TSO4(pi)) {
 			if (__predict_false(ip->ip_p != IPPROTO_TCP))
 				return (ENXIO);
 			th->th_sum = in_pseudo(ip->ip_src.s_addr,
 					       ip->ip_dst.s_addr, htons(IPPROTO_TCP));
 			pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz;
 			if (sctx->isc_flags & IFLIB_TSO_INIT_IP) {
 				ip->ip_sum = 0;
 				ip->ip_len = htons(pi->ipi_ip_hlen + pi->ipi_tcp_hlen + pi->ipi_tso_segsz);
 			}
 		}
 		break;
 	}
 #endif
 #ifdef INET6
 	case ETHERTYPE_IPV6:
 	{
 		struct ip6_hdr *ip6 = (struct ip6_hdr *)(m->m_data + pi->ipi_ehdrlen);
 		struct tcphdr *th;
 		pi->ipi_ip_hlen = sizeof(struct ip6_hdr);
 
 		if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) {
 			if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) == NULL))
 				return (ENOMEM);
 		}
 		th = (struct tcphdr *)((caddr_t)ip6 + pi->ipi_ip_hlen);
 
 		/* XXX-BZ this will go badly in case of ext hdrs. */
 		pi->ipi_ipproto = ip6->ip6_nxt;
 		pi->ipi_flags |= IPI_TX_IPV6;
 
 		if (pi->ipi_ipproto == IPPROTO_TCP) {
 			if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) {
 				if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) == NULL))
 					return (ENOMEM);
 			}
 			pi->ipi_tcp_hflags = th->th_flags;
 			pi->ipi_tcp_hlen = th->th_off << 2;
 		}
 		if (IS_TSO6(pi)) {
 
 			if (__predict_false(ip6->ip6_nxt != IPPROTO_TCP))
 				return (ENXIO);
 			/*
 			 * The corresponding flag is set by the stack in the IPv4
 			 * TSO case, but not in IPv6 (at least in FreeBSD 10.2).
 			 * So, set it here because the rest of the flow requires it.
 			 */
 			pi->ipi_csum_flags |= CSUM_TCP_IPV6;
 			th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0);
 			pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz;
 		}
 		break;
 	}
 #endif
 	default:
 		pi->ipi_csum_flags &= ~CSUM_OFFLOAD;
 		pi->ipi_ip_hlen = 0;
 		break;
 	}
 	*mp = m;
 
 	return (0);
 }
 
 static  __noinline  struct mbuf *
 collapse_pkthdr(struct mbuf *m0)
 {
 	struct mbuf *m, *m_next, *tmp;
 
 	m = m0;
 	m_next = m->m_next;
 	while (m_next != NULL && m_next->m_len == 0) {
 		m = m_next;
 		m->m_next = NULL;
 		m_free(m);
 		m_next = m_next->m_next;
 	}
 	m = m0;
 	m->m_next = m_next;
 	if ((m_next->m_flags & M_EXT) == 0) {
 		m = m_defrag(m, M_NOWAIT);
 	} else {
 		tmp = m_next->m_next;
 		memcpy(m_next, m, MPKTHSIZE);
 		m = m_next;
 		m->m_next = tmp;
 	}
 	return (m);
 }
 
 /*
  * If dodgy hardware rejects the scatter gather chain we've handed it
  * we'll need to remove the mbuf chain from ifsg_m[] before we can add the
  * m_defrag'd mbufs
  */
 static __noinline struct mbuf *
 iflib_remove_mbuf(iflib_txq_t txq)
 {
 	int ntxd, i, pidx;
 	struct mbuf *m, *mh, **ifsd_m;
 
 	pidx = txq->ift_pidx;
 	ifsd_m = txq->ift_sds.ifsd_m;
 	ntxd = txq->ift_size;
 	mh = m = ifsd_m[pidx];
 	ifsd_m[pidx] = NULL;
 #if MEMORY_LOGGING
 	txq->ift_dequeued++;
 #endif
 	i = 1;
 
 	while (m) {
 		ifsd_m[(pidx + i) & (ntxd -1)] = NULL;
 #if MEMORY_LOGGING
 		txq->ift_dequeued++;
 #endif
 		m = m->m_next;
 		i++;
 	}
 	return (mh);
 }
 
 static int
 iflib_busdma_load_mbuf_sg(iflib_txq_t txq, bus_dma_tag_t tag, bus_dmamap_t map,
 			  struct mbuf **m0, bus_dma_segment_t *segs, int *nsegs,
 			  int max_segs, int flags)
 {
 	if_ctx_t ctx;
 	if_shared_ctx_t		sctx;
 	if_softc_ctx_t		scctx;
 	int i, next, pidx, mask, err, maxsegsz, ntxd, count;
 	struct mbuf *m, *tmp, **ifsd_m, **mp;
 
 	m = *m0;
 
 	/*
 	 * Please don't ever do this
 	 */
 	if (__predict_false(m->m_len == 0))
 		*m0 = m = collapse_pkthdr(m);
 
 	ctx = txq->ift_ctx;
 	sctx = ctx->ifc_sctx;
 	scctx = &ctx->ifc_softc_ctx;
 	ifsd_m = txq->ift_sds.ifsd_m;
 	ntxd = txq->ift_size;
 	pidx = txq->ift_pidx;
 	if (map != NULL) {
 		uint8_t *ifsd_flags = txq->ift_sds.ifsd_flags;
 
 		err = bus_dmamap_load_mbuf_sg(tag, map,
 					      *m0, segs, nsegs, BUS_DMA_NOWAIT);
 		if (err)
 			return (err);
 		ifsd_flags[pidx] |= TX_SW_DESC_MAPPED;
 		i = 0;
 		next = pidx;
 		mask = (txq->ift_size-1);
 		m = *m0;
 		do {
 			mp = &ifsd_m[next];
 			*mp = m;
 			m = m->m_next;
 			if (__predict_false((*mp)->m_len == 0)) {
 				m_free(*mp);
 				*mp = NULL;
 			} else
 				next = (pidx + i) & (ntxd-1);
 		} while (m != NULL);
 	} else {
 		int buflen, sgsize, max_sgsize;
 		vm_offset_t vaddr;
 		vm_paddr_t curaddr;
 
 		count = i = 0;
 		maxsegsz = sctx->isc_tx_maxsize;
 		m = *m0;
 		do {
 			if (__predict_false(m->m_len <= 0)) {
 				tmp = m;
 				m = m->m_next;
 				tmp->m_next = NULL;
 				m_free(tmp);
 				continue;
 			}
 			buflen = m->m_len;
 			vaddr = (vm_offset_t)m->m_data;
 			/*
 			 * see if we can't be smarter about physically
 			 * contiguous mappings
 			 */
 			next = (pidx + count) & (ntxd-1);
 			MPASS(ifsd_m[next] == NULL);
 #if MEMORY_LOGGING
 			txq->ift_enqueued++;
 #endif
 			ifsd_m[next] = m;
 			while (buflen > 0) {
 				max_sgsize = MIN(buflen, maxsegsz);
 				curaddr = pmap_kextract(vaddr);
 				sgsize = PAGE_SIZE - (curaddr & PAGE_MASK);
 				sgsize = MIN(sgsize, max_sgsize);
 				segs[i].ds_addr = curaddr;
 				segs[i].ds_len = sgsize;
 				vaddr += sgsize;
 				buflen -= sgsize;
 				i++;
 				if (i >= max_segs)
 					goto err;
 			}
 			count++;
 			tmp = m;
 			m = m->m_next;
 		} while (m != NULL);
 		*nsegs = i;
 	}
 	return (0);
 err:
 	*m0 = iflib_remove_mbuf(txq);
 	return (EFBIG);
 }
 
 static int
 iflib_encap(iflib_txq_t txq, struct mbuf **m_headp)
 {
 	if_ctx_t		ctx;
 	if_shared_ctx_t		sctx;
 	if_softc_ctx_t		scctx;
 	bus_dma_segment_t	*segs;
 	struct mbuf		*m_head;
 	bus_dmamap_t		map;
 	struct if_pkt_info	pi;
 	int remap = 0;
 	int err, nsegs, ndesc, max_segs, pidx, cidx, next, ntxd;
 	bus_dma_tag_t desc_tag;
 
 	segs = txq->ift_segs;
 	ctx = txq->ift_ctx;
 	sctx = ctx->ifc_sctx;
 	scctx = &ctx->ifc_softc_ctx;
 	segs = txq->ift_segs;
 	ntxd = txq->ift_size;
 	m_head = *m_headp;
 	map = NULL;
 
 	/*
 	 * If we're doing TSO the next descriptor to clean may be quite far ahead
 	 */
 	cidx = txq->ift_cidx;
 	pidx = txq->ift_pidx;
 	next = (cidx + CACHE_PTR_INCREMENT) & (ntxd-1);
 
 	/* prefetch the next cache line of mbuf pointers and flags */
 	prefetch(&txq->ift_sds.ifsd_m[next]);
 	if (txq->ift_sds.ifsd_map != NULL) {
 		prefetch(&txq->ift_sds.ifsd_map[next]);
 		map = txq->ift_sds.ifsd_map[pidx];
 		next = (cidx + CACHE_LINE_SIZE) & (ntxd-1);
 		prefetch(&txq->ift_sds.ifsd_flags[next]);
 	}
 
 
 	if (m_head->m_pkthdr.csum_flags & CSUM_TSO) {
 		desc_tag = txq->ift_tso_desc_tag;
 		max_segs = scctx->isc_tx_tso_segments_max;
 	} else {
 		desc_tag = txq->ift_desc_tag;
 		max_segs = scctx->isc_tx_nsegments;
 	}
 	m_head = *m_headp;
 	bzero(&pi, sizeof(pi));
 	pi.ipi_len = m_head->m_pkthdr.len;
 	pi.ipi_mflags = (m_head->m_flags & (M_VLANTAG|M_BCAST|M_MCAST));
 	pi.ipi_csum_flags = m_head->m_pkthdr.csum_flags;
 	pi.ipi_vtag = (m_head->m_flags & M_VLANTAG) ? m_head->m_pkthdr.ether_vtag : 0;
 	pi.ipi_pidx = pidx;
 	pi.ipi_qsidx = txq->ift_id;
 
 	/* deliberate bitwise OR to make one condition */
 	if (__predict_true((pi.ipi_csum_flags | pi.ipi_vtag))) {
 		if (__predict_false((err = iflib_parse_header(txq, &pi, m_headp)) != 0))
 			return (err);
 		m_head = *m_headp;
 	}
 
 retry:
 	err = iflib_busdma_load_mbuf_sg(txq, desc_tag, map, m_headp, segs, &nsegs, max_segs, BUS_DMA_NOWAIT);
 defrag:
 	if (__predict_false(err)) {
 		switch (err) {
 		case EFBIG:
 			/* try collapse once and defrag once */
 			if (remap == 0)
 				m_head = m_collapse(*m_headp, M_NOWAIT, max_segs);
 			if (remap == 1)
 				m_head = m_defrag(*m_headp, M_NOWAIT);
 			remap++;
 			if (__predict_false(m_head == NULL))
 				goto defrag_failed;
 			txq->ift_mbuf_defrag++;
 			*m_headp = m_head;
 			goto retry;
 			break;
 		case ENOMEM:
 			txq->ift_no_tx_dma_setup++;
 			break;
 		default:
 			txq->ift_no_tx_dma_setup++;
 			m_freem(*m_headp);
 			DBG_COUNTER_INC(tx_frees);
 			*m_headp = NULL;
 			break;
 		}
 		txq->ift_map_failed++;
 		DBG_COUNTER_INC(encap_load_mbuf_fail);
 		return (err);
 	}
 
 	/*
 	 * XXX assumes a 1 to 1 relationship between segments and
 	 *        descriptors - this does not hold true on all drivers, e.g.
 	 *        cxgb
 	 */
 	if (__predict_false(nsegs + 2 > TXQ_AVAIL(txq))) {
 		txq->ift_no_desc_avail++;
 		if (map != NULL)
 			bus_dmamap_unload(desc_tag, map);
 		DBG_COUNTER_INC(encap_txq_avail_fail);
 		if ((txq->ift_task.gt_task.ta_flags & TASK_ENQUEUED) == 0)
 			GROUPTASK_ENQUEUE(&txq->ift_task);
 		return (ENOBUFS);
 	}
 	pi.ipi_segs = segs;
 	pi.ipi_nsegs = nsegs;
 
 	MPASS(pidx >= 0 && pidx < txq->ift_size);
 #ifdef PKT_DEBUG
 	print_pkt(&pi);
 #endif
 	if ((err = ctx->isc_txd_encap(ctx->ifc_softc, &pi)) == 0) {
 		bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map,
 						BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 
 		DBG_COUNTER_INC(tx_encap);
 		MPASS(pi.ipi_new_pidx >= 0 &&
 		    pi.ipi_new_pidx < txq->ift_size);
 
 		ndesc = pi.ipi_new_pidx - pi.ipi_pidx;
 		if (pi.ipi_new_pidx < pi.ipi_pidx) {
 			ndesc += txq->ift_size;
 			txq->ift_gen = 1;
 		}
 		/*
 		 * drivers can need as many as 
 		 * two sentinels
 		 */
 		MPASS(ndesc <= pi.ipi_nsegs + 2);
 		MPASS(pi.ipi_new_pidx != pidx);
 		MPASS(ndesc > 0);
 		txq->ift_in_use += ndesc;
 		/*
 		 * We update the last software descriptor again here because there may
 		 * be a sentinel and/or there may be more mbufs than segments
 		 */
 		txq->ift_pidx = pi.ipi_new_pidx;
 		txq->ift_npending += pi.ipi_ndescs;
 	} else if (__predict_false(err == EFBIG && remap < 2)) {
 		*m_headp = m_head = iflib_remove_mbuf(txq);
 		remap = 1;
 		txq->ift_txd_encap_efbig++;
 		goto defrag;
 	} else
 		DBG_COUNTER_INC(encap_txd_encap_fail);
 	return (err);
 
 defrag_failed:
 	txq->ift_mbuf_defrag_failed++;
 	txq->ift_map_failed++;
 	m_freem(*m_headp);
 	DBG_COUNTER_INC(tx_frees);
 	*m_headp = NULL;
 	return (ENOMEM);
 }
 
 /* forward compatibility for cxgb */
 #define FIRST_QSET(ctx) 0
 
 #define NTXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_ntxqsets)
 #define NRXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_nrxqsets)
 #define QIDX(ctx, m) ((((m)->m_pkthdr.flowid & ctx->ifc_softc_ctx.isc_rss_table_mask) % NTXQSETS(ctx)) + FIRST_QSET(ctx))
 #define DESC_RECLAIMABLE(q) ((int)((q)->ift_processed - (q)->ift_cleaned - (q)->ift_ctx->ifc_softc_ctx.isc_tx_nsegments))
 #define RECLAIM_THRESH(ctx) ((ctx)->ifc_sctx->isc_tx_reclaim_thresh)
 #define MAX_TX_DESC(ctx) ((ctx)->ifc_softc_ctx.isc_tx_tso_segments_max)
 
 
 
 /* if there are more than TXQ_MIN_OCCUPANCY packets pending we consider deferring
  * doorbell writes
  *
  * ORing with 2 assures that min occupancy is never less than 2 without any conditional logic
  */
 #define TXQ_MIN_OCCUPANCY(size) ((size >> 6)| 0x2)
 
 static inline int
 iflib_txq_min_occupancy(iflib_txq_t txq)
 {
 	if_ctx_t ctx;
 
 	ctx = txq->ift_ctx;
 	return (get_inuse(txq->ift_size, txq->ift_cidx, txq->ift_pidx,
 	    txq->ift_gen) < TXQ_MIN_OCCUPANCY(txq->ift_size) +
 	    MAX_TX_DESC(ctx));
 }
 
 static void
 iflib_tx_desc_free(iflib_txq_t txq, int n)
 {
 	int hasmap;
 	uint32_t qsize, cidx, mask, gen;
 	struct mbuf *m, **ifsd_m;
 	uint8_t *ifsd_flags;
 	bus_dmamap_t *ifsd_map;
 
 	cidx = txq->ift_cidx;
 	gen = txq->ift_gen;
 	qsize = txq->ift_size;
 	mask = qsize-1;
 	hasmap = txq->ift_sds.ifsd_map != NULL;
 	ifsd_flags = txq->ift_sds.ifsd_flags;
 	ifsd_m = txq->ift_sds.ifsd_m;
 	ifsd_map = txq->ift_sds.ifsd_map;
 
 	while (n--) {
 		prefetch(ifsd_m[(cidx + 3) & mask]);
 		prefetch(ifsd_m[(cidx + 4) & mask]);
 
 		if (ifsd_m[cidx] != NULL) {
 			prefetch(&ifsd_m[(cidx + CACHE_PTR_INCREMENT) & mask]);
 			prefetch(&ifsd_flags[(cidx + CACHE_PTR_INCREMENT) & mask]);
 			if (hasmap && (ifsd_flags[cidx] & TX_SW_DESC_MAPPED)) {
 				/*
 				 * does it matter if it's not the TSO tag? If so we'll
 				 * have to add the type to flags
 				 */
 				bus_dmamap_unload(txq->ift_desc_tag, ifsd_map[cidx]);
 				ifsd_flags[cidx] &= ~TX_SW_DESC_MAPPED;
 			}
 			if ((m = ifsd_m[cidx]) != NULL) {
 				/* XXX we don't support any drivers that batch packets yet */
 				MPASS(m->m_nextpkt == NULL);
 
 				m_free(m);
 				ifsd_m[cidx] = NULL;
 #if MEMORY_LOGGING
 				txq->ift_dequeued++;
 #endif
 				DBG_COUNTER_INC(tx_frees);
 			}
 		}
 		if (__predict_false(++cidx == qsize)) {
 			cidx = 0;
 			gen = 0;
 		}
 	}
 	txq->ift_cidx = cidx;
 	txq->ift_gen = gen;
 }
 
 static __inline int
 iflib_completed_tx_reclaim(iflib_txq_t txq, int thresh)
 {
 	int reclaim;
 	if_ctx_t ctx = txq->ift_ctx;
 
 	KASSERT(thresh >= 0, ("invalid threshold to reclaim"));
 	MPASS(thresh /*+ MAX_TX_DESC(txq->ift_ctx) */ < txq->ift_size);
 
 	/*
 	 * Need a rate-limiting check so that this isn't called every time
 	 */
 	iflib_tx_credits_update(ctx, txq);
 	reclaim = DESC_RECLAIMABLE(txq);
 
 	if (reclaim <= thresh /* + MAX_TX_DESC(txq->ift_ctx) */) {
 #ifdef INVARIANTS
 		if (iflib_verbose_debug) {
 			printf("%s processed=%ju cleaned=%ju tx_nsegments=%d reclaim=%d thresh=%d\n", __FUNCTION__,
 			       txq->ift_processed, txq->ift_cleaned, txq->ift_ctx->ifc_softc_ctx.isc_tx_nsegments,
 			       reclaim, thresh);
 
 		}
 #endif
 		return (0);
 	}
 	iflib_tx_desc_free(txq, reclaim);
 	txq->ift_cleaned += reclaim;
 	txq->ift_in_use -= reclaim;
 
 	if (txq->ift_active == FALSE)
 		txq->ift_active = TRUE;
 
 	return (reclaim);
 }
 
 static struct mbuf **
 _ring_peek_one(struct ifmp_ring *r, int cidx, int offset)
 {
 
 	return (__DEVOLATILE(struct mbuf **, &r->items[(cidx + offset) & (r->size-1)]));
 }
 
 static void
 iflib_txq_check_drain(iflib_txq_t txq, int budget)
 {
 
 	ifmp_ring_check_drainage(txq->ift_br[0], budget);
 }
 
 static uint32_t
 iflib_txq_can_drain(struct ifmp_ring *r)
 {
 	iflib_txq_t txq = r->cookie;
 	if_ctx_t ctx = txq->ift_ctx;
 
 	return ((TXQ_AVAIL(txq) > MAX_TX_DESC(ctx) + 2) ||
 		ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, txq->ift_cidx_processed, false));
 }
 
 static uint32_t
 iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx)
 {
 	iflib_txq_t txq = r->cookie;
 	if_ctx_t ctx = txq->ift_ctx;
 	if_t ifp = ctx->ifc_ifp;
 	struct mbuf **mp, *m;
 	int i, count, consumed, pkt_sent, bytes_sent, mcast_sent, avail, err, in_use_prev, desc_used;
 
 	if (__predict_false(!(if_getdrvflags(ifp) & IFF_DRV_RUNNING) ||
 			    !LINK_ACTIVE(ctx))) {
 		DBG_COUNTER_INC(txq_drain_notready);
 		return (0);
 	}
 
 	avail = IDXDIFF(pidx, cidx, r->size);
 	if (__predict_false(ctx->ifc_flags & IFC_QFLUSH)) {
 		DBG_COUNTER_INC(txq_drain_flushing);
 		for (i = 0; i < avail; i++) {
 			m_free(r->items[(cidx + i) & (r->size-1)]);
 			r->items[(cidx + i) & (r->size-1)] = NULL;
 		}
 		return (avail);
 	}
 	iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx));
 	if (__predict_false(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE)) {
 		txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		callout_stop(&txq->ift_db_check);
 		CALLOUT_UNLOCK(txq);
 		DBG_COUNTER_INC(txq_drain_oactive);
 		return (0);
 	}
 	consumed = mcast_sent = bytes_sent = pkt_sent = 0;
 	count = MIN(avail, TX_BATCH_SIZE);
 #ifdef INVARIANTS
 	if (iflib_verbose_debug)
 		printf("%s avail=%d ifc_flags=%x txq_avail=%d ", __FUNCTION__,
 		       avail, ctx->ifc_flags, TXQ_AVAIL(txq));
 #endif
 
 	for (desc_used = i = 0; i < count && TXQ_AVAIL(txq) > MAX_TX_DESC(ctx) + 2; i++) {
 		mp = _ring_peek_one(r, cidx, i);
 		MPASS(mp != NULL && *mp != NULL);
 		in_use_prev = txq->ift_in_use;
 		if ((err = iflib_encap(txq, mp)) == ENOBUFS) {
 			DBG_COUNTER_INC(txq_drain_encapfail);
 			/* no room - bail out */
 			break;
 		}
 		consumed++;
 		if (err) {
 			DBG_COUNTER_INC(txq_drain_encapfail);
 			/* we can't send this packet - skip it */
 			continue;
 		}
 		pkt_sent++;
 		m = *mp;
 		DBG_COUNTER_INC(tx_sent);
 		bytes_sent += m->m_pkthdr.len;
 		if (m->m_flags & M_MCAST)
 			mcast_sent++;
 
 		txq->ift_db_pending += (txq->ift_in_use - in_use_prev);
 		desc_used += (txq->ift_in_use - in_use_prev);
 		iflib_txd_db_check(ctx, txq, FALSE);
 		ETHER_BPF_MTAP(ifp, m);
 		if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)))
 			break;
 
 		if (desc_used >= TXQ_MAX_DB_CONSUMED(txq->ift_size))
 			break;
 	}
 
 	if ((iflib_min_tx_latency || iflib_txq_min_occupancy(txq)) && txq->ift_db_pending)
 		iflib_txd_db_check(ctx, txq, TRUE);
 	else if ((txq->ift_db_pending || TXQ_AVAIL(txq) <= MAX_TX_DESC(ctx) + 2) &&
 		 (callout_pending(&txq->ift_db_check) == 0)) {
 		txq->ift_db_pending_queued = txq->ift_db_pending;
 		callout_reset_on(&txq->ift_db_check, 1, iflib_txd_deferred_db_check,
 				 txq, txq->ift_db_check.c_cpu);
 	}
 	if_inc_counter(ifp, IFCOUNTER_OBYTES, bytes_sent);
 	if_inc_counter(ifp, IFCOUNTER_OPACKETS, pkt_sent);
 	if (mcast_sent)
 		if_inc_counter(ifp, IFCOUNTER_OMCASTS, mcast_sent);
 #ifdef INVARIANTS
 	if (iflib_verbose_debug)
 		printf("consumed=%d\n", consumed);
 #endif
 	return (consumed);
 }
 
 static uint32_t
 iflib_txq_drain_always(struct ifmp_ring *r)
 {
 	return (1);
 }
 
 static uint32_t
 iflib_txq_drain_free(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx)
 {
 	int i, avail;
 	struct mbuf **mp;
 	iflib_txq_t txq;
 
 	txq = r->cookie;
 
 	txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 	CALLOUT_LOCK(txq);
 	callout_stop(&txq->ift_timer);
 	callout_stop(&txq->ift_db_check);
 	CALLOUT_UNLOCK(txq);
 
 	avail = IDXDIFF(pidx, cidx, r->size);
 	for (i = 0; i < avail; i++) {
 		mp = _ring_peek_one(r, cidx, i);
 		m_freem(*mp);
 	}
 	MPASS(ifmp_ring_is_stalled(r) == 0);
 	return (avail);
 }
 
 static void
 iflib_ifmp_purge(iflib_txq_t txq)
 {
 	struct ifmp_ring *r;
 
 	r = txq->ift_br[0];
 	r->drain = iflib_txq_drain_free;
 	r->can_drain = iflib_txq_drain_always;
 
 	ifmp_ring_check_drainage(r, r->size);
 
 	r->drain = iflib_txq_drain;
 	r->can_drain = iflib_txq_can_drain;
 }
 
 static void
 _task_fn_tx(void *context)
 {
 	iflib_txq_t txq = context;
 	if_ctx_t ctx = txq->ift_ctx;
 
 #ifdef IFLIB_DIAGNOSTICS
 	txq->ift_cpu_exec_count[curcpu]++;
 #endif
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 	ifmp_ring_check_drainage(txq->ift_br[0], TX_BATCH_SIZE);
 }
 
 static void
 _task_fn_rx(void *context)
 {
 	iflib_rxq_t rxq = context;
 	if_ctx_t ctx = rxq->ifr_ctx;
 	bool more;
 	int rc;
 
 #ifdef IFLIB_DIAGNOSTICS
 	rxq->ifr_cpu_exec_count[curcpu]++;
 #endif
 	DBG_COUNTER_INC(task_fn_rxs);
 	if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)))
 		return;
 
 	if ((more = iflib_rxeof(rxq, 16 /* XXX */)) == false) {
 		if (ctx->ifc_flags & IFC_LEGACY)
 			IFDI_INTR_ENABLE(ctx);
 		else {
 			DBG_COUNTER_INC(rx_intr_enables);
 			rc = IFDI_QUEUE_INTR_ENABLE(ctx, rxq->ifr_id);
 			KASSERT(rc != ENOTSUP, ("MSI-X support requires queue_intr_enable, but not implemented in driver"));
 		}
 	}
 	if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)))
 		return;
 	if (more)
 		GROUPTASK_ENQUEUE(&rxq->ifr_task);
 }
 
 static void
 _task_fn_admin(void *context)
 {
 	if_ctx_t ctx = context;
 	if_softc_ctx_t sctx = &ctx->ifc_softc_ctx;
 	iflib_txq_t txq;
 	int i;
 
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 
 	CTX_LOCK(ctx);
 	for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) {
 		CALLOUT_LOCK(txq);
 		callout_stop(&txq->ift_timer);
 		CALLOUT_UNLOCK(txq);
 	}
 	IFDI_UPDATE_ADMIN_STATUS(ctx);
 	for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++)
 		callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu);
 	IFDI_LINK_INTR_ENABLE(ctx);
 	CTX_UNLOCK(ctx);
 
 	if (LINK_ACTIVE(ctx) == 0)
 		return;
 	for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++)
 		iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET);
 }
 
 
 static void
 _task_fn_iov(void *context)
 {
 	if_ctx_t ctx = context;
 
 	if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))
 		return;
 
 	CTX_LOCK(ctx);
 	IFDI_VFLR_HANDLE(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static int
 iflib_sysctl_int_delay(SYSCTL_HANDLER_ARGS)
 {
 	int err;
 	if_int_delay_info_t info;
 	if_ctx_t ctx;
 
 	info = (if_int_delay_info_t)arg1;
 	ctx = info->iidi_ctx;
 	info->iidi_req = req;
 	info->iidi_oidp = oidp;
 	CTX_LOCK(ctx);
 	err = IFDI_SYSCTL_INT_DELAY(ctx, info);
 	CTX_UNLOCK(ctx);
 	return (err);
 }
 
 /*********************************************************************
  *
  *  IFNET FUNCTIONS
  *
  **********************************************************************/
 
 static void
 iflib_if_init_locked(if_ctx_t ctx)
 {
 	iflib_stop(ctx);
 	iflib_init_locked(ctx);
 }
 
 
 static void
 iflib_if_init(void *arg)
 {
 	if_ctx_t ctx = arg;
 
 	CTX_LOCK(ctx);
 	iflib_if_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static int
 iflib_if_transmit(if_t ifp, struct mbuf *m)
 {
 	if_ctx_t	ctx = if_getsoftc(ifp);
 
 	iflib_txq_t txq;
 	int err, qidx;
 
 	if (__predict_false((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || !LINK_ACTIVE(ctx))) {
 		DBG_COUNTER_INC(tx_frees);
 		m_freem(m);
 		return (ENOBUFS);
 	}
 
 	MPASS(m->m_nextpkt == NULL);
 	qidx = 0;
 	if ((NTXQSETS(ctx) > 1) && M_HASHTYPE_GET(m))
 		qidx = QIDX(ctx, m);
 	/*
 	 * XXX calculate buf_ring based on flowid (divvy up bits?)
 	 */
 	txq = &ctx->ifc_txqs[qidx];
 
 #ifdef DRIVER_BACKPRESSURE
 	if (txq->ift_closed) {
 		while (m != NULL) {
 			next = m->m_nextpkt;
 			m->m_nextpkt = NULL;
 			m_freem(m);
 			m = next;
 		}
 		return (ENOBUFS);
 	}
 #endif
 #ifdef notyet
 	qidx = count = 0;
 	mp = marr;
 	next = m;
 	do {
 		count++;
 		next = next->m_nextpkt;
 	} while (next != NULL);
 
 	if (count > nitems(marr))
 		if ((mp = malloc(count*sizeof(struct mbuf *), M_IFLIB, M_NOWAIT)) == NULL) {
 			/* XXX check nextpkt */
 			m_freem(m);
 			/* XXX simplify for now */
 			DBG_COUNTER_INC(tx_frees);
 			return (ENOBUFS);
 		}
 	for (next = m, i = 0; next != NULL; i++) {
 		mp[i] = next;
 		next = next->m_nextpkt;
 		mp[i]->m_nextpkt = NULL;
 	}
 #endif
 	DBG_COUNTER_INC(tx_seen);
 	err = ifmp_ring_enqueue(txq->ift_br[0], (void **)&m, 1, TX_BATCH_SIZE);
 
 	if (err) {
 		GROUPTASK_ENQUEUE(&txq->ift_task);
 		/* support forthcoming later */
 #ifdef DRIVER_BACKPRESSURE
 		txq->ift_closed = TRUE;
 #endif
 		ifmp_ring_check_drainage(txq->ift_br[0], TX_BATCH_SIZE);
 		m_freem(m);
 	} else if (TXQ_AVAIL(txq) < (txq->ift_size >> 1)) {
 		GROUPTASK_ENQUEUE(&txq->ift_task);
 	}
 
 	return (err);
 }
 
 static void
 iflib_if_qflush(if_t ifp)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 	iflib_txq_t txq = ctx->ifc_txqs;
 	int i;
 
 	CTX_LOCK(ctx);
 	ctx->ifc_flags |= IFC_QFLUSH;
 	CTX_UNLOCK(ctx);
 	for (i = 0; i < NTXQSETS(ctx); i++, txq++)
 		while (!(ifmp_ring_is_idle(txq->ift_br[0]) || ifmp_ring_is_stalled(txq->ift_br[0])))
 			iflib_txq_check_drain(txq, 0);
 	CTX_LOCK(ctx);
 	ctx->ifc_flags &= ~IFC_QFLUSH;
 	CTX_UNLOCK(ctx);
 
 	if_qflush(ifp);
 }
 
 
 #define IFCAP_FLAGS (IFCAP_TXCSUM_IPV6 | IFCAP_RXCSUM_IPV6 | IFCAP_HWCSUM | IFCAP_LRO | \
 		     IFCAP_TSO4 | IFCAP_TSO6 | IFCAP_VLAN_HWTAGGING |	\
 		     IFCAP_VLAN_MTU | IFCAP_VLAN_HWFILTER | IFCAP_VLAN_HWTSO)
 
 static int
 iflib_if_ioctl(if_t ifp, u_long command, caddr_t data)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 	struct ifreq	*ifr = (struct ifreq *)data;
 #if defined(INET) || defined(INET6)
 	struct ifaddr	*ifa = (struct ifaddr *)data;
 #endif
 	bool		avoid_reset = FALSE;
 	int		err = 0, reinit = 0, bits;
 
 	switch (command) {
 	case SIOCSIFADDR:
 #ifdef INET
 		if (ifa->ifa_addr->sa_family == AF_INET)
 			avoid_reset = TRUE;
 #endif
 #ifdef INET6
 		if (ifa->ifa_addr->sa_family == AF_INET6)
 			avoid_reset = TRUE;
 #endif
 		/*
 		** Calling init results in link renegotiation,
 		** so we avoid doing it when possible.
 		*/
 		if (avoid_reset) {
 			if_setflagbits(ifp, IFF_UP,0);
 			if (!(if_getdrvflags(ifp)& IFF_DRV_RUNNING))
 				reinit = 1;
 #ifdef INET
 			if (!(if_getflags(ifp) & IFF_NOARP))
 				arp_ifinit(ifp, ifa);
 #endif
 		} else
 			err = ether_ioctl(ifp, command, data);
 		break;
 	case SIOCSIFMTU:
 		CTX_LOCK(ctx);
 		if (ifr->ifr_mtu == if_getmtu(ifp)) {
 			CTX_UNLOCK(ctx);
 			break;
 		}
 		bits = if_getdrvflags(ifp);
 		/* stop the driver and free any clusters before proceeding */
 		iflib_stop(ctx);
 
 		if ((err = IFDI_MTU_SET(ctx, ifr->ifr_mtu)) == 0) {
 			if (ifr->ifr_mtu > ctx->ifc_max_fl_buf_size)
 				ctx->ifc_flags |= IFC_MULTISEG;
 			else
 				ctx->ifc_flags &= ~IFC_MULTISEG;
 			err = if_setmtu(ifp, ifr->ifr_mtu);
 		}
 		iflib_init_locked(ctx);
 		if_setdrvflags(ifp, bits);
 		CTX_UNLOCK(ctx);
 		break;
 	case SIOCSIFFLAGS:
 		CTX_LOCK(ctx);
 		if (if_getflags(ifp) & IFF_UP) {
 			if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) {
 				if ((if_getflags(ifp) ^ ctx->ifc_if_flags) &
 				    (IFF_PROMISC | IFF_ALLMULTI)) {
 					err = IFDI_PROMISC_SET(ctx, if_getflags(ifp));
 				}
 			} else
 				reinit = 1;
 		} else if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) {
 			iflib_stop(ctx);
 		}
 		ctx->ifc_if_flags = if_getflags(ifp);
 		CTX_UNLOCK(ctx);
 		break;
 	case SIOCADDMULTI:
 	case SIOCDELMULTI:
 		if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) {
 			CTX_LOCK(ctx);
 			IFDI_INTR_DISABLE(ctx);
 			IFDI_MULTI_SET(ctx);
 			IFDI_INTR_ENABLE(ctx);
 			CTX_UNLOCK(ctx);
 		}
 		break;
 	case SIOCSIFMEDIA:
 		CTX_LOCK(ctx);
 		IFDI_MEDIA_SET(ctx);
 		CTX_UNLOCK(ctx);
 		/* falls thru */
 	case SIOCGIFMEDIA:
 		err = ifmedia_ioctl(ifp, ifr, &ctx->ifc_media, command);
 		break;
 	case SIOCGI2C:
 	{
 		struct ifi2creq i2c;
 
 		err = copyin(ifr->ifr_data, &i2c, sizeof(i2c));
 		if (err != 0)
 			break;
 		if (i2c.dev_addr != 0xA0 && i2c.dev_addr != 0xA2) {
 			err = EINVAL;
 			break;
 		}
 		if (i2c.len > sizeof(i2c.data)) {
 			err = EINVAL;
 			break;
 		}
 
 		if ((err = IFDI_I2C_REQ(ctx, &i2c)) == 0)
 			err = copyout(&i2c, ifr->ifr_data, sizeof(i2c));
 		break;
 	}
 	case SIOCSIFCAP:
 	{
 		int mask, setmask;
 
 		mask = ifr->ifr_reqcap ^ if_getcapenable(ifp);
 		setmask = 0;
 #ifdef TCP_OFFLOAD
 		setmask |= mask & (IFCAP_TOE4|IFCAP_TOE6);
 #endif
 		setmask |= (mask & IFCAP_FLAGS);
 
 		if (setmask  & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6))
 			setmask |= (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6);
 		if ((mask & IFCAP_WOL) &&
 		    (if_getcapabilities(ifp) & IFCAP_WOL) != 0)
 			setmask |= (mask & (IFCAP_WOL_MCAST|IFCAP_WOL_MAGIC));
 		if_vlancap(ifp);
 		/*
 		 * want to ensure that traffic has stopped before we change any of the flags
 		 */
 		if (setmask) {
 			CTX_LOCK(ctx);
 			bits = if_getdrvflags(ifp);
 			if (bits & IFF_DRV_RUNNING)
 				iflib_stop(ctx);
 			if_togglecapenable(ifp, setmask);
 			if (bits & IFF_DRV_RUNNING)
 				iflib_init_locked(ctx);
 			if_setdrvflags(ifp, bits);
 			CTX_UNLOCK(ctx);
 		}
 		break;
 	    }
 	case SIOCGPRIVATE_0:
 	case SIOCSDRVSPEC:
 	case SIOCGDRVSPEC:
 		CTX_LOCK(ctx);
 		err = IFDI_PRIV_IOCTL(ctx, command, data);
 		CTX_UNLOCK(ctx);
 		break;
 	default:
 		err = ether_ioctl(ifp, command, data);
 		break;
 	}
 	if (reinit)
 		iflib_if_init(ctx);
 	return (err);
 }
 
 static uint64_t
 iflib_if_get_counter(if_t ifp, ift_counter cnt)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	return (IFDI_GET_COUNTER(ctx, cnt));
 }
 
 /*********************************************************************
  *
  *  OTHER FUNCTIONS EXPORTED TO THE STACK
  *
  **********************************************************************/
 
 static void
 iflib_vlan_register(void *arg, if_t ifp, uint16_t vtag)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	if ((void *)ctx != arg)
 		return;
 
 	if ((vtag == 0) || (vtag > 4095))
 		return;
 
 	CTX_LOCK(ctx);
 	IFDI_VLAN_REGISTER(ctx, vtag);
 	/* Re-init to load the changes */
 	if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER)
 		iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_vlan_unregister(void *arg, if_t ifp, uint16_t vtag)
 {
 	if_ctx_t ctx = if_getsoftc(ifp);
 
 	if ((void *)ctx != arg)
 		return;
 
 	if ((vtag == 0) || (vtag > 4095))
 		return;
 
 	CTX_LOCK(ctx);
 	IFDI_VLAN_UNREGISTER(ctx, vtag);
 	/* Re-init to load the changes */
 	if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER)
 		iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 static void
 iflib_led_func(void *arg, int onoff)
 {
 	if_ctx_t ctx = arg;
 
 	CTX_LOCK(ctx);
 	IFDI_LED_FUNC(ctx, onoff);
 	CTX_UNLOCK(ctx);
 }
 
 /*********************************************************************
  *
  *  BUS FUNCTION DEFINITIONS
  *
  **********************************************************************/
 
 int
 iflib_device_probe(device_t dev)
 {
 	pci_vendor_info_t *ent;
 
 	uint16_t	pci_vendor_id, pci_device_id;
 	uint16_t	pci_subvendor_id, pci_subdevice_id;
 	uint16_t	pci_rev_id;
 	if_shared_ctx_t sctx;
 
 	if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC)
 		return (ENOTSUP);
 
 	pci_vendor_id = pci_get_vendor(dev);
 	pci_device_id = pci_get_device(dev);
 	pci_subvendor_id = pci_get_subvendor(dev);
 	pci_subdevice_id = pci_get_subdevice(dev);
 	pci_rev_id = pci_get_revid(dev);
 	if (sctx->isc_parse_devinfo != NULL)
 		sctx->isc_parse_devinfo(&pci_device_id, &pci_subvendor_id, &pci_subdevice_id, &pci_rev_id);
 
 	ent = sctx->isc_vendor_info;
 	while (ent->pvi_vendor_id != 0) {
 		if (pci_vendor_id != ent->pvi_vendor_id) {
 			ent++;
 			continue;
 		}
 		if ((pci_device_id == ent->pvi_device_id) &&
 		    ((pci_subvendor_id == ent->pvi_subvendor_id) ||
 		     (ent->pvi_subvendor_id == 0)) &&
 		    ((pci_subdevice_id == ent->pvi_subdevice_id) ||
 		     (ent->pvi_subdevice_id == 0)) &&
 		    ((pci_rev_id == ent->pvi_rev_id) ||
 		     (ent->pvi_rev_id == 0))) {
 
 			device_set_desc_copy(dev, ent->pvi_name);
 			/* this needs to be changed to zero if the bus probing code
 			 * ever stops re-probing on best match because the sctx
 			 * may have its values over written by register calls
 			 * in subsequent probes
 			 */
 			return (BUS_PROBE_DEFAULT);
 		}
 		ent++;
 	}
 	return (ENXIO);
 }
 
 int
 iflib_device_register(device_t dev, void *sc, if_shared_ctx_t sctx, if_ctx_t *ctxp)
 {
 	int err, rid, msix, msix_bar;
 	if_ctx_t ctx;
 	if_t ifp;
 	if_softc_ctx_t scctx;
 	int i;
 	uint16_t main_txq;
 	uint16_t main_rxq;
 
 
 	ctx = malloc(sizeof(* ctx), M_IFLIB, M_WAITOK|M_ZERO);
 
 	if (sc == NULL) {
 		sc = malloc(sctx->isc_driver->size, M_IFLIB, M_WAITOK|M_ZERO);
 		device_set_softc(dev, ctx);
 		ctx->ifc_flags |= IFC_SC_ALLOCATED;
 	}
 
 	ctx->ifc_sctx = sctx;
 	ctx->ifc_dev = dev;
 	ctx->ifc_softc = sc;
 
 	if ((err = iflib_register(ctx)) != 0) {
 		device_printf(dev, "iflib_register failed %d\n", err);
 		return (err);
 	}
 	iflib_add_device_sysctl_pre(ctx);
 
 	scctx = &ctx->ifc_softc_ctx;
 	ifp = ctx->ifc_ifp;
 
 	/*
 	 * XXX sanity check that ntxd & nrxd are a power of 2
 	 */
 	if (ctx->ifc_sysctl_ntxqs != 0)
 		scctx->isc_ntxqsets = ctx->ifc_sysctl_ntxqs;
 	if (ctx->ifc_sysctl_nrxqs != 0)
 		scctx->isc_nrxqsets = ctx->ifc_sysctl_nrxqs;
 
 	for (i = 0; i < sctx->isc_ntxqs; i++) {
 		if (ctx->ifc_sysctl_ntxds[i] != 0)
 			scctx->isc_ntxd[i] = ctx->ifc_sysctl_ntxds[i];
 		else
 			scctx->isc_ntxd[i] = sctx->isc_ntxd_default[i];
 	}
 
 	for (i = 0; i < sctx->isc_nrxqs; i++) {
 		if (ctx->ifc_sysctl_nrxds[i] != 0)
 			scctx->isc_nrxd[i] = ctx->ifc_sysctl_nrxds[i];
 		else
 			scctx->isc_nrxd[i] = sctx->isc_nrxd_default[i];
 	}
 
 	for (i = 0; i < sctx->isc_nrxqs; i++) {
 		if (scctx->isc_nrxd[i] < sctx->isc_nrxd_min[i]) {
 			device_printf(dev, "nrxd%d: %d less than nrxd_min %d - resetting to min\n",
 				      i, scctx->isc_nrxd[i], sctx->isc_nrxd_min[i]);
 			scctx->isc_nrxd[i] = sctx->isc_nrxd_min[i];
 		}
 		if (scctx->isc_nrxd[i] > sctx->isc_nrxd_max[i]) {
 			device_printf(dev, "nrxd%d: %d greater than nrxd_max %d - resetting to max\n",
 				      i, scctx->isc_nrxd[i], sctx->isc_nrxd_max[i]);
 			scctx->isc_nrxd[i] = sctx->isc_nrxd_max[i];
 		}
 	}
 
 	for (i = 0; i < sctx->isc_ntxqs; i++) {
 		if (scctx->isc_ntxd[i] < sctx->isc_ntxd_min[i]) {
 			device_printf(dev, "ntxd%d: %d less than ntxd_min %d - resetting to min\n",
 				      i, scctx->isc_ntxd[i], sctx->isc_ntxd_min[i]);
 			scctx->isc_ntxd[i] = sctx->isc_ntxd_min[i];
 		}
 		if (scctx->isc_ntxd[i] > sctx->isc_ntxd_max[i]) {
 			device_printf(dev, "ntxd%d: %d greater than ntxd_max %d - resetting to max\n",
 				      i, scctx->isc_ntxd[i], sctx->isc_ntxd_max[i]);
 			scctx->isc_ntxd[i] = sctx->isc_ntxd_max[i];
 		}
 	}
 
 	if ((err = IFDI_ATTACH_PRE(ctx)) != 0) {
 		device_printf(dev, "IFDI_ATTACH_PRE failed %d\n", err);
 		return (err);
 	}
 	_iflib_pre_assert(scctx);
 	ctx->ifc_txrx = *scctx->isc_txrx;
 
 #ifdef INVARIANTS
 	MPASS(scctx->isc_capenable);
 	if (scctx->isc_capenable & IFCAP_TXCSUM)
 		MPASS(scctx->isc_tx_csum_flags);
 #endif
 
 	if_setcapabilities(ifp, scctx->isc_capenable);
 	if_setcapenable(ifp, scctx->isc_capenable);
 
 	if (scctx->isc_ntxqsets == 0 || (scctx->isc_ntxqsets_max && scctx->isc_ntxqsets_max < scctx->isc_ntxqsets))
 		scctx->isc_ntxqsets = scctx->isc_ntxqsets_max;
 	if (scctx->isc_nrxqsets == 0 || (scctx->isc_nrxqsets_max && scctx->isc_nrxqsets_max < scctx->isc_nrxqsets))
 		scctx->isc_nrxqsets = scctx->isc_nrxqsets_max;
 
 #ifdef ACPI_DMAR
 	if (dmar_get_dma_tag(device_get_parent(dev), dev) != NULL)
 		ctx->ifc_flags |= IFC_DMAR;
 #endif
 
 	msix_bar = scctx->isc_msix_bar;
 
 	if(sctx->isc_flags & IFLIB_HAS_TXCQ)
 		main_txq = 1;
 	else
 		main_txq = 0;
 
 	if(sctx->isc_flags & IFLIB_HAS_RXCQ)
 		main_rxq = 1;
 	else
 		main_rxq = 0;
 
 	/* XXX change for per-queue sizes */
 	device_printf(dev, "using %d tx descriptors and %d rx descriptors\n",
 		      scctx->isc_ntxd[main_txq], scctx->isc_nrxd[main_rxq]);
 	for (i = 0; i < sctx->isc_nrxqs; i++) {
 		if (!powerof2(scctx->isc_nrxd[i])) {
 			/* round down instead? */
 			device_printf(dev, "# rx descriptors must be a power of 2\n");
 			err = EINVAL;
 			goto fail;
 		}
 	}
 	for (i = 0; i < sctx->isc_ntxqs; i++) {
 		if (!powerof2(scctx->isc_ntxd[i])) {
 			device_printf(dev,
 			    "# tx descriptors must be a power of 2");
 			err = EINVAL;
 			goto fail;
 		}
 	}
 
 	if (scctx->isc_tx_nsegments > scctx->isc_ntxd[main_txq] /
 	    MAX_SINGLE_PACKET_FRACTION)
 		scctx->isc_tx_nsegments = max(1, scctx->isc_ntxd[main_txq] /
 		    MAX_SINGLE_PACKET_FRACTION);
 	if (scctx->isc_tx_tso_segments_max > scctx->isc_ntxd[main_txq] /
 	    MAX_SINGLE_PACKET_FRACTION)
 		scctx->isc_tx_tso_segments_max = max(1,
 		    scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION);
 
 	/*
 	 * Protect the stack against modern hardware
 	 */
 	if (scctx->isc_tx_tso_size_max > FREEBSD_TSO_SIZE_MAX)
 		scctx->isc_tx_tso_size_max = FREEBSD_TSO_SIZE_MAX;
 
 	/* TSO parameters - dig these out of the data sheet - simply correspond to tag setup */
 	ifp->if_hw_tsomaxsegcount = scctx->isc_tx_tso_segments_max;
 	ifp->if_hw_tsomax = scctx->isc_tx_tso_size_max;
 	ifp->if_hw_tsomaxsegsize = scctx->isc_tx_tso_segsize_max;
 	if (scctx->isc_rss_table_size == 0)
 		scctx->isc_rss_table_size = 64;
 	scctx->isc_rss_table_mask = scctx->isc_rss_table_size-1;
 
 	GROUPTASK_INIT(&ctx->ifc_admin_task, 0, _task_fn_admin, ctx);
 	/* XXX format name */
 	taskqgroup_attach(qgroup_if_config_tqg, &ctx->ifc_admin_task, ctx, -1, "admin");
 	/*
 	** Now setup MSI or MSI/X, should
 	** return us the number of supported
 	** vectors. (Will be 1 for MSI)
 	*/
 	if (sctx->isc_flags & IFLIB_SKIP_MSIX) {
 		msix = scctx->isc_vectors;
 	} else if (scctx->isc_msix_bar != 0)
 	       /*
 		* The simple fact that isc_msix_bar is not 0 does not mean we
 		* we have a good value there that is known to work.
 		*/
 		msix = iflib_msix_init(ctx);
 	else {
 		scctx->isc_vectors = 1;
 		scctx->isc_ntxqsets = 1;
 		scctx->isc_nrxqsets = 1;
 		scctx->isc_intr = IFLIB_INTR_LEGACY;
 		msix = 0;
 	}
 	/* Get memory for the station queues */
 	if ((err = iflib_queues_alloc(ctx))) {
 		device_printf(dev, "Unable to allocate queue memory\n");
 		goto fail;
 	}
 
 	if ((err = iflib_qset_structures_setup(ctx))) {
 		device_printf(dev, "qset structure setup failed %d\n", err);
 		goto fail_queues;
 	}
 
 	/*
 	 * Group taskqueues aren't properly set up until SMP is started,
 	 * so we disable interrupts until we can handle them post
 	 * SI_SUB_SMP.
 	 *
 	 * XXX: disabling interrupts doesn't actually work, at least for
 	 * the non-MSI case.  When they occur before SI_SUB_SMP completes,
 	 * we do null handling and depend on this not causing too large an
 	 * interrupt storm.
 	 */
 	IFDI_INTR_DISABLE(ctx);
 	if (msix > 1 && (err = IFDI_MSIX_INTR_ASSIGN(ctx, msix)) != 0) {
 		device_printf(dev, "IFDI_MSIX_INTR_ASSIGN failed %d\n", err);
 		goto fail_intr_free;
 	}
 	if (msix <= 1) {
 		rid = 0;
 		if (scctx->isc_intr == IFLIB_INTR_MSI) {
 			MPASS(msix == 1);
 			rid = 1;
 		}
 		if ((err = iflib_legacy_setup(ctx, ctx->isc_legacy_intr, ctx->ifc_softc, &rid, "irq0")) != 0) {
 			device_printf(dev, "iflib_legacy_setup failed %d\n", err);
 			goto fail_intr_free;
 		}
 	}
 	ether_ifattach(ctx->ifc_ifp, ctx->ifc_mac);
 	if ((err = IFDI_ATTACH_POST(ctx)) != 0) {
 		device_printf(dev, "IFDI_ATTACH_POST failed %d\n", err);
 		goto fail_detach;
 	}
 	if ((err = iflib_netmap_attach(ctx))) {
 		device_printf(ctx->ifc_dev, "netmap attach failed: %d\n", err);
 		goto fail_detach;
 	}
 	*ctxp = ctx;
 
 	if_setgetcounterfn(ctx->ifc_ifp, iflib_if_get_counter);
 	iflib_add_device_sysctl_post(ctx);
 	ctx->ifc_flags |= IFC_INIT_DONE;
 	return (0);
 fail_detach:
 	ether_ifdetach(ctx->ifc_ifp);
 fail_intr_free:
 	if (scctx->isc_intr == IFLIB_INTR_MSIX || scctx->isc_intr == IFLIB_INTR_MSI)
 		pci_release_msi(ctx->ifc_dev);
 fail_queues:
 	/* XXX free queues */
 fail:
 	IFDI_DETACH(ctx);
 	return (err);
 }
 
 int
 iflib_device_attach(device_t dev)
 {
 	if_ctx_t ctx;
 	if_shared_ctx_t sctx;
 
 	if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC)
 		return (ENOTSUP);
 
 	pci_enable_busmaster(dev);
 
 	return (iflib_device_register(dev, NULL, sctx, &ctx));
 }
 
 int
 iflib_device_deregister(if_ctx_t ctx)
 {
 	if_t ifp = ctx->ifc_ifp;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	device_t dev = ctx->ifc_dev;
 	int i;
 	struct taskqgroup *tqg;
 
 	/* Make sure VLANS are not using driver */
 	if (if_vlantrunkinuse(ifp)) {
 		device_printf(dev,"Vlan in use, detach first\n");
 		return (EBUSY);
 	}
 
 	CTX_LOCK(ctx);
 	ctx->ifc_in_detach = 1;
 	iflib_stop(ctx);
 	CTX_UNLOCK(ctx);
 
 	/* Unregister VLAN events */
 	if (ctx->ifc_vlan_attach_event != NULL)
 		EVENTHANDLER_DEREGISTER(vlan_config, ctx->ifc_vlan_attach_event);
 	if (ctx->ifc_vlan_detach_event != NULL)
 		EVENTHANDLER_DEREGISTER(vlan_unconfig, ctx->ifc_vlan_detach_event);
 
 	iflib_netmap_detach(ifp);
 	ether_ifdetach(ifp);
 	/* ether_ifdetach calls if_qflush - lock must be destroy afterwards*/
 	CTX_LOCK_DESTROY(ctx);
 	if (ctx->ifc_led_dev != NULL)
 		led_destroy(ctx->ifc_led_dev);
 	/* XXX drain any dependent tasks */
-	tqg = qgroup_if_io_tqg;
+	tqg = qgroup_softirq;
 	for (txq = ctx->ifc_txqs, i = 0; i < NTXQSETS(ctx); i++, txq++) {
 		callout_drain(&txq->ift_timer);
 		callout_drain(&txq->ift_db_check);
 		if (txq->ift_task.gt_uniq != NULL)
 			taskqgroup_detach(tqg, &txq->ift_task);
 	}
 	for (i = 0, rxq = ctx->ifc_rxqs; i < NRXQSETS(ctx); i++, rxq++) {
 		if (rxq->ifr_task.gt_uniq != NULL)
 			taskqgroup_detach(tqg, &rxq->ifr_task);
 	}
 	tqg = qgroup_if_config_tqg;
 	if (ctx->ifc_admin_task.gt_uniq != NULL)
 		taskqgroup_detach(tqg, &ctx->ifc_admin_task);
 	if (ctx->ifc_vflr_task.gt_uniq != NULL)
 		taskqgroup_detach(tqg, &ctx->ifc_vflr_task);
 
 	IFDI_DETACH(ctx);
 	device_set_softc(ctx->ifc_dev, NULL);
 	if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_LEGACY) {
 		pci_release_msi(dev);
 	}
 	if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_MSIX) {
 		iflib_irq_free(ctx, &ctx->ifc_legacy_irq);
 	}
 	if (ctx->ifc_msix_mem != NULL) {
 		bus_release_resource(ctx->ifc_dev, SYS_RES_MEMORY,
 			ctx->ifc_softc_ctx.isc_msix_bar, ctx->ifc_msix_mem);
 		ctx->ifc_msix_mem = NULL;
 	}
 
 	bus_generic_detach(dev);
 	if_free(ifp);
 
 	iflib_tx_structures_free(ctx);
 	iflib_rx_structures_free(ctx);
 	if (ctx->ifc_flags & IFC_SC_ALLOCATED)
 		free(ctx->ifc_softc, M_IFLIB);
 	free(ctx, M_IFLIB);
 	return (0);
 }
 
 
 int
 iflib_device_detach(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	return (iflib_device_deregister(ctx));
 }
 
 int
 iflib_device_suspend(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	IFDI_SUSPEND(ctx);
 	CTX_UNLOCK(ctx);
 
 	return bus_generic_suspend(dev);
 }
 int
 iflib_device_shutdown(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	IFDI_SHUTDOWN(ctx);
 	CTX_UNLOCK(ctx);
 
 	return bus_generic_suspend(dev);
 }
 
 
 int
 iflib_device_resume(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 	iflib_txq_t txq = ctx->ifc_txqs;
 
 	CTX_LOCK(ctx);
 	IFDI_RESUME(ctx);
 	iflib_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 	for (int i = 0; i < NTXQSETS(ctx); i++, txq++)
 		iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET);
 
 	return (bus_generic_resume(dev));
 }
 
 int
 iflib_device_iov_init(device_t dev, uint16_t num_vfs, const nvlist_t *params)
 {
 	int error;
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	error = IFDI_IOV_INIT(ctx, num_vfs, params);
 	CTX_UNLOCK(ctx);
 
 	return (error);
 }
 
 void
 iflib_device_iov_uninit(device_t dev)
 {
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	IFDI_IOV_UNINIT(ctx);
 	CTX_UNLOCK(ctx);
 }
 
 int
 iflib_device_iov_add_vf(device_t dev, uint16_t vfnum, const nvlist_t *params)
 {
 	int error;
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
 	error = IFDI_IOV_VF_ADD(ctx, vfnum, params);
 	CTX_UNLOCK(ctx);
 
 	return (error);
 }
 
 /*********************************************************************
  *
  *  MODULE FUNCTION DEFINITIONS
  *
  **********************************************************************/
 
 /*
  * - Start a fast taskqueue thread for each core
  * - Start a taskqueue for control operations
  */
 static int
 iflib_module_init(void)
 {
 	return (0);
 }
 
 static int
 iflib_module_event_handler(module_t mod, int what, void *arg)
 {
 	int err;
 
 	switch (what) {
 	case MOD_LOAD:
 		if ((err = iflib_module_init()) != 0)
 			return (err);
 		break;
 	case MOD_UNLOAD:
 		return (EBUSY);
 	default:
 		return (EOPNOTSUPP);
 	}
 
 	return (0);
 }
 
 /*********************************************************************
  *
  *  PUBLIC FUNCTION DEFINITIONS
  *     ordered as in iflib.h
  *
  **********************************************************************/
 
 
 static void
 _iflib_assert(if_shared_ctx_t sctx)
 {
 	MPASS(sctx->isc_tx_maxsize);
 	MPASS(sctx->isc_tx_maxsegsize);
 
 	MPASS(sctx->isc_rx_maxsize);
 	MPASS(sctx->isc_rx_nsegments);
 	MPASS(sctx->isc_rx_maxsegsize);
 
 	MPASS(sctx->isc_nrxd_min[0]);
 	MPASS(sctx->isc_nrxd_max[0]);
 	MPASS(sctx->isc_nrxd_default[0]);
 	MPASS(sctx->isc_ntxd_min[0]);
 	MPASS(sctx->isc_ntxd_max[0]);
 	MPASS(sctx->isc_ntxd_default[0]);
 }
 
 static void
 _iflib_pre_assert(if_softc_ctx_t scctx)
 {
 
 	MPASS(scctx->isc_txrx->ift_txd_encap);
 	MPASS(scctx->isc_txrx->ift_txd_flush);
 	MPASS(scctx->isc_txrx->ift_txd_credits_update);
 	MPASS(scctx->isc_txrx->ift_rxd_available);
 	MPASS(scctx->isc_txrx->ift_rxd_pkt_get);
 	MPASS(scctx->isc_txrx->ift_rxd_refill);
 	MPASS(scctx->isc_txrx->ift_rxd_flush);
 }
 
 static int
 iflib_register(if_ctx_t ctx)
 {
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	driver_t *driver = sctx->isc_driver;
 	device_t dev = ctx->ifc_dev;
 	if_t ifp;
 
 	_iflib_assert(sctx);
 
 	CTX_LOCK_INIT(ctx, device_get_nameunit(ctx->ifc_dev));
 
 	ifp = ctx->ifc_ifp = if_gethandle(IFT_ETHER);
 	if (ifp == NULL) {
 		device_printf(dev, "can not allocate ifnet structure\n");
 		return (ENOMEM);
 	}
 
 	/*
 	 * Initialize our context's device specific methods
 	 */
 	kobj_init((kobj_t) ctx, (kobj_class_t) driver);
 	kobj_class_compile((kobj_class_t) driver);
 	driver->refs++;
 
 	if_initname(ifp, device_get_name(dev), device_get_unit(dev));
 	if_setsoftc(ifp, ctx);
 	if_setdev(ifp, dev);
 	if_setinitfn(ifp, iflib_if_init);
 	if_setioctlfn(ifp, iflib_if_ioctl);
 	if_settransmitfn(ifp, iflib_if_transmit);
 	if_setqflushfn(ifp, iflib_if_qflush);
 	if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST);
 
 	ctx->ifc_vlan_attach_event =
 		EVENTHANDLER_REGISTER(vlan_config, iflib_vlan_register, ctx,
 							  EVENTHANDLER_PRI_FIRST);
 	ctx->ifc_vlan_detach_event =
 		EVENTHANDLER_REGISTER(vlan_unconfig, iflib_vlan_unregister, ctx,
 							  EVENTHANDLER_PRI_FIRST);
 
 	ifmedia_init(&ctx->ifc_media, IFM_IMASK,
 					 iflib_media_change, iflib_media_status);
 
 	return (0);
 }
 
 
 static int
 iflib_queues_alloc(if_ctx_t ctx)
 {
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	device_t dev = ctx->ifc_dev;
 	int nrxqsets = scctx->isc_nrxqsets;
 	int ntxqsets = scctx->isc_ntxqsets;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	iflib_fl_t fl = NULL;
 	int i, j, cpu, err, txconf, rxconf;
 	iflib_dma_info_t ifdip;
 	uint32_t *rxqsizes = scctx->isc_rxqsizes;
 	uint32_t *txqsizes = scctx->isc_txqsizes;
 	uint8_t nrxqs = sctx->isc_nrxqs;
 	uint8_t ntxqs = sctx->isc_ntxqs;
 	int nfree_lists = sctx->isc_nfl ? sctx->isc_nfl : 1;
 	caddr_t *vaddrs;
 	uint64_t *paddrs;
 	struct ifmp_ring **brscp;
 	int nbuf_rings = 1; /* XXX determine dynamically */
 
 	KASSERT(ntxqs > 0, ("number of queues per qset must be at least 1"));
 	KASSERT(nrxqs > 0, ("number of queues per qset must be at least 1"));
 
 	brscp = NULL;
 	txq = NULL;
 	rxq = NULL;
 
 /* Allocate the TX ring struct memory */
 	if (!(txq =
 	    (iflib_txq_t) malloc(sizeof(struct iflib_txq) *
 	    ntxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate TX ring memory\n");
 		err = ENOMEM;
 		goto fail;
 	}
 
 	/* Now allocate the RX */
 	if (!(rxq =
 	    (iflib_rxq_t) malloc(sizeof(struct iflib_rxq) *
 	    nrxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to allocate RX ring memory\n");
 		err = ENOMEM;
 		goto rx_fail;
 	}
 	if (!(brscp = malloc(sizeof(void *) * nbuf_rings * nrxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) {
 		device_printf(dev, "Unable to buf_ring_sc * memory\n");
 		err = ENOMEM;
 		goto rx_fail;
 	}
 
 	ctx->ifc_txqs = txq;
 	ctx->ifc_rxqs = rxq;
 
 	/*
 	 * XXX handle allocation failure
 	 */
 	for (txconf = i = 0, cpu = CPU_FIRST(); i < ntxqsets; i++, txconf++, txq++, cpu = CPU_NEXT(cpu)) {
 		/* Set up some basics */
 
 		if ((ifdip = malloc(sizeof(struct iflib_dma_info) * ntxqs, M_IFLIB, M_WAITOK|M_ZERO)) == NULL) {
 			device_printf(dev, "failed to allocate iflib_dma_info\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 		txq->ift_ifdi = ifdip;
 		for (j = 0; j < ntxqs; j++, ifdip++) {
 			if (iflib_dma_alloc(ctx, txqsizes[j], ifdip, BUS_DMA_NOWAIT)) {
 				device_printf(dev, "Unable to allocate Descriptor memory\n");
 				err = ENOMEM;
 				goto err_tx_desc;
 			}
 			bzero((void *)ifdip->idi_vaddr, txqsizes[j]);
 		}
 		txq->ift_ctx = ctx;
 		txq->ift_id = i;
 		if (sctx->isc_flags & IFLIB_HAS_TXCQ) {
 			txq->ift_br_offset = 1;
 		} else {
 			txq->ift_br_offset = 0;
 		}
 		/* XXX fix this */
 		txq->ift_timer.c_cpu = cpu;
 		txq->ift_db_check.c_cpu = cpu;
 		txq->ift_nbr = nbuf_rings;
 
 		if (iflib_txsd_alloc(txq)) {
 			device_printf(dev, "Critical Failure setting up TX buffers\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 
 		/* Initialize the TX lock */
 		snprintf(txq->ift_mtx_name, MTX_NAME_LEN, "%s:tx(%d):callout",
 		    device_get_nameunit(dev), txq->ift_id);
 		mtx_init(&txq->ift_mtx, txq->ift_mtx_name, NULL, MTX_DEF);
 		callout_init_mtx(&txq->ift_timer, &txq->ift_mtx, 0);
 		callout_init_mtx(&txq->ift_db_check, &txq->ift_mtx, 0);
 
 		snprintf(txq->ift_db_mtx_name, MTX_NAME_LEN, "%s:tx(%d):db",
 			 device_get_nameunit(dev), txq->ift_id);
 		TXDB_LOCK_INIT(txq);
 
 		txq->ift_br = brscp + i*nbuf_rings;
 		for (j = 0; j < nbuf_rings; j++) {
 			err = ifmp_ring_alloc(&txq->ift_br[j], 2048, txq, iflib_txq_drain,
 					      iflib_txq_can_drain, M_IFLIB, M_WAITOK);
 			if (err) {
 				/* XXX free any allocated rings */
 				device_printf(dev, "Unable to allocate buf_ring\n");
 				goto err_tx_desc;
 			}
 		}
 	}
 
 	for (rxconf = i = 0; i < nrxqsets; i++, rxconf++, rxq++) {
 		/* Set up some basics */
 
 		if ((ifdip = malloc(sizeof(struct iflib_dma_info) * nrxqs, M_IFLIB, M_WAITOK|M_ZERO)) == NULL) {
 			device_printf(dev, "failed to allocate iflib_dma_info\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 
 		rxq->ifr_ifdi = ifdip;
 		for (j = 0; j < nrxqs; j++, ifdip++) {
 			if (iflib_dma_alloc(ctx, rxqsizes[j], ifdip, BUS_DMA_NOWAIT)) {
 				device_printf(dev, "Unable to allocate Descriptor memory\n");
 				err = ENOMEM;
 				goto err_tx_desc;
 			}
 			bzero((void *)ifdip->idi_vaddr, rxqsizes[j]);
 		}
 		rxq->ifr_ctx = ctx;
 		rxq->ifr_id = i;
 		if (sctx->isc_flags & IFLIB_HAS_RXCQ) {
 			rxq->ifr_fl_offset = 1;
 		} else {
 			rxq->ifr_fl_offset = 0;
 		}
 		rxq->ifr_nfl = nfree_lists;
 		if (!(fl =
 			  (iflib_fl_t) malloc(sizeof(struct iflib_fl) * nfree_lists, M_IFLIB, M_NOWAIT | M_ZERO))) {
 			device_printf(dev, "Unable to allocate free list memory\n");
 			err = ENOMEM;
 			goto err_tx_desc;
 		}
 		rxq->ifr_fl = fl;
 		for (j = 0; j < nfree_lists; j++) {
 			rxq->ifr_fl[j].ifl_rxq = rxq;
 			rxq->ifr_fl[j].ifl_id = j;
 			rxq->ifr_fl[j].ifl_ifdi =
 			    &rxq->ifr_ifdi[j + rxq->ifr_fl_offset];
 		}
         /* Allocate receive buffers for the ring*/
 		if (iflib_rxsd_alloc(rxq)) {
 			device_printf(dev,
 			    "Critical Failure setting up receive buffers\n");
 			err = ENOMEM;
 			goto err_rx_desc;
 		}
 	}
 
 	/* TXQs */
 	vaddrs = malloc(sizeof(caddr_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK);
 	paddrs = malloc(sizeof(uint64_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK);
 	for (i = 0; i < ntxqsets; i++) {
 		iflib_dma_info_t di = ctx->ifc_txqs[i].ift_ifdi;
 
 		for (j = 0; j < ntxqs; j++, di++) {
 			vaddrs[i*ntxqs + j] = di->idi_vaddr;
 			paddrs[i*ntxqs + j] = di->idi_paddr;
 		}
 	}
 	if ((err = IFDI_TX_QUEUES_ALLOC(ctx, vaddrs, paddrs, ntxqs, ntxqsets)) != 0) {
 		device_printf(ctx->ifc_dev, "device queue allocation failed\n");
 		iflib_tx_structures_free(ctx);
 		free(vaddrs, M_IFLIB);
 		free(paddrs, M_IFLIB);
 		goto err_rx_desc;
 	}
 	free(vaddrs, M_IFLIB);
 	free(paddrs, M_IFLIB);
 
 	/* RXQs */
 	vaddrs = malloc(sizeof(caddr_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK);
 	paddrs = malloc(sizeof(uint64_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK);
 	for (i = 0; i < nrxqsets; i++) {
 		iflib_dma_info_t di = ctx->ifc_rxqs[i].ifr_ifdi;
 
 		for (j = 0; j < nrxqs; j++, di++) {
 			vaddrs[i*nrxqs + j] = di->idi_vaddr;
 			paddrs[i*nrxqs + j] = di->idi_paddr;
 		}
 	}
 	if ((err = IFDI_RX_QUEUES_ALLOC(ctx, vaddrs, paddrs, nrxqs, nrxqsets)) != 0) {
 		device_printf(ctx->ifc_dev, "device queue allocation failed\n");
 		iflib_tx_structures_free(ctx);
 		free(vaddrs, M_IFLIB);
 		free(paddrs, M_IFLIB);
 		goto err_rx_desc;
 	}
 	free(vaddrs, M_IFLIB);
 	free(paddrs, M_IFLIB);
 
 	return (0);
 
 /* XXX handle allocation failure changes */
 err_rx_desc:
 err_tx_desc:
 	if (ctx->ifc_rxqs != NULL)
 		free(ctx->ifc_rxqs, M_IFLIB);
 	ctx->ifc_rxqs = NULL;
 	if (ctx->ifc_txqs != NULL)
 		free(ctx->ifc_txqs, M_IFLIB);
 	ctx->ifc_txqs = NULL;
 rx_fail:
 	if (brscp != NULL)
 		free(brscp, M_IFLIB);
 	if (rxq != NULL)
 		free(rxq, M_IFLIB);
 	if (txq != NULL)
 		free(txq, M_IFLIB);
 fail:
 	return (err);
 }
 
 static int
 iflib_tx_structures_setup(if_ctx_t ctx)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	int i;
 
 	for (i = 0; i < NTXQSETS(ctx); i++, txq++)
 		iflib_txq_setup(txq);
 
 	return (0);
 }
 
 static void
 iflib_tx_structures_free(if_ctx_t ctx)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	int i, j;
 
 	for (i = 0; i < NTXQSETS(ctx); i++, txq++) {
 		iflib_txq_destroy(txq);
 		for (j = 0; j < ctx->ifc_nhwtxqs; j++)
 			iflib_dma_free(&txq->ift_ifdi[j]);
 	}
 	free(ctx->ifc_txqs, M_IFLIB);
 	ctx->ifc_txqs = NULL;
 	IFDI_QUEUES_FREE(ctx);
 }
 
 /*********************************************************************
  *
  *  Initialize all receive rings.
  *
  **********************************************************************/
 static int
 iflib_rx_structures_setup(if_ctx_t ctx)
 {
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 	int q;
 #if defined(INET6) || defined(INET)
 	int i, err;
 #endif
 
 	for (q = 0; q < ctx->ifc_softc_ctx.isc_nrxqsets; q++, rxq++) {
 #if defined(INET6) || defined(INET)
 		tcp_lro_free(&rxq->ifr_lc);
 		if ((err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp,
 		    TCP_LRO_ENTRIES, min(1024,
 		    ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset]))) != 0) {
 			device_printf(ctx->ifc_dev, "LRO Initialization failed!\n");
 			goto fail;
 		}
 		rxq->ifr_lro_enabled = TRUE;
 #endif
 		IFDI_RXQ_SETUP(ctx, rxq->ifr_id);
 	}
 	return (0);
 #if defined(INET6) || defined(INET)
 fail:
 	/*
 	 * Free RX software descriptors allocated so far, we will only handle
 	 * the rings that completed, the failing case will have
 	 * cleaned up for itself. 'q' failed, so its the terminus.
 	 */
 	rxq = ctx->ifc_rxqs;
 	for (i = 0; i < q; ++i, rxq++) {
 		iflib_rx_sds_free(rxq);
 		rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0;
 	}
 	return (err);
 #endif
 }
 
 /*********************************************************************
  *
  *  Free all receive rings.
  *
  **********************************************************************/
 static void
 iflib_rx_structures_free(if_ctx_t ctx)
 {
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 
 	for (int i = 0; i < ctx->ifc_softc_ctx.isc_nrxqsets; i++, rxq++) {
 		iflib_rx_sds_free(rxq);
 	}
 }
 
 static int
 iflib_qset_structures_setup(if_ctx_t ctx)
 {
 	int err;
 
 	if ((err = iflib_tx_structures_setup(ctx)) != 0)
 		return (err);
 
 	if ((err = iflib_rx_structures_setup(ctx)) != 0) {
 		device_printf(ctx->ifc_dev, "iflib_rx_structures_setup failed: %d\n", err);
 		iflib_tx_structures_free(ctx);
 		iflib_rx_structures_free(ctx);
 	}
 	return (err);
 }
 
 int
 iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid,
 				driver_filter_t filter, void *filter_arg, driver_intr_t handler, void *arg, char *name)
 {
 
 	return (_iflib_irq_alloc(ctx, irq, rid, filter, handler, arg, name));
 }
 
 static int
 find_nth(if_ctx_t ctx, cpuset_t *cpus, int qid)
 {
 	int i, cpuid, eqid, count;
 
 	CPU_COPY(&ctx->ifc_cpus, cpus);
 	count = CPU_COUNT(&ctx->ifc_cpus);
 	eqid = qid % count;
 	/* clear up to the qid'th bit */
 	for (i = 0; i < eqid; i++) {
 		cpuid = CPU_FFS(cpus);
 		MPASS(cpuid != 0);
 		CPU_CLR(cpuid-1, cpus);
 	}
 	cpuid = CPU_FFS(cpus);
 	MPASS(cpuid != 0);
 	return (cpuid-1);
 }
 
 int
 iflib_irq_alloc_generic(if_ctx_t ctx, if_irq_t irq, int rid,
 						iflib_intr_type_t type, driver_filter_t *filter,
 						void *filter_arg, int qid, char *name)
 {
 	struct grouptask *gtask;
 	struct taskqgroup *tqg;
 	iflib_filter_info_t info;
 	cpuset_t cpus;
 	gtask_fn_t *fn;
 	int tqrid, err, cpuid;
 	void *q;
 
 	info = &ctx->ifc_filter_info;
 	tqrid = rid;
 
 	switch (type) {
 	/* XXX merge tx/rx for netmap? */
 	case IFLIB_INTR_TX:
 		q = &ctx->ifc_txqs[qid];
 		info = &ctx->ifc_txqs[qid].ift_filter_info;
 		gtask = &ctx->ifc_txqs[qid].ift_task;
-		tqg = qgroup_if_io_tqg;
+		tqg = qgroup_softirq;
 		fn = _task_fn_tx;
 		GROUPTASK_INIT(gtask, 0, fn, q);
 		break;
 	case IFLIB_INTR_RX:
 		q = &ctx->ifc_rxqs[qid];
 		info = &ctx->ifc_rxqs[qid].ifr_filter_info;
 		gtask = &ctx->ifc_rxqs[qid].ifr_task;
-		tqg = qgroup_if_io_tqg;
+		tqg = qgroup_softirq;
 		fn = _task_fn_rx;
 		GROUPTASK_INIT(gtask, 0, fn, q);
 		break;
 	case IFLIB_INTR_ADMIN:
 		q = ctx;
 		tqrid = -1;
 		info = &ctx->ifc_filter_info;
 		gtask = &ctx->ifc_admin_task;
 		tqg = qgroup_if_config_tqg;
 		fn = _task_fn_admin;
 		break;
 	default:
 		panic("unknown net intr type");
 	}
 
 	info->ifi_filter = filter;
 	info->ifi_filter_arg = filter_arg;
 	info->ifi_task = gtask;
 	info->ifi_ctx = ctx;
 
 	err = _iflib_irq_alloc(ctx, irq, rid, iflib_fast_intr, NULL, info,  name);
 	if (err != 0) {
 		device_printf(ctx->ifc_dev, "_iflib_irq_alloc failed %d\n", err);
 		return (err);
 	}
 	if (type == IFLIB_INTR_ADMIN)
 		return (0);
 
 	if (tqrid != -1) {
 		cpuid = find_nth(ctx, &cpus, qid);
 		taskqgroup_attach_cpu(tqg, gtask, q, cpuid, irq->ii_rid, name);
 	} else {
 		taskqgroup_attach(tqg, gtask, q, tqrid, name);
 	}
 
 	return (0);
 }
 
 void
 iflib_softirq_alloc_generic(if_ctx_t ctx, int rid, iflib_intr_type_t type,  void *arg, int qid, char *name)
 {
 	struct grouptask *gtask;
 	struct taskqgroup *tqg;
 	gtask_fn_t *fn;
 	void *q;
 
 	switch (type) {
 	case IFLIB_INTR_TX:
 		q = &ctx->ifc_txqs[qid];
 		gtask = &ctx->ifc_txqs[qid].ift_task;
-		tqg = qgroup_if_io_tqg;
+		tqg = qgroup_softirq;
 		fn = _task_fn_tx;
 		break;
 	case IFLIB_INTR_RX:
 		q = &ctx->ifc_rxqs[qid];
 		gtask = &ctx->ifc_rxqs[qid].ifr_task;
-		tqg = qgroup_if_io_tqg;
+		tqg = qgroup_softirq;
 		fn = _task_fn_rx;
 		break;
 	case IFLIB_INTR_IOV:
 		q = ctx;
 		gtask = &ctx->ifc_vflr_task;
 		tqg = qgroup_if_config_tqg;
 		rid = -1;
 		fn = _task_fn_iov;
 		break;
 	default:
 		panic("unknown net intr type");
 	}
 	GROUPTASK_INIT(gtask, 0, fn, q);
 	taskqgroup_attach(tqg, gtask, q, rid, name);
 }
 
 void
 iflib_irq_free(if_ctx_t ctx, if_irq_t irq)
 {
 	if (irq->ii_tag)
 		bus_teardown_intr(ctx->ifc_dev, irq->ii_res, irq->ii_tag);
 
 	if (irq->ii_res)
 		bus_release_resource(ctx->ifc_dev, SYS_RES_IRQ, irq->ii_rid, irq->ii_res);
 }
 
 static int
 iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filter_arg, int *rid, char *name)
 {
 	iflib_txq_t txq = ctx->ifc_txqs;
 	iflib_rxq_t rxq = ctx->ifc_rxqs;
 	if_irq_t irq = &ctx->ifc_legacy_irq;
 	iflib_filter_info_t info;
 	struct grouptask *gtask;
 	struct taskqgroup *tqg;
 	gtask_fn_t *fn;
 	int tqrid;
 	void *q;
 	int err;
 
 	q = &ctx->ifc_rxqs[0];
 	info = &rxq[0].ifr_filter_info;
 	gtask = &rxq[0].ifr_task;
-	tqg = qgroup_if_io_tqg;
+	tqg = qgroup_softirq;
 	tqrid = irq->ii_rid = *rid;
 	fn = _task_fn_rx;
 
 	ctx->ifc_flags |= IFC_LEGACY;
 	info->ifi_filter = filter;
 	info->ifi_filter_arg = filter_arg;
 	info->ifi_task = gtask;
 	info->ifi_ctx = ctx;
 
 	/* We allocate a single interrupt resource */
 	if ((err = _iflib_irq_alloc(ctx, irq, tqrid, iflib_fast_intr, NULL, info, name)) != 0)
 		return (err);
 	GROUPTASK_INIT(gtask, 0, fn, q);
 	taskqgroup_attach(tqg, gtask, q, tqrid, name);
 
 	GROUPTASK_INIT(&txq->ift_task, 0, _task_fn_tx, txq);
-	taskqgroup_attach(qgroup_if_io_tqg, &txq->ift_task, txq, tqrid, "tx");
+	taskqgroup_attach(qgroup_softirq, &txq->ift_task, txq, tqrid, "tx");
 	return (0);
 }
 
 void
 iflib_led_create(if_ctx_t ctx)
 {
 
 	ctx->ifc_led_dev = led_create(iflib_led_func, ctx,
 								  device_get_nameunit(ctx->ifc_dev));
 }
 
 void
 iflib_tx_intr_deferred(if_ctx_t ctx, int txqid)
 {
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_txqs[txqid].ift_task);
 }
 
 void
 iflib_rx_intr_deferred(if_ctx_t ctx, int rxqid)
 {
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_rxqs[rxqid].ifr_task);
 }
 
 void
 iflib_admin_intr_deferred(if_ctx_t ctx)
 {
 #ifdef INVARIANTS
 	struct grouptask *gtask;
 
 	gtask = &ctx->ifc_admin_task;
 	MPASS(gtask->gt_taskqueue != NULL);
 #endif
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_admin_task);
 }
 
 void
 iflib_iov_intr_deferred(if_ctx_t ctx)
 {
 
 	GROUPTASK_ENQUEUE(&ctx->ifc_vflr_task);
 }
 
 void
 iflib_io_tqg_attach(struct grouptask *gt, void *uniq, int cpu, char *name)
 {
 
-	taskqgroup_attach_cpu(qgroup_if_io_tqg, gt, uniq, cpu, -1, name);
+	taskqgroup_attach_cpu(qgroup_softirq, gt, uniq, cpu, -1, name);
 }
 
 void
 iflib_config_gtask_init(if_ctx_t ctx, struct grouptask *gtask, gtask_fn_t *fn,
 	char *name)
 {
 
 	GROUPTASK_INIT(gtask, 0, fn, ctx);
 	taskqgroup_attach(qgroup_if_config_tqg, gtask, gtask, -1, name);
 }
 
 void
 iflib_config_gtask_deinit(struct grouptask *gtask)
 {
 
 	taskqgroup_detach(qgroup_if_config_tqg, gtask);	
 }
 
 void
 iflib_link_state_change(if_ctx_t ctx, int link_state, uint64_t baudrate)
 {
 	if_t ifp = ctx->ifc_ifp;
 	iflib_txq_t txq = ctx->ifc_txqs;
 
 	if_setbaudrate(ifp, baudrate);
 
 	/* If link down, disable watchdog */
 	if ((ctx->ifc_link_state == LINK_STATE_UP) && (link_state == LINK_STATE_DOWN)) {
 		for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxqsets; i++, txq++)
 			txq->ift_qstatus = IFLIB_QUEUE_IDLE;
 	}
 	ctx->ifc_link_state = link_state;
 	if_link_state_change(ifp, link_state);
 }
 
 static int
 iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq)
 {
 	int credits;
 #ifdef INVARIANTS
 	int credits_pre = txq->ift_cidx_processed;
 #endif	
 
 	if (ctx->isc_txd_credits_update == NULL)
 		return (0);
 
 	if ((credits = ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, txq->ift_cidx_processed, true)) == 0)
 		return (0);
 
 	txq->ift_processed += credits;
 	txq->ift_cidx_processed += credits;
 
 	MPASS(credits_pre + credits == txq->ift_cidx_processed);
 	if (txq->ift_cidx_processed >= txq->ift_size)
 		txq->ift_cidx_processed -= txq->ift_size;
 	return (credits);
 }
 
 static int
 iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, int cidx, int budget)
 {
 
 	return (ctx->isc_rxd_available(ctx->ifc_softc, rxq->ifr_id, cidx,
 	    budget));
 }
 
 void
 iflib_add_int_delay_sysctl(if_ctx_t ctx, const char *name,
 	const char *description, if_int_delay_info_t info,
 	int offset, int value)
 {
 	info->iidi_ctx = ctx;
 	info->iidi_offset = offset;
 	info->iidi_value = value;
 	SYSCTL_ADD_PROC(device_get_sysctl_ctx(ctx->ifc_dev),
 	    SYSCTL_CHILDREN(device_get_sysctl_tree(ctx->ifc_dev)),
 	    OID_AUTO, name, CTLTYPE_INT|CTLFLAG_RW,
 	    info, 0, iflib_sysctl_int_delay, "I", description);
 }
 
 struct mtx *
 iflib_ctx_lock_get(if_ctx_t ctx)
 {
 
 	return (&ctx->ifc_mtx);
 }
 
 static int
 iflib_msix_init(if_ctx_t ctx)
 {
 	device_t dev = ctx->ifc_dev;
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
 	int vectors, queues, rx_queues, tx_queues, queuemsgs, msgs;
 	int iflib_num_tx_queues, iflib_num_rx_queues;
 	int err, admincnt, bar;
 
 	iflib_num_tx_queues = scctx->isc_ntxqsets;
 	iflib_num_rx_queues = scctx->isc_nrxqsets;
 
 	device_printf(dev, "msix_init qsets capped at %d\n", iflib_num_tx_queues);
 	
 	bar = ctx->ifc_softc_ctx.isc_msix_bar;
 	admincnt = sctx->isc_admin_intrcnt;
 	/* Override by tuneable */
 	if (enable_msix == 0)
 		goto msi;
 
 	/*
 	** When used in a virtualized environment
 	** PCI BUSMASTER capability may not be set
 	** so explicity set it here and rewrite
 	** the ENABLE in the MSIX control register
 	** at this point to cause the host to
 	** successfully initialize us.
 	*/
 	{
 		int msix_ctrl, rid;
 
  		pci_enable_busmaster(dev);
 		rid = 0;
 		if (pci_find_cap(dev, PCIY_MSIX, &rid) == 0 && rid != 0) {
 			rid += PCIR_MSIX_CTRL;
 			msix_ctrl = pci_read_config(dev, rid, 2);
 			msix_ctrl |= PCIM_MSIXCTRL_MSIX_ENABLE;
 			pci_write_config(dev, rid, msix_ctrl, 2);
 		} else {
 			device_printf(dev, "PCIY_MSIX capability not found; "
 			                   "or rid %d == 0.\n", rid);
 			goto msi;
 		}
 	}
 
 	/*
 	 * bar == -1 => "trust me I know what I'm doing"
 	 * https://www.youtube.com/watch?v=nnwWKkNau4I
 	 * Some drivers are for hardware that is so shoddily
 	 * documented that no one knows which bars are which
 	 * so the developer has to map all bars. This hack
 	 * allows shoddy garbage to use msix in this framework.
 	 */
 	if (bar != -1) {
 		ctx->ifc_msix_mem = bus_alloc_resource_any(dev,
 	            SYS_RES_MEMORY, &bar, RF_ACTIVE);
 		if (ctx->ifc_msix_mem == NULL) {
 			/* May not be enabled */
 			device_printf(dev, "Unable to map MSIX table \n");
 			goto msi;
 		}
 	}
 	/* First try MSI/X */
 	if ((msgs = pci_msix_count(dev)) == 0) { /* system has msix disabled */
 		device_printf(dev, "System has MSIX disabled \n");
 		bus_release_resource(dev, SYS_RES_MEMORY,
 		    bar, ctx->ifc_msix_mem);
 		ctx->ifc_msix_mem = NULL;
 		goto msi;
 	}
 #if IFLIB_DEBUG
 	/* use only 1 qset in debug mode */
 	queuemsgs = min(msgs - admincnt, 1);
 #else
 	queuemsgs = msgs - admincnt;
 #endif
 	if (bus_get_cpus(dev, INTR_CPUS, sizeof(ctx->ifc_cpus), &ctx->ifc_cpus) == 0) {
 #ifdef RSS
 		queues = imin(queuemsgs, rss_getnumbuckets());
 #else
 		queues = queuemsgs;
 #endif
 		queues = imin(CPU_COUNT(&ctx->ifc_cpus), queues);
 		device_printf(dev, "pxm cpus: %d queue msgs: %d admincnt: %d\n",
 					  CPU_COUNT(&ctx->ifc_cpus), queuemsgs, admincnt);
 	} else {
 		device_printf(dev, "Unable to fetch CPU list\n");
 		/* Figure out a reasonable auto config value */
 		queues = min(queuemsgs, mp_ncpus);
 	}
 #ifdef  RSS
 	/* If we're doing RSS, clamp at the number of RSS buckets */
 	if (queues > rss_getnumbuckets())
 		queues = rss_getnumbuckets();
 #endif
 	if (iflib_num_rx_queues > 0 && iflib_num_rx_queues < queuemsgs - admincnt)
 		rx_queues = iflib_num_rx_queues;
 	else
 		rx_queues = queues;
 	/*
 	 * We want this to be all logical CPUs by default
 	 */
 	if (iflib_num_tx_queues > 0 && iflib_num_tx_queues < queues)
 		tx_queues = iflib_num_tx_queues;
 	else
 		tx_queues = mp_ncpus;
 
 	if (ctx->ifc_sysctl_qs_eq_override == 0) {
 #ifdef INVARIANTS
 		if (tx_queues != rx_queues)
 			device_printf(dev, "queue equality override not set, capping rx_queues at %d and tx_queues at %d\n",
 				      min(rx_queues, tx_queues), min(rx_queues, tx_queues));
 #endif
 		tx_queues = min(rx_queues, tx_queues);
 		rx_queues = min(rx_queues, tx_queues);
 	}
 
 	device_printf(dev, "using %d rx queues %d tx queues \n", rx_queues, tx_queues);
 
 	vectors = rx_queues + admincnt;
 	if ((err = pci_alloc_msix(dev, &vectors)) == 0) {
 		device_printf(dev,
 					  "Using MSIX interrupts with %d vectors\n", vectors);
 		scctx->isc_vectors = vectors;
 		scctx->isc_nrxqsets = rx_queues;
 		scctx->isc_ntxqsets = tx_queues;
 		scctx->isc_intr = IFLIB_INTR_MSIX;
 
 		return (vectors);
 	} else {
 		device_printf(dev, "failed to allocate %d msix vectors, err: %d - using MSI\n", vectors, err);
 	}
 msi:
 	vectors = pci_msi_count(dev);
 	scctx->isc_nrxqsets = 1;
 	scctx->isc_ntxqsets = 1;
 	scctx->isc_vectors = vectors;
 	if (vectors == 1 && pci_alloc_msi(dev, &vectors) == 0) {
 		device_printf(dev,"Using an MSI interrupt\n");
 		scctx->isc_intr = IFLIB_INTR_MSI;
 	} else {
 		device_printf(dev,"Using a Legacy interrupt\n");
 		scctx->isc_intr = IFLIB_INTR_LEGACY;
 	}
 
 	return (vectors);
 }
 
 char * ring_states[] = { "IDLE", "BUSY", "STALLED", "ABDICATED" };
 
 static int
 mp_ring_state_handler(SYSCTL_HANDLER_ARGS)
 {
 	int rc;
 	uint16_t *state = ((uint16_t *)oidp->oid_arg1);
 	struct sbuf *sb;
 	char *ring_state = "UNKNOWN";
 
 	/* XXX needed ? */
 	rc = sysctl_wire_old_buffer(req, 0);
 	MPASS(rc == 0);
 	if (rc != 0)
 		return (rc);
 	sb = sbuf_new_for_sysctl(NULL, NULL, 80, req);
 	MPASS(sb != NULL);
 	if (sb == NULL)
 		return (ENOMEM);
 	if (state[3] <= 3)
 		ring_state = ring_states[state[3]];
 
 	sbuf_printf(sb, "pidx_head: %04hd pidx_tail: %04hd cidx: %04hd state: %s",
 		    state[0], state[1], state[2], ring_state);
 	rc = sbuf_finish(sb);
 	sbuf_delete(sb);
         return(rc);
 }
 
 enum iflib_ndesc_handler {
 	IFLIB_NTXD_HANDLER,
 	IFLIB_NRXD_HANDLER,
 };
 
 static int
 mp_ndesc_handler(SYSCTL_HANDLER_ARGS)
 {
 	if_ctx_t ctx = (void *)arg1;
 	enum iflib_ndesc_handler type = arg2;
 	char buf[256] = {0};
 	uint16_t *ndesc;
 	char *p, *next;
 	int nqs, rc, i;
 
 	MPASS(type == IFLIB_NTXD_HANDLER || type == IFLIB_NRXD_HANDLER);
 
 	nqs = 8;
 	switch(type) {
 	case IFLIB_NTXD_HANDLER:
 		ndesc = ctx->ifc_sysctl_ntxds;
 		if (ctx->ifc_sctx)
 			nqs = ctx->ifc_sctx->isc_ntxqs;
 		break;
 	case IFLIB_NRXD_HANDLER:
 		ndesc = ctx->ifc_sysctl_nrxds;
 		if (ctx->ifc_sctx)
 			nqs = ctx->ifc_sctx->isc_nrxqs;
 		break;
 	}
 	if (nqs == 0)
 		nqs = 8;
 
 	for (i=0; i<8; i++) {
 		if (i >= nqs)
 			break;
 		if (i)
 			strcat(buf, ",");
 		sprintf(strchr(buf, 0), "%d", ndesc[i]);
 	}
 
 	rc = sysctl_handle_string(oidp, buf, sizeof(buf), req);
 	if (rc || req->newptr == NULL)
 		return rc;
 
 	for (i = 0, next = buf, p = strsep(&next, " ,"); i < 8 && p;
 	    i++, p = strsep(&next, " ,")) {
 		ndesc[i] = strtoul(p, NULL, 10);
 	}
 
 	return(rc);
 }
 
 #define NAME_BUFLEN 32
 static void
 iflib_add_device_sysctl_pre(if_ctx_t ctx)
 {
         device_t dev = iflib_get_dev(ctx);
 	struct sysctl_oid_list *child, *oid_list;
 	struct sysctl_ctx_list *ctx_list;
 	struct sysctl_oid *node;
 
 	ctx_list = device_get_sysctl_ctx(dev);
 	child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev));
 	ctx->ifc_sysctl_node = node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, "iflib",
 						      CTLFLAG_RD, NULL, "IFLIB fields");
 	oid_list = SYSCTL_CHILDREN(node);
 
 	SYSCTL_ADD_STRING(ctx_list, oid_list, OID_AUTO, "driver_version",
 		       CTLFLAG_RD, ctx->ifc_sctx->isc_driver_version, 0,
 		       "driver version");
 
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_ntxqs",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_ntxqs, 0,
 			"# of txqs to use, 0 => use default #");
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_nrxqs",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_nrxqs, 0,
 			"# of rxqs to use, 0 => use default #");
 	SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_qs_enable",
 		       CTLFLAG_RWTUN, &ctx->ifc_sysctl_qs_eq_override, 0,
                        "permit #txq != #rxq");
 
 	/* XXX change for per-queue sizes */
 	SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_ntxds",
 		       CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NTXD_HANDLER,
                        mp_ndesc_handler, "A",
                        "list of # of tx descriptors to use, 0 = use default #");
 	SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_nrxds",
 		       CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NRXD_HANDLER,
                        mp_ndesc_handler, "A",
                        "list of # of rx descriptors to use, 0 = use default #");
 }
 
 static void
 iflib_add_device_sysctl_post(if_ctx_t ctx)
 {
 	if_shared_ctx_t sctx = ctx->ifc_sctx;
 	if_softc_ctx_t scctx = &ctx->ifc_softc_ctx;
         device_t dev = iflib_get_dev(ctx);
 	struct sysctl_oid_list *child;
 	struct sysctl_ctx_list *ctx_list;
 	iflib_fl_t fl;
 	iflib_txq_t txq;
 	iflib_rxq_t rxq;
 	int i, j;
 	char namebuf[NAME_BUFLEN];
 	char *qfmt;
 	struct sysctl_oid *queue_node, *fl_node, *node;
 	struct sysctl_oid_list *queue_list, *fl_list;
 	ctx_list = device_get_sysctl_ctx(dev);
 
 	node = ctx->ifc_sysctl_node;
 	child = SYSCTL_CHILDREN(node);
 
 	if (scctx->isc_ntxqsets > 100)
 		qfmt = "txq%03d";
 	else if (scctx->isc_ntxqsets > 10)
 		qfmt = "txq%02d";
 	else
 		qfmt = "txq%d";
 	for (i = 0, txq = ctx->ifc_txqs; i < scctx->isc_ntxqsets; i++, txq++) {
 		snprintf(namebuf, NAME_BUFLEN, qfmt, i);
 		queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf,
 					     CTLFLAG_RD, NULL, "Queue Name");
 		queue_list = SYSCTL_CHILDREN(queue_node);
 #if MEMORY_LOGGING
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_dequeued",
 				CTLFLAG_RD,
 				&txq->ift_dequeued, "total mbufs freed");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_enqueued",
 				CTLFLAG_RD,
 				&txq->ift_enqueued, "total mbufs enqueued");
 #endif
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag",
 				   CTLFLAG_RD,
 				   &txq->ift_mbuf_defrag, "# of times m_defrag was called");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "m_pullups",
 				   CTLFLAG_RD,
 				   &txq->ift_pullups, "# of times m_pullup was called");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag_failed",
 				   CTLFLAG_RD,
 				   &txq->ift_mbuf_defrag_failed, "# of times m_defrag failed");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_desc_avail",
 				   CTLFLAG_RD,
 				   &txq->ift_no_desc_avail, "# of times no descriptors were available");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "tx_map_failed",
 				   CTLFLAG_RD,
 				   &txq->ift_map_failed, "# of times dma map failed");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txd_encap_efbig",
 				   CTLFLAG_RD,
 				   &txq->ift_txd_encap_efbig, "# of times txd_encap returned EFBIG");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_tx_dma_setup",
 				   CTLFLAG_RD,
 				   &txq->ift_no_tx_dma_setup, "# of times map failed for other than EFBIG");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_pidx",
 				   CTLFLAG_RD,
 				   &txq->ift_pidx, 1, "Producer Index");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx",
 				   CTLFLAG_RD,
 				   &txq->ift_cidx, 1, "Consumer Index");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx_processed",
 				   CTLFLAG_RD,
 				   &txq->ift_cidx_processed, 1, "Consumer Index seen by credit update");
 		SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_in_use",
 				   CTLFLAG_RD,
 				   &txq->ift_in_use, 1, "descriptors in use");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_processed",
 				   CTLFLAG_RD,
 				   &txq->ift_processed, "descriptors procesed for clean");
 		SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_cleaned",
 				   CTLFLAG_RD,
 				   &txq->ift_cleaned, "total cleaned");
 		SYSCTL_ADD_PROC(ctx_list, queue_list, OID_AUTO, "ring_state",
 				CTLTYPE_STRING | CTLFLAG_RD, __DEVOLATILE(uint64_t *, &txq->ift_br[0]->state),
 				0, mp_ring_state_handler, "A", "soft ring state");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_enqueues",
 				       CTLFLAG_RD, &txq->ift_br[0]->enqueues,
 				       "# of enqueues to the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_drops",
 				       CTLFLAG_RD, &txq->ift_br[0]->drops,
 				       "# of drops in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_starts",
 				       CTLFLAG_RD, &txq->ift_br[0]->starts,
 				       "# of normal consumer starts in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_stalls",
 				       CTLFLAG_RD, &txq->ift_br[0]->stalls,
 					       "# of consumer stalls in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_restarts",
 			       CTLFLAG_RD, &txq->ift_br[0]->restarts,
 				       "# of consumer restarts in the mp_ring for this queue");
 		SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_abdications",
 				       CTLFLAG_RD, &txq->ift_br[0]->abdications,
 				       "# of consumer abdications in the mp_ring for this queue");
 	}
 
 	if (scctx->isc_nrxqsets > 100)
 		qfmt = "rxq%03d";
 	else if (scctx->isc_nrxqsets > 10)
 		qfmt = "rxq%02d";
 	else
 		qfmt = "rxq%d";
 	for (i = 0, rxq = ctx->ifc_rxqs; i < scctx->isc_nrxqsets; i++, rxq++) {
 		snprintf(namebuf, NAME_BUFLEN, qfmt, i);
 		queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf,
 					     CTLFLAG_RD, NULL, "Queue Name");
 		queue_list = SYSCTL_CHILDREN(queue_node);
 		if (sctx->isc_flags & IFLIB_HAS_RXCQ) {
 			SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_pidx",
 				       CTLFLAG_RD,
 				       &rxq->ifr_cq_pidx, 1, "Producer Index");
 			SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_cidx",
 				       CTLFLAG_RD,
 				       &rxq->ifr_cq_cidx, 1, "Consumer Index");
 		}
 
 		for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) {
 			snprintf(namebuf, NAME_BUFLEN, "rxq_fl%d", j);
 			fl_node = SYSCTL_ADD_NODE(ctx_list, queue_list, OID_AUTO, namebuf,
 						     CTLFLAG_RD, NULL, "freelist Name");
 			fl_list = SYSCTL_CHILDREN(fl_node);
 			SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "pidx",
 				       CTLFLAG_RD,
 				       &fl->ifl_pidx, 1, "Producer Index");
 			SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "cidx",
 				       CTLFLAG_RD,
 				       &fl->ifl_cidx, 1, "Consumer Index");
 			SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "credits",
 				       CTLFLAG_RD,
 				       &fl->ifl_credits, 1, "credits available");
 #if MEMORY_LOGGING
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_enqueued",
 					CTLFLAG_RD,
 					&fl->ifl_m_enqueued, "mbufs allocated");
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_dequeued",
 					CTLFLAG_RD,
 					&fl->ifl_m_dequeued, "mbufs freed");
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_enqueued",
 					CTLFLAG_RD,
 					&fl->ifl_cl_enqueued, "clusters allocated");
 			SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_dequeued",
 					CTLFLAG_RD,
 					&fl->ifl_cl_dequeued, "clusters freed");
 #endif
 
 		}
 	}
 
 }
Index: projects/clang400-import/sys/netpfil/ipfw/ip_fw2.c
===================================================================
--- projects/clang400-import/sys/netpfil/ipfw/ip_fw2.c	(revision 314522)
+++ projects/clang400-import/sys/netpfil/ipfw/ip_fw2.c	(revision 314523)
@@ -1,2941 +1,2948 @@
 /*-
  * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  * The FreeBSD IP packet firewall, main file
  */
 
 #include "opt_ipfw.h"
 #include "opt_ipdivert.h"
 #include "opt_inet.h"
 #ifndef INET
 #error "IPFIREWALL requires INET"
 #endif /* INET */
 #include "opt_inet6.h"
 #include "opt_ipsec.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/condvar.h>
 #include <sys/counter.h>
 #include <sys/eventhandler.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/jail.h>
 #include <sys/module.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/rwlock.h>
 #include <sys/rmlock.h>
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
 #include <sys/ucred.h>
 #include <net/ethernet.h> /* for ETHERTYPE_IP */
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/route.h>
 #include <net/pfil.h>
 #include <net/vnet.h>
 
 #include <netpfil/pf/pf_mtag.h>
 
 #include <netinet/in.h>
 #include <netinet/in_var.h>
 #include <netinet/in_pcb.h>
 #include <netinet/ip.h>
 #include <netinet/ip_var.h>
 #include <netinet/ip_icmp.h>
 #include <netinet/ip_fw.h>
 #include <netinet/ip_carp.h>
 #include <netinet/pim.h>
 #include <netinet/tcp_var.h>
 #include <netinet/udp.h>
 #include <netinet/udp_var.h>
 #include <netinet/sctp.h>
 
 #include <netinet/ip6.h>
 #include <netinet/icmp6.h>
 #include <netinet/in_fib.h>
 #ifdef INET6
 #include <netinet6/in6_fib.h>
 #include <netinet6/in6_pcb.h>
 #include <netinet6/scope6_var.h>
 #include <netinet6/ip6_var.h>
 #endif
 
 #include <netpfil/ipfw/ip_fw_private.h>
 
 #include <machine/in_cksum.h>	/* XXX for in_cksum */
 
 #ifdef MAC
 #include <security/mac/mac_framework.h>
 #endif
 
 /*
  * static variables followed by global ones.
  * All ipfw global variables are here.
  */
 
 static VNET_DEFINE(int, fw_deny_unknown_exthdrs);
 #define	V_fw_deny_unknown_exthdrs	VNET(fw_deny_unknown_exthdrs)
 
 static VNET_DEFINE(int, fw_permit_single_frag6) = 1;
 #define	V_fw_permit_single_frag6	VNET(fw_permit_single_frag6)
 
 #ifdef IPFIREWALL_DEFAULT_TO_ACCEPT
 static int default_to_accept = 1;
 #else
 static int default_to_accept;
 #endif
 
 VNET_DEFINE(int, autoinc_step);
 VNET_DEFINE(int, fw_one_pass) = 1;
 
 VNET_DEFINE(unsigned int, fw_tables_max);
 VNET_DEFINE(unsigned int, fw_tables_sets) = 0;	/* Don't use set-aware tables */
 /* Use 128 tables by default */
 static unsigned int default_fw_tables = IPFW_TABLES_DEFAULT;
 
 #ifndef LINEAR_SKIPTO
 static int jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num,
     int tablearg, int jump_backwards);
 #define	JUMP(ch, f, num, targ, back)	jump_fast(ch, f, num, targ, back)
 #else
 static int jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num,
     int tablearg, int jump_backwards);
 #define	JUMP(ch, f, num, targ, back)	jump_linear(ch, f, num, targ, back)
 #endif
 
 /*
  * Each rule belongs to one of 32 different sets (0..31).
  * The variable set_disable contains one bit per set.
  * If the bit is set, all rules in the corresponding set
  * are disabled. Set RESVD_SET(31) is reserved for the default rule
  * and rules that are not deleted by the flush command,
  * and CANNOT be disabled.
  * Rules in set RESVD_SET can only be deleted individually.
  */
 VNET_DEFINE(u_int32_t, set_disable);
 #define	V_set_disable			VNET(set_disable)
 
 VNET_DEFINE(int, fw_verbose);
 /* counter for ipfw_log(NULL...) */
 VNET_DEFINE(u_int64_t, norule_counter);
 VNET_DEFINE(int, verbose_limit);
 
 /* layer3_chain contains the list of rules for layer 3 */
 VNET_DEFINE(struct ip_fw_chain, layer3_chain);
 
 /* ipfw_vnet_ready controls when we are open for business */
 VNET_DEFINE(int, ipfw_vnet_ready) = 0;
 
 VNET_DEFINE(int, ipfw_nat_ready) = 0;
 
 ipfw_nat_t *ipfw_nat_ptr = NULL;
 struct cfg_nat *(*lookup_nat_ptr)(struct nat_list *, int);
 ipfw_nat_cfg_t *ipfw_nat_cfg_ptr;
 ipfw_nat_cfg_t *ipfw_nat_del_ptr;
 ipfw_nat_cfg_t *ipfw_nat_get_cfg_ptr;
 ipfw_nat_cfg_t *ipfw_nat_get_log_ptr;
 
 #ifdef SYSCTL_NODE
 uint32_t dummy_def = IPFW_DEFAULT_RULE;
 static int sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS);
 static int sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS);
 
 SYSBEGIN(f3)
 
 SYSCTL_NODE(_net_inet_ip, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall");
 SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, one_pass,
     CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_one_pass), 0,
     "Only do a single pass through ipfw when using dummynet(4)");
 SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, autoinc_step,
     CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(autoinc_step), 0,
     "Rule number auto-increment step");
 SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, verbose,
     CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_verbose), 0,
     "Log matches to ipfw rules");
 SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, verbose_limit,
     CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(verbose_limit), 0,
     "Set upper limit of matches of ipfw rules logged");
 SYSCTL_UINT(_net_inet_ip_fw, OID_AUTO, default_rule, CTLFLAG_RD,
     &dummy_def, 0,
     "The default/max possible rule number.");
 SYSCTL_PROC(_net_inet_ip_fw, OID_AUTO, tables_max,
     CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW, 0, 0, sysctl_ipfw_table_num, "IU",
     "Maximum number of concurrently used tables");
 SYSCTL_PROC(_net_inet_ip_fw, OID_AUTO, tables_sets,
     CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW,
     0, 0, sysctl_ipfw_tables_sets, "IU",
     "Use per-set namespace for tables");
 SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, default_to_accept, CTLFLAG_RDTUN,
     &default_to_accept, 0,
     "Make the default rule accept all packets.");
 TUNABLE_INT("net.inet.ip.fw.tables_max", (int *)&default_fw_tables);
 SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, static_count,
     CTLFLAG_VNET | CTLFLAG_RD, &VNET_NAME(layer3_chain.n_rules), 0,
     "Number of static rules");
 
 #ifdef INET6
 SYSCTL_DECL(_net_inet6_ip6);
 SYSCTL_NODE(_net_inet6_ip6, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall");
 SYSCTL_INT(_net_inet6_ip6_fw, OID_AUTO, deny_unknown_exthdrs,
     CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE,
     &VNET_NAME(fw_deny_unknown_exthdrs), 0,
     "Deny packets with unknown IPv6 Extension Headers");
 SYSCTL_INT(_net_inet6_ip6_fw, OID_AUTO, permit_single_frag6,
     CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE,
     &VNET_NAME(fw_permit_single_frag6), 0,
     "Permit single packet IPv6 fragments");
 #endif /* INET6 */
 
 SYSEND
 
 #endif /* SYSCTL_NODE */
 
 
 /*
  * Some macros used in the various matching options.
  * L3HDR maps an ipv4 pointer into a layer3 header pointer of type T
  * Other macros just cast void * into the appropriate type
  */
 #define	L3HDR(T, ip)	((T *)((u_int32_t *)(ip) + (ip)->ip_hl))
 #define	TCP(p)		((struct tcphdr *)(p))
 #define	SCTP(p)		((struct sctphdr *)(p))
 #define	UDP(p)		((struct udphdr *)(p))
 #define	ICMP(p)		((struct icmphdr *)(p))
 #define	ICMP6(p)	((struct icmp6_hdr *)(p))
 
 static __inline int
 icmptype_match(struct icmphdr *icmp, ipfw_insn_u32 *cmd)
 {
 	int type = icmp->icmp_type;
 
 	return (type <= ICMP_MAXTYPE && (cmd->d[0] & (1<<type)) );
 }
 
 #define TT	( (1 << ICMP_ECHO) | (1 << ICMP_ROUTERSOLICIT) | \
     (1 << ICMP_TSTAMP) | (1 << ICMP_IREQ) | (1 << ICMP_MASKREQ) )
 
 static int
 is_icmp_query(struct icmphdr *icmp)
 {
 	int type = icmp->icmp_type;
 
 	return (type <= ICMP_MAXTYPE && (TT & (1<<type)) );
 }
 #undef TT
 
 /*
  * The following checks use two arrays of 8 or 16 bits to store the
  * bits that we want set or clear, respectively. They are in the
  * low and high half of cmd->arg1 or cmd->d[0].
  *
  * We scan options and store the bits we find set. We succeed if
  *
  *	(want_set & ~bits) == 0 && (want_clear & ~bits) == want_clear
  *
  * The code is sometimes optimized not to store additional variables.
  */
 
 static int
 flags_match(ipfw_insn *cmd, u_int8_t bits)
 {
 	u_char want_clear;
 	bits = ~bits;
 
 	if ( ((cmd->arg1 & 0xff) & bits) != 0)
 		return 0; /* some bits we want set were clear */
 	want_clear = (cmd->arg1 >> 8) & 0xff;
 	if ( (want_clear & bits) != want_clear)
 		return 0; /* some bits we want clear were set */
 	return 1;
 }
 
 static int
 ipopts_match(struct ip *ip, ipfw_insn *cmd)
 {
 	int optlen, bits = 0;
 	u_char *cp = (u_char *)(ip + 1);
 	int x = (ip->ip_hl << 2) - sizeof (struct ip);
 
 	for (; x > 0; x -= optlen, cp += optlen) {
 		int opt = cp[IPOPT_OPTVAL];
 
 		if (opt == IPOPT_EOL)
 			break;
 		if (opt == IPOPT_NOP)
 			optlen = 1;
 		else {
 			optlen = cp[IPOPT_OLEN];
 			if (optlen <= 0 || optlen > x)
 				return 0; /* invalid or truncated */
 		}
 		switch (opt) {
 
 		default:
 			break;
 
 		case IPOPT_LSRR:
 			bits |= IP_FW_IPOPT_LSRR;
 			break;
 
 		case IPOPT_SSRR:
 			bits |= IP_FW_IPOPT_SSRR;
 			break;
 
 		case IPOPT_RR:
 			bits |= IP_FW_IPOPT_RR;
 			break;
 
 		case IPOPT_TS:
 			bits |= IP_FW_IPOPT_TS;
 			break;
 		}
 	}
 	return (flags_match(cmd, bits));
 }
 
 static int
 tcpopts_match(struct tcphdr *tcp, ipfw_insn *cmd)
 {
 	int optlen, bits = 0;
 	u_char *cp = (u_char *)(tcp + 1);
 	int x = (tcp->th_off << 2) - sizeof(struct tcphdr);
 
 	for (; x > 0; x -= optlen, cp += optlen) {
 		int opt = cp[0];
 		if (opt == TCPOPT_EOL)
 			break;
 		if (opt == TCPOPT_NOP)
 			optlen = 1;
 		else {
 			optlen = cp[1];
 			if (optlen <= 0)
 				break;
 		}
 
 		switch (opt) {
 
 		default:
 			break;
 
 		case TCPOPT_MAXSEG:
 			bits |= IP_FW_TCPOPT_MSS;
 			break;
 
 		case TCPOPT_WINDOW:
 			bits |= IP_FW_TCPOPT_WINDOW;
 			break;
 
 		case TCPOPT_SACK_PERMITTED:
 		case TCPOPT_SACK:
 			bits |= IP_FW_TCPOPT_SACK;
 			break;
 
 		case TCPOPT_TIMESTAMP:
 			bits |= IP_FW_TCPOPT_TS;
 			break;
 
 		}
 	}
 	return (flags_match(cmd, bits));
 }
 
 static int
 iface_match(struct ifnet *ifp, ipfw_insn_if *cmd, struct ip_fw_chain *chain,
     uint32_t *tablearg)
 {
 
 	if (ifp == NULL)	/* no iface with this packet, match fails */
 		return (0);
 
 	/* Check by name or by IP address */
 	if (cmd->name[0] != '\0') { /* match by name */
 		if (cmd->name[0] == '\1') /* use tablearg to match */
 			return ipfw_lookup_table_extended(chain, cmd->p.kidx, 0,
 				&ifp->if_index, tablearg);
 		/* Check name */
 		if (cmd->p.glob) {
 			if (fnmatch(cmd->name, ifp->if_xname, 0) == 0)
 				return(1);
 		} else {
 			if (strncmp(ifp->if_xname, cmd->name, IFNAMSIZ) == 0)
 				return(1);
 		}
 	} else {
 #if !defined(USERSPACE) && defined(__FreeBSD__)	/* and OSX too ? */
 		struct ifaddr *ia;
 
 		if_addr_rlock(ifp);
 		TAILQ_FOREACH(ia, &ifp->if_addrhead, ifa_link) {
 			if (ia->ifa_addr->sa_family != AF_INET)
 				continue;
 			if (cmd->p.ip.s_addr == ((struct sockaddr_in *)
 			    (ia->ifa_addr))->sin_addr.s_addr) {
 				if_addr_runlock(ifp);
 				return(1);	/* match */
 			}
 		}
 		if_addr_runlock(ifp);
 #endif /* __FreeBSD__ */
 	}
 	return(0);	/* no match, fail ... */
 }
 
 /*
  * The verify_path function checks if a route to the src exists and
  * if it is reachable via ifp (when provided).
  * 
  * The 'verrevpath' option checks that the interface that an IP packet
  * arrives on is the same interface that traffic destined for the
  * packet's source address would be routed out of.
  * The 'versrcreach' option just checks that the source address is
  * reachable via any route (except default) in the routing table.
  * These two are a measure to block forged packets. This is also
  * commonly known as "anti-spoofing" or Unicast Reverse Path
  * Forwarding (Unicast RFP) in Cisco-ese. The name of the knobs
  * is purposely reminiscent of the Cisco IOS command,
  *
  *   ip verify unicast reverse-path
  *   ip verify unicast source reachable-via any
  *
  * which implements the same functionality. But note that the syntax
  * is misleading, and the check may be performed on all IP packets
  * whether unicast, multicast, or broadcast.
  */
 static int
 verify_path(struct in_addr src, struct ifnet *ifp, u_int fib)
 {
 #if defined(USERSPACE) || !defined(__FreeBSD__)
 	return 0;
 #else
 	struct nhop4_basic nh4;
 
 	if (fib4_lookup_nh_basic(fib, src, NHR_IFAIF, 0, &nh4) != 0)
 		return (0);
 
 	/*
 	 * If ifp is provided, check for equality with rtentry.
 	 * We should use rt->rt_ifa->ifa_ifp, instead of rt->rt_ifp,
 	 * in order to pass packets injected back by if_simloop():
 	 * routing entry (via lo0) for our own address
 	 * may exist, so we need to handle routing assymetry.
 	 */
 	if (ifp != NULL && ifp != nh4.nh_ifp)
 		return (0);
 
 	/* if no ifp provided, check if rtentry is not default route */
 	if (ifp == NULL && (nh4.nh_flags & NHF_DEFAULT) != 0)
 		return (0);
 
 	/* or if this is a blackhole/reject route */
 	if (ifp == NULL && (nh4.nh_flags & (NHF_REJECT|NHF_BLACKHOLE)) != 0)
 		return (0);
 
 	/* found valid route */
 	return 1;
 #endif /* __FreeBSD__ */
 }
 
 #ifdef INET6
 /*
  * ipv6 specific rules here...
  */
 static __inline int
 icmp6type_match (int type, ipfw_insn_u32 *cmd)
 {
 	return (type <= ICMP6_MAXTYPE && (cmd->d[type/32] & (1<<(type%32)) ) );
 }
 
 static int
 flow6id_match( int curr_flow, ipfw_insn_u32 *cmd )
 {
 	int i;
 	for (i=0; i <= cmd->o.arg1; ++i )
 		if (curr_flow == cmd->d[i] )
 			return 1;
 	return 0;
 }
 
 /* support for IP6_*_ME opcodes */
 static const struct in6_addr lla_mask = {{{
 	0xff, 0xff, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
 	0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
 }}};
 
 static int
 ipfw_localip6(struct in6_addr *in6)
 {
 	struct rm_priotracker in6_ifa_tracker;
 	struct in6_ifaddr *ia;
 
 	if (IN6_IS_ADDR_MULTICAST(in6))
 		return (0);
 
 	if (!IN6_IS_ADDR_LINKLOCAL(in6))
 		return (in6_localip(in6));
 
 	IN6_IFADDR_RLOCK(&in6_ifa_tracker);
 	TAILQ_FOREACH(ia, &V_in6_ifaddrhead, ia_link) {
 		if (!IN6_IS_ADDR_LINKLOCAL(&ia->ia_addr.sin6_addr))
 			continue;
 		if (IN6_ARE_MASKED_ADDR_EQUAL(&ia->ia_addr.sin6_addr,
 		    in6, &lla_mask)) {
 			IN6_IFADDR_RUNLOCK(&in6_ifa_tracker);
 			return (1);
 		}
 	}
 	IN6_IFADDR_RUNLOCK(&in6_ifa_tracker);
 	return (0);
 }
 
 static int
 verify_path6(struct in6_addr *src, struct ifnet *ifp, u_int fib)
 {
 	struct nhop6_basic nh6;
 
 	if (IN6_IS_SCOPE_LINKLOCAL(src))
 		return (1);
 
 	if (fib6_lookup_nh_basic(fib, src, 0, NHR_IFAIF, 0, &nh6) != 0)
 		return (0);
 
 	/* If ifp is provided, check for equality with route table. */
 	if (ifp != NULL && ifp != nh6.nh_ifp)
 		return (0);
 
 	/* if no ifp provided, check if rtentry is not default route */
 	if (ifp == NULL && (nh6.nh_flags & NHF_DEFAULT) != 0)
 		return (0);
 
 	/* or if this is a blackhole/reject route */
 	if (ifp == NULL && (nh6.nh_flags & (NHF_REJECT|NHF_BLACKHOLE)) != 0)
 		return (0);
 
 	/* found valid route */
 	return 1;
 }
 
 static int
 is_icmp6_query(int icmp6_type)
 {
 	if ((icmp6_type <= ICMP6_MAXTYPE) &&
 	    (icmp6_type == ICMP6_ECHO_REQUEST ||
 	    icmp6_type == ICMP6_MEMBERSHIP_QUERY ||
 	    icmp6_type == ICMP6_WRUREQUEST ||
 	    icmp6_type == ICMP6_FQDN_QUERY ||
 	    icmp6_type == ICMP6_NI_QUERY))
 		return (1);
 
 	return (0);
 }
 
 static void
 send_reject6(struct ip_fw_args *args, int code, u_int hlen, struct ip6_hdr *ip6)
 {
 	struct mbuf *m;
 
 	m = args->m;
 	if (code == ICMP6_UNREACH_RST && args->f_id.proto == IPPROTO_TCP) {
 		struct tcphdr *tcp;
 		tcp = (struct tcphdr *)((char *)ip6 + hlen);
 
 		if ((tcp->th_flags & TH_RST) == 0) {
 			struct mbuf *m0;
 			m0 = ipfw_send_pkt(args->m, &(args->f_id),
 			    ntohl(tcp->th_seq), ntohl(tcp->th_ack),
 			    tcp->th_flags | TH_RST);
 			if (m0 != NULL)
 				ip6_output(m0, NULL, NULL, 0, NULL, NULL,
 				    NULL);
 		}
 		FREE_PKT(m);
 	} else if (code != ICMP6_UNREACH_RST) { /* Send an ICMPv6 unreach. */
 #if 0
 		/*
 		 * Unlike above, the mbufs need to line up with the ip6 hdr,
 		 * as the contents are read. We need to m_adj() the
 		 * needed amount.
 		 * The mbuf will however be thrown away so we can adjust it.
 		 * Remember we did an m_pullup on it already so we
 		 * can make some assumptions about contiguousness.
 		 */
 		if (args->L3offset)
 			m_adj(m, args->L3offset);
 #endif
 		icmp6_error(m, ICMP6_DST_UNREACH, code, 0);
 	} else
 		FREE_PKT(m);
 
 	args->m = NULL;
 }
 
 #endif /* INET6 */
 
 
 /*
  * sends a reject message, consuming the mbuf passed as an argument.
  */
 static void
 send_reject(struct ip_fw_args *args, int code, int iplen, struct ip *ip)
 {
 
 #if 0
 	/* XXX When ip is not guaranteed to be at mtod() we will
 	 * need to account for this */
 	 * The mbuf will however be thrown away so we can adjust it.
 	 * Remember we did an m_pullup on it already so we
 	 * can make some assumptions about contiguousness.
 	 */
 	if (args->L3offset)
 		m_adj(m, args->L3offset);
 #endif
 	if (code != ICMP_REJECT_RST) { /* Send an ICMP unreach */
 		icmp_error(args->m, ICMP_UNREACH, code, 0L, 0);
 	} else if (args->f_id.proto == IPPROTO_TCP) {
 		struct tcphdr *const tcp =
 		    L3HDR(struct tcphdr, mtod(args->m, struct ip *));
 		if ( (tcp->th_flags & TH_RST) == 0) {
 			struct mbuf *m;
 			m = ipfw_send_pkt(args->m, &(args->f_id),
 				ntohl(tcp->th_seq), ntohl(tcp->th_ack),
 				tcp->th_flags | TH_RST);
 			if (m != NULL)
 				ip_output(m, NULL, NULL, 0, NULL, NULL);
 		}
 		FREE_PKT(args->m);
 	} else
 		FREE_PKT(args->m);
 	args->m = NULL;
 }
 
 /*
  * Support for uid/gid/jail lookup. These tests are expensive
  * (because we may need to look into the list of active sockets)
  * so we cache the results. ugid_lookupp is 0 if we have not
  * yet done a lookup, 1 if we succeeded, and -1 if we tried
  * and failed. The function always returns the match value.
  * We could actually spare the variable and use *uc, setting
  * it to '(void *)check_uidgid if we have no info, NULL if
  * we tried and failed, or any other value if successful.
  */
 static int
 check_uidgid(ipfw_insn_u32 *insn, struct ip_fw_args *args, int *ugid_lookupp,
     struct ucred **uc)
 {
 #if defined(USERSPACE)
 	return 0;	// not supported in userspace
 #else
 #ifndef __FreeBSD__
 	/* XXX */
 	return cred_check(insn, proto, oif,
 	    dst_ip, dst_port, src_ip, src_port,
 	    (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb);
 #else  /* FreeBSD */
 	struct in_addr src_ip, dst_ip;
 	struct inpcbinfo *pi;
 	struct ipfw_flow_id *id;
 	struct inpcb *pcb, *inp;
 	struct ifnet *oif;
 	int lookupflags;
 	int match;
 
 	id = &args->f_id;
 	inp = args->inp;
 	oif = args->oif;
 
 	/*
 	 * Check to see if the UDP or TCP stack supplied us with
 	 * the PCB. If so, rather then holding a lock and looking
 	 * up the PCB, we can use the one that was supplied.
 	 */
 	if (inp && *ugid_lookupp == 0) {
 		INP_LOCK_ASSERT(inp);
 		if (inp->inp_socket != NULL) {
 			*uc = crhold(inp->inp_cred);
 			*ugid_lookupp = 1;
 		} else
 			*ugid_lookupp = -1;
 	}
 	/*
 	 * If we have already been here and the packet has no
 	 * PCB entry associated with it, then we can safely
 	 * assume that this is a no match.
 	 */
 	if (*ugid_lookupp == -1)
 		return (0);
 	if (id->proto == IPPROTO_TCP) {
 		lookupflags = 0;
 		pi = &V_tcbinfo;
 	} else if (id->proto == IPPROTO_UDP) {
 		lookupflags = INPLOOKUP_WILDCARD;
 		pi = &V_udbinfo;
 	} else
 		return 0;
 	lookupflags |= INPLOOKUP_RLOCKPCB;
 	match = 0;
 	if (*ugid_lookupp == 0) {
 		if (id->addr_type == 6) {
 #ifdef INET6
 			if (oif == NULL)
 				pcb = in6_pcblookup_mbuf(pi,
 				    &id->src_ip6, htons(id->src_port),
 				    &id->dst_ip6, htons(id->dst_port),
 				    lookupflags, oif, args->m);
 			else
 				pcb = in6_pcblookup_mbuf(pi,
 				    &id->dst_ip6, htons(id->dst_port),
 				    &id->src_ip6, htons(id->src_port),
 				    lookupflags, oif, args->m);
 #else
 			*ugid_lookupp = -1;
 			return (0);
 #endif
 		} else {
 			src_ip.s_addr = htonl(id->src_ip);
 			dst_ip.s_addr = htonl(id->dst_ip);
 			if (oif == NULL)
 				pcb = in_pcblookup_mbuf(pi,
 				    src_ip, htons(id->src_port),
 				    dst_ip, htons(id->dst_port),
 				    lookupflags, oif, args->m);
 			else
 				pcb = in_pcblookup_mbuf(pi,
 				    dst_ip, htons(id->dst_port),
 				    src_ip, htons(id->src_port),
 				    lookupflags, oif, args->m);
 		}
 		if (pcb != NULL) {
 			INP_RLOCK_ASSERT(pcb);
 			*uc = crhold(pcb->inp_cred);
 			*ugid_lookupp = 1;
 			INP_RUNLOCK(pcb);
 		}
 		if (*ugid_lookupp == 0) {
 			/*
 			 * We tried and failed, set the variable to -1
 			 * so we will not try again on this packet.
 			 */
 			*ugid_lookupp = -1;
 			return (0);
 		}
 	}
 	if (insn->o.opcode == O_UID)
 		match = ((*uc)->cr_uid == (uid_t)insn->d[0]);
 	else if (insn->o.opcode == O_GID)
 		match = groupmember((gid_t)insn->d[0], *uc);
 	else if (insn->o.opcode == O_JAIL)
 		match = ((*uc)->cr_prison->pr_id == (int)insn->d[0]);
 	return (match);
 #endif /* __FreeBSD__ */
 #endif /* not supported in userspace */
 }
 
 /*
  * Helper function to set args with info on the rule after the matching
  * one. slot is precise, whereas we guess rule_id as they are
  * assigned sequentially.
  */
 static inline void
 set_match(struct ip_fw_args *args, int slot,
 	struct ip_fw_chain *chain)
 {
 	args->rule.chain_id = chain->id;
 	args->rule.slot = slot + 1; /* we use 0 as a marker */
 	args->rule.rule_id = 1 + chain->map[slot]->id;
 	args->rule.rulenum = chain->map[slot]->rulenum;
 }
 
 #ifndef LINEAR_SKIPTO
 /*
  * Helper function to enable cached rule lookups using
  * cached_id and cached_pos fields in ipfw rule.
  */
 static int
 jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num,
     int tablearg, int jump_backwards)
 {
 	int f_pos;
 
 	/* If possible use cached f_pos (in f->cached_pos),
 	 * whose version is written in f->cached_id
 	 * (horrible hacks to avoid changing the ABI).
 	 */
 	if (num != IP_FW_TARG && f->cached_id == chain->id)
 		f_pos = f->cached_pos;
 	else {
 		int i = IP_FW_ARG_TABLEARG(chain, num, skipto);
 		/* make sure we do not jump backward */
 		if (jump_backwards == 0 && i <= f->rulenum)
 			i = f->rulenum + 1;
 		if (chain->idxmap != NULL)
 			f_pos = chain->idxmap[i];
 		else
 			f_pos = ipfw_find_rule(chain, i, 0);
 		/* update the cache */
 		if (num != IP_FW_TARG) {
 			f->cached_id = chain->id;
 			f->cached_pos = f_pos;
 		}
 	}
 
 	return (f_pos);
 }
 #else
 /*
  * Helper function to enable real fast rule lookups.
  */
 static int
 jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num,
     int tablearg, int jump_backwards)
 {
 	int f_pos;
 
 	num = IP_FW_ARG_TABLEARG(chain, num, skipto);
 	/* make sure we do not jump backward */
 	if (jump_backwards == 0 && num <= f->rulenum)
 		num = f->rulenum + 1;
 	f_pos = chain->idxmap[num];
 
 	return (f_pos);
 }
 #endif
 
 #define	TARG(k, f)	IP_FW_ARG_TABLEARG(chain, k, f)
 /*
  * The main check routine for the firewall.
  *
  * All arguments are in args so we can modify them and return them
  * back to the caller.
  *
  * Parameters:
  *
  *	args->m	(in/out) The packet; we set to NULL when/if we nuke it.
  *		Starts with the IP header.
  *	args->eh (in)	Mac header if present, NULL for layer3 packet.
  *	args->L3offset	Number of bytes bypassed if we came from L2.
  *			e.g. often sizeof(eh)  ** NOTYET **
  *	args->oif	Outgoing interface, NULL if packet is incoming.
  *		The incoming interface is in the mbuf. (in)
  *	args->divert_rule (in/out)
  *		Skip up to the first rule past this rule number;
  *		upon return, non-zero port number for divert or tee.
  *
  *	args->rule	Pointer to the last matching rule (in/out)
  *	args->next_hop	Socket we are forwarding to (out).
  *	args->next_hop6	IPv6 next hop we are forwarding to (out).
  *	args->f_id	Addresses grabbed from the packet (out)
  * 	args->rule.info	a cookie depending on rule action
  *
  * Return value:
  *
  *	IP_FW_PASS	the packet must be accepted
  *	IP_FW_DENY	the packet must be dropped
  *	IP_FW_DIVERT	divert packet, port in m_tag
  *	IP_FW_TEE	tee packet, port in m_tag
  *	IP_FW_DUMMYNET	to dummynet, pipe in args->cookie
  *	IP_FW_NETGRAPH	into netgraph, cookie args->cookie
  *		args->rule contains the matching rule,
  *		args->rule.info has additional information.
  *
  */
 int
 ipfw_chk(struct ip_fw_args *args)
 {
 
 	/*
 	 * Local variables holding state while processing a packet:
 	 *
 	 * IMPORTANT NOTE: to speed up the processing of rules, there
 	 * are some assumption on the values of the variables, which
 	 * are documented here. Should you change them, please check
 	 * the implementation of the various instructions to make sure
 	 * that they still work.
 	 *
 	 * args->eh	The MAC header. It is non-null for a layer2
 	 *	packet, it is NULL for a layer-3 packet.
 	 * **notyet**
 	 * args->L3offset Offset in the packet to the L3 (IP or equiv.) header.
 	 *
 	 * m | args->m	Pointer to the mbuf, as received from the caller.
 	 *	It may change if ipfw_chk() does an m_pullup, or if it
 	 *	consumes the packet because it calls send_reject().
 	 *	XXX This has to change, so that ipfw_chk() never modifies
 	 *	or consumes the buffer.
 	 * ip	is the beginning of the ip(4 or 6) header.
 	 *	Calculated by adding the L3offset to the start of data.
 	 *	(Until we start using L3offset, the packet is
 	 *	supposed to start with the ip header).
 	 */
 	struct mbuf *m = args->m;
 	struct ip *ip = mtod(m, struct ip *);
 
 	/*
 	 * For rules which contain uid/gid or jail constraints, cache
 	 * a copy of the users credentials after the pcb lookup has been
 	 * executed. This will speed up the processing of rules with
 	 * these types of constraints, as well as decrease contention
 	 * on pcb related locks.
 	 */
 #ifndef __FreeBSD__
 	struct bsd_ucred ucred_cache;
 #else
 	struct ucred *ucred_cache = NULL;
 #endif
 	int ucred_lookup = 0;
 
 	/*
 	 * oif | args->oif	If NULL, ipfw_chk has been called on the
 	 *	inbound path (ether_input, ip_input).
 	 *	If non-NULL, ipfw_chk has been called on the outbound path
 	 *	(ether_output, ip_output).
 	 */
 	struct ifnet *oif = args->oif;
 
 	int f_pos = 0;		/* index of current rule in the array */
 	int retval = 0;
 
 	/*
 	 * hlen	The length of the IP header.
 	 */
 	u_int hlen = 0;		/* hlen >0 means we have an IP pkt */
 
 	/*
 	 * offset	The offset of a fragment. offset != 0 means that
 	 *	we have a fragment at this offset of an IPv4 packet.
 	 *	offset == 0 means that (if this is an IPv4 packet)
 	 *	this is the first or only fragment.
 	 *	For IPv6 offset|ip6f_mf == 0 means there is no Fragment Header
 	 *	or there is a single packet fragment (fragment header added
 	 *	without needed).  We will treat a single packet fragment as if
 	 *	there was no fragment header (or log/block depending on the
 	 *	V_fw_permit_single_frag6 sysctl setting).
 	 */
 	u_short offset = 0;
 	u_short ip6f_mf = 0;
 
 	/*
 	 * Local copies of addresses. They are only valid if we have
 	 * an IP packet.
 	 *
 	 * proto	The protocol. Set to 0 for non-ip packets,
 	 *	or to the protocol read from the packet otherwise.
 	 *	proto != 0 means that we have an IPv4 packet.
 	 *
 	 * src_port, dst_port	port numbers, in HOST format. Only
 	 *	valid for TCP and UDP packets.
 	 *
 	 * src_ip, dst_ip	ip addresses, in NETWORK format.
 	 *	Only valid for IPv4 packets.
 	 */
 	uint8_t proto;
 	uint16_t src_port = 0, dst_port = 0;	/* NOTE: host format	*/
 	struct in_addr src_ip, dst_ip;		/* NOTE: network format	*/
 	uint16_t iplen=0;
 	int pktlen;
 	uint16_t	etype = 0;	/* Host order stored ether type */
 
 	/*
 	 * dyn_dir = MATCH_UNKNOWN when rules unchecked,
 	 * 	MATCH_NONE when checked and not matched (q = NULL),
 	 *	MATCH_FORWARD or MATCH_REVERSE otherwise (q != NULL)
 	 */
 	int dyn_dir = MATCH_UNKNOWN;
 	uint16_t dyn_name = 0;
 	ipfw_dyn_rule *q = NULL;
 	struct ip_fw_chain *chain = &V_layer3_chain;
 
 	/*
 	 * We store in ulp a pointer to the upper layer protocol header.
 	 * In the ipv4 case this is easy to determine from the header,
 	 * but for ipv6 we might have some additional headers in the middle.
 	 * ulp is NULL if not found.
 	 */
 	void *ulp = NULL;		/* upper layer protocol pointer. */
 
 	/* XXX ipv6 variables */
 	int is_ipv6 = 0;
 	uint8_t	icmp6_type = 0;
 	uint16_t ext_hd = 0;	/* bits vector for extension header filtering */
 	/* end of ipv6 variables */
 
 	int is_ipv4 = 0;
 
 	int done = 0;		/* flag to exit the outer loop */
 	IPFW_RLOCK_TRACKER;
 
 	if (m->m_flags & M_SKIP_FIREWALL || (! V_ipfw_vnet_ready))
 		return (IP_FW_PASS);	/* accept */
 
 	dst_ip.s_addr = 0;		/* make sure it is initialized */
 	src_ip.s_addr = 0;		/* make sure it is initialized */
 	pktlen = m->m_pkthdr.len;
 	args->f_id.fib = M_GETFIB(m); /* note mbuf not altered) */
 	proto = args->f_id.proto = 0;	/* mark f_id invalid */
 		/* XXX 0 is a valid proto: IP/IPv6 Hop-by-Hop Option */
 
 /*
  * PULLUP_TO(len, p, T) makes sure that len + sizeof(T) is contiguous,
  * then it sets p to point at the offset "len" in the mbuf. WARNING: the
  * pointer might become stale after other pullups (but we never use it
  * this way).
  */
 #define PULLUP_TO(_len, p, T)	PULLUP_LEN(_len, p, sizeof(T))
 #define PULLUP_LEN(_len, p, T)					\
 do {								\
 	int x = (_len) + T;					\
 	if ((m)->m_len < x) {					\
 		args->m = m = m_pullup(m, x);			\
 		if (m == NULL)					\
 			goto pullup_failed;			\
 	}							\
 	p = (mtod(m, char *) + (_len));				\
 } while (0)
 
 	/*
 	 * if we have an ether header,
 	 */
 	if (args->eh)
 		etype = ntohs(args->eh->ether_type);
 
 	/* Identify IP packets and fill up variables. */
 	if (pktlen >= sizeof(struct ip6_hdr) &&
 	    (args->eh == NULL || etype == ETHERTYPE_IPV6) && ip->ip_v == 6) {
 		struct ip6_hdr *ip6 = (struct ip6_hdr *)ip;
 		is_ipv6 = 1;
 		args->f_id.addr_type = 6;
 		hlen = sizeof(struct ip6_hdr);
 		proto = ip6->ip6_nxt;
 
 		/* Search extension headers to find upper layer protocols */
 		while (ulp == NULL && offset == 0) {
 			switch (proto) {
 			case IPPROTO_ICMPV6:
 				PULLUP_TO(hlen, ulp, struct icmp6_hdr);
 				icmp6_type = ICMP6(ulp)->icmp6_type;
 				break;
 
 			case IPPROTO_TCP:
 				PULLUP_TO(hlen, ulp, struct tcphdr);
 				dst_port = TCP(ulp)->th_dport;
 				src_port = TCP(ulp)->th_sport;
 				/* save flags for dynamic rules */
 				args->f_id._flags = TCP(ulp)->th_flags;
 				break;
 
 			case IPPROTO_SCTP:
 				PULLUP_TO(hlen, ulp, struct sctphdr);
 				src_port = SCTP(ulp)->src_port;
 				dst_port = SCTP(ulp)->dest_port;
 				break;
 
 			case IPPROTO_UDP:
 				PULLUP_TO(hlen, ulp, struct udphdr);
 				dst_port = UDP(ulp)->uh_dport;
 				src_port = UDP(ulp)->uh_sport;
 				break;
 
 			case IPPROTO_HOPOPTS:	/* RFC 2460 */
 				PULLUP_TO(hlen, ulp, struct ip6_hbh);
 				ext_hd |= EXT_HOPOPTS;
 				hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3;
 				proto = ((struct ip6_hbh *)ulp)->ip6h_nxt;
 				ulp = NULL;
 				break;
 
 			case IPPROTO_ROUTING:	/* RFC 2460 */
 				PULLUP_TO(hlen, ulp, struct ip6_rthdr);
 				switch (((struct ip6_rthdr *)ulp)->ip6r_type) {
 				case 0:
 					ext_hd |= EXT_RTHDR0;
 					break;
 				case 2:
 					ext_hd |= EXT_RTHDR2;
 					break;
 				default:
 					if (V_fw_verbose)
 						printf("IPFW2: IPV6 - Unknown "
 						    "Routing Header type(%d)\n",
 						    ((struct ip6_rthdr *)
 						    ulp)->ip6r_type);
 					if (V_fw_deny_unknown_exthdrs)
 					    return (IP_FW_DENY);
 					break;
 				}
 				ext_hd |= EXT_ROUTING;
 				hlen += (((struct ip6_rthdr *)ulp)->ip6r_len + 1) << 3;
 				proto = ((struct ip6_rthdr *)ulp)->ip6r_nxt;
 				ulp = NULL;
 				break;
 
 			case IPPROTO_FRAGMENT:	/* RFC 2460 */
 				PULLUP_TO(hlen, ulp, struct ip6_frag);
 				ext_hd |= EXT_FRAGMENT;
 				hlen += sizeof (struct ip6_frag);
 				proto = ((struct ip6_frag *)ulp)->ip6f_nxt;
 				offset = ((struct ip6_frag *)ulp)->ip6f_offlg &
 					IP6F_OFF_MASK;
 				ip6f_mf = ((struct ip6_frag *)ulp)->ip6f_offlg &
 					IP6F_MORE_FRAG;
 				if (V_fw_permit_single_frag6 == 0 &&
 				    offset == 0 && ip6f_mf == 0) {
 					if (V_fw_verbose)
 						printf("IPFW2: IPV6 - Invalid "
 						    "Fragment Header\n");
 					if (V_fw_deny_unknown_exthdrs)
 					    return (IP_FW_DENY);
 					break;
 				}
 				args->f_id.extra =
 				    ntohl(((struct ip6_frag *)ulp)->ip6f_ident);
 				ulp = NULL;
 				break;
 
 			case IPPROTO_DSTOPTS:	/* RFC 2460 */
 				PULLUP_TO(hlen, ulp, struct ip6_hbh);
 				ext_hd |= EXT_DSTOPTS;
 				hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3;
 				proto = ((struct ip6_hbh *)ulp)->ip6h_nxt;
 				ulp = NULL;
 				break;
 
 			case IPPROTO_AH:	/* RFC 2402 */
 				PULLUP_TO(hlen, ulp, struct ip6_ext);
 				ext_hd |= EXT_AH;
 				hlen += (((struct ip6_ext *)ulp)->ip6e_len + 2) << 2;
 				proto = ((struct ip6_ext *)ulp)->ip6e_nxt;
 				ulp = NULL;
 				break;
 
 			case IPPROTO_ESP:	/* RFC 2406 */
 				PULLUP_TO(hlen, ulp, uint32_t);	/* SPI, Seq# */
 				/* Anything past Seq# is variable length and
 				 * data past this ext. header is encrypted. */
 				ext_hd |= EXT_ESP;
 				break;
 
 			case IPPROTO_NONE:	/* RFC 2460 */
 				/*
 				 * Packet ends here, and IPv6 header has
 				 * already been pulled up. If ip6e_len!=0
 				 * then octets must be ignored.
 				 */
 				ulp = ip; /* non-NULL to get out of loop. */
 				break;
 
 			case IPPROTO_OSPFIGP:
 				/* XXX OSPF header check? */
 				PULLUP_TO(hlen, ulp, struct ip6_ext);
 				break;
 
 			case IPPROTO_PIM:
 				/* XXX PIM header check? */
 				PULLUP_TO(hlen, ulp, struct pim);
 				break;
 
 			case IPPROTO_CARP:
 				PULLUP_TO(hlen, ulp, struct carp_header);
 				if (((struct carp_header *)ulp)->carp_version !=
 				    CARP_VERSION) 
 					return (IP_FW_DENY);
 				if (((struct carp_header *)ulp)->carp_type !=
 				    CARP_ADVERTISEMENT) 
 					return (IP_FW_DENY);
 				break;
 
 			case IPPROTO_IPV6:	/* RFC 2893 */
 				PULLUP_TO(hlen, ulp, struct ip6_hdr);
 				break;
 
 			case IPPROTO_IPV4:	/* RFC 2893 */
 				PULLUP_TO(hlen, ulp, struct ip);
 				break;
 
 			default:
 				if (V_fw_verbose)
 					printf("IPFW2: IPV6 - Unknown "
 					    "Extension Header(%d), ext_hd=%x\n",
 					     proto, ext_hd);
 				if (V_fw_deny_unknown_exthdrs)
 				    return (IP_FW_DENY);
 				PULLUP_TO(hlen, ulp, struct ip6_ext);
 				break;
 			} /*switch */
 		}
 		ip = mtod(m, struct ip *);
 		ip6 = (struct ip6_hdr *)ip;
 		args->f_id.src_ip6 = ip6->ip6_src;
 		args->f_id.dst_ip6 = ip6->ip6_dst;
 		args->f_id.src_ip = 0;
 		args->f_id.dst_ip = 0;
 		args->f_id.flow_id6 = ntohl(ip6->ip6_flow);
 	} else if (pktlen >= sizeof(struct ip) &&
 	    (args->eh == NULL || etype == ETHERTYPE_IP) && ip->ip_v == 4) {
 	    	is_ipv4 = 1;
 		hlen = ip->ip_hl << 2;
 		args->f_id.addr_type = 4;
 
 		/*
 		 * Collect parameters into local variables for faster matching.
 		 */
 		proto = ip->ip_p;
 		src_ip = ip->ip_src;
 		dst_ip = ip->ip_dst;
 		offset = ntohs(ip->ip_off) & IP_OFFMASK;
 		iplen = ntohs(ip->ip_len);
 		pktlen = iplen < pktlen ? iplen : pktlen;
 
 		if (offset == 0) {
 			switch (proto) {
 			case IPPROTO_TCP:
 				PULLUP_TO(hlen, ulp, struct tcphdr);
 				dst_port = TCP(ulp)->th_dport;
 				src_port = TCP(ulp)->th_sport;
 				/* save flags for dynamic rules */
 				args->f_id._flags = TCP(ulp)->th_flags;
 				break;
 
 			case IPPROTO_SCTP:
 				PULLUP_TO(hlen, ulp, struct sctphdr);
 				src_port = SCTP(ulp)->src_port;
 				dst_port = SCTP(ulp)->dest_port;
 				break;
 
 			case IPPROTO_UDP:
 				PULLUP_TO(hlen, ulp, struct udphdr);
 				dst_port = UDP(ulp)->uh_dport;
 				src_port = UDP(ulp)->uh_sport;
 				break;
 
 			case IPPROTO_ICMP:
 				PULLUP_TO(hlen, ulp, struct icmphdr);
 				//args->f_id.flags = ICMP(ulp)->icmp_type;
 				break;
 
 			default:
 				break;
 			}
 		}
 
 		ip = mtod(m, struct ip *);
 		args->f_id.src_ip = ntohl(src_ip.s_addr);
 		args->f_id.dst_ip = ntohl(dst_ip.s_addr);
 	}
 #undef PULLUP_TO
 	if (proto) { /* we may have port numbers, store them */
 		args->f_id.proto = proto;
 		args->f_id.src_port = src_port = ntohs(src_port);
 		args->f_id.dst_port = dst_port = ntohs(dst_port);
 	}
 
 	IPFW_PF_RLOCK(chain);
 	if (! V_ipfw_vnet_ready) { /* shutting down, leave NOW. */
 		IPFW_PF_RUNLOCK(chain);
 		return (IP_FW_PASS);	/* accept */
 	}
 	if (args->rule.slot) {
 		/*
 		 * Packet has already been tagged as a result of a previous
 		 * match on rule args->rule aka args->rule_id (PIPE, QUEUE,
 		 * REASS, NETGRAPH, DIVERT/TEE...)
 		 * Validate the slot and continue from the next one
 		 * if still present, otherwise do a lookup.
 		 */
 		f_pos = (args->rule.chain_id == chain->id) ?
 		    args->rule.slot :
 		    ipfw_find_rule(chain, args->rule.rulenum,
 			args->rule.rule_id);
 	} else {
 		f_pos = 0;
 	}
 
 	/*
 	 * Now scan the rules, and parse microinstructions for each rule.
 	 * We have two nested loops and an inner switch. Sometimes we
 	 * need to break out of one or both loops, or re-enter one of
 	 * the loops with updated variables. Loop variables are:
 	 *
 	 *	f_pos (outer loop) points to the current rule.
 	 *		On output it points to the matching rule.
 	 *	done (outer loop) is used as a flag to break the loop.
 	 *	l (inner loop)	residual length of current rule.
 	 *		cmd points to the current microinstruction.
 	 *
 	 * We break the inner loop by setting l=0 and possibly
 	 * cmdlen=0 if we don't want to advance cmd.
 	 * We break the outer loop by setting done=1
 	 * We can restart the inner loop by setting l>0 and f_pos, f, cmd
 	 * as needed.
 	 */
 	for (; f_pos < chain->n_rules; f_pos++) {
 		ipfw_insn *cmd;
 		uint32_t tablearg = 0;
 		int l, cmdlen, skip_or; /* skip rest of OR block */
 		struct ip_fw *f;
 
 		f = chain->map[f_pos];
 		if (V_set_disable & (1 << f->set) )
 			continue;
 
 		skip_or = 0;
 		for (l = f->cmd_len, cmd = f->cmd ; l > 0 ;
 		    l -= cmdlen, cmd += cmdlen) {
 			int match;
 
 			/*
 			 * check_body is a jump target used when we find a
 			 * CHECK_STATE, and need to jump to the body of
 			 * the target rule.
 			 */
 
 /* check_body: */
 			cmdlen = F_LEN(cmd);
 			/*
 			 * An OR block (insn_1 || .. || insn_n) has the
 			 * F_OR bit set in all but the last instruction.
 			 * The first match will set "skip_or", and cause
 			 * the following instructions to be skipped until
 			 * past the one with the F_OR bit clear.
 			 */
 			if (skip_or) {		/* skip this instruction */
 				if ((cmd->len & F_OR) == 0)
 					skip_or = 0;	/* next one is good */
 				continue;
 			}
 			match = 0; /* set to 1 if we succeed */
 
 			switch (cmd->opcode) {
 			/*
 			 * The first set of opcodes compares the packet's
 			 * fields with some pattern, setting 'match' if a
 			 * match is found. At the end of the loop there is
 			 * logic to deal with F_NOT and F_OR flags associated
 			 * with the opcode.
 			 */
 			case O_NOP:
 				match = 1;
 				break;
 
 			case O_FORWARD_MAC:
 				printf("ipfw: opcode %d unimplemented\n",
 				    cmd->opcode);
 				break;
 
 			case O_GID:
 			case O_UID:
 			case O_JAIL:
 				/*
 				 * We only check offset == 0 && proto != 0,
 				 * as this ensures that we have a
 				 * packet with the ports info.
 				 */
 				if (offset != 0)
 					break;
 				if (proto == IPPROTO_TCP ||
 				    proto == IPPROTO_UDP)
 					match = check_uidgid(
 						    (ipfw_insn_u32 *)cmd,
 						    args, &ucred_lookup,
 #ifdef __FreeBSD__
 						    &ucred_cache);
 #else
 						    (void *)&ucred_cache);
 #endif
 				break;
 
 			case O_RECV:
 				match = iface_match(m->m_pkthdr.rcvif,
 				    (ipfw_insn_if *)cmd, chain, &tablearg);
 				break;
 
 			case O_XMIT:
 				match = iface_match(oif, (ipfw_insn_if *)cmd,
 				    chain, &tablearg);
 				break;
 
 			case O_VIA:
 				match = iface_match(oif ? oif :
 				    m->m_pkthdr.rcvif, (ipfw_insn_if *)cmd,
 				    chain, &tablearg);
 				break;
 
 			case O_MACADDR2:
 				if (args->eh != NULL) {	/* have MAC header */
 					u_int32_t *want = (u_int32_t *)
 						((ipfw_insn_mac *)cmd)->addr;
 					u_int32_t *mask = (u_int32_t *)
 						((ipfw_insn_mac *)cmd)->mask;
 					u_int32_t *hdr = (u_int32_t *)args->eh;
 
 					match =
 					    ( want[0] == (hdr[0] & mask[0]) &&
 					      want[1] == (hdr[1] & mask[1]) &&
 					      want[2] == (hdr[2] & mask[2]) );
 				}
 				break;
 
 			case O_MAC_TYPE:
 				if (args->eh != NULL) {
 					u_int16_t *p =
 					    ((ipfw_insn_u16 *)cmd)->ports;
 					int i;
 
 					for (i = cmdlen - 1; !match && i>0;
 					    i--, p += 2)
 						match = (etype >= p[0] &&
 						    etype <= p[1]);
 				}
 				break;
 
 			case O_FRAG:
 				match = (offset != 0);
 				break;
 
 			case O_IN:	/* "out" is "not in" */
 				match = (oif == NULL);
 				break;
 
 			case O_LAYER2:
 				match = (args->eh != NULL);
 				break;
 
 			case O_DIVERTED:
 			    {
 				/* For diverted packets, args->rule.info
 				 * contains the divert port (in host format)
 				 * reason and direction.
 				 */
 				uint32_t i = args->rule.info;
 				match = (i&IPFW_IS_MASK) == IPFW_IS_DIVERT &&
 				    cmd->arg1 & ((i & IPFW_INFO_IN) ? 1 : 2);
 			    }
 				break;
 
 			case O_PROTO:
 				/*
 				 * We do not allow an arg of 0 so the
 				 * check of "proto" only suffices.
 				 */
 				match = (proto == cmd->arg1);
 				break;
 
 			case O_IP_SRC:
 				match = is_ipv4 &&
 				    (((ipfw_insn_ip *)cmd)->addr.s_addr ==
 				    src_ip.s_addr);
 				break;
 
 			case O_IP_SRC_LOOKUP:
 			case O_IP_DST_LOOKUP:
 				if (is_ipv4) {
 				    uint32_t key =
 					(cmd->opcode == O_IP_DST_LOOKUP) ?
 					    dst_ip.s_addr : src_ip.s_addr;
 				    uint32_t v = 0;
 
 				    if (cmdlen > F_INSN_SIZE(ipfw_insn_u32)) {
 					/* generic lookup. The key must be
 					 * in 32bit big-endian format.
 					 */
 					v = ((ipfw_insn_u32 *)cmd)->d[1];
 					if (v == 0)
 					    key = dst_ip.s_addr;
 					else if (v == 1)
 					    key = src_ip.s_addr;
 					else if (v == 6) /* dscp */
 					    key = (ip->ip_tos >> 2) & 0x3f;
 					else if (offset != 0)
 					    break;
 					else if (proto != IPPROTO_TCP &&
 						proto != IPPROTO_UDP)
 					    break;
 					else if (v == 2)
 					    key = dst_port;
 					else if (v == 3)
 					    key = src_port;
 #ifndef USERSPACE
 					else if (v == 4 || v == 5) {
 					    check_uidgid(
 						(ipfw_insn_u32 *)cmd,
 						args, &ucred_lookup,
 #ifdef __FreeBSD__
 						&ucred_cache);
 					    if (v == 4 /* O_UID */)
 						key = ucred_cache->cr_uid;
 					    else if (v == 5 /* O_JAIL */)
 						key = ucred_cache->cr_prison->pr_id;
 #else /* !__FreeBSD__ */
 						(void *)&ucred_cache);
 					    if (v ==4 /* O_UID */)
 						key = ucred_cache.uid;
 					    else if (v == 5 /* O_JAIL */)
 						key = ucred_cache.xid;
 #endif /* !__FreeBSD__ */
 					}
 #endif /* !USERSPACE */
 					else
 					    break;
 				    }
 				    match = ipfw_lookup_table(chain,
 					cmd->arg1, key, &v);
 				    if (!match)
 					break;
 				    if (cmdlen == F_INSN_SIZE(ipfw_insn_u32))
 					match =
 					    ((ipfw_insn_u32 *)cmd)->d[0] == v;
 				    else
 					tablearg = v;
 				} else if (is_ipv6) {
 					uint32_t v = 0;
 					void *pkey = (cmd->opcode == O_IP_DST_LOOKUP) ?
 						&args->f_id.dst_ip6: &args->f_id.src_ip6;
 					match = ipfw_lookup_table_extended(chain,
 							cmd->arg1,
 							sizeof(struct in6_addr),
 							pkey, &v);
 					if (cmdlen == F_INSN_SIZE(ipfw_insn_u32))
 						match = ((ipfw_insn_u32 *)cmd)->d[0] == v;
 					if (match)
 						tablearg = v;
 				}
 				break;
 
 			case O_IP_FLOW_LOOKUP:
 				{
 					uint32_t v = 0;
 					match = ipfw_lookup_table_extended(chain,
 					    cmd->arg1, 0, &args->f_id, &v);
 					if (cmdlen == F_INSN_SIZE(ipfw_insn_u32))
 						match = ((ipfw_insn_u32 *)cmd)->d[0] == v;
 					if (match)
 						tablearg = v;
 				}
 				break;
 			case O_IP_SRC_MASK:
 			case O_IP_DST_MASK:
 				if (is_ipv4) {
 				    uint32_t a =
 					(cmd->opcode == O_IP_DST_MASK) ?
 					    dst_ip.s_addr : src_ip.s_addr;
 				    uint32_t *p = ((ipfw_insn_u32 *)cmd)->d;
 				    int i = cmdlen-1;
 
 				    for (; !match && i>0; i-= 2, p+= 2)
 					match = (p[0] == (a & p[1]));
 				}
 				break;
 
 			case O_IP_SRC_ME:
 				if (is_ipv4) {
 					struct ifnet *tif;
 
 					INADDR_TO_IFP(src_ip, tif);
 					match = (tif != NULL);
 					break;
 				}
 #ifdef INET6
 				/* FALLTHROUGH */
 			case O_IP6_SRC_ME:
 				match= is_ipv6 && ipfw_localip6(&args->f_id.src_ip6);
 #endif
 				break;
 
 			case O_IP_DST_SET:
 			case O_IP_SRC_SET:
 				if (is_ipv4) {
 					u_int32_t *d = (u_int32_t *)(cmd+1);
 					u_int32_t addr =
 					    cmd->opcode == O_IP_DST_SET ?
 						args->f_id.dst_ip :
 						args->f_id.src_ip;
 
 					    if (addr < d[0])
 						    break;
 					    addr -= d[0]; /* subtract base */
 					    match = (addr < cmd->arg1) &&
 						( d[ 1 + (addr>>5)] &
 						  (1<<(addr & 0x1f)) );
 				}
 				break;
 
 			case O_IP_DST:
 				match = is_ipv4 &&
 				    (((ipfw_insn_ip *)cmd)->addr.s_addr ==
 				    dst_ip.s_addr);
 				break;
 
 			case O_IP_DST_ME:
 				if (is_ipv4) {
 					struct ifnet *tif;
 
 					INADDR_TO_IFP(dst_ip, tif);
 					match = (tif != NULL);
 					break;
 				}
 #ifdef INET6
 				/* FALLTHROUGH */
 			case O_IP6_DST_ME:
 				match= is_ipv6 && ipfw_localip6(&args->f_id.dst_ip6);
 #endif
 				break;
 
 
 			case O_IP_SRCPORT:
 			case O_IP_DSTPORT:
 				/*
 				 * offset == 0 && proto != 0 is enough
 				 * to guarantee that we have a
 				 * packet with port info.
 				 */
 				if ((proto==IPPROTO_UDP || proto==IPPROTO_TCP)
 				    && offset == 0) {
 					u_int16_t x =
 					    (cmd->opcode == O_IP_SRCPORT) ?
 						src_port : dst_port ;
 					u_int16_t *p =
 					    ((ipfw_insn_u16 *)cmd)->ports;
 					int i;
 
 					for (i = cmdlen - 1; !match && i>0;
 					    i--, p += 2)
 						match = (x>=p[0] && x<=p[1]);
 				}
 				break;
 
 			case O_ICMPTYPE:
 				match = (offset == 0 && proto==IPPROTO_ICMP &&
 				    icmptype_match(ICMP(ulp), (ipfw_insn_u32 *)cmd) );
 				break;
 
 #ifdef INET6
 			case O_ICMP6TYPE:
 				match = is_ipv6 && offset == 0 &&
 				    proto==IPPROTO_ICMPV6 &&
 				    icmp6type_match(
 					ICMP6(ulp)->icmp6_type,
 					(ipfw_insn_u32 *)cmd);
 				break;
 #endif /* INET6 */
 
 			case O_IPOPT:
 				match = (is_ipv4 &&
 				    ipopts_match(ip, cmd) );
 				break;
 
 			case O_IPVER:
 				match = (is_ipv4 &&
 				    cmd->arg1 == ip->ip_v);
 				break;
 
 			case O_IPID:
 			case O_IPLEN:
 			case O_IPTTL:
 				if (is_ipv4) {	/* only for IP packets */
 				    uint16_t x;
 				    uint16_t *p;
 				    int i;
 
 				    if (cmd->opcode == O_IPLEN)
 					x = iplen;
 				    else if (cmd->opcode == O_IPTTL)
 					x = ip->ip_ttl;
 				    else /* must be IPID */
 					x = ntohs(ip->ip_id);
 				    if (cmdlen == 1) {
 					match = (cmd->arg1 == x);
 					break;
 				    }
 				    /* otherwise we have ranges */
 				    p = ((ipfw_insn_u16 *)cmd)->ports;
 				    i = cmdlen - 1;
 				    for (; !match && i>0; i--, p += 2)
 					match = (x >= p[0] && x <= p[1]);
 				}
 				break;
 
 			case O_IPPRECEDENCE:
 				match = (is_ipv4 &&
 				    (cmd->arg1 == (ip->ip_tos & 0xe0)) );
 				break;
 
 			case O_IPTOS:
 				match = (is_ipv4 &&
 				    flags_match(cmd, ip->ip_tos));
 				break;
 
 			case O_DSCP:
 			    {
 				uint32_t *p;
 				uint16_t x;
 
 				p = ((ipfw_insn_u32 *)cmd)->d;
 
 				if (is_ipv4)
 					x = ip->ip_tos >> 2;
 				else if (is_ipv6) {
 					uint8_t *v;
 					v = &((struct ip6_hdr *)ip)->ip6_vfc;
 					x = (*v & 0x0F) << 2;
 					v++;
 					x |= *v >> 6;
 				} else
 					break;
 
 				/* DSCP bitmask is stored as low_u32 high_u32 */
 				if (x >= 32)
 					match = *(p + 1) & (1 << (x - 32));
 				else
 					match = *p & (1 << x);
 			    }
 				break;
 
 			case O_TCPDATALEN:
 				if (proto == IPPROTO_TCP && offset == 0) {
 				    struct tcphdr *tcp;
 				    uint16_t x;
 				    uint16_t *p;
 				    int i;
 
 				    tcp = TCP(ulp);
 				    x = iplen -
 					((ip->ip_hl + tcp->th_off) << 2);
 				    if (cmdlen == 1) {
 					match = (cmd->arg1 == x);
 					break;
 				    }
 				    /* otherwise we have ranges */
 				    p = ((ipfw_insn_u16 *)cmd)->ports;
 				    i = cmdlen - 1;
 				    for (; !match && i>0; i--, p += 2)
 					match = (x >= p[0] && x <= p[1]);
 				}
 				break;
 
 			case O_TCPFLAGS:
 				match = (proto == IPPROTO_TCP && offset == 0 &&
 				    flags_match(cmd, TCP(ulp)->th_flags));
 				break;
 
 			case O_TCPOPTS:
 				if (proto == IPPROTO_TCP && offset == 0 && ulp){
 					PULLUP_LEN(hlen, ulp,
 					    (TCP(ulp)->th_off << 2));
 					match = tcpopts_match(TCP(ulp), cmd);
 				}
 				break;
 
 			case O_TCPSEQ:
 				match = (proto == IPPROTO_TCP && offset == 0 &&
 				    ((ipfw_insn_u32 *)cmd)->d[0] ==
 					TCP(ulp)->th_seq);
 				break;
 
 			case O_TCPACK:
 				match = (proto == IPPROTO_TCP && offset == 0 &&
 				    ((ipfw_insn_u32 *)cmd)->d[0] ==
 					TCP(ulp)->th_ack);
 				break;
 
 			case O_TCPWIN:
 				if (proto == IPPROTO_TCP && offset == 0) {
 				    uint16_t x;
 				    uint16_t *p;
 				    int i;
 
 				    x = ntohs(TCP(ulp)->th_win);
 				    if (cmdlen == 1) {
 					match = (cmd->arg1 == x);
 					break;
 				    }
 				    /* Otherwise we have ranges. */
 				    p = ((ipfw_insn_u16 *)cmd)->ports;
 				    i = cmdlen - 1;
 				    for (; !match && i > 0; i--, p += 2)
 					match = (x >= p[0] && x <= p[1]);
 				}
 				break;
 
 			case O_ESTAB:
 				/* reject packets which have SYN only */
 				/* XXX should i also check for TH_ACK ? */
 				match = (proto == IPPROTO_TCP && offset == 0 &&
 				    (TCP(ulp)->th_flags &
 				     (TH_RST | TH_ACK | TH_SYN)) != TH_SYN);
 				break;
 
 			case O_ALTQ: {
 				struct pf_mtag *at;
 				struct m_tag *mtag;
 				ipfw_insn_altq *altq = (ipfw_insn_altq *)cmd;
 
 				/*
 				 * ALTQ uses mbuf tags from another
 				 * packet filtering system - pf(4).
 				 * We allocate a tag in its format
 				 * and fill it in, pretending to be pf(4).
 				 */
 				match = 1;
 				at = pf_find_mtag(m);
 				if (at != NULL && at->qid != 0)
 					break;
 				mtag = m_tag_get(PACKET_TAG_PF,
 				    sizeof(struct pf_mtag), M_NOWAIT | M_ZERO);
 				if (mtag == NULL) {
 					/*
 					 * Let the packet fall back to the
 					 * default ALTQ.
 					 */
 					break;
 				}
 				m_tag_prepend(m, mtag);
 				at = (struct pf_mtag *)(mtag + 1);
 				at->qid = altq->qid;
 				at->hdr = ip;
 				break;
 			}
 
 			case O_LOG:
 				ipfw_log(chain, f, hlen, args, m,
 				    oif, offset | ip6f_mf, tablearg, ip);
 				match = 1;
 				break;
 
 			case O_PROB:
 				match = (random()<((ipfw_insn_u32 *)cmd)->d[0]);
 				break;
 
 			case O_VERREVPATH:
 				/* Outgoing packets automatically pass/match */
 				match = ((oif != NULL) ||
 				    (m->m_pkthdr.rcvif == NULL) ||
 				    (
 #ifdef INET6
 				    is_ipv6 ?
 					verify_path6(&(args->f_id.src_ip6),
 					    m->m_pkthdr.rcvif, args->f_id.fib) :
 #endif
 				    verify_path(src_ip, m->m_pkthdr.rcvif,
 				        args->f_id.fib)));
 				break;
 
 			case O_VERSRCREACH:
 				/* Outgoing packets automatically pass/match */
 				match = (hlen > 0 && ((oif != NULL) ||
 #ifdef INET6
 				    is_ipv6 ?
 				        verify_path6(&(args->f_id.src_ip6),
 				            NULL, args->f_id.fib) :
 #endif
 				    verify_path(src_ip, NULL, args->f_id.fib)));
 				break;
 
 			case O_ANTISPOOF:
 				/* Outgoing packets automatically pass/match */
 				if (oif == NULL && hlen > 0 &&
 				    (  (is_ipv4 && in_localaddr(src_ip))
 #ifdef INET6
 				    || (is_ipv6 &&
 				        in6_localaddr(&(args->f_id.src_ip6)))
 #endif
 				    ))
 					match =
 #ifdef INET6
 					    is_ipv6 ? verify_path6(
 					        &(args->f_id.src_ip6),
 					        m->m_pkthdr.rcvif,
 						args->f_id.fib) :
 #endif
 					    verify_path(src_ip,
 					    	m->m_pkthdr.rcvif,
 					        args->f_id.fib);
 				else
 					match = 1;
 				break;
 
 			case O_IPSEC:
 #ifdef IPSEC
 				match = (m_tag_find(m,
 				    PACKET_TAG_IPSEC_IN_DONE, NULL) != NULL);
 #endif
 				/* otherwise no match */
 				break;
 
 #ifdef INET6
 			case O_IP6_SRC:
 				match = is_ipv6 &&
 				    IN6_ARE_ADDR_EQUAL(&args->f_id.src_ip6,
 				    &((ipfw_insn_ip6 *)cmd)->addr6);
 				break;
 
 			case O_IP6_DST:
 				match = is_ipv6 &&
 				IN6_ARE_ADDR_EQUAL(&args->f_id.dst_ip6,
 				    &((ipfw_insn_ip6 *)cmd)->addr6);
 				break;
 			case O_IP6_SRC_MASK:
 			case O_IP6_DST_MASK:
 				if (is_ipv6) {
 					int i = cmdlen - 1;
 					struct in6_addr p;
 					struct in6_addr *d =
 					    &((ipfw_insn_ip6 *)cmd)->addr6;
 
 					for (; !match && i > 0; d += 2,
 					    i -= F_INSN_SIZE(struct in6_addr)
 					    * 2) {
 						p = (cmd->opcode ==
 						    O_IP6_SRC_MASK) ?
 						    args->f_id.src_ip6:
 						    args->f_id.dst_ip6;
 						APPLY_MASK(&p, &d[1]);
 						match =
 						    IN6_ARE_ADDR_EQUAL(&d[0],
 						    &p);
 					}
 				}
 				break;
 
 			case O_FLOW6ID:
 				match = is_ipv6 &&
 				    flow6id_match(args->f_id.flow_id6,
 				    (ipfw_insn_u32 *) cmd);
 				break;
 
 			case O_EXT_HDR:
 				match = is_ipv6 &&
 				    (ext_hd & ((ipfw_insn *) cmd)->arg1);
 				break;
 
 			case O_IP6:
 				match = is_ipv6;
 				break;
 #endif
 
 			case O_IP4:
 				match = is_ipv4;
 				break;
 
 			case O_TAG: {
 				struct m_tag *mtag;
 				uint32_t tag = TARG(cmd->arg1, tag);
 
 				/* Packet is already tagged with this tag? */
 				mtag = m_tag_locate(m, MTAG_IPFW, tag, NULL);
 
 				/* We have `untag' action when F_NOT flag is
 				 * present. And we must remove this mtag from
 				 * mbuf and reset `match' to zero (`match' will
 				 * be inversed later).
 				 * Otherwise we should allocate new mtag and
 				 * push it into mbuf.
 				 */
 				if (cmd->len & F_NOT) { /* `untag' action */
 					if (mtag != NULL)
 						m_tag_delete(m, mtag);
 					match = 0;
 				} else {
 					if (mtag == NULL) {
 						mtag = m_tag_alloc( MTAG_IPFW,
 						    tag, 0, M_NOWAIT);
 						if (mtag != NULL)
 							m_tag_prepend(m, mtag);
 					}
 					match = 1;
 				}
 				break;
 			}
 
 			case O_FIB: /* try match the specified fib */
 				if (args->f_id.fib == cmd->arg1)
 					match = 1;
 				break;
 
 			case O_SOCKARG:	{
 #ifndef USERSPACE	/* not supported in userspace */
 				struct inpcb *inp = args->inp;
 				struct inpcbinfo *pi;
 				
 				if (is_ipv6) /* XXX can we remove this ? */
 					break;
 
 				if (proto == IPPROTO_TCP)
 					pi = &V_tcbinfo;
 				else if (proto == IPPROTO_UDP)
 					pi = &V_udbinfo;
 				else
 					break;
 
 				/*
 				 * XXXRW: so_user_cookie should almost
 				 * certainly be inp_user_cookie?
 				 */
 
 				/* For incoming packet, lookup up the 
 				inpcb using the src/dest ip/port tuple */
 				if (inp == NULL) {
 					inp = in_pcblookup(pi, 
 						src_ip, htons(src_port),
 						dst_ip, htons(dst_port),
 						INPLOOKUP_RLOCKPCB, NULL);
 					if (inp != NULL) {
 						tablearg =
 						    inp->inp_socket->so_user_cookie;
 						if (tablearg)
 							match = 1;
 						INP_RUNLOCK(inp);
 					}
 				} else {
 					if (inp->inp_socket) {
 						tablearg =
 						    inp->inp_socket->so_user_cookie;
 						if (tablearg)
 							match = 1;
 					}
 				}
 #endif /* !USERSPACE */
 				break;
 			}
 
 			case O_TAGGED: {
 				struct m_tag *mtag;
 				uint32_t tag = TARG(cmd->arg1, tag);
 
 				if (cmdlen == 1) {
 					match = m_tag_locate(m, MTAG_IPFW,
 					    tag, NULL) != NULL;
 					break;
 				}
 
 				/* we have ranges */
 				for (mtag = m_tag_first(m);
 				    mtag != NULL && !match;
 				    mtag = m_tag_next(m, mtag)) {
 					uint16_t *p;
 					int i;
 
 					if (mtag->m_tag_cookie != MTAG_IPFW)
 						continue;
 
 					p = ((ipfw_insn_u16 *)cmd)->ports;
 					i = cmdlen - 1;
 					for(; !match && i > 0; i--, p += 2)
 						match =
 						    mtag->m_tag_id >= p[0] &&
 						    mtag->m_tag_id <= p[1];
 				}
 				break;
 			}
 				
 			/*
 			 * The second set of opcodes represents 'actions',
 			 * i.e. the terminal part of a rule once the packet
 			 * matches all previous patterns.
 			 * Typically there is only one action for each rule,
 			 * and the opcode is stored at the end of the rule
 			 * (but there are exceptions -- see below).
 			 *
 			 * In general, here we set retval and terminate the
 			 * outer loop (would be a 'break 3' in some language,
 			 * but we need to set l=0, done=1)
 			 *
 			 * Exceptions:
 			 * O_COUNT and O_SKIPTO actions:
 			 *   instead of terminating, we jump to the next rule
 			 *   (setting l=0), or to the SKIPTO target (setting
 			 *   f/f_len, cmd and l as needed), respectively.
 			 *
 			 * O_TAG, O_LOG and O_ALTQ action parameters:
 			 *   perform some action and set match = 1;
 			 *
 			 * O_LIMIT and O_KEEP_STATE: these opcodes are
 			 *   not real 'actions', and are stored right
 			 *   before the 'action' part of the rule.
 			 *   These opcodes try to install an entry in the
 			 *   state tables; if successful, we continue with
 			 *   the next opcode (match=1; break;), otherwise
 			 *   the packet must be dropped (set retval,
 			 *   break loops with l=0, done=1)
 			 *
 			 * O_PROBE_STATE and O_CHECK_STATE: these opcodes
 			 *   cause a lookup of the state table, and a jump
 			 *   to the 'action' part of the parent rule
 			 *   if an entry is found, or
 			 *   (CHECK_STATE only) a jump to the next rule if
 			 *   the entry is not found.
 			 *   The result of the lookup is cached so that
 			 *   further instances of these opcodes become NOPs.
 			 *   The jump to the next rule is done by setting
 			 *   l=0, cmdlen=0.
 			 */
 			case O_LIMIT:
 			case O_KEEP_STATE:
 				if (ipfw_install_state(chain, f,
 				    (ipfw_insn_limit *)cmd, args, tablearg)) {
 					/* error or limit violation */
 					retval = IP_FW_DENY;
 					l = 0;	/* exit inner loop */
 					done = 1; /* exit outer loop */
 				}
 				match = 1;
 				break;
 
 			case O_PROBE_STATE:
 			case O_CHECK_STATE:
 				/*
 				 * dynamic rules are checked at the first
 				 * keep-state or check-state occurrence,
 				 * with the result being stored in dyn_dir
 				 * and dyn_name.
 				 * The compiler introduces a PROBE_STATE
 				 * instruction for us when we have a
 				 * KEEP_STATE (because PROBE_STATE needs
 				 * to be run first).
 				 *
 				 * (dyn_dir == MATCH_UNKNOWN) means this is
 				 * first lookup for such f_id. Do lookup.
 				 *
 				 * (dyn_dir != MATCH_UNKNOWN &&
 				 *  dyn_name != 0 && dyn_name != cmd->arg1)
 				 * means previous lookup didn't find dynamic
 				 * rule for specific state name and current
 				 * lookup will search rule with another state
 				 * name. Redo lookup.
 				 *
 				 * (dyn_dir != MATCH_UNKNOWN && dyn_name == 0)
 				 * means previous lookup was for `any' name
 				 * and it didn't find rule. No need to do
 				 * lookup again.
 				 */
 				if ((dyn_dir == MATCH_UNKNOWN ||
 				    (dyn_name != 0 &&
 				    dyn_name != cmd->arg1)) &&
 				    (q = ipfw_lookup_dyn_rule(&args->f_id,
 				     &dyn_dir, proto == IPPROTO_TCP ?
 				     TCP(ulp): NULL,
 				     (dyn_name = cmd->arg1))) != NULL) {
 					/*
 					 * Found dynamic entry, update stats
 					 * and jump to the 'action' part of
 					 * the parent rule by setting
 					 * f, cmd, l and clearing cmdlen.
 					 */
 					IPFW_INC_DYN_COUNTER(q, pktlen);
 					/* XXX we would like to have f_pos
 					 * readily accessible in the dynamic
 				         * rule, instead of having to
 					 * lookup q->rule.
 					 */
 					f = q->rule;
 					f_pos = ipfw_find_rule(chain,
 						f->rulenum, f->id);
 					cmd = ACTION_PTR(f);
 					l = f->cmd_len - f->act_ofs;
 					ipfw_dyn_unlock(q);
 					cmdlen = 0;
 					match = 1;
 					break;
 				}
 				/*
 				 * Dynamic entry not found. If CHECK_STATE,
 				 * skip to next rule, if PROBE_STATE just
 				 * ignore and continue with next opcode.
 				 */
 				if (cmd->opcode == O_CHECK_STATE)
 					l = 0;	/* exit inner loop */
 				match = 1;
 				break;
 
 			case O_ACCEPT:
 				retval = 0;	/* accept */
 				l = 0;		/* exit inner loop */
 				done = 1;	/* exit outer loop */
 				break;
 
 			case O_PIPE:
 			case O_QUEUE:
 				set_match(args, f_pos, chain);
 				args->rule.info = TARG(cmd->arg1, pipe);
 				if (cmd->opcode == O_PIPE)
 					args->rule.info |= IPFW_IS_PIPE;
 				if (V_fw_one_pass)
 					args->rule.info |= IPFW_ONEPASS;
 				retval = IP_FW_DUMMYNET;
 				l = 0;          /* exit inner loop */
 				done = 1;       /* exit outer loop */
 				break;
 
 			case O_DIVERT:
 			case O_TEE:
 				if (args->eh) /* not on layer 2 */
 				    break;
 				/* otherwise this is terminal */
 				l = 0;		/* exit inner loop */
 				done = 1;	/* exit outer loop */
 				retval = (cmd->opcode == O_DIVERT) ?
 					IP_FW_DIVERT : IP_FW_TEE;
 				set_match(args, f_pos, chain);
 				args->rule.info = TARG(cmd->arg1, divert);
 				break;
 
 			case O_COUNT:
 				IPFW_INC_RULE_COUNTER(f, pktlen);
 				l = 0;		/* exit inner loop */
 				break;
 
 			case O_SKIPTO:
 			    IPFW_INC_RULE_COUNTER(f, pktlen);
 			    f_pos = JUMP(chain, f, cmd->arg1, tablearg, 0);
 			    /*
 			     * Skip disabled rules, and re-enter
 			     * the inner loop with the correct
 			     * f_pos, f, l and cmd.
 			     * Also clear cmdlen and skip_or
 			     */
 			    for (; f_pos < chain->n_rules - 1 &&
 				    (V_set_disable &
 				     (1 << chain->map[f_pos]->set));
 				    f_pos++)
 				;
 			    /* Re-enter the inner loop at the skipto rule. */
 			    f = chain->map[f_pos];
 			    l = f->cmd_len;
 			    cmd = f->cmd;
 			    match = 1;
 			    cmdlen = 0;
 			    skip_or = 0;
 			    continue;
 			    break;	/* not reached */
 
 			case O_CALLRETURN: {
 				/*
 				 * Implementation of `subroutine' call/return,
 				 * in the stack carried in an mbuf tag. This
 				 * is different from `skipto' in that any call
 				 * address is possible (`skipto' must prevent
 				 * backward jumps to avoid endless loops).
 				 * We have `return' action when F_NOT flag is
 				 * present. The `m_tag_id' field is used as
 				 * stack pointer.
 				 */
 				struct m_tag *mtag;
 				uint16_t jmpto, *stack;
 
 #define	IS_CALL		((cmd->len & F_NOT) == 0)
 #define	IS_RETURN	((cmd->len & F_NOT) != 0)
 				/*
 				 * Hand-rolled version of m_tag_locate() with
 				 * wildcard `type'.
 				 * If not already tagged, allocate new tag.
 				 */
 				mtag = m_tag_first(m);
 				while (mtag != NULL) {
 					if (mtag->m_tag_cookie ==
 					    MTAG_IPFW_CALL)
 						break;
 					mtag = m_tag_next(m, mtag);
 				}
 				if (mtag == NULL && IS_CALL) {
 					mtag = m_tag_alloc(MTAG_IPFW_CALL, 0,
 					    IPFW_CALLSTACK_SIZE *
 					    sizeof(uint16_t), M_NOWAIT);
 					if (mtag != NULL)
 						m_tag_prepend(m, mtag);
 				}
 
 				/*
 				 * On error both `call' and `return' just
 				 * continue with next rule.
 				 */
 				if (IS_RETURN && (mtag == NULL ||
 				    mtag->m_tag_id == 0)) {
 					l = 0;		/* exit inner loop */
 					break;
 				}
 				if (IS_CALL && (mtag == NULL ||
 				    mtag->m_tag_id >= IPFW_CALLSTACK_SIZE)) {
 					printf("ipfw: call stack error, "
 					    "go to next rule\n");
 					l = 0;		/* exit inner loop */
 					break;
 				}
 
 				IPFW_INC_RULE_COUNTER(f, pktlen);
 				stack = (uint16_t *)(mtag + 1);
 
 				/*
 				 * The `call' action may use cached f_pos
 				 * (in f->next_rule), whose version is written
 				 * in f->next_rule.
 				 * The `return' action, however, doesn't have
 				 * fixed jump address in cmd->arg1 and can't use
 				 * cache.
 				 */
 				if (IS_CALL) {
 					stack[mtag->m_tag_id] = f->rulenum;
 					mtag->m_tag_id++;
 			    		f_pos = JUMP(chain, f, cmd->arg1,
 					    tablearg, 1);
 				} else {	/* `return' action */
 					mtag->m_tag_id--;
 					jmpto = stack[mtag->m_tag_id] + 1;
 					f_pos = ipfw_find_rule(chain, jmpto, 0);
 				}
 
 				/*
 				 * Skip disabled rules, and re-enter
 				 * the inner loop with the correct
 				 * f_pos, f, l and cmd.
 				 * Also clear cmdlen and skip_or
 				 */
 				for (; f_pos < chain->n_rules - 1 &&
 				    (V_set_disable &
 				    (1 << chain->map[f_pos]->set)); f_pos++)
 					;
 				/* Re-enter the inner loop at the dest rule. */
 				f = chain->map[f_pos];
 				l = f->cmd_len;
 				cmd = f->cmd;
 				cmdlen = 0;
 				skip_or = 0;
 				continue;
 				break;	/* NOTREACHED */
 			}
 #undef IS_CALL
 #undef IS_RETURN
 
 			case O_REJECT:
 				/*
 				 * Drop the packet and send a reject notice
 				 * if the packet is not ICMP (or is an ICMP
 				 * query), and it is not multicast/broadcast.
 				 */
 				if (hlen > 0 && is_ipv4 && offset == 0 &&
 				    (proto != IPPROTO_ICMP ||
 				     is_icmp_query(ICMP(ulp))) &&
 				    !(m->m_flags & (M_BCAST|M_MCAST)) &&
 				    !IN_MULTICAST(ntohl(dst_ip.s_addr))) {
 					send_reject(args, cmd->arg1, iplen, ip);
 					m = args->m;
 				}
 				/* FALLTHROUGH */
 #ifdef INET6
 			case O_UNREACH6:
 				if (hlen > 0 && is_ipv6 &&
 				    ((offset & IP6F_OFF_MASK) == 0) &&
 				    (proto != IPPROTO_ICMPV6 ||
 				     (is_icmp6_query(icmp6_type) == 1)) &&
 				    !(m->m_flags & (M_BCAST|M_MCAST)) &&
 				    !IN6_IS_ADDR_MULTICAST(&args->f_id.dst_ip6)) {
 					send_reject6(
 					    args, cmd->arg1, hlen,
 					    (struct ip6_hdr *)ip);
 					m = args->m;
 				}
 				/* FALLTHROUGH */
 #endif
 			case O_DENY:
 				retval = IP_FW_DENY;
 				l = 0;		/* exit inner loop */
 				done = 1;	/* exit outer loop */
 				break;
 
 			case O_FORWARD_IP:
 				if (args->eh)	/* not valid on layer2 pkts */
 					break;
 				if (q == NULL || q->rule != f ||
 				    dyn_dir == MATCH_FORWARD) {
 				    struct sockaddr_in *sa;
 
 				    sa = &(((ipfw_insn_sa *)cmd)->sa);
 				    if (sa->sin_addr.s_addr == INADDR_ANY) {
 #ifdef INET6
 					/*
 					 * We use O_FORWARD_IP opcode for
 					 * fwd rule with tablearg, but tables
 					 * now support IPv6 addresses. And
 					 * when we are inspecting IPv6 packet,
 					 * we can use nh6 field from
 					 * table_value as next_hop6 address.
 					 */
 					if (is_ipv6) {
 						struct sockaddr_in6 *sa6;
 
 						sa6 = args->next_hop6 =
 						    &args->hopstore6;
 						sa6->sin6_family = AF_INET6;
 						sa6->sin6_len = sizeof(*sa6);
 						sa6->sin6_addr = TARG_VAL(
 						    chain, tablearg, nh6);
 						/*
 						 * Set sin6_scope_id only for
 						 * link-local unicast addresses.
 						 */
 						if (IN6_IS_ADDR_LINKLOCAL(
 						    &sa6->sin6_addr))
 							sa6->sin6_scope_id =
 							    TARG_VAL(chain,
 								tablearg,
 								zoneid);
 					} else
 #endif
 					{
 						sa = args->next_hop =
 						    &args->hopstore;
 						sa->sin_family = AF_INET;
 						sa->sin_len = sizeof(*sa);
 						sa->sin_addr.s_addr = htonl(
 						    TARG_VAL(chain, tablearg,
 						    nh4));
 					}
 				    } else {
 					args->next_hop = sa;
 				    }
 				}
 				retval = IP_FW_PASS;
 				l = 0;          /* exit inner loop */
 				done = 1;       /* exit outer loop */
 				break;
 
 #ifdef INET6
 			case O_FORWARD_IP6:
 				if (args->eh)	/* not valid on layer2 pkts */
 					break;
 				if (q == NULL || q->rule != f ||
 				    dyn_dir == MATCH_FORWARD) {
 					struct sockaddr_in6 *sin6;
 
 					sin6 = &(((ipfw_insn_sa6 *)cmd)->sa);
 					args->next_hop6 = sin6;
 				}
 				retval = IP_FW_PASS;
 				l = 0;		/* exit inner loop */
 				done = 1;	/* exit outer loop */
 				break;
 #endif
 
 			case O_NETGRAPH:
 			case O_NGTEE:
 				set_match(args, f_pos, chain);
 				args->rule.info = TARG(cmd->arg1, netgraph);
 				if (V_fw_one_pass)
 					args->rule.info |= IPFW_ONEPASS;
 				retval = (cmd->opcode == O_NETGRAPH) ?
 				    IP_FW_NETGRAPH : IP_FW_NGTEE;
 				l = 0;          /* exit inner loop */
 				done = 1;       /* exit outer loop */
 				break;
 
 			case O_SETFIB: {
 				uint32_t fib;
 
 				IPFW_INC_RULE_COUNTER(f, pktlen);
 				fib = TARG(cmd->arg1, fib) & 0x7FFF;
 				if (fib >= rt_numfibs)
 					fib = 0;
 				M_SETFIB(m, fib);
 				args->f_id.fib = fib;
 				l = 0;		/* exit inner loop */
 				break;
 		        }
 
 			case O_SETDSCP: {
 				uint16_t code;
 
 				code = TARG(cmd->arg1, dscp) & 0x3F;
 				l = 0;		/* exit inner loop */
 				if (is_ipv4) {
 					uint16_t old;
 
 					old = *(uint16_t *)ip;
 					ip->ip_tos = (code << 2) |
 					    (ip->ip_tos & 0x03);
 					ip->ip_sum = cksum_adjust(ip->ip_sum,
 					    old, *(uint16_t *)ip);
 				} else if (is_ipv6) {
 					uint8_t *v;
 
 					v = &((struct ip6_hdr *)ip)->ip6_vfc;
 					*v = (*v & 0xF0) | (code >> 2);
 					v++;
 					*v = (*v & 0x3F) | ((code & 0x03) << 6);
 				} else
 					break;
 
 				IPFW_INC_RULE_COUNTER(f, pktlen);
 				break;
 			}
 
 			case O_NAT:
 				l = 0;          /* exit inner loop */
 				done = 1;       /* exit outer loop */
  				if (!IPFW_NAT_LOADED) {
 				    retval = IP_FW_DENY;
 				    break;
 				}
 
 				struct cfg_nat *t;
 				int nat_id;
 
 				set_match(args, f_pos, chain);
 				/* Check if this is 'global' nat rule */
 				if (cmd->arg1 == IP_FW_NAT44_GLOBAL) {
 					retval = ipfw_nat_ptr(args, NULL, m);
 					break;
 				}
 				t = ((ipfw_insn_nat *)cmd)->nat;
 				if (t == NULL) {
 					nat_id = TARG(cmd->arg1, nat);
 					t = (*lookup_nat_ptr)(&chain->nat, nat_id);
 
 					if (t == NULL) {
 					    retval = IP_FW_DENY;
 					    break;
 					}
 					if (cmd->arg1 != IP_FW_TARG)
 					    ((ipfw_insn_nat *)cmd)->nat = t;
 				}
 				retval = ipfw_nat_ptr(args, t, m);
 				break;
 
 			case O_REASS: {
 				int ip_off;
 
 				IPFW_INC_RULE_COUNTER(f, pktlen);
 				l = 0;	/* in any case exit inner loop */
 				ip_off = ntohs(ip->ip_off);
 
 				/* if not fragmented, go to next rule */
 				if ((ip_off & (IP_MF | IP_OFFMASK)) == 0)
 				    break;
 
 				args->m = m = ip_reass(m);
 
 				/*
 				 * do IP header checksum fixup.
 				 */
 				if (m == NULL) { /* fragment got swallowed */
 				    retval = IP_FW_DENY;
 				} else { /* good, packet complete */
 				    int hlen;
 
 				    ip = mtod(m, struct ip *);
 				    hlen = ip->ip_hl << 2;
 				    ip->ip_sum = 0;
 				    if (hlen == sizeof(struct ip))
 					ip->ip_sum = in_cksum_hdr(ip);
 				    else
 					ip->ip_sum = in_cksum(m, hlen);
 				    retval = IP_FW_REASS;
 				    set_match(args, f_pos, chain);
 				}
 				done = 1;	/* exit outer loop */
 				break;
 			}
 			case O_EXTERNAL_ACTION:
 				l = 0; /* in any case exit inner loop */
 				retval = ipfw_run_eaction(chain, args,
 				    cmd, &done);
+				/*
+				 * If both @retval and @done are zero,
+				 * consider this as rule matching and
+				 * update counters.
+				 */
+				if (retval == 0 && done == 0)
+					IPFW_INC_RULE_COUNTER(f, pktlen);
 				break;
 
 			default:
 				panic("-- unknown opcode %d\n", cmd->opcode);
 			} /* end of switch() on opcodes */
 			/*
 			 * if we get here with l=0, then match is irrelevant.
 			 */
 
 			if (cmd->len & F_NOT)
 				match = !match;
 
 			if (match) {
 				if (cmd->len & F_OR)
 					skip_or = 1;
 			} else {
 				if (!(cmd->len & F_OR)) /* not an OR block, */
 					break;		/* try next rule    */
 			}
 
 		}	/* end of inner loop, scan opcodes */
 #undef PULLUP_LEN
 
 		if (done)
 			break;
 
 /* next_rule:; */	/* try next rule		*/
 
 	}		/* end of outer for, scan rules */
 
 	if (done) {
 		struct ip_fw *rule = chain->map[f_pos];
 		/* Update statistics */
 		IPFW_INC_RULE_COUNTER(rule, pktlen);
 	} else {
 		retval = IP_FW_DENY;
 		printf("ipfw: ouch!, skip past end of rules, denying packet\n");
 	}
 	IPFW_PF_RUNLOCK(chain);
 #ifdef __FreeBSD__
 	if (ucred_cache != NULL)
 		crfree(ucred_cache);
 #endif
 	return (retval);
 
 pullup_failed:
 	if (V_fw_verbose)
 		printf("ipfw: pullup failed\n");
 	return (IP_FW_DENY);
 }
 
 /*
  * Set maximum number of tables that can be used in given VNET ipfw instance.
  */
 #ifdef SYSCTL_NODE
 static int
 sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 	unsigned int ntables;
 
 	ntables = V_fw_tables_max;
 
 	error = sysctl_handle_int(oidp, &ntables, 0, req);
 	/* Read operation or some error */
 	if ((error != 0) || (req->newptr == NULL))
 		return (error);
 
 	return (ipfw_resize_tables(&V_layer3_chain, ntables));
 }
 
 /*
  * Switches table namespace between global and per-set.
  */
 static int
 sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS)
 {
 	int error;
 	unsigned int sets;
 
 	sets = V_fw_tables_sets;
 
 	error = sysctl_handle_int(oidp, &sets, 0, req);
 	/* Read operation or some error */
 	if ((error != 0) || (req->newptr == NULL))
 		return (error);
 
 	return (ipfw_switch_tables_namespace(&V_layer3_chain, sets));
 }
 #endif
 
 /*
  * Module and VNET glue
  */
 
 /*
  * Stuff that must be initialised only on boot or module load
  */
 static int
 ipfw_init(void)
 {
 	int error = 0;
 
 	/*
  	 * Only print out this stuff the first time around,
 	 * when called from the sysinit code.
 	 */
 	printf("ipfw2 "
 #ifdef INET6
 		"(+ipv6) "
 #endif
 		"initialized, divert %s, nat %s, "
 		"default to %s, logging ",
 #ifdef IPDIVERT
 		"enabled",
 #else
 		"loadable",
 #endif
 #ifdef IPFIREWALL_NAT
 		"enabled",
 #else
 		"loadable",
 #endif
 		default_to_accept ? "accept" : "deny");
 
 	/*
 	 * Note: V_xxx variables can be accessed here but the vnet specific
 	 * initializer may not have been called yet for the VIMAGE case.
 	 * Tuneables will have been processed. We will print out values for
 	 * the default vnet. 
 	 * XXX This should all be rationalized AFTER 8.0
 	 */
 	if (V_fw_verbose == 0)
 		printf("disabled\n");
 	else if (V_verbose_limit == 0)
 		printf("unlimited\n");
 	else
 		printf("limited to %d packets/entry by default\n",
 		    V_verbose_limit);
 
 	/* Check user-supplied table count for validness */
 	if (default_fw_tables > IPFW_TABLES_MAX)
 	  default_fw_tables = IPFW_TABLES_MAX;
 
 	ipfw_init_sopt_handler();
 	ipfw_init_obj_rewriter();
 	ipfw_iface_init();
 	return (error);
 }
 
 /*
  * Called for the removal of the last instance only on module unload.
  */
 static void
 ipfw_destroy(void)
 {
 
 	ipfw_iface_destroy();
 	ipfw_destroy_sopt_handler();
 	ipfw_destroy_obj_rewriter();
 	printf("IP firewall unloaded\n");
 }
 
 /*
  * Stuff that must be initialized for every instance
  * (including the first of course).
  */
 static int
 vnet_ipfw_init(const void *unused)
 {
 	int error, first;
 	struct ip_fw *rule = NULL;
 	struct ip_fw_chain *chain;
 
 	chain = &V_layer3_chain;
 
 	first = IS_DEFAULT_VNET(curvnet) ? 1 : 0;
 
 	/* First set up some values that are compile time options */
 	V_autoinc_step = 100;	/* bounded to 1..1000 in add_rule() */
 	V_fw_deny_unknown_exthdrs = 1;
 #ifdef IPFIREWALL_VERBOSE
 	V_fw_verbose = 1;
 #endif
 #ifdef IPFIREWALL_VERBOSE_LIMIT
 	V_verbose_limit = IPFIREWALL_VERBOSE_LIMIT;
 #endif
 #ifdef IPFIREWALL_NAT
 	LIST_INIT(&chain->nat);
 #endif
 
 	/* Init shared services hash table */
 	ipfw_init_srv(chain);
 
 	ipfw_init_counters();
 	/* insert the default rule and create the initial map */
 	chain->n_rules = 1;
 	chain->map = malloc(sizeof(struct ip_fw *), M_IPFW, M_WAITOK | M_ZERO);
 	rule = ipfw_alloc_rule(chain, sizeof(struct ip_fw));
 
 	/* Set initial number of tables */
 	V_fw_tables_max = default_fw_tables;
 	error = ipfw_init_tables(chain, first);
 	if (error) {
 		printf("ipfw2: setting up tables failed\n");
 		free(chain->map, M_IPFW);
 		free(rule, M_IPFW);
 		return (ENOSPC);
 	}
 
 	/* fill and insert the default rule */
 	rule->act_ofs = 0;
 	rule->rulenum = IPFW_DEFAULT_RULE;
 	rule->cmd_len = 1;
 	rule->set = RESVD_SET;
 	rule->cmd[0].len = 1;
 	rule->cmd[0].opcode = default_to_accept ? O_ACCEPT : O_DENY;
 	chain->default_rule = chain->map[0] = rule;
 	chain->id = rule->id = 1;
 	/* Pre-calculate rules length for legacy dump format */
 	chain->static_len = sizeof(struct ip_fw_rule0);
 
 	IPFW_LOCK_INIT(chain);
 	ipfw_dyn_init(chain);
 	ipfw_eaction_init(chain, first);
 #ifdef LINEAR_SKIPTO
 	ipfw_init_skipto_cache(chain);
 #endif
 	ipfw_bpf_init(first);
 
 	/* First set up some values that are compile time options */
 	V_ipfw_vnet_ready = 1;		/* Open for business */
 
 	/*
 	 * Hook the sockopt handler and pfil hooks for ipv4 and ipv6.
 	 * Even if the latter two fail we still keep the module alive
 	 * because the sockopt and layer2 paths are still useful.
 	 * ipfw[6]_hook return 0 on success, ENOENT on failure,
 	 * so we can ignore the exact return value and just set a flag.
 	 *
 	 * Note that V_fw[6]_enable are manipulated by a SYSCTL_PROC so
 	 * changes in the underlying (per-vnet) variables trigger
 	 * immediate hook()/unhook() calls.
 	 * In layer2 we have the same behaviour, except that V_ether_ipfw
 	 * is checked on each packet because there are no pfil hooks.
 	 */
 	V_ip_fw_ctl_ptr = ipfw_ctl3;
 	error = ipfw_attach_hooks(1);
 	return (error);
 }
 
 /*
  * Called for the removal of each instance.
  */
 static int
 vnet_ipfw_uninit(const void *unused)
 {
 	struct ip_fw *reap;
 	struct ip_fw_chain *chain = &V_layer3_chain;
 	int i, last;
 
 	V_ipfw_vnet_ready = 0; /* tell new callers to go away */
 	/*
 	 * disconnect from ipv4, ipv6, layer2 and sockopt.
 	 * Then grab, release and grab again the WLOCK so we make
 	 * sure the update is propagated and nobody will be in.
 	 */
 	(void)ipfw_attach_hooks(0 /* detach */);
 	V_ip_fw_ctl_ptr = NULL;
 
 	last = IS_DEFAULT_VNET(curvnet) ? 1 : 0;
 
 	IPFW_UH_WLOCK(chain);
 	IPFW_UH_WUNLOCK(chain);
 
 	ipfw_dyn_uninit(0);	/* run the callout_drain */
 
 	IPFW_UH_WLOCK(chain);
 
 	reap = NULL;
 	IPFW_WLOCK(chain);
 	for (i = 0; i < chain->n_rules; i++)
 		ipfw_reap_add(chain, &reap, chain->map[i]);
 	free(chain->map, M_IPFW);
 #ifdef LINEAR_SKIPTO
 	ipfw_destroy_skipto_cache(chain);
 #endif
 	IPFW_WUNLOCK(chain);
 	IPFW_UH_WUNLOCK(chain);
 	ipfw_destroy_tables(chain, last);
 	ipfw_eaction_uninit(chain, last);
 	if (reap != NULL)
 		ipfw_reap_rules(reap);
 	vnet_ipfw_iface_destroy(chain);
 	ipfw_destroy_srv(chain);
 	IPFW_LOCK_DESTROY(chain);
 	ipfw_dyn_uninit(1);	/* free the remaining parts */
 	ipfw_destroy_counters();
 	ipfw_bpf_uninit(last);
 	return (0);
 }
 
 /*
  * Module event handler.
  * In general we have the choice of handling most of these events by the
  * event handler or by the (VNET_)SYS(UN)INIT handlers. I have chosen to
  * use the SYSINIT handlers as they are more capable of expressing the
  * flow of control during module and vnet operations, so this is just
  * a skeleton. Note there is no SYSINIT equivalent of the module
  * SHUTDOWN handler, but we don't have anything to do in that case anyhow.
  */
 static int
 ipfw_modevent(module_t mod, int type, void *unused)
 {
 	int err = 0;
 
 	switch (type) {
 	case MOD_LOAD:
 		/* Called once at module load or
 	 	 * system boot if compiled in. */
 		break;
 	case MOD_QUIESCE:
 		/* Called before unload. May veto unloading. */
 		break;
 	case MOD_UNLOAD:
 		/* Called during unload. */
 		break;
 	case MOD_SHUTDOWN:
 		/* Called during system shutdown. */
 		break;
 	default:
 		err = EOPNOTSUPP;
 		break;
 	}
 	return err;
 }
 
 static moduledata_t ipfwmod = {
 	"ipfw",
 	ipfw_modevent,
 	0
 };
 
 /* Define startup order. */
 #define	IPFW_SI_SUB_FIREWALL	SI_SUB_PROTO_FIREWALL
 #define	IPFW_MODEVENT_ORDER	(SI_ORDER_ANY - 255) /* On boot slot in here. */
 #define	IPFW_MODULE_ORDER	(IPFW_MODEVENT_ORDER + 1) /* A little later. */
 #define	IPFW_VNET_ORDER		(IPFW_MODEVENT_ORDER + 2) /* Later still. */
 
 DECLARE_MODULE(ipfw, ipfwmod, IPFW_SI_SUB_FIREWALL, IPFW_MODEVENT_ORDER);
 FEATURE(ipfw_ctl3, "ipfw new sockopt calls");
 MODULE_VERSION(ipfw, 3);
 /* should declare some dependencies here */
 
 /*
  * Starting up. Done in order after ipfwmod() has been called.
  * VNET_SYSINIT is also called for each existing vnet and each new vnet.
  */
 SYSINIT(ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER,
 	    ipfw_init, NULL);
 VNET_SYSINIT(vnet_ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER,
 	    vnet_ipfw_init, NULL);
  
 /*
  * Closing up shop. These are done in REVERSE ORDER, but still
  * after ipfwmod() has been called. Not called on reboot.
  * VNET_SYSUNINIT is also called for each exiting vnet as it exits.
  * or when the module is unloaded.
  */
 SYSUNINIT(ipfw_destroy, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER,
 	    ipfw_destroy, NULL);
 VNET_SYSUNINIT(vnet_ipfw_uninit, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER,
 	    vnet_ipfw_uninit, NULL);
 /* end of file */
Index: projects/clang400-import/sys/netpfil/ipfw/nptv6/nptv6.c
===================================================================
--- projects/clang400-import/sys/netpfil/ipfw/nptv6/nptv6.c	(revision 314522)
+++ projects/clang400-import/sys/netpfil/ipfw/nptv6/nptv6.c	(revision 314523)
@@ -1,892 +1,894 @@
 /*-
  * Copyright (c) 2016 Yandex LLC
  * Copyright (c) 2016 Andrey V. Elsukov <ae@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/counter.h>
 #include <sys/errno.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/module.h>
 #include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/socket.h>
 #include <sys/queue.h>
 #include <sys/syslog.h>
 #include <sys/sysctl.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
 #include <net/netisr.h>
 #include <net/pfil.h>
 #include <net/vnet.h>
 
 #include <netinet/in.h>
 #include <netinet/ip_var.h>
 #include <netinet/ip_fw.h>
 #include <netinet/ip6.h>
 #include <netinet/icmp6.h>
 #include <netinet6/in6_var.h>
 #include <netinet6/ip6_var.h>
 
 #include <netpfil/ipfw/ip_fw_private.h>
 #include <netpfil/ipfw/nptv6/nptv6.h>
 
 static VNET_DEFINE(uint16_t, nptv6_eid) = 0;
 #define	V_nptv6_eid	VNET(nptv6_eid)
 #define	IPFW_TLV_NPTV6_NAME	IPFW_TLV_EACTION_NAME(V_nptv6_eid)
 
 static struct nptv6_cfg *nptv6_alloc_config(const char *name, uint8_t set);
 static void nptv6_free_config(struct nptv6_cfg *cfg);
 static struct nptv6_cfg *nptv6_find(struct namedobj_instance *ni,
     const char *name, uint8_t set);
 static int nptv6_rewrite_internal(struct nptv6_cfg *cfg, struct mbuf **mp,
     int offset);
 static int nptv6_rewrite_external(struct nptv6_cfg *cfg, struct mbuf **mp,
     int offset);
 
 #define	NPTV6_LOOKUP(chain, cmd)	\
     (struct nptv6_cfg *)SRV_OBJECT((chain), (cmd)->arg1)
 
 #ifndef IN6_MASK_ADDR
 #define IN6_MASK_ADDR(a, m)	do { \
 	(a)->s6_addr32[0] &= (m)->s6_addr32[0]; \
 	(a)->s6_addr32[1] &= (m)->s6_addr32[1]; \
 	(a)->s6_addr32[2] &= (m)->s6_addr32[2]; \
 	(a)->s6_addr32[3] &= (m)->s6_addr32[3]; \
 } while (0)
 #endif
 #ifndef IN6_ARE_MASKED_ADDR_EQUAL
 #define IN6_ARE_MASKED_ADDR_EQUAL(d, a, m)	(	\
 	(((d)->s6_addr32[0] ^ (a)->s6_addr32[0]) & (m)->s6_addr32[0]) == 0 && \
 	(((d)->s6_addr32[1] ^ (a)->s6_addr32[1]) & (m)->s6_addr32[1]) == 0 && \
 	(((d)->s6_addr32[2] ^ (a)->s6_addr32[2]) & (m)->s6_addr32[2]) == 0 && \
 	(((d)->s6_addr32[3] ^ (a)->s6_addr32[3]) & (m)->s6_addr32[3]) == 0 )
 #endif
 
 #if 0
 #define	NPTV6_DEBUG(fmt, ...)	do {			\
 	printf("%s: " fmt "\n", __func__, ## __VA_ARGS__);	\
 } while (0)
 #define	NPTV6_IPDEBUG(fmt, ...)	do {			\
 	char _s[INET6_ADDRSTRLEN], _d[INET6_ADDRSTRLEN];	\
 	printf("%s: " fmt "\n", __func__, ## __VA_ARGS__);	\
 } while (0)
 #else
 #define	NPTV6_DEBUG(fmt, ...)
 #define	NPTV6_IPDEBUG(fmt, ...)
 #endif
 
 static int
 nptv6_getlasthdr(struct nptv6_cfg *cfg, struct mbuf *m, int *offset)
 {
 	struct ip6_hdr *ip6;
 	struct ip6_hbh *hbh;
 	int proto, hlen;
 
 	hlen = (offset == NULL) ? 0: *offset;
 	if (m->m_len < hlen)
 		return (-1);
 	ip6 = mtodo(m, hlen);
 	hlen += sizeof(*ip6);
 	proto = ip6->ip6_nxt;
 	while (proto == IPPROTO_HOPOPTS || proto == IPPROTO_ROUTING ||
 	    proto == IPPROTO_DSTOPTS) {
 		hbh = mtodo(m, hlen);
 		if (m->m_len < hlen)
 			return (-1);
 		proto = hbh->ip6h_nxt;
 		hlen += hbh->ip6h_len << 3;
 	}
 	if (offset != NULL)
 		*offset = hlen;
 	return (proto);
 }
 
 static int
 nptv6_translate_icmpv6(struct nptv6_cfg *cfg, struct mbuf **mp, int offset)
 {
 	struct icmp6_hdr *icmp6;
 	struct ip6_hdr *ip6;
 	struct mbuf *m;
 
 	m = *mp;
 	if (offset > m->m_len)
 		return (-1);
 	icmp6 = mtodo(m, offset);
 	NPTV6_DEBUG("ICMPv6 type %d", icmp6->icmp6_type);
 	switch (icmp6->icmp6_type) {
 	case ICMP6_DST_UNREACH:
 	case ICMP6_PACKET_TOO_BIG:
 	case ICMP6_TIME_EXCEEDED:
 	case ICMP6_PARAM_PROB:
 		break;
 	case ICMP6_ECHO_REQUEST:
 	case ICMP6_ECHO_REPLY:
 		/* nothing to translate */
 		return (0);
 	default:
 		/*
 		 * XXX: We can add some checks to not translate NDP and MLD
 		 * messages. Currently user must explicitly allow these message
 		 * types, otherwise packets will be dropped.
 		 */
 		return (-1);
 	}
 	offset += sizeof(*icmp6);
 	if (offset + sizeof(*ip6) > m->m_pkthdr.len)
 		return (-1);
 	if (offset + sizeof(*ip6) > m->m_len)
 		*mp = m = m_pullup(m, offset + sizeof(*ip6));
 	if (m == NULL)
 		return (-1);
 	ip6 = mtodo(m, offset);
 	NPTV6_IPDEBUG("offset %d, %s -> %s %d", offset,
 	    inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)),
 	    inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)),
 	    ip6->ip6_nxt);
 	if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_src,
 	    &cfg->external, &cfg->mask))
 		return (nptv6_rewrite_external(cfg, mp, offset));
 	else if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_dst,
 	    &cfg->internal, &cfg->mask))
 		return (nptv6_rewrite_internal(cfg, mp, offset));
 	/*
 	 * Addresses in the inner IPv6 header doesn't matched to
 	 * our prefixes.
 	 */
 	return (-1);
 }
 
 static int
 nptv6_search_index(struct nptv6_cfg *cfg, struct in6_addr *a)
 {
 	int idx;
 
 	if (cfg->flags & NPTV6_48PLEN)
 		return (3);
 
 	/* Search suitable word index for adjustment */
 	for (idx = 4; idx < 8; idx++)
 		if (a->s6_addr16[idx] != 0xffff)
 			break;
 	/*
 	 * RFC 6296 p3.7: If an NPTv6 Translator discovers a datagram with
 	 * an IID of all-zeros while performing address mapping, that
 	 * datagram MUST be dropped, and an ICMPv6 Parameter Problem error
 	 * SHOULD be generated.
 	 */
 	if (idx == 8 ||
 	    (a->s6_addr32[2] == 0 && a->s6_addr32[3] == 0))
 		return (-1);
 	return (idx);
 }
 
 static void
 nptv6_copy_addr(struct in6_addr *src, struct in6_addr *dst,
     struct in6_addr *mask)
 {
 	int i;
 
 	for (i = 0; i < 8 && mask->s6_addr8[i] != 0; i++) {
 		dst->s6_addr8[i] &=  ~mask->s6_addr8[i];
 		dst->s6_addr8[i] |= src->s6_addr8[i] & mask->s6_addr8[i];
 	}
 }
 
 static int
 nptv6_rewrite_internal(struct nptv6_cfg *cfg, struct mbuf **mp, int offset)
 {
 	struct in6_addr *addr;
 	struct ip6_hdr *ip6;
 	int idx, proto;
 	uint16_t adj;
 
 	ip6 = mtodo(*mp, offset);
 	NPTV6_IPDEBUG("offset %d, %s -> %s %d", offset,
 	    inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)),
 	    inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)),
 	    ip6->ip6_nxt);
 	if (offset == 0)
 		addr = &ip6->ip6_src;
 	else {
 		/*
 		 * When we rewriting inner IPv6 header, we need to rewrite
 		 * destination address back to external prefix. The datagram in
 		 * the ICMPv6 payload should looks like it was send from
 		 * external prefix.
 		 */
 		addr = &ip6->ip6_dst;
 	}
 	idx = nptv6_search_index(cfg, addr);
 	if (idx < 0) {
 		/*
 		 * Do not send ICMPv6 error when offset isn't zero.
 		 * This means we are rewriting inner IPv6 header in the
 		 * ICMPv6 error message.
 		 */
 		if (offset == 0) {
 			icmp6_error2(*mp, ICMP6_DST_UNREACH,
 			    ICMP6_DST_UNREACH_ADDR, 0, (*mp)->m_pkthdr.rcvif);
 			*mp = NULL;
 		}
 		return (IP_FW_DENY);
 	}
 	adj = addr->s6_addr16[idx];
 	nptv6_copy_addr(&cfg->external, addr, &cfg->mask);
 	adj = cksum_add(adj, cfg->adjustment);
 	if (adj == 0xffff)
 		adj = 0;
 	addr->s6_addr16[idx] = adj;
 	if (offset == 0) {
 		/*
 		 * We may need to translate addresses in the inner IPv6
 		 * header for ICMPv6 error messages.
 		 */
 		proto = nptv6_getlasthdr(cfg, *mp, &offset);
 		if (proto < 0 || (proto == IPPROTO_ICMPV6 &&
 		    nptv6_translate_icmpv6(cfg, mp, offset) != 0))
 			return (IP_FW_DENY);
 		NPTV6STAT_INC(cfg, in2ex);
 	}
 	return (0);
 }
 
 static int
 nptv6_rewrite_external(struct nptv6_cfg *cfg, struct mbuf **mp, int offset)
 {
 	struct in6_addr *addr;
 	struct ip6_hdr *ip6;
 	int idx, proto;
 	uint16_t adj;
 
 	ip6 = mtodo(*mp, offset);
 	NPTV6_IPDEBUG("offset %d, %s -> %s %d", offset,
 	    inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)),
 	    inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)),
 	    ip6->ip6_nxt);
 	if (offset == 0)
 		addr = &ip6->ip6_dst;
 	else {
 		/*
 		 * When we rewriting inner IPv6 header, we need to rewrite
 		 * source address back to internal prefix. The datagram in
 		 * the ICMPv6 payload should looks like it was send from
 		 * internal prefix.
 		 */
 		addr = &ip6->ip6_src;
 	}
 	idx = nptv6_search_index(cfg, addr);
 	if (idx < 0) {
 		/*
 		 * Do not send ICMPv6 error when offset isn't zero.
 		 * This means we are rewriting inner IPv6 header in the
 		 * ICMPv6 error message.
 		 */
 		if (offset == 0) {
 			icmp6_error2(*mp, ICMP6_DST_UNREACH,
 			    ICMP6_DST_UNREACH_ADDR, 0, (*mp)->m_pkthdr.rcvif);
 			*mp = NULL;
 		}
 		return (IP_FW_DENY);
 	}
 	adj = addr->s6_addr16[idx];
 	nptv6_copy_addr(&cfg->internal, addr, &cfg->mask);
 	adj = cksum_add(adj, ~cfg->adjustment);
 	if (adj == 0xffff)
 		adj = 0;
 	addr->s6_addr16[idx] = adj;
 	if (offset == 0) {
 		/*
 		 * We may need to translate addresses in the inner IPv6
 		 * header for ICMPv6 error messages.
 		 */
 		proto = nptv6_getlasthdr(cfg, *mp, &offset);
 		if (proto < 0 || (proto == IPPROTO_ICMPV6 &&
 		    nptv6_translate_icmpv6(cfg, mp, offset) != 0))
 			return (IP_FW_DENY);
 		NPTV6STAT_INC(cfg, ex2in);
 	}
 	return (0);
 }
 
 /*
  * ipfw external action handler.
  */
 static int
 ipfw_nptv6(struct ip_fw_chain *chain, struct ip_fw_args *args,
     ipfw_insn *cmd, int *done)
 {
 	struct ip6_hdr *ip6;
 	struct nptv6_cfg *cfg;
 	ipfw_insn *icmd;
 	int ret;
 
 	*done = 0; /* try next rule if not matched */
+	ret = IP_FW_DENY;
 	icmd = cmd + 1;
 	if (cmd->opcode != O_EXTERNAL_ACTION ||
 	    cmd->arg1 != V_nptv6_eid ||
 	    icmd->opcode != O_EXTERNAL_INSTANCE ||
 	    (cfg = NPTV6_LOOKUP(chain, icmd)) == NULL)
-		return (0);
+		return (ret);
 	/*
 	 * We need act as router, so when forwarding is disabled -
 	 * do nothing.
 	 */
 	if (V_ip6_forwarding == 0 || args->f_id.addr_type != 6)
-		return (0);
+		return (ret);
 	/*
 	 * NOTE: we expect ipfw_chk() did m_pullup() up to upper level
 	 * protocol's headers. Also we skip some checks, that ip6_input(),
 	 * ip6_forward(), ip6_fastfwd() and ipfw_chk() already did.
 	 */
-	ret = IP_FW_DENY;
 	ip6 = mtod(args->m, struct ip6_hdr *);
 	NPTV6_IPDEBUG("eid %u, oid %u, %s -> %s %d",
 	    cmd->arg1, icmd->arg1,
 	    inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)),
 	    inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)),
 	    ip6->ip6_nxt);
 	if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_src,
 	    &cfg->internal, &cfg->mask)) {
 		/*
 		 * XXX: Do not translate packets when both src and dst
 		 * are from internal prefix.
 		 */
 		if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_dst,
 		    &cfg->internal, &cfg->mask))
-			return (0);
+			return (ret);
 		ret = nptv6_rewrite_internal(cfg, &args->m, 0);
 	} else if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_dst,
 	    &cfg->external, &cfg->mask))
 		ret = nptv6_rewrite_external(cfg, &args->m, 0);
 	else
-		return (0);
+		return (ret);
 	/*
-	 * If address wasn't rewrited - free mbuf.
+	 * If address wasn't rewrited - free mbuf and terminate the search.
 	 */
 	if (ret != 0) {
 		if (args->m != NULL) {
 			m_freem(args->m);
 			args->m = NULL; /* mark mbuf as consumed */
 		}
 		NPTV6STAT_INC(cfg, dropped);
-	}
-	/* Terminate the search if one_pass is set */
-	*done = V_fw_one_pass;
-	/* Update args->f_id when one_pass is off */
-	if (*done == 0 && ret == 0) {
-		ip6 = mtod(args->m, struct ip6_hdr *);
-		args->f_id.src_ip6 = ip6->ip6_src;
-		args->f_id.dst_ip6 = ip6->ip6_dst;
+		*done = 1;
+	} else {
+		/* Terminate the search if one_pass is set */
+		*done = V_fw_one_pass;
+		/* Update args->f_id when one_pass is off */
+		if (*done == 0) {
+			ip6 = mtod(args->m, struct ip6_hdr *);
+			args->f_id.src_ip6 = ip6->ip6_src;
+			args->f_id.dst_ip6 = ip6->ip6_dst;
+		}
 	}
 	return (ret);
 }
 
 static struct nptv6_cfg *
 nptv6_alloc_config(const char *name, uint8_t set)
 {
 	struct nptv6_cfg *cfg;
 
 	cfg = malloc(sizeof(struct nptv6_cfg), M_IPFW, M_WAITOK | M_ZERO);
 	COUNTER_ARRAY_ALLOC(cfg->stats, NPTV6STATS, M_WAITOK);
 	cfg->no.name = cfg->name;
 	cfg->no.etlv = IPFW_TLV_NPTV6_NAME;
 	cfg->no.set = set;
 	strlcpy(cfg->name, name, sizeof(cfg->name));
 	return (cfg);
 }
 
 static void
 nptv6_free_config(struct nptv6_cfg *cfg)
 {
 
 	COUNTER_ARRAY_FREE(cfg->stats, NPTV6STATS);
 	free(cfg, M_IPFW);
 }
 
 static void
 nptv6_export_config(struct ip_fw_chain *ch, struct nptv6_cfg *cfg,
     ipfw_nptv6_cfg *uc)
 {
 
 	uc->internal = cfg->internal;
 	uc->external = cfg->external;
 	uc->plen = cfg->plen;
 	uc->flags = cfg->flags & NPTV6_FLAGSMASK;
 	uc->set = cfg->no.set;
 	strlcpy(uc->name, cfg->no.name, sizeof(uc->name));
 }
 
 struct nptv6_dump_arg {
 	struct ip_fw_chain *ch;
 	struct sockopt_data *sd;
 };
 
 static int
 export_config_cb(struct namedobj_instance *ni, struct named_object *no,
     void *arg)
 {
 	struct nptv6_dump_arg *da = (struct nptv6_dump_arg *)arg;
 	ipfw_nptv6_cfg *uc;
 
 	uc = (ipfw_nptv6_cfg *)ipfw_get_sopt_space(da->sd, sizeof(*uc));
 	nptv6_export_config(da->ch, (struct nptv6_cfg *)no, uc);
 	return (0);
 }
 
 static struct nptv6_cfg *
 nptv6_find(struct namedobj_instance *ni, const char *name, uint8_t set)
 {
 	struct nptv6_cfg *cfg;
 
 	cfg = (struct nptv6_cfg *)ipfw_objhash_lookup_name_type(ni, set,
 	    IPFW_TLV_NPTV6_NAME, name);
 
 	return (cfg);
 }
 
 static void
 nptv6_calculate_adjustment(struct nptv6_cfg *cfg)
 {
 	uint16_t i, e;
 	uint16_t *p;
 
 	/* Calculate checksum of internal prefix */
 	for (i = 0, p = (uint16_t *)&cfg->internal;
 	    p < (uint16_t *)(&cfg->internal + 1); p++)
 		i = cksum_add(i, *p);
 
 	/* Calculate checksum of external prefix */
 	for (e = 0, p = (uint16_t *)&cfg->external;
 	    p < (uint16_t *)(&cfg->external + 1); p++)
 		e = cksum_add(e, *p);
 
 	/* Adjustment value for Int->Ext direction */
 	cfg->adjustment = cksum_add(~e, i);
 }
 
 /*
  * Creates new NPTv6 instance.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_lheader ipfw_nptv6_cfg ]
  *
  * Returns 0 on success
  */
 static int
 nptv6_create(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	struct in6_addr mask;
 	ipfw_obj_lheader *olh;
 	ipfw_nptv6_cfg *uc;
 	struct namedobj_instance *ni;
 	struct nptv6_cfg *cfg;
 
 	if (sd->valsize != sizeof(*olh) + sizeof(*uc))
 		return (EINVAL);
 
 	olh = (ipfw_obj_lheader *)sd->kbuf;
 	uc = (ipfw_nptv6_cfg *)(olh + 1);
 	if (ipfw_check_object_name_generic(uc->name) != 0)
 		return (EINVAL);
 	if (uc->plen < 8 || uc->plen > 64 || uc->set >= IPFW_MAX_SETS)
 		return (EINVAL);
 	if (IN6_IS_ADDR_MULTICAST(&uc->internal) ||
 	    IN6_IS_ADDR_MULTICAST(&uc->external) ||
 	    IN6_IS_ADDR_UNSPECIFIED(&uc->internal) ||
 	    IN6_IS_ADDR_UNSPECIFIED(&uc->external) ||
 	    IN6_IS_ADDR_LINKLOCAL(&uc->internal) ||
 	    IN6_IS_ADDR_LINKLOCAL(&uc->external))
 		return (EINVAL);
 	in6_prefixlen2mask(&mask, uc->plen);
 	if (IN6_ARE_MASKED_ADDR_EQUAL(&uc->internal, &uc->external, &mask))
 		return (EINVAL);
 
 	ni = CHAIN_TO_SRV(ch);
 	IPFW_UH_RLOCK(ch);
 	if (nptv6_find(ni, uc->name, uc->set) != NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (EEXIST);
 	}
 	IPFW_UH_RUNLOCK(ch);
 
 	cfg = nptv6_alloc_config(uc->name, uc->set);
 	cfg->plen = uc->plen;
 	if (cfg->plen <= 48)
 		cfg->flags |= NPTV6_48PLEN;
 	cfg->internal = uc->internal;
 	cfg->external = uc->external;
 	cfg->mask = mask;
 	IN6_MASK_ADDR(&cfg->internal, &mask);
 	IN6_MASK_ADDR(&cfg->external, &mask);
 	nptv6_calculate_adjustment(cfg);
 
 	IPFW_UH_WLOCK(ch);
 	if (ipfw_objhash_alloc_idx(ni, &cfg->no.kidx) != 0) {
 		IPFW_UH_WUNLOCK(ch);
 		nptv6_free_config(cfg);
 		return (ENOSPC);
 	}
 	ipfw_objhash_add(ni, &cfg->no);
 	IPFW_WLOCK(ch);
 	SRV_OBJECT(ch, cfg->no.kidx) = cfg;
 	IPFW_WUNLOCK(ch);
 	IPFW_UH_WUNLOCK(ch);
 	return (0);
 }
 
 /*
  * Destroys NPTv6 instance.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ]
  *
  * Returns 0 on success
  */
 static int
 nptv6_destroy(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	ipfw_obj_header *oh;
 	struct nptv6_cfg *cfg;
 
 	if (sd->valsize != sizeof(*oh))
 		return (EINVAL);
 
 	oh = (ipfw_obj_header *)sd->kbuf;
 	if (ipfw_check_object_name_generic(oh->ntlv.name) != 0)
 		return (EINVAL);
 
 	IPFW_UH_WLOCK(ch);
 	cfg = nptv6_find(CHAIN_TO_SRV(ch), oh->ntlv.name, oh->ntlv.set);
 	if (cfg == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 	if (cfg->no.refcnt > 0) {
 		IPFW_UH_WUNLOCK(ch);
 		return (EBUSY);
 	}
 
 	IPFW_WLOCK(ch);
 	SRV_OBJECT(ch, cfg->no.kidx) = NULL;
 	IPFW_WUNLOCK(ch);
 
 	ipfw_objhash_del(CHAIN_TO_SRV(ch), &cfg->no);
 	ipfw_objhash_free_idx(CHAIN_TO_SRV(ch), cfg->no.kidx);
 	IPFW_UH_WUNLOCK(ch);
 
 	nptv6_free_config(cfg);
 	return (0);
 }
 
 /*
  * Get or change nptv6 instance config.
  * Request: [ ipfw_obj_header [ ipfw_nptv6_cfg ] ]
  */
 static int
 nptv6_config(struct ip_fw_chain *chain, ip_fw3_opheader *op,
     struct sockopt_data *sd)
 {
 
 	return (EOPNOTSUPP);
 }
 
 /*
  * Lists all NPTv6 instances currently available in kernel.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_lheader ]
  * Reply: [ ipfw_obj_lheader ipfw_nptv6_cfg x N ]
  *
  * Returns 0 on success
  */
 static int
 nptv6_list(struct ip_fw_chain *ch, ip_fw3_opheader *op3,
     struct sockopt_data *sd)
 {
 	ipfw_obj_lheader *olh;
 	struct nptv6_dump_arg da;
 
 	/* Check minimum header size */
 	if (sd->valsize < sizeof(ipfw_obj_lheader))
 		return (EINVAL);
 
 	olh = (ipfw_obj_lheader *)ipfw_get_sopt_header(sd, sizeof(*olh));
 
 	IPFW_UH_RLOCK(ch);
 	olh->count = ipfw_objhash_count_type(CHAIN_TO_SRV(ch),
 	    IPFW_TLV_NPTV6_NAME);
 	olh->objsize = sizeof(ipfw_nptv6_cfg);
 	olh->size = sizeof(*olh) + olh->count * olh->objsize;
 
 	if (sd->valsize < olh->size) {
 		IPFW_UH_RUNLOCK(ch);
 		return (ENOMEM);
 	}
 	memset(&da, 0, sizeof(da));
 	da.ch = ch;
 	da.sd = sd;
 	ipfw_objhash_foreach_type(CHAIN_TO_SRV(ch), export_config_cb,
 	    &da, IPFW_TLV_NPTV6_NAME);
 	IPFW_UH_RUNLOCK(ch);
 
 	return (0);
 }
 
 #define	__COPY_STAT_FIELD(_cfg, _stats, _field)	\
 	(_stats)->_field = NPTV6STAT_FETCH(_cfg, _field)
 static void
 export_stats(struct ip_fw_chain *ch, struct nptv6_cfg *cfg,
     struct ipfw_nptv6_stats *stats)
 {
 
 	__COPY_STAT_FIELD(cfg, stats, in2ex);
 	__COPY_STAT_FIELD(cfg, stats, ex2in);
 	__COPY_STAT_FIELD(cfg, stats, dropped);
 }
 
 /*
  * Get NPTv6 statistics.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ]
  * Reply: [ ipfw_obj_header ipfw_obj_ctlv [ uint64_t x N ]]
  *
  * Returns 0 on success
  */
 static int
 nptv6_stats(struct ip_fw_chain *ch, ip_fw3_opheader *op,
     struct sockopt_data *sd)
 {
 	struct ipfw_nptv6_stats stats;
 	struct nptv6_cfg *cfg;
 	ipfw_obj_header *oh;
 	ipfw_obj_ctlv *ctlv;
 	size_t sz;
 
 	sz = sizeof(ipfw_obj_header) + sizeof(ipfw_obj_ctlv) + sizeof(stats);
 	if (sd->valsize % sizeof(uint64_t))
 		return (EINVAL);
 	if (sd->valsize < sz)
 		return (ENOMEM);
 	oh = (ipfw_obj_header *)ipfw_get_sopt_header(sd, sz);
 	if (oh == NULL)
 		return (EINVAL);
 	if (ipfw_check_object_name_generic(oh->ntlv.name) != 0 ||
 	    oh->ntlv.set >= IPFW_MAX_SETS)
 		return (EINVAL);
 	memset(&stats, 0, sizeof(stats));
 
 	IPFW_UH_RLOCK(ch);
 	cfg = nptv6_find(CHAIN_TO_SRV(ch), oh->ntlv.name, oh->ntlv.set);
 	if (cfg == NULL) {
 		IPFW_UH_RUNLOCK(ch);
 		return (ESRCH);
 	}
 	export_stats(ch, cfg, &stats);
 	IPFW_UH_RUNLOCK(ch);
 
 	ctlv = (ipfw_obj_ctlv *)(oh + 1);
 	memset(ctlv, 0, sizeof(*ctlv));
 	ctlv->head.type = IPFW_TLV_COUNTERS;
 	ctlv->head.length = sz - sizeof(ipfw_obj_header);
 	ctlv->count = sizeof(stats) / sizeof(uint64_t);
 	ctlv->objsize = sizeof(uint64_t);
 	ctlv->version = 1;
 	memcpy(ctlv + 1, &stats, sizeof(stats));
 	return (0);
 }
 
 /*
  * Reset NPTv6 statistics.
  * Data layout (v0)(current):
  * Request: [ ipfw_obj_header ]
  *
  * Returns 0 on success
  */
 static int
 nptv6_reset_stats(struct ip_fw_chain *ch, ip_fw3_opheader *op,
     struct sockopt_data *sd)
 {
 	struct nptv6_cfg *cfg;
 	ipfw_obj_header *oh;
 
 	if (sd->valsize != sizeof(*oh))
 		return (EINVAL);
 	oh = (ipfw_obj_header *)sd->kbuf;
 	if (ipfw_check_object_name_generic(oh->ntlv.name) != 0 ||
 	    oh->ntlv.set >= IPFW_MAX_SETS)
 		return (EINVAL);
 
 	IPFW_UH_WLOCK(ch);
 	cfg = nptv6_find(CHAIN_TO_SRV(ch), oh->ntlv.name, oh->ntlv.set);
 	if (cfg == NULL) {
 		IPFW_UH_WUNLOCK(ch);
 		return (ESRCH);
 	}
 	COUNTER_ARRAY_ZERO(cfg->stats, NPTV6STATS);
 	IPFW_UH_WUNLOCK(ch);
 	return (0);
 }
 
 static struct ipfw_sopt_handler	scodes[] = {
 	{ IP_FW_NPTV6_CREATE, 0,	HDIR_SET,	nptv6_create },
 	{ IP_FW_NPTV6_DESTROY,0,	HDIR_SET,	nptv6_destroy },
 	{ IP_FW_NPTV6_CONFIG, 0,	HDIR_BOTH,	nptv6_config },
 	{ IP_FW_NPTV6_LIST,   0,	HDIR_GET,	nptv6_list },
 	{ IP_FW_NPTV6_STATS,  0,	HDIR_GET,	nptv6_stats },
 	{ IP_FW_NPTV6_RESET_STATS,0,	HDIR_SET,	nptv6_reset_stats },
 };
 
 static int
 nptv6_classify(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype)
 {
 	ipfw_insn *icmd;
 
 	icmd = cmd - 1;
 	NPTV6_DEBUG("opcode %d, arg1 %d, opcode0 %d, arg1 %d",
 	    cmd->opcode, cmd->arg1, icmd->opcode, icmd->arg1);
 	if (icmd->opcode != O_EXTERNAL_ACTION ||
 	    icmd->arg1 != V_nptv6_eid)
 		return (1);
 
 	*puidx = cmd->arg1;
 	*ptype = 0;
 	return (0);
 }
 
 static void
 nptv6_update_arg1(ipfw_insn *cmd, uint16_t idx)
 {
 
 	cmd->arg1 = idx;
 	NPTV6_DEBUG("opcode %d, arg1 -> %d", cmd->opcode, cmd->arg1);
 }
 
 static int
 nptv6_findbyname(struct ip_fw_chain *ch, struct tid_info *ti,
     struct named_object **pno)
 {
 	int err;
 
 	err = ipfw_objhash_find_type(CHAIN_TO_SRV(ch), ti,
 	    IPFW_TLV_NPTV6_NAME, pno);
 	NPTV6_DEBUG("uidx %u, type %u, err %d", ti->uidx, ti->type, err);
 	return (err);
 }
 
 static struct named_object *
 nptv6_findbykidx(struct ip_fw_chain *ch, uint16_t idx)
 {
 	struct namedobj_instance *ni;
 	struct named_object *no;
 
 	IPFW_UH_WLOCK_ASSERT(ch);
 	ni = CHAIN_TO_SRV(ch);
 	no = ipfw_objhash_lookup_kidx(ni, idx);
 	KASSERT(no != NULL, ("NPT with index %d not found", idx));
 
 	NPTV6_DEBUG("kidx %u -> %s", idx, no->name);
 	return (no);
 }
 
 static int
 nptv6_manage_sets(struct ip_fw_chain *ch, uint16_t set, uint8_t new_set,
     enum ipfw_sets_cmd cmd)
 {
 
 	return (ipfw_obj_manage_sets(CHAIN_TO_SRV(ch), IPFW_TLV_NPTV6_NAME,
 	    set, new_set, cmd));
 }
 
 static struct opcode_obj_rewrite opcodes[] = {
 	{
 		.opcode	= O_EXTERNAL_INSTANCE,
 		.etlv = IPFW_TLV_EACTION /* just show it isn't table */,
 		.classifier = nptv6_classify,
 		.update = nptv6_update_arg1,
 		.find_byname = nptv6_findbyname,
 		.find_bykidx = nptv6_findbykidx,
 		.manage_sets = nptv6_manage_sets,
 	},
 };
 
 static int
 destroy_config_cb(struct namedobj_instance *ni, struct named_object *no,
     void *arg)
 {
 	struct nptv6_cfg *cfg;
 	struct ip_fw_chain *ch;
 
 	ch = (struct ip_fw_chain *)arg;
 	IPFW_UH_WLOCK_ASSERT(ch);
 
 	cfg = (struct nptv6_cfg *)SRV_OBJECT(ch, no->kidx);
 	SRV_OBJECT(ch, no->kidx) = NULL;
 	ipfw_objhash_del(ni, &cfg->no);
 	ipfw_objhash_free_idx(ni, cfg->no.kidx);
 	nptv6_free_config(cfg);
 	return (0);
 }
 
 int
 nptv6_init(struct ip_fw_chain *ch, int first)
 {
 
 	V_nptv6_eid = ipfw_add_eaction(ch, ipfw_nptv6, "nptv6");
 	if (V_nptv6_eid == 0)
 		return (ENXIO);
 	IPFW_ADD_SOPT_HANDLER(first, scodes);
 	IPFW_ADD_OBJ_REWRITER(first, opcodes);
 	return (0);
 }
 
 void
 nptv6_uninit(struct ip_fw_chain *ch, int last)
 {
 
 	IPFW_DEL_OBJ_REWRITER(last, opcodes);
 	IPFW_DEL_SOPT_HANDLER(last, scodes);
 	ipfw_del_eaction(ch, V_nptv6_eid);
 	/*
 	 * Since we already have deregistered external action,
 	 * our named objects become unaccessible via rules, because
 	 * all rules were truncated by ipfw_del_eaction().
 	 * So, we can unlink and destroy our named objects without holding
 	 * IPFW_WLOCK().
 	 */
 	IPFW_UH_WLOCK(ch);
 	ipfw_objhash_foreach_type(CHAIN_TO_SRV(ch), destroy_config_cb, ch,
 	    IPFW_TLV_NPTV6_NAME);
 	V_nptv6_eid = 0;
 	IPFW_UH_WUNLOCK(ch);
 }
 
Index: projects/clang400-import/sys/sys/gtaskqueue.h
===================================================================
--- projects/clang400-import/sys/sys/gtaskqueue.h	(revision 314522)
+++ projects/clang400-import/sys/sys/gtaskqueue.h	(revision 314523)
@@ -1,123 +1,124 @@
 /*-
  * Copyright (c) 2014 Jeffrey Roberson <jeff@freebsd.org>
  * Copyright (c) 2016 Matthew Macy <mmacy@nextbsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _SYS_GTASKQUEUE_H_
 #define _SYS_GTASKQUEUE_H_
 #include <sys/taskqueue.h>
 
 #ifndef _KERNEL
 #error "no user-serviceable parts inside"
 #endif
 
 struct gtaskqueue;
 typedef void (*gtaskqueue_enqueue_fn)(void *context);
 
 /*
  * Taskqueue groups.  Manages dynamic thread groups and irq binding for
  * device and other tasks.
  */
 
 void	gtaskqueue_block(struct gtaskqueue *queue);
 void	gtaskqueue_unblock(struct gtaskqueue *queue);
 
 int	gtaskqueue_cancel(struct gtaskqueue *queue, struct gtask *gtask);
 void	gtaskqueue_drain(struct gtaskqueue *queue, struct gtask *task);
 void	gtaskqueue_drain_all(struct gtaskqueue *queue);
 
 int grouptaskqueue_enqueue(struct gtaskqueue *queue, struct gtask *task);
 void	taskqgroup_attach(struct taskqgroup *qgroup, struct grouptask *grptask,
 	    void *uniq, int irq, char *name);
 int		taskqgroup_attach_cpu(struct taskqgroup *qgroup, struct grouptask *grptask,
 		void *uniq, int cpu, int irq, char *name);
 void	taskqgroup_detach(struct taskqgroup *qgroup, struct grouptask *gtask);
 struct taskqgroup *taskqgroup_create(char *name);
 void	taskqgroup_destroy(struct taskqgroup *qgroup);
 int	taskqgroup_adjust(struct taskqgroup *qgroup, int cnt, int stride);
 
 #define TASK_ENQUEUED			0x1
 #define TASK_SKIP_WAKEUP		0x2
 
 
 #define GTASK_INIT(task, flags, priority, func, context) do {	\
 	(task)->ta_flags = flags;				\
 	(task)->ta_priority = (priority);		\
 	(task)->ta_func = (func);			\
 	(task)->ta_context = (context);			\
 } while (0)
 
 #define	GROUPTASK_INIT(gtask, priority, func, context)	\
 	GTASK_INIT(&(gtask)->gt_task, TASK_SKIP_WAKEUP, priority, func, context)
 
 #define	GROUPTASK_ENQUEUE(gtask)			\
 	grouptaskqueue_enqueue((gtask)->gt_taskqueue, &(gtask)->gt_task)
 
 #define TASKQGROUP_DECLARE(name)			\
 extern struct taskqgroup *qgroup_##name
 
 #ifdef EARLY_AP_STARTUP
 #define TASKQGROUP_DEFINE(name, cnt, stride)				\
 									\
 struct taskqgroup *qgroup_##name;					\
 									\
 static void								\
 taskqgroup_define_##name(void *arg)					\
 {									\
 	qgroup_##name = taskqgroup_create(#name);			\
 	taskqgroup_adjust(qgroup_##name, (cnt), (stride));		\
 }									\
 									\
 SYSINIT(taskqgroup_##name, SI_SUB_INIT_IF, SI_ORDER_FIRST,		\
 	taskqgroup_define_##name, NULL)
 #else /* !EARLY_AP_STARTUP */
 #define TASKQGROUP_DEFINE(name, cnt, stride)				\
 									\
 struct taskqgroup *qgroup_##name;					\
 									\
 static void								\
 taskqgroup_define_##name(void *arg)					\
 {									\
 	qgroup_##name = taskqgroup_create(#name);			\
 }									\
 									\
 SYSINIT(taskqgroup_##name, SI_SUB_INIT_IF, SI_ORDER_FIRST,		\
 	taskqgroup_define_##name, NULL);				\
 									\
 static void								\
 taskqgroup_adjust_##name(void *arg)					\
 {									\
 	taskqgroup_adjust(qgroup_##name, (cnt), (stride));		\
 }									\
 									\
 SYSINIT(taskqgroup_adj_##name, SI_SUB_SMP, SI_ORDER_ANY,		\
 	taskqgroup_adjust_##name, NULL)
 #endif /* EARLY_AP_STARTUP */
 
 TASKQGROUP_DECLARE(net);
+TASKQGROUP_DECLARE(softirq);
 
 #endif /* !_SYS_GTASKQUEUE_H_ */
Index: projects/clang400-import/sys/sys/sysent.h
===================================================================
--- projects/clang400-import/sys/sys/sysent.h	(revision 314522)
+++ projects/clang400-import/sys/sys/sysent.h	(revision 314523)
@@ -1,287 +1,287 @@
 /*-
  * Copyright (c) 1982, 1988, 1991 The Regents of the University of California.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _SYS_SYSENT_H_
 #define	_SYS_SYSENT_H_
 
 #include <bsm/audit.h>
 
 struct rlimit;
 struct sysent;
 struct thread;
 struct ksiginfo;
 struct syscall_args;
 
 enum systrace_probe_t {
 	SYSTRACE_ENTRY,
 	SYSTRACE_RETURN,
 };
 
 typedef	int	sy_call_t(struct thread *, void *);
 
 typedef	void	(*systrace_probe_func_t)(struct syscall_args *,
 		    enum systrace_probe_t, int);
 typedef	void	(*systrace_args_func_t)(int, void *, uint64_t *, int *);
 
 extern systrace_probe_func_t	systrace_probe_func;
 
 struct sysent {			/* system call table */
 	int	sy_narg;	/* number of arguments */
 	sy_call_t *sy_call;	/* implementing function */
 	au_event_t sy_auevent;	/* audit event associated with syscall */
 	systrace_args_func_t sy_systrace_args_func;
 				/* optional argument conversion function. */
 	u_int32_t sy_entry;	/* DTrace entry ID for systrace. */
 	u_int32_t sy_return;	/* DTrace return ID for systrace. */
 	u_int32_t sy_flags;	/* General flags for system calls. */
 	u_int32_t sy_thrcnt;
 };
 
 /*
  * A system call is permitted in capability mode.
  */
 #define	SYF_CAPENABLED	0x00000001
 
 #define	SY_THR_FLAGMASK	0x7
 #define	SY_THR_STATIC	0x1
 #define	SY_THR_DRAINING	0x2
 #define	SY_THR_ABSENT	0x4
 #define	SY_THR_INCR	0x8
 
 #ifdef KLD_MODULE
 #define	SY_THR_STATIC_KLD	0
 #else
 #define	SY_THR_STATIC_KLD	SY_THR_STATIC
 #endif
 
 struct image_params;
 struct __sigset;
 struct trapframe;
 struct vnode;
 
 struct sysentvec {
 	int		sv_size;	/* number of entries */
 	struct sysent	*sv_table;	/* pointer to sysent */
 	u_int		sv_mask;	/* optional mask to index */
 	int		sv_errsize;	/* size of errno translation table */
 	int 		*sv_errtbl;	/* errno translation table */
 	int		(*sv_transtrap)(int, int);
 					/* translate trap-to-signal mapping */
 	int		(*sv_fixup)(register_t **, struct image_params *);
 					/* stack fixup function */
 	void		(*sv_sendsig)(void (*)(int), struct ksiginfo *, struct __sigset *);
 			    		/* send signal */
 	char 		*sv_sigcode;	/* start of sigtramp code */
 	int 		*sv_szsigcode;	/* size of sigtramp code */
 	char		*sv_name;	/* name of binary type */
 	int		(*sv_coredump)(struct thread *, struct vnode *, off_t, int);
 					/* function to dump core, or NULL */
 	int		(*sv_imgact_try)(struct image_params *);
 	int		sv_minsigstksz;	/* minimum signal stack size */
 	int		sv_pagesize;	/* pagesize */
 	vm_offset_t	sv_minuser;	/* VM_MIN_ADDRESS */
 	vm_offset_t	sv_maxuser;	/* VM_MAXUSER_ADDRESS */
 	vm_offset_t	sv_usrstack;	/* USRSTACK */
 	vm_offset_t	sv_psstrings;	/* PS_STRINGS */
 	int		sv_stackprot;	/* vm protection for stack */
 	register_t	*(*sv_copyout_strings)(struct image_params *);
 	void		(*sv_setregs)(struct thread *, struct image_params *,
 			    u_long);
 	void		(*sv_fixlimit)(struct rlimit *, int);
 	u_long		*sv_maxssiz;
 	u_int		sv_flags;
 	void		(*sv_set_syscall_retval)(struct thread *, int);
 	int		(*sv_fetch_syscall_args)(struct thread *, struct
 			    syscall_args *);
 	const char	**sv_syscallnames;
 	vm_offset_t	sv_timekeep_base;
 	vm_offset_t	sv_shared_page_base;
 	vm_offset_t	sv_shared_page_len;
 	vm_offset_t	sv_sigcode_base;
 	void		*sv_shared_page_obj;
 	void		(*sv_schedtail)(struct thread *);
 	void		(*sv_thread_detach)(struct thread *);
 	int		(*sv_trap)(struct thread *);
 };
 
 #define	SV_ILP32	0x000100	/* 32-bit executable. */
 #define	SV_LP64		0x000200	/* 64-bit executable. */
 #define	SV_IA32		0x004000	/* Intel 32-bit executable. */
 #define	SV_AOUT		0x008000	/* a.out executable. */
 #define	SV_SHP		0x010000	/* Shared page. */
 #define	SV_CAPSICUM	0x020000	/* Force cap_enter() on startup. */
-#define	SV_TIMEKEEP	0x040000
+#define	SV_TIMEKEEP	0x040000	/* Shared page timehands. */
 
 #define	SV_ABI_MASK	0xff
 #define	SV_ABI_ERRNO(p, e)	((p)->p_sysent->sv_errsize <= 0 ? e :	\
 	((e) >= (p)->p_sysent->sv_errsize ? -1 : (p)->p_sysent->sv_errtbl[e]))
 #define	SV_PROC_FLAG(p, x)	((p)->p_sysent->sv_flags & (x))
 #define	SV_PROC_ABI(p)		((p)->p_sysent->sv_flags & SV_ABI_MASK)
 #define	SV_CURPROC_FLAG(x)	SV_PROC_FLAG(curproc, x)
 #define	SV_CURPROC_ABI()	SV_PROC_ABI(curproc)
 /* same as ELFOSABI_XXX, to prevent header pollution */
 #define	SV_ABI_LINUX	3
 #define	SV_ABI_FREEBSD 	9
 #define	SV_ABI_CLOUDABI	17
 #define	SV_ABI_UNDEF	255
 
 #ifdef _KERNEL
 extern struct sysentvec aout_sysvec;
 extern struct sysent sysent[];
 extern const char *syscallnames[];
 
 #if defined(__amd64__)
 extern int i386_read_exec;
 #endif
 
 #define	NO_SYSCALL (-1)
 
 struct module;
 
 struct syscall_module_data {
 	int	(*chainevh)(struct module *, int, void *); /* next handler */
 	void	*chainarg;		/* arg for next event handler */
 	int	*offset;		/* offset into sysent */
 	struct sysent *new_sysent;	/* new sysent */
 	struct sysent old_sysent;	/* old sysent */
 	int	flags;			/* flags for syscall_register */
 };
 
 /* separate initialization vector so it can be used in a substructure */
 #define SYSENT_INIT_VALS(_syscallname) {			\
 	.sy_narg = (sizeof(struct _syscallname ## _args )	\
 	    / sizeof(register_t)),				\
 	.sy_call = (sy_call_t *)&sys_##_syscallname,		\
 	.sy_auevent = SYS_AUE_##_syscallname,			\
 	.sy_systrace_args_func = NULL,				\
 	.sy_entry = 0,						\
 	.sy_return = 0,						\
 	.sy_flags = 0,						\
 	.sy_thrcnt = 0						\
 }							
 
 #define	MAKE_SYSENT(syscallname)				\
 static struct sysent syscallname##_sysent = SYSENT_INIT_VALS(syscallname);
 
 #define	MAKE_SYSENT_COMPAT(syscallname)				\
 static struct sysent syscallname##_sysent = {			\
 	(sizeof(struct syscallname ## _args )			\
 	    / sizeof(register_t)),				\
 	(sy_call_t *)& syscallname,				\
 	SYS_AUE_##syscallname					\
 }
 
 #define SYSCALL_MODULE(name, offset, new_sysent, evh, arg)	\
 static struct syscall_module_data name##_syscall_mod = {	\
 	evh, arg, offset, new_sysent, { 0, NULL, AUE_NULL }	\
 };								\
 								\
 static moduledata_t name##_mod = {				\
 	"sys/" #name,						\
 	syscall_module_handler,					\
 	&name##_syscall_mod					\
 };								\
 DECLARE_MODULE(name, name##_mod, SI_SUB_SYSCALLS, SI_ORDER_MIDDLE)
 
 #define	SYSCALL_MODULE_HELPER(syscallname)			\
 static int syscallname##_syscall = SYS_##syscallname;		\
 MAKE_SYSENT(syscallname);					\
 SYSCALL_MODULE(syscallname,					\
     & syscallname##_syscall, & syscallname##_sysent,		\
     NULL, NULL)
 
 #define	SYSCALL_MODULE_PRESENT(syscallname)				\
 	(sysent[SYS_##syscallname].sy_call != (sy_call_t *)lkmnosys &&	\
 	sysent[SYS_##syscallname].sy_call != (sy_call_t *)lkmressys)
 
 /*
  * Syscall registration helpers with resource allocation handling.
  */
 struct syscall_helper_data {
 	struct sysent new_sysent;
 	struct sysent old_sysent;
 	int syscall_no;
 	int registered;
 };
 #define SYSCALL_INIT_HELPER(syscallname) {			\
     .new_sysent = {						\
 	.sy_narg = (sizeof(struct syscallname ## _args )	\
 	    / sizeof(register_t)),				\
 	.sy_call = (sy_call_t *)& sys_ ## syscallname,		\
 	.sy_auevent = SYS_AUE_##syscallname			\
     },								\
     .syscall_no = SYS_##syscallname				\
 }
 #define SYSCALL_INIT_HELPER_COMPAT(syscallname) {		\
     .new_sysent = {						\
 	.sy_narg = (sizeof(struct syscallname ## _args )	\
 	    / sizeof(register_t)),				\
 	.sy_call = (sy_call_t *)& syscallname,			\
 	.sy_auevent = SYS_AUE_##syscallname			\
     },								\
     .syscall_no = SYS_##syscallname				\
 }
 #define SYSCALL_INIT_LAST {					\
     .syscall_no = NO_SYSCALL					\
 }
 
 int	syscall_register(int *offset, struct sysent *new_sysent,
 	    struct sysent *old_sysent, int flags);
 int	syscall_deregister(int *offset, struct sysent *old_sysent);
 int	syscall_module_handler(struct module *mod, int what, void *arg);
 int	syscall_helper_register(struct syscall_helper_data *sd, int flags);
 int	syscall_helper_unregister(struct syscall_helper_data *sd);
 
 struct proc;
 const char *syscallname(struct proc *p, u_int code);
 
 /* Special purpose system call functions. */
 struct nosys_args;
 
 int	lkmnosys(struct thread *, struct nosys_args *);
 int	lkmressys(struct thread *, struct nosys_args *);
 
 int	syscall_thread_enter(struct thread *td, struct sysent *se);
 void	syscall_thread_exit(struct thread *td, struct sysent *se);
 
 int shared_page_alloc(int size, int align);
 int shared_page_fill(int size, int align, const void *data);
 void shared_page_write(int base, int size, const void *data);
 void exec_sysvec_init(void *param);
 void exec_inittk(void);
 
 #define INIT_SYSENTVEC(name, sv)					\
     SYSINIT(name, SI_SUB_EXEC, SI_ORDER_ANY,				\
 	(sysinit_cfunc_t)exec_sysvec_init, sv);
 
 #endif /* _KERNEL */
 
 #endif /* !_SYS_SYSENT_H_ */
Index: projects/clang400-import
===================================================================
--- projects/clang400-import	(revision 314522)
+++ projects/clang400-import	(revision 314523)

Property changes on: projects/clang400-import
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head:r314482-314522