Index: projects/clang400-import/README.md =================================================================== --- projects/clang400-import/README.md (revision 314522) +++ projects/clang400-import/README.md (revision 314523) @@ -1,86 +1,86 @@ FreeBSD Source: --------------- -This is the top level of the FreeBSD source directory. This file -was last revised on: +This is the top level of the FreeBSD source directory. This file +was last revised on: $FreeBSD$ -For copyright information, please see the file COPYRIGHT in this -directory (additional copyright information also exists for some -sources in this tree - please see the specific source directories for +For copyright information, please see the file COPYRIGHT in this +directory (additional copyright information also exists for some +sources in this tree - please see the specific source directories for more information). -The Makefile in this directory supports a number of targets for -building components (or all) of the FreeBSD source tree. See build(7) -and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/makeworld.html -for more information, including setting make(1) variables. +The Makefile in this directory supports a number of targets for +building components (or all) of the FreeBSD source tree. See build(7) +and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/makeworld.html +for more information, including setting make(1) variables. -The `buildkernel` and `installkernel` targets build and install -the kernel and the modules (see below). Please see the top of -the Makefile in this directory for more information on the +The `buildkernel` and `installkernel` targets build and install +the kernel and the modules (see below). Please see the top of +the Makefile in this directory for more information on the standard build targets and compile-time flags. -Building a kernel is a somewhat more involved process. See build(7), config(8), -and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig.html +Building a kernel is a somewhat more involved process. See build(7), config(8), +and http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig.html for more information. -Note: If you want to build and install the kernel with the -`buildkernel` and `installkernel` targets, you might need to build +Note: If you want to build and install the kernel with the +`buildkernel` and `installkernel` targets, you might need to build world before. More information is available in the handbook. -The kernel configuration files reside in the `sys//conf` -sub-directory. GENERIC is the default configuration used in release builds. -NOTES contains entries and documentation for all possible +The kernel configuration files reside in the `sys//conf` +sub-directory. GENERIC is the default configuration used in release builds. +NOTES contains entries and documentation for all possible devices, not just those commonly used. Source Roadmap: --------------- ``` bin System/user commands. cddl Various commands and libraries under the Common Development and Distribution License. contrib Packages contributed by 3rd parties. crypto Cryptography stuff (see crypto/README). etc Template files for /etc. gnu Various commands and libraries under the GNU Public License. Please see gnu/COPYING* for more information. include System include files. kerberos5 Kerberos5 (Heimdal) package. lib System libraries. libexec System daemons. release Release building Makefile & associated tools. rescue Build system for statically linked /rescue utilities. sbin System commands. secure Cryptographic libraries and commands. share Shared resources. sys Kernel sources. tests Regression tests which can be run by Kyua. See tests/README for additional information. tools Utilities for regression testing and miscellaneous tasks. usr.bin User commands. usr.sbin System administration commands. ``` -For information on synchronizing your source tree with one or more of +For information on synchronizing your source tree with one or more of the FreeBSD Project's development branches, please see: http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/synching.html Index: projects/clang400-import/contrib/dma/dma.c =================================================================== --- projects/clang400-import/contrib/dma/dma.c (revision 314522) +++ projects/clang400-import/contrib/dma/dma.c (revision 314523) @@ -1,632 +1,633 @@ /* * Copyright (c) 2008-2014, Simon Schubert <2@0x2c.org>. * Copyright (c) 2008 The DragonFly Project. All rights reserved. * * This code is derived from software contributed to The DragonFly Project * by Simon Schubert <2@0x2c.org>. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * 3. Neither the name of The DragonFly Project nor the names of its * contributors may be used to endorse or promote products derived * from this software without specific, prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include "dfcompat.h" #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include "dma.h" static void deliver(struct qitem *); struct aliases aliases = LIST_HEAD_INITIALIZER(aliases); struct strlist tmpfs = SLIST_HEAD_INITIALIZER(tmpfs); struct authusers authusers = LIST_HEAD_INITIALIZER(authusers); char username[USERNAME_SIZE]; uid_t useruid; const char *logident_base; char errmsg[ERRMSG_SIZE]; static int daemonize = 1; static int doqueue = 0; struct config config = { .smarthost = NULL, .port = 25, .aliases = "/etc/aliases", .spooldir = "/var/spool/dma", .authpath = NULL, .certfile = NULL, .features = 0, .mailname = NULL, .masquerade_host = NULL, .masquerade_user = NULL, }; static void sighup_handler(int signo) { (void)signo; /* so that gcc doesn't complain */ } static char * set_from(struct queue *queue, const char *osender) { const char *addr; char *sender; if (osender) { addr = osender; } else if (getenv("EMAIL") != NULL) { addr = getenv("EMAIL"); } else { if (config.masquerade_user) addr = config.masquerade_user; else addr = username; } if (!strchr(addr, '@')) { const char *from_host = hostname(); if (config.masquerade_host) from_host = config.masquerade_host; if (asprintf(&sender, "%s@%s", addr, from_host) <= 0) return (NULL); } else { sender = strdup(addr); if (sender == NULL) return (NULL); } if (strchr(sender, '\n') != NULL) { errno = EINVAL; return (NULL); } queue->sender = sender; return (sender); } static int read_aliases(void) { yyin = fopen(config.aliases, "r"); if (yyin == NULL) { /* * Non-existing aliases file is not a fatal error */ if (errno == ENOENT) return (0); /* Other problems are. */ return (-1); } if (yyparse()) return (-1); /* fatal error, probably malloc() */ fclose(yyin); return (0); } static int do_alias(struct queue *queue, const char *addr) { struct alias *al; struct stritem *sit; int aliased = 0; LIST_FOREACH(al, &aliases, next) { if (strcmp(al->alias, addr) != 0) continue; SLIST_FOREACH(sit, &al->dests, next) { if (add_recp(queue, sit->str, EXPAND_ADDR) != 0) return (-1); } aliased = 1; } return (aliased); } int add_recp(struct queue *queue, const char *str, int expand) { struct qitem *it, *tit; struct passwd *pw; char *host; int aliased = 0; it = calloc(1, sizeof(*it)); if (it == NULL) return (-1); it->addr = strdup(str); if (it->addr == NULL) return (-1); it->sender = queue->sender; host = strrchr(it->addr, '@'); if (host != NULL && (strcmp(host + 1, hostname()) == 0 || strcmp(host + 1, "localhost") == 0)) { *host = 0; } LIST_FOREACH(tit, &queue->queue, next) { /* weed out duplicate dests */ if (strcmp(tit->addr, it->addr) == 0) { free(it->addr); free(it); return (0); } } LIST_INSERT_HEAD(&queue->queue, it, next); /** * Do local delivery if there is no @. * Do not do local delivery when NULLCLIENT is set. */ if (strrchr(it->addr, '@') == NULL && (config.features & NULLCLIENT) == 0) { it->remote = 0; if (expand) { aliased = do_alias(queue, it->addr); if (!aliased && expand == EXPAND_WILDCARD) aliased = do_alias(queue, "*"); if (aliased < 0) return (-1); if (aliased) { LIST_REMOVE(it, next); } else { /* Local destination, check */ pw = getpwnam(it->addr); if (pw == NULL) goto out; /* XXX read .forward */ endpwent(); } } } else { it->remote = 1; } return (0); out: free(it->addr); free(it); return (-1); } static struct qitem * go_background(struct queue *queue) { struct sigaction sa; struct qitem *it; pid_t pid; if (daemonize && daemon(0, 0) != 0) { syslog(LOG_ERR, "can not daemonize: %m"); exit(EX_OSERR); } daemonize = 0; bzero(&sa, sizeof(sa)); sa.sa_handler = SIG_IGN; sigaction(SIGCHLD, &sa, NULL); LIST_FOREACH(it, &queue->queue, next) { /* No need to fork for the last dest */ if (LIST_NEXT(it, next) == NULL) goto retit; pid = fork(); switch (pid) { case -1: syslog(LOG_ERR, "can not fork: %m"); exit(EX_OSERR); break; case 0: /* * Child: * * return and deliver mail */ retit: /* * If necessary, acquire the queue and * mail files. * If this fails, we probably were raced by another * process. It is okay to be raced if we're supposed * to flush the queue. */ setlogident("%s", it->queueid); switch (acquirespool(it)) { case 0: break; case 1: if (doqueue) exit(EX_OK); syslog(LOG_WARNING, "could not lock queue file"); exit(EX_SOFTWARE); default: exit(EX_SOFTWARE); } dropspool(queue, it); return (it); default: /* * Parent: * * fork next child */ break; } } syslog(LOG_CRIT, "reached dead code"); exit(EX_SOFTWARE); } static void deliver(struct qitem *it) { int error; unsigned int backoff = MIN_RETRY, slept; struct timeval now; struct stat st; snprintf(errmsg, sizeof(errmsg), "unknown bounce reason"); retry: syslog(LOG_INFO, "<%s> trying delivery", it->addr); if (it->remote) error = deliver_remote(it); else error = deliver_local(it); switch (error) { case 0: delqueue(it); syslog(LOG_INFO, "<%s> delivery successful", it->addr); exit(EX_OK); case 1: if (stat(it->queuefn, &st) != 0) { syslog(LOG_ERR, "lost queue file `%s'", it->queuefn); exit(EX_SOFTWARE); } if (gettimeofday(&now, NULL) == 0 && (now.tv_sec - st.st_mtim.tv_sec > MAX_TIMEOUT)) { snprintf(errmsg, sizeof(errmsg), "Could not deliver for the last %d seconds. Giving up.", MAX_TIMEOUT); goto bounce; } for (slept = 0; slept < backoff;) { slept += SLEEP_TIMEOUT - sleep(SLEEP_TIMEOUT); if (flushqueue_since(slept)) { backoff = MIN_RETRY; goto retry; } } if (slept >= backoff) { /* pick the next backoff between [1.5, 2.5) times backoff */ backoff = backoff + backoff / 2 + random() % backoff; if (backoff > MAX_RETRY) backoff = MAX_RETRY; } goto retry; case -1: default: break; } bounce: bounce(it, errmsg); /* NOTREACHED */ } void run_queue(struct queue *queue) { struct qitem *it; if (LIST_EMPTY(&queue->queue)) return; it = go_background(queue); deliver(it); /* NOTREACHED */ } static void show_queue(struct queue *queue) { struct qitem *it; int locked = 0; /* XXX */ if (LIST_EMPTY(&queue->queue)) { printf("Mail queue is empty\n"); return; } LIST_FOREACH(it, &queue->queue, next) { printf("ID\t: %s%s\n" "From\t: %s\n" "To\t: %s\n", it->queueid, locked ? "*" : "", it->sender, it->addr); if (LIST_NEXT(it, next) != NULL) printf("--\n"); } } /* * TODO: * * - alias processing * - use group permissions * - proper sysexit codes */ int main(int argc, char **argv) { struct sigaction act; char *sender = NULL; struct queue queue; int i, ch; int nodot = 0, showq = 0, queue_only = 0; int recp_from_header = 0; set_username(); /* * We never run as root. If called by root, drop permissions * to the mail user. */ if (geteuid() == 0 || getuid() == 0) { struct passwd *pw; errno = 0; pw = getpwnam(DMA_ROOT_USER); if (pw == NULL) { if (errno == 0) errx(EX_CONFIG, "user '%s' not found", DMA_ROOT_USER); else err(EX_OSERR, "cannot drop root privileges"); } if (setuid(pw->pw_uid) != 0) err(EX_OSERR, "cannot drop root privileges"); if (geteuid() == 0 || getuid() == 0) errx(EX_OSERR, "cannot drop root privileges"); } atexit(deltmp); init_random(); bzero(&queue, sizeof(queue)); LIST_INIT(&queue.queue); - if (strcmp(argv[0], "mailq") == 0) { + if (strcmp(basename(argv[0]), "mailq") == 0) { argv++; argc--; showq = 1; if (argc != 0) errx(EX_USAGE, "invalid arguments"); goto skipopts; } else if (strcmp(argv[0], "newaliases") == 0) { logident_base = "dma"; setlogident("%s", logident_base); if (read_aliases() != 0) errx(EX_SOFTWARE, "could not parse aliases file `%s'", config.aliases); exit(EX_OK); } opterr = 0; while ((ch = getopt(argc, argv, ":A:b:B:C:d:Df:F:h:iL:N:no:O:q:r:R:tUV:vX:")) != -1) { switch (ch) { case 'A': /* -AX is being ignored, except for -A{c,m} */ if (optarg[0] == 'c' || optarg[0] == 'm') { break; } /* else FALLTRHOUGH */ case 'b': /* -bX is being ignored, except for -bp */ if (optarg[0] == 'p') { showq = 1; break; } else if (optarg[0] == 'q') { queue_only = 1; break; } /* else FALLTRHOUGH */ case 'D': daemonize = 0; break; case 'L': logident_base = optarg; break; case 'f': case 'r': sender = optarg; break; case 't': recp_from_header = 1; break; case 'o': /* -oX is being ignored, except for -oi */ if (optarg[0] != 'i') break; /* else FALLTRHOUGH */ case 'O': break; case 'i': nodot = 1; break; case 'q': /* Don't let getopt slup up other arguments */ if (optarg && *optarg == '-') optind--; doqueue = 1; break; /* Ignored options */ case 'B': case 'C': case 'd': case 'F': case 'h': case 'N': case 'n': case 'R': case 'U': case 'V': case 'v': case 'X': break; case ':': if (optopt == 'q') { doqueue = 1; break; } /* FALLTHROUGH */ default: fprintf(stderr, "invalid argument: `-%c'\n", optopt); exit(EX_USAGE); } } argc -= optind; argv += optind; opterr = 1; if (argc != 0 && (showq || doqueue)) errx(EX_USAGE, "sending mail and queue operations are mutually exclusive"); if (showq + doqueue > 1) errx(EX_USAGE, "conflicting queue operations"); skipopts: if (logident_base == NULL) logident_base = "dma"; setlogident("%s", logident_base); act.sa_handler = sighup_handler; act.sa_flags = 0; sigemptyset(&act.sa_mask); if (sigaction(SIGHUP, &act, NULL) != 0) syslog(LOG_WARNING, "can not set signal handler: %m"); parse_conf(CONF_PATH "/dma.conf"); if (config.authpath != NULL) parse_authfile(config.authpath); if (showq) { if (load_queue(&queue) < 0) errlog(EX_NOINPUT, "can not load queue"); show_queue(&queue); return (0); } if (doqueue) { flushqueue_signal(); if (load_queue(&queue) < 0) errlog(EX_NOINPUT, "can not load queue"); run_queue(&queue); return (0); } if (read_aliases() != 0) errlog(EX_SOFTWARE, "could not parse aliases file `%s'", config.aliases); if ((sender = set_from(&queue, sender)) == NULL) errlog(EX_SOFTWARE, "set_from()"); if (newspoolf(&queue) != 0) errlog(EX_CANTCREAT, "can not create temp file in `%s'", config.spooldir); setlogident("%s", queue.id); for (i = 0; i < argc; i++) { if (add_recp(&queue, argv[i], EXPAND_WILDCARD) != 0) errlogx(EX_DATAERR, "invalid recipient `%s'", argv[i]); } if (LIST_EMPTY(&queue.queue) && !recp_from_header) errlogx(EX_NOINPUT, "no recipients"); if (readmail(&queue, nodot, recp_from_header) != 0) errlog(EX_NOINPUT, "can not read mail"); if (LIST_EMPTY(&queue.queue)) errlogx(EX_NOINPUT, "no recipients"); if (linkspool(&queue) != 0) errlog(EX_CANTCREAT, "can not create spools"); /* From here on the mail is safe. */ if (config.features & DEFER || queue_only) return (0); run_queue(&queue); /* NOTREACHED */ return (0); } Index: projects/clang400-import/contrib/dma =================================================================== --- projects/clang400-import/contrib/dma (revision 314522) +++ projects/clang400-import/contrib/dma (revision 314523) Property changes on: projects/clang400-import/contrib/dma ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,2 ## Merged /head/contrib/dma:r311132-312925,312928-314522 Merged /vendor/dma/dist:r306541-314519 Index: projects/clang400-import/sys/boot/arm/uboot/Makefile =================================================================== --- projects/clang400-import/sys/boot/arm/uboot/Makefile (revision 314522) +++ projects/clang400-import/sys/boot/arm/uboot/Makefile (revision 314523) @@ -1,156 +1,157 @@ # $FreeBSD$ .include FILES= ubldr ubldr.bin NEWVERSWHAT= "U-Boot loader" ${MACHINE_ARCH} BINDIR?= /boot INSTALLFLAGS= -b WARNS?= 1 # Address at which ubldr will be loaded. # This varies for different boards and SOCs. UBLDR_LOADADDR?= 0x1000000 # Architecture-specific loader code SRCS= start.S conf.c self_reloc.c vers.c .if !defined(LOADER_NO_DISK_SUPPORT) LOADER_DISK_SUPPORT?= yes .else LOADER_DISK_SUPPORT= no .endif LOADER_UFS_SUPPORT?= yes LOADER_CD9660_SUPPORT?= no LOADER_EXT2FS_SUPPORT?= no .if ${MK_NAND} != "no" LOADER_NANDFS_SUPPORT?= yes .else LOADER_NANDFS_SUPPORT?= no .endif LOADER_NET_SUPPORT?= yes LOADER_NFS_SUPPORT?= yes LOADER_TFTP_SUPPORT?= no LOADER_GZIP_SUPPORT?= no LOADER_BZIP2_SUPPORT?= no .if ${MK_FDT} != "no" LOADER_FDT_SUPPORT= yes .else LOADER_FDT_SUPPORT= no .endif .if ${LOADER_DISK_SUPPORT} == "yes" CFLAGS+= -DLOADER_DISK_SUPPORT .endif .if ${LOADER_UFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_UFS_SUPPORT .endif .if ${LOADER_CD9660_SUPPORT} == "yes" CFLAGS+= -DLOADER_CD9660_SUPPORT .endif .if ${LOADER_EXT2FS_SUPPORT} == "yes" CFLAGS+= -DLOADER_EXT2FS_SUPPORT .endif .if ${LOADER_NANDFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_NANDFS_SUPPORT .endif .if ${LOADER_GZIP_SUPPORT} == "yes" CFLAGS+= -DLOADER_GZIP_SUPPORT .endif .if ${LOADER_BZIP2_SUPPORT} == "yes" CFLAGS+= -DLOADER_BZIP2_SUPPORT .endif .if ${LOADER_NET_SUPPORT} == "yes" CFLAGS+= -DLOADER_NET_SUPPORT .endif .if ${LOADER_NFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_NFS_SUPPORT .endif .if ${LOADER_TFTP_SUPPORT} == "yes" CFLAGS+= -DLOADER_TFTP_SUPPORT .endif .if ${LOADER_FDT_SUPPORT} == "yes" CFLAGS+= -I${.CURDIR}/../../fdt CFLAGS+= -I${.OBJDIR}/../../fdt CFLAGS+= -DLOADER_FDT_SUPPORT LIBUBOOT_FDT= ${.OBJDIR}/../../uboot/fdt/libuboot_fdt.a LIBFDT= ${.OBJDIR}/../../fdt/libfdt.a .endif .if ${MK_FORTH} != "no" # Enable BootForth BOOT_FORTH= yes -CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/arm +CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl +CFLAGS+= -I${.CURDIR}/../../ficl/arm LIBFICL= ${.OBJDIR}/../../ficl/libficl.a .endif # Always add MI sources .PATH: ${.CURDIR}/../../common .include "${.CURDIR}/../../common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../common CFLAGS+= -I. CLEANFILES+= loader.help CFLAGS+= -ffreestanding -msoft-float LDFLAGS= -nostdlib -static -T ${.CURDIR}/ldscript.${MACHINE_CPUARCH} # Pull in common loader code .PATH: ${.CURDIR}/../../uboot/common .include "${.CURDIR}/../../uboot/common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../uboot/common # U-Boot standalone support library LIBUBOOT= ${.OBJDIR}/../../uboot/lib/libuboot.a CFLAGS+= -I${.CURDIR}/../../uboot/lib CFLAGS+= -I${.OBJDIR}/../../uboot/lib # where to get libstand from CFLAGS+= -I${.CURDIR}/../../../../lib/libstand/ CFLAGS+= -fPIC # clang doesn't understand %D as a specifier to printf NO_WERROR.clang= DPADD= ${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} ${LIBSTAND} LDADD= ${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} -lstand OBJS+= ${SRCS:N*.h:R:S/$/.o/g} loader.help: help.common help.uboot ${.CURDIR}/../../fdt/help.fdt cat ${.ALLSRC} | \ awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET} ldscript.abs: echo "UBLDR_LOADADDR = ${UBLDR_LOADADDR};" >${.TARGET} ldscript.pie: echo "UBLDR_LOADADDR = 0;" >${.TARGET} ubldr: ${OBJS} ldscript.abs ${.CURDIR}/ldscript.${MACHINE_CPUARCH} ${DPADD} ${CC} ${CFLAGS} -T ldscript.abs ${LDFLAGS} \ -o ${.TARGET} ${OBJS} ${LDADD} ubldr.pie: ${OBJS} ldscript.pie ${.CURDIR}/ldscript.${MACHINE_CPUARCH} ${DPADD} ${CC} ${CFLAGS} -T ldscript.pie ${LDFLAGS} -pie -Wl,-Bsymbolic \ -o ${.TARGET} ${OBJS} ${LDADD} ubldr.bin: ubldr.pie ${OBJCOPY} -S -O binary ubldr.pie ${.TARGET} CLEANFILES+= ldscript.abs ldscript.pie ubldr ubldr.pie ubldr.bin .if !defined(LOADER_ONLY) .PATH: ${.CURDIR}/../../forth .include "${.CURDIR}/../../forth/Makefile.inc" # Install loader.rc. FILES+= loader.rc # Put sample menu.rc on disk but don't enable it by default. FILES+= menu.rc FILESNAME_menu.rc= menu.rc.sample .endif .include Index: projects/clang400-import/sys/boot/fdt/dts/arm/socfpga_arria10_socdk_sdmmc.dts =================================================================== --- projects/clang400-import/sys/boot/fdt/dts/arm/socfpga_arria10_socdk_sdmmc.dts (revision 314522) +++ projects/clang400-import/sys/boot/fdt/dts/arm/socfpga_arria10_socdk_sdmmc.dts (revision 314523) @@ -1,82 +1,86 @@ /*- * Copyright (c) 2017 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract FA8750-10-C-0237 * ("CTSRD"), as part of the DARPA CRASH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /dts-v1/; #include "socfpga_arria10_socdk.dtsi" / { model = "Altera SOCFPGA Arria 10"; compatible = "altr,socfpga-arria10", "altr,socfpga"; /* Reserve first page for secondary CPU trampoline code */ memreserve = < 0x00000000 0x1000 >; soc { /* Local timer */ timer@ffffc600 { clock-frequency = <200000000>; }; /* Global timer */ global_timer: timer@ffffc200 { compatible = "arm,cortex-a9-global-timer"; reg = <0xffffc200 0x20>; interrupts = <1 11 0x301>; clock-frequency = <200000000>; }; }; chosen { stdin = "serial1"; stdout = "serial1"; }; }; &uart1 { clock-frequency = < 50000000 >; }; &mmc { status = "okay"; num-slots = <1>; cap-sd-highspeed; broken-cd; bus-width = <4>; bus-frequency = <200000000>; }; &i2c1 { lcd@28 { compatible = "newhaven,nhd-0216k3z-nsw-bbw"; reg = <0x28>; }; }; + +&usb0 { + dr_mode = "host"; +}; Index: projects/clang400-import/sys/boot/powerpc/kboot/Makefile =================================================================== --- projects/clang400-import/sys/boot/powerpc/kboot/Makefile (revision 314522) +++ projects/clang400-import/sys/boot/powerpc/kboot/Makefile (revision 314523) @@ -1,111 +1,112 @@ # $FreeBSD$ .include MK_SSP= no MAN= PROG= loader.kboot NEWVERSWHAT= "kboot loader" ${MACHINE_ARCH} BINDIR?= /boot INSTALLFLAGS= -b # Architecture-specific loader code SRCS= conf.c metadata.c vers.c main.c ppc64_elf_freebsd.c SRCS+= host_syscall.S hostcons.c hostdisk.c kerneltramp.S kbootfdt.c SRCS+= ucmpdi2.c LOADER_DISK_SUPPORT?= yes LOADER_UFS_SUPPORT?= yes LOADER_CD9660_SUPPORT?= yes LOADER_EXT2FS_SUPPORT?= yes LOADER_NET_SUPPORT?= yes LOADER_NFS_SUPPORT?= yes LOADER_TFTP_SUPPORT?= no LOADER_GZIP_SUPPORT?= yes LOADER_FDT_SUPPORT= yes LOADER_BZIP2_SUPPORT?= no .if ${LOADER_DISK_SUPPORT} == "yes" CFLAGS+= -DLOADER_DISK_SUPPORT .endif .if ${LOADER_UFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_UFS_SUPPORT .endif .if ${LOADER_CD9660_SUPPORT} == "yes" CFLAGS+= -DLOADER_CD9660_SUPPORT .endif .if ${LOADER_EXT2FS_SUPPORT} == "yes" CFLAGS+= -DLOADER_EXT2FS_SUPPORT .endif .if ${LOADER_GZIP_SUPPORT} == "yes" CFLAGS+= -DLOADER_GZIP_SUPPORT .endif .if ${LOADER_BZIP2_SUPPORT} == "yes" CFLAGS+= -DLOADER_BZIP2_SUPPORT .endif .if ${LOADER_NET_SUPPORT} == "yes" CFLAGS+= -DLOADER_NET_SUPPORT .endif .if ${LOADER_NFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_NFS_SUPPORT .endif .if ${LOADER_TFTP_SUPPORT} == "yes" CFLAGS+= -DLOADER_TFTP_SUPPORT .endif .if ${LOADER_FDT_SUPPORT} == "yes" CFLAGS+= -I${.CURDIR}/../../fdt CFLAGS+= -I${.OBJDIR}/../../fdt CFLAGS+= -I${.CURDIR}/../../../contrib/libfdt CFLAGS+= -DLOADER_FDT_SUPPORT LIBFDT= ${.OBJDIR}/../../fdt/libfdt.a .endif .if ${MK_FORTH} != "no" # Enable BootForth BOOT_FORTH= yes -CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc +CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl +CFLAGS+= -I${.CURDIR}/../../ficl/powerpc LIBFICL= ${.OBJDIR}/../../ficl/libficl.a .endif CFLAGS+= -mcpu=powerpc64 # Always add MI sources .PATH: ${.CURDIR}/../../common ${.CURDIR}/../../../libkern .include "${.CURDIR}/../../common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../common -I${.CURDIR}/../../.. CFLAGS+= -I. CLEANFILES+= loader.help CFLAGS+= -Wall -ffreestanding -msoft-float -DAIM # load address. set in linker script RELOC?= 0x0 CFLAGS+= -DRELOC=${RELOC} LDFLAGS= -nostdlib -static -T ${.CURDIR}/ldscript.powerpc # 64-bit bridge extensions CFLAGS+= -Wa,-mppc64bridge # Pull in common loader code #.PATH: ${.CURDIR}/../../ofw/common #.include "${.CURDIR}/../../ofw/common/Makefile.inc" # where to get libstand from LIBSTAND= ${.OBJDIR}/../../libstand32/libstand.a CFLAGS+= -I${.CURDIR}/../../../../lib/libstand/ DPADD= ${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND} LDADD= ${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND} loader.help: help.common help.kboot ${.CURDIR}/../../fdt/help.fdt cat ${.ALLSRC} | \ awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET} .PATH: ${.CURDIR}/../../forth .include "${.CURDIR}/../../forth/Makefile.inc" FILES+= loader.rc menu.rc .include Index: projects/clang400-import/sys/boot/powerpc/ofw/Makefile =================================================================== --- projects/clang400-import/sys/boot/powerpc/ofw/Makefile (revision 314522) +++ projects/clang400-import/sys/boot/powerpc/ofw/Makefile (revision 314523) @@ -1,109 +1,110 @@ # $FreeBSD$ .include MK_SSP= no MAN= PROG= loader NEWVERSWHAT= "Open Firmware loader" ${MACHINE_ARCH} BINDIR?= /boot INSTALLFLAGS= -b # Architecture-specific loader code SRCS= conf.c metadata.c vers.c start.c SRCS+= ucmpdi2.c LOADER_DISK_SUPPORT?= yes LOADER_UFS_SUPPORT?= yes LOADER_CD9660_SUPPORT?= yes LOADER_EXT2FS_SUPPORT?= no LOADER_NET_SUPPORT?= yes LOADER_NFS_SUPPORT?= yes LOADER_TFTP_SUPPORT?= no LOADER_GZIP_SUPPORT?= yes LOADER_BZIP2_SUPPORT?= no LOADER_FDT_SUPPORT?= yes .if ${LOADER_DISK_SUPPORT} == "yes" CFLAGS+= -DLOADER_DISK_SUPPORT .endif .if ${LOADER_UFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_UFS_SUPPORT .endif .if ${LOADER_CD9660_SUPPORT} == "yes" CFLAGS+= -DLOADER_CD9660_SUPPORT .endif .if ${LOADER_EXT2FS_SUPPORT} == "yes" CFLAGS+= -DLOADER_EXT2FS_SUPPORT .endif .if ${LOADER_GZIP_SUPPORT} == "yes" CFLAGS+= -DLOADER_GZIP_SUPPORT .endif .if ${LOADER_BZIP2_SUPPORT} == "yes" CFLAGS+= -DLOADER_BZIP2_SUPPORT .endif .if ${LOADER_NET_SUPPORT} == "yes" CFLAGS+= -DLOADER_NET_SUPPORT .endif .if ${LOADER_NFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_NFS_SUPPORT .endif .if ${LOADER_TFTP_SUPPORT} == "yes" CFLAGS+= -DLOADER_TFTP_SUPPORT .endif .if ${LOADER_FDT_SUPPORT} == "yes" SRCS+= ofwfdt.c CFLAGS+= -I${.CURDIR}/../../fdt CFLAGS+= -I${.OBJDIR}/../../fdt CFLAGS+= -I${.CURDIR}/../../../contrib/libfdt CFLAGS+= -DLOADER_FDT_SUPPORT LIBFDT= ${.OBJDIR}/../../fdt/libfdt.a .endif .if ${MK_FORTH} != "no" # Enable BootForth BOOT_FORTH= yes -CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc +CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl +CFLAGS+= -I${.CURDIR}/../../ficl/powerpc LIBFICL= ${.OBJDIR}/../../ficl/libficl.a .endif # Always add MI sources .PATH: ${.CURDIR}/../../common ${.CURDIR}/../../../libkern .include "${.CURDIR}/../../common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../common -I${.CURDIR}/../../.. CFLAGS+= -I. CLEANFILES+= loader.help CFLAGS+= -ffreestanding -msoft-float # load address. set in linker script RELOC?= 0x1C00000 CFLAGS+= -DRELOC=${RELOC} LDFLAGS= -nostdlib -static -T ${.CURDIR}/ldscript.powerpc # Pull in common loader code .PATH: ${.CURDIR}/../../ofw/common .include "${.CURDIR}/../../ofw/common/Makefile.inc" # Open Firmware standalone support library LIBOFW= ${.OBJDIR}/../../ofw/libofw/libofw.a CFLAGS+= -I${.CURDIR}/../../ofw/libofw # where to get libstand from LIBSTAND= ${.OBJDIR}/../../libstand32/libstand.a CFLAGS+= -I${.CURDIR}/../../../../lib/libstand/ DPADD= ${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND} LDADD= ${LIBFICL} ${LIBOFW} ${LIBFDT} ${LIBSTAND} loader.help: help.common help.ofw ${.CURDIR}/../../fdt/help.fdt cat ${.ALLSRC} | \ awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET} .PATH: ${.CURDIR}/../../forth .include "${.CURDIR}/../../forth/Makefile.inc" FILES+= loader.rc menu.rc .include Index: projects/clang400-import/sys/boot/powerpc/ps3/Makefile =================================================================== --- projects/clang400-import/sys/boot/powerpc/ps3/Makefile (revision 314522) +++ projects/clang400-import/sys/boot/powerpc/ps3/Makefile (revision 314523) @@ -1,113 +1,114 @@ # $FreeBSD$ .include MK_SSP= no MAN= PROG= loader.ps3 NEWVERSWHAT= "Playstation 3 loader" ${MACHINE_ARCH} BINDIR?= /boot INSTALLFLAGS= -b # Architecture-specific loader code SRCS= start.S conf.c metadata.c vers.c main.c devicename.c ppc64_elf_freebsd.c SRCS+= lv1call.S ps3cons.c font.h ps3mmu.c ps3net.c ps3repo.c \ ps3stor.c ps3disk.c ps3cdrom.c SRCS+= ucmpdi2.c LOADER_DISK_SUPPORT?= yes LOADER_UFS_SUPPORT?= yes LOADER_CD9660_SUPPORT?= yes LOADER_EXT2FS_SUPPORT?= yes LOADER_NET_SUPPORT?= yes LOADER_NFS_SUPPORT?= yes LOADER_TFTP_SUPPORT?= no LOADER_GZIP_SUPPORT?= yes LOADER_FDT_SUPPORT?= no LOADER_BZIP2_SUPPORT?= no .if ${LOADER_DISK_SUPPORT} == "yes" CFLAGS+= -DLOADER_DISK_SUPPORT .endif .if ${LOADER_UFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_UFS_SUPPORT .endif .if ${LOADER_CD9660_SUPPORT} == "yes" CFLAGS+= -DLOADER_CD9660_SUPPORT .endif .if ${LOADER_EXT2FS_SUPPORT} == "yes" CFLAGS+= -DLOADER_EXT2FS_SUPPORT .endif .if ${LOADER_GZIP_SUPPORT} == "yes" CFLAGS+= -DLOADER_GZIP_SUPPORT .endif .if ${LOADER_BZIP2_SUPPORT} == "yes" CFLAGS+= -DLOADER_BZIP2_SUPPORT .endif .if ${LOADER_NET_SUPPORT} == "yes" CFLAGS+= -DLOADER_NET_SUPPORT .endif .if ${LOADER_NFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_NFS_SUPPORT .endif .if ${LOADER_TFTP_SUPPORT} == "yes" CFLAGS+= -DLOADER_TFTP_SUPPORT .endif .if ${LOADER_FDT_SUPPORT} == "yes" CFLAGS+= -I${.CURDIR}/../../fdt CFLAGS+= -I${.OBJDIR}/../../fdt CFLAGS+= -DLOADER_FDT_SUPPORT LIBFDT= ${.OBJDIR}/../../fdt/libfdt.a .endif .if ${MK_FORTH} != "no" # Enable BootForth BOOT_FORTH= yes -CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc +CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl +CFLAGS+= -I${.CURDIR}/../../ficl/powerpc LIBFICL= ${.OBJDIR}/../../ficl/libficl.a .endif CFLAGS+= -mcpu=powerpc64 # Always add MI sources .PATH: ${.CURDIR}/../../common ${.CURDIR}/../../../libkern .include "${.CURDIR}/../../common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../common -I${.CURDIR}/../../.. CFLAGS+= -I. CLEANFILES+= loader.help CFLAGS+= -Wall -ffreestanding -msoft-float -DAIM # load address. set in linker script RELOC?= 0x0 CFLAGS+= -DRELOC=${RELOC} LDFLAGS= -nostdlib -static -T ${.CURDIR}/ldscript.powerpc # Pull in common loader code #.PATH: ${.CURDIR}/../../ofw/common #.include "${.CURDIR}/../../ofw/common/Makefile.inc" # where to get libstand from LIBSTAND= ${.OBJDIR}/../../libstand32/libstand.a CFLAGS+= -I${.CURDIR}/../../../../lib/libstand/ DPADD= ${LIBFICL} ${LIBOFW} ${LIBSTAND} LDADD= ${LIBFICL} ${LIBOFW} ${LIBSTAND} SC_DFLT_FONT=cp437 font.h: uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x16.fnt && file2c 'u_char dflt_font_16[16*256] = {' '};' < ${SC_DFLT_FONT}-8x16 > font.h && uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x14.fnt && file2c 'u_char dflt_font_14[14*256] = {' '};' < ${SC_DFLT_FONT}-8x14 >> font.h && uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x8.fnt && file2c 'u_char dflt_font_8[8*256] = {' '};' < ${SC_DFLT_FONT}-8x8 >> font.h loader.help: help.common help.ps3 ${.CURDIR}/../../fdt/help.fdt cat ${.ALLSRC} | \ awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET} .PATH: ${.CURDIR}/../../forth .include "${.CURDIR}/../../forth/Makefile.inc" FILES+= loader.rc menu.rc .include Index: projects/clang400-import/sys/boot/powerpc/uboot/Makefile =================================================================== --- projects/clang400-import/sys/boot/powerpc/uboot/Makefile (revision 314522) +++ projects/clang400-import/sys/boot/powerpc/uboot/Makefile (revision 314523) @@ -1,112 +1,113 @@ # $FreeBSD$ .include PROG= ubldr NEWVERSWHAT= "U-Boot loader" ${MACHINE_ARCH} BINDIR?= /boot INSTALLFLAGS= -b MAN= # Architecture-specific loader code SRCS= start.S conf.c vers.c SRCS+= ucmpdi2.c .if !defined(LOADER_NO_DISK_SUPPORT) LOADER_DISK_SUPPORT?= yes .else LOADER_DISK_SUPPORT= no .endif LOADER_UFS_SUPPORT?= yes LOADER_CD9660_SUPPORT?= no LOADER_EXT2FS_SUPPORT?= no LOADER_NET_SUPPORT?= yes LOADER_NFS_SUPPORT?= yes LOADER_TFTP_SUPPORT?= no LOADER_GZIP_SUPPORT?= no LOADER_BZIP2_SUPPORT?= no .if ${MK_FDT} != "no" LOADER_FDT_SUPPORT= yes .else LOADER_FDT_SUPPORT= no .endif .if ${LOADER_DISK_SUPPORT} == "yes" CFLAGS+= -DLOADER_DISK_SUPPORT .endif .if ${LOADER_UFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_UFS_SUPPORT .endif .if ${LOADER_CD9660_SUPPORT} == "yes" CFLAGS+= -DLOADER_CD9660_SUPPORT .endif .if ${LOADER_EXT2FS_SUPPORT} == "yes" CFLAGS+= -DLOADER_EXT2FS_SUPPORT .endif .if ${LOADER_GZIP_SUPPORT} == "yes" CFLAGS+= -DLOADER_GZIP_SUPPORT .endif .if ${LOADER_BZIP2_SUPPORT} == "yes" CFLAGS+= -DLOADER_BZIP2_SUPPORT .endif .if ${LOADER_NET_SUPPORT} == "yes" CFLAGS+= -DLOADER_NET_SUPPORT .endif .if ${LOADER_NFS_SUPPORT} == "yes" CFLAGS+= -DLOADER_NFS_SUPPORT .endif .if ${LOADER_TFTP_SUPPORT} == "yes" CFLAGS+= -DLOADER_TFTP_SUPPORT .endif .if ${LOADER_FDT_SUPPORT} == "yes" CFLAGS+= -I${.CURDIR}/../../fdt CFLAGS+= -I${.OBJDIR}/../../fdt CFLAGS+= -DLOADER_FDT_SUPPORT LIBUBOOT_FDT= ${.OBJDIR}/../../uboot/fdt/libuboot_fdt.a LIBFDT= ${.OBJDIR}/../../fdt/libfdt.a .endif .if ${MK_FORTH} != "no" # Enable BootForth BOOT_FORTH= yes -CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/powerpc +CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl +CFLAGS+= -I${.CURDIR}/../../ficl/powerpc LIBFICL= ${.OBJDIR}/../../ficl/libficl.a .endif # Always add MI sources .PATH: ${.CURDIR}/../../common ${.CURDIR}/../../../libkern .include "${.CURDIR}/../../common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../common -I${.CURDIR}/../../.. CFLAGS+= -I. CLEANFILES+= ${PROG}.help CFLAGS+= -ffreestanding LDFLAGS= -nostdlib -static -T ${.CURDIR}/ldscript.powerpc # Pull in common loader code .PATH: ${.CURDIR}/../../uboot/common .include "${.CURDIR}/../../uboot/common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../uboot/common # U-Boot standalone support library LIBUBOOT= ${.OBJDIR}/../../uboot/lib/libuboot.a CFLAGS+= -I${.CURDIR}/../../uboot/lib CFLAGS+= -I${.OBJDIR}/../../uboot/lib # where to get libstand from LIBSTAND= ${.OBJDIR}/../../libstand32/libstand.a CFLAGS+= -I${.CURDIR}/../../../../lib/libstand/ DPADD= ${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} ${LIBSTAND} LDADD= ${LIBFICL} ${LIBUBOOT} ${LIBFDT} ${LIBUBOOT_FDT} ${LIBSTAND} loader.help: help.common help.uboot ${.CURDIR}/../../fdt/help.fdt cat ${.ALLSRC} | \ awk -f ${.CURDIR}/../../common/merge_help.awk > ${.TARGET} .PATH: ${.CURDIR}/../../forth FILES= loader.help .include Index: projects/clang400-import/sys/boot/userboot/userboot/Makefile =================================================================== --- projects/clang400-import/sys/boot/userboot/userboot/Makefile (revision 314522) +++ projects/clang400-import/sys/boot/userboot/userboot/Makefile (revision 314523) @@ -1,66 +1,67 @@ # $FreeBSD$ MAN= .include MK_SSP= no SHLIB_NAME= userboot.so MK_CTF= no STRIP= LIBDIR= /boot SRCS= autoload.c SRCS+= bcache.c SRCS+= biossmap.c SRCS+= bootinfo.c SRCS+= bootinfo32.c SRCS+= bootinfo64.c SRCS+= conf.c SRCS+= console.c SRCS+= copy.c SRCS+= devicename.c SRCS+= elf32_freebsd.c SRCS+= elf64_freebsd.c SRCS+= host.c SRCS+= main.c SRCS+= userboot_cons.c SRCS+= userboot_disk.c SRCS+= vers.c CFLAGS+= -Wall CFLAGS+= -I${.CURDIR}/.. CFLAGS+= -I${.CURDIR}/../../common CFLAGS+= -I${.CURDIR}/../../.. CFLAGS+= -I${.CURDIR}/../../../../lib/libstand CFLAGS+= -ffreestanding -I. CWARNFLAGS.main.c += -Wno-implicit-function-declaration LDFLAGS+= -nostdlib -Wl,-Bsymbolic NEWVERSWHAT= "User boot" ${MACHINE_CPUARCH} .if ${MK_FORTH} != "no" BOOT_FORTH= yes -CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl -I${.CURDIR}/../../ficl/i386 +CFLAGS+= -DBOOT_FORTH -I${.CURDIR}/../../ficl +CFLAGS+= -I${.CURDIR}/../../ficl/i386 CFLAGS+= -DBF_DICTSIZE=15000 LIBFICL= ${.OBJDIR}/../ficl/libficl.a .endif LIBSTAND= ${.OBJDIR}/../libstand/libstand.a .if ${MK_ZFS} != "no" CFLAGS+= -DUSERBOOT_ZFS_SUPPORT LIBZFSBOOT= ${.OBJDIR}/../zfs/libzfsboot.a .endif # Always add MI sources .PATH: ${.CURDIR}/../../common .include "${.CURDIR}/../../common/Makefile.inc" CFLAGS+= -I${.CURDIR}/../../common CFLAGS+= -I. DPADD+= ${LIBFICL} ${LIBZFSBOOT} ${LIBSTAND} LDADD+= ${LIBFICL} ${LIBZFSBOOT} ${LIBSTAND} .include Index: projects/clang400-import/sys/boot/zfs/zfsimpl.c =================================================================== --- projects/clang400-import/sys/boot/zfs/zfsimpl.c (revision 314522) +++ projects/clang400-import/sys/boot/zfs/zfsimpl.c (revision 314523) @@ -1,2488 +1,2488 @@ /*- * Copyright (c) 2007 Doug Rabson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); /* * Stand-alone ZFS file reader. */ #include #include #include "zfsimpl.h" #include "zfssubr.c" struct zfsmount { const spa_t *spa; objset_phys_t objset; uint64_t rootobj; }; /* * List of all vdevs, chained through v_alllink. */ static vdev_list_t zfs_vdevs; /* * List of ZFS features supported for read */ static const char *features_for_read[] = { "org.illumos:lz4_compress", "com.delphix:hole_birth", "com.delphix:extensible_dataset", "com.delphix:embedded_data", "org.open-zfs:large_blocks", "org.illumos:sha512", "org.illumos:skein", NULL }; /* * List of all pools, chained through spa_link. */ static spa_list_t zfs_pools; static uint64_t zfs_crc64_table[256]; static const dnode_phys_t *dnode_cache_obj = NULL; static uint64_t dnode_cache_bn; static char *dnode_cache_buf; static char *zap_scratch; static char *zfs_temp_buf, *zfs_temp_end, *zfs_temp_ptr; #define TEMP_SIZE (1024 * 1024) static int zio_read(const spa_t *spa, const blkptr_t *bp, void *buf); static int zfs_get_root(const spa_t *spa, uint64_t *objid); static int zfs_rlookup(const spa_t *spa, uint64_t objnum, char *result); static int zap_lookup(const spa_t *spa, const dnode_phys_t *dnode, const char *name, uint64_t integer_size, uint64_t num_integers, void *value); static void zfs_init(void) { STAILQ_INIT(&zfs_vdevs); STAILQ_INIT(&zfs_pools); zfs_temp_buf = malloc(TEMP_SIZE); zfs_temp_end = zfs_temp_buf + TEMP_SIZE; zfs_temp_ptr = zfs_temp_buf; dnode_cache_buf = malloc(SPA_MAXBLOCKSIZE); zap_scratch = malloc(SPA_MAXBLOCKSIZE); zfs_init_crc(); } static void * zfs_alloc(size_t size) { char *ptr; if (zfs_temp_ptr + size > zfs_temp_end) { printf("ZFS: out of temporary buffer space\n"); for (;;) ; } ptr = zfs_temp_ptr; zfs_temp_ptr += size; return (ptr); } static void zfs_free(void *ptr, size_t size) { zfs_temp_ptr -= size; if (zfs_temp_ptr != ptr) { printf("ZFS: zfs_alloc()/zfs_free() mismatch\n"); for (;;) ; } } static int xdr_int(const unsigned char **xdr, int *ip) { *ip = ((*xdr)[0] << 24) | ((*xdr)[1] << 16) | ((*xdr)[2] << 8) | ((*xdr)[3] << 0); (*xdr) += 4; return (0); } static int xdr_u_int(const unsigned char **xdr, u_int *ip) { *ip = ((*xdr)[0] << 24) | ((*xdr)[1] << 16) | ((*xdr)[2] << 8) | ((*xdr)[3] << 0); (*xdr) += 4; return (0); } static int xdr_uint64_t(const unsigned char **xdr, uint64_t *lp) { u_int hi, lo; xdr_u_int(xdr, &hi); xdr_u_int(xdr, &lo); *lp = (((uint64_t) hi) << 32) | lo; return (0); } static int nvlist_find(const unsigned char *nvlist, const char *name, int type, int* elementsp, void *valuep) { const unsigned char *p, *pair; int junk; int encoded_size, decoded_size; p = nvlist; xdr_int(&p, &junk); xdr_int(&p, &junk); pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); while (encoded_size && decoded_size) { int namelen, pairtype, elements; const char *pairname; xdr_int(&p, &namelen); pairname = (const char*) p; p += roundup(namelen, 4); xdr_int(&p, &pairtype); if (!memcmp(name, pairname, namelen) && type == pairtype) { xdr_int(&p, &elements); if (elementsp) *elementsp = elements; if (type == DATA_TYPE_UINT64) { xdr_uint64_t(&p, (uint64_t *) valuep); return (0); } else if (type == DATA_TYPE_STRING) { int len; xdr_int(&p, &len); (*(const char**) valuep) = (const char*) p; return (0); } else if (type == DATA_TYPE_NVLIST || type == DATA_TYPE_NVLIST_ARRAY) { (*(const unsigned char**) valuep) = (const unsigned char*) p; return (0); } else { return (EIO); } } else { /* * Not the pair we are looking for, skip to the next one. */ p = pair + encoded_size; } pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); } return (EIO); } static int nvlist_check_features_for_read(const unsigned char *nvlist) { const unsigned char *p, *pair; int junk; int encoded_size, decoded_size; int rc; rc = 0; p = nvlist; xdr_int(&p, &junk); xdr_int(&p, &junk); pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); while (encoded_size && decoded_size) { int namelen, pairtype; const char *pairname; int i, found; found = 0; xdr_int(&p, &namelen); pairname = (const char*) p; p += roundup(namelen, 4); xdr_int(&p, &pairtype); for (i = 0; features_for_read[i] != NULL; i++) { if (!memcmp(pairname, features_for_read[i], namelen)) { found = 1; break; } } if (!found) { printf("ZFS: unsupported feature: %s\n", pairname); rc = EIO; } p = pair + encoded_size; pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); } return (rc); } /* * Return the next nvlist in an nvlist array. */ static const unsigned char * nvlist_next(const unsigned char *nvlist) { const unsigned char *p, *pair; int junk; int encoded_size, decoded_size; p = nvlist; xdr_int(&p, &junk); xdr_int(&p, &junk); pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); while (encoded_size && decoded_size) { p = pair + encoded_size; pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); } return p; } #ifdef TEST static const unsigned char * nvlist_print(const unsigned char *nvlist, unsigned int indent) { static const char* typenames[] = { "DATA_TYPE_UNKNOWN", "DATA_TYPE_BOOLEAN", "DATA_TYPE_BYTE", "DATA_TYPE_INT16", "DATA_TYPE_UINT16", "DATA_TYPE_INT32", "DATA_TYPE_UINT32", "DATA_TYPE_INT64", "DATA_TYPE_UINT64", "DATA_TYPE_STRING", "DATA_TYPE_BYTE_ARRAY", "DATA_TYPE_INT16_ARRAY", "DATA_TYPE_UINT16_ARRAY", "DATA_TYPE_INT32_ARRAY", "DATA_TYPE_UINT32_ARRAY", "DATA_TYPE_INT64_ARRAY", "DATA_TYPE_UINT64_ARRAY", "DATA_TYPE_STRING_ARRAY", "DATA_TYPE_HRTIME", "DATA_TYPE_NVLIST", "DATA_TYPE_NVLIST_ARRAY", "DATA_TYPE_BOOLEAN_VALUE", "DATA_TYPE_INT8", "DATA_TYPE_UINT8", "DATA_TYPE_BOOLEAN_ARRAY", "DATA_TYPE_INT8_ARRAY", "DATA_TYPE_UINT8_ARRAY" }; unsigned int i, j; const unsigned char *p, *pair; int junk; int encoded_size, decoded_size; p = nvlist; xdr_int(&p, &junk); xdr_int(&p, &junk); pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); while (encoded_size && decoded_size) { int namelen, pairtype, elements; const char *pairname; xdr_int(&p, &namelen); pairname = (const char*) p; p += roundup(namelen, 4); xdr_int(&p, &pairtype); for (i = 0; i < indent; i++) printf(" "); printf("%s %s", typenames[pairtype], pairname); xdr_int(&p, &elements); switch (pairtype) { case DATA_TYPE_UINT64: { uint64_t val; xdr_uint64_t(&p, &val); printf(" = 0x%jx\n", (uintmax_t)val); break; } case DATA_TYPE_STRING: { int len; xdr_int(&p, &len); printf(" = \"%s\"\n", p); break; } case DATA_TYPE_NVLIST: printf("\n"); nvlist_print(p, indent + 1); break; case DATA_TYPE_NVLIST_ARRAY: for (j = 0; j < elements; j++) { printf("[%d]\n", j); p = nvlist_print(p, indent + 1); if (j != elements - 1) { for (i = 0; i < indent; i++) printf(" "); printf("%s %s", typenames[pairtype], pairname); } } break; default: printf("\n"); } p = pair + encoded_size; pair = p; xdr_int(&p, &encoded_size); xdr_int(&p, &decoded_size); } return p; } #endif static int vdev_read_phys(vdev_t *vdev, const blkptr_t *bp, void *buf, off_t offset, size_t size) { size_t psize; int rc; if (!vdev->v_phys_read) return (EIO); if (bp) { psize = BP_GET_PSIZE(bp); } else { psize = size; } /*printf("ZFS: reading %d bytes at 0x%jx to %p\n", psize, (uintmax_t)offset, buf);*/ rc = vdev->v_phys_read(vdev, vdev->v_read_priv, offset, buf, psize); if (rc) return (rc); if (bp && zio_checksum_verify(vdev->spa, bp, buf)) return (EIO); return (0); } static int vdev_disk_read(vdev_t *vdev, const blkptr_t *bp, void *buf, off_t offset, size_t bytes) { return (vdev_read_phys(vdev, bp, buf, offset + VDEV_LABEL_START_SIZE, bytes)); } static int vdev_mirror_read(vdev_t *vdev, const blkptr_t *bp, void *buf, off_t offset, size_t bytes) { vdev_t *kid; int rc; rc = EIO; STAILQ_FOREACH(kid, &vdev->v_children, v_childlink) { if (kid->v_state != VDEV_STATE_HEALTHY) continue; rc = kid->v_read(kid, bp, buf, offset, bytes); if (!rc) return (0); } return (rc); } static int vdev_replacing_read(vdev_t *vdev, const blkptr_t *bp, void *buf, off_t offset, size_t bytes) { vdev_t *kid; /* * Here we should have two kids: * First one which is the one we are replacing and we can trust * only this one to have valid data, but it might not be present. * Second one is that one we are replacing with. It is most likely * healthy, but we can't trust it has needed data, so we won't use it. */ kid = STAILQ_FIRST(&vdev->v_children); if (kid == NULL) return (EIO); if (kid->v_state != VDEV_STATE_HEALTHY) return (EIO); return (kid->v_read(kid, bp, buf, offset, bytes)); } static vdev_t * vdev_find(uint64_t guid) { vdev_t *vdev; STAILQ_FOREACH(vdev, &zfs_vdevs, v_alllink) if (vdev->v_guid == guid) return (vdev); return (0); } static vdev_t * vdev_create(uint64_t guid, vdev_read_t *read) { vdev_t *vdev; vdev = malloc(sizeof(vdev_t)); memset(vdev, 0, sizeof(vdev_t)); STAILQ_INIT(&vdev->v_children); vdev->v_guid = guid; vdev->v_state = VDEV_STATE_OFFLINE; vdev->v_read = read; vdev->v_phys_read = 0; vdev->v_read_priv = 0; STAILQ_INSERT_TAIL(&zfs_vdevs, vdev, v_alllink); return (vdev); } static int vdev_init_from_nvlist(const unsigned char *nvlist, vdev_t *pvdev, vdev_t **vdevp, int is_newer) { int rc; uint64_t guid, id, ashift, nparity; const char *type; const char *path; vdev_t *vdev, *kid; const unsigned char *kids; int nkids, i, is_new; uint64_t is_offline, is_faulted, is_degraded, is_removed, isnt_present; if (nvlist_find(nvlist, ZPOOL_CONFIG_GUID, DATA_TYPE_UINT64, 0, &guid) || nvlist_find(nvlist, ZPOOL_CONFIG_ID, DATA_TYPE_UINT64, 0, &id) || nvlist_find(nvlist, ZPOOL_CONFIG_TYPE, DATA_TYPE_STRING, 0, &type)) { printf("ZFS: can't find vdev details\n"); return (ENOENT); } if (strcmp(type, VDEV_TYPE_MIRROR) && strcmp(type, VDEV_TYPE_DISK) #ifdef ZFS_TEST && strcmp(type, VDEV_TYPE_FILE) #endif && strcmp(type, VDEV_TYPE_RAIDZ) && strcmp(type, VDEV_TYPE_REPLACING)) { printf("ZFS: can only boot from disk, mirror, raidz1, raidz2 and raidz3 vdevs\n"); return (EIO); } is_offline = is_removed = is_faulted = is_degraded = isnt_present = 0; nvlist_find(nvlist, ZPOOL_CONFIG_OFFLINE, DATA_TYPE_UINT64, 0, &is_offline); nvlist_find(nvlist, ZPOOL_CONFIG_REMOVED, DATA_TYPE_UINT64, 0, &is_removed); nvlist_find(nvlist, ZPOOL_CONFIG_FAULTED, DATA_TYPE_UINT64, 0, &is_faulted); nvlist_find(nvlist, ZPOOL_CONFIG_DEGRADED, DATA_TYPE_UINT64, 0, &is_degraded); nvlist_find(nvlist, ZPOOL_CONFIG_NOT_PRESENT, DATA_TYPE_UINT64, 0, &isnt_present); vdev = vdev_find(guid); if (!vdev) { is_new = 1; if (!strcmp(type, VDEV_TYPE_MIRROR)) vdev = vdev_create(guid, vdev_mirror_read); else if (!strcmp(type, VDEV_TYPE_RAIDZ)) vdev = vdev_create(guid, vdev_raidz_read); else if (!strcmp(type, VDEV_TYPE_REPLACING)) vdev = vdev_create(guid, vdev_replacing_read); else vdev = vdev_create(guid, vdev_disk_read); vdev->v_id = id; vdev->v_top = pvdev != NULL ? pvdev : vdev; if (nvlist_find(nvlist, ZPOOL_CONFIG_ASHIFT, DATA_TYPE_UINT64, 0, &ashift) == 0) vdev->v_ashift = ashift; else vdev->v_ashift = 0; if (nvlist_find(nvlist, ZPOOL_CONFIG_NPARITY, DATA_TYPE_UINT64, 0, &nparity) == 0) vdev->v_nparity = nparity; else vdev->v_nparity = 0; if (nvlist_find(nvlist, ZPOOL_CONFIG_PATH, DATA_TYPE_STRING, 0, &path) == 0) { if (strncmp(path, "/dev/", 5) == 0) path += 5; vdev->v_name = strdup(path); } else { if (!strcmp(type, "raidz")) { if (vdev->v_nparity == 1) vdev->v_name = "raidz1"; else if (vdev->v_nparity == 2) vdev->v_name = "raidz2"; else if (vdev->v_nparity == 3) vdev->v_name = "raidz3"; else { printf("ZFS: can only boot from disk, mirror, raidz1, raidz2 and raidz3 vdevs\n"); return (EIO); } } else { vdev->v_name = strdup(type); } } } else { is_new = 0; } if (is_new || is_newer) { /* * This is either new vdev or we've already seen this vdev, * but from an older vdev label, so let's refresh its state * from the newer label. */ if (is_offline) vdev->v_state = VDEV_STATE_OFFLINE; else if (is_removed) vdev->v_state = VDEV_STATE_REMOVED; else if (is_faulted) vdev->v_state = VDEV_STATE_FAULTED; else if (is_degraded) vdev->v_state = VDEV_STATE_DEGRADED; else if (isnt_present) vdev->v_state = VDEV_STATE_CANT_OPEN; } rc = nvlist_find(nvlist, ZPOOL_CONFIG_CHILDREN, DATA_TYPE_NVLIST_ARRAY, &nkids, &kids); /* * Its ok if we don't have any kids. */ if (rc == 0) { vdev->v_nchildren = nkids; for (i = 0; i < nkids; i++) { rc = vdev_init_from_nvlist(kids, vdev, &kid, is_newer); if (rc) return (rc); if (is_new) STAILQ_INSERT_TAIL(&vdev->v_children, kid, v_childlink); kids = nvlist_next(kids); } } else { vdev->v_nchildren = 0; } if (vdevp) *vdevp = vdev; return (0); } static void vdev_set_state(vdev_t *vdev) { vdev_t *kid; int good_kids; int bad_kids; /* * A mirror or raidz is healthy if all its kids are healthy. A * mirror is degraded if any of its kids is healthy; a raidz * is degraded if at most nparity kids are offline. */ if (STAILQ_FIRST(&vdev->v_children)) { good_kids = 0; bad_kids = 0; STAILQ_FOREACH(kid, &vdev->v_children, v_childlink) { if (kid->v_state == VDEV_STATE_HEALTHY) good_kids++; else bad_kids++; } if (bad_kids == 0) { vdev->v_state = VDEV_STATE_HEALTHY; } else { if (vdev->v_read == vdev_mirror_read) { if (good_kids) { vdev->v_state = VDEV_STATE_DEGRADED; } else { vdev->v_state = VDEV_STATE_OFFLINE; } } else if (vdev->v_read == vdev_raidz_read) { if (bad_kids > vdev->v_nparity) { vdev->v_state = VDEV_STATE_OFFLINE; } else { vdev->v_state = VDEV_STATE_DEGRADED; } } } } } static spa_t * spa_find_by_guid(uint64_t guid) { spa_t *spa; STAILQ_FOREACH(spa, &zfs_pools, spa_link) if (spa->spa_guid == guid) return (spa); return (0); } static spa_t * spa_find_by_name(const char *name) { spa_t *spa; STAILQ_FOREACH(spa, &zfs_pools, spa_link) if (!strcmp(spa->spa_name, name)) return (spa); return (0); } #ifdef BOOT2 static spa_t * spa_get_primary(void) { return (STAILQ_FIRST(&zfs_pools)); } static vdev_t * spa_get_primary_vdev(const spa_t *spa) { vdev_t *vdev; vdev_t *kid; if (spa == NULL) spa = spa_get_primary(); if (spa == NULL) return (NULL); vdev = STAILQ_FIRST(&spa->spa_vdevs); if (vdev == NULL) return (NULL); for (kid = STAILQ_FIRST(&vdev->v_children); kid != NULL; kid = STAILQ_FIRST(&vdev->v_children)) vdev = kid; return (vdev); } #endif static spa_t * spa_create(uint64_t guid) { spa_t *spa; spa = malloc(sizeof(spa_t)); memset(spa, 0, sizeof(spa_t)); STAILQ_INIT(&spa->spa_vdevs); spa->spa_guid = guid; STAILQ_INSERT_TAIL(&zfs_pools, spa, spa_link); return (spa); } static const char * state_name(vdev_state_t state) { static const char* names[] = { "UNKNOWN", "CLOSED", "OFFLINE", "REMOVED", "CANT_OPEN", "FAULTED", "DEGRADED", "ONLINE" }; return names[state]; } #ifdef BOOT2 #define pager_printf printf #else static int pager_printf(const char *fmt, ...) { char line[80]; va_list args; va_start(args, fmt); vsprintf(line, fmt, args); va_end(args); return (pager_output(line)); } #endif #define STATUS_FORMAT " %s %s\n" static int print_state(int indent, const char *name, vdev_state_t state) { int i; char buf[512]; buf[0] = 0; for (i = 0; i < indent; i++) strcat(buf, " "); strcat(buf, name); return (pager_printf(STATUS_FORMAT, buf, state_name(state))); } static int vdev_status(vdev_t *vdev, int indent) { vdev_t *kid; int ret; ret = print_state(indent, vdev->v_name, vdev->v_state); if (ret != 0) return (ret); STAILQ_FOREACH(kid, &vdev->v_children, v_childlink) { ret = vdev_status(kid, indent + 1); if (ret != 0) return (ret); } return (ret); } static int spa_status(spa_t *spa) { static char bootfs[ZFS_MAXNAMELEN]; uint64_t rootid; vdev_t *vdev; int good_kids, bad_kids, degraded_kids, ret; vdev_state_t state; ret = pager_printf(" pool: %s\n", spa->spa_name); if (ret != 0) return (ret); if (zfs_get_root(spa, &rootid) == 0 && zfs_rlookup(spa, rootid, bootfs) == 0) { if (bootfs[0] == '\0') ret = pager_printf("bootfs: %s\n", spa->spa_name); else ret = pager_printf("bootfs: %s/%s\n", spa->spa_name, bootfs); if (ret != 0) return (ret); } ret = pager_printf("config:\n\n"); if (ret != 0) return (ret); ret = pager_printf(STATUS_FORMAT, "NAME", "STATE"); if (ret != 0) return (ret); good_kids = 0; degraded_kids = 0; bad_kids = 0; STAILQ_FOREACH(vdev, &spa->spa_vdevs, v_childlink) { if (vdev->v_state == VDEV_STATE_HEALTHY) good_kids++; else if (vdev->v_state == VDEV_STATE_DEGRADED) degraded_kids++; else bad_kids++; } state = VDEV_STATE_CLOSED; if (good_kids > 0 && (degraded_kids + bad_kids) == 0) state = VDEV_STATE_HEALTHY; else if ((good_kids + degraded_kids) > 0) state = VDEV_STATE_DEGRADED; ret = print_state(0, spa->spa_name, state); if (ret != 0) return (ret); STAILQ_FOREACH(vdev, &spa->spa_vdevs, v_childlink) { ret = vdev_status(vdev, 1); if (ret != 0) return (ret); } return (ret); } static int spa_all_status(void) { spa_t *spa; int first = 1, ret = 0; STAILQ_FOREACH(spa, &zfs_pools, spa_link) { if (!first) { ret = pager_printf("\n"); if (ret != 0) return (ret); } first = 0; ret = spa_status(spa); if (ret != 0) return (ret); } return (ret); } static int vdev_probe(vdev_phys_read_t *read, void *read_priv, spa_t **spap) { vdev_t vtmp; vdev_phys_t *vdev_label = (vdev_phys_t *) zap_scratch; spa_t *spa; vdev_t *vdev, *top_vdev, *pool_vdev; off_t off; blkptr_t bp; const unsigned char *nvlist; uint64_t val; uint64_t guid; uint64_t pool_txg, pool_guid; uint64_t is_log; const char *pool_name; const unsigned char *vdevs; const unsigned char *features; int i, rc, is_newer; char *upbuf; const struct uberblock *up; /* * Load the vdev label and figure out which * uberblock is most current. */ memset(&vtmp, 0, sizeof(vtmp)); vtmp.v_phys_read = read; vtmp.v_read_priv = read_priv; off = offsetof(vdev_label_t, vl_vdev_phys); BP_ZERO(&bp); BP_SET_LSIZE(&bp, sizeof(vdev_phys_t)); BP_SET_PSIZE(&bp, sizeof(vdev_phys_t)); BP_SET_CHECKSUM(&bp, ZIO_CHECKSUM_LABEL); BP_SET_COMPRESS(&bp, ZIO_COMPRESS_OFF); DVA_SET_OFFSET(BP_IDENTITY(&bp), off); ZIO_SET_CHECKSUM(&bp.blk_cksum, off, 0, 0, 0); if (vdev_read_phys(&vtmp, &bp, vdev_label, off, 0)) return (EIO); if (vdev_label->vp_nvlist[0] != NV_ENCODE_XDR) { return (EIO); } nvlist = (const unsigned char *) vdev_label->vp_nvlist + 4; if (nvlist_find(nvlist, ZPOOL_CONFIG_VERSION, DATA_TYPE_UINT64, 0, &val)) { return (EIO); } if (!SPA_VERSION_IS_SUPPORTED(val)) { printf("ZFS: unsupported ZFS version %u (should be %u)\n", (unsigned) val, (unsigned) SPA_VERSION); return (EIO); } /* Check ZFS features for read */ if (nvlist_find(nvlist, ZPOOL_CONFIG_FEATURES_FOR_READ, DATA_TYPE_NVLIST, 0, &features) == 0 && nvlist_check_features_for_read(features) != 0) return (EIO); if (nvlist_find(nvlist, ZPOOL_CONFIG_POOL_STATE, DATA_TYPE_UINT64, 0, &val)) { return (EIO); } if (val == POOL_STATE_DESTROYED) { /* We don't boot only from destroyed pools. */ return (EIO); } if (nvlist_find(nvlist, ZPOOL_CONFIG_POOL_TXG, DATA_TYPE_UINT64, 0, &pool_txg) || nvlist_find(nvlist, ZPOOL_CONFIG_POOL_GUID, DATA_TYPE_UINT64, 0, &pool_guid) || nvlist_find(nvlist, ZPOOL_CONFIG_POOL_NAME, DATA_TYPE_STRING, 0, &pool_name)) { /* * Cache and spare devices end up here - just ignore * them. */ /*printf("ZFS: can't find pool details\n");*/ return (EIO); } is_log = 0; (void) nvlist_find(nvlist, ZPOOL_CONFIG_IS_LOG, DATA_TYPE_UINT64, 0, &is_log); if (is_log) return (EIO); /* * Create the pool if this is the first time we've seen it. */ spa = spa_find_by_guid(pool_guid); if (!spa) { spa = spa_create(pool_guid); spa->spa_name = strdup(pool_name); } if (pool_txg > spa->spa_txg) { spa->spa_txg = pool_txg; is_newer = 1; } else is_newer = 0; /* * Get the vdev tree and create our in-core copy of it. * If we already have a vdev with this guid, this must * be some kind of alias (overlapping slices, dangerously dedicated * disks etc). */ if (nvlist_find(nvlist, ZPOOL_CONFIG_GUID, DATA_TYPE_UINT64, 0, &guid)) { return (EIO); } vdev = vdev_find(guid); if (vdev && vdev->v_phys_read) /* Has this vdev already been inited? */ return (EIO); if (nvlist_find(nvlist, ZPOOL_CONFIG_VDEV_TREE, DATA_TYPE_NVLIST, 0, &vdevs)) { return (EIO); } rc = vdev_init_from_nvlist(vdevs, NULL, &top_vdev, is_newer); if (rc) return (rc); /* * Add the toplevel vdev to the pool if its not already there. */ STAILQ_FOREACH(pool_vdev, &spa->spa_vdevs, v_childlink) if (top_vdev == pool_vdev) break; if (!pool_vdev && top_vdev) { top_vdev->spa = spa; STAILQ_INSERT_TAIL(&spa->spa_vdevs, top_vdev, v_childlink); } /* * We should already have created an incomplete vdev for this * vdev. Find it and initialise it with our read proc. */ vdev = vdev_find(guid); if (vdev) { vdev->v_phys_read = read; vdev->v_read_priv = read_priv; vdev->v_state = VDEV_STATE_HEALTHY; } else { printf("ZFS: inconsistent nvlist contents\n"); return (EIO); } /* * Re-evaluate top-level vdev state. */ vdev_set_state(top_vdev); /* * Ok, we are happy with the pool so far. Lets find * the best uberblock and then we can actually access * the contents of the pool. */ upbuf = zfs_alloc(VDEV_UBERBLOCK_SIZE(vdev)); up = (const struct uberblock *)upbuf; for (i = 0; i < VDEV_UBERBLOCK_COUNT(vdev); i++) { off = VDEV_UBERBLOCK_OFFSET(vdev, i); BP_ZERO(&bp); DVA_SET_OFFSET(&bp.blk_dva[0], off); BP_SET_LSIZE(&bp, VDEV_UBERBLOCK_SIZE(vdev)); BP_SET_PSIZE(&bp, VDEV_UBERBLOCK_SIZE(vdev)); BP_SET_CHECKSUM(&bp, ZIO_CHECKSUM_LABEL); BP_SET_COMPRESS(&bp, ZIO_COMPRESS_OFF); ZIO_SET_CHECKSUM(&bp.blk_cksum, off, 0, 0, 0); if (vdev_read_phys(vdev, &bp, upbuf, off, 0)) continue; if (up->ub_magic != UBERBLOCK_MAGIC) continue; if (up->ub_txg < spa->spa_txg) continue; if (up->ub_txg > spa->spa_uberblock.ub_txg) { spa->spa_uberblock = *up; } else if (up->ub_txg == spa->spa_uberblock.ub_txg) { if (up->ub_timestamp > spa->spa_uberblock.ub_timestamp) spa->spa_uberblock = *up; } } zfs_free(upbuf, VDEV_UBERBLOCK_SIZE(vdev)); vdev->spa = spa; if (spap) *spap = spa; return (0); } static int ilog2(int n) { int v; for (v = 0; v < 32; v++) if (n == (1 << v)) return v; return -1; } static int zio_read_gang(const spa_t *spa, const blkptr_t *bp, void *buf) { blkptr_t gbh_bp; zio_gbh_phys_t zio_gb; char *pbuf; int i; /* Artificial BP for gang block header. */ gbh_bp = *bp; BP_SET_PSIZE(&gbh_bp, SPA_GANGBLOCKSIZE); BP_SET_LSIZE(&gbh_bp, SPA_GANGBLOCKSIZE); BP_SET_CHECKSUM(&gbh_bp, ZIO_CHECKSUM_GANG_HEADER); BP_SET_COMPRESS(&gbh_bp, ZIO_COMPRESS_OFF); for (i = 0; i < SPA_DVAS_PER_BP; i++) DVA_SET_GANG(&gbh_bp.blk_dva[i], 0); /* Read gang header block using the artificial BP. */ if (zio_read(spa, &gbh_bp, &zio_gb)) return (EIO); pbuf = buf; for (i = 0; i < SPA_GBH_NBLKPTRS; i++) { blkptr_t *gbp = &zio_gb.zg_blkptr[i]; if (BP_IS_HOLE(gbp)) continue; if (zio_read(spa, gbp, pbuf)) return (EIO); pbuf += BP_GET_PSIZE(gbp); } if (zio_checksum_verify(spa, bp, buf)) return (EIO); return (0); } static int zio_read(const spa_t *spa, const blkptr_t *bp, void *buf) { int cpfunc = BP_GET_COMPRESS(bp); uint64_t align, size; void *pbuf; int i, error; /* * Process data embedded in block pointer */ if (BP_IS_EMBEDDED(bp)) { ASSERT(BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA); size = BPE_GET_PSIZE(bp); ASSERT(size <= BPE_PAYLOAD_SIZE); if (cpfunc != ZIO_COMPRESS_OFF) pbuf = zfs_alloc(size); else pbuf = buf; decode_embedded_bp_compressed(bp, pbuf); error = 0; if (cpfunc != ZIO_COMPRESS_OFF) { error = zio_decompress_data(cpfunc, pbuf, size, buf, BP_GET_LSIZE(bp)); zfs_free(pbuf, size); } if (error != 0) printf("ZFS: i/o error - unable to decompress block pointer data, error %d\n", error); return (error); } error = EIO; for (i = 0; i < SPA_DVAS_PER_BP; i++) { const dva_t *dva = &bp->blk_dva[i]; vdev_t *vdev; int vdevid; off_t offset; if (!dva->dva_word[0] && !dva->dva_word[1]) continue; vdevid = DVA_GET_VDEV(dva); offset = DVA_GET_OFFSET(dva); STAILQ_FOREACH(vdev, &spa->spa_vdevs, v_childlink) { if (vdev->v_id == vdevid) break; } if (!vdev || !vdev->v_read) continue; size = BP_GET_PSIZE(bp); if (vdev->v_read == vdev_raidz_read) { align = 1ULL << vdev->v_top->v_ashift; if (P2PHASE(size, align) != 0) size = P2ROUNDUP(size, align); } if (size != BP_GET_PSIZE(bp) || cpfunc != ZIO_COMPRESS_OFF) pbuf = zfs_alloc(size); else pbuf = buf; if (DVA_GET_GANG(dva)) error = zio_read_gang(spa, bp, pbuf); else error = vdev->v_read(vdev, bp, pbuf, offset, size); if (error == 0) { if (cpfunc != ZIO_COMPRESS_OFF) error = zio_decompress_data(cpfunc, pbuf, BP_GET_PSIZE(bp), buf, BP_GET_LSIZE(bp)); else if (size != BP_GET_PSIZE(bp)) bcopy(pbuf, buf, BP_GET_PSIZE(bp)); } if (buf != pbuf) zfs_free(pbuf, size); if (error == 0) break; } if (error != 0) printf("ZFS: i/o error - all block copies unavailable\n"); return (error); } static int dnode_read(const spa_t *spa, const dnode_phys_t *dnode, off_t offset, void *buf, size_t buflen) { int ibshift = dnode->dn_indblkshift - SPA_BLKPTRSHIFT; int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT; int nlevels = dnode->dn_nlevels; int i, rc; if (bsize > SPA_MAXBLOCKSIZE) { printf("ZFS: I/O error - blocks larger than %llu are not " "supported\n", SPA_MAXBLOCKSIZE); return (EIO); } /* * Note: bsize may not be a power of two here so we need to do an * actual divide rather than a bitshift. */ while (buflen > 0) { uint64_t bn = offset / bsize; int boff = offset % bsize; int ibn; const blkptr_t *indbp; blkptr_t bp; if (bn > dnode->dn_maxblkid) return (EIO); if (dnode == dnode_cache_obj && bn == dnode_cache_bn) goto cached; indbp = dnode->dn_blkptr; for (i = 0; i < nlevels; i++) { /* * Copy the bp from the indirect array so that * we can re-use the scratch buffer for multi-level * objects. */ ibn = bn >> ((nlevels - i - 1) * ibshift); ibn &= ((1 << ibshift) - 1); bp = indbp[ibn]; if (BP_IS_HOLE(&bp)) { memset(dnode_cache_buf, 0, bsize); break; } rc = zio_read(spa, &bp, dnode_cache_buf); if (rc) return (rc); indbp = (const blkptr_t *) dnode_cache_buf; } dnode_cache_obj = dnode; dnode_cache_bn = bn; cached: /* * The buffer contains our data block. Copy what we * need from it and loop. */ i = bsize - boff; if (i > buflen) i = buflen; memcpy(buf, &dnode_cache_buf[boff], i); buf = ((char*) buf) + i; offset += i; buflen -= i; } return (0); } /* * Lookup a value in a microzap directory. Assumes that the zap * scratch buffer contains the directory contents. */ static int mzap_lookup(const dnode_phys_t *dnode, const char *name, uint64_t *value) { const mzap_phys_t *mz; const mzap_ent_phys_t *mze; size_t size; int chunks, i; /* * Microzap objects use exactly one block. Read the whole * thing. */ size = dnode->dn_datablkszsec * 512; mz = (const mzap_phys_t *) zap_scratch; chunks = size / MZAP_ENT_LEN - 1; for (i = 0; i < chunks; i++) { mze = &mz->mz_chunk[i]; if (!strcmp(mze->mze_name, name)) { *value = mze->mze_value; return (0); } } return (ENOENT); } /* * Compare a name with a zap leaf entry. Return non-zero if the name * matches. */ static int fzap_name_equal(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc, const char *name) { size_t namelen; const zap_leaf_chunk_t *nc; const char *p; namelen = zc->l_entry.le_name_numints; nc = &ZAP_LEAF_CHUNK(zl, zc->l_entry.le_name_chunk); p = name; while (namelen > 0) { size_t len; len = namelen; if (len > ZAP_LEAF_ARRAY_BYTES) len = ZAP_LEAF_ARRAY_BYTES; if (memcmp(p, nc->l_array.la_array, len)) return (0); p += len; namelen -= len; nc = &ZAP_LEAF_CHUNK(zl, nc->l_array.la_next); } return 1; } /* * Extract a uint64_t value from a zap leaf entry. */ static uint64_t fzap_leaf_value(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc) { const zap_leaf_chunk_t *vc; int i; uint64_t value; const uint8_t *p; vc = &ZAP_LEAF_CHUNK(zl, zc->l_entry.le_value_chunk); for (i = 0, value = 0, p = vc->l_array.la_array; i < 8; i++) { value = (value << 8) | p[i]; } return value; } static void stv(int len, void *addr, uint64_t value) { switch (len) { case 1: *(uint8_t *)addr = value; return; case 2: *(uint16_t *)addr = value; return; case 4: *(uint32_t *)addr = value; return; case 8: *(uint64_t *)addr = value; return; } } /* * Extract a array from a zap leaf entry. */ static void fzap_leaf_array(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc, uint64_t integer_size, uint64_t num_integers, void *buf) { uint64_t array_int_len = zc->l_entry.le_value_intlen; uint64_t value = 0; uint64_t *u64 = buf; char *p = buf; int len = MIN(zc->l_entry.le_value_numints, num_integers); int chunk = zc->l_entry.le_value_chunk; int byten = 0; if (integer_size == 8 && len == 1) { *u64 = fzap_leaf_value(zl, zc); return; } while (len > 0) { struct zap_leaf_array *la = &ZAP_LEAF_CHUNK(zl, chunk).l_array; int i; ASSERT3U(chunk, <, ZAP_LEAF_NUMCHUNKS(zl)); for (i = 0; i < ZAP_LEAF_ARRAY_BYTES && len > 0; i++) { value = (value << 8) | la->la_array[i]; byten++; if (byten == array_int_len) { stv(integer_size, p, value); byten = 0; len--; if (len == 0) return; p += integer_size; } } chunk = la->la_next; } } /* * Lookup a value in a fatzap directory. Assumes that the zap scratch * buffer contains the directory header. */ static int fzap_lookup(const spa_t *spa, const dnode_phys_t *dnode, const char *name, uint64_t integer_size, uint64_t num_integers, void *value) { int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT; zap_phys_t zh = *(zap_phys_t *) zap_scratch; fat_zap_t z; uint64_t *ptrtbl; uint64_t hash; int rc; if (zh.zap_magic != ZAP_MAGIC) return (EIO); z.zap_block_shift = ilog2(bsize); z.zap_phys = (zap_phys_t *) zap_scratch; /* * Figure out where the pointer table is and read it in if necessary. */ if (zh.zap_ptrtbl.zt_blk) { rc = dnode_read(spa, dnode, zh.zap_ptrtbl.zt_blk * bsize, zap_scratch, bsize); if (rc) return (rc); ptrtbl = (uint64_t *) zap_scratch; } else { ptrtbl = &ZAP_EMBEDDED_PTRTBL_ENT(&z, 0); } hash = zap_hash(zh.zap_salt, name); zap_leaf_t zl; zl.l_bs = z.zap_block_shift; off_t off = ptrtbl[hash >> (64 - zh.zap_ptrtbl.zt_shift)] << zl.l_bs; zap_leaf_chunk_t *zc; rc = dnode_read(spa, dnode, off, zap_scratch, bsize); if (rc) return (rc); zl.l_phys = (zap_leaf_phys_t *) zap_scratch; /* * Make sure this chunk matches our hash. */ if (zl.l_phys->l_hdr.lh_prefix_len > 0 && zl.l_phys->l_hdr.lh_prefix != hash >> (64 - zl.l_phys->l_hdr.lh_prefix_len)) return (ENOENT); /* * Hash within the chunk to find our entry. */ int shift = (64 - ZAP_LEAF_HASH_SHIFT(&zl) - zl.l_phys->l_hdr.lh_prefix_len); int h = (hash >> shift) & ((1 << ZAP_LEAF_HASH_SHIFT(&zl)) - 1); h = zl.l_phys->l_hash[h]; if (h == 0xffff) return (ENOENT); zc = &ZAP_LEAF_CHUNK(&zl, h); while (zc->l_entry.le_hash != hash) { if (zc->l_entry.le_next == 0xffff) { zc = NULL; break; } zc = &ZAP_LEAF_CHUNK(&zl, zc->l_entry.le_next); } if (fzap_name_equal(&zl, zc, name)) { if (zc->l_entry.le_value_intlen * zc->l_entry.le_value_numints > integer_size * num_integers) return (E2BIG); fzap_leaf_array(&zl, zc, integer_size, num_integers, value); return (0); } return (ENOENT); } /* * Lookup a name in a zap object and return its value as a uint64_t. */ static int zap_lookup(const spa_t *spa, const dnode_phys_t *dnode, const char *name, uint64_t integer_size, uint64_t num_integers, void *value) { int rc; uint64_t zap_type; size_t size = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT; rc = dnode_read(spa, dnode, 0, zap_scratch, size); if (rc) return (rc); zap_type = *(uint64_t *) zap_scratch; if (zap_type == ZBT_MICRO) return mzap_lookup(dnode, name, value); else if (zap_type == ZBT_HEADER) { return fzap_lookup(spa, dnode, name, integer_size, num_integers, value); } printf("ZFS: invalid zap_type=%d\n", (int)zap_type); return (EIO); } /* * List a microzap directory. Assumes that the zap scratch buffer contains * the directory contents. */ static int mzap_list(const dnode_phys_t *dnode, int (*callback)(const char *, uint64_t)) { const mzap_phys_t *mz; const mzap_ent_phys_t *mze; size_t size; int chunks, i, rc; /* * Microzap objects use exactly one block. Read the whole * thing. */ size = dnode->dn_datablkszsec * 512; mz = (const mzap_phys_t *) zap_scratch; chunks = size / MZAP_ENT_LEN - 1; for (i = 0; i < chunks; i++) { mze = &mz->mz_chunk[i]; if (mze->mze_name[0]) { rc = callback(mze->mze_name, mze->mze_value); if (rc != 0) return (rc); } } return (0); } /* * List a fatzap directory. Assumes that the zap scratch buffer contains * the directory header. */ static int fzap_list(const spa_t *spa, const dnode_phys_t *dnode, int (*callback)(const char *, uint64_t)) { int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT; zap_phys_t zh = *(zap_phys_t *) zap_scratch; fat_zap_t z; int i, j, rc; if (zh.zap_magic != ZAP_MAGIC) return (EIO); z.zap_block_shift = ilog2(bsize); z.zap_phys = (zap_phys_t *) zap_scratch; /* * This assumes that the leaf blocks start at block 1. The * documentation isn't exactly clear on this. */ zap_leaf_t zl; zl.l_bs = z.zap_block_shift; for (i = 0; i < zh.zap_num_leafs; i++) { off_t off = (i + 1) << zl.l_bs; char name[256], *p; uint64_t value; if (dnode_read(spa, dnode, off, zap_scratch, bsize)) return (EIO); zl.l_phys = (zap_leaf_phys_t *) zap_scratch; for (j = 0; j < ZAP_LEAF_NUMCHUNKS(&zl); j++) { zap_leaf_chunk_t *zc, *nc; int namelen; zc = &ZAP_LEAF_CHUNK(&zl, j); if (zc->l_entry.le_type != ZAP_CHUNK_ENTRY) continue; namelen = zc->l_entry.le_name_numints; if (namelen > sizeof(name)) namelen = sizeof(name); /* * Paste the name back together. */ nc = &ZAP_LEAF_CHUNK(&zl, zc->l_entry.le_name_chunk); p = name; while (namelen > 0) { int len; len = namelen; if (len > ZAP_LEAF_ARRAY_BYTES) len = ZAP_LEAF_ARRAY_BYTES; memcpy(p, nc->l_array.la_array, len); p += len; namelen -= len; nc = &ZAP_LEAF_CHUNK(&zl, nc->l_array.la_next); } /* * Assume the first eight bytes of the value are * a uint64_t. */ value = fzap_leaf_value(&zl, zc); //printf("%s 0x%jx\n", name, (uintmax_t)value); rc = callback((const char *)name, value); if (rc != 0) return (rc); } } return (0); } static int zfs_printf(const char *name, uint64_t value __unused) { printf("%s\n", name); return (0); } /* * List a zap directory. */ static int zap_list(const spa_t *spa, const dnode_phys_t *dnode) { uint64_t zap_type; size_t size = dnode->dn_datablkszsec * 512; if (dnode_read(spa, dnode, 0, zap_scratch, size)) return (EIO); zap_type = *(uint64_t *) zap_scratch; if (zap_type == ZBT_MICRO) return mzap_list(dnode, zfs_printf); else return fzap_list(spa, dnode, zfs_printf); } static int objset_get_dnode(const spa_t *spa, const objset_phys_t *os, uint64_t objnum, dnode_phys_t *dnode) { off_t offset; offset = objnum * sizeof(dnode_phys_t); return dnode_read(spa, &os->os_meta_dnode, offset, dnode, sizeof(dnode_phys_t)); } static int mzap_rlookup(const spa_t *spa, const dnode_phys_t *dnode, char *name, uint64_t value) { const mzap_phys_t *mz; const mzap_ent_phys_t *mze; size_t size; int chunks, i; /* * Microzap objects use exactly one block. Read the whole * thing. */ size = dnode->dn_datablkszsec * 512; mz = (const mzap_phys_t *) zap_scratch; chunks = size / MZAP_ENT_LEN - 1; for (i = 0; i < chunks; i++) { mze = &mz->mz_chunk[i]; if (value == mze->mze_value) { strcpy(name, mze->mze_name); return (0); } } return (ENOENT); } static void fzap_name_copy(const zap_leaf_t *zl, const zap_leaf_chunk_t *zc, char *name) { size_t namelen; const zap_leaf_chunk_t *nc; char *p; namelen = zc->l_entry.le_name_numints; nc = &ZAP_LEAF_CHUNK(zl, zc->l_entry.le_name_chunk); p = name; while (namelen > 0) { size_t len; len = namelen; if (len > ZAP_LEAF_ARRAY_BYTES) len = ZAP_LEAF_ARRAY_BYTES; memcpy(p, nc->l_array.la_array, len); p += len; namelen -= len; nc = &ZAP_LEAF_CHUNK(zl, nc->l_array.la_next); } *p = '\0'; } static int fzap_rlookup(const spa_t *spa, const dnode_phys_t *dnode, char *name, uint64_t value) { int bsize = dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT; zap_phys_t zh = *(zap_phys_t *) zap_scratch; fat_zap_t z; int i, j; if (zh.zap_magic != ZAP_MAGIC) return (EIO); z.zap_block_shift = ilog2(bsize); z.zap_phys = (zap_phys_t *) zap_scratch; /* * This assumes that the leaf blocks start at block 1. The * documentation isn't exactly clear on this. */ zap_leaf_t zl; zl.l_bs = z.zap_block_shift; for (i = 0; i < zh.zap_num_leafs; i++) { off_t off = (i + 1) << zl.l_bs; if (dnode_read(spa, dnode, off, zap_scratch, bsize)) return (EIO); zl.l_phys = (zap_leaf_phys_t *) zap_scratch; for (j = 0; j < ZAP_LEAF_NUMCHUNKS(&zl); j++) { zap_leaf_chunk_t *zc; zc = &ZAP_LEAF_CHUNK(&zl, j); if (zc->l_entry.le_type != ZAP_CHUNK_ENTRY) continue; if (zc->l_entry.le_value_intlen != 8 || zc->l_entry.le_value_numints != 1) continue; if (fzap_leaf_value(&zl, zc) == value) { fzap_name_copy(&zl, zc, name); return (0); } } } return (ENOENT); } static int zap_rlookup(const spa_t *spa, const dnode_phys_t *dnode, char *name, uint64_t value) { int rc; uint64_t zap_type; size_t size = dnode->dn_datablkszsec * 512; rc = dnode_read(spa, dnode, 0, zap_scratch, size); if (rc) return (rc); zap_type = *(uint64_t *) zap_scratch; if (zap_type == ZBT_MICRO) return mzap_rlookup(spa, dnode, name, value); else return fzap_rlookup(spa, dnode, name, value); } static int zfs_rlookup(const spa_t *spa, uint64_t objnum, char *result) { char name[256]; char component[256]; uint64_t dir_obj, parent_obj, child_dir_zapobj; dnode_phys_t child_dir_zap, dataset, dir, parent; dsl_dir_phys_t *dd; dsl_dataset_phys_t *ds; char *p; int len; p = &name[sizeof(name) - 1]; *p = '\0'; if (objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset)) { printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum); return (EIO); } ds = (dsl_dataset_phys_t *)&dataset.dn_bonus; dir_obj = ds->ds_dir_obj; for (;;) { if (objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir) != 0) return (EIO); dd = (dsl_dir_phys_t *)&dir.dn_bonus; /* Actual loop condition. */ parent_obj = dd->dd_parent_obj; if (parent_obj == 0) break; if (objset_get_dnode(spa, &spa->spa_mos, parent_obj, &parent) != 0) return (EIO); dd = (dsl_dir_phys_t *)&parent.dn_bonus; child_dir_zapobj = dd->dd_child_dir_zapobj; if (objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap) != 0) return (EIO); if (zap_rlookup(spa, &child_dir_zap, component, dir_obj) != 0) return (EIO); len = strlen(component); p -= len; memcpy(p, component, len); --p; *p = '/'; /* Actual loop iteration. */ dir_obj = parent_obj; } if (*p != '\0') ++p; strcpy(result, p); return (0); } static int zfs_lookup_dataset(const spa_t *spa, const char *name, uint64_t *objnum) { char element[256]; uint64_t dir_obj, child_dir_zapobj; dnode_phys_t child_dir_zap, dir; dsl_dir_phys_t *dd; const char *p, *q; if (objset_get_dnode(spa, &spa->spa_mos, DMU_POOL_DIRECTORY_OBJECT, &dir)) return (EIO); if (zap_lookup(spa, &dir, DMU_POOL_ROOT_DATASET, sizeof (dir_obj), 1, &dir_obj)) return (EIO); p = name; for (;;) { if (objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir)) return (EIO); dd = (dsl_dir_phys_t *)&dir.dn_bonus; while (*p == '/') p++; /* Actual loop condition #1. */ if (*p == '\0') break; q = strchr(p, '/'); if (q) { memcpy(element, p, q - p); element[q - p] = '\0'; p = q + 1; } else { strcpy(element, p); p += strlen(p); } child_dir_zapobj = dd->dd_child_dir_zapobj; if (objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap) != 0) return (EIO); /* Actual loop condition #2. */ if (zap_lookup(spa, &child_dir_zap, element, sizeof (dir_obj), 1, &dir_obj) != 0) return (ENOENT); } *objnum = dd->dd_head_dataset_obj; return (0); } #ifndef BOOT2 static int zfs_list_dataset(const spa_t *spa, uint64_t objnum/*, int pos, char *entry*/) { uint64_t dir_obj, child_dir_zapobj; dnode_phys_t child_dir_zap, dir, dataset; dsl_dataset_phys_t *ds; dsl_dir_phys_t *dd; if (objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset)) { printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum); return (EIO); } ds = (dsl_dataset_phys_t *) &dataset.dn_bonus; dir_obj = ds->ds_dir_obj; if (objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir)) { printf("ZFS: can't find dirobj %ju\n", (uintmax_t)dir_obj); return (EIO); } dd = (dsl_dir_phys_t *)&dir.dn_bonus; child_dir_zapobj = dd->dd_child_dir_zapobj; if (objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap) != 0) { printf("ZFS: can't find child zap %ju\n", (uintmax_t)dir_obj); return (EIO); } return (zap_list(spa, &child_dir_zap) != 0); } int zfs_callback_dataset(const spa_t *spa, uint64_t objnum, int (*callback)(const char *, uint64_t)) { uint64_t dir_obj, child_dir_zapobj, zap_type; dnode_phys_t child_dir_zap, dir, dataset; dsl_dataset_phys_t *ds; dsl_dir_phys_t *dd; int err; err = objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset); if (err != 0) { printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum); return (err); } ds = (dsl_dataset_phys_t *) &dataset.dn_bonus; dir_obj = ds->ds_dir_obj; err = objset_get_dnode(spa, &spa->spa_mos, dir_obj, &dir); if (err != 0) { printf("ZFS: can't find dirobj %ju\n", (uintmax_t)dir_obj); return (err); } dd = (dsl_dir_phys_t *)&dir.dn_bonus; child_dir_zapobj = dd->dd_child_dir_zapobj; err = objset_get_dnode(spa, &spa->spa_mos, child_dir_zapobj, &child_dir_zap); if (err != 0) { printf("ZFS: can't find child zap %ju\n", (uintmax_t)dir_obj); return (err); } err = dnode_read(spa, &child_dir_zap, 0, zap_scratch, child_dir_zap.dn_datablkszsec * 512); if (err != 0) return (err); zap_type = *(uint64_t *) zap_scratch; if (zap_type == ZBT_MICRO) return mzap_list(&child_dir_zap, callback); else return fzap_list(spa, &child_dir_zap, callback); } #endif /* * Find the object set given the object number of its dataset object * and return its details in *objset */ static int zfs_mount_dataset(const spa_t *spa, uint64_t objnum, objset_phys_t *objset) { dnode_phys_t dataset; dsl_dataset_phys_t *ds; if (objset_get_dnode(spa, &spa->spa_mos, objnum, &dataset)) { printf("ZFS: can't find dataset %ju\n", (uintmax_t)objnum); return (EIO); } ds = (dsl_dataset_phys_t *) &dataset.dn_bonus; if (zio_read(spa, &ds->ds_bp, objset)) { printf("ZFS: can't read object set for dataset %ju\n", (uintmax_t)objnum); return (EIO); } return (0); } /* * Find the object set pointed to by the BOOTFS property or the root * dataset if there is none and return its details in *objset */ static int zfs_get_root(const spa_t *spa, uint64_t *objid) { dnode_phys_t dir, propdir; uint64_t props, bootfs, root; *objid = 0; /* * Start with the MOS directory object. */ if (objset_get_dnode(spa, &spa->spa_mos, DMU_POOL_DIRECTORY_OBJECT, &dir)) { printf("ZFS: can't read MOS object directory\n"); return (EIO); } /* * Lookup the pool_props and see if we can find a bootfs. */ if (zap_lookup(spa, &dir, DMU_POOL_PROPS, sizeof (props), 1, &props) == 0 && objset_get_dnode(spa, &spa->spa_mos, props, &propdir) == 0 && zap_lookup(spa, &propdir, "bootfs", sizeof (bootfs), 1, &bootfs) == 0 && bootfs != 0) { *objid = bootfs; return (0); } /* * Lookup the root dataset directory */ if (zap_lookup(spa, &dir, DMU_POOL_ROOT_DATASET, sizeof (root), 1, &root) || objset_get_dnode(spa, &spa->spa_mos, root, &dir)) { printf("ZFS: can't find root dsl_dir\n"); return (EIO); } /* * Use the information from the dataset directory's bonus buffer * to find the dataset object and from that the object set itself. */ dsl_dir_phys_t *dd = (dsl_dir_phys_t *) &dir.dn_bonus; *objid = dd->dd_head_dataset_obj; return (0); } static int zfs_mount(const spa_t *spa, uint64_t rootobj, struct zfsmount *mount) { mount->spa = spa; /* * Find the root object set if not explicitly provided */ if (rootobj == 0 && zfs_get_root(spa, &rootobj)) { printf("ZFS: can't find root filesystem\n"); return (EIO); } if (zfs_mount_dataset(spa, rootobj, &mount->objset)) { printf("ZFS: can't open root filesystem\n"); return (EIO); } mount->rootobj = rootobj; return (0); } /* * callback function for feature name checks. */ static int check_feature(const char *name, uint64_t value) { int i; if (value == 0) return (0); if (name[0] == '\0') return (0); for (i = 0; features_for_read[i] != NULL; i++) { if (strcmp(name, features_for_read[i]) == 0) return (0); } printf("ZFS: unsupported feature: %s\n", name); return (EIO); } /* * Checks whether the MOS features that are active are supported. */ static int check_mos_features(const spa_t *spa) { dnode_phys_t dir; uint64_t objnum, zap_type; size_t size; int rc; if ((rc = objset_get_dnode(spa, &spa->spa_mos, DMU_OT_OBJECT_DIRECTORY, &dir)) != 0) return (rc); if ((rc = zap_lookup(spa, &dir, DMU_POOL_FEATURES_FOR_READ, sizeof (objnum), 1, &objnum)) != 0) { /* * It is older pool without features. As we have already * tested the label, just return without raising the error. */ return (0); } if ((rc = objset_get_dnode(spa, &spa->spa_mos, objnum, &dir)) != 0) return (rc); if (dir.dn_type != DMU_OTN_ZAP_METADATA) return (EIO); size = dir.dn_datablkszsec * 512; if (dnode_read(spa, &dir, 0, zap_scratch, size)) return (EIO); zap_type = *(uint64_t *) zap_scratch; if (zap_type == ZBT_MICRO) rc = mzap_list(&dir, check_feature); else rc = fzap_list(spa, &dir, check_feature); return (rc); } static int zfs_spa_init(spa_t *spa) { dnode_phys_t dir; int rc; if (zio_read(spa, &spa->spa_uberblock.ub_rootbp, &spa->spa_mos)) { printf("ZFS: can't read MOS of pool %s\n", spa->spa_name); return (EIO); } if (spa->spa_mos.os_type != DMU_OST_META) { printf("ZFS: corrupted MOS of pool %s\n", spa->spa_name); return (EIO); } if (objset_get_dnode(spa, &spa->spa_mos, DMU_POOL_DIRECTORY_OBJECT, &dir)) { printf("ZFS: failed to read pool %s directory object\n", spa->spa_name); return (EIO); } /* this is allowed to fail, older pools do not have salt */ rc = zap_lookup(spa, &dir, DMU_POOL_CHECKSUM_SALT, 1, sizeof (spa->spa_cksum_salt.zcs_bytes), spa->spa_cksum_salt.zcs_bytes); rc = check_mos_features(spa); if (rc != 0) { printf("ZFS: pool %s is not supported\n", spa->spa_name); } return (rc); } static int zfs_dnode_stat(const spa_t *spa, dnode_phys_t *dn, struct stat *sb) { if (dn->dn_bonustype != DMU_OT_SA) { znode_phys_t *zp = (znode_phys_t *)dn->dn_bonus; sb->st_mode = zp->zp_mode; sb->st_uid = zp->zp_uid; sb->st_gid = zp->zp_gid; sb->st_size = zp->zp_size; } else { sa_hdr_phys_t *sahdrp; int hdrsize; size_t size = 0; void *buf = NULL; if (dn->dn_bonuslen != 0) sahdrp = (sa_hdr_phys_t *)DN_BONUS(dn); else { if ((dn->dn_flags & DNODE_FLAG_SPILL_BLKPTR) != 0) { blkptr_t *bp = &dn->dn_spill; int error; size = BP_GET_LSIZE(bp); buf = zfs_alloc(size); error = zio_read(spa, bp, buf); if (error != 0) { zfs_free(buf, size); return (error); } sahdrp = buf; } else { return (EIO); } } hdrsize = SA_HDR_SIZE(sahdrp); sb->st_mode = *(uint64_t *)((char *)sahdrp + hdrsize + SA_MODE_OFFSET); sb->st_uid = *(uint64_t *)((char *)sahdrp + hdrsize + SA_UID_OFFSET); sb->st_gid = *(uint64_t *)((char *)sahdrp + hdrsize + SA_GID_OFFSET); sb->st_size = *(uint64_t *)((char *)sahdrp + hdrsize + SA_SIZE_OFFSET); if (buf != NULL) zfs_free(buf, size); } return (0); } static int zfs_dnode_readlink(const spa_t *spa, dnode_phys_t *dn, char *path, size_t psize) { int rc = 0; if (dn->dn_bonustype == DMU_OT_SA) { sa_hdr_phys_t *sahdrp = NULL; size_t size = 0; void *buf = NULL; int hdrsize; char *p; if (dn->dn_bonuslen != 0) sahdrp = (sa_hdr_phys_t *)DN_BONUS(dn); else { blkptr_t *bp; if ((dn->dn_flags & DNODE_FLAG_SPILL_BLKPTR) == 0) return (EIO); bp = &dn->dn_spill; size = BP_GET_LSIZE(bp); buf = zfs_alloc(size); rc = zio_read(spa, bp, buf); if (rc != 0) { zfs_free(buf, size); return (rc); } sahdrp = buf; } hdrsize = SA_HDR_SIZE(sahdrp); p = (char *)((uintptr_t)sahdrp + hdrsize + SA_SYMLINK_OFFSET); memcpy(path, p, psize); if (buf != NULL) zfs_free(buf, size); return (0); } /* * Second test is purely to silence bogus compiler * warning about accessing past the end of dn_bonus. */ if (psize + sizeof(znode_phys_t) <= dn->dn_bonuslen && sizeof(znode_phys_t) <= sizeof(dn->dn_bonus)) { memcpy(path, &dn->dn_bonus[sizeof(znode_phys_t)], psize); } else { rc = dnode_read(spa, dn, 0, path, psize); } return (rc); } struct obj_list { uint64_t objnum; STAILQ_ENTRY(obj_list) entry; }; /* * Lookup a file and return its dnode. */ static int zfs_lookup(const struct zfsmount *mount, const char *upath, dnode_phys_t *dnode) { int rc; uint64_t objnum; const spa_t *spa; dnode_phys_t dn; const char *p, *q; char element[256]; char path[1024]; int symlinks_followed = 0; struct stat sb; - struct obj_list *entry; + struct obj_list *entry, *tentry; STAILQ_HEAD(, obj_list) on_cache = STAILQ_HEAD_INITIALIZER(on_cache); spa = mount->spa; if (mount->objset.os_type != DMU_OST_ZFS) { printf("ZFS: unexpected object set type %ju\n", (uintmax_t)mount->objset.os_type); return (EIO); } if ((entry = malloc(sizeof(struct obj_list))) == NULL) return (ENOMEM); /* * Get the root directory dnode. */ rc = objset_get_dnode(spa, &mount->objset, MASTER_NODE_OBJ, &dn); if (rc) { free(entry); return (rc); } rc = zap_lookup(spa, &dn, ZFS_ROOT_OBJ, sizeof (objnum), 1, &objnum); if (rc) { free(entry); return (rc); } entry->objnum = objnum; STAILQ_INSERT_HEAD(&on_cache, entry, entry); rc = objset_get_dnode(spa, &mount->objset, objnum, &dn); if (rc != 0) goto done; p = upath; while (p && *p) { rc = objset_get_dnode(spa, &mount->objset, objnum, &dn); if (rc != 0) goto done; while (*p == '/') p++; if (*p == '\0') break; q = p; while (*q != '\0' && *q != '/') q++; /* skip dot */ if (p + 1 == q && p[0] == '.') { p++; continue; } /* double dot */ if (p + 2 == q && p[0] == '.' && p[1] == '.') { p += 2; if (STAILQ_FIRST(&on_cache) == STAILQ_LAST(&on_cache, obj_list, entry)) { rc = ENOENT; goto done; } entry = STAILQ_FIRST(&on_cache); STAILQ_REMOVE_HEAD(&on_cache, entry); free(entry); objnum = (STAILQ_FIRST(&on_cache))->objnum; continue; } if (q - p + 1 > sizeof(element)) { rc = ENAMETOOLONG; goto done; } memcpy(element, p, q - p); element[q - p] = 0; p = q; if ((rc = zfs_dnode_stat(spa, &dn, &sb)) != 0) goto done; if (!S_ISDIR(sb.st_mode)) { rc = ENOTDIR; goto done; } rc = zap_lookup(spa, &dn, element, sizeof (objnum), 1, &objnum); if (rc) goto done; objnum = ZFS_DIRENT_OBJ(objnum); if ((entry = malloc(sizeof(struct obj_list))) == NULL) { rc = ENOMEM; goto done; } entry->objnum = objnum; STAILQ_INSERT_HEAD(&on_cache, entry, entry); rc = objset_get_dnode(spa, &mount->objset, objnum, &dn); if (rc) goto done; /* * Check for symlink. */ rc = zfs_dnode_stat(spa, &dn, &sb); if (rc) goto done; if (S_ISLNK(sb.st_mode)) { if (symlinks_followed > 10) { rc = EMLINK; goto done; } symlinks_followed++; /* * Read the link value and copy the tail of our * current path onto the end. */ if (sb.st_size + strlen(p) + 1 > sizeof(path)) { rc = ENAMETOOLONG; goto done; } strcpy(&path[sb.st_size], p); rc = zfs_dnode_readlink(spa, &dn, path, sb.st_size); if (rc != 0) goto done; /* * Restart with the new path, starting either at * the root or at the parent depending whether or * not the link is relative. */ p = path; if (*p == '/') { while (STAILQ_FIRST(&on_cache) != STAILQ_LAST(&on_cache, obj_list, entry)) { entry = STAILQ_FIRST(&on_cache); STAILQ_REMOVE_HEAD(&on_cache, entry); free(entry); } } else { entry = STAILQ_FIRST(&on_cache); STAILQ_REMOVE_HEAD(&on_cache, entry); free(entry); } objnum = (STAILQ_FIRST(&on_cache))->objnum; } } *dnode = dn; done: - STAILQ_FOREACH(entry, &on_cache, entry) + STAILQ_FOREACH_SAFE(entry, &on_cache, entry, tentry) free(entry); return (rc); } Index: projects/clang400-import/sys/cam/ctl/ctl.c =================================================================== --- projects/clang400-import/sys/cam/ctl/ctl.c (revision 314522) +++ projects/clang400-import/sys/cam/ctl/ctl.c (revision 314523) @@ -1,13537 +1,13537 @@ /*- * Copyright (c) 2003-2009 Silicon Graphics International Corp. * Copyright (c) 2012 The FreeBSD Foundation * Copyright (c) 2014-2017 Alexander Motin * All rights reserved. * * Portions of this software were developed by Edward Tomasz Napierala * under sponsorship from the FreeBSD Foundation. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions, and the following disclaimer, * without modification. * 2. Redistributions in binary form must reproduce at minimum a disclaimer * substantially similar to the "NO WARRANTY" disclaimer below * ("Disclaimer") and any redistribution must be conditioned upon * including a substantially similar Disclaimer requirement for further * binary redistribution. * * NO WARRANTY * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGES. * * $Id$ */ /* * CAM Target Layer, a SCSI device emulation subsystem. * * Author: Ken Merry */ #define _CTL_C #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct ctl_softc *control_softc = NULL; /* * Template mode pages. */ /* * Note that these are default values only. The actual values will be * filled in when the user does a mode sense. */ const static struct scsi_da_rw_recovery_page rw_er_page_default = { /*page_code*/SMS_RW_ERROR_RECOVERY_PAGE, /*page_length*/sizeof(struct scsi_da_rw_recovery_page) - 2, /*byte3*/SMS_RWER_AWRE|SMS_RWER_ARRE, /*read_retry_count*/0, /*correction_span*/0, /*head_offset_count*/0, /*data_strobe_offset_cnt*/0, /*byte8*/SMS_RWER_LBPERE, /*write_retry_count*/0, /*reserved2*/0, /*recovery_time_limit*/{0, 0}, }; const static struct scsi_da_rw_recovery_page rw_er_page_changeable = { /*page_code*/SMS_RW_ERROR_RECOVERY_PAGE, /*page_length*/sizeof(struct scsi_da_rw_recovery_page) - 2, /*byte3*/SMS_RWER_PER, /*read_retry_count*/0, /*correction_span*/0, /*head_offset_count*/0, /*data_strobe_offset_cnt*/0, /*byte8*/SMS_RWER_LBPERE, /*write_retry_count*/0, /*reserved2*/0, /*recovery_time_limit*/{0, 0}, }; const static struct scsi_format_page format_page_default = { /*page_code*/SMS_FORMAT_DEVICE_PAGE, /*page_length*/sizeof(struct scsi_format_page) - 2, /*tracks_per_zone*/ {0, 0}, /*alt_sectors_per_zone*/ {0, 0}, /*alt_tracks_per_zone*/ {0, 0}, /*alt_tracks_per_lun*/ {0, 0}, /*sectors_per_track*/ {(CTL_DEFAULT_SECTORS_PER_TRACK >> 8) & 0xff, CTL_DEFAULT_SECTORS_PER_TRACK & 0xff}, /*bytes_per_sector*/ {0, 0}, /*interleave*/ {0, 0}, /*track_skew*/ {0, 0}, /*cylinder_skew*/ {0, 0}, /*flags*/ SFP_HSEC, /*reserved*/ {0, 0, 0} }; const static struct scsi_format_page format_page_changeable = { /*page_code*/SMS_FORMAT_DEVICE_PAGE, /*page_length*/sizeof(struct scsi_format_page) - 2, /*tracks_per_zone*/ {0, 0}, /*alt_sectors_per_zone*/ {0, 0}, /*alt_tracks_per_zone*/ {0, 0}, /*alt_tracks_per_lun*/ {0, 0}, /*sectors_per_track*/ {0, 0}, /*bytes_per_sector*/ {0, 0}, /*interleave*/ {0, 0}, /*track_skew*/ {0, 0}, /*cylinder_skew*/ {0, 0}, /*flags*/ 0, /*reserved*/ {0, 0, 0} }; const static struct scsi_rigid_disk_page rigid_disk_page_default = { /*page_code*/SMS_RIGID_DISK_PAGE, /*page_length*/sizeof(struct scsi_rigid_disk_page) - 2, /*cylinders*/ {0, 0, 0}, /*heads*/ CTL_DEFAULT_HEADS, /*start_write_precomp*/ {0, 0, 0}, /*start_reduced_current*/ {0, 0, 0}, /*step_rate*/ {0, 0}, /*landing_zone_cylinder*/ {0, 0, 0}, /*rpl*/ SRDP_RPL_DISABLED, /*rotational_offset*/ 0, /*reserved1*/ 0, /*rotation_rate*/ {(CTL_DEFAULT_ROTATION_RATE >> 8) & 0xff, CTL_DEFAULT_ROTATION_RATE & 0xff}, /*reserved2*/ {0, 0} }; const static struct scsi_rigid_disk_page rigid_disk_page_changeable = { /*page_code*/SMS_RIGID_DISK_PAGE, /*page_length*/sizeof(struct scsi_rigid_disk_page) - 2, /*cylinders*/ {0, 0, 0}, /*heads*/ 0, /*start_write_precomp*/ {0, 0, 0}, /*start_reduced_current*/ {0, 0, 0}, /*step_rate*/ {0, 0}, /*landing_zone_cylinder*/ {0, 0, 0}, /*rpl*/ 0, /*rotational_offset*/ 0, /*reserved1*/ 0, /*rotation_rate*/ {0, 0}, /*reserved2*/ {0, 0} }; const static struct scsi_da_verify_recovery_page verify_er_page_default = { /*page_code*/SMS_VERIFY_ERROR_RECOVERY_PAGE, /*page_length*/sizeof(struct scsi_da_verify_recovery_page) - 2, /*byte3*/0, /*read_retry_count*/0, /*reserved*/{ 0, 0, 0, 0, 0, 0 }, /*recovery_time_limit*/{0, 0}, }; const static struct scsi_da_verify_recovery_page verify_er_page_changeable = { /*page_code*/SMS_VERIFY_ERROR_RECOVERY_PAGE, /*page_length*/sizeof(struct scsi_da_verify_recovery_page) - 2, /*byte3*/SMS_VER_PER, /*read_retry_count*/0, /*reserved*/{ 0, 0, 0, 0, 0, 0 }, /*recovery_time_limit*/{0, 0}, }; const static struct scsi_caching_page caching_page_default = { /*page_code*/SMS_CACHING_PAGE, /*page_length*/sizeof(struct scsi_caching_page) - 2, /*flags1*/ SCP_DISC | SCP_WCE, /*ret_priority*/ 0, /*disable_pf_transfer_len*/ {0xff, 0xff}, /*min_prefetch*/ {0, 0}, /*max_prefetch*/ {0xff, 0xff}, /*max_pf_ceiling*/ {0xff, 0xff}, /*flags2*/ 0, /*cache_segments*/ 0, /*cache_seg_size*/ {0, 0}, /*reserved*/ 0, /*non_cache_seg_size*/ {0, 0, 0} }; const static struct scsi_caching_page caching_page_changeable = { /*page_code*/SMS_CACHING_PAGE, /*page_length*/sizeof(struct scsi_caching_page) - 2, /*flags1*/ SCP_WCE | SCP_RCD, /*ret_priority*/ 0, /*disable_pf_transfer_len*/ {0, 0}, /*min_prefetch*/ {0, 0}, /*max_prefetch*/ {0, 0}, /*max_pf_ceiling*/ {0, 0}, /*flags2*/ 0, /*cache_segments*/ 0, /*cache_seg_size*/ {0, 0}, /*reserved*/ 0, /*non_cache_seg_size*/ {0, 0, 0} }; const static struct scsi_control_page control_page_default = { /*page_code*/SMS_CONTROL_MODE_PAGE, /*page_length*/sizeof(struct scsi_control_page) - 2, /*rlec*/0, /*queue_flags*/SCP_QUEUE_ALG_RESTRICTED, /*eca_and_aen*/0, /*flags4*/SCP_TAS, /*aen_holdoff_period*/{0, 0}, /*busy_timeout_period*/{0, 0}, /*extended_selftest_completion_time*/{0, 0} }; const static struct scsi_control_page control_page_changeable = { /*page_code*/SMS_CONTROL_MODE_PAGE, /*page_length*/sizeof(struct scsi_control_page) - 2, /*rlec*/SCP_DSENSE, /*queue_flags*/SCP_QUEUE_ALG_MASK | SCP_NUAR, /*eca_and_aen*/SCP_SWP, /*flags4*/0, /*aen_holdoff_period*/{0, 0}, /*busy_timeout_period*/{0, 0}, /*extended_selftest_completion_time*/{0, 0} }; #define CTL_CEM_LEN (sizeof(struct scsi_control_ext_page) - 4) const static struct scsi_control_ext_page control_ext_page_default = { /*page_code*/SMS_CONTROL_MODE_PAGE | SMPH_SPF, /*subpage_code*/0x01, /*page_length*/{CTL_CEM_LEN >> 8, CTL_CEM_LEN}, /*flags*/0, /*prio*/0, /*max_sense*/0 }; const static struct scsi_control_ext_page control_ext_page_changeable = { /*page_code*/SMS_CONTROL_MODE_PAGE | SMPH_SPF, /*subpage_code*/0x01, /*page_length*/{CTL_CEM_LEN >> 8, CTL_CEM_LEN}, /*flags*/0, /*prio*/0, /*max_sense*/0xff }; const static struct scsi_info_exceptions_page ie_page_default = { /*page_code*/SMS_INFO_EXCEPTIONS_PAGE, /*page_length*/sizeof(struct scsi_info_exceptions_page) - 2, /*info_flags*/SIEP_FLAGS_EWASC, /*mrie*/SIEP_MRIE_NO, /*interval_timer*/{0, 0, 0, 0}, /*report_count*/{0, 0, 0, 1} }; const static struct scsi_info_exceptions_page ie_page_changeable = { /*page_code*/SMS_INFO_EXCEPTIONS_PAGE, /*page_length*/sizeof(struct scsi_info_exceptions_page) - 2, /*info_flags*/SIEP_FLAGS_EWASC | SIEP_FLAGS_DEXCPT | SIEP_FLAGS_TEST | SIEP_FLAGS_LOGERR, /*mrie*/0x0f, /*interval_timer*/{0xff, 0xff, 0xff, 0xff}, /*report_count*/{0xff, 0xff, 0xff, 0xff} }; #define CTL_LBPM_LEN (sizeof(struct ctl_logical_block_provisioning_page) - 4) const static struct ctl_logical_block_provisioning_page lbp_page_default = {{ /*page_code*/SMS_INFO_EXCEPTIONS_PAGE | SMPH_SPF, /*subpage_code*/0x02, /*page_length*/{CTL_LBPM_LEN >> 8, CTL_LBPM_LEN}, /*flags*/0, /*reserved*/{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, /*descr*/{}}, {{/*flags*/0, /*resource*/0x01, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}}, {/*flags*/0, /*resource*/0x02, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}}, {/*flags*/0, /*resource*/0xf1, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}}, {/*flags*/0, /*resource*/0xf2, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}} } }; const static struct ctl_logical_block_provisioning_page lbp_page_changeable = {{ /*page_code*/SMS_INFO_EXCEPTIONS_PAGE | SMPH_SPF, /*subpage_code*/0x02, /*page_length*/{CTL_LBPM_LEN >> 8, CTL_LBPM_LEN}, /*flags*/SLBPP_SITUA, /*reserved*/{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, /*descr*/{}}, {{/*flags*/0, /*resource*/0, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}}, {/*flags*/0, /*resource*/0, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}}, {/*flags*/0, /*resource*/0, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}}, {/*flags*/0, /*resource*/0, /*reserved*/{0, 0}, /*count*/{0, 0, 0, 0}} } }; const static struct scsi_cddvd_capabilities_page cddvd_page_default = { /*page_code*/SMS_CDDVD_CAPS_PAGE, /*page_length*/sizeof(struct scsi_cddvd_capabilities_page) - 2, /*caps1*/0x3f, /*caps2*/0x00, /*caps3*/0xf0, /*caps4*/0x00, /*caps5*/0x29, /*caps6*/0x00, /*obsolete*/{0, 0}, /*nvol_levels*/{0, 0}, /*buffer_size*/{8, 0}, /*obsolete2*/{0, 0}, /*reserved*/0, /*digital*/0, /*obsolete3*/0, /*copy_management*/0, /*reserved2*/0, /*rotation_control*/0, /*cur_write_speed*/0, /*num_speed_descr*/0, }; const static struct scsi_cddvd_capabilities_page cddvd_page_changeable = { /*page_code*/SMS_CDDVD_CAPS_PAGE, /*page_length*/sizeof(struct scsi_cddvd_capabilities_page) - 2, /*caps1*/0, /*caps2*/0, /*caps3*/0, /*caps4*/0, /*caps5*/0, /*caps6*/0, /*obsolete*/{0, 0}, /*nvol_levels*/{0, 0}, /*buffer_size*/{0, 0}, /*obsolete2*/{0, 0}, /*reserved*/0, /*digital*/0, /*obsolete3*/0, /*copy_management*/0, /*reserved2*/0, /*rotation_control*/0, /*cur_write_speed*/0, /*num_speed_descr*/0, }; SYSCTL_NODE(_kern_cam, OID_AUTO, ctl, CTLFLAG_RD, 0, "CAM Target Layer"); static int worker_threads = -1; SYSCTL_INT(_kern_cam_ctl, OID_AUTO, worker_threads, CTLFLAG_RDTUN, &worker_threads, 1, "Number of worker threads"); static int ctl_debug = CTL_DEBUG_NONE; SYSCTL_INT(_kern_cam_ctl, OID_AUTO, debug, CTLFLAG_RWTUN, &ctl_debug, 0, "Enabled debug flags"); static int ctl_lun_map_size = 1024; SYSCTL_INT(_kern_cam_ctl, OID_AUTO, lun_map_size, CTLFLAG_RWTUN, &ctl_lun_map_size, 0, "Size of per-port LUN map (max LUN + 1)"); /* * Supported pages (0x00), Serial number (0x80), Device ID (0x83), * Extended INQUIRY Data (0x86), Mode Page Policy (0x87), * SCSI Ports (0x88), Third-party Copy (0x8F), Block limits (0xB0), * Block Device Characteristics (0xB1) and Logical Block Provisioning (0xB2) */ #define SCSI_EVPD_NUM_SUPPORTED_PAGES 10 static void ctl_isc_event_handler(ctl_ha_channel chanel, ctl_ha_event event, int param); static void ctl_copy_sense_data(union ctl_ha_msg *src, union ctl_io *dest); static void ctl_copy_sense_data_back(union ctl_io *src, union ctl_ha_msg *dest); static int ctl_init(void); static int ctl_shutdown(void); static int ctl_open(struct cdev *dev, int flags, int fmt, struct thread *td); static int ctl_close(struct cdev *dev, int flags, int fmt, struct thread *td); static void ctl_serialize_other_sc_cmd(struct ctl_scsiio *ctsio); static void ctl_ioctl_fill_ooa(struct ctl_lun *lun, uint32_t *cur_fill_num, struct ctl_ooa *ooa_hdr, struct ctl_ooa_entry *kern_entries); static int ctl_ioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flag, struct thread *td); static int ctl_alloc_lun(struct ctl_softc *ctl_softc, struct ctl_lun *lun, struct ctl_be_lun *be_lun); static int ctl_free_lun(struct ctl_lun *lun); static void ctl_create_lun(struct ctl_be_lun *be_lun); static int ctl_do_mode_select(union ctl_io *io); static int ctl_pro_preempt(struct ctl_softc *softc, struct ctl_lun *lun, uint64_t res_key, uint64_t sa_res_key, uint8_t type, uint32_t residx, struct ctl_scsiio *ctsio, struct scsi_per_res_out *cdb, struct scsi_per_res_out_parms* param); static void ctl_pro_preempt_other(struct ctl_lun *lun, union ctl_ha_msg *msg); static void ctl_hndl_per_res_out_on_other_sc(union ctl_io *io); static int ctl_inquiry_evpd_supported(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_serial(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_devid(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_eid(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_mpp(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_scsi_ports(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_block_limits(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_bdc(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd_lbp(struct ctl_scsiio *ctsio, int alloc_len); static int ctl_inquiry_evpd(struct ctl_scsiio *ctsio); static int ctl_inquiry_std(struct ctl_scsiio *ctsio); static int ctl_get_lba_len(union ctl_io *io, uint64_t *lba, uint64_t *len); static ctl_action ctl_extent_check(union ctl_io *io1, union ctl_io *io2, bool seq); static ctl_action ctl_extent_check_seq(union ctl_io *io1, union ctl_io *io2); static ctl_action ctl_check_for_blockage(struct ctl_lun *lun, union ctl_io *pending_io, union ctl_io *ooa_io); static ctl_action ctl_check_ooa(struct ctl_lun *lun, union ctl_io *pending_io, union ctl_io *starting_io); static int ctl_check_blocked(struct ctl_lun *lun); static int ctl_scsiio_lun_check(struct ctl_lun *lun, const struct ctl_cmd_entry *entry, struct ctl_scsiio *ctsio); static void ctl_failover_lun(union ctl_io *io); static int ctl_scsiio_precheck(struct ctl_softc *ctl_softc, struct ctl_scsiio *ctsio); static int ctl_scsiio(struct ctl_scsiio *ctsio); static int ctl_target_reset(union ctl_io *io); static void ctl_do_lun_reset(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua_type); static int ctl_lun_reset(union ctl_io *io); static int ctl_abort_task(union ctl_io *io); static int ctl_abort_task_set(union ctl_io *io); static int ctl_query_task(union ctl_io *io, int task_set); static void ctl_i_t_nexus_loss(struct ctl_softc *softc, uint32_t initidx, ctl_ua_type ua_type); static int ctl_i_t_nexus_reset(union ctl_io *io); static int ctl_query_async_event(union ctl_io *io); static void ctl_run_task(union ctl_io *io); #ifdef CTL_IO_DELAY static void ctl_datamove_timer_wakeup(void *arg); static void ctl_done_timer_wakeup(void *arg); #endif /* CTL_IO_DELAY */ static void ctl_send_datamove_done(union ctl_io *io, int have_lock); static void ctl_datamove_remote_write_cb(struct ctl_ha_dt_req *rq); static int ctl_datamove_remote_dm_write_cb(union ctl_io *io); static void ctl_datamove_remote_write(union ctl_io *io); static int ctl_datamove_remote_dm_read_cb(union ctl_io *io); static void ctl_datamove_remote_read_cb(struct ctl_ha_dt_req *rq); static int ctl_datamove_remote_sgl_setup(union ctl_io *io); static int ctl_datamove_remote_xfer(union ctl_io *io, unsigned command, ctl_ha_dt_cb callback); static void ctl_datamove_remote_read(union ctl_io *io); static void ctl_datamove_remote(union ctl_io *io); static void ctl_process_done(union ctl_io *io); static void ctl_lun_thread(void *arg); static void ctl_thresh_thread(void *arg); static void ctl_work_thread(void *arg); static void ctl_enqueue_incoming(union ctl_io *io); static void ctl_enqueue_rtr(union ctl_io *io); static void ctl_enqueue_done(union ctl_io *io); static void ctl_enqueue_isc(union ctl_io *io); static const struct ctl_cmd_entry * ctl_get_cmd_entry(struct ctl_scsiio *ctsio, int *sa); static const struct ctl_cmd_entry * ctl_validate_command(struct ctl_scsiio *ctsio); static int ctl_cmd_applicable(uint8_t lun_type, const struct ctl_cmd_entry *entry); static int ctl_ha_init(void); static int ctl_ha_shutdown(void); static uint64_t ctl_get_prkey(struct ctl_lun *lun, uint32_t residx); static void ctl_clr_prkey(struct ctl_lun *lun, uint32_t residx); static void ctl_alloc_prkey(struct ctl_lun *lun, uint32_t residx); static void ctl_set_prkey(struct ctl_lun *lun, uint32_t residx, uint64_t key); /* * Load the serialization table. This isn't very pretty, but is probably * the easiest way to do it. */ #include "ctl_ser_table.c" /* * We only need to define open, close and ioctl routines for this driver. */ static struct cdevsw ctl_cdevsw = { .d_version = D_VERSION, .d_flags = 0, .d_open = ctl_open, .d_close = ctl_close, .d_ioctl = ctl_ioctl, .d_name = "ctl", }; MALLOC_DEFINE(M_CTL, "ctlmem", "Memory used for CTL"); static int ctl_module_event_handler(module_t, int /*modeventtype_t*/, void *); static moduledata_t ctl_moduledata = { "ctl", ctl_module_event_handler, NULL }; DECLARE_MODULE(ctl, ctl_moduledata, SI_SUB_CONFIGURE, SI_ORDER_THIRD); MODULE_VERSION(ctl, 1); static struct ctl_frontend ha_frontend = { .name = "ha", .init = ctl_ha_init, .shutdown = ctl_ha_shutdown, }; static int ctl_ha_init(void) { struct ctl_softc *softc = control_softc; if (ctl_pool_create(softc, "othersc", CTL_POOL_ENTRIES_OTHER_SC, &softc->othersc_pool) != 0) return (ENOMEM); if (ctl_ha_msg_init(softc) != CTL_HA_STATUS_SUCCESS) { ctl_pool_free(softc->othersc_pool); return (EIO); } if (ctl_ha_msg_register(CTL_HA_CHAN_CTL, ctl_isc_event_handler) != CTL_HA_STATUS_SUCCESS) { ctl_ha_msg_destroy(softc); ctl_pool_free(softc->othersc_pool); return (EIO); } return (0); }; static int ctl_ha_shutdown(void) { struct ctl_softc *softc = control_softc; struct ctl_port *port; ctl_ha_msg_shutdown(softc); if (ctl_ha_msg_deregister(CTL_HA_CHAN_CTL) != CTL_HA_STATUS_SUCCESS) return (EIO); if (ctl_ha_msg_destroy(softc) != CTL_HA_STATUS_SUCCESS) return (EIO); ctl_pool_free(softc->othersc_pool); while ((port = STAILQ_FIRST(&ha_frontend.port_list)) != NULL) { ctl_port_deregister(port); free(port->port_name, M_CTL); free(port, M_CTL); } return (0); }; static void ctl_ha_datamove(union ctl_io *io) { struct ctl_lun *lun = CTL_LUN(io); struct ctl_sg_entry *sgl; union ctl_ha_msg msg; uint32_t sg_entries_sent; int do_sg_copy, i, j; memset(&msg.dt, 0, sizeof(msg.dt)); msg.hdr.msg_type = CTL_MSG_DATAMOVE; msg.hdr.original_sc = io->io_hdr.original_sc; msg.hdr.serializing_sc = io; msg.hdr.nexus = io->io_hdr.nexus; msg.hdr.status = io->io_hdr.status; msg.dt.flags = io->io_hdr.flags; /* * We convert everything into a S/G list here. We can't * pass by reference, only by value between controllers. * So we can't pass a pointer to the S/G list, only as many * S/G entries as we can fit in here. If it's possible for * us to get more than CTL_HA_MAX_SG_ENTRIES S/G entries, * then we need to break this up into multiple transfers. */ if (io->scsiio.kern_sg_entries == 0) { msg.dt.kern_sg_entries = 1; #if 0 if (io->io_hdr.flags & CTL_FLAG_BUS_ADDR) { msg.dt.sg_list[0].addr = io->scsiio.kern_data_ptr; } else { /* XXX KDM use busdma here! */ msg.dt.sg_list[0].addr = (void *)vtophys(io->scsiio.kern_data_ptr); } #else KASSERT((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0, ("HA does not support BUS_ADDR")); msg.dt.sg_list[0].addr = io->scsiio.kern_data_ptr; #endif msg.dt.sg_list[0].len = io->scsiio.kern_data_len; do_sg_copy = 0; } else { msg.dt.kern_sg_entries = io->scsiio.kern_sg_entries; do_sg_copy = 1; } msg.dt.kern_data_len = io->scsiio.kern_data_len; msg.dt.kern_total_len = io->scsiio.kern_total_len; msg.dt.kern_data_resid = io->scsiio.kern_data_resid; msg.dt.kern_rel_offset = io->scsiio.kern_rel_offset; msg.dt.sg_sequence = 0; /* * Loop until we've sent all of the S/G entries. On the * other end, we'll recompose these S/G entries into one * contiguous list before processing. */ for (sg_entries_sent = 0; sg_entries_sent < msg.dt.kern_sg_entries; msg.dt.sg_sequence++) { msg.dt.cur_sg_entries = MIN((sizeof(msg.dt.sg_list) / sizeof(msg.dt.sg_list[0])), msg.dt.kern_sg_entries - sg_entries_sent); if (do_sg_copy != 0) { sgl = (struct ctl_sg_entry *)io->scsiio.kern_data_ptr; for (i = sg_entries_sent, j = 0; i < msg.dt.cur_sg_entries; i++, j++) { #if 0 if (io->io_hdr.flags & CTL_FLAG_BUS_ADDR) { msg.dt.sg_list[j].addr = sgl[i].addr; } else { /* XXX KDM use busdma here! */ msg.dt.sg_list[j].addr = (void *)vtophys(sgl[i].addr); } #else KASSERT((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0, ("HA does not support BUS_ADDR")); msg.dt.sg_list[j].addr = sgl[i].addr; #endif msg.dt.sg_list[j].len = sgl[i].len; } } sg_entries_sent += msg.dt.cur_sg_entries; msg.dt.sg_last = (sg_entries_sent >= msg.dt.kern_sg_entries); if (ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.dt) - sizeof(msg.dt.sg_list) + sizeof(struct ctl_sg_entry) * msg.dt.cur_sg_entries, M_WAITOK) > CTL_HA_STATUS_SUCCESS) { io->io_hdr.port_status = 31341; io->scsiio.be_move_done(io); return; } msg.dt.sent_sg_entries = sg_entries_sent; } /* * Officially handover the request from us to peer. * If failover has just happened, then we must return error. * If failover happen just after, then it is not our problem. */ if (lun) mtx_lock(&lun->lun_lock); if (io->io_hdr.flags & CTL_FLAG_FAILOVER) { if (lun) mtx_unlock(&lun->lun_lock); io->io_hdr.port_status = 31342; io->scsiio.be_move_done(io); return; } io->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE; io->io_hdr.flags |= CTL_FLAG_DMA_INPROG; if (lun) mtx_unlock(&lun->lun_lock); } static void ctl_ha_done(union ctl_io *io) { union ctl_ha_msg msg; if (io->io_hdr.io_type == CTL_IO_SCSI) { memset(&msg, 0, sizeof(msg)); msg.hdr.msg_type = CTL_MSG_FINISH_IO; msg.hdr.original_sc = io->io_hdr.original_sc; msg.hdr.nexus = io->io_hdr.nexus; msg.hdr.status = io->io_hdr.status; msg.scsi.scsi_status = io->scsiio.scsi_status; msg.scsi.tag_num = io->scsiio.tag_num; msg.scsi.tag_type = io->scsiio.tag_type; msg.scsi.sense_len = io->scsiio.sense_len; memcpy(&msg.scsi.sense_data, &io->scsiio.sense_data, io->scsiio.sense_len); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.scsi) - sizeof(msg.scsi.sense_data) + msg.scsi.sense_len, M_WAITOK); } ctl_free_io(io); } static void ctl_isc_handler_finish_xfer(struct ctl_softc *ctl_softc, union ctl_ha_msg *msg_info) { struct ctl_scsiio *ctsio; if (msg_info->hdr.original_sc == NULL) { printf("%s: original_sc == NULL!\n", __func__); /* XXX KDM now what? */ return; } ctsio = &msg_info->hdr.original_sc->scsiio; ctsio->io_hdr.flags |= CTL_FLAG_IO_ACTIVE; ctsio->io_hdr.msg_type = CTL_MSG_FINISH_IO; ctsio->io_hdr.status = msg_info->hdr.status; ctsio->scsi_status = msg_info->scsi.scsi_status; ctsio->sense_len = msg_info->scsi.sense_len; memcpy(&ctsio->sense_data, &msg_info->scsi.sense_data, msg_info->scsi.sense_len); ctl_enqueue_isc((union ctl_io *)ctsio); } static void ctl_isc_handler_finish_ser_only(struct ctl_softc *ctl_softc, union ctl_ha_msg *msg_info) { struct ctl_scsiio *ctsio; if (msg_info->hdr.serializing_sc == NULL) { printf("%s: serializing_sc == NULL!\n", __func__); /* XXX KDM now what? */ return; } ctsio = &msg_info->hdr.serializing_sc->scsiio; ctsio->io_hdr.msg_type = CTL_MSG_FINISH_IO; ctl_enqueue_isc((union ctl_io *)ctsio); } void ctl_isc_announce_lun(struct ctl_lun *lun) { struct ctl_softc *softc = lun->ctl_softc; union ctl_ha_msg *msg; struct ctl_ha_msg_lun_pr_key pr_key; int i, k; if (softc->ha_link != CTL_HA_LINK_ONLINE) return; mtx_lock(&lun->lun_lock); i = sizeof(msg->lun); if (lun->lun_devid) i += lun->lun_devid->len; i += sizeof(pr_key) * lun->pr_key_count; alloc: mtx_unlock(&lun->lun_lock); msg = malloc(i, M_CTL, M_WAITOK); mtx_lock(&lun->lun_lock); k = sizeof(msg->lun); if (lun->lun_devid) k += lun->lun_devid->len; k += sizeof(pr_key) * lun->pr_key_count; if (i < k) { free(msg, M_CTL); i = k; goto alloc; } bzero(&msg->lun, sizeof(msg->lun)); msg->hdr.msg_type = CTL_MSG_LUN_SYNC; msg->hdr.nexus.targ_lun = lun->lun; msg->hdr.nexus.targ_mapped_lun = lun->lun; msg->lun.flags = lun->flags; msg->lun.pr_generation = lun->pr_generation; msg->lun.pr_res_idx = lun->pr_res_idx; msg->lun.pr_res_type = lun->pr_res_type; msg->lun.pr_key_count = lun->pr_key_count; i = 0; if (lun->lun_devid) { msg->lun.lun_devid_len = lun->lun_devid->len; memcpy(&msg->lun.data[i], lun->lun_devid->data, msg->lun.lun_devid_len); i += msg->lun.lun_devid_len; } for (k = 0; k < CTL_MAX_INITIATORS; k++) { if ((pr_key.pr_key = ctl_get_prkey(lun, k)) == 0) continue; pr_key.pr_iid = k; memcpy(&msg->lun.data[i], &pr_key, sizeof(pr_key)); i += sizeof(pr_key); } mtx_unlock(&lun->lun_lock); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg->port, sizeof(msg->port) + i, M_WAITOK); free(msg, M_CTL); if (lun->flags & CTL_LUN_PRIMARY_SC) { for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { ctl_isc_announce_mode(lun, -1, lun->mode_pages.index[i].page_code & SMPH_PC_MASK, lun->mode_pages.index[i].subpage); } } } void ctl_isc_announce_port(struct ctl_port *port) { struct ctl_softc *softc = port->ctl_softc; union ctl_ha_msg *msg; int i; if (port->targ_port < softc->port_min || port->targ_port >= softc->port_max || softc->ha_link != CTL_HA_LINK_ONLINE) return; i = sizeof(msg->port) + strlen(port->port_name) + 1; if (port->lun_map) i += port->lun_map_size * sizeof(uint32_t); if (port->port_devid) i += port->port_devid->len; if (port->target_devid) i += port->target_devid->len; if (port->init_devid) i += port->init_devid->len; msg = malloc(i, M_CTL, M_WAITOK); bzero(&msg->port, sizeof(msg->port)); msg->hdr.msg_type = CTL_MSG_PORT_SYNC; msg->hdr.nexus.targ_port = port->targ_port; msg->port.port_type = port->port_type; msg->port.physical_port = port->physical_port; msg->port.virtual_port = port->virtual_port; msg->port.status = port->status; i = 0; msg->port.name_len = sprintf(&msg->port.data[i], "%d:%s", softc->ha_id, port->port_name) + 1; i += msg->port.name_len; if (port->lun_map) { msg->port.lun_map_len = port->lun_map_size * sizeof(uint32_t); memcpy(&msg->port.data[i], port->lun_map, msg->port.lun_map_len); i += msg->port.lun_map_len; } if (port->port_devid) { msg->port.port_devid_len = port->port_devid->len; memcpy(&msg->port.data[i], port->port_devid->data, msg->port.port_devid_len); i += msg->port.port_devid_len; } if (port->target_devid) { msg->port.target_devid_len = port->target_devid->len; memcpy(&msg->port.data[i], port->target_devid->data, msg->port.target_devid_len); i += msg->port.target_devid_len; } if (port->init_devid) { msg->port.init_devid_len = port->init_devid->len; memcpy(&msg->port.data[i], port->init_devid->data, msg->port.init_devid_len); i += msg->port.init_devid_len; } ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg->port, sizeof(msg->port) + i, M_WAITOK); free(msg, M_CTL); } void ctl_isc_announce_iid(struct ctl_port *port, int iid) { struct ctl_softc *softc = port->ctl_softc; union ctl_ha_msg *msg; int i, l; if (port->targ_port < softc->port_min || port->targ_port >= softc->port_max || softc->ha_link != CTL_HA_LINK_ONLINE) return; mtx_lock(&softc->ctl_lock); i = sizeof(msg->iid); l = 0; if (port->wwpn_iid[iid].name) l = strlen(port->wwpn_iid[iid].name) + 1; i += l; msg = malloc(i, M_CTL, M_NOWAIT); if (msg == NULL) { mtx_unlock(&softc->ctl_lock); return; } bzero(&msg->iid, sizeof(msg->iid)); msg->hdr.msg_type = CTL_MSG_IID_SYNC; msg->hdr.nexus.targ_port = port->targ_port; msg->hdr.nexus.initid = iid; msg->iid.in_use = port->wwpn_iid[iid].in_use; msg->iid.name_len = l; msg->iid.wwpn = port->wwpn_iid[iid].wwpn; if (port->wwpn_iid[iid].name) strlcpy(msg->iid.data, port->wwpn_iid[iid].name, l); mtx_unlock(&softc->ctl_lock); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg->iid, i, M_NOWAIT); free(msg, M_CTL); } void ctl_isc_announce_mode(struct ctl_lun *lun, uint32_t initidx, uint8_t page, uint8_t subpage) { struct ctl_softc *softc = lun->ctl_softc; union ctl_ha_msg msg; u_int i; if (softc->ha_link != CTL_HA_LINK_ONLINE) return; for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { if ((lun->mode_pages.index[i].page_code & SMPH_PC_MASK) == page && lun->mode_pages.index[i].subpage == subpage) break; } if (i == CTL_NUM_MODE_PAGES) return; /* Don't try to replicate pages not present on this device. */ if (lun->mode_pages.index[i].page_data == NULL) return; bzero(&msg.mode, sizeof(msg.mode)); msg.hdr.msg_type = CTL_MSG_MODE_SYNC; msg.hdr.nexus.targ_port = initidx / CTL_MAX_INIT_PER_PORT; msg.hdr.nexus.initid = initidx % CTL_MAX_INIT_PER_PORT; msg.hdr.nexus.targ_lun = lun->lun; msg.hdr.nexus.targ_mapped_lun = lun->lun; msg.mode.page_code = page; msg.mode.subpage = subpage; msg.mode.page_len = lun->mode_pages.index[i].page_len; memcpy(msg.mode.data, lun->mode_pages.index[i].page_data, msg.mode.page_len); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg.mode, sizeof(msg.mode), M_WAITOK); } static void ctl_isc_ha_link_up(struct ctl_softc *softc) { struct ctl_port *port; struct ctl_lun *lun; union ctl_ha_msg msg; int i; /* Announce this node parameters to peer for validation. */ msg.login.msg_type = CTL_MSG_LOGIN; msg.login.version = CTL_HA_VERSION; msg.login.ha_mode = softc->ha_mode; msg.login.ha_id = softc->ha_id; msg.login.max_luns = CTL_MAX_LUNS; msg.login.max_ports = CTL_MAX_PORTS; msg.login.max_init_per_port = CTL_MAX_INIT_PER_PORT; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg.login, sizeof(msg.login), M_WAITOK); STAILQ_FOREACH(port, &softc->port_list, links) { ctl_isc_announce_port(port); for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) { if (port->wwpn_iid[i].in_use) ctl_isc_announce_iid(port, i); } } STAILQ_FOREACH(lun, &softc->lun_list, links) ctl_isc_announce_lun(lun); } static void ctl_isc_ha_link_down(struct ctl_softc *softc) { struct ctl_port *port; struct ctl_lun *lun; union ctl_io *io; int i; mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(lun, &softc->lun_list, links) { mtx_lock(&lun->lun_lock); if (lun->flags & CTL_LUN_PEER_SC_PRIMARY) { lun->flags &= ~CTL_LUN_PEER_SC_PRIMARY; ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE); } mtx_unlock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); io = ctl_alloc_io(softc->othersc_pool); mtx_lock(&softc->ctl_lock); ctl_zero_io(io); io->io_hdr.msg_type = CTL_MSG_FAILOVER; io->io_hdr.nexus.targ_mapped_lun = lun->lun; ctl_enqueue_isc(io); } STAILQ_FOREACH(port, &softc->port_list, links) { if (port->targ_port >= softc->port_min && port->targ_port < softc->port_max) continue; port->status &= ~CTL_PORT_STATUS_ONLINE; for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) { port->wwpn_iid[i].in_use = 0; free(port->wwpn_iid[i].name, M_CTL); port->wwpn_iid[i].name = NULL; } } mtx_unlock(&softc->ctl_lock); } static void ctl_isc_ua(struct ctl_softc *softc, union ctl_ha_msg *msg, int len) { struct ctl_lun *lun; uint32_t iid = ctl_get_initindex(&msg->hdr.nexus); mtx_lock(&softc->ctl_lock); if (msg->hdr.nexus.targ_mapped_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[msg->hdr.nexus.targ_mapped_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); return; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if (msg->ua.ua_type == CTL_UA_THIN_PROV_THRES && msg->ua.ua_set) memcpy(lun->ua_tpt_info, msg->ua.ua_info, 8); if (msg->ua.ua_all) { if (msg->ua.ua_set) ctl_est_ua_all(lun, iid, msg->ua.ua_type); else ctl_clr_ua_all(lun, iid, msg->ua.ua_type); } else { if (msg->ua.ua_set) ctl_est_ua(lun, iid, msg->ua.ua_type); else ctl_clr_ua(lun, iid, msg->ua.ua_type); } mtx_unlock(&lun->lun_lock); } static void ctl_isc_lun_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len) { struct ctl_lun *lun; struct ctl_ha_msg_lun_pr_key pr_key; int i, k; ctl_lun_flags oflags; uint32_t targ_lun; targ_lun = msg->hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); return; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); return; } i = (lun->lun_devid != NULL) ? lun->lun_devid->len : 0; if (msg->lun.lun_devid_len != i || (i > 0 && memcmp(&msg->lun.data[0], lun->lun_devid->data, i) != 0)) { mtx_unlock(&lun->lun_lock); printf("%s: Received conflicting HA LUN %d\n", __func__, targ_lun); return; } else { /* Record whether peer is primary. */ oflags = lun->flags; if ((msg->lun.flags & CTL_LUN_PRIMARY_SC) && (msg->lun.flags & CTL_LUN_DISABLED) == 0) lun->flags |= CTL_LUN_PEER_SC_PRIMARY; else lun->flags &= ~CTL_LUN_PEER_SC_PRIMARY; if (oflags != lun->flags) ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE); /* If peer is primary and we are not -- use data */ if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 && (lun->flags & CTL_LUN_PEER_SC_PRIMARY)) { lun->pr_generation = msg->lun.pr_generation; lun->pr_res_idx = msg->lun.pr_res_idx; lun->pr_res_type = msg->lun.pr_res_type; lun->pr_key_count = msg->lun.pr_key_count; for (k = 0; k < CTL_MAX_INITIATORS; k++) ctl_clr_prkey(lun, k); for (k = 0; k < msg->lun.pr_key_count; k++) { memcpy(&pr_key, &msg->lun.data[i], sizeof(pr_key)); ctl_alloc_prkey(lun, pr_key.pr_iid); ctl_set_prkey(lun, pr_key.pr_iid, pr_key.pr_key); i += sizeof(pr_key); } } mtx_unlock(&lun->lun_lock); CTL_DEBUG_PRINT(("%s: Known LUN %d, peer is %s\n", __func__, targ_lun, (msg->lun.flags & CTL_LUN_PRIMARY_SC) ? "primary" : "secondary")); /* If we are primary but peer doesn't know -- notify */ if ((lun->flags & CTL_LUN_PRIMARY_SC) && (msg->lun.flags & CTL_LUN_PEER_SC_PRIMARY) == 0) ctl_isc_announce_lun(lun); } } static void ctl_isc_port_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len) { struct ctl_port *port; struct ctl_lun *lun; int i, new; port = softc->ctl_ports[msg->hdr.nexus.targ_port]; if (port == NULL) { CTL_DEBUG_PRINT(("%s: New port %d\n", __func__, msg->hdr.nexus.targ_port)); new = 1; port = malloc(sizeof(*port), M_CTL, M_WAITOK | M_ZERO); port->frontend = &ha_frontend; port->targ_port = msg->hdr.nexus.targ_port; port->fe_datamove = ctl_ha_datamove; port->fe_done = ctl_ha_done; } else if (port->frontend == &ha_frontend) { CTL_DEBUG_PRINT(("%s: Updated port %d\n", __func__, msg->hdr.nexus.targ_port)); new = 0; } else { printf("%s: Received conflicting HA port %d\n", __func__, msg->hdr.nexus.targ_port); return; } port->port_type = msg->port.port_type; port->physical_port = msg->port.physical_port; port->virtual_port = msg->port.virtual_port; port->status = msg->port.status; i = 0; free(port->port_name, M_CTL); port->port_name = strndup(&msg->port.data[i], msg->port.name_len, M_CTL); i += msg->port.name_len; if (msg->port.lun_map_len != 0) { if (port->lun_map == NULL || port->lun_map_size * sizeof(uint32_t) < msg->port.lun_map_len) { port->lun_map_size = 0; free(port->lun_map, M_CTL); port->lun_map = malloc(msg->port.lun_map_len, M_CTL, M_WAITOK); } memcpy(port->lun_map, &msg->port.data[i], msg->port.lun_map_len); port->lun_map_size = msg->port.lun_map_len / sizeof(uint32_t); i += msg->port.lun_map_len; } else { port->lun_map_size = 0; free(port->lun_map, M_CTL); port->lun_map = NULL; } if (msg->port.port_devid_len != 0) { if (port->port_devid == NULL || port->port_devid->len < msg->port.port_devid_len) { free(port->port_devid, M_CTL); port->port_devid = malloc(sizeof(struct ctl_devid) + msg->port.port_devid_len, M_CTL, M_WAITOK); } memcpy(port->port_devid->data, &msg->port.data[i], msg->port.port_devid_len); port->port_devid->len = msg->port.port_devid_len; i += msg->port.port_devid_len; } else { free(port->port_devid, M_CTL); port->port_devid = NULL; } if (msg->port.target_devid_len != 0) { if (port->target_devid == NULL || port->target_devid->len < msg->port.target_devid_len) { free(port->target_devid, M_CTL); port->target_devid = malloc(sizeof(struct ctl_devid) + msg->port.target_devid_len, M_CTL, M_WAITOK); } memcpy(port->target_devid->data, &msg->port.data[i], msg->port.target_devid_len); port->target_devid->len = msg->port.target_devid_len; i += msg->port.target_devid_len; } else { free(port->target_devid, M_CTL); port->target_devid = NULL; } if (msg->port.init_devid_len != 0) { if (port->init_devid == NULL || port->init_devid->len < msg->port.init_devid_len) { free(port->init_devid, M_CTL); port->init_devid = malloc(sizeof(struct ctl_devid) + msg->port.init_devid_len, M_CTL, M_WAITOK); } memcpy(port->init_devid->data, &msg->port.data[i], msg->port.init_devid_len); port->init_devid->len = msg->port.init_devid_len; i += msg->port.init_devid_len; } else { free(port->init_devid, M_CTL); port->init_devid = NULL; } if (new) { if (ctl_port_register(port) != 0) { printf("%s: ctl_port_register() failed with error\n", __func__); } } mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(lun, &softc->lun_list, links) { if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; mtx_lock(&lun->lun_lock); ctl_est_ua_all(lun, -1, CTL_UA_INQ_CHANGE); mtx_unlock(&lun->lun_lock); } mtx_unlock(&softc->ctl_lock); } static void ctl_isc_iid_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len) { struct ctl_port *port; int iid; port = softc->ctl_ports[msg->hdr.nexus.targ_port]; if (port == NULL) { printf("%s: Received IID for unknown port %d\n", __func__, msg->hdr.nexus.targ_port); return; } iid = msg->hdr.nexus.initid; if (port->wwpn_iid[iid].in_use != 0 && msg->iid.in_use == 0) ctl_i_t_nexus_loss(softc, iid, CTL_UA_POWERON); port->wwpn_iid[iid].in_use = msg->iid.in_use; port->wwpn_iid[iid].wwpn = msg->iid.wwpn; free(port->wwpn_iid[iid].name, M_CTL); if (msg->iid.name_len) { port->wwpn_iid[iid].name = strndup(&msg->iid.data[0], msg->iid.name_len, M_CTL); } else port->wwpn_iid[iid].name = NULL; } static void ctl_isc_login(struct ctl_softc *softc, union ctl_ha_msg *msg, int len) { if (msg->login.version != CTL_HA_VERSION) { printf("CTL HA peers have different versions %d != %d\n", msg->login.version, CTL_HA_VERSION); ctl_ha_msg_abort(CTL_HA_CHAN_CTL); return; } if (msg->login.ha_mode != softc->ha_mode) { printf("CTL HA peers have different ha_mode %d != %d\n", msg->login.ha_mode, softc->ha_mode); ctl_ha_msg_abort(CTL_HA_CHAN_CTL); return; } if (msg->login.ha_id == softc->ha_id) { printf("CTL HA peers have same ha_id %d\n", msg->login.ha_id); ctl_ha_msg_abort(CTL_HA_CHAN_CTL); return; } if (msg->login.max_luns != CTL_MAX_LUNS || msg->login.max_ports != CTL_MAX_PORTS || msg->login.max_init_per_port != CTL_MAX_INIT_PER_PORT) { printf("CTL HA peers have different limits\n"); ctl_ha_msg_abort(CTL_HA_CHAN_CTL); return; } } static void ctl_isc_mode_sync(struct ctl_softc *softc, union ctl_ha_msg *msg, int len) { struct ctl_lun *lun; u_int i; uint32_t initidx, targ_lun; targ_lun = msg->hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); return; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); return; } for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { if ((lun->mode_pages.index[i].page_code & SMPH_PC_MASK) == msg->mode.page_code && lun->mode_pages.index[i].subpage == msg->mode.subpage) break; } if (i == CTL_NUM_MODE_PAGES) { mtx_unlock(&lun->lun_lock); return; } memcpy(lun->mode_pages.index[i].page_data, msg->mode.data, lun->mode_pages.index[i].page_len); initidx = ctl_get_initindex(&msg->hdr.nexus); if (initidx != -1) ctl_est_ua_all(lun, initidx, CTL_UA_MODE_CHANGE); mtx_unlock(&lun->lun_lock); } /* * ISC (Inter Shelf Communication) event handler. Events from the HA * subsystem come in here. */ static void ctl_isc_event_handler(ctl_ha_channel channel, ctl_ha_event event, int param) { struct ctl_softc *softc = control_softc; union ctl_io *io; struct ctl_prio *presio; ctl_ha_status isc_status; CTL_DEBUG_PRINT(("CTL: Isc Msg event %d\n", event)); if (event == CTL_HA_EVT_MSG_RECV) { union ctl_ha_msg *msg, msgbuf; if (param > sizeof(msgbuf)) msg = malloc(param, M_CTL, M_WAITOK); else msg = &msgbuf; isc_status = ctl_ha_msg_recv(CTL_HA_CHAN_CTL, msg, param, M_WAITOK); if (isc_status != CTL_HA_STATUS_SUCCESS) { printf("%s: Error receiving message: %d\n", __func__, isc_status); if (msg != &msgbuf) free(msg, M_CTL); return; } CTL_DEBUG_PRINT(("CTL: msg_type %d\n", msg->msg_type)); switch (msg->hdr.msg_type) { case CTL_MSG_SERIALIZE: io = ctl_alloc_io(softc->othersc_pool); ctl_zero_io(io); // populate ctsio from msg io->io_hdr.io_type = CTL_IO_SCSI; io->io_hdr.msg_type = CTL_MSG_SERIALIZE; io->io_hdr.original_sc = msg->hdr.original_sc; io->io_hdr.flags |= CTL_FLAG_FROM_OTHER_SC | CTL_FLAG_IO_ACTIVE; /* * If we're in serialization-only mode, we don't * want to go through full done processing. Thus * the COPY flag. * * XXX KDM add another flag that is more specific. */ if (softc->ha_mode != CTL_HA_MODE_XFER) io->io_hdr.flags |= CTL_FLAG_INT_COPY; io->io_hdr.nexus = msg->hdr.nexus; #if 0 printf("port %u, iid %u, lun %u\n", io->io_hdr.nexus.targ_port, io->io_hdr.nexus.initid, io->io_hdr.nexus.targ_lun); #endif io->scsiio.tag_num = msg->scsi.tag_num; io->scsiio.tag_type = msg->scsi.tag_type; #ifdef CTL_TIME_IO io->io_hdr.start_time = time_uptime; getbinuptime(&io->io_hdr.start_bt); #endif /* CTL_TIME_IO */ io->scsiio.cdb_len = msg->scsi.cdb_len; memcpy(io->scsiio.cdb, msg->scsi.cdb, CTL_MAX_CDBLEN); if (softc->ha_mode == CTL_HA_MODE_XFER) { const struct ctl_cmd_entry *entry; entry = ctl_get_cmd_entry(&io->scsiio, NULL); io->io_hdr.flags &= ~CTL_FLAG_DATA_MASK; io->io_hdr.flags |= entry->flags & CTL_FLAG_DATA_MASK; } ctl_enqueue_isc(io); break; /* Performed on the Originating SC, XFER mode only */ case CTL_MSG_DATAMOVE: { struct ctl_sg_entry *sgl; int i, j; io = msg->hdr.original_sc; if (io == NULL) { printf("%s: original_sc == NULL!\n", __func__); /* XXX KDM do something here */ break; } io->io_hdr.msg_type = CTL_MSG_DATAMOVE; io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE; /* * Keep track of this, we need to send it back over * when the datamove is complete. */ io->io_hdr.serializing_sc = msg->hdr.serializing_sc; if (msg->hdr.status == CTL_SUCCESS) io->io_hdr.status = msg->hdr.status; if (msg->dt.sg_sequence == 0) { #ifdef CTL_TIME_IO getbinuptime(&io->io_hdr.dma_start_bt); #endif i = msg->dt.kern_sg_entries + msg->dt.kern_data_len / CTL_HA_DATAMOVE_SEGMENT + 1; sgl = malloc(sizeof(*sgl) * i, M_CTL, M_WAITOK | M_ZERO); io->io_hdr.remote_sglist = sgl; io->io_hdr.local_sglist = &sgl[msg->dt.kern_sg_entries]; io->scsiio.kern_data_ptr = (uint8_t *)sgl; io->scsiio.kern_sg_entries = msg->dt.kern_sg_entries; io->scsiio.rem_sg_entries = msg->dt.kern_sg_entries; io->scsiio.kern_data_len = msg->dt.kern_data_len; io->scsiio.kern_total_len = msg->dt.kern_total_len; io->scsiio.kern_data_resid = msg->dt.kern_data_resid; io->scsiio.kern_rel_offset = msg->dt.kern_rel_offset; io->io_hdr.flags &= ~CTL_FLAG_BUS_ADDR; io->io_hdr.flags |= msg->dt.flags & CTL_FLAG_BUS_ADDR; } else sgl = (struct ctl_sg_entry *) io->scsiio.kern_data_ptr; for (i = msg->dt.sent_sg_entries, j = 0; i < (msg->dt.sent_sg_entries + msg->dt.cur_sg_entries); i++, j++) { sgl[i].addr = msg->dt.sg_list[j].addr; sgl[i].len = msg->dt.sg_list[j].len; #if 0 printf("%s: DATAMOVE: %p,%lu j=%d, i=%d\n", __func__, sgl[i].addr, sgl[i].len, j, i); #endif } /* * If this is the last piece of the I/O, we've got * the full S/G list. Queue processing in the thread. * Otherwise wait for the next piece. */ if (msg->dt.sg_last != 0) ctl_enqueue_isc(io); break; } /* Performed on the Serializing (primary) SC, XFER mode only */ case CTL_MSG_DATAMOVE_DONE: { if (msg->hdr.serializing_sc == NULL) { printf("%s: serializing_sc == NULL!\n", __func__); /* XXX KDM now what? */ break; } /* * We grab the sense information here in case * there was a failure, so we can return status * back to the initiator. */ io = msg->hdr.serializing_sc; io->io_hdr.msg_type = CTL_MSG_DATAMOVE_DONE; io->io_hdr.flags &= ~CTL_FLAG_DMA_INPROG; io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE; io->io_hdr.port_status = msg->scsi.port_status; io->scsiio.kern_data_resid = msg->scsi.kern_data_resid; if (msg->hdr.status != CTL_STATUS_NONE) { io->io_hdr.status = msg->hdr.status; io->scsiio.scsi_status = msg->scsi.scsi_status; io->scsiio.sense_len = msg->scsi.sense_len; memcpy(&io->scsiio.sense_data, &msg->scsi.sense_data, msg->scsi.sense_len); if (msg->hdr.status == CTL_SUCCESS) io->io_hdr.flags |= CTL_FLAG_STATUS_SENT; } ctl_enqueue_isc(io); break; } /* Preformed on Originating SC, SER_ONLY mode */ case CTL_MSG_R2R: io = msg->hdr.original_sc; if (io == NULL) { printf("%s: original_sc == NULL!\n", __func__); break; } io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE; io->io_hdr.msg_type = CTL_MSG_R2R; io->io_hdr.serializing_sc = msg->hdr.serializing_sc; ctl_enqueue_isc(io); break; /* * Performed on Serializing(i.e. primary SC) SC in SER_ONLY * mode. * Performed on the Originating (i.e. secondary) SC in XFER * mode */ case CTL_MSG_FINISH_IO: if (softc->ha_mode == CTL_HA_MODE_XFER) ctl_isc_handler_finish_xfer(softc, msg); else ctl_isc_handler_finish_ser_only(softc, msg); break; /* Preformed on Originating SC */ case CTL_MSG_BAD_JUJU: io = msg->hdr.original_sc; if (io == NULL) { printf("%s: Bad JUJU!, original_sc is NULL!\n", __func__); break; } ctl_copy_sense_data(msg, io); /* * IO should have already been cleaned up on other * SC so clear this flag so we won't send a message * back to finish the IO there. */ io->io_hdr.flags &= ~CTL_FLAG_SENT_2OTHER_SC; io->io_hdr.flags |= CTL_FLAG_IO_ACTIVE; /* io = msg->hdr.serializing_sc; */ io->io_hdr.msg_type = CTL_MSG_BAD_JUJU; ctl_enqueue_isc(io); break; /* Handle resets sent from the other side */ case CTL_MSG_MANAGE_TASKS: { struct ctl_taskio *taskio; taskio = (struct ctl_taskio *)ctl_alloc_io( softc->othersc_pool); ctl_zero_io((union ctl_io *)taskio); taskio->io_hdr.io_type = CTL_IO_TASK; taskio->io_hdr.flags |= CTL_FLAG_FROM_OTHER_SC; taskio->io_hdr.nexus = msg->hdr.nexus; taskio->task_action = msg->task.task_action; taskio->tag_num = msg->task.tag_num; taskio->tag_type = msg->task.tag_type; #ifdef CTL_TIME_IO taskio->io_hdr.start_time = time_uptime; getbinuptime(&taskio->io_hdr.start_bt); #endif /* CTL_TIME_IO */ ctl_run_task((union ctl_io *)taskio); break; } /* Persistent Reserve action which needs attention */ case CTL_MSG_PERS_ACTION: presio = (struct ctl_prio *)ctl_alloc_io( softc->othersc_pool); ctl_zero_io((union ctl_io *)presio); presio->io_hdr.msg_type = CTL_MSG_PERS_ACTION; presio->io_hdr.flags |= CTL_FLAG_FROM_OTHER_SC; presio->io_hdr.nexus = msg->hdr.nexus; presio->pr_msg = msg->pr; ctl_enqueue_isc((union ctl_io *)presio); break; case CTL_MSG_UA: ctl_isc_ua(softc, msg, param); break; case CTL_MSG_PORT_SYNC: ctl_isc_port_sync(softc, msg, param); break; case CTL_MSG_LUN_SYNC: ctl_isc_lun_sync(softc, msg, param); break; case CTL_MSG_IID_SYNC: ctl_isc_iid_sync(softc, msg, param); break; case CTL_MSG_LOGIN: ctl_isc_login(softc, msg, param); break; case CTL_MSG_MODE_SYNC: ctl_isc_mode_sync(softc, msg, param); break; default: printf("Received HA message of unknown type %d\n", msg->hdr.msg_type); ctl_ha_msg_abort(CTL_HA_CHAN_CTL); break; } if (msg != &msgbuf) free(msg, M_CTL); } else if (event == CTL_HA_EVT_LINK_CHANGE) { printf("CTL: HA link status changed from %d to %d\n", softc->ha_link, param); if (param == softc->ha_link) return; if (softc->ha_link == CTL_HA_LINK_ONLINE) { softc->ha_link = param; ctl_isc_ha_link_down(softc); } else { softc->ha_link = param; if (softc->ha_link == CTL_HA_LINK_ONLINE) ctl_isc_ha_link_up(softc); } return; } else { printf("ctl_isc_event_handler: Unknown event %d\n", event); return; } } static void ctl_copy_sense_data(union ctl_ha_msg *src, union ctl_io *dest) { memcpy(&dest->scsiio.sense_data, &src->scsi.sense_data, src->scsi.sense_len); dest->scsiio.scsi_status = src->scsi.scsi_status; dest->scsiio.sense_len = src->scsi.sense_len; dest->io_hdr.status = src->hdr.status; } static void ctl_copy_sense_data_back(union ctl_io *src, union ctl_ha_msg *dest) { memcpy(&dest->scsi.sense_data, &src->scsiio.sense_data, src->scsiio.sense_len); dest->scsi.scsi_status = src->scsiio.scsi_status; dest->scsi.sense_len = src->scsiio.sense_len; dest->hdr.status = src->io_hdr.status; } void ctl_est_ua(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua) { struct ctl_softc *softc = lun->ctl_softc; ctl_ua_type *pu; if (initidx < softc->init_min || initidx >= softc->init_max) return; mtx_assert(&lun->lun_lock, MA_OWNED); pu = lun->pending_ua[initidx / CTL_MAX_INIT_PER_PORT]; if (pu == NULL) return; pu[initidx % CTL_MAX_INIT_PER_PORT] |= ua; } void ctl_est_ua_port(struct ctl_lun *lun, int port, uint32_t except, ctl_ua_type ua) { int i; mtx_assert(&lun->lun_lock, MA_OWNED); if (lun->pending_ua[port] == NULL) return; for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) { if (port * CTL_MAX_INIT_PER_PORT + i == except) continue; lun->pending_ua[port][i] |= ua; } } void ctl_est_ua_all(struct ctl_lun *lun, uint32_t except, ctl_ua_type ua) { struct ctl_softc *softc = lun->ctl_softc; int i; mtx_assert(&lun->lun_lock, MA_OWNED); for (i = softc->port_min; i < softc->port_max; i++) ctl_est_ua_port(lun, i, except, ua); } void ctl_clr_ua(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua) { struct ctl_softc *softc = lun->ctl_softc; ctl_ua_type *pu; if (initidx < softc->init_min || initidx >= softc->init_max) return; mtx_assert(&lun->lun_lock, MA_OWNED); pu = lun->pending_ua[initidx / CTL_MAX_INIT_PER_PORT]; if (pu == NULL) return; pu[initidx % CTL_MAX_INIT_PER_PORT] &= ~ua; } void ctl_clr_ua_all(struct ctl_lun *lun, uint32_t except, ctl_ua_type ua) { struct ctl_softc *softc = lun->ctl_softc; int i, j; mtx_assert(&lun->lun_lock, MA_OWNED); for (i = softc->port_min; i < softc->port_max; i++) { if (lun->pending_ua[i] == NULL) continue; for (j = 0; j < CTL_MAX_INIT_PER_PORT; j++) { if (i * CTL_MAX_INIT_PER_PORT + j == except) continue; lun->pending_ua[i][j] &= ~ua; } } } void ctl_clr_ua_allluns(struct ctl_softc *ctl_softc, uint32_t initidx, ctl_ua_type ua_type) { struct ctl_lun *lun; mtx_assert(&ctl_softc->ctl_lock, MA_OWNED); STAILQ_FOREACH(lun, &ctl_softc->lun_list, links) { mtx_lock(&lun->lun_lock); ctl_clr_ua(lun, initidx, ua_type); mtx_unlock(&lun->lun_lock); } } static int ctl_ha_role_sysctl(SYSCTL_HANDLER_ARGS) { struct ctl_softc *softc = (struct ctl_softc *)arg1; struct ctl_lun *lun; struct ctl_lun_req ireq; int error, value; value = (softc->flags & CTL_FLAG_ACTIVE_SHELF) ? 0 : 1; error = sysctl_handle_int(oidp, &value, 0, req); if ((error != 0) || (req->newptr == NULL)) return (error); mtx_lock(&softc->ctl_lock); if (value == 0) softc->flags |= CTL_FLAG_ACTIVE_SHELF; else softc->flags &= ~CTL_FLAG_ACTIVE_SHELF; STAILQ_FOREACH(lun, &softc->lun_list, links) { mtx_unlock(&softc->ctl_lock); bzero(&ireq, sizeof(ireq)); ireq.reqtype = CTL_LUNREQ_MODIFY; ireq.reqdata.modify.lun_id = lun->lun; lun->backend->ioctl(NULL, CTL_LUN_REQ, (caddr_t)&ireq, 0, curthread); if (ireq.status != CTL_LUN_OK) { printf("%s: CTL_LUNREQ_MODIFY returned %d '%s'\n", __func__, ireq.status, ireq.error_str); } mtx_lock(&softc->ctl_lock); } mtx_unlock(&softc->ctl_lock); return (0); } static int ctl_init(void) { struct make_dev_args args; struct ctl_softc *softc; int i, error; softc = control_softc = malloc(sizeof(*control_softc), M_DEVBUF, M_WAITOK | M_ZERO); make_dev_args_init(&args); args.mda_devsw = &ctl_cdevsw; args.mda_uid = UID_ROOT; args.mda_gid = GID_OPERATOR; args.mda_mode = 0600; args.mda_si_drv1 = softc; error = make_dev_s(&args, &softc->dev, "cam/ctl"); if (error != 0) { free(softc, M_DEVBUF); control_softc = NULL; return (error); } sysctl_ctx_init(&softc->sysctl_ctx); softc->sysctl_tree = SYSCTL_ADD_NODE(&softc->sysctl_ctx, SYSCTL_STATIC_CHILDREN(_kern_cam), OID_AUTO, "ctl", CTLFLAG_RD, 0, "CAM Target Layer"); if (softc->sysctl_tree == NULL) { printf("%s: unable to allocate sysctl tree\n", __func__); destroy_dev(softc->dev); free(softc, M_DEVBUF); control_softc = NULL; return (ENOMEM); } mtx_init(&softc->ctl_lock, "CTL mutex", NULL, MTX_DEF); softc->io_zone = uma_zcreate("CTL IO", sizeof(union ctl_io), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); softc->flags = 0; SYSCTL_ADD_INT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree), OID_AUTO, "ha_mode", CTLFLAG_RDTUN, (int *)&softc->ha_mode, 0, "HA mode (0 - act/stby, 1 - serialize only, 2 - xfer)"); /* * In Copan's HA scheme, the "master" and "slave" roles are * figured out through the slot the controller is in. Although it * is an active/active system, someone has to be in charge. */ SYSCTL_ADD_INT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree), OID_AUTO, "ha_id", CTLFLAG_RDTUN, &softc->ha_id, 0, "HA head ID (0 - no HA)"); if (softc->ha_id == 0 || softc->ha_id > NUM_HA_SHELVES) { softc->flags |= CTL_FLAG_ACTIVE_SHELF; softc->is_single = 1; softc->port_cnt = CTL_MAX_PORTS; softc->port_min = 0; } else { softc->port_cnt = CTL_MAX_PORTS / NUM_HA_SHELVES; softc->port_min = (softc->ha_id - 1) * softc->port_cnt; } softc->port_max = softc->port_min + softc->port_cnt; softc->init_min = softc->port_min * CTL_MAX_INIT_PER_PORT; softc->init_max = softc->port_max * CTL_MAX_INIT_PER_PORT; SYSCTL_ADD_INT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree), OID_AUTO, "ha_link", CTLFLAG_RD, (int *)&softc->ha_link, 0, "HA link state (0 - offline, 1 - unknown, 2 - online)"); STAILQ_INIT(&softc->lun_list); STAILQ_INIT(&softc->pending_lun_queue); STAILQ_INIT(&softc->fe_list); STAILQ_INIT(&softc->port_list); STAILQ_INIT(&softc->be_list); ctl_tpc_init(softc); if (worker_threads <= 0) worker_threads = max(1, mp_ncpus / 4); if (worker_threads > CTL_MAX_THREADS) worker_threads = CTL_MAX_THREADS; for (i = 0; i < worker_threads; i++) { struct ctl_thread *thr = &softc->threads[i]; mtx_init(&thr->queue_lock, "CTL queue mutex", NULL, MTX_DEF); thr->ctl_softc = softc; STAILQ_INIT(&thr->incoming_queue); STAILQ_INIT(&thr->rtr_queue); STAILQ_INIT(&thr->done_queue); STAILQ_INIT(&thr->isc_queue); error = kproc_kthread_add(ctl_work_thread, thr, &softc->ctl_proc, &thr->thread, 0, 0, "ctl", "work%d", i); if (error != 0) { printf("error creating CTL work thread!\n"); return (error); } } error = kproc_kthread_add(ctl_lun_thread, softc, &softc->ctl_proc, &softc->lun_thread, 0, 0, "ctl", "lun"); if (error != 0) { printf("error creating CTL lun thread!\n"); return (error); } error = kproc_kthread_add(ctl_thresh_thread, softc, &softc->ctl_proc, &softc->thresh_thread, 0, 0, "ctl", "thresh"); if (error != 0) { printf("error creating CTL threshold thread!\n"); return (error); } SYSCTL_ADD_PROC(&softc->sysctl_ctx,SYSCTL_CHILDREN(softc->sysctl_tree), OID_AUTO, "ha_role", CTLTYPE_INT | CTLFLAG_RWTUN, softc, 0, ctl_ha_role_sysctl, "I", "HA role for this head"); if (softc->is_single == 0) { if (ctl_frontend_register(&ha_frontend) != 0) softc->is_single = 1; } return (0); } static int ctl_shutdown(void) { struct ctl_softc *softc = control_softc; int i; if (softc->is_single == 0) ctl_frontend_deregister(&ha_frontend); destroy_dev(softc->dev); /* Shutdown CTL threads. */ softc->shutdown = 1; for (i = 0; i < worker_threads; i++) { struct ctl_thread *thr = &softc->threads[i]; while (thr->thread != NULL) { wakeup(thr); if (thr->thread != NULL) pause("CTL thr shutdown", 1); } mtx_destroy(&thr->queue_lock); } while (softc->lun_thread != NULL) { wakeup(&softc->pending_lun_queue); if (softc->lun_thread != NULL) pause("CTL thr shutdown", 1); } while (softc->thresh_thread != NULL) { wakeup(softc->thresh_thread); if (softc->thresh_thread != NULL) pause("CTL thr shutdown", 1); } ctl_tpc_shutdown(softc); uma_zdestroy(softc->io_zone); mtx_destroy(&softc->ctl_lock); sysctl_ctx_free(&softc->sysctl_ctx); free(softc, M_DEVBUF); control_softc = NULL; return (0); } static int ctl_module_event_handler(module_t mod, int what, void *arg) { switch (what) { case MOD_LOAD: return (ctl_init()); case MOD_UNLOAD: return (ctl_shutdown()); default: return (EOPNOTSUPP); } } /* * XXX KDM should we do some access checks here? Bump a reference count to * prevent a CTL module from being unloaded while someone has it open? */ static int ctl_open(struct cdev *dev, int flags, int fmt, struct thread *td) { return (0); } static int ctl_close(struct cdev *dev, int flags, int fmt, struct thread *td) { return (0); } /* * Remove an initiator by port number and initiator ID. * Returns 0 for success, -1 for failure. */ int ctl_remove_initiator(struct ctl_port *port, int iid) { struct ctl_softc *softc = port->ctl_softc; int last; mtx_assert(&softc->ctl_lock, MA_NOTOWNED); if (iid > CTL_MAX_INIT_PER_PORT) { printf("%s: initiator ID %u > maximun %u!\n", __func__, iid, CTL_MAX_INIT_PER_PORT); return (-1); } mtx_lock(&softc->ctl_lock); last = (--port->wwpn_iid[iid].in_use == 0); port->wwpn_iid[iid].last_use = time_uptime; mtx_unlock(&softc->ctl_lock); if (last) ctl_i_t_nexus_loss(softc, iid, CTL_UA_POWERON); ctl_isc_announce_iid(port, iid); return (0); } /* * Add an initiator to the initiator map. * Returns iid for success, < 0 for failure. */ int ctl_add_initiator(struct ctl_port *port, int iid, uint64_t wwpn, char *name) { struct ctl_softc *softc = port->ctl_softc; time_t best_time; int i, best; mtx_assert(&softc->ctl_lock, MA_NOTOWNED); if (iid >= CTL_MAX_INIT_PER_PORT) { printf("%s: WWPN %#jx initiator ID %u > maximum %u!\n", __func__, wwpn, iid, CTL_MAX_INIT_PER_PORT); free(name, M_CTL); return (-1); } mtx_lock(&softc->ctl_lock); if (iid < 0 && (wwpn != 0 || name != NULL)) { for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) { if (wwpn != 0 && wwpn == port->wwpn_iid[i].wwpn) { iid = i; break; } if (name != NULL && port->wwpn_iid[i].name != NULL && strcmp(name, port->wwpn_iid[i].name) == 0) { iid = i; break; } } } if (iid < 0) { for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) { if (port->wwpn_iid[i].in_use == 0 && port->wwpn_iid[i].wwpn == 0 && port->wwpn_iid[i].name == NULL) { iid = i; break; } } } if (iid < 0) { best = -1; best_time = INT32_MAX; for (i = 0; i < CTL_MAX_INIT_PER_PORT; i++) { if (port->wwpn_iid[i].in_use == 0) { if (port->wwpn_iid[i].last_use < best_time) { best = i; best_time = port->wwpn_iid[i].last_use; } } } iid = best; } if (iid < 0) { mtx_unlock(&softc->ctl_lock); free(name, M_CTL); return (-2); } if (port->wwpn_iid[iid].in_use > 0 && (wwpn != 0 || name != NULL)) { /* * This is not an error yet. */ if (wwpn != 0 && wwpn == port->wwpn_iid[iid].wwpn) { #if 0 printf("%s: port %d iid %u WWPN %#jx arrived" " again\n", __func__, port->targ_port, iid, (uintmax_t)wwpn); #endif goto take; } if (name != NULL && port->wwpn_iid[iid].name != NULL && strcmp(name, port->wwpn_iid[iid].name) == 0) { #if 0 printf("%s: port %d iid %u name '%s' arrived" " again\n", __func__, port->targ_port, iid, name); #endif goto take; } /* * This is an error, but what do we do about it? The * driver is telling us we have a new WWPN for this * initiator ID, so we pretty much need to use it. */ printf("%s: port %d iid %u WWPN %#jx '%s' arrived," " but WWPN %#jx '%s' is still at that address\n", __func__, port->targ_port, iid, wwpn, name, (uintmax_t)port->wwpn_iid[iid].wwpn, port->wwpn_iid[iid].name); } take: free(port->wwpn_iid[iid].name, M_CTL); port->wwpn_iid[iid].name = name; port->wwpn_iid[iid].wwpn = wwpn; port->wwpn_iid[iid].in_use++; mtx_unlock(&softc->ctl_lock); ctl_isc_announce_iid(port, iid); return (iid); } static int ctl_create_iid(struct ctl_port *port, int iid, uint8_t *buf) { int len; switch (port->port_type) { case CTL_PORT_FC: { struct scsi_transportid_fcp *id = (struct scsi_transportid_fcp *)buf; if (port->wwpn_iid[iid].wwpn == 0) return (0); memset(id, 0, sizeof(*id)); id->format_protocol = SCSI_PROTO_FC; scsi_u64to8b(port->wwpn_iid[iid].wwpn, id->n_port_name); return (sizeof(*id)); } case CTL_PORT_ISCSI: { struct scsi_transportid_iscsi_port *id = (struct scsi_transportid_iscsi_port *)buf; if (port->wwpn_iid[iid].name == NULL) return (0); memset(id, 0, 256); id->format_protocol = SCSI_TRN_ISCSI_FORMAT_PORT | SCSI_PROTO_ISCSI; len = strlcpy(id->iscsi_name, port->wwpn_iid[iid].name, 252) + 1; len = roundup2(min(len, 252), 4); scsi_ulto2b(len, id->additional_length); return (sizeof(*id) + len); } case CTL_PORT_SAS: { struct scsi_transportid_sas *id = (struct scsi_transportid_sas *)buf; if (port->wwpn_iid[iid].wwpn == 0) return (0); memset(id, 0, sizeof(*id)); id->format_protocol = SCSI_PROTO_SAS; scsi_u64to8b(port->wwpn_iid[iid].wwpn, id->sas_address); return (sizeof(*id)); } default: { struct scsi_transportid_spi *id = (struct scsi_transportid_spi *)buf; memset(id, 0, sizeof(*id)); id->format_protocol = SCSI_PROTO_SPI; scsi_ulto2b(iid, id->scsi_addr); scsi_ulto2b(port->targ_port, id->rel_trgt_port_id); return (sizeof(*id)); } } } /* * Serialize a command that went down the "wrong" side, and so was sent to * this controller for execution. The logic is a little different than the * standard case in ctl_scsiio_precheck(). Errors in this case need to get * sent back to the other side, but in the success case, we execute the * command on this side (XFER mode) or tell the other side to execute it * (SER_ONLY mode). */ static void ctl_serialize_other_sc_cmd(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_port *port = CTL_PORT(ctsio); union ctl_ha_msg msg_info; struct ctl_lun *lun; const struct ctl_cmd_entry *entry; uint32_t targ_lun; targ_lun = ctsio->io_hdr.nexus.targ_mapped_lun; /* Make sure that we know about this port. */ if (port == NULL || (port->status & CTL_PORT_STATUS_ONLINE) == 0) { ctl_set_internal_failure(ctsio, /*sks_valid*/ 0, /*retry_count*/ 1); goto badjuju; } /* Make sure that we know about this LUN. */ mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); /* * The other node would not send this request to us unless * received announce that we are primary node for this LUN. * If this LUN does not exist now, it is probably result of * a race, so respond to initiator in the most opaque way. */ ctl_set_busy(ctsio); goto badjuju; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); /* * If the LUN is invalid, pretend that it doesn't exist. * It will go away as soon as all pending I/Os completed. */ if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); ctl_set_busy(ctsio); goto badjuju; } entry = ctl_get_cmd_entry(ctsio, NULL); if (ctl_scsiio_lun_check(lun, entry, ctsio) != 0) { mtx_unlock(&lun->lun_lock); goto badjuju; } CTL_LUN(ctsio) = lun; CTL_BACKEND_LUN(ctsio) = lun->be_lun; /* * Every I/O goes into the OOA queue for a * particular LUN, and stays there until completion. */ #ifdef CTL_TIME_IO if (TAILQ_EMPTY(&lun->ooa_queue)) lun->idle_time += getsbinuptime() - lun->last_busy; #endif TAILQ_INSERT_TAIL(&lun->ooa_queue, &ctsio->io_hdr, ooa_links); switch (ctl_check_ooa(lun, (union ctl_io *)ctsio, (union ctl_io *)TAILQ_PREV(&ctsio->io_hdr, ctl_ooaq, ooa_links))) { case CTL_ACTION_BLOCK: ctsio->io_hdr.flags |= CTL_FLAG_BLOCKED; TAILQ_INSERT_TAIL(&lun->blocked_queue, &ctsio->io_hdr, blocked_links); mtx_unlock(&lun->lun_lock); break; case CTL_ACTION_PASS: case CTL_ACTION_SKIP: if (softc->ha_mode == CTL_HA_MODE_XFER) { ctsio->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR; ctl_enqueue_rtr((union ctl_io *)ctsio); mtx_unlock(&lun->lun_lock); } else { ctsio->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE; mtx_unlock(&lun->lun_lock); /* send msg back to other side */ msg_info.hdr.original_sc = ctsio->io_hdr.original_sc; msg_info.hdr.serializing_sc = (union ctl_io *)ctsio; msg_info.hdr.msg_type = CTL_MSG_R2R; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.hdr), M_WAITOK); } break; case CTL_ACTION_OVERLAP: TAILQ_REMOVE(&lun->ooa_queue, &ctsio->io_hdr, ooa_links); mtx_unlock(&lun->lun_lock); ctl_set_overlapped_cmd(ctsio); goto badjuju; case CTL_ACTION_OVERLAP_TAG: TAILQ_REMOVE(&lun->ooa_queue, &ctsio->io_hdr, ooa_links); mtx_unlock(&lun->lun_lock); ctl_set_overlapped_tag(ctsio, ctsio->tag_num); goto badjuju; case CTL_ACTION_ERROR: default: TAILQ_REMOVE(&lun->ooa_queue, &ctsio->io_hdr, ooa_links); mtx_unlock(&lun->lun_lock); ctl_set_internal_failure(ctsio, /*sks_valid*/ 0, /*retry_count*/ 0); badjuju: ctl_copy_sense_data_back((union ctl_io *)ctsio, &msg_info); msg_info.hdr.original_sc = ctsio->io_hdr.original_sc; msg_info.hdr.serializing_sc = NULL; msg_info.hdr.msg_type = CTL_MSG_BAD_JUJU; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.scsi), M_WAITOK); ctl_free_io((union ctl_io *)ctsio); break; } } /* * Returns 0 for success, errno for failure. */ static void ctl_ioctl_fill_ooa(struct ctl_lun *lun, uint32_t *cur_fill_num, struct ctl_ooa *ooa_hdr, struct ctl_ooa_entry *kern_entries) { union ctl_io *io; mtx_lock(&lun->lun_lock); for (io = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); (io != NULL); (*cur_fill_num)++, io = (union ctl_io *)TAILQ_NEXT(&io->io_hdr, ooa_links)) { struct ctl_ooa_entry *entry; /* * If we've got more than we can fit, just count the * remaining entries. */ if (*cur_fill_num >= ooa_hdr->alloc_num) continue; entry = &kern_entries[*cur_fill_num]; entry->tag_num = io->scsiio.tag_num; entry->lun_num = lun->lun; #ifdef CTL_TIME_IO entry->start_bt = io->io_hdr.start_bt; #endif bcopy(io->scsiio.cdb, entry->cdb, io->scsiio.cdb_len); entry->cdb_len = io->scsiio.cdb_len; if (io->io_hdr.flags & CTL_FLAG_BLOCKED) entry->cmd_flags |= CTL_OOACMD_FLAG_BLOCKED; if (io->io_hdr.flags & CTL_FLAG_DMA_INPROG) entry->cmd_flags |= CTL_OOACMD_FLAG_DMA; if (io->io_hdr.flags & CTL_FLAG_ABORT) entry->cmd_flags |= CTL_OOACMD_FLAG_ABORT; if (io->io_hdr.flags & CTL_FLAG_IS_WAS_ON_RTR) entry->cmd_flags |= CTL_OOACMD_FLAG_RTR; if (io->io_hdr.flags & CTL_FLAG_DMA_QUEUED) entry->cmd_flags |= CTL_OOACMD_FLAG_DMA_QUEUED; } mtx_unlock(&lun->lun_lock); } static void * ctl_copyin_alloc(void *user_addr, unsigned int len, char *error_str, size_t error_str_len) { void *kptr; kptr = malloc(len, M_CTL, M_WAITOK | M_ZERO); if (copyin(user_addr, kptr, len) != 0) { snprintf(error_str, error_str_len, "Error copying %d bytes " "from user address %p to kernel address %p", len, user_addr, kptr); free(kptr, M_CTL); return (NULL); } return (kptr); } static void ctl_free_args(int num_args, struct ctl_be_arg *args) { int i; if (args == NULL) return; for (i = 0; i < num_args; i++) { free(args[i].kname, M_CTL); free(args[i].kvalue, M_CTL); } free(args, M_CTL); } static struct ctl_be_arg * ctl_copyin_args(int num_args, struct ctl_be_arg *uargs, char *error_str, size_t error_str_len) { struct ctl_be_arg *args; int i; args = ctl_copyin_alloc(uargs, num_args * sizeof(*args), error_str, error_str_len); if (args == NULL) goto bailout; for (i = 0; i < num_args; i++) { args[i].kname = NULL; args[i].kvalue = NULL; } for (i = 0; i < num_args; i++) { uint8_t *tmpptr; if (args[i].namelen == 0) { snprintf(error_str, error_str_len, "Argument %d " "name length is zero", i); goto bailout; } args[i].kname = ctl_copyin_alloc(args[i].name, args[i].namelen, error_str, error_str_len); if (args[i].kname == NULL) goto bailout; if (args[i].kname[args[i].namelen - 1] != '\0') { snprintf(error_str, error_str_len, "Argument %d " "name is not NUL-terminated", i); goto bailout; } if (args[i].flags & CTL_BEARG_RD) { if (args[i].vallen == 0) { snprintf(error_str, error_str_len, "Argument %d " "value length is zero", i); goto bailout; } tmpptr = ctl_copyin_alloc(args[i].value, args[i].vallen, error_str, error_str_len); if (tmpptr == NULL) goto bailout; if ((args[i].flags & CTL_BEARG_ASCII) && (tmpptr[args[i].vallen - 1] != '\0')) { snprintf(error_str, error_str_len, "Argument " "%d value is not NUL-terminated", i); free(tmpptr, M_CTL); goto bailout; } args[i].kvalue = tmpptr; } else { args[i].kvalue = malloc(args[i].vallen, M_CTL, M_WAITOK | M_ZERO); } } return (args); bailout: ctl_free_args(num_args, args); return (NULL); } static void ctl_copyout_args(int num_args, struct ctl_be_arg *args) { int i; for (i = 0; i < num_args; i++) { if (args[i].flags & CTL_BEARG_WR) copyout(args[i].kvalue, args[i].value, args[i].vallen); } } /* * Escape characters that are illegal or not recommended in XML. */ int ctl_sbuf_printf_esc(struct sbuf *sb, char *str, int size) { char *end = str + size; int retval; retval = 0; for (; *str && str < end; str++) { switch (*str) { case '&': retval = sbuf_printf(sb, "&"); break; case '>': retval = sbuf_printf(sb, ">"); break; case '<': retval = sbuf_printf(sb, "<"); break; default: retval = sbuf_putc(sb, *str); break; } if (retval != 0) break; } return (retval); } static void ctl_id_sbuf(struct ctl_devid *id, struct sbuf *sb) { struct scsi_vpd_id_descriptor *desc; int i; if (id == NULL || id->len < 4) return; desc = (struct scsi_vpd_id_descriptor *)id->data; switch (desc->id_type & SVPD_ID_TYPE_MASK) { case SVPD_ID_TYPE_T10: sbuf_printf(sb, "t10."); break; case SVPD_ID_TYPE_EUI64: sbuf_printf(sb, "eui."); break; case SVPD_ID_TYPE_NAA: sbuf_printf(sb, "naa."); break; case SVPD_ID_TYPE_SCSI_NAME: break; } switch (desc->proto_codeset & SVPD_ID_CODESET_MASK) { case SVPD_ID_CODESET_BINARY: for (i = 0; i < desc->length; i++) sbuf_printf(sb, "%02x", desc->identifier[i]); break; case SVPD_ID_CODESET_ASCII: sbuf_printf(sb, "%.*s", (int)desc->length, (char *)desc->identifier); break; case SVPD_ID_CODESET_UTF8: sbuf_printf(sb, "%s", (char *)desc->identifier); break; } } static int ctl_ioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flag, struct thread *td) { struct ctl_softc *softc = dev->si_drv1; struct ctl_port *port; struct ctl_lun *lun; int retval; retval = 0; switch (cmd) { case CTL_IO: retval = ctl_ioctl_io(dev, cmd, addr, flag, td); break; case CTL_ENABLE_PORT: case CTL_DISABLE_PORT: case CTL_SET_PORT_WWNS: { struct ctl_port *port; struct ctl_port_entry *entry; entry = (struct ctl_port_entry *)addr; mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(port, &softc->port_list, links) { int action, done; if (port->targ_port < softc->port_min || port->targ_port >= softc->port_max) continue; action = 0; done = 0; if ((entry->port_type == CTL_PORT_NONE) && (entry->targ_port == port->targ_port)) { /* * If the user only wants to enable or * disable or set WWNs on a specific port, * do the operation and we're done. */ action = 1; done = 1; } else if (entry->port_type & port->port_type) { /* * Compare the user's type mask with the * particular frontend type to see if we * have a match. */ action = 1; done = 0; /* * Make sure the user isn't trying to set * WWNs on multiple ports at the same time. */ if (cmd == CTL_SET_PORT_WWNS) { printf("%s: Can't set WWNs on " "multiple ports\n", __func__); retval = EINVAL; break; } } if (action == 0) continue; /* * XXX KDM we have to drop the lock here, because * the online/offline operations can potentially * block. We need to reference count the frontends * so they can't go away, */ if (cmd == CTL_ENABLE_PORT) { mtx_unlock(&softc->ctl_lock); ctl_port_online(port); mtx_lock(&softc->ctl_lock); } else if (cmd == CTL_DISABLE_PORT) { mtx_unlock(&softc->ctl_lock); ctl_port_offline(port); mtx_lock(&softc->ctl_lock); } else if (cmd == CTL_SET_PORT_WWNS) { ctl_port_set_wwns(port, (entry->flags & CTL_PORT_WWNN_VALID) ? 1 : 0, entry->wwnn, (entry->flags & CTL_PORT_WWPN_VALID) ? 1 : 0, entry->wwpn); } if (done != 0) break; } mtx_unlock(&softc->ctl_lock); break; } case CTL_GET_OOA: { struct ctl_ooa *ooa_hdr; struct ctl_ooa_entry *entries; uint32_t cur_fill_num; ooa_hdr = (struct ctl_ooa *)addr; if ((ooa_hdr->alloc_len == 0) || (ooa_hdr->alloc_num == 0)) { printf("%s: CTL_GET_OOA: alloc len %u and alloc num %u " "must be non-zero\n", __func__, ooa_hdr->alloc_len, ooa_hdr->alloc_num); retval = EINVAL; break; } if (ooa_hdr->alloc_len != (ooa_hdr->alloc_num * sizeof(struct ctl_ooa_entry))) { printf("%s: CTL_GET_OOA: alloc len %u must be alloc " "num %d * sizeof(struct ctl_ooa_entry) %zd\n", __func__, ooa_hdr->alloc_len, ooa_hdr->alloc_num,sizeof(struct ctl_ooa_entry)); retval = EINVAL; break; } entries = malloc(ooa_hdr->alloc_len, M_CTL, M_WAITOK | M_ZERO); if (entries == NULL) { printf("%s: could not allocate %d bytes for OOA " "dump\n", __func__, ooa_hdr->alloc_len); retval = ENOMEM; break; } mtx_lock(&softc->ctl_lock); if ((ooa_hdr->flags & CTL_OOA_FLAG_ALL_LUNS) == 0 && (ooa_hdr->lun_num >= CTL_MAX_LUNS || softc->ctl_luns[ooa_hdr->lun_num] == NULL)) { mtx_unlock(&softc->ctl_lock); free(entries, M_CTL); printf("%s: CTL_GET_OOA: invalid LUN %ju\n", __func__, (uintmax_t)ooa_hdr->lun_num); retval = EINVAL; break; } cur_fill_num = 0; if (ooa_hdr->flags & CTL_OOA_FLAG_ALL_LUNS) { STAILQ_FOREACH(lun, &softc->lun_list, links) { ctl_ioctl_fill_ooa(lun, &cur_fill_num, ooa_hdr, entries); } } else { lun = softc->ctl_luns[ooa_hdr->lun_num]; ctl_ioctl_fill_ooa(lun, &cur_fill_num, ooa_hdr, entries); } mtx_unlock(&softc->ctl_lock); ooa_hdr->fill_num = min(cur_fill_num, ooa_hdr->alloc_num); ooa_hdr->fill_len = ooa_hdr->fill_num * sizeof(struct ctl_ooa_entry); retval = copyout(entries, ooa_hdr->entries, ooa_hdr->fill_len); if (retval != 0) { printf("%s: error copying out %d bytes for OOA dump\n", __func__, ooa_hdr->fill_len); } getbinuptime(&ooa_hdr->cur_bt); if (cur_fill_num > ooa_hdr->alloc_num) { ooa_hdr->dropped_num = cur_fill_num -ooa_hdr->alloc_num; ooa_hdr->status = CTL_OOA_NEED_MORE_SPACE; } else { ooa_hdr->dropped_num = 0; ooa_hdr->status = CTL_OOA_OK; } free(entries, M_CTL); break; } case CTL_DELAY_IO: { struct ctl_io_delay_info *delay_info; delay_info = (struct ctl_io_delay_info *)addr; #ifdef CTL_IO_DELAY mtx_lock(&softc->ctl_lock); if (delay_info->lun_id >= CTL_MAX_LUNS || (lun = softc->ctl_luns[delay_info->lun_id]) == NULL) { mtx_unlock(&softc->ctl_lock); delay_info->status = CTL_DELAY_STATUS_INVALID_LUN; break; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); delay_info->status = CTL_DELAY_STATUS_OK; switch (delay_info->delay_type) { case CTL_DELAY_TYPE_CONT: case CTL_DELAY_TYPE_ONESHOT: break; default: delay_info->status = CTL_DELAY_STATUS_INVALID_TYPE; break; } switch (delay_info->delay_loc) { case CTL_DELAY_LOC_DATAMOVE: lun->delay_info.datamove_type = delay_info->delay_type; lun->delay_info.datamove_delay = delay_info->delay_secs; break; case CTL_DELAY_LOC_DONE: lun->delay_info.done_type = delay_info->delay_type; lun->delay_info.done_delay = delay_info->delay_secs; break; default: delay_info->status = CTL_DELAY_STATUS_INVALID_LOC; break; } mtx_unlock(&lun->lun_lock); #else delay_info->status = CTL_DELAY_STATUS_NOT_IMPLEMENTED; #endif /* CTL_IO_DELAY */ break; } #ifdef CTL_LEGACY_STATS case CTL_GETSTATS: { struct ctl_stats *stats = (struct ctl_stats *)addr; int i; /* * XXX KDM no locking here. If the LUN list changes, * things can blow up. */ i = 0; stats->status = CTL_SS_OK; stats->fill_len = 0; STAILQ_FOREACH(lun, &softc->lun_list, links) { if (stats->fill_len + sizeof(lun->legacy_stats) > stats->alloc_len) { stats->status = CTL_SS_NEED_MORE_SPACE; break; } retval = copyout(&lun->legacy_stats, &stats->lun_stats[i++], sizeof(lun->legacy_stats)); if (retval != 0) break; stats->fill_len += sizeof(lun->legacy_stats); } stats->num_luns = softc->num_luns; stats->flags = CTL_STATS_FLAG_NONE; #ifdef CTL_TIME_IO stats->flags |= CTL_STATS_FLAG_TIME_VALID; #endif getnanouptime(&stats->timestamp); break; } #endif /* CTL_LEGACY_STATS */ case CTL_ERROR_INJECT: { struct ctl_error_desc *err_desc, *new_err_desc; err_desc = (struct ctl_error_desc *)addr; new_err_desc = malloc(sizeof(*new_err_desc), M_CTL, M_WAITOK | M_ZERO); bcopy(err_desc, new_err_desc, sizeof(*new_err_desc)); mtx_lock(&softc->ctl_lock); if (err_desc->lun_id >= CTL_MAX_LUNS || (lun = softc->ctl_luns[err_desc->lun_id]) == NULL) { mtx_unlock(&softc->ctl_lock); free(new_err_desc, M_CTL); printf("%s: CTL_ERROR_INJECT: invalid LUN %ju\n", __func__, (uintmax_t)err_desc->lun_id); retval = EINVAL; break; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); /* * We could do some checking here to verify the validity * of the request, but given the complexity of error * injection requests, the checking logic would be fairly * complex. * * For now, if the request is invalid, it just won't get * executed and might get deleted. */ STAILQ_INSERT_TAIL(&lun->error_list, new_err_desc, links); /* * XXX KDM check to make sure the serial number is unique, * in case we somehow manage to wrap. That shouldn't * happen for a very long time, but it's the right thing to * do. */ new_err_desc->serial = lun->error_serial; err_desc->serial = lun->error_serial; lun->error_serial++; mtx_unlock(&lun->lun_lock); break; } case CTL_ERROR_INJECT_DELETE: { struct ctl_error_desc *delete_desc, *desc, *desc2; int delete_done; delete_desc = (struct ctl_error_desc *)addr; delete_done = 0; mtx_lock(&softc->ctl_lock); if (delete_desc->lun_id >= CTL_MAX_LUNS || (lun = softc->ctl_luns[delete_desc->lun_id]) == NULL) { mtx_unlock(&softc->ctl_lock); printf("%s: CTL_ERROR_INJECT_DELETE: invalid LUN %ju\n", __func__, (uintmax_t)delete_desc->lun_id); retval = EINVAL; break; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); STAILQ_FOREACH_SAFE(desc, &lun->error_list, links, desc2) { if (desc->serial != delete_desc->serial) continue; STAILQ_REMOVE(&lun->error_list, desc, ctl_error_desc, links); free(desc, M_CTL); delete_done = 1; } mtx_unlock(&lun->lun_lock); if (delete_done == 0) { printf("%s: CTL_ERROR_INJECT_DELETE: can't find " "error serial %ju on LUN %u\n", __func__, delete_desc->serial, delete_desc->lun_id); retval = EINVAL; break; } break; } case CTL_DUMP_STRUCTS: { int j, k; struct ctl_port *port; struct ctl_frontend *fe; mtx_lock(&softc->ctl_lock); printf("CTL Persistent Reservation information start:\n"); STAILQ_FOREACH(lun, &softc->lun_list, links) { mtx_lock(&lun->lun_lock); if ((lun->flags & CTL_LUN_DISABLED) != 0) { mtx_unlock(&lun->lun_lock); continue; } for (j = 0; j < CTL_MAX_PORTS; j++) { if (lun->pr_keys[j] == NULL) continue; for (k = 0; k < CTL_MAX_INIT_PER_PORT; k++){ if (lun->pr_keys[j][k] == 0) continue; printf(" LUN %ju port %d iid %d key " "%#jx\n", lun->lun, j, k, (uintmax_t)lun->pr_keys[j][k]); } } mtx_unlock(&lun->lun_lock); } printf("CTL Persistent Reservation information end\n"); printf("CTL Ports:\n"); STAILQ_FOREACH(port, &softc->port_list, links) { printf(" Port %d '%s' Frontend '%s' Type %u pp %d vp %d WWNN " "%#jx WWPN %#jx\n", port->targ_port, port->port_name, port->frontend->name, port->port_type, port->physical_port, port->virtual_port, (uintmax_t)port->wwnn, (uintmax_t)port->wwpn); for (j = 0; j < CTL_MAX_INIT_PER_PORT; j++) { if (port->wwpn_iid[j].in_use == 0 && port->wwpn_iid[j].wwpn == 0 && port->wwpn_iid[j].name == NULL) continue; printf(" iid %u use %d WWPN %#jx '%s'\n", j, port->wwpn_iid[j].in_use, (uintmax_t)port->wwpn_iid[j].wwpn, port->wwpn_iid[j].name); } } printf("CTL Port information end\n"); mtx_unlock(&softc->ctl_lock); /* * XXX KDM calling this without a lock. We'd likely want * to drop the lock before calling the frontend's dump * routine anyway. */ printf("CTL Frontends:\n"); STAILQ_FOREACH(fe, &softc->fe_list, links) { printf(" Frontend '%s'\n", fe->name); if (fe->fe_dump != NULL) fe->fe_dump(); } printf("CTL Frontend information end\n"); break; } case CTL_LUN_REQ: { struct ctl_lun_req *lun_req; struct ctl_backend_driver *backend; lun_req = (struct ctl_lun_req *)addr; backend = ctl_backend_find(lun_req->backend); if (backend == NULL) { lun_req->status = CTL_LUN_ERROR; snprintf(lun_req->error_str, sizeof(lun_req->error_str), "Backend \"%s\" not found.", lun_req->backend); break; } if (lun_req->num_be_args > 0) { lun_req->kern_be_args = ctl_copyin_args( lun_req->num_be_args, lun_req->be_args, lun_req->error_str, sizeof(lun_req->error_str)); if (lun_req->kern_be_args == NULL) { lun_req->status = CTL_LUN_ERROR; break; } } retval = backend->ioctl(dev, cmd, addr, flag, td); if (lun_req->num_be_args > 0) { ctl_copyout_args(lun_req->num_be_args, lun_req->kern_be_args); ctl_free_args(lun_req->num_be_args, lun_req->kern_be_args); } break; } case CTL_LUN_LIST: { struct sbuf *sb; struct ctl_lun_list *list; struct ctl_option *opt; list = (struct ctl_lun_list *)addr; /* * Allocate a fixed length sbuf here, based on the length * of the user's buffer. We could allocate an auto-extending * buffer, and then tell the user how much larger our * amount of data is than his buffer, but that presents * some problems: * * 1. The sbuf(9) routines use a blocking malloc, and so * we can't hold a lock while calling them with an * auto-extending buffer. * * 2. There is not currently a LUN reference counting * mechanism, outside of outstanding transactions on * the LUN's OOA queue. So a LUN could go away on us * while we're getting the LUN number, backend-specific * information, etc. Thus, given the way things * currently work, we need to hold the CTL lock while * grabbing LUN information. * * So, from the user's standpoint, the best thing to do is * allocate what he thinks is a reasonable buffer length, * and then if he gets a CTL_LUN_LIST_NEED_MORE_SPACE error, * double the buffer length and try again. (And repeat * that until he succeeds.) */ sb = sbuf_new(NULL, NULL, list->alloc_len, SBUF_FIXEDLEN); if (sb == NULL) { list->status = CTL_LUN_LIST_ERROR; snprintf(list->error_str, sizeof(list->error_str), "Unable to allocate %d bytes for LUN list", list->alloc_len); break; } sbuf_printf(sb, "\n"); mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(lun, &softc->lun_list, links) { mtx_lock(&lun->lun_lock); retval = sbuf_printf(sb, "\n", (uintmax_t)lun->lun); /* * Bail out as soon as we see that we've overfilled * the buffer. */ if (retval != 0) break; retval = sbuf_printf(sb, "\t%s" "\n", (lun->backend == NULL) ? "none" : lun->backend->name); if (retval != 0) break; retval = sbuf_printf(sb, "\t%d\n", lun->be_lun->lun_type); if (retval != 0) break; if (lun->backend == NULL) { retval = sbuf_printf(sb, "\n"); if (retval != 0) break; continue; } retval = sbuf_printf(sb, "\t%ju\n", (lun->be_lun->maxlba > 0) ? lun->be_lun->maxlba + 1 : 0); if (retval != 0) break; retval = sbuf_printf(sb, "\t%u\n", lun->be_lun->blocksize); if (retval != 0) break; retval = sbuf_printf(sb, "\t"); if (retval != 0) break; retval = ctl_sbuf_printf_esc(sb, lun->be_lun->serial_num, sizeof(lun->be_lun->serial_num)); if (retval != 0) break; retval = sbuf_printf(sb, "\n"); if (retval != 0) break; retval = sbuf_printf(sb, "\t"); if (retval != 0) break; retval = ctl_sbuf_printf_esc(sb, lun->be_lun->device_id, sizeof(lun->be_lun->device_id)); if (retval != 0) break; retval = sbuf_printf(sb, "\n"); if (retval != 0) break; if (lun->backend->lun_info != NULL) { retval = lun->backend->lun_info(lun->be_lun->be_lun, sb); if (retval != 0) break; } STAILQ_FOREACH(opt, &lun->be_lun->options, links) { retval = sbuf_printf(sb, "\t<%s>%s\n", opt->name, opt->value, opt->name); if (retval != 0) break; } retval = sbuf_printf(sb, "\n"); if (retval != 0) break; mtx_unlock(&lun->lun_lock); } if (lun != NULL) mtx_unlock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if ((retval != 0) || ((retval = sbuf_printf(sb, "\n")) != 0)) { retval = 0; sbuf_delete(sb); list->status = CTL_LUN_LIST_NEED_MORE_SPACE; snprintf(list->error_str, sizeof(list->error_str), "Out of space, %d bytes is too small", list->alloc_len); break; } sbuf_finish(sb); retval = copyout(sbuf_data(sb), list->lun_xml, sbuf_len(sb) + 1); list->fill_len = sbuf_len(sb) + 1; list->status = CTL_LUN_LIST_OK; sbuf_delete(sb); break; } case CTL_ISCSI: { struct ctl_iscsi *ci; struct ctl_frontend *fe; ci = (struct ctl_iscsi *)addr; fe = ctl_frontend_find("iscsi"); if (fe == NULL) { ci->status = CTL_ISCSI_ERROR; snprintf(ci->error_str, sizeof(ci->error_str), "Frontend \"iscsi\" not found."); break; } retval = fe->ioctl(dev, cmd, addr, flag, td); break; } case CTL_PORT_REQ: { struct ctl_req *req; struct ctl_frontend *fe; req = (struct ctl_req *)addr; fe = ctl_frontend_find(req->driver); if (fe == NULL) { req->status = CTL_LUN_ERROR; snprintf(req->error_str, sizeof(req->error_str), "Frontend \"%s\" not found.", req->driver); break; } if (req->num_args > 0) { req->kern_args = ctl_copyin_args(req->num_args, req->args, req->error_str, sizeof(req->error_str)); if (req->kern_args == NULL) { req->status = CTL_LUN_ERROR; break; } } if (fe->ioctl) retval = fe->ioctl(dev, cmd, addr, flag, td); else retval = ENODEV; if (req->num_args > 0) { ctl_copyout_args(req->num_args, req->kern_args); ctl_free_args(req->num_args, req->kern_args); } break; } case CTL_PORT_LIST: { struct sbuf *sb; struct ctl_port *port; struct ctl_lun_list *list; struct ctl_option *opt; int j; uint32_t plun; list = (struct ctl_lun_list *)addr; sb = sbuf_new(NULL, NULL, list->alloc_len, SBUF_FIXEDLEN); if (sb == NULL) { list->status = CTL_LUN_LIST_ERROR; snprintf(list->error_str, sizeof(list->error_str), "Unable to allocate %d bytes for LUN list", list->alloc_len); break; } sbuf_printf(sb, "\n"); mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(port, &softc->port_list, links) { retval = sbuf_printf(sb, "\n", (uintmax_t)port->targ_port); /* * Bail out as soon as we see that we've overfilled * the buffer. */ if (retval != 0) break; retval = sbuf_printf(sb, "\t%s" "\n", port->frontend->name); if (retval != 0) break; retval = sbuf_printf(sb, "\t%d\n", port->port_type); if (retval != 0) break; retval = sbuf_printf(sb, "\t%s\n", (port->status & CTL_PORT_STATUS_ONLINE) ? "YES" : "NO"); if (retval != 0) break; retval = sbuf_printf(sb, "\t%s\n", port->port_name); if (retval != 0) break; retval = sbuf_printf(sb, "\t%d\n", port->physical_port); if (retval != 0) break; retval = sbuf_printf(sb, "\t%d\n", port->virtual_port); if (retval != 0) break; if (port->target_devid != NULL) { sbuf_printf(sb, "\t"); ctl_id_sbuf(port->target_devid, sb); sbuf_printf(sb, "\n"); } if (port->port_devid != NULL) { sbuf_printf(sb, "\t"); ctl_id_sbuf(port->port_devid, sb); sbuf_printf(sb, "\n"); } if (port->port_info != NULL) { retval = port->port_info(port->onoff_arg, sb); if (retval != 0) break; } STAILQ_FOREACH(opt, &port->options, links) { retval = sbuf_printf(sb, "\t<%s>%s\n", opt->name, opt->value, opt->name); if (retval != 0) break; } if (port->lun_map != NULL) { sbuf_printf(sb, "\ton\n"); for (j = 0; j < port->lun_map_size; j++) { plun = ctl_lun_map_from_port(port, j); if (plun == UINT32_MAX) continue; sbuf_printf(sb, "\t%u\n", j, plun); } } for (j = 0; j < CTL_MAX_INIT_PER_PORT; j++) { if (port->wwpn_iid[j].in_use == 0 || (port->wwpn_iid[j].wwpn == 0 && port->wwpn_iid[j].name == NULL)) continue; if (port->wwpn_iid[j].name != NULL) retval = sbuf_printf(sb, "\t%s\n", j, port->wwpn_iid[j].name); else retval = sbuf_printf(sb, "\tnaa.%08jx\n", j, port->wwpn_iid[j].wwpn); if (retval != 0) break; } if (retval != 0) break; retval = sbuf_printf(sb, "\n"); if (retval != 0) break; } mtx_unlock(&softc->ctl_lock); if ((retval != 0) || ((retval = sbuf_printf(sb, "\n")) != 0)) { retval = 0; sbuf_delete(sb); list->status = CTL_LUN_LIST_NEED_MORE_SPACE; snprintf(list->error_str, sizeof(list->error_str), "Out of space, %d bytes is too small", list->alloc_len); break; } sbuf_finish(sb); retval = copyout(sbuf_data(sb), list->lun_xml, sbuf_len(sb) + 1); list->fill_len = sbuf_len(sb) + 1; list->status = CTL_LUN_LIST_OK; sbuf_delete(sb); break; } case CTL_LUN_MAP: { struct ctl_lun_map *lm = (struct ctl_lun_map *)addr; struct ctl_port *port; mtx_lock(&softc->ctl_lock); if (lm->port < softc->port_min || lm->port >= softc->port_max || (port = softc->ctl_ports[lm->port]) == NULL) { mtx_unlock(&softc->ctl_lock); return (ENXIO); } if (port->status & CTL_PORT_STATUS_ONLINE) { STAILQ_FOREACH(lun, &softc->lun_list, links) { if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; mtx_lock(&lun->lun_lock); ctl_est_ua_port(lun, lm->port, -1, CTL_UA_LUN_CHANGE); mtx_unlock(&lun->lun_lock); } } mtx_unlock(&softc->ctl_lock); // XXX: port_enable sleeps if (lm->plun != UINT32_MAX) { if (lm->lun == UINT32_MAX) retval = ctl_lun_map_unset(port, lm->plun); else if (lm->lun < CTL_MAX_LUNS && softc->ctl_luns[lm->lun] != NULL) retval = ctl_lun_map_set(port, lm->plun, lm->lun); else return (ENXIO); } else { if (lm->lun == UINT32_MAX) retval = ctl_lun_map_deinit(port); else retval = ctl_lun_map_init(port); } if (port->status & CTL_PORT_STATUS_ONLINE) ctl_isc_announce_port(port); break; } case CTL_GET_LUN_STATS: { struct ctl_get_io_stats *stats = (struct ctl_get_io_stats *)addr; int i; /* * XXX KDM no locking here. If the LUN list changes, * things can blow up. */ i = 0; stats->status = CTL_SS_OK; stats->fill_len = 0; STAILQ_FOREACH(lun, &softc->lun_list, links) { if (lun->lun < stats->first_item) continue; if (stats->fill_len + sizeof(lun->stats) > stats->alloc_len) { stats->status = CTL_SS_NEED_MORE_SPACE; break; } retval = copyout(&lun->stats, &stats->stats[i++], sizeof(lun->stats)); if (retval != 0) break; stats->fill_len += sizeof(lun->stats); } stats->num_items = softc->num_luns; stats->flags = CTL_STATS_FLAG_NONE; #ifdef CTL_TIME_IO stats->flags |= CTL_STATS_FLAG_TIME_VALID; #endif getnanouptime(&stats->timestamp); break; } case CTL_GET_PORT_STATS: { struct ctl_get_io_stats *stats = (struct ctl_get_io_stats *)addr; int i; /* * XXX KDM no locking here. If the LUN list changes, * things can blow up. */ i = 0; stats->status = CTL_SS_OK; stats->fill_len = 0; STAILQ_FOREACH(port, &softc->port_list, links) { if (port->targ_port < stats->first_item) continue; if (stats->fill_len + sizeof(port->stats) > stats->alloc_len) { stats->status = CTL_SS_NEED_MORE_SPACE; break; } retval = copyout(&port->stats, &stats->stats[i++], sizeof(port->stats)); if (retval != 0) break; stats->fill_len += sizeof(port->stats); } stats->num_items = softc->num_ports; stats->flags = CTL_STATS_FLAG_NONE; #ifdef CTL_TIME_IO stats->flags |= CTL_STATS_FLAG_TIME_VALID; #endif getnanouptime(&stats->timestamp); break; } default: { /* XXX KDM should we fix this? */ #if 0 struct ctl_backend_driver *backend; unsigned int type; int found; found = 0; /* * We encode the backend type as the ioctl type for backend * ioctls. So parse it out here, and then search for a * backend of this type. */ type = _IOC_TYPE(cmd); STAILQ_FOREACH(backend, &softc->be_list, links) { if (backend->type == type) { found = 1; break; } } if (found == 0) { printf("ctl: unknown ioctl command %#lx or backend " "%d\n", cmd, type); retval = EINVAL; break; } retval = backend->ioctl(dev, cmd, addr, flag, td); #endif retval = ENOTTY; break; } } return (retval); } uint32_t ctl_get_initindex(struct ctl_nexus *nexus) { return (nexus->initid + (nexus->targ_port * CTL_MAX_INIT_PER_PORT)); } int ctl_lun_map_init(struct ctl_port *port) { struct ctl_softc *softc = port->ctl_softc; struct ctl_lun *lun; int size = ctl_lun_map_size; uint32_t i; if (port->lun_map == NULL || port->lun_map_size < size) { port->lun_map_size = 0; free(port->lun_map, M_CTL); port->lun_map = malloc(size * sizeof(uint32_t), M_CTL, M_NOWAIT); } if (port->lun_map == NULL) return (ENOMEM); for (i = 0; i < size; i++) port->lun_map[i] = UINT32_MAX; port->lun_map_size = size; if (port->status & CTL_PORT_STATUS_ONLINE) { if (port->lun_disable != NULL) { STAILQ_FOREACH(lun, &softc->lun_list, links) port->lun_disable(port->targ_lun_arg, lun->lun); } ctl_isc_announce_port(port); } return (0); } int ctl_lun_map_deinit(struct ctl_port *port) { struct ctl_softc *softc = port->ctl_softc; struct ctl_lun *lun; if (port->lun_map == NULL) return (0); port->lun_map_size = 0; free(port->lun_map, M_CTL); port->lun_map = NULL; if (port->status & CTL_PORT_STATUS_ONLINE) { if (port->lun_enable != NULL) { STAILQ_FOREACH(lun, &softc->lun_list, links) port->lun_enable(port->targ_lun_arg, lun->lun); } ctl_isc_announce_port(port); } return (0); } int ctl_lun_map_set(struct ctl_port *port, uint32_t plun, uint32_t glun) { int status; uint32_t old; if (port->lun_map == NULL) { status = ctl_lun_map_init(port); if (status != 0) return (status); } if (plun >= port->lun_map_size) return (EINVAL); old = port->lun_map[plun]; port->lun_map[plun] = glun; if ((port->status & CTL_PORT_STATUS_ONLINE) && old == UINT32_MAX) { if (port->lun_enable != NULL) port->lun_enable(port->targ_lun_arg, plun); ctl_isc_announce_port(port); } return (0); } int ctl_lun_map_unset(struct ctl_port *port, uint32_t plun) { uint32_t old; if (port->lun_map == NULL || plun >= port->lun_map_size) return (0); old = port->lun_map[plun]; port->lun_map[plun] = UINT32_MAX; if ((port->status & CTL_PORT_STATUS_ONLINE) && old != UINT32_MAX) { if (port->lun_disable != NULL) port->lun_disable(port->targ_lun_arg, plun); ctl_isc_announce_port(port); } return (0); } uint32_t ctl_lun_map_from_port(struct ctl_port *port, uint32_t lun_id) { if (port == NULL) return (UINT32_MAX); if (port->lun_map == NULL) return (lun_id); if (lun_id > port->lun_map_size) return (UINT32_MAX); return (port->lun_map[lun_id]); } uint32_t ctl_lun_map_to_port(struct ctl_port *port, uint32_t lun_id) { uint32_t i; if (port == NULL) return (UINT32_MAX); if (port->lun_map == NULL) return (lun_id); for (i = 0; i < port->lun_map_size; i++) { if (port->lun_map[i] == lun_id) return (i); } return (UINT32_MAX); } uint32_t ctl_decode_lun(uint64_t encoded) { uint8_t lun[8]; uint32_t result = 0xffffffff; be64enc(lun, encoded); switch (lun[0] & RPL_LUNDATA_ATYP_MASK) { case RPL_LUNDATA_ATYP_PERIPH: if ((lun[0] & 0x3f) == 0 && lun[2] == 0 && lun[3] == 0 && lun[4] == 0 && lun[5] == 0 && lun[6] == 0 && lun[7] == 0) result = lun[1]; break; case RPL_LUNDATA_ATYP_FLAT: if (lun[2] == 0 && lun[3] == 0 && lun[4] == 0 && lun[5] == 0 && lun[6] == 0 && lun[7] == 0) result = ((lun[0] & 0x3f) << 8) + lun[1]; break; case RPL_LUNDATA_ATYP_EXTLUN: switch (lun[0] & RPL_LUNDATA_EXT_EAM_MASK) { case 0x02: switch (lun[0] & RPL_LUNDATA_EXT_LEN_MASK) { case 0x00: result = lun[1]; break; case 0x10: result = (lun[1] << 16) + (lun[2] << 8) + lun[3]; break; case 0x20: if (lun[1] == 0 && lun[6] == 0 && lun[7] == 0) result = (lun[2] << 24) + (lun[3] << 16) + (lun[4] << 8) + lun[5]; break; } break; case RPL_LUNDATA_EXT_EAM_NOT_SPEC: result = 0xffffffff; break; } break; } return (result); } uint64_t ctl_encode_lun(uint32_t decoded) { uint64_t l = decoded; if (l <= 0xff) return (((uint64_t)RPL_LUNDATA_ATYP_PERIPH << 56) | (l << 48)); if (l <= 0x3fff) return (((uint64_t)RPL_LUNDATA_ATYP_FLAT << 56) | (l << 48)); if (l <= 0xffffff) return (((uint64_t)(RPL_LUNDATA_ATYP_EXTLUN | 0x12) << 56) | (l << 32)); return ((((uint64_t)RPL_LUNDATA_ATYP_EXTLUN | 0x22) << 56) | (l << 16)); } int ctl_ffz(uint32_t *mask, uint32_t first, uint32_t last) { int i; for (i = first; i < last; i++) { if ((mask[i / 32] & (1 << (i % 32))) == 0) return (i); } return (-1); } int ctl_set_mask(uint32_t *mask, uint32_t bit) { uint32_t chunk, piece; chunk = bit >> 5; piece = bit % (sizeof(uint32_t) * 8); if ((mask[chunk] & (1 << piece)) != 0) return (-1); else mask[chunk] |= (1 << piece); return (0); } int ctl_clear_mask(uint32_t *mask, uint32_t bit) { uint32_t chunk, piece; chunk = bit >> 5; piece = bit % (sizeof(uint32_t) * 8); if ((mask[chunk] & (1 << piece)) == 0) return (-1); else mask[chunk] &= ~(1 << piece); return (0); } int ctl_is_set(uint32_t *mask, uint32_t bit) { uint32_t chunk, piece; chunk = bit >> 5; piece = bit % (sizeof(uint32_t) * 8); if ((mask[chunk] & (1 << piece)) == 0) return (0); else return (1); } static uint64_t ctl_get_prkey(struct ctl_lun *lun, uint32_t residx) { uint64_t *t; t = lun->pr_keys[residx/CTL_MAX_INIT_PER_PORT]; if (t == NULL) return (0); return (t[residx % CTL_MAX_INIT_PER_PORT]); } static void ctl_clr_prkey(struct ctl_lun *lun, uint32_t residx) { uint64_t *t; t = lun->pr_keys[residx/CTL_MAX_INIT_PER_PORT]; if (t == NULL) return; t[residx % CTL_MAX_INIT_PER_PORT] = 0; } static void ctl_alloc_prkey(struct ctl_lun *lun, uint32_t residx) { uint64_t *p; u_int i; i = residx/CTL_MAX_INIT_PER_PORT; if (lun->pr_keys[i] != NULL) return; mtx_unlock(&lun->lun_lock); p = malloc(sizeof(uint64_t) * CTL_MAX_INIT_PER_PORT, M_CTL, M_WAITOK | M_ZERO); mtx_lock(&lun->lun_lock); if (lun->pr_keys[i] == NULL) lun->pr_keys[i] = p; else free(p, M_CTL); } static void ctl_set_prkey(struct ctl_lun *lun, uint32_t residx, uint64_t key) { uint64_t *t; t = lun->pr_keys[residx/CTL_MAX_INIT_PER_PORT]; KASSERT(t != NULL, ("prkey %d is not allocated", residx)); t[residx % CTL_MAX_INIT_PER_PORT] = key; } /* * ctl_softc, pool_name, total_ctl_io are passed in. * npool is passed out. */ int ctl_pool_create(struct ctl_softc *ctl_softc, const char *pool_name, uint32_t total_ctl_io, void **npool) { struct ctl_io_pool *pool; pool = (struct ctl_io_pool *)malloc(sizeof(*pool), M_CTL, M_NOWAIT | M_ZERO); if (pool == NULL) return (ENOMEM); snprintf(pool->name, sizeof(pool->name), "CTL IO %s", pool_name); pool->ctl_softc = ctl_softc; #ifdef IO_POOLS pool->zone = uma_zsecond_create(pool->name, NULL, NULL, NULL, NULL, ctl_softc->io_zone); /* uma_prealloc(pool->zone, total_ctl_io); */ #else pool->zone = ctl_softc->io_zone; #endif *npool = pool; return (0); } void ctl_pool_free(struct ctl_io_pool *pool) { if (pool == NULL) return; #ifdef IO_POOLS uma_zdestroy(pool->zone); #endif free(pool, M_CTL); } union ctl_io * ctl_alloc_io(void *pool_ref) { struct ctl_io_pool *pool = (struct ctl_io_pool *)pool_ref; union ctl_io *io; io = uma_zalloc(pool->zone, M_WAITOK); if (io != NULL) { io->io_hdr.pool = pool_ref; CTL_SOFTC(io) = pool->ctl_softc; } return (io); } union ctl_io * ctl_alloc_io_nowait(void *pool_ref) { struct ctl_io_pool *pool = (struct ctl_io_pool *)pool_ref; union ctl_io *io; io = uma_zalloc(pool->zone, M_NOWAIT); if (io != NULL) { io->io_hdr.pool = pool_ref; CTL_SOFTC(io) = pool->ctl_softc; } return (io); } void ctl_free_io(union ctl_io *io) { struct ctl_io_pool *pool; if (io == NULL) return; pool = (struct ctl_io_pool *)io->io_hdr.pool; uma_zfree(pool->zone, io); } void ctl_zero_io(union ctl_io *io) { struct ctl_io_pool *pool; if (io == NULL) return; /* * May need to preserve linked list pointers at some point too. */ pool = io->io_hdr.pool; memset(io, 0, sizeof(*io)); io->io_hdr.pool = pool; CTL_SOFTC(io) = pool->ctl_softc; } int ctl_expand_number(const char *buf, uint64_t *num) { char *endptr; uint64_t number; unsigned shift; number = strtoq(buf, &endptr, 0); switch (tolower((unsigned char)*endptr)) { case 'e': shift = 60; break; case 'p': shift = 50; break; case 't': shift = 40; break; case 'g': shift = 30; break; case 'm': shift = 20; break; case 'k': shift = 10; break; case 'b': case '\0': /* No unit. */ *num = number; return (0); default: /* Unrecognized unit. */ return (-1); } if ((number << shift) >> shift != number) { /* Overflow */ return (-1); } *num = number << shift; return (0); } /* * This routine could be used in the future to load default and/or saved * mode page parameters for a particuar lun. */ static int ctl_init_page_index(struct ctl_lun *lun) { int i, page_code; struct ctl_page_index *page_index; const char *value; uint64_t ival; memcpy(&lun->mode_pages.index, page_index_template, sizeof(page_index_template)); for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { page_index = &lun->mode_pages.index[i]; if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; page_code = page_index->page_code & SMPH_PC_MASK; switch (page_code) { case SMS_RW_ERROR_RECOVERY_PAGE: { KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0, ("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code)); memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_CURRENT], &rw_er_page_default, sizeof(rw_er_page_default)); memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_CHANGEABLE], &rw_er_page_changeable, sizeof(rw_er_page_changeable)); memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_DEFAULT], &rw_er_page_default, sizeof(rw_er_page_default)); memcpy(&lun->mode_pages.rw_er_page[CTL_PAGE_SAVED], &rw_er_page_default, sizeof(rw_er_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.rw_er_page; break; } case SMS_FORMAT_DEVICE_PAGE: { struct scsi_format_page *format_page; KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0, ("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code)); /* * Sectors per track are set above. Bytes per * sector need to be set here on a per-LUN basis. */ memcpy(&lun->mode_pages.format_page[CTL_PAGE_CURRENT], &format_page_default, sizeof(format_page_default)); memcpy(&lun->mode_pages.format_page[ CTL_PAGE_CHANGEABLE], &format_page_changeable, sizeof(format_page_changeable)); memcpy(&lun->mode_pages.format_page[CTL_PAGE_DEFAULT], &format_page_default, sizeof(format_page_default)); memcpy(&lun->mode_pages.format_page[CTL_PAGE_SAVED], &format_page_default, sizeof(format_page_default)); format_page = &lun->mode_pages.format_page[ CTL_PAGE_CURRENT]; scsi_ulto2b(lun->be_lun->blocksize, format_page->bytes_per_sector); format_page = &lun->mode_pages.format_page[ CTL_PAGE_DEFAULT]; scsi_ulto2b(lun->be_lun->blocksize, format_page->bytes_per_sector); format_page = &lun->mode_pages.format_page[ CTL_PAGE_SAVED]; scsi_ulto2b(lun->be_lun->blocksize, format_page->bytes_per_sector); page_index->page_data = (uint8_t *)lun->mode_pages.format_page; break; } case SMS_RIGID_DISK_PAGE: { struct scsi_rigid_disk_page *rigid_disk_page; uint32_t sectors_per_cylinder; uint64_t cylinders; #ifndef __XSCALE__ int shift; #endif /* !__XSCALE__ */ KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0, ("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code)); /* * Rotation rate and sectors per track are set * above. We calculate the cylinders here based on * capacity. Due to the number of heads and * sectors per track we're using, smaller arrays * may turn out to have 0 cylinders. Linux and * FreeBSD don't pay attention to these mode pages * to figure out capacity, but Solaris does. It * seems to deal with 0 cylinders just fine, and * works out a fake geometry based on the capacity. */ memcpy(&lun->mode_pages.rigid_disk_page[ CTL_PAGE_DEFAULT], &rigid_disk_page_default, sizeof(rigid_disk_page_default)); memcpy(&lun->mode_pages.rigid_disk_page[ CTL_PAGE_CHANGEABLE],&rigid_disk_page_changeable, sizeof(rigid_disk_page_changeable)); sectors_per_cylinder = CTL_DEFAULT_SECTORS_PER_TRACK * CTL_DEFAULT_HEADS; /* * The divide method here will be more accurate, * probably, but results in floating point being * used in the kernel on i386 (__udivdi3()). On the * XScale, though, __udivdi3() is implemented in * software. * * The shift method for cylinder calculation is * accurate if sectors_per_cylinder is a power of * 2. Otherwise it might be slightly off -- you * might have a bit of a truncation problem. */ #ifdef __XSCALE__ cylinders = (lun->be_lun->maxlba + 1) / sectors_per_cylinder; #else for (shift = 31; shift > 0; shift--) { if (sectors_per_cylinder & (1 << shift)) break; } cylinders = (lun->be_lun->maxlba + 1) >> shift; #endif /* * We've basically got 3 bytes, or 24 bits for the * cylinder size in the mode page. If we're over, * just round down to 2^24. */ if (cylinders > 0xffffff) cylinders = 0xffffff; rigid_disk_page = &lun->mode_pages.rigid_disk_page[ CTL_PAGE_DEFAULT]; scsi_ulto3b(cylinders, rigid_disk_page->cylinders); if ((value = ctl_get_opt(&lun->be_lun->options, "rpm")) != NULL) { scsi_ulto2b(strtol(value, NULL, 0), rigid_disk_page->rotation_rate); } memcpy(&lun->mode_pages.rigid_disk_page[CTL_PAGE_CURRENT], &lun->mode_pages.rigid_disk_page[CTL_PAGE_DEFAULT], sizeof(rigid_disk_page_default)); memcpy(&lun->mode_pages.rigid_disk_page[CTL_PAGE_SAVED], &lun->mode_pages.rigid_disk_page[CTL_PAGE_DEFAULT], sizeof(rigid_disk_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.rigid_disk_page; break; } case SMS_VERIFY_ERROR_RECOVERY_PAGE: { KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0, ("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code)); memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_CURRENT], &verify_er_page_default, sizeof(verify_er_page_default)); memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_CHANGEABLE], &verify_er_page_changeable, sizeof(verify_er_page_changeable)); memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_DEFAULT], &verify_er_page_default, sizeof(verify_er_page_default)); memcpy(&lun->mode_pages.verify_er_page[CTL_PAGE_SAVED], &verify_er_page_default, sizeof(verify_er_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.verify_er_page; break; } case SMS_CACHING_PAGE: { struct scsi_caching_page *caching_page; KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0, ("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code)); memcpy(&lun->mode_pages.caching_page[CTL_PAGE_DEFAULT], &caching_page_default, sizeof(caching_page_default)); memcpy(&lun->mode_pages.caching_page[ CTL_PAGE_CHANGEABLE], &caching_page_changeable, sizeof(caching_page_changeable)); memcpy(&lun->mode_pages.caching_page[CTL_PAGE_SAVED], &caching_page_default, sizeof(caching_page_default)); caching_page = &lun->mode_pages.caching_page[ CTL_PAGE_SAVED]; value = ctl_get_opt(&lun->be_lun->options, "writecache"); if (value != NULL && strcmp(value, "off") == 0) caching_page->flags1 &= ~SCP_WCE; value = ctl_get_opt(&lun->be_lun->options, "readcache"); if (value != NULL && strcmp(value, "off") == 0) caching_page->flags1 |= SCP_RCD; memcpy(&lun->mode_pages.caching_page[CTL_PAGE_CURRENT], &lun->mode_pages.caching_page[CTL_PAGE_SAVED], sizeof(caching_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.caching_page; break; } case SMS_CONTROL_MODE_PAGE: { switch (page_index->subpage) { case SMS_SUBPAGE_PAGE_0: { struct scsi_control_page *control_page; memcpy(&lun->mode_pages.control_page[ CTL_PAGE_DEFAULT], &control_page_default, sizeof(control_page_default)); memcpy(&lun->mode_pages.control_page[ CTL_PAGE_CHANGEABLE], &control_page_changeable, sizeof(control_page_changeable)); memcpy(&lun->mode_pages.control_page[ CTL_PAGE_SAVED], &control_page_default, sizeof(control_page_default)); control_page = &lun->mode_pages.control_page[ CTL_PAGE_SAVED]; value = ctl_get_opt(&lun->be_lun->options, "reordering"); if (value != NULL && strcmp(value, "unrestricted") == 0) { control_page->queue_flags &= ~SCP_QUEUE_ALG_MASK; control_page->queue_flags |= SCP_QUEUE_ALG_UNRESTRICTED; } memcpy(&lun->mode_pages.control_page[ CTL_PAGE_CURRENT], &lun->mode_pages.control_page[ CTL_PAGE_SAVED], sizeof(control_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.control_page; break; } case 0x01: memcpy(&lun->mode_pages.control_ext_page[ CTL_PAGE_DEFAULT], &control_ext_page_default, sizeof(control_ext_page_default)); memcpy(&lun->mode_pages.control_ext_page[ CTL_PAGE_CHANGEABLE], &control_ext_page_changeable, sizeof(control_ext_page_changeable)); memcpy(&lun->mode_pages.control_ext_page[ CTL_PAGE_SAVED], &control_ext_page_default, sizeof(control_ext_page_default)); memcpy(&lun->mode_pages.control_ext_page[ CTL_PAGE_CURRENT], &lun->mode_pages.control_ext_page[ CTL_PAGE_SAVED], sizeof(control_ext_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.control_ext_page; break; default: panic("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code); } break; } case SMS_INFO_EXCEPTIONS_PAGE: { switch (page_index->subpage) { case SMS_SUBPAGE_PAGE_0: memcpy(&lun->mode_pages.ie_page[CTL_PAGE_CURRENT], &ie_page_default, sizeof(ie_page_default)); memcpy(&lun->mode_pages.ie_page[ CTL_PAGE_CHANGEABLE], &ie_page_changeable, sizeof(ie_page_changeable)); memcpy(&lun->mode_pages.ie_page[CTL_PAGE_DEFAULT], &ie_page_default, sizeof(ie_page_default)); memcpy(&lun->mode_pages.ie_page[CTL_PAGE_SAVED], &ie_page_default, sizeof(ie_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.ie_page; break; case 0x02: { struct ctl_logical_block_provisioning_page *page; memcpy(&lun->mode_pages.lbp_page[CTL_PAGE_DEFAULT], &lbp_page_default, sizeof(lbp_page_default)); memcpy(&lun->mode_pages.lbp_page[ CTL_PAGE_CHANGEABLE], &lbp_page_changeable, sizeof(lbp_page_changeable)); memcpy(&lun->mode_pages.lbp_page[CTL_PAGE_SAVED], &lbp_page_default, sizeof(lbp_page_default)); page = &lun->mode_pages.lbp_page[CTL_PAGE_SAVED]; value = ctl_get_opt(&lun->be_lun->options, "avail-threshold"); if (value != NULL && ctl_expand_number(value, &ival) == 0) { page->descr[0].flags |= SLBPPD_ENABLED | SLBPPD_ARMING_DEC; if (lun->be_lun->blocksize) ival /= lun->be_lun->blocksize; else ival /= 512; scsi_ulto4b(ival >> CTL_LBP_EXPONENT, page->descr[0].count); } value = ctl_get_opt(&lun->be_lun->options, "used-threshold"); if (value != NULL && ctl_expand_number(value, &ival) == 0) { page->descr[1].flags |= SLBPPD_ENABLED | SLBPPD_ARMING_INC; if (lun->be_lun->blocksize) ival /= lun->be_lun->blocksize; else ival /= 512; scsi_ulto4b(ival >> CTL_LBP_EXPONENT, page->descr[1].count); } value = ctl_get_opt(&lun->be_lun->options, "pool-avail-threshold"); if (value != NULL && ctl_expand_number(value, &ival) == 0) { page->descr[2].flags |= SLBPPD_ENABLED | SLBPPD_ARMING_DEC; if (lun->be_lun->blocksize) ival /= lun->be_lun->blocksize; else ival /= 512; scsi_ulto4b(ival >> CTL_LBP_EXPONENT, page->descr[2].count); } value = ctl_get_opt(&lun->be_lun->options, "pool-used-threshold"); if (value != NULL && ctl_expand_number(value, &ival) == 0) { page->descr[3].flags |= SLBPPD_ENABLED | SLBPPD_ARMING_INC; if (lun->be_lun->blocksize) ival /= lun->be_lun->blocksize; else ival /= 512; scsi_ulto4b(ival >> CTL_LBP_EXPONENT, page->descr[3].count); } memcpy(&lun->mode_pages.lbp_page[CTL_PAGE_CURRENT], &lun->mode_pages.lbp_page[CTL_PAGE_SAVED], sizeof(lbp_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.lbp_page; break; } default: panic("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code); } break; } case SMS_CDDVD_CAPS_PAGE:{ KASSERT(page_index->subpage == SMS_SUBPAGE_PAGE_0, ("subpage %#x for page %#x is incorrect!", page_index->subpage, page_code)); memcpy(&lun->mode_pages.cddvd_page[CTL_PAGE_DEFAULT], &cddvd_page_default, sizeof(cddvd_page_default)); memcpy(&lun->mode_pages.cddvd_page[ CTL_PAGE_CHANGEABLE], &cddvd_page_changeable, sizeof(cddvd_page_changeable)); memcpy(&lun->mode_pages.cddvd_page[CTL_PAGE_SAVED], &cddvd_page_default, sizeof(cddvd_page_default)); memcpy(&lun->mode_pages.cddvd_page[CTL_PAGE_CURRENT], &lun->mode_pages.cddvd_page[CTL_PAGE_SAVED], sizeof(cddvd_page_default)); page_index->page_data = (uint8_t *)lun->mode_pages.cddvd_page; break; } default: panic("invalid page code value %#x", page_code); } } return (CTL_RETVAL_COMPLETE); } static int ctl_init_log_page_index(struct ctl_lun *lun) { struct ctl_page_index *page_index; int i, j, k, prev; memcpy(&lun->log_pages.index, log_page_index_template, sizeof(log_page_index_template)); prev = -1; for (i = 0, j = 0, k = 0; i < CTL_NUM_LOG_PAGES; i++) { page_index = &lun->log_pages.index[i]; if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; if (page_index->page_code == SLS_LOGICAL_BLOCK_PROVISIONING && lun->backend->lun_attr == NULL) continue; if (page_index->page_code != prev) { lun->log_pages.pages_page[j] = page_index->page_code; prev = page_index->page_code; j++; } lun->log_pages.subpages_page[k*2] = page_index->page_code; lun->log_pages.subpages_page[k*2+1] = page_index->subpage; k++; } lun->log_pages.index[0].page_data = &lun->log_pages.pages_page[0]; lun->log_pages.index[0].page_len = j; lun->log_pages.index[1].page_data = &lun->log_pages.subpages_page[0]; lun->log_pages.index[1].page_len = k * 2; lun->log_pages.index[2].page_data = &lun->log_pages.lbp_page[0]; lun->log_pages.index[2].page_len = 12*CTL_NUM_LBP_PARAMS; lun->log_pages.index[3].page_data = (uint8_t *)&lun->log_pages.stat_page; lun->log_pages.index[3].page_len = sizeof(lun->log_pages.stat_page); lun->log_pages.index[4].page_data = (uint8_t *)&lun->log_pages.ie_page; lun->log_pages.index[4].page_len = sizeof(lun->log_pages.ie_page); return (CTL_RETVAL_COMPLETE); } static int hex2bin(const char *str, uint8_t *buf, int buf_size) { int i; u_char c; memset(buf, 0, buf_size); while (isspace(str[0])) str++; if (str[0] == '0' && (str[1] == 'x' || str[1] == 'X')) str += 2; buf_size *= 2; for (i = 0; str[i] != 0 && i < buf_size; i++) { while (str[i] == '-') /* Skip dashes in UUIDs. */ str++; c = str[i]; if (isdigit(c)) c -= '0'; else if (isalpha(c)) c -= isupper(c) ? 'A' - 10 : 'a' - 10; else break; if (c >= 16) break; if ((i & 1) == 0) buf[i / 2] |= (c << 4); else buf[i / 2] |= c; } return ((i + 1) / 2); } /* * LUN allocation. * * Requirements: * - caller allocates and zeros LUN storage, or passes in a NULL LUN if he * wants us to allocate the LUN and he can block. * - ctl_softc is always set * - be_lun is set if the LUN has a backend (needed for disk LUNs) * * Returns 0 for success, non-zero (errno) for failure. */ static int ctl_alloc_lun(struct ctl_softc *ctl_softc, struct ctl_lun *ctl_lun, struct ctl_be_lun *const be_lun) { struct ctl_lun *nlun, *lun; struct scsi_vpd_id_descriptor *desc; struct scsi_vpd_id_t10 *t10id; const char *eui, *naa, *scsiname, *uuid, *vendor, *value; int lun_number, lun_malloced; int devidlen, idlen1, idlen2 = 0, len; if (be_lun == NULL) return (EINVAL); /* * We currently only support Direct Access or Processor LUN types. */ switch (be_lun->lun_type) { case T_DIRECT: case T_PROCESSOR: case T_CDROM: break; case T_SEQUENTIAL: case T_CHANGER: default: be_lun->lun_config_status(be_lun->be_lun, CTL_LUN_CONFIG_FAILURE); break; } if (ctl_lun == NULL) { lun = malloc(sizeof(*lun), M_CTL, M_WAITOK); lun_malloced = 1; } else { lun_malloced = 0; lun = ctl_lun; } memset(lun, 0, sizeof(*lun)); if (lun_malloced) lun->flags = CTL_LUN_MALLOCED; /* Generate LUN ID. */ devidlen = max(CTL_DEVID_MIN_LEN, strnlen(be_lun->device_id, CTL_DEVID_LEN)); idlen1 = sizeof(*t10id) + devidlen; len = sizeof(struct scsi_vpd_id_descriptor) + idlen1; scsiname = ctl_get_opt(&be_lun->options, "scsiname"); if (scsiname != NULL) { idlen2 = roundup2(strlen(scsiname) + 1, 4); len += sizeof(struct scsi_vpd_id_descriptor) + idlen2; } eui = ctl_get_opt(&be_lun->options, "eui"); if (eui != NULL) { len += sizeof(struct scsi_vpd_id_descriptor) + 16; } naa = ctl_get_opt(&be_lun->options, "naa"); if (naa != NULL) { len += sizeof(struct scsi_vpd_id_descriptor) + 16; } uuid = ctl_get_opt(&be_lun->options, "uuid"); if (uuid != NULL) { len += sizeof(struct scsi_vpd_id_descriptor) + 18; } lun->lun_devid = malloc(sizeof(struct ctl_devid) + len, M_CTL, M_WAITOK | M_ZERO); desc = (struct scsi_vpd_id_descriptor *)lun->lun_devid->data; desc->proto_codeset = SVPD_ID_CODESET_ASCII; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN | SVPD_ID_TYPE_T10; desc->length = idlen1; t10id = (struct scsi_vpd_id_t10 *)&desc->identifier[0]; memset(t10id->vendor, ' ', sizeof(t10id->vendor)); if ((vendor = ctl_get_opt(&be_lun->options, "vendor")) == NULL) { strncpy((char *)t10id->vendor, CTL_VENDOR, sizeof(t10id->vendor)); } else { strncpy(t10id->vendor, vendor, min(sizeof(t10id->vendor), strlen(vendor))); } strncpy((char *)t10id->vendor_spec_id, (char *)be_lun->device_id, devidlen); if (scsiname != NULL) { desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] + desc->length); desc->proto_codeset = SVPD_ID_CODESET_UTF8; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN | SVPD_ID_TYPE_SCSI_NAME; desc->length = idlen2; strlcpy(desc->identifier, scsiname, idlen2); } if (eui != NULL) { desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] + desc->length); desc->proto_codeset = SVPD_ID_CODESET_BINARY; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN | SVPD_ID_TYPE_EUI64; desc->length = hex2bin(eui, desc->identifier, 16); desc->length = desc->length > 12 ? 16 : (desc->length > 8 ? 12 : 8); len -= 16 - desc->length; } if (naa != NULL) { desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] + desc->length); desc->proto_codeset = SVPD_ID_CODESET_BINARY; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN | SVPD_ID_TYPE_NAA; desc->length = hex2bin(naa, desc->identifier, 16); desc->length = desc->length > 8 ? 16 : 8; len -= 16 - desc->length; } if (uuid != NULL) { desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] + desc->length); desc->proto_codeset = SVPD_ID_CODESET_BINARY; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_LUN | SVPD_ID_TYPE_UUID; desc->identifier[0] = 0x10; hex2bin(uuid, &desc->identifier[2], 16); desc->length = 18; } lun->lun_devid->len = len; mtx_lock(&ctl_softc->ctl_lock); /* * See if the caller requested a particular LUN number. If so, see * if it is available. Otherwise, allocate the first available LUN. */ if (be_lun->flags & CTL_LUN_FLAG_ID_REQ) { if ((be_lun->req_lun_id > (CTL_MAX_LUNS - 1)) || (ctl_is_set(ctl_softc->ctl_lun_mask, be_lun->req_lun_id))) { mtx_unlock(&ctl_softc->ctl_lock); if (be_lun->req_lun_id > (CTL_MAX_LUNS - 1)) { printf("ctl: requested LUN ID %d is higher " "than CTL_MAX_LUNS - 1 (%d)\n", be_lun->req_lun_id, CTL_MAX_LUNS - 1); } else { /* * XXX KDM return an error, or just assign * another LUN ID in this case?? */ printf("ctl: requested LUN ID %d is already " "in use\n", be_lun->req_lun_id); } fail: free(lun->lun_devid, M_CTL); if (lun->flags & CTL_LUN_MALLOCED) free(lun, M_CTL); be_lun->lun_config_status(be_lun->be_lun, CTL_LUN_CONFIG_FAILURE); return (ENOSPC); } lun_number = be_lun->req_lun_id; } else { lun_number = ctl_ffz(ctl_softc->ctl_lun_mask, 0, CTL_MAX_LUNS); if (lun_number == -1) { mtx_unlock(&ctl_softc->ctl_lock); printf("ctl: can't allocate LUN, out of LUNs\n"); goto fail; } } ctl_set_mask(ctl_softc->ctl_lun_mask, lun_number); mtx_unlock(&ctl_softc->ctl_lock); mtx_init(&lun->lun_lock, "CTL LUN", NULL, MTX_DEF); lun->lun = lun_number; lun->be_lun = be_lun; /* * The processor LUN is always enabled. Disk LUNs come on line * disabled, and must be enabled by the backend. */ lun->flags |= CTL_LUN_DISABLED; lun->backend = be_lun->be; be_lun->ctl_lun = lun; be_lun->lun_id = lun_number; atomic_add_int(&be_lun->be->num_luns, 1); if (be_lun->flags & CTL_LUN_FLAG_EJECTED) lun->flags |= CTL_LUN_EJECTED; if (be_lun->flags & CTL_LUN_FLAG_NO_MEDIA) lun->flags |= CTL_LUN_NO_MEDIA; if (be_lun->flags & CTL_LUN_FLAG_STOPPED) lun->flags |= CTL_LUN_STOPPED; if (be_lun->flags & CTL_LUN_FLAG_PRIMARY) lun->flags |= CTL_LUN_PRIMARY_SC; value = ctl_get_opt(&be_lun->options, "removable"); if (value != NULL) { if (strcmp(value, "on") == 0) lun->flags |= CTL_LUN_REMOVABLE; } else if (be_lun->lun_type == T_CDROM) lun->flags |= CTL_LUN_REMOVABLE; lun->ctl_softc = ctl_softc; #ifdef CTL_TIME_IO lun->last_busy = getsbinuptime(); #endif TAILQ_INIT(&lun->ooa_queue); TAILQ_INIT(&lun->blocked_queue); STAILQ_INIT(&lun->error_list); lun->ie_reported = 1; callout_init_mtx(&lun->ie_callout, &lun->lun_lock, 0); ctl_tpc_lun_init(lun); if (lun->flags & CTL_LUN_REMOVABLE) { lun->prevent = malloc((CTL_MAX_INITIATORS + 31) / 32 * 4, M_CTL, M_WAITOK); } /* * Initialize the mode and log page index. */ ctl_init_page_index(lun); ctl_init_log_page_index(lun); /* Setup statistics gathering */ #ifdef CTL_LEGACY_STATS lun->legacy_stats.device_type = be_lun->lun_type; lun->legacy_stats.lun_number = lun_number; lun->legacy_stats.blocksize = be_lun->blocksize; if (be_lun->blocksize == 0) lun->legacy_stats.flags = CTL_LUN_STATS_NO_BLOCKSIZE; for (len = 0; len < CTL_MAX_PORTS; len++) lun->legacy_stats.ports[len].targ_port = len; #endif /* CTL_LEGACY_STATS */ lun->stats.item = lun_number; /* * Now, before we insert this lun on the lun list, set the lun * inventory changed UA for all other luns. */ mtx_lock(&ctl_softc->ctl_lock); STAILQ_FOREACH(nlun, &ctl_softc->lun_list, links) { mtx_lock(&nlun->lun_lock); ctl_est_ua_all(nlun, -1, CTL_UA_LUN_CHANGE); mtx_unlock(&nlun->lun_lock); } STAILQ_INSERT_TAIL(&ctl_softc->lun_list, lun, links); ctl_softc->ctl_luns[lun_number] = lun; ctl_softc->num_luns++; mtx_unlock(&ctl_softc->ctl_lock); lun->be_lun->lun_config_status(lun->be_lun->be_lun, CTL_LUN_CONFIG_OK); return (0); } /* * Delete a LUN. * Assumptions: * - LUN has already been marked invalid and any pending I/O has been taken * care of. */ static int ctl_free_lun(struct ctl_lun *lun) { struct ctl_softc *softc = lun->ctl_softc; struct ctl_lun *nlun; int i; mtx_assert(&softc->ctl_lock, MA_OWNED); STAILQ_REMOVE(&softc->lun_list, lun, ctl_lun, links); ctl_clear_mask(softc->ctl_lun_mask, lun->lun); softc->ctl_luns[lun->lun] = NULL; if (!TAILQ_EMPTY(&lun->ooa_queue)) panic("Freeing a LUN %p with outstanding I/O!!\n", lun); softc->num_luns--; /* * Tell the backend to free resources, if this LUN has a backend. */ atomic_subtract_int(&lun->be_lun->be->num_luns, 1); lun->be_lun->lun_shutdown(lun->be_lun->be_lun); lun->ie_reportcnt = UINT32_MAX; callout_drain(&lun->ie_callout); ctl_tpc_lun_shutdown(lun); mtx_destroy(&lun->lun_lock); free(lun->lun_devid, M_CTL); for (i = 0; i < CTL_MAX_PORTS; i++) free(lun->pending_ua[i], M_CTL); for (i = 0; i < CTL_MAX_PORTS; i++) free(lun->pr_keys[i], M_CTL); free(lun->write_buffer, M_CTL); free(lun->prevent, M_CTL); if (lun->flags & CTL_LUN_MALLOCED) free(lun, M_CTL); STAILQ_FOREACH(nlun, &softc->lun_list, links) { mtx_lock(&nlun->lun_lock); ctl_est_ua_all(nlun, -1, CTL_UA_LUN_CHANGE); mtx_unlock(&nlun->lun_lock); } return (0); } static void ctl_create_lun(struct ctl_be_lun *be_lun) { /* * ctl_alloc_lun() should handle all potential failure cases. */ ctl_alloc_lun(control_softc, NULL, be_lun); } int ctl_add_lun(struct ctl_be_lun *be_lun) { struct ctl_softc *softc = control_softc; mtx_lock(&softc->ctl_lock); STAILQ_INSERT_TAIL(&softc->pending_lun_queue, be_lun, links); mtx_unlock(&softc->ctl_lock); wakeup(&softc->pending_lun_queue); return (0); } int ctl_enable_lun(struct ctl_be_lun *be_lun) { struct ctl_softc *softc; struct ctl_port *port, *nport; struct ctl_lun *lun; int retval; lun = (struct ctl_lun *)be_lun->ctl_lun; softc = lun->ctl_softc; mtx_lock(&softc->ctl_lock); mtx_lock(&lun->lun_lock); if ((lun->flags & CTL_LUN_DISABLED) == 0) { /* * eh? Why did we get called if the LUN is already * enabled? */ mtx_unlock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); return (0); } lun->flags &= ~CTL_LUN_DISABLED; mtx_unlock(&lun->lun_lock); STAILQ_FOREACH_SAFE(port, &softc->port_list, links, nport) { if ((port->status & CTL_PORT_STATUS_ONLINE) == 0 || port->lun_map != NULL || port->lun_enable == NULL) continue; /* * Drop the lock while we call the FETD's enable routine. * This can lead to a callback into CTL (at least in the * case of the internal initiator frontend. */ mtx_unlock(&softc->ctl_lock); retval = port->lun_enable(port->targ_lun_arg, lun->lun); mtx_lock(&softc->ctl_lock); if (retval != 0) { printf("%s: FETD %s port %d returned error " "%d for lun_enable on lun %jd\n", __func__, port->port_name, port->targ_port, retval, (intmax_t)lun->lun); } } mtx_unlock(&softc->ctl_lock); ctl_isc_announce_lun(lun); return (0); } int ctl_disable_lun(struct ctl_be_lun *be_lun) { struct ctl_softc *softc; struct ctl_port *port; struct ctl_lun *lun; int retval; lun = (struct ctl_lun *)be_lun->ctl_lun; softc = lun->ctl_softc; mtx_lock(&softc->ctl_lock); mtx_lock(&lun->lun_lock); if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); return (0); } lun->flags |= CTL_LUN_DISABLED; mtx_unlock(&lun->lun_lock); STAILQ_FOREACH(port, &softc->port_list, links) { if ((port->status & CTL_PORT_STATUS_ONLINE) == 0 || port->lun_map != NULL || port->lun_disable == NULL) continue; /* * Drop the lock before we call the frontend's disable * routine, to avoid lock order reversals. * * XXX KDM what happens if the frontend list changes while * we're traversing it? It's unlikely, but should be handled. */ mtx_unlock(&softc->ctl_lock); retval = port->lun_disable(port->targ_lun_arg, lun->lun); mtx_lock(&softc->ctl_lock); if (retval != 0) { printf("%s: FETD %s port %d returned error " "%d for lun_disable on lun %jd\n", __func__, port->port_name, port->targ_port, retval, (intmax_t)lun->lun); } } mtx_unlock(&softc->ctl_lock); ctl_isc_announce_lun(lun); return (0); } int ctl_start_lun(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; mtx_lock(&lun->lun_lock); lun->flags &= ~CTL_LUN_STOPPED; mtx_unlock(&lun->lun_lock); return (0); } int ctl_stop_lun(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; mtx_lock(&lun->lun_lock); lun->flags |= CTL_LUN_STOPPED; mtx_unlock(&lun->lun_lock); return (0); } int ctl_lun_no_media(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; mtx_lock(&lun->lun_lock); lun->flags |= CTL_LUN_NO_MEDIA; mtx_unlock(&lun->lun_lock); return (0); } int ctl_lun_has_media(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; union ctl_ha_msg msg; mtx_lock(&lun->lun_lock); lun->flags &= ~(CTL_LUN_NO_MEDIA | CTL_LUN_EJECTED); if (lun->flags & CTL_LUN_REMOVABLE) ctl_est_ua_all(lun, -1, CTL_UA_MEDIUM_CHANGE); mtx_unlock(&lun->lun_lock); if ((lun->flags & CTL_LUN_REMOVABLE) && lun->ctl_softc->ha_mode == CTL_HA_MODE_XFER) { bzero(&msg.ua, sizeof(msg.ua)); msg.hdr.msg_type = CTL_MSG_UA; msg.hdr.nexus.initid = -1; msg.hdr.nexus.targ_port = -1; msg.hdr.nexus.targ_lun = lun->lun; msg.hdr.nexus.targ_mapped_lun = lun->lun; msg.ua.ua_all = 1; msg.ua.ua_set = 1; msg.ua.ua_type = CTL_UA_MEDIUM_CHANGE; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.ua), M_WAITOK); } return (0); } int ctl_lun_ejected(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; mtx_lock(&lun->lun_lock); lun->flags |= CTL_LUN_EJECTED; mtx_unlock(&lun->lun_lock); return (0); } int ctl_lun_primary(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; mtx_lock(&lun->lun_lock); lun->flags |= CTL_LUN_PRIMARY_SC; ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE); mtx_unlock(&lun->lun_lock); ctl_isc_announce_lun(lun); return (0); } int ctl_lun_secondary(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; mtx_lock(&lun->lun_lock); lun->flags &= ~CTL_LUN_PRIMARY_SC; ctl_est_ua_all(lun, -1, CTL_UA_ASYM_ACC_CHANGE); mtx_unlock(&lun->lun_lock); ctl_isc_announce_lun(lun); return (0); } int ctl_invalidate_lun(struct ctl_be_lun *be_lun) { struct ctl_softc *softc; struct ctl_lun *lun; lun = (struct ctl_lun *)be_lun->ctl_lun; softc = lun->ctl_softc; mtx_lock(&lun->lun_lock); /* * The LUN needs to be disabled before it can be marked invalid. */ if ((lun->flags & CTL_LUN_DISABLED) == 0) { mtx_unlock(&lun->lun_lock); return (-1); } /* * Mark the LUN invalid. */ lun->flags |= CTL_LUN_INVALID; /* * If there is nothing in the OOA queue, go ahead and free the LUN. * If we have something in the OOA queue, we'll free it when the * last I/O completes. */ if (TAILQ_EMPTY(&lun->ooa_queue)) { mtx_unlock(&lun->lun_lock); mtx_lock(&softc->ctl_lock); ctl_free_lun(lun); mtx_unlock(&softc->ctl_lock); } else mtx_unlock(&lun->lun_lock); return (0); } void ctl_lun_capacity_changed(struct ctl_be_lun *be_lun) { struct ctl_lun *lun = (struct ctl_lun *)be_lun->ctl_lun; union ctl_ha_msg msg; mtx_lock(&lun->lun_lock); ctl_est_ua_all(lun, -1, CTL_UA_CAPACITY_CHANGE); mtx_unlock(&lun->lun_lock); if (lun->ctl_softc->ha_mode == CTL_HA_MODE_XFER) { /* Send msg to other side. */ bzero(&msg.ua, sizeof(msg.ua)); msg.hdr.msg_type = CTL_MSG_UA; msg.hdr.nexus.initid = -1; msg.hdr.nexus.targ_port = -1; msg.hdr.nexus.targ_lun = lun->lun; msg.hdr.nexus.targ_mapped_lun = lun->lun; msg.ua.ua_all = 1; msg.ua.ua_set = 1; msg.ua.ua_type = CTL_UA_CAPACITY_CHANGE; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.ua), M_WAITOK); } } /* * Backend "memory move is complete" callback for requests that never * make it down to say RAIDCore's configuration code. */ int ctl_config_move_done(union ctl_io *io) { int retval; CTL_DEBUG_PRINT(("ctl_config_move_done\n")); KASSERT(io->io_hdr.io_type == CTL_IO_SCSI, ("Config I/O type isn't CTL_IO_SCSI (%d)!", io->io_hdr.io_type)); if ((io->io_hdr.port_status != 0) && ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE || (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) { ctl_set_internal_failure(&io->scsiio, /*sks_valid*/ 1, /*retry_count*/ io->io_hdr.port_status); } else if (io->scsiio.kern_data_resid != 0 && (io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_OUT && ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE || (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) { ctl_set_invalid_field_ciu(&io->scsiio); } if (ctl_debug & CTL_DEBUG_CDB_DATA) ctl_data_print(io); if (((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_IN) || ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE && (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS) || ((io->io_hdr.flags & CTL_FLAG_ABORT) != 0)) { /* * XXX KDM just assuming a single pointer here, and not a * S/G list. If we start using S/G lists for config data, * we'll need to know how to clean them up here as well. */ if (io->io_hdr.flags & CTL_FLAG_ALLOCATED) free(io->scsiio.kern_data_ptr, M_CTL); ctl_done(io); retval = CTL_RETVAL_COMPLETE; } else { /* * XXX KDM now we need to continue data movement. Some * options: * - call ctl_scsiio() again? We don't do this for data * writes, because for those at least we know ahead of * time where the write will go and how long it is. For * config writes, though, that information is largely * contained within the write itself, thus we need to * parse out the data again. * * - Call some other function once the data is in? */ /* * XXX KDM call ctl_scsiio() again for now, and check flag * bits to see whether we're allocated or not. */ retval = ctl_scsiio(&io->scsiio); } return (retval); } /* * This gets called by a backend driver when it is done with a * data_submit method. */ void ctl_data_submit_done(union ctl_io *io) { /* * If the IO_CONT flag is set, we need to call the supplied * function to continue processing the I/O, instead of completing * the I/O just yet. * * If there is an error, though, we don't want to keep processing. * Instead, just send status back to the initiator. */ if ((io->io_hdr.flags & CTL_FLAG_IO_CONT) && (io->io_hdr.flags & CTL_FLAG_ABORT) == 0 && ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE || (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) { io->scsiio.io_cont(io); return; } ctl_done(io); } /* * This gets called by a backend driver when it is done with a * configuration write. */ void ctl_config_write_done(union ctl_io *io) { uint8_t *buf; /* * If the IO_CONT flag is set, we need to call the supplied * function to continue processing the I/O, instead of completing * the I/O just yet. * * If there is an error, though, we don't want to keep processing. * Instead, just send status back to the initiator. */ if ((io->io_hdr.flags & CTL_FLAG_IO_CONT) && (io->io_hdr.flags & CTL_FLAG_ABORT) == 0 && ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE || (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) { io->scsiio.io_cont(io); return; } /* * Since a configuration write can be done for commands that actually * have data allocated, like write buffer, and commands that have * no data, like start/stop unit, we need to check here. */ if (io->io_hdr.flags & CTL_FLAG_ALLOCATED) buf = io->scsiio.kern_data_ptr; else buf = NULL; ctl_done(io); if (buf) free(buf, M_CTL); } void ctl_config_read_done(union ctl_io *io) { uint8_t *buf; /* * If there is some error -- we are done, skip data transfer. */ if ((io->io_hdr.flags & CTL_FLAG_ABORT) != 0 || ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE && (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS)) { if (io->io_hdr.flags & CTL_FLAG_ALLOCATED) buf = io->scsiio.kern_data_ptr; else buf = NULL; ctl_done(io); if (buf) free(buf, M_CTL); return; } /* * If the IO_CONT flag is set, we need to call the supplied * function to continue processing the I/O, instead of completing * the I/O just yet. */ if (io->io_hdr.flags & CTL_FLAG_IO_CONT) { io->scsiio.io_cont(io); return; } ctl_datamove(io); } /* * SCSI release command. */ int ctl_scsi_release(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); uint32_t residx; CTL_DEBUG_PRINT(("ctl_scsi_release\n")); residx = ctl_get_initindex(&ctsio->io_hdr.nexus); /* * XXX KDM right now, we only support LUN reservation. We don't * support 3rd party reservations, or extent reservations, which * might actually need the parameter list. If we've gotten this * far, we've got a LUN reservation. Anything else got kicked out * above. So, according to SPC, ignore the length. */ mtx_lock(&lun->lun_lock); /* * According to SPC, it is not an error for an intiator to attempt * to release a reservation on a LUN that isn't reserved, or that * is reserved by another initiator. The reservation can only be * released, though, by the initiator who made it or by one of * several reset type events. */ if ((lun->flags & CTL_LUN_RESERVED) && (lun->res_idx == residx)) lun->flags &= ~CTL_LUN_RESERVED; mtx_unlock(&lun->lun_lock); ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_scsi_reserve(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); uint32_t residx; CTL_DEBUG_PRINT(("ctl_reserve\n")); residx = ctl_get_initindex(&ctsio->io_hdr.nexus); /* * XXX KDM right now, we only support LUN reservation. We don't * support 3rd party reservations, or extent reservations, which * might actually need the parameter list. If we've gotten this * far, we've got a LUN reservation. Anything else got kicked out * above. So, according to SPC, ignore the length. */ mtx_lock(&lun->lun_lock); if ((lun->flags & CTL_LUN_RESERVED) && (lun->res_idx != residx)) { ctl_set_reservation_conflict(ctsio); goto bailout; } /* SPC-3 exceptions to SPC-2 RESERVE and RELEASE behavior. */ if (lun->flags & CTL_LUN_PR_RESERVED) { ctl_set_success(ctsio); goto bailout; } lun->flags |= CTL_LUN_RESERVED; lun->res_idx = residx; ctl_set_success(ctsio); bailout: mtx_unlock(&lun->lun_lock); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_start_stop(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_start_stop_unit *cdb; int retval; CTL_DEBUG_PRINT(("ctl_start_stop\n")); cdb = (struct scsi_start_stop_unit *)ctsio->cdb; if ((cdb->how & SSS_PC_MASK) == 0) { if ((lun->flags & CTL_LUN_PR_RESERVED) && (cdb->how & SSS_START) == 0) { uint32_t residx; residx = ctl_get_initindex(&ctsio->io_hdr.nexus); if (ctl_get_prkey(lun, residx) == 0 || (lun->pr_res_idx != residx && lun->pr_res_type < 4)) { ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } } if ((cdb->how & SSS_LOEJ) && (lun->flags & CTL_LUN_REMOVABLE) == 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 4, /*bit_valid*/ 1, /*bit*/ 1); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if ((cdb->how & SSS_START) == 0 && (cdb->how & SSS_LOEJ) && lun->prevent_count > 0) { /* "Medium removal prevented" */ ctl_set_sense(ctsio, /*current_error*/ 1, /*sense_key*/(lun->flags & CTL_LUN_NO_MEDIA) ? SSD_KEY_NOT_READY : SSD_KEY_ILLEGAL_REQUEST, /*asc*/ 0x53, /*ascq*/ 0x02, SSD_ELEM_NONE); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } } retval = lun->backend->config_write((union ctl_io *)ctsio); return (retval); } int ctl_prevent_allow(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_prevent *cdb; int retval; uint32_t initidx; CTL_DEBUG_PRINT(("ctl_prevent_allow\n")); cdb = (struct scsi_prevent *)ctsio->cdb; if ((lun->flags & CTL_LUN_REMOVABLE) == 0 || lun->prevent == NULL) { ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } initidx = ctl_get_initindex(&ctsio->io_hdr.nexus); mtx_lock(&lun->lun_lock); if ((cdb->how & PR_PREVENT) && ctl_is_set(lun->prevent, initidx) == 0) { ctl_set_mask(lun->prevent, initidx); lun->prevent_count++; } else if ((cdb->how & PR_PREVENT) == 0 && ctl_is_set(lun->prevent, initidx)) { ctl_clear_mask(lun->prevent, initidx); lun->prevent_count--; } mtx_unlock(&lun->lun_lock); retval = lun->backend->config_write((union ctl_io *)ctsio); return (retval); } /* * We support the SYNCHRONIZE CACHE command (10 and 16 byte versions), but * we don't really do anything with the LBA and length fields if the user * passes them in. Instead we'll just flush out the cache for the entire * LUN. */ int ctl_sync_cache(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct ctl_lba_len_flags *lbalen; uint64_t starting_lba; uint32_t block_count; int retval; uint8_t byte2; CTL_DEBUG_PRINT(("ctl_sync_cache\n")); retval = 0; switch (ctsio->cdb[0]) { case SYNCHRONIZE_CACHE: { struct scsi_sync_cache *cdb; cdb = (struct scsi_sync_cache *)ctsio->cdb; starting_lba = scsi_4btoul(cdb->begin_lba); block_count = scsi_2btoul(cdb->lb_count); byte2 = cdb->byte2; break; } case SYNCHRONIZE_CACHE_16: { struct scsi_sync_cache_16 *cdb; cdb = (struct scsi_sync_cache_16 *)ctsio->cdb; starting_lba = scsi_8btou64(cdb->begin_lba); block_count = scsi_4btoul(cdb->lb_count); byte2 = cdb->byte2; break; } default: ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); goto bailout; break; /* NOTREACHED */ } /* * We check the LBA and length, but don't do anything with them. * A SYNCHRONIZE CACHE will cause the entire cache for this lun to * get flushed. This check will just help satisfy anyone who wants * to see an error for an out of range LBA. */ if ((starting_lba + block_count) > (lun->be_lun->maxlba + 1)) { ctl_set_lba_out_of_range(ctsio, MAX(starting_lba, lun->be_lun->maxlba + 1)); ctl_done((union ctl_io *)ctsio); goto bailout; } lbalen = (struct ctl_lba_len_flags *)&ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->lba = starting_lba; lbalen->len = block_count; lbalen->flags = byte2; retval = lun->backend->config_write((union ctl_io *)ctsio); bailout: return (retval); } int ctl_format(struct ctl_scsiio *ctsio) { struct scsi_format *cdb; int length, defect_list_len; CTL_DEBUG_PRINT(("ctl_format\n")); cdb = (struct scsi_format *)ctsio->cdb; length = 0; if (cdb->byte2 & SF_FMTDATA) { if (cdb->byte2 & SF_LONGLIST) length = sizeof(struct scsi_format_header_long); else length = sizeof(struct scsi_format_header_short); } if (((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) && (length > 0)) { ctsio->kern_data_ptr = malloc(length, M_CTL, M_WAITOK); ctsio->kern_data_len = length; ctsio->kern_total_len = length; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } defect_list_len = 0; if (cdb->byte2 & SF_FMTDATA) { if (cdb->byte2 & SF_LONGLIST) { struct scsi_format_header_long *header; header = (struct scsi_format_header_long *) ctsio->kern_data_ptr; defect_list_len = scsi_4btoul(header->defect_list_len); if (defect_list_len != 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); goto bailout; } } else { struct scsi_format_header_short *header; header = (struct scsi_format_header_short *) ctsio->kern_data_ptr; defect_list_len = scsi_2btoul(header->defect_list_len); if (defect_list_len != 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); goto bailout; } } } ctl_set_success(ctsio); bailout: if (ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) { free(ctsio->kern_data_ptr, M_CTL); ctsio->io_hdr.flags &= ~CTL_FLAG_ALLOCATED; } ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_read_buffer(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); uint64_t buffer_offset; uint32_t len; uint8_t byte2; static uint8_t descr[4]; static uint8_t echo_descr[4] = { 0 }; CTL_DEBUG_PRINT(("ctl_read_buffer\n")); switch (ctsio->cdb[0]) { case READ_BUFFER: { struct scsi_read_buffer *cdb; cdb = (struct scsi_read_buffer *)ctsio->cdb; buffer_offset = scsi_3btoul(cdb->offset); len = scsi_3btoul(cdb->length); byte2 = cdb->byte2; break; } case READ_BUFFER_16: { struct scsi_read_buffer_16 *cdb; cdb = (struct scsi_read_buffer_16 *)ctsio->cdb; buffer_offset = scsi_8btou64(cdb->offset); len = scsi_4btoul(cdb->length); byte2 = cdb->byte2; break; } default: /* This shouldn't happen. */ ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if (buffer_offset > CTL_WRITE_BUFFER_SIZE || buffer_offset + len > CTL_WRITE_BUFFER_SIZE) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 6, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if ((byte2 & RWB_MODE) == RWB_MODE_DESCR) { descr[0] = 0; scsi_ulto3b(CTL_WRITE_BUFFER_SIZE, &descr[1]); ctsio->kern_data_ptr = descr; len = min(len, sizeof(descr)); } else if ((byte2 & RWB_MODE) == RWB_MODE_ECHO_DESCR) { ctsio->kern_data_ptr = echo_descr; len = min(len, sizeof(echo_descr)); } else { if (lun->write_buffer == NULL) { lun->write_buffer = malloc(CTL_WRITE_BUFFER_SIZE, M_CTL, M_WAITOK); } ctsio->kern_data_ptr = lun->write_buffer + buffer_offset; } ctsio->kern_data_len = len; ctsio->kern_total_len = len; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctl_set_success(ctsio); ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_write_buffer(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_write_buffer *cdb; int buffer_offset, len; CTL_DEBUG_PRINT(("ctl_write_buffer\n")); cdb = (struct scsi_write_buffer *)ctsio->cdb; len = scsi_3btoul(cdb->length); buffer_offset = scsi_3btoul(cdb->offset); if (buffer_offset + len > CTL_WRITE_BUFFER_SIZE) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 6, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * If we've got a kernel request that hasn't been malloced yet, * malloc it and tell the caller the data buffer is here. */ if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) { if (lun->write_buffer == NULL) { lun->write_buffer = malloc(CTL_WRITE_BUFFER_SIZE, M_CTL, M_WAITOK); } ctsio->kern_data_ptr = lun->write_buffer + buffer_offset; ctsio->kern_data_len = len; ctsio->kern_total_len = len; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_write_same(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct ctl_lba_len_flags *lbalen; uint64_t lba; uint32_t num_blocks; int len, retval; uint8_t byte2; CTL_DEBUG_PRINT(("ctl_write_same\n")); switch (ctsio->cdb[0]) { case WRITE_SAME_10: { struct scsi_write_same_10 *cdb; cdb = (struct scsi_write_same_10 *)ctsio->cdb; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_2btoul(cdb->length); byte2 = cdb->byte2; break; } case WRITE_SAME_16: { struct scsi_write_same_16 *cdb; cdb = (struct scsi_write_same_16 *)ctsio->cdb; lba = scsi_8btou64(cdb->addr); num_blocks = scsi_4btoul(cdb->length); byte2 = cdb->byte2; break; } default: /* * We got a command we don't support. This shouldn't * happen, commands should be filtered out above us. */ ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); break; /* NOTREACHED */ } /* ANCHOR flag can be used only together with UNMAP */ if ((byte2 & SWS_UNMAP) == 0 && (byte2 & SWS_ANCHOR) != 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 1, /*bit_valid*/ 1, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * The first check is to make sure we're in bounds, the second * check is to catch wrap-around problems. If the lba + num blocks * is less than the lba, then we've wrapped around and the block * range is invalid anyway. */ if (((lba + num_blocks) > (lun->be_lun->maxlba + 1)) || ((lba + num_blocks) < lba)) { ctl_set_lba_out_of_range(ctsio, MAX(lba, lun->be_lun->maxlba + 1)); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* Zero number of blocks means "to the last logical block" */ if (num_blocks == 0) { if ((lun->be_lun->maxlba + 1) - lba > UINT32_MAX) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 0, /*command*/ 1, /*field*/ 0, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } num_blocks = (lun->be_lun->maxlba + 1) - lba; } len = lun->be_lun->blocksize; /* * If we've got a kernel request that hasn't been malloced yet, * malloc it and tell the caller the data buffer is here. */ if ((byte2 & SWS_NDOB) == 0 && (ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) { ctsio->kern_data_ptr = malloc(len, M_CTL, M_WAITOK); ctsio->kern_data_len = len; ctsio->kern_total_len = len; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } lbalen = (struct ctl_lba_len_flags *)&ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->lba = lba; lbalen->len = num_blocks; lbalen->flags = byte2; retval = lun->backend->config_write((union ctl_io *)ctsio); return (retval); } int ctl_unmap(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_unmap *cdb; struct ctl_ptr_len_flags *ptrlen; struct scsi_unmap_header *hdr; struct scsi_unmap_desc *buf, *end, *endnz, *range; uint64_t lba; uint32_t num_blocks; int len, retval; uint8_t byte2; CTL_DEBUG_PRINT(("ctl_unmap\n")); cdb = (struct scsi_unmap *)ctsio->cdb; len = scsi_2btoul(cdb->length); byte2 = cdb->byte2; /* * If we've got a kernel request that hasn't been malloced yet, * malloc it and tell the caller the data buffer is here. */ if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) { ctsio->kern_data_ptr = malloc(len, M_CTL, M_WAITOK); ctsio->kern_data_len = len; ctsio->kern_total_len = len; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } len = ctsio->kern_total_len - ctsio->kern_data_resid; hdr = (struct scsi_unmap_header *)ctsio->kern_data_ptr; if (len < sizeof (*hdr) || len < (scsi_2btoul(hdr->length) + sizeof(hdr->length)) || len < (scsi_2btoul(hdr->desc_length) + sizeof (*hdr)) || scsi_2btoul(hdr->desc_length) % sizeof(*buf) != 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 0, /*command*/ 0, /*field*/ 0, /*bit_valid*/ 0, /*bit*/ 0); goto done; } len = scsi_2btoul(hdr->desc_length); buf = (struct scsi_unmap_desc *)(hdr + 1); end = buf + len / sizeof(*buf); endnz = buf; for (range = buf; range < end; range++) { lba = scsi_8btou64(range->lba); num_blocks = scsi_4btoul(range->length); if (((lba + num_blocks) > (lun->be_lun->maxlba + 1)) || ((lba + num_blocks) < lba)) { ctl_set_lba_out_of_range(ctsio, MAX(lba, lun->be_lun->maxlba + 1)); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if (num_blocks != 0) endnz = range + 1; } /* * Block backend can not handle zero last range. * Filter it out and return if there is nothing left. */ len = (uint8_t *)endnz - (uint8_t *)buf; if (len == 0) { ctl_set_success(ctsio); goto done; } mtx_lock(&lun->lun_lock); ptrlen = (struct ctl_ptr_len_flags *) &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; ptrlen->ptr = (void *)buf; ptrlen->len = len; ptrlen->flags = byte2; ctl_check_blocked(lun); mtx_unlock(&lun->lun_lock); retval = lun->backend->config_write((union ctl_io *)ctsio); return (retval); done: if (ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) { free(ctsio->kern_data_ptr, M_CTL); ctsio->io_hdr.flags &= ~CTL_FLAG_ALLOCATED; } ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_default_page_handler(struct ctl_scsiio *ctsio, struct ctl_page_index *page_index, uint8_t *page_ptr) { struct ctl_lun *lun = CTL_LUN(ctsio); uint8_t *current_cp; int set_ua; uint32_t initidx; initidx = ctl_get_initindex(&ctsio->io_hdr.nexus); set_ua = 0; current_cp = (page_index->page_data + (page_index->page_len * CTL_PAGE_CURRENT)); mtx_lock(&lun->lun_lock); if (memcmp(current_cp, page_ptr, page_index->page_len)) { memcpy(current_cp, page_ptr, page_index->page_len); set_ua = 1; } if (set_ua != 0) ctl_est_ua_all(lun, initidx, CTL_UA_MODE_CHANGE); mtx_unlock(&lun->lun_lock); if (set_ua) { ctl_isc_announce_mode(lun, ctl_get_initindex(&ctsio->io_hdr.nexus), page_index->page_code, page_index->subpage); } return (CTL_RETVAL_COMPLETE); } static void ctl_ie_timer(void *arg) { struct ctl_lun *lun = arg; uint64_t t; if (lun->ie_asc == 0) return; if (lun->MODE_IE.mrie == SIEP_MRIE_UA) ctl_est_ua_all(lun, -1, CTL_UA_IE); else lun->ie_reported = 0; if (lun->ie_reportcnt < scsi_4btoul(lun->MODE_IE.report_count)) { lun->ie_reportcnt++; t = scsi_4btoul(lun->MODE_IE.interval_timer); if (t == 0 || t == UINT32_MAX) t = 3000; /* 5 min */ callout_schedule(&lun->ie_callout, t * hz / 10); } } int ctl_ie_page_handler(struct ctl_scsiio *ctsio, struct ctl_page_index *page_index, uint8_t *page_ptr) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_info_exceptions_page *pg; uint64_t t; (void)ctl_default_page_handler(ctsio, page_index, page_ptr); pg = (struct scsi_info_exceptions_page *)page_ptr; mtx_lock(&lun->lun_lock); if (pg->info_flags & SIEP_FLAGS_TEST) { lun->ie_asc = 0x5d; lun->ie_ascq = 0xff; if (pg->mrie == SIEP_MRIE_UA) { ctl_est_ua_all(lun, -1, CTL_UA_IE); lun->ie_reported = 1; } else { ctl_clr_ua_all(lun, -1, CTL_UA_IE); lun->ie_reported = -1; } lun->ie_reportcnt = 1; if (lun->ie_reportcnt < scsi_4btoul(pg->report_count)) { lun->ie_reportcnt++; t = scsi_4btoul(pg->interval_timer); if (t == 0 || t == UINT32_MAX) t = 3000; /* 5 min */ callout_reset(&lun->ie_callout, t * hz / 10, ctl_ie_timer, lun); } } else { lun->ie_asc = 0; lun->ie_ascq = 0; lun->ie_reported = 1; ctl_clr_ua_all(lun, -1, CTL_UA_IE); lun->ie_reportcnt = UINT32_MAX; callout_stop(&lun->ie_callout); } mtx_unlock(&lun->lun_lock); return (CTL_RETVAL_COMPLETE); } static int ctl_do_mode_select(union ctl_io *io) { struct ctl_lun *lun = CTL_LUN(io); struct scsi_mode_page_header *page_header; struct ctl_page_index *page_index; struct ctl_scsiio *ctsio; int page_len, page_len_offset, page_len_size; union ctl_modepage_info *modepage_info; uint16_t *len_left, *len_used; int retval, i; ctsio = &io->scsiio; page_index = NULL; page_len = 0; modepage_info = (union ctl_modepage_info *) ctsio->io_hdr.ctl_private[CTL_PRIV_MODEPAGE].bytes; len_left = &modepage_info->header.len_left; len_used = &modepage_info->header.len_used; do_next_page: page_header = (struct scsi_mode_page_header *) (ctsio->kern_data_ptr + *len_used); if (*len_left == 0) { free(ctsio->kern_data_ptr, M_CTL); ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } else if (*len_left < sizeof(struct scsi_mode_page_header)) { free(ctsio->kern_data_ptr, M_CTL); ctl_set_param_len_error(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } else if ((page_header->page_code & SMPH_SPF) && (*len_left < sizeof(struct scsi_mode_page_header_sp))) { free(ctsio->kern_data_ptr, M_CTL); ctl_set_param_len_error(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * XXX KDM should we do something with the block descriptor? */ for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { page_index = &lun->mode_pages.index[i]; if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; if ((page_index->page_code & SMPH_PC_MASK) != (page_header->page_code & SMPH_PC_MASK)) continue; /* * If neither page has a subpage code, then we've got a * match. */ if (((page_index->page_code & SMPH_SPF) == 0) && ((page_header->page_code & SMPH_SPF) == 0)) { page_len = page_header->page_length; break; } /* * If both pages have subpages, then the subpage numbers * have to match. */ if ((page_index->page_code & SMPH_SPF) && (page_header->page_code & SMPH_SPF)) { struct scsi_mode_page_header_sp *sph; sph = (struct scsi_mode_page_header_sp *)page_header; if (page_index->subpage == sph->subpage) { page_len = scsi_2btoul(sph->page_length); break; } } } /* * If we couldn't find the page, or if we don't have a mode select * handler for it, send back an error to the user. */ if ((i >= CTL_NUM_MODE_PAGES) || (page_index->select_handler == NULL)) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ *len_used, /*bit_valid*/ 0, /*bit*/ 0); free(ctsio->kern_data_ptr, M_CTL); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if (page_index->page_code & SMPH_SPF) { page_len_offset = 2; page_len_size = 2; } else { page_len_size = 1; page_len_offset = 1; } /* * If the length the initiator gives us isn't the one we specify in * the mode page header, or if they didn't specify enough data in * the CDB to avoid truncating this page, kick out the request. */ if (page_len != page_index->page_len - page_len_offset - page_len_size) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ *len_used + page_len_offset, /*bit_valid*/ 0, /*bit*/ 0); free(ctsio->kern_data_ptr, M_CTL); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if (*len_left < page_index->page_len) { free(ctsio->kern_data_ptr, M_CTL); ctl_set_param_len_error(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * Run through the mode page, checking to make sure that the bits * the user changed are actually legal for him to change. */ for (i = 0; i < page_index->page_len; i++) { uint8_t *user_byte, *change_mask, *current_byte; int bad_bit; int j; user_byte = (uint8_t *)page_header + i; change_mask = page_index->page_data + (page_index->page_len * CTL_PAGE_CHANGEABLE) + i; current_byte = page_index->page_data + (page_index->page_len * CTL_PAGE_CURRENT) + i; /* * Check to see whether the user set any bits in this byte * that he is not allowed to set. */ if ((*user_byte & ~(*change_mask)) == (*current_byte & ~(*change_mask))) continue; /* * Go through bit by bit to determine which one is illegal. */ bad_bit = 0; for (j = 7; j >= 0; j--) { if ((((1 << i) & ~(*change_mask)) & *user_byte) != (((1 << i) & ~(*change_mask)) & *current_byte)) { bad_bit = i; break; } } ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ *len_used + i, /*bit_valid*/ 1, /*bit*/ bad_bit); free(ctsio->kern_data_ptr, M_CTL); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * Decrement these before we call the page handler, since we may * end up getting called back one way or another before the handler * returns to this context. */ *len_left -= page_index->page_len; *len_used += page_index->page_len; retval = page_index->select_handler(ctsio, page_index, (uint8_t *)page_header); /* * If the page handler returns CTL_RETVAL_QUEUED, then we need to * wait until this queued command completes to finish processing * the mode page. If it returns anything other than * CTL_RETVAL_COMPLETE (e.g. CTL_RETVAL_ERROR), then it should have * already set the sense information, freed the data pointer, and * completed the io for us. */ if (retval != CTL_RETVAL_COMPLETE) goto bailout_no_done; /* * If the initiator sent us more than one page, parse the next one. */ if (*len_left > 0) goto do_next_page; ctl_set_success(ctsio); free(ctsio->kern_data_ptr, M_CTL); ctl_done((union ctl_io *)ctsio); bailout_no_done: return (CTL_RETVAL_COMPLETE); } int ctl_mode_select(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); union ctl_modepage_info *modepage_info; int bd_len, i, header_size, param_len, pf, rtd, sp; uint32_t initidx; initidx = ctl_get_initindex(&ctsio->io_hdr.nexus); switch (ctsio->cdb[0]) { case MODE_SELECT_6: { struct scsi_mode_select_6 *cdb; cdb = (struct scsi_mode_select_6 *)ctsio->cdb; pf = (cdb->byte2 & SMS_PF) ? 1 : 0; rtd = (cdb->byte2 & SMS_RTD) ? 1 : 0; sp = (cdb->byte2 & SMS_SP) ? 1 : 0; param_len = cdb->length; header_size = sizeof(struct scsi_mode_header_6); break; } case MODE_SELECT_10: { struct scsi_mode_select_10 *cdb; cdb = (struct scsi_mode_select_10 *)ctsio->cdb; pf = (cdb->byte2 & SMS_PF) ? 1 : 0; rtd = (cdb->byte2 & SMS_RTD) ? 1 : 0; sp = (cdb->byte2 & SMS_SP) ? 1 : 0; param_len = scsi_2btoul(cdb->length); header_size = sizeof(struct scsi_mode_header_10); break; } default: ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if (rtd) { if (param_len != 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 0, /*command*/ 1, /*field*/ 0, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* Revert to defaults. */ ctl_init_page_index(lun); mtx_lock(&lun->lun_lock); ctl_est_ua_all(lun, initidx, CTL_UA_MODE_CHANGE); mtx_unlock(&lun->lun_lock); for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { ctl_isc_announce_mode(lun, -1, lun->mode_pages.index[i].page_code & SMPH_PC_MASK, lun->mode_pages.index[i].subpage); } ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * From SPC-3: * "A parameter list length of zero indicates that the Data-Out Buffer * shall be empty. This condition shall not be considered as an error." */ if (param_len == 0) { ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * Since we'll hit this the first time through, prior to * allocation, we don't need to free a data buffer here. */ if (param_len < header_size) { ctl_set_param_len_error(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * Allocate the data buffer and grab the user's data. In theory, * we shouldn't have to sanity check the parameter list length here * because the maximum size is 64K. We should be able to malloc * that much without too many problems. */ if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) { ctsio->kern_data_ptr = malloc(param_len, M_CTL, M_WAITOK); ctsio->kern_data_len = param_len; ctsio->kern_total_len = param_len; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } switch (ctsio->cdb[0]) { case MODE_SELECT_6: { struct scsi_mode_header_6 *mh6; mh6 = (struct scsi_mode_header_6 *)ctsio->kern_data_ptr; bd_len = mh6->blk_desc_len; break; } case MODE_SELECT_10: { struct scsi_mode_header_10 *mh10; mh10 = (struct scsi_mode_header_10 *)ctsio->kern_data_ptr; bd_len = scsi_2btoul(mh10->blk_desc_len); break; } default: panic("%s: Invalid CDB type %#x", __func__, ctsio->cdb[0]); } if (param_len < (header_size + bd_len)) { free(ctsio->kern_data_ptr, M_CTL); ctl_set_param_len_error(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * Set the IO_CONT flag, so that if this I/O gets passed to * ctl_config_write_done(), it'll get passed back to * ctl_do_mode_select() for further processing, or completion if * we're all done. */ ctsio->io_hdr.flags |= CTL_FLAG_IO_CONT; ctsio->io_cont = ctl_do_mode_select; modepage_info = (union ctl_modepage_info *) ctsio->io_hdr.ctl_private[CTL_PRIV_MODEPAGE].bytes; memset(modepage_info, 0, sizeof(*modepage_info)); modepage_info->header.len_left = param_len - header_size - bd_len; modepage_info->header.len_used = header_size + bd_len; return (ctl_do_mode_select((union ctl_io *)ctsio)); } int ctl_mode_sense(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); int pc, page_code, dbd, llba, subpage; int alloc_len, page_len, header_len, total_len; struct scsi_mode_block_descr *block_desc; struct ctl_page_index *page_index; dbd = 0; llba = 0; block_desc = NULL; CTL_DEBUG_PRINT(("ctl_mode_sense\n")); switch (ctsio->cdb[0]) { case MODE_SENSE_6: { struct scsi_mode_sense_6 *cdb; cdb = (struct scsi_mode_sense_6 *)ctsio->cdb; header_len = sizeof(struct scsi_mode_hdr_6); if (cdb->byte2 & SMS_DBD) dbd = 1; else header_len += sizeof(struct scsi_mode_block_descr); pc = (cdb->page & SMS_PAGE_CTRL_MASK) >> 6; page_code = cdb->page & SMS_PAGE_CODE; subpage = cdb->subpage; alloc_len = cdb->length; break; } case MODE_SENSE_10: { struct scsi_mode_sense_10 *cdb; cdb = (struct scsi_mode_sense_10 *)ctsio->cdb; header_len = sizeof(struct scsi_mode_hdr_10); if (cdb->byte2 & SMS_DBD) dbd = 1; else header_len += sizeof(struct scsi_mode_block_descr); if (cdb->byte2 & SMS10_LLBAA) llba = 1; pc = (cdb->page & SMS_PAGE_CTRL_MASK) >> 6; page_code = cdb->page & SMS_PAGE_CODE; subpage = cdb->subpage; alloc_len = scsi_2btoul(cdb->length); break; } default: ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); break; /* NOTREACHED */ } /* * We have to make a first pass through to calculate the size of * the pages that match the user's query. Then we allocate enough * memory to hold it, and actually copy the data into the buffer. */ switch (page_code) { case SMS_ALL_PAGES_PAGE: { u_int i; page_len = 0; /* * At the moment, values other than 0 and 0xff here are * reserved according to SPC-3. */ if ((subpage != SMS_SUBPAGE_PAGE_0) && (subpage != SMS_SUBPAGE_ALL)) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 3, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { page_index = &lun->mode_pages.index[i]; /* Make sure the page is supported for this dev type */ if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; /* * We don't use this subpage if the user didn't * request all subpages. */ if ((page_index->subpage != 0) && (subpage == SMS_SUBPAGE_PAGE_0)) continue; #if 0 printf("found page %#x len %d\n", page_index->page_code & SMPH_PC_MASK, page_index->page_len); #endif page_len += page_index->page_len; } break; } default: { u_int i; page_len = 0; for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { page_index = &lun->mode_pages.index[i]; /* Make sure the page is supported for this dev type */ if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; /* Look for the right page code */ if ((page_index->page_code & SMPH_PC_MASK) != page_code) continue; /* Look for the right subpage or the subpage wildcard*/ if ((page_index->subpage != subpage) && (subpage != SMS_SUBPAGE_ALL)) continue; #if 0 printf("found page %#x len %d\n", page_index->page_code & SMPH_PC_MASK, page_index->page_len); #endif page_len += page_index->page_len; } if (page_len == 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 5); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } break; } } total_len = header_len + page_len; #if 0 printf("header_len = %d, page_len = %d, total_len = %d\n", header_len, page_len, total_len); #endif ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; switch (ctsio->cdb[0]) { case MODE_SENSE_6: { struct scsi_mode_hdr_6 *header; header = (struct scsi_mode_hdr_6 *)ctsio->kern_data_ptr; header->datalen = MIN(total_len - 1, 254); if (lun->be_lun->lun_type == T_DIRECT) { header->dev_specific = 0x10; /* DPOFUA */ if ((lun->be_lun->flags & CTL_LUN_FLAG_READONLY) || (lun->MODE_CTRL.eca_and_aen & SCP_SWP) != 0) header->dev_specific |= 0x80; /* WP */ } if (dbd) header->block_descr_len = 0; else header->block_descr_len = sizeof(struct scsi_mode_block_descr); block_desc = (struct scsi_mode_block_descr *)&header[1]; break; } case MODE_SENSE_10: { struct scsi_mode_hdr_10 *header; int datalen; header = (struct scsi_mode_hdr_10 *)ctsio->kern_data_ptr; datalen = MIN(total_len - 2, 65533); scsi_ulto2b(datalen, header->datalen); if (lun->be_lun->lun_type == T_DIRECT) { header->dev_specific = 0x10; /* DPOFUA */ if ((lun->be_lun->flags & CTL_LUN_FLAG_READONLY) || (lun->MODE_CTRL.eca_and_aen & SCP_SWP) != 0) header->dev_specific |= 0x80; /* WP */ } if (dbd) scsi_ulto2b(0, header->block_descr_len); else scsi_ulto2b(sizeof(struct scsi_mode_block_descr), header->block_descr_len); block_desc = (struct scsi_mode_block_descr *)&header[1]; break; } default: panic("%s: Invalid CDB type %#x", __func__, ctsio->cdb[0]); } /* * If we've got a disk, use its blocksize in the block * descriptor. Otherwise, just set it to 0. */ if (dbd == 0) { if (lun->be_lun->lun_type == T_DIRECT) scsi_ulto3b(lun->be_lun->blocksize, block_desc->block_len); else scsi_ulto3b(0, block_desc->block_len); } switch (page_code) { case SMS_ALL_PAGES_PAGE: { int i, data_used; data_used = header_len; for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { struct ctl_page_index *page_index; page_index = &lun->mode_pages.index[i]; if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; /* * We don't use this subpage if the user didn't * request all subpages. We already checked (above) * to make sure the user only specified a subpage * of 0 or 0xff in the SMS_ALL_PAGES_PAGE case. */ if ((page_index->subpage != 0) && (subpage == SMS_SUBPAGE_PAGE_0)) continue; /* * Call the handler, if it exists, to update the * page to the latest values. */ if (page_index->sense_handler != NULL) page_index->sense_handler(ctsio, page_index,pc); memcpy(ctsio->kern_data_ptr + data_used, page_index->page_data + (page_index->page_len * pc), page_index->page_len); data_used += page_index->page_len; } break; } default: { int i, data_used; data_used = header_len; for (i = 0; i < CTL_NUM_MODE_PAGES; i++) { struct ctl_page_index *page_index; page_index = &lun->mode_pages.index[i]; /* Look for the right page code */ if ((page_index->page_code & SMPH_PC_MASK) != page_code) continue; /* Look for the right subpage or the subpage wildcard*/ if ((page_index->subpage != subpage) && (subpage != SMS_SUBPAGE_ALL)) continue; /* Make sure the page is supported for this dev type */ if (lun->be_lun->lun_type == T_DIRECT && (page_index->page_flags & CTL_PAGE_FLAG_DIRECT) == 0) continue; if (lun->be_lun->lun_type == T_PROCESSOR && (page_index->page_flags & CTL_PAGE_FLAG_PROC) == 0) continue; if (lun->be_lun->lun_type == T_CDROM && (page_index->page_flags & CTL_PAGE_FLAG_CDROM) == 0) continue; /* * Call the handler, if it exists, to update the * page to the latest values. */ if (page_index->sense_handler != NULL) page_index->sense_handler(ctsio, page_index,pc); memcpy(ctsio->kern_data_ptr + data_used, page_index->page_data + (page_index->page_len * pc), page_index->page_len); data_used += page_index->page_len; } break; } } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_lbp_log_sense_handler(struct ctl_scsiio *ctsio, struct ctl_page_index *page_index, int pc) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_log_param_header *phdr; uint8_t *data; uint64_t val; data = page_index->page_data; if (lun->backend->lun_attr != NULL && (val = lun->backend->lun_attr(lun->be_lun->be_lun, "blocksavail")) != UINT64_MAX) { phdr = (struct scsi_log_param_header *)data; scsi_ulto2b(0x0001, phdr->param_code); phdr->param_control = SLP_LBIN | SLP_LP; phdr->param_len = 8; data = (uint8_t *)(phdr + 1); scsi_ulto4b(val >> CTL_LBP_EXPONENT, data); data[4] = 0x02; /* per-pool */ data += phdr->param_len; } if (lun->backend->lun_attr != NULL && (val = lun->backend->lun_attr(lun->be_lun->be_lun, "blocksused")) != UINT64_MAX) { phdr = (struct scsi_log_param_header *)data; scsi_ulto2b(0x0002, phdr->param_code); phdr->param_control = SLP_LBIN | SLP_LP; phdr->param_len = 8; data = (uint8_t *)(phdr + 1); scsi_ulto4b(val >> CTL_LBP_EXPONENT, data); data[4] = 0x01; /* per-LUN */ data += phdr->param_len; } if (lun->backend->lun_attr != NULL && (val = lun->backend->lun_attr(lun->be_lun->be_lun, "poolblocksavail")) != UINT64_MAX) { phdr = (struct scsi_log_param_header *)data; scsi_ulto2b(0x00f1, phdr->param_code); phdr->param_control = SLP_LBIN | SLP_LP; phdr->param_len = 8; data = (uint8_t *)(phdr + 1); scsi_ulto4b(val >> CTL_LBP_EXPONENT, data); data[4] = 0x02; /* per-pool */ data += phdr->param_len; } if (lun->backend->lun_attr != NULL && (val = lun->backend->lun_attr(lun->be_lun->be_lun, "poolblocksused")) != UINT64_MAX) { phdr = (struct scsi_log_param_header *)data; scsi_ulto2b(0x00f2, phdr->param_code); phdr->param_control = SLP_LBIN | SLP_LP; phdr->param_len = 8; data = (uint8_t *)(phdr + 1); scsi_ulto4b(val >> CTL_LBP_EXPONENT, data); data[4] = 0x02; /* per-pool */ data += phdr->param_len; } page_index->page_len = data - page_index->page_data; return (0); } int ctl_sap_log_sense_handler(struct ctl_scsiio *ctsio, struct ctl_page_index *page_index, int pc) { struct ctl_lun *lun = CTL_LUN(ctsio); struct stat_page *data; struct bintime *t; data = (struct stat_page *)page_index->page_data; scsi_ulto2b(SLP_SAP, data->sap.hdr.param_code); data->sap.hdr.param_control = SLP_LBIN; data->sap.hdr.param_len = sizeof(struct scsi_log_stat_and_perf) - sizeof(struct scsi_log_param_header); scsi_u64to8b(lun->stats.operations[CTL_STATS_READ], data->sap.read_num); scsi_u64to8b(lun->stats.operations[CTL_STATS_WRITE], data->sap.write_num); if (lun->be_lun->blocksize > 0) { scsi_u64to8b(lun->stats.bytes[CTL_STATS_WRITE] / lun->be_lun->blocksize, data->sap.recvieved_lba); scsi_u64to8b(lun->stats.bytes[CTL_STATS_READ] / lun->be_lun->blocksize, data->sap.transmitted_lba); } t = &lun->stats.time[CTL_STATS_READ]; scsi_u64to8b((uint64_t)t->sec * 1000 + t->frac / (UINT64_MAX / 1000), data->sap.read_int); t = &lun->stats.time[CTL_STATS_WRITE]; scsi_u64to8b((uint64_t)t->sec * 1000 + t->frac / (UINT64_MAX / 1000), data->sap.write_int); scsi_u64to8b(0, data->sap.weighted_num); scsi_u64to8b(0, data->sap.weighted_int); scsi_ulto2b(SLP_IT, data->it.hdr.param_code); data->it.hdr.param_control = SLP_LBIN; data->it.hdr.param_len = sizeof(struct scsi_log_idle_time) - sizeof(struct scsi_log_param_header); #ifdef CTL_TIME_IO scsi_u64to8b(lun->idle_time / SBT_1MS, data->it.idle_int); #endif scsi_ulto2b(SLP_TI, data->ti.hdr.param_code); data->it.hdr.param_control = SLP_LBIN; data->ti.hdr.param_len = sizeof(struct scsi_log_time_interval) - sizeof(struct scsi_log_param_header); scsi_ulto4b(3, data->ti.exponent); scsi_ulto4b(1, data->ti.integer); return (0); } int ctl_ie_log_sense_handler(struct ctl_scsiio *ctsio, struct ctl_page_index *page_index, int pc) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_log_informational_exceptions *data; data = (struct scsi_log_informational_exceptions *)page_index->page_data; scsi_ulto2b(SLP_IE_GEN, data->hdr.param_code); data->hdr.param_control = SLP_LBIN; data->hdr.param_len = sizeof(struct scsi_log_informational_exceptions) - sizeof(struct scsi_log_param_header); data->ie_asc = lun->ie_asc; data->ie_ascq = lun->ie_ascq; data->temperature = 0xff; return (0); } int ctl_log_sense(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); int i, pc, page_code, subpage; int alloc_len, total_len; struct ctl_page_index *page_index; struct scsi_log_sense *cdb; struct scsi_log_header *header; CTL_DEBUG_PRINT(("ctl_log_sense\n")); cdb = (struct scsi_log_sense *)ctsio->cdb; pc = (cdb->page & SLS_PAGE_CTRL_MASK) >> 6; page_code = cdb->page & SLS_PAGE_CODE; subpage = cdb->subpage; alloc_len = scsi_2btoul(cdb->length); page_index = NULL; for (i = 0; i < CTL_NUM_LOG_PAGES; i++) { page_index = &lun->log_pages.index[i]; /* Look for the right page code */ if ((page_index->page_code & SL_PAGE_CODE) != page_code) continue; /* Look for the right subpage or the subpage wildcard*/ if (page_index->subpage != subpage) continue; break; } if (i >= CTL_NUM_LOG_PAGES) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } total_len = sizeof(struct scsi_log_header) + page_index->page_len; ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; header = (struct scsi_log_header *)ctsio->kern_data_ptr; header->page = page_index->page_code; if (page_index->page_code == SLS_LOGICAL_BLOCK_PROVISIONING) header->page |= SL_DS; if (page_index->subpage) { header->page |= SL_SPF; header->subpage = page_index->subpage; } scsi_ulto2b(page_index->page_len, header->datalen); /* * Call the handler, if it exists, to update the * page to the latest values. */ if (page_index->sense_handler != NULL) page_index->sense_handler(ctsio, page_index, pc); memcpy(header + 1, page_index->page_data, page_index->page_len); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_read_capacity(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_read_capacity *cdb; struct scsi_read_capacity_data *data; uint32_t lba; CTL_DEBUG_PRINT(("ctl_read_capacity\n")); cdb = (struct scsi_read_capacity *)ctsio->cdb; lba = scsi_4btoul(cdb->addr); if (((cdb->pmi & SRC_PMI) == 0) && (lba != 0)) { ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } ctsio->kern_data_ptr = malloc(sizeof(*data), M_CTL, M_WAITOK | M_ZERO); data = (struct scsi_read_capacity_data *)ctsio->kern_data_ptr; ctsio->kern_data_len = sizeof(*data); ctsio->kern_total_len = sizeof(*data); ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; /* * If the maximum LBA is greater than 0xfffffffe, the user must * issue a SERVICE ACTION IN (16) command, with the read capacity * serivce action set. */ if (lun->be_lun->maxlba > 0xfffffffe) scsi_ulto4b(0xffffffff, data->addr); else scsi_ulto4b(lun->be_lun->maxlba, data->addr); /* * XXX KDM this may not be 512 bytes... */ scsi_ulto4b(lun->be_lun->blocksize, data->length); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_read_capacity_16(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_read_capacity_16 *cdb; struct scsi_read_capacity_data_long *data; uint64_t lba; uint32_t alloc_len; CTL_DEBUG_PRINT(("ctl_read_capacity_16\n")); cdb = (struct scsi_read_capacity_16 *)ctsio->cdb; alloc_len = scsi_4btoul(cdb->alloc_len); lba = scsi_8btou64(cdb->addr); if ((cdb->reladr & SRC16_PMI) && (lba != 0)) { ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } ctsio->kern_data_ptr = malloc(sizeof(*data), M_CTL, M_WAITOK | M_ZERO); data = (struct scsi_read_capacity_data_long *)ctsio->kern_data_ptr; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(sizeof(*data), alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; scsi_u64to8b(lun->be_lun->maxlba, data->addr); /* XXX KDM this may not be 512 bytes... */ scsi_ulto4b(lun->be_lun->blocksize, data->length); data->prot_lbppbe = lun->be_lun->pblockexp & SRC16_LBPPBE; scsi_ulto2b(lun->be_lun->pblockoff & SRC16_LALBA_A, data->lalba_lbp); if (lun->be_lun->flags & CTL_LUN_FLAG_UNMAP) data->lalba_lbp[0] |= SRC16_LBPME | SRC16_LBPRZ; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_get_lba_status(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_get_lba_status *cdb; struct scsi_get_lba_status_data *data; struct ctl_lba_len_flags *lbalen; uint64_t lba; uint32_t alloc_len, total_len; int retval; CTL_DEBUG_PRINT(("ctl_get_lba_status\n")); cdb = (struct scsi_get_lba_status *)ctsio->cdb; lba = scsi_8btou64(cdb->addr); alloc_len = scsi_4btoul(cdb->alloc_len); if (lba > lun->be_lun->maxlba) { ctl_set_lba_out_of_range(ctsio, lba); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } total_len = sizeof(*data) + sizeof(data->descr[0]); ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); data = (struct scsi_get_lba_status_data *)ctsio->kern_data_ptr; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* Fill dummy data in case backend can't tell anything. */ scsi_ulto4b(4 + sizeof(data->descr[0]), data->length); scsi_u64to8b(lba, data->descr[0].addr); scsi_ulto4b(MIN(UINT32_MAX, lun->be_lun->maxlba + 1 - lba), data->descr[0].length); data->descr[0].status = 0; /* Mapped or unknown. */ ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; lbalen = (struct ctl_lba_len_flags *)&ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->lba = lba; lbalen->len = total_len; lbalen->flags = 0; retval = lun->backend->config_read((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_read_defect(struct ctl_scsiio *ctsio) { struct scsi_read_defect_data_10 *ccb10; struct scsi_read_defect_data_12 *ccb12; struct scsi_read_defect_data_hdr_10 *data10; struct scsi_read_defect_data_hdr_12 *data12; uint32_t alloc_len, data_len; uint8_t format; CTL_DEBUG_PRINT(("ctl_read_defect\n")); if (ctsio->cdb[0] == READ_DEFECT_DATA_10) { ccb10 = (struct scsi_read_defect_data_10 *)&ctsio->cdb; format = ccb10->format; alloc_len = scsi_2btoul(ccb10->alloc_length); data_len = sizeof(*data10); } else { ccb12 = (struct scsi_read_defect_data_12 *)&ctsio->cdb; format = ccb12->format; alloc_len = scsi_4btoul(ccb12->alloc_length); data_len = sizeof(*data12); } if (alloc_len == 0) { ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; if (ctsio->cdb[0] == READ_DEFECT_DATA_10) { data10 = (struct scsi_read_defect_data_hdr_10 *) ctsio->kern_data_ptr; data10->format = format; scsi_ulto2b(0, data10->length); } else { data12 = (struct scsi_read_defect_data_hdr_12 *) ctsio->kern_data_ptr; data12->format = format; scsi_ulto2b(0, data12->generation); scsi_ulto4b(0, data12->length); } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_report_tagret_port_groups(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_maintenance_in *cdb; int retval; int alloc_len, ext, total_len = 0, g, pc, pg, ts, os; int num_ha_groups, num_target_ports, shared_group; struct ctl_port *port; struct scsi_target_group_data *rtg_ptr; struct scsi_target_group_data_extended *rtg_ext_ptr; struct scsi_target_port_group_descriptor *tpg_desc; CTL_DEBUG_PRINT(("ctl_report_tagret_port_groups\n")); cdb = (struct scsi_maintenance_in *)ctsio->cdb; retval = CTL_RETVAL_COMPLETE; switch (cdb->byte2 & STG_PDF_MASK) { case STG_PDF_LENGTH: ext = 0; break; case STG_PDF_EXTENDED: ext = 1; break; default: ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 5); ctl_done((union ctl_io *)ctsio); return(retval); } num_target_ports = 0; shared_group = (softc->is_single != 0); mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(port, &softc->port_list, links) { if ((port->status & CTL_PORT_STATUS_ONLINE) == 0) continue; if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; num_target_ports++; if (port->status & CTL_PORT_STATUS_HA_SHARED) shared_group = 1; } mtx_unlock(&softc->ctl_lock); num_ha_groups = (softc->is_single) ? 0 : NUM_HA_SHELVES; if (ext) total_len = sizeof(struct scsi_target_group_data_extended); else total_len = sizeof(struct scsi_target_group_data); total_len += sizeof(struct scsi_target_port_group_descriptor) * (shared_group + num_ha_groups) + sizeof(struct scsi_target_port_descriptor) * num_target_ports; alloc_len = scsi_4btoul(cdb->length); ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; if (ext) { rtg_ext_ptr = (struct scsi_target_group_data_extended *) ctsio->kern_data_ptr; scsi_ulto4b(total_len - 4, rtg_ext_ptr->length); rtg_ext_ptr->format_type = 0x10; rtg_ext_ptr->implicit_transition_time = 0; tpg_desc = &rtg_ext_ptr->groups[0]; } else { rtg_ptr = (struct scsi_target_group_data *) ctsio->kern_data_ptr; scsi_ulto4b(total_len - 4, rtg_ptr->length); tpg_desc = &rtg_ptr->groups[0]; } mtx_lock(&softc->ctl_lock); pg = softc->port_min / softc->port_cnt; if (lun->flags & (CTL_LUN_PRIMARY_SC | CTL_LUN_PEER_SC_PRIMARY)) { /* Some shelf is known to be primary. */ if (softc->ha_link == CTL_HA_LINK_OFFLINE) os = TPG_ASYMMETRIC_ACCESS_UNAVAILABLE; else if (softc->ha_link == CTL_HA_LINK_UNKNOWN) os = TPG_ASYMMETRIC_ACCESS_TRANSITIONING; else if (softc->ha_mode == CTL_HA_MODE_ACT_STBY) os = TPG_ASYMMETRIC_ACCESS_STANDBY; else os = TPG_ASYMMETRIC_ACCESS_NONOPTIMIZED; if (lun->flags & CTL_LUN_PRIMARY_SC) { ts = TPG_ASYMMETRIC_ACCESS_OPTIMIZED; } else { ts = os; os = TPG_ASYMMETRIC_ACCESS_OPTIMIZED; } } else { /* No known primary shelf. */ if (softc->ha_link == CTL_HA_LINK_OFFLINE) { ts = TPG_ASYMMETRIC_ACCESS_UNAVAILABLE; os = TPG_ASYMMETRIC_ACCESS_OPTIMIZED; } else if (softc->ha_link == CTL_HA_LINK_UNKNOWN) { ts = TPG_ASYMMETRIC_ACCESS_TRANSITIONING; os = TPG_ASYMMETRIC_ACCESS_OPTIMIZED; } else { ts = os = TPG_ASYMMETRIC_ACCESS_TRANSITIONING; } } if (shared_group) { tpg_desc->pref_state = ts; tpg_desc->support = TPG_AO_SUP | TPG_AN_SUP | TPG_S_SUP | TPG_U_SUP | TPG_T_SUP; scsi_ulto2b(1, tpg_desc->target_port_group); tpg_desc->status = TPG_IMPLICIT; pc = 0; STAILQ_FOREACH(port, &softc->port_list, links) { if ((port->status & CTL_PORT_STATUS_ONLINE) == 0) continue; if (!softc->is_single && (port->status & CTL_PORT_STATUS_HA_SHARED) == 0) continue; if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; scsi_ulto2b(port->targ_port, tpg_desc->descriptors[pc]. relative_target_port_identifier); pc++; } tpg_desc->target_port_count = pc; tpg_desc = (struct scsi_target_port_group_descriptor *) &tpg_desc->descriptors[pc]; } for (g = 0; g < num_ha_groups; g++) { tpg_desc->pref_state = (g == pg) ? ts : os; tpg_desc->support = TPG_AO_SUP | TPG_AN_SUP | TPG_S_SUP | TPG_U_SUP | TPG_T_SUP; scsi_ulto2b(2 + g, tpg_desc->target_port_group); tpg_desc->status = TPG_IMPLICIT; pc = 0; STAILQ_FOREACH(port, &softc->port_list, links) { if (port->targ_port < g * softc->port_cnt || port->targ_port >= (g + 1) * softc->port_cnt) continue; if ((port->status & CTL_PORT_STATUS_ONLINE) == 0) continue; if (port->status & CTL_PORT_STATUS_HA_SHARED) continue; if (ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; scsi_ulto2b(port->targ_port, tpg_desc->descriptors[pc]. relative_target_port_identifier); pc++; } tpg_desc->target_port_count = pc; tpg_desc = (struct scsi_target_port_group_descriptor *) &tpg_desc->descriptors[pc]; } mtx_unlock(&softc->ctl_lock); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return(retval); } int ctl_report_supported_opcodes(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_report_supported_opcodes *cdb; const struct ctl_cmd_entry *entry, *sentry; struct scsi_report_supported_opcodes_all *all; struct scsi_report_supported_opcodes_descr *descr; struct scsi_report_supported_opcodes_one *one; int retval; int alloc_len, total_len; int opcode, service_action, i, j, num; CTL_DEBUG_PRINT(("ctl_report_supported_opcodes\n")); cdb = (struct scsi_report_supported_opcodes *)ctsio->cdb; retval = CTL_RETVAL_COMPLETE; opcode = cdb->requested_opcode; service_action = scsi_2btoul(cdb->requested_service_action); switch (cdb->options & RSO_OPTIONS_MASK) { case RSO_OPTIONS_ALL: num = 0; for (i = 0; i < 256; i++) { entry = &ctl_cmd_table[i]; if (entry->flags & CTL_CMD_FLAG_SA5) { for (j = 0; j < 32; j++) { sentry = &((const struct ctl_cmd_entry *) entry->execute)[j]; if (ctl_cmd_applicable( lun->be_lun->lun_type, sentry)) num++; } } else { if (ctl_cmd_applicable(lun->be_lun->lun_type, entry)) num++; } } total_len = sizeof(struct scsi_report_supported_opcodes_all) + num * sizeof(struct scsi_report_supported_opcodes_descr); break; case RSO_OPTIONS_OC: if (ctl_cmd_table[opcode].flags & CTL_CMD_FLAG_SA5) { ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 2); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } total_len = sizeof(struct scsi_report_supported_opcodes_one) + 32; break; case RSO_OPTIONS_OC_SA: if ((ctl_cmd_table[opcode].flags & CTL_CMD_FLAG_SA5) == 0 || service_action >= 32) { ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 2); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* FALLTHROUGH */ case RSO_OPTIONS_OC_ASA: total_len = sizeof(struct scsi_report_supported_opcodes_one) + 32; break; default: ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 2); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } alloc_len = scsi_4btoul(cdb->length); ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; switch (cdb->options & RSO_OPTIONS_MASK) { case RSO_OPTIONS_ALL: all = (struct scsi_report_supported_opcodes_all *) ctsio->kern_data_ptr; num = 0; for (i = 0; i < 256; i++) { entry = &ctl_cmd_table[i]; if (entry->flags & CTL_CMD_FLAG_SA5) { for (j = 0; j < 32; j++) { sentry = &((const struct ctl_cmd_entry *) entry->execute)[j]; if (!ctl_cmd_applicable( lun->be_lun->lun_type, sentry)) continue; descr = &all->descr[num++]; descr->opcode = i; scsi_ulto2b(j, descr->service_action); descr->flags = RSO_SERVACTV; scsi_ulto2b(sentry->length, descr->cdb_length); } } else { if (!ctl_cmd_applicable(lun->be_lun->lun_type, entry)) continue; descr = &all->descr[num++]; descr->opcode = i; scsi_ulto2b(0, descr->service_action); descr->flags = 0; scsi_ulto2b(entry->length, descr->cdb_length); } } scsi_ulto4b( num * sizeof(struct scsi_report_supported_opcodes_descr), all->length); break; case RSO_OPTIONS_OC: one = (struct scsi_report_supported_opcodes_one *) ctsio->kern_data_ptr; entry = &ctl_cmd_table[opcode]; goto fill_one; case RSO_OPTIONS_OC_SA: one = (struct scsi_report_supported_opcodes_one *) ctsio->kern_data_ptr; entry = &ctl_cmd_table[opcode]; entry = &((const struct ctl_cmd_entry *) entry->execute)[service_action]; fill_one: if (ctl_cmd_applicable(lun->be_lun->lun_type, entry)) { one->support = 3; scsi_ulto2b(entry->length, one->cdb_length); one->cdb_usage[0] = opcode; memcpy(&one->cdb_usage[1], entry->usage, entry->length - 1); } else one->support = 1; break; case RSO_OPTIONS_OC_ASA: one = (struct scsi_report_supported_opcodes_one *) ctsio->kern_data_ptr; entry = &ctl_cmd_table[opcode]; if (entry->flags & CTL_CMD_FLAG_SA5) { entry = &((const struct ctl_cmd_entry *) entry->execute)[service_action]; } else if (service_action != 0) { one->support = 1; break; } goto fill_one; } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return(retval); } int ctl_report_supported_tmf(struct ctl_scsiio *ctsio) { struct scsi_report_supported_tmf *cdb; struct scsi_report_supported_tmf_ext_data *data; int retval; int alloc_len, total_len; CTL_DEBUG_PRINT(("ctl_report_supported_tmf\n")); cdb = (struct scsi_report_supported_tmf *)ctsio->cdb; retval = CTL_RETVAL_COMPLETE; if (cdb->options & RST_REPD) total_len = sizeof(struct scsi_report_supported_tmf_ext_data); else total_len = sizeof(struct scsi_report_supported_tmf_data); alloc_len = scsi_4btoul(cdb->length); ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; data = (struct scsi_report_supported_tmf_ext_data *)ctsio->kern_data_ptr; data->byte1 |= RST_ATS | RST_ATSS | RST_CTSS | RST_LURS | RST_QTS | RST_TRS; data->byte2 |= RST_QAES | RST_QTSS | RST_ITNRS; data->length = total_len - 4; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (retval); } int ctl_report_timestamp(struct ctl_scsiio *ctsio) { struct scsi_report_timestamp *cdb; struct scsi_report_timestamp_data *data; struct timeval tv; int64_t timestamp; int retval; int alloc_len, total_len; CTL_DEBUG_PRINT(("ctl_report_timestamp\n")); cdb = (struct scsi_report_timestamp *)ctsio->cdb; retval = CTL_RETVAL_COMPLETE; total_len = sizeof(struct scsi_report_timestamp_data); alloc_len = scsi_4btoul(cdb->length); ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; data = (struct scsi_report_timestamp_data *)ctsio->kern_data_ptr; scsi_ulto2b(sizeof(*data) - 2, data->length); data->origin = RTS_ORIG_OUTSIDE; getmicrotime(&tv); timestamp = (int64_t)tv.tv_sec * 1000 + tv.tv_usec / 1000; scsi_ulto4b(timestamp >> 16, data->timestamp); scsi_ulto2b(timestamp & 0xffff, &data->timestamp[4]); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (retval); } int ctl_persistent_reserve_in(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_per_res_in *cdb; int alloc_len, total_len = 0; /* struct scsi_per_res_in_rsrv in_data; */ uint64_t key; CTL_DEBUG_PRINT(("ctl_persistent_reserve_in\n")); cdb = (struct scsi_per_res_in *)ctsio->cdb; alloc_len = scsi_2btoul(cdb->length); retry: mtx_lock(&lun->lun_lock); switch (cdb->action) { case SPRI_RK: /* read keys */ total_len = sizeof(struct scsi_per_res_in_keys) + lun->pr_key_count * sizeof(struct scsi_per_res_key); break; case SPRI_RR: /* read reservation */ if (lun->flags & CTL_LUN_PR_RESERVED) total_len = sizeof(struct scsi_per_res_in_rsrv); else total_len = sizeof(struct scsi_per_res_in_header); break; case SPRI_RC: /* report capabilities */ total_len = sizeof(struct scsi_per_res_cap); break; case SPRI_RS: /* read full status */ total_len = sizeof(struct scsi_per_res_in_header) + (sizeof(struct scsi_per_res_in_full_desc) + 256) * lun->pr_key_count; break; default: panic("%s: Invalid PR type %#x", __func__, cdb->action); } mtx_unlock(&lun->lun_lock); ctsio->kern_data_ptr = malloc(total_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(total_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; mtx_lock(&lun->lun_lock); switch (cdb->action) { case SPRI_RK: { // read keys struct scsi_per_res_in_keys *res_keys; int i, key_count; res_keys = (struct scsi_per_res_in_keys*)ctsio->kern_data_ptr; /* * We had to drop the lock to allocate our buffer, which * leaves time for someone to come in with another * persistent reservation. (That is unlikely, though, * since this should be the only persistent reservation * command active right now.) */ if (total_len != (sizeof(struct scsi_per_res_in_keys) + (lun->pr_key_count * sizeof(struct scsi_per_res_key)))){ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); printf("%s: reservation length changed, retrying\n", __func__); goto retry; } scsi_ulto4b(lun->pr_generation, res_keys->header.generation); scsi_ulto4b(sizeof(struct scsi_per_res_key) * lun->pr_key_count, res_keys->header.length); for (i = 0, key_count = 0; i < CTL_MAX_INITIATORS; i++) { if ((key = ctl_get_prkey(lun, i)) == 0) continue; /* * We used lun->pr_key_count to calculate the * size to allocate. If it turns out the number of * initiators with the registered flag set is * larger than that (i.e. they haven't been kept in * sync), we've got a problem. */ if (key_count >= lun->pr_key_count) { key_count++; continue; } scsi_u64to8b(key, res_keys->keys[key_count].key); key_count++; } break; } case SPRI_RR: { // read reservation struct scsi_per_res_in_rsrv *res; int tmp_len, header_only; res = (struct scsi_per_res_in_rsrv *)ctsio->kern_data_ptr; scsi_ulto4b(lun->pr_generation, res->header.generation); if (lun->flags & CTL_LUN_PR_RESERVED) { tmp_len = sizeof(struct scsi_per_res_in_rsrv); scsi_ulto4b(sizeof(struct scsi_per_res_in_rsrv_data), res->header.length); header_only = 0; } else { tmp_len = sizeof(struct scsi_per_res_in_header); scsi_ulto4b(0, res->header.length); header_only = 1; } /* * We had to drop the lock to allocate our buffer, which * leaves time for someone to come in with another * persistent reservation. (That is unlikely, though, * since this should be the only persistent reservation * command active right now.) */ if (tmp_len != total_len) { mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); printf("%s: reservation status changed, retrying\n", __func__); goto retry; } /* * No reservation held, so we're done. */ if (header_only != 0) break; /* * If the registration is an All Registrants type, the key * is 0, since it doesn't really matter. */ if (lun->pr_res_idx != CTL_PR_ALL_REGISTRANTS) { scsi_u64to8b(ctl_get_prkey(lun, lun->pr_res_idx), res->data.reservation); } res->data.scopetype = lun->pr_res_type; break; } case SPRI_RC: //report capabilities { struct scsi_per_res_cap *res_cap; uint16_t type_mask; res_cap = (struct scsi_per_res_cap *)ctsio->kern_data_ptr; scsi_ulto2b(sizeof(*res_cap), res_cap->length); res_cap->flags1 = SPRI_CRH; res_cap->flags2 = SPRI_TMV | SPRI_ALLOW_5; type_mask = SPRI_TM_WR_EX_AR | SPRI_TM_EX_AC_RO | SPRI_TM_WR_EX_RO | SPRI_TM_EX_AC | SPRI_TM_WR_EX | SPRI_TM_EX_AC_AR; scsi_ulto2b(type_mask, res_cap->type_mask); break; } case SPRI_RS: { // read full status struct scsi_per_res_in_full *res_status; struct scsi_per_res_in_full_desc *res_desc; struct ctl_port *port; int i, len; res_status = (struct scsi_per_res_in_full*)ctsio->kern_data_ptr; /* * We had to drop the lock to allocate our buffer, which * leaves time for someone to come in with another * persistent reservation. (That is unlikely, though, * since this should be the only persistent reservation * command active right now.) */ if (total_len < (sizeof(struct scsi_per_res_in_header) + (sizeof(struct scsi_per_res_in_full_desc) + 256) * lun->pr_key_count)){ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); printf("%s: reservation length changed, retrying\n", __func__); goto retry; } scsi_ulto4b(lun->pr_generation, res_status->header.generation); res_desc = &res_status->desc[0]; for (i = 0; i < CTL_MAX_INITIATORS; i++) { if ((key = ctl_get_prkey(lun, i)) == 0) continue; scsi_u64to8b(key, res_desc->res_key.key); if ((lun->flags & CTL_LUN_PR_RESERVED) && (lun->pr_res_idx == i || lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS)) { res_desc->flags = SPRI_FULL_R_HOLDER; res_desc->scopetype = lun->pr_res_type; } scsi_ulto2b(i / CTL_MAX_INIT_PER_PORT, res_desc->rel_trgt_port_id); len = 0; port = softc->ctl_ports[i / CTL_MAX_INIT_PER_PORT]; if (port != NULL) len = ctl_create_iid(port, i % CTL_MAX_INIT_PER_PORT, res_desc->transport_id); scsi_ulto4b(len, res_desc->additional_length); res_desc = (struct scsi_per_res_in_full_desc *) &res_desc->transport_id[len]; } scsi_ulto4b((uint8_t *)res_desc - (uint8_t *)&res_status->desc[0], res_status->header.length); break; } default: panic("%s: Invalid PR type %#x", __func__, cdb->action); } mtx_unlock(&lun->lun_lock); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * Returns 0 if ctl_persistent_reserve_out() should continue, non-zero if * it should return. */ static int ctl_pro_preempt(struct ctl_softc *softc, struct ctl_lun *lun, uint64_t res_key, uint64_t sa_res_key, uint8_t type, uint32_t residx, struct ctl_scsiio *ctsio, struct scsi_per_res_out *cdb, struct scsi_per_res_out_parms* param) { union ctl_ha_msg persis_io; int i; mtx_lock(&lun->lun_lock); if (sa_res_key == 0) { if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS) { /* validate scope and type */ if ((cdb->scope_type & SPR_SCOPE_MASK) != SPR_LU_SCOPE) { mtx_unlock(&lun->lun_lock); ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 4); ctl_done((union ctl_io *)ctsio); return (1); } if (type>8 || type==2 || type==4 || type==0) { mtx_unlock(&lun->lun_lock); ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (1); } /* * Unregister everybody else and build UA for * them */ for(i = 0; i < CTL_MAX_INITIATORS; i++) { if (i == residx || ctl_get_prkey(lun, i) == 0) continue; ctl_clr_prkey(lun, i); ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } lun->pr_key_count = 1; lun->pr_res_type = type; if (lun->pr_res_type != SPR_TYPE_WR_EX_AR && lun->pr_res_type != SPR_TYPE_EX_AC_AR) lun->pr_res_idx = residx; lun->pr_generation++; mtx_unlock(&lun->lun_lock); /* send msg to other side */ persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_PREEMPT; persis_io.pr.pr_info.residx = lun->pr_res_idx; persis_io.pr.pr_info.res_type = type; memcpy(persis_io.pr.pr_info.sa_res_key, param->serv_act_res_key, sizeof(param->serv_act_res_key)); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } else { /* not all registrants */ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ 8, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (1); } } else if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS || !(lun->flags & CTL_LUN_PR_RESERVED)) { int found = 0; if (res_key == sa_res_key) { /* special case */ /* * The spec implies this is not good but doesn't * say what to do. There are two choices either * generate a res conflict or check condition * with illegal field in parameter data. Since * that is what is done when the sa_res_key is * zero I'll take that approach since this has * to do with the sa_res_key. */ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ 8, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (1); } for (i = 0; i < CTL_MAX_INITIATORS; i++) { if (ctl_get_prkey(lun, i) != sa_res_key) continue; found = 1; ctl_clr_prkey(lun, i); lun->pr_key_count--; ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } if (!found) { mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } lun->pr_generation++; mtx_unlock(&lun->lun_lock); /* send msg to other side */ persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_PREEMPT; persis_io.pr.pr_info.residx = lun->pr_res_idx; persis_io.pr.pr_info.res_type = type; memcpy(persis_io.pr.pr_info.sa_res_key, param->serv_act_res_key, sizeof(param->serv_act_res_key)); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } else { /* Reserved but not all registrants */ /* sa_res_key is res holder */ if (sa_res_key == ctl_get_prkey(lun, lun->pr_res_idx)) { /* validate scope and type */ if ((cdb->scope_type & SPR_SCOPE_MASK) != SPR_LU_SCOPE) { mtx_unlock(&lun->lun_lock); ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 4); ctl_done((union ctl_io *)ctsio); return (1); } if (type>8 || type==2 || type==4 || type==0) { mtx_unlock(&lun->lun_lock); ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (1); } /* * Do the following: * if sa_res_key != res_key remove all * registrants w/sa_res_key and generate UA * for these registrants(Registrations * Preempted) if it wasn't an exclusive * reservation generate UA(Reservations * Preempted) for all other registered nexuses * if the type has changed. Establish the new * reservation and holder. If res_key and * sa_res_key are the same do the above * except don't unregister the res holder. */ for(i = 0; i < CTL_MAX_INITIATORS; i++) { if (i == residx || ctl_get_prkey(lun, i) == 0) continue; if (sa_res_key == ctl_get_prkey(lun, i)) { ctl_clr_prkey(lun, i); lun->pr_key_count--; ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } else if (type != lun->pr_res_type && (lun->pr_res_type == SPR_TYPE_WR_EX_RO || lun->pr_res_type == SPR_TYPE_EX_AC_RO)) { ctl_est_ua(lun, i, CTL_UA_RES_RELEASE); } } lun->pr_res_type = type; if (lun->pr_res_type != SPR_TYPE_WR_EX_AR && lun->pr_res_type != SPR_TYPE_EX_AC_AR) lun->pr_res_idx = residx; else lun->pr_res_idx = CTL_PR_ALL_REGISTRANTS; lun->pr_generation++; mtx_unlock(&lun->lun_lock); persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_PREEMPT; persis_io.pr.pr_info.residx = lun->pr_res_idx; persis_io.pr.pr_info.res_type = type; memcpy(persis_io.pr.pr_info.sa_res_key, param->serv_act_res_key, sizeof(param->serv_act_res_key)); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } else { /* * sa_res_key is not the res holder just * remove registrants */ int found=0; for (i = 0; i < CTL_MAX_INITIATORS; i++) { if (sa_res_key != ctl_get_prkey(lun, i)) continue; found = 1; ctl_clr_prkey(lun, i); lun->pr_key_count--; ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } if (!found) { mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (1); } lun->pr_generation++; mtx_unlock(&lun->lun_lock); persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_PREEMPT; persis_io.pr.pr_info.residx = lun->pr_res_idx; persis_io.pr.pr_info.res_type = type; memcpy(persis_io.pr.pr_info.sa_res_key, param->serv_act_res_key, sizeof(param->serv_act_res_key)); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } } return (0); } static void ctl_pro_preempt_other(struct ctl_lun *lun, union ctl_ha_msg *msg) { uint64_t sa_res_key; int i; sa_res_key = scsi_8btou64(msg->pr.pr_info.sa_res_key); if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS || lun->pr_res_idx == CTL_PR_NO_RESERVATION || sa_res_key != ctl_get_prkey(lun, lun->pr_res_idx)) { if (sa_res_key == 0) { /* * Unregister everybody else and build UA for * them */ for(i = 0; i < CTL_MAX_INITIATORS; i++) { if (i == msg->pr.pr_info.residx || ctl_get_prkey(lun, i) == 0) continue; ctl_clr_prkey(lun, i); ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } lun->pr_key_count = 1; lun->pr_res_type = msg->pr.pr_info.res_type; if (lun->pr_res_type != SPR_TYPE_WR_EX_AR && lun->pr_res_type != SPR_TYPE_EX_AC_AR) lun->pr_res_idx = msg->pr.pr_info.residx; } else { for (i = 0; i < CTL_MAX_INITIATORS; i++) { if (sa_res_key == ctl_get_prkey(lun, i)) continue; ctl_clr_prkey(lun, i); lun->pr_key_count--; ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } } } else { for (i = 0; i < CTL_MAX_INITIATORS; i++) { if (i == msg->pr.pr_info.residx || ctl_get_prkey(lun, i) == 0) continue; if (sa_res_key == ctl_get_prkey(lun, i)) { ctl_clr_prkey(lun, i); lun->pr_key_count--; ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } else if (msg->pr.pr_info.res_type != lun->pr_res_type && (lun->pr_res_type == SPR_TYPE_WR_EX_RO || lun->pr_res_type == SPR_TYPE_EX_AC_RO)) { ctl_est_ua(lun, i, CTL_UA_RES_RELEASE); } } lun->pr_res_type = msg->pr.pr_info.res_type; if (lun->pr_res_type != SPR_TYPE_WR_EX_AR && lun->pr_res_type != SPR_TYPE_EX_AC_AR) lun->pr_res_idx = msg->pr.pr_info.residx; else lun->pr_res_idx = CTL_PR_ALL_REGISTRANTS; } lun->pr_generation++; } int ctl_persistent_reserve_out(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); int retval; u_int32_t param_len; struct scsi_per_res_out *cdb; struct scsi_per_res_out_parms* param; uint32_t residx; uint64_t res_key, sa_res_key, key; uint8_t type; union ctl_ha_msg persis_io; int i; CTL_DEBUG_PRINT(("ctl_persistent_reserve_out\n")); cdb = (struct scsi_per_res_out *)ctsio->cdb; retval = CTL_RETVAL_COMPLETE; /* * We only support whole-LUN scope. The scope & type are ignored for * register, register and ignore existing key and clear. * We sometimes ignore scope and type on preempts too!! * Verify reservation type here as well. */ type = cdb->scope_type & SPR_TYPE_MASK; if ((cdb->action == SPRO_RESERVE) || (cdb->action == SPRO_RELEASE)) { if ((cdb->scope_type & SPR_SCOPE_MASK) != SPR_LU_SCOPE) { ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 4); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } if (type>8 || type==2 || type==4 || type==0) { ctl_set_invalid_field(/*ctsio*/ ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 1, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } } param_len = scsi_4btoul(cdb->length); if ((ctsio->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0) { ctsio->kern_data_ptr = malloc(param_len, M_CTL, M_WAITOK); ctsio->kern_data_len = param_len; ctsio->kern_total_len = param_len; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } param = (struct scsi_per_res_out_parms *)ctsio->kern_data_ptr; residx = ctl_get_initindex(&ctsio->io_hdr.nexus); res_key = scsi_8btou64(param->res_key.key); sa_res_key = scsi_8btou64(param->serv_act_res_key); /* * Validate the reservation key here except for SPRO_REG_IGNO * This must be done for all other service actions */ if ((cdb->action & SPRO_ACTION_MASK) != SPRO_REG_IGNO) { mtx_lock(&lun->lun_lock); if ((key = ctl_get_prkey(lun, residx)) != 0) { if (res_key != key) { /* * The current key passed in doesn't match * the one the initiator previously * registered. */ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } } else if ((cdb->action & SPRO_ACTION_MASK) != SPRO_REGISTER) { /* * We are not registered */ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } else if (res_key != 0) { /* * We are not registered and trying to register but * the register key isn't zero. */ mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } mtx_unlock(&lun->lun_lock); } switch (cdb->action & SPRO_ACTION_MASK) { case SPRO_REGISTER: case SPRO_REG_IGNO: { #if 0 printf("Registration received\n"); #endif /* * We don't support any of these options, as we report in * the read capabilities request (see * ctl_persistent_reserve_in(), above). */ if ((param->flags & SPR_SPEC_I_PT) || (param->flags & SPR_ALL_TG_PT) || (param->flags & SPR_APTPL)) { int bit_ptr; if (param->flags & SPR_APTPL) bit_ptr = 0; else if (param->flags & SPR_ALL_TG_PT) bit_ptr = 2; else /* SPR_SPEC_I_PT */ bit_ptr = 3; free(ctsio->kern_data_ptr, M_CTL); ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 0, /*field*/ 20, /*bit_valid*/ 1, /*bit*/ bit_ptr); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } mtx_lock(&lun->lun_lock); /* * The initiator wants to clear the * key/unregister. */ if (sa_res_key == 0) { if ((res_key == 0 && (cdb->action & SPRO_ACTION_MASK) == SPRO_REGISTER) || ((cdb->action & SPRO_ACTION_MASK) == SPRO_REG_IGNO && ctl_get_prkey(lun, residx) == 0)) { mtx_unlock(&lun->lun_lock); goto done; } ctl_clr_prkey(lun, residx); lun->pr_key_count--; if (residx == lun->pr_res_idx) { lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_idx = CTL_PR_NO_RESERVATION; if ((lun->pr_res_type == SPR_TYPE_WR_EX_RO || lun->pr_res_type == SPR_TYPE_EX_AC_RO) && lun->pr_key_count) { /* * If the reservation is a registrants * only type we need to generate a UA * for other registered inits. The * sense code should be RESERVATIONS * RELEASED */ for (i = softc->init_min; i < softc->init_max; i++){ if (ctl_get_prkey(lun, i) == 0) continue; ctl_est_ua(lun, i, CTL_UA_RES_RELEASE); } } lun->pr_res_type = 0; } else if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS) { if (lun->pr_key_count==0) { lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_type = 0; lun->pr_res_idx = CTL_PR_NO_RESERVATION; } } lun->pr_generation++; mtx_unlock(&lun->lun_lock); persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_UNREG_KEY; persis_io.pr.pr_info.residx = residx; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } else /* sa_res_key != 0 */ { /* * If we aren't registered currently then increment * the key count and set the registered flag. */ ctl_alloc_prkey(lun, residx); if (ctl_get_prkey(lun, residx) == 0) lun->pr_key_count++; ctl_set_prkey(lun, residx, sa_res_key); lun->pr_generation++; mtx_unlock(&lun->lun_lock); persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_REG_KEY; persis_io.pr.pr_info.residx = residx; memcpy(persis_io.pr.pr_info.sa_res_key, param->serv_act_res_key, sizeof(param->serv_act_res_key)); ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } break; } case SPRO_RESERVE: #if 0 printf("Reserve executed type %d\n", type); #endif mtx_lock(&lun->lun_lock); if (lun->flags & CTL_LUN_PR_RESERVED) { /* * if this isn't the reservation holder and it's * not a "all registrants" type or if the type is * different then we have a conflict */ if ((lun->pr_res_idx != residx && lun->pr_res_idx != CTL_PR_ALL_REGISTRANTS) || lun->pr_res_type != type) { mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_reservation_conflict(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } mtx_unlock(&lun->lun_lock); } else /* create a reservation */ { /* * If it's not an "all registrants" type record * reservation holder */ if (type != SPR_TYPE_WR_EX_AR && type != SPR_TYPE_EX_AC_AR) lun->pr_res_idx = residx; /* Res holder */ else lun->pr_res_idx = CTL_PR_ALL_REGISTRANTS; lun->flags |= CTL_LUN_PR_RESERVED; lun->pr_res_type = type; mtx_unlock(&lun->lun_lock); /* send msg to other side */ persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_RESERVE; persis_io.pr.pr_info.residx = lun->pr_res_idx; persis_io.pr.pr_info.res_type = type; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); } break; case SPRO_RELEASE: mtx_lock(&lun->lun_lock); if ((lun->flags & CTL_LUN_PR_RESERVED) == 0) { /* No reservation exists return good status */ mtx_unlock(&lun->lun_lock); goto done; } /* * Is this nexus a reservation holder? */ if (lun->pr_res_idx != residx && lun->pr_res_idx != CTL_PR_ALL_REGISTRANTS) { /* * not a res holder return good status but * do nothing */ mtx_unlock(&lun->lun_lock); goto done; } if (lun->pr_res_type != type) { mtx_unlock(&lun->lun_lock); free(ctsio->kern_data_ptr, M_CTL); ctl_set_illegal_pr_release(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* okay to release */ lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_idx = CTL_PR_NO_RESERVATION; lun->pr_res_type = 0; /* * If this isn't an exclusive access reservation and NUAR * is not set, generate UA for all other registrants. */ if (type != SPR_TYPE_EX_AC && type != SPR_TYPE_WR_EX && (lun->MODE_CTRL.queue_flags & SCP_NUAR) == 0) { for (i = softc->init_min; i < softc->init_max; i++) { if (i == residx || ctl_get_prkey(lun, i) == 0) continue; ctl_est_ua(lun, i, CTL_UA_RES_RELEASE); } } mtx_unlock(&lun->lun_lock); /* Send msg to other side */ persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_RELEASE; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); break; case SPRO_CLEAR: /* send msg to other side */ mtx_lock(&lun->lun_lock); lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_type = 0; lun->pr_key_count = 0; lun->pr_res_idx = CTL_PR_NO_RESERVATION; ctl_clr_prkey(lun, residx); for (i = 0; i < CTL_MAX_INITIATORS; i++) if (ctl_get_prkey(lun, i) != 0) { ctl_clr_prkey(lun, i); ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } lun->pr_generation++; mtx_unlock(&lun->lun_lock); persis_io.hdr.nexus = ctsio->io_hdr.nexus; persis_io.hdr.msg_type = CTL_MSG_PERS_ACTION; persis_io.pr.pr_info.action = CTL_PR_CLEAR; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &persis_io, sizeof(persis_io.pr), M_WAITOK); break; case SPRO_PREEMPT: case SPRO_PRE_ABO: { int nretval; nretval = ctl_pro_preempt(softc, lun, res_key, sa_res_key, type, residx, ctsio, cdb, param); if (nretval != 0) return (CTL_RETVAL_COMPLETE); break; } default: panic("%s: Invalid PR type %#x", __func__, cdb->action); } done: free(ctsio->kern_data_ptr, M_CTL); ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (retval); } /* * This routine is for handling a message from the other SC pertaining to * persistent reserve out. All the error checking will have been done * so only perorming the action need be done here to keep the two * in sync. */ static void ctl_hndl_per_res_out_on_other_sc(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); union ctl_ha_msg *msg = (union ctl_ha_msg *)&io->presio.pr_msg; struct ctl_lun *lun; int i; uint32_t residx, targ_lun; targ_lun = msg->hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); return; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); return; } residx = ctl_get_initindex(&msg->hdr.nexus); switch(msg->pr.pr_info.action) { case CTL_PR_REG_KEY: ctl_alloc_prkey(lun, msg->pr.pr_info.residx); if (ctl_get_prkey(lun, msg->pr.pr_info.residx) == 0) lun->pr_key_count++; ctl_set_prkey(lun, msg->pr.pr_info.residx, scsi_8btou64(msg->pr.pr_info.sa_res_key)); lun->pr_generation++; break; case CTL_PR_UNREG_KEY: ctl_clr_prkey(lun, msg->pr.pr_info.residx); lun->pr_key_count--; /* XXX Need to see if the reservation has been released */ /* if so do we need to generate UA? */ if (msg->pr.pr_info.residx == lun->pr_res_idx) { lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_idx = CTL_PR_NO_RESERVATION; if ((lun->pr_res_type == SPR_TYPE_WR_EX_RO || lun->pr_res_type == SPR_TYPE_EX_AC_RO) && lun->pr_key_count) { /* * If the reservation is a registrants * only type we need to generate a UA * for other registered inits. The * sense code should be RESERVATIONS * RELEASED */ for (i = softc->init_min; i < softc->init_max; i++) { if (ctl_get_prkey(lun, i) == 0) continue; ctl_est_ua(lun, i, CTL_UA_RES_RELEASE); } } lun->pr_res_type = 0; } else if (lun->pr_res_idx == CTL_PR_ALL_REGISTRANTS) { if (lun->pr_key_count==0) { lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_type = 0; lun->pr_res_idx = CTL_PR_NO_RESERVATION; } } lun->pr_generation++; break; case CTL_PR_RESERVE: lun->flags |= CTL_LUN_PR_RESERVED; lun->pr_res_type = msg->pr.pr_info.res_type; lun->pr_res_idx = msg->pr.pr_info.residx; break; case CTL_PR_RELEASE: /* * If this isn't an exclusive access reservation and NUAR * is not set, generate UA for all other registrants. */ if (lun->pr_res_type != SPR_TYPE_EX_AC && lun->pr_res_type != SPR_TYPE_WR_EX && (lun->MODE_CTRL.queue_flags & SCP_NUAR) == 0) { for (i = softc->init_min; i < softc->init_max; i++) if (i == residx || ctl_get_prkey(lun, i) == 0) continue; ctl_est_ua(lun, i, CTL_UA_RES_RELEASE); } lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_idx = CTL_PR_NO_RESERVATION; lun->pr_res_type = 0; break; case CTL_PR_PREEMPT: ctl_pro_preempt_other(lun, msg); break; case CTL_PR_CLEAR: lun->flags &= ~CTL_LUN_PR_RESERVED; lun->pr_res_type = 0; lun->pr_key_count = 0; lun->pr_res_idx = CTL_PR_NO_RESERVATION; for (i=0; i < CTL_MAX_INITIATORS; i++) { if (ctl_get_prkey(lun, i) == 0) continue; ctl_clr_prkey(lun, i); ctl_est_ua(lun, i, CTL_UA_REG_PREEMPT); } lun->pr_generation++; break; } mtx_unlock(&lun->lun_lock); } int ctl_read_write(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct ctl_lba_len_flags *lbalen; uint64_t lba; uint32_t num_blocks; int flags, retval; int isread; CTL_DEBUG_PRINT(("ctl_read_write: command: %#x\n", ctsio->cdb[0])); flags = 0; isread = ctsio->cdb[0] == READ_6 || ctsio->cdb[0] == READ_10 || ctsio->cdb[0] == READ_12 || ctsio->cdb[0] == READ_16; switch (ctsio->cdb[0]) { case READ_6: case WRITE_6: { struct scsi_rw_6 *cdb; cdb = (struct scsi_rw_6 *)ctsio->cdb; lba = scsi_3btoul(cdb->addr); /* only 5 bits are valid in the most significant address byte */ lba &= 0x1fffff; num_blocks = cdb->length; /* * This is correct according to SBC-2. */ if (num_blocks == 0) num_blocks = 256; break; } case READ_10: case WRITE_10: { struct scsi_rw_10 *cdb; cdb = (struct scsi_rw_10 *)ctsio->cdb; if (cdb->byte2 & SRW10_FUA) flags |= CTL_LLF_FUA; if (cdb->byte2 & SRW10_DPO) flags |= CTL_LLF_DPO; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_2btoul(cdb->length); break; } case WRITE_VERIFY_10: { struct scsi_write_verify_10 *cdb; cdb = (struct scsi_write_verify_10 *)ctsio->cdb; flags |= CTL_LLF_FUA; if (cdb->byte2 & SWV_DPO) flags |= CTL_LLF_DPO; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_2btoul(cdb->length); break; } case READ_12: case WRITE_12: { struct scsi_rw_12 *cdb; cdb = (struct scsi_rw_12 *)ctsio->cdb; if (cdb->byte2 & SRW12_FUA) flags |= CTL_LLF_FUA; if (cdb->byte2 & SRW12_DPO) flags |= CTL_LLF_DPO; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_4btoul(cdb->length); break; } case WRITE_VERIFY_12: { struct scsi_write_verify_12 *cdb; cdb = (struct scsi_write_verify_12 *)ctsio->cdb; flags |= CTL_LLF_FUA; if (cdb->byte2 & SWV_DPO) flags |= CTL_LLF_DPO; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_4btoul(cdb->length); break; } case READ_16: case WRITE_16: { struct scsi_rw_16 *cdb; cdb = (struct scsi_rw_16 *)ctsio->cdb; if (cdb->byte2 & SRW12_FUA) flags |= CTL_LLF_FUA; if (cdb->byte2 & SRW12_DPO) flags |= CTL_LLF_DPO; lba = scsi_8btou64(cdb->addr); num_blocks = scsi_4btoul(cdb->length); break; } case WRITE_ATOMIC_16: { struct scsi_write_atomic_16 *cdb; if (lun->be_lun->atomicblock == 0) { ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } cdb = (struct scsi_write_atomic_16 *)ctsio->cdb; if (cdb->byte2 & SRW12_FUA) flags |= CTL_LLF_FUA; if (cdb->byte2 & SRW12_DPO) flags |= CTL_LLF_DPO; lba = scsi_8btou64(cdb->addr); num_blocks = scsi_2btoul(cdb->length); if (num_blocks > lun->be_lun->atomicblock) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 12, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } break; } case WRITE_VERIFY_16: { struct scsi_write_verify_16 *cdb; cdb = (struct scsi_write_verify_16 *)ctsio->cdb; flags |= CTL_LLF_FUA; if (cdb->byte2 & SWV_DPO) flags |= CTL_LLF_DPO; lba = scsi_8btou64(cdb->addr); num_blocks = scsi_4btoul(cdb->length); break; } default: /* * We got a command we don't support. This shouldn't * happen, commands should be filtered out above us. */ ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); break; /* NOTREACHED */ } /* * The first check is to make sure we're in bounds, the second * check is to catch wrap-around problems. If the lba + num blocks * is less than the lba, then we've wrapped around and the block * range is invalid anyway. */ if (((lba + num_blocks) > (lun->be_lun->maxlba + 1)) || ((lba + num_blocks) < lba)) { ctl_set_lba_out_of_range(ctsio, MAX(lba, lun->be_lun->maxlba + 1)); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * According to SBC-3, a transfer length of 0 is not an error. * Note that this cannot happen with WRITE(6) or READ(6), since 0 * translates to 256 blocks for those commands. */ if (num_blocks == 0) { ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* Set FUA and/or DPO if caches are disabled. */ if (isread) { if ((lun->MODE_CACHING.flags1 & SCP_RCD) != 0) flags |= CTL_LLF_FUA | CTL_LLF_DPO; } else { if ((lun->MODE_CACHING.flags1 & SCP_WCE) == 0) flags |= CTL_LLF_FUA; } lbalen = (struct ctl_lba_len_flags *) &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->lba = lba; lbalen->len = num_blocks; lbalen->flags = (isread ? CTL_LLF_READ : CTL_LLF_WRITE) | flags; ctsio->kern_total_len = num_blocks * lun->be_lun->blocksize; ctsio->kern_rel_offset = 0; CTL_DEBUG_PRINT(("ctl_read_write: calling data_submit()\n")); retval = lun->backend->data_submit((union ctl_io *)ctsio); return (retval); } static int ctl_cnw_cont(union ctl_io *io) { struct ctl_lun *lun = CTL_LUN(io); struct ctl_scsiio *ctsio; struct ctl_lba_len_flags *lbalen; int retval; ctsio = &io->scsiio; ctsio->io_hdr.status = CTL_STATUS_NONE; ctsio->io_hdr.flags &= ~CTL_FLAG_IO_CONT; lbalen = (struct ctl_lba_len_flags *) &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->flags &= ~CTL_LLF_COMPARE; lbalen->flags |= CTL_LLF_WRITE; CTL_DEBUG_PRINT(("ctl_cnw_cont: calling data_submit()\n")); retval = lun->backend->data_submit((union ctl_io *)ctsio); return (retval); } int ctl_cnw(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct ctl_lba_len_flags *lbalen; uint64_t lba; uint32_t num_blocks; int flags, retval; CTL_DEBUG_PRINT(("ctl_cnw: command: %#x\n", ctsio->cdb[0])); flags = 0; switch (ctsio->cdb[0]) { case COMPARE_AND_WRITE: { struct scsi_compare_and_write *cdb; cdb = (struct scsi_compare_and_write *)ctsio->cdb; if (cdb->byte2 & SRW10_FUA) flags |= CTL_LLF_FUA; if (cdb->byte2 & SRW10_DPO) flags |= CTL_LLF_DPO; lba = scsi_8btou64(cdb->addr); num_blocks = cdb->length; break; } default: /* * We got a command we don't support. This shouldn't * happen, commands should be filtered out above us. */ ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); break; /* NOTREACHED */ } /* * The first check is to make sure we're in bounds, the second * check is to catch wrap-around problems. If the lba + num blocks * is less than the lba, then we've wrapped around and the block * range is invalid anyway. */ if (((lba + num_blocks) > (lun->be_lun->maxlba + 1)) || ((lba + num_blocks) < lba)) { ctl_set_lba_out_of_range(ctsio, MAX(lba, lun->be_lun->maxlba + 1)); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * According to SBC-3, a transfer length of 0 is not an error. */ if (num_blocks == 0) { ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* Set FUA if write cache is disabled. */ if ((lun->MODE_CACHING.flags1 & SCP_WCE) == 0) flags |= CTL_LLF_FUA; ctsio->kern_total_len = 2 * num_blocks * lun->be_lun->blocksize; ctsio->kern_rel_offset = 0; /* * Set the IO_CONT flag, so that if this I/O gets passed to * ctl_data_submit_done(), it'll get passed back to * ctl_ctl_cnw_cont() for further processing. */ ctsio->io_hdr.flags |= CTL_FLAG_IO_CONT; ctsio->io_cont = ctl_cnw_cont; lbalen = (struct ctl_lba_len_flags *) &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->lba = lba; lbalen->len = num_blocks; lbalen->flags = CTL_LLF_COMPARE | flags; CTL_DEBUG_PRINT(("ctl_cnw: calling data_submit()\n")); retval = lun->backend->data_submit((union ctl_io *)ctsio); return (retval); } int ctl_verify(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct ctl_lba_len_flags *lbalen; uint64_t lba; uint32_t num_blocks; int bytchk, flags; int retval; CTL_DEBUG_PRINT(("ctl_verify: command: %#x\n", ctsio->cdb[0])); bytchk = 0; flags = CTL_LLF_FUA; switch (ctsio->cdb[0]) { case VERIFY_10: { struct scsi_verify_10 *cdb; cdb = (struct scsi_verify_10 *)ctsio->cdb; if (cdb->byte2 & SVFY_BYTCHK) bytchk = 1; if (cdb->byte2 & SVFY_DPO) flags |= CTL_LLF_DPO; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_2btoul(cdb->length); break; } case VERIFY_12: { struct scsi_verify_12 *cdb; cdb = (struct scsi_verify_12 *)ctsio->cdb; if (cdb->byte2 & SVFY_BYTCHK) bytchk = 1; if (cdb->byte2 & SVFY_DPO) flags |= CTL_LLF_DPO; lba = scsi_4btoul(cdb->addr); num_blocks = scsi_4btoul(cdb->length); break; } case VERIFY_16: { struct scsi_rw_16 *cdb; cdb = (struct scsi_rw_16 *)ctsio->cdb; if (cdb->byte2 & SVFY_BYTCHK) bytchk = 1; if (cdb->byte2 & SVFY_DPO) flags |= CTL_LLF_DPO; lba = scsi_8btou64(cdb->addr); num_blocks = scsi_4btoul(cdb->length); break; } default: /* * We got a command we don't support. This shouldn't * happen, commands should be filtered out above us. */ ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * The first check is to make sure we're in bounds, the second * check is to catch wrap-around problems. If the lba + num blocks * is less than the lba, then we've wrapped around and the block * range is invalid anyway. */ if (((lba + num_blocks) > (lun->be_lun->maxlba + 1)) || ((lba + num_blocks) < lba)) { ctl_set_lba_out_of_range(ctsio, MAX(lba, lun->be_lun->maxlba + 1)); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * According to SBC-3, a transfer length of 0 is not an error. */ if (num_blocks == 0) { ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } lbalen = (struct ctl_lba_len_flags *) &ctsio->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; lbalen->lba = lba; lbalen->len = num_blocks; if (bytchk) { lbalen->flags = CTL_LLF_COMPARE | flags; ctsio->kern_total_len = num_blocks * lun->be_lun->blocksize; } else { lbalen->flags = CTL_LLF_VERIFY | flags; ctsio->kern_total_len = 0; } ctsio->kern_rel_offset = 0; CTL_DEBUG_PRINT(("ctl_verify: calling data_submit()\n")); retval = lun->backend->data_submit((union ctl_io *)ctsio); return (retval); } int ctl_report_luns(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_port *port = CTL_PORT(ctsio); struct ctl_lun *lun, *request_lun = CTL_LUN(ctsio); struct scsi_report_luns *cdb; struct scsi_report_luns_data *lun_data; int num_filled, num_luns, num_port_luns, retval; uint32_t alloc_len, lun_datalen; uint32_t initidx, targ_lun_id, lun_id; retval = CTL_RETVAL_COMPLETE; cdb = (struct scsi_report_luns *)ctsio->cdb; CTL_DEBUG_PRINT(("ctl_report_luns\n")); num_luns = 0; num_port_luns = port->lun_map ? port->lun_map_size : CTL_MAX_LUNS; mtx_lock(&softc->ctl_lock); for (targ_lun_id = 0; targ_lun_id < num_port_luns; targ_lun_id++) { if (ctl_lun_map_from_port(port, targ_lun_id) != UINT32_MAX) num_luns++; } mtx_unlock(&softc->ctl_lock); switch (cdb->select_report) { case RPL_REPORT_DEFAULT: case RPL_REPORT_ALL: case RPL_REPORT_NONSUBSID: break; case RPL_REPORT_WELLKNOWN: case RPL_REPORT_ADMIN: case RPL_REPORT_CONGLOM: num_luns = 0; break; default: ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (retval); break; /* NOTREACHED */ } alloc_len = scsi_4btoul(cdb->length); /* * The initiator has to allocate at least 16 bytes for this request, * so he can at least get the header and the first LUN. Otherwise * we reject the request (per SPC-3 rev 14, section 6.21). */ if (alloc_len < (sizeof(struct scsi_report_luns_data) + sizeof(struct scsi_report_luns_lundata))) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 6, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (retval); } lun_datalen = sizeof(*lun_data) + (num_luns * sizeof(struct scsi_report_luns_lundata)); ctsio->kern_data_ptr = malloc(lun_datalen, M_CTL, M_WAITOK | M_ZERO); lun_data = (struct scsi_report_luns_data *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; initidx = ctl_get_initindex(&ctsio->io_hdr.nexus); mtx_lock(&softc->ctl_lock); for (targ_lun_id = 0, num_filled = 0; targ_lun_id < num_port_luns && num_filled < num_luns; targ_lun_id++) { lun_id = ctl_lun_map_from_port(port, targ_lun_id); if (lun_id == UINT32_MAX) continue; lun = softc->ctl_luns[lun_id]; if (lun == NULL) continue; be64enc(lun_data->luns[num_filled++].lundata, ctl_encode_lun(targ_lun_id)); /* * According to SPC-3, rev 14 section 6.21: * * "The execution of a REPORT LUNS command to any valid and * installed logical unit shall clear the REPORTED LUNS DATA * HAS CHANGED unit attention condition for all logical * units of that target with respect to the requesting * initiator. A valid and installed logical unit is one * having a PERIPHERAL QUALIFIER of 000b in the standard * INQUIRY data (see 6.4.2)." * * If request_lun is NULL, the LUN this report luns command * was issued to is either disabled or doesn't exist. In that * case, we shouldn't clear any pending lun change unit * attention. */ if (request_lun != NULL) { mtx_lock(&lun->lun_lock); ctl_clr_ua(lun, initidx, CTL_UA_LUN_CHANGE); mtx_unlock(&lun->lun_lock); } } mtx_unlock(&softc->ctl_lock); /* * It's quite possible that we've returned fewer LUNs than we allocated * space for. Trim it. */ lun_datalen = sizeof(*lun_data) + (num_filled * sizeof(struct scsi_report_luns_lundata)); ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(lun_datalen, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * We set this to the actual data length, regardless of how much * space we actually have to return results. If the user looks at * this value, he'll know whether or not he allocated enough space * and reissue the command if necessary. We don't support well * known logical units, so if the user asks for that, return none. */ scsi_ulto4b(lun_datalen - 8, lun_data->length); /* * We can only return SCSI_STATUS_CHECK_COND when we can't satisfy * this request. */ ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (retval); } int ctl_request_sense(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_request_sense *cdb; struct scsi_sense_data *sense_ptr, *ps; uint32_t initidx; int have_error; u_int sense_len = SSD_FULL_SIZE; scsi_sense_data_type sense_format; ctl_ua_type ua_type; uint8_t asc = 0, ascq = 0; cdb = (struct scsi_request_sense *)ctsio->cdb; CTL_DEBUG_PRINT(("ctl_request_sense\n")); /* * Determine which sense format the user wants. */ if (cdb->byte2 & SRS_DESC) sense_format = SSD_TYPE_DESC; else sense_format = SSD_TYPE_FIXED; ctsio->kern_data_ptr = malloc(sizeof(*sense_ptr), M_CTL, M_WAITOK); sense_ptr = (struct scsi_sense_data *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; /* * struct scsi_sense_data, which is currently set to 256 bytes, is * larger than the largest allowed value for the length field in the * REQUEST SENSE CDB, which is 252 bytes as of SPC-4. */ ctsio->kern_data_len = cdb->length; ctsio->kern_total_len = cdb->length; /* * If we don't have a LUN, we don't have any pending sense. */ if (lun == NULL || ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 && softc->ha_link < CTL_HA_LINK_UNKNOWN)) { /* "Logical unit not supported" */ ctl_set_sense_data(sense_ptr, &sense_len, NULL, sense_format, /*current_error*/ 1, /*sense_key*/ SSD_KEY_ILLEGAL_REQUEST, /*asc*/ 0x25, /*ascq*/ 0x00, SSD_ELEM_NONE); goto send; } have_error = 0; initidx = ctl_get_initindex(&ctsio->io_hdr.nexus); /* * Check for pending sense, and then for pending unit attentions. * Pending sense gets returned first, then pending unit attentions. */ mtx_lock(&lun->lun_lock); ps = lun->pending_sense[initidx / CTL_MAX_INIT_PER_PORT]; if (ps != NULL) ps += initidx % CTL_MAX_INIT_PER_PORT; if (ps != NULL && ps->error_code != 0) { scsi_sense_data_type stored_format; /* * Check to see which sense format was used for the stored * sense data. */ stored_format = scsi_sense_type(ps); /* * If the user requested a different sense format than the * one we stored, then we need to convert it to the other * format. If we're going from descriptor to fixed format * sense data, we may lose things in translation, depending * on what options were used. * * If the stored format is SSD_TYPE_NONE (i.e. invalid), * for some reason we'll just copy it out as-is. */ if ((stored_format == SSD_TYPE_FIXED) && (sense_format == SSD_TYPE_DESC)) ctl_sense_to_desc((struct scsi_sense_data_fixed *) ps, (struct scsi_sense_data_desc *)sense_ptr); else if ((stored_format == SSD_TYPE_DESC) && (sense_format == SSD_TYPE_FIXED)) ctl_sense_to_fixed((struct scsi_sense_data_desc *) ps, (struct scsi_sense_data_fixed *)sense_ptr); else memcpy(sense_ptr, ps, sizeof(*sense_ptr)); ps->error_code = 0; have_error = 1; } else { ua_type = ctl_build_ua(lun, initidx, sense_ptr, &sense_len, sense_format); if (ua_type != CTL_UA_NONE) have_error = 1; } if (have_error == 0) { /* * Report informational exception if have one and allowed. */ if (lun->MODE_IE.mrie != SIEP_MRIE_NO) { asc = lun->ie_asc; ascq = lun->ie_ascq; } ctl_set_sense_data(sense_ptr, &sense_len, lun, sense_format, /*current_error*/ 1, /*sense_key*/ SSD_KEY_NO_SENSE, /*asc*/ asc, /*ascq*/ ascq, SSD_ELEM_NONE); } mtx_unlock(&lun->lun_lock); send: /* * We report the SCSI status as OK, since the status of the command * itself is OK. We're reporting sense as parameter data. */ ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_tur(struct ctl_scsiio *ctsio) { CTL_DEBUG_PRINT(("ctl_tur\n")); ctl_set_success(ctsio); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * SCSI VPD page 0x00, the Supported VPD Pages page. */ static int ctl_inquiry_evpd_supported(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_supported_pages *pages; int sup_page_size; int p; sup_page_size = sizeof(struct scsi_vpd_supported_pages) * SCSI_EVPD_NUM_SUPPORTED_PAGES; ctsio->kern_data_ptr = malloc(sup_page_size, M_CTL, M_WAITOK | M_ZERO); pages = (struct scsi_vpd_supported_pages *)ctsio->kern_data_ptr; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(sup_page_size, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. Need to change this * to figure out whether the disk device is actually online or not. */ if (lun != NULL) pages->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else pages->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; p = 0; /* Supported VPD pages */ pages->page_list[p++] = SVPD_SUPPORTED_PAGES; /* Serial Number */ pages->page_list[p++] = SVPD_UNIT_SERIAL_NUMBER; /* Device Identification */ pages->page_list[p++] = SVPD_DEVICE_ID; /* Extended INQUIRY Data */ pages->page_list[p++] = SVPD_EXTENDED_INQUIRY_DATA; /* Mode Page Policy */ pages->page_list[p++] = SVPD_MODE_PAGE_POLICY; /* SCSI Ports */ pages->page_list[p++] = SVPD_SCSI_PORTS; /* Third-party Copy */ pages->page_list[p++] = SVPD_SCSI_TPC; if (lun != NULL && lun->be_lun->lun_type == T_DIRECT) { /* Block limits */ pages->page_list[p++] = SVPD_BLOCK_LIMITS; /* Block Device Characteristics */ pages->page_list[p++] = SVPD_BDC; /* Logical Block Provisioning */ pages->page_list[p++] = SVPD_LBP; } pages->length = p; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * SCSI VPD page 0x80, the Unit Serial Number page. */ static int ctl_inquiry_evpd_serial(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_unit_serial_number *sn_ptr; int data_len; data_len = 4 + CTL_SN_LEN; ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); sn_ptr = (struct scsi_vpd_unit_serial_number *)ctsio->kern_data_ptr; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. Need to change this * to figure out whether the disk device is actually online or not. */ if (lun != NULL) sn_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else sn_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; sn_ptr->page_code = SVPD_UNIT_SERIAL_NUMBER; sn_ptr->length = CTL_SN_LEN; /* * If we don't have a LUN, we just leave the serial number as * all spaces. */ if (lun != NULL) { strncpy((char *)sn_ptr->serial_num, (char *)lun->be_lun->serial_num, CTL_SN_LEN); } else memset(sn_ptr->serial_num, 0x20, CTL_SN_LEN); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * SCSI VPD page 0x86, the Extended INQUIRY Data page. */ static int ctl_inquiry_evpd_eid(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_extended_inquiry_data *eid_ptr; int data_len; data_len = sizeof(struct scsi_vpd_extended_inquiry_data); ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); eid_ptr = (struct scsi_vpd_extended_inquiry_data *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. */ if (lun != NULL) eid_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else eid_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; eid_ptr->page_code = SVPD_EXTENDED_INQUIRY_DATA; scsi_ulto2b(data_len - 4, eid_ptr->page_length); /* * We support head of queue, ordered and simple tags. */ eid_ptr->flags2 = SVPD_EID_HEADSUP | SVPD_EID_ORDSUP | SVPD_EID_SIMPSUP; /* * Volatile cache supported. */ eid_ptr->flags3 = SVPD_EID_V_SUP; /* * This means that we clear the REPORTED LUNS DATA HAS CHANGED unit * attention for a particular IT nexus on all LUNs once we report * it to that nexus once. This bit is required as of SPC-4. */ eid_ptr->flags4 = SVPD_EID_LUICLR; /* * We support revert to defaults (RTD) bit in MODE SELECT. */ eid_ptr->flags5 = SVPD_EID_RTD_SUP; /* * XXX KDM in order to correctly answer this, we would need * information from the SIM to determine how much sense data it * can send. So this would really be a path inquiry field, most * likely. This can be set to a maximum of 252 according to SPC-4, * but the hardware may or may not be able to support that much. * 0 just means that the maximum sense data length is not reported. */ eid_ptr->max_sense_length = 0; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } static int ctl_inquiry_evpd_mpp(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_mode_page_policy *mpp_ptr; int data_len; data_len = sizeof(struct scsi_vpd_mode_page_policy) + sizeof(struct scsi_vpd_mode_page_policy_descr); ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); mpp_ptr = (struct scsi_vpd_mode_page_policy *)ctsio->kern_data_ptr; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. */ if (lun != NULL) mpp_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else mpp_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; mpp_ptr->page_code = SVPD_MODE_PAGE_POLICY; scsi_ulto2b(data_len - 4, mpp_ptr->page_length); mpp_ptr->descr[0].page_code = 0x3f; mpp_ptr->descr[0].subpage_code = 0xff; mpp_ptr->descr[0].policy = SVPD_MPP_SHARED; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * SCSI VPD page 0x83, the Device Identification page. */ static int ctl_inquiry_evpd_devid(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_port *port = CTL_PORT(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_device_id *devid_ptr; struct scsi_vpd_id_descriptor *desc; int data_len, g; uint8_t proto; data_len = sizeof(struct scsi_vpd_device_id) + sizeof(struct scsi_vpd_id_descriptor) + sizeof(struct scsi_vpd_id_rel_trgt_port_id) + sizeof(struct scsi_vpd_id_descriptor) + sizeof(struct scsi_vpd_id_trgt_port_grp_id); if (lun && lun->lun_devid) data_len += lun->lun_devid->len; if (port && port->port_devid) data_len += port->port_devid->len; if (port && port->target_devid) data_len += port->target_devid->len; ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); devid_ptr = (struct scsi_vpd_device_id *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. */ if (lun != NULL) devid_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else devid_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; devid_ptr->page_code = SVPD_DEVICE_ID; scsi_ulto2b(data_len - 4, devid_ptr->length); if (port && port->port_type == CTL_PORT_FC) proto = SCSI_PROTO_FC << 4; - else if (port->port_type == CTL_PORT_SAS) + else if (port && port->port_type == CTL_PORT_SAS) proto = SCSI_PROTO_SAS << 4; else if (port && port->port_type == CTL_PORT_ISCSI) proto = SCSI_PROTO_ISCSI << 4; else proto = SCSI_PROTO_SPI << 4; desc = (struct scsi_vpd_id_descriptor *)devid_ptr->desc_list; /* * We're using a LUN association here. i.e., this device ID is a * per-LUN identifier. */ if (lun && lun->lun_devid) { memcpy(desc, lun->lun_devid->data, lun->lun_devid->len); desc = (struct scsi_vpd_id_descriptor *)((uint8_t *)desc + lun->lun_devid->len); } /* * This is for the WWPN which is a port association. */ if (port && port->port_devid) { memcpy(desc, port->port_devid->data, port->port_devid->len); desc = (struct scsi_vpd_id_descriptor *)((uint8_t *)desc + port->port_devid->len); } /* * This is for the Relative Target Port(type 4h) identifier */ desc->proto_codeset = proto | SVPD_ID_CODESET_BINARY; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_PORT | SVPD_ID_TYPE_RELTARG; desc->length = 4; scsi_ulto2b(ctsio->io_hdr.nexus.targ_port, &desc->identifier[2]); desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] + sizeof(struct scsi_vpd_id_rel_trgt_port_id)); /* * This is for the Target Port Group(type 5h) identifier */ desc->proto_codeset = proto | SVPD_ID_CODESET_BINARY; desc->id_type = SVPD_ID_PIV | SVPD_ID_ASSOC_PORT | SVPD_ID_TYPE_TPORTGRP; desc->length = 4; if (softc->is_single || (port && port->status & CTL_PORT_STATUS_HA_SHARED)) g = 1; else g = 2 + ctsio->io_hdr.nexus.targ_port / softc->port_cnt; scsi_ulto2b(g, &desc->identifier[2]); desc = (struct scsi_vpd_id_descriptor *)(&desc->identifier[0] + sizeof(struct scsi_vpd_id_trgt_port_grp_id)); /* * This is for the Target identifier */ if (port && port->target_devid) { memcpy(desc, port->target_devid->data, port->target_devid->len); } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } static int ctl_inquiry_evpd_scsi_ports(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_scsi_ports *sp; struct scsi_vpd_port_designation *pd; struct scsi_vpd_port_designation_cont *pdc; struct ctl_port *port; int data_len, num_target_ports, iid_len, id_len; num_target_ports = 0; iid_len = 0; id_len = 0; mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(port, &softc->port_list, links) { if ((port->status & CTL_PORT_STATUS_ONLINE) == 0) continue; if (lun != NULL && ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; num_target_ports++; if (port->init_devid) iid_len += port->init_devid->len; if (port->port_devid) id_len += port->port_devid->len; } mtx_unlock(&softc->ctl_lock); data_len = sizeof(struct scsi_vpd_scsi_ports) + num_target_ports * (sizeof(struct scsi_vpd_port_designation) + sizeof(struct scsi_vpd_port_designation_cont)) + iid_len + id_len; ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); sp = (struct scsi_vpd_scsi_ports *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. Need to change this * to figure out whether the disk device is actually online or not. */ if (lun != NULL) sp->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else sp->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; sp->page_code = SVPD_SCSI_PORTS; scsi_ulto2b(data_len - sizeof(struct scsi_vpd_scsi_ports), sp->page_length); pd = &sp->design[0]; mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(port, &softc->port_list, links) { if ((port->status & CTL_PORT_STATUS_ONLINE) == 0) continue; if (lun != NULL && ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; scsi_ulto2b(port->targ_port, pd->relative_port_id); if (port->init_devid) { iid_len = port->init_devid->len; memcpy(pd->initiator_transportid, port->init_devid->data, port->init_devid->len); } else iid_len = 0; scsi_ulto2b(iid_len, pd->initiator_transportid_length); pdc = (struct scsi_vpd_port_designation_cont *) (&pd->initiator_transportid[iid_len]); if (port->port_devid) { id_len = port->port_devid->len; memcpy(pdc->target_port_descriptors, port->port_devid->data, port->port_devid->len); } else id_len = 0; scsi_ulto2b(id_len, pdc->target_port_descriptors_length); pd = (struct scsi_vpd_port_designation *) ((uint8_t *)pdc->target_port_descriptors + id_len); } mtx_unlock(&softc->ctl_lock); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } static int ctl_inquiry_evpd_block_limits(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_block_limits *bl_ptr; uint64_t ival; ctsio->kern_data_ptr = malloc(sizeof(*bl_ptr), M_CTL, M_WAITOK | M_ZERO); bl_ptr = (struct scsi_vpd_block_limits *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_sg_entries = 0; ctsio->kern_data_len = min(sizeof(*bl_ptr), alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. Need to change this * to figure out whether the disk device is actually online or not. */ if (lun != NULL) bl_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else bl_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; bl_ptr->page_code = SVPD_BLOCK_LIMITS; scsi_ulto2b(sizeof(*bl_ptr) - 4, bl_ptr->page_length); bl_ptr->max_cmp_write_len = 0xff; scsi_ulto4b(0xffffffff, bl_ptr->max_txfer_len); if (lun != NULL) { scsi_ulto4b(lun->be_lun->opttxferlen, bl_ptr->opt_txfer_len); if (lun->be_lun->flags & CTL_LUN_FLAG_UNMAP) { ival = 0xffffffff; ctl_get_opt_number(&lun->be_lun->options, "unmap_max_lba", &ival); scsi_ulto4b(ival, bl_ptr->max_unmap_lba_cnt); ival = 0xffffffff; ctl_get_opt_number(&lun->be_lun->options, "unmap_max_descr", &ival); scsi_ulto4b(ival, bl_ptr->max_unmap_blk_cnt); if (lun->be_lun->ublockexp != 0) { scsi_ulto4b((1 << lun->be_lun->ublockexp), bl_ptr->opt_unmap_grain); scsi_ulto4b(0x80000000 | lun->be_lun->ublockoff, bl_ptr->unmap_grain_align); } } scsi_ulto4b(lun->be_lun->atomicblock, bl_ptr->max_atomic_transfer_length); scsi_ulto4b(0, bl_ptr->atomic_alignment); scsi_ulto4b(0, bl_ptr->atomic_transfer_length_granularity); scsi_ulto4b(0, bl_ptr->max_atomic_transfer_length_with_atomic_boundary); scsi_ulto4b(0, bl_ptr->max_atomic_boundary_size); ival = UINT64_MAX; ctl_get_opt_number(&lun->be_lun->options, "write_same_max_lba", &ival); scsi_u64to8b(ival, bl_ptr->max_write_same_length); } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } static int ctl_inquiry_evpd_bdc(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_block_device_characteristics *bdc_ptr; const char *value; u_int i; ctsio->kern_data_ptr = malloc(sizeof(*bdc_ptr), M_CTL, M_WAITOK | M_ZERO); bdc_ptr = (struct scsi_vpd_block_device_characteristics *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(sizeof(*bdc_ptr), alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. Need to change this * to figure out whether the disk device is actually online or not. */ if (lun != NULL) bdc_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else bdc_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; bdc_ptr->page_code = SVPD_BDC; scsi_ulto2b(sizeof(*bdc_ptr) - 4, bdc_ptr->page_length); if (lun != NULL && (value = ctl_get_opt(&lun->be_lun->options, "rpm")) != NULL) i = strtol(value, NULL, 0); else i = CTL_DEFAULT_ROTATION_RATE; scsi_ulto2b(i, bdc_ptr->medium_rotation_rate); if (lun != NULL && (value = ctl_get_opt(&lun->be_lun->options, "formfactor")) != NULL) i = strtol(value, NULL, 0); else i = 0; bdc_ptr->wab_wac_ff = (i & 0x0f); bdc_ptr->flags = SVPD_FUAB | SVPD_VBULS; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } static int ctl_inquiry_evpd_lbp(struct ctl_scsiio *ctsio, int alloc_len) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_vpd_logical_block_prov *lbp_ptr; const char *value; ctsio->kern_data_ptr = malloc(sizeof(*lbp_ptr), M_CTL, M_WAITOK | M_ZERO); lbp_ptr = (struct scsi_vpd_logical_block_prov *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(sizeof(*lbp_ptr), alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; /* * The control device is always connected. The disk device, on the * other hand, may not be online all the time. Need to change this * to figure out whether the disk device is actually online or not. */ if (lun != NULL) lbp_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; else lbp_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | T_DIRECT; lbp_ptr->page_code = SVPD_LBP; scsi_ulto2b(sizeof(*lbp_ptr) - 4, lbp_ptr->page_length); lbp_ptr->threshold_exponent = CTL_LBP_EXPONENT; if (lun != NULL && lun->be_lun->flags & CTL_LUN_FLAG_UNMAP) { lbp_ptr->flags = SVPD_LBP_UNMAP | SVPD_LBP_WS16 | SVPD_LBP_WS10 | SVPD_LBP_RZ | SVPD_LBP_ANC_SUP; value = ctl_get_opt(&lun->be_lun->options, "provisioning_type"); if (value != NULL) { if (strcmp(value, "resource") == 0) lbp_ptr->prov_type = SVPD_LBP_RESOURCE; else if (strcmp(value, "thin") == 0) lbp_ptr->prov_type = SVPD_LBP_THIN; } else lbp_ptr->prov_type = SVPD_LBP_THIN; } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * INQUIRY with the EVPD bit set. */ static int ctl_inquiry_evpd(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_inquiry *cdb; int alloc_len, retval; cdb = (struct scsi_inquiry *)ctsio->cdb; alloc_len = scsi_2btoul(cdb->length); switch (cdb->page_code) { case SVPD_SUPPORTED_PAGES: retval = ctl_inquiry_evpd_supported(ctsio, alloc_len); break; case SVPD_UNIT_SERIAL_NUMBER: retval = ctl_inquiry_evpd_serial(ctsio, alloc_len); break; case SVPD_DEVICE_ID: retval = ctl_inquiry_evpd_devid(ctsio, alloc_len); break; case SVPD_EXTENDED_INQUIRY_DATA: retval = ctl_inquiry_evpd_eid(ctsio, alloc_len); break; case SVPD_MODE_PAGE_POLICY: retval = ctl_inquiry_evpd_mpp(ctsio, alloc_len); break; case SVPD_SCSI_PORTS: retval = ctl_inquiry_evpd_scsi_ports(ctsio, alloc_len); break; case SVPD_SCSI_TPC: retval = ctl_inquiry_evpd_tpc(ctsio, alloc_len); break; case SVPD_BLOCK_LIMITS: if (lun == NULL || lun->be_lun->lun_type != T_DIRECT) goto err; retval = ctl_inquiry_evpd_block_limits(ctsio, alloc_len); break; case SVPD_BDC: if (lun == NULL || lun->be_lun->lun_type != T_DIRECT) goto err; retval = ctl_inquiry_evpd_bdc(ctsio, alloc_len); break; case SVPD_LBP: if (lun == NULL || lun->be_lun->lun_type != T_DIRECT) goto err; retval = ctl_inquiry_evpd_lbp(ctsio, alloc_len); break; default: err: ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); retval = CTL_RETVAL_COMPLETE; break; } return (retval); } /* * Standard INQUIRY data. */ static int ctl_inquiry_std(struct ctl_scsiio *ctsio) { struct ctl_softc *softc = CTL_SOFTC(ctsio); struct ctl_port *port = CTL_PORT(ctsio); struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_inquiry_data *inq_ptr; struct scsi_inquiry *cdb; char *val; uint32_t alloc_len, data_len; ctl_port_type port_type; port_type = port->port_type; if (port_type == CTL_PORT_IOCTL || port_type == CTL_PORT_INTERNAL) port_type = CTL_PORT_SCSI; cdb = (struct scsi_inquiry *)ctsio->cdb; alloc_len = scsi_2btoul(cdb->length); /* * We malloc the full inquiry data size here and fill it * in. If the user only asks for less, we'll give him * that much. */ data_len = offsetof(struct scsi_inquiry_data, vendor_specific1); ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); inq_ptr = (struct scsi_inquiry_data *)ctsio->kern_data_ptr; ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; if (lun != NULL) { if ((lun->flags & CTL_LUN_PRIMARY_SC) || softc->ha_link >= CTL_HA_LINK_UNKNOWN) { inq_ptr->device = (SID_QUAL_LU_CONNECTED << 5) | lun->be_lun->lun_type; } else { inq_ptr->device = (SID_QUAL_LU_OFFLINE << 5) | lun->be_lun->lun_type; } if (lun->flags & CTL_LUN_REMOVABLE) inq_ptr->dev_qual2 |= SID_RMB; } else inq_ptr->device = (SID_QUAL_BAD_LU << 5) | T_NODEVICE; /* RMB in byte 2 is 0 */ inq_ptr->version = SCSI_REV_SPC5; /* * According to SAM-3, even if a device only supports a single * level of LUN addressing, it should still set the HISUP bit: * * 4.9.1 Logical unit numbers overview * * All logical unit number formats described in this standard are * hierarchical in structure even when only a single level in that * hierarchy is used. The HISUP bit shall be set to one in the * standard INQUIRY data (see SPC-2) when any logical unit number * format described in this standard is used. Non-hierarchical * formats are outside the scope of this standard. * * Therefore we set the HiSup bit here. * * The response format is 2, per SPC-3. */ inq_ptr->response_format = SID_HiSup | 2; inq_ptr->additional_length = data_len - (offsetof(struct scsi_inquiry_data, additional_length) + 1); CTL_DEBUG_PRINT(("additional_length = %d\n", inq_ptr->additional_length)); inq_ptr->spc3_flags = SPC3_SID_3PC | SPC3_SID_TPGS_IMPLICIT; if (port_type == CTL_PORT_SCSI) inq_ptr->spc2_flags = SPC2_SID_ADDR16; inq_ptr->spc2_flags |= SPC2_SID_MultiP; inq_ptr->flags = SID_CmdQue; if (port_type == CTL_PORT_SCSI) inq_ptr->flags |= SID_WBus16 | SID_Sync; /* * Per SPC-3, unused bytes in ASCII strings are filled with spaces. * We have 8 bytes for the vendor name, and 16 bytes for the device * name and 4 bytes for the revision. */ if (lun == NULL || (val = ctl_get_opt(&lun->be_lun->options, "vendor")) == NULL) { strncpy(inq_ptr->vendor, CTL_VENDOR, sizeof(inq_ptr->vendor)); } else { memset(inq_ptr->vendor, ' ', sizeof(inq_ptr->vendor)); strncpy(inq_ptr->vendor, val, min(sizeof(inq_ptr->vendor), strlen(val))); } if (lun == NULL) { strncpy(inq_ptr->product, CTL_DIRECT_PRODUCT, sizeof(inq_ptr->product)); } else if ((val = ctl_get_opt(&lun->be_lun->options, "product")) == NULL) { switch (lun->be_lun->lun_type) { case T_DIRECT: strncpy(inq_ptr->product, CTL_DIRECT_PRODUCT, sizeof(inq_ptr->product)); break; case T_PROCESSOR: strncpy(inq_ptr->product, CTL_PROCESSOR_PRODUCT, sizeof(inq_ptr->product)); break; case T_CDROM: strncpy(inq_ptr->product, CTL_CDROM_PRODUCT, sizeof(inq_ptr->product)); break; default: strncpy(inq_ptr->product, CTL_UNKNOWN_PRODUCT, sizeof(inq_ptr->product)); break; } } else { memset(inq_ptr->product, ' ', sizeof(inq_ptr->product)); strncpy(inq_ptr->product, val, min(sizeof(inq_ptr->product), strlen(val))); } /* * XXX make this a macro somewhere so it automatically gets * incremented when we make changes. */ if (lun == NULL || (val = ctl_get_opt(&lun->be_lun->options, "revision")) == NULL) { strncpy(inq_ptr->revision, "0001", sizeof(inq_ptr->revision)); } else { memset(inq_ptr->revision, ' ', sizeof(inq_ptr->revision)); strncpy(inq_ptr->revision, val, min(sizeof(inq_ptr->revision), strlen(val))); } /* * For parallel SCSI, we support double transition and single * transition clocking. We also support QAS (Quick Arbitration * and Selection) and Information Unit transfers on both the * control and array devices. */ if (port_type == CTL_PORT_SCSI) inq_ptr->spi3data = SID_SPI_CLOCK_DT_ST | SID_SPI_QAS | SID_SPI_IUS; /* SAM-6 (no version claimed) */ scsi_ulto2b(0x00C0, inq_ptr->version1); /* SPC-5 (no version claimed) */ scsi_ulto2b(0x05C0, inq_ptr->version2); if (port_type == CTL_PORT_FC) { /* FCP-2 ANSI INCITS.350:2003 */ scsi_ulto2b(0x0917, inq_ptr->version3); } else if (port_type == CTL_PORT_SCSI) { /* SPI-4 ANSI INCITS.362:200x */ scsi_ulto2b(0x0B56, inq_ptr->version3); } else if (port_type == CTL_PORT_ISCSI) { /* iSCSI (no version claimed) */ scsi_ulto2b(0x0960, inq_ptr->version3); } else if (port_type == CTL_PORT_SAS) { /* SAS (no version claimed) */ scsi_ulto2b(0x0BE0, inq_ptr->version3); } else if (port_type == CTL_PORT_UMASS) { /* USB Mass Storage Class Bulk-Only Transport, Revision 1.0 */ scsi_ulto2b(0x1730, inq_ptr->version3); } if (lun == NULL) { /* SBC-4 (no version claimed) */ scsi_ulto2b(0x0600, inq_ptr->version4); } else { switch (lun->be_lun->lun_type) { case T_DIRECT: /* SBC-4 (no version claimed) */ scsi_ulto2b(0x0600, inq_ptr->version4); break; case T_PROCESSOR: break; case T_CDROM: /* MMC-6 (no version claimed) */ scsi_ulto2b(0x04E0, inq_ptr->version4); break; default: break; } } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_inquiry(struct ctl_scsiio *ctsio) { struct scsi_inquiry *cdb; int retval; CTL_DEBUG_PRINT(("ctl_inquiry\n")); cdb = (struct scsi_inquiry *)ctsio->cdb; if (cdb->byte2 & SI_EVPD) retval = ctl_inquiry_evpd(ctsio); else if (cdb->page_code == 0) retval = ctl_inquiry_std(ctsio); else { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 2, /*bit_valid*/ 0, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } return (retval); } int ctl_get_config(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_get_config_header *hdr; struct scsi_get_config_feature *feature; struct scsi_get_config *cdb; uint32_t alloc_len, data_len; int rt, starting; cdb = (struct scsi_get_config *)ctsio->cdb; rt = (cdb->rt & SGC_RT_MASK); starting = scsi_2btoul(cdb->starting_feature); alloc_len = scsi_2btoul(cdb->length); data_len = sizeof(struct scsi_get_config_header) + sizeof(struct scsi_get_config_feature) + 8 + sizeof(struct scsi_get_config_feature) + 8 + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 8 + sizeof(struct scsi_get_config_feature) + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 4 + sizeof(struct scsi_get_config_feature) + 4; ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; hdr = (struct scsi_get_config_header *)ctsio->kern_data_ptr; if (lun->flags & CTL_LUN_NO_MEDIA) scsi_ulto2b(0x0000, hdr->current_profile); else scsi_ulto2b(0x0010, hdr->current_profile); feature = (struct scsi_get_config_feature *)(hdr + 1); if (starting > 0x003b) goto done; if (starting > 0x003a) goto f3b; if (starting > 0x002b) goto f3a; if (starting > 0x002a) goto f2b; if (starting > 0x001f) goto f2a; if (starting > 0x001e) goto f1f; if (starting > 0x001d) goto f1e; if (starting > 0x0010) goto f1d; if (starting > 0x0003) goto f10; if (starting > 0x0002) goto f3; if (starting > 0x0001) goto f2; if (starting > 0x0000) goto f1; /* Profile List */ scsi_ulto2b(0x0000, feature->feature_code); feature->flags = SGC_F_PERSISTENT | SGC_F_CURRENT; feature->add_length = 8; scsi_ulto2b(0x0008, &feature->feature_data[0]); /* CD-ROM */ feature->feature_data[2] = 0x00; scsi_ulto2b(0x0010, &feature->feature_data[4]); /* DVD-ROM */ feature->feature_data[6] = 0x01; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f1: /* Core */ scsi_ulto2b(0x0001, feature->feature_code); feature->flags = 0x08 | SGC_F_PERSISTENT | SGC_F_CURRENT; feature->add_length = 8; scsi_ulto4b(0x00000000, &feature->feature_data[0]); feature->feature_data[4] = 0x03; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f2: /* Morphing */ scsi_ulto2b(0x0002, feature->feature_code); feature->flags = 0x04 | SGC_F_PERSISTENT | SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x02; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f3: /* Removable Medium */ scsi_ulto2b(0x0003, feature->feature_code); feature->flags = 0x04 | SGC_F_PERSISTENT | SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x39; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; if (rt == SGC_RT_CURRENT && (lun->flags & CTL_LUN_NO_MEDIA)) goto done; f10: /* Random Read */ scsi_ulto2b(0x0010, feature->feature_code); feature->flags = 0x00; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 8; scsi_ulto4b(lun->be_lun->blocksize, &feature->feature_data[0]); scsi_ulto2b(1, &feature->feature_data[4]); feature->feature_data[6] = 0x00; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f1d: /* Multi-Read */ scsi_ulto2b(0x001D, feature->feature_code); feature->flags = 0x00; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 0; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f1e: /* CD Read */ scsi_ulto2b(0x001E, feature->feature_code); feature->flags = 0x00; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x00; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f1f: /* DVD Read */ scsi_ulto2b(0x001F, feature->feature_code); feature->flags = 0x08; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x01; feature->feature_data[2] = 0x03; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f2a: /* DVD+RW */ scsi_ulto2b(0x002A, feature->feature_code); feature->flags = 0x04; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x00; feature->feature_data[1] = 0x00; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f2b: /* DVD+R */ scsi_ulto2b(0x002B, feature->feature_code); feature->flags = 0x00; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x00; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f3a: /* DVD+RW Dual Layer */ scsi_ulto2b(0x003A, feature->feature_code); feature->flags = 0x00; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x00; feature->feature_data[1] = 0x00; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; f3b: /* DVD+R Dual Layer */ scsi_ulto2b(0x003B, feature->feature_code); feature->flags = 0x00; if ((lun->flags & CTL_LUN_NO_MEDIA) == 0) feature->flags |= SGC_F_CURRENT; feature->add_length = 4; feature->feature_data[0] = 0x00; feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; done: data_len = (uint8_t *)feature - (uint8_t *)hdr; if (rt == SGC_RT_SPECIFIC && data_len > 4) { feature = (struct scsi_get_config_feature *)(hdr + 1); if (scsi_2btoul(feature->feature_code) == starting) feature = (struct scsi_get_config_feature *) &feature->feature_data[feature->add_length]; data_len = (uint8_t *)feature - (uint8_t *)hdr; } scsi_ulto4b(data_len - 4, hdr->data_length); ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_get_event_status(struct ctl_scsiio *ctsio) { struct scsi_get_event_status_header *hdr; struct scsi_get_event_status *cdb; uint32_t alloc_len, data_len; int notif_class; cdb = (struct scsi_get_event_status *)ctsio->cdb; if ((cdb->byte2 & SGESN_POLLED) == 0) { ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 1, /*bit_valid*/ 1, /*bit*/ 0); ctl_done((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } notif_class = cdb->notif_class; alloc_len = scsi_2btoul(cdb->length); data_len = sizeof(struct scsi_get_event_status_header); ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; hdr = (struct scsi_get_event_status_header *)ctsio->kern_data_ptr; scsi_ulto2b(0, hdr->descr_length); hdr->nea_class = SGESN_NEA; hdr->supported_class = 0; ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } int ctl_mechanism_status(struct ctl_scsiio *ctsio) { struct scsi_mechanism_status_header *hdr; struct scsi_mechanism_status *cdb; uint32_t alloc_len, data_len; cdb = (struct scsi_mechanism_status *)ctsio->cdb; alloc_len = scsi_2btoul(cdb->length); data_len = sizeof(struct scsi_mechanism_status_header); ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; hdr = (struct scsi_mechanism_status_header *)ctsio->kern_data_ptr; hdr->state1 = 0x00; hdr->state2 = 0xe0; scsi_ulto3b(0, hdr->lba); hdr->slots_num = 0; scsi_ulto2b(0, hdr->slots_length); ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } static void ctl_ultomsf(uint32_t lba, uint8_t *buf) { lba += 150; buf[0] = 0; buf[1] = bin2bcd((lba / 75) / 60); buf[2] = bin2bcd((lba / 75) % 60); buf[3] = bin2bcd(lba % 75); } int ctl_read_toc(struct ctl_scsiio *ctsio) { struct ctl_lun *lun = CTL_LUN(ctsio); struct scsi_read_toc_hdr *hdr; struct scsi_read_toc_type01_descr *descr; struct scsi_read_toc *cdb; uint32_t alloc_len, data_len; int format, msf; cdb = (struct scsi_read_toc *)ctsio->cdb; msf = (cdb->byte2 & CD_MSF) != 0; format = cdb->format; alloc_len = scsi_2btoul(cdb->data_len); data_len = sizeof(struct scsi_read_toc_hdr); if (format == 0) data_len += 2 * sizeof(struct scsi_read_toc_type01_descr); else data_len += sizeof(struct scsi_read_toc_type01_descr); ctsio->kern_data_ptr = malloc(data_len, M_CTL, M_WAITOK | M_ZERO); ctsio->kern_sg_entries = 0; ctsio->kern_rel_offset = 0; ctsio->kern_data_len = min(data_len, alloc_len); ctsio->kern_total_len = ctsio->kern_data_len; hdr = (struct scsi_read_toc_hdr *)ctsio->kern_data_ptr; if (format == 0) { scsi_ulto2b(0x12, hdr->data_length); hdr->first = 1; hdr->last = 1; descr = (struct scsi_read_toc_type01_descr *)(hdr + 1); descr->addr_ctl = 0x14; descr->track_number = 1; if (msf) ctl_ultomsf(0, descr->track_start); else scsi_ulto4b(0, descr->track_start); descr++; descr->addr_ctl = 0x14; descr->track_number = 0xaa; if (msf) ctl_ultomsf(lun->be_lun->maxlba+1, descr->track_start); else scsi_ulto4b(lun->be_lun->maxlba+1, descr->track_start); } else { scsi_ulto2b(0x0a, hdr->data_length); hdr->first = 1; hdr->last = 1; descr = (struct scsi_read_toc_type01_descr *)(hdr + 1); descr->addr_ctl = 0x14; descr->track_number = 1; if (msf) ctl_ultomsf(0, descr->track_start); else scsi_ulto4b(0, descr->track_start); } ctl_set_success(ctsio); ctsio->io_hdr.flags |= CTL_FLAG_ALLOCATED; ctsio->be_move_done = ctl_config_move_done; ctl_datamove((union ctl_io *)ctsio); return (CTL_RETVAL_COMPLETE); } /* * For known CDB types, parse the LBA and length. */ static int ctl_get_lba_len(union ctl_io *io, uint64_t *lba, uint64_t *len) { if (io->io_hdr.io_type != CTL_IO_SCSI) return (1); switch (io->scsiio.cdb[0]) { case COMPARE_AND_WRITE: { struct scsi_compare_and_write *cdb; cdb = (struct scsi_compare_and_write *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = cdb->length; break; } case READ_6: case WRITE_6: { struct scsi_rw_6 *cdb; cdb = (struct scsi_rw_6 *)io->scsiio.cdb; *lba = scsi_3btoul(cdb->addr); /* only 5 bits are valid in the most significant address byte */ *lba &= 0x1fffff; *len = cdb->length; break; } case READ_10: case WRITE_10: { struct scsi_rw_10 *cdb; cdb = (struct scsi_rw_10 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_2btoul(cdb->length); break; } case WRITE_VERIFY_10: { struct scsi_write_verify_10 *cdb; cdb = (struct scsi_write_verify_10 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_2btoul(cdb->length); break; } case READ_12: case WRITE_12: { struct scsi_rw_12 *cdb; cdb = (struct scsi_rw_12 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case WRITE_VERIFY_12: { struct scsi_write_verify_12 *cdb; cdb = (struct scsi_write_verify_12 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case READ_16: case WRITE_16: { struct scsi_rw_16 *cdb; cdb = (struct scsi_rw_16 *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case WRITE_ATOMIC_16: { struct scsi_write_atomic_16 *cdb; cdb = (struct scsi_write_atomic_16 *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = scsi_2btoul(cdb->length); break; } case WRITE_VERIFY_16: { struct scsi_write_verify_16 *cdb; cdb = (struct scsi_write_verify_16 *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case WRITE_SAME_10: { struct scsi_write_same_10 *cdb; cdb = (struct scsi_write_same_10 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_2btoul(cdb->length); break; } case WRITE_SAME_16: { struct scsi_write_same_16 *cdb; cdb = (struct scsi_write_same_16 *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case VERIFY_10: { struct scsi_verify_10 *cdb; cdb = (struct scsi_verify_10 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_2btoul(cdb->length); break; } case VERIFY_12: { struct scsi_verify_12 *cdb; cdb = (struct scsi_verify_12 *)io->scsiio.cdb; *lba = scsi_4btoul(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case VERIFY_16: { struct scsi_verify_16 *cdb; cdb = (struct scsi_verify_16 *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = scsi_4btoul(cdb->length); break; } case UNMAP: { *lba = 0; *len = UINT64_MAX; break; } case SERVICE_ACTION_IN: { /* GET LBA STATUS */ struct scsi_get_lba_status *cdb; cdb = (struct scsi_get_lba_status *)io->scsiio.cdb; *lba = scsi_8btou64(cdb->addr); *len = UINT32_MAX; break; } default: return (1); break; /* NOTREACHED */ } return (0); } static ctl_action ctl_extent_check_lba(uint64_t lba1, uint64_t len1, uint64_t lba2, uint64_t len2, bool seq) { uint64_t endlba1, endlba2; endlba1 = lba1 + len1 - (seq ? 0 : 1); endlba2 = lba2 + len2 - 1; if ((endlba1 < lba2) || (endlba2 < lba1)) return (CTL_ACTION_PASS); else return (CTL_ACTION_BLOCK); } static int ctl_extent_check_unmap(union ctl_io *io, uint64_t lba2, uint64_t len2) { struct ctl_ptr_len_flags *ptrlen; struct scsi_unmap_desc *buf, *end, *range; uint64_t lba; uint32_t len; /* If not UNMAP -- go other way. */ if (io->io_hdr.io_type != CTL_IO_SCSI || io->scsiio.cdb[0] != UNMAP) return (CTL_ACTION_ERROR); /* If UNMAP without data -- block and wait for data. */ ptrlen = (struct ctl_ptr_len_flags *) &io->io_hdr.ctl_private[CTL_PRIV_LBA_LEN]; if ((io->io_hdr.flags & CTL_FLAG_ALLOCATED) == 0 || ptrlen->ptr == NULL) return (CTL_ACTION_BLOCK); /* UNMAP with data -- check for collision. */ buf = (struct scsi_unmap_desc *)ptrlen->ptr; end = buf + ptrlen->len / sizeof(*buf); for (range = buf; range < end; range++) { lba = scsi_8btou64(range->lba); len = scsi_4btoul(range->length); if ((lba < lba2 + len2) && (lba + len > lba2)) return (CTL_ACTION_BLOCK); } return (CTL_ACTION_PASS); } static ctl_action ctl_extent_check(union ctl_io *io1, union ctl_io *io2, bool seq) { uint64_t lba1, lba2; uint64_t len1, len2; int retval; if (ctl_get_lba_len(io2, &lba2, &len2) != 0) return (CTL_ACTION_ERROR); retval = ctl_extent_check_unmap(io1, lba2, len2); if (retval != CTL_ACTION_ERROR) return (retval); if (ctl_get_lba_len(io1, &lba1, &len1) != 0) return (CTL_ACTION_ERROR); if (io1->io_hdr.flags & CTL_FLAG_SERSEQ_DONE) seq = FALSE; return (ctl_extent_check_lba(lba1, len1, lba2, len2, seq)); } static ctl_action ctl_extent_check_seq(union ctl_io *io1, union ctl_io *io2) { uint64_t lba1, lba2; uint64_t len1, len2; if (io1->io_hdr.flags & CTL_FLAG_SERSEQ_DONE) return (CTL_ACTION_PASS); if (ctl_get_lba_len(io1, &lba1, &len1) != 0) return (CTL_ACTION_ERROR); if (ctl_get_lba_len(io2, &lba2, &len2) != 0) return (CTL_ACTION_ERROR); if (lba1 + len1 == lba2) return (CTL_ACTION_BLOCK); return (CTL_ACTION_PASS); } static ctl_action ctl_check_for_blockage(struct ctl_lun *lun, union ctl_io *pending_io, union ctl_io *ooa_io) { const struct ctl_cmd_entry *pending_entry, *ooa_entry; const ctl_serialize_action *serialize_row; /* * The initiator attempted multiple untagged commands at the same * time. Can't do that. */ if ((pending_io->scsiio.tag_type == CTL_TAG_UNTAGGED) && (ooa_io->scsiio.tag_type == CTL_TAG_UNTAGGED) && ((pending_io->io_hdr.nexus.targ_port == ooa_io->io_hdr.nexus.targ_port) && (pending_io->io_hdr.nexus.initid == ooa_io->io_hdr.nexus.initid)) && ((ooa_io->io_hdr.flags & (CTL_FLAG_ABORT | CTL_FLAG_STATUS_SENT)) == 0)) return (CTL_ACTION_OVERLAP); /* * The initiator attempted to send multiple tagged commands with * the same ID. (It's fine if different initiators have the same * tag ID.) * * Even if all of those conditions are true, we don't kill the I/O * if the command ahead of us has been aborted. We won't end up * sending it to the FETD, and it's perfectly legal to resend a * command with the same tag number as long as the previous * instance of this tag number has been aborted somehow. */ if ((pending_io->scsiio.tag_type != CTL_TAG_UNTAGGED) && (ooa_io->scsiio.tag_type != CTL_TAG_UNTAGGED) && (pending_io->scsiio.tag_num == ooa_io->scsiio.tag_num) && ((pending_io->io_hdr.nexus.targ_port == ooa_io->io_hdr.nexus.targ_port) && (pending_io->io_hdr.nexus.initid == ooa_io->io_hdr.nexus.initid)) && ((ooa_io->io_hdr.flags & (CTL_FLAG_ABORT | CTL_FLAG_STATUS_SENT)) == 0)) return (CTL_ACTION_OVERLAP_TAG); /* * If we get a head of queue tag, SAM-3 says that we should * immediately execute it. * * What happens if this command would normally block for some other * reason? e.g. a request sense with a head of queue tag * immediately after a write. Normally that would block, but this * will result in its getting executed immediately... * * We currently return "pass" instead of "skip", so we'll end up * going through the rest of the queue to check for overlapped tags. * * XXX KDM check for other types of blockage first?? */ if (pending_io->scsiio.tag_type == CTL_TAG_HEAD_OF_QUEUE) return (CTL_ACTION_PASS); /* * Ordered tags have to block until all items ahead of them * have completed. If we get called with an ordered tag, we always * block, if something else is ahead of us in the queue. */ if (pending_io->scsiio.tag_type == CTL_TAG_ORDERED) return (CTL_ACTION_BLOCK); /* * Simple tags get blocked until all head of queue and ordered tags * ahead of them have completed. I'm lumping untagged commands in * with simple tags here. XXX KDM is that the right thing to do? */ if (((pending_io->scsiio.tag_type == CTL_TAG_UNTAGGED) || (pending_io->scsiio.tag_type == CTL_TAG_SIMPLE)) && ((ooa_io->scsiio.tag_type == CTL_TAG_HEAD_OF_QUEUE) || (ooa_io->scsiio.tag_type == CTL_TAG_ORDERED))) return (CTL_ACTION_BLOCK); pending_entry = ctl_get_cmd_entry(&pending_io->scsiio, NULL); KASSERT(pending_entry->seridx < CTL_SERIDX_COUNT, ("%s: Invalid seridx %d for pending CDB %02x %02x @ %p", __func__, pending_entry->seridx, pending_io->scsiio.cdb[0], pending_io->scsiio.cdb[1], pending_io)); ooa_entry = ctl_get_cmd_entry(&ooa_io->scsiio, NULL); if (ooa_entry->seridx == CTL_SERIDX_INVLD) return (CTL_ACTION_PASS); /* Unsupported command in OOA queue */ KASSERT(ooa_entry->seridx < CTL_SERIDX_COUNT, ("%s: Invalid seridx %d for ooa CDB %02x %02x @ %p", __func__, ooa_entry->seridx, ooa_io->scsiio.cdb[0], ooa_io->scsiio.cdb[1], ooa_io)); serialize_row = ctl_serialize_table[ooa_entry->seridx]; switch (serialize_row[pending_entry->seridx]) { case CTL_SER_BLOCK: return (CTL_ACTION_BLOCK); case CTL_SER_EXTENT: return (ctl_extent_check(ooa_io, pending_io, (lun->be_lun && lun->be_lun->serseq == CTL_LUN_SERSEQ_ON))); case CTL_SER_EXTENTOPT: if ((lun->MODE_CTRL.queue_flags & SCP_QUEUE_ALG_MASK) != SCP_QUEUE_ALG_UNRESTRICTED) return (ctl_extent_check(ooa_io, pending_io, (lun->be_lun && lun->be_lun->serseq == CTL_LUN_SERSEQ_ON))); return (CTL_ACTION_PASS); case CTL_SER_EXTENTSEQ: if (lun->be_lun && lun->be_lun->serseq != CTL_LUN_SERSEQ_OFF) return (ctl_extent_check_seq(ooa_io, pending_io)); return (CTL_ACTION_PASS); case CTL_SER_PASS: return (CTL_ACTION_PASS); case CTL_SER_BLOCKOPT: if ((lun->MODE_CTRL.queue_flags & SCP_QUEUE_ALG_MASK) != SCP_QUEUE_ALG_UNRESTRICTED) return (CTL_ACTION_BLOCK); return (CTL_ACTION_PASS); case CTL_SER_SKIP: return (CTL_ACTION_SKIP); default: panic("%s: Invalid serialization value %d for %d => %d", __func__, serialize_row[pending_entry->seridx], pending_entry->seridx, ooa_entry->seridx); } return (CTL_ACTION_ERROR); } /* * Check for blockage or overlaps against the OOA (Order Of Arrival) queue. * Assumptions: * - pending_io is generally either incoming, or on the blocked queue * - starting I/O is the I/O we want to start the check with. */ static ctl_action ctl_check_ooa(struct ctl_lun *lun, union ctl_io *pending_io, union ctl_io *starting_io) { union ctl_io *ooa_io; ctl_action action; mtx_assert(&lun->lun_lock, MA_OWNED); /* * Run back along the OOA queue, starting with the current * blocked I/O and going through every I/O before it on the * queue. If starting_io is NULL, we'll just end up returning * CTL_ACTION_PASS. */ for (ooa_io = starting_io; ooa_io != NULL; ooa_io = (union ctl_io *)TAILQ_PREV(&ooa_io->io_hdr, ctl_ooaq, ooa_links)){ /* * This routine just checks to see whether * cur_blocked is blocked by ooa_io, which is ahead * of it in the queue. It doesn't queue/dequeue * cur_blocked. */ action = ctl_check_for_blockage(lun, pending_io, ooa_io); switch (action) { case CTL_ACTION_BLOCK: case CTL_ACTION_OVERLAP: case CTL_ACTION_OVERLAP_TAG: case CTL_ACTION_SKIP: case CTL_ACTION_ERROR: return (action); break; /* NOTREACHED */ case CTL_ACTION_PASS: break; default: panic("%s: Invalid action %d\n", __func__, action); } } return (CTL_ACTION_PASS); } /* * Assumptions: * - An I/O has just completed, and has been removed from the per-LUN OOA * queue, so some items on the blocked queue may now be unblocked. */ static int ctl_check_blocked(struct ctl_lun *lun) { struct ctl_softc *softc = lun->ctl_softc; union ctl_io *cur_blocked, *next_blocked; mtx_assert(&lun->lun_lock, MA_OWNED); /* * Run forward from the head of the blocked queue, checking each * entry against the I/Os prior to it on the OOA queue to see if * there is still any blockage. * * We cannot use the TAILQ_FOREACH() macro, because it can't deal * with our removing a variable on it while it is traversing the * list. */ for (cur_blocked = (union ctl_io *)TAILQ_FIRST(&lun->blocked_queue); cur_blocked != NULL; cur_blocked = next_blocked) { union ctl_io *prev_ooa; ctl_action action; next_blocked = (union ctl_io *)TAILQ_NEXT(&cur_blocked->io_hdr, blocked_links); prev_ooa = (union ctl_io *)TAILQ_PREV(&cur_blocked->io_hdr, ctl_ooaq, ooa_links); /* * If cur_blocked happens to be the first item in the OOA * queue now, prev_ooa will be NULL, and the action * returned will just be CTL_ACTION_PASS. */ action = ctl_check_ooa(lun, cur_blocked, prev_ooa); switch (action) { case CTL_ACTION_BLOCK: /* Nothing to do here, still blocked */ break; case CTL_ACTION_OVERLAP: case CTL_ACTION_OVERLAP_TAG: /* * This shouldn't happen! In theory we've already * checked this command for overlap... */ break; case CTL_ACTION_PASS: case CTL_ACTION_SKIP: { const struct ctl_cmd_entry *entry; /* * The skip case shouldn't happen, this transaction * should have never made it onto the blocked queue. */ /* * This I/O is no longer blocked, we can remove it * from the blocked queue. Since this is a TAILQ * (doubly linked list), we can do O(1) removals * from any place on the list. */ TAILQ_REMOVE(&lun->blocked_queue, &cur_blocked->io_hdr, blocked_links); cur_blocked->io_hdr.flags &= ~CTL_FLAG_BLOCKED; if ((softc->ha_mode != CTL_HA_MODE_XFER) && (cur_blocked->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC)){ /* * Need to send IO back to original side to * run */ union ctl_ha_msg msg_info; cur_blocked->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE; msg_info.hdr.original_sc = cur_blocked->io_hdr.original_sc; msg_info.hdr.serializing_sc = cur_blocked; msg_info.hdr.msg_type = CTL_MSG_R2R; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.hdr), M_NOWAIT); break; } entry = ctl_get_cmd_entry(&cur_blocked->scsiio, NULL); /* * Check this I/O for LUN state changes that may * have happened while this command was blocked. * The LUN state may have been changed by a command * ahead of us in the queue, so we need to re-check * for any states that can be caused by SCSI * commands. */ if (ctl_scsiio_lun_check(lun, entry, &cur_blocked->scsiio) == 0) { cur_blocked->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR; ctl_enqueue_rtr(cur_blocked); } else ctl_done(cur_blocked); break; } default: /* * This probably shouldn't happen -- we shouldn't * get CTL_ACTION_ERROR, or anything else. */ break; } } return (CTL_RETVAL_COMPLETE); } /* * This routine (with one exception) checks LUN flags that can be set by * commands ahead of us in the OOA queue. These flags have to be checked * when a command initially comes in, and when we pull a command off the * blocked queue and are preparing to execute it. The reason we have to * check these flags for commands on the blocked queue is that the LUN * state may have been changed by a command ahead of us while we're on the * blocked queue. * * Ordering is somewhat important with these checks, so please pay * careful attention to the placement of any new checks. */ static int ctl_scsiio_lun_check(struct ctl_lun *lun, const struct ctl_cmd_entry *entry, struct ctl_scsiio *ctsio) { struct ctl_softc *softc = lun->ctl_softc; int retval; uint32_t residx; retval = 0; mtx_assert(&lun->lun_lock, MA_OWNED); /* * If this shelf is a secondary shelf controller, we may have to * reject some commands disallowed by HA mode and link state. */ if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0) { if (softc->ha_link == CTL_HA_LINK_OFFLINE && (entry->flags & CTL_CMD_FLAG_OK_ON_UNAVAIL) == 0) { ctl_set_lun_unavail(ctsio); retval = 1; goto bailout; } if ((lun->flags & CTL_LUN_PEER_SC_PRIMARY) == 0 && (entry->flags & CTL_CMD_FLAG_OK_ON_UNAVAIL) == 0) { ctl_set_lun_transit(ctsio); retval = 1; goto bailout; } if (softc->ha_mode == CTL_HA_MODE_ACT_STBY && (entry->flags & CTL_CMD_FLAG_OK_ON_STANDBY) == 0) { ctl_set_lun_standby(ctsio); retval = 1; goto bailout; } /* The rest of checks are only done on executing side */ if (softc->ha_mode == CTL_HA_MODE_XFER) goto bailout; } if (entry->pattern & CTL_LUN_PAT_WRITE) { if (lun->be_lun && lun->be_lun->flags & CTL_LUN_FLAG_READONLY) { ctl_set_hw_write_protected(ctsio); retval = 1; goto bailout; } if ((lun->MODE_CTRL.eca_and_aen & SCP_SWP) != 0) { ctl_set_sense(ctsio, /*current_error*/ 1, /*sense_key*/ SSD_KEY_DATA_PROTECT, /*asc*/ 0x27, /*ascq*/ 0x02, SSD_ELEM_NONE); retval = 1; goto bailout; } } /* * Check for a reservation conflict. If this command isn't allowed * even on reserved LUNs, and if this initiator isn't the one who * reserved us, reject the command with a reservation conflict. */ residx = ctl_get_initindex(&ctsio->io_hdr.nexus); if ((lun->flags & CTL_LUN_RESERVED) && ((entry->flags & CTL_CMD_FLAG_ALLOW_ON_RESV) == 0)) { if (lun->res_idx != residx) { ctl_set_reservation_conflict(ctsio); retval = 1; goto bailout; } } if ((lun->flags & CTL_LUN_PR_RESERVED) == 0 || (entry->flags & CTL_CMD_FLAG_ALLOW_ON_PR_RESV)) { /* No reservation or command is allowed. */; } else if ((entry->flags & CTL_CMD_FLAG_ALLOW_ON_PR_WRESV) && (lun->pr_res_type == SPR_TYPE_WR_EX || lun->pr_res_type == SPR_TYPE_WR_EX_RO || lun->pr_res_type == SPR_TYPE_WR_EX_AR)) { /* The command is allowed for Write Exclusive resv. */; } else { /* * if we aren't registered or it's a res holder type * reservation and this isn't the res holder then set a * conflict. */ if (ctl_get_prkey(lun, residx) == 0 || (residx != lun->pr_res_idx && lun->pr_res_type < 4)) { ctl_set_reservation_conflict(ctsio); retval = 1; goto bailout; } } if ((entry->flags & CTL_CMD_FLAG_OK_ON_NO_MEDIA) == 0) { if (lun->flags & CTL_LUN_EJECTED) ctl_set_lun_ejected(ctsio); else if (lun->flags & CTL_LUN_NO_MEDIA) { if (lun->flags & CTL_LUN_REMOVABLE) ctl_set_lun_no_media(ctsio); else ctl_set_lun_int_reqd(ctsio); } else if (lun->flags & CTL_LUN_STOPPED) ctl_set_lun_stopped(ctsio); else goto bailout; retval = 1; goto bailout; } bailout: return (retval); } static void ctl_failover_io(union ctl_io *io, int have_lock) { ctl_set_busy(&io->scsiio); ctl_done(io); } static void ctl_failover_lun(union ctl_io *rio) { struct ctl_softc *softc = CTL_SOFTC(rio); struct ctl_lun *lun; struct ctl_io_hdr *io, *next_io; uint32_t targ_lun; targ_lun = rio->io_hdr.nexus.targ_mapped_lun; CTL_DEBUG_PRINT(("FAILOVER for lun %ju\n", targ_lun)); /* Find and lock the LUN. */ mtx_lock(&softc->ctl_lock); if (targ_lun > CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); return; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); return; } if (softc->ha_mode == CTL_HA_MODE_XFER) { TAILQ_FOREACH_SAFE(io, &lun->ooa_queue, ooa_links, next_io) { /* We are master */ if (io->flags & CTL_FLAG_FROM_OTHER_SC) { if (io->flags & CTL_FLAG_IO_ACTIVE) { io->flags |= CTL_FLAG_ABORT; io->flags |= CTL_FLAG_FAILOVER; } else { /* This can be only due to DATAMOVE */ io->msg_type = CTL_MSG_DATAMOVE_DONE; io->flags &= ~CTL_FLAG_DMA_INPROG; io->flags |= CTL_FLAG_IO_ACTIVE; io->port_status = 31340; ctl_enqueue_isc((union ctl_io *)io); } } /* We are slave */ if (io->flags & CTL_FLAG_SENT_2OTHER_SC) { io->flags &= ~CTL_FLAG_SENT_2OTHER_SC; if (io->flags & CTL_FLAG_IO_ACTIVE) { io->flags |= CTL_FLAG_FAILOVER; } else { ctl_set_busy(&((union ctl_io *)io)-> scsiio); ctl_done((union ctl_io *)io); } } } } else { /* SERIALIZE modes */ TAILQ_FOREACH_SAFE(io, &lun->blocked_queue, blocked_links, next_io) { /* We are master */ if (io->flags & CTL_FLAG_FROM_OTHER_SC) { TAILQ_REMOVE(&lun->blocked_queue, io, blocked_links); io->flags &= ~CTL_FLAG_BLOCKED; TAILQ_REMOVE(&lun->ooa_queue, io, ooa_links); ctl_free_io((union ctl_io *)io); } } TAILQ_FOREACH_SAFE(io, &lun->ooa_queue, ooa_links, next_io) { /* We are master */ if (io->flags & CTL_FLAG_FROM_OTHER_SC) { TAILQ_REMOVE(&lun->ooa_queue, io, ooa_links); ctl_free_io((union ctl_io *)io); } /* We are slave */ if (io->flags & CTL_FLAG_SENT_2OTHER_SC) { io->flags &= ~CTL_FLAG_SENT_2OTHER_SC; if (!(io->flags & CTL_FLAG_IO_ACTIVE)) { ctl_set_busy(&((union ctl_io *)io)-> scsiio); ctl_done((union ctl_io *)io); } } } ctl_check_blocked(lun); } mtx_unlock(&lun->lun_lock); } static int ctl_scsiio_precheck(struct ctl_softc *softc, struct ctl_scsiio *ctsio) { struct ctl_lun *lun; const struct ctl_cmd_entry *entry; uint32_t initidx, targ_lun; int retval = 0; lun = NULL; targ_lun = ctsio->io_hdr.nexus.targ_mapped_lun; if (targ_lun < CTL_MAX_LUNS) lun = softc->ctl_luns[targ_lun]; if (lun) { /* * If the LUN is invalid, pretend that it doesn't exist. * It will go away as soon as all pending I/O has been * completed. */ mtx_lock(&lun->lun_lock); if (lun->flags & CTL_LUN_DISABLED) { mtx_unlock(&lun->lun_lock); lun = NULL; } } CTL_LUN(ctsio) = lun; if (lun) { CTL_BACKEND_LUN(ctsio) = lun->be_lun; /* * Every I/O goes into the OOA queue for a particular LUN, * and stays there until completion. */ #ifdef CTL_TIME_IO if (TAILQ_EMPTY(&lun->ooa_queue)) lun->idle_time += getsbinuptime() - lun->last_busy; #endif TAILQ_INSERT_TAIL(&lun->ooa_queue, &ctsio->io_hdr, ooa_links); } /* Get command entry and return error if it is unsuppotyed. */ entry = ctl_validate_command(ctsio); if (entry == NULL) { if (lun) mtx_unlock(&lun->lun_lock); return (retval); } ctsio->io_hdr.flags &= ~CTL_FLAG_DATA_MASK; ctsio->io_hdr.flags |= entry->flags & CTL_FLAG_DATA_MASK; /* * Check to see whether we can send this command to LUNs that don't * exist. This should pretty much only be the case for inquiry * and request sense. Further checks, below, really require having * a LUN, so we can't really check the command anymore. Just put * it on the rtr queue. */ if (lun == NULL) { if (entry->flags & CTL_CMD_FLAG_OK_ON_NO_LUN) { ctsio->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR; ctl_enqueue_rtr((union ctl_io *)ctsio); return (retval); } ctl_set_unsupported_lun(ctsio); ctl_done((union ctl_io *)ctsio); CTL_DEBUG_PRINT(("ctl_scsiio_precheck: bailing out due to invalid LUN\n")); return (retval); } else { /* * Make sure we support this particular command on this LUN. * e.g., we don't support writes to the control LUN. */ if (!ctl_cmd_applicable(lun->be_lun->lun_type, entry)) { mtx_unlock(&lun->lun_lock); ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (retval); } } initidx = ctl_get_initindex(&ctsio->io_hdr.nexus); /* * If we've got a request sense, it'll clear the contingent * allegiance condition. Otherwise, if we have a CA condition for * this initiator, clear it, because it sent down a command other * than request sense. */ if (ctsio->cdb[0] != REQUEST_SENSE) { struct scsi_sense_data *ps; ps = lun->pending_sense[initidx / CTL_MAX_INIT_PER_PORT]; if (ps != NULL) ps[initidx % CTL_MAX_INIT_PER_PORT].error_code = 0; } /* * If the command has this flag set, it handles its own unit * attention reporting, we shouldn't do anything. Otherwise we * check for any pending unit attentions, and send them back to the * initiator. We only do this when a command initially comes in, * not when we pull it off the blocked queue. * * According to SAM-3, section 5.3.2, the order that things get * presented back to the host is basically unit attentions caused * by some sort of reset event, busy status, reservation conflicts * or task set full, and finally any other status. * * One issue here is that some of the unit attentions we report * don't fall into the "reset" category (e.g. "reported luns data * has changed"). So reporting it here, before the reservation * check, may be technically wrong. I guess the only thing to do * would be to check for and report the reset events here, and then * check for the other unit attention types after we check for a * reservation conflict. * * XXX KDM need to fix this */ if ((entry->flags & CTL_CMD_FLAG_NO_SENSE) == 0) { ctl_ua_type ua_type; u_int sense_len = 0; ua_type = ctl_build_ua(lun, initidx, &ctsio->sense_data, &sense_len, SSD_TYPE_NONE); if (ua_type != CTL_UA_NONE) { mtx_unlock(&lun->lun_lock); ctsio->scsi_status = SCSI_STATUS_CHECK_COND; ctsio->io_hdr.status = CTL_SCSI_ERROR | CTL_AUTOSENSE; ctsio->sense_len = sense_len; ctl_done((union ctl_io *)ctsio); return (retval); } } if (ctl_scsiio_lun_check(lun, entry, ctsio) != 0) { mtx_unlock(&lun->lun_lock); ctl_done((union ctl_io *)ctsio); return (retval); } /* * XXX CHD this is where we want to send IO to other side if * this LUN is secondary on this SC. We will need to make a copy * of the IO and flag the IO on this side as SENT_2OTHER and the flag * the copy we send as FROM_OTHER. * We also need to stuff the address of the original IO so we can * find it easily. Something similar will need be done on the other * side so when we are done we can find the copy. */ if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 && (lun->flags & CTL_LUN_PEER_SC_PRIMARY) != 0 && (entry->flags & CTL_CMD_FLAG_RUN_HERE) == 0) { union ctl_ha_msg msg_info; int isc_retval; ctsio->io_hdr.flags |= CTL_FLAG_SENT_2OTHER_SC; ctsio->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE; mtx_unlock(&lun->lun_lock); msg_info.hdr.msg_type = CTL_MSG_SERIALIZE; msg_info.hdr.original_sc = (union ctl_io *)ctsio; msg_info.hdr.serializing_sc = NULL; msg_info.hdr.nexus = ctsio->io_hdr.nexus; msg_info.scsi.tag_num = ctsio->tag_num; msg_info.scsi.tag_type = ctsio->tag_type; msg_info.scsi.cdb_len = ctsio->cdb_len; memcpy(msg_info.scsi.cdb, ctsio->cdb, CTL_MAX_CDBLEN); if ((isc_retval = ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.scsi) - sizeof(msg_info.scsi.sense_data), M_WAITOK)) > CTL_HA_STATUS_SUCCESS) { ctl_set_busy(ctsio); ctl_done((union ctl_io *)ctsio); return (retval); } return (retval); } switch (ctl_check_ooa(lun, (union ctl_io *)ctsio, (union ctl_io *)TAILQ_PREV(&ctsio->io_hdr, ctl_ooaq, ooa_links))) { case CTL_ACTION_BLOCK: ctsio->io_hdr.flags |= CTL_FLAG_BLOCKED; TAILQ_INSERT_TAIL(&lun->blocked_queue, &ctsio->io_hdr, blocked_links); mtx_unlock(&lun->lun_lock); return (retval); case CTL_ACTION_PASS: case CTL_ACTION_SKIP: ctsio->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR; mtx_unlock(&lun->lun_lock); ctl_enqueue_rtr((union ctl_io *)ctsio); break; case CTL_ACTION_OVERLAP: mtx_unlock(&lun->lun_lock); ctl_set_overlapped_cmd(ctsio); ctl_done((union ctl_io *)ctsio); break; case CTL_ACTION_OVERLAP_TAG: mtx_unlock(&lun->lun_lock); ctl_set_overlapped_tag(ctsio, ctsio->tag_num & 0xff); ctl_done((union ctl_io *)ctsio); break; case CTL_ACTION_ERROR: default: mtx_unlock(&lun->lun_lock); ctl_set_internal_failure(ctsio, /*sks_valid*/ 0, /*retry_count*/ 0); ctl_done((union ctl_io *)ctsio); break; } return (retval); } const struct ctl_cmd_entry * ctl_get_cmd_entry(struct ctl_scsiio *ctsio, int *sa) { const struct ctl_cmd_entry *entry; int service_action; entry = &ctl_cmd_table[ctsio->cdb[0]]; if (sa) *sa = ((entry->flags & CTL_CMD_FLAG_SA5) != 0); if (entry->flags & CTL_CMD_FLAG_SA5) { service_action = ctsio->cdb[1] & SERVICE_ACTION_MASK; entry = &((const struct ctl_cmd_entry *) entry->execute)[service_action]; } return (entry); } const struct ctl_cmd_entry * ctl_validate_command(struct ctl_scsiio *ctsio) { const struct ctl_cmd_entry *entry; int i, sa; uint8_t diff; entry = ctl_get_cmd_entry(ctsio, &sa); if (entry->execute == NULL) { if (sa) ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ 1, /*bit_valid*/ 1, /*bit*/ 4); else ctl_set_invalid_opcode(ctsio); ctl_done((union ctl_io *)ctsio); return (NULL); } KASSERT(entry->length > 0, ("Not defined length for command 0x%02x/0x%02x", ctsio->cdb[0], ctsio->cdb[1])); for (i = 1; i < entry->length; i++) { diff = ctsio->cdb[i] & ~entry->usage[i - 1]; if (diff == 0) continue; ctl_set_invalid_field(ctsio, /*sks_valid*/ 1, /*command*/ 1, /*field*/ i, /*bit_valid*/ 1, /*bit*/ fls(diff) - 1); ctl_done((union ctl_io *)ctsio); return (NULL); } return (entry); } static int ctl_cmd_applicable(uint8_t lun_type, const struct ctl_cmd_entry *entry) { switch (lun_type) { case T_DIRECT: if ((entry->flags & CTL_CMD_FLAG_OK_ON_DIRECT) == 0) return (0); break; case T_PROCESSOR: if ((entry->flags & CTL_CMD_FLAG_OK_ON_PROC) == 0) return (0); break; case T_CDROM: if ((entry->flags & CTL_CMD_FLAG_OK_ON_CDROM) == 0) return (0); break; default: return (0); } return (1); } static int ctl_scsiio(struct ctl_scsiio *ctsio) { int retval; const struct ctl_cmd_entry *entry; retval = CTL_RETVAL_COMPLETE; CTL_DEBUG_PRINT(("ctl_scsiio cdb[0]=%02X\n", ctsio->cdb[0])); entry = ctl_get_cmd_entry(ctsio, NULL); /* * If this I/O has been aborted, just send it straight to * ctl_done() without executing it. */ if (ctsio->io_hdr.flags & CTL_FLAG_ABORT) { ctl_done((union ctl_io *)ctsio); goto bailout; } /* * All the checks should have been handled by ctl_scsiio_precheck(). * We should be clear now to just execute the I/O. */ retval = entry->execute(ctsio); bailout: return (retval); } static int ctl_target_reset(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_port *port = CTL_PORT(io); struct ctl_lun *lun; uint32_t initidx; ctl_ua_type ua_type; if (!(io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC)) { union ctl_ha_msg msg_info; msg_info.hdr.nexus = io->io_hdr.nexus; msg_info.task.task_action = io->taskio.task_action; msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS; msg_info.hdr.original_sc = NULL; msg_info.hdr.serializing_sc = NULL; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.task), M_WAITOK); } initidx = ctl_get_initindex(&io->io_hdr.nexus); if (io->taskio.task_action == CTL_TASK_TARGET_RESET) ua_type = CTL_UA_TARG_RESET; else ua_type = CTL_UA_BUS_RESET; mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(lun, &softc->lun_list, links) { if (port != NULL && ctl_lun_map_to_port(port, lun->lun) == UINT32_MAX) continue; ctl_do_lun_reset(lun, initidx, ua_type); } mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; return (0); } /* * The LUN should always be set. The I/O is optional, and is used to * distinguish between I/Os sent by this initiator, and by other * initiators. We set unit attention for initiators other than this one. * SAM-3 is vague on this point. It does say that a unit attention should * be established for other initiators when a LUN is reset (see section * 5.7.3), but it doesn't specifically say that the unit attention should * be established for this particular initiator when a LUN is reset. Here * is the relevant text, from SAM-3 rev 8: * * 5.7.2 When a SCSI initiator port aborts its own tasks * * When a SCSI initiator port causes its own task(s) to be aborted, no * notification that the task(s) have been aborted shall be returned to * the SCSI initiator port other than the completion response for the * command or task management function action that caused the task(s) to * be aborted and notification(s) associated with related effects of the * action (e.g., a reset unit attention condition). * * XXX KDM for now, we're setting unit attention for all initiators. */ static void ctl_do_lun_reset(struct ctl_lun *lun, uint32_t initidx, ctl_ua_type ua_type) { union ctl_io *xio; int i; mtx_lock(&lun->lun_lock); /* Abort tasks. */ for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL; xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) { xio->io_hdr.flags |= CTL_FLAG_ABORT | CTL_FLAG_ABORT_STATUS; } /* Clear CA. */ for (i = 0; i < CTL_MAX_PORTS; i++) { free(lun->pending_sense[i], M_CTL); lun->pending_sense[i] = NULL; } /* Clear reservation. */ lun->flags &= ~CTL_LUN_RESERVED; /* Clear prevent media removal. */ if (lun->prevent) { for (i = 0; i < CTL_MAX_INITIATORS; i++) ctl_clear_mask(lun->prevent, i); lun->prevent_count = 0; } /* Clear TPC status */ ctl_tpc_lun_clear(lun, -1); /* Establish UA. */ #if 0 ctl_est_ua_all(lun, initidx, ua_type); #else ctl_est_ua_all(lun, -1, ua_type); #endif mtx_unlock(&lun->lun_lock); } static int ctl_lun_reset(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_lun *lun; uint32_t targ_lun, initidx; targ_lun = io->io_hdr.nexus.targ_mapped_lun; initidx = ctl_get_initindex(&io->io_hdr.nexus); mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST; return (1); } ctl_do_lun_reset(lun, initidx, CTL_UA_LUN_RESET); mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; if ((io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) == 0) { union ctl_ha_msg msg_info; msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS; msg_info.hdr.nexus = io->io_hdr.nexus; msg_info.task.task_action = CTL_TASK_LUN_RESET; msg_info.hdr.original_sc = NULL; msg_info.hdr.serializing_sc = NULL; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.task), M_WAITOK); } return (0); } static void ctl_abort_tasks_lun(struct ctl_lun *lun, uint32_t targ_port, uint32_t init_id, int other_sc) { union ctl_io *xio; mtx_assert(&lun->lun_lock, MA_OWNED); /* * Run through the OOA queue and attempt to find the given I/O. * The target port, initiator ID, tag type and tag number have to * match the values that we got from the initiator. If we have an * untagged command to abort, simply abort the first untagged command * we come to. We only allow one untagged command at a time of course. */ for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL; xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) { if ((targ_port == UINT32_MAX || targ_port == xio->io_hdr.nexus.targ_port) && (init_id == UINT32_MAX || init_id == xio->io_hdr.nexus.initid)) { if (targ_port != xio->io_hdr.nexus.targ_port || init_id != xio->io_hdr.nexus.initid) xio->io_hdr.flags |= CTL_FLAG_ABORT_STATUS; xio->io_hdr.flags |= CTL_FLAG_ABORT; if (!other_sc && !(lun->flags & CTL_LUN_PRIMARY_SC)) { union ctl_ha_msg msg_info; msg_info.hdr.nexus = xio->io_hdr.nexus; msg_info.task.task_action = CTL_TASK_ABORT_TASK; msg_info.task.tag_num = xio->scsiio.tag_num; msg_info.task.tag_type = xio->scsiio.tag_type; msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS; msg_info.hdr.original_sc = NULL; msg_info.hdr.serializing_sc = NULL; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.task), M_NOWAIT); } } } } static int ctl_abort_task_set(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_lun *lun; uint32_t targ_lun; /* * Look up the LUN. */ targ_lun = io->io_hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST; return (1); } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); if (io->taskio.task_action == CTL_TASK_ABORT_TASK_SET) { ctl_abort_tasks_lun(lun, io->io_hdr.nexus.targ_port, io->io_hdr.nexus.initid, (io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) != 0); } else { /* CTL_TASK_CLEAR_TASK_SET */ ctl_abort_tasks_lun(lun, UINT32_MAX, UINT32_MAX, (io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) != 0); } mtx_unlock(&lun->lun_lock); io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; return (0); } static void ctl_i_t_nexus_loss(struct ctl_softc *softc, uint32_t initidx, ctl_ua_type ua_type) { struct ctl_lun *lun; struct scsi_sense_data *ps; uint32_t p, i; p = initidx / CTL_MAX_INIT_PER_PORT; i = initidx % CTL_MAX_INIT_PER_PORT; mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(lun, &softc->lun_list, links) { mtx_lock(&lun->lun_lock); /* Abort tasks. */ ctl_abort_tasks_lun(lun, p, i, 1); /* Clear CA. */ ps = lun->pending_sense[p]; if (ps != NULL) ps[i].error_code = 0; /* Clear reservation. */ if ((lun->flags & CTL_LUN_RESERVED) && (lun->res_idx == initidx)) lun->flags &= ~CTL_LUN_RESERVED; /* Clear prevent media removal. */ if (lun->prevent && ctl_is_set(lun->prevent, initidx)) { ctl_clear_mask(lun->prevent, initidx); lun->prevent_count--; } /* Clear TPC status */ ctl_tpc_lun_clear(lun, initidx); /* Establish UA. */ ctl_est_ua(lun, initidx, ua_type); mtx_unlock(&lun->lun_lock); } mtx_unlock(&softc->ctl_lock); } static int ctl_i_t_nexus_reset(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); uint32_t initidx; if (!(io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC)) { union ctl_ha_msg msg_info; msg_info.hdr.nexus = io->io_hdr.nexus; msg_info.task.task_action = CTL_TASK_I_T_NEXUS_RESET; msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS; msg_info.hdr.original_sc = NULL; msg_info.hdr.serializing_sc = NULL; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.task), M_WAITOK); } initidx = ctl_get_initindex(&io->io_hdr.nexus); ctl_i_t_nexus_loss(softc, initidx, CTL_UA_I_T_NEXUS_LOSS); io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; return (0); } static int ctl_abort_task(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); union ctl_io *xio; struct ctl_lun *lun; #if 0 struct sbuf sb; char printbuf[128]; #endif int found; uint32_t targ_lun; found = 0; /* * Look up the LUN. */ targ_lun = io->io_hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST; return (1); } #if 0 printf("ctl_abort_task: called for lun %lld, tag %d type %d\n", lun->lun, io->taskio.tag_num, io->taskio.tag_type); #endif mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); /* * Run through the OOA queue and attempt to find the given I/O. * The target port, initiator ID, tag type and tag number have to * match the values that we got from the initiator. If we have an * untagged command to abort, simply abort the first untagged command * we come to. We only allow one untagged command at a time of course. */ for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL; xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) { #if 0 sbuf_new(&sb, printbuf, sizeof(printbuf), SBUF_FIXEDLEN); sbuf_printf(&sb, "LUN %lld tag %d type %d%s%s%s%s: ", lun->lun, xio->scsiio.tag_num, xio->scsiio.tag_type, (xio->io_hdr.blocked_links.tqe_prev == NULL) ? "" : " BLOCKED", (xio->io_hdr.flags & CTL_FLAG_DMA_INPROG) ? " DMA" : "", (xio->io_hdr.flags & CTL_FLAG_ABORT) ? " ABORT" : "", (xio->io_hdr.flags & CTL_FLAG_IS_WAS_ON_RTR ? " RTR" : "")); ctl_scsi_command_string(&xio->scsiio, NULL, &sb); sbuf_finish(&sb); printf("%s\n", sbuf_data(&sb)); #endif if ((xio->io_hdr.nexus.targ_port != io->io_hdr.nexus.targ_port) || (xio->io_hdr.nexus.initid != io->io_hdr.nexus.initid) || (xio->io_hdr.flags & CTL_FLAG_ABORT)) continue; /* * If the abort says that the task is untagged, the * task in the queue must be untagged. Otherwise, * we just check to see whether the tag numbers * match. This is because the QLogic firmware * doesn't pass back the tag type in an abort * request. */ #if 0 if (((xio->scsiio.tag_type == CTL_TAG_UNTAGGED) && (io->taskio.tag_type == CTL_TAG_UNTAGGED)) || (xio->scsiio.tag_num == io->taskio.tag_num)) #endif /* * XXX KDM we've got problems with FC, because it * doesn't send down a tag type with aborts. So we * can only really go by the tag number... * This may cause problems with parallel SCSI. * Need to figure that out!! */ if (xio->scsiio.tag_num == io->taskio.tag_num) { xio->io_hdr.flags |= CTL_FLAG_ABORT; found = 1; if ((io->io_hdr.flags & CTL_FLAG_FROM_OTHER_SC) == 0 && !(lun->flags & CTL_LUN_PRIMARY_SC)) { union ctl_ha_msg msg_info; msg_info.hdr.nexus = io->io_hdr.nexus; msg_info.task.task_action = CTL_TASK_ABORT_TASK; msg_info.task.tag_num = io->taskio.tag_num; msg_info.task.tag_type = io->taskio.tag_type; msg_info.hdr.msg_type = CTL_MSG_MANAGE_TASKS; msg_info.hdr.original_sc = NULL; msg_info.hdr.serializing_sc = NULL; #if 0 printf("Sent Abort to other side\n"); #endif ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg_info, sizeof(msg_info.task), M_NOWAIT); } #if 0 printf("ctl_abort_task: found I/O to abort\n"); #endif } } mtx_unlock(&lun->lun_lock); if (found == 0) { /* * This isn't really an error. It's entirely possible for * the abort and command completion to cross on the wire. * This is more of an informative/diagnostic error. */ #if 0 printf("ctl_abort_task: ABORT sent for nonexistent I/O: " "%u:%u:%u tag %d type %d\n", io->io_hdr.nexus.initid, io->io_hdr.nexus.targ_port, io->io_hdr.nexus.targ_lun, io->taskio.tag_num, io->taskio.tag_type); #endif } io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; return (0); } static int ctl_query_task(union ctl_io *io, int task_set) { struct ctl_softc *softc = CTL_SOFTC(io); union ctl_io *xio; struct ctl_lun *lun; int found = 0; uint32_t targ_lun; targ_lun = io->io_hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST; return (1); } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); for (xio = (union ctl_io *)TAILQ_FIRST(&lun->ooa_queue); xio != NULL; xio = (union ctl_io *)TAILQ_NEXT(&xio->io_hdr, ooa_links)) { if ((xio->io_hdr.nexus.targ_port != io->io_hdr.nexus.targ_port) || (xio->io_hdr.nexus.initid != io->io_hdr.nexus.initid) || (xio->io_hdr.flags & CTL_FLAG_ABORT)) continue; if (task_set || xio->scsiio.tag_num == io->taskio.tag_num) { found = 1; break; } } mtx_unlock(&lun->lun_lock); if (found) io->taskio.task_status = CTL_TASK_FUNCTION_SUCCEEDED; else io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; return (0); } static int ctl_query_async_event(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_lun *lun; ctl_ua_type ua; uint32_t targ_lun, initidx; targ_lun = io->io_hdr.nexus.targ_mapped_lun; mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); io->taskio.task_status = CTL_TASK_LUN_DOES_NOT_EXIST; return (1); } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); initidx = ctl_get_initindex(&io->io_hdr.nexus); ua = ctl_build_qae(lun, initidx, io->taskio.task_resp); mtx_unlock(&lun->lun_lock); if (ua != CTL_UA_NONE) io->taskio.task_status = CTL_TASK_FUNCTION_SUCCEEDED; else io->taskio.task_status = CTL_TASK_FUNCTION_COMPLETE; return (0); } static void ctl_run_task(union ctl_io *io) { int retval = 1; CTL_DEBUG_PRINT(("ctl_run_task\n")); KASSERT(io->io_hdr.io_type == CTL_IO_TASK, ("ctl_run_task: Unextected io_type %d\n", io->io_hdr.io_type)); io->taskio.task_status = CTL_TASK_FUNCTION_NOT_SUPPORTED; bzero(io->taskio.task_resp, sizeof(io->taskio.task_resp)); switch (io->taskio.task_action) { case CTL_TASK_ABORT_TASK: retval = ctl_abort_task(io); break; case CTL_TASK_ABORT_TASK_SET: case CTL_TASK_CLEAR_TASK_SET: retval = ctl_abort_task_set(io); break; case CTL_TASK_CLEAR_ACA: break; case CTL_TASK_I_T_NEXUS_RESET: retval = ctl_i_t_nexus_reset(io); break; case CTL_TASK_LUN_RESET: retval = ctl_lun_reset(io); break; case CTL_TASK_TARGET_RESET: case CTL_TASK_BUS_RESET: retval = ctl_target_reset(io); break; case CTL_TASK_PORT_LOGIN: break; case CTL_TASK_PORT_LOGOUT: break; case CTL_TASK_QUERY_TASK: retval = ctl_query_task(io, 0); break; case CTL_TASK_QUERY_TASK_SET: retval = ctl_query_task(io, 1); break; case CTL_TASK_QUERY_ASYNC_EVENT: retval = ctl_query_async_event(io); break; default: printf("%s: got unknown task management event %d\n", __func__, io->taskio.task_action); break; } if (retval == 0) io->io_hdr.status = CTL_SUCCESS; else io->io_hdr.status = CTL_ERROR; ctl_done(io); } /* * For HA operation. Handle commands that come in from the other * controller. */ static void ctl_handle_isc(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_lun *lun; const struct ctl_cmd_entry *entry; uint32_t targ_lun; targ_lun = io->io_hdr.nexus.targ_mapped_lun; switch (io->io_hdr.msg_type) { case CTL_MSG_SERIALIZE: ctl_serialize_other_sc_cmd(&io->scsiio); break; case CTL_MSG_R2R: /* Only used in SER_ONLY mode. */ entry = ctl_get_cmd_entry(&io->scsiio, NULL); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { ctl_done(io); break; } mtx_lock(&lun->lun_lock); if (ctl_scsiio_lun_check(lun, entry, &io->scsiio) != 0) { mtx_unlock(&lun->lun_lock); ctl_done(io); break; } io->io_hdr.flags |= CTL_FLAG_IS_WAS_ON_RTR; mtx_unlock(&lun->lun_lock); ctl_enqueue_rtr(io); break; case CTL_MSG_FINISH_IO: if (softc->ha_mode == CTL_HA_MODE_XFER) { ctl_done(io); break; } if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { ctl_free_io(io); break; } mtx_lock(&lun->lun_lock); TAILQ_REMOVE(&lun->ooa_queue, &io->io_hdr, ooa_links); ctl_check_blocked(lun); mtx_unlock(&lun->lun_lock); ctl_free_io(io); break; case CTL_MSG_PERS_ACTION: ctl_hndl_per_res_out_on_other_sc(io); ctl_free_io(io); break; case CTL_MSG_BAD_JUJU: ctl_done(io); break; case CTL_MSG_DATAMOVE: /* Only used in XFER mode */ ctl_datamove_remote(io); break; case CTL_MSG_DATAMOVE_DONE: /* Only used in XFER mode */ io->scsiio.be_move_done(io); break; case CTL_MSG_FAILOVER: ctl_failover_lun(io); ctl_free_io(io); break; default: printf("%s: Invalid message type %d\n", __func__, io->io_hdr.msg_type); ctl_free_io(io); break; } } /* * Returns the match type in the case of a match, or CTL_LUN_PAT_NONE if * there is no match. */ static ctl_lun_error_pattern ctl_cmd_pattern_match(struct ctl_scsiio *ctsio, struct ctl_error_desc *desc) { const struct ctl_cmd_entry *entry; ctl_lun_error_pattern filtered_pattern, pattern; pattern = desc->error_pattern; /* * XXX KDM we need more data passed into this function to match a * custom pattern, and we actually need to implement custom pattern * matching. */ if (pattern & CTL_LUN_PAT_CMD) return (CTL_LUN_PAT_CMD); if ((pattern & CTL_LUN_PAT_MASK) == CTL_LUN_PAT_ANY) return (CTL_LUN_PAT_ANY); entry = ctl_get_cmd_entry(ctsio, NULL); filtered_pattern = entry->pattern & pattern; /* * If the user requested specific flags in the pattern (e.g. * CTL_LUN_PAT_RANGE), make sure the command supports all of those * flags. * * If the user did not specify any flags, it doesn't matter whether * or not the command supports the flags. */ if ((filtered_pattern & ~CTL_LUN_PAT_MASK) != (pattern & ~CTL_LUN_PAT_MASK)) return (CTL_LUN_PAT_NONE); /* * If the user asked for a range check, see if the requested LBA * range overlaps with this command's LBA range. */ if (filtered_pattern & CTL_LUN_PAT_RANGE) { uint64_t lba1; uint64_t len1; ctl_action action; int retval; retval = ctl_get_lba_len((union ctl_io *)ctsio, &lba1, &len1); if (retval != 0) return (CTL_LUN_PAT_NONE); action = ctl_extent_check_lba(lba1, len1, desc->lba_range.lba, desc->lba_range.len, FALSE); /* * A "pass" means that the LBA ranges don't overlap, so * this doesn't match the user's range criteria. */ if (action == CTL_ACTION_PASS) return (CTL_LUN_PAT_NONE); } return (filtered_pattern); } static void ctl_inject_error(struct ctl_lun *lun, union ctl_io *io) { struct ctl_error_desc *desc, *desc2; mtx_assert(&lun->lun_lock, MA_OWNED); STAILQ_FOREACH_SAFE(desc, &lun->error_list, links, desc2) { ctl_lun_error_pattern pattern; /* * Check to see whether this particular command matches * the pattern in the descriptor. */ pattern = ctl_cmd_pattern_match(&io->scsiio, desc); if ((pattern & CTL_LUN_PAT_MASK) == CTL_LUN_PAT_NONE) continue; switch (desc->lun_error & CTL_LUN_INJ_TYPE) { case CTL_LUN_INJ_ABORTED: ctl_set_aborted(&io->scsiio); break; case CTL_LUN_INJ_MEDIUM_ERR: ctl_set_medium_error(&io->scsiio, (io->io_hdr.flags & CTL_FLAG_DATA_MASK) != CTL_FLAG_DATA_OUT); break; case CTL_LUN_INJ_UA: /* 29h/00h POWER ON, RESET, OR BUS DEVICE RESET * OCCURRED */ ctl_set_ua(&io->scsiio, 0x29, 0x00); break; case CTL_LUN_INJ_CUSTOM: /* * We're assuming the user knows what he is doing. * Just copy the sense information without doing * checks. */ bcopy(&desc->custom_sense, &io->scsiio.sense_data, MIN(sizeof(desc->custom_sense), sizeof(io->scsiio.sense_data))); io->scsiio.scsi_status = SCSI_STATUS_CHECK_COND; io->scsiio.sense_len = SSD_FULL_SIZE; io->io_hdr.status = CTL_SCSI_ERROR | CTL_AUTOSENSE; break; case CTL_LUN_INJ_NONE: default: /* * If this is an error injection type we don't know * about, clear the continuous flag (if it is set) * so it will get deleted below. */ desc->lun_error &= ~CTL_LUN_INJ_CONTINUOUS; break; } /* * By default, each error injection action is a one-shot */ if (desc->lun_error & CTL_LUN_INJ_CONTINUOUS) continue; STAILQ_REMOVE(&lun->error_list, desc, ctl_error_desc, links); free(desc, M_CTL); } } #ifdef CTL_IO_DELAY static void ctl_datamove_timer_wakeup(void *arg) { union ctl_io *io; io = (union ctl_io *)arg; ctl_datamove(io); } #endif /* CTL_IO_DELAY */ void ctl_datamove(union ctl_io *io) { void (*fe_datamove)(union ctl_io *io); mtx_assert(&((struct ctl_softc *)CTL_SOFTC(io))->ctl_lock, MA_NOTOWNED); CTL_DEBUG_PRINT(("ctl_datamove\n")); /* No data transferred yet. Frontend must update this when done. */ io->scsiio.kern_data_resid = io->scsiio.kern_data_len; #ifdef CTL_TIME_IO if ((time_uptime - io->io_hdr.start_time) > ctl_time_io_secs) { char str[256]; char path_str[64]; struct sbuf sb; ctl_scsi_path_string(io, path_str, sizeof(path_str)); sbuf_new(&sb, str, sizeof(str), SBUF_FIXEDLEN); sbuf_cat(&sb, path_str); switch (io->io_hdr.io_type) { case CTL_IO_SCSI: ctl_scsi_command_string(&io->scsiio, NULL, &sb); sbuf_printf(&sb, "\n"); sbuf_cat(&sb, path_str); sbuf_printf(&sb, "Tag: 0x%04x, type %d\n", io->scsiio.tag_num, io->scsiio.tag_type); break; case CTL_IO_TASK: sbuf_printf(&sb, "Task I/O type: %d, Tag: 0x%04x, " "Tag Type: %d\n", io->taskio.task_action, io->taskio.tag_num, io->taskio.tag_type); break; default: panic("%s: Invalid CTL I/O type %d\n", __func__, io->io_hdr.io_type); } sbuf_cat(&sb, path_str); sbuf_printf(&sb, "ctl_datamove: %jd seconds\n", (intmax_t)time_uptime - io->io_hdr.start_time); sbuf_finish(&sb); printf("%s", sbuf_data(&sb)); } #endif /* CTL_TIME_IO */ #ifdef CTL_IO_DELAY if (io->io_hdr.flags & CTL_FLAG_DELAY_DONE) { io->io_hdr.flags &= ~CTL_FLAG_DELAY_DONE; } else { struct ctl_lun *lun; lun = CTL_LUN(io); if ((lun != NULL) && (lun->delay_info.datamove_delay > 0)) { callout_init(&io->io_hdr.delay_callout, /*mpsafe*/ 1); io->io_hdr.flags |= CTL_FLAG_DELAY_DONE; callout_reset(&io->io_hdr.delay_callout, lun->delay_info.datamove_delay * hz, ctl_datamove_timer_wakeup, io); if (lun->delay_info.datamove_type == CTL_DELAY_TYPE_ONESHOT) lun->delay_info.datamove_delay = 0; return; } } #endif /* * This command has been aborted. Set the port status, so we fail * the data move. */ if (io->io_hdr.flags & CTL_FLAG_ABORT) { printf("ctl_datamove: tag 0x%04x on (%u:%u:%u) aborted\n", io->scsiio.tag_num, io->io_hdr.nexus.initid, io->io_hdr.nexus.targ_port, io->io_hdr.nexus.targ_lun); io->io_hdr.port_status = 31337; /* * Note that the backend, in this case, will get the * callback in its context. In other cases it may get * called in the frontend's interrupt thread context. */ io->scsiio.be_move_done(io); return; } /* Don't confuse frontend with zero length data move. */ if (io->scsiio.kern_data_len == 0) { io->scsiio.be_move_done(io); return; } fe_datamove = CTL_PORT(io)->fe_datamove; fe_datamove(io); } static void ctl_send_datamove_done(union ctl_io *io, int have_lock) { union ctl_ha_msg msg; #ifdef CTL_TIME_IO struct bintime cur_bt; #endif memset(&msg, 0, sizeof(msg)); msg.hdr.msg_type = CTL_MSG_DATAMOVE_DONE; msg.hdr.original_sc = io; msg.hdr.serializing_sc = io->io_hdr.serializing_sc; msg.hdr.nexus = io->io_hdr.nexus; msg.hdr.status = io->io_hdr.status; msg.scsi.kern_data_resid = io->scsiio.kern_data_resid; msg.scsi.tag_num = io->scsiio.tag_num; msg.scsi.tag_type = io->scsiio.tag_type; msg.scsi.scsi_status = io->scsiio.scsi_status; memcpy(&msg.scsi.sense_data, &io->scsiio.sense_data, io->scsiio.sense_len); msg.scsi.sense_len = io->scsiio.sense_len; msg.scsi.port_status = io->io_hdr.port_status; io->io_hdr.flags &= ~CTL_FLAG_IO_ACTIVE; if (io->io_hdr.flags & CTL_FLAG_FAILOVER) { ctl_failover_io(io, /*have_lock*/ have_lock); return; } ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.scsi) - sizeof(msg.scsi.sense_data) + msg.scsi.sense_len, M_WAITOK); #ifdef CTL_TIME_IO getbinuptime(&cur_bt); bintime_sub(&cur_bt, &io->io_hdr.dma_start_bt); bintime_add(&io->io_hdr.dma_bt, &cur_bt); #endif io->io_hdr.num_dmas++; } /* * The DMA to the remote side is done, now we need to tell the other side * we're done so it can continue with its data movement. */ static void ctl_datamove_remote_write_cb(struct ctl_ha_dt_req *rq) { union ctl_io *io; uint32_t i; io = rq->context; if (rq->ret != CTL_HA_STATUS_SUCCESS) { printf("%s: ISC DMA write failed with error %d", __func__, rq->ret); ctl_set_internal_failure(&io->scsiio, /*sks_valid*/ 1, /*retry_count*/ rq->ret); } ctl_dt_req_free(rq); for (i = 0; i < io->scsiio.kern_sg_entries; i++) free(io->io_hdr.local_sglist[i].addr, M_CTL); free(io->io_hdr.remote_sglist, M_CTL); io->io_hdr.remote_sglist = NULL; io->io_hdr.local_sglist = NULL; /* * The data is in local and remote memory, so now we need to send * status (good or back) back to the other side. */ ctl_send_datamove_done(io, /*have_lock*/ 0); } /* * We've moved the data from the host/controller into local memory. Now we * need to push it over to the remote controller's memory. */ static int ctl_datamove_remote_dm_write_cb(union ctl_io *io) { int retval; retval = ctl_datamove_remote_xfer(io, CTL_HA_DT_CMD_WRITE, ctl_datamove_remote_write_cb); return (retval); } static void ctl_datamove_remote_write(union ctl_io *io) { int retval; void (*fe_datamove)(union ctl_io *io); /* * - Get the data from the host/HBA into local memory. * - DMA memory from the local controller to the remote controller. * - Send status back to the remote controller. */ retval = ctl_datamove_remote_sgl_setup(io); if (retval != 0) return; /* Switch the pointer over so the FETD knows what to do */ io->scsiio.kern_data_ptr = (uint8_t *)io->io_hdr.local_sglist; /* * Use a custom move done callback, since we need to send completion * back to the other controller, not to the backend on this side. */ io->scsiio.be_move_done = ctl_datamove_remote_dm_write_cb; fe_datamove = CTL_PORT(io)->fe_datamove; fe_datamove(io); } static int ctl_datamove_remote_dm_read_cb(union ctl_io *io) { #if 0 char str[256]; char path_str[64]; struct sbuf sb; #endif uint32_t i; for (i = 0; i < io->scsiio.kern_sg_entries; i++) free(io->io_hdr.local_sglist[i].addr, M_CTL); free(io->io_hdr.remote_sglist, M_CTL); io->io_hdr.remote_sglist = NULL; io->io_hdr.local_sglist = NULL; #if 0 scsi_path_string(io, path_str, sizeof(path_str)); sbuf_new(&sb, str, sizeof(str), SBUF_FIXEDLEN); sbuf_cat(&sb, path_str); scsi_command_string(&io->scsiio, NULL, &sb); sbuf_printf(&sb, "\n"); sbuf_cat(&sb, path_str); sbuf_printf(&sb, "Tag: 0x%04x, type %d\n", io->scsiio.tag_num, io->scsiio.tag_type); sbuf_cat(&sb, path_str); sbuf_printf(&sb, "%s: flags %#x, status %#x\n", __func__, io->io_hdr.flags, io->io_hdr.status); sbuf_finish(&sb); printk("%s", sbuf_data(&sb)); #endif /* * The read is done, now we need to send status (good or bad) back * to the other side. */ ctl_send_datamove_done(io, /*have_lock*/ 0); return (0); } static void ctl_datamove_remote_read_cb(struct ctl_ha_dt_req *rq) { union ctl_io *io; void (*fe_datamove)(union ctl_io *io); io = rq->context; if (rq->ret != CTL_HA_STATUS_SUCCESS) { printf("%s: ISC DMA read failed with error %d\n", __func__, rq->ret); ctl_set_internal_failure(&io->scsiio, /*sks_valid*/ 1, /*retry_count*/ rq->ret); } ctl_dt_req_free(rq); /* Switch the pointer over so the FETD knows what to do */ io->scsiio.kern_data_ptr = (uint8_t *)io->io_hdr.local_sglist; /* * Use a custom move done callback, since we need to send completion * back to the other controller, not to the backend on this side. */ io->scsiio.be_move_done = ctl_datamove_remote_dm_read_cb; /* XXX KDM add checks like the ones in ctl_datamove? */ fe_datamove = CTL_PORT(io)->fe_datamove; fe_datamove(io); } static int ctl_datamove_remote_sgl_setup(union ctl_io *io) { struct ctl_sg_entry *local_sglist; uint32_t len_to_go; int retval; int i; retval = 0; local_sglist = io->io_hdr.local_sglist; len_to_go = io->scsiio.kern_data_len; /* * The difficult thing here is that the size of the various * S/G segments may be different than the size from the * remote controller. That'll make it harder when DMAing * the data back to the other side. */ for (i = 0; len_to_go > 0; i++) { local_sglist[i].len = MIN(len_to_go, CTL_HA_DATAMOVE_SEGMENT); local_sglist[i].addr = malloc(local_sglist[i].len, M_CTL, M_WAITOK); len_to_go -= local_sglist[i].len; } /* * Reset the number of S/G entries accordingly. The original * number of S/G entries is available in rem_sg_entries. */ io->scsiio.kern_sg_entries = i; #if 0 printf("%s: kern_sg_entries = %d\n", __func__, io->scsiio.kern_sg_entries); for (i = 0; i < io->scsiio.kern_sg_entries; i++) printf("%s: sg[%d] = %p, %lu\n", __func__, i, local_sglist[i].addr, local_sglist[i].len); #endif return (retval); } static int ctl_datamove_remote_xfer(union ctl_io *io, unsigned command, ctl_ha_dt_cb callback) { struct ctl_ha_dt_req *rq; struct ctl_sg_entry *remote_sglist, *local_sglist; uint32_t local_used, remote_used, total_used; int i, j, isc_ret; rq = ctl_dt_req_alloc(); /* * If we failed to allocate the request, and if the DMA didn't fail * anyway, set busy status. This is just a resource allocation * failure. */ if ((rq == NULL) && ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE && (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS)) ctl_set_busy(&io->scsiio); if ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_STATUS_NONE && (io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS) { if (rq != NULL) ctl_dt_req_free(rq); /* * The data move failed. We need to return status back * to the other controller. No point in trying to DMA * data to the remote controller. */ ctl_send_datamove_done(io, /*have_lock*/ 0); return (1); } local_sglist = io->io_hdr.local_sglist; remote_sglist = io->io_hdr.remote_sglist; local_used = 0; remote_used = 0; total_used = 0; /* * Pull/push the data over the wire from/to the other controller. * This takes into account the possibility that the local and * remote sglists may not be identical in terms of the size of * the elements and the number of elements. * * One fundamental assumption here is that the length allocated for * both the local and remote sglists is identical. Otherwise, we've * essentially got a coding error of some sort. */ isc_ret = CTL_HA_STATUS_SUCCESS; for (i = 0, j = 0; total_used < io->scsiio.kern_data_len; ) { uint32_t cur_len; uint8_t *tmp_ptr; rq->command = command; rq->context = io; /* * Both pointers should be aligned. But it is possible * that the allocation length is not. They should both * also have enough slack left over at the end, though, * to round up to the next 8 byte boundary. */ cur_len = MIN(local_sglist[i].len - local_used, remote_sglist[j].len - remote_used); rq->size = cur_len; tmp_ptr = (uint8_t *)local_sglist[i].addr; tmp_ptr += local_used; #if 0 /* Use physical addresses when talking to ISC hardware */ if ((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0) { /* XXX KDM use busdma */ rq->local = vtophys(tmp_ptr); } else rq->local = tmp_ptr; #else KASSERT((io->io_hdr.flags & CTL_FLAG_BUS_ADDR) == 0, ("HA does not support BUS_ADDR")); rq->local = tmp_ptr; #endif tmp_ptr = (uint8_t *)remote_sglist[j].addr; tmp_ptr += remote_used; rq->remote = tmp_ptr; rq->callback = NULL; local_used += cur_len; if (local_used >= local_sglist[i].len) { i++; local_used = 0; } remote_used += cur_len; if (remote_used >= remote_sglist[j].len) { j++; remote_used = 0; } total_used += cur_len; if (total_used >= io->scsiio.kern_data_len) rq->callback = callback; #if 0 printf("%s: %s: local %p remote %p size %d\n", __func__, (command == CTL_HA_DT_CMD_WRITE) ? "WRITE" : "READ", rq->local, rq->remote, rq->size); #endif isc_ret = ctl_dt_single(rq); if (isc_ret > CTL_HA_STATUS_SUCCESS) break; } if (isc_ret != CTL_HA_STATUS_WAIT) { rq->ret = isc_ret; callback(rq); } return (0); } static void ctl_datamove_remote_read(union ctl_io *io) { int retval; uint32_t i; /* * This will send an error to the other controller in the case of a * failure. */ retval = ctl_datamove_remote_sgl_setup(io); if (retval != 0) return; retval = ctl_datamove_remote_xfer(io, CTL_HA_DT_CMD_READ, ctl_datamove_remote_read_cb); if (retval != 0) { /* * Make sure we free memory if there was an error.. The * ctl_datamove_remote_xfer() function will send the * datamove done message, or call the callback with an * error if there is a problem. */ for (i = 0; i < io->scsiio.kern_sg_entries; i++) free(io->io_hdr.local_sglist[i].addr, M_CTL); free(io->io_hdr.remote_sglist, M_CTL); io->io_hdr.remote_sglist = NULL; io->io_hdr.local_sglist = NULL; } } /* * Process a datamove request from the other controller. This is used for * XFER mode only, not SER_ONLY mode. For writes, we DMA into local memory * first. Once that is complete, the data gets DMAed into the remote * controller's memory. For reads, we DMA from the remote controller's * memory into our memory first, and then move it out to the FETD. */ static void ctl_datamove_remote(union ctl_io *io) { mtx_assert(&((struct ctl_softc *)CTL_SOFTC(io))->ctl_lock, MA_NOTOWNED); if (io->io_hdr.flags & CTL_FLAG_FAILOVER) { ctl_failover_io(io, /*have_lock*/ 0); return; } /* * Note that we look for an aborted I/O here, but don't do some of * the other checks that ctl_datamove() normally does. * We don't need to run the datamove delay code, since that should * have been done if need be on the other controller. */ if (io->io_hdr.flags & CTL_FLAG_ABORT) { printf("%s: tag 0x%04x on (%u:%u:%u) aborted\n", __func__, io->scsiio.tag_num, io->io_hdr.nexus.initid, io->io_hdr.nexus.targ_port, io->io_hdr.nexus.targ_lun); io->io_hdr.port_status = 31338; ctl_send_datamove_done(io, /*have_lock*/ 0); return; } if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_OUT) ctl_datamove_remote_write(io); else if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_IN) ctl_datamove_remote_read(io); else { io->io_hdr.port_status = 31339; ctl_send_datamove_done(io, /*have_lock*/ 0); } } static void ctl_process_done(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_port *port = CTL_PORT(io); struct ctl_lun *lun = CTL_LUN(io); void (*fe_done)(union ctl_io *io); union ctl_ha_msg msg; CTL_DEBUG_PRINT(("ctl_process_done\n")); fe_done = port->fe_done; #ifdef CTL_TIME_IO if ((time_uptime - io->io_hdr.start_time) > ctl_time_io_secs) { char str[256]; char path_str[64]; struct sbuf sb; ctl_scsi_path_string(io, path_str, sizeof(path_str)); sbuf_new(&sb, str, sizeof(str), SBUF_FIXEDLEN); sbuf_cat(&sb, path_str); switch (io->io_hdr.io_type) { case CTL_IO_SCSI: ctl_scsi_command_string(&io->scsiio, NULL, &sb); sbuf_printf(&sb, "\n"); sbuf_cat(&sb, path_str); sbuf_printf(&sb, "Tag: 0x%04x, type %d\n", io->scsiio.tag_num, io->scsiio.tag_type); break; case CTL_IO_TASK: sbuf_printf(&sb, "Task I/O type: %d, Tag: 0x%04x, " "Tag Type: %d\n", io->taskio.task_action, io->taskio.tag_num, io->taskio.tag_type); break; default: panic("%s: Invalid CTL I/O type %d\n", __func__, io->io_hdr.io_type); } sbuf_cat(&sb, path_str); sbuf_printf(&sb, "ctl_process_done: %jd seconds\n", (intmax_t)time_uptime - io->io_hdr.start_time); sbuf_finish(&sb); printf("%s", sbuf_data(&sb)); } #endif /* CTL_TIME_IO */ switch (io->io_hdr.io_type) { case CTL_IO_SCSI: break; case CTL_IO_TASK: if (ctl_debug & CTL_DEBUG_INFO) ctl_io_error_print(io, NULL); fe_done(io); return; default: panic("%s: Invalid CTL I/O type %d\n", __func__, io->io_hdr.io_type); } if (lun == NULL) { CTL_DEBUG_PRINT(("NULL LUN for lun %d\n", io->io_hdr.nexus.targ_mapped_lun)); goto bailout; } mtx_lock(&lun->lun_lock); /* * Check to see if we have any informational exception and status * of this command can be modified to report it in form of either * RECOVERED ERROR or NO SENSE, depending on MRIE mode page field. */ if (lun->ie_reported == 0 && lun->ie_asc != 0 && io->io_hdr.status == CTL_SUCCESS && (io->io_hdr.flags & CTL_FLAG_STATUS_SENT) == 0) { uint8_t mrie = lun->MODE_IE.mrie; uint8_t per = ((lun->MODE_RWER.byte3 & SMS_RWER_PER) || (lun->MODE_VER.byte3 & SMS_VER_PER)); if (((mrie == SIEP_MRIE_REC_COND && per) || mrie == SIEP_MRIE_REC_UNCOND || mrie == SIEP_MRIE_NO_SENSE) && (ctl_get_cmd_entry(&io->scsiio, NULL)->flags & CTL_CMD_FLAG_NO_SENSE) == 0) { ctl_set_sense(&io->scsiio, /*current_error*/ 1, /*sense_key*/ (mrie == SIEP_MRIE_NO_SENSE) ? SSD_KEY_NO_SENSE : SSD_KEY_RECOVERED_ERROR, /*asc*/ lun->ie_asc, /*ascq*/ lun->ie_ascq, SSD_ELEM_NONE); lun->ie_reported = 1; } } else if (lun->ie_reported < 0) lun->ie_reported = 0; /* * Check to see if we have any errors to inject here. We only * inject errors for commands that don't already have errors set. */ if (!STAILQ_EMPTY(&lun->error_list) && ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS) && ((io->io_hdr.flags & CTL_FLAG_STATUS_SENT) == 0)) ctl_inject_error(lun, io); /* * XXX KDM how do we treat commands that aren't completed * successfully? * * XXX KDM should we also track I/O latency? */ if ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS && io->io_hdr.io_type == CTL_IO_SCSI) { int type; #ifdef CTL_TIME_IO struct bintime bt; getbinuptime(&bt); bintime_sub(&bt, &io->io_hdr.start_bt); #endif if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_IN) type = CTL_STATS_READ; else if ((io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_OUT) type = CTL_STATS_WRITE; else type = CTL_STATS_NO_IO; #ifdef CTL_LEGACY_STATS uint32_t targ_port = port->targ_port; lun->legacy_stats.ports[targ_port].bytes[type] += io->scsiio.kern_total_len; lun->legacy_stats.ports[targ_port].operations[type] ++; lun->legacy_stats.ports[targ_port].num_dmas[type] += io->io_hdr.num_dmas; #ifdef CTL_TIME_IO bintime_add(&lun->legacy_stats.ports[targ_port].dma_time[type], &io->io_hdr.dma_bt); bintime_add(&lun->legacy_stats.ports[targ_port].time[type], &bt); #endif #endif /* CTL_LEGACY_STATS */ lun->stats.bytes[type] += io->scsiio.kern_total_len; lun->stats.operations[type] ++; lun->stats.dmas[type] += io->io_hdr.num_dmas; #ifdef CTL_TIME_IO bintime_add(&lun->stats.dma_time[type], &io->io_hdr.dma_bt); bintime_add(&lun->stats.time[type], &bt); #endif mtx_lock(&port->port_lock); port->stats.bytes[type] += io->scsiio.kern_total_len; port->stats.operations[type] ++; port->stats.dmas[type] += io->io_hdr.num_dmas; #ifdef CTL_TIME_IO bintime_add(&port->stats.dma_time[type], &io->io_hdr.dma_bt); bintime_add(&port->stats.time[type], &bt); #endif mtx_unlock(&port->port_lock); } /* * Remove this from the OOA queue. */ TAILQ_REMOVE(&lun->ooa_queue, &io->io_hdr, ooa_links); #ifdef CTL_TIME_IO if (TAILQ_EMPTY(&lun->ooa_queue)) lun->last_busy = getsbinuptime(); #endif /* * Run through the blocked queue on this LUN and see if anything * has become unblocked, now that this transaction is done. */ ctl_check_blocked(lun); /* * If the LUN has been invalidated, free it if there is nothing * left on its OOA queue. */ if ((lun->flags & CTL_LUN_INVALID) && TAILQ_EMPTY(&lun->ooa_queue)) { mtx_unlock(&lun->lun_lock); mtx_lock(&softc->ctl_lock); ctl_free_lun(lun); mtx_unlock(&softc->ctl_lock); } else mtx_unlock(&lun->lun_lock); bailout: /* * If this command has been aborted, make sure we set the status * properly. The FETD is responsible for freeing the I/O and doing * whatever it needs to do to clean up its state. */ if (io->io_hdr.flags & CTL_FLAG_ABORT) ctl_set_task_aborted(&io->scsiio); /* * If enabled, print command error status. */ if ((io->io_hdr.status & CTL_STATUS_MASK) != CTL_SUCCESS && (ctl_debug & CTL_DEBUG_INFO) != 0) ctl_io_error_print(io, NULL); /* * Tell the FETD or the other shelf controller we're done with this * command. Note that only SCSI commands get to this point. Task * management commands are completed above. */ if ((softc->ha_mode != CTL_HA_MODE_XFER) && (io->io_hdr.flags & CTL_FLAG_SENT_2OTHER_SC)) { memset(&msg, 0, sizeof(msg)); msg.hdr.msg_type = CTL_MSG_FINISH_IO; msg.hdr.serializing_sc = io->io_hdr.serializing_sc; msg.hdr.nexus = io->io_hdr.nexus; ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.scsi) - sizeof(msg.scsi.sense_data), M_WAITOK); } fe_done(io); } /* * Front end should call this if it doesn't do autosense. When the request * sense comes back in from the initiator, we'll dequeue this and send it. */ int ctl_queue_sense(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_port *port = CTL_PORT(io); struct ctl_lun *lun; struct scsi_sense_data *ps; uint32_t initidx, p, targ_lun; CTL_DEBUG_PRINT(("ctl_queue_sense\n")); targ_lun = ctl_lun_map_from_port(port, io->io_hdr.nexus.targ_lun); /* * LUN lookup will likely move to the ctl_work_thread() once we * have our new queueing infrastructure (that doesn't put things on * a per-LUN queue initially). That is so that we can handle * things like an INQUIRY to a LUN that we don't have enabled. We * can't deal with that right now. * If we don't have a LUN for this, just toss the sense information. */ mtx_lock(&softc->ctl_lock); if (targ_lun >= CTL_MAX_LUNS || (lun = softc->ctl_luns[targ_lun]) == NULL) { mtx_unlock(&softc->ctl_lock); goto bailout; } mtx_lock(&lun->lun_lock); mtx_unlock(&softc->ctl_lock); initidx = ctl_get_initindex(&io->io_hdr.nexus); p = initidx / CTL_MAX_INIT_PER_PORT; if (lun->pending_sense[p] == NULL) { lun->pending_sense[p] = malloc(sizeof(*ps) * CTL_MAX_INIT_PER_PORT, M_CTL, M_NOWAIT | M_ZERO); } if ((ps = lun->pending_sense[p]) != NULL) { ps += initidx % CTL_MAX_INIT_PER_PORT; memset(ps, 0, sizeof(*ps)); memcpy(ps, &io->scsiio.sense_data, io->scsiio.sense_len); } mtx_unlock(&lun->lun_lock); bailout: ctl_free_io(io); return (CTL_RETVAL_COMPLETE); } /* * Primary command inlet from frontend ports. All SCSI and task I/O * requests must go through this function. */ int ctl_queue(union ctl_io *io) { struct ctl_port *port = CTL_PORT(io); CTL_DEBUG_PRINT(("ctl_queue cdb[0]=%02X\n", io->scsiio.cdb[0])); #ifdef CTL_TIME_IO io->io_hdr.start_time = time_uptime; getbinuptime(&io->io_hdr.start_bt); #endif /* CTL_TIME_IO */ /* Map FE-specific LUN ID into global one. */ io->io_hdr.nexus.targ_mapped_lun = ctl_lun_map_from_port(port, io->io_hdr.nexus.targ_lun); switch (io->io_hdr.io_type) { case CTL_IO_SCSI: case CTL_IO_TASK: if (ctl_debug & CTL_DEBUG_CDB) ctl_io_print(io); ctl_enqueue_incoming(io); break; default: printf("ctl_queue: unknown I/O type %d\n", io->io_hdr.io_type); return (EINVAL); } return (CTL_RETVAL_COMPLETE); } #ifdef CTL_IO_DELAY static void ctl_done_timer_wakeup(void *arg) { union ctl_io *io; io = (union ctl_io *)arg; ctl_done(io); } #endif /* CTL_IO_DELAY */ void ctl_serseq_done(union ctl_io *io) { struct ctl_lun *lun = CTL_LUN(io);; if (lun->be_lun == NULL || lun->be_lun->serseq == CTL_LUN_SERSEQ_OFF) return; mtx_lock(&lun->lun_lock); io->io_hdr.flags |= CTL_FLAG_SERSEQ_DONE; ctl_check_blocked(lun); mtx_unlock(&lun->lun_lock); } void ctl_done(union ctl_io *io) { /* * Enable this to catch duplicate completion issues. */ #if 0 if (io->io_hdr.flags & CTL_FLAG_ALREADY_DONE) { printf("%s: type %d msg %d cdb %x iptl: " "%u:%u:%u tag 0x%04x " "flag %#x status %x\n", __func__, io->io_hdr.io_type, io->io_hdr.msg_type, io->scsiio.cdb[0], io->io_hdr.nexus.initid, io->io_hdr.nexus.targ_port, io->io_hdr.nexus.targ_lun, (io->io_hdr.io_type == CTL_IO_TASK) ? io->taskio.tag_num : io->scsiio.tag_num, io->io_hdr.flags, io->io_hdr.status); } else io->io_hdr.flags |= CTL_FLAG_ALREADY_DONE; #endif /* * This is an internal copy of an I/O, and should not go through * the normal done processing logic. */ if (io->io_hdr.flags & CTL_FLAG_INT_COPY) return; #ifdef CTL_IO_DELAY if (io->io_hdr.flags & CTL_FLAG_DELAY_DONE) { io->io_hdr.flags &= ~CTL_FLAG_DELAY_DONE; } else { struct ctl_lun *lun = CTL_LUN(io); if ((lun != NULL) && (lun->delay_info.done_delay > 0)) { callout_init(&io->io_hdr.delay_callout, /*mpsafe*/ 1); io->io_hdr.flags |= CTL_FLAG_DELAY_DONE; callout_reset(&io->io_hdr.delay_callout, lun->delay_info.done_delay * hz, ctl_done_timer_wakeup, io); if (lun->delay_info.done_type == CTL_DELAY_TYPE_ONESHOT) lun->delay_info.done_delay = 0; return; } } #endif /* CTL_IO_DELAY */ ctl_enqueue_done(io); } static void ctl_work_thread(void *arg) { struct ctl_thread *thr = (struct ctl_thread *)arg; struct ctl_softc *softc = thr->ctl_softc; union ctl_io *io; int retval; CTL_DEBUG_PRINT(("ctl_work_thread starting\n")); while (!softc->shutdown) { /* * We handle the queues in this order: * - ISC * - done queue (to free up resources, unblock other commands) * - RtR queue * - incoming queue * * If those queues are empty, we break out of the loop and * go to sleep. */ mtx_lock(&thr->queue_lock); io = (union ctl_io *)STAILQ_FIRST(&thr->isc_queue); if (io != NULL) { STAILQ_REMOVE_HEAD(&thr->isc_queue, links); mtx_unlock(&thr->queue_lock); ctl_handle_isc(io); continue; } io = (union ctl_io *)STAILQ_FIRST(&thr->done_queue); if (io != NULL) { STAILQ_REMOVE_HEAD(&thr->done_queue, links); /* clear any blocked commands, call fe_done */ mtx_unlock(&thr->queue_lock); ctl_process_done(io); continue; } io = (union ctl_io *)STAILQ_FIRST(&thr->incoming_queue); if (io != NULL) { STAILQ_REMOVE_HEAD(&thr->incoming_queue, links); mtx_unlock(&thr->queue_lock); if (io->io_hdr.io_type == CTL_IO_TASK) ctl_run_task(io); else ctl_scsiio_precheck(softc, &io->scsiio); continue; } io = (union ctl_io *)STAILQ_FIRST(&thr->rtr_queue); if (io != NULL) { STAILQ_REMOVE_HEAD(&thr->rtr_queue, links); mtx_unlock(&thr->queue_lock); retval = ctl_scsiio(&io->scsiio); if (retval != CTL_RETVAL_COMPLETE) CTL_DEBUG_PRINT(("ctl_scsiio failed\n")); continue; } /* Sleep until we have something to do. */ mtx_sleep(thr, &thr->queue_lock, PDROP | PRIBIO, "-", 0); } thr->thread = NULL; kthread_exit(); } static void ctl_lun_thread(void *arg) { struct ctl_softc *softc = (struct ctl_softc *)arg; struct ctl_be_lun *be_lun; CTL_DEBUG_PRINT(("ctl_lun_thread starting\n")); while (!softc->shutdown) { mtx_lock(&softc->ctl_lock); be_lun = STAILQ_FIRST(&softc->pending_lun_queue); if (be_lun != NULL) { STAILQ_REMOVE_HEAD(&softc->pending_lun_queue, links); mtx_unlock(&softc->ctl_lock); ctl_create_lun(be_lun); continue; } /* Sleep until we have something to do. */ mtx_sleep(&softc->pending_lun_queue, &softc->ctl_lock, PDROP | PRIBIO, "-", 0); } softc->lun_thread = NULL; kthread_exit(); } static void ctl_thresh_thread(void *arg) { struct ctl_softc *softc = (struct ctl_softc *)arg; struct ctl_lun *lun; struct ctl_logical_block_provisioning_page *page; const char *attr; union ctl_ha_msg msg; uint64_t thres, val; int i, e, set; CTL_DEBUG_PRINT(("ctl_thresh_thread starting\n")); while (!softc->shutdown) { mtx_lock(&softc->ctl_lock); STAILQ_FOREACH(lun, &softc->lun_list, links) { if ((lun->flags & CTL_LUN_DISABLED) || (lun->flags & CTL_LUN_NO_MEDIA) || lun->backend->lun_attr == NULL) continue; if ((lun->flags & CTL_LUN_PRIMARY_SC) == 0 && softc->ha_mode == CTL_HA_MODE_XFER) continue; if ((lun->MODE_RWER.byte8 & SMS_RWER_LBPERE) == 0) continue; e = 0; page = &lun->MODE_LBP; for (i = 0; i < CTL_NUM_LBP_THRESH; i++) { if ((page->descr[i].flags & SLBPPD_ENABLED) == 0) continue; thres = scsi_4btoul(page->descr[i].count); thres <<= CTL_LBP_EXPONENT; switch (page->descr[i].resource) { case 0x01: attr = "blocksavail"; break; case 0x02: attr = "blocksused"; break; case 0xf1: attr = "poolblocksavail"; break; case 0xf2: attr = "poolblocksused"; break; default: continue; } mtx_unlock(&softc->ctl_lock); // XXX val = lun->backend->lun_attr( lun->be_lun->be_lun, attr); mtx_lock(&softc->ctl_lock); if (val == UINT64_MAX) continue; if ((page->descr[i].flags & SLBPPD_ARMING_MASK) == SLBPPD_ARMING_INC) e = (val >= thres); else e = (val <= thres); if (e) break; } mtx_lock(&lun->lun_lock); if (e) { scsi_u64to8b((uint8_t *)&page->descr[i] - (uint8_t *)page, lun->ua_tpt_info); if (lun->lasttpt == 0 || time_uptime - lun->lasttpt >= CTL_LBP_UA_PERIOD) { lun->lasttpt = time_uptime; ctl_est_ua_all(lun, -1, CTL_UA_THIN_PROV_THRES); set = 1; } else set = 0; } else { lun->lasttpt = 0; ctl_clr_ua_all(lun, -1, CTL_UA_THIN_PROV_THRES); set = -1; } mtx_unlock(&lun->lun_lock); if (set != 0 && lun->ctl_softc->ha_mode == CTL_HA_MODE_XFER) { /* Send msg to other side. */ bzero(&msg.ua, sizeof(msg.ua)); msg.hdr.msg_type = CTL_MSG_UA; msg.hdr.nexus.initid = -1; msg.hdr.nexus.targ_port = -1; msg.hdr.nexus.targ_lun = lun->lun; msg.hdr.nexus.targ_mapped_lun = lun->lun; msg.ua.ua_all = 1; msg.ua.ua_set = (set > 0); msg.ua.ua_type = CTL_UA_THIN_PROV_THRES; memcpy(msg.ua.ua_info, lun->ua_tpt_info, 8); mtx_unlock(&softc->ctl_lock); // XXX ctl_ha_msg_send(CTL_HA_CHAN_CTL, &msg, sizeof(msg.ua), M_WAITOK); mtx_lock(&softc->ctl_lock); } } mtx_sleep(&softc->thresh_thread, &softc->ctl_lock, PDROP | PRIBIO, "-", CTL_LBP_PERIOD * hz); } softc->thresh_thread = NULL; kthread_exit(); } static void ctl_enqueue_incoming(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_thread *thr; u_int idx; idx = (io->io_hdr.nexus.targ_port * 127 + io->io_hdr.nexus.initid) % worker_threads; thr = &softc->threads[idx]; mtx_lock(&thr->queue_lock); STAILQ_INSERT_TAIL(&thr->incoming_queue, &io->io_hdr, links); mtx_unlock(&thr->queue_lock); wakeup(thr); } static void ctl_enqueue_rtr(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_thread *thr; thr = &softc->threads[io->io_hdr.nexus.targ_mapped_lun % worker_threads]; mtx_lock(&thr->queue_lock); STAILQ_INSERT_TAIL(&thr->rtr_queue, &io->io_hdr, links); mtx_unlock(&thr->queue_lock); wakeup(thr); } static void ctl_enqueue_done(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_thread *thr; thr = &softc->threads[io->io_hdr.nexus.targ_mapped_lun % worker_threads]; mtx_lock(&thr->queue_lock); STAILQ_INSERT_TAIL(&thr->done_queue, &io->io_hdr, links); mtx_unlock(&thr->queue_lock); wakeup(thr); } static void ctl_enqueue_isc(union ctl_io *io) { struct ctl_softc *softc = CTL_SOFTC(io); struct ctl_thread *thr; thr = &softc->threads[io->io_hdr.nexus.targ_mapped_lun % worker_threads]; mtx_lock(&thr->queue_lock); STAILQ_INSERT_TAIL(&thr->isc_queue, &io->io_hdr, links); mtx_unlock(&thr->queue_lock); wakeup(thr); } /* * vim: ts=8 */ Index: projects/clang400-import/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c =================================================================== --- projects/clang400-import/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c (revision 314522) +++ projects/clang400-import/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c (revision 314523) @@ -1,3234 +1,3243 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * * Copyright (c) 2006-2010 Pawel Jakub Dawidek * All rights reserved. * * Portions Copyright 2010 Robert Milkowski * * Copyright 2011 Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2012, 2016 by Delphix. All rights reserved. * Copyright (c) 2013, Joyent, Inc. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ /* Portions Copyright 2011 Martin Matuska */ /* * ZFS volume emulation driver. * * Makes a DMU object look like a volume of arbitrary size, up to 2^64 bytes. * Volumes are accessed through the symbolic links named: * * /dev/zvol/dsk// * /dev/zvol/rdsk// * * These links are created by the /dev filesystem (sdev_zvolops.c). * Volumes are persistent through reboot. No user command needs to be * run before opening and using a device. * * FreeBSD notes. * On FreeBSD ZVOLs are simply GEOM providers like any other storage device * in the system. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "zfs_namecheck.h" #ifndef illumos struct g_class zfs_zvol_class = { .name = "ZFS::ZVOL", .version = G_VERSION, }; DECLARE_GEOM_CLASS(zfs_zvol_class, zfs_zvol); #endif void *zfsdev_state; static char *zvol_tag = "zvol_tag"; #define ZVOL_DUMPSIZE "dumpsize" /* * This lock protects the zfsdev_state structure from being modified * while it's being used, e.g. an open that comes in before a create * finishes. It also protects temporary opens of the dataset so that, * e.g., an open doesn't get a spurious EBUSY. */ #ifdef illumos kmutex_t zfsdev_state_lock; #else /* * In FreeBSD we've replaced the upstream zfsdev_state_lock with the * spa_namespace_lock in the ZVOL code. */ #define zfsdev_state_lock spa_namespace_lock #endif static uint32_t zvol_minors; #ifndef illumos SYSCTL_DECL(_vfs_zfs); SYSCTL_NODE(_vfs_zfs, OID_AUTO, vol, CTLFLAG_RW, 0, "ZFS VOLUME"); static int volmode = ZFS_VOLMODE_GEOM; SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, mode, CTLFLAG_RWTUN, &volmode, 0, "Expose as GEOM providers (1), device files (2) or neither"); static boolean_t zpool_on_zvol = B_FALSE; SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, recursive, CTLFLAG_RWTUN, &zpool_on_zvol, 0, "Allow zpools to use zvols as vdevs (DANGEROUS)"); #endif typedef struct zvol_extent { list_node_t ze_node; dva_t ze_dva; /* dva associated with this extent */ uint64_t ze_nblks; /* number of blocks in extent */ } zvol_extent_t; /* * The in-core state of each volume. */ typedef struct zvol_state { #ifndef illumos LIST_ENTRY(zvol_state) zv_links; #endif char zv_name[MAXPATHLEN]; /* pool/dd name */ uint64_t zv_volsize; /* amount of space we advertise */ uint64_t zv_volblocksize; /* volume block size */ #ifdef illumos minor_t zv_minor; /* minor number */ #else struct cdev *zv_dev; /* non-GEOM device */ struct g_provider *zv_provider; /* GEOM provider */ #endif uint8_t zv_min_bs; /* minimum addressable block shift */ uint8_t zv_flags; /* readonly, dumpified, etc. */ objset_t *zv_objset; /* objset handle */ #ifdef illumos uint32_t zv_open_count[OTYPCNT]; /* open counts */ #endif uint32_t zv_total_opens; /* total open count */ uint32_t zv_sync_cnt; /* synchronous open count */ zilog_t *zv_zilog; /* ZIL handle */ list_t zv_extents; /* List of extents for dump */ znode_t zv_znode; /* for range locking */ dmu_buf_t *zv_dbuf; /* bonus handle */ #ifndef illumos int zv_state; int zv_volmode; /* Provide GEOM or cdev */ struct bio_queue_head zv_queue; struct mtx zv_queue_mtx; /* zv_queue mutex */ #endif } zvol_state_t; #ifndef illumos static LIST_HEAD(, zvol_state) all_zvols; #endif /* * zvol specific flags */ #define ZVOL_RDONLY 0x1 #define ZVOL_DUMPIFIED 0x2 #define ZVOL_EXCL 0x4 #define ZVOL_WCE 0x8 /* * zvol maximum transfer in one DMU tx. */ int zvol_maxphys = DMU_MAX_ACCESS/2; /* * Toggle unmap functionality. */ boolean_t zvol_unmap_enabled = B_TRUE; /* * If true, unmaps requested as synchronous are executed synchronously, * otherwise all unmaps are asynchronous. */ boolean_t zvol_unmap_sync_enabled = B_FALSE; #ifndef illumos SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, unmap_enabled, CTLFLAG_RWTUN, &zvol_unmap_enabled, 0, "Enable UNMAP functionality"); SYSCTL_INT(_vfs_zfs_vol, OID_AUTO, unmap_sync_enabled, CTLFLAG_RWTUN, &zvol_unmap_sync_enabled, 0, "UNMAPs requested as sync are executed synchronously"); static d_open_t zvol_d_open; static d_close_t zvol_d_close; static d_read_t zvol_read; static d_write_t zvol_write; static d_ioctl_t zvol_d_ioctl; static d_strategy_t zvol_strategy; static struct cdevsw zvol_cdevsw = { .d_version = D_VERSION, .d_open = zvol_d_open, .d_close = zvol_d_close, .d_read = zvol_read, .d_write = zvol_write, .d_ioctl = zvol_d_ioctl, .d_strategy = zvol_strategy, .d_name = "zvol", .d_flags = D_DISK | D_TRACKCLOSE, }; static void zvol_geom_run(zvol_state_t *zv); static void zvol_geom_destroy(zvol_state_t *zv); static int zvol_geom_access(struct g_provider *pp, int acr, int acw, int ace); static void zvol_geom_start(struct bio *bp); static void zvol_geom_worker(void *arg); static void zvol_log_truncate(zvol_state_t *zv, dmu_tx_t *tx, uint64_t off, uint64_t len, boolean_t sync); #endif /* !illumos */ extern int zfs_set_prop_nvlist(const char *, zprop_source_t, nvlist_t *, nvlist_t *); static int zvol_remove_zv(zvol_state_t *); static int zvol_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio); static int zvol_dumpify(zvol_state_t *zv); static int zvol_dump_fini(zvol_state_t *zv); static int zvol_dump_init(zvol_state_t *zv, boolean_t resize); static void zvol_size_changed(zvol_state_t *zv, uint64_t volsize) { #ifdef illumos dev_t dev = makedevice(ddi_driver_major(zfs_dip), zv->zv_minor); zv->zv_volsize = volsize; VERIFY(ddi_prop_update_int64(dev, zfs_dip, "Size", volsize) == DDI_SUCCESS); VERIFY(ddi_prop_update_int64(dev, zfs_dip, "Nblocks", lbtodb(volsize)) == DDI_SUCCESS); /* Notify specfs to invalidate the cached size */ spec_size_invalidate(dev, VBLK); spec_size_invalidate(dev, VCHR); #else /* !illumos */ zv->zv_volsize = volsize; if (zv->zv_volmode == ZFS_VOLMODE_GEOM) { struct g_provider *pp; pp = zv->zv_provider; if (pp == NULL) return; g_topology_lock(); - g_resize_provider(pp, zv->zv_volsize); + + /* + * Do not invoke resize event when initial size was zero. + * ZVOL initializes the size on first open, this is not + * real resizing. + */ + if (pp->mediasize == 0) + pp->mediasize = zv->zv_volsize; + else + g_resize_provider(pp, zv->zv_volsize); g_topology_unlock(); } #endif /* illumos */ } int zvol_check_volsize(uint64_t volsize, uint64_t blocksize) { if (volsize == 0) return (SET_ERROR(EINVAL)); if (volsize % blocksize != 0) return (SET_ERROR(EINVAL)); #ifdef _ILP32 if (volsize - 1 > SPEC_MAXOFFSET_T) return (SET_ERROR(EOVERFLOW)); #endif return (0); } int zvol_check_volblocksize(uint64_t volblocksize) { if (volblocksize < SPA_MINBLOCKSIZE || volblocksize > SPA_OLD_MAXBLOCKSIZE || !ISP2(volblocksize)) return (SET_ERROR(EDOM)); return (0); } int zvol_get_stats(objset_t *os, nvlist_t *nv) { int error; dmu_object_info_t doi; uint64_t val; error = zap_lookup(os, ZVOL_ZAP_OBJ, "size", 8, 1, &val); if (error) return (error); dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_VOLSIZE, val); error = dmu_object_info(os, ZVOL_OBJ, &doi); if (error == 0) { dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_VOLBLOCKSIZE, doi.doi_data_block_size); } return (error); } static zvol_state_t * zvol_minor_lookup(const char *name) { #ifdef illumos minor_t minor; #endif zvol_state_t *zv; ASSERT(MUTEX_HELD(&zfsdev_state_lock)); #ifdef illumos for (minor = 1; minor <= ZFSDEV_MAX_MINOR; minor++) { zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) continue; #else LIST_FOREACH(zv, &all_zvols, zv_links) { #endif if (strcmp(zv->zv_name, name) == 0) return (zv); } return (NULL); } /* extent mapping arg */ struct maparg { zvol_state_t *ma_zv; uint64_t ma_blks; }; /*ARGSUSED*/ static int zvol_map_block(spa_t *spa, zilog_t *zilog, const blkptr_t *bp, const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg) { struct maparg *ma = arg; zvol_extent_t *ze; int bs = ma->ma_zv->zv_volblocksize; if (bp == NULL || BP_IS_HOLE(bp) || zb->zb_object != ZVOL_OBJ || zb->zb_level != 0) return (0); VERIFY(!BP_IS_EMBEDDED(bp)); VERIFY3U(ma->ma_blks, ==, zb->zb_blkid); ma->ma_blks++; /* Abort immediately if we have encountered gang blocks */ if (BP_IS_GANG(bp)) return (SET_ERROR(EFRAGS)); /* * See if the block is at the end of the previous extent. */ ze = list_tail(&ma->ma_zv->zv_extents); if (ze && DVA_GET_VDEV(BP_IDENTITY(bp)) == DVA_GET_VDEV(&ze->ze_dva) && DVA_GET_OFFSET(BP_IDENTITY(bp)) == DVA_GET_OFFSET(&ze->ze_dva) + ze->ze_nblks * bs) { ze->ze_nblks++; return (0); } dprintf_bp(bp, "%s", "next blkptr:"); /* start a new extent */ ze = kmem_zalloc(sizeof (zvol_extent_t), KM_SLEEP); ze->ze_dva = bp->blk_dva[0]; /* structure assignment */ ze->ze_nblks = 1; list_insert_tail(&ma->ma_zv->zv_extents, ze); return (0); } static void zvol_free_extents(zvol_state_t *zv) { zvol_extent_t *ze; while (ze = list_head(&zv->zv_extents)) { list_remove(&zv->zv_extents, ze); kmem_free(ze, sizeof (zvol_extent_t)); } } static int zvol_get_lbas(zvol_state_t *zv) { objset_t *os = zv->zv_objset; struct maparg ma; int err; ma.ma_zv = zv; ma.ma_blks = 0; zvol_free_extents(zv); /* commit any in-flight changes before traversing the dataset */ txg_wait_synced(dmu_objset_pool(os), 0); err = traverse_dataset(dmu_objset_ds(os), 0, TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA, zvol_map_block, &ma); if (err || ma.ma_blks != (zv->zv_volsize / zv->zv_volblocksize)) { zvol_free_extents(zv); return (err ? err : EIO); } return (0); } /* ARGSUSED */ void zvol_create_cb(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx) { zfs_creat_t *zct = arg; nvlist_t *nvprops = zct->zct_props; int error; uint64_t volblocksize, volsize; VERIFY(nvlist_lookup_uint64(nvprops, zfs_prop_to_name(ZFS_PROP_VOLSIZE), &volsize) == 0); if (nvlist_lookup_uint64(nvprops, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &volblocksize) != 0) volblocksize = zfs_prop_default_numeric(ZFS_PROP_VOLBLOCKSIZE); /* * These properties must be removed from the list so the generic * property setting step won't apply to them. */ VERIFY(nvlist_remove_all(nvprops, zfs_prop_to_name(ZFS_PROP_VOLSIZE)) == 0); (void) nvlist_remove_all(nvprops, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE)); error = dmu_object_claim(os, ZVOL_OBJ, DMU_OT_ZVOL, volblocksize, DMU_OT_NONE, 0, tx); ASSERT(error == 0); error = zap_create_claim(os, ZVOL_ZAP_OBJ, DMU_OT_ZVOL_PROP, DMU_OT_NONE, 0, tx); ASSERT(error == 0); error = zap_update(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize, tx); ASSERT(error == 0); } /* * Replay a TX_TRUNCATE ZIL transaction if asked. TX_TRUNCATE is how we * implement DKIOCFREE/free-long-range. */ static int zvol_replay_truncate(zvol_state_t *zv, lr_truncate_t *lr, boolean_t byteswap) { uint64_t offset, length; if (byteswap) byteswap_uint64_array(lr, sizeof (*lr)); offset = lr->lr_offset; length = lr->lr_length; return (dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, offset, length)); } /* * Replay a TX_WRITE ZIL transaction that didn't get committed * after a system failure */ static int zvol_replay_write(zvol_state_t *zv, lr_write_t *lr, boolean_t byteswap) { objset_t *os = zv->zv_objset; char *data = (char *)(lr + 1); /* data follows lr_write_t */ uint64_t offset, length; dmu_tx_t *tx; int error; if (byteswap) byteswap_uint64_array(lr, sizeof (*lr)); offset = lr->lr_offset; length = lr->lr_length; /* If it's a dmu_sync() block, write the whole block */ if (lr->lr_common.lrc_reclen == sizeof (lr_write_t)) { uint64_t blocksize = BP_GET_LSIZE(&lr->lr_blkptr); if (length < blocksize) { offset -= offset % blocksize; length = blocksize; } } tx = dmu_tx_create(os); dmu_tx_hold_write(tx, ZVOL_OBJ, offset, length); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); } else { dmu_write(os, ZVOL_OBJ, offset, length, data, tx); dmu_tx_commit(tx); } return (error); } /* ARGSUSED */ static int zvol_replay_err(zvol_state_t *zv, lr_t *lr, boolean_t byteswap) { return (SET_ERROR(ENOTSUP)); } /* * Callback vectors for replaying records. * Only TX_WRITE and TX_TRUNCATE are needed for zvol. */ zil_replay_func_t *zvol_replay_vector[TX_MAX_TYPE] = { zvol_replay_err, /* 0 no such transaction type */ zvol_replay_err, /* TX_CREATE */ zvol_replay_err, /* TX_MKDIR */ zvol_replay_err, /* TX_MKXATTR */ zvol_replay_err, /* TX_SYMLINK */ zvol_replay_err, /* TX_REMOVE */ zvol_replay_err, /* TX_RMDIR */ zvol_replay_err, /* TX_LINK */ zvol_replay_err, /* TX_RENAME */ zvol_replay_write, /* TX_WRITE */ zvol_replay_truncate, /* TX_TRUNCATE */ zvol_replay_err, /* TX_SETATTR */ zvol_replay_err, /* TX_ACL */ zvol_replay_err, /* TX_CREATE_ACL */ zvol_replay_err, /* TX_CREATE_ATTR */ zvol_replay_err, /* TX_CREATE_ACL_ATTR */ zvol_replay_err, /* TX_MKDIR_ACL */ zvol_replay_err, /* TX_MKDIR_ATTR */ zvol_replay_err, /* TX_MKDIR_ACL_ATTR */ zvol_replay_err, /* TX_WRITE2 */ }; #ifdef illumos int zvol_name2minor(const char *name, minor_t *minor) { zvol_state_t *zv; mutex_enter(&zfsdev_state_lock); zv = zvol_minor_lookup(name); if (minor && zv) *minor = zv->zv_minor; mutex_exit(&zfsdev_state_lock); return (zv ? 0 : -1); } #endif /* illumos */ /* * Create a minor node (plus a whole lot more) for the specified volume. */ int zvol_create_minor(const char *name) { zfs_soft_state_t *zs; zvol_state_t *zv; objset_t *os; #ifdef illumos dmu_object_info_t doi; minor_t minor = 0; char chrbuf[30], blkbuf[30]; #else struct g_provider *pp; struct g_geom *gp; uint64_t mode; #endif int error; #ifndef illumos ZFS_LOG(1, "Creating ZVOL %s...", name); #endif mutex_enter(&zfsdev_state_lock); if (zvol_minor_lookup(name) != NULL) { mutex_exit(&zfsdev_state_lock); return (SET_ERROR(EEXIST)); } /* lie and say we're read-only */ error = dmu_objset_own(name, DMU_OST_ZVOL, B_TRUE, FTAG, &os); if (error) { mutex_exit(&zfsdev_state_lock); return (error); } #ifdef illumos if ((minor = zfsdev_minor_alloc()) == 0) { dmu_objset_disown(os, FTAG); mutex_exit(&zfsdev_state_lock); return (SET_ERROR(ENXIO)); } if (ddi_soft_state_zalloc(zfsdev_state, minor) != DDI_SUCCESS) { dmu_objset_disown(os, FTAG); mutex_exit(&zfsdev_state_lock); return (SET_ERROR(EAGAIN)); } (void) ddi_prop_update_string(minor, zfs_dip, ZVOL_PROP_NAME, (char *)name); (void) snprintf(chrbuf, sizeof (chrbuf), "%u,raw", minor); if (ddi_create_minor_node(zfs_dip, chrbuf, S_IFCHR, minor, DDI_PSEUDO, 0) == DDI_FAILURE) { ddi_soft_state_free(zfsdev_state, minor); dmu_objset_disown(os, FTAG); mutex_exit(&zfsdev_state_lock); return (SET_ERROR(EAGAIN)); } (void) snprintf(blkbuf, sizeof (blkbuf), "%u", minor); if (ddi_create_minor_node(zfs_dip, blkbuf, S_IFBLK, minor, DDI_PSEUDO, 0) == DDI_FAILURE) { ddi_remove_minor_node(zfs_dip, chrbuf); ddi_soft_state_free(zfsdev_state, minor); dmu_objset_disown(os, FTAG); mutex_exit(&zfsdev_state_lock); return (SET_ERROR(EAGAIN)); } zs = ddi_get_soft_state(zfsdev_state, minor); zs->zss_type = ZSST_ZVOL; zv = zs->zss_data = kmem_zalloc(sizeof (zvol_state_t), KM_SLEEP); #else /* !illumos */ zv = kmem_zalloc(sizeof(*zv), KM_SLEEP); zv->zv_state = 0; error = dsl_prop_get_integer(name, zfs_prop_to_name(ZFS_PROP_VOLMODE), &mode, NULL); if (error != 0 || mode == ZFS_VOLMODE_DEFAULT) mode = volmode; DROP_GIANT(); zv->zv_volmode = mode; if (zv->zv_volmode == ZFS_VOLMODE_GEOM) { g_topology_lock(); gp = g_new_geomf(&zfs_zvol_class, "zfs::zvol::%s", name); gp->start = zvol_geom_start; gp->access = zvol_geom_access; pp = g_new_providerf(gp, "%s/%s", ZVOL_DRIVER, name); pp->flags |= G_PF_DIRECT_RECEIVE | G_PF_DIRECT_SEND; pp->sectorsize = DEV_BSIZE; pp->mediasize = 0; pp->private = zv; zv->zv_provider = pp; bioq_init(&zv->zv_queue); mtx_init(&zv->zv_queue_mtx, "zvol", NULL, MTX_DEF); } else if (zv->zv_volmode == ZFS_VOLMODE_DEV) { struct make_dev_args args; make_dev_args_init(&args); args.mda_flags = MAKEDEV_CHECKNAME | MAKEDEV_WAITOK; args.mda_devsw = &zvol_cdevsw; args.mda_cr = NULL; args.mda_uid = UID_ROOT; args.mda_gid = GID_OPERATOR; args.mda_mode = 0640; args.mda_si_drv2 = zv; error = make_dev_s(&args, &zv->zv_dev, "%s/%s", ZVOL_DRIVER, name); if (error != 0) { kmem_free(zv, sizeof(*zv)); dmu_objset_disown(os, FTAG); mutex_exit(&zfsdev_state_lock); return (error); } zv->zv_dev->si_iosize_max = MAXPHYS; } LIST_INSERT_HEAD(&all_zvols, zv, zv_links); #endif /* illumos */ (void) strlcpy(zv->zv_name, name, MAXPATHLEN); zv->zv_min_bs = DEV_BSHIFT; #ifdef illumos zv->zv_minor = minor; #endif zv->zv_objset = os; if (dmu_objset_is_snapshot(os) || !spa_writeable(dmu_objset_spa(os))) zv->zv_flags |= ZVOL_RDONLY; mutex_init(&zv->zv_znode.z_range_lock, NULL, MUTEX_DEFAULT, NULL); avl_create(&zv->zv_znode.z_range_avl, zfs_range_compare, sizeof (rl_t), offsetof(rl_t, r_node)); list_create(&zv->zv_extents, sizeof (zvol_extent_t), offsetof(zvol_extent_t, ze_node)); #ifdef illumos /* get and cache the blocksize */ error = dmu_object_info(os, ZVOL_OBJ, &doi); ASSERT(error == 0); zv->zv_volblocksize = doi.doi_data_block_size; #endif if (spa_writeable(dmu_objset_spa(os))) { if (zil_replay_disable) zil_destroy(dmu_objset_zil(os), B_FALSE); else zil_replay(os, zv, zvol_replay_vector); } dmu_objset_disown(os, FTAG); zv->zv_objset = NULL; zvol_minors++; mutex_exit(&zfsdev_state_lock); #ifndef illumos if (zv->zv_volmode == ZFS_VOLMODE_GEOM) { zvol_geom_run(zv); g_topology_unlock(); } PICKUP_GIANT(); ZFS_LOG(1, "ZVOL %s created.", name); #endif return (0); } /* * Remove minor node for the specified volume. */ static int zvol_remove_zv(zvol_state_t *zv) { #ifdef illumos char nmbuf[20]; minor_t minor = zv->zv_minor; #endif ASSERT(MUTEX_HELD(&zfsdev_state_lock)); if (zv->zv_total_opens != 0) return (SET_ERROR(EBUSY)); #ifdef illumos (void) snprintf(nmbuf, sizeof (nmbuf), "%u,raw", minor); ddi_remove_minor_node(zfs_dip, nmbuf); (void) snprintf(nmbuf, sizeof (nmbuf), "%u", minor); ddi_remove_minor_node(zfs_dip, nmbuf); #else ZFS_LOG(1, "ZVOL %s destroyed.", zv->zv_name); LIST_REMOVE(zv, zv_links); if (zv->zv_volmode == ZFS_VOLMODE_GEOM) { g_topology_lock(); zvol_geom_destroy(zv); g_topology_unlock(); } else if (zv->zv_volmode == ZFS_VOLMODE_DEV) { if (zv->zv_dev != NULL) destroy_dev(zv->zv_dev); } #endif avl_destroy(&zv->zv_znode.z_range_avl); mutex_destroy(&zv->zv_znode.z_range_lock); kmem_free(zv, sizeof (zvol_state_t)); #ifdef illumos ddi_soft_state_free(zfsdev_state, minor); #endif zvol_minors--; return (0); } int zvol_remove_minor(const char *name) { zvol_state_t *zv; int rc; mutex_enter(&zfsdev_state_lock); if ((zv = zvol_minor_lookup(name)) == NULL) { mutex_exit(&zfsdev_state_lock); return (SET_ERROR(ENXIO)); } rc = zvol_remove_zv(zv); mutex_exit(&zfsdev_state_lock); return (rc); } int zvol_first_open(zvol_state_t *zv) { dmu_object_info_t doi; objset_t *os; uint64_t volsize; int error; uint64_t readonly; /* lie and say we're read-only */ error = dmu_objset_own(zv->zv_name, DMU_OST_ZVOL, B_TRUE, zvol_tag, &os); if (error) return (error); zv->zv_objset = os; error = zap_lookup(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize); if (error) { ASSERT(error == 0); dmu_objset_disown(os, zvol_tag); return (error); } /* get and cache the blocksize */ error = dmu_object_info(os, ZVOL_OBJ, &doi); if (error) { ASSERT(error == 0); dmu_objset_disown(os, zvol_tag); return (error); } zv->zv_volblocksize = doi.doi_data_block_size; error = dmu_bonus_hold(os, ZVOL_OBJ, zvol_tag, &zv->zv_dbuf); if (error) { dmu_objset_disown(os, zvol_tag); return (error); } zvol_size_changed(zv, volsize); zv->zv_zilog = zil_open(os, zvol_get_data); VERIFY(dsl_prop_get_integer(zv->zv_name, "readonly", &readonly, NULL) == 0); if (readonly || dmu_objset_is_snapshot(os) || !spa_writeable(dmu_objset_spa(os))) zv->zv_flags |= ZVOL_RDONLY; else zv->zv_flags &= ~ZVOL_RDONLY; return (error); } void zvol_last_close(zvol_state_t *zv) { zil_close(zv->zv_zilog); zv->zv_zilog = NULL; dmu_buf_rele(zv->zv_dbuf, zvol_tag); zv->zv_dbuf = NULL; /* * Evict cached data */ if (dsl_dataset_is_dirty(dmu_objset_ds(zv->zv_objset)) && !(zv->zv_flags & ZVOL_RDONLY)) txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0); dmu_objset_evict_dbufs(zv->zv_objset); dmu_objset_disown(zv->zv_objset, zvol_tag); zv->zv_objset = NULL; } #ifdef illumos int zvol_prealloc(zvol_state_t *zv) { objset_t *os = zv->zv_objset; dmu_tx_t *tx; uint64_t refd, avail, usedobjs, availobjs; uint64_t resid = zv->zv_volsize; uint64_t off = 0; /* Check the space usage before attempting to allocate the space */ dmu_objset_space(os, &refd, &avail, &usedobjs, &availobjs); if (avail < zv->zv_volsize) return (SET_ERROR(ENOSPC)); /* Free old extents if they exist */ zvol_free_extents(zv); while (resid != 0) { int error; uint64_t bytes = MIN(resid, SPA_OLD_MAXBLOCKSIZE); tx = dmu_tx_create(os); dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); (void) dmu_free_long_range(os, ZVOL_OBJ, 0, off); return (error); } dmu_prealloc(os, ZVOL_OBJ, off, bytes, tx); dmu_tx_commit(tx); off += bytes; resid -= bytes; } txg_wait_synced(dmu_objset_pool(os), 0); return (0); } #endif /* illumos */ static int zvol_update_volsize(objset_t *os, uint64_t volsize) { dmu_tx_t *tx; int error; ASSERT(MUTEX_HELD(&zfsdev_state_lock)); tx = dmu_tx_create(os); dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL); dmu_tx_mark_netfree(tx); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); return (error); } error = zap_update(os, ZVOL_ZAP_OBJ, "size", 8, 1, &volsize, tx); dmu_tx_commit(tx); if (error == 0) error = dmu_free_long_range(os, ZVOL_OBJ, volsize, DMU_OBJECT_END); return (error); } void zvol_remove_minors(const char *name) { #ifdef illumos zvol_state_t *zv; char *namebuf; minor_t minor; namebuf = kmem_zalloc(strlen(name) + 2, KM_SLEEP); (void) strncpy(namebuf, name, strlen(name)); (void) strcat(namebuf, "/"); mutex_enter(&zfsdev_state_lock); for (minor = 1; minor <= ZFSDEV_MAX_MINOR; minor++) { zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) continue; if (strncmp(namebuf, zv->zv_name, strlen(namebuf)) == 0) (void) zvol_remove_zv(zv); } kmem_free(namebuf, strlen(name) + 2); mutex_exit(&zfsdev_state_lock); #else /* !illumos */ zvol_state_t *zv, *tzv; size_t namelen; namelen = strlen(name); DROP_GIANT(); mutex_enter(&zfsdev_state_lock); LIST_FOREACH_SAFE(zv, &all_zvols, zv_links, tzv) { if (strcmp(zv->zv_name, name) == 0 || (strncmp(zv->zv_name, name, namelen) == 0 && strlen(zv->zv_name) > namelen && (zv->zv_name[namelen] == '/' || zv->zv_name[namelen] == '@'))) { (void) zvol_remove_zv(zv); } } mutex_exit(&zfsdev_state_lock); PICKUP_GIANT(); #endif /* illumos */ } static int zvol_update_live_volsize(zvol_state_t *zv, uint64_t volsize) { uint64_t old_volsize = 0ULL; int error = 0; ASSERT(MUTEX_HELD(&zfsdev_state_lock)); /* * Reinitialize the dump area to the new size. If we * failed to resize the dump area then restore it back to * its original size. We must set the new volsize prior * to calling dumpvp_resize() to ensure that the devices' * size(9P) is not visible by the dump subsystem. */ old_volsize = zv->zv_volsize; zvol_size_changed(zv, volsize); #ifdef ZVOL_DUMP if (zv->zv_flags & ZVOL_DUMPIFIED) { if ((error = zvol_dumpify(zv)) != 0 || (error = dumpvp_resize()) != 0) { int dumpify_error; (void) zvol_update_volsize(zv->zv_objset, old_volsize); zvol_size_changed(zv, old_volsize); dumpify_error = zvol_dumpify(zv); error = dumpify_error ? dumpify_error : error; } } #endif /* ZVOL_DUMP */ #ifdef illumos /* * Generate a LUN expansion event. */ if (error == 0) { sysevent_id_t eid; nvlist_t *attr; char *physpath = kmem_zalloc(MAXPATHLEN, KM_SLEEP); (void) snprintf(physpath, MAXPATHLEN, "%s%u", ZVOL_PSEUDO_DEV, zv->zv_minor); VERIFY(nvlist_alloc(&attr, NV_UNIQUE_NAME, KM_SLEEP) == 0); VERIFY(nvlist_add_string(attr, DEV_PHYS_PATH, physpath) == 0); (void) ddi_log_sysevent(zfs_dip, SUNW_VENDOR, EC_DEV_STATUS, ESC_DEV_DLE, attr, &eid, DDI_SLEEP); nvlist_free(attr); kmem_free(physpath, MAXPATHLEN); } #endif /* illumos */ return (error); } int zvol_set_volsize(const char *name, uint64_t volsize) { zvol_state_t *zv = NULL; objset_t *os; int error; dmu_object_info_t doi; uint64_t readonly; boolean_t owned = B_FALSE; error = dsl_prop_get_integer(name, zfs_prop_to_name(ZFS_PROP_READONLY), &readonly, NULL); if (error != 0) return (error); if (readonly) return (SET_ERROR(EROFS)); mutex_enter(&zfsdev_state_lock); zv = zvol_minor_lookup(name); if (zv == NULL || zv->zv_objset == NULL) { if ((error = dmu_objset_own(name, DMU_OST_ZVOL, B_FALSE, FTAG, &os)) != 0) { mutex_exit(&zfsdev_state_lock); return (error); } owned = B_TRUE; if (zv != NULL) zv->zv_objset = os; } else { os = zv->zv_objset; } if ((error = dmu_object_info(os, ZVOL_OBJ, &doi)) != 0 || (error = zvol_check_volsize(volsize, doi.doi_data_block_size)) != 0) goto out; error = zvol_update_volsize(os, volsize); if (error == 0 && zv != NULL) error = zvol_update_live_volsize(zv, volsize); out: if (owned) { dmu_objset_disown(os, FTAG); if (zv != NULL) zv->zv_objset = NULL; } mutex_exit(&zfsdev_state_lock); return (error); } /*ARGSUSED*/ #ifdef illumos int zvol_open(dev_t *devp, int flag, int otyp, cred_t *cr) #else static int zvol_open(struct g_provider *pp, int flag, int count) #endif { zvol_state_t *zv; int err = 0; #ifdef illumos mutex_enter(&zfsdev_state_lock); zv = zfsdev_get_soft_state(getminor(*devp), ZSST_ZVOL); if (zv == NULL) { mutex_exit(&zfsdev_state_lock); return (SET_ERROR(ENXIO)); } if (zv->zv_total_opens == 0) err = zvol_first_open(zv); if (err) { mutex_exit(&zfsdev_state_lock); return (err); } #else /* !illumos */ boolean_t locked = B_FALSE; if (!zpool_on_zvol && tsd_get(zfs_geom_probe_vdev_key) != NULL) { /* * if zfs_geom_probe_vdev_key is set, that means that zfs is * attempting to probe geom providers while looking for a * replacement for a missing VDEV. In this case, the * spa_namespace_lock will not be held, but it is still illegal * to use a zvol as a vdev. Deadlocks can result if another * thread has spa_namespace_lock */ return (EOPNOTSUPP); } /* * Protect against recursively entering spa_namespace_lock * when spa_open() is used for a pool on a (local) ZVOL(s). * This is needed since we replaced upstream zfsdev_state_lock * with spa_namespace_lock in the ZVOL code. * We are using the same trick as spa_open(). * Note that calls in zvol_first_open which need to resolve * pool name to a spa object will enter spa_open() * recursively, but that function already has all the * necessary protection. */ if (!MUTEX_HELD(&zfsdev_state_lock)) { mutex_enter(&zfsdev_state_lock); locked = B_TRUE; } zv = pp->private; if (zv == NULL) { if (locked) mutex_exit(&zfsdev_state_lock); return (SET_ERROR(ENXIO)); } if (zv->zv_total_opens == 0) { err = zvol_first_open(zv); if (err) { if (locked) mutex_exit(&zfsdev_state_lock); return (err); } pp->mediasize = zv->zv_volsize; pp->stripeoffset = 0; pp->stripesize = zv->zv_volblocksize; } #endif /* illumos */ if ((flag & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) { err = SET_ERROR(EROFS); goto out; } if (zv->zv_flags & ZVOL_EXCL) { err = SET_ERROR(EBUSY); goto out; } #ifdef FEXCL if (flag & FEXCL) { if (zv->zv_total_opens != 0) { err = SET_ERROR(EBUSY); goto out; } zv->zv_flags |= ZVOL_EXCL; } #endif #ifdef illumos if (zv->zv_open_count[otyp] == 0 || otyp == OTYP_LYR) { zv->zv_open_count[otyp]++; zv->zv_total_opens++; } mutex_exit(&zfsdev_state_lock); #else zv->zv_total_opens += count; if (locked) mutex_exit(&zfsdev_state_lock); #endif return (err); out: if (zv->zv_total_opens == 0) zvol_last_close(zv); #ifdef illumos mutex_exit(&zfsdev_state_lock); #else if (locked) mutex_exit(&zfsdev_state_lock); #endif return (err); } /*ARGSUSED*/ #ifdef illumos int zvol_close(dev_t dev, int flag, int otyp, cred_t *cr) { minor_t minor = getminor(dev); zvol_state_t *zv; int error = 0; mutex_enter(&zfsdev_state_lock); zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) { mutex_exit(&zfsdev_state_lock); #else /* !illumos */ static int zvol_close(struct g_provider *pp, int flag, int count) { zvol_state_t *zv; int error = 0; boolean_t locked = B_FALSE; /* See comment in zvol_open(). */ if (!MUTEX_HELD(&zfsdev_state_lock)) { mutex_enter(&zfsdev_state_lock); locked = B_TRUE; } zv = pp->private; if (zv == NULL) { if (locked) mutex_exit(&zfsdev_state_lock); #endif /* illumos */ return (SET_ERROR(ENXIO)); } if (zv->zv_flags & ZVOL_EXCL) { ASSERT(zv->zv_total_opens == 1); zv->zv_flags &= ~ZVOL_EXCL; } /* * If the open count is zero, this is a spurious close. * That indicates a bug in the kernel / DDI framework. */ #ifdef illumos ASSERT(zv->zv_open_count[otyp] != 0); #endif ASSERT(zv->zv_total_opens != 0); /* * You may get multiple opens, but only one close. */ #ifdef illumos zv->zv_open_count[otyp]--; zv->zv_total_opens--; #else zv->zv_total_opens -= count; #endif if (zv->zv_total_opens == 0) zvol_last_close(zv); #ifdef illumos mutex_exit(&zfsdev_state_lock); #else if (locked) mutex_exit(&zfsdev_state_lock); #endif return (error); } static void zvol_get_done(zgd_t *zgd, int error) { if (zgd->zgd_db) dmu_buf_rele(zgd->zgd_db, zgd); zfs_range_unlock(zgd->zgd_rl); if (error == 0 && zgd->zgd_bp) zil_add_block(zgd->zgd_zilog, zgd->zgd_bp); kmem_free(zgd, sizeof (zgd_t)); } /* * Get data to generate a TX_WRITE intent log record. */ static int zvol_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio) { zvol_state_t *zv = arg; objset_t *os = zv->zv_objset; uint64_t object = ZVOL_OBJ; uint64_t offset = lr->lr_offset; uint64_t size = lr->lr_length; /* length of user data */ blkptr_t *bp = &lr->lr_blkptr; dmu_buf_t *db; zgd_t *zgd; int error; ASSERT(zio != NULL); ASSERT(size != 0); zgd = kmem_zalloc(sizeof (zgd_t), KM_SLEEP); zgd->zgd_zilog = zv->zv_zilog; zgd->zgd_rl = zfs_range_lock(&zv->zv_znode, offset, size, RL_READER); /* * Write records come in two flavors: immediate and indirect. * For small writes it's cheaper to store the data with the * log record (immediate); for large writes it's cheaper to * sync the data and get a pointer to it (indirect) so that * we don't have to write the data twice. */ if (buf != NULL) { /* immediate write */ error = dmu_read(os, object, offset, size, buf, DMU_READ_NO_PREFETCH); } else { size = zv->zv_volblocksize; offset = P2ALIGN(offset, size); error = dmu_buf_hold(os, object, offset, zgd, &db, DMU_READ_NO_PREFETCH); if (error == 0) { blkptr_t *obp = dmu_buf_get_blkptr(db); if (obp) { ASSERT(BP_IS_HOLE(bp)); *bp = *obp; } zgd->zgd_db = db; zgd->zgd_bp = bp; ASSERT(db->db_offset == offset); ASSERT(db->db_size == size); error = dmu_sync(zio, lr->lr_common.lrc_txg, zvol_get_done, zgd); if (error == 0) return (0); } } zvol_get_done(zgd, error); return (error); } /* * zvol_log_write() handles synchronous writes using TX_WRITE ZIL transactions. * * We store data in the log buffers if it's small enough. * Otherwise we will later flush the data out via dmu_sync(). */ ssize_t zvol_immediate_write_sz = 32768; #ifdef _KERNEL SYSCTL_LONG(_vfs_zfs_vol, OID_AUTO, immediate_write_sz, CTLFLAG_RWTUN, &zvol_immediate_write_sz, 0, "Minimal size for indirect log write"); #endif static void zvol_log_write(zvol_state_t *zv, dmu_tx_t *tx, offset_t off, ssize_t resid, boolean_t sync) { uint32_t blocksize = zv->zv_volblocksize; zilog_t *zilog = zv->zv_zilog; itx_wr_state_t write_state; if (zil_replaying(zilog, tx)) return; if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT) write_state = WR_INDIRECT; else if (!spa_has_slogs(zilog->zl_spa) && resid >= blocksize && blocksize > zvol_immediate_write_sz) write_state = WR_INDIRECT; else if (sync) write_state = WR_COPIED; else write_state = WR_NEED_COPY; while (resid) { itx_t *itx; lr_write_t *lr; itx_wr_state_t wr_state = write_state; ssize_t len = resid; if (wr_state == WR_COPIED && resid > ZIL_MAX_COPIED_DATA) wr_state = WR_NEED_COPY; else if (wr_state == WR_INDIRECT) len = MIN(blocksize - P2PHASE(off, blocksize), resid); itx = zil_itx_create(TX_WRITE, sizeof (*lr) + (wr_state == WR_COPIED ? len : 0)); lr = (lr_write_t *)&itx->itx_lr; if (wr_state == WR_COPIED && dmu_read(zv->zv_objset, ZVOL_OBJ, off, len, lr + 1, DMU_READ_NO_PREFETCH) != 0) { zil_itx_destroy(itx); itx = zil_itx_create(TX_WRITE, sizeof (*lr)); lr = (lr_write_t *)&itx->itx_lr; wr_state = WR_NEED_COPY; } itx->itx_wr_state = wr_state; lr->lr_foid = ZVOL_OBJ; lr->lr_offset = off; lr->lr_length = len; lr->lr_blkoff = 0; BP_ZERO(&lr->lr_blkptr); itx->itx_private = zv; if (!sync && (zv->zv_sync_cnt == 0)) itx->itx_sync = B_FALSE; zil_itx_assign(zilog, itx, tx); off += len; resid -= len; } } #ifdef illumos static int zvol_dumpio_vdev(vdev_t *vd, void *addr, uint64_t offset, uint64_t origoffset, uint64_t size, boolean_t doread, boolean_t isdump) { vdev_disk_t *dvd; int c; int numerrors = 0; if (vd->vdev_ops == &vdev_mirror_ops || vd->vdev_ops == &vdev_replacing_ops || vd->vdev_ops == &vdev_spare_ops) { for (c = 0; c < vd->vdev_children; c++) { int err = zvol_dumpio_vdev(vd->vdev_child[c], addr, offset, origoffset, size, doread, isdump); if (err != 0) { numerrors++; } else if (doread) { break; } } } if (!vd->vdev_ops->vdev_op_leaf && vd->vdev_ops != &vdev_raidz_ops) return (numerrors < vd->vdev_children ? 0 : EIO); if (doread && !vdev_readable(vd)) return (SET_ERROR(EIO)); else if (!doread && !vdev_writeable(vd)) return (SET_ERROR(EIO)); if (vd->vdev_ops == &vdev_raidz_ops) { return (vdev_raidz_physio(vd, addr, size, offset, origoffset, doread, isdump)); } offset += VDEV_LABEL_START_SIZE; if (ddi_in_panic() || isdump) { ASSERT(!doread); if (doread) return (SET_ERROR(EIO)); dvd = vd->vdev_tsd; ASSERT3P(dvd, !=, NULL); return (ldi_dump(dvd->vd_lh, addr, lbtodb(offset), lbtodb(size))); } else { dvd = vd->vdev_tsd; ASSERT3P(dvd, !=, NULL); return (vdev_disk_ldi_physio(dvd->vd_lh, addr, size, offset, doread ? B_READ : B_WRITE)); } } static int zvol_dumpio(zvol_state_t *zv, void *addr, uint64_t offset, uint64_t size, boolean_t doread, boolean_t isdump) { vdev_t *vd; int error; zvol_extent_t *ze; spa_t *spa = dmu_objset_spa(zv->zv_objset); /* Must be sector aligned, and not stradle a block boundary. */ if (P2PHASE(offset, DEV_BSIZE) || P2PHASE(size, DEV_BSIZE) || P2BOUNDARY(offset, size, zv->zv_volblocksize)) { return (SET_ERROR(EINVAL)); } ASSERT(size <= zv->zv_volblocksize); /* Locate the extent this belongs to */ ze = list_head(&zv->zv_extents); while (offset >= ze->ze_nblks * zv->zv_volblocksize) { offset -= ze->ze_nblks * zv->zv_volblocksize; ze = list_next(&zv->zv_extents, ze); } if (ze == NULL) return (SET_ERROR(EINVAL)); if (!ddi_in_panic()) spa_config_enter(spa, SCL_STATE, FTAG, RW_READER); vd = vdev_lookup_top(spa, DVA_GET_VDEV(&ze->ze_dva)); offset += DVA_GET_OFFSET(&ze->ze_dva); error = zvol_dumpio_vdev(vd, addr, offset, DVA_GET_OFFSET(&ze->ze_dva), size, doread, isdump); if (!ddi_in_panic()) spa_config_exit(spa, SCL_STATE, FTAG); return (error); } int zvol_strategy(buf_t *bp) { zfs_soft_state_t *zs = NULL; #else /* !illumos */ void zvol_strategy(struct bio *bp) { #endif /* illumos */ zvol_state_t *zv; uint64_t off, volsize; size_t resid; char *addr; objset_t *os; rl_t *rl; int error = 0; #ifdef illumos boolean_t doread = bp->b_flags & B_READ; #else boolean_t doread = 0; #endif boolean_t is_dumpified; boolean_t sync; #ifdef illumos if (getminor(bp->b_edev) == 0) { error = SET_ERROR(EINVAL); } else { zs = ddi_get_soft_state(zfsdev_state, getminor(bp->b_edev)); if (zs == NULL) error = SET_ERROR(ENXIO); else if (zs->zss_type != ZSST_ZVOL) error = SET_ERROR(EINVAL); } if (error) { bioerror(bp, error); biodone(bp); return (0); } zv = zs->zss_data; if (!(bp->b_flags & B_READ) && (zv->zv_flags & ZVOL_RDONLY)) { bioerror(bp, EROFS); biodone(bp); return (0); } off = ldbtob(bp->b_blkno); #else /* !illumos */ if (bp->bio_to) zv = bp->bio_to->private; else zv = bp->bio_dev->si_drv2; if (zv == NULL) { error = SET_ERROR(ENXIO); goto out; } if (bp->bio_cmd != BIO_READ && (zv->zv_flags & ZVOL_RDONLY)) { error = SET_ERROR(EROFS); goto out; } switch (bp->bio_cmd) { case BIO_FLUSH: goto sync; case BIO_READ: doread = 1; case BIO_WRITE: case BIO_DELETE: break; default: error = EOPNOTSUPP; goto out; } off = bp->bio_offset; #endif /* illumos */ volsize = zv->zv_volsize; os = zv->zv_objset; ASSERT(os != NULL); #ifdef illumos bp_mapin(bp); addr = bp->b_un.b_addr; resid = bp->b_bcount; if (resid > 0 && (off < 0 || off >= volsize)) { bioerror(bp, EIO); biodone(bp); return (0); } is_dumpified = zv->zv_flags & ZVOL_DUMPIFIED; sync = ((!(bp->b_flags & B_ASYNC) && !(zv->zv_flags & ZVOL_WCE)) || (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS)) && !doread && !is_dumpified; #else /* !illumos */ addr = bp->bio_data; resid = bp->bio_length; if (resid > 0 && (off < 0 || off >= volsize)) { error = SET_ERROR(EIO); goto out; } is_dumpified = B_FALSE; sync = !doread && !is_dumpified && zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS; #endif /* illumos */ /* * There must be no buffer changes when doing a dmu_sync() because * we can't change the data whilst calculating the checksum. */ rl = zfs_range_lock(&zv->zv_znode, off, resid, doread ? RL_READER : RL_WRITER); #ifndef illumos if (bp->bio_cmd == BIO_DELETE) { dmu_tx_t *tx = dmu_tx_create(zv->zv_objset); error = dmu_tx_assign(tx, TXG_WAIT); if (error != 0) { dmu_tx_abort(tx); } else { zvol_log_truncate(zv, tx, off, resid, sync); dmu_tx_commit(tx); error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, off, resid); resid = 0; } goto unlock; } #endif while (resid != 0 && off < volsize) { size_t size = MIN(resid, zvol_maxphys); #ifdef illumos if (is_dumpified) { size = MIN(size, P2END(off, zv->zv_volblocksize) - off); error = zvol_dumpio(zv, addr, off, size, doread, B_FALSE); } else if (doread) { #else if (doread) { #endif error = dmu_read(os, ZVOL_OBJ, off, size, addr, DMU_READ_PREFETCH); } else { dmu_tx_t *tx = dmu_tx_create(os); dmu_tx_hold_write(tx, ZVOL_OBJ, off, size); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); } else { dmu_write(os, ZVOL_OBJ, off, size, addr, tx); zvol_log_write(zv, tx, off, size, sync); dmu_tx_commit(tx); } } if (error) { /* convert checksum errors into IO errors */ if (error == ECKSUM) error = SET_ERROR(EIO); break; } off += size; addr += size; resid -= size; } #ifndef illumos unlock: #endif zfs_range_unlock(rl); #ifdef illumos if ((bp->b_resid = resid) == bp->b_bcount) bioerror(bp, off > volsize ? EINVAL : error); if (sync) zil_commit(zv->zv_zilog, ZVOL_OBJ); biodone(bp); return (0); #else /* !illumos */ bp->bio_completed = bp->bio_length - resid; if (bp->bio_completed < bp->bio_length && off > volsize) error = EINVAL; if (sync) { sync: zil_commit(zv->zv_zilog, ZVOL_OBJ); } out: if (bp->bio_to) g_io_deliver(bp, error); else biofinish(bp, NULL, error); #endif /* illumos */ } #ifdef illumos /* * Set the buffer count to the zvol maximum transfer. * Using our own routine instead of the default minphys() * means that for larger writes we write bigger buffers on X86 * (128K instead of 56K) and flush the disk write cache less often * (every zvol_maxphys - currently 1MB) instead of minphys (currently * 56K on X86 and 128K on sparc). */ void zvol_minphys(struct buf *bp) { if (bp->b_bcount > zvol_maxphys) bp->b_bcount = zvol_maxphys; } int zvol_dump(dev_t dev, caddr_t addr, daddr_t blkno, int nblocks) { minor_t minor = getminor(dev); zvol_state_t *zv; int error = 0; uint64_t size; uint64_t boff; uint64_t resid; zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) return (SET_ERROR(ENXIO)); if ((zv->zv_flags & ZVOL_DUMPIFIED) == 0) return (SET_ERROR(EINVAL)); boff = ldbtob(blkno); resid = ldbtob(nblocks); VERIFY3U(boff + resid, <=, zv->zv_volsize); while (resid) { size = MIN(resid, P2END(boff, zv->zv_volblocksize) - boff); error = zvol_dumpio(zv, addr, boff, size, B_FALSE, B_TRUE); if (error) break; boff += size; addr += size; resid -= size; } return (error); } /*ARGSUSED*/ int zvol_read(dev_t dev, uio_t *uio, cred_t *cr) { minor_t minor = getminor(dev); #else /* !illumos */ int zvol_read(struct cdev *dev, struct uio *uio, int ioflag) { #endif /* illumos */ zvol_state_t *zv; uint64_t volsize; rl_t *rl; int error = 0; #ifdef illumos zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) return (SET_ERROR(ENXIO)); #else zv = dev->si_drv2; #endif volsize = zv->zv_volsize; /* uio_loffset == volsize isn't an error as its required for EOF processing. */ if (uio->uio_resid > 0 && (uio->uio_loffset < 0 || uio->uio_loffset > volsize)) return (SET_ERROR(EIO)); #ifdef illumos if (zv->zv_flags & ZVOL_DUMPIFIED) { error = physio(zvol_strategy, NULL, dev, B_READ, zvol_minphys, uio); return (error); } #endif rl = zfs_range_lock(&zv->zv_znode, uio->uio_loffset, uio->uio_resid, RL_READER); while (uio->uio_resid > 0 && uio->uio_loffset < volsize) { uint64_t bytes = MIN(uio->uio_resid, DMU_MAX_ACCESS >> 1); /* don't read past the end */ if (bytes > volsize - uio->uio_loffset) bytes = volsize - uio->uio_loffset; error = dmu_read_uio_dbuf(zv->zv_dbuf, uio, bytes); if (error) { /* convert checksum errors into IO errors */ if (error == ECKSUM) error = SET_ERROR(EIO); break; } } zfs_range_unlock(rl); return (error); } #ifdef illumos /*ARGSUSED*/ int zvol_write(dev_t dev, uio_t *uio, cred_t *cr) { minor_t minor = getminor(dev); #else /* !illumos */ int zvol_write(struct cdev *dev, struct uio *uio, int ioflag) { #endif /* illumos */ zvol_state_t *zv; uint64_t volsize; rl_t *rl; int error = 0; boolean_t sync; #ifdef illumos zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) return (SET_ERROR(ENXIO)); #else zv = dev->si_drv2; #endif volsize = zv->zv_volsize; /* uio_loffset == volsize isn't an error as its required for EOF processing. */ if (uio->uio_resid > 0 && (uio->uio_loffset < 0 || uio->uio_loffset > volsize)) return (SET_ERROR(EIO)); #ifdef illumos if (zv->zv_flags & ZVOL_DUMPIFIED) { error = physio(zvol_strategy, NULL, dev, B_WRITE, zvol_minphys, uio); return (error); } sync = !(zv->zv_flags & ZVOL_WCE) || #else sync = (ioflag & IO_SYNC) || #endif (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS); rl = zfs_range_lock(&zv->zv_znode, uio->uio_loffset, uio->uio_resid, RL_WRITER); while (uio->uio_resid > 0 && uio->uio_loffset < volsize) { uint64_t bytes = MIN(uio->uio_resid, DMU_MAX_ACCESS >> 1); uint64_t off = uio->uio_loffset; dmu_tx_t *tx = dmu_tx_create(zv->zv_objset); if (bytes > volsize - off) /* don't write past the end */ bytes = volsize - off; dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); break; } error = dmu_write_uio_dbuf(zv->zv_dbuf, uio, bytes, tx); if (error == 0) zvol_log_write(zv, tx, off, bytes, sync); dmu_tx_commit(tx); if (error) break; } zfs_range_unlock(rl); if (sync) zil_commit(zv->zv_zilog, ZVOL_OBJ); return (error); } #ifdef illumos int zvol_getefi(void *arg, int flag, uint64_t vs, uint8_t bs) { struct uuid uuid = EFI_RESERVED; efi_gpe_t gpe = { 0 }; uint32_t crc; dk_efi_t efi; int length; char *ptr; if (ddi_copyin(arg, &efi, sizeof (dk_efi_t), flag)) return (SET_ERROR(EFAULT)); ptr = (char *)(uintptr_t)efi.dki_data_64; length = efi.dki_length; /* * Some clients may attempt to request a PMBR for the * zvol. Currently this interface will return EINVAL to * such requests. These requests could be supported by * adding a check for lba == 0 and consing up an appropriate * PMBR. */ if (efi.dki_lba < 1 || efi.dki_lba > 2 || length <= 0) return (SET_ERROR(EINVAL)); gpe.efi_gpe_StartingLBA = LE_64(34ULL); gpe.efi_gpe_EndingLBA = LE_64((vs >> bs) - 1); UUID_LE_CONVERT(gpe.efi_gpe_PartitionTypeGUID, uuid); if (efi.dki_lba == 1) { efi_gpt_t gpt = { 0 }; gpt.efi_gpt_Signature = LE_64(EFI_SIGNATURE); gpt.efi_gpt_Revision = LE_32(EFI_VERSION_CURRENT); gpt.efi_gpt_HeaderSize = LE_32(sizeof (gpt)); gpt.efi_gpt_MyLBA = LE_64(1ULL); gpt.efi_gpt_FirstUsableLBA = LE_64(34ULL); gpt.efi_gpt_LastUsableLBA = LE_64((vs >> bs) - 1); gpt.efi_gpt_PartitionEntryLBA = LE_64(2ULL); gpt.efi_gpt_NumberOfPartitionEntries = LE_32(1); gpt.efi_gpt_SizeOfPartitionEntry = LE_32(sizeof (efi_gpe_t)); CRC32(crc, &gpe, sizeof (gpe), -1U, crc32_table); gpt.efi_gpt_PartitionEntryArrayCRC32 = LE_32(~crc); CRC32(crc, &gpt, sizeof (gpt), -1U, crc32_table); gpt.efi_gpt_HeaderCRC32 = LE_32(~crc); if (ddi_copyout(&gpt, ptr, MIN(sizeof (gpt), length), flag)) return (SET_ERROR(EFAULT)); ptr += sizeof (gpt); length -= sizeof (gpt); } if (length > 0 && ddi_copyout(&gpe, ptr, MIN(sizeof (gpe), length), flag)) return (SET_ERROR(EFAULT)); return (0); } /* * BEGIN entry points to allow external callers access to the volume. */ /* * Return the volume parameters needed for access from an external caller. * These values are invariant as long as the volume is held open. */ int zvol_get_volume_params(minor_t minor, uint64_t *blksize, uint64_t *max_xfer_len, void **minor_hdl, void **objset_hdl, void **zil_hdl, void **rl_hdl, void **bonus_hdl) { zvol_state_t *zv; zv = zfsdev_get_soft_state(minor, ZSST_ZVOL); if (zv == NULL) return (SET_ERROR(ENXIO)); if (zv->zv_flags & ZVOL_DUMPIFIED) return (SET_ERROR(ENXIO)); ASSERT(blksize && max_xfer_len && minor_hdl && objset_hdl && zil_hdl && rl_hdl && bonus_hdl); *blksize = zv->zv_volblocksize; *max_xfer_len = (uint64_t)zvol_maxphys; *minor_hdl = zv; *objset_hdl = zv->zv_objset; *zil_hdl = zv->zv_zilog; *rl_hdl = &zv->zv_znode; *bonus_hdl = zv->zv_dbuf; return (0); } /* * Return the current volume size to an external caller. * The size can change while the volume is open. */ uint64_t zvol_get_volume_size(void *minor_hdl) { zvol_state_t *zv = minor_hdl; return (zv->zv_volsize); } /* * Return the current WCE setting to an external caller. * The WCE setting can change while the volume is open. */ int zvol_get_volume_wce(void *minor_hdl) { zvol_state_t *zv = minor_hdl; return ((zv->zv_flags & ZVOL_WCE) ? 1 : 0); } /* * Entry point for external callers to zvol_log_write */ void zvol_log_write_minor(void *minor_hdl, dmu_tx_t *tx, offset_t off, ssize_t resid, boolean_t sync) { zvol_state_t *zv = minor_hdl; zvol_log_write(zv, tx, off, resid, sync); } /* * END entry points to allow external callers access to the volume. */ #endif /* illumos */ /* * Log a DKIOCFREE/free-long-range to the ZIL with TX_TRUNCATE. */ static void zvol_log_truncate(zvol_state_t *zv, dmu_tx_t *tx, uint64_t off, uint64_t len, boolean_t sync) { itx_t *itx; lr_truncate_t *lr; zilog_t *zilog = zv->zv_zilog; if (zil_replaying(zilog, tx)) return; itx = zil_itx_create(TX_TRUNCATE, sizeof (*lr)); lr = (lr_truncate_t *)&itx->itx_lr; lr->lr_foid = ZVOL_OBJ; lr->lr_offset = off; lr->lr_length = len; itx->itx_sync = (sync || zv->zv_sync_cnt != 0); zil_itx_assign(zilog, itx, tx); } #ifdef illumos /* * Dirtbag ioctls to support mkfs(1M) for UFS filesystems. See dkio(7I). * Also a dirtbag dkio ioctl for unmap/free-block functionality. */ /*ARGSUSED*/ int zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp) { zvol_state_t *zv; struct dk_callback *dkc; int error = 0; rl_t *rl; mutex_enter(&zfsdev_state_lock); zv = zfsdev_get_soft_state(getminor(dev), ZSST_ZVOL); if (zv == NULL) { mutex_exit(&zfsdev_state_lock); return (SET_ERROR(ENXIO)); } ASSERT(zv->zv_total_opens > 0); switch (cmd) { case DKIOCINFO: { struct dk_cinfo dki; bzero(&dki, sizeof (dki)); (void) strcpy(dki.dki_cname, "zvol"); (void) strcpy(dki.dki_dname, "zvol"); dki.dki_ctype = DKC_UNKNOWN; dki.dki_unit = getminor(dev); dki.dki_maxtransfer = 1 << (SPA_OLD_MAXBLOCKSHIFT - zv->zv_min_bs); mutex_exit(&zfsdev_state_lock); if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag)) error = SET_ERROR(EFAULT); return (error); } case DKIOCGMEDIAINFO: { struct dk_minfo dkm; bzero(&dkm, sizeof (dkm)); dkm.dki_lbsize = 1U << zv->zv_min_bs; dkm.dki_capacity = zv->zv_volsize >> zv->zv_min_bs; dkm.dki_media_type = DK_UNKNOWN; mutex_exit(&zfsdev_state_lock); if (ddi_copyout(&dkm, (void *)arg, sizeof (dkm), flag)) error = SET_ERROR(EFAULT); return (error); } case DKIOCGMEDIAINFOEXT: { struct dk_minfo_ext dkmext; bzero(&dkmext, sizeof (dkmext)); dkmext.dki_lbsize = 1U << zv->zv_min_bs; dkmext.dki_pbsize = zv->zv_volblocksize; dkmext.dki_capacity = zv->zv_volsize >> zv->zv_min_bs; dkmext.dki_media_type = DK_UNKNOWN; mutex_exit(&zfsdev_state_lock); if (ddi_copyout(&dkmext, (void *)arg, sizeof (dkmext), flag)) error = SET_ERROR(EFAULT); return (error); } case DKIOCGETEFI: { uint64_t vs = zv->zv_volsize; uint8_t bs = zv->zv_min_bs; mutex_exit(&zfsdev_state_lock); error = zvol_getefi((void *)arg, flag, vs, bs); return (error); } case DKIOCFLUSHWRITECACHE: dkc = (struct dk_callback *)arg; mutex_exit(&zfsdev_state_lock); zil_commit(zv->zv_zilog, ZVOL_OBJ); if ((flag & FKIOCTL) && dkc != NULL && dkc->dkc_callback) { (*dkc->dkc_callback)(dkc->dkc_cookie, error); error = 0; } return (error); case DKIOCGETWCE: { int wce = (zv->zv_flags & ZVOL_WCE) ? 1 : 0; if (ddi_copyout(&wce, (void *)arg, sizeof (int), flag)) error = SET_ERROR(EFAULT); break; } case DKIOCSETWCE: { int wce; if (ddi_copyin((void *)arg, &wce, sizeof (int), flag)) { error = SET_ERROR(EFAULT); break; } if (wce) { zv->zv_flags |= ZVOL_WCE; mutex_exit(&zfsdev_state_lock); } else { zv->zv_flags &= ~ZVOL_WCE; mutex_exit(&zfsdev_state_lock); zil_commit(zv->zv_zilog, ZVOL_OBJ); } return (0); } case DKIOCGGEOM: case DKIOCGVTOC: /* * commands using these (like prtvtoc) expect ENOTSUP * since we're emulating an EFI label */ error = SET_ERROR(ENOTSUP); break; case DKIOCDUMPINIT: rl = zfs_range_lock(&zv->zv_znode, 0, zv->zv_volsize, RL_WRITER); error = zvol_dumpify(zv); zfs_range_unlock(rl); break; case DKIOCDUMPFINI: if (!(zv->zv_flags & ZVOL_DUMPIFIED)) break; rl = zfs_range_lock(&zv->zv_znode, 0, zv->zv_volsize, RL_WRITER); error = zvol_dump_fini(zv); zfs_range_unlock(rl); break; case DKIOCFREE: { dkioc_free_t df; dmu_tx_t *tx; if (!zvol_unmap_enabled) break; if (ddi_copyin((void *)arg, &df, sizeof (df), flag)) { error = SET_ERROR(EFAULT); break; } /* * Apply Postel's Law to length-checking. If they overshoot, * just blank out until the end, if there's a need to blank * out anything. */ if (df.df_start >= zv->zv_volsize) break; /* No need to do anything... */ mutex_exit(&zfsdev_state_lock); rl = zfs_range_lock(&zv->zv_znode, df.df_start, df.df_length, RL_WRITER); tx = dmu_tx_create(zv->zv_objset); dmu_tx_mark_netfree(tx); error = dmu_tx_assign(tx, TXG_WAIT); if (error != 0) { dmu_tx_abort(tx); } else { zvol_log_truncate(zv, tx, df.df_start, df.df_length, B_TRUE); dmu_tx_commit(tx); error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, df.df_start, df.df_length); } zfs_range_unlock(rl); /* * If the write-cache is disabled, 'sync' property * is set to 'always', or if the caller is asking for * a synchronous free, commit this operation to the zil. * This will sync any previous uncommitted writes to the * zvol object. * Can be overridden by the zvol_unmap_sync_enabled tunable. */ if ((error == 0) && zvol_unmap_sync_enabled && (!(zv->zv_flags & ZVOL_WCE) || (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS) || (df.df_flags & DF_WAIT_SYNC))) { zil_commit(zv->zv_zilog, ZVOL_OBJ); } return (error); } default: error = SET_ERROR(ENOTTY); break; } mutex_exit(&zfsdev_state_lock); return (error); } #endif /* illumos */ int zvol_busy(void) { return (zvol_minors != 0); } void zvol_init(void) { VERIFY(ddi_soft_state_init(&zfsdev_state, sizeof (zfs_soft_state_t), 1) == 0); #ifdef illumos mutex_init(&zfsdev_state_lock, NULL, MUTEX_DEFAULT, NULL); #else ZFS_LOG(1, "ZVOL Initialized."); #endif } void zvol_fini(void) { #ifdef illumos mutex_destroy(&zfsdev_state_lock); #endif ddi_soft_state_fini(&zfsdev_state); ZFS_LOG(1, "ZVOL Deinitialized."); } #ifdef illumos /*ARGSUSED*/ static int zfs_mvdev_dump_feature_check(void *arg, dmu_tx_t *tx) { spa_t *spa = dmu_tx_pool(tx)->dp_spa; if (spa_feature_is_active(spa, SPA_FEATURE_MULTI_VDEV_CRASH_DUMP)) return (1); return (0); } /*ARGSUSED*/ static void zfs_mvdev_dump_activate_feature_sync(void *arg, dmu_tx_t *tx) { spa_t *spa = dmu_tx_pool(tx)->dp_spa; spa_feature_incr(spa, SPA_FEATURE_MULTI_VDEV_CRASH_DUMP, tx); } static int zvol_dump_init(zvol_state_t *zv, boolean_t resize) { dmu_tx_t *tx; int error; objset_t *os = zv->zv_objset; spa_t *spa = dmu_objset_spa(os); vdev_t *vd = spa->spa_root_vdev; nvlist_t *nv = NULL; uint64_t version = spa_version(spa); uint64_t checksum, compress, refresrv, vbs, dedup; ASSERT(MUTEX_HELD(&zfsdev_state_lock)); ASSERT(vd->vdev_ops == &vdev_root_ops); error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, 0, DMU_OBJECT_END); if (error != 0) return (error); /* wait for dmu_free_long_range to actually free the blocks */ txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0); /* * If the pool on which the dump device is being initialized has more * than one child vdev, check that the MULTI_VDEV_CRASH_DUMP feature is * enabled. If so, bump that feature's counter to indicate that the * feature is active. We also check the vdev type to handle the * following case: * # zpool create test raidz disk1 disk2 disk3 * Now have spa_root_vdev->vdev_children == 1 (the raidz vdev), * the raidz vdev itself has 3 children. */ if (vd->vdev_children > 1 || vd->vdev_ops == &vdev_raidz_ops) { if (!spa_feature_is_enabled(spa, SPA_FEATURE_MULTI_VDEV_CRASH_DUMP)) return (SET_ERROR(ENOTSUP)); (void) dsl_sync_task(spa_name(spa), zfs_mvdev_dump_feature_check, zfs_mvdev_dump_activate_feature_sync, NULL, 2, ZFS_SPACE_CHECK_RESERVED); } if (!resize) { error = dsl_prop_get_integer(zv->zv_name, zfs_prop_to_name(ZFS_PROP_COMPRESSION), &compress, NULL); if (error == 0) { error = dsl_prop_get_integer(zv->zv_name, zfs_prop_to_name(ZFS_PROP_CHECKSUM), &checksum, NULL); } if (error == 0) { error = dsl_prop_get_integer(zv->zv_name, zfs_prop_to_name(ZFS_PROP_REFRESERVATION), &refresrv, NULL); } if (error == 0) { error = dsl_prop_get_integer(zv->zv_name, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &vbs, NULL); } if (version >= SPA_VERSION_DEDUP && error == 0) { error = dsl_prop_get_integer(zv->zv_name, zfs_prop_to_name(ZFS_PROP_DEDUP), &dedup, NULL); } } if (error != 0) return (error); tx = dmu_tx_create(os); dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL); dmu_tx_hold_bonus(tx, ZVOL_OBJ); error = dmu_tx_assign(tx, TXG_WAIT); if (error != 0) { dmu_tx_abort(tx); return (error); } /* * If we are resizing the dump device then we only need to * update the refreservation to match the newly updated * zvolsize. Otherwise, we save off the original state of the * zvol so that we can restore them if the zvol is ever undumpified. */ if (resize) { error = zap_update(os, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 8, 1, &zv->zv_volsize, tx); } else { error = zap_update(os, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_COMPRESSION), 8, 1, &compress, tx); if (error == 0) { error = zap_update(os, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_CHECKSUM), 8, 1, &checksum, tx); } if (error == 0) { error = zap_update(os, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 8, 1, &refresrv, tx); } if (error == 0) { error = zap_update(os, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), 8, 1, &vbs, tx); } if (error == 0) { error = dmu_object_set_blocksize( os, ZVOL_OBJ, SPA_OLD_MAXBLOCKSIZE, 0, tx); } if (version >= SPA_VERSION_DEDUP && error == 0) { error = zap_update(os, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_DEDUP), 8, 1, &dedup, tx); } if (error == 0) zv->zv_volblocksize = SPA_OLD_MAXBLOCKSIZE; } dmu_tx_commit(tx); /* * We only need update the zvol's property if we are initializing * the dump area for the first time. */ if (error == 0 && !resize) { /* * If MULTI_VDEV_CRASH_DUMP is active, use the NOPARITY checksum * function. Otherwise, use the old default -- OFF. */ checksum = spa_feature_is_active(spa, SPA_FEATURE_MULTI_VDEV_CRASH_DUMP) ? ZIO_CHECKSUM_NOPARITY : ZIO_CHECKSUM_OFF; VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0); VERIFY(nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 0) == 0); VERIFY(nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_COMPRESSION), ZIO_COMPRESS_OFF) == 0); VERIFY(nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_CHECKSUM), checksum) == 0); if (version >= SPA_VERSION_DEDUP) { VERIFY(nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_DEDUP), ZIO_CHECKSUM_OFF) == 0); } error = zfs_set_prop_nvlist(zv->zv_name, ZPROP_SRC_LOCAL, nv, NULL); nvlist_free(nv); } /* Allocate the space for the dump */ if (error == 0) error = zvol_prealloc(zv); return (error); } static int zvol_dumpify(zvol_state_t *zv) { int error = 0; uint64_t dumpsize = 0; dmu_tx_t *tx; objset_t *os = zv->zv_objset; if (zv->zv_flags & ZVOL_RDONLY) return (SET_ERROR(EROFS)); if (zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, ZVOL_DUMPSIZE, 8, 1, &dumpsize) != 0 || dumpsize != zv->zv_volsize) { boolean_t resize = (dumpsize > 0); if ((error = zvol_dump_init(zv, resize)) != 0) { (void) zvol_dump_fini(zv); return (error); } } /* * Build up our lba mapping. */ error = zvol_get_lbas(zv); if (error) { (void) zvol_dump_fini(zv); return (error); } tx = dmu_tx_create(os); dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); (void) zvol_dump_fini(zv); return (error); } zv->zv_flags |= ZVOL_DUMPIFIED; error = zap_update(os, ZVOL_ZAP_OBJ, ZVOL_DUMPSIZE, 8, 1, &zv->zv_volsize, tx); dmu_tx_commit(tx); if (error) { (void) zvol_dump_fini(zv); return (error); } txg_wait_synced(dmu_objset_pool(os), 0); return (0); } static int zvol_dump_fini(zvol_state_t *zv) { dmu_tx_t *tx; objset_t *os = zv->zv_objset; nvlist_t *nv; int error = 0; uint64_t checksum, compress, refresrv, vbs, dedup; uint64_t version = spa_version(dmu_objset_spa(zv->zv_objset)); /* * Attempt to restore the zvol back to its pre-dumpified state. * This is a best-effort attempt as it's possible that not all * of these properties were initialized during the dumpify process * (i.e. error during zvol_dump_init). */ tx = dmu_tx_create(os); dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); return (error); } (void) zap_remove(os, ZVOL_ZAP_OBJ, ZVOL_DUMPSIZE, tx); dmu_tx_commit(tx); (void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_CHECKSUM), 8, 1, &checksum); (void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_COMPRESSION), 8, 1, &compress); (void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_REFRESERVATION), 8, 1, &refresrv); (void) zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), 8, 1, &vbs); VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0); (void) nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_CHECKSUM), checksum); (void) nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_COMPRESSION), compress); (void) nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_REFRESERVATION), refresrv); if (version >= SPA_VERSION_DEDUP && zap_lookup(zv->zv_objset, ZVOL_ZAP_OBJ, zfs_prop_to_name(ZFS_PROP_DEDUP), 8, 1, &dedup) == 0) { (void) nvlist_add_uint64(nv, zfs_prop_to_name(ZFS_PROP_DEDUP), dedup); } (void) zfs_set_prop_nvlist(zv->zv_name, ZPROP_SRC_LOCAL, nv, NULL); nvlist_free(nv); zvol_free_extents(zv); zv->zv_flags &= ~ZVOL_DUMPIFIED; (void) dmu_free_long_range(os, ZVOL_OBJ, 0, DMU_OBJECT_END); /* wait for dmu_free_long_range to actually free the blocks */ txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0); tx = dmu_tx_create(os); dmu_tx_hold_bonus(tx, ZVOL_OBJ); error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); return (error); } if (dmu_object_set_blocksize(os, ZVOL_OBJ, vbs, 0, tx) == 0) zv->zv_volblocksize = vbs; dmu_tx_commit(tx); return (0); } #else /* !illumos */ static void zvol_geom_run(zvol_state_t *zv) { struct g_provider *pp; pp = zv->zv_provider; g_error_provider(pp, 0); kproc_kthread_add(zvol_geom_worker, zv, &zfsproc, NULL, 0, 0, "zfskern", "zvol %s", pp->name + sizeof(ZVOL_DRIVER)); } static void zvol_geom_destroy(zvol_state_t *zv) { struct g_provider *pp; g_topology_assert(); mtx_lock(&zv->zv_queue_mtx); zv->zv_state = 1; wakeup_one(&zv->zv_queue); while (zv->zv_state != 2) msleep(&zv->zv_state, &zv->zv_queue_mtx, 0, "zvol:w", 0); mtx_destroy(&zv->zv_queue_mtx); pp = zv->zv_provider; zv->zv_provider = NULL; pp->private = NULL; g_wither_geom(pp->geom, ENXIO); } static int zvol_geom_access(struct g_provider *pp, int acr, int acw, int ace) { int count, error, flags; g_topology_assert(); /* * To make it easier we expect either open or close, but not both * at the same time. */ KASSERT((acr >= 0 && acw >= 0 && ace >= 0) || (acr <= 0 && acw <= 0 && ace <= 0), ("Unsupported access request to %s (acr=%d, acw=%d, ace=%d).", pp->name, acr, acw, ace)); if (pp->private == NULL) { if (acr <= 0 && acw <= 0 && ace <= 0) return (0); return (pp->error); } /* * We don't pass FEXCL flag to zvol_open()/zvol_close() if ace != 0, * because GEOM already handles that and handles it a bit differently. * GEOM allows for multiple read/exclusive consumers and ZFS allows * only one exclusive consumer, no matter if it is reader or writer. * I like better the way GEOM works so I'll leave it for GEOM to * decide what to do. */ count = acr + acw + ace; if (count == 0) return (0); flags = 0; if (acr != 0 || ace != 0) flags |= FREAD; if (acw != 0) flags |= FWRITE; g_topology_unlock(); if (count > 0) error = zvol_open(pp, flags, count); else error = zvol_close(pp, flags, -count); g_topology_lock(); return (error); } static void zvol_geom_start(struct bio *bp) { zvol_state_t *zv; boolean_t first; zv = bp->bio_to->private; ASSERT(zv != NULL); switch (bp->bio_cmd) { case BIO_FLUSH: if (!THREAD_CAN_SLEEP()) goto enqueue; zil_commit(zv->zv_zilog, ZVOL_OBJ); g_io_deliver(bp, 0); break; case BIO_READ: case BIO_WRITE: case BIO_DELETE: if (!THREAD_CAN_SLEEP()) goto enqueue; zvol_strategy(bp); break; case BIO_GETATTR: { spa_t *spa = dmu_objset_spa(zv->zv_objset); uint64_t refd, avail, usedobjs, availobjs, val; if (g_handleattr_int(bp, "GEOM::candelete", 1)) return; if (strcmp(bp->bio_attribute, "blocksavail") == 0) { dmu_objset_space(zv->zv_objset, &refd, &avail, &usedobjs, &availobjs); if (g_handleattr_off_t(bp, "blocksavail", avail / DEV_BSIZE)) return; } else if (strcmp(bp->bio_attribute, "blocksused") == 0) { dmu_objset_space(zv->zv_objset, &refd, &avail, &usedobjs, &availobjs); if (g_handleattr_off_t(bp, "blocksused", refd / DEV_BSIZE)) return; } else if (strcmp(bp->bio_attribute, "poolblocksavail") == 0) { avail = metaslab_class_get_space(spa_normal_class(spa)); avail -= metaslab_class_get_alloc(spa_normal_class(spa)); if (g_handleattr_off_t(bp, "poolblocksavail", avail / DEV_BSIZE)) return; } else if (strcmp(bp->bio_attribute, "poolblocksused") == 0) { refd = metaslab_class_get_alloc(spa_normal_class(spa)); if (g_handleattr_off_t(bp, "poolblocksused", refd / DEV_BSIZE)) return; } /* FALLTHROUGH */ } default: g_io_deliver(bp, EOPNOTSUPP); break; } return; enqueue: mtx_lock(&zv->zv_queue_mtx); first = (bioq_first(&zv->zv_queue) == NULL); bioq_insert_tail(&zv->zv_queue, bp); mtx_unlock(&zv->zv_queue_mtx); if (first) wakeup_one(&zv->zv_queue); } static void zvol_geom_worker(void *arg) { zvol_state_t *zv; struct bio *bp; thread_lock(curthread); sched_prio(curthread, PRIBIO); thread_unlock(curthread); zv = arg; for (;;) { mtx_lock(&zv->zv_queue_mtx); bp = bioq_takefirst(&zv->zv_queue); if (bp == NULL) { if (zv->zv_state == 1) { zv->zv_state = 2; wakeup(&zv->zv_state); mtx_unlock(&zv->zv_queue_mtx); kthread_exit(); } msleep(&zv->zv_queue, &zv->zv_queue_mtx, PRIBIO | PDROP, "zvol:io", 0); continue; } mtx_unlock(&zv->zv_queue_mtx); switch (bp->bio_cmd) { case BIO_FLUSH: zil_commit(zv->zv_zilog, ZVOL_OBJ); g_io_deliver(bp, 0); break; case BIO_READ: case BIO_WRITE: case BIO_DELETE: zvol_strategy(bp); break; default: g_io_deliver(bp, EOPNOTSUPP); break; } } } extern boolean_t dataset_name_hidden(const char *name); static int zvol_create_snapshots(objset_t *os, const char *name) { uint64_t cookie, obj; char *sname; int error, len; cookie = obj = 0; sname = kmem_alloc(MAXPATHLEN, KM_SLEEP); #if 0 (void) dmu_objset_find(name, dmu_objset_prefetch, NULL, DS_FIND_SNAPSHOTS); #endif for (;;) { len = snprintf(sname, MAXPATHLEN, "%s@", name); if (len >= MAXPATHLEN) { dmu_objset_rele(os, FTAG); error = ENAMETOOLONG; break; } dsl_pool_config_enter(dmu_objset_pool(os), FTAG); error = dmu_snapshot_list_next(os, MAXPATHLEN - len, sname + len, &obj, &cookie, NULL); dsl_pool_config_exit(dmu_objset_pool(os), FTAG); if (error != 0) { if (error == ENOENT) error = 0; break; } error = zvol_create_minor(sname); if (error != 0 && error != EEXIST) { printf("ZFS WARNING: Unable to create ZVOL %s (error=%d).\n", sname, error); break; } } kmem_free(sname, MAXPATHLEN); return (error); } int zvol_create_minors(const char *name) { uint64_t cookie; objset_t *os; char *osname, *p; int error, len; if (dataset_name_hidden(name)) return (0); if ((error = dmu_objset_hold(name, FTAG, &os)) != 0) { printf("ZFS WARNING: Unable to put hold on %s (error=%d).\n", name, error); return (error); } if (dmu_objset_type(os) == DMU_OST_ZVOL) { dsl_dataset_long_hold(os->os_dsl_dataset, FTAG); dsl_pool_rele(dmu_objset_pool(os), FTAG); error = zvol_create_minor(name); if (error == 0 || error == EEXIST) { error = zvol_create_snapshots(os, name); } else { printf("ZFS WARNING: Unable to create ZVOL %s (error=%d).\n", name, error); } dsl_dataset_long_rele(os->os_dsl_dataset, FTAG); dsl_dataset_rele(os->os_dsl_dataset, FTAG); return (error); } if (dmu_objset_type(os) != DMU_OST_ZFS) { dmu_objset_rele(os, FTAG); return (0); } osname = kmem_alloc(MAXPATHLEN, KM_SLEEP); if (snprintf(osname, MAXPATHLEN, "%s/", name) >= MAXPATHLEN) { dmu_objset_rele(os, FTAG); kmem_free(osname, MAXPATHLEN); return (ENOENT); } p = osname + strlen(osname); len = MAXPATHLEN - (p - osname); #if 0 /* Prefetch the datasets. */ cookie = 0; while (dmu_dir_list_next(os, len, p, NULL, &cookie) == 0) { if (!dataset_name_hidden(osname)) (void) dmu_objset_prefetch(osname, NULL); } #endif cookie = 0; while (dmu_dir_list_next(os, MAXPATHLEN - (p - osname), p, NULL, &cookie) == 0) { dmu_objset_rele(os, FTAG); (void)zvol_create_minors(osname); if ((error = dmu_objset_hold(name, FTAG, &os)) != 0) { printf("ZFS WARNING: Unable to put hold on %s (error=%d).\n", name, error); return (error); } } dmu_objset_rele(os, FTAG); kmem_free(osname, MAXPATHLEN); return (0); } static void zvol_rename_minor(zvol_state_t *zv, const char *newname) { struct g_geom *gp; struct g_provider *pp; struct cdev *dev; ASSERT(MUTEX_HELD(&zfsdev_state_lock)); if (zv->zv_volmode == ZFS_VOLMODE_GEOM) { g_topology_lock(); pp = zv->zv_provider; ASSERT(pp != NULL); gp = pp->geom; ASSERT(gp != NULL); zv->zv_provider = NULL; g_wither_provider(pp, ENXIO); pp = g_new_providerf(gp, "%s/%s", ZVOL_DRIVER, newname); pp->flags |= G_PF_DIRECT_RECEIVE | G_PF_DIRECT_SEND; pp->sectorsize = DEV_BSIZE; pp->mediasize = zv->zv_volsize; pp->private = zv; zv->zv_provider = pp; g_error_provider(pp, 0); g_topology_unlock(); } else if (zv->zv_volmode == ZFS_VOLMODE_DEV) { struct make_dev_args args; if ((dev = zv->zv_dev) != NULL) { zv->zv_dev = NULL; destroy_dev(dev); if (zv->zv_total_opens > 0) { zv->zv_flags &= ~ZVOL_EXCL; zv->zv_total_opens = 0; zvol_last_close(zv); } } make_dev_args_init(&args); args.mda_flags = MAKEDEV_CHECKNAME | MAKEDEV_WAITOK; args.mda_devsw = &zvol_cdevsw; args.mda_cr = NULL; args.mda_uid = UID_ROOT; args.mda_gid = GID_OPERATOR; args.mda_mode = 0640; args.mda_si_drv2 = zv; if (make_dev_s(&args, &zv->zv_dev, "%s/%s", ZVOL_DRIVER, newname) == 0) zv->zv_dev->si_iosize_max = MAXPHYS; } strlcpy(zv->zv_name, newname, sizeof(zv->zv_name)); } void zvol_rename_minors(const char *oldname, const char *newname) { char name[MAXPATHLEN]; struct g_provider *pp; struct g_geom *gp; size_t oldnamelen, newnamelen; zvol_state_t *zv; char *namebuf; boolean_t locked = B_FALSE; oldnamelen = strlen(oldname); newnamelen = strlen(newname); DROP_GIANT(); /* See comment in zvol_open(). */ if (!MUTEX_HELD(&zfsdev_state_lock)) { mutex_enter(&zfsdev_state_lock); locked = B_TRUE; } LIST_FOREACH(zv, &all_zvols, zv_links) { if (strcmp(zv->zv_name, oldname) == 0) { zvol_rename_minor(zv, newname); } else if (strncmp(zv->zv_name, oldname, oldnamelen) == 0 && (zv->zv_name[oldnamelen] == '/' || zv->zv_name[oldnamelen] == '@')) { snprintf(name, sizeof(name), "%s%c%s", newname, zv->zv_name[oldnamelen], zv->zv_name + oldnamelen + 1); zvol_rename_minor(zv, name); } } if (locked) mutex_exit(&zfsdev_state_lock); PICKUP_GIANT(); } static int zvol_d_open(struct cdev *dev, int flags, int fmt, struct thread *td) { zvol_state_t *zv = dev->si_drv2; int err = 0; mutex_enter(&zfsdev_state_lock); if (zv->zv_total_opens == 0) err = zvol_first_open(zv); if (err) { mutex_exit(&zfsdev_state_lock); return (err); } if ((flags & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) { err = SET_ERROR(EROFS); goto out; } if (zv->zv_flags & ZVOL_EXCL) { err = SET_ERROR(EBUSY); goto out; } #ifdef FEXCL if (flags & FEXCL) { if (zv->zv_total_opens != 0) { err = SET_ERROR(EBUSY); goto out; } zv->zv_flags |= ZVOL_EXCL; } #endif zv->zv_total_opens++; if (flags & (FSYNC | FDSYNC)) { zv->zv_sync_cnt++; if (zv->zv_sync_cnt == 1) zil_async_to_sync(zv->zv_zilog, ZVOL_OBJ); } mutex_exit(&zfsdev_state_lock); return (err); out: if (zv->zv_total_opens == 0) zvol_last_close(zv); mutex_exit(&zfsdev_state_lock); return (err); } static int zvol_d_close(struct cdev *dev, int flags, int fmt, struct thread *td) { zvol_state_t *zv = dev->si_drv2; mutex_enter(&zfsdev_state_lock); if (zv->zv_flags & ZVOL_EXCL) { ASSERT(zv->zv_total_opens == 1); zv->zv_flags &= ~ZVOL_EXCL; } /* * If the open count is zero, this is a spurious close. * That indicates a bug in the kernel / DDI framework. */ ASSERT(zv->zv_total_opens != 0); /* * You may get multiple opens, but only one close. */ zv->zv_total_opens--; if (flags & (FSYNC | FDSYNC)) zv->zv_sync_cnt--; if (zv->zv_total_opens == 0) zvol_last_close(zv); mutex_exit(&zfsdev_state_lock); return (0); } static int zvol_d_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int fflag, struct thread *td) { zvol_state_t *zv; rl_t *rl; off_t offset, length; int i, error; boolean_t sync; zv = dev->si_drv2; error = 0; KASSERT(zv->zv_total_opens > 0, ("Device with zero access count in zvol_d_ioctl")); i = IOCPARM_LEN(cmd); switch (cmd) { case DIOCGSECTORSIZE: *(u_int *)data = DEV_BSIZE; break; case DIOCGMEDIASIZE: *(off_t *)data = zv->zv_volsize; break; case DIOCGFLUSH: zil_commit(zv->zv_zilog, ZVOL_OBJ); break; case DIOCGDELETE: if (!zvol_unmap_enabled) break; offset = ((off_t *)data)[0]; length = ((off_t *)data)[1]; if ((offset % DEV_BSIZE) != 0 || (length % DEV_BSIZE) != 0 || offset < 0 || offset >= zv->zv_volsize || length <= 0) { printf("%s: offset=%jd length=%jd\n", __func__, offset, length); error = EINVAL; break; } rl = zfs_range_lock(&zv->zv_znode, offset, length, RL_WRITER); dmu_tx_t *tx = dmu_tx_create(zv->zv_objset); error = dmu_tx_assign(tx, TXG_WAIT); if (error != 0) { sync = FALSE; dmu_tx_abort(tx); } else { sync = (zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS); zvol_log_truncate(zv, tx, offset, length, sync); dmu_tx_commit(tx); error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, offset, length); } zfs_range_unlock(rl); if (sync) zil_commit(zv->zv_zilog, ZVOL_OBJ); break; case DIOCGSTRIPESIZE: *(off_t *)data = zv->zv_volblocksize; break; case DIOCGSTRIPEOFFSET: *(off_t *)data = 0; break; case DIOCGATTR: { spa_t *spa = dmu_objset_spa(zv->zv_objset); struct diocgattr_arg *arg = (struct diocgattr_arg *)data; uint64_t refd, avail, usedobjs, availobjs; if (strcmp(arg->name, "GEOM::candelete") == 0) arg->value.i = 1; else if (strcmp(arg->name, "blocksavail") == 0) { dmu_objset_space(zv->zv_objset, &refd, &avail, &usedobjs, &availobjs); arg->value.off = avail / DEV_BSIZE; } else if (strcmp(arg->name, "blocksused") == 0) { dmu_objset_space(zv->zv_objset, &refd, &avail, &usedobjs, &availobjs); arg->value.off = refd / DEV_BSIZE; } else if (strcmp(arg->name, "poolblocksavail") == 0) { avail = metaslab_class_get_space(spa_normal_class(spa)); avail -= metaslab_class_get_alloc(spa_normal_class(spa)); arg->value.off = avail / DEV_BSIZE; } else if (strcmp(arg->name, "poolblocksused") == 0) { refd = metaslab_class_get_alloc(spa_normal_class(spa)); arg->value.off = refd / DEV_BSIZE; } else error = ENOIOCTL; break; } case FIOSEEKHOLE: case FIOSEEKDATA: { off_t *off = (off_t *)data; uint64_t noff; boolean_t hole; hole = (cmd == FIOSEEKHOLE); noff = *off; error = dmu_offset_next(zv->zv_objset, ZVOL_OBJ, hole, &noff); *off = noff; break; } default: error = ENOIOCTL; } return (error); } #endif /* illumos */ Index: projects/clang400-import/sys/cddl/contrib/opensolaris =================================================================== --- projects/clang400-import/sys/cddl/contrib/opensolaris (revision 314522) +++ projects/clang400-import/sys/cddl/contrib/opensolaris (revision 314523) Property changes on: projects/clang400-import/sys/cddl/contrib/opensolaris ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/sys/cddl/contrib/opensolaris:r314420-314522 Index: projects/clang400-import/sys/compat/ia32/ia32_sysvec.c =================================================================== --- projects/clang400-import/sys/compat/ia32/ia32_sysvec.c (revision 314522) +++ projects/clang400-import/sys/compat/ia32/ia32_sysvec.c (revision 314523) @@ -1,238 +1,235 @@ /*- * Copyright (c) 2002 Doug Rabson * Copyright (c) 2003 Peter Wemm * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #define __ELF_WORD_SIZE 32 #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include CTASSERT(sizeof(struct ia32_mcontext) == 640); CTASSERT(sizeof(struct ia32_ucontext) == 704); CTASSERT(sizeof(struct ia32_sigframe) == 800); CTASSERT(sizeof(struct siginfo32) == 64); #ifdef COMPAT_FREEBSD4 CTASSERT(sizeof(struct ia32_mcontext4) == 260); CTASSERT(sizeof(struct ia32_ucontext4) == 324); CTASSERT(sizeof(struct ia32_sigframe4) == 408); #endif extern const char *freebsd32_syscallnames[]; static SYSCTL_NODE(_compat, OID_AUTO, ia32, CTLFLAG_RW, 0, "ia32 mode"); static u_long ia32_maxdsiz = IA32_MAXDSIZ; SYSCTL_ULONG(_compat_ia32, OID_AUTO, maxdsiz, CTLFLAG_RWTUN, &ia32_maxdsiz, 0, ""); u_long ia32_maxssiz = IA32_MAXSSIZ; SYSCTL_ULONG(_compat_ia32, OID_AUTO, maxssiz, CTLFLAG_RWTUN, &ia32_maxssiz, 0, ""); static u_long ia32_maxvmem = IA32_MAXVMEM; SYSCTL_ULONG(_compat_ia32, OID_AUTO, maxvmem, CTLFLAG_RWTUN, &ia32_maxvmem, 0, ""); struct sysentvec ia32_freebsd_sysvec = { .sv_size = FREEBSD32_SYS_MAXSYSCALL, .sv_table = freebsd32_sysent, .sv_mask = 0, .sv_errsize = 0, .sv_errtbl = NULL, .sv_transtrap = NULL, .sv_fixup = elf32_freebsd_fixup, .sv_sendsig = ia32_sendsig, .sv_sigcode = ia32_sigcode, .sv_szsigcode = &sz_ia32_sigcode, .sv_name = "FreeBSD ELF32", .sv_coredump = elf32_coredump, .sv_imgact_try = NULL, .sv_minsigstksz = MINSIGSTKSZ, .sv_pagesize = IA32_PAGE_SIZE, .sv_minuser = FREEBSD32_MINUSER, .sv_maxuser = FREEBSD32_MAXUSER, .sv_usrstack = FREEBSD32_USRSTACK, .sv_psstrings = FREEBSD32_PS_STRINGS, .sv_stackprot = VM_PROT_ALL, .sv_copyout_strings = freebsd32_copyout_strings, .sv_setregs = ia32_setregs, .sv_fixlimit = ia32_fixlimit, .sv_maxssiz = &ia32_maxssiz, - .sv_flags = -#ifdef __amd64__ - SV_SHP | SV_TIMEKEEP | -#endif - SV_ABI_FREEBSD | SV_IA32 | SV_ILP32, + .sv_flags = SV_ABI_FREEBSD | SV_IA32 | SV_ILP32 | + SV_SHP | SV_TIMEKEEP, .sv_set_syscall_retval = ia32_set_syscall_retval, .sv_fetch_syscall_args = ia32_fetch_syscall_args, .sv_syscallnames = freebsd32_syscallnames, .sv_shared_page_base = FREEBSD32_SHAREDPAGE, .sv_shared_page_len = PAGE_SIZE, .sv_schedtail = NULL, .sv_thread_detach = NULL, .sv_trap = NULL, }; INIT_SYSENTVEC(elf_ia32_sysvec, &ia32_freebsd_sysvec); static Elf32_Brandinfo ia32_brand_info = { .brand = ELFOSABI_FREEBSD, .machine = EM_386, .compat_3_brand = "FreeBSD", .emul_path = NULL, .interp_path = "/libexec/ld-elf.so.1", .sysvec = &ia32_freebsd_sysvec, .interp_newpath = "/libexec/ld-elf32.so.1", .brand_note = &elf32_freebsd_brandnote, .flags = BI_CAN_EXEC_DYN | BI_BRAND_NOTE }; SYSINIT(ia32, SI_SUB_EXEC, SI_ORDER_MIDDLE, (sysinit_cfunc_t) elf32_insert_brand_entry, &ia32_brand_info); static Elf32_Brandinfo ia32_brand_oinfo = { .brand = ELFOSABI_FREEBSD, .machine = EM_386, .compat_3_brand = "FreeBSD", .emul_path = NULL, .interp_path = "/usr/libexec/ld-elf.so.1", .sysvec = &ia32_freebsd_sysvec, .interp_newpath = "/libexec/ld-elf32.so.1", .brand_note = &elf32_freebsd_brandnote, .flags = BI_CAN_EXEC_DYN | BI_BRAND_NOTE }; SYSINIT(oia32, SI_SUB_EXEC, SI_ORDER_ANY, (sysinit_cfunc_t) elf32_insert_brand_entry, &ia32_brand_oinfo); static Elf32_Brandinfo kia32_brand_info = { .brand = ELFOSABI_FREEBSD, .machine = EM_386, .compat_3_brand = "FreeBSD", .emul_path = NULL, .interp_path = "/lib/ld.so.1", .sysvec = &ia32_freebsd_sysvec, .brand_note = &elf32_kfreebsd_brandnote, .flags = BI_CAN_EXEC_DYN | BI_BRAND_NOTE_MANDATORY }; SYSINIT(kia32, SI_SUB_EXEC, SI_ORDER_ANY, (sysinit_cfunc_t) elf32_insert_brand_entry, &kia32_brand_info); void elf32_dump_thread(struct thread *td, void *dst, size_t *off) { void *buf; size_t len; len = 0; if (use_xsave) { if (dst != NULL) { fpugetregs(td); len += elf32_populate_note(NT_X86_XSTATE, get_pcb_user_save_td(td), dst, cpu_max_ext_state_size, &buf); *(uint64_t *)((char *)buf + X86_XSTATE_XCR0_OFFSET) = xsave_mask; } else len += elf32_populate_note(NT_X86_XSTATE, NULL, NULL, cpu_max_ext_state_size, NULL); } *off = len; } void ia32_fixlimit(struct rlimit *rl, int which) { switch (which) { case RLIMIT_DATA: if (ia32_maxdsiz != 0) { if (rl->rlim_cur > ia32_maxdsiz) rl->rlim_cur = ia32_maxdsiz; if (rl->rlim_max > ia32_maxdsiz) rl->rlim_max = ia32_maxdsiz; } break; case RLIMIT_STACK: if (ia32_maxssiz != 0) { if (rl->rlim_cur > ia32_maxssiz) rl->rlim_cur = ia32_maxssiz; if (rl->rlim_max > ia32_maxssiz) rl->rlim_max = ia32_maxssiz; } break; case RLIMIT_VMEM: if (ia32_maxvmem != 0) { if (rl->rlim_cur > ia32_maxvmem) rl->rlim_cur = ia32_maxvmem; if (rl->rlim_max > ia32_maxvmem) rl->rlim_max = ia32_maxvmem; } break; } } Index: projects/clang400-import/sys/dev/cxgbe/iw_cxgbe/mem.c =================================================================== --- projects/clang400-import/sys/dev/cxgbe/iw_cxgbe/mem.c (revision 314522) +++ projects/clang400-import/sys/dev/cxgbe/iw_cxgbe/mem.c (revision 314523) @@ -1,845 +1,842 @@ /* * Copyright (c) 2009-2013 Chelsio, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #ifdef TCP_OFFLOAD #include #include #include #include #include #include "iw_cxgbe.h" #define T4_ULPTX_MIN_IO 32 #define C4IW_MAX_INLINE_SIZE 96 static int mr_exceeds_hw_limits(struct c4iw_dev *dev __unused, u64 length) { return (length >= 8*1024*1024*1024ULL); } static int write_adapter_mem(struct c4iw_rdev *rdev, u32 addr, u32 len, void *data) { struct adapter *sc = rdev->adap; struct ulp_mem_io *ulpmc; struct ulptx_idata *ulpsc; u8 wr_len, *to_dp, *from_dp; int copy_len, num_wqe, i, ret = 0; struct c4iw_wr_wait wr_wait; struct wrqe *wr; u32 cmd; cmd = cpu_to_be32(V_ULPTX_CMD(ULP_TX_MEM_WRITE)); if (is_t4(sc)) cmd |= cpu_to_be32(F_ULP_MEMIO_ORDER); else cmd |= cpu_to_be32(F_T5_ULP_MEMIO_IMM); addr &= 0x7FFFFFF; CTR3(KTR_IW_CXGBE, "%s addr 0x%x len %u", __func__, addr, len); num_wqe = DIV_ROUND_UP(len, C4IW_MAX_INLINE_SIZE); c4iw_init_wr_wait(&wr_wait); for (i = 0; i < num_wqe; i++) { copy_len = min(len, C4IW_MAX_INLINE_SIZE); wr_len = roundup(sizeof *ulpmc + sizeof *ulpsc + roundup(copy_len, T4_ULPTX_MIN_IO), 16); wr = alloc_wrqe(wr_len, &sc->sge.mgmtq); if (wr == NULL) return (0); ulpmc = wrtod(wr); memset(ulpmc, 0, wr_len); INIT_ULPTX_WR(ulpmc, wr_len, 0, 0); if (i == (num_wqe-1)) { ulpmc->wr.wr_hi = cpu_to_be32(V_FW_WR_OP(FW_ULPTX_WR) | F_FW_WR_COMPL); ulpmc->wr.wr_lo = (__force __be64)(unsigned long) &wr_wait; } else ulpmc->wr.wr_hi = cpu_to_be32(V_FW_WR_OP(FW_ULPTX_WR)); ulpmc->wr.wr_mid = cpu_to_be32( V_FW_WR_LEN16(DIV_ROUND_UP(wr_len, 16))); ulpmc->cmd = cmd; ulpmc->dlen = cpu_to_be32(V_ULP_MEMIO_DATA_LEN( DIV_ROUND_UP(copy_len, T4_ULPTX_MIN_IO))); ulpmc->len16 = cpu_to_be32(DIV_ROUND_UP(wr_len-sizeof(ulpmc->wr), 16)); ulpmc->lock_addr = cpu_to_be32(V_ULP_MEMIO_ADDR(addr + i * 3)); ulpsc = (struct ulptx_idata *)(ulpmc + 1); ulpsc->cmd_more = cpu_to_be32(V_ULPTX_CMD(ULP_TX_SC_IMM)); ulpsc->len = cpu_to_be32(roundup(copy_len, T4_ULPTX_MIN_IO)); to_dp = (u8 *)(ulpsc + 1); from_dp = (u8 *)data + i * C4IW_MAX_INLINE_SIZE; if (data) memcpy(to_dp, from_dp, copy_len); else memset(to_dp, 0, copy_len); if (copy_len % T4_ULPTX_MIN_IO) memset(to_dp + copy_len, 0, T4_ULPTX_MIN_IO - (copy_len % T4_ULPTX_MIN_IO)); t4_wrq_tx(sc, wr); len -= C4IW_MAX_INLINE_SIZE; } ret = c4iw_wait_for_reply(rdev, &wr_wait, 0, 0, __func__); return ret; } /* * Build and write a TPT entry. * IN: stag key, pdid, perm, bind_enabled, zbva, to, len, page_size, * pbl_size and pbl_addr * OUT: stag index */ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 reset_tpt_entry, u32 *stag, u8 stag_state, u32 pdid, enum fw_ri_stag_type type, enum fw_ri_mem_perms perm, int bind_enabled, u32 zbva, u64 to, u64 len, u8 page_size, u32 pbl_size, u32 pbl_addr) { int err; struct fw_ri_tpte tpt; u32 stag_idx; static atomic_t key; if (c4iw_fatal_error(rdev)) return -EIO; stag_state = stag_state > 0; stag_idx = (*stag) >> 8; if ((!reset_tpt_entry) && (*stag == T4_STAG_UNSET)) { stag_idx = c4iw_get_resource(&rdev->resource.tpt_table); if (!stag_idx) { mutex_lock(&rdev->stats.lock); rdev->stats.stag.fail++; mutex_unlock(&rdev->stats.lock); return -ENOMEM; } mutex_lock(&rdev->stats.lock); rdev->stats.stag.cur += 32; if (rdev->stats.stag.cur > rdev->stats.stag.max) rdev->stats.stag.max = rdev->stats.stag.cur; mutex_unlock(&rdev->stats.lock); *stag = (stag_idx << 8) | (atomic_inc_return(&key) & 0xff); } CTR5(KTR_IW_CXGBE, "%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x", __func__, stag_state, type, pdid, stag_idx); /* write TPT entry */ if (reset_tpt_entry) memset(&tpt, 0, sizeof(tpt)); else { tpt.valid_to_pdid = cpu_to_be32(F_FW_RI_TPTE_VALID | V_FW_RI_TPTE_STAGKEY((*stag & M_FW_RI_TPTE_STAGKEY)) | V_FW_RI_TPTE_STAGSTATE(stag_state) | V_FW_RI_TPTE_STAGTYPE(type) | V_FW_RI_TPTE_PDID(pdid)); tpt.locread_to_qpid = cpu_to_be32(V_FW_RI_TPTE_PERM(perm) | (bind_enabled ? F_FW_RI_TPTE_MWBINDEN : 0) | V_FW_RI_TPTE_ADDRTYPE((zbva ? FW_RI_ZERO_BASED_TO : FW_RI_VA_BASED_TO))| V_FW_RI_TPTE_PS(page_size)); tpt.nosnoop_pbladdr = !pbl_size ? 0 : cpu_to_be32( V_FW_RI_TPTE_PBLADDR(PBL_OFF(rdev, pbl_addr)>>3)); tpt.len_lo = cpu_to_be32((u32)(len & 0xffffffffUL)); tpt.va_hi = cpu_to_be32((u32)(to >> 32)); tpt.va_lo_fbo = cpu_to_be32((u32)(to & 0xffffffffUL)); tpt.dca_mwbcnt_pstag = cpu_to_be32(0); tpt.len_hi = cpu_to_be32((u32)(len >> 32)); } err = write_adapter_mem(rdev, stag_idx + (rdev->adap->vres.stag.start >> 5), sizeof(tpt), &tpt); if (reset_tpt_entry) { c4iw_put_resource(&rdev->resource.tpt_table, stag_idx); mutex_lock(&rdev->stats.lock); rdev->stats.stag.cur -= 32; mutex_unlock(&rdev->stats.lock); } return err; } static int write_pbl(struct c4iw_rdev *rdev, __be64 *pbl, u32 pbl_addr, u32 pbl_size) { int err; CTR4(KTR_IW_CXGBE, "%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d", __func__, pbl_addr, rdev->adap->vres.pbl.start, pbl_size); err = write_adapter_mem(rdev, pbl_addr >> 5, pbl_size << 3, pbl); return err; } static int dereg_mem(struct c4iw_rdev *rdev, u32 stag, u32 pbl_size, u32 pbl_addr) { return write_tpt_entry(rdev, 1, &stag, 0, 0, 0, 0, 0, 0, 0UL, 0, 0, pbl_size, pbl_addr); } static int allocate_window(struct c4iw_rdev *rdev, u32 * stag, u32 pdid) { *stag = T4_STAG_UNSET; return write_tpt_entry(rdev, 0, stag, 0, pdid, FW_RI_STAG_MW, 0, 0, 0, 0UL, 0, 0, 0, 0); } static int deallocate_window(struct c4iw_rdev *rdev, u32 stag) { return write_tpt_entry(rdev, 1, &stag, 0, 0, 0, 0, 0, 0, 0UL, 0, 0, 0, 0); } static int allocate_stag(struct c4iw_rdev *rdev, u32 *stag, u32 pdid, u32 pbl_size, u32 pbl_addr) { *stag = T4_STAG_UNSET; return write_tpt_entry(rdev, 0, stag, 0, pdid, FW_RI_STAG_NSMR, 0, 0, 0, 0UL, 0, 0, pbl_size, pbl_addr); } static int finish_mem_reg(struct c4iw_mr *mhp, u32 stag) { u32 mmid; mhp->attr.state = 1; mhp->attr.stag = stag; mmid = stag >> 8; mhp->ibmr.rkey = mhp->ibmr.lkey = stag; CTR3(KTR_IW_CXGBE, "%s mmid 0x%x mhp %p", __func__, mmid, mhp); return insert_handle(mhp->rhp, &mhp->rhp->mmidr, mhp, mmid); } static int register_mem(struct c4iw_dev *rhp, struct c4iw_pd *php, struct c4iw_mr *mhp, int shift) { u32 stag = T4_STAG_UNSET; int ret; ret = write_tpt_entry(&rhp->rdev, 0, &stag, 1, mhp->attr.pdid, FW_RI_STAG_NSMR, mhp->attr.len ? mhp->attr.perms : 0, mhp->attr.mw_bind_enable, mhp->attr.zbva, mhp->attr.va_fbo, mhp->attr.len ? mhp->attr.len : -1, shift - 12, mhp->attr.pbl_size, mhp->attr.pbl_addr); if (ret) return ret; ret = finish_mem_reg(mhp, stag); if (ret) dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, mhp->attr.pbl_addr); return ret; } static int reregister_mem(struct c4iw_dev *rhp, struct c4iw_pd *php, struct c4iw_mr *mhp, int shift, int npages) { u32 stag; int ret; if (npages > mhp->attr.pbl_size) return -ENOMEM; stag = mhp->attr.stag; ret = write_tpt_entry(&rhp->rdev, 0, &stag, 1, mhp->attr.pdid, FW_RI_STAG_NSMR, mhp->attr.perms, mhp->attr.mw_bind_enable, mhp->attr.zbva, mhp->attr.va_fbo, mhp->attr.len, shift - 12, mhp->attr.pbl_size, mhp->attr.pbl_addr); if (ret) return ret; ret = finish_mem_reg(mhp, stag); if (ret) dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, mhp->attr.pbl_addr); return ret; } static int alloc_pbl(struct c4iw_mr *mhp, int npages) { mhp->attr.pbl_addr = c4iw_pblpool_alloc(&mhp->rhp->rdev, npages << 3); if (!mhp->attr.pbl_addr) return -ENOMEM; mhp->attr.pbl_size = npages; return 0; } static int build_phys_page_list(struct ib_phys_buf *buffer_list, int num_phys_buf, u64 *iova_start, u64 *total_size, int *npages, int *shift, __be64 **page_list) { u64 mask; int i, j, n; mask = 0; *total_size = 0; for (i = 0; i < num_phys_buf; ++i) { if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) return -EINVAL; if (i != 0 && i != num_phys_buf - 1 && (buffer_list[i].size & ~PAGE_MASK)) return -EINVAL; *total_size += buffer_list[i].size; if (i > 0) mask |= buffer_list[i].addr; else mask |= buffer_list[i].addr & PAGE_MASK; if (i != num_phys_buf - 1) mask |= buffer_list[i].addr + buffer_list[i].size; else mask |= (buffer_list[i].addr + buffer_list[i].size + PAGE_SIZE - 1) & PAGE_MASK; } - if (*total_size > 0xFFFFFFFFULL) - return -ENOMEM; - /* Find largest page shift we can use to cover buffers */ for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) if ((1ULL << *shift) & mask) break; buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); buffer_list[0].addr &= ~0ull << *shift; *npages = 0; for (i = 0; i < num_phys_buf; ++i) *npages += (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; if (!*npages) return -EINVAL; *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); if (!*page_list) return -ENOMEM; n = 0; for (i = 0; i < num_phys_buf; ++i) for (j = 0; j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; ++j) (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + ((u64) j << *shift)); CTR6(KTR_IW_CXGBE, "%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d", __func__, (unsigned long long)*iova_start, (unsigned long long)mask, *shift, (unsigned long long)*total_size, *npages); return 0; } int c4iw_reregister_phys_mem(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, struct ib_phys_buf *buffer_list, int num_phys_buf, int acc, u64 *iova_start) { struct c4iw_mr mh, *mhp; struct c4iw_pd *php; struct c4iw_dev *rhp; __be64 *page_list = NULL; int shift = 0; u64 total_size = 0; int npages = 0; int ret; CTR3(KTR_IW_CXGBE, "%s ib_mr %p ib_pd %p", __func__, mr, pd); /* There can be no memory windows */ if (atomic_read(&mr->usecnt)) return -EINVAL; mhp = to_c4iw_mr(mr); rhp = mhp->rhp; php = to_c4iw_pd(mr->pd); /* make sure we are on the same adapter */ if (rhp != php->rhp) return -EINVAL; memcpy(&mh, mhp, sizeof *mhp); if (mr_rereg_mask & IB_MR_REREG_PD) php = to_c4iw_pd(pd); if (mr_rereg_mask & IB_MR_REREG_ACCESS) { mh.attr.perms = c4iw_ib_to_tpt_access(acc); mh.attr.mw_bind_enable = (acc & IB_ACCESS_MW_BIND) == IB_ACCESS_MW_BIND; } if (mr_rereg_mask & IB_MR_REREG_TRANS) { ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, &total_size, &npages, &shift, &page_list); if (ret) return ret; } if (mr_exceeds_hw_limits(rhp, total_size)) { kfree(page_list); return -EINVAL; } ret = reregister_mem(rhp, php, &mh, shift, npages); kfree(page_list); if (ret) return ret; if (mr_rereg_mask & IB_MR_REREG_PD) mhp->attr.pdid = php->pdid; if (mr_rereg_mask & IB_MR_REREG_ACCESS) mhp->attr.perms = c4iw_ib_to_tpt_access(acc); if (mr_rereg_mask & IB_MR_REREG_TRANS) { mhp->attr.zbva = 0; mhp->attr.va_fbo = *iova_start; mhp->attr.page_size = shift - 12; mhp->attr.len = (u32) total_size; mhp->attr.pbl_size = npages; } return 0; } struct ib_mr *c4iw_register_phys_mem(struct ib_pd *pd, struct ib_phys_buf *buffer_list, int num_phys_buf, int acc, u64 *iova_start) { __be64 *page_list; int shift; u64 total_size; int npages; struct c4iw_dev *rhp; struct c4iw_pd *php; struct c4iw_mr *mhp; int ret; CTR2(KTR_IW_CXGBE, "%s ib_pd %p", __func__, pd); php = to_c4iw_pd(pd); rhp = php->rhp; mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); if (!mhp) return ERR_PTR(-ENOMEM); mhp->rhp = rhp; /* First check that we have enough alignment */ if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { ret = -EINVAL; goto err; } if (num_phys_buf > 1 && ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { ret = -EINVAL; goto err; } ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, &total_size, &npages, &shift, &page_list); if (ret) goto err; if (mr_exceeds_hw_limits(rhp, total_size)) { kfree(page_list); ret = -EINVAL; goto err; } ret = alloc_pbl(mhp, npages); if (ret) { kfree(page_list); goto err; } ret = write_pbl(&mhp->rhp->rdev, page_list, mhp->attr.pbl_addr, npages); kfree(page_list); if (ret) goto err_pbl; mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; mhp->attr.perms = c4iw_ib_to_tpt_access(acc); mhp->attr.va_fbo = *iova_start; mhp->attr.page_size = shift - 12; mhp->attr.len = (u32) total_size; mhp->attr.pbl_size = npages; ret = register_mem(rhp, php, mhp, shift); if (ret) goto err_pbl; return &mhp->ibmr; err_pbl: c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr, mhp->attr.pbl_size << 3); err: kfree(mhp); return ERR_PTR(ret); } struct ib_mr *c4iw_get_dma_mr(struct ib_pd *pd, int acc) { struct c4iw_dev *rhp; struct c4iw_pd *php; struct c4iw_mr *mhp; int ret; u32 stag = T4_STAG_UNSET; CTR2(KTR_IW_CXGBE, "%s ib_pd %p", __func__, pd); php = to_c4iw_pd(pd); rhp = php->rhp; mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); if (!mhp) return ERR_PTR(-ENOMEM); mhp->rhp = rhp; mhp->attr.pdid = php->pdid; mhp->attr.perms = c4iw_ib_to_tpt_access(acc); mhp->attr.mw_bind_enable = (acc&IB_ACCESS_MW_BIND) == IB_ACCESS_MW_BIND; mhp->attr.zbva = 0; mhp->attr.va_fbo = 0; mhp->attr.page_size = 0; mhp->attr.len = ~0UL; mhp->attr.pbl_size = 0; ret = write_tpt_entry(&rhp->rdev, 0, &stag, 1, php->pdid, FW_RI_STAG_NSMR, mhp->attr.perms, mhp->attr.mw_bind_enable, 0, 0, ~0UL, 0, 0, 0); if (ret) goto err1; ret = finish_mem_reg(mhp, stag); if (ret) goto err2; return &mhp->ibmr; err2: dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, mhp->attr.pbl_addr); err1: kfree(mhp); return ERR_PTR(ret); } struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, int acc, struct ib_udata *udata, int mr_id) { __be64 *pages; int shift, n, len; int i, k, entry; int err = 0; struct scatterlist *sg; struct c4iw_dev *rhp; struct c4iw_pd *php; struct c4iw_mr *mhp; CTR2(KTR_IW_CXGBE, "%s ib_pd %p", __func__, pd); if (length == ~0ULL) return ERR_PTR(-EINVAL); if ((length + start) < start) return ERR_PTR(-EINVAL); php = to_c4iw_pd(pd); rhp = php->rhp; if (mr_exceeds_hw_limits(rhp, length)) return ERR_PTR(-EINVAL); mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); if (!mhp) return ERR_PTR(-ENOMEM); mhp->rhp = rhp; mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); if (IS_ERR(mhp->umem)) { err = PTR_ERR(mhp->umem); kfree(mhp); return ERR_PTR(err); } shift = ffs(mhp->umem->page_size) - 1; n = mhp->umem->nmap; err = alloc_pbl(mhp, n); if (err) goto err; pages = (__be64 *) __get_free_page(GFP_KERNEL); if (!pages) { err = -ENOMEM; goto err_pbl; } i = n = 0; for_each_sg(mhp->umem->sg_head.sgl, sg, mhp->umem->nmap, entry) { len = sg_dma_len(sg) >> shift; for (k = 0; k < len; ++k) { pages[i++] = cpu_to_be64(sg_dma_address(sg) + mhp->umem->page_size * k); if (i == PAGE_SIZE / sizeof *pages) { err = write_pbl(&mhp->rhp->rdev, pages, mhp->attr.pbl_addr + (n << 3), i); if (err) goto pbl_done; n += i; i = 0; } } } if (i) err = write_pbl(&mhp->rhp->rdev, pages, mhp->attr.pbl_addr + (n << 3), i); pbl_done: free_page((unsigned long) pages); if (err) goto err_pbl; mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; mhp->attr.perms = c4iw_ib_to_tpt_access(acc); mhp->attr.va_fbo = virt; mhp->attr.page_size = shift - 12; mhp->attr.len = length; err = register_mem(rhp, php, mhp, shift); if (err) goto err_pbl; return &mhp->ibmr; err_pbl: c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr, mhp->attr.pbl_size << 3); err: ib_umem_release(mhp->umem); kfree(mhp); return ERR_PTR(err); } struct ib_mw *c4iw_alloc_mw(struct ib_pd *pd, enum ib_mw_type type) { struct c4iw_dev *rhp; struct c4iw_pd *php; struct c4iw_mw *mhp; u32 mmid; u32 stag = 0; int ret; php = to_c4iw_pd(pd); rhp = php->rhp; mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); if (!mhp) return ERR_PTR(-ENOMEM); ret = allocate_window(&rhp->rdev, &stag, php->pdid); if (ret) { kfree(mhp); return ERR_PTR(ret); } mhp->rhp = rhp; mhp->attr.pdid = php->pdid; mhp->attr.type = FW_RI_STAG_MW; mhp->attr.stag = stag; mmid = (stag) >> 8; mhp->ibmw.rkey = stag; if (insert_handle(rhp, &rhp->mmidr, mhp, mmid)) { deallocate_window(&rhp->rdev, mhp->attr.stag); kfree(mhp); return ERR_PTR(-ENOMEM); } CTR4(KTR_IW_CXGBE, "%s mmid 0x%x mhp %p stag 0x%x", __func__, mmid, mhp, stag); return &(mhp->ibmw); } int c4iw_dealloc_mw(struct ib_mw *mw) { struct c4iw_dev *rhp; struct c4iw_mw *mhp; u32 mmid; mhp = to_c4iw_mw(mw); rhp = mhp->rhp; mmid = (mw->rkey) >> 8; remove_handle(rhp, &rhp->mmidr, mmid); deallocate_window(&rhp->rdev, mhp->attr.stag); kfree(mhp); CTR4(KTR_IW_CXGBE, "%s ib_mw %p mmid 0x%x ptr %p", __func__, mw, mmid, mhp); return 0; } struct ib_mr *c4iw_alloc_fast_reg_mr(struct ib_pd *pd, int pbl_depth) { struct c4iw_dev *rhp; struct c4iw_pd *php; struct c4iw_mr *mhp; u32 mmid; u32 stag = 0; int ret = 0; php = to_c4iw_pd(pd); rhp = php->rhp; mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); if (!mhp) { ret = -ENOMEM; goto err; } mhp->rhp = rhp; ret = alloc_pbl(mhp, pbl_depth); if (ret) goto err1; mhp->attr.pbl_size = pbl_depth; ret = allocate_stag(&rhp->rdev, &stag, php->pdid, mhp->attr.pbl_size, mhp->attr.pbl_addr); if (ret) goto err2; mhp->attr.pdid = php->pdid; mhp->attr.type = FW_RI_STAG_NSMR; mhp->attr.stag = stag; mhp->attr.state = 1; mmid = (stag) >> 8; mhp->ibmr.rkey = mhp->ibmr.lkey = stag; if (insert_handle(rhp, &rhp->mmidr, mhp, mmid)) { ret = -ENOMEM; goto err3; } CTR4(KTR_IW_CXGBE, "%s mmid 0x%x mhp %p stag 0x%x", __func__, mmid, mhp, stag); return &(mhp->ibmr); err3: dereg_mem(&rhp->rdev, stag, mhp->attr.pbl_size, mhp->attr.pbl_addr); err2: c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr, mhp->attr.pbl_size << 3); err1: kfree(mhp); err: return ERR_PTR(ret); } struct ib_fast_reg_page_list *c4iw_alloc_fastreg_pbl(struct ib_device *device, int page_list_len) { struct c4iw_fr_page_list *c4pl; struct c4iw_dev *dev = to_c4iw_dev(device); bus_addr_t dma_addr; int size = sizeof *c4pl + page_list_len * sizeof(u64); c4pl = contigmalloc(size, M_DEVBUF, M_NOWAIT, 0ul, ~0ul, 4096, 0); if (c4pl) dma_addr = vtophys(c4pl); else return ERR_PTR(-ENOMEM); pci_unmap_addr_set(c4pl, mapping, dma_addr); c4pl->dma_addr = dma_addr; c4pl->dev = dev; c4pl->size = size; c4pl->ibpl.page_list = (u64 *)(c4pl + 1); c4pl->ibpl.max_page_list_len = page_list_len; return &c4pl->ibpl; } void c4iw_free_fastreg_pbl(struct ib_fast_reg_page_list *ibpl) { struct c4iw_fr_page_list *c4pl = to_c4iw_fr_page_list(ibpl); contigfree(c4pl, c4pl->size, M_DEVBUF); } int c4iw_dereg_mr(struct ib_mr *ib_mr) { struct c4iw_dev *rhp; struct c4iw_mr *mhp; u32 mmid; CTR2(KTR_IW_CXGBE, "%s ib_mr %p", __func__, ib_mr); /* There can be no memory windows */ if (atomic_read(&ib_mr->usecnt)) return -EINVAL; mhp = to_c4iw_mr(ib_mr); rhp = mhp->rhp; mmid = mhp->attr.stag >> 8; remove_handle(rhp, &rhp->mmidr, mmid); dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, mhp->attr.pbl_addr); if (mhp->attr.pbl_size) c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr, mhp->attr.pbl_size << 3); if (mhp->kva) kfree((void *) (unsigned long) mhp->kva); if (mhp->umem) ib_umem_release(mhp->umem); CTR3(KTR_IW_CXGBE, "%s mmid 0x%x ptr %p", __func__, mmid, mhp); kfree(mhp); return 0; } #endif Index: projects/clang400-import/sys/dev/hyperv/netvsc/hn_nvs.c =================================================================== --- projects/clang400-import/sys/dev/hyperv/netvsc/hn_nvs.c (revision 314522) +++ projects/clang400-import/sys/dev/hyperv/netvsc/hn_nvs.c (revision 314523) @@ -1,735 +1,740 @@ /*- * Copyright (c) 2009-2012,2016 Microsoft Corp. * Copyright (c) 2010-2012 Citrix Inc. * Copyright (c) 2012 NetApp Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* * Network Virtualization Service. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet6.h" #include "opt_inet.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static int hn_nvs_conn_chim(struct hn_softc *); static int hn_nvs_conn_rxbuf(struct hn_softc *); static void hn_nvs_disconn_chim(struct hn_softc *); static void hn_nvs_disconn_rxbuf(struct hn_softc *); static int hn_nvs_conf_ndis(struct hn_softc *, int); static int hn_nvs_init_ndis(struct hn_softc *); static int hn_nvs_doinit(struct hn_softc *, uint32_t); static int hn_nvs_init(struct hn_softc *); static const void *hn_nvs_xact_execute(struct hn_softc *, struct vmbus_xact *, void *, int, size_t *, uint32_t); static void hn_nvs_sent_none(struct hn_nvs_sendctx *, struct hn_softc *, struct vmbus_channel *, const void *, int); struct hn_nvs_sendctx hn_nvs_sendctx_none = HN_NVS_SENDCTX_INITIALIZER(hn_nvs_sent_none, NULL); static const uint32_t hn_nvs_version[] = { HN_NVS_VERSION_5, HN_NVS_VERSION_4, HN_NVS_VERSION_2, HN_NVS_VERSION_1 }; static const void * hn_nvs_xact_execute(struct hn_softc *sc, struct vmbus_xact *xact, void *req, int reqlen, size_t *resplen0, uint32_t type) { struct hn_nvs_sendctx sndc; size_t resplen, min_resplen = *resplen0; const struct hn_nvs_hdr *hdr; int error; KASSERT(min_resplen >= sizeof(*hdr), ("invalid minimum response len %zu", min_resplen)); /* * Execute the xact setup by the caller. */ hn_nvs_sendctx_init(&sndc, hn_nvs_sent_xact, xact); vmbus_xact_activate(xact); error = hn_nvs_send(sc->hn_prichan, VMBUS_CHANPKT_FLAG_RC, req, reqlen, &sndc); if (error) { vmbus_xact_deactivate(xact); return (NULL); } hdr = vmbus_chan_xact_wait(sc->hn_prichan, xact, &resplen, HN_CAN_SLEEP(sc)); /* * Check this NVS response message. */ if (resplen < min_resplen) { if_printf(sc->hn_ifp, "invalid NVS resp len %zu\n", resplen); return (NULL); } if (hdr->nvs_type != type) { if_printf(sc->hn_ifp, "unexpected NVS resp 0x%08x, " "expect 0x%08x\n", hdr->nvs_type, type); return (NULL); } /* All pass! */ *resplen0 = resplen; return (hdr); } static __inline int hn_nvs_req_send(struct hn_softc *sc, void *req, int reqlen) { return (hn_nvs_send(sc->hn_prichan, VMBUS_CHANPKT_FLAG_NONE, req, reqlen, &hn_nvs_sendctx_none)); } static int hn_nvs_conn_rxbuf(struct hn_softc *sc) { struct vmbus_xact *xact = NULL; struct hn_nvs_rxbuf_conn *conn; const struct hn_nvs_rxbuf_connresp *resp; size_t resp_len; uint32_t status; int error, rxbuf_size; /* * Limit RXBUF size for old NVS. */ if (sc->hn_nvs_ver <= HN_NVS_VERSION_2) rxbuf_size = HN_RXBUF_SIZE_COMPAT; else rxbuf_size = HN_RXBUF_SIZE; /* * Connect the RXBUF GPADL to the primary channel. * * NOTE: * Only primary channel has RXBUF connected to it. Sub-channels * just share this RXBUF. */ error = vmbus_chan_gpadl_connect(sc->hn_prichan, sc->hn_rxbuf_dma.hv_paddr, rxbuf_size, &sc->hn_rxbuf_gpadl); if (error) { if_printf(sc->hn_ifp, "rxbuf gpadl conn failed: %d\n", error); goto cleanup; } /* * Connect RXBUF to NVS. */ xact = vmbus_xact_get(sc->hn_xact, sizeof(*conn)); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for nvs rxbuf conn\n"); error = ENXIO; goto cleanup; } conn = vmbus_xact_req_data(xact); conn->nvs_type = HN_NVS_TYPE_RXBUF_CONN; conn->nvs_gpadl = sc->hn_rxbuf_gpadl; conn->nvs_sig = HN_NVS_RXBUF_SIG; resp_len = sizeof(*resp); resp = hn_nvs_xact_execute(sc, xact, conn, sizeof(*conn), &resp_len, HN_NVS_TYPE_RXBUF_CONNRESP); if (resp == NULL) { if_printf(sc->hn_ifp, "exec nvs rxbuf conn failed\n"); error = EIO; goto cleanup; } status = resp->nvs_status; vmbus_xact_put(xact); xact = NULL; if (status != HN_NVS_STATUS_OK) { if_printf(sc->hn_ifp, "nvs rxbuf conn failed: %x\n", status); error = EIO; goto cleanup; } sc->hn_flags |= HN_FLAG_RXBUF_CONNECTED; return (0); cleanup: if (xact != NULL) vmbus_xact_put(xact); hn_nvs_disconn_rxbuf(sc); return (error); } static int hn_nvs_conn_chim(struct hn_softc *sc) { struct vmbus_xact *xact = NULL; struct hn_nvs_chim_conn *chim; const struct hn_nvs_chim_connresp *resp; size_t resp_len; uint32_t status, sectsz; int error; /* * Connect chimney sending buffer GPADL to the primary channel. * * NOTE: * Only primary channel has chimney sending buffer connected to it. * Sub-channels just share this chimney sending buffer. */ error = vmbus_chan_gpadl_connect(sc->hn_prichan, sc->hn_chim_dma.hv_paddr, HN_CHIM_SIZE, &sc->hn_chim_gpadl); if (error) { if_printf(sc->hn_ifp, "chim gpadl conn failed: %d\n", error); goto cleanup; } /* * Connect chimney sending buffer to NVS */ xact = vmbus_xact_get(sc->hn_xact, sizeof(*chim)); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for nvs chim conn\n"); error = ENXIO; goto cleanup; } chim = vmbus_xact_req_data(xact); chim->nvs_type = HN_NVS_TYPE_CHIM_CONN; chim->nvs_gpadl = sc->hn_chim_gpadl; chim->nvs_sig = HN_NVS_CHIM_SIG; resp_len = sizeof(*resp); resp = hn_nvs_xact_execute(sc, xact, chim, sizeof(*chim), &resp_len, HN_NVS_TYPE_CHIM_CONNRESP); if (resp == NULL) { if_printf(sc->hn_ifp, "exec nvs chim conn failed\n"); error = EIO; goto cleanup; } status = resp->nvs_status; sectsz = resp->nvs_sectsz; vmbus_xact_put(xact); xact = NULL; if (status != HN_NVS_STATUS_OK) { if_printf(sc->hn_ifp, "nvs chim conn failed: %x\n", status); error = EIO; goto cleanup; } - if (sectsz == 0) { + if (sectsz == 0 || sectsz % sizeof(uint32_t) != 0) { /* * Can't use chimney sending buffer; done! */ - if_printf(sc->hn_ifp, "zero chimney sending buffer " - "section size\n"); + if (sectsz == 0) { + if_printf(sc->hn_ifp, "zero chimney sending buffer " + "section size\n"); + } else { + if_printf(sc->hn_ifp, "misaligned chimney sending " + "buffers, section size: %u\n", sectsz); + } sc->hn_chim_szmax = 0; sc->hn_chim_cnt = 0; sc->hn_flags |= HN_FLAG_CHIM_CONNECTED; return (0); } sc->hn_chim_szmax = sectsz; sc->hn_chim_cnt = HN_CHIM_SIZE / sc->hn_chim_szmax; if (HN_CHIM_SIZE % sc->hn_chim_szmax != 0) { if_printf(sc->hn_ifp, "chimney sending sections are " "not properly aligned\n"); } if (sc->hn_chim_cnt % LONG_BIT != 0) { if_printf(sc->hn_ifp, "discard %d chimney sending sections\n", sc->hn_chim_cnt % LONG_BIT); } sc->hn_chim_bmap_cnt = sc->hn_chim_cnt / LONG_BIT; sc->hn_chim_bmap = malloc(sc->hn_chim_bmap_cnt * sizeof(u_long), M_DEVBUF, M_WAITOK | M_ZERO); /* Done! */ sc->hn_flags |= HN_FLAG_CHIM_CONNECTED; if (bootverbose) { if_printf(sc->hn_ifp, "chimney sending buffer %d/%d\n", sc->hn_chim_szmax, sc->hn_chim_cnt); } return (0); cleanup: if (xact != NULL) vmbus_xact_put(xact); hn_nvs_disconn_chim(sc); return (error); } static void hn_nvs_disconn_rxbuf(struct hn_softc *sc) { int error; if (sc->hn_flags & HN_FLAG_RXBUF_CONNECTED) { struct hn_nvs_rxbuf_disconn disconn; /* * Disconnect RXBUF from NVS. */ memset(&disconn, 0, sizeof(disconn)); disconn.nvs_type = HN_NVS_TYPE_RXBUF_DISCONN; disconn.nvs_sig = HN_NVS_RXBUF_SIG; /* NOTE: No response. */ error = hn_nvs_req_send(sc, &disconn, sizeof(disconn)); if (error) { if_printf(sc->hn_ifp, "send nvs rxbuf disconn failed: %d\n", error); /* * Fine for a revoked channel, since the hypervisor * does not drain TX bufring for a revoked channel. */ if (!vmbus_chan_is_revoked(sc->hn_prichan)) sc->hn_flags |= HN_FLAG_RXBUF_REF; } sc->hn_flags &= ~HN_FLAG_RXBUF_CONNECTED; /* * Wait for the hypervisor to receive this NVS request. * * NOTE: * The TX bufring will not be drained by the hypervisor, * if the primary channel is revoked. */ while (!vmbus_chan_tx_empty(sc->hn_prichan) && !vmbus_chan_is_revoked(sc->hn_prichan)) pause("waittx", 1); /* * Linger long enough for NVS to disconnect RXBUF. */ pause("lingtx", (200 * hz) / 1000); } if (sc->hn_rxbuf_gpadl != 0) { /* * Disconnect RXBUF from primary channel. */ error = vmbus_chan_gpadl_disconnect(sc->hn_prichan, sc->hn_rxbuf_gpadl); if (error) { if_printf(sc->hn_ifp, "rxbuf gpadl disconn failed: %d\n", error); sc->hn_flags |= HN_FLAG_RXBUF_REF; } sc->hn_rxbuf_gpadl = 0; } } static void hn_nvs_disconn_chim(struct hn_softc *sc) { int error; if (sc->hn_flags & HN_FLAG_CHIM_CONNECTED) { struct hn_nvs_chim_disconn disconn; /* * Disconnect chimney sending buffer from NVS. */ memset(&disconn, 0, sizeof(disconn)); disconn.nvs_type = HN_NVS_TYPE_CHIM_DISCONN; disconn.nvs_sig = HN_NVS_CHIM_SIG; /* NOTE: No response. */ error = hn_nvs_req_send(sc, &disconn, sizeof(disconn)); if (error) { if_printf(sc->hn_ifp, "send nvs chim disconn failed: %d\n", error); /* * Fine for a revoked channel, since the hypervisor * does not drain TX bufring for a revoked channel. */ if (!vmbus_chan_is_revoked(sc->hn_prichan)) sc->hn_flags |= HN_FLAG_CHIM_REF; } sc->hn_flags &= ~HN_FLAG_CHIM_CONNECTED; /* * Wait for the hypervisor to receive this NVS request. * * NOTE: * The TX bufring will not be drained by the hypervisor, * if the primary channel is revoked. */ while (!vmbus_chan_tx_empty(sc->hn_prichan) && !vmbus_chan_is_revoked(sc->hn_prichan)) pause("waittx", 1); /* * Linger long enough for NVS to disconnect chimney * sending buffer. */ pause("lingtx", (200 * hz) / 1000); } if (sc->hn_chim_gpadl != 0) { /* * Disconnect chimney sending buffer from primary channel. */ error = vmbus_chan_gpadl_disconnect(sc->hn_prichan, sc->hn_chim_gpadl); if (error) { if_printf(sc->hn_ifp, "chim gpadl disconn failed: %d\n", error); sc->hn_flags |= HN_FLAG_CHIM_REF; } sc->hn_chim_gpadl = 0; } if (sc->hn_chim_bmap != NULL) { free(sc->hn_chim_bmap, M_DEVBUF); sc->hn_chim_bmap = NULL; sc->hn_chim_bmap_cnt = 0; } } static int hn_nvs_doinit(struct hn_softc *sc, uint32_t nvs_ver) { struct vmbus_xact *xact; struct hn_nvs_init *init; const struct hn_nvs_init_resp *resp; size_t resp_len; uint32_t status; xact = vmbus_xact_get(sc->hn_xact, sizeof(*init)); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for nvs init\n"); return (ENXIO); } init = vmbus_xact_req_data(xact); init->nvs_type = HN_NVS_TYPE_INIT; init->nvs_ver_min = nvs_ver; init->nvs_ver_max = nvs_ver; resp_len = sizeof(*resp); resp = hn_nvs_xact_execute(sc, xact, init, sizeof(*init), &resp_len, HN_NVS_TYPE_INIT_RESP); if (resp == NULL) { if_printf(sc->hn_ifp, "exec init failed\n"); vmbus_xact_put(xact); return (EIO); } status = resp->nvs_status; vmbus_xact_put(xact); if (status != HN_NVS_STATUS_OK) { if (bootverbose) { /* * Caller may try another NVS version, and will log * error if there are no more NVS versions to try, * so don't bark out loud here. */ if_printf(sc->hn_ifp, "nvs init failed for ver 0x%x\n", nvs_ver); } return (EINVAL); } return (0); } /* * Configure MTU and enable VLAN. */ static int hn_nvs_conf_ndis(struct hn_softc *sc, int mtu) { struct hn_nvs_ndis_conf conf; int error; memset(&conf, 0, sizeof(conf)); conf.nvs_type = HN_NVS_TYPE_NDIS_CONF; conf.nvs_mtu = mtu; conf.nvs_caps = HN_NVS_NDIS_CONF_VLAN; if (sc->hn_nvs_ver >= HN_NVS_VERSION_5) conf.nvs_caps |= HN_NVS_NDIS_CONF_SRIOV; /* NOTE: No response. */ error = hn_nvs_req_send(sc, &conf, sizeof(conf)); if (error) { if_printf(sc->hn_ifp, "send nvs ndis conf failed: %d\n", error); return (error); } if (bootverbose) if_printf(sc->hn_ifp, "nvs ndis conf done\n"); sc->hn_caps |= HN_CAP_MTU | HN_CAP_VLAN; return (0); } static int hn_nvs_init_ndis(struct hn_softc *sc) { struct hn_nvs_ndis_init ndis; int error; memset(&ndis, 0, sizeof(ndis)); ndis.nvs_type = HN_NVS_TYPE_NDIS_INIT; ndis.nvs_ndis_major = HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver); ndis.nvs_ndis_minor = HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver); /* NOTE: No response. */ error = hn_nvs_req_send(sc, &ndis, sizeof(ndis)); if (error) if_printf(sc->hn_ifp, "send nvs ndis init failed: %d\n", error); return (error); } static int hn_nvs_init(struct hn_softc *sc) { int i, error; if (device_is_attached(sc->hn_dev)) { /* * NVS version and NDIS version MUST NOT be changed. */ if (bootverbose) { if_printf(sc->hn_ifp, "reinit NVS version 0x%x, " "NDIS version %u.%u\n", sc->hn_nvs_ver, HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver), HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver)); } error = hn_nvs_doinit(sc, sc->hn_nvs_ver); if (error) { if_printf(sc->hn_ifp, "reinit NVS version 0x%x " "failed: %d\n", sc->hn_nvs_ver, error); return (error); } goto done; } /* * Find the supported NVS version and set NDIS version accordingly. */ for (i = 0; i < nitems(hn_nvs_version); ++i) { error = hn_nvs_doinit(sc, hn_nvs_version[i]); if (!error) { sc->hn_nvs_ver = hn_nvs_version[i]; /* Set NDIS version according to NVS version. */ sc->hn_ndis_ver = HN_NDIS_VERSION_6_30; if (sc->hn_nvs_ver <= HN_NVS_VERSION_4) sc->hn_ndis_ver = HN_NDIS_VERSION_6_1; if (bootverbose) { if_printf(sc->hn_ifp, "NVS version 0x%x, " "NDIS version %u.%u\n", sc->hn_nvs_ver, HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver), HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver)); } goto done; } } if_printf(sc->hn_ifp, "no NVS available\n"); return (ENXIO); done: if (sc->hn_nvs_ver >= HN_NVS_VERSION_5) sc->hn_caps |= HN_CAP_HASHVAL; return (0); } int hn_nvs_attach(struct hn_softc *sc, int mtu) { int error; /* * Initialize NVS. */ error = hn_nvs_init(sc); if (error) return (error); if (sc->hn_nvs_ver >= HN_NVS_VERSION_2) { /* * Configure NDIS before initializing it. */ error = hn_nvs_conf_ndis(sc, mtu); if (error) return (error); } /* * Initialize NDIS. */ error = hn_nvs_init_ndis(sc); if (error) return (error); /* * Connect RXBUF. */ error = hn_nvs_conn_rxbuf(sc); if (error) return (error); /* * Connect chimney sending buffer. */ error = hn_nvs_conn_chim(sc); if (error) { hn_nvs_disconn_rxbuf(sc); return (error); } return (0); } void hn_nvs_detach(struct hn_softc *sc) { /* NOTE: there are no requests to stop the NVS. */ hn_nvs_disconn_rxbuf(sc); hn_nvs_disconn_chim(sc); } void hn_nvs_sent_xact(struct hn_nvs_sendctx *sndc, struct hn_softc *sc __unused, struct vmbus_channel *chan __unused, const void *data, int dlen) { vmbus_xact_wakeup(sndc->hn_cbarg, data, dlen); } static void hn_nvs_sent_none(struct hn_nvs_sendctx *sndc __unused, struct hn_softc *sc __unused, struct vmbus_channel *chan __unused, const void *data __unused, int dlen __unused) { /* EMPTY */ } int hn_nvs_alloc_subchans(struct hn_softc *sc, int *nsubch0) { struct vmbus_xact *xact; struct hn_nvs_subch_req *req; const struct hn_nvs_subch_resp *resp; int error, nsubch_req; uint32_t nsubch; size_t resp_len; nsubch_req = *nsubch0; KASSERT(nsubch_req > 0, ("invalid # of sub-channels %d", nsubch_req)); xact = vmbus_xact_get(sc->hn_xact, sizeof(*req)); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for nvs subch alloc\n"); return (ENXIO); } req = vmbus_xact_req_data(xact); req->nvs_type = HN_NVS_TYPE_SUBCH_REQ; req->nvs_op = HN_NVS_SUBCH_OP_ALLOC; req->nvs_nsubch = nsubch_req; resp_len = sizeof(*resp); resp = hn_nvs_xact_execute(sc, xact, req, sizeof(*req), &resp_len, HN_NVS_TYPE_SUBCH_RESP); if (resp == NULL) { if_printf(sc->hn_ifp, "exec nvs subch alloc failed\n"); error = EIO; goto done; } if (resp->nvs_status != HN_NVS_STATUS_OK) { if_printf(sc->hn_ifp, "nvs subch alloc failed: %x\n", resp->nvs_status); error = EIO; goto done; } nsubch = resp->nvs_nsubch; if (nsubch > nsubch_req) { if_printf(sc->hn_ifp, "%u subchans are allocated, " "requested %d\n", nsubch, nsubch_req); nsubch = nsubch_req; } *nsubch0 = nsubch; error = 0; done: vmbus_xact_put(xact); return (error); } int hn_nvs_send_rndis_ctrl(struct vmbus_channel *chan, struct hn_nvs_sendctx *sndc, struct vmbus_gpa *gpa, int gpa_cnt) { return hn_nvs_send_rndis_sglist(chan, HN_NVS_RNDIS_MTYPE_CTRL, sndc, gpa, gpa_cnt); } void hn_nvs_set_datapath(struct hn_softc *sc, uint32_t path) { struct hn_nvs_datapath dp; memset(&dp, 0, sizeof(dp)); dp.nvs_type = HN_NVS_TYPE_SET_DATAPATH; dp.nvs_active_path = path; hn_nvs_req_send(sc, &dp, sizeof(dp)); } Index: projects/clang400-import/sys/dev/hyperv/netvsc/hn_rndis.c =================================================================== --- projects/clang400-import/sys/dev/hyperv/netvsc/hn_rndis.c (revision 314522) +++ projects/clang400-import/sys/dev/hyperv/netvsc/hn_rndis.c (revision 314523) @@ -1,993 +1,1006 @@ /*- * Copyright (c) 2009-2012,2016 Microsoft Corp. * Copyright (c) 2010-2012 Citrix Inc. * Copyright (c) 2012 NetApp Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet6.h" #include "opt_inet.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define HN_RNDIS_RID_COMPAT_MASK 0xffff #define HN_RNDIS_RID_COMPAT_MAX HN_RNDIS_RID_COMPAT_MASK #define HN_RNDIS_XFER_SIZE 2048 #define HN_NDIS_TXCSUM_CAP_IP4 \ (NDIS_TXCSUM_CAP_IP4 | NDIS_TXCSUM_CAP_IP4OPT) #define HN_NDIS_TXCSUM_CAP_TCP4 \ (NDIS_TXCSUM_CAP_TCP4 | NDIS_TXCSUM_CAP_TCP4OPT) #define HN_NDIS_TXCSUM_CAP_TCP6 \ (NDIS_TXCSUM_CAP_TCP6 | NDIS_TXCSUM_CAP_TCP6OPT | \ NDIS_TXCSUM_CAP_IP6EXT) #define HN_NDIS_TXCSUM_CAP_UDP6 \ (NDIS_TXCSUM_CAP_UDP6 | NDIS_TXCSUM_CAP_IP6EXT) #define HN_NDIS_LSOV2_CAP_IP6 \ (NDIS_LSOV2_CAP_IP6EXT | NDIS_LSOV2_CAP_TCP6OPT) static const void *hn_rndis_xact_exec1(struct hn_softc *, struct vmbus_xact *, size_t, struct hn_nvs_sendctx *, size_t *); static const void *hn_rndis_xact_execute(struct hn_softc *, struct vmbus_xact *, uint32_t, size_t, size_t *, uint32_t); static int hn_rndis_query(struct hn_softc *, uint32_t, const void *, size_t, void *, size_t *); static int hn_rndis_query2(struct hn_softc *, uint32_t, const void *, size_t, void *, size_t *, size_t); static int hn_rndis_set(struct hn_softc *, uint32_t, const void *, size_t); static int hn_rndis_init(struct hn_softc *); static int hn_rndis_halt(struct hn_softc *); static int hn_rndis_conf_offload(struct hn_softc *, int); static int hn_rndis_query_hwcaps(struct hn_softc *, struct ndis_offload *); static __inline uint32_t hn_rndis_rid(struct hn_softc *sc) { uint32_t rid; again: rid = atomic_fetchadd_int(&sc->hn_rndis_rid, 1); if (rid == 0) goto again; /* Use upper 16 bits for non-compat RNDIS messages. */ return ((rid & 0xffff) << 16); } void hn_rndis_rx_ctrl(struct hn_softc *sc, const void *data, int dlen) { const struct rndis_comp_hdr *comp; const struct rndis_msghdr *hdr; KASSERT(dlen >= sizeof(*hdr), ("invalid RNDIS msg\n")); hdr = data; switch (hdr->rm_type) { case REMOTE_NDIS_INITIALIZE_CMPLT: case REMOTE_NDIS_QUERY_CMPLT: case REMOTE_NDIS_SET_CMPLT: case REMOTE_NDIS_KEEPALIVE_CMPLT: /* unused */ if (dlen < sizeof(*comp)) { if_printf(sc->hn_ifp, "invalid RNDIS cmplt\n"); return; } comp = data; KASSERT(comp->rm_rid > HN_RNDIS_RID_COMPAT_MAX, ("invalid RNDIS rid 0x%08x\n", comp->rm_rid)); vmbus_xact_ctx_wakeup(sc->hn_xact, comp, dlen); break; case REMOTE_NDIS_RESET_CMPLT: /* * Reset completed, no rid. * * NOTE: * RESET is not issued by hn(4), so this message should * _not_ be observed. */ if_printf(sc->hn_ifp, "RESET cmplt received\n"); break; default: if_printf(sc->hn_ifp, "unknown RNDIS msg 0x%x\n", hdr->rm_type); break; } } int hn_rndis_get_eaddr(struct hn_softc *sc, uint8_t *eaddr) { size_t eaddr_len; int error; eaddr_len = ETHER_ADDR_LEN; error = hn_rndis_query(sc, OID_802_3_PERMANENT_ADDRESS, NULL, 0, eaddr, &eaddr_len); if (error) return (error); if (eaddr_len != ETHER_ADDR_LEN) { if_printf(sc->hn_ifp, "invalid eaddr len %zu\n", eaddr_len); return (EINVAL); } return (0); } int hn_rndis_get_linkstatus(struct hn_softc *sc, uint32_t *link_status) { size_t size; int error; size = sizeof(*link_status); error = hn_rndis_query(sc, OID_GEN_MEDIA_CONNECT_STATUS, NULL, 0, link_status, &size); if (error) return (error); if (size != sizeof(uint32_t)) { if_printf(sc->hn_ifp, "invalid link status len %zu\n", size); return (EINVAL); } return (0); } static const void * hn_rndis_xact_exec1(struct hn_softc *sc, struct vmbus_xact *xact, size_t reqlen, struct hn_nvs_sendctx *sndc, size_t *comp_len) { struct vmbus_gpa gpa[HN_XACT_REQ_PGCNT]; int gpa_cnt, error; bus_addr_t paddr; KASSERT(reqlen <= HN_XACT_REQ_SIZE && reqlen > 0, ("invalid request length %zu", reqlen)); /* * Setup the SG list. */ paddr = vmbus_xact_req_paddr(xact); KASSERT((paddr & PAGE_MASK) == 0, ("vmbus xact request is not page aligned 0x%jx", (uintmax_t)paddr)); for (gpa_cnt = 0; gpa_cnt < HN_XACT_REQ_PGCNT; ++gpa_cnt) { int len = PAGE_SIZE; if (reqlen == 0) break; if (reqlen < len) len = reqlen; gpa[gpa_cnt].gpa_page = atop(paddr) + gpa_cnt; gpa[gpa_cnt].gpa_len = len; gpa[gpa_cnt].gpa_ofs = 0; reqlen -= len; } KASSERT(reqlen == 0, ("still have %zu request data left", reqlen)); /* * Send this RNDIS control message and wait for its completion * message. */ vmbus_xact_activate(xact); error = hn_nvs_send_rndis_ctrl(sc->hn_prichan, sndc, gpa, gpa_cnt); if (error) { vmbus_xact_deactivate(xact); if_printf(sc->hn_ifp, "RNDIS ctrl send failed: %d\n", error); return (NULL); } return (vmbus_chan_xact_wait(sc->hn_prichan, xact, comp_len, HN_CAN_SLEEP(sc))); } static const void * hn_rndis_xact_execute(struct hn_softc *sc, struct vmbus_xact *xact, uint32_t rid, size_t reqlen, size_t *comp_len0, uint32_t comp_type) { const struct rndis_comp_hdr *comp; size_t comp_len, min_complen = *comp_len0; KASSERT(rid > HN_RNDIS_RID_COMPAT_MAX, ("invalid rid %u\n", rid)); KASSERT(min_complen >= sizeof(*comp), ("invalid minimum complete len %zu", min_complen)); /* * Execute the xact setup by the caller. */ comp = hn_rndis_xact_exec1(sc, xact, reqlen, &hn_nvs_sendctx_none, &comp_len); if (comp == NULL) return (NULL); /* * Check this RNDIS complete message. */ if (comp_len < min_complen) { if (comp_len >= sizeof(*comp)) { /* rm_status field is valid */ if_printf(sc->hn_ifp, "invalid RNDIS comp len %zu, " "status 0x%08x\n", comp_len, comp->rm_status); } else { if_printf(sc->hn_ifp, "invalid RNDIS comp len %zu\n", comp_len); } return (NULL); } if (comp->rm_len < min_complen) { if_printf(sc->hn_ifp, "invalid RNDIS comp msglen %u\n", comp->rm_len); return (NULL); } if (comp->rm_type != comp_type) { if_printf(sc->hn_ifp, "unexpected RNDIS comp 0x%08x, " "expect 0x%08x\n", comp->rm_type, comp_type); return (NULL); } if (comp->rm_rid != rid) { if_printf(sc->hn_ifp, "RNDIS comp rid mismatch %u, " "expect %u\n", comp->rm_rid, rid); return (NULL); } /* All pass! */ *comp_len0 = comp_len; return (comp); } static int hn_rndis_query(struct hn_softc *sc, uint32_t oid, const void *idata, size_t idlen, void *odata, size_t *odlen0) { return (hn_rndis_query2(sc, oid, idata, idlen, odata, odlen0, *odlen0)); } static int hn_rndis_query2(struct hn_softc *sc, uint32_t oid, const void *idata, size_t idlen, void *odata, size_t *odlen0, size_t min_odlen) { struct rndis_query_req *req; const struct rndis_query_comp *comp; struct vmbus_xact *xact; size_t reqlen, odlen = *odlen0, comp_len; int error, ofs; uint32_t rid; reqlen = sizeof(*req) + idlen; xact = vmbus_xact_get(sc->hn_xact, reqlen); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for RNDIS query 0x%08x\n", oid); return (ENXIO); } rid = hn_rndis_rid(sc); req = vmbus_xact_req_data(xact); req->rm_type = REMOTE_NDIS_QUERY_MSG; req->rm_len = reqlen; req->rm_rid = rid; req->rm_oid = oid; /* * XXX * This is _not_ RNDIS Spec conforming: * "This MUST be set to 0 when there is no input data * associated with the OID." * * If this field was set to 0 according to the RNDIS Spec, * Hyper-V would set non-SUCCESS status in the query * completion. */ req->rm_infobufoffset = RNDIS_QUERY_REQ_INFOBUFOFFSET; if (idlen > 0) { req->rm_infobuflen = idlen; /* Input data immediately follows RNDIS query. */ memcpy(req + 1, idata, idlen); } comp_len = sizeof(*comp) + min_odlen; comp = hn_rndis_xact_execute(sc, xact, rid, reqlen, &comp_len, REMOTE_NDIS_QUERY_CMPLT); if (comp == NULL) { if_printf(sc->hn_ifp, "exec RNDIS query 0x%08x failed\n", oid); error = EIO; goto done; } if (comp->rm_status != RNDIS_STATUS_SUCCESS) { if_printf(sc->hn_ifp, "RNDIS query 0x%08x failed: " "status 0x%08x\n", oid, comp->rm_status); error = EIO; goto done; } if (comp->rm_infobuflen == 0 || comp->rm_infobufoffset == 0) { /* No output data! */ if_printf(sc->hn_ifp, "RNDIS query 0x%08x, no data\n", oid); *odlen0 = 0; error = 0; goto done; } /* * Check output data length and offset. */ /* ofs is the offset from the beginning of comp. */ ofs = RNDIS_QUERY_COMP_INFOBUFOFFSET_ABS(comp->rm_infobufoffset); if (ofs < sizeof(*comp) || ofs + comp->rm_infobuflen > comp_len) { if_printf(sc->hn_ifp, "RNDIS query invalid comp ib off/len, " "%u/%u\n", comp->rm_infobufoffset, comp->rm_infobuflen); error = EINVAL; goto done; } /* * Save output data. */ if (comp->rm_infobuflen < odlen) odlen = comp->rm_infobuflen; memcpy(odata, ((const uint8_t *)comp) + ofs, odlen); *odlen0 = odlen; error = 0; done: vmbus_xact_put(xact); return (error); } int hn_rndis_query_rsscaps(struct hn_softc *sc, int *rxr_cnt0) { struct ndis_rss_caps in, caps; size_t caps_len; int error, indsz, rxr_cnt, hash_fnidx; uint32_t hash_func = 0, hash_types = 0; *rxr_cnt0 = 0; if (sc->hn_ndis_ver < HN_NDIS_VERSION_6_20) return (EOPNOTSUPP); memset(&in, 0, sizeof(in)); in.ndis_hdr.ndis_type = NDIS_OBJTYPE_RSS_CAPS; in.ndis_hdr.ndis_rev = NDIS_RSS_CAPS_REV_2; in.ndis_hdr.ndis_size = NDIS_RSS_CAPS_SIZE; caps_len = NDIS_RSS_CAPS_SIZE; error = hn_rndis_query2(sc, OID_GEN_RECEIVE_SCALE_CAPABILITIES, &in, NDIS_RSS_CAPS_SIZE, &caps, &caps_len, NDIS_RSS_CAPS_SIZE_6_0); if (error) return (error); /* * Preliminary verification. */ if (caps.ndis_hdr.ndis_type != NDIS_OBJTYPE_RSS_CAPS) { if_printf(sc->hn_ifp, "invalid NDIS objtype 0x%02x\n", caps.ndis_hdr.ndis_type); return (EINVAL); } if (caps.ndis_hdr.ndis_rev < NDIS_RSS_CAPS_REV_1) { if_printf(sc->hn_ifp, "invalid NDIS objrev 0x%02x\n", caps.ndis_hdr.ndis_rev); return (EINVAL); } if (caps.ndis_hdr.ndis_size > caps_len) { if_printf(sc->hn_ifp, "invalid NDIS objsize %u, " "data size %zu\n", caps.ndis_hdr.ndis_size, caps_len); return (EINVAL); } else if (caps.ndis_hdr.ndis_size < NDIS_RSS_CAPS_SIZE_6_0) { if_printf(sc->hn_ifp, "invalid NDIS objsize %u\n", caps.ndis_hdr.ndis_size); return (EINVAL); } /* * Save information for later RSS configuration. */ if (caps.ndis_nrxr == 0) { if_printf(sc->hn_ifp, "0 RX rings!?\n"); return (EINVAL); } if (bootverbose) if_printf(sc->hn_ifp, "%u RX rings\n", caps.ndis_nrxr); rxr_cnt = caps.ndis_nrxr; if (caps.ndis_hdr.ndis_size == NDIS_RSS_CAPS_SIZE && caps.ndis_hdr.ndis_rev >= NDIS_RSS_CAPS_REV_2) { if (caps.ndis_nind > NDIS_HASH_INDCNT) { if_printf(sc->hn_ifp, "too many RSS indirect table entries %u\n", caps.ndis_nind); return (EOPNOTSUPP); } if (!powerof2(caps.ndis_nind)) { if_printf(sc->hn_ifp, "RSS indirect table size is not " "power-of-2 %u\n", caps.ndis_nind); } if (bootverbose) { if_printf(sc->hn_ifp, "RSS indirect table size %u\n", caps.ndis_nind); } indsz = caps.ndis_nind; } else { indsz = NDIS_HASH_INDCNT; } if (indsz < rxr_cnt) { if_printf(sc->hn_ifp, "# of RX rings (%d) > " "RSS indirect table size %d\n", rxr_cnt, indsz); rxr_cnt = indsz; } /* * NOTE: * Toeplitz is at the lowest bit, and it is prefered; so ffs(), * instead of fls(), is used here. */ hash_fnidx = ffs(caps.ndis_caps & NDIS_RSS_CAP_HASHFUNC_MASK); if (hash_fnidx == 0) { if_printf(sc->hn_ifp, "no hash functions, caps 0x%08x\n", caps.ndis_caps); return (EOPNOTSUPP); } hash_func = 1 << (hash_fnidx - 1); /* ffs is 1-based */ if (caps.ndis_caps & NDIS_RSS_CAP_IPV4) hash_types |= NDIS_HASH_IPV4 | NDIS_HASH_TCP_IPV4; if (caps.ndis_caps & NDIS_RSS_CAP_IPV6) hash_types |= NDIS_HASH_IPV6 | NDIS_HASH_TCP_IPV6; if (caps.ndis_caps & NDIS_RSS_CAP_IPV6_EX) hash_types |= NDIS_HASH_IPV6_EX | NDIS_HASH_TCP_IPV6_EX; if (hash_types == 0) { if_printf(sc->hn_ifp, "no hash types, caps 0x%08x\n", caps.ndis_caps); return (EOPNOTSUPP); } /* Commit! */ sc->hn_rss_ind_size = indsz; sc->hn_rss_hash = hash_func | hash_types; *rxr_cnt0 = rxr_cnt; return (0); } static int hn_rndis_set(struct hn_softc *sc, uint32_t oid, const void *data, size_t dlen) { struct rndis_set_req *req; const struct rndis_set_comp *comp; struct vmbus_xact *xact; size_t reqlen, comp_len; uint32_t rid; int error; KASSERT(dlen > 0, ("invalid dlen %zu", dlen)); reqlen = sizeof(*req) + dlen; xact = vmbus_xact_get(sc->hn_xact, reqlen); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for RNDIS set 0x%08x\n", oid); return (ENXIO); } rid = hn_rndis_rid(sc); req = vmbus_xact_req_data(xact); req->rm_type = REMOTE_NDIS_SET_MSG; req->rm_len = reqlen; req->rm_rid = rid; req->rm_oid = oid; req->rm_infobuflen = dlen; req->rm_infobufoffset = RNDIS_SET_REQ_INFOBUFOFFSET; /* Data immediately follows RNDIS set. */ memcpy(req + 1, data, dlen); comp_len = sizeof(*comp); comp = hn_rndis_xact_execute(sc, xact, rid, reqlen, &comp_len, REMOTE_NDIS_SET_CMPLT); if (comp == NULL) { if_printf(sc->hn_ifp, "exec RNDIS set 0x%08x failed\n", oid); error = EIO; goto done; } if (comp->rm_status != RNDIS_STATUS_SUCCESS) { if_printf(sc->hn_ifp, "RNDIS set 0x%08x failed: " "status 0x%08x\n", oid, comp->rm_status); error = EIO; goto done; } error = 0; done: vmbus_xact_put(xact); return (error); } static int hn_rndis_conf_offload(struct hn_softc *sc, int mtu) { struct ndis_offload hwcaps; struct ndis_offload_params params; uint32_t caps = 0; size_t paramsz; int error, tso_maxsz, tso_minsg; error = hn_rndis_query_hwcaps(sc, &hwcaps); if (error) { if_printf(sc->hn_ifp, "hwcaps query failed: %d\n", error); return (error); } /* NOTE: 0 means "no change" */ memset(¶ms, 0, sizeof(params)); params.ndis_hdr.ndis_type = NDIS_OBJTYPE_DEFAULT; if (sc->hn_ndis_ver < HN_NDIS_VERSION_6_30) { params.ndis_hdr.ndis_rev = NDIS_OFFLOAD_PARAMS_REV_2; paramsz = NDIS_OFFLOAD_PARAMS_SIZE_6_1; } else { params.ndis_hdr.ndis_rev = NDIS_OFFLOAD_PARAMS_REV_3; paramsz = NDIS_OFFLOAD_PARAMS_SIZE; } params.ndis_hdr.ndis_size = paramsz; /* * TSO4/TSO6 setup. */ tso_maxsz = IP_MAXPACKET; tso_minsg = 2; if (hwcaps.ndis_lsov2.ndis_ip4_encap & NDIS_OFFLOAD_ENCAP_8023) { caps |= HN_CAP_TSO4; params.ndis_lsov2_ip4 = NDIS_OFFLOAD_LSOV2_ON; if (hwcaps.ndis_lsov2.ndis_ip4_maxsz < tso_maxsz) tso_maxsz = hwcaps.ndis_lsov2.ndis_ip4_maxsz; if (hwcaps.ndis_lsov2.ndis_ip4_minsg > tso_minsg) tso_minsg = hwcaps.ndis_lsov2.ndis_ip4_minsg; } if ((hwcaps.ndis_lsov2.ndis_ip6_encap & NDIS_OFFLOAD_ENCAP_8023) && (hwcaps.ndis_lsov2.ndis_ip6_opts & HN_NDIS_LSOV2_CAP_IP6) == HN_NDIS_LSOV2_CAP_IP6) { caps |= HN_CAP_TSO6; params.ndis_lsov2_ip6 = NDIS_OFFLOAD_LSOV2_ON; if (hwcaps.ndis_lsov2.ndis_ip6_maxsz < tso_maxsz) tso_maxsz = hwcaps.ndis_lsov2.ndis_ip6_maxsz; if (hwcaps.ndis_lsov2.ndis_ip6_minsg > tso_minsg) tso_minsg = hwcaps.ndis_lsov2.ndis_ip6_minsg; } sc->hn_ndis_tso_szmax = 0; sc->hn_ndis_tso_sgmin = 0; if (caps & (HN_CAP_TSO4 | HN_CAP_TSO6)) { KASSERT(tso_maxsz <= IP_MAXPACKET, ("invalid NDIS TSO maxsz %d", tso_maxsz)); KASSERT(tso_minsg >= 2, ("invalid NDIS TSO minsg %d", tso_minsg)); if (tso_maxsz < tso_minsg * mtu) { if_printf(sc->hn_ifp, "invalid NDIS TSO config: " "maxsz %d, minsg %d, mtu %d; " "disable TSO4 and TSO6\n", tso_maxsz, tso_minsg, mtu); caps &= ~(HN_CAP_TSO4 | HN_CAP_TSO6); params.ndis_lsov2_ip4 = NDIS_OFFLOAD_LSOV2_OFF; params.ndis_lsov2_ip6 = NDIS_OFFLOAD_LSOV2_OFF; } else { sc->hn_ndis_tso_szmax = tso_maxsz; sc->hn_ndis_tso_sgmin = tso_minsg; if (bootverbose) { if_printf(sc->hn_ifp, "NDIS TSO " "szmax %d sgmin %d\n", sc->hn_ndis_tso_szmax, sc->hn_ndis_tso_sgmin); } } } /* IPv4 checksum */ if ((hwcaps.ndis_csum.ndis_ip4_txcsum & HN_NDIS_TXCSUM_CAP_IP4) == HN_NDIS_TXCSUM_CAP_IP4) { caps |= HN_CAP_IPCS; params.ndis_ip4csum = NDIS_OFFLOAD_PARAM_TX; } if (hwcaps.ndis_csum.ndis_ip4_rxcsum & NDIS_RXCSUM_CAP_IP4) { if (params.ndis_ip4csum == NDIS_OFFLOAD_PARAM_TX) params.ndis_ip4csum = NDIS_OFFLOAD_PARAM_TXRX; else params.ndis_ip4csum = NDIS_OFFLOAD_PARAM_RX; } /* TCP4 checksum */ if ((hwcaps.ndis_csum.ndis_ip4_txcsum & HN_NDIS_TXCSUM_CAP_TCP4) == HN_NDIS_TXCSUM_CAP_TCP4) { caps |= HN_CAP_TCP4CS; params.ndis_tcp4csum = NDIS_OFFLOAD_PARAM_TX; } if (hwcaps.ndis_csum.ndis_ip4_rxcsum & NDIS_RXCSUM_CAP_TCP4) { if (params.ndis_tcp4csum == NDIS_OFFLOAD_PARAM_TX) params.ndis_tcp4csum = NDIS_OFFLOAD_PARAM_TXRX; else params.ndis_tcp4csum = NDIS_OFFLOAD_PARAM_RX; } /* UDP4 checksum */ if (hwcaps.ndis_csum.ndis_ip4_txcsum & NDIS_TXCSUM_CAP_UDP4) { caps |= HN_CAP_UDP4CS; params.ndis_udp4csum = NDIS_OFFLOAD_PARAM_TX; } if (hwcaps.ndis_csum.ndis_ip4_rxcsum & NDIS_RXCSUM_CAP_UDP4) { if (params.ndis_udp4csum == NDIS_OFFLOAD_PARAM_TX) params.ndis_udp4csum = NDIS_OFFLOAD_PARAM_TXRX; else params.ndis_udp4csum = NDIS_OFFLOAD_PARAM_RX; } /* TCP6 checksum */ if ((hwcaps.ndis_csum.ndis_ip6_txcsum & HN_NDIS_TXCSUM_CAP_TCP6) == HN_NDIS_TXCSUM_CAP_TCP6) { caps |= HN_CAP_TCP6CS; params.ndis_tcp6csum = NDIS_OFFLOAD_PARAM_TX; } if (hwcaps.ndis_csum.ndis_ip6_rxcsum & NDIS_RXCSUM_CAP_TCP6) { if (params.ndis_tcp6csum == NDIS_OFFLOAD_PARAM_TX) params.ndis_tcp6csum = NDIS_OFFLOAD_PARAM_TXRX; else params.ndis_tcp6csum = NDIS_OFFLOAD_PARAM_RX; } /* UDP6 checksum */ if ((hwcaps.ndis_csum.ndis_ip6_txcsum & HN_NDIS_TXCSUM_CAP_UDP6) == HN_NDIS_TXCSUM_CAP_UDP6) { caps |= HN_CAP_UDP6CS; params.ndis_udp6csum = NDIS_OFFLOAD_PARAM_TX; } if (hwcaps.ndis_csum.ndis_ip6_rxcsum & NDIS_RXCSUM_CAP_UDP6) { if (params.ndis_udp6csum == NDIS_OFFLOAD_PARAM_TX) params.ndis_udp6csum = NDIS_OFFLOAD_PARAM_TXRX; else params.ndis_udp6csum = NDIS_OFFLOAD_PARAM_RX; } if (bootverbose) { if_printf(sc->hn_ifp, "offload csum: " "ip4 %u, tcp4 %u, udp4 %u, tcp6 %u, udp6 %u\n", params.ndis_ip4csum, params.ndis_tcp4csum, params.ndis_udp4csum, params.ndis_tcp6csum, params.ndis_udp6csum); if_printf(sc->hn_ifp, "offload lsov2: ip4 %u, ip6 %u\n", params.ndis_lsov2_ip4, params.ndis_lsov2_ip6); } error = hn_rndis_set(sc, OID_TCP_OFFLOAD_PARAMETERS, ¶ms, paramsz); if (error) { if_printf(sc->hn_ifp, "offload config failed: %d\n", error); return (error); } if (bootverbose) if_printf(sc->hn_ifp, "offload config done\n"); sc->hn_caps |= caps; return (0); } int hn_rndis_conf_rss(struct hn_softc *sc, uint16_t flags) { struct ndis_rssprm_toeplitz *rss = &sc->hn_rss; struct ndis_rss_params *prm = &rss->rss_params; int error, rss_size; /* * Only NDIS 6.20+ is supported: * We only support 4bytes element in indirect table, which has been * adopted since NDIS 6.20. */ KASSERT(sc->hn_ndis_ver >= HN_NDIS_VERSION_6_20, ("NDIS 6.20+ is required, NDIS version 0x%08x", sc->hn_ndis_ver)); /* XXX only one can be specified through, popcnt? */ KASSERT((sc->hn_rss_hash & NDIS_HASH_FUNCTION_MASK), ("no hash func")); KASSERT((sc->hn_rss_hash & NDIS_HASH_TYPE_MASK), ("no hash types")); KASSERT(sc->hn_rss_ind_size > 0, ("no indirect table size")); if (bootverbose) { if_printf(sc->hn_ifp, "RSS indirect table size %d, " "hash 0x%08x\n", sc->hn_rss_ind_size, sc->hn_rss_hash); } /* * NOTE: * DO NOT whack rss_key and rss_ind, which are setup by the caller. */ memset(prm, 0, sizeof(*prm)); rss_size = NDIS_RSSPRM_TOEPLITZ_SIZE(sc->hn_rss_ind_size); prm->ndis_hdr.ndis_type = NDIS_OBJTYPE_RSS_PARAMS; prm->ndis_hdr.ndis_rev = NDIS_RSS_PARAMS_REV_2; prm->ndis_hdr.ndis_size = rss_size; prm->ndis_flags = flags; prm->ndis_hash = sc->hn_rss_hash; prm->ndis_indsize = sizeof(rss->rss_ind[0]) * sc->hn_rss_ind_size; prm->ndis_indoffset = __offsetof(struct ndis_rssprm_toeplitz, rss_ind[0]); prm->ndis_keysize = sizeof(rss->rss_key); prm->ndis_keyoffset = __offsetof(struct ndis_rssprm_toeplitz, rss_key[0]); error = hn_rndis_set(sc, OID_GEN_RECEIVE_SCALE_PARAMETERS, rss, rss_size); if (error) { if_printf(sc->hn_ifp, "RSS config failed: %d\n", error); } else { if (bootverbose) if_printf(sc->hn_ifp, "RSS config done\n"); } return (error); } int hn_rndis_set_rxfilter(struct hn_softc *sc, uint32_t filter) { int error; error = hn_rndis_set(sc, OID_GEN_CURRENT_PACKET_FILTER, &filter, sizeof(filter)); if (error) { if_printf(sc->hn_ifp, "set RX filter 0x%08x failed: %d\n", filter, error); } else { if (bootverbose) { if_printf(sc->hn_ifp, "set RX filter 0x%08x done\n", filter); } } return (error); } static int hn_rndis_init(struct hn_softc *sc) { struct rndis_init_req *req; const struct rndis_init_comp *comp; struct vmbus_xact *xact; size_t comp_len; uint32_t rid; int error; xact = vmbus_xact_get(sc->hn_xact, sizeof(*req)); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for RNDIS init\n"); return (ENXIO); } rid = hn_rndis_rid(sc); req = vmbus_xact_req_data(xact); req->rm_type = REMOTE_NDIS_INITIALIZE_MSG; req->rm_len = sizeof(*req); req->rm_rid = rid; req->rm_ver_major = RNDIS_VERSION_MAJOR; req->rm_ver_minor = RNDIS_VERSION_MINOR; req->rm_max_xfersz = HN_RNDIS_XFER_SIZE; comp_len = RNDIS_INIT_COMP_SIZE_MIN; comp = hn_rndis_xact_execute(sc, xact, rid, sizeof(*req), &comp_len, REMOTE_NDIS_INITIALIZE_CMPLT); if (comp == NULL) { if_printf(sc->hn_ifp, "exec RNDIS init failed\n"); error = EIO; goto done; } if (comp->rm_status != RNDIS_STATUS_SUCCESS) { if_printf(sc->hn_ifp, "RNDIS init failed: status 0x%08x\n", comp->rm_status); error = EIO; goto done; } sc->hn_rndis_agg_size = comp->rm_pktmaxsz; sc->hn_rndis_agg_pkts = comp->rm_pktmaxcnt; sc->hn_rndis_agg_align = 1U << comp->rm_align; + if (sc->hn_rndis_agg_align < sizeof(uint32_t)) { + /* + * The RNDIS packet messsage encap assumes that the RNDIS + * packet message is at least 4 bytes aligned. Fix up the + * alignment here, if the remote side sets the alignment + * too low. + */ + if_printf(sc->hn_ifp, "fixup RNDIS aggpkt align: %u -> %zu\n", + sc->hn_rndis_agg_align, sizeof(uint32_t)); + sc->hn_rndis_agg_align = sizeof(uint32_t); + } + if (bootverbose) { - if_printf(sc->hn_ifp, "RNDIS ver %u.%u, pktsz %u, pktcnt %u, " - "align %u\n", comp->rm_ver_major, comp->rm_ver_minor, + if_printf(sc->hn_ifp, "RNDIS ver %u.%u, " + "aggpkt size %u, aggpkt cnt %u, aggpkt align %u\n", + comp->rm_ver_major, comp->rm_ver_minor, sc->hn_rndis_agg_size, sc->hn_rndis_agg_pkts, sc->hn_rndis_agg_align); } error = 0; done: vmbus_xact_put(xact); return (error); } static int hn_rndis_halt(struct hn_softc *sc) { struct vmbus_xact *xact; struct rndis_halt_req *halt; struct hn_nvs_sendctx sndc; size_t comp_len; xact = vmbus_xact_get(sc->hn_xact, sizeof(*halt)); if (xact == NULL) { if_printf(sc->hn_ifp, "no xact for RNDIS halt\n"); return (ENXIO); } halt = vmbus_xact_req_data(xact); halt->rm_type = REMOTE_NDIS_HALT_MSG; halt->rm_len = sizeof(*halt); halt->rm_rid = hn_rndis_rid(sc); /* No RNDIS completion; rely on NVS message send completion */ hn_nvs_sendctx_init(&sndc, hn_nvs_sent_xact, xact); hn_rndis_xact_exec1(sc, xact, sizeof(*halt), &sndc, &comp_len); vmbus_xact_put(xact); if (bootverbose) if_printf(sc->hn_ifp, "RNDIS halt done\n"); return (0); } static int hn_rndis_query_hwcaps(struct hn_softc *sc, struct ndis_offload *caps) { struct ndis_offload in; size_t caps_len, size; int error; memset(&in, 0, sizeof(in)); in.ndis_hdr.ndis_type = NDIS_OBJTYPE_OFFLOAD; if (sc->hn_ndis_ver >= HN_NDIS_VERSION_6_30) { in.ndis_hdr.ndis_rev = NDIS_OFFLOAD_REV_3; size = NDIS_OFFLOAD_SIZE; } else if (sc->hn_ndis_ver >= HN_NDIS_VERSION_6_1) { in.ndis_hdr.ndis_rev = NDIS_OFFLOAD_REV_2; size = NDIS_OFFLOAD_SIZE_6_1; } else { in.ndis_hdr.ndis_rev = NDIS_OFFLOAD_REV_1; size = NDIS_OFFLOAD_SIZE_6_0; } in.ndis_hdr.ndis_size = size; caps_len = NDIS_OFFLOAD_SIZE; error = hn_rndis_query2(sc, OID_TCP_OFFLOAD_HARDWARE_CAPABILITIES, &in, size, caps, &caps_len, NDIS_OFFLOAD_SIZE_6_0); if (error) return (error); /* * Preliminary verification. */ if (caps->ndis_hdr.ndis_type != NDIS_OBJTYPE_OFFLOAD) { if_printf(sc->hn_ifp, "invalid NDIS objtype 0x%02x\n", caps->ndis_hdr.ndis_type); return (EINVAL); } if (caps->ndis_hdr.ndis_rev < NDIS_OFFLOAD_REV_1) { if_printf(sc->hn_ifp, "invalid NDIS objrev 0x%02x\n", caps->ndis_hdr.ndis_rev); return (EINVAL); } if (caps->ndis_hdr.ndis_size > caps_len) { if_printf(sc->hn_ifp, "invalid NDIS objsize %u, " "data size %zu\n", caps->ndis_hdr.ndis_size, caps_len); return (EINVAL); } else if (caps->ndis_hdr.ndis_size < NDIS_OFFLOAD_SIZE_6_0) { if_printf(sc->hn_ifp, "invalid NDIS objsize %u\n", caps->ndis_hdr.ndis_size); return (EINVAL); } if (bootverbose) { /* * NOTE: * caps->ndis_hdr.ndis_size MUST be checked before accessing * NDIS 6.1+ specific fields. */ if_printf(sc->hn_ifp, "hwcaps rev %u\n", caps->ndis_hdr.ndis_rev); if_printf(sc->hn_ifp, "hwcaps csum: " "ip4 tx 0x%x/0x%x rx 0x%x/0x%x, " "ip6 tx 0x%x/0x%x rx 0x%x/0x%x\n", caps->ndis_csum.ndis_ip4_txcsum, caps->ndis_csum.ndis_ip4_txenc, caps->ndis_csum.ndis_ip4_rxcsum, caps->ndis_csum.ndis_ip4_rxenc, caps->ndis_csum.ndis_ip6_txcsum, caps->ndis_csum.ndis_ip6_txenc, caps->ndis_csum.ndis_ip6_rxcsum, caps->ndis_csum.ndis_ip6_rxenc); if_printf(sc->hn_ifp, "hwcaps lsov2: " "ip4 maxsz %u minsg %u encap 0x%x, " "ip6 maxsz %u minsg %u encap 0x%x opts 0x%x\n", caps->ndis_lsov2.ndis_ip4_maxsz, caps->ndis_lsov2.ndis_ip4_minsg, caps->ndis_lsov2.ndis_ip4_encap, caps->ndis_lsov2.ndis_ip6_maxsz, caps->ndis_lsov2.ndis_ip6_minsg, caps->ndis_lsov2.ndis_ip6_encap, caps->ndis_lsov2.ndis_ip6_opts); } return (0); } int hn_rndis_attach(struct hn_softc *sc, int mtu) { int error; /* * Initialize RNDIS. */ error = hn_rndis_init(sc); if (error) return (error); /* * Configure NDIS offload settings. */ hn_rndis_conf_offload(sc, mtu); return (0); } void hn_rndis_detach(struct hn_softc *sc) { /* Halt the RNDIS. */ hn_rndis_halt(sc); } Index: projects/clang400-import/sys/dev/hyperv/netvsc/if_hn.c =================================================================== --- projects/clang400-import/sys/dev/hyperv/netvsc/if_hn.c (revision 314522) +++ projects/clang400-import/sys/dev/hyperv/netvsc/if_hn.c (revision 314523) @@ -1,5740 +1,5739 @@ /*- * Copyright (c) 2010-2012 Citrix Inc. * Copyright (c) 2009-2012,2016 Microsoft Corp. * Copyright (c) 2012 NetApp Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (c) 2004-2006 Kip Macy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_hn.h" #include "opt_inet6.h" #include "opt_inet.h" #include "opt_rss.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef RSS #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "vmbus_if.h" #define HN_IFSTART_SUPPORT #define HN_RING_CNT_DEF_MAX 8 /* YYY should get it from the underlying channel */ #define HN_TX_DESC_CNT 512 #define HN_RNDIS_PKT_LEN \ (sizeof(struct rndis_packet_msg) + \ HN_RNDIS_PKTINFO_SIZE(HN_NDIS_HASH_VALUE_SIZE) + \ HN_RNDIS_PKTINFO_SIZE(NDIS_VLAN_INFO_SIZE) + \ HN_RNDIS_PKTINFO_SIZE(NDIS_LSO2_INFO_SIZE) + \ HN_RNDIS_PKTINFO_SIZE(NDIS_TXCSUM_INFO_SIZE)) #define HN_RNDIS_PKT_BOUNDARY PAGE_SIZE #define HN_RNDIS_PKT_ALIGN CACHE_LINE_SIZE #define HN_TX_DATA_BOUNDARY PAGE_SIZE #define HN_TX_DATA_MAXSIZE IP_MAXPACKET #define HN_TX_DATA_SEGSIZE PAGE_SIZE /* -1 for RNDIS packet message */ #define HN_TX_DATA_SEGCNT_MAX (HN_GPACNT_MAX - 1) #define HN_DIRECT_TX_SIZE_DEF 128 #define HN_EARLY_TXEOF_THRESH 8 #define HN_PKTBUF_LEN_DEF (16 * 1024) #define HN_LROENT_CNT_DEF 128 #define HN_LRO_LENLIM_MULTIRX_DEF (12 * ETHERMTU) #define HN_LRO_LENLIM_DEF (25 * ETHERMTU) /* YYY 2*MTU is a bit rough, but should be good enough. */ #define HN_LRO_LENLIM_MIN(ifp) (2 * (ifp)->if_mtu) #define HN_LRO_ACKCNT_DEF 1 #define HN_LOCK_INIT(sc) \ sx_init(&(sc)->hn_lock, device_get_nameunit((sc)->hn_dev)) #define HN_LOCK_DESTROY(sc) sx_destroy(&(sc)->hn_lock) #define HN_LOCK_ASSERT(sc) sx_assert(&(sc)->hn_lock, SA_XLOCKED) #define HN_LOCK(sc) \ do { \ while (sx_try_xlock(&(sc)->hn_lock) == 0) \ DELAY(1000); \ } while (0) #define HN_UNLOCK(sc) sx_xunlock(&(sc)->hn_lock) #define HN_CSUM_IP_MASK (CSUM_IP | CSUM_IP_TCP | CSUM_IP_UDP) #define HN_CSUM_IP6_MASK (CSUM_IP6_TCP | CSUM_IP6_UDP) #define HN_CSUM_IP_HWASSIST(sc) \ ((sc)->hn_tx_ring[0].hn_csum_assist & HN_CSUM_IP_MASK) #define HN_CSUM_IP6_HWASSIST(sc) \ ((sc)->hn_tx_ring[0].hn_csum_assist & HN_CSUM_IP6_MASK) #define HN_PKTSIZE_MIN(align) \ roundup2(ETHER_MIN_LEN + ETHER_VLAN_ENCAP_LEN - ETHER_CRC_LEN + \ HN_RNDIS_PKT_LEN, (align)) #define HN_PKTSIZE(m, align) \ roundup2((m)->m_pkthdr.len + HN_RNDIS_PKT_LEN, (align)) #ifdef RSS #define HN_RING_IDX2CPU(sc, idx) rss_getcpu((idx) % rss_getnumbuckets()) #else #define HN_RING_IDX2CPU(sc, idx) (((sc)->hn_cpu + (idx)) % mp_ncpus) #endif struct hn_txdesc { #ifndef HN_USE_TXDESC_BUFRING SLIST_ENTRY(hn_txdesc) link; #endif STAILQ_ENTRY(hn_txdesc) agg_link; /* Aggregated txdescs, in sending order. */ STAILQ_HEAD(, hn_txdesc) agg_list; /* The oldest packet, if transmission aggregation happens. */ struct mbuf *m; struct hn_tx_ring *txr; int refs; uint32_t flags; /* HN_TXD_FLAG_ */ struct hn_nvs_sendctx send_ctx; uint32_t chim_index; int chim_size; bus_dmamap_t data_dmap; bus_addr_t rndis_pkt_paddr; struct rndis_packet_msg *rndis_pkt; bus_dmamap_t rndis_pkt_dmap; }; #define HN_TXD_FLAG_ONLIST 0x0001 #define HN_TXD_FLAG_DMAMAP 0x0002 #define HN_TXD_FLAG_ONAGG 0x0004 struct hn_rxinfo { uint32_t vlan_info; uint32_t csum_info; uint32_t hash_info; uint32_t hash_value; }; struct hn_update_vf { struct hn_rx_ring *rxr; struct ifnet *vf; }; #define HN_RXINFO_VLAN 0x0001 #define HN_RXINFO_CSUM 0x0002 #define HN_RXINFO_HASHINF 0x0004 #define HN_RXINFO_HASHVAL 0x0008 #define HN_RXINFO_ALL \ (HN_RXINFO_VLAN | \ HN_RXINFO_CSUM | \ HN_RXINFO_HASHINF | \ HN_RXINFO_HASHVAL) #define HN_NDIS_VLAN_INFO_INVALID 0xffffffff #define HN_NDIS_RXCSUM_INFO_INVALID 0 #define HN_NDIS_HASH_INFO_INVALID 0 static int hn_probe(device_t); static int hn_attach(device_t); static int hn_detach(device_t); static int hn_shutdown(device_t); static void hn_chan_callback(struct vmbus_channel *, void *); static void hn_init(void *); static int hn_ioctl(struct ifnet *, u_long, caddr_t); #ifdef HN_IFSTART_SUPPORT static void hn_start(struct ifnet *); #endif static int hn_transmit(struct ifnet *, struct mbuf *); static void hn_xmit_qflush(struct ifnet *); static int hn_ifmedia_upd(struct ifnet *); static void hn_ifmedia_sts(struct ifnet *, struct ifmediareq *); static int hn_rndis_rxinfo(const void *, int, struct hn_rxinfo *); static void hn_rndis_rx_data(struct hn_rx_ring *, const void *, int); static void hn_rndis_rx_status(struct hn_softc *, const void *, int); static void hn_nvs_handle_notify(struct hn_softc *, const struct vmbus_chanpkt_hdr *); static void hn_nvs_handle_comp(struct hn_softc *, struct vmbus_channel *, const struct vmbus_chanpkt_hdr *); static void hn_nvs_handle_rxbuf(struct hn_rx_ring *, struct vmbus_channel *, const struct vmbus_chanpkt_hdr *); static void hn_nvs_ack_rxbuf(struct hn_rx_ring *, struct vmbus_channel *, uint64_t); #if __FreeBSD_version >= 1100099 static int hn_lro_lenlim_sysctl(SYSCTL_HANDLER_ARGS); static int hn_lro_ackcnt_sysctl(SYSCTL_HANDLER_ARGS); #endif static int hn_trust_hcsum_sysctl(SYSCTL_HANDLER_ARGS); static int hn_chim_size_sysctl(SYSCTL_HANDLER_ARGS); #if __FreeBSD_version < 1100095 static int hn_rx_stat_int_sysctl(SYSCTL_HANDLER_ARGS); #else static int hn_rx_stat_u64_sysctl(SYSCTL_HANDLER_ARGS); #endif static int hn_rx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS); static int hn_tx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS); static int hn_tx_conf_int_sysctl(SYSCTL_HANDLER_ARGS); static int hn_ndis_version_sysctl(SYSCTL_HANDLER_ARGS); static int hn_caps_sysctl(SYSCTL_HANDLER_ARGS); static int hn_hwassist_sysctl(SYSCTL_HANDLER_ARGS); static int hn_rxfilter_sysctl(SYSCTL_HANDLER_ARGS); #ifndef RSS static int hn_rss_key_sysctl(SYSCTL_HANDLER_ARGS); static int hn_rss_ind_sysctl(SYSCTL_HANDLER_ARGS); #endif static int hn_rss_hash_sysctl(SYSCTL_HANDLER_ARGS); static int hn_txagg_size_sysctl(SYSCTL_HANDLER_ARGS); static int hn_txagg_pkts_sysctl(SYSCTL_HANDLER_ARGS); static int hn_txagg_pktmax_sysctl(SYSCTL_HANDLER_ARGS); static int hn_txagg_align_sysctl(SYSCTL_HANDLER_ARGS); static int hn_polling_sysctl(SYSCTL_HANDLER_ARGS); static int hn_vf_sysctl(SYSCTL_HANDLER_ARGS); static void hn_stop(struct hn_softc *, bool); static void hn_init_locked(struct hn_softc *); static int hn_chan_attach(struct hn_softc *, struct vmbus_channel *); static void hn_chan_detach(struct hn_softc *, struct vmbus_channel *); static int hn_attach_subchans(struct hn_softc *); static void hn_detach_allchans(struct hn_softc *); static void hn_chan_rollup(struct hn_rx_ring *, struct hn_tx_ring *); static void hn_set_ring_inuse(struct hn_softc *, int); static int hn_synth_attach(struct hn_softc *, int); static void hn_synth_detach(struct hn_softc *); static int hn_synth_alloc_subchans(struct hn_softc *, int *); static bool hn_synth_attachable(const struct hn_softc *); static void hn_suspend(struct hn_softc *); static void hn_suspend_data(struct hn_softc *); static void hn_suspend_mgmt(struct hn_softc *); static void hn_resume(struct hn_softc *); static void hn_resume_data(struct hn_softc *); static void hn_resume_mgmt(struct hn_softc *); static void hn_suspend_mgmt_taskfunc(void *, int); static void hn_chan_drain(struct hn_softc *, struct vmbus_channel *); static void hn_polling(struct hn_softc *, u_int); static void hn_chan_polling(struct vmbus_channel *, u_int); static void hn_update_link_status(struct hn_softc *); static void hn_change_network(struct hn_softc *); static void hn_link_taskfunc(void *, int); static void hn_netchg_init_taskfunc(void *, int); static void hn_netchg_status_taskfunc(void *, int); static void hn_link_status(struct hn_softc *); static int hn_create_rx_data(struct hn_softc *, int); static void hn_destroy_rx_data(struct hn_softc *); static int hn_check_iplen(const struct mbuf *, int); static int hn_set_rxfilter(struct hn_softc *, uint32_t); static int hn_rxfilter_config(struct hn_softc *); #ifndef RSS static int hn_rss_reconfig(struct hn_softc *); #endif static void hn_rss_ind_fixup(struct hn_softc *); static int hn_rxpkt(struct hn_rx_ring *, const void *, int, const struct hn_rxinfo *); static int hn_tx_ring_create(struct hn_softc *, int); static void hn_tx_ring_destroy(struct hn_tx_ring *); static int hn_create_tx_data(struct hn_softc *, int); static void hn_fixup_tx_data(struct hn_softc *); static void hn_destroy_tx_data(struct hn_softc *); static void hn_txdesc_dmamap_destroy(struct hn_txdesc *); static void hn_txdesc_gc(struct hn_tx_ring *, struct hn_txdesc *); static int hn_encap(struct ifnet *, struct hn_tx_ring *, struct hn_txdesc *, struct mbuf **); static int hn_txpkt(struct ifnet *, struct hn_tx_ring *, struct hn_txdesc *); static void hn_set_chim_size(struct hn_softc *, int); static void hn_set_tso_maxsize(struct hn_softc *, int, int); static bool hn_tx_ring_pending(struct hn_tx_ring *); static void hn_tx_ring_qflush(struct hn_tx_ring *); static void hn_resume_tx(struct hn_softc *, int); static void hn_set_txagg(struct hn_softc *); static void *hn_try_txagg(struct ifnet *, struct hn_tx_ring *, struct hn_txdesc *, int); static int hn_get_txswq_depth(const struct hn_tx_ring *); static void hn_txpkt_done(struct hn_nvs_sendctx *, struct hn_softc *, struct vmbus_channel *, const void *, int); static int hn_txpkt_sglist(struct hn_tx_ring *, struct hn_txdesc *); static int hn_txpkt_chim(struct hn_tx_ring *, struct hn_txdesc *); static int hn_xmit(struct hn_tx_ring *, int); static void hn_xmit_taskfunc(void *, int); static void hn_xmit_txeof(struct hn_tx_ring *); static void hn_xmit_txeof_taskfunc(void *, int); #ifdef HN_IFSTART_SUPPORT static int hn_start_locked(struct hn_tx_ring *, int); static void hn_start_taskfunc(void *, int); static void hn_start_txeof(struct hn_tx_ring *); static void hn_start_txeof_taskfunc(void *, int); #endif SYSCTL_NODE(_hw, OID_AUTO, hn, CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, "Hyper-V network interface"); /* Trust tcp segements verification on host side. */ static int hn_trust_hosttcp = 1; SYSCTL_INT(_hw_hn, OID_AUTO, trust_hosttcp, CTLFLAG_RDTUN, &hn_trust_hosttcp, 0, "Trust tcp segement verification on host side, " "when csum info is missing (global setting)"); /* Trust udp datagrams verification on host side. */ static int hn_trust_hostudp = 1; SYSCTL_INT(_hw_hn, OID_AUTO, trust_hostudp, CTLFLAG_RDTUN, &hn_trust_hostudp, 0, "Trust udp datagram verification on host side, " "when csum info is missing (global setting)"); /* Trust ip packets verification on host side. */ static int hn_trust_hostip = 1; SYSCTL_INT(_hw_hn, OID_AUTO, trust_hostip, CTLFLAG_RDTUN, &hn_trust_hostip, 0, "Trust ip packet verification on host side, " "when csum info is missing (global setting)"); /* Limit TSO burst size */ static int hn_tso_maxlen = IP_MAXPACKET; SYSCTL_INT(_hw_hn, OID_AUTO, tso_maxlen, CTLFLAG_RDTUN, &hn_tso_maxlen, 0, "TSO burst limit"); /* Limit chimney send size */ static int hn_tx_chimney_size = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tx_chimney_size, CTLFLAG_RDTUN, &hn_tx_chimney_size, 0, "Chimney send packet size limit"); /* Limit the size of packet for direct transmission */ static int hn_direct_tx_size = HN_DIRECT_TX_SIZE_DEF; SYSCTL_INT(_hw_hn, OID_AUTO, direct_tx_size, CTLFLAG_RDTUN, &hn_direct_tx_size, 0, "Size of the packet for direct transmission"); /* # of LRO entries per RX ring */ #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 static int hn_lro_entry_count = HN_LROENT_CNT_DEF; SYSCTL_INT(_hw_hn, OID_AUTO, lro_entry_count, CTLFLAG_RDTUN, &hn_lro_entry_count, 0, "LRO entry count"); #endif #endif static int hn_tx_taskq_cnt = 1; SYSCTL_INT(_hw_hn, OID_AUTO, tx_taskq_cnt, CTLFLAG_RDTUN, &hn_tx_taskq_cnt, 0, "# of TX taskqueues"); #define HN_TX_TASKQ_M_INDEP 0 #define HN_TX_TASKQ_M_GLOBAL 1 #define HN_TX_TASKQ_M_EVTTQ 2 static int hn_tx_taskq_mode = HN_TX_TASKQ_M_INDEP; SYSCTL_INT(_hw_hn, OID_AUTO, tx_taskq_mode, CTLFLAG_RDTUN, &hn_tx_taskq_mode, 0, "TX taskqueue modes: " "0 - independent, 1 - share global tx taskqs, 2 - share event taskqs"); #ifndef HN_USE_TXDESC_BUFRING static int hn_use_txdesc_bufring = 0; #else static int hn_use_txdesc_bufring = 1; #endif SYSCTL_INT(_hw_hn, OID_AUTO, use_txdesc_bufring, CTLFLAG_RD, &hn_use_txdesc_bufring, 0, "Use buf_ring for TX descriptors"); #ifdef HN_IFSTART_SUPPORT /* Use ifnet.if_start instead of ifnet.if_transmit */ static int hn_use_if_start = 0; SYSCTL_INT(_hw_hn, OID_AUTO, use_if_start, CTLFLAG_RDTUN, &hn_use_if_start, 0, "Use if_start TX method"); #endif /* # of channels to use */ static int hn_chan_cnt = 0; SYSCTL_INT(_hw_hn, OID_AUTO, chan_cnt, CTLFLAG_RDTUN, &hn_chan_cnt, 0, "# of channels to use; each channel has one RX ring and one TX ring"); /* # of transmit rings to use */ static int hn_tx_ring_cnt = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tx_ring_cnt, CTLFLAG_RDTUN, &hn_tx_ring_cnt, 0, "# of TX rings to use"); /* Software TX ring deptch */ static int hn_tx_swq_depth = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tx_swq_depth, CTLFLAG_RDTUN, &hn_tx_swq_depth, 0, "Depth of IFQ or BUFRING"); /* Enable sorted LRO, and the depth of the per-channel mbuf queue */ #if __FreeBSD_version >= 1100095 static u_int hn_lro_mbufq_depth = 0; SYSCTL_UINT(_hw_hn, OID_AUTO, lro_mbufq_depth, CTLFLAG_RDTUN, &hn_lro_mbufq_depth, 0, "Depth of LRO mbuf queue"); #endif /* Packet transmission aggregation size limit */ static int hn_tx_agg_size = -1; SYSCTL_INT(_hw_hn, OID_AUTO, tx_agg_size, CTLFLAG_RDTUN, &hn_tx_agg_size, 0, "Packet transmission aggregation size limit"); /* Packet transmission aggregation count limit */ static int hn_tx_agg_pkts = -1; SYSCTL_INT(_hw_hn, OID_AUTO, tx_agg_pkts, CTLFLAG_RDTUN, &hn_tx_agg_pkts, 0, "Packet transmission aggregation packet limit"); static u_int hn_cpu_index; /* next CPU for channel */ static struct taskqueue **hn_tx_taskque;/* shared TX taskqueues */ #ifndef RSS static const uint8_t hn_rss_key_default[NDIS_HASH_KEYSIZE_TOEPLITZ] = { 0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2, 0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0, 0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4, 0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c, 0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa }; #endif /* !RSS */ static device_method_t hn_methods[] = { /* Device interface */ DEVMETHOD(device_probe, hn_probe), DEVMETHOD(device_attach, hn_attach), DEVMETHOD(device_detach, hn_detach), DEVMETHOD(device_shutdown, hn_shutdown), DEVMETHOD_END }; static driver_t hn_driver = { "hn", hn_methods, sizeof(struct hn_softc) }; static devclass_t hn_devclass; DRIVER_MODULE(hn, vmbus, hn_driver, hn_devclass, 0, 0); MODULE_VERSION(hn, 1); MODULE_DEPEND(hn, vmbus, 1, 1, 1); #if __FreeBSD_version >= 1100099 static void hn_set_lro_lenlim(struct hn_softc *sc, int lenlim) { int i; for (i = 0; i < sc->hn_rx_ring_cnt; ++i) sc->hn_rx_ring[i].hn_lro.lro_length_lim = lenlim; } #endif static int hn_txpkt_sglist(struct hn_tx_ring *txr, struct hn_txdesc *txd) { KASSERT(txd->chim_index == HN_NVS_CHIM_IDX_INVALID && txd->chim_size == 0, ("invalid rndis sglist txd")); return (hn_nvs_send_rndis_sglist(txr->hn_chan, HN_NVS_RNDIS_MTYPE_DATA, &txd->send_ctx, txr->hn_gpa, txr->hn_gpa_cnt)); } static int hn_txpkt_chim(struct hn_tx_ring *txr, struct hn_txdesc *txd) { struct hn_nvs_rndis rndis; KASSERT(txd->chim_index != HN_NVS_CHIM_IDX_INVALID && txd->chim_size > 0, ("invalid rndis chim txd")); rndis.nvs_type = HN_NVS_TYPE_RNDIS; rndis.nvs_rndis_mtype = HN_NVS_RNDIS_MTYPE_DATA; rndis.nvs_chim_idx = txd->chim_index; rndis.nvs_chim_sz = txd->chim_size; return (hn_nvs_send(txr->hn_chan, VMBUS_CHANPKT_FLAG_RC, &rndis, sizeof(rndis), &txd->send_ctx)); } static __inline uint32_t hn_chim_alloc(struct hn_softc *sc) { int i, bmap_cnt = sc->hn_chim_bmap_cnt; u_long *bmap = sc->hn_chim_bmap; uint32_t ret = HN_NVS_CHIM_IDX_INVALID; for (i = 0; i < bmap_cnt; ++i) { int idx; idx = ffsl(~bmap[i]); if (idx == 0) continue; --idx; /* ffsl is 1-based */ KASSERT(i * LONG_BIT + idx < sc->hn_chim_cnt, ("invalid i %d and idx %d", i, idx)); if (atomic_testandset_long(&bmap[i], idx)) continue; ret = i * LONG_BIT + idx; break; } return (ret); } static __inline void hn_chim_free(struct hn_softc *sc, uint32_t chim_idx) { u_long mask; uint32_t idx; idx = chim_idx / LONG_BIT; KASSERT(idx < sc->hn_chim_bmap_cnt, ("invalid chimney index 0x%x", chim_idx)); mask = 1UL << (chim_idx % LONG_BIT); KASSERT(sc->hn_chim_bmap[idx] & mask, ("index bitmap 0x%lx, chimney index %u, " "bitmap idx %d, bitmask 0x%lx", sc->hn_chim_bmap[idx], chim_idx, idx, mask)); atomic_clear_long(&sc->hn_chim_bmap[idx], mask); } #if defined(INET6) || defined(INET) /* * NOTE: If this function failed, the m_head would be freed. */ static __inline struct mbuf * hn_tso_fixup(struct mbuf *m_head) { struct ether_vlan_header *evl; struct tcphdr *th; int ehlen; KASSERT(M_WRITABLE(m_head), ("TSO mbuf not writable")); #define PULLUP_HDR(m, len) \ do { \ if (__predict_false((m)->m_len < (len))) { \ (m) = m_pullup((m), (len)); \ if ((m) == NULL) \ return (NULL); \ } \ } while (0) PULLUP_HDR(m_head, sizeof(*evl)); evl = mtod(m_head, struct ether_vlan_header *); if (evl->evl_encap_proto == ntohs(ETHERTYPE_VLAN)) ehlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; else ehlen = ETHER_HDR_LEN; #ifdef INET if (m_head->m_pkthdr.csum_flags & CSUM_IP_TSO) { struct ip *ip; int iphlen; PULLUP_HDR(m_head, ehlen + sizeof(*ip)); ip = mtodo(m_head, ehlen); iphlen = ip->ip_hl << 2; PULLUP_HDR(m_head, ehlen + iphlen + sizeof(*th)); th = mtodo(m_head, ehlen + iphlen); ip->ip_len = 0; ip->ip_sum = 0; th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); } #endif #if defined(INET6) && defined(INET) else #endif #ifdef INET6 { struct ip6_hdr *ip6; PULLUP_HDR(m_head, ehlen + sizeof(*ip6)); ip6 = mtodo(m_head, ehlen); if (ip6->ip6_nxt != IPPROTO_TCP) { m_freem(m_head); return (NULL); } PULLUP_HDR(m_head, ehlen + sizeof(*ip6) + sizeof(*th)); th = mtodo(m_head, ehlen + sizeof(*ip6)); ip6->ip6_plen = 0; th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); } #endif return (m_head); #undef PULLUP_HDR } #endif /* INET6 || INET */ static int hn_set_rxfilter(struct hn_softc *sc, uint32_t filter) { int error = 0; HN_LOCK_ASSERT(sc); if (sc->hn_rx_filter != filter) { error = hn_rndis_set_rxfilter(sc, filter); if (!error) sc->hn_rx_filter = filter; } return (error); } static int hn_rxfilter_config(struct hn_softc *sc) { struct ifnet *ifp = sc->hn_ifp; uint32_t filter; HN_LOCK_ASSERT(sc); if ((ifp->if_flags & IFF_PROMISC) || (sc->hn_flags & HN_FLAG_VF)) { filter = NDIS_PACKET_TYPE_PROMISCUOUS; } else { filter = NDIS_PACKET_TYPE_DIRECTED; if (ifp->if_flags & IFF_BROADCAST) filter |= NDIS_PACKET_TYPE_BROADCAST; /* TODO: support multicast list */ if ((ifp->if_flags & IFF_ALLMULTI) || !TAILQ_EMPTY(&ifp->if_multiaddrs)) filter |= NDIS_PACKET_TYPE_ALL_MULTICAST; } return (hn_set_rxfilter(sc, filter)); } static void hn_set_txagg(struct hn_softc *sc) { uint32_t size, pkts; int i; /* * Setup aggregation size. */ if (sc->hn_agg_size < 0) size = UINT32_MAX; else size = sc->hn_agg_size; if (sc->hn_rndis_agg_size < size) size = sc->hn_rndis_agg_size; /* NOTE: We only aggregate packets using chimney sending buffers. */ if (size > (uint32_t)sc->hn_chim_szmax) size = sc->hn_chim_szmax; if (size <= 2 * HN_PKTSIZE_MIN(sc->hn_rndis_agg_align)) { /* Disable */ size = 0; pkts = 0; goto done; } /* NOTE: Type of the per TX ring setting is 'int'. */ if (size > INT_MAX) size = INT_MAX; /* * Setup aggregation packet count. */ if (sc->hn_agg_pkts < 0) pkts = UINT32_MAX; else pkts = sc->hn_agg_pkts; if (sc->hn_rndis_agg_pkts < pkts) pkts = sc->hn_rndis_agg_pkts; if (pkts <= 1) { /* Disable */ size = 0; pkts = 0; goto done; } /* NOTE: Type of the per TX ring setting is 'short'. */ if (pkts > SHRT_MAX) pkts = SHRT_MAX; done: /* NOTE: Type of the per TX ring setting is 'short'. */ if (sc->hn_rndis_agg_align > SHRT_MAX) { /* Disable */ size = 0; pkts = 0; } if (bootverbose) { if_printf(sc->hn_ifp, "TX agg size %u, pkts %u, align %u\n", size, pkts, sc->hn_rndis_agg_align); } for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { struct hn_tx_ring *txr = &sc->hn_tx_ring[i]; mtx_lock(&txr->hn_tx_lock); txr->hn_agg_szmax = size; txr->hn_agg_pktmax = pkts; txr->hn_agg_align = sc->hn_rndis_agg_align; mtx_unlock(&txr->hn_tx_lock); } } static int hn_get_txswq_depth(const struct hn_tx_ring *txr) { KASSERT(txr->hn_txdesc_cnt > 0, ("tx ring is not setup yet")); if (hn_tx_swq_depth < txr->hn_txdesc_cnt) return txr->hn_txdesc_cnt; return hn_tx_swq_depth; } #ifndef RSS static int hn_rss_reconfig(struct hn_softc *sc) { int error; HN_LOCK_ASSERT(sc); if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) return (ENXIO); /* * Disable RSS first. * * NOTE: * Direct reconfiguration by setting the UNCHG flags does * _not_ work properly. */ if (bootverbose) if_printf(sc->hn_ifp, "disable RSS\n"); error = hn_rndis_conf_rss(sc, NDIS_RSS_FLAG_DISABLE); if (error) { if_printf(sc->hn_ifp, "RSS disable failed\n"); return (error); } /* * Reenable the RSS w/ the updated RSS key or indirect * table. */ if (bootverbose) if_printf(sc->hn_ifp, "reconfig RSS\n"); error = hn_rndis_conf_rss(sc, NDIS_RSS_FLAG_NONE); if (error) { if_printf(sc->hn_ifp, "RSS reconfig failed\n"); return (error); } return (0); } #endif /* !RSS */ static void hn_rss_ind_fixup(struct hn_softc *sc) { struct ndis_rssprm_toeplitz *rss = &sc->hn_rss; int i, nchan; nchan = sc->hn_rx_ring_inuse; KASSERT(nchan > 1, ("invalid # of channels %d", nchan)); /* * Check indirect table to make sure that all channels in it * can be used. */ for (i = 0; i < NDIS_HASH_INDCNT; ++i) { if (rss->rss_ind[i] >= nchan) { if_printf(sc->hn_ifp, "RSS indirect table %d fixup: %u -> %d\n", i, rss->rss_ind[i], nchan - 1); rss->rss_ind[i] = nchan - 1; } } } static int hn_ifmedia_upd(struct ifnet *ifp __unused) { return EOPNOTSUPP; } static void hn_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr) { struct hn_softc *sc = ifp->if_softc; ifmr->ifm_status = IFM_AVALID; ifmr->ifm_active = IFM_ETHER; if ((sc->hn_link_flags & HN_LINK_FLAG_LINKUP) == 0) { ifmr->ifm_active |= IFM_NONE; return; } ifmr->ifm_status |= IFM_ACTIVE; ifmr->ifm_active |= IFM_10G_T | IFM_FDX; } static void hn_update_vf_task(void *arg, int pending __unused) { struct hn_update_vf *uv = arg; uv->rxr->hn_vf = uv->vf; } static void hn_update_vf(struct hn_softc *sc, struct ifnet *vf) { struct hn_rx_ring *rxr; struct hn_update_vf uv; struct task task; int i; HN_LOCK_ASSERT(sc); TASK_INIT(&task, 0, hn_update_vf_task, &uv); for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; if (i < sc->hn_rx_ring_inuse) { uv.rxr = rxr; uv.vf = vf; vmbus_chan_run_task(rxr->hn_chan, &task); } else { rxr->hn_vf = vf; } } } static void hn_set_vf(struct hn_softc *sc, struct ifnet *ifp, bool vf) { struct ifnet *hn_ifp; HN_LOCK(sc); if (!(sc->hn_flags & HN_FLAG_SYNTH_ATTACHED)) goto out; hn_ifp = sc->hn_ifp; if (ifp == hn_ifp) goto out; if (ifp->if_alloctype != IFT_ETHER) goto out; /* Ignore lagg/vlan interfaces */ if (strcmp(ifp->if_dname, "lagg") == 0 || strcmp(ifp->if_dname, "vlan") == 0) goto out; if (bcmp(IF_LLADDR(ifp), IF_LLADDR(hn_ifp), ETHER_ADDR_LEN) != 0) goto out; /* Now we're sure 'ifp' is a real VF device. */ if (vf) { if (sc->hn_flags & HN_FLAG_VF) goto out; sc->hn_flags |= HN_FLAG_VF; hn_rxfilter_config(sc); } else { if (!(sc->hn_flags & HN_FLAG_VF)) goto out; sc->hn_flags &= ~HN_FLAG_VF; if (sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) hn_rxfilter_config(sc); else hn_set_rxfilter(sc, NDIS_PACKET_TYPE_NONE); } hn_nvs_set_datapath(sc, vf ? HN_NVS_DATAPATH_VF : HN_NVS_DATAPATH_SYNTHETIC); hn_update_vf(sc, vf ? ifp : NULL); if (vf) { hn_suspend_mgmt(sc); sc->hn_link_flags &= ~(HN_LINK_FLAG_LINKUP | HN_LINK_FLAG_NETCHG); if_link_state_change(sc->hn_ifp, LINK_STATE_DOWN); } else { hn_resume_mgmt(sc); } devctl_notify("HYPERV_NIC_VF", if_name(hn_ifp), vf ? "VF_UP" : "VF_DOWN", NULL); if (bootverbose) if_printf(hn_ifp, "Data path is switched %s %s\n", vf ? "to" : "from", if_name(ifp)); out: HN_UNLOCK(sc); } static void hn_ifnet_event(void *arg, struct ifnet *ifp, int event) { if (event != IFNET_EVENT_UP && event != IFNET_EVENT_DOWN) return; hn_set_vf(arg, ifp, event == IFNET_EVENT_UP); } static void hn_ifaddr_event(void *arg, struct ifnet *ifp) { hn_set_vf(arg, ifp, ifp->if_flags & IFF_UP); } /* {F8615163-DF3E-46c5-913F-F2D2F965ED0E} */ static const struct hyperv_guid g_net_vsc_device_type = { .hv_guid = {0x63, 0x51, 0x61, 0xF8, 0x3E, 0xDF, 0xc5, 0x46, 0x91, 0x3F, 0xF2, 0xD2, 0xF9, 0x65, 0xED, 0x0E} }; static int hn_probe(device_t dev) { if (VMBUS_PROBE_GUID(device_get_parent(dev), dev, &g_net_vsc_device_type) == 0) { device_set_desc(dev, "Hyper-V Network Interface"); return BUS_PROBE_DEFAULT; } return ENXIO; } static int hn_attach(device_t dev) { struct hn_softc *sc = device_get_softc(dev); struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; uint8_t eaddr[ETHER_ADDR_LEN]; struct ifnet *ifp = NULL; int error, ring_cnt, tx_ring_cnt; sc->hn_dev = dev; sc->hn_prichan = vmbus_get_channel(dev); HN_LOCK_INIT(sc); /* * Initialize these tunables once. */ sc->hn_agg_size = hn_tx_agg_size; sc->hn_agg_pkts = hn_tx_agg_pkts; /* * Setup taskqueue for transmission. */ if (hn_tx_taskq_mode == HN_TX_TASKQ_M_INDEP) { int i; sc->hn_tx_taskqs = malloc(hn_tx_taskq_cnt * sizeof(struct taskqueue *), M_DEVBUF, M_WAITOK); for (i = 0; i < hn_tx_taskq_cnt; ++i) { sc->hn_tx_taskqs[i] = taskqueue_create("hn_tx", M_WAITOK, taskqueue_thread_enqueue, &sc->hn_tx_taskqs[i]); taskqueue_start_threads(&sc->hn_tx_taskqs[i], 1, PI_NET, "%s tx%d", device_get_nameunit(dev), i); } } else if (hn_tx_taskq_mode == HN_TX_TASKQ_M_GLOBAL) { sc->hn_tx_taskqs = hn_tx_taskque; } /* * Setup taskqueue for mangement tasks, e.g. link status. */ sc->hn_mgmt_taskq0 = taskqueue_create("hn_mgmt", M_WAITOK, taskqueue_thread_enqueue, &sc->hn_mgmt_taskq0); taskqueue_start_threads(&sc->hn_mgmt_taskq0, 1, PI_NET, "%s mgmt", device_get_nameunit(dev)); TASK_INIT(&sc->hn_link_task, 0, hn_link_taskfunc, sc); TASK_INIT(&sc->hn_netchg_init, 0, hn_netchg_init_taskfunc, sc); TIMEOUT_TASK_INIT(sc->hn_mgmt_taskq0, &sc->hn_netchg_status, 0, hn_netchg_status_taskfunc, sc); /* * Allocate ifnet and setup its name earlier, so that if_printf * can be used by functions, which will be called after * ether_ifattach(). */ ifp = sc->hn_ifp = if_alloc(IFT_ETHER); ifp->if_softc = sc; if_initname(ifp, device_get_name(dev), device_get_unit(dev)); /* * Initialize ifmedia earlier so that it can be unconditionally * destroyed, if error happened later on. */ ifmedia_init(&sc->hn_media, 0, hn_ifmedia_upd, hn_ifmedia_sts); /* * Figure out the # of RX rings (ring_cnt) and the # of TX rings * to use (tx_ring_cnt). * * NOTE: * The # of RX rings to use is same as the # of channels to use. */ ring_cnt = hn_chan_cnt; if (ring_cnt <= 0) { /* Default */ ring_cnt = mp_ncpus; if (ring_cnt > HN_RING_CNT_DEF_MAX) ring_cnt = HN_RING_CNT_DEF_MAX; } else if (ring_cnt > mp_ncpus) { ring_cnt = mp_ncpus; } #ifdef RSS if (ring_cnt > rss_getnumbuckets()) ring_cnt = rss_getnumbuckets(); #endif tx_ring_cnt = hn_tx_ring_cnt; if (tx_ring_cnt <= 0 || tx_ring_cnt > ring_cnt) tx_ring_cnt = ring_cnt; #ifdef HN_IFSTART_SUPPORT if (hn_use_if_start) { /* ifnet.if_start only needs one TX ring. */ tx_ring_cnt = 1; } #endif /* * Set the leader CPU for channels. */ sc->hn_cpu = atomic_fetchadd_int(&hn_cpu_index, ring_cnt) % mp_ncpus; /* * Create enough TX/RX rings, even if only limited number of * channels can be allocated. */ error = hn_create_tx_data(sc, tx_ring_cnt); if (error) goto failed; error = hn_create_rx_data(sc, ring_cnt); if (error) goto failed; /* * Create transaction context for NVS and RNDIS transactions. */ sc->hn_xact = vmbus_xact_ctx_create(bus_get_dma_tag(dev), HN_XACT_REQ_SIZE, HN_XACT_RESP_SIZE, 0); if (sc->hn_xact == NULL) { error = ENXIO; goto failed; } /* * Install orphan handler for the revocation of this device's * primary channel. * * NOTE: * The processing order is critical here: * Install the orphan handler, _before_ testing whether this * device's primary channel has been revoked or not. */ vmbus_chan_set_orphan(sc->hn_prichan, sc->hn_xact); if (vmbus_chan_is_revoked(sc->hn_prichan)) { error = ENXIO; goto failed; } /* * Attach the synthetic parts, i.e. NVS and RNDIS. */ error = hn_synth_attach(sc, ETHERMTU); if (error) goto failed; error = hn_rndis_get_eaddr(sc, eaddr); if (error) goto failed; #if __FreeBSD_version >= 1100099 if (sc->hn_rx_ring_inuse > 1) { /* * Reduce TCP segment aggregation limit for multiple * RX rings to increase ACK timeliness. */ hn_set_lro_lenlim(sc, HN_LRO_LENLIM_MULTIRX_DEF); } #endif /* * Fixup TX stuffs after synthetic parts are attached. */ hn_fixup_tx_data(sc); ctx = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "nvs_version", CTLFLAG_RD, &sc->hn_nvs_ver, 0, "NVS version"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "ndis_version", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_ndis_version_sysctl, "A", "NDIS version"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "caps", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_caps_sysctl, "A", "capabilities"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "hwassist", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_hwassist_sysctl, "A", "hwassist"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rxfilter", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_rxfilter_sysctl, "A", "rxfilter"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rss_hash", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_rss_hash_sysctl, "A", "RSS hash"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rss_ind_size", CTLFLAG_RD, &sc->hn_rss_ind_size, 0, "RSS indirect entry count"); #ifndef RSS /* * Don't allow RSS key/indirect table changes, if RSS is defined. */ SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rss_key", CTLTYPE_OPAQUE | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_rss_key_sysctl, "IU", "RSS key"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rss_ind", CTLTYPE_OPAQUE | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_rss_ind_sysctl, "IU", "RSS indirect table"); #endif SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rndis_agg_size", CTLFLAG_RD, &sc->hn_rndis_agg_size, 0, "RNDIS offered packet transmission aggregation size limit"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rndis_agg_pkts", CTLFLAG_RD, &sc->hn_rndis_agg_pkts, 0, "RNDIS offered packet transmission aggregation count limit"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rndis_agg_align", CTLFLAG_RD, &sc->hn_rndis_agg_align, 0, "RNDIS packet transmission aggregation alignment"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_size", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_txagg_size_sysctl, "I", "Packet transmission aggregation size, 0 -- disable, -1 -- auto"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_pkts", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_txagg_pkts_sysctl, "I", "Packet transmission aggregation packets, " "0 -- disable, -1 -- auto"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "polling", CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_polling_sysctl, "I", "Polling frequency: [100,1000000], 0 disable polling"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "vf", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_vf_sysctl, "A", "Virtual Function's name"); /* * Setup the ifmedia, which has been initialized earlier. */ ifmedia_add(&sc->hn_media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_set(&sc->hn_media, IFM_ETHER | IFM_AUTO); /* XXX ifmedia_set really should do this for us */ sc->hn_media.ifm_media = sc->hn_media.ifm_cur->ifm_media; /* * Setup the ifnet for this interface. */ ifp->if_baudrate = IF_Gbps(10); ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_ioctl = hn_ioctl; ifp->if_init = hn_init; #ifdef HN_IFSTART_SUPPORT if (hn_use_if_start) { int qdepth = hn_get_txswq_depth(&sc->hn_tx_ring[0]); ifp->if_start = hn_start; IFQ_SET_MAXLEN(&ifp->if_snd, qdepth); ifp->if_snd.ifq_drv_maxlen = qdepth - 1; IFQ_SET_READY(&ifp->if_snd); } else #endif { ifp->if_transmit = hn_transmit; ifp->if_qflush = hn_xmit_qflush; } ifp->if_capabilities |= IFCAP_RXCSUM | IFCAP_LRO; #ifdef foo /* We can't diff IPv6 packets from IPv4 packets on RX path. */ ifp->if_capabilities |= IFCAP_RXCSUM_IPV6; #endif if (sc->hn_caps & HN_CAP_VLAN) { /* XXX not sure about VLAN_MTU. */ ifp->if_capabilities |= IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_MTU; } ifp->if_hwassist = sc->hn_tx_ring[0].hn_csum_assist; if (ifp->if_hwassist & HN_CSUM_IP_MASK) ifp->if_capabilities |= IFCAP_TXCSUM; if (ifp->if_hwassist & HN_CSUM_IP6_MASK) ifp->if_capabilities |= IFCAP_TXCSUM_IPV6; if (sc->hn_caps & HN_CAP_TSO4) { ifp->if_capabilities |= IFCAP_TSO4; ifp->if_hwassist |= CSUM_IP_TSO; } if (sc->hn_caps & HN_CAP_TSO6) { ifp->if_capabilities |= IFCAP_TSO6; ifp->if_hwassist |= CSUM_IP6_TSO; } /* Enable all available capabilities by default. */ ifp->if_capenable = ifp->if_capabilities; /* * Disable IPv6 TSO and TXCSUM by default, they still can * be enabled through SIOCSIFCAP. */ ifp->if_capenable &= ~(IFCAP_TXCSUM_IPV6 | IFCAP_TSO6); ifp->if_hwassist &= ~(HN_CSUM_IP6_MASK | CSUM_IP6_TSO); if (ifp->if_capabilities & (IFCAP_TSO6 | IFCAP_TSO4)) { hn_set_tso_maxsize(sc, hn_tso_maxlen, ETHERMTU); ifp->if_hw_tsomaxsegcount = HN_TX_DATA_SEGCNT_MAX; ifp->if_hw_tsomaxsegsize = PAGE_SIZE; } ether_ifattach(ifp, eaddr); if ((ifp->if_capabilities & (IFCAP_TSO6 | IFCAP_TSO4)) && bootverbose) { if_printf(ifp, "TSO segcnt %u segsz %u\n", ifp->if_hw_tsomaxsegcount, ifp->if_hw_tsomaxsegsize); } /* Inform the upper layer about the long frame support. */ ifp->if_hdrlen = sizeof(struct ether_vlan_header); /* * Kick off link status check. */ sc->hn_mgmt_taskq = sc->hn_mgmt_taskq0; hn_update_link_status(sc); sc->hn_ifnet_evthand = EVENTHANDLER_REGISTER(ifnet_event, hn_ifnet_event, sc, EVENTHANDLER_PRI_ANY); sc->hn_ifaddr_evthand = EVENTHANDLER_REGISTER(ifaddr_event, hn_ifaddr_event, sc, EVENTHANDLER_PRI_ANY); return (0); failed: if (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) hn_synth_detach(sc); hn_detach(dev); return (error); } static int hn_detach(device_t dev) { struct hn_softc *sc = device_get_softc(dev); struct ifnet *ifp = sc->hn_ifp; if (sc->hn_ifaddr_evthand != NULL) EVENTHANDLER_DEREGISTER(ifaddr_event, sc->hn_ifaddr_evthand); if (sc->hn_ifnet_evthand != NULL) EVENTHANDLER_DEREGISTER(ifnet_event, sc->hn_ifnet_evthand); if (sc->hn_xact != NULL && vmbus_chan_is_revoked(sc->hn_prichan)) { /* * In case that the vmbus missed the orphan handler * installation. */ vmbus_xact_ctx_orphan(sc->hn_xact); } if (device_is_attached(dev)) { HN_LOCK(sc); if (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) { if (ifp->if_drv_flags & IFF_DRV_RUNNING) hn_stop(sc, true); /* * NOTE: * hn_stop() only suspends data, so managment * stuffs have to be suspended manually here. */ hn_suspend_mgmt(sc); hn_synth_detach(sc); } HN_UNLOCK(sc); ether_ifdetach(ifp); } ifmedia_removeall(&sc->hn_media); hn_destroy_rx_data(sc); hn_destroy_tx_data(sc); if (sc->hn_tx_taskqs != NULL && sc->hn_tx_taskqs != hn_tx_taskque) { int i; for (i = 0; i < hn_tx_taskq_cnt; ++i) taskqueue_free(sc->hn_tx_taskqs[i]); free(sc->hn_tx_taskqs, M_DEVBUF); } taskqueue_free(sc->hn_mgmt_taskq0); if (sc->hn_xact != NULL) { /* * Uninstall the orphan handler _before_ the xact is * destructed. */ vmbus_chan_unset_orphan(sc->hn_prichan); vmbus_xact_ctx_destroy(sc->hn_xact); } if_free(ifp); HN_LOCK_DESTROY(sc); return (0); } static int hn_shutdown(device_t dev) { return (0); } static void hn_link_status(struct hn_softc *sc) { uint32_t link_status; int error; error = hn_rndis_get_linkstatus(sc, &link_status); if (error) { /* XXX what to do? */ return; } if (link_status == NDIS_MEDIA_STATE_CONNECTED) sc->hn_link_flags |= HN_LINK_FLAG_LINKUP; else sc->hn_link_flags &= ~HN_LINK_FLAG_LINKUP; if_link_state_change(sc->hn_ifp, (sc->hn_link_flags & HN_LINK_FLAG_LINKUP) ? LINK_STATE_UP : LINK_STATE_DOWN); } static void hn_link_taskfunc(void *xsc, int pending __unused) { struct hn_softc *sc = xsc; if (sc->hn_link_flags & HN_LINK_FLAG_NETCHG) return; hn_link_status(sc); } static void hn_netchg_init_taskfunc(void *xsc, int pending __unused) { struct hn_softc *sc = xsc; /* Prevent any link status checks from running. */ sc->hn_link_flags |= HN_LINK_FLAG_NETCHG; /* * Fake up a [link down --> link up] state change; 5 seconds * delay is used, which closely simulates miibus reaction * upon link down event. */ sc->hn_link_flags &= ~HN_LINK_FLAG_LINKUP; if_link_state_change(sc->hn_ifp, LINK_STATE_DOWN); taskqueue_enqueue_timeout(sc->hn_mgmt_taskq0, &sc->hn_netchg_status, 5 * hz); } static void hn_netchg_status_taskfunc(void *xsc, int pending __unused) { struct hn_softc *sc = xsc; /* Re-allow link status checks. */ sc->hn_link_flags &= ~HN_LINK_FLAG_NETCHG; hn_link_status(sc); } static void hn_update_link_status(struct hn_softc *sc) { if (sc->hn_mgmt_taskq != NULL) taskqueue_enqueue(sc->hn_mgmt_taskq, &sc->hn_link_task); } static void hn_change_network(struct hn_softc *sc) { if (sc->hn_mgmt_taskq != NULL) taskqueue_enqueue(sc->hn_mgmt_taskq, &sc->hn_netchg_init); } static __inline int hn_txdesc_dmamap_load(struct hn_tx_ring *txr, struct hn_txdesc *txd, struct mbuf **m_head, bus_dma_segment_t *segs, int *nsegs) { struct mbuf *m = *m_head; int error; KASSERT(txd->chim_index == HN_NVS_CHIM_IDX_INVALID, ("txd uses chim")); error = bus_dmamap_load_mbuf_sg(txr->hn_tx_data_dtag, txd->data_dmap, m, segs, nsegs, BUS_DMA_NOWAIT); if (error == EFBIG) { struct mbuf *m_new; m_new = m_collapse(m, M_NOWAIT, HN_TX_DATA_SEGCNT_MAX); if (m_new == NULL) return ENOBUFS; else *m_head = m = m_new; txr->hn_tx_collapsed++; error = bus_dmamap_load_mbuf_sg(txr->hn_tx_data_dtag, txd->data_dmap, m, segs, nsegs, BUS_DMA_NOWAIT); } if (!error) { bus_dmamap_sync(txr->hn_tx_data_dtag, txd->data_dmap, BUS_DMASYNC_PREWRITE); txd->flags |= HN_TXD_FLAG_DMAMAP; } return error; } static __inline int hn_txdesc_put(struct hn_tx_ring *txr, struct hn_txdesc *txd) { KASSERT((txd->flags & HN_TXD_FLAG_ONLIST) == 0, ("put an onlist txd %#x", txd->flags)); KASSERT((txd->flags & HN_TXD_FLAG_ONAGG) == 0, ("put an onagg txd %#x", txd->flags)); KASSERT(txd->refs > 0, ("invalid txd refs %d", txd->refs)); if (atomic_fetchadd_int(&txd->refs, -1) != 1) return 0; if (!STAILQ_EMPTY(&txd->agg_list)) { struct hn_txdesc *tmp_txd; while ((tmp_txd = STAILQ_FIRST(&txd->agg_list)) != NULL) { int freed; KASSERT(STAILQ_EMPTY(&tmp_txd->agg_list), ("resursive aggregation on aggregated txdesc")); KASSERT((tmp_txd->flags & HN_TXD_FLAG_ONAGG), ("not aggregated txdesc")); KASSERT((tmp_txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("aggregated txdesc uses dmamap")); KASSERT(tmp_txd->chim_index == HN_NVS_CHIM_IDX_INVALID, ("aggregated txdesc consumes " "chimney sending buffer")); KASSERT(tmp_txd->chim_size == 0, ("aggregated txdesc has non-zero " "chimney sending size")); STAILQ_REMOVE_HEAD(&txd->agg_list, agg_link); tmp_txd->flags &= ~HN_TXD_FLAG_ONAGG; freed = hn_txdesc_put(txr, tmp_txd); KASSERT(freed, ("failed to free aggregated txdesc")); } } if (txd->chim_index != HN_NVS_CHIM_IDX_INVALID) { KASSERT((txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("chim txd uses dmamap")); hn_chim_free(txr->hn_sc, txd->chim_index); txd->chim_index = HN_NVS_CHIM_IDX_INVALID; txd->chim_size = 0; } else if (txd->flags & HN_TXD_FLAG_DMAMAP) { bus_dmamap_sync(txr->hn_tx_data_dtag, txd->data_dmap, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->hn_tx_data_dtag, txd->data_dmap); txd->flags &= ~HN_TXD_FLAG_DMAMAP; } if (txd->m != NULL) { m_freem(txd->m); txd->m = NULL; } txd->flags |= HN_TXD_FLAG_ONLIST; #ifndef HN_USE_TXDESC_BUFRING mtx_lock_spin(&txr->hn_txlist_spin); KASSERT(txr->hn_txdesc_avail >= 0 && txr->hn_txdesc_avail < txr->hn_txdesc_cnt, ("txdesc_put: invalid txd avail %d", txr->hn_txdesc_avail)); txr->hn_txdesc_avail++; SLIST_INSERT_HEAD(&txr->hn_txlist, txd, link); mtx_unlock_spin(&txr->hn_txlist_spin); #else /* HN_USE_TXDESC_BUFRING */ #ifdef HN_DEBUG atomic_add_int(&txr->hn_txdesc_avail, 1); #endif buf_ring_enqueue(txr->hn_txdesc_br, txd); #endif /* !HN_USE_TXDESC_BUFRING */ return 1; } static __inline struct hn_txdesc * hn_txdesc_get(struct hn_tx_ring *txr) { struct hn_txdesc *txd; #ifndef HN_USE_TXDESC_BUFRING mtx_lock_spin(&txr->hn_txlist_spin); txd = SLIST_FIRST(&txr->hn_txlist); if (txd != NULL) { KASSERT(txr->hn_txdesc_avail > 0, ("txdesc_get: invalid txd avail %d", txr->hn_txdesc_avail)); txr->hn_txdesc_avail--; SLIST_REMOVE_HEAD(&txr->hn_txlist, link); } mtx_unlock_spin(&txr->hn_txlist_spin); #else txd = buf_ring_dequeue_sc(txr->hn_txdesc_br); #endif if (txd != NULL) { #ifdef HN_USE_TXDESC_BUFRING #ifdef HN_DEBUG atomic_subtract_int(&txr->hn_txdesc_avail, 1); #endif #endif /* HN_USE_TXDESC_BUFRING */ KASSERT(txd->m == NULL && txd->refs == 0 && STAILQ_EMPTY(&txd->agg_list) && txd->chim_index == HN_NVS_CHIM_IDX_INVALID && txd->chim_size == 0 && (txd->flags & HN_TXD_FLAG_ONLIST) && (txd->flags & HN_TXD_FLAG_ONAGG) == 0 && (txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("invalid txd")); txd->flags &= ~HN_TXD_FLAG_ONLIST; txd->refs = 1; } return txd; } static __inline void hn_txdesc_hold(struct hn_txdesc *txd) { /* 0->1 transition will never work */ KASSERT(txd->refs > 0, ("invalid txd refs %d", txd->refs)); atomic_add_int(&txd->refs, 1); } static __inline void hn_txdesc_agg(struct hn_txdesc *agg_txd, struct hn_txdesc *txd) { KASSERT((agg_txd->flags & HN_TXD_FLAG_ONAGG) == 0, ("recursive aggregation on aggregating txdesc")); KASSERT((txd->flags & HN_TXD_FLAG_ONAGG) == 0, ("already aggregated")); KASSERT(STAILQ_EMPTY(&txd->agg_list), ("recursive aggregation on to-be-aggregated txdesc")); txd->flags |= HN_TXD_FLAG_ONAGG; STAILQ_INSERT_TAIL(&agg_txd->agg_list, txd, agg_link); } static bool hn_tx_ring_pending(struct hn_tx_ring *txr) { bool pending = false; #ifndef HN_USE_TXDESC_BUFRING mtx_lock_spin(&txr->hn_txlist_spin); if (txr->hn_txdesc_avail != txr->hn_txdesc_cnt) pending = true; mtx_unlock_spin(&txr->hn_txlist_spin); #else if (!buf_ring_full(txr->hn_txdesc_br)) pending = true; #endif return (pending); } static __inline void hn_txeof(struct hn_tx_ring *txr) { txr->hn_has_txeof = 0; txr->hn_txeof(txr); } static void hn_txpkt_done(struct hn_nvs_sendctx *sndc, struct hn_softc *sc, struct vmbus_channel *chan, const void *data __unused, int dlen __unused) { struct hn_txdesc *txd = sndc->hn_cbarg; struct hn_tx_ring *txr; txr = txd->txr; KASSERT(txr->hn_chan == chan, ("channel mismatch, on chan%u, should be chan%u", vmbus_chan_id(chan), vmbus_chan_id(txr->hn_chan))); txr->hn_has_txeof = 1; hn_txdesc_put(txr, txd); ++txr->hn_txdone_cnt; if (txr->hn_txdone_cnt >= HN_EARLY_TXEOF_THRESH) { txr->hn_txdone_cnt = 0; if (txr->hn_oactive) hn_txeof(txr); } } static void hn_chan_rollup(struct hn_rx_ring *rxr, struct hn_tx_ring *txr) { #if defined(INET) || defined(INET6) tcp_lro_flush_all(&rxr->hn_lro); #endif /* * NOTE: * 'txr' could be NULL, if multiple channels and * ifnet.if_start method are enabled. */ if (txr == NULL || !txr->hn_has_txeof) return; txr->hn_txdone_cnt = 0; hn_txeof(txr); } static __inline uint32_t hn_rndis_pktmsg_offset(uint32_t ofs) { KASSERT(ofs >= sizeof(struct rndis_packet_msg), ("invalid RNDIS packet msg offset %u", ofs)); return (ofs - __offsetof(struct rndis_packet_msg, rm_dataoffset)); } static __inline void * hn_rndis_pktinfo_append(struct rndis_packet_msg *pkt, size_t pktsize, size_t pi_dlen, uint32_t pi_type) { const size_t pi_size = HN_RNDIS_PKTINFO_SIZE(pi_dlen); struct rndis_pktinfo *pi; KASSERT((pi_size & RNDIS_PACKET_MSG_OFFSET_ALIGNMASK) == 0, ("unaligned pktinfo size %zu, pktinfo dlen %zu", pi_size, pi_dlen)); /* * Per-packet-info does not move; it only grows. * * NOTE: * rm_pktinfooffset in this phase counts from the beginning * of rndis_packet_msg. */ KASSERT(pkt->rm_pktinfooffset + pkt->rm_pktinfolen + pi_size <= pktsize, ("%u pktinfo overflows RNDIS packet msg", pi_type)); pi = (struct rndis_pktinfo *)((uint8_t *)pkt + pkt->rm_pktinfooffset + pkt->rm_pktinfolen); pkt->rm_pktinfolen += pi_size; pi->rm_size = pi_size; pi->rm_type = pi_type; pi->rm_pktinfooffset = RNDIS_PKTINFO_OFFSET; - /* Update RNDIS packet msg length */ - pkt->rm_len += pi_size; - return (pi->rm_data); } static __inline int hn_flush_txagg(struct ifnet *ifp, struct hn_tx_ring *txr) { struct hn_txdesc *txd; struct mbuf *m; int error, pkts; txd = txr->hn_agg_txd; KASSERT(txd != NULL, ("no aggregate txdesc")); /* * Since hn_txpkt() will reset this temporary stat, save * it now, so that oerrors can be updated properly, if * hn_txpkt() ever fails. */ pkts = txr->hn_stat_pkts; /* * Since txd's mbuf will _not_ be freed upon hn_txpkt() * failure, save it for later freeing, if hn_txpkt() ever * fails. */ m = txd->m; error = hn_txpkt(ifp, txr, txd); if (__predict_false(error)) { /* txd is freed, but m is not. */ m_freem(m); txr->hn_flush_failed++; if_inc_counter(ifp, IFCOUNTER_OERRORS, pkts); } /* Reset all aggregation states. */ txr->hn_agg_txd = NULL; txr->hn_agg_szleft = 0; txr->hn_agg_pktleft = 0; txr->hn_agg_prevpkt = NULL; return (error); } static void * hn_try_txagg(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd, int pktsize) { void *chim; if (txr->hn_agg_txd != NULL) { if (txr->hn_agg_pktleft >= 1 && txr->hn_agg_szleft > pktsize) { struct hn_txdesc *agg_txd = txr->hn_agg_txd; struct rndis_packet_msg *pkt = txr->hn_agg_prevpkt; int olen; /* * Update the previous RNDIS packet's total length, * it can be increased due to the mandatory alignment * padding for this RNDIS packet. And update the * aggregating txdesc's chimney sending buffer size * accordingly. * * XXX * Zero-out the padding, as required by the RNDIS spec. */ olen = pkt->rm_len; pkt->rm_len = roundup2(olen, txr->hn_agg_align); agg_txd->chim_size += pkt->rm_len - olen; /* Link this txdesc to the parent. */ hn_txdesc_agg(agg_txd, txd); chim = (uint8_t *)pkt + pkt->rm_len; /* Save the current packet for later fixup. */ txr->hn_agg_prevpkt = chim; txr->hn_agg_pktleft--; txr->hn_agg_szleft -= pktsize; if (txr->hn_agg_szleft <= HN_PKTSIZE_MIN(txr->hn_agg_align)) { /* * Probably can't aggregate more packets, * flush this aggregating txdesc proactively. */ txr->hn_agg_pktleft = 0; } /* Done! */ return (chim); } hn_flush_txagg(ifp, txr); } KASSERT(txr->hn_agg_txd == NULL, ("lingering aggregating txdesc")); txr->hn_tx_chimney_tried++; txd->chim_index = hn_chim_alloc(txr->hn_sc); if (txd->chim_index == HN_NVS_CHIM_IDX_INVALID) return (NULL); txr->hn_tx_chimney++; chim = txr->hn_sc->hn_chim + (txd->chim_index * txr->hn_sc->hn_chim_szmax); if (txr->hn_agg_pktmax > 1 && txr->hn_agg_szmax > pktsize + HN_PKTSIZE_MIN(txr->hn_agg_align)) { txr->hn_agg_txd = txd; txr->hn_agg_pktleft = txr->hn_agg_pktmax - 1; txr->hn_agg_szleft = txr->hn_agg_szmax - pktsize; txr->hn_agg_prevpkt = chim; } return (chim); } /* * NOTE: * If this function fails, then both txd and m_head0 will be freed. */ static int hn_encap(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd, struct mbuf **m_head0) { bus_dma_segment_t segs[HN_TX_DATA_SEGCNT_MAX]; int error, nsegs, i; struct mbuf *m_head = *m_head0; struct rndis_packet_msg *pkt; uint32_t *pi_data; void *chim = NULL; int pkt_hlen, pkt_size; pkt = txd->rndis_pkt; pkt_size = HN_PKTSIZE(m_head, txr->hn_agg_align); if (pkt_size < txr->hn_chim_size) { chim = hn_try_txagg(ifp, txr, txd, pkt_size); if (chim != NULL) pkt = chim; } else { if (txr->hn_agg_txd != NULL) hn_flush_txagg(ifp, txr); } pkt->rm_type = REMOTE_NDIS_PACKET_MSG; - pkt->rm_len = sizeof(*pkt) + m_head->m_pkthdr.len; + pkt->rm_len = m_head->m_pkthdr.len; pkt->rm_dataoffset = 0; pkt->rm_datalen = m_head->m_pkthdr.len; pkt->rm_oobdataoffset = 0; pkt->rm_oobdatalen = 0; pkt->rm_oobdataelements = 0; pkt->rm_pktinfooffset = sizeof(*pkt); pkt->rm_pktinfolen = 0; pkt->rm_vchandle = 0; pkt->rm_reserved = 0; if (txr->hn_tx_flags & HN_TX_FLAG_HASHVAL) { /* * Set the hash value for this packet, so that the host could * dispatch the TX done event for this packet back to this TX * ring's channel. */ pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN, HN_NDIS_HASH_VALUE_SIZE, HN_NDIS_PKTINFO_TYPE_HASHVAL); *pi_data = txr->hn_tx_idx; } if (m_head->m_flags & M_VLANTAG) { pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN, NDIS_VLAN_INFO_SIZE, NDIS_PKTINFO_TYPE_VLAN); *pi_data = NDIS_VLAN_INFO_MAKE( EVL_VLANOFTAG(m_head->m_pkthdr.ether_vtag), EVL_PRIOFTAG(m_head->m_pkthdr.ether_vtag), EVL_CFIOFTAG(m_head->m_pkthdr.ether_vtag)); } if (m_head->m_pkthdr.csum_flags & CSUM_TSO) { #if defined(INET6) || defined(INET) pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN, NDIS_LSO2_INFO_SIZE, NDIS_PKTINFO_TYPE_LSO); #ifdef INET if (m_head->m_pkthdr.csum_flags & CSUM_IP_TSO) { *pi_data = NDIS_LSO2_INFO_MAKEIPV4(0, m_head->m_pkthdr.tso_segsz); } #endif #if defined(INET6) && defined(INET) else #endif #ifdef INET6 { *pi_data = NDIS_LSO2_INFO_MAKEIPV6(0, m_head->m_pkthdr.tso_segsz); } #endif #endif /* INET6 || INET */ } else if (m_head->m_pkthdr.csum_flags & txr->hn_csum_assist) { pi_data = hn_rndis_pktinfo_append(pkt, HN_RNDIS_PKT_LEN, NDIS_TXCSUM_INFO_SIZE, NDIS_PKTINFO_TYPE_CSUM); if (m_head->m_pkthdr.csum_flags & (CSUM_IP6_TCP | CSUM_IP6_UDP)) { *pi_data = NDIS_TXCSUM_INFO_IPV6; } else { *pi_data = NDIS_TXCSUM_INFO_IPV4; if (m_head->m_pkthdr.csum_flags & CSUM_IP) *pi_data |= NDIS_TXCSUM_INFO_IPCS; } if (m_head->m_pkthdr.csum_flags & (CSUM_IP_TCP | CSUM_IP6_TCP)) *pi_data |= NDIS_TXCSUM_INFO_TCPCS; else if (m_head->m_pkthdr.csum_flags & (CSUM_IP_UDP | CSUM_IP6_UDP)) *pi_data |= NDIS_TXCSUM_INFO_UDPCS; } pkt_hlen = pkt->rm_pktinfooffset + pkt->rm_pktinfolen; + /* Fixup RNDIS packet message total length */ + pkt->rm_len += pkt_hlen; /* Convert RNDIS packet message offsets */ pkt->rm_dataoffset = hn_rndis_pktmsg_offset(pkt_hlen); pkt->rm_pktinfooffset = hn_rndis_pktmsg_offset(pkt->rm_pktinfooffset); /* * Fast path: Chimney sending. */ if (chim != NULL) { struct hn_txdesc *tgt_txd = txd; if (txr->hn_agg_txd != NULL) { tgt_txd = txr->hn_agg_txd; #ifdef INVARIANTS *m_head0 = NULL; #endif } KASSERT(pkt == chim, ("RNDIS pkt not in chimney sending buffer")); KASSERT(tgt_txd->chim_index != HN_NVS_CHIM_IDX_INVALID, ("chimney sending buffer is not used")); tgt_txd->chim_size += pkt->rm_len; m_copydata(m_head, 0, m_head->m_pkthdr.len, ((uint8_t *)chim) + pkt_hlen); txr->hn_gpa_cnt = 0; txr->hn_sendpkt = hn_txpkt_chim; goto done; } KASSERT(txr->hn_agg_txd == NULL, ("aggregating sglist txdesc")); KASSERT(txd->chim_index == HN_NVS_CHIM_IDX_INVALID, ("chimney buffer is used")); KASSERT(pkt == txd->rndis_pkt, ("RNDIS pkt not in txdesc")); error = hn_txdesc_dmamap_load(txr, txd, &m_head, segs, &nsegs); if (__predict_false(error)) { int freed; /* * This mbuf is not linked w/ the txd yet, so free it now. */ m_freem(m_head); *m_head0 = NULL; freed = hn_txdesc_put(txr, txd); KASSERT(freed != 0, ("fail to free txd upon txdma error")); txr->hn_txdma_failed++; if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return error; } *m_head0 = m_head; /* +1 RNDIS packet message */ txr->hn_gpa_cnt = nsegs + 1; /* send packet with page buffer */ txr->hn_gpa[0].gpa_page = atop(txd->rndis_pkt_paddr); txr->hn_gpa[0].gpa_ofs = txd->rndis_pkt_paddr & PAGE_MASK; txr->hn_gpa[0].gpa_len = pkt_hlen; /* * Fill the page buffers with mbuf info after the page * buffer for RNDIS packet message. */ for (i = 0; i < nsegs; ++i) { struct vmbus_gpa *gpa = &txr->hn_gpa[i + 1]; gpa->gpa_page = atop(segs[i].ds_addr); gpa->gpa_ofs = segs[i].ds_addr & PAGE_MASK; gpa->gpa_len = segs[i].ds_len; } txd->chim_index = HN_NVS_CHIM_IDX_INVALID; txd->chim_size = 0; txr->hn_sendpkt = hn_txpkt_sglist; done: txd->m = m_head; /* Set the completion routine */ hn_nvs_sendctx_init(&txd->send_ctx, hn_txpkt_done, txd); /* Update temporary stats for later use. */ txr->hn_stat_pkts++; txr->hn_stat_size += m_head->m_pkthdr.len; if (m_head->m_flags & M_MCAST) txr->hn_stat_mcasts++; return 0; } /* * NOTE: * If this function fails, then txd will be freed, but the mbuf * associated w/ the txd will _not_ be freed. */ static int hn_txpkt(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd) { int error, send_failed = 0, has_bpf; again: has_bpf = bpf_peers_present(ifp->if_bpf); if (has_bpf) { /* * Make sure that this txd and any aggregated txds are not * freed before ETHER_BPF_MTAP. */ hn_txdesc_hold(txd); } error = txr->hn_sendpkt(txr, txd); if (!error) { if (has_bpf) { const struct hn_txdesc *tmp_txd; ETHER_BPF_MTAP(ifp, txd->m); STAILQ_FOREACH(tmp_txd, &txd->agg_list, agg_link) ETHER_BPF_MTAP(ifp, tmp_txd->m); } if_inc_counter(ifp, IFCOUNTER_OPACKETS, txr->hn_stat_pkts); #ifdef HN_IFSTART_SUPPORT if (!hn_use_if_start) #endif { if_inc_counter(ifp, IFCOUNTER_OBYTES, txr->hn_stat_size); if (txr->hn_stat_mcasts != 0) { if_inc_counter(ifp, IFCOUNTER_OMCASTS, txr->hn_stat_mcasts); } } txr->hn_pkts += txr->hn_stat_pkts; txr->hn_sends++; } if (has_bpf) hn_txdesc_put(txr, txd); if (__predict_false(error)) { int freed; /* * This should "really rarely" happen. * * XXX Too many RX to be acked or too many sideband * commands to run? Ask netvsc_channel_rollup() * to kick start later. */ txr->hn_has_txeof = 1; if (!send_failed) { txr->hn_send_failed++; send_failed = 1; /* * Try sending again after set hn_has_txeof; * in case that we missed the last * netvsc_channel_rollup(). */ goto again; } if_printf(ifp, "send failed\n"); /* * Caller will perform further processing on the * associated mbuf, so don't free it in hn_txdesc_put(); * only unload it from the DMA map in hn_txdesc_put(), * if it was loaded. */ txd->m = NULL; freed = hn_txdesc_put(txr, txd); KASSERT(freed != 0, ("fail to free txd upon send error")); txr->hn_send_failed++; } /* Reset temporary stats, after this sending is done. */ txr->hn_stat_size = 0; txr->hn_stat_pkts = 0; txr->hn_stat_mcasts = 0; return (error); } /* * Append the specified data to the indicated mbuf chain, * Extend the mbuf chain if the new data does not fit in * existing space. * * This is a minor rewrite of m_append() from sys/kern/uipc_mbuf.c. * There should be an equivalent in the kernel mbuf code, * but there does not appear to be one yet. * * Differs from m_append() in that additional mbufs are * allocated with cluster size MJUMPAGESIZE, and filled * accordingly. * * Return 1 if able to complete the job; otherwise 0. */ static int hv_m_append(struct mbuf *m0, int len, c_caddr_t cp) { struct mbuf *m, *n; int remainder, space; for (m = m0; m->m_next != NULL; m = m->m_next) ; remainder = len; space = M_TRAILINGSPACE(m); if (space > 0) { /* * Copy into available space. */ if (space > remainder) space = remainder; bcopy(cp, mtod(m, caddr_t) + m->m_len, space); m->m_len += space; cp += space; remainder -= space; } while (remainder > 0) { /* * Allocate a new mbuf; could check space * and allocate a cluster instead. */ n = m_getjcl(M_NOWAIT, m->m_type, 0, MJUMPAGESIZE); if (n == NULL) break; n->m_len = min(MJUMPAGESIZE, remainder); bcopy(cp, mtod(n, caddr_t), n->m_len); cp += n->m_len; remainder -= n->m_len; m->m_next = n; m = n; } if (m0->m_flags & M_PKTHDR) m0->m_pkthdr.len += len - remainder; return (remainder == 0); } #if defined(INET) || defined(INET6) static __inline int hn_lro_rx(struct lro_ctrl *lc, struct mbuf *m) { #if __FreeBSD_version >= 1100095 if (hn_lro_mbufq_depth) { tcp_lro_queue_mbuf(lc, m); return 0; } #endif return tcp_lro_rx(lc, m, 0); } #endif static int hn_rxpkt(struct hn_rx_ring *rxr, const void *data, int dlen, const struct hn_rxinfo *info) { struct ifnet *ifp; struct mbuf *m_new; int size, do_lro = 0, do_csum = 1; int hash_type; /* If the VF is active, inject the packet through the VF */ ifp = rxr->hn_vf ? rxr->hn_vf : rxr->hn_ifp; if (dlen <= MHLEN) { m_new = m_gethdr(M_NOWAIT, MT_DATA); if (m_new == NULL) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); return (0); } memcpy(mtod(m_new, void *), data, dlen); m_new->m_pkthdr.len = m_new->m_len = dlen; rxr->hn_small_pkts++; } else { /* * Get an mbuf with a cluster. For packets 2K or less, * get a standard 2K cluster. For anything larger, get a * 4K cluster. Any buffers larger than 4K can cause problems * if looped around to the Hyper-V TX channel, so avoid them. */ size = MCLBYTES; if (dlen > MCLBYTES) { /* 4096 */ size = MJUMPAGESIZE; } m_new = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, size); if (m_new == NULL) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); return (0); } hv_m_append(m_new, dlen, data); } m_new->m_pkthdr.rcvif = ifp; if (__predict_false((ifp->if_capenable & IFCAP_RXCSUM) == 0)) do_csum = 0; /* receive side checksum offload */ if (info->csum_info != HN_NDIS_RXCSUM_INFO_INVALID) { /* IP csum offload */ if ((info->csum_info & NDIS_RXCSUM_INFO_IPCS_OK) && do_csum) { m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID); rxr->hn_csum_ip++; } /* TCP/UDP csum offload */ if ((info->csum_info & (NDIS_RXCSUM_INFO_UDPCS_OK | NDIS_RXCSUM_INFO_TCPCS_OK)) && do_csum) { m_new->m_pkthdr.csum_flags |= (CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m_new->m_pkthdr.csum_data = 0xffff; if (info->csum_info & NDIS_RXCSUM_INFO_TCPCS_OK) rxr->hn_csum_tcp++; else rxr->hn_csum_udp++; } /* * XXX * As of this write (Oct 28th, 2016), host side will turn * on only TCPCS_OK and IPCS_OK even for UDP datagrams, so * the do_lro setting here is actually _not_ accurate. We * depend on the RSS hash type check to reset do_lro. */ if ((info->csum_info & (NDIS_RXCSUM_INFO_TCPCS_OK | NDIS_RXCSUM_INFO_IPCS_OK)) == (NDIS_RXCSUM_INFO_TCPCS_OK | NDIS_RXCSUM_INFO_IPCS_OK)) do_lro = 1; } else { const struct ether_header *eh; uint16_t etype; int hoff; hoff = sizeof(*eh); if (m_new->m_len < hoff) goto skip; eh = mtod(m_new, struct ether_header *); etype = ntohs(eh->ether_type); if (etype == ETHERTYPE_VLAN) { const struct ether_vlan_header *evl; hoff = sizeof(*evl); if (m_new->m_len < hoff) goto skip; evl = mtod(m_new, struct ether_vlan_header *); etype = ntohs(evl->evl_proto); } if (etype == ETHERTYPE_IP) { int pr; pr = hn_check_iplen(m_new, hoff); if (pr == IPPROTO_TCP) { if (do_csum && (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_TCP)) { rxr->hn_csum_trusted++; m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m_new->m_pkthdr.csum_data = 0xffff; } do_lro = 1; } else if (pr == IPPROTO_UDP) { if (do_csum && (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_UDP)) { rxr->hn_csum_trusted++; m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m_new->m_pkthdr.csum_data = 0xffff; } } else if (pr != IPPROTO_DONE && do_csum && (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_IP)) { rxr->hn_csum_trusted++; m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID); } } } skip: if (info->vlan_info != HN_NDIS_VLAN_INFO_INVALID) { m_new->m_pkthdr.ether_vtag = EVL_MAKETAG( NDIS_VLAN_INFO_ID(info->vlan_info), NDIS_VLAN_INFO_PRI(info->vlan_info), NDIS_VLAN_INFO_CFI(info->vlan_info)); m_new->m_flags |= M_VLANTAG; } if (info->hash_info != HN_NDIS_HASH_INFO_INVALID) { rxr->hn_rss_pkts++; m_new->m_pkthdr.flowid = info->hash_value; hash_type = M_HASHTYPE_OPAQUE_HASH; if ((info->hash_info & NDIS_HASH_FUNCTION_MASK) == NDIS_HASH_FUNCTION_TOEPLITZ) { uint32_t type = (info->hash_info & NDIS_HASH_TYPE_MASK); /* * NOTE: * do_lro is resetted, if the hash types are not TCP * related. See the comment in the above csum_flags * setup section. */ switch (type) { case NDIS_HASH_IPV4: hash_type = M_HASHTYPE_RSS_IPV4; do_lro = 0; break; case NDIS_HASH_TCP_IPV4: hash_type = M_HASHTYPE_RSS_TCP_IPV4; break; case NDIS_HASH_IPV6: hash_type = M_HASHTYPE_RSS_IPV6; do_lro = 0; break; case NDIS_HASH_IPV6_EX: hash_type = M_HASHTYPE_RSS_IPV6_EX; do_lro = 0; break; case NDIS_HASH_TCP_IPV6: hash_type = M_HASHTYPE_RSS_TCP_IPV6; break; case NDIS_HASH_TCP_IPV6_EX: hash_type = M_HASHTYPE_RSS_TCP_IPV6_EX; break; } } } else { m_new->m_pkthdr.flowid = rxr->hn_rx_idx; hash_type = M_HASHTYPE_OPAQUE; } M_HASHTYPE_SET(m_new, hash_type); /* * Note: Moved RX completion back to hv_nv_on_receive() so all * messages (not just data messages) will trigger a response. */ if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); rxr->hn_pkts++; if ((ifp->if_capenable & IFCAP_LRO) && do_lro) { #if defined(INET) || defined(INET6) struct lro_ctrl *lro = &rxr->hn_lro; if (lro->lro_cnt) { rxr->hn_lro_tried++; if (hn_lro_rx(lro, m_new) == 0) { /* DONE! */ return 0; } } #endif } /* We're not holding the lock here, so don't release it */ (*ifp->if_input)(ifp, m_new); return (0); } static int hn_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) { struct hn_softc *sc = ifp->if_softc; struct ifreq *ifr = (struct ifreq *)data; int mask, error = 0; switch (cmd) { case SIOCSIFMTU: if (ifr->ifr_mtu > HN_MTU_MAX) { error = EINVAL; break; } HN_LOCK(sc); if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) { HN_UNLOCK(sc); break; } if ((sc->hn_caps & HN_CAP_MTU) == 0) { /* Can't change MTU */ HN_UNLOCK(sc); error = EOPNOTSUPP; break; } if (ifp->if_mtu == ifr->ifr_mtu) { HN_UNLOCK(sc); break; } /* * Suspend this interface before the synthetic parts * are ripped. */ hn_suspend(sc); /* * Detach the synthetics parts, i.e. NVS and RNDIS. */ hn_synth_detach(sc); /* * Reattach the synthetic parts, i.e. NVS and RNDIS, * with the new MTU setting. */ error = hn_synth_attach(sc, ifr->ifr_mtu); if (error) { HN_UNLOCK(sc); break; } /* * Commit the requested MTU, after the synthetic parts * have been successfully attached. */ ifp->if_mtu = ifr->ifr_mtu; /* * Make sure that various parameters based on MTU are * still valid, after the MTU change. */ if (sc->hn_tx_ring[0].hn_chim_size > sc->hn_chim_szmax) hn_set_chim_size(sc, sc->hn_chim_szmax); hn_set_tso_maxsize(sc, hn_tso_maxlen, ifp->if_mtu); #if __FreeBSD_version >= 1100099 if (sc->hn_rx_ring[0].hn_lro.lro_length_lim < HN_LRO_LENLIM_MIN(ifp)) hn_set_lro_lenlim(sc, HN_LRO_LENLIM_MIN(ifp)); #endif /* * All done! Resume the interface now. */ hn_resume(sc); HN_UNLOCK(sc); break; case SIOCSIFFLAGS: HN_LOCK(sc); if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) { HN_UNLOCK(sc); break; } if (ifp->if_flags & IFF_UP) { if (ifp->if_drv_flags & IFF_DRV_RUNNING) { /* * Caller meight hold mutex, e.g. * bpf; use busy-wait for the RNDIS * reply. */ HN_NO_SLEEPING(sc); hn_rxfilter_config(sc); HN_SLEEPING_OK(sc); } else { hn_init_locked(sc); } } else { if (ifp->if_drv_flags & IFF_DRV_RUNNING) hn_stop(sc, false); } sc->hn_if_flags = ifp->if_flags; HN_UNLOCK(sc); break; case SIOCSIFCAP: HN_LOCK(sc); mask = ifr->ifr_reqcap ^ ifp->if_capenable; if (mask & IFCAP_TXCSUM) { ifp->if_capenable ^= IFCAP_TXCSUM; if (ifp->if_capenable & IFCAP_TXCSUM) ifp->if_hwassist |= HN_CSUM_IP_HWASSIST(sc); else ifp->if_hwassist &= ~HN_CSUM_IP_HWASSIST(sc); } if (mask & IFCAP_TXCSUM_IPV6) { ifp->if_capenable ^= IFCAP_TXCSUM_IPV6; if (ifp->if_capenable & IFCAP_TXCSUM_IPV6) ifp->if_hwassist |= HN_CSUM_IP6_HWASSIST(sc); else ifp->if_hwassist &= ~HN_CSUM_IP6_HWASSIST(sc); } /* TODO: flip RNDIS offload parameters for RXCSUM. */ if (mask & IFCAP_RXCSUM) ifp->if_capenable ^= IFCAP_RXCSUM; #ifdef foo /* We can't diff IPv6 packets from IPv4 packets on RX path. */ if (mask & IFCAP_RXCSUM_IPV6) ifp->if_capenable ^= IFCAP_RXCSUM_IPV6; #endif if (mask & IFCAP_LRO) ifp->if_capenable ^= IFCAP_LRO; if (mask & IFCAP_TSO4) { ifp->if_capenable ^= IFCAP_TSO4; if (ifp->if_capenable & IFCAP_TSO4) ifp->if_hwassist |= CSUM_IP_TSO; else ifp->if_hwassist &= ~CSUM_IP_TSO; } if (mask & IFCAP_TSO6) { ifp->if_capenable ^= IFCAP_TSO6; if (ifp->if_capenable & IFCAP_TSO6) ifp->if_hwassist |= CSUM_IP6_TSO; else ifp->if_hwassist &= ~CSUM_IP6_TSO; } HN_UNLOCK(sc); break; case SIOCADDMULTI: case SIOCDELMULTI: HN_LOCK(sc); if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) { HN_UNLOCK(sc); break; } if (ifp->if_drv_flags & IFF_DRV_RUNNING) { /* * Multicast uses mutex; use busy-wait for * the RNDIS reply. */ HN_NO_SLEEPING(sc); hn_rxfilter_config(sc); HN_SLEEPING_OK(sc); } HN_UNLOCK(sc); break; case SIOCSIFMEDIA: case SIOCGIFMEDIA: error = ifmedia_ioctl(ifp, ifr, &sc->hn_media, cmd); break; default: error = ether_ioctl(ifp, cmd, data); break; } return (error); } static void hn_stop(struct hn_softc *sc, bool detaching) { struct ifnet *ifp = sc->hn_ifp; int i; HN_LOCK_ASSERT(sc); KASSERT(sc->hn_flags & HN_FLAG_SYNTH_ATTACHED, ("synthetic parts were not attached")); /* Disable polling. */ hn_polling(sc, 0); /* Clear RUNNING bit _before_ hn_suspend_data() */ atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_RUNNING); hn_suspend_data(sc); /* Clear OACTIVE bit. */ atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); for (i = 0; i < sc->hn_tx_ring_inuse; ++i) sc->hn_tx_ring[i].hn_oactive = 0; /* * If the VF is active, make sure the filter is not 0, even if * the synthetic NIC is down. */ if (!detaching && (sc->hn_flags & HN_FLAG_VF)) hn_rxfilter_config(sc); } static void hn_init_locked(struct hn_softc *sc) { struct ifnet *ifp = sc->hn_ifp; int i; HN_LOCK_ASSERT(sc); if ((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0) return; if (ifp->if_drv_flags & IFF_DRV_RUNNING) return; /* Configure RX filter */ hn_rxfilter_config(sc); /* Clear OACTIVE bit. */ atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); for (i = 0; i < sc->hn_tx_ring_inuse; ++i) sc->hn_tx_ring[i].hn_oactive = 0; /* Clear TX 'suspended' bit. */ hn_resume_tx(sc, sc->hn_tx_ring_inuse); /* Everything is ready; unleash! */ atomic_set_int(&ifp->if_drv_flags, IFF_DRV_RUNNING); /* Re-enable polling if requested. */ if (sc->hn_pollhz > 0) hn_polling(sc, sc->hn_pollhz); } static void hn_init(void *xsc) { struct hn_softc *sc = xsc; HN_LOCK(sc); hn_init_locked(sc); HN_UNLOCK(sc); } #if __FreeBSD_version >= 1100099 static int hn_lro_lenlim_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; unsigned int lenlim; int error; lenlim = sc->hn_rx_ring[0].hn_lro.lro_length_lim; error = sysctl_handle_int(oidp, &lenlim, 0, req); if (error || req->newptr == NULL) return error; HN_LOCK(sc); if (lenlim < HN_LRO_LENLIM_MIN(sc->hn_ifp) || lenlim > TCP_LRO_LENGTH_MAX) { HN_UNLOCK(sc); return EINVAL; } hn_set_lro_lenlim(sc, lenlim); HN_UNLOCK(sc); return 0; } static int hn_lro_ackcnt_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ackcnt, error, i; /* * lro_ackcnt_lim is append count limit, * +1 to turn it into aggregation limit. */ ackcnt = sc->hn_rx_ring[0].hn_lro.lro_ackcnt_lim + 1; error = sysctl_handle_int(oidp, &ackcnt, 0, req); if (error || req->newptr == NULL) return error; if (ackcnt < 2 || ackcnt > (TCP_LRO_ACKCNT_MAX + 1)) return EINVAL; /* * Convert aggregation limit back to append * count limit. */ --ackcnt; HN_LOCK(sc); for (i = 0; i < sc->hn_rx_ring_cnt; ++i) sc->hn_rx_ring[i].hn_lro.lro_ackcnt_lim = ackcnt; HN_UNLOCK(sc); return 0; } #endif static int hn_trust_hcsum_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int hcsum = arg2; int on, error, i; on = 0; if (sc->hn_rx_ring[0].hn_trust_hcsum & hcsum) on = 1; error = sysctl_handle_int(oidp, &on, 0, req); if (error || req->newptr == NULL) return error; HN_LOCK(sc); for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { struct hn_rx_ring *rxr = &sc->hn_rx_ring[i]; if (on) rxr->hn_trust_hcsum |= hcsum; else rxr->hn_trust_hcsum &= ~hcsum; } HN_UNLOCK(sc); return 0; } static int hn_chim_size_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int chim_size, error; chim_size = sc->hn_tx_ring[0].hn_chim_size; error = sysctl_handle_int(oidp, &chim_size, 0, req); if (error || req->newptr == NULL) return error; if (chim_size > sc->hn_chim_szmax || chim_size <= 0) return EINVAL; HN_LOCK(sc); hn_set_chim_size(sc, chim_size); HN_UNLOCK(sc); return 0; } #if __FreeBSD_version < 1100095 static int hn_rx_stat_int_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_rx_ring *rxr; uint64_t stat; stat = 0; for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; stat += *((int *)((uint8_t *)rxr + ofs)); } error = sysctl_handle_64(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; *((int *)((uint8_t *)rxr + ofs)) = 0; } return 0; } #else static int hn_rx_stat_u64_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_rx_ring *rxr; uint64_t stat; stat = 0; for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; stat += *((uint64_t *)((uint8_t *)rxr + ofs)); } error = sysctl_handle_64(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; *((uint64_t *)((uint8_t *)rxr + ofs)) = 0; } return 0; } #endif static int hn_rx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_rx_ring *rxr; u_long stat; stat = 0; for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; stat += *((u_long *)((uint8_t *)rxr + ofs)); } error = sysctl_handle_long(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { rxr = &sc->hn_rx_ring[i]; *((u_long *)((uint8_t *)rxr + ofs)) = 0; } return 0; } static int hn_tx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_tx_ring *txr; u_long stat; stat = 0; for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { txr = &sc->hn_tx_ring[i]; stat += *((u_long *)((uint8_t *)txr + ofs)); } error = sysctl_handle_long(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { txr = &sc->hn_tx_ring[i]; *((u_long *)((uint8_t *)txr + ofs)) = 0; } return 0; } static int hn_tx_conf_int_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error, conf; struct hn_tx_ring *txr; txr = &sc->hn_tx_ring[0]; conf = *((int *)((uint8_t *)txr + ofs)); error = sysctl_handle_int(oidp, &conf, 0, req); if (error || req->newptr == NULL) return error; HN_LOCK(sc); for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { txr = &sc->hn_tx_ring[i]; *((int *)((uint8_t *)txr + ofs)) = conf; } HN_UNLOCK(sc); return 0; } static int hn_txagg_size_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int error, size; size = sc->hn_agg_size; error = sysctl_handle_int(oidp, &size, 0, req); if (error || req->newptr == NULL) return (error); HN_LOCK(sc); sc->hn_agg_size = size; hn_set_txagg(sc); HN_UNLOCK(sc); return (0); } static int hn_txagg_pkts_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int error, pkts; pkts = sc->hn_agg_pkts; error = sysctl_handle_int(oidp, &pkts, 0, req); if (error || req->newptr == NULL) return (error); HN_LOCK(sc); sc->hn_agg_pkts = pkts; hn_set_txagg(sc); HN_UNLOCK(sc); return (0); } static int hn_txagg_pktmax_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int pkts; pkts = sc->hn_tx_ring[0].hn_agg_pktmax; return (sysctl_handle_int(oidp, &pkts, 0, req)); } static int hn_txagg_align_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int align; align = sc->hn_tx_ring[0].hn_agg_align; return (sysctl_handle_int(oidp, &align, 0, req)); } static void hn_chan_polling(struct vmbus_channel *chan, u_int pollhz) { if (pollhz == 0) vmbus_chan_poll_disable(chan); else vmbus_chan_poll_enable(chan, pollhz); } static void hn_polling(struct hn_softc *sc, u_int pollhz) { int nsubch = sc->hn_rx_ring_inuse - 1; HN_LOCK_ASSERT(sc); if (nsubch > 0) { struct vmbus_channel **subch; int i; subch = vmbus_subchan_get(sc->hn_prichan, nsubch); for (i = 0; i < nsubch; ++i) hn_chan_polling(subch[i], pollhz); vmbus_subchan_rel(subch, nsubch); } hn_chan_polling(sc->hn_prichan, pollhz); } static int hn_polling_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int pollhz, error; pollhz = sc->hn_pollhz; error = sysctl_handle_int(oidp, &pollhz, 0, req); if (error || req->newptr == NULL) return (error); if (pollhz != 0 && (pollhz < VMBUS_CHAN_POLLHZ_MIN || pollhz > VMBUS_CHAN_POLLHZ_MAX)) return (EINVAL); HN_LOCK(sc); if (sc->hn_pollhz != pollhz) { sc->hn_pollhz = pollhz; if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) && (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED)) hn_polling(sc, sc->hn_pollhz); } HN_UNLOCK(sc); return (0); } static int hn_ndis_version_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; char verstr[16]; snprintf(verstr, sizeof(verstr), "%u.%u", HN_NDIS_VERSION_MAJOR(sc->hn_ndis_ver), HN_NDIS_VERSION_MINOR(sc->hn_ndis_ver)); return sysctl_handle_string(oidp, verstr, sizeof(verstr), req); } static int hn_caps_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; char caps_str[128]; uint32_t caps; HN_LOCK(sc); caps = sc->hn_caps; HN_UNLOCK(sc); snprintf(caps_str, sizeof(caps_str), "%b", caps, HN_CAP_BITS); return sysctl_handle_string(oidp, caps_str, sizeof(caps_str), req); } static int hn_hwassist_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; char assist_str[128]; uint32_t hwassist; HN_LOCK(sc); hwassist = sc->hn_ifp->if_hwassist; HN_UNLOCK(sc); snprintf(assist_str, sizeof(assist_str), "%b", hwassist, CSUM_BITS); return sysctl_handle_string(oidp, assist_str, sizeof(assist_str), req); } static int hn_rxfilter_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; char filter_str[128]; uint32_t filter; HN_LOCK(sc); filter = sc->hn_rx_filter; HN_UNLOCK(sc); snprintf(filter_str, sizeof(filter_str), "%b", filter, NDIS_PACKET_TYPES); return sysctl_handle_string(oidp, filter_str, sizeof(filter_str), req); } #ifndef RSS static int hn_rss_key_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int error; HN_LOCK(sc); error = SYSCTL_OUT(req, sc->hn_rss.rss_key, sizeof(sc->hn_rss.rss_key)); if (error || req->newptr == NULL) goto back; error = SYSCTL_IN(req, sc->hn_rss.rss_key, sizeof(sc->hn_rss.rss_key)); if (error) goto back; sc->hn_flags |= HN_FLAG_HAS_RSSKEY; if (sc->hn_rx_ring_inuse > 1) { error = hn_rss_reconfig(sc); } else { /* Not RSS capable, at least for now; just save the RSS key. */ error = 0; } back: HN_UNLOCK(sc); return (error); } static int hn_rss_ind_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int error; HN_LOCK(sc); error = SYSCTL_OUT(req, sc->hn_rss.rss_ind, sizeof(sc->hn_rss.rss_ind)); if (error || req->newptr == NULL) goto back; /* * Don't allow RSS indirect table change, if this interface is not * RSS capable currently. */ if (sc->hn_rx_ring_inuse == 1) { error = EOPNOTSUPP; goto back; } error = SYSCTL_IN(req, sc->hn_rss.rss_ind, sizeof(sc->hn_rss.rss_ind)); if (error) goto back; sc->hn_flags |= HN_FLAG_HAS_RSSIND; hn_rss_ind_fixup(sc); error = hn_rss_reconfig(sc); back: HN_UNLOCK(sc); return (error); } #endif /* !RSS */ static int hn_rss_hash_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; char hash_str[128]; uint32_t hash; HN_LOCK(sc); hash = sc->hn_rss_hash; HN_UNLOCK(sc); snprintf(hash_str, sizeof(hash_str), "%b", hash, NDIS_HASH_BITS); return sysctl_handle_string(oidp, hash_str, sizeof(hash_str), req); } static int hn_vf_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; char vf_name[128]; struct ifnet *vf; HN_LOCK(sc); vf_name[0] = '\0'; vf = sc->hn_rx_ring[0].hn_vf; if (vf != NULL) snprintf(vf_name, sizeof(vf_name), "%s", if_name(vf)); HN_UNLOCK(sc); return sysctl_handle_string(oidp, vf_name, sizeof(vf_name), req); } static int hn_check_iplen(const struct mbuf *m, int hoff) { const struct ip *ip; int len, iphlen, iplen; const struct tcphdr *th; int thoff; /* TCP data offset */ len = hoff + sizeof(struct ip); /* The packet must be at least the size of an IP header. */ if (m->m_pkthdr.len < len) return IPPROTO_DONE; /* The fixed IP header must reside completely in the first mbuf. */ if (m->m_len < len) return IPPROTO_DONE; ip = mtodo(m, hoff); /* Bound check the packet's stated IP header length. */ iphlen = ip->ip_hl << 2; if (iphlen < sizeof(struct ip)) /* minimum header length */ return IPPROTO_DONE; /* The full IP header must reside completely in the one mbuf. */ if (m->m_len < hoff + iphlen) return IPPROTO_DONE; iplen = ntohs(ip->ip_len); /* * Check that the amount of data in the buffers is as * at least much as the IP header would have us expect. */ if (m->m_pkthdr.len < hoff + iplen) return IPPROTO_DONE; /* * Ignore IP fragments. */ if (ntohs(ip->ip_off) & (IP_OFFMASK | IP_MF)) return IPPROTO_DONE; /* * The TCP/IP or UDP/IP header must be entirely contained within * the first fragment of a packet. */ switch (ip->ip_p) { case IPPROTO_TCP: if (iplen < iphlen + sizeof(struct tcphdr)) return IPPROTO_DONE; if (m->m_len < hoff + iphlen + sizeof(struct tcphdr)) return IPPROTO_DONE; th = (const struct tcphdr *)((const uint8_t *)ip + iphlen); thoff = th->th_off << 2; if (thoff < sizeof(struct tcphdr) || thoff + iphlen > iplen) return IPPROTO_DONE; if (m->m_len < hoff + iphlen + thoff) return IPPROTO_DONE; break; case IPPROTO_UDP: if (iplen < iphlen + sizeof(struct udphdr)) return IPPROTO_DONE; if (m->m_len < hoff + iphlen + sizeof(struct udphdr)) return IPPROTO_DONE; break; default: if (iplen < iphlen) return IPPROTO_DONE; break; } return ip->ip_p; } static int hn_create_rx_data(struct hn_softc *sc, int ring_cnt) { struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; device_t dev = sc->hn_dev; #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 int lroent_cnt; #endif #endif int i; /* * Create RXBUF for reception. * * NOTE: * - It is shared by all channels. * - A large enough buffer is allocated, certain version of NVSes * may further limit the usable space. */ sc->hn_rxbuf = hyperv_dmamem_alloc(bus_get_dma_tag(dev), PAGE_SIZE, 0, HN_RXBUF_SIZE, &sc->hn_rxbuf_dma, BUS_DMA_WAITOK | BUS_DMA_ZERO); if (sc->hn_rxbuf == NULL) { device_printf(sc->hn_dev, "allocate rxbuf failed\n"); return (ENOMEM); } sc->hn_rx_ring_cnt = ring_cnt; sc->hn_rx_ring_inuse = sc->hn_rx_ring_cnt; sc->hn_rx_ring = malloc(sizeof(struct hn_rx_ring) * sc->hn_rx_ring_cnt, M_DEVBUF, M_WAITOK | M_ZERO); #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 lroent_cnt = hn_lro_entry_count; if (lroent_cnt < TCP_LRO_ENTRIES) lroent_cnt = TCP_LRO_ENTRIES; if (bootverbose) device_printf(dev, "LRO: entry count %d\n", lroent_cnt); #endif #endif /* INET || INET6 */ ctx = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); /* Create dev.hn.UNIT.rx sysctl tree */ sc->hn_rx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "rx", CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { struct hn_rx_ring *rxr = &sc->hn_rx_ring[i]; rxr->hn_br = hyperv_dmamem_alloc(bus_get_dma_tag(dev), PAGE_SIZE, 0, HN_TXBR_SIZE + HN_RXBR_SIZE, &rxr->hn_br_dma, BUS_DMA_WAITOK); if (rxr->hn_br == NULL) { device_printf(dev, "allocate bufring failed\n"); return (ENOMEM); } if (hn_trust_hosttcp) rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_TCP; if (hn_trust_hostudp) rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_UDP; if (hn_trust_hostip) rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_IP; rxr->hn_ifp = sc->hn_ifp; if (i < sc->hn_tx_ring_cnt) rxr->hn_txr = &sc->hn_tx_ring[i]; rxr->hn_pktbuf_len = HN_PKTBUF_LEN_DEF; rxr->hn_pktbuf = malloc(rxr->hn_pktbuf_len, M_DEVBUF, M_WAITOK); rxr->hn_rx_idx = i; rxr->hn_rxbuf = sc->hn_rxbuf; /* * Initialize LRO. */ #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 tcp_lro_init_args(&rxr->hn_lro, sc->hn_ifp, lroent_cnt, hn_lro_mbufq_depth); #else tcp_lro_init(&rxr->hn_lro); rxr->hn_lro.ifp = sc->hn_ifp; #endif #if __FreeBSD_version >= 1100099 rxr->hn_lro.lro_length_lim = HN_LRO_LENLIM_DEF; rxr->hn_lro.lro_ackcnt_lim = HN_LRO_ACKCNT_DEF; #endif #endif /* INET || INET6 */ if (sc->hn_rx_sysctl_tree != NULL) { char name[16]; /* * Create per RX ring sysctl tree: * dev.hn.UNIT.rx.RINGID */ snprintf(name, sizeof(name), "%d", i); rxr->hn_rx_sysctl_tree = SYSCTL_ADD_NODE(ctx, SYSCTL_CHILDREN(sc->hn_rx_sysctl_tree), OID_AUTO, name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); if (rxr->hn_rx_sysctl_tree != NULL) { SYSCTL_ADD_ULONG(ctx, SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree), OID_AUTO, "packets", CTLFLAG_RW, &rxr->hn_pkts, "# of packets received"); SYSCTL_ADD_ULONG(ctx, SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree), OID_AUTO, "rss_pkts", CTLFLAG_RW, &rxr->hn_rss_pkts, "# of packets w/ RSS info received"); SYSCTL_ADD_INT(ctx, SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree), OID_AUTO, "pktbuf_len", CTLFLAG_RD, &rxr->hn_pktbuf_len, 0, "Temporary channel packet buffer length"); } } } SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_queued", CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_lro.lro_queued), #if __FreeBSD_version < 1100095 hn_rx_stat_int_sysctl, #else hn_rx_stat_u64_sysctl, #endif "LU", "LRO queued"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_flushed", CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_lro.lro_flushed), #if __FreeBSD_version < 1100095 hn_rx_stat_int_sysctl, #else hn_rx_stat_u64_sysctl, #endif "LU", "LRO flushed"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_tried", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_lro_tried), hn_rx_stat_ulong_sysctl, "LU", "# of LRO tries"); #if __FreeBSD_version >= 1100099 SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_length_lim", CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_lro_lenlim_sysctl, "IU", "Max # of data bytes to be aggregated by LRO"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_ackcnt_lim", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_lro_ackcnt_sysctl, "I", "Max # of ACKs to be aggregated by LRO"); #endif SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hosttcp", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_TCP, hn_trust_hcsum_sysctl, "I", "Trust tcp segement verification on host side, " "when csum info is missing"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hostudp", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_UDP, hn_trust_hcsum_sysctl, "I", "Trust udp datagram verification on host side, " "when csum info is missing"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hostip", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_IP, hn_trust_hcsum_sysctl, "I", "Trust ip packet verification on host side, " "when csum info is missing"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_ip", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_ip), hn_rx_stat_ulong_sysctl, "LU", "RXCSUM IP"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_tcp", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_tcp), hn_rx_stat_ulong_sysctl, "LU", "RXCSUM TCP"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_udp", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_udp), hn_rx_stat_ulong_sysctl, "LU", "RXCSUM UDP"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_trusted", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_trusted), hn_rx_stat_ulong_sysctl, "LU", "# of packets that we trust host's csum verification"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "small_pkts", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_small_pkts), hn_rx_stat_ulong_sysctl, "LU", "# of small packets received"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "rx_ack_failed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_ack_failed), hn_rx_stat_ulong_sysctl, "LU", "# of RXBUF ack failures"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rx_ring_cnt", CTLFLAG_RD, &sc->hn_rx_ring_cnt, 0, "# created RX rings"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rx_ring_inuse", CTLFLAG_RD, &sc->hn_rx_ring_inuse, 0, "# used RX rings"); return (0); } static void hn_destroy_rx_data(struct hn_softc *sc) { int i; if (sc->hn_rxbuf != NULL) { if ((sc->hn_flags & HN_FLAG_RXBUF_REF) == 0) hyperv_dmamem_free(&sc->hn_rxbuf_dma, sc->hn_rxbuf); else device_printf(sc->hn_dev, "RXBUF is referenced\n"); sc->hn_rxbuf = NULL; } if (sc->hn_rx_ring_cnt == 0) return; for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { struct hn_rx_ring *rxr = &sc->hn_rx_ring[i]; if (rxr->hn_br == NULL) continue; if ((rxr->hn_rx_flags & HN_RX_FLAG_BR_REF) == 0) { hyperv_dmamem_free(&rxr->hn_br_dma, rxr->hn_br); } else { device_printf(sc->hn_dev, "%dth channel bufring is referenced", i); } rxr->hn_br = NULL; #if defined(INET) || defined(INET6) tcp_lro_free(&rxr->hn_lro); #endif free(rxr->hn_pktbuf, M_DEVBUF); } free(sc->hn_rx_ring, M_DEVBUF); sc->hn_rx_ring = NULL; sc->hn_rx_ring_cnt = 0; sc->hn_rx_ring_inuse = 0; } static int hn_tx_ring_create(struct hn_softc *sc, int id) { struct hn_tx_ring *txr = &sc->hn_tx_ring[id]; device_t dev = sc->hn_dev; bus_dma_tag_t parent_dtag; int error, i; txr->hn_sc = sc; txr->hn_tx_idx = id; #ifndef HN_USE_TXDESC_BUFRING mtx_init(&txr->hn_txlist_spin, "hn txlist", NULL, MTX_SPIN); #endif mtx_init(&txr->hn_tx_lock, "hn tx", NULL, MTX_DEF); txr->hn_txdesc_cnt = HN_TX_DESC_CNT; txr->hn_txdesc = malloc(sizeof(struct hn_txdesc) * txr->hn_txdesc_cnt, M_DEVBUF, M_WAITOK | M_ZERO); #ifndef HN_USE_TXDESC_BUFRING SLIST_INIT(&txr->hn_txlist); #else txr->hn_txdesc_br = buf_ring_alloc(txr->hn_txdesc_cnt, M_DEVBUF, M_WAITOK, &txr->hn_tx_lock); #endif if (hn_tx_taskq_mode == HN_TX_TASKQ_M_EVTTQ) { txr->hn_tx_taskq = VMBUS_GET_EVENT_TASKQ( device_get_parent(dev), dev, HN_RING_IDX2CPU(sc, id)); } else { txr->hn_tx_taskq = sc->hn_tx_taskqs[id % hn_tx_taskq_cnt]; } #ifdef HN_IFSTART_SUPPORT if (hn_use_if_start) { txr->hn_txeof = hn_start_txeof; TASK_INIT(&txr->hn_tx_task, 0, hn_start_taskfunc, txr); TASK_INIT(&txr->hn_txeof_task, 0, hn_start_txeof_taskfunc, txr); } else #endif { int br_depth; txr->hn_txeof = hn_xmit_txeof; TASK_INIT(&txr->hn_tx_task, 0, hn_xmit_taskfunc, txr); TASK_INIT(&txr->hn_txeof_task, 0, hn_xmit_txeof_taskfunc, txr); br_depth = hn_get_txswq_depth(txr); txr->hn_mbuf_br = buf_ring_alloc(br_depth, M_DEVBUF, M_WAITOK, &txr->hn_tx_lock); } txr->hn_direct_tx_size = hn_direct_tx_size; /* * Always schedule transmission instead of trying to do direct * transmission. This one gives the best performance so far. */ txr->hn_sched_tx = 1; parent_dtag = bus_get_dma_tag(dev); /* DMA tag for RNDIS packet messages. */ error = bus_dma_tag_create(parent_dtag, /* parent */ HN_RNDIS_PKT_ALIGN, /* alignment */ HN_RNDIS_PKT_BOUNDARY, /* boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ HN_RNDIS_PKT_LEN, /* maxsize */ 1, /* nsegments */ HN_RNDIS_PKT_LEN, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->hn_tx_rndis_dtag); if (error) { device_printf(dev, "failed to create rndis dmatag\n"); return error; } /* DMA tag for data. */ error = bus_dma_tag_create(parent_dtag, /* parent */ 1, /* alignment */ HN_TX_DATA_BOUNDARY, /* boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ HN_TX_DATA_MAXSIZE, /* maxsize */ HN_TX_DATA_SEGCNT_MAX, /* nsegments */ HN_TX_DATA_SEGSIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->hn_tx_data_dtag); if (error) { device_printf(dev, "failed to create data dmatag\n"); return error; } for (i = 0; i < txr->hn_txdesc_cnt; ++i) { struct hn_txdesc *txd = &txr->hn_txdesc[i]; txd->txr = txr; txd->chim_index = HN_NVS_CHIM_IDX_INVALID; STAILQ_INIT(&txd->agg_list); /* * Allocate and load RNDIS packet message. */ error = bus_dmamem_alloc(txr->hn_tx_rndis_dtag, (void **)&txd->rndis_pkt, BUS_DMA_WAITOK | BUS_DMA_COHERENT | BUS_DMA_ZERO, &txd->rndis_pkt_dmap); if (error) { device_printf(dev, "failed to allocate rndis_packet_msg, %d\n", i); return error; } error = bus_dmamap_load(txr->hn_tx_rndis_dtag, txd->rndis_pkt_dmap, txd->rndis_pkt, HN_RNDIS_PKT_LEN, hyperv_dma_map_paddr, &txd->rndis_pkt_paddr, BUS_DMA_NOWAIT); if (error) { device_printf(dev, "failed to load rndis_packet_msg, %d\n", i); bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_pkt, txd->rndis_pkt_dmap); return error; } /* DMA map for TX data. */ error = bus_dmamap_create(txr->hn_tx_data_dtag, 0, &txd->data_dmap); if (error) { device_printf(dev, "failed to allocate tx data dmamap\n"); bus_dmamap_unload(txr->hn_tx_rndis_dtag, txd->rndis_pkt_dmap); bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_pkt, txd->rndis_pkt_dmap); return error; } /* All set, put it to list */ txd->flags |= HN_TXD_FLAG_ONLIST; #ifndef HN_USE_TXDESC_BUFRING SLIST_INSERT_HEAD(&txr->hn_txlist, txd, link); #else buf_ring_enqueue(txr->hn_txdesc_br, txd); #endif } txr->hn_txdesc_avail = txr->hn_txdesc_cnt; if (sc->hn_tx_sysctl_tree != NULL) { struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; char name[16]; /* * Create per TX ring sysctl tree: * dev.hn.UNIT.tx.RINGID */ ctx = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(sc->hn_tx_sysctl_tree); snprintf(name, sizeof(name), "%d", id); txr->hn_tx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); if (txr->hn_tx_sysctl_tree != NULL) { child = SYSCTL_CHILDREN(txr->hn_tx_sysctl_tree); #ifdef HN_DEBUG SYSCTL_ADD_INT(ctx, child, OID_AUTO, "txdesc_avail", CTLFLAG_RD, &txr->hn_txdesc_avail, 0, "# of available TX descs"); #endif #ifdef HN_IFSTART_SUPPORT if (!hn_use_if_start) #endif { SYSCTL_ADD_INT(ctx, child, OID_AUTO, "oactive", CTLFLAG_RD, &txr->hn_oactive, 0, "over active"); } SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "packets", CTLFLAG_RW, &txr->hn_pkts, "# of packets transmitted"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "sends", CTLFLAG_RW, &txr->hn_sends, "# of sends"); } } return 0; } static void hn_txdesc_dmamap_destroy(struct hn_txdesc *txd) { struct hn_tx_ring *txr = txd->txr; KASSERT(txd->m == NULL, ("still has mbuf installed")); KASSERT((txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("still dma mapped")); bus_dmamap_unload(txr->hn_tx_rndis_dtag, txd->rndis_pkt_dmap); bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_pkt, txd->rndis_pkt_dmap); bus_dmamap_destroy(txr->hn_tx_data_dtag, txd->data_dmap); } static void hn_txdesc_gc(struct hn_tx_ring *txr, struct hn_txdesc *txd) { KASSERT(txd->refs == 0 || txd->refs == 1, ("invalid txd refs %d", txd->refs)); /* Aggregated txds will be freed by their aggregating txd. */ if (txd->refs > 0 && (txd->flags & HN_TXD_FLAG_ONAGG) == 0) { int freed; freed = hn_txdesc_put(txr, txd); KASSERT(freed, ("can't free txdesc")); } } static void hn_tx_ring_destroy(struct hn_tx_ring *txr) { int i; if (txr->hn_txdesc == NULL) return; /* * NOTE: * Because the freeing of aggregated txds will be deferred * to the aggregating txd, two passes are used here: * - The first pass GCes any pending txds. This GC is necessary, * since if the channels are revoked, hypervisor will not * deliver send-done for all pending txds. * - The second pass frees the busdma stuffs, i.e. after all txds * were freed. */ for (i = 0; i < txr->hn_txdesc_cnt; ++i) hn_txdesc_gc(txr, &txr->hn_txdesc[i]); for (i = 0; i < txr->hn_txdesc_cnt; ++i) hn_txdesc_dmamap_destroy(&txr->hn_txdesc[i]); if (txr->hn_tx_data_dtag != NULL) bus_dma_tag_destroy(txr->hn_tx_data_dtag); if (txr->hn_tx_rndis_dtag != NULL) bus_dma_tag_destroy(txr->hn_tx_rndis_dtag); #ifdef HN_USE_TXDESC_BUFRING buf_ring_free(txr->hn_txdesc_br, M_DEVBUF); #endif free(txr->hn_txdesc, M_DEVBUF); txr->hn_txdesc = NULL; if (txr->hn_mbuf_br != NULL) buf_ring_free(txr->hn_mbuf_br, M_DEVBUF); #ifndef HN_USE_TXDESC_BUFRING mtx_destroy(&txr->hn_txlist_spin); #endif mtx_destroy(&txr->hn_tx_lock); } static int hn_create_tx_data(struct hn_softc *sc, int ring_cnt) { struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; int i; /* * Create TXBUF for chimney sending. * * NOTE: It is shared by all channels. */ sc->hn_chim = hyperv_dmamem_alloc(bus_get_dma_tag(sc->hn_dev), PAGE_SIZE, 0, HN_CHIM_SIZE, &sc->hn_chim_dma, BUS_DMA_WAITOK | BUS_DMA_ZERO); if (sc->hn_chim == NULL) { device_printf(sc->hn_dev, "allocate txbuf failed\n"); return (ENOMEM); } sc->hn_tx_ring_cnt = ring_cnt; sc->hn_tx_ring_inuse = sc->hn_tx_ring_cnt; sc->hn_tx_ring = malloc(sizeof(struct hn_tx_ring) * sc->hn_tx_ring_cnt, M_DEVBUF, M_WAITOK | M_ZERO); ctx = device_get_sysctl_ctx(sc->hn_dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->hn_dev)); /* Create dev.hn.UNIT.tx sysctl tree */ sc->hn_tx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "tx", CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { int error; error = hn_tx_ring_create(sc, i); if (error) return error; } SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "no_txdescs", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_no_txdescs), hn_tx_stat_ulong_sysctl, "LU", "# of times short of TX descs"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "send_failed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_send_failed), hn_tx_stat_ulong_sysctl, "LU", "# of hyper-v sending failure"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "txdma_failed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_txdma_failed), hn_tx_stat_ulong_sysctl, "LU", "# of TX DMA failure"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_flush_failed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_flush_failed), hn_tx_stat_ulong_sysctl, "LU", "# of packet transmission aggregation flush failure"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_collapsed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_tx_collapsed), hn_tx_stat_ulong_sysctl, "LU", "# of TX mbuf collapsed"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_tx_chimney), hn_tx_stat_ulong_sysctl, "LU", "# of chimney send"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney_tried", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_tx_chimney_tried), hn_tx_stat_ulong_sysctl, "LU", "# of chimney send tries"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "txdesc_cnt", CTLFLAG_RD, &sc->hn_tx_ring[0].hn_txdesc_cnt, 0, "# of total TX descs"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_chimney_max", CTLFLAG_RD, &sc->hn_chim_szmax, 0, "Chimney send packet size upper boundary"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney_size", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_chim_size_sysctl, "I", "Chimney send packet size limit"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "direct_tx_size", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_direct_tx_size), hn_tx_conf_int_sysctl, "I", "Size of the packet for direct transmission"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "sched_tx", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_sched_tx), hn_tx_conf_int_sysctl, "I", "Always schedule transmission " "instead of doing direct transmission"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_ring_cnt", CTLFLAG_RD, &sc->hn_tx_ring_cnt, 0, "# created TX rings"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_ring_inuse", CTLFLAG_RD, &sc->hn_tx_ring_inuse, 0, "# used TX rings"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "agg_szmax", CTLFLAG_RD, &sc->hn_tx_ring[0].hn_agg_szmax, 0, "Applied packet transmission aggregation size"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_pktmax", CTLTYPE_INT | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_txagg_pktmax_sysctl, "I", "Applied packet transmission aggregation packets"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "agg_align", CTLTYPE_INT | CTLFLAG_RD | CTLFLAG_MPSAFE, sc, 0, hn_txagg_align_sysctl, "I", "Applied packet transmission aggregation alignment"); return 0; } static void hn_set_chim_size(struct hn_softc *sc, int chim_size) { int i; for (i = 0; i < sc->hn_tx_ring_cnt; ++i) sc->hn_tx_ring[i].hn_chim_size = chim_size; } static void hn_set_tso_maxsize(struct hn_softc *sc, int tso_maxlen, int mtu) { struct ifnet *ifp = sc->hn_ifp; int tso_minlen; if ((ifp->if_capabilities & (IFCAP_TSO4 | IFCAP_TSO6)) == 0) return; KASSERT(sc->hn_ndis_tso_sgmin >= 2, ("invalid NDIS tso sgmin %d", sc->hn_ndis_tso_sgmin)); tso_minlen = sc->hn_ndis_tso_sgmin * mtu; KASSERT(sc->hn_ndis_tso_szmax >= tso_minlen && sc->hn_ndis_tso_szmax <= IP_MAXPACKET, ("invalid NDIS tso szmax %d", sc->hn_ndis_tso_szmax)); if (tso_maxlen < tso_minlen) tso_maxlen = tso_minlen; else if (tso_maxlen > IP_MAXPACKET) tso_maxlen = IP_MAXPACKET; if (tso_maxlen > sc->hn_ndis_tso_szmax) tso_maxlen = sc->hn_ndis_tso_szmax; ifp->if_hw_tsomax = tso_maxlen - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN); if (bootverbose) if_printf(ifp, "TSO size max %u\n", ifp->if_hw_tsomax); } static void hn_fixup_tx_data(struct hn_softc *sc) { uint64_t csum_assist; int i; hn_set_chim_size(sc, sc->hn_chim_szmax); if (hn_tx_chimney_size > 0 && hn_tx_chimney_size < sc->hn_chim_szmax) hn_set_chim_size(sc, hn_tx_chimney_size); csum_assist = 0; if (sc->hn_caps & HN_CAP_IPCS) csum_assist |= CSUM_IP; if (sc->hn_caps & HN_CAP_TCP4CS) csum_assist |= CSUM_IP_TCP; if (sc->hn_caps & HN_CAP_UDP4CS) csum_assist |= CSUM_IP_UDP; if (sc->hn_caps & HN_CAP_TCP6CS) csum_assist |= CSUM_IP6_TCP; if (sc->hn_caps & HN_CAP_UDP6CS) csum_assist |= CSUM_IP6_UDP; for (i = 0; i < sc->hn_tx_ring_cnt; ++i) sc->hn_tx_ring[i].hn_csum_assist = csum_assist; if (sc->hn_caps & HN_CAP_HASHVAL) { /* * Support HASHVAL pktinfo on TX path. */ if (bootverbose) if_printf(sc->hn_ifp, "support HASHVAL pktinfo\n"); for (i = 0; i < sc->hn_tx_ring_cnt; ++i) sc->hn_tx_ring[i].hn_tx_flags |= HN_TX_FLAG_HASHVAL; } } static void hn_destroy_tx_data(struct hn_softc *sc) { int i; if (sc->hn_chim != NULL) { if ((sc->hn_flags & HN_FLAG_CHIM_REF) == 0) { hyperv_dmamem_free(&sc->hn_chim_dma, sc->hn_chim); } else { device_printf(sc->hn_dev, "chimney sending buffer is referenced"); } sc->hn_chim = NULL; } if (sc->hn_tx_ring_cnt == 0) return; for (i = 0; i < sc->hn_tx_ring_cnt; ++i) hn_tx_ring_destroy(&sc->hn_tx_ring[i]); free(sc->hn_tx_ring, M_DEVBUF); sc->hn_tx_ring = NULL; sc->hn_tx_ring_cnt = 0; sc->hn_tx_ring_inuse = 0; } #ifdef HN_IFSTART_SUPPORT static void hn_start_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); hn_start_locked(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static int hn_start_locked(struct hn_tx_ring *txr, int len) { struct hn_softc *sc = txr->hn_sc; struct ifnet *ifp = sc->hn_ifp; int sched = 0; KASSERT(hn_use_if_start, ("hn_start_locked is called, when if_start is disabled")); KASSERT(txr == &sc->hn_tx_ring[0], ("not the first TX ring")); mtx_assert(&txr->hn_tx_lock, MA_OWNED); KASSERT(txr->hn_agg_txd == NULL, ("lingering aggregating txdesc")); if (__predict_false(txr->hn_suspended)) return (0); if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) return (0); while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) { struct hn_txdesc *txd; struct mbuf *m_head; int error; IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head); if (m_head == NULL) break; if (len > 0 && m_head->m_pkthdr.len > len) { /* * This sending could be time consuming; let callers * dispatch this packet sending (and sending of any * following up packets) to tx taskqueue. */ IFQ_DRV_PREPEND(&ifp->if_snd, m_head); sched = 1; break; } #if defined(INET6) || defined(INET) if (m_head->m_pkthdr.csum_flags & CSUM_TSO) { m_head = hn_tso_fixup(m_head); if (__predict_false(m_head == NULL)) { if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); continue; } } #endif txd = hn_txdesc_get(txr); if (txd == NULL) { txr->hn_no_txdescs++; IFQ_DRV_PREPEND(&ifp->if_snd, m_head); atomic_set_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); break; } error = hn_encap(ifp, txr, txd, &m_head); if (error) { /* Both txd and m_head are freed */ KASSERT(txr->hn_agg_txd == NULL, ("encap failed w/ pending aggregating txdesc")); continue; } if (txr->hn_agg_pktleft == 0) { if (txr->hn_agg_txd != NULL) { KASSERT(m_head == NULL, ("pending mbuf for aggregating txdesc")); error = hn_flush_txagg(ifp, txr); if (__predict_false(error)) { atomic_set_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); break; } } else { KASSERT(m_head != NULL, ("mbuf was freed")); error = hn_txpkt(ifp, txr, txd); if (__predict_false(error)) { /* txd is freed, but m_head is not */ IFQ_DRV_PREPEND(&ifp->if_snd, m_head); atomic_set_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); break; } } } #ifdef INVARIANTS else { KASSERT(txr->hn_agg_txd != NULL, ("no aggregating txdesc")); KASSERT(m_head == NULL, ("pending mbuf for aggregating txdesc")); } #endif } /* Flush pending aggerated transmission. */ if (txr->hn_agg_txd != NULL) hn_flush_txagg(ifp, txr); return (sched); } static void hn_start(struct ifnet *ifp) { struct hn_softc *sc = ifp->if_softc; struct hn_tx_ring *txr = &sc->hn_tx_ring[0]; if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; sched = hn_start_locked(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (!sched) return; } do_sched: taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); } static void hn_start_txeof_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); atomic_clear_int(&txr->hn_sc->hn_ifp->if_drv_flags, IFF_DRV_OACTIVE); hn_start_locked(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static void hn_start_txeof(struct hn_tx_ring *txr) { struct hn_softc *sc = txr->hn_sc; struct ifnet *ifp = sc->hn_ifp; KASSERT(txr == &sc->hn_tx_ring[0], ("not the first TX ring")); if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); sched = hn_start_locked(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (sched) { taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); } } else { do_sched: /* * Release the OACTIVE earlier, with the hope, that * others could catch up. The task will clear the * flag again with the hn_tx_lock to avoid possible * races. */ atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task); } } #endif /* HN_IFSTART_SUPPORT */ static int hn_xmit(struct hn_tx_ring *txr, int len) { struct hn_softc *sc = txr->hn_sc; struct ifnet *ifp = sc->hn_ifp; struct mbuf *m_head; int sched = 0; mtx_assert(&txr->hn_tx_lock, MA_OWNED); #ifdef HN_IFSTART_SUPPORT KASSERT(hn_use_if_start == 0, ("hn_xmit is called, when if_start is enabled")); #endif KASSERT(txr->hn_agg_txd == NULL, ("lingering aggregating txdesc")); if (__predict_false(txr->hn_suspended)) return (0); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || txr->hn_oactive) return (0); while ((m_head = drbr_peek(ifp, txr->hn_mbuf_br)) != NULL) { struct hn_txdesc *txd; int error; if (len > 0 && m_head->m_pkthdr.len > len) { /* * This sending could be time consuming; let callers * dispatch this packet sending (and sending of any * following up packets) to tx taskqueue. */ drbr_putback(ifp, txr->hn_mbuf_br, m_head); sched = 1; break; } txd = hn_txdesc_get(txr); if (txd == NULL) { txr->hn_no_txdescs++; drbr_putback(ifp, txr->hn_mbuf_br, m_head); txr->hn_oactive = 1; break; } error = hn_encap(ifp, txr, txd, &m_head); if (error) { /* Both txd and m_head are freed; discard */ KASSERT(txr->hn_agg_txd == NULL, ("encap failed w/ pending aggregating txdesc")); drbr_advance(ifp, txr->hn_mbuf_br); continue; } if (txr->hn_agg_pktleft == 0) { if (txr->hn_agg_txd != NULL) { KASSERT(m_head == NULL, ("pending mbuf for aggregating txdesc")); error = hn_flush_txagg(ifp, txr); if (__predict_false(error)) { txr->hn_oactive = 1; break; } } else { KASSERT(m_head != NULL, ("mbuf was freed")); error = hn_txpkt(ifp, txr, txd); if (__predict_false(error)) { /* txd is freed, but m_head is not */ drbr_putback(ifp, txr->hn_mbuf_br, m_head); txr->hn_oactive = 1; break; } } } #ifdef INVARIANTS else { KASSERT(txr->hn_agg_txd != NULL, ("no aggregating txdesc")); KASSERT(m_head == NULL, ("pending mbuf for aggregating txdesc")); } #endif /* Sent */ drbr_advance(ifp, txr->hn_mbuf_br); } /* Flush pending aggerated transmission. */ if (txr->hn_agg_txd != NULL) hn_flush_txagg(ifp, txr); return (sched); } static int hn_transmit(struct ifnet *ifp, struct mbuf *m) { struct hn_softc *sc = ifp->if_softc; struct hn_tx_ring *txr; int error, idx = 0; #if defined(INET6) || defined(INET) /* * Perform TSO packet header fixup now, since the TSO * packet header should be cache-hot. */ if (m->m_pkthdr.csum_flags & CSUM_TSO) { m = hn_tso_fixup(m); if (__predict_false(m == NULL)) { if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return EIO; } } #endif /* * Select the TX ring based on flowid */ if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) { #ifdef RSS uint32_t bid; if (rss_hash2bucket(m->m_pkthdr.flowid, M_HASHTYPE_GET(m), &bid) == 0) idx = bid % sc->hn_tx_ring_inuse; else #endif idx = m->m_pkthdr.flowid % sc->hn_tx_ring_inuse; } txr = &sc->hn_tx_ring[idx]; error = drbr_enqueue(ifp, txr->hn_mbuf_br, m); if (error) { if_inc_counter(ifp, IFCOUNTER_OQDROPS, 1); return error; } if (txr->hn_oactive) return 0; if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; sched = hn_xmit(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (!sched) return 0; } do_sched: taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); return 0; } static void hn_tx_ring_qflush(struct hn_tx_ring *txr) { struct mbuf *m; mtx_lock(&txr->hn_tx_lock); while ((m = buf_ring_dequeue_sc(txr->hn_mbuf_br)) != NULL) m_freem(m); mtx_unlock(&txr->hn_tx_lock); } static void hn_xmit_qflush(struct ifnet *ifp) { struct hn_softc *sc = ifp->if_softc; int i; for (i = 0; i < sc->hn_tx_ring_inuse; ++i) hn_tx_ring_qflush(&sc->hn_tx_ring[i]); if_qflush(ifp); } static void hn_xmit_txeof(struct hn_tx_ring *txr) { if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; txr->hn_oactive = 0; sched = hn_xmit(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (sched) { taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); } } else { do_sched: /* * Release the oactive earlier, with the hope, that * others could catch up. The task will clear the * oactive again with the hn_tx_lock to avoid possible * races. */ txr->hn_oactive = 0; taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task); } } static void hn_xmit_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); hn_xmit(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static void hn_xmit_txeof_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); txr->hn_oactive = 0; hn_xmit(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static int hn_chan_attach(struct hn_softc *sc, struct vmbus_channel *chan) { struct vmbus_chan_br cbr; struct hn_rx_ring *rxr; struct hn_tx_ring *txr = NULL; int idx, error; idx = vmbus_chan_subidx(chan); /* * Link this channel to RX/TX ring. */ KASSERT(idx >= 0 && idx < sc->hn_rx_ring_inuse, ("invalid channel index %d, should > 0 && < %d", idx, sc->hn_rx_ring_inuse)); rxr = &sc->hn_rx_ring[idx]; KASSERT((rxr->hn_rx_flags & HN_RX_FLAG_ATTACHED) == 0, ("RX ring %d already attached", idx)); rxr->hn_rx_flags |= HN_RX_FLAG_ATTACHED; rxr->hn_chan = chan; if (bootverbose) { if_printf(sc->hn_ifp, "link RX ring %d to chan%u\n", idx, vmbus_chan_id(chan)); } if (idx < sc->hn_tx_ring_inuse) { txr = &sc->hn_tx_ring[idx]; KASSERT((txr->hn_tx_flags & HN_TX_FLAG_ATTACHED) == 0, ("TX ring %d already attached", idx)); txr->hn_tx_flags |= HN_TX_FLAG_ATTACHED; txr->hn_chan = chan; if (bootverbose) { if_printf(sc->hn_ifp, "link TX ring %d to chan%u\n", idx, vmbus_chan_id(chan)); } } /* Bind this channel to a proper CPU. */ vmbus_chan_cpu_set(chan, HN_RING_IDX2CPU(sc, idx)); /* * Open this channel */ cbr.cbr = rxr->hn_br; cbr.cbr_paddr = rxr->hn_br_dma.hv_paddr; cbr.cbr_txsz = HN_TXBR_SIZE; cbr.cbr_rxsz = HN_RXBR_SIZE; error = vmbus_chan_open_br(chan, &cbr, NULL, 0, hn_chan_callback, rxr); if (error) { if (error == EISCONN) { if_printf(sc->hn_ifp, "bufring is connected after " "chan%u open failure\n", vmbus_chan_id(chan)); rxr->hn_rx_flags |= HN_RX_FLAG_BR_REF; } else { if_printf(sc->hn_ifp, "open chan%u failed: %d\n", vmbus_chan_id(chan), error); } } return (error); } static void hn_chan_detach(struct hn_softc *sc, struct vmbus_channel *chan) { struct hn_rx_ring *rxr; int idx, error; idx = vmbus_chan_subidx(chan); /* * Link this channel to RX/TX ring. */ KASSERT(idx >= 0 && idx < sc->hn_rx_ring_inuse, ("invalid channel index %d, should > 0 && < %d", idx, sc->hn_rx_ring_inuse)); rxr = &sc->hn_rx_ring[idx]; KASSERT((rxr->hn_rx_flags & HN_RX_FLAG_ATTACHED), ("RX ring %d is not attached", idx)); rxr->hn_rx_flags &= ~HN_RX_FLAG_ATTACHED; if (idx < sc->hn_tx_ring_inuse) { struct hn_tx_ring *txr = &sc->hn_tx_ring[idx]; KASSERT((txr->hn_tx_flags & HN_TX_FLAG_ATTACHED), ("TX ring %d is not attached attached", idx)); txr->hn_tx_flags &= ~HN_TX_FLAG_ATTACHED; } /* * Close this channel. * * NOTE: * Channel closing does _not_ destroy the target channel. */ error = vmbus_chan_close_direct(chan); if (error == EISCONN) { if_printf(sc->hn_ifp, "chan%u bufring is connected " "after being closed\n", vmbus_chan_id(chan)); rxr->hn_rx_flags |= HN_RX_FLAG_BR_REF; } else if (error) { if_printf(sc->hn_ifp, "chan%u close failed: %d\n", vmbus_chan_id(chan), error); } } static int hn_attach_subchans(struct hn_softc *sc) { struct vmbus_channel **subchans; int subchan_cnt = sc->hn_rx_ring_inuse - 1; int i, error = 0; KASSERT(subchan_cnt > 0, ("no sub-channels")); /* Attach the sub-channels. */ subchans = vmbus_subchan_get(sc->hn_prichan, subchan_cnt); for (i = 0; i < subchan_cnt; ++i) { int error1; error1 = hn_chan_attach(sc, subchans[i]); if (error1) { error = error1; /* Move on; all channels will be detached later. */ } } vmbus_subchan_rel(subchans, subchan_cnt); if (error) { if_printf(sc->hn_ifp, "sub-channels attach failed: %d\n", error); } else { if (bootverbose) { if_printf(sc->hn_ifp, "%d sub-channels attached\n", subchan_cnt); } } return (error); } static void hn_detach_allchans(struct hn_softc *sc) { struct vmbus_channel **subchans; int subchan_cnt = sc->hn_rx_ring_inuse - 1; int i; if (subchan_cnt == 0) goto back; /* Detach the sub-channels. */ subchans = vmbus_subchan_get(sc->hn_prichan, subchan_cnt); for (i = 0; i < subchan_cnt; ++i) hn_chan_detach(sc, subchans[i]); vmbus_subchan_rel(subchans, subchan_cnt); back: /* * Detach the primary channel, _after_ all sub-channels * are detached. */ hn_chan_detach(sc, sc->hn_prichan); /* Wait for sub-channels to be destroyed, if any. */ vmbus_subchan_drain(sc->hn_prichan); #ifdef INVARIANTS for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { KASSERT((sc->hn_rx_ring[i].hn_rx_flags & HN_RX_FLAG_ATTACHED) == 0, ("%dth RX ring is still attached", i)); } for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { KASSERT((sc->hn_tx_ring[i].hn_tx_flags & HN_TX_FLAG_ATTACHED) == 0, ("%dth TX ring is still attached", i)); } #endif } static int hn_synth_alloc_subchans(struct hn_softc *sc, int *nsubch) { struct vmbus_channel **subchans; int nchan, rxr_cnt, error; nchan = *nsubch + 1; if (nchan == 1) { /* * Multiple RX/TX rings are not requested. */ *nsubch = 0; return (0); } /* * Query RSS capabilities, e.g. # of RX rings, and # of indirect * table entries. */ error = hn_rndis_query_rsscaps(sc, &rxr_cnt); if (error) { /* No RSS; this is benign. */ *nsubch = 0; return (0); } if (bootverbose) { if_printf(sc->hn_ifp, "RX rings offered %u, requested %d\n", rxr_cnt, nchan); } if (nchan > rxr_cnt) nchan = rxr_cnt; if (nchan == 1) { if_printf(sc->hn_ifp, "only 1 channel is supported, no vRSS\n"); *nsubch = 0; return (0); } /* * Allocate sub-channels from NVS. */ *nsubch = nchan - 1; error = hn_nvs_alloc_subchans(sc, nsubch); if (error || *nsubch == 0) { /* Failed to allocate sub-channels. */ *nsubch = 0; return (0); } /* * Wait for all sub-channels to become ready before moving on. */ subchans = vmbus_subchan_get(sc->hn_prichan, *nsubch); vmbus_subchan_rel(subchans, *nsubch); return (0); } static bool hn_synth_attachable(const struct hn_softc *sc) { int i; if (sc->hn_flags & HN_FLAG_ERRORS) return (false); for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { const struct hn_rx_ring *rxr = &sc->hn_rx_ring[i]; if (rxr->hn_rx_flags & HN_RX_FLAG_BR_REF) return (false); } return (true); } static int hn_synth_attach(struct hn_softc *sc, int mtu) { #define ATTACHED_NVS 0x0002 #define ATTACHED_RNDIS 0x0004 struct ndis_rssprm_toeplitz *rss = &sc->hn_rss; int error, nsubch, nchan, i; uint32_t old_caps, attached = 0; KASSERT((sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) == 0, ("synthetic parts were attached")); if (!hn_synth_attachable(sc)) return (ENXIO); /* Save capabilities for later verification. */ old_caps = sc->hn_caps; sc->hn_caps = 0; /* Clear RSS stuffs. */ sc->hn_rss_ind_size = 0; sc->hn_rss_hash = 0; /* * Attach the primary channel _before_ attaching NVS and RNDIS. */ error = hn_chan_attach(sc, sc->hn_prichan); if (error) goto failed; /* * Attach NVS. */ error = hn_nvs_attach(sc, mtu); if (error) goto failed; attached |= ATTACHED_NVS; /* * Attach RNDIS _after_ NVS is attached. */ error = hn_rndis_attach(sc, mtu); if (error) goto failed; attached |= ATTACHED_RNDIS; /* * Make sure capabilities are not changed. */ if (device_is_attached(sc->hn_dev) && old_caps != sc->hn_caps) { if_printf(sc->hn_ifp, "caps mismatch old 0x%08x, new 0x%08x\n", old_caps, sc->hn_caps); error = ENXIO; goto failed; } /* * Allocate sub-channels for multi-TX/RX rings. * * NOTE: * The # of RX rings that can be used is equivalent to the # of * channels to be requested. */ nsubch = sc->hn_rx_ring_cnt - 1; error = hn_synth_alloc_subchans(sc, &nsubch); if (error) goto failed; /* NOTE: _Full_ synthetic parts detach is required now. */ sc->hn_flags |= HN_FLAG_SYNTH_ATTACHED; /* * Set the # of TX/RX rings that could be used according to * the # of channels that NVS offered. */ nchan = nsubch + 1; hn_set_ring_inuse(sc, nchan); if (nchan == 1) { /* Only the primary channel can be used; done */ goto back; } /* * Attach the sub-channels. * * NOTE: hn_set_ring_inuse() _must_ have been called. */ error = hn_attach_subchans(sc); if (error) goto failed; /* * Configure RSS key and indirect table _after_ all sub-channels * are attached. */ if ((sc->hn_flags & HN_FLAG_HAS_RSSKEY) == 0) { /* * RSS key is not set yet; set it to the default RSS key. */ if (bootverbose) if_printf(sc->hn_ifp, "setup default RSS key\n"); #ifdef RSS rss_getkey(rss->rss_key); #else memcpy(rss->rss_key, hn_rss_key_default, sizeof(rss->rss_key)); #endif sc->hn_flags |= HN_FLAG_HAS_RSSKEY; } if ((sc->hn_flags & HN_FLAG_HAS_RSSIND) == 0) { /* * RSS indirect table is not set yet; set it up in round- * robin fashion. */ if (bootverbose) { if_printf(sc->hn_ifp, "setup default RSS indirect " "table\n"); } for (i = 0; i < NDIS_HASH_INDCNT; ++i) { uint32_t subidx; #ifdef RSS subidx = rss_get_indirection_to_bucket(i); #else subidx = i; #endif rss->rss_ind[i] = subidx % nchan; } sc->hn_flags |= HN_FLAG_HAS_RSSIND; } else { /* * # of usable channels may be changed, so we have to * make sure that all entries in RSS indirect table * are valid. * * NOTE: hn_set_ring_inuse() _must_ have been called. */ hn_rss_ind_fixup(sc); } error = hn_rndis_conf_rss(sc, NDIS_RSS_FLAG_NONE); if (error) goto failed; back: /* * Fixup transmission aggregation setup. */ hn_set_txagg(sc); return (0); failed: if (sc->hn_flags & HN_FLAG_SYNTH_ATTACHED) { hn_synth_detach(sc); } else { if (attached & ATTACHED_RNDIS) hn_rndis_detach(sc); if (attached & ATTACHED_NVS) hn_nvs_detach(sc); hn_chan_detach(sc, sc->hn_prichan); /* Restore old capabilities. */ sc->hn_caps = old_caps; } return (error); #undef ATTACHED_RNDIS #undef ATTACHED_NVS } /* * NOTE: * The interface must have been suspended though hn_suspend(), before * this function get called. */ static void hn_synth_detach(struct hn_softc *sc) { KASSERT(sc->hn_flags & HN_FLAG_SYNTH_ATTACHED, ("synthetic parts were not attached")); /* Detach the RNDIS first. */ hn_rndis_detach(sc); /* Detach NVS. */ hn_nvs_detach(sc); /* Detach all of the channels. */ hn_detach_allchans(sc); sc->hn_flags &= ~HN_FLAG_SYNTH_ATTACHED; } static void hn_set_ring_inuse(struct hn_softc *sc, int ring_cnt) { KASSERT(ring_cnt > 0 && ring_cnt <= sc->hn_rx_ring_cnt, ("invalid ring count %d", ring_cnt)); if (sc->hn_tx_ring_cnt > ring_cnt) sc->hn_tx_ring_inuse = ring_cnt; else sc->hn_tx_ring_inuse = sc->hn_tx_ring_cnt; sc->hn_rx_ring_inuse = ring_cnt; #ifdef RSS if (sc->hn_rx_ring_inuse != rss_getnumbuckets()) { if_printf(sc->hn_ifp, "# of RX rings (%d) does not match " "# of RSS buckets (%d)\n", sc->hn_rx_ring_inuse, rss_getnumbuckets()); } #endif if (bootverbose) { if_printf(sc->hn_ifp, "%d TX ring, %d RX ring\n", sc->hn_tx_ring_inuse, sc->hn_rx_ring_inuse); } } static void hn_chan_drain(struct hn_softc *sc, struct vmbus_channel *chan) { /* * NOTE: * The TX bufring will not be drained by the hypervisor, * if the primary channel is revoked. */ while (!vmbus_chan_rx_empty(chan) || (!vmbus_chan_is_revoked(sc->hn_prichan) && !vmbus_chan_tx_empty(chan))) pause("waitch", 1); vmbus_chan_intr_drain(chan); } static void hn_suspend_data(struct hn_softc *sc) { struct vmbus_channel **subch = NULL; struct hn_tx_ring *txr; int i, nsubch; HN_LOCK_ASSERT(sc); /* * Suspend TX. */ for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { txr = &sc->hn_tx_ring[i]; mtx_lock(&txr->hn_tx_lock); txr->hn_suspended = 1; mtx_unlock(&txr->hn_tx_lock); /* No one is able send more packets now. */ /* * Wait for all pending sends to finish. * * NOTE: * We will _not_ receive all pending send-done, if the * primary channel is revoked. */ while (hn_tx_ring_pending(txr) && !vmbus_chan_is_revoked(sc->hn_prichan)) pause("hnwtx", 1 /* 1 tick */); } /* * Disable RX by clearing RX filter. */ hn_set_rxfilter(sc, NDIS_PACKET_TYPE_NONE); /* * Give RNDIS enough time to flush all pending data packets. */ pause("waitrx", (200 * hz) / 1000); /* * Drain RX/TX bufrings and interrupts. */ nsubch = sc->hn_rx_ring_inuse - 1; if (nsubch > 0) subch = vmbus_subchan_get(sc->hn_prichan, nsubch); if (subch != NULL) { for (i = 0; i < nsubch; ++i) hn_chan_drain(sc, subch[i]); } hn_chan_drain(sc, sc->hn_prichan); if (subch != NULL) vmbus_subchan_rel(subch, nsubch); /* * Drain any pending TX tasks. * * NOTE: * The above hn_chan_drain() can dispatch TX tasks, so the TX * tasks will have to be drained _after_ the above hn_chan_drain() * calls. */ for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { txr = &sc->hn_tx_ring[i]; taskqueue_drain(txr->hn_tx_taskq, &txr->hn_tx_task); taskqueue_drain(txr->hn_tx_taskq, &txr->hn_txeof_task); } } static void hn_suspend_mgmt_taskfunc(void *xsc, int pending __unused) { ((struct hn_softc *)xsc)->hn_mgmt_taskq = NULL; } static void hn_suspend_mgmt(struct hn_softc *sc) { struct task task; HN_LOCK_ASSERT(sc); /* * Make sure that hn_mgmt_taskq0 can nolonger be accessed * through hn_mgmt_taskq. */ TASK_INIT(&task, 0, hn_suspend_mgmt_taskfunc, sc); vmbus_chan_run_task(sc->hn_prichan, &task); /* * Make sure that all pending management tasks are completed. */ taskqueue_drain(sc->hn_mgmt_taskq0, &sc->hn_netchg_init); taskqueue_drain_timeout(sc->hn_mgmt_taskq0, &sc->hn_netchg_status); taskqueue_drain_all(sc->hn_mgmt_taskq0); } static void hn_suspend(struct hn_softc *sc) { /* Disable polling. */ hn_polling(sc, 0); if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) || (sc->hn_flags & HN_FLAG_VF)) hn_suspend_data(sc); hn_suspend_mgmt(sc); } static void hn_resume_tx(struct hn_softc *sc, int tx_ring_cnt) { int i; KASSERT(tx_ring_cnt <= sc->hn_tx_ring_cnt, ("invalid TX ring count %d", tx_ring_cnt)); for (i = 0; i < tx_ring_cnt; ++i) { struct hn_tx_ring *txr = &sc->hn_tx_ring[i]; mtx_lock(&txr->hn_tx_lock); txr->hn_suspended = 0; mtx_unlock(&txr->hn_tx_lock); } } static void hn_resume_data(struct hn_softc *sc) { int i; HN_LOCK_ASSERT(sc); /* * Re-enable RX. */ hn_rxfilter_config(sc); /* * Make sure to clear suspend status on "all" TX rings, * since hn_tx_ring_inuse can be changed after * hn_suspend_data(). */ hn_resume_tx(sc, sc->hn_tx_ring_cnt); #ifdef HN_IFSTART_SUPPORT if (!hn_use_if_start) #endif { /* * Flush unused drbrs, since hn_tx_ring_inuse may be * reduced. */ for (i = sc->hn_tx_ring_inuse; i < sc->hn_tx_ring_cnt; ++i) hn_tx_ring_qflush(&sc->hn_tx_ring[i]); } /* * Kick start TX. */ for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { struct hn_tx_ring *txr = &sc->hn_tx_ring[i]; /* * Use txeof task, so that any pending oactive can be * cleared properly. */ taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task); } } static void hn_resume_mgmt(struct hn_softc *sc) { sc->hn_mgmt_taskq = sc->hn_mgmt_taskq0; /* * Kick off network change detection, if it was pending. * If no network change was pending, start link status * checks, which is more lightweight than network change * detection. */ if (sc->hn_link_flags & HN_LINK_FLAG_NETCHG) hn_change_network(sc); else hn_update_link_status(sc); } static void hn_resume(struct hn_softc *sc) { if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) || (sc->hn_flags & HN_FLAG_VF)) hn_resume_data(sc); /* * When the VF is activated, the synthetic interface is changed * to DOWN in hn_set_vf(). Here, if the VF is still active, we * don't call hn_resume_mgmt() until the VF is deactivated in * hn_set_vf(). */ if (!(sc->hn_flags & HN_FLAG_VF)) hn_resume_mgmt(sc); /* * Re-enable polling if this interface is running and * the polling is requested. */ if ((sc->hn_ifp->if_drv_flags & IFF_DRV_RUNNING) && sc->hn_pollhz > 0) hn_polling(sc, sc->hn_pollhz); } static void hn_rndis_rx_status(struct hn_softc *sc, const void *data, int dlen) { const struct rndis_status_msg *msg; int ofs; if (dlen < sizeof(*msg)) { if_printf(sc->hn_ifp, "invalid RNDIS status\n"); return; } msg = data; switch (msg->rm_status) { case RNDIS_STATUS_MEDIA_CONNECT: case RNDIS_STATUS_MEDIA_DISCONNECT: hn_update_link_status(sc); break; case RNDIS_STATUS_TASK_OFFLOAD_CURRENT_CONFIG: /* Not really useful; ignore. */ break; case RNDIS_STATUS_NETWORK_CHANGE: ofs = RNDIS_STBUFOFFSET_ABS(msg->rm_stbufoffset); if (dlen < ofs + msg->rm_stbuflen || msg->rm_stbuflen < sizeof(uint32_t)) { if_printf(sc->hn_ifp, "network changed\n"); } else { uint32_t change; memcpy(&change, ((const uint8_t *)msg) + ofs, sizeof(change)); if_printf(sc->hn_ifp, "network changed, change %u\n", change); } hn_change_network(sc); break; default: if_printf(sc->hn_ifp, "unknown RNDIS status 0x%08x\n", msg->rm_status); break; } } static int hn_rndis_rxinfo(const void *info_data, int info_dlen, struct hn_rxinfo *info) { const struct rndis_pktinfo *pi = info_data; uint32_t mask = 0; while (info_dlen != 0) { const void *data; uint32_t dlen; if (__predict_false(info_dlen < sizeof(*pi))) return (EINVAL); if (__predict_false(info_dlen < pi->rm_size)) return (EINVAL); info_dlen -= pi->rm_size; if (__predict_false(pi->rm_size & RNDIS_PKTINFO_SIZE_ALIGNMASK)) return (EINVAL); if (__predict_false(pi->rm_size < pi->rm_pktinfooffset)) return (EINVAL); dlen = pi->rm_size - pi->rm_pktinfooffset; data = pi->rm_data; switch (pi->rm_type) { case NDIS_PKTINFO_TYPE_VLAN: if (__predict_false(dlen < NDIS_VLAN_INFO_SIZE)) return (EINVAL); info->vlan_info = *((const uint32_t *)data); mask |= HN_RXINFO_VLAN; break; case NDIS_PKTINFO_TYPE_CSUM: if (__predict_false(dlen < NDIS_RXCSUM_INFO_SIZE)) return (EINVAL); info->csum_info = *((const uint32_t *)data); mask |= HN_RXINFO_CSUM; break; case HN_NDIS_PKTINFO_TYPE_HASHVAL: if (__predict_false(dlen < HN_NDIS_HASH_VALUE_SIZE)) return (EINVAL); info->hash_value = *((const uint32_t *)data); mask |= HN_RXINFO_HASHVAL; break; case HN_NDIS_PKTINFO_TYPE_HASHINF: if (__predict_false(dlen < HN_NDIS_HASH_INFO_SIZE)) return (EINVAL); info->hash_info = *((const uint32_t *)data); mask |= HN_RXINFO_HASHINF; break; default: goto next; } if (mask == HN_RXINFO_ALL) { /* All found; done */ break; } next: pi = (const struct rndis_pktinfo *) ((const uint8_t *)pi + pi->rm_size); } /* * Final fixup. * - If there is no hash value, invalidate the hash info. */ if ((mask & HN_RXINFO_HASHVAL) == 0) info->hash_info = HN_NDIS_HASH_INFO_INVALID; return (0); } static __inline bool hn_rndis_check_overlap(int off, int len, int check_off, int check_len) { if (off < check_off) { if (__predict_true(off + len <= check_off)) return (false); } else if (off > check_off) { if (__predict_true(check_off + check_len <= off)) return (false); } return (true); } static void hn_rndis_rx_data(struct hn_rx_ring *rxr, const void *data, int dlen) { const struct rndis_packet_msg *pkt; struct hn_rxinfo info; int data_off, pktinfo_off, data_len, pktinfo_len; /* * Check length. */ if (__predict_false(dlen < sizeof(*pkt))) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg\n"); return; } pkt = data; if (__predict_false(dlen < pkt->rm_len)) { if_printf(rxr->hn_ifp, "truncated RNDIS packet msg, " "dlen %d, msglen %u\n", dlen, pkt->rm_len); return; } if (__predict_false(pkt->rm_len < pkt->rm_datalen + pkt->rm_oobdatalen + pkt->rm_pktinfolen)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msglen, " "msglen %u, data %u, oob %u, pktinfo %u\n", pkt->rm_len, pkt->rm_datalen, pkt->rm_oobdatalen, pkt->rm_pktinfolen); return; } if (__predict_false(pkt->rm_datalen == 0)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, no data\n"); return; } /* * Check offests. */ #define IS_OFFSET_INVALID(ofs) \ ((ofs) < RNDIS_PACKET_MSG_OFFSET_MIN || \ ((ofs) & RNDIS_PACKET_MSG_OFFSET_ALIGNMASK)) /* XXX Hyper-V does not meet data offset alignment requirement */ if (__predict_false(pkt->rm_dataoffset < RNDIS_PACKET_MSG_OFFSET_MIN)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "data offset %u\n", pkt->rm_dataoffset); return; } if (__predict_false(pkt->rm_oobdataoffset > 0 && IS_OFFSET_INVALID(pkt->rm_oobdataoffset))) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "oob offset %u\n", pkt->rm_oobdataoffset); return; } if (__predict_true(pkt->rm_pktinfooffset > 0) && __predict_false(IS_OFFSET_INVALID(pkt->rm_pktinfooffset))) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "pktinfo offset %u\n", pkt->rm_pktinfooffset); return; } #undef IS_OFFSET_INVALID data_off = RNDIS_PACKET_MSG_OFFSET_ABS(pkt->rm_dataoffset); data_len = pkt->rm_datalen; pktinfo_off = RNDIS_PACKET_MSG_OFFSET_ABS(pkt->rm_pktinfooffset); pktinfo_len = pkt->rm_pktinfolen; /* * Check OOB coverage. */ if (__predict_false(pkt->rm_oobdatalen != 0)) { int oob_off, oob_len; if_printf(rxr->hn_ifp, "got oobdata\n"); oob_off = RNDIS_PACKET_MSG_OFFSET_ABS(pkt->rm_oobdataoffset); oob_len = pkt->rm_oobdatalen; if (__predict_false(oob_off + oob_len > pkt->rm_len)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "oob overflow, msglen %u, oob abs %d len %d\n", pkt->rm_len, oob_off, oob_len); return; } /* * Check against data. */ if (hn_rndis_check_overlap(oob_off, oob_len, data_off, data_len)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "oob overlaps data, oob abs %d len %d, " "data abs %d len %d\n", oob_off, oob_len, data_off, data_len); return; } /* * Check against pktinfo. */ if (pktinfo_len != 0 && hn_rndis_check_overlap(oob_off, oob_len, pktinfo_off, pktinfo_len)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "oob overlaps pktinfo, oob abs %d len %d, " "pktinfo abs %d len %d\n", oob_off, oob_len, pktinfo_off, pktinfo_len); return; } } /* * Check per-packet-info coverage and find useful per-packet-info. */ info.vlan_info = HN_NDIS_VLAN_INFO_INVALID; info.csum_info = HN_NDIS_RXCSUM_INFO_INVALID; info.hash_info = HN_NDIS_HASH_INFO_INVALID; if (__predict_true(pktinfo_len != 0)) { bool overlap; int error; if (__predict_false(pktinfo_off + pktinfo_len > pkt->rm_len)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "pktinfo overflow, msglen %u, " "pktinfo abs %d len %d\n", pkt->rm_len, pktinfo_off, pktinfo_len); return; } /* * Check packet info coverage. */ overlap = hn_rndis_check_overlap(pktinfo_off, pktinfo_len, data_off, data_len); if (__predict_false(overlap)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "pktinfo overlap data, pktinfo abs %d len %d, " "data abs %d len %d\n", pktinfo_off, pktinfo_len, data_off, data_len); return; } /* * Find useful per-packet-info. */ error = hn_rndis_rxinfo(((const uint8_t *)pkt) + pktinfo_off, pktinfo_len, &info); if (__predict_false(error)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg " "pktinfo\n"); return; } } if (__predict_false(data_off + data_len > pkt->rm_len)) { if_printf(rxr->hn_ifp, "invalid RNDIS packet msg, " "data overflow, msglen %u, data abs %d len %d\n", pkt->rm_len, data_off, data_len); return; } hn_rxpkt(rxr, ((const uint8_t *)pkt) + data_off, data_len, &info); } static __inline void hn_rndis_rxpkt(struct hn_rx_ring *rxr, const void *data, int dlen) { const struct rndis_msghdr *hdr; if (__predict_false(dlen < sizeof(*hdr))) { if_printf(rxr->hn_ifp, "invalid RNDIS msg\n"); return; } hdr = data; if (__predict_true(hdr->rm_type == REMOTE_NDIS_PACKET_MSG)) { /* Hot data path. */ hn_rndis_rx_data(rxr, data, dlen); /* Done! */ return; } if (hdr->rm_type == REMOTE_NDIS_INDICATE_STATUS_MSG) hn_rndis_rx_status(rxr->hn_ifp->if_softc, data, dlen); else hn_rndis_rx_ctrl(rxr->hn_ifp->if_softc, data, dlen); } static void hn_nvs_handle_notify(struct hn_softc *sc, const struct vmbus_chanpkt_hdr *pkt) { const struct hn_nvs_hdr *hdr; if (VMBUS_CHANPKT_DATALEN(pkt) < sizeof(*hdr)) { if_printf(sc->hn_ifp, "invalid nvs notify\n"); return; } hdr = VMBUS_CHANPKT_CONST_DATA(pkt); if (hdr->nvs_type == HN_NVS_TYPE_TXTBL_NOTE) { /* Useless; ignore */ return; } if_printf(sc->hn_ifp, "got notify, nvs type %u\n", hdr->nvs_type); } static void hn_nvs_handle_comp(struct hn_softc *sc, struct vmbus_channel *chan, const struct vmbus_chanpkt_hdr *pkt) { struct hn_nvs_sendctx *sndc; sndc = (struct hn_nvs_sendctx *)(uintptr_t)pkt->cph_xactid; sndc->hn_cb(sndc, sc, chan, VMBUS_CHANPKT_CONST_DATA(pkt), VMBUS_CHANPKT_DATALEN(pkt)); /* * NOTE: * 'sndc' CAN NOT be accessed anymore, since it can be freed by * its callback. */ } static void hn_nvs_handle_rxbuf(struct hn_rx_ring *rxr, struct vmbus_channel *chan, const struct vmbus_chanpkt_hdr *pkthdr) { const struct vmbus_chanpkt_rxbuf *pkt; const struct hn_nvs_hdr *nvs_hdr; int count, i, hlen; if (__predict_false(VMBUS_CHANPKT_DATALEN(pkthdr) < sizeof(*nvs_hdr))) { if_printf(rxr->hn_ifp, "invalid nvs RNDIS\n"); return; } nvs_hdr = VMBUS_CHANPKT_CONST_DATA(pkthdr); /* Make sure that this is a RNDIS message. */ if (__predict_false(nvs_hdr->nvs_type != HN_NVS_TYPE_RNDIS)) { if_printf(rxr->hn_ifp, "nvs type %u, not RNDIS\n", nvs_hdr->nvs_type); return; } hlen = VMBUS_CHANPKT_GETLEN(pkthdr->cph_hlen); if (__predict_false(hlen < sizeof(*pkt))) { if_printf(rxr->hn_ifp, "invalid rxbuf chanpkt\n"); return; } pkt = (const struct vmbus_chanpkt_rxbuf *)pkthdr; if (__predict_false(pkt->cp_rxbuf_id != HN_NVS_RXBUF_SIG)) { if_printf(rxr->hn_ifp, "invalid rxbuf_id 0x%08x\n", pkt->cp_rxbuf_id); return; } count = pkt->cp_rxbuf_cnt; if (__predict_false(hlen < __offsetof(struct vmbus_chanpkt_rxbuf, cp_rxbuf[count]))) { if_printf(rxr->hn_ifp, "invalid rxbuf_cnt %d\n", count); return; } /* Each range represents 1 RNDIS pkt that contains 1 Ethernet frame */ for (i = 0; i < count; ++i) { int ofs, len; ofs = pkt->cp_rxbuf[i].rb_ofs; len = pkt->cp_rxbuf[i].rb_len; if (__predict_false(ofs + len > HN_RXBUF_SIZE)) { if_printf(rxr->hn_ifp, "%dth RNDIS msg overflow rxbuf, " "ofs %d, len %d\n", i, ofs, len); continue; } hn_rndis_rxpkt(rxr, rxr->hn_rxbuf + ofs, len); } /* * Ack the consumed RXBUF associated w/ this channel packet, * so that this RXBUF can be recycled by the hypervisor. */ hn_nvs_ack_rxbuf(rxr, chan, pkt->cp_hdr.cph_xactid); } static void hn_nvs_ack_rxbuf(struct hn_rx_ring *rxr, struct vmbus_channel *chan, uint64_t tid) { struct hn_nvs_rndis_ack ack; int retries, error; ack.nvs_type = HN_NVS_TYPE_RNDIS_ACK; ack.nvs_status = HN_NVS_STATUS_OK; retries = 0; again: error = vmbus_chan_send(chan, VMBUS_CHANPKT_TYPE_COMP, VMBUS_CHANPKT_FLAG_NONE, &ack, sizeof(ack), tid); if (__predict_false(error == EAGAIN)) { /* * NOTE: * This should _not_ happen in real world, since the * consumption of the TX bufring from the TX path is * controlled. */ if (rxr->hn_ack_failed == 0) if_printf(rxr->hn_ifp, "RXBUF ack retry\n"); rxr->hn_ack_failed++; retries++; if (retries < 10) { DELAY(100); goto again; } /* RXBUF leaks! */ if_printf(rxr->hn_ifp, "RXBUF ack failed\n"); } } static void hn_chan_callback(struct vmbus_channel *chan, void *xrxr) { struct hn_rx_ring *rxr = xrxr; struct hn_softc *sc = rxr->hn_ifp->if_softc; for (;;) { struct vmbus_chanpkt_hdr *pkt = rxr->hn_pktbuf; int error, pktlen; pktlen = rxr->hn_pktbuf_len; error = vmbus_chan_recv_pkt(chan, pkt, &pktlen); if (__predict_false(error == ENOBUFS)) { void *nbuf; int nlen; /* * Expand channel packet buffer. * * XXX * Use M_WAITOK here, since allocation failure * is fatal. */ nlen = rxr->hn_pktbuf_len * 2; while (nlen < pktlen) nlen *= 2; nbuf = malloc(nlen, M_DEVBUF, M_WAITOK); if_printf(rxr->hn_ifp, "expand pktbuf %d -> %d\n", rxr->hn_pktbuf_len, nlen); free(rxr->hn_pktbuf, M_DEVBUF); rxr->hn_pktbuf = nbuf; rxr->hn_pktbuf_len = nlen; /* Retry! */ continue; } else if (__predict_false(error == EAGAIN)) { /* No more channel packets; done! */ break; } KASSERT(!error, ("vmbus_chan_recv_pkt failed: %d", error)); switch (pkt->cph_type) { case VMBUS_CHANPKT_TYPE_COMP: hn_nvs_handle_comp(sc, chan, pkt); break; case VMBUS_CHANPKT_TYPE_RXBUF: hn_nvs_handle_rxbuf(rxr, chan, pkt); break; case VMBUS_CHANPKT_TYPE_INBAND: hn_nvs_handle_notify(sc, pkt); break; default: if_printf(rxr->hn_ifp, "unknown chan pkt %u\n", pkt->cph_type); break; } } hn_chan_rollup(rxr, rxr->hn_txr); } static void hn_tx_taskq_create(void *arg __unused) { int i; /* * Fix the # of TX taskqueues. */ if (hn_tx_taskq_cnt <= 0) hn_tx_taskq_cnt = 1; else if (hn_tx_taskq_cnt > mp_ncpus) hn_tx_taskq_cnt = mp_ncpus; /* * Fix the TX taskqueue mode. */ switch (hn_tx_taskq_mode) { case HN_TX_TASKQ_M_INDEP: case HN_TX_TASKQ_M_GLOBAL: case HN_TX_TASKQ_M_EVTTQ: break; default: hn_tx_taskq_mode = HN_TX_TASKQ_M_INDEP; break; } if (vm_guest != VM_GUEST_HV) return; if (hn_tx_taskq_mode != HN_TX_TASKQ_M_GLOBAL) return; hn_tx_taskque = malloc(hn_tx_taskq_cnt * sizeof(struct taskqueue *), M_DEVBUF, M_WAITOK); for (i = 0; i < hn_tx_taskq_cnt; ++i) { hn_tx_taskque[i] = taskqueue_create("hn_tx", M_WAITOK, taskqueue_thread_enqueue, &hn_tx_taskque[i]); taskqueue_start_threads(&hn_tx_taskque[i], 1, PI_NET, "hn tx%d", i); } } SYSINIT(hn_txtq_create, SI_SUB_DRIVERS, SI_ORDER_SECOND, hn_tx_taskq_create, NULL); static void hn_tx_taskq_destroy(void *arg __unused) { if (hn_tx_taskque != NULL) { int i; for (i = 0; i < hn_tx_taskq_cnt; ++i) taskqueue_free(hn_tx_taskque[i]); free(hn_tx_taskque, M_DEVBUF); } } SYSUNINIT(hn_txtq_destroy, SI_SUB_DRIVERS, SI_ORDER_SECOND, hn_tx_taskq_destroy, NULL); Index: projects/clang400-import/sys/kern/imgact_elf.c =================================================================== --- projects/clang400-import/sys/kern/imgact_elf.c (revision 314522) +++ projects/clang400-import/sys/kern/imgact_elf.c (revision 314523) @@ -1,2410 +1,2416 @@ /*- * Copyright (c) 2000 David O'Brien * Copyright (c) 1995-1996 Søren Schmidt * Copyright (c) 1996 Peter Wemm * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_capsicum.h" #include "opt_compat.h" #include "opt_gzio.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define ELF_NOTE_ROUNDSIZE 4 #define OLD_EI_BRAND 8 static int __elfN(check_header)(const Elf_Ehdr *hdr); static Elf_Brandinfo *__elfN(get_brandinfo)(struct image_params *imgp, const char *interp, int interp_name_len, int32_t *osrel); static int __elfN(load_file)(struct proc *p, const char *file, u_long *addr, u_long *entry, size_t pagesize); static int __elfN(load_section)(struct image_params *imgp, vm_offset_t offset, caddr_t vmaddr, size_t memsz, size_t filsz, vm_prot_t prot, size_t pagesize); static int __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp); static boolean_t __elfN(freebsd_trans_osrel)(const Elf_Note *note, int32_t *osrel); static boolean_t kfreebsd_trans_osrel(const Elf_Note *note, int32_t *osrel); static boolean_t __elfN(check_note)(struct image_params *imgp, Elf_Brandnote *checknote, int32_t *osrel); static vm_prot_t __elfN(trans_prot)(Elf_Word); static Elf_Word __elfN(untrans_prot)(vm_prot_t); SYSCTL_NODE(_kern, OID_AUTO, __CONCAT(elf, __ELF_WORD_SIZE), CTLFLAG_RW, 0, ""); #define CORE_BUF_SIZE (16 * 1024) int __elfN(fallback_brand) = -1; SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO, fallback_brand, CTLFLAG_RWTUN, &__elfN(fallback_brand), 0, __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) " brand of last resort"); static int elf_legacy_coredump = 0; SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, &elf_legacy_coredump, 0, "include all and only RW pages in core dumps"); int __elfN(nxstack) = #if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */ || \ (defined(__arm__) && __ARM_ARCH >= 7) || defined(__aarch64__) 1; #else 0; #endif SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO, nxstack, CTLFLAG_RW, &__elfN(nxstack), 0, __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) ": enable non-executable stack"); #if __ELF_WORD_SIZE == 32 #if defined(__amd64__) int i386_read_exec = 0; SYSCTL_INT(_kern_elf32, OID_AUTO, read_exec, CTLFLAG_RW, &i386_read_exec, 0, "enable execution from readable segments"); #endif #endif static Elf_Brandinfo *elf_brand_list[MAX_BRANDS]; #define trunc_page_ps(va, ps) rounddown2(va, ps) #define round_page_ps(va, ps) roundup2(va, ps) #define aligned(a, t) (trunc_page_ps((u_long)(a), sizeof(t)) == (u_long)(a)) static const char FREEBSD_ABI_VENDOR[] = "FreeBSD"; Elf_Brandnote __elfN(freebsd_brandnote) = { .hdr.n_namesz = sizeof(FREEBSD_ABI_VENDOR), .hdr.n_descsz = sizeof(int32_t), .hdr.n_type = NT_FREEBSD_ABI_TAG, .vendor = FREEBSD_ABI_VENDOR, .flags = BN_TRANSLATE_OSREL, .trans_osrel = __elfN(freebsd_trans_osrel) }; static boolean_t __elfN(freebsd_trans_osrel)(const Elf_Note *note, int32_t *osrel) { uintptr_t p; p = (uintptr_t)(note + 1); p += roundup2(note->n_namesz, ELF_NOTE_ROUNDSIZE); *osrel = *(const int32_t *)(p); return (TRUE); } static const char GNU_ABI_VENDOR[] = "GNU"; static int GNU_KFREEBSD_ABI_DESC = 3; Elf_Brandnote __elfN(kfreebsd_brandnote) = { .hdr.n_namesz = sizeof(GNU_ABI_VENDOR), .hdr.n_descsz = 16, /* XXX at least 16 */ .hdr.n_type = 1, .vendor = GNU_ABI_VENDOR, .flags = BN_TRANSLATE_OSREL, .trans_osrel = kfreebsd_trans_osrel }; static boolean_t kfreebsd_trans_osrel(const Elf_Note *note, int32_t *osrel) { const Elf32_Word *desc; uintptr_t p; p = (uintptr_t)(note + 1); p += roundup2(note->n_namesz, ELF_NOTE_ROUNDSIZE); desc = (const Elf32_Word *)p; if (desc[0] != GNU_KFREEBSD_ABI_DESC) return (FALSE); /* * Debian GNU/kFreeBSD embed the earliest compatible kernel version * (__FreeBSD_version: Rxx) in the LSB way. */ *osrel = desc[1] * 100000 + desc[2] * 1000 + desc[3]; return (TRUE); } int __elfN(insert_brand_entry)(Elf_Brandinfo *entry) { int i; for (i = 0; i < MAX_BRANDS; i++) { if (elf_brand_list[i] == NULL) { elf_brand_list[i] = entry; break; } } if (i == MAX_BRANDS) { printf("WARNING: %s: could not insert brandinfo entry: %p\n", __func__, entry); return (-1); } return (0); } int __elfN(remove_brand_entry)(Elf_Brandinfo *entry) { int i; for (i = 0; i < MAX_BRANDS; i++) { if (elf_brand_list[i] == entry) { elf_brand_list[i] = NULL; break; } } if (i == MAX_BRANDS) return (-1); return (0); } int __elfN(brand_inuse)(Elf_Brandinfo *entry) { struct proc *p; int rval = FALSE; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { if (p->p_sysent == entry->sysvec) { rval = TRUE; break; } } sx_sunlock(&allproc_lock); return (rval); } static Elf_Brandinfo * __elfN(get_brandinfo)(struct image_params *imgp, const char *interp, int interp_name_len, int32_t *osrel) { const Elf_Ehdr *hdr = (const Elf_Ehdr *)imgp->image_header; Elf_Brandinfo *bi, *bi_m; boolean_t ret; int i; /* * We support four types of branding -- (1) the ELF EI_OSABI field * that SCO added to the ELF spec, (2) FreeBSD 3.x's traditional string * branding w/in the ELF header, (3) path of the `interp_path' * field, and (4) the ".note.ABI-tag" ELF section. */ /* Look for an ".note.ABI-tag" ELF section */ bi_m = NULL; for (i = 0; i < MAX_BRANDS; i++) { bi = elf_brand_list[i]; if (bi == NULL) continue; if (hdr->e_machine == bi->machine && (bi->flags & (BI_BRAND_NOTE|BI_BRAND_NOTE_MANDATORY)) != 0) { ret = __elfN(check_note)(imgp, bi->brand_note, osrel); /* Give brand a chance to veto check_note's guess */ if (ret && bi->header_supported) ret = bi->header_supported(imgp); /* * If note checker claimed the binary, but the * interpreter path in the image does not * match default one for the brand, try to * search for other brands with the same * interpreter. Either there is better brand * with the right interpreter, or, failing * this, we return first brand which accepted * our note and, optionally, header. */ if (ret && bi_m == NULL && (strlen(bi->interp_path) + 1 != interp_name_len || strncmp(interp, bi->interp_path, interp_name_len) != 0)) { bi_m = bi; ret = 0; } if (ret) return (bi); } } if (bi_m != NULL) return (bi_m); /* If the executable has a brand, search for it in the brand list. */ for (i = 0; i < MAX_BRANDS; i++) { bi = elf_brand_list[i]; if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY) continue; if (hdr->e_machine == bi->machine && (hdr->e_ident[EI_OSABI] == bi->brand || strncmp((const char *)&hdr->e_ident[OLD_EI_BRAND], bi->compat_3_brand, strlen(bi->compat_3_brand)) == 0)) { /* Looks good, but give brand a chance to veto */ if (!bi->header_supported || bi->header_supported(imgp)) return (bi); } } /* No known brand, see if the header is recognized by any brand */ for (i = 0; i < MAX_BRANDS; i++) { bi = elf_brand_list[i]; if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY || bi->header_supported == NULL) continue; if (hdr->e_machine == bi->machine) { ret = bi->header_supported(imgp); if (ret) return (bi); } } /* Lacking a known brand, search for a recognized interpreter. */ if (interp != NULL) { for (i = 0; i < MAX_BRANDS; i++) { bi = elf_brand_list[i]; if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY) continue; if (hdr->e_machine == bi->machine && /* ELF image p_filesz includes terminating zero */ strlen(bi->interp_path) + 1 == interp_name_len && strncmp(interp, bi->interp_path, interp_name_len) == 0) return (bi); } } /* Lacking a recognized interpreter, try the default brand */ for (i = 0; i < MAX_BRANDS; i++) { bi = elf_brand_list[i]; if (bi == NULL || bi->flags & BI_BRAND_NOTE_MANDATORY) continue; if (hdr->e_machine == bi->machine && __elfN(fallback_brand) == bi->brand) return (bi); } return (NULL); } static int __elfN(check_header)(const Elf_Ehdr *hdr) { Elf_Brandinfo *bi; int i; if (!IS_ELF(*hdr) || hdr->e_ident[EI_CLASS] != ELF_TARG_CLASS || hdr->e_ident[EI_DATA] != ELF_TARG_DATA || hdr->e_ident[EI_VERSION] != EV_CURRENT || hdr->e_phentsize != sizeof(Elf_Phdr) || hdr->e_version != ELF_TARG_VER) return (ENOEXEC); /* * Make sure we have at least one brand for this machine. */ for (i = 0; i < MAX_BRANDS; i++) { bi = elf_brand_list[i]; if (bi != NULL && bi->machine == hdr->e_machine) break; } if (i == MAX_BRANDS) return (ENOEXEC); return (0); } static int __elfN(map_partial)(vm_map_t map, vm_object_t object, vm_ooffset_t offset, vm_offset_t start, vm_offset_t end, vm_prot_t prot) { struct sf_buf *sf; int error; vm_offset_t off; /* * Create the page if it doesn't exist yet. Ignore errors. */ vm_map_lock(map); vm_map_insert(map, NULL, 0, trunc_page(start), round_page(end), VM_PROT_ALL, VM_PROT_ALL, 0); vm_map_unlock(map); /* * Find the page from the underlying object. */ if (object) { sf = vm_imgact_map_page(object, offset); if (sf == NULL) return (KERN_FAILURE); off = offset - trunc_page(offset); error = copyout((caddr_t)sf_buf_kva(sf) + off, (caddr_t)start, end - start); vm_imgact_unmap_page(sf); if (error) { return (KERN_FAILURE); } } return (KERN_SUCCESS); } static int -__elfN(map_insert)(vm_map_t map, vm_object_t object, vm_ooffset_t offset, - vm_offset_t start, vm_offset_t end, vm_prot_t prot, int cow) +__elfN(map_insert)(struct image_params *imgp, vm_map_t map, vm_object_t object, + vm_ooffset_t offset, vm_offset_t start, vm_offset_t end, vm_prot_t prot, + int cow) { struct sf_buf *sf; vm_offset_t off; vm_size_t sz; - int error, rv; + int error, locked, rv; if (start != trunc_page(start)) { rv = __elfN(map_partial)(map, object, offset, start, round_page(start), prot); if (rv) return (rv); offset += round_page(start) - start; start = round_page(start); } if (end != round_page(end)) { rv = __elfN(map_partial)(map, object, offset + trunc_page(end) - start, trunc_page(end), end, prot); if (rv) return (rv); end = trunc_page(end); } if (end > start) { if (offset & PAGE_MASK) { /* * The mapping is not page aligned. This means we have * to copy the data. Sigh. */ - rv = vm_map_find(map, NULL, 0, &start, end - start, 0, - VMFS_NO_SPACE, prot | VM_PROT_WRITE, VM_PROT_ALL, - 0); + vm_map_lock(map); + rv = vm_map_insert(map, NULL, 0, start, end, + prot | VM_PROT_WRITE, VM_PROT_ALL, 0); + vm_map_unlock(map); if (rv != KERN_SUCCESS) return (rv); if (object == NULL) return (KERN_SUCCESS); for (; start < end; start += sz) { sf = vm_imgact_map_page(object, offset); if (sf == NULL) return (KERN_FAILURE); off = offset - trunc_page(offset); sz = end - start; if (sz > PAGE_SIZE - off) sz = PAGE_SIZE - off; error = copyout((caddr_t)sf_buf_kva(sf) + off, (caddr_t)start, sz); vm_imgact_unmap_page(sf); if (error != 0) return (KERN_FAILURE); offset += sz; } rv = KERN_SUCCESS; } else { vm_object_reference(object); vm_map_lock(map); rv = vm_map_insert(map, object, offset, start, end, prot, VM_PROT_ALL, cow); vm_map_unlock(map); - if (rv != KERN_SUCCESS) + if (rv != KERN_SUCCESS) { + locked = VOP_ISLOCKED(imgp->vp); + VOP_UNLOCK(imgp->vp, 0); vm_object_deallocate(object); + vn_lock(imgp->vp, locked | LK_RETRY); + } } return (rv); } else { return (KERN_SUCCESS); } } static int __elfN(load_section)(struct image_params *imgp, vm_offset_t offset, caddr_t vmaddr, size_t memsz, size_t filsz, vm_prot_t prot, size_t pagesize) { struct sf_buf *sf; size_t map_len; vm_map_t map; vm_object_t object; vm_offset_t map_addr; int error, rv, cow; size_t copy_len; vm_offset_t file_addr; /* * It's necessary to fail if the filsz + offset taken from the * header is greater than the actual file pager object's size. * If we were to allow this, then the vm_map_find() below would * walk right off the end of the file object and into the ether. * * While I'm here, might as well check for something else that * is invalid: filsz cannot be greater than memsz. */ if ((off_t)filsz + offset > imgp->attr->va_size || filsz > memsz) { uprintf("elf_load_section: truncated ELF file\n"); return (ENOEXEC); } object = imgp->object; map = &imgp->proc->p_vmspace->vm_map; map_addr = trunc_page_ps((vm_offset_t)vmaddr, pagesize); file_addr = trunc_page_ps(offset, pagesize); /* * We have two choices. We can either clear the data in the last page * of an oversized mapping, or we can start the anon mapping a page * early and copy the initialized data into that first page. We * choose the second.. */ if (memsz > filsz) map_len = trunc_page_ps(offset + filsz, pagesize) - file_addr; else map_len = round_page_ps(offset + filsz, pagesize) - file_addr; if (map_len != 0) { /* cow flags: don't dump readonly sections in core */ cow = MAP_COPY_ON_WRITE | MAP_PREFAULT | (prot & VM_PROT_WRITE ? 0 : MAP_DISABLE_COREDUMP); - rv = __elfN(map_insert)(map, + rv = __elfN(map_insert)(imgp, map, object, file_addr, /* file offset */ map_addr, /* virtual start */ map_addr + map_len,/* virtual end */ prot, cow); if (rv != KERN_SUCCESS) return (EINVAL); /* we can stop now if we've covered it all */ if (memsz == filsz) { return (0); } } /* * We have to get the remaining bit of the file into the first part * of the oversized map segment. This is normally because the .data * segment in the file is extended to provide bss. It's a neat idea * to try and save a page, but it's a pain in the behind to implement. */ copy_len = (offset + filsz) - trunc_page_ps(offset + filsz, pagesize); map_addr = trunc_page_ps((vm_offset_t)vmaddr + filsz, pagesize); map_len = round_page_ps((vm_offset_t)vmaddr + memsz, pagesize) - map_addr; /* This had damn well better be true! */ if (map_len != 0) { - rv = __elfN(map_insert)(map, NULL, 0, map_addr, map_addr + - map_len, VM_PROT_ALL, 0); + rv = __elfN(map_insert)(imgp, map, NULL, 0, map_addr, + map_addr + map_len, VM_PROT_ALL, 0); if (rv != KERN_SUCCESS) { return (EINVAL); } } if (copy_len != 0) { vm_offset_t off; sf = vm_imgact_map_page(object, offset + filsz); if (sf == NULL) return (EIO); /* send the page fragment to user space */ off = trunc_page_ps(offset + filsz, pagesize) - trunc_page(offset + filsz); error = copyout((caddr_t)sf_buf_kva(sf) + off, (caddr_t)map_addr, copy_len); vm_imgact_unmap_page(sf); if (error) { return (error); } } /* * set it to the specified protection. * XXX had better undo the damage from pasting over the cracks here! */ vm_map_protect(map, trunc_page(map_addr), round_page(map_addr + map_len), prot, FALSE); return (0); } /* * Load the file "file" into memory. It may be either a shared object * or an executable. * * The "addr" reference parameter is in/out. On entry, it specifies * the address where a shared object should be loaded. If the file is * an executable, this value is ignored. On exit, "addr" specifies * where the file was actually loaded. * * The "entry" reference parameter is out only. On exit, it specifies * the entry point for the loaded file. */ static int __elfN(load_file)(struct proc *p, const char *file, u_long *addr, u_long *entry, size_t pagesize) { struct { struct nameidata nd; struct vattr attr; struct image_params image_params; } *tempdata; const Elf_Ehdr *hdr = NULL; const Elf_Phdr *phdr = NULL; struct nameidata *nd; struct vattr *attr; struct image_params *imgp; vm_prot_t prot; u_long rbase; u_long base_addr = 0; int error, i, numsegs; #ifdef CAPABILITY_MODE /* * XXXJA: This check can go away once we are sufficiently confident * that the checks in namei() are correct. */ if (IN_CAPABILITY_MODE(curthread)) return (ECAPMODE); #endif tempdata = malloc(sizeof(*tempdata), M_TEMP, M_WAITOK); nd = &tempdata->nd; attr = &tempdata->attr; imgp = &tempdata->image_params; /* * Initialize part of the common data */ imgp->proc = p; imgp->attr = attr; imgp->firstpage = NULL; imgp->image_header = NULL; imgp->object = NULL; imgp->execlabel = NULL; NDINIT(nd, LOOKUP, LOCKLEAF | FOLLOW, UIO_SYSSPACE, file, curthread); if ((error = namei(nd)) != 0) { nd->ni_vp = NULL; goto fail; } NDFREE(nd, NDF_ONLY_PNBUF); imgp->vp = nd->ni_vp; /* * Check permissions, modes, uid, etc on the file, and "open" it. */ error = exec_check_permissions(imgp); if (error) goto fail; error = exec_map_first_page(imgp); if (error) goto fail; /* * Also make certain that the interpreter stays the same, so set * its VV_TEXT flag, too. */ VOP_SET_TEXT(nd->ni_vp); imgp->object = nd->ni_vp->v_object; hdr = (const Elf_Ehdr *)imgp->image_header; if ((error = __elfN(check_header)(hdr)) != 0) goto fail; if (hdr->e_type == ET_DYN) rbase = *addr; else if (hdr->e_type == ET_EXEC) rbase = 0; else { error = ENOEXEC; goto fail; } /* Only support headers that fit within first page for now */ if ((hdr->e_phoff > PAGE_SIZE) || (u_int)hdr->e_phentsize * hdr->e_phnum > PAGE_SIZE - hdr->e_phoff) { error = ENOEXEC; goto fail; } phdr = (const Elf_Phdr *)(imgp->image_header + hdr->e_phoff); if (!aligned(phdr, Elf_Addr)) { error = ENOEXEC; goto fail; } for (i = 0, numsegs = 0; i < hdr->e_phnum; i++) { if (phdr[i].p_type == PT_LOAD && phdr[i].p_memsz != 0) { /* Loadable segment */ prot = __elfN(trans_prot)(phdr[i].p_flags); error = __elfN(load_section)(imgp, phdr[i].p_offset, (caddr_t)(uintptr_t)phdr[i].p_vaddr + rbase, phdr[i].p_memsz, phdr[i].p_filesz, prot, pagesize); if (error != 0) goto fail; /* * Establish the base address if this is the * first segment. */ if (numsegs == 0) base_addr = trunc_page(phdr[i].p_vaddr + rbase); numsegs++; } } *addr = base_addr; *entry = (unsigned long)hdr->e_entry + rbase; fail: if (imgp->firstpage) exec_unmap_first_page(imgp); if (nd->ni_vp) vput(nd->ni_vp); free(tempdata, M_TEMP); return (error); } static int __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp) { struct thread *td; const Elf_Ehdr *hdr; const Elf_Phdr *phdr; Elf_Auxargs *elf_auxargs; struct vmspace *vmspace; const char *err_str, *newinterp; char *interp, *interp_buf, *path; Elf_Brandinfo *brand_info; struct sysentvec *sv; vm_prot_t prot; u_long text_size, data_size, total_size, text_addr, data_addr; u_long seg_size, seg_addr, addr, baddr, et_dyn_addr, entry, proghdr; int32_t osrel; int error, i, n, interp_name_len, have_interp; hdr = (const Elf_Ehdr *)imgp->image_header; /* * Do we have a valid ELF header ? * * Only allow ET_EXEC & ET_DYN here, reject ET_DYN later * if particular brand doesn't support it. */ if (__elfN(check_header)(hdr) != 0 || (hdr->e_type != ET_EXEC && hdr->e_type != ET_DYN)) return (-1); /* * From here on down, we return an errno, not -1, as we've * detected an ELF file. */ if ((hdr->e_phoff > PAGE_SIZE) || (u_int)hdr->e_phentsize * hdr->e_phnum > PAGE_SIZE - hdr->e_phoff) { /* Only support headers in first page for now */ uprintf("Program headers not in the first page\n"); return (ENOEXEC); } phdr = (const Elf_Phdr *)(imgp->image_header + hdr->e_phoff); if (!aligned(phdr, Elf_Addr)) { uprintf("Unaligned program headers\n"); return (ENOEXEC); } n = error = 0; baddr = 0; osrel = 0; text_size = data_size = total_size = text_addr = data_addr = 0; entry = proghdr = 0; interp_name_len = 0; err_str = newinterp = NULL; interp = interp_buf = NULL; td = curthread; for (i = 0; i < hdr->e_phnum; i++) { switch (phdr[i].p_type) { case PT_LOAD: if (n == 0) baddr = phdr[i].p_vaddr; n++; break; case PT_INTERP: /* Path to interpreter */ if (phdr[i].p_filesz > MAXPATHLEN) { uprintf("Invalid PT_INTERP\n"); error = ENOEXEC; goto ret; } if (interp != NULL) { uprintf("Multiple PT_INTERP headers\n"); error = ENOEXEC; goto ret; } interp_name_len = phdr[i].p_filesz; if (phdr[i].p_offset > PAGE_SIZE || interp_name_len > PAGE_SIZE - phdr[i].p_offset) { VOP_UNLOCK(imgp->vp, 0); interp_buf = malloc(interp_name_len + 1, M_TEMP, M_WAITOK); vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY); error = vn_rdwr(UIO_READ, imgp->vp, interp_buf, interp_name_len, phdr[i].p_offset, UIO_SYSSPACE, IO_NODELOCKED, td->td_ucred, NOCRED, NULL, td); if (error != 0) { uprintf("i/o error PT_INTERP\n"); goto ret; } interp_buf[interp_name_len] = '\0'; interp = interp_buf; } else { interp = __DECONST(char *, imgp->image_header) + phdr[i].p_offset; } break; case PT_GNU_STACK: if (__elfN(nxstack)) imgp->stack_prot = __elfN(trans_prot)(phdr[i].p_flags); imgp->stack_sz = phdr[i].p_memsz; break; } } brand_info = __elfN(get_brandinfo)(imgp, interp, interp_name_len, &osrel); if (brand_info == NULL) { uprintf("ELF binary type \"%u\" not known.\n", hdr->e_ident[EI_OSABI]); error = ENOEXEC; goto ret; } et_dyn_addr = 0; if (hdr->e_type == ET_DYN) { if ((brand_info->flags & BI_CAN_EXEC_DYN) == 0) { uprintf("Cannot execute shared object\n"); error = ENOEXEC; goto ret; } /* * Honour the base load address from the dso if it is * non-zero for some reason. */ if (baddr == 0) et_dyn_addr = ET_DYN_LOAD_ADDR; } sv = brand_info->sysvec; if (interp != NULL && brand_info->interp_newpath != NULL) newinterp = brand_info->interp_newpath; /* * Avoid a possible deadlock if the current address space is destroyed * and that address space maps the locked vnode. In the common case, * the locked vnode's v_usecount is decremented but remains greater * than zero. Consequently, the vnode lock is not needed by vrele(). * However, in cases where the vnode lock is external, such as nullfs, * v_usecount may become zero. * * The VV_TEXT flag prevents modifications to the executable while * the vnode is unlocked. */ VOP_UNLOCK(imgp->vp, 0); error = exec_new_vmspace(imgp, sv); imgp->proc->p_sysent = sv; vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY); if (error != 0) goto ret; for (i = 0; i < hdr->e_phnum; i++) { switch (phdr[i].p_type) { case PT_LOAD: /* Loadable segment */ if (phdr[i].p_memsz == 0) break; prot = __elfN(trans_prot)(phdr[i].p_flags); error = __elfN(load_section)(imgp, phdr[i].p_offset, (caddr_t)(uintptr_t)phdr[i].p_vaddr + et_dyn_addr, phdr[i].p_memsz, phdr[i].p_filesz, prot, sv->sv_pagesize); if (error != 0) goto ret; /* * If this segment contains the program headers, * remember their virtual address for the AT_PHDR * aux entry. Static binaries don't usually include * a PT_PHDR entry. */ if (phdr[i].p_offset == 0 && hdr->e_phoff + hdr->e_phnum * hdr->e_phentsize <= phdr[i].p_filesz) proghdr = phdr[i].p_vaddr + hdr->e_phoff + et_dyn_addr; seg_addr = trunc_page(phdr[i].p_vaddr + et_dyn_addr); seg_size = round_page(phdr[i].p_memsz + phdr[i].p_vaddr + et_dyn_addr - seg_addr); /* * Make the largest executable segment the official * text segment and all others data. * * Note that obreak() assumes that data_addr + * data_size == end of data load area, and the ELF * file format expects segments to be sorted by * address. If multiple data segments exist, the * last one will be used. */ if (phdr[i].p_flags & PF_X && text_size < seg_size) { text_size = seg_size; text_addr = seg_addr; } else { data_size = seg_size; data_addr = seg_addr; } total_size += seg_size; break; case PT_PHDR: /* Program header table info */ proghdr = phdr[i].p_vaddr + et_dyn_addr; break; default: break; } } if (data_addr == 0 && data_size == 0) { data_addr = text_addr; data_size = text_size; } entry = (u_long)hdr->e_entry + et_dyn_addr; /* * Check limits. It should be safe to check the * limits after loading the segments since we do * not actually fault in all the segments pages. */ PROC_LOCK(imgp->proc); if (data_size > lim_cur_proc(imgp->proc, RLIMIT_DATA)) err_str = "Data segment size exceeds process limit"; else if (text_size > maxtsiz) err_str = "Text segment size exceeds system limit"; else if (total_size > lim_cur_proc(imgp->proc, RLIMIT_VMEM)) err_str = "Total segment size exceeds process limit"; else if (racct_set(imgp->proc, RACCT_DATA, data_size) != 0) err_str = "Data segment size exceeds resource limit"; else if (racct_set(imgp->proc, RACCT_VMEM, total_size) != 0) err_str = "Total segment size exceeds resource limit"; if (err_str != NULL) { PROC_UNLOCK(imgp->proc); uprintf("%s\n", err_str); error = ENOMEM; goto ret; } vmspace = imgp->proc->p_vmspace; vmspace->vm_tsize = text_size >> PAGE_SHIFT; vmspace->vm_taddr = (caddr_t)(uintptr_t)text_addr; vmspace->vm_dsize = data_size >> PAGE_SHIFT; vmspace->vm_daddr = (caddr_t)(uintptr_t)data_addr; /* * We load the dynamic linker where a userland call * to mmap(0, ...) would put it. The rationale behind this * calculation is that it leaves room for the heap to grow to * its maximum allowed size. */ addr = round_page((vm_offset_t)vmspace->vm_daddr + lim_max(td, RLIMIT_DATA)); PROC_UNLOCK(imgp->proc); imgp->entry_addr = entry; if (interp != NULL) { have_interp = FALSE; VOP_UNLOCK(imgp->vp, 0); if (brand_info->emul_path != NULL && brand_info->emul_path[0] != '\0') { path = malloc(MAXPATHLEN, M_TEMP, M_WAITOK); snprintf(path, MAXPATHLEN, "%s%s", brand_info->emul_path, interp); error = __elfN(load_file)(imgp->proc, path, &addr, &imgp->entry_addr, sv->sv_pagesize); free(path, M_TEMP); if (error == 0) have_interp = TRUE; } if (!have_interp && newinterp != NULL && (brand_info->interp_path == NULL || strcmp(interp, brand_info->interp_path) == 0)) { error = __elfN(load_file)(imgp->proc, newinterp, &addr, &imgp->entry_addr, sv->sv_pagesize); if (error == 0) have_interp = TRUE; } if (!have_interp) { error = __elfN(load_file)(imgp->proc, interp, &addr, &imgp->entry_addr, sv->sv_pagesize); } vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY); if (error != 0) { uprintf("ELF interpreter %s not found, error %d\n", interp, error); goto ret; } } else addr = et_dyn_addr; /* * Construct auxargs table (used by the fixup routine) */ elf_auxargs = malloc(sizeof(Elf_Auxargs), M_TEMP, M_WAITOK); elf_auxargs->execfd = -1; elf_auxargs->phdr = proghdr; elf_auxargs->phent = hdr->e_phentsize; elf_auxargs->phnum = hdr->e_phnum; elf_auxargs->pagesz = PAGE_SIZE; elf_auxargs->base = addr; elf_auxargs->flags = 0; elf_auxargs->entry = entry; elf_auxargs->hdr_eflags = hdr->e_flags; imgp->auxargs = elf_auxargs; imgp->interpreted = 0; imgp->reloc_base = addr; imgp->proc->p_osrel = osrel; imgp->proc->p_elf_machine = hdr->e_machine; imgp->proc->p_elf_flags = hdr->e_flags; ret: free(interp_buf, M_TEMP); return (error); } #define suword __CONCAT(suword, __ELF_WORD_SIZE) int __elfN(freebsd_fixup)(register_t **stack_base, struct image_params *imgp) { Elf_Auxargs *args = (Elf_Auxargs *)imgp->auxargs; Elf_Addr *base; Elf_Addr *pos; base = (Elf_Addr *)*stack_base; pos = base + (imgp->args->argc + imgp->args->envc + 2); if (args->execfd != -1) AUXARGS_ENTRY(pos, AT_EXECFD, args->execfd); AUXARGS_ENTRY(pos, AT_PHDR, args->phdr); AUXARGS_ENTRY(pos, AT_PHENT, args->phent); AUXARGS_ENTRY(pos, AT_PHNUM, args->phnum); AUXARGS_ENTRY(pos, AT_PAGESZ, args->pagesz); AUXARGS_ENTRY(pos, AT_FLAGS, args->flags); AUXARGS_ENTRY(pos, AT_ENTRY, args->entry); AUXARGS_ENTRY(pos, AT_BASE, args->base); #ifdef AT_EHDRFLAGS AUXARGS_ENTRY(pos, AT_EHDRFLAGS, args->hdr_eflags); #endif if (imgp->execpathp != 0) AUXARGS_ENTRY(pos, AT_EXECPATH, imgp->execpathp); AUXARGS_ENTRY(pos, AT_OSRELDATE, imgp->proc->p_ucred->cr_prison->pr_osreldate); if (imgp->canary != 0) { AUXARGS_ENTRY(pos, AT_CANARY, imgp->canary); AUXARGS_ENTRY(pos, AT_CANARYLEN, imgp->canarylen); } AUXARGS_ENTRY(pos, AT_NCPUS, mp_ncpus); if (imgp->pagesizes != 0) { AUXARGS_ENTRY(pos, AT_PAGESIZES, imgp->pagesizes); AUXARGS_ENTRY(pos, AT_PAGESIZESLEN, imgp->pagesizeslen); } if (imgp->sysent->sv_timekeep_base != 0) { AUXARGS_ENTRY(pos, AT_TIMEKEEP, imgp->sysent->sv_timekeep_base); } AUXARGS_ENTRY(pos, AT_STACKPROT, imgp->sysent->sv_shared_page_obj != NULL && imgp->stack_prot != 0 ? imgp->stack_prot : imgp->sysent->sv_stackprot); AUXARGS_ENTRY(pos, AT_NULL, 0); free(imgp->auxargs, M_TEMP); imgp->auxargs = NULL; base--; suword(base, (long)imgp->args->argc); *stack_base = (register_t *)base; return (0); } /* * Code for generating ELF core dumps. */ typedef void (*segment_callback)(vm_map_entry_t, void *); /* Closure for cb_put_phdr(). */ struct phdr_closure { Elf_Phdr *phdr; /* Program header to fill in */ Elf_Off offset; /* Offset of segment in core file */ }; /* Closure for cb_size_segment(). */ struct sseg_closure { int count; /* Count of writable segments. */ size_t size; /* Total size of all writable segments. */ }; typedef void (*outfunc_t)(void *, struct sbuf *, size_t *); struct note_info { int type; /* Note type. */ outfunc_t outfunc; /* Output function. */ void *outarg; /* Argument for the output function. */ size_t outsize; /* Output size. */ TAILQ_ENTRY(note_info) link; /* Link to the next note info. */ }; TAILQ_HEAD(note_info_list, note_info); /* Coredump output parameters. */ struct coredump_params { off_t offset; struct ucred *active_cred; struct ucred *file_cred; struct thread *td; struct vnode *vp; struct gzio_stream *gzs; }; static void cb_put_phdr(vm_map_entry_t, void *); static void cb_size_segment(vm_map_entry_t, void *); static int core_write(struct coredump_params *, const void *, size_t, off_t, enum uio_seg); static void each_dumpable_segment(struct thread *, segment_callback, void *); static int __elfN(corehdr)(struct coredump_params *, int, void *, size_t, struct note_info_list *, size_t); static void __elfN(prepare_notes)(struct thread *, struct note_info_list *, size_t *); static void __elfN(puthdr)(struct thread *, void *, size_t, int, size_t); static void __elfN(putnote)(struct note_info *, struct sbuf *); static size_t register_note(struct note_info_list *, int, outfunc_t, void *); static int sbuf_drain_core_output(void *, const char *, int); static int sbuf_drain_count(void *arg, const char *data, int len); static void __elfN(note_fpregset)(void *, struct sbuf *, size_t *); static void __elfN(note_prpsinfo)(void *, struct sbuf *, size_t *); static void __elfN(note_prstatus)(void *, struct sbuf *, size_t *); static void __elfN(note_threadmd)(void *, struct sbuf *, size_t *); static void __elfN(note_thrmisc)(void *, struct sbuf *, size_t *); static void __elfN(note_procstat_auxv)(void *, struct sbuf *, size_t *); static void __elfN(note_procstat_proc)(void *, struct sbuf *, size_t *); static void __elfN(note_procstat_psstrings)(void *, struct sbuf *, size_t *); static void note_procstat_files(void *, struct sbuf *, size_t *); static void note_procstat_groups(void *, struct sbuf *, size_t *); static void note_procstat_osrel(void *, struct sbuf *, size_t *); static void note_procstat_rlimit(void *, struct sbuf *, size_t *); static void note_procstat_umask(void *, struct sbuf *, size_t *); static void note_procstat_vmmap(void *, struct sbuf *, size_t *); #ifdef GZIO extern int compress_user_cores_gzlevel; /* * Write out a core segment to the compression stream. */ static int compress_chunk(struct coredump_params *p, char *base, char *buf, u_int len) { u_int chunk_len; int error; while (len > 0) { chunk_len = MIN(len, CORE_BUF_SIZE); /* * We can get EFAULT error here. * In that case zero out the current chunk of the segment. */ error = copyin(base, buf, chunk_len); if (error != 0) bzero(buf, chunk_len); error = gzio_write(p->gzs, buf, chunk_len); if (error != 0) break; base += chunk_len; len -= chunk_len; } return (error); } static int core_gz_write(void *base, size_t len, off_t offset, void *arg) { return (core_write((struct coredump_params *)arg, base, len, offset, UIO_SYSSPACE)); } #endif /* GZIO */ static int core_write(struct coredump_params *p, const void *base, size_t len, off_t offset, enum uio_seg seg) { return (vn_rdwr_inchunks(UIO_WRITE, p->vp, __DECONST(void *, base), len, offset, seg, IO_UNIT | IO_DIRECT | IO_RANGELOCKED, p->active_cred, p->file_cred, NULL, p->td)); } static int core_output(void *base, size_t len, off_t offset, struct coredump_params *p, void *tmpbuf) { int error; #ifdef GZIO if (p->gzs != NULL) return (compress_chunk(p, base, tmpbuf, len)); #endif /* * EFAULT is a non-fatal error that we can get, for example, * if the segment is backed by a file but extends beyond its * end. */ error = core_write(p, base, len, offset, UIO_USERSPACE); if (error == EFAULT) { log(LOG_WARNING, "Failed to fully fault in a core file segment " "at VA %p with size 0x%zx to be written at offset 0x%jx " "for process %s\n", base, len, offset, curproc->p_comm); /* * Write a "real" zero byte at the end of the target region * in the case this is the last segment. * The intermediate space will be implicitly zero-filled. */ error = core_write(p, zero_region, 1, offset + len - 1, UIO_SYSSPACE); } return (error); } /* * Drain into a core file. */ static int sbuf_drain_core_output(void *arg, const char *data, int len) { struct coredump_params *p; int error, locked; p = (struct coredump_params *)arg; /* * Some kern_proc out routines that print to this sbuf may * call us with the process lock held. Draining with the * non-sleepable lock held is unsafe. The lock is needed for * those routines when dumping a live process. In our case we * can safely release the lock before draining and acquire * again after. */ locked = PROC_LOCKED(p->td->td_proc); if (locked) PROC_UNLOCK(p->td->td_proc); #ifdef GZIO if (p->gzs != NULL) error = gzio_write(p->gzs, __DECONST(char *, data), len); else #endif error = core_write(p, __DECONST(void *, data), len, p->offset, UIO_SYSSPACE); if (locked) PROC_LOCK(p->td->td_proc); if (error != 0) return (-error); p->offset += len; return (len); } /* * Drain into a counter. */ static int sbuf_drain_count(void *arg, const char *data __unused, int len) { size_t *sizep; sizep = (size_t *)arg; *sizep += len; return (len); } int __elfN(coredump)(struct thread *td, struct vnode *vp, off_t limit, int flags) { struct ucred *cred = td->td_ucred; int error = 0; struct sseg_closure seginfo; struct note_info_list notelst; struct coredump_params params; struct note_info *ninfo; void *hdr, *tmpbuf; size_t hdrsize, notesz, coresize; #ifdef GZIO boolean_t compress; compress = (flags & IMGACT_CORE_COMPRESS) != 0; #endif hdr = NULL; tmpbuf = NULL; TAILQ_INIT(¬elst); /* Size the program segments. */ seginfo.count = 0; seginfo.size = 0; each_dumpable_segment(td, cb_size_segment, &seginfo); /* * Collect info about the core file header area. */ hdrsize = sizeof(Elf_Ehdr) + sizeof(Elf_Phdr) * (1 + seginfo.count); if (seginfo.count + 1 >= PN_XNUM) hdrsize += sizeof(Elf_Shdr); __elfN(prepare_notes)(td, ¬elst, ¬esz); coresize = round_page(hdrsize + notesz) + seginfo.size; /* Set up core dump parameters. */ params.offset = 0; params.active_cred = cred; params.file_cred = NOCRED; params.td = td; params.vp = vp; params.gzs = NULL; #ifdef RACCT if (racct_enable) { PROC_LOCK(td->td_proc); error = racct_add(td->td_proc, RACCT_CORE, coresize); PROC_UNLOCK(td->td_proc); if (error != 0) { error = EFAULT; goto done; } } #endif if (coresize >= limit) { error = EFAULT; goto done; } #ifdef GZIO /* Create a compression stream if necessary. */ if (compress) { params.gzs = gzio_init(core_gz_write, GZIO_DEFLATE, CORE_BUF_SIZE, compress_user_cores_gzlevel, ¶ms); if (params.gzs == NULL) { error = EFAULT; goto done; } tmpbuf = malloc(CORE_BUF_SIZE, M_TEMP, M_WAITOK | M_ZERO); } #endif /* * Allocate memory for building the header, fill it up, * and write it out following the notes. */ hdr = malloc(hdrsize, M_TEMP, M_WAITOK); error = __elfN(corehdr)(¶ms, seginfo.count, hdr, hdrsize, ¬elst, notesz); /* Write the contents of all of the writable segments. */ if (error == 0) { Elf_Phdr *php; off_t offset; int i; php = (Elf_Phdr *)((char *)hdr + sizeof(Elf_Ehdr)) + 1; offset = round_page(hdrsize + notesz); for (i = 0; i < seginfo.count; i++) { error = core_output((caddr_t)(uintptr_t)php->p_vaddr, php->p_filesz, offset, ¶ms, tmpbuf); if (error != 0) break; offset += php->p_filesz; php++; } #ifdef GZIO if (error == 0 && compress) error = gzio_flush(params.gzs); #endif } if (error) { log(LOG_WARNING, "Failed to write core file for process %s (error %d)\n", curproc->p_comm, error); } done: #ifdef GZIO if (compress) { free(tmpbuf, M_TEMP); if (params.gzs != NULL) gzio_fini(params.gzs); } #endif while ((ninfo = TAILQ_FIRST(¬elst)) != NULL) { TAILQ_REMOVE(¬elst, ninfo, link); free(ninfo, M_TEMP); } if (hdr != NULL) free(hdr, M_TEMP); return (error); } /* * A callback for each_dumpable_segment() to write out the segment's * program header entry. */ static void cb_put_phdr(entry, closure) vm_map_entry_t entry; void *closure; { struct phdr_closure *phc = (struct phdr_closure *)closure; Elf_Phdr *phdr = phc->phdr; phc->offset = round_page(phc->offset); phdr->p_type = PT_LOAD; phdr->p_offset = phc->offset; phdr->p_vaddr = entry->start; phdr->p_paddr = 0; phdr->p_filesz = phdr->p_memsz = entry->end - entry->start; phdr->p_align = PAGE_SIZE; phdr->p_flags = __elfN(untrans_prot)(entry->protection); phc->offset += phdr->p_filesz; phc->phdr++; } /* * A callback for each_dumpable_segment() to gather information about * the number of segments and their total size. */ static void cb_size_segment(vm_map_entry_t entry, void *closure) { struct sseg_closure *ssc = (struct sseg_closure *)closure; ssc->count++; ssc->size += entry->end - entry->start; } /* * For each writable segment in the process's memory map, call the given * function with a pointer to the map entry and some arbitrary * caller-supplied data. */ static void each_dumpable_segment(struct thread *td, segment_callback func, void *closure) { struct proc *p = td->td_proc; vm_map_t map = &p->p_vmspace->vm_map; vm_map_entry_t entry; vm_object_t backing_object, object; boolean_t ignore_entry; vm_map_lock_read(map); for (entry = map->header.next; entry != &map->header; entry = entry->next) { /* * Don't dump inaccessible mappings, deal with legacy * coredump mode. * * Note that read-only segments related to the elf binary * are marked MAP_ENTRY_NOCOREDUMP now so we no longer * need to arbitrarily ignore such segments. */ if (elf_legacy_coredump) { if ((entry->protection & VM_PROT_RW) != VM_PROT_RW) continue; } else { if ((entry->protection & VM_PROT_ALL) == 0) continue; } /* * Dont include memory segment in the coredump if * MAP_NOCORE is set in mmap(2) or MADV_NOCORE in * madvise(2). Do not dump submaps (i.e. parts of the * kernel map). */ if (entry->eflags & (MAP_ENTRY_NOCOREDUMP|MAP_ENTRY_IS_SUB_MAP)) continue; if ((object = entry->object.vm_object) == NULL) continue; /* Ignore memory-mapped devices and such things. */ VM_OBJECT_RLOCK(object); while ((backing_object = object->backing_object) != NULL) { VM_OBJECT_RLOCK(backing_object); VM_OBJECT_RUNLOCK(object); object = backing_object; } ignore_entry = object->type != OBJT_DEFAULT && object->type != OBJT_SWAP && object->type != OBJT_VNODE && object->type != OBJT_PHYS; VM_OBJECT_RUNLOCK(object); if (ignore_entry) continue; (*func)(entry, closure); } vm_map_unlock_read(map); } /* * Write the core file header to the file, including padding up to * the page boundary. */ static int __elfN(corehdr)(struct coredump_params *p, int numsegs, void *hdr, size_t hdrsize, struct note_info_list *notelst, size_t notesz) { struct note_info *ninfo; struct sbuf *sb; int error; /* Fill in the header. */ bzero(hdr, hdrsize); __elfN(puthdr)(p->td, hdr, hdrsize, numsegs, notesz); sb = sbuf_new(NULL, NULL, CORE_BUF_SIZE, SBUF_FIXEDLEN); sbuf_set_drain(sb, sbuf_drain_core_output, p); sbuf_start_section(sb, NULL); sbuf_bcat(sb, hdr, hdrsize); TAILQ_FOREACH(ninfo, notelst, link) __elfN(putnote)(ninfo, sb); /* Align up to a page boundary for the program segments. */ sbuf_end_section(sb, -1, PAGE_SIZE, 0); error = sbuf_finish(sb); sbuf_delete(sb); return (error); } static void __elfN(prepare_notes)(struct thread *td, struct note_info_list *list, size_t *sizep) { struct proc *p; struct thread *thr; size_t size; p = td->td_proc; size = 0; size += register_note(list, NT_PRPSINFO, __elfN(note_prpsinfo), p); /* * To have the debugger select the right thread (LWP) as the initial * thread, we dump the state of the thread passed to us in td first. * This is the thread that causes the core dump and thus likely to * be the right thread one wants to have selected in the debugger. */ thr = td; while (thr != NULL) { size += register_note(list, NT_PRSTATUS, __elfN(note_prstatus), thr); size += register_note(list, NT_FPREGSET, __elfN(note_fpregset), thr); size += register_note(list, NT_THRMISC, __elfN(note_thrmisc), thr); size += register_note(list, -1, __elfN(note_threadmd), thr); thr = (thr == td) ? TAILQ_FIRST(&p->p_threads) : TAILQ_NEXT(thr, td_plist); if (thr == td) thr = TAILQ_NEXT(thr, td_plist); } size += register_note(list, NT_PROCSTAT_PROC, __elfN(note_procstat_proc), p); size += register_note(list, NT_PROCSTAT_FILES, note_procstat_files, p); size += register_note(list, NT_PROCSTAT_VMMAP, note_procstat_vmmap, p); size += register_note(list, NT_PROCSTAT_GROUPS, note_procstat_groups, p); size += register_note(list, NT_PROCSTAT_UMASK, note_procstat_umask, p); size += register_note(list, NT_PROCSTAT_RLIMIT, note_procstat_rlimit, p); size += register_note(list, NT_PROCSTAT_OSREL, note_procstat_osrel, p); size += register_note(list, NT_PROCSTAT_PSSTRINGS, __elfN(note_procstat_psstrings), p); size += register_note(list, NT_PROCSTAT_AUXV, __elfN(note_procstat_auxv), p); *sizep = size; } static void __elfN(puthdr)(struct thread *td, void *hdr, size_t hdrsize, int numsegs, size_t notesz) { Elf_Ehdr *ehdr; Elf_Phdr *phdr; Elf_Shdr *shdr; struct phdr_closure phc; ehdr = (Elf_Ehdr *)hdr; ehdr->e_ident[EI_MAG0] = ELFMAG0; ehdr->e_ident[EI_MAG1] = ELFMAG1; ehdr->e_ident[EI_MAG2] = ELFMAG2; ehdr->e_ident[EI_MAG3] = ELFMAG3; ehdr->e_ident[EI_CLASS] = ELF_CLASS; ehdr->e_ident[EI_DATA] = ELF_DATA; ehdr->e_ident[EI_VERSION] = EV_CURRENT; ehdr->e_ident[EI_OSABI] = ELFOSABI_FREEBSD; ehdr->e_ident[EI_ABIVERSION] = 0; ehdr->e_ident[EI_PAD] = 0; ehdr->e_type = ET_CORE; ehdr->e_machine = td->td_proc->p_elf_machine; ehdr->e_version = EV_CURRENT; ehdr->e_entry = 0; ehdr->e_phoff = sizeof(Elf_Ehdr); ehdr->e_flags = td->td_proc->p_elf_flags; ehdr->e_ehsize = sizeof(Elf_Ehdr); ehdr->e_phentsize = sizeof(Elf_Phdr); ehdr->e_shentsize = sizeof(Elf_Shdr); ehdr->e_shstrndx = SHN_UNDEF; if (numsegs + 1 < PN_XNUM) { ehdr->e_phnum = numsegs + 1; ehdr->e_shnum = 0; } else { ehdr->e_phnum = PN_XNUM; ehdr->e_shnum = 1; ehdr->e_shoff = ehdr->e_phoff + (numsegs + 1) * ehdr->e_phentsize; KASSERT(ehdr->e_shoff == hdrsize - sizeof(Elf_Shdr), ("e_shoff: %zu, hdrsize - shdr: %zu", (size_t)ehdr->e_shoff, hdrsize - sizeof(Elf_Shdr))); shdr = (Elf_Shdr *)((char *)hdr + ehdr->e_shoff); memset(shdr, 0, sizeof(*shdr)); /* * A special first section is used to hold large segment and * section counts. This was proposed by Sun Microsystems in * Solaris and has been adopted by Linux; the standard ELF * tools are already familiar with the technique. * * See table 7-7 of the Solaris "Linker and Libraries Guide" * (or 12-7 depending on the version of the document) for more * details. */ shdr->sh_type = SHT_NULL; shdr->sh_size = ehdr->e_shnum; shdr->sh_link = ehdr->e_shstrndx; shdr->sh_info = numsegs + 1; } /* * Fill in the program header entries. */ phdr = (Elf_Phdr *)((char *)hdr + ehdr->e_phoff); /* The note segement. */ phdr->p_type = PT_NOTE; phdr->p_offset = hdrsize; phdr->p_vaddr = 0; phdr->p_paddr = 0; phdr->p_filesz = notesz; phdr->p_memsz = 0; phdr->p_flags = PF_R; phdr->p_align = ELF_NOTE_ROUNDSIZE; phdr++; /* All the writable segments from the program. */ phc.phdr = phdr; phc.offset = round_page(hdrsize + notesz); each_dumpable_segment(td, cb_put_phdr, &phc); } static size_t register_note(struct note_info_list *list, int type, outfunc_t out, void *arg) { struct note_info *ninfo; size_t size, notesize; size = 0; out(arg, NULL, &size); ninfo = malloc(sizeof(*ninfo), M_TEMP, M_ZERO | M_WAITOK); ninfo->type = type; ninfo->outfunc = out; ninfo->outarg = arg; ninfo->outsize = size; TAILQ_INSERT_TAIL(list, ninfo, link); if (type == -1) return (size); notesize = sizeof(Elf_Note) + /* note header */ roundup2(sizeof(FREEBSD_ABI_VENDOR), ELF_NOTE_ROUNDSIZE) + /* note name */ roundup2(size, ELF_NOTE_ROUNDSIZE); /* note description */ return (notesize); } static size_t append_note_data(const void *src, void *dst, size_t len) { size_t padded_len; padded_len = roundup2(len, ELF_NOTE_ROUNDSIZE); if (dst != NULL) { bcopy(src, dst, len); bzero((char *)dst + len, padded_len - len); } return (padded_len); } size_t __elfN(populate_note)(int type, void *src, void *dst, size_t size, void **descp) { Elf_Note *note; char *buf; size_t notesize; buf = dst; if (buf != NULL) { note = (Elf_Note *)buf; note->n_namesz = sizeof(FREEBSD_ABI_VENDOR); note->n_descsz = size; note->n_type = type; buf += sizeof(*note); buf += append_note_data(FREEBSD_ABI_VENDOR, buf, sizeof(FREEBSD_ABI_VENDOR)); append_note_data(src, buf, size); if (descp != NULL) *descp = buf; } notesize = sizeof(Elf_Note) + /* note header */ roundup2(sizeof(FREEBSD_ABI_VENDOR), ELF_NOTE_ROUNDSIZE) + /* note name */ roundup2(size, ELF_NOTE_ROUNDSIZE); /* note description */ return (notesize); } static void __elfN(putnote)(struct note_info *ninfo, struct sbuf *sb) { Elf_Note note; ssize_t old_len, sect_len; size_t new_len, descsz, i; if (ninfo->type == -1) { ninfo->outfunc(ninfo->outarg, sb, &ninfo->outsize); return; } note.n_namesz = sizeof(FREEBSD_ABI_VENDOR); note.n_descsz = ninfo->outsize; note.n_type = ninfo->type; sbuf_bcat(sb, ¬e, sizeof(note)); sbuf_start_section(sb, &old_len); sbuf_bcat(sb, FREEBSD_ABI_VENDOR, sizeof(FREEBSD_ABI_VENDOR)); sbuf_end_section(sb, old_len, ELF_NOTE_ROUNDSIZE, 0); if (note.n_descsz == 0) return; sbuf_start_section(sb, &old_len); ninfo->outfunc(ninfo->outarg, sb, &ninfo->outsize); sect_len = sbuf_end_section(sb, old_len, ELF_NOTE_ROUNDSIZE, 0); if (sect_len < 0) return; new_len = (size_t)sect_len; descsz = roundup(note.n_descsz, ELF_NOTE_ROUNDSIZE); if (new_len < descsz) { /* * It is expected that individual note emitters will correctly * predict their expected output size and fill up to that size * themselves, padding in a format-specific way if needed. * However, in case they don't, just do it here with zeros. */ for (i = 0; i < descsz - new_len; i++) sbuf_putc(sb, 0); } else if (new_len > descsz) { /* * We can't always truncate sb -- we may have drained some * of it already. */ KASSERT(new_len == descsz, ("%s: Note type %u changed as we " "read it (%zu > %zu). Since it is longer than " "expected, this coredump's notes are corrupt. THIS " "IS A BUG in the note_procstat routine for type %u.\n", __func__, (unsigned)note.n_type, new_len, descsz, (unsigned)note.n_type)); } } /* * Miscellaneous note out functions. */ #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32 #include typedef struct prstatus32 elf_prstatus_t; typedef struct prpsinfo32 elf_prpsinfo_t; typedef struct fpreg32 elf_prfpregset_t; typedef struct fpreg32 elf_fpregset_t; typedef struct reg32 elf_gregset_t; typedef struct thrmisc32 elf_thrmisc_t; #define ELF_KERN_PROC_MASK KERN_PROC_MASK32 typedef struct kinfo_proc32 elf_kinfo_proc_t; typedef uint32_t elf_ps_strings_t; #else typedef prstatus_t elf_prstatus_t; typedef prpsinfo_t elf_prpsinfo_t; typedef prfpregset_t elf_prfpregset_t; typedef prfpregset_t elf_fpregset_t; typedef gregset_t elf_gregset_t; typedef thrmisc_t elf_thrmisc_t; #define ELF_KERN_PROC_MASK 0 typedef struct kinfo_proc elf_kinfo_proc_t; typedef vm_offset_t elf_ps_strings_t; #endif static void __elfN(note_prpsinfo)(void *arg, struct sbuf *sb, size_t *sizep) { struct sbuf sbarg; size_t len; char *cp, *end; struct proc *p; elf_prpsinfo_t *psinfo; int error; p = (struct proc *)arg; if (sb != NULL) { KASSERT(*sizep == sizeof(*psinfo), ("invalid size")); psinfo = malloc(sizeof(*psinfo), M_TEMP, M_ZERO | M_WAITOK); psinfo->pr_version = PRPSINFO_VERSION; psinfo->pr_psinfosz = sizeof(elf_prpsinfo_t); strlcpy(psinfo->pr_fname, p->p_comm, sizeof(psinfo->pr_fname)); PROC_LOCK(p); if (p->p_args != NULL) { len = sizeof(psinfo->pr_psargs) - 1; if (len > p->p_args->ar_length) len = p->p_args->ar_length; memcpy(psinfo->pr_psargs, p->p_args->ar_args, len); PROC_UNLOCK(p); error = 0; } else { _PHOLD(p); PROC_UNLOCK(p); sbuf_new(&sbarg, psinfo->pr_psargs, sizeof(psinfo->pr_psargs), SBUF_FIXEDLEN); error = proc_getargv(curthread, p, &sbarg); PRELE(p); if (sbuf_finish(&sbarg) == 0) len = sbuf_len(&sbarg) - 1; else len = sizeof(psinfo->pr_psargs) - 1; sbuf_delete(&sbarg); } if (error || len == 0) strlcpy(psinfo->pr_psargs, p->p_comm, sizeof(psinfo->pr_psargs)); else { KASSERT(len < sizeof(psinfo->pr_psargs), ("len is too long: %zu vs %zu", len, sizeof(psinfo->pr_psargs))); cp = psinfo->pr_psargs; end = cp + len - 1; for (;;) { cp = memchr(cp, '\0', end - cp); if (cp == NULL) break; *cp = ' '; } } psinfo->pr_pid = p->p_pid; sbuf_bcat(sb, psinfo, sizeof(*psinfo)); free(psinfo, M_TEMP); } *sizep = sizeof(*psinfo); } static void __elfN(note_prstatus)(void *arg, struct sbuf *sb, size_t *sizep) { struct thread *td; elf_prstatus_t *status; td = (struct thread *)arg; if (sb != NULL) { KASSERT(*sizep == sizeof(*status), ("invalid size")); status = malloc(sizeof(*status), M_TEMP, M_ZERO | M_WAITOK); status->pr_version = PRSTATUS_VERSION; status->pr_statussz = sizeof(elf_prstatus_t); status->pr_gregsetsz = sizeof(elf_gregset_t); status->pr_fpregsetsz = sizeof(elf_fpregset_t); status->pr_osreldate = osreldate; status->pr_cursig = td->td_proc->p_sig; status->pr_pid = td->td_tid; #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32 fill_regs32(td, &status->pr_reg); #else fill_regs(td, &status->pr_reg); #endif sbuf_bcat(sb, status, sizeof(*status)); free(status, M_TEMP); } *sizep = sizeof(*status); } static void __elfN(note_fpregset)(void *arg, struct sbuf *sb, size_t *sizep) { struct thread *td; elf_prfpregset_t *fpregset; td = (struct thread *)arg; if (sb != NULL) { KASSERT(*sizep == sizeof(*fpregset), ("invalid size")); fpregset = malloc(sizeof(*fpregset), M_TEMP, M_ZERO | M_WAITOK); #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32 fill_fpregs32(td, fpregset); #else fill_fpregs(td, fpregset); #endif sbuf_bcat(sb, fpregset, sizeof(*fpregset)); free(fpregset, M_TEMP); } *sizep = sizeof(*fpregset); } static void __elfN(note_thrmisc)(void *arg, struct sbuf *sb, size_t *sizep) { struct thread *td; elf_thrmisc_t thrmisc; td = (struct thread *)arg; if (sb != NULL) { KASSERT(*sizep == sizeof(thrmisc), ("invalid size")); bzero(&thrmisc._pad, sizeof(thrmisc._pad)); strcpy(thrmisc.pr_tname, td->td_name); sbuf_bcat(sb, &thrmisc, sizeof(thrmisc)); } *sizep = sizeof(thrmisc); } /* * Allow for MD specific notes, as well as any MD * specific preparations for writing MI notes. */ static void __elfN(note_threadmd)(void *arg, struct sbuf *sb, size_t *sizep) { struct thread *td; void *buf; size_t size; td = (struct thread *)arg; size = *sizep; if (size != 0 && sb != NULL) buf = malloc(size, M_TEMP, M_ZERO | M_WAITOK); else buf = NULL; size = 0; __elfN(dump_thread)(td, buf, &size); KASSERT(sb == NULL || *sizep == size, ("invalid size")); if (size != 0 && sb != NULL) sbuf_bcat(sb, buf, size); free(buf, M_TEMP); *sizep = size; } #ifdef KINFO_PROC_SIZE CTASSERT(sizeof(struct kinfo_proc) == KINFO_PROC_SIZE); #endif static void __elfN(note_procstat_proc)(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size; int structsize; p = (struct proc *)arg; size = sizeof(structsize) + p->p_numthreads * sizeof(elf_kinfo_proc_t); if (sb != NULL) { KASSERT(*sizep == size, ("invalid size")); structsize = sizeof(elf_kinfo_proc_t); sbuf_bcat(sb, &structsize, sizeof(structsize)); sx_slock(&proctree_lock); PROC_LOCK(p); kern_proc_out(p, sb, ELF_KERN_PROC_MASK); sx_sunlock(&proctree_lock); } *sizep = size; } #ifdef KINFO_FILE_SIZE CTASSERT(sizeof(struct kinfo_file) == KINFO_FILE_SIZE); #endif static void note_procstat_files(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size, sect_sz, i; ssize_t start_len, sect_len; int structsize, filedesc_flags; if (coredump_pack_fileinfo) filedesc_flags = KERN_FILEDESC_PACK_KINFO; else filedesc_flags = 0; p = (struct proc *)arg; structsize = sizeof(struct kinfo_file); if (sb == NULL) { size = 0; sb = sbuf_new(NULL, NULL, 128, SBUF_FIXEDLEN); sbuf_set_drain(sb, sbuf_drain_count, &size); sbuf_bcat(sb, &structsize, sizeof(structsize)); PROC_LOCK(p); kern_proc_filedesc_out(p, sb, -1, filedesc_flags); sbuf_finish(sb); sbuf_delete(sb); *sizep = size; } else { sbuf_start_section(sb, &start_len); sbuf_bcat(sb, &structsize, sizeof(structsize)); PROC_LOCK(p); kern_proc_filedesc_out(p, sb, *sizep - sizeof(structsize), filedesc_flags); sect_len = sbuf_end_section(sb, start_len, 0, 0); if (sect_len < 0) return; sect_sz = sect_len; KASSERT(sect_sz <= *sizep, ("kern_proc_filedesc_out did not respect maxlen; " "requested %zu, got %zu", *sizep - sizeof(structsize), sect_sz - sizeof(structsize))); for (i = 0; i < *sizep - sect_sz && sb->s_error == 0; i++) sbuf_putc(sb, 0); } } #ifdef KINFO_VMENTRY_SIZE CTASSERT(sizeof(struct kinfo_vmentry) == KINFO_VMENTRY_SIZE); #endif static void note_procstat_vmmap(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size; int structsize, vmmap_flags; if (coredump_pack_vmmapinfo) vmmap_flags = KERN_VMMAP_PACK_KINFO; else vmmap_flags = 0; p = (struct proc *)arg; structsize = sizeof(struct kinfo_vmentry); if (sb == NULL) { size = 0; sb = sbuf_new(NULL, NULL, 128, SBUF_FIXEDLEN); sbuf_set_drain(sb, sbuf_drain_count, &size); sbuf_bcat(sb, &structsize, sizeof(structsize)); PROC_LOCK(p); kern_proc_vmmap_out(p, sb, -1, vmmap_flags); sbuf_finish(sb); sbuf_delete(sb); *sizep = size; } else { sbuf_bcat(sb, &structsize, sizeof(structsize)); PROC_LOCK(p); kern_proc_vmmap_out(p, sb, *sizep - sizeof(structsize), vmmap_flags); } } static void note_procstat_groups(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size; int structsize; p = (struct proc *)arg; size = sizeof(structsize) + p->p_ucred->cr_ngroups * sizeof(gid_t); if (sb != NULL) { KASSERT(*sizep == size, ("invalid size")); structsize = sizeof(gid_t); sbuf_bcat(sb, &structsize, sizeof(structsize)); sbuf_bcat(sb, p->p_ucred->cr_groups, p->p_ucred->cr_ngroups * sizeof(gid_t)); } *sizep = size; } static void note_procstat_umask(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size; int structsize; p = (struct proc *)arg; size = sizeof(structsize) + sizeof(p->p_fd->fd_cmask); if (sb != NULL) { KASSERT(*sizep == size, ("invalid size")); structsize = sizeof(p->p_fd->fd_cmask); sbuf_bcat(sb, &structsize, sizeof(structsize)); sbuf_bcat(sb, &p->p_fd->fd_cmask, sizeof(p->p_fd->fd_cmask)); } *sizep = size; } static void note_procstat_rlimit(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; struct rlimit rlim[RLIM_NLIMITS]; size_t size; int structsize, i; p = (struct proc *)arg; size = sizeof(structsize) + sizeof(rlim); if (sb != NULL) { KASSERT(*sizep == size, ("invalid size")); structsize = sizeof(rlim); sbuf_bcat(sb, &structsize, sizeof(structsize)); PROC_LOCK(p); for (i = 0; i < RLIM_NLIMITS; i++) lim_rlimit_proc(p, i, &rlim[i]); PROC_UNLOCK(p); sbuf_bcat(sb, rlim, sizeof(rlim)); } *sizep = size; } static void note_procstat_osrel(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size; int structsize; p = (struct proc *)arg; size = sizeof(structsize) + sizeof(p->p_osrel); if (sb != NULL) { KASSERT(*sizep == size, ("invalid size")); structsize = sizeof(p->p_osrel); sbuf_bcat(sb, &structsize, sizeof(structsize)); sbuf_bcat(sb, &p->p_osrel, sizeof(p->p_osrel)); } *sizep = size; } static void __elfN(note_procstat_psstrings)(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; elf_ps_strings_t ps_strings; size_t size; int structsize; p = (struct proc *)arg; size = sizeof(structsize) + sizeof(ps_strings); if (sb != NULL) { KASSERT(*sizep == size, ("invalid size")); structsize = sizeof(ps_strings); #if defined(COMPAT_FREEBSD32) && __ELF_WORD_SIZE == 32 ps_strings = PTROUT(p->p_sysent->sv_psstrings); #else ps_strings = p->p_sysent->sv_psstrings; #endif sbuf_bcat(sb, &structsize, sizeof(structsize)); sbuf_bcat(sb, &ps_strings, sizeof(ps_strings)); } *sizep = size; } static void __elfN(note_procstat_auxv)(void *arg, struct sbuf *sb, size_t *sizep) { struct proc *p; size_t size; int structsize; p = (struct proc *)arg; if (sb == NULL) { size = 0; sb = sbuf_new(NULL, NULL, 128, SBUF_FIXEDLEN); sbuf_set_drain(sb, sbuf_drain_count, &size); sbuf_bcat(sb, &structsize, sizeof(structsize)); PHOLD(p); proc_getauxv(curthread, p, sb); PRELE(p); sbuf_finish(sb); sbuf_delete(sb); *sizep = size; } else { structsize = sizeof(Elf_Auxinfo); sbuf_bcat(sb, &structsize, sizeof(structsize)); PHOLD(p); proc_getauxv(curthread, p, sb); PRELE(p); } } static boolean_t __elfN(parse_notes)(struct image_params *imgp, Elf_Brandnote *checknote, int32_t *osrel, const Elf_Phdr *pnote) { const Elf_Note *note, *note0, *note_end; const char *note_name; char *buf; int i, error; boolean_t res; /* We need some limit, might as well use PAGE_SIZE. */ if (pnote == NULL || pnote->p_filesz > PAGE_SIZE) return (FALSE); ASSERT_VOP_LOCKED(imgp->vp, "parse_notes"); if (pnote->p_offset > PAGE_SIZE || pnote->p_filesz > PAGE_SIZE - pnote->p_offset) { VOP_UNLOCK(imgp->vp, 0); buf = malloc(pnote->p_filesz, M_TEMP, M_WAITOK); vn_lock(imgp->vp, LK_EXCLUSIVE | LK_RETRY); error = vn_rdwr(UIO_READ, imgp->vp, buf, pnote->p_filesz, pnote->p_offset, UIO_SYSSPACE, IO_NODELOCKED, curthread->td_ucred, NOCRED, NULL, curthread); if (error != 0) { uprintf("i/o error PT_NOTE\n"); res = FALSE; goto ret; } note = note0 = (const Elf_Note *)buf; note_end = (const Elf_Note *)(buf + pnote->p_filesz); } else { note = note0 = (const Elf_Note *)(imgp->image_header + pnote->p_offset); note_end = (const Elf_Note *)(imgp->image_header + pnote->p_offset + pnote->p_filesz); buf = NULL; } for (i = 0; i < 100 && note >= note0 && note < note_end; i++) { if (!aligned(note, Elf32_Addr) || (const char *)note_end - (const char *)note < sizeof(Elf_Note)) { res = FALSE; goto ret; } if (note->n_namesz != checknote->hdr.n_namesz || note->n_descsz != checknote->hdr.n_descsz || note->n_type != checknote->hdr.n_type) goto nextnote; note_name = (const char *)(note + 1); if (note_name + checknote->hdr.n_namesz >= (const char *)note_end || strncmp(checknote->vendor, note_name, checknote->hdr.n_namesz) != 0) goto nextnote; /* * Fetch the osreldate for binary * from the ELF OSABI-note if necessary. */ if ((checknote->flags & BN_TRANSLATE_OSREL) != 0 && checknote->trans_osrel != NULL) { res = checknote->trans_osrel(note, osrel); goto ret; } res = TRUE; goto ret; nextnote: note = (const Elf_Note *)((const char *)(note + 1) + roundup2(note->n_namesz, ELF_NOTE_ROUNDSIZE) + roundup2(note->n_descsz, ELF_NOTE_ROUNDSIZE)); } res = FALSE; ret: free(buf, M_TEMP); return (res); } /* * Try to find the appropriate ABI-note section for checknote, * fetch the osreldate for binary from the ELF OSABI-note. Only the * first page of the image is searched, the same as for headers. */ static boolean_t __elfN(check_note)(struct image_params *imgp, Elf_Brandnote *checknote, int32_t *osrel) { const Elf_Phdr *phdr; const Elf_Ehdr *hdr; int i; hdr = (const Elf_Ehdr *)imgp->image_header; phdr = (const Elf_Phdr *)(imgp->image_header + hdr->e_phoff); for (i = 0; i < hdr->e_phnum; i++) { if (phdr[i].p_type == PT_NOTE && __elfN(parse_notes)(imgp, checknote, osrel, &phdr[i])) return (TRUE); } return (FALSE); } /* * Tell kern_execve.c about it, with a little help from the linker. */ static struct execsw __elfN(execsw) = { __CONCAT(exec_, __elfN(imgact)), __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) }; EXEC_SET(__CONCAT(elf, __ELF_WORD_SIZE), __elfN(execsw)); static vm_prot_t __elfN(trans_prot)(Elf_Word flags) { vm_prot_t prot; prot = 0; if (flags & PF_X) prot |= VM_PROT_EXECUTE; if (flags & PF_W) prot |= VM_PROT_WRITE; if (flags & PF_R) prot |= VM_PROT_READ; #if __ELF_WORD_SIZE == 32 #if defined(__amd64__) if (i386_read_exec && (flags & PF_R)) prot |= VM_PROT_EXECUTE; #endif #endif return (prot); } static Elf_Word __elfN(untrans_prot)(vm_prot_t prot) { Elf_Word flags; flags = 0; if (prot & VM_PROT_EXECUTE) flags |= PF_X; if (prot & VM_PROT_READ) flags |= PF_R; if (prot & VM_PROT_WRITE) flags |= PF_W; return (flags); } Index: projects/clang400-import/sys/kern/subr_gtaskqueue.c =================================================================== --- projects/clang400-import/sys/kern/subr_gtaskqueue.c (revision 314522) +++ projects/clang400-import/sys/kern/subr_gtaskqueue.c (revision 314523) @@ -1,963 +1,965 @@ /*- * Copyright (c) 2000 Doug Rabson * Copyright (c) 2014 Jeff Roberson * Copyright (c) 2016 Matthew Macy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_GTASKQUEUE, "taskqueue", "Task Queues"); static void gtaskqueue_thread_enqueue(void *); static void gtaskqueue_thread_loop(void *arg); +TASKQGROUP_DEFINE(softirq, mp_ncpus, 1); + struct gtaskqueue_busy { struct gtask *tb_running; TAILQ_ENTRY(gtaskqueue_busy) tb_link; }; static struct gtask * const TB_DRAIN_WAITER = (struct gtask *)0x1; struct gtaskqueue { STAILQ_HEAD(, gtask) tq_queue; gtaskqueue_enqueue_fn tq_enqueue; void *tq_context; char *tq_name; TAILQ_HEAD(, gtaskqueue_busy) tq_active; struct mtx tq_mutex; struct thread **tq_threads; int tq_tcount; int tq_spin; int tq_flags; int tq_callouts; taskqueue_callback_fn tq_callbacks[TASKQUEUE_NUM_CALLBACKS]; void *tq_cb_contexts[TASKQUEUE_NUM_CALLBACKS]; }; #define TQ_FLAGS_ACTIVE (1 << 0) #define TQ_FLAGS_BLOCKED (1 << 1) #define TQ_FLAGS_UNLOCKED_ENQUEUE (1 << 2) #define DT_CALLOUT_ARMED (1 << 0) #define TQ_LOCK(tq) \ do { \ if ((tq)->tq_spin) \ mtx_lock_spin(&(tq)->tq_mutex); \ else \ mtx_lock(&(tq)->tq_mutex); \ } while (0) #define TQ_ASSERT_LOCKED(tq) mtx_assert(&(tq)->tq_mutex, MA_OWNED) #define TQ_UNLOCK(tq) \ do { \ if ((tq)->tq_spin) \ mtx_unlock_spin(&(tq)->tq_mutex); \ else \ mtx_unlock(&(tq)->tq_mutex); \ } while (0) #define TQ_ASSERT_UNLOCKED(tq) mtx_assert(&(tq)->tq_mutex, MA_NOTOWNED) #ifdef INVARIANTS static void gtask_dump(struct gtask *gtask) { printf("gtask: %p ta_flags=%x ta_priority=%d ta_func=%p ta_context=%p\n", gtask, gtask->ta_flags, gtask->ta_priority, gtask->ta_func, gtask->ta_context); } #endif static __inline int TQ_SLEEP(struct gtaskqueue *tq, void *p, struct mtx *m, int pri, const char *wm, int t) { if (tq->tq_spin) return (msleep_spin(p, m, wm, t)); return (msleep(p, m, pri, wm, t)); } static struct gtaskqueue * _gtaskqueue_create(const char *name, int mflags, taskqueue_enqueue_fn enqueue, void *context, int mtxflags, const char *mtxname __unused) { struct gtaskqueue *queue; char *tq_name; tq_name = malloc(TASKQUEUE_NAMELEN, M_GTASKQUEUE, mflags | M_ZERO); if (!tq_name) return (NULL); snprintf(tq_name, TASKQUEUE_NAMELEN, "%s", (name) ? name : "taskqueue"); queue = malloc(sizeof(struct gtaskqueue), M_GTASKQUEUE, mflags | M_ZERO); if (!queue) return (NULL); STAILQ_INIT(&queue->tq_queue); TAILQ_INIT(&queue->tq_active); queue->tq_enqueue = enqueue; queue->tq_context = context; queue->tq_name = tq_name; queue->tq_spin = (mtxflags & MTX_SPIN) != 0; queue->tq_flags |= TQ_FLAGS_ACTIVE; if (enqueue == gtaskqueue_thread_enqueue) queue->tq_flags |= TQ_FLAGS_UNLOCKED_ENQUEUE; mtx_init(&queue->tq_mutex, tq_name, NULL, mtxflags); return (queue); } /* * Signal a taskqueue thread to terminate. */ static void gtaskqueue_terminate(struct thread **pp, struct gtaskqueue *tq) { while (tq->tq_tcount > 0 || tq->tq_callouts > 0) { wakeup(tq); TQ_SLEEP(tq, pp, &tq->tq_mutex, PWAIT, "taskqueue_destroy", 0); } } static void gtaskqueue_free(struct gtaskqueue *queue) { TQ_LOCK(queue); queue->tq_flags &= ~TQ_FLAGS_ACTIVE; gtaskqueue_terminate(queue->tq_threads, queue); KASSERT(TAILQ_EMPTY(&queue->tq_active), ("Tasks still running?")); KASSERT(queue->tq_callouts == 0, ("Armed timeout tasks")); mtx_destroy(&queue->tq_mutex); free(queue->tq_threads, M_GTASKQUEUE); free(queue->tq_name, M_GTASKQUEUE); free(queue, M_GTASKQUEUE); } int grouptaskqueue_enqueue(struct gtaskqueue *queue, struct gtask *gtask) { #ifdef INVARIANTS if (queue == NULL) { gtask_dump(gtask); panic("queue == NULL"); } #endif TQ_LOCK(queue); if (gtask->ta_flags & TASK_ENQUEUED) { TQ_UNLOCK(queue); return (0); } STAILQ_INSERT_TAIL(&queue->tq_queue, gtask, ta_link); gtask->ta_flags |= TASK_ENQUEUED; TQ_UNLOCK(queue); if ((queue->tq_flags & TQ_FLAGS_BLOCKED) == 0) queue->tq_enqueue(queue->tq_context); return (0); } static void gtaskqueue_task_nop_fn(void *context) { } /* * Block until all currently queued tasks in this taskqueue * have begun execution. Tasks queued during execution of * this function are ignored. */ static void gtaskqueue_drain_tq_queue(struct gtaskqueue *queue) { struct gtask t_barrier; if (STAILQ_EMPTY(&queue->tq_queue)) return; /* * Enqueue our barrier after all current tasks, but with * the highest priority so that newly queued tasks cannot * pass it. Because of the high priority, we can not use * taskqueue_enqueue_locked directly (which drops the lock * anyway) so just insert it at tail while we have the * queue lock. */ GTASK_INIT(&t_barrier, 0, USHRT_MAX, gtaskqueue_task_nop_fn, &t_barrier); STAILQ_INSERT_TAIL(&queue->tq_queue, &t_barrier, ta_link); t_barrier.ta_flags |= TASK_ENQUEUED; /* * Once the barrier has executed, all previously queued tasks * have completed or are currently executing. */ while (t_barrier.ta_flags & TASK_ENQUEUED) TQ_SLEEP(queue, &t_barrier, &queue->tq_mutex, PWAIT, "-", 0); } /* * Block until all currently executing tasks for this taskqueue * complete. Tasks that begin execution during the execution * of this function are ignored. */ static void gtaskqueue_drain_tq_active(struct gtaskqueue *queue) { struct gtaskqueue_busy tb_marker, *tb_first; if (TAILQ_EMPTY(&queue->tq_active)) return; /* Block taskq_terminate().*/ queue->tq_callouts++; /* * Wait for all currently executing taskqueue threads * to go idle. */ tb_marker.tb_running = TB_DRAIN_WAITER; TAILQ_INSERT_TAIL(&queue->tq_active, &tb_marker, tb_link); while (TAILQ_FIRST(&queue->tq_active) != &tb_marker) TQ_SLEEP(queue, &tb_marker, &queue->tq_mutex, PWAIT, "-", 0); TAILQ_REMOVE(&queue->tq_active, &tb_marker, tb_link); /* * Wakeup any other drain waiter that happened to queue up * without any intervening active thread. */ tb_first = TAILQ_FIRST(&queue->tq_active); if (tb_first != NULL && tb_first->tb_running == TB_DRAIN_WAITER) wakeup(tb_first); /* Release taskqueue_terminate(). */ queue->tq_callouts--; if ((queue->tq_flags & TQ_FLAGS_ACTIVE) == 0) wakeup_one(queue->tq_threads); } void gtaskqueue_block(struct gtaskqueue *queue) { TQ_LOCK(queue); queue->tq_flags |= TQ_FLAGS_BLOCKED; TQ_UNLOCK(queue); } void gtaskqueue_unblock(struct gtaskqueue *queue) { TQ_LOCK(queue); queue->tq_flags &= ~TQ_FLAGS_BLOCKED; if (!STAILQ_EMPTY(&queue->tq_queue)) queue->tq_enqueue(queue->tq_context); TQ_UNLOCK(queue); } static void gtaskqueue_run_locked(struct gtaskqueue *queue) { struct gtaskqueue_busy tb; struct gtaskqueue_busy *tb_first; struct gtask *gtask; KASSERT(queue != NULL, ("tq is NULL")); TQ_ASSERT_LOCKED(queue); tb.tb_running = NULL; while (STAILQ_FIRST(&queue->tq_queue)) { TAILQ_INSERT_TAIL(&queue->tq_active, &tb, tb_link); /* * Carefully remove the first task from the queue and * clear its TASK_ENQUEUED flag */ gtask = STAILQ_FIRST(&queue->tq_queue); KASSERT(gtask != NULL, ("task is NULL")); STAILQ_REMOVE_HEAD(&queue->tq_queue, ta_link); gtask->ta_flags &= ~TASK_ENQUEUED; tb.tb_running = gtask; TQ_UNLOCK(queue); KASSERT(gtask->ta_func != NULL, ("task->ta_func is NULL")); gtask->ta_func(gtask->ta_context); TQ_LOCK(queue); tb.tb_running = NULL; wakeup(gtask); TAILQ_REMOVE(&queue->tq_active, &tb, tb_link); tb_first = TAILQ_FIRST(&queue->tq_active); if (tb_first != NULL && tb_first->tb_running == TB_DRAIN_WAITER) wakeup(tb_first); } } static int task_is_running(struct gtaskqueue *queue, struct gtask *gtask) { struct gtaskqueue_busy *tb; TQ_ASSERT_LOCKED(queue); TAILQ_FOREACH(tb, &queue->tq_active, tb_link) { if (tb->tb_running == gtask) return (1); } return (0); } static int gtaskqueue_cancel_locked(struct gtaskqueue *queue, struct gtask *gtask) { if (gtask->ta_flags & TASK_ENQUEUED) STAILQ_REMOVE(&queue->tq_queue, gtask, gtask, ta_link); gtask->ta_flags &= ~TASK_ENQUEUED; return (task_is_running(queue, gtask) ? EBUSY : 0); } int gtaskqueue_cancel(struct gtaskqueue *queue, struct gtask *gtask) { int error; TQ_LOCK(queue); error = gtaskqueue_cancel_locked(queue, gtask); TQ_UNLOCK(queue); return (error); } void gtaskqueue_drain(struct gtaskqueue *queue, struct gtask *gtask) { if (!queue->tq_spin) WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, __func__); TQ_LOCK(queue); while ((gtask->ta_flags & TASK_ENQUEUED) || task_is_running(queue, gtask)) TQ_SLEEP(queue, gtask, &queue->tq_mutex, PWAIT, "-", 0); TQ_UNLOCK(queue); } void gtaskqueue_drain_all(struct gtaskqueue *queue) { if (!queue->tq_spin) WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, __func__); TQ_LOCK(queue); gtaskqueue_drain_tq_queue(queue); gtaskqueue_drain_tq_active(queue); TQ_UNLOCK(queue); } static int _gtaskqueue_start_threads(struct gtaskqueue **tqp, int count, int pri, cpuset_t *mask, const char *name, va_list ap) { char ktname[MAXCOMLEN + 1]; struct thread *td; struct gtaskqueue *tq; int i, error; if (count <= 0) return (EINVAL); vsnprintf(ktname, sizeof(ktname), name, ap); tq = *tqp; tq->tq_threads = malloc(sizeof(struct thread *) * count, M_GTASKQUEUE, M_NOWAIT | M_ZERO); if (tq->tq_threads == NULL) { printf("%s: no memory for %s threads\n", __func__, ktname); return (ENOMEM); } for (i = 0; i < count; i++) { if (count == 1) error = kthread_add(gtaskqueue_thread_loop, tqp, NULL, &tq->tq_threads[i], RFSTOPPED, 0, "%s", ktname); else error = kthread_add(gtaskqueue_thread_loop, tqp, NULL, &tq->tq_threads[i], RFSTOPPED, 0, "%s_%d", ktname, i); if (error) { /* should be ok to continue, taskqueue_free will dtrt */ printf("%s: kthread_add(%s): error %d", __func__, ktname, error); tq->tq_threads[i] = NULL; /* paranoid */ } else tq->tq_tcount++; } for (i = 0; i < count; i++) { if (tq->tq_threads[i] == NULL) continue; td = tq->tq_threads[i]; if (mask) { error = cpuset_setthread(td->td_tid, mask); /* * Failing to pin is rarely an actual fatal error; * it'll just affect performance. */ if (error) printf("%s: curthread=%llu: can't pin; " "error=%d\n", __func__, (unsigned long long) td->td_tid, error); } thread_lock(td); sched_prio(td, pri); sched_add(td, SRQ_BORING); thread_unlock(td); } return (0); } static int gtaskqueue_start_threads(struct gtaskqueue **tqp, int count, int pri, const char *name, ...) { va_list ap; int error; va_start(ap, name); error = _gtaskqueue_start_threads(tqp, count, pri, NULL, name, ap); va_end(ap); return (error); } static inline void gtaskqueue_run_callback(struct gtaskqueue *tq, enum taskqueue_callback_type cb_type) { taskqueue_callback_fn tq_callback; TQ_ASSERT_UNLOCKED(tq); tq_callback = tq->tq_callbacks[cb_type]; if (tq_callback != NULL) tq_callback(tq->tq_cb_contexts[cb_type]); } static void gtaskqueue_thread_loop(void *arg) { struct gtaskqueue **tqp, *tq; tqp = arg; tq = *tqp; gtaskqueue_run_callback(tq, TASKQUEUE_CALLBACK_TYPE_INIT); TQ_LOCK(tq); while ((tq->tq_flags & TQ_FLAGS_ACTIVE) != 0) { /* XXX ? */ gtaskqueue_run_locked(tq); /* * Because taskqueue_run() can drop tq_mutex, we need to * check if the TQ_FLAGS_ACTIVE flag wasn't removed in the * meantime, which means we missed a wakeup. */ if ((tq->tq_flags & TQ_FLAGS_ACTIVE) == 0) break; TQ_SLEEP(tq, tq, &tq->tq_mutex, 0, "-", 0); } gtaskqueue_run_locked(tq); /* * This thread is on its way out, so just drop the lock temporarily * in order to call the shutdown callback. This allows the callback * to look at the taskqueue, even just before it dies. */ TQ_UNLOCK(tq); gtaskqueue_run_callback(tq, TASKQUEUE_CALLBACK_TYPE_SHUTDOWN); TQ_LOCK(tq); /* rendezvous with thread that asked us to terminate */ tq->tq_tcount--; wakeup_one(tq->tq_threads); TQ_UNLOCK(tq); kthread_exit(); } static void gtaskqueue_thread_enqueue(void *context) { struct gtaskqueue **tqp, *tq; tqp = context; tq = *tqp; wakeup_one(tq); } static struct gtaskqueue * gtaskqueue_create_fast(const char *name, int mflags, taskqueue_enqueue_fn enqueue, void *context) { return _gtaskqueue_create(name, mflags, enqueue, context, MTX_SPIN, "fast_taskqueue"); } struct taskqgroup_cpu { LIST_HEAD(, grouptask) tgc_tasks; struct gtaskqueue *tgc_taskq; int tgc_cnt; int tgc_cpu; }; struct taskqgroup { struct taskqgroup_cpu tqg_queue[MAXCPU]; struct mtx tqg_lock; char * tqg_name; int tqg_adjusting; int tqg_stride; int tqg_cnt; }; struct taskq_bind_task { struct gtask bt_task; int bt_cpuid; }; static void taskqgroup_cpu_create(struct taskqgroup *qgroup, int idx, int cpu) { struct taskqgroup_cpu *qcpu; qcpu = &qgroup->tqg_queue[idx]; LIST_INIT(&qcpu->tgc_tasks); qcpu->tgc_taskq = gtaskqueue_create_fast(NULL, M_WAITOK, taskqueue_thread_enqueue, &qcpu->tgc_taskq); gtaskqueue_start_threads(&qcpu->tgc_taskq, 1, PI_SOFT, "%s_%d", qgroup->tqg_name, idx); qcpu->tgc_cpu = cpu; } static void taskqgroup_cpu_remove(struct taskqgroup *qgroup, int idx) { gtaskqueue_free(qgroup->tqg_queue[idx].tgc_taskq); } /* * Find the taskq with least # of tasks that doesn't currently have any * other queues from the uniq identifier. */ static int taskqgroup_find(struct taskqgroup *qgroup, void *uniq) { struct grouptask *n; int i, idx, mincnt; int strict; mtx_assert(&qgroup->tqg_lock, MA_OWNED); if (qgroup->tqg_cnt == 0) return (0); idx = -1; mincnt = INT_MAX; /* * Two passes; First scan for a queue with the least tasks that * does not already service this uniq id. If that fails simply find * the queue with the least total tasks; */ for (strict = 1; mincnt == INT_MAX; strict = 0) { for (i = 0; i < qgroup->tqg_cnt; i++) { if (qgroup->tqg_queue[i].tgc_cnt > mincnt) continue; if (strict) { LIST_FOREACH(n, &qgroup->tqg_queue[i].tgc_tasks, gt_list) if (n->gt_uniq == uniq) break; if (n != NULL) continue; } mincnt = qgroup->tqg_queue[i].tgc_cnt; idx = i; } } if (idx == -1) panic("taskqgroup_find: Failed to pick a qid."); return (idx); } /* * smp_started is unusable since it is not set for UP kernels or even for * SMP kernels when there is 1 CPU. This is usually handled by adding a * (mp_ncpus == 1) test, but that would be broken here since we need to * to synchronize with the SI_SUB_SMP ordering. Even in the pure SMP case * smp_started only gives a fuzzy ordering relative to SI_SUB_SMP. * * So maintain our own flag. It must be set after all CPUs are started * and before SI_SUB_SMP:SI_ORDER_ANY so that the SYSINIT for delayed * adjustment is properly delayed. SI_ORDER_FOURTH is clearly before * SI_ORDER_ANY and unclearly after the CPUs are started. It would be * simpler for adjustment to pass a flag indicating if it is delayed. */ static int tqg_smp_started; static void tqg_record_smp_started(void *arg) { tqg_smp_started = 1; } SYSINIT(tqg_record_smp_started, SI_SUB_SMP, SI_ORDER_FOURTH, tqg_record_smp_started, NULL); void taskqgroup_attach(struct taskqgroup *qgroup, struct grouptask *gtask, void *uniq, int irq, char *name) { cpuset_t mask; int qid; gtask->gt_uniq = uniq; gtask->gt_name = name; gtask->gt_irq = irq; gtask->gt_cpu = -1; mtx_lock(&qgroup->tqg_lock); qid = taskqgroup_find(qgroup, uniq); qgroup->tqg_queue[qid].tgc_cnt++; LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list); gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq; if (irq != -1 && tqg_smp_started) { gtask->gt_cpu = qgroup->tqg_queue[qid].tgc_cpu; CPU_ZERO(&mask); CPU_SET(qgroup->tqg_queue[qid].tgc_cpu, &mask); mtx_unlock(&qgroup->tqg_lock); intr_setaffinity(irq, &mask); } else mtx_unlock(&qgroup->tqg_lock); } static void taskqgroup_attach_deferred(struct taskqgroup *qgroup, struct grouptask *gtask) { cpuset_t mask; int qid, cpu; mtx_lock(&qgroup->tqg_lock); qid = taskqgroup_find(qgroup, gtask->gt_uniq); cpu = qgroup->tqg_queue[qid].tgc_cpu; if (gtask->gt_irq != -1) { mtx_unlock(&qgroup->tqg_lock); CPU_ZERO(&mask); CPU_SET(cpu, &mask); intr_setaffinity(gtask->gt_irq, &mask); mtx_lock(&qgroup->tqg_lock); } qgroup->tqg_queue[qid].tgc_cnt++; LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list); MPASS(qgroup->tqg_queue[qid].tgc_taskq != NULL); gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq; mtx_unlock(&qgroup->tqg_lock); } int taskqgroup_attach_cpu(struct taskqgroup *qgroup, struct grouptask *gtask, void *uniq, int cpu, int irq, char *name) { cpuset_t mask; int i, qid; qid = -1; gtask->gt_uniq = uniq; gtask->gt_name = name; gtask->gt_irq = irq; gtask->gt_cpu = cpu; mtx_lock(&qgroup->tqg_lock); if (tqg_smp_started) { for (i = 0; i < qgroup->tqg_cnt; i++) if (qgroup->tqg_queue[i].tgc_cpu == cpu) { qid = i; break; } if (qid == -1) { mtx_unlock(&qgroup->tqg_lock); return (EINVAL); } } else qid = 0; qgroup->tqg_queue[qid].tgc_cnt++; LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list); gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq; cpu = qgroup->tqg_queue[qid].tgc_cpu; mtx_unlock(&qgroup->tqg_lock); CPU_ZERO(&mask); CPU_SET(cpu, &mask); if (irq != -1 && tqg_smp_started) intr_setaffinity(irq, &mask); return (0); } static int taskqgroup_attach_cpu_deferred(struct taskqgroup *qgroup, struct grouptask *gtask) { cpuset_t mask; int i, qid, irq, cpu; qid = -1; irq = gtask->gt_irq; cpu = gtask->gt_cpu; MPASS(tqg_smp_started); mtx_lock(&qgroup->tqg_lock); for (i = 0; i < qgroup->tqg_cnt; i++) if (qgroup->tqg_queue[i].tgc_cpu == cpu) { qid = i; break; } if (qid == -1) { mtx_unlock(&qgroup->tqg_lock); return (EINVAL); } qgroup->tqg_queue[qid].tgc_cnt++; LIST_INSERT_HEAD(&qgroup->tqg_queue[qid].tgc_tasks, gtask, gt_list); MPASS(qgroup->tqg_queue[qid].tgc_taskq != NULL); gtask->gt_taskqueue = qgroup->tqg_queue[qid].tgc_taskq; mtx_unlock(&qgroup->tqg_lock); CPU_ZERO(&mask); CPU_SET(cpu, &mask); if (irq != -1) intr_setaffinity(irq, &mask); return (0); } void taskqgroup_detach(struct taskqgroup *qgroup, struct grouptask *gtask) { int i; mtx_lock(&qgroup->tqg_lock); for (i = 0; i < qgroup->tqg_cnt; i++) if (qgroup->tqg_queue[i].tgc_taskq == gtask->gt_taskqueue) break; if (i == qgroup->tqg_cnt) panic("taskqgroup_detach: task not in group\n"); qgroup->tqg_queue[i].tgc_cnt--; LIST_REMOVE(gtask, gt_list); mtx_unlock(&qgroup->tqg_lock); gtask->gt_taskqueue = NULL; } static void taskqgroup_binder(void *ctx) { struct taskq_bind_task *gtask = (struct taskq_bind_task *)ctx; cpuset_t mask; int error; CPU_ZERO(&mask); CPU_SET(gtask->bt_cpuid, &mask); error = cpuset_setthread(curthread->td_tid, &mask); thread_lock(curthread); sched_bind(curthread, gtask->bt_cpuid); thread_unlock(curthread); if (error) printf("taskqgroup_binder: setaffinity failed: %d\n", error); free(gtask, M_DEVBUF); } static void taskqgroup_bind(struct taskqgroup *qgroup) { struct taskq_bind_task *gtask; int i; /* * Bind taskqueue threads to specific CPUs, if they have been assigned * one. */ if (qgroup->tqg_cnt == 1) return; for (i = 0; i < qgroup->tqg_cnt; i++) { gtask = malloc(sizeof (*gtask), M_DEVBUF, M_WAITOK); GTASK_INIT(>ask->bt_task, 0, 0, taskqgroup_binder, gtask); gtask->bt_cpuid = qgroup->tqg_queue[i].tgc_cpu; grouptaskqueue_enqueue(qgroup->tqg_queue[i].tgc_taskq, >ask->bt_task); } } static int _taskqgroup_adjust(struct taskqgroup *qgroup, int cnt, int stride) { LIST_HEAD(, grouptask) gtask_head = LIST_HEAD_INITIALIZER(NULL); struct grouptask *gtask; int i, k, old_cnt, old_cpu, cpu; mtx_assert(&qgroup->tqg_lock, MA_OWNED); if (cnt < 1 || cnt * stride > mp_ncpus || !tqg_smp_started) { printf("%s: failed cnt: %d stride: %d " "mp_ncpus: %d tqg_smp_started: %d\n", __func__, cnt, stride, mp_ncpus, tqg_smp_started); return (EINVAL); } if (qgroup->tqg_adjusting) { printf("taskqgroup_adjust failed: adjusting\n"); return (EBUSY); } qgroup->tqg_adjusting = 1; old_cnt = qgroup->tqg_cnt; old_cpu = 0; if (old_cnt < cnt) old_cpu = qgroup->tqg_queue[old_cnt].tgc_cpu; mtx_unlock(&qgroup->tqg_lock); /* * Set up queue for tasks added before boot. */ if (old_cnt == 0) { LIST_SWAP(>ask_head, &qgroup->tqg_queue[0].tgc_tasks, grouptask, gt_list); qgroup->tqg_queue[0].tgc_cnt = 0; } /* * If new taskq threads have been added. */ cpu = old_cpu; for (i = old_cnt; i < cnt; i++) { taskqgroup_cpu_create(qgroup, i, cpu); for (k = 0; k < stride; k++) cpu = CPU_NEXT(cpu); } mtx_lock(&qgroup->tqg_lock); qgroup->tqg_cnt = cnt; qgroup->tqg_stride = stride; /* * Adjust drivers to use new taskqs. */ for (i = 0; i < old_cnt; i++) { while ((gtask = LIST_FIRST(&qgroup->tqg_queue[i].tgc_tasks))) { LIST_REMOVE(gtask, gt_list); qgroup->tqg_queue[i].tgc_cnt--; LIST_INSERT_HEAD(>ask_head, gtask, gt_list); } } mtx_unlock(&qgroup->tqg_lock); while ((gtask = LIST_FIRST(>ask_head))) { LIST_REMOVE(gtask, gt_list); if (gtask->gt_cpu == -1) taskqgroup_attach_deferred(qgroup, gtask); else if (taskqgroup_attach_cpu_deferred(qgroup, gtask)) taskqgroup_attach_deferred(qgroup, gtask); } #ifdef INVARIANTS mtx_lock(&qgroup->tqg_lock); for (i = 0; i < qgroup->tqg_cnt; i++) { MPASS(qgroup->tqg_queue[i].tgc_taskq != NULL); LIST_FOREACH(gtask, &qgroup->tqg_queue[i].tgc_tasks, gt_list) MPASS(gtask->gt_taskqueue != NULL); } mtx_unlock(&qgroup->tqg_lock); #endif /* * If taskq thread count has been reduced. */ for (i = cnt; i < old_cnt; i++) taskqgroup_cpu_remove(qgroup, i); taskqgroup_bind(qgroup); mtx_lock(&qgroup->tqg_lock); qgroup->tqg_adjusting = 0; return (0); } int taskqgroup_adjust(struct taskqgroup *qgroup, int cnt, int stride) { int error; mtx_lock(&qgroup->tqg_lock); error = _taskqgroup_adjust(qgroup, cnt, stride); mtx_unlock(&qgroup->tqg_lock); return (error); } struct taskqgroup * taskqgroup_create(char *name) { struct taskqgroup *qgroup; qgroup = malloc(sizeof(*qgroup), M_GTASKQUEUE, M_WAITOK | M_ZERO); mtx_init(&qgroup->tqg_lock, "taskqgroup", NULL, MTX_DEF); qgroup->tqg_name = name; LIST_INIT(&qgroup->tqg_queue[0].tgc_tasks); return (qgroup); } void taskqgroup_destroy(struct taskqgroup *qgroup) { } Index: projects/clang400-import/sys/net/iflib.c =================================================================== --- projects/clang400-import/sys/net/iflib.c (revision 314522) +++ projects/clang400-import/sys/net/iflib.c (revision 314523) @@ -1,5244 +1,5243 @@ /*- * Copyright (c) 2014-2017, Matthew Macy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Neither the name of Matthew Macy nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_acpi.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "ifdi_if.h" #if defined(__i386__) || defined(__amd64__) #include #include #include #include #include #include #endif /* * enable accounting of every mbuf as it comes in to and goes out of iflib's software descriptor references */ #define MEMORY_LOGGING 0 /* * Enable mbuf vectors for compressing long mbuf chains */ /* * NB: * - Prefetching in tx cleaning should perhaps be a tunable. The distance ahead * we prefetch needs to be determined by the time spent in m_free vis a vis * the cost of a prefetch. This will of course vary based on the workload: * - NFLX's m_free path is dominated by vm-based M_EXT manipulation which * is quite expensive, thus suggesting very little prefetch. * - small packet forwarding which is just returning a single mbuf to * UMA will typically be very fast vis a vis the cost of a memory * access. */ /* * File organization: * - private structures * - iflib private utility functions * - ifnet functions * - vlan registry and other exported functions * - iflib public core functions * * */ static MALLOC_DEFINE(M_IFLIB, "iflib", "ifnet library"); struct iflib_txq; typedef struct iflib_txq *iflib_txq_t; struct iflib_rxq; typedef struct iflib_rxq *iflib_rxq_t; struct iflib_fl; typedef struct iflib_fl *iflib_fl_t; struct iflib_ctx; typedef struct iflib_filter_info { driver_filter_t *ifi_filter; void *ifi_filter_arg; struct grouptask *ifi_task; struct iflib_ctx *ifi_ctx; } *iflib_filter_info_t; struct iflib_ctx { KOBJ_FIELDS; /* * Pointer to hardware driver's softc */ void *ifc_softc; device_t ifc_dev; if_t ifc_ifp; cpuset_t ifc_cpus; if_shared_ctx_t ifc_sctx; struct if_softc_ctx ifc_softc_ctx; struct mtx ifc_mtx; uint16_t ifc_nhwtxqs; uint16_t ifc_nhwrxqs; iflib_txq_t ifc_txqs; iflib_rxq_t ifc_rxqs; uint32_t ifc_if_flags; uint32_t ifc_flags; uint32_t ifc_max_fl_buf_size; int ifc_in_detach; int ifc_link_state; int ifc_link_irq; int ifc_pause_frames; int ifc_watchdog_events; struct cdev *ifc_led_dev; struct resource *ifc_msix_mem; struct if_irq ifc_legacy_irq; struct grouptask ifc_admin_task; struct grouptask ifc_vflr_task; struct iflib_filter_info ifc_filter_info; struct ifmedia ifc_media; struct sysctl_oid *ifc_sysctl_node; uint16_t ifc_sysctl_ntxqs; uint16_t ifc_sysctl_nrxqs; uint16_t ifc_sysctl_qs_eq_override; uint16_t ifc_sysctl_ntxds[8]; uint16_t ifc_sysctl_nrxds[8]; struct if_txrx ifc_txrx; #define isc_txd_encap ifc_txrx.ift_txd_encap #define isc_txd_flush ifc_txrx.ift_txd_flush #define isc_txd_credits_update ifc_txrx.ift_txd_credits_update #define isc_rxd_available ifc_txrx.ift_rxd_available #define isc_rxd_pkt_get ifc_txrx.ift_rxd_pkt_get #define isc_rxd_refill ifc_txrx.ift_rxd_refill #define isc_rxd_flush ifc_txrx.ift_rxd_flush #define isc_rxd_refill ifc_txrx.ift_rxd_refill #define isc_rxd_refill ifc_txrx.ift_rxd_refill #define isc_legacy_intr ifc_txrx.ift_legacy_intr eventhandler_tag ifc_vlan_attach_event; eventhandler_tag ifc_vlan_detach_event; uint8_t ifc_mac[ETHER_ADDR_LEN]; char ifc_mtx_name[16]; }; void * iflib_get_softc(if_ctx_t ctx) { return (ctx->ifc_softc); } device_t iflib_get_dev(if_ctx_t ctx) { return (ctx->ifc_dev); } if_t iflib_get_ifp(if_ctx_t ctx) { return (ctx->ifc_ifp); } struct ifmedia * iflib_get_media(if_ctx_t ctx) { return (&ctx->ifc_media); } void iflib_set_mac(if_ctx_t ctx, uint8_t mac[ETHER_ADDR_LEN]) { bcopy(mac, ctx->ifc_mac, ETHER_ADDR_LEN); } if_softc_ctx_t iflib_get_softc_ctx(if_ctx_t ctx) { return (&ctx->ifc_softc_ctx); } if_shared_ctx_t iflib_get_sctx(if_ctx_t ctx) { return (ctx->ifc_sctx); } #define CACHE_PTR_INCREMENT (CACHE_LINE_SIZE/sizeof(void*)) #define LINK_ACTIVE(ctx) ((ctx)->ifc_link_state == LINK_STATE_UP) #define CTX_IS_VF(ctx) ((ctx)->ifc_sctx->isc_flags & IFLIB_IS_VF) #define RX_SW_DESC_MAP_CREATED (1 << 0) #define TX_SW_DESC_MAP_CREATED (1 << 1) #define RX_SW_DESC_INUSE (1 << 3) #define TX_SW_DESC_MAPPED (1 << 4) typedef struct iflib_sw_rx_desc_array { bus_dmamap_t *ifsd_map; /* bus_dma maps for packet */ struct mbuf **ifsd_m; /* pkthdr mbufs */ caddr_t *ifsd_cl; /* direct cluster pointer for rx */ uint8_t *ifsd_flags; } iflib_rxsd_array_t; typedef struct iflib_sw_tx_desc_array { bus_dmamap_t *ifsd_map; /* bus_dma maps for packet */ struct mbuf **ifsd_m; /* pkthdr mbufs */ uint8_t *ifsd_flags; } iflib_txsd_array_t; /* magic number that should be high enough for any hardware */ #define IFLIB_MAX_TX_SEGS 128 #define IFLIB_MAX_RX_SEGS 32 #define IFLIB_RX_COPY_THRESH 63 #define IFLIB_MAX_RX_REFRESH 32 #define IFLIB_QUEUE_IDLE 0 #define IFLIB_QUEUE_HUNG 1 #define IFLIB_QUEUE_WORKING 2 /* this should really scale with ring size - 32 is a fairly arbitrary value for this */ #define TX_BATCH_SIZE 16 #define IFLIB_RESTART_BUDGET 8 #define IFC_LEGACY 0x01 #define IFC_QFLUSH 0x02 #define IFC_MULTISEG 0x04 #define IFC_DMAR 0x08 #define IFC_SC_ALLOCATED 0x10 #define IFC_INIT_DONE 0x20 #define CSUM_OFFLOAD (CSUM_IP_TSO|CSUM_IP6_TSO|CSUM_IP| \ CSUM_IP_UDP|CSUM_IP_TCP|CSUM_IP_SCTP| \ CSUM_IP6_UDP|CSUM_IP6_TCP|CSUM_IP6_SCTP) struct iflib_txq { uint16_t ift_in_use; uint16_t ift_cidx; uint16_t ift_cidx_processed; uint16_t ift_pidx; uint8_t ift_gen; uint8_t ift_db_pending; uint8_t ift_db_pending_queued; uint8_t ift_npending; uint8_t ift_br_offset; /* implicit pad */ uint64_t ift_processed; uint64_t ift_cleaned; #if MEMORY_LOGGING uint64_t ift_enqueued; uint64_t ift_dequeued; #endif uint64_t ift_no_tx_dma_setup; uint64_t ift_no_desc_avail; uint64_t ift_mbuf_defrag_failed; uint64_t ift_mbuf_defrag; uint64_t ift_map_failed; uint64_t ift_txd_encap_efbig; uint64_t ift_pullups; struct mtx ift_mtx; struct mtx ift_db_mtx; /* constant values */ if_ctx_t ift_ctx; struct ifmp_ring **ift_br; struct grouptask ift_task; uint16_t ift_size; uint16_t ift_id; struct callout ift_timer; struct callout ift_db_check; iflib_txsd_array_t ift_sds; uint8_t ift_nbr; uint8_t ift_qstatus; uint8_t ift_active; uint8_t ift_closed; int ift_watchdog_time; struct iflib_filter_info ift_filter_info; bus_dma_tag_t ift_desc_tag; bus_dma_tag_t ift_tso_desc_tag; iflib_dma_info_t ift_ifdi; #define MTX_NAME_LEN 16 char ift_mtx_name[MTX_NAME_LEN]; char ift_db_mtx_name[MTX_NAME_LEN]; bus_dma_segment_t ift_segs[IFLIB_MAX_TX_SEGS] __aligned(CACHE_LINE_SIZE); #ifdef IFLIB_DIAGNOSTICS uint64_t ift_cpu_exec_count[256]; #endif } __aligned(CACHE_LINE_SIZE); struct iflib_fl { uint16_t ifl_cidx; uint16_t ifl_pidx; uint16_t ifl_credits; uint8_t ifl_gen; #if MEMORY_LOGGING uint64_t ifl_m_enqueued; uint64_t ifl_m_dequeued; uint64_t ifl_cl_enqueued; uint64_t ifl_cl_dequeued; #endif /* implicit pad */ /* constant */ uint16_t ifl_size; uint16_t ifl_buf_size; uint16_t ifl_cltype; uma_zone_t ifl_zone; iflib_rxsd_array_t ifl_sds; iflib_rxq_t ifl_rxq; uint8_t ifl_id; bus_dma_tag_t ifl_desc_tag; iflib_dma_info_t ifl_ifdi; uint64_t ifl_bus_addrs[IFLIB_MAX_RX_REFRESH] __aligned(CACHE_LINE_SIZE); caddr_t ifl_vm_addrs[IFLIB_MAX_RX_REFRESH]; } __aligned(CACHE_LINE_SIZE); static inline int get_inuse(int size, int cidx, int pidx, int gen) { int used; if (pidx > cidx) used = pidx - cidx; else if (pidx < cidx) used = size - cidx + pidx; else if (gen == 0 && pidx == cidx) used = 0; else if (gen == 1 && pidx == cidx) used = size; else panic("bad state"); return (used); } #define TXQ_AVAIL(txq) (txq->ift_size - get_inuse(txq->ift_size, txq->ift_cidx, txq->ift_pidx, txq->ift_gen)) #define IDXDIFF(head, tail, wrap) \ ((head) >= (tail) ? (head) - (tail) : (wrap) - (tail) + (head)) struct iflib_rxq { /* If there is a separate completion queue - * these are the cq cidx and pidx. Otherwise * these are unused. */ uint16_t ifr_size; uint16_t ifr_cq_cidx; uint16_t ifr_cq_pidx; uint8_t ifr_cq_gen; uint8_t ifr_fl_offset; if_ctx_t ifr_ctx; iflib_fl_t ifr_fl; uint64_t ifr_rx_irq; uint16_t ifr_id; uint8_t ifr_lro_enabled; uint8_t ifr_nfl; struct lro_ctrl ifr_lc; struct grouptask ifr_task; struct iflib_filter_info ifr_filter_info; iflib_dma_info_t ifr_ifdi; /* dynamically allocate if any drivers need a value substantially larger than this */ struct if_rxd_frag ifr_frags[IFLIB_MAX_RX_SEGS] __aligned(CACHE_LINE_SIZE); #ifdef IFLIB_DIAGNOSTICS uint64_t ifr_cpu_exec_count[256]; #endif } __aligned(CACHE_LINE_SIZE); /* * Only allow a single packet to take up most 1/nth of the tx ring */ #define MAX_SINGLE_PACKET_FRACTION 12 #define IF_BAD_DMA (bus_addr_t)-1 static int enable_msix = 1; #define CTX_ACTIVE(ctx) ((if_getdrvflags((ctx)->ifc_ifp) & IFF_DRV_RUNNING)) #define CTX_LOCK_INIT(_sc, _name) mtx_init(&(_sc)->ifc_mtx, _name, "iflib ctx lock", MTX_DEF) #define CTX_LOCK(ctx) mtx_lock(&(ctx)->ifc_mtx) #define CTX_UNLOCK(ctx) mtx_unlock(&(ctx)->ifc_mtx) #define CTX_LOCK_DESTROY(ctx) mtx_destroy(&(ctx)->ifc_mtx) #define TXDB_LOCK_INIT(txq) mtx_init(&(txq)->ift_db_mtx, (txq)->ift_db_mtx_name, NULL, MTX_DEF) #define TXDB_TRYLOCK(txq) mtx_trylock(&(txq)->ift_db_mtx) #define TXDB_LOCK(txq) mtx_lock(&(txq)->ift_db_mtx) #define TXDB_UNLOCK(txq) mtx_unlock(&(txq)->ift_db_mtx) #define TXDB_LOCK_DESTROY(txq) mtx_destroy(&(txq)->ift_db_mtx) #define CALLOUT_LOCK(txq) mtx_lock(&txq->ift_mtx) #define CALLOUT_UNLOCK(txq) mtx_unlock(&txq->ift_mtx) /* Our boot-time initialization hook */ static int iflib_module_event_handler(module_t, int, void *); static moduledata_t iflib_moduledata = { "iflib", iflib_module_event_handler, NULL }; DECLARE_MODULE(iflib, iflib_moduledata, SI_SUB_INIT_IF, SI_ORDER_ANY); MODULE_VERSION(iflib, 1); MODULE_DEPEND(iflib, pci, 1, 1, 1); MODULE_DEPEND(iflib, ether, 1, 1, 1); -TASKQGROUP_DEFINE(if_io_tqg, mp_ncpus, 1); TASKQGROUP_DEFINE(if_config_tqg, 1, 1); #ifndef IFLIB_DEBUG_COUNTERS #ifdef INVARIANTS #define IFLIB_DEBUG_COUNTERS 1 #else #define IFLIB_DEBUG_COUNTERS 0 #endif /* !INVARIANTS */ #endif static SYSCTL_NODE(_net, OID_AUTO, iflib, CTLFLAG_RD, 0, "iflib driver parameters"); /* * XXX need to ensure that this can't accidentally cause the head to be moved backwards */ static int iflib_min_tx_latency = 0; SYSCTL_INT(_net_iflib, OID_AUTO, min_tx_latency, CTLFLAG_RW, &iflib_min_tx_latency, 0, "minimize transmit latency at the possible expense of throughput"); #if IFLIB_DEBUG_COUNTERS static int iflib_tx_seen; static int iflib_tx_sent; static int iflib_tx_encap; static int iflib_rx_allocs; static int iflib_fl_refills; static int iflib_fl_refills_large; static int iflib_tx_frees; SYSCTL_INT(_net_iflib, OID_AUTO, tx_seen, CTLFLAG_RD, &iflib_tx_seen, 0, "# tx mbufs seen"); SYSCTL_INT(_net_iflib, OID_AUTO, tx_sent, CTLFLAG_RD, &iflib_tx_sent, 0, "# tx mbufs sent"); SYSCTL_INT(_net_iflib, OID_AUTO, tx_encap, CTLFLAG_RD, &iflib_tx_encap, 0, "# tx mbufs encapped"); SYSCTL_INT(_net_iflib, OID_AUTO, tx_frees, CTLFLAG_RD, &iflib_tx_frees, 0, "# tx frees"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_allocs, CTLFLAG_RD, &iflib_rx_allocs, 0, "# rx allocations"); SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills, CTLFLAG_RD, &iflib_fl_refills, 0, "# refills"); SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills_large, CTLFLAG_RD, &iflib_fl_refills_large, 0, "# large refills"); static int iflib_txq_drain_flushing; static int iflib_txq_drain_oactive; static int iflib_txq_drain_notready; static int iflib_txq_drain_encapfail; SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_flushing, CTLFLAG_RD, &iflib_txq_drain_flushing, 0, "# drain flushes"); SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_oactive, CTLFLAG_RD, &iflib_txq_drain_oactive, 0, "# drain oactives"); SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_notready, CTLFLAG_RD, &iflib_txq_drain_notready, 0, "# drain notready"); SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_encapfail, CTLFLAG_RD, &iflib_txq_drain_encapfail, 0, "# drain encap fails"); static int iflib_encap_load_mbuf_fail; static int iflib_encap_txq_avail_fail; static int iflib_encap_txd_encap_fail; SYSCTL_INT(_net_iflib, OID_AUTO, encap_load_mbuf_fail, CTLFLAG_RD, &iflib_encap_load_mbuf_fail, 0, "# busdma load failures"); SYSCTL_INT(_net_iflib, OID_AUTO, encap_txq_avail_fail, CTLFLAG_RD, &iflib_encap_txq_avail_fail, 0, "# txq avail failures"); SYSCTL_INT(_net_iflib, OID_AUTO, encap_txd_encap_fail, CTLFLAG_RD, &iflib_encap_txd_encap_fail, 0, "# driver encap failures"); static int iflib_task_fn_rxs; static int iflib_rx_intr_enables; static int iflib_fast_intrs; static int iflib_intr_link; static int iflib_intr_msix; static int iflib_rx_unavail; static int iflib_rx_ctx_inactive; static int iflib_rx_zero_len; static int iflib_rx_if_input; static int iflib_rx_mbuf_null; static int iflib_rxd_flush; static int iflib_verbose_debug; SYSCTL_INT(_net_iflib, OID_AUTO, intr_link, CTLFLAG_RD, &iflib_intr_link, 0, "# intr link calls"); SYSCTL_INT(_net_iflib, OID_AUTO, intr_msix, CTLFLAG_RD, &iflib_intr_msix, 0, "# intr msix calls"); SYSCTL_INT(_net_iflib, OID_AUTO, task_fn_rx, CTLFLAG_RD, &iflib_task_fn_rxs, 0, "# task_fn_rx calls"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_intr_enables, CTLFLAG_RD, &iflib_rx_intr_enables, 0, "# rx intr enables"); SYSCTL_INT(_net_iflib, OID_AUTO, fast_intrs, CTLFLAG_RD, &iflib_fast_intrs, 0, "# fast_intr calls"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_unavail, CTLFLAG_RD, &iflib_rx_unavail, 0, "# times rxeof called with no available data"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_ctx_inactive, CTLFLAG_RD, &iflib_rx_ctx_inactive, 0, "# times rxeof called with inactive context"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_zero_len, CTLFLAG_RD, &iflib_rx_zero_len, 0, "# times rxeof saw zero len mbuf"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_if_input, CTLFLAG_RD, &iflib_rx_if_input, 0, "# times rxeof called if_input"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_mbuf_null, CTLFLAG_RD, &iflib_rx_mbuf_null, 0, "# times rxeof got null mbuf"); SYSCTL_INT(_net_iflib, OID_AUTO, rxd_flush, CTLFLAG_RD, &iflib_rxd_flush, 0, "# times rxd_flush called"); SYSCTL_INT(_net_iflib, OID_AUTO, verbose_debug, CTLFLAG_RW, &iflib_verbose_debug, 0, "enable verbose debugging"); #define DBG_COUNTER_INC(name) atomic_add_int(&(iflib_ ## name), 1) static void iflib_debug_reset(void) { iflib_tx_seen = iflib_tx_sent = iflib_tx_encap = iflib_rx_allocs = iflib_fl_refills = iflib_fl_refills_large = iflib_tx_frees = iflib_txq_drain_flushing = iflib_txq_drain_oactive = iflib_txq_drain_notready = iflib_txq_drain_encapfail = iflib_encap_load_mbuf_fail = iflib_encap_txq_avail_fail = iflib_encap_txd_encap_fail = iflib_task_fn_rxs = iflib_rx_intr_enables = iflib_fast_intrs = iflib_intr_link = iflib_intr_msix = iflib_rx_unavail = iflib_rx_ctx_inactive = iflib_rx_zero_len = iflib_rx_if_input = iflib_rx_mbuf_null = iflib_rxd_flush = 0; } #else #define DBG_COUNTER_INC(name) static void iflib_debug_reset(void) {} #endif #define IFLIB_DEBUG 0 static void iflib_tx_structures_free(if_ctx_t ctx); static void iflib_rx_structures_free(if_ctx_t ctx); static int iflib_queues_alloc(if_ctx_t ctx); static int iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq); static int iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, int cidx, int budget); static int iflib_qset_structures_setup(if_ctx_t ctx); static int iflib_msix_init(if_ctx_t ctx); static int iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filterarg, int *rid, char *str); static void iflib_txq_check_drain(iflib_txq_t txq, int budget); static uint32_t iflib_txq_can_drain(struct ifmp_ring *); static int iflib_register(if_ctx_t); static void iflib_init_locked(if_ctx_t ctx); static void iflib_add_device_sysctl_pre(if_ctx_t ctx); static void iflib_add_device_sysctl_post(if_ctx_t ctx); static void iflib_ifmp_purge(iflib_txq_t txq); static void _iflib_pre_assert(if_softc_ctx_t scctx); #ifdef DEV_NETMAP #include #include #include MODULE_DEPEND(iflib, netmap, 1, 1, 1); /* * device-specific sysctl variables: * * iflib_crcstrip: 0: keep CRC in rx frames (default), 1: strip it. * During regular operations the CRC is stripped, but on some * hardware reception of frames not multiple of 64 is slower, * so using crcstrip=0 helps in benchmarks. * * iflib_rx_miss, iflib_rx_miss_bufs: * count packets that might be missed due to lost interrupts. */ SYSCTL_DECL(_dev_netmap); /* * The xl driver by default strips CRCs and we do not override it. */ int iflib_crcstrip = 1; SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_crcstrip, CTLFLAG_RW, &iflib_crcstrip, 1, "strip CRC on rx frames"); int iflib_rx_miss, iflib_rx_miss_bufs; SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss, CTLFLAG_RW, &iflib_rx_miss, 0, "potentially missed rx intr"); SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss_bufs, CTLFLAG_RW, &iflib_rx_miss_bufs, 0, "potentially missed rx intr bufs"); /* * Register/unregister. We are already under netmap lock. * Only called on the first register or the last unregister. */ static int iflib_netmap_register(struct netmap_adapter *na, int onoff) { struct ifnet *ifp = na->ifp; if_ctx_t ctx = ifp->if_softc; CTX_LOCK(ctx); IFDI_INTR_DISABLE(ctx); /* Tell the stack that the interface is no longer active */ ifp->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE); if (!CTX_IS_VF(ctx)) IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip); /* enable or disable flags and callbacks in na and ifp */ if (onoff) { nm_set_native_flags(na); } else { nm_clear_native_flags(na); } IFDI_INIT(ctx); IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip); // XXX why twice ? CTX_UNLOCK(ctx); return (ifp->if_drv_flags & IFF_DRV_RUNNING ? 0 : 1); } /* * Reconcile kernel and user view of the transmit ring. * * All information is in the kring. * Userspace wants to send packets up to the one before kring->rhead, * kernel knows kring->nr_hwcur is the first unsent packet. * * Here we push packets out (as many as possible), and possibly * reclaim buffers from previously completed transmission. * * The caller (netmap) guarantees that there is only one instance * running at any time. Any interference with other driver * methods should be handled by the individual drivers. */ static int iflib_netmap_txsync(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct ifnet *ifp = na->ifp; struct netmap_ring *ring = kring->ring; u_int nm_i; /* index into the netmap ring */ u_int nic_i; /* index into the NIC ring */ u_int n; u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; struct if_pkt_info pi; /* * interrupts on every tx packet are expensive so request * them every half ring, or where NS_REPORT is set */ u_int report_frequency = kring->nkr_num_slots >> 1; /* device-specific */ if_ctx_t ctx = ifp->if_softc; iflib_txq_t txq = &ctx->ifc_txqs[kring->ring_id]; pi.ipi_segs = txq->ift_segs; pi.ipi_qsidx = kring->ring_id; pi.ipi_ndescs = 0; bus_dmamap_sync(txq->ift_desc_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); /* * First part: process new packets to send. * nm_i is the current index in the netmap ring, * nic_i is the corresponding index in the NIC ring. * * If we have packets to send (nm_i != head) * iterate over the netmap ring, fetch length and update * the corresponding slot in the NIC ring. Some drivers also * need to update the buffer's physical address in the NIC slot * even NS_BUF_CHANGED is not set (PNMB computes the addresses). * * The netmap_reload_map() calls is especially expensive, * even when (as in this case) the tag is 0, so do only * when the buffer has actually changed. * * If possible do not set the report/intr bit on all slots, * but only a few times per ring or when NS_REPORT is set. * * Finally, on 10G and faster drivers, it might be useful * to prefetch the next slot and txr entry. */ nm_i = kring->nr_hwcur; if (nm_i != head) { /* we have new packets to send */ nic_i = netmap_idx_k2n(kring, nm_i); __builtin_prefetch(&ring->slot[nm_i]); __builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i]); __builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i]); for (n = 0; nm_i != head; n++) { struct netmap_slot *slot = &ring->slot[nm_i]; u_int len = slot->len; uint64_t paddr; void *addr = PNMB(na, slot, &paddr); int flags = (slot->flags & NS_REPORT || nic_i == 0 || nic_i == report_frequency) ? IPI_TX_INTR : 0; /* device-specific */ pi.ipi_pidx = nic_i; pi.ipi_flags = flags; /* Fill the slot in the NIC ring. */ ctx->isc_txd_encap(ctx->ifc_softc, &pi); /* prefetch for next round */ __builtin_prefetch(&ring->slot[nm_i + 1]); __builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i + 1]); __builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i + 1]); NM_CHECK_ADDR_LEN(na, addr, len); if (slot->flags & NS_BUF_CHANGED) { /* buffer has changed, reload map */ netmap_reload_map(na, txq->ift_desc_tag, txq->ift_sds.ifsd_map[nic_i], addr); } slot->flags &= ~(NS_REPORT | NS_BUF_CHANGED); /* make sure changes to the buffer are synced */ bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_sds.ifsd_map[nic_i], BUS_DMASYNC_PREWRITE); nm_i = nm_next(nm_i, lim); nic_i = nm_next(nic_i, lim); } kring->nr_hwcur = head; /* synchronize the NIC ring */ bus_dmamap_sync(txq->ift_desc_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* (re)start the tx unit up to slot nic_i (excluded) */ ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, nic_i); } /* * Second part: reclaim buffers for completed transmissions. */ if (iflib_tx_credits_update(ctx, txq)) { /* some tx completed, increment avail */ nic_i = txq->ift_cidx_processed; kring->nr_hwtail = nm_prev(netmap_idx_n2k(kring, nic_i), lim); } return (0); } /* * Reconcile kernel and user view of the receive ring. * Same as for the txsync, this routine must be efficient. * The caller guarantees a single invocations, but races against * the rest of the driver should be handled here. * * On call, kring->rhead is the first packet that userspace wants * to keep, and kring->rcur is the wakeup point. * The kernel has previously reported packets up to kring->rtail. * * If (flags & NAF_FORCE_READ) also check for incoming packets irrespective * of whether or not we received an interrupt. */ static int iflib_netmap_rxsync(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct ifnet *ifp = na->ifp; struct netmap_ring *ring = kring->ring; u_int nm_i; /* index into the netmap ring */ u_int nic_i; /* index into the NIC ring */ u_int i, n; u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; int force_update = (flags & NAF_FORCE_READ) || kring->nr_kflags & NKR_PENDINTR; struct if_rxd_info ri; /* device-specific */ if_ctx_t ctx = ifp->if_softc; iflib_rxq_t rxq = &ctx->ifc_rxqs[kring->ring_id]; iflib_fl_t fl = rxq->ifr_fl; if (head > lim) return netmap_ring_reinit(kring); bzero(&ri, sizeof(ri)); ri.iri_qsidx = kring->ring_id; ri.iri_ifp = ctx->ifc_ifp; /* XXX check sync modes */ for (i = 0, fl = rxq->ifr_fl; i < rxq->ifr_nfl; i++, fl++) bus_dmamap_sync(rxq->ifr_fl[i].ifl_desc_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); /* * First part: import newly received packets. * * nm_i is the index of the next free slot in the netmap ring, * nic_i is the index of the next received packet in the NIC ring, * and they may differ in case if_init() has been called while * in netmap mode. For the receive ring we have * * nic_i = rxr->next_check; * nm_i = kring->nr_hwtail (previous) * and * nm_i == (nic_i + kring->nkr_hwofs) % ring_size * * rxr->next_check is set to 0 on a ring reinit */ if (netmap_no_pendintr || force_update) { int crclen = iflib_crcstrip ? 0 : 4; int error, avail; uint16_t slot_flags = kring->nkr_slot_flags; for (fl = rxq->ifr_fl, i = 0; i < rxq->ifr_nfl; i++, fl++) { nic_i = fl->ifl_cidx; nm_i = netmap_idx_n2k(kring, nic_i); avail = ctx->isc_rxd_available(ctx->ifc_softc, kring->ring_id, nic_i, INT_MAX); for (n = 0; avail > 0; n++, avail--) { error = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri); if (error) ring->slot[nm_i].len = 0; else ring->slot[nm_i].len = ri.iri_len - crclen; ring->slot[nm_i].flags = slot_flags; bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_sds.ifsd_map[nic_i], BUS_DMASYNC_POSTREAD); nm_i = nm_next(nm_i, lim); nic_i = nm_next(nic_i, lim); } if (n) { /* update the state variables */ if (netmap_no_pendintr && !force_update) { /* diagnostics */ iflib_rx_miss ++; iflib_rx_miss_bufs += n; } fl->ifl_cidx = nic_i; kring->nr_hwtail = nm_i; } kring->nr_kflags &= ~NKR_PENDINTR; } } /* * Second part: skip past packets that userspace has released. * (kring->nr_hwcur to head excluded), * and make the buffers available for reception. * As usual nm_i is the index in the netmap ring, * nic_i is the index in the NIC ring, and * nm_i == (nic_i + kring->nkr_hwofs) % ring_size */ /* XXX not sure how this will work with multiple free lists */ nm_i = kring->nr_hwcur; if (nm_i != head) { nic_i = netmap_idx_k2n(kring, nm_i); for (n = 0; nm_i != head; n++) { struct netmap_slot *slot = &ring->slot[nm_i]; uint64_t paddr; caddr_t vaddr; void *addr = PNMB(na, slot, &paddr); if (addr == NETMAP_BUF_BASE(na)) /* bad buf */ goto ring_reset; vaddr = addr; if (slot->flags & NS_BUF_CHANGED) { /* buffer has changed, reload map */ netmap_reload_map(na, fl->ifl_ifdi->idi_tag, fl->ifl_sds.ifsd_map[nic_i], addr); slot->flags &= ~NS_BUF_CHANGED; } /* * XXX we should be batching this operation - TODO */ ctx->isc_rxd_refill(ctx->ifc_softc, rxq->ifr_id, fl->ifl_id, nic_i, &paddr, &vaddr, 1, fl->ifl_buf_size); bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_sds.ifsd_map[nic_i], BUS_DMASYNC_PREREAD); nm_i = nm_next(nm_i, lim); nic_i = nm_next(nic_i, lim); } kring->nr_hwcur = head; bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * IMPORTANT: we must leave one free slot in the ring, * so move nic_i back by one unit */ nic_i = nm_prev(nic_i, lim); ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, fl->ifl_id, nic_i); } return 0; ring_reset: return netmap_ring_reinit(kring); } static int iflib_netmap_attach(if_ctx_t ctx) { struct netmap_adapter na; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; bzero(&na, sizeof(na)); na.ifp = ctx->ifc_ifp; na.na_flags = NAF_BDG_MAYSLEEP; MPASS(ctx->ifc_softc_ctx.isc_ntxqsets); MPASS(ctx->ifc_softc_ctx.isc_nrxqsets); na.num_tx_desc = scctx->isc_ntxd[0]; na.num_rx_desc = scctx->isc_nrxd[0]; na.nm_txsync = iflib_netmap_txsync; na.nm_rxsync = iflib_netmap_rxsync; na.nm_register = iflib_netmap_register; na.num_tx_rings = ctx->ifc_softc_ctx.isc_ntxqsets; na.num_rx_rings = ctx->ifc_softc_ctx.isc_nrxqsets; return (netmap_attach(&na)); } static void iflib_netmap_txq_init(if_ctx_t ctx, iflib_txq_t txq) { struct netmap_adapter *na = NA(ctx->ifc_ifp); struct netmap_slot *slot; slot = netmap_reset(na, NR_TX, txq->ift_id, 0); if (slot == NULL) return; for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxd[0]; i++) { /* * In netmap mode, set the map for the packet buffer. * NOTE: Some drivers (not this one) also need to set * the physical buffer address in the NIC ring. * netmap_idx_n2k() maps a nic index, i, into the corresponding * netmap slot index, si */ int si = netmap_idx_n2k(&na->tx_rings[txq->ift_id], i); netmap_load_map(na, txq->ift_desc_tag, txq->ift_sds.ifsd_map[i], NMB(na, slot + si)); } } static void iflib_netmap_rxq_init(if_ctx_t ctx, iflib_rxq_t rxq) { struct netmap_adapter *na = NA(ctx->ifc_ifp); struct netmap_slot *slot; bus_dmamap_t *map; int nrxd; slot = netmap_reset(na, NR_RX, rxq->ifr_id, 0); if (slot == NULL) return; map = rxq->ifr_fl[0].ifl_sds.ifsd_map; nrxd = ctx->ifc_softc_ctx.isc_nrxd[0]; for (int i = 0; i < nrxd; i++, map++) { int sj = netmap_idx_n2k(&na->rx_rings[rxq->ifr_id], i); uint64_t paddr; void *addr; caddr_t vaddr; vaddr = addr = PNMB(na, slot + sj, &paddr); netmap_load_map(na, rxq->ifr_fl[0].ifl_ifdi->idi_tag, *map, addr); /* Update descriptor and the cached value */ ctx->isc_rxd_refill(ctx->ifc_softc, rxq->ifr_id, 0 /* fl_id */, i, &paddr, &vaddr, 1, rxq->ifr_fl[0].ifl_buf_size); } /* preserve queue */ if (ctx->ifc_ifp->if_capenable & IFCAP_NETMAP) { struct netmap_kring *kring = &na->rx_rings[rxq->ifr_id]; int t = na->num_rx_desc - 1 - nm_kr_rxspace(kring); ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, 0 /* fl_id */, t); } else ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, 0 /* fl_id */, nrxd-1); } #define iflib_netmap_detach(ifp) netmap_detach(ifp) #else #define iflib_netmap_txq_init(ctx, txq) #define iflib_netmap_rxq_init(ctx, rxq) #define iflib_netmap_detach(ifp) #define iflib_netmap_attach(ctx) (0) #define netmap_rx_irq(ifp, qid, budget) (0) #endif #if defined(__i386__) || defined(__amd64__) static __inline void prefetch(void *x) { __asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x)); } #else #define prefetch(x) #endif static void _iflib_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nseg, int err) { if (err) return; *(bus_addr_t *) arg = segs[0].ds_addr; } int iflib_dma_alloc(if_ctx_t ctx, int size, iflib_dma_info_t dma, int mapflags) { int err; if_shared_ctx_t sctx = ctx->ifc_sctx; device_t dev = ctx->ifc_dev; KASSERT(sctx->isc_q_align != 0, ("alignment value not initialized")); err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */ sctx->isc_q_align, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ size, /* maxsize */ 1, /* nsegments */ size, /* maxsegsize */ BUS_DMA_ALLOCNOW, /* flags */ NULL, /* lockfunc */ NULL, /* lockarg */ &dma->idi_tag); if (err) { device_printf(dev, "%s: bus_dma_tag_create failed: %d\n", __func__, err); goto fail_0; } err = bus_dmamem_alloc(dma->idi_tag, (void**) &dma->idi_vaddr, BUS_DMA_NOWAIT | BUS_DMA_COHERENT | BUS_DMA_ZERO, &dma->idi_map); if (err) { device_printf(dev, "%s: bus_dmamem_alloc(%ju) failed: %d\n", __func__, (uintmax_t)size, err); goto fail_1; } dma->idi_paddr = IF_BAD_DMA; err = bus_dmamap_load(dma->idi_tag, dma->idi_map, dma->idi_vaddr, size, _iflib_dmamap_cb, &dma->idi_paddr, mapflags | BUS_DMA_NOWAIT); if (err || dma->idi_paddr == IF_BAD_DMA) { device_printf(dev, "%s: bus_dmamap_load failed: %d\n", __func__, err); goto fail_2; } dma->idi_size = size; return (0); fail_2: bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map); fail_1: bus_dma_tag_destroy(dma->idi_tag); fail_0: dma->idi_tag = NULL; return (err); } int iflib_dma_alloc_multi(if_ctx_t ctx, int *sizes, iflib_dma_info_t *dmalist, int mapflags, int count) { int i, err; iflib_dma_info_t *dmaiter; dmaiter = dmalist; for (i = 0; i < count; i++, dmaiter++) { if ((err = iflib_dma_alloc(ctx, sizes[i], *dmaiter, mapflags)) != 0) break; } if (err) iflib_dma_free_multi(dmalist, i); return (err); } void iflib_dma_free(iflib_dma_info_t dma) { if (dma->idi_tag == NULL) return; if (dma->idi_paddr != IF_BAD_DMA) { bus_dmamap_sync(dma->idi_tag, dma->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(dma->idi_tag, dma->idi_map); dma->idi_paddr = IF_BAD_DMA; } if (dma->idi_vaddr != NULL) { bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map); dma->idi_vaddr = NULL; } bus_dma_tag_destroy(dma->idi_tag); dma->idi_tag = NULL; } void iflib_dma_free_multi(iflib_dma_info_t *dmalist, int count) { int i; iflib_dma_info_t *dmaiter = dmalist; for (i = 0; i < count; i++, dmaiter++) iflib_dma_free(*dmaiter); } #ifdef EARLY_AP_STARTUP static const int iflib_started = 1; #else /* * We used to abuse the smp_started flag to decide if the queues have been * fully initialized (by late taskqgroup_adjust() calls in a SYSINIT()). * That gave bad races, since the SYSINIT() runs strictly after smp_started * is set. Run a SYSINIT() strictly after that to just set a usable * completion flag. */ static int iflib_started; static void iflib_record_started(void *arg) { iflib_started = 1; } SYSINIT(iflib_record_started, SI_SUB_SMP + 1, SI_ORDER_FIRST, iflib_record_started, NULL); #endif static int iflib_fast_intr(void *arg) { iflib_filter_info_t info = arg; struct grouptask *gtask = info->ifi_task; if (!iflib_started) return (FILTER_HANDLED); DBG_COUNTER_INC(fast_intrs); if (info->ifi_filter != NULL && info->ifi_filter(info->ifi_filter_arg) == FILTER_HANDLED) return (FILTER_HANDLED); GROUPTASK_ENQUEUE(gtask); return (FILTER_HANDLED); } static int _iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid, driver_filter_t filter, driver_intr_t handler, void *arg, char *name) { int rc; struct resource *res; void *tag; device_t dev = ctx->ifc_dev; MPASS(rid < 512); irq->ii_rid = rid; res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &irq->ii_rid, RF_SHAREABLE | RF_ACTIVE); if (res == NULL) { device_printf(dev, "failed to allocate IRQ for rid %d, name %s.\n", rid, name); return (ENOMEM); } irq->ii_res = res; KASSERT(filter == NULL || handler == NULL, ("filter and handler can't both be non-NULL")); rc = bus_setup_intr(dev, res, INTR_MPSAFE | INTR_TYPE_NET, filter, handler, arg, &tag); if (rc != 0) { device_printf(dev, "failed to setup interrupt for rid %d, name %s: %d\n", rid, name ? name : "unknown", rc); return (rc); } else if (name) bus_describe_intr(dev, res, tag, "%s", name); irq->ii_tag = tag; return (0); } /********************************************************************* * * Allocate memory for tx_buffer structures. The tx_buffer stores all * the information needed to transmit a packet on the wire. This is * called only once at attach, setup is done every reset. * **********************************************************************/ static int iflib_txsd_alloc(iflib_txq_t txq) { if_ctx_t ctx = txq->ift_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = ctx->ifc_dev; int err, nsegments, ntsosegments; nsegments = scctx->isc_tx_nsegments; ntsosegments = scctx->isc_tx_tso_segments_max; MPASS(scctx->isc_ntxd[0] > 0); MPASS(scctx->isc_ntxd[txq->ift_br_offset] > 0); MPASS(nsegments > 0); MPASS(ntsosegments > 0); /* * Setup DMA descriptor areas. */ if ((err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ sctx->isc_tx_maxsize, /* maxsize */ nsegments, /* nsegments */ sctx->isc_tx_maxsegsize, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txq->ift_desc_tag))) { device_printf(dev,"Unable to allocate TX DMA tag: %d\n", err); device_printf(dev,"maxsize: %zd nsegments: %d maxsegsize: %zd\n", sctx->isc_tx_maxsize, nsegments, sctx->isc_tx_maxsegsize); goto fail; } if ((err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ scctx->isc_tx_tso_size_max, /* maxsize */ ntsosegments, /* nsegments */ scctx->isc_tx_tso_segsize_max, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txq->ift_tso_desc_tag))) { device_printf(dev,"Unable to allocate TX TSO DMA tag: %d\n", err); goto fail; } if (!(txq->ift_sds.ifsd_flags = (uint8_t *) malloc(sizeof(uint8_t) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); err = ENOMEM; goto fail; } if (!(txq->ift_sds.ifsd_m = (struct mbuf **) malloc(sizeof(struct mbuf *) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); err = ENOMEM; goto fail; } /* Create the descriptor buffer dma maps */ #if defined(ACPI_DMAR) || (!(defined(__i386__) && !defined(__amd64__))) if ((ctx->ifc_flags & IFC_DMAR) == 0) return (0); if (!(txq->ift_sds.ifsd_map = (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer map memory\n"); err = ENOMEM; goto fail; } for (int i = 0; i < scctx->isc_ntxd[txq->ift_br_offset]; i++) { err = bus_dmamap_create(txq->ift_desc_tag, 0, &txq->ift_sds.ifsd_map[i]); if (err != 0) { device_printf(dev, "Unable to create TX DMA map\n"); goto fail; } } #endif return (0); fail: /* We free all, it handles case where we are in the middle */ iflib_tx_structures_free(ctx); return (err); } static void iflib_txsd_destroy(if_ctx_t ctx, iflib_txq_t txq, int i) { bus_dmamap_t map; map = NULL; if (txq->ift_sds.ifsd_map != NULL) map = txq->ift_sds.ifsd_map[i]; if (map != NULL) { bus_dmamap_unload(txq->ift_desc_tag, map); bus_dmamap_destroy(txq->ift_desc_tag, map); txq->ift_sds.ifsd_map[i] = NULL; } } static void iflib_txq_destroy(iflib_txq_t txq) { if_ctx_t ctx = txq->ift_ctx; for (int i = 0; i < txq->ift_size; i++) iflib_txsd_destroy(ctx, txq, i); if (txq->ift_sds.ifsd_map != NULL) { free(txq->ift_sds.ifsd_map, M_IFLIB); txq->ift_sds.ifsd_map = NULL; } if (txq->ift_sds.ifsd_m != NULL) { free(txq->ift_sds.ifsd_m, M_IFLIB); txq->ift_sds.ifsd_m = NULL; } if (txq->ift_sds.ifsd_flags != NULL) { free(txq->ift_sds.ifsd_flags, M_IFLIB); txq->ift_sds.ifsd_flags = NULL; } if (txq->ift_desc_tag != NULL) { bus_dma_tag_destroy(txq->ift_desc_tag); txq->ift_desc_tag = NULL; } if (txq->ift_tso_desc_tag != NULL) { bus_dma_tag_destroy(txq->ift_tso_desc_tag); txq->ift_tso_desc_tag = NULL; } } static void iflib_txsd_free(if_ctx_t ctx, iflib_txq_t txq, int i) { struct mbuf **mp; mp = &txq->ift_sds.ifsd_m[i]; if (*mp == NULL) return; if (txq->ift_sds.ifsd_map != NULL) { bus_dmamap_sync(txq->ift_desc_tag, txq->ift_sds.ifsd_map[i], BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_desc_tag, txq->ift_sds.ifsd_map[i]); } m_free(*mp); DBG_COUNTER_INC(tx_frees); *mp = NULL; } static int iflib_txq_setup(iflib_txq_t txq) { if_ctx_t ctx = txq->ift_ctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; iflib_dma_info_t di; int i; /* Set number of descriptors available */ txq->ift_qstatus = IFLIB_QUEUE_IDLE; /* Reset indices */ txq->ift_cidx_processed = txq->ift_pidx = txq->ift_cidx = txq->ift_npending = 0; txq->ift_size = scctx->isc_ntxd[txq->ift_br_offset]; for (i = 0, di = txq->ift_ifdi; i < ctx->ifc_nhwtxqs; i++, di++) bzero((void *)di->idi_vaddr, di->idi_size); IFDI_TXQ_SETUP(ctx, txq->ift_id); for (i = 0, di = txq->ift_ifdi; i < ctx->ifc_nhwtxqs; i++, di++) bus_dmamap_sync(di->idi_tag, di->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); return (0); } /********************************************************************* * * Allocate memory for rx_buffer structures. Since we use one * rx_buffer per received packet, the maximum number of rx_buffer's * that we'll need is equal to the number of receive descriptors * that we've allocated. * **********************************************************************/ static int iflib_rxsd_alloc(iflib_rxq_t rxq) { if_ctx_t ctx = rxq->ifr_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = ctx->ifc_dev; iflib_fl_t fl; int err; MPASS(scctx->isc_nrxd[0] > 0); MPASS(scctx->isc_nrxd[rxq->ifr_fl_offset] > 0); fl = rxq->ifr_fl; for (int i = 0; i < rxq->ifr_nfl; i++, fl++) { fl->ifl_size = scctx->isc_nrxd[rxq->ifr_fl_offset]; /* this isn't necessarily the same */ err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ sctx->isc_rx_maxsize, /* maxsize */ sctx->isc_rx_nsegments, /* nsegments */ sctx->isc_rx_maxsegsize, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockarg */ &fl->ifl_desc_tag); if (err) { device_printf(dev, "%s: bus_dma_tag_create failed %d\n", __func__, err); goto fail; } if (!(fl->ifl_sds.ifsd_flags = (uint8_t *) malloc(sizeof(uint8_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); err = ENOMEM; goto fail; } if (!(fl->ifl_sds.ifsd_m = (struct mbuf **) malloc(sizeof(struct mbuf *) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); err = ENOMEM; goto fail; } if (!(fl->ifl_sds.ifsd_cl = (caddr_t *) malloc(sizeof(caddr_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); err = ENOMEM; goto fail; } /* Create the descriptor buffer dma maps */ #if defined(ACPI_DMAR) || (!(defined(__i386__) && !defined(__amd64__))) if ((ctx->ifc_flags & IFC_DMAR) == 0) continue; if (!(fl->ifl_sds.ifsd_map = (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer map memory\n"); err = ENOMEM; goto fail; } for (int i = 0; i < scctx->isc_nrxd[rxq->ifr_fl_offset]; i++) { err = bus_dmamap_create(fl->ifl_desc_tag, 0, &fl->ifl_sds.ifsd_map[i]); if (err != 0) { device_printf(dev, "Unable to create TX DMA map\n"); goto fail; } } #endif } return (0); fail: iflib_rx_structures_free(ctx); return (err); } /* * Internal service routines */ struct rxq_refill_cb_arg { int error; bus_dma_segment_t seg; int nseg; }; static void _rxq_refill_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error) { struct rxq_refill_cb_arg *cb_arg = arg; cb_arg->error = error; cb_arg->seg = segs[0]; cb_arg->nseg = nseg; } #ifdef ACPI_DMAR #define IS_DMAR(ctx) (ctx->ifc_flags & IFC_DMAR) #else #define IS_DMAR(ctx) (0) #endif /** * rxq_refill - refill an rxq free-buffer list * @ctx: the iflib context * @rxq: the free-list to refill * @n: the number of new buffers to allocate * * (Re)populate an rxq free-buffer list with up to @n new packet buffers. * The caller must assure that @n does not exceed the queue's capacity. */ static void _iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count) { struct mbuf *m; int idx, pidx = fl->ifl_pidx; caddr_t cl, *sd_cl; struct mbuf **sd_m; uint8_t *sd_flags; bus_dmamap_t *sd_map; int n, i = 0; uint64_t bus_addr; int err; sd_m = fl->ifl_sds.ifsd_m; sd_map = fl->ifl_sds.ifsd_map; sd_cl = fl->ifl_sds.ifsd_cl; sd_flags = fl->ifl_sds.ifsd_flags; idx = pidx; n = count; MPASS(n > 0); MPASS(fl->ifl_credits + n <= fl->ifl_size); if (pidx < fl->ifl_cidx) MPASS(pidx + n <= fl->ifl_cidx); if (pidx == fl->ifl_cidx && (fl->ifl_credits < fl->ifl_size)) MPASS(fl->ifl_gen == 0); if (pidx > fl->ifl_cidx) MPASS(n <= fl->ifl_size - pidx + fl->ifl_cidx); DBG_COUNTER_INC(fl_refills); if (n > 8) DBG_COUNTER_INC(fl_refills_large); while (n--) { /* * We allocate an uninitialized mbuf + cluster, mbuf is * initialized after rx. * * If the cluster is still set then we know a minimum sized packet was received */ if ((cl = sd_cl[idx]) == NULL) { if ((cl = sd_cl[idx] = m_cljget(NULL, M_NOWAIT, fl->ifl_buf_size)) == NULL) break; #if MEMORY_LOGGING fl->ifl_cl_enqueued++; #endif } if ((m = m_gethdr(M_NOWAIT, MT_NOINIT)) == NULL) { break; } #if MEMORY_LOGGING fl->ifl_m_enqueued++; #endif DBG_COUNTER_INC(rx_allocs); #ifdef notyet if ((sd_flags[pidx] & RX_SW_DESC_MAP_CREATED) == 0) { int err; if ((err = bus_dmamap_create(fl->ifl_ifdi->idi_tag, 0, &sd_map[idx]))) { log(LOG_WARNING, "bus_dmamap_create failed %d\n", err); uma_zfree(fl->ifl_zone, cl); n = 0; goto done; } sd_flags[idx] |= RX_SW_DESC_MAP_CREATED; } #endif #if defined(__i386__) || defined(__amd64__) if (!IS_DMAR(ctx)) { bus_addr = pmap_kextract((vm_offset_t)cl); } else #endif { struct rxq_refill_cb_arg cb_arg; iflib_rxq_t q; cb_arg.error = 0; q = fl->ifl_rxq; err = bus_dmamap_load(fl->ifl_desc_tag, sd_map[idx], cl, fl->ifl_buf_size, _rxq_refill_cb, &cb_arg, 0); if (err != 0 || cb_arg.error) { /* * !zone_pack ? */ if (fl->ifl_zone == zone_pack) uma_zfree(fl->ifl_zone, cl); m_free(m); n = 0; goto done; } bus_addr = cb_arg.seg.ds_addr; } sd_flags[idx] |= RX_SW_DESC_INUSE; MPASS(sd_m[idx] == NULL); sd_cl[idx] = cl; sd_m[idx] = m; fl->ifl_bus_addrs[i] = bus_addr; fl->ifl_vm_addrs[i] = cl; fl->ifl_credits++; i++; MPASS(fl->ifl_credits <= fl->ifl_size); if (++idx == fl->ifl_size) { fl->ifl_gen = 1; idx = 0; } if (n == 0 || i == IFLIB_MAX_RX_REFRESH) { ctx->isc_rxd_refill(ctx->ifc_softc, fl->ifl_rxq->ifr_id, fl->ifl_id, pidx, fl->ifl_bus_addrs, fl->ifl_vm_addrs, i, fl->ifl_buf_size); i = 0; pidx = idx; } fl->ifl_pidx = idx; } done: DBG_COUNTER_INC(rxd_flush); if (fl->ifl_pidx == 0) pidx = fl->ifl_size - 1; else pidx = fl->ifl_pidx - 1; ctx->isc_rxd_flush(ctx->ifc_softc, fl->ifl_rxq->ifr_id, fl->ifl_id, pidx); } static __inline void __iflib_fl_refill_lt(if_ctx_t ctx, iflib_fl_t fl, int max) { /* we avoid allowing pidx to catch up with cidx as it confuses ixl */ int32_t reclaimable = fl->ifl_size - fl->ifl_credits - 1; #ifdef INVARIANTS int32_t delta = fl->ifl_size - get_inuse(fl->ifl_size, fl->ifl_cidx, fl->ifl_pidx, fl->ifl_gen) - 1; #endif MPASS(fl->ifl_credits <= fl->ifl_size); MPASS(reclaimable == delta); if (reclaimable > 0) _iflib_fl_refill(ctx, fl, min(max, reclaimable)); } static void iflib_fl_bufs_free(iflib_fl_t fl) { iflib_dma_info_t idi = fl->ifl_ifdi; uint32_t i; for (i = 0; i < fl->ifl_size; i++) { struct mbuf **sd_m = &fl->ifl_sds.ifsd_m[i]; uint8_t *sd_flags = &fl->ifl_sds.ifsd_flags[i]; caddr_t *sd_cl = &fl->ifl_sds.ifsd_cl[i]; if (*sd_flags & RX_SW_DESC_INUSE) { if (fl->ifl_sds.ifsd_map != NULL) { bus_dmamap_t sd_map = fl->ifl_sds.ifsd_map[i]; bus_dmamap_unload(fl->ifl_desc_tag, sd_map); bus_dmamap_destroy(fl->ifl_desc_tag, sd_map); } if (*sd_m != NULL) { m_init(*sd_m, M_NOWAIT, MT_DATA, 0); uma_zfree(zone_mbuf, *sd_m); } if (*sd_cl != NULL) uma_zfree(fl->ifl_zone, *sd_cl); *sd_flags = 0; } else { MPASS(*sd_cl == NULL); MPASS(*sd_m == NULL); } #if MEMORY_LOGGING fl->ifl_m_dequeued++; fl->ifl_cl_dequeued++; #endif *sd_cl = NULL; *sd_m = NULL; } /* * Reset free list values */ fl->ifl_credits = fl->ifl_cidx = fl->ifl_pidx = fl->ifl_gen = 0;; bzero(idi->idi_vaddr, idi->idi_size); } /********************************************************************* * * Initialize a receive ring and its buffers. * **********************************************************************/ static int iflib_fl_setup(iflib_fl_t fl) { iflib_rxq_t rxq = fl->ifl_rxq; if_ctx_t ctx = rxq->ifr_ctx; if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; /* ** Free current RX buffer structs and their mbufs */ iflib_fl_bufs_free(fl); /* Now replenish the mbufs */ MPASS(fl->ifl_credits == 0); /* * XXX don't set the max_frame_size to larger * than the hardware can handle */ if (sctx->isc_max_frame_size <= 2048) fl->ifl_buf_size = MCLBYTES; else if (sctx->isc_max_frame_size <= 4096) fl->ifl_buf_size = MJUMPAGESIZE; else if (sctx->isc_max_frame_size <= 9216) fl->ifl_buf_size = MJUM9BYTES; else fl->ifl_buf_size = MJUM16BYTES; if (fl->ifl_buf_size > ctx->ifc_max_fl_buf_size) ctx->ifc_max_fl_buf_size = fl->ifl_buf_size; fl->ifl_cltype = m_gettype(fl->ifl_buf_size); fl->ifl_zone = m_getzone(fl->ifl_buf_size); /* avoid pre-allocating zillions of clusters to an idle card * potentially speeding up attach */ _iflib_fl_refill(ctx, fl, min(128, fl->ifl_size)); MPASS(min(128, fl->ifl_size) == fl->ifl_credits); if (min(128, fl->ifl_size) != fl->ifl_credits) return (ENOBUFS); /* * handle failure */ MPASS(rxq != NULL); MPASS(fl->ifl_ifdi != NULL); bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); return (0); } /********************************************************************* * * Free receive ring data structures * **********************************************************************/ static void iflib_rx_sds_free(iflib_rxq_t rxq) { iflib_fl_t fl; int i; if (rxq->ifr_fl != NULL) { for (i = 0; i < rxq->ifr_nfl; i++) { fl = &rxq->ifr_fl[i]; if (fl->ifl_desc_tag != NULL) { bus_dma_tag_destroy(fl->ifl_desc_tag); fl->ifl_desc_tag = NULL; } free(fl->ifl_sds.ifsd_m, M_IFLIB); free(fl->ifl_sds.ifsd_cl, M_IFLIB); /* XXX destroy maps first */ free(fl->ifl_sds.ifsd_map, M_IFLIB); fl->ifl_sds.ifsd_m = NULL; fl->ifl_sds.ifsd_cl = NULL; fl->ifl_sds.ifsd_map = NULL; } free(rxq->ifr_fl, M_IFLIB); rxq->ifr_fl = NULL; rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0; } } /* * MI independent logic * */ static void iflib_timer(void *arg) { iflib_txq_t txq = arg; if_ctx_t ctx = txq->ift_ctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) return; /* ** Check on the state of the TX queue(s), this ** can be done without the lock because its RO ** and the HUNG state will be static if set. */ IFDI_TIMER(ctx, txq->ift_id); if ((txq->ift_qstatus == IFLIB_QUEUE_HUNG) && (ctx->ifc_pause_frames == 0)) goto hung; if (TXQ_AVAIL(txq) <= 2*scctx->isc_tx_nsegments || ifmp_ring_is_stalled(txq->ift_br[0])) GROUPTASK_ENQUEUE(&txq->ift_task); ctx->ifc_pause_frames = 0; if (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING) callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu); return; hung: CTX_LOCK(ctx); if_setdrvflagbits(ctx->ifc_ifp, 0, IFF_DRV_RUNNING); device_printf(ctx->ifc_dev, "TX(%d) desc avail = %d, pidx = %d\n", txq->ift_id, TXQ_AVAIL(txq), txq->ift_pidx); IFDI_WATCHDOG_RESET(ctx); ctx->ifc_watchdog_events++; ctx->ifc_pause_frames = 0; iflib_init_locked(ctx); CTX_UNLOCK(ctx); } static void iflib_init_locked(if_ctx_t ctx) { if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; if_t ifp = ctx->ifc_ifp; iflib_fl_t fl; iflib_txq_t txq; iflib_rxq_t rxq; int i, j, tx_ip_csum_flags, tx_ip6_csum_flags; if_setdrvflagbits(ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING); IFDI_INTR_DISABLE(ctx); tx_ip_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_SCTP); tx_ip6_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP6_TCP | CSUM_IP6_UDP | CSUM_IP6_SCTP); /* Set hardware offload abilities */ if_clearhwassist(ifp); if (if_getcapenable(ifp) & IFCAP_TXCSUM) if_sethwassistbits(ifp, tx_ip_csum_flags, 0); if (if_getcapenable(ifp) & IFCAP_TXCSUM_IPV6) if_sethwassistbits(ifp, tx_ip6_csum_flags, 0); if (if_getcapenable(ifp) & IFCAP_TSO4) if_sethwassistbits(ifp, CSUM_IP_TSO, 0); if (if_getcapenable(ifp) & IFCAP_TSO6) if_sethwassistbits(ifp, CSUM_IP6_TSO, 0); for (i = 0, txq = ctx->ifc_txqs; i < sctx->isc_ntxqsets; i++, txq++) { CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); callout_stop(&txq->ift_db_check); CALLOUT_UNLOCK(txq); iflib_netmap_txq_init(ctx, txq); } for (i = 0, rxq = ctx->ifc_rxqs; i < sctx->isc_nrxqsets; i++, rxq++) { iflib_netmap_rxq_init(ctx, rxq); } #ifdef INVARIANTS i = if_getdrvflags(ifp); #endif IFDI_INIT(ctx); MPASS(if_getdrvflags(ifp) == i); for (i = 0, rxq = ctx->ifc_rxqs; i < sctx->isc_nrxqsets; i++, rxq++) { for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) { if (iflib_fl_setup(fl)) { device_printf(ctx->ifc_dev, "freelist setup failed - check cluster settings\n"); goto done; } } } done: if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_RUNNING, IFF_DRV_OACTIVE); IFDI_INTR_ENABLE(ctx); txq = ctx->ifc_txqs; for (i = 0; i < sctx->isc_ntxqsets; i++, txq++) callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu); } static int iflib_media_change(if_t ifp) { if_ctx_t ctx = if_getsoftc(ifp); int err; CTX_LOCK(ctx); if ((err = IFDI_MEDIA_CHANGE(ctx)) == 0) iflib_init_locked(ctx); CTX_UNLOCK(ctx); return (err); } static void iflib_media_status(if_t ifp, struct ifmediareq *ifmr) { if_ctx_t ctx = if_getsoftc(ifp); CTX_LOCK(ctx); IFDI_UPDATE_ADMIN_STATUS(ctx); IFDI_MEDIA_STATUS(ctx, ifmr); CTX_UNLOCK(ctx); } static void iflib_stop(if_ctx_t ctx) { iflib_txq_t txq = ctx->ifc_txqs; iflib_rxq_t rxq = ctx->ifc_rxqs; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; iflib_dma_info_t di; iflib_fl_t fl; int i, j; /* Tell the stack that the interface is no longer active */ if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING); IFDI_INTR_DISABLE(ctx); DELAY(100000); IFDI_STOP(ctx); DELAY(100000); iflib_debug_reset(); /* Wait for current tx queue users to exit to disarm watchdog timer. */ for (i = 0; i < scctx->isc_ntxqsets; i++, txq++) { /* make sure all transmitters have completed before proceeding XXX */ /* clean any enqueued buffers */ iflib_ifmp_purge(txq); /* Free any existing tx buffers. */ for (j = 0; j < txq->ift_size; j++) { iflib_txsd_free(ctx, txq, j); } txq->ift_processed = txq->ift_cleaned = txq->ift_cidx_processed = 0; txq->ift_in_use = txq->ift_gen = txq->ift_cidx = txq->ift_pidx = txq->ift_no_desc_avail = 0; txq->ift_closed = txq->ift_mbuf_defrag = txq->ift_mbuf_defrag_failed = 0; txq->ift_no_tx_dma_setup = txq->ift_txd_encap_efbig = txq->ift_map_failed = 0; txq->ift_pullups = 0; ifmp_ring_reset_stats(txq->ift_br[0]); for (j = 0, di = txq->ift_ifdi; j < ctx->ifc_nhwtxqs; j++, di++) bzero((void *)di->idi_vaddr, di->idi_size); } for (i = 0; i < scctx->isc_nrxqsets; i++, rxq++) { /* make sure all transmitters have completed before proceeding XXX */ for (j = 0, di = txq->ift_ifdi; j < ctx->ifc_nhwrxqs; j++, di++) bzero((void *)di->idi_vaddr, di->idi_size); /* also resets the free lists pidx/cidx */ for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) iflib_fl_bufs_free(fl); } } static inline void prefetch_pkts(iflib_fl_t fl, int cidx) { int nextptr; int nrxd = fl->ifl_size; nextptr = (cidx + CACHE_PTR_INCREMENT) & (nrxd-1); prefetch(&fl->ifl_sds.ifsd_m[nextptr]); prefetch(&fl->ifl_sds.ifsd_cl[nextptr]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 1) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 2) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 3) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 4) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 1) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 2) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 3) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 4) & (nrxd-1)]); } static void rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, int *cltype, int unload, iflib_fl_t *pfl, int *pcidx) { int flid, cidx; bus_dmamap_t map; iflib_fl_t fl; iflib_dma_info_t di; int next; flid = irf->irf_flid; cidx = irf->irf_idx; fl = &rxq->ifr_fl[flid]; fl->ifl_credits--; #if MEMORY_LOGGING fl->ifl_m_dequeued++; if (cltype) fl->ifl_cl_dequeued++; #endif prefetch_pkts(fl, cidx); if (fl->ifl_sds.ifsd_map != NULL) { next = (cidx + CACHE_PTR_INCREMENT) & (fl->ifl_size-1); prefetch(&fl->ifl_sds.ifsd_map[next]); map = fl->ifl_sds.ifsd_map[cidx]; di = fl->ifl_ifdi; next = (cidx + CACHE_LINE_SIZE) & (fl->ifl_size-1); prefetch(&fl->ifl_sds.ifsd_flags[next]); bus_dmamap_sync(di->idi_tag, di->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); /* not valid assert if bxe really does SGE from non-contiguous elements */ MPASS(fl->ifl_cidx == cidx); if (unload) bus_dmamap_unload(fl->ifl_desc_tag, map); } if (__predict_false(++fl->ifl_cidx == fl->ifl_size)) { fl->ifl_cidx = 0; fl->ifl_gen = 0; } /* YES ick */ if (cltype) *cltype = fl->ifl_cltype; *pfl = fl; *pcidx = cidx; } static struct mbuf * assemble_segments(iflib_rxq_t rxq, if_rxd_info_t ri) { int i, padlen , flags, cltype; struct mbuf *m, *mh, *mt, *sd_m; iflib_fl_t fl; int cidx; caddr_t cl, sd_cl; i = 0; mh = NULL; do { rxd_frag_to_sd(rxq, &ri->iri_frags[i], &cltype, TRUE, &fl, &cidx); sd_m = fl->ifl_sds.ifsd_m[cidx]; sd_cl = fl->ifl_sds.ifsd_cl[cidx]; MPASS(sd_cl != NULL); MPASS(sd_m != NULL); /* Don't include zero-length frags */ if (ri->iri_frags[i].irf_len == 0) { /* XXX we can save the cluster here, but not the mbuf */ m_init(sd_m, M_NOWAIT, MT_DATA, 0); m_free(sd_m); fl->ifl_sds.ifsd_m[cidx] = NULL; continue; } m = sd_m; if (mh == NULL) { flags = M_PKTHDR|M_EXT; mh = mt = m; padlen = ri->iri_pad; } else { flags = M_EXT; mt->m_next = m; mt = m; /* assuming padding is only on the first fragment */ padlen = 0; } fl->ifl_sds.ifsd_m[cidx] = NULL; cl = fl->ifl_sds.ifsd_cl[cidx]; fl->ifl_sds.ifsd_cl[cidx] = NULL; /* Can these two be made one ? */ m_init(m, M_NOWAIT, MT_DATA, flags); m_cljset(m, cl, cltype); /* * These must follow m_init and m_cljset */ m->m_data += padlen; ri->iri_len -= padlen; m->m_len = ri->iri_frags[i].irf_len; } while (++i < ri->iri_nfrags); return (mh); } /* * Process one software descriptor */ static struct mbuf * iflib_rxd_pkt_get(iflib_rxq_t rxq, if_rxd_info_t ri) { struct mbuf *m; iflib_fl_t fl; caddr_t sd_cl; int cidx; /* should I merge this back in now that the two paths are basically duplicated? */ if (ri->iri_nfrags == 1 && ri->iri_frags[0].irf_len <= IFLIB_RX_COPY_THRESH) { rxd_frag_to_sd(rxq, &ri->iri_frags[0], NULL, FALSE, &fl, &cidx); m = fl->ifl_sds.ifsd_m[cidx]; fl->ifl_sds.ifsd_m[cidx] = NULL; sd_cl = fl->ifl_sds.ifsd_cl[cidx]; m_init(m, M_NOWAIT, MT_DATA, M_PKTHDR); memcpy(m->m_data, sd_cl, ri->iri_len); m->m_len = ri->iri_frags[0].irf_len; } else { m = assemble_segments(rxq, ri); } m->m_pkthdr.len = ri->iri_len; m->m_pkthdr.rcvif = ri->iri_ifp; m->m_flags |= ri->iri_flags; m->m_pkthdr.ether_vtag = ri->iri_vtag; m->m_pkthdr.flowid = ri->iri_flowid; M_HASHTYPE_SET(m, ri->iri_rsstype); m->m_pkthdr.csum_flags = ri->iri_csum_flags; m->m_pkthdr.csum_data = ri->iri_csum_data; return (m); } static bool iflib_rxeof(iflib_rxq_t rxq, int budget) { if_ctx_t ctx = rxq->ifr_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; int avail, i; uint16_t *cidxp; struct if_rxd_info ri; int err, budget_left, rx_bytes, rx_pkts; iflib_fl_t fl; struct ifnet *ifp; int lro_enabled; /* * XXX early demux data packets so that if_input processing only handles * acks in interrupt context */ struct mbuf *m, *mh, *mt; if (netmap_rx_irq(ctx->ifc_ifp, rxq->ifr_id, &budget)) { return (FALSE); } mh = mt = NULL; MPASS(budget > 0); rx_pkts = rx_bytes = 0; if (sctx->isc_flags & IFLIB_HAS_RXCQ) cidxp = &rxq->ifr_cq_cidx; else cidxp = &rxq->ifr_fl[0].ifl_cidx; if ((avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget)) == 0) { for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++) __iflib_fl_refill_lt(ctx, fl, budget + 8); DBG_COUNTER_INC(rx_unavail); return (false); } for (budget_left = budget; (budget_left > 0) && (avail > 0); budget_left--, avail--) { if (__predict_false(!CTX_ACTIVE(ctx))) { DBG_COUNTER_INC(rx_ctx_inactive); break; } /* * Reset client set fields to their default values */ bzero(&ri, sizeof(ri)); ri.iri_qsidx = rxq->ifr_id; ri.iri_cidx = *cidxp; ri.iri_ifp = ctx->ifc_ifp; ri.iri_frags = rxq->ifr_frags; err = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri); /* in lieu of handling correctly - make sure it isn't being unhandled */ MPASS(err == 0); if (sctx->isc_flags & IFLIB_HAS_RXCQ) { *cidxp = ri.iri_cidx; /* Update our consumer index */ while (rxq->ifr_cq_cidx >= scctx->isc_nrxd[0]) { rxq->ifr_cq_cidx -= scctx->isc_nrxd[0]; rxq->ifr_cq_gen = 0; } /* was this only a completion queue message? */ if (__predict_false(ri.iri_nfrags == 0)) continue; } MPASS(ri.iri_nfrags != 0); MPASS(ri.iri_len != 0); /* will advance the cidx on the corresponding free lists */ m = iflib_rxd_pkt_get(rxq, &ri); if (avail == 0 && budget_left) avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget_left); if (__predict_false(m == NULL)) { DBG_COUNTER_INC(rx_mbuf_null); continue; } /* imm_pkt: -- cxgb */ if (mh == NULL) mh = mt = m; else { mt->m_nextpkt = m; mt = m; } } /* make sure that we can refill faster than drain */ for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++) __iflib_fl_refill_lt(ctx, fl, budget + 8); ifp = ctx->ifc_ifp; lro_enabled = (if_getcapenable(ifp) & IFCAP_LRO); while (mh != NULL) { m = mh; mh = mh->m_nextpkt; m->m_nextpkt = NULL; rx_bytes += m->m_pkthdr.len; rx_pkts++; #if defined(INET6) || defined(INET) if (lro_enabled && tcp_lro_rx(&rxq->ifr_lc, m, 0) == 0) continue; #endif DBG_COUNTER_INC(rx_if_input); ifp->if_input(ifp, m); } if_inc_counter(ifp, IFCOUNTER_IBYTES, rx_bytes); if_inc_counter(ifp, IFCOUNTER_IPACKETS, rx_pkts); /* * Flush any outstanding LRO work */ #if defined(INET6) || defined(INET) tcp_lro_flush_all(&rxq->ifr_lc); #endif if (avail) return true; return (iflib_rxd_avail(ctx, rxq, *cidxp, 1)); } #define M_CSUM_FLAGS(m) ((m)->m_pkthdr.csum_flags) #define M_HAS_VLANTAG(m) (m->m_flags & M_VLANTAG) #define TXQ_MAX_DB_DEFERRED(size) (size >> 5) #define TXQ_MAX_DB_CONSUMED(size) (size >> 4) static __inline void iflib_txd_db_check(if_ctx_t ctx, iflib_txq_t txq, int ring) { uint32_t dbval; if (ring || txq->ift_db_pending >= TXQ_MAX_DB_DEFERRED(txq->ift_size)) { /* the lock will only ever be contended in the !min_latency case */ if (!TXDB_TRYLOCK(txq)) return; dbval = txq->ift_npending ? txq->ift_npending : txq->ift_pidx; ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, dbval); txq->ift_db_pending = txq->ift_npending = 0; TXDB_UNLOCK(txq); } } static void iflib_txd_deferred_db_check(void * arg) { iflib_txq_t txq = arg; /* simple non-zero boolean so use bitwise OR */ if ((txq->ift_db_pending | txq->ift_npending) && txq->ift_db_pending >= txq->ift_db_pending_queued) iflib_txd_db_check(txq->ift_ctx, txq, TRUE); txq->ift_db_pending_queued = 0; if (ifmp_ring_is_stalled(txq->ift_br[0])) iflib_txq_check_drain(txq, 4); } #ifdef PKT_DEBUG static void print_pkt(if_pkt_info_t pi) { printf("pi len: %d qsidx: %d nsegs: %d ndescs: %d flags: %x pidx: %d\n", pi->ipi_len, pi->ipi_qsidx, pi->ipi_nsegs, pi->ipi_ndescs, pi->ipi_flags, pi->ipi_pidx); printf("pi new_pidx: %d csum_flags: %lx tso_segsz: %d mflags: %x vtag: %d\n", pi->ipi_new_pidx, pi->ipi_csum_flags, pi->ipi_tso_segsz, pi->ipi_mflags, pi->ipi_vtag); printf("pi etype: %d ehdrlen: %d ip_hlen: %d ipproto: %d\n", pi->ipi_etype, pi->ipi_ehdrlen, pi->ipi_ip_hlen, pi->ipi_ipproto); } #endif #define IS_TSO4(pi) ((pi)->ipi_csum_flags & CSUM_IP_TSO) #define IS_TSO6(pi) ((pi)->ipi_csum_flags & CSUM_IP6_TSO) static int iflib_parse_header(iflib_txq_t txq, if_pkt_info_t pi, struct mbuf **mp) { if_shared_ctx_t sctx = txq->ift_ctx->ifc_sctx; struct ether_vlan_header *eh; struct mbuf *m, *n; n = m = *mp; if ((sctx->isc_flags & IFLIB_NEED_SCRATCH) && M_WRITABLE(m) == 0) { if ((m = m_dup(m, M_NOWAIT)) == NULL) { return (ENOMEM); } else { m_freem(*mp); n = *mp = m; } } /* * Determine where frame payload starts. * Jump over vlan headers if already present, * helpful for QinQ too. */ if (__predict_false(m->m_len < sizeof(*eh))) { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, sizeof(*eh))) == NULL)) return (ENOMEM); } eh = mtod(m, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { pi->ipi_etype = ntohs(eh->evl_proto); pi->ipi_ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; } else { pi->ipi_etype = ntohs(eh->evl_encap_proto); pi->ipi_ehdrlen = ETHER_HDR_LEN; } switch (pi->ipi_etype) { #ifdef INET case ETHERTYPE_IP: { struct ip *ip = NULL; struct tcphdr *th = NULL; int minthlen; minthlen = min(m->m_pkthdr.len, pi->ipi_ehdrlen + sizeof(*ip) + sizeof(*th)); if (__predict_false(m->m_len < minthlen)) { /* * if this code bloat is causing too much of a hit * move it to a separate function and mark it noinline */ if (m->m_len == pi->ipi_ehdrlen) { n = m->m_next; MPASS(n); if (n->m_len >= sizeof(*ip)) { ip = (struct ip *)n->m_data; if (n->m_len >= (ip->ip_hl << 2) + sizeof(*th)) th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2)); } else { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, minthlen)) == NULL)) return (ENOMEM); ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen); } } else { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, minthlen)) == NULL)) return (ENOMEM); ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen); if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th)) th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2)); } } else { ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen); if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th)) th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2)); } pi->ipi_ip_hlen = ip->ip_hl << 2; pi->ipi_ipproto = ip->ip_p; pi->ipi_flags |= IPI_TX_IPV4; if (pi->ipi_csum_flags & CSUM_IP) ip->ip_sum = 0; if (pi->ipi_ipproto == IPPROTO_TCP) { if (__predict_false(th == NULL)) { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, (ip->ip_hl << 2) + sizeof(*th))) == NULL)) return (ENOMEM); th = (struct tcphdr *)((caddr_t)ip + pi->ipi_ip_hlen); } pi->ipi_tcp_hflags = th->th_flags; pi->ipi_tcp_hlen = th->th_off << 2; pi->ipi_tcp_seq = th->th_seq; } if (IS_TSO4(pi)) { if (__predict_false(ip->ip_p != IPPROTO_TCP)) return (ENXIO); th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz; if (sctx->isc_flags & IFLIB_TSO_INIT_IP) { ip->ip_sum = 0; ip->ip_len = htons(pi->ipi_ip_hlen + pi->ipi_tcp_hlen + pi->ipi_tso_segsz); } } break; } #endif #ifdef INET6 case ETHERTYPE_IPV6: { struct ip6_hdr *ip6 = (struct ip6_hdr *)(m->m_data + pi->ipi_ehdrlen); struct tcphdr *th; pi->ipi_ip_hlen = sizeof(struct ip6_hdr); if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) { if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) == NULL)) return (ENOMEM); } th = (struct tcphdr *)((caddr_t)ip6 + pi->ipi_ip_hlen); /* XXX-BZ this will go badly in case of ext hdrs. */ pi->ipi_ipproto = ip6->ip6_nxt; pi->ipi_flags |= IPI_TX_IPV6; if (pi->ipi_ipproto == IPPROTO_TCP) { if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) { if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) == NULL)) return (ENOMEM); } pi->ipi_tcp_hflags = th->th_flags; pi->ipi_tcp_hlen = th->th_off << 2; } if (IS_TSO6(pi)) { if (__predict_false(ip6->ip6_nxt != IPPROTO_TCP)) return (ENXIO); /* * The corresponding flag is set by the stack in the IPv4 * TSO case, but not in IPv6 (at least in FreeBSD 10.2). * So, set it here because the rest of the flow requires it. */ pi->ipi_csum_flags |= CSUM_TCP_IPV6; th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz; } break; } #endif default: pi->ipi_csum_flags &= ~CSUM_OFFLOAD; pi->ipi_ip_hlen = 0; break; } *mp = m; return (0); } static __noinline struct mbuf * collapse_pkthdr(struct mbuf *m0) { struct mbuf *m, *m_next, *tmp; m = m0; m_next = m->m_next; while (m_next != NULL && m_next->m_len == 0) { m = m_next; m->m_next = NULL; m_free(m); m_next = m_next->m_next; } m = m0; m->m_next = m_next; if ((m_next->m_flags & M_EXT) == 0) { m = m_defrag(m, M_NOWAIT); } else { tmp = m_next->m_next; memcpy(m_next, m, MPKTHSIZE); m = m_next; m->m_next = tmp; } return (m); } /* * If dodgy hardware rejects the scatter gather chain we've handed it * we'll need to remove the mbuf chain from ifsg_m[] before we can add the * m_defrag'd mbufs */ static __noinline struct mbuf * iflib_remove_mbuf(iflib_txq_t txq) { int ntxd, i, pidx; struct mbuf *m, *mh, **ifsd_m; pidx = txq->ift_pidx; ifsd_m = txq->ift_sds.ifsd_m; ntxd = txq->ift_size; mh = m = ifsd_m[pidx]; ifsd_m[pidx] = NULL; #if MEMORY_LOGGING txq->ift_dequeued++; #endif i = 1; while (m) { ifsd_m[(pidx + i) & (ntxd -1)] = NULL; #if MEMORY_LOGGING txq->ift_dequeued++; #endif m = m->m_next; i++; } return (mh); } static int iflib_busdma_load_mbuf_sg(iflib_txq_t txq, bus_dma_tag_t tag, bus_dmamap_t map, struct mbuf **m0, bus_dma_segment_t *segs, int *nsegs, int max_segs, int flags) { if_ctx_t ctx; if_shared_ctx_t sctx; if_softc_ctx_t scctx; int i, next, pidx, mask, err, maxsegsz, ntxd, count; struct mbuf *m, *tmp, **ifsd_m, **mp; m = *m0; /* * Please don't ever do this */ if (__predict_false(m->m_len == 0)) *m0 = m = collapse_pkthdr(m); ctx = txq->ift_ctx; sctx = ctx->ifc_sctx; scctx = &ctx->ifc_softc_ctx; ifsd_m = txq->ift_sds.ifsd_m; ntxd = txq->ift_size; pidx = txq->ift_pidx; if (map != NULL) { uint8_t *ifsd_flags = txq->ift_sds.ifsd_flags; err = bus_dmamap_load_mbuf_sg(tag, map, *m0, segs, nsegs, BUS_DMA_NOWAIT); if (err) return (err); ifsd_flags[pidx] |= TX_SW_DESC_MAPPED; i = 0; next = pidx; mask = (txq->ift_size-1); m = *m0; do { mp = &ifsd_m[next]; *mp = m; m = m->m_next; if (__predict_false((*mp)->m_len == 0)) { m_free(*mp); *mp = NULL; } else next = (pidx + i) & (ntxd-1); } while (m != NULL); } else { int buflen, sgsize, max_sgsize; vm_offset_t vaddr; vm_paddr_t curaddr; count = i = 0; maxsegsz = sctx->isc_tx_maxsize; m = *m0; do { if (__predict_false(m->m_len <= 0)) { tmp = m; m = m->m_next; tmp->m_next = NULL; m_free(tmp); continue; } buflen = m->m_len; vaddr = (vm_offset_t)m->m_data; /* * see if we can't be smarter about physically * contiguous mappings */ next = (pidx + count) & (ntxd-1); MPASS(ifsd_m[next] == NULL); #if MEMORY_LOGGING txq->ift_enqueued++; #endif ifsd_m[next] = m; while (buflen > 0) { max_sgsize = MIN(buflen, maxsegsz); curaddr = pmap_kextract(vaddr); sgsize = PAGE_SIZE - (curaddr & PAGE_MASK); sgsize = MIN(sgsize, max_sgsize); segs[i].ds_addr = curaddr; segs[i].ds_len = sgsize; vaddr += sgsize; buflen -= sgsize; i++; if (i >= max_segs) goto err; } count++; tmp = m; m = m->m_next; } while (m != NULL); *nsegs = i; } return (0); err: *m0 = iflib_remove_mbuf(txq); return (EFBIG); } static int iflib_encap(iflib_txq_t txq, struct mbuf **m_headp) { if_ctx_t ctx; if_shared_ctx_t sctx; if_softc_ctx_t scctx; bus_dma_segment_t *segs; struct mbuf *m_head; bus_dmamap_t map; struct if_pkt_info pi; int remap = 0; int err, nsegs, ndesc, max_segs, pidx, cidx, next, ntxd; bus_dma_tag_t desc_tag; segs = txq->ift_segs; ctx = txq->ift_ctx; sctx = ctx->ifc_sctx; scctx = &ctx->ifc_softc_ctx; segs = txq->ift_segs; ntxd = txq->ift_size; m_head = *m_headp; map = NULL; /* * If we're doing TSO the next descriptor to clean may be quite far ahead */ cidx = txq->ift_cidx; pidx = txq->ift_pidx; next = (cidx + CACHE_PTR_INCREMENT) & (ntxd-1); /* prefetch the next cache line of mbuf pointers and flags */ prefetch(&txq->ift_sds.ifsd_m[next]); if (txq->ift_sds.ifsd_map != NULL) { prefetch(&txq->ift_sds.ifsd_map[next]); map = txq->ift_sds.ifsd_map[pidx]; next = (cidx + CACHE_LINE_SIZE) & (ntxd-1); prefetch(&txq->ift_sds.ifsd_flags[next]); } if (m_head->m_pkthdr.csum_flags & CSUM_TSO) { desc_tag = txq->ift_tso_desc_tag; max_segs = scctx->isc_tx_tso_segments_max; } else { desc_tag = txq->ift_desc_tag; max_segs = scctx->isc_tx_nsegments; } m_head = *m_headp; bzero(&pi, sizeof(pi)); pi.ipi_len = m_head->m_pkthdr.len; pi.ipi_mflags = (m_head->m_flags & (M_VLANTAG|M_BCAST|M_MCAST)); pi.ipi_csum_flags = m_head->m_pkthdr.csum_flags; pi.ipi_vtag = (m_head->m_flags & M_VLANTAG) ? m_head->m_pkthdr.ether_vtag : 0; pi.ipi_pidx = pidx; pi.ipi_qsidx = txq->ift_id; /* deliberate bitwise OR to make one condition */ if (__predict_true((pi.ipi_csum_flags | pi.ipi_vtag))) { if (__predict_false((err = iflib_parse_header(txq, &pi, m_headp)) != 0)) return (err); m_head = *m_headp; } retry: err = iflib_busdma_load_mbuf_sg(txq, desc_tag, map, m_headp, segs, &nsegs, max_segs, BUS_DMA_NOWAIT); defrag: if (__predict_false(err)) { switch (err) { case EFBIG: /* try collapse once and defrag once */ if (remap == 0) m_head = m_collapse(*m_headp, M_NOWAIT, max_segs); if (remap == 1) m_head = m_defrag(*m_headp, M_NOWAIT); remap++; if (__predict_false(m_head == NULL)) goto defrag_failed; txq->ift_mbuf_defrag++; *m_headp = m_head; goto retry; break; case ENOMEM: txq->ift_no_tx_dma_setup++; break; default: txq->ift_no_tx_dma_setup++; m_freem(*m_headp); DBG_COUNTER_INC(tx_frees); *m_headp = NULL; break; } txq->ift_map_failed++; DBG_COUNTER_INC(encap_load_mbuf_fail); return (err); } /* * XXX assumes a 1 to 1 relationship between segments and * descriptors - this does not hold true on all drivers, e.g. * cxgb */ if (__predict_false(nsegs + 2 > TXQ_AVAIL(txq))) { txq->ift_no_desc_avail++; if (map != NULL) bus_dmamap_unload(desc_tag, map); DBG_COUNTER_INC(encap_txq_avail_fail); if ((txq->ift_task.gt_task.ta_flags & TASK_ENQUEUED) == 0) GROUPTASK_ENQUEUE(&txq->ift_task); return (ENOBUFS); } pi.ipi_segs = segs; pi.ipi_nsegs = nsegs; MPASS(pidx >= 0 && pidx < txq->ift_size); #ifdef PKT_DEBUG print_pkt(&pi); #endif if ((err = ctx->isc_txd_encap(ctx->ifc_softc, &pi)) == 0) { bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); DBG_COUNTER_INC(tx_encap); MPASS(pi.ipi_new_pidx >= 0 && pi.ipi_new_pidx < txq->ift_size); ndesc = pi.ipi_new_pidx - pi.ipi_pidx; if (pi.ipi_new_pidx < pi.ipi_pidx) { ndesc += txq->ift_size; txq->ift_gen = 1; } /* * drivers can need as many as * two sentinels */ MPASS(ndesc <= pi.ipi_nsegs + 2); MPASS(pi.ipi_new_pidx != pidx); MPASS(ndesc > 0); txq->ift_in_use += ndesc; /* * We update the last software descriptor again here because there may * be a sentinel and/or there may be more mbufs than segments */ txq->ift_pidx = pi.ipi_new_pidx; txq->ift_npending += pi.ipi_ndescs; } else if (__predict_false(err == EFBIG && remap < 2)) { *m_headp = m_head = iflib_remove_mbuf(txq); remap = 1; txq->ift_txd_encap_efbig++; goto defrag; } else DBG_COUNTER_INC(encap_txd_encap_fail); return (err); defrag_failed: txq->ift_mbuf_defrag_failed++; txq->ift_map_failed++; m_freem(*m_headp); DBG_COUNTER_INC(tx_frees); *m_headp = NULL; return (ENOMEM); } /* forward compatibility for cxgb */ #define FIRST_QSET(ctx) 0 #define NTXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_ntxqsets) #define NRXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_nrxqsets) #define QIDX(ctx, m) ((((m)->m_pkthdr.flowid & ctx->ifc_softc_ctx.isc_rss_table_mask) % NTXQSETS(ctx)) + FIRST_QSET(ctx)) #define DESC_RECLAIMABLE(q) ((int)((q)->ift_processed - (q)->ift_cleaned - (q)->ift_ctx->ifc_softc_ctx.isc_tx_nsegments)) #define RECLAIM_THRESH(ctx) ((ctx)->ifc_sctx->isc_tx_reclaim_thresh) #define MAX_TX_DESC(ctx) ((ctx)->ifc_softc_ctx.isc_tx_tso_segments_max) /* if there are more than TXQ_MIN_OCCUPANCY packets pending we consider deferring * doorbell writes * * ORing with 2 assures that min occupancy is never less than 2 without any conditional logic */ #define TXQ_MIN_OCCUPANCY(size) ((size >> 6)| 0x2) static inline int iflib_txq_min_occupancy(iflib_txq_t txq) { if_ctx_t ctx; ctx = txq->ift_ctx; return (get_inuse(txq->ift_size, txq->ift_cidx, txq->ift_pidx, txq->ift_gen) < TXQ_MIN_OCCUPANCY(txq->ift_size) + MAX_TX_DESC(ctx)); } static void iflib_tx_desc_free(iflib_txq_t txq, int n) { int hasmap; uint32_t qsize, cidx, mask, gen; struct mbuf *m, **ifsd_m; uint8_t *ifsd_flags; bus_dmamap_t *ifsd_map; cidx = txq->ift_cidx; gen = txq->ift_gen; qsize = txq->ift_size; mask = qsize-1; hasmap = txq->ift_sds.ifsd_map != NULL; ifsd_flags = txq->ift_sds.ifsd_flags; ifsd_m = txq->ift_sds.ifsd_m; ifsd_map = txq->ift_sds.ifsd_map; while (n--) { prefetch(ifsd_m[(cidx + 3) & mask]); prefetch(ifsd_m[(cidx + 4) & mask]); if (ifsd_m[cidx] != NULL) { prefetch(&ifsd_m[(cidx + CACHE_PTR_INCREMENT) & mask]); prefetch(&ifsd_flags[(cidx + CACHE_PTR_INCREMENT) & mask]); if (hasmap && (ifsd_flags[cidx] & TX_SW_DESC_MAPPED)) { /* * does it matter if it's not the TSO tag? If so we'll * have to add the type to flags */ bus_dmamap_unload(txq->ift_desc_tag, ifsd_map[cidx]); ifsd_flags[cidx] &= ~TX_SW_DESC_MAPPED; } if ((m = ifsd_m[cidx]) != NULL) { /* XXX we don't support any drivers that batch packets yet */ MPASS(m->m_nextpkt == NULL); m_free(m); ifsd_m[cidx] = NULL; #if MEMORY_LOGGING txq->ift_dequeued++; #endif DBG_COUNTER_INC(tx_frees); } } if (__predict_false(++cidx == qsize)) { cidx = 0; gen = 0; } } txq->ift_cidx = cidx; txq->ift_gen = gen; } static __inline int iflib_completed_tx_reclaim(iflib_txq_t txq, int thresh) { int reclaim; if_ctx_t ctx = txq->ift_ctx; KASSERT(thresh >= 0, ("invalid threshold to reclaim")); MPASS(thresh /*+ MAX_TX_DESC(txq->ift_ctx) */ < txq->ift_size); /* * Need a rate-limiting check so that this isn't called every time */ iflib_tx_credits_update(ctx, txq); reclaim = DESC_RECLAIMABLE(txq); if (reclaim <= thresh /* + MAX_TX_DESC(txq->ift_ctx) */) { #ifdef INVARIANTS if (iflib_verbose_debug) { printf("%s processed=%ju cleaned=%ju tx_nsegments=%d reclaim=%d thresh=%d\n", __FUNCTION__, txq->ift_processed, txq->ift_cleaned, txq->ift_ctx->ifc_softc_ctx.isc_tx_nsegments, reclaim, thresh); } #endif return (0); } iflib_tx_desc_free(txq, reclaim); txq->ift_cleaned += reclaim; txq->ift_in_use -= reclaim; if (txq->ift_active == FALSE) txq->ift_active = TRUE; return (reclaim); } static struct mbuf ** _ring_peek_one(struct ifmp_ring *r, int cidx, int offset) { return (__DEVOLATILE(struct mbuf **, &r->items[(cidx + offset) & (r->size-1)])); } static void iflib_txq_check_drain(iflib_txq_t txq, int budget) { ifmp_ring_check_drainage(txq->ift_br[0], budget); } static uint32_t iflib_txq_can_drain(struct ifmp_ring *r) { iflib_txq_t txq = r->cookie; if_ctx_t ctx = txq->ift_ctx; return ((TXQ_AVAIL(txq) > MAX_TX_DESC(ctx) + 2) || ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, txq->ift_cidx_processed, false)); } static uint32_t iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) { iflib_txq_t txq = r->cookie; if_ctx_t ctx = txq->ift_ctx; if_t ifp = ctx->ifc_ifp; struct mbuf **mp, *m; int i, count, consumed, pkt_sent, bytes_sent, mcast_sent, avail, err, in_use_prev, desc_used; if (__predict_false(!(if_getdrvflags(ifp) & IFF_DRV_RUNNING) || !LINK_ACTIVE(ctx))) { DBG_COUNTER_INC(txq_drain_notready); return (0); } avail = IDXDIFF(pidx, cidx, r->size); if (__predict_false(ctx->ifc_flags & IFC_QFLUSH)) { DBG_COUNTER_INC(txq_drain_flushing); for (i = 0; i < avail; i++) { m_free(r->items[(cidx + i) & (r->size-1)]); r->items[(cidx + i) & (r->size-1)] = NULL; } return (avail); } iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx)); if (__predict_false(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE)) { txq->ift_qstatus = IFLIB_QUEUE_IDLE; CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); callout_stop(&txq->ift_db_check); CALLOUT_UNLOCK(txq); DBG_COUNTER_INC(txq_drain_oactive); return (0); } consumed = mcast_sent = bytes_sent = pkt_sent = 0; count = MIN(avail, TX_BATCH_SIZE); #ifdef INVARIANTS if (iflib_verbose_debug) printf("%s avail=%d ifc_flags=%x txq_avail=%d ", __FUNCTION__, avail, ctx->ifc_flags, TXQ_AVAIL(txq)); #endif for (desc_used = i = 0; i < count && TXQ_AVAIL(txq) > MAX_TX_DESC(ctx) + 2; i++) { mp = _ring_peek_one(r, cidx, i); MPASS(mp != NULL && *mp != NULL); in_use_prev = txq->ift_in_use; if ((err = iflib_encap(txq, mp)) == ENOBUFS) { DBG_COUNTER_INC(txq_drain_encapfail); /* no room - bail out */ break; } consumed++; if (err) { DBG_COUNTER_INC(txq_drain_encapfail); /* we can't send this packet - skip it */ continue; } pkt_sent++; m = *mp; DBG_COUNTER_INC(tx_sent); bytes_sent += m->m_pkthdr.len; if (m->m_flags & M_MCAST) mcast_sent++; txq->ift_db_pending += (txq->ift_in_use - in_use_prev); desc_used += (txq->ift_in_use - in_use_prev); iflib_txd_db_check(ctx, txq, FALSE); ETHER_BPF_MTAP(ifp, m); if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))) break; if (desc_used >= TXQ_MAX_DB_CONSUMED(txq->ift_size)) break; } if ((iflib_min_tx_latency || iflib_txq_min_occupancy(txq)) && txq->ift_db_pending) iflib_txd_db_check(ctx, txq, TRUE); else if ((txq->ift_db_pending || TXQ_AVAIL(txq) <= MAX_TX_DESC(ctx) + 2) && (callout_pending(&txq->ift_db_check) == 0)) { txq->ift_db_pending_queued = txq->ift_db_pending; callout_reset_on(&txq->ift_db_check, 1, iflib_txd_deferred_db_check, txq, txq->ift_db_check.c_cpu); } if_inc_counter(ifp, IFCOUNTER_OBYTES, bytes_sent); if_inc_counter(ifp, IFCOUNTER_OPACKETS, pkt_sent); if (mcast_sent) if_inc_counter(ifp, IFCOUNTER_OMCASTS, mcast_sent); #ifdef INVARIANTS if (iflib_verbose_debug) printf("consumed=%d\n", consumed); #endif return (consumed); } static uint32_t iflib_txq_drain_always(struct ifmp_ring *r) { return (1); } static uint32_t iflib_txq_drain_free(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) { int i, avail; struct mbuf **mp; iflib_txq_t txq; txq = r->cookie; txq->ift_qstatus = IFLIB_QUEUE_IDLE; CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); callout_stop(&txq->ift_db_check); CALLOUT_UNLOCK(txq); avail = IDXDIFF(pidx, cidx, r->size); for (i = 0; i < avail; i++) { mp = _ring_peek_one(r, cidx, i); m_freem(*mp); } MPASS(ifmp_ring_is_stalled(r) == 0); return (avail); } static void iflib_ifmp_purge(iflib_txq_t txq) { struct ifmp_ring *r; r = txq->ift_br[0]; r->drain = iflib_txq_drain_free; r->can_drain = iflib_txq_drain_always; ifmp_ring_check_drainage(r, r->size); r->drain = iflib_txq_drain; r->can_drain = iflib_txq_can_drain; } static void _task_fn_tx(void *context) { iflib_txq_t txq = context; if_ctx_t ctx = txq->ift_ctx; #ifdef IFLIB_DIAGNOSTICS txq->ift_cpu_exec_count[curcpu]++; #endif if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) return; ifmp_ring_check_drainage(txq->ift_br[0], TX_BATCH_SIZE); } static void _task_fn_rx(void *context) { iflib_rxq_t rxq = context; if_ctx_t ctx = rxq->ifr_ctx; bool more; int rc; #ifdef IFLIB_DIAGNOSTICS rxq->ifr_cpu_exec_count[curcpu]++; #endif DBG_COUNTER_INC(task_fn_rxs); if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))) return; if ((more = iflib_rxeof(rxq, 16 /* XXX */)) == false) { if (ctx->ifc_flags & IFC_LEGACY) IFDI_INTR_ENABLE(ctx); else { DBG_COUNTER_INC(rx_intr_enables); rc = IFDI_QUEUE_INTR_ENABLE(ctx, rxq->ifr_id); KASSERT(rc != ENOTSUP, ("MSI-X support requires queue_intr_enable, but not implemented in driver")); } } if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))) return; if (more) GROUPTASK_ENQUEUE(&rxq->ifr_task); } static void _task_fn_admin(void *context) { if_ctx_t ctx = context; if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; iflib_txq_t txq; int i; if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) return; CTX_LOCK(ctx); for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) { CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); CALLOUT_UNLOCK(txq); } IFDI_UPDATE_ADMIN_STATUS(ctx); for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu); IFDI_LINK_INTR_ENABLE(ctx); CTX_UNLOCK(ctx); if (LINK_ACTIVE(ctx) == 0) return; for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET); } static void _task_fn_iov(void *context) { if_ctx_t ctx = context; if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) return; CTX_LOCK(ctx); IFDI_VFLR_HANDLE(ctx); CTX_UNLOCK(ctx); } static int iflib_sysctl_int_delay(SYSCTL_HANDLER_ARGS) { int err; if_int_delay_info_t info; if_ctx_t ctx; info = (if_int_delay_info_t)arg1; ctx = info->iidi_ctx; info->iidi_req = req; info->iidi_oidp = oidp; CTX_LOCK(ctx); err = IFDI_SYSCTL_INT_DELAY(ctx, info); CTX_UNLOCK(ctx); return (err); } /********************************************************************* * * IFNET FUNCTIONS * **********************************************************************/ static void iflib_if_init_locked(if_ctx_t ctx) { iflib_stop(ctx); iflib_init_locked(ctx); } static void iflib_if_init(void *arg) { if_ctx_t ctx = arg; CTX_LOCK(ctx); iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); } static int iflib_if_transmit(if_t ifp, struct mbuf *m) { if_ctx_t ctx = if_getsoftc(ifp); iflib_txq_t txq; int err, qidx; if (__predict_false((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || !LINK_ACTIVE(ctx))) { DBG_COUNTER_INC(tx_frees); m_freem(m); return (ENOBUFS); } MPASS(m->m_nextpkt == NULL); qidx = 0; if ((NTXQSETS(ctx) > 1) && M_HASHTYPE_GET(m)) qidx = QIDX(ctx, m); /* * XXX calculate buf_ring based on flowid (divvy up bits?) */ txq = &ctx->ifc_txqs[qidx]; #ifdef DRIVER_BACKPRESSURE if (txq->ift_closed) { while (m != NULL) { next = m->m_nextpkt; m->m_nextpkt = NULL; m_freem(m); m = next; } return (ENOBUFS); } #endif #ifdef notyet qidx = count = 0; mp = marr; next = m; do { count++; next = next->m_nextpkt; } while (next != NULL); if (count > nitems(marr)) if ((mp = malloc(count*sizeof(struct mbuf *), M_IFLIB, M_NOWAIT)) == NULL) { /* XXX check nextpkt */ m_freem(m); /* XXX simplify for now */ DBG_COUNTER_INC(tx_frees); return (ENOBUFS); } for (next = m, i = 0; next != NULL; i++) { mp[i] = next; next = next->m_nextpkt; mp[i]->m_nextpkt = NULL; } #endif DBG_COUNTER_INC(tx_seen); err = ifmp_ring_enqueue(txq->ift_br[0], (void **)&m, 1, TX_BATCH_SIZE); if (err) { GROUPTASK_ENQUEUE(&txq->ift_task); /* support forthcoming later */ #ifdef DRIVER_BACKPRESSURE txq->ift_closed = TRUE; #endif ifmp_ring_check_drainage(txq->ift_br[0], TX_BATCH_SIZE); m_freem(m); } else if (TXQ_AVAIL(txq) < (txq->ift_size >> 1)) { GROUPTASK_ENQUEUE(&txq->ift_task); } return (err); } static void iflib_if_qflush(if_t ifp) { if_ctx_t ctx = if_getsoftc(ifp); iflib_txq_t txq = ctx->ifc_txqs; int i; CTX_LOCK(ctx); ctx->ifc_flags |= IFC_QFLUSH; CTX_UNLOCK(ctx); for (i = 0; i < NTXQSETS(ctx); i++, txq++) while (!(ifmp_ring_is_idle(txq->ift_br[0]) || ifmp_ring_is_stalled(txq->ift_br[0]))) iflib_txq_check_drain(txq, 0); CTX_LOCK(ctx); ctx->ifc_flags &= ~IFC_QFLUSH; CTX_UNLOCK(ctx); if_qflush(ifp); } #define IFCAP_FLAGS (IFCAP_TXCSUM_IPV6 | IFCAP_RXCSUM_IPV6 | IFCAP_HWCSUM | IFCAP_LRO | \ IFCAP_TSO4 | IFCAP_TSO6 | IFCAP_VLAN_HWTAGGING | \ IFCAP_VLAN_MTU | IFCAP_VLAN_HWFILTER | IFCAP_VLAN_HWTSO) static int iflib_if_ioctl(if_t ifp, u_long command, caddr_t data) { if_ctx_t ctx = if_getsoftc(ifp); struct ifreq *ifr = (struct ifreq *)data; #if defined(INET) || defined(INET6) struct ifaddr *ifa = (struct ifaddr *)data; #endif bool avoid_reset = FALSE; int err = 0, reinit = 0, bits; switch (command) { case SIOCSIFADDR: #ifdef INET if (ifa->ifa_addr->sa_family == AF_INET) avoid_reset = TRUE; #endif #ifdef INET6 if (ifa->ifa_addr->sa_family == AF_INET6) avoid_reset = TRUE; #endif /* ** Calling init results in link renegotiation, ** so we avoid doing it when possible. */ if (avoid_reset) { if_setflagbits(ifp, IFF_UP,0); if (!(if_getdrvflags(ifp)& IFF_DRV_RUNNING)) reinit = 1; #ifdef INET if (!(if_getflags(ifp) & IFF_NOARP)) arp_ifinit(ifp, ifa); #endif } else err = ether_ioctl(ifp, command, data); break; case SIOCSIFMTU: CTX_LOCK(ctx); if (ifr->ifr_mtu == if_getmtu(ifp)) { CTX_UNLOCK(ctx); break; } bits = if_getdrvflags(ifp); /* stop the driver and free any clusters before proceeding */ iflib_stop(ctx); if ((err = IFDI_MTU_SET(ctx, ifr->ifr_mtu)) == 0) { if (ifr->ifr_mtu > ctx->ifc_max_fl_buf_size) ctx->ifc_flags |= IFC_MULTISEG; else ctx->ifc_flags &= ~IFC_MULTISEG; err = if_setmtu(ifp, ifr->ifr_mtu); } iflib_init_locked(ctx); if_setdrvflags(ifp, bits); CTX_UNLOCK(ctx); break; case SIOCSIFFLAGS: CTX_LOCK(ctx); if (if_getflags(ifp) & IFF_UP) { if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { if ((if_getflags(ifp) ^ ctx->ifc_if_flags) & (IFF_PROMISC | IFF_ALLMULTI)) { err = IFDI_PROMISC_SET(ctx, if_getflags(ifp)); } } else reinit = 1; } else if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { iflib_stop(ctx); } ctx->ifc_if_flags = if_getflags(ifp); CTX_UNLOCK(ctx); break; case SIOCADDMULTI: case SIOCDELMULTI: if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { CTX_LOCK(ctx); IFDI_INTR_DISABLE(ctx); IFDI_MULTI_SET(ctx); IFDI_INTR_ENABLE(ctx); CTX_UNLOCK(ctx); } break; case SIOCSIFMEDIA: CTX_LOCK(ctx); IFDI_MEDIA_SET(ctx); CTX_UNLOCK(ctx); /* falls thru */ case SIOCGIFMEDIA: err = ifmedia_ioctl(ifp, ifr, &ctx->ifc_media, command); break; case SIOCGI2C: { struct ifi2creq i2c; err = copyin(ifr->ifr_data, &i2c, sizeof(i2c)); if (err != 0) break; if (i2c.dev_addr != 0xA0 && i2c.dev_addr != 0xA2) { err = EINVAL; break; } if (i2c.len > sizeof(i2c.data)) { err = EINVAL; break; } if ((err = IFDI_I2C_REQ(ctx, &i2c)) == 0) err = copyout(&i2c, ifr->ifr_data, sizeof(i2c)); break; } case SIOCSIFCAP: { int mask, setmask; mask = ifr->ifr_reqcap ^ if_getcapenable(ifp); setmask = 0; #ifdef TCP_OFFLOAD setmask |= mask & (IFCAP_TOE4|IFCAP_TOE6); #endif setmask |= (mask & IFCAP_FLAGS); if (setmask & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6)) setmask |= (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6); if ((mask & IFCAP_WOL) && (if_getcapabilities(ifp) & IFCAP_WOL) != 0) setmask |= (mask & (IFCAP_WOL_MCAST|IFCAP_WOL_MAGIC)); if_vlancap(ifp); /* * want to ensure that traffic has stopped before we change any of the flags */ if (setmask) { CTX_LOCK(ctx); bits = if_getdrvflags(ifp); if (bits & IFF_DRV_RUNNING) iflib_stop(ctx); if_togglecapenable(ifp, setmask); if (bits & IFF_DRV_RUNNING) iflib_init_locked(ctx); if_setdrvflags(ifp, bits); CTX_UNLOCK(ctx); } break; } case SIOCGPRIVATE_0: case SIOCSDRVSPEC: case SIOCGDRVSPEC: CTX_LOCK(ctx); err = IFDI_PRIV_IOCTL(ctx, command, data); CTX_UNLOCK(ctx); break; default: err = ether_ioctl(ifp, command, data); break; } if (reinit) iflib_if_init(ctx); return (err); } static uint64_t iflib_if_get_counter(if_t ifp, ift_counter cnt) { if_ctx_t ctx = if_getsoftc(ifp); return (IFDI_GET_COUNTER(ctx, cnt)); } /********************************************************************* * * OTHER FUNCTIONS EXPORTED TO THE STACK * **********************************************************************/ static void iflib_vlan_register(void *arg, if_t ifp, uint16_t vtag) { if_ctx_t ctx = if_getsoftc(ifp); if ((void *)ctx != arg) return; if ((vtag == 0) || (vtag > 4095)) return; CTX_LOCK(ctx); IFDI_VLAN_REGISTER(ctx, vtag); /* Re-init to load the changes */ if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER) iflib_init_locked(ctx); CTX_UNLOCK(ctx); } static void iflib_vlan_unregister(void *arg, if_t ifp, uint16_t vtag) { if_ctx_t ctx = if_getsoftc(ifp); if ((void *)ctx != arg) return; if ((vtag == 0) || (vtag > 4095)) return; CTX_LOCK(ctx); IFDI_VLAN_UNREGISTER(ctx, vtag); /* Re-init to load the changes */ if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER) iflib_init_locked(ctx); CTX_UNLOCK(ctx); } static void iflib_led_func(void *arg, int onoff) { if_ctx_t ctx = arg; CTX_LOCK(ctx); IFDI_LED_FUNC(ctx, onoff); CTX_UNLOCK(ctx); } /********************************************************************* * * BUS FUNCTION DEFINITIONS * **********************************************************************/ int iflib_device_probe(device_t dev) { pci_vendor_info_t *ent; uint16_t pci_vendor_id, pci_device_id; uint16_t pci_subvendor_id, pci_subdevice_id; uint16_t pci_rev_id; if_shared_ctx_t sctx; if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC) return (ENOTSUP); pci_vendor_id = pci_get_vendor(dev); pci_device_id = pci_get_device(dev); pci_subvendor_id = pci_get_subvendor(dev); pci_subdevice_id = pci_get_subdevice(dev); pci_rev_id = pci_get_revid(dev); if (sctx->isc_parse_devinfo != NULL) sctx->isc_parse_devinfo(&pci_device_id, &pci_subvendor_id, &pci_subdevice_id, &pci_rev_id); ent = sctx->isc_vendor_info; while (ent->pvi_vendor_id != 0) { if (pci_vendor_id != ent->pvi_vendor_id) { ent++; continue; } if ((pci_device_id == ent->pvi_device_id) && ((pci_subvendor_id == ent->pvi_subvendor_id) || (ent->pvi_subvendor_id == 0)) && ((pci_subdevice_id == ent->pvi_subdevice_id) || (ent->pvi_subdevice_id == 0)) && ((pci_rev_id == ent->pvi_rev_id) || (ent->pvi_rev_id == 0))) { device_set_desc_copy(dev, ent->pvi_name); /* this needs to be changed to zero if the bus probing code * ever stops re-probing on best match because the sctx * may have its values over written by register calls * in subsequent probes */ return (BUS_PROBE_DEFAULT); } ent++; } return (ENXIO); } int iflib_device_register(device_t dev, void *sc, if_shared_ctx_t sctx, if_ctx_t *ctxp) { int err, rid, msix, msix_bar; if_ctx_t ctx; if_t ifp; if_softc_ctx_t scctx; int i; uint16_t main_txq; uint16_t main_rxq; ctx = malloc(sizeof(* ctx), M_IFLIB, M_WAITOK|M_ZERO); if (sc == NULL) { sc = malloc(sctx->isc_driver->size, M_IFLIB, M_WAITOK|M_ZERO); device_set_softc(dev, ctx); ctx->ifc_flags |= IFC_SC_ALLOCATED; } ctx->ifc_sctx = sctx; ctx->ifc_dev = dev; ctx->ifc_softc = sc; if ((err = iflib_register(ctx)) != 0) { device_printf(dev, "iflib_register failed %d\n", err); return (err); } iflib_add_device_sysctl_pre(ctx); scctx = &ctx->ifc_softc_ctx; ifp = ctx->ifc_ifp; /* * XXX sanity check that ntxd & nrxd are a power of 2 */ if (ctx->ifc_sysctl_ntxqs != 0) scctx->isc_ntxqsets = ctx->ifc_sysctl_ntxqs; if (ctx->ifc_sysctl_nrxqs != 0) scctx->isc_nrxqsets = ctx->ifc_sysctl_nrxqs; for (i = 0; i < sctx->isc_ntxqs; i++) { if (ctx->ifc_sysctl_ntxds[i] != 0) scctx->isc_ntxd[i] = ctx->ifc_sysctl_ntxds[i]; else scctx->isc_ntxd[i] = sctx->isc_ntxd_default[i]; } for (i = 0; i < sctx->isc_nrxqs; i++) { if (ctx->ifc_sysctl_nrxds[i] != 0) scctx->isc_nrxd[i] = ctx->ifc_sysctl_nrxds[i]; else scctx->isc_nrxd[i] = sctx->isc_nrxd_default[i]; } for (i = 0; i < sctx->isc_nrxqs; i++) { if (scctx->isc_nrxd[i] < sctx->isc_nrxd_min[i]) { device_printf(dev, "nrxd%d: %d less than nrxd_min %d - resetting to min\n", i, scctx->isc_nrxd[i], sctx->isc_nrxd_min[i]); scctx->isc_nrxd[i] = sctx->isc_nrxd_min[i]; } if (scctx->isc_nrxd[i] > sctx->isc_nrxd_max[i]) { device_printf(dev, "nrxd%d: %d greater than nrxd_max %d - resetting to max\n", i, scctx->isc_nrxd[i], sctx->isc_nrxd_max[i]); scctx->isc_nrxd[i] = sctx->isc_nrxd_max[i]; } } for (i = 0; i < sctx->isc_ntxqs; i++) { if (scctx->isc_ntxd[i] < sctx->isc_ntxd_min[i]) { device_printf(dev, "ntxd%d: %d less than ntxd_min %d - resetting to min\n", i, scctx->isc_ntxd[i], sctx->isc_ntxd_min[i]); scctx->isc_ntxd[i] = sctx->isc_ntxd_min[i]; } if (scctx->isc_ntxd[i] > sctx->isc_ntxd_max[i]) { device_printf(dev, "ntxd%d: %d greater than ntxd_max %d - resetting to max\n", i, scctx->isc_ntxd[i], sctx->isc_ntxd_max[i]); scctx->isc_ntxd[i] = sctx->isc_ntxd_max[i]; } } if ((err = IFDI_ATTACH_PRE(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_PRE failed %d\n", err); return (err); } _iflib_pre_assert(scctx); ctx->ifc_txrx = *scctx->isc_txrx; #ifdef INVARIANTS MPASS(scctx->isc_capenable); if (scctx->isc_capenable & IFCAP_TXCSUM) MPASS(scctx->isc_tx_csum_flags); #endif if_setcapabilities(ifp, scctx->isc_capenable); if_setcapenable(ifp, scctx->isc_capenable); if (scctx->isc_ntxqsets == 0 || (scctx->isc_ntxqsets_max && scctx->isc_ntxqsets_max < scctx->isc_ntxqsets)) scctx->isc_ntxqsets = scctx->isc_ntxqsets_max; if (scctx->isc_nrxqsets == 0 || (scctx->isc_nrxqsets_max && scctx->isc_nrxqsets_max < scctx->isc_nrxqsets)) scctx->isc_nrxqsets = scctx->isc_nrxqsets_max; #ifdef ACPI_DMAR if (dmar_get_dma_tag(device_get_parent(dev), dev) != NULL) ctx->ifc_flags |= IFC_DMAR; #endif msix_bar = scctx->isc_msix_bar; if(sctx->isc_flags & IFLIB_HAS_TXCQ) main_txq = 1; else main_txq = 0; if(sctx->isc_flags & IFLIB_HAS_RXCQ) main_rxq = 1; else main_rxq = 0; /* XXX change for per-queue sizes */ device_printf(dev, "using %d tx descriptors and %d rx descriptors\n", scctx->isc_ntxd[main_txq], scctx->isc_nrxd[main_rxq]); for (i = 0; i < sctx->isc_nrxqs; i++) { if (!powerof2(scctx->isc_nrxd[i])) { /* round down instead? */ device_printf(dev, "# rx descriptors must be a power of 2\n"); err = EINVAL; goto fail; } } for (i = 0; i < sctx->isc_ntxqs; i++) { if (!powerof2(scctx->isc_ntxd[i])) { device_printf(dev, "# tx descriptors must be a power of 2"); err = EINVAL; goto fail; } } if (scctx->isc_tx_nsegments > scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION) scctx->isc_tx_nsegments = max(1, scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION); if (scctx->isc_tx_tso_segments_max > scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION) scctx->isc_tx_tso_segments_max = max(1, scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION); /* * Protect the stack against modern hardware */ if (scctx->isc_tx_tso_size_max > FREEBSD_TSO_SIZE_MAX) scctx->isc_tx_tso_size_max = FREEBSD_TSO_SIZE_MAX; /* TSO parameters - dig these out of the data sheet - simply correspond to tag setup */ ifp->if_hw_tsomaxsegcount = scctx->isc_tx_tso_segments_max; ifp->if_hw_tsomax = scctx->isc_tx_tso_size_max; ifp->if_hw_tsomaxsegsize = scctx->isc_tx_tso_segsize_max; if (scctx->isc_rss_table_size == 0) scctx->isc_rss_table_size = 64; scctx->isc_rss_table_mask = scctx->isc_rss_table_size-1; GROUPTASK_INIT(&ctx->ifc_admin_task, 0, _task_fn_admin, ctx); /* XXX format name */ taskqgroup_attach(qgroup_if_config_tqg, &ctx->ifc_admin_task, ctx, -1, "admin"); /* ** Now setup MSI or MSI/X, should ** return us the number of supported ** vectors. (Will be 1 for MSI) */ if (sctx->isc_flags & IFLIB_SKIP_MSIX) { msix = scctx->isc_vectors; } else if (scctx->isc_msix_bar != 0) /* * The simple fact that isc_msix_bar is not 0 does not mean we * we have a good value there that is known to work. */ msix = iflib_msix_init(ctx); else { scctx->isc_vectors = 1; scctx->isc_ntxqsets = 1; scctx->isc_nrxqsets = 1; scctx->isc_intr = IFLIB_INTR_LEGACY; msix = 0; } /* Get memory for the station queues */ if ((err = iflib_queues_alloc(ctx))) { device_printf(dev, "Unable to allocate queue memory\n"); goto fail; } if ((err = iflib_qset_structures_setup(ctx))) { device_printf(dev, "qset structure setup failed %d\n", err); goto fail_queues; } /* * Group taskqueues aren't properly set up until SMP is started, * so we disable interrupts until we can handle them post * SI_SUB_SMP. * * XXX: disabling interrupts doesn't actually work, at least for * the non-MSI case. When they occur before SI_SUB_SMP completes, * we do null handling and depend on this not causing too large an * interrupt storm. */ IFDI_INTR_DISABLE(ctx); if (msix > 1 && (err = IFDI_MSIX_INTR_ASSIGN(ctx, msix)) != 0) { device_printf(dev, "IFDI_MSIX_INTR_ASSIGN failed %d\n", err); goto fail_intr_free; } if (msix <= 1) { rid = 0; if (scctx->isc_intr == IFLIB_INTR_MSI) { MPASS(msix == 1); rid = 1; } if ((err = iflib_legacy_setup(ctx, ctx->isc_legacy_intr, ctx->ifc_softc, &rid, "irq0")) != 0) { device_printf(dev, "iflib_legacy_setup failed %d\n", err); goto fail_intr_free; } } ether_ifattach(ctx->ifc_ifp, ctx->ifc_mac); if ((err = IFDI_ATTACH_POST(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_POST failed %d\n", err); goto fail_detach; } if ((err = iflib_netmap_attach(ctx))) { device_printf(ctx->ifc_dev, "netmap attach failed: %d\n", err); goto fail_detach; } *ctxp = ctx; if_setgetcounterfn(ctx->ifc_ifp, iflib_if_get_counter); iflib_add_device_sysctl_post(ctx); ctx->ifc_flags |= IFC_INIT_DONE; return (0); fail_detach: ether_ifdetach(ctx->ifc_ifp); fail_intr_free: if (scctx->isc_intr == IFLIB_INTR_MSIX || scctx->isc_intr == IFLIB_INTR_MSI) pci_release_msi(ctx->ifc_dev); fail_queues: /* XXX free queues */ fail: IFDI_DETACH(ctx); return (err); } int iflib_device_attach(device_t dev) { if_ctx_t ctx; if_shared_ctx_t sctx; if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC) return (ENOTSUP); pci_enable_busmaster(dev); return (iflib_device_register(dev, NULL, sctx, &ctx)); } int iflib_device_deregister(if_ctx_t ctx) { if_t ifp = ctx->ifc_ifp; iflib_txq_t txq; iflib_rxq_t rxq; device_t dev = ctx->ifc_dev; int i; struct taskqgroup *tqg; /* Make sure VLANS are not using driver */ if (if_vlantrunkinuse(ifp)) { device_printf(dev,"Vlan in use, detach first\n"); return (EBUSY); } CTX_LOCK(ctx); ctx->ifc_in_detach = 1; iflib_stop(ctx); CTX_UNLOCK(ctx); /* Unregister VLAN events */ if (ctx->ifc_vlan_attach_event != NULL) EVENTHANDLER_DEREGISTER(vlan_config, ctx->ifc_vlan_attach_event); if (ctx->ifc_vlan_detach_event != NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, ctx->ifc_vlan_detach_event); iflib_netmap_detach(ifp); ether_ifdetach(ifp); /* ether_ifdetach calls if_qflush - lock must be destroy afterwards*/ CTX_LOCK_DESTROY(ctx); if (ctx->ifc_led_dev != NULL) led_destroy(ctx->ifc_led_dev); /* XXX drain any dependent tasks */ - tqg = qgroup_if_io_tqg; + tqg = qgroup_softirq; for (txq = ctx->ifc_txqs, i = 0; i < NTXQSETS(ctx); i++, txq++) { callout_drain(&txq->ift_timer); callout_drain(&txq->ift_db_check); if (txq->ift_task.gt_uniq != NULL) taskqgroup_detach(tqg, &txq->ift_task); } for (i = 0, rxq = ctx->ifc_rxqs; i < NRXQSETS(ctx); i++, rxq++) { if (rxq->ifr_task.gt_uniq != NULL) taskqgroup_detach(tqg, &rxq->ifr_task); } tqg = qgroup_if_config_tqg; if (ctx->ifc_admin_task.gt_uniq != NULL) taskqgroup_detach(tqg, &ctx->ifc_admin_task); if (ctx->ifc_vflr_task.gt_uniq != NULL) taskqgroup_detach(tqg, &ctx->ifc_vflr_task); IFDI_DETACH(ctx); device_set_softc(ctx->ifc_dev, NULL); if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_LEGACY) { pci_release_msi(dev); } if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_MSIX) { iflib_irq_free(ctx, &ctx->ifc_legacy_irq); } if (ctx->ifc_msix_mem != NULL) { bus_release_resource(ctx->ifc_dev, SYS_RES_MEMORY, ctx->ifc_softc_ctx.isc_msix_bar, ctx->ifc_msix_mem); ctx->ifc_msix_mem = NULL; } bus_generic_detach(dev); if_free(ifp); iflib_tx_structures_free(ctx); iflib_rx_structures_free(ctx); if (ctx->ifc_flags & IFC_SC_ALLOCATED) free(ctx->ifc_softc, M_IFLIB); free(ctx, M_IFLIB); return (0); } int iflib_device_detach(device_t dev) { if_ctx_t ctx = device_get_softc(dev); return (iflib_device_deregister(ctx)); } int iflib_device_suspend(device_t dev) { if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); IFDI_SUSPEND(ctx); CTX_UNLOCK(ctx); return bus_generic_suspend(dev); } int iflib_device_shutdown(device_t dev) { if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); IFDI_SHUTDOWN(ctx); CTX_UNLOCK(ctx); return bus_generic_suspend(dev); } int iflib_device_resume(device_t dev) { if_ctx_t ctx = device_get_softc(dev); iflib_txq_t txq = ctx->ifc_txqs; CTX_LOCK(ctx); IFDI_RESUME(ctx); iflib_init_locked(ctx); CTX_UNLOCK(ctx); for (int i = 0; i < NTXQSETS(ctx); i++, txq++) iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET); return (bus_generic_resume(dev)); } int iflib_device_iov_init(device_t dev, uint16_t num_vfs, const nvlist_t *params) { int error; if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); error = IFDI_IOV_INIT(ctx, num_vfs, params); CTX_UNLOCK(ctx); return (error); } void iflib_device_iov_uninit(device_t dev) { if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); IFDI_IOV_UNINIT(ctx); CTX_UNLOCK(ctx); } int iflib_device_iov_add_vf(device_t dev, uint16_t vfnum, const nvlist_t *params) { int error; if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); error = IFDI_IOV_VF_ADD(ctx, vfnum, params); CTX_UNLOCK(ctx); return (error); } /********************************************************************* * * MODULE FUNCTION DEFINITIONS * **********************************************************************/ /* * - Start a fast taskqueue thread for each core * - Start a taskqueue for control operations */ static int iflib_module_init(void) { return (0); } static int iflib_module_event_handler(module_t mod, int what, void *arg) { int err; switch (what) { case MOD_LOAD: if ((err = iflib_module_init()) != 0) return (err); break; case MOD_UNLOAD: return (EBUSY); default: return (EOPNOTSUPP); } return (0); } /********************************************************************* * * PUBLIC FUNCTION DEFINITIONS * ordered as in iflib.h * **********************************************************************/ static void _iflib_assert(if_shared_ctx_t sctx) { MPASS(sctx->isc_tx_maxsize); MPASS(sctx->isc_tx_maxsegsize); MPASS(sctx->isc_rx_maxsize); MPASS(sctx->isc_rx_nsegments); MPASS(sctx->isc_rx_maxsegsize); MPASS(sctx->isc_nrxd_min[0]); MPASS(sctx->isc_nrxd_max[0]); MPASS(sctx->isc_nrxd_default[0]); MPASS(sctx->isc_ntxd_min[0]); MPASS(sctx->isc_ntxd_max[0]); MPASS(sctx->isc_ntxd_default[0]); } static void _iflib_pre_assert(if_softc_ctx_t scctx) { MPASS(scctx->isc_txrx->ift_txd_encap); MPASS(scctx->isc_txrx->ift_txd_flush); MPASS(scctx->isc_txrx->ift_txd_credits_update); MPASS(scctx->isc_txrx->ift_rxd_available); MPASS(scctx->isc_txrx->ift_rxd_pkt_get); MPASS(scctx->isc_txrx->ift_rxd_refill); MPASS(scctx->isc_txrx->ift_rxd_flush); } static int iflib_register(if_ctx_t ctx) { if_shared_ctx_t sctx = ctx->ifc_sctx; driver_t *driver = sctx->isc_driver; device_t dev = ctx->ifc_dev; if_t ifp; _iflib_assert(sctx); CTX_LOCK_INIT(ctx, device_get_nameunit(ctx->ifc_dev)); ifp = ctx->ifc_ifp = if_gethandle(IFT_ETHER); if (ifp == NULL) { device_printf(dev, "can not allocate ifnet structure\n"); return (ENOMEM); } /* * Initialize our context's device specific methods */ kobj_init((kobj_t) ctx, (kobj_class_t) driver); kobj_class_compile((kobj_class_t) driver); driver->refs++; if_initname(ifp, device_get_name(dev), device_get_unit(dev)); if_setsoftc(ifp, ctx); if_setdev(ifp, dev); if_setinitfn(ifp, iflib_if_init); if_setioctlfn(ifp, iflib_if_ioctl); if_settransmitfn(ifp, iflib_if_transmit); if_setqflushfn(ifp, iflib_if_qflush); if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST); ctx->ifc_vlan_attach_event = EVENTHANDLER_REGISTER(vlan_config, iflib_vlan_register, ctx, EVENTHANDLER_PRI_FIRST); ctx->ifc_vlan_detach_event = EVENTHANDLER_REGISTER(vlan_unconfig, iflib_vlan_unregister, ctx, EVENTHANDLER_PRI_FIRST); ifmedia_init(&ctx->ifc_media, IFM_IMASK, iflib_media_change, iflib_media_status); return (0); } static int iflib_queues_alloc(if_ctx_t ctx) { if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = ctx->ifc_dev; int nrxqsets = scctx->isc_nrxqsets; int ntxqsets = scctx->isc_ntxqsets; iflib_txq_t txq; iflib_rxq_t rxq; iflib_fl_t fl = NULL; int i, j, cpu, err, txconf, rxconf; iflib_dma_info_t ifdip; uint32_t *rxqsizes = scctx->isc_rxqsizes; uint32_t *txqsizes = scctx->isc_txqsizes; uint8_t nrxqs = sctx->isc_nrxqs; uint8_t ntxqs = sctx->isc_ntxqs; int nfree_lists = sctx->isc_nfl ? sctx->isc_nfl : 1; caddr_t *vaddrs; uint64_t *paddrs; struct ifmp_ring **brscp; int nbuf_rings = 1; /* XXX determine dynamically */ KASSERT(ntxqs > 0, ("number of queues per qset must be at least 1")); KASSERT(nrxqs > 0, ("number of queues per qset must be at least 1")); brscp = NULL; txq = NULL; rxq = NULL; /* Allocate the TX ring struct memory */ if (!(txq = (iflib_txq_t) malloc(sizeof(struct iflib_txq) * ntxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate TX ring memory\n"); err = ENOMEM; goto fail; } /* Now allocate the RX */ if (!(rxq = (iflib_rxq_t) malloc(sizeof(struct iflib_rxq) * nrxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX ring memory\n"); err = ENOMEM; goto rx_fail; } if (!(brscp = malloc(sizeof(void *) * nbuf_rings * nrxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to buf_ring_sc * memory\n"); err = ENOMEM; goto rx_fail; } ctx->ifc_txqs = txq; ctx->ifc_rxqs = rxq; /* * XXX handle allocation failure */ for (txconf = i = 0, cpu = CPU_FIRST(); i < ntxqsets; i++, txconf++, txq++, cpu = CPU_NEXT(cpu)) { /* Set up some basics */ if ((ifdip = malloc(sizeof(struct iflib_dma_info) * ntxqs, M_IFLIB, M_WAITOK|M_ZERO)) == NULL) { device_printf(dev, "failed to allocate iflib_dma_info\n"); err = ENOMEM; goto err_tx_desc; } txq->ift_ifdi = ifdip; for (j = 0; j < ntxqs; j++, ifdip++) { if (iflib_dma_alloc(ctx, txqsizes[j], ifdip, BUS_DMA_NOWAIT)) { device_printf(dev, "Unable to allocate Descriptor memory\n"); err = ENOMEM; goto err_tx_desc; } bzero((void *)ifdip->idi_vaddr, txqsizes[j]); } txq->ift_ctx = ctx; txq->ift_id = i; if (sctx->isc_flags & IFLIB_HAS_TXCQ) { txq->ift_br_offset = 1; } else { txq->ift_br_offset = 0; } /* XXX fix this */ txq->ift_timer.c_cpu = cpu; txq->ift_db_check.c_cpu = cpu; txq->ift_nbr = nbuf_rings; if (iflib_txsd_alloc(txq)) { device_printf(dev, "Critical Failure setting up TX buffers\n"); err = ENOMEM; goto err_tx_desc; } /* Initialize the TX lock */ snprintf(txq->ift_mtx_name, MTX_NAME_LEN, "%s:tx(%d):callout", device_get_nameunit(dev), txq->ift_id); mtx_init(&txq->ift_mtx, txq->ift_mtx_name, NULL, MTX_DEF); callout_init_mtx(&txq->ift_timer, &txq->ift_mtx, 0); callout_init_mtx(&txq->ift_db_check, &txq->ift_mtx, 0); snprintf(txq->ift_db_mtx_name, MTX_NAME_LEN, "%s:tx(%d):db", device_get_nameunit(dev), txq->ift_id); TXDB_LOCK_INIT(txq); txq->ift_br = brscp + i*nbuf_rings; for (j = 0; j < nbuf_rings; j++) { err = ifmp_ring_alloc(&txq->ift_br[j], 2048, txq, iflib_txq_drain, iflib_txq_can_drain, M_IFLIB, M_WAITOK); if (err) { /* XXX free any allocated rings */ device_printf(dev, "Unable to allocate buf_ring\n"); goto err_tx_desc; } } } for (rxconf = i = 0; i < nrxqsets; i++, rxconf++, rxq++) { /* Set up some basics */ if ((ifdip = malloc(sizeof(struct iflib_dma_info) * nrxqs, M_IFLIB, M_WAITOK|M_ZERO)) == NULL) { device_printf(dev, "failed to allocate iflib_dma_info\n"); err = ENOMEM; goto err_tx_desc; } rxq->ifr_ifdi = ifdip; for (j = 0; j < nrxqs; j++, ifdip++) { if (iflib_dma_alloc(ctx, rxqsizes[j], ifdip, BUS_DMA_NOWAIT)) { device_printf(dev, "Unable to allocate Descriptor memory\n"); err = ENOMEM; goto err_tx_desc; } bzero((void *)ifdip->idi_vaddr, rxqsizes[j]); } rxq->ifr_ctx = ctx; rxq->ifr_id = i; if (sctx->isc_flags & IFLIB_HAS_RXCQ) { rxq->ifr_fl_offset = 1; } else { rxq->ifr_fl_offset = 0; } rxq->ifr_nfl = nfree_lists; if (!(fl = (iflib_fl_t) malloc(sizeof(struct iflib_fl) * nfree_lists, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate free list memory\n"); err = ENOMEM; goto err_tx_desc; } rxq->ifr_fl = fl; for (j = 0; j < nfree_lists; j++) { rxq->ifr_fl[j].ifl_rxq = rxq; rxq->ifr_fl[j].ifl_id = j; rxq->ifr_fl[j].ifl_ifdi = &rxq->ifr_ifdi[j + rxq->ifr_fl_offset]; } /* Allocate receive buffers for the ring*/ if (iflib_rxsd_alloc(rxq)) { device_printf(dev, "Critical Failure setting up receive buffers\n"); err = ENOMEM; goto err_rx_desc; } } /* TXQs */ vaddrs = malloc(sizeof(caddr_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK); paddrs = malloc(sizeof(uint64_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK); for (i = 0; i < ntxqsets; i++) { iflib_dma_info_t di = ctx->ifc_txqs[i].ift_ifdi; for (j = 0; j < ntxqs; j++, di++) { vaddrs[i*ntxqs + j] = di->idi_vaddr; paddrs[i*ntxqs + j] = di->idi_paddr; } } if ((err = IFDI_TX_QUEUES_ALLOC(ctx, vaddrs, paddrs, ntxqs, ntxqsets)) != 0) { device_printf(ctx->ifc_dev, "device queue allocation failed\n"); iflib_tx_structures_free(ctx); free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); goto err_rx_desc; } free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); /* RXQs */ vaddrs = malloc(sizeof(caddr_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK); paddrs = malloc(sizeof(uint64_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK); for (i = 0; i < nrxqsets; i++) { iflib_dma_info_t di = ctx->ifc_rxqs[i].ifr_ifdi; for (j = 0; j < nrxqs; j++, di++) { vaddrs[i*nrxqs + j] = di->idi_vaddr; paddrs[i*nrxqs + j] = di->idi_paddr; } } if ((err = IFDI_RX_QUEUES_ALLOC(ctx, vaddrs, paddrs, nrxqs, nrxqsets)) != 0) { device_printf(ctx->ifc_dev, "device queue allocation failed\n"); iflib_tx_structures_free(ctx); free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); goto err_rx_desc; } free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); return (0); /* XXX handle allocation failure changes */ err_rx_desc: err_tx_desc: if (ctx->ifc_rxqs != NULL) free(ctx->ifc_rxqs, M_IFLIB); ctx->ifc_rxqs = NULL; if (ctx->ifc_txqs != NULL) free(ctx->ifc_txqs, M_IFLIB); ctx->ifc_txqs = NULL; rx_fail: if (brscp != NULL) free(brscp, M_IFLIB); if (rxq != NULL) free(rxq, M_IFLIB); if (txq != NULL) free(txq, M_IFLIB); fail: return (err); } static int iflib_tx_structures_setup(if_ctx_t ctx) { iflib_txq_t txq = ctx->ifc_txqs; int i; for (i = 0; i < NTXQSETS(ctx); i++, txq++) iflib_txq_setup(txq); return (0); } static void iflib_tx_structures_free(if_ctx_t ctx) { iflib_txq_t txq = ctx->ifc_txqs; int i, j; for (i = 0; i < NTXQSETS(ctx); i++, txq++) { iflib_txq_destroy(txq); for (j = 0; j < ctx->ifc_nhwtxqs; j++) iflib_dma_free(&txq->ift_ifdi[j]); } free(ctx->ifc_txqs, M_IFLIB); ctx->ifc_txqs = NULL; IFDI_QUEUES_FREE(ctx); } /********************************************************************* * * Initialize all receive rings. * **********************************************************************/ static int iflib_rx_structures_setup(if_ctx_t ctx) { iflib_rxq_t rxq = ctx->ifc_rxqs; int q; #if defined(INET6) || defined(INET) int i, err; #endif for (q = 0; q < ctx->ifc_softc_ctx.isc_nrxqsets; q++, rxq++) { #if defined(INET6) || defined(INET) tcp_lro_free(&rxq->ifr_lc); if ((err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp, TCP_LRO_ENTRIES, min(1024, ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset]))) != 0) { device_printf(ctx->ifc_dev, "LRO Initialization failed!\n"); goto fail; } rxq->ifr_lro_enabled = TRUE; #endif IFDI_RXQ_SETUP(ctx, rxq->ifr_id); } return (0); #if defined(INET6) || defined(INET) fail: /* * Free RX software descriptors allocated so far, we will only handle * the rings that completed, the failing case will have * cleaned up for itself. 'q' failed, so its the terminus. */ rxq = ctx->ifc_rxqs; for (i = 0; i < q; ++i, rxq++) { iflib_rx_sds_free(rxq); rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0; } return (err); #endif } /********************************************************************* * * Free all receive rings. * **********************************************************************/ static void iflib_rx_structures_free(if_ctx_t ctx) { iflib_rxq_t rxq = ctx->ifc_rxqs; for (int i = 0; i < ctx->ifc_softc_ctx.isc_nrxqsets; i++, rxq++) { iflib_rx_sds_free(rxq); } } static int iflib_qset_structures_setup(if_ctx_t ctx) { int err; if ((err = iflib_tx_structures_setup(ctx)) != 0) return (err); if ((err = iflib_rx_structures_setup(ctx)) != 0) { device_printf(ctx->ifc_dev, "iflib_rx_structures_setup failed: %d\n", err); iflib_tx_structures_free(ctx); iflib_rx_structures_free(ctx); } return (err); } int iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid, driver_filter_t filter, void *filter_arg, driver_intr_t handler, void *arg, char *name) { return (_iflib_irq_alloc(ctx, irq, rid, filter, handler, arg, name)); } static int find_nth(if_ctx_t ctx, cpuset_t *cpus, int qid) { int i, cpuid, eqid, count; CPU_COPY(&ctx->ifc_cpus, cpus); count = CPU_COUNT(&ctx->ifc_cpus); eqid = qid % count; /* clear up to the qid'th bit */ for (i = 0; i < eqid; i++) { cpuid = CPU_FFS(cpus); MPASS(cpuid != 0); CPU_CLR(cpuid-1, cpus); } cpuid = CPU_FFS(cpus); MPASS(cpuid != 0); return (cpuid-1); } int iflib_irq_alloc_generic(if_ctx_t ctx, if_irq_t irq, int rid, iflib_intr_type_t type, driver_filter_t *filter, void *filter_arg, int qid, char *name) { struct grouptask *gtask; struct taskqgroup *tqg; iflib_filter_info_t info; cpuset_t cpus; gtask_fn_t *fn; int tqrid, err, cpuid; void *q; info = &ctx->ifc_filter_info; tqrid = rid; switch (type) { /* XXX merge tx/rx for netmap? */ case IFLIB_INTR_TX: q = &ctx->ifc_txqs[qid]; info = &ctx->ifc_txqs[qid].ift_filter_info; gtask = &ctx->ifc_txqs[qid].ift_task; - tqg = qgroup_if_io_tqg; + tqg = qgroup_softirq; fn = _task_fn_tx; GROUPTASK_INIT(gtask, 0, fn, q); break; case IFLIB_INTR_RX: q = &ctx->ifc_rxqs[qid]; info = &ctx->ifc_rxqs[qid].ifr_filter_info; gtask = &ctx->ifc_rxqs[qid].ifr_task; - tqg = qgroup_if_io_tqg; + tqg = qgroup_softirq; fn = _task_fn_rx; GROUPTASK_INIT(gtask, 0, fn, q); break; case IFLIB_INTR_ADMIN: q = ctx; tqrid = -1; info = &ctx->ifc_filter_info; gtask = &ctx->ifc_admin_task; tqg = qgroup_if_config_tqg; fn = _task_fn_admin; break; default: panic("unknown net intr type"); } info->ifi_filter = filter; info->ifi_filter_arg = filter_arg; info->ifi_task = gtask; info->ifi_ctx = ctx; err = _iflib_irq_alloc(ctx, irq, rid, iflib_fast_intr, NULL, info, name); if (err != 0) { device_printf(ctx->ifc_dev, "_iflib_irq_alloc failed %d\n", err); return (err); } if (type == IFLIB_INTR_ADMIN) return (0); if (tqrid != -1) { cpuid = find_nth(ctx, &cpus, qid); taskqgroup_attach_cpu(tqg, gtask, q, cpuid, irq->ii_rid, name); } else { taskqgroup_attach(tqg, gtask, q, tqrid, name); } return (0); } void iflib_softirq_alloc_generic(if_ctx_t ctx, int rid, iflib_intr_type_t type, void *arg, int qid, char *name) { struct grouptask *gtask; struct taskqgroup *tqg; gtask_fn_t *fn; void *q; switch (type) { case IFLIB_INTR_TX: q = &ctx->ifc_txqs[qid]; gtask = &ctx->ifc_txqs[qid].ift_task; - tqg = qgroup_if_io_tqg; + tqg = qgroup_softirq; fn = _task_fn_tx; break; case IFLIB_INTR_RX: q = &ctx->ifc_rxqs[qid]; gtask = &ctx->ifc_rxqs[qid].ifr_task; - tqg = qgroup_if_io_tqg; + tqg = qgroup_softirq; fn = _task_fn_rx; break; case IFLIB_INTR_IOV: q = ctx; gtask = &ctx->ifc_vflr_task; tqg = qgroup_if_config_tqg; rid = -1; fn = _task_fn_iov; break; default: panic("unknown net intr type"); } GROUPTASK_INIT(gtask, 0, fn, q); taskqgroup_attach(tqg, gtask, q, rid, name); } void iflib_irq_free(if_ctx_t ctx, if_irq_t irq) { if (irq->ii_tag) bus_teardown_intr(ctx->ifc_dev, irq->ii_res, irq->ii_tag); if (irq->ii_res) bus_release_resource(ctx->ifc_dev, SYS_RES_IRQ, irq->ii_rid, irq->ii_res); } static int iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filter_arg, int *rid, char *name) { iflib_txq_t txq = ctx->ifc_txqs; iflib_rxq_t rxq = ctx->ifc_rxqs; if_irq_t irq = &ctx->ifc_legacy_irq; iflib_filter_info_t info; struct grouptask *gtask; struct taskqgroup *tqg; gtask_fn_t *fn; int tqrid; void *q; int err; q = &ctx->ifc_rxqs[0]; info = &rxq[0].ifr_filter_info; gtask = &rxq[0].ifr_task; - tqg = qgroup_if_io_tqg; + tqg = qgroup_softirq; tqrid = irq->ii_rid = *rid; fn = _task_fn_rx; ctx->ifc_flags |= IFC_LEGACY; info->ifi_filter = filter; info->ifi_filter_arg = filter_arg; info->ifi_task = gtask; info->ifi_ctx = ctx; /* We allocate a single interrupt resource */ if ((err = _iflib_irq_alloc(ctx, irq, tqrid, iflib_fast_intr, NULL, info, name)) != 0) return (err); GROUPTASK_INIT(gtask, 0, fn, q); taskqgroup_attach(tqg, gtask, q, tqrid, name); GROUPTASK_INIT(&txq->ift_task, 0, _task_fn_tx, txq); - taskqgroup_attach(qgroup_if_io_tqg, &txq->ift_task, txq, tqrid, "tx"); + taskqgroup_attach(qgroup_softirq, &txq->ift_task, txq, tqrid, "tx"); return (0); } void iflib_led_create(if_ctx_t ctx) { ctx->ifc_led_dev = led_create(iflib_led_func, ctx, device_get_nameunit(ctx->ifc_dev)); } void iflib_tx_intr_deferred(if_ctx_t ctx, int txqid) { GROUPTASK_ENQUEUE(&ctx->ifc_txqs[txqid].ift_task); } void iflib_rx_intr_deferred(if_ctx_t ctx, int rxqid) { GROUPTASK_ENQUEUE(&ctx->ifc_rxqs[rxqid].ifr_task); } void iflib_admin_intr_deferred(if_ctx_t ctx) { #ifdef INVARIANTS struct grouptask *gtask; gtask = &ctx->ifc_admin_task; MPASS(gtask->gt_taskqueue != NULL); #endif GROUPTASK_ENQUEUE(&ctx->ifc_admin_task); } void iflib_iov_intr_deferred(if_ctx_t ctx) { GROUPTASK_ENQUEUE(&ctx->ifc_vflr_task); } void iflib_io_tqg_attach(struct grouptask *gt, void *uniq, int cpu, char *name) { - taskqgroup_attach_cpu(qgroup_if_io_tqg, gt, uniq, cpu, -1, name); + taskqgroup_attach_cpu(qgroup_softirq, gt, uniq, cpu, -1, name); } void iflib_config_gtask_init(if_ctx_t ctx, struct grouptask *gtask, gtask_fn_t *fn, char *name) { GROUPTASK_INIT(gtask, 0, fn, ctx); taskqgroup_attach(qgroup_if_config_tqg, gtask, gtask, -1, name); } void iflib_config_gtask_deinit(struct grouptask *gtask) { taskqgroup_detach(qgroup_if_config_tqg, gtask); } void iflib_link_state_change(if_ctx_t ctx, int link_state, uint64_t baudrate) { if_t ifp = ctx->ifc_ifp; iflib_txq_t txq = ctx->ifc_txqs; if_setbaudrate(ifp, baudrate); /* If link down, disable watchdog */ if ((ctx->ifc_link_state == LINK_STATE_UP) && (link_state == LINK_STATE_DOWN)) { for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxqsets; i++, txq++) txq->ift_qstatus = IFLIB_QUEUE_IDLE; } ctx->ifc_link_state = link_state; if_link_state_change(ifp, link_state); } static int iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq) { int credits; #ifdef INVARIANTS int credits_pre = txq->ift_cidx_processed; #endif if (ctx->isc_txd_credits_update == NULL) return (0); if ((credits = ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, txq->ift_cidx_processed, true)) == 0) return (0); txq->ift_processed += credits; txq->ift_cidx_processed += credits; MPASS(credits_pre + credits == txq->ift_cidx_processed); if (txq->ift_cidx_processed >= txq->ift_size) txq->ift_cidx_processed -= txq->ift_size; return (credits); } static int iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, int cidx, int budget) { return (ctx->isc_rxd_available(ctx->ifc_softc, rxq->ifr_id, cidx, budget)); } void iflib_add_int_delay_sysctl(if_ctx_t ctx, const char *name, const char *description, if_int_delay_info_t info, int offset, int value) { info->iidi_ctx = ctx; info->iidi_offset = offset; info->iidi_value = value; SYSCTL_ADD_PROC(device_get_sysctl_ctx(ctx->ifc_dev), SYSCTL_CHILDREN(device_get_sysctl_tree(ctx->ifc_dev)), OID_AUTO, name, CTLTYPE_INT|CTLFLAG_RW, info, 0, iflib_sysctl_int_delay, "I", description); } struct mtx * iflib_ctx_lock_get(if_ctx_t ctx) { return (&ctx->ifc_mtx); } static int iflib_msix_init(if_ctx_t ctx) { device_t dev = ctx->ifc_dev; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; int vectors, queues, rx_queues, tx_queues, queuemsgs, msgs; int iflib_num_tx_queues, iflib_num_rx_queues; int err, admincnt, bar; iflib_num_tx_queues = scctx->isc_ntxqsets; iflib_num_rx_queues = scctx->isc_nrxqsets; device_printf(dev, "msix_init qsets capped at %d\n", iflib_num_tx_queues); bar = ctx->ifc_softc_ctx.isc_msix_bar; admincnt = sctx->isc_admin_intrcnt; /* Override by tuneable */ if (enable_msix == 0) goto msi; /* ** When used in a virtualized environment ** PCI BUSMASTER capability may not be set ** so explicity set it here and rewrite ** the ENABLE in the MSIX control register ** at this point to cause the host to ** successfully initialize us. */ { int msix_ctrl, rid; pci_enable_busmaster(dev); rid = 0; if (pci_find_cap(dev, PCIY_MSIX, &rid) == 0 && rid != 0) { rid += PCIR_MSIX_CTRL; msix_ctrl = pci_read_config(dev, rid, 2); msix_ctrl |= PCIM_MSIXCTRL_MSIX_ENABLE; pci_write_config(dev, rid, msix_ctrl, 2); } else { device_printf(dev, "PCIY_MSIX capability not found; " "or rid %d == 0.\n", rid); goto msi; } } /* * bar == -1 => "trust me I know what I'm doing" * https://www.youtube.com/watch?v=nnwWKkNau4I * Some drivers are for hardware that is so shoddily * documented that no one knows which bars are which * so the developer has to map all bars. This hack * allows shoddy garbage to use msix in this framework. */ if (bar != -1) { ctx->ifc_msix_mem = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &bar, RF_ACTIVE); if (ctx->ifc_msix_mem == NULL) { /* May not be enabled */ device_printf(dev, "Unable to map MSIX table \n"); goto msi; } } /* First try MSI/X */ if ((msgs = pci_msix_count(dev)) == 0) { /* system has msix disabled */ device_printf(dev, "System has MSIX disabled \n"); bus_release_resource(dev, SYS_RES_MEMORY, bar, ctx->ifc_msix_mem); ctx->ifc_msix_mem = NULL; goto msi; } #if IFLIB_DEBUG /* use only 1 qset in debug mode */ queuemsgs = min(msgs - admincnt, 1); #else queuemsgs = msgs - admincnt; #endif if (bus_get_cpus(dev, INTR_CPUS, sizeof(ctx->ifc_cpus), &ctx->ifc_cpus) == 0) { #ifdef RSS queues = imin(queuemsgs, rss_getnumbuckets()); #else queues = queuemsgs; #endif queues = imin(CPU_COUNT(&ctx->ifc_cpus), queues); device_printf(dev, "pxm cpus: %d queue msgs: %d admincnt: %d\n", CPU_COUNT(&ctx->ifc_cpus), queuemsgs, admincnt); } else { device_printf(dev, "Unable to fetch CPU list\n"); /* Figure out a reasonable auto config value */ queues = min(queuemsgs, mp_ncpus); } #ifdef RSS /* If we're doing RSS, clamp at the number of RSS buckets */ if (queues > rss_getnumbuckets()) queues = rss_getnumbuckets(); #endif if (iflib_num_rx_queues > 0 && iflib_num_rx_queues < queuemsgs - admincnt) rx_queues = iflib_num_rx_queues; else rx_queues = queues; /* * We want this to be all logical CPUs by default */ if (iflib_num_tx_queues > 0 && iflib_num_tx_queues < queues) tx_queues = iflib_num_tx_queues; else tx_queues = mp_ncpus; if (ctx->ifc_sysctl_qs_eq_override == 0) { #ifdef INVARIANTS if (tx_queues != rx_queues) device_printf(dev, "queue equality override not set, capping rx_queues at %d and tx_queues at %d\n", min(rx_queues, tx_queues), min(rx_queues, tx_queues)); #endif tx_queues = min(rx_queues, tx_queues); rx_queues = min(rx_queues, tx_queues); } device_printf(dev, "using %d rx queues %d tx queues \n", rx_queues, tx_queues); vectors = rx_queues + admincnt; if ((err = pci_alloc_msix(dev, &vectors)) == 0) { device_printf(dev, "Using MSIX interrupts with %d vectors\n", vectors); scctx->isc_vectors = vectors; scctx->isc_nrxqsets = rx_queues; scctx->isc_ntxqsets = tx_queues; scctx->isc_intr = IFLIB_INTR_MSIX; return (vectors); } else { device_printf(dev, "failed to allocate %d msix vectors, err: %d - using MSI\n", vectors, err); } msi: vectors = pci_msi_count(dev); scctx->isc_nrxqsets = 1; scctx->isc_ntxqsets = 1; scctx->isc_vectors = vectors; if (vectors == 1 && pci_alloc_msi(dev, &vectors) == 0) { device_printf(dev,"Using an MSI interrupt\n"); scctx->isc_intr = IFLIB_INTR_MSI; } else { device_printf(dev,"Using a Legacy interrupt\n"); scctx->isc_intr = IFLIB_INTR_LEGACY; } return (vectors); } char * ring_states[] = { "IDLE", "BUSY", "STALLED", "ABDICATED" }; static int mp_ring_state_handler(SYSCTL_HANDLER_ARGS) { int rc; uint16_t *state = ((uint16_t *)oidp->oid_arg1); struct sbuf *sb; char *ring_state = "UNKNOWN"; /* XXX needed ? */ rc = sysctl_wire_old_buffer(req, 0); MPASS(rc == 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 80, req); MPASS(sb != NULL); if (sb == NULL) return (ENOMEM); if (state[3] <= 3) ring_state = ring_states[state[3]]; sbuf_printf(sb, "pidx_head: %04hd pidx_tail: %04hd cidx: %04hd state: %s", state[0], state[1], state[2], ring_state); rc = sbuf_finish(sb); sbuf_delete(sb); return(rc); } enum iflib_ndesc_handler { IFLIB_NTXD_HANDLER, IFLIB_NRXD_HANDLER, }; static int mp_ndesc_handler(SYSCTL_HANDLER_ARGS) { if_ctx_t ctx = (void *)arg1; enum iflib_ndesc_handler type = arg2; char buf[256] = {0}; uint16_t *ndesc; char *p, *next; int nqs, rc, i; MPASS(type == IFLIB_NTXD_HANDLER || type == IFLIB_NRXD_HANDLER); nqs = 8; switch(type) { case IFLIB_NTXD_HANDLER: ndesc = ctx->ifc_sysctl_ntxds; if (ctx->ifc_sctx) nqs = ctx->ifc_sctx->isc_ntxqs; break; case IFLIB_NRXD_HANDLER: ndesc = ctx->ifc_sysctl_nrxds; if (ctx->ifc_sctx) nqs = ctx->ifc_sctx->isc_nrxqs; break; } if (nqs == 0) nqs = 8; for (i=0; i<8; i++) { if (i >= nqs) break; if (i) strcat(buf, ","); sprintf(strchr(buf, 0), "%d", ndesc[i]); } rc = sysctl_handle_string(oidp, buf, sizeof(buf), req); if (rc || req->newptr == NULL) return rc; for (i = 0, next = buf, p = strsep(&next, " ,"); i < 8 && p; i++, p = strsep(&next, " ,")) { ndesc[i] = strtoul(p, NULL, 10); } return(rc); } #define NAME_BUFLEN 32 static void iflib_add_device_sysctl_pre(if_ctx_t ctx) { device_t dev = iflib_get_dev(ctx); struct sysctl_oid_list *child, *oid_list; struct sysctl_ctx_list *ctx_list; struct sysctl_oid *node; ctx_list = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); ctx->ifc_sysctl_node = node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, "iflib", CTLFLAG_RD, NULL, "IFLIB fields"); oid_list = SYSCTL_CHILDREN(node); SYSCTL_ADD_STRING(ctx_list, oid_list, OID_AUTO, "driver_version", CTLFLAG_RD, ctx->ifc_sctx->isc_driver_version, 0, "driver version"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_ntxqs", CTLFLAG_RWTUN, &ctx->ifc_sysctl_ntxqs, 0, "# of txqs to use, 0 => use default #"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_nrxqs", CTLFLAG_RWTUN, &ctx->ifc_sysctl_nrxqs, 0, "# of rxqs to use, 0 => use default #"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_qs_enable", CTLFLAG_RWTUN, &ctx->ifc_sysctl_qs_eq_override, 0, "permit #txq != #rxq"); /* XXX change for per-queue sizes */ SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_ntxds", CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NTXD_HANDLER, mp_ndesc_handler, "A", "list of # of tx descriptors to use, 0 = use default #"); SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_nrxds", CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NRXD_HANDLER, mp_ndesc_handler, "A", "list of # of rx descriptors to use, 0 = use default #"); } static void iflib_add_device_sysctl_post(if_ctx_t ctx) { if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = iflib_get_dev(ctx); struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx_list; iflib_fl_t fl; iflib_txq_t txq; iflib_rxq_t rxq; int i, j; char namebuf[NAME_BUFLEN]; char *qfmt; struct sysctl_oid *queue_node, *fl_node, *node; struct sysctl_oid_list *queue_list, *fl_list; ctx_list = device_get_sysctl_ctx(dev); node = ctx->ifc_sysctl_node; child = SYSCTL_CHILDREN(node); if (scctx->isc_ntxqsets > 100) qfmt = "txq%03d"; else if (scctx->isc_ntxqsets > 10) qfmt = "txq%02d"; else qfmt = "txq%d"; for (i = 0, txq = ctx->ifc_txqs; i < scctx->isc_ntxqsets; i++, txq++) { snprintf(namebuf, NAME_BUFLEN, qfmt, i); queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "Queue Name"); queue_list = SYSCTL_CHILDREN(queue_node); #if MEMORY_LOGGING SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_dequeued", CTLFLAG_RD, &txq->ift_dequeued, "total mbufs freed"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_enqueued", CTLFLAG_RD, &txq->ift_enqueued, "total mbufs enqueued"); #endif SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag", CTLFLAG_RD, &txq->ift_mbuf_defrag, "# of times m_defrag was called"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "m_pullups", CTLFLAG_RD, &txq->ift_pullups, "# of times m_pullup was called"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag_failed", CTLFLAG_RD, &txq->ift_mbuf_defrag_failed, "# of times m_defrag failed"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_desc_avail", CTLFLAG_RD, &txq->ift_no_desc_avail, "# of times no descriptors were available"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "tx_map_failed", CTLFLAG_RD, &txq->ift_map_failed, "# of times dma map failed"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txd_encap_efbig", CTLFLAG_RD, &txq->ift_txd_encap_efbig, "# of times txd_encap returned EFBIG"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_tx_dma_setup", CTLFLAG_RD, &txq->ift_no_tx_dma_setup, "# of times map failed for other than EFBIG"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_pidx", CTLFLAG_RD, &txq->ift_pidx, 1, "Producer Index"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx", CTLFLAG_RD, &txq->ift_cidx, 1, "Consumer Index"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx_processed", CTLFLAG_RD, &txq->ift_cidx_processed, 1, "Consumer Index seen by credit update"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_in_use", CTLFLAG_RD, &txq->ift_in_use, 1, "descriptors in use"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_processed", CTLFLAG_RD, &txq->ift_processed, "descriptors procesed for clean"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_cleaned", CTLFLAG_RD, &txq->ift_cleaned, "total cleaned"); SYSCTL_ADD_PROC(ctx_list, queue_list, OID_AUTO, "ring_state", CTLTYPE_STRING | CTLFLAG_RD, __DEVOLATILE(uint64_t *, &txq->ift_br[0]->state), 0, mp_ring_state_handler, "A", "soft ring state"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_enqueues", CTLFLAG_RD, &txq->ift_br[0]->enqueues, "# of enqueues to the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_drops", CTLFLAG_RD, &txq->ift_br[0]->drops, "# of drops in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_starts", CTLFLAG_RD, &txq->ift_br[0]->starts, "# of normal consumer starts in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_stalls", CTLFLAG_RD, &txq->ift_br[0]->stalls, "# of consumer stalls in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_restarts", CTLFLAG_RD, &txq->ift_br[0]->restarts, "# of consumer restarts in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_abdications", CTLFLAG_RD, &txq->ift_br[0]->abdications, "# of consumer abdications in the mp_ring for this queue"); } if (scctx->isc_nrxqsets > 100) qfmt = "rxq%03d"; else if (scctx->isc_nrxqsets > 10) qfmt = "rxq%02d"; else qfmt = "rxq%d"; for (i = 0, rxq = ctx->ifc_rxqs; i < scctx->isc_nrxqsets; i++, rxq++) { snprintf(namebuf, NAME_BUFLEN, qfmt, i); queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "Queue Name"); queue_list = SYSCTL_CHILDREN(queue_node); if (sctx->isc_flags & IFLIB_HAS_RXCQ) { SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_pidx", CTLFLAG_RD, &rxq->ifr_cq_pidx, 1, "Producer Index"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_cidx", CTLFLAG_RD, &rxq->ifr_cq_cidx, 1, "Consumer Index"); } for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) { snprintf(namebuf, NAME_BUFLEN, "rxq_fl%d", j); fl_node = SYSCTL_ADD_NODE(ctx_list, queue_list, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "freelist Name"); fl_list = SYSCTL_CHILDREN(fl_node); SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "pidx", CTLFLAG_RD, &fl->ifl_pidx, 1, "Producer Index"); SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "cidx", CTLFLAG_RD, &fl->ifl_cidx, 1, "Consumer Index"); SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "credits", CTLFLAG_RD, &fl->ifl_credits, 1, "credits available"); #if MEMORY_LOGGING SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_enqueued", CTLFLAG_RD, &fl->ifl_m_enqueued, "mbufs allocated"); SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_dequeued", CTLFLAG_RD, &fl->ifl_m_dequeued, "mbufs freed"); SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_enqueued", CTLFLAG_RD, &fl->ifl_cl_enqueued, "clusters allocated"); SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_dequeued", CTLFLAG_RD, &fl->ifl_cl_dequeued, "clusters freed"); #endif } } } Index: projects/clang400-import/sys/netpfil/ipfw/ip_fw2.c =================================================================== --- projects/clang400-import/sys/netpfil/ipfw/ip_fw2.c (revision 314522) +++ projects/clang400-import/sys/netpfil/ipfw/ip_fw2.c (revision 314523) @@ -1,2941 +1,2948 @@ /*- * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); /* * The FreeBSD IP packet firewall, main file */ #include "opt_ipfw.h" #include "opt_ipdivert.h" #include "opt_inet.h" #ifndef INET #error "IPFIREWALL requires INET" #endif /* INET */ #include "opt_inet6.h" #include "opt_ipsec.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* for ETHERTYPE_IP */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #include #include #include #endif #include #include /* XXX for in_cksum */ #ifdef MAC #include #endif /* * static variables followed by global ones. * All ipfw global variables are here. */ static VNET_DEFINE(int, fw_deny_unknown_exthdrs); #define V_fw_deny_unknown_exthdrs VNET(fw_deny_unknown_exthdrs) static VNET_DEFINE(int, fw_permit_single_frag6) = 1; #define V_fw_permit_single_frag6 VNET(fw_permit_single_frag6) #ifdef IPFIREWALL_DEFAULT_TO_ACCEPT static int default_to_accept = 1; #else static int default_to_accept; #endif VNET_DEFINE(int, autoinc_step); VNET_DEFINE(int, fw_one_pass) = 1; VNET_DEFINE(unsigned int, fw_tables_max); VNET_DEFINE(unsigned int, fw_tables_sets) = 0; /* Don't use set-aware tables */ /* Use 128 tables by default */ static unsigned int default_fw_tables = IPFW_TABLES_DEFAULT; #ifndef LINEAR_SKIPTO static int jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards); #define JUMP(ch, f, num, targ, back) jump_fast(ch, f, num, targ, back) #else static int jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards); #define JUMP(ch, f, num, targ, back) jump_linear(ch, f, num, targ, back) #endif /* * Each rule belongs to one of 32 different sets (0..31). * The variable set_disable contains one bit per set. * If the bit is set, all rules in the corresponding set * are disabled. Set RESVD_SET(31) is reserved for the default rule * and rules that are not deleted by the flush command, * and CANNOT be disabled. * Rules in set RESVD_SET can only be deleted individually. */ VNET_DEFINE(u_int32_t, set_disable); #define V_set_disable VNET(set_disable) VNET_DEFINE(int, fw_verbose); /* counter for ipfw_log(NULL...) */ VNET_DEFINE(u_int64_t, norule_counter); VNET_DEFINE(int, verbose_limit); /* layer3_chain contains the list of rules for layer 3 */ VNET_DEFINE(struct ip_fw_chain, layer3_chain); /* ipfw_vnet_ready controls when we are open for business */ VNET_DEFINE(int, ipfw_vnet_ready) = 0; VNET_DEFINE(int, ipfw_nat_ready) = 0; ipfw_nat_t *ipfw_nat_ptr = NULL; struct cfg_nat *(*lookup_nat_ptr)(struct nat_list *, int); ipfw_nat_cfg_t *ipfw_nat_cfg_ptr; ipfw_nat_cfg_t *ipfw_nat_del_ptr; ipfw_nat_cfg_t *ipfw_nat_get_cfg_ptr; ipfw_nat_cfg_t *ipfw_nat_get_log_ptr; #ifdef SYSCTL_NODE uint32_t dummy_def = IPFW_DEFAULT_RULE; static int sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS); static int sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS); SYSBEGIN(f3) SYSCTL_NODE(_net_inet_ip, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, one_pass, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_one_pass), 0, "Only do a single pass through ipfw when using dummynet(4)"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, autoinc_step, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(autoinc_step), 0, "Rule number auto-increment step"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, verbose, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_verbose), 0, "Log matches to ipfw rules"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, verbose_limit, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(verbose_limit), 0, "Set upper limit of matches of ipfw rules logged"); SYSCTL_UINT(_net_inet_ip_fw, OID_AUTO, default_rule, CTLFLAG_RD, &dummy_def, 0, "The default/max possible rule number."); SYSCTL_PROC(_net_inet_ip_fw, OID_AUTO, tables_max, CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW, 0, 0, sysctl_ipfw_table_num, "IU", "Maximum number of concurrently used tables"); SYSCTL_PROC(_net_inet_ip_fw, OID_AUTO, tables_sets, CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW, 0, 0, sysctl_ipfw_tables_sets, "IU", "Use per-set namespace for tables"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, default_to_accept, CTLFLAG_RDTUN, &default_to_accept, 0, "Make the default rule accept all packets."); TUNABLE_INT("net.inet.ip.fw.tables_max", (int *)&default_fw_tables); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, static_count, CTLFLAG_VNET | CTLFLAG_RD, &VNET_NAME(layer3_chain.n_rules), 0, "Number of static rules"); #ifdef INET6 SYSCTL_DECL(_net_inet6_ip6); SYSCTL_NODE(_net_inet6_ip6, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall"); SYSCTL_INT(_net_inet6_ip6_fw, OID_AUTO, deny_unknown_exthdrs, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(fw_deny_unknown_exthdrs), 0, "Deny packets with unknown IPv6 Extension Headers"); SYSCTL_INT(_net_inet6_ip6_fw, OID_AUTO, permit_single_frag6, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(fw_permit_single_frag6), 0, "Permit single packet IPv6 fragments"); #endif /* INET6 */ SYSEND #endif /* SYSCTL_NODE */ /* * Some macros used in the various matching options. * L3HDR maps an ipv4 pointer into a layer3 header pointer of type T * Other macros just cast void * into the appropriate type */ #define L3HDR(T, ip) ((T *)((u_int32_t *)(ip) + (ip)->ip_hl)) #define TCP(p) ((struct tcphdr *)(p)) #define SCTP(p) ((struct sctphdr *)(p)) #define UDP(p) ((struct udphdr *)(p)) #define ICMP(p) ((struct icmphdr *)(p)) #define ICMP6(p) ((struct icmp6_hdr *)(p)) static __inline int icmptype_match(struct icmphdr *icmp, ipfw_insn_u32 *cmd) { int type = icmp->icmp_type; return (type <= ICMP_MAXTYPE && (cmd->d[0] & (1<icmp_type; return (type <= ICMP_MAXTYPE && (TT & (1<arg1 or cmd->d[0]. * * We scan options and store the bits we find set. We succeed if * * (want_set & ~bits) == 0 && (want_clear & ~bits) == want_clear * * The code is sometimes optimized not to store additional variables. */ static int flags_match(ipfw_insn *cmd, u_int8_t bits) { u_char want_clear; bits = ~bits; if ( ((cmd->arg1 & 0xff) & bits) != 0) return 0; /* some bits we want set were clear */ want_clear = (cmd->arg1 >> 8) & 0xff; if ( (want_clear & bits) != want_clear) return 0; /* some bits we want clear were set */ return 1; } static int ipopts_match(struct ip *ip, ipfw_insn *cmd) { int optlen, bits = 0; u_char *cp = (u_char *)(ip + 1); int x = (ip->ip_hl << 2) - sizeof (struct ip); for (; x > 0; x -= optlen, cp += optlen) { int opt = cp[IPOPT_OPTVAL]; if (opt == IPOPT_EOL) break; if (opt == IPOPT_NOP) optlen = 1; else { optlen = cp[IPOPT_OLEN]; if (optlen <= 0 || optlen > x) return 0; /* invalid or truncated */ } switch (opt) { default: break; case IPOPT_LSRR: bits |= IP_FW_IPOPT_LSRR; break; case IPOPT_SSRR: bits |= IP_FW_IPOPT_SSRR; break; case IPOPT_RR: bits |= IP_FW_IPOPT_RR; break; case IPOPT_TS: bits |= IP_FW_IPOPT_TS; break; } } return (flags_match(cmd, bits)); } static int tcpopts_match(struct tcphdr *tcp, ipfw_insn *cmd) { int optlen, bits = 0; u_char *cp = (u_char *)(tcp + 1); int x = (tcp->th_off << 2) - sizeof(struct tcphdr); for (; x > 0; x -= optlen, cp += optlen) { int opt = cp[0]; if (opt == TCPOPT_EOL) break; if (opt == TCPOPT_NOP) optlen = 1; else { optlen = cp[1]; if (optlen <= 0) break; } switch (opt) { default: break; case TCPOPT_MAXSEG: bits |= IP_FW_TCPOPT_MSS; break; case TCPOPT_WINDOW: bits |= IP_FW_TCPOPT_WINDOW; break; case TCPOPT_SACK_PERMITTED: case TCPOPT_SACK: bits |= IP_FW_TCPOPT_SACK; break; case TCPOPT_TIMESTAMP: bits |= IP_FW_TCPOPT_TS; break; } } return (flags_match(cmd, bits)); } static int iface_match(struct ifnet *ifp, ipfw_insn_if *cmd, struct ip_fw_chain *chain, uint32_t *tablearg) { if (ifp == NULL) /* no iface with this packet, match fails */ return (0); /* Check by name or by IP address */ if (cmd->name[0] != '\0') { /* match by name */ if (cmd->name[0] == '\1') /* use tablearg to match */ return ipfw_lookup_table_extended(chain, cmd->p.kidx, 0, &ifp->if_index, tablearg); /* Check name */ if (cmd->p.glob) { if (fnmatch(cmd->name, ifp->if_xname, 0) == 0) return(1); } else { if (strncmp(ifp->if_xname, cmd->name, IFNAMSIZ) == 0) return(1); } } else { #if !defined(USERSPACE) && defined(__FreeBSD__) /* and OSX too ? */ struct ifaddr *ia; if_addr_rlock(ifp); TAILQ_FOREACH(ia, &ifp->if_addrhead, ifa_link) { if (ia->ifa_addr->sa_family != AF_INET) continue; if (cmd->p.ip.s_addr == ((struct sockaddr_in *) (ia->ifa_addr))->sin_addr.s_addr) { if_addr_runlock(ifp); return(1); /* match */ } } if_addr_runlock(ifp); #endif /* __FreeBSD__ */ } return(0); /* no match, fail ... */ } /* * The verify_path function checks if a route to the src exists and * if it is reachable via ifp (when provided). * * The 'verrevpath' option checks that the interface that an IP packet * arrives on is the same interface that traffic destined for the * packet's source address would be routed out of. * The 'versrcreach' option just checks that the source address is * reachable via any route (except default) in the routing table. * These two are a measure to block forged packets. This is also * commonly known as "anti-spoofing" or Unicast Reverse Path * Forwarding (Unicast RFP) in Cisco-ese. The name of the knobs * is purposely reminiscent of the Cisco IOS command, * * ip verify unicast reverse-path * ip verify unicast source reachable-via any * * which implements the same functionality. But note that the syntax * is misleading, and the check may be performed on all IP packets * whether unicast, multicast, or broadcast. */ static int verify_path(struct in_addr src, struct ifnet *ifp, u_int fib) { #if defined(USERSPACE) || !defined(__FreeBSD__) return 0; #else struct nhop4_basic nh4; if (fib4_lookup_nh_basic(fib, src, NHR_IFAIF, 0, &nh4) != 0) return (0); /* * If ifp is provided, check for equality with rtentry. * We should use rt->rt_ifa->ifa_ifp, instead of rt->rt_ifp, * in order to pass packets injected back by if_simloop(): * routing entry (via lo0) for our own address * may exist, so we need to handle routing assymetry. */ if (ifp != NULL && ifp != nh4.nh_ifp) return (0); /* if no ifp provided, check if rtentry is not default route */ if (ifp == NULL && (nh4.nh_flags & NHF_DEFAULT) != 0) return (0); /* or if this is a blackhole/reject route */ if (ifp == NULL && (nh4.nh_flags & (NHF_REJECT|NHF_BLACKHOLE)) != 0) return (0); /* found valid route */ return 1; #endif /* __FreeBSD__ */ } #ifdef INET6 /* * ipv6 specific rules here... */ static __inline int icmp6type_match (int type, ipfw_insn_u32 *cmd) { return (type <= ICMP6_MAXTYPE && (cmd->d[type/32] & (1<<(type%32)) ) ); } static int flow6id_match( int curr_flow, ipfw_insn_u32 *cmd ) { int i; for (i=0; i <= cmd->o.arg1; ++i ) if (curr_flow == cmd->d[i] ) return 1; return 0; } /* support for IP6_*_ME opcodes */ static const struct in6_addr lla_mask = {{{ 0xff, 0xff, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }}}; static int ipfw_localip6(struct in6_addr *in6) { struct rm_priotracker in6_ifa_tracker; struct in6_ifaddr *ia; if (IN6_IS_ADDR_MULTICAST(in6)) return (0); if (!IN6_IS_ADDR_LINKLOCAL(in6)) return (in6_localip(in6)); IN6_IFADDR_RLOCK(&in6_ifa_tracker); TAILQ_FOREACH(ia, &V_in6_ifaddrhead, ia_link) { if (!IN6_IS_ADDR_LINKLOCAL(&ia->ia_addr.sin6_addr)) continue; if (IN6_ARE_MASKED_ADDR_EQUAL(&ia->ia_addr.sin6_addr, in6, &lla_mask)) { IN6_IFADDR_RUNLOCK(&in6_ifa_tracker); return (1); } } IN6_IFADDR_RUNLOCK(&in6_ifa_tracker); return (0); } static int verify_path6(struct in6_addr *src, struct ifnet *ifp, u_int fib) { struct nhop6_basic nh6; if (IN6_IS_SCOPE_LINKLOCAL(src)) return (1); if (fib6_lookup_nh_basic(fib, src, 0, NHR_IFAIF, 0, &nh6) != 0) return (0); /* If ifp is provided, check for equality with route table. */ if (ifp != NULL && ifp != nh6.nh_ifp) return (0); /* if no ifp provided, check if rtentry is not default route */ if (ifp == NULL && (nh6.nh_flags & NHF_DEFAULT) != 0) return (0); /* or if this is a blackhole/reject route */ if (ifp == NULL && (nh6.nh_flags & (NHF_REJECT|NHF_BLACKHOLE)) != 0) return (0); /* found valid route */ return 1; } static int is_icmp6_query(int icmp6_type) { if ((icmp6_type <= ICMP6_MAXTYPE) && (icmp6_type == ICMP6_ECHO_REQUEST || icmp6_type == ICMP6_MEMBERSHIP_QUERY || icmp6_type == ICMP6_WRUREQUEST || icmp6_type == ICMP6_FQDN_QUERY || icmp6_type == ICMP6_NI_QUERY)) return (1); return (0); } static void send_reject6(struct ip_fw_args *args, int code, u_int hlen, struct ip6_hdr *ip6) { struct mbuf *m; m = args->m; if (code == ICMP6_UNREACH_RST && args->f_id.proto == IPPROTO_TCP) { struct tcphdr *tcp; tcp = (struct tcphdr *)((char *)ip6 + hlen); if ((tcp->th_flags & TH_RST) == 0) { struct mbuf *m0; m0 = ipfw_send_pkt(args->m, &(args->f_id), ntohl(tcp->th_seq), ntohl(tcp->th_ack), tcp->th_flags | TH_RST); if (m0 != NULL) ip6_output(m0, NULL, NULL, 0, NULL, NULL, NULL); } FREE_PKT(m); } else if (code != ICMP6_UNREACH_RST) { /* Send an ICMPv6 unreach. */ #if 0 /* * Unlike above, the mbufs need to line up with the ip6 hdr, * as the contents are read. We need to m_adj() the * needed amount. * The mbuf will however be thrown away so we can adjust it. * Remember we did an m_pullup on it already so we * can make some assumptions about contiguousness. */ if (args->L3offset) m_adj(m, args->L3offset); #endif icmp6_error(m, ICMP6_DST_UNREACH, code, 0); } else FREE_PKT(m); args->m = NULL; } #endif /* INET6 */ /* * sends a reject message, consuming the mbuf passed as an argument. */ static void send_reject(struct ip_fw_args *args, int code, int iplen, struct ip *ip) { #if 0 /* XXX When ip is not guaranteed to be at mtod() we will * need to account for this */ * The mbuf will however be thrown away so we can adjust it. * Remember we did an m_pullup on it already so we * can make some assumptions about contiguousness. */ if (args->L3offset) m_adj(m, args->L3offset); #endif if (code != ICMP_REJECT_RST) { /* Send an ICMP unreach */ icmp_error(args->m, ICMP_UNREACH, code, 0L, 0); } else if (args->f_id.proto == IPPROTO_TCP) { struct tcphdr *const tcp = L3HDR(struct tcphdr, mtod(args->m, struct ip *)); if ( (tcp->th_flags & TH_RST) == 0) { struct mbuf *m; m = ipfw_send_pkt(args->m, &(args->f_id), ntohl(tcp->th_seq), ntohl(tcp->th_ack), tcp->th_flags | TH_RST); if (m != NULL) ip_output(m, NULL, NULL, 0, NULL, NULL); } FREE_PKT(args->m); } else FREE_PKT(args->m); args->m = NULL; } /* * Support for uid/gid/jail lookup. These tests are expensive * (because we may need to look into the list of active sockets) * so we cache the results. ugid_lookupp is 0 if we have not * yet done a lookup, 1 if we succeeded, and -1 if we tried * and failed. The function always returns the match value. * We could actually spare the variable and use *uc, setting * it to '(void *)check_uidgid if we have no info, NULL if * we tried and failed, or any other value if successful. */ static int check_uidgid(ipfw_insn_u32 *insn, struct ip_fw_args *args, int *ugid_lookupp, struct ucred **uc) { #if defined(USERSPACE) return 0; // not supported in userspace #else #ifndef __FreeBSD__ /* XXX */ return cred_check(insn, proto, oif, dst_ip, dst_port, src_ip, src_port, (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb); #else /* FreeBSD */ struct in_addr src_ip, dst_ip; struct inpcbinfo *pi; struct ipfw_flow_id *id; struct inpcb *pcb, *inp; struct ifnet *oif; int lookupflags; int match; id = &args->f_id; inp = args->inp; oif = args->oif; /* * Check to see if the UDP or TCP stack supplied us with * the PCB. If so, rather then holding a lock and looking * up the PCB, we can use the one that was supplied. */ if (inp && *ugid_lookupp == 0) { INP_LOCK_ASSERT(inp); if (inp->inp_socket != NULL) { *uc = crhold(inp->inp_cred); *ugid_lookupp = 1; } else *ugid_lookupp = -1; } /* * If we have already been here and the packet has no * PCB entry associated with it, then we can safely * assume that this is a no match. */ if (*ugid_lookupp == -1) return (0); if (id->proto == IPPROTO_TCP) { lookupflags = 0; pi = &V_tcbinfo; } else if (id->proto == IPPROTO_UDP) { lookupflags = INPLOOKUP_WILDCARD; pi = &V_udbinfo; } else return 0; lookupflags |= INPLOOKUP_RLOCKPCB; match = 0; if (*ugid_lookupp == 0) { if (id->addr_type == 6) { #ifdef INET6 if (oif == NULL) pcb = in6_pcblookup_mbuf(pi, &id->src_ip6, htons(id->src_port), &id->dst_ip6, htons(id->dst_port), lookupflags, oif, args->m); else pcb = in6_pcblookup_mbuf(pi, &id->dst_ip6, htons(id->dst_port), &id->src_ip6, htons(id->src_port), lookupflags, oif, args->m); #else *ugid_lookupp = -1; return (0); #endif } else { src_ip.s_addr = htonl(id->src_ip); dst_ip.s_addr = htonl(id->dst_ip); if (oif == NULL) pcb = in_pcblookup_mbuf(pi, src_ip, htons(id->src_port), dst_ip, htons(id->dst_port), lookupflags, oif, args->m); else pcb = in_pcblookup_mbuf(pi, dst_ip, htons(id->dst_port), src_ip, htons(id->src_port), lookupflags, oif, args->m); } if (pcb != NULL) { INP_RLOCK_ASSERT(pcb); *uc = crhold(pcb->inp_cred); *ugid_lookupp = 1; INP_RUNLOCK(pcb); } if (*ugid_lookupp == 0) { /* * We tried and failed, set the variable to -1 * so we will not try again on this packet. */ *ugid_lookupp = -1; return (0); } } if (insn->o.opcode == O_UID) match = ((*uc)->cr_uid == (uid_t)insn->d[0]); else if (insn->o.opcode == O_GID) match = groupmember((gid_t)insn->d[0], *uc); else if (insn->o.opcode == O_JAIL) match = ((*uc)->cr_prison->pr_id == (int)insn->d[0]); return (match); #endif /* __FreeBSD__ */ #endif /* not supported in userspace */ } /* * Helper function to set args with info on the rule after the matching * one. slot is precise, whereas we guess rule_id as they are * assigned sequentially. */ static inline void set_match(struct ip_fw_args *args, int slot, struct ip_fw_chain *chain) { args->rule.chain_id = chain->id; args->rule.slot = slot + 1; /* we use 0 as a marker */ args->rule.rule_id = 1 + chain->map[slot]->id; args->rule.rulenum = chain->map[slot]->rulenum; } #ifndef LINEAR_SKIPTO /* * Helper function to enable cached rule lookups using * cached_id and cached_pos fields in ipfw rule. */ static int jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards) { int f_pos; /* If possible use cached f_pos (in f->cached_pos), * whose version is written in f->cached_id * (horrible hacks to avoid changing the ABI). */ if (num != IP_FW_TARG && f->cached_id == chain->id) f_pos = f->cached_pos; else { int i = IP_FW_ARG_TABLEARG(chain, num, skipto); /* make sure we do not jump backward */ if (jump_backwards == 0 && i <= f->rulenum) i = f->rulenum + 1; if (chain->idxmap != NULL) f_pos = chain->idxmap[i]; else f_pos = ipfw_find_rule(chain, i, 0); /* update the cache */ if (num != IP_FW_TARG) { f->cached_id = chain->id; f->cached_pos = f_pos; } } return (f_pos); } #else /* * Helper function to enable real fast rule lookups. */ static int jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards) { int f_pos; num = IP_FW_ARG_TABLEARG(chain, num, skipto); /* make sure we do not jump backward */ if (jump_backwards == 0 && num <= f->rulenum) num = f->rulenum + 1; f_pos = chain->idxmap[num]; return (f_pos); } #endif #define TARG(k, f) IP_FW_ARG_TABLEARG(chain, k, f) /* * The main check routine for the firewall. * * All arguments are in args so we can modify them and return them * back to the caller. * * Parameters: * * args->m (in/out) The packet; we set to NULL when/if we nuke it. * Starts with the IP header. * args->eh (in) Mac header if present, NULL for layer3 packet. * args->L3offset Number of bytes bypassed if we came from L2. * e.g. often sizeof(eh) ** NOTYET ** * args->oif Outgoing interface, NULL if packet is incoming. * The incoming interface is in the mbuf. (in) * args->divert_rule (in/out) * Skip up to the first rule past this rule number; * upon return, non-zero port number for divert or tee. * * args->rule Pointer to the last matching rule (in/out) * args->next_hop Socket we are forwarding to (out). * args->next_hop6 IPv6 next hop we are forwarding to (out). * args->f_id Addresses grabbed from the packet (out) * args->rule.info a cookie depending on rule action * * Return value: * * IP_FW_PASS the packet must be accepted * IP_FW_DENY the packet must be dropped * IP_FW_DIVERT divert packet, port in m_tag * IP_FW_TEE tee packet, port in m_tag * IP_FW_DUMMYNET to dummynet, pipe in args->cookie * IP_FW_NETGRAPH into netgraph, cookie args->cookie * args->rule contains the matching rule, * args->rule.info has additional information. * */ int ipfw_chk(struct ip_fw_args *args) { /* * Local variables holding state while processing a packet: * * IMPORTANT NOTE: to speed up the processing of rules, there * are some assumption on the values of the variables, which * are documented here. Should you change them, please check * the implementation of the various instructions to make sure * that they still work. * * args->eh The MAC header. It is non-null for a layer2 * packet, it is NULL for a layer-3 packet. * **notyet** * args->L3offset Offset in the packet to the L3 (IP or equiv.) header. * * m | args->m Pointer to the mbuf, as received from the caller. * It may change if ipfw_chk() does an m_pullup, or if it * consumes the packet because it calls send_reject(). * XXX This has to change, so that ipfw_chk() never modifies * or consumes the buffer. * ip is the beginning of the ip(4 or 6) header. * Calculated by adding the L3offset to the start of data. * (Until we start using L3offset, the packet is * supposed to start with the ip header). */ struct mbuf *m = args->m; struct ip *ip = mtod(m, struct ip *); /* * For rules which contain uid/gid or jail constraints, cache * a copy of the users credentials after the pcb lookup has been * executed. This will speed up the processing of rules with * these types of constraints, as well as decrease contention * on pcb related locks. */ #ifndef __FreeBSD__ struct bsd_ucred ucred_cache; #else struct ucred *ucred_cache = NULL; #endif int ucred_lookup = 0; /* * oif | args->oif If NULL, ipfw_chk has been called on the * inbound path (ether_input, ip_input). * If non-NULL, ipfw_chk has been called on the outbound path * (ether_output, ip_output). */ struct ifnet *oif = args->oif; int f_pos = 0; /* index of current rule in the array */ int retval = 0; /* * hlen The length of the IP header. */ u_int hlen = 0; /* hlen >0 means we have an IP pkt */ /* * offset The offset of a fragment. offset != 0 means that * we have a fragment at this offset of an IPv4 packet. * offset == 0 means that (if this is an IPv4 packet) * this is the first or only fragment. * For IPv6 offset|ip6f_mf == 0 means there is no Fragment Header * or there is a single packet fragment (fragment header added * without needed). We will treat a single packet fragment as if * there was no fragment header (or log/block depending on the * V_fw_permit_single_frag6 sysctl setting). */ u_short offset = 0; u_short ip6f_mf = 0; /* * Local copies of addresses. They are only valid if we have * an IP packet. * * proto The protocol. Set to 0 for non-ip packets, * or to the protocol read from the packet otherwise. * proto != 0 means that we have an IPv4 packet. * * src_port, dst_port port numbers, in HOST format. Only * valid for TCP and UDP packets. * * src_ip, dst_ip ip addresses, in NETWORK format. * Only valid for IPv4 packets. */ uint8_t proto; uint16_t src_port = 0, dst_port = 0; /* NOTE: host format */ struct in_addr src_ip, dst_ip; /* NOTE: network format */ uint16_t iplen=0; int pktlen; uint16_t etype = 0; /* Host order stored ether type */ /* * dyn_dir = MATCH_UNKNOWN when rules unchecked, * MATCH_NONE when checked and not matched (q = NULL), * MATCH_FORWARD or MATCH_REVERSE otherwise (q != NULL) */ int dyn_dir = MATCH_UNKNOWN; uint16_t dyn_name = 0; ipfw_dyn_rule *q = NULL; struct ip_fw_chain *chain = &V_layer3_chain; /* * We store in ulp a pointer to the upper layer protocol header. * In the ipv4 case this is easy to determine from the header, * but for ipv6 we might have some additional headers in the middle. * ulp is NULL if not found. */ void *ulp = NULL; /* upper layer protocol pointer. */ /* XXX ipv6 variables */ int is_ipv6 = 0; uint8_t icmp6_type = 0; uint16_t ext_hd = 0; /* bits vector for extension header filtering */ /* end of ipv6 variables */ int is_ipv4 = 0; int done = 0; /* flag to exit the outer loop */ IPFW_RLOCK_TRACKER; if (m->m_flags & M_SKIP_FIREWALL || (! V_ipfw_vnet_ready)) return (IP_FW_PASS); /* accept */ dst_ip.s_addr = 0; /* make sure it is initialized */ src_ip.s_addr = 0; /* make sure it is initialized */ pktlen = m->m_pkthdr.len; args->f_id.fib = M_GETFIB(m); /* note mbuf not altered) */ proto = args->f_id.proto = 0; /* mark f_id invalid */ /* XXX 0 is a valid proto: IP/IPv6 Hop-by-Hop Option */ /* * PULLUP_TO(len, p, T) makes sure that len + sizeof(T) is contiguous, * then it sets p to point at the offset "len" in the mbuf. WARNING: the * pointer might become stale after other pullups (but we never use it * this way). */ #define PULLUP_TO(_len, p, T) PULLUP_LEN(_len, p, sizeof(T)) #define PULLUP_LEN(_len, p, T) \ do { \ int x = (_len) + T; \ if ((m)->m_len < x) { \ args->m = m = m_pullup(m, x); \ if (m == NULL) \ goto pullup_failed; \ } \ p = (mtod(m, char *) + (_len)); \ } while (0) /* * if we have an ether header, */ if (args->eh) etype = ntohs(args->eh->ether_type); /* Identify IP packets and fill up variables. */ if (pktlen >= sizeof(struct ip6_hdr) && (args->eh == NULL || etype == ETHERTYPE_IPV6) && ip->ip_v == 6) { struct ip6_hdr *ip6 = (struct ip6_hdr *)ip; is_ipv6 = 1; args->f_id.addr_type = 6; hlen = sizeof(struct ip6_hdr); proto = ip6->ip6_nxt; /* Search extension headers to find upper layer protocols */ while (ulp == NULL && offset == 0) { switch (proto) { case IPPROTO_ICMPV6: PULLUP_TO(hlen, ulp, struct icmp6_hdr); icmp6_type = ICMP6(ulp)->icmp6_type; break; case IPPROTO_TCP: PULLUP_TO(hlen, ulp, struct tcphdr); dst_port = TCP(ulp)->th_dport; src_port = TCP(ulp)->th_sport; /* save flags for dynamic rules */ args->f_id._flags = TCP(ulp)->th_flags; break; case IPPROTO_SCTP: PULLUP_TO(hlen, ulp, struct sctphdr); src_port = SCTP(ulp)->src_port; dst_port = SCTP(ulp)->dest_port; break; case IPPROTO_UDP: PULLUP_TO(hlen, ulp, struct udphdr); dst_port = UDP(ulp)->uh_dport; src_port = UDP(ulp)->uh_sport; break; case IPPROTO_HOPOPTS: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_hbh); ext_hd |= EXT_HOPOPTS; hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; ulp = NULL; break; case IPPROTO_ROUTING: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_rthdr); switch (((struct ip6_rthdr *)ulp)->ip6r_type) { case 0: ext_hd |= EXT_RTHDR0; break; case 2: ext_hd |= EXT_RTHDR2; break; default: if (V_fw_verbose) printf("IPFW2: IPV6 - Unknown " "Routing Header type(%d)\n", ((struct ip6_rthdr *) ulp)->ip6r_type); if (V_fw_deny_unknown_exthdrs) return (IP_FW_DENY); break; } ext_hd |= EXT_ROUTING; hlen += (((struct ip6_rthdr *)ulp)->ip6r_len + 1) << 3; proto = ((struct ip6_rthdr *)ulp)->ip6r_nxt; ulp = NULL; break; case IPPROTO_FRAGMENT: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_frag); ext_hd |= EXT_FRAGMENT; hlen += sizeof (struct ip6_frag); proto = ((struct ip6_frag *)ulp)->ip6f_nxt; offset = ((struct ip6_frag *)ulp)->ip6f_offlg & IP6F_OFF_MASK; ip6f_mf = ((struct ip6_frag *)ulp)->ip6f_offlg & IP6F_MORE_FRAG; if (V_fw_permit_single_frag6 == 0 && offset == 0 && ip6f_mf == 0) { if (V_fw_verbose) printf("IPFW2: IPV6 - Invalid " "Fragment Header\n"); if (V_fw_deny_unknown_exthdrs) return (IP_FW_DENY); break; } args->f_id.extra = ntohl(((struct ip6_frag *)ulp)->ip6f_ident); ulp = NULL; break; case IPPROTO_DSTOPTS: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_hbh); ext_hd |= EXT_DSTOPTS; hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; ulp = NULL; break; case IPPROTO_AH: /* RFC 2402 */ PULLUP_TO(hlen, ulp, struct ip6_ext); ext_hd |= EXT_AH; hlen += (((struct ip6_ext *)ulp)->ip6e_len + 2) << 2; proto = ((struct ip6_ext *)ulp)->ip6e_nxt; ulp = NULL; break; case IPPROTO_ESP: /* RFC 2406 */ PULLUP_TO(hlen, ulp, uint32_t); /* SPI, Seq# */ /* Anything past Seq# is variable length and * data past this ext. header is encrypted. */ ext_hd |= EXT_ESP; break; case IPPROTO_NONE: /* RFC 2460 */ /* * Packet ends here, and IPv6 header has * already been pulled up. If ip6e_len!=0 * then octets must be ignored. */ ulp = ip; /* non-NULL to get out of loop. */ break; case IPPROTO_OSPFIGP: /* XXX OSPF header check? */ PULLUP_TO(hlen, ulp, struct ip6_ext); break; case IPPROTO_PIM: /* XXX PIM header check? */ PULLUP_TO(hlen, ulp, struct pim); break; case IPPROTO_CARP: PULLUP_TO(hlen, ulp, struct carp_header); if (((struct carp_header *)ulp)->carp_version != CARP_VERSION) return (IP_FW_DENY); if (((struct carp_header *)ulp)->carp_type != CARP_ADVERTISEMENT) return (IP_FW_DENY); break; case IPPROTO_IPV6: /* RFC 2893 */ PULLUP_TO(hlen, ulp, struct ip6_hdr); break; case IPPROTO_IPV4: /* RFC 2893 */ PULLUP_TO(hlen, ulp, struct ip); break; default: if (V_fw_verbose) printf("IPFW2: IPV6 - Unknown " "Extension Header(%d), ext_hd=%x\n", proto, ext_hd); if (V_fw_deny_unknown_exthdrs) return (IP_FW_DENY); PULLUP_TO(hlen, ulp, struct ip6_ext); break; } /*switch */ } ip = mtod(m, struct ip *); ip6 = (struct ip6_hdr *)ip; args->f_id.src_ip6 = ip6->ip6_src; args->f_id.dst_ip6 = ip6->ip6_dst; args->f_id.src_ip = 0; args->f_id.dst_ip = 0; args->f_id.flow_id6 = ntohl(ip6->ip6_flow); } else if (pktlen >= sizeof(struct ip) && (args->eh == NULL || etype == ETHERTYPE_IP) && ip->ip_v == 4) { is_ipv4 = 1; hlen = ip->ip_hl << 2; args->f_id.addr_type = 4; /* * Collect parameters into local variables for faster matching. */ proto = ip->ip_p; src_ip = ip->ip_src; dst_ip = ip->ip_dst; offset = ntohs(ip->ip_off) & IP_OFFMASK; iplen = ntohs(ip->ip_len); pktlen = iplen < pktlen ? iplen : pktlen; if (offset == 0) { switch (proto) { case IPPROTO_TCP: PULLUP_TO(hlen, ulp, struct tcphdr); dst_port = TCP(ulp)->th_dport; src_port = TCP(ulp)->th_sport; /* save flags for dynamic rules */ args->f_id._flags = TCP(ulp)->th_flags; break; case IPPROTO_SCTP: PULLUP_TO(hlen, ulp, struct sctphdr); src_port = SCTP(ulp)->src_port; dst_port = SCTP(ulp)->dest_port; break; case IPPROTO_UDP: PULLUP_TO(hlen, ulp, struct udphdr); dst_port = UDP(ulp)->uh_dport; src_port = UDP(ulp)->uh_sport; break; case IPPROTO_ICMP: PULLUP_TO(hlen, ulp, struct icmphdr); //args->f_id.flags = ICMP(ulp)->icmp_type; break; default: break; } } ip = mtod(m, struct ip *); args->f_id.src_ip = ntohl(src_ip.s_addr); args->f_id.dst_ip = ntohl(dst_ip.s_addr); } #undef PULLUP_TO if (proto) { /* we may have port numbers, store them */ args->f_id.proto = proto; args->f_id.src_port = src_port = ntohs(src_port); args->f_id.dst_port = dst_port = ntohs(dst_port); } IPFW_PF_RLOCK(chain); if (! V_ipfw_vnet_ready) { /* shutting down, leave NOW. */ IPFW_PF_RUNLOCK(chain); return (IP_FW_PASS); /* accept */ } if (args->rule.slot) { /* * Packet has already been tagged as a result of a previous * match on rule args->rule aka args->rule_id (PIPE, QUEUE, * REASS, NETGRAPH, DIVERT/TEE...) * Validate the slot and continue from the next one * if still present, otherwise do a lookup. */ f_pos = (args->rule.chain_id == chain->id) ? args->rule.slot : ipfw_find_rule(chain, args->rule.rulenum, args->rule.rule_id); } else { f_pos = 0; } /* * Now scan the rules, and parse microinstructions for each rule. * We have two nested loops and an inner switch. Sometimes we * need to break out of one or both loops, or re-enter one of * the loops with updated variables. Loop variables are: * * f_pos (outer loop) points to the current rule. * On output it points to the matching rule. * done (outer loop) is used as a flag to break the loop. * l (inner loop) residual length of current rule. * cmd points to the current microinstruction. * * We break the inner loop by setting l=0 and possibly * cmdlen=0 if we don't want to advance cmd. * We break the outer loop by setting done=1 * We can restart the inner loop by setting l>0 and f_pos, f, cmd * as needed. */ for (; f_pos < chain->n_rules; f_pos++) { ipfw_insn *cmd; uint32_t tablearg = 0; int l, cmdlen, skip_or; /* skip rest of OR block */ struct ip_fw *f; f = chain->map[f_pos]; if (V_set_disable & (1 << f->set) ) continue; skip_or = 0; for (l = f->cmd_len, cmd = f->cmd ; l > 0 ; l -= cmdlen, cmd += cmdlen) { int match; /* * check_body is a jump target used when we find a * CHECK_STATE, and need to jump to the body of * the target rule. */ /* check_body: */ cmdlen = F_LEN(cmd); /* * An OR block (insn_1 || .. || insn_n) has the * F_OR bit set in all but the last instruction. * The first match will set "skip_or", and cause * the following instructions to be skipped until * past the one with the F_OR bit clear. */ if (skip_or) { /* skip this instruction */ if ((cmd->len & F_OR) == 0) skip_or = 0; /* next one is good */ continue; } match = 0; /* set to 1 if we succeed */ switch (cmd->opcode) { /* * The first set of opcodes compares the packet's * fields with some pattern, setting 'match' if a * match is found. At the end of the loop there is * logic to deal with F_NOT and F_OR flags associated * with the opcode. */ case O_NOP: match = 1; break; case O_FORWARD_MAC: printf("ipfw: opcode %d unimplemented\n", cmd->opcode); break; case O_GID: case O_UID: case O_JAIL: /* * We only check offset == 0 && proto != 0, * as this ensures that we have a * packet with the ports info. */ if (offset != 0) break; if (proto == IPPROTO_TCP || proto == IPPROTO_UDP) match = check_uidgid( (ipfw_insn_u32 *)cmd, args, &ucred_lookup, #ifdef __FreeBSD__ &ucred_cache); #else (void *)&ucred_cache); #endif break; case O_RECV: match = iface_match(m->m_pkthdr.rcvif, (ipfw_insn_if *)cmd, chain, &tablearg); break; case O_XMIT: match = iface_match(oif, (ipfw_insn_if *)cmd, chain, &tablearg); break; case O_VIA: match = iface_match(oif ? oif : m->m_pkthdr.rcvif, (ipfw_insn_if *)cmd, chain, &tablearg); break; case O_MACADDR2: if (args->eh != NULL) { /* have MAC header */ u_int32_t *want = (u_int32_t *) ((ipfw_insn_mac *)cmd)->addr; u_int32_t *mask = (u_int32_t *) ((ipfw_insn_mac *)cmd)->mask; u_int32_t *hdr = (u_int32_t *)args->eh; match = ( want[0] == (hdr[0] & mask[0]) && want[1] == (hdr[1] & mask[1]) && want[2] == (hdr[2] & mask[2]) ); } break; case O_MAC_TYPE: if (args->eh != NULL) { u_int16_t *p = ((ipfw_insn_u16 *)cmd)->ports; int i; for (i = cmdlen - 1; !match && i>0; i--, p += 2) match = (etype >= p[0] && etype <= p[1]); } break; case O_FRAG: match = (offset != 0); break; case O_IN: /* "out" is "not in" */ match = (oif == NULL); break; case O_LAYER2: match = (args->eh != NULL); break; case O_DIVERTED: { /* For diverted packets, args->rule.info * contains the divert port (in host format) * reason and direction. */ uint32_t i = args->rule.info; match = (i&IPFW_IS_MASK) == IPFW_IS_DIVERT && cmd->arg1 & ((i & IPFW_INFO_IN) ? 1 : 2); } break; case O_PROTO: /* * We do not allow an arg of 0 so the * check of "proto" only suffices. */ match = (proto == cmd->arg1); break; case O_IP_SRC: match = is_ipv4 && (((ipfw_insn_ip *)cmd)->addr.s_addr == src_ip.s_addr); break; case O_IP_SRC_LOOKUP: case O_IP_DST_LOOKUP: if (is_ipv4) { uint32_t key = (cmd->opcode == O_IP_DST_LOOKUP) ? dst_ip.s_addr : src_ip.s_addr; uint32_t v = 0; if (cmdlen > F_INSN_SIZE(ipfw_insn_u32)) { /* generic lookup. The key must be * in 32bit big-endian format. */ v = ((ipfw_insn_u32 *)cmd)->d[1]; if (v == 0) key = dst_ip.s_addr; else if (v == 1) key = src_ip.s_addr; else if (v == 6) /* dscp */ key = (ip->ip_tos >> 2) & 0x3f; else if (offset != 0) break; else if (proto != IPPROTO_TCP && proto != IPPROTO_UDP) break; else if (v == 2) key = dst_port; else if (v == 3) key = src_port; #ifndef USERSPACE else if (v == 4 || v == 5) { check_uidgid( (ipfw_insn_u32 *)cmd, args, &ucred_lookup, #ifdef __FreeBSD__ &ucred_cache); if (v == 4 /* O_UID */) key = ucred_cache->cr_uid; else if (v == 5 /* O_JAIL */) key = ucred_cache->cr_prison->pr_id; #else /* !__FreeBSD__ */ (void *)&ucred_cache); if (v ==4 /* O_UID */) key = ucred_cache.uid; else if (v == 5 /* O_JAIL */) key = ucred_cache.xid; #endif /* !__FreeBSD__ */ } #endif /* !USERSPACE */ else break; } match = ipfw_lookup_table(chain, cmd->arg1, key, &v); if (!match) break; if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) match = ((ipfw_insn_u32 *)cmd)->d[0] == v; else tablearg = v; } else if (is_ipv6) { uint32_t v = 0; void *pkey = (cmd->opcode == O_IP_DST_LOOKUP) ? &args->f_id.dst_ip6: &args->f_id.src_ip6; match = ipfw_lookup_table_extended(chain, cmd->arg1, sizeof(struct in6_addr), pkey, &v); if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) match = ((ipfw_insn_u32 *)cmd)->d[0] == v; if (match) tablearg = v; } break; case O_IP_FLOW_LOOKUP: { uint32_t v = 0; match = ipfw_lookup_table_extended(chain, cmd->arg1, 0, &args->f_id, &v); if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) match = ((ipfw_insn_u32 *)cmd)->d[0] == v; if (match) tablearg = v; } break; case O_IP_SRC_MASK: case O_IP_DST_MASK: if (is_ipv4) { uint32_t a = (cmd->opcode == O_IP_DST_MASK) ? dst_ip.s_addr : src_ip.s_addr; uint32_t *p = ((ipfw_insn_u32 *)cmd)->d; int i = cmdlen-1; for (; !match && i>0; i-= 2, p+= 2) match = (p[0] == (a & p[1])); } break; case O_IP_SRC_ME: if (is_ipv4) { struct ifnet *tif; INADDR_TO_IFP(src_ip, tif); match = (tif != NULL); break; } #ifdef INET6 /* FALLTHROUGH */ case O_IP6_SRC_ME: match= is_ipv6 && ipfw_localip6(&args->f_id.src_ip6); #endif break; case O_IP_DST_SET: case O_IP_SRC_SET: if (is_ipv4) { u_int32_t *d = (u_int32_t *)(cmd+1); u_int32_t addr = cmd->opcode == O_IP_DST_SET ? args->f_id.dst_ip : args->f_id.src_ip; if (addr < d[0]) break; addr -= d[0]; /* subtract base */ match = (addr < cmd->arg1) && ( d[ 1 + (addr>>5)] & (1<<(addr & 0x1f)) ); } break; case O_IP_DST: match = is_ipv4 && (((ipfw_insn_ip *)cmd)->addr.s_addr == dst_ip.s_addr); break; case O_IP_DST_ME: if (is_ipv4) { struct ifnet *tif; INADDR_TO_IFP(dst_ip, tif); match = (tif != NULL); break; } #ifdef INET6 /* FALLTHROUGH */ case O_IP6_DST_ME: match= is_ipv6 && ipfw_localip6(&args->f_id.dst_ip6); #endif break; case O_IP_SRCPORT: case O_IP_DSTPORT: /* * offset == 0 && proto != 0 is enough * to guarantee that we have a * packet with port info. */ if ((proto==IPPROTO_UDP || proto==IPPROTO_TCP) && offset == 0) { u_int16_t x = (cmd->opcode == O_IP_SRCPORT) ? src_port : dst_port ; u_int16_t *p = ((ipfw_insn_u16 *)cmd)->ports; int i; for (i = cmdlen - 1; !match && i>0; i--, p += 2) match = (x>=p[0] && x<=p[1]); } break; case O_ICMPTYPE: match = (offset == 0 && proto==IPPROTO_ICMP && icmptype_match(ICMP(ulp), (ipfw_insn_u32 *)cmd) ); break; #ifdef INET6 case O_ICMP6TYPE: match = is_ipv6 && offset == 0 && proto==IPPROTO_ICMPV6 && icmp6type_match( ICMP6(ulp)->icmp6_type, (ipfw_insn_u32 *)cmd); break; #endif /* INET6 */ case O_IPOPT: match = (is_ipv4 && ipopts_match(ip, cmd) ); break; case O_IPVER: match = (is_ipv4 && cmd->arg1 == ip->ip_v); break; case O_IPID: case O_IPLEN: case O_IPTTL: if (is_ipv4) { /* only for IP packets */ uint16_t x; uint16_t *p; int i; if (cmd->opcode == O_IPLEN) x = iplen; else if (cmd->opcode == O_IPTTL) x = ip->ip_ttl; else /* must be IPID */ x = ntohs(ip->ip_id); if (cmdlen == 1) { match = (cmd->arg1 == x); break; } /* otherwise we have ranges */ p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for (; !match && i>0; i--, p += 2) match = (x >= p[0] && x <= p[1]); } break; case O_IPPRECEDENCE: match = (is_ipv4 && (cmd->arg1 == (ip->ip_tos & 0xe0)) ); break; case O_IPTOS: match = (is_ipv4 && flags_match(cmd, ip->ip_tos)); break; case O_DSCP: { uint32_t *p; uint16_t x; p = ((ipfw_insn_u32 *)cmd)->d; if (is_ipv4) x = ip->ip_tos >> 2; else if (is_ipv6) { uint8_t *v; v = &((struct ip6_hdr *)ip)->ip6_vfc; x = (*v & 0x0F) << 2; v++; x |= *v >> 6; } else break; /* DSCP bitmask is stored as low_u32 high_u32 */ if (x >= 32) match = *(p + 1) & (1 << (x - 32)); else match = *p & (1 << x); } break; case O_TCPDATALEN: if (proto == IPPROTO_TCP && offset == 0) { struct tcphdr *tcp; uint16_t x; uint16_t *p; int i; tcp = TCP(ulp); x = iplen - ((ip->ip_hl + tcp->th_off) << 2); if (cmdlen == 1) { match = (cmd->arg1 == x); break; } /* otherwise we have ranges */ p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for (; !match && i>0; i--, p += 2) match = (x >= p[0] && x <= p[1]); } break; case O_TCPFLAGS: match = (proto == IPPROTO_TCP && offset == 0 && flags_match(cmd, TCP(ulp)->th_flags)); break; case O_TCPOPTS: if (proto == IPPROTO_TCP && offset == 0 && ulp){ PULLUP_LEN(hlen, ulp, (TCP(ulp)->th_off << 2)); match = tcpopts_match(TCP(ulp), cmd); } break; case O_TCPSEQ: match = (proto == IPPROTO_TCP && offset == 0 && ((ipfw_insn_u32 *)cmd)->d[0] == TCP(ulp)->th_seq); break; case O_TCPACK: match = (proto == IPPROTO_TCP && offset == 0 && ((ipfw_insn_u32 *)cmd)->d[0] == TCP(ulp)->th_ack); break; case O_TCPWIN: if (proto == IPPROTO_TCP && offset == 0) { uint16_t x; uint16_t *p; int i; x = ntohs(TCP(ulp)->th_win); if (cmdlen == 1) { match = (cmd->arg1 == x); break; } /* Otherwise we have ranges. */ p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for (; !match && i > 0; i--, p += 2) match = (x >= p[0] && x <= p[1]); } break; case O_ESTAB: /* reject packets which have SYN only */ /* XXX should i also check for TH_ACK ? */ match = (proto == IPPROTO_TCP && offset == 0 && (TCP(ulp)->th_flags & (TH_RST | TH_ACK | TH_SYN)) != TH_SYN); break; case O_ALTQ: { struct pf_mtag *at; struct m_tag *mtag; ipfw_insn_altq *altq = (ipfw_insn_altq *)cmd; /* * ALTQ uses mbuf tags from another * packet filtering system - pf(4). * We allocate a tag in its format * and fill it in, pretending to be pf(4). */ match = 1; at = pf_find_mtag(m); if (at != NULL && at->qid != 0) break; mtag = m_tag_get(PACKET_TAG_PF, sizeof(struct pf_mtag), M_NOWAIT | M_ZERO); if (mtag == NULL) { /* * Let the packet fall back to the * default ALTQ. */ break; } m_tag_prepend(m, mtag); at = (struct pf_mtag *)(mtag + 1); at->qid = altq->qid; at->hdr = ip; break; } case O_LOG: ipfw_log(chain, f, hlen, args, m, oif, offset | ip6f_mf, tablearg, ip); match = 1; break; case O_PROB: match = (random()<((ipfw_insn_u32 *)cmd)->d[0]); break; case O_VERREVPATH: /* Outgoing packets automatically pass/match */ match = ((oif != NULL) || (m->m_pkthdr.rcvif == NULL) || ( #ifdef INET6 is_ipv6 ? verify_path6(&(args->f_id.src_ip6), m->m_pkthdr.rcvif, args->f_id.fib) : #endif verify_path(src_ip, m->m_pkthdr.rcvif, args->f_id.fib))); break; case O_VERSRCREACH: /* Outgoing packets automatically pass/match */ match = (hlen > 0 && ((oif != NULL) || #ifdef INET6 is_ipv6 ? verify_path6(&(args->f_id.src_ip6), NULL, args->f_id.fib) : #endif verify_path(src_ip, NULL, args->f_id.fib))); break; case O_ANTISPOOF: /* Outgoing packets automatically pass/match */ if (oif == NULL && hlen > 0 && ( (is_ipv4 && in_localaddr(src_ip)) #ifdef INET6 || (is_ipv6 && in6_localaddr(&(args->f_id.src_ip6))) #endif )) match = #ifdef INET6 is_ipv6 ? verify_path6( &(args->f_id.src_ip6), m->m_pkthdr.rcvif, args->f_id.fib) : #endif verify_path(src_ip, m->m_pkthdr.rcvif, args->f_id.fib); else match = 1; break; case O_IPSEC: #ifdef IPSEC match = (m_tag_find(m, PACKET_TAG_IPSEC_IN_DONE, NULL) != NULL); #endif /* otherwise no match */ break; #ifdef INET6 case O_IP6_SRC: match = is_ipv6 && IN6_ARE_ADDR_EQUAL(&args->f_id.src_ip6, &((ipfw_insn_ip6 *)cmd)->addr6); break; case O_IP6_DST: match = is_ipv6 && IN6_ARE_ADDR_EQUAL(&args->f_id.dst_ip6, &((ipfw_insn_ip6 *)cmd)->addr6); break; case O_IP6_SRC_MASK: case O_IP6_DST_MASK: if (is_ipv6) { int i = cmdlen - 1; struct in6_addr p; struct in6_addr *d = &((ipfw_insn_ip6 *)cmd)->addr6; for (; !match && i > 0; d += 2, i -= F_INSN_SIZE(struct in6_addr) * 2) { p = (cmd->opcode == O_IP6_SRC_MASK) ? args->f_id.src_ip6: args->f_id.dst_ip6; APPLY_MASK(&p, &d[1]); match = IN6_ARE_ADDR_EQUAL(&d[0], &p); } } break; case O_FLOW6ID: match = is_ipv6 && flow6id_match(args->f_id.flow_id6, (ipfw_insn_u32 *) cmd); break; case O_EXT_HDR: match = is_ipv6 && (ext_hd & ((ipfw_insn *) cmd)->arg1); break; case O_IP6: match = is_ipv6; break; #endif case O_IP4: match = is_ipv4; break; case O_TAG: { struct m_tag *mtag; uint32_t tag = TARG(cmd->arg1, tag); /* Packet is already tagged with this tag? */ mtag = m_tag_locate(m, MTAG_IPFW, tag, NULL); /* We have `untag' action when F_NOT flag is * present. And we must remove this mtag from * mbuf and reset `match' to zero (`match' will * be inversed later). * Otherwise we should allocate new mtag and * push it into mbuf. */ if (cmd->len & F_NOT) { /* `untag' action */ if (mtag != NULL) m_tag_delete(m, mtag); match = 0; } else { if (mtag == NULL) { mtag = m_tag_alloc( MTAG_IPFW, tag, 0, M_NOWAIT); if (mtag != NULL) m_tag_prepend(m, mtag); } match = 1; } break; } case O_FIB: /* try match the specified fib */ if (args->f_id.fib == cmd->arg1) match = 1; break; case O_SOCKARG: { #ifndef USERSPACE /* not supported in userspace */ struct inpcb *inp = args->inp; struct inpcbinfo *pi; if (is_ipv6) /* XXX can we remove this ? */ break; if (proto == IPPROTO_TCP) pi = &V_tcbinfo; else if (proto == IPPROTO_UDP) pi = &V_udbinfo; else break; /* * XXXRW: so_user_cookie should almost * certainly be inp_user_cookie? */ /* For incoming packet, lookup up the inpcb using the src/dest ip/port tuple */ if (inp == NULL) { inp = in_pcblookup(pi, src_ip, htons(src_port), dst_ip, htons(dst_port), INPLOOKUP_RLOCKPCB, NULL); if (inp != NULL) { tablearg = inp->inp_socket->so_user_cookie; if (tablearg) match = 1; INP_RUNLOCK(inp); } } else { if (inp->inp_socket) { tablearg = inp->inp_socket->so_user_cookie; if (tablearg) match = 1; } } #endif /* !USERSPACE */ break; } case O_TAGGED: { struct m_tag *mtag; uint32_t tag = TARG(cmd->arg1, tag); if (cmdlen == 1) { match = m_tag_locate(m, MTAG_IPFW, tag, NULL) != NULL; break; } /* we have ranges */ for (mtag = m_tag_first(m); mtag != NULL && !match; mtag = m_tag_next(m, mtag)) { uint16_t *p; int i; if (mtag->m_tag_cookie != MTAG_IPFW) continue; p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for(; !match && i > 0; i--, p += 2) match = mtag->m_tag_id >= p[0] && mtag->m_tag_id <= p[1]; } break; } /* * The second set of opcodes represents 'actions', * i.e. the terminal part of a rule once the packet * matches all previous patterns. * Typically there is only one action for each rule, * and the opcode is stored at the end of the rule * (but there are exceptions -- see below). * * In general, here we set retval and terminate the * outer loop (would be a 'break 3' in some language, * but we need to set l=0, done=1) * * Exceptions: * O_COUNT and O_SKIPTO actions: * instead of terminating, we jump to the next rule * (setting l=0), or to the SKIPTO target (setting * f/f_len, cmd and l as needed), respectively. * * O_TAG, O_LOG and O_ALTQ action parameters: * perform some action and set match = 1; * * O_LIMIT and O_KEEP_STATE: these opcodes are * not real 'actions', and are stored right * before the 'action' part of the rule. * These opcodes try to install an entry in the * state tables; if successful, we continue with * the next opcode (match=1; break;), otherwise * the packet must be dropped (set retval, * break loops with l=0, done=1) * * O_PROBE_STATE and O_CHECK_STATE: these opcodes * cause a lookup of the state table, and a jump * to the 'action' part of the parent rule * if an entry is found, or * (CHECK_STATE only) a jump to the next rule if * the entry is not found. * The result of the lookup is cached so that * further instances of these opcodes become NOPs. * The jump to the next rule is done by setting * l=0, cmdlen=0. */ case O_LIMIT: case O_KEEP_STATE: if (ipfw_install_state(chain, f, (ipfw_insn_limit *)cmd, args, tablearg)) { /* error or limit violation */ retval = IP_FW_DENY; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ } match = 1; break; case O_PROBE_STATE: case O_CHECK_STATE: /* * dynamic rules are checked at the first * keep-state or check-state occurrence, * with the result being stored in dyn_dir * and dyn_name. * The compiler introduces a PROBE_STATE * instruction for us when we have a * KEEP_STATE (because PROBE_STATE needs * to be run first). * * (dyn_dir == MATCH_UNKNOWN) means this is * first lookup for such f_id. Do lookup. * * (dyn_dir != MATCH_UNKNOWN && * dyn_name != 0 && dyn_name != cmd->arg1) * means previous lookup didn't find dynamic * rule for specific state name and current * lookup will search rule with another state * name. Redo lookup. * * (dyn_dir != MATCH_UNKNOWN && dyn_name == 0) * means previous lookup was for `any' name * and it didn't find rule. No need to do * lookup again. */ if ((dyn_dir == MATCH_UNKNOWN || (dyn_name != 0 && dyn_name != cmd->arg1)) && (q = ipfw_lookup_dyn_rule(&args->f_id, &dyn_dir, proto == IPPROTO_TCP ? TCP(ulp): NULL, (dyn_name = cmd->arg1))) != NULL) { /* * Found dynamic entry, update stats * and jump to the 'action' part of * the parent rule by setting * f, cmd, l and clearing cmdlen. */ IPFW_INC_DYN_COUNTER(q, pktlen); /* XXX we would like to have f_pos * readily accessible in the dynamic * rule, instead of having to * lookup q->rule. */ f = q->rule; f_pos = ipfw_find_rule(chain, f->rulenum, f->id); cmd = ACTION_PTR(f); l = f->cmd_len - f->act_ofs; ipfw_dyn_unlock(q); cmdlen = 0; match = 1; break; } /* * Dynamic entry not found. If CHECK_STATE, * skip to next rule, if PROBE_STATE just * ignore and continue with next opcode. */ if (cmd->opcode == O_CHECK_STATE) l = 0; /* exit inner loop */ match = 1; break; case O_ACCEPT: retval = 0; /* accept */ l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_PIPE: case O_QUEUE: set_match(args, f_pos, chain); args->rule.info = TARG(cmd->arg1, pipe); if (cmd->opcode == O_PIPE) args->rule.info |= IPFW_IS_PIPE; if (V_fw_one_pass) args->rule.info |= IPFW_ONEPASS; retval = IP_FW_DUMMYNET; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_DIVERT: case O_TEE: if (args->eh) /* not on layer 2 */ break; /* otherwise this is terminal */ l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ retval = (cmd->opcode == O_DIVERT) ? IP_FW_DIVERT : IP_FW_TEE; set_match(args, f_pos, chain); args->rule.info = TARG(cmd->arg1, divert); break; case O_COUNT: IPFW_INC_RULE_COUNTER(f, pktlen); l = 0; /* exit inner loop */ break; case O_SKIPTO: IPFW_INC_RULE_COUNTER(f, pktlen); f_pos = JUMP(chain, f, cmd->arg1, tablearg, 0); /* * Skip disabled rules, and re-enter * the inner loop with the correct * f_pos, f, l and cmd. * Also clear cmdlen and skip_or */ for (; f_pos < chain->n_rules - 1 && (V_set_disable & (1 << chain->map[f_pos]->set)); f_pos++) ; /* Re-enter the inner loop at the skipto rule. */ f = chain->map[f_pos]; l = f->cmd_len; cmd = f->cmd; match = 1; cmdlen = 0; skip_or = 0; continue; break; /* not reached */ case O_CALLRETURN: { /* * Implementation of `subroutine' call/return, * in the stack carried in an mbuf tag. This * is different from `skipto' in that any call * address is possible (`skipto' must prevent * backward jumps to avoid endless loops). * We have `return' action when F_NOT flag is * present. The `m_tag_id' field is used as * stack pointer. */ struct m_tag *mtag; uint16_t jmpto, *stack; #define IS_CALL ((cmd->len & F_NOT) == 0) #define IS_RETURN ((cmd->len & F_NOT) != 0) /* * Hand-rolled version of m_tag_locate() with * wildcard `type'. * If not already tagged, allocate new tag. */ mtag = m_tag_first(m); while (mtag != NULL) { if (mtag->m_tag_cookie == MTAG_IPFW_CALL) break; mtag = m_tag_next(m, mtag); } if (mtag == NULL && IS_CALL) { mtag = m_tag_alloc(MTAG_IPFW_CALL, 0, IPFW_CALLSTACK_SIZE * sizeof(uint16_t), M_NOWAIT); if (mtag != NULL) m_tag_prepend(m, mtag); } /* * On error both `call' and `return' just * continue with next rule. */ if (IS_RETURN && (mtag == NULL || mtag->m_tag_id == 0)) { l = 0; /* exit inner loop */ break; } if (IS_CALL && (mtag == NULL || mtag->m_tag_id >= IPFW_CALLSTACK_SIZE)) { printf("ipfw: call stack error, " "go to next rule\n"); l = 0; /* exit inner loop */ break; } IPFW_INC_RULE_COUNTER(f, pktlen); stack = (uint16_t *)(mtag + 1); /* * The `call' action may use cached f_pos * (in f->next_rule), whose version is written * in f->next_rule. * The `return' action, however, doesn't have * fixed jump address in cmd->arg1 and can't use * cache. */ if (IS_CALL) { stack[mtag->m_tag_id] = f->rulenum; mtag->m_tag_id++; f_pos = JUMP(chain, f, cmd->arg1, tablearg, 1); } else { /* `return' action */ mtag->m_tag_id--; jmpto = stack[mtag->m_tag_id] + 1; f_pos = ipfw_find_rule(chain, jmpto, 0); } /* * Skip disabled rules, and re-enter * the inner loop with the correct * f_pos, f, l and cmd. * Also clear cmdlen and skip_or */ for (; f_pos < chain->n_rules - 1 && (V_set_disable & (1 << chain->map[f_pos]->set)); f_pos++) ; /* Re-enter the inner loop at the dest rule. */ f = chain->map[f_pos]; l = f->cmd_len; cmd = f->cmd; cmdlen = 0; skip_or = 0; continue; break; /* NOTREACHED */ } #undef IS_CALL #undef IS_RETURN case O_REJECT: /* * Drop the packet and send a reject notice * if the packet is not ICMP (or is an ICMP * query), and it is not multicast/broadcast. */ if (hlen > 0 && is_ipv4 && offset == 0 && (proto != IPPROTO_ICMP || is_icmp_query(ICMP(ulp))) && !(m->m_flags & (M_BCAST|M_MCAST)) && !IN_MULTICAST(ntohl(dst_ip.s_addr))) { send_reject(args, cmd->arg1, iplen, ip); m = args->m; } /* FALLTHROUGH */ #ifdef INET6 case O_UNREACH6: if (hlen > 0 && is_ipv6 && ((offset & IP6F_OFF_MASK) == 0) && (proto != IPPROTO_ICMPV6 || (is_icmp6_query(icmp6_type) == 1)) && !(m->m_flags & (M_BCAST|M_MCAST)) && !IN6_IS_ADDR_MULTICAST(&args->f_id.dst_ip6)) { send_reject6( args, cmd->arg1, hlen, (struct ip6_hdr *)ip); m = args->m; } /* FALLTHROUGH */ #endif case O_DENY: retval = IP_FW_DENY; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_FORWARD_IP: if (args->eh) /* not valid on layer2 pkts */ break; if (q == NULL || q->rule != f || dyn_dir == MATCH_FORWARD) { struct sockaddr_in *sa; sa = &(((ipfw_insn_sa *)cmd)->sa); if (sa->sin_addr.s_addr == INADDR_ANY) { #ifdef INET6 /* * We use O_FORWARD_IP opcode for * fwd rule with tablearg, but tables * now support IPv6 addresses. And * when we are inspecting IPv6 packet, * we can use nh6 field from * table_value as next_hop6 address. */ if (is_ipv6) { struct sockaddr_in6 *sa6; sa6 = args->next_hop6 = &args->hopstore6; sa6->sin6_family = AF_INET6; sa6->sin6_len = sizeof(*sa6); sa6->sin6_addr = TARG_VAL( chain, tablearg, nh6); /* * Set sin6_scope_id only for * link-local unicast addresses. */ if (IN6_IS_ADDR_LINKLOCAL( &sa6->sin6_addr)) sa6->sin6_scope_id = TARG_VAL(chain, tablearg, zoneid); } else #endif { sa = args->next_hop = &args->hopstore; sa->sin_family = AF_INET; sa->sin_len = sizeof(*sa); sa->sin_addr.s_addr = htonl( TARG_VAL(chain, tablearg, nh4)); } } else { args->next_hop = sa; } } retval = IP_FW_PASS; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; #ifdef INET6 case O_FORWARD_IP6: if (args->eh) /* not valid on layer2 pkts */ break; if (q == NULL || q->rule != f || dyn_dir == MATCH_FORWARD) { struct sockaddr_in6 *sin6; sin6 = &(((ipfw_insn_sa6 *)cmd)->sa); args->next_hop6 = sin6; } retval = IP_FW_PASS; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; #endif case O_NETGRAPH: case O_NGTEE: set_match(args, f_pos, chain); args->rule.info = TARG(cmd->arg1, netgraph); if (V_fw_one_pass) args->rule.info |= IPFW_ONEPASS; retval = (cmd->opcode == O_NETGRAPH) ? IP_FW_NETGRAPH : IP_FW_NGTEE; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_SETFIB: { uint32_t fib; IPFW_INC_RULE_COUNTER(f, pktlen); fib = TARG(cmd->arg1, fib) & 0x7FFF; if (fib >= rt_numfibs) fib = 0; M_SETFIB(m, fib); args->f_id.fib = fib; l = 0; /* exit inner loop */ break; } case O_SETDSCP: { uint16_t code; code = TARG(cmd->arg1, dscp) & 0x3F; l = 0; /* exit inner loop */ if (is_ipv4) { uint16_t old; old = *(uint16_t *)ip; ip->ip_tos = (code << 2) | (ip->ip_tos & 0x03); ip->ip_sum = cksum_adjust(ip->ip_sum, old, *(uint16_t *)ip); } else if (is_ipv6) { uint8_t *v; v = &((struct ip6_hdr *)ip)->ip6_vfc; *v = (*v & 0xF0) | (code >> 2); v++; *v = (*v & 0x3F) | ((code & 0x03) << 6); } else break; IPFW_INC_RULE_COUNTER(f, pktlen); break; } case O_NAT: l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ if (!IPFW_NAT_LOADED) { retval = IP_FW_DENY; break; } struct cfg_nat *t; int nat_id; set_match(args, f_pos, chain); /* Check if this is 'global' nat rule */ if (cmd->arg1 == IP_FW_NAT44_GLOBAL) { retval = ipfw_nat_ptr(args, NULL, m); break; } t = ((ipfw_insn_nat *)cmd)->nat; if (t == NULL) { nat_id = TARG(cmd->arg1, nat); t = (*lookup_nat_ptr)(&chain->nat, nat_id); if (t == NULL) { retval = IP_FW_DENY; break; } if (cmd->arg1 != IP_FW_TARG) ((ipfw_insn_nat *)cmd)->nat = t; } retval = ipfw_nat_ptr(args, t, m); break; case O_REASS: { int ip_off; IPFW_INC_RULE_COUNTER(f, pktlen); l = 0; /* in any case exit inner loop */ ip_off = ntohs(ip->ip_off); /* if not fragmented, go to next rule */ if ((ip_off & (IP_MF | IP_OFFMASK)) == 0) break; args->m = m = ip_reass(m); /* * do IP header checksum fixup. */ if (m == NULL) { /* fragment got swallowed */ retval = IP_FW_DENY; } else { /* good, packet complete */ int hlen; ip = mtod(m, struct ip *); hlen = ip->ip_hl << 2; ip->ip_sum = 0; if (hlen == sizeof(struct ip)) ip->ip_sum = in_cksum_hdr(ip); else ip->ip_sum = in_cksum(m, hlen); retval = IP_FW_REASS; set_match(args, f_pos, chain); } done = 1; /* exit outer loop */ break; } case O_EXTERNAL_ACTION: l = 0; /* in any case exit inner loop */ retval = ipfw_run_eaction(chain, args, cmd, &done); + /* + * If both @retval and @done are zero, + * consider this as rule matching and + * update counters. + */ + if (retval == 0 && done == 0) + IPFW_INC_RULE_COUNTER(f, pktlen); break; default: panic("-- unknown opcode %d\n", cmd->opcode); } /* end of switch() on opcodes */ /* * if we get here with l=0, then match is irrelevant. */ if (cmd->len & F_NOT) match = !match; if (match) { if (cmd->len & F_OR) skip_or = 1; } else { if (!(cmd->len & F_OR)) /* not an OR block, */ break; /* try next rule */ } } /* end of inner loop, scan opcodes */ #undef PULLUP_LEN if (done) break; /* next_rule:; */ /* try next rule */ } /* end of outer for, scan rules */ if (done) { struct ip_fw *rule = chain->map[f_pos]; /* Update statistics */ IPFW_INC_RULE_COUNTER(rule, pktlen); } else { retval = IP_FW_DENY; printf("ipfw: ouch!, skip past end of rules, denying packet\n"); } IPFW_PF_RUNLOCK(chain); #ifdef __FreeBSD__ if (ucred_cache != NULL) crfree(ucred_cache); #endif return (retval); pullup_failed: if (V_fw_verbose) printf("ipfw: pullup failed\n"); return (IP_FW_DENY); } /* * Set maximum number of tables that can be used in given VNET ipfw instance. */ #ifdef SYSCTL_NODE static int sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS) { int error; unsigned int ntables; ntables = V_fw_tables_max; error = sysctl_handle_int(oidp, &ntables, 0, req); /* Read operation or some error */ if ((error != 0) || (req->newptr == NULL)) return (error); return (ipfw_resize_tables(&V_layer3_chain, ntables)); } /* * Switches table namespace between global and per-set. */ static int sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS) { int error; unsigned int sets; sets = V_fw_tables_sets; error = sysctl_handle_int(oidp, &sets, 0, req); /* Read operation or some error */ if ((error != 0) || (req->newptr == NULL)) return (error); return (ipfw_switch_tables_namespace(&V_layer3_chain, sets)); } #endif /* * Module and VNET glue */ /* * Stuff that must be initialised only on boot or module load */ static int ipfw_init(void) { int error = 0; /* * Only print out this stuff the first time around, * when called from the sysinit code. */ printf("ipfw2 " #ifdef INET6 "(+ipv6) " #endif "initialized, divert %s, nat %s, " "default to %s, logging ", #ifdef IPDIVERT "enabled", #else "loadable", #endif #ifdef IPFIREWALL_NAT "enabled", #else "loadable", #endif default_to_accept ? "accept" : "deny"); /* * Note: V_xxx variables can be accessed here but the vnet specific * initializer may not have been called yet for the VIMAGE case. * Tuneables will have been processed. We will print out values for * the default vnet. * XXX This should all be rationalized AFTER 8.0 */ if (V_fw_verbose == 0) printf("disabled\n"); else if (V_verbose_limit == 0) printf("unlimited\n"); else printf("limited to %d packets/entry by default\n", V_verbose_limit); /* Check user-supplied table count for validness */ if (default_fw_tables > IPFW_TABLES_MAX) default_fw_tables = IPFW_TABLES_MAX; ipfw_init_sopt_handler(); ipfw_init_obj_rewriter(); ipfw_iface_init(); return (error); } /* * Called for the removal of the last instance only on module unload. */ static void ipfw_destroy(void) { ipfw_iface_destroy(); ipfw_destroy_sopt_handler(); ipfw_destroy_obj_rewriter(); printf("IP firewall unloaded\n"); } /* * Stuff that must be initialized for every instance * (including the first of course). */ static int vnet_ipfw_init(const void *unused) { int error, first; struct ip_fw *rule = NULL; struct ip_fw_chain *chain; chain = &V_layer3_chain; first = IS_DEFAULT_VNET(curvnet) ? 1 : 0; /* First set up some values that are compile time options */ V_autoinc_step = 100; /* bounded to 1..1000 in add_rule() */ V_fw_deny_unknown_exthdrs = 1; #ifdef IPFIREWALL_VERBOSE V_fw_verbose = 1; #endif #ifdef IPFIREWALL_VERBOSE_LIMIT V_verbose_limit = IPFIREWALL_VERBOSE_LIMIT; #endif #ifdef IPFIREWALL_NAT LIST_INIT(&chain->nat); #endif /* Init shared services hash table */ ipfw_init_srv(chain); ipfw_init_counters(); /* insert the default rule and create the initial map */ chain->n_rules = 1; chain->map = malloc(sizeof(struct ip_fw *), M_IPFW, M_WAITOK | M_ZERO); rule = ipfw_alloc_rule(chain, sizeof(struct ip_fw)); /* Set initial number of tables */ V_fw_tables_max = default_fw_tables; error = ipfw_init_tables(chain, first); if (error) { printf("ipfw2: setting up tables failed\n"); free(chain->map, M_IPFW); free(rule, M_IPFW); return (ENOSPC); } /* fill and insert the default rule */ rule->act_ofs = 0; rule->rulenum = IPFW_DEFAULT_RULE; rule->cmd_len = 1; rule->set = RESVD_SET; rule->cmd[0].len = 1; rule->cmd[0].opcode = default_to_accept ? O_ACCEPT : O_DENY; chain->default_rule = chain->map[0] = rule; chain->id = rule->id = 1; /* Pre-calculate rules length for legacy dump format */ chain->static_len = sizeof(struct ip_fw_rule0); IPFW_LOCK_INIT(chain); ipfw_dyn_init(chain); ipfw_eaction_init(chain, first); #ifdef LINEAR_SKIPTO ipfw_init_skipto_cache(chain); #endif ipfw_bpf_init(first); /* First set up some values that are compile time options */ V_ipfw_vnet_ready = 1; /* Open for business */ /* * Hook the sockopt handler and pfil hooks for ipv4 and ipv6. * Even if the latter two fail we still keep the module alive * because the sockopt and layer2 paths are still useful. * ipfw[6]_hook return 0 on success, ENOENT on failure, * so we can ignore the exact return value and just set a flag. * * Note that V_fw[6]_enable are manipulated by a SYSCTL_PROC so * changes in the underlying (per-vnet) variables trigger * immediate hook()/unhook() calls. * In layer2 we have the same behaviour, except that V_ether_ipfw * is checked on each packet because there are no pfil hooks. */ V_ip_fw_ctl_ptr = ipfw_ctl3; error = ipfw_attach_hooks(1); return (error); } /* * Called for the removal of each instance. */ static int vnet_ipfw_uninit(const void *unused) { struct ip_fw *reap; struct ip_fw_chain *chain = &V_layer3_chain; int i, last; V_ipfw_vnet_ready = 0; /* tell new callers to go away */ /* * disconnect from ipv4, ipv6, layer2 and sockopt. * Then grab, release and grab again the WLOCK so we make * sure the update is propagated and nobody will be in. */ (void)ipfw_attach_hooks(0 /* detach */); V_ip_fw_ctl_ptr = NULL; last = IS_DEFAULT_VNET(curvnet) ? 1 : 0; IPFW_UH_WLOCK(chain); IPFW_UH_WUNLOCK(chain); ipfw_dyn_uninit(0); /* run the callout_drain */ IPFW_UH_WLOCK(chain); reap = NULL; IPFW_WLOCK(chain); for (i = 0; i < chain->n_rules; i++) ipfw_reap_add(chain, &reap, chain->map[i]); free(chain->map, M_IPFW); #ifdef LINEAR_SKIPTO ipfw_destroy_skipto_cache(chain); #endif IPFW_WUNLOCK(chain); IPFW_UH_WUNLOCK(chain); ipfw_destroy_tables(chain, last); ipfw_eaction_uninit(chain, last); if (reap != NULL) ipfw_reap_rules(reap); vnet_ipfw_iface_destroy(chain); ipfw_destroy_srv(chain); IPFW_LOCK_DESTROY(chain); ipfw_dyn_uninit(1); /* free the remaining parts */ ipfw_destroy_counters(); ipfw_bpf_uninit(last); return (0); } /* * Module event handler. * In general we have the choice of handling most of these events by the * event handler or by the (VNET_)SYS(UN)INIT handlers. I have chosen to * use the SYSINIT handlers as they are more capable of expressing the * flow of control during module and vnet operations, so this is just * a skeleton. Note there is no SYSINIT equivalent of the module * SHUTDOWN handler, but we don't have anything to do in that case anyhow. */ static int ipfw_modevent(module_t mod, int type, void *unused) { int err = 0; switch (type) { case MOD_LOAD: /* Called once at module load or * system boot if compiled in. */ break; case MOD_QUIESCE: /* Called before unload. May veto unloading. */ break; case MOD_UNLOAD: /* Called during unload. */ break; case MOD_SHUTDOWN: /* Called during system shutdown. */ break; default: err = EOPNOTSUPP; break; } return err; } static moduledata_t ipfwmod = { "ipfw", ipfw_modevent, 0 }; /* Define startup order. */ #define IPFW_SI_SUB_FIREWALL SI_SUB_PROTO_FIREWALL #define IPFW_MODEVENT_ORDER (SI_ORDER_ANY - 255) /* On boot slot in here. */ #define IPFW_MODULE_ORDER (IPFW_MODEVENT_ORDER + 1) /* A little later. */ #define IPFW_VNET_ORDER (IPFW_MODEVENT_ORDER + 2) /* Later still. */ DECLARE_MODULE(ipfw, ipfwmod, IPFW_SI_SUB_FIREWALL, IPFW_MODEVENT_ORDER); FEATURE(ipfw_ctl3, "ipfw new sockopt calls"); MODULE_VERSION(ipfw, 3); /* should declare some dependencies here */ /* * Starting up. Done in order after ipfwmod() has been called. * VNET_SYSINIT is also called for each existing vnet and each new vnet. */ SYSINIT(ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER, ipfw_init, NULL); VNET_SYSINIT(vnet_ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER, vnet_ipfw_init, NULL); /* * Closing up shop. These are done in REVERSE ORDER, but still * after ipfwmod() has been called. Not called on reboot. * VNET_SYSUNINIT is also called for each exiting vnet as it exits. * or when the module is unloaded. */ SYSUNINIT(ipfw_destroy, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER, ipfw_destroy, NULL); VNET_SYSUNINIT(vnet_ipfw_uninit, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER, vnet_ipfw_uninit, NULL); /* end of file */ Index: projects/clang400-import/sys/netpfil/ipfw/nptv6/nptv6.c =================================================================== --- projects/clang400-import/sys/netpfil/ipfw/nptv6/nptv6.c (revision 314522) +++ projects/clang400-import/sys/netpfil/ipfw/nptv6/nptv6.c (revision 314523) @@ -1,892 +1,894 @@ /*- * Copyright (c) 2016 Yandex LLC * Copyright (c) 2016 Andrey V. Elsukov * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static VNET_DEFINE(uint16_t, nptv6_eid) = 0; #define V_nptv6_eid VNET(nptv6_eid) #define IPFW_TLV_NPTV6_NAME IPFW_TLV_EACTION_NAME(V_nptv6_eid) static struct nptv6_cfg *nptv6_alloc_config(const char *name, uint8_t set); static void nptv6_free_config(struct nptv6_cfg *cfg); static struct nptv6_cfg *nptv6_find(struct namedobj_instance *ni, const char *name, uint8_t set); static int nptv6_rewrite_internal(struct nptv6_cfg *cfg, struct mbuf **mp, int offset); static int nptv6_rewrite_external(struct nptv6_cfg *cfg, struct mbuf **mp, int offset); #define NPTV6_LOOKUP(chain, cmd) \ (struct nptv6_cfg *)SRV_OBJECT((chain), (cmd)->arg1) #ifndef IN6_MASK_ADDR #define IN6_MASK_ADDR(a, m) do { \ (a)->s6_addr32[0] &= (m)->s6_addr32[0]; \ (a)->s6_addr32[1] &= (m)->s6_addr32[1]; \ (a)->s6_addr32[2] &= (m)->s6_addr32[2]; \ (a)->s6_addr32[3] &= (m)->s6_addr32[3]; \ } while (0) #endif #ifndef IN6_ARE_MASKED_ADDR_EQUAL #define IN6_ARE_MASKED_ADDR_EQUAL(d, a, m) ( \ (((d)->s6_addr32[0] ^ (a)->s6_addr32[0]) & (m)->s6_addr32[0]) == 0 && \ (((d)->s6_addr32[1] ^ (a)->s6_addr32[1]) & (m)->s6_addr32[1]) == 0 && \ (((d)->s6_addr32[2] ^ (a)->s6_addr32[2]) & (m)->s6_addr32[2]) == 0 && \ (((d)->s6_addr32[3] ^ (a)->s6_addr32[3]) & (m)->s6_addr32[3]) == 0 ) #endif #if 0 #define NPTV6_DEBUG(fmt, ...) do { \ printf("%s: " fmt "\n", __func__, ## __VA_ARGS__); \ } while (0) #define NPTV6_IPDEBUG(fmt, ...) do { \ char _s[INET6_ADDRSTRLEN], _d[INET6_ADDRSTRLEN]; \ printf("%s: " fmt "\n", __func__, ## __VA_ARGS__); \ } while (0) #else #define NPTV6_DEBUG(fmt, ...) #define NPTV6_IPDEBUG(fmt, ...) #endif static int nptv6_getlasthdr(struct nptv6_cfg *cfg, struct mbuf *m, int *offset) { struct ip6_hdr *ip6; struct ip6_hbh *hbh; int proto, hlen; hlen = (offset == NULL) ? 0: *offset; if (m->m_len < hlen) return (-1); ip6 = mtodo(m, hlen); hlen += sizeof(*ip6); proto = ip6->ip6_nxt; while (proto == IPPROTO_HOPOPTS || proto == IPPROTO_ROUTING || proto == IPPROTO_DSTOPTS) { hbh = mtodo(m, hlen); if (m->m_len < hlen) return (-1); proto = hbh->ip6h_nxt; hlen += hbh->ip6h_len << 3; } if (offset != NULL) *offset = hlen; return (proto); } static int nptv6_translate_icmpv6(struct nptv6_cfg *cfg, struct mbuf **mp, int offset) { struct icmp6_hdr *icmp6; struct ip6_hdr *ip6; struct mbuf *m; m = *mp; if (offset > m->m_len) return (-1); icmp6 = mtodo(m, offset); NPTV6_DEBUG("ICMPv6 type %d", icmp6->icmp6_type); switch (icmp6->icmp6_type) { case ICMP6_DST_UNREACH: case ICMP6_PACKET_TOO_BIG: case ICMP6_TIME_EXCEEDED: case ICMP6_PARAM_PROB: break; case ICMP6_ECHO_REQUEST: case ICMP6_ECHO_REPLY: /* nothing to translate */ return (0); default: /* * XXX: We can add some checks to not translate NDP and MLD * messages. Currently user must explicitly allow these message * types, otherwise packets will be dropped. */ return (-1); } offset += sizeof(*icmp6); if (offset + sizeof(*ip6) > m->m_pkthdr.len) return (-1); if (offset + sizeof(*ip6) > m->m_len) *mp = m = m_pullup(m, offset + sizeof(*ip6)); if (m == NULL) return (-1); ip6 = mtodo(m, offset); NPTV6_IPDEBUG("offset %d, %s -> %s %d", offset, inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)), inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)), ip6->ip6_nxt); if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_src, &cfg->external, &cfg->mask)) return (nptv6_rewrite_external(cfg, mp, offset)); else if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_dst, &cfg->internal, &cfg->mask)) return (nptv6_rewrite_internal(cfg, mp, offset)); /* * Addresses in the inner IPv6 header doesn't matched to * our prefixes. */ return (-1); } static int nptv6_search_index(struct nptv6_cfg *cfg, struct in6_addr *a) { int idx; if (cfg->flags & NPTV6_48PLEN) return (3); /* Search suitable word index for adjustment */ for (idx = 4; idx < 8; idx++) if (a->s6_addr16[idx] != 0xffff) break; /* * RFC 6296 p3.7: If an NPTv6 Translator discovers a datagram with * an IID of all-zeros while performing address mapping, that * datagram MUST be dropped, and an ICMPv6 Parameter Problem error * SHOULD be generated. */ if (idx == 8 || (a->s6_addr32[2] == 0 && a->s6_addr32[3] == 0)) return (-1); return (idx); } static void nptv6_copy_addr(struct in6_addr *src, struct in6_addr *dst, struct in6_addr *mask) { int i; for (i = 0; i < 8 && mask->s6_addr8[i] != 0; i++) { dst->s6_addr8[i] &= ~mask->s6_addr8[i]; dst->s6_addr8[i] |= src->s6_addr8[i] & mask->s6_addr8[i]; } } static int nptv6_rewrite_internal(struct nptv6_cfg *cfg, struct mbuf **mp, int offset) { struct in6_addr *addr; struct ip6_hdr *ip6; int idx, proto; uint16_t adj; ip6 = mtodo(*mp, offset); NPTV6_IPDEBUG("offset %d, %s -> %s %d", offset, inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)), inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)), ip6->ip6_nxt); if (offset == 0) addr = &ip6->ip6_src; else { /* * When we rewriting inner IPv6 header, we need to rewrite * destination address back to external prefix. The datagram in * the ICMPv6 payload should looks like it was send from * external prefix. */ addr = &ip6->ip6_dst; } idx = nptv6_search_index(cfg, addr); if (idx < 0) { /* * Do not send ICMPv6 error when offset isn't zero. * This means we are rewriting inner IPv6 header in the * ICMPv6 error message. */ if (offset == 0) { icmp6_error2(*mp, ICMP6_DST_UNREACH, ICMP6_DST_UNREACH_ADDR, 0, (*mp)->m_pkthdr.rcvif); *mp = NULL; } return (IP_FW_DENY); } adj = addr->s6_addr16[idx]; nptv6_copy_addr(&cfg->external, addr, &cfg->mask); adj = cksum_add(adj, cfg->adjustment); if (adj == 0xffff) adj = 0; addr->s6_addr16[idx] = adj; if (offset == 0) { /* * We may need to translate addresses in the inner IPv6 * header for ICMPv6 error messages. */ proto = nptv6_getlasthdr(cfg, *mp, &offset); if (proto < 0 || (proto == IPPROTO_ICMPV6 && nptv6_translate_icmpv6(cfg, mp, offset) != 0)) return (IP_FW_DENY); NPTV6STAT_INC(cfg, in2ex); } return (0); } static int nptv6_rewrite_external(struct nptv6_cfg *cfg, struct mbuf **mp, int offset) { struct in6_addr *addr; struct ip6_hdr *ip6; int idx, proto; uint16_t adj; ip6 = mtodo(*mp, offset); NPTV6_IPDEBUG("offset %d, %s -> %s %d", offset, inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)), inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)), ip6->ip6_nxt); if (offset == 0) addr = &ip6->ip6_dst; else { /* * When we rewriting inner IPv6 header, we need to rewrite * source address back to internal prefix. The datagram in * the ICMPv6 payload should looks like it was send from * internal prefix. */ addr = &ip6->ip6_src; } idx = nptv6_search_index(cfg, addr); if (idx < 0) { /* * Do not send ICMPv6 error when offset isn't zero. * This means we are rewriting inner IPv6 header in the * ICMPv6 error message. */ if (offset == 0) { icmp6_error2(*mp, ICMP6_DST_UNREACH, ICMP6_DST_UNREACH_ADDR, 0, (*mp)->m_pkthdr.rcvif); *mp = NULL; } return (IP_FW_DENY); } adj = addr->s6_addr16[idx]; nptv6_copy_addr(&cfg->internal, addr, &cfg->mask); adj = cksum_add(adj, ~cfg->adjustment); if (adj == 0xffff) adj = 0; addr->s6_addr16[idx] = adj; if (offset == 0) { /* * We may need to translate addresses in the inner IPv6 * header for ICMPv6 error messages. */ proto = nptv6_getlasthdr(cfg, *mp, &offset); if (proto < 0 || (proto == IPPROTO_ICMPV6 && nptv6_translate_icmpv6(cfg, mp, offset) != 0)) return (IP_FW_DENY); NPTV6STAT_INC(cfg, ex2in); } return (0); } /* * ipfw external action handler. */ static int ipfw_nptv6(struct ip_fw_chain *chain, struct ip_fw_args *args, ipfw_insn *cmd, int *done) { struct ip6_hdr *ip6; struct nptv6_cfg *cfg; ipfw_insn *icmd; int ret; *done = 0; /* try next rule if not matched */ + ret = IP_FW_DENY; icmd = cmd + 1; if (cmd->opcode != O_EXTERNAL_ACTION || cmd->arg1 != V_nptv6_eid || icmd->opcode != O_EXTERNAL_INSTANCE || (cfg = NPTV6_LOOKUP(chain, icmd)) == NULL) - return (0); + return (ret); /* * We need act as router, so when forwarding is disabled - * do nothing. */ if (V_ip6_forwarding == 0 || args->f_id.addr_type != 6) - return (0); + return (ret); /* * NOTE: we expect ipfw_chk() did m_pullup() up to upper level * protocol's headers. Also we skip some checks, that ip6_input(), * ip6_forward(), ip6_fastfwd() and ipfw_chk() already did. */ - ret = IP_FW_DENY; ip6 = mtod(args->m, struct ip6_hdr *); NPTV6_IPDEBUG("eid %u, oid %u, %s -> %s %d", cmd->arg1, icmd->arg1, inet_ntop(AF_INET6, &ip6->ip6_src, _s, sizeof(_s)), inet_ntop(AF_INET6, &ip6->ip6_dst, _d, sizeof(_d)), ip6->ip6_nxt); if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_src, &cfg->internal, &cfg->mask)) { /* * XXX: Do not translate packets when both src and dst * are from internal prefix. */ if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_dst, &cfg->internal, &cfg->mask)) - return (0); + return (ret); ret = nptv6_rewrite_internal(cfg, &args->m, 0); } else if (IN6_ARE_MASKED_ADDR_EQUAL(&ip6->ip6_dst, &cfg->external, &cfg->mask)) ret = nptv6_rewrite_external(cfg, &args->m, 0); else - return (0); + return (ret); /* - * If address wasn't rewrited - free mbuf. + * If address wasn't rewrited - free mbuf and terminate the search. */ if (ret != 0) { if (args->m != NULL) { m_freem(args->m); args->m = NULL; /* mark mbuf as consumed */ } NPTV6STAT_INC(cfg, dropped); - } - /* Terminate the search if one_pass is set */ - *done = V_fw_one_pass; - /* Update args->f_id when one_pass is off */ - if (*done == 0 && ret == 0) { - ip6 = mtod(args->m, struct ip6_hdr *); - args->f_id.src_ip6 = ip6->ip6_src; - args->f_id.dst_ip6 = ip6->ip6_dst; + *done = 1; + } else { + /* Terminate the search if one_pass is set */ + *done = V_fw_one_pass; + /* Update args->f_id when one_pass is off */ + if (*done == 0) { + ip6 = mtod(args->m, struct ip6_hdr *); + args->f_id.src_ip6 = ip6->ip6_src; + args->f_id.dst_ip6 = ip6->ip6_dst; + } } return (ret); } static struct nptv6_cfg * nptv6_alloc_config(const char *name, uint8_t set) { struct nptv6_cfg *cfg; cfg = malloc(sizeof(struct nptv6_cfg), M_IPFW, M_WAITOK | M_ZERO); COUNTER_ARRAY_ALLOC(cfg->stats, NPTV6STATS, M_WAITOK); cfg->no.name = cfg->name; cfg->no.etlv = IPFW_TLV_NPTV6_NAME; cfg->no.set = set; strlcpy(cfg->name, name, sizeof(cfg->name)); return (cfg); } static void nptv6_free_config(struct nptv6_cfg *cfg) { COUNTER_ARRAY_FREE(cfg->stats, NPTV6STATS); free(cfg, M_IPFW); } static void nptv6_export_config(struct ip_fw_chain *ch, struct nptv6_cfg *cfg, ipfw_nptv6_cfg *uc) { uc->internal = cfg->internal; uc->external = cfg->external; uc->plen = cfg->plen; uc->flags = cfg->flags & NPTV6_FLAGSMASK; uc->set = cfg->no.set; strlcpy(uc->name, cfg->no.name, sizeof(uc->name)); } struct nptv6_dump_arg { struct ip_fw_chain *ch; struct sockopt_data *sd; }; static int export_config_cb(struct namedobj_instance *ni, struct named_object *no, void *arg) { struct nptv6_dump_arg *da = (struct nptv6_dump_arg *)arg; ipfw_nptv6_cfg *uc; uc = (ipfw_nptv6_cfg *)ipfw_get_sopt_space(da->sd, sizeof(*uc)); nptv6_export_config(da->ch, (struct nptv6_cfg *)no, uc); return (0); } static struct nptv6_cfg * nptv6_find(struct namedobj_instance *ni, const char *name, uint8_t set) { struct nptv6_cfg *cfg; cfg = (struct nptv6_cfg *)ipfw_objhash_lookup_name_type(ni, set, IPFW_TLV_NPTV6_NAME, name); return (cfg); } static void nptv6_calculate_adjustment(struct nptv6_cfg *cfg) { uint16_t i, e; uint16_t *p; /* Calculate checksum of internal prefix */ for (i = 0, p = (uint16_t *)&cfg->internal; p < (uint16_t *)(&cfg->internal + 1); p++) i = cksum_add(i, *p); /* Calculate checksum of external prefix */ for (e = 0, p = (uint16_t *)&cfg->external; p < (uint16_t *)(&cfg->external + 1); p++) e = cksum_add(e, *p); /* Adjustment value for Int->Ext direction */ cfg->adjustment = cksum_add(~e, i); } /* * Creates new NPTv6 instance. * Data layout (v0)(current): * Request: [ ipfw_obj_lheader ipfw_nptv6_cfg ] * * Returns 0 on success */ static int nptv6_create(struct ip_fw_chain *ch, ip_fw3_opheader *op3, struct sockopt_data *sd) { struct in6_addr mask; ipfw_obj_lheader *olh; ipfw_nptv6_cfg *uc; struct namedobj_instance *ni; struct nptv6_cfg *cfg; if (sd->valsize != sizeof(*olh) + sizeof(*uc)) return (EINVAL); olh = (ipfw_obj_lheader *)sd->kbuf; uc = (ipfw_nptv6_cfg *)(olh + 1); if (ipfw_check_object_name_generic(uc->name) != 0) return (EINVAL); if (uc->plen < 8 || uc->plen > 64 || uc->set >= IPFW_MAX_SETS) return (EINVAL); if (IN6_IS_ADDR_MULTICAST(&uc->internal) || IN6_IS_ADDR_MULTICAST(&uc->external) || IN6_IS_ADDR_UNSPECIFIED(&uc->internal) || IN6_IS_ADDR_UNSPECIFIED(&uc->external) || IN6_IS_ADDR_LINKLOCAL(&uc->internal) || IN6_IS_ADDR_LINKLOCAL(&uc->external)) return (EINVAL); in6_prefixlen2mask(&mask, uc->plen); if (IN6_ARE_MASKED_ADDR_EQUAL(&uc->internal, &uc->external, &mask)) return (EINVAL); ni = CHAIN_TO_SRV(ch); IPFW_UH_RLOCK(ch); if (nptv6_find(ni, uc->name, uc->set) != NULL) { IPFW_UH_RUNLOCK(ch); return (EEXIST); } IPFW_UH_RUNLOCK(ch); cfg = nptv6_alloc_config(uc->name, uc->set); cfg->plen = uc->plen; if (cfg->plen <= 48) cfg->flags |= NPTV6_48PLEN; cfg->internal = uc->internal; cfg->external = uc->external; cfg->mask = mask; IN6_MASK_ADDR(&cfg->internal, &mask); IN6_MASK_ADDR(&cfg->external, &mask); nptv6_calculate_adjustment(cfg); IPFW_UH_WLOCK(ch); if (ipfw_objhash_alloc_idx(ni, &cfg->no.kidx) != 0) { IPFW_UH_WUNLOCK(ch); nptv6_free_config(cfg); return (ENOSPC); } ipfw_objhash_add(ni, &cfg->no); IPFW_WLOCK(ch); SRV_OBJECT(ch, cfg->no.kidx) = cfg; IPFW_WUNLOCK(ch); IPFW_UH_WUNLOCK(ch); return (0); } /* * Destroys NPTv6 instance. * Data layout (v0)(current): * Request: [ ipfw_obj_header ] * * Returns 0 on success */ static int nptv6_destroy(struct ip_fw_chain *ch, ip_fw3_opheader *op3, struct sockopt_data *sd) { ipfw_obj_header *oh; struct nptv6_cfg *cfg; if (sd->valsize != sizeof(*oh)) return (EINVAL); oh = (ipfw_obj_header *)sd->kbuf; if (ipfw_check_object_name_generic(oh->ntlv.name) != 0) return (EINVAL); IPFW_UH_WLOCK(ch); cfg = nptv6_find(CHAIN_TO_SRV(ch), oh->ntlv.name, oh->ntlv.set); if (cfg == NULL) { IPFW_UH_WUNLOCK(ch); return (ESRCH); } if (cfg->no.refcnt > 0) { IPFW_UH_WUNLOCK(ch); return (EBUSY); } IPFW_WLOCK(ch); SRV_OBJECT(ch, cfg->no.kidx) = NULL; IPFW_WUNLOCK(ch); ipfw_objhash_del(CHAIN_TO_SRV(ch), &cfg->no); ipfw_objhash_free_idx(CHAIN_TO_SRV(ch), cfg->no.kidx); IPFW_UH_WUNLOCK(ch); nptv6_free_config(cfg); return (0); } /* * Get or change nptv6 instance config. * Request: [ ipfw_obj_header [ ipfw_nptv6_cfg ] ] */ static int nptv6_config(struct ip_fw_chain *chain, ip_fw3_opheader *op, struct sockopt_data *sd) { return (EOPNOTSUPP); } /* * Lists all NPTv6 instances currently available in kernel. * Data layout (v0)(current): * Request: [ ipfw_obj_lheader ] * Reply: [ ipfw_obj_lheader ipfw_nptv6_cfg x N ] * * Returns 0 on success */ static int nptv6_list(struct ip_fw_chain *ch, ip_fw3_opheader *op3, struct sockopt_data *sd) { ipfw_obj_lheader *olh; struct nptv6_dump_arg da; /* Check minimum header size */ if (sd->valsize < sizeof(ipfw_obj_lheader)) return (EINVAL); olh = (ipfw_obj_lheader *)ipfw_get_sopt_header(sd, sizeof(*olh)); IPFW_UH_RLOCK(ch); olh->count = ipfw_objhash_count_type(CHAIN_TO_SRV(ch), IPFW_TLV_NPTV6_NAME); olh->objsize = sizeof(ipfw_nptv6_cfg); olh->size = sizeof(*olh) + olh->count * olh->objsize; if (sd->valsize < olh->size) { IPFW_UH_RUNLOCK(ch); return (ENOMEM); } memset(&da, 0, sizeof(da)); da.ch = ch; da.sd = sd; ipfw_objhash_foreach_type(CHAIN_TO_SRV(ch), export_config_cb, &da, IPFW_TLV_NPTV6_NAME); IPFW_UH_RUNLOCK(ch); return (0); } #define __COPY_STAT_FIELD(_cfg, _stats, _field) \ (_stats)->_field = NPTV6STAT_FETCH(_cfg, _field) static void export_stats(struct ip_fw_chain *ch, struct nptv6_cfg *cfg, struct ipfw_nptv6_stats *stats) { __COPY_STAT_FIELD(cfg, stats, in2ex); __COPY_STAT_FIELD(cfg, stats, ex2in); __COPY_STAT_FIELD(cfg, stats, dropped); } /* * Get NPTv6 statistics. * Data layout (v0)(current): * Request: [ ipfw_obj_header ] * Reply: [ ipfw_obj_header ipfw_obj_ctlv [ uint64_t x N ]] * * Returns 0 on success */ static int nptv6_stats(struct ip_fw_chain *ch, ip_fw3_opheader *op, struct sockopt_data *sd) { struct ipfw_nptv6_stats stats; struct nptv6_cfg *cfg; ipfw_obj_header *oh; ipfw_obj_ctlv *ctlv; size_t sz; sz = sizeof(ipfw_obj_header) + sizeof(ipfw_obj_ctlv) + sizeof(stats); if (sd->valsize % sizeof(uint64_t)) return (EINVAL); if (sd->valsize < sz) return (ENOMEM); oh = (ipfw_obj_header *)ipfw_get_sopt_header(sd, sz); if (oh == NULL) return (EINVAL); if (ipfw_check_object_name_generic(oh->ntlv.name) != 0 || oh->ntlv.set >= IPFW_MAX_SETS) return (EINVAL); memset(&stats, 0, sizeof(stats)); IPFW_UH_RLOCK(ch); cfg = nptv6_find(CHAIN_TO_SRV(ch), oh->ntlv.name, oh->ntlv.set); if (cfg == NULL) { IPFW_UH_RUNLOCK(ch); return (ESRCH); } export_stats(ch, cfg, &stats); IPFW_UH_RUNLOCK(ch); ctlv = (ipfw_obj_ctlv *)(oh + 1); memset(ctlv, 0, sizeof(*ctlv)); ctlv->head.type = IPFW_TLV_COUNTERS; ctlv->head.length = sz - sizeof(ipfw_obj_header); ctlv->count = sizeof(stats) / sizeof(uint64_t); ctlv->objsize = sizeof(uint64_t); ctlv->version = 1; memcpy(ctlv + 1, &stats, sizeof(stats)); return (0); } /* * Reset NPTv6 statistics. * Data layout (v0)(current): * Request: [ ipfw_obj_header ] * * Returns 0 on success */ static int nptv6_reset_stats(struct ip_fw_chain *ch, ip_fw3_opheader *op, struct sockopt_data *sd) { struct nptv6_cfg *cfg; ipfw_obj_header *oh; if (sd->valsize != sizeof(*oh)) return (EINVAL); oh = (ipfw_obj_header *)sd->kbuf; if (ipfw_check_object_name_generic(oh->ntlv.name) != 0 || oh->ntlv.set >= IPFW_MAX_SETS) return (EINVAL); IPFW_UH_WLOCK(ch); cfg = nptv6_find(CHAIN_TO_SRV(ch), oh->ntlv.name, oh->ntlv.set); if (cfg == NULL) { IPFW_UH_WUNLOCK(ch); return (ESRCH); } COUNTER_ARRAY_ZERO(cfg->stats, NPTV6STATS); IPFW_UH_WUNLOCK(ch); return (0); } static struct ipfw_sopt_handler scodes[] = { { IP_FW_NPTV6_CREATE, 0, HDIR_SET, nptv6_create }, { IP_FW_NPTV6_DESTROY,0, HDIR_SET, nptv6_destroy }, { IP_FW_NPTV6_CONFIG, 0, HDIR_BOTH, nptv6_config }, { IP_FW_NPTV6_LIST, 0, HDIR_GET, nptv6_list }, { IP_FW_NPTV6_STATS, 0, HDIR_GET, nptv6_stats }, { IP_FW_NPTV6_RESET_STATS,0, HDIR_SET, nptv6_reset_stats }, }; static int nptv6_classify(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype) { ipfw_insn *icmd; icmd = cmd - 1; NPTV6_DEBUG("opcode %d, arg1 %d, opcode0 %d, arg1 %d", cmd->opcode, cmd->arg1, icmd->opcode, icmd->arg1); if (icmd->opcode != O_EXTERNAL_ACTION || icmd->arg1 != V_nptv6_eid) return (1); *puidx = cmd->arg1; *ptype = 0; return (0); } static void nptv6_update_arg1(ipfw_insn *cmd, uint16_t idx) { cmd->arg1 = idx; NPTV6_DEBUG("opcode %d, arg1 -> %d", cmd->opcode, cmd->arg1); } static int nptv6_findbyname(struct ip_fw_chain *ch, struct tid_info *ti, struct named_object **pno) { int err; err = ipfw_objhash_find_type(CHAIN_TO_SRV(ch), ti, IPFW_TLV_NPTV6_NAME, pno); NPTV6_DEBUG("uidx %u, type %u, err %d", ti->uidx, ti->type, err); return (err); } static struct named_object * nptv6_findbykidx(struct ip_fw_chain *ch, uint16_t idx) { struct namedobj_instance *ni; struct named_object *no; IPFW_UH_WLOCK_ASSERT(ch); ni = CHAIN_TO_SRV(ch); no = ipfw_objhash_lookup_kidx(ni, idx); KASSERT(no != NULL, ("NPT with index %d not found", idx)); NPTV6_DEBUG("kidx %u -> %s", idx, no->name); return (no); } static int nptv6_manage_sets(struct ip_fw_chain *ch, uint16_t set, uint8_t new_set, enum ipfw_sets_cmd cmd) { return (ipfw_obj_manage_sets(CHAIN_TO_SRV(ch), IPFW_TLV_NPTV6_NAME, set, new_set, cmd)); } static struct opcode_obj_rewrite opcodes[] = { { .opcode = O_EXTERNAL_INSTANCE, .etlv = IPFW_TLV_EACTION /* just show it isn't table */, .classifier = nptv6_classify, .update = nptv6_update_arg1, .find_byname = nptv6_findbyname, .find_bykidx = nptv6_findbykidx, .manage_sets = nptv6_manage_sets, }, }; static int destroy_config_cb(struct namedobj_instance *ni, struct named_object *no, void *arg) { struct nptv6_cfg *cfg; struct ip_fw_chain *ch; ch = (struct ip_fw_chain *)arg; IPFW_UH_WLOCK_ASSERT(ch); cfg = (struct nptv6_cfg *)SRV_OBJECT(ch, no->kidx); SRV_OBJECT(ch, no->kidx) = NULL; ipfw_objhash_del(ni, &cfg->no); ipfw_objhash_free_idx(ni, cfg->no.kidx); nptv6_free_config(cfg); return (0); } int nptv6_init(struct ip_fw_chain *ch, int first) { V_nptv6_eid = ipfw_add_eaction(ch, ipfw_nptv6, "nptv6"); if (V_nptv6_eid == 0) return (ENXIO); IPFW_ADD_SOPT_HANDLER(first, scodes); IPFW_ADD_OBJ_REWRITER(first, opcodes); return (0); } void nptv6_uninit(struct ip_fw_chain *ch, int last) { IPFW_DEL_OBJ_REWRITER(last, opcodes); IPFW_DEL_SOPT_HANDLER(last, scodes); ipfw_del_eaction(ch, V_nptv6_eid); /* * Since we already have deregistered external action, * our named objects become unaccessible via rules, because * all rules were truncated by ipfw_del_eaction(). * So, we can unlink and destroy our named objects without holding * IPFW_WLOCK(). */ IPFW_UH_WLOCK(ch); ipfw_objhash_foreach_type(CHAIN_TO_SRV(ch), destroy_config_cb, ch, IPFW_TLV_NPTV6_NAME); V_nptv6_eid = 0; IPFW_UH_WUNLOCK(ch); } Index: projects/clang400-import/sys/sys/gtaskqueue.h =================================================================== --- projects/clang400-import/sys/sys/gtaskqueue.h (revision 314522) +++ projects/clang400-import/sys/sys/gtaskqueue.h (revision 314523) @@ -1,123 +1,124 @@ /*- * Copyright (c) 2014 Jeffrey Roberson * Copyright (c) 2016 Matthew Macy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_GTASKQUEUE_H_ #define _SYS_GTASKQUEUE_H_ #include #ifndef _KERNEL #error "no user-serviceable parts inside" #endif struct gtaskqueue; typedef void (*gtaskqueue_enqueue_fn)(void *context); /* * Taskqueue groups. Manages dynamic thread groups and irq binding for * device and other tasks. */ void gtaskqueue_block(struct gtaskqueue *queue); void gtaskqueue_unblock(struct gtaskqueue *queue); int gtaskqueue_cancel(struct gtaskqueue *queue, struct gtask *gtask); void gtaskqueue_drain(struct gtaskqueue *queue, struct gtask *task); void gtaskqueue_drain_all(struct gtaskqueue *queue); int grouptaskqueue_enqueue(struct gtaskqueue *queue, struct gtask *task); void taskqgroup_attach(struct taskqgroup *qgroup, struct grouptask *grptask, void *uniq, int irq, char *name); int taskqgroup_attach_cpu(struct taskqgroup *qgroup, struct grouptask *grptask, void *uniq, int cpu, int irq, char *name); void taskqgroup_detach(struct taskqgroup *qgroup, struct grouptask *gtask); struct taskqgroup *taskqgroup_create(char *name); void taskqgroup_destroy(struct taskqgroup *qgroup); int taskqgroup_adjust(struct taskqgroup *qgroup, int cnt, int stride); #define TASK_ENQUEUED 0x1 #define TASK_SKIP_WAKEUP 0x2 #define GTASK_INIT(task, flags, priority, func, context) do { \ (task)->ta_flags = flags; \ (task)->ta_priority = (priority); \ (task)->ta_func = (func); \ (task)->ta_context = (context); \ } while (0) #define GROUPTASK_INIT(gtask, priority, func, context) \ GTASK_INIT(&(gtask)->gt_task, TASK_SKIP_WAKEUP, priority, func, context) #define GROUPTASK_ENQUEUE(gtask) \ grouptaskqueue_enqueue((gtask)->gt_taskqueue, &(gtask)->gt_task) #define TASKQGROUP_DECLARE(name) \ extern struct taskqgroup *qgroup_##name #ifdef EARLY_AP_STARTUP #define TASKQGROUP_DEFINE(name, cnt, stride) \ \ struct taskqgroup *qgroup_##name; \ \ static void \ taskqgroup_define_##name(void *arg) \ { \ qgroup_##name = taskqgroup_create(#name); \ taskqgroup_adjust(qgroup_##name, (cnt), (stride)); \ } \ \ SYSINIT(taskqgroup_##name, SI_SUB_INIT_IF, SI_ORDER_FIRST, \ taskqgroup_define_##name, NULL) #else /* !EARLY_AP_STARTUP */ #define TASKQGROUP_DEFINE(name, cnt, stride) \ \ struct taskqgroup *qgroup_##name; \ \ static void \ taskqgroup_define_##name(void *arg) \ { \ qgroup_##name = taskqgroup_create(#name); \ } \ \ SYSINIT(taskqgroup_##name, SI_SUB_INIT_IF, SI_ORDER_FIRST, \ taskqgroup_define_##name, NULL); \ \ static void \ taskqgroup_adjust_##name(void *arg) \ { \ taskqgroup_adjust(qgroup_##name, (cnt), (stride)); \ } \ \ SYSINIT(taskqgroup_adj_##name, SI_SUB_SMP, SI_ORDER_ANY, \ taskqgroup_adjust_##name, NULL) #endif /* EARLY_AP_STARTUP */ TASKQGROUP_DECLARE(net); +TASKQGROUP_DECLARE(softirq); #endif /* !_SYS_GTASKQUEUE_H_ */ Index: projects/clang400-import/sys/sys/sysent.h =================================================================== --- projects/clang400-import/sys/sys/sysent.h (revision 314522) +++ projects/clang400-import/sys/sys/sysent.h (revision 314523) @@ -1,287 +1,287 @@ /*- * Copyright (c) 1982, 1988, 1991 The Regents of the University of California. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_SYSENT_H_ #define _SYS_SYSENT_H_ #include struct rlimit; struct sysent; struct thread; struct ksiginfo; struct syscall_args; enum systrace_probe_t { SYSTRACE_ENTRY, SYSTRACE_RETURN, }; typedef int sy_call_t(struct thread *, void *); typedef void (*systrace_probe_func_t)(struct syscall_args *, enum systrace_probe_t, int); typedef void (*systrace_args_func_t)(int, void *, uint64_t *, int *); extern systrace_probe_func_t systrace_probe_func; struct sysent { /* system call table */ int sy_narg; /* number of arguments */ sy_call_t *sy_call; /* implementing function */ au_event_t sy_auevent; /* audit event associated with syscall */ systrace_args_func_t sy_systrace_args_func; /* optional argument conversion function. */ u_int32_t sy_entry; /* DTrace entry ID for systrace. */ u_int32_t sy_return; /* DTrace return ID for systrace. */ u_int32_t sy_flags; /* General flags for system calls. */ u_int32_t sy_thrcnt; }; /* * A system call is permitted in capability mode. */ #define SYF_CAPENABLED 0x00000001 #define SY_THR_FLAGMASK 0x7 #define SY_THR_STATIC 0x1 #define SY_THR_DRAINING 0x2 #define SY_THR_ABSENT 0x4 #define SY_THR_INCR 0x8 #ifdef KLD_MODULE #define SY_THR_STATIC_KLD 0 #else #define SY_THR_STATIC_KLD SY_THR_STATIC #endif struct image_params; struct __sigset; struct trapframe; struct vnode; struct sysentvec { int sv_size; /* number of entries */ struct sysent *sv_table; /* pointer to sysent */ u_int sv_mask; /* optional mask to index */ int sv_errsize; /* size of errno translation table */ int *sv_errtbl; /* errno translation table */ int (*sv_transtrap)(int, int); /* translate trap-to-signal mapping */ int (*sv_fixup)(register_t **, struct image_params *); /* stack fixup function */ void (*sv_sendsig)(void (*)(int), struct ksiginfo *, struct __sigset *); /* send signal */ char *sv_sigcode; /* start of sigtramp code */ int *sv_szsigcode; /* size of sigtramp code */ char *sv_name; /* name of binary type */ int (*sv_coredump)(struct thread *, struct vnode *, off_t, int); /* function to dump core, or NULL */ int (*sv_imgact_try)(struct image_params *); int sv_minsigstksz; /* minimum signal stack size */ int sv_pagesize; /* pagesize */ vm_offset_t sv_minuser; /* VM_MIN_ADDRESS */ vm_offset_t sv_maxuser; /* VM_MAXUSER_ADDRESS */ vm_offset_t sv_usrstack; /* USRSTACK */ vm_offset_t sv_psstrings; /* PS_STRINGS */ int sv_stackprot; /* vm protection for stack */ register_t *(*sv_copyout_strings)(struct image_params *); void (*sv_setregs)(struct thread *, struct image_params *, u_long); void (*sv_fixlimit)(struct rlimit *, int); u_long *sv_maxssiz; u_int sv_flags; void (*sv_set_syscall_retval)(struct thread *, int); int (*sv_fetch_syscall_args)(struct thread *, struct syscall_args *); const char **sv_syscallnames; vm_offset_t sv_timekeep_base; vm_offset_t sv_shared_page_base; vm_offset_t sv_shared_page_len; vm_offset_t sv_sigcode_base; void *sv_shared_page_obj; void (*sv_schedtail)(struct thread *); void (*sv_thread_detach)(struct thread *); int (*sv_trap)(struct thread *); }; #define SV_ILP32 0x000100 /* 32-bit executable. */ #define SV_LP64 0x000200 /* 64-bit executable. */ #define SV_IA32 0x004000 /* Intel 32-bit executable. */ #define SV_AOUT 0x008000 /* a.out executable. */ #define SV_SHP 0x010000 /* Shared page. */ #define SV_CAPSICUM 0x020000 /* Force cap_enter() on startup. */ -#define SV_TIMEKEEP 0x040000 +#define SV_TIMEKEEP 0x040000 /* Shared page timehands. */ #define SV_ABI_MASK 0xff #define SV_ABI_ERRNO(p, e) ((p)->p_sysent->sv_errsize <= 0 ? e : \ ((e) >= (p)->p_sysent->sv_errsize ? -1 : (p)->p_sysent->sv_errtbl[e])) #define SV_PROC_FLAG(p, x) ((p)->p_sysent->sv_flags & (x)) #define SV_PROC_ABI(p) ((p)->p_sysent->sv_flags & SV_ABI_MASK) #define SV_CURPROC_FLAG(x) SV_PROC_FLAG(curproc, x) #define SV_CURPROC_ABI() SV_PROC_ABI(curproc) /* same as ELFOSABI_XXX, to prevent header pollution */ #define SV_ABI_LINUX 3 #define SV_ABI_FREEBSD 9 #define SV_ABI_CLOUDABI 17 #define SV_ABI_UNDEF 255 #ifdef _KERNEL extern struct sysentvec aout_sysvec; extern struct sysent sysent[]; extern const char *syscallnames[]; #if defined(__amd64__) extern int i386_read_exec; #endif #define NO_SYSCALL (-1) struct module; struct syscall_module_data { int (*chainevh)(struct module *, int, void *); /* next handler */ void *chainarg; /* arg for next event handler */ int *offset; /* offset into sysent */ struct sysent *new_sysent; /* new sysent */ struct sysent old_sysent; /* old sysent */ int flags; /* flags for syscall_register */ }; /* separate initialization vector so it can be used in a substructure */ #define SYSENT_INIT_VALS(_syscallname) { \ .sy_narg = (sizeof(struct _syscallname ## _args ) \ / sizeof(register_t)), \ .sy_call = (sy_call_t *)&sys_##_syscallname, \ .sy_auevent = SYS_AUE_##_syscallname, \ .sy_systrace_args_func = NULL, \ .sy_entry = 0, \ .sy_return = 0, \ .sy_flags = 0, \ .sy_thrcnt = 0 \ } #define MAKE_SYSENT(syscallname) \ static struct sysent syscallname##_sysent = SYSENT_INIT_VALS(syscallname); #define MAKE_SYSENT_COMPAT(syscallname) \ static struct sysent syscallname##_sysent = { \ (sizeof(struct syscallname ## _args ) \ / sizeof(register_t)), \ (sy_call_t *)& syscallname, \ SYS_AUE_##syscallname \ } #define SYSCALL_MODULE(name, offset, new_sysent, evh, arg) \ static struct syscall_module_data name##_syscall_mod = { \ evh, arg, offset, new_sysent, { 0, NULL, AUE_NULL } \ }; \ \ static moduledata_t name##_mod = { \ "sys/" #name, \ syscall_module_handler, \ &name##_syscall_mod \ }; \ DECLARE_MODULE(name, name##_mod, SI_SUB_SYSCALLS, SI_ORDER_MIDDLE) #define SYSCALL_MODULE_HELPER(syscallname) \ static int syscallname##_syscall = SYS_##syscallname; \ MAKE_SYSENT(syscallname); \ SYSCALL_MODULE(syscallname, \ & syscallname##_syscall, & syscallname##_sysent, \ NULL, NULL) #define SYSCALL_MODULE_PRESENT(syscallname) \ (sysent[SYS_##syscallname].sy_call != (sy_call_t *)lkmnosys && \ sysent[SYS_##syscallname].sy_call != (sy_call_t *)lkmressys) /* * Syscall registration helpers with resource allocation handling. */ struct syscall_helper_data { struct sysent new_sysent; struct sysent old_sysent; int syscall_no; int registered; }; #define SYSCALL_INIT_HELPER(syscallname) { \ .new_sysent = { \ .sy_narg = (sizeof(struct syscallname ## _args ) \ / sizeof(register_t)), \ .sy_call = (sy_call_t *)& sys_ ## syscallname, \ .sy_auevent = SYS_AUE_##syscallname \ }, \ .syscall_no = SYS_##syscallname \ } #define SYSCALL_INIT_HELPER_COMPAT(syscallname) { \ .new_sysent = { \ .sy_narg = (sizeof(struct syscallname ## _args ) \ / sizeof(register_t)), \ .sy_call = (sy_call_t *)& syscallname, \ .sy_auevent = SYS_AUE_##syscallname \ }, \ .syscall_no = SYS_##syscallname \ } #define SYSCALL_INIT_LAST { \ .syscall_no = NO_SYSCALL \ } int syscall_register(int *offset, struct sysent *new_sysent, struct sysent *old_sysent, int flags); int syscall_deregister(int *offset, struct sysent *old_sysent); int syscall_module_handler(struct module *mod, int what, void *arg); int syscall_helper_register(struct syscall_helper_data *sd, int flags); int syscall_helper_unregister(struct syscall_helper_data *sd); struct proc; const char *syscallname(struct proc *p, u_int code); /* Special purpose system call functions. */ struct nosys_args; int lkmnosys(struct thread *, struct nosys_args *); int lkmressys(struct thread *, struct nosys_args *); int syscall_thread_enter(struct thread *td, struct sysent *se); void syscall_thread_exit(struct thread *td, struct sysent *se); int shared_page_alloc(int size, int align); int shared_page_fill(int size, int align, const void *data); void shared_page_write(int base, int size, const void *data); void exec_sysvec_init(void *param); void exec_inittk(void); #define INIT_SYSENTVEC(name, sv) \ SYSINIT(name, SI_SUB_EXEC, SI_ORDER_ANY, \ (sysinit_cfunc_t)exec_sysvec_init, sv); #endif /* _KERNEL */ #endif /* !_SYS_SYSENT_H_ */ Index: projects/clang400-import =================================================================== --- projects/clang400-import (revision 314522) +++ projects/clang400-import (revision 314523) Property changes on: projects/clang400-import ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head:r314482-314522